 Okay, we're live. Hi, everyone. My name is Kinaret, and I'm the lead research community officer on the WMF research team. I would like to welcome you to this month's research showcase. Research showcases are monthly convenings organized by our team to recognize and share recent research or relevant and honor relevant to the week in the projects. For those of you joining us live, we welcome you to ask questions of the speakers in the YouTube chat. We will monitor this channel and pass questions to the speakers at the end of their presentation. And we kindly ask that attendees follow the friendly space policy and universal code of conduct. Before we get started, we have a couple of very exciting announcements today. We are happy to announce the launch of the next round of the research fund. The research fund provides support to individuals, groups and organizations with interest in conducting research on or about Wikipedia projects. This application submission deadline is December 15, 2023. If you know someone or you yourself always wanted to wait to get involved with Wikimedia's research, this is a good way to do it. We also invite you and we would highly appreciate if you could raise awareness and spread the word in your network. We will attach the link in the YouTube chat now and we will also send the announcement via our regular channel, such as wiki researcher. Another very exciting announcement for today is that we have a new date for the wiki workshop. So wiki workshop, the main forum bringing together researchers exploring all aspects of Wikimedia projects will be held virtually as a standalone event on June 20th, 2024. More information will come into time, but we are excited to begin planning the workshop and we look forward to seeing you all there. So with that in mind, I'll pass it over to my colleague Isaac. We'll introduce this month's theme and speakers. Thank you, Kenara, for that introduction. Hi all, I'm Isaac Johnson. I'm a senior research scientist with the Wikimedia Foundation. I'm glad to introduce the October 2023 research showcase. For my quick intro, I'm gonna actually be borrowing very heavily from our first speaker, Hal Trideman and how he's explained it to me in the past. So I think it's a really useful framing for this. There's a really interesting and fundamental tension on the Wikimedia projects between our values of transparency and our values of privacy. So on one hand, transparency suggests that we should lean towards publicly releasing all data that we, as researchers, as editors, can learn from it and help to improve the projects. On the other hand, privacy suggests that we should withhold information about how people interact with the Wikimedia projects to protect them from surveillance and other harms. And so our research showcase today focuses on two projects that I really love and both of them are attempting to reconcile this tension between on one hand, public data and the benefits of that and on the other hand, user privacy and the protections that that brings. So first, we're gonna have a presentation by Hal Trideman, who's a senior privacy engineer at the Wikimedia Foundation living in Denver, Colorado. He primarily works on differential privacy, private data engineering, working with the Wikimedia Foundation legal team to write policies around data privacy and occasionally machine learning. He will be presenting his research on differential privacy through a case study of geolocated daily page view counts. After Hal, we're gonna have Akhil Aurora, a final year PhD student affiliated with the EPFL Data Science Lab and external research collaborator of the Wikimedia Foundation. Prior to this, Akhil spent close to five years in industry working with research labs at Xerox and American Express as a research scientist. His research lies at the intersection of data science, natural language processing and machine learning with an overarching goal of modeling human behavior and real world web scale systems. And his talk will focus on reader navigation on Wikimedia and the value of synthetic data in this space for capturing patterns. After each talk, we'll have about 10 minutes for discussion. At that time, we'll be happy to take your questions in the YouTube chat and my colleague Pablo Aragon will monitor the channels and relay the questions during Q&A. And with that, I'm gonna pass it off to Hal. Take us away, Hal. Yeah, let me quickly share my screen. Can somebody give me a thumbs up if the screen is visible? All right, cool. So thank you so much, Isaac, for that wonderful introduction. Yeah, I'm gonna be talking a little bit today about how to tell the world about data that you cannot actually show to them, which is the main goal of differential privacy. As Isaac said, I'm a senior privacy engineer at the Wikimedia Foundation. Yeah, and my name is Hal Trideman. So as you probably know, since you're in attendance at this research presentation, the Wikimedia Foundation is a non-profit. We're most well known for our project, Wikipedia, which is across, I think, 313, 15, 30, I don't even know, somewhere in the low 300s, numbers of languages. But we also have a bunch of other projects, WikiData, WikiSource, WikiVercity, MediaWiki, the Commons, Wikimedia Enterprise, which is a newer one. And cumulatively, these projects net us around 20 to 22 billion page views per month, which makes us the seventh largest website in the world, or maybe six, depending on the year. So one of the key policies that governs how we use and access data and publish about it in particular is the open access policy. And fundamentally, at its core, what this is saying is that we have an organizational imperative to publish as much data as possible because we have a lot of useful information about the internet when you're collecting somewhere between 20 and 22 billion page views across a variety of topics. We could tell something interesting, potentially, if we made a lot of this data available. And this takes a couple of different sort of tax, as you probably know, revision histories are likely public on almost every single page, but we also publish sort of global statistics, for example, page views by country, page views by project, et cetera. On the other side of the equation, we have something that has been sort of colloquially dubbed the lean data diet, which is defined by our privacy policy and data retention guidelines. And there are three aspects to this. Firstly, we have no first-party tracking cookies for reading. There are some session cookies that get put on someone's browser if they're logged in or they're editing or something like that, but there's no first-party tracking cookies that say this reader read page X earlier in the day and page Y later in the day. Secondly, there's no account that's necessary for reading or for editing. There's no account at all that you need to use our services. And thirdly, there's only 90 days that we have before the data that we have is aggregated and then features that are personally identifiable, for example, IP address or user agent, information about a browser is dropped and we sort of aggregate these things into a slightly larger bucket before we delete it. So we don't have a very long time to do this. Now, in 2020, community members on fabricator researchers who are engaged with some of these statistics that we publish, they requested that WMF release page views by both country and project as opposed to just country or project. And this came to be known as the page view data release. And now there are some privacy concerns about this, right? So page views by country and page views by project are both constructed of user data and the lean data diet constrains the kind of actions that we can take. And it illuminates ultimately attention, as Isaac mentioned, between privacy and transparency. And these are sort of expressed well in the policies on either side of this spectrum. And the stakes are high here because Wikipedia and other Wikimedia projects are inherently political. Oftentimes users and editors are doing pseudonymous and sensitive editing and reading behavior. And that pseudonymity is very useful and it's for a very good reason. So this tension means that differential privacy, which I'll talk a little bit more about in the coming slides could be useful. So let's take a step back and try to answer the question, what is differential privacy? So let's just sort of set up a very, very high-level schematic of what this looks like. We have a process which takes a database in as an input and returns some data abstractly as an output. For example, you could ask about this database that we have sort of on the left side, how many people have red hair or how many people have facial hair or how many people have glasses, something like that. You get a number as the output. Differential privacy is saying we'll add some random noise and for the moment we'll ignore how much and what type that is to this process and we'll just call that magic noise. So as Shia LaBeouf is showing us here, we can add some magic noise to that process. Basically, the idea of differential privacy is saying we should be able to remove one person from the database and rerun the process with this magic noise and the outputs should be basically the same. Now, basically the same is sort of a kind of squishy term. What does that actually mean? It means that the exact same outputs are possible with a somewhat similar likelihood. So if you look at this graph, the noise, the possibility of different counts without the person in the database and the possibility of counts with the person in the database, those two probability distributions overlap which means that there's some uncertainty when you get the answer about whether the database included that person or did not include that person. So at its core, differential privacy is a promise that WMF can make to readers and editors and anyone who is a data subject of our nonprofit organization and who may be contributing to the data that is in a public release. And the promise is this, it basically says from the perspective of someone looking at this data release, your contribution to this database will be hidden. There will still be high level trends that are visible about the data but no one will be able to infer your presence or absence in the data no matter what. Even if you're an outlier, if you're reading a thousand pages a day or making hundreds of thousands of edits or something like that. So it's a promise that we can make. And there are a lot of different ways of satisfying this promise but the best way to do it is with some random noise. So why is DP nice? This magic noise is configurable, so we can scale it up and scale it down using a parameter that we call epsilon which represents like a privacy budget. We'll ignore the sort of small text in this slide for the moment. Basically smaller epsilon, you're spending less of your budget means it's more noisy, larger epsilon, you're spending more of your budget so it's less noisy. And the noise is randomly generated which means that it's impossible for differentially private data to be subject to re-identification attacks in the same way as other non-noisy data. And then finally, a really nice thing about this is that any post-processing with DP data, so modeling, sharing, combining with other data is also covered by these guarantees. It's a sort of process agnostic thing. It doesn't rely on any foreknowledge of what the attacker knows. And then finally, like these guarantees also make their way downstream. So let's get back to the page view data release. I know that that's sort of a lot of highly conceptual information let's make it a little bit more concrete. We worked with a company that's called Tumult Labs. They create open source software for doing sort of large scale differential privacy releases. And they have this approach which we adopted as well for our project where they split things up into a build stage, a tune stage and a deploy stage and we'll go through each of these bubbles one by one to sort of show you what the process looked like. So we wanted to define the problem and the success metrics and what was the problem we were trying to solve? We wanted to release as much data as possible about reading activity, partitioned by country, project and page and we wanted to release it every single day. Now, broadly success looks like privacy is protected at a user day level and that's a pretty specific term. We can get a little bit more into what that actually means later. We want the data to be just in general more plentiful and more granular than a baseline data release which sort of looks at a similar thing without DP. And then finally, we want the outputs of this data to be equitable, accurate and trustworthy for data consumers. So broadly speaking on a conceptual basis this is kind of what we're doing, right? So we have a list of page views with a country, a project and a page ID. We want to sort of group by and count to get the number of views that each page got in a given day. And then finally, we add some noise to those page views to obfuscate the contribution of a single person to this data set. In practice, obviously it's much more complicated. We're not gonna go through this diagram piece by piece but I will make this presentation available just in case somebody is interested and wants to sort of step through this flow chart. In practice, there's a lot of individual pieces of this puzzle that are much, much, much more complex. So we also did this not just for current data but for historical data. We had a similar approach with some tweaks, slightly different kind of noise that we're adding, larger noise scale and a slightly weaker privacy guarantee to our users. So when we implement this prototype we want to evaluate the output quality, make sure we're doing a good job. As far as our success metrics, are we being successful? Yeah, it seems like the data is more plentiful and more granular when we initially created this conceptual prototype just as a proof of concept. But when we use the default sort of parameters, the output was not actually equitable, accurate and trustworthy for our data consumers because some of the metrics that we were looking at were not actually meeting the goals that we had set out. So let's dive into the metrics that we were looking at. We had three or four principal error metrics. Firstly, we wanted to look at median relative error which is to say, if you compare the noisy dataset with the non-noisy dataset, what is the median relative error between those two things? So we wanted to have a median relative error of around or less than 6%. We wanted to have a drop rate of less than 1% and this is a non-standard metric but basically it's similar to a false negative rate which is saying how many things should we be publishing that are above the threshold of what we wanna release that we've added some negative noise to to make it so that we're not publishing it anymore. And then finally, we wanted to have a spurious rate also of less than 1% and this is sort of the inverse. So we wanna say how many things that have a true value of zero, nobody contributed any page views to this country project page tuple had so much positive noise that they were added and now they're above the threshold. So they're like a hallucination in the output dataset. Finally, we also wanted to have equitable relatively equitable regional error rates. So drop rate and spurious rate are important because data that we're looking at here is very sparse and has a very long tail. And ultimately meeting the goals for equity, accuracy and trust requires optimizing for these metrics. So as I mentioned before, what ended up happening was the data was more plentiful, more granular, we were able to split it up with a lot more granularity, but ultimately it was not meeting all of these metrics and I'll get into sort of why. So initially we were meeting the drop rate being low, the spurious rate being low, and the median relative error being low, but the sub global metrics as far as regional equity were not actually being met. So we wanted to optimize our algorithm around something that we called the Micronesia problem. So Micronesia is a subcontinental region of seven Pacific Island nations and they ultimately send very little traffic to the Wikimedia Foundation. So what we had to do on our naive first implementation was we looked at the error for each subcontinental region and we noticed that more than 99% of the published data from specifically this continental region and some others as well, which we'll talk about a little bit more was spurious. It was completely fake, it was a hallucination. And this sort of was generalizable as well. So nine out of the 23 subcontinental regions had spurious rates of over 25%, meaning for those regions, more than a quarter of the data was completely hallucinated. And unfortunately these were centralized in places that we really wanna be focusing our equity efforts on such as Africa, Oceania, Central Asia and the Caribbean. So the lesson here that we wanna take away is that global metrics can conceal local inequities and the solution that we ultimately came up with was to change the kind of differentially private noise that we were adding to solve this problem. We can get into the specifics here if you have questions about what kind of DP noise changes were made, but ultimately we ended up having a spurious rate that was less than 1% both globally and for 21 out of 23 of the subcontinental regions. Now there's a trade-off here that did mean publishing less data about those subcontinental regions, but I feel like as far as our calculations less and more accurate publication was a better trade-off than more and completely hallucinated publication. So we made that change. We have another broad success metric that we're trying to meet as well though, which is privacy protection at a user day level and it's kind of unclear initially if we were meeting this with our second iteration of this prototype because we have an internal construct for someone's user identity that has a bunch of known failure modes and I'll talk a little bit more about that. So we wanted to bound user contributions in order to make sure that outliers were also safe here. So if you recall from the very beginning of this presentation, we have no first-party tracking cookies. We're not tracking on an individual browser or across multiple devices, multiple browser, what people are looking at. So how do we want to bound user contributions? We can look at the hash of the IP address and the user agent. So sort of default browser information that we get on any request, but that often fails. So think for example, of somebody who is browsing on a mobile device, on a bus, right? They might be using separate IP addresses as their cell changes, cell phone towers or alternatively think of someone using a library computer where there's a bunch of similar computers that have the same browser and the same sort of default software information that we're getting and the same IP address. So sometimes the data is overly disaggregated and sometimes the data is overly aggregated together. So our solution, which Isaac so kindly figured out was this idea called client-side filtering. And basically it was saying, we can send the server a single one zero Boolean value to include only the first K unique page views in a day. And the idea here is that with every single page view that we send, we're just sending a single value. It's not anything that's specific, not anything that's identifying, doesn't even tell you anything about the device that it's coming from. It's just sending a single one or zero bit saying include this page in our differentially private calculations or don't include this page in our differentially private calculations. And we can set that up to only include the first N page views in a day. And that's a very useful thing because instead of saying, we're gonna arbitrarily look at the hash of an IP address and user agent, we can say, let's just include all of the pages and we can count on this sort of client-side anonymous filtering to ensure that we're both collecting only the correct amount of data and not putting sort of a first party tracking cookie impromptu on someone's device. So the lesson here is that data minimization and strong privacy guarantees can be in conflict with each other. And the solution, as I described was building a new privacy preserving piece of infrastructure that allowed us to both do differentially private calculations in a slightly more accurate way and slightly more privacy protective way while simultaneously ensuring that we're not degrading the infrastructure that we've already put together. So when we evaluate the output quality, we have the data is more plentiful and granular than our baseline. The output is more equitable, accurate and trustworthy for data consumers and we're protecting privacy on a user day level because client-side filtering has a significantly fewer failure modes than the hash of IP address and user agent. So our latest attempt when we put everything together meets our equity accuracy and trustworthiness goals. It has a low spurious rate, a low drop rate, low median relative error and it checks our geographic equity boxes. And it also significantly improves on a baseline of non-differentially private data releases. So prior to this DP data release, we were releasing around 9,000 data points of this type per day. With differential privacy, we're able to release significantly more around 360,000 data points a day which is a 40 times increase in the amount of data. And we're also increasing the amount of actual page views that we're considering to about 2.4 times the amount. Finally, we finalize the algorithm. We put it in a Python package. We sort of orchestrate it in our internal software and we write some documentation. So there's stuff on Metawiki about this as well as sort of a data consumer focused homepage where the data is to be downloaded. And then finally, we deploy it. If you want to download the data and take a look at what this looks like, you can download the data at this link shortened link. And finally, there's some useful outcomes here. So this dataset allows us to disaggregate interesting trends across a language that is spoken in multiple countries. For example, a celebrity's death across several Spanish speaking countries. You can see this is a log scale on the Y axis. So there's a huge spike in page views, but then there's still different orders of magnitude as we go down. And it also allows us to disaggregate different kinds of traffic in a single country where multiple languages are spoken. So this is the Marrakesh earthquake that happened several months ago or about a month ago in Morocco, which is a place that speaks both Arabic and French. And you can see sort of different patterns in how data is being consumed across languages. So in total, we have published eight years of safer, more granular data, about 300 million rows of data, about 350 billion source data points. This is publicly accessible, it's openly licensed. And as I mentioned before, the guarantees of differential privacy make this safe for post-processing. So trying to use it to do some sort of country level trend modeling. And then the future, we've done some geolocated editor activity, we've done grant data, we've done this work. We're currently working on banner views, search data, sort of chains of page views, hopefully geolocated edit activity. There's a whole lot of different stuff that we're looking at in the future. And if you're looking for more information, you can click these links as I send them to you. There's lots of really interesting stuff, including people that we've worked with. So yeah, thank you so much. And a lot of gratitude to collaborators both at WMF and Tumul Labs for their help, because this has been a multi-year project. Yeah, okay, so let's take some questions and I will stop sharing so that we can stop being tiny little heads on the YouTube feed. Thanks a lot Hal, for the presentation. You know, I love the work. It's been a lot of fun kind of being part of this and following this. So it's a lot of fun to see it shared out too. So we now have some time for questions. I'm gonna open it up to the room. Are there folks in the room? And I think Pablo, can I put you on the spot? You can. Yep, awesome. So I'm gonna give it to Pablo for the first question. Yes, sure. We all have questions yet on YouTube chat, but please, as soon as they arrive. I was there with you. In the meantime, like first thank you. This is so impressive, this work that you are doing. It reminded me of a project that was involved some years ago, building data commons. So there was a community that we were like, having sensors at home that were taking information about pollution noise. And we were serving across the community because having that data was given context to our own sensor. But at the same time, this was very sensitive data. So there were some warships within the community was deciding like how to add noise to those sensors to address those privacy concerns. So I was wondering like, you define very well like how you have taken all these decisions in differential privacy to meet those requirements. I'm wondering how to incorporate community folks to open the governance of this process in the working media ecosystem. Yeah, no, that's a really good question. And I mean, I've been sort of active in the academic universe of differential privacy and it's not a solved question. It's an open research question. There's two things that I can think of. Firstly, there's this parameter that I was talking about before that is the key to differential privacy which is your privacy budget or Epsilon. And deciding what Epsilon ought to be and the quality of the data that is resultant from that is a very sort of normative value-based decision. It's almost similar to like setting the thresholds on a machine learning model that does classification or something like that. So I think that in increase in discussion about what Epsilon ought to be for given sets of data releases is definitely something that I would love to engage with as far as community members if they're interested. At the same time, Epsilon it's a unit list or I don't know, it's not unit list. It is a hard concept to wrap your brain around because ultimately what it describes is the worst case scenario of like a Bayesian update in priors which takes some statistical numeracy to understand what that means. And there's lots of studies that show that people are not great about thinking about sort of like likelihoods and sort of taking actions based off of likelihoods. So yeah, I think on one hand I would love to see more discussion of those kinds of privacy parameters. And on the other hand, I would wanna make sure that we are giving people the tools they need from a mathematical, from an intellectual, from like a discursive perspective to make those conversations productive and useful. And there is an interesting paper that I can send which is called what are the odds which is like looking at this question of how do you communicate Epsilon to like sort of lay people Amazon Mechanical Turks which came out like a month or two ago? Well, this is super. I remember we have a sort of case some time ago more than one year about machine learning governance and I do remember like some presentation around how to create interfaces to load the barriers of how like common people can understand better like how those, the implications of those parameters so not to impose them at how to empower these communities through these interfaces like various interfaces with literacy. So we are having some questions from YouTube chat. Like one of them, well, these are more like clarification questions asking like, this is Neil McBarnett was asking if you can share some details of how the metrics compare with the example data graphs that you share at the end. For instance, like the error bar at the end of error bars on the bar charts. Yeah, also where do you get your randomness from? Like to understand much more like how to create? Yeah, yeah. So I'll answer, I'll answer the first question first. In differential privacy, it's you actually can't share precise error metrics about a given sort of like page or sort of data point over time specifically because if you're sharing an error metric publicly underlying that error metric is a comparison to the real data. So if you said, okay, this page had a relative error of 0.0357 or something like that, you could figure out, okay, either it got exactly 183 views or exactly 195 views or something like that. So you could do some basic math. One of the things that we do is we try to communicate error bounds by saying we have, broadly speaking, ensured that the error is below x% on these set of metrics broadly across the dataset and also on a sub global sort of regional level. As far as getting randomness and noise, that's a really good question. It's actually quite difficult to do that specifically using the open source data software that we use, which is Apache Spark. That is something that thankfully we've abstracted away into this open source peer reviewed third party library that we're using from Tumul Labs. But for the moment, we haven't had to worry about that but they're making sure that they're not sending correlated noise to all of the different workers in our Spark cluster because that would be one way of potentially doing a privacy leak or adding strangely large or small amounts of noise to a given data point. Yes, I think Caroline had a question. Thanks, can you hear me? Okay, great, this actually relates, if I understood the preceding question correctly, I think this relates very much to it when you spoke to like how you can't give too much detail about error ranges, but here's what I was wondering and you can tell me whether you already answered it in your previous question. So I'm thinking about folks who use these type of data to track metrics related to like increasing or decreasing page views geographically. And so I'm wondering if there are, I guess my question one is if there's like documentation related to like caveats that come with these like uncertainty thresholds? Cause in other words, I'm wondering like what caveats should folks who are monitoring increasing or decreasing page views from very small geographic locations? Like should they be cognizant or even like more cognizant of this type of masking that's going on? So I think there's a broad answer to this question and then there's a specific answer, which is a lot of times when you describe differential privacy and you say, this is privacy preserving, this is good, people are like, wait, hold on, you're adding random noise to our metrics, why would you be doing that? That's a terrible idea that's gonna lead to all sorts of perverse strange outcomes. I think broadly speaking, one of the things that's interesting is that all data is noisy, right? Like our data pipelines are not perfect actually, sometimes they break, sometimes data gets lost, sometimes as I was saying, a given person gets split up into two personas because they switched their IP address while they're in one browsing session or people get agglomerated together. So all data is noisy. We're adding a very small and well-described amount of noise to this dataset. But to answer your question specifically, yes, there are caveats here. And I would say, the randomness is very random, right? Like when you're actually looking at this in a dataset, if you're looking at say activity in, I don't know, in Morocco over the course of some time, right? Like the likelihood that somebody, that there were 90 page views to like, say some Japanese soccer team in Spanish from Morocco, right? It's actually like, that's probably noise, right? Like there's, but also it might not be, right? Like that's the whole point of this, right? You're obfuscating these things. And thankfully also, right? Like if something shows up across multiple days, multiple data points consecutively, you can start to say, that's more likely to be a trend. And we have data available now from July 1st, 2015, which is much more restricted all the way through yesterday, which has thousands and thousands, hundreds of thousands of data points. And I thought, I think there was one more question. Is it from Akil? We're gonna, I'm gonna give you a question. Why do I gonna let Akil get set up on sharing, sharing slides? How though, one thing this touched on a couple of times in various ways, but this has been kind of a newer project. And there's been a lot of hurdles in the technical sense of getting it up and running. And one of those big ones has been kind of getting the open source piece going. And so I was wondering if you could talk a little bit towards like the work that happened in that domain, muted. Muted, yes, yes, definitely. So this is an idea, cryptographic idea that has been around since 2006, but it's only really been in the last couple of years since 2020, 2019 maybe, that this has started to become like a productionizable technology. And we looked at a, there may be five or six separate open source libraries that are all trying to work on this space. Some of them are more geared towards, distributed computing on a large scale. Some of them are more geared towards sort of an individual large computer that is doing more like a economic or health data analysis that's a little bit smaller. And we looked at I think four or five different options cumulatively to decide on which piece of software we're gonna use. We picked two more labs both because they have well-made software. They worked with I think the previously largest DP data release in the past, which is the US Census in 2020. And also they've given us a lot of like really helpful hands-on support as far as creating documentation and helping us define these problems and the outcomes of these problems quite well. Also obviously their stuff that they're working with is open source. So it's a really useful tool and it checks a lot of boxes for us. Awesome, thank you. With that, we're gonna pass to Akil, take us away Akil. Okay, thanks Isaac. Okay, so hi everyone. Today I talked to you about a very interesting piece of work that I did as a collaboration with the Wikimedia Foundation. We studied how well and under which scenarios can we approximate Wikipedia reader navigation behavior through generating synthetic data by using publicly available resources, right? And this was joint work with your very own Martin Gerlach who's not here because he's on vacation. My friends from DLAB and my advisor Bob West. Okay, so before we dive deep into reader navigation and Wikipedia, let's zoom out to first understand the bigger picture, right? So why should we first study network navigation? Networks are ubiquitous in the world around us and we rely on smooth operation or smooth functionality of many such networks for our day to day lives, right? For instance, we rely on the road network and the transportation network for our daily commuting needs, right? Everyone today relies on the web and the services that it offers for something. For instance, we rely on the web for information seeking. We rely on online encyclopedic systems like Wikipedia for seeking knowledge and we rely on social media services to talk to our colleagues, collaborate with each other and so on and so forth. In fact, if you even think about it, the human brain can also be seen as a network and when neurons communicate with each other and this communication is important for a smooth functioning of our human body, right? So it's clear from all these examples, right? That networks have utility only if they enable efficient routing or they are navigable, right? And at this point in time, I would like to just present this graphic which basically shows the first experiment that was conducted sometime in 1960s by an American experimental psychologist called Stanley Milgram which led to basically defining what we mean by navigability. And here, as you can see in the graphic, the task first for humans to send an information packet literally post something starting from Nebraska to New York, right? And they could only use information about their own friend circle and local neighborhood that they were aware of, right? So having defined navigability and already discussed why navigation is important, I'll move on to the topic why should we study navigation on Wikipedia, right? So we all know here that Wikipedia is the largest platform for open and freely accessible knowledge in the history of humankind, right? It contains about 60 million articles across more than 300 languages and every month, we have 10 million edits done on Wikipedia by 500,000 volunteer editors and Wikipedia receives a whopping 20 billion monthly page views. And it's not just about the size. Wikipedia has grown enormously, we know that but it has also become an extremely important part of an information ecosystem, right? It has far-reaching applications and societal impacts such as helping doctors, big tech companies and even government organizations, right? Overall, the point that I want to make here is that Wikipedia navigation traces becomes one of the richest source of online human behavior because of its scale and because of its utility. And therefore, it's paramount to study Wikipedia reader navigation behavior for human behavior modeling. That said, we looked at the bigger picture, let's return back from space and focus on Wikipedia reader navigation, right? So while browsing any website, readers leave information traces of their navigation behavior, right? Which tells us about how they consume the content offered by any website and how they interact with these websites, right? So, and this is important, right? Because this makes the fact that insights into navigation patterns have really high utility. Next to Wikipedia, we can understand that they tell us about the learning pathways that the readers take while browsing Wikipedia, right? And they give us a lot of information about understanding readers' needs so that we can better serve them. Then they also help us understand knowledge gaps that exist on Wikipedia or think about structural biases that may occur because of course, it's edited by humans and humans even, you know, inherently have biases, right? Then lastly, most important, you know, a great utility of this navigation behavior is that you can organize articles into a curriculum which will only improve the learning experience of readers thereby improving the signals in the navigation data and it's a feedback loop, right? They help each other. So, with all of these important aspects and this huge amount of utility, it's sad to state that systematic studies of navigation in Wikipedia and also on the web are fairly limited and this is because of one key challenge. In a commitment to protect the privacy of Wikipedia's readers, the Wikimedia Foundation has taken a very wise decision to make real navigation traces as private and to be stored in the servers. In fact, more importantly, as Hal already mentioned, going back to the lean diet, this data is not kept for more than 90 days. So only the last 90 days of data is available to us, right? So what can we do in this setup? Thankfully, there is a publicly available resource, the Wikipedia clickstream data which captures aggregate traffic, right? So basically what exactly it captures is it gives us counts of referral resource pairs extracted from the private server logs and what you can see is if you look at this example, what you can understand, it tells us how many times did someone move from the page London to England in a given month. So this is aggregate source of information which tells us aggregate traffic between two pages, not identifying any individuals, but at a very high level aggregate traffic between any two pages in Wikipedia over a month, right? And one important aspect is that to ensure privacy and also key anonymity, what happens is that rare events are removed. So any pair that occurs less than 10 times is also not present in this data set. While this is a great thing that we have a publicly available resource that lets us model navigation behavior, a second challenge emerges here is that next page visit depends only on them. As you can see, it just captures the traffic between a pair of articles, so it only captures first order navigation behavior and nothing beyond that, right? So in the remainder of this talk, we will assess the utility of this publicly available resource to actually model reader navigation behavior and to study it in a better way, right? And to do this, we actually formulated two key research questions. First, how different are real trajectories from synthetic ones generated using the Wikipedia click stream? And more broadly, how well can we approximate real reader navigation behavior via the Wikipedia click stream? To answer our research questions, we went for this setup where we look into four different types of synthetically generated navigation traces and compare them with the real traces, courtesy of Wikimedia allowing us to look into this, and this gives us a way to differentiate how well can we do using synthetic data generations, right? And I think the beauty of this analysis, like the key thing that we found out is that it's not done for a single language, it is done for eight different language versions, so anything that we find out is not specific to a given language addition and we did a bunch of analysis, like not just a single thing analysis, we did six different analysis by doing some sorts of characterization and also by using models actually generated from this synthetic data on several downstream tasks which are popular in machine learning and assess the utility. So in the interest of time and to keep things focused, right? Today, I'm just gonna talk about a specific analysis that we did in this paper called Mixing of Slows, primarily because of its intuitive appeal and also because we like it a lot, right? And for anything else, you're welcome to read the paper, it's very well written, very clear and you will be able to understand whatever information that you want from it or you can reach out to us for any questions that you may have. So going back, going to the specific analysis, Mixing of Slows, right? An easy way to understand this is we, let's look at, let's look at triples of information access patterns. So basically we look at a source page, a page that was visited in between and the target, right? So basically to understand this, we looked at all the trajectories that pass through a given Wikipedia page and we connect the source and target articles, right? And we can see this example, here we are, both of these plots are showing what happens when we fix, like we look at trajectories passing through the page, dolls, Prince of Wales and on the right, the page prices are scary, right? And what we are interested here in to look at is, is there any predictability? And let me define that in a more clear way. Basically what we want to know is that with your knowledge of the source and with this intermediate page fixed, how well can you predict where are you gonna, is someone navigating through this series of steps gonna land, right? And I think it becomes already intuitive by these two plots. On the left, what you can see is there's a lot of mixing, right? What you can see is given the knowledge of the source article, which is Elizabeth II or Prince Harry, you don't know, you're basically still, you cannot say it with confidence among which of these five pages are you gonna land, right? On the other hand, if you look at the right plot, right? It's quite clear that starting from Simon's dream, it's highly likely you go to this particular article, right? And the mixing is less, it's basically, we call it like a spaghetti-like structure is less, rather it's a clear path, right? And I think the beauty here is that going back from this simple intuition, we were able to wrap this around in a proper information theoretic measure where we quantified predictability using mutual information. And let me just try to give an intuitive explanation of what we actually do here. So we measure the predictability here by mutual information, which tells us how much does one random variable tell us about the other? Going back to the actual example here of Wikipedia articles, in our case, it tells us how much does the start articles about the target articles, right? And if you have done any course on basic probability, we can just make it even clearer that if you're looking at this particular part of this, the way mutual information is defined, if the two independent events are independent, that is X and Y are independent. And in this case, our source and target, let's say our independent, this number is gonna just cancel out and then the mutual information is gonna be zero, right? So basically for independent events, mutual information is zero. And then it leads to the key finding that which basically relates this intuitive spaghetti like structure or the mixing with an actual information theoretic measure that the mutual information will be low. For independent events, the mutual information is actually zero, but the mutual information is low when there is strong mixing or which means there is low predictability. On the other hand, the mutual information is high when there is weak mixing or there is high predictability. Okay, so I think with this, we really understand what's going on with the real navigation traces. Now let's try to take this up at a larger scale and understand what happens when we look at all the navigation traces that were taken for all the eight languages in our study. So here, this plot gives us a very clear indication that first, like if we look at the mutual information for all the trajectories, and this is basically, let me explain the plot. First of all, clearly, this is a complementary cumulative distribution which is also called CCDF, which tells us the probability of adjusted mutual information greater than or equal to a given value. And what we can see is that this is a fast exponential decay and what we can notice is that only less than 10% of pages have a mutual information greater than 0.1. Making things more extreme, if you look at mutual information about 0.5, less than 0.1% of pages exhibit a mutual information of more than 0.5. So going back to this, what you understand is that if mutual information is low, it means there is no predictability and it also means that just by fixing the middle article, you have a lot of information about where the human browsing the page is gonna go. And there is no dependence on where the human started. So this gives us a key finding that majority of human navigation traces on Wikipedia are Markovian or they have no memory or basically the next step only depends on the current state and not the previous states. So this is the key for and I also wanted to highlight some more results. Again, I don't want anyone of you to read through this entire slide. Just wanted to mention one key message here that we tried with all the different analysis, all the different downstream tasks. And what we found is the differences between real navigation traces and synthetic navigation sequences is more often than not less than 10%. They're sometimes even negative, which means that the synthetic navigation traces is better. And one can, if you put our statistical hats or if we go to the statistical level, one can think that, yes, the differences will be statistically significant, but I think the key message that we want to say here is that the effect sizes are really small. Okay, so with this, I'm gonna slowly start to wrap up and talk about the key takeaways of this work. One is real trajectories exhibit strong mixing, which means that the mutual information is closer to zero and only a small set of articles have larger mutual information, which says that only a small set of articles are really affected by the fact that you need to know what happened in the past to go to the next day, right? So this highlights cases where synthetic trajectories will differ substantially from the real and cannot be modeled using the Wikipedia click stream, but these cases are low. And then the other key aspect is that you can build machine learning models based on navigation traces generated from the click stream data and the performance usually is always within 10% in comparison to real trajectories. It means that we can actually generate embeddings, which is quite popular in machine learning community, and we can actually generate human behavior navigation embeddings from the synthetic data, which will be of comparable quality to real data. And I would like to send a key message here that this gives quantitative evidence for the utility of Wikipedia click stream as a public resource that can closely capture real data navigation on Wikipedia, which is super important. Then I would like to conclude with the slide with super important implications. I'm just gonna zoom out a lot here. And the key point here is that, yes, for many cases, click stream is good enough, right? And the most important implication here is that this makes research on navigation in Wikipedia accessible to a wider audience. And it also respects user privacy because there is no need to store or reveal sensitive data until unless necessary. But it's also important to acknowledge here that although very small, which were like less than 1%, but there are cases where real data is required and when click stream is not good enough, right? So I'll give you some examples. If you want to track the activities of the same user, you need to think about revisitation patterns or multi-tap behavior. Or if you want to understand how information consumption patterns of Wikipedia readers vary, for which we really need the information about location or the timestamp, which is missing from the click stream as of now. Lastly, I think it's also important to see, we don't get a lot of information about how readers interact with additional content like images or info boxes, which serve a key way in which readers interact with Wikipedia. And then comes the fun part. Like I really love the broader impact that this work has is basically, again, I would acknowledge that this is an open question whether our findings will generalize beyond Wikipedia. But again, I would like to stress that click stream like data has the capacity and to actually empower broader research on user navigation on online platforms. And this is also an open call to the community. All the web companies out there to basically release such data sets so that we can better model human navigation behavior and a big shout out to have whoever helped create the click stream data set. It's a very valuable resource. Lastly, I want to still mention that it provides a lot of impetus to the field of human behavior modeling, right? And human behavior modeling is of interest to, it's beyond computer science. It's of interest to anthropologists, psychologists and definitely computer scientists. And I think when we look at this navigation behavior, it really tells us about how humans understand complex networks and reason about them. And by taking this first step in the direction of generating navigation traces at a web scale in a privacy preserving manner, we strongly believe that this work has the potential to drive the much needed impetus to models on online human navigation behavior and eventually drive the field of human behavior modeling further. With this, I would just like to flash some resources like you can use the code available on our GitHub page. You can look at the click stream data if you want. And there's also a preprint available. I would like to thanks all my collaborators on this paper and also all my friends at D-Lab, Swiss National Science Foundation for funding my PhD. And with this, I just, you know, stop. Thank you so much, Akil. If you'll end the slide sharing too, we can switch over to discussion and Q&A. I think to start, I believe we have some questions from YouTube, so Pablo, am I able to pass it off to you to read those? Should I stop sharing my screen? Yeah. Okay, that's good. All right, Pablo. Yeah. Well, first thank you, Akil, for sharing this work. It's very exciting. Particularly, I like many things. I also like this spaghetti approach to float. In network science, I was familiar with this concept of hairy ball for those networks that are messy and it's hard to find like a community structure. I was not familiar with this other metaphor for flows, like with the spaghetti, so it was fun. So we have some questions on YouTube by Gianna Inali and also Neil-Mike Warnett. Most of them, they can be covered with one question we focus on. Since you talk about this small subset of articles with larger adjusted mutual information, like I'm also wondering that you were curious about what were those articles about if you try to characterize or find some patterns of what was that small subset of articles about? Okay, I can give some examples about this. So there were some articles about, let's say, of course, we don't know who are people behind it, but it was mostly about, let's say, someone is trying to understand about a concept, like literally like a student doing research about an assignment, right? Or we even as sometimes we also refer to technical concepts, like let's say you want to refresh things about topic models, right? So in these cases, your navigation patterns actually really depend on also your history where you started, right? It's not just the next click. Or the other example that we found is that let's say someone is doing research about a news article that they want to write, right? In these cases, again, the memory definitely plays a role. Again, so these are some of the very high level examples that we looked into, but definitely it would be valuable to go deep into this. Although 1%, it's huge because there are a lot of phrases and I think we can, I have been discussing with Martin about this if we can look into this and actually do a more deeper analysis paper about this at a later point in time. Thank you. Well, actually you have been like doing very interesting research on navigation patterns. So I recommend anyone watching this audience if they're interested in reading patterns, like to take your website and all the publication that you're doing in your thesis, because it's definitely quite the work. I don't know if there are questions in the room. I do have my personal question, but Isaac, you're welcome to go. No, I think, Pablo, let's go with your question, then Akil, I know you had a broader question too, so let's follow up with that one then. Okay, well, so my question comes like, today I was discussing with some colleagues about the topic of the showcase and I thought there was going to be your presentation about synthetic data and the first reaction was like, yeah, like some limitations about using synthetic data. They were focusing mostly on medical research, like in some cases prohibited that you, if you need more data, you need to have like a new trial and create more natural data because there are some ethical constraints with that. And I tried to explain that this is not the case, it's not because of the scarcity of data, but because of privacy purposes, but still there are some connections. So in your presentation, you define very well, like very relevant downstream task that fits well with the approach that you have proposed. And I was wondering with other tasks might be challenging or from both ethical or technical perspectives. In the context of Wikipedia and this talk or generally? Yeah, you can go broader, but I think there are around synthetic data, but by particular language, yeah. Yeah, I do think next article prediction, which is basically which page someone would like to visit next is although we found some good results on it, but it really is a case which can improve the reader experience, but it's like tricky to do this in a Markovian way, specifically for cases where it's actually needed, for instance, which we spoke about, right? So we all know about a random surfer on the web, people get bored, they just go to a page and give up the session or they use their favorite search engines and now even large language models, right? So basically the information access, there are two ways, right? One is either you organically navigate by clicking links or you just do like use a Q and A system where basically you ask questions, get answers, and sometimes these answers are available in a Wikipedia page, right? But like literally going back to those cases where the history really matters, right? And these are the cases where people could benefit from actually nice reader recommendations. Like I'm going back to the example which you have on the, I think there is one recommendation in the mobile version of Wikipedia, right? Which is, I think the also is also now there on the web, like on the desktop mode, but there is also some recommendations that were available only in the mobile app. So I think these kind of things will be interesting because I feel they allow humans to not go into rapid holes, right? Where you're looking for something and you can get the information much better if you can model actual navigation behavior. Then thinking aloud here, some other things could really be is how humans engage with different parts of a website. For instance, let's take the example of Wikipedia. And I know Isaac here in the room is an expert about, he has Chi expertise, but I'm just gonna put this aloud here that let's say you wanna do A-B testing on Wikipedia to launch a feature, right? Right now, the way A-B testing is done in the world is basically, we release the feature to some people and some people don't get the feature, right? Again, and it's done randomly, but it still has the chance of marginalizing some specific people or some people don't get a good feature. For instance, for a very long time, I didn't have the ability to reply to Instagram chats. Well, okay, right? So now the point is this, we do this in this manner because we really wanna know how much is this feature relevant. But think about this, if you can really understand how humans are engaging with different parts of a website, right? And in this case, how humans are engaging with different sections of Wikipedia, like how important is Infobox? There was a paper, I think, by Dimitrov where they studied the value of a click on Wikipedia, right? And they did this only using clickstream, but let's say you had real, like you somehow managed to generate high quality synthetic data or you had a subset of real navigation traces on these specific aspects and then you can build models of which parts of Wikipedia are more relevant. And once you have a model, then you can really do such testing without human in the loop, right? And to me, I think that's a great way to, where this data is really useful, but the problem is this data has highly sensitive information and PII's, so it cannot be revealed to the world for such kind of features. I'm sure that if we have a way to get a good amount of data or a good quality data, then there will be many creative applications from people, you know, which relies on this. At this point, I think let's switch over to Caroline, who I believe you have a question for Akil. Getting our audio set up, there we go. Yeah, can you hear me now? Great. Love to. Great, thanks for the presentation. So like with my question for Hal, I'm thinking about the effects of synthetic noise on like small geolocations and small language additions at Wikipedia's. So like with the eight language additions that you looked at, I appreciated that you covered many language families. Most of them are very large projects in terms of like numbers of pages and links. I wondered if you had any thoughts about like if you were to, or if you have plans to look at smaller language additions of Wikipedia, if you expect to see similar results or if you think the effect size might change. I'm just wondering if you had any thoughts about that. Sure. Okay, so I'll answer to one part of this question if we have plans. And again, I'm gonna mention Isaac, he knows we have been pushing a lot for the clickstream to be available for in multiple languages and shout out to again Isaac and Martin to really push this agenda. So we really have plans and we have the entire framework set up, right? We just need clickstream data available for 100 languages and not just 10 or 11, right, 11. So clickstream data is not available for more than these languages, right? That's why we couldn't do the study about the effect size. Yes, I think we noticed a bit in this initial study as well because there are some cases where, so clickstream removes any like any pair that received like the traffic was less than 10 occurrences, right? So basically less than 10 visits per month. This is a required post-processing step, let's say. And we noticed that for smaller languages, this had an impact because for smaller languages, there are more cases which are removed and we noticed the impact in downstream tasks in terms of effect sizes. So definitely I think there would be an impact and I can say that this is the only case that caused the impact because we constructed our own clickstream from the private server logs where the only difference between our clickstream and the one publicly available was this threshold of removing like this filter of removing less than 10 occurrences. And the problem was not there in our private clickstream, like I call it a private clickstream. The problem was not there in that clickstream but it was existent in public clickstream for smaller languages. So this has clear indications that the effect sizes could be much larger and there would be some implications for smaller languages. I hope it answers your question, but really great. Thanks for asking. Yeah, it does, thank you. With that, I think Akil, you had a question for the room and I also put one out there to YouTube, somebody who was asking about the most predictable paths and why I suspect further like most predictable. That might be some maybe undetected bot activity. I was curious too, like qualitatively was where there's some really interesting kind of extremes of the data set that you noticed. Yeah, it could be undetected bot activity. We did our due diligence to remove bots, but again, we are kind of limited by whatever we can detect via Wikimedia resources. So this could be that. It could also be cases where it could be more like a fact checking setup or more like a very targeted question where someone went from a search engine, looked into it, looked into a specific concept, got interested, went through one or two more pages and then gave up. So these sessions could also have this behavior where literally it doesn't matter where you started. Or I'll also take the example from Wikimedia, which is the game that Bob built and there as well. If you're really targeted towards, like if you really have an intent towards reaching a target, it doesn't matter where you start from, right? So it always matters where you are currently and you just take a greedy decision to go to the target. So these are the cases where things can happen. Things like these can happen and they definitely encapsulate the largest amount of traffic on Wikimedia, like actual people when they, with all these technologies, search engines, it's very, the majority of cases correspond to these cases. Then, okay, going back to the question that I had for Hal and everyone is about the differences between the theoretical guarantees of differential privacy and its actual practical implementation. I have seen this multiple times again, to be for full disclosure, I am not a privacy expert. I just have done some basic, I have some basic passive knowledge about differential privacy, but I have heard this a lot from experts at EPFL and also in different talks where people say that, okay, people talk about differentially private stochastic gradient descent or differentially private machine learning, but there's a huge difference between the implementation and the actual theoretical proof and the theoretical guarantees that people claim either in their, as an addendum, as a white paper to their technology or as a research paper that they write about it. So what are your thoughts about it and educate us a bit about? Yeah, yeah, no, that's a great question. I'm gonna stay off camera if that's okay. So I do think that it's, I do think that there's several difficult, sort of metrics that people are trying to optimize for and specifically in the world of privacy preserving machine learning with say a tool like TensorFlow privacy or Opaqus, which is the PyTorch differentially private machine learning library. Oftentimes the impetus for a publishable results is a high sort of like model output quality rather than concrete data protection guarantees. And more broadly speaking, I think, I don't know, I've been developing a thesis over the last couple of months that differential privacy as a domain is heavily theorized and lightly implemented, which means, which is not, it's not the fault of researchers. It's one of those things where oftentimes the people doing theoretical research don't have access to sensitive data that would be useful for, or contain at like have both risks and benefits for differential privacy as far as training a classifier using the data or doing some form of data release. But yeah, I think specifically in the realm of machine learning when the sort of publication imperative is high quality accuracy metric scores or F1 or whatever the metric you're using is, that means that people are incentivized to continue training even when additional training steps would burn additional privacy budget and lead to privacy budgets of like, we trained this with an Epsilon of 12, which means that in the worst case scenario, someone could be almost 100% certain that a given person was in the source data set. So yeah, I mean, I think that that's definitely a valid constraint and a valid criticism, I think of a lot of that kind of research, but there's also a lot of active research about like changing the theoretical formulations of how you add noise and sort of like connecting differentially private classifiers to the literature on robustness in classification, which is a much older piece of literature. So, or much older line of research from a theoretical computer science perspective. But yeah, no, it's definitely an interesting, active open field and there's lots of sort of room for improvement and optimization, especially when you're focusing on where the rubber meets the road and sort of real life implementations of DP pipelines. All right, well, thank you both for this. Thank you everyone for joining the Wikimedia Research Showcase today in this lively discussion. So I want to thank our speakers, Al and Akhil for their contributions and presentations today and the excellent question answering. This research showcase is made possible thanks to the coordination team, particularly with my colleagues, Kinaret and Pablo, so thank you to you too as well. And also thanks to Emerald for support with the audio and video today. The next showcase will be Wednesday, November 18th with a focus on bibliometrics. And I look forward to seeing all of you there. Al, thank you.