 Hi everybody, Daria from the Wikimedia Foundation and welcome to the March edition of the Reacher Showcase. I'm very happy to have here today Andre Rizoyu from Australian National University and I'm going to give too many presentation introduction to his talk. Andre came here a couple of weeks ago when he presented his paper at a conference here in San Francisco at Wisdom and we had a small but very intense discussion on these issues here at the Foundation. So we thought we would invite him to give a public showcase of presenting his work on the evolution of privacy loss in Wikipedia. In order to say one more word about this, at the Foundation and in the movement, we are very sensitive to private data and we tend to take very seriously the issue of how much data we collect and how we retain it and what we share or we don't share. There isn't necessarily much discussion on public data and the application of public data for generalization and given this is a really hot topic at the moment in industry, most people at some of the key conferences in the field are talking about this issue designing algorithms for actively generalizing private trade from public data. That would be a fantastic opportunity for this presentation to spark discussion within the movement. So with that, thanks for joining us. I know it's super early hours on your end and with that the stage is yours. Okay. Hi everyone. Thank you Dario for this introduction. Indeed this work was presented a couple of weeks ago at the end of February in Wisdom in San Francisco and that's how I got to meet you fantastic guys. I really loved that I could speak with you and talk to you and meet you. So let me pull out the slides. So I can start showing you what we've been working on. So the name of the presentation is Evolution in Privacy Loss and it should have been probably within brackets in Wikipedia. Initially it started as an idea that in any social system that is out there, when we observe its evolution over time, over a certain length of time, there is a de-anonymization or a privacy loss trend that occurs naturally in all systems. And we applied this idea and we tested this idea in Wikipedia. I'm going to go more into details soon enough. But before that, let me show you a little bit about the theme behind this work. These are my co-authors on the paper. Leshin Shiesh is also the Australian National University and then there is Manos Hebrian and Tipeiro Caetano. The common trait of all four of us is that we are all part of Data61, one of the major research organizations here in Australia and then others are affiliated to other universities also. So the motivation behind this work is exactly this, is the advent of social media, the fact that more and more people use social media to share information or just to keep in touch with their friends or family. And as you put it in the initialization of this discussion, online privacy is something that people are getting aware nowadays and start being a little concerned. Most people think that simply by setting privacy options on their social media website that might help them keep their privacy in check, where this is shown to be false. More and more research out there shows that the digital traces that a user leaves behind reveals a lot more about the users as they might think. In 2013, Kozinski and his colleagues showed that, for example, just by looking at the public series of likes on Facebook, one can infer very sensitive things like gender, not necessarily very sensitive but they can infer things like the sexual orientation or even the fact if the user's parents divorced before her 21st birthday. So really sensitive key issues can be inferred by just looking at the pattern of activity of a user online. This is known. What we are really interested in this work, the kind of meta questions that we are trying to address in this work is, does the user privacy degrade over time? So not only does it degrade when we have information, but does it degrade over time? Is there a trend of how this privacy is lost? If so, what are the factors that contribute most to the revealing of these private traits? And lastly, what would be the measures to stop this personal leak of information? If I stop posting information online, does that help to protect my privacy? So this is exactly what I'm going to show next in the rest of my presentation. I'm going to talk about why we chose Wikipedia to test our hypothesis and what we used from it. I'm going to show you how we profiled editing behavior of the Wikipedia editors, how we measure the predictability of some personal traits, and then I'm going to show you what results we obtained, and then just conclude with some recommendations, maybe. So let me jump straight into the subject. First of all, why Wikipedia? This is kind of the question that people keep asking us. Why have we chose Wikipedia to test our hypothesis? Well, there are a number of traits that makes Wikipedia perfect for our study. First of all, it's all public. So you guys did a great job to make everything available, so that's what we did. It's public. It's very long, so we have the entire edit history for the whole 13 years, now a little bit more, but at the time when I crawled it, it was 13 years. So it's ideal for our kind of study. The second reason why we are interested in this is that there are tens of thousands of users in Wikipedia, and they come from different backgrounds like geography, location, religion, education, political. That means there's a wealth of information, the potentially private information that we could get if there is any. And also this allows us to test these patterns on a very diverse population. But most of all, what we were really attracted by looking at Wikipedia is that it's apparently a harmless data set. People in Wikipedia put a lot of energy into focusing only on the knowledge and not on the personal information. So our hypothesis is if we can detect these trends in an environment as Wikipedia, imagine what we could do in a more rich environment, socially rich environment. So just to give you an idea of how big is the data set that we are using, I downloaded the June 2013 stub of Wikipedia, and I simply parsed it. So we ended up with more than 188 million revisions after filtering based on the information we got about the editor. So we were looking at the activity of almost 117,000 editors, and the way we took the personal information, it was by looking at editor badges, and we got more than 8,000 categories of editor badges. So we got the history of almost 22 million pages, and we also take into account the page categories or the categorization system that Wikipedia uses to classify the pages. I'm going to show you exactly how we did this. So the first question came, how can we encode the editing behavior of users? How can we understand their editing patterns? So the first thing we did is that we looked only on the number of revisions that editors made over predefined categories. So we have two sets of descriptors, one is the basic set, which simply counts the number of revisions over the six predefined categories, over some six predefined categories, which are based on the Wiki namespaces. So we got the six features. One of the first features talks about content creation, so actually editing one of the pages in Wikipedia, which corresponds roughly to namespaces 0 and 6. There is another feature which measures the activity of our user on the talk section of these pages that we call Talk C, namespaces 1 and 7. There's two more features which capture the activity on the more social side of the interactions in Wikipedia. So there's one of the user pages, edits that the user makes to his own and other user pages, which is namespace 2, and then the talk around these user pages, which is namespace 3. We've been looking at two more kind of features. I'm going to show you why I think they are really interesting. There's one that we called Wiki, with namespaces 4 and 5, and the other one, which we call infrastructure. So in here, we strive to capture edits towards community pages, how-to's, discussion lists, and so on and so forth. This feature tries to capture the effort of organization behind the content creation. That's how we aimed at it. So this is about how we capture the editing behavior. When it comes to the kind of personal information that we can extract, we are based on the self-disclosed information that some users put on their web pages about themselves. So we simply, using public APIs only, or based on the public APIs either specified by Wikipedia itself or developed by third parties, like the University of Waikato in New Zealand. So we extracted these batches from the user pages, and we used the categorization system to actually get categories on the users. So on the right-hand side, there are three examples of these badges that we used. We got information about gender or ethnic origin, religious views, sexual orientation, location of users, languages that they speak. But because we have a constraint on the kind of the size of the data sets that we want to analyze, we are restricting ourselves to three kind of personal information. Gender, religious views, and education. And I gave you, just as a matter of, just as orientative, I gave you how many of these users disclose those features. So I showed you how we encode the user, the editing activity, and I showed you how we collected our personal information. But let me show you, now we're getting closer to the core of our question, how did we encode these editing behavior over time? Because we want to capture more and more information in a kind of a temporary accumulative fashion. So what we did, we simply divided time into three months' time frames, and then we computed the editing behavior of each user, just as I showed you earlier, by counting over the categories. In here, you see an example of time being cut into three times frames. And on each time frame, we count the revisions the editor did over the categories, and that gives us the kind of a description of her editing behavior over each time frame. Now, if we want to temporarily embed increasing quantities of information, all we need to do is to take the concatenation of these features. For example, to describe the activity at second time frame, we will have the features in the first time frame and in the second time frame. And then if you want to go on the third time frame, we're going to have the previous description, so in the first and the second time frame, and we will add the descriptions over to the third time frame. The advantage of this kind of method is that, for example, we will have now features that describe a user's activity, let's say the pattern of editing around socializing with other users, edits on the user category. We're going to have it on the first time frame, denoted as user one, but also on the second and third time frame, denoted as user two and user three. So this gives us the series of a series, a time series of editing behavior on each of our features. Okay, so before I jump to show you how we act, what we actually got on the privacy itself is any data mining related task. We did a bit of profiling of the editing behavior, trying to understand what was in the data and trying to find kind of visually some pattern. So there was a couple of interesting conclusions that we got just by looking at the editing behavior itself. So this graph shows the total number of revisions in Wikipedia, as well as the number of active and new editors. The total revisions are shown in the green line and then the active and the new editors are shown with the blue respectively, the red bars. So what we see on the left is the content edits, so the active contributions to the Wikipedia pages, which shows the famous slowdown of Wikipedia, which has been reported previously. Aaron did quite a number of, quite an amount of work on this, and I've seen a number of his presentations, where this phenomenon was analyzed into detail. So this is known. But what we did, we went a little bit further and we kind of tabulated this trend over our category. So what we saw is that while this trend is active in most of the features, so we can detect a similar trend, it is not the same in what we call the infrastructure. So what we detected here is that infrastructure edits tend to be on the rise. So there is a rise of a maintenance effort or as one of Aaron's presentations earlier, Wikipedia is getting efficient. So there is more work around maintaining the content rather than creating the content. And we find this being really interesting. So we also tabulated this evolution over let's say the size of the populations. And then divided the populations, we group the population based on our features. So we can now plot things like the relative evolution of the population of editors based on their education or based on their religion. So for example, on education we see that while the general population does follow the decreasing trend after 2007, it is interesting that different categories of editors based on their education level tend to evolve differently. So while undergrads and graduates do decrease as the general population, PhDs tend to be stable or only decrease in the recent years, which kind of seems to reinforce the hypothesis of the increasingly high bar of expertise that editors need to have in order to make useful contributions to the bulk of Wikipedia. Simply because, well, Wikipedia has been maturing a lot. And a lot of the articles are in a good shape. So there the editors need to be more and more specialized in order to make a contribution. Looking at tabulation per religion, we see that the self-disclosed Muslim editors seem to be more and more present in the population, whereas all the other mainstream religions seem to be on a decreasing trend. So after looking at this, the next question comes, does the edit behavior actually correlate with any of the private traits? So what we did in here, on the left I've tabulated and aggregated the number of edits on the different categories. For each, in there is gender, we did it for all of the features, but I'm presenting gender because it's more visual. And we see that, for example, the self-disclosed male tend to edit more on the content feature, whereas the self-disclosed female users tend to edit more on the user side and the talk side of the pages, on the Wikipedia editing. So by looking at this temporarily, we see that this pattern is valid until around 2009, after which they have comparable values. I'm not inferring anything about the editing behavior of the different genders. All I'm trying to say here is that the way the different editors, based on their gender, seems to correlate with the private trait that they self-disclosed, which means that there is hope that we can actually learn and predict these private traits only based on the editing behavior. This is what I'm trying to say. So here we come to the actual core of this presentation, is that we treat privacy loss as a prediction problem, meaning that we define an intrusion or a breach of privacy whenever a personal trait of an editor is predicted or disclosed without the will of the editor. So it doesn't matter if it's gender or any other trait which would be kind of easy to spot, or for most of us it does not pose a problem. If the user doesn't want that trait to be disclosed, then we treat it as a privacy intrusion. So now, given this definition of privacy loss, one way to measure if there is this phenomenon of privacy loss is to see if we can predict these traits and how well we can predict. So basically what we did, we took each of these features, we took each of these categories and we treat them as categorical classes, and we're trying to predict them using a typical standard setup for machine learning. So we took our population of users, we split them into the train, into the test and then we used a classical machine learning algorithm, in this case it's logistic regression, really off the shelf algorithm. So what we detected is that as time increases using the setup that I've been talking to you earlier, presenting earlier, as time increases we observe an increase of prediction performance in all of the traits. So what we can conclude is that all of the traits that we were looking at can be predicted increasingly better only by looking at the editing pattern of our editors over time, into increasing prefixes of time. So this was already a big result, but then we came to the next question. What causes this privacy loss? So we wanted to look at the sources of privacy loss. One of the first things to do is to try to predict better, because as you saw earlier, the prediction performance is here measured by AUC, which is an indicator defined between 0, 5 and 1, with higher being better. R is around 0, 7, which while it's indicative is not a big value. So there is always room for improvement, therefore we went to put more information into the description. So instead of just using our basic features, now we use the thematic features shown here in red. The thematic features, they are constructed similarly as the basic ones, but they measure the thematic preferences of editing of our users in each time frame. So we took the categories of pages, the high level categories of pages, things like geography, history, social sciences, applied sciences, and so on and so forth. And we measure the activity of each user in each time frame over each of these features. So now we know not only that the user edited the content more than they edited the wiki kind of pages, but we also know that the user prefers to edit more mathematics than people in general. So now having this additional information, we perform the same kind of learning and we obtain the red curve, which is definitely better in prediction terms than the basic feature, which means that indeed, yes, using richer feature gives us a boost in the prediction power, therefore a bigger privacy loss. But as you can see, the two prediction lines seem to be correlated, the Pearson correlation coefficient is 97%, which means it's just scaled up, which means that makes us believe that it doesn't really matter how much information we put into the features or into the description, the basic trend, the trend of the anonymization or the privacy loss is the same because it's intrinsic to the environment itself, it does not depend on the description, it's basically more on the accumulation of information. So the next source of information that we are looking at now is how about information that can be learned from newcomers, people who joined the population in the course of the study, the time frame. So what we did in here is that the red line shows the number of, the red line shows the prediction performances of the privacy loss trend in a fixed population. We basically look at the active population at the beginning of 2007 and then we study only these users by looking only at their own activity. So any performance increase is due only to learning better the editing patterns of users based on their own activity, so no additional information. Whereas in the blue line, we allow new users to enter the population, so new editors which post their own personal information that have their own editing patterns and we get a steeper line. So basically, we learn more in time. There is a clear performance difference towards the end of our time frame. And this concludes that indeed newcomers do hurt the privacy of existing users. Now, we are trying to figure out why this happens and in order to be able to quantify more into detail, we need to take a very brief look into the kind of measures that we can use. So these are information theory measures and we use some typical measures like the uncertainty about the private information can be measured using the entropy. Imagine of this as the kind of, the total amount of information that can be uncovered about, let's say, the gender of users. Now, if we are to measure the amount of information disclosed by a given feature, X about our class Y, we can measure this using the mutual information between X and Y. In the example on the right, the intersection of the VEM diagrams shows the amount of mutual information. More intersection means higher mutual information, therefore more disclosure of a feature on the class. But this is not what we are interested in. Remember that all of our features are temporal theory, so we are not really interested only in the amount of information that the feature discloses, but we are mainly interested in the amount of new information that the feature discloses at a given time T. Therefore, we are using the information transfer measure, which is a conditional mutual information. And this tells us how much X-free on the right gives, how much information X-free gives considering we already have seen X1 and X2. X-free, as you can see, has a pretty large intersection with Y, but the amount of new information that X-free brings is pretty small. So this information transfer actually allows us to measure how much more, how much additional privacy loss, how much additional privacy is lost on a new time frame. So let me show you then what we obtain. The online breadcrumbs, we denote by online breadcrumbs the effect of a user's own activity. Therefore, what do I learn when looking at a set of users, fixed set of users, and learning the patterns of their usage? So on the left, you have the instantaneous mutual information, which means how much information each of the main three features for us, their content, life, and society, give about gender in this case. So how much information at each time frame these features give, and you see that the amount seems to be kind of constant, which means that each time frame brings roughly about the same amount of information about gender. But when we look about the amount of new information that we get, we see that initially there's a peak, and then it decreases quite fast. The idea behind this is that even if all the descriptions contain as much information, so they hurt privacy just as much, what it means in practice is that later edits tend to hurt privacy less, simply because what is there to be learned has been learned already. Imagine that you are following someone over a week or a month, and you learn her habits, you know, where she shops, where she lives, or what she does. By following her for another month or another year will not bring a lot of new information, because most of the habits of that person have already been discovered. Now, here we come to the interesting part. What do new people bring into our population? So on the left you see the information transfer, so the amount of new information in the system, when we take into account a system that contains both new and old users, whereas on the right there's only with old users. Both of them present the initial peak, which is due to the breadcrumbs, but then on the right it goes straight to zero, meaning that almost no new information is learned, whereas on the left it goes down to a non-trivial value, which means that the amount of information that can be learned from these new editors entering constantly in the system is kind of moderate. It's pretty small, but it's constant over time, which the amount of this information that is learned from newcomers is outside of the control of the users. The users cannot control what a learning algorithm can infer based on other people's actions. So this comes now, boils down to our third and really interesting question, is if I retire from the active life, does it mean that am I safe? So we are looking at the same setup, but what we see here, how we change the setup, is now that we are taking the set of users that have been active in 2007, so before January 2008, and these are the set of users that we'll be concentrating on. These users stopped editing after 2008, they quit Wikipedia after 2008, so I do not have any information about them after 2008. And now I'm trying to predict features for these guys. And for some features, not all of them, in these cases, undergrads, we detect a constant increase in the prediction performance, which basically means that their privacy keeps on eroding even after they retire. Now, why is that? We were looking at why is that? Obviously, it's because of the new users. It needs to be based on the new users, but how do the new users affect the old users? And how we see it is that there are those connection users which have been active both before and after our users exited the system. Now let me show you a little bit more how it works, because this really intrigued us, really intrigued us. How does it work? When we do our learning, each of the learning algorithms, the machine learning algorithms, they construct a model. The model is an abstract description of the data that the model has seen until then. So how it does in these cases, for example, it constructs a linear combination between the features and they are weighted. Each of the features that you've seen in their content, to user 5 and so on and so forth, they have a weight which shows the importance of a given feature to predict the private trait of a user at a given time frame. Now, by looking at the weights of these features inside each of the models, we can understand what model thinks it's important and what model thinks is not important for predicting, let's say in this case, is the education level if the users are undergrads. So look on the right graphic. Each of the bars is proportional with the weights, not proportionally, it shows the weights of content features. So these are content features and basically the content 2 and 3. Remember that all we have to predict the education level for the retired users are their activity when they were still inactive, which means content 1, 2, 3 and 4 in the four quarters of 2007. So the initial models think that content features are not important for predicting if a user is an undergrad. Whereas later models seem to think more and more, seem to be convinced that content features are more and more important for predicting. What happens is that new users enter the system and they provide more information and then the model learns that it should put more weights on, let's say, content features. And then when it predicts, it goes back to the activity, it revisits the activity of our exited users and then it has the aha moment saying, ah, in fact they were undergrads. So what this shows is that even if today something cannot be inferred about our users, it doesn't mean that tomorrow this cannot be done, simply because we will have access to another kind of information. And do remember that everything that I showed you here are based on the very basic description of data that we constructed and also they are based on only the kind of very basic learning algorithms. A lot more sophisticated learning algorithms designed specifically for this task could be devised, which would give even better results. So, well, the main conclusion that I would have, the takeaway messages of this talk is that, yes, time has an adverse effect on privacy, which cannot really be controlled by the users themselves at the time when they are posting information in the online environment. Some of the people from the machine learning background will say, of course, this happens because we have more information, therefore more information can translate almost automatically into better prediction. But when we are looking at the factors influencing privacy loss, we detect that there are two main categories of privacy loss, the online breadcrumbs, or better said, the editor's own activity. The patterns of activity that emerge from just following one editor, but also, and this is the one that people are less aware of, is the activity of other editors and newcomers. Basically, the information that can be learned from other editors and then used to predict for the existing editors. And the natural result of this is that privacy erodes even for retired users. So I think the main message in here that probably people should be aware of is that users do not have complete control over the consequences of the information they release today. You cannot foresee what your information can be used tomorrow to predict. Your today's information can be used tomorrow to predict things that you would not expect. So this is the main message of my presentation. So thank you very much for your attention. And yeah, I'm open to discussions or to any questions. Thank you, Andrei. Thank you. So I have a few questions we're going to ask around first. So if you have an IRC, any questions? Maybe you can relate them to the following on the channel. Yeah, I've got one question from Giovanni, who asks, one thing that I don't understand is whether Andrei trains a different model at each point in time and computes the ROC with a separate model? Or whether he trains once and then computes the ROC over time with a single model? Yeah, OK, thank you. This is a good question, a good point. I have not got too much into the technical details. I am training separate models at time step. So at each moment in time, I am training a different model based on the description of that time point. So what does this mean? Is that I'm constructing different descriptions of my users at each moment, and then I'm using those descriptions to predict. So yes, they are different models at each time step. I'm not evolving a single model. I don't know if that was clear enough. Yeah, regretfully, on IRC, they're going to have a delayed response. So I'll let you know if Giovanni wants to ask a follow up. Since I'm already talking, I'll ask my question too, and I'm pretty sure that finishes it for IRC. So I'm curious, when you were showing those graphs at the beginning, thank you for the shout out. I was wondering how big the sub-sample of editors who you have this information about their gender or educational background is. Like what's the scale of that graph? You're talking about these graphs over here, right? Yeah, how many users do you even have this information for? Yep, so these are proportions, all of these. The curves are scaled. So the lines are scaled from 0 to 1, which is minimum to maximum population. Then the height of the bars on the left give you proportional number, proportions of the categories in each. So for example, undergrads are bigger than grads, which are bigger than, oh, the other way, grads are bigger than undergrads are bigger than PhD. And in the parentheses, you have the real effectives of the class. So I have 3, 300, undergrads, a bit more than 700 PhD, and 4,500 graduates. Now, if you want the number for all of these features, I gave them here. So I have education information for 9,000 users. Then I have religious information for more than 7,000 users, and then gender for also approximately 7,000 users. When I, this is an important point. Thanks for the question. When I constructed my data set that I described here on this slide, I'm restricting all of the information only to editors that I know information about. So basically, I went through my dump, through the Wikipedia dump. I got the entire population of users. Then the next thing I did is I went and I looked which of these users I have information about. So when I'm presenting, let's say, 188 million revisions, those are revisions made only by editors that have put at least one badge on their wall, and of which I have information. Obviously, there's a lot more than 188 million revisions in Wikipedia. I'm only restricting myself to registered users and users that actually have some information on their user page. Yeah, I appreciate this. Like the thing that I'm struggling with, and the reason why I ask this question, is I'd like to be able to take your observations as fact and then try and figure out why is it that, for example, people who have a PhD stick around in Wikipedia longer. But it worries me that this is such a small subset of editors, and it's limited to people who already got through enough of Wikipedia stuff that they know what a user box is and put it on their user page. And so I guess if I could finish my question and then shut up by asking, what do you think about using this as a fact and then trying to build hypotheses about why it is this way? Yes, this is one of the known things in our work. There is a user disclosure bias. So the user disclosure bias works not only on, it's not important only for the validity of the hypothesis that we could build around it. I think the user disclosure bias is also important about the results of the prediction itself. Which means that, let's say, if I have a subset, if I'm basing my training on a subset of users which is more inclined towards disclosing their religious views or location, it is location, then probably the results, the models I construct will be drawn towards this more vocal minority and not be representative for the real majority out there. We are aware of this, this is a typical problem in setups where we're using self-disclosed information. A solution for this would be performing a real study, a real questionnaire, and then there you can have a random sample which is representative for my population and where I get information for the real, from the real, your real editor, not just the ones that want to be known, for example, to be known for their preferences. I don't see another way of going around the bias. Now, why would the PhD stick around more? I think you're right there. I think you are, that population is probably some hardened population of editors that have been around for a while. I don't think I have intimate knowledge of the Wikipedia's mechanism by itself to be able to make any reasonable hypothesis towards why that is happening. Gotcha. Well, thank you very much, and for what it's worth, you also answered Giovanni's follow-up question in that too, so that was excellent, thank you. Okay, cool, thanks. Now, I'm gonna jump in the conversation to say that Erin stole half of my question. So I'm very happy that the bias introduced by self-disclosure got addressed. I agree that having a study with a more controlled demographic sample will be very, very valuable. And I'm also thinking, really, to the point that there might be an additional bias introduced by the discoverability of these badges. Given the fact that there's no repository of badges that when you register, you can choose from, there would be almost certainly network effects that determine when you come across certain badges, you learn about them, and then you start using them, and the timing of these may also affect the population of people who actually decide to use these badges on their own profile. But I think that your answer actually spot on about how to get a proper sample for this study. I'm gonna ask you a second question, which is the same as I asked you last time. I think it might be useful for a community, and that's basically related to one of your first points that you mentioned at the beginning of the presentation. Mainly, what can be done, is there any actionable recommendation to prevent this kind of exposure of information? It sounds like the main takeaway from your presentation is no, basically, right? There's not much can be done, either by the user or in design of the system, then can allow this progressive disclosure information even when people retire. And so, given that I guess many people would be interested in knowing about measures that can implement to avoid this, I'm curious to have any additional thought about this question. Really good point, really good point. This is one of the things that we also argued in our, oh, can you hear me? Yeah, this is one of the points that we argued in our paper, is that, so at the individual level, yes, probably there is not much that we can do at this moment. What can be done, the measures can probably be on two ways and they are based on the same, no, they are based on different principles. One of the measures would be actually being, so being aware is mainly what's the most important because when we are aware, by being aware means that policies can be designed. Mainly this is around, I think most of the, what I think can be done, they revolve around policies, designing political policies so that, like the right to be forgotten, past in Google or data retention policies or these kind of things. Now, what it means for the Wikipedia use editors themselves? Well, based on the results that we get, it doesn't, they should not be too worked, right, at this moment of time. We can predict with a fairly good performance some of the personal traits which are abundant in the population, but from there to actually de-anonymizing the identity of one particular user, I think there's still a long road to go. We just detected for the first time this phenomenon, right? So from there to actually using it for de-anonymization is a long way. But the way I see it now, I don't see other options of limiting this effect other than kind of not exposing the data, but this would make Wikipedia less public. And even then, I showed you that most of the performance predictions or most of the privacy is lost within the first period of observation, right? So that is when most of the things are lost, actually. So yeah, I don't know if I made kind of a convoluted convoluted answer, but what I'm trying to say is that probably the only effective means of preserving privacy of users is designing systems that are having this in mind, so that they do not give you information. Of course, there's also differential privacy kind of methods like K anonymity, like introducing noise in the results. So there are other methods, but all of these assume that the user, that the designers of the system are aware of this phenomenon, which was not the case until now. Thank you. All right, are there additional questions from the room or from IRC or checking? Well, I think IRC went silent. So again, I want to thank you, Andrei, more than welcome to stick around if you want to on IRC, there's any additional comment, but thanks again for joining us today, and thanks everybody for joining us. Let's finish with our final round of applause. Thank you. Thank you very much for having me. It's always a pleasure to talk to you guys. Bye, Chris. Thank you. Thank you.