 Yeah, again, so this session has been video streamed and recorded, so just everyone of you should be aware of that. It is about this upcoming law, the GDPR, which is the data protection law. I'm going to say more about this. And the question I want to answer together with you is how does MediaViki relate to GDPR on a technical basis? So what do users need to know that use MediaViki on their sites? What's missing? How do we intend to remedy these things? And what's our position where do we stand regarding the GDPR? I'm Marcus Glaser. I have a few hats on or hat. Currently I'm spokesperson of the NW State User Group, which is a group that cares for MediaViki installations out there in the white internet world. On my professional life, I run an enterprise that helps companies set up their MediaViki's and we publish a product called Blue Spice, which makes MediaViki Enterprise ready. So I also do have a professional interest in the outcome of what's happening here. Okay, so where do we stand? What's the GDPR? GDPR is short for General Data Protection Regulation, which is an EU law that will become effective on May 25th, 2018. There has been a transition period from the old country-based regulations to this EU-wide regulation. The transition period was for two years, so actually all that panic that happens now with regards to the GDPR could have been avoided if all the people affected by the law would have started working towards implementing it earlier, but that's probably the human nature, so we see a lot of activity in Europe now in businesses trying to implement the requirements of this law. Actually, the intention of this law is good and I fully support it. It is decided to have individuals to better control their personal data, the use of it, the processing of it, and be able to interfere if they feel like something happens with their personal data they don't want. As with many laws, the intention is good, sometimes the details are not so perfect. One of the flaws I guess we are seeing is that the words are very vague, so there is a lot of room for interpretation, and actually what is a good way to implement the actual arguments of the law is something eventually courts will have to decide. So there is no blueprint, no checklist, you can just follow and say, okay, if you do this and that, then you fully comply. It's about the intention of protecting the data and having them, but there is a lot of room for interpretation. As I know that not all of you are from the EU, you might be wondering if this law how it does affect you. So the law is designed in a way that any interaction with any citizen from the EU, if you have any interaction, then this law applies. So if you come from outside the EU, I had found a very interesting hack, which is the EU blocker. So you set up your website, you put up a EU blocker and any IP address from the EU would not be able to reach your website, which is a good idea, but then you have to see what happens if your EU citizen is in the US and tries to access your site, it still applies. So this law is rather creeping all across the web and you should be... In one page of Media Week, someone said, I'm not from the EU, I'm not affected. Another person answered, if you ever plan to set your foot on the EU and you don't want to get arrested or find, you better see that you comply to that law. I don't want to say if it's good or bad, that's just the current reading of this law. Yeah, I forgot to say I'm not a lawyer, so I don't know this law by heart. I'm almost in the same situation as most of you are, so I see this law, I have to deal with it and the question is how can we do this? So a few basic terms. The law is about protecting personal data and this is taken directly from the law, so the question is what is personal data? It means basically any information that can be related to an individual person, a natural person. So personal data is not if I can identify an organization, but because an organization is not a person by definition. Interestingly, it's also about the question directly or indirectly. So if, say for example, I put my name somewhere, then I can be directly identified. If I put my IP address somewhere, that is probably a case of indirect identification because it's possible to use the IP address, go to my provider and get to my name. So that also applies to pseudonyms and other things. And then it's also a question of data processing, so if for example you write about a person that is from Germany, it's probably not totally identifiable here in this room if you write about another person that is age 40 to 45. And then you somehow can bring those two data sets together, then you pretty much get me or maybe another person here, but you can pretty much identify us or me. So that is the combination of factors which should also be considered. The law applies to data processing. So that basically means any case when you handle C, modify, delete whatever data, that is processing. So it's a very wide term and when in doubt if it's processing or not, you better assume it is processing under this law. So as a rule of thumb, if you deal with any personal data, somehow you will be affected by this law. Okay, so this law is a huge number of articles and there's a lot of principles involved and a lot of rights. The things that are of interest here are the ones that affect the technology. That's what I think we should be talking about. So it's not about, I don't know, the disclaimer you need to put on your page or things like that. And we have six rights. So actually when you try to categorize, to organize what's happening here, for the sake of technology, it's not that easy. So when you skim through the web pages, they give you a lot of advice. I try to compile it a little bit, condense it, but again, I'm not guaranteeing completeness here. It's just the way I understand it. Okay, so what are the rights of the principles? There's one principle called transparency. So the user needs to know what's happening with their data and they need to be able to access that data. So that is here right to access and to be informed. There are a few requirements. For example, the user can ask an organization, what do you store about me? And the organization then has to give the user a list of the data they stored about this person. Okay, then you have the right to rectification. So if there is any false information somewhere stored, then the processing entity, that's the term, is obliged to correct this error and do so without fee. That's another thing for companies. So Facebook cannot just go there and say, I taught you $10 for every spelling mistake in your name. I have somewhere in my data. They have to do it for free. Then there is the right to be forgotten. So if you want to go away from a website, from an organization, you can require that organization to completely forget about you. That is something we will probably have to talk about in the context of Media Weekly. Then there's the right to data portability, which sounds somewhat strange in the first place. Basically, the data processor is required to provide a set of data in a form that can be imported or can be used elsewhere in a commonly known format to the user on request. So I can go to Facebook again. I can say give me all the data you have about me. Then I get this large file or whatever print out, and I can do whatever I want with this file. The question is, what's the purpose of this? In the context of Facebook, for example, I see some people doing their photo journals on Facebook. The lawmakers want to prevent a situation where Facebook says, okay, you can go away, we can erase all your data, but you can't take your personal photos with you. They're just locked in the system, so either you stay or you lose your data. So this prevention of locking is one of the main intentions behind data portability. They also think competition among similar platforms will increase there. I don't think we have a huge problem with data portability. There might be the question of how do we compile all the contributions of the user, but generally all the contributions a user makes to a media, the key is basically open. So I don't see a big deal here. Okay, so then there's the right to restrict or object processing of data. Which is basically, yeah, and you can withdraw your consent. So if I say it's okay for Wikipedia to process my data to store my username, at a certain point I can say, okay, it's no longer okay, and then the username has to be removed. And the last thing is users need to be notified about security breaches. So if there is a large hack and a lot of data gets stolen, then all the affected users have to be notified. I guess that makes sense in a way. We still have to see how this can be achieved using media. Okay, so I looked a little bit to other software I worked with and tried to see what they do. I found some position papers and I guess you can find lots more if you look on the internet. So for example OSM takes the articles of the GDP audiences. Okay, with regard to this article, this is how we stand. This is the current situation, this is what we want to do. A more lightweight approach came by Easy Red Mind. So where the OSM paper has some 24 pages, I think Easy Red Mind paper could be condensed on two pages or three. So the second approach is to write plugins to make it easier for auditors, for example, and to provide the necessary export features. So WordPress, for example, does have a GDPR plugin, Joomla does have a GDPR plugin. And it might be worth looking into what features they provide using these plugins to see what kind of things we might think are missing in Media Weekend. And then of course there's the approach of adopting the core software again. Easy Red Mind, which is software I use personally. Thus they added a few buttons to provide complete audit trails and things like that to ensure data integrity. Okay, so that's the general overview about GDPR so far as I can give it to you. I was thinking a little bit about what data is affected in Media Weekend. And of course, again, this is not a complete list. This is the beginning of the brainstorming, which I want to continue with you in the etherpad after I talked. So what kind of data do we have in Media Weekend? What kind of personal data? I think these here are pretty obvious. We collect username, real name, email address and language. We do so by consent, say for the username I think. So no one has to give their email address, but if they do, it's still storing personal data. And we still need to see how we can, for example, remove that. Also we store IP address of anonymous users. That is probably more of an issue in Open Weekends out on the internet. It's not so much an issue in the internal Weekends because they typically do have something like a login only policy. But still the software is capable of storing IP addresses and uses that. And that's probably one of the questions that will keep us busy how to deal with the IP addresses. Then we do have user actions, like the version history and action log. And while these are mostly about content user contributes, there are some kinds of personal data which are interesting. So for example, there's a timestamp and you can do profiling using a timestamp. So when a user is active, you can, for example, interfere their time zone or their working hours or their non-working hours or whether they edit in their working hours. You see there's a lot of stuff in there. So timestamps in the logs also if you have a profile of what kind of articles a user edits, you can maybe profile their user's interests. You can draw various conclusions from that, which is also this combination of things, combination of data and identification. And I think we can also deal with that quite easily, hopefully, but then we do have one thing in MediaVQ which is quite tricky. That is all kind of data that is in the content area. So starting with signatures which are part of the content, links to user pages, any contribution of a user where the name is mentioned or that can be personally attributed could potentially be a personal data. And yeah, we have to find a reasonable way to deal with that. So while it is quite easy to do a replace of signatures in the current version, it might not be so easy to remove the signature from all the previous versions. I mean it can somehow be done probably, but even if you find the perfect regex to remove a signature, there's always the risk of false positives and things like that. So that is one of the things where we need to see how we deal with it. Okay, so whether we stand, this is again the beginning of drafting a document. So the right to access and to be informed about the data, we do have a list of recent changes, we do have a list of future contributions, we do have the logs that can be filtered. So the user pretty much knows most of what is stored about the person. They can also do a full text search for their username and see where they are mentioned, at least in the current revision. So then we have the right to rectification in the data. The user can just change it as far as I know. There is no history of data in the user table. And in the most current version of an article, you can also correct the data. And I think one could safely argue that if it is fixed in the current version, that would somehow suffice, at least rectification. I'm not sure about erasure. So that's the right to be forgotten. I think we have to put some effort, some energy into that. Regarding the data portability, there is the export pages. And what I still think we should provide easier user interfaces for exporting the data for one person, theoretically the person is able to export the pages they contributed. So we don't have a lock-in situation. And I think with the page export, this should be okay. Again, not a lawyer. Right to strict or object data processing. I'm not really sure if that is a technical thing. It might just be that we need an opt-out page on media key saying, I no longer consent. For the right to restrict, it's okay. I won't do the way the user is saying. I will have a legal action over this data. Please do not move that. So the solution is to just clone this data into another base and move the data on another instance of media key. Don't touch this data there. Okay. But I have to remove the data of the user and the future version. Just need to edit somewhere else so that it can be used in a legal process and not on the page. Sounds cool. And last thing, breach notification. There is no mass mailing feature in media wiki right now. But again, there's no requirement for a technical solution. But if you run a wiki with more than a few hundred users, then extracting all the email addresses and sending a mail one by one to those users might be a little bit hard. Maybe this could be a good point to provision media wiki with a kind of notify all users about something feature. People have also proposed that central notice might be adequate to meet that requirement. It's just a suggestion. Yeah. I would, like from the text, from how I read the text, I think it would be okay because it's a notification. The question is, does it require a push notification, for example? And that's something that's probably up to lawyers to decide in the end. But I mean, if we decide, I mean, I don't know, we are not a decisive body here, but if the media community decides to do a minimal approach, then I guess that a notification on this side could also be okay. Just one additional sort of thought we'll have stored about that there's a check-use at the website, and I thought it was more about the website projects or larger projects. But that's something that they don't, the regional public, and we generally don't want to disclose to them. So there's a bit of a conflict in what we want and how we can resolve that. So I take an easier approach here because I say this is about media wiki and not about the Wikipedia's. But I know there's a discussion about that, and that's probably a very tricky part. Even though people can only request data stored about themselves and not about others, but the question is, how can you do this in a safe manner? Yeah, so that's the end of me talking, and what I would like to do is I have basically copied all of this over to an etherpad, and I think we can go through it and see if we have any additions right now. If we can find some potential solutions. And then what I would love to see, like in a few weeks maybe, is a page on media wiki saying GDPR compliance and then we address just the article and say, this is how we comply or how we intend to comply. There is one thing that might be useful information. When I was talking to, I have to get my company ready to GDPR as well, and I talked to a lot of people knowledgeable of GDPR. Most of them say, if you are in danger of being fined or being subject to a lawsuit, and you can prove that you actually started a process of implementing the regulations, that helps a lot. So at least for organizations, it will be that they will not, on May 25th they will not start a scan and find everyone with 20 million euros. But they want to see that you are in the process of getting there. So that's also a thing where I think the GDPR compliance page should be on media wiki.org. Could you or someone else be able to do a summary on where the foundation is at with their installation and how they did you comply? Is that something that anyone knows? No, just wait a second. I just want to remind you that this is publicly recorded, so be careful about the internet. Okay, yep. Okay. Yeah, you wanted to ask a question? No? Okay. So in the presentation which is also linked in the fabricator task and in the etherpad, I put a few links up there, most notably of course the text of the law itself, so we can look up for reference. Okay, so check out this etherpad. As I am well prepared, I hope. I opened it already. So I guess the best thing we can do in this session is to try and do some kind of collection of where we are and what we can do. So unless there are more questions, sorry. Yes? Okay. So that's a copyright thing basically. Yeah, okay. I have to rely on people that know more about that than I do. It is, yes. You talked about then starts to get into what is copyrightable. Yeah. Which can vary a lot by jurisdiction, but in some places like if it's just facts about the user that may not be copyrightable. In a similar sense to like if you're writing a phone book in many jurisdictions, but not all the listed numbers in the phone book isn't copyrightable. But the presentation happens. I don't think it's your right. If you're going to talk with some other right though, then it's... If a user requires us to analyze all the data and read it to him, everything he has published is longer attached to his name. Oh, I was not thinking about the content they had created. Yeah, that's an issue. Yeah, there's the attribution issue, which is a whole other thing we haven't talked about. Yeah, but I... So just taking on common sense if a user requests his removal of his name from Wikipedia and at the same time request attribution, then the user is probably schizophrenic. No, that's because there may be... In France you cannot renounce attribution. It's part of copyright that you cannot renounce attribution. Okay, that might be a legal loophole for any media, published media, stuff. But again, that is up to lawyers to decide, I think. And I don't know about the situation in Germany, for example. Yeah, okay, so what else? Any more questions need for clarification? Okay, so let's assume... Let's continue this under the assumption that we will be required to remove personal data and don't draw this one card. So what kind of data do you think that list is complete? The data we are collecting, actually. So username, rename, email address, language, IP address as like personal, directly related personal data. Yeah, I see the extensions that is probably also a very vast range of things. So I was thinking about that extension problematic before I did this talk. I think we need to split it up basically and say, okay, this is MediaViki Core and then see what the situation for extensions is in the second step, because an extension can do almost anything. So I know we have social profile extension which stores like any data. We do have extensions that store in Wikipages so revised, we have extensions that don't store revised and more. Okay, yeah, Web Server Access Logs. That is also a good point. I think it's not on the scope of MediaViki like on this GDPR compliance page. It should be mentioned because people should not forget about it, but that just applies to any website in, yeah, okay. How about like special pages? Is there, except for logs and recent changes do you think there's other, there's the list of most active users, for example, that might be some kind of information, but if the user is, what happens when the user is deleted, can't delete a user? If the user is merged into this big blob of anonymous users, will this information persist somehow? Probably not, okay. Does that lack of users would persist until the next time they're scripted? I guess that's okay. So there are some time periods like 30 days that you can argue, that's in the law, you can argue reasonable times. Well, I mean, that's... There's above where the active users can access where it doesn't get updated ever for certain configurations, but that user special page does. But in, as we work in a heavily developing environment, we assume that no bug is forever. One thing is right now with images, for the current version of an image, it's impossible to delete what user uploaded that image, like it's possible to delete that, and generally, like if you're looking at a page, you have to add a view, like edit the page because you can't delete the top user that's something to consider in terms of this. You can't delete which user? Well, I guess if you're renaming the user, like the user of the most recent provision of a page, like you can't re-vision delete that. If you delete it, it will be rushed into the message anonymous user of this special environment. Yeah, I guess. Then there would be the anonymous user when it is updated. It'd be nicer though if, like, you know, it was instead of showing random anonymous super-merged users showed the users deleted, like what you get when you use the hide user feature in special block. Would you be renaming the user to anonymous and some other generation identified, which is possible to reconstruct the original user name? I don't know. To the still student anonymous. I think in some ways, what this really needs is the hide user feature with that button that says, like, make this permanent. So you hide the user, what you made? Like, you know how there's a hide user feature currently? Oh, no. It's disabled by default, but not really disabled. By default, no user group has permission to do it. But if you give it things, the hide user right, then on this block page there's an extra checkbox to hide this user in all pages. Hide is not an option. What we need is that plus an option to make it not just hidden, but actually gone. Yeah. But I think that feature could be used as a stepping stone right now to see the things we need to do. Okay. That's cool. Let's just put it. Yeah. Great. I love Etherpad because you don't have to do anything as a presenter. People in the audience will help you out. There was a question? Yeah. A lot of this seems to be based on the production we actually are able to verify the person who's requesting a change. Or is that actually something that we generally don't have a list you've got the log in and you've got some sort of authentication. You've got the person who's chosen to use the name manually to be named so it could be someone else's user name that may request how do we know if the person did actually have a different visual correct person. So... Yeah. Yeah. That's a whole new problem with IP addresses. I think that is a process question. So if you're on your media key and you allow pseudonymus or anonymous access and then as you say a person using a pseudonym comes and says please delete this user you have to make them it makes sense to make them prove that they actually hold off that pseudonym. That is not a technical question though. Can be a technical problem if you've never actually chosen to keep your confirmation to be able to verify it. If you actually refuse that then you are actually not in compliance. It's a difficult way to do that with login users with users who have an actual account is to make them log into their account and edit their user page or something like that. With IP anonymous users I have no idea. So... Yeah. I hope they have not lost their credentials and want to be deleted simultaneously. The problem can be I've changed my email provider and I can no longer access the email that it sends. How can I rescue my account if I'm logging in in order to be able to verify who I am. We often have people in this scenario that end up having created a new account that old account is definitely lost. If you verify it was there in the context of email address that's if they didn't set an email address and they can't rescue the account so how do you make sure that we can actually be able to be used to verify an account in order to be able to make them change? So I came across this discussion several times here and I think that is you have to conflicting interests here. So I don't think there is a blueprint solution for this. So for example in order to verify this person that the pseudonym is still the person you might just say we store the IP address as long as the pseudonym is there that again leads to collection of more data instead of less data and I don't there is a clause that says store the data as long or store the data as long as it's necessary to run the operations and services in the way you want to provide them where with one exception eternally is not an option here and again that is something that will only be solved in practice so I assume that in a few months there will be some trial cases and then there will be some rules of thumb for this also for the request for takedown of pseudonyms for example but let's put it that way from a common sense perspective if you're not able to prove an identity of a pseudonym it's probably not possible so by not being able to prove you have shown that there is no relation between that that's what I would argue there's a provision in the GDPO it comments on if it's reasonable to do it in fact it says shall take responsible steps including technical measures taking accountability available technology and the cost of implementation shall take reasonable steps and that's directly from it so there may be conditions where it's just not going to be possible to do it and that's actually mentioned in here yeah and we have this reasonable costs that's everywhere in this law so that's what I say the law is very vague and I think a good first step is to just address it and say okay it's not reasonable even for Wikipedia I would say if it takes a year to implement then it's probably not reasonable and we have to find other ways to get around it and the same thing applies for media so user deletion for example is not easy now because you just can't go to the database and delete the user but it's not a good idea but there is a user merge plugin so that is a remedy that's a workaround you can say okay we don't delete that user but we merge it into the anonymous pool user and then we have probably come around this right for erasure so we have suffice to requirement just talking about future implementation there has been a lot of discussion about who is database and whether or not they can just ignore all of this and they're going to be fine and it's probably a comparable case with large non-profit wikis because it's a public good they are a rich public good which is a big difference if you could be considered to have a good benefactor that could actually influence the changes of the quality that's necessary so just saying well we're starting to implement changes is probably not going to be enough at least that seems to be an indication from the law actually saying not to just question on database one you may get sufficient consideration if you're not compliant on day one that the whole thing is not acceptable so that's also a political case there's a bit more to it than just the who is database it's worth using as a a witness test of whether or not the compliance is going to be acceptable they're worried about saying you've got 12 months to comply with what they always say is you had two years to comply so but now we have a situation where we have to deal with it and I think starting a process is better than not starting a process sure so that's what I think okay so responses to require I move on to the responses because we were talking about right to issue anyways so again I think is like the most the hardest bit if we go to like a data portability then I think like exporting the pages could be a good thing and as I don't know what you think but it shouldn't be hard to write for example for media weekies outside of Wikipedia to write a export script that dumps the list of user contributions and then compiles it and says export with these pages so and according to the law I think you have 30 days to react to a request of getting all the data so that is good question good question I don't know that's the contribution and we do track it and we need to provide files or file names I have no idea no it says the data stored about a user should be provided in a commonly used format so I'm thinking Jason XML I'm a developer so you don't want to text does it say that you have to only give them information about their user or just a dump of these yeah I was thinking about that and yeah yeah so it's always the same you know I would explain I mean what about if somebody else on a different account writes something about the user links their tool page exactly we need to implement the content permanently on request if it contains personal information because right now there is no way to permanently delete an article it's going to remain in the the provision text is going to remain on the servers and the archive provision log is going to remain on the servers so I think it also applies to file uploads and that's the point where it gets really messy because you can't just retrospectively edit the history that contradicts the whole philosophy of a wiki and yeah so I think we need to have wiki users in general not only media but we need to have to get a position on that that is legal as well as feasible and I don't have a solution for that but I think we're getting to a point where it's only safe to delete all the content of the wiki because you never know what kind of trace is someone left somewhere or some person wrote about another person somewhere and you just can't figure that out I think on that part probably the onus is on the user to provide infringing URLs for example google has an rtbf form and it requires the user to provide a list of URLs which contains their personal information and these are to be deleted so probably it's the same judgment to automatically delete the user pages for example but then the onus would be on the user to provide for example this guy put my name on their user page I think that's a defensive position maybe but still I mean reverse editing history is not I don't think it's an acceptable option I know this is I mean we're not negotiating with gdpr but maybe yeah I think that is also kind of a gray area because it can't be easily retrieved you cannot search revision pages in a search for example so I think that would be like this would all become gray area but again I'm not a lawyer I don't I can't say this is like what it is the dump formats in general would actually be considered to be a commonly used format to rely on that for the previous gfdl licensed that was actually a requirement of that law about the license that it would be distributed in a commonly acceptable format and with the dumps and the test neither of which were really common for much of the time but considered to be enough because people could actually use to interrogate themselves and provide we actually do here not but being one extra bit that's the one more visible part yeah no so we wouldn't want to try and add extra stuff to the dumps but we could use dumps as a way to say all the information is actually there there are tools to interrogate it and identify which is yours so therefore it's not being hidden it's not being just the tools would not be the most commonly used tools tools that we could write and what did you say was the case where that was used to show the gpl gfdl so that was a requirement it's also gpl any of the new licenses required that you use or you provide that are necessary in order to compile or make it usable so if you don't provide those tools it is considered to be a breach of license and if any of you can just pull one into that truck here is the source code and they forget to input the main file that would be a breach of gpl okay so we are getting closely to the end of that session I want to before we continue discussing until it was out I want to ask you what do you think is a good way to proceed I assume that if I now would create a page tdpr compliance on media you can just dump this there then it would be taken down like 10 seconds afterwards because that is might have legal implications and what not so I kind of think we should continue working on that maybe in the etherpad until we have a version which we think we can safely present but that's just a suggestion if someone has a better idea so just looking from what is the middle installation that is actually compliant what options do you need to have turned off in order to be compliant so then you can actually publish one way, not the way that this is one way of having it set up that it is actually compliant for example you must have authentication of your addresses rule out all the dodges and then all that whole lot papers or if you want to have anonymous users contributing here is how you would go about being compliant so we added an initial paper on how it can be set up ignoring what people might have as their current set up and how they might transition this is something that in one set up will work and we can fairly safely say that this will be a set up that is not going to cause problems may not have many people who have been actually switched to that very quickly but at least it's been a study of how this is going to work sure that's a good approach but again if we start creating this article publicly on media it has a very high visibility because a lot of people will look for it so maybe it's just a good idea to do a GDPR page and then go to the discussion page and work on this or use the etherpad I don't know until we have a good feeling about it now we can also use good so I think as I said this is more the kickoff of a process than in the end for anyone who is interested please contribute to that page if you want you can leave your email address or whatever and I will put it somewhere publicly and erase it on request I think the etherpad is good for now and then once we feel like that is a good thing maybe it's a good idea to put your some way to contact you on the etherpad so we can do some kind of informal working group and assess when we think that page quality on media wiki.org ok thanks for your contribution and for all your discussion and see you for the next person to talk about thank you