 Good afternoon everyone and welcome to today's ODI Friday's lunchtime lecture. My name's Ellen I'm the head of policy here at the open data Institute Just some really quick housekeeping before we begin There is a hashtag for anyone who is following this on Twitter. It's hashtag ODI Fridays This is also live streamed and recorded So just bear that in mind during the questions phase because questions you ask will be captured in the livestream Recording, so we just want to make sure that's clear before we start We host a range of lunchtime lectures over the course of the year every almost every Friday It's a different subject. I'm really personally excited about today's lunchtime lecture from Woodrow about Open data. What does it mean and what does it mean in the context of privacy? Because this is something that we think about a lot at the open data Institute because this tension between open and Privacy is one that we encounter almost on a weekly basis this sense that their binary their binary opposites that never the twain shall meet So I'm really interested to see What Woodrow works through in terms of the spectrum between Open and privacy as they're kind of commonly described I think that's it for housekeeping. There will be questions We'll all have an opportunity to ask questions at the end So I hope you enjoy Thank you very much It's a pleasure to be here today. Thank you all for coming I realized as I prepared this slides for today that it may have exhibited a great deal of hubris to Come talk to the open data Institute about what open data is because this is something that you think about all day long So I want to start off by saying that I'm today I don't want to presume to tell you what open data is rather. I'd like to pose a question about a Possible alternative interpretations or maybe just solidifying with a definition of open data is and what it means with respect to privacy And so for that matter, I I'm going to really rely on you all to inform me during the Q&A period about what I got wrong and Whether I'm even remotely close about what I'd like to suggest today so I want to talk today about a Friction that is probably all too familiar with everyone at ODI and anyone that works with data and open data initiatives And that is the friction between privacy and utility in data sets the conventional wisdom particularly I give several talks about this and the wisdom that I run into in a lot of these talks Is that the greater privacy people have in data sets right the more protected it is then the less useful the data is right and they're almost proportionate to each other and That's the you might as well give up on having any real utility if you want privacy and and vice versa now I Want to try to apply this sort of general debates to the talk of anonymization which is what I've been working in for the past a couple of years and I want to make the proposition that whether these two values Conflicts, which I think it's questionable that they do but there may be some truth to this data depends upon how we define The notions of privacy and how we define the notions of utility and I want to talk about that with in particular respect today to anonymization So I really want to make three major points today And I'm going to spend the rest of my talk sort of running down these three points one the failure of anonymization has pit privacy against open data I Think probably needlessly To this is only a problem if open data is thought of in absolute terms Which is where we're going to get to when I start talking about what is open data anyway? And then the final point that the normative point that I want to make is that open data definition should accommodate some kind of moderate risk mitigation obligations in order to make the concept of open data more sustainable and appealing to policy makers regulators and the general public Okay, so what do I mean? by all this So a few years ago. I started researching Anonymization the notion of anonymization and for years Long time ago now I guess it was widely believed that as long as data sets were anonymized And they didn't pose a risk to anyone's privacy That as long as they didn't reveal the identity of the data subjects then these data sets were safe to use that they're there They're safe and you can sort of trade them you can give them away and things that we value in open data Anonymized data sets were good for that and then Starting I guess it was around ten years ago. Maybe a little more than that the the Protection offered by anonymization seemingly started to crumble there were data scientists There were security researchers who were able to demonstrate what we call re-identification attacks They were able to take a data set and through some clever sort of re-engineering principles They were able to identify to a certain degree of confidence percentages of people that were supposedly anonymized right so In 2006 America online famously published a sample of its search queries And although AOL replaced the screen names with random numbers in the published search logs This minimal step did not protect its users and people were able to Re-identify the people in the supposedly anonymized Data sets the New York Times discovered and revealed the identity of a 62 year old AOL customer in the data data set Thelma Arnold's and they were able to publish her search queries They in so you can imagine if someone is able to re-identify your search queries that that can reveal a relatively great deal about you There were lots of other sort of high-profile Re-identification attacks that have been made public over the past few years and so this failure of Anonymization this sort of seemingly Relatively bulletproof Approach is now starting to crumble and everyone's like hollow anonymization doesn't work anymore That means we have to expand what our definition of personal information is because things that are supposedly identified might be identifiable this is the sort of terminology that we use and This failure of anonymization at least as from my perspective in the United States Really took hold in the mind of regulators right the regulators read a lot of these studies and these papers that came out were incredibly influential and The regulator said wait a second just because something is anonymized Doesn't mean that it's protected and it gave people great hesitation and so it it caused people to sort of regulators and the general public actually because I Know that something has sort of hit the zeitgeist when my mom calls me and said you know that all those taxi cab drivers were able to be identified And so when my mom is telling me that anonymization has failed And I truly know that anonymization this it's got some real problems and the biggest culprits of Anonymization's failure is what a scholar named Paul ohm calls release and forget data sets And these are data sets that were Deidentified in some way he uses the term scrub right so you take someone's name you replace them with identifiers You you delete the address and you put only their city or maybe their zip code in the United States And so there there are ways by which you can sort of scrub data and then you release it out into the world as anonymized right and and This is this is what he calls release and forget and you assume that the protection is probably good enough And then the researchers get a hold of it and they say that well It's not good enough and with almost every major Re-identification attack that's been demonstrated which is usually a sort of proof of concept by data researchers It has almost always been with release and forget style anonymization and It is this release and forget anonymization That's I think has done a lot of the framing work in pitting Privacy against data utility because in order to make a release and forget data set really truly anonymized Meaning that it's got sufficiently low enough risk that we feel pretty confident that no one's ever going to re-identify it You've got to scrub that data pretty good, right? You've got to strip that data of a lot of Really nuanced fine-grained detail and to anyone that works in data science That's a bad thing because you're losing a lot of really valuable pieces of information And so I think the release and forget data is the is one of the real birth places of this sort of framing Where we say well you can have useful data or you can have privacy protective data, but you can't have both kinds of data and I think that release and forget data is a threat to the concept of open data and the reason why is because if people start to Equate release and forget data with open data then open data is going to seem risky from a privacy perspective Because we know that release and forget anonymization is not that good, right? There are ways to anonymize that are really good that lower risk But it's it's largely not through for release and forget data, and so I want to to To talk about this sort of framing right and I I think we can avoid the problem of release and forget data with With equating it with openness because we only run into the problem of this this Framing this adversarial framing if we define and we equate openness with complete Freedom now when I say complete freedom I mean to articulate a sort of absolute position that imposes no legal obligation and virtually no Transaction cost to using data right so it if framing this framing of openness is something I think we need to care more about which leads us I think to a relatively Interesting question, which is what is open data in the first place so for the past Three or four months I've been digging around trying to find some definitions of open data because open data Like privacy like anonymization like a lot of terms that we use in public spheres It's one of these terms that I think we intuitively think we know what it means But when you actually start to sort of drill down what it means you get a lot of different definitions So on day one of my privacy course I teach information privacy law in the United States at a law school I have all of my students give me their definition of privacy and as you can imagine I Have 12 students. I get 12 totally complete different answers and it turned out that that's the case for a lot of terms and when I started digging in open data, I found Several different sorts of conceptualizations And so I want to propose today that it's time that we start thinking really hard about finding some precise definitions About what we mean when we say the term open data and this matters Maybe not for a lot of different things, but it does matter for regulators and it matters for the general public and so I Think that's the point that I want to make here is that framing matters. We know that's the Way that something is presented the way an issue is presented the way an event is presented can produce Significant changes of opinion. This isn't just some sort of theory. This is sort of proven empirically in a lot of disciplines Right, so psychological disciplines have done a lot of study on framing We know that people are more willing to tolerate rallies by controversial hate groups in the United States when the rallies are framed as free speech issues Rather than when these rallies are framed as disruptions of public order They can be both right something that disrupts the public order can be a free speech issue But if you frame a hate speech rally as a free speech issue, then we know people are much more willing to tolerate it right just just based almost completely about on the way you frame it and so One of the most examples with respect to privacy With respect to framing is the nothing-to-hide argument has anyone here heard the nothing that what's the nothing-to-hide argument audience participation? Who has heard the nothing-to-hide argument? That's right. So surveillance is often justified with the nothing-to-hide argument Right, which is why do you care that someone's spying on you? Do you have something to hide right and this argument only works when you frame privacy as the hiding of Things right in terms of secrecy, right? So if hiding a secrecy then maybe the nothing-to-hide argument works But of course I would argue that that secrecy is only one very small little part of privacy Right and that when you conceptualize privacy in different ways So perhaps privacy is trust privacy as dignity privacy as autonomy right lots of different privacy is control Which is the dominant paradigm and lots of different privacy regimes then the nothing-to-hide argument crumbles completely Right based entirely on the framing and so framing matters And we should the way we talk about privacy and the way we talk about data matters in terms of policy and it matters in terms of norms and And so I started digging into a lot of the different definitions and and some of what I want to do today Let's just sort of take a tour on what I found and I would love to know if I missed any major definitions because I'm compiling this for a research project So obviously I started with the ODI and I think that ODI probably has one of the most nuanced definitions of open data that I've seen in terms of creating the spectrum Right, which I found to be incredibly useful Now as far as an actual definition of open data goes I dug around a little and I couldn't find a Very precise definition other than open data is data that anyone can access use or share simple as that right so Even within that are embedded certain sort of assumptions, right, which will I want to unpack a little bit later But note that's the that if you look at things like license that limits use Right one of the things I want to call attention to in this definition Even though I love the spectrum is that that's less than open on this spectrum, right? I want to put a pin in that and just note that Open here is framed in sort of almost absolute terms, right? The gold standard of data is framed in sort of absolute terms and then the more restrictions there are right or the more sort Of protective it is then the less open it becomes and it heads closer towards the closed end of the spectrum and so I want to I want to Revisits some of that some of the other definitions I found so open knowledge has an interesting definition They say a piece of content or data is open if you are free to use Reuse and redistribute it Subject only at most to the requirements to attribute to attribute and share a like This is an interesting conceptualization of data note that no privacy protections are mentioned in this definition of openness, right? So we don't talk about Privacy doesn't sort of pop up anywhere in this openness definition at least not explicitly but There was a restriction Placed in open data, which I thought was really interesting They say subject at most to the requirement to attribute and share alike that sort of implies the existence of some sort of agreement I want to put a pin in that and we'll come back to it Melanie sharing off the public policy manager for Red Hat said that generally Open data means the data should be released in a format that is free of royalties and other IP restrictions And so then she draws a distinction between just the fact that something is available does not mean that it's open now This definition was primarily concerned with financial transaction costs, right? Whether it actually costs money to access data not data protection responsibilities Another definition that I found was the open definition And the open definition says the open open means anyone can freely access you modify and share for any purpose Subject at most here's another caveat to the open data definition to requirements that preserve provenance and openness Put most succinctly open data and content can be freely used modified and shared by anyone For any purpose again, no real mention of privacy or any kind of data protection in that definition It more was a sort of property mindset, right? Is there is some sort of? Properization that allows someone to charge or license use The open data handbook Open data handbook provides that's data open data is data that can be freely used reused and redistributed by anyone If you'll notice there's a pattern that's sort of emerging here from open data definitions at most to the require Subject only at most to the requirement to attribute and share alike again another sort of property like Requirement built into the definition of open data the full open definition gives precise details as to what this means And I'll briefly summarize the most important There are really three different sort of tenets of open data within the open data handbook Availability and access is one of them that the data must be available as a whole and no more than a reasonable Reproduction costs preferably downloading over the internet so that's how we access the data The data must also be available in convenient and modifiable form. That's the format of the data I think what we're getting to there is machine readable is better than in sort of you know unscan PDF that OCR wouldn't work on Reuse and redistribution was the second tenant the data must be provided under terms that permit reuse and redistribution Including intermixing with other data sets So what can I do with this data beyond just the format and then universal participation is the final tenant Everyone must be able to use Reuse and redistribute and there should be no discrimination against fields of endeavor against persons or groups, right? So this is a no wrongful discrimination sort of clause and it makes sure that everyone can use it again Note here that open data can include some restrictions notably attribution and share alike which is important to my later point All right, the open data barometer I found a definition here and they define open data They say open data is data which is freely available and shareable online without charge Again note that this is a sort of property-based Mindset with regard to open data. They're concerned about making it free and when I say free of cost I mean free of financial cost, right? That's really what open data this definition of open data is focusing on And then there are two scholars that work at a now an organization called upturn and These two scholars Harlan and you they they Laid they say open has two really two different components, right? So there's a technological component to openness and technological components Suggests that computers can handle the information efficiently. This deals with the format of data We need it to be Computer-processable in order to really be open and then they say there's a philosophical component to open data the philosophical components is one that Allows all people who might benefit from the information to be able to share and reuse it in a democratized and accessible way Really what what the philosophical component gets that is the absence of legal barriers to the use of data Another definition for open data that I found is that the NYU's gov lab and so Joel Gurin Who's a senior advisor to the governance lab at NYU and former editorial editor and executive vice president of the consumers Union? Defined open data as intentionally released government data Intentionally released for public use in other words accessible public data that people companies and organizations can use to launch new Ventures analyze patterns and trends and make data driven decisions So this definition sort of targets an intentional release, right? So if you intentionally release it for public use then it's open I don't know if that captures Everything that we've seen and other definitions of public data, but it's just a different definition that I found So what can we get from all this? What is open data anyway? I tried to distill down a sort of common set of concerns or attributes of open data And what I got at were some really some questions or anyone that might benefit from the data What kind of licensing credentials are there? What kind of authentication credentials are there? Is there any discrimination between people or groups, right? So the less discrimination the more open it is the more people that can use it the more open it is and then Of course less the less open it is cost of Access was the second thing that I noticed and we saw this come up in a lot of the definitions that I just went through Is it free financially? Does anyone have to pay for it? Obviously the freer it is the more open it is The format of the data is a machine readable. Is it is it somehow otherwise processable? Can you put OCR on it? for example Permitted uses Are you allowed to edit the data any way you want? Are you allowed to ask anything of the data in other words are there certain questions that you're not allowed to ask of the data? um Can you commercially exploit the data right so there if if there are restrictions on commercial exploitation and some people might say well That's not open data or would can you combine it with other data sets? I must you keep it sort of? Siloed off or can you really combine it to get more powerful insights by combining it with other data sets? Five permitted disclosures who can you give it to who can you share it with can you share it with anyone? Can you share it if they agree and this is incredibly important to where I'm headed if you'll notice that some of the The requirements that were allowed under the definition of open data where they called share alike licenses. What's a share alike license? Exactly it forms sort of a chain right a protective chain and often this is attribution right So this is that's what we see it's like attribution combined with share alike right or not or non-commercial There are lots of conflicts right But This is part of what we sort of say is possibly compatible with open data, right? I mean we say subject to and most either a share alike license or No, not not commercial right right and And so the share alike sort of creates a chain right you have to take it on the same term But I took it upon and as long as you agree to that then you can have this data And so what all of this boils down to in my minds are concerns about Transaction costs how difficult is it to access the data right into and to use the data to share the data and legal prohibitions Right am I allowed to do this with the data or am I prohibited from doing this with the sort of data and that's I? Welcome any sort of feedback on that Okay And so that's what that was that my second point what is open data? I think those are the factors that we can sort of derive from that some combination of those factors are That's what our universe of open data that we're talking about is And so the third point that I want to make here is that in order for data to be sustainable Open data to be sustainable. It has to be seen as safe Right if data if oh if the concept of open data Become synonymous in the eyes of regulators in the eyes of the general public is risky Then that jeopardizes the movements flourishing Right people say well open data. I don't know people can re-identify anything these days Right, and that is a hindrance to this concept And so one of the things that I want to argue is against an absolute framing of open data In order to preserve its sustainability as a concept I Want to make the argument that some restrictions that protect privacy can preserve Utility and still live on in the spirit of what we've seen as the sort of major factors of open data When you categorize open data as protected, but still useful data as Somehow less than open right so I want to go back to the ODI slide right when you I don't mean to give everyone whiplash I'm headed back here when you categorize This board as something less than open Right as sort of the gold standard Then you're implying that the only kind of data that's fit for the gold standard truly open data Might be one that leaves people at risk right that's where our Privacy versus utility comes into play and so I want to to propose that this needlessly pits two concepts privacy and openness in against each other particularly when I think that a Lot of our concerns about what is really open are financially based whether you charge money or not and not necessarily restrictive of Or play are about placing obligations on privacy. So everyone close your eyes. Well, I Swing you back through there. I apologize. Okay, so I Want to argue that data subject to minimum obligations can still minimum obligations to protect the data Can I apologize for my grammar there as that's pretty horrific can still be thought of as open particularly data that Is aimed at things like prohibitions on re-identification For example because that's what a lot of people are concerned about particularly with the anonymization people are concerned about a Lot of what the talk that we have in the data security community So I do a lot of data security research we talk about adversaries, right? Who are our adversaries people who will want to get data and exploit it for adversarial use and a lot of that adversarial use involves Re-identification these re-identification attacks stir everyone's imagination, right? So when that AOL story was published and you were able to Netflix did this too So did everyone read about the Netflix study where they published on online and then some computer scientists got a hold of And they say we can tell who everyone's viewing data right then that immediately sort of strikes fear into the heart of People that someone's going to take their data set right you're your Google profile or your Netflix profile and figure out who it was That you actually were right right well, that's hard right because a lot of this has to do with threat modeling Right like what do we choose as our threat model and how do we define our data? I think there's a way for us to accommodate lots of different possible threats in a nuanced way Where that's sort of pre-defining like what anyone particular danger is which I'll get I hopefully we'll get to in a second and so I want to make a possibly I don't know whether this is controversial or not But I'm gonna say it anyway. I think that defining open data in a safe and sustainable way means distancing open data from a release and forget data sets right meaning that We might have to consider some kind of obligation either through a creative commons likes contractual obligation Perhaps at a minimum to do things like not re-identify Can still be truly open data right can still be the gold standard for data which requires maybe tweaking a little bit our definition of what open data is and before we get into the debate about Whether we need to distance open data or at least somehow distinguish open data from release and forget data sets I'd like for and I and I really mean this as particularly in the Q&A session to sort of dig into this a little Let's ask ourselves if Get data sets are worth defending and if that's true I would love to hear the reasons why we sign a licensing agreements for literally Almost every single internet transaction we make right so you log on to any website I'm noticing you the cookie message pops up right and we agree I agree to allow the use of cookies right we sign a licensing agreement for virtually everything we do online, right? and I wonder if We have some terms some basic terms that don't decrease the utility of data and talking with data scientists Requirements not to try and re-identify the data seem to be relatively uncontroversial Among so I sat on my university's IRB for example Most people say we don't want to re-identify the data. We just want to glean some insight insights from the data, right? So so re-identification is seen as sort of separate from utility Because most people when they get data sets don't want to re-identify someone if they've been anonymized And so maybe basic requirements not to re-identify can be a part of the definition of open data The law sort of acts as a backstop on data use for every single subject, right? So True sort of you know pure Libertarian open data doesn't even exist anyway Because we're all subject to the provisions of the GDPR the data directive or whatever Privacy regime that it is right so we can never we can never achieve sort of pure open data nirvana And I don't even think we would want it if we had it, right? Because we have things of concerns like privacy and so I want to I want to throw out there that maybe We're releasing forget data sets aren't worth defending and if that's the case then we can safely sort of Re-identify open data to incorporate minimal data privacy and data protections. I say that that can fit within an open data regime Okay, in order for this to work. We also and this is what a lot of my research previous today has been on We also have to change the way we think about Anonymization that we have to reshape notions of anonymization to mean low risk One of the reasons that people were shocked about the fear of anonymization is that they thought it was foolproof, right? That it was a guarantee of anonymization the Re-identification attacks that occur only show that with a certain, you know piece of Auxiliary information that you're able to re-identify in a very small percentage of the chances but a lot of the this is why framing matters a lot of the re-identification attacks are almost Completely guaranteed never to happen in the real world because the percentage of the likelihood of it happening is so low Right, you have to have this certain kind of piece of information Combined with this certain kind of expertise to work in this certain kind of way and only for this one particular data set Right and the odds of all of that sort of coming together in an adversary That's got the motivation to not to de-identify is very low, right? Because a lot of hackers they don't care which database they break as long as they can break one So if it's relatively de-identified, they'll just move on to a different data set, right? Or it's not financially valuable to de-identify and so they're not going to waste a lot of efforts This is part of threat modeling, right? But that's not what we read when we log on to the Guardian and we see that AOL Really search engines that people could figure out who you were, right? Because then everyone has a heart attack and everyone says well no don't release any data anymore instead We have to rethink of anonymization in Process terms so what I've argued is that anonymization should be thought of as the process of minimizing risk Just like data security so the law of data security is not doesn't guarantee that information is safe, right? Because we never can if someone wants to hack you you are going to get hacked that's almost a guarantee, right? I mean there's just almost no way to completely prevent something so what do we do instead we say well There are certain steps that you have to follow to lower the risk of this data being compromised, right? You have to implement technical safeguards you have to implement procedural safeguards you have to implement Administrative safeguards and if you do those things you're in compliance with the law even if you get hacked Right, that's the way that data security law works We have to start thinking of anonymization in the same way you have to follow certain kinds of steps And if you follow those certain kind of steps with good threat modeling We we know what lowers the risk of de-identification and if the risk of de-identification is lowered enough through these steps It should be good enough. Will some people be re-identified possibly, right? In the same way that some people will possibly be hacked through data security hacks happen all the time But from a matter of policy we have to start thinking in terms of risk Risk management is the name of the game in my opinion and for anonymization and for data security This is not going to solve all problems, but it can solve some problems And one of the ways that we can mitigate this risk is through share-alike contracts, right? Is it going to solve everything? No, some people might violate their share-alike agreement But it does put the normative heft of the law behind things, right? So most people when they get data sets they want to follow the law, right? Most people comply with the law most organizations comply with the law And so if they know that they're under an obligation not to re-identify then that should serve as some sort of mitigating factor And so these sorts of protections right with protections not to re-identify Maybe basic sort of process-based steps the recipients of open data have to go through maybe just basic data security Right don't store the data in an unencrypted form. That's pretty simple, right? Some basic sort of obligations that you have to undertake should still Not make should not make data less than open right that we should be able to say this is open data Even though there are minimal requirements to protect privacy And I think that we can do that in a way that still preserves those main factors that I went through this those one through five We can still keep it free. We can still make it non-discriminatory, right? It's anyone that agrees to the the basic requirements gets the open data I Think that should be seen as sort of permissive and so all of that to say I think that if we rethink about the way we define open data in a way that makes it Protective of privacy in at least a minimal way and respects anonymization is sort of a risk management technique Then we can make the concept sustainable for the long term. So, thank you very much. I'm happy to go into Q&A Thank you very much. Woodrow. I know that I'm thinking lots of things As a result, I can see someone already has their hand up So did you want to start off? Can you introduce yourself and I'm Hi, my name is Carol is I'm the ODI associates here I'm thinking about legal definitions and that's very interesting conversation to have my question is, you know, where a Perfect definition of open data should Find its place. That's one thing but another concern that I have is the culture of Open data organizations and there is an organization called responsible data form that deals with the open data responsible data policies and My guess is that in the near future every Organization dealing with open data will have responsible data Policy and there's probably a connection between the culture of our community and Probably legal follow-up to follow right. So my question is have you been doing any thinking about the culture of the community? and whether there should be a shifting kind of perception of of What is open it and if open by default is really The idea was depending in your words. I I think that's a great Question I Haven't thought about the culture of the open data community, but it's one that I think is going to be important Because one of the things that I would argue for in thinking about policy So if a policymaker sort of comes and says okay genius, what should we make open data be defined as? one of the reasons I would encourage Redefining open data and thinking hard about it is one of the things that I would recommend is looking to industry standards Right or community standards Data security law does this relatively well Because what I'm arguing is largely modeling data protection law and anonymization off data security law The security law looks industry standards. They look to things like the NIST framework. They look to the ISO frameworks The open data community has an opportunity to get ahead of the game right now and define open data in a sustainable way and Providing a framework for regulators to look to because regulators frankly at least in my experience preferred If you had a framework already created where they could say just do what everyone else is doing Right just do the industry standard that we think is safe and protective then they'll just say Reasonable anonymization includes following the open data protective framework right created by what was this the responsible open data policies? Maybe that's a framework that could actually be implemented in the policy and in that case then then the culture of the open data community matters Very specifically to data protection policy Thanks very much. Thanks very much for the talk Kira Nohara of the University of Southampton I agree with every point you make Absolutely right and release and forget is not sustainable The issue one issue which I don't know whether you've taken into account is one of cost because I remember when open data was sort of becoming More of a norm around about 2010 2011 there was a lot of efforts and bureaucratic tug-of-war about developing an open government license and That was and that has come in a way to kind of define openness at least in the UK context It may be maybe different in the state and that the definition of that open government license was Took a lot of doing and a lot of head scratching and the implication. I think of your proposal Which I'm very very sympathetic to is that actually every single so if a government data every single department is gonna have to think of its own license Based on the policies of the data are not on the on the policies of the transaction, which is what the OGL defines so There's a cost associated with this a bureaucratic cost that in the government context may not be sustainable What do you think about that? This is where I Would want to talk to a lot of experts to see whether would work or not Because data secure again, that's not like a broken record But data security law works in a sort of similar way here and that there is no one right way to do data security Now this frustrates a lot of companies and a lot of regulators who frankly would rather have a lockstep approach to protecting To securing data, right? So I work with a lot of companies that say okay Give me step one through ten on how to comply with data security law and you simply can't answer it Right if you you can you can have your checklist right a low-cost checklist to protect data Or you can have good data security, but you can't have both, right? And it may simply be a matter of coming to terms with the fact that we need to identify good enough Sort of standards right and have open-ended standards This this sort of helps when you define something in a way where you say just have a reasonable data protection, right? That's Head scratch it means it's infuriating for anyone is actually implementing it But it's something that works a lot as a as a legal standard because then we say just use your best judgment right and we defer to the judgment of each individual organization as long as you sort of made a good faith attempts based on industry Standards which may be open data community could help provide then Maybe that's close and maybe it's it's it's costly initially But once you sort of get in once you have industry standards to follow then maybe that lowers the cost over time But it may just be that it's costly if we want to do it, right I've just been thinking through in my head because it's been really interesting Listening because as soon as I saw you working through the definitions of open and open data I thought the reason a lot of those terms exist in the definitions is because in the UK at least Open does have a very technical Definition which is that unless you have a license on your data that explicitly allows people to do these things with it Companies other public sector departments, etc. Would be in breach of copyright and database rights laws using that data so part of the reason The language used in the definition is so property rights based is because your access to a new is governed by both property rights and database rights and by the rights of the data subject Then your obligations is a controller and a processor under the protection laws and the thing that I get nervous about is when we try and Put those together into one system Where I think so the training that we try and provide because we're Frequent a not as vigorously about everything you are saying about releasing to get data Because not only is it risky from a personal data perspective Also, if a department if an organization for business is releasing data and then forgetting about it It's probably not that useful anyway, right like for someone to use and build services off It's like why are you putting it out there? Because if you're not using it, how useful is it going to be for others, right? but these things what we train help people understand is how your rights and obligations as a data controller as a processor have to enable you to Determine when it is safe to publish Because your rights as a data subject do you have primacy over? rights in copyright and database law right but the thing that I think I get a bit worried about is when we kind of say We need to change the definition when I know that in no small way in the UK that definition of open Comes from open access open source, which is very explicitly about licensing and I'm like, how do we reconcile those? Right reconcile them because there's because you're right It is so property-based and it's difficult to sort of tack that on and maybe it's worth sort of thinking about Adding a separate and distinct sort of element that doesn't then muddle The current existing definitions of open data, but then adds You know, I mean sustainability is a word that I've been using the sustainability sort of implies within it protective right not risky And or at least not to rescue everything walking out of your house in the morning But but not overly risky and so maybe there's sort of a secondary component that can be But into that and we see and a lot of what I've wanted to and maybe the point I didn't make too clearly is that is that we have a lot of this Existing responsible data use already embedded, but we're not putting it in the definition of open data And I think that's a missed opportunity because a lot of open data principles are protective and they are sustainable right so you work a lot of different ways Why don't we put that into the framing of it because framing matters the way we talk about data matters And if it's seen as sustainable and distinct from from release and forget then we don't even have to fight The anonymization battle with respect to open data because it's built into the sort of framework about it And a lot of this some of the research I've done is on genomic data And if you look at some of the the data that comes from genomic data to download it There is a sort of a basic license that you have to agree to and if there's just a sort of licensing already I Don't think it's I don't think it detracts a lot of way from the utility of data to have very basic sort of no re-identification or some sort of you know basic non-discriminatory Authentication I don't know if that's sort of redundant, but but you can have authentication applies to everyone that says yes, right? Not based on a particular Yeah And it's under one terrible political university in tooling. I have three questions. The first one is you And point out the chance to use a sort of Contact for limitation Regarding the re-identification that is a well-known strategy. The problem is and the question is how you can Enforce this kind of a dish because if you spread a lot of data set around it's not so easy to know if someone Has broken this second question a Problem that you should also consider I think is see and not to the open data set as an autonomous silos But consider the effect of their use of the fusion of the different data sets in this case in many case We have a sort of involuntary re-identification so that that single each single silos is Quite analyzed But when we merge different kind of silos and voluntary in many cases we Show enough information that make possible a re-identification and the third point is I Think that is related to the previous point. I think that the key element is not Only the anonymization or identification open data because in many projects that I wouldn't share was involved about open data the main problem is How you can anonymize anonymized data Maintaining enough value in information. This is another point that should be considered because if we talk in a theoretical terms We talk about an organization and open data and they can also coexist They cannot also coexist But the main problem is that when we talk with the guys that won't know their administration that won't put out the Data sector they said you if I increase the level and Anonymization I lose a lot of counts as a lot a lot of potential useful information for users center So this is another point that should we should consider in this balance and in concrete in the specific case In which we are to put to apply this through right That's a so I'll start with your third point and then work backwards if I can so I think that that's When we talk about Anonymizing data losing value what we're really talking about is Actually changing the data sets, right? I mean what we mean is we're taking valuable pieces of you know some sort of Particular sort of marker that lets you get someone that that's only one sort of way to Protect data sets right and then there are lots of so there's a whole field of studies statistical disclosure limitation study That involves several different sorts of techniques that don't necessarily involve changing the data itself one of those trusted researcher model One of them is like differential privacy for example where you can sort of access the database through through queries by injecting certain amount of Noise that makes sure that you can't tell whether any one person is in the data set or not and so I would a lot of my research on anonymization has been an attempt to To having a holistic approach to anonymization and think of ways in which we compare non utility degrading Anonymization or protection right because it trusted researcher model doesn't necessarily mean anonymization It just means that the data subject trust whoever is going through the data right which trust is an incredibly important point It is often really undervalued in our privacy policy right now because we leverage everything on control Right, which I think it's a whole different conversation, which I'm happy to have later But I think we should invest in trust more right so trusted Research models can still be compatible I think with open data as long as you're willing to preserve the trust that you're given right so we can we can not discriminate We can not charge we can make it machine readable all these open data principles But only the people that agree to preserve the trust get the data now that's a limitation Right, but it's one that might allow us to use more data in a more useful way And so I'm a I'm sort of a thousand flat let a thousand flowers bloom kind of guy when it comes to protecting data and privacy And so that's that's one example your second question is Voluntary reident involuntary was an involuntary reidentification where you sort of combine everything This is hard. You're right the more data you accumulate together the higher the odds come as I understand Obligations not to re-identify what we really care about are doing is doing anything to purposefully Uncover someone's identity right you just don't ask the question now will some people be accidentally re-identified when you come through Possibly maybe when you combine oh so-and-so lives at this address like I know this person right but That may just we also need to become a little more comfortable with the idea of some Collateral damage with using data that data security has it right if we wanted perfect data security We would not allow data to be stored on a hard drive full stop right in the same way that if we wanted everyone to live Using cars we would ban cars right if we really cared about Making car safety 100% we would ban the use of cars, but we don't do that right because we have too much utility from cars In the same way we need to get sort of comfortable with what a what are acceptable risk thresholds for the use of data And if we start talking about acceptable risk thresholds, then we can really start pushing the ball And that it may vary you know according to appetite and then finally how do we check with this? This is the biggest problem about comparing data Anonymization to data security we know more or less when data has been breached It is very difficult to know when data has been re-identified right because there's no sort of external marker that we get The best answer I have with that is we can check things like procedures to De-identify through auditing processes right like did you go through the process to sort of protect data? Do you have organizational safeguards implemented we can check things like that? We can give Private rights of action to data subjects who end up being harmed who come to find out that they've been re-identified Right that there's only some some sort of you know, we know to a relative degree of certainty that You know this particular person was re-identified through this database and we think you know enable private subjects That it that's harder. I don't have an answer for that My name is a Brigitte and your last point. I think was helpful to a degree Because I've been trying to get my head around re-identification. I'm quite confused because on one hand I think it is kind of a bodge because anonymization is not strong enough. It shouldn't be the reason if I listen to a tape or Or a sound clip and identify that as Barack Obama. Have I re-identified and a broken the Or or Donald Trump or so I'm unsure whether re-identification should be banned Right, or maybe we have a precise enough definition of re-identification To avoid things like listening to a tape and saying I know that person Yeah, we put some sort of intent requirement, right, which is one of I don't know how much how much controls You're going to be able to add because if you're talking about you know fire and forget kind of data should they be banned What are you going to do with census data? Well, right, so since the state is hard, I mean that's a that's an incredibly difficult one if a lot of it is We can try to identify questions that we know are basically re-identification type questions and we say you have to avoid sort of intentionally seeking out someone's data and This is where I look to industry standards This is where we need a participation of Computer scientists in the open data community and advocates working together to try to agree on a basically set list of principles We were able to find so privacy is probably one of the most intractable concepts that's ever been created, right? We've been writing about it for a thousand years yet somehow Over the past 30 years or so we've settled on a relatively set number of privacy principles the fair information practices right there they're Incredibly, this is sort of the universal language of privacy and Maybe we should try to seek to have a conversation where we try to find things that are universal sort of problematic re-identification techniques and if we can articulate those and we can say stay away from those and then the other stuff just sort of Is collateral damage, right? I know lots of people have their hands up I'm looking at Phil because he told me about ten minutes ago to wind up in five minutes Have I got time Phil to take like three questions at once and have that be Woodrow's wrap-up So no Okay, I'm really sorry I'm really sorry email me. I'm happy to answer any questions after the class. I'm After the lecture, thank you everybody for coming. Thank you so much a passionate Discussion today including with the audience as well. There were lots of really interesting questions Next week's lunchtime lecture actually is a wonderful lead-on from this So I really hope many of you can be back because next Friday's Lunchtime lecture is titled who owns your reviews data your reputation in a digital world It's a cat. It's looking at data portability Which is totally wrapped up in notions of data protection as well as data rights under IPN database law and Portability which is coming through in the GDPR and now potentially in the free flow of data Initiative, so it's a really good opportunity to continue the discussion. That will be a panel debate So there'll be the Australian economist Nicholas Gruen the assistant director for the department for biz or its new title Which actually don't remember industry infrastructure energy Tom Gerlard and also hopefully somebody from Airbnb. So it should be very interesting. Thank you again Thank you everyone and hope to see you next week I