 Again, thank you so much and welcome to our webinar series at the Center for Open Science. Today, we have a great topic, demystifying the data anonymization process, myths and best practices. We were joking that everyone should try to say that three times backwards. We are recording this webinar, so just wanted to let everyone know it will be made available for distribution afterwards for anyone who is not available to attend live. We also have enabled the transcription service below for those that that is a support and we will make that transcript available as well. So just want to take a quick second and introduce myself. My name is Marcy Reedy. I'm a community manager here at the Center for Open Science. Hope to support initiatives around our STEM education hub that we have worked with through support from the National Science Foundations. And just to start off, I wanted to provide some additional resources and information that would be helpful to people as they embark on their open scholarship journey. The first of which is a STEM education hub page and in there it is a trove of valuable resources that you can go to if you are searching for training and deep dive supports into open scholarship. There are deep dives into pre-registration, into open access, collection of our previous webinars including on data visualization, supporting early career researchers and open scholarship and overview and licensing and copyright procedures. So that is a great resource to bookmark and go to if you are looking for information. As well as I wanted to share a quick link to a hub we have created on the OER Commons website that we are calling the Open Scholarship Knowledge Base which is a repository of the what, when, how of open scholarship. Everything that you might need to know. It not only allows you to search for resources, it also allows you to add your own if you would like to support colleagues that are in their open science journey. So we invite you to visit that as well and bookmark it and take advantage of the resources there. So with those resources shared and without further ado, a couple housekeeping tips. We did, it seems like the Q&A has already been found, yay, thank you so much Phillip. We did want to encourage everyone to take advantage of that and share questions as the webinar proceeds. We certainly want to make this active and engaging and responsive to the audience. So we encourage you to take advantage of that. Also we will be monitoring the attendee list and if you see there is a raise hand function. So if you prefer to actually make a comment and for us to call on you we would be happy to do that. So I will be monitoring it. Feel free to raise your hand if you would like some clarification or question about what is being presented at that time and we will try to keep track of that and monitor it pretty closely. And so without further ado, because no one wants to hear from me, they want to hear from our guest Dr. Michael Altman today and we welcome him. We thank him for joining us and sharing his valuable perspective and expertise. Thank you so much for joining us Dr. Altman. Dr. Altman is a social and information scientist at MIT Center for Research in Equitable and Open Scholarship. And previously he served as director of research for MIT libraries and at the Harvard University as the associate director for the Harvard MIT data center. So hugely impressive background, amazing person to speak on this topic about data anonymization and we thank you so much for joining us. And invite you to take it away. Thank you Marcy and hello COS community. I'm really pleased today to have an opportunity to talk about the research and practice we've been doing and to share some of it. This is informed both by some of our research with a number of collaborators and by previous work both as a social scientist and as a head of a social science data archive. In this talk we'll reflect on the fact that there's more information from individuals that's available than ever before and that the laws, technologies, uses are all changing rapidly. So what I hope to do here is to look at how we integrate information protection into the research life cycle and deal with those ethical, legal, statistical and methodological considerations and particularly to talk about what are the needs for information protection to help characterize key concepts and to identify some next steps for you. So before we go further, there are a few caveats for this talk. This work represents my own perspective. It's not MIT, not our funders, not the library. Further, we're making predictions about where information and privacy and anonymization is going to go. That's tough to make, especially when you're talking about the future. So to summarize, if there's anything that's wrong in this presentation, it's entirely my fault. And if there's anything that you like, it wouldn't have been possible without many, many collaborators, some of whom outside the MIT library for hosting earlier research and some of our past funders for supporting many research projects in this area. And of course, Sanos Giganta and Humoris Incidentis, the project builds upon scholarship and many others, only a few of whom could be directly referenced in this presentation. So one of the facts of research now and of the world is that personal information is everywhere, including where it isn't wanted. And that includes information about what we do, what we say, what we think, where we go, history, health, property. All of this information is much more commonly observed, recorded, shared. It's the subject of research, it's the subject of commerce. And a lot of this is voluntary. This is a chart of the growth of monthly active users in Facebook in the million. So there are 3 billion active users on monthly active users on Facebook, seems like a lot. Sharing all sorts of information about themselves, their friends, their activities. Now this is not only consented information, but even within this framework of consent, there are externalities, such as my sharing something that might contain information about you, picture perhaps with you in it. There are network effects. Facebook is really the only game in town for social information sharing of this sort because the value of the network grows with the number of people in it. And they've got 4 billion people. And then asymmetric information because Facebook knows a lot more about how they can use your information and what the threats to you are than you do when you post a photo or send a message. So even voluntary information that is consented may not be done so under sort of full informed reflective entirely uncoerced consent. Sometimes information has consequences that the availability of information has consequences that are entirely unexpected. This is one example, it's a mudshot database. It illustrates that nominally public information can become universally discoverable rapidly. And actually about 10 years ago, especially in the U.S., a number of county records office started putting arrest records on the internet, part of transparency. Now these were nominally public, you had a right to see them as public records, but it usually involved going down to the county court office during a specified window of time when they were open and asking to look at a particular person and paying for the copying fees. Having these all just heaved on to the internet had some consequences, made them discoverable, which created new opportunities for aggregation, analysis, and use. That inspires new business models such as creating a database of mugshots so that you might maybe support criminal search or maybe because you don't want to be in a database of mugshots and you'll pay to have your mugshot removed. So an intentional misuse can arise as well as unintentional misuses in order for you to discover that you're in this database of mugshots. Well, you have to find it. How do we find people find out this information typically through Google? So there were these mugshot databases bought last name records, last name advertisements, and said when somebody searches in the common last name, put it up an ad so that they'll know that we have their mugshot and they'll pay us to take it off. But people tend to click on these ads more frequently when they see black first names than white first names. It's based on work by Latonya Sweeney. And so what ended up happening was that the ads, the ad algorithm adjusted to only show ads for these databases when there was a combination of a white first, a white, black first name and a last name. Even though the mugshot database people did not have any intention to discriminate by race. And legislation then comes around, works at the edges, changes things. But you will still see advertisements for criminal records on Google, though the racial bias is now, in that particular case, is now more controlled, right? There have been a lot of other data protection issues in the headlines from international disputes to breaches and unexpected breaches and anonymity to claims about group control over information, controversies over consent, problems with hacking and identity theft. And my favorite, that was my second favorite, that reading privacy policies will take you 76 days of your life. So there's another aspect of consent, although technically you've consented to all of the web privacy policies and click-throughs, you've never read them. Information threats are changing in a variety of ways in the current environment. Information travels wider and faster, cyber attacks are increasingly common. Common platforms for information collection, storage, et cetera, expose information in new ways, and privacy leakage accumulates, even when people use disclosure control methods. One aspect of this that is becoming increasingly important is how aggregate information can reveal things about us. So if we think about how unique am I as an individual, if you know there's someone whose birthday is the 31st, whose zip code is 02145, whose gender is male, have you identified them? Well, pretty close. There may be two people there with those sets of attributes. So if those attributes are public, even though we have not used a name, we've accumulated enough aggregate information to learn something about them, for example, that they're in this data collection. And it's not just numbers and databases that reveal this. This is a statistical graphics of disease incidences. And it turns out that given the, and this is simulated, but this is from a study that looked at graphics like this. And you can look at a disease incident chart. And given the criteria for having had that disease, learn a lot about whether particularly individuals are likely to have a disease in that area. Even though there's not, you know, the streets aren't labeled, you can match it to a map, et cetera. So regardless of the form of the information, whether it's a map or a picture or a story, we can accumulate privacy losses and learn things about people in aggregate. So there are specific challenges of anonymity. And that's what we're going to focus on for the most part here. One is around the environment, the rapid change in how much data is collected, how fast dissemination and broadly dissemination occurs. This doesn't fundamentally change the problem, but it brings some fundamental constraints into sort of a harsh light. And one is that anonymity isn't about naming. It's not really about finding Micah Altman in the dataset. It's about learning. If you don't know, you didn't learn my name from the dataset or learn that I was record 236, but you did learn that I was in the study. And the study is of ex-criminals, you've learned something about me. Or maybe you've only learned it, maybe you only learned that I'm probably there. This accumulation of learning, however, can create the sorts of harm that we typically think anonymization is preventing. So anonymization is really about learning about individuals, not about acquiring specifically their names and identity, etc. We've also learned that the other perfect privacy is possible. It involves throwing away all the data. So if we want to have useful information, we can't have perfect privacy. We will always learn at least a little bit about people who participated in the data, in the computation. And what we need to do is to minimize that amount for the value of what we're learning. But even when we minimize this, privacy losses add up. And so we have to manage those losses cumulatively and conscientiously. So as researchers, we're responsible for information privacy and all of the consequences of sharing information, at least ethically. Many countries recognize information privacy and some recognize data protection as a universal human right. This involves not only necessarily anonymity, but also characteristics of intrusion into your life, dignity, reputation, and control over your information. And researchers have ethical responsibilities in general. But the point of research is informational. It is to gather information and able in order to learn. So when we are doing research on human subjects, on humans, we have to be concerned about the information we gather. That's a central part of what we're doing. We tend to gather that information. We're going to benefit from it. And it should be protecting, it should be a core concern. People participate in research because they view it as a public good. And we have an ethical responsibility both to protect their privacy and to make sure the benefits of the research compensate for the inevitable privacy loss. Even when we minimize it, that will be involved in conducting research. Furthermore, there are legal risks and requirements. And I assert that researchers should treat the law as a supplement to ethical responsibilities. You need to comply with the law. It is part of being ethical. It is not the beginning or end of ethical responsibilities. To do this, you'll need to incorporate data security and privacy planning during your research design, both to minimize the legal risks to you and your institution and the ethical risks to your institution and data subject. Now, there are many different laws. I am not a lawyer. This is not legal advice. And they place different sort of requirements. I'll talk about some of the privacy concepts underneath the law and hope that this presentation will help you identify some possible legal issues. But it is not a substitute for legal advice. And the requirements and triggers for these requirements and even the definition of the same word say anonymity varies from law to law. So in summary, there is a responsibility to understand individual rights and interests and anticipate harms that may result from participation in your research. And you can address this through research design, data management planning, selection and use of protections. And we'll focus in the rest of this talk on some of the key concepts to identify and yes, we will have a slides after the will be disseminated along with the recording. Thanks. And we'll address this through planning informational controls throughout the life cycle. So let's dive into some key concepts. Actually, key concepts in this area derive from three domains. And these domains have grown organically. So the separation is not always that clean. But it's useful to keep track of whether you're using a particular word or concept in one domain or another, even when the same word may be involved multiple. So at core, there are a number of policy concepts, one of which are called informational harms and benefits general sense of what are what are the overall goods and bads that come to individuals in society from from a particular activity and privacy and informational rights that we are protecting. And those are in essence policy concepts that are implemented through technology and through the law. On the technical side, there are concepts of information security, anonymization and utility. And sometimes again, these words are used in other domains, but I'm going to use them primarily as technical and legal side, there are concepts of things like the identification, scope of authority, personally identifiable information, sensitivity, none of which are strictly defined in terms of the tactical, the statistics of or computer security definitions of privacy, but but are regulated by law. So let's dive into a few of these. So informational harm, informational harms occur when others use from research, when other use research results or data, learn something and then violate the rights of others or negatively affect social welfare. And this is a very broad definition. And anonymity just a kid anonymity doesn't take care of all of these. But some examples of these harms are linking a deidentified health record to an individual, find a data set you managed, managed to pin it to Michael Altman, I am denied employment or I lose my employment because it's discovered I have this condition. That's an anonymity harm. Perhaps you do a statistical analysis on some aggregate data tables. And you figure out that, you know, a particular household in this, it's only, you know, several households in particular zip code. And all of them participated in the study. And so you know that, you know, the whole households of interest in that area participated in the study of sex offenders. So you haven't really identified a particular person, but you know, they're in there and you've learned something about them. That could be considered an anonymity, though maybe not individual. Another is you use biosamples collected for analysis of diabetes. And, and then you extend that to estimate inbreeding among indigenous people. And that contributed to stereotype work. Was it ledged to contribute to stereotyping stigma? We're farther away from anonymization here. You, this could have happened whether you, if you had completely deidentified the data. So anonymization would not be a protection against this, but we might agree that it's an informational mark. Second, what another is analysis of exercise data shows what popular running paths exist. And it also shows you where classified military bases are. This happened with the Strava application. Again, probably not anonymization, at least not in the way that it's typically thought of. Or you build really good machine learning models. You train them on user data. You develop something profitable and you don't share the results. Well, potentially, you know, an economic misappropriation, but not an organization. Information privacy, more broadly, is about the interests that individuals and groups have in controlling information about or from them. Let's, who might be harmed from an information release? We could talk about harms that come, any of these harms coming to the data subject, to not necessarily the result of the data subject, but to a vulnerable group. For example, the harm of stereotyping from using the diabetes data for inbreeding, arguably is and does not rest on a particular individual as much as it rests on a group and doesn't rely on the participation of a specific individual so much as it relies on the participation of the group. We also make care about harm to institution and often that's where legal and institutional authorities are focused as well as on data subjects. And there are, and we have a obligation to society, individual protection of information may still lead to unwanted social harms, surveillance or dual use of data or all sorts of things. But for this, we're focusing on, for the concept of anonymity really focuses on the data subject. Although informational harms more broadly can involve any, can involve those other actors, anonymity harms are in reference to the data subject, research subject. Now let's dive into some technical concepts. Information security is about the control and protection against unauthorized access use, disclosure, disruption, modification. Someone hacks into your castle and steals all your records. That is an information security violation. If they, if they hack into and it is also probably a privacy violation because it involves the control protection over the extent and circumstances of collection sharing and use. Someone breaks into your information castle and burns it all down. It's a security violation, but it's not a privacy violation. And anonymity is a sub case and a more technically well-defined sub case of privacy, which is focused on what others can learn about participants in the data as a result of their, of the data collection processing analysis. Someone learned something about you because you participated in data. They learned that you've gotten, you're likely to have lung cancer because you were in the data and they looked at the data and they found some heightened area and your zip code traced it to your household. That's an anonymity. However, if they learn something, if they learn that you're more likely to have lung cancer, we're not, not as a result of your participation. For example, maybe the study just shows that people who have your profession are more likely to have lung cancer. They might have learned that whether or not you're participating. That's not anonymity. So all useful computations provide some ability to learn about the individuals measured from the data from which the computations were made. We say anonymization fails when others learn more than a minimal amount because the person was involved in the data. So if I learn more, if I learned that you're more likely to have lung cancer and I deny you insurance because your risk of lung cancer in my eyes has gone up 20%. Because you were part of that data process and I would not have learned it if we had randomly sampled a bunch of other people and done the same thing, then you violated my privacy. That's a good question. So there's a question about how you distinguish the type of knowledge that is likely to be gained without participating in research versus when you participate. And one of the answers is the privacy concept you're using. Distinguishing between these two things is at the core of the concept of differential privacy, which I'll talk about a little bit in this. So in response to this question, there are different stages of thinking about identifiability. In the mid-60s and 70s or so, the state of the art was where is Waldo? We thought about anonymization and identifiability as being able to link to a particular person in the database. If I can match you to record 17238 in the database, a broken variable privacy. Now, of course, that's generally true if I can match your record to a specific database and if that database has useful information that I didn't know of before, but turns out that although that's a useful attack method, that is not a necessary condition for definition of privacy. A second sort of 1990s or so concept is like indistinguishability. Maybe if there's a large enough group of people that you can't tell something about, then that's privacy. Well, and the idea was that the realization that even when we didn't link you to record 748, the fact that we learned that you were in the dataset and you were one of 30 people who had committed, who was in the subset of data who had committed horrendous crimes, meant that we learned something interesting about you. It turns out that sort of like thinking about the crowd of people to which you are identical is still not a great like core definition of privacy. And the most sophisticated, the most modern definition of privacy, which is where differential privacy and its variance or focus are limiting what adversaries can learn. And you can do this very formally in a statistical and crypt with statistical and cryptographic analysis. And it's actually possible to prove that for some particular data computation processes that you can limit the amount that an individual contributes to the result. So the sensitivity of the answer, say the mean income of that district, if that can be guaranteed not to be affected by any individual by too much, then you can say that if you just publish the summary statistics, you can limit the statistically how much people can learn across all possible background knowledge and distribution. So basically, if I thought you had a 50% chance of being a criminal and I look at the data, you can guarantee that my probability over you being a criminal will not change by more than .05%. Of course, there's a trade-off between information utility and privacy. Perfectly anonymous data is perfectly useless data. And there's no... Well, there is a very formal way of describing the information privacy that you lose in terms of the posterior log odds of how much more you're changing the distribution of belief over somebody's properties. There's no single definition of utility. You can talk about mean square or you can measure these trade-offs, but this will vary from case to case. And so privacy protections always balance usefulness and privacy. There's also a number of legal concepts and some of these sound like technical and policy concepts, but it's really important to distinguish them because legal concepts are sometimes at odd with the technical definitions. So sensitivity are conditions under which the law expects some higher expectation of harm and usually requires additional scrutiny or protection. For example, the federal law may declare that data about criminal history is more sensitive. So when you ask questions and you collect measures in that space, there are more required protections. You may expect more risk and so you may need more controls. Scope of authority talks about the conditions under which information is regulated by specific law. And this is typically related to the properties of the data subject, the data collector, the data user. For example, if I'm a resident of Massachusetts and the data collector is an organization doing business in Massachusetts, then they're subject to the Massachusetts personal information protection law. Otherwise, the information doesn't come under the scope of the law. And finally, the law generally has some concepts around identifiability, often referred to as de-identification. These specify the conditions under which the law considers outputs to be anonymous in scare quotes. And it often, not always, it implies under that law that the outputs are not subject to any further regulatory requirements. Often this is based on some complementary definition of PII, or personally identifiable information, say a list of attributes that if you get rid of them, it'll be anonymous. And this is where the tension between legal concepts and technical concepts are most important currently. Because technically, what you can learn is not dependent on specific attributes. You can learn things from any attribute and you have to look at the combination of information and the computations in order to understand learnability. You cannot simply say if you take out the name, the first name, as you can in the Massachusetts law, you take out the first name and everything is okay. It's anonymous. It doesn't work like that in statistics, so it may work like that in law. And there's also sometimes an underlying legal assumption that anonymization is equivalent with zero risk, which is also not true. That's a matter of technical reality, not legal reality. So for example, FERPA, which is the US Federal Law Protecting Education data, defines identifiability in terms of a direct list of PII, so specific attributes like name, that if they're present, imply that it's identifiable. And then a notion of linkage, that there's some indirect information that together could also identify a person, though it doesn't specifically tell you how to tell when something is linked or not. HIPAA, but to be de-identified, it has to both eliminate PII and not be directly linked. HIPAA, it's an either or. You can eliminate 18 identifiers in a list or you can verify that it's you can't link it through a quote statistician, which I think is a GS13 statistician, which means you've got some number, you have an undergraduate degree and some number of credit hours and statistics. And the common rule, which is where a lot of our research and where our IRBs come from, research is governed and IRBs come from in the US, has a standard readily identifiable. They also have concepts of sensitivity. For anything that is not a defined as a directory information, directory information is okay. It's not sensitive, so there's no harm, even if it's identifiable. Anything else is sensitive. HIPAA, it's going to be medical information. And common rule depends on the harm that you'd expect from the information, how sensitive it is. And then the scope, FERPA applies to data collected by institutions receiving federal educational funding, HIPAA to data collected by institutions designated as health providers, covered entities, and the common rule to data collected by institution free receiving federal research funding. So if you have a private company and you collect data to do research, you're not covered by any of these and you have no legal responsibilities under any of these laws. So let's talk about the processes of anonymization. And protection should be viewed as a process. Effective data protection is a result of an effective process. It's not strictly the property of a data output. You can't, if we look at the number 721-071-426, is that the average wealth of the zip code in which Bill Gates lives? Is that a privacy violation, if it is? It is impossible to know given that number. And in fact, even if you had an entire data set, just looking at that data set, you cannot say this is, this is de-identified in a technical sense or this is like statistical average in technical sense. For both of these things, you need to understand the process that generated the outputs. And more generally, we can think about data protection in every life cycle stage from collection through post-processing. And that number is a social security number, by the way, but the person is not covered by law. So we're good. There are different opportunities to protect the different parts of the research life cycle, including at the research design stage, the research implementation stage, the research analysis stage, and the post-analysis stage. These include things like evaluating privacy and security of measurement and data collection, how you store data, what you do for disclosure limitation, and the auditing adverse event monitoring data destruction that you do afterwards. We're going to focus, although these all have some importance for the overall data protection information form, we're going to focus on the disclosure for anonymization. You can think about not only the stage, but the level of harm in terms of a continuum from sort of minimal risk. And this is just a guide for thinking it is not a official legal standard. To grave risk of harm should that data be linked to a person, learned about a person, and found by an adversary. So maybe if someone knows my favorite flavor of ice cream, even if it's fine, even if they figure out that mine is fudge ripple, nothing bad will happen, but if they know that I am a sex offender, it could be a grave risk to me. So the consequences of breaching anonymity will vary. And so how we protect the data will be a factor of both how identifiable the data is, what the risk of learning about individuals from that data is, and the harm, the sensitivity of that data. And so for either harmless data, even if it's identifiable or really formally protected strongly anonymous data, we might choose to release it without any other protections. But if something is either less well protected, maybe because there are no formal methods that you can use for that particular analysis, or has a higher risk of harm, you might want to combine things like statistical disclosure control with restrictions on use, secure data enclaves, legal agreements, auditing, and other sorts of controls. And generally, there will be multiple, you want to plan for multiple ways of getting the data for tiered access. So there'll be some outputs that either because of their anonymity or lack of sensitivity you can make available for the public, some which you may need to have gate and some which are necessary to support particular uses, especially like cutting edge research uses, but that you need many levels of protection for and simply like trying to anonymize the data and disseminate it will not be sufficient for those types of research. Let's think, let's drill in this end part into disclosure limitation for data publishing. You can think of it mechanistically as you apply a set of transformations at a particular stage to execute a privacy goal and it results in some output. So that's a sort of mechanical timeline view of it. The important part here is really the privacy concept because it's the privacy concept like differential privacy that dictates how protective the result is going to be. Regardless of the transformation, the processing stage, or the particular output, it's this privacy concept that's creating the protection guarantee. But we'll look at some example transformations. So one is like can set, we just release it, redaction, domain generalization, we limit the values of the pixels. We've got synthetic data. You might still be able to recognize me. Aggregation, we've aggregated lots of pixels, noise, these are all transformations that are used in commonly used in all sorts of different disclosure control processes. This is a different category. This is an example of conceptually of using differential privacy as a solution concept. Here, it's not just that we're using noise, but that we have an idea of using noise to produce an aggregate result where the input doesn't affect the output more than. So this is not my face blurred, it is the average face. So that's where this idea of formal protection comes in, is that you are reporting results, averages, model results, etc., and the input from a particular person does not affect the output too much. Of course, this is perfect privacy where we remove everything. Here's an example of a particular dataset with some nastiness in it, and here we've removed some fields, aggregated, generalized others, suppressed others, and averaged others. This is a result of applying aggregation, suppression, and generalization. We considered favorite ice cream flavor to be non-sensitive, so we didn't do anything to that. Hopefully, nobody figures out that I like crunchy frog as a flavor and matches me somewhere else. All of these transformations reduce utility. There's a tension between maximizing the breath of future uses and minimizing risk for targeted uses. So if you know exactly what you want, you can focus on a method that protects privacy for that particular use. You're trying to create a general reusable synthetic dataset. It's not going to be as useful for every specific use. And if you don't say how you anonymized it, and you don't provide the measures of the uncertainty you introduced, you're going to get biased analysis. This is true no matter what method you use, and it's always been true, but it's becoming more obvious, and the debate over the census result highlights this. But very quickly, some next steps are to articulate privacy principles, identify the legal institutional requirements, then nominate informational harms that are part of your research, develop a principled life cycle data protection plan, and select and apply privacy protection tools. And really, this set of slides is just a set of links for useful resources to get you to the next step. So articulating the subject rights, there are resources for that, but data protection is to protect subjects. You should consider and articulate at your project who you're protecting and from what. Data protection doesn't involve just disclosure control at the end, but is part of a whole life cycle. And that involves both design principles for what you are protecting and what you are enabling, and how you're measuring both. For legal and institutional requirements, you're going to need to identify your institutional policies, national law, if you're in member states, there are resources for each of these. Informational harms generally a qualitative process. The information landscape is large and rapidly evolving, and we can enumerate categories of harm, but understanding what particular risks are most important to your subjects and your research is a qualitative analysis that you need to conduct and revisit. Then designing the life cycle data protection process, including particular information controls and particular disclosure control methods. And so a general warning here, this is an area where the state of the art has substantially outrun the state of the practice, and some extend the law. The textbooks are outdated, they cover the 70s and 90s methods fairly well, but not methods introduced and the new privacy enhancing technologies and formal methods as well. So there's some introductions. And the software, the shrink wrap software handles again the 70s to 90s methods, and this is a fairly comprehensive list of the shrink wrap software available by the way. But for things like differential privacy that goes beyond using differential privacy to compute means of tables or things like that to or secure multi-party computation or other privacy enhancing, you know, more advanced privacy enhancing technologies, you'll need to get into development libraries and the like. And it is, you know, I guess the good news is that it is generally possible to get to legal compliance with the, you know, for now with the 70s and 90s methods, but that's not future proofing you for risks that can even legal risk that could hit you or and is not protecting against all the possible risks in the subjects. You can find more information about how to contact me and some of the working papers on my website, and I've seen a number of questions, you know, we've had a number of questions in the chat and we can open up for more questions now. Yeah, thank you so much, Dr. Altman. We do have a number of interesting questions, very fall-provoking. One I was really interested in from Ryan says that he uses data from continuous glucose monitors, which collect glucose levels every five minutes. What are your thoughts on entering a bit of non-significant error to throw the folks trying to link data off the trail? So there's a deep insight into that and that's both like that's a great idea and not great idea at the same time. So noise is a fundamental, introducing noise is a fundamental tool for a fundamental transformation method that can be used in data protection. However, the introduction of noise has to be engineered so that it is known to fit a particular privacy solution concept. It is easy to inject noise in a way that makes it really easy to compute, like, to recover information about the individual or subgroup. For example, if we have a, suppose that glucose monitor had a measure of your location, I know this is a strange thing to have, but suppose you just recorded the location sample you were and you noise that. In fact, you only recorded the day and the day plus or minus a week, the date plus or minus a week, and the location plus or minus 100 miles. If we had a sequence of five or 10 of those, that would be over a week or a month. That would create a unique fingerprint that we can match to any other database. So if we could observe that person even with that level of noise somewhere else, we can match them and so they start. So the noise has to be governed by a higher level algorithm that is engineered to provide privacy. Another good question we had was how the anonymization process would work with the study involving interviews. What if the participant's answers could be identifying? Yeah, generally, that's a, there's a life cycle consideration. When we think about about what are we trying to learn and what are the outputs, there are a number of risks from like collecting survey information. One is that the first is that people might be observed while they are being interviewed. So whether or not the answers are identifying, then there's that information, the raw data might have identifying information either directly or indirectly. The platform on which it is used, like if you go through survey monkey, they may be collecting IP addresses. So that could be identified. So you have to look at the whole process of how are the inputs related to your outputs? What are you releasing in terms of your final analysis, data products, publications, etc. And do you have a process that limits what people can learn from that? And that might, you know, if it were people's, you know, if you needed open-ended answers, it might involve coding those answers and releasing the topics. There, even if you redact information from answers, if the answers are long enough, they can be analyzed stylometrically. So we, you know, from blog posts, we can tell who wrote it by writing style. So you have to think of it as an overall life cycle and sort of balance the goals of protection against what you're releasing. If there's no other questions, could you please elaborate on the idea of a privacy concept that one could apply to research data and the difference between the privacy concepts you introduced? Yeah, so I think the core privacy concept that to focus, that we focused on right now is anonymity, which is technically defined in a modern framework as what you can learn about an individual because of their participation, right? That is a different privacy concept from, say, a legal concept. Can I prove that you are a record number of 33? Or it is also a different privacy concept than, say, a group privacy concept. You may sample from Ashkenazi Jews, like me, and you may learn about breast cancer mutation in them. It doesn't violate my individual privacy. So it is, in that sense, individually anonymous. But it may be that you are interested in group information. Or you may have, you may have a privacy concept that involves control over data purpose. Maybe you've anonymized the data and nobody knows it's me, but I didn't authorize you to use it for machine learning and building profitable models or for research into genealogy rather than medicine. So there are different sorts of end goals that you may be trying to maximize with those different sorts of modifications to the data and protections from the life cycle. What is the basic or first step towards anonymization is to be taken in the case in which to public the data set supplementary to scientific publication that contains patient data. For that, I would refer back to the two anonymization textbooks, I think, particularly the first textbook link that provides the basic, how can we do some basic checks to make sure that they're not, you know, obvious individual identifiers that some of the standard attacks to link can't work. It's still, there are other things that will be necessary to provide sort of formal protection, but that's a good step. Big step indeed. Well, with that, we are at 303, so we have reached time. I know we could go on forever with questions. It's such a rich, rich topic. And we will follow up, have contact information. So please, attendees, feel free to reach back out to us with other questions and follow up. We hope to continue this conversation. It's a start, certainly a progress we're going to continue to work on. But I just wanted to take this time to again thank you, Dr. Albin, so much for joining us. We really appreciate you leading this discussion and sharing all of your expertise. Well, this is a, you know, this is a topic that I find very interesting and I think is really important to researchers and research subjects. I thank everyone for, you know, having, paying some attention to it and diving into the details of it. Thank you very much. Yes, thank you, everyone. Have a great day.