 So, good afternoon or good morning if you're over in the Perth time zone to everyone. Thank you for calling into our webinar today. We've got some handouts in today's webinar as well. We've got a guide to publishing and sharing sensitive data that is an ANN's resource and also an ANN's sensitive, it's called Sensitive Decision Tree. And that's a one-page summary of the information that's available in our guide. So I'd just like to introduce our two guests today. We've got Professor George Ulter, he's a research professor in the Institute for Social Research and Professor of History at the University of Michigan. His research integrates theory and methods from demography, economics and family history with historical sources to understand demographic behaviours in the past. From 2007 to 2016, he was the director of the Inter-University Consortium for Political and Social Research, ICPSR, the world's largest archive of social science data. He's been active in international efforts to promote research transparency, data sharing and secure access to confidential research data. He's currently engaged in projects to automate the capture of metadata from statistical analysis software and to compare fertility transitions in contemporary and historical populations and we're lucky to currently have him as a visiting professor at ANU. And Dr Steve McGeckin is the director of the Australian Data Archive at the Australian National University. He holds a PhD in industrial relations and a graduate diploma in management information systems and has research interests in data management and archiving, community and social attitude surveys, new data collection methods and reproducible research methods. Steve has been involved in various professional associations in survey research and data archiving over the last 10 years and is currently chair of the executive board of the Data Documentation Initiative. So, firstly, we're going to hand over to George, who's going to share the benefit of over 50 years of ICPSR managing sensitive social science data. Over to you, George. Thank you, Kate. It's a pleasure to talk to you today. ICPSR, as mentioned, has been in data archiving for more than 50 years and an increasing amount of our effort has gone into devising safe ways to share data that have sensitive and confidential information. At the heart of everything we do in terms of protecting confidential information is a part of the research process where when we ask people to provide information about themselves to us, we make a promise to them and we tell them that the benefits of the research that we're going to do are going to outweigh the risk to them and we say that we will protect the information that they give us. We have a lot of data that we receive at ICPSR and here at the ADA that include questions that are very sensitive. Often we're asking people about types of behavior that could cause them harm, that we might be specifically asking them about criminal activity, we might be asking them about medications that they take that could affect their jobs or other things, so we have to be careful about it. And we're afraid that if the information gets out, it could be used by various actors for specific purposes, it could be used in a divorce proceeding, sometimes we interview adolescents about drug use or sexual behavior and we promise them that their parents won't see it and so on. In the data archiving where we often talk about two kinds of identifiers. There are direct identifiers which are things like names, addresses, social security numbers that many of which are unnecessary, but some types of direct identifiers such as geographic locations or genetic characteristics may actually be part of the research project. And then the most difficult problem often is the indirect identifiers that is to say characteristics of an individual that when taken together can identify them. We refer to this as often as deductive disclosure, meaning that it's not obvious directly but if you know enough information about a person in a data set, then you can match them to something else. We were concerned that someone who knows that another person is in the survey could use that information to find them or that there is some other external database where you could match information from the survey and re-identify a subject. Deductive disclosure is often dependent on contextual data. So if you know that a person is in a small geographic area or you know that they're in a certain kind of institution like a hospital or a school, it makes it easier to narrow down the field over which you have to search to identify them. And unfortunately in the social sciences, contextual data has become more and more important. Whereas people now are very interested in things like the effect of neighborhood on behavior and political attitudes or the effect of available health services on morbidity and mortality. And there are a number of different kinds of contextual data that can affect deductive disclosure. So we're in a world right now where social science researchers are increasingly using data collections that include items of information that make the subjects more identifiable. So for example, people studying the effectiveness of teaching often have data sets that have characteristics of students, teachers, schools, school districts, and once you put all those things together, it becomes very identifiable. So we at ICPSR and I think the social science data community in general have taken up a framework for protecting confidential data that was originally developed by Felix Ritchie in the UK that talks about ways to make data safe. So I'm going to go through these points, what Ritchie talks about, safe data, safe projects, safe settings, safe people and safe outputs. And the idea of this is not that any one approach solves the problem, but that you can create an overall system that draws from all of these different approaches and uses them to reinforce each other. So safe data means taking measures that make the data less identifiable. Ideally, that starts when the data are collected. So there are things that data producers can do to make their data less identifiable. One of the simplest things is to do something that masks the geography. If you're doing interviews, it's best to do the interviews in multiple locations that adds to the anonymization of your interviewees. Or if you're doing them in only one location, you should keep that information about the location as secret as possible. Once the data have been collected, research projects have been using a lot of different techniques for many years to mask the identity of individuals. So one of the most common one is what's called top coding, where if you ask your subjects about their incomes, the people with the highest incomes are going to stand out in most cases. And so usually you group them into something that says people above $100,000 in income or something like that, so that there's not just one person at the very top, but a group of people which makes them more anonymous. And this list of things that I've given here, which goes from aggregation approaches to actually affecting the values, is listed in terms of the amount of intervention that's involved. Some of the more recently developed techniques actually involve adding noise or random numbers to the data itself, which tends to make it less identifiable, but it also has an impact on the research that you can do with the data. If projects means that the projects themselves are reviewed before access is approved, at most data repositories when the data need to be restricted because of sensitivity, we ask the people who apply for the data to give us a research plan. That research plan can be reviewed in several different ways. The first two things that we do regularly at ICPSR, we ask, first of all, do you really need the confidential information to do this research project? And if you do need it, would this research plan identify individual subjects? We're not in the business of helping marketers identify people for target marketing, so we would not accept the research plan that did that. There are also projects that actually look at the scientific merit of a research plan. To do that, though, you need to have experts in the field who can help you to do that. Safe settings means putting the data in places that reduce the risk that it will get out. And I'm going to talk here about three approaches. The first one is, or four approaches actually, the first one is data protection plans. So for data that need to be protected but the level of risk is reasonably low, we often send those data to a researcher under a data protection plan and a data user agreement which we'll come to in a couple of minutes. And the data protection plan specifies how they're going to protect the data. And here's a list of things that we worry about that one of my colleagues at ICPSR made up. One of the things we tell people is, you know, what happens if your computer is stolen? How will the confidential data be protected? And there are a number of things that people can do like encrypting their hard disk, locking their computers in a closet where they're not being used that can address these things. And I think that data protection plans need to move to just a general consideration of what it is that we're trying to protect against and allow the users to propose alternative approaches rather than saying, oh, you have to use this particular software or this or that. We have to be clear about what we're worried about. A couple of notes about data security plans. Data security plans are often difficult, partly because of the approach that has been taken in the past and also because, you know, researchers are not computer technicians and we're often giving them confusing information. One of the ways that I think in the future, in the U.S. at least, universities are going to move beyond this is I'm seeing universities developing their own protocols where they use different levels of security for different types of problems. And at each level, they specify the kinds of measures that researchers need to take to protect data that is at that level of sensitivity. And from my point of view as a repository director, I think that any time that the institutions provide guidance, it's a big help to us. The other way to make the data safe by putting it in a safe setting is actually to control access. There are three main ways that repositories control access. One kind of system is what I've called your remote submission and execution system where the researcher doesn't actually get access to the data directly. They submit a program code or a script for a statistical package to the data repository. The repository runs the script on the data and then sends back the results. That's a very restrictive approach but it's very effective. Recently, however, a number of repositories and statistical agencies have been moving to virtual data enclaves. These enclaves, which I'll illustrate briefly in a minute, use technologies that isolate the data and provide access remotely but restrict what the user can do. The most restrictive approach is actually a physical enclave. At ICPSR, we have a room in our basement that has computers that are isolated from the internet. We have certain data sets that are highly sensitive. If you want to do research with them, you can. But on the way into the enclave, we're going to go through your pockets to make sure you're not trying to bring anything in or the way out. We're going to go through your pockets again and you'll be locked in there while you're working because we want to make sure that nothing that is uncontrolled is removed from the enclave. The disadvantage of a physical enclave is that you actually have to travel to Ann Arbor, Michigan to use those data, which could be expensive. That's the reason that a number of repositories are turning to virtual data enclaves. This is sort of a sketch of what the technology looks like. What happens is that you as a researcher look over the internet, log on to a site that connects you to a virtual computer and then that virtual computer is in contact as access to the data, but your desktop machine does not. You only can access the data through the virtual machine. At ICPSR, we actually use this system internally for our data processing to provide an additional level of security. We talk about the virtual data enclave, which is the service we provide to researchers and the secure data environment, which is where our staff works when they're working on sensitive data. It's a little bit of a let down, but this is what it actually looks like. What I've done here is the window that's open there with the blue background is our virtual data enclave and I've opened a window for state inside there. The black background is my desktop computer. If you look closely, you'll see in the corner of the blue box that you see the usual windows icons. That's because when you're operating remotely in the virtual enclave, you're using windows. It looks just like windows and acts just like windows, except that you can't get to anything on the internet. You can only get to things that we provide for a level of security. On top of that, the software that's used and we use VMware software, but there are other brands that do the same thing, essentially turns off your access to your printer, turns off your access to your hard drive or the USB drive. You cannot copy data from the virtual machine to your local machine. You can take a picture of what you see there, but because you have that capability, we also restrict people with data use agreement. That's my next topic. How do you make people safer? The main way that we make people safer is by making them sign data use agreements or by providing training. The data use agreements used at ICPSR are frankly rather complicated. They consist of a research plan, as I mentioned before. We require people to get IRB approval for what they're doing, a data protection plan, which I mentioned, and then there are these additional things of behavioral rules and security pledges and an institutional signature, which I'll mention now. The process, if you look at the overall process of doing research, there are a number of legal agreements that get passed back and forth. It actually starts with an agreement made between the data collectors and the subject in which they provide the subjects with informed consent about what the research is about and what they're going to be asked. It's only after that that the data go from the subject to the data producers. Then the data archive, such as ICPSR or ADA, actually reaches an agreement with the data producers in which we become their delegates for distributing the data. That's another legal agreement. Then when the data are sensitive, we actually have to get an agreement from the researcher. These are pieces of information we get from the researcher. In the United States, our system is that the agreement is actually not with the researcher, but with the researcher's institution. At ICPSR, we're located at the University of Michigan, and all of our data use agreements are between the University of Michigan and some other university. In most cases, there are some exceptions. It's only after we get all of these legal agreements in place that the researcher gets the data. One of the things in our agreements at ICPSR is a list of the types of things that we don't want people to do with the data. For example, we don't want someone to publish a table across tabulation tables where there's one cell that has one person in it because that makes that person more identifiable. There's a list of these things. I think often we have like 10 or 12 of them that are really standard rules of thumb that statisticians have developed for controlling re-identification. The ICPSR agreements are also, as I said, agreements between institutions. One of the things that we require is that the institution takes responsibility for enforcing them, and that if we at ICPSR believe that something has gone wrong, the institution agrees that they will investigate this based on their own policies about scientific integrity and protecting research subjects. DUAs are not ideal. There's a lot of friction in the system. Currently in most cases, a PI needs a different data use agreement for every dataset, and they don't like that. We can, I think in the future, reduce the costs of data use agreements by making institution-wide agreements where the institution designates a steward who will work with researchers at that institution. There's already an example of this, the DataBury project, which is a project in developmental psychology that shares videos, has done very good work on legal agreements. My colleague, the current director at ICPSR, Margaret Levenstein, has been working on a model where a researcher who gets data use agreement from one dataset can use that to get a data use agreement for another dataset so that individuals can be certified and include that certification in other places. One of the things that I think we need to do more about is training a number of places like ADA, train people who get confidential data. We've actually done some work on developing an online tutorial about disclosure risk, which we haven't yet released, but is, I think, something that should be done. Finally, there's safe outputs. One of the last stage in the process is that the repository can review what was done with the data and remove things that are a risk to subjects. This only works if you retain control, so it doesn't work if you send the data to the researcher, but it does work if you're using one of these remote systems like remote submission or a virtual data enclave. Often, this kind of checking is costly. There are some ways to automate part of it, but a manual review is almost always necessary in the end. So, a last thing about the costs and benefits. Obviously, data protection has costs, modifying data affects the analysis. If you restrict access, you're imposing burdens on researchers. Our view is that you need to weigh the costs with the risks that are involved. There are two dimensions of risk. One dimension is, in this particular data set, what's the likelihood that an individual could be re-identified if someone tried to do it? Secondly, if that person was re-identified, what harm would result? So, we think about this as a matrix where you can see in this figure, as you move up, you're getting more harm. As you move to the right, you're increasing the probability of disclosure. So, if the data set is low on both of these things, for example, if it's a national survey where a thousand people from all over the United States were interviewed and we don't know where they're from, and we ask them what their favorite brand of refrigerator is, that kind of data we're happy to send out directly over the web without a data use agreement with a simple terms of use. But as we get more complex data with more questions, more sensitive questions, we often will add some requirements in the form of a data use agreement to assure that data are protected. And when we get to complex data where there is a strong possibility of re-identification and where some harm would result to the subjects, in that case, we often add a technology component like the virtual data enclave. And then there are the really seriously risky and sensitive things. My usual example of this is we have a data set at ICPSR that was compiled from interviews with convicts about sexual abuse and other kinds of abuse in prisons, and that data is very easy to identify and very sensitive, and we only provide that in our physical data center. So that's the end of my presentation, and I thank you for your attention, and we'll take questions later. Great. Thank you, George. So we'll pass over to Dr Steve McGeckin to give his presentation about managing sensitive data at the Australian Data Archive. Okay. So my aim today is to build off what George has talked about, particularly taking the five-saves model and looking at what the situation is in the Australian case. Now I'll talk about the Australian Data Archive and how we support sensitive data, but I wanted to put it in the context of the broader framework of how we access sensitive data in Australian social sciences generally. So I'm going to talk about some of the different options that are around sort of picking up on somewhat George's discussed in terms of some of the alternatives that are available and sort of demonstrate the different ways these are in use here in Australia. So I'm really focusing more on the five-saves model and its application in Australia than I am specifically on ADA. As I always say, we are one component of the broader framework for sensitive data access here. So just to say, what I really wanted to cover off here is thinking about sensitive data and the five-saves model. I'll look at the different frameworks for sensitive data access in Australia and where you might find them and then how we apply the five-saves model at ADA in particular and then depending on time I might say something briefly about the data life cycle and sensitive data as we go through. So I wanted to just pick up on the particularly the ANZ definition here of sensitive data and say, I'll frame this in the context of most of what we deal with at ADA at some point in its life cycle has been sensitive data. It's more often that it's information, it's collected from humans often with some degree of identifiability at least at the point of data collection, not necessarily at the point of distribution. So a lot of what we deal with and this is true for a lot of social science archives has been subject to would fall into the class of sensitive data. But there's a distinction there between what we get and what we distribute that we would probably draw a distinction. So in terms of our definition here, this is the handout I think that's in the handout section and it's available online. Data that can be used to identify an individual species, object, process or location that introduces a risk of discrimination or unwanted attention. Now we tend to think in terms of human risks more than anything else to say the risk to humans and individuals, but it does apply in other cases as well. So for example, the identification of sites for Indigenous art might in and of itself lead more people to want to go and visit that location and in a sense destroy the thing that you're actually trying to protect. So the more visits that they actually get, the more degraded the art itself becomes. So it doesn't just hold for human research, but that's probably our emphasis at ADA. So just to reiterate the five states again, so we talked about five things, people, projects, settings, data and outputs. And the reference here is down at the bottom you can look at the document that Felix Richie and two of his colleagues developed, sort of framing out the five states model. What I would say about this is that it's been adopted directly with our UK data service. That's where it has origin. The basic principles are applied in a lot of the social science data archives and it's now actually been adopted by the Australian Bureau of Statistics as well. So their framework for thinking about output of different types of publications literally leverages this model. So what we think is quite a useful sort of framework for talking about. I'm going to take a slightly different approach to George in thinking about how we think about what we're worried about. And I'm going to take, as a depositor, you worry about the risk of disclosure. As a researcher, what's the flip side of that? Why do we need access to sensitive data? What does it provide? The National Science Foundation about four or five years ago put out a call around how could we improve access to micro data, particularly from government sources. And it sort of highlights the sorts of things. Why do we talk about the need for access? The sorts of research you can do. And this comes from a submission from David Card, Raj Shetty and several economists in the US and elsewhere. They were highlighting what's needed. Direct access is really the critical thing here. And direct access to micro data. And by micro data, we mean individual information about individuals line by line. Aggregate statistics, synthetic data, you can create fake people as it were, or submission of computer programs for someone else to run, really don't allow you to do the sorts of work you need to answer policy questions in particular. And a lot of the particular social policy research is focused in this way. So in order to do certain things, access to this data is necessary. And so how do we facilitate that, taking account of the sorts of concerns that have been raised? Alongside that is, well, how do people expect to access it? This was an interesting blog post from a researcher based at previously the University of Canterbury comparing how you access US census data versus the New Zealand census and similarly, we could say the Australian census as well. In the US, you can get a 1% sample of the census. You can just go and download a file directly. It's open as what's called a public use micro sample file. And those are directly available. In New Zealand, there's a whole series of instructions you have to go through. You might be subject to data use agreements. You might be subject to an application process, et cetera, et cetera. Now, he's criticising saying it must be much easier. It should be the US model that's appropriate here rather than the New Zealand model. But what we're really probably talking about here is, although they're both appropriate, depending upon the sorts of detail, the sorts of identifying information that are available, both might be valid models that just allow you to do different things. The first model really focuses on, in a sense, masking the data to some degree. In some of the safe data models that George talked about, the other uses other types of aspects of the safe model to address confidentiality concerns. And what you'll also find is researchers understand these, that there has to be some trade-offs. The recognition of the need for confidentiality is recognised and understood, and that there may well be, and there ought to be, trade-offs in return for that. So, for example, Carden's colleague suggested we'll use a set of criteria that you could put in place for enabling form of access to sensitive micro data. And it might, they reference access through local statistical officers, through some remote connections, such as the virtual enclave that George talked about, and monitoring of what people are doing. If you're going to have highly sensitive data available, the trade-off for that for access should be appropriate monitoring. So there is a recognition that these, I mean this is just one possible approach, but a recognition that access brings with responsibilities and appropriate checks and balances. So what I wanted to talk about is all how is that eventuated in Australia? What do we see? So the sorts of models that we see here in Australia, I've broken them out here broadly. I'd say I think we can look at some four broad areas. The one that people are probably most familiar with is the ABS, Australian Bureau of Statistics. They have a number of systems and access methods that suit different types of safe profiles. So these include what's called confidentialised unit record files or CURFs. They have the Remote Access Data Lab, which is one of their online execution systems. They have an on-site data lab. You can go to the bowels of the ABS buildings, certainly in Canberra and I believe in other states as well, and do on-site processing. Then they have other systems that probably the best note of these is what's called a table builder, which is an online data aggregation tool, which does safe data processing on the fly. Our emphasis at ADA is primarily on what these confidentialised unit record files are. We provide unit record access and some aggregated data access as well. Then we have the remote execution or the remote analysis environments. I put under this model the Australian Urban Research Infrastructure Network for geographic data access in particular. The secure unified research environment produced by the Population Health Research Network is an example of George's remote access environment as well. Even David Linkage Facilities, another part of the PHRN network, fit to some degrees under this type of secure access model. That's in a sense a more extreme version of that. Then we have other ad hoc arrangements as well. Things like the physical secure rooms. A number of institutions have a secure space. There are a number here at ANU, for example. Then you might have other departmental arrangements as well that exist. We can probably classify those in terms of the distinctions in the type of approaches that we have. What I've done here is a very simple assessment from not at all to a very strong yes, it fits within this sort of addressing this safe element from low to high. I have some question marks on some of the facilities, particularly Shor and the Data Linkage Facilities. Not because I don't think they can do it. It's like I don't have enough information to make an assessment there. But if you look at the different types, things like the ADS models have tended towards safe data, so the sorts of confidentialization and randomization routines, data output checking and secure access models. Tabulation systems are a secure access model as well. They've tended less towards safe people and safe projects. So checking of people and checking of projects, we tend to put more, a lot of cases, there's more trust in the technology than there is in the people using the technology, which I think is a little bit problematic. Given that they're, and I'm going to talk to this in a moment, there are some fairly good processes in Australia for actually assessing the quality of people in particular and to some extent to projects. So this is, we can kind of profile, the point here I'm making is that you have different alternatives to how you might make sensitive data available. There's not a one solution. It's what's the mix of things that I might do and I'll come back to that at the end. So in the Australian experience, I'd say we had a strong emphasis on safe data. We came up with the term in Australia of confidentialization. That's probably the term you'll see most regularly. In anywhere else in the world, you would hear the term anonymization. I'm not quite sure why this is the case, but I'd say in Australia, the term is that we tend to use the confidentialization. The Australian data archive used this model, the ADS, and the partner social services, things like the household income labor dynamics in Australia, use anonymization techniques as sort of the starting point. So you can make data safe before you release it. It has its limitations and a good example of that is some of the data that's released into the data.gov environment used anonymization. Safe data is sort of the priority, the potential for it to be reverse engineered. If you haven't done your anonymization properly, then it could be reversed and you get a safe data risk. So it has its flaws. And this is why we tended towards looking at sort of a combination of techniques. But as George pointed out, I'd say if the risk of actually being identified is low and particularly the harm that comes from that is low, then there may be the case that this is sufficient. Certainly a lot of the content that we have at ADA most are emphasis action, safe data, more than it else. Safe settings. We do have, as I say, examples here. Tabulation systems, you know, things where you can do cross tabs online are fundamentally a safe settings model. People don't get access to the unit record data, they just get access to the systems, produce outputs. Remote access systems, the research data lab, PHRNs, Shure system, and a new system that the ABS are bringing on at their remote data lab. They're making their data labs available in a virtual environment. They're in pilot stages that we're working with them on at the moment, increasingly being used as well. They're also secure environments, as I mentioned, the data lab in the secure rooms. Safe outputs. A number of the safe settings environments, because they tend to use highly sensitive data, have safe output models as well. The real problem has been with these in scaling them. It requires manual checking, more often than not. So reviewing the output of these sorts of systems, that requires people, that requires time. It's hard to automate as well. The ABS have invested a lot of money into automating output checking. Point of fact, the table bill system is one of the best that's around. But the new remote lab still has manual checking of outputs. So it depends on what you're trying to do and the sorts of outputs you're checking as to the extent to which you're producing, as to whether you can actually automate the checking as well. The other side of this that I think will become increasingly relevant to is the replication and reproducibility elements of things that come out of systems like this. How are we going to facilitate replication models within those environments? I'm not sure that question's been addressed yet. Safe researchers and safe projects in Australia, to be frank, they are considered in most models, but they're not really closely monitored. And that's because they're difficult to monitor. How do you follow the extent to which people are followed, the things that they've signed up to? Anyone who's involved in reporting and research outputs for ERA or anything will know that getting people to actually fill out the forms, putting in place or by produce would be hard. Filling out forms to say I've been compliant with a daily use agreement is even harder. That said, we do have some checks and balances that are there, certainly in terms of the ethics models and the codes of conduct and research do provide some degree of vetting insurance for those that go through that sort of system. That's it. We have some checks and balances in place for particular university researchers to address the sorts of concerns. So I think an emphasis increasingly on safe researchers and projects might be one that we can leverage a bit more carefully. As I say, because of the frameworks we have in place, Australia Code of Conduct and increasingly professional association and journal requirements as well for data sharing, I go to put a degree of assessment on the sorts of practices we use as well. The American Economic Association, the DART agenda in the political science plus ones required for data sharing, these are actually a mechanism also for addressing partly the extent to which, why by sharing, assessing the sharing of data and also assessing the extent to which you're potentially becoming dyslexic as well. That's something to be considering in the future. I'll quickly turn to the ADA model and then wrap up. The ADA model, I say our emphasis is primarily on safe data. Data is anonymized, either it tends to be through the agencies that provide or the researchers that provide data to us in their dance. We will also do some review on content as well. We'll provide recommendations back to our depositors as to these are the sorts of things you probably want to think about in terms of have you included things like post codes or occupational information. If I know someone's post code, their occupation and their age is a fair chance that I could identify them in many cases in remote locations in Australia in particular. There's some basic checks you can do. Certainly safe people and safe settings, our data access is almost all mediated. You must be identified, you must provide contact information and supervise the details. So we do some checking on safe people and we provide information on project descriptions, what do you intend to do with the data as well, particularly for where we have more sensitive content. Often that's a requirement for our depositors. We don't apply, frankly, safe settings and safe outputs. That's not the space that we work in. We work with other agencies such as the ABS where there's access to certain sensitive content will point people off to the relevant locations and where you've got highly sensitive content that you want to make available. I'd say something like the remote data lab. Whereas it's focused, they focus less on safe data. So they're a virtual enclave. They don't prohibit the use of safe data practices so they do limit where you have highly sensitive data. There's a more dedicated assessment process on the projects and the outcomes. Highly safe settings. Sitting at the ABS, the problem is that the cost they have is in establishing the system itself. They vet all of the outputs. It has costs associated with it. They have safe people. They have training for researchers prior to accessing the system. There is some challenging assessing the backgrounds of people, for example. This is where the need for domain experts. If you're going to fully assess people in projects and you're going to assess their domain expertise, you need domain experts to be able to make that sort of evaluation. So the emphasis might be on are you using appropriate techniques? Are you maintaining secure facilities? And are you potentially, what's the research plan itself look like? It's more the emphasis than the quality of the science. That's a much harder thing to evaluate. Safe projects. That has been used in some places at the ABS. Sometimes it's required for legislative reasons. The extent of data release is dependent itself upon, meaning a public good statement, for example. One of the questions for future for some organisations is should this matter, ASIC research itself might generate useful insights that you didn't expect? So, as I say, in some cases, again, you're going to be probably moving the levers focusing on different types of aspects of the safe data environment. I guess the message we want to put through here is certainly there are sweeter options that are available for you for accessing sensitive data. Different models exist and they have different ranges of the five safes. You can certainly incorporate safe people models. Curious, a lot of models focus on the expectation that we have an intruder. Hackers are coming in to access that system. Actually, what tends to be the case more often than not is the silly mistakes. I made a mistake by leaving my laptop on the train or leaving my USB in the computer lab. That's far more common. We tend to try and profile two default options in terms of our mix of safe settings, but I say there are options available to you and what you have to think about is what's appropriate for the form of data that you're trying to work with. Fundamentally, the argument is that principles should enable the right mix of safes for a given data source. Thank you very much, Steve. That was a really great overview about the different ways that the five elements of the safes can be mixed and used in different cells. I thought it was really interesting that both of you mentioned that a safe location was in a basement. I've just got these images of people locked up in basements. I also wanted to note that George mentioned data masking and using de-identification methods. Steve also mentioned confidentialization and anonymization. They're similar sorts of words for similar processes and has a de-identification guide available on our website now. If you're interested in that, it's more detail on that information. We have that guide there that you can have a look at. I was also wondering about, George, you were talking about with the data protection plan and the data use agreement that the onus is on the institution that if someone breaks it, that they need to put them through some sort of research integrity investigation or something like that. If that doesn't happen, is there any potential recourse for the university like could ICPSR turn around and say, well, you didn't follow this process. You're not going to be testing any of our data anymore? Sure. Actually, on our website, we actually list the levels of escalation that we're willing to go to. We can certainly cut off the institution from access to ICPSR data. What really gets people's attention is that the National Institutes of Health in the U.S. has an Office of Human Research Protections. If we thought that someone was breaching one of our agreements and endangering the confidentiality of research subjects, I would report them to that office. That office has a lot of power. They regularly publish the names of bad actors. What's more, they can cut off all NIH funding to universities and they have done that in the past when they thought that protections weren't worked in place. I always think of that as the nuclear option. I know for a fact that university administrations and their trustees and regents are terrified that NIH will do something like that. Waving that in front of a university compliance officer gets their attention. Excellent. Steve, I was wondering with your the Australian Data Archive with the use agreement that people are signing with that, is that with the individual user or with the institution as it is with ICPSR? Primarily it's with the individual. We have a small number of organisational agreements, but not many. There is, I would say there's more efficient, yeah, a keen focus on an agreement between the individual and the organisation rather than the organisational. Some organisations do ask for them, but frankly it's more actually for pragmatic reasons than it is for compliance reasons, is that they will want to host content and manage access by requesting access to a particular dataset for all members of their research team, for example. It just makes that easier as it were. There are other models. The AGS model is actually the agreement is with the institution and then the individual sign up to the institutional agreement. The Department of Social Services model is the same as well. It will be interesting to see the extent to which we move in one direction or another. I think the compliance argument hasn't been one that's been all that common here in Australia. It's actually been, except in the case of where you have government data, I would say, it's probably the situation. But for academic reduced data, that hasn't tended to be a need for this. And with George's agreement with institutions where the course is that the institution should then have some integrity investigation, what level of recourse do you have with the agreement is with the individual? We would probably report back to the institution for which they belong and that's a safe way. We do have the question about supervisory arrangements. We would probably also follow some of the questions under the Code of Combat for research. So that's why I sort of make reference back to there is an overarching set of obligations on those within Australian academic institutions and we would pursue something in that way. One of the challenges for us and I'm going to guess for George as well is just finding where you get breaches of compliance. One of the hardest things to do is actually point out what happened in the first place. We've had one case that I'm aware of in certainly my predecessor's lifetime, which is going back to the late 90s. So it's not common occurrence that we're aware of, but yeah. Okay, excellent. So George mentioned standardized study agreements between US institutions. Has that been formalized across a number of institutions as part of a consortium arrangement or is it more of an informal and gaining momentum? Well, the example I gave is the data buried project and they're the only ones I know that have done this in a formal way where they get institutions to sign on as an institution and then that covers all of the researchers at that institution. It took them a while to negotiate that and get the bugs out, but I think it's paying off for them. And this is something that I think other groups like ICBSR should move to, but right now it's a big problem that about one in six of our data use agreements at ICBSR involved a negotiation between lawyers at the University of Michigan and lawyers at the other institutions. So it's a major cost. I think it's one of the ways to go. I would say that, I mean in Australia we have a pretty strong example, which is the University of Australia ABS agreement. I mean that model facilitates a whole lot of things. So it's sort of enabled access to the broad collection of ABS curve data under a single agreement. The other is the University slide up to a cost that comes with that as well. So they're paying it for that, but I'd say it covers the full spectrum of what they can do. The challenge in some cases is what forcing you've got for dissemination of the content. As I say, if I went to the next department, I've had this discussion with the various departments, could we establish a consistent data access agreement? And it's because the departments themselves are set up under different models and different legislation, sorry, the impact of that is they can't necessarily have the same set of conditions. But certainly there is some, you know, capacity to try and optimize some of that. And I'd be interested to see the extent to which the Practical Commission report that's coming out of data access might address some of those sort of questions as well. Okay, great. So just quickly, there's a question about, are there any checklist or guidelines for new researchers to assess their research surveys for the level of confidentiality? So I think that they're talking about like privacy risk assessments. This is, okay, we have an internal checklist. This is something we've talked about in terms of thinking about whether you, what you need to do in terms, but it really depends on publication. We talked before about the fact that in order to do certain research, you need to have actually some things that might be identifying. So it depends on which point in the data life cycle you're actually talking about here. When we're thinking about data release, then you say we would basically apply some basic principles for these are the sorts of things that we look for. And actually we've talked about making that checklist available in terms of these are the sorts of things you have to be concerned about. There is advice around that we could probably bring together. But I'd say it's this usability versus confidentiality question again. So one of the things we sometimes do is we split off those things that have a high confidentiality risk and we actually release separate different sets of data. So if you need that additional information, you can actually make that available under a separate additional set of requirements, possibly under a different technological setting. So I think it depends a little bit on when in the life cycle you are talking about here. It often is useful to have as much, have information, particularly for example if you're running a longitudinal study, you must have identifying information going forward. You can't be able to contact someone the next time around. It depends on what you're trying to achieve. But there are some basic advice that you can put out. There's a literature that's been used by statistical agencies about what they can release. But that whole area is right now somewhat contentious because the statistical agencies develop that literature largely in the age when data were released in the form of published tables. When the data are available online and you can do repetitive, iterative operations on them, you're in a new world and there's a there's a separate literature that's developed in the computer science world. Anyway, it is a problem. There is guidance out there in really complex areas like in some healthcare areas. Doing a full assessment of a data set can be very complicated and difficult. So I think my recommendation is that people start at the basics and think about how would you identify this person and if this information got out, what harm would it cause? Often the researchers themselves have a good sense of that from the research they're doing. Just one last question. Are the five safes applicable in all research disciplines or are they specifically limited to social sciences? I think they're already applicable. It's interesting we're having a discussion here about social sciences, but for example we work a lot with health sciences and environmental sciences and the like. I don't see any reason why they shouldn't be applied elsewhere. That's part of the question actually is it's more present about what do you have to think about in terms of the privacy and conflict and gently risks far more so than what's the topic. The topic helps you make some sort of judgment about the harm in George's terms, but yeah it's the confidentiality questions. The framework is still applicable. Thank you very much to George and Steve for coming along to our webinar today and thank you everyone for calling in.