 Good afternoon and welcome to our webinar on health and medical data, storing and publishing health and medical data. I'm Kate LeMay and I'm a senior research data specialist from the Australian National Data Service and I've got with me Jeff Christensen from QCIF. So there are probably some people here who are new to AN. So I just wanted to introduce us. We're a federally funded body and we work to make Australia's research data assets more valuable for researchers, research institutions in the nation through various means around data management and data sharing. So as I said today's webinar will be about storing and publishing health and medical data. It's part of a series. Last week was our first one on Funders and Publishers and the recording of that is available on the ANS website in the presentation section already. Next week we'll be talking about ethics and legal issues around data sharing and as I said I've got Jeff Christensen joining me from QCIF and he's the Program Manager for Health and Life Sciences at QCIF and the University of Queensland's Research Computing Centre. He's been involved in the development of many national research IT infrastructure projects with a biomedical focus within Australia including med.data which he'll be speaking to us about. And prior to this he was based in the UK where he led a team who developed and maintained an international reference resource of embryo anatomy and associated gene expression patterns. So Jeff will be speaking to us in a little while. So firstly I'd like to talk to you about data repositories in general and some things to think about when planning to submit data to a repository. So when we're talking about a repository we're not talking about just putting a file on Google Drive and making it shareable to other people. It's a managed environment capable of storing and sharing data and usually has some process for curating and preserving data as well. So there's quite a few different choices that people have when looking at what repository to use for their data. So institutions and by institutions I mean in general universities most institutions have a repository that may be able to either have the data set plus metadata about it or just the metadata available. Now when I say metadata what I mean is the description of a data set. So that's things like who made it, when was it made, what's it about in general and any information, extra information that a secondary user might need to know about that data set. An advantage of having data in an institutional repository is that in general they're free and in Australia institutional repositories feed their metadata to a site that is owned by ANS called Research Shutter Australia which I'll show you in a little bit and Research Shutter Australia as I said collects this information, this metadata about the data sets and provides a central point for people to go to to look for data sets within Australia. So there's also discipline specific repositories and the Australian Data Archive is an example of one of these. The Australian Data Archive is in general social science specific but does have some medical and health data sets in there. It's a really great example of a repository that provides mediated access which is a concept I'll touch on next in my next slide and Re3Data is a registry which I will also show you after this slide set and it's a registry of discipline repositories so you can go there and search for your discipline to see if there's a repository that's specific to your research. So there's also non-specific repositories and it's some examples here are FigShare and Dryad. These repositories don't have specific types of data in there they're quite general and they can in the case of FigShare they can hold things other than data like papers, presentations, posters, other things like that great literature. So some of the things that a researcher might want to think about when they're looking at which repository they want to deposit their data into are if they're being mandated or recommended a repository from say a funder or a journal that they're publishing in and I would suggest in this case to check if it is a mandate or just a recommendation and to look into whether you're actually required to put it there. An example is if you look at last week's webinar Wiley was speaking about that they have a deal with FigShare when you deposit data you can deposit data into FigShare when you're putting a paper into a Wiley journal but they're not mandating that so they're repository agnostic. There are also discipline conventions so an example of this is Genomics says quite well established discipline conventions in that area. Also some considerations might be if you're going to be able to publish the metadata and the data in the same place. So as I said some institutional repositories have at this point the capacity for publishing metadata but maybe not holding the data and then where will you put that data what we do about that. Some repositories have no cost and some do have a cost so that's a consideration when you're looking at your choices. This concept of mediated access that I mentioned when I was talking about the Australian data archive this means that you can have the description the metadata of your data available publicly it's findable it's searchable it's referenceable but the data itself is not available for public download. So in this case the access is mediated through some sort of means so that only say a legitimate researcher with a research question that can be answered by use of your data set is able to access this data. This can be done at a repository level with somewhere like the Australian data archive it can be done at a researcher level so I'll show you an example on research data Australia of a research data set that has mediated access through the researcher and another thing to consider is whether that repository allocates DOIs to data sets so most people would be familiar with DOIs digital object identifiers from papers being assigned them. If data set is allocated a DOI it can be cited in a reference list and that citation can be tracked in the same way as DOI citations are tracked for papers so that's an advantage for the researcher if that is assigned to your data set. So I'd like to show you now a few websites that I mentioned so one of this was research data Australia you can come here and you can search for data sets you can also browse by subjects so that's worth having a look at. Here is the example of a medical and health related data set on research data Australia that has mediated access under this access conditions on the left hand side it says to contact them to gain access to that data set so it's good all the metadata that I was discussing earlier but that access is through the researcher and the secondary user would have to also have ethical approval for getting that data set and this is re3data.org if you go to browse by subject it's got this really fun well I think it's fun graphic where you can look into subjects so here if we click on medicine that's the fun bit it pops up and you can look further into it to see if there's a repository that is related to your discipline. So now I'm going to hand over to Jeff and he's going to speak to us about med.data. Thanks very much for the invitation Kate it's a pleasure to be able to talk to the group today what I'm going to talk about is infrastructure project called med.data.edu.edu which we sometimes call med.data just to shorten and this is a cloud-based data storage computing and sharing system for health and medical researchers in Australia. So last week at the Ann's webinar we heard from Whimming Boon and he reminded me about putting things in the context of the research lifecycle and this is an image of the research lifecycle from the NHMRC statement on data sharing and this covers everything from at the beginning I guess supplying for funding and ethics approval and then commencing research, undertaking research and then disseminating the results and what is apparent I guess these days and in particular since data a lot of data has been born digital is that data is and digital data is central to this lifecycle. So when one is attempting I guess to wrangle all of these data resources there's a number of items of infrastructure that you need in place so when one is collecting data we obviously need to store it somewhere and that may be on a shared system it may be on laptop it may be in various places. Also when we store it we need to manage it so we need to organize it so this is putting it into directories that mean something and also attaching metadata so information about the data to make it useful and I guess there's two levels of metadata which can be thought of and one is a collection level so this particular data set or folder over here contains information associated with research project X the other is that you can attach metadata to items within the actual data repository that you have so for instance you could say well this is a file and that file is associated with a person that has various characteristics. The other aspect of management that's worth thinking about is sharing it with collaborators during the life during the life cycle so those collaborators may be people within your own lab or they may be within the same institution but they could very well be international based internationally so having a system that allows you to easily manage and share data is really central to undertaking research. Obviously data by itself is not particularly useful unless it's subjected to some kind of analysis so there's a plethora of analysis tools that one can apply to data they might be commercial they might be open source and analysis tools all need compute power to be able to run so again this computing power may be offered through something like a desktop machine or a laptop institutions may offer computational resources there are cloud providers I'll talk a little bit about a national research cloud here in Australia that's come about through the Nectar project and then there's also high performance computing facilities so these are really for doing number crunching on very large data sets or on data sets that require a lot of computation in a parallel fashion and then once the analysis has been done and the the results have been gleaned it's important to be able to disseminate the results and Kate was just talking about a number of great repositories that I guess such as Dryad and Figshell which are really really useful as data repositories to disseminate results and data associated with results from a research project so this webinar is about health and medical related topic and so what's so special about health and medical research data so primarily because it's health and medical it includes a proportion of data that is derived from or is directly related to human beings that have participated in research within those within those data single human being beings may also be individually identifiable I'll just point out that the term individually identifiable is used in the national MA the NH and MRC national statement on ethical conduct in human research and it defines three levels of identifiability one of which is individually identifiable so single human beings may actually be individually identifiable in a proportion of data associated with a research project to some people within that research corporation so if the data is individually identifiable and it also contains information about a person's health or their genetic information or any biometric information it's also considered to be sensitive and that's defined by the Privacy Act the Commonwealth Privacy Act and if data is sensitive it carries legal and ethical responsibilities in ensuring that the information is not intentionally or inadvertently disclosed to non-authorized individuals so when conducting research effectively everybody has a shared ethical everybody related to that research has a shared ethical responsibility to ensure that harm doesn't come to any research participant through unauthorized release of identifiable data so that may be non-intentionally someone might accidentally give access to someone or give access to someone and then they shouldn't have done that and then so there's issues there or the other one is through things like hacking of systems and various things like that which I guess is fairly topical at the moment so when one is considering using health and medical research data especially if that data is going to be identifiable a process of risk management is required so you know we have to be able to store that information and use it and share it with collaborators but it's imperative that that is done in a suitably safe manner. Med.data is effectively data infrastructure operate a data infrastructure and as operators of data infrastructure we have a duty to data custodians of a particular data set and also researchers using that data set to demonstrate that we have appropriate levels of maturity and discipline and information security practice to be able to store human derived research data and we also have to have a repertoire of safeguards in place to show custodians of the security of the information that's held on systems like this so they may be administrative so for instance policies they may be physical so do we use appropriate appropriately secure data centers and then there's also a plethora of technical safeguards that one can apply to data infrastructure in this space. So what is Med.data? So effectively when nationally funded data infrastructure for health and medical research data we've received an increase so that's national collaborative research infrastructure scheme funding through two projects the RDSI which was a research data storage infrastructure and the research data services projects so interestingly on the NH and MRC statement on data sharing this is alluded to so Kate mentioned a little bit of the infrastructure within this may be provided from an institution also dried that gets a mention there but there's also established established networks and projects across the country that have been nationally funded including research data storage infrastructure. I should point out intersect we're going to mention in a minute as one has another organization that's mentioned in the NH and MRC statement on data sharing so as data infrastructure we provide a number of features so one is cloud storage for health and medical research data so this is I guess set in a number of data centers around the country and I'll talk to that it's networks over the internet high-speed research network backbone we have data management tools that can be associated with the data storage so media flux and mitardus excellent tools that provide data structuring and also attachment of metadata to particular items within collections and also some other features such as encryption and as Farrah is a tool for high speed transfer and also very good structuring data into directories and also has really good encryption capabilities associated with the storage and management is compute resources and analysis tools so with metadata we rely on computing that sits behind the data storage or is associated with that storage some of this is maybe cloud so I mentioned a little bit about the next research class this is another interest funded project to provide cloud compute a cloud research cloud for Australian researchers and there's also a number of high-performance computing or parallel computing systems that are highly associated with this particular data set analysis tools primarily with met of data the use of the use of the storage have been bringing their own software so it may be as I said it may be a commercial product or it may be a open source product but that software is run on the compute that's associated with the data storage and then dissemination of results and data access is really important so we also have a data registry that's closely coupled with the data storage and this actually leverages research data Australia which which Kate just mentioned so we have a widget built into our website which can present information that's described in in research data Australia on this particular site and that's for data that's stored on this infrastructure we also have a resource library which I think is proved to be very popular so I think a lot of people in this space are fairly confused by what is the legislative and the best practice landscape when dealing with health and medical research data that's derived from humans so we have a resource library there and I think next week's topic will also touch on that and we also talk about IT security frameworks that may be that are suitable for I guess describing the security features of data infrastructure and I'll talk a little bit in a minute about the Australian Signals Directorate Information Security Manual and something else that has also proved very popular is we just have an interactive use guide so this is really to find out if you're thinking about using met data is it going to be right for your data so it's interactive it leads through a decision tree so there's up to eight questions and it can be very useful for finding out information in a directed fashion so who manages met data so the project is actually led by intersect and intersect QCIP and big note of the three primary partners in this particular project and also has been involved in preliminary stages of this project as well so collectively I should say that QCIP intersect and big note are organizations that are so we're research organizations and we have member universities so together that we have 23 universities and what we do is we work with IT infrastructure groups and others of those at those institutions to to provide value to those particular institutions so there are a lot of universities affiliated through this particular groupings so who is using met data and prior to answering that I guess I should say who can use it and effectively anyone can use it in Australia so you feel free to contact us and I'll give you some how to do that at the end but who is using it now so we have about 2.6 petabytes of data stored just to give you some context that it costs about 600,000 DVDs or I think about 8 million CDs so it's a lot of data primarily it's I guess can be classified that the vast majority is human derived genomics data and then human derived imaging data and then a much smaller proportion of information that's collected from biosensing biosensor experiments or data derived from bios specimens or from epidemiological observational simulation studies and again if we look at the number of data sets per type again genomics we definitely have the largest amount who is using it we currently have research group from 20 organizations so not all of these organizations are members of the three of QCIP intersect and Vic node as I mentioned before so we have researchers from six universities and including I guess fairly independent research centers within those universities nine medical research institutes and also other organizations including hospital-based research groups so I just want to say a little bit here now about identifiable versus non-identifiable data so at the moment the majority of the six petabytes data is classified as reidentifiable or non-identifiable again these are classifications that are used in the statement of ethical conduct in human research. Reidentifiable is where a temporary identifier has been attached to information that has been anonymized so this may have a classifier so an alpha numerical number actually identifies the participant within the study that can't be traced back to actually reveal the identity of that person and then non-identifiable data is data that's never had any kind of identifying information attached. I guess the reason why most of this is currently a reidentifiable or non-identifiable is that a lot of the researchers that are using mental data are actually given access to data sets that have already been de-identified by third parties. However we can still actually store identifiable individually identifiable data however prior to doing that we urge that we have a discussion about risk management so this is really important that the data custodian of that data set and the infrastructure providers have this conversation so we need to understand collectively what's the sensitivity level of this data but it's also really important to understand from it as an infrastructure provider use cases of how this data may be used so how will it be used would it be used on a HPC system or would it be used would a cloud a virtual machine in the cloud environment be sufficient. It's also really important to understand where are the users are they based within a university or maybe they're based in a medical research institute or maybe they're based overseas so it's also really important for us to be able to understand that to have a better idea of how the data can be housed and protected and also by who. So again I roll the people within one research group or are they across others so this as I said dictates the specifics of the IT security setup. So I should just say that the security policies are actually set by each known operator and we're very very happy to discuss with data custodians and and particularly the as well the institutional IT security officers albury searches or data custodians as to the setup of our particular infrastructure. So we can provide sort of a comparison of how we shore up against the Australian Signals Directorate Information Security Manual principles and controls and I should just say that these are effectively the Australian standards for information security. So if you'd like to know more we have a couple of methods so you can contact us there's a contact page on the website and as I said before that there's an interactive use guide. I should just say before closing off that to thank the funders so it was NCRIS funded this project primarily through RDS and RDSI. Thank you. Fabulous thank you so much Jeff for talking about med.data. I've just received one little question most of the ethics committees requiring full ethics for data sharing or LNR. This is a good question we will be talking about ethics next week so ethics committees can vary in their attitudes to data sharing. This is a new concept that some ethics committees are coming up against. We have a guide available on our website for ethics committees as a bit of an introduction to data sharing so if there are any ethics committees members or you know anyone who's in an ethics committee please feel free to point that out to them. But in general if information is identifiable you definitely have to have the consent of the participants to be able to share data. If the data is non-identifiable it doesn't fall under the Privacy Act and can be shared. However there are other aspects that ethics committees may consider so it's a bit of a it depends and so Jeff do you have anything to add to that? Oh no I was just going to say that there are I guess quite a lot of data questions in the new Human Research Ethics application form and one thing we're considering doing is actually providing some further I guess advice on our site about how one can respond to that question if they're considering using metadata. Someone has also asked for more details on how this stacks up to the principles and controls from the government ISM and we're after more details on how we can access this. I believe that this is one of those questions that Jeff they would need to contact their local. Yeah we would have an individual conversation because by actually disclosing your security setup you're actually it's actually a security risk to do so. So we will have those conversations one by one but we're not going to publish a list on the website saying well we actually adhere to all of these and not all of these. I guess I should just clarify a little bit more about ASD so there's a number of classifications in that document and they range from protected and sensitive which are the classifications we're utilizing for this type of data up to top secret and obviously we're not building a system for top secret information. A lot of it comes back to policies and for us each of the nodes have security policies and information security policies and within that again it's really important that we clarify roles and responsibilities so we effectively provide a secure container for the data and we want the people that are using our systems to effectively use that secure container in a responsible manner so it's definitely a shit thing but please contact us through the website and we can have a discussion about how we stack up against those controls. Yep absolutely and there was just one little last question someone asked about reidentifiable data so when we were talking about non-identifiable and identifiable in that question about ethics. For me reidentifiable data if it's my understanding is if it's separated from the key that can reidentify it then the data set that does not have any of the identifiers in it is shareable but again best always best to get consent. And we also have some advice so we have one of our guys in the resource sections on anonymization and there's actually no Australian specific guidance about how to generate those keys but the US HIPAA Act the Health Insurance Privacy Portability Act has some really good tips I guess on how to generate these types of identifiers and basically it shouldn't be derived from anything that was ever associated with that person. Another HIP that's covered on that page is also and the ASD I think does provide the guidance from memory is that the identifying information should never be stored on the same system as the keys. The keys should be stored somewhere else and also encrypted. Yep absolutely and just one last thing to say about de-identification and also has a guide about de-identification on our website and it points off to a lot of international and national guidance about that process that you can go through and these terms identifiable, reidentifiable and non-identifiable as Jeff said they're in the national statement on ethical conduct in human research and that that statement is currently undergoing review. Who knows how long these government processes take but that quest that those three identifiers, those three terms are currently under the terms of the review so we'll keep our eye on whether they continue being the terms but that's just something to note for the future. So I'd just like to thank Jeff very much for speaking to us about med.data. Thank everyone for coming in today.