 So, welcome everybody to the second in this series of webinars about the fair data principles and today we're up to A for Accessible. So last week we talked about the first one, Findable and now Accessible. Next week we'll talk about Interoperable and the week after that about Reusable. So first of all, I'd like to introduce myself. My name's Keith Russell, I'm from the Australian National Data Service, I'm your host for today. A big thank you to Susanna, Susanna Sabine in the background, she's organizing, co-hosting this webinar with me. Just as a bit of background, the Australian National Data Service works with research organizations around the country to establish trusted partnerships, reliable services and to enhance capability around the sector to add value to research data and to enhance the capability in the research sector. So we are working together with two other Enquist-funded projects, so that's RDS and NECTA, to create an aligned set of joint investments to deliver transformation in the research sector. There you are. So we have three speakers for today. I'll do a quick kick-off and just give a very brief introduction to what the fair data principles say about Accessible. And then I'm really excited and very grateful for two of our speakers today, David. David Fitzgerald, he is in this webinar, he doesn't have a big webcam, so that's why I don't see him at present. And David is a data manager at the Australian Longitudinal Study of Women's Health. And David's going to be talking about how in the study and how in the data that's been provided, they make the data accessible. And I was especially interested in his perspective from the angle of sensitive data and making sensitive data accessible. The other speaker for today is Jingbo, Jingbo Wang from NCI. And I've asked Jingbo to talk a bit about how NCI makes their data accessible using services over the data. So they can be interrogated and used by humans and machines. So first of all, I'd like to give a brief introduction about the A in the data, in the fair data principles. And the A stands for Accessible. So the way it's being described and way Force 11 described the principles is that metadata, so data and the metadata, both of them, are retrieved by their identifier using a standardized communications protocol. So when retrieved by their identifier, that's the identifier we talked about last week, so that can be a DOI, a handle, a pearl, something that's persistent. And that through by using the DOI, a handle or pearl, you should be able to get access to the data or the metadata. And the protocol to get there should be open, free, and universally implementable. So the thing to think about there is that it's something that is a protocol which is standardized and used by, can be used by anybody. It's not something that is bespoke, not something that's home built or badly documented. The classic example is HTTP, that's the very usual, very normal way of using through internet accessing materials and accessing data. It should not require some specialized, expensive software. Another point they make in the data, in the data principles is that the protocol should allow for an authentication and authorization procedure where necessary. So this is a common misunderstanding is that when people read accessible, they think, oh, that means I have to make my data open. If you actually read the data principles, that's not what they're saying. What they're saying is accessible does not actually have to be open or free, but you are expected to give exact conditions under which the data are accessible. So even the heavenly protected and private data can be made fair. If you implement it properly, implement the fair data principles properly, then a human being can see that the data is maybe not openly available, but then what steps they need to take to get access to the data. And because in the fair data principles, they also talk about machine access to data. If a machine goes hunting around and looking for the data, the machine should be able to recognize that the data is not open and what steps need to be taken to actually get to the data. I'll talk about that a little further. If the user, so that's either the human or the machine, has been granted access to the data, then it should be accessible through some sort of authentication and authorization procedure, some standard procedure. The last point they make under the fair data principles about being accessible is in the case in which data is no longer available, at least the metadata should be accessible. So this is of course not ideal, but in some cases it is necessary to actually take the data down. So that could be if consent for use was only for a limited period of time or maybe there's been a legal takedown notice or something along those lines that really make it impossible to no longer make the data available. In that case it is valuable to still keep up a metadata record describing the data and explaining that the data is no longer available. Now just to reinforce that accessible does not always have to be open, there are clear cases in which data cannot be made openly available. Obvious example is where data refers to human beings and specific characteristics of those human beings like information about their health, their income, religion, attitudes, political persuasion, all that sort of stuff. That's not the sort of information you can make publicly available. Other examples and that's probably worth remembering is that there are other sets of data. So for example, threatened species. The location of where threatened species can be data which is not some of you want to make openly available because that could mean that the last few of those species are hunted down or collected. The famous example, the wool of my pine, the location of those specific species need to be protected. So finally, another example where data cannot always be made openly available is where there are commercial interests in the data and maybe the metadata can be shared but the data itself is sensitive. Well, there are commercial interests around that and in that case it would not be appropriate for that to be made openly available. When considering making data accessible, we do argue to make it as accessible as possible and as openly available as possible. The possible angle there is just to provide the metadata as a starting point. If the rest cannot be made available, at least the metadata, slightly more useful perhaps is making it available through mediated access and in that case it's valuable to be clear about how the user can actually get access and that can be through by providing an email address, name, telephone number and if for example the user has to go through an ethics procedure to get access to the data, then clearly describe that ethics procedure and what sort of information is required to apply for that ethics procedure. So I was talking about the mediated access and about providing information about who to contact if you want to get access to the data. One thing to keep in mind there is if you list a person within the organization, have a think about whether that person is ever going to leave, if that's a researcher, if they're going to another organization, have a fallback, have some sort of mechanism to make sure that or maybe a more general email address so that when that data custodian leaves, somebody else can at least answer the question and grant access to the data. Another possible angle in making data accessible is creating a de-identified version of the data and making that public as long as it's properly de-identified and that can be useful for certain data users, at least have a better view what's in the data set and for some purposes a de-identified version can be enough. Finally, good point to keep in mind is if you do want to make the data accessible, plan for this and you can send forms because coming back afterwards and trying to get consent is not easy. Another angle worth keeping in mind and that's something I've invited Jingbo to talk about more is making data accessible can be through various routes and various protocols and in some cases it doesn't make sense to have a large data set available through download. In some cases it can make much more sense to actually have services over the data which allow the users to interrogate parts of the data, pull in parts of the data that are much more specific and answer their requests and that can be for a human being but especially for a machine that can be extremely useful. So one thing to keep in mind there, you need some sort of community agreed standards around that but Jingbo is going to talk much more about that. So that was all from a much more theoretical perspective. I'm very grateful that I have two speakers today to talk about accessible in practice and how they have actually tackled making data accessible and the first speaker for today is David, David Fitzgerald and he's a data manager at the Australian Longitudinal Study of Women's Health and I'm very grateful that David's available to talk about what ALS-WH has done to make quite sensitive data still accessible for others to reuse. So David is on the line and I'd like to hand over to David and then David can talk about how in the Australian Longitudinal Study of Women's Health they have made data accessible. Thank you, Keith. OK, so I'm David Fitzgerald, the data manager for ELS. How do I pronounce it? It's the Australian Longitudinal Study on Women's Health and I'll be talking about the accessibility issues for this. So I'm going to first of all explain and give background to our study and then talk about the accessibility issues and try and relate them to the fed data principles which I've just listed here and these are the exact ones which Keith showed earlier. So I won't go through them in detail but I'll try and relate these to our study. OK, so what is the ELS study? It's a collaborative effort project from the two universities of Newcastle and Queensland and in fact the two universities are sort of there sort of related to keeping the sensitive data which I'll talk about briefly. It's one of Australia's longest running longitudinal epidemiological studies. So it's been going since 1996 and is ongoing and we hope to go further into the future funded by the Australian government. So we started off with over 40,000 women and a few years ago we got a new cohort of 17,000 women and I'll show you the four cohorts we work with. Here they are. So the four cohorts are aged based and we define them in the years of birth. So you can see this one, the oldest one born born in 1991 to 26 and there's three other ones of various ages and as you can imagine each cohort has their own health issues and that's what we're interested in and indeed the Australian government is interested in. So what are we collecting and our methodology? So health issues and in particular mental, physical, reproductive, social health there's more and also life transitions. So the different ages of women obviously going through different life transitions, life events and things which are related to health employment, health service use and more and I'll just mention a bit of data linkage. I don't want to stress this because it's a big area with lots of issues but we have actually linked our survey data with some administrative data sets and in fact they're listed there, the MBS, PBS and Cancer Registries and Immitted Patient Hospital. The linkage can be particularly sensitive and we treat them quite differently in how we make the data accessible. So the data is used extensively and in particular more than 680 peer reviewed papers have been published using our data and also we report back to the government frequently and national health policies have been informed by reports and use of our data. Okay so I'll go on to the sort of aspects of accessibility and see how it relates to our data. So that one there about being retrieval by an identifier using the standard communications protocol. So all the data sets from our survey which are analysed and used have an identifier, the same identifier and I just stress it's de-identified but with a consistent new identifier and that's across all surveys. So anyone using our survey data, I'll just put the caveat as long as it's not part of the linked data but anyone using this survey data has one and only one identifier for use and we say this has been de-identified because there are no personal names on the data, no addresses, no post codes, no dates of birth although the year and month of birth are actually given. So obviously to do things like age analysis and any, they are the main ones but any other data which is deemed to be identifiable is stripped off. The identifier is, we call it the ID alias, it's actually not the administrative ID which a respondent would see or somebody working in an office in Newcastle who's communicating with our respondents. They would not know what the identifier, the analysable identifier is, they would have a different administrative idea and just on this point, any small cell sizes which we think are identifiable are grouped into larger groups and for example country of birth, we group into broad continental geographical areas to avoid particular countries of birth coming up and anyone using the data has to, along with a number of other conditions, they must not identify respondents which although we go to lengths to sort of make that very difficult, it's conceivable that something could come up but they promise and sign that they will not identify respondents if they ever had that possibility. Okay, so I was also just sort of asked to sort of look at legal and ethical issues. So we do have a legal contract with the Australian Government Department of Health and in fact, this is ongoing and we didn't get a 20-year one. We regularly update it and short-term contracts and also the ethics committees from the two universities there have approved our usage and in fact, every time we do a new survey because it's longitudinal, every year we're actually going back to at least one of the cohorts to survey them, each new survey which is not identical to previous surveys is subject to ethics committee oversight and approval. So we do have extensive legal and ethical issues there. So I want to talk about how actually an investigator or a re-user would get access to our survey data. So they, and as we explained, this is all on the website but they must first complete an expression of interest form and in particular they'd say who they are, why they are a sort of a serious researcher, what they want to find out from the data and that would be reviewed by our Publications Sub-Studies. That's the BSA Committee and if their EOI expression of interest is approved they will sign confidentiality and data use documents, statements before receiving the de-identified data and they also must report back to us about their progress and we expect some sort of some immediate work on the data and for them to continue with that access. But if their expression of interest is successful the data are actually sent to them and this is an area which I'm directly involved in. And so we do it before sending the data encrypted, we use 7Z software and that compresses it as well. We use the AIRnet cloud source system to send data to the approved researchers, re-users and an email was sent to them as well with passwords but also to Sub-Studies Contact with the management here so for future correspondence. And I'll just put a note there about, we have linked data but we never send this out actually and anyone using this has to actually come to our offices or actually there's the SACS Institute shore facility which also can have it but we don't own the linked data so we've agreed not to send it anyway. So public metadata, so this refers back to protocol being open so we have a website which lists the above procedure in fact that I went through but also has a lot of metadata on it including a data dictionary which lists all the variables and the many data sets. We have a data dictionary supplement which is a description of the frequently used variables with some detail. A data map that shows how the variables are used across the different surveys and cohorts. We're not saying different surveys, the longitudinal, we have up to eight surveys for some of our cohorts and so each one is deemed a different survey and has slight differences from other surveys. We have a list of all the variables used in spreadsheets for easy access. We also have data books which list essentially the frequency summaries of the variables. The question is that the respondents filled in. Technical reports which we produce should go into detail on many of our reports and a frequently asked question page on exactly that and so making metadata accessible, in fact we make data although our data is not completely open we do want to make it accessible and we do archive both metadata and the data and we do that annually with the Australian data archives and although they are not releasing it yet the plan is from the future for them to take over a release of our data perhaps when we're not doing it ourselves and that will be a role to keep our data sort of useful and used in the long term. And yeah, so that's what I've got to say. I'd just like to acknowledge the women in our study who fill in the surveys and of course the Government Department of Health for Fundingness and the Universities of Queensland and New South Wales for doing the job. So thank you, that's what I have to say. Thank you David, thanks. Just really interesting presentation, interesting to hear how you've made data accessible in practice and what it means to make sensitive data accessible to researchers. Thanks for that perspective and thanks for that view on how quite sensitive data can still be made accessible through various routes. I think it's really interesting to hear that you both have the route of de-identified data through appropriate routes but also linked data, so much richer version but then through either sure or through coming to the ALS, ALS, the Australian Longest Genial Studies in itself. I've got to work on that one. Thanks. Okay, I'd like to now move on to Jingbo and Jingbo's got a, I've asked Jingbo to talk also about making data accessible through a very different perspective and Jingbo works at NCI and NCI does all sorts of elements around making data findable, accessible, interoperable, reusable. Today, I've asked Jingbo to focus on the accessible side of things but I do want to note that NCI also does a whole bunch of other things in this space. Thank you, Keith. I think I would just turn off my camera because I can't see my presentation. Right, so my name is Jingbo Wang. I work at National Computational Infrastructure which is a computer, super computer center located in Australian National University campus. So today I'm going to address different flavor of data accessibility practice at NCI and before I go for that, I just wanted to make a comment that fair principle is quite useful to govern our data management practice and we use it a lot in every single aspect in our data management. So this is the quick overview of the data sets we have. So as you can see, I've listed here the main data type that we store at NCI are national collections about climate models, satellite images, bathymetry, elevation, hydrology, geophysics and those data are quite geospatial focused but we also have other social science data and genomic sequencing data and astronomy data. So we aim to provide a user with data as a service as many digital repository will do. In our data management, we catalog data so that people can query the metadata database to find what we have here. We also publish data through various data services. That's a focus I'm going to talk about in the next few slides. We offer data quality assurance, data quality control and benchmarking use cases. We provide data through virtual laboratories. We also provide help on data visualization. If I wanted to make something that we are different from other digital repository is because we co-located with HPC facility, high performance computing. Given the nature of our large scale of the data, we host more than 10 petabyte research data. So we really wanna make good use of the high performance computing here to advance science research. So this is the six stop points that I wanted to address today about the data access. So I put the red color words to show the difference for each points. So initially I will talk about the how do we control the data access and then I'm going to present one example of how do we use persistent identifier to manage data access. Then I will talk about two main data services that we offer at NCI for our users. One is the threads, the other one is a g-ski which is a more fancy and scalable distributed data server. Finally I'm going to cover very quickly about the data versioning and the quality of the data. So the first point is about how do we control the data access? Most of our data are coming from our stakeholders such as Geoscience Australia, the Bureau of Meteorology, CSRO, universities and many data has been funded by Australian government. So it naturally fall into CC by full license. Some owners also impose that this data should be non-commercial, non-derivative or share alike type of CC by. We also have international partners such as in the European and US and they impose even strict term and conditions if people wanted to access the data. So this is the legal perspective about how do we control the license, the data access through licenses. On the file system, we actually hard-coded the data access control using echos. So this is a way how do we separate different group of people access the same data. So we have basically for each collection we have two access group. The first group has the read and write permission which means those are data managers who are able to generate data and write data and modify data. The second group is read-only group. So for those people who are in the read-only group they can access the data on the file system but they can't really modify anything. This way we actually protect the integrity of the data. We only give access to authorized person who really can manage the data. So there is also a social aspect of data access. For a research project, we often see the embargo period that maybe after two years of the project the data can be made available. Also, some researchers say I want to share my data after my journal article about this data set is published. So another example is from the Bureau of Meteorology. We have a data that there is a six months time delay between the data is being developed, verified until it is being operational available on our thread server. So the second point I wanted to raise is our practice about implementing a persistent identifier. Often we experience some frustration about when we give people the URL to access the data. It only valid for a certain period of time or only valid during the time that somebody can maintain it. Afterwards, we can't really guarantee. And also the URL, the original URL, if you look at on the left-hand side of the slides, those are the metadata catalog URL or service endpoint URL. Let's look at the second one, which is service endpoint. So from this URL name convention, you can tell the later parts, which include the project code, file path, file name. Anything in this path, for example, project code changed, you rename the file or we shuffle the file around and this link will be broken. So the original URL that we provided here is not a very stable one. We adopt the product that CSRO developed some time ago about persistent identifier as a broker. So we now, most of the time, we give the external user the right-hand side of the name convention. As you can see, we have four main categories after PID.nci.org.au. Now we have data set, we have services, we have documentation, and we have vocabularies. The only thing, keep it unique is the file identifier or UUID. It's basically, as long as the identifier keeps the same, the URL on the right-hand side is pretty consistent. If anything changed in the original URL on the left-hand side, what we need to do is update the mapping inside of the PID services broker without interrupting the URL that we give to the external user. We have the technical implementation published in the Digital Science Journal, so you're welcome to have a look. Now I'm going to talk about the main data services that Keith really wanted me to address from NCES perspective. So I divided our type of data service into two main groups. One is the OGC services. I'm going to talk more about what is OGC in a second. The other type of data services is more project-specific, such as we are one of the largest node in Southern Hemisphere as part of the Earth's System Federation Group, which is the aggregation of climate model from Global Research Institute. So the way we provide data services is we copy the main of the data model to serve for Australian users. Another fancy data service that I'm going to show you a bit more is called Giskey. It's a scalable data server that directly interact with our file system. So what is OGC? OGC is Open Geospatial Consortium. It is an international, non-profit organization to make quality open standards for global geospatial community. We find OGC standards quite useful for us because we have a lot of geospatial featured data and OGC have all sorts of standards for different type of mapping, feature, coverage, processing for us to use because it's so common and it's free for people to use. And if we made the data available through OGC standards, a lot of people naturally can access our data. So that's the motivation. So what is OGC services? It's actually an API in the middle between the data store and the user. So the user can request whatever available on OGC services. Let's say I want a map about the anomaly across whole Australian continental and NCI host this data. But we host the data, we don't host images. What the OGC web services is do is he actually extracts the image and return back to the user and user can take the URL which contain the image of the data put on their own web portal. For example, you can get the URL and copy and paste onto the national map to show the grades. So NCI has two main production data type service. One is the threads. So you can often find the threads available on our data catalog. This is the interface of the geo network. So the red circle link is the NCI thread server. So you can open and click it. The second interface is the data catalog. They more or less contain the same information but serving for different purposes. Geo network is mainly for data harvester from machine accessible. The data catalog is for human readable. So threads in a short, in a very simple term is it's the data services which allow you to browse and access the data. So I've listed here six main type of services that threads offer. The very first to two, OpenDAP and that CDF is subset, subsetting the data. So we have a lot of very large data but in practice when scientists access the data they don't necessarily have to access all the data. They might just need a very small piece of data from this big pool. So what the threads can offer is you can actually define your query and only get the data, the part that you want. So it really save a lot of traffic on the internet. The other two standard OGC web mapping services, web coverage is very popular for people to access the mapping and coverage directly out of our data. And of course, threads offer a very quick data viewer. If you don't know what this data is, you can have a quick look of what it is on the web without downloading it. Of course, the also threads offer the direct download if you really want to download the data. Another fancy scalable distributed data server that I was talking about is called Giskey. Giskey is the in-house and un-CI developed product. What it does is we have a lot of data on the file system, millions, millions file on the system. If we wanted to people to query this data, how? It's gonna to be very hard to create millions of metadata records for every single file. So what we've done is we use the crawler to crawl the file system, get the header of the file and formulate as a database, metadata database. And then the database will be a clear window for people to hand in the request. Say, give me some images in the polygon at some time. So the metadata database actually includes those essential geospatial information and it returned back to the user of what they requested. So we published recently a technical details of Giskey implementation. You're more than welcome to have a look. Yeah, sorry, I think you're getting close to the end. I just wanted to ask you, there was only about a minute or two left. So if you could work towards the end. I'll quickly go through. Yes, so the last two points will be version data. Again, because of the scale of the data, we can't really store every single step of the data. So what we can do is we stalk the raw data and the final version and we keep the URI of the metadata in the middle step. So in that way, the province information was kept and also saved the storage. The last point of the quality data is I would think some users say, we can't really assume we can access data and data is flawless. So by publishing data, aside with the quality report, we wanted to provide the data access with a certain type of assurance. So we also have the publication that is going to be in place very soon. Thank you for your attention. That's our experiences so far about this access. Thanks, thanks Gingba. That was really a very quick overview of all the work you've been doing around services and all the work you've been doing there about making data accessible, and not only for humans, but also for machines. So first of all, I would like to thank David and Gingba again for providing a sort of insight into what it means in practice in making data accessible from different perspectives. So that was the very interesting presentations. In case you're interested in learning more about making your data accessible and things you can think about there, this slide provides you some resources. The med.data project's got a number of materials around sensitive data. There's a link here to the Australian Data Archive and the access conditions there. On the ANS website, we have some materials on sensitive data. Another piece of work we're doing together with the community is looking at data services. So this is the work that Gingba also talked about and making sure that the services over the data are discoverable. There is an interest group working in this space. So if you're interested in learning more about it and also engaging more around that, please follow the link and there's more information there about that data services interest group. Last year, we also did 23 things, research data things and two of the research data things are relevant to the topics discussed today. So have a look at those thing 10 and thing 19 if you want to learn more and also want to get your hands dirty and try out a little bit what it means to make data accessible. The link at the bottom is just the general link about the fair data principles on the ANS website. So this week, we talked about accessible. Next week, we're going to be talking about interoperable. Thank you all for your attention. Or finally, I would like to acknowledge and thank, or first of all, our speakers for today, but I would also like to thank Encriss, so the National Collaborative Research Infrastructure Strategy Program for funding ANS and making this all possible. Thank you all for your time and look forward to seeing you next week.