 Right, so my name is Jingbo Wang. I work at National Computational Infrastructure, which is a supercomputer center located in Australian National University campus. So today I'm going to address different flavor of data accessibility practice at NCI. And before I go for that, I just wanted to make a comment that FAIR principle is quite useful to govern our data management practice. And we use it a lot in every single aspect in our data management. So this is a quick overview of the data sets we have. So as you can see, I've listed here the main data type that we store at NCI are national collections about climate model, molecular images, bathymetry, elevation, hydrology, geophysics, and those data are quite geospatial focused. But we also have other social science data and genomic sequencing data and astronomy data. So we aim to provide the user with data as a service as many digital repository will do. In our data management, we catalog data so that people can query the metadata database to find what we have here. We also publish data through various data services. That's a focus I'm going to talk about in the next few slides. We offer data quality assurance, data quality control, and benchmarking use cases. We provide data through virtual laboratories. We also provide help on data visualization. If I wanted to make something that we are different from other digital repositories is because we co-located with HPC facility high performance computing. Given the nature of our large scale of the data, we host more than 10 petabyte research data. So we really want to make good use of the high performance computing here to advance science research. So this is the six dot points that I wanted to address today about the data access. So I put the red color words to show the difference for each point. So initially I will talk about how do we control the data access. And then I'm going to present one example of how do we use persistent identifier to manage data access. Then I will talk about two main data services that we offer at NCI for our users. One is the threads. The other one is a G-ski, which is a more fancy and scalable distributed data server. Finally I'm going to cover very quickly about the data versioning and the quality of the data. So the first point is about how do we control the data access. Most of our data are coming from our stakeholders such as Geoscience Australia, the Bureau of Metrology, CSRO, universities, and many data has been funded by Australian government. So it naturally fall into CCI for a license. Some owners also impose that the data should be non-commercial, non-derivative, or share alike type of CCI. We also have international partners such as in the European and US and they impose even strict determine the conditions if people wanted to access the data. So this is the legal perspective about how do we control the license, the data access through licenses. On the file system we actually hard-coded the data access control using echos. So this is the way how do we separate different group of people access the same data. So we have basically for each collection we have two access group. The first group has the read and write permission which means those are data managers who are able to generate data and write data and modify data. The second group is a read only group. So for those people who are in the read only group they can access the data on the file system but they can't really modify anything. This way we actually protect the integrity of the data. We only give access to authorized a person who really can manage the data. So there is also a social aspect of data access. For a research project we often see the embargo period that maybe after two years of the project the data can be made available. Also some researchers say I want to share my data after my journal article about this data set is published. So another example is from the Bureau of Meteorology. We have a data that there is a six months time delay between the data is being developed verified until it is being operational available on our thread server. So the second point I wanted to raise is our practice about implementing a person's identifier. Often we experience some frustration about when we give people the URL to access the data. It only valid for a certain period of time or only valid during the time that somebody can maintain it. Afterwards we can't really guarantee and also the URL, the original URL if you look at it on the left hand side of the slides those are the metadata catalog URL or service endpoint URL. Let's look at the second one which is service endpoint. So from this URL name convention you can tell the later part which include the project code file path file name. Anything in this path for example project code changed of your username the file or we shuffle the file around and this link will be broken. So the original URL that we provided here is not a very stable one. We adopt the product that CSRO developed some time ago about persistent identifier as a broker. So we now most of the time we give the external user the right-hand side of the name convention. As you can see we have four main categories after PID.nci.org.au. Now we have datasets, we have services, we have documentation and we have vocabularies. The only thing keep it unique is the file identifier or UUID. It's basically as long as the identifier keeps the same the URL on the right-hand side is pretty consistent. If anything changed in the original URL on the left-hand side what we need to do is update the mapping inside of the PID services broker without interrupting the URL that we give to the external user. We have the technical implementation published in the digital science journal so you're welcome to have a look. Now I'm going to talk about the main data services that Keith really wanted me to address from NCES perspective. So I divided our type of data service into two main groups. One is the OGC services. I'm going to talk about more about what is OGC in a second. The other type of data services is more project step specific such as we are one of the largest node in southern hemisphere as part of the Earth's system federation grade which is aggregation of climate model from global research institute. So the way we provide data services is we copy the main of the data model to serve for Australian users. Another fancy data service that I'm going to show you a bit more is called GSC. It's a scalable data server that directly interact with our file system. So what is OGC? OGC is open geospatial consortium. It is an international non-profit organization to make quality open standards for global geospatial community. We find OGC standards quite useful for us because we have a lot of geospatial featured data and OGC have all sorts of standards for different type of mapping, feature, coverage, processing for us to use because it's so common and it's free for people to use and if we made a data available through OGC standards a lot of people naturally can access our data. So that's the motivation. So what is the OGC services? It's actually an API in the middle between the data store and the user. So the user can request whatever available on OGC services. Let's say I want to map about the anomaly across whole Australian continental and NCI host this data but we host the data. We don't host images. What the OGC web services is do is actually extract the image and return back to the user and user can take the URL which contains the image of the data put on their own web portal. For example, you can get the URL and copy and paste onto the national map to show the grids. NCI has two main production data type service. One is the threads. So you can often find the threads available on our data catalog. This is the interface of the geonetwork. So the red circle link is the NCI thread server so you can open and click it. The second interface is the data catalog. They more or less contain the same information but serving for different purposes. Geonetwork is mainly for data harvester from machine accessible. The data catalog is for human readable. So threads in a short in a very simple term is it's a data services which allow you to browse and access the data. So I've listed here six main type of services that threads offer. The very first two opened up and that CDF is subset, subsetting the data. So we have a lot of very large data but in practice when scientists access the data they don't necessarily have to access all the data. They might just need a very small piece of data from this big pool. So what the threads can offer is you can actually define your query and only get the data the part that you want. So it really save a lot of traffic on the internet. The other two standard OGC web mapping services web coverage is very popular for people to access the mapping and coverage directly out of our data. And of course threads offer a very quick data viewer. If you don't know what this data is you can have a quick look of what it is on the web without downloading it. Of course the also threads offer the direct download if you really want to download the data. Another fancy scalable distributed data server that I was talking about is called Giskey. Giskey is the in-house NCI developed product. What it does is we have a lot of data on the file system, millions, millions file on the system. If we wanted to people to query this data how? It's going to be very hard to create millions of metadata records for every single file. So what we've done is we use the crawler to crawl the file system, get the header of the file, and formulate as a database metadata database. And then the database will be a queer window for people to hand in the request. They give me some images in the polygon at some time. So the metadata database actually includes those essential geospatial information and it returns back to you the user of what they requested. So we published recently a technical details of Giskey implementation. You're more than welcome to have a look. I think you're getting close to the end. I just wanted to ask you if there was only only about a minute or two left so if you could go through. So the last two points will be version data. Again because of the scale of the data we can't really store every single step of the data. So what we can do is we stock the raw data and the final version and we keep the URI of the metadata in the middle step. So in that way the profanity information was kept and also saved the storage. The last point of the quality data is I would think some users say we can't really assume we can access data and data is flawless. So by publishing data aside with the quality report we wanted to provide the data access with a certain type of assurance. So we also have the publication that is going to be in place very soon. Thank you for your attention. That's our experiences so far about data access.