 Good morning, everybody. Welcome back to the Fair Data 101 training course. My name is Mathias Lippis, and I would like to start by acknowledging the traditional owners of the land on which we all are. I'm in Perth, so I would like to acknowledge the Wajuk people of the Ngungan Nation. I would like to pay my respects to their elders past and present. So thankfully, despite a crazy storm moving in over Perth overnight, I still have internet and power, so that means I'm able to present to you this morning. And what we'll be covering today is part one of Accessible, the A in Fair. So just a quick reminder, this course has a code of conduct, and I welcome you to review that code of conduct anytime at the URL. And if at any time you observe a breach in the code of conduct, could you please report it to the ARDC, and we will follow that up. All right, so today's agenda. No messing about. Let's get straight into it. I do have a fair bit to cover. And it gets a little bit technical in places, so hold on to your hats. So as I said, we're covering the A in Fair accessible, specifically out of the four guiding principles. I'm going to aim to cover three of them because they're all interrelated. And they are specifically about the technical side of accessing data and metadata. And so there will be quite a bit about machine accessibility of data this morning. And when it comes to machine accessibility, we're going to be talking about protocols and the various different protocols that work together in layers. I've got a nice diagram to show you and how they work together to bring data from one place to another. And I will be giving a couple of examples of how these protocols can work together to do what it is they need to do for you. All right, so not last week, last fortnight, two weeks ago, Liz and I covered the F in Fair, how to find the data. And so accessibility is, all right, know where that data is or the metadata is. How do we actually get to it? How do we get our hands on it to be able to work with that data? And the, sorry, I've put all three up on the slide at once because they are all connected to each other and all related. What we would like to see A1 is that metadata and data are retrievable by their identifier using a standardized communications protocol. We would also like that protocol to be open, free, and universally implementable. And then finally, A1.2, the protocol allows for an authorization, sorry, an authentication and authorization procedure where necessary. And Liz will be speaking a bit more later this week about the ins and outs of where this kind of authorization might be necessary. Why do we care about machine accessibility? So I did cover that in my last webinar. Given the increasing volume and speed of data generation and data collection, researchers are relying increasingly on computers to process that data for them. So I'll give a quick example of where we might like to do that as a relatively simple example. Okay, so let's imagine that we are running an experiment that requires relatively real-time knowledge of how warm it is. So off the top of my head say we're looking at solar panel generation across an entire city. We've already got a source of data for how those solar panels are generating, but we want to see how the weather might affect that. Now we don't necessarily worry too much about putting sensors with each and every set of solar panels because that's a lot of work. Thankfully, there is a well-established organization that collects weather data and reports it in quite high increments. So I believe they update this data, in fact every 10 minutes it says right there. So the Bureau of Meteorology has weather observations across the entire country and publishes those every 10 minutes to their website. If we want to incorporate this data into our experiment, one way we could do this is by visiting this page every 10 minutes and grabbing the latest figure and plugging that into our experiment, which to me sounds an awful lot of work, even if you only have to do it during daylight hours for solar panels. So what we would like to do is to get a computer to visit this page every 10 minutes and get that latest data for us. Now this web page is designed for humans and even if we go to the HTML source of this web page, so we'll see here that the first temperature is down here on line 296 of the HTML source. Now from a programming point of view, you actually have to teach the computer a fair amount to make it ignore everything up until that point and find that exact value that you're looking for. So what the Bureau of Meteorology has done is also make all of this data available in a JSON format. Sorry. So, in fact, if we go back to our web page, there is, no, sorry, I think it's further down on the web page and this is just a screenshot, there is a link to this data in other formats. And if you click on that, then you can get through to this JSON file. Now, we have a look here, this is the very beginning of the file. And it starts here with this brace curly bracket. And, oh dear, sorry. And it's actually not very long before the actual data, the important data appears so that's only about 20 lines in. And with this nice structured JSON data, it is only I think two, no three, three layers down in the nested hierarchy of JSON. So, and that is much easier to teach computer to access than it would be to go through an entire HTML web page, which was designed for humans. So, what we can do with this JSON data is get the computer to check this JSON data every 10 minutes, grab the important value and plug that into our experiment, meaning we don't need to constantly refresh webpages, unless it's going to be a record high and we just want to tell to our workmates about the weather. So, to do this, to be able to get a computer to grab data from somewhere else, we need to use protocols. Now, you might already be familiar with the more human centric definition of protocol, probably a little bit older. And that a protocol is an accepted code of conduct or acceptable behavior in a given situational group. And really a computer protocol isn't that different. It is an accepted behavior, a way to behave that is predictable, follows rules, so that other computers know what's got know what is going on as well. So, the computer definition from which generic forest is your source, a set of formal rules describing how to transmit or exchange data, especially across a network. Now, what we would like these protocols to be, and I've picked some keywords out of the A1 and A1.1 guiding principles. We would like these protocols to be standard, open, free and universally implementable. Now, in terms of open and free, there can be overlapping definitions between those two. And what we would ideally like in this case is for this standardized protocol to be openly available in the same way open access. So, everybody can access it without any barrier of any kind, including cost. And free, it could mean the same in terms of cost, or it could mean free as in everybody is free to contribute to implement free as in Libre is what some open source nerds like to call it. And finally, universally implementable, everybody should be able to implement this standard protocol without any barriers. And there is enough detail in the standards to be able to do that. Now, to make it a little more difficult. No, sorry, getting out of myself. Why do we want all of these principles to apply? So we can trust what's going on. So we can trust that the computer or the program on the other end of the connection is delivering the data to us in a format we know will work in a Byron method we know will work. And we don't need to worry about coding absolutely everything from scratch from how to computers talk to each other by a blue cable up to how the data gets from your computer in this side of the country to a computer over on the other side of the country. And the way these protocols do this is by no, sorry, getting ahead of myself again. And a lot of the protocols used do have standard identifiers they have kids in effect. So for example, you might not have heard of all of these so hopefully it heard of quite a few. So you're probably familiar with HTTP, the hypertext transfer protocol that is what web browsers use to collect data or to connect to web service and grab web pages. Ethernet is that blue cable that you have hopefully plugged your laptop into your router to get the best possible internet connections while you're working from home. Otherwise you're using your Wi-Fi which might not be as reliable. And then we had some questions Liz's webinar about XML versus Jason. If you'd really like to know the differences between the two, the standards are available via their identifiers so you can inspect them and compare and contrast them. Now we see that four of the standards here are using DIYs. They're all from the same organization that has chosen to mint DIYs for all of its protocols. Two of the standards to do with physical network connectivity Ethernet and Wi-Fi. They're from IEEE. Now MQTT is a protocol used in the internet of things. It's a very lightweight low power protocol for shuffling small amounts of data very quickly. Unfortunately they haven't come up on the PID party as it were. And then XML, we have this URL that is a persistent identifier in and of itself. So you can always go to the XML standard by visiting that one. Alright now that's a lot of protocols. Why is it that we need so many protocols? Well apart from if you're trying to transfer internet data with birds, what you might like to do is have a look at the different layers involved. So protocols are like ogres who in turn are like onions, they have layers. And protocols build up on each other to get data moving around. Okay now I'm sorry for exposing you to this diagram. This is something that you might learn in say a networking university course. This idea of a layered model where protocols build up on top of each other in order to shift data around. At the very bottom layer we have the link level which deals with the physical connections between two devices. And how those two devices, the two ports communicate with each other over a cable or in the case of Wi-Fi how two radios communicate with each other and swap data between them. On top of that we have an internet layer which deals then with how two computers on the internet communicate with each other. They don't care what's happening below on the link layer. So some computers are using ethernet, some computers are using Wi-Fi. The internet layer doesn't care about that. It's dealing at a higher level as it were. You know what, I'm going to skip to the next slide because I've actually put some helpful diagrams on here. Now, as on the slide before, we've got the different protocols in use. Each and every one of these things on the tree diagram on the left is a protocol. And you hopefully will recognize HTTP and ethernet. Now HTTP is layered on top of TCP transmission control protocol which is layered on top of the internet protocol which is layered on top of the ethernet protocol. Okay, so why is it that I'm showing this to you? It's because the word protocol is an incredibly loaded term. There are a lot of protocols and no protocol really works in isolation. It depends on other protocols. It works with other protocols in order to get the job done. So if you are talking to a librarian or a researcher or somebody in the IT department about protocols, it's important to make sure that there's a common understanding of what kind of protocol you're talking about. Because you could say, for example, well, my data is available. You have to plug a cable into my laptop and then you can get your data from me. So if we're using the ethernet protocol, that is a standardized open. Actually, it's not entirely open. It does cost money to get the standards from the IEEE. But it is a standard protocol. Is that not fair? Not entirely because we would like to have a few more protocols working hand in hand so that you don't need to plug in a physical cable that you can access something remotely as well. That being said, when it comes to the actual crux of the situation, making our data and metadata available or accessing metadata and data from somebody else, we only really care about the very top layers because we can assume that the other layers below are going to work. So for example, all of the, all of, most of, touch wood universities in Australia use ARNET. And ARNET provides the network infrastructure, provides the cables, as well as the routing infrastructure that works and we can trust so that we can get our data from point to point without worrying about digging trenches and putting down cables ourselves. All right, mostly. So depending on where it is you're working. So for example, okay, ARNET's great within Australia. But if you try to grab data from say a mountain top in the Himalayas, you might have to work out some other way of transmitting that data on the physical layer because there's not very good phone services in the top of the Himalayas. All right, let's get into some examples about how these protocols all work with each other. Now, the first one, the first example, I'll go into a bit of depth on this one, is about how repositories can share metadata with each other. So you may have heard of O-A-I-P-M-H, which is very widely used, especially in library and institutional repositories. And it is a method used to harvest metadata or transfer metadata from one repository to another. So for example, institutional publications repository might want to have its metadata harvested by say a centralized search functionality, so that end users only have to go to one place to search all of the repositories in Australia. For example, research data Australia. O-A-I-P-M-H, which stands for the Open Archives Initiative Protocol for Metadata Harvesting. Now, you won't be tested on that, don't worry. So O-A-I-P-M-H in and of itself is a protocol, shares, double and core metadata, plus other metadata standards, but let's stick with double and core for this example, which could be considered a protocol as well. As XML, yet another standard, over HTTP. So when the repositories talk to each other, they behave like a web browser and a web server. So how does that work? All right. So here's a cool project at Griffith University, the prosecution project. And they have digitized metadata about basically crimes and prosecutions in colonial Australia. The data that I was looking at was largely pre-federation. I'm not sure if they have anything post-federation. And they have a repository and they make their metadata records available through the O-A-I-P-M-H API. Actually API, that's a term I haven't defined. You'll also hear the word API every now and again. And an API is a method for two computer programs to talk to each other. So they're not interfacing with a human. It's the two programs talking to each other using an API, which uses a bunch of protocols to shift data around. Okay. This URL here is to access the O-A-I-P... Sorry, there are too many acronyms in this presentation. O-A-I-P-M-H-A-P-I. And you can get metadata records. Let's break it down. So first up, we can see that it uses HTTP or in this case, HTTP-S. S is for secure. So the connection between the two computers is verified with the security certificate. Next up, we have a URL. The same as any other kind of domain name, host name, URL. www.ardc.edu.au. Exactly the same except in this case. O-A-I.prosecutionproject.griffith.edu.au. Okay, nice. Then when we're accessing that server, we ask for this file or directory, forward slash O-A-I. That's the end bit of the URL there. And then after that, we are providing some instructions to the API as to what it is that we want. And they're encoded in that URL. So first up, there is a parameter called verb. And we're saying the verb we want is list records. So we're instructing this server to give us a list of records. And then the next instruction is please give it to us in the form of double and core metadata. So you can actually visit this URL in your web browser. And you'll get something that looks like this. Now it might be different depending on when you connect to that API because they do update, change metadata, things like that. So the data that you get back will depend on when it is that you access this particular server. And what we have here is some XML with double and core embedded within it. So we can see there's this first record here, transcription of trial record, Thomas Matthews assault intent to robbing company, Melbourne 1852. We've got all sorts of metadata fields around that particular record. So what can we do with this? Well, we can, using our own software that we create, harvest records, take records from the prosecution project, and we can use those records to do our own kind of analysis. So if you're doing something into the, if you're investigating crime in, in pre federation Australia, prosecution project could be a good source of data to do so. Okay. Now we, I am, I'm still putting together fine details, but I'm hoping to get everybody to use an API in the activities for this module. Hopefully that'll be good fun. Okay. So not all standardized communications protocols are about swapping XML metadata around for example. Now, earlier, I mentioned the MQTT protocol that don't ask me what that stands for. I've actually forgotten, but it is used in the internet of things or in sensor networks. So if you have built a sensor network or you, you're a researcher, you want to build a network of sensors, say temperature loggers or humidity sensors, and you want to have them wirelessly around a building so you know what the temperature is in each room or in several buildings, you can use something called the my sensors framework to build that network of sensors. And if you use my sensors, one of the options for getting the data from that sensor network is to get it in the JSON format over the MQTT protocol. Now again, we haven't necessarily spoken about the lower levels because like HTTP MQTT is one of these high level ones and it assumes that you already have the rest of the protocol set up and working. All right, now I'm getting through this a little faster than I thought I would, which is nice because that means we'll have plenty of time for questions. Okay. Now, this is, it can get quite confusing because there are so many protocols available. This would have been an excellent opportunity to put in an XKCD comic because I'm sure they've got something about this, but there are lots of protocols and you can pick and choose different protocols to do what it is that you need to do. And there is almost an infinite number of combinations of protocols. Now the good thing is we don't necessarily, sorry, we, people working in research support, we need to memorize all of the communications protocols and know how they work and what the problems and pitfalls are with different protocols. We can talk to more technically minded stuff. So we have research support engineer, sorry, research software engineers. So these are software engineers who, well, RSEs come from either a software engineering or a research background where they combine this understanding of research with really deep technical knowledge of software engineering and can build software that supports research. Or similarly, there are data engineers and these engineers are familiar with all the standards and how to engineer together a system with some kind of data pipeline to get data from A to B with a bit of processing along the way. And hopefully then now that they will not be able to bamboozle you by talking about bunches of different protocols, you'll have some understanding of what it is that they'll be talking about. Now, what this also means is that when it comes to sharing data or making data accessible, given the diversity of standards and protocols and things like that, it is quite unlikely that a single repository solution will serve the needs of every single researcher at an organisation. So for example, your traditional data repository will be geared mostly towards having flat files. So you can load it up with tabular data, a spreadsheet or a CSV or JSON or XML files. And people can grab those files, but you will not necessarily be able to offer a wealth of different APIs and protocols for harvesting that data in different ways. It could be, for example, that, okay, here's a good example. The square kilometre array, part of which is going to be built or is being built in north of me, up in the Merchison. The sheer volume of data that the instrumentation up there generates is so huge that it's simply not feasible or practical to involve an institution or a repository. That requires incredibly specialist processing communications equipment. And then accessing that data, again, would be handled by specialist solutions just developed for that data. But they would be using standardised communications protocols. So that being said, the metadata and information on how to access that data could be placed in an institutional repository. But then that record in that repository would point to a different location. So very importantly, data and metadata don't necessarily need to be co-located. Okay, so I really just, this was just the tip of the iceberg, talking about some of the research infrastructure for shunting data across the world. And so there is this topic of or discipline of infrastructure literacy. The knowledge that it would be really nice for researchers and research support professionals have in order to get the most of the huge amounts of money, millions if not billions of dollars put into the infrastructure that supports us. And Arnett in particular has done a pretty good job of developing modules to teach researchers how to use their infrastructure. So Dr. Sarah King, one of their trainers, would be willing to give you further training in Arnett's offerings, for example. It might be tricky at the moment, might have to be done by Zoom, but get in touch with her if you'd like some free training. Okay, now I spoiled myself. Data and metadata do not need to be co-located. All right, let's get past that. So up until now, I haven't spoken about authentication and authorization at all. Now, this is something, since we're running out of time, I'll have to spend farthest time talking about. However, everybody here should already be familiar with authentication and authorization procedures in the form of usernames and passwords. So if we are authorized to access a resource, we are provided with the authentication credentials in order to access that resource. Now the how and why of the authorization, that's what Liz will be offering that on Wednesday. However, usernames and passwords are good for humans, but they're not necessarily used very much by computers when they're talking to each other, especially by APIs. What is more common for a computer to use is an API key. And API keys are, they should be considered as protected and private as a password. So if you were ever given an API key to access an API, treat that as securely as you would treat a password. Now, API keys you could consider to be or is a password, just without a username. So it'll probably be quite long, randomly generated and is unique to you or the service that is trying to be accessed by an app. Yeah, so you will hopefully, during the activities, be getting an API key to access Trove, which is from the National Library of Australia. And so please treat that API key quite carefully. Okay, now the infrastructure and policy. So for the authorization and authentication, the infrastructure and policy needs to work together to ensure that data that needs to be kept safe is kept safe, but is still made available to those who need access to it who are authorized to access it. So, for example, there are numerous centers around Australia for data linkage. And a lot of these sensors work with sensitive health information, health data, but they want to be able to link different patient records together to draw conclusions and come up with some, sorry, come up with answers to research questions around health outcomes. Now they are authorized to access that data. And they would have certain ways of being able to access their data and bring it into their secure processing environment through some way. And it could be using these authentication procedures. Sorry. Yeah. So they have these systems built in using authorization, user names, passwords or keys or something like that to shut the data around and keep it safe, very important to keep it safe. Okay, so that is it for me. So next up, the next webinar will be on Wednesday when Liz delves into this idea of the authorization to access continuum of closed to open data and how to make things sensitive things accessible while still keeping them safe. And that will be at the time on your screen. Now I believe we have some time for questions. Are you there, Liz? Hi, Matthias. Yes, we've got time for questions, although there aren't any questions in the question box or the chat at the moment except for a nice job tackling protocols Matthias from one of our participants. So I invite anyone, if this is, if this dive into protocols has got you thinking or has got you floundering. Look, the floor is yours. Please have at it in the question box and share with us some of those questions. Even if it's something like Matthias, what was that first protocol you shared with us? What was the first protocol I shared with you? Possibly Ethernet, I can't recall. Now, if you are still digesting and need to have a bit of time for all of that to settle a bit before you come up with questions, you can ask me in Slack and there will be opportunity in the community discussions next week as well to have a chat about it. Matthias, we've got a question and I'm going to ask you now. How much of this are researchers required to know? As much as is required for them to be able to get their work done. I'm glad to think that researchers can rely on research support professionals like us to know this kind of stuff for them so that they can consult with us, get the solution that they need, and then get on with their research. I mean, most people didn't get into research to deal with administration or IT or things like that. They got into research to do the research. However, if a researcher is working on the cutting edge of things and deploying sensor networks and things like that, it might be useful for them to know how that technology works so that they can account for that in the design of their experiment. So, you know, particularly time sensitive things or huge volumes of data needs special treatments and if you do require incredible precision in an experiment, you really need to know how your data is being generated to understand what kind of errors might crop up. Matthias, that sounds like it might actually lead into the next question I have for you, which is, what are the key questions to consider when assessing whether an application we're thinking of using for our data is an accessible application in terms of these types of protocols? I would see whether the, unfortunately, I'm probably going to use some TLA's, free letter acronyms, I would see if that particular system or software solution does have APIs and especially APIs that are well documented so that you and anybody else can access the documentation of the API to then be able to construct your own solution to talk to it. Or you can say, say, let's go to the good old institutional repository solution. I'd like to implement a repository, and I want to make sure that that repository can be harvested by something like Research Data Australia. I need to make sure that that solution has the correct APIs and uses the correct protocols to let Research Data Australia harvest it. So check the documentation, ask hard questions of the developers or the vendors, and make sure you get what you need and what you want, and especially in terms of commercial software, make sure you get what you paid for. Great. Thanks, Matias. I have another question for you with an apology prefacing it, just in case this might be out of scope and potentially it is, but I'll ask it anyway and you can handball it. Could you talk a little bit more about how linked data works, e.g. does everyone use the same protocols in linked data? Okay. So linked data and linked open data, certainly to me and my understanding of it. And in fact we might get into this in the interoperability of good job handballing this to future Matias, but linked open data to me is more a set of, or linked data is a set of principles around linking data records together with identifiers. Now there are some very common standards used for linking these data to each other. So sorry it's been a little while since I've touched on it deeply so I might need to bone up but we'll see how we go. So there is a metadata standard called RDF. I can't remember what it stands for. Research Data Framework. Not research. Yeah yeah resource descriptive framework. That's the one. It is based on XML. So it uses XML for its structure. It doesn't have to be, but linked data doesn't have to be expressed as RDF. You can express linked data in different formats, for example in JSON. So yeah so in short linked data is this principle of structuring data. But it can use a variety of different standards and different protocols. So more of a paradigm I would say. Okay I've got a couple more questions and then I think this might be it for today's session. So here's one. For many researchers will the focus be on machine accessible protocols or human readable policy. What about for research support professionals or technologists and developers? Well it's important to have both to be honest both machine readable and human readable. Because you can make your data machine accessible and machine readable. But without the human readable documentation and policy behind it that describes how these things work to humans, humans would not be able to implement them. So I said earlier when you're considering a system make sure the documentation is up to scratch. Make sure that whatever APIs they've developed has good documentation. Because those APIs are next to useless unless there is a way for a human to learn how they work. So that human can then develop their own software or solution to access that API. In the same way that having only human readable policy and documentation is next to useless to a computer. If there's no machine accessible or readable things to work hand in hand. Nice one. Okay so our final question which on reading might be maybe it's a nice round up one. Going back to some of your earlier points Matias is what does the researcher need to consider for accessibility if the researcher is mainly concerned with sharing their primary data. Okay. This now then would I probably more about what Liz is going to be covering on Wednesday. So primary data is incredibly valuable and incredibly personal. And I certainly understand why many researchers would be reluctant to share that data. I mean there's always this fear well founded fear of being scooped. I did hear the other day that when it comes to primary data, your average researcher has already has a 12 month. Advantage over anybody else trying to understand that data so if you had your primary data and you made it available, then it will take 12 months for another researcher to from getting a copy of that data to be able to understand it. And then actually write any publications out of it. But I will otherwise hand all that to you Liz to hand to deal with on Wednesday. Thank you. I shall take that on notice. All right. Well that's it for questions. And an accommodation to your answer on the having the focus for researchers on machine accessible versus all the protocols versus the human readable policy question. Thank you very much for facilitating those questions formulas. Now I get as I said I am on slack you can ask me questions there. Some people have sent me some private questions and private messages but if you think your question could be of interest to the rest of the community. Please ask in that general channel so everybody else can answer that have seen what I've seen that there's already been some great discussion about the Australian data archive. Otherwise, I think that's it from me. Was there anything more from you Liz. Just to remind everyone to fill in the post webinar survey. Thank you for your feedback for our last webinars that was really valuable. And we continue to look forward to your suggestions and ideas for how we're going. Yes, great. Thanks for that Liz and thank you everyone for coming and I will see some of you next week during our community discussions.