 Good afternoon everybody, and I'd like to welcome you to this last webinar for the Fair Data 101 course. First of all, I would like to acknowledge the traditional owners of the lands on which we all are. For me in Perth, that is the Nunga Wajuk people, and I'd like to pay my respects to their elders past and present. And I'd also like to extend that respect to any members of First Nations attending this webinar. Alright, yes, so final webinar, and we'll be covering, we'll be finishing up the reusable side of things and then talking a bit about Fair Beyond Data. I did say at the very beginning of the whole series that the Fair Guiding Principles were not intended to be just about data, but we've been spending a lot of time talking about data. So we'll be talking a little bit about other things that can be verified, as it were. I'd like to remind you all that the Fair Data 101 course is governed by code of conduct. And if you do observe any breaches of the code of conduct, could you please report that to the ARDC via the form that is linked to from that code. I'd also like to remind you all that you can enter your questions at any time in the question modules here in GoToWebinar. And there will be a facilitated Q&A session after I've finished talking basically. And Liz will be joining us to do that. Alright, so as I said, we're going to be finishing up the reusable side of things. And I will mostly be talking about provenance, provenance of research data, not provenance of artworks or antiquities if you watch too much antics, Roto like me. Then I'll be talking about Fair Beyond Data software training materials. And then I'll also be outlining a few options that you can, that are open to you if you want to continue your education or practice in the Fair. Alright, let's get right into it. So at 1.2, as Liz described on Monday, metadata and data are associated with detailed provenance. Alright, so that is a little bit problematic. And in fact, getting detailed provenance can be tricky, or rather it can be tricky, but also it can be easy. And areas in which it is particularly tricky is when people work a lot with spreadsheets, specifically in a spreadsheet program like Microsoft Excel or Google Sheets. And look, I'm absolutely guilty of this kind of behavior myself as well. I will get a data set and I will fire it up or open it up in a spreadsheet program. I'll mess around with it and I'll move things around, clean things up, add columns, remove columns, stuff like that, and then save it and move on. And the issue there is I haven't made a concerted effort to record what I've actually done to that spreadsheet. And unfortunately Excel doesn't really have a formal mechanism for recording that kind of thing either. And in fact, the closest thing you have to that kind of recording mechanism is the undo, redo buttons in Excel and you tend to lose them. Sorry, not just Excel, I'm bashing Excel, I'm sorry, any spreadsheet software. The issue is that that undo, redo history can vanish when you save the file or when you've opened the file again from scratch. Now there are other tools that you can use to work with tabular data that do save or permit you to save a nice detailed history. One of my favorite tools, for example, open refine. It allows you to open files, tabular data generally, manipulate that data, and then you can, while saving that data again, you can also save the complete history of what you've done to that data. And in fact you can load that history and apply it to different data sets, which is quite interesting. If you'd like to learn more about open refine, I strongly suggest that you have a look for the next available library carpentry course in your area, or in fact online at the moment. Another way to record a detailed provenance of data manipulation is to do is to separate the process from the data. Because really, when we're talking about recording the provenance, what we want is, we want to know what the original data was. We want to know what the process was. And we also want to know what the resulting data was, and then link that all nicely through some kind of relationship. So you can explore backwards and forwards. I have a data set. How did that data set get to be in its current state? Oh look, there's the process that was used to transform older data into this new state, and you can trace that through. And then by investigating that process, you know that the data can be trustworthy. So I'd like to remind you all of one of the activities that we asked you to do a few weeks ago. And that was the LAN Workbench tool, or using a Jupyter Notebook to interrogate the Trove API and get data from that and manipulate that data. Okay, so guess what? That was provenance. So the entire Jupyter Notebook programmatically defines the transformations, or indeed not just transformations, but first, what data do we retrieve from the Trove API and how do we retrieve that through to any kind of transformations to that data, and then displaying that data. And that is the full provenance. So if you completed this activity, you essentially did record the provenance of the activity that you were doing. Fantastic. So this is why I really personally think and not just me personally, a lot of people would really like to encourage people to move away from using only Excel, or sorry, spreadsheet program to manipulate data in ways that aren't easily then recorded or logged. And instead, move to using some kind of formal programming language. So this Jupyter Notebook here, we used the Python language. So Python is one of the most popular languages for scientific computing these days. Another quite popular language is R and hopefully you've heard of one or both of these. And by writing your transformations, by defining them programmatically in a script or a program and using that to analyze your data, you can not only keep that original data and not accidentally save over it like I've definitely done in the past using spreadsheets, but then you have, so you have that original data, you then have the process that you used to transform that data, the script, and that script outputs another file, the resulting data. And between the three of them, you have a nice provenance trail there. Another reason why it's really nice to use software or to write software for data transformation is that software can then itself be made fair or available to others for potential reuse. And there are hundreds if not thousands of scientific or research computing packages. Sorry, Andy, could you please mute yourself? I can hear you. So there are hundreds if not thousands of research computing packages available in a number of different languages, Python or even really old ones like Fortran or Tobol. And using these packages can help researchers save a lot of time because they don't need to necessarily implement completely new packages from scratch themselves. They can look through a library of available packages and see if any are available to do the kind of data analysis that they need. Now, and one reason why I would like to make these packages themselves fair. And in fact, with a lot of software, the focus is more on making it unlike fair, which doesn't necessarily think too much about openness or other it encourages openness but doesn't require it. So good provenance that software probably really should be made open as well. Because if the software isn't open, then it's a black box, and you cannot inspect that the inside the internal workings of that black box to work out exactly what has happened with the data transformation. And in fact, a colleague of mine was telling me how one very popular scientific computing program, SPSS, you can ask it to run any kinds of analyses on your data and for one particular analytical method called an ANOVA. Now, don't ask me what that is, I'm sure I learned it in undergrad mathematics about 20 years ago, but I can't tell you anymore. But there are several different ways of performing an ANOVA. And unfortunately, SPSS doesn't tell you which method it uses. So you tell it, please run an ANOVA, and it'll give you the results and say, here's your ANOVA, but doesn't give you the provenance of which ANOVA method was used. And why this is important, or why we would like to be able to inspect software, research software and see how it works, is that despite all best intentions, software can have bugs in it. So for example, this Python software package using the Willoughby-Hoy method. It has been used in hundreds, if not thousands of computational chemistry projects and papers. And this particular script was found to rely on a computer operating system to do something. But unfortunately, the programmers didn't realize that different computer operating systems, when asked to perform the same function, would do it differently. So under Windows, under Linux, under older versions of macOS, and new versions of macOS, what you would get from the operating system was subtly different. And it was about to do, it was to do with how files are ordered, or lists of files are ordered, different operating systems would order file lists in different ways. Now, the result of this particular bug was that under different operating systems, the same analysis on the same data would come up with different results, but subtly different results. The results weren't big enough to be sort of noticeable to casual inspection by a human, but the results were different enough to make a big enough difference, a statistically significant difference to the results. And so this bug was only found because the software was open, and it could be investigated and scrutinized by other researchers. We found the error, they reported that in a paper, and the package has since been fixed and made stronger, due to its open nature. All right. So that is one degree of provenance, of providing a provenance trial, having the research software available for inspection. Now, the next level of recording provenance, now there's always a metadata schema. And in this particular case, sorry, and for provenance, there is a W3C recommendation, and W3C recommendations to be honest to really standards. So the World Wide Web Consortium has this recommendation called PROV, or rather it's a group of recommendations. Now, I probably, I definitely will not go into too much detail about PROV, because it is, it's pretty detailed, and beyond the scope of an introductory course like this. But the basic idea of PROV is that things are described as linked data. Now, we're all familiar with linked data, we went through that before. And that everything is one of, at the most fundamental level, three different kinds of objects. So you have an entity. And in this case, entity is data, a data set, or a discrete data set or data object. You also have activities, and activities are processes by which data is transformed or generated. And then you have agents. And in the same way that the entity is the what, the activity is the how, the agent is the who, which person, which human, or which organization undertook the activity that produced an entity from another entity. And this can all be recorded as linked data in a variety of different formats. You can record as XML or JSON, or another, or another language called turtle. And there is an excellent primer on the W3C website on PROV. And I strongly encourage you to have a read of that primer. And I think the first paragraph of that primer, which I found quite entertaining, is that you can engage with PROV as much or as little as you like. That is to say you can implement just a small part of PROV for your purposes, or you can try and drink from the fire hose and implement all of PROV for your system. And PROV is being used in some organizations in Australia for recording provenance of their research data. And it is, but it is by no means universal. I think recording provenance in a really detailed and method, there in a systematic way is something that's still to come and that we can all work on together as an Australian community, or even as a part of a worldwide community. All right. So that's enough about provenance. Let's get talking about something that isn't data. Although, to be honest, I just spent a lot of time talking about software, didn't I? Well, good news is I'm going to talk a bit more about software. So fair software. Now, when the fair guiding principles were written, it was imagined that they could apply to any kind of digital research output. And to some degree, perhaps analog research outputs as well, although that's a little bit trickier. Now, those guiding principles were originally written quite some years ago, but more and since then, some very intelligent people have been thinking quite hard about what what if the fair principles can be applied to software, but what if the fair principles don't quite work or don't fit together very well. So there was a paper released recently by Lamprecht et al. And they do acknowledge that software and data are very similar, more similar than they are different in the way that software is a special kind of data. But there are some very significant differences between data and software as digital research objects that does require us as data stewards or digital research object stewards to treat them differently. So I'll go over a couple of those things just very quickly. But this paper is a great read. And I just like the W3C Prima, I recommend that you have a bit of a read of this. And there is at least one Australian author on that paper as well. So first up, licensing. Now, Liz went into or discussed the Creative Commons license on Monday. And so the Creative Commons license, we're now up to version 4.0. But in the very early days of Creative Commons version 1.0, it was originally for creative works, artworks, films, photos, photos artworks possibly. So things that are more artistic in nature rather than research based or scientific. And in fact, the original Creative Commons license wasn't even terribly good for research data. And it took several iterations of Creative Commons before it started being relevant or before its particular clauses were made relevant for data as well. And so now we're at Creative Commons 4.0, which is a good robust license to be able to apply to data. Now, unfortunately, Creative Commons isn't very applicable to software because of those differences between software and data. And at the same time, the free and open source software movement has existed for a lot longer than the Creative Commons license has. And in fact, that community, the FOS community has come up with many, many different kinds of software specific licenses. And it's probably in the same way that we like to joke are there's so many standards we need to create one that unites a new standard that unites all the standards and then all of a sudden we have yet another standard that needs to be united with everything else. There are all these existing licenses and you might be familiar with things such as the Apache license, the MIT license, or the GNU general public or general purpose license GPL. And these licenses are written from the the get go or were written from the get go as software licenses, particularly open source software licenses. So we strongly encourage you to investigate those and have a look at those rather than trying to apply a Creative Commons license to research software. Now, another way in which software and data differ quite a bit is that it is quite normal for software to be versioned and receive regular updates or continuous improvement to use a management buzzword. And these updates happen quite regularly, whereas with data that it's not generally expected that a data set would have new versions. I mean some data sets do have new versions come out on a regular basis. So if you were to if you're running a longitudinal study and you're running a survey on a cohort every year, then there's a new version of the data set as more data gets added. However, the not all research projects work like that with software though, it absolutely is the case that new versions come out. There's bugs that needed to be fixed. There is new functionality that needs to be added, or the software might start using a different method of calculating things. And so, in order to link that software, all the different versions of software with each other in their metadata records. There are versioning practices that can be applied to keep things together and aid citation of that software. Alright, now I'm definitely done thinking talking about software, and I will now start talking about their training material. So, this might be something that's a little closer to your hearts, as I know that many people attending this course are themselves trainers and might in fact be seeking to on train others in their data practices. And quite recently, there was a paper I think it was written by once again a group of very clever people. And they came up with 10 simple rules for making training materials fair. And this illustration, this diagram I find fantastic, because they created it for this particular paper. And I realize now I'm citing the image here and not the paper. So I will provide the DOI of the paper itself once I'm done with this webinar. It's another great read and probably directly relevant to a lot of people in this course. So, going clockwise around this diagram. They, well, they haven't put all of the 10 simple rules on this diagram, but you can see how there are some things, some actions you can take with the training materials you create. And those actions can assist more than one of the aspects of fair. So there's a fair bit of overlap between findable and accessible, although then, yeah, sorry, interoperable and reusable sort of stand by themselves. Oh wait, there's number one right in the very middle chair. And I think this is a really good practical guide for those of us who do develop and deliver training to help us make our training materials as useful to others as possible. Not just to those who are learning using that training material, but also people who might like to train using that training material. And then I have a bit of a naval gaze in question for you. Is fair data 101 fair. Now this is one of the topics for our community discussions next week. And you are absolutely welcome to completely sledge us basically well sorry be be mindful of the code of conduct challenge ideas not people. So, if you can give us some feedback on whether you think fair data 101 is fair. And if it's not, what could we do differently. Alright, so that's it. Now, that is the end of the course material, or rather the end of the topic the fair related webinars. And you might be wondering now, given that it looks like certainly for some parts of Australia lockdowns might continue for a little bit longer so we might have a bit of spare time, where to from here. And so we come up with some suggestions of avenues you might like to pursue, depending on your particular interests. So, first up, Liz and I spent a lot of time talking about community agreed standards. And the first suggestion we have is that you join one of the communities that agree on the standards. And the two most disciplined agnostic communities that were around the world. There's the research data alliance, and this force 11. The research data alliance is, it's a worldwide body and the RDC contributes to the to the research data alliance we assist in that's governance. And within the research data alliance there were I checked this morning there are approximately 200 groups. And each of these groups so so there are two kinds of groups there are interest groups, where and that's a long term group people get together and discuss topics of common interest. And there are working groups. And the idea for working group is that it is formed mobilized for a year or two. And it it's formed to produce a particular output. So whether that is a framework or a series of recommendations. So that group creates this thing publishes it releases it to the world for free and disbands. The research data alliance has. I never get this right by annual by annual every twice a year. It has a plenary somewhere in the world and we were meant to have one in Australia earlier this year but hey guess what that didn't end up happening but it did happen virtually. And at these plenary at these plenaries every six months. So that's one of the recommendations of working groups that are winding up I released and people can get together and discuss what the next piece of work could be what new working groups could be formed. And force 11 similar thinking more about research communications is scholarship and they also collaborate on many things. That's to join the communities that agree on the standards. I sorry. An example of a RDA and joint force 11 RDA working group is this fearful research software working group. Sorry, I'm not done talking about software. So this group is still in its forming stage. And if you have a look at the chairs that we actually have to two Australians Michelle Barker and Paula Martinez who chairs and the other chairs come from around the world. So Dan Katz is from the US Neil Chu Hong is in the UK. And the idea for this working group. Once it is endorsed it's not yet endorsed but hopefully soon is to come up with concrete fair principles that are really tailored towards software. And in fact they have a couple of webinars. Next week I believe I'll be able to share the details with you one is at a very Australia friendly time. And the. Yes sorry so there are many many working groups. This is just one of them so there might be a working group that covers your particular domain interest. So sorry I did say that RDA is domain agnostic, but there are many domain specific groups within the RDA. So research data for agriculture, for example. And something else you might be interested in doing is working on or developing some real practical fair skills. Now, the really big one coming up soon is the banner up the cross the top Fiske the four celebrant scholarly communication Institute. This normally happens in the US. So it's a bit inaccessible for us here in Australia, but due to worldwide circumstances it is happening online this year. Which means you don't need to travel to the US, but you still need to be awake at the appropriate time. Given the US is approximately 12 hours time difference that could be a bit of a challenge to attend but at least it's possible right. A little closer to home coming up soon on sorry yeah sorry Fiske is in the first two weeks of all this. A bit closer to home down the bottom there hacking heritage the glam workbench. So if you enjoyed that exercise accessing the Trove API with a chip to notebook. Tim Sherrod at Canberra University University of Canberra. He is running a three day course delivered online of course, and you can sign up for that and that's happening in the coming weeks as well in the coming month at any rate. There is also carpentries workshops, which you may or may not have heard of before. So the carpentries is both an organization and a suite of lessons designed to train researchers and research support professionals in research computing methods in a very accessible and very friendly way. The focus is on making sure that everyone has a fulfilling learning experience. Everyone grows together. And in fact we got the idea for our code of conduct from the carpentries. Then there are also some online resources available. So foster is a European group about open science, and they have a number of resources open science handbook for example. And similarly the Turing way from the Alan Turing Institute also has well a handbook rather the Turing way is the open science handbook and full of interesting lessons and theories and ways to make research more fair. Although they do possibly focus more on the concept of open science rather than fair research. Okay, so that's it for things that you can do for yourself in terms of continuing your learning. For this particular course of the wrap up now is that we have activities, community discussions and the quiz. So the activities and the quiz will be released sometime tomorrow. We're just finalizing those making sure that we haven't gotten any spelling errors. The community discussions continues normal next week and the topics are already available. Then stickers. We will be sending out stickers. The problem is we don't know where you are. And for you, and so we need to tell us where to post your stickers to. And in order to be able to tell us, you'll need to complete a feedback form. Now the feedback form itself is anonymous, but we have separated then got a separate survey asking you where you live so we can't link those results together. So you can be as brutal as you want in the feedback. But remember, please be kind as well. And then let us know where you would like your sticker sent to for completing the course. We also have some bonus content available for you. So the ARDC is partnering with call the Council of Australian University librarians and the ASG the Australasian open access strategy group to deliver a webinar on fair beyond data. Now they'll probably be focusing more on research publications than my favorite software. But I hardly recommend you to sign up. That'll be next week on Monday at a very Australia friendly time. West Australia, East Australia. Yes, I think that's all correct. Go to ARDC.edu.au forward slash events to register there. Now that is it for me. So Liz. Yes, I am here. And hooray. Okay. So how many questions do we have? We have. We have a few comments. People piping up about ANOVA. But the first question, which I have a half typed out a bunch of links to share, but I would I can. We'll just ask the question. Can you recommend any good resources for choosing the right license for software? Yes, but you can throw that back. Yes, there is a website and I believe it is called something as simple as choose a license. Yes, choose a license dot com. Now this website was created by GitHub and GitHub is owned by Microsoft. However, Microsoft has really gotten on board the open source train recently, which is quite nice. And this is a very friendly way of picking out an appropriate license for your software bearing in mind that this was created in the US. Now the ARDC is working on a software licensing guide. I can't give you an ETA, unfortunately, but we would like that to be available sooner rather than later. All right, thanks. No, there are no other questions at the moment. So I do encourage people to ask, look, maybe we can extend this Q&A out beyond just focused solely on reusability if you would like to ask questions from this last cumulative series of weeks. And happy to wait while you form those questions in your mind and compel them to us via the question or chat module. Yes. Yes, I see. Oh yeah, an over analysis of variants. That's the one. That was, yeah, it's honestly going back 20 years for me since I did first year statistics in university. And I can't believe I'm that old already. All right, so waiting patiently type them out. I mean, there will of course be opportunity for further discussion in the community discussions next week. Or sorry, my friends keep trying to fall out. Or there is the slack if you want to ask questions there. We've had some great discussion recently. So discussing the data publishing and embargoes. There was an article about COVID-19 or the perils of people ignoring metadata standards for ensuring COVID-19. That was quite interesting. Wasn't it, Lance? You're telling me how, what was that? You said the lack of appropriate metadata to do with the provenance in fact, not saying where or when or how that data was collected, made that data completely useless. Yes, that's right. In the circumstances of a pandemic, you really want to make sure that the metadata you have for the location and the time at which disease has had been reported is complete. But these researchers have found that found lots of instances where that information has not been complete, which makes that kind of data, that research data absolutely useless until somebody has completed it, gone through and filled it out. So nothing like a pandemic to sharpen the inequalities of data quality. Good one. Yeah. All right. So we just killed some time then. Have any more questions come in at all? Oh, yes. Hang on. There's a request to email the slideshowers to course participants, which we will undertake. We'll put the slides out on the Slack channel, but we can see if we can send an automatic email to everyone. And there's one final question about how best to find out about carpentries events. Okay, great. Now, the problem with carpentries events. So the carpentries does really focus very much on face-to-face training, which is a problem right now. So they've been working very hard to pivot to, sorry, using another management buzzword there, to deliver more online training. And so there has been a bit of a bump in the availability of training workshops. But there are several places you can check. You can go to the carpentries website to see if there are any workshops coming up soon that you can attend. And in fact, with more of them being online, it might be easier to find one you can attend, although it might be a very friendly time of day. Alternatively, there could be already somebody at your institution delivering carpentries workshops. There are some universities like Macquarie University, Monash University, and that's just the tip of the iceberg. They are delivering regular carpentries workshops for their staff and students. And then there are also some organisations in Australia who will deliver carpentries workshops for you. So they intersect, they're based in Sydney, but they are active around the country. They can, for a fee, deliver a carpentries workshop for you. Alternatively, I am one of the carpentries regional coordinators for Australia, and I welcome you to get in touch with me to ask about your specific institution or area if you're not sure where to start. Excellent. A few more questions now. Oh, and also another helpful reminder that Q-SIFT also delivers carpentries workshops online. So thank you for submitting the question. I'm just going to cut and paste that into the chat as well. Q-SIFT tends to deliver only to their members and their member organisations are the most of the universities in Queensland, whereas Intersect will also deliver to non-member organisations. I do have a final question. Is there guidance available from IRDC on adding DOIs to research outputs in repositories? Absolutely. There isn't. No, I'm sure there is. We'll share a link with you. We help facilitate getting DOI, helping Australian institutions get DOIs for their research outputs. Now, we are limited to research data and so non-publication research objects generally. We also have scope to help facilitate DOIs for great literature. But when it comes to formal publications, like say if your institution publishes a journal and you want to get DOIs for those journals and those journal articles, you're probably best off going directly to Crossref, which is the organisation that produces most of those DOIs. Yeah, that's true. I think that in answer to that question, you'd be looking at contacting the data and services team at IRDC. I will put a link to the part of our website that refers to our DOI service. And follow up. I'll put the team to contact, the person to contact into the Slack channel if that's okay. There is also your local IRDC liaison, so we have at least one engagement officer in each state. And you're welcome to contact them and ask them if you know them. If you don't know them, ask us. Yeah. So Matias, one final question, and then I'm going to put a hard line underneath all of these because I need to go and run off and pick up a certain child from school. Do you have any advice or useful links for someone who would like to learn Python or R who has never done it before? Yes. So Carpentries is a fantastic way to, if there is a workshop available to you, attending a software carpentry course on Python or R is a fantastic and friendly introduction. It is, it's gentle and it really works with a sort of real world example. So it doesn't, it's not a high level theoretical overview of programming as you would get in a computer science or software engineering course, but it's very practical and gentle introduction to coding. Yes. And I would add that so it's checking out with your organization if they do have any Carpentries instructors who are running workshops virtually, but also you might consider things if there might be a RESBAS activity in the future, which tend to be research, tend to be festivals. They mostly have been cancelled, I believe, this year, certainly in Australia, but these are good touch points to get in touch with and learn Python and R in normally through the Carpentries in your state. So it's likely that there might be some kind of virtual activity this year or some kind of face-to-face thing next year. There are also a number of MOOCs available, so data science being the hot new thing, well it's not that new anymore, but data science being the hot new thing, that's the career of the future at the moment. There are quite a few freely, so open and free to attend MOOCs. So one I did was from Johns Hopkins University in the US. Alright, so I think that we have run out of time, Liz. Yeah, I think, would you like to wrap up? Well, I don't actually have anything more to say, but I think it's nice if we both stay on and we both give a wave. Say thank you to everybody for attending these webinars. We look forward to seeing you in the community discussions next week and also to on Slack as well. And we will be sharing a lot of information with you after this presentation. There's lots of links. And in fact, sorry something I completely forgot to mention was the ARDC communities of practice. So the ARDC runs or is affiliated with a number of communities of practice on different topics. And we strongly encourage you to check out the ARDC communities of practice page to see if there is a community there that suits you. So for example, many people in this course might be interested in the data librarians community of practice if you're not already a member. Alright, now I think that's enough. So definitely thank you very much for joining us for the past seven weeks or it'll be eight weeks next week. It has been an amazing time. And it's been absolutely fantastic being able to deliver these webinars to you. So if you do have any questions, please let us know. And otherwise, we will see you next week. Thank you. Thank you everybody.