 Hello, everybody. My name is Liz Stokes and I am very happy to welcome you back to the final installment of our Q&A live webinar for the FAIR Data 101 Express course. I would like to acknowledge the traditional owners on the land on which we are meeting today. For me, that is the Gadigal people of the Eora Nation. And I would like to acknowledge that they are the traditional constituents of our land and pay my respects to owners past, present and emerging. And a special welcome to any people of First Nations who are with us today. So let's get into this FAIR Data 101 Express. It's nearly the end. And today we're going to focus on reusable as a theme of the FAIR Data Principles. And I'm going to make sure slowly I will make my way across to the next slide. I hope you've been keeping up with us, but fear not if you feel like you've fallen behind a little and you haven't seen, you've missed a webinar or two and haven't made it all the way through the activities. We'll keep this course open and the Slack channel open for another couple of weeks after we officially finish at the end of this week. So you've got some time. You've got an extension, if you will, to wrap up. So because, you know, life happens and timelines have an innate flexibility in them, I understand today. And that is why with this recognition of life happening, we turn our attention to reusability of research data. Because once you start looking at how you actually make some data, you've got some data to make reusable and you try to reuse it. It's only then does your engagement with how these FAIR Principles actually work really come into play. So we've got a couple of awesome guests today. We have Maria Del Mar Caroga from the University of Melbourne, who's a researcher there. And she's going to talk to us about her experience wrangling some Medicare data. And we also have Tom Honeyman, who is the software program manager at the Australian Research Data Commons. And he is going to talk to us about some of his experiences managing research data in his former life as a linguist. So which comes out every now and again. And we've had many long talks on linguistic tangents. So I would like to, yeah, crack on with the show today. So Ma, I welcome you to join us, turn on your camera and give you the floor. So I'll turn my camera off and be quiet. Oh no, I'm not going to be able to share my screen, but that's all right. So I wanted to talk about my experience last year working with Medicare data. So I'm currently a research data specialist at Melbourne data analytics platform at the University of Melbourne. But last year I was working at the Burnett Institute, which is a medical research institute that's right next to the Alfred hospital in the alcohol and other drugs group. So it turns out the Victorian government started a trial and opened Melbourne's first medically supervised injecting room in July of 2018. And the injecting room had a number of different aims and objectives, the main one being of course, preventing opioid overdose and being able to attend prevent opioid overdose deaths within people who inject drugs. But some of the aims were a little bit more complex than that. So for example, one of the aims was to advance the delivery of more effective health services for clients of the supervised injecting room because it provides an easy gateway to health and social services. It's co-located with the North Melbourne community health center. And also to reduce the spread of bloodborne diseases amongst the clients of the MSR is medically supervised injecting room, sorry. So the Victorian government set up an independent review panel to evaluate these aims and to see one year into the operation, whether the room was satisfying these objectives. And that's when they came to us. So my group, the alcohol and other drugs group at the Burnett Institute had a prospective cohort study on people who inject drugs called Supermix. So they've been following up people who inject drugs for over 10 years. Once a year they do an extensive interview where they ask about drug use patterns and health services utilization. And they get consent to link all of this self-report data to record linkage data like Medicare or PBS or ambulances, hospitals, emergency departments, etc. So when the MSR was about to open, they added a few questions specifically about the supervised injecting room. Do people visit it for what proportion of their injections and things like that. So the independent review panel kind of subcontracted us given the richness of the data that we had to see if we could help them answer some of these questions that required kind of a comparison between people who visit the room and people who don't to see if people, once they start visiting the room, they access health services better and there is a reduced spread of bloodborne diseases like hepatitis C or HIV. So we wanted to be able to look at changes over time for GP visits in particular is what we were interested in and also hepatitis C testing and maybe some other pathology results that have to do with drugs of abuse. So we received linked data from Medicare, MBS, Medical Benefit Scheme for all of the participants of our study covering the range from January of 2008 to March of 2019. And so we really wanted to look at this before and after the room opened and be able to know from this data, is there a difference between the clients who were visiting the injecting room regularly to perform their injections and those who weren't. So we received 442,000 records that linked to about 1150 participants in our analysis sample. So the data looks is you get a Medicare item number and there's 13,500 different possible item numbers. So each row has a Medicare item number, a date in which this service happened, information about co-payments, clinic codes, doctor codes and things like that. So each record can be of multiple natures. So they can be to do with professional attendances that can be visits to the GP, the dentist, specialists, but also there's different item numbers for different kinds of GP visits. So after hours has a different number or depending on the length of the visit, it has a different number, etc. Records can also be to do with diagnostic procedures, for example, an egocardiography or something to check heart function. Therapeutic procedures, for example, any surgery can be listed there in Medicare, diagnostic imaging like radiographies, pathology results like blood or urine tests. And also there's a bunch of miscellaneous or administrative codes that have to do with the management of bulk build services. So we have to be able to make sense of these 450,000 records and see which ones of those corresponded to the things we were interested in, namely GP visits in particular and hepatitis C testing material. So I wanted to show you on the screen, sorry, I didn't set up my permissions on my computer properly for go to webinar before this meeting, but I'll just describe it to you. What we received initially from Medicare together with the data for our participants was an Excel sheet, which had for each Medicare item number, and again there's 13,500 of these, a big text description of what that Medicare number means. So for example, for 0001, it's professional attendance by general practitioner on one patient on one occasion, each attendance other than an attendance in unsociable hours in an after hour periods. If the attendance is requested by the patient, you get the idea. It's a big, big paragraph describing in a human readable way, what that item number means. But if I now need to go into my big data set with 450,000 records and be able to pick out from there which ones are GP visits. That was very difficult to do with just that information. Importantly, the numbers are not sequentially ordered, they've been kind of added the item numbers as things have kind of been changing in Medicare. So maybe the first five are to do with GP visits, but then there's 10 that are to do with specialist visits and then there's another five that have to do with radiography and then there's 10 that have to do. So it was really, really difficult. I spent a full day trying to get up, look up the individual item numbers to see which ones corresponded to different kinds of GP visits. And that was really like I spent a full day doing this. And so I thought there must be a better way to do this. And so I approached a colleague at that point. I was doing this analysis within the Australian Institute of Secure Environment in the Australian Institute of Family Studies, given the sensitive nature of this data. So I approached a colleague there who had experience working with Medicare data and he said, I've been exactly where you are. Have a look at this link. And so he sent me a very obscure link that was somewhere in the Medicare website, but very, very difficult to find that had a text file that ended up being a little bit more useful than the data dictionary that Medicare initially sent. So this text file has a row for each Medicare item number. And it tells you for that Medicare item number, what category does it belong to professional attendances or diagnostic procedures or therapeutic procedures, which subgroup, which which group does it correspond to so there's there's different categories like general practitioner attendances to which no other item applies or other non referred attendances to which no other item applies. And then there's an MBS subgroup and then there's a subheading. And so all of this, I think you would call it a vocabulary potentially, or a taxonomy like hierarchy of all the different items and how they're grouped. Once you can, you can merge that into the initial the original data set and add columns, depending on the item number, what category is it, which subcategory which subgroup and have these descriptions so then now I could easily say, please return filter all the records that are GP visits and only GP visits alone. And then I can split that for for participants who have visited the injecting room and those who haven't, and be able to do the analysis that I'm that I needed to do. So I think this is an example that was that was quite eye opening to me to understand because at the beginning I did I did the previous round of the fair data course. And at the beginning I kept thinking that reusability meant open. I kept getting confused between those two concepts, reusability, meaning that anybody can use it, but review. But it doesn't mean necessarily that anybody can use it, it just means that you can use it for different kinds of research questions. So in this case, obviously Medicare data is not collected for the research purposes that I had in mind. But it could be quite useful for my research purposes, I just needed to have a relatively easy way to be able to filter out and get the information that I needed. So I don't know if my conversations with the Australian Institute of Health and Welfare had anything to do with this because I obviously explained this extensively but now if you go to the Medicare website today. This kind of hard to find more helpful data table is now there featured in in their main website and so that's, that's probably made all the people looking into COVID stuff and other research questions made life a lot easier for them. So that's my experience that I wanted to share with you. I can't hear you miss. Thanks. It took me a while to unmute myself. Sometimes it's finding the buttons. Well thanks for that really interesting story and deep dive into Medicare data. It sounds like I think that that's research impact right there with the changes to the AHW page. I'm going to what I'm going to do is hand over to Tom right now and then we'll go into our Q&A session. So Tom matching our DC jumpers. Yes. Take it away. Hi everyone. So I'm going to talk about a slightly different scenario how reusability works in the area of field linguistics or descriptive linguistics. So for a little bit of context, this is the, this is essentially people who are studying the world's languages of which there are roughly about 7000. But the use of those languages is in decline. And I think the statistic is roughly that one every fortnight around the world ceases to be spoken anymore. And so there is a strong imperative there for reuse for those people that are going out and collecting data. So I was one of those people. I used to work in Papua New Guinea on a one of the many hundreds of languages that are spoken there. And so like a good field linguist or my materials are archived in a couple of different archives. And you should be able to find them pretty easily if you start looking for materials in model or you can be even more accurate. So what I wanted to particularly drill down into is the principles under reusable and how they play out in this field and what they enable. And hopefully what I'm going to spell out is that sometimes it's not as complicated. And sometimes the value of what you're doing in this space is actually more than you would realize for a simple piece of metadata. So there are some specific keywords. Sorry, there are some specific vocabularies that are used for this kind of linguistics. And first and foremost amongst them is that we like to identify the languages of content and study in any materials that we're archiving. And this allows people to basically look at the almost the entire extent of materials that exist for a given language. There's a special portal that all of the specialist archives that do record language materials, publish feeds to and that's you can go there by going to search.language-archives.org. And because all of the records have metadata capturing the language of focus and language of content, you can be sure to search for the language that you want. And if you want to, you can use the three letter ISO 639-3 code to do that. And that's particularly useful because it just so happens that languages around the world, sometimes the same name pops up more than once. There's lots of reasons for that. And so having that really good quality specific metadata is really important for locating materials. And so if you look on this website, then you'll generally find all of the recordings that you want for a given language. Now this is amazing for reuse, but it doesn't stop there. I'm not going to delve into what would be accessibility issues around access and authorization, but there are concerns there. But generally speaking, there's people have gone to the trouble of putting on a relevant, clear and accessible data license for a lot of stuff. So in particular for the archive that I use, which is Paradisac, there are standard licensing dropbox that you can choose depending on how open you want the material to be. So I've got lots of material up there, which is freely accessible. Please don't criticize my talk person pronunciation. And that means that people can go in there, they can listen to the recordings, they can look at the transcripts of the text. They can cite the materials as details there available to do that. And in particular, there are a couple of key fields that really help with the provenance of the materials. So when you're really drilling down into study of these languages, you do actually care which area the language was spoken in. And so reasonably specific geographic details are actually really useful for understanding differences between dialects, for instance. Understanding the type of text is really important too. So this is a bit wonky, but in the field of linguistics, there's a distinction made between elicited materials and so-called naturally occurring materials. And so a lot of texts are actually flagged as being one type or the other. Well, actually there's a whole vocabulary for describing the different types of text. And so depending on what type of linguistics you're doing, you might be more wary about incorporating elicited texts into your analysis. And so these all basically fall into domain relevant community standards. For linguistic archives, there's the Open Language Archive Community standard, which is basically, for those in the know, is basically DC terms with a couple of extra fields added on that really, really, really help locate exactly what you're after when you're going to look for those materials. So yeah, that's about it, I would say, for linguistics. I'm happy to answer any questions. But yeah, an example of things working reasonably well. Wonderful. Thank you, Tom. And as thanks, Matthias, for putting those links into the chat for us. And so now we come to the Q&A session for real, finally. So I encourage everyone to put in some questions into your question box down the panel there, and I'll invite Ma back on. Matthias, you could turn on your webcam too if you'd like, give people a wave. And I look forward to hearing what our participants have to say about both of your experiences. In the meantime, I will say thank you very much for sharing those. I've got your presentation up on my other screen if you would like to. Did you want to just talk people briefly through that? Or we could, I'm happy to publish these slides afterwards if that's okay. Yeah, they're just screenshots of what it looks like just to give an idea. Excellent. Hang on, I'll come back to show screen two. That's moved everything across. Okay, can people see that screen? Yeah, so the last slide. Sure. Okay, so that's the move over here and using my correct one. So this is your summary of what was going on and how that data had moved across. And so you actually might move over to the better looking slide, the more useful data dictionary slide. And so here's the data that's nice and structured. And you can see, you get to see all of its context. Whereas if we went back to that previous one, it's all you've got is because it's a data dictionary, you have, you can see that there's similar content there, but it's very hard to look at that at scale. Was that the experience you had? So with the other one, if I'm interested in GP visits, I could say, show me all the records for which the category is one and the subcategory is a zero one. And then that covers all GP visits. And I don't have to mess with individual item numbers of which there were hundreds and scattered all over the range of zero to 1300 and 500. Right. Awesome. I can't see any questions that have come in from our participants yet unless somebody else. Oh, that's because actually I'm a I'm, I think my commissions are as presenters so I can't see a thing. I'm going to hand this over to you. Sure. I will quickly make you an organizer so you can see the questions, but we do have a few and I will start reading them out. So we have a question about sensitive data. So I suspect this could go to both Ma and Tom. Are there any special considerations to be taken into account to make sensitive data reusable by other researchers? Ma, maybe you'd like to go first. Yeah. Do you want to stop sharing the screen list? So yes. Yes, there are a lot of considerations that can be taken into account. So depending on the level of sensitivity of the data, the data custodians will have different requirements in terms of the security of the environment in which this data can be made available. So Medicare, PBS, etc. Obviously they are de-identified. There's no personal information on there about people, but it is so sensitive that they have pretty strong requirements about the environment in which it could be accessed. So in our case we had to sit in the Australian Institute of Family Studies, which is an accredited linkage service belongs to the part of the Australian government. And sit there with laptops that were not connected to any network. So they had no internet access. They had no USB access either. And so the data was saved into these physical laptops and we had to go there to the physical location in a room and work on those laptops without access to the internet. Other types of sensitive data can have different requirements. And so it really depends on the conditions of the data custodians, but usually most data will be de-identified before being made available for research. Right. Sorry, Tom, go on. Yeah, whereas in my case it's sometimes the opposite scenario. So it may not be that the personally identifiable details is the concern. It might be that the materials that you've collected are culturally sensitive and have particular restrictions on access. Like a common scenario is that if you're dealing with women's or men's business, you may not be able to view the materials or listen to them if that's the case. And so how this is handled in linguistics is that you go right back to the consent form when you're gathering the data. And it's very important to discuss with the community more generally about how they feel about access to the information that you're gathering. And they indicate at the time whether they want to restrict access to the materials and what kinds of restrictions there might be. So if it's culturally sensitive material, then you may well want to put in place a protocol for gaining permission from the community to access those materials before you do. So it's not stopping reuse, but it is a barrier to reuse. But generally speaking, I did also talk a lot about the value of reuse with the community so that they could understand what it was for and why it was there. And there's actually significant interest in this realm for realm for reuse of the kind where other people in the community can actually access the materials from the archive afterwards as well. So it can be generally quite good for people to be able to use it. But yeah, different concerns sometimes the opposite. It's really interesting when it gets into that realm of community use, you know, not just for researchers, bio researchers. Thank you, Mara, for describing that experience. We have a question about is the AIHW any help with the data extraction and usability process. And this commenter notes they know they create an Atlas of Australian healthcare variation which maps the variation in example surgery in different states. I wonder if you had any further comment on that. So the AIHW is a linkage authority. It's a certified linkage authority by the by the government. What they can do is in our case, they handled the whole linkage process for us so we sent them our full data set of self report interviews to our participants. And also, I don't have access to this but I personally identify all the information of these participants so that they could actually do the linkage to the Medicare information and PBS and ambulance and emergency department hospitals, the National Death Index, everything. So they remove all the personally identifiable information and they put a unique identifier on each of these things so that you can connect across the data sets. I don't know if that actually answers the question but they are one of many certified linkage authorities in Australia so depending on the type of data that you want to get access to. You might want to contact them or other organizations. So for health information in Australia also BioGrid is one that's emerging in more the university space as well. So yeah, there's a bunch of different linkage groups. Thanks. So I'm going to go. There's a question towards more data and government agencies and then more cycle back to talking about some questions that came up for Tom. So is there a way to influence how data is displayed by government agencies. Is there a former website to show those who are not making their public data available in usable formats. It would be nice to have a gentle method to show the way rather than rely on individual researchers having to do this. Agreed. I don't know if I can add much to that that I think the work the ARDC is doing is amazing to raise awareness in general about these things. I don't know that there's I don't know of any specific process for feeding this back up. Maybe you, Liz or Matthias or Tom can comment on that. Good for Matthias. So I do know that many of the state governments and the federal government as well do have specific open government open data projects. So you can visit say data dot gov dot au or data dot w a dot gov dot au. So there are efforts within government jurisdictions to make more government data open and available. Yeah, and they do focus on open rather than necessarily fair at this stage. But that is I think it's it's a process. Things will will proceed different departments have different levels of how conservative they might be when it comes to safeguarding their data. So I would say give it time. Yes, it's definitely a long game and as we track the process of the landscape changing. The glaciers move so government governments and its research institutions can also adapt to changing circumstances. I'm not going to be writing speeches for politicians anytime soon. Okay, I'm going to come back to Tom. The previous studies in your thesis area. Did you find them reusable to the same extent and has had so otherwise in other words has reusable changed over time as a concept. Massively, I think the critical turning point in my field was about the year 2000 when there was a UNESCO report on endangered languages or language endangerment. And that really galvanized the community to act in this space and to do a lot more work archiving and going back in particular. If you look at parody sec, it did a huge amount of work recovering recordings from old real real recorders and other obscure formats that were dating very quickly. And but it still remains that there is what the practices in in in the area have changed a lot as well. So data citation started to be taken very seriously at that time. And it's far more common to see specific references to the to the text that you're that you're quoting included in line in the document they're doing. So the materials so I have I pretty much did hunt down every single existing bit of work on the language that worked on. There was a missionary linguist who worked in the area in the late 70s and early 80s. He didn't have a lot of stuff archived, but he did have a few papers floating around. And I wasn't ever really able to cite the original texts for those. But if I'd been able to, there would have been an absolute treasure trove basically. And, you know, right down to one of the most obscure things I found was a there was an an evangelizing group that went around making recordings in local language. And they put they started to digitize their tapes and put them up and I accessed those and I had speakers in the community who said, oh, yeah, that's, you know, that's fire was in in Aminab. And so I could sort of work out backwards from there. Who the speakers were where they were based, what dialect they were speaking. In some cases, well actually it's pretty much the core evidence for me to know that there is in fact two dialects of the language I was working on. And I wouldn't have been able to work that out otherwise. But yeah, so it really has changed over time. It's fantastic. It just goes to see how many different, I guess, I don't want to call them interest groups, but people with different perspectives and things to things to add that actually contribute to the research that happens. So, Tom, another question, how do you handle language when the 639 code does not exist? Yeah, so there is a process for amending those those codes, unless you mean it's a language that doesn't exist. In which case, or a language of one or something like that, but there is a process basically for submitting changes. And so if you have, so a lot of these, a lot of the languages of the well, increasingly fewer languages of the world have had almost no work done on them or no sort of, you know, formal work done on them or had the care taken to work out whether they are distinct languages or whether they're dialects or beyond sometimes just a list of maybe a hundred words. And so the status of languages actually changes a bit and is sometimes contested. And so there is a process for basically submitting a change request to that, to the authority. Great. Thank you. Also, I'd like to note, panellists that we've had a couple more comments in there about the discussion on changes in how the government agencies make data available. And there's a comment about the office of the National Data Commissioner being made aware of those what's going on. And also that the National Data Commissioner is also advised by Australia's chief scientist who has come out recently being quite enthusiastic about data sharing and open data. So hooray. That's good. I'm going to move to a comment now about general, a comment about data licensing. And so, which is actually probably good on, you know, my fairly flippant comment about yay for open, open data, I do acknowledge that sometimes it is possible and sometimes it's not that possible. And I don't think absolutely every data set needs to be open and public. So this question asks, in particular, looking at the benefits of CC0, so public domain or attempting to put research data in the public domain, because as we know, there's a little bit of tension there with Australian copyright law. So in terms of using the CC0 rather than requiring attribution, which might get messy as more and more data sets get used and combined over the years, thinking in terms of 10 to 15 years from the original sources. Would either of you like to make any comment on that? Specifically in linguistics, it's unlikely. The data itself, there's some complexities there, but the convention is actually to recognise the, although it's not recognised in traditional IP law to recognise the rights of the people who are speaking that language as the owners of those materials. And leaving that attribution in there is an important link back to that, as is the possibility of shifting access rights over time. So you're not going to see CC0 any time. Well, apart from the issue with CC0 in Australia. Yes, Matthias and Ma. Yes, I just wanted to expand on that issue with CC0 in Australia because the attendees might not be aware of that. So sorry, I should preface this with I am not a lawyer, but CC0 is an instrument that attempts to put things or make it easier for people to put things in the public domain and therefore essentially completely give up all of their rights associated with a creative output of some kind. Now, the problem with that in Australia is that, sorry, as far as I'm aware, as not a lawyer, you actually cannot legally give up your moral rights to something. So your moral rights are what protect you and your creative outputs from misuse by others. You can waive, sorry, you can't waive, you can give people permission to reuse something and possibly broach your moral rights without any kind of legal action from you, but that's not something you can do on a blanket basis as far as I'm aware. Now, if there are any lawyers present who know who can articulate that better than I can, that would be fantastic, but I would definitely advise caution with trying to use CC0 in Australia. But then I also think that then by not requiring attribution we to me that feels like we're flying in the face of academia, which is all about acknowledging sources where things came from. I mean, I'm sure it can make things easier because you just reuse something you don't worry about who you need to acknowledge or have you, but yeah, is it really compatible with academia? Yeah, and I just wanted to add something exactly about that. I'm working in the Melbourne Data Analytics Platform as an academic specialist. So it's kind of a new career path that sits somewhere in between traditional academic roles and professional roles. And a lot of what we're doing is trying to figure out how can we get recognition within academia for maybe some more non-traditional research outputs. So we don't necessarily write a lot of papers because we collaborate with researchers in a bunch of different domains to apply computational kind of cutting edge data science techniques to their research problems. So in that sense, I'm thinking it's important to have use licenses that remain that retain that attribution of your contribution to this, just thinking about career paths and making cases for promotion and things like that. If you create, if you collect a big data set, you want that to count as an academic output. That was all I wanted to add about that. Fabulous. Thank you. I guess also with the thinking about the absence of attribution, it's great for ideas, but probably not for that long tail of provenance. I guess that's the difference between a research paper and a zine. And I, you know, love the freedom of a zine to not have to provide citation appropriately and you can literally cut and paste things together and make beautiful artworks, meaningful in a completely different sense making world. So with that, I would like to thank our presenters expert panel very much for your stories and answers today for our questions on reusability. And thank everybody here in our, all our attendees for making it through to the final Q&A live session. So what we've got, oh, I'll just draw your attention to the ARDC research data rights management guide that Matias has shared in the chat there. Please get in touch with us if you have further questions about rights management and sharing research data, we'd be more than happy to answer or get stuck in the weeds with you. We are going to, as usual, there will be a short feedback survey on this particular webinar, so we really appreciate your feedback on that. But I would like to, since this is the last time I get to actually talk to you face to face or I suppose face to screen, that we also have a full course evaluation survey that I think Matias is going to pop into the chat. Somewhere or I will scroll through and find my reference to that and share with you. If you, we've tied the course evaluation survey to the form that you fill out if you would like a sticker and a certificate posted to you in recognition of the completion of this course and I thoroughly recommend that you do that. So, please, thank you, thanks Matias. So please tell us about your experience of the whole course. As I said at the beginning, we'll keep the Slack channel open for another fortnight, but please be aware that we will deactivate you and I know that sounds a little bit harsh. Those are the terms of the Slack platform. We could talk to them about that too. So we'll deactivate your membership there after two weeks. So on October the 15th, there's your deadline. So I do recommend getting on board there and striking up a conversation about reusability, interoperability, accessibility and findability with regards to research data. But as it's now 1246, it's probably time for me to say goodbye. So thank you very much everyone for staying the course with us and let us know how you go in those surveys. Thank you.