 Well, welcome everybody. This week we've got Marianne Brown from James Cook University. Marianne wears many hats and in fact has done the whole suite of Anne's projects. She's done several apps projects. She's done a Seeking the Commons project, a data capture and now MetaDarver stores. So that's why she's such a remarkable woman that she's covered them all. Anyway, without wasting more time I'll hand over to Marianne who's going to talk I hope about two things. One is about the research portfolio and also about the infrastructure she's built for self-deposit at JCU and I know this considerable interest in a notion of self-deposit and what that entails. Okay, Marianne, over to you and we'll thank you Marianne for agreeing to do this. Thank you Simon. Good morning everyone. This is a bit of a rehash of the talk I gave at the Red Box Community Day at Adelaide. So for those of you that were there, I hope it doesn't get too repetitive for you but here we go. So at JCU we've been doing Anne's projects for a little while now. We started off with a very small apps project where we built a prototype metadata repository. It was really just a toy system. We had the idea that we wanted to do self-deposition and we wanted to have a small something that we could play with and and see how that would work. Then as we went on we took on a couple of bigger apps projects and we now have a data capture project that should be coming online shortly as well which have been and will be producing lots of data sets. So our EDCA project which was AP03 on our first ingest we imported 1498 records from that. All those records are in now which was a big task and caused lots of stress but it was all good. As the EDCA project continues the new data sets that get generated from it will be less but it was just that initial bulk import which was presented some challenge and we have another apps project Climads which AP02 which will again stretch metadata system when we do an import from it. Its initial import is around the two and a half thousand marks so that will be interesting and should be happening in the next few weeks. So the background to our system we started off with a little small system and then as these projects went on decided we needed a more robust production ready metadata repository to cope with what we were trying to do. Also main consideration was library resources are scarce and I don't think that's unusual for any Australian University and the library's wish was to keep workload impacts to a minimum. So while we looked around and saw what other people were doing and saw that Newcastle were doing some wonderful work but their chosen methodology of doing the interviews was something that our library just felt that they wouldn't be able to do not to the level where we were hoped for the amount of data sets that we were hoping to get into the system. So that was our main starting point so main mechanisms needed to be self-deposition by researchers for their individual data sets and then we needed to have a machine deposition capability for large scale collections and so one of the properties of those large scale collections is that basically the metadata is all quite similar they've been all generated it with the same algorithms or the same methodologies and the differences between one metadata record and another are down to a few key fields and the bulk of the metadata is pretty much the same. So we felt that that lent itself to to automation. Okay but the main thing we wanted to do was get this self-deposition working and so main question was how do we motivate researchers to add their own data sets to the system and my experience with academics is that sticks tend to turn them into mules and donkeys and they won't do what you want them to do so your best approach is always with carrots and to encourage and and try and offer them a benefit to to doing what it is you want them to do. So main carrots, external forces add carrots, there's ARC and NHMRC granting bodies that are encouraging people to put things into data sets as you know asking in their applications for data management plans and all that sort of things certainly has got some of our earliest people to do self-deposition where people fill in in those grant applications and wanting easy solutions to fit those requirements. But that's only a few people at our university and we wanted something that would get the bulk of our researchers on board. So the Mendes data stores project provided us with the perfect opportunity for that because one of their optional deliverables was the research portfolio or the research profile pages. So over the last six months I'd say here at eResearchJCU we've been building our research portfolio system which we'd like to show today. This has been an in-house custom-built, it's not a commercial bit of software, we have our collection our sources of truth, some are third-party commercial products and some are in-house built databases. So we've we wrote our own system that would extract us information from our sources of truth and present them in an attractive manner. So this is our front page. If you update your profile regularly then you get to stay on the front page so that's motivation for making sure things stay up-to-date. An example page, let me see, let's pick my boss, do Ian's page. On this page most of the information is pulled out of internal sources of truth. So the amount of information that a researcher has to add to make this more attractive is quite minimal. We don't store biographies so the biography here and this about section is something that a researcher would have to provide themselves but the majority of the contact information is scraped from internal systems. This little word cloud here about their research areas is built from their publications and their grant applications that are in our systems looking for keywords that are used come up most often. Again the FORs and the SEO codes again are picking those top three that show up over again and again in their publications and stuff. Publications tab, this is pulled from our publications repository. It shows the most recent so in the last seven years but there is a link through to the full profile if you want to see the full list of any researchers publication history you can but other than that you will just see the most current so you'll see what they're currently working on. Funding from our research services database, supervision, information comes from our student systems, privacy concerns, so we don't show students names at all. Some people really want their PhD students names to appear, we just give the name of the project and the researchers role either as a principal supervisor or an associate supervisor. We show current and those that have completed within the last five years and for those that have completed in the last five years if the thesis is in the repository then that link will take you through to that repository. And then one nice feature, this again is built from information that exists in the publications repository, it's just showing you the places that this researcher collaborates with around the world. So they can go through and see if the person has data sets, so if I go back to what earlier page, the person has data sets in our metadata repository then there also is a data tab will show up. Now Jeremy hasn't edited his profile at all, we put the very bad picture in for him, we've got to get another picture of him, but you can see even without the academic going to any amount of trouble, you'll notice there's no biography there, it still is a fairly complete profile without the researcher having to do anything and it's attractive and looks nice. So adding additional stuff then just completes the picture. So what we're showing to the researchers that is if they do their bit in letting the university know about their publications, make sure that their grants go through proper channels and recorded and put their data sets in the data repository, then without any other effort on their behalf they will have a complete and up to date and attractive looking profile page which that they can then use to advertise themselves. So that's our main carrot, that has gone over really well, it's going live internally next week and just to give people a chance to check their information, as we all know corporate systems that are internal sources of truth sometimes contain small lies and it's nice to be able to see those lines and get them fixed. So there'll be about a two month period where everyone will have a chance to go through and inspect their pages. If they see places which aren't correct, for example they're down to teach subjects that they're actually not teaching or they're not showing up as teaching subjects that they are, then the site when you go into edit mode will provide you with information as to how you go about correcting that information that is coming from those external sources of truth. The deputy, senior deputy vice chancellor took one look at the profile pages and said there's no really any need for applications for promotions now, you just send us a link to your page and we'll decide whether you need a promotion or not. So more carrots. So having got our carrots now in place, we're working on our self deposition. We have a prototype site, we built that back at the start of 2011 as our first project and we're going to use that interface as our starting point for our prototype for our self deposition plugin for Redbox. The way we want it to work in Redbox is that the researcher will be able to log in and create a metadata record. That record will remain private to that individual researcher until such time as they believe it's ready and they submit it in which case it will switch over into the normal library review workflow. We are giving the research the ability to share view access to the record with colleagues. One of the key deposition mechanisms that we observed in our prototype site is that the professor who has 20 or 30 data sets to put in the repository doesn't put them in themselves. They find a PhD student or a research assistant or they apply for a grant and get money to pay for someone to do that for them. That's fine as long as it's not our library staff been asked to do it, we don't care how the stuff gets in there. But what it does mean is that typically the professor or the researcher who's getting someone else to do that deposition for them does want to check what's going in before it goes public. So in the case where we have someone other than the key researcher doing the data access that person can then share view access of that record with whoever so that things can be checked. Redbox unfortunately doesn't implement any low-down lock-in so giving more than one person right access to a file is problematic. So the way we will overcome that is that while only the owner of the record can edit the metadata that owner can transfer ownership if they choose. So in the case where we have a PhD student entering the metadata they can create the 50, can 50 records whatever it is, get it to a certain standard and then transfer the ownership to the professor to do the last spit and polish before it then gets handed over to the librarians for checking and making sure everything is as it should be. So once the record's been submitted in for review it then leaves the researchers control and into the normal workflow. So if the researcher has seen something that they want changed after that point then it'll be normal channels as what would happen now they would need to contact the reviewers in the library and point out what needs to be changed or modified to get that fixed. For us we researchers care about their data and they're usually fairly good at doing most of the metadata so we will be when the position in the existing workflow that we will be putting our self-deposition records will be in the final review stage of that workflow. However if anyone chooses to use that plug-in that would just be a matter of making a few simple changes if you would rather that it landed in investigation as accepted alerts normally would be so that it goes through the full chain of review that you've got at your institution. Marianne can I just ask you there just to clarify something with Simon here. So the workflows built into Redbox for that review did you have to tweak them or did they work perfectly for you? Okay so currently the Redbox team have been working on the data management planning tool for Flinders and South Australia. Flinders Deakin and Newcastle I think are all involved in that project and so they've already created this workflow a similar workflow so we're piggybacking off that work we don't want to do anything new. No point having to compete in workflow streams there so we've got the code off Andrew and then we're just adapting that to SUDAS and pretty much that workflow is the same one as a data management planning tool. The ultimate vision for between me and Duncan is that I don't know if you've all seen Duncan's plan of a dashboard for a researcher so a researcher will come in and see what data management planning plans that they've started working on and they would also see if the self-deposition is a plugin that you've installed or turned on. You would also see your records that you've created yourself and the one thing we need to work on with Duncan is what we want to be able to do is take a data management plan and import information from it into that self-deposition metadata record. So try and make those two which at the moment two different products but work nicely together in the same sandbox so if you turn on both self-deposition and the data management planning tool you'd have your researcher start off with their data management plan when they're happy with their plan and as data sets started to eventuate out of that project that that plan was for they would click a button that says create data set from this plan it would pre-fill information out of from that plan into the self-deposition things like who are the people involved probably the description some of the FOI codes or other information that you might have in your data management plan would come across but then they could be edited in that self-deposition obviously a generic description that you've written for your entire project in your data management plan would need to be adapted or amended or added to to be an accurate description for an individual data set that is coming out of that plan and then and then once you're happy from there then it would switch over into the normal workflow. Diane Hillier has a question which you may want to ask or answer later because it's I know a lot of people have this question and it's how you went about or how you get what carrots you use for researchers to create good metadata records is that something you'd like to talk about now or later? Yeah I'll switch over and I'll show you in our metadata repository I was just gonna first of all just do a quick show of what our prototype system looks like and note this was built in it is a different system from Redbox and so what will be in Redbox won't be exactly this but just to give you an idea of what we sort of planning of showing for self-deposition we've modified our Redbox so that you can have more than one just one description field what we've found is that researchers really want to give detailed descriptions about their metadata they care deeply about their methodology and have a want to have an opportunity to explain that to explain any quality assurance mechanisms that they've put their data through and so left to themselves a researcher it does depend on your researcher I will qualify that but some researchers when you give them a description will start writing like they're writing a technical paper and it can go on for pages so we so we give them the ability to create as many descriptions as they want to we generally encourage three so brief is for your layman's description for someone who's not necessarily in your field full description go for your life what do you think another researcher in your field who wants to use his data set would want to know about it and then a notes field saying okay now give us a bit more of the information about the technical side of the the data set in terms of you know what formats it's stored in you know if it's a spreadsheet with lots of columns and you've used abbreviations give us give us some hints here preferably you don't use you know you'd include a legend in your spreadsheet but there may be additional information you want to put down here coverage standard stuff map to draw on data location is it in a lab is it in a room is it sitting on a computer system somewhere that is more for us just together where it is often researchers have large amounts of data which are quite tricky to transfer between systems if you're talking to someone who's got a hundred you know 10 gig files that they want to transfer may be sticking on a USB key is problematic and and walking it over to someone's office not everyone's up to doing the transfers on the computer associations as we all all know and love keywords and then extra information space for writing legal stuff and this is and and choosing licenses this is also sort of a bit of alert to us as to whether we need to go and talk to legals or ethics and make sure we've got all the appropriate clearances on stuff and how whether what the legal collections the legal rights are sane and what the researchers put down is how they want to share this data is actually compatible and stuff okay so that's that's what the that's what that interface looks like in that system I'll show you a record that a researcher did for us and this is this is one of the template ones for our egg data set I'm not a librarian I am an IT person so when it comes to a scientific judgment of what makes good metadata or what doesn't I guess my training has been through submitting things to ants and having the wonderful checkers at and send back pages of comments to let me know where I could have been better so I don't know that I have I definitely don't have a librarian's eye for this sort of thing so I guess it's up to other people's judgment as to what is good metadata and what is not I look at this and I think I can read the first one and have some understanding of what it's about there's some information that comes next with links to sites to to refer on to for someone who wants more detailed technical information gives me some information about what I'm going to get if I actually click on the download file up the top it looks it looks good to me there was a little bit of back and forth with this researcher but not very much so that was that was good I guess my attitude is if the metadata records not out there no one's going to find the data set so I don't know how ants feels about the statement I'm about to make but I'd rather a mediocre data record available for someone to have a chance at finding than to have metadata records stuck in a review queue for six months while it's made to be perfect so I don't know I may get burned at the stake for that comment but that's that's that's that's how I feel about it and that's how my boss feels about it the other thing we feel that will happen is if the records go out there and people are interested in your data set and they find it and they don't think your metadata is up to scratch research is going to be fielding lots of comments about what does this mean what does that mean how do I use this and I think it wouldn't be too long before the researcher with those data sets will learn to be better about the way they put their data sets together does that answer the question yes I think it does but I'm sure there'll be more about quality control and and we're always we're always careful to have a mediation step on it we never ever take a researcher's record and go straight to published it always gets checked what we're trying to save on is the bulk of the work of city well for starters for chasing the researcher down we still do some of that but because the researchers have the ability self depth deposit they self identify so that saves some of the workload we do workshops with the graduate research school to early career researchers and with PhDs it's part of their research training program so they know all of so they all know about their obligations around data management and about the tools that we have here to help and who they should contact if they want to create a metadata record I want the handheld for the first time we're happy to do that who does the checking in the early stages I was doing the checking because I've been the one through that had gone through seeding the commons and had been trained by trial by fire we now have digital creation curation librarian at JCU starting to check now as well in in in red box and she's working on a checklist of things to like I because I've done a lot of the checking I have a checklist in my head of what I look for when I'm going through metadata records or can I understand the brief description because I should be able to understand the brief description even though I might not be up to scratch on what's in the full description do the full codes make sense do you know is the title descriptive is there a note there that tells me what's in the data sets so that I know what I'm getting before I click on a link because there's nothing scarier than clicking on a link and finding out you've just got a five terabyte download that you're not going to be able to fit on your on your file system so there's some things that I've just been doing myself Joe now is going through that in a much more systematic way because she's used to doing that in the library of making notes about making sure titles are in you know title case and and possibly some of the more detailed formatting issues which I possibly wasn't as as fussy on so that will become a more robust system in that we'll have a guide then for other checkers so that there will be possible for more than just you know one or two people who've been specifically trained over a period of time to be able to get in and do the checking okay the next aspect of what we were doing was around the bulk imports and as I said we've got the two big app projects Edgar and Climax which are producing lots of data sets for us so our approach to that was to get the researcher to aircraft a metadata record for a sample output from the system so in the case of Edgar each bird species in Edgar produces results in the output of two data sets one for a cleaned set of observation records and then another data set which is the output of the modeling process which projects outputs the projected future distribution of that bird under different climate change scenarios and different parameters so rather than saying writer say template we said pick in here we go here's an emu right could you write me a metadata record for the input and the output records for an emu and then as a double check we got them to do one more which was the cassowary and compared them and just made and looked for where the the strong similarities were and came up with a template metadata record from that with Climax similar process Climax is a comprise of three tools so each tool puts out a different type of record and the output from one tool is sort of an input to to the next tool so so again we got our researcher and it's the same research involved in the both of these projects so I guess we're spoiling the way in that they've sort of been trained in metadata so they've gone along to produce those three record sample records we then because we were in an assessment stage with ends we sent them to ends to have outside expert opinion on what the what they thought of the metadata took on board the feedback that we got back to then develop our template and in developing that template we came up with the fields that were common to to all all the records and which fields would be different on a per per case basis and in what we do is we make a default metadata file and then we have an override metadata file that the application writes out into the directory with the with the with the dataset and it only has to write out that information that is particular to that record so minimal impost on the on the application writer they just have to put out a small JSON file with some additional information which is stuff that they have because it's stuff that's they be there in this project they've needed to display on the on the website like the name of the bird or some cross referencing links to data sources and stuff which is all stored with those bird data that that metadata.json file the default file sits in with the on the red box sits down with with the harvesting script and the override metadata sits on the on the system where the actual datasets have been stored and we have red box set up with the housekeeping job to run at 6 a.m. every morning and it inspects a file system for the creation of new files if it finds new files it then runs its create you record mechanism where it takes its default metadata file applies the override file and then maps that to the internal red box structure and thus we have a record because the template metadata file has gone through a series of checks before it's been put in place those new records that are created in red box goes straight to published the metadata has already been through an extensive vetting process and so more checking seems like overkill plus I cried at the thought of having to look and manually publish a 1498 records for Edgar so that just wasn't happening as a project goes on obviously new bird species don't come up all that frequently so the ongoing load from Edgar new records coming in will not be as great and so that would make it more feasible to do manual checking if for some reason you had doubts about your system and and wanted human eyes across records before before it went to out publicly but that's that's all you can figure that in your harvester so if you decide to do a bulk import but are still cautious and want to have a someone look at stuff before it goes public it's just matter making some changes to that to the file so that it ends up in one of the review stages rather than going straight to published for anyone that's interested in the bulk imports and want to or you just want to have a look at what like that metadata template looks like the Edgar template can be found on our GitHub site and the links there and for those technically inclined if you want to have a look at the rule mapping file is can be found there we we as I said we successfully published the Edgar records we had to make a change to the red box curation mechanism for it to cope with the request to curate that many records against a single researcher in one hit Duncan and we've pushed those changes that we've made back to the red box community for them to look at and Duncan said once they've got this 1.6 release out the door they'll be looking at the change if we've made and more than likely incorporating that into the main trunk of the red box build possibly with some changes they may identify things that might cause problems in certain circumstances so that'll go through a view process but I know there was possibly other people that are going to be in similar situations where you've got a historical collection where the data is very very the metadata will be very similar across all the records and this mechanism of doing this bulk import might be something other people might want to want to use so that's it that's where I got to so thanks this funding of course and and also QCIF which is Queensland sort of e-researchs main body for their assistance in getting this done any questions okay Marianne it's probably a good opportunity for me to put an unashamed plug-in at the moment for a couple of videos that we've made about the Edgar project particularly there's one that went up yesterday or was released yesterday with Stephen Garnett talking about some of the moral and philosophical issues that arise when you start to look at moving birds into a better habitat all of these things the modeling has come out of the Edgar project it's really I find it fascinating