 So what I want to do is I want to give you an overview of what we're trying to do in the metadata stores program Talk about some of the solutions that we're funding and then I guess throw it open for more general questions Yeah Yeah, that'll do As I think I probably mentioned in a throwaway line earlier on Today we're in the process of updating our business plan I think Ross Wilkinson spent a chunk of yesterday closeted away in a room updating it So we should have a business plan available soon For the next well, it's no longer the next year the next nine months for you to look at We don't so I'm having to point you to what's in the current Publicly available document, which is a project plan from last year, but this text hasn't actually changed much So the the research metadata store infrastructure, which is was the name of this program Was really about infrastructure for creation management and harvesting of metadata about collections and objects. I Won't read the rest of the definition because I'm actually going to talk about some of those concepts in a minute So the problem we've got is that there are an increasing number of data stores out there Some of those are institutional Monash has a thing with the catchy title of the large research data store Melbourne University is looking to put in something similar number of other institutions around Australia have got or are building out large data stores in addition, there's the national Infrastructure so arcs have got the arcs data fabric Some of you may have been on mailing lists and seen roadshows about the arcs data fabric roadshow that's coming up Some of you may have also seen announcements that they've now got integration between the arcs data fabric and Amazon S3 and there's There was an allocation of money in the super science budget at the same time as the 48 million for the ANS ARDC activity to contribute towards the national data storage infrastructure 47 of that million has gone to an initiative at the University of Melbourne called Nectar And I'm sure Steve Manos would be happy to answer questions on that 50 million of it is going to Infrastructure focus specifically on storage and there have been discussions about the business model for that and Who the lead agent would be but no announcement yet. That's in part caught up in Caretaker mode and then of course there's international data infrastructure things like Amazon S3 and equivalents although S3 is less attractive for Australia because of the the backhaul traffic costs Unfortunately, all of those have relatively poor support for rich metadata about the data that's being stored They're good at storing ones and zeros. They're not good at storing metadata. And if people want I can do a quick demo of the data fabric It's not on my laptop. I'll have to remember how to authenticate I could probably remember how to authenticate to show you what the data fabric looks like in question time for those of you that haven't seen it So I won't do the data fabric demo and we'll see if People want that at the end so what we're trying to do in the metadata stores program is Provide ways to enable you to manage metadata that's going to drive an enable reuse Now the reason I'm talking about this in a data capture briefing is because What you're capturing is data and metadata coming off the instruments and that has to go somewhere and So that's why metadata stores are a key component in terms of this infrastructure And the way that we're currently talking about the kinds of metadata that we want Ideally to have available are in terms of these four things. It's the second time. I've used four things today. I After I'm not sure this is a trend. I guess it gets away from the standard trend of using three of everything So we like to talk about information for discovery information for determination of value information for access and information for reuse You'll notice by the way that I haven't called this discovery metadata or determination of value metadata But information for that's in part because those phrases seem to resonate better with researchers who don't necessarily think In terms of metadata It's not true for all disciplines, but certainly true for some So what does that actually mean? So the easiest one is information for discovery This is the thing that's kind of closest to catalog metadata You can infer some of it from other information or from linked information you can extract some of it from other systems as I talked about earlier today and Some of it you have to enter manually But if you think back to the RDA the RDA demo earlier on this afternoon a lot of that was information for discovery So let's assume that I've gone searching either on Google or an RDA. I've found some data that is potentially of interest. I Now need to move on to the next step Well, the next step is I need to decide whether I care and this is where information for determination of value comes in Do I care about this data enough to investigate further? And here the answer is really it's all about the context so Two years ago when we were first working out how Anne's was going to look I Think we thought we were just mostly going to build a discovery system And when we went out and consulted around our original model people said don't bother If all you're doing is providing discovery, that's not going to be enough You have to give people context for the data And so what we're trying to do in determination of value is Provide as much context as possible and some of the stuff that you saw in the RDA demo and in fact I might do a quick demo of that in a minute was based around providing that context So information about the researcher or the research program or the institution they work for or Publications that were associated with the data or what the experimental design looked like or the availability of reuse metadata So those are things that would help a researcher decide Yes, I care about that because it's associated with a nature publication or I know the researcher or I trust the ARC or I think that everything that the University of Melbourne does is wonderful or not Third thing once you've decided that you've you care about this particular data set You then need information for access. You need to be able to get to the data Now there's at least three possibilities here One of them well two of them are obvious two of them at least for me when I started doing this were not obvious The first obvious one is a direct link to the open access data So you find the collections record collections record has a way that allows you to click straight through to the data And that's if you like the simplest possible use case The second use case is you have a link, but you have to authenticate so in terms of it's that That one there register login for restricted access Remember that and Doesn't control what your data store does we simply point So this would take you to whatever your data store is that can enforce its own access control regimes Login using an authentication mechanism you manage login using the AAF whatever we don't care Those are the two. I guess obvious one open access restricted access The two that are perhaps less obvious and therefore useful to talk a little bit about Login, but for open access and the first time I saw this one I thought wait a minute what's going on here an example of a System that uses this is an organization in the Netherlands called dunce the data archive and networking service which is part of the Royal Dutch Academy for Arts and Sciences and Dunce is the major social science data archive for Dutch research It does some other disciplines as well, but primarily social science In order to access their data or data that they hold you have to log in and The reason that you have to log in is twofold firstly, so the person who Gave them the data who uploaded the data to them can see who else is downloading their data and using it That one you can kind of see you know, I'm I've contributed my data I'd like to see who else cares about the stuff. I'm doing fine The other reason for logging in is a little bit more subtle When you say I want to download this data You can see everybody else who's downloaded the data So as a re-user of the data you can see all of the other Re-users of the data and I asked them why they'd done that and the answer was they were trying to build a Community of practice or a community of interest around a particular data set So that you could discover other people who also cared about data that you cared about and say I wonder why they're downloading that data set. I'll strike up a conversation with them So that's actually quite a nice use of log in for open access data and anyone can get a log in Including Australians. It's not restricted to Dutch users only The final information for access use case again is one that seemed a little weird the first time I saw it, but I now understand it and that is there is no link to the data There is no link to an underlying data store. There's an email address or a phone number or a physical address and A number of the researchers that we're working with are saying No, before anyone can reuse my data. I want to talk to them. I want to have a conversation with them I want to check that they're not one of my competitors I want to make sure that they understand the full brilliance of my research design. Whatever. I want to be the gatekeeper So we're going to see a number of instances of that and all of those possibilities are fine from our point of view Absolutely don't have a problem with those There is in fact, of course an additional dimension that you could overlay on top of that Which is a an embargo that is it will be open access or you'll have to log in to get it Or you can contact me But there's going to be an embargo period within which you can't get it And the last one is information for reuse now This obviously varies hugely by discipline But it's the the kind of metadata that you'd need to enable you to reuse the data once you've got it And it might be the method section in your paper. So many Articles in nature have now got an electronic supplement, which is the the detail about how they did the experiment because the actual Amount of space in nature itself is constrained It might be the reagents they used in a chemical experiment. It might be the calibration values It might even be something subtle like what do the variable names mean? So for instance, if you've got a spreadsheet, you know You're the data that you've downloaded as a spreadsheet and some of the the polar data That's available through the research data straighter at the moment Some of that polar data is a series of Excel spreadsheets But in a number of cases the column headings for those spreadsheets are not explained and so you have to guess now If you're someone who works in that discipline, you can probably guess reasonably accurately If you're someone outside the discipline trying to reuse the data, your guess might be wrong and even Even if you are trying to guess what do you do with a column that's called temp? Is it temporary? Is it temperature? If it's temperature, is it in Kelvin or is it in Fahrenheit or is it in Celsius? Is it in Raimur? There's no way of knowing unless you've got at least some kind of legend Nick shaking his head at Raimur I did it for you Nick So there's a significant need for information for reuse That's going to vary enormously So This is yet another view of ISO 20 on 46 so rather than go and do that Let me just quickly show you what that looks like for some real data So here is The coral example that Sally was showing you before So here's a collections record coming from the Australian Institute of Marine Science and This is actually displaying a number of the things that I was talking about discovery access Determination of value reuse although not all in an optimal way So discovery where you might have been searching for octocoral which what I which is what I was searching for or Stolen different whatever that is you might have searched on Species names see fans see whips whatever so you can imagine a range of searches that would have used some of that information To get you to this particular record information for determination of value Well, you might say I'm interested in rapid ecological assessment So therefore if they're using that particular methodology I'm more interested in the data set because I think that's a good way to do things you might say that down the bottom That I trust the stuff that Caterina Fabrizios does or that I think that the Australian Institute of Marine Science in general Does good stuff so that might be part of your information for determination of value Information for access. I think is where this particular record is going to fall down. Let's try Yeah, okay So in this case this is going to take me to one of those Metadata standards that Jalbin talked about before I saw 191.15 You can see up there. You can save this record as 191.15 and in this particular case The online resource is not the actual data set. It's a link to the aims website so in fact the way you get access to this data is You contact Caterina Fabrizios to get it so the information for access is one of those go through a gatekeeper models and then The last one which is information for reuse Well, some of it here. They've actually put in the description and you'll you'll see this with a number of records They've actually got quite a lot of what some disciplines might call methods information in the actual description So they're telling you how they're doing their transects. They're telling you what depths they're using They're telling you what the site variables were So visibility and modified secchi technique And again, if you're one of those kinds of researchers, this is the kind of stuff you care about So there's a reasonable amount of reuse information in there now in this case I can't actually see what the actual data looks like I would need to contact Caterina Fabrizios The last point I wanted to make about this one is that there is information hidden in the 191.15 record that we are not surfacing at the moment and in particular and I think Sally showed you some of this stuff before now a lot of this is stuff that you might think well I don't want to surface it. However, you probably do want to surface something like this Here's the related publication That's a grandeur. Yeah, so here's the related publication information It would be nice if we could extract that and display that as related information in the collections record because that would actually be Helpful context information for someone. So we might have to have a look at the crosswalk. We're doing from 191.15 into our collections record the reason why This isn't showing up is that they haven't put it in as publication. They put it in as supplemental information So it's basically just a lump of text and it would be relatively difficult to work out how you'd surface that in a collections record But that kind of link to a publication is the sort of thing that would be nice And if your projects are doing data capture and there are associated Publications it would be really nice to see links to the publications in the records as part of that Determination of value information So I won't talk about That is not well that maybe what I said I was I wanted. That's not what I meant So if we care about these kinds of information information for discovery for determination of value for access for reuse Where are we going to get them from? well Down the left are the ISO 20 on 46 entities collections parties activities and services Next column across are the research instances. What does that mean in a research context? physical and digital collections individual and Organizations for parties whole lot of information about research projects and we're still trying to work out how to model things like Synchrotrons or beam lines on the synchrotrons as services Nick will have that solved by this weekend, right Nick? Excellent. It's actually a relatively tricky modeling task, but it is one we know we need to solve Okay, so we just need to recognize Nick's brilliance and the problem will go away the identifier source Nick talked about the role of persistent identifiers for collections Sally Sally Monica has already talked about people Australia for Individuals and organizations We're an active discussion with the ARC and the NHM RC about identifiers for research projects Providing linked data endpoints for those don't have a solution for services at the moment and inside Institutions this stuff may live in your institutional repositories. You have been talked about that in your institutional metadata stores Which I'll talk about in a minute in your HR system your research management system So, how does that get into the collections registry membering from this morning that we use the collections registry to build the RDA pages You can just do a series of feeds So that data source administrator interface that you have been showed you you can have multiple data sources for an institution You know if you wanted to you could have a repository and a metadata store and your HR system and your research management system Four separate data sources. That's fine Or you can feed stuff through a metadata aggregator and then do a single feed from the aggregator Into the collections registry. That's an architecture decision that you will need to make So what are the drivers for the metadata store? I've already talked about the paucity of metadata in the existing data solutions Clearly something that you're providing as a metadata store needs to meet your needs as an institution For managing rich metadata our needs to get feeds of information about collections But also as I think a couple of people have mentioned Needs to solve the problems for seeding the commons and for data capture Remembering that all of the data capture funded universities have also got seeding the commons money and Those are a little bit different certainly collection stuff plus associated information in the data capture world possibly some object metadata as well In an ideal world, we would have funded all of the metadata stores projects last year We would have got those solutions fill finished We would have had them ready and deployed into institutions and we then would have said and now guys We'd like to spend money with you on data capture Unfortunately the world that we live in rather than my preferred parallel universe we thought we had to spend all our money by the middle of next year and so we did everything in parallel and As a result the early activity method or metadata stores projects things that we started talking to institutions about in September last year Are either still building solutions or in some cases haven't started yet? And we're building the activity and party identifier infrastructure that you've heard a little bit about today and At the same time you're doing data capture and seeding the commons projects So this is clearly not ideal, but is unfortunately what we're going to have to live with the alternative would have been I guess Well, there really wasn't any alternative We could I guess in theory have said when we got the time extension Just put a hold on all your data capture projects and we'll come back and talk to you in a year But we would have been lynched so we didn't say that so we have to live with the fact that we're trying to do everything in Parallel and if I had time at this point I would play a fantastic commercial that EDS built EDS showed 10 or 15 years ago now about building an aeroplane in the air while you're flying it which is sometimes how Anne's feels So what metadata stores solutions do we have at the moment? So the first is a system based on vitro University of Melbourne is the lead for this which is why they're in italics QUT and Griffith Who we're also funding on a research metadata hub are picking that up and using it and UWA I believe are also proposing to use that too It's using an RDF triple store based solution technology developed out of Cornell And Simon Porter at the University of Melbourne would I'm sure be happy to answer questions on that if you had any second solution is a thing called inject because we can't come up with a better name for it yet lead agency for this is Australian Digital Futures Institute at USQ Some of you may know Peter Sefton This is building on top of an existing institutional repository solution and they're actually Testing it out with the University of Newcastle to make sure it meets their needs So they're building not a generic solution But they're building to a specific use case and the Newcastle instance of this is called Redbox for reasons that will become clear on the next Slide Swinburne University is also interested in picking this one up Peter Sefton at USQ or Vicky Picasso at Newcastle would be able to Help you with details on that. The reason it's called Redbox is this So all of the red stuff is what this is building The green stuff is external systems that this talks to the blue stuff is the institutional repository and The red stuff is the new components. So they have an event queue which monitors external sources of information They have a pluggable harvester that can slip stuff out of those. They have a form system And they're using the institutional repository as the underlying store and then they can feed stuff to us If you're interested in this Peter Sefton blogs about it reasonably frequently on his blog at ptsefton.com Third solution is I forgot what you called it this morning Anthony. It's no longer called a TARDIS derivative It's it's the I share my research. I was going to say I heart my research data. I knew that was wrong I share my research data. So this is building on the TARDIS code that Steve Andrew Larkas has done I still haven't Done the Twitter thing for your name yet at Spetsnaz if you follow want to follow me on Twitter There we are. See I've done it in in real life. How tames that? So this is going to have the ability so these are and Anthony's bullet points except for the last one The ability to stage data upload data via a web interface annotate data with new parameters map experiments provide a web service interface Access systems around groups This one is intended to be generic I'm in the stuff that Steve built for in the original TARDIS with specific to protein crystallography This is a generalization of that to sit on top of the large research data store I'm saying it's potentially useful because it looks like it would be a really good idea, but it's running late So until I actually see it running. I'm reluctant to say more than potentially useful But I'm sure Anthony would be happy to talk to you about it afterwards and The last option is so the second last option is you can run The software that we use for the collections registry Locally within your institution. You can run a local orca instance That was designed to operate in a federated mode But it's primarily focused on meeting the needs of the Ann's registry so it's not Intended as an institutional metadata solution Having said that and you have said that for their data capture project. They're going to use it and are probably going to be extending it so That is another possibility that's available now and Extensions will probably be coming out of the ANU use of it That's the architecture diagram. Don't worry about that and then the last option is you can use your existing institution repository But I would skip to the bottom bullet point, which is the recommendation is don't so It's doable, but Jabin and Nick have spent some time looking at it and it's a valley of pain The valley is deeper and longer for some institutional repository solutions than for others But in none of them is it less than a valley and without a degree of pain Most institutional repositories don't like storing large objects. They don't have good collection support They really only are designed to do DC and mods and don't cope well with doing RIFCS and there's a range of other problems So think very hard before you go with that solution and I'm fractionally over time, but I'll stop there