 Hello everybody. I'm Peter Sefton, I'm the research manager at the University of Technology Sydney. Until next Thursday, after which my position has been made redundant under one of those BSP voluntary separation program we have. So I realized that I gave stupid long titles to a couple of talks I'm giving this week. Okay so just to give some context and I don't know how many people were here for the question. I'm involved in a project crate. I'm getting a bit of background noise could the person on the phone mute. So the arrow crate specification is it's a way of using skimmer.org and Jason LD if it's nicely with the last discussion to describe and package research data. It really started as a packaging exercise where we were looking for ways if you have a bunch of files in a zip file or sitting on your hard disk how do we how do we provide good context for that good good metadata but it also turns out that the way we've done it the arrow crates can be used for discovery and description and you can pop one on to a website so the standard was just released a month ago the version 1.1 of the of the spec and you can look at that online. Okay so just to have a just to show very quickly what arrow crate looks like arrow crate is just a arrow crate is a directory on a file system somewhere with really only one compulsory thing which you can see on the left there it must have an arrow crate metadata.json file and that has skimmer.org in it so I guess I don't need to explain Jason LD and skimmer.org to this audience and particularly after the last talk but this is a description of a data set and it lines up pretty well with Google's data set search so arrow crates should be discoverable from the Google data set search there's more work we need to do on that but it will describe the thing as a whole as a data set using the skimmer.org data set type and it can have as much or as little data about the files and the variables inside the files as you like so it would fit well with the work we've just heard about if you were distributing a spreadsheet of ABS data or something you could wrap it in an arrow crate and there's standard tooling across a number of programming languages and so on to deal with it and repositories and tools around the world and now starting to add support for it and on the right we have a recommended optional feature which arrow crate should have which is that they come with an HTML page which is a human readable version so you see I've got a scientist there in a computer both reading the same data in different forms so the HTML can be generated from the Jason LD and provides a human readable summary for what's inside the crate so that's what we've been working on the thing I talked about last time was we've arrow crate has using skimmer.org has worked quite well at the discovery level if it's there's enough there's enough in skimmer.org to do the basic discovery metadata that you need for crates for research data packages with a little bit of fudging around the edges dealing with the vocabulary which was largely developed for commercial purposes by search engines it was under 100% fit but it was pretty good so you can talk about you know names and dates and you can model people and organizations funding relationships so you can get you can you can quite easily model a data set but we have discovered in particularly in working with some humanities groups that we can push the research object crate stuff much further arrow crate makes a distinction between what we call data entities so the date the data set which is the root which is the thing that you're talking about and files contained inside the data set we've kind of repurposed the skimmer.org media object for that that didn't have a kind of native file class but but also importantly there's contextual entities so if you're talking about the author of a crate or you're talking about the organization that was funded by a funding body you can talk about all those organizations and so skimmer.org is fine for that we've come into trying to deal with more abstract crates where a lot of the information is actually contextual so one this is an example this is a bit of text out of the spec and it kind of this summarizes what I was talking about last time I think I came with some kind of questions and some hand waving and some silly ideas last time to this group about what we might do this is what actually ended up getting written into the spec which is a very pragmatic way to let people expand the vocabulary so this is a real the examples we used here come from a real project I'm working with Dr Alana Piper who's a historian of criminology have to be careful how you say that you don't want to call her a criminal historian who's been who has a large data set which is the archive of the Victorian prison system from about 1850 to about 1950 and she has had a big crowd sourcing exercise to transcribe all the records and we can describe most of that people places and so on we can model the context of each of those archival records pretty well using skimmer.org but there are some things where the things that Alana wanted to say about the about the data were not really easily done with this vocabulary she wanted she wanted specifically education to say what education people have what level of education interests is made up somebody else made that up but we had this is just an example in a spec but you know things like sentencing and convictions which vary across the definitions of those things vary across different jurisdictions across the world so what we ended up coming up with is a way of expanding a vocabulary in a temporary way is we've allowed Alana to define her own terms and the tooling that we've got for this she she gives us stuff in spreadsheets and when we generate the research object crate which has all this contextual information in it we also generate an HTML file which defines just actually just prints out the the definitions that Alana gave us and then we give her that HTML file back and she pops it on her website so we end up with something she has criminal characters dot com as a website for her project and if you go and resolve that URL there criminal characters dot com vocab education you get a human readable gloss for what education means so from a practical point of view for a human it's about the same as using skimmer.org as we saw in the last talk the skimmer.org has a website and you can go there and you can read what the terms the definitions vary in quality but you can find out you know definitions for terms and classes so in a purely practical sense this is what we wanted to be able to do and we started to build tool chains to let people define their own vocabulary and pop it up on skimmer.org sorry pop it up on their own website. Obviously I want to come back to this is not great at a community level and we should work out ways of working together to build vocabularies and extend things like skimmer.org but it's a pragmatic thing that we can do. The other thing that we've added here is just sticking up a website to define something is a tiny bit dodging so the other thing we're doing is encouraging people with the they are using ad hoc terms in RO crate to actually put a definition in JSON LD into the package so what we have here is how you could put in this education property with a label education and actually put it in the crate with the data so there's going to be the data set for this will have about 50,000 records PDF records that have been scanned and transcribed with all this context around it and some of these people have had you know up into the dozens or hundreds of convictions over their criminal career and all of that contextual information is described. So the data will actually ship with its own vocabulary. It's not ideal. It'd be better if a group of the historians of criminology could get together and come up with a shared vocabulary but this is this is what we've decided to do so far. Another data set we've been working with this one is expert nation is a is a history exercise started by the the universities that were around in the early part of the 20th century. It's actually led out of UTS though by Tamsin Peach associate professor Tamsin Peach. They have they have data which is in heurist that looks like this. So heurist is a sort of semantic database from the University of Sydney, which is been around for a long time and is not really aligned with modern practice for doing link data. It is essentially a link data thing and it has its own notion of ontologies and so on, but it doesn't sort of mesh well at this point with the rest of the web. So we wrote a some scripts which pull the data out of heurist. It has complete API on it. So that's a that's a fairly straightforward thing to do. Push the data into JSNLD using the ROCRATE guidelines for how you do that. And you end up with something of I'll send a link so people can when I'll tidy up this presentation and and give it to Rowan. And this will have a link into so you can go and have a play with this and have a look at it. The here we have the the problem was more extreme than with the alarmist data, the criminal data I was talking about before, because the heurist is a complete self contained world with concepts like eventlet for things like births and deaths and marriages. And it does not easily map on to something like Schema.org. There would be a very, very major undertaking. So we did a similar thing in this one where we consumed essentially the the definitions for things like people and universities and so on that were in heurist. And they had sometimes names like Person 2 as the type. Well, it might make it a bit hard to sort of map them on to Schema.org because you don't know where the Person 2 is just an artifact of some accidental history, whether it's a second type of person. So we just actually preserved the heurist ontology. The problem with that though is it means that it does not match well with does not play well with other datasets. You can't find a person and know that they're the same person between this dataset and another dataset if it's all self contained on every separate project. So we did a similar thing to what I showed you with the locally defined classes, but there are lots of them. And we can publish that dataset. What I think having worked with these couple of projects and more we're talking to is Schema.org is quite close to being able to do to be quite good for our for the universities and for the Glam groups. You do have people and events and I've worked with people like Deb Verhoeven who studies movies and if you're studying movies, no problem, right? It's it's they're very well modeled. You can, you know, down to actors and producers and screenwriters and everything. But if you're trying to do criminal history, you don't really have the level of precision that you'd like. So, you know, we have a courthouse in Schema.org, for example, but it conflates the courthouse is a place or a building. It doesn't give you the opportunity to separate the institution of the court from that where it's located. And that that that might change over time, or you might have things like circuit courts that, you know, appear in different places. So the fundamental building blocks are there in Schema.org. And I think it will probably be a sensible way to go to build out from that. I don't know if anybody on this group has been involved in any of those projects, but that's kind of what we're thinking of thinking of doing. So that we can when we're starting when we're dealing with these projects for people. So for example, for Tamsen, she wanted a website which was stable to support a book that they're writing, which was a snapshot of what's in Heurist at the current time. So we by pulling the data out of the current here, it means ROCrate, we have tools for generating a static website. And we can build a static website with information about all of the people in that database. It's about the employment history of people returned from World War One without needing any special web servers or whatever. But we can also build, we can we can build access, we can build a site which might run for a couple of years, which provides more detailed search services and so on over the top of it. Right. So I've actually come with questions here. I should finish up and maybe we can discuss and people can give me some suggestions. Just to sort of summarize where we are, we've come up with ways of capturing existing sort of ontology ontology ontologies or schemas that are kind of buried in a system like Heurist. And for and simple ways for researchers like Alana Piper to extend vocabulary she needs and the tools we need in the immediate future for her to be able to put up, you know, explanations of what those terms mean. But it would be better if we could come up with a pipeline where we could, we could sort of build out on the schema.org, I think schema.org vocabulary for use in academic projects. Another, I didn't really say it came up on the screen, but I didn't really talk about it. One of the approaches we're doing with the RO Crate project is to help seed this sort of development. We're doing a very simple thing where we are letting people choose a namespace on the RO Crate website and add terms to it. And there'll be a simple way for you to do that like by uploading a CSV file or a spreadsheet with some terms, and they're going into a commonplace. Similar, I think to what ARDC has done with their vocabulary thing, but using very simple tools, you know, a GitHub repository, which will automatically generate from a spreadsheet, it'll generate the definitions of terms and things for people to use. But that's not as good as having a more fleshed out schema.org. So I'll finish up there.