 With the Taxonomy Research and Information Network, which deals with organisms or something, I'm not, it's obviously not my area, so I'm not quite sure. But he's here to talk to us about lonely data versus link data with a FOSWIKI. So I'll hand it over to Paul. Thank you. There we go. So I was just wondering if I could start with a question. How many in the audience here actually know about linked data, the semantic web? You guys have, okay, and you're sort of using it to bind your data with, make your data reusable and that sort of thing? That's a lot harder. So where I work is for the Plant Biodiversity Research Center in Canberra by CSRO. And I work for something called TRIN, the Taxonomic Research Information Network. And I'll just, this is our team, so I'm not the only guy, of course. There's three of us employed full time. The other guys have day jobs, unfortunately. So we're like the geeks within TRIN. The other guys do real science. Well, I should say that we're just sort of trying to develop IT systems to help support science, make science more usable, because there's quite a lot of research done. And the focus for the scientist is, as you would have heard Adam and others say before, the focus is on getting a paper out, getting citations, and that is the concern of the scientist. Sadly, a lot of really useful data that could be highly valuable later on, it doesn't go into the paper, and that's often a bit unfortunate. So especially with taxonomy, there's a real shortage of taxonomists, and these are sort of biodiversity researchers who are focused on the classification, on the evolution of organisms. And this is very useful, like if you're trying to grind up sea sponges to find new cancer cures and things, it's useful to know how organisms are related to each other in terms of their evolution. And so with there being very few taxonomists, we need to make sure they don't waste their time. And so TRIN embarked on this project to accelerate biodiversity research and try and make all the data as seamless and reusable as possible. So at Hubris, this means that we have to use open source software because taxonomists, many of them, not being a very lucrative field of research, there's not a whole lot of funding apart from the taxonomist position usually. So we have to make sure the software that we use and recommend and also develop has to be free and open source. We also have to use existing open standards, and that's a bit more difficult than it sounds because there's just simply so many of them. And when somebody is faced with the task of trying to take a data set and make it reusable in a linked data fashion, it's just simply overwhelming, trying to pick RDF ontologies and vocabularies and to really come up with your own data model that will make sense to someone else. And as Adam said, CSV is usually the best because it doesn't carry any of this model baggage with it. So you've just got columns and headings. That can be a problem. We have all of these systems and activities going on. Everyone's collecting data in all of these formats, CSV access, everyone knows the problem. So a lot of time is wasted transcribing stuff from one platform to the next, one part of the workflow to the next. Our focus has actually been to use FOSWiki, which is a collaboration platform. It's a little bit more than a wiki. It's more of an application development environment in a sense. So we've got that version control access controls. That's why we chose FOSWiki. So we've got data capture in the field on PDAs, data capture from laboratories, especially with genetic work. There's a lot of data generated there. The big problem comes when you get to the end of a workflow and you've got a data set and you need to annotate it and mark it up for use. It might even be for publication or field guide. And so you've got this list of hundreds or thousands of types of species names with various attributes that you've generated. The very first problem we come up against is that the idea of the evolution of plants and animals is changing. So for instance, I had one person working on something and it was the first 10 hits out, just the first 10 records of their data set. And in the first five minutes, I'd found out that this species and genus name here had actually changed completely. It was no longer in the same family or order, also we called it. In fact, apart from a few ranks down from Plantae, it had basically completely changed its position in the taxonomy. So it's very important when scientists say, oh, I found an interesting compound in this plant, or this is an interesting gene in this plant, it's very important that they know that they're all researching and talking about the same critter. So apart from identification and classification, the naming is a pretty fuzzy handle to work with. If the name changes, if its classification changes, that changes the name as well. So to try and automate this, we really want to avoid copy-paste madness. So copy-paste madness is where you've gone to the name authority and you've copy-pasted the taxonomy and the name and the authority and all the other attributes about the critter into your own data set. So you're sort of importing this stuff, but you're really just duplicating it. You're making redundant information. So really, we think that the way this is the linked data part, the data should just be the link. If you have a standard identifier authority for the things that you have in your database, and this can be genes like in GenBank or in the case of taxonomy, we have the apnea and AFT, that's the Australian Plant Name Index, the Australian Formal Directory and a few others as well, which are coming online, if you use their persistent HDP URIs, then that is all you need to store in your database. You can cache and pull in all the other properties about the taxonomy and the name and the authority and not have to worry about keeping that in sync. So six months later, if the project takes 12 months, you can go back and query every single taxon and say, has the name change, is this still a current concept? Is this still a currently accepted concept? What is its current classification? And you can update or at least flag for review a lot of things in your database that would otherwise require a human to sit down and manually have a look. Okay, this is a species name, find it in the literature, find it in the database, that wastes a lot of time, and it's something really easy to solve. This is an example snippet, a bit more nuts and bolts side of the semantic web, linked data, this is an XMLRTF fragment. As you can see, it's got a whole bunch of junk in it. I'm just showing the links. So you can think of linked data as actually a graph. It's a massive entity relationship sort of network of things. So this thing is actually the response from a gisterous coniculatum, and it's showing us that it has a name, obviously. It has literature references from where it was mentioned. It also says it's an accepted concept for several names. When we say accepted concept, that's according to the Australian Plant Census. Linked data makes glorious graphs. Well, this is not to scale, it's quite a simplification, but in order to determine the classification of an object, it's a little bit tricky. So the user actually finds this URI. These are research scientists, not ordinary Wikipedia users, so we're allowed to give them a bit of ugly skunk work stuff. So they do actually have to determine the URI of the thing that they know that they are talking about. And once they've put that into the wiki, it goes off and finds each node. It walks the graph and determines and reverse-engineers the classification. So we have these child-tax-on relationships. So this is like, this is the species, that's the genus, that's the family. And so we avoid the copy-paste madness. So being on top of computers and make them do what you want, that actually hasn't happened for me today. So I'm going to have to show you the previous version as of a month ago. I had been working very hard to make this ready for today, but I'm afraid it's not going to be able to show you that. So this is on our live wiki at the moment. This is the taxon that I've picked for this demonstration. So the HDP URI for this taxon has actually been recorded here, and if this was the current version, it would show you the case attributes from the name server. Now because we've expressed this tax on using a standard identifier that hopefully all other researchers will refer to that common ID, it's like a globally unique identifier. If everybody refers to that creator using that globally unique ID, then we can really easily find all other related information to that globally unique identifier. Now the ultimate plan is to experiment with exporting our wiki data into freebase. I don't know if you guys have heard of that, but I don't really have time to talk about it, but it will automatically try to align nodes and find relationships between nodes and automatically discover related information through indirect means as well. So not just direct alignments of nodes, but through indirect links as well. So because I'm using a standard identifier, there are two web services that know and understand the same IDs I do. So all I've done there is we have an ALA info link here and there's another web service that will understand the same globally unique identifier that I've used and because I've done that, the user can very easily find maps and related information pulled from all over the web. So this is like a distribution map for this critter or plant actually. That's a sort of heat map thing going on there. And of course, this is the originating name authority with all the extra boring stuff about the name. And so that's all. Any questions? Yes. Need to need to get a mic up to our questioners because we're recording these talks and we want to be able to hear the questioners asking their questions. I'm interested to find out how the URI is actually constructed so that it's easy to find. Yes, easy to find, very good point. Why don't I go to a taxon that I know won't have the URI associated? So they're very picky actually. They don't want to accidentally just go to the same name that they've randomly found somewhere. They want to use the correct name with the correct author citation to really anchor the concept that they're referring to. So in taxonomy, we refer to taxon concepts. So I'll find one that doesn't, if I wasn't a programmer, the plant names would absorb into my brain a lot easier. Let's try this one. Now, I must admit I've started doing a UI and that's also not ready today to pick the name. There's several different versions of the globe. I say globally unique, but actually, there's two main versions. There's an LSI, can I have a show of hands? Who's heard of LSIDs? One, excellent. That's more than I was expecting. LSIDs are a URIM, if you know what that is. A uniform resource name. And this one has been populated. They must have gone through and populated them all. It's made a lie of me. Okay, I'll just do it manually. So I'm going to go to this search facility. The wiki has a link. So you get taken here automatically with the name already filled out. And so I can mistype this name actually. We'll see what that comes back with. And they get a list of taxon concepts that exist in the APNI database. So when they're consolidating research facts about a taxon, usually the author of the original research will cite the taxonomy that they used for that critter. So come on, Telstra. All right. Okay, so they basically just copy, paste one of these links on the accepted. So we're just trying to stick to APC accepted concepts at the moment. So if they decide that they want to make a wiki page about this genera, then they can inspect the link. And it's just basically copy pasting. This web service will actually do content negotiation. So if I put this into the wiki, the back end actually does use a library called RDFtri and that's a Pearl library. And it will content negotiate the RDF representation of that same resource. So for example, if I go to RDF on the end here, that's not using HTTP, but just a different URL. So I can get the exact same information in a computer readable format. So you're basically copy pasting links. That's the long-winded, I'm sorry that was so long-winded that answer. Okay. Right, do we have any more questions? Yes. Are there many similarities with this and document object identifiers, the things that nature, for example uses? Yes, DOIs are an excellent success story of URNs. And sadly at the moment, we just haven't had the resources to implement DOI negotiation. We have a very manual copy paste bibliography system at the moment. So yes, they are very similar, especially LSIDs. So an LSID is actually much more similar to a DOI than HTTP or ISO, it's the same concept. More questions? Okay, well if there are no more questions, everybody please thank Paul for his talk. Thank you.