 I only remembered that I was giving this presentation yesterday afternoon, so I've patched together a couple of pieces from some recent presentations I've done. I'm going to whiz through the first half, which is really scene setting, and then we're going to land up on a couple of slides at the end, which really does raise the issue around the governance of controlled vocabularies. So first, just to do some of the scene setting, we've got a goal here of, you know, maybe this is slightly, not exactly in the vocabulary governance space, but nevertheless, it really says, you know, why we're thinking about vocabularies, why we're thinking about vocabularies in a web setting, and it's really thinking about what's a, you know, there's a big push at the moment with research data to make sure that it is made available in a fair way. That's findable, accessible, interoperable and reusable. And so a group of us have been working on thinking about what that means for vocabularies. And I think I already presented a bit of this, I can't remember whether it was to this audience, Rowan, but I've, you know, some of you all recognize some of the slides in this, but I'm just, as I said, doing some scene setting. The fair vocabulary is one that's registered, therefore findable, that's on the web, therefore downloadable, accessible. The interoperability comes from the fact that it's at least the thing that we're focusing on is that it's encoded using web standards, perhaps some domain specific extensions also realize a missed piece there. It also interoperability goes to content issues that it matches the expectations of the discipline or domain. And from that also comes up in the reusability context that in order for a vocabulary to be reusable, it should be made available with an open license that it's maintained and therefore we can trust it and it's comprehensible. Makes sense to the people who are familiar with the subject material. We've got a lot of controlled vocabularies and mostly they're not fair. For example, they're published in a book. This is a book that has a lot of controlled vocabularies which are widely used in the Australian environmental and ecology space and this is how the vocabularies look like. They're tables with red symbols or let codes there and here's another very important controlled vocabulary maintained published by the International Commission on Stratigraphy. That's the geological time scale and this is the version that most people see which is a PDF. The same information, in fact, a lot more is also available in a table on a web page at stratigraphy.org but that's a very special table that's basically just in HTML. So it's not really machine processable except in the sense of rendering it by a browser. And this SIN is also committed by one of the most important controlled vocabularies which is the SI units, which again, this is a snip from the so-called SI brochure which is where the official definitions of the core units of measurement is just in a table on a web page. So we've been doing some work on what a fair vocabulary look like which I've already said but how do we get there and that there's plenty of examples of fair vocabularies which are not necessarily the ones which are published by the actual custodians. Those vocabularies in many cases is the ones which have been translated from those but nevertheless it shows how it can be done. This one actually was, this is the original version from the EUDM energy usage data model which was a project that we did in Siro three or four years ago. Of course, Research for Capitalism Australia publishes quite a lot of, you know, hundreds of vocabularies in a fair way although in many, this one's one which is this is the official AODM vocabulary for sampling parameters but there's a lot of vocabularies in Research for Capitalism Australia which are just translated or transferred from non-fair representations and there's a bit of a question of which is the authoritative version. Is it the non-fair one or is it the fair one? This is an agricultural vocabulary published on a platform called SCOSMOS. So partly I'm just showing these as the different platforms but which are all showing vocabularies published in a web-friendly way which means at least they're going some way towards being fair. Bioportal has been doing it for years. Ontobe shows the ones from the Obo foundry and we've written a paper basically giving some basic instructions for how to take what we call a legacy vocabulary that say controlled vocabulary which is published for example in a book or in a comma separated variable spreadsheet or is maybe downloadable but not in a way which allows you to address each term in a vocabulary with its own web identifier and so we've written this paper this is with some colleagues scattered around the world on giving a kind of step-by-step as to how you might convert a vocabulary from a legacy format to make it fair. In fact on Sunday I just submitted a revision of the manuscript back to PLOS Bioinformatics because we received one set of reviews so we hope that'll get accepted and published soon. So now my computer is telling me that I have to do some exercises so I should have killed that off before we started this talk shouldn't I? I think I've got control again, yes good. Okay and we've got a couple of examples of where we can actually show how we've stepped through that recipe if you like to convert in this case the yellow book vocabulary into the link data registry format. So this is not just an idle boast or a theoretical idea about how to do the conversion and I've presented before shown how the specific parts of the set of instructions are represented by sorry implemented by information in this fair publication of the controlled vocabulary. Similarly for the geological timescale you can represent it in a rather simple way just to discuss concepts and you can also then add in extra information if you have a domain specific vocabulary which ontology which we actually have developed there through the GSML project I'm not going to show that at the moment because it's not the point of what we're talking about here but the point is that there's a whole set of questions around the publication and development of controlled vocabulary is making available to the research community and the paper that we've written really deals with a part of a number of different workflows and we're planning for it and it's the converting paper we're planning for it to be one of a set of guidelines and next cabs off the rank are actually going to be addressing some issues around maintenance and governance. In the first place we've got the question of the creation of controlled vocabulary which is sort of the first step in the governance process and then there's the maintenance or update or keeping things up to date part which is the ongoing governance question. So having set the scene with that now I'm going to pause on a couple of slides where I'm actually going to step through a whole series of precedents for how some controlled vocabulary is which are important ones which are used in the community or should be known and used how the governance of those is actually managed which is the point of the presentation today burnt up the first 10 minutes giving you some some context. So here's a list of a set of let's think of them as as controlled vocabulary is almost all of them are very recognizable that way maybe you hadn't thought of schema.org that way but it is. Schema.org of course is the set of tags which are used to tag web pages and that also includes web pages which are landing pages for data sets. So but schema.org was a project that's been around for best part of 10 years now was originally established by Microsoft, Google, Yahoo and Yandex basically to bring a little order to the chaos of search engine optimization but has also more recently been picked up by the research community for example in the the ESIP and EarthCube project science on schema.org who've been making some primarily what the science at schema.org people have been doing is documenting usage in the research community but the background vocabulary of the schema.org is is actually now maintained publicly in GitHub which also with using the issue tracker there. There are monthly releases in principle it's community owned anyone who's interested who's got a GitHub account can make comments can raise issues can contribute to whether a particular term makes its way into the vocabulary or whether it's positioned correctly as agreed by the community but interestingly when it when push comes to shove the decisions get made by a very small group of people basically with one of them Dan Brickley acting as what might be thought of as a benevolent dictator he's got the community's interest at heart I know Dan I'm a bit known him for a while well 20 odd years actually and he very much does have the community's interest at heart not not not even though he's employed by Google he's not grinding any Google access in fact at times he'll tell people yes it's all very well putting this in to the vocabulary and that's fine but that doesn't necessarily mean Google will use it in building their index they only they initially use a tiny fraction of schema.org in building the Google index and Dan there sort of plays the role of just bringing people's attention to not being overly optimistic about what's going to happen with this stuff even while that vocabulary is being maintained. And another vocabulary control vocabulary which I've done put quite a bit of made some contributions to recently is QEDT which is Quantities, Units, Dimensions and remember what the T stands for but it doesn't really matter the interesting part of QEDT or there's two interesting parts which is lists of units of measure and so-called quantity kinds that were you know quantity kinds of things like the semantics of what a number refers to length or depth or wavelength or those kinds of things it's probably the most widely adopted control vocabulary of units of measure although there's a there's another one which is just the codes called you come which I'm going to talk about in a moment which is which is very heavily adopted in the biomedical and clinical community but QEDT in terms of properly fair published control vocabulary probably the most widely adopted it's also maintained in Github so anyone can submit issues or suggestions or requests and they get considered by a small group of gatekeepers I've given you the names there in consultation with the technical advisory board so there's this kind of a supervision by the technical advisory board again QEDT has approximately monthly releases there'll be you know somewhere between one and 10 units added every month it's not quite as automatic as Schema.org which really does have monthly releases now QEDT it's more of an as needed basis but that work turns out to be on average about every month so but again there's a now there's a public maintenance process and that's been going for about two years prior to that it was all a bit secretive and it was being held behind closed doors and there was justifiably some suspicion about whose interests were being met by QEDT but now it's done in the public at least or it's transparent though there are these individuals who are the gatekeepers. UCUM as said is the codes for units of measure used quite heavily in the medical clinical health space that's owned by the National Library of Medicine and notice on each of these points I'm making a point about who owns the content who is responsible who has got for intellectually for the content of these of these control recoveries and it's quite important to to bear that in mind. UCUM has been around for about 15-20 years actually well it goes back originally into the 1990s the last millennium but it's been maintained publicly through again an online issue tracking system but an earlier generation than Github it's subversion and track. The progress in dealing with issues there is a lot slower I've recently well the technical advisory committee has recently been revived for UCUM and is dealing with some issues in the tracker which are nine and ten years old so that's partly because the people who were maintaining it were kind of of the impression that it was all done and dusted and it didn't need any updates although it turns out that it does for example it's got the constants for things like Planck's constant speed of light in there which were believe it or not the values for which were updated by codator 2017-2018. So the NERC vocabulary service which is owned by the BODC but used broadly in the oceanography marine science community there's several hundred controlled vocabularies there a lot of which are just translations of ones which are owned and run by individual agencies or organizations so so the NBS just acts as a kind of publication platform but some of which are directly run and owned by the BODC. The actual reference content is maintained behind the scenes in Oracle so they also use github but only for the issue tracker. Effectively Gwen Moncoffé of BODC acts as the gatekeeper but again very benevolent the gatekeeping there is really is this formatted right is all the information there no judgment about whether the terms belong in a vocabulary because as far as NBS is concerned if a term is used in a dataset then it belongs in NBS. It's sort of bottom up inventory of terms used in datasets rather than instructions about they shall use these terms so the governance model there has to reflect that kind of idea. The yellow book which I mentioned before the Australian soil and land survey handbook there the book is normative the printed version and the national committee on soil and terrain which is essentially a collaboration of the department's environment and their operating agencies between the states and territories in Australia is represented on the NCST and it has had occasional updates in the past the current version that was published that I showed you the picture of with I think probably here's the picture is the third edition that was published in 2009 that on average there's kind of about every 10 years it gets updated and it's currently being revised by the NCST but primarily Andrew Biggs from Queensland natural resources is managing that revision so if you've got any problems with that you need to get hold of Andrew and I haven't really tracked down whether they've got a transparent web presence I think probably not but it's owned by NCST so they as long as they're happy with their process I guess that's okay and we do have as I showed before a linked data version of that but that lags the official version and that's very deliberate because the official version is the version in the book. Finally the geological time scale as I mentioned is owned by the international commission on stratigraphy the values of the the numbers in there of the different boundaries in the chart each one of those is tied to an article in a journal these days almost always a journal published by the ICS called episodes so once you get an article published in episodes which establishes one of those boundaries then a value in the control vocabulary gets changed the website tabulation gets updated annually but again that's lagging what the agreements of the commission for stratigraphy are and there's the semantic implementation that I've been managing with Steve Richard a bit which is in GitHub but that is not the normative value although ICS I believe I'm not sure if Nick's car is on this call but he's now the web master for ICS and is basically planning on shifting so that the semantic version becomes the normative version of the geological time scale at the moment though it's only updated and I've got the energy so actually I'll skip this one the point I was making here was just that some systems allow you to reflect very fine grained revisions at an individual term level shown by the historical versions available thing but most systems tend to hide that behind closed doors I'd also as part of this slide deck reflected on some of the ways in which governance is done in the official standards community like ISO with its five-year review cycles and technical committees and multiple drafts and review stages you know similar system used by the Open Geospatial Consortium although it doesn't have such hard timelines W3C has very hard timelines two-year timelines on a working group to produce the recommendation and you know also the different ways in which these standards are published ISO standards of PDFs OTC standards these days are HTML but prepared using a particular pipeline W3C uses its own standard HTML and it's all done there so sort of a summary thing we're working on these 10 rules and we've done one for the making of the vocabulary fair and the next one we're planning on doing is maintaining a fair vocabulary which is part of the governance story and I've just given you a few precedents for how some well-known controlled vocabularies are governed at present okay thanks