 to kick off, first of all, formally with acknowledging, celebrating and paying our respects to the Ngunnawal and Nambri people of the Canberra region and to all First Nations on whose traditional lands we may be meeting or dialing in from today and we pay our respects to elders past, present and emerging and any of you who might be in the room or joining us today. So over these next two days we're coming together to, as people across a broad range of disciplines, to communicate and promote approaches for vocabularies that can be used to enhance and enrich data and the meaning of data, information and knowledge so that data can be shared, re-used appropriately into perpetuity. That's a big word, isn't it? So we're working with vocabularies to enhance the meaning of data, information and knowledge so I think with that it's only really appropriate that we pay our deep respect to the fact that we're meeting on the lands of the oldest continuing culture and knowledge system in human history today. So with that I'd like to hand over to Paul House from the Australian National University First Nations portfolio. Thank you very much. I'm a Ndangu. Thank you Megan. And Jumburuburu Marambang and Narama Rang. Good morning everyone. Baladu, Nyambri, Canberra, Walgulu, Walabaloa, Wuradri, Nyayang, I speak all the languages here on country Nyambri, Canberra, Walgulu, Walabaloa, Nunu, Wuradri. So Yilinga Longbu, Yundu, Paul Giroa House, my name is Paul Giroa House, Nado, Maradu, Maribiringu, Gujigangu, Nyambri, Nudu, Nudurambangu. I was born here, the centre of my ancestral country at the Old Canberra Hospital. Anyone born in the Old Canberra Hospital? I must be an endangered species, am I? No. It's good to see a hospital alumni though when I do a welcome to the country. So Nyambri, Yilinga Longbu, Giba Bangu, Walgulu, Walabaloa, Wuradri, Nildubangu, ladies and gentlemen, young men, young women, distinguished guests, Nyari and Jamali, Nyambri, Guma, Walgulu, Walabaloa, Nunu, Ngaraga, Wuradri, Mujigang, Yangarambu, Jaya, Ndur, my respects, Nyambri, Canberra, Walgulu, Walabaloa, Nunu, Wuradri, Elders, past and present, Nyari and Jamarabu, Mujigangu, Nurambangu, Nini, Yiridu, my respects to all people from all parts of the country, First Nation people, Nyambri, Canberra, Walgulu, Walabaloa, Nunu, Wumai, Ngai and Banya, Nyinugang, Nurambangu, Dara, Nyambri, Canberra, Walgulu, Walabaloa, people, welcome you all to the country, Marambang, Nyiyang, Yambuwan, fair vocabularies for all, Nyiyang in language, it's not on there at the moment, your conference theme, Naadu, Wuragabugi, Balabambu, Wabalagi, Bangu, Gungulila, Dumbahalina, Murawai, Marambu, we listen to the old people, the ancestors, the elders and they show us the good path, the right path, the straight path, Dulagang, Muro, straight path, Gungulila, Bilangalina, Yamamalina, Walamalina, they nurture us, they guide us, they protect us, have old people, our elders, our ancestors, Mambuwara, Naminya, Wuragabinya, Wuradarago, Wenangalago, Balago, looking to see, listening to hear, and learning to understand, Nyani, Njimali, Nurabangu, Balanin, Walawin, Galangga, Bangbu, Yanengu, Nyani, Mawangu, Bilagigila, Yamam, Bilgiri, Marambu, Bagregan, Yanengu, we look after country so it is healthy for our children for all of our people, we teach them, we learn what is right for all of us on country. Muro Mwagiginya Yinja Mara, Muro Muro, Wurambu, Nurabangu, living a respectful way of life, cares for country. Yinja Mara, Wurambu, Marandogobo, Geiragobo, Yandogobo, respect is taking responsibility for the now, the past, the present, and the future. Our welcome to countries are always made in the spirit of peace and a desire for harmony for all people of modern Australia, and our main aim always to establish an atmosphere of mutual respect through the acknowledgement of our ancestors and the recognition of our rights to declare our special place in the pre and post history of the Canberra region. The name Canberra is derived from the name of our people and country right here at the ANU. Ngamburi, Namburu, Ngamburi, Canberra, gazetted on the 22nd of January 1834 here, a Canberra station under the New South Wales colonial government. We've cared for Mother Earth since the dawn of time, and evidence of our occupation, our statehood, our sovereignty can be seen everywhere throughout the land. Our signature is in the land, not just our DNA, and taking care of country is important to us all. Anyone heard of those words? In language, it means many good things. They go slow, it's a way of life, many good things. They go slow, be patient, be polite, be gentle, take responsibility, uphold. Ngamburi, Namburu, Ngamburi, Ngamburi, respecting our key totems, our Creator and protector, the Krow and the Eagle here on country. Ngamburi, Nambur, Ngamburi, healthy people, Injumara Bala, Bala Birida, Bina, Bina, Bina, Yawulu Wurawin Nurembangon, respect is in the Canberra Creek and the rivers and the breeze, quietly moving through country. Bala Wala Mwangadabu, Muranmatan Dabu, Bama Yu Gurugam Bira Bangon Nara Nara, respect is in the grinding stones and the carved trees made long ago on country. Magangiri, Goengulia La, Magangiri Biringa Bokongu and respect is in the journey of the Bokong moths in the mountains. Wongadawin Nandu Dungawabu Miradaganda Wijingayina Wongadadaganda Bapa Yiidinigur respect is in the soles of the feet of our dancers. It's in our matriarch dig for yams in Mother Earth. Maragaladal Walanmayan Mayangalong Hold fast to each other, empower the people. Walangunmala Maramara Gurebi, brave, may change. Gira yawana murawara nawan bira Get up, stand up, inshallah. Just in terms of our vocabulary, I'm sharing with you. Burumbabirra Baladu Niyang Burumbabirra Niyani Ginyu, I'm sharing this language with you all. Our language is what we describe as a free word order structure. It comes from Mother Earth. It comes out of the ground. And it's created by the old people, Bayami, Gujigang, Nyuyalangu, the old people, tens of thousands of years ago. And we still speak language, Garri speak Niyang language on country because of oral history, but also because of the FNA historical records, powerful and compelling. And the records taken by non-Aboriginal people when they first come to the country. So that's important. We're able to continue to speak language on country. Language is empowering. Didn't go. Language is important. Didn't go. Important. Nia way. Go on and get your identity. Mara mara, the actions you take, the nina, the fokus and the way and the transformation of being able to speak language and express yourself across country. So with that, in the mara bala, mara mara nia nia nia girama mara nia respect shapes us and lifts up the people. Mara mara, when inga, in respect creates people who care for each other. So go on and nina, welcome the country. Welcome and Mandangu Wuruguri. Thank you very much. Thank you so much. Paul, did you want to set up where you're right? Yeah, you're right. Would I just be able to say a word before you leave? Thank you very, very much. And I think respect creates people who care. I think we need to, we could take a lot of those messages that you expressed there around respect for country people and each other as we build vocabularies that are respectful and take our time, patients and care with creating them. So thank you very, very much for your time today. We are very pleased to welcome Arafan Gregory from CoData today. We're very pleased that he's taking the time out of his busy schedule to join us. So Arafan works as a standards expert with CoData at the data, that's the data arm of the International Science Council. For more than two decades, he's focused on the development and implementation of technical standards for scientific and official data and metadata contributing to the SDMX, so the statistical data and metadata exchange, DDI, the Data Documentation Initiative, and GSIFAN, the Generic Statistical Information Model, specifications amongst others. He's the chair of the DDI, Cross-Domain Data Integration Working Group, and he was co-chair of the IUSSP, CoData for the Vocab Working Group, IUSSP, the International Union for the Scientific Studies of Population Working Group. He's active in the World Fair Project with a focus on the development of the Cross-Domain Interoperability Framework, and she's the metadata to support fair data use in multi-disciplinary implementations. So today, Gregory will, his talk will characterise the metadata of the landscape and the challenges we face regarding fair controlled vocabularies, so he's asked the question why when the technology and standards can so easily solve the problem, do we remain unable to leverage these critical assets? So we really hope, you know, he's raised this question, we look forward to hearing his thoughts, ideas on this, and it will stimulate an interesting discussion. So hold on to those questions for the discussion at the end. So hopefully we have Aravain. I'm here, can you see my presentation? We can hear you and we can see your presentation in presentation mode. Very good. So I'm going in from Dublin in Ireland this evening to me. I realised it's a bit earlier for you tomorrow. Before I get going here, I wanted to just mention something. I know that I'm going to be talking a lot about fair, which is primarily where I work these days, you know, in implementation of the fair principles. I wanted to mention briefly the care principles because I know that's something of significance. And I'm going to be using a lot of examples here that are sort of Western European examples. And I don't want anyone to think that I'm suggesting that's the primary or most important sort of set of examples, merely that those are the ones I know best, which is why I've used them. I'm going to be talking today about some perspectives on metadata and how controlled vocabularies fit into the metadata landscape that we're dealing with today. I had to ask myself when I was invited to do this talk what I had to offer because I'm not an ontologist, I'm not a subject matter expert, I'm not a classification expert. What I am is sort of a specialist in metadata more broadly. And it seemed to me that positioning controlled vocabularies in relation to metadata would offer some perspectives, some food for thought to this group. And I hope you find this useful. I'm going to be talking a little bit about understanding the metadata landscape and some ways of thinking about metadata that I found useful. Talk a little bit about how we organize knowledge, the systems we use to organize knowledge. Talk a bit more about fare and how ontologies sort of fit into that equation. And then talk about trying not to fail. I think we're faced with an opportunity at this point that we can choose to take. But there will be consequences if we don't. And I sort of want to start off with a little bit of a whimsical example of that. I'm a bit of an amateur historian, if you will. And so I want to give you an example from the mid-19th century. This is Lord Palmerston who was the prime minister of the UK from 1835 to 1865, maybe something like that. And he had to deal, he was obviously a huge driver of policy within the British Empire, which was a big player in those days globally. And there was a kind of nasty situation in northern Germany and southern Denmark depending on who you talked to. They were having a civil war over Schleswig-Holstein. And he said about this, only three people have ever really understood the Schleswig-Holstein business. The prince consort was dead. A German professor who's gone mad and I, who have forgotten all about it. Now that's a pretty flip thing to say. But what we have here is a scenario where no one understood the real meat of the matter, the merits of the situation. And yet they fought two shooting wars resulting in 35,000 casualties. And arguably this contributed to the lead up to the German Empire that triggered the First World War. So the consequences of a policy failure and they were really blundering around in their ignorance in this particular issue was based on a failure of understanding and a tiny number of people who really understood what it was about and who could not communicate that to anyone else. And this seems like sort of an irrelevant example. When we look at the policy decisions that are important to us today in an age of global warming and a lot of the other challenges we face and we think about the potential cost of failure, the way we communicate about complex issues is primarily in explaining the data on which policy is based. And I kind of wanted to use this as a cautionary note because if we don't get a good mechanism for explaining data, for understanding data around complex issues, bad things can happen. It's not a direct correlation. It's not an automatic consequence. But it opens a possibility and it's a possibility it would be best if we could avoid. So I'm going to start with this little sort of whimsical failure scenario and sort of move forward from there. And I'll come back to this at the end. Controlled vocabularies are all about definition of terms. So I thought it would be right and good for me to define what I mean when I say controlled vocabulary in this talk. The controlled vocabulary is a formal set of terms and definitions, often supplemented with information about the relationships between and among the important things in a domain. And that's not a comprehensive definition, but I think you see where I'm going with this. Practically speaking, I'm talking yes about codelists and classifications and ontologies, but also for psorias and vocabularies and taxonomies. And there are a number of different forms that control vocabularies can take. But I'm using the term in a broad sense here. Controlled vocabularies recognize the important distinctions and how we understand any given domain. And I want to point out that when we make those distinctions, we imply the existence of models or systems that are in operation within the domain being described. Think about a system where I have people who can either get on a bus because they have a ticket or they can't. So I have two categories of people, ticketed people and unticketed people. That system is fine until I say that children ride for free, right? And children mean that I need a third category because I have adults with tickets, adults without tickets and children. So the capabilities of a system are determined what the important distinctions are that we need to make. And there's an inherent connection between the capabilities of a system to describe things and the distinctions that need to be made. And the point here is that controlled vocabularies never exist independent of other considerations. And when we think about metadata, that kind of connection is pretty fundamental. I want to talk a little bit about perspectives on metadata because it's a very broad topic and a very difficult one to get your head around. And this is something that I've learned, I guess the hard way over the past 20 years, which is that it pays dividends to think about metadata at three different levels. And the basic level is conceptual. We have information which provides the concepts and ideas and formalizes concepts and ideas that allow us to describe data and other kinds of related resources. And CDs are obviously really at home at this level of metadata. CDs provide the building blocks which everything else uses to describe data. We have, however, another level which I like to call the logical level, where we take the intellectual organization of concepts, the logical relationships between concepts, how they describe data, how they relate to each other. And there are lots of different roles that they play in relation to data. They could be variables or categories, properties, universes, populations, all kinds of things that they can do in relation to data. And we need to understand what those roles are and what the logical organization of concepts is, as well as just the definition of the concepts themselves. And below this, of course, we have the physical encoding of metadata. What is the syntax? What is the format? Things that allow it to be consumed and manipulated by computers. All of these levels are always in operation when we talk about metadata of any kind. And the levels are very related so that concepts can play many different roles in different logical structures, the same concepts. And those logical structures can be encoded physically in an almost infinite number of ways so that it's important when we're trying to integrate data and manage data that we understand where the agreements are and where the differences are, what's being reused and what's not. So I find this to be a very interesting way to think about metadata, a very useful way to think about metadata. If I had to give you a simple typology of metadata, this would be it. We have definitional metadata, again CVs, which provide concepts and semantics. We have structural metadata, which talks about the roles played by concepts vis-a-vis the data, vis-a-vis other concepts. And a lot of packaging and higher level information about groups of data, streams of data, files of data, what have you. And then we have what I term provenance and contextual metadata. So data lineage and origination, how was the data collected? Was it a survey, a sensor, a register? How was data processed? What were the methods used? What was the purpose for collecting the data? Was it part of an experiment designed to answer a particular research question? Was there some other purpose? All of this information is necessary to fully understand data. And controlled vocabularies here are foundational, but they are not alone. They are one type of metadata among many. And a complete set of metadata is necessary when we're talking about really understanding data fully. And this is going to come up again and again because I feel like it's an important point. I do see controlled vocabularies as a very foundational form of metadata, but in and of themselves, they cannot do everything. When I look around the world today, I see a couple of sort of typical scenarios when we talk about metadata. A lot of domains have established standards and systems and technologies for dealing with metadata and managing data, disseminating data, reusing data. And domains in this sense are not necessarily academic disciplines, although that is certainly one primary way that domains are organized. When I look at official statistics as an example, I see that as a single domain. You have a whole number of organizations that perform a common function in terms of providing official data for policy to government, and they have their own standards that they use to exchange data, and they're deeply cross-disciplinary. But that I see as a domain because it's a community of practice. Within these communities, you see differing levels of maturity and very diverse approaches to how they deal with data management, with describing data, with managing data and metadata both. And there tends to be a lack of interoperability across domain boundaries. There also tends to be a depth of focus what we term these days domain fair, which is to say that members of that community can often exchange data and have metadata that is meaningful to everybody within the domain, but not outside of it. And I think that's a very typical case. At the same time, we see a culture of standardization that is really organized around the web. And W3C is, I guess, the biggest organization. A lot of standards that are webby use RDF and other sort of webby technologies to describe very broadly, but not in a lot of depth, particular things. So we have standards like PROV that describe provenance in a very general way, but not in a very deep way, not in a very specific way that's meaningful within a particular domain. And emerging from this, we have this idea of cross-domain fair. And I'll come back to that a little bit when I talk more about world fair and some of the things that are coming out of that. But controlled vocabularies exist in kind of a special space vis-a-vis these kinds of standardization because they're inherently domain specific. And yet they're a very important kind of metadata to exchange in cross-domain scenarios. And so there's a bit of tension between these very broad, shallow standards and these very deep domain specific standards. And controlled vocabularies are one of the points where that becomes very complicated. And you end up having to map controlled vocabularies across domain boundaries to make them useful. I want to stop there in a discussion of a characterization of metadata and switch gears a little bit and talk about how we organize knowledge. And I'm going to take a sort of historical view of this because I think there's a trajectory here that it's worth recognizing. Organizing knowledge is something that people have always done since I guess the dawn of time. And there are different ways of doing this. I mean people used to have oral traditions where they had lists of significant figures like King's lists and so on. And that was passed on from generation to generation. And over time we have more sophisticated systems that have developed. But it does seem to be the case that once a system of organizing knowledge takes hold in human society it tends to persist basically forever. I'll give you an example of this in a minute. But as new systems come along they expand our ability to organize knowledge and to leverage knowledge. And that seems to be intricately tied up with the emergence of new technologies that support that expansion. Let me give you an example. I don't know if you've ever seen a bestiary, but I think these things are wonderful. They're books and this is mostly from the Middle Ages where people tried to enumerate all of the animals that they were aware of. And so there are effectively lists of animals organized maybe alphabetically, maybe not particularly organized at all. But they have these descriptions and these wonderful illuminations. You can imagine monks going blind in their monasteries painting these things. And some of the animals turn out to be fictional like the Wyvern and the Phoenix and so on. But what you have is a simple listing and lists are very consonant with the experience of human experience of the world if you view time in a linear fashion. Now I recognize that not all societies do that. I understand that there are Aboriginal societies where time is non-linear. But I think that's a bit of an exception. I think most cultures view experience the world in linear fashion and think about time as a sequence of things, one thing after another. Lists are a sort of natural extension of that into the organization of things that we understand and know. And so even with simple technology like oral traditions, verbal memorization you have these lists of things and today we still use these. We have code lists that are a simple way of organizing knowledge but they are very prevalent and very useful. I mean modern code lists are maybe a sophisticated form of list but they're not unlike the way lists have existed throughout human history. If we dial forward a couple of centuries, we end up with this gentleman Linnaeus, who's known as the father of taxonomy. And in I think the mid 18th century, 1761 maybe earlier than that 58, he published a thing called the Stema Naturae, which is the original classification of plants and animals. And this is a leveled hierarchy organized around observable physical similarities. So what we really have here is a list that is made up of the lists and it's a much more complex construction than a simple list. And it takes, I think a little bit of a more demanding technology that you need to be literate. It's kind of a file folder paradigm. You need pen and paper at a minimum printing press probably helps a lot. And he wasn't the man who invented classifications but he's maybe the most famous example of this. His taxonomy is still in a much evolved form in use today. And classifications are still a very, very common way of organizing knowledge, right? We think about the way computers interact with knowledge and we have technologies like XML that are fundamentally hierarchical. Lots of classifications are used very, very widely today. And it's again a fairly natural way of humans to think about the world and to organize their understanding of it. Anybody who's worked with classifications however will have had the experience where something doesn't fit neatly into a single category that maybe it sort of also wants to be in another category. And you end up with constructions like C also. And that leads to I guess the most modern example to have here which is the graph. And the example I'm going to give you is linked open data. Now graphs are a kind of network of objects that are connected only through their relationships with other objects and those relationships can be of particular types. Now this is not a natural way to understand the world even though it can be a very powerful way to describe it. So you really need good technology to make sense of the graph because when you present it to a human being you're likely to end up with something presented as a list or a hierarchy in fact that is derived by navigating through that graph. You need a lot of computation to deal with this model but it's an incredibly powerful model and that's why we see it becoming more and more common. Ontologies use graphs as their fundamental organization and there are a lot of reasons for this but I'm going to stop now with this sort of example about how we organize knowledge but we'll come back to this in a bit. I think this evolution is actually very important as we think about the form the controlled vocabularies take and how best they can be expressed. This gives rise I think to some questions. Now I've been making trying to answer the question what is the point here. We need precise clear communication of significant distinctions that inform our understanding in domains and I think that's kind of a given that's the point of control vocabularies but there's some other questions we kind of need to ask you. Who are we communicating these things to and what do they do with control vocabularies why do they need these things to be explained and if we can answer those questions what language should we use to do it and I think those are interesting things to think about but there's a third question a little bit unrelated which is different. Why are we so bad at it? Because I would argue that even though we have a lot of good approaches to control vocabularies today we're not doing a very good job of sharing them in a useful way. So I'm going to sort of take these questions out one at a time. When I think about who we're communicating control vocabularies to and why we're doing it I'm not really talking about users so much as the kind of people who are in the room right now which are people who use these things to serve end users and I really divide this into two groups and I'm not sure my names for these groups are very good but the first group I would say are ontologists people who are looking at shared definitions of the important things within domains. They're looking at very deep complete descriptions of important objects and how they relate to the kind of model the working of domains. The applications you can build with this are often termed reasoners. That is I can perform sort of a logical reasoning on a domain based on the information I have to describe it. Now this is an incredibly powerful kind of technology. You have standards for supporting this things like OWL the web ontology language and then lots of more general formal ontologies like BFO and GFO GIST which is a little more practical upper ontology and a whole ton of domain ontologies. So there are lots of standards within this community and lots of powerful technology but it's very focused within the domain primarily. Alongside them you have what I describe as the fair community although that's not a great name. These are people who are focused on interoperability so how can we share definitions across different domains? How can we use them to harmonize data coming from different sources and really the main use case here is reuse and integration of data and they have a different set of standards things like SCOS, the simple knowledge organization system, EXCOS which is an extended knowledge organization system for doing classifications, things like SM, the simple standard for sharing ontology mappings, models like GSAM initiative which are dealing with classifications and so on. So you have some sort of different communities with different focuses and different applications but I think these are really the major consumers of controlled vocabularies from a technical perspective today from a metadata perspective. I want to point out that these are not exclusive communities and less and less over time. I feel like fair is bringing people from these two groups more and more together as they look at how you can share meanings across domains so the cross-domain fair case and the upper ontology case where you want to organize ontologies into a broader understanding. Both groups are looking now more and more at describing crosswalks and there's a lot of development in that space even in the past couple of months and I think that's going to be an interesting development. But one thing that both of these groups absolutely recognize is that an important aspect of communicating controlled vocabularies is to machines that we need to have machine actionable descriptions of these things. Because when I turn to the question what language should we use there's one thing that everybody seems to agree on not pdf. Word, Excel, pretty image files, printed manuals these are all nice for human consumption but they're no longer sufficient to support the description of controlled vocabularies in the modern world. For ontologists I think OWL is probably the most common format although there are some other vocabularies like RDFS that get used those things are expressed in RDF conformant syntaxes so turtle or RDF XML or JSON LD in the fair community SCOS and XGOS are probably the most common and they use similar syntaxes so turtle or JSON LD or RDF XML I would say the lowest common denominator here is really SCOS or XGOS and triple SOM for mappings in those common RDF expressions and there are some approaches to more complex mappings that are kind of under development as we speak I want to emphasize here that the granularity of description in these standards is very important. If you're using SCOS and XGOS you have to describe things to the level of individual concepts and each node in a classification or ontology because if you've described things to that level in a machine actionable way even though you might be describing a flat list or a hierarchical construction you make them able to participate in larger knowledge graphs and that ability to connect to this more complex form of knowledge organization is really really important moving into the future. Why are we so bad at it today? This is not a simple topic and I don't like to just point the finger at the world and say you're terrible at this but I kind of have to do that because so many controlled vocabularies today even a very high importance are very very poorly disseminated according to the sort of criteria I'm talking about. One thing I am going to maintain however is that this is not a failure of technology and it's not a failure of standardization we have good standards for describing controlled vocabularies in a technical way the things I've been mentioning SCOS XGOS all of this stuff and we have good technology platforms and tools for doing it. RDF technologies are really really very very good at this and they're pretty easy to implement I mean SCOS is not a hard thing to implement it can be done fairly handily so this isn't a technology failure at all. If people are familiar with the 10 simple rules document and the link and the DOI is down at the bottom there if you're not familiar with that document please read it it lays out in a very easy to understand way the main considerations for providing controlled vocabularies in a fair way that is maybe not the best possible way but provides a good basis for doing it and this isn't that hard to do. So I'm going to argue that the failure is really as kind of a failure of maybe organizational awareness I don't know how to describe this but I see there being some real problems with how we approach controlled vocabularies. One of these I call the data space fallacy and the other is maybe a similar phenomenon around incomplete solutions to data sharing. The data space fallacy is this and you see a lot of people talk about interoperability as a matter of providing access to existing data and metadata in their current form so they'll put like Jupyter notebooks and data files and PDFs of controlled vocabularies and documentation into an online space so you can access them at the same time and they say interoperability problem solved and it's not true because all you've really done is provide access and although access is a precondition prepared and a precondition for interoperability it does not solve the entire problem you still have that the challenge of being able to understand the resources and how they fit together and integrate what you want is sort of standard granular metadata including the controlled vocabularies and all of the other metadata I talked about but instead you have the same kinds of messages that you have today just easier access to the less that doesn't produce interoperability it's a partial solution and I think we see a similar phenomenon in some other ways people who talk about fair and fair implementation often trivialize it they say oh everything has a DOI and now it's fair no it's not fair it's identifiable there's a difference and being identifiable is part of fair yes but it's only a part of fair or they put a thin layer of discovery metadata on something and say oh you can find it now it's fair and that's not good enough we have to be realistic about it in order to solve the problems of interoperability and reusability and not just findability and access we actually need a lot more metadata and see these are core to having that base metadata that we can build the complete picture off we need machine actionable standard control vocabularies supplemented with machine actionable granular standard metadata across the board if we're actually going to solve this problem and yet that's not what we get and I feel like in some ways there's a failure of understanding about the challenge with control vocabularies and with all the other metadata in this picture and we're not going to get fair if we don't actually accept the extent of the problem no we don't want to be too negative about this because I feel like there are some very good developments in this space I work a lot with World Fair and the cross-domain interoperability framework or CDF and what that is is basically a set of minimum recommended metadata for performing different fair functions and of course control vocabularies are part of that picture and we have other initiatives that are engaging with the same space things like you ask into our ability framework there are projects in fair impact there's quite a long list of these initiatives and I think we've done a fair job of collaborating in certain issues notably around mapping control vocabularies there's an emergent good practice for describing and disseminating control vocabularies and a lot of recommendations around other metadata as well that place CV in that overall metadata framework so I do think that there's a sort of hope for the future here the problem of course is that we can have good practice but will we follow the practice so I want to talk about the importance of not failing here and I'm going to I don't want to be alarmist and I don't want to be too far out here but I'm going to describe something that I think is pretty real and that people don't maybe recognize sufficiently I think we're actually at a historical inflection point in how we organize knowledge you probably know people I mean I'm old enough that almost everybody's a youngster now but when I look at younger people they seem to live in the network as much as they do in the physical world people are focused on their devices like half the time and they're having very real interactions and doing real things on the network and so they exist in a sort of shared space their reality is not just the physical world it's also the virtual world and we can pretend that it's not as important I think a lot of people make that assumption but I don't think it's really true I think we've reached a point where if resources do not exist online they don't exist within the network they fundamentally don't exist for any purpose whatsoever people will not today dig out paper books from a library off their dusty shelves to look up controlled book tabularies they just won't do it if they can't Google it it isn't real and I mean that's of course a generalization but I think it's an important aspect of where we're heading as a society now I'm American and when I think about the damage that's done by misinformation out there in social media out there online it's profound and people say oh let's regulate the social media companies I don't think that's the answer I think the answer to disinformation is good information and what that means is that we need to be able to encode knowledge in a way that naturally works within the way that people operate on the network so our expression of what we know and the data we rely on to know it and how we understand that data needs to be encoded in a way that naturally works within a networked architecture this means using graphs to organize our knowledge to organize our CVs I was really happy to see in the mentis poll there how many people are working with ontologies because ontologies are graph based and I think it's super important that all of our knowledge encoding and describing data fit into the sort of open data paradigm where every node that we put out can be navigated to as part of a graph because I think that's the way people are going to start to deal with the information that we're going to rely on to counter disinformation and to base policy on moving into the future now this is a little out there maybe but I think it's at least something that's worth considering and I do think that we have an opportunity to frame our description of data in a way that is optimally useful in the world that we're headed towards and I'm not sure people think about this very much this is my last slide and I have some suggestions for not failing I really like the 10 simple rules document I've worked with a lot of the authors and they're very very smart people I think it's a great starving point I think fundamentally SCOS and XGOS are the basic expression of control vocabularies in a machine actionable way I think when you're talking about mappings SM is probably the starting point we should really think about multilingual use because a lot of control vocabularies are national but the network isn't one thing that these large language models are very good at is actually translations within controlled frames and RDF formats are very good at encoding language equivalents so we have good technology for this I'm not saying you won't have to double check everything with domain experts but people should think about putting out control vocabularies in multiple natural languages because that's important for how people will use them I think it's very important that people pay attention to emerging best practice I mentioned world fair and fair impact in these initiatives for controlled vocabularies but also for the full set of fair metadata because I think controlled vocabularies are only really part of that picture and people should not be seduced by partial solutions I think that that's really core to the reason we're failing to really disseminate controlled vocabularies in effective way today I think it's a problem that can be solved if we focus on the issue and focus on good practice and make that a reality so I feel like we can sort of seize this opportunity if we decide to do it and that there's some fairly easy steps we can take but I'm a little concerned that we won't because I think the consequences of that kind of failure could be pretty severe and on that note I'm going to stop so I hope you found something in there of interest or to think about Right, good morning good everything from wherever you are it's almost 1 a.m. in Nigeria and it's really an honor and privilege to be participating at this event and to my experience leading the virus outbreak data Africa projects this is more like going to be like a practice focus presentation it's not that much of a technical thing it's more of an experience sharing and then I hope I don't lose many of you along the way I want to check I hope my slide is advancing so I want to start with a bit of history the idea of a fair in Africa actually formally started with the COVID-19 yes we've had a few of the college students who were in Europe and some part of North America who are doing some fair related research at that time but it wasn't that popular and then a few of them had come around to do some workshops to kind of create awareness about the fair data management process but it was really that popular and then there was that COVID-19 and there was disruption and then there was a lot of lockdown and so we saw opportunities to do something new and many of them sold the idea of kind of ensuring data ownership in Africa because we've had experience with the Ebola viral disease in West Africa and some part of East and Central Africa where the observational patients data were collected from Africa and they were taken out and were roused elsewhere so Africa as a continent really couldn't even learn anything from the Ebola crisis when COVID came and then of course Africa was not a resource point for verified data because they were not just there it wasn't uncommon at that time to go for conferences and say researchers talk about the sources of the data especially health data and then you see from North America from Australia from Europe from this country, from that country and then Africa is generally kind of classified as the rest of the world so a few of us came together at that time to join the virus Housebreak Data Africa Network to see how we can ensure data ownership and handling to ensure that the data on COVID-19 collected in Africa actually resides in Africa and then they become the property of the country of jurisdiction we also saw that as an opportunity to strengthen data informed health system for Africa and to ensure that the rest of the world actually have access to data from Africa that is resident in Africa and so we had a network of universities university health centers ministries of health and then other government agencies that are related to health care up north Africa in Tunisia in Nigeria in the west of Africa Zimbabwe in the south and then of course we had Kenya, Kenya and Uganda and we're still talking to Tanzania at that time and Africa became the first successful implementation of data in residence in the context of COVID-19 so funded by the Philips Foundation who were able to install COVID-19 fair data points in nine locations so we had two in Uganda we had two in Ethiopia one in each Kenya and Zimbabwe and Tunisia and two in Africa then with the working with the African graduate student at the Leiden University Medical Center we also installed the COVID-19 fair data point and were able to demonstrate the possibility of data visiting between Africa and Europe in the phase one and so we began the process of training and training more students from Africa in the principles of fair data implementation and so it was a bit straightforward at that time because there was a single case reporting form designed by the WHO for the COVID-19 so the fields are basically the same the vocabularies are basically the same and then the healthcare workers and the data collects in the in the health facilities they were trained with the same thing so it became diverse when we now had to go beyond COVID-19 to now expand into clinical data observational patient data beyond COVID-19 in the ANC and then of course OPD so I will use the case of Nigeria as an example Nigeria has 36 states plus the FCT which is like 37 states and then there are three levels of hospital we have the kind of what we call the county which is the lowest level and then which is the community then we have the secondary which is owned by the state and then we have the tertiary which is owned by the federal government so these three then the same country so if we have one state like I take my state where I live the vocabularies for the primary healthcare we have some slight difference from that of the secondary owned by the state from that of the tertiary owned by the federal government and we now observe all this scenario across different countries there were no uniformity and everything and that was what informed so we first built the first architecture for observational patient data as the same here trying to be conscious of the time and then so we had the hospital data in the form of bulk impute those are data that have been collected on paper or some on spreadsheet and we had to build an embeddable editor for the clinics and then to create the metadata for the clinic and metadata for the patient data and then we had the modern Africa, WHO Smart Island compliance data entry and some of this country already have the health information system like part of Nigeria and mostly northern Ethiopia and some part of Uganda already had the DHIS so of course the data already the DHIS is a repository so even after collecting the clinic data we must have to put in a duplicate at the DHIS and then there is a dashboard for analytics within each hospital because the primary was to provide access to data for critical data at the point of care which was not in place and then there is also the data point hosted on Stanford Cedar at that time which is publicly available to host the metadata for each clinic and then then for the clinical data owned by Northern Africa there is also the dashboard so like there are kind of three levels of view then there is the case of Tunisia which is kind of unique Tunisia is more interested in research data on the impact of COVID-19 on migrants in Tunisia parts of North Africa and the greater parts of Niger and so we had to modify the architecture the modern Africa architecture localization architecture for research data and then some of the participating university teaching hospitals were interested in both research data and clinical data they were both interested in repositoring both kinds of data and so we had to have the combined architecture to have the editor at the clinic or university and then the metadata for the clinic or the university and metadata for the clinical or research data and then following the same process then we had removed the DHIS because we worked with the GoFair office with the Andrea project at that time and so we have the hospital repository the joint metadata hospital repository and then we have the clinic or university repository and then still the different dashboards and it continues and we had to implement this in then we had expanded to Liberia in West Africa in addition to Nigeria and Uganda and Kenya Tanzania Zimbabwe Ethiopia and Somalia and we were able to install these data points so how did we handle the differences in vocabularies thank you just two more minutes thank you Francisca so I'm just going to go on to the data creation process for the Vodan vocabularies so I take the CARE and Nata CARE register for example so this is the case reporting form and the Vodan vocabularies check the vocabularies if it already exists then it simply just goes to the bow potter because it's already there and completes the rates of the verification process if it does not exist then we now had to use the spreadsheet to create the metadata for the vocabulary and then once that is done it goes through the entire process again to the open web language and then it goes into the ontology repository so the next time another clinic tries to use this if it's the last one that was created it exists so it ignores this process and goes on and that was how we were able to so we had to create different different vocabularies to kind of handle all the 88 health facilities where this is installed I don't know so I have the dashboard here there are 13 Ethiopia, 3 in Kenya we have one facility in Liberia I don't know if I'm able to open a web page and if you all can see my browser so this is Africa thank you 8th country with 67 health facilities so far with patient instances and this and then it keeps on so let me take the case of Ethiopia so we have the different health facilities we can look at the Adia Health Center so we have two OPD metadata which was last updated this day and then we have the ANC one and then it continues like that so that is the experience I really want to share and I will be happy to take questions on how we're able to now handle this diversity in vocabularies across 8 countries in 88 health facilities with varying degrees diversity in vocabularies thank you all very much hey good morning everyone so today we'll be presenting kind of an update of what we've been doing so last year we discussed a little bit more about this project we've been doing with the Gurini Yoruba community indigenous community where we develop a health based application a set of applications actually using a new technology called PODS so right so I'm going to give you guys a bit of an introduction and then I'm going to give you a bit of a demo as well as to what we've been doing so far so PODS a personal online data store is kind of like a place where you can store your own data and have full control over that data so the idea is that you have your own data store instead of let's say facebook having their own databases or twitter having their own databases google having their own databases and giving us access to those data we our own data is stored in our own PODS so and then we do have the control over who gets access to this data and at what time and we can have like we can revoke those access at any time we want so the individual the idea is that individual as the first data user and the first data owner of their own data so when we have this kind of this kind of an concept the advantage that we are getting is that it is not just a data store it can be an ecosystem for innovation as well so for instance the picture here the picture to your right side actually so that's your POD and then you can so the innovators the developers like us we can develop different software to access that like access the data from your own POD and provide different services so that's the idea of the personal online data stores so solid is kind of a project started by setting Berners-Lee back in I think 2016 or 2017 couple of years ago so that the solid is kind of a specification that's developed for these personal online data stores so using this specification we can develop a server that can you know that we can have our own ports so the we already have multiple open source solid servers available you can see there are some links to those servers available here so each app so the general idea so the centralized databases where we have now has centralized databases for each app right so the Facebook has their own database Google has their own database but with decentralized approaches like solid ports we can have apps to apps pointing to different ports so the ports are the same but we can have different apps pointing to the same port and providing different services so that's the advantage we are getting over this you know general naive way of storing data and the project we've been doing is we are using this technology the solid ports technology in the Arabic community in North Queensland so that's an indigenous community and their health service actually want to develop a set of apps that can basically manage patients health data and at the same time provide patients access to their own data and have them control in their own health data so so they contacted us about one and a half years ago and we started developing these apps for them based on this idea of ports so they already have the existing system as well over here so what we are doing right now is we are developing these data pipelines in order to get existing data into patient sports and then once the patient sports has their own data then we develop these apps so we are currently developing 3 apps health worker in individuals app and guruni app actually we talked about this last time we were here so I won't go into much detail about those and these apps basically access the data in the ports and then provide services such as these encryption of data data analytical modules authentication modules so on and so forth so kind of an idea like a little bit of an idea of those 3 apps so the individuals app is where the individual patients have their own data stored and have their own data so have their own analytics so the indie apps provide patients with the diabetics so we are kind of focusing on diabetics patients at the moment so we are hoping to run a trial in the coming 2 months with a small cohort of patients in the araba to test our apps so the individuals app is the app which gets to the patients and the clinic app is the app which gets to the actual to the health center to the doctors where they can have depending on what kind of data they have available from the patients they can have different analytics and then we have another app called Care Coordination team app which is purely developed for data collection so they have these care coordination teams going out in the community collecting data from patients and this app is purely developed for that purpose once they collect this data that data directly goes to the patient sports and patient can then access that data at any time they want so the security and privacy so we have been developing in these 3 apps so that was the main point of this presentation so the idea is that the every data stored in ports are encrypted and no one else can have access to those data even the system admin even though they can see the data they will only see encrypted data so without these keys they cannot have access to the original plain text data so the individual here has their own pod and they have this kind of set of keys master key and public private key pair and then they have a set of data files which they store their health data and these health data or data files are encrypted using something called random session keys and we want to encrypt or we want to keep these random session keys secure as well and for that we are using a password that's only known by the individual himself so the individual knows the master key and he encrypts these random session keys and they also encrypt the private key as well so the private public key pair actually is used to share the data between ports and if someone else need to share their own data with Bob then what happens is that they use the public key of Bob and then use that public key to encrypt these session keys of their data and then share that encrypted session keys with the Bob and then once Bob has these encrypted session keys he can use his own private key to decrypt decrypt these session keys file parts and access list in order to get access to this data so that's the idea of the security and privacy architecture so I'll just give Sergio a chance to discuss about the ontologies Thanks Anushka, hi so regarding the system ontologies we are developing two ontologies one it's about health and personal clinical information of the patients so the ontology it's we are developing in a bottom up approach that is based on the current use cases so the main classes are about the person and patient and the form related to all the information we gather from the data collection team app and also recently we have included a time series data model to capture all the and do this analysis over time or different physiological aspects of clinical data and then the second model is the secure data model or SPDM based on the presented data security architecture for the system so here we are focusing mainly on two cases use cases one is the encryption and the other one is the data sharing the data sharing aspect it's the most important thing about achieving the decentralized environment or ecosystem that we are aiming for the system so as you can see here we have a person has a pod and then the data files and that is the part of the share case so this is still an ongoing data model that we are developing and included in the internal mechanism of the system and that's it any questions, thank you first of all thank you so much for giving us the opportunity to speak with you today about some work that we've been doing probably for a little over a year now and conjunction with the international classification of disease-developed revision which is a classification that's been published by the World Health Organization for a number of years and so today I have Dr. Michael Pine and Dr. Chris Tompkins on the call with me they'll be joining me during the Q&A for the portion of the session so thank you again a little here's our bios these slides are available so I'm not going to go in depth about the information here on this slide but I want to say that we have broad experience in clinical medicine research and terminologies and classifications as I mentioned the international classifications classification of diseases by the World Health Organization has been around for 100 plus years it's evolved over time it started as a mortality classification and I think around in the 1940s it started into morbidity and at that time and a little bit later into the sixth and then seventh revision and on into where we are today there's been a number of countries who've adapted it so they felt that while it is a classification that works for mortality and morbidity there are some things that perhaps aren't in the WHO version that needed to be expanded upon there are some examples of that on the screen here the one at the top left is the one from the US the Centers for Disease Control who oversees the ICD classification we have CM at the end of our modification called clinical modification then to the right of that is the German modification and then the two below that are the Canadian classification the modification these are only four examples of what currently exists in ICD-10 or the 10th so while we have one system that comes out of WHO when WHO produced ICD-10 well it was actually the World Health Organization basically passed that it should become a classification in place of 9 in 1990 and countries were able to use it from 1994 onward from that point like I said there's been a number of countries who have adapted it for their purposes beyond morbidity and mortality and there's many use cases today for the classification it's just not for statistics which is where it started it's used for things like case mix quality measurement and those types of things and then using it beyond its original focus that's where a lot of these adaptations or adaptions came into play so what happened in the 11th revision well in 2019 the World Health Assembly basically said we have a new system that we'd like to have people begin to use and that system is much more than a replacement of the ICD-10 that came out of WHO back in the 90s they really took a look at the fact that what's happening in the world of terminology and also what's happening in the world of electronic tools so when they created this and this has been started probably 2007 I think is when they began the work but what they found was is there a better way to actually have a clinical system where maybe perhaps it could be used from a standard approach meaning is there one system perhaps that could be adopted across all countries so we wouldn't have that variation in modifications that we had with ICD-10 so they created what's called the HOOFIC foundation now besides ICD-11 there are other classifications that WHO considers to be their reference classifications those happen to do with interventions or the international classification of health interventions commonly called ICHI and then the international classification of functional disability or ICF so when I talk about the WHO HOOFIC foundation it's more than just diseases and disorders it includes ICD-11 but it also includes those other types of entities which would address functional descriptions for example interventions and they have this whole new area called extension codes and then from that foundation that content that large body of content of all entities certain aspects of that are pulled and a constrained subset is created so everything that comes out of the foundation as a linearization has its source as the foundation so the linearization is defined here just both of these coming directly from the HOOFIC content model reference guide it's basically a subset for a suitable purpose now when we think about the replacement for ICD-10 that purpose again originally began as morbidity and mortality one of the things that's interesting to think about for the purposes of this MMS as we often refer to it which is the ICD-11 form morbidity and mortality statistics is it could be not necessarily is a replacement for not only ICD-10 of course it should be for the WHO version but could it be a replacement for classifications that are legacy systems like I showed you a moment ago on this slide so when it first was created as a linearization there was some discussion about maybe having separate linearizations for mortality and another linearization for morbidity but it was decided that in fact that wasn't the way to go and so WHO created a single linearization for both purposes now what's come about because of that has been a number of countries the US being one and some other countries as well who have these legacy systems that isn't going to work for us and it hasn't in the past when ICD-10 was developed they as I said adaptations were actually created so is it for purpose for those current uses such as case mix and research and quality of measures and if the answer to that question is no if not well then should countries go about their business and create some type of modification or clinical modification or the next question is and this where this is the issue of consideration for what I'm going to talk about next perhaps is there a linearization that aligns with ICD-11 MMS but also extends it in using the foundation and I'll explain a bit about what I mean so this group okay got it what we did was we created a couple of things so actually three aspects of what I'm going to talk about in terms of the the innovation model one of them is a comprehensive clinical linearization where we've gone into the foundation and we pulled content out of it because we felt again it's not going to work as is we need a better system than the MMS and I think other countries who have done their research have found the same thing not everything in MMS is what needs to be there for the purposes that they have and of course anytime you have an old system and you're working into a new system there needs to be some type of map for longitudinal data analysis so just give you an idea that we have hyperplasia maxilla which is coded three four different ways in these four legacy systems and in MMS there's what's considered to be an other specified so there's no specific code for this well in C.Clear we have a specific code for that as you can see in the DAO C.03 with the CCL as the subscript so we've tried to pull in from the foundation very specific data that will help countries including ourselves perhaps adopt a system that will work for the use cases I'm not going to go into detail with this because it's given the time but the main things I wanted to point out is the main purpose of C.Clear is clinical care and I'll show you in a moment why that is it is non-proprietary we've absolutely taken advantage of all the aspects of what WHO has available to us in regards to creating computer friendly tools and we actually have come up with a syntax that will help people again gather that data using the codes of C.Clear and create some type of description of what they want to and so we have the innovation model where it includes the comprehensive code set which I mentioned in terms of the codes we created this linearization which we call composite linearization we have syntax where we can take all of that composite linearization and make sense of it using the syntax and then we've gone and created tools as well so I'm going to advance here so why have we done this well we wanted to harness all of ICD-11's power because there's a lot of power behind the foundation and there's a lot of content in regards to linearizations and what you can do because we wanted to be able to do all of the things that are on this page we wanted to look at it from the area of prevention and early intervention patient coordination risk adjustment resource and particularly the clinical outcomes so what have we done beyond the linearization and the syntax well right now we're working towards what we called tools and these are an expansion upon the coding tools that WHO already has but we've taken a step further and we've actually been able to come up with ways in which we can translate clinical language using the code set the linearization and the syntax and create coded clusters and then from those take them back to the clinical language in which they came and here's just one example again it's a lot of words on a page on a slide and I realize that but bottom line what we're trying to show here is there's a lot of content on this patient the 67 year old person or woman and then what would happen if we only used MMS if we take took all of that we only have a very short description of what this person really has and what's going on with them but if we use the C-Clear cluster that includes the linearization and the syntax what we end up with is really a description very well described very well stated in regards to what this patient of the 67 year old woman has because it takes into account all the information that's in the foundation in such a manner that the syntax can create those code clusters and then based on those code clusters identified back exactly what happened and with that I think I made it close anyway 20 some seconds I think okay yeah with Thesaurus 15 slides in 12 minutes so yes there is such a thing I'm going to talk about this Thesaurus what is the biosecurity oh my goodness I don't really have time to talk a lot about that but I'm from the centre of excellence for biosecurity risk analysis and what we do we're not just concerned with things like shipping containers but we answer questions for Department of Agriculture like how many of those containers should I test I don't have time to waste resources so we are a bunch of economists and statisticians and not just sandal-wearing ecologists so we answer questions like this but not just about shipping containers so there is a project context to what I'm talking about today we are developing a biosecurity metadata portal so we're extracting research metadata that's relevant to biosecurity from places where the research is actually most practical and relevant to decision makers and that's in the grey literature world so not instead of being behind paywalls but you know from regulatory portfolio, government industry and some open access research sources as well yeah so there is a biosecurity Thesaurus that's in development really being driven by development of this project but I certainly always feel it's important to share what you're doing even if it ever does get used in a particular application context it's published through research vocabularies Australia you can learn more about that by following the link there if you're familiar with research vocabularies Australia do have a look for it and other interesting stuff so I mentioned we're collecting metadata and this is just a bit of flexing about who we're collecting it from but I guess the point I want to make here is that when you're collecting metadata from lots of different sources it's in lots of different formats and so we have challenges regarding harmonization and transformation of that metadata and without going to what all these sort of clouds and applications mean and what they are we are doing some transformation work for all the metadata coming from all those organizations including applying vocabularies so the reason we're using Thesaurus is to help to harmonize the description of all those metadata sets that are coming from different places and really to help with one of the project objectives and that's to deal with the language ambiguity problem in biosecurity which exists everywhere of course but certainly in biosecurity so I wanted to just touch on today a little bit about what goes into a controlled vocabulary and the reasons for why we make decisions about how we develop a vocab so this is a little bit of a carpentry sort of practical step through I'm going to talk about three different sources of warrant so by that I mean what is our rationale or explanation for why a term or a concept ends up in one of these vocabularies and in the keynote this morning I like to mention that these controlled vocabularies never exist outside of a framework or a model context certainly that's something I'm interested in and with that Thesaurus we're developing we pay attention to things like legislation legal sort of framework I guess or things like websites and how they're structured so site navigation it gives us some hints about how a taxonomy or Thesaurus might be structured or models this is the biosecurity continuum which talks about activities pre-border, at-border, post-border and things and these are useful for informing the overall framework of a vocabulary I can test anyway that's the top-down, the bottom-up development is often an analysis of the literature that you want to tag so this is the literary warrant that's a big job we need machines to help us do this so the literature is fast so I'll get to how we're tackling that in just a moment and the third source of warrant is in vocabular construction sometimes called user warrant so how are people searching what language do they use how do they discuss topics maybe not so much in literature but in communications and social settings and in search context so what we did as part of this project is we did a survey of biosecurity participants and we asked the question what are your questions about biosecurity what do you want to know and we're interested in the makeup of those questions and analysing the questions and saying well what are these questions made up of what concepts are found in the questions and are there different ways of asking the same questions or does the question structure itself lend itself to a particular kind of problem how many containers should I test a bug how many how many salmon should I test for disease and hatchery whatever it is a lot of these questions have a similar structure for example like a statistical structure so this is really the third input for our project we've got our sort of top down framework analysis bottom up literary analysis but also our user question analysis and actually the results from that are available if you're interested there's a preliminary classification of those questions and so we run the text from the literature that we're collecting the questions and to some extent those high level frameworks through our process which is a little bit of a black box I can't tell you but we use some proprietary software called pool party to do some corpus analysis and that tells us about the frequency of those terms but also the relevance when you consider also mutualism do the terms occur next to each other because they're part of a phrase or is that just a coincidence and give us these interesting stats that help inform us whether or not that term is a concept an enduring concept that should be included into the vocabulary. Thank you sorry if I'm talking too fast and this is so I just wanted to it doesn't even have a title so what the hell is this slide I just thought I'd just make a point about vocabulary publishing and collaboration I really believe that one of the reasons why we take our vocab work and we put it on the web and we put it in registries where it can be described and discovered and discussed and pointed out so that someone else could say hey I'm doing that I'm doing something like that and you know it's it's really a recent discovery that some of the thesaurus which is very broad ranging in topics overlaps with some other vocabulary work that's happening at Department of Agriculture Fisheries and Forestry DAF and so I found for example that a sub tree in the thesaurus which is about traps and traps are used in biosecurity to form surveillance on invasive species or to eradicate them there's another smaller concept scheme which focuses directly on traps and so there's an opportunity there to say well what are we going to do now are we going to can we at the very least can we make links between these using you know scost predicates you know exact match or close match or something like this should we be should we be sharing the knowledge that we've gained from our different vocabulary development activities with each other so we can augment each other's vocabularies or indeed should we be consolidating these are we creating a mess by letting them live in parallel and and actually this is something that I think we're still trying to work out so I'm not really suggesting the answer there but part of it for me is the application context so this is just a screenshot from a demo system for this biosecurity portal but and I don't expect you to be able to read all this except I'll just say on the left-hand side here are some vocabs being used as facets and filters and things like this right and think about the user experience how many concept schemes do you want to throw out your user so that they can operate the system correctly so it might come down to how this stuff is implemented if you want to model this more what I'm finding in the vocab construction game is this question which is do I want to have sort of broad reaching vocabularies or do I want to have lots of little concept schemes that focus on particular things and if you have lots of little concept schemes that kind of forces you into developing vocabularies to help express them correctly so there are all these trade-offs that have to be made but it could also be about how these get implemented in a discovery system which is something I'm concerned about because I want this to be usable and I want the vocabs to be maintainable as well so yeah so oh I guess that's kind of my point there last slide yeah implications very good thank you thank you well that's coming up I'll just sort of kick on this morning we're hearing from Arif and Gregory and there was a few points I sort of noted down which were really of interest things like do not be seduced by partial solutions machine actionable and these sort of things that probably segues quite nicely coming on to myself from Les there who's using research vocabularies Australia of course and looking at different parts of the vocabularies but what Arif and I was talking about are there aren't much greater challenges in overall curation and governance around all these things and I guess that's something that we're really trying to look at I'm still struggling with the screen share there I think um that's alright so what we've been doing in Australia in the health domain things are a little bit different in that Australia is trying to go down the route of having a standard for health clinical data which is snowed clinical terms but of course there are hundreds and hundreds of different vocabularies and dictionaries and things out there we're not all stuck on those standards at this point in time so the problem is how do you get there, how do you map what are the real world problems that we're facing how can we bring it together but more importantly we wanted to have a basis from the national standards so with some medical terms we here in Australia through CSIRO have developed a system that I'll show you a little bit about in a minute for allowing Australians to be licensed to use this and what you can see now is that we're looking to widen the tools and things coming out of that to build more standard tools and hopefully that can link in with some of the standards that we were hearing about this morning but yes it's in terms of don't be seduced by partial solutions this is certainly one that I'm going to show you because it's got a way to go so here's snow med it's good in that it allows you to give very detailed descriptions of clinical concepts but the problem is it's very complex in itself and in fact in my space which is general practice we hit real world problems of people not having time to code data up like this so you get all sorts of text in there which can represent a term and these are the sorts of things that we are wanting to map as well so it's not necessarily a standard vocabulary but it's about taking rubbish like this and making sense of it but also supporting the vocabulary so how can we look at all of this ecosystem and why it's important here if you look at the middle line here chronic kidney disease this is where we did analysis of coded versus free text in the medical records so government for instance their national statistics work off the codes but they're missing 22.55% of all occurrences of chronic kidney disease for people because most of it's not coded so we really need to get this stuff sorted out you can't rely on coded here's meat here so this is a IHW they do a fantastic job around standards with their meat here work but of course it does change as well over time so what we need to be able to see is how can you migrate between different versions of things and also how can we make it machine operable so I've been working with as part of a national initiative here to look at quality assessment but also terminologies and mappings and being able to get to more consistent common data models so you can see if you're going to bring common modeling you need to convert data and you need it to be accurate in the conversions so through this work we came up with an idea of taking the RDC's work to the next level by trying to build on some of these other national tools and this allows us for instance to map to things like the OMOC common data model I'll not go into this in any great detail but for health research it's the largest one in the world and there are about a billion patient records that have been mapped to this around the world we can run massive studies with this so here's the here's the vocabulary of Australia portal as it is just now but it's very much around vocabularies not mappings this is the national system that looks after things like snow med coding it allows all sorts of syndication it's a very complex system which is why we wanted to build on this because it's been it allows different institutions to do things their own way in the UK are bigger adopters of this and we'll probably get more than a thousand mappings going to snow med clinical terms using this in the UK on top of that what has been developed is this tool snap to snow med that allows fairly automated mapping and looking up terms to try and see what the final snow med code is so really the clever thing that's been happening now is we've wanted CSIRO to support mapping to lots of different things not just the snow med can we just open this up so this is a generic so if it is leases, vocabularies or standards they could be getting mapped to never mind health terms so this could go far beyond so that's what we've been doing it's just been going a few months now this sort of shows that we had a workshop to try and kick this off and work out some underlying assumptions and specifications so we've started that and we've started moving from snow med now into other vocabularies and the idea is to have this on the ERDC research vocabularies website so it would be an extension and it would integrate well and hopefully look and feel in a similar manner, something like that we'll see how it evolves so we have got this up and running now so this is actually onto server and I think for the first time we're actually seeing instead of this being a snow med code system we've actually got one called RxNorm which is an American drug coding standard that's used in that OMOP data representation I showed earlier so that's now represented in here and we can actually see that within visually so this visually normally you would see snow med in here so now all of a sudden we're seeing something different in here so I'm excited when we actually get beyond health even so we'll see how that goes and this is actually showing the tool that we're wanting to develop further for community curation so if you look over on the right here it's quite small but this is looking at who's reviewing a set of mappings and who it's assigned to and the other thing about this is it's controlled in terms of permissions so there's a lot of base things that we need nationally and internationally around how you properly curate this who looks after it how do you tell the provenance or a base of this product which is why we've aimed to go in this the other thing that we want to do with this is that machine interoperable side of things so what we've got now is we've got it operational in a first version at ERDC and we've got it working with Australian Access Federation logins the RXNORM works in the testing phase now we want to make sure we can do it in other things the key next stage is around the community co-design we wanted to get the base functionality but now it's about how people that aren't coders and experts can use this sort of stuff to be able to really understand how to map and maintain the concepts and I want to stop people reinventing the wheel if you look at that this is a tiered example of male and female how many people have actually looked at databases and said oh I've got my way of coding that and so on and I've already mapped it and it took me time to map it so I'm not going to share it with you so I'm really excited that we've got the ERDC to build a platform where perhaps we can start to get things happening in the community there so that automated API interfacing I think is going to be important so I kind of like listening about the SSSOM format for instance there just because we're coming up with something new in the health domain doesn't mean to say it can't integrate and work across other representations like SCOS and OWL so it'll be interesting to discuss that a little bit more today and tomorrow so to finish up then really what we're aiming for is to build a service that's seen as a national resource for many so we'd like to work more with Meteor for instance so we'd like to build an automation to AIHW in the way they represent health data and their clinical concepts and we're having some discussions about that now and if we start building up some of the national utilization then perhaps we can actually get really get the ball rolling and get wider utilization and not restricted to for instance ARDC's people research data commons we're obviously having a focus in there but we've got the back of our minds that we don't want to be stuck in that rat and we want to work with everyone in this room to see where we take this and ultimately it's about that faster better cheaper research and what we can do with it okay thanks very much yep here we are okay wonderful sorry for that so I'm Lucy at UNSW and I'm presenting today on building the Aus Trates plant dictionary into a formal vocabulary I'd like to begin by acknowledging the traditional custodians of the lands throughout Australia on which plant trade data have been collected my home land is on the lands that are Margal and Dark people and I pay my respects to their elders both past and present as a quick roadmap for today's talk first let me introduce Aus Trates you understand why we need an Aus Trates plant dictionary and then how we went about building the best possible trade definitions and definitions that are simultaneously a simultaneously useful resource to ecologists and bioinformaticians so if you look at any organism like this eucalyptus tree here there's three core pieces of information that matter about it you have to document its name the taxonomy where it occurs and what are its characteristics its traits and within the Australian research infrastructure landscape Aus Trates is emerging as the go-to location for plant traits its characteristics Aus Trates was first publicly released a little over two years ago with the concurrent release of the data set on Zenodo and a data descriptor and scientific data it has continued to grow since then and now includes more than 370 data sets from more than 250 contributors we're approaching 2 million data records with some information for 500 traits and nearly all of Australia's plant taxa so how are these traits defined within the our workflow that compiles Aus Trates it requires a file traits.yaml that lists all supported traits so all 500 traits are documented here together with some core metadata but it's minimal for instance for leaf length you need a label a brief description is it a categorical or numeric trait what are the output units and what is its allowed range for a categorical trait you have to list the allowed values but that's it better definitions are clearly required to increase the interpretation of Aus Trates data to increase the accuracy of how new data are mapped into Aus traits and of course to increase the interoperability between trait databases we subscribe wholeheartedly to the open traits network vision of trait databases where worldwide there are endless databases for different taxonomic groups some larger some smaller and somehow to maximize reuse of this data there have to be connections between them so if we think here about the noted plants Aus Trates is one of these little dots for the angiosperms and by properly defining documenting the traits that's one way we can increase the data reuse so I want to talk first what is required for a formal vocabulary this is well known to everyone in the room here you need a permanent resolvable URI for each trait concept the dictionary needs to be published as an RDF serialization and archived an ontology repository there need to be links to identical and similar traits in other published vocabularies more explicit definitions where an individual words terms are linked in the same term in a different vocabulary but what does an ecologist view as necessary it's a quite different list yes the dictionary has to be published but the ecologists actually are quite happy with the PDF even though no one here would be when they think of links to identical and similar traits simply string matching is fun and more explicit definitions I've actually had ecologists ask me to please remove the links to published vocabularies because it makes the definitions hard to read meanwhile they want a dictionary output as a spreadsheet for easy research reuse they want categorical trait values to be part of specific trait definitions not a standalone terms and they simply want definitions linked to research papers trait handbooks not to something specific to that trait so you can see this is a bit of a tug of war between cultures and we aren't the first to note this there are some plant trait ontologies out there none is complete none suffice for us to reuse when we built the Oz traits database we've in particular taken inspiration from the top the source a spin off of the tri global plant trait database they've done things for instance like linked individual terms defined words from other vocabularies they have not however gone as far as to add definitions for nearly all of the tri traits and there are not unique identity URIs for each trait so this is a starting point and we've now our goal has been to create a vocabulary that merges what's required for a formal vocabulary but in a way that ecologists can reuse it vocabulary is only useful if the audience sees it as such so what are some of our steps forward so we want to create a vocabulary that incorporates information content ecologists want with the standards required by the bioinformatics community and just very quickly what is this information ecologists want they want trait concepts with clear scopes explicit definitions comments about best practice methodology curated lists of allowable trait values links to references expert review of all the allowable ranges and units and among other things we had a series of workshops to discuss this for nearly every trait so for instance we had a plant growth form trait that we considered muddy confounded and after much discussion decided that actually the trait values within here could be better divided out into five much more explicit trait concepts this occurred again and again we also added descriptive metadata keywords what structure was measured what characteristic was measured was it a length or a mass or a force we added metadata to document the trait concepts references links to databases reviewers but again when I say links these were mostly just string matches but as a starting point it was good enough and has already led to one paper that showed how merging together trait databases can increase trait coverage but then if we think about this more now from the perspective of how do we build a formal vocabulary one thing we had to do was add resolvable identifiers to each one of the trait concepts and for this we turned to w3id.org the redirect service and the APD namespace with a unique identifier for each of the 500 traits that redirects to our projects GitHub website then came the actual steps of me teaching myself how to build this formal vocabulary I started by attempting to do this in protege and as a beginner it had its benefits visual menus easy for me to integrate different ontologies and port them together understand the vocabulary understand what's required for a formal ontology but it also had its downsides so all of our data was documented in a series of spreadsheets with columns each mapped to annotation properties but that's what I had protege was very constraining in what are classes versus individuals and properties incredibly clunky to correct mistakes and as Rowan Brownlee can attest I never actually created an output that past discussed validator so I turned to a different attempt something that's very familiar to ecologists I use our code to build an ontology from the base up this meant that I could use my spreadsheets that it could simultaneously create RDF representations and human friendly versions it was easy to change mappings easy to add traits edit traits and importantly you can reuse this code for additional ontologies there were downsides I don't think I could ever have built a vocabulary like this as a beginner protege taught me what the proper output was and how to go about it and then I could turn to R but what I now have is I have an R script where I can simultaneously take my input data sets and output the CSV spreadsheets ecology researchers request an html file our website that is easy to search in peruse and a series of machine readable formats all from the same data inputs and a single script and all these output formats are archived on Zenodo we've had a new release just last week the turtle file is available to download or look at at research vocabularies Australia and then we have our website which includes all outputs as well and where you can simply scroll through and look at the different trait concepts so here is we flank the same very short trait definition we had at the beginning now has this much information content attached to it so here is the resolvable you are I and all the annotation properties each one is linked to a well established vocabulary to discuss to DC terms just two minutes what's that two minutes to go and then information such as keywords what's the measured characteristic the measured structure each of these has terms that's the keyword that ecologists wants to see but those have been linked back to publish vocabularies we had to be a little more creative about how we did our exact matches for some terms so the trait ontology has leaf length is one of its terms so there we could do a proper exact match to a URI but the big global trait databases they have a project web page but they don't have resolvable URIs for the individual terms and so we had to creatively map this in as a simple string into which we embedded the project so that we could see what's going on inside it's not perfect from the bioinformatics perspective but it offers sort of a linkage between what the ecologists require we didn't want to leave these mappings out simply because they don't yet have URIs meanwhile all references are indeed linked to identifiers and all reviewers are identified by their orchids you'll see as well up here with the encoded words from ontologies and we've had it as well without them so that the ecologists can simply read a sentence and finally there's sort of an additional resource that we've created with this in addition to the actual published vocabulary for the 500 terms by far the most comprehensive plant trait vocabulary in existence we have the resources freely available that we use to create this ontology anyone from the ecological sciences or anyone who codes in R can simply see how we've merged together our CSV files using a couple of R scripts into the different RDF serializations and to us it's important to share these resources as the vocabulary and I hope maybe somebody else can use R to code ontology in the future thank you very much so for those people who haven't come across it the oscope geochemical network is an oscope funded initiative which has really been looking to combine the earth science departments at Australian universities and some other sectors I'm in response to this community identified need to better tackle the data standardization and organization of all these masses of earth science related data coming out of these various institutions and so a large part of this has been building what we call osgeochem which is a flagship kind of front end platform which acts as a geospatial archive and also dissemination and exploration platform allowing users to upload and interact with a large volume of geoscience data from a number of disciplines both from Australia but also somewhat misleadingly given the name internationally as well and obviously in order to do this we've needed to create quite extensive vocabularies for earth science which either from whole cloth or adapting ones within the community and so these have built largely of custom relational data formats within scoffs vocabularies but we've tried to make them particularly bespoke for the geoscience community but also very importantly as the last talk touched on accessible for non-data scientists and we've also tried to make these while suitable to take existing data really aiming to be forward looking with an aim to export data directly for machines so within the earth sciences there's quite a variety of sub-disciplines and data types that we can collect but so we can break it down into these three domains where all of our data is in some way linked to a sample that we're analysing it being a physical rock or mineral sample or things like ground water or soils that are looking at contamination and then we have what we might think about as our primary data so information is telling you something about the age or the history or the chemistry of that sample and associated metadata and then the way by which this is acquired which is obviously a complex relationship and there can be multiple ways of collecting different types of data and so where we are at the moment in this project is we've established vocabularies for some of these but not others and continuing to work on others as well and today I'm going to focus a little bit more on the geochemistry side of things because that's probably inaccessible so just quickly what we talk about what we mean when we talk about geochemistry is this is essentially the composition of the rock or mineral or whatever our sample specimen is so be that the elemental composition or the the amount of certainly compound in there and this obviously has overlaps with, it was like chemistry so we can do all another vocabularies and resources there so this is a snapshot of how we've built this vocabulary so like all of them we haven't kind of gone out by ourselves, we've assembled an advisory group of experts in the field so that we're not just propagating one lab's idea, we're trying to get as much information as accessible as we can and so we're currently in the process of uploading these to RVA and so I guess this is set up as we have this high order concepts which are the geochemistry data point which tell us something about the sample and what these data represent and then something about the composition and the spatial information which is the aliquot which you're going to step through at the moment so the geochemistry data point is telling us what is being analysed and how it's being analysed so things like the analysis scale and the mineral type so are we looking at a whole rock or a whole soil sample or are we actually looking at a single spatial point on a single mineral and again we're linking where we can to external vocabularies so things like the MINDAT and the IMA for mineralogical and geological descriptions so again we're not trying to reinvent this but also everything is tied to the orchids so there is data attribution to anything that's being uploaded here and then moving down the hierarchy we have what we call the aliquot this is telling us where in the sample is being analysed so are we just doing one large sample or are we taking a lot of sub-samples and having some spatial context between them and then we start to step into the actual what can think of as maybe the hard data which is what elements, what's the abundance and what is the uncertainty associated with those and again here we're drawing from external vocabularies so pubchem through there their repository for chemical information and QEDT for units of measure and then very similarly they're set up basically the same way is the compound data so again how much what are the concentrations and uncertainties associated with these compounds but rather these elements and again I'm outsourcing linking to pubchem here for those definitions and so beyond just I guess the academic sector we're currently working on a project and national geosurveys looking to standardise these reporting standards within this elemental and compound data and producing templates for the analytical procedures used to collect that data so that we can really start to move towards sharing this data and making it fully open and fair I guess this is just a very quick mud map of the flow of geochemical data within the sector at the moment obviously edge cases but kind of flowing down from the producers to the ways in which we can make this data accessible so currently we have a lot of data coming out of either university or commercial labs and then just going flowing down into end users in the academic and government sectors and then also in industry and at the moment as you can kind of see in the bottom the kind of red fields a lot of this data is very segmented so what we're trying to do through this project working with these surveys is to try and establish and publish this national vocabulary for geochemical data and the metadata and analytical procedures used to collect this and start to propose these national best factors reporting standards with the end goal that we can start to move towards establishing kind of automatic edge sharing between government and academic databases so I guess if we and then if we kind of map out how this revised revised framework would hopefully look we can see this is a little bit different but mainly in that we're starting to connect a lot more of these boxes in the two points I sort of want to highlight in this that we're trying to achieve in this project are trying to really standardise the data that's coming out of these first order producers so that we can be sure that we're getting the requisite minimum metadata but also things like the standards and the units all standardised but then when we want to interpret that together we've got the requisite information to do that but then also trying to at the other end of it trying to make sure that these data users are actually feeding into shared repositories whether that be the OZGEOCAN platform or kind of cross sharing between the various government portals so that we're not duplicating data or hiding publicly funded data in kind of obscure places that the real goal is to try and make this as accessible as possible and I think we might give them a couple of talks in a little bit the OZGEOCAN network is also working with international partners so most of the stuff I've been talking about has been on an Australian scale but we're obviously open our platform was obviously open to a global audience and we're also trying to harmonise our vocabularies and databases to start really mapping this data between the large global databases such as Earth, Chem and GeoRoc and AstroMatch so I guess just to kind of quickly summarise so the AGN thing collaborating with a number of expert international groups to start to produce these SCOS vocabularies to the geoscience community and we've just more recently started to work with the government geosurvey to use these national reporting templates to really start taking steps towards interoperability and we're currently in the process of kind of annoying rolling and uploading these to RVA but if you do want to kind of have a look at these in the meantime they are so publicly available but not in a machine readable format on the OZGEOCAN website otherwise I'm kind of happy to talk about any of them with you more directly in the future. Thank you. First of all I'd like to acknowledge and celebrate the first Australians on whose lands we are meeting and whose traditional lands we always meet to and pay our respects to our elders past, present and future. I've got the abstract because this is a CC by presentation so I'd like to package it up but I don't expect you to read it. So what I'm just going to do is quickly set the scene as to why we need to be globally sharing our data talk a little bit more about the World Fair project that Arafan touched on this morning and talk about how we were using pips and furs in one geochemistry and how we also then translated them into trying it with geophysics. Very brave. And just point out how you can accelerate the development of fair vocabularies and so right, okay I think it's an old fart. I could argue that what's happening with our sciences, we're moving more from portrayal of our science and our objectives and so I always argue geoscience got contaminated by the geological map in 1815 when this character, William Smith went around, made thousands of observations and produced a picture and we're kind of still doing that now and with knowledge graphs and new technologies there's a new world opening up and that new world is about visualising observations themselves using knowledge graphs that enable analytics on that finer granularity but if you're going to do this then fair semantics are critical. One thing that a few people realise that it's really worth revisiting the original paper of Wilkinson at AL where they emphasise and it's all about machine readability of data making data machine actionable and as RFA said this morning probably 95% of what's online is a claim to be fair is it best findable and accessible by humans it's definitely not machine readable and above all it's not interoperable or reusable and so moving on to why they fail with the interoperability principle is the second one is it says metadata and data use vocabularies that follow the fair principles that means they're available online the machine actionable and as you know most of the vocabularies we try and deal with if we're lucky they're a PDF online even where a vocabulary does have pids per term the vocabulary itself is not versioned or doesn't have a pid and it's very hard to be able to machine to machine environment version of a vocabulary that you use and it can be hard as I said to access previous versions because usually you just got one link for the whole lot but one of the foundations we need to remember is that we want these things to be global and there are two couple of things about that in that interoperability is dependent on the size of the community that firstly knows about your standard and secondly uses it and that's a quite a pinch from Simon Cox from decades ago and we do have this need to converge on standards for fair machine data but this is really going to take time and at the moment because everyone's wanting to get into AI or machine learning or all sorts of online analytics they're just creating vocabularies and we say they're breeding like rabbits over the place and this is just exacerbating our scenario of how can we make this more machine readable and more accessible and I think above all as I said to one group well the reason no one knows about your vocabs is if I go to your website I've got to go down three levels to find it I don't know about it you don't go to conferences you don't tell us about it and so I was really lucky to be invited to be part of a group that became part of this world fair and what was even great was for some reason the EU I'm not going to say lower their standards I should say changed their rules and allowed Australians and a few other countries to participate and so Steve McGacken went in with the ADA and I'm working loosely with a group called OneGear Chemistry and we got in on it we're able to get funding and be part of it mind you the downside of that is anyone who's done a EU Horizon 2020 knows what the bureaucracy and project management overhead is you wonder why the hell you did it but to get more precise as Arafan alluded to this morning you can sort of see what we call the petals so there's geochemistry there's agriculture, cultural heritage G-Bifer in there there's oceans, all sorts of groups and so we're working within the domains but under Arafan as leadership we're starting to work out hey how can you make this interoperable so a lot of what Arafan presented this morning was from this project now one of the tools that they have started to use and as you know I've been trying to do interoperability of data for probably the last 25 years I think maybe we've got the silver bullet and so they've picked up on a tool called a fair implementation profile that came out of the GoFair group and the GoFair group has its roots in medical and there we're just totally out we've got to get the data machine readable how do we do it so you can see on the right it's just a simple Excel spreadsheet that is the fair implementation profile and for each one of the fair principles it just asks you which model schemas do you use what usage license do you have so for each one of the principles there's just a simple question and you can answer it in this Excel spreadsheet in free text but you can go one step further and that is what we call a fair enabling resource and this gets a little bit more sophisticated but each principle can be linked to a fur which is a link to an online resource so it can be a vocabulary it can be the ISO19115 metadata standard or if you're using a profile of that standard it is the link to what you're actually using for each one of the fair profiles and so it's done in what they call a micro publication but each fur must have a link in the real world and I've got a paper there that gives you more information and to make it easier this is really new technology and it's not particularly stable operational but one group developed this thing called the fit wizard where you can go in and start to declare all your resources and publish it as a PDF Excel jason and the important thing about this is if you've got a community using it you can go in like for the IOC IGSN identifier you're using IGSN and straight away I knew who else was using a resource or a vocabulary that I'm using so I knew we then had the grounds for doing machine to machine interoperability so just quickly on one geochemistry Angus mentioned it and this is a informal amalgamation of group of quite a few of the major global databases and also geochem is in it along with earthchem which is a big American database and also geochem and so we're trying to work out how we can get our data sets interoperable and so we're starting to develop say this is between earthchem and geochem just going through this simple question what are you doing finding out what the commonalities are and then hopefully getting them into first so that we can make the machine readable and so you can see how if you do it across all the databases you're starting to get somewhere and even though a lot of people complain about fig share and zanotto because they're generic and they don't impose domain metadata or data standards the thing is if the entry in zanotto has a fit with it then you know a lot more about that database as they become more standard and I was also working on this project in geophysics totally other end of the spectrum you've gone from long tail to the big end of town and I thought that was a bit easier because geophysicists tend to be a bit more literate than digitally literate than geochemists sorry guys I am a geochemist I can say that and the aim of that project was to test what we had to think about if we wanted to make our geophysical data accessible on the scale machines that are already in Europe and North America what do we got to do and the key conclusion was it has to be fully machined to machine and therefore we actually tried implementing the fair principles machine to machine in geophysical data so we went through the same effort thank you Joe she she tried that fit wizard that's around and was able to generate the fair implementation profiles we created a Zenodo community which bit ironic within NCI to store these fibs and if you go to our catalogue you can see how you've got the access to the resource but there again in yellow you've got the fair implementation profile so straight away download the data and here's the profile which tells you all those nitty gritty things you want to know about what Dan vocabulary do they use or what formats the data in and it's all there at a logical ordering and so we hope to work that across all these geophysical domains in the coming years if this catches on and I see it as an excellent tool for declaring what's going on and so we come to our second final diagram in which I'm saying you've got local we've talked about this Lizzie talked about taking local moving them up into community and ultimately we will have the global agreed vocabularies that takes time but the critical thing to get fair vocabularies for all is to get people down at the lower levels understanding what it means to publish a vocabulary publish it properly using identifiers and I think I agree with our friend this morning this is so darn hard to do if people know someone else has done it then we'll get quickly going on the community resources as Lizzie said the drivers are there and the important thing is that if you've already got something at tier 3 with the local vocabulary you can just redirect the URIs of the terms that you're using as they come online and so in conclusion to accelerate it I come back to it requires vocabularies that follow the fair principles this is why I can pretty well say there's very few data sets in Australia that are genuinely fair because they file that criteria we need to get people to publish them in a reliable vocabulary service and make sure they're fair compliant with the PIDs and even when you've got a vocabulary with URIs for the terms make sure the vocab itself and the version of the vocab has a PID because that's what the FERS refer to and want to embed data set metadata and ensure that super seeded versions can still be accessed i.e. time stamped so that the older data sets we can machine to machine get back to the terms that we used to make that data set fair, thank you so good afternoon my name is Maggie Smith and I work within the data governance and catalog team at Geoscience Australia my colleague Laura Sedgeman who leads the team which publishes the vocabs is also here today and she was my stand by I know she's not pushing me out of the way so I'm speaking today from none more country and I'd like to acknowledge the traditional owners and custodians of country throughout Australia and recognise their continuing connection to land, waters and community and I pay my respects to Aboriginal and Torres Strait Islander people here today and to elders past and present I've made three presentations this year about the GA vocab refresh so I apologise in advance for any repeated content the next few minutes I'll be covering some background about GA vocabs two example types that we're publishing our policies and finally how fair our vocab there is are at the moment so a vocab at GA is for communication both as human readable concepts and as machine readable persistently identifiable link to terms being used in GA products they're intended to be controlled lists managed by subject matter experts and published according to our existing approval and release process they can also be international controlled lists referenced in GA products like country codes sorry I'm going down the page don't move the words sorry I need to read it apologies I'll get completely off track and tell you how bad my flight's been in the last two days see how that goes so our vocabs can be a glossary alphabetical list of terms in a particular domain of knowledge a data dictionary structured data elements in their metadata generally taken from a database table or a pick list a defined set of terms classified using a domain model with classes, subclasses and sub properties currently the most important thing about GA vocabs at the moment is the fact that it's being populated voluntarily and there is no mandate for people to do that so it's a big step forward for us so I've chosen an example of vocabs that communicate database content we have a catalogue policy aspiration that for data held in databases which are not accessible for query or download the metadata records are to be publicly facing and should include a data dictionary or snapshot of the database as well as associations to delivery formats such as web services in the record above you can see that the download and links for the database contain links to web services which don't deliver all the data from the database and in fact have their own metadata records as indicated in the associations tab as of this month we now also have links to an existing geological databases data dictionary pdf so as Leslie mentioned before we've got the good old pdf which is better than nothing and to the record for the vocabulary register the existing geological databases data dictionary pdf was published in 2021 to support the delivery of web services by providing definitional information with the service when you look at down when you download the document describes the various tables and here looking at content character you can see the format of the information and the database table identifier this information is now available in the vocab register which is identified as an associated record link so looking at the same table in the vocab register the parent terms are identified and the hierarchy is as you would expect it to be the provenance is contained in the rdf but it's not displaying at the moment we're going to be doing a small body of work in the future to update the information given on the landing pages so that it is actually in line with our metadata records in the catalogue so license point of contact and any lineage will be displayed in the process of communicating these database tables as vocabs we've also been cleaning up duplications and spelling errors and in the case of land form type putting in definitions that were not previously given the next steps will be to reference vocab terms in web services and Aaron Sedgeman's keen to give that a go but there's a bit of work to do there yet so we're also in the process of publishing the control code lists used in the catalogue where we have extended the standard list so in this example the code list is for associations made between metadata records in the catalogue if you look at the associations in this record you can see that this vocab is part of the GA vocab collections record and that the list was informed by the GA metadata profile of the 19115 so as described in the abstract this list contains the standard and extra terms used in the metadata standard and when I was doing this presentation I found out a mistake, stereo mate which is the top term should be camel case so I'll have to go back and clean that up the data is a well known process at GA and we're treating a vocab as a dataset for the purposes of the catalogue in process so the vocab and publishing policy specify that before the vocab is created there needs to be an identified custodian for the vocab a community of practice supporting the content and the custodian director approval to begin work and that it's okay to release these terms publicly and once the vocabs in the template are in catalogue team at the same time they begin the online publishing process so the PRMT no PMRT is the creation of the metadata record for the catalogue the review of the metadata and product and then the final approval to release so this is a follow the bouncing ball type process that people follow the eCAT PID which is generated by the completion of the metadata creation process is inserted in the RDF into the vocab so there's a connection there as well the vocab creators could use RDF directly but generally the voc excel template document metadata about the vocab being created and the vocab itself seems to be popular so far in that everyone seems to be familiar with Excel once template is complete they begin the publishing process as I mentioned before send the template to us and then at that point I would be going through the template to look for consistency normal spell checks and things like that and where appropriate getting rid of definitions that don't make sense like template template those sorts of things okay so that's going through that's what the user sees go through the online publishing tool and then I will take the template and convert it to RDF validate it just quickly to make sure I haven't missed anything obvious and then I will put it into GitHub repository so I put that in as RDF and it's actually validated in the tool as well so it gets a second validation which was interesting a while ago when it validated online but didn't validate in our tool so that's another story once it's put into the GitHub it goes on to non-prod page so a review page for the custodian to check that it is appearing as they expect it to so we're adopting a very slow approach to promoting this service at the moment but once the current vocabs in the backlog are published so there will be about 40 vocabs before we start approaching the rest of the organisation to see if there's more to see so how fair are our vocabs? so following the information given by Simon Cox in the fair vocabularies on GitHub I'd say that our vocabs are fairly good using this general description as I mentioned before we're doing more work on the delivery system and the licence will be obvious on the register and eventually the vocabs will also be downloadable in different formats going into a bit more detail from the same authors I believe that we've got F and A covered but could do better on the I and the R so as I mentioned after a long pause GA is now gradually publishing persistent vocabs using Fort Prez vocabs are a communication of the terms and the definitions being used or referenced in products released by GA other benefits have been improvements in database definitions that's what we've found so far and conversations with other state surveys about vocabs that GA are releasing in terms of geological databases vocabs are released via a standard well understood data release process terms within a vocab and the vocab itself are persistently cited the vocabs will eventually be discoverable at our RVA I too have been annoying Rowan not annoying talking to Rowan and we're going to be cleaning up the existing I think there's 14 GA vocabs sitting there at the moment we'll be cleaning those up if we extend a non GA vocab that we reference we will use the existing point of truth IRIs for the terms then include the GA IRIs for any extension terms and future ideas obviously there's going to be a huge long list and that includes having more vocab metadata exposed referencing the IRIs in metadata records for all code listing keyword terms in the actual catalog web services incorporating vocab links into the service rather than just pointing at the PDF version and having as was mentioned in the last two presentations having conversations to work towards creating community vocabs and potentially even the international ones but we have to start somewhere so at least we're putting things out there and we can start the conversation from there thank you I'm Martik Lecking and I'm talking about collaboration between three major geochemical and cosmochemical data systems about how we're aligning and harmonizing our vocabularies so I'll be speaking about EarthChem and Degas which are both terrestrial geochemistry data systems so dealing with rocks and minerals from Earth and then the astromaterial data system is the cosmochemistry equivalent of that dealing with chemical data compositional data of meteorites and other astromaterials I'll mostly be speaking about the synthesis databases of these three data systems they're called astroDB, georock and headDB and they compile data, chemical data from the published literature and this slide really is just to show that beside the primary chemical analytical data we also compile the large amount of metadata describing the samples the geographic locations that these samples come from and the analytical methods as well so just one example from georock and headDB so this is the terrestrial side together these two databases host over 30 million individual data values so it's a huge amount of data I'm talking about here collected over 25 years and the colors on this image are just showing different geological settings so beyond these synthesis databases the three systems actually also do a bit more all three data systems host domain repositories so they publish data that's submitted by researchers from the community and these data would then flow into the synthesis databases and then pretty uniquely EarthCam run the EarthCam portal which is a combination of both the metadata and data from seven different distributed synthesis databases so georock and EarthCam data were in here georock, headDB data as well as a number of other databases and up to this point you might get away with not really worrying about your vocabulary is too much but if you want to run a data portal like this combining this many data from these many synthesis databases you do really need to synchronize your capabilities so this data portal is sort of the motivation behind the work that I'm presenting on today although of course it flows through all of the rest of our system architecture so georock and headDB EarthCam digas have been collaborating for the last 20 years or longer but recently we decided to properly align all of our vocabularies and also bring astromat on board with us so the goals are alignment of our vocabularies extending those vocabularies to also integrate external standards where they exist where community authorities have published vocabularies we want to include those we obviously want to make them fair so we want to make them accessible by publishing them with RBA and we're also putting in place hopefully a transparent and very flexible governance model that will allow for future changes in community revisions of these vocabularies so we've been doing this for two years now and it's really hard largely because we're just dealing with a very diverse large set of communities and sub-disciplines within those and really there's no global standards in geocosmic chemistry where anything everybody pretty much does what they want within their own little sub-community and we often find multiple definitions of the same term and sort of conflicting complex hierarchy so it's quite challenging to combine and synthesize all of this information to serve all of these communities together so here's an overview of the ecosystem of vocabularies that we're dealing with this is not a technical drawing by any means it's just an overview but you can roughly divide this into anything describing the sampling feature so the geographical location of our samples and the physical samples themselves and then the analytical side of things the observations and measurements analytical methods, instruments variables, units etc so in green I've highlighted vocabularies that we can make use of external vocabularies that are machine readable whereas blue are ones that we will need to partly design and certainly publish and make accessible ourselves so there are a number of bodies we can fall back on but it still requires a lot of synthesis work as well as I will explain next so I'm going to show you five examples of different vocabularies that we've been working in so first off our poster child is the analytical methods this is work compiled by Steve Richard and this slide really just shows an example of the many different words used to describe one particular technique in geochemistry which is laser revelation plasma mass spectrography ICPMS and all of these different terms basically are the same method so what Steve has done is aggregated all of this information and put it into the SCOS format which is really handy for our purposes actually in this case because you can see we've got information coming from four different data system sources in this case we've obviously got a definition of what we're meaning with this term and what sort of standard notation this is the LA ICPMS we've got an accepted name label what we want to call it but we can also preserve all of these other names so we can preserve that information coming that's come from the literature compilation of our databases so this is great this has been published on RVA in June earlier this year and there's a link up the top in the QR code if you're interested and here's just the same LA or ICPMS example again now both human and machine readable example number two is minerals so in this case we do actually have an external authoritative body there's the golden standard of the international mineralogical association the IMA they publish improved list of minerals so this is amazing but unfortunately it's not quite enough for our purposes we have a lot of legacy data that isn't covered in this approved list of minerals and often people are really sort of sloppy when they describe what they've analysed but also sometimes you just can't be as precise as this IMA list you'd like you to be and maybe most important of all the IMA list is a PDF so this is where MINDAT comes in which is a database of minerals predominantly and they've had the open MINDAT project through that they've made all of their information accessible through the API including the IMA list of minerals so we can now harvest those IMA terms and supplement them with other regular mineral names from the MINDAT database and then the plan is to publish our own implementation profile of mineral names on the API. Example number three is lethologies and rock names and this really is very similar to the mineral story there is an authoritative body which is the International Union of Geological Sciences they've got several subcommittees that deal with rock names and their definitions but it's really hard to get a list a comprehensive anything description of these rock names from the IUGS so there's a few machine readable representations or compilations done by some of the surveys of the BGS as leading amongst those but again it's not quite comprehensive and it looks like we're having to fall back on MINDAT yet again which conveniently doesn't only compile mineral names but also rock names and their hierarchies so it's looking like MINDAT will be our source of choice again now fourth example is sample description and these are two examples there are many more vocabularies you can use to describe samples I'm just showing the material and the sampling techniques here and this is an example where we really were starting from our compilations within the database as you can see here and the source column I've summarized whether it comes from georock or from the KTB database these are words used by the community in their papers so we're synthesizing that trying to harmonize that between our systems and the map to more authoritative bodies like the iSamples project and CSI which is an IGSN registration agent to come up with a community vocabulary that again will be published as our implementation profile on RBA final example taking you back to those maps from the beginning and the different colors on here one of our vocabularies the geological settings distinct rock forming settings in a way and historically the KTB and georock which is where the data from dealt with different geological settings which of course then meant that our individual list of geological settings were not compatible so what we've now done is we've taken each of our lists to the first two columns and come up with a joint list of 22 terms that we are both very happy to map our existing concepts to so this is something that we will also publish on RBA in this new joint format so I just wanted to finish on this slide from Kerstin Lennard based on an idea where Leslie Wyber won on this detailed approach to vocabularies so really I think most of us are still in this bottom tier we've got local vocabularies if we're good we will share them we'll put a PDF on our website so what I've been talking about now is the community level where we've got three big data systems working together to come up with a column set of vocabularies and that's already really hard to do but hopefully the first step in eventually getting to a global set of vocabularies for geochemistry and other science disciplines so with that thank you very much, very attention and I look forward to your questions either in the chat now or you can email me or anyone of the other free data systems that I've talked about thank you all right thanks for my name is Masut Rahimi I'm a senior data scientist from Oren Australian Urban Research Infrastructure Network in this presentation I reflect on Oren's journey to facilitate semantic interoperability and support the creation management integration and the use of their data and metadata in urban digital twins before everything I start this presentation by acknowledging the traditional of the land on which this event is taking place and I pay my respect to elders past, present and emerging let's have a brief introduction to Oren first Oren is a national research infrastructure initiative aimed to provide researchers and decision makers with access to urban data and tools to support evidence-based decision making established in 2010 funded by NCRIS to provide nationwide digital research infrastructure we have served more than 24K users providing access to 5K data sets focusing on urban systems and regional centers over the next five years Oren's focus will be mainly on three key challenges of Australian citizens and regional centers this includes demographic transformation, energy transition and climate change to address those challenges Oren will bring together more granular hard-to-get data sets from both public and private sectors in a secure digital system and this is actually what Oren is known for we call this piece Oren's Urban Data as a Service which is technically a service aimed to generate, curate and share their hard-to-get data for urban and regional research and planning but this is not all we do we're currently exploring opportunities for designing, developing and maintaining a modular scalable service for analytics moreover building on top of urban data as a service and urban analytics as a service Oren has been working on establishing what we call foundations for Australian urban digital twins which generally aims to first ensure lower transaction costs associated with building utilities for both private and public sectors and second support and facilitate research output translation into this space in line with this Oren has been following the roadmap and I'll put more details on it in this presentation for each step I briefly talk about what it is and the main challenges and gaps that we have identified in each of these steps first I draw on this extensive experience from the past national initiatives that Oren has been part of this includes liberal cities digital twins LCDT, other pie iris, odds and health ATRC, AHDAP among others for example from LCDT we learned that urban digital twins are great are in great need of connectedness and linkage of data and tools and the lack or weakness in semantic interoperability and standardization is a real block reader from iris we learned that even the simplest data linkage can be non-trivial due to data inconsistency caused by evolving semantics over time from other pie we learned that poor data accessibility due to technical and non-technical reasons is a real challenge limiting the opportunity of data harmonization from odds and health we learned that utilities are mainly developed government-centric while this leads to significant disconnectedness and making lots of data and tools silos now I expand on Oren's utility test which technically delves into domain of urban digital twins assessing the market landscape and prototyping innovative utility capabilities some of the most important challenges identified during these initiatives were we learned that the lack of proper semantic interoperability hinders discovering actionable knowledge which is essential for utilities second we observe a significant high transaction cost on the back end of utilities that has been important efforts to improve the front end of utilities by handling fairness and heterogeneous data and tools having a proper service architecture and future-proofing what we make these sort of stuff are mainly overlooked there are significantly limited efforts on technological reusability and scalability many definitions concepts and technologies on the back end of utilities are emerging and by the time you design a utility they might have already been drifted and we also identified a broad range of available solutions which can be really confusing and need to really be embracing and acknowledged as part of our commitment to overcome the challenges identified in the national initiatives and on the utility test side we then moved to the next step which was semantic interoperability challenge which was technically designed to explore the complexities associated with implementing semantic interoperability stuff into our current existing infrastructure so there were two important challenges that we identified in this work first we observed that the progress in making different utility systems working together is mostly slowed down because there isn't a real clear financial benefit for it creating utility systems can easily work with other like creating a system that can easily work with other systems that require lots of resources integrated technologies and setting lots of standards if the people involved don't see a good financial reason to invest in the things that are less likely to do that second we learned that the heterogeneity is multifaceted it can occur on different levels semantic level syntax structural level and spatial temporal level this can be leading to significant complexity in handling interoperability finally ORIEN has recently comments its utility community and we have been driving collaboration as a central role by participating in important conferences and discussions such as this one located in Australia our collaboration efforts also involves working closely with some important partners from research industry and government such as OJC, CSR data61, Wolpert Kurovange RMIT University some of the most important challenges that we identified is the fact that interoperability is really complex and multi-dimensional and such a multi-dimensional problem needs multifaceted solution and we need to acknowledge that making different systems from different domains together may not be even feasible in many cases and there may never be the case of perfect fairness and third there is a significant resistance in various communities to calculate the fairness of their data probably a result of concern about complexity involved in the area uncertainties or probably not knowing an easy tool or standard methodology for such assessment and finally efforts to communities of semantic interoperability have been mostly government centered rather than being human centered which actually limits its widespread adoption and implementation of their benefits across different sectors and communities with that opening I can now discuss some of the lessons learned by Oren and its partners so first we need to acknowledge that interoperability has various aspects it has legal aspects licenses, agreements, regulations it has organizational aspects credentials, authorization it has semantic aspects all the conceptual and logical models and technical API specification schemas all of this can highlight the need for collaborative efforts of people from different domains and different expertise coming together also we need to acknowledge that we need to have a sort of model for value proposition this is essential and we need to ensure that interoperability has a clear game creator this can easily be translated into the current demand on the utility market which is a great opportunity to actually bring that sort of opportunity bring that sort of potential into the semantic interoperability space and bring that sort of benefits across different sectors and communities also whatever we do on the utilities should move from government centered yep, two minutes sure, whatever we do on the utility should be moved from that government centric perspective to a more human centric perspective that ensures that our solution prioritize the needs, experiences and weldings of citizens as the main customers of the city and also we need a sort of a procedure for constant evaluation of fairness, we know that fairness can change over time and we need to be ready for that and constantly evaluate fairness and also we need to acknowledge that there will never be the case of perfect fairness so what's important here is even a small effort like a five minute work can help make our data more fair we need to acknowledge that there is this diversity between different standards and we need to create a balance between how much a standard we want to enforce and how much standards we don't want to enforce and also we should understand the landscape and map all of this together and also this also applies for tools and all the other solutions control vocabulary depends on developments of user communities so this is very important to have that sort of community and also we need to really go beyond the spatial data to actually get to that point that units can do some actionable items and this also highlights the need for semantic interoperability so we believe that semantic interoperability is the core for enabling utilities since it's that piece of puzzle that brings urban data as a service and urban analytics as a service together and we recently secured a budget of $25 million and there is more funding coming in our way so based on this aims to bridge these gaps and these challenges and opportunities by being an early adopter of the best practices bringing researchers, industries, governments together hosting that sort of conversation aligning the efforts and supporting innovative solution if you're interested in what Orin is doing on the utility side on the semantic interoperability would like to hear more about us please scan this QR code which is also available on the next slide and fill in the form and we will get back to you yep that's pretty much it thank you very much for the opportunity thank you very much for the opportunity to present and it's also a wonderful segue after the first one in terms of exploring techniques for addressing the human centered aspects of fair practices and care practices so in this presentation we want to talk about some of the ways that the lessons we've learned from our work in a number of different communities to try to build more careful vocabularies and before we begin would like to on behalf of Ruth and myself pay our respects to the traditional owners and custodians of the lands on which all of us are walking, working and living I'm coming to you today from Guy Magal land Ruth is close by and we'd also extend that respect to all First Nations people who are listening yeah Kamaragan land here thank you Ruth so a big part of what we're going to talk about can be summarized in two statements one naming matters and data matters and I put up this quote from Mary Ellen Keppich this is a statement that came out of her work in the 1980s about the power of language and how as an intimate and political activity naming actually shapes and defines our institutions and our structures I'm familiar with Mary Ellen's work because when I was training librarians many decades ago we were working with transformations of the Library of Congress subject headings so these are not new phenomena but they are acutely critical I think I would argue well we would argue right now because of the increased use of data technologies and AI and big data contexts where there are risks involved in magnifying the limitations of language or inappropriate or biased approaches so what we are talking about here builds on the care principles that come out of the Global Indigenous Data Alliance and I'm very fortunate to be able to be collaborating with some people as part of the World Fair project looking specifically at global health in the different ways that be fair and care messaging and training can inform the work of data professionals who are working in that domain Ruth and I have been collaborating in connecting on fair and care principles in relation to the work we have been doing with the public service looking at data collection and data analysis in local and state government contexts and so really what we are doing is we are taking these care principles and we are seeing them as serving not just as guides for dealing with Indigenous data but really serving as valuable guides for a more inclusive and participatory practice across all communities and so that's what we want to talk to you about briefly today and it's really about how do you how do you put this into practice how if you seek to be fair and care how do you create a culture of care so in our experience what we have found is really important is in the first instance to really make sure that you are paying attention not just to the technical considerations which so often demand our attention but really giving time and space to the social and the cultural as well and it's one thing to say that it's quite another to make sure that they are respected and they are given the time and the space it's also then important to recognize that the shaping of good practice and responsible practice as we are so often doing through standards and frameworks is not a set and forget practice but rather it is an ongoing and ever-evolving process and so that dynamics is also important to address and brings its own challenges and so in our ways of working what we've sought to do is to apply a socially sensitive approach that in the first instance looks to make sure that we are making two-way streets that allow for authentic feedback and co-design so not only making sure that we are careful in the way that we are taking data or creating the ways of categorizing and classifying but also making sure that we give that data back and set up the two-way street to allow for that genuine engagement and then you're looking to make sure that you are designing and maintaining these aspects in a very inclusive as well as actionable way and before you even go in making sure that you have permission to apply the language and the measures from another context so that is very much about putting into practice a human-centered approach and it's important because it helps to establish the legitimacy of what you were doing making sure that you can do the work that you're seeking to do in the first place but there's another element to that which is really important if we are going to put care principles into practice and that is a metacognitive one that is about helping us to really be mindful of our own practices, our own assumptions our own skills and this is really essential if you're going to seek to address unconscious bias for instance so I like using this image that I took some years ago because the way that we learn to look beneath the surface is a nice metaphor for thinking about ways of making the invisible visible and trying to get a sense of what can be made visible and what may not be able to be made visible so if you were walking out to a reef it would take you a while to if you stood very still allow the water around you to get still allowed your eyes to adjust you would start to see more deeply, you would start to understand and have a more intimate appreciation and this is this goes back to care principles and if you like the lessons that have come from indigenous data sovereignty communities or those who are working in that space indigenous technologists who will talk about indigenous practices and ways of connecting and making time and making space and so that seems a really powerful way of allowing us to be in a position to capture and locate language appropriately and to deal with its dynamics so putting that together we reference quite a lot this idea of making the invisible visible and the power of enriching the process through a number of different techniques to really bring that home so one of the first things to keep in mind is taking the time to think and to link so this would mean appreciating the context it means making time to think on your own about what it is that you are bringing what are your strengths as well as your potential blind spots to limitations thinking about ways that you can meet communities on their own ground and then making the time for that meeting not sticking to your own time table but rather responding to the needs of the community and their practices and when you're dealing with multiple communities then this magnifies those challenges further and then it's important to think about your naming practices to think about the different ways of mapping contexts and again to allow the flow and dynamic unfolding of those situations and through that process you start to develop the skills for co-creation which is really important for accommodating those multiple interpretations and you also by developing a practice the habit of making the invisible visible you also learn to show your work which then helps build up the evidence that you need to help yourself and others to understand the decisions that you've been making so I'm going to turn it over to Ruth thank you very much that was a really good introduction and some of the things that we've been playing around with over the last few years and metaphor is a very important one so one of the things we're really looking at is how can we change skill sets to address some of the issues around evolving semantics and semantic interoperability and some of the concepts that Theresa was also talking about in terms of bringing the care principles into the process as well and metaphor is a very good way of doing that it helps people take them out of the context that they were actually working in where they have really heavily already value-laden interpretations for some of the terminology that we're using and it also allows us to meet on a common ground and have a shared meaning through some other shared experience and it helps people reinterpret through that sort of externalisation process some of the concepts they work with every day in data and so we've both worked a lot with metaphor data as water in this case I'm going to talk about data as food and indeed also Theresa in her previous life also launched a little program that was inspired by, very much inspired by this excellent quote at the top of this slide raw data was both an oxymoron and a bad idea to the contrary data should be cooked with care meaning well it's fairly self-evident and she introduced an exercise to her master's students at UTS using the master's chef mystery box challenge just two minutes two minutes okay I'll move very quickly so one of the things that we've been doing through looking at metaphor again is coming up with ideas for allowing people to think about what they're doing in their day-to-day work with data and the import of that and in order to try and really stretch their understanding and build their understanding of day-to-day activities with data and data governance and co-creation and really try and understand how they sit within that and how they can improve their practices and think about them differently and one of them is using the responsible services data concept which is obviously for any Australian would know we have a responsible service of alcohol requirement that anybody working in a licensed premises has to do before they're allowed to serve alcohol and we're very much borrowing from that and feeling that that needs to be added to the data as well and we've also been playing around with developing experiential integrated learning experiences using that metaphor that people can actually streamline into the work that they're doing so that as they go through the usual practices planning your meal obviously being doing your project management and getting your plans in place looking at your requirements and who your users are taking you out of your I know this space I'm a data practitioner into this other metaphors that you can really think in a more holistic and different way about what you're doing and the context is everything so as you'll see in the last of these who are you cooking for thinking about your data subject who is going to benefit and what really thinking about the words that you're using the vocabulary that you're employing and whether you're meeting that same vocabulary shared by that community and again you can take that same concept through and we do here to the semantics and meaning of data products and data outcomes and how we can evaluate that and whether we're on the same terms as the data subjects and the rest of the community that we're working with so then the last part of that concept and this is something we're still playing with and experimenting with is the final piece of context which is very important here is obviously the person the practitioner themselves how are they interpreting some of these concepts what are they contributing in the way of semantics and definitions and what are the great superfiles that they might be bringing to this which I actually give them some advantages but also the things that we're calling kryptonites that may be slowing them down that may be changing the way in which they interpret the information and what have you also taking people through that reflective process through to learn new habits about showing your work which is what Teresa was talking about earlier being able to actually show your thinking process through all of these steps and just to be able to justify those process your decisions as you go through and then we're also playing around with getting expert feedback to those to see how that will improve people's speed and quality of learning of new data practices through expert feedback so I think I'm sort of more or less out of time so if anyone wants to follow up with either of us our details will be in the slide deck and certainly very happy to reach out there's quite a few things resources we could point you to and along the lines that we've been thinking debut first of all I'm talking to you today from the land of the Boonaron people of the Kulin Nation and I'd like to express my respects to their elders past and present I'd also like to apologise in advance I was picked yesterday by a rather nasty upper respiratory tract infection and I hope I won't dissolve in the fits of coughing at some time during my presentation the language data commons of Australia project has been developing a metadata vocabulary for our work we started mainly on the basis of work done about 20 years ago in the open language archive community they've got from the name it's fairly obviously part of the broader open archive community so the work they did started from a Dublin core base although when you look at what we're doing probably not so easy to pick that up but there was also input from other kinds of projects particularly back then 20 years ago it was a project called electronic metadata for endangered language documentation the people involved in that work came from a particular what ended to come from particular subfields of linguistics descriptive linguistics language documentation and they were thinking primarily about vocabulary they needed to describe language other than English languages spoken in small communities and so forth and that means that there are some places where the vocabulary that they developed some quite obvious holes in it so for example they have terms to describe various kinds of linguistic genre which includes things like formulaic language and ludic language but when we were trying to describe some government documents we realised there was no term there just talk about documents which were pieces of language which conveyed information and we had to add that kind of term so we worked on that basis we've started from what was there we've added classes and properties as terms as we need them our first preference is to look for things like schema.org if there's nothing suitable there we'll look for other linkable sites but some things inevitably are what we devise ourselves the vocabulary that we're working with is available initially at least in two formats so there's a JSON-LD version and which is the machine-readable version and there's documentation that's automatically generated from that and I'll show you what they look like in a second and both of those are available from a GitHub repo so this is a little kind of JSON-LD version and that's the corresponding text version that's generated from that but we have to ask ourselves the question is this actually enough, these versions of what we're doing if we can persuade people to use the vocabulary there are obvious benefits there for data managers, for developers, people working on the more technical side but we hope that it should also have benefits for our users and the benefit that we should get for our users is to make the data fair or more fair and if people understand how we're using the vocabulary we think that we can improve that in two areas first of all we can help people find data efficiently and secondly we can help people to describe their data so that other people can find it more efficiently but as I said this is work that's going back over 20 years the uptake of vocabularies for describing language data has been pretty poor it's been extinguished by background I'm as guilty as the next person probably on this regard but the history of these kinds of endeavours in our discipline has been fairly dispiriting there have been a range of different schemes that have been proposed at different times so there's the old like scheme we've been working with there was another scheme I called the general ontological language description component metadata initiative and none of these has been extensively adopted I'll give you a specific example of a situation where there are very obvious advantages for adopting a particular vocabulary and still didn't really happen I can mention something called the Leipzig glossing rules which is a proposal from the Max Planck Institute in Leipzig about how linguists could provide glosses for grammatical categories when they were presenting data so this is things like number and tense Leipzig people proposed around 200 items that could be included in the list and in the end general usage is restricted to a very small subset people are happy to use the number one for first person the number two for second person the abbreviation S would be the singular PL plural but beyond that it falls away very quickly because people say well my past hence is not exactly the same as your past then and therefore we don't want to use the same label as I said there were obvious advantages in adopting this kind of proposal apart from anything else it would have meant that people basically did not have to prepare abbreviation lists for their publications which I think would benefit but it still didn't get picked up so we have decided that we could try and find some other way to improve uptake of what we are doing and the way we are trying to do that is to provide different resources we use terms from a vocabulary for displaying records in our data portal so the first problem I mentioned is very relevant are the people who are looking at understanding how those terms are being used and we also provide advice to people on how they should be going about collecting language data and then the second problem is relevant because obviously we recommend people should use our vocabulary so that other people are going to be able to find and access that data and use it so we also have to ask ourselves do the people of collecting data understand the terms we are using to apply them in order to try and meet these needs then we are as I said creating another resource we are documenting our vocabulary in much greater detail than the automatically generated documentation and we are doing this using hit book the reason that we chose that medium is firstly that it gives you a low cost in terms of our effort it gives us a presentation style and a layout that is accessible, familiar for at least most of our users it looks like a book as I said relatively low cost for us the editing interface and the procedures are straight forward we don't have to start from scratch there's a lot of stuff there for free and you end up with a nice product and also as much as one can make predictions about these things the delivery platform seems like it's reasonably stable we can hope that it will be there for quite a while without us having to put a lot of maintenance effort into it but if there were any problems it's easy to export what we have there into PDFs or mark down and then we could reuse it in another context now I'll show you a couple of examples of the kind of additional content that we are providing here some kind specifically one is explanation one is usage examples and one how we refer to things in the literature here's an example where we've given fairly detailed explanation I should emphasise please do go and look at the book site but it's work in progress I'm showing you some of the better developed examples and there's a lot of pages which don't have this literature information yet it's still working on it so here we have quite a detailed explanation of one term and how we apply it I like on which it was based their comment says that it's structured annotation is structured linguistic information aligned to some extent of another linguistic record we've given a slightly more general explanation as the base resource includes material which adds information to some other linguistic record which is very general and then we explain this in a little bit more detail we talk about what I like might have meant by alignment we talk about some of the tools that people use produce alignment some of the kinds of formats that result from that different kinds of annotation different kinds of annotation documents that result and we give a little bit more discussion an example for a lot of linguists which is transcription what's the relationship between a recording or a video of people interacting and some kind of extra information provided as an annotation and we're suggesting that transcription just kind of record the words or whatever aspect you want to be going to record that is already an annotation so that's one and a half minutes to go so that's the kind of level of explanation we like to provide here's an example where we go into some detail about UCID so we make a distinction between subject language and in language and this is between the language which is the actual medium that a resource is using to communicate with people and subject language is what's being talked about which can be different and here we give an example of how this would apply in the case of a work about Italian language as used in Australia but which was written in English so it's in language value is English but its subject language value is Italian and then we also as I said aim to give people references to additional literature you may not be familiar with the concept of whistle language but this does exist in various places around the world this is a place where we haven't yet had a chance to add in explanation or usage notes but we do at least have a reference where you can go and find out more about this slightly strained phenomenon so those are the kinds of information we're trying to give people to make our metadata vocabulary more accessible and encourage them to use it so we want to maximise the benefit we get from this vocabulary and as I've just said to achieve that we hope to make it as accessible as possible to the people who are going to use our products and our services and to make this possible to make the vocabulary accessible we're developing this rich of documentation with explanations examples and references to literature and Git Book seems like a good way to do this because it's so easy to edit and create content and because for a low cost we're getting pretty good production values and as I said earlier a format which is familiar and therefore accessible to our users who are not necessarily the kind of people who jump on GitHub on a daily basis to look for information Thank you very much the organisers of the conference that is the Australian Search Data Commons Thank you for welcoming me to this vocabulary symposium I am going to speak just in brief or from a socio-economic point of view the essence of indigenous data custodianship and governance looking at frozen pathways and unknown civilisations how indigenous data is significant so we are defining in our discussion we are defining indigenous data what it is this is critical information that we have in any nation in any given community within even a family you can trace this critical indigenous data it is information that you get which is critical for the existence of that society and it is the one on which processes are managed it can be categorised into social biological sciences natural sciences engineering sciences depending on how development frontiers are changing within a community and also depending on the level of development of that community the indigenous governance speaks of the way this data is managed and how all these other groups in the circles that are indicated social biological engineering sciences how are they contributing so we have correspondingly those categories and I think we in our country in Zimbabwe in southern and south south in Africa we are still in the process I think of coming up with this area of constructing this critical indigenous data which is culturally founded historically founded regardless of those the classes that I have indicated the other issue that we may want to understand here is why custodianship or governance of data it is very critical for identity of an individual command for identity of a nation of a country and for defining the levels of scientific progress that a country is registering that's one level of a country but again to the extent a country or a community is part of the global community it is also critical that governance of data indigenous data it is also very critical for exchanging value with other communities across the world in other words it is critical for bridging the gaps that may be existing within the social economic cycle, biological cycle now we have space technology and we have nuclear technologies also and we have the challenges of climate change as we all know that need the sharing of this knowledge so indigenous knowledge sometimes if it is not well developed it gets a kind of a compensatory I think dimension from the international dimension we are talking about that it needs not only be national but even international dimension where common humanitarian issues will be discussed and therefore the need for that indigenous knowledge to have also the international dimension what are the major characteristics of this indigenous data it should not be limited to the community or the nation itself because it becomes very misleading in that case you sort of diverse yourself from other processes that from which humanity is already benefiting across the world there are aspects of indigenous data the characteristics that are quite critical for it to remain relevant to the community the first being limitation it must not be subject to limitation in other words limited to only a small group of people limited only to the nation but it must be one that is if in one looking but it must also have an outward orientation so that the community the nation is part of the global community the other issue that is very critical is there must be balance between what is closed and what is open data within a community or within a nation for how long that data which is specific to a community or a nation for how long it will be closed for when will it be open under what circumstances all those dimensions they are needed but you also like I said want to identify yourself within a progressive world that is moving on so the dimension to share is always very critical what is the other issue that we have discussed in our paper mission to the Australian search data commons one of the critical aspects or characteristics of indigenous indigenous knowledge indigenous data it must not have the limitation aspect it should be open ended but also very critical is to know which data and from which sector which area is to be shared which is not to be shared it must be time frame also on how long it should not be shared and how long can it be open and it be open now to the public the other issue that we is quite critical for for indigenous data is its commercialization we have seen nations actually during the time of covid exchanging this data these drugs that are made is part of the dialogue that is going on on indigenous data when that data is availed the drugs themselves the data when it is availed we have a lot of advances also in other sciences that is going on on climate change space technology nuclear technologies it's all to do with indigenous data and the critical information that must be that must be secured or kept through these governance mechanisms you are talking about but however in conclusion ladies and gentlemen allow me just to say whatever the case indigenous data should be meant in the final analysis I think for social progress for common for the common human good it must be data that is meant for peace for change and transformation of communities I think to make this a world worth living by all thank you very much Merry Christmas and the prosperous 2024 blessed new year thank you Hello everyone good morning I am Janet Ba I work at GASIS the live in institute for the social sciences thank you for having me here today to present in the vocabulary symposium and I'm going to present a work on control the vocabulary for relations between variables across waves and studies Hi I'm Janet Ba I work at GASIS and I'm going to present today our work which is enhanced compliance a control the vocabulary for math and social sciences survey variables so here are the authors it's me Janet Ba and Peter Murdersky the three of us work at GASIS as well so our agenda will cover the main points so the context and motivations the variables relations and the control vocabulary proposed we also address the topic with the knowledge graph and discuss a bit about some open questions and provide some next steps so regarding context motivations and goals this control vocabulary was developed in the context of PID registration service for variables this is a service deliverable for consortia SVD which is funded by the German national research data infrastructure and FDI so consortia SVD is a consortium funded by the NFDI in Germany and our task area will deliver a service to assign PIDs or persistent identifiers at a more finely grounded level such as data set elements in which our first approach is a variable within a data set so our first motivation is the fact that in the social sciences survey there is this dynamic relationship among studies units and surveys instruments especially considering these entities like questionnaires, variables, questions and responsive format evolving across waves and studies the second motivation is that the variables of data set relations enable comparability across waves of a given study for example we are considering here a data set element a variable and this variable is within a data set up in a retongular or a tabular format so from wave 1 and 2 this given variable could have many changes so for example from wave 1 and 2 the same variable was reused with a different name or a different label which is the reference of the variable a different question wording which is very common especially to address differences in response scheme as well so you can have a variable with a response schema for example a Likert scale 5 and then in the next wave you need to extend it to a Likert scale 7 so this is very important from the researcher standpoint when they are making decisions on how and which data to reuse another motivation is that the current relations type descriptions do not represent this complexity of variables in the social sciences so although current frameworks like DTI or data sites try to model these relationships they fall short in addressing all of this complexity in the social sciences for example here we have this DTI control vocabulary for common types and it try to provide codes and terms to describe these relationships and there are three codes here available identical some and none let's take a look in this code such as some which is not enough to disambiguate these relations because when you see the description it uses some when compared items have similar but not identical content for variables for example some of the elements of the variable descriptions name, label question category codes and so on we did to inform what others will be different but exactly which of those elements are different so this is not enough to disambiguate these relations so this is why we are coming up with these goals to describe relations between variables across waves first we think in the same study then describe connections across variables from different studies because there are many variables that are connecting across studies describe relations type for better semantics across and between variables in the social sciences finding relations inherited within DTI instructions the structure among all other possible entities I will show it later and store these relations across variables within our metadata that are registered when a PAD is assigned to a variable level so to fulfill these goals we are describing these variables relations and we start demonstrating or illustrating these elements then we start with this study and then this study has some waves so the same survey or instruments are repeatedly across years used to get new data all of these instruments also they have a bunch of questions and each question has a response scheme or a response scale which can be changed from one wave to another and we are proposing this control vocabulary to explain these relations across waves and studies and then to provide these descriptions we published our control vocabulary in Chester Vocabulary Service then it's called this control vocabulary for variables relations for social science research data here is the link you can also here is the link but you can also find it in the description of this presentation so we provide a brief textual identification of this relation time supported by a control vocabulary with an extended description of the relationship so for instance the connections for variable versions, derived format different labels, discussion wording everything we have talked so far and it's also important to highlight some assumptions on our project requirements because we are aware of about the data DDI modeling the variable CASCAY but we are now focused only on variables within data sets or the instance variables because we only get variable metadata for registration of the PID so no other entity has been registered now we will give a brief overview about variables relations for knowledge graph so a knowledge graph holds descriptions of entities and their interrelations it's organized in a graph it's built on established standards like that such as W3C RDF makes entities and the interrelations machine interpretable and of course uses persistent identifiers so since these variables they have now persistent identifiers then it's feasible to use them in order to build a knowledge graph for the social sciences in terms of visual representations of these knowledge graphs we have here the first example which is the variable NAME so we have seen this example before but now in terms of a knowledge graph we have the study program and then we have here the waves 1 and 2 and each wave has a different data set and each variable resides in this data set and here the predicate or the relation between these two variables it's a different name and then we can express this difference here so the variable 1 or the first variable used to be the variable NAME used to be 8 and now the second variable which is being registered has an updated in this element or in this attribute the second example is the question wording which is also very common here the question wording was changed for example to comply with here in this case in Germany to comply with a language requirements in terms of using or not using special characters the second one sorry the third one is the response schema so again I have two waves and then I have two data sets and then I have two variables because each variable resides in a different data set and here question 1 and 2 the questions can be the same or can be different because I'm talking about the same variable but then the second variable uses or extended the response scale from liquid scale 5 which was used in the first variable and then here it's using liquid scale 7 so what are the advantages for this knowledge class well once we document and describe these variables relations we can support search and browse functionalities we can also enhance data we're using especially through this search and browse functionalities researchers are able to find these connections in a visual manner and of course allows comparison across data sets it facilitates a harmonization process using this proposal control vocabulary can create a semantically rich common social science knowledge graph so across institutes we can also start using the same control vocabulary which are in line with the fair principles and here there is a word where some automatic filters are explained and this knowledge graph can also be used to data access in R so open questions we have some questions that are still open so for example data set versions we have variables that are packaged in different data products even though they are the same but they are packaged differently so we can provide for example basic versions extended versions versions with some aggregated data from another study some data sets that are restrict because of some privacy information or privacy issue then we would need in this case a relation type with the data set itself link in the variable PID with a data set DOI but we are still questioning or wondering if we would need any type of different relations descriptions or relations name for that questions sub levels this is also something that we had encountered so for example the questions sub levels means that the question itself has many parts or many levels we have the questions team it's a pre-test a statement that came before the question itself the question text itself some questions prone it's called for example information regarding the instructions on how to answer so consider which level do you agree with this statement and do we need to break down the relations for the questions sub levels as well a survey mode is also an open question so depending on the mode paper based online based interview face to face it can also require some modifications in the variables for example question wording in a paper format is written go to question 10 but then if you use the same survey in an online mode this instruction it's not necessary anymore because the software or the survey the online survey will jump automatically for the next necessary question so to summarize what we have talked at we presented the motivation behind the relations between variables we published our first version on Chester controlled vocabulary managed tool we provided some examples based on Gaze's data set our next steps is continuing the discussion please join me if you would like to collaborate if you would like to provide some feedback we are eager to hear them Validating extent on Chester controlled vocabulary manager tool extend the controlled vocabulary we want to describe links across different not just different ways but also different studies and other entities and we would love to foster this controlled vocabulary reuse among social sciences institutions and build our first knowledge graph based on Gaze's variables and to publish this knowledge graph so thank you very much this was my presentation and here you find the main outcomes of our project the registration service so we have the technical report and the use case and metadata schema extended report thank you very much so this was my presentation thank you very much for your attention I'm looking forward to your questions and please do not hesitate to contact me if you have any doubts or any feedback or any questions thank you very much