 Okay. Hello everyone. Nicholas Carr from CSIRO here up in Brisbane. Hello. Hello. Now the first speaker is Sophie Darnell and Sophie was a student of mine until recently. And she comes from the University of Southern Queensland in Toowoomba and has previously worked at Virgin Australia and other places in large operational IT environments. And undertaking this project with us recently, she's been looking at doing things a little bit differently with vocabularies and so on. So she was working with us as a student from about mid-year, June-July until about now. And now she's taken on additional work through some of our projects actually on doing vocabularies, and projects for some CSIRO contracts. But here's Sophie and you can see her title there. And I'll leave it up to her to introduce the specifics of her project. Over to you Sophie. Alright, so everyone can see my screen, my title. Yep, awesome. It's my first involvement with this group other than accessing some of your resources and information you've got online as I started to work on the project that I'm going to be discussing. It was really interesting for me as I've had previously fairly limited experience working with this type of formal vocabulary framework. It did also highlight some of the issues with accessing and developing those existing vocabularies without pre-existing knowledge of the tools and the technical background as well though. So my project was to adapt the existing free text keywords tagged against CSIRO's data access portal, DAP entries, and develop a controlled vocabulary in a SCOS format that could be used to prompt for suggested keywords when future uploads are made to try and introduce a bit of control and uniformity. The template I was using prioritised the SCOS tags for prep label, preferred label, alternative label, definition, source and broader concepts linked to a top concept. I'm not sure how familiar everyone is with the different SCOS tags, but it's something that is fairly null and void for the purposes of this. I intended to expand on these concepts as a simple table and use the Excel to the linked data registry tool developed within CSIRO to develop a SCOS structure for the vocabulary. Sophie, just one second. We can actually see as far as I can tell your non-presentation screen. So we're seeing the one that's got your notes and song rather than just the slides. I can't see either. For those who can't see either, there must be a way in Zoom to click viewing the screen rather than viewing a list of people. What I can see is, yeah, you're there. That's the way. Okay. I'm still seeing it. Okay. Do you want me to give you a minute? No joy. You put it? Okay. Just continue. That's fine. My main aims were to identify solutions for addressing syntactical errors, as well as those that might attempt to address the lack of uniformity in the approach to the syntax that had previously been established using the free text approach. I was also interested in any existing synonyms or relationships that I could identify that might allow light concepts to be linked. As much as possible, I'd hope to identify existing vocabularies that included some of the tags for the SCOS framework and that were accessible in a way that could readily be utilized to fill out missing information in the DAP keyword vocabulary. Keywords were extracted with field of research and business unit filters in order to provide a broad idea of the term context and use with a very broad range of scientific disciplines being used as well as potential intended uses of the keywords chosen. The ability to gain a better sense of context became really important in identifying sources for concepts. My initial priority was to understand how words would be used in the DAP and what issues might arise. I established a general approach to clean up the existing list with regards to spelling errors and plurals versus singulars. I developed a rough overview of what contributed to both the overall list and a selected section that I took out. A simple visual review identified issues with spelling syntax as well as varied approaches to the level of specificity and usefulness. I'd hope to identify common keywords for different business units as a way of targeting more used and, in theory, more useful terms. Unfortunately, all the 3,000 or so terms that were identified as unique owned the 1,300 occurred only once in the DAP and about 649 of the terms were only used twice. The majority of keywords in use were tagged only once or twice and could therefore only really be considered low frequency. This made it difficult for me to identify a clear section of the keywords in current use that were, in theory, correct syntactically and considered as helpful tags by multiple uploaders. Of those keywords that were used with a high frequency, the same sets of keywords appeared to have been applied to every instance of regularly repeated types of datasets with limited variability and keyword selection appearing to be part of a set standard for that type of dataset upload rather than being a regular selection by users of a common term by what you might be able to consider their own choice. This meant that frequency was not likely to provide a simple low tech approach to filtering for more valuable keywords. Some terms such as Pulsar are in fact used too regularly to make them a helpful way of searching DAP results. Using this term in the DAP portal actually returned about 1,400 results but without any other filtering or knowledge of the datasets returned and how else they might be tagged, it doesn't allow you to search for all that term in a particularly helpful sense. It was decided that the final vocabulary should use the simplest singular version of each keyword. As part of my initial investigation into available tools, I'd hoped to use the natural language toolkit in Python to tokenise, lemmatise and check terms with syntax. I couldn't identify an approach similar to lemmatisation that would work better for phrases and while in some cases, lemmatisation of each keyword as separated tokens wasn't technically correct, I still thought that a script to tokenise and lemmatise each keyword would ensure they're at least formatted similarly to each other. This didn't work due to limitations with the natural language toolkit's inability to work with non-common and scientific terms as well as being fairly clumsy due to me trying to adapt with limited Python skills and applying general spell checking tools to the DAP keywords wasn't helpful at all. It corrected incorrectly more than it actually found requiring spell checking. We attempted to find an alternate method of identifying plurals and correcting them to a singular form. I found approximately 670 of the keywords ended with an S. However, keywords ending with an S in a scientific domain meant that this was just a lot of them were species names. Similarly, I'd hoped it would be useful to identify synonyms and therefore reduce the total number of concepts. But again, I found that the NLTK tools were too general. My knowledge was a bit too limited to adapt them or to identify any more complex tools. The link between a term such as ocean in the natural language toolkit to those of water, lake, sea or beach appeared to be quite high using the synonyms tool. But none of those terms could be considered a synonym for the purposes of a vocabulary such as this. And as well, for some less common terms such as macro algae, the tool didn't know how to identify those with life terms at all. There do appear to be numerous general purpose tools available to a beginner such as myself, but there is a significant gap between the possible applications of the readily discoverable or open tools and those that require specialised knowledge or a part of a paid service. The terms that were readily discoverable using the natural language toolkit were not specific or varied enough for the types of scientific terms in use in the DAP entries. I realised that more automated tools might not provide me with the solution at hopeful. So I identified a smaller section of the total keywords in use to focus on by selecting the business units that had the largest total list of keywords, which was the marine and atmospheric research group. At this point, I moved from prioritising tools to address syntax and focused on finding pre-existing vocabularies in similar domains to the terms in use in the DAP in order to marry terms between them. To assist in the identification of how these were being used and what might be appropriate sources for the terms, I established top concepts under which each term could be grouped using some of the regularly occurring keyword areas I was encountering and on the advice of the team members I was with, namely, Mick. While my identification of these terms was partially limited to my interpretation of the intended context of fairly specialised concepts, I felt that as any vocabularies such as this would require a reasonable amount of user input to adapt and grow, it will certainly room for a few areas in my judgement to be corrected later. I made one of those assumptions as an example by choosing to use any plant or animal name such as prawn or banana as a reference to the species rather than a reference to a product or a process. The business unit and field of research information provided a first basis for user-created keywords context, but I found that the more context I could give to the terms in use, the more readily I could identify the most appropriate sources to define them. Separating keywords into the most reasonable top concepts helped me to identify some of the easier groups of terms that could be identified within existing vocabularies, particularly those with a logical structure pre-existing for the broader relationships. It also helped me to start thinking logically about the most appropriate of those relationships and the structure that I wanted to link these keywords to. Geographic location as well as species names within the scientific domain top concept became my two easiest keywords to structure in that way. It also highlighted the need to include some broader terms to ensure that a logical structure was maintained for future use without a better way to feel any gaps created as more keywords were added. Some species names might be readily linkable through the standard taxonomic structure, but to ensure a certain uniformity for all species keywords, I found myself filling in gaps between broader concepts where the genus or family were not otherwise used, but the species and the kingdom had been included otherwise. There were also multiple instances of phrases such as mooring data where mooring also existed as a separate term and due to the nature of the DAF, every entry could be considered data of some type. So with keywords phrases ending in things like data and research, we could make the assumption that these can reasonably be treated as synonymous with whatever preceding keyword they accompany. However, making a decision on some of these similar terms as someone who is not a subject matter expert in that field creates the possibility of a number of flawed assumptions. Keyword phrases relating to DNA and sequencing might be one instance of this. They seem to repeat themselves in certain DAF entries, but it's important to differentiate between a term such as DNA sequence and the concept of a DNA sequencer. So I can identify clearly that those key words relate to separate concepts, but without a clear existing vocabulary source or a direct understanding of the intention of the original uploader when tagging a phrase, it becomes a bit of a greater challenge and increases the likelihood of me making incorrect choices. As the number of assumptions I was making increased due to gaps in the tools and an inability to access large existing vocabularies, it became increasingly likely that my assumptions were overlooking an integral difference in the intended context. As I tried to differentiate related but not synonymous concepts, it also became more difficult to find those existing vocabularies or sources that went to the correct level of detail for the scientific domain. Science and domain specific glossaries and vocabularies do exist in many large organisations. So locating applicable ones from an appropriate source and comparing them to the keywords in use in theory provides some measure of surety the term has value in that field. Using the research data Australia linked data API, I thought I could extract appropriate scores of vocabularies to not only identify the concepts in the intended format but minimise the total number of terms for which a more manual identification process will be required. My first comparison for the smallest set of terms I had taken from the Marine and Atmosphere business unit immediately matched 74 of my 235 subsection of terms to the Global Change Master Directory which was one vocabulary I found which was fairly helpful. However, despite this initial promise I was unable to find similarly large proportions of the terms within existing vocabularies and was really able to find sources that could be extracted simply in a format that someone with my technical skills could manipulate for comparison and extraction of the key fields required to adapt the keywords to a scores format. Unfortunately, as well as some of the existing vocabularies I did manage to access were designed for even more specific scientific applications or fields. The definitions for the concept as described within them was too limiting and didn't make sense for entries in the DAP. One example of this was field survey where in many of the vocabularies I accessed definitions of the concept very specifically outlined in relation to one stream of science. I could find entries for geological field survey, vegetation field surveys or marketing or economic field surveys came up quite regularly but I struggled to find a clear definition that was directed at a more general concept not targeted to a single application or scientific domain. Strangely enough this meant in a lot of cases I found myself using Wikipedia to provide the best general definition for sources. However, as a source for our controls goes vocabulary I'd been hoping to find ones that were a little more permanent and a lot more reliable. There were still some of those more reliable sources available but of those I identified, the majority only assisted with such a small fragment of the overall collection of terms that I was trying to address that accessing or adapting tools to draw from them was impractical. In order to continue addressing the structure of the DAP keyword vocabulary and provide some output for my efforts the sources used were essentially collected manually. A lot of the assumptions I made were limited as I wasn't working directly with DAP users and couldn't readily gain greater insight into their intended use of terms. In some cases there were clearly internal rules or lists being applied to datasets uploaded by different business units and partnering with those users would have helped me to identify where other vocabularies were already in use that might have allowed me to piggyback a bit off those. There were also some vocabularies and tools that were obviously created with the intention of being used and shared for other projects. So establishing where relationships for a collaboration exist would have been another opportunity had I had a bit more time to develop this project. While there are a lot of tools available their functionality for someone without a high level of technical skill or the ability to access paid resources still makes their accessibility a bit of an issue for a project on this scale. Without a single unified tool to extract user-created keywords and form neat and tidy controlled vocabulary an interim step might be to improve the value of those keywords being added by the users. Creating options for entering keywords into fields for the top concepts. Identified or assigning some clear rules to how keywords should be used when uploading datasets to provide a more consistent user approach to keyword selection in the future and provide a greater context for any future transformations of the keyword pool into a controlled vocabulary. I did end up with a smaller version of the keyword vocabulary intended but I feel that the approach that I took maybe could have used a greater level of technical skill to automate a lot more of the processes and I think that partnering with other organisations or business groups would have also helped me to identify valuable vocabularies that were already in use a bit more quickly. So that's it from me. Thank you very much Sophie.