 Hello everybody, welcome and I'm Julia Martin from the Australian Research Data Commons and I want to thank you today for joining us for the second in a series of webinars which will showcase outputs from the ARDC funded RDC and devil projects and the benefits and the reuse potential. And today's webinar is going to focus on output supporting or developed during the marine and characterization projects. The speakers from marine will include Sebastian Mancini who's the director of IMOS and was the marine project manager. We've got Tim Langua, the senior ecologist from the University of Western Australia and marine project stakeholder and Ari Fredman, he's a software engineer and project lead and developer. Now we also have speakers from the characterisation data enhanced virtual lab project, Lance Wilson from Monash University, he's a CVL coordinator who will talk to federating characterisation resources and Andrew Maynard, the senior lecturer from UWA and project lead who will talk to the microscopy characterisation and analysis perspective. Okay, so apparently I've been promoted, I'm not the director of AOGN anymore, I'm the director of IMOS, but no. So I'm Sebastian Mancini, I'm the director of the Australian Ocean Data Network, so the AOGN is a facility of IMOS, it's in charge of the data management of all the data collected by the IMOS project and another mission of the AOGN is to create an infrastructure that to publish all marine data into a single framework, so that's the goal of the AOGN. So today I'm going to give you a quick overview in 10 minutes about the marine RDC project that we've run last year and then Tim and Ari will dive into more detail about the tools that have been developed. So the thing about the marine RDC project was it came out after the AOGN technical advisory group recommendation a year before that recommended that we improve the capture and delivery of biological data into the portal. And so we submitted a proposal to improve biological data delivery and there was multiple subcomponents of a project that you can see below. So focusing you know on biological data but not exclusively because we had over data said that we wanted to take care of like a surface wave observation for example. So today I'm not going to go through the whole seven components, Tim and Ari actually are going to talk about the National Service for Marine Imageries above Global Archive and Squido, so that's perfect. And then I think I will spend more time you know six or seven minutes really on the linkages that the AOGN has been doing and establishing connection with a national repository of large organization collecting marine data. So overall the project has been developing tools that we will see with Tim and Ari but also I guess highlighting some data set and that were already available on the internet most of them but for both data set we've been adding web services on top of it to improve data accessibility and reusability. So that's really the main outcome of the RDC was to you know enable web services for some of the data set. So the goal of the subcomponent 7 was really to link you know the AOGN portal to some of the data set that were owned by other organizations. So we're focusing on five of them so the Australian Institute of Marine Science, the Atlas of Living Australia, CSRO and Geoscience Australia and for each of them we decided with our partners to focus on one or two data set that they have that they wanted to share with us. And so really the end goal was to publish a data on the AOGN portal. So the AOGN portal is the main window where we display the data collected by IMOs but by other partners. So it's sort of in three steps. The first step is about search and discovery. So you look through different data set collection using different facets like parameter, platform or organization. Then on step two you're able to visualize the different data set collection and overlay them on top of each other filtering them by different type of filters like spatial subset or temporal extent or parameters sort of subset. And when you decided to pick the area of the data that you want you can see at the end, step three, the download panel where you can select the data you want in different type of format. So that's really the focus of the AOGN portal in recent years has been about search, discovery but also access to the data really for the user to really download the data at the end of the system. And so four years ago we released the portal and at the beginning it was only really working for with 100% of IMOs collection in there because we were in control of that so it was easy to publish. But since then we've connected different type of national repository to the portal and like just recently really we've gone to almost 250 data set collection on the portal and 50% of IMOs and 50% are from other organization that you can see on that graph. So mostly IMOs so the University of Tasmania but all the partners have been adding collection from their repository. And so what does it mean to actually publish data to the AOGN portal? The portal really is just the window to show the data but it consumes web services so it relies on metadata standards like the ISO 19.1.1.5 for the search interface and so looking at different fields in the metadata to power the faceted search and using control vocabularies which I will talk after. And then the second part is web services and the portal consume web services to enable the step two and the step three. So the map using web map service and then the step three web feature service or any other type of download service that could be made available by other people so that can be consumed. And so what we had to do for that part of a project was really to look at okay we need to look at the metadata and ingest that information into the portal but also the web map service and the download service and so for each of those organizations these are the tools that those organizations are using and as you can see mostly we are using the same combination of tools so geo network for the metadata catalog and geo server for the delivery of web map service and web feature service but there was some differences like AAD is using GCMD and the GIF metadata standard for the metadata or ALA is using their own set of web services GA using his own sort of web services for API or API for accessing their geotiff data so we had to do some compromise to change the portal to actually connect to those different services and so for the download services for example we made the connection to GA and CSRO and ALA to use their already existing API where they made changes to their API to actually connect to the portal and that was pretty good and then for the metadata yeah I will talk a bit later after that so yeah and the others like AIMS and AAD were a bit simpler because they were already using geo server and the same way as delivering data as AIMS so that was a bit easier to ingest for the metadata most of us are using the same standards so the ISO 19115 and the marine community profile GA is as already stepped up to the new new standard dash one and so basically for the metadata one of the key thing was to use control vocabularies that is really very important to ingest and power the facelift search and so what we did there was a couple of years ago we published a lot of our vocabularies on the research vocabularies Australia website and so we use that platform to both you know create, edit our vocabularies and then we publish it and then we ask our partners to actually use it in their metadata catalog to tag their metadata so we've been using that and on the RVA platform you can you can select a range of different published vocabularies that we use that are very important to tag our metadata fields so platform instruments organization or discovery parameter all those ones all those vocabularies are available for people to download and reuse and so that has been a very important thing for getting the data and the metadata into the system and so the end goal is that that's the outcome of a this part of the project was really to connect to the AOG and all those different you know repositories and so we've got the temperature logos from Ames which covers 300 temperature data for the last 30 years available there the ALA and all the marine occurrence data is in the portal and and GeoScience Australia I think it's a great example where they've got batimetry that they've done for the MH370 search that's now available as a web map service but also a download service so we connect to their service so that was pretty successful I guess the highlights of a project was more like federated standardized data supply it's possible it's technically feasible it's I think the technology has changed so much lately that it's very I think it's possible to do it it's more it relies on standards and so you need standards for the metadata but also the data but you still need a little bit of fiddling to to make the connection and I think now it's not really a technology problem it's a community commitment so getting those partners to continue to add more data into the system it's still it's still a challenge yeah and yeah so that was a great great outcome and we're still working with partners to get more data set collection into our portal now I think I will stop here thanks sir that's fantastic we'll hand over now to Tim thanks very much Julia here we go right thanks very much said for introducing us and so I'm going to talk about this global archive which is a system that I and Ari Friedman from Graeberts have been working on for the last five years really to help to bring together data set from a particular domain in the marine environment so I'm going to talk in particular about the global archive tool and also a data sync tool well I'll get to that more in detail in the moment and this particular marine imagery project as said mentioned was funded by the Australian Research Data Commons so the data sets we're going to be talking about are from stereo baited video systems you can see all been deployed over the side of the boat here in northern Australia I'm going to show you now just a brief video clip of what a stereo video system looked like so this is some work we've been conducting with the marine biodiversity hub characterizing fish and habitats inside some of the new marine parks that exist around Australia and this is a method this baited remote underwater video system it's been adopted broadly by marine park managers and scientists around Australia and now internationally in particular for investigating and studying marine parks you see the dark green areas there are the no-take areas that we are studying and you can see here a stereo camera system with two cameras they're calibrated so we can do length measurements afterwards then deploy it to the bottom of the Ninglu reef and then we see the fish has come in so what this data essentially generates is information on the biodiversity species abundance and the size distribution of fishes and sharks around Australia so we're currently running projects in particular with Australian marine parks looking at some of the new deeper water reserves traded around Australia so that's the data that we're looking at now the right so as I mentioned this data is very useful for characterizing fish and shark assemblages and it's been adopted on Australia internationally there's one particular software called event measure which is generally used duty stereo and length estimates but there's also some global initiatives such as a sin print program use mono cameras as well so in a sense it should be relatively easy to synthesize and bring all these data sets together however we found that wasn't really the case around Australia so Australia is currently the leader in this in development of this technology so the baited remote systems and also the diver swam system so you can see someone there swimming a diver swam stereo video system that's the distribution of sample points we have currently around Australia put into the global archive system and there's also evolving methods in particular remote operated vehicles and tow videos that are using the same stereo methods to sample fish and sharks but what we found we needed was you needed a centralized data archive because we found that actually by surveying different people around the country there was a lot of differences in the way people were dealing with the data how they were using quality control and there's also a real need to be future ready for the future of image automation of image annotation all these annotations are currently done manually but there's a clear need and a desire to move this to a more automated annotation system for these fish and sharks so we can speed up the data access process so we decided to divide global archive to solve some of these problems for synthesizing the different data sets and ensure that time series data will be available for the future so essentially what global archive when we designed it we tried to design it so it would be flexible so that we could import historical annotations that weren't collected because back in the day when I started this you know 20 years ago we were just watching video cameras and writing things down on pieces of paper and then entering into excel so there weren't any proper annotations done but now we have the have more modern stereo annotation software so either some way to try and bring together these two different forms of historical and modern annotation data sets and at the same time working with the people working internationally like the fingerprint project to make sure what we were doing was going to be mutable with their work as well we chose to take an approach of having a custodian for particular projects that were uploaded and then organising organising the sampling events of campaigns into public into projects and then also enabling the metadata to be all metadata is public and so discoverable in the system with the option for making the complete data set open now we took this approach of developing the sharing procedure because you know we're trying to move the marine ecological community towards an open data approach but sometimes it's a bit tricky and sometimes they're somewhat you know not always forward of sharing data and making data sets open however by creating the system we've enabled people to build trust in the idea of sharing and collaborating and making data sets open so for example this model doesn't really follow or follow an AODN sort of best use model in the sense that not all the actual annotation data sets are open but all the metadata is public so everyone can see and discover data and request custodians to share the data sets with them and so currently we've rigged out custodians around Australia and shared 22,000 stereo broad deployments into a synthesis of data that's around Australia and this is helping us to build trust and build the idea of making these data sets more open and available to everybody um just briefly I'm going to outline and try and illustrate how global archive has changed the acquisition of this data so the old situation used to be we go in the field collect this wonderful video footage and analyse it with an event measure and then from that query some summaries of the data and then use excel to make some plots and make some reports and maybe put it in the database and it might have had some form of metadata associated with it might have used r maybe but this led to a really unfortunate fog of error where we really weren't sure about the qaqc in the data and we couldn't ever go back to the original annotations so the whole objective of global archive has been to try and close this loop and also remove this problem of this these important data essentially sitting on people's hard drives on the desk around the country and not being backed up adequately so the new new workflow we now have the global archive and sync tool the aria will introduce in a minute we use sync tool to back up the imagery to a national repository and we have this working you've been carried out some pilot pilot workshops of people around Australia and this works really nicely use that to also create a metadata standard push this into global archive the raw annotations for event measures so we're maintaining a link to the raw annotations as closely as possible and then have queries coming out from global archive into a data analysis framework and currently we have our qaqc sits in our and then feeds back give us a list of things we need to go and check any event measure annotations and then feed that back through sync tool into global archive but in the future we're hoping to move this qaqc very much into the sort of sync tool global archive step so that we don't have to keep doing this loop but this at the moment this is a good feedback loop that works quite well to qaqc the data sets and then produce reporting so we've removed the fog of error we previously had we've now working with partners around Australia internationally to see how this best fits into everyone's workflows and so this is my first example of reuse and sort of the motivation for it we've just put in the application to the transformative data collections grants with our dc to run some national workshops and create some global archive champions following the model used by eco cloud the ARDC project we've got all state marine conservation agencies conservation agencies and fisheries management agencies are using the stereo broad technology they want to adopt the global archive workflows we developed and the reason for that is because they see that it allows them to meet best practice standards also allows them to save time had a great quote from a research in New South Wales recently he said you know we're running a workshop with him who said it's obviously took you a long time to develop and it's going to save us a lot of time in work and effort and then also people want to use to move towards this workflow because they can use a sync tool to back up their imagery to the national archive and give them another backup for this important imagery around the country internationally the global fin print project which has been funded by the Vulcan Foundation through Paul Allen philanthropies they're an international project that's collected 18 000 samples around the world efficient sharks using similar methods and they want somewhere to host their data that's going to be open and also enable a synthesis with other data sets so they want to start they want to upload the data onto global archive and maintain the annotations in the tourist form as possible and another example close to the home I'm actually an across domain I'm currently the coordinator of the masters in ecology at UWA and we're developing a new big data unit where we're going to be using ARD services such as EcoCloud linking to ALA and using VC-TVL as well for doing biodiversity modeling and species distribution modeling and the students have really taken to this and as part of this new big data unit we're actually proposing because in marine ecology and in our ecology discipline in general most of the master students that produce a thesis and are trying to submit it to a journal at the moment these journals aren't really requiring any evidence of a reproducible data analysis workflow but what we're going to propose to do in this master's in this new big data unit as part of the master's thesis is that the students will actually learn to create for example a github repository that has all the code that produces their data analysis for their thesis and then because global archive was built in a really flexible way allowed us to actually ingest historical data sets that had a variety of different formats we are proposing to use global archive at UWA to actually hold and archive all the data sets the students produce which could come from a range of different methods and sampling approaches but we because of the flexibility of global archive we'll be able to host the data sets on it and then link to the github repositories and create truly reproducible research which I think is going to be a good skill for these students to leave UWA with yeah so thanks very much that's everything I had to present on that any questions sorry we'll keep questions to the end Tim and it's now over to Ari hi thanks so Tim I had actually thought that you were going to be talking about sync tool I'm sorry I actually have sync tool slides in here although I do have something in the back that at the end that ties it all together so anyway I so I'm going to talk about another online platform called Squirtle Plus it's built on the same underlying architecture as global archive so it's very similar in many ways but it's targeted towards an end-to-end suite of tools for data analysis and so it was built with funding support largely from the Schmidt Ocean Institute and also now I'm asked is going to be funding some further development of the tools and it's designed for the exploration management and annotation of the images and video data it's also built to integrate with machine learning tools and so the systems designed to be platform agnostic meaning that in the case of global archive focusing primarily on brubs which is the image in the top right but Squirtle Plus is is intended to be able to support different data types and different data formats coming from a variety of different platforms and the the idea is that it ingests all this data through a flexible data ingestion API which supports multiple different data types and it speaks heavily to the idea both of these platforms global archive and Squirtle speak heavily to the idea of data reuse the examples in this this slide here are all marine based but there's nothing specifically marine about the Squirtle Plus platform in theory it could be extended to other to other data data sources as well so one example of a data source just to give you an overview of the system briefly is shown here this map gives you an idea of regular repeat surveys have been collected by the integrated marine observing systems AUV node most of these surveys are done every year as part of the integrated marine observing system and there's a huge volume of spatially and temporally diverse data set with repeat monitoring the biggest issue faced here is not is is that the rates of data collection are outpacing the rates of possible data analysis and so so there's over five million images in the AODN repository which Seb had highlighted earlier and all of these five million I must say the images have been imported into Squirtle Plus so the data here is all discoverable through the platform and ready for further analysis within platform using the tools that are all online and so while the AODN is hosting the data Squirtle Plus is making an alternative portal to access the data hosted on the AODN and also providing tools to do analysis so that the data doesn't have to all then be downloaded and analyzed offline where everybody's using their own sort of analysis workflow and then try to reconcile that afterwards makes it very difficult for data reuse and so Squirtle Plus has currently got five of the five million stereo image pairs from 683 I must say via deployments cross 49 campaigns and every single one of the images that get uploaded for media items that are uploaded has associated metadata so here you can see the images with associated latitude longitude depth and a number of additional sensor data and in addition to the data that's ingested it also has various layers integrated into the user interface from external services so again as Seb mentioned pulling in external data sources to provide sort of context the way that the AODN is doing it where they're integrating with these external data sources Squirtle is doing something similar and Global Archive does the same where these map layers provide contextual awareness for the data contained in the system rather than trying to reinvent the wheel here we're just sucking in data from these external data sources providing a richer experience for the users so for example here is here the deployments on the map underlaid by satellite imagery from Esri we've got a map laid here from Geoscience Australia showing symmetry ecological features from the Department of Environment and Energy geomorphic features from Geoscience Australia and so in addition to these layers we can zoom in and look at the individual deployments contained within the system and we can query those deployments for further analysis so this the Squirtle Plus interface provides a tool for quickly searching through campaigns or filtering the data by depth and altitude and we can then go into further analysis of the data so here's an example of how data might be analyzed within the system you've got annotation tools to where you can select a variety of different methods for annotation points or grid of points or randomly distributed points and you can also then search through multiple different annotation schemes and labels can be applied you can apply multiple labels per point and there's lots of sort of configurations in the view like adjusting contrast and brightness and one of the main sort of core features of the system which is really important when we're talking about data reuse and discoverability and collaboration and synthesis is the problem of standardization and so Global Archive does a really good job at capturing BRUV's data where the annotation schemes that are used are relatively consistent because we're talking mostly about fish data but in a lot of situations particularly for benthic ecology the annotation schemes that get used the sort of vocabularies are very different and so looking at this image of what I would call seaweed in depending on what annotation scheme you use it might be called any one of these different things that are on the screen so kelp seaweed macro algae, clonia radiata canopy forming macro algae and so one of the things that that squiddle tries to address is this idea of standardization and so what's been tried in the past and has failed several times before is to sort of say well we'll just introduce an annotation scheme and everyone will have to use it and this cartoon highlights that quite well so if you've got a number of competing standards and you want to make one gold standard to remove all what you end up with is yet another competing standard and that tends to happen so the way that we deal with this from within the squiddle framework or the marine db framework to be more specific because global archive is part of the same same sort of framework is to provide multiple annotation schemes and to then link the different to be able to map between the schemes so so wherever possible there's not always a one-to-one mapping but wherever possible is two different annotation schemes and we can link these these different data sets of an analyzed using different annotation schemes by mapping the classification schemes to each other and then this provides in tools for exporting the data in consistent views and formats that can then go on to you can then reuse for sort of other purposes so just to provide a bit of an overview of how things hang together the typically if you if you have an existing data collection program the data gets uploaded to online cloud storage facility as in the case with the imos AUV that data then gets automatically synced to marine db or squiddle and squiddle can then support multiple online repositories so it's not just limited to being able to to to suck in data from one particular repository but it can sort of interface with multiple repositories and then it provides tools for analysis online analysis and data export and tools for collaboration so a lot of these tools are still in development sort of data sharing tools and there's also the idea that you can provide multiple user interfaces for the same back end so marine db is the sort of underlying infrastructure squiddle is sort of a science user front end but we can also provide multiple simplified user interfaces for interfacing for for citizen science projects or other types of you know different different types of users that might be interested in viewing the data in different ways and so global archive is a separate front end to different components of the marine db system and global archive is is based on an old version of marine db so work in the future is going to be sort of building these two these two interfaces onto the same sort of common back end and then the other thing that squiddle plus and marine db support is machine learning integration and so machine learning sort of provides the promise of of being able to then solve the problem of of of this data collection outpacing the data analysis so so it will assist in being able to process much more data much more quickly and we need the sort of user interface to to sort of connect these algorithms to the end users and so one of the things that is a core requirement is that we don't want to be beholden to a single and a single automated algorithm like some of the some of the existing systems provide the idea is that we're providing a platform that that facilitates integration with multiple online algorithms multiple automated algorithms and so here's an example of some classified results where this was from my phd thesis which is a really sort of very old way of doing image classification but you can see you've gone from a few points point labels on images through to sort of every pixel in every in every image being possible being being classified and then the integrating these algorithms into squiddle plus will provide or into marine db and and with the user interface the squiddle plus provides users with magical suggestions to sort of speed up the analysis so what their job becomes is more instead of one of just having to label everything in search for a classification scheme it becomes more validating the automated pacify the that the results that are being provided by the classifier are sensible and once users are happy with the results they can then use the automated results to continue analysis and so putting this all together in a very sort of complicated diagram here this it shows it shows the different components of the marine db system and so both as i've mentioned both global archive and squiddle plus are built on marine db the underlying framework and code base or open source but it should be pointed out that the benefit of this of these systems is really to have the centralization of data so it while it's possible for everyone to spin up their own instances it doesn't really make that much sense because we really want to centralize this the the data and but with that said there are various libraries that provide that provide access to the data stored within the apis and these are also open source which mean that it for people who want to work with external tools or you might have or you might have some scripts that do your data analysis and you want to be able to suck data in dynamically from these different apis according that these these libraries will facilitate data facilitate interaction with the with the api to pull the data into your into your program and obviously subject to the sharing constraints that are built into the system and so another thing that's interesting about this diagram here is you can see that machine learning users are effectively machine learning algorithms are effectively special users of the system so they're subject to the same sort of sharing constraints as the rest of the users but the they interact directly with the api and so there are libraries which are still in development for facilitating this and so in a sense these tools are providing cross disciplinary collaboration and sharing of data and also synergies between sort of the marine science community and also the sort of machine learning researchers with the ultimate goal of providing various data outputs like discoverable data high level stats and summaries facilitating collaboration science communication academic publication ultimately leading to informed policy so looking at that same diagram a slightly different way this is more a workflow one of the things to point out here by facilitating online analysis tools is in in the workflow here that we're using external annotations there's still a number of steps required here the red arrows flag the manual processes involved whereas the green arrows show the automated steps and sharing is shown by the blue arrows so if you can see here there's a number of red arrows if you're taking the route external annotations outside of the system so for example in the bruv's community event measure is used because it's necessary because we're doing sort of stereo video annotation where there aren't existing online viewers for online annotation tools for this yet it's necessary to do this and analysis external to the platform but the caveat there is that there are several manual steps involved in then putting this data online so you can see the data custodians are charged with unstructured data that then is quite difficult to to put into a structured format that will be consistent for online repositories that and the annotations as well when you're when you're not dealing with consistent annotation schemes becomes very difficult too so we've developed the sync tool which is aimed towards simplifying a lot of that process so what happens is unstructured data goes in the sync tool will verify the it's an application that runs on your local machine you will import your imagery that you used in your analysis and import the or at least pointed to the imagery and pointed to the metadata files and it will reconcile and make sure that the imagery matches up with the metadata files and it will do some validation checks on the metadata files to make sure that everything checks out prior to syncing it to one of the online platforms and so even with with global archive you export data from event measure but there's still a step of making a metadata file which tells global archive have more information about the georeferencing information of your deployments and a number of the things that aren't captured through event measure that you need to put into global archive and it also captures a lot of the metadata so the sync tool aims is sort of aimed towards streamlining the data wrangling efforts that need to happen by manually reconciling annotations and manually reconciling the survey data the annotations reconciling the annotations is something that's that's still part of future work for the sync tool but up till now as part of the marine IDC it's more about pushing data to an established online repository and so sync tool currently does that we're still working on a windows build for it but just to give you a sort of comparison here if you don't if you do if you use the online annotation tools it cuts out a large number of the manual processes and so there's a manual process here which is involved uploading structured data to a data repository and if you if you have structured data like in the case where you always have consistent metadata coming out of your data sources like for the case of AUBs the data can go into a data repository and everything is sort of streamlined and automated and the manual step is the manual annotation of analysis of the data but there's no additional sort of data wrangling needed to to make the data discoverable and to do the analysis and then eventually when you start plugging in machine learning tools it becomes more an effort of validation provided the what you're trying to do can be solved by machine learning tools and so looking at some of the to-dos that are still left here the global archive has a lot of data sharing tools built in as as Tim mentioned and we're planning to roll global archive into marine db and into squiddle into to have it as a separate interface for the system so some of the data sharing components need to be built into the new version of marine db the annotation scheme management need to be built into the user interface and we're building building in sort of curated species catalogs to assist with the annotation expanding the machine learning functionality so there's actually a leaf grant that's been submitted to sort of build in build out some of the machine learning tools behind this this platform and to integrate the various other machine learning tools to provide interfaces for machine learning researchers to contribute algorithms photo mosaics so one of the one of the facts factors of the system is that one of the sort of features of the system is that it is annotation a bit sorry it's media type agnostic so it doesn't matter whether you're talking about an image or a video or a large-scale mosaic or a map it provides the same it provides annotation tools in the same annotation framework for a variety of different media types you just have to build an online viewer for it so that's really the limiting factor is building the online viewer so if we have streamable video we can annotate in the stream streamable video or if we have a web service that provides frame shocks of video we can interface with that web service and squiddle can then provide an annotation tool for that provide citizen science interfaces which are possible to be developed on the same back end provides reporting and high level overviews these are some things that that still need further development qaqc tools while there are some basic qaqc tools we plan to extend that further and then the idea of expanding these platforms for additional application domains so the examples here are for marine imagery but we could be showing the song you know we could be using the same platform for UAB imagery and then also one other sort of looming thing is a workout long-term support and funding arrangements for sustainability so one of the bottlenecks up till now has been user support supporting the giving the users data stewardship to help get data into the system supporting users using the system training users also just responding to user requests and really has been a bottleneck in terms of not having enough development support with this platform hopefully that's going to be changing soon there is an almost proposal under consideration for expanding the development team but there's still plenty of plenty more scope for investment in these in these types of projects so I think that's what I will leave you guys with and I think I've probably gone a little bit over time but that's fantastic Ari thank you very much and we'll go straight across to characterization and our next presenter is Lance Wilson all right excellent thank you for inviting me along so what I want to talk to you today is and I'm probably going to get the two terms convoluted together is the characterization virtual laboratory and the sea devil so the sea devil is is our data enhanced version of the CVL and practically we we usually use the CVL as the short name for everything that we do so just to give you a little bit of background of what the CVL is so what we what we want to do for researchers is put all of their tools and their data in the one place so that one place can be anywhere around the country but what we're trying to do is ensure that any data that a researcher collects at an instrument around the country hits a system where the tools are there so that they can actually carry out the research so that's what we try and do for the researcher in terms of the project what we the the project view of the CVL is it's a program of work where we have identified areas inside the characterization space where where we can where we can make a difference where things that we can develop or coordinate can help them to to carry out their research quicker so today I want to talk about two two main things one is the reusable components that we've built as part of the the CVL project and the second thing I want to talk about is is our journey in terms of federation so what we've been doing at the moment is we began with one site and now we're up to three sites and what I want to do is talk about how how we've gone along that so if we look at the reusable software and infrastructure I want to talk through these in a little bit more detail as we go along so the the things that we provide in terms of the CVL primarily from the the user point of view is a remote desktop environment so the and the way that we do that is using a tool called strudel and strudel web both of those software programs are bundled together and there's a couple of services underneath which I'll talk through in a second the next thing that we have that is useful for people who don't have a repository technology in this space is mytardis so mytardis is our general purpose repository which we use for connecting all of the instruments and characterization which span from things like the Australian synchrotron all the way down to sort of desktop microscopes the next thing that we've developed is an authorization service using using SSH certificates and the AAF so what we try and do for researchers is that they come in with a single identity and then we use that identity to map across to all of the different underlying services that we provide the next thing that we have built as part of this project is is automation scripts for rebuilding all of our infrastructure so everything that we bring up any databases login nodes trying to think of anything else so we do all of it is is built using ansible scripts so that in the event that a cloud node disappears we can easily rebuild it probably another really important one which is useful for many people is we've built some data repatriation scripts that are used to pull data back from them from Anstow and specifically this trans synchrotron what those scripts are really useful for is if you have an API on the other end these scripts are made to interact with APIs that we that are provided from research institutions and this allows you to pull the data in a in a sane way the last thing that we that is probably inherently reusable from this is we've been working very hard at containerizing every piece of software that we install on the in the characterization virtual laboratory this has been an ongoing process so in the past we didn't do this because we were running in a different way but now that we're we've moved on so that this this makes sense for us so on this slide here you can see there's a nice pretty picture of our desktop you can see this is so that researchers when they come in they get a like an environment that looks reasonably comfortable that they can recognize what they need to do here um on on that you can see the the top level menus are organized by community so we supported a whole range of communities on their neutron beam imaging general light microscopy cryo electron microscopy is a really strong focus for us at the moment the in the background you can see that there's a web interface and that's the web interface that you get to prior to to hitting the desktop which I'll just talk through a little bit so when I was talking about the authorization service and strudel web what those parts look like is this so a user comes in and they see a web page where they get a choice to choose a remote system so in this case here you can see we've got cvl at uwa we have a specifically branded cvl for the design house people um they use that they click the whichever one they want to log into that then redirects to the af that redirection then uses the af to create some a certificate that gets passed around to all of the infrastructure parts so from a user point of view they don't need to see that they they're authorize themselves around all the different parts once they've logged in then they get presented with a with a choice about what type of desktop they want to run and where they want those resources to come up and and who they want to charge those resources to so in this case here the they progress through to the job control stage which has taken all of their authorization and that authorization is then you fire up a job on our hpc systems and that job is then either a vnc desktop or a guacamali desktop and that depends on on whether they want to use the desktop client or the web version so just to give some context about where strudel is deployed so we're we are now we've been international for a little while and we're getting significantly more uptake around the country so pretty much we've hit everywhere except for northern territory I think and I think we're still struggling a bit with New South Wales but we're still running quite a number of deployments around the country and they vary from where we're highly engaged with them or so all the way to they've just taken the source code and they're they're running it themselves just to give you some perspective on how how how well used it is we we've been providing this this particular version of the desktop since 2016 with a number of incremental improvements along the way so currently we're up to 40 000 desktops and we're typically getting over 200 unique users a month on the system all right if we move on from the strudel and strudel web to the mitatus so where we where we help researchers with the data is at the instrument and mitatus is how we do that so the mitatus project itself has it has a couple of parts to it the first part that I want to talk through is in my data client which you put onto the microscope or whatever the instrument is and that that then is used to replicate that data into the mitatus ecosystem so mitatus is a couple of things it's a web interface to some research stories and also to the metadata so what we've done if you want to have a look at is we have a store dot monash which is where I think I can't remember we're up to about 60 instruments or more probably going to get corrected in a minute about how many of that is we pretty much every instrument on on monash's campus is integrated now and when new instruments come along this is this is the tool that we use to to pull the data in it's really useful for researchers too from the point of view that there's a feature inside it which allows data to be pushed to analysis systems so currently this pushes to the CVL it also pushes to massive as well all right so if we move on to the federation activities that we've been carrying out this for the past 18 months what I want to cover is is how we how we got to the point of where we are so and to begin with that we want to talk through a little bit about the architecture that we we got there from so the CVL in terms of the desktop which we which we're running up in multiple nodes around the country started off as the original CVL project which was on Nectar that project then led to a hbc on cloud project called monark that that particular cluster is is ongoing and then that's the campus cluster for all monash researchers from there every time that we we've moved we we've learned new things and redeveloped a technology to to be current that led to the massive m3 system and us putting CVL on top of that massive system and finally that's led to the current c-devil activities which is CVL at UWA and CVL at UQ so I'm probably not going to talk through this in any great depth we we're running two different models of CVL around the country one is our source model which is where you deploy on a cloud using our ansible scripts which then brings up all of the infrastructure that you need the secondary model that we're running is where you deploy CVL on top of an existing hbc system each of them have has their own inherent design choices inside that and this is why as part of federation we need to explore both because if you can get access to large nationally funded resources to put CVL on top of that really makes sense so the architecture that we went with is is how we've been running projects inside the characterization community with where we have really strong governance where that governance is made up of the research community and the infrastructure providers and that that men means that the things that we build a really researcher focused that then leads down into the lead node and in this case on the lead node and where we do here is we're really trying to make sure that the things that we learn from our very large user community here gets deployed around the the rest of the nodes around the country the other thing that we really want to have as part of the federation is that any node that has a specialized capability they that can be contributed back into the into the national experience and this is what this architecture that has allowed us to to really strongly partner with the research community and the infrastructure providers so if we look at this in a little bit more detail what we're trying to do is is provide a consistent user experience for the for the researchers so that means that wherever they they come in to the CVL and it looks familiar to them what we're what we're really trying to aim for is is a unified experience but because we're running in multiple locations that that's not necessarily as easy as one might hope the primary thing I want you to notice with a detailed view is that everything that we do is based around a single identity for the researcher so if we if we dive into the the technical components of this a little bit more what we've done for the what we call a CVL in a box where we you can be provided with all the source is that all of those parts then the job scheduler which which runs the desktops for people the identity servers which translates their IAF identity into into a local identity they're they're all scripted right so if you wanted to become a CVL partner you can take those scripts rebuild a cluster and be up and running in I don't know less than a week the other the other model that we were running and you can see on the on the right hand side of the slide is is bridging an existing HPC cluster now the the things that are different with those is we we don't have any control over the the underlying identity services or the job scheduler but the the partner node has the capability of setting them up so that we can do we can provide that consistent user experience to everybody all right so if you move on to some of the example federation principles that we've been working through see why we think this why we're communicating this is we think this is really important when you want to run a system in more than one location because as a researcher you you want to minimize the amount of time that you spend learning another system to get your research done so what what we've done at least from the beginning is we've had a single user portal so everybody comes in to cvl.org.au and from there everything is linked whether they they're going to strudelweb which comes from there or any documentation that's where our intention is is so that there's a one port of call for all that at the next part which we we've we keep learning about is we want to have a single sign on a mechanism and we've we've begun with this and it works really well for for a couple of our sites and we're still working through what this looks like when you run with the with the secondary model this means that users can get resources where they need them so if they come in from the one place and then they get all of their everything they need in the one spot the other the other component probably that we want to talk through a little bit is the single software stack this is probably the biggest challenge for researchers in this space is how how do they how do they get their tools where their data is and we've been working really hard on that so just to talk through that a little bit more the way that we've gone about doing this is is twofold one is if for the clusters that are running in the cvl in a box mode it's really simple you can replicate it my entire software stack and it pretty much just runs out of the box that that's really fantastic to get up and going the the slight tricky part where that is then you how do you get support for new packages and things and we're still working on what that looks like the secondary thing that we've been doing is for all of the software components that we identify now and the work benches that we want to work on where we're making containers for them so we have a public repository where those container build scripts are located and the containers are also built publicly so people can pull them down and run them wherever they want this work is really useful with if if if nothing else the the reuse of these containers is and the technology that we use for them is important across all of the research communities if you can package up your software in a repeatable way and then people can use it that's that's really really good okay so if we move on to the the last bit I suppose what I wanted to convey is that the two really strong outcomes from this the cvl project over the the last five or six years is we've been trying to build reusable components that people can use for deploying remote desktops anywhere the we've also developed a repository system which is pretty much you can plug in whatever data you want to it the the third the third thing is the authorization service where we strongly encourage the use of this type of authentication because of the the certificates that can get passed around this allows systems to have passwordless logins for everybody in that they they take their aaf identity the the next thing that we'd be really happy to have people contribute to our software containers the the last thing too is we'd be really happy to talk to anybody about our federation experience we've been trying to document how we've gone about that how we have strong governance how we've how we have the multiple nodes and how we have contributions from those nodes we're really keen to talk to people about that a lot of those materials are up in our in our GitHub under the characterization virtual laboratory there is also a little where we're beginning it that's where we're putting our training materials so that probably covers it from me thank you very much Lance that was fascinating and I will now pass over to Andrew okay so I'm going to continue on to talk about the C double project but I'm going to talk from one of the partners perspectives the Center for Microscopy Characterization and Analysis at the University of Western Australia my name is Andrew Maynard I'm a joint Microscopy Australia and National Imaging Facility Informatics Fellow the National Imaging Facility and Microscopy Australia both increase characterization capabilities also senior lecturer and group leader for data management analysis and visualization at the Center for Microscopy Characterization and Analysis okay out on my talk I'll talk briefly about what the CMCA is I'll talk about our big data challenge I'll talk about our path to adopting the my TARDIS platform and then I'll talk about how that's linked into our current involvement with the characterization data and health virtual laboratory project so about the CMCA we're a university center we collaborate in microscopy and characterization research supporting research experiments locally nationally and internationally we have some 48 different instrument platforms worth about 45 million Australian we have around about 35 staff more than 400 users and our instruments enable us to characterize the continuum from atoms through to smaller animals that's instrumentation everything from optical confocal microp microscopes magnetic resonance imaging micro CT and so on partners in the center include the Australian Nanofabrication Facility Oscope the Cavillomics Australia and all those others I show on the right hand side there at the slide importantly we have the West Australian node for Microscopy Australia and also the West Australian node for the National Imaging Facility and together those organizations are in partnership with Euro Bioimaging through the Global Bioimaging Project okay solving the CMCA big data challenge we have a concept of the user pathway in the CMCA we have a researcher who will come to us with a specimen material sample that they want to understand to characterize they will talk with academic staff to figure out what would be the appropriate instruments to use to characterize understand their sample so that's the planning aspect of the engagement if they haven't registered with us already they need to register they will get training to use the particular instrument or instruments that they need and then they go away book the instrument collect their data perform analysis often using proprietary software on workstations within the center and with the input from CMCA staff and then hopefully onto results to get publications what we're finding is that researchers acquiring ever larger amounts of non-dimensional data and they're typically using more than just one instrument so the question is how do we manage curate and archive this data so for us at the moment that's about 50 terabytes of data per annum but that's will sharp the increase into the next year as we take on cryo electron microscopy capability and also move into human MRI the data is also long tail data predominantly that's relatively small unstructured and uncurated data so the tens and hundreds of thousands of files and how do we adhere to the fair data principles and making that data findable accessible into operable and reusable then how do we analyze this data to facilitate new discovery so that deals with everything from computation data processing and visualization and finally how do we collaborate across multiple sites nationally and internationally so solving this challenge is the CMCA Informatics strategy which has three major objectives the first is to leverage national e-research infrastructure and tools and of course here we've looked at the ARDC and it's precursors so ANS, NECTA and RDS we also look to ARNET and given we're in Western Australia to the Aussie Supercomputing Centre and of course to our colleagues at Monash and at Massive and we make use of course of the Australian Access Federation for authenticating with online services just using institutional credentials the second objective has been to harvest instrument data into a data repository service and we've we use my TARDIS for that I'll comment more on that shortly and the third objective is to provide users with the ability to analyze and visualize their data in virtual laboratories hosted in the cloud and of course this means in our case the CBL of the characterization virtual data so this is the digital ecosystem that's evolved as we as we partner in the C-level project on the far left on an instrument we have a MyData upload client installed on the instrumental partner PC and the user will acquire their data and will drag their resulting files to a folder that has project ID as the name it's their own project ID that gets ingested into our repository service in the cloud in our case Trudat at UWA which is based on MyTARDIS and this is hosted at PAWSI super computing center CBL at UWA is also hosted at PAWSI so we make use of that too and we also make use of CBL at Massive and as the Federation grows we'll have more characterization virtual lab instances we can make use of top right you can see the web portal for the Trudat at UWA data store the user can download their data to their own PC or to one of our high end workstations in the center or as Lance mentioned earlier is possible to push it on to one of the CBL desktops so the bottom the bottom middle you can see an example of the CBL desktop in this case running through Strudel web and of course since the data is sitting in the cloud at PAWSI or at Massive we then have a stepping stone into high performance computers maybe okay the path to MyTARDIS given we have such a variety of instrument platforms we looked around and all the various offerings we could find in this space I've listed a few of those there on the slide we eventually ended up adopting MyTARDIS because it's really been designed for the long tail of instruments it really is not domain specifically ingest data from anything and it's possible to have placed ingest filters to handle metadata depending on the source of the data so that's why we ended up adopting the MyTARDIS platform plus it originated Monash there's a strong user community in Australia and we wanted to support that also so our initial deployment of MyTARDIS came about with our involvement in the RDS A1.4 transition project so one of the RDS is now part of ARDC of course we deployed on UWA infrastructure that was provided with a service level agreement through Vocus so that's virtual machines and storage and with that time we integrated three electron microscope instruments with the platform and that project ended June 2017. Our current deployment of MyTARDIS is much more mature and it's evolved or evaluated from the ARDC funded trusted data repositories national imaging facility trusted data repository project. The aim of that project was to deliver durable reliable high-quality image data for the national imaging facility at UWA we deployed TARDIS then on the Pawsie Nimbus cloud replacing the Nectar cloud offering and this involved four nodes of the national imaging facility which I've listed on the slide there with UWA is the lead node. So to summarize that project in a nutshell and I won't go into a great amount of detail here I'll just say the top left again we're talking about an instrument PC with a piece of software on it the MyTARDIS client the uploader client we developed a protocol and agreed process in the national imaging facility for uploading data to the repository service which included specification of minimal metadata for instance but also specification of a standard operating procedure for quality control of the instrument and the fact that we should upload quality control data from the instrument also to the repository. We also required that every instrument integrated into the repository should have a description or record parking research data Australia data and service discovery portal provided by the ARD. That means essentially every instrument has a unique handle or a persistent ID. So look on the right hand side the four repository services we organize data by project ID we are moving towards making use of the research activity identifier national database instead of working minted project IDs. Logins by the Australian access federation to any of the repositories and then the other thing to note there is we have a link to the instrument record located in research state of Australia. Okay moving on the deployment of MyTARDIS at UWA is essentially based on a docker deployment of MyTARDIS which we built at UWA plus some extensions user interface additions post ingest filter hierarchical file view and particular configuration requirements for compliance with the trusted data repository project. So one of the things we did was to map the notion of an experiment in MyTARDIS to mean an experiment maps to a project and underneath or to the right there you can see the rationale for going with MyTARDIS easy instrument integration simple data sharing and so on. Now UWA docker deployment features of it that it's easy to deploy or redeploy the MyTARDIS platform in fact we were able to deploy on behalf of the University of New South Wales. There's reduced administrative overheads for example updating to new versions and because the component parts of MyTARDIS are containerized we've got the properties of self-healing so a container can be monitored if it crashes we can restart it we can auto-scale add more containers with demand and we can orchestrate containers using Kubernetes. Finally on to the link to the characterization data on Hans virtual laboratory project so we're one of the partners in this project which is of course Mynash led and we've just heard from Lance. To the right hand side there the diagram shows the project has a number of sub-projects so three horizontals and four verticals so UWA has been involved in the infrastructure horizontal and the electron microscopy workbench vertical so involvement has specifically been to assist Mynash with the deployment of CDL at UWA on pausing infrastructure to integrate our repository service Trude out at UWA which is based on MyTARDIS with CDL at UWA so to enable push-to functionality and finally to contribute software to a new electron microscopy workbench alongside our colleagues from the CMM at UQ, Microscopy Australia and Mynash and the bottom right hand side just to refresh what Lance had said before there's a snapshot of the CDL instance running in Strudelweb the list of the applications menu essentially shows the individual workbenchers and I've highlighted one there called a cytometry workbench and you can see all of the tools inside of that so what we're adding to this is another one for electron microscopy non prior EM electron microscopy. Challenges with the EM workbench have been firstly we had to undertake a survey of EM software tools across Microscopy Australia nodes we discovered that many of the applications are Windows applications and the CDL is inherently a Linux platform so what we've had to do here is to leverage what we've done with the cytometry workbench previous project and make use of the wine compatibility layer to deploy applications such as GATAN's digital micrograph perhaps for the future we could look at a Windows virtual machine solution within CDL. Another aspect is dealing with licensed software so for instance digital micrograph there's a free version and licensed version so to deploy the licensed version we need to have discussions with the vendor GATAN. Standardizing containerization across the Federation to ensure that we can each add containers to the workbench in the same way so that we have a common recipe. We've had some challenges in dealing with desktops that are CPU only and desktops with GPU capability to try to make containers work in both those environments and finding maintenance and support you really need a community to support the workbench so in the case of Microscopy Australia with the nodes around the country and interest from the electron microscopy group leaders we have that input and we've got some champions and that's essential to keep the workbench alive and software then contributed to it. So finally summary and conclusion the CDL itself leverages outputs from several ARDC projects including the NICTRA to Data Repositives project the software from that project is available on Github you can read more about the project on the National Imaging Facility website in the second link. The CNCE ecosystem now consists of CVL at UWA in beta CVL at massive which we make use of and our repository service TrueData UWA and I'll finish with a note on some Federation challenges. The first is consistency of look and feel that's essentially that's essential if you want users to adopt the system and to for each node essentially of the Federation to offer the same look and feel of desktop. Integration of data repositories is important so if I'm at UWA I should be able to push data from the pause repository to massive for instance to CVL at massive. Computing community support and champions and workbenchers is essential ensuring availability that a user when a user wants to access the CVL to have a desktop to work with but it's there that we have enough hardware infrastructure to make that happen and that we can share that load amongst the Federation and finally the last one the toughest I think for us all is how do we support this infrastructure to the future. So UWA for instance that would mean a commitment from Central IT to support the cloud services TrueData UWA and CVL UWA and with that I say thank you. Thank you so much Andrew it's Jerry Rider here from the ARDC. We've had a really diverse group of presentations today and I thank all our presenters. I know we already have a couple of people that have questions or comments that they'd like to add. Leslie Wyborn I think you had some thoughts around the characterization project. Do you want to share those? Yeah how can we plagiarise I mean sorry reuse what you've developed at a smaller scale we I'm speaking from Osgoath and I work as sort of their help on on data on their data project and we're going to try and set up a geochemistry network. Now we have a plethora of much smaller scale instruments and various laboratories in various departments and bringing them together is the first issue but once we do that you've got pretty well got the back end infrastructure and a lot of the problems that I can see we're going to have to do so yeah heck we reuse what you've done. I mean the simplest thing is what you've done is reach out to me by email. We're really happy to partner with people. Most of the things that we've developed are completely open source and public. There's a few things where we have a little bit of crossover with some of our internal things where we typically have a conversation about how we help you get to the point where you can run. And my next question is a social one because we're also working with an international group that are trying to bring geochemists together into an international network and talk about tough nuts to crack. Um socially they only want to share that bit that they publish okay. They want to keep their processing as their priority. This is not Australia, this is international. Well it probably goes on in Australia as well. It's that social issue how you get people to sign on to what is a fairly open system from what I can see. Although I can see you know people are what can I say um dialing in and you've got your authorization that worked out but generally those processing programs some people aren't willing to share those. Do you have that problem or you've got a much more open community? No no we must definitely have that problem as well. I feel we can share info on that. It's a real spectrum. I mean some people are really really keen to to share everything and other people don't want to share anything. So I mean we just work through where people are at in that journey. Okay now that's good that you've had that issue as well. Okay maybe ideas and war stories. Yeah war stories is right. Just get the knife out of the back of me. Thanks Leslie thanks. At last I think you had a question for the marine folks. I'm not sure exactly what the question is but do you want to pose that? Yeah I suppose one of the things that we've struggled with in that people don't like it when we do it is that when we've hooked up people's repositories to compute they freak out once they realize how hard compute can drive their repositories. I was wondering if you've had that experience yet or if you've got plans for how you're gonna cope with it. Sorry I was just muted. Yeah I was muted. I started talking but you can hear me. It definitely is a concern so the way that it's kind of set up is we can be running compute with the way that the libraries are set up. Everything is sort of communicating by HTTP and the APIs are all sort of restful HTTP APIs and we've had instances where we've run compute on distributed systems where the compute could be on my local machine but it could be pulling data down from AWS and we don't need to really be pulling the data down as quickly because the operations for doing image analysis you're spending a lot more time on a single image so it's not really taxing the repository as much as it's taxing the CPU or the GPU that you're doing the compute on so you need a high performance machine to be running these deep learning algorithms on but you don't record doing the training but you don't necessarily need high throughput on your IO bandwidth so we're kind of and we've set up some caching as well to try and alleviate that so if you're running a compute node to do some prediction on imagery if you ever need you know we've set up some caching in the libraries to sort of try to deal with that to limit the amount of throughput from the repository but we haven't really experienced a problem where it's taxing the repository as much as it is an expense for running the compute on the cloud. Okay so thank you Ari and good question Lance. We don't have any more questions in the question pod and we are just about up for time. Julia did you want to wrap up? Sure I just again wanted to thank everybody for your time. It sounds like there could be opportunity for further technical discussion and we might see whether or not we could break another conversation separately but once again thank you very much and please if you do have any questions for the presenters let us know and we'll pass them on.