 I'm great to be here to speak with you today and thank you very much for having me and to Ivana for the invite. So I'll be talking a little bit about some work that Ivana and I did together in collaboration as part of an open geospatial consortium test bed. So before I get into that for those on the call who haven't come across Frontiers Eye before, we're a not-for-profit focusing on sort of uplifting and enabling collaboration across the space and spatial sector. So in my role here I work as a senior data scientist and so that means I do a mix of machine learning across sort of different applications but that I focus a lot on earth observation. So that's kind of the context for where some of this work sort of is motivated by. So I'll just share my whole screen and then get into the presentation. Perfect. So to give you some context for where this piece of work came from it was actually done as part of an Open Geospatial Consortium and OGC test bed. So the test beds are innovation programs that then can be used to sort of help guide standards. This particular test bed was focused on location interoperability and under the advanced models and data thread there was a specific thing about machine learning training data sets. So our goal with taking on this topic was really to undertake an investigation of what's actually best practice currently in how people document machine learning training data sets but also how are people creating these data sets, how are they being used, what are the implications for reuse and we wanted to capture that as part of this test bed so that it could inform any sort of future standard developed by OGC. So there were three people involved from Frontier Assign myself, my colleague Madeleine and my colleague Kate and then with Ivana who works with us often as a partner but we also worked on this project with a company based in the UK called Pixelytics. So together we were able to provide like a good coverage of different global data sets as well as sort of best practice from or current practice from Australian governments who are trying to work with machine learning and sort of global use cases as well. So I think it's important to start with what machine learning and what a training data set actually is. So in this case how I tend to define machine learning is it's really any sort of process that identifies a pattern in data without being explicitly programmed. So if you manually write a decision tree that you say you know if it's greater than this then we classify it as this that's not machine learning but if you show an algorithm a large amount of data and it comes to that same decision tree on its own that that would be machine learning. So the training data which is really what I'm here to kind of talk about a little bit in regards to data quality is the data that's going into that machine learning process and so algorithms that we use to kind of train these models are important but it's really the training data itself that defines the problem space and really sets up what the purpose of a model is, what it's trying to identify from future data. And so the reason that this work was kind of motivated is that there's a lot of instances where we're using machine learning to create spatial data sets. For example I had the opportunity to work with the Victorian Department for Environment about using image recognition techniques to actually map trees across the whole state of Victoria and so machine learning actually offers us a really useful tool in terms of automating the creation of really large foundational data sets. However there isn't a standard currently for how we document these kinds of training data sets and there are starting to sort of be best practices that are emerging but OGC really wanted to look into creating this and the reason that that's really important is that if we don't understand what our training data describes then we don't have a clear picture of how it can be reused and how it can be added to over time and so it's a really important part of treating the creation of these data sets like an investment and being able to actually understand where they were you know what was the effort they were targeting. So I would say that there's a couple of key barriers to whether training data sets can be reused and that's really to do with the fact that they're often very highly specific and so the application that you're doing so let's say me you know detecting trees here in Victoria what I did was I created a data set that digitized the presence and absence of trees across you know a stratified sampling of Victoria and so I have really good data about you know trees in Victoria but will that apply to other places it's hard to know. They're also extremely expensive to create so when we created that type of data set we started with a lot of people hours to manually create and digitize imagery so looking at images and outlining the trees and also to do that with high quality you need a lot of domain knowledge so you know you would think that that if you were given some aerial you know imagery it might be not so challenging to distinguish between trees and the ground but it can be quite hard to distinguish between trees and you know tall bushes and if the problem is really about trees and not tall bushes then having those sort of mixed classes come into your training data set is likely to affect you know the overall quality of your solution in terms of what you're trying to target and so what this means is that given all of these factors people do tend to feel they need to create new data whenever they come up with a new problem and I think that actually if we had better metadata about how these data sets are generated and their overall quality that that would help people make clearer assessments about whether the data can be used or not so for example if there was higher level information for our trees training data that said training data set that said these are the bioregions across Victoria that we have training samples in and that might link to other information about the kind of you know vegetation that you see in those areas and that's the kind of thing that will give people a much better insight into whether the data can be reused for a different purpose you know similar stuff dealing like down to the level of maybe that will help them identify whether it's going to cover a particular species they're interested in and they may not use the whole data set but even if they can learn enough of the information about it it might help reuse some so I think that's that's sorry partly how quality plays into training data and really the you know why metadata is such an important part it's it's how people might interface with these data sets and understand whether they're applicable to their problem and so as part of the work that we did when we were thinking about how a standard could be developed was we talked about what does it mean for a training data set to be of high quality in terms of would people want to reuse it and so really what we discussed is that there's kind of two kinds of quality when it comes to the actual training data themselves so there's accuracy which is how well does your training data really reflect reality and in particular the reality that you're trying to measure or map or capture and then on the other hand there's consistency which is you know how internally consistent is your data set so when you say you know when one piece of your data says something is a tree how similar that is that to other records and also how often do people actually agree that that is true so if you gave the same image to five different people would they you know would they all give you the same label back which is what we kind of use in machine learning as the thing that you are trying to target with your algorithm what are you trying to detect or map or measure and so it's really these two components and what that can look like in terms of how we might actually measure these things so when we come to accuracy it might be trying to understand you know how many of the items in your training data set were actually created by a subject matter expert so these are the highest quality examples depending on exactly what you're digitizing and how you're creating your training data but I think it's fair to imagine that someone who is a subject matter expert might be more reliable and consistent in how they assign labels to training data and that they're more likely to be reflective of what's actually happening in the world so how that person would classify it if they had the opportunity to do so in terms of accuracy I think sort of statistical and spatial distribution is also really important and so if I say that my data set is representative of all types of trees in Victoria then I need to be able to provide evidence that I do actually capture that statistical range and spatial distribution because otherwise someone might come along and look at my data set and say well I think I can use this you know up in the Mali region specifically but if I haven't provided data in there then there's no reason that this particular training data would solve that problem so being able to communicate the domain over which this applies is an important part of the accuracy and also just whether you're picking up enough of the statistical patterns you know that your data actually is sampling your distribution so that can be categorized in terms of you know actual number based sort of metrics but also it might come along with you know human readable documentation about well what were your selection methods for your training data how did you split up your area of interest and sample from it and there also might be things like the identification of outliers or duplicates so you know when someone has collected the initial data set and done a review they've identified that that maybe someone you know and one of the people who was doing the labeling had a slightly different definition and consequently those labels um didn't necessarily be trusted as highly as some of the others and that kind of comes into this point about consistency so how confident are you that when your labeling process said that this sample was this particular species but it's actually you know that that's actually true and so one of the things that I think can be useful here is whether you know if if things were being digitized from imagery it's like is there a guide that people use to kind of come up with consistent definitions and classifications and so that's kind of the more qualitative side of it on the quantitative side you can also look at okay if I gave the same sample to five different people um did they all give it the same classification or were they mixed values and that would imply that over a very large data set you might have that inconsistency and that makes the data less reliable in terms of being able to consistently predict an outcome because if your people who label your data can't agree it's going to be very hard to change it you know to train a machine learning model to have a consistent opinion and so that can be both in terms of like a sample um and also like across the whole data set so I think I've kind of touched on on some of these things but I think the role that metadata really plays is that um training data sets are really created to solve specific problems but because they're expensive we really should be looking at ways that we can support people to reuse them um for example if we go and you know take ground-based samples of crops in a particular African country well if we can communicate how that sampling was done people could add more to the data set but do it in a consistent way or alternatively they might be able to say hey um you know this other country is in a similar climate zone um and so maybe the data is then you know and similar majority crops so we can do that application the problem is is that um if people aren't exposed to those considerations about really you know the domain that this this data was created for um they may apply them without that understanding and that may less lead to poor results because even though um data might be high quality for the problem that you're trying you know one particular problem it may not be sufficient um for another type of problem and so really a lot of our recommendations um from this test bed work were to kind of point out that we can use metadata to capture some of these things um we talk a little bit in there about how um some of the sort of accuracy and consistency metrics I mentioned before it may be possible to automate those um but I think quite a lot of this domain information is something that you would be needing to encourage people to think about when they've created their data so that they can discuss really um yeah like what are the elements that make their data set a quality data set and allow people to judge that for their own problems so in terms of the test bed work that we conducted um the um sort of quality consideration was just a small part of it and we um have released the full engineering report online so can provide a link to that if people are interested um and it also goes through you know what are the best practices um how are people using uh training data to create spatial data sets um and really what what needs to be present in a standard in order to capture um meaningful information about training data sets so that they can be reused um one of the things that was a really great outcome from this project and a huge amount of thanks on this goes to Ivana who was really dedicated in working with the people who were drafting the actual standard um was that you know a lot of the ideas that we came up with we were able to give them kind of direct feedback um on to the to the domain working group that was um or the standards working group I get the terminology confused sorry um but uh able to yeah give them direct feedback that actually led to um changes in the standard that I think are for the better and more aligned with what people are actually doing out there in uh in practice and finally um we uh that draft standards has has gone out for public comment um so that's that close back in April um Liz I noticed you put your hand up but I'm about to do my conclusions and then we can probably go to question so um yeah just in summary um machine learning data training the machine learning training data set um do have a data quality element and I think you know it would be interesting to talk further with this group about you know what I you know coming into this having not thought so much about data quality um before you know my considerations of sort of um consistency and accuracy and how that shows up in other possibly in other data sets too um and finally really the project as a whole I think we um gained a lot in being to look at the way that people use training data and think about well how can we actually structure metadata to capture the things that are really important and um support that kind of reuse yeah so that's that's everything I had to share with you today so yeah