 A thank you meant for inviting me to do this. Oceans of data lead to rivers of information, streams of knowledge, and eventually drops of understanding. Today I'm going to talk about three aspects, principles of data quality, georeferencing and data standards. The principles of data quality go back to a book I prepared for the Global Biodiversity Information Facility some time back. You can see the DOI there. It's also available there in French, Spanish, Portuguese, Chinese and Korean. At the end of the slides I have a list of references so you can grab those at a later date. So users need quality information. But what do we mean by data quality? I'm going to give you a few quotes here. The first one is from the United States Spatial Data Transfer Standard that says an essential or distinguishing characteristic necessary for data to be fit for use. Chrisman wrote a very good book on data error and I'd recommend that book to anybody. The general intent of describing the quality of a particular data set or record is to describe the fitness of that data set or record for a particular use that one may have in mind for the data. And Juran, back in 1950, Juran was a Romanian American, regarded as the father of data quality in many ways. And he said the concept of fitness for use is universal. It refers to all goods and services without exception. The popular term for fitness for use is quality. And our basic definition becomes quality means fitness for use. English in a book on the principles of data quality for business said in the database, the data have no actual quality or value. They only have potential value. And the value is realized only when someone uses the data to do something useful. And finishing up there was strong. The quality of data cannot be assessed independently of the users of that data. So I think there's a theme running through this that data and fitness for use are dependent on one another. But what do we mean by fitness for use? Now if we had a question, does species A occur in Tasmania? Or does species A occur in the Southeast Tasmania World Heritage List? They're two different questions. Now if we have a point as shown there with an uncertainty around it, then we can say, yes, it occurs in Tasmania. So if we're making a list of species that occur in Tasmania, that's a perfectly good record. But if I want to ask, does it occur in the Southeast Tasmanian World Heritage List? Given the uncertainty, I can't say yes or no. However, too many people pull the data out without bringing the uncertainty. And so that point looks to be perfectly good within the South Tasmanian World Heritage List. And they may think that data is perfectly good for them. But with the uncertainty, that data is not suitable for that purpose. And so we have this fitness for use. Some data may be good for one purpose and no good for another purpose. And if we look at data quality information chain, if we have a plant or animal in the field and a collector comes along and collects that, makes notes in a data book or something, a data entry operator adds it to the database. It can be edited and cleaned and released and users get out there. But the cost of error correction increases the further along that chain before you find the error. So if the data entry operator says, I can't tell on this collector whether it's a four or a seven. So if that data entry operator goes back to the collector and says, can you make this and can you make it clearer in future? The cost of correcting that error is very little. But if you get right up here to the end where the users are and somebody's using it and said, no, this was collected in April, not July. Surely this doesn't flower in July. They may make that correction, but then how many other records are wrong in the database? So the cost of increasing that increases. And to quote Redmond from a book on business, assign responsibility for the quality of data to those who create them. If this is not possible, assign responsibility as close to data creation as possible. And another couple of quotes from Crispin talking about error. In general, error must not be treated as potentially embarrassing in convenience because error provides a critical component in judging the fitness for use. And although most data gathering disciplines treat error as an embarrassing issue to be expunged, the error inherent in spatial data deserves closer attention and public understanding. And we can improve our data by putting out the data with the errors, but mainly maybe with documentation, so that then people can look at that and decide, one, whether the data is fit for their use or not. And two, whether there's possibility of correct here. Now quite often we collect plants and look in databases and we use other databases to try and test whether that data is correct or not. And I tend not to talk too much about errors. I'll talk about suspect records. Now, this is a database of the gazetteer of localities in Brazil. Now, we can see that there are three records there that are not in Brazil. Now, they're very easy to find and so we can correct them. But what does it tell us about all these others over on the right there? Are they all in the right place or they might be in the wrong place but still fall in Brazil and we don't know. So we have to be careful of all databases. It's very rare to find a database that doesn't have some sort of error in it. So I'll get rid of it. I've got something showing on my screen. I don't want to. How do we find suspect records? Now, some errors are easy to find. But what does it say about the others? Now, if we look at these three points, they're at the C. Now, these are data from GBIF and they're of a wolf, North American wolf. Now, we know that the wolf doesn't swim this far. So these are errors. What does it say about all these others? And what does it say about these two records over here and some others? Now, I know from chasing up this data that these are records in a zoo. Now, from working on the distribution, the wild distribution of this wolf and I've got these records, my data's got my results going to be skewed. And so we must document this information to show that these errors are cultivated or they're captive or whatever so that we can then exclude them for our analysis. Now, quite often, cumulative frequency curves are used to detect environmental outlaws. Now, here we're using annual mean temperature against cumulative frequency for some plants and cumulative frequency curves pulls out a known number at the top and bottom of the array. So it always pulls out some records. It might be the 95 percentile or the 97.5 percentile or whatever, but it always pulls out some. Now, okay, this record down here on the left is probably an error and it's picked it up. This one up here on the right is probably not an error and therefore it's pulled out if you're excluding those. But what does it tell you about this one? It may or may not be an error. It's hard to tell. So back in the early 1900s, 1990s, sorry, 1992, I worked on trying to work out how to identify an unknown number of outliers at both the top and bottom of an array. So these are records for a wattle in case your variety's here and their temperature values. Annual temperature, minimum temperature of the coolest month, maximum temperature of the warmest month, temperature span, temperature of the coolest quarter, warmest quarter, wettest quarter and driest quarter. 19 records. And you can see this one in blue isn't fitting the pattern. So what I determined was you get the distance between a record and its nearest neighbor, multiply by that by the distance between the record and the mean and divide by the standard deviation and you get a value for C, critical value. Now, this is a reverse jackknife in process. Jackknife intends to reduce the effect of outliers on a dataset. This is expanding them to make them more obvious. And then we put that over this formula. I won't explain how this formula was developed. I can't get this thing in the way for me. But we developed the concept of outlierness. So outlierness is the degree to which record is an outlier. If we put that value C that we got before over T, as shown there, we get a value. If it's greater than one, it's an outlier. And the degree with which it's an outlier is given by the size of that number. If it's below that line, minus one or less than one, it's not an outlier and it's probably a perfectly good record. Now, just using it on one parameter is usually not good enough because you get all sorts of outliers. But if you use it on two or three parameters, climate parameters or you can use it on latitude, longitude, elevation or whatever you want, this works doing that. So I'll leave that there. And I'll go on to talk about geo-referencing. During the COVID lockdowns last year, some colleagues and I wrote a couple of books. One on geo-referencing best practices and one geo-referencing quick reference guide. We also did a couple of other books as well. And again, they're available at DOI at that link and they references at the end of the paper. So the geo-referencing best practices provides guidelines to the best practices for geo-referencing through targeting and specifically at biological occurrence data, the concepts and methods presented may be just as useful in many other disciplines. It's a live document published on GitHub using ASCII docs. This is a new way of publishing and we produce intubit PDF versions. This is a live document so people can update it or suggest updates at any time and we just commit them through the GitHub to produce a new version document. It can also be done in multiple languages using crowdsourcing in the same way. I'll show an example in a minute. Now, we started this process to geo-reference the historical legacy data in museums of which there are billions around the world, estimated about three billion. But it also gives guidelines on recording geo-references and uncertainty using either footprints or point radius methods I'll talk a bit about and gives details on the accuracy of different methods of determining geo-references, whether it's maps, GNSS, GPS, et cetera, B maps like Google Earth, Google Maps, Smartphones and many others. And it talks about how to record geo-references and uncertainty in difficult locations in a cave, down the mine site, underwater, et cetera, and recording elevation depth and distance above the surface if you've got a bird flying above the surface. This book has very detailed glossary, detailed and comprehensive and just the glossary alone is worth using if people are looking for definitions of things. We put a lot of work into the glossary. Now, this is an example of crowdsourcing of the GitHub to produce a Spanish version. So here the people come in and each paragraph virtually, they produce the Spanish version of the data and I think the Spanish being worked on at the moment and I think maybe Russian, certainly the Russian. I just get comments coming through my email every day when somebody's done something new. But this is then can be committed and eventually produce a Spanish or another version. Now, if we talk about chord that precision, what the number of digits in your coordinates mean? A lot of people say I've looked at precision and tried to use that as accuracy, which is crazy, particularly when things come out of arch info or something in 10 decimal points. So this is something from the XKCD. It's a scientific comedic site. So you can hear, I'll let you read them, but you're doing something space related if you're giving it to a degree and you come right down here to dozens of decimal points. It says either you're handing out raw floating point variables or you've built a database to track individual atoms. In other case, please stop. So that's a good example to people of what the number of digits in your chord actually mean. Now, we record uncertainty. It's one of the things that we do a lot is recording the uncertainty in data. And we use two methods. We use the point radius, which was developed by my co-author and myself independently at various stages. We have the shape method. Now, the shape method, if you've got a point radius, if you've got a plant selected here where the X is and you use the point radius method, you've got an area of uncertainty. But I know the plant is a terrestrial plant. It grows on land. So I can modify that uncertainty by clipping out the ocean part of the data. And I get a polygon here, a footprint. And that can be stored in the well-known tech format. You see one of the ice standards. And so we don't use this as much as we should, and we're hoping to move towards it. There are other methods, the point method, which we don't recommend, and a bounding box. Now, we also developed a system called Spatial Fit, which was looking at how well the bounding, the point radius and the polygon represent through case in nature. Now, we did a lot of work on this, and then we found that somebody called RIOC in 1961 had already developed this system. He called it the degree of compactness. It was to do with statistical data in statisticians, in collecting statistical data. And the diagram illustrates basically the spatial fit of a location that can be described by a polygon or a bounding box, a circle or a point and et cetera. And I won't go into the details, but it's a method we're recommending people use to say how well their representation is a representation through life. So if we look at uncertainty in location types, we did a lot of work over the years on the types, the way you can describe a location. Now, if we've got 20 kilometers northwest of Canberra, what does that actually mean? Canberra has a certain extent, and are you taking it from the best of us, from the boundary? All those sorts of things add to the uncertainty. So the type of uncertainty, you can have features, which is anywhere that can be represented spatially. It can be a feature with an extent, for example, Fraser Island or Uluru National Park. Without an extent, such as Great Barrier Reef, Mount Hot Pippamy, where you don't know how the edges are, it can be a path, which might be a road, a river or a transect. It can be a junction, where a road crosses the Bunya Mountains National Park or the confluences of Darling and Murray Rivers. It can be a cave, it can be a dive location, and it can be things like between two features, between Missoula and Florence and Montana. And then you've got offsets. Now, your offsets can be a distance only, so five kilometers from Calgary, you don't know what direction. It can be a heading only, south of Yath, or it can be a combination of those, or it can be offset along the path, the Murray River, 16 kilometers downstream from Acheeta. And then you've got various coordinate systems, for example, geographic, UTM and others, and you can have grids and 3D shapes, et cetera. So these are all ways of describing a location. Now, one of the things that my colleagues, John and his brother, John Majorca, his brother developed some years ago, was a georeference calculator. When you can put information in, put this calculator, it's a browser-based JavaScript application, and you can put information in, you can say how you developed the coordinate source, whether you used a map or GPS or whatever. You've got directions, offset distance, the direction, et cetera, all those can add to the uncertainty. And you can put that information in, along with the datum that you've used, and we have worked on every possible datum that we've been able to find, every published datum, to work a calculation between any one and any of the other ones. So there's several hundred, I think of nearly a thousand, I think, different datums that you can calculate. And then it gives you the value of the latitude, the longitude, your uncertainty, the datum that you're outputting, the precision, et cetera. This is all available online at that location, and you can put data in and play with it and have all sorts of fun. There's also different ways you can do things like coordinates, et cetera. So I'll leave that and move on to standards and the need for standards. We work with a group called the Biodiversity Information Standards. It has the acronym TAG. I won't explain why that is, but it's historic. And we can't get rid of it. We try and get rid of it, and people keep using it. So we're a non-profit organisation and community dedicated to developing Biodiversity Information Standards. Established in 1984, there are 10 current standards which goes through a maintenance process. There are two draft standards which undergo a pretty thorough public review before they can become a current standard. We have lots of proposed standards, and we have prior standards pre-2005, the 13 of those, and they were before the process of review, et cetera. But they are standards that are in use. So the current standard standards include Darwin Core, which is our biggest standard. It's the largest and most used. It's a data transfer standard. It includes theme-based extensions for genetics, paleontology, marine, and others. It's been used to transfer over a billion records using this standard. And it's a new version on the way. Everybody in the biological world uses that standard to transfer their data. The Atlas Living Australia uses it. The Global Biodiversity Information Facility. I did bio in America. It's a well-known and well-used standard. We have another one called Audubon Core. It will probably change its name because it's recently been found. Well, information about Audubon himself as a slave trader has come to light. So they're trying to work out another name for that one at the moment. But it's a set of vocabulary designed to represent metadata for biodiversity, multimedia resources, and collections. It's looking at things like regions of interest for images. If you've got an image and you've got an insect on the one corner, how do you describe that in a standard way? Sound is easier. You might have a start of the sound bite, the end of the sound bite at a certain time. Video is difficult and we're still working on that one. It's complicated. For example, if you're an underwater diver and you've taken a lot of video of fish and you might have a fish swimming through that video, so you want to be able to document the start time, the end time, but you also want to be able to document where in each of the frames that fish has occurred. So there's trying to work on that at the moment. And there's views for images. If you've got an image that you look here from the front, the back, the top, the bottom, if you're talking about the wing, you can only really have a top and a bottom, the front or the side doesn't mean much. So that is being well developed. We have the other ones on Tapea, which is the access protocol for information retrieval. There's standard documentation standards. There's a vocabulary maintenance standard and others. The two draft standards at the moment are a global genome biodiversity network data standard and a natural collections descriptions standard. The second one is looking at how you describe collections within a museum, et cetera. And the first one is looking at how you set of vocabulary is designed to represent tissue, DNA, RNA samples that are associated with voucher specimens, tissue samples and collections within a museum. The developing TADRIG standards, these are all the interest groups that are looking at developing standards. They may develop one or more. So you've got one annotations. We're dealing that with the W3C annotations group. And we're looking at how you annotate information, like if you find an error in data, how do you annotate that so that it goes back to the owner of the data or so that users can see that, hey, I think this is wrongly named or something. So we're looking at the W3C annotation system and how we can develop that for biodiversity. There's attributions, there's data quality, which I'll talk about in a minute. There's biological interactions data, if you're talking about a bee pollinating a plant, et cetera, that interaction. We have a lot of citizen science information. I'm trying to look at data quality and other issues for citizen science. Geographical schema, let's try and go update it. And machine observations I'll talk very briefly about in a minute. And then all the taxonomic stuff, how you standards for plant names, animal names, et cetera, viruses. So looking at the data quality one on the leader of the data quality interest group. And this, we have several task groups under that. And the first one was on a frame, developing a framework on data quality. This largely came out of a PhD thesis by a young fellow in Brazil. And it looks at basically breaking your data quality down to three issues. You have a profile, which is based on a use case. So what is the data I need for this use case? What are the information elements for that use case that are valid for my purpose? For example, I might need, if I'm doing distribution modeling, I might need latitude, longitude, et cetera. How do I measure that in the use of the use case? What is the validation policy? What's the status of the quality should be to be suitable for my particular use? And how to improve the quality of the data in that use case context? And then you have the solutions, which are a set of methods, technical specifications, mechanisms, tools, the test or the fact on the data and all the profile requirements. And then you have a report, which will tell you what data is available, what data is not, got the measures that are used, the validations and improvements. This is a fairly complicated piece of work by Alan Cockvega. We published it in 2017. And it's a very mathematical representation. We've also put out in 2020 part of a different paper, but the framework comes into that and we've tried to explain it in a way that's more suitable to individual users. Now, one of the other task groups is on data quality tests and assertions. We've developed 99 tests for testing biodiversity data in museums, herbaria, et cetera. 65 of these validations, which flag suspicious or invalid values in one or more of the Darwin core terms. We have 26 amendments, which go to improving the data. We have five measures and let's tell you how many have been fixed, et cetera, and three notifications, which I won't go into. 28 of these are time-related, related to an event, a start date, time, day, year, et cetera. And many tests for ISO 8681 compliance. We have 41 space-related tests, which deal with coordinates, datums, uncertain precision depth elevation, and geographic countries, country codes, the higher geography, et cetera. We have 25 tests related to taxidermic names, ranks, classification, and there are 22 others, things like data licensing, status establishment means if a plant is introduced to an area, et cetera, and the basis of the record, whether it's a specimen and observation, a citizen scientist, whether it's a tissue sample, et cetera. Now, we manage this. This is an example of some of the space tests. You've got decimal latitudes, MP, geography, a standard. It's an amendment that one. Validation, the maximum elevation is out of range. The coordinate uncertainty is out of range. The minimum elevation is greater than the maximum elevation. Tests like this. And each one is fully documented. Oops. In a standard way, they have a grid. And this is maximum elevation out of range. The expected response that you get from running the tests. So you have internal prerequisites not met if the maximum elevation in meters is empty or the value is not a number. It's compliant if the value is within the parameter range. And we can see the parameter range there below. And if otherwise it's not compliant. And we have parameters. The parameters, we have default values for the parameters, but you can set it your own. If you're working in Australia, you don't need to have a minimum elevation to 430 meters. You might just have 35 or whatever, 32. And the maximum elevation. I'll finish up on one of the other tasks, the interest groups that are working on looking at standards. There's the machine to patients ones, because I know that probably of interest to many people here. It's available on this. I've had with all GitHub. It's being, the convener is Peggy Newman from the Atlas of Living Australia here in Australia. And has people from the ocean by the information system, the Swedish Life Watch, the ocean trackings network, the Max Planckings, GDAR, all the following, etc. And it's looking at biological observations inferred from sensor data. For example, radio telemetry, GPS tracking, acoustic telemetry, camera traps, acoustic monitoring, video monitoring, etc. For example, tracking, putting a tracker on an albatross and following that or turtles, etc. And how we document the quality of that information. And they're working with the sensor data, the sensor developers and the builders of the equipment to look at how they document information and trying to work with them and other information. If you're interested in this, I'd suggest you contact Peggy Newman. I'll work on that. Now, this is just a small sample of the activities in the field of biodiversity and biodiversity data quality. There's lots more being done and to be done. Thank you. And as I said, there's a series of references there at the end. I'll talk.