 Um, thank you for allowing me to present today. Um, I was originally going to, uh, present with Elia from, uh, data links. However, Elia has some, some challenges with, uh, water ingress, uh, at, at his premises, uh, today, so he's unable to join us. So he, he sends his apologies. Uh, so just to give you a very quick background, uh, uh, Elia and I have collaborated with a lot of, uh, academic and commercial research organizations to maximize the value of data. Uh, for me personally, uh, I've been at NetApp for four years now, and I work with most of the, the major academic institutions that, that are around Australia and, uh, over in New Zealand as well. I've presented at, uh, e-research, Australasia and e-research New Zealand recently and I continue my work, uh, both on the commercial basis as well as volunteering to participate in community activities, uh, to help enhance, uh, the value of research, uh, for all Australians. Uh, one of my focal areas that I'm working on at the moment is Indigenous research and, um, that's a very interesting area because, um, it crosses over a number of, um, quantitative and qualitative areas and domains of interest. So, um, uh, currently active in working in a few of those at the moment, uh, there are some of the commercially sensitive. So I can't mention what they actually are. Okay. So, uh, today's topic is around data quality assessment, but, uh, before we dive into that, um, the, I would like to frame, um, what data quality assessment belongs to and as part of, it's part of data quality management. So the whole, um, uh, data quality management domain, um, specific, the, uh, talks about making data fit for consumption and meeting the needs of data consumers and it's subjective because the quality of data relates to how it's going to be used and who's going to use it and what, um, what the outcomes, uh, we try to drive from the data is. So in today's presentation, I'll go through, uh, a few examples of where the, the, um, where data quality assessment is essential for driving academic and commercial outcomes. Right. Um, okay. So we need to assess the quality of data because not doing so can lead to some, some big issues, right? So, um, in these three examples, uh, which I'll go through quickly, there are elements of both commercial and academic challenges that are caused by a lack of, um, the ability to assess the quality of data. So a Spanish submarine project, uh, a misplaced decimal point, uh, led to submarines, uh, being 70 tons heavier than, than plan. Now that's a, um, that, that's an issue from a cost perspective, but also for the people operating in submarine, if they can't research submarine and they're going to drop to the bottom of the sea. Uh, that's going to be a real problem socially as well as for the lives of those people. Um, another example is in the air Canada, um, incident, um, they had an error in the calculation of how much fuel to put in the plane. And, and, uh, there's a result, as a result of plane crash landed. All right. Um, and the third one was, um, uh, Emron, being an energy company that did a lot of R and D. Um, one of the challenges they had was, uh, errors in their spreadsheets. Uh, and when they didn't order, um, almost a quarter of their spreadsheets contained serious errors. So I'm pretty sure that a lot of us use tools like our Python or Excel, uh, to analyze data or maybe power BI even, uh, these days, and all of those tool sets, if you feed, uh, data of insufficient quality into them, uh, may result in anomalous results that can have negative impacts. So all around the research data lifecycle, um, the data goes through several, uh, steps in the process. So, uh, the first one is creation at top there, which is where data gets ingested or created by a model, for example. And then around the lifecycle, there's processing analysis, preservation, um, and, and access and reuse. And, um, I'm sure that, uh, you know, if we have time afterwards, we can dig a bit more into each of the steps in the process, but at each step, areas can creep into data and metadata. So one of the, the key areas we see, um, well, there is a challenge, uh, around data errors, um, uh, sensor data errors. So sensors can go, uh, can, can be inaccurate. Um, the things like, um, sensors can get wet, for example. Um, and that can affect the electronics in the sensor or the orientation of the sensor might not be correct. Um, there could be a data stream coming from a sensor, for example, um, where, um, uh, there might be an interruption in the stream and we don't know what caused an interruption. So there, um, could be, um, anomalies, um, in the, uh, so there's anomalies may be caused by a loss of, um, data in the stream or an error in transmission and we might, might not know what they are. All right, um, there could be issues with codes or APIs, for example, in the work that we do, we often have to pass data between, um, different APIs and sometimes the data format, um, can be corrupted in between or perhaps the field is truncated. Um, back in the days of mainframes, we saw, uh, screen scraping applications have problems with, uh, pulling data out of those sorts of environments because the fields got moved on the screen and the screen scraping application was still trying to scrape the same characters on the screen. And as a result, the, the data moved over and was reading correctly. Uh, we can get false positives and aliasing and I'll be looking at an example of that a little bit further than the presentation. Human error is always an issue. Somebody, um, copying and pasting data from the wrong folder to the wrong folder. Um, and of course, um, a poor or missing data architecture can result in systemic data quality failures. So how do we assess what those failures are? How do we quantify? Well, um, this organization precisely, uh, offers, um, several different metrics that can be used to quantify, uh, these areas at a very basic level. So, um, you know, how many areas do we have relative to the size of the data, uh, number of empty values, um, error rates when data is transformed. So, uh, when, and say, uh, data is composed together, how many blanks, blanks do we find in the data when there should be data there? For example, um, how much unusable data is there, for example, anomalous results from sensors that are, say, out of range. So we're looking at a sensor that should return the value between zero and a hundred. It was, um, providing numbers like, uh, 10,000 or a million. We know it's out of range, right? Uh, but is that actually the correct data or is it just the sensor playing up? All right. Um, in a more commercial environment where people are looking at email responses, um, email bounce rates, for example, the last two I would challenge, are they actually data quality issues, right? So data storage costs, not really a quality issue, right? It's more of a cost issue. Uh, but poor data quality can lead to more runs having to be made to resolve those issues, which can result in higher storage costs. Um, and also, um, the amount of time it takes to get to an answer, um, that's not really a quality issue either, but bad quality data can impact the time it takes to get to those, uh, outcomes. So here is an example of, um, how data quality impacts my personal work at NetApp. This is a real live example. I've obviously cut out the, the customer identities around the edges, but what you can see there is that we ingest data from our tool sets, um, that analyze data on remote endpoints. So we collect data like, uh, thermal data and physical operational data from systems as well as, um, data, as well as data around performance and, um, and the presence of data assets. Uh, and so we collect all the data from the field and we aggregate and average it. Now, sometimes we find holes in the data and I'll, I'll go into that in a little bit. Um, in the top, right? You'll see that, uh, we analyze and preserve the data. So we do this initial, um, preprocessing and then we save the data sets after we check and check them. And in the third step, we re-access that data to be able to, um, create solutions and we require high accuracy to size those solutions correctly for our clients. In the same way, in an academic environment, um, the quality of the data would allow a scientist to produce results and publish a paper, that, uh, where the data can be relied on if the data is of sufficient quality. Now here's an example from a real world R&D client that I worked with recently. You can see the three pie charts there. They, they asked us to do some detail performance analysis, um, but we encountered some technologies that had legacy coding them and wouldn't return, um, some data, didn't return all the data. So you can see that first graph, there's a 3%, um, uh, rate on the amount of data that didn't have any performance, uh, data associated with it. In the second one, a 1%, so those two were considered to be of sufficient quality to drive the outcomes we're looking for. The third one is the standout, um, 93% of the data that came back didn't have performance data. So that was a problem. We, we noted that straight away, went back to the client and said, in order to solve the problem, what we'd like to do is we would like to take the performance profiles from status type A and B and use them as a projection to either like fill out the rest of, um, type C as, as an extrapolation. So we did some algorithms to do that. We went through the process with the client and they were quite happy with the outcomes. Um, and as a result, they were confident to invest further in, in the, in the required IT platform to, to support their future R&D projects. In our next example, this is for a very large research organization in Australia, um, they, excuse me, they do voyages that collect data, um, and they asked us to help them, uh, create a data pipeline that could be automated, but, uh, would also do automatic quality checks along the pipeline. So at each stage, one, two, three, and four, uh, the platform has built-in checks that, uh, that, that would store the results of those quality checks back in the data hub. So what that means is that as the data moves through the pipeline, all of the, all of the quality checks are being stored and can be viewed in a data quality dashboard. That's a work in progress at the moment. And what that means is that data scientists will be able to look at the relative quality based on the criteria of data as it moves through that pipeline, um, and associate it with the tags, uh, for the data as well. So that means that, that they will be able to know whether it's an original copy of the data or a derivative copy and assess the quality against the quality benchmarks. This example is from a commercial banking client and what we found is that as we're measuring data in the environment, um, something changed there on the 4th of March as you can see. So taking the average and peak readings over that whole time period can mask those nuances that reveal those insights. So understanding and looking at the time series data is important as well. We also found that, um, there was one spike in performance as you can see in that bottom graph, um, and that was a demand spike. It only happened once in the whole month. So when we, when we were looking at how we designed the solution using this data, uh, we looked at those two dotted lines. So the top one is what is the absolute peak in the sampling period? We designed for that, we're likely to over-serve and waste money. And if you look at the bottom line, which is the average, we designed just for the average, we're, we're likely to underserve and cause attrition through dissatisfaction because things take too long, transactions don't complete, etc. So, so ideally we want to design for about what that midline looks like. So, um, the, the, um, uh, the highest frequency of peaks and look at what that peak reading is, which is around about the 20 mark. So that's what we chose to design to and the customer was happy with that. So this is, these are some of the challenges we, we face with, um, commercial data quality, but also applies to research data where perhaps the research is tied to a commercial outcome. We've been talking to some scientists about the increasing, uh, pressure to commercialize their research and, uh, being able to look at time series data and analyze the quality of data over time and zoom in on those, those anomalous spikes that might impact quality of the outcomes, uh, is becoming more and more important. And the fourth example, from a more strategic level, we're working with a commercial research organization spun off from university. So one of some of the challenges they face are around the velocity accuracy and cost effectiveness of the instrument survey results they're doing. So they're working with very large data sets. There could be hundreds of terabytes in each data set. And they need to quickly assess the quality of that data. So we're currently working with them to create a framework in which they can quickly determine, uh, what the appropriate level of quality is. So what is common in all of these examples is that there is a certain tool set that we use, um, in, in our, uh, in our work. So I just wanted to share, um, a typical example of what we do. So in order to assess the quality of data, this is for both commercial and academic environments. We collect data from the environment using a couple of different sets of tools. So on the bottom right, you see we collect data from, uh, our data infrastructure and the top right, we collect data from scientific research. The data hub on the left, then aggregates the data together and does feature extraction. So it looks at what, what does the master data model look like? What are the quality features we're looking for? It can do versioning, tagging, correlation, anomaly management. It can do data federation and migration. So what it delivers is the ability to make data quality discoverable, observable and actionable. So using the data hub, if a quality problem is detected, uh, it allows us to go back to the data scientists to have that conversation earlier around, um, whether we need to find out where that data problem is coming from and how to solve those problems. So underpinning that is the architecture which we call the data fabric. The data fabric is, uh, a design concept that allows research organizations to create, sorry, to create the right environment to manage data quality efficiently. Sometimes what we found is that, um, looking at data quality, the cost of managing data quality if it's done incorrectly, uh, or suboptimally actually exceeds the benefit it provides. And sometimes that is challenging because although we can come up with good ideas around what data quality should look like, the commercial reality is somebody's got to pay for the tools and the components to observe, to observe, measure and, uh, enforce data quality. And that, that is sometimes where things fall over. So, uh, often we hear, uh, sentiments like great idea but it looks too costly and expensive to implement. The data fabric is a way of breaking that down into its constituent components. So each of those components can be looked at, um, and in an incremental fashion so that those sorts of data quality, um, benefits can be delivered, uh, without having to, uh, boil the ocean so to speak. So in conclusion, uh, we, when we look at data quality, uh, data's got to be fit for consumption to meet the user needs. Quality issues can arise anywhere in their life cycle. They impact both academic and commercial research. Um, the, the outcomes can be devastating or differentiating. So good quality data, um, that's managed efficiently can result in great differentiation in terms of being able to publish a research paper that's, that's, uh, reliable or a commercial outcome that differentiates commercial success or it can be devastating if the data quality isn't right, um, it can impact people's lives in a significant way and sometimes in a permanent way. Um, so the best practice is early detection and correction. All right, and that can only be done with the right data architecture and the right tools to ensure it's sustainable. So, uh, any questions?