 In this session, we will cover the following learning objectives. First, we want to describe the principles of data quality. Second, we want to identify the tools in DHIs2 for data quality. Third, we want to verify that data quality procedures are in place. Number four, we want to confirm the use of data quality tools in DHIs2. And finally, we want to verify that data use practices are in place and being used. So let's start with some data quality principles. The WHO has defined data quality in their data quality review guidelines. There are three separate guidelines that you see here and a link to them on the slide. These are a really good starting place to understand some of the foundational principles of data quality, conducting data quality reviews, and even using DHIs2 to enable data quality reviews. If you're conducting data quality reviews, let's talk a little bit about the frequency. So different programs, different countries at different levels will have different frequencies of data quality reviews. Typically what we see is a monthly data quality review is the most frequently done. This is often done at health facilities or at district level. So as new data comes in every month at the lowest levels, health facility and district level, they're checking the data that's just come in. Annual data quality reviews are often done at national level, maybe regional level. These are large events where multiple programs are reviewing all of their data over the course of the year. They're not getting into the level of granularity or detail that the monthly data quality reviews are happening at lower levels. These annual data quality reviews are really looking at annual trends. They're finding big outliers. They're reviewing just general data quality metrics across the entire country, across the entire year. And then different programs will have different periodic reviews. So programs that are only being conducted during the rainy season or during a certain outreach campaign may only do data quality reviews at the end of those campaigns or maybe during those actual campaigns, but they're done at an ad hoc way. There are different methods for doing data quality reviews. The first one is doing a data quality verification survey. A surveys go out and look at different sites. So in this case, maybe health facilities, community health workers, maybe different districts, and they're assessing the accuracy of the data compared with the facility registry or the actual hard copy that the facility has. So they're physically going and they're saying, do the data that's in DHIS2 is it matching what's actually been physically recorded on paper records? They're also assessing the availability of inputs. For example, guidelines, staff knowledge on data quality, different standard operating procedures and how they're being conducted at the various levels to see if they're being adhered to. They're also assessing use. And it's very important to assess the use of data. Again, we talked about previously during the data use session that data quality and data use are tied together. We cannot have high data use if we have poor data quality. And so it's important to look at how is the data being used in a routine way and maybe a day-to-day way. And that will give us some indication of is the data trusted? Is the data reliable? Do the people at the lowest levels feel that they can use it? What are they not using? Why are they not using it? Is it because of data quality issues? Is it because they don't trust the data? So looking at the use of the data, not just the data quality, is also very important when we're doing these data verification surveys. The final point to make here is often travel is required. This means that you can't necessarily go to all sites, all health facilities, all districts all the time. You may have to just do this at an ad hoc way. Maybe you go to certain sites or certain districts that had a historically struggle with data quality to do these verification surveys. But it is important to note that travel is required for these typically. The other method to note here is data desk reviews. So in this one, as the name applies, there is no travel required. You're sitting at a desk and what you're doing is you are looking at the routine data coming in. Usually this is done a monthly or more frequently. And you are interrogating that data using the whole host of DHIS-2 tools and functionalities available to you. So you are looking for outliers. You're looking for consistency over time, internal, external consistency, which are concepts we'll cover later in the presentation. And you're using DHIS-2 to drill down into specifically where the problems are. And then you're hopefully reaching out to the individuals that manage those facilities or responsible for entering that data to alert them that there is a problem that you have identified. The third methodology here is DHIS-2 data validation rules. And DHIS-2 has a functionality that we'll cover in more detail later in the presentation that allows you to check data against predefined logic or what we call validation rules. And these can be done at multiple points or at different frequencies. So DHIS-2 can be set up to run the validation rules while the user is actually entering the data. So if they enter a data point that violates a validation rule, for example, the validation rule says that treated cannot be more than tested for some certain disease. Malaria treated cannot be more than malaria tested. Then if they put in a value that's greater for malaria treated than it is for tested, then DHIS-2 automatically flag that and say, here's an issue. We've automatically detected a problem. So that's during data entry. Validation rules can also be run ad hocly by a user using the data quality app in DHIS-2. So they can go in, they can run a validation rule analysis and DHIS-2 will produce to them a list of all the validation rule notifications that have been triggered. The third option is to have this run automatically. So DHIS-2 can be configured using the scheduler application to run validation rules automatically. And the user can receive the notifications for this automatic scheduled jobs into their email or sent to them via an SMS message. So in this way, the validation rule notifications can come to you as opposed to the previous two steps, which I mentioned, the user has to go to it. So it's important to revisit quickly the virtuous data cycle. And here we want to reiterate the fact that data quality and data use are connected. We see that if we have high data quality, that enables high data use. And vice versa, if we have high data use, often we're seeing very high data quality. Now the question always becomes, where do I start? Do I try to force data use or do I try to force data quality? And can I get to high data quality and still have low data use? Or how do I kind of handle this situation? And so as we're talking about data quality, I think it's important to appreciate that both of these can be done at the same time. We can take steps to improve data quality. And as we improve data quality, we can also take measures to improve data use through, for example, adding new feedback mechanisms, sending more alerts and notifications. They typically go hand in hand. As we start to improve data quality, people will start to, the measures that we take to improve data quality, for example, adding more alerts and notifications, having people engage with the data more frequently to find the errors and fix them, they will inherently start to use the data more. What's the starting point for most countries? The first starting point is often start to work on improving data quality because we cannot have high data use if we have poor data quality. It's just not possible. So we have to make sure that the data quality is to a certain degree that people start to trust the data. But as we start to work on data quality, what we often see is that correspondingly, data use also naturally kind of increases. As you drag one up, the other follows quickly behind. But it's still important to appreciate that data quality and data use are connected together. You can't have a high one without the other. And if you have a poor data quality, you will have poor data use. But it is a little bit more pragmatic to start with improving data quality. And as we see data quality improvements, we will hopefully see corresponding data use improvements as well. The WHO recommends several core indicators for annual data quality reviews. For maternal health, they recommend anti-natal care one visits. For immunization, they recommend Pinta third dose. For HIV AIDS programs, they recommend patients currently receiving ART. For TB, they recommend TB notification rate. And for malaria, they recommend confirmed malaria cases. These indicators are almost universally found in all country HMIS systems. So now we're going to go through the key data quality metrics. And we're going to cover five different measures of data quality. The first one here that we're going to talk about is data set completeness and timeliness. This is the measure that people often refer to as reporting rates. And this is the received reports on time divided by the expected reports. And in this particular data quality metric is really geared more specifically to data that is entered routinely in aggregate form at health facility or other levels. So every month I am receiving a aggregated report from the health facility. And this is the measure of did I receive the number of reports at the time that I expected to receive them. And DHIS 2 defines an expected report, which is the denominator here. Again, it's received reports on time divided by the number of expected reports. So DHIS 2 defines the number of expected reports as the number of organizational units or health facilities that are assigned to the data set. And a data set again at DHIS 2 is the actual reporting form. So how many health facilities have I said in my system configuration should be reporting on this data? And then the measure is the percentage of those that sent in their reports on time of those that I have assigned to report on that data. We should have a targeted completeness and timeliness of about 90% or greater. And research has shown that if we have a reporting rate for completeness of less than 80%, then often our data is not reliable. If we have less than 80%, then we're missing a large enough portion of the data so that our global or national indicators are not really reliable. There's too much data missing to say that this is an accurate measure of the status of the country or health program or whatever it is that we're trying to measure. In most countries this is improving and in many countries it is actually very good. We see very good data set completeness and timeliness across most countries that are using DHIS 2 as their HMIS. Some of course better than others but in many countries we see that it has improved over the years and now it's at a very acceptable level. Here's an example of a simple month-to-month line graph showing the reporting rates of various data set or reporting forms. So you can see that there is an ANC reporting data set. There's a maternity data set at HMIS which is kind of the monthly probably aggregated data from the health facility. We have one from malaria. We have one from immunization. And you can see the reporting rates here for each. You can also see indicated here that the reporting rate for the ANC data set has been going down over the last few months. And so this would be an indication that there's some issue in our reporting rates. There's some issue in some of the health facilities reporting and this is a data quality issue that would have to be followed up on. Of course as I mentioned earlier if I see that my reporting rate is going down then that means I'm not getting all of the data in which means that maybe some of my indicators like for example ANC1 coverage in this case may become less accurate or will become less accurate because they're not being calculated based upon all of the data that should be coming in. Let's move on to the next one then. So this one is consistency over time. And so what we expect in most data within the HMIS is a year over year predictability. We expect the data to be relatively consistent year from year. So when we make this line chart as you see in the example here we see that each year the data is following a similar pattern or trend. If it's seasonal then we see a similar kind of rise and crest and fall again at the same time every year. If it's just not if it's non seasonal data then we expect to see kind of the same patterns within the data. We don't expect to see too much variation. Now of course many countries have growing populations so you could see the numbers increase every year just because the population base is increasing but that rate of increase would probably still be the same. So year over year we see a similar kind of slope to the data. And that's what you see here in this example. You see that the data from 2020, 2021 and 2022 is nearly identical over the 12 month period. Again you see in this example a big spike in 2020 right in November. See that it's very obvious and vision of an outlier. There are maybe some situations where this is reliable data. There's some context where you could have an actual spike maybe some kind of mass vaccination campaign or something like that. But more often than not this is an outlier. In doing these year over year chart as you're able to do in the WHO data quality app and in the data visualizer app as we'll talk about later. You are clearly able to see these outliers and measure this kind of year over year consistency. In most countries this has also been improving and in many countries it's very good. And with the WHO data quality app and the other data quality tools that we have in DHIs to many countries have been monitoring this for several years now. And we see that it has significantly improved. The next data quality metric is outlier detection. And we just saw an example of an outlier in the year over year. But we need to make sure that we understand that there are many different ways in DHIs to be able to capture outliers. And outliers can be very significant and throw off our national statistics in a really impactful way. And what we see here in this example is a scatter plot. And the scatter plot is able to be made in the data visualizer application. And we have applied an outlier analysis to this scatter plot. And we're actually using an outlier detection methodology called interquartile range here in this particular example. We also have Z score and modified Z score available in the data visualizer application. But using this methodology you can see that we are monitoring here malaria inpatient death under five and malaria inpatient death over five. And we plotted these two data items against one another. And you can see that the vast majority of the health facilities each point here representing a health facility fall within an expected range. And those are showing up here as green dots. And then using our interquartile range methodology we have plotted what we consider a high and low threshold. So you can see those are the gray lines that are kind of on the outside. And any health facility that is above or below those gray lines is showing up here as a red dot. So it's indicating to you that this is an outlier. Now equally important on this chart are these horizontal and vertical dotted lines that's saying one percent of total X values or one percent of total Y values. What are these saying? This is saying that these dots represent more than one percent of the total value of malaria inpatient death or malaria under five or malaria inpatient death over five. So if it is above the dotted line that's one percent total Y values or it's to the right of the one percent total X values. Then this outlier is having some kind of effect on our national figures. So this is an indication on how severe the outliers are. And you can see the outlier that we have highlighted here for facility two ninety nine is showing that they had under five one hundred and eighty three malaria inpatient death and over five only thirty one. And it's clear that malaria inpatient death under five of one hundred and eighty three is a value that is well outside of the expected range. Now is it wrong? Maybe maybe not. But it is significant and it's so significant that it would be throwing off the national statistics just as one health facility. And it's definitely worth investigating. Here in this example we can see that you can use things like the data visualizer application with the scatter plots to quickly identify outliers when comparing two different variables together and gives you a good opportunity to follow up and investigate. In many countries this is still a challenge. Many countries are still facing issues with detecting outliers in a timely way and being able to address them. Most of the outliers that are existing are not significant. They're not throwing off national statistics. But it does happen occasionally like we saw in the last chart where you have a value that is way outside the expected range and it's throwing off national statistics. Now outliers are very, very easy to input as a health facility. You can imagine a scenario where a health facility worker mean to put it in one hundred and they accidentally put in ten thousand. They just hit a couple of extra zeros as they're putting the data in. It's a mistake that any of us could easily make. And there are multiple points at which that should be caught. It should be caught at the point of data entry. So DHIS2 has the functionality to send up alert while they're entering the data saying, hey, this is outside of an expected range. Are you sure this number is correct? But if it's not caught there for some reason, then it can be caught on a dashboard as we see here. It can also be caught in validation rules and all of these various steps we'll talk a little bit about later. The next data quality metric is external consistency. And this measure is looking at the degree to which data should agree with other data points. We know that there are common relationships between various data points. Some common examples of this are dropout rates or coverage rates. And here in the example, we see a very common dropout rate between PINTA1 dose and PINTA3 dose. We know that in nearly all situations, we expect to have more PINTA1 doses than we have PINTA3 doses. And that's just because over the course of immunization, unfortunately fewer people are completing the entire course of the immunizations. They're getting the first PINTA dose and then they don't continue to get two and three. Now, a dropout rate is measuring what percentage of patients are not getting PINTA3 dose that did get PINTA1. And so you can see that here. We have all of our districts and then we have the dropout rates. Now, if we have a high dropout rate, then that's a problem to address programmatically. That's indicating that we have an issue with people in that particular area completing their immunization. We expect the dropout rate to be low. We want it to be as low as we possibly get because that's indicating that the people who have got PINTA1 are also getting PINTA3. However, when we have a negative dropout rate, that's indicating that we have more PINTA3 doses than we have PINTA1 doses. Now, is this impossible? Well, no, not technically. It could happen, especially if we're doing some kind of cohort analysis where we have a very different cohort for PINTA3 than we did for PINTA1 or some other kind of drug that has more differentiated cohorts along its various treatment steps. By and large, most of the time, we would expect to almost always have more PINTA1 doses distributed than PINTA3 doses. When we have a negative dropout rate, that's indicating to us that we probably have an issue with either our PINTA1 dose count or our PINTA3 dose count. Either PINTA1 is too low or PINTA3 is too high and it needs to be investigated. These kinds of things, like when we see negative dropout rates, we know that it is technically possible in certain situations, but it's almost always a data quality problem. And here in this example, you see that we actually have three different districts that are reporting negative dropout rates. And that may indicate to us that, you know, if you're looking at this, if you're an immunization coordinator, it may make you question your entire data. Obviously, the data is not reliable, but now I question all the districts. And so this is something to keep in track and follow up on, monitor continuously these various dropout rates and other types of external consistency measures. In most countries, because we have been making these dashboards for a long time, these types of analytics, they're both on the standard DHI2 dashboards as well as the dashboards in the WHO data quality app, we see that it has been improving significantly. And so many countries are actually quite good at monitoring this now, but it does still happen. Again, data quality mistakes are very easy to put in the system. And so it's something to continuously follow up on. Our very last data quality metric is consistency or accuracy of population estimates. This one is specifically referring to the review of denominators, our population denominators that we use in our coverage calculations, or many of our high level national or even subnational impactor indicators or KPIs are population based. And in many countries, we see that population data that is coming from the census is unreliable. And here in this example, you can actually see that all of these coverage rates in this scorecard are over 100%. Well, except a few, some of them are the ones that you actually see showing up with red, green or yellow. These are not over 100%, but all the ones that don't have a color that are just like a red value, these are over 100%. Some of them are very high, some of them are nearly 200 or over 200%. How do we have coverage rates over 200%? Are we administering two treatments to every single person? Probably not. The reality of it is that the population denominator in many countries is very wrong. And that means that if we're calculating any kind of indicator based upon population, we're not getting information that we can trust that is reliable. And I think that this is probably one of the most significant challenges facing data use and data quality in countries that are using DHS2 as an HMIS. You know, DHS2 has no magical way of calculating population for you. It has to come from somewhere, but there are various approaches to using alternative or service-based population estimates, as well as other organizations like WorldPop, Grid3 that are able to do new population projections that could be more accurate. But at the end of the day, whether it's using alternative or service-based denominators or using new projections of population, we have got to start to figure out how we address population denominators. Because what most countries have is not usable, as you see in this example. And unless we start to correct it, then we're not going to see a significant improvement in data quality or data use, at least for these types of indicators that are population-based.