 So let's talk about quality control for a few minutes first. So quality control, unlike quality assurance, are the processes that come after you start collecting data. So these are the processes of monitoring and maintaining the reliability, accuracy, and completeness of data during the conduct of the project. So again, quality assurance are the things that you do before the first piece of data is ever collected. So the protocols, the data collection forms, setting up the data entry systems, all that. So quality control then are the procedures that monitor the quality of the data once you start collecting data. Like with quality assurance, quality control requires a multidisciplinary team, which typically includes clinicians or whoever is collecting the data. Data entry staff, the statisticians, systems administrators, developers, and of course, the data managers. And it requires sharing knowledge about what you're studying, the disease process, the clinical practices, the effects of the medical treatment, the relationship between the variables, and the expected timing of events. So the data managers really need to have a good understanding of what the policies and procedures are for the data that they're working in, for the data that they're working with. So what we do in research studies is we work with the clinicians, or we work with the investigators. We read through the entire protocol. We ask questions. We meet with them. We meet with the statisticians to understand how they're going to use the data. And we really learn a lot about the medicine and the work that we've, and the data sets that we've been working with, if let's see, we've had to know that CD4 is only done about every six months. So you won't be looking at it to be at every visit. So you need to know that you don't expect to see CD4 result at every visit. Because when you go to do the cleanup queries, you don't want to say, hey, CD4 is missing on these hundreds of observations. These hundreds of visits are without CD4 because it's only present every six months. So you really need to have a good understanding of what's going on in the study and how the clinicians are practicing. So to ensure data quality, there are several points of assessment that you can use to help improve data quality. Probably the most important one is the point of collection. If you don't get good quality data, if you don't get good data collected, then no amount of good data entry or no statistical magic can produce good results from those data. So it's really important that you have procedures in place so that you ensure that complete and quality data is collected at the point where they're gathering the data filling out the forms. At the point of entry, of course, the range restrictions and the logical checks, which we've already talked about, help to improve data quality. The post entry cleanup queries, and that's what we're going to focus on here, help to identify existing errors or questionable data in the system and allow the correction of those data. And then you can't find everything just with looking at the data cross-sectionally. So there's a lot of statistical methods and analyses that allow you to look at trends in the data. And sometimes trends in the data reveal problems with data collection or data entry. So it's really helpful to have statistical statisticians look at the data as well when you're generating cleanup queries. So to ensure the data quality, the data manager needs to understand the goals of the program. They need to understand the standards of operation, what's happening in the field. They need to understand the impact of the intervention or program. So for example, specifically in research studies, if you are randomized clinical trial and you're giving a placebo to one group and a specific drug to another group, I mean, you need to understand what the side effects of each of those medications might be so that you can identify problems with adverse events or toxicities associated with those. So you need to understand what the impact of the intervention is and how that intervention is going to be delivered. The schedule of events, the visit schedule, and then the scheduling of the intervention. And then you always need to understand the relationship between variables. As if the medication affects the incidence of nausea or something like that, you need to understand that and look for that in the data. And then the expected timing of events, as I said, the schedule. You need to know exactly when things are done. In HIV, we typically see patients once a month. So you would expect to have regular monthly visits. So let's talk about cleanup queries. So one of the things you want to, you want your cleanup queries to do is to identify missing data. And the first thing you probably want to do is generate reports regarding the percentage of missing data on each item on the data collection form. And this will highlight the differences between the programs or specific groups of patients in order to identify methods to minimize missing data. So if you're seeing that there's a high percentage of missing data in the alcohol questions, you might want to break that down and say, well, let's look at it by gender. And you might see that, well, the women aren't answering the alcohol question. So you need to go back and review the processes and say, what's going on here? Why are the women not answering? And you may come to find out they're not answering because they don't use alcohol for the most part. So the clinicians have stopped asking because they're getting so many responses or something like this. So there might be some retraining involved so that at least you get the question, do you use alcohol? You get less missing data on that variable. You want to, very, very important is to compare dates. You always want to make sure that the date of birth precedes all other dates. Nobody, nothing can happen to a patient, certainly not in our HIV clinics prior to their birth. You want to calculate the age and verify that the date of birth makes sense. So if you have a pediatric form on a patient and you calculate age and they come out to be 21 years old, there may be a question about, do I have the right date of birth for this patient? And for patients who have died, you want to ensure that the date of death follows all other dates. I mean, these are kind of obvious, but they are constantly overlooked. Even in our program, we are sending data to the statisticians with negative ages. So you got to just keep remembering to review those dates. One other thing about the dates, if you have multiple questionnaires in a study and you have different tables that are holding those questionnaires and you know that, for example, these three questionnaires were entered all on the same date, then you need to merge those three data sets and identify any problems with the dates. So if you see the date for one of the questionnaires is off by a week or off by an entire year for that particular patient, you may want to double check the data entry. There may be a mistake in the data entry or in transcribing the date or in just somebody writing the wrong date on the form. Specifically, we see a lot of errors in recording of dates at the beginning of a new year. People are still writing 2009 through January and February. The transition is just difficult. So at the beginning of the year, you may want to pay close attention to the year that's being written on the form. And you can also alert your data entry clerks to that as well. So for date comparisons, you want to generate a cleanup list for observations whose dates that are after today's date or preferably after the date of data entry. So this is another way to confirm that the proper date has been entered into the system. And even better is if you've already put that restriction in your data entry system. I think with Infopath and some of the things you've seen, they actually want you to use the system date. They default the date to the system date, which is a great way to prevent entry errors. But we're still seeing it. We still see future dates. We see lab values that are done in 2011. So it's like, these are not possible. So you want to have, if you can restrict those dates in your entry system, great. If not, you need to have cleanup queries that identify those erroneous dates. I think with something like OpenMRS, you also record the date and time that those data were keyed in. So that's another way to check. So rather than basing it off of today's date, you can base it off of the date that the data entered. So then we also want to generate a similar list for observation dates that precede the date of the inception of your program. So if you know that your HIV clinic didn't open until January 2008, then you should not be seeing any visit dates prior to January 2008. This gets a little tricky because sometimes patients come with lab results from another clinic or information from another clinic. And so they want to enter those data. So it's not an actual visit or encounter at that clinic, but they do have data that precedes the inception of the program. It's never easy. It's never straightforward. You want to examine the intervals between observations and visit dates to ensure that the expected timeframe is reflected. As I said earlier, in AMPath, they see patients about once a month. So if you're looking at the intervals between visits and someone's not being seen except every three months, you may question whether or not you're capturing all the data in the system. Are you missing some of the visits, some of the encounters? This is even more important in research studies where you have specific protocols that say the patient will come on at day seven and then two weeks after that and then one month after that or 30 days after that from then on out. So in those cases, you have a better chance of seeing what you expect to see. And you'll be a little more rigorous about the cleanup queries that you generate for those research studies. So checks on numeric data, you want to confirm that all values are within the expected range. And we've talked about how to do that. You want to investigate possible outliers by verifying against the source document. So that's just the source document being the paper document. So if you find a lab value or a numeric value that doesn't really make sense, the first thing you do is go back to the form to confirm that it was entered properly. You can also compare it with other values for the same subject. So if you see a weight that just seems a little wacky, you can display or plot that patient's longitudinal weights. You can just plot all of their weights and see if, well, maybe, yeah, this guy's weight has been increasing over time. So a weight of 110 kilos might be OK. You can cross-reference with other variables, such as current illnesses in the case of elevated lab results. So if somebody has a really high white blood cell count, you can look at their diagnoses for that visit, or their other symptoms. Maybe they had a fever that day and confirm that or imply that these results are probably OK as is. And then you also want to confirm that the value makes sense with respect to the patient's age, gender, or disease status. So we've talked about this, looking at weights compared to the patient's age or even weights compared to the patient's gender. With adults, you can look at the heights and weights, and they might make sense on their own. But it's a really good idea to calculate body mass index, and here's a formula for calculating that. And to confirm that, because you can get a valid height and weight and then get a body mass index that's sort of out of range. Most BMI should be between 10 and 40 for adults, and even 10 is on the low side. And you can also flag unexpected weight fluctuations in your data cleanup queries. So if somebody, if you plot their weights or you look at their weights longitudinally and somebody has a sudden drop in their weight, you can identify that observation and have that show up in your cleanup queries. The pediatric heights and weights are a little bit easier to confirm because there are a lot of systems that allow you to cross-reference the height and weight against standard growth charts. So the EpiInfo has a software, and SAS also has software for calculating what's called Z-scores. And what a Z-score is, is it's the number of standard deviations away from the mean, what you expect to see for babies. So let me show you a typical growth chart. And we'll talk about how we can, actually, I think this is part of one of the exercises. I didn't load it yet. OK, never mind. Let me keep going, and I'll load it at the break and just post it up there. OK. So when you find Z-scores that are unusual, so typically we would want to review any Z-score that's less than minus 5 or greater than 5, you need to review the date of birth, the visit date, and the weight for that patient. You may also need to review the patient ID. It could be that the data has been entered under the wrong identifier. And then similar checks can be made for height for age. There's weight for age, Z-scores, and also height for age, Z-scores, and also weight for height, Z-scores. So comparing the weight and the height, regardless of age. We've talked about this already, reviewing longitudinal data, looking at fluctuations in heights or weights, or even just looking at in HIV care and treatment, if someone starts ART for their own therapy for therapeutic reasons, you don't expect them to come off of the ART. I mean, once you're on ART, you're on ART for life as long as you're adherent and not failing the regimen and so forth. So we could look at the on ARV variable longitudinally. And if we see places where the patient is just on, it appears that they're on maybe at one or two visits, we may want to review that patient's data because it may be an error, an entry error or data collection error because you don't expect patients to be on ART for short periods of time. Sometimes when we were talking about missing data, we talked about having additional codes or responses for the reason for missing. So you just want to make sure that these missing codes don't overlap with valid data. In the past, people have often used like 999 or 9, they fill the field with 9s to indicate missing or unknown. This becomes problematic when you start adding additional categories to your response or you happen to run across somebody who really has 99 children. So you've got to make sure that, I don't know if it's possible, but 27, 30, we've seen people with that many in the empath. But there are cases where you think, oh, there's only four categories here or nobody could possibly have 99 of this. And then the more data you collect, the more patients you see, you start to increase. So you have this problem of overlapping. You're missing codes, coded responses with your actual valid values. So you need to be careful that that's not going to happen. And I think with systems like OpenMRS, it can't happen because all of the responses are coded. I guess it could happen in the cases where you have blood pressure. Could you really have a blood pressure? And if you said, OK, missing blood pressure is 999, could you really have somebody with a systolic blood pressure of 999? But you just need to be thinking about that. I mentioned this earlier for lab results, it's good to have a qualifier. If you know you're going to have a qualifier such as the less than or greater than, you need to store this value as a separate variable. You want to maintain, if 90% of the data are numeric for a field, you want to make that data type numeric. You always want to default to numeric. It's much easier to work with for the statisticians. It's much nicer when you're setting up data entry systems. You can put range restrictions on it. So if you have lab results that have less than or are going to have some results with less than 0.1, you should create two fields for that result, storing the qualifier in one field and the actual numeric value in the other. So also with cleanup queries, it's a good idea to look at cross variable checks so you can confirm that there is consistency between gender and, for example, between gender and other variables in the system such as pregnancy. Look for another consistency check might be contraindications and medication. This is where it helps to understand the nature of the topic that you're studying, that you're managing the data for. So if you know that these two medications should never be given in combination and you see that in the database, you may want to review that particular encounter and make sure that there hasn't been some error in collection or entry. I find that folks make mistakes maybe not so much anymore with open MRS because it has a much nicer system for identifying patients and registering patients. But in the old system, data was often entered under the wrong subject ID. So there's no way to catch this, really, except in comparing the data from that observation with the data from the other observations for that same patient.