 analysis. I'm going to talk about what is a specific data request. I'm going to talk about data cleaning again a little bit. I'm going to talk about preparing the datasets and of course documentation. So what is a research data request? Typically an investigator, a clinician, a scientist will come to the group, the statisticians and the data managers with a question. They have a question they want to answer, they have a hypothesis about what the answer will be and then they have aims for a specific project. And typically what they do is they write up a concept proposal which is a detailed plan for how the research project will proceed and that typically includes the aims of the study. So what are their goals? What are they going to try and achieve with this study? What are the specific associated hypotheses of the study? So like you might have multiple aims and have you might have multiple hypotheses what you think is going to happen. So this is really helpful because the statisticians know all about testing hypotheses given the data. The concept proposal will typically include a statistical analysis plan. The investigator or the researcher is not always the one that writes the analysis plan themselves. They do so in conjunction with the biostatisticians. Typically there's a detailed description of the cohort and we've talked about what a cohort is. It's defining the population or the study population that you want to analyze. So you want to include in your data set. And then also there are on a detailed concept proposal there's there's usually a specific list of variables needed and a description of the data needed to conduct this study. So let me show you a quick a concept proposal. So you can see that it can be pretty detailed. We have a description of the study that the the investigator wants to perform. We also have a list of statistical methods so a short analysis plan of how they're going to analyze the data. And then below is a list of the variables the specific variables that they need. The dependent variables and the independent variables. And just some other administrative information about contacts and so forth. So this is like our guidebook when the when the data managers start to put together an analysis data set. We use this to figure out you know what patients should be included. What variables should be included. What time period they're looking at. What what new variables we're going to need to to derive. So in this I mean in this example you can see that these are some of the things that we've seen in the the demonstration data set. I mean age and gender and CD4 count. The CD4 plus T cell count is the same as our CD4 count. The who stage which we've we've worked with a little bit. Opportunistic infections and these are like the diseases that patients get from HIV. And then a few other miscellaneous variables. And then the the dependent variable here is the time to treatment initiation. So this is looking at this particular concept proposal is looking at the characteristics of patients prior to initiating a ART. So the the changing characteristics of changing over time the characteristics of HIV infected patients who are getting ready to start antiretroviral therapy in East Africa. So this is an actual concept proposal that we've we've put together the data set for these these results have been presented at one of the international AIDS meetings. And we're in the process of writing up the manuscript. So that's our concept proposal. That's that's how things begin preferably. So how do we go about fulfilling a data request that's embedded in a in a concept proposal. I mean the first thing you want to do is to identify and resolve any questions regarding the requirements. So you you have to read through the proposal and understand exactly what the investment wants and what the statisticians are going to do with the data and make sure that you have that you have a good sense of what is required of the analysis data set. You want to determine what the data sources are. I mean sometimes data is not coming just from one source. You may be the clinician or the investigator may come with some information from another from the pharmacy or from another sub study or something and you have to be able to merge that data with that to determine what your all your data sources are and what variables are going to be needed. What those derived variables are and how you're going to how you're going to have a good idea about what those visits and applications are that it will be included these data sets and or properly prepare them for analysis. You'll need to prepare the actual data and how that's done. And then always you need to document. Okay so let's back up just a little bit. It's rare that it sass is not a very good tool for inputting data for actually keying it for a data inter clerk to key in data. It does have capabilities you can set up some simple interfaces but it's not as good as OpenMRS for sure or other SQL packages or even access. So you typically data are not collected in sass. So if at all possible you if you have access to the raw data you might want to do some processing and cleaning there before you bring it over into sass. So when you when you get ready to start data cleaning in the raw data set the first thing you want to do is back up the original data files. You don't want us you know you know you're not going to do this. There may be little studies going on collecting things and access databases around for little sub studies. So in those situations when you're going to when you're going to extract somebody's data you want to make sure that you back up that original data files before you make any changes whatsoever. You want to go through and look for blank records. A lot of times folks that don't have a lot of experience with programming if they set up an access database there'll be a lot of blank records in there you know entries with no with actual not any actual data or real data or records in there just that were were inserted for testing. So you want to eliminate those if possible. You want to locate any duplicate records sometimes I mean in access you can well if you set it up right you can't have to that there is that uniquely distinguishes every record in the table from every other record in the table. And so typically you might have identified that in Excel are great for this kind of thing because you especially if you have a small database you know with not too many observations you can just click on each column and sort that column. I mean you want to sort and you find somebody in there who has an adult by each column and it's great for looking at dates when you if you know when the study started and when it ended if you sort by date and you see future dates or you see dates in 1908 you know I mean you know that there's been entry errors there and you can fix them right there on the spot before you bring the data back. And then for categorical variables a lot of inexperienced developers will allow entry of upper and lower case coding for specific variables they may also allow they may have FM 1 and 2 you know so they start out for a field like gender they start out using F&M and then they switch to one and two midway through I mean so you can if you sort the data set if you sorted that column you could easily identify where the issues are and where the problems are. And so again you can fix those on site and it will actually I mean this kind of raw data cleaning is actually helpful to the person who's maintaining the database you know because it sort of teaches them how to to be more consistent in the future and and it also cleans up their data their raw data for them so that if they go to run generate some some frequencies on gender or something they won't get for they won't get for values they'll only get the two F&M because uppercase F and lowercase F are not the same thing in SAS. SAS understands those as two separate values not true with variable names but true with values and then also when you're sorting those columns you can easily review the missing data so if if you have you know if you sort the gender column and you see that there's five or six people missing gender I mean you can say do you know what should we can we enter this do you have this and they can immediately go yeah nobody should be missing gender we know what what gender is on everybody so you can easily go and grab you know you can easily go and grab those data and then you can just say ask for this variable the the other advantage of doing data cleaning like this in in the raw data set is that there are occasionally problems with transferring data to SAS depending on the format especially things with Excel spreadsheets there's a lot of quirky things happening with Excel spreadsheets that don't always come out well in SAS so if you can look at the data in its original format you can you can confirm that that those that everything's transferring to SAS correctly so even the an ASCII file I like to have a separate SAS program that just does the conversion that takes the is because you often make that that you might not want to be reading to SAS data sets and I think Eva talked about this last week for converting an access or an Excel data set you can use tables there are limits on how many variables or how many observations can be in an access table so make things a little more modular and they break things up in smaller pieces and usually those pieces can be combined recombined in SAS for the analysis for when creating an analysis data set so you need to merge or append or concatenate these tables as necessary you need to double-check that the merging process worked properly by looking at the number of variables and the number of observations here I'm talking about the number of observations look at the you know what the number of observations were in each of the data sets and we'll look at some examples and you also need to understand how the number of records is dependent on the overlap among the among the data set so you might have some patients in this data set that aren't in here which will create you know a few additional records in the final merge data set so you need to understand what the relationship is so let's take a look at an example here let's say you have oh wow okay so this is actually your the I want to I want to I want to I want to keep the ID the visit date age weight height BMI and CD4 from this longitudinal data set and let's assume that okay so in this particular example so it doesn't match exactly with our demonstration set there was about 933,000 observations in the original visit data set so in this data step where I'm keeping only the patients only page only the records for patient ID 1 2 3 4 and 5 I end up with 71 observations in this subset of data set so so so that means that there are multiple observations where I data set because I'm including patient 4 and 5 from both of these data sets okay is everybody kind of get a picture in their head of what what's going on so far so when I do that I'm still I'm still extracting the data from the same demonstration so the the original data sets has 23,000 observations new data set okay now what happens when I let's let's let's sort of count them let's list them okay so will patient ID be in there okay appointment date that's to age 3 weight 4 height 5 BMI 6 CD4 7 patient ID do I count that one again no because I already counted it right how about appointment date no already counted it okay so clinic hemoglobin and SAO 2 so there'll be 10 10 variables in that final data set can you tell how many observations will be in that final data set 117 how do you get that 46 plus 71 okay but I'm gonna merge these data sets right and the two data sets presumably have some observations in common for patients 4 and 5 cannot tell right you cannot tell how could you tell if there's if you knew something you could tell how many observations would be in the final data set and if the patient IDs were different what would be the final number right what there's the sum okay that's good there's something else if you know something else you will also would be able to tell how many were in the final data set for patient what right right but you wouldn't have to know all of them so she said the nut you if you knew the number of observations per patient ID then you could calculate the total but you wouldn't have to know all of them what how many would you have to know actually only six now for you wouldn't have to know four five because if you knew six then you could just subtract that number from here and or if you knew six all you do is add add that number let's say six had ten observations all you'd have to do is add that to this number right because there's complete overlap because we're not we're not restricting by any other variables I mean all all observations for number four and five are in this data set and and they're also all in this data set so you really would only have to know how many observations there are for number six to get the total and it could be zero right it could be that there are no observations for number six there so let's see if we were right about oh look at this no so so so when you do the merge you you read 71 observations from the first data set and 46 from the second and you end up with yay 10 variables like you said and 83 observations so based on that can you tell me how many observations ID six has 10 12 right we should test that see if you're right okay so I mean this is the kind of I mean we're really breaking this down but this is the kind of sleuthing you have to do to confirm that your merge works correctly and this is a very simple one right there's no conditional statements there's no you know sub-setting out other observations and so forth I mean this is this is really important to get to get these merges working right and not overriding data from one data set to another so when you're doing this you want to confirm that the total number of variables in the merge data set is correct like we just example and there's actually a specific formula that you can use based on the number of data sets being merged the number of key fields and the number of fields in each data set so in this in this previous in a previous example we had seven fields in the first data set we had seven we had five fields in the second data set I mean yeah five fields and we had two key fields patient ID and appointment date so that's this two here and you multiply that by the number of the total number of data sets being merged minus one actually this might not be right look at this again but it works when you have only two data sets I think this what what you subtract by maybe dependent on the number of key fields so we can look at that later though if the number of variables is less than what you expect then and you know that you have other fields that are common to both data sets and this should be strictly avoided you do not want to to get into a situation where you're overriding one one field with another from a different data set so if I had kept if I had also kept age here then this number would be six but this number would still be ten and you and you want to avoid that situation where you're you're merging two data sets that have fields in common that are not in the by statement okay the only variables in common that you want between the data sets that you're merging are what's in the by statement because sass will give you does kind of some strange things depending on how how the records align so when you're running sass program whether you're creating a data set or for research or or doing cleanup queries or whatever you're doing you always want to review the log before you go on and look at the output or before you send out the data set and you guys actually I've been impressed you've been pretty good about reviewing the log making sure that that things ran properly and some of the unfortunately they all popped up at once some of the certainly the error message they come error messages they come up in red and it's pretty obvious that you've got a problem you can't really continue without it there are some warning messages I mean that that allow you to continue the warning and note messages allow you to continue processing but you're not gonna you may not be getting what you expect you're getting in your data sets and this is just a list of five of the warning messages or notes that you get that really should be attended to before continuing to process your program some of them may not actually be causing problems with your code but they should be corrected because it means there's an error in your code the error may not be affecting the actual data sets that you're creating but it is indeed still a problem and particularly the first one is a serious problem and that's the note where you get the merge statement has more than one data set with the repeats of the vibe by values and what this means is is that you that there are multiple observations per by category in both of the data sets so it doesn't know how to merge them so if you if you're by someone only contains patient ID and you've got patient ID listed three times in one data set and twice in another it doesn't know how to link those three observations with these two from this data set okay so there's that leads to confusion it typically when you get this message what what typically is the problem is that you haven't enough variables in your by statement that you haven't that you don't realize that there's multiple observations per patient in this data set and multiple observations per patient in this case and whenever that happens you you cannot merge by a single variable you cannot merge by just patient ID you have to have some other variable to merge by visit date CD you know whatever a visit number whatever else is available but the point is that that you that you cannot merge these two because there are multiple records in each of those data sets for this for variables named in the by statement and if when you run the merge you get strange results and you should definitely not continue on to the next step or next procedure or anything until you resolve this issue and that air goes away then there are two other notes that are very similar where a variable is on a variable with a specific name it can be any name here they'll just list the variable that's uninitialized or has never been referenced and these indicate that the variable is probably not properly defined and many times it's just a typo where you've misspelled one of your variable names and it's kind of handy because SAS picks up on that the next one where care where you get a note where character values have been converted to numeric values and there's a bunch of other blah blah blah that this indicates that SAS has automatically converted a character variable to numeric because this can lead to unexpected results in SAS it's best to do this conversion manually with an input function which converts numeric a numeric variable to I'm sorry input I always get confused input is character to numeric and put statement is is numeric to character and then the last one I don't see as often where you have multiple links specified for your bi variable so if you if you get results from the lab database and patient ID is in there and you also get results from from the clinical database and patient ID is in there if these two if the people designing those datasets didn't communicate before the design was set up the length of this field may be 10 characters long and the list of the in the length of this one may be 15 so you want to be careful about trying to merge those two until you resolve that that length problem okay when whenever I run a SAS program and I typically rerun SAS programs before I get it right sometimes I recreate the datasets you know a year down the road or what have you whenever I run a SAS program I just go straight to the log and I search for these I search for these keywords so I search for you know merge statement I search for uninitialized I search for referenced I search for character values and I search for this multiple links you don't have to type in the entire thing but you know just a snippet from this from from each of these commands will give you will take you to those problems and sometimes I search for warning messages as well I'm not I always search for more I always search for error messages first then warning messages and cetera because because these don't show up in a different color they the warning messages show up in green I think the error messages in red but but the notes just show up in in in black so if you're like I mean you know I just it's harder to run through the program and easily spot them so I just search for those automatically okay so what else do you have to do when when you're setting up a permanent data set for research purposes I mean one of the things is you you may need to recode the missing values that were used in the raw data files as I mentioned earlier sometimes people use nines fill a field with nines to indicate missing these should should be converted to a blank or a dot and sass before proceeding unless there's a special value unless they they have meaning an additional meaning unless that nine means unknown or that nine means that that the patient refused to answer but if it's just general missing then it should be converted to the missing values a lot of times we have to calculate summary scores and and a simple example is the body mass index which is based on height and weight but there's also a lot of studies that that that do that use behavioral questionnaires that ask about pain and depression and you know they may have a series of 30 questions that ask about depression and there are summary scores that can be calculated from those questions sometimes there are multiple summary scores and we don't typically in our area we don't typically do all these calculations in the data entry system we typically do them at the end in sass so there's also the steps you need to add for for coding those summary scores audit three is isn't the summary of the alcohol questions I think there's four alcohol questions and there's a couple of audit scores you want to look at the difference between the dates such as time from enrollment to ART initiation this is this is also a cleanup query but the analysis they need to have the durations in the system so this is a good place to put to calculate those variables and and storm in the in the data set any variables that you calculate any variables that you derive any new variables that you create you must label and you must document in external just word I mean just documentation as well and then the final thing you want to do is attach formats to specific variables so like marital status or civil status we had that the values are actually one to seven the entered values are one to seven but the but they but one two three four doesn't mean anything to the statistician so that's why we we create formats for those values and attach those to to the data to those associated variable so once you have your permanent data sets created from the raw data sets you can have a separate program that does the data cleaning and generates cleanup queries so in the cleaning data set I like to have generate frequencies and means and sometimes univariate statistics on all the on all the variables I just look at every single variable in the in the data set that I got to make sure that there are no allies to make sure that gender has doesn't have four levels to check for and to check for invalid data some you can also plot the data often all plot age against weight especially if I have if I'm looking at kids you know to see how much how much work is going to be needed in cleaning up those those weights for numeric and date fields you can look at the minimums and the maximums to verify the values are in the expected ranges you can do this if you have if it's if you have few observations or there are limited number of responses you can use proc freak to look at the to look at everything look at the minimum the maximum but if you have you know if you're looking at weight and you've got 10,000 observations in the data set you probably don't want to run a proc frequency on those so that's where you can use the proc univariate and the proc means that will show you how many are at the lower end and how many are at the upper end and and you can figure out you can generate some cleanup queries based on what you see in those those descriptive statistics you want to locate of course locate duplicate records and fix those if you find observe multiple observations per patient in what's supposed to be a cross-sectional data set then you know you've got to eliminate those records I mean this is redundant a lot of this is redundant to what you did or could do in the raw data set but all of it has to be repeated here and SAS and then you want to compare fields where appropriate you know looking at at date of birth and age and visit date make sure that confirm that the date of the initial visit what they have defined as the initial visit actually proceeds the date of the follow-up visits okay then you want to kind of pick out the important fields such as summary scores and verify their values in more detail you know I mean there's there's going to be a lot of data collected for a research study and some of it is not important to the analysis how many phone calls how many times a person called in to ask about medication that I mean those kinds of things are may not probably may not be useful for the analysis so you don't need to spend your time on those variables so that's why it's important that's why it's important to understand the concept proposal so you know what the specific aims are for a study and you know what variables are going to be needed for the analysis to address those questions in the in the aims so typically you want to merge all the longitudinal data sets together you may not want to you may not want to store that as such in the final data set that you that you prepare for analysis but you certainly want to do that as part of the cleaning as part of the cleaning so you can verify there are no inconsistencies and how variables are formatted so the variables that are supposed to be in all data sets like an identifier and a visit number and a visit date you want to make you want to merge those data sets to confirm that that the format of those variables is all the same and the length of those variables so that what if the statistician needs to merge them that there won't be any questions about inconsistencies and then you always want to merge the cross-sectional data set the demographic data set with the longitudinal data set with the visits data set to the to identify subjects who are in one data set but not the other and you may also want to get the data birth into the longitudinal data set so that you can calculate age and other such variables but this is only a temporary merge okay you don't want to send the data set to the statisticians with gender in every field of the longitudinal data set okay and every in every record of the longitudinal data set okay this is a temporary merge where you're just merging to make sure that there's agreement between these two data sets maybe to do some calculations but when you actually send the data set out they need to be separate files and I'll explain more about why that is and okay so when you're writing SAS programs it's really important to save all of your logs and your outputs especially the the final version that you use to create the analysis data sets and even more especially if those data sets are going to be used for publication however if I give a data set to any statistician or investigator I have the log in the listing and the SAS program frozen and the data sets that were given I freeze those you know put those into an archived folder and don't ever touch them again I copy them into that folder so that if I'm back to me six months later and say you know I want to run some additional analysis on that data set you created can you add in these next you know can you do this then I can go back to that exact data set even if the other data set even if data is continuing to be collected and the other data set is growing I still have that data set as it was at that time okay and believe me this happens I mean even after publication they'll come back and want some additional information about that data set and if you haven't saved the log and you haven't saved the program then you cannot you cannot reproduce the results and you cannot answer questions about how that data set was created when you're when you're naming SAS programs we talked about this earlier it's it's important to name to give them a name that means something not just to just call them you know frequencies or study or what have you and it's also important to name the log and listing files with the same prefix as as the original SAS program so in this example study X is not study X it's going to be like bone studies project or it's going to be something like you know it'll be specific it'll be specific to a particular study but then the log and listing should be should be named the same thing with just the different extension log and LST and SAS automatically attaches the log the log and the LST when you save it through SAS only the program that generates the permanent data set should override it only the program that generates a permanent SAS data set should override it should ever override it you should not have two SAS programs on your hard drive that that modify the same permanent SAS data set okay this because this leads to a lot of confusion and a lot of problems if you have one program that creates a permanent SAS data set and then another program that modifies that SAS data set and overwrites it and you come back to redo to recreate this data set six months later and you forget to run that first program or you run those programs in opposite order then you're not going to get the right results in that set in that data set okay the other big disadvantage is the other advantage of not letting more than one SAS program override a specific data set is that you always have the date and time associated with the creation of that data set you always have a data set so if you have a data set that lots of other programmers are using lots of other data managers are using to generate other data sets and you let them modify this permanent data set then you're never going to know no one's going to know what if they have the most recent version you know if they just look at the date and time because of other programs even a proc sort will change the date and time associated with that data set so if you sort a data set and override it then all the other programmers might think oh I have the latest data set because it's from last week but that may not be the latest data set that's that's just when it will that particular file was updated but it may not be the actual data set that they want to be working with okay I even take it one step further and say even within one SAS program there shouldn't be more than one data step or procedure that that modifies that data set so you don't want to you don't want to have two data steps that that overwrite the same the create the same SAS data set I mean there's no point in that I mean if you just can eliminate that first data step or create a temporary data set for that first data set and then only then only output the permanent data set and in that final data step but then it gets a little confusing well what if you want that data set sorted in a certain way well then you have to sort it prior to creating that that permanent data set okay does that make sense okay so then you never you never want to override a permanent data set like I said even with a proc sort from any other program so only one program creates the data set and no other programs overwrite that permanent data set so let's go talk a little bit about documentation as we said earlier you want to internally document your SAS programs at a minimum you need to include a file name location purpose author date and revisions we also had when we're looking at the program design language we also had a header that included inputs and outputs as I said to include the names of any permanent SAS data sets created by this program that would be an output and then all your your SAS pronounce if you create a rich text file or just an ASCII file you want to put meaningful titles on that output meaningful titles that include the name of the project that at minimum include the name of the project but you may also want to put something else the treatment interruption analysis data set you know like the second generation or this is something about what the subset is or when it was run but at least at minimum the name of the study or the project would be very helpful and then you can also use the footnote option and I usually use the footnote option to display the name of the SAS program that created that output so this you can have so in SAS you can have an option statement with a footnote and then you just put whatever you want in quotes here in the in the footnote option and it and it puts that on every page of your output and I can tell you this more than anything has come in handy because investigators have come back to me months after I've generated an analysis or or a list for them they come back and say you know that list you'd put together for me can you rerun that can you redo that you know I'm like oh my god you know I've got 25 30 projects going at once and then I'll say hey is there is there a note at the bottom of that page that that that tells what program generate and then I go yeah and it says this you know and I can just go right to the folder right to the program where that that generated that particular output so it's it's very handy it's a good habit to get into okay like as you saw with the demonstration data set when we have formats attached to specific variables we need to make sure that we include those format value statements somewhere in the documentation you always want to generate format keys we talked about that make those available and you want to provide a detailed description of any variables included in the data set that are not found on the form key these would be all your derived variables your summary scores your duration variables any you know additional variables that you create you want to make sure that they're in the documentation okay when I review other programmers were the first thing I do is I look I look through the list of variables in it and they if it's not on the key form then I send them and it's not in the documentation I say I have no idea what this variable is even if I can figure out what it is you know I say you got to have you got to provide more detailed description and I send it back okay let's talk a little bit about summary scores and I'm I tried to find an example for you this morning without the internet I could not pull it up but there are often cases where you have questionnaires as I mentioned earlier like depression that measure depression or pain or other behavioral problems and there'll be a series of 10 20 30 sometimes 60 questions and typically these questions are interrelated and and can result in in multiple summary scores and typically a total score for that questionnaire as well and and these are very handy because the a lot of these questionnaires have been validated and are accepted in the research and so you go to the literature and you can compare you can actually compare your depression scores your mean and median depression scores for your population with with other populations that are published in the in the literature you know and it's actually very handy and as long as everyone is using the same standardized instrument and scoring the instrument the same way then these are comparable results so but scoring instruments can be a little bit complicated so you need to have you need to provide a detailed description of how you generated those summary scores and by doing so you're actually going to come up with the algorithms yourself and the end and the procedures for how you're going to actually code it in SAS so this detailed summary should should include the list of variables that were included in the summary scores with variables were included in which summary scores and then which variables were recoded and how they were recoded quite often the the the items will be will will have opposite meanings like if you strongly agree to one thing then you are more satisfied but if on the next question down you strongly agree that indicates that you're less satisfied with maybe like the service that you're receiving from the clinician or something so in order to you can't just add those two scores together you know and maybe strongly agree as coded as a five because that will just they'll just negate one another that they don't have the same meaning so you actually have to flip the the the responses for one of the items so that the higher score still means that you're satisfied with the with the service that you're receiving does that make sense so that's what I'm getting at here where you might need to recode you might need to flip the scoring of specific variables so that they all have the same meaning so that they're consistent does that make sense okay it's not possible to calculate summary scores if you have too much missing data and a rule of thumb is typically that if two-thirds of the data are not missing then you can calculate the summary score some people say that they need to have three-quarters of the data non-missing before they they feel comfortable calculating the summary score it and it also depends on the number of items I mean if you've got 50 items going into a summary score you know you can live with you may be able to live with a little more higher percentage of missing if you have three items going into a summary score then you know you may not be half you may insist that at least two of those three are there before you're calculating before you'll let the score be calculated in this documentation you also need to explain how the missing values are addressed so typically when calculating a total or some score the mean for the other non-missing values can be calculated and imputed for that missing data okay if the summary score is a mean itself like you know you want to calculate the mean score for a set of five questions or something then the then the missing data can be ignored right so if it's a total let me just go through that again if it's a total score a sum score you know like you add all the items together and you're missing one of the items then you can't really calculate you can't ignore the fact that one's missing because that person would have you know would have would naturally have a lower total or the potential of a lower total than all the other patients who have complete data so what you can do is you can take the mean of the other four items that are not missing and add that to the other four items to come up with a with a total so that sort of adjusts adjusts for the meat adjusts for the missing when you're talking about a total score but if the total if the summary score is a mean itself then calculating the mean of the missing value doesn't add anything to the mean because the four items the four non missing items would have the same mean as if you imputed the mean for the fifth item and then calculate the mean so it doesn't it doesn't matter okay but it's very important to understand that in both of these cases whether you're calculating the total or the mean that you cannot ignore the the minimum requirement for non missing data okay so if you have five items that go into a score and you say I have to have two-thirds non missing you know so you have to have at least three of the values answered you can't calculate the score if four of them are missing or even if three of them are missing you can't whether it's a mean or a total you cannot you cannot successfully calculate that score because too many too much of the data are missing and you can't really say you can't really come up with a valid score for that patient okay so even if the score is a mean and you and you can ignore the missing or if you can impute the the mean for the missing data in the case where the total is a sum I mean the summary score is a sum you still have to pay attention to this this minimum non missing requirement okay and then the final thing that you want to have in your documentation is is what's the meaning of the score you know and how is it scaled so you want to indicate and you want to include the the possible range how high the score can be and how how the score how a high score differs from a low score so you want to have something in your document documentation to say something like a higher score indicates more depression because a lot of times the the investing these certainly the bio statisticians and possibly the investigators don't know how the the the instrument is is is scored so you make sure you want to make sure you say what what a score means and what the scale is okay when you when you when you create an analysis data set and you're going to distribute it to the statisticians and to the investigators you want to make sure you include some what we call a data set cover sheet it doesn't have to be exactly like this but it but it includes the additional information about the data sets that you're distributing so what the project name is a lot of this stuff you can get from the original concept proposal who the principal investigator is again that's typically on the concept proposal when the when the original request was made and you can even attach a copy of the concept proposal and then the data sets created the names of the SAS data sets that you created the the name and location of the SAS programs that were used to generate these data sets that's always nice to have especially if you have multiple data managers working with the same in the same project or you you're going to pass on some of your responsibilities to somebody else and they don't have to go searching for things all the information be right here when those data sets were created who created those data sets I mean some of this is redundant with what the stuff that you're putting in your SAS program but it's but but this is what's actually going to be submitted to other people okay other people other the statisticians and investigators are probably not going to see your SAS programs and then we also just list who the bio statisticians are so we know who the data went to and then a description of the cohort what subjects are included in this data set and then the description of the derived variable so this is your documentation this is where you would put how age at ARV start was calculated and what it means okay for example and then your SAS formats with the with the PROC format statement and then typically we would do preliminary statistics just descriptive statistics like frequencies and you know so we would pick out the most important variables and maybe include them in this document but that's more that's sort of optional here if you if you create additional versions of the data sets usually I provide an updated version of the cover sheet as well okay so just some general notes when you're using SAS if the study is longitudinal you need to provide at least two data sets one that contains the longitudinal data the data that's collected multiple times and one that contains just the demographics of the cross-sectional data the data that you only have one observation per person okay statisticians can they know how to merge data sets so if they need to get the data birth into that longitudinal data set they can merge those two data sets and make it happen you you don't ever want to put cross-sectional variables such as gender into the longitudinal data set I sometimes have this argument with statisticians because they want that data in there I said okay well if I put gender in the cross-sectional data in the longitudinal data set how you're going to calculate frequencies on gender oh well I'm going to have to manipulate the data set to take out to find just one observation per person and then calculate the the frequency on gender and I said or I can give you the cross-sectional data set and you can just create the product that you can generate automatically on that data set without any manipulation you can create frequencies on gender so you know there there are trade-offs but I think it's just cleaner and more concise if you have if you're not mixing longitudinal data with cross-sectional data and the way you ask yourself this is is this if you if you wondering whether this is a longitudinal variable cross-sectional variable you say well does this change over time if I were to have it in a longitudinal data set would it be different for every observation or could it be different for every observation and the answer to things like gender and date of birth is no okay so that's a cross-sectional variable and should be in a separate data set you want to be careful about formatting your dates I've made some mistakes where I didn't where I formatted dates with just a two-year extension for the years a two-digit year and didn't realize that there were some really there were some years in there that would like 1895 instead of 1995 so you need it's a good idea to always format your dates so that you can see all four digits and when I'm working in East Africa I almost always use date 9 and I apologize I did not use that here in this these presentations but I almost always use date 9 because then there's no confusion about if it's month or or day first and then you want to default to the numeric type numeric types are just much easier to work with and the statisticians prefer numeric types over care over text fields so when you're distributing a SAS data set if if possible you want to have another data manager review the data sets and the documentation before distributing they don't necessarily have to read through every line of your SAS code that created those data sets but the idea is that they should be able to go to your SAS data sets open the SAS data sets in SAS view or in SAS and look at the documentation and be able to understand what the data sets are for what all the variables mean maybe they also need the the form keys but they should be able it should make sense to them okay and if it doesn't make sense to them then you probably need to to do a little bit more work on documenting or there may be problems with actually how you generated the data sets are throughout the variable so it's really handy to have somebody else work look at those look at your data sets and documentation when you're giving when you're distributing data sets especially to statisticians you need to include the following you need to include the form keys the electronic data dictionaries form keys whichever you have you need to include the appropriate data sets and all these data sets should have the extension SAS 7BDAT you need to include the data cover sheet which we were just working on the latest data request form or the concept proposal it's always it it's almost never the same I mean the concept proposal is submitted there's discussions there's meetings they tweak the aims they may change the hypothesis a little bit so you want to make sure that the most recent the final version of that concept proposal is made available to the statisticians and the investigators and then any other documentation that you need that that further explains the the data set in most cases the following should not be distributed you do not want to distribute distribute any protected health information and these include things like the subject's name and address and phone numbers of security numbers national ID numbers the statisticians do not need this information to analyze the data I mean they may need GPS coordinates for some sub studies but you need to be real careful about distributing those and even things like date of birth may not be necessary what I usually do is I usually calculate age for every observation and calculate ages at specific points in time and I round that age to the nearest 10th so they don't a session doesn't really necessarily have to have the exact age and so they don't actually need the date of birth and especially here in East Africa where a lot of people don't know what their exact date of birth is anyway you can substitute age for date of birth this you don't need to give your SAS generation programs to the statisticians at least in my SAS programs there's a lot of PHI in there there's a lot of protected health information things where I'm listing out people's dates of birth and there and that maybe even their names so I can distinguish men from women etc so there's there's there's I don't offer I don't hardly ever make those data set those programs available to the statisticians or certainly not the investigators so let's just talk a couple minutes about file maintenance and archiving for your own records what you should keep is a copy of everything that you give to the biostatistician everything that you've ever given out should be stored I mean almost everything is electronic now should be stored and saved and preserved in an archived subfolder on your on your hard disk somewhere okay because they'll they'll lose things they'll come back and say can you resend that data set and if you've overwritten it then you have to say no I can't and we you may be in trouble if they can't if you can't reproduce the same data set that you had already given them you want to make sure that you archive a copy of all the logs and the SAS programs especially those that create permanent SAS data sets but I would also say anything that creates SAS data sets that you're passing on to somebody else you should save the log in the in the program and even the output file too there's it's a good idea to keep copies of the grant proposals meetings notes from meetings the scoring algorithms instructions any manuscripts a lot of times you'll get the scoring algorithms from published from the published literature and you should try and keep a copy of that in your in your folder that's associated with this particular study and it may be helpful to maintain a subdirectory that for for each specific study that that holds all of these data and in addition you may also want to have a subdirectory that mirrors the data entry system and then this applies mostly to the the simpler data entry systems like like access or if you if you've actually developed the access data entry system yourself then you should keep that that separate on your computer and so that you know what so that you have a copy of what the what the data inter clerks are seeing or what the data manager is actually using at the site and then for launch student studies in particular it's really important to archive data sets and SAS programs which were used for analysis and abstracts and papers because in typically in a longitudinal data set there'll be interim analyses so you'll actually create analysis data sets midway through the study and then another data set at the end of the study so you need to make sure that that you preserve each of those in separate files and then just a few notes about working with the working with investigators and bio statisticians you want to make sure that you are seen as a team member okay I mean you want to be involved in the decision-making process you want to contribute because if somebody goes and designs a data collection forum without having input from a status dish I mean from a data manager they're probably not going to do a very good job and it's going to be really hard to develop a good data capturing and electronic system for that data for that questionnaire so you want to kind of make sure that you you know that you kind of become part of the team and and that involves attending the study meetings documenting taking notes at all this at all the study meetings certainly you want to comment on any proposed study changes I don't think I've ever been involved in a research study where midway through the investigator says okay now I want to add this questionnaire or I want to stop doing this on these group of patients you know and you really have to be you really have to to discuss have many discussions about this before you make those kind of decisions because they can really affect the outcomes of the study and and it's good to get the statisticians involved in those discussions as well because they may have comments about how it's going to affect the analysis you want to try and understand the analysis plan I mean sometimes it can get very very very complicated but understanding how the statisticians are going to use the data and what variables are most important will really help you focus your attention to those in those areas and making sure that you've got the data properly formatted to address the questions I like to review the statistical reports before they actually go to the investigator so I work in the division of biostatistics and I work very closely with a lot of statisticians and we use a form a team on each project and I like to review that the analysis that they do mostly to make sure that I have conveyed all the information appropriately to them because when they start doing the analysis they don't have the knowledge of the of the raw data that I have they may not even they may not have been involved in developing the data collection form they may have never even met the investigator you know or know anything about how the study is conducted so it's really important that you know I do the best I can to convey all my knowledge to them but it's obvious when I'm looking at some of the analyses that they've overlooked something or they've misinterpreted one of my variables or something so it's a good idea to meet with them on a regular basis and review their reports and then I spend a lot of time looking at manuscripts and abstracts those people sort of have come to rely on me to go through with the detail you know with a detailed head and comparing every value is this the right number is this the right mean is this the right value you know pp value and so forth so but but also in addition you can contribute to the to the narrative and even to the to the results section and the description of the study and so forth so it's good to it's good to have the data managers input to make sure that and sometimes when I read manuscripts I realize that I didn't understand exactly I didn't understand things exactly the way they are and and I may need to adjust some of my some of the variables in my data sets so just remember that your your contribution is extremely important in these in these settings with with research data analysis that's all I have any questions thank you very much I'll be here all week