 Thanks, everyone. We are up for our last talk today from our own Michael Cain, and he's going to be talking about processing clinical trial analysis data with the forceps package. Hello, I'm back. So this talk is going to be pretty short, and if we don't have time for questions, I am going to be at the virtual happy hour, and you can ask me questions then. So this is about actually creating a data set that is analyzable from standard FDA submission data. So just as a little bit of background, basically the clinical trial study data being submitted to the FDA has to conform to the standards that were put out by this consortium called CDISC, which is the Clinical Data Interchange Standards Consortium, and the goal is to enable information system and interoperability to improve medical research in related areas of healthcare. So basically this means that if you can standardize the format of the data a little bit better, then it's going to be easier to analyze, it's easier to validate analysis results, and then it's potentially easier to be used for other trials or to inform other trials and develop new hypotheses and eventually new therapies. So one of the things that gets submitted with an FDA submission is what's called the analytical data model or atom formatted data. This is the individual level patient data. These data, they have to be validated. So essentially, if you have something like a clinical trial in cancer, for example, you define an endpoint, which is to say, this is what it means for the therapy to be successful or not successful. You need two different analysts or two different programmers writing code to actually extract those variables. And those variables, along with all the other data that you have on the individuals, goes into the atom formatted data. These are used directly in the statistical analysis to show therapies of efficacy. So that means that the atom data sets are put in and then along with that there are scripts. They're usually SAS still that go in to show that, yes, this therapy is actually effective. So why might we care about atom formatted data? These are the data that are that are provided for trial. And we might actually be creating one of these reports. The other thing is right now, there's I don't know if many people know this, but there is a huge movement to make clinical trial data available. A lot of times it's the control control arm or standard of care arm. But this is this has done a lot to allow us to basically understand the prognostic progression of disease. Two of the big ones right now that I recommend using if you're doing this is project data sphere for oncology, for if you're studying cancer, and then import for kind of for other types of diseases. These data facilitate a lot of biomedical research. So like I said, I do kind of prognostic characterization of patients usually in control data. And then there's also this idea of subtype identification and analysis. The idea that some patients coming in are because of the characteristics that we have about them are more or less likely to respond to a drug. So what's the challenge and why am I giving this talk? So the big issue that we have is the data format is essentially it's SAS centric. So one of the things that SAS is not so great at is kind of mixing these different types of variables. And it really comes from two different reasons. First, you don't get proper factor in coding the way we do in R and SAS. And then the other thing is there's no notion of a nested data frame in SAS. So if you have something like a longitudinal data set and something that's not longitudinal, then you're really thinking about repeating values if you want to combine those. And that's not how we need to do these things in R. On top of that, these data sets are not tidy. That is, you have columns in the data set that correspond to multiple variables. So one of the things you need to do if you're going to do an analysis is actually tease those out and think about how you're going to spread them. So four sets package, which I'm talking about right now, is kind of a minimal set of functions that facilitate the creation of data sets where you have one row per patient. If you have longitudinal data, then that basically gets nested for that patient. And then you have the option to un-nest it yourself or extract features from it and do an analysis however you want. There are basically four main verbs. There's a data description, which I'll show in a sec. Contradictions, which basically look for contradictions in the variable assignments and types across the different data sets that we're getting. There's Consolidate, which puts them all together. And then Cohort, which basically lets us look at these data sets at different levels, either the individual level or maybe at the site level or by some demographic feature like sex. So I'm going to include there's a toy SAS 7 BDAT files that's kind of in line with what you expect to see in an atom file in the package that's available. So if you want to know what the data look like, usually it's just a set of these SAS 7 BDAT files, so we can use Haven to look at them if we want to see what the actual contents for this in this example. These are basically three different data sets, AE, which is going to be a longitudinal data set, biomarker, which is not, and then demography, which is also not. So here's an example of AE. On top of being longitudinal, you can see that this actually has multiple variables in the same column. So AE is adverse event and you have these different types of adverse events along with the grades, along with the times that they were that they were being seen. Let's see, this is the biomarker data. You know, this is not it doesn't actually have genetic indicators, but it does have things like ECOG, smoking and so on. And again, in this case, it's one roper, one roper patient, whereas before we were at multiple a patient can be seen multiple times because it is a longitudinal data set. And then the last data set, the demography is about the same as as the second data set in terms of its format. So what would you like to the data to look like? So again, if we have longitudinal data or so, yeah, I want I would like to have one row per individual. If I have longitudinal data set, I'd like to nest it in in in a data in a data frame in one of the columns. So in this case for patient one, zero, zero, three, their longitudinal adverse event information will be contained in a table that I can then that I can then extract features from or again, I can I can unnest and do kind of a more traditional longitudinal analysis and any of the other variables, especially the ones that were repeated in the longitudinal data set only get shown once. I only need that once per per patient. So the first thing that I usually do when when cleaning these data sets is I'll do a data description. So the data inside the set. So the SAS B7, yes, the SAS data files usually have a little bit of extra meta information that's included. So you can pull that out and look at that next to the actual variable name. After that, you want to usually find conflicts. So in this case, we have three different data sets that we've read in. You do run in the cases where you'll have for a given patient, you'll have the same variable name repeated, but the values will actually be different. And that may be because that they're actually capturing different different that they're actually different measurements or because that there's a there's a coding error and both do occur in these in in these data sets, even though they are being submitted for clinical trials. So you can call contradictions on this. And what this says is it'll find all the contradiction. So all the places where you have a variable that has different values in the different data sets. So for example, in this one, I have one that's OS days for overall survival days. You can see that in a this is an N a for user one zero zero three, but it's two thirty three for both the biomarker and demography data set. And then we have a similar format for a similar issue with OS sensor. And then there's also an issue with age in this in this particular data set. So what I might want to do then is go to my data set and pivot wider. And yeah, and basically pivot the adverse events so that I have the adverse event and the grade and then think about how to join that. The way I would do that is first go to the adverse event data set and for the variables that are repeated. So, for example, sex is repeated for the patient because the patient sex does not change over the course of the trial. I can just go through and do a summary of that. Do a summarize and pull the first one variable per patient. After that, I can go through and get the actual the adverse events and the grades. I'm going to do that with a pivot wider that's nested inside of a 4H. And I'm doing that because there's you do tend to run into kind of consistency issues with the data. So I can put that into a 4H. I can do a try catch just to make sure that there aren't any errors. If there are, I can be a little more careful about how I return that. After that, I'm going to go through and for each one of those adverse event variables I have, if it's an NA that indicates it didn't have the adverse event. And I'm just going to set that to a grade zero that it didn't happen. After that, I'm going to do a full join and then I have my new adverse event data set. After that, I can consolidate the data set. So again, DS is this is the three are the three data sets that were put together. I'm consolidating on the patient ID and I have a data set that that is that that's now analysis ready. Along with that, there's a cohort, there's a cohort verb. I'm not going to show that because of time. But the idea is, again, if you have repeated variables that you can collapse those on the cohort variable of interest. I was showing cohorting on by patient ID, but you can also do this on on things like the side ID or or or demography data. As far as it can it be used now, this is actually the second iteration of a package called normalizer that I've been work that that that is stable. It the package, the forceps package will be up and available in the next few weeks. Since this is the second iteration, I have a pretty good idea that the interface is not going to be able to change or is not going to change very much. The it will be available on my GitHub page slash forceps. And that's it. Thanks very much. Thanks, Michael. Let's see. That I do believe that we're pretty much out of time from there. So like you had pointed out earlier, you're going to be in the happy hour session and anybody who wants to ask any questions, please feel free to hit my cup in there. Thanks very much.