 Hello, I'm Peter Higgins and I have the privilege of introducing Michael Cain, who's one of the founders of our medicine who served on the organizing committee for the past four years, who is a assistant professor at the Yale School of Public Health in the Biostatistics Department who's interested in scalable statistics. And today we'll be presenting on collaborative reproducible exploration of clinical trial data. All right. Thanks, Peter. All right. So, yeah, today I'm going to be talking about, it's a kind of two different things but that are pretty parallel. So with a lot of the work that I do, I research methods, but I do a lot of work with clinicians and so, and I do a lot of data analyses around atom-formatted data sets or clinical trial data. A lot of times that clinical trial data is after the trial is complete and either people want to be able to, they want to look at, a lot of times it's population heterogeneity. So what were the different profiles for the individuals in the trial? Or it's for, or a lot of times they also want to see what happened when a trial goes bad or why did it go bad? This talk is going to be specifically about the exploration part portion of the analyses that I present, mostly because I've been building tools with a couple of other people and we're kind of up through the exploration part and that's kind of, I think, one of the more interesting parts. So this talk is going to be about a couple of different things. So it's how do I think about my relationship with my clinical collaborator, essentially what is my role and what is their role and how do we, how does understanding that role help to get to better results? How do I present results? Again, the tools that I'm going to be showing are mostly for atom-formatted data sets or something close. So we're dealing with clinical data. Usually it's order tends to hundreds of different variables. So basically how do we, how do we provide something that's a pretty comprehensive view that's going to be consumable both by a statistician who's generating these things and the clinician. At the same time, I want to be able to do this without spending all my time or hours and hours creating visualizations and tables in our markdown. And this is where really we get to kind of the tools that are available, including the ones that I'm going to be presenting and how are they used for this, for these types of collaborations. So I want to present not only the tools but the context in which I'm using them. So I tend to be kind of the technical lead for a lot of these analyses. Sometimes I have somebody who's an analyst who will run things for me. A lot of times I'm doing these things myself. One of the things to realize with the clinician is that a lot of times it's, if you don't, that your goal is to really be a partner with them in the investigation. The idea is, if you don't manage that part of the collaboration, you will probably be managed by the clinician. Clinicians are busy. They tend to be really good at organization and research. And if you have trouble showing or providing value quickly, you're probably going to get tasks with things that are going to help them with that. So specifically, I think, usually think of my job as a statistician is to present all of the relevant data and their basic relationships. And again, this is at the exploratory portion. So usually there is an idea of some hypothesis, but you usually go through the exploratory portion, which is after we clean the data, just to make sure that the data look like we expect. Make sure there was no problems with the cleaning. And then to evaluate the hypothesis and see if there are other interesting hypotheses. Then my job, a lot of times, is to create a well-posed hypothesis. And by well-posed, I mean that I want to be able to create a data that I can answer by statistical means with the data that's being presented. So some clinicians vary in terms of the amount of statistical background they have. Sometimes they have awesome intuition but are not necessarily going to be able to tell you what the analysis is. Some other people are going to be able to suggest analyses all the way up through which statistical test they want to see. So then after that, my job is to, a lot of times, then test the hypothesis. So execute whatever the analysis is and then provide an interpretation of the results. So in general, clinicians usually know at least enough statistics to be dangerous. At the same time, for me as a statistician, I should be thinking about learning at least enough clinical science to be dangerous. And the goal is to keep each other out of danger and provide scientific insights by statistical means. So especially as we're thinking about the exploratory portion, that the context that I'm doing a lot of these analysis is research, right? So research is inherently kind of inefficient in some ways. But that doesn't mean all portion should be inefficient. So from experience, the part that is kind of inefficient can be creating these well-posed hypotheses. Sometimes we have to iterate. Sometimes we have to go back and forth with thinking about what those mean. And then also the interpretation of the results. So those parts are the most researchy. And because of that, I think we usually do accommodate a little more inefficiency. But the idea is for these other parts, how do we minimize the amount of inefficiency so that we can really focus on the scientific questions and then answering the scientific questions? So again, part of the, one of the questions we have here is how do we make the exploratory portion more efficient? So we have kind of the classic data workflow, but I want to think about what do we need along with that? Or how do we go beyond that? So in general, what we really want to do is we want to make systematic how we clean the data, how we communicate the summaries, and then how we navigate those summaries. So again, because I'm usually dealing with tens to hundreds of variables, there are enough relationships there that I'm probably not going to write all of the R Markdown code myself to create things like tables and visualizations. So if I have a large number of these tables and visualizations, then I want to think about how do I get to the ones of interest as quickly as possible? And then another thing is that it's a lot better when the clinician can engage the kind of digital artifacts that you're creating. So again, I'm usually focused on well organized tables and visualizations. And after those are created, I usually end up putting those as HTML documents in a place where the clinician can find them. And most of that is so those are things that can be shared before meetings. And those usually end up being kind of the emphasis for the meetings that I have with clinicians. Especially when we're thinking about what are the hypotheses that we want to test. So going a little bit beyond this, this tends to be how I think of the workflow that we do for these collaborative relationships. There's the state acquisition, processing and normalization, analysis, and then communicating the output. These are usually the output artifacts that come out. But the thing to really remember about these is with each one of these things, there's usually the evaluation of the output artifacts. And usually that means going back and doing some kind of exploration of the results with visualization and analyses. So we really want to think about how do we capture those artifacts? And how do we make sure that everyone knows about them and that they're incorporated into the research process and any of the decisions that we're making? So this is kind of a classic waterfall model. So in software engineering, there's this concept of waterfall for the construction of software. This is a waterfall for the construction of these analyses and the artifacts that present them. And one other thing is that you can have more than one analysis. So in that case, you do have kind of these other tributaries that are coming off of the waterfall. But the idea with the waterfall is that anything that's downstream usually depends on everything upstream. So this does go towards this idea of when we're cleaning data. We actually a lot of times want to make sure that if we're changing data, we want to change it as far upstream as we can. And just so that if we're thinking about a workflow that we can allow any of those changes to cascade forward into the analyses and other artifacts that we create. So I'm going to show an example of the type of visualizations that we end up showing. So I'm going to get out of full screen. So this is one that we did a little while ago. This is for, it was a K-RAS study. So if you go to project data sphere, you can find actually both treatment and control data for Fallfox and Fallfury. These are the types of analyses that we end up doing. Again, most of the time it's to begin with it's table. Lately we've been adding a tab beforehand that goes through, kind of that tells something about the data cleaning process. A lot of times that includes something like a consort diagram to see, how many patients do we expect to have? How many samples of the data are we dropping? And then what did the data look like? And then after that we have, again, this data review, this particular analysis was based around subtyping. So we were looking at what are the patient profiles associated with a response or non-response along with these other visualizations. So again, the idea is we have a lot of artifacts that we want to present. And we don't want to have to create all these R markdown documents per tab by ourselves. And so we want to kind of automate the process of generating them and then also present them in a way that's navigable. All right, so how do we do this? So the packages that I'm going to show today, again, there are three of them. So one is forceps and basically the idea is you can take a data frame and you can assign roles. So this is taking from the tidy models. Nomenclature and tidy models, they have a notion of roles as being a dependent or independent variable. I want to generalize that a little bit so that we can declare our own roles and then start thinking about combinations of relationships between variables in the role, which is going to be handled by this variant package. And then I'm going to use another package called listdown. So listdown is used then to take those artifacts and organize them and present them. Using HTML. All three of these packages are on GitHub. My handle is k++ and then it's the package. Listdown is on CRAN right now. It's a slightly earlier version. And yeah, so I'm going to be showing some of the newer functionality. So I'm just going to start with this example. And so this is, so I'm going to, so the forceps package includes a couple of data sets. So there's an ADSL data set, biomarker, adverse event, and demography. These are going to be pretty closely modeled after some of these things that some of the data sets you see that are usually atom formatted. So what I'm going to do is I'm going to import those. I'm going to do a little bit of processing on them. One thing I am going to do is the adverse event data is longitudinal. The other data sets are not. So I'm going to use this cohort variable, which is basically going to say by subject ID, if the other variables are repeated for a subject ID, then just collapse that variable to a single row. If you can't collapse that variable to a single row, then put it in its own table that we're going to use as another column in the data set. So if you want to look at what that looks like, I have this data list. The adverse events table is, again, this collapsed version of this data set. So if I say adverse event, yeah. So if I say data list adverse events, and I look at the beginning of this, you can see I have the subject ID. There's an adverse event count. And then all the other variables, which are, again, longitudinal, are embedded in a table. So after I have those data sets put together, then what I'll do is I'll consolidate on the subject ID. So this is going to do an outer join of those four tables. It's going to do a little bit extra, because we want to just check for name collisions and do a couple of other checks to make sure that the data look OK. But once we have that, we have this nice, tidy data set. And again, the adverse events, which is this longitudinal data set, is stored as a list. So if we need something else from the adverse event, like just a single feature, we can use the map functions on those to grab the feature. If we need to change the format of the data so that includes the adverse events, so for something like a longitudinal analysis, we can unnest on the adverse events selecting only the columns that we need. And then we can even think about if this is right now each row corresponds to a single patient, if we wanted to change that so that it's site ID, for example, then we could nest on site ID and start thinking about what are the characteristics of the different sites. So now that I have that data set, what I'm going to do is assign roles to it. So all I'm doing here is I'm creating a list. The names for the list are the role name. So I'm going to say that baseline is going to be the variables in demography and biomarkers minus the subject ID and the site ID. So in that case, then baseline is going to refer to a set of variables in the table. I'm going to do the same thing for the other ones. And once I have this roles list put together, then I'm going to add these roles to my table. Now the roles are going to be stored as an attribute, which means that x is still going to be a regular tibble that you can do all of your tibbley things with. But it has this extra information that's going to allow us to do a couple of other things. So if you want to see what the roles are per variable, you can just call roles of x. And you can see what's the term, so that's the variable. The role is the role that was assigned. And then I'm just going to keep track of what the type is. Now that I have those things, I can start thinking about doing operations based on roles. So one of the things I could do is I could use GT summary. And I could say, I'm going to take x, grab everything that was all the variables in the administration role, which was only the site ID. And then I'm going to also keep arm. So that's going to grab just the admin role variables along with arm. And then I can pipe that to tibble summary, for example. And I get something like this. So rather than actually needing to select each of the variables that I need, I'll just select them by the role. The other thing that I can do is create perspectives. So perspective is either a table or a visualization summarizing possibly conditional univariate or bivariate relationships. It's characterized by a formula and a data frame. And the very, very package provides basically reasonable defaults. So the idea is going to be that I'm going to have, I want to show relationships between variables within roles. And I want to create a lot of those visualizations. So those visualizations are, this is an extensible package. So if you don't like any of the perspectives, so the visualizations or tables that are being generated, you can override them yourself. We basically tried to come up with a reasonable set of defaults. So then if I want to create a perspective, I can just take my x. I'll push it through the perspective, and I'll basically say I want my y to be endpoints and my x to be arms. The thing that's returned then is a table with the y terms corresponding to endpoints and the x corresponding to arm. One of the things to note is I did make one of these roles survival so that I'd be able to visualize all of those. So if I want to just look at one of these perspectives, I'll just call ps$perspective1 basically because I'm holding each one of these perspectives within a list. So here's, let's see, one was best response by arm, and I have a reasonable visualization for this. So again, a lot of this is for exploratory. I'm not as worried about these being publication-ready. I just want something reasonable that I can go through and understand the data with pretty quickly. And then also that I can commoditize the generation of these. So if I wanted to take a look at all of these, I might want to push this through a telescope. So I'll just select the x and y term. Yeah, my y term and my x term and the perspective. And then push it through, again, telescope.js. And I'll get this visualization. And this lets me scroll through each one of those visualizations, including the overall survival. Along with that, I can show more than one. I can show a couple. I can also add extra cognostics to this so that if I want to prioritize which visualizations I want to see, then I can do that. And then I can sort or filter based on those extra values. So then the last thing I want to do is show that the last thing I want to do is actually report these things. So we have now this nice kind of framework for showing relationships between roles. If we need to make something a little more interactive, because we have a lot of visualizations that we can prioritize, we can put things into telescope. But now we want to structure these things. So I'm going to build a lot of these artifacts, a lot of these perspectives in telescope visualizations. And then what I'm going to do is use list down to create my output document. So to actually do that, what I'll do is first describe the R markdown structure with a list. So I'm going to have a list that's named that's going to point to my visualizations or perspectives or tables or whatever. I'm going to construct a tab by basically giving lists down this tab along with a little bit of extra information. And then I'm going to embed that tab in a web page that I'm generating. So here's how I do this. So again, here are these what I call computational components. We have a baseline. So this is just going to be a single tab with a baseline. And I'm going to have my table summary like I had before, my site count information where I'm going to be using my perspective, so admin by arm. And then I'll just show the adverse events by arm as well. So again, all I'm doing is creating a list and embedding and having the list point to these objects. After that, then what I'll do is I'll create a list down object which tells me about how to actually render the document. So I'm going to be using ggplot2 and gtsummary to render the plot because those are the packages that were required to make those objects. I don't want my R chunks, my R markdown chunks, to not echo, not show messages, and not tell me any warnings. After that, I'm going to just create a header and just to let it know that I want the table of contents to be true and I do want it to be floating. After that, I'm going to create this table tab by just bundling these three things. So this LD bundle doc is a function that's supported by list down. And then I'm going to create this pages, which is going to be my web page, which is just going to be a single column called summary tables. And it's going to be using this table tab that I described before. We still need a little bit more information. So I'm just going to add a little bit of YAML and tell it where to output. And then here's the result. So basically, I have this baseline table that I had before. I have the site counts. There's only the site ID perspective. And then for adverse events, there was only the one AE count perspective. And it's complete. So this is a pretty extensible framework. There are a lot of other things that you can do to extend this or customize it. There's definitely a couple of, there's definitely a lot more to this. But this is kind of the quick example. At this point, I'm out of time. So I just wanted to say thanks to my collaborators, in an Austin way. Here are some of the packages that were being used. And thanks very much.