 OK, this looks good. So thanks for the organizers for inviting me. So I was actually going to talk about teaching R2 physicians today. But I think I have something a little bit more interesting for this audience. And that's work that's been going on for just a few weeks in my group. And it's to solve a particularly annoying pain point that we've encountered when working with complex red cap databases. And let me see if I can get rid of this thing. So the title of my talk is Red Cap Tidy R, Extracting Red Cap Databases Into Tidy Tibles. So just to give you a little bit of background, I work at CHOP, the Children's Hospital of Philadelphia. And I lead a group that's called Cell and Gene Therapy Data Ops, so CGT Data Ops. And CGT Data Ops is a data science team that's embedded in the Cell Therapy and Transplant Section of the Division of Oncology. So I'm a physician by training. I see patients. But I also run the small data science team. And one key goal that CGT Data Ops is trying to solve, so we've existed for about a year and a half now. And one of our key goals is digital transformation of how we run clinical trials in which patients receive cell therapy products, such as bone marrow transplants and CAR T cells, and specifically investigator initiated cell therapy trials. So those that don't have a hugely well-funded sponsor. And we have more than 15 active trials like this going on. And so here's the problem. Here's the state of affairs when I started CGT Data Ops a year and a half ago. So when we write up our trials into papers, we have our data sources. So we use EPIC as our electronic health record. And then we'll have protocol forms and other documents that the FDA requires us to put together. And usually, these exist as PDFs. They each exist on a computer. And this is then printed out and gets put into these physical binders. And we have hundreds of them because we have hundreds of patients in each one. This is exactly what one of these painters looks like. And then for writing up a paper when it's time to write a paper, these would then get manually re-entered and usually into Excel sheets. And then the final data analysis would happen using an R script. Or I'm just kidding. This usually happens in Excel. So this is what the state of affairs was a year and a half ago. And here was my proposal at the time. So we wanted to really move away from physical binders towards something like Dropbox to have everything as PDFs in there. Then use Redcap as our standard electronic data capture system. And then monitor both operations for those clinical trials and outcomes in custom built R Shiny dashboards. And so operations monitoring would be stuff like how many people are on the trial, what's our enrollment, what's our accrual, who's coming up, are the data missing, that should have been entered. And outcomes monitoring is stuff like what's our response rate, what's our Kaplan-Meier curves, swimmer plots, those kinds of things. And importantly, so I wanted to give a shout out to Paul Harris' talk the other day. So we have implemented CDIS. So we've made a huge amount of progress towards this goal. We've implemented CDIS. We've implemented a really nice standardized way to that we build Redcap databases, complex Redcap databases with longitudinal structures. But then a huge pain point that we've encountered was in actually downloading a complex database into R for dashboarding. I'm going to show you what I mean. So in this talk, I'm going to briefly talk about, just as a brief outline. So what is Redcap? What is Redcap R? And why Redcap tidy R? So Redcap, for those of you who don't know, which I think is the minority probably, it's a database solution that's meant to support research, but can also be used for clinical operations. It is used widely for clinical operations at places like CHOP. It's secure and accessible from a web browser and can collect any type of data in any environment. This is how they build themselves. And it's probably installed your institution. They probably allow you to store PHI and set up your own databases. So it's very widely used. And it's for everybody who is a member of the Redcap Consortium, which is a lot and a lot of institutions. You can use it for free. So what is Redcap R? Redcap R is one of the R packages that exist for interacting with Redcap. It's probably the one that's maintained the best. And the most important piece of Redcap R is a set of functions that wrap the Redcap API. So Redcap comes with an API that allows you to program and download data from Redcap and what you have to be using an API token. And so Redcap R has functions such as Redcap Read, Redcap Metadata Read, or Redcap Write to interact with the Redcap API. And one thing I want to point out here, it's extremely well engineered. There's a ton of unit tests. There's a lot of assertions. It's been around for 10 years. It's well tested. It's well maintained. So we love Redcap R. So as a motivating problem for Redcap TIDR, I could show you a data set with demographics and medications. But we've seen a lot of these, I think, over the past two days. So I first more fun to look at this data set, which has information about 736 superheroes. So that's the superheroes database. You can go to this URL here. And there's really two tables in this superheroes database. So we built a Redcap database containing the superheroes data. It has two tables. One is heroes underscore information. And this is an instrument that captures demographics of your superheroes, like what's their eye color? What's their hair color? Which universe are they in? Are they in the Marvel universe or DC? Actually, I don't know anything about superheroes. But those are those kinds of things. And so this is what's considered a regular instrument. Means like one superhero per record. And then there's a second instrument, superhero powers. And this is what's called a repeating instrument. So any superhero can have zero, one, or multiple superpowers. And this is what this looks like in Redcap. So you have the first six records from the superhero's database. Our first record is ABOM. And we have the heroes information instrument. And you can see that there's this one kind of, I don't know, circle showing that there's exactly one instrument filled out with information about this specific record. And then superhero powers has multiple of these. And you can add additional ones. And here, there's just one and here's multiple. And if I click on Abomination, I can see that Abomination has eight superpowers. And if I click on this button here, I could see more. But this is, so it's just wanna drive home regular instruments, repeated instruments is something that Redcap supports. And this is very useful for medical data because you might have the demographics of a patient and then they might have multiple medications. So there's these forms that you wanna fill in repeatedly even in a specific single record. So here is what it looks like when you're trying to use Redcap R to download this data set into R. So what we have here is superheroes gets Redcap R, Redcap read one shot. So that's a version of the Redcap read function. And it has a very, very simple URL, it has very simple API. You just give it your Redcap the uniform research indicator or URI and the API token. And then actually what it spits out is a list of data elements with return messages. But then you have this data element, which is a data frame with all of the data that comes out. So everything comes out as a big table. And here's what this big table looks like for our superhero's data set. Okay, so what the hell? Where do all these NAs come from? And if we look at the dimensions of this table, okay, I said earlier that there's 736. Why are there 6,700s rowing this thing? And why do I have 16 columns? So most of the data is NAs. And, but this is what's called the sparse matrix. So welcome to the matrix. The sparse matrix is what the Redcap API, and therefore Redcap R returns when you have a mix of repeated and not repeated instruments. So it's unwieldy, it's huge. There's, it's very confusing what the meaning of NAs are. They, you know, our database is actually fully entered with values. There shouldn't be any, there aren't, every NA here in this case means that it's an artifact of how that table gets put together. There's important metadata missing. If you take this data matrix, you don't know which fields come from which instrument except for those that come from repeating instrument. And, and this is important. So this is very confusing. The meaning of a row in the data set is not consistent because for rows that actually come from the, that represent information from, so this first two row here represents information coming from the first instrument, which is a regular instrument. We actually have one row meaning one record. That's not the case for these other rows here, which, in which one row represents not one record, but one record repeat, okay? So you have a mix of different granularities in the same table. So that is not tidy. That is that contradicts or that is inconsistent with the definition of tidy. So this is the problem that we're trying to solve is having to deal with this sparse matrix. And so we red cap tidy R, we named it so that by kind of squeezing this tidy idea in between red cap and R to really highlight this is really something that's built on top of red cap R, which we're huge fans of. And the main function, which is read red cap tidy has a similar API as red cap or red cap read or red cap read one shot, which is what you usually use. So really just requires a red cap UR and API token. And what it does is that it returns a set of tidy tables. And the big idea here is that you're gonna get one table for each red cap instrument. So let's look at what this looks like. So here, we're gonna load the library red cap tidy R. We're gonna load a super, we're gonna put the output of the read red cap tidy function into superheroes tidy. It doesn't return a list. So I don't have to do this dollar data awkward thing at the end of this. And then if I then look at this, look at this superhero tidy object, I can see it's a table. So I get a table and this table has two rows. And you can see that the first columns as red cap form name. So this is the name of the instrument. The second column is red cap data. And this is a list column that contains the tidy tables that we were talking about. And here you can see that one has a lot of rows and not a lot of columns. And one has much, much fewer rows and more columns. So corresponding to heroes information, superheroes parts also tells me what the structure is repeating and non-repeating. Okay, so here's the idea behind the red cap super, here's the idea behind the red cap super, what we've called the red cap tidier super table. So you get one row for each instrument and you get a list column with tables that contain the actual data. And then I'm gonna make the claim that these tables that are in there look exactly what you expect them to look like. Okay, I realize it's a bold statement, but let's look at these two example instruments, the outputs from these two example instruments that we looked at in the superheroes. So you get, you start with a key column and check how much time I have for this. And okay, it doesn't look too bad. So you start with a key column which is your record ID, right? So this is, I will familiar with this if you've ever looked at a classic or a regular database without repeated instruments, that's what this looked like. Then we have data and it's not sparse anymore. There's no random NAs in here anymore. And then we have at the very last column, a indicator column whether the form status is complete. Okay, so this is gonna be the structure of each tidy table that comes out of red cap tidier. We're gonna start with a key, then we'll have the data and then we'll have form status complete at the end. So let's look at the repeated instrument. Here we have record ID and red cap repeat instance and together what they'll form is a composite key. So together, these two columns will allow me to uniquely identify and therefore wanted to use, uniquely identify a row and if I wanted to use that key to join multiple tables that I get from the project back together. So we have a composite key, we have one column with the data that we're interested in when we have form status complete. Okay, so what about longitudinal projects? So longitudinal projects is something that red cap supports when you don't just wanna repeat instruments on demand but if you wanna create a structure in your project where you'll fill out one instrument at predefined intervals. For example, you could have a physical exam that you fill out once at each visit and you'll have a pretreatment visit, you have an infusion visit and then you have your days seven, day 28, one month, two month, three month follow up visits. So how do we handle these? The idea is that again, one instrument is one tidy table. So if that instrument shows up in multiple events, it'll still be just one instrument that you'll have in your output. And the second key idea is that your composite key that a red cap tidier makes up for you will depend on the instrument structure and the project structure. So we already just saw that in a classic instrument which is what the superhero's data set is, it's not longitudinal, it's classic. We have the record ID is our key, is our composite key, so it only has one column and in a repeated instrument, we'll have record ID plus red cap repeat instance. Now this idea actually extends very nicely to longitudinal one arm and longitudinal multi arm kinds of projects. So we'll just have additional partial key columns, red cap vent and red cap arm. And what this allows you to do is, even though you're gonna get a bunch of tibbles back that you may wanna link back together, because you'll have this composite key, it's gonna be very easy to join these tibbles back together in any way that you want. But at the same time, one instrument oftentimes for a lot of analytic tasks contains the data that you wanna actually put together into an analytic object. And it really reduces your cognitive load of how many things you have to deal with at the same time. So now one thing that I glossed over a little bit is, like how do you, now if you're gonna get a list with all of these instruments, how do you extract your data frames out of, how do you extract these tibbles out of that super tibble object and actually work with them? So that's something I glossed over a little bit because you're gonna actually have to do some subsetting and some extracting and some single bracket, double bracket kind of action. But we've provided a helper function for this. And this is the helper function that we have here is called bind tables. And what it does is it takes a red cap tidy or super tibble object and magically makes the tibbles inside of it appear in your environment. So if I fake click on this play button here, notice the environment is empty right now. And remember we had two tables in this object. Now you would have these two tables appear in your environment. So behind the scenes this works with Arlang, a ENV underscore or poke to inject these tables into your environment. And you can actually also give it a different environment object if you wanna use something for shiny applications with the strategy to PTR, which is one thing that we like to use, we like to use environments in shiny applications. You can do that as well. But this makes it really easy for an analyst if you have a database with lots of tables to immediately have them in a place where you can start to work with them. So I wanna invite everybody to try this for yourselves. I'm gonna post this link in the chat. But I do wanna warn you that this is still very alpha. But I think we're now at a point where we're ready to have tires kicked by the community. So I really like to hear what happens when you try to load, I wanna hear all the errors that you get when you try to load your own databases with this and tell what fails. Again, we don't use any, this is purely a functionality to read packages. We're not writing anything. So it's not gonna be non-destructive but I really like to hear feedback. And I realize I'm out of time. So future work, what we're gonna try to do next is to actually support and actually default to labels instead of raw data for categorical fields. Create some helper functions to extract individual tables into named objects so you don't, if you don't like the bind tables kind of magic non-pure function, a business. Wanna be able to pull metadata and also gracefully deal with incomplete data sets which is currently a bug. So thank you very much. And I'm happy to take a question if there's time.