 Hi everyone, I'm Gabe Becker and welcome to my Use R Talk where I'm going to talk about the R Tables package which is for creating complex multi-level report grade tabulations in R. In particular the types of tabulations you might need to do if you're filing regulatory reports about clinical trials. So first off, please bear with me, this is the first video talk that I've given so hopefully it goes well but if not hopefully you will all be forgiving. So a few disclaimers, this work was fully funded and is copyright roche. It has been released open source under the Apache 2 license and it is under active development. In fact all of the features that I'm going to talk about today are currently in a feature branch which is called Gabe Table Tree work but they have been tagged as a release in GitHub that you can see there and they will be merged into the primary and B production in the future and finally all the data in this talk is completely fabricated. So display tables, what am I talking about when I'm talking about complex tabulations? So let's think about what we mean when we say visualizing or summarizing data. Most people would typically do make some plots or they would create some model fits and things of that nature and we're not going to talk really about any of that in this talk here. So consider this table then we can see that it has structure in both the rows and the columns that goes beyond the simple names of rows, names of columns that we might be used to and we're going to talk quite a bit more about what exactly that's going to mean in a second but this is the type of table that I am talking about where we have the complex complicated structure and then tabulation basically along that structure that has been applied to some raw data set. So first off I'm going to claim that this is a visualization because both the contents and the visual layout structure are functions of raw data which is different than the data that is appearing in the table itself. So the structure and thus position within the table is defined by a subsetting and or variable selection whereas the contents of each cell within the table are the results of specific computations on those subsets or selected variables and we're going to see more about what that means in just a moment. So the R Tables package allows us to declaratively build up a table layout and then apply it to data. Layouts are thus reusable which is particularly useful if you're in a context where data standards are controlling your data sets as we are in pharma. And this was motivated by the tables required for regulatory filing for example to the FDA but the framework that I'm going to be discussing is fully general and data agnostic and standard agnostic. So there are only a few core instructions or core layout pieces that our tables provides and these are nested splitting in both the row space by the split rows by and split rows by sibling functions and the in column space by the similarly named column based functions. And then summarizing the groups that are defined in row space and then analyzing variables based on those subsets those groups. So this is the raw data that went into that table that I showed you a few minutes ago. You can see we have just six columns. We have arm, country, gender, handed which is handedness, age and weight. It doesn't really matter what these are because the data is completely made up anyway. But you can see that they this is this is what they are we have 400 rows each rows and individual observations is the typical tidy data. So let's start simple. So we start with a basic table. We can see that here and then we say analyze the age variable with the analysis function mean and then we give it a format and I'm not going to talk too much about the format but that just controls how it gets displayed. And then we say use our layout to build the table, the display table based on this raw data. And so we can see because we don't have any structure in row or column space. We have a single column has all of the observations and then we only have one row which is our analysis row which is the application of our functioning case mean to all of our data on the variable age which is the variable that we asked it to analyze. And so we can see that that was that was pretty painless but that would be painless in pretty much any system so that's not particularly interesting. So what's the next step so next let's say okay so take your column your universal all observations column is split it by arm and what this means is we're we don't know how many arms there are in a data set in this case there are two but we could apply this same layout to a data set that had four arms and we don't need to know that all we're saying is the layout says that for each value of the arm variable make a new column. And so in this case because we have two arms we have two columns and then we analyze the age variable the same as before but now age is analyzed and the mean is taken for each of the two arms. So this is also pretty straightforward nothing super fancy going on here yet. Next, let's say okay now take each of those two columns and split that column by gender. And so we don't need to know how many arm columns there were and we don't need to know how many genders there are. And we end up with our four columns so we got and beat that is because solely for the purpose of screen real estate we are only using male and female for gender here, but that should not be taken as a statement about other genders. And so we split our arm columns, each of our arm columns by gender and that was as simple as adding a new split to our layout after the split gender because these are being nested so each new split happens within all the existing splits in the on the axis that that is being applied specifically to otherwise what you can do. So now we've got the mean age for arm a females for arm a males for arm b females and for arm b males. So now it's getting a little bit more complicated but it's not it's not too bad yet, but the next thing is we're going to split by country so we're going to say. Now split on the rose and we could split more on the columns but the table that I showed you before had the country along the, what would we think of as the y axis the vertical, the vertical axis so we're going to split the rose by country. So now you can see we just split rose by country. And now we have a Canadian mean and USA mean for each of those arm gender groups that we had before. And again, we don't need to know how many countries there are we don't need to know how many of any of these things there are and we can apply this to any data set with any number of countries and it would work exactly as it is now because the layout is declared before before the data is seen. So this is still relatively not too bad. So next we're going to summarize each of these countries. So by summarize I mean count how many, how many observations there were for that country and how what percentage of all of the observations within the column that it is in that that represents. So Canada had 66 females and arm a and that was 53 of all of the females and arm a is what that means and so all we had to do to do that is summarize row groups. This is the default summarization behavior you can actually control that and make that make the summarization do whatever you want, which we'll see a little bit later in this talk, but the default is to is pretty nice I think so. So there you go. And but now we're going to say we want to split each country by handedness now we can split each country by handedness and we the summaries of each country remain there and they that that aspect of the layout doesn't change but within each country you in and out splitting on left handed and right handed. And so now we have this, you know, multifaceted splitting in both the column and row directions, and it's still the same all we have to do is say analyze and everything just just happens. And then we can further summarize each handedness within each country so now we have multiple levels of summary at different levels of splitting in row space. And again all we had to do was summarize again after after a new row split. And finally just a little nice city here where there's a there's a little function that adds the column, the total column counts at the top of the columns. So now we can see that there were 124 females in arm a. So that's why that 66 was 53% there. So this is the this is the this is the exact table that I showed you before so we've recreated that in with just a few very simple, very simple sort of instructions to build up the layout. So that's the core of what our tables is trying to allow people to do. So I told you this was motivated by regulatory filings. So let's look at some regulatory filing tables. These are obviously not real data. As I said in the beginning of the talk, but you know let's have some fun. So first off we're going to have a summary, a custom summary function. And I don't want to spend too much time on this but we can see here that we've got this in rows function there. And in rows is basically an illustration that a summary can have more than one rows in it and analysis can also in fact have more than one rows in it using the same in rows function. So this summary is actually going to have two rows for each grouping patients with at least one event and total number of events. And the summary function accepts the full data frame that represents the subset and then it grabs the you know the relevant ID column and does some things to it. So there's your adverse events table apologies for clipping the third column there a little bit but that just says combination under there is nothing super exciting. And you can see that all we did is we split by arm we add our column counts we analyze the subject ID variable with the summary events patients that gives us our top level all the patients summary there. And then we split by a body sis which I honestly couldn't really tell you what that is but I think it's the type of adverse event the large class. And then we're going to trim the levels of the a COD which is the sort of more specific class within each grouping of a body sis and then we're going to label the kids which is just gives us these empty label rows here. And then we're going to summarize those row groups with our summary function. And then we're going to analyze this inner a COD. And so that gives us for each large class which is the body sis we have our summary and then we have values for which is just the count for each level of this a COD variable underneath that. And then we need to do one more thing because an inverse event table actually is has one row per event but we actually want these counts up here to be patient. So we need just need to calculate the patient numbers for each of the columns which is the arms here and we can just pass that to build table. So you can actually override the the column counts and every and the tabulation framework will take those column counts when it when it is applying some functions and stuff like that. So obviously a lot of care needs to be taken when you're doing that but it is supported. So now we've got a little couple of small utilities here for some reason data for true false is encoded as yes no in this don't ask why I was not involved. And then we just have a summary that counts and gives a percentage. And then we have a little utility that I wrote that trims rows that have all zeros zeros for each of the columns and that's going to be a post processing stuff that is not a part of the layout itself. And now we have a disposition table which is another pretty important regulatory table that all clinical trials need to submit. And so what we're doing here we're splitting by columns we're analyzing the completed study just giving the counts and the percents percents always 100 so that's not really that interesting. And then we're going to split rows by discontinued the study. And then we're going to summarize those. And then we are going to split rows by this discontinue category. And we're going to reorder the levels because we want safety if you're first because it's the more important one. And then we're going to summarize the groups so summarize safety and other here. And then we analyze this. This reason for the discontinuation here. And then we build the table and then we turn the zeros and there you have it there is your disposition table so it's a little bit more complicated because we have some custom functions here but I think overall it's still pretty simple. So I'd like to acknowledge Adrian Waddell and biometrics and all of Roche biometrics who have been very supportive of both of this work which which has taken a lot of doing and they've invested in that and also it being released open source Adrian has been a very big proponent of that. And finally, those of you who are paying attention are probably wondering where that simulated data came from that I was using for those regulatory tables. And the answer to that is that there is a package coming soon where you can randomly generate data against the CDISC standard so look for that a little bit later in the year when we get that released it's already working internally. So with that thank you all for listening and I hope to see you all in the virtual question and answer sessions when they are scheduled.