 Good morning, good afternoon, or good evening. My name is Ethan Hyneson. I'm a senior data science analyst at Mayo Clinic, and today I'll be presenting about the arsenal package. My co-presenters are Beth Atkinson and Jason Sinwell, who will be joining me in the chat. The motivation for this R package stems primarily from our work at Mayo Clinic. Mayo Clinic is a three-site research hospital with a fairly substantial statistical presence. We have over 500 people who are doing R, but who are also doing statistics. Historically, we've been a SaaS shop using SaaS for most of our data needs and analysis needs. A recent five years ago, six years ago, a SaaS license renegotiation revealed our critical dependency on SaaS. That made leadership a little bit wary, so they asked us to evaluate how we could reduce our footprint in SaaS, which of course naturally led to the question, what's the best way to port those in-house macros and other SaaS procedures to R? We started with an R package that we called RLocal, which was internal only to Mayo. Now, that package contained both private functions for accessing like Mayo data and some public functions. We decided eventually to separate those public functions that we thought might be useful to an external audience into a CRAN package, and that's how we arrived at Arsenal. Now the goal of Arsenal is to mimic and improve upon the SaaS functionality, but to do so easily. We didn't want a large barrier entry for those people who were moving from SaaS to R. We wanted things to feel kind of familiar with similar output. The first release of the package came December 30th, 2016, and we've had a few major releases since then with minor releases every couple months and patches in between. There are six main functions in Arsenal. We're going to talk about five of them today. The first function is the table by function, which is intended to replace the percent table macro from SaaS. Percent table is basically used to input a data set and output a table one, describing a summary of the data of a particular cohort usually. The paired function is a lot like table by, except that it's used to report on paired data. Model sum is a function that replicates the percent model sum macro, which was used to fit models over a set of independent variables. Freakless is a function that mimics SaaS's proc freak procedure to compile a frequency table over a set of variables. Compare DF is a function that replicates SaaS's compare procedure, which compares two data sets and looks for differences. And finally, the right two family functions, I kind of think of like ODS output, but basically the idea is you just want to output results to a file. Arsenal comes with a data set built in called Mach Study. Mach Study represents a traditional medical data set with one row per person, the case variable, and a variety of demographic and clinical information like age, sex, the follow up time on a given study, the arm that the patient is on, et cetera. We'll start with table by. Table by is Arsenal's bread and butter kind of. The idea here is we want to tabulate the sex and age variables across the three different arms in the study. For example, we want to do this to ensure that a particular randomization scheme works or just to make sure that there aren't stark differences between the different arms that might explain the difference in a particular model result. So here what we do is we feed a traditional R formula with the bi variable arm on the left hand side and the variables to be tabulated, the independent variables on the right hand side. So we get arm till the sex plus age, we feed that into table by and then we summarize it with the summary function. Now here we're passing the argument text equal true to tell table by that we just want a simple pipe table, table by and in fact all the functions which report tables in Arsenal use knitter cable as the table generating engine. Of course the pipe table I think at least is a little ugly. So if we omit that text equal true and render to a PDF like the slides I have here, we get this really nice looking attractive table here. Now one thing you'll notice with table by is that the categorical variables here sex is they're treated differently from, for example, the numeric variables age in this table. Table by has, well at least we think meaningful defaults to treat all the different data types separately. Table by supports natively categoricals which we call character logical and factors. Ordered factors, survival objects, dates and also results from the arsenal select all function. Now the most common requests that people usually get for tables is, hey, can I change the labels on this table? Or even can I change which summary statistics are in there? Maybe you want medium instead of mean or maybe even can I change the statistical test or even my favorite which is can you show fewer significant digits on that number? All of those things are very easily accomplished in arsenal with table by. There are a few different ways to set labels and summary statistics, p-value tests and decimal points can all be controlled using table by dot control or inline as we'll see on the next page. So here's an example call to modify the table slightly. The first thing you'll notice is that we're putting the variable sex inside the function fe. fe stands for Fisher's exact test. And so what we're telling table by here is we want to use Fisher's exact on the sex variable instead of the default chi square. We then pass digits dot percent to indicate that we don't want any trailing digits on the percentages. Second for the age variable, we surround that with no test, which means we don't want a p-value for that variable. We also only want one digit as opposed to the default, which is three, as you can see here. Finally, we're changing the summary statistics by passing unnamed arguments that match function names here, median and q1, q3 to indicate that we want the median and the first and third quartile instead of the default mean standard deviation. Lastly, when we summarize the table, we also include p-footnote equal true to indicate that we want a footnote indicating the statistical test that was run. Table by also works without a by variable. And the default here is just to summarize the overall data set. Table by also allows you to stratify, subset, and summarize multiple endpoints. In this example, what we're doing is we're passing list of arm and sex to table by on the left-hand side to indicate that we want to tabulate both of those by variables by the same factors on the right-hand side. Now we also indicate strata equals ps to indicate that we want to separately tabulate the statistical summaries for each of the levels in the ps variable. We've also subset it to only those values that ps takes on that are zero or one. And so in the output here, you can see on the top table across the top is the arm variable and on the bottom table across the top is the sex variable. We've separated summaries for ps equals zero and ps equals one. Now there are a lot of other features for table by that we didn't have time to talk about today, including a data frame method for both the table by object and the summary object. There are ways to subset variables, change the order of the variables and delete variables using the bracket, header, tail. You can sort and filter by p-values. You can merge two tables together and you can also input custom p-values and indicate custom user statistics. The next function I want to talk about is the model sum function. The model sum function basically runs separate models for various independent variables across the same endpoint. So here what we're doing is we're going to model both the alkaline phosphatase as a function of arm and alkaline phosphatase as a function of ps. And so you do that by again using the formula interface with the outcome variable on the left hand side and the separate independent variables on the right hand side. Now what you'll see in the bottom table is two separate models each with their own intercept with the main effects in bold and then a variety of summaries for each of those two models including the estimate, the point estimate, the standard error of that point estimate, the p-value and some model level summaries like the adjusted r-squared and the number of missing observations in the model. Now you might want to add common adjusters to both of those models. Here what we're doing is we're passing sex and age as the adjuster variables as a one-sided formula. Both of these two models then include sex and age in their model. Now that table gets kind of gets to be kind of a lot to look at and so you might want to hide those common adjusters and even the intercepts. You can do both of those using arguments to the summary function show.adjust equal false and show.intercept equal false. As with table by there are lots of other options. You can change the model family to be Poisson, binomial, a survival model, etc. You can change the labels, the decimal places, the summary statistics. You can convert it to a data frame and you can also subset variables, change the order, delete them and merge two tables. The third function I want to talk about is the FreakList function. FreakList as you'll remember mimics SAS's PROC FREAK procedure basically to output a frequency table. Here we're passing a one-sided formula to tabulate sex and arm and PS. And the default here looks a lot like SAS's PROC FREAK procedure. The variables are on the left. The frequency is the next column and then the last three are the cumulative frequency, the percentage of the total, and the cumulative percentage of the total. Now just like the XTabs function you can also pass a left-hand side to this formula to indicate case weights for each observation. One neat trick with FreakList is that you can sort the table by frequency using the sort function. Now the default doesn't duplicate labels. As you can see here the male label is only reported once for the whole column and everything below it is assumed to take on that same value. One, the argument to disable that is the dupe labels argument which here when we sort the table we want to be true so that we can see as we've sorted the table exactly which levels of each variable this frequency count represent. Just like the other two functions with FreakList you can coerce to a data frame, you can change the labels, subset variables, and merge two tables together. The next function I want to talk about is the compareDF function. CompareDF compares two different data frames to look for differences. Here you can see we're using the function MuckUpMuckStudy which basically changes some variable names, deletes some observations, and edits some data. When we run compareDF, comparing MuckStudy to MuckStudy2, we want to we want to compare those rows that we would expect to match up. Here we're expecting rows which match by the case variable which you'll remember is the participant ID. We expect those to match up. The print method gives a fairly decent summary of what's going on in the comparison. You can see that there are nine non-bi variables that are shared and 1495 observations, but there are seven variables that aren't shared between the two datasets and four observations which appear in one but not the other. You can see that there are differences found in three of the seven variables that are compared and three variables compared have non-identical attributes. The summary function gives a more robust output but is too long to include on this slide. You can also change the tolerances of the compareDF function. For example, you might want integers to be treated as numeric so that they can compare, or factors to be coerced to characters so that they can compare with character variables, or you might want to compare variables which only differ by the case. Here you can see that we've eliminated some of our not shared variables and increased the number of shared variables. There are extractor functions to allow you to see what the differences are as you can see here. Finally, we'll look at the writeTo function. The idea with writeTo is that you want to just output data to document. There are three main functions to do that, writeToWord, writeToPDF, and writeToHTML. However, the writeTo function itself supports other output formats that are also supported by our markdown. Here you can see we're passing a variety of objects to a list that we want to write to a PDF, including the table buy object, a linear model summary, some raw latex code, a level one heading, the model sum object, and a code chunk that we want to be evaluated. The output looks a lot like this. We have our table buy object, we have the results of the linear model, and then we have the model sum table after a new page, that raw latex, and the level one heading, and the code chunk that gets evaluated. Finally, here are some resources, including docs pages, the issues page, a link to this presentation, and all of our GitHub handles that you can connect with us on GitHub. Thank you.