 Hi, I'm Ethan Brown from the University of Minnesota's Research Methodology Consulting Center and I'm here to talk to you today about SIMPAR, which is our new R package for concise and readable simulations for the tidyverse. So we're currently available on GitHub and working on our CRAN release. You can get it at install underscore GitHub, satisfactions forward slash SIMPAR. So we created this package because we believe that simulation is a powerful and essential feature in R. However, writing simulation code often requires custom functions and nested loops, which can be difficult for all ranges of R users. The SIMPAR package streamlines the simulation workflow using a tidyverse approach. But SIMPAR is especially helpful for systematically varying simulation parameters, which we often do for instance in a power analysis. So let's look at simulating power of a t-test in base R. So first we might set up a matrix of conditions, created nested loops for looping over that matrix of conditions. And we know that generating data and fitting the model are buried in the lowest level of the loop or in a custom function that you might write. So here in the example code below, I generate a matrix of conditions using expand grid for both sample size and a difference in means and have 500 repetitions. I split that data frame into a list of one row data frames that I can easily loop over and then I can loop over it using L apply or similar functions in base R. So I have an outer loop where I loop over the conditions and an inner loop where I actually do the individual repetitions. We could also use the replicate function for this purpose. And then finally we get to the meet, which is the data generating mechanism. So first we generate a data frame that has a control condition where we call on condition dollar sign in to get the sample size. And we have an experimental condition where we can call on that sample size again, but then also on condition dollar sign mu diff, which is specifying how the difference in means between the two conditions. Then we can get a t test and extract the p value from the t test and have some other kind of housekeeping information so we know which condition and mean difference that p value belongs to. And then finally we can do some data manipulation using do call, R bind, unlist and the works in order to get that into a readable format for plotting and further analysis. So there are several approaches that make this cleaner and more readable using tidy verse tools such as d pliers, slice and rerun. However, the overall workflow is still unclear and essential aspects of the simulation get buried in data manipulation syntax. So our question was how to streamline the overall workflow of a simulation study. So the SIMPAR approach is to really focus attention on the essentials of the simulation. So we start with a blueprint, which is where we specify the data generating mechanism. Meta, where we specify the parameters that vary across conditions such as sample size. Produce, we generate the actual simulated data set, fit, we fit models to the data and then tidy fits where we tidy the output using broom. So if we do the same simulated power of a t test and SIMPAR, we can do this in a relatively compressed way which actually highlights the essential simulation features in a very readable way. We start with blueprint, meta, produce, fit and tidy fits. And I'll talk about each of these steps in more detail in the upcoming slides. So what this produces is it produces a tidy table that we can then use for plotting power. So if you look at the first row of the table, you'll see that we have for the first repetition rep equals 1, where the difference in means is 0 and n equals 25. We've got a t statistic of 1.33 and a p value of 0.191. This format allows us to plot relatively easily with some simple data manipulation. So here I load the tidyverse or really I'm just using dplyr and ggplot2 and I group by our simulation factors by our meta parameters and I summarize the power and then I feed this into a simple plot. And here we can see power curves increasing by sample size for each of the different conditions of the difference in means. So of course we're hovering around 0.5 for the difference of means of 0 and have steeper curves for bigger and bigger differences in means. So let's look a little bit more at this SIMPAR chain and how we build up a simulation like this. So part of the idea of having things in a chain is modularity. Each stage of the chain can be saved, which is convenient for simulations with shared features. We start with the blueprint. Each argument is a function which uses per's, formulas, and texts. And we can refer to previous arguments, so previous named arguments, and we can also refer to meta parameters. So I'll talk more about that in a moment, but this is something such as sample size n. Of course the blueprint is the most challenging to specify because that's our data generating mechanism, so we have to understand our model in order to be able to simulate it. So this returns and our object with specification information. So we haven't done anything yet. So here, just to make this more interesting, I'm using log normal rather than just a random normal which is analytically straightforward. So in the blueprint I have a tilde, which I'm using the per formula function syntax. And instead of having to refer to anything else, I can just say n, for RL norm n for the sample size, and I'll define that later. And then for the experimental variable I have n, and then I also have mu underscore diff. So note that these two variables, n and mu diff, are never defined. These are meta parameters that we'll define using meta. So meta allows us to specify these meta parameters or any simulation factors that will be systematically varied. So by default all possible combinations of meta parameters will be generated. Typical meta parameters for power analysis are sample size and effect size, but there's no limitation in SIMBAR to those particular ones. Meta parameters can be lists, for instance, of alternative correlation matrices or functions, the way that we can specify in meta is actually quite flexible and there's lots of examples in the package documentation. But for this example we set n to go from 25 to 200 and mu diff to go from 0 to 1 by 0.2. This is similar to how we specified it in base R, but now SIMBAR knows that these are meta parameters and treats them without us having to explicitly doing that looping over a condition matrix. Produce actually generates the simulated data based on the specification, so it just takes one argument, reps, which is the number of replications per condition. And so the result is a table with one row per data set. The data sets are included as a list column. So in this example it's relatively simple to specify. We produce 500 repetitions per condition. And now we get a tibble where each row is one data set. So we have repetition number 1 where n equals 25 and the difference in means is 0. And so that's a tibble with 25 rows. And scrolling further down we can see tibbles with more rows because the sample size is increased. And this has many rows because we actually have 500 conditions, 500 repetitions per combination of n and mu diff. So after that we can actually fit the model or apply any other arbitrary R function to each generated data set. And we refer to that generated data set using per's dot syntax, which we can see in the example. So we can fit multiple models to the same data. And so what this does is it adds list columns for each model. So again we use the per formula function syntax, so t underscore test equals tilde t dot test. And then we refer to the columns from the current data set using dot dollar sign. But we could also for some functions that we would just use data equals dot, for instance for LM. So this just adds another column onto our data set so that last column there is t underscore test and everything else is the same as the results of the produce function. And then finally we use tidyfits to tidy the model outputs using broom. And this works for many R models and creates a tibble with one row per model that we can easily use for further analysis and visualization. And we can use applyfits to apply any arbitrary function to the model objects if tidy is not exactly what we want for our particular case or it doesn't support the model we're using. And then we just invoke that function and again that gets us to the same tidy data set that I showed earlier in the presentation. So let's talk about a slightly more advanced example. What is the number of categories in an ordinal variable affect power for the Pearson correlation in the chi-square test? So now I'll use the function in blueprint where we can refer to previous variables. So I generate x1 as a random normal, x2 as a random normal that's x1 plus another random normal. So I'm basically creating two correlated variables. There's other ways I could do this but I want to show off the fact that I can refer to x1 within the specification of x2. And then I cut x1 and x2, create ordinalizations of them where I break them up into b breaks and I define b later. And so then we can see that in meta I define this n as the sample size and how it varies and b as the breaks which vary from 2 to 7. And simply enough I produce 500 repetitions again. I fit both chi-square tests and Pearson correlations to these using the same syntax and then I tidy. And now I have two rows per unique data set because I have one, the first and second row are the same but the first row shows the chi-square p-value and the second row shows the Pearson correlation p-value. But SIMPAR takes care of all of this behind the scenes for us. We can plot using the same strategies we were using before and we can see not surprisingly that the chi-square test it remains about the same in power for the number of breaks whereas the more breaks we have the more powerful the Pearson correlation test is. So where do we go with SIMPAR now? There are domain specific simulation packages that can be used but they can actually be used within SIMPAR. They're not really competitors to SIMPAR because SIMPAR is really about the workflow. Some other promising simulation workflow packages are out there such as the simulator and declare design. These are much faster computationally, have more support for specification but they do have a bigger learning curve and they do have a less streamlined workflow. They're still generally finding ways of plugging in certain custom functions into these frameworks created by these packages. So that leads us to the limitations of SIMPAR, it's computationally slow and inefficient, the simulations must be in memory and we could really use some templates for specifying common data generating mechanisms which both simulator and declare design have. So we want to address these limitations, start adding syntax for efficient specifications, starting with linear models and we're doing some ongoing related research categorizing simulation packages in R to make sure we are taking advantages of the best simulation approaches in R already. So thank you for listening, we're very excited about this package and happy to talk about it with you or help you use it for whatever you're trying to do. Thank you so much.