 This file shows various different examples of how to run a simulation study in R. My focus here is on the simulation part, not really on data generation part or analysis part. So what I'm doing in this file is just showing how you can simulate a simple dataset, run a regression analysis and do that with various levels of complexity and various levels of features. Let's take a look at the single sample simulation first. The idea of a single sample simulation is that you generate a single sample and then you analyze it in one or more ways. How this analysis works is that we first set the seed and setting the seed relates to how computers produce random numbers. So computers don't really produce random numbers but they produce something called pseudo random numbers. So the numbers that your computer generates from the random number generator are actually a part of a deterministic sequence that looks as if it was random. So the numbers are difficult to predict but they are still deterministic. If we use the same seed, we will always get the same result. So if we run this analysis here, it generates some data, estimates a regression model and then prints out results. We can see that we get the regression coefficient 0.98 and if we run it without resetting the seed, we will get a different regression result but if we set the seed again, we will always get the same result. So we can run this many, many times, how many times we like and it always gives us the same result. Being able to reproduce simulations is important for a couple of reasons. First, if you're using simulations to develop practice problems for students, it is very useful that when your students sing with a data set, they will get the exact same data set that you use in your model answer. So the students can generate the data, then they can analyze the data and they can compare against your model answer and see that they will get exactly the same result. The second reason is that sometimes you might get a result that is very interesting by chance only. Let's say that you will get a result that produces a non-conversion result in a structural mechanism model. If you want to learn how to troubleshoot non-conversion results or ask a colleague how to troubleshoot it, it is very important that you or they can reproduce the result. Whenever I talk with colleagues about methods, we sometimes send the single sample simulations to each other to make the point and the point is only made if the analysis that my colleague sees is the same analysis that I see or the same result. So let's take a look at how this works. How it works that we first set the seed again. Then we generate X. So we're just generating X from random normal data and that's one variable and then we generate Y as a function of X plus some noise. So that is our Y and then we estimate recursion and we summarize and that gives us 0.998 which is very close to the correct population value. If we want to try different recursion coefficients or get a slightly smaller r-square, we can either multiply X by let's say 0.3 so that will generate you a recursion coefficient of 0.3 or roughly in the population we can see that the recursion coefficient here is approximately correct. If we want to get the r-square to a different value let's say a smaller r-square but still the same recursion coefficient of 1 we can multiply this error variance by let's say 5 that will decrease our r-square quite a bit so it was 0.4 and now it's 0.3 so you can modify the equation that generates the data to generate different kinds of population so that is our original model. So that is the single sample simulation. The next example is a simple Monte Carlo simulation. So this example here extends a single sample analysis by repeating the data generation and analysis a thousand times. 1000 is very common if you have something that runs a bit longer then you might go with 500 or 200 but for most applications 1000 is standard. So why would we want to estimate one model or simulate one data set and estimate a model 1000 times? The reason is that this allows us to study how a statistic or analysis tool behaves in samples so whenever we run something from a single sample that estimate will always contain some estimation error so there's always some random notes if we want to for example understand if our estimates are systematically biased so there's systematic estimation error not just random noise then we have to repeat the estimation process many times. Another useful thing for or useful way of using this kind of simple Monte Carlo is to do power analysis so we might check how our statistic behaves or how our analysis behaves with varying sample sizes. So when we run this code here it produces us a couple of things. First it estimated us a thousand recursion models and the mean estimate is 1.0013 so it's unbiased because the population value here is 1, we don't multiply x with anything and we can see that the estimates are roughly normally distributed, they are actually exactly normal but this small bumpiness is here because this is a finite number of samples that we estimate. If we would do this four, let's say a million times then this would look a very smooth normal distribution. We can also see how this recursion behaves if we have a larger sample size. So we can see that the estimates are mostly between 0.8 and 1.2 so what will happen if we multiply our sample size by 10 so if we collect 1,000 observations we can run the study. Let's move it to set the seed and then we can see that it's again unbiased so the biasness didn't really change with the sample size but now most of the estimates are within 0.95 and 1.05 instead of 0.8 and 1.2 so we can see that the precision is a lot better when we increase the sample size. If we want to see the replications we can just print them here so that is all the recursion coefficients that this code estimate. Let's return to the original values. There is one thing that beginners sometimes get it incorrectly and it is where you set the seed and I've seen this a few times with someone who starts doing simulations so it is an error to specify the seed here you might think that it's useful to set the seed right before you start generating random data because that's where you need the seed but this has the problem that now the seed is defined inside the replication loop that means that every single replication that we will run will use the same exact seed and it produces the same exact result. We run this here and if we print the estimates we can see that every regression code is the same and that is of course an idea. You want to estimate the statistic over 1000 independent and different samples not 1000 times the same sample so we always set the seed outside the replication loop. The next example is simple Monte Carlo with parallel processing. In the previous example the simulation run very quickly because we just did two variables and we run recursion analysis and that is very quick to calculate for a computer but not all models that you estimate are so quick and for example if you do multi-level modeling if you do generalized linear models you do structural recursion modeling then one replication can take a few seconds to run and that quickly adds up when you do it 1000 times. For example a simulation with structural recursion model take easily a few hours. You can actually speed that up fortunately and how the speed up works is that normally when you run our code it just runs on a single computer core but for example my computer here which is M1 MacBook Pro it has 10 computer cores so I can actually split the work in 10 different chunks and the computer runs each of those chunks at the same time so I get almost 10-fold increase in computation speed if you have a bigger computer with more cores you will get even more speed up in a simulation list how this works is that we need to first choose a parallel processing framework there are a couple of four R but the most modern and the most advanced and easiest to use is the future framework the idea of a future is that it creates code blocks that are executed sometime in the future so it allows the computer to out to proceed without actually running code if you are interested more in the programming concept of futures you can read more about it here on this website but for us the important thing what we need to do how we changed the previous example is that instead of running with replicate we just do future replicate because these replications are independent then we simply need to apply a function that splits them into different cores we need to load the library future period apply which has futureized versions of the apply functions which replicate is one and then we set the plan so the execution plan is multi-session which will choose the most appropriate multi-core framework depending on the computer's operating system we don't need to worry about the different ways we could set up multiple computers like sockets and processing and forking this will just use whatever is the most appropriate for your computer and if you're troubleshooting you can use plan sequence to run everything in a sequence in which case it's very easier to see where problems occur when you run on multiple cores then getting error messages from processor core that is not the main core can sometimes be a bit difficult to do so then we will set future CD equals true this is the default for future replicate but it is not the default for every function that the future framework provides and future replicate, future seed here equals true means that we are using a random number-safe multi-core-safe random number generation it guarantees that the random numbers are not correlated between the cores so it guarantees that if we run this code on this computer again then the results will be reproducible no matter what other workload the computer is running so it guarantees reproducibility and safe random numbers if we don't use random number-safe or multi-core-safe random number generation then it might be that the processor cores get different random numbers depending on in which order they query the random number generator and then that would make our results not reproducible, but this guarantees that they are reproducible other than that this works exactly the same as before so we run it and we have the results here they are now slightly different than before because we used multi-core processing and multi-core random numbers but if we run this same analysis again we'll get the same results so when we run a multi-core with multi-core-safe random numbers they are not the same as the single-core random numbers but they are reproducible if we run the simulation again if we were to run this on a different computer with different number of cores then the result might not be the same so what we need to do is to load the actual package the appropriate package is set the plan for parallel processing and then change the replicate, future replicate and just to be sure we set the random number the multi-core-safe random number generation to be on it's a default for this command but it's usually a good idea to be explicit about important assumptions the next example is a multiple Monte Carlo simulation using nested loops what this code does is that instead of running the Monte Carlo over one condition it varies the conditions so we are actually running a Monte Carlo simulation over nine different conditions I'll run it first and then we can take a look what it actually does and how it does it so it runs now a bit longer the reason is that we are now running the Monte Carlo simulation nine different times so we are varying the sample size from 100 to 500 we are varying the population regression coefficient 0.1, 0.2 and 0.3 and this code runs all possible combinations for sample size and population regression coefficient and it prints out the mean regression coefficient we can see here that regression coefficient is pretty much unbiased regardless of the sample size or regardless of the population value standard deviation of estimates or values of estimates would of course differ depending on these values but we are not interested in studying regression but how the mechanics of the simulation works this is the simplest way to produce a multiple Monte Carlo so we would have two nested loops if you have just one condition then you would leave out the inner loop just have the outer loop and we will need to manage our results ourselves so we will start by generating a simple empty list here and then whenever we replicate something we will add the replications to the list so this part here is the exact same as the Monte Carlo in the previous example but we will just run it using the value of beta b that comes from here and the value of n that comes from here so we just vary b and m and then we put the n and b there with the coefficient just to store the conditions and then we have a list and when we print the list, the sim here we can take a look at what's in there so the sim contains a list so it's a list the length of the n so let's do str sim to see the structure a bit better so we have a list of 9 because of the 9 different conditions and each condition contains 3 values times 1000 so we have 3 rows and 1000 columns each column corresponds to this replication the rows are the first row is the n the second row is the b and the third row is the recursion coefficient and when we come here to the summarization we first apply row means so we calculate the mean of each row the mean of sample size is simply the mean of beta is simply the population beta or b and then we get the mean of the estimate and then we use our bind to put all those estimates into one matrix or data frame that we print out so if we don't do the R bind we just print that, we will get a list of the different rows and then when we call R bind then that puts all them together nicely in a matrix or data frame so this is the simplest way of running a multiple Monte Carlo or the simplest way of getting started but this is not the ideal way for a couple of reasons first of all if you have many different design many different factors in your design let's say if we have five different factors we might vary number of predictors we want R square and we might vary the correlation between predictors in our simulation so we would have five conditions that we vary then you would have nested loops five times and the code becomes harder to read the second problem with this approach is that it's difficult to run subsets of this design so if you just want to run a subset you will have to edit the code and the subset always needs to be defined in a way that is a full factorial of certain specific set of values for example we can run a combination of sample size 100 beta 1.1 and sample size 500 beta 0.5 that wouldn't be possible in this kind of setup and the third is that this can be parallelized by using the future replicate but that is inefficient it's a lot more efficient computationally to run the different conditions on parallel so instead of parallelizing the replications we run the different conditions in parallel to solve those problems we need to have a bit of a different design the next example is multiple Monte Carlo with the design matrix and parallel processing so the code is here and how this code differs from the Monte Carlo with nested loops is that we of course obviously don't have the loops but we have something else here so we set up the parallel processing like we did before and remembering to use multi-core-safe random numbers then this is new thing so we have this design matrix so the design matrix is a data frame or a matrix that contains all the simulation designs that we run if we print out the design matrix we can see that there are nine designs so we have three sample sizes we have three recursion coefficients and this is the full combination all the combinations so it's nine combinations so why would this be more useful well the first thing is that it's easier to run subsets so if we want to run only let's say the designs one four and seven we can just subset like that and this would not be that easy with a for loop and for example this would not be possible with a for loop so it's not possible to run a fractional factorial so we would have the n equals 0.200 using a different recursion coefficient than others so this of course would not be very useful in this particular case but in other cases this is a very useful feature for example if you run structurally because of models of varying complexity you might want to make sure that the most complex models are only run if sample size is 500 or more so you might determine that decide that some more complex simulations should only use larger sample size and that is in practice where I've used fractional factorials not running or leaving out some combinations quite a lot so this is the design matrix and then the actual simulation code looks a bit different so instead of having the for loop we have this future apply here so if the future apply is a parallel processing enabled version of the apply function so what does the apply do? well apply applies a function over margins of a factor or a data frame so if I design matrix here then we can apply something over rows one, columns, two or if it has more dimensions we could have like three or four four dimensional data but we could do one and then let's just print so what this does is that it prints out every row of the matrix so it's not particularly interesting but it just prints first row then it prints the second row we can also calculate summary statistics so we can have a sum or every row so this is the sum of n and beta we can do the same for columns if we have two then one is rows two is columns so apply can be used to apply a function either on rows or columns of a matrix and we use apply here or the parallel processing version of apply from the future package and we call each row a design so because it's row in the simulation in the design matrix corresponds to one design we call it a design so this is the variable name that we use and the replication loop here is the same as before with a couple of differences so instead of using n or using beta we take n from the current design we take the b from the current design and again n from the current design and then we return the design and then the actual coefficient that we estimate so the design always contains n and beta and we return the replications here the reason why I have this addition return here instead of simply returning the directive from the replicative that this makes troubleshooting a bit easier we also need to have a simplify cross-pause here and the reason for not simplifying is that now we have a a three dimensional result object so we will have this is the statistic that we return so there is the n, there is the beta and then there is the estimate so this is one dimension, this is the statistic the second dimension is the replication and then the third dimension here is the design if we have simplify equals true then future apply tries to simplify the result into a matrix and in this case the rows and columns would be 9 over 3000 so it just puts everything from those replications together including the beta, the n and the estimate in one big vector instead of having a matrix so we don't want to simplify we simplify later ourselves when we run it we set the seed here we run it and it produces the same result as before but a lot faster and this is probably the easiest way of running a Monte Carlo simulation over multiple different conditions using parallel processing you could use a future replicate here and just do you apply here but that is less efficient computation you always should do the parallel processing as high up as possible in your design because there are inefficiencies in starting a new process for example this is a lot easier parallel size than four loops because in four loops you would have to decide on which loop to parallelize for example I have ten cores on this computer and if we go up and we apply parallel programming to this outer loop it means that we split the simulation in three chunks and that would leave about 70% of the computer power on my computer underutilized because it's just running a three cores this allows you to run every design on a separate core the next example is multiple Monte Carlo using the SIM design package and parallel processing so the code is here and one of the big advantages in R is the availability of packages quite often when someone runs into a problem that they want to solve then they find out a solution that might be more general and then write the package about it and this is a package written by some R user for doing Monte Carlo simulations it has been around for a couple of years but still when developing this example I found a bug in the package and the bug is that if we have beta here and then we have beta in what we return so I have beta estimate here then if a beta here and beta here then the package crashes with uninformative error message well I hope that this is an isolated incident so I reported the bug hopefully the maintainer will look into it and fix it typically a package that is actively maintained such as this, the maintainer will notice the error or the error report in a couple of days and typically fix it right away if it's not difficult to fix this doesn't look like something that would be hard to fix but it's just something that no one has tried before or happened to do before no one knew that anyone might use the same name for an estimate and a population value so let's take a look at what this code does or before that let's talk about generally the advantages of a package and the advantage of this package is that it provides you all the features that you might need when running a Monte Carlo simulation on your computer so it provides you a multi-core processing using a parallel or future we're using future because it's generally more modern and it provides us more features then this provides us error handling it provides us progress reports for example if we are if we have a simulation when we are doing a thousand replications and then there is an error on the replication number 900 any of my codes before simply throw away everything that's been calculated that far and return an error this package saves intermediate versions of the results and if there is a crash if there's an analysis error or data generation fails or something like that it exits and then produces you everything that was produced that far so you actually get results that you can analyze even if not every replication completes another thing that this package does is that it manages the random number generation so it provides you features that allow you to rerun a specific replication the idea of these pseudo-random numbers is that they are a deterministic sequence and if your problem occurs in a sample number 900 then unless you specifically program your study to manage random numbers you will have to generate every sample before the problematic one to be able to travel through the problematic sample this package provides you a way of restoring the random number generator state right before the problem occurs so there are lots of convenience features and if I would start now from scratch running simulations in R I would probably go with this package but there are three downsides in a package and one is a general downside of any package is that package always functions and the functions have arguments and that is something that you need to learn how to use so if you know how to do a simulation using a replica or future replica then apply or future apply then you might not want to spend the time learning a new tool if your existing tools are good enough and this applies to any statistical package many users use SPSS not because they think that it's the greatest software ever but it's because that's what they know and this does the job and even if there is a better tool available they would not want to spend the time to learn it so that's the first the second is that if there is a problem with the package like I found a bug in this package then troubleshooting that or even noticing that might be very difficult I wrote a paper to Psychological Methods it's now in a press about later interactions we actually found that one of the packages we applied generated multivariate or it did multivariate normality testing incorrectly so it is possible that packages have bugs and that might complicate your life then the third issue with using a package is that if the package doesn't do something that you wanted to do then it might be very difficult to modify the package or work around the limitations for example if I run a simulation in a computer cluster which is like a rack of computers there might be like 50 computers in a rack and I split my simulation on those 50 computers to run faster I might not only split on the design level but I might also split the replication so if they are running 1000 replications I would run 10 sets of 100 replications and this software doesn't that easily allow that kind of replication you can kind of work around it but it's not built into the package of course this is something that most people who just run simulations on their own computers wouldn't ever need 3 plugins but generally this is a very promising tool for multi-challengels so let's take a look at what the code does and to get started with this package we would use the same functions and this gives us a code template so this is a very nice feature so it just gives you the code you copy-paste that code into an R file and then start editing and this is actually what I did here so you can see that I prefer to have lower case but I just use the templates so nice and then we need to do a sim clean if you want a clean slate simulation so as I said this simulation framework allows you to recover estimates if you have a simulation that produces errors or simulation that crashes then you can use whatever you got and continue from wherever you left off after you fix the problem just a simulation to crash we don't want to do that so we will need to do sim clean just clean every temporary files we can see the temporary file appearing here in a moment when we start running the analysis and then we do the design matrix so this works the same way as my code so the design will contain all the designs so we have nine designs this is a tibble instead of a matrix it's like a more modern version of a data frame and this is built on the tidiverse functions instead of a base R then we have three functions we have functions for engineering data so we have the experimental condition or simulation condition here as an argument we have fixed objects so if you want to pass something additional parameters additional data sets to every function so it goes there and then we have we must return a data frame so that adds a bit of overhead so this code is substantially slower than my code because of this overhead but for any simulation of a meaningful complexity when you actually run for example SCMs or multi-level models or GLMs that take longer to run then the overhead would be inconsequential so we have the generate function then we have the analyze function which takes the data and then just runs the regression analysis and we have the NC which is named concatenate so you have to name this then we have the summary function that calculates the mean result so we are just interested in that and then when you run it we will use multi-core processing and this will use multi-core save random numbers we will need to set the seed we will set the seed here and we will run it see what happens and it produces, runs the nine designs it gives us progress reports and it gives us how long it takes to run and the total execution time was 4.8 seconds we can also run without the future framework just run on a single core and when you run on a single core it prints you the how long it takes to run each replication and yeah the execution time was 4.22 seconds because the execution time is calculating the core time so that means that when you are running two cores then it runs on then every second of normal time is two seconds of computer time so this is the same design it's a very useful package for multi-color simulations the final example is code that is designed to run on a computer cluster when you run on your own computer you have the computational resources what your computer provides for example I have a 10 cores and then I can just use the future apply for example when you run on a cluster then the computation is actually split over different computers so a cluster is like a few racks of computers there can be tens or hundreds of computers and the computers tend to be more computationally focused for example the cluster where I run run my simulation sometimes has computers with more than 100 cores each so the core count on the computer is in the order of thousands of times of more computational power than my work laptop so it's like tens of thousands of cores and to run on a computer cluster we need to be able to run the simulation file itself in many different places not just parallelize within the file but run it on different computers so just basically execute the same file started on a different computer this would be useful also if you for example if you don't have a cluster but you could have a computer lab over the weekend so if a computer lab has 30 computers you could split the work in 30 different ways or 30 different chunks and then have each of those computer lab computers run one chunk of the code and then combine at the end of the weekend that'll be two months of computational time on a single computer so what we do here is that we have the design matrix as before then we have replications 1000 as before but this time we are splitting the replications in the replication sets so we are not running 1000 replications but instead we are running 10 sets of 100 replications so we need to have a way of keeping tracking which replication we are running on which computer we also need to have maximum time so when you are using a computer cluster or if you are running in a computer lab you are not able to use that forever but it's like if you have a computer lab it might need to be done by Monday when the class is in the computer lab start in the cluster you reserve a certain amount of computational time and if your job doesn't complete by that time then it's going to be killed so we need to monitor the execution time here this is also a useful feature to have in your code if you are running something on your computer and you want it to be done let's say you want to run overnight as far as you get and then you stop the simulation and then you work on the day and then once your work day is over you run the simulation where you left off so setting maximum execution time is useful there then we are adding design numbers to the design matrix so the design matrix is here and when we add the design numbers then the design matrix just has the number and this is useful for managing the workflow or the jobs then what we do is that we duplicate the design matrix so instead of having one role for one design we will have one role for one replication set yeah I need to run that code too let's run it from there so what we'll do is that we will duplicate rows so now we have 90 rows and each of these 90 rows corresponds to one job on the cluster so we could split this to 90 different cores now instead of just 9 and then we do replication sets so what we have here is that we have 1 to 1000 and we cut it into 10 equally sized sets so if we have replication sets now it is 10 sets of 100 replication reads so set number 10 is from 901 up to 1000 and then we put it into the replication so now the design matrix that we have contains information about the designs and it contains the information about what we're going to run in each design so I'm just going to print the first row so we can see what it looks like so design matrix first row so we have the n, the beta then the design number and this is the replications so when we run on each row of the design matrix we'll run, the first row runs design number 1, replications 1 to 100 the second row runs replications 101 to 200 so it splits the replications into these sets then we need to have a way of for the R file to know which set it runs and this file is designed so that it takes arguments so whenever we run it on a computer cluster or if we run it in a computer lab the computer has a job number and the job number here varies between 1 and 90 so the first computer gets 1 the second computer 2 and then the job number is given in the arguments and then we simply pick whatever is the job ID that we need to run and we only run that on this computer then we loop we use apply instead of future apply because on a computer cluster the parallelization is handled by the cluster you're not supposed to run on multiple core so each job runs on single core but all the jobs like 90 here would run on parallel so parallelization is maintained by the cluster and not by your code we have a bit of logging here so if things fail it's useful to see what was run then we store we have some management of seeds so we don't set the seed in the beginning file instead of we set the seed based on the design number and then we set a seed for each replication so each replication has a fixed seed this guarantees reproducibility and in practice it doesn't lead in the correlated random numbers so that's not guaranteed then we have a list that stores the replication results then we loop of the replications we use the for loop because we want to keep track of what replication we are running instead of when we do replicate the replication does not know what its replication number is but here we need to pass the replication number because we pick the seed based on the replication number and then we have code for managing time if the execution time is over the limit then it breaks and we just don't run any replications anymore just saves whatever was done that far this far and stops running we set the seed we have bit of logging here then we have generating data as we did before estimating as we did before and then we append the result list we have the design number we have the replication, we have the seed and then we have the coefficient in larger simulations what I normally do is that I don't store the simulation conditions I just store the simulation number and then I merge the simulation conditions using the simulation number or the design number and then we put everything together and then we save this on a file and we would then summarize after using a different script so whenever you run on a cluster you typically want to just dump everything on the hard drive on the hard drive of the cluster and then just load it on your computer and summarize it there instead of summarizing on the cluster so let's see what it does so we're gonna run it and from the beginning this will so it runs, there's some logging and it gives us the results it produced as files so each of these files contains the design and which replications for that reside so we have 90 different files so if we were to run this on a computer cluster one computer or one core in the cluster would have handled one of these replications this is probably not something that most people need to use if you just do things for your own learning or for teaching but if you end up doing something about methods that you would like to publish and then you want to run it a bit faster this is a very useful thing to know how to do for example I have a colleague who was running a big simulation on his work laptop and he ended up carrying the laptop powered on for a couple of weeks with him so that he could run the simulation but instead of running the simulation on your home work computer you could run it on a cluster if your university has one in just a few minutes because they have so much more computer power than a normal laptop so that is the final example and I hope you find these examples of how to organize simulations that are called useful