 Tämä paikaisuus on esimerkiksi, miten tehdä simulatioida dataa. Voisi nähdä, miten voidaan simulata dataa, tai miten voidaan analysoida dataa, mutta haluaisi nähdä mekaniksia erilaisia simulatioita. Tällä kertaa, olen vain käyttänyt yksi regresson analysointi asiaa populatioida, jossa olen jälkeen X ja Y ja analysointi dataa yksi regresson. Se analysointi ei ole erittäin käyttävä, mutta nämä saman principleilla, joita olen tullut, voidaan myös olla käyttänyt yksi regresson yksi regresson yksi regresson. Tällä kertaa on yksi samanlainen simulatioida. Yksi samanlainen simulatio on käyttänyt koko ajan. Tällä kertaa voidaan usein samanlainen simulatioida, jos statistiikkoja toimivat yksi senarjoja. Ja tämänkin on, että jos sinulla on generaata dataa melkein erittäin erittäin erittäin erittäin samanlainen simulatioida, jos sinulla on epä- ja statistiikkoja, jossa on koko ajan, sillä se on korkeat populatioja saada. Tämä voi olla käyttänyt verifaa epä- ja esimuiluista. Tämä on myös käyttänyt tekniä, jossa on koko ajan, jossa on valmiita, studiasta, asiaa ja puolesta. puhuu ympäristöhön, kun katsoit tai vaikuttaa erilaisiin, ehkä vuotta, kun omistakaa verkkukulmani. Heikkumatta on tullut koko koduskulmani. Tulemme yksi lähevisi erikaissani. Sitten seuraavaa. Meiltä tuli, miten piilot sought alttiivista on valmis. Enpä tukea, että piilot womennevat koko alttiivista, mutta jotain, joilla on kokoa pseudorandom-numme. Se on niin, että pseudorandom-numme ovat ihmiset, jotka näyttävät, kuten jos on random. Seivinkoan seid on niin, että siinä sequensoi on neljastu. Kun teemme seidon seidon, se saattaa rannamme sellaista, jos seidon seidon seikkailun nekin olevaa, se puhuttee sellaista osallistusta random-numme. Se on useattaminen, kun paljon peikkaan. Se on tärkeintä ympäristöryhmä. Jos minulla on esim. alueellinen alueen, niin haluan tämän alueen ottaa samaa alueen, kuin minulla on. Se olisi todella ympäristöryhmä, että alueellinen alue tarvitaan ympäristöryhmä, jos minun esim. tarvitaan ympäristöryhmä, kun minun alueen tarvitaan ympäristöryhmä. Se on myös käytännössä, kun kehittämme esim. alueen tai alueen. Jos sinulla on esim. alueellinen alueen, sitten, jos sinulla on esim. alueellinen alueen, esim. alueen saa tämän alueen, kuin tämän alueen. Se on myös käytännössä, kun kehittämme esim. alueen. Sitten jäi olemme tämän alueen. Me on tullut ympäristöryhmä. Päätämme x-malli ja ympäristöryhmä. Joten, joten mennään. Joten, ensimmäinen alueen. Päätämme x-malli ja ympäristöryhmä. Joten, joten ympäristöryhmä, ja nyt voimme lämpää. Voimme nähdä, että meillä on 100 alueen ympäristöryhmä. Ympäristöryhmä on ympäristöryhmä. Ympäristöryhmä on ympäristöryhmä. Ympäristöryhmä on 2, koska se on x plus ympäristöryhmä. Me voimme ympäristöryhmä eri ympäristöryhmä, mutta tässä menee se ympäristöryhmä. Tällä esim. alueen. Voimme nähdä, että koe-efficienta, joka saamme, on ensimmäinen, ja ensimmäinen on koe-efficienta, koska me olemme ympäristöpäivänneet mitenkään täällä, joten regressona-annauksessa toimii aika hyvin. Jos saamme koe-efficienta, niin saamme samaa risulta, koska saamme koe-efficienta. Jos ei saamme koe-efficienta, mutta saamme koe-efficienta ja sitten saamme koe-efficienta, nyt saamme koe-efficienta, joten koe-efficienta on nyt 107, mutta koe-efficienta saa 102. Tämä on esim. ympäristöpäivänneet, jotta koe-efficienta saamme koe-efficienta, niin se saamme samaa risulta. Me voimme myös modifiaa koe-efficienta, esimerkiksi, jos haluaisimme koe-efficienta ja regressona koe-efficienta 2, voimme multiplaata x2, voimme koe-efficienta ja voimme nähdä, että regressona koe-efficienta, nyt koe-efficienta on nyt ympäristöpäivänneet. R-skuva on 80, joten se on yksi hieman. Jos haluaisimme regressona koe-efficienta, jotta koe-efficienta on ympäristöpäivänneet, voimme esimerkiksi multiplaata eri-tuntia 5 ja se on regressona r2, jotta 0, 0, 05. Se on paljon ympäristöpäivänne. Me voimme maailmaa koe-efficienta ja nähdä, miten regressona koe-efficienta on. Se on ympäristöpäivänne. Nähdään ympäristöpäivänne on ympäristöpäivänne. Tämä on ympäristöpäivänne, kun jotta koe-efficienta on multiple different times. This kind of analysis is useful for checking how statistics behave in populations. For example we could check how the precision of the regression analysis estimates may affect sample size. We could also check if estimation techniques suffer from small sample bias. The multiple different samples are required Because, if we gestrate one sample, then that estimate that could be off by chance only. If we do let's say 100 or 1000 replicated samples from the same population, then any of those random errors would cancel out. Let's run the code to see what it does. And then I will explain the result. Joten mitä tämä kodin tapahtui, on se, että se on kehitetty 1 000 samppuja jota ei ole maailmaa, joka on rikroson koeffisuus 1, ja meillä on rikroson koeffisuus tällaista 1, joten se on hyvin koko 1, se on ympäristö, standard deviation is 0.1, joten se kautta, että se on variaalista estimet, ja tämä on densiteti plotteja estimet. Joten ympäristöä estimet ovat 1, mutta siellä on ympäristöä, jotka ovat 0,8 ja 0,1, 1,2, esimerkiksi. Joten siellä on ympäristöä, jotka ovat minus 20, tai ympäristöä ympäristöä. Joten siellä on ympäristöä, joka on ympäristöä, mutta ympäristöä on ympäristöä. Me voimme esimerkiksi modifiaa tämä kodin ja nähdä, miten rikroson koeffisuus on ympäristöä ympäristöä. Joten esimerkiksi, jos me olemme ympäristöä ympäristöä 1,000, me voimme rikroson koeffisuus ympäristöä ympäristöä rikroson koeffisuus on ympäristöä ympäristöä. Ja tämä on jotain, joka voisi tehdä, esimerkiksi koeffisuus. Nyt voimme nähdä, että standard deviation estimet oli 0,1, nyt se on 0,3, ja tietysti estimet ovat 0,95, 1,05 ympäristöä 0,80 ja 1,2. Joten rikroson koeffisuus on parempi, koeffisuus on ympäristöä. Lähdemme koeffisuus. Nyt voimme nähdä, mitä koeffisuus on. Se, mitä me tarvitsemme, ympäristöä, ympäristöä on data are two things. Me meidän on program that defines a set of commands that are run as a sequence. Ja tämä on meidän program, koeffisuus on koeffisuus, koeffisuus on koeffisuus, joten meillä on simulata commanda, joka voimme käyttää multi-multi-replikatioita meidän simulataan. Tämä capturer program is required because redefining a program causes an error. Joten jos voimme rerunni tämä, niin meillä on an error program already defined. Joten meidän on program drop seen. Tämä capturer on required, koska jos voimme drop a program that doesn't exist, it causes an error. Joten, nyt on full code, mutta jos voimme just try to run it, niin on ero. Joten, capturer prevents any error from stopping the execution. Joten tämä basically drops any program called seen if it exists, if the program doesn't exist it doesn't do anything. Joten, program here starts with program, we clear the data. Joten tämä on exactly the same thing as we do in a single sample simulation. It's just encapsulated in the program. What the program does by default is that it returns whatever this final command returns. Joten, you can change that. There are commands, for example return, command. You can return other things. For example, calculate two recursive models, calculate the difference between the estimates and return that. But when we test the statistic, then this is the simplest way. Just generate data and then you run recursive. Another way of comparing two different estimators would be to just generate the same data sets twice and then have one program generate data from one estimator, another program for another estimator and run those with the same seed. One important thing to note is that the seed must be set outside the program. So it will be kind of logical to define the seed here because this is where you start generating random numbers. But this is a big error that sometimes I see beginners do and it's an error because it guarantees that every run of the simulation uses the same data set. So when we run the simulation here and we summarize the result we can see that there is no variation. So if we list the data we can see that every recursion coefficient has the same value. That is because we generated each random sample using the same seed which means that instead of generating a thousand independent samples we generated the same sample a thousand times. When we look at the simulation command this has three parts. First is whatever we collect we just collect underscore B which refers to the estimates and then we have options. We simply have options that we run a thousand times. There are other options for example you could save the simulation results into a file but we just leave them in memory and then we have colon and then we have whatever our simulation program. So this is the simple Monte Carlo simulation. The next example is a simple Monte Carlo using parallel processing. The idea of parallel processing is that normal modern computers have multiple processor cores for example my M1 MacBook Pro has 10 processor cores. When you run an analysis then normally it runs on a single core but you want to be separating the work across multiple cores. And that speeds up computation. Stata has support for multiple multi-core processing built in but how it works is that it works on a single command level. For example regression command gets a small speed boost from parallel processing and the number of processors that are used depends on your Stata license. I think my Stata license allows the use of two cores out of the thing that my computer has. In simulations when we run replications of a code running each replication on a separate core it's a lot more efficient than trying to parallelize one regression model within a replication. So how this code works this just runs the same thing a lot quicker and we can run it and so there. So it's a lot quicker than running on a single core or without parallel processing. There is like almost the order of magnitude difference in speed and we need to have a user written package here because Stata does not support parallel processing except what is built in into the commands out of the box. You can install the parallel from by using these lines here. So you need to install some Mata libraries as well. You run these codes and you can read about the parallel library here. There is a journal paper and then there is a tutorial online how you apply it. The parallel basically requires two things over the simple Monte Carlo. One is that you need to run parallel initialize and that sets the number of computer cores that the simulation can use. It sets it to 75% what is available on a computer and on my computer we could set it to 10 but 75 is okay for this purpose. Then the SIM command works a bit differently. So instead of using Simulite we have parallel SIM then instead of having beta we have this option here. So option expression what is that we capture, we capture beta from the results and other than that it is specified the exact same way as the built-in simulation. So moving from Simulate to Parallis Simulate requires simply changing your Simulate command a bit installing this stuff here and then running parallel initialize. There is really no reason not to use this when you do simulation work. Except that when you are developing your simulation code you might run into errors and it's a lot easier to troubleshoot something when it's running on a single core instead of on a multi core but this package here which now runs on seven cores has this nice feature that allows us to print the error logs from each of these such child processes. So this is a very user friendly way of doing parallel programming. The next example is multiple Monte Carlo Simulation using nested loops. What nested loops means that we have these four loops that are within each other. This kind of simulations would be useful for methodological research and when you want to study how an estimation technique works under different conditions. So the conditions that we have here is that we vary the sample size and then we vary what is the recursion coefficient in populace. We need to adjust the simple Monte Carlo with a couple of things to do this kind of multi factor Monte Carlo or a multiple Monte Carlo. What we need to do is to have a way of giving the sample size and the beta or the recursion coefficient information to the program. We do that by using the syntax command. So what the syntax command does is that allows you to pass different arguments to programs. The easiest way to use it is to have a comma and specify whatever you need as options. You could have var list and if and in and whatever you can have in a state of command. But this is the easiest way. So we specify that we have an argument called n. It must be lowercase because uppercase refers to abbreviation and the name of the macro that this produces is always lowercase. It means that this is a whole number. So n is integer that's the sample size and then b is real which means that it's a decimal number. It can be a whole number as well but these are different ways of storing numbers on the computer. So b and n are here and then they are available as macros. So this back tick n normal tick is the back tick that scores the option n and then b beta here back tick b up normal tick which means what scores the recursion coefficient. So instead of having a fixed number here like before we have the macro n and when stator runs this file before running this file it writes in the content of the macro which will contain the sample size and then runs the command. The same here it writes whatever is the value of beta that macro before it evaluates what is the value of b here and then it runs that b substitute in the command. And then the estimation works the same. The symbolization needs to also have a few things that we didn't have before. So we have the loop here so we loop macro called n receives values 100, 200 and 500 and macro b receives values of 1.1 0.2 and 0.3. So this goes through all n and all b and it goes through every b for every n. Then in the simulation we will do a couple of things differently so instead of capturing b we are now capturing with names so we are storing macro n into n so we want to store the sample size information we could also have the sample size information from the file name that we are saving but I find it useful to store it directly into the data set. Then we store the value of b and then we store the regression coefficient of x under the estimate and another thing that we need here is saving to a file because simulate always leaves the results the simulation results into your memory so if you run simulate many many times then whatever we have in the memory of the computer is going to be the results of the final simulate. So we need to save the file result of each simulate into a file and we are just going to save in the working directory then we have n and b in the file name just to distinguish them. And then in the sim we have the options n and b. When we run this yeah let's run it and it'll run for a while because it runs 9 different simulations of 1000 replications each and then we will have the files in my downloads which is my working directory now and then the next thing that we need to do is to load the files by append and then we summarize them. So this uses the macro function dir which gives us the content of the directory so we are taking the content of the working directory we are taking files that match simulation dot dta pattern and then we append all those data into our memory so we can run that file command here and then we do macro macro list it's copy rather than run from the file so we can see the result files here now contain the names of these files and you can do help macro that will show you how it's done so we are using the macro dir function here and then we append all those files so we append all those files and then we can summarize the results the macro is always it has a scope so whenever I define a macro and I run a do file then that macro is not available in the do file and if I define the macro in the do file it's only available in the do file so I have to define it in the same file where I use it so we can tabulate that gives us a quick and dirty cross tabulation we are tabulating the mean and standard deviation so regression analysis we can see here that it is unbiased because of sample size we get the correct result always the correct result and the standard deviation of the estimates does not depend on the regression coefficient but it depends on the sample size so that the estimates become more precise when sample size increases so this is a quick and dirty table if you want to publish your results you can use the customizable tables you can do the same thing and then you can customize the layer of and presentation of this table for your publication so this is an example of a multiple Monte Carlo using nested loops this is a useful way of getting started using multiple Monte Callows but there are a couple of downsides first you have this managing of these files which is a bit inconvenient there are the computer just store the files somewhere or store the results in a memory and just have them there without having to specify files yourself the second thing that this is not easy to run on parallel computing we could do parallel computing on each simulate but it's typically more efficient to run the conditions on parallel and then run every replication of a condition in a core and then split the conditions the simulation conditions or the signs over different cores and another downside is that when you have many different factors design factors for example we could have a number of predictors we could have correlation between predictors and we could have for example error variance to have three more factors then we have five nested loops and the code becomes a bit difficult to read finally this only allows you to do full factorial if we would want to run certain sample sizes or certain conditions only when sample size is large enough we can really do that easily or it would be very confusing to write ifs and if statements here so there are other ways of organizing the simulation that we can take a look at next the final example is a multiple Monte Carlo simulation using design matrix and parallel processing so the code is here and let's zoom into the specific parts and take a look at what it does so this will run substantially faster than the previous so we can run it and it takes a little while because it's running still a thousand replications for each core but it should be done yeah so here we are so it's about maybe close to ten times faster than without parallel processing so how do we do this kind of parallel processing again we need to have the parallel package and the setting or the simulation program is the same now we need to have another helper program I'll explain the role of the helper program in a while and then we have a design matrix so the design matrix here is a data set that describes the simulations that we run if we run the code and then we do list we can see that this code shows that all possible combinations of N and B that define our simulation setup so we have nine different designs in the simulation formed with a combination of three levels of N and B how the simulation works is that we iterate over this data set and then we run the simulation using this design then we have the simulation over using this design so we iterate over the roles of the data set and each row defines a simulation design what the code does is that we just generate some data and or enter some data so we enter the data with N and B we could enter more so if we want to have a sample size of 1000 we could go to the data editor and we could add 1000 here it doesn't make a difference that there are different number of observations and then the fill in creates all possible combinations so we run that we would have then combinations except we need to drop the missing data okay we drop the ones with missing beta and then we run to the design matrix so you input whatever unique values of N and B or whatever your factors you do fill in to create full factorial or possible combinations and then you add a design number to identify the designs and then that is our data the next thing that we do is we initialize parallel processing like before and then we have this file here instead of using the parallel sim we use the parallel command and parallel breaks the data into chunks so if we have four cores then it breaks the data into four subsamples and it runs the analysis that we specify on each subsample so when we run parallel here this breaks the data into seven chunks so we would have seven subsamples we might have one, two to be one subsample three, four to be another subsample then four, five, six, seven, eight and nine to be the rest of the subsamples these are the observation numbers and we tell the parallel that we need to have the program sim and the program run sim that I explain in a minute to be available for this worker so the program simply says that these programs need to be also given to the worker threads so the worker processes then because this splits the data into our seven, in my case into seven different subsets we still need to have a way to iterate over each of those subsets so we don't want to run simulation for a subsample of one and two observations one and two but what we need to do is to run the simulation on one line at a time and we do that by using the runby command so the runby command is basically it's a user written command that works the same way as the by prefix except that it works also for commands that don't that are not viable so that they don't support the by prefix so this this splits the data into our the course let's say in the seven subsamples and this further splits each subsample into individual lines because we're splitting by design so this parallel program here might say that okay these observations one and two they go to the first core and then this runby says that these observations one and two are further going to be split into two different chunks based on the design matrix then we have the run sim and what the run sim command does is that it takes the first observation from the data in this case the only observation it takes the value of n from that observation the value of b from that observation and stores those as variables and then it gives the or passes the n and b as arguments or options to the sim command so to recap what it does is that parallel program parallel splits the data into subsamples runby splits each subsample by design and then run sim takes the first observation which in this case is the only observation and runs the simulation using the content of that observation all these commands what they do then is that they will just append the results into one big matrix so if we can see one big data set we can see here that well if we run this again it gives us the data set so here we go so what is the advantage of this approach well the advantage is that it splits the designing the simulation the design matrix from the actual simulation code so you can do more complicated designs like fractional factorial bitless simulation code instead of loops we have this one and that one and it can parallel it can do parallel program