 Estimation of nonlinear medias and models can be a bit challenging. When we will first learn about mediation, we learn about the linear case. For the linear case, we might define mediation as a product of the coefficients from x to m from m to y, and we also learn that the full mediation, a partial mediation can be differentiated based on this coefficient beta y1 here. If it's zero or non-significant, then we have full mediation, if it's non-zero and significant, then we have partial mediation. In another video I talked about the problems of this approach. The first problem was that this is not applicable to all nonlinear models. It works for some nonlinear models, but not most nonlinear models. There's also a problem of cos-lighter-genet. So if the effect of x to m is not the same for everybody, but it varies between individuals, then this model of product of coefficients approach assumes that the variation of beta m1 is uncorrelated to the variation of beta y2. And I also talked about how we can't use any statistical model as a definition of mediation, because mediation is a causal concept and not tied to any particular model. Now let's start looking at how do we estimate nonlinear models. But before we talk about nonlinear models, it's good to remind that for linear models, the Baron and Kenny or product of coefficient approach is still valid. So this article by Mayan co-authors, which is probably one of the most cited sources on estimating nonlinear models, still recommends the product of coefficient if your model is linear. Of course, how we check assumptions and how we actually go about estimating these models has evolved since 1986 when Baron and Kenny was published, but the basic idea that you multiply two paths together is still the same. But it doesn't work for all nonlinear models. Let's take a look at how nonlinear models are underestimated. We need to understand first the counterfactual model based on nonlinear models. We need to understand first the counterfactual model based definition of mediation effect. So Imai defines mediation as difference between two outcomes only of one of which is observed. So this is the outcome of an individual which is either treated or not, and the mediation is assumed as or mediation is either observed as after mediation or it is a counterfactual outcome, and we compare it against the outcome where that individual has not received or the mediator is not affected by the treatment. That can be either observed or counterfactual depending on what we actually observed. So the general estimation principle is that we record the observed outcomes. We estimate the counterfactual for each individual. So for the cases that are treated we observe the case under treatment and the mediator under treatment we predict or we somehow estimate a counterfactual for that individual where that individual was treated but the mediator was observed as if not treated. We do the same for the untreated cases. For those cases we observe the cases untreated and mediator under the untreated condition and we estimate somehow what the mediator would be in the treatment condition for that person. So we estimate counterfactual for each individual. Then we estimate the individual of a causal effect as the difference between these outcomes and then we take the average of these individual causal effects and that gives us the average causal mediation effect. Now there is still the little complexity on how we choose the counterfactual. So one way is to hold the treatment as it is and simply write the mediation but this article by New England talks about also different options. So which counterfactual do we estimate? So we observe two outcomes. We observe either individual under the treatment and the mediator is measured under treatment for those and then we observe individuals were in the control and then a mediator is observed for the control condition for those individuals and in EMAs approach we generate these counterfactuals. So we generate counterfactual by keeping the treatment where it is either one or zero and flipping the mediator. So we have this T1, mediator zero means that individual receives the treatment and we predict what the moderator would be for that individual if that individual was not treated. For this individual under control we predict what the moderator would be had that individual been treated. So this is one way but this is not the only way and another way that New England talks about which is also at least according to them fairly common is to assume first generate the counterfactual by assigning these two treatments. So the first counterfactual for the treated case would be generated the same way but for this untreated case we would generate the counterfactual not by changing the moderator or mediator but by changing the treatment and we can do of course the opposite so that we change the mediator and adjust the treatment so that all cases are untreated but all cases observe the mediator. I would say that this the first one is probably the safest choice because it's kind of middle ground it's also easiest to implement in your statistical software but if you do this kind of analysis New England's paper that talks about this decision is worth checking out as is my explanation on how to choose between these approaches in another video. So let's take a look at what is the general estimation principle for these. So there are there are estimation strategies. There are two main ways there are more but these are the two main ways of estimating this nonlinear. Mediation effects are in my article 2010 presents one article one approach. This is called the simulation based approach so they do pit models for outcome or mediator then they simulate multiple replications of parameters then apply this kind of formula and then calculate an average that gives you a causal effect. Importantly this does not depend on on the functional form of any of these relationships. Then Wonder Wheel talks about another approach where you have a regression model this is called regression based approach so you estimate two models and then you apply this kind of equation for example for binary mediators and that gives you the natural indirect or the causal mediation effect. Let's take a look at these two approaches. They both start with estimating a model or estimating two models. So if we have this is the Baron Kenney approach and we estimate two models so we estimate a model for mediator we estimate a model for the outcome they can be any model and model for m model for y and then these approaches differ in how we calculate the mediation effect. So we have two models one for m one for y they can be nonlinear and then we have the problem of how do we calculate the mediation effect. I'll take a look at the regression based approach first because it's a bit easier to implement in your statistical software. Both of these approaches is a simulation based approach and the regression based approach are implemented in packages so you can apply these without really understanding the estimation principle or the math or how it's programmed but understanding the principle and the math and how it's programmed will make you a better user of these techniques. So let's start with the regression approach because it's a bit easier to implement. And this is from Vanderbilt's book which is a really great resource for the regression based approach. He talks about different combinations of mediator and outcome and then he goes and proves how they are, they are modeled using regression analysis. And to understand what's the meaning of all this stuff we need to define a couple of things. So we need to define first what A star and A is and A star is the baseline so if we have a treatment control kind of scenario then the baseline is the control A is the treatment and then we need to understand this is the mediation and we need to understand what's this thing here and what's this thing here and what's this multiplier here and well this is just the probability of M under the treatment so this is the probability of a positive outcome calculated from logistic regression analysis so you might not recognize that this is actually the inverse logic that you use in logistic regression but that's what it is. And then we have the probability of M in baseline so we compare the differences in probabilities. So we don't actually use the option of M at all but instead we calculate or we predict what is the probability of M for an individual under treatment what is the probability of M of that individual under control so we do these adjusted predictions we calculate two predictions for each individual one under the treatment, one under the control and then we simply multiply that with the effect of M because this is a linear model for the outcome and this theta 3a is we can ignore it for now it's simply because a van der Wiel considers the case where the mediator and the X variable treatment variable are allowed to interact so this is why there's the second term here so how do you then do this in practice so how do you move from these equations to statistical software I'll show examples using theta and my examples use linear models the reason why I'm using linear models here is that when I use this in a class I give students assignments I don't want to give them the model assignment but you will just replace these equations or these commands with logistic regression and predictions from that to calculate using nonlinear models and the linear model is easier to understand and check so we generate some data so we have a full mediation model T is the treatment, M is the mediator we have a thousand observations from this hypothetical population and we start by regressing M on T so that's the model for M then we generate observations we store what's the original value of T we store what's the original value of M and then we generate MT0 what would be the value or we predict MT0 what would be the value of the mediator for every case if they were not treated then we generate the other outcome which is under treatment so this is another adjusted prediction we adjust everyone to be under the treatment we calculate prediction and then average of those will be the average marginal prediction of under treatment and then we estimate the model for Y and we do two predictions for using Y so we replace first M using the MT0 M outcome we calculate predictions for Y then we calculate another adjusted prediction using M under the treatment condition we calculate another predictions for Y and then the difference between these is the comparisons between Y, T, MT1 and Y, T, MT0 which is EMI's definition of causal mediation effect but of course it can be defined in different ways as well so this is a simple way using simply regression prediction and you can replace this the regress command with some nonlinear function it does not need to be linear it works the same way now there is the problem of how do we get the standard errors so the standard error between these two differences is not valid from summary because they are predicted instead of observed so some gives you the differences but assuming that they are observed there are three main ways of getting the standard errors the first one is delta method this is a large sample analytical approximation so it's basically an equation that you apply to the data and it gives you a result it gives you an approximation of standard errors that works well in large samples then another approach is simulation I will not talk about simulation in detail but the basic idea is that instead of calculating these from M and these models just once you simulate multiple sets of parameters from the sampling distribution of the estimated variance covariance matrix of that model and then you do the calculation based on that I will talk more about that in another video and then there is bootstrapping and I will talk more about that bootstrapping approach in another video too how do we do it in practice so this is a data implementation that gives you the delta method standard errors so we will be using margins here and margins implements delta method which is nice because it's faster to calculate but it might not be the best in very small samples some important things that we do here first the average direct effect is calculated simply by adjusting T so it's adjusting the prediction of T at 0 and 1 because we are predicting at holding M at the observed value so what is the value of T when M does not change that is the average direct effect or natural direct effect calculating the mediation effect is a bit more complicated so we need to first make a system out of these two equations using SWEST and the reason why do we need to make them form a system out of three equations is that we need to use two different sets of coefficients in margins and this is the way to do it so we just need to treat these estimates as estimates so if we store the estimates in a matrix then margins would not know that they are estimates but it would treat them as fixed values so taking the coefficients of this regression here and put it in the matrix taking this coefficients here put it in the matrix and then using the matrices here in margins instead of the estimates the B's would produce incorrect results because they would not be considered in the standard error calculus then this coefficient legend option here in the SWEST command gives you the technical names of the coefficients which you need for writing the margins command so for example my underscore mean column D T is something that the coefficient legend will tell you to use so what is the meaning of these margins how does one come up with this kind of margins command well we need to have a custom prediction equation here so we use the expression command and we are simply writing this regression equation here so we are predicting we have the intercept then we have there are prediction using observed and not adjusted T so we are calculating adjusted predictions by adjusting the T but in the actual predictions we are holding T constant at the observed value and we are adjusting the M based on the T so adjusting T changes M which is this Y but it does not have a direct effect so we need to you predict using the option T instead of the adjusted T then we predict M using adjusted T and we just multiply then this prediction of M using the coefficient of M and that gives us the average causal mediation effect so it's not that complicated to do with status margins if your model is very complex like you have lots of covariates then typing this equation by hand will be tedious so there's another approach that you can apply and here is a different way of specifying the same thing with warmsms so if you have let's say 10 covariates in the equation typing the coefficients of those covariates times the name of the variable is not something that you want to do instead what you can do is that you start with the linear prediction using the original M and adjusted T here and M is at option values and then we subtract the effect of adjusted T and original T here and then we do the same so we add the effect of option T we add the effect of predicted M to the prediction and if you have a non-linear model then how do you do that well instead of using predict XB which gives you the linear prediction you simply apply the link function from your GLM like you would use exponential model of Poisson regression analysis and inverse logic for logistic regression analysis or the cumulative normal distribution for probit regression so limitations of the regression based approach this approach works well when the model for M or model for Y is linear and for non-linear models when both are non-linear special cases where it works and the approach that I talked about also works when there is one mediator when there are more than one mediator if you have like two competing causal paths and you want to understand which one is the one that is more important then it gets more complicated there's a really great book by Tyler J. Wonderveal about this regression approach it's like a few hundred pages of reading and it goes through all the complications on a very detailed level and also how to address those and which at the time of writing in 2015 were still unaddressed by methodological literature other cases except this one linear needs to be proven on case by case basis and this book gives you the proofs so for example if you have a logistic model for for Y logistic model for M then this approach that I just presented wouldn't work directly you can get odds ratios using math but you can't get the predictions like I showed because of if we assume that the probability of Y is 0.5 in one particular case then the inverse logic of 0.5 is not the same as average of these two logits so it works when you have linear because in linear outcome or linear mediator or linear outcome then the outcome is simply if you have two possible linear outcomes 1 and 0 then the average of 1 and 0 is the same as the average of it's the same as the predicts probability of 1 but it doesn't work if you need to use a link function for the outcome as well so this is a simple technique but it does not work in all potential cases EMA's approach on the other hand the simulation based approach is generally it works in pretty much every condition that you can think of but it's more challenging computationally so similarities between the simulation based approach is that they first start with estimating models for y and m and calculating predictions for m so this is the similarities but there are also differences between the simulation based approach and the regression based approach the differences include multiple replications whereas in the regression approach you simply predicted 1 probability for under treatment under control in the binary mediator case in this case you do multiple replications and we don't predict expectations so we don't predict the probability in a binary model like logistic and probid but instead we predict there are some actual values so we generate actual values 1's and 0's instead of predicting a probability and predictions use samples of model estimates instead of actual estimates so let's take a look at why we need to go all this trouble so why do we simulate why do we take samples of estimates and why all this stuff instead of just using the regression one well this is something that I struggle myself to understand but the key to understanding why this happens is in theorem 1 of IMAI's paper so there is some kind of integral over M so integral over M means that we are analyze the distribution of M instead of the expected value like we do in the regression based approach so we need to look at the full distribution instead of just what is the expected value and so where does this integration integral comes from it comes from something called the mediation formula which is produced by I think it was pearl in the literature on causality and the idea here in the mediation formula is that we are not looking at two predicted outcomes for each individual but we are looking at how does the distribution of the moderator mediator, how does the distribution of the mediator change between the treatment and the control those we don't analyze the differences in expectations but we analyze the differences in distributions and then we multiply each value of the distribution with the coefficient or whatever function there is, it's for why and that gives us the causal mediation effect. How this is proven it's not important but to know you just need to know that this is a very general way so instead of looking at particle value of M under treatment particle value of M under control for an individual case basis we look at the distribution of M and under treatment distribution of M under control and then what the simulation does is something called Monte Carlo integration so Monte Carlo integration is a way of calculating an integral so if you want to analyze the full distribution of a variable we need to do integration and Monte Carlo integration is one way of doing it so this is a simple example of Monte Carlo integration let's assume that we have x which is normally distributed mean zero standard deviation 1 and we want to know what is the mean of x square so we know that that's chi-square with one degree of freedom and that's worked out in any business that's book but for a demonstration let's assume we don't know what is the mean of x square and we want to do Monte Carlo integration to find out so the idea of Monte Carlo integration is that we take a random sample so we take one observation of x it's minus 0.7 then we raise it to the second power and then we get that that's about 0.3 or something and we repeat this many many times so we draw samples from the distribution of x we calculate x square and we record these x squares this dashed line here shows what is the average of these now 35 replications and we can see that when we do this kind of simulation we most of the time we get x values that are close to zero sometimes we get x values that are far from zero but they are less common because this is normal distribution and in normal distribution the most of revisions are close to zero and in about 500 replications or so we can see that the expected value or the average of these squares is here at 1 so this is the idea of Monte Carlo simulation Monte Carlo integration we are similar values from the distribution of interest we calculate the function of interest for each of these observations and then we calculate the mean the idea is that we will then have very few occurrences on the tails as we should be and most of revisions are here and this works out pretty well for many many problems so this is the reason why we take multiple simulations we need to calculate the intercroppers and this is another way of showing the integral so we calculate the integral and we calculate actual predictions of m from the distribution like we did the prediction we simulated values for x in the normal distribution in the previous slide we do the same here so this is the reason why we do simulations and why do we do actual draws from random m instead of the expected value because it allows us to do Monte Carlo in the case now there is the problem of why samples of estimates and this is a bit tricky to understand and this is explained really well in the missing data literature and for example my video on data augmentation I talk about this issue in more detail but the idea is that when we want to estimate what is the expected value of x then the linear model or whatever model we have gives us the estimate and that's our best guess but if we want to characterize or estimate how is x distributed then using a linear model or whatever model and then there are the distribution of the variable around the model prediction actually is a biased estimate of the distribution because it doesn't take the estimation error of the linear model into account so we draw these samples of estimates to take the estimation error into account or the estimation uncertainty I will not talk about that in detail here but it's basically the same thing that you do in missing data analysis when you do multiple implementation with the data augmentation algorithm so how do we actually implement this in practice so this is a quick R implementation of my simulation based approach and we generate some data so this is again full media model 1000 observations we estimate two models this time both are probit models so we generate data from probit model 2 and then we draw samples of coefficients from these models so we draw samples the mean of these random number variables are the actual coefficients and the variance covariance matrix of the random numbers is the variance covariance matrix of the estimates and then we simulate t1 and t0 so these are adjusted predictions for each case so we have n is 1000 so the simulated t's are 2000 so we have 1 and 0 for each case 2 times 1000 is 2000 then what we do is that we simulate this one line simulates 1000 replications of m for each case under treatment and under control so this is a matrix product of 2000 by 2 matrix so it is 2000 by 2 1 is the intercept intercept is something times 1 and you add that to the model so this would be called the design matrix and it's 2000 so 1000 observations and 2 adjustments t0, t1 for each case and then we have this 2 times 1000 matrix which is 2 coefficients the intercept and the coefficient of t and this produces 2000 times 10000 matrix so the columns other replications the rows are the cases under these 2 adjustments treated and untreated then we calculate for each combination for each replication for each row of of the mediator we calculate this kind of matrix product again so we have 4000 by 3 matrix so 4000 now is because we have for each observation we have 4 cases we have the cases where t was adjusted at 1 and mediator is predicted under t equals 1 t is adjusted to 0 the mediator is predicted under t is 0 and then we are also predicting to counterfactuals where we use t equals 0 and the prediction of t equals 1 and then we have t equals 0 and the prediction of t of mediator under t equals 0 so we have 4 different cases that we compare we apply to each condition and then we take take sums and we will get these causal average causal effects this runs in maybe less than a second on my computer and if we calculate using EMI software package we will get the same results it takes 2 or 3 seconds on my computer so this is in a nutshell the simulation based approach limitations of the simulation based approach limitations of the simulation based approach I wonder who advocates the recursion based approach says that this takes the main drawback of the simulation based approach is that it takes a lot of computational time if your data set is large I am not sure if that is actually true because you are not really estimating things so if you estimate a GLM model for 26 million like you have here in their example then that takes some time takes a few minutes on a modern computer but this is if you have 26 million observations and then you need to simulate 1000 observations then the matrix products that I showed and that EMI software package uses they will explode in size so this is you will run out of memory on a computer and when your computer runs out of memory then it gets slower because it needs to save results on the hard drive and that several orders of magnitude slower than just using estimates from the memory if you are working with large data in a simulation based approach then your options are basically that don't use the full data take a random sample if you have 26 million observations a random sample of 260,000 observations would be good enough for most purposes so that's one option another option is that then you can go back to matrices and just implement it yourself using R for example and just split the simulation into smaller tasks where you don't have to calculate a big matrix that fills your computer's memory so these techniques can be implemented using the more foundational or primitive tools of your statistical software I demonstrated the regression based approach using Stata I demonstrated the simulation based approach using R both can be done in both software but there are also several packages and these packages are constantly developed this article in structural as a model shows you a review of these six or seven common packages they cover Stata R, SAS and then M+, and the article talks about the differences between these so if you want to do this causal mediation analysis or non-linear mediation analysis then this article is a good source for you because the first thing that you need to do is to pick a tool trying to implement this by hand is a good idea for learning but if you want to use these then relying on your own implementation is problematic for two reasons one, it's a bit tedious to think it through instead of applying these black boxes and two, how do you know that you've implemented it correctly so verifying your implementation that it actually works right is a bit of work so I recommend that you do both so use a package and then you can verify if you understood the package what it does correctly by implementing it yourself using adjusted predictions now this part we have talked about binary treatment so how do we deal with continuous treatment so we don't have zeroes at once but we have for example individuals receive different amounts of medication it's another yes or no or we have let's say a sports example we tell people do running two times more than they would normally but the running baseline varies between different people so continuous treatment addresses these questions the case for continuous treatment is actually not different at all so all these equations as pointed out by Tinglis note in their package documentation is just same math so let's take a look at how you do it in this data so instead of predicting a T01 we predict at T from let's say from normal distribution it's from minus 2 to 2 with increments of 0.5 and what do we actually do here we use linear prediction using the original M and adjusted T this is the same thing that I did before and then we subtract the effect of adjusted T and original M we add the effect of offset T and then we add the effect of predicted M and what these models is not the difference between 0 and 1 but it is the difference from going whatever is your current baseline to a particle of value so if you move from let's say let's use running example and we have treatment would be running once 3 times 4 times 5 times a week then if you adjust your current running level whatever it is to be 5 times a week what's the average effect and we can use margins here and plot the effect so we would see that if we change everybody to be at minus 2 on the treatment without changing the mediator then the expected outcome would be here so if we move on from the current level to 1.5 for example this is the value of the outcome holding mediator and offset values and this red line is the causal mediation effect so what is the effect of changing the mediator based on different predicted levels of T considering the current T and current M Mediation testing is an active topic so if you want to do non-linear mediation you are in for some reading there are quite a few recent articles that talk about problems such as measurement error confounding like omitted variables multi-level stuff and all kinds of things so particularly the regression based approach every case needs to be proven case by case and then the simulation based approach that also has some limitations that need to be studied and they are also because this is a hot topic in some disciplines there are some good commentaries and reviews and guidelines in epidemiology particularly but also in the modeling literature so this is an active topic under research but it's just not something that is active within management so not all management researchers necessarily read all this literature