 teksti- ja multi-level-modellinen ja ekonometric techniques on panel data, typically focus on data sets where repeated observations of individuals are actual repeated observations in the data. However, this is not the only way of storing repeated observations in the data set. We could also have the repeated observations as new variables, so that if we have 10 observations of each individual, we have 10 variables for those observations. This is called the wide format data, and it's a useful way of structuring data for longitudinal analysis as well, and if you have the data in this format, you would be using structural equation modelling with latent variables for the analysis. Let's take a look at how these techniques work and how we can specify many commonly used panel data models using structural equation modelling. The examples are based on this article by Beau and Satorra. The article covers how you can use SCM to estimate panel data models, and it also re-used the use of these models in the wide format in management research. They conclude that these techniques are not very widely known by management researchers, although they are not really new, so these techniques have been used for decades in, for example, psychology, but not so much in management. There tends to be a focus on either econometrics techniques or structural equation modelling techniques for longitudinal data sets, and for some reason management researchers typically go for the econometrics techniques more than the structural equation modelling techniques. I have a sweet spot for this paper because I was a reviewer for this paper, and for example I pushed the authors to have examples in R as well in Addis and Dev State examples. So the paper provides nice examples that you can work through using R or data to understand more about these models. Let's take a look at what wide format and long format data are. So we can structure the same set of information, same set of variable values in different ways. Here we have the long form data, so we have individuals each observed four times, so we have four observations for each individual. We have values X11 through XN4, Y11 through YN4, and Z1 through ZN. As we can see, Z does not vary within individuals, so we always use the same value of Z for one individual and so on. In the wide format data we have the exact same numbers, so the data content information is the same, it's just presented in a different format. So instead of having four observations for each individual, we have just one observation for each individual, and the repeated measures of X and Y are stored as different variables. So we have time one variables, time two variables, time three variables, and time four variables. And we have one variable for Z because Z does not vary over time, it is constant within individual. To switch between these two data formats in your statistical software, you would be using something called reshape or pivot or something other than that. But these are typically the terms used for describing going from wide to long and long to wide. And once you have told your statistical software the specifications on how you want to go from long to wide, typically going from wide to long and back is fairly straightforward. You just reexecute the same command the other way. So how do you use these wide format data to estimate different models where we have the unobserved effect? So we have basically a model where we have a random intercept, and how do we estimate the random intercept model using this kind of configuration? Let's take a look at first how we estimate a pooled regression analysis. So the idea of a pooled regression analysis is that we just regress Y on X and perhaps Z, and we get the result. In this example I'm just going to be using Y and X and omitting Z for simplicity. So this is the pooled cross-sectional analysis, ignoring clustering. So the idea is that we specify this kind of model in our structural equation modeling software. So we regress Y1 on X1, Y2 on X2, and Y3 on X3. We constrain the coefficients to be the same. We constrain the intercepts to be the same. And the error variances are the same. So this is basically, or this is exactly the same thing as estimating a normal regression model using the long form data. We just specify three different regression models. We specify that the errors are independent because of independence of observations. These errors are not correlated. And we specify that these regression paths are shared between these three models. This gives you the normal regression results from SCM. So this is not particularly interesting except for learning experience, because the advantage of this modeling approach comes from the use of latent variables. So how do we estimate then a model where we have a random intercept? So we have something that is common to all Y variables that is of each individual. So we have individual specific variable that affects all the observations of that individual. We do this kind of model. So this is the random intercept model. So the idea of a random intercept is that we have this unobserviced heterogeneity. We have an effect that affects all observations of an individual. So there is something individual specific that is shared between all observations of that individual. So this would be equivalent to estimating, for example, a normal multilevel model with random intercept or generalized least squares with random effects specification. And this does that job. So we have one unobserved variable that affects all time observations of an individual. We can also see that the random effect assumption here is that this latent variable, this unobserved effect is not correlated with any of the predictors. And we can use the chi-square test for this model for testing this assumption. We can also do less constraint models. And this is something that the Bow and Satora article focuses a lot on. So we can do the correlated random effects model as well. So instead of calculating cluster means, using those cluster means of X as control when regressing Y on X, we simply specify that this unobserved effect A can be correlated with all the X variables. And that is the correlated random effects model. So this random effect model shown on the previous slide is nested on this correlated random effects model. So we can test these correlations, whether these correlations are all at zero, which is the random effects assumption by doing a likelihood ratio test between these two models. So this is the way that we can do these simple econometric techniques using SCF. But we can also do more advanced stuff. This is of course possible using multilevel modeling software as well. For example, we can do lag-dependent variables. And this would be an example of a dynamic panel model because the Ys affect each other over time, then the Xs affect the Ys and then there is the unobserved effect here. We can do other things. We can do autocorrelated models. So here we don't have effect on Y over time, but instead the error term of time 2 depends on the value of time 1, time 3 depends on the value of time 2. So this is called autocorrelation. So the error term depends on its previous value. And this is very useful for modeling all kinds of trending data. We can do heteroskedastic errors. So instead of saying that these error variances are the same, we don't constrain these error variances to be the same. Instead, we allow each error to be freely estimated. So this is heteroskedasticity over time. It of course does not take into account every type of heteroskedasticity. For example, if there are variants of UT2 here depends on the X, then that would not be accounted by this model. But it takes into account heteroskedasticity over time, which is sometimes an important consideration. We can also do something that resembles cluster over standard errors in Rykosan analysis. So this is not exactly the same as cluster over standard errors, but you get the idea. So the error terms within an observation, within a unit, so repeated observations within a unit, those errors are allowed to be freely correlated, so we don't constrain any, we don't have any independence constraints here. This is of course not exactly the same as cluster over standard errors for two reasons. One is that the estimates from this model will be a bit different from the pooled cross-sectional analysis. So normally in cluster over standard errors, you only adjust the standard errors, but not the estimates. Another thing with this model is that it doesn't take into account heteroskedasticity, for example, if UT3 variants depends on X, T3, it would not be taken into account by this model. That would be taken into account with the cluster over standard errors. So we can do all kinds of stuff using cluster or long internal data, using SCM, by modeling the unobserved heterogeneity effect as a latent variable, and we can specify different error covariance structures like we do in a multi-level model. So what are the advantages and disadvantages of these two modeling approaches? The long form is more commonly used. It is easier to use in the sense that it is faster to specify. You normally just specify the independent variable, the dependent variable, and the clustering variable. Then if you need to have an error structure, error covariance structure, you choose one, and then you estimate model. In wide format, you have more manual model specifications. So you actually need to specify structurally because a model that shows all the correlations and all the paths, and if you have, let's say, 10 repeated observations, that model gets rather large. Diagnostics for long form are not well established. So there are diagnostic tools for multi-level models. There are diagnostic tools for panel data techniques, but those tend to be fairly specialized. So you need to go and read specialized literature to understand those. And you need to learn a bit case by case, because, for example, a data package that implements added variable plot for panel data models does not work for multi-level models and vice versa. Then in the wide format, you have well-developed diagnostics. We have the chi-square test, you have modification indices, and these same indices and same diagnostics apply regardless of what kind of model you estimate. And this technique has better handling of missing data because maximum likelihood estimation with mixing data, or FIML, can be applied, and it allows you to estimate the model, even if some observations don't have every value for all the predictors. In this case, in long form data, those cases would typically be dropped. So the wide form data, in a way, provides you more flexibility, but it's also more work to do and it's more error-prone. So which one you apply is more of a personal preference. The wide form is not that much better that everyone should use it right away, but it's something that you should keep in mind if for nothing else as a diagnostic tool and as a learning technique because understanding how you specify different econometric techniques using SCM allows you to understand those econometric techniques much better than you did before.