So you want to reproduce a study…

Note that I said reproduce, not replicate. Reproducing a study means that given the same data (and code, ideally), another researcher would be able to derive the same model and estimates that the original researcher did.

The kicker is that rarely do you have access to the original data. The logic, though, of correlation matrices in published papers is that—theoretically—you should be able to reproduce the same, or demonstrably similar, results as published.

In practice, it’s a bit trickier.


Here’s our study for today: Anderson et al (2009).

Go ahead and reproduce it. I dare you! :)


First up, the correlation matrix

Your first choice is whether to reproduce the entire matrix, or a portion of the matrix. All is better, but you can get close depending on the covariates with just focusing on the parameters of interest. That’s what we’re going to do here.

So, lets build our matrix. We’re starting with a correlation matrix, so the diagonal elements are all 1.

cor.matrix <- matrix(c(1,.18,.26,.27,.28, 
                       .18,1,.24,.18,.41,
                       .26,.24,1,.16,-.05, 
                       .27,.18,.16,1,.12,
                       .28,.41,-.05,.12,1),
                      nrow = 5, ncol = 5,
                      dimnames = list(c("SLC", "EO", "SO", "MR", "SFM")))
cor.matrix
##     [,1] [,2]  [,3] [,4]  [,5]
## SLC 1.00 0.18  0.26 0.27  0.28
## EO  0.18 1.00  0.24 0.18  0.41
## SO  0.26 0.24  1.00 0.16 -0.05
## MR  0.27 0.18  0.16 1.00  0.12
## SFM 0.28 0.41 -0.05 0.12  1.00

Now we’re going to convert our correlation matrix into a covariance matrix using the cor2cov function in the lavaan package. Notice that the variance is the square of the standard deviation :)

library(lavaan)
# We need a vector of our standard deviations
sd <- c(.93,1.09,1.26,1.33,1.31)
cov.matrix <- cor2cov(cor.matrix, sd)
cov.matrix
##         [,1]     [,2]      [,3]     [,4]      [,5]
## SLC 0.864900 0.182466  0.304668 0.333963  0.341124
## EO  0.182466 1.188100  0.329616 0.260946  0.585439
## SO  0.304668 0.329616  1.587600 0.268128 -0.082530
## MR  0.333963 0.260946  0.268128 1.768900  0.209076
## SFM 0.341124 0.585439 -0.082530 0.209076  1.716100

Now, lets generate the dataset using the semTools library and the mvrnonnorm function. Note that the sample size is the same as that reported in the paper: 110.

library(semTools)
set.seed(08022003)
my.df = mvrnonnorm(n = 110,  # Number of observations
                   mu = c(4.7, 4.01, 4.98, 4.92, 4.07),  # Variable means
                   Sigma = cov.matrix,  # Covariance matrix
                   empirical = TRUE)  # Set our matrix as the empirical values
# Convert the simulated data to a data frame
my.df <- as.data.frame(my.df)
# Now lets check the covariance matrix for the new dataset
myCov.matrix <- cov(my.df, method = c("pearson"))  
round(myCov.matrix,5)
##         SLC      EO       SO      MR      SFM
## SLC 0.86490 0.18247  0.30467 0.33396  0.34112
## EO  0.18247 1.18810  0.32962 0.26095  0.58544
## SO  0.30467 0.32962  1.58760 0.26813 -0.08253
## MR  0.33396 0.26095  0.26813 1.76890  0.20908
## SFM 0.34112 0.58544 -0.08253 0.20908  1.71610

Lastly, lets verify the correlations in the new data…

myCor.matrix <- cor(my.df, method = c("pearson"))  
round(myCor.matrix,2)
##      SLC   EO    SO   MR   SFM
## SLC 1.00 0.18  0.26 0.27  0.28
## EO  0.18 1.00  0.24 0.18  0.41
## SO  0.26 0.24  1.00 0.16 -0.05
## MR  0.27 0.18  0.16 1.00  0.12
## SFM 0.28 0.41 -0.05 0.12  1.00

Looks good!


Run some models

So the paper used the Baron & Kenny method (dooh!). We’re not going to make the same mistake as that author did!

Take a look though at Model 8 in Table 4. Lets do a simple model—we don’t have the covariates, but none of them were all that big anyway—on the relationship between EO and SLC. We’re going to use the lm.beta package to get the standardized parameters.

library(lm.beta)
eo.slc.model <- lm(SLC ~ EO, data = my.df)
summary(eo.slc.model)
## 
## Call:
## lm(formula = SLC ~ EO, data = my.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0225 -0.6292 -0.1570  0.6434  2.0861 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.08415    0.33549  12.174   <2e-16 ***
## EO           0.15358    0.08076   1.902   0.0599 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.919 on 108 degrees of freedom
## Multiple R-squared:  0.0324, Adjusted R-squared:  0.02344 
## F-statistic: 3.616 on 1 and 108 DF,  p-value: 0.05988
lm.beta(eo.slc.model)
## 
## Call:
## lm(formula = SLC ~ EO, data = my.df)
## 
## Standardized Coefficients::
## (Intercept)          EO 
##        0.00        0.18

Ok, so we got close to what was reported in the paper. The reason for the difference is likely sampling variation when we created our dataset. Computers aren’t all that smart, and we just started with the mean, standard deviation, and zero-order correlations for our simulated data.

So, to improve our model, we’re going to run thousands of simulations. To do that, we need to write our own function. The function is going to generate a new dataset (without using set.seed), run a model, then store the standardized coefficient estimate for EO in a new dataframe. We’re also not going to use the emperical=TRUE option, because we want random draws based on the specified correlations.

library(tidyverse)
reproduce.eo.function <- function() {
  my.df.sim = mvrnonnorm(n = 110, mu = c(4.7, 4.01, 4.98, 4.92, 4.07), 
                         Sigma = cov.matrix) 
  my.df.sim <- as.data.frame(my.df.sim)
  eo.slc.model.sim <- lm(SLC ~ EO, data = my.df.sim)
  my.std.sim <- lm.beta(eo.slc.model.sim)
  my.coef.sim <- coef(my.std.sim)
  my.coef <- my.coef.sim["EO"]
}
sim.results <- data_frame(EO = replicate(10000, reproduce.eo.function()))
summary(sim.results)
##        EO         
##  Min.   :-0.1540  
##  1st Qu.: 0.1173  
##  Median : 0.1815  
##  Mean   : 0.1803  
##  3rd Qu.: 0.2457  
##  Max.   : 0.4761

With 10,000 simulations, we’re very close to our estimate of .18 from before. Nonetheless, based on the correlations specified in the model, we get close to the reported value of .20. In a perfect world, we should be able to get statistically indistinguishable match when we also include the covariates. That said, it’s worth noting that there is a negative correlation between effect size and sample size :)


Model comparison

As I said before, the paper uses the B&K method, which we already know to be, well, bad. So what if we used SEM instead? Well, we could use the simulated dataset, but we actually don’t have to. The type of SEM that we generally do is actually called covariance-based structural equation modeling. It uses the covariance structure of the data to estimate the parameters. So long as we have a covariance matrix, we can use SEM.

So lets specify the full, three-mediator model from the paper.

library(lavaan)
eo.model <- 'SO ~ EO
             MR ~ EO
             SFM ~ EO

             SLC ~ SO + MR + SFM + EO'
eo.fit <- sem(eo.model, sample.cov = cov.matrix, sample.nobs = 110)
summary(eo.fit, standardized = TRUE)
## lavaan (0.5-23.1097) converged normally after  19 iterations
## 
##   Number of observations                           110
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                5.389
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.145
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   SO ~                                                                  
##     EO                0.277    0.107    2.593    0.010    0.277    0.240
##   MR ~                                                                  
##     EO                0.220    0.114    1.919    0.055    0.220    0.180
##   SFM ~                                                                 
##     EO                0.493    0.105    4.715    0.000    0.493    0.410
##   SLC ~                                                                 
##     SO                0.184    0.065    2.824    0.005    0.184    0.249
##     MR                0.141    0.061    2.320    0.020    0.141    0.202
##     SFM               0.200    0.067    2.992    0.003    0.200    0.281
##     EO               -0.027    0.083   -0.322    0.747   -0.027   -0.031
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .SO                1.483    0.200    7.416    0.000    1.483    0.942
##    .MR                1.696    0.229    7.416    0.000    1.696    0.968
##    .SFM               1.415    0.191    7.416    0.000    1.415    0.832
##    .SLC               0.692    0.093    7.416    0.000    0.692    0.805

So not too far off from what was reported. Now, the kicker is that in order to truly get in to this reproduction, we would need the covariance matrix of the indicators, so that we could specify the measurement model as well. But, since the authors collapsed the variables into mean-scale scores—I know, right!!!—that’s the best we can do.

As a point of comparison, if we estimate the same model using the data frame itself, we get the same values…

eo.df.fit <- sem(eo.model, data = my.df)
summary(eo.df.fit, standardized = TRUE)
## lavaan (0.5-23.1097) converged normally after  19 iterations
## 
##   Number of observations                           110
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                5.389
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.145
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   SO ~                                                                  
##     EO                0.277    0.107    2.593    0.010    0.277    0.240
##   MR ~                                                                  
##     EO                0.220    0.114    1.919    0.055    0.220    0.180
##   SFM ~                                                                 
##     EO                0.493    0.105    4.715    0.000    0.493    0.410
##   SLC ~                                                                 
##     SO                0.184    0.065    2.824    0.005    0.184    0.249
##     MR                0.141    0.061    2.320    0.020    0.141    0.202
##     SFM               0.200    0.067    2.992    0.003    0.200    0.281
##     EO               -0.027    0.083   -0.322    0.747   -0.027   -0.031
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .SO                1.483    0.200    7.416    0.000    1.483    0.942
##    .MR                1.696    0.229    7.416    0.000    1.696    0.968
##    .SFM               1.415    0.191    7.416    0.000    1.415    0.832
##    .SLC               0.692    0.093    7.416    0.000    0.692    0.805

Pretty cool, huh?


Endogeneity

As with our last lab, we’re interested in understanding how big a threat endogeneity might be to the model. We know that there are a lot of potential problems with this model—selection effects, omitted causes, measurement error, etc. We can’t go back in time to fix it, but we could get an idea about how big the endogeneity problem would need to be to have made a nomological error.

For that, we can use a simulation again using the mediate package. Now, while we can specify multiple mediation models, doing this kind of simulation on observational data gets pretty tenuous. So, we’re going to focus on one mediator, SFM, while controlling for the other two in both the mediator model and the outcome model. That gets us closer to the indirect effect from the SEM model above.

library(mediation)
mediator.model <- lm(SFM ~ EO + MR + SO, data = my.df)
outcome.model <- lm(SLC ~ SFM + MR + SO + EO, data = my.df)
# Now we specify the full model using the 'mediate' function
mediation.si.fit <- mediate(mediator.model, outcome.model, treat = "EO", 
                            mediator = "SFM", robustSE = TRUE, sims = 1000)
summary(mediation.si.fit)
## 
## Causal Mediation Analysis 
## 
## Quasi-Bayesian Confidence Intervals
## 
##                Estimate 95% CI Lower 95% CI Upper p-value
## ACME             0.1032       0.0221       0.1965    0.01
## ADE             -0.0259      -0.2379       0.1635    0.82
## Total Effect     0.0773      -0.1169       0.2790    0.46
## Prop. Mediated   0.7271      -9.4522      10.0780    0.46
## 
## Sample Size Used: 110 
## 
## 
## Simulations: 1000

And the plot…

plot(mediation.si.fit)


Sensitivity analysis

Now we need to run a simulation using the medsens function.

NOTE: The medsens function assumes that the \(a\) path is consistent because x would have been manipulated. That is not the case in this data, because it was an observational design. Now, the estiamte the \(ab\) would be inconsistent because of endogeneity in the \(b\) path even if x was minpulated, so this is still a useful exercise for observational designs. Just make sure to add that caveat.

sensitivity.fit <- medsens(mediation.si.fit, rho.by = 0.01, 
                           effect.type = "indirect", sims = 1000)
summary(sensitivity.fit)
## 
## Mediation Sensitivity Analysis for Average Causal Mediation Effect
## 
## Sensitivity Region
## 
##        Rho    ACME 95% CI Lower 95% CI Upper R^2_M*R^2_Y* R^2_M~R^2_Y~
##  [1,] 0.07  0.0787      -0.0011       0.1585       0.0049       0.0032
##  [2,] 0.08  0.0749      -0.0042       0.1540       0.0064       0.0042
##  [3,] 0.09  0.0711      -0.0073       0.1496       0.0081       0.0053
##  [4,] 0.10  0.0674      -0.0105       0.1452       0.0100       0.0065
##  [5,] 0.11  0.0636      -0.0137       0.1408       0.0121       0.0079
##  [6,] 0.12  0.0597      -0.0169       0.1364       0.0144       0.0094
##  [7,] 0.13  0.0559      -0.0202       0.1320       0.0169       0.0110
##  [8,] 0.14  0.0521      -0.0235       0.1276       0.0196       0.0127
##  [9,] 0.15  0.0482      -0.0269       0.1233       0.0225       0.0146
## [10,] 0.16  0.0443      -0.0303       0.1190       0.0256       0.0166
## [11,] 0.17  0.0404      -0.0338       0.1146       0.0289       0.0188
## [12,] 0.18  0.0365      -0.0374       0.1103       0.0324       0.0210
## [13,] 0.19  0.0325      -0.0410       0.1061       0.0361       0.0234
## [14,] 0.20  0.0286      -0.0446       0.1018       0.0400       0.0260
## [15,] 0.21  0.0246      -0.0483       0.0975       0.0441       0.0286
## [16,] 0.22  0.0206      -0.0521       0.0933       0.0484       0.0314
## [17,] 0.23  0.0165      -0.0560       0.0890       0.0529       0.0344
## [18,] 0.24  0.0125      -0.0599       0.0848       0.0576       0.0374
## [19,] 0.25  0.0084      -0.0639       0.0806       0.0625       0.0406
## [20,] 0.26  0.0042      -0.0680       0.0764       0.0676       0.0439
## [21,] 0.27  0.0000      -0.0721       0.0722       0.0729       0.0473
## [22,] 0.28 -0.0042      -0.0763       0.0680       0.0784       0.0509
## [23,] 0.29 -0.0084      -0.0807       0.0638       0.0841       0.0546
## [24,] 0.30 -0.0127      -0.0851       0.0596       0.0900       0.0584
## [25,] 0.31 -0.0170      -0.0896       0.0555       0.0961       0.0624
## [26,] 0.32 -0.0214      -0.0942       0.0513       0.1024       0.0665
## [27,] 0.33 -0.0258      -0.0988       0.0472       0.1089       0.0707
## [28,] 0.34 -0.0303      -0.1036       0.0430       0.1156       0.0751
## [29,] 0.35 -0.0348      -0.1085       0.0389       0.1225       0.0796
## [30,] 0.36 -0.0394      -0.1136       0.0347       0.1296       0.0842
## [31,] 0.37 -0.0441      -0.1187       0.0306       0.1369       0.0889
## [32,] 0.38 -0.0488      -0.1239       0.0264       0.1444       0.0938
## [33,] 0.39 -0.0535      -0.1293       0.0223       0.1521       0.0988
## [34,] 0.40 -0.0583      -0.1348       0.0181       0.1600       0.1039
## [35,] 0.41 -0.0632      -0.1404       0.0139       0.1681       0.1092
## [36,] 0.42 -0.0682      -0.1462       0.0098       0.1764       0.1146
## [37,] 0.43 -0.0732      -0.1521       0.0056       0.1849       0.1201
## [38,] 0.44 -0.0784      -0.1581       0.0014       0.1936       0.1257
## 
## Rho at which ACME = 0: 0.27
## R^2_M*R^2_Y* at which ACME = 0: 0.0729
## R^2_M~R^2_Y~ at which ACME = 0: 0.0473

So we end up with a \(\rho\) of 0.27. We would need to see a correlation between the disturbance terms (\(\zeta\)) of about .27 in order to have an simulated omitted variable lead to an indirect effect equally zero.

But look closer at the value of \(\rho\) when the confidence interval contains zero. It is only .07! Why is this important? Because we generally infer support for mediation on the basis of the confidence interval for \(ab\) not containing zero. Here, it doesn’t take much in the way of endogeneity for us to have failed to reject the null hypothesis. On the other end, we need to see \(\rho\) greater than .44 to have infered a negative, but statistically significant, indirect effect (a sign flip basically). That’s a higher bar, but still, it doesn’t take much to show just how instable parameter estimates can be!

And the plot…

plot(sensitivity.fit, sens.par = "rho", main = "Mediator")

So with a less than a .3 threshold for \(ab\) to equal zero and a smaller range (0.07 - 0.44) to yield different nomological conclusions, then yes, I’d say that it would not take much in the way of endogeneity to see a different result from the published model.


Word of caution

Remember, while you can get close to reproducing a result, you do not have the actual data from the original study, so be careful in drawing substantive conclusions.


Homework

Ok, your turn! Pick a mediation paper—the one from ASQ comes to mind—and set up a reproduction and sensitivity analysis. The purpose here is to practice these techniques!

Due to me COB Friday, 7 April.



Brian S. Anderson, Ph.D. | Assistant Professor of Entrepreneurship | andersonbri@umkc.edu
Henry W. Bloch School of Management | University of Missouri - Kansas City
entrepreneurship.bloch.umkc.edu
(c) Brian S. Anderson, 2017