Moderation: Power // LDV Models

ENT5587B - Research Design & Theory Testing II

Brian S. Anderson, Ph.D.
Assistant Professor
Department of Global Entrepreneurship & Innovation
andersonbri@umkc.edu

RIEI Logo
© 2017 Brian S. Anderson

How goes it?
IRB Applications
Papers Due!
Measurement error in moderation models
Statistical power considerations
Moderation in LDV models
NO LAB THIS WEEK
Seminar 27 Feb – 9:00AM
Lab 27 Feb – 1:00PM

Repeat this mantra over and over and over again…

Measurement error is the enemy. Simplicity in measurement is the best ally in defeating my enemy.

Lets start with a simple illustration, and lets generate some random numbers…

set.seed(08022003)
my.df <- data.frame(x <- rnorm(1000),
                    m <- rnorm(1000),
                    xm <- x*m)

Now lets plot these distributions…

library(tidyverse)
distribution.plot <- ggplot(my.df, aes(x)) + 
                     geom_density((aes(colour = "x"))) +
                     geom_density(data=my.df, (aes(m, colour = "m"))) +
                     geom_density(data=my.df, (aes(xm, colour = "xm"))) +
                     labs(title = "Distribution in Interaction Model",
                      x = "", 
                      y = "Distribution Density")
distribution.plot + theme_minimal() + 
                    xlim(-6, 6) + theme(legend.title = element_blank())

In a continuous by continuous interaction, even if the constituent terms—x and m—are normally distributed, the xm term will not be. Gauranteed.

This reality of interaction models has important implications for modeling and interpreting interaction effects, which we’ve already covered.

Today we’re going to talk about why skews in the distribution of x and m, whether by random error, selection effects, or measurement error, create additional problems for testing interactions.

Even worse, measurement error in x or m make false positives in moderation studies so prevalent, especially when the study is under-powered.

Lets do a simulation to walk through this…

We’re going to simulate 100 studies with a simple interaction effect using nothing but random variables and with a small sample size (N = 150 — not uncommon in management studies). We’re then going to capture whether the p-value for the interaction effect was significant at the p < .05 level. We’ll wrap it up with a count of how many ‘false positives’ we observed.

Note that there are a lot of ways to do this in R—but this is one way, and we’re going to write our own R function and combine it with the replicate function.

falsepositives.function <- function() {
  y.sim <- rnorm(150)
  x.sim <- rnorm(150)
  m.sim <- rnorm(150)
  model.sim <- lm(y.sim ~ x.sim*m.sim)
  my.pvalue <- summary(model.sim)$coefficients["x.sim:m.sim","Pr(>|t|)"]
}
sim.results <- data_frame(PValue = replicate(100, falsepositives.function()))
my.sim <- sim.results %>%
          filter(PValue <= .05) %>%
          summarise(fake.count = n())
cat("Number of False Positives:", my.sim$fake.count)

## Number of False Positives: 5

Just copy and paste the code, and run it a few times. You should see the number of false positive interactions vary (I got as high as 15 once). Remember, this model is nothing but random variables; any statistically significant interaction happened only by random chance.

The sample size is also very small and very under-powered to detect an interaction effect—although it’s a sample size that is commonly found in our literature—and yet we still found spurious interactions.

Lets kick this simulation up a notch. We’re going to run our simulation 100 times. Think about it this way—one simulation is one researcher p-hacking one database 100 times. Now we’re going to simulate 100 researchers doing the same thing.

What we want to know is the average number of false positives per researcher.

myTotal.sim <- data_frame()
averagefalsepositives.function <- function() {
  mySim.results <- data_frame(PValue = replicate(100, falsepositives.function()))
  myCount.sim <- mySim.results %>%
                 filter(PValue <= .05) %>%
                 summarise(fake.count = n())
  myTotal.sim <- bind_rows(myTotal.sim, myCount.sim) 
}
sim2.results <- data_frame(FalseTotal = replicate(100, averagefalsepositives.function()))
summarise(sim2.results, myAverage = mean(as.numeric(FalseTotal)))

## # A tibble: 1 × 1
##   myAverage
##       <dbl>
## 1      5.37

Quick quiz…how does this number relate broadly to our p < .05 standard for statistical significance?

Remember, our false positives were the result of random chance. We easily could have a made a Type I error just out of bad luck.

Imagine how easy it would be to make a Type I error if we were p-hacking, HARKing, etc.

I’m just saying.

This is a key concern with interaction effects in management. It is easy to find spurious interactions in small samples by just random chance because the distribution of xm is not normal—interaction terms under the best of circumstances are already noisy, even when their constitute variables are not very noisy.

Imagine what happens when they are genuinely noisy, AND you layer researcher degrees of freedom on top of it.

Lets get some real world data…

my.ds <- read_csv("http://a.web.umkc.edu/andersonbri/ENT5587.csv")
my.df <- as.data.frame(my.ds) %>%
         dplyr::select(INN1:RISK3, SGR) %>%
         filter(SGR > -50) %>% 
         na.omit()

Lets say we’re interested in the interaction between innovativeness and proactiveness on sales growth rate. But before we estimate our model, we’re going to do what is commonly done in our field with latent variables—create a mean scale score.

We’ll cover this more in SEM, but with a mean scale score, we’re effectively setting measurement error equal to zero. We don’t actually make it zero, rather we just assume that it’s negligible.

The most common way to evaluate whether we can make this assumption is by evaluating the internal consistency reliability of the latent construct’s indicators using a metric called Cronbach’s Alpha.

Lets calculate alpha for innovativeness and proactiveness. First lets create some data frames with our indicators.

INN.df <- my.df %>%
          dplyr::select(INN1:INN3)
PRO.df <- my.df %>%
          dplyr::select(PRO1:PRO3)

library(psych)
alpha(INN.df)

## 
## Reliability analysis   
## Call: alpha(x = INN.df)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd
##       0.78      0.78    0.75      0.55 3.6 0.038    4 1.5
## 
##  lower alpha upper     95% confidence boundaries
## 0.7 0.78 0.85 
## 
##  Reliability if an item is dropped:
##      raw_alpha std.alpha G6(smc) average_r S/N alpha se
## INN1      0.83      0.83    0.71      0.71 4.9    0.032
## INN2      0.54      0.54    0.37      0.37 1.2    0.087
## INN3      0.71      0.71    0.55      0.55 2.5    0.055
## 
##  Item statistics 
##        n raw.r std.r r.cor r.drop mean  sd
## INN1 110  0.79  0.77  0.57   0.50  3.9 1.9
## INN2 110  0.90  0.90  0.86   0.76  4.2 1.7
## INN3 110  0.82  0.83  0.74   0.61  3.8 1.7
## 
## Non missing response frequency for each item
##         1    2    3    4    5    6    7 miss
## INN1 0.11 0.23 0.06 0.17 0.18 0.16 0.08    0
## INN2 0.05 0.17 0.11 0.20 0.20 0.18 0.08    0
## INN3 0.10 0.14 0.20 0.20 0.18 0.13 0.05    0

alpha(PRO.df)

## 
## Reliability analysis   
## Call: alpha(x = PRO.df)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd
##       0.66      0.65    0.62      0.38 1.9 0.056  4.3 1.2
## 
##  lower alpha upper     95% confidence boundaries
## 0.55 0.66 0.77 
## 
##  Reliability if an item is dropped:
##      raw_alpha std.alpha G6(smc) average_r  S/N alpha se
## PRO1      0.30      0.30    0.18      0.18 0.42    0.133
## PRO2      0.52      0.52    0.35      0.35 1.09    0.091
## PRO3      0.77      0.77    0.62      0.62 3.32    0.044
## 
##  Item statistics 
##        n raw.r std.r r.cor r.drop mean  sd
## PRO1 110  0.87  0.86  0.79   0.65  4.1 1.6
## PRO2 110  0.80  0.78  0.66   0.50  4.3 1.7
## PRO3 110  0.63  0.66  0.36   0.29  4.5 1.4
## 
## Non missing response frequency for each item
##         1    2    3    4    5    6    7 miss
## PRO1 0.08 0.11 0.11 0.25 0.25 0.15 0.05    0
## PRO2 0.08 0.07 0.12 0.22 0.25 0.18 0.08    0
## PRO3 0.04 0.09 0.09 0.20 0.36 0.17 0.05    0

While the alpha for Proactiveness is just below the .7 threshold, you might still see this used in a paper.

FWIW, Cronbach’s (1951) original recommendation was to retain constructs with alpha’s greater than .9 if using established measures.

Somewhere over the ensuing years, we just stuck with the .7 (or even lower) standard. Yeah, not a good thing.

So we’re going to exercise some researcher degrees of freedom and go ahead and collapse our scales into observed variables…

my.df$INN <- (my.df$INN1 + my.df$INN2 + my.df$INN3)/3
my.df$PRO <- (my.df$PRO1 + my.df$PRO2 + my.df$PRO3)/3

Now lets estimate our model…

sgr.model <- lm(SGR ~ INN*PRO, data = my.df)
summary(sgr.model)

## 
## Call:
## lm(formula = SGR ~ INN * PRO, data = my.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.131  -7.596  -0.822   7.842  33.798 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.3181     9.7179   0.033    0.974
## INN          -2.7935     2.7084  -1.031    0.305
## PRO          -0.4566     2.3067  -0.198    0.843
## INN:PRO       0.7449     0.5776   1.290    0.200
## 
## Residual standard error: 12.61 on 106 degrees of freedom
## Multiple R-squared:  0.06885,    Adjusted R-squared:  0.0425 
## F-statistic: 2.613 on 3 and 106 DF,  p-value: 0.05512

Darn, no interaction. Here’s the question though…is our failure to reject the null hypothesis because there really is no ‘true’ effect here?

Or…

Is it because our study is under-powered (N = 110)? Or is it because our variables are noisy?

Lets take a look at the noisy part first. This paper is a classic read on the issue, but the simple issue is that when two noisy variables interact, the cross-product term will be even noisier.

Recall that our alpha for Innovativeness was .78, and for Proactiveness .65. If these two variables are perfectly uncorrelated (they aren’t, but lets start there), the maximum reliability will the product of the two reliabilities: .78 x .65 = .507.

Roughly 50% of the variance in our interaction term would be nothing but random noise.

Now, the more correlated the constituent terms are, the better the cross-product reliability, but only marginally so (see the table on page 557 of Busemeyer & Jones [1983]). In our case, Innovativeness and Proactiveness correlate at about .44, so we bought about .08 of an increase in our cross-product reliability (.507 + .08 = .587).

That still leaves about 40% of the variance in the interaction term as noise. That’s not going to help us much.

The problem with measurement error is the signal to noise ratio.

We don’t have time to cover all of this, but lets just walk through the logic.

Lets assume that our interaction effect is real, in the sense that innovativeness and proactiveness in real life have a joint effect on sales growth rate. Latent constructs—just like all measures—are imperfect; they have error. When we create our interaction term, the errors compound, so that within our measure, we end up with a lot more noise.

Now when we do our analysis, we’re already starting behind the curve. Our measurement noise obscures our signal, but so does the noise inherent to collecting real world data.

Now carry it forward. Most interaction effects in social sciences are small—there isn’t much of a signal there in the first place. If we have a small sample, we know from a simple power analysis that it gets very challenging to detect a ‘true’ effect that is also small. We are very likely to make a Type II error, but that assumes that our measures are error free.

So now we have a faint signal, in a noisy dataset, with measures that are also substantially noisy (and we haven’t even talked about measurement error in the DV yet!).

In theory, this should mean that we should not see a lot of moderation papers in the literature, but we do!

Even ignoring researcher degrees of freedom, it is common for researchers to fall victim to the low power fallacy: “Look at my small sample and I still found an effect; see, my study is conservative and my finding robust!”

Except that isn’t true. The noise in the data and in the measure are just as likely to increase the Type I error rate, and if a false-positive exists, it’s likely to be really big.

Don’t believe me?

Run the simulation again and keep track of the lowest p-value. I was able to get several below p < .001, and remember, that data is random.

So what to do???

The first step is to acknowledge the problem :)
Use estimators that account for measurement error (SEM, errors-in-variables, etc.)
Look for simple measures, particularly of moderators, that have high reliability and are a priori likely to be exogenous.
Up-power your study.

Speaking of power, lets talk about power calculations in moderation studies.

We’re going to make use of the pwr package.

Once we move beyond the simple univariate world, power calculations get tricky. We’re going to use the \({f}^{2}\) measure for effect size, which in our case is the ratio of the sum of squares for the interaction term to the sum of square errors.

Drawing from Bosco et al (2015), a reasonable expectation for \({f}^{2}\) in the management literature is about .0025. This is the very definition of a small effect.

So what kind of N would we need to detect a typical interaction effect, with 80% probability of correctly rejecting the null hypothesis, assuming we had perfectly reliable measures?

library(pwr)
pwr.f2.test(u = 1, f2 = .0025, sig.level = 0.05, power = .8)

## 
##      Multiple regression power calculation 
## 
##               u = 1
##               v = 3139.466
##              f2 = 0.0025
##       sig.level = 0.05
##           power = 0.8

v is the number we are interested in. Over 3,100 observations.

So what’s the moral of our story?

Moderation in LDV models…

Lets go back to our data set and create a new data frame…

ldv.df <- as.data.frame(my.ds) %>%
          dplyr::select(Proactiveness, SGR, PatentActivity) %>%
          filter(SGR > -50) %>% 
          na.omit()

Recall that our LDV model is just a logit transformation of the general linear model…

\(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x}+\varepsilon\)

The benefit of a logit model is that we can predict the probability of our y = 1 condition occurring. The kicker is that the probability of y = 1 changes as a function of the level of x.

In an LDV interaction model, we’re suggesting that the change in the probability of y = 1 as x changes itself changes as a function of our moderator, m.

Make sense?

\(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x}+{\beta}{m}+{\beta}{xm}+\varepsilon\)

Don’t worry, it’s hard for me to get my mind around it also.

Before we start running models, lets go ahead and mean center our predictors—Sales Growth Rate, and Proactiveness.

ldv.df$SGR.center <- (ldv.df$SGR - mean(ldv.df$SGR))
ldv.df$Pro.center <- (ldv.df$Proactiveness - mean(ldv.df$Proactiveness))

Quick quiz…how would we interpret the coefficient estimates for the lower-order terms using centered variables in an LDV interaction model?

Our dependent variable will be patent activity, which equals one if the firm filed for a patent in the previous year.

interaction.model <- glm(PatentActivity ~ SGR.center*Pro.center,
                         data = ldv.df, 
                         family = "binomial")

So it looks like there is a significant interaction between Proactiveness and Sales Growth Rate. Cool! Lets publish it! Oh wait, never mind, I p-hacked the crap out of the data to find it!

summary(interaction.model)

## 
## Call:
## glm(formula = PatentActivity ~ SGR.center * Pro.center, family = "binomial", 
##     data = ldv.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3014   0.2537   0.4964   0.6583   1.1427  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            1.79748    0.30803   5.835 5.37e-09 ***
## SGR.center             0.02440    0.02542   0.960   0.3372    
## Pro.center             0.57919    0.25653   2.258   0.0240 *  
## SGR.center:Pro.center  0.04387    0.02025   2.166   0.0303 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 98.399  on 110  degrees of freedom
## Residual deviance: 87.064  on 107  degrees of freedom
## AIC: 95.064
## 
## Number of Fisher Scoring iterations: 5

With LDV models with interaction effects, we’re in the camp where the log odds ratios are interpretable, but they are so narrowly defined as to have little utility.

This is where understanding marginal effects becomes so important. We’re going to make use of the mfx and interplot packages.

First up, lets get the average marginal effects…

library(mfx)
logitmfx(PatentActivity ~ SGR.center*Pro.center, data = ldv.df)

## Call:
## logitmfx(formula = PatentActivity ~ SGR.center * Pro.center, 
##     data = ldv.df)
## 
## Marginal Effects:
##                           dF/dx Std. Err.      z   P>|z|  
## SGR.center            0.0026592 0.0025985 1.0234 0.30613  
## Pro.center            0.0631295 0.0255174 2.4740 0.01336 *
## SGR.center:Pro.center 0.0047822 0.0020040 2.3864 0.01702 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So on average, we see that there is a .005 change in the probability of a change in sales growth rate influencing the probability of whether a patent was filed by the firm.

Make sense?

Just kidding.

Ok, lets make this into a picture and try to make better sense of things…

library(interplot)
interplot(m = interaction.model, var1 = "SGR.center", var2 = "Pro.center") +
          geom_hline(yintercept = 0, linetype = "dashed") + 
          theme_minimal() + 
          xlab("Proactiveness") + 
          ylab("Estimated Odds of SGR\nOn Whether the Firm Filed a Patent") +
          ggtitle("Marginal Effect of Sales Growth Rate\nOn Odds of Filing Patent by Proactiveness")

The y axis here is the log odds ratio, not the probability, so it’s a slightly different interpretation than the average marginal effect.

Still, we can see from the visualization that the interaction effect is pretty small. We only observe a significant change in the odds ratio of Sales Growth rate on Patent Activity among firms that are very proactive—roughly > 2 on the centered scale.

One technique to help put that effect size in perspective is to calculate the percentage of firms in the sample that fall within that range of the proactiveness scale.

proactive.df <- ldv.df %>%
                filter(Pro.center >= 2)
my.obs <- nrow(ldv.df)
my.pro.count <- nrow(proactive.df)
cat("Percentage of '>2' Proactive Firms in Sample:", round((my.pro.count/my.obs)*100,2))

## Percentage of '>2' Proactive Firms in Sample: 2.7

So it looks like it’s just a few firms in our sample where our interaction effect applies—not exactly a robust finding.

Not surprisingly, p-hacked interaction effects often have very narrow regions of significance, but unless you go the extra step in the paper, readers would never know.

Ok, marginal effects are pretty valuable in LDV interaction models, but how about probability distributions?

Just like with the linear model, we’re going to dichotomize the moderator to +/- 1 standard deviation. In our data, Proactiveness has a standard deviation of 1.23.

To make our plot, we’re going to make use of visreg again. You can easily modify the code from our LDV seminar to do this in ggplot2 as well.

library(visreg)
visreg(interaction.model, "SGR.center", by="Pro.center", 
       overlay=TRUE, partial=FALSE, breaks=c(-1.23, 1.23),
       scale="response", xlab="Sales Growth Rate", 
       ylab="Probability of Patent Filing", rug=2)

Quick quiz…how do we interpret this plot and reconcile this with our marginal effect visualization?

Lets have some discussion about interaction effects…

From the perspective of a reviewer.
From the perspective of a researcher.

Wrap-up.

NO LAB THIS WEEK

Seminar 27 Feb – 9:00AM Curvilinear relationships // Endogeneity in moderation models

Lab 27 Feb – 1:00PM Moderation paper critique