Special topics in panel data

ENT5587B - Research Design & Theory Testing II

Brian S. Anderson, Ph.D.
Assistant Professor
Department of Global Entrepreneurship & Innovation
andersonbri@umkc.edu


RIEI Logo
© 2017 Brian S. Anderson

  • Paper revision and reviewer responses due today!
  • Data collection discussion
  • Hybrid method
  • Interaction effects
  • LDV models
  • Lab 27 April – Panel data assessment
  • Seminar 1 May – Publishing strategy
  • Lab change – 3 May, 1:00PM

Another training time out.

What’s still bugging you about panel models?

\(y_{it}=\alpha+\beta{x_{it}}+\mu_{i}+\epsilon_{it}\)

Just to review…

  • \(y\) = Value of the \(t^{th}\) observation of \(y\) for the \(i^{th}\) entity
  • \(\alpha\) = The value of \(y\) when \(x\) equals zero across all \(i, t's\)
  • \(\beta\) = The expected change in \(y\) for an average \(i\) across time \(t\). Note that this interpretation gets more complicated, quickly.
  • \(x\) = Value of the \(t^{th}\) observation of \(x\) for the \(i^{th}\) entity
  • \(\mu\) = The portion of the disturbance term unique to \(i\) and that is constant over time \(t\).
  • \(\epsilon\) = The variance in the \(t^{th}\) observation of \(y\) for the \(i^{th}\) entity that is not explained by the variance in the \(t^{th}\) observation of \(x\) for the \(i^{th}\) entity or \(\mu_{i}\).

Recall that we can decompose \(x_{it}\) into the following:

\(x_{it}=\gamma_i+\tau_{it}\)

For each \(x_{it}\), there is going to be a between-component, \(\gamma\), that never changes over time for each \(i\) in the sample (the firm, for example). But there is also going to be a within-component, \(\tau\), that can change for each \(i\) over time (the firm’s sales, for example).

\(x_{it}=\gamma_i+\tau_{it}\)

In OLS (or the pooled model), we assume that \(\gamma_i\) = 0, or otherwise is constant for every \(i\) in the sample and, hence, meaningless for analysis purposes.

If \(\gamma_i\) ≠ 0, or if \(\gamma_i\) varies among the \(i's\) in the sample, then the OLS (or pooled) model will produce biased estimates of \(\beta\).

\(x_{it}=\gamma_i+\tau_{it}\)

In a panel model, we are assuming that the higher order entity (\(\gamma_i\)), has some meaningful effect on the observation of \(x_{it}\).

The problem is that we don’t directly observe this effect. So any unobserved factors of \(i\) that do correlate with \(x_{it}\) will show up in the disturbance term of \(\epsilon_{it}\) of our panel equation.

So in our panel equation:

\(y_{it}=\alpha+\beta{x_{it}}+\mu_{i}+\epsilon_{it}\)

We need to take some steps to address for the \(\mu_{i}\) effect. We can assume that \(\mu_{i}\) has a mean of zero, constant variance, doesn’t meaningfully correlate with \(x_{it}\) and estimate a random effect model.

Or we can assume that \(\mu_{i}\) does not have constant variance and does meaningfully correlate with \(x_{it}\), and then estimate a fixed effect model.

Recall that the choice of random or fixed effect has an important implication for theory construction and testing.

With the random effect model, the estimate of \(\beta\) will contain between \(\mu_{i}\) variance and within \(\mu_{i}\) variance. In the fixed effect model, \(\beta\) contains only within \(\mu_{i}\) variance.

As we mentioned last week, the fixed effect model is a blunt instrument. Sometimes we really want between- and within-effects, but may not have that luxury because of unobserved heterogeneity at the \(i\) level.

Unless we use the hybrid method, which is what we’re going to demonstrate.

First up, lets simulate some multilevel data, and let me walk through the code. We’re going to simulate startups nested within mentors, so in this case, our \(t's\) refer to startups, rather than variables measured over time. Remember, the math is the same!

library(tidyverse)
set.seed(08022003)
mentors.df <- data_frame(MentorID = 1:40, 
                         MinStartup = 3,
                         MaxStartup = sample(6:10,40,replace=T))
mixed.df <- mentors.df %>%
  group_by(MentorID) %>%
    expand(MinStartup, MaxStartup, 
                  StartupID = full_seq(MinStartup:MaxStartup, 1)) %>%
  select(MentorID, StartupID) %>%
  transmute(StartupID = StartupID - 2)

Lets take a look…

## Source: local data frame [10 x 2]
## Groups: MentorID [2]
## 
##    MentorID StartupID
##       <int>     <dbl>
## 1         1         1
## 2         1         2
## 3         1         3
## 4         1         4
## 5         2         1
## 6         2         2
## 7         2         3
## 8         2         4
## 9         2         5
## 10        2         6

Now lets add some variables. We’re going to start with the two disturbance terms in our panel equation, \(\mu_{i}\) and \(\epsilon_{it}\). Remember \(\mu_{i}\) is unique to each \(i\)—our mentors in this case. The \(\epsilon_{it}\) is idiosyncratic to the particular observation index. The \(\mu_{i0}\) term allows us to specify a random slopes model (I’ll explain later).

set.seed(08022003)
mixed.df <- mixed.df %>%
  group_by(MentorID) %>%
    mutate(u_i = rnorm(length(MentorID),0,5)[1]) %>%
    mutate(u_i0 = rnorm(length(MentorID),0,3)[1]) %>%
  mutate(e_it = rnorm(n(),0,9))

And another look…

## Source: local data frame [10 x 5]
## Groups: MentorID [2]
## 
##    MentorID StartupID       u_i      u_i0        e_it
##       <int>     <dbl>     <dbl>     <dbl>       <dbl>
## 1         1         1 -4.586373  2.046178   1.3033081
## 2         1         2 -4.586373  2.046178  -6.7447618
## 3         1         3 -4.586373  2.046178   7.3491288
## 4         1         4 -4.586373  2.046178  -2.7354462
## 5         2         1  1.608891 -3.333907 -11.9360533
## 6         2         2  1.608891 -3.333907   0.6031354
## 7         2         3  1.608891 -3.333907  -8.0624269
## 8         2         4  1.608891 -3.333907  -7.5428354
## 9         2         5  1.608891 -3.333907  -4.7178909
## 10        2         6  1.608891 -3.333907   6.6763995

We’re getting closer!

Now lets add a couple of predictor variables. HrsMentoring will be the total number of hours each startup received from the mentor. PriorStartup will be a dichotomous variable with 1 indicating that the startup team has done a prior startup, which I’m just using as a covariate.

set.seed(08022003)
mixed.df <- mixed.df %>%
  mutate(HrsMentoring = sample(5:15, n(), replace = TRUE), 
         PriorStartup = rbinom(n(), 1, .5))

And another look…

## Source: local data frame [10 x 7]
## Groups: MentorID [2]
## 
##    MentorID StartupID       u_i      u_i0        e_it HrsMentoring
##       <int>     <dbl>     <dbl>     <dbl>       <dbl>        <int>
## 1         1         1 -4.586373  2.046178   1.3033081            6
## 2         1         2 -4.586373  2.046178  -6.7447618           10
## 3         1         3 -4.586373  2.046178   7.3491288           11
## 4         1         4 -4.586373  2.046178  -2.7354462            7
## 5         2         1  1.608891 -3.333907 -11.9360533           10
## 6         2         2  1.608891 -3.333907   0.6031354           10
## 7         2         3  1.608891 -3.333907  -8.0624269           11
## 8         2         4  1.608891 -3.333907  -7.5428354            8
## 9         2         5  1.608891 -3.333907  -4.7178909           11
## 10        2         6  1.608891 -3.333907   6.6763995           15
## # ... with 1 more variables: PriorStartup <int>

Now lets generate some DV’s. We’re going to create two outcome variables: a continuous y, EntSelfEfficacy, and a dichotomous y, FirstSale, which will be whether or the startup team actually made a first sale of a product.

We’re also going to force a random slopes model by creating an interaction between StudentID and \(\mu_{i0}\).

mixed.df <- mixed.df %>%
  mutate(EntSelfEfficacy = (3 * HrsMentoring) + (5 * PriorStartup) 
         + (StartupID*u_i0) + u_i + e_it) %>%
  mutate(linpred = (.65 * HrsMentoring) + (.1 * PriorStartup) 
         + (StartupID*u_i0) + u_i + e_it) %>%
  mutate(pr = exp(linpred)/(1 + exp(linpred))) %>%
  mutate(FirstSale = rbinom(n(), 1, pr))

Whew!

Ok, now lets get to our model. We’re going to walk through the model specification using plm like before. Lets estimate our models.

library(plm)
plm.df <- pdata.frame(mixed.df, index=c("MentorID","StartupID"))
pooled.model <- plm(EntSelfEfficacy ~ HrsMentoring + PriorStartup, 
                    data = plm.df, model = "pooling")
random.model <- plm(EntSelfEfficacy ~ HrsMentoring + PriorStartup, 
                    data = plm.df, model = "random")
fixed.model <- plm(EntSelfEfficacy ~ HrsMentoring + PriorStartup, 
                   data = plm.df, model = "within")

First up is our Breusch-Pagan Lagrange Multiplier Test, under the null hypothesis that \(\beta{_{it}}\) = \(\beta\) for all \(i, t\).

plmtest(pooled.model, type=c("bp"))
## 
##  Lagrange Multiplier Test - (Breusch-Pagan) for unbalanced panels
## 
## data:  EntSelfEfficacy ~ HrsMentoring + PriorStartup
## chisq = 185, df = 1, p-value < 2.2e-16
## alternative hypothesis: significant effects

So how do we interpret this result?

Next up is our Hausman test, to determine whether there is a systematic difference in \(\beta\) as a function of the unobserved variance in the \(i\) effect.

phtest(fixed.model, random.model)
## 
##  Hausman Test
## 
## data:  EntSelfEfficacy ~ HrsMentoring + PriorStartup
## chisq = 70.35, df = 2, p-value = 5.293e-16
## alternative hypothesis: one model is inconsistent

How do we interpret this result?

Ok, so there is an empirical reason to retain a fixed effect model. But what would be the theoretical reason to use the fixed effect model?

The challenge is that both within, and between, questions are often theoretically useful.

Between hypothesis:

Across mentors, startups with greater mentoring have higher self-efficacy.

Within hypothesis:

For a given mentor, increasing startup mentoring results in higher self-efficacy for the startup.

In a random effect model, the estimate of \(\beta\) includes both within and between variance, so interpreting \(\beta\) difficult.

In a fixed effect model, you eliminate (control for) the between effect, so all you get is the within-effect.

The solution to having your cake and eating it to? The hybrid.

In the hybrid method, we’re going to estimate a random effects model but include a fixed effect for the variable we are interested in testing.

\(y_{it}=\alpha+\beta_1{\bar{x}}_{i}+\beta_2{(x_{it}-{\bar{x}}_{i})}+\mu_{i}+\epsilon_{it}\)

We demean \(x\), which removes the fixed portion of \(x_{it}\). \(\beta_2\) then is the estimate of the within effect of \(x\) on \(y\) for a given \(i\).

Then we include \({\bar{x}}_{i}\) as another predictor. This term is constant for each \(i\) over observations \(t\). \(\beta_1\) then is the estimate of the between effect of \(x\) on \(y\) across \(i's\).

Lets setup our data…

mixed.df <- mixed.df %>%
  group_by(MentorID) %>%
    mutate(HrsMent.between = mean(HrsMentoring),
           HrsMent.within = (HrsMentoring - HrsMent.between))

Now lets estimate our model. We’ll use a slightly different syntax for plm, so that we don’t have to create a new plm dataframe.

hybrid.model <- plm(EntSelfEfficacy ~ HrsMent.between + 
                      HrsMent.within + PriorStartup, 
                    data = mixed.df, 
                    index=c("MentorID", "StartupID"), model="random")
summary(hybrid.model)

## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = EntSelfEfficacy ~ HrsMent.between + HrsMent.within + 
##     PriorStartup, data = mixed.df, model = "random", index = c("MentorID", 
##     "StartupID"))
## 
## Unbalanced Panel: n=40, T=4-8, N=240
## 
## Effects:
##                  var std.dev share
## idiosyncratic 114.80   10.71 0.508
## individual    111.21   10.55 0.492
## theta  : 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5471  0.6169  0.6169  0.6201  0.6415  0.6619 
## 
## Residuals :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -34.300  -6.690   0.303  -0.094   6.620  34.300 
## 
## Coefficients :
##                  Estimate Std. Error t-value  Pr(>|t|)    
## (Intercept)     -38.70362   16.83674 -2.2988   0.02239 *  
## HrsMent.between   6.57289    1.65634  3.9683 9.607e-05 ***
## HrsMent.within    3.01510    0.25404 11.8684 < 2.2e-16 ***
## PriorStartup      7.47836    1.47648  5.0650 8.235e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    48887
## Residual Sum of Squares: 27097
## R-Squared:      0.44583
## Adj. R-Squared: 0.43879
## F-statistic: 63.2589 on 3 and 236 DF, p-value: < 2.22e-16

Ok, lets get some coefficients…

between.effect <- round(summary(hybrid.model)$coefficients[2,1],3)
within.effect <- round(summary(hybrid.model)$coefficients[3,1],3)
cat("Between effect of hrs. mentoring:",between.effect, "\n")
## Between effect of hrs. mentoring: 6.573
cat("Within effect of hrs. mentoring:",within.effect, "\n")
## Within effect of hrs. mentoring: 3.015

With our hybrid approach, we should see the same within-effect that we observed with our fixed effect specification earlier. Why?

## Within effect from hybrid model: 3.015
## Within effect from fixed effect model: 3.015

## Between effect of hrs. mentoring: 6.573
## Within effect of hrs. mentoring: 3.015

So interpreting these gets a bit tricky. Across mentors and on average, each additional hour of mentoring seems to increase entrepreneurial self-efficacy by 6.573. Further, given the substantial individual variance in \(\mu_{i}\), there seems to be substantial differences among the mentors.

However…

It is still possible that the between effect correlates with \(\epsilon_{it}\), rendering \(\beta_{between}\) inconsistent.

Why?

But…

We do know that at least with respect to the “mentor effect,” our estimate of \(\beta_{within}\) will be consistent.

Why?

Other omitted variables at the within-level is another matter, and yes, you can integrate a 2SLS approach into the hybrid model to deal with omitted variables at both the between and within levels.

Still, I think the advantage of the hybrid model is that you retain the best feature of a fixed effect model, while still being able to answer interesting theoretical questions.

Next up…

Within-effect moderation.

Question…

Why didn’t I say between-effect moderation, or cross-level moderation, given what we know about the data that we are using today?

With the hybrid approach, it is possible to estimate cross-level interactions and still retain a partial fixed effect specification. That said, I’d avoid it.

Why? Well, you tell me how to interpret it, and I’m all in. Just because you can multiple two variables together doesn’t mean that you should! :)

So, lets focus on our within (fixed) effect model. For our interaction effect, we’ll need another predictor that is panel variant.

Fortunately, we happen to have one of these variables—PriorStartup.

So lets set up our model.

interaction.model <- plm(EntSelfEfficacy ~ PriorStartup * HrsMentoring, 
                         data = plm.df, model = "within")
summary(interaction.model)

## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = EntSelfEfficacy ~ PriorStartup * HrsMentoring, 
##     data = plm.df, model = "within")
## 
## Unbalanced Panel: n=40, T=4-8, N=240
## 
## Residuals :
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -24.9000  -6.4600   0.0873   6.4300  38.5000 
## 
## Coefficients :
##                           Estimate Std. Error t-value  Pr(>|t|)    
## PriorStartup               5.63158    5.47566  1.0285    0.3050    
## HrsMentoring               2.91483    0.34636  8.4155 7.931e-15 ***
## PriorStartup:HrsMentoring  0.22013    0.51833  0.4247    0.6715    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    42022
## Residual Sum of Squares: 22596
## R-Squared:      0.46229
## Adj. R-Squared: 0.34765
## F-statistic: 56.4556 on 3 and 197 DF, p-value: < 2.22e-16

So no interaction effect (because I didn’t make one earlier!). But that’s not the point.

The point is to reinforce that when a fixed effect specification is necessary or consistent with your theoretical development, interaction effects must be at the within-level.

What does that mean for testing, say, industry effects?

Industry effects are a pretty big deal in strategy research. Unfortunately, most of this research (particularly the early work) didn’t use a panel estimator, or used a random effect model.

Given your experience now with Compustat data, how confident are you in our field’s understanding of the contextual role of environmental exigencies?

Now, the nice thing about specifying interaction effects in fixed effect models is the ease of interpretation.

We could have specified our within-effect hypothesis this way…

For a given mentor, increasing startup mentoring results in higher self-efficacy for the startup among startup teams with prior startup experience.

While our results are not statistically significant, we would interpret a significant effect as the average within-group change in the effect (slope) of mentoring hours on self-efficacy.

We can also visualize our model fairly easily using the sjPlot package, although you can use most of the other moderation visualization techniques we discussed earlier.

library(sjPlot)
sjp.lm(interaction.model, show.ci = TRUE, 
       title = "Average Within-Mentor Effect of Hrs of Mentoring
on Ent Self Efficacy by Prior Startup Experience")

Quiz…why does this plot look the way it does?

By the way, it’s helpful to see another mixed effect modeling approach. The lme4 package fits hierarchical models of the type described in the Aguinis et al (2013) paper.

This code replicates the random effect model that we specified earlier using plm:

\(y_{it}=\alpha+\beta{x_{it}}+\mu_{i}+\epsilon_{it}\)
library(lme4)
mixed.fit <- lmer(EntSelfEfficacy ~ HrsMentoring + PriorStartup + 
                    (1 | MentorID), data = mixed.df)
summary(mixed.fit)

## Linear mixed model fit by REML ['lmerMod']
## Formula: EntSelfEfficacy ~ HrsMentoring + PriorStartup + (1 | MentorID)
##    Data: mixed.df
## 
## REML criterion at convergence: 1890.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6216 -0.6422  0.0103  0.5940  3.2029 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  MentorID (Intercept) 127.5    11.29   
##  Residual             114.1    10.68   
## Number of obs: 240, groups:  MentorID, 40
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)   -3.5154     3.2462  -1.083
## HrsMentoring   3.0873     0.2506  12.319
## PriorStartup   7.5543     1.4735   5.127
## 
## Correlation of Fixed Effects:
##             (Intr) HrsMnt
## HrsMentorng -0.779       
## PriorStartp -0.204 -0.005

These estimates are slightly different from our plm random effects model because of the fitting algorithm used. Still, they are very close to each other.

Anything you can do with mixed effect modeling (lme4) you can do with panel modeling (plm), but please take care to note that the terminology and notation can differ substantially—the fixed effect in the model output is NOT the same thing as a fixed effect in our econometric model!

That the models are equivalent comes in handy when it comes to publishing strategy. If you are writing an econometric focused paper, then panel modeling techniques are most appropriate. If you are writing a psych/micro paper, or do not have repeated measures over time as your \(t's\), then mixed/mulitlevel modeling is your best bet.

Yes, they are equivalent, but the language used by the journal/reviewers will be different, so it’s better to give them terminology they are used to. Always know your audience!

BTW, we can estimate our hybrid model in the mixed effect estimator…

mixed.hybrid.fit <- lmer(EntSelfEfficacy ~ HrsMent.between + 
                           HrsMent.within + PriorStartup + (1 | MentorID),
                         data = mixed.df)
summary(mixed.hybrid.fit)

Do these numbers look familiar?

## Panel model using plm (econometric)
##   Between effect of hrs. mentoring: 6.573
##   Within effect of hrs. mentoring: 3.015
## -----
## Panel model using lme4 (multilevel)
##   Between effect of hrs. mentoring: 6.569
##   Within effect of hrs. mentoring: 3.015

Next up…

LDV models in panel/multilevel data.

Truth be told, while I love LDV models, estimating these things well with panel data is a royal PITA. LDV models in a panel/multilevel framework have a special name: Generalized Linear Mixed Models.

The same assumptions about mixed effect models apply to GLMMs—there can be no correlation between the predictor, \(x\), and \(\mu_{i}\). Straight up fixed effect modeling with LDV outcomes is somewhat challenging, so I like using the hybrid method in a mixed effect (random) model.

Lets set it up. We’re going to use the glmer function in the lme4 package. We’ll keep our same predictors, but this time our DV will be FirstSale, which takes on a value of 1 if the startup actually sold a product to a real customer.

hybrid.ldv.fit <- glmer(FirstSale ~ HrsMent.between + 
                          HrsMent.within + PriorStartup + (1 | MentorID),
                        data = mixed.df, family = binomial(link = "logit"))
summary(hybrid.ldv.fit)

Lets walk through the output…

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: FirstSale ~ HrsMent.between + HrsMent.within + PriorStartup +  
##     (1 | MentorID)
##    Data: mixed.df
## 
##      AIC      BIC   logLik deviance df.resid 
##    260.8    278.2   -125.4    250.8      235 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9541 -0.4435  0.2340  0.4958  2.4172 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  MentorID (Intercept) 3.612    1.901   
## Number of obs: 240, groups:  MentorID, 40
## 
## Fixed effects:
##                 Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     -7.84846    3.56902  -2.199   0.0279 *
## HrsMent.between  0.85119    0.35413   2.404   0.0162 *
## HrsMent.within   0.14304    0.06717   2.130   0.0332 *
## PriorStartup     0.44974    0.38139   1.179   0.2383  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) HrsMnt.b HrsMnt.w
## HrsMnt.btwn -0.993                  
## HrsMnt.wthn -0.055  0.063           
## PriorStartp -0.073  0.029    0.033

WARNING

Inference in GLMM models is, well, rather tricky because of the distributional assumptions in the standard errors. There’s a couple of options, including bootstrapping and Monte Carlo, but it’s way beyond the scope of this class.

END WARNING

Just like in any logit model, the “coefficients” reported are log odds ratios:

## Log odds of between effect of hrs. mentoring: 0.851
## Log odds of within effect of hrs. mentoring: 0.143

Those are handy, but not exactly easy to use.

What we’re really after is predicted probabilities, just as before. Interpreting these are tricky because predicted probabilities change as a function of the value of \(x\). In a GLMM model, it’s even more complicated because \(x_{it}\) has a between, and within component.

Our hybrid model split those values out, which is handy. Still, the between effect is pretty challenging to interpret. Why?

The easiest way to understand predicted probabilities is with a picture, so lets use sjPlot again and just focus on the within effect.

library(ggplot2)
sjp.setTheme(base = theme_minimal())
sjp.glmer(hybrid.ldv.fit, type = "pred", show.ci = TRUE, vars = "HrsMent.within",
          axis.title = c("Hours With Mentor (Group Mean Centered)", 
                         "Average Predicted Probability of First Sale"), 
          title = "Mentorship Success", show.scatter = FALSE)

So how do we interpret this?

Whew!

Ok, now what’s still bugging you about panel models?

Wrap-up.

Lab 27 April – Panel data assessment

Seminar 1 May – Publishing strategy