Interpreting logistic regression

Back to Musings >> Resources

In Part I, I outlined interpreting logistic regression models with two dichotomous predictors. In this post, we’ll tackle logistic regression with two continuous predictors.

Generating some data

We’ll start by generating the data with two dichotomous predictors, and we’ll call this Model 2.

library(tidyverse)  # Love the tidyverse!

# Set the random seed generator to ensure reproducible results
set.seed(080203)

# Create a new dataframe with 1,000 observations and our two normally...
# ...distributed predictors.
model2.df <- data_frame(x1 = rnorm(1000),
                        x2 = rnorm(1000),
                        e_i = rnorm(1000))

# Next we'll create our predicted value of y and include a random...
# ...disturbance term
model2.df <- model2.df %>%
  mutate(y = (.75*x1 + 1.5*x2 + e_i)) %>%
  mutate(pr_y = exp(y)/(1 + exp(y))) %>%
  mutate(y = rbinom(1000, 1, pr_y)) %>%
  select(-pr_y, -e_i)

# Now lets take a look
head(model2.df, 10)

## # A tibble: 10 x 3
##            x1         x2     y
##         <dbl>      <dbl> <int>
##  1  0.9054952 -0.8219114     0
##  2  1.5866302  0.2418754     1
##  3  0.1654336  0.3295136     1
##  4 -0.5190593  2.2521170     1
##  5 -1.0471148  1.5826300     1
##  6 -0.1464832 -0.2107056     0
##  7 -1.3642726 -0.4808994     0
##  8  0.3521542 -0.7726017     1
##  9 -0.5077717 -0.5513062     0
## 10 -1.7676480  0.3960072     0

Estimate our model

Just to review, a logistic regression model is simply a logistic transformation of the linear model, so we can write our model in a familiar form…

\(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_1}+{\beta}{x_2}+\epsilon\)

Our outcome variable is the probability of \(y\) occurring (y = 1), which we write as \(\left[\frac{p}{1-p}\right]\), the probability of \(y\) occurring divided by the probability of \(y\) not occurring.

We’ll use the glm function to estimate our model…

model2 <- glm(y ~ x1 + x2, data = model2.df, family = "binomial")
summary(model2)

## 
## Call:
## glm(formula = y ~ x1 + x2, family = "binomial", data = model2.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3476  -0.8569   0.2387   0.8389   2.9860  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.02863    0.07640  -0.375    0.708    
## x1           0.82535    0.08809   9.369   <2e-16 ***
## x2           1.41638    0.10210  13.873   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1386.1  on 999  degrees of freedom
## Residual deviance: 1030.2  on 997  degrees of freedom
## AIC: 1036.2
## 
## Number of Fisher Scoring iterations: 4

Making sense of the model

Once again, here is where things get complicated.

Lets take a look at the estimated coefficients…

log.odds <- summary(model2)$coefficients[1:3,1]
log.odds

## (Intercept)          x1          x2 
## -0.02862927  0.82535153  1.41638390

Just as we Model 1 from the last post, the \(y\) variable is on the logit scale. That means that the coefficient estimates are the log odds of \(y\) occurring given a unit change in the respective predictor. The intercept in this case is the log odds of \(y\) occurring when both \(x1\) and \(x2\) are zero. Because both our \(x1\) and \(x2\) are normally distributed with a mean of (close to) zero and standard deviation of (close to) one, the intercept is effectively the log odds of \(y\) occurring at roughly the mean value of \(x1\) and \(x2\).

Once again we can convert the log odds to the odds by taking the exponent:

exp.odds <- exp(log.odds)
exp.odds

## (Intercept)          x1          x2 
##   0.9717767   2.2826831   4.1221872

Importantly, the odds shown here are the “average” odds across the range of values of both predictors. That makes them only marginally useful. We can interpret, for example, the “average” odds of \(y\) occurring as 2.28:1, holding constant the effect of \(x2\) at its mean value, which is basically zero in this model. As we will see, the odds—and hence probability—of \(y\) occurring vary dramatically as a function of the value of the predictor.

Side note—with odds you are looking for statistically significant deviations from 1.0. Consider the intercept in the model. There is a 0.97:1 chance of \(y\) occurring when \(x1\) and \(x2\) equal zero (at the mean in the case of our data). The estimate is not statistically significant, which makes sense—1:1 odds is a 50% chance of \(y\) = 1, so it’s just even money, or a coin flip.

Marginal effects

The most important thing to remember when working with logit models is that they are inherently nonlinear. This means that we cannot express a change in the outcome variable as a function of a linear change in a vector of predictor variables. To put another way, as \(x\) increases/decreases in a nonlinear model, we do not expect a consistent change in the \(y\) variable. In practice, this means that the “average” log odds/odds have very little utility in a logit model. What we really want to see is how the odds itself changes as a \(x\) changes. This is where marginal effects come in to play.

The marginal effect in a logit model is the instantaneous expected change in the probability of \(y\) occurring for a given unit change in \(x\), assuming \(x\) is a continuous measure. You can think of the marginal effect intuitively as:

\(ME=\frac{\hat{Pr}\_{y}_{1}-\hat{Pr}\_{y}_{2}}{\Delta{x}}\)

Lets walk through the logic. We’ve got our first predicted probability (\(\hat{Pr}\_{y}_{1}\)), when \(x\) took on a particular value. Now we increment \(x\) by some value—could be large, could be small. We then subtract the two probabilities, and divide by the change in \(x\). This gives us the expected change in \(y\) for a given change in \(x\).

In a simple linear regression, \(y=\alpha+\beta{x}+\epsilon\), the estimate of \(\beta{x}\) is the marginal effect—the change in \(y\) is constant as \(x\) increments by some unit. But in a nonlinear model, the marginal effect varies between different changes in \(x\).

We’ll do this first with the predict function, and then do it by hand.

ave.model2.predict <- predict(model2, newdata = data.frame(x1 = 0, x2 = 0), 
                          type = "response")
ave.model2.predict

##         1 
## 0.4928432

As in Part I, we get this same “average” probability by calculating the predicted probability of \(y\) occurring at the estimated intercept. The intercept is the log odds of \(y\) occurring when \(x1\) and \(x2\) equal zero, which happens to be very close to their mean value in our model.

pr_y = 1 / (1 + (exp(-1 * (summary(model2)$coefficients[1,1]))))
pr_y

## [1] 0.4928432

Now lets increment \(x1\) by a very small amount, say .0001, but then hold \(x\) at its mean value, 0. Then we’ll construct a new predicted probability.

delta_x1 <- .0001
inc.model2.predict <- predict(model2, newdata = data.frame(x1 = delta_x1, x2 = 0), 
                          type = "response")
inc.model2.predict

##         1 
## 0.4928638

Now from our formula earlier, we’ll take the differences between the two predicted probabilities, and divide them by the change in \(x1\).

ame.hand <- (inc.model2.predict - ave.model2.predict)/delta_x1
ame.hand

##         1 
## 0.2062957

Because we used the mean value of \(x1\) in the preceding, and used a small change in \(x1\), the marginal effect we calculated by hand is going to be exceptionally close to the marginal effect calculated with calculus, using R’s margins package.

library(margins)
# We're setting `margins` to give us the average marginal effect...
# ...when x1 = 0 and x2 = 0; there mean values in our data 
ame.margins <- margins(model2, at = list(x1 = 0, x2 = 0))
ame.margins

##  at(x1) at(x2)     x1    x2
##       0      0 0.2063 0.354

As before though, the average marginal effect doesn’t help us much. Practically speaking, it would be the expected change in the probability of \(y\) occurring as we moved from the mean of \(x1\) and \(x2\) slightly away from the mean. To get a better handle on how a change in \(x1\) changes the probability of \(y\) occurring, we want to look at a range of changes in the value of \(x1\).

Because \(x1\) is continuous, there are, unfortunately, an infinite number of ways to increment \(x1\). This is where theory, and a good understanding the measure, come in to play. In our data, \(x1\) is normally distributed with a standard deviation close to 1.0. In that case, I’ll increment \(x1\) in tenth of a standard deviation changes (.1), starting from where \(x1\) begins to where \(x1\) ends.

As an aside, if \(x1\) was, say, a Likert-type measure from 1-5 or 1-7, an appropriate increment would be 1, which would be the change in the probability of \(y\) occurring as the respondent moved from 1 to 2 to 3 to 4 and so on. Pretty cool, huh!

We’re going to make use of the margins pacakge again for this—it makes life a lot easier. Also, because each marginal effect is itself a point estimate with a standard error, we can see whether the change in the value of \(\beta_{x1}\) is itself statistically significant. We can do that by calling the summary command for margins:

library(knitr)
# Setup a new vector of x1 values starting from the lowest ovbserved...
# value to the highest, incrementing along the way by .1
x1.inc <- seq(min(model2.df$x1), max(model2.df$x1), by = .1) 

# Now we store the results of the summary function for margins as...
# ...a new dataframe, which makes it easier to manipulate
xi.margins.df <- as.data.frame(summary(margins(model2, at = list(x1 = x1.inc, x2 = 0))))

# Now we're going to filter the dataframe to focus just on the x1 marginal effects
xi.margins.df <- xi.margins.df %>%
  filter(factor == "x1")

# Here's what it looks like!
kable(head(xi.margins.df, 10))

factor	x1	AME	SE	z	p	lower	upper
x1	-3.557843	0.0384812	0.0073825	5.212478	2e-07	0.0240117	0.0529506
x1	-3.457843	0.0414416	0.0075400	5.496251	0e+00	0.0266635	0.0562197
x1	-3.357843	0.0445993	0.0076729	5.812546	0e+00	0.0295606	0.0596379
x1	-3.257843	0.0479622	0.0077774	6.166843	0e+00	0.0327187	0.0632057
x1	-3.157843	0.0515379	0.0078495	6.565781	0e+00	0.0361532	0.0669225
x1	-3.057843	0.0553330	0.0078851	7.017402	0e+00	0.0398785	0.0707875
x1	-2.957843	0.0593533	0.0078806	7.531531	0e+00	0.0439075	0.0747991
x1	-2.857843	0.0636034	0.0078328	8.120135	0e+00	0.0482514	0.0789554
x1	-2.757843	0.0680863	0.0077392	8.797623	0e+00	0.0529178	0.0832548
x1	-2.657843	0.0728032	0.0075986	9.581162	0e+00	0.0579103	0.0876962

Yeah, so that’s good, but not exactly easy to make sense of. The best way to get a handle on marginal effects is to visualize them, along with a confidence interval to better understand the range in which the estimate of \(\beta_{x1}\) changes as a function of a change in \(x1\).

There are several packages out there that do this, but here’s one option using ggplot2 and our xi.margins.df dataframe we just created:

library(ggplot2)
library(ggthemes)
ame.plot <- ggplot(xi.margins.df, aes(y = AME, x = x1)) + 
  geom_line() +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = .2) +
  labs(x = "x1", 
       y = "Average Marginal Effect of x1")
ame.plot + theme_grey()

So how do we make sense of the plot? Well, the first thing to notice is that the marginal effect never crosses zero (on the y-axis, not the x1 axis). This tells us that there is always a statistically significant change in the expected probability of \(y\) occurring as \(x1\) itself changes.

So that’s important, but we also see that this change is substantially larger as \(x1\) approaches zero, which is its mean value. Here we see very dramatic changes in the probability of \(y\) occurring as \(x1\) changes. To put another way, when \(x1\) changes from, say, 2 to 2.2, each of those changes will produce about a 12-10% change in the probability of \(y\) occurring. But when \(x1\) changes from 0 to .2, each of those changes will produce more than a 20% change in the probability of \(y\). You get more bang for the buck so to speak in changes to the probability of \(y\) when \(x1\) is closer to its mean than you do when \(x1\) is closer to its tails.

When you compare this plot with the Average Marginal Effect at the mean value of \(x1\) that we calculated earlier, it’s easy to see that the “average” really doesn’t give us a good understanding of the underlying relationship between \(x1\) and \(y\).

Side note—again this is just the marginal effect of \(x1\) holding \(x2\) constant at its mean value, which in our case happens to be zero. To explore the marginal effect of \(x2\), simply do to \(x2\) what we did to \(x1\), and then hold \(x1\) constant at its mean.

Predicted probabilities

Another way to see how the probability of \(y\) changes as a function of the level of \(x1\) is by plotting the predicted probabilities, along with a confidence interval (always with a confidence interval!). Again there are built in functions—the margins package has one—but I like doing it by hand.

We’re going to start by calling the predict function, and get the standard errors at the same time. Like we did with the margins command, we’re going to ask our function to use the same range of \(x1\) that we used for the marginal effects—the range of \(x1\) in the data, incremented by .1.

# BTW, type = "response" ensures that we get predicted probabilities in a...
# normal scale, instead of the log probabilities!
x1.model2.predict <- predict(model2, newdata = data.frame(x1 = x1.inc, x2 = 0), 
                          type = "response", se = TRUE)

# Next we're going to store our results in a dataframe, to calculate...
# ...our confidence intervals.
x1.model2.predict <- as_data_frame(x1.model2.predict) %>%
  mutate(PredProb = fit,
         lower.ci = (PredProb - (1.96 * se.fit)),
         upper.ci = (PredProb + (1.96 * se.fit))) %>%
  select(-fit:-residual.scale)  # Just removing unnecessary variables

# Now we need to bring our values of x1 back into our predicted dataframe...
# ...so we know which probability goes to which x1!
x1.model2.predict$x1 = x1.inc

kable(head(x1.model2.predict, 10))

PredProb	lower.ci	upper.ci	x1
0.0490277	0.0192426	0.0788128	-3.557843
0.0530222	0.0217878	0.0842565	-3.457843
0.0573226	0.0246131	0.0900320	-3.357843
0.0619489	0.0277444	0.0961534	-3.257843
0.0669221	0.0312093	0.1026349	-3.157843
0.0722638	0.0350371	0.1094904	-3.057843
0.0779962	0.0392591	0.1167333	-2.957843
0.0841421	0.0439077	0.1243765	-2.857843
0.0907246	0.0490170	0.1324322	-2.757843
0.0977672	0.0546226	0.1409117	-2.657843

Tables are nice, but pictures are better. So lets plot our predicted probabilities using ggplot2:

pred.plot <- ggplot(x1.model2.predict, aes(y = PredProb, x = x1)) + 
  geom_line() +
  geom_ribbon(aes(ymin = lower.ci, ymax = upper.ci), alpha = .2) +
  labs(x = "x1", 
       y = "Predicted Probability of y Occuring")
pred.plot + theme_grey()

Just as before with our AME plot, the confidence interval for the predicted probability never crosses zero (on the y-axis), so there is always a change of \(y\) occurring in the presence of some value of \(x1\). Also as we saw in the AME plot, there is a big change in the probability of \(y\) occurring when \(x1\) is closer to its mean value, and a smaller change in the probability when \(x1\) is closer to its tails.

Key takeaway

Logit models are, to me, one of the most useful models in business research, and help us to answer a lot of interesting questions. They do, especially in the case of logit models with continuous predictors, take some extra work to make sense of and really get a handle on what the relationship is between the predictors and the outcome. These steps—marginal effects and predicted probabilities—are, to me, a must have in logit model analysis.