Back to Musings >> Resources


To set expectations up front, this is a decidedly non-technical discussion of what is, in fact, a very technical, but very fundamental, topic in applied statistics. For the technical details, let me refer you to my favorite textbook on the topic:

Cohen J, Cohen P, West SG, Aiken LS. 2003. Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum.

And if you are interested, wikipedia has lots of formulas for you to geek out on. We’re also referring here to questions of sample variance/covariance, and not population parameters.

Understanding variance, covariance, and correlation are fundamental to making sense of the vast majority of analyses we do as researchers. Variance—and probability theory, but that’s another blog post—are the building blocks to making sense of causal relationships and more importantly the strength of those causal relationships. Yes, we want to know whether X effects Y, but really we want to know how much of a change in X causes how much of a change in Y. To do that, we need to understand variance.


Variance

Lets start by creating a dataframe with three normally distributed variables but with different standard deviations.

library(tidyverse)
set.seed(08022003)
sd5.df <- data_frame(x = rnorm(1000, sd = .5), sd = .5)
sd1.df <- data_frame(x = rnorm(1000, sd = 1), sd = 1)
sd15.df <- data_frame(x = rnorm(1000, sd = 1.5), sd = 1.5)
my.df <- bind_rows(sd5.df, sd1.df, sd15.df)  # Bind our variables into a new data frame
my.df$sd <- as.factor(my.df$sd)  # Convert the 'sd' variable into a factor

The easiest way to understand variance is with a graphic, so lets create a box plot showing the different variances.

my.boxplot <- ggplot(my.df, aes(x = sd, y = x, color = sd)) +
  geom_boxplot()
my.boxplot + theme_minimal() + 
  xlab("Standard Deviation (sd)") +
  ggtitle("Box plot of three random variables with different variances")

Here we have my favorite diagnostic tool, the Box Plot, which is a handy way to visually see the dispersion of a given random variable.

Dispersion is just what variance is. Variance is technically the average of the squared difference from the mean value of all of the observations of a given variable in your sample. What variance tells you is how spread out the observations of your variable are from the mean value of that variable. The higher the variance, the more spread out the variables are.

When you take the square root of the variance, you get the variable’s standard deviation. What makes the standard deviation so handy is that it puts the variance into the same units as the variable itself (more on that later).

In the box plot above, I’ve generated three random continuous variables (1,000 observations each [n = 1,000]), with an expected mean of 0 but with three different standard deviations, .5, 1, and 1.5. As you can see, as the standard deviation gets larger, the ‘box’ around the mean value of 0 (the line in the center) gets larger, and the ‘whiskers’ of the plot also go farther out. The higher the variance (standard deviation) the more spread out the observations are from the mean value.

Here’s the most important thing to understand about variance–we have to understand the variance of X and Y if we are to understand the covariance between X and Y.


Covariance

If variance is the measure of how dispersed a set of observations of a single variable are, covariance is the extent to which the variance in one variable depends on another variable. In effect, covariance is a measure of the relationship between two variables. The higher the covariance, the stronger the relationship.

Covariances can be positive (both variables move in the same direction), negative (both variables move in different directions), or in the case of no relationship, zero.

I’ve generated a new dataset (code is below) with three random continuous variables, x1, x2, and y. I’ve purposely set the covariance between x1 and x2 to be zero–no relationship. If you take a look at the scatter plot below of x1 and x2, it seems pretty clear that there is no clear relationship between the two variables…

library(MASS)
set.seed(08022003)
# We start by creating a defined covariance matrix
cov.matrix <- matrix(c(.5,0,1.5, 
                       0,1,0,
                       1.5,0,15),
                     nrow = 3, ncol = 3,
                     dimnames = list(c("x1", "x2", "y")))
# Now we generate our simulated data
cv.df = mvrnorm(n = 1000,  # Number of observations
                mu = c(0, 0, 0),  # Variable means
                Sigma = cov.matrix,  # Covariance matrix
                tol = .1,  # Ensures a positive definite matrix
                empirical = TRUE)  # Set our matrix as the true empirical values
cv.df <- data.frame(cv.df)
# Lets make a scatterplot of x1 and x2
x1x2.scatterplot <- ggplot(cv.df, aes(x = x1, y = x2)) +
  geom_point(shape=1)
x1x2.scatterplot + theme_minimal()