Welcome back! To kick off our first lab of the year, we’ll be using R Markdown created with R Studio Notebook today. Pretty fancy huh!
When working—and learning—with R, it’s best to keep in mind a couple of things:
You don’t have to use R Studio for this class, but I would highly, highly recommend it! I also highly recommend using R Notebook (or another notebook solution) for this class and for your research.
By itself, R is pretty cool, but also pretty limited. To do most of the things you want to do in R, you need to use packages. Packages are bolt-on software available at no cost to get more out of R. One of the packages we will use a lot is called tidyverse, which is itself a collection of popular packages. You can install a package through a point and click interface in R Studio, or by typing…
install.packages("tidyverse")
You only need to install a package once—the software itself is stored locally on your machine. But to use the package, you need to load what’s called its library each session…
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
We’ll be working a lot more with packages and libraries, so learn to love them! But, take a look at the output and the conflicts mentioned with other packages. This happens somewhat frequently—two or more packages have a function with the same name—but there is an easy way to deal with it.
To understand R you need to get a firm grasp of data sets and data frames. We’ll only scratch the surface of the countless ways to get data into R, and there are some great resources here and here.
A data set is just what it implies—a series of rows and columns of data in some type. I’m a big fan of Tidy Data, which is a standardized way of presenting data such that it’s easy to use and easy to make sense of. Most of the time, the data that we get are messy, so tidying the data is an important step to make raw data useful.
I’ve already got a tidy dataset handy, so lets fire that one up using the readr package…
# R can take in virtually any type of data.
# Here I'm using a built in command, read.table
# to create a new dataset, myDS, from a .csv file
library(readr)
myDS <- read_csv("http://a.web.umkc.edu/andersonbri/ENT5587.csv")
View(myDS) # Lets take a look at the dataset in a familiar format
What you should see is a spreadsheet-like view of your data in the source window. Take a look at cells with the NA value. This is how R stores missing data—we’ll come back to that later.
Data sets can be accessed directly, but the more useful device is the data frame. A data frame is really like an uber-customizable grid. You can create multiple data frames from a single data set, and new data frames from existing data frames, allowing us to conduct all sorts of analyses, build visualizations, and dig in to the data.
myDF <- data.frame(myDS) # This creates a data frame using our myDS data source
We can also use the subset option to parcial out a data frame along virtually any lines. For example, lets say we want a data frame that only has complete observations on every variable in our myDS data set…
myCompleteDF <- subset(myDS[complete.cases(myDS), ])
# Note that you can subset on just a few variables as well.
Subsetting is a very powerful option, and one that we’ll use a lot.
When we were working in Stata (like most other statistics software), you would generally make a local copy of your dataset saved in Stata’s proprietary format and then work from there. R works a bit differently. Yes, there is a native R database storage format, but you really don’t need to use it unless you want to. Instead, you can just call a stored, or master-version of your data each session, create a (or as many as you want) data frame from that data, and you are off. This is also one of the major advantages of working within an R notebook to save your code, and create a track record of what you’ve been working.
There are some other benefits of working dynamically with datasets…
A downside of working dynamically is that you may need an Internet connection (although you can always just use a cloud solution, like I do), and archiving might be an issue (although again, cloud storage generally takes care of that).
So if you haven’t guessed yet, you can have several different data sources and several different data frames stored in working memory at any one time. This is a handy feature, but it also means we have to help R out to know which data frame we want to use for a given set of commands. There are several ways to do this, and we’ll cover a few of them here as we get to know our data.
Lets start with getting some descriptive statistics out. There are lots of ways to do this, but my preferred way is the psych package. If you haven’t installed it yet, now’s a good time…
install.packages("psych")
With that done, you can load the library and then call an R function, in this case describe and tell it that you want to run that function on our myCompleteDF data frame. You will be working with R functions extensively, and the nice thing about them is that they generally follow a similar format.
library(psych) # Call the pysch library
# The describe function in the psych library has a conflict with another
# package, so we make sure that R references the function that we want.
# This is one way to resolve what is called a namespace conflict.
psych::describe(myCompleteDF)
## vars n mean sd median trimmed mad min max
## Obs 1 110 59.53 34.07 58.50 59.33 43.00 1.00 119.00
## Employees 2 110 277.23 76.85 280.50 278.02 77.10 26.00 447.00
## FirmAge 3 110 22.25 8.06 22.00 22.42 7.41 1.00 40.00
## INN1 4 110 3.85 1.88 4.00 3.85 2.97 1.00 7.00
## INN2 5 110 4.14 1.72 4.00 4.16 1.48 1.00 7.00
## INN3 6 110 3.83 1.68 4.00 3.84 1.48 1.00 7.00
## PRO1 7 110 4.13 1.61 4.00 4.19 1.48 1.00 7.00
## PRO2 8 110 4.35 1.66 5.00 4.43 1.48 1.00 7.00
## PRO3 9 110 4.44 1.41 5.00 4.55 1.48 1.00 7.00
## RISK1 10 110 3.94 1.34 4.00 3.94 1.48 1.00 7.00
## RISK2 11 110 3.84 1.42 4.00 3.85 1.48 1.00 7.00
## RISK3 12 110 4.04 1.42 4.00 4.11 1.48 1.00 7.00
## Innovativeness 13 110 3.94 1.47 4.00 3.95 1.48 1.00 7.00
## Proactiveness 14 110 4.30 1.22 4.33 4.38 1.48 1.00 7.00
## RiskTaking 15 110 3.94 1.21 4.00 4.01 0.99 1.00 6.33
## LRP 16 110 3.88 1.76 4.12 3.91 2.04 1.00 7.00
## Forecast 17 110 3.94 1.56 4.33 3.99 1.48 1.00 7.00
## AdPlan 18 110 4.18 0.88 4.25 4.20 0.74 2.00 7.00
## SGR 19 110 -0.24 14.73 0.05 0.04 12.03 -76.12 37.94
## IndTech 20 110 3.59 1.26 3.60 3.56 1.48 1.00 7.00
## IndDyn 21 110 4.03 1.12 4.20 4.02 1.19 1.40 6.60
## IndHos 22 110 4.34 1.03 4.33 4.34 0.99 1.83 6.83
## HighTech 23 110 0.55 0.50 1.00 0.57 0.00 0.00 1.00
## PatentCount 24 110 4.34 3.05 4.50 4.23 3.71 0.00 12.00
## PatentActivity 25 110 0.84 0.37 1.00 0.92 0.00 0.00 1.00
## RNDIntensity 26 110 4.93 3.84 4.85 4.60 4.10 0.00 17.28
## range skew kurtosis se
## Obs 118.00 0.06 -1.18 3.25
## Employees 421.00 -0.34 0.48 7.33
## FirmAge 39.00 -0.18 -0.25 0.77
## INN1 6.00 -0.01 -1.29 0.18
## INN2 6.00 -0.16 -1.05 0.16
## INN3 6.00 0.03 -0.90 0.16
## PRO1 6.00 -0.31 -0.66 0.15
## PRO2 6.00 -0.41 -0.60 0.16
## PRO3 6.00 -0.62 -0.15 0.13
## RISK1 6.00 -0.16 -0.41 0.13
## RISK2 6.00 -0.22 -0.86 0.14
## RISK3 6.00 -0.46 -0.51 0.14
## Innovativeness 6.00 -0.06 -0.84 0.14
## Proactiveness 6.00 -0.40 -0.41 0.12
## RiskTaking 5.33 -0.50 -0.27 0.12
## LRP 6.00 -0.25 -1.08 0.17
## Forecast 6.00 -0.33 -0.74 0.15
## AdPlan 5.00 -0.11 0.48 0.08
## SGR 114.06 -1.00 5.30 1.40
## IndTech 6.00 0.27 -0.40 0.12
## IndDyn 5.20 0.01 -0.53 0.11
## IndHos 5.00 0.03 -0.17 0.10
## HighTech 1.00 -0.22 -1.97 0.05
## PatentCount 12.00 0.14 -0.93 0.29
## PatentActivity 1.00 -1.79 1.23 0.04
## RNDIntensity 17.28 0.60 0.13 0.37
That doesn’t look all that great in our display—lets see if we can make it nicer looking. For this, we’ll use a package called stargazer which is a favorite of mine. But before we do that, one of the cool things about functions is that we can actually store the results of most functions as a vector, matrix, data frame, etc. This becomes very handy when we want to pass the results from one function to another function. That’s what we’re going to do here.
We’re going to get our descriptive statistics, store them, and then pass it to the stargazer function to have that function format our output. Don’t forget to load the stargazer package first!
install.packages("stargazer")
library(stargazer) # Load the stargazer package into memory
# I'm creating a new variable and store my descriptive stats table
desc.stats <- psych::describe(myCompleteDF, fast = TRUE )
# Now pass desc.stats off to stargazer for formatting.
# Note that we need to tell stargazer to treat our data frame
# as a matrix in order to make this command work.
stargazer(as.matrix(desc.stats), type = "html",
summary = FALSE,
digits = 2,
initial.zero = FALSE)
vars | n | mean | sd | min | max | range | se | |
Obs | 1 | 110 | 59.53 | 34.07 | 1 | 119 | 118 | 3.25 |
Employees | 2 | 110 | 277.23 | 76.85 | 26 | 447 | 421 | 7.33 |
FirmAge | 3 | 110 | 22.25 | 8.06 | 1 | 40 | 39 | .77 |
INN1 | 4 | 110 | 3.85 | 1.88 | 1 | 7 | 6 | .18 |
INN2 | 5 | 110 | 4.14 | 1.72 | 1 | 7 | 6 | .16 |
INN3 | 6 | 110 | 3.83 | 1.68 | 1 | 7 | 6 | .16 |
PRO1 | 7 | 110 | 4.13 | 1.61 | 1 | 7 | 6 | .15 |
PRO2 | 8 | 110 | 4.35 | 1.66 | 1 | 7 | 6 | .16 |
PRO3 | 9 | 110 | 4.44 | 1.41 | 1 | 7 | 6 | .13 |
RISK1 | 10 | 110 | 3.94 | 1.34 | 1 | 7 | 6 | .13 |
RISK2 | 11 | 110 | 3.84 | 1.42 | 1 | 7 | 6 | .14 |
RISK3 | 12 | 110 | 4.04 | 1.42 | 1 | 7 | 6 | .14 |
Innovativeness | 13 | 110 | 3.94 | 1.47 | 1 | 7 | 6 | .14 |
Proactiveness | 14 | 110 | 4.30 | 1.22 | 1 | 7 | 6 | .12 |
RiskTaking | 15 | 110 | 3.94 | 1.21 | 1 | 6.33 | 5.33 | .12 |
LRP | 16 | 110 | 3.88 | 1.76 | 1 | 7 | 6 | .17 |
Forecast | 17 | 110 | 3.94 | 1.56 | 1 | 7 | 6 | .15 |
AdPlan | 18 | 110 | 4.18 | .88 | 2 | 7 | 5 | .08 |
SGR | 19 | 110 | -.24 | 14.73 | -76.12 | 37.94 | 114.06 | 1.40 |
IndTech | 20 | 110 | 3.59 | 1.26 | 1 | 7 | 6 | .12 |
IndDyn | 21 | 110 | 4.03 | 1.12 | 1.40 | 6.60 | 5.20 | .11 |
IndHos | 22 | 110 | 4.34 | 1.03 | 1.83 | 6.83 | 5.00 | .10 |
HighTech | 23 | 110 | .55 | .50 | 0 | 1 | 1 | .05 |
PatentCount | 24 | 110 | 4.34 | 3.05 | 0 | 12 | 12 | .29 |
PatentActivity | 25 | 110 | .84 | .37 | 0 | 1 | 1 | .04 |
RNDIntensity | 26 | 110 | 4.93 | 3.84 | 0 | 17.28 | 17.28 | .37 |
There are several other things that we can use to get to know our data. Lets run through a few of them using the myCompleteDF data frame…
# Get the structure of the data
str(myCompleteDF)
## Classes 'tbl_df', 'tbl' and 'data.frame': 110 obs. of 26 variables:
## $ Obs : int 1 3 4 5 6 7 8 9 10 11 ...
## $ Employees : int 301 199 385 282 324 200 209 445 213 201 ...
## $ FirmAge : int 28 22 19 20 36 1 29 20 17 34 ...
## $ INN1 : int 4 1 4 6 2 2 3 3 5 3 ...
## $ INN2 : int 2 1 5 5 3 2 2 4 4 2 ...
## $ INN3 : int 3 1 3 5 3 2 2 4 5 1 ...
## $ PRO1 : int 4 1 1 5 4 4 5 3 5 5 ...
## $ PRO2 : int 7 1 4 4 3 4 5 3 5 6 ...
## $ PRO3 : int 5 1 3 3 4 5 5 4 5 5 ...
## $ RISK1 : int 5 1 4 4 2 2 4 6 4 4 ...
## $ RISK2 : int 3 1 3 5 2 4 4 5 4 3 ...
## $ RISK3 : int 4 1 4 5 3 4 5 4 4 5 ...
## $ Innovativeness: num 3 1 4 5.33 2.67 ...
## $ Proactiveness : num 5.33 1 2.67 4 3.67 ...
## $ RiskTaking : num 4 1 3.67 4.67 2.33 ...
## $ LRP : num 1 2.75 4.75 3.75 4.75 ...
## $ Forecast : num 1.67 2.33 2 4.33 5.33 ...
## $ AdPlan : num 4 5 5 4.75 4.5 4 4.75 4 4.5 3.25 ...
## $ SGR : num 11.86 16.63 3.05 -13.71 4.53 ...
## $ IndTech : num 1.2 1.8 2 3.4 4 ...
## $ IndDyn : num 4.6 2.8 4.8 3.2 4.2 ...
## $ IndHos : num 3 3.67 5.33 3.67 4.33 ...
## $ HighTech : int 0 0 0 1 1 1 0 1 0 1 ...
## $ PatentCount : int 2 0 0 9 2 0 1 4 8 0 ...
## $ PatentActivity: int 1 0 0 1 1 0 1 1 1 0 ...
## $ RNDIntensity : num 2.34 0 4.67 6.23 0 ...
# This is a helpful command to keep track of what type of object you are
# actually working with.
class(myCompleteDF)
## [1] "tbl_df" "tbl" "data.frame"
# Print the first 10 rows of the data
head(myCompleteDF, n = 10)
## # A tibble: 10 × 26
## Obs Employees FirmAge INN1 INN2 INN3 PRO1 PRO2 PRO3 RISK1 RISK2
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 1 301 28 4 2 3 4 7 5 5 3
## 2 3 199 22 1 1 1 1 1 1 1 1
## 3 4 385 19 4 5 3 1 4 3 4 3
## 4 5 282 20 6 5 5 5 4 3 4 5
## 5 6 324 36 2 3 3 4 3 4 2 2
## 6 7 200 1 2 2 2 4 4 5 2 4
## 7 8 209 29 3 2 2 5 5 5 4 4
## 8 9 445 20 3 4 4 3 3 4 6 5
## 9 10 213 17 5 4 5 5 5 5 4 4
## 10 11 201 34 3 2 1 5 6 5 4 3
## # ... with 15 more variables: RISK3 <int>, Innovativeness <dbl>,
## # Proactiveness <dbl>, RiskTaking <dbl>, LRP <dbl>, Forecast <dbl>,
## # AdPlan <dbl>, SGR <dbl>, IndTech <dbl>, IndDyn <dbl>, IndHos <dbl>,
## # HighTech <int>, PatentCount <int>, PatentActivity <int>,
## # RNDIntensity <dbl>
Next up, lets make some pictures!
R has several vary capable graphics engines, but this is a case where equifanility may not be in your best interest. The reason being is that you really need to invest the time building your competency with a single package, because they all work a little differently, and as you attempt more complicated visualizations, it’s easier if you have a solid foundation.
The package we will be using in class, and also my recommendation, is ggplot2. There are a LOT of references online for ggplot2, but let me point you over to my go-to. I would also HIGHLY recommend that you take advantage of the ggplot2 Cheat Sheet and keep it handy—it’s a terrific reference for most of what you want to do.
We’ll be doing a lot more with visualizations as we go along this term, but let me highlight a few common plots that we’ve already used.
Since you’ve installed the tidyverse already, you’ve already installed the ggplot2 package. There are other extensions of ggplot2 that I’d recommend, one of which gives you even more formating options, ggthemes.
install.packages('ggthemes', dependencies = TRUE)
Ok, lets start with a basic histogram of one of our variables, Sales Growth Rate (SGR). We’ll walk through the syntax in class, but this gets you started.
library(ggplot2)
# We're going to store the result in a new variable, myHist
myHist <- ggplot(myCompleteDF, aes(SGR)) +
geom_histogram()
myHist
Looks like we have an outlier. Lets change the y scale from a count of the bins to a density function, and then overlay a normal density curve over it to see how it compares.
# Note the 'myCompleteDF$SGR' reference
myHistNormal <- ggplot(myCompleteDF, aes(SGR)) +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm,
args = list(mean = mean(myCompleteDF$SGR),
sd = sd(myCompleteDF$SGR)))
myHistNormal
It certainy seems like we’ve got a normality problem. We’ll come back to that later, but right now, our graphic is pretty boring. Lets make use of some built-in templates to make things look better, and we’ll add a title and change the axis labels.
library(ggthemes) # Call the themes library
myHistNormalColor <- ggplot(myCompleteDF, aes(SGR)) +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm,
colour = 'red',
args = list(mean = mean(myCompleteDF$SGR),
sd = sd(myCompleteDF$SGR))) +
labs(title = "Histogram of Sales Growth Rate with \n Normal Distribution Overlay",
x = "Sales Growth Rate",
y = "Distribution Density")
myHistNormalColor + theme_hc()
We can see the outlier better now. But we can also use our trusty QQ plot to give us some additional visual evidence.
myQQ <- ggplot(myCompleteDF, aes(sample = SGR)) +
stat_qq()
myQQ + theme_hc()
Well, it looks pretty clear from this one as well that we’ve got an outlier on the low end of Sales Growth Rate.
Now lets move on to the scatter plot. The logic of ggplot2 is built around the Grammar of Graphics, which in a sense asks the data to tells its story through pictures. One manifestation of this is that you can think of building visualizations iteratively, in the sense that you start with one story, and then overlay additional stories to guide the reader to a conclusion. We did that with our histogram above—the basic histogram seemed to indicate an outlier, but it was easier to see when we overlayed the normal denisty distribution.
So we’re going to start with a basic scatterplot of the relationship between Proactiveness and Sales Growth Rate, and then build from there to improve our understanding of the relationship.
# The aes is a mapping command
# We are mapping the x axis to Proactiveness, and y to Sales Growth Rate
# geom's are the type of visualization we are going to make
myScatter <- ggplot(myCompleteDF, aes(x = Proactiveness, y = SGR)) +
geom_point()
myScatter
Ok, we might have to do something about that outlier now. If we look at the scatter plot and the historgrams above, it looks like there is a value of SGR around -80 that is well outside of our distribution. Lets create a new data frame by subsetting myCompleteDF to drop the outlier. Note the cool thing about data frames. We haven’t actually modified the original data set at all—it’s still pristine. But we’ve created a variety of data frames to make the original set more useful. Cool, huh?
# Drop the outlier observation, and create a working dataframe
# After the ',' R reads the command as 'keep if' SGR > -40
myNoOutlierDF <- subset(myCompleteDF, SGR > -40)
# Lets see how many observations we dropped...
original.rows <- nrow(myCompleteDF) # Get the original count of rows(obs)
new.rows <- nrow(myNoOutlierDF) # Get the new count of rows(obs)
dif.rows <- original.rows - new.rows # Calculate the difference
cat("Number of outliers dropped: ", dif.rows) # Print it out
## Number of outliers dropped: 1
Ok, so now we’ve got our new data frame, lets try that scatter plot again.
# We're going to use the new data frame this time
myNoOutlierScatter <- ggplot(myNoOutlierDF, aes(x = Proactiveness, y = SGR)) +
geom_point()
myNoOutlierScatter
That looks better. Now, lets keep going with telling our story. Lets overlay a linear function of SGR on Proactiveness and see if there is a general linear trend here.
# We overlay here two geoms, the point (the scatter) and smooth (the line)
myLinearTrend <- ggplot(myNoOutlierDF, aes(x = Proactiveness, y = SGR)) +
geom_point() +
stat_smooth(method="lm")
myLinearTrend + theme_hc()
Well, looks like there may not be much there, particularly given how much the confidence interval band includes zero, but it’s not entirely clear. Before we move on though, lets take advantage of another feature in R to easily put plots next to each other. For this example, lets put a linear trend of the SGR ~ Proactiveness relationship before and after we removed the outlier.
We’re also going to force R to use the same scale of reference for the Y variable—if we didn’t do that, the visualizations really wouldn’t be comparable!
We’re going to use the gridExtra package, so lets go get it.
install.packages("gridExtra")
# Create the first plot with the outlier
myTrend.Outlier <- ggplot(myCompleteDF, aes(x = Proactiveness, y = SGR)) +
geom_point() +
stat_smooth(method="lm") +
theme_hc() +
ylim(-100, 50) +
labs(title = "With Outlier")
# Create the second plot without the outlier
myTrend.NoOutlier <- ggplot(myNoOutlierDF, aes(x = Proactiveness, y = SGR)) +
geom_point() +
stat_smooth(method="lm") +
theme_hc() +
ylim(-100, 50) +
labs(title = "Without Outlier")
Now lets put it together.
library(gridExtra)
grid.arrange(myTrend.Outlier, myTrend.NoOutlier, nrow = 1)
Looks like the outlier might have been pulling the trend line down, suggesting a larger effect than what should be there. We’ll have to look into that further. It’s not looking good for increasing sales growth as a function of being proactive.
Given that frequentist statistics is all about the study of probabilities as a function of variance, it’s helpful to know how to get the variance, covariances, and correlations for focal variables. I’m going to start with showing you how to get a variance/covariance matrix and then a correlation matrix for a subset of our myCompleteDF data frame using R’s built-in commands. Then we’re going to explore a package that gives us additional options.
# Lets start by getting only the variables that we want into a new data frame
myVars <- subset(myCompleteDF, select = c(SGR, Innovativeness,
Proactiveness, RiskTaking))
# Now, we'll create a variance/covariance matrix
myCov <- cov(myVars, method = c("pearson"))
round(myCov, 3) # Round the matrix to three decimals
## SGR Innovativeness Proactiveness RiskTaking
## SGR 216.905 3.814 4.467 4.872
## Innovativeness 3.814 2.149 0.811 1.013
## Proactiveness 4.467 0.811 1.487 0.819
## RiskTaking 4.872 1.013 0.819 1.473
Now for the correlation matrix.
myCorr <- cor(myVars, method = c("pearson"))
round(myCorr, 3)
## SGR Innovativeness Proactiveness RiskTaking
## SGR 1.000 0.177 0.249 0.273
## Innovativeness 0.177 1.000 0.453 0.570
## Proactiveness 0.249 0.453 1.000 0.553
## RiskTaking 0.273 0.570 0.553 1.000
Now lets test our correlational hypotheses and get the p-values using the corr.test function.
library(psych)
corr.test(myVars) # This gives us p-values for the correlations
## Call:corr.test(x = myVars)
## Correlation matrix
## SGR Innovativeness Proactiveness RiskTaking
## SGR 1.00 0.18 0.25 0.27
## Innovativeness 0.18 1.00 0.45 0.57
## Proactiveness 0.25 0.45 1.00 0.55
## RiskTaking 0.27 0.57 0.55 1.00
## Sample Size
## [1] 110
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## SGR Innovativeness Proactiveness RiskTaking
## SGR 0.00 0.06 0.02 0.01
## Innovativeness 0.06 0.00 0.00 0.00
## Proactiveness 0.01 0.00 0.00 0.00
## RiskTaking 0.00 0.00 0.00 0.00
##
## To see confidence intervals of the correlations, print with the short=FALSE option
It should be pretty apparent now that R’s built in commands and even the common packages have some limitations, particularly when it comes to displaying the results, getting p-values, etc. I wish that I had an easy answer for you in terms of quick and nicely formatted correlation matrices with p-values, but I don’t. Here’s though a somewhat complicated function—original code here—that will output an html table that you can copy and past into Word, Google Docs, etc, for further editing.
library(xtable)
corstars <-function(x, method=c("pearson", "spearman"), removeTriangle=c("upper", "lower"), result=c("none", "html", "latex")){
#Compute correlation matrix
require(Hmisc)
x <- as.matrix(x)
correlation_matrix<-rcorr(x, type=method[1])
R <- correlation_matrix$r # Matrix of correlation coeficients
p <- correlation_matrix$P # Matrix of p-value
## Define notions for significance levels; spacing is important.
mystars <- ifelse(p < .001, "****", ifelse(p < .001, "*** ", ifelse(p < .01, "** ", ifelse(p < .05, "* ", " "))))
## trunctuate the correlation matrix to two decimal
R <- format(round(cbind(rep(-1.11, ncol(x)), R), 2))[,-1]
## build a new matrix that includes the correlations with their apropriate stars
Rnew <- matrix(paste(R, mystars, sep=""), ncol=ncol(x))
diag(Rnew) <- paste(diag(R), " ", sep="")
rownames(Rnew) <- colnames(x)
colnames(Rnew) <- paste(colnames(x), "", sep="")
## remove upper triangle of correlation matrix
if(removeTriangle[1]=="upper"){
Rnew <- as.matrix(Rnew)
Rnew[upper.tri(Rnew, diag = TRUE)] <- ""
Rnew <- as.data.frame(Rnew)
}
## remove lower triangle of correlation matrix
else if(removeTriangle[1]=="lower"){
Rnew <- as.matrix(Rnew)
Rnew[lower.tri(Rnew, diag = TRUE)] <- ""
Rnew <- as.data.frame(Rnew)
}
## remove last column and return the correlation matrix
Rnew <- cbind(Rnew[1:length(Rnew)-1])
if (result[1]=="none") return(Rnew)
else{
if(result[1]=="html") print(xtable(Rnew), type="html")
else print(xtable(Rnew), type="latex")
}
}
corstars(myVars, method=c("pearson"), removeTriangle=c("upper"),
result=c("html"))
SGR | Innovativeness | Proactiveness | |
---|---|---|---|
SGR | |||
Innovativeness | 0.18 | ||
Proactiveness | 0.25** | 0.45**** | |
RiskTaking | 0.27** | 0.57**** | 0.55**** |
Ok, so this table isn’t very pretty, but that’s for another class!
Lets take a look first at our simple linear regression model: \(y=\alpha+{\beta}{x}+\varepsilon\)
There are several ways to tackle regression in R, including the built in stats package. With this package, we’ll be using the lm function, which stands for linear model (clever, huh?). To this we’re going to add another package called car, which contians a very helpful set of regression diagnostic functions that you’re already familiar with.
install.packages("car")
We can’t hope to cover everything, so Google is your friend here. We will be working a lot with these models as the term progresses, so it will get easier to understand as time goes along. We’re going to start with a model regressing Sales Growth Rate on Innovativeness, using the myCompleteDF data frame (this is the one without the outlier).
myRegression <- lm(SGR ~ Innovativeness, # Here's the model itself (note the ~)
data = myCompleteDF) # This calls the data frame
summary(myRegression) # This displays our output
##
## Call:
## lm(formula = SGR ~ Innovativeness, data = myCompleteDF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.438 -8.131 0.685 8.749 36.015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.2330 3.9972 -1.810 0.0732 .
## Innovativeness 1.7746 0.9515 1.865 0.0649 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.56 on 108 degrees of freedom
## Multiple R-squared: 0.0312, Adjusted R-squared: 0.02223
## F-statistic: 3.479 on 1 and 108 DF, p-value: 0.06488
So you know what each of these output variables and values mean, although the default display isn’t particularly helpful (more on that later). As it stands, Innovativeness is tenuously related to Sales Growth Rate. If we remembr though, we were concerned about an outlier earlier, so lets check some common regression diagnostics using the car package.
library(car) # Don't forget to load the car library!
# We'll start with a QQ Plot of the studentized residuals
qqPlot(myRegression)
Looks there might be one observation we might be concernd about (similar to what our earlier plots showed). We can also use an influential observation plot to see if this observation may be materially influencing the effect size estimate.
influencePlot(myRegression, id.method="identify")
Well, we certainly have an observation with a studentized residual well outside of the standard +/- 2 standard deviations, which is likely impacting the results. There are LOTS of other ways to evaluate the regression diagnostics, but my favorite is the ggfortify package which adds on to ggplot2 and creates a series of plots one the same page.
# Don't forget to install the package!
install.packages("ggfortify")
library(ggfortify)
autoplot(myRegression)
Ok, it looks like observation #74 is our problem. So now lets run a new model, this time with a data frame that subsets myCompleteDF based on excluding observation #74.
# Subset if variable 'Obs' is not equal to '74'
myRemovedOutlierDF <- subset(myCompleteDF, Obs!=74)
# Now lets run a new regression
myNewRegression <- lm(SGR ~ Innovativeness, # Here's the model itself (note the ~)
data = myRemovedOutlierDF) # This calls the data frame
summary(myNewRegression) # This displays our output
##
## Call:
## lm(formula = SGR ~ Innovativeness, data = myRemovedOutlierDF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.074 -8.306 -0.555 8.262 35.507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.1341 3.5538 -1.163 0.247
## Innovativeness 1.1595 0.8429 1.375 0.172
##
## Residual standard error: 12.8 on 107 degrees of freedom
## Multiple R-squared: 0.01737, Adjusted R-squared: 0.008191
## F-statistic: 1.892 on 1 and 107 DF, p-value: 0.1719
So the results certainly changed, as we expected them to. It’s a little hard though to visually compare these models from the default output, so lets use stargazer again to make them pretty, and publication ready!
library(stargazer)
stargazer(myRegression, myNewRegression, type = "html",
title = "Model comparisons",
dep.var.labels.include = FALSE,
dep.var.caption = "DV: Sales Growth Rate",
column.labels = c("With Outlier", "Without Outlier"),
report = ("vc*sp"),
ci = TRUE,
omit.stat = c("ser"),
star.cutoffs = c(0.05, 0.01, 0.001),
star.char = c("*", "**", "***"),
notes = "† p < .05; ** p < .01; *** p < .001",
notes.align = c("r"),
notes.label = "",
notes.append = FALSE)
DV: Sales Growth Rate | ||
With Outlier | Without Outlier | |
(1) | (2) | |
Innovativeness | 1.775 | 1.159 |
(-0.090, 3.639) | (-0.493, 2.812) | |
p = 0.065 | p = 0.172 | |
Constant | -7.233 | -4.134 |
(-15.067, 0.601) | (-11.099, 2.831) | |
p = 0.074 | p = 0.248 | |
Observations | 110 | 109 |
R2 | 0.031 | 0.017 |
Adjusted R2 | 0.022 | 0.008 |
F Statistic | 3.479 (df = 1; 108) | 1.892 (df = 1; 107) |
† p < .05; ** p < .01; *** p < .001 |
Now, lets tackle a multiple regression model: \(y=\alpha+{\beta}_{i}{x}_{i}+{\beta}_{j}{x}_{j}+...+{\beta}_{k}{x}_{k}+\varepsilon\)
The syntax is very similar, but we’ll need to add some additional variables—in this case, Proactiveness and RiskTaking. Lets also use stargazer to make it pretty for us.
myMultipleRegression <- lm(SGR ~ Innovativeness + Proactiveness + RiskTaking,
data = myRemovedOutlierDF) # Use the no outlier data frame
stargazer(myMultipleRegression, type = "html",
title = "Multiple regression model",
dep.var.labels.include = FALSE,
dep.var.caption = "DV: Sales Growth Rate",
colnames = FALSE,
report = ("vc*sp"),
ci = TRUE,
omit.stat = c("ser"),
star.cutoffs = c(0.05, 0.01, 0.001),
star.char = c("*", "**", "***"),
notes = "† p < .05; ** p < .01; *** p < .001",
notes.align = c("r"),
notes.label = "",
notes.append = FALSE)
DV: Sales Growth Rate | |
Innovativeness | -0.144 |
(-2.150, 1.863) | |
p = 0.889 | |
Proactiveness | 1.712 |
(-0.666, 4.091) | |
p = 0.162 | |
RiskTaking | 1.447 |
(-1.164, 4.059) | |
p = 0.280 | |
Constant | -12.092* |
(-21.775, -2.409) | |
p = 0.017 | |
Observations | 109 |
R2 | 0.064 |
Adjusted R2 | 0.037 |
F Statistic | 2.394 (df = 3; 105) |
† p < .05; ** p < .01; *** p < .001 |