Please read and follow these instructions in order to try these past workshops on your own.

Basic statistics and R commands

This is the code that we went over during the session.

Outline of statistics code we used:

mean: mean
sd: standard deviation
median: median
IQR: Interquartile range
cor: correlation
cor.test: correlation with p-value
aov: ANOVA
glm: linear regression

Loading the data!

Import the data from the Code As Manuscript website to use in this session.

ds <- read.csv("http://codeasmanuscript.org/states_data.csv")
names(ds)

#>  [1] "StateName"  "Population" "Income"     "Illiteracy" "LifeExp"   
#>  [6] "Murder"     "HSGrad"     "Frost"      "Area"       "Region"    
#> [11] "Division"   "Longitude"  "Latitude"

Simple descriptive statistics

The $ tells R to take a column from the dataset. So ds$Population tells R to take the Population column from the ds object.

mean(ds$Population)

#> [1] 4246.42

sd(ds$Population)

#> [1] 4464.491

round(median(ds$LifeExp), 1)

#> [1] 70.7

IQR(ds$LifeExp)

#> [1] 1.775

You can use R code inline too:

The median of Life Expectancy in the States dataset is: 70.7

The median of Life Expectancy in the States dataset is: 70.7

You can put the results of the statistical output into an object called Corr (R is case-sensitive). And if we load the broom package, we can use the tidy() function

cor(ds$Population, ds$LifeExp)

#> [1] -0.06805195

Corr <- cor.test(ds$Population, ds$LifeExp)
library(broom)
Corr2 <- tidy(Corr)

Again, you can use inline R code to convert to text:

The correlation is -0.068052 and it is not significant (p=0.6386594).

The correlation is -0.068052 and it is not significant (p=0.6386594).

Running an ANOVA uses a slightly different syntax: the ~ (tilde) is used in R to indicate that it is a formula. So in this case, you are seeing the role of Division (the X variable as a factor/discrete) on Population (the y variable as a continuous variable).

AOV <- aov(Population ~ Division, data = ds)
summary(AOV)

#>             Df    Sum Sq  Mean Sq F value  Pr(>F)   
#> Division     8 422922254 52865282   3.914 0.00164 **
#> Residuals   41 553730250 13505616                   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AOV2 <- tidy(AOV)

There is a significant difference in Population between Divisions (p=0.0016447). The table is below:

There is a significant difference in Population between Divisions (p=0.0016447). The table is below:

You can put the ANOVA results into a table too!

knitr::kable(AOV2)

term	df	sumsq	meansq	statistic	p.value
Division	8	422922254	52865282	3.914318	0.0016447
Residuals	41	553730250	13505616	NA	NA

Linear regression uses the same notation as ANOVA with the ~. In this case, since this is a linear regression you need to use a Gaussian distribution, as LifeExp (the y variable) is continuous.

fit <- glm(LifeExp ~ Income, data = ds, family = gaussian)
summary(fit)

#> 
#> Call:
#> glm(formula = LifeExp ~ Income, family = gaussian, data = ds)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.96547  -0.76381  -0.03428   0.92876   2.32951  
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 6.758e+01  1.328e+00  50.906   <2e-16 ***
#> Income      7.433e-04  2.965e-04   2.507   0.0156 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 1.62659)
#> 
#>     Null deviance: 88.299  on 49  degrees of freedom
#> Residual deviance: 78.076  on 48  degrees of freedom
#> AIC: 170.18
#> 
#> Number of Fisher Scoring iterations: 2

fit2 <- tidy(fit)
knitr::kable(fit2)

term	estimate	std.error	statistic	p.value
(Intercept)	67.5813178	1.3275714	50.90597	0.0000000
Income	0.0007433	0.0002965	2.50694	0.0156173

You can use math formulas inside R Markdown files too:

$$y = Xb + e$$

The regression formula is similar to the formula style in R with the ~.

$y = Xb + e$

If you want to go through each step of how linear regression works, I wrote a blog on understanding linear regression. Check it.

We can add covariates (confounders, etc) to the model, just like the math formula.

$y = x1b + x2b + e$

And in R:

fit_covar <- glm(LifeExp ~ Income + Population, data = ds, family = gaussian)
summary(fit_covar)

#> 
#> Call:
#> glm(formula = LifeExp ~ Income + Population, family = gaussian, 
#>     data = ds)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -3.2591  -0.8579  -0.0185   0.9485   2.2235  
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  6.747e+01  1.330e+00  50.724   <2e-16 ***
#> Income       8.094e-04  3.029e-04   2.673   0.0103 *  
#> Population  -4.366e-05  4.168e-05  -1.047   0.3003    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 1.623308)
#> 
#>     Null deviance: 88.299  on 49  degrees of freedom
#> Residual deviance: 76.295  on 47  degrees of freedom
#> AIC: 171.02
#> 
#> Number of Fisher Scoring iterations: 2

fit_covar2 <- tidy(fit_covar)
knitr::kable(fit_covar2)

term	estimate	std.error	statistic	p.value
(Intercept)	67.4737218	1.3302039	50.724345	0.0000000
Income	0.0008094	0.0003028	2.672564	0.0103146
Population	-0.0000437	0.0000417	-1.047401	0.3002713

If you want to select just certain variables to display in a table, you can more easily do that using the package dplyr:

library(dplyr)
fit_covar2 %>% 
    select(Variable = term, 
           Beta = estimate, 
           P = p.value) %>% 
    knitr::kable()

Variable	Beta	P
(Intercept)	67.4737218	0.0000000
Income	0.0008094	0.0103146
Population	-0.0000437	0.3002713

Written on October 14, 2016