Basic statistics and R commands
This is the code that we went over during the session.
Outline of statistics code we used:
- mean: mean
- sd: standard deviation
- median: median
- IQR: Interquartile range
- cor: correlation
- cor.test: correlation with p-value
- aov: ANOVA
- glm: linear regression
Loading the data!
Import the data from the Code As Manuscript website to use in this session.
Simple descriptive statistics
The $
tells R to take a column from the dataset. So ds$Population
tells R to
take the Population column from the ds object.
You can use R code inline too:
The median of Life Expectancy in the States dataset is: 70.7
The median of Life Expectancy in the States dataset is: 70.7
You can put the results of the statistical output into an object called Corr
(R is case-sensitive). And if we load the broom
package, we can use the tidy()
function
Again, you can use inline R code to convert to text:
The correlation is -0.068052 and it is not significant (p=0.6386594).
The correlation is -0.068052 and it is not significant (p=0.6386594).
Running an ANOVA uses a slightly different syntax: the ~
(tilde) is used in R
to indicate that it is a formula. So in this case, you are seeing the role of
Division
(the X variable as a factor/discrete) on Population
(the y variable
as a continuous variable).
There is a significant difference in Population between Divisions (p=0.0016447). The table is below:
There is a significant difference in Population between Divisions (p=0.0016447). The table is below:
You can put the ANOVA results into a table too!
term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|
Division | 8 | 422922254 | 52865282 | 3.914318 | 0.0016447 |
Residuals | 41 | 553730250 | 13505616 | NA | NA |
Linear regression uses the same notation as ANOVA with the ~
. In this case, since this is a linear regression you need to use a Gaussian distribution, as LifeExp
(the y variable) is continuous.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 67.5813178 | 1.3275714 | 50.90597 | 0.0000000 |
Income | 0.0007433 | 0.0002965 | 2.50694 | 0.0156173 |
You can use math formulas inside R Markdown files too:
$$y = Xb + e$$
The regression formula is similar to the formula style in R with the ~
.
If you want to go through each step of how linear regression works, I wrote a blog on understanding linear regression. Check it.
We can add covariates (confounders, etc) to the model, just like the math formula.
And in R:
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 67.4737218 | 1.3302039 | 50.724345 | 0.0000000 |
Income | 0.0008094 | 0.0003028 | 2.672564 | 0.0103146 |
Population | -0.0000437 | 0.0000417 | -1.047401 | 0.3002713 |
If you want to select just certain variables to display in a table, you can more easily do that using the package dplyr
:
Variable | Beta | P |
---|---|---|
(Intercept) | 67.4737218 | 0.0000000 |
Income | 0.0008094 | 0.0103146 |
Population | -0.0000437 | 0.3002713 |