Please read and follow these instructions in order to try these past workshops on your own.

Tidy data and importing your data in R

In R, as in most statistical and programming languages, how your data is structured and stored impacts how or whether you can do your analysis. The best form for a dataset is in a tidy format. Tidy datasets are where each column contains a variable measurements and each row contains a single observation at a specific time. So, if you work with animals, each row would be the animal. For cells, each row would be the cell plate/dish. For humans, each row would be an individual. If you collect multiple time points on the animals/humans, they would take up a new row for each measurement time.

This tidy format for the data makes it easier to analyze the dataset and to plot it. It also makes it very easy to import into R. The best time to start using this format for storing your data is right away! But, sometimes your data is created by a machine or from the measurement instrument. In those cases, you’ll have to wrangle (‘manipulate’) the data into the appropriate tidy format before analyzing it.

Code used

Quick note: If you want to know more about a command (aka. function), use ?. For instance, use ?install.packages to learn more about the install.packages() function.

# install the package if you haven't already
# install.packages('readr')
# Load the readr package
library(readr)

We’ll be using read_csv() from the readr package, which is similar to the read.csv() function in base R but is slower and does more things that you usually don’t want. We can use read_csv() to also get a csv file, which is a comma separated value file, from the internet. To get a better look at the csv file, we can save it to a file by using write_csv(). The two datasets below are fake datasets I made up.

# This is an example of a messy dataset
web_link <- "http://codeasmanuscript.github.io/code-along/data/tidy-data-1.csv"
tidy_data <- read_csv(web_link)
#> Warning: Missing column names filled in: 'X1' [1], 'X3' [3], 'X4' [4],
#> 'X5' [5], 'X6' [6], 'X7' [7], 'X9' [9], 'X10' [10], 'X11' [11], 'X12' [12],
#> 'X13' [13]
#> Parsed with column specification:
#> cols(
#>   X1 = col_character(),
#>   Exercise = col_integer(),
#>   X3 = col_integer(),
#>   X4 = col_integer(),
#>   X5 = col_integer(),
#>   X6 = col_integer(),
#>   X7 = col_integer(),
#>   Control = col_integer(),
#>   X9 = col_integer(),
#>   X10 = col_integer(),
#>   X11 = col_integer(),
#>   X12 = col_integer(),
#>   X13 = col_integer()
#> )
# Save it into the data/ folder using write_csv
#write_csv(tidy_data, 'data/tidy-data-1.csv')

# Assign the tidy-data into the object ds using <-
# Import directly from a file:
ds <- read_csv('data/tidy-data-1.csv')
#> Warning: Missing column names filled in: 'X1' [1], 'X3' [3], 'X4' [4],
#> 'X5' [5], 'X6' [6], 'X7' [7], 'X9' [9], 'X10' [10], 'X11' [11], 'X12' [12],
#> 'X13' [13]
#> Parsed with column specification:
#> cols(
#>   X1 = col_character(),
#>   Exercise = col_integer(),
#>   X3 = col_integer(),
#>   X4 = col_integer(),
#>   X5 = col_integer(),
#>   X6 = col_integer(),
#>   X7 = col_integer(),
#>   Control = col_integer(),
#>   X9 = col_integer(),
#>   X10 = col_integer(),
#>   X11 = col_integer(),
#>   X12 = col_integer(),
#>   X13 = col_integer()
#> )
# Let's take a peek at this file too.
web_link2 <- "http://codeasmanuscript.github.io/code-along/data/tidy-data-2.csv"
read_csv(web_link2)
#> Parsed with column specification:
#> cols(
#>   ID_case = col_integer(),
#>   weight_case = col_integer(),
#>   testosterone_case = col_integer(),
#>   ID_control = col_integer(),
#>   weight_control = col_integer(),
#>   testosterone_control = col_integer()
#> )
#> # A tibble: 3 × 6
#>   ID_case weight_case testosterone_case ID_control weight_control
#>     <int>       <int>             <int>      <int>          <int>
#> 1    1241         110                34       2144            200
#> 2    1456         150                56       2315            140
#> 3    2679         165                43       3245            160
#> # ... with 1 more variables: testosterone_control <int>

If you have a dataset that is tab separated, you can use read_tsv(). Since this is a tab separated (or just plain text, txt, file), we need to save it to the computer using writeLines().

read_tsv("http://codeasmanuscript.github.io/code-along/data/tidy-data-3.txt")
#> Parsed with column specification:
#> cols(
#>   `##BLOCKS= 1` = col_character()
#> )
#> Warning: 11 parsing failures.
#> row col  expected     actual
#>   1  -- 1 columns 23 columns
#>   2  -- 1 columns 16 columns
#>   3  -- 1 columns 16 columns
#>   4  -- 1 columns 16 columns
#>   5  -- 1 columns 16 columns
#> ... ... ......... ..........
#> See problems(...) for more details.
#> # A tibble: 12 × 1
#>    `##BLOCKS= 1`
#>            <chr>
#> 1         Plate:
#> 2           <NA>
#> 3           <NA>
#> 4           <NA>
#> 5           <NA>
#> 6           <NA>
#> 7           <NA>
#> 8           <NA>
#> 9           <NA>
#> 10          <NA>
#> 11          <NA>
#> 12          ~End
# Save it using:
# writeLines(
#     read_lines("http://codeasmanuscript.github.io/code-along/data/tidy-data-3.txt"), 
#     'data/tidy-data-3.txt')

These next two datasets are data Zhila in our lab collected for PROMISE, which is the completely raw data file sent out from the machine she was using for measurements. You can see when you open the tidy-data-3.txt file that there is a bunch of stuff you probably don’t need. So, tell R to not include these stuff when importing. Each arguments (options) in the read_tsv() function is explained below:

  • comment argument says that anything starting with # can be ignored
  • skip says to skip the first line when importing
  • n_max says to only import 8 lines after the header line (i.e. the column names)
  • [-1:-2] is called indexing (see ?'[') and says to exclude (minus) the first two columns
  • [c(-13, -14)] (or [-13:-14]) takes the previous index results and says to exclude columns 13 and 14
ds <- read_tsv("http://codeasmanuscript.github.io/code-along/data/tidy-data-3.txt",
               comment = '#', skip = 1, n_max = 8)[-1:-2][c(-13, -14)]
#> Warning: Missing column names filled in: 'X1' [1], 'X15' [15], 'X16' [16]
#> Error in make.names(x): invalid multibyte string 1
# This adds a column called PlateRow to the ds object
ds$PlateRow <- 1:8
#> Error in `$<-.data.frame`(`*tmp*`, "PlateRow", value = 1:8): replacement has 8 rows, data has 3
ds
#> # A tibble: 3 × 13
#>      X1 Exercise    X3    X4    X5    X6    X7 Control    X9   X10   X11
#>   <chr>    <int> <int> <int> <int> <int> <int>   <int> <int> <int> <int>
#> 1 DietA       34   132    54    65    33    NA      31    12   124    45
#> 2 DietB      111    24    38    12    NA    NA      23     8    23   213
#> 3 DietC       43   150   123    43   500   321     425    23   235   534
#> # ... with 2 more variables: X12 <int>, X13 <int>

In this case, to get a tidy dataset, we’ll need to re-arrange the dataset using another package.

#install.packages('tidyr')
library(tidyr)
# gather takes a wide dataset can converts it to a long dataset. In this case
# I don't want PlateRow to be converted, just all other columns.
ds2 <- gather(ds, PlateColumn, Wavelength, -PlateRow)
#> Error in eval(expr, envir, enclos): object 'PlateRow' not found
ds2
#> Error in eval(expr, envir, enclos): object 'ds2' not found

And lastly, to look at this next dataset:

# Save it using:
# writeLines(
#     read_lines("http://codeasmanuscript.github.io/code-along/data/tidy-data-4.csv"), 
#     'data/tidy-data-4.csv')

# [1:20, ] means to keep only the first 20 *rows*, in which a comma must follow
read_csv("http://codeasmanuscript.github.io/code-along/data/tidy-data-4.csv",
         skip = 22)[1:20, ]
#> Parsed with column specification:
#> cols(
#>   `Plate Name` = col_character(),
#>   Sample = col_character(),
#>   Assay = col_character(),
#>   Well = col_character(),
#>   Spot = col_integer(),
#>   Dilution = col_integer(),
#>   Concentration = col_double(),
#>   Signal = col_integer(),
#>   `Adjusted Signal` = col_integer(),
#>   Mean = col_integer(),
#>   `Adj. Sig. Mean` = col_double(),
#>   CV = col_double(),
#>   `% Recovery` = col_double(),
#>   `% Recovery Mean` = col_double(),
#>   `Calc. Concentration` = col_double(),
#>   `Calc. Conc. Mean` = col_double(),
#>   `Calc. Conc. CV` = col_double()
#> )
#> # A tibble: 20 × 17
#>        `Plate Name` Sample       Assay  Well  Spot Dilution Concentration
#>               <chr>  <chr>       <chr> <chr> <int>    <int>         <dbl>
#> 1  Plate_25E5MAPB94   S001 Adiponectin   A01     1       NA      1.28e-02
#> 2  Plate_25E5MAPB94   S001 Adiponectin   A02     1       NA      1.28e-02
#> 3  Plate_25E5MAPB94   S002 Adiponectin   B02     1       NA      6.40e-02
#> 4  Plate_25E5MAPB94   S002 Adiponectin   B01     1       NA      6.40e-02
#> 5  Plate_25E5MAPB94   S003 Adiponectin   C02     1       NA      3.20e-01
#> 6  Plate_25E5MAPB94   S003 Adiponectin   C01     1       NA      3.20e-01
#> 7  Plate_25E5MAPB94   S004 Adiponectin   D01     1       NA      1.60e+00
#> 8  Plate_25E5MAPB94   S004 Adiponectin   D02     1       NA      1.60e+00
#> 9  Plate_25E5MAPB94   S005 Adiponectin   E02     1       NA      8.00e+00
#> 10 Plate_25E5MAPB94   S005 Adiponectin   E01     1       NA      8.00e+00
#> 11 Plate_25E5MAPB94   S006 Adiponectin   F01     1       NA      4.00e+01
#> 12 Plate_25E5MAPB94   S006 Adiponectin   F02     1       NA      4.00e+01
#> 13 Plate_25E5MAPB94   S007 Adiponectin   G01     1       NA      2.00e+02
#> 14 Plate_25E5MAPB94   S007 Adiponectin   G02     1       NA      2.00e+02
#> 15 Plate_25E5MAPB94   S008 Adiponectin   H02     1       NA      1.00e+03
#> 16 Plate_25E5MAPB94   S008 Adiponectin   H01     1       NA      1.00e+03
#> 17 Plate_25E5MAPB94   U001 Adiponectin   A03     1        1            NA
#> 18 Plate_25E5MAPB94   U001 Adiponectin   A04     1        1            NA
#> 19 Plate_25E5MAPB94   U002 Adiponectin   A05     1        1            NA
#> 20 Plate_25E5MAPB94   U002 Adiponectin   A06     1        1            NA
#> # ... with 10 more variables: Signal <int>, `Adjusted Signal` <int>,
#> #   Mean <int>, `Adj. Sig. Mean` <dbl>, CV <dbl>, `% Recovery` <dbl>, `%
#> #   Recovery Mean` <dbl>, `Calc. Concentration` <dbl>, `Calc. Conc.
#> #   Mean` <dbl>, `Calc. Conc. CV` <dbl>

Resources

Written on June 16, 2016 (Updated on: February 28, 2017)