Please read and follow these instructions in order to try these past workshops on your own.
Tidy data and importing your data in R
In R, as in most statistical and programming languages, how your data is
structured and stored impacts how or whether you can do your analysis. The best
form for a dataset is in a tidy format. Tidy datasets are where each column
contains a variable measurements and each row contains a single observation at a
specific time. So, if you work with animals, each row would be the animal. For
cells, each row would be the cell plate/dish. For humans, each row would be an
individual. If you collect multiple time points on the animals/humans, they
would take up a new row for each measurement time.
This tidy format for the data makes it easier to analyze the dataset and to plot
it. It also makes it very easy to import into R. The best time to start using
this format for storing your data is right away! But, sometimes your data is
created by a machine or from the measurement instrument. In those cases, you’ll
have to wrangle (‘manipulate’) the data into the appropriate tidy format before
analyzing it.
Code used
Quick note: If you want to know more about a command (aka. function), use ?.
For instance, use ?install.packages to learn more about the
install.packages() function.
We’ll be using read_csv() from the readr package, which is similar to the
read.csv() function in base R but is slower and does more things that you
usually don’t want. We can use read_csv() to also get a csv file, which is a
comma separated value file, from the internet. To get a better look at the csv
file, we can save it to a file by using write_csv(). The two datasets below
are fake datasets I made up.
If you have a dataset that is tab separated, you can use read_tsv(). Since
this is a tab separated (or just plain text, txt, file), we need to save it to
the computer using writeLines().
These next two datasets are data Zhila in our lab collected for PROMISE, which is
the completely raw data file sent out from the machine she was using for measurements.
You can see when you open the tidy-data-3.txt file that there is a bunch of
stuff you probably don’t need. So, tell R to not include these stuff when importing.
Each arguments (options) in the read_tsv() function is explained below:
comment argument says that anything starting with # can be ignored
skip says to skip the first line when importing
n_max says to only import 8 lines after the header line (i.e. the column
names)
[-1:-2] is called indexing (see ?'[') and says to exclude (minus) the
first two columns
[c(-13, -14)] (or [-13:-14]) takes the previous index results and says to
exclude columns 13 and 14
In this case, to get a tidy dataset, we’ll need to re-arrange the dataset using
another package.