Please read and follow these instructions in order to try these past workshops on your own.

Principles of managing and analyzing data

Main points

Have a plan and structure in place before you start collecting the data!
Keep raw data raw and untouched
Use plain text/open source file formats (csv, txt)
Use version control
Use scripts* to process/clean your data
Avoid spreadsheets
Keep data ‘tidy’
Have a single folder to store the master copy of the data
Have a data dictionary/metadata
Put your data online (or at least pretend you will eventually)

The foundation: folders and files!

Look into a Data Management Plan (DMP) before collecting data
Use a consistent file and folder naming system and date format (use the international standard! ISO 8601, YYYY-MM-DD).

Example:

                               levelName
1 project-name                          
2  °--data-raw                          
3      ¦--protein-measure-2017-01-17.txt
4      ¦--protein-measure-2017-01-18.txt
5      ¦--protein-measure-2017-01-19.txt
6      ¦--anthropometrics.csv           
7      °--blood-measures.csv

Data entry

Use rules for data entry to prevent data entry errors
- Maybe have two people do data entry to cross-check?
- Use tools such as Google Forms to force proper data type (only numbers, only less than 10, etc)
Use ‘flat’ files as often as you can (csv, spreadsheet-style)
- Has only rows and columns, 2-dimensional
- First row is the column/variable names
- Variable names informative and CamelCase (e.g. FatIntake)
Avoid spreadsheets as much as you can since they often (wrongly) try to guess what the user wants (e.g. change date formats on you)
Use international standard for dates (YYYY-MM-DD)

Storing the data

Keep the raw data completely raw and untouched
- Prevents accidental deletion
Look into the FAIR concept (Findable, Accessible, Interoperable, Re-useable)
Use plain text or open source file formats (txt, csv, sqlite)
- NOT xls, xlsx, Access database, etc (these are not FAIR since they depend on purchased software)
- Allows re-use in future, inter-platform (Mac, Windows, Linux), software independent (don’t need Word, etc)
Strongly recommended: Keep under version control (Git, or use backups frequently)
(optional) Have a single ‘master’ data file, then a copy in your folder for your manuscript, etc.

Version control

Form of managing changes to your files and data
Git is the most popular and has the most documentation for using it
Prevents accidental deletion of values in the data or of files
(this is a whole topic on it’s own)

                               levelName
1 project-name                          
2  ¦--data-raw                          
3  ¦   ¦--protein-measure-2017-01-17.txt
4  ¦   ¦--protein-measure-2017-01-18.txt
5  ¦   ¦--protein-measure-2017-01-19.txt
6  ¦   ¦--anthropometrics.csv           
7  ¦   °--blood-measures.csv            
8  °--.git

Use scripts (like R) to process and clean data

                                levelName
1  project-name                          
2   ¦--data-raw                          
3   ¦   ¦--protein-measure-2017-01-17.txt
4   ¦   ¦--protein-measure-2017-01-18.txt
5   ¦   ¦--protein-measure-2017-01-19.txt
6   ¦   ¦--anthropometrics.csv           
7   ¦   °--blood-measures.csv            
8   ¦--.git                              
9   ¦--R                                 
10  ¦   ¦--process-protein.R             
11  ¦   ¦--process-blood-measures.R      
12  ¦   °--merge-data.R                  
13  °--data                              
14      °--project-data.csv

This is a well established workflow to use, so use it! :D

Structuring the data:

Use informative column names
Each column should be a single data type (only numbers or only letters)
If there are notes, create a new column (don’t use the note feature in Excel)
Try to keep it ‘tidy’ (see paper on ‘tidy’ data)
- Each row is an observation at one time
- Each column is (fairly) unique (e.g. don’t use WeightAt10, WeightAt15, WeightAt20, etc, use Weight and Age)

Structuring the data, examples

Multiple time points per subject (e.g. blood draws):

   SubjectID Time BloodGlucose
1          1    1         6.66
2          1    2         6.78
3          1    3         5.84
4          1    4         6.48
5          1    5         4.67
6          2    1         5.65
7          2    2         6.87
8          2    3         6.97
9          2    4         6.31
10         2    5         5.43

Structuring the data, examples

Multiple time points once per subject (e.g. taking the brain, which of course only happens once):

   MouseID Week BrainWeightG
1        1   10       138.84
2        2   10       140.18
3        3   10       123.87
4        4   10       142.96
5        5   10       133.36
6        6   15       123.35
7        7   15       132.81
8        8   15       141.17
9        9   15       149.23
10      10   15       149.22

Structuring the data, examples

Or if it’s a cross-sectional study:

   ParticipantID WeightKg PercentFatIntake
1              1     84.4             49.0
2              2     52.5             41.3
3              3     69.1             32.9
4              4     59.2             43.7
5              5     66.7             38.9
6              6     63.2             28.4
7              7     65.9             33.6
8              8     96.0             30.2
9              9     60.0             42.3
10            10     66.1             34.4

Structuring the data, examples

Or multiple time points for a clinical trial:

   ParticipantID Months BloodPressure PercentFatIntake
1              1      6         112.1             35.6
2              1     12         107.2             33.4
3              2      6          97.1             35.6
4              2     12         112.7             49.3
5              3      6         102.9             36.1
6              3     12          94.8             42.8
7              4      6         102.5             29.2
8              4     12          96.8             35.7
9              5      6         113.8             26.2
10             5     12          97.1             30.3

Write up a data dictionary/metadata

As often as possible, have the information about the data inside the data itself:
- Example: For simple units, include in variable name like WeightKg or TimeWeeks
- Example: For categorical variables (sex, ethnicity), include the actual meaning in the data like "European", "African", "East Asian" etc rather than 1, 2, 3
If not possible, write up a data dictionary (this saves soooo much time later)
- Use plain text (txt, csv, markdown) rather than Word Doc or Excel etc. This ensures everyone, using any software, can read it or use it.

Consider putting your data online

Or at least pretend you will be asked to do that
Biggest advantage: It forces you to make sure you are managing the data in as correct way as possible and to be careful with your data. No one wants the embarassment of someone else seeing your data and saying… what’s going on here?
- (but don’t let this fear stop you! Simple fact that you put it up will earn you respect and you’ll likely get good feedback on how to improve the data)
Use public repositories like GitHub, figshare, Dryad
No evidence that you will get scoped. There is evidence you will get more citations by doing so.

Use scripts to re-arranging/wrangling your data

if you need to re-arrange or prep the data to be used for the statistical technique, make sure to use scripts, as they keep a history of what you’ve done to the data. Spreadsheets do not do that, so be very very wary of them.
Learn about ‘tidy’ data for analyzing, plotting, and exploring your data at the exploration/analysis stage. (link tidy paper Hadley)

Resources

FAIR concept:
- (http://www.science.gc.ca/eic/site/063.nsf/eng/h_83F7624E.html?OpenDocument)
- http://www.nature.com/articles/sdata201618
- https://www.ncbi.nlm.nih.gov/pubmed/26978244
‘Tidy’ data concept paper

Written on January 18, 2017