Notes – Tidy Data

by Hadley Wickham

White paper these notes are from.

p.1
- Tidy datasets are easy to manipulate, model and visualise, and have a specic structure each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets.
p. 4
- Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table
- p. 5
  - Tidy data is particularly well suited for vectorised programming languages like R, because the layout ensures that values of diﬀerent variables from the same observation are always paired.
  - Fixed variables should come ﬁrst, followed by measured variables, each ordered so that related variables are contiguous. Rows can then be ordered by the ﬁrst variable, breaking ties with the second and subsequent (ﬁxed) variables.
  - most messy datasets, including types of messiness not explicitly described above, can be tidied with a small set of tools: melting, string splitting, and casting.
  - The complete datasets and the R code used to tidy them are available online at https://github.com/hadley/tidy-data,
- p. 6
  - The result of melting is a molten dataset
- p.12
  - 3.5. One type in multiple tables
    - It’s also common to ﬁnd data values about a single type of observational unit spread out over multiple tables or ﬁles. These tables and ﬁles are often split up by another variable, so that each represents a single year, person, or location. As long as the format for individual records is consistent, this is an easy problem to ﬁx:
      1. Read the ﬁles into a list of tables.
      2. For each table, add a new column that records the original ﬁle name (because the ﬁle name is often the value of an important variable).
      3. Combine all tables into a single table.
    - The plyr package makes this a straightforward task in R. The following code generates a vector of ﬁle names in a directory (data/) which match a regular expression (ends in .csv). Next we name each element of the vector with the name of the ﬁle. We do this because plyr will preserve the names in the following step, ensuring that each row in the ﬁnal data frame is labelled with its source. Finally, ldply() loops over each path, reading in the csv ﬁle and combining the results into a single data frame.
    - R> paths <- dir(“data”, pattern = “\\.csv$”, full.names = TRUE)
    - R> names(paths) <- basename(paths)
    - R> ldply(paths, read.csv, stringsAsFactors = FALSE)
    - Once you have a single table, you can perform additional tidying as needed. An example of this type of cleaning can be found at https://github.com/hadley/data-baby-names which takes 129 yearly baby name tables provided by the US Social Security Administration and combines them into a single ﬁle.
  - p. 13
    - Data manipulation includes variable-by-variable transformation (e.g., log or sqrt), as well as aggregation, ﬁltering and reordering. In my experience, these are the four fundamental verbs of data manipulation:
      - Filter: subsetting or removing observations based on some condition.
      - Transform: adding or modifying variables. These modiﬁcations can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).
      - Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means).
      - Sort: changing the order of observations.
  - p. 14
    - In R, ﬁltering and transforming are performed by the base R functions subset() and transform(). These are input and output-tidy. The aggregate() function performs group-wise aggrega-tion. It is input-tidy. Provided that a single aggregation method is used, it is also output-tidy The plyr package provides tidy summarise() and arrange() functions for aggregation and sorting.
    - Base R possesses a by() function, which is input-tidy, but not output-tidy, because it produces a list. The ddply() function from the plyr package is a tidy alternative.
    - Tidy visualisation tools only need to be input-tidy as their output is visual.

Bio
Latest Posts

Steven Miller

Is a CFA® Charterholder and writer focused on providing people with insight on surviving and thriving in a volatile world.

He's published three books. Most recently The World After Covid 19: Coexisting with the Novel Coronavirus.

His musings can be found at stevenlmiller.me. Subscribe to The Pompatus Times for updates.

The CFA designation is globally recognized and attests to a charterholder’s success in a rigorous and comprehensive study program in the field of investment management and research analysis.

CFA® and Chartered Financial Analyst® are registered trademarks owned by CFA Institute.

stevenlmiller.me

Notes – Tidy Data

Steven Miller

Latest posts by Steven Miller (see all)

Related