Close

Notes – Tidy Data

by Hadley Wickham

White paper these notes are from.

  • p.1
    • Tidy datasets are easy to manipulate, model and visualise, and have a speci c structure each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets.
  • p. 4
    • Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
      1. Each variable forms a column.
      2. Each observation forms a row.
      3. Each type of observational unit forms a table
    • p. 5
      • Tidy data is particularly well suited for vectorised programming languages like R, because the layout ensures that values of different variables from the same observation are always paired.
      • Fixed variables should come first, followed by measured variables, each ordered so that related variables are contiguous. Rows can then be ordered by the first variable, breaking ties with the second and subsequent (fixed) variables.
      • most messy datasets, including types of messiness not explicitly described above, can be tidied with a small set of tools: melting, string splitting, and casting.
      • The complete datasets and the R code used to tidy them are available online at https://github.com/hadley/tidy-data,
    • p. 6
      • The result of melting is a molten dataset
    • p.12
      • 3.5. One type in multiple tables
        • It’s also common to find data values about a single type of observational unit spread out over multiple tables or files. These tables and files are often split up by another variable, so that each represents a single year, person, or location. As long as the format for individual records is consistent, this is an easy problem to fix:
          1. Read the files into a list of tables.
          2. For each table, add a new column that records the original file name (because the file name is often the value of an important variable).
          3. Combine all tables into a single table.
        • The plyr package makes this a straightforward task in R. The following code generates a vector of file names in a directory (data/) which match a regular expression (ends in .csv). Next we name each element of the vector with the name of the file. We do this because plyr will preserve the names in the following step, ensuring that each row in the final data frame is labelled with its source. Finally, ldply() loops over each path, reading in the csv file and combining the results into a single data frame.
        • R> paths <- dir(“data”, pattern = “\\.csv$”, full.names = TRUE)
        • R> names(paths) <- basename(paths)
        • R> ldply(paths, read.csv, stringsAsFactors = FALSE)
        • Once you have a single table, you can perform additional tidying as needed. An example of this type of cleaning can be found at https://github.com/hadley/data-baby-names which takes 129 yearly baby name tables provided by the US Social Security Administration and combines them into a single file.
      • p. 13
        • Data manipulation includes variable-by-variable transformation (e.g., log or sqrt), as well as aggregation, filtering and reordering. In my experience, these are the four fundamental verbs of data manipulation:
          • Filter: subsetting or removing observations based on some condition.
          • Transform: adding or modifying variables. These modifications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).
          • Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means).
          • Sort: changing the order of observations.
      • p. 14
        • In R, filtering and transforming are performed by the base R functions subset() and transform(). These are input and output-tidy. The aggregate() function performs group-wise aggrega-tion. It is input-tidy. Provided that a single aggregation method is used, it is also output-tidy The plyr package provides tidy summarise() and arrange() functions for aggregation and sorting.
        • Base R possesses a by() function, which is input-tidy, but not output-tidy, because it produces a list. The ddply() function from the plyr package is a tidy alternative.
        • Tidy visualisation tools only need to be input-tidy as their output is visual.
The following two tabs change content below.
Is a CFA® Charterholder and writer focused on providing people with insight on surviving and thriving in a volatile world.

He's published three books. Most recently The World After Covid 19: Coexisting with the Novel Coronavirus.

His musings can be found at stevenlmiller.me. Subscribe to The Pompatus Times for updates.

The CFA designation is globally recognized and attests to a charterholder’s success in a rigorous and comprehensive study program in the field of investment management and research analysis.

CFA® and Chartered Financial Analyst® are registered trademarks owned by CFA Institute.