What is datawrangling?
- R is oriented around statistical analysis, but its other core strength is data manipulation
- that is, getting your data into the right format to feed into analysis functions
- datawrangling is a more expressive term for ‘data manipulation’ - it encapsulates some of the angst associated with wrangling datasets into the right format
- I’d estimate that at least half the code people write aims to manipulate data
- base R functions can be effective if you know how to use them, but people typically end up writing and reusing custom functions to do their bidding
- and if people don’t know how to do that, they often resort to doing things in Excel, either manually (!) or with lookups and pivot tables etc.
- enter Hadley Wickham, who now works as one of RStudio’s boss software engineers, and who has contributed hugely to all things R. The ‘tidyverse’ is his flagship project, and is an umbrella for a number of data science and data manipulation packages which have been developed to work together.
Datawrangling with tidyverse and dplyr
- first used plyr (another allusion to the frustration inherent in datawrangling - “I wish I could just take a pair of pliers to this bloody data”) software carpentry with D Falster, which I highly reccommend
- plyr is a package that provides functionality similar to the ‘apply’ functions in base R
- who has heard of apply?
- let me run you through a basic apply example
- I was working with a lot of data and plyr was slow
- dplyr is an evolved plyr, with a much faster backend and an extensive ‘vocabulary’ with ‘verbs’ that are functions which perform the most common data manipulation tasks. Statements can be chained together, which is really powerful once you begin to explore it.
- My main aim here is for you to leave the room simply knowing that these functions exist, and to have seen a few examples of how they work, what they might be able to do for you, and what might be possible for you in the future
- I use these functions almost every day that I work with R, and they allow me to write 5 lines in a few minutes instead of what might otherwise have taken half an hour, or even a morning of painful wrangling, to produce 40 lines of tortured code.
- summarising data to plot bar chart
require(dplyr) # dplyr has a set of 'verbs' which are functions that perform basic data manipulation operations # these manipulations can all be done in base R with varying degrees of difficulty # dplyr's utility is in intuitive syntax and its fast backend # I'm going to show you how to create summary tables of your data data("PlantGrowth") str(PlantGrowth) # first lets find mean plant growth by group and plot data summary <- PlantGrowth %>% group_by(group) %>% summarise(mean = mean(weight)) plot(mean ~ group, summary) # %>% is a 'pipe' - it means 'and then do this next thing with the output' # when using 'pipes' they need to go at the end of a line, not the beginning of a new line # or you can put everything on one line if there isn't too much code # so this next example won't work... summary <- PlantGrowth %>% group_by(group) %>% summarise(mean = mean(weight)) plot(mean ~ group, summary) # add in a 'filter' step to the pipeline to remove a group summary <- PlantGrowth %>% group_by(group) %>% filter(group != 'trt2') %>% summarise(mean = mean(weight)) plot(mean ~ group, summary) # or to remove values we don't want blah <- data.frame(weight = c(0,0,0,0), group = c(rep('zero', 4))) # first I'll add some zeroes summary <- PlantGrowth %>% bind_rows(blah) %>% # we could use rbind, but if we want to use the dplyr workflow, there's an equivalent dplyr function 'bind_rows' group_by(group) %>% filter(weight > 0) %>% summarise(mean = mean(weight)) plot(mean ~ group, summary) # this won't work because bind_rows coerced our factor into a character vector # but we can fix it by changing ('mutating') the class of our group vector summary <- PlantGrowth %>% bind_rows(blah) %>% # we could use rbind, but if we want to use the dplyr workflow, there's an equivalent dplyr function 'bind_rows' group_by(group) %>% filter(weight > 0) %>% summarise(mean = mean(weight)) %>% mutate(group = as.factor(group)) plot(mean ~ group, summary) # another example - make a new vector of transformed data using mutate summary <- PlantGrowth %>% mutate(weight_log10 = log10(weight)) %>% group_by(group) %>% summarise(mean_log10 = mean(weight_log10)) boxplot(mean_log10 ~ group, summary)
# tidyr contains functions for 'reshaping' data (i.e. changing the layout of a dataset) # similar to reshape2 but IMHO easier to use # this said, I've always found reshaping data a challenge, no matter what the tool require(tidyr) stocks <- data.frame( time = as.Date('2009-01-01') + 0:9, X = rnorm(10, 0, 1), Y = rnorm(10, 0, 2), Z = rnorm(10, 0, 4) ) stocksm <- stocks %>% gather(stock, price, -time) stocksm %>% spread(stock, price) stocksm %>% spread(time, price)