You have learned alot about data wrangling! You know how to:
This ecosystem of packages works well together to accomplish these tasks: dplyr
, plyr
, tidyr
, reshape2
.
The goal of this homework is to solidify your data wrangling skills by working some realistic problems, which typically live in the grey area between data aggregation and data reshaping.
If you internalize that there are multiple solutions to most problems, you will spend less time banging your head against the wall in data analysis. If something’s really hard, sneak up on it from a different angle.
Due anytime Thursday October 30th 2014.
Do two of the problems below. At least.
As always, it is fine to work with a new dataset and/or create variations on these problem themes. I point out the important principles, so you can craft comparable exercises.
For the STAT545 alums who have just joined us: dplyr
is a new element in this course and is extremely powerful for data manipulation. I encourage you to go through this course material:
and draw on the exercises in HW03 to get some experience with dplyr
.
Problem: You have two data sources and you need info from both in one new data object. Sometimes you have a primary data object and the other is secondary. That feels like “table look up” to me. Sometimes the data sources are equally important. That feels more like a “merge” to me. Under the hood, these are still the same type of problem.
Solution: Perform a join, which borrows terminology from the database world, specifically SQL.
Background reading:
dplyr
join cheatsheet, which includes base function merge()
.match()
and merge()
to bring country color information into Gapminder.Possible activities:
Activity #1
dplyr
join function, merge()
, and/or match()
and make some observations about the process and result. Explore the different types of joins. Examples:
Activity #2
You will likely need to iterate between your data prep and your joining to make your explorations comprehensive and interesting. For example, you will want a specific amount (or lack) of overlap between the two data.frames, in order to demonstrate all the different joins. You will want both the data.frames to be as small as possible, while still retaining the expository value.
We worked with plyr::ddply()
for an important special case in data aggregation:
In this special case, there are competing dplyr
workflows, namely using group_by()
and do()
available, but I chose to show ddply()
due to its approachable syntax for very general problems.
Problem: What if you need to do this sort of operation on something other than a data.frame? Like a vector or matrix or array or list? And what if you need something other than a data.frame back?
Solution: Use plyr
. Specifically the functions that start and end with l
for list or a
for array, e.g. laply()
, ldply()
, llply()
, l_ply()
.
Background reading on basics of data aggregation and plyr
:
plyr::ddply()
lessonPossible activities:
Activity #1
Use dlply()
to enact linear regression on each Gapminder country:
lm
objectsstr()
, length()
, names()
, etc.foo[[i]]
, foo[["Canada"]]
, str(foo[[i]])
, etc.ldply()
to pull information out of the fitted models. Examples:
country
, intercept
, slope
country
, factor coefficient
with levels intercept and slope, numeric variable value
with estimated coefficientcountry
, year
, lifeExp
, fitted
, resid
tidy()
function from the broom
package to get a full inferential table on the regression coefficients; output = a data.frame with two rows per country and variables country
, variable identifying which coefficient, plus variables for the estimate, estimated std error, \(t\) statistic, and p-value (Getting Genetics Done has a nice blog post on broom
)Reflect on things like this:
ldply()
+ more wrangling approach over ddply()
? Hint: Think about this in terms of the DRY principle (“don’t repeat yourself”): “Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.”Find an example of this in your work: One (input, output) pair can be achieved with two different workflows. Describe the competing workflows in terms of different choices of:
Activity #2
plyr
or dplyr
to write out the data for each country to a separate delimited file in a sub-directory, like country_data
, with good names, like Canada.tsv
.list.files()
to capture the file listing of this sub-directory as a character vector (which is an array). Use this array as input below.plyr
functions starting with a
to do something for each country. Examples:
adply()
to read the mini datasets back in and re-combine to recreate the original Gapminder data.aaply()
to read the mini datasets back in and retain one single piece of info, such as the continent, the number of rows, or the maximum life expectancy. When recombined, you will have have an atomic vector (which is an array).alply()
to read the mini datasets back in and fit a linear model for life expectancy against time.a_ply()
to read the mini datasets back in, create a country-specific plot, and write it to file.Problem: You have data in one “shape” but you wish it were in another. Usually this is because the alternative shape is superior for presenting a table, making a figure, or doing aggregation and statistical analysis.
Solution: Reshape your data. For simple reshaping, gather()
and spread()
from tidyr
will suffice. If that fails, use the more powerful tools under the hood: melt()
, dcast()
, and acast()
from reshape2
.
Background reading:
tidyr
package on GitHub and CRAN (read the vignette!)tidyr
gather()
function from the tidyr
package to reshape the Lord of the Rings wordcount data in this lesson.reshape2
packagePossible activities:
NOTE: I PLAN TO PICK SPECIFIC CHALLENGES TO LIST HERE BUT FEEL FREE TO CHOOSE YOUR OWN.
dplyr
wasn’t on my radar last year but you know how to use it.tidyr
didn’t even exist yet. Use it.reshape2
, but many of you are ready to read up on melt()
and dcast()/acast()
and try them out.