Overview

You have learned alot about data wrangling! You know how to:

This ecosystem of packages works well together to accomplish these tasks: dplyr, plyr, tidyr, reshape2.

The goal of this homework is to solidify your data wrangling skills by working some realistic problems, which typically live in the grey area between data aggregation and data reshaping.

If you internalize that there are multiple solutions to most problems, you will spend less time banging your head against the wall in data analysis. If something’s really hard, sneak up on it from a different angle.

Due anytime Thursday October 30th 2014.

Choose your own adventure

Do two of the problems below. At least.

As always, it is fine to work with a new dataset and/or create variations on these problem themes. I point out the important principles, so you can craft comparable exercises.

Return of HW03

For the STAT545 alums who have just joined us: dplyr is a new element in this course and is extremely powerful for data manipulation. I encourage you to go through this course material:

and draw on the exercises in HW03 to get some experience with dplyr.

Join, merge, look up

Problem: You have two data sources and you need info from both in one new data object. Sometimes you have a primary data object and the other is secondary. That feels like “table look up” to me. Sometimes the data sources are equally important. That feels more like a “merge” to me. Under the hood, these are still the same type of problem.

Solution: Perform a join, which borrows terminology from the database world, specifically SQL.

Background reading:

Possible activities:

Activity #1

Activity #2

You will likely need to iterate between your data prep and your joining to make your explorations comprehensive and interesting. For example, you will want a specific amount (or lack) of overlap between the two data.frames, in order to demonstrate all the different joins. You will want both the data.frames to be as small as possible, while still retaining the expository value.

Data aggregation based on lists and arrays

We worked with plyr::ddply() for an important special case in data aggregation:

In this special case, there are competing dplyr workflows, namely using group_by() and do() available, but I chose to show ddply() due to its approachable syntax for very general problems.

Problem: What if you need to do this sort of operation on something other than a data.frame? Like a vector or matrix or array or list? And what if you need something other than a data.frame back?

Solution: Use plyr. Specifically the functions that start and end with l for list or a for array, e.g. laply(), ldply(), llply(), l_ply().

Background reading on basics of data aggregation and plyr:

Possible activities:

Activity #1

Reflect on things like this:

Activity #2

General data reshaping and relationship to aggregation

Problem: You have data in one “shape” but you wish it were in another. Usually this is because the alternative shape is superior for presenting a table, making a figure, or doing aggregation and statistical analysis.

Solution: Reshape your data. For simple reshaping, gather() and spread() from tidyr will suffice. If that fails, use the more powerful tools under the hood: melt(), dcast(), and acast() from reshape2.

Background reading:

Possible activities:

NOTE: I PLAN TO PICK SPECIFIC CHALLENGES TO LIST HERE BUT FEEL FREE TO CHOOSE YOUR OWN.