You have learned a lot about advance string operations combined with regular expression.
grep()
, grepl()
, stringr::str_match()
etc.gsub()
etc.strsplit()
, str_split_fixed()
etc.paste()
, paste0()
.Now you are ready to apply them to the real world of data cleaning!
Here is an interesting link on why data cleaning is important and how extremely frustrating it can be.
In this homework, you will start with the dirty version of Gapminder
in our course repository, gapminderDataFiveYear_dirty.txt, and clean it up with all the string functions you’ve learned. Before you begin, copy this file into your own homework repository.
Due anytime Thursday November 6th 2014.
Read gapminderDataFiveYear_dirty.txt into R. Experiment with strip.white = FALSE
(default) and strip.white = TRUE
.
Reflect on what difference this argument makes.
Now look at the columns of this data frame. You need to decide how much information each column should hold. Are there two pieces of information in one field? Or one piece of information splitted over two columns? Use string splitting or pasting so that each column only contains one conceptual variable.
Missing values often comes as fields with NA
or empty strings. Is there any missing data in this dataset? Identify missing fields, and try to fill them yourself.
Many times you will come across difference versions of the exact same record with different capitalization or spelling. Are there any in our dataset? There are several different ways to go about this:
gsub()
.\\b[a-z]
will match all cases where the first letter of the word is not capitalized.For advanced users, check out the
replacement
argument in?gsub
for the use of\U
together withperl = TRUE
. For example,gsub("\\b([a-z])", "\\U\\1", strings, perl = TRUE)
will capitalize the first letter of every word. But it is not required in this homework.
Finally, load the clean Gapminder
, gapminderDataFiveYear.txt, and use identical()
to compare your cleaned data frame with ours. If TRUE
, congratulations! You’ve successfully cleaned this dataset!
There are other potential problems that we did not cover in this homework. Watch out for them in your own data cleaning process:
\n
and \r
. Most likely to happen if you are using a mixture of OS like Windows plus Linux etc.