3 Ways to Recode Categorical Variables in R

Documenting my R Learning Quest Vol.1

So, sooner or later you will find the need to recode some variables or to ‘translate’ obscure values to more informative labels. Of course, there are several ways to do this, I’m just listing here the ones I have used during the first stages of my R learning quest and my current favourite.

I wanted to use some data about frogs, and found this Frog Atlas one.1

frog_atlas stringsAsFactors=FALSE)

length(unique(frog_atlas$ACCURACYCODE))

## [1] 6

unique(frog_atlas$ACCURACYCODE))

## [1] "D" "C" NA "M" "T" "Y"

We have 6 accuracy codes "D" "C" NA "M" "T" "Y" to inform us of the extent we can trust the reported day of the frog kidnapping, and we’d like to change them for their labels.


1. The Slow and Dirty: modifying data frames one value at a time using logical selection.

I remember a Coursera assignment where I had little time left to submit and went this way. I remember myself thinking that it had to be another way, but I was too tired at the moment and did it like this.

It was indeed correct, but my eyeballs dried for the lack of blinking during the process and it is a method prone to errors (given enough variable values to change, you’ll probably feel a strong Ctrl+c, Ctrl+v temptation).

frog_atlas$DATEACCURACY[frog_atlas$DATEACCURACY=="C"] <- "century"
frog_atlas$DATEACCURACY[frog_atlas$DATEACCURACY=="D"] <- "day"
frog_atlas$DATEACCURACY[frog_atlas$DATEACCURACY=="M"] <- "month"
frog_atlas$DATEACCURACY[frog_atlas$DATEACCURACY=="T"] <- "decade"
frog_atlas$DATEACCURACY[frog_atlas$DATEACCURACY=="Y"] <- "year"

unique(frog_atlas$DATEACCURACY)

## [1] "day" "century" NA "month" "decade" "year"

Not too scary or time consuming, but imagine the same process with the 32 values of the SPECIES variable. Or imagine you want to change the variable name at some poing.


2. The ‘I discovered merge() and I loved it, so I’m using it everywhere’

I still find this cool if you have the correspondence table in a separate data frame. BUT be careful, careful, careful with the NA values.

For the sake of the example, I’m creating a separate data frame out of the frogs dataset.

dateaccuracy_labels_df "M", "T", "Y", NA), "dateaccuracy_labels"=c("day", "century",
"month", "decade", "year", NA), stringsAsFactors=FALSE)

dateaccuracy_labels_df

## DATEACCURACY dateaccuracy_labels
## 1 D day
## 2 C century
## 3 M month
## 4 T decade
## 5 Y year
## 6

frog_atlas

unique(frog_atlas$dateaccuracy_labels)

## [1] "century" "day" "month" "decade" "year" NA

3. The Quick and Painless: using named vectors

First, you create a named vector containing the original values and each corresponding label.

dateaccuracy_labels "M"="month", "T"="decade", "Y"="year")

dateaccuracy_labels

## C D M T Y
## "century" "day" "month" "decade" "year"

Now, If you subset this named vector with the character strings of your variable, R will return the ‘translation’ which then can be assigned to replace the variable.

frog_atlas$DATEACCURACY

unique(frog_atlas$DATEACCURACY)

## [1] "day" "century" NA "month" "decade" "year"

And this last one is my favourite way these days.


  1. I saved the .xls original ‘Frogs’ sheet to csv (comma delimited) before reading the data into R. 
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s