R for Public Health: continuous

Wednesday, November 21, 2012

Data types, part 3: Factors!

In this third part of the data types series, I'll go an important class that I skipped over so far: factors.

Factors are categorical variables that are super useful in summary statistics, plots, and regressions. They basically act like dummy variables that R codes for you. So, let's start off with some data:

and let's check out what kinds of variables we have:

so we see that Race is a factor variable with three levels. I can see all the levels this way:

So what his means that R groups statistics by these levels. Internally, R stores the integer values 1, 2, and 3, and maps the character strings (in alphabetical order, unless I reorder) to these values, i.e. 1=Black, 2=Hispanic, and 3=White. Now if I were to do a summary of this variable, it shows me the counts for each category, as below. R won't let me do a mean or any other statistic of a factor variable other than a count, so keep that in mind. But you can always change your factor to be numeric, which I'll go over next week.

If I do a plot of age on race, I get a boxplot from the normal plot command since that is what makes sense for a categorical variable:

plot(mydata$Age~mydata$Race, xlab="Race", ylab="Age", main="Boxplots of Age by Race")

Finally, if I do a regression of age on race, notice how I instantly get dummy variables:

summary(lm(Age~Race, data=mydata))

Here Black is the reference category since it's the first level by alphabetical order.

What if I want to run the same regression as before, but I want to use Hispanic as my reference group? Very easily, I just relevel the factor like this and get the resulting regression output:

mydata$Race2<-relevel(mydata$Race, "Hispanic")

summary(lm(Age~Race2, data=mydata))

Notice how now the Hispanic category is the reference.

So that is great stuff. But it's really important to know how to manipulate the factor variables. First, I can create factors using the factor() function. I notice from viewing my dataset above (or from running class(mydata$Marriage)) that marriage is numeric and coded as 0, 1, or 2. I find out in my codebook that those values correspond to Single, Married, and Divorced/Widowed. We can fix that this way:

mydata$Married.cat<-factor(mydata$Married, labels=c("Single", "Married", "Divorced/Widowed"))

This will create an unordered factor where 1=Single, 2=Married, and 3=Divorced/Widowed.

Now let's say I want to create a variable that describes whether someone's weight is "Low", "Medium", and "High". In this case, I'll use the cut() function, which instantly creates a factor that I can label, then I use the ordered() function around the cut function to order the levels. I show it in different colors for ease of viewing the two functions that I'm nesting:

mydata$Weight.type<-ordered(cut(mydata$Weight, c(0,135,165,200), labels=c("Low", "Medium", "High")))

If I print out the class and the contents of the variable (left) , I notice that it's an ordered factor and it tells me that Low<Medium< High, which is what I want.

We can see the new additions to my dataset (I'm showing just the relevant columns):

One caveat with factors - if you start off with a level and then you drop the only observations with that level, R still holds on to the level as a stored value and this can mess up your later analysis. For example, I subset my data to the first 6 rows so that I eliminate all Hispanic subjects from my data, but R keeps Hispanic as a possible level:

mynewdata<-mydata[1:6,]
summary(mynewdata$Race)

So here Hispanic just has a 0 count, but is still a category. This can be really annoying, like when you're making a barplot and the category is still showing up in the plot.

One quick way to get rid of this is to use the droplevels() function to drop all unused levels from your dataset. A commentator let me know that this function was introduced in R 2.12.0. Before, it was necessary to use a separate package called gdata. The function takes the whole dataframe as the argument, and you can use the except argument to list the indices of columns that you do not want subject to the dropping:

finaldata<-droplevels(mynewdata)
summary(finaldata$Race)

You could do all this very efficiently in one step like this without changing the name of your dataframe:

mydata<-droplevels(mydata[1:6,])

which is why we love R!

Monday, September 24, 2012

From continuous to categorical

During data analysis, it is often super useful to turn continuous variables into categorical ones. In Stata you would do something like this:

gen catvar=0
replace catvar=1 if contvar>0 & contvar<=3
replace catvar=2 if contvar>3 & contvar<=5

etc. And then you would label your values like so:

label define agelabel 0 "0" 1 "1-3" 2 "3-5"
label values catvar agelabel

How can we do this in R? There's a great function in R called cut() that does everything at once. It takes in a continuous variable and returns a factor (which is an ordered or unordered categorical variable). Factor variables are extremely useful for regression because they can be treated as dummy variables. I'll have another post on the merits of factor variables soon.

But for now, let's focus on getting our categorical variable. Here is our data:

And now we want to take that "Age" variable and turn in into a categorical variable. The most basic statement is like so:

mydata$Agecat1<-cut(mydata$Age, c(0,5,10,15,20,25,30))

Here the function cut() takes in as the first argument the continuous variable mydata$Age and it cuts it into chunks that are described in the second argument. So here I've indicated to make groups that go from 0-5, 6-10, 11-15, 16-20, etc. By default, the right side of the interval is closed while the left is open. You can change that, as we will see below. First, the output with the new "Agecat" variable:

Now we can customize our intervals. First, in Agecat2, I show how instead of spelling out every cutoff of the interval, I can just specify a sequence using seq(0, 30, 5) - this means we start at 0 and go to 30 by intervals of 5.

For Agecat3, I switch the default closed interval to be the left one by specifying "right=FALSE".

Finally, for Agecat4 I add in my own labels instead of the default "(0,5]" labels that are provided by R. I want them to be numbers instead so I indicate "labels=c(1:6)". The output of all of the options are shown below.

mydata$Agecat2<-cut(mydata$Age, seq(0,30,5))

mydata$Agecat3<-cut(mydata$Age, seq(0,30,5), right=FALSE)

mydata$Agecat4<-cut(mydata$Age, seq(0,30,5), right=FALSE, labels=c(1:6))

Now, if I want some summary statistics or a bivariate table, I get some nice output:

summary(mydata$Agecat1)

(0,5] (5,10] (10,15] (15,20] (20,25] (25,30]
0 1 2 0 0 1

table(mydata$Agecat1, mydata$Sex)

0 1
(0,5] 0 0
(5,10] 0 1
(10,15] 1 1
(15,20] 0 0
(20,25] 0 0
(25,30] 0 1

Why R for public health?

I created this blog to help public health researchers that are used to Stata or SAS to begin using R. I find that public health data is unique and this blog is meant to address the specific data management and analysis needs of the world of public health.

R is a very powerful tool for programming but can have a steep learning curve. In my experience, people find it easier to do it the long way with another programming language, rather than try R, because it just takes longer to learn. I think all statistical packages are useful and have their place in the public health world. However, I am a strong proponent of R and I hope this blog can help you move toward using it when it makes sense for you.

Please email me with posts you would like to see or R questions, and I'll try my best to answer them. Thanks for following!