Showing posts with label apply. Show all posts
Showing posts with label apply. Show all posts
Review: Kölner R Meeting 18 October 2013
The Cologne R user group met last Friday for two talks on split apply combine in R and XLConnect by Bernd Weiß and Günter Faes respectively, before the usual Schnitzel and Kölsch at the Lux.
The
Alternatively to the base R function Bernd touched also on the
Günter presented the
Please get in touch if you would like to present and share your experience, or indeed if you have a request for a topic you would like to hear more about. For more details see also our Meetup page.
Thanks again to Bernd Weiß for hosting the event and Revolution Analytics for their sponsorship.
Split apply combine in R
The
apply family of functions in R is incredible powerful, yet for newcomers often somewhat mysterious. Thus, Bernd gave an overview of the different apply functions and their cousins. The various functions differ in their object inputs, e.g. vectors, arrays, data frames or lists, and their outputs. Other related functions are by, aggregate and ave. While functions like aggregate reduce the output size, others like ave will return as many rows as the input object and repeat the results where necessary. Alternatively to the base R function Bernd touched also on the
**ply functions of the plyr package. The function names are certainly easier to remember, but their syntax can be a little awkward (.()). Bernd's slides, in German, are already available from our Meetup site. XLConnect
When dealing with data stored in spreadsheets most member of the group rely onread.csv and write.csv in R. However, if you have a spreadsheet with multiple tabs and formatted numbers, read.csv becomes clumsy, as you would have to save each tab without any formatting in separate files. Günter presented the
XLConnect as an alternative to read.csv or indeed RODBC for reading spreadsheet data. It uses the Apache POI API as the underlying interface. XLConnect requires a Java runtime environment on your computer, but no installation of Excel. That makes it a true platform independent solution to exchange data with spreadsheets and R. Not only can you read defined rows and columns from Excel into R, or indeed named ranges, but in the same way data can be stored in Excel files again and to top it all - also graphic output from R.Next Kölner R meeting
The next meeting is scheduled for 13 December 2013. A discussion of the data.table package is already on the agenda.Please get in touch if you would like to present and share your experience, or indeed if you have a request for a topic you would like to hear more about. For more details see also our Meetup page.
Thanks again to Bernd Weiß for hosting the event and Revolution Analytics for their sponsorship.
22 Oct 2013
07:45
aggregate
,
apply
,
ave
,
Koelner R User
,
Kölner R Users
,
R
,
XLconnect
Say it in R with "by", "apply" and friends
| Iris versicolor By Danielle Langlois License: CC-BY-SA |
Languages are full of surprises, in particular for non-native speakers. The other day I learned that there is courtesy and curtsey. Both words sounded very similar to me, but of course created some laughter when I mixed them up in an email.
With languages you can get into habits of using certain words and phrases, but sometimes you see or hear something, which shakes you up again. So did the following two lines in R with me:
f <- function(x) x^2
sapply(1:10, f)
[1] 1 4 9 16 25 36 49 64 81 100
It reminded me of the phrase that everything is a list in R. It showed me again how easily a for loop can be turned into a statement using the apply family of functions and how little I know about all the subtleties of R.
I remember how happy I felt, when I finally understood the by function in R. I started to use it all the time, closing my eyes on aggregate and the apply functions family. Here is an example where I calculate the means of the various measurements by species of the famous iris data set using by.
by
do.call("rbind", as.list(
by(iris, list(Species=iris$Species), function(x){
y <- subset(x, select= -Species)
apply(y, 2, mean)
}
)))
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
Now let's find alternative ways of expressing ourselves, using other words/functions of the R language, such as aggregate, apply, sapply, tapply, data.table, ddply, sqldf, and summaryBy.
aggregate
Theaggregate function splits the data into subsets and computes summary statistics for each of them. The output of aggregate is a data.frame, including a column for species.
iris.x <- subset(iris, select= -Species)
iris.s <- subset(iris, select= Species)
aggregate(iris.x, iris.s, mean)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
Addition: As John Christie points out in the comments, aggregate has also a formula interface, which simplifies the call to:
aggregate( . ~ Species, iris, mean)
apply and tapply
The combination oftapply and apply achieves a similar result, but this time the output is a matrix and hence I lose the column with species. The species are now the row names.
apply(iris.x, 2, function(x) tapply(x, iris.s, mean))
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
split and apply
Here I split the data first into subsets for each of the species and calculate then the mean for each column in the subset. The output is amatrix again, but transposed.
sapply(split(iris.x, iris.s), function(x) apply(x, 2, mean))
setosa versicolor virginica
Sepal.Length 5.006 5.936 6.588
Sepal.Width 3.428 2.770 2.974
Petal.Length 1.462 4.260 5.552
Petal.Width 0.246 1.326 2.026
ddply
Hadley Wickham'splyr package provides tools for splitting, applying and combining data. The function ddply is similar to the by function, but it returns a data.frame instead of a by list and maintains the column for the species.
library(plyr)
ddply(iris, "Species", function(x){
y <- subset(x, select= -Species)
apply(y, 2, mean)
})
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
Addition: Sean mentions in the comments an alternative, using the colMeans function, while Andrew reminds us of the reshape package with its functions melt and cast.
ddply(iris, "Species", function(x) colMeans(subset(x, select= -Species)))
## or
ddply(iris, "Species", colwise(mean))
## same output as above
library(reshape)
cast(melt(iris, id.vars='Species'),formula=Species ~ variable,mean)
## same output as above
summaryBy
ThesummaryBy function of the doBy package by Søren Højsgaard and Ulrich Halekoh has a very intuitive interface, using formulas.
library(doBy)
summaryBy(Sepal.Length + Sepal.Width + Petal.Length + Petal.Width ~ Species, data=iris, FUN=mean)
Species Sepal.Length.mean Sepal.Width.mean Petal.Length.mean Petal.Width.mean
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
sqldf
If you are fluent in SQL, then the sqldf package by Gabor Grothendieck might be the one for you.
library(sqldf)
sqldf("select Species, avg(Sepal_Length), avg(Sepal_Width),
avg(Petal_Length), avg(Petal_Width) from iris
group by Species")
Species avg(Sepal_Length) avg(Sepal_Width) avg(Petal_Length) avg(Petal_Width)
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
data.table
Thedata.table package by M Dowle, T Short and S Lianoglou is the real rock star to me. It provides an elegant and fast way to complete the task. The statement reads in plain English from right to left: take columns 1 to 4, split them by the factor in column "Species" and calculate on the sub data (.SD) the means.
library(data.table)
iris.dt <- data.table(iris)
iris.dt[,lapply(.SD,mean),by="Species",.SDcols=1:4]
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,] setosa 5.006 3.428 1.462 0.246
[2,] versicolor 5.936 2.770 4.260 1.326
[3,] virginica 6.588 2.974 5.552 2.026
apply
I should mention that R provides theiris data set also in an array form. The third dimension of the iris3 array holds the species information. Therefore I can use the apply function again, I go down the third and then the second dimension to calculate the means.
apply(iris3, c(3,2), mean)
Sepal L. Sepal W. Petal L. Petal W.
Setosa 5.006 3.428 1.462 0.246
Versicolor 5.936 2.770 4.260 1.326
Virginica 6.588 2.974 5.552 2.026
