R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate -


whenever want "map"py in r, try use function in apply family. (side question: still haven't learned plyr or reshape -- plyr or reshape replace of these entirely?)

however, i've never quite understood differences between them [how {sapply, lapply, etc.} apply function input/grouped input, output like, or input can be], go through them until want.

can explain how use 1 when?

[my current (probably incorrect/incomplete) understanding is...

  1. sapply(vec, f): input vector. output vector/matrix, element i f(vec[i]) [giving matrix if f has multi-element output]
  2. lapply(vec, f): same sapply, output list?
  3. apply(matrix, 1/2, f): input matrix. output vector, element i f(row/col of matrix)
  4. tapply(vector, grouping, f): output matrix/array, element in matrix/array value of f @ grouping g of vector, , g gets pushed row/col names
  5. by(dataframe, grouping, f): let g grouping. apply f each column of group/dataframe. pretty print grouping , value of f @ each column.
  6. aggregate(matrix, grouping, f): similar by, instead of pretty printing output, aggregate sticks dataframe.]

r has many *apply functions ably described in files (e.g. ?apply). there enough of them, though, beginning users may have difficulty deciding 1 appropriate situation or remembering them all. may have general sense "i should using *apply function here", can tough keep them straight @ first.

despite fact (noted in other answers) of functionality of *apply family covered extremely popular plyr package, base functions remain useful , worth knowing.

this answer intended act sort of signpost new users direct them correct *apply function particular problem. note, not intended regurgitate or replace r documentation! hope answer helps decide *apply function suits situation , research further. 1 exception, performance differences not addressed.

  • apply - when want apply function rows or columns of matrix (and higher-dimensional analogues); not advisable data frames coerce matrix first.

    # 2 dimensional matrix m <- matrix(seq(1,16), 4, 4)  # apply min rows apply(m, 1, min) [1] 1 2 3 4  # apply max columns apply(m, 2, max) [1]  4  8 12 16  # 3 dimensional array m <- array( seq(32), dim = c(4,4,2))  # apply sum across each m[*, , ] - i.e sum across 2nd , 3rd dimension apply(m, 1, sum) # result one-dimensional [1] 120 128 136 144  # apply sum across each m[*, *, ] - i.e sum across 3rd dimension apply(m, c(1,2), sum) # result two-dimensional      [,1] [,2] [,3] [,4] [1,]   18   26   34   42 [2,]   20   28   36   44 [3,]   22   30   38   46 [4,]   24   32   40   48 

    if want row/column means or sums 2d matrix, sure investigate highly optimized, lightning-quick colmeans, rowmeans, colsums, rowsums.

  • lapply - when want apply function each element of list in turn , list back.

    this workhorse of many of other *apply functions. peel code , find lapply underneath.

       x <- list(a = 1, b = 1:3, c = 10:100)     lapply(x, fun = length)     $a     [1] 1    $b     [1] 3    $c     [1] 91     lapply(x, fun = sum)     $a     [1] 1    $b     [1] 6    $c     [1] 5005 
  • sapply - when want apply function each element of list in turn, want vector back, rather list.

    if find typing unlist(lapply(...)), stop , consider sapply.

       x <- list(a = 1, b = 1:3, c = 10:100)    #compare above; named vector, not list     sapply(x, fun = length)       b  c       1  3 91     sapply(x, fun = sum)          b    c        1    6 5005  

    in more advanced uses of sapply attempt coerce result multi-dimensional array, if appropriate. example, if our function returns vectors of same length, sapply use them columns of matrix:

       sapply(1:5,function(x) rnorm(3,x)) 

    if our function returns 2 dimensional matrix, sapply same thing, treating each returned matrix single long vector:

       sapply(1:5,function(x) matrix(x,2,2)) 

    unless specify simplify = "array", in case use individual matrices build multi-dimensional array:

       sapply(1:5,function(x) matrix(x,2,2), simplify = "array") 

    each of these behaviors of course contingent on our function returning vectors or matrices of same length or dimension.

  • vapply - when want use sapply perhaps need squeeze more speed out of code.

    for vapply, give r example of sort of thing function return, can save time coercing returned values fit in single atomic vector.

    x <- list(a = 1, b = 1:3, c = 10:100) #note since advantage here speed, # example illustration. we're telling r # returned length() should integer of  # length 1.  vapply(x, fun = length, fun.value = 0l)   b  c   1  3 91 
  • mapply - for when have several data structures (e.g. vectors, lists) , want apply function 1st elements of each, , 2nd elements of each, etc., coercing result vector/array in sapply.

    this multivariate in sense function must accept multiple arguments.

    #sums 1st elements, 2nd elements, etc.  mapply(sum, 1:5, 1:5, 1:5)  [1]  3  6  9 12 15 #to rep(1,4), rep(2,3), etc. mapply(rep, 1:4, 4:1)    [[1]] [1] 1 1 1 1  [[2]] [1] 2 2 2  [[3]] [1] 3 3  [[4]] [1] 4 
  • map - a wrapper mapply simplify = false, guaranteed return list.

    map(sum, 1:5, 1:5, 1:5) [[1]] [1] 3  [[2]] [1] 6  [[3]] [1] 9  [[4]] [1] 12  [[5]] [1] 15 
  • rapply - for when want apply function each element of nested list structure, recursively.

    to give idea of how uncommon rapply is, forgot when first posting answer! obviously, i'm sure many people use it, ymmv. rapply best illustrated user-defined function apply:

    #append ! string, otherwise increment myfun <- function(x){     if (is.character(x)){     return(paste(x,"!",sep=""))     }     else{     return(x + 1)     } }  #a nested list structure l <- list(a = list(a1 = "boo", b1 = 2, c1 = "eeek"),            b = 3, c = "yikes",            d = list(a2 = 1, b2 = list(a3 = "hey", b3 = 5)))   #result named vector, coerced character            rapply(l,myfun)  #result nested list l, values altered rapply(l, myfun, how = "replace") 
  • tapply - for when want apply function subsets of vector , subsets defined other vector, factor.

    the black sheep of *apply family, of sorts. file's use of phrase "ragged array" can bit confusing, quite simple.

    a vector:

       x <- 1:20 

    a factor (of same length!) defining groups:

       y <- factor(rep(letters[1:5], each = 4)) 

    add values in x within each subgroup defined y:

       tapply(x, y, sum)        b  c  d  e      10 26 42 58 74  

    more complex examples can handled subgroups defined unique combinations of list of several factors. tapply similar in spirit split-apply-combine functions common in r (aggregate, by, ave, ddply, etc.) hence black sheep status.


Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

c# - Asp.net web api : redirect unauthorized requst to forbidden page -