x <- 1:5
y <- 10
z <- x + y
z
[1] 11 12 13 14 15
March 23, 2023
At the end of this lesson you will:
Repeated execution on a number of objects is a common task you will have to do. For example, when you want a bootstrap confidence interval on something youʻve estimated, you will run the analysis once, on the original data, and on 100 or 1000 samples of simulated data.
When you know how many times you want to repeat execution, two common methods are to use for loops
and apply
functions. apply()
functions are special functions that operate on lists, and come in different flavors depending on the type of object you want returned.
Many functions in R are already vectorized in that they will perform the same computation on the entire object (rather than element by element). Basic arithmetic on vectors is a good example.
x <- 1:5
y <- 10
z <- x + y
z
[1] 11 12 13 14 15
The two vectors, x and y, are added together in parallel because vector arithmetic is vectorized. This allows you to write code that is natural, fast, and easy to read.
If R were not vectorized (as in Fortran and C), you would have to code operations element by element like so:
z <- numeric(length(x))
for(i in seq_along(x)) { # seq_along(x) same as 1:length(x)
z[i] <- x[i] + y
}
z
[1] 11 12 13 14 15
Imagine if you had to do this each time you wanted to do anything to any objects! It would take a lot longer to get anything done. Vectorization makes coding much more natural.
When you do find operations that are not vectorized, you can use apply
functions (below). But before we get into that, letʻs build up our example with a loop.
for
loops are straightforward to understand, and are a general feature of every programming language. They are necessary at times, for example when you need the results of the previous iteration for the current one. But they are usually slower in R and sometimes not very elegant (making the code harder to understand). For example, think of a very simple function that calculates the square of a number:
square <- function( x ) {
return (x*x)
}
If you wanted to apply it to the vector 1:10, using a for loop, it would look like this:
xx <- vector(length=10) ## create a container for output
for ( i in 1:10 ) { ## step through i from 1 to 10
xx[i] <- square( i ) ## run square function for each i
}
xx
[1] 1 4 9 16 25 36 49 64 81 100
This runs the square()
function 10 times, once for each value of i
from 1 to 10. Importantly, notice that it works by going through i
one element at a time.
Another way to repeatedly execute code is via the apply()
functions. apply
functions are unique to R, and in some situations can operate on an entire object at once, which can make them fast. This is called vectorization.
Letʻs try sapply()
:
sapply( 1:10, square )
[1] 1 4 9 16 25 36 49 64 81 100
There are several different flavors of apply functions, but they all have similar forms:
sapply( X, FUN, ...)
Where X
is an object, and FUN
is a function. The function is applied to each element of X
, often simultaneously (whether this happens simultaneously or not depends on whether the function written with vectorization in mind, you have to just try).
sapply
and lapply
Another common type is lapply
, which operates on list objects and returns a list. sapply
(s
for simplify) is almost identical to lapply
, but tries to make prettier output by returning a vector or a matrix if possible (instead of a list):
sapply( 1:5, square )
[1] 1 4 9 16 25
lapply( 1:5, square )
[[1]]
[1] 1
[[2]]
[1] 4
[[3]]
[1] 9
[[4]]
[1] 16
[[5]]
[1] 25
There is also apply()
which works on matrices or arrays, and has an index argument for whether it should apply the function over rows or columns.
tapply
to apply the function across a grouping index or treatments.
mapply
to apply to multiple lists simultaneously.
outer
which applies the function to an outer product of two arrays, and more.
aggregate
is actually a user-friendly wrapper for tapply
, used to apply a function across groups.
All of the apply
functions work in the same way. Donʻt get overwhelmed - I mainly use sapply
or lapply
, and aggregate
, and occasionally apply
if I need to work over rows. Thatʻs all you need to remember, consult the help page when you need.
Letʻs calculate the mean and standard error of sexual size dimorphism (log(male size/female size)) in Anolis lizards, and make a nice table:
The data are included in the GitHub repo for this course. There are 23 species, with each species belonging to one of five ecomorph groups. We can aggregate by mean over ecomorph groups:
anolis <- read.csv("https://raw.githubusercontent.com/mbutler808/rclass/main/data/anolisSSD.csv")
aggregate(anolis$logSSD, by=list(anolis$ecomorph), mean)
Group.1 x
1 crown-giant 0.1391750
2 grass-bush 0.1437525
3 trunk 0.1467167
4 trunk-crown 0.2626575
5 trunk-ground 0.3339650
6 twig 0.0848450
Calculate the mean and the sd by ecomorph group, and this time save them:
anolis.mean <- aggregate(anolis$logSSD, by=list(anolis$ecomorph), mean)
anolis.sd <- aggregate(anolis$logSSD, by=list(anolis$ecomorph), sd)
anolis.sd
Group.1 x
1 crown-giant 0.09909567
2 grass-bush 0.06924584
3 trunk 0.02136480
4 trunk-crown 0.09968872
5 trunk-ground 0.06966130
6 twig 0.07107131
Give the results of aggregate meaningful column names:
names(anolis.mean) # check that this is what we want to modify
[1] "Group.1" "x"
While we’re at it, let’s get the sample size so that we can calculate the standard error, which is the standard deviation divided by the square root of the sample size.
To put the columns together, use merge()
. Here there is only one matching column (ecomorph), so the by=
is optional, but itʻs good practice:
Merging works two by two so we have to do it a second time to add the N
. There are also options for by.x=
and by.y=
in case your columns have different names in the two objects – you can tell R which two columns to match.
Now itʻs easy to add the standard error, and we can use the print()
function to reduce the number of digits displayed:
ecomorph mean sd N se
1 crown-giant 0.139 0.099 4 0.050
2 grass-bush 0.144 0.069 4 0.035
3 trunk 0.147 0.021 3 0.012
4 trunk-crown 0.263 0.100 4 0.050
5 trunk-ground 0.334 0.070 4 0.035
6 twig 0.085 0.071 4 0.036
Or rearrange for our paper format:
ecomorph N mean se
1 crown-giant 4 0.139 0.050
2 grass-bush 4 0.144 0.035
3 trunk 3 0.147 0.012
4 trunk-crown 4 0.263 0.050
5 trunk-ground 4 0.334 0.035
6 twig 4 0.085 0.036
We can save it for later as well:
apply
functionsIf the function needs additional arguments, you just provide them separated by commas:
sapply( X, FUN, arg1, arg2, ...)
For example, letʻs say we wanted to sample with replacement from the vector 1:5
. To do it once, we would do:
sample(5, replace=T)
[1] 4 5 4 4 1
To do it 4 times, you could do:
[,1] [,2] [,3] [,4]
[1,] 2 1 4 3
[2,] 5 1 1 1
[3,] 1 1 4 5
[4,] 3 5 5 2
[5,] 5 2 3 3
sapply
took the vector of fives and created a sample for each one.
Sometimes the function that you want to run inside of an apply function is more complicated and requires many lines. Suppose you wanted to run several functions or have many lines of code. You have two choices. First, you can write a function definition and then pass it to an apply function:
Alternatively you could define the function within the apply function:
sapply( input, function(x) {
...lines_of_code...
})
Where x
is a single element of the input
object, so if input
is a vector, x
would be one element of the vector. But if input
is a list, it would be the first list element, etc. Apply functions work really nicely with lists, and many times they handle dataframes nicely as well.
To return to one of our first examples, to code the square
function inside of the sapply
it would simply be:
sapply ( 1:10, function(x) x*x )
[1] 1 4 9 16 25 36 49 64 81 100
Where {}
around {x*x}
are optional here because itʻs only one line. This is much cleaner and more elegant than:
xx <- vector(length=10) ## create a container for output
for ( i in 1:10 ) { ## step through i`s from 1 to 10
xx[i] <- square( i ) ## run square function for each i
}
xx
[1] 1 4 9 16 25 36 49 64 81 100
Furthermore, itʻs often easier to understand assigning the output object, because the entire object is returned, not filled element by element:
xx <- sapply ( 1:10, function(x) x*x )
This is another advantage of thinking of the manipulation on the whole object rather than pieces of it.
mylist <- vector("list") ## creates a null (empty) list
for (i in 1:4) {
mylist[i] <- list(data.frame(x=rnorm(3), y=rnorm(3)))
}
x
as a function of y
for each dataframe using an apply function.x ~ y
on each dataframe, and save the anova output (there should be 4 of them) to a list or dataframe.