Introduction to OU Models

Comparative analyses with Ornstein-Uhlenbeck models
module 8
week 15
comparative analysis
phylogenetic trees
Ornstein Uhlenbeck
Brownian motion
Author
Affiliation

School of Life Sciences, University of Hawaii

Published

April 27, 2023

Learning objectives

Learning Objectives

Goals: - Approaches for adaptive evolution (ouch, slouch, others) - Model-based vs statistical approaches

Concepts: - Model comparison tools in R - Process-based models

Introduction to OU Models

The OU Model for Comparative Analysis

A Brownian motion process can be described using the following differential equation:

\[ dX(t) = \sigma\,dB(t). \tag{1}\]

If we imagine the phenotype \(X\) as changing through time \(t\), this equation says that in a small increment of time, the change will be proportional to the parameter \(\sigma\). Here, \(dB(t)\) is a sample from a Brownian (white noise) process.

A small step towards reality is the OU Process:

\[ dX(t) = \alpha\,(\theta - X(t))\,dt + \sigma\,dB(t). \tag{2}\]

Eq. Equation 2 expresses the amount of change in character \(X\) over the course of a small increment of time: specifically, \(dX(t)\) is the infinitesimal change in the character \(X\) over the infinitesimal interval from time \(t\) to time \(t+dt\). The term \(dB(t)\) is white noise; that is, the random variables \(dB(t)\) are independent and identically-distributed normal random variables, each with mean zero and variance \(dt\). The parameter \(\alpha\) measures the strength of selection. When \(\alpha = 0\), the deterministic part of the OU model is diminished and Equation 2 approaches the familiar BM model of pure drift,

ouch package

See ouch lecture.

Good starting points available in package ouch:
- ?bimac help page for Bimaculatus character displacement dataset
- example(bimac) example of bimac analysis - ?anolis.ssd help page for Anolis sexual size dimorphism dataset

ouch is a package designed to test adaptive hypotheses using variations of the OU process, including BM (Butler and King 2004; King and Butler 2022). OUCH implements a model that fits alpha and sigma parameters to the entire phylogeny, but allows the user to specify which branches belong to different selective regimes. The location of the optima are also fit.

The Data

The data in OUCH are most easily assembled as a data frame. Load the built in example from ouch and then print it to the screen (I only printed the head of the dataset here):

require(ouch)
data(bimac)
bimac
Loading required package: ouch
  node spcode species island size ancestor time OU.1   OU.3 OU.4  OU.LP
1    1   <NA>    <NA>   <NA>   NA       NA    0   ns medium  anc medium
2    2   <NA>    <NA>   <NA>   NA        1   12   ns medium  anc medium
3    3   <NA>    <NA>   <NA>   NA        2   32   ns medium  anc  small
4    4   <NA>    <NA>   <NA>   NA        3   34   ns medium  anc  small
5    5   <NA>    <NA>   <NA>   NA        4   36   ns medium  anc  small
6    6   <NA>    <NA>   <NA>   NA        3   36   ns medium  anc  small

NOTE: a very important detail about ouch is that it matches trees with data and regimes using the node labels stored in the rownames of the objects you pass to the ouch functions. So it is important to make sure that your dataframes and vectors are appropriately named. The dataframe bimac already has the correct row names, but we do so here just to illustrate.

rownames(bimac) <- bimac$node

ouch was designed around a rectangular data model, so although the tree object is not a dataframe internally, it still helps us to build the data as a dataframe before making the ouchtree objects. The central organizing element is the node: it has a node number (usually an integer but it is actually a unique character string), an ancestor to which it is joined by a branch, a time since the root of the tree, and optional label such as a species name.

The hypotheses

The hypotheses which we use are assigned by painting particular regimes on branches. It is convenient to represent each model or hypothesis as a column on the dataframe, with the regime assigned to the node (that is, it is assigned to the branch connecting the node to its ancestor).

Make an ouchtree object using the ouchtree constructor. with is a very nice function to create a small local environment so that you can use a dataframe’s elements directly without using the bimac$ prefix.

It is similar to an attach but it is temporary – only lasting as long as the call itself. I like it much better than attach because I sometimes forget what I’ve attached and run into problems later. Also, with attach, you are actually working with a copy of the original dataframe object, so updating values is tricky. With with, it is more clear what’s going on, and I don’t tend to make those mistakes.

tree <- with(bimac, ouchtree(node,ancestor,time/max(time),species))
plot(tree)

ouch fits the OU model Eq. Equation 2 along each branch of the phylogeny. While \(\alpha\) and \(\sigma\) are held constant across the entire tree, the optima along each branch \(\theta\) are allowed to vary. Users can then paint various combinations of optima on the tree to reflect various biological scenarios.

For example, the dataset bimac was used to test the hypothesis of character displacement using an interspecific daaset of body sizes and sympatry/allopatry Butler and King (2004). The analysis tested several different models, which are included with bimac. They are: OU.1 or global optimum, OU.3 or small, medium, and large regimes depending on the body size of the observed species (terminal branches only, internal branches painted medium, OU.4 or the same as OU.3 but with internal branches given their own unique regime called ancestral, and OU.LP based on a linear parsimony reconstruction of the colonization events (i.e., that as species came into sympatry, they diverged in body size).

Plotting ouchtrees

You can plot the regime paintings on the tree, and set options such as line widths for prettier plots. ouch has a very nice feature which allows plotting of the alternative models on one plot.

plot(tree, regimes=bimac[c("OU.1", "OU.3", "OU.4", "OU.LP")], lwd=6)

Remember that you can pass a single vector or a data frame to the regimes parameter, but it must have the appropriate row names or names in the case of a vector. The regimes are not part of the ouchtree object, because they represent our hypothesis of evolution along the tree, rather than the tree itself. It is part of the original dataframe from which we derived the tree, so remember to refer to bimac when passing the regimes to the plot function.

Fitting models

There are two main model fitting functions in ouch, brown, which fits Brownian motion models, and hansen, which fits OU models to comparative data. The call to brown is particularly simple, as it takes only the data and the tree:

brown(log(bimac['size']),tree)

call:
brown(data = log(bimac["size"]), tree = tree)
   nodes ancestors     times             labels     size
1      1      <NA> 0.0000000               <NA>       NA
2      2         1 0.3157895               <NA>       NA
3      3         2 0.8421053               <NA>       NA
4      4         3 0.8947368               <NA>       NA
5      5         4 0.9473684               <NA>       NA
6      6         3 0.9473684               <NA>       NA
7      7         1 0.2105263               <NA>       NA
8      8         7 0.3421053               <NA>       NA
9      9         8 0.4736842               <NA>       NA
10    10         9 0.6052632               <NA>       NA
11    11        10 0.7368421               <NA>       NA
12    12         9 0.7368421               <NA>       NA
13    13         8 0.5789474               <NA>       NA
14    14        13 0.6842105               <NA>       NA
15    15        14 0.8947368               <NA>       NA
16    16        15 0.9473684               <NA>       NA
17    17         7 0.7368421               <NA>       NA
18    18        17 0.7894737               <NA>       NA
19    19        18 0.8947368               <NA>       NA
20    20        19 0.9473684               <NA>       NA
21    21        20 0.9736842               <NA>       NA
22    22        19 0.9473684               <NA>       NA
23    23         2 1.0000000       Anolis pogus 2.602690
24    24         4 1.0000000   Anolis schwartzi 2.660260
25    25         5 1.0000000   Anolis schwartzi 2.660260
26    26         5 1.0000000   Anolis schwartzi 2.653242
27    27         6 1.0000000     Anolis wattsii 2.674149
28    28         6 1.0000000     Anolis wattsii 2.701361
29    29        10 1.0000000 Anolis bimaculatus 3.161247
30    30        11 1.0000000 Anolis bimaculatus 3.299534
31    31        11 1.0000000 Anolis bimaculatus 3.328627
32    32        12 1.0000000      Anolis leachi 3.353407
33    33        12 1.0000000      Anolis leachi 3.360375
34    34        13 1.0000000     Anolis nubilus 3.049273
35    35        14 1.0000000     Anolis sabanus 2.906901
36    36        15 1.0000000  Anolis gingivinus 2.980619
37    37        16 1.0000000  Anolis gingivinus 2.933857
38    38        16 1.0000000  Anolis gingivinus 2.975530
39    39        17 1.0000000    Anolis oculatus 3.104587
40    40        18 1.0000000     Anolis ferreus 3.346389
41    41        20 1.0000000     Anolis lividus 2.928524
42    42        21 1.0000000  Anolis marmoratus 2.939162
43    43        21 1.0000000  Anolis marmoratus 2.990720
44    44        22 1.0000000 Anolis terraealtae 3.058707
45    45        22 1.0000000 Anolis terraealtae 3.068053

sigma squared:
           [,1]
[1,] 0.04311003

theta:
NULL
   loglik  deviance       aic     aic.c       sic       dof 
 17.33129 -34.66257 -30.66257 -30.06257 -28.39158   2.00000 

What is returned is an object of class browntree. It contains all input including the function call, the tree and data), as well as the parameter estimate for \(\sigma\) and the model fit statistics including: the log-likelihood, the deviance (\(-2*log(L)\)), the information criteria \(AIC\), \(AIC_c\) (corrected for small sample size), and \(SIC\), and the model degrees of freedom.

It is a good practice to save this, as it encapsulates the analysis. From this, we can rerun the model fit.

h1 <- brown(log(bimac['size']),tree)

hansen models are slightly more complex. In addition to \(\sigma\), we are now fitting \(\alpha\), the strength of selection, and all of the optima \(\theta\) specified by our model. This maximum-likelihood search now requires an initial guess. If you have no idea, a good starting guess is 1. If you want to be sure, you can intiate searches with different starting guesses. You can also specify alternative optimization algorithms and increase or decrease the relative tolerance, which is the stringency by which convergence is assessed. Typically, the default is roughly reltol=1e-8, and the limit of machine precision is in the neighborhood of reltol=1e-15.

h2 <- hansen(log(bimac['size']),
             tree,
             bimac['OU.1'],
             sqrt.alpha=1,
             sigma=1)
h3 <- hansen(log(bimac['size']),
             tree,bimac['OU.3'], 
             sqrt.alpha=1,
             sigma=1)
h4 <- hansen(log(bimac['size']),
             tree,
             bimac['OU.4'], 
             sqrt.alpha=1,
             sigma=1)
h5 <- hansen(log(bimac['size']),
             tree,
             bimac['OU.LP'], 
             sqrt.alpha=1,
             sigma=1,
             reltol=1e-5)

hansentree and ouchtree methods

We can see the model results by typing h5, which will execute the print method for this class. You could also use the attributes function, but this will dump too much information. ouchtree objects and the classes derived from them contain information that is used in internal calculations of the algorithms, not of general interest to users.

Additional accessor functions include:

coef(h5)    # the coefficients of the fitted model
$sqrt.alpha
[1] 1.61658

$sigma
[1] 0.2249274

$theta
$theta$size
   large   medium    small 
3.355087 3.040729 2.565249 


$alpha.matrix
        [,1]
[1,] 2.61333

$sigma.sq.matrix
           [,1]
[1,] 0.05059232
logLik(h5)   # the log-likelihood value
[1] 24.81823
summary(h5) # (not printed) everything except the tree+data

We can now generate a table of our model fits:

    # just the model fit statistics on a single line
unlist(summary(h5)[c('aic', 'aic.c', 'sic', 'dof')])  
      aic     aic.c       sic       dof 
-39.63645 -36.10704 -33.95898   5.00000 
h <- list(h1, h2, h3, h4, h5)   # store fitted models in a list
names(h) <- c('BM', 'OU.1', 'OU.3', 'OU.4', 'OU.LP')
sapply( h, function(x)
             unlist(
                summary(x)[c('aic', 'aic.c', 'sic', 'dof')]
             ) 
      )                         # table with all models
             BM      OU.1      OU.3      OU.4     OU.LP
aic   -30.66257 -25.39364 -29.15573 -35.22319 -39.63645
aic.c -30.06257 -24.13048 -25.62631 -29.97319 -36.10704
sic   -28.39158 -21.98715 -23.47826 -28.41022 -33.95898
dof     2.00000   3.00000   5.00000   6.00000   5.00000

By storing the model fits in a list, we can use apply methods to get the statistics from all the models at once. sapply returns a matrix if possible.

h.ic <- sapply( h, function(x) 
                     unlist(
                      summary(x)[c('aic', 'aic.c', 'sic', 'dof')]
                     ) 
               )  
print( h.ic, digits = 3)
         BM  OU.1  OU.3  OU.4 OU.LP
aic   -30.7 -25.4 -29.2 -35.2 -39.6
aic.c -30.1 -24.1 -25.6 -30.0 -36.1
sic   -28.4 -22.0 -23.5 -28.4 -34.0
dof     2.0   3.0   5.0   6.0   5.0

Simulation and bootstrap methods:

simulate generates random deviates or sets of simulated tip data based on the fitted model. The input is a fitted model hansentree or browntree, and the output is a list of dataframes, each comparable to the original data. These can then be used to refit the model.

h5.sim <- simulate(object = h5, nsim=10)   # saves 10 sets of simulated data
                                  #  based on OU.LP

update refits the model, with one or more parameters changed.

summary( update( object = h5, 
                 data = h5.sim[[1]] 
                )  # fit the first dataset
        )   
$call
hansen(data = data, tree = object, regimes = regimes, sqrt.alpha = sqrt.alpha, 
    sigma = sigma)

$conv.code
[1] 0

$optimizer.message
NULL

$alpha
         [,1]
[1,] 3.530333

$sigma.squared
           [,1]
[1,] 0.04448437

$optima
$optima$size
   large   medium    small 
3.333356 3.019454 2.619120 


$loglik
[1] 28.5958

$deviance
[1] -57.19159

$aic
[1] -47.19159

$aic.c
[1] -43.66218

$sic
[1] -41.51412

$dof
[1] 5
h5.sim.fit <- lapply( h5.sim, 
                      function(x) update(h5, x)
                    ) # fit all 10 simulations

bootstrap is a convenience function for generating parametric bootstraps of the parameter estimates. It takes the fitted model, performs the simulations, refits, and outputs a dataframe of parameter estimates.

bootstrap( object = h5, nboot=10)
      alpha sigma.squared optima.size.large optima.size.medium
1  3.927412    0.03738445          3.323732           3.037843
2  3.808633    0.04102961          3.229301           3.019838
3  6.620335    0.08857152          3.201652           3.008789
4  5.689668    0.06439393          3.387377           3.005807
5  4.041209    0.04768774          3.367822           2.974627
6  4.627289    0.05344219          3.392546           3.045898
7  4.551867    0.07127402          3.271024           3.024687
8  2.708405    0.04256169          3.344545           3.033925
9  8.064081    0.05081199          3.343484           3.081023
10 4.496143    0.10630749          3.257126           3.097088
   optima.size.small   loglik       aic     aic.c       sic dof
1           2.567702 31.46802 -52.93605 -49.40663 -47.25858   5
2           2.560893 30.14563 -50.29125 -46.76184 -44.61378   5
3           2.518318 26.20707 -42.41414 -38.88472 -36.73666   5
4           2.661032 28.46116 -46.92231 -43.39290 -41.24484   5
5           2.445358 28.90891 -47.81781 -44.28840 -42.14034   5
6           2.726667 28.75600 -47.51201 -43.98259 -41.83454   5
7           2.724046 25.30250 -40.60499 -37.07558 -34.92752   5
8           2.609753 27.06160 -44.12320 -40.59378 -38.44572   5
9           2.544137 34.51236 -59.02472 -55.49530 -53.34724   5
10          2.586874 20.59745 -31.19490 -27.66548 -25.51742   5

painting regimes on trees

A new function in ouch is paint. Previously, it was up to users to set up regimes manually by editing spreadsheets. paint helps with this task by specifying the regimes on particular species, subtrees, or particular branches.

There are two parameters to paint, subtrees, which paints the entire subtree which descends from the node, and branch, which paints the branch connecting the node to it’s ancestor. For either, you specify the node label (remember it’s a character and needs to be quoted), and set it equal to the name of the regime you want to specify.

Let’s try it on the bimac tree and try to recreate the OU.LP regime:

plot(tree, node.names=T)

Paint the subtrees first, take a look:

ou.lp <- paint( tree, 
                subtree=c("1"="medium","9"="large","2"="small") 
               )
plot(tree, regimes=ou.lp, node.names=T)

But there was an independent switch from medium to large at species gm, or node 38, and the node connecting 9 to its ancestor:

ou.lp <- paint( tree, 
                subtree=c("1"='medium',"9"='large',"2"='small'),
                branch=c("38"='large',"2"='medium')
              )  

Compare it to the original OU.LP from above.

plot(tree, regimes=ou.lp, node.names=T)

We can create alternative paintings of the regimes to test against the data. Suppose we wanted to add a clade specific hypothesis that diverged in a similar time period (this is a completely made-up hypothesis, just for example):

ou.clades <- paint( tree, 
                    subtree=c("1"="A","7"="B", "8"="C"), 
                    branch=c("8"="C", "7"="C", "1"="A")
                   )
plot(tree, regimes=ou.clades, node.names=T)

Run the model:

h6 <- hansen( 
              log(bimac['size']),
              tree, 
              regimes=ou.clades, 
              sqrt.alpha=1,
              sigma=1
            )

Rebuild our table and compare models:

h <- append(h, h6)         # append (add on) new model results to our list h
names(h)[length(h)] <- ou.clades    # add the name of the new model
Warning in names(h)[length(h)] <- ou.clades: number of items to replace is not
a multiple of replacement length
[1] "BM"    "OU.1"  "OU.3"  "OU.4"  "OU.LP" "1"    
h.ic <- sapply( h, function(x) 
                     unlist(
                        summary(x)[c('aic', 'aic.c', 'sic', 'dof')]
                     ) 
               )  
print( h.ic, digits = 3)
         BM  OU.1  OU.3  OU.4 OU.LP     1
aic   -30.7 -25.4 -29.2 -35.2 -39.6 -30.7
aic.c -30.1 -24.1 -25.6 -30.0 -36.1 -27.1
sic   -28.4 -22.0 -23.5 -28.4 -34.0 -25.0
dof     2.0   3.0   5.0   6.0   5.0   5.0

References

Butler, Marguerite A., and Aaron A. King. 2004. “Phylogenetic Comparative Analysis: A Modeling Approach for Adaptive Evolution.” American Naturalist 164: 683–95. https://doi.org/10.1086/426002.
King, Aaron A., and Marguerite A. Butler. 2022. Ouch: Ornstein-Uhlenbeck Models for Phylogenetic Comparative Hypotheses. https://kingaa.github.io/ouch/.