Hierarchical models II: Correlated observations

 

Most of the machinery of model fitting we've examined so far involves likelihood, which comes with the critical assumption that values of our response variable are independent.  Ecological data often violate this assumption across both space and time, so the question arises as to how to compose legitimate likelihood functions with correlated data.  Bolker's (2008) section 10.3 offers a brief and lucid description of why we approach this as a variance modeling exercise, related qualitatively to mixed models because of the need for a variance-covariance matrix that specifies the correlation between observations.  For example, using our tree size (dbh) example from last time, if we assume that there is only population and the dbh of each tree is independent, then we can estimate the mean dbh (intercept model) given normal error:

            dbh ~ N(μ,V)

where we've replaced the element σ2 with the covariance matrix V.  If the observations are independent, then the covariance matrix is just σ2 x I, where I is the 'identity matrix' (1s on the diagonal, 0s elsewhere).  But if observations are correlated, then off-diagonal elements will contain covariance values between 0 and 1 that indicate the degree of correlation.  Note the difference between 'covariance' and 'correlation' (values between -1 and 1); it is often more convenient to use a correlation matrix (C) where we can observe whether autocorrelation between observations is positive or negative. 

            You can create your own variance-covariance functions for use in likelihood functions by using the multivariate normal density (dmvnorm) in place of the normal density and feeding it a matrix for the 'Sigma' argument representing a covariance matrix, as suggested by Bolker (p. 324).  However, you are unlikely to ever need to do this, given the built-in covariance structures available in R, especially in the nlme library.  Moreover, numerical fitting procedures for manual mixed-effects models can be daunting.  We'll stick with nlme.

            My suggestion for getting acquainted with correlation structures using the nlme library is section 5.3 in Pinheiro and Bates (2004, Mixed-Effects Models in S and S-PLUS), which describes the library of correlation structures in nlme that account for different types of temporal (or serial) autocorrelation in time series data (where the vector describing correlations is a time vector, like Date) and those that describe spatial autocorrelation (where the position vectors are two-dimensional, involving columns of X and Y).  Temporal correlation structures are usually described by a correlation function (decay in correlation along a single axis of separation in time, available with the acf function), while spatial autocorrelation is described by a semivariogram (available with the Variogram function).  A word of warning is to not get too carried away with accounting for autocorrelation in your data: adding correlation structures with nlme is easy in principle but difficult in practice, and the benefits to model fitting of trying out different correlation structures are usually small, especially with models already containing different spatial or temporal levels.  With a large dataset and complicated model, adding autocorrelation can also add significantly to computation time: for the full temperature models I constructed for the Smokies, ML fitting took over a week on a high-end linux workstation.

 

Correlation in time

We'll start simply by examining maxmimum temperatures over time within one transect in the Smokies temperature dataset. We'll use the Tf3.RData data object, containing temperature values across several watersheds in the Smokies.

load(file.choose()) #grab new Tf3 object

Tf3$Date = as.Date(Tf3$DATE,format="%m/%d/%Y")  #making day a 'Date' object

Tf4 = na.exclude(Tf3)        #remove missing values

library(nlme)

When dealing with time series data, it is usually most convenient to make your date or time variable an object of class 'Date' using the as.Date function, which takes a character vector as input and changes dates to numbers according to an international standard format.  The hard part to remember about this function is how to specify the format of the values you give it: see R help for 'strptime'.  Among other helpful aspect of Date objects are how conveniently they are plotted:
plot(Tf4$Date,Tf4$maxT)
We suspect strong autocorrelation of day-to-day temperatures.  We can examine this separately between loggers by plotting the groupedData object:
gD = groupedData(maxT~as.numeric(Date)|logger,data=Tf4)

plot(gD)

Date looks important.  A simple model asks about the nature of the Date effect:
mod1 = gls(maxT ~ Date,data=Tf4,method="ML")
summary(mod1)

Generalized least squares fit by maximum likelihood

  Model: maxT ~ Date

  Data: Tf4

       AIC      BIC    logLik

  21124.63 21142.92 -10559.32

 

Coefficients:

               Value Std.Error   t-value p-value

(Intercept) 742.4263 16.057490  46.23551       0

Date         -0.0560  0.001224 -45.70547       0

 

 Correlation:

     (Intr)

Date -1   

 

Standardized residuals:

       Min         Q1        Med         Q3        Max

-2.5257478 -0.6197299  0.1152725  0.5662425  4.2665986

 

Residual standard error: 6.039998

Degrees of freedom: 3282 total; 3280 residual
The function gls (generalized least squares) is the other common model fitting routine in the nlme library, and is used specifically for modeling correlated observations (although lme can also do this).  There are also nonlinear versions of each (nlme, gnls).  The P value here is significant but bogus because we strongly suspect temporal autocorrelation.  The residuals of the model plotted with respect to Date agree:

plot(Tf4$Date,mod1$resid)

An even better way to detect autocorrelation is with the acf function in the basic stats package:
acf(residuals(mod1))

The y values are the correlation (between -1 and 1) of values separated by 1, 2, 3,… days (the 'lag' in observations, because our predictor variable was Date), and the dashed blue lines are approximate 95% CIs for the lack of autocorrelation.  We clearly have a massive autocorrelation problem. The nlme package provides a suite of auto-correlation structures that are easily folded into model construction using gls or lme.  Let's try 'lag-1' autocorrelation (corAR1):
mod2 = gls(maxT ~ Date,data=Tf4,method="ML",correlation=corAR1())
summary(mod2)

Generalized least squares fit by maximum likelihood

  Model: maxT ~ Date

  Data: Tf4

       AIC      BIC   logLik

  14051.76 14076.14 -7021.88

 

Correlation Structure: AR(1)

 Formula: ~1

 Parameter estimate(s):

      Phi

0.9428097

 

Coefficients:

               Value Std.Error   t-value p-value

(Intercept) 551.5893  66.61448  8.280322       0

Date         -0.0414   0.00508 -8.153653       0

 

 Correlation:

     (Intr)

Date -1   

 

Standardized residuals:

       Min         Q1        Med         Q3        Max

-2.6488294 -0.6821428  0.2128596  0.6972587  3.8573242

 

Residual standard error: 6.164809

Degrees of freedom: 3282 total; 3280 residual

The first thing you notice is that adding a correlation structure into a maximum likelihood model makes it much more complicated; not only do optimizations take a long time but achieving convergence can be difficult (as our single normal likelihood has now turned into a large multivariate normal likelihood, this makes sense).  The second thing to notice is that including the correlation structure reduced the AIC from 21124 to 14051 (!).  What the corAR1 term has done has added a variance-covariance matrix accounting for the correlation of observations according to a particular kind of simplified V that assumes correlations between any two years are simply the product of correlations between adjacent years ('lag 1').  That is, if the correlation between adjacent years is 0.94 (about what the acf function showed, and consistent with the fitted correlation parameter Phi=0.942), then the correlation between values two years apart is 0.94^2 (0.88), three years 0.94^3 (0.83), and so forth—that is, constant exponential decay of correlation over time.  To see this we can re-examine the initial acf graph, and superpose a line that shows the lag-1 correlation of 0.94 decay according to the AR1 model:
acf(residuals(mod1))

lines(c(1:35),.94^(c(1:35)),col="red")

Although the AR1 correlation structure is commonly used for time series models, apparently our correlation doesn't decay quite that fast.  You can make the AR1 structure more sophisticated by allowing the auto-regressive (AR) function to use more parameters, or by allowing the correlation structure to involve (separately or in addition) a moving average of error variance (MA models).  The corARMA correlation structure does both in one complicated argument, where you can involve different AR parameters (p) and different MA parameters (q).  For example, here is a more complicated AR function involving two autocorrelation constants (Phi).  (Warning: this took my machine over 20 minutes to optimize!):
arma2 = corARMA(c(.2,.2),p=2,q=0)
mod3 = gls(maxT ~ Date,data=Tf4,method="ML",correlation=arma2)
summary(mod3)

Generalized least squares fit by maximum likelihood

  Model: maxT ~ Date

  Data: Tf4

       AIC      BIC    logLik

  13109.09 13139.57 -6549.545

 

Correlation Structure: ARMA(2,0)

 Formula: ~1

 Parameter estimate(s):

     Phi1      Phi2

0.4725051 0.5006501

 

Coefficients:

               Value Std.Error   t-value p-value

(Intercept) 436.6911  65.28497  6.688998       0

Date         -0.0327   0.00498 -6.562634       0

 

 Correlation:

     (Intr)

Date -1   

 

Standardized residuals:

       Min         Q1        Med         Q3        Max

-2.6566430 -0.7215224  0.2485611  0.7857267  3.5684973

 

Residual standard error: 6.354524

Degrees of freedom: 3282 total; 3280 residual

We've now fit two Phi parameters and our AIC has decreased to 13109.  The c(.2,.2) values in the corARMA function are starting values for the Phi parameters that are fitted during the optimization process.  You can play around with adding yet even more parameters to corARMA; in my efforts AIC kept getting better with more parameters.  But it is hard to find guidance in the literature about how to go about this systematically: indeed several sources (like Zuur et al. 2009, p. 151) warn against spending much time refining correlation structures, despite potential big gains in AIC.
             Let’s continue to refine the model by incorporating a random intercept effect for logger and adding in explanatory variables for temporal variance.  To do this we switch to lme:
mod4 = lme(maxT~maxSYN,random=~1|logger,correlation=corAR1(form=~Date),data=Tf4,method="ML")

summary(mod4)

Linear mixed-effects model fit by maximum likelihood

 Data: Tf4

       AIC      BIC    logLik

  14196.31 14226.79 -7093.156

 

Random effects:

 Formula: ~1 | logger

        (Intercept) Residual

StdDev:   0.6811543 2.158144

 

Correlation Structure: ARMA(1,0)

 Formula: ~Date | logger

 Parameter estimate(s):

     Phi1

0.2489540

Fixed effects: maxT ~ maxSYN

                Value  Std.Error   DF   t-value p-value

(Intercept) -1.831016 0.22158045 3270  -8.26344       0

maxSYN       0.934924 0.00606927 3270 154.04226       0

 Correlation:

       (Intr)

maxSYN -0.304

 

Standardized Within-Group Residuals:

       Min         Q1        Med         Q3        Max

-4.1361759 -0.5329948  0.0721283  0.5674469  5.8056673

 

Number of Observations: 3282

Number of Groups: 11

The form argument to the correlation term indicates that the variable defining the correlation structure is Date.  Comparing this model to one without temporal autocorrelation yields:
mod5 = update(mod4,correlation=NULL)

summary(mod5)

Linear mixed-effects model fit by maximum likelihood

 Data: Tf4

       AIC      BIC    logLik

  14369.54 14393.93 -7180.771

 

Random effects:

 Formula: ~1 | logger

        (Intercept) Residual

StdDev:   0.6961765 2.145037

 

Fixed effects: maxT ~ maxSYN

                 Value  Std.Error   DF  t-value p-value

(Intercept) -2.1393025 0.21999316 3270  -9.7244       0

maxSYN       0.9627423 0.00486653 3270 197.8293       0

 Correlation:

       (Intr)

maxSYN -0.245

 

Standardized Within-Group Residuals:

        Min          Q1         Med          Q3         Max

-4.13013167 -0.52209483  0.05250002  0.52749883  6.11885806

 

Number of Observations: 3282

Number of Groups: 11

Inferences based on mod4, which includes non-independent observations, are clearly better than those based on mod5 (AICs are 14196 and 14369, resp.; note here there are relatively few variables but lots of data, so the more conservative BIC may be more appropriate).  But note the parameter estimates for both fixed and random effects are relatively insensitive to whether the correlation is included or not; residuals will therefore look the same from both models:
par(mfrow=c(2,1))

acf(residuals(mod4))

acf(residuals(mod5))

Periodic (e.g., annual) datasets like this one can be improved by modeling seasonal effects with a continuous sine-cosine wave, thus using up only two parameters to model ‘season’.  The formula for this is ‘sin(ωvt) + cos(ωvt)’, where vt is the time index variable (here we’ll use Julian date, 1–365, via numerical conversation of our Date variable using the function julian) and ω is a constant that describes how the index variable relates to the full period (here, 2π/365=0.0172).
Tf4$JDATE = julian(Tf4$Date,origin=as.Date("2005-01-01"))
mod6 = update(mod4,fixed=maxT~maxSYN+cos(0.0172*JDATE)+sin(0.0172*JDATE))

summary(mod6)

Linear mixed-effects model fit by maximum likelihood

 Data: Tf4

       AIC      BIC    logLik

  13693.67 13736.34 -6839.835

 

Random effects:

 Formula: ~1 | logger

        (Intercept) Residual

StdDev:   0.6253322 2.000688

 

Correlation Structure: ARMA(1,0)

 Formula: ~Date | logger

 Parameter estimate(s):

     Phi1

0.2541087

Fixed effects: maxT ~ maxSYN + cos(0.0172 * JDATE) + sin(0.0172 * JDATE)

                         Value  Std.Error   DF   t-value p-value

(Intercept)          0.5855322 0.22875986 3268   2.55959  0.0105

maxSYN               0.7481913 0.00974929 3268  76.74314  0.0000

cos(0.0172 * JDATE) -2.4903238 0.11137111 3268 -22.36059  0.0000

sin(0.0172 * JDATE) -0.6903123 0.06892255 3268 -10.01577  0.0000

 Correlation:

                    (Intr) maxSYN c(0.*J

maxSYN              -0.525             

cos(0.0172 * JDATE) -0.447  0.779      

sin(0.0172 * JDATE) -0.141  0.349  0.144

 

Standardized Within-Group Residuals:

         Min           Q1          Med           Q3          Max

-4.105842090 -0.510211178 -0.004973752  0.531681313  5.199079986

 

Number of Observations: 3282

Number of Groups: 11

anova(mod4,mod6)

     Model df      AIC      BIC    logLik   Test  L.Ratio p-value

mod4     1  5 14196.31 14226.79 -7093.156                       

mod6     2  7 13693.67 13736.34 -6839.835 1 vs 2 506.6425  <.0001
The likelihood ratio test leaves no doubt that we’ve accounted for significantly more error with the added seasonal effect, and the residual (within-logger) variance component has decreased from 2.15 to 2.  The autocorrelation remains significant, however:
x = acf(residuals(mod6))$lag
y = acf(residuals(mod6))$acf
acf(residuals(mod4))
points(x,y,col="red")

And we can test for this comparing the simpler non-correlation model:
mod7 = update(mod6,correlation=NULL)
anova(mod6,mod7)

     Model df      AIC      BIC    logLik   Test  L.Ratio p-value

mod6     1  7 13693.67 13736.34 -6839.835                       

mod7     2  6 13876.52 13913.10 -6932.261 1 vs 2 184.8515  <.0001

The correlation significantly improves the model, and here the coefficients themselves are sensitive to whether temporal autocorrelation is included:
cbind(mod6$coef$fixed,mod7$coef$fixed)

                          [,1]         [,2]

(Intercept)          0.5855322 -0.005507119

maxSYN               0.7481913  0.796203905

cos(0.0172 * JDATE) -2.4903238 -2.064411541

sin(0.0172 * JDATE) -0.6903123 -0.571010330

From here we would continue model building by adding more fixed effects for Level 1 (within-logger) variance, e.g. adding in a radiation term, and then exploring Level 2 (between-logger) variance.
            The corAR1 correlation class is probably the most useful of those relevant to time series modeling.  The different classes for times series data are explained fully in Pinheiro and Bates (2004), pp., 232-239.  These include:
corStruct class              description

corAR1                        autoregresssive, order 1 (exponential decay of correlation with temporal distance)

corCAR1                     continuous-time version of AR1 (time index does not have to be integers)
corARMA                    autoregressive and/or moving average (higher-order AR1 structure, or moving average structure, or both)
corCompSymm            compound symmetry: like AR1 except the correlation does not decay with greater temporal distance
corSymm                      ‘general’ correlation, where each value in the correlation matrix can be a separate value; P&B suggest this

                                    typically results in overparameterized models and is only useful where groups contain few observations

 

Correlation in space

            Spatial correlation of neighboring observations is the bane of the population modeler and the boon of the landscape ecologist.  Hierarchical modelers usually worry less about such correlation because often multi-level models can remove the signature of spatial dependence with blocking or individual-based random effects, but there are always situations where you might be concerned that ML estimates are inaccurate due to spatial autocorrelation.  There are many ways of addressing spatial dependence in R; here we’ll focus on incorporating spatial correlation structures in regression models with nlme, where spatial dependence is a mostly a nuisance concern (the typical case in niche modeling). More powerful ways of describing spatial autocorrelation structures are available in the package geoR; for some examples with our dataset see these notes by Catherine Ravenscroft.

            First, we’ll look at a Smokies dataset example where we do not need to consider spatial autocorrelation, using red maple:

dat = read.csv(file.choose())     #grab the treedata.csv file
dat2 = subset(dat,dat$species=="Acer rubrum")
ro.plots = unique(as.character(dat2$plotID))    #plots with red maple

u.plots = unique(as.character(dat$plotID)) #all plots
nob.plots = u.plots[is.element(u.plots,ro.plots)==F] #plots without red maple
dat3 = subset(dat,is.element(as.character(dat$plotID),nob.plots)) #dataset of no maple
dat4 = subset(dat3,duplicated(dat3$plotID)==F)  #one row per plot
dat4$cover = 0     #cover of red oak is zero in these plots
mp = na.exclude(rbind(dat2,dat4))     #new dataframe of presences and absences
mp2 = subset(mp,duplicated(mp$plotID)==F) #remove all remeasured plots
mp3 = subset(mp2,mp2$utme<260000) #reduce dataset to one park region

     The corStruct classes for spatial autocorrelation do not accept observations that have identical X and Y coordinates (zero distance), which is why we simplified the dataset to mp2, and the mp3 object only took a park region that was well surveyed to eliminate rare large distances between plots.  We can fit a simple gls model of maple cover based on elevation:
gls1 = gls(cover~elev+I(elev^2),data=mp3)

The question is whether we need to bother with modeling spatial autocorrelation in these data.  For this we use a variogram (or, sometimes called a semivariogram) that shows how the total variance in the data is related to the distance between observations.  If there is no spatial autocorrelation, we expect a flat variogram of semivariance equal to the maximum (random) variance (called the sill; often this is scaled to 1).  If nearby samples are correlated, however, then only a fraction of the total variance can be expressed between nearby values, such that variograms increase in some fashion until a certain distance where values are uncorrelated (called the range).  The variance at a lag distance of zero is called the nugget, which describes fine-scale variance and/or observation error.  The shape of this increasing function describes the nature of the autocorrelation, and the nlme package has 5 spatial corStruct classes, corresponding to the semivariograms below (from Pinheiro and Bates 2004, p. 233):

 

The basic strategy for dealing with spatial autocorrelation in regression models is to 1) fit a basic model without a correlation structure, 2) examine a variogram of the residuals (i.e., their correlation with respect to distance), 3) choose an appropriate corStruct class based on the variogram, and 4) test whether addition of the correlation term significantly improves model fit via AICs or LRTs (anova function).  The Variogram function in nlme serves as our diagnostic function:

plot(Variogram(gls1,form=~utme+utmn))
The graph shows a flat line at y=1, which is sufficient evidence to ignore spatial autocorrelation in this model.  You can experiment with different species in this dataset; I couldn’t find any that merited a spatial autocorrelation component.  In plot datasets with relatively small spatial grains and large mean distances between plots, this is probably commonplace, but until you demonstrate the lack of variogram pattern empirically there is always the chance that your likelihood estimates are inaccurate due to non-independent observations.

            To demonstrate what spatial autocorrelation looks like, grab the treedbh.csv dataset, which describes annual dbh increment in a hemlock-hardwood forest stand in northern MI:

trees = read.csv(file.choose())       #grab the ‘treedbh.csv’ dataset
trees1 = na.exclude(trees)        #remove rows of missing values

trees2 = subset(trees1,trees1$dead==0) #ignore dead trees
trees3 = subset(trees2,duplicated(cbind(trees2$xcoord,trees2$ycoord))==F)     #remove ‘double stems’ with same coordinates
tr.mod1 = gls(INCR~DBH,data=trees3)   #simple model of tree growth (INCR) as a function of stem size
plot(Variogram(tr.mod1,form=~xcoord+ycoord))

Now the variogram indicates significant spatial autocorrelation of growth values, with a nugget of about 0.9 and a ‘range’ (here the x value where y=1) of about 100 m.  We can first ask whether a spherical correlation structure is appropriate using the correlation=corSpher argument; note corSpher would like starting values for the range and nugget parameters (when nugget=T).  (Warning: this took me about 20 minutes to optimize).
tr.mod2 = update(tr.mod1,correlation=corSpher(c(100,0.9),form=~xcoord+ycoord,nugget=T))

summary(tr.mod2)

Generalized least squares fit by REML

  Model: INCR ~ DBH

  Data: trees3

       AIC     BIC    logLik

  3492.997 3518.12 -1741.498

 

Correlation Structure: Spherical spatial correlation

 Formula: ~xcoord + ycoord

 Parameter estimate(s):

       range       nugget

1.053204e+05 2.488153e-03

 

Coefficients:

                  Value Std.Error   t-value p-value

(Intercept) -0.18772279 21.593435 -0.008694  0.9931

DBH          0.00660686  0.001732  3.814293  0.0001

 

 Correlation:

    (Intr)

DBH -0.002

 

Standardized residuals:

        Min          Q1         Med          Q3         Max

-0.57888285  0.00798371  0.02819694  0.05323549  0.48943971

 

Residual standard error: 21.64101

Degrees of freedom: 1126 total; 1124 residual

We have a huge AIC advantage here:

anova(tr.mod1,tr.mod2)

        Model df      AIC      BIC    logLik   Test  L.Ratio p-value

tr.mod1     1  3 3590.035 3605.109 -1792.017                        

tr.mod2     2  5 3492.997 3518.120 -1741.498 1 vs 2 101.0376  <.0001

But clearly the correlation structure did not do what we expected, because the MLEs for the range and nugget are nonsensical (10532 and 0.0025, resp.).  We can force the issue by fitting the corSpher structure again, but with non-fitted range and nugget values using the fixed=T argument:

tr.mod3 = update(tr.mod1,correlation=corSpher(c(100,0.9),form=~xcoord+ycoord,nugget=T,fixed=T))

plot(Variogram(tr.mod3,form=~xcoord+ycoord))

anova(tr.mod2,tr.mod3)
        Model df      AIC     BIC    logLik   Test  L.Ratio p-value

tr.mod2     1  5 3492.997 3518.12 -1741.498                       

tr.mod3     2  3 3501.966 3517.04 -1747.983 1 vs 2 12.96868  0.0015

The variogram of the fixed correlation model shows our assumed correlation structure, but the LRT shows that we’re not accounting for much of the spatial autocorrelation function, and that the weirdly parameterized corSpher model does a better (but not particularly helpful) job.  Would exponential correlation work better?
tr.mod4 = update(tr.mod1,correlation=corExp(c(100,0.9),form=~xcoord+ycoord,nugget=T))
summary(tr.mod4)
Generalized least squares fit by REML

  Model: INCR ~ DBH

  Data: trees3

       AIC      BIC    logLik

  3491.192 3516.316 -1740.596

 

Correlation Structure: Exponential spatial correlation

 Formula: ~xcoord + ycoord

 Parameter estimate(s):

     range     nugget

34.4006815  0.7747585

 

Coefficients:

                Value  Std.Error  t-value p-value

(Intercept) 0.3159240 0.21814152 1.448252  0.1478

DBH         0.0066374 0.00173083 3.834806  0.0001

 

 Correlation:

    (Intr)

DBH -0.224

 

Standardized residuals:

         Min           Q1          Med           Q3          Max

-10.70824455  -0.27256013   0.08697457   0.53202986   8.28691426

 

Residual standard error: 1.217159

Degrees of freedom: 1126 total; 1124 residual
The range and nugget parameters make more sense, and the AIC value is slightly better than tr.mod2 (3491.2 vs. 3493.0); the correlation structures are not nested so we can’t obtain a P value for the difference, but here we’d probably choose to go with corExp.  We check the status of our residuals with tr.mod4: 
plot(Variogram(tr.mod4))
The variogram suggests we still have work to do.