**March 29: Models of correlated observations**

Most of the machinery of model fitting we've examined so
far involves likelihood, which comes with the critical assumption that values
of our response variable are independent.
Ecological data often violate this assumption across both space and
time, so the question arises as to how to compose legitimate likelihood
functions with correlated data. Bolker's
(2008) section 10.3 offers a brief and lucid description of why we approach
this as a variance modeling exercise, related qualitatively to mixed models
because of the need for a *variance-covariance
matrix* that specifies the correlation between observations. For example, using our tree size (dbh)
example from last time, if we assume that there is only population and the dbh
of each tree is independent, then we can estimate the mean dbh (intercept
model) given normal error:

dbh ~ N(μ,**V**)

where we've replaced the element σ^{2} with
the covariance matrix **V**. If the observations are independent, then the
covariance matrix is just σ^{2} x **I**, where **I** is the
'identity matrix' (1s on the diagonal, 0s elsewhere). But if observations are correlated, then
off-diagonal elements will contain covariance values between 0 and 1 that
indicate the degree of correlation. Note
the difference between 'covariance' and 'correlation' (values between -1 and
1); it is often more convenient to use a *correlation
matrix* (**C**) where we can observe
whether autocorrelation between observations is positive or negative.

You can
create your own variance-covariance functions for use in likelihood functions
by using the multivariate normal density (**dmvnorm**)
in place of the normal density and feeding it a matrix for the 'Sigma' argument
representing a covariance matrix, as suggested by Bolker (p. 324). However, you are unlikely to ever need to do
this, given the built-in covariance structures available in R, especially in
the **nlme** library. Moreover, numerical fitting procedures for
manual mixed-effects models can be daunting.
We'll stick with nlme.

My
suggestion for getting acquainted with correlation structures using the **nlme** library is section 5.3 in Pinheiro
and Bates (2004, *Mixed-Effects Models in
S and S-PLUS*), which describes the library of correlation structures in **nlme** that account for different types
of temporal (or *serial*) autocorrelation
in time series data (where the vector describing correlations is a time vector,
like Date) and those that describe spatial autocorrelation (where the position
vectors are two-dimensional, involving columns of X and Y). Temporal correlation structures are usually
described by a correlation function (decay in correlation along a single axis
of separation in time, available with the **acf**
function), while spatial autocorrelation is described by a semivariogram
(available with the **Variogram**
function). A word of warning is to not
get too carried away with accounting for autocorrelation in your data: adding
correlation structures with **nlme** is
easy in principle but difficult in practice, and the benefits to model fitting of
trying out different correlation structures are usually small, especially with
models already containing different spatial or temporal levels. With a large dataset and complicated model,
adding autocorrelation can also add significantly to computation time: for the
full temperature models I constructed for the Smokies, ML fitting took over a
week on a high-end linux workstation.

*Correlation in time*

We'll start simply by examining maxmimum temperatures over
time within one transect in the Smokies temperature dataset.

load(file.choose()) #grab new Tf3 object from course webpage

Tf3$Date = as.Date(Tf3$DATE,format="%m/%d/%Y") #making day a 'Date' object

Tf4 = na.exclude(Tf3) #remove missing values

library(nlme)

When dealing with time series data, it is usually most
convenient to make your date or time variable an object of class 'Date' using
the **as.Date** function, which takes a
character vector as input and changes dates to numbers according to an
international standard format. The hard
part to remember about this function is how to specify the *format* of the values you give it: see R help for 'strptime'. Among other helpful aspect of Date objects
are how conveniently they are plotted:

plot(Tf4$Date,Tf4$maxT)

We suspect strong autocorrelation of day-to-day temperatures. We can examine this separately between
loggers by plotting the groupedData object:

gD = groupedData(maxT~as.numeric(Date)|logger,data=Tf4)

plot(gD)

Date looks important.
A simple model asks about the nature of the Date effect:

mod1 = gls(maxT ~
Date,data=Tf4,method="ML")

summary(mod1)

Generalized least squares fit by maximum likelihood

Model: maxT ~ Date

Data: Tf4

AIC
BIC logLik

21124.63 21142.92 -10559.32

Coefficients:

Value Std.Error t-value p-value

(Intercept) 742.4263
16.057490 46.23551 0

Date -0.0560 0.001224 -45.70547 0

Correlation:

(Intr)

Date -1

Standardized residuals:

Min Q1 Med Q3 Max

-2.5257478 -0.6197299 0.1152725
0.5662425 4.2665986

Residual standard error: 6.039998

Degrees of freedom: 3282 total;
3280 residual

The function **gls** (generalized least
squares) is the other common model fitting routine in the nlme library, and is
used specifically for modeling correlated observations (although **lme** can also do this). There are also nonlinear versions of each (**nlme**, **gnls**). The P value here is
significant but bogus because we strongly suspect temporal
autocorrelation. The residuals of the
model plotted with respect to Date agree:

plot(Tf4$Date,mod1$resid)

An even better way to detect autocorrelation is with the **acf** function in the basic stats
package:

acf(residuals(mod1))

The y values are the correlation (between -1 and 1) of
values separated by 1, 2, 3,… days (the 'lag' in observations, because our
predictor variable was Date), and the dashed blue lines are approximate 95% CIs
for the lack of autocorrelation. We
clearly have a massive autocorrelation problem. The nlme package provides a
suite of auto-correlation structures that are easily folded into model
construction using gls or lme. Let's try
'lag-1' autocorrelation (corAR1):

mod2 = gls(maxT ~
Date,data=Tf4,method="ML",correlation=corAR1())

summary(mod2)

Generalized least
squares fit by maximum likelihood

Model: maxT ~ Date

Data: Tf4

AIC
BIC logLik

14051.76 14076.14 -7021.88

Correlation Structure: AR(1)

Formula: ~1

Parameter estimate(s):

Phi

0.9428097

Coefficients:

Value Std.Error t-value p-value

(Intercept) 551.5893 66.61448
8.280322 0

Date -0.0414 0.00508 -8.153653 0

Correlation:

(Intr)

Date -1

Standardized residuals:

Min Q1 Med Q3 Max

-2.6488294 -0.6821428 0.2128596
0.6972587 3.8573242

Residual standard error: 6.164809

Degrees of freedom: 3282 total;
3280 residual

The first thing you notice is that adding a correlation
structure into a maximum likelihood model makes it much more complicated; not
only do optimizations take a long time but achieving convergence can be
difficult (as our single normal likelihood has now turned into a large
multivariate normal likelihood, this makes sense). The second thing to notice is that including
the correlation structure reduced the AIC from 21124 to 14051 (!). What the corAR1 term has done has added a
variance-covariance matrix accounting for the correlation of observations
according to a particular kind of simplified **V** that assumes correlations between any two years are simply the
product of correlations between adjacent years ('lag 1'). That is, if the correlation between adjacent
years is 0.94 (about what the acf function showed, and consistent with the
fitted correlation parameter Phi=0.942), then the correlation between values
two years apart is 0.94^2 (0.88), three years 0.94^3 (0.83), and so forth—that
is, constant exponential decay of correlation over time. To see this we can re-examine the initial acf
graph, and superpose a line that shows the lag-1 correlation of 0.94 decay
according to the AR1 model:

acf(residuals(mod1))

lines(c(1:35),.94^(c(1:35)),col="red")

Although the AR1 correlation structure is commonly used for
time series models, apparently our correlation doesn't decay quite that
fast. You can make the AR1 structure
more sophisticated by allowing the auto-regressive (AR) function to use more
parameters, or by allowing the correlation structure to involve (separately or
in addition) a moving average of error variance (MA models). The corARMA correlation structure does both
in one complicated argument, where you can involve different AR parameters (p)
and different MA parameters (q). For
example, here is a more complicated AR function involving two autocorrelation
constants (Phi). (Warning: this took my
machine over 20 minutes to optimize!):

arma2 =
corARMA(c(.2,.2),p=2,q=0)

mod3 = gls(maxT ~ Date,data=Tf4,method="ML",correlation=arma2)

summary(mod3)

Generalized least squares fit by
maximum likelihood

Model: maxT ~ Date

Data: Tf4

AIC
BIC logLik

13109.09 13139.57 -6549.545

Correlation Structure: ARMA(2,0)

Formula: ~1

Parameter estimate(s):

Phi1
Phi2

0.4725051 0.5006501

Coefficients:

Value Std.Error t-value p-value

(Intercept) 436.6911 65.28497
6.688998 0

Date -0.0327 0.00498 -6.562634 0

Correlation:

(Intr)

Date -1

Standardized residuals:

Min Q1 Med Q3 Max

-2.6566430 -0.7215224 0.2485611
0.7857267 3.5684973

Residual standard error: 6.354524

Degrees of freedom: 3282 total;
3280 residual

We've now fit two Phi parameters and our AIC has decreased
to 13109. The c(.2,.2) values in the
corARMA function are starting values for the Phi parameters that are fitted
during the optimization process. You can
play around with adding yet even more parameters to corARMA; in my efforts AIC
kept getting better with more parameters.
But it is hard to find guidance in the literature about how to go about
this systematically: indeed several sources (like Zuur et al. 2009, p. 151)
warn against spending much time refining correlation structures, despite
potential big gains in AIC.

Let’s continue to refine the model by
incorporating a random intercept effect for logger and adding in explanatory
variables for temporal variance. To do
this we switch to lme:

mod4 = lme(maxT~maxSYN,random=~1|logger,correlation=corAR1(form=~Date),data=Tf4,method="ML")

summary(mod4)

Linear mixed-effects model fit by
maximum likelihood

Data: Tf4

AIC
BIC logLik

14196.31 14226.79 -7093.156

Random effects:

Formula: ~1 | logger

(Intercept) Residual

StdDev: 0.6811543 2.158144

Correlation Structure: ARMA(1,0)

Formula: ~Date | logger

Parameter estimate(s):

Phi1

0.2489540

Fixed effects: maxT ~ maxSYN

Value Std.Error
DF t-value p-value

(Intercept) -1.831016 0.22158045
3270 -8.26344 0

maxSYN 0.934924 0.00606927 3270 154.04226 0

Correlation:

(Intr)

maxSYN -0.304

Standardized Within-Group
Residuals:

Min Q1 Med Q3 Max

-4.1361759 -0.5329948 0.0721283
0.5674469 5.8056673

Number of Observations: 3282

Number of Groups: 11

The *form* argument to the correlation term indicates
that the variable defining the correlation structure is Date. Comparing this model to one without temporal
autocorrelation yields:

mod5 =
update(mod4,correlation=NULL)

summary(mod5)

Linear mixed-effects model fit by
maximum likelihood

Data: Tf4

AIC
BIC logLik

14369.54 14393.93 -7180.771

Random effects:

Formula: ~1 | logger

(Intercept) Residual

StdDev: 0.6961765 2.145037

Fixed effects: maxT ~ maxSYN

Value Std.Error
DF t-value p-value

(Intercept) -2.1393025 0.21999316
3270 -9.7244 0

maxSYN 0.9627423 0.00486653 3270 197.8293 0

Correlation:

(Intr)

maxSYN -0.245

Standardized Within-Group
Residuals:

Min
Q1 Med Q3 Max

-4.13013167 -0.52209483 0.05250002
0.52749883 6.11885806

Number of Observations: 3282

Number of Groups: 11

Inferences based on mod4, which includes non-independent
observations, are clearly better than those based on mod5 (AICs are 14196 and
14369, resp.; note here there are relatively few variables but lots of data, so
the more conservative BIC may be more appropriate). But note the parameter estimates for both
fixed and random effects are relatively insensitive to whether the correlation
is included or not; residuals will therefore look the same from both models:

par(mfrow=c(2,1))

acf(residuals(mod4))

acf(residuals(mod5))

Periodic (e.g., annual) datasets like this one can be
improved by modeling seasonal effects with a continuous sine-cosine wave, thus
using up only two parameters to model ‘season’.
The formula for this is ‘sin(*ωv _{t}*)
+ cos(

Tf4$JDATE = julian(Tf4$Date,origin=as.Date("2005-01-01"))

mod6 = update(mod4,fixed=maxT~maxSYN+cos(0.0172*JDATE)+sin(0.0172*JDATE))

summary(mod6)

Linear mixed-effects model fit by
maximum likelihood

Data: Tf4

AIC
BIC logLik

13693.67 13736.34 -6839.835

Random effects:

Formula: ~1 | logger

(Intercept) Residual

StdDev: 0.6253322 2.000688

Correlation Structure: ARMA(1,0)

Formula: ~Date | logger

Parameter estimate(s):

Phi1

0.2541087

Fixed effects: maxT ~ maxSYN +
cos(0.0172 * JDATE) + sin(0.0172 * JDATE)

Value Std.Error
DF t-value p-value

(Intercept) 0.5855322 0.22875986 3268 2.55959
0.0105

maxSYN 0.7481913 0.00974929 3268 76.74314
0.0000

cos(0.0172 * JDATE) -2.4903238
0.11137111 3268 -22.36059 0.0000

sin(0.0172 * JDATE) -0.6903123
0.06892255 3268 -10.01577 0.0000

Correlation:

(Intr) maxSYN c(0.*J

maxSYN -0.525

cos(0.0172 * JDATE) -0.447 0.779

sin(0.0172 * JDATE) -0.141 0.349
0.144

Standardized Within-Group
Residuals:

Min Q1 Med Q3 Max

-4.105842090 -0.510211178
-0.004973752 0.531681313 5.199079986

Number of Observations: 3282

Number of Groups: 11

anova(mod4,mod6)

Model df AIC
BIC logLik Test
L.Ratio p-value

mod4 1 5
14196.31 14226.79 -7093.156

mod6 2 7
13693.67 13736.34 -6839.835 1 vs 2 506.6425
<.0001

The likelihood ratio test leaves no doubt that
we’ve accounted for significantly more error with the added seasonal effect,
and the residual (within-logger) variance component has decreased from 2.15 to
2. The autocorrelation remains
significant, however:

x =
acf(residuals(mod6))$lag

y = acf(residuals(mod6))$acf

acf(residuals(mod4))

points(x,y,col="red")

And we can test for this comparing the simpler non-correlation model:

mod7 =
update(mod6,correlation=NULL)

anova(mod6,mod7)

Model df AIC
BIC logLik Test
L.Ratio p-value

mod6 1 7
13693.67 13736.34 -6839.835

mod7 2 6
13876.52 13913.10 -6932.261 1 vs 2 184.8515
<.0001

The correlation significantly improves the model, and here
the coefficients themselves are sensitive to whether temporal autocorrelation
is included:

cbind(mod6$coef$fixed,mod7$coef$fixed)

[,1] [,2]

(Intercept) 0.5855322 -0.005507119

maxSYN 0.7481913 0.796203905

cos(0.0172 * JDATE) -2.4903238
-2.064411541

sin(0.0172 * JDATE) -0.6903123
-0.571010330

From here we would continue model building by adding more
fixed effects for Level 1 (within-logger) variance, e.g. adding in a radiation
term, and then exploring Level 2 (between-logger) variance.

The corAR1 correlation class
is probably the most useful of those relevant to time series modeling. The different classes for times series data are
explained fully in Pinheiro and Bates (2004), pp., 232-239. These include:

__corStruct class description__

corAR1 autoregresssive, order 1
(exponential decay of correlation with temporal distance)

corCAR1 continuous-time version of AR1 (time
index does not have to be integers)

corARMA autoregressive
and/or moving average (higher-order AR1 structure, or moving average structure,
or both)

corCompSymm compound symmetry:
like AR1 except the correlation does not decay with greater temporal distance

corSymm ‘general’ correlation, where each value in the
correlation matrix can be a separate value; P&B suggest this

typically
results in overparameterized models and is only useful where groups contain few
observations

*Correlation in space *

Spatial
correlation of neighboring observations is the bane of the population modeler and
the boon of the landscape ecologist.
Hierarchical modelers usually worry less about such correlation because often
multi-level models can remove the signature of spatial dependence with blocking
or individual-based random effects, but there are always situations where you
might be concerned that ML estimates are inaccurate due to spatial
autocorrelation. There are many ways of
addressing spatial dependence in R; here we’ll focus on incorporating spatial
correlation structures in regression models with **nlme**, where spatial dependence is a mostly a nuisance concern (the
typical case in niche modeling).

First,
we’ll look at a Smokies dataset example where we do not need to consider spatial
autocorrelation, using red maple:

dat =
read.csv(file.choose()) #grab the treedata.csv
file

dat2 = subset(dat,dat$species=="Acer rubrum")

ro.plots =
unique(as.character(dat2$plotID)) #plots with red maple

u.plots = unique(as.character(dat$plotID)) #all plots

nob.plots = u.plots[is.element(u.plots,ro.plots)==F] #plots without red
maple

dat3 =
subset(dat,is.element(as.character(dat$plotID),nob.plots)) #dataset of no maple

dat4 =
subset(dat3,duplicated(dat3$plotID)==F) #one row per plot

dat4$cover = 0 #cover
of red oak is zero in these plots

mp =
na.exclude(rbind(dat2,dat4)) #new dataframe of
presences and absences

mp2 = subset(mp,duplicated(mp$plotID)==F) #remove all remeasured plots

mp3 = subset(mp2,mp2$utme<260000) #reduce
dataset to one park region

The
corStruct classes for spatial autocorrelation do not accept observations that
have identical X and Y coordinates (zero distance), which is why we simplified
the dataset to mp2, and the mp3 object only took a park region that was well
surveyed to eliminate rare large distances between plots. We can fit a simple gls model of maple cover
based on elevation:

gls1 =
gls(cover~elev+I(elev^2),data=mp3)

The question is whether we need to bother with modeling
spatial autocorrelation in these data. For this we use a *variogram* (or, sometimes called a semivariogram) that shows how the
total variance in the data is related to the distance between
observations. If there is no spatial
autocorrelation, we expect a flat variogram of semivariance equal to the
maximum (random) variance (called the *sill*;
often this is scaled to 1). If nearby
samples are correlated, however, then only a fraction of the total variance can
be expressed between nearby values, such that variograms increase in some
fashion until a certain distance where values are uncorrelated (called the *range*).
The variance at a lag distance of zero is called the *nugget*, which describes fine-scale
variance and/or observation error. The
shape of this increasing function describes the nature of the autocorrelation,
and the **nlme** package has 5 spatial
corStruct classes, corresponding to the semivariograms below (from Pinheiro and
Bates 2004, p. 233):

The basic strategy for dealing with spatial autocorrelation
in regression models is to 1) fit a basic model without a correlation
structure, 2) examine a variogram of the residuals (i.e., their correlation
with respect to distance), 3) choose an appropriate corStruct class based on
the variogram, and 4) test whether addition of the correlation term significantly
improves model fit via AICs or LRTs (**anova**
function). The **Variogram** function in **nlme**
serves as our diagnostic function:

plot(Variogram(gls1,form=~utme+utmn))

The graph shows a flat line at y=1, which is sufficient evidence to ignore
spatial autocorrelation in this model.
You can experiment with different species in this dataset; I couldn’t
find any that merited a spatial autocorrelation component. In plot datasets with relatively small
spatial grains and large mean distances between plots, this is probably
commonplace, but until you demonstrate the lack of variogram pattern
empirically there is always the chance that your likelihood estimates are
inaccurate due to non-independent observations.

To
demonstrate what spatial autocorrelation looks like, grab the ‘treedbh.csv’
dataset from the class website, which describes annual dbh increment in a
hemlock-hardwood forest stand in northern MI:

trees = read.csv(file.choose()) #grab the ‘treedbh.csv’ dataset

trees1 = na.exclude(trees) #remove
rows of missing values

trees2 =
subset(trees1,trees1$dead==0) #ignore dead
trees

trees3 = subset(trees2,duplicated(cbind(trees2$xcoord,trees2$ycoord))==F) #remove ‘double stems’ with same
coordinates

tr.mod1 = gls(INCR~DBH,data=trees3) #simple
model of tree growth (INCR) as a function of stem size

plot(Variogram(tr.mod1,form=~xcoord+ycoord))

Now the variogram indicates significant spatial
autocorrelation of growth values, with a nugget of about 0.9 and a ‘range’
(here the x value where y=1) of about 100 m.
We can first ask whether a spherical correlation structure is
appropriate using the *correlation=corSpher*
argument; note corSpher would like starting values for the range and nugget
parameters (when nugget=T). (Warning:
this took me about 20 minutes to optimize).

tr.mod2 =
update(tr.mod1,correlation=corSpher(c(100,0.9),form=~xcoord+ycoord,nugget=T))

summary(tr.mod2)

Generalized least squares fit by
REML

Model: INCR ~ DBH

Data: trees3

AIC
BIC logLik

3492.997 3518.12 -1741.498

Correlation Structure: Spherical
spatial correlation

Formula: ~xcoord + ycoord

Parameter estimate(s):

range nugget

1.053204e+05 2.488153e-03

Coefficients:

Value Std.Error t-value p-value

(Intercept) -0.18772279 21.593435
-0.008694 0.9931

DBH 0.00660686 0.001732
3.814293 0.0001

Correlation:

(Intr)

DBH -0.002

Standardized residuals:

Min Q1 Med Q3 Max

-0.57888285 0.00798371
0.02819694 0.05323549 0.48943971

Residual standard error: 21.64101

Degrees of freedom: 1126 total;
1124 residual

We have a huge AIC advantage here:

anova(tr.mod1,tr.mod2)

Model df AIC
BIC logLik Test
L.Ratio p-value

tr.mod1 1 3
3590.035 3605.109 -1792.017

tr.mod2 2 5
3492.997 3518.120 -1741.498 1 vs 2 101.0376
<.0001

But clearly the
correlation structure did not do what we expected, because the MLEs for the
range and nugget are nonsensical (10532 and 0.0025, resp.). We can force the issue by fitting the
corSpher structure again, but with non-fitted range and nugget values using the
*fixed=T* argument:

tr.mod3 =
update(tr.mod1,correlation=corSpher(c(100,0.9),form=~xcoord+ycoord,nugget=T,fixed=T))

plot(Variogram(tr.mod3,form=~xcoord+ycoord))

anova(tr.mod2,tr.mod3)

Model df AIC
BIC logLik Test
L.Ratio p-value

tr.mod2 1 5
3492.997 3518.12 -1741.498

tr.mod3 2 3
3501.966 3517.04 -1747.983 1 vs 2 12.96868
0.0015

The variogram of the
fixed correlation model shows our assumed correlation structure, but the LRT
shows that we’re not accounting for much of the spatial autocorrelation
function, and that the weirdly parameterized corSpher model does a better (but
not particularly helpful) job. Would
exponential correlation work better?

tr.mod4 = update(tr.mod1,correlation=corExp(c(100,0.9),form=~xcoord+ycoord,nugget=T))

summary(tr.mod4)

Generalized least
squares fit by REML

Model: INCR ~ DBH

Data: trees3

AIC
BIC logLik

3491.192 3516.316 -1740.596

Correlation Structure: Exponential
spatial correlation

Formula: ~xcoord + ycoord

Parameter estimate(s):

range
nugget

34.4006815 0.7747585

Coefficients:

Value Std.Error
t-value p-value

(Intercept) 0.3159240 0.21814152
1.448252 0.1478

DBH 0.0066374 0.00173083 3.834806 0.0001

Correlation:

(Intr)

DBH -0.224

Standardized residuals:

Min Q1 Med Q3 Max

-10.70824455 -0.27256013
0.08697457 0.53202986 8.28691426

Residual standard error: 1.217159

Degrees of freedom: 1126 total;
1124 residual

The range and nugget parameters make more sense, and the AIC value is
slightly better than tr.mod2 (3491.2 vs. 3493.0); the correlation structures
are not nested so we can’t obtain a P value for the difference, but here we’d
probably choose to go with corExp. We
check the status of our residuals with tr.mod4:

plot(Variogram(tr.mod4))

The variogram suggests we still have work to do.

*Exercise
*Apply a spatial autocorrelation structure to the above Smokies max
temperature dataset (Tf4) using