**Likelihood
approaches in R
**

For this session we’re going back to revisit a key concept
that underlies many of the modeling approaches we’ve already used and will be
an essential component of building more complicated model structures, including
mixed effect models and Bayesian techniques: maximum likelihood. *Likelihood*
is the probability of a dataset given a model.
The ‘dataset’ is any collection of values that are 1) independent and 2)
identically distributed (meaning they are produced by the same statistical distribution:
normal, binomial, Poisson, etc.). The
‘model’ includes both function(s) describing how the predictor variable(s) are
converted into the response variable, and the particular values of parameters
included in the function. In terms of a
standard linear regression, ‘data’ are the Y values, and ‘model’ is both the
form ‘b+mX’ and particular values of the slope and intercept. Given they are independent, the probability
of obtaining a particular set of observations given the model is equal to the product
of the probabilities of obtaining each value separately. One of the great things about likelihoods is
that they force you to specify the error distribution that produces the data. We usually don’t consider how error enters
the model until we’re forced to do so (e.g., by the *family* argument in a GLM) or we ignore it entirely. Creating likelihood functions forces us to
admit that data are produced by a combination of deterministic relationships (*process* models, indicating ecological
relationships) and functions that describe the distribution of error or
uncertainty (*data* models). Excellent introductions to the concept and
usage of likelihoods for model fitting are found in Hilborn and Mangel (1997,
Chap. 7) and Bolker (2008, Chap. 6), among many others.

Model fitting via maximum likelihood is usually accomplished
with a search through parameter space in a technique called *optimization*. Before computers this was basically
impossible for most applications, but R implements several different optimization
routines that are robust enough to now be the backbone of most model fitting
routines (including GAMs, nls, and all types of mixed effects models). Unfortunately convergence on the optimum
solution (here, the maximum likelihood estimate, or *MLE*) is not guaranteed and can be difficult to achieve in some
circumstances. For this reason, to
gainfully employ likelihood techniques in fitting models to data we have to
know how to construct likelihood functions AND know at least a little about the
underlying optimization methods that are supposedly finding our desired MLE.

My objectives in what follows concern implementing
likelihood model fitting techniques in R, including:

1. writing likelihood functions,

2. finding maximum likelihood solutions through optimization,

3. creating likelihood profiles to examine parameter precision, and

4. comparing models using likelihood techniques (likelihood ratio tests and
information criteria like AIC).

**Writing
likelihood functions in R**

Our basic objective in fitting a model to data using
likelihood is, once a model has been specified, finding the parameter value(s)
that make the observed data the most likely (the maximum likelihood). Because of tradition and mathematical
convenience, we are actually concerned not with likelihoods but log-likelihoods
(a log transformation of the likelihood, which changes the product of
probabilities to a sum of log-probabilities), and are interested in minimizing
negative log-likelihoods rather than maximization.

A likelihood function is a specification of how the data are
obtained from a given error distribution. As a starting example consider a
model of the mean (intercept) and variance of data with normally distributed
error (akin to **lm**). Values distributed according to a normal
distribution each occur with a probability based on the mean and the variance
of the distribution in this way:

normprob = function (x,m,v) {

(1/sqrt(2*pi*v))*exp(-((x-m)^2)/(2*v)) }

You might recognize this as the probability density function (pdf) of a normal
distribution, telling us how probable certain values of x are given mean m and
variance v. For a given mean of 5 and
variance of 2, how probable are observations of integers from 1 to 10?

X = c(1:10)

Y = normprob(x=X,m=5,v=2)

plot(X,Y)

We're of course expecting values near the mean to be the most probable, and
from what we know about normal distributions we anticipate about 95% of the
data being within 2 standard deviations (~1.4 units) of 5. Let's check that this is in fact a pdf by
generating more x values and comparing it to the built-in **dnorm** function:

X = seq(0,10,by=.1)

Y = normprob(x=X,m=5,v=2)

plot(X,Y)

lines(X,dnorm(X,mean=5,sd=sqrt(2)))

Now
let's take a set of observations and assess their likelihood assuming normal
error. We assume these observations are independent, so we define their
likelihood as the product of their individual probabilities given a mean m and
variance v:

X = c(0,4,5,7,8,2,2,4,4)

like1 = prod(normprob(x=X,m=5,v=2))

like1

[1] 4.443365e-12

Again,
this is the probability of observing X given the values were generated from a
normal distribution of mean=5 and variance=2. Is this the model that maximizes
the likelihood? Let's try some other possibilities, all assuming normal error:

means = seq(0,10,by=.1)

vars = seq(1,8,by=.5)

mean.likes = sapply(means,function(y)prod(normprob(x=X,m=y,v=2))) #varying the
mean

var.likes = sapply(vars,function(y)prod(normprob(x=X,m=5,v=y))) #varying the
variance

plot(means,mean.likes,type="b")

windows()

plot(vars,var.likes,type="b")

The
graphs (*likelihood curves*) show the
likelihood values for different parameter values of the mean (assuming
variance=2) and variance (assuming mean=5). Let's do this in a slightly more
sophisticated way by examining the likelihood of models where the mean and
variance parameters vary simultaneously (a *likelihood*
*surface*):

means = rep(means,each=length(vars))

vars = rep(vars,times=101)

mv.mat = cbind(means,vars)

mv.likes = apply(mv.mat,1,function(y)prod(normprob(x=X,m=y[1],v=y[2])))

library(lattice)

levelplot(mv.likes~means*vars,contour=T)

The
likelihood peaks at the mean and variance values we found from the separate
plots, and we can extract the mean and variance values by asking on what row of
mv.mat the max value lies:

mv.mat[mv.likes==max(mv.likes)]

[1] 4.0 5.5

So
our crude maximum likelihood estimate (MLE) for the mean is 4.0, and the
variance is 5.5. Of course we've gone to
a lot of trouble to calculate the obvious:

mean(X)

[1] 4

mean(X^2)-(mean(X)^2)

[1] 5.555556

Note that **var**(X) produces the
unbiased variance estimator for small samples, which multiplies the population
variance by n/(n-1), or 9/8. But hopefully you see the general technique: given
an error function, we can search over a range of possible parameter values to
find the model that maximizes the likelihood of our data. This is model fitting via ML in a nutshell.

Let's keep our data and add in a deterministic
function in addition to our normal error.
Let's envision our current dataset X as plot cover values of some
species and suppose we have a predictor variable of elevation in meters
associated with each X:

cover
= X

elev
= c(100,110,320,500,830,120,330,100,500)

Although normal error is probably not the best choice for an error model, we'll
stick with it for now and write a likelihood function assuming a simple linear
relationship (as in **lm**) and normal
error. This time we'll take a shortcut
using **dnorm** instead of bothering to
write out the formula for the normal pdf, and we'll convert our calculation to
negative log-likelihoods using the *log=T*
argument in **dnorm** and summing the
values rather than taking their product:

normNLL
= function(par) {

y.pred = par[1] + par[2]*elev

-sum(dnorm(cover,mean=y.pred,sd=sqrt(par[3]),log=T))
}

The only model component we've added is how we think the mean of the normal
distribution should vary depending on elevation, using two parameters of an
intercept (par[1]) and slope (par[2]).
This means we need to find the MLE of 3 parameters, including the
variance (par[3]), which we assume does not vary according to elevation. Let's
plot the data to see which parameter values are reasonable:

plot(elev,cover)

The relationship looks roughly linear, with an intercept around zero and a
slope of close to 1/100 (0.01). We'll
now automate the search process using one of several 'search' or optimization
routines in R:

optim(par=c(0,1/100,1),normNLL)
#par argument is the starting search values for par[1:3]

$par

[1] 1.54507000 0.00759274 2.31612581

$value

[1] 16.55038

We'll look at the details of **optim**
below; for now note the fitted parameter values of the intercept (par[1]=1.54),
slope (par[2]=.0076), and variance (par[3]=2.3), and our estimate of the
minimized value of the negative log-likelihood function at the MLEs of each
parameter (16.55). To see how this
works, plug in the MLEs into normNLL:

normNLL(c(1.545,.00759,2.316))

[1] 16.55038

If **optim** worked correctly, this is
the smallest possible value of normNLL given our data and model form. Let's check to make sure that the analytical
solution using least squares is basically the same:

summary(lm(cover~elev))

Call:

lm(formula = cover ~ elev)

Residuals:

Min 1Q
Median 3Q Max

-2.3045 -1.3412 0.1534 1.6196
1.6955

Coefficients:

Estimate Std. Error t value
Pr(>|t|)

(Intercept) 1.545275 0.972862 1.588
0.1562

elev 0.007592 0.002427
3.129 0.0166 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.726 on 7 degrees of freedom

Multiple R-squared: 0.5831, Adjusted
R-squared: 0.5235

F-statistic: 9.789 on 1 and 7 DF,
p-value: 0.01664

The
intercept and slope values are identical.

Let's compare this result to one using a
different error distribution that recognizes cover cannot be negative, such as
the Poisson distribution. The likelihood
function is similar except we change the error function to the Poisson pdf:

poisNLL
= function(par) {

y.pred = par[1] + par[2]*elev

-sum(dpois(cover,lambda=y.pred,log=T))
}

We now have
a model with one less parameter, because we assume the mean and variance are
equal with Poisson error. As initial parameter values we'll use the MLEs from
the normal error example:

optim(par=c(1.54,.0076),poisNLL)

$par

[1] 1.626705708 0.007340982

$value

[1] 17.61976

Notice that
our MLEs were similar but our negative log-likelihood went up, even through we
used one less parameter. Clearly this is
a worse model for these data.

**Optimization
routines**

Before we go much further with model fitting using ML we
should examine the optimization routines in more detail. *Optimization*,
or the process of finding a function's minimum or maximum solution numerically,
is by far the most frustrating part of fitting complicated models to data using
likelihoods, because there is no universal best optimization method and there
are many ways a given routine can fail to converge on an answer or produce MLEs
that are nonsensical. The most commonly
used R function for optimization is **optim**,
but it itself is just a convenient way to access several different optimization
routines.

Calls to **optim**
require a function (*fn*=) to minimize
(here, the negative log-likelihood function), starting values for the fitted
parameters (*par*=), and the desired
optimization method (*method*=). **optim**
includes five methods of various utility.
Let's generate a more complicated model for our treedata and compare
several **optim** methods.

dat = read.csv(file.choose()) #grab the
treedata.csv file

dat2 = subset(dat,dat$species=="Magnolia fraseri") #Fraser's
magnolia

ro.plots = unique(as.character(dat2$plotID)) #plots with
mag fra

u.plots = unique(as.character(dat$plotID)) #all plots

nob.plots = u.plots[is.element(u.plots,ro.plots)==F] #plots without
mag fra

dat3 = subset(dat,is.element(as.character(dat$plotID),nob.plots)) #dataset of
no mag fra

dat4 = subset(dat3,duplicated(dat3$plotID)==F) #one row per plot

dat4$cover = 0 #cover of red oak is zero in these
plots

mf = na.exclude(rbind(dat2,dat4)) #new dataframe
of presences and absences

Fraser magnolia is a classic "rich cove" species
that probably has a strong relationship with cove (near-stream) locations. We'll write a likelihood function that
includes a deterministic component which includes a nonlinear (exponential
decay) relationship of magnolia abundance with distance-from-stream, with the
rate dependent on elevation (thinking that mid-slope locations get wetter as we
go higher). For this exercise we'll also
use the full complement of cover data, including zeros. Our expectation is that these data are
overdispersed, so the error component of the model will be a negative binomial
function. To show off the flexibility of
the ML approach, we'll do this using a pair of nested functions, with the first
being the error distribution and the second the deterministic function.

cover = mf$cover

elev = mf$elev

streamdist = mf$streamdist

nbNLL = function(par) {

y.pred =
mfNLL(c(par[1],par[2],par[3],par[4]))

-sum(dnbinom(cover+1,mu=y.pred,size=par[5],log=T))
}

mfNLL = function(par) {

b0 = par[1]

b1 = par[2]

b2 = par[3]

b3 = par[4]

r = elev*b3

b0 + b1*elev +
b2*exp(r*log(streamdist+1)) }

My
objective here is to illustrate the power of likelihoods in fitting models to
data, rather than explore in detail magnolia distribution. What we did is define an error function based
on the two-parameter negative binomial function, where *mu* is the expected cover and *size*
is the dispersion parameter that accounts for the variance increasing faster
than the mean (thus relaxing this assumption for the otherwise equivalent
Poisson model). nbNLL thus returns the
negative log-likelihood of the dataset fed through the deterministic function
mfNLL. mfNLL has no specification of
error in it, but only contains a four-parameter model based on an overall
intercept (b0), slope for the relation of cover and elevation (b1), a constant
for the stream distance relationship (b2), and finally a parameter that
controls how elevation influences the exponential decay rate of cover with
stream distance (b3). We could have opted
for a simpler regression of elevation and stream distance, each with a slope
parameter and including an intercept, but I wanted to demonstrate the
flexibility of giving optim any function that will produce a likelihood value. This can involve calls to numerous other
functions, many estimated parameters (constrained by optim's ability to
converge on a consistent and sensible answer), or even calling up other
scripts—indeed anything that will generate a 'y.pred' value for nbNLL.

The first consideration in having **optim** produce MLEs is giving it reasonable starting values for each
parameter. b0 should be near the overall cover mean, so we can start it at 1.
b1 is the slope for elevation, and it must be small (given the scale of elevation
relative to cover) but we're not sure if it's greater or less than zero, so we
start it at 0. b2 is a bit of a mystery; we'll start it at 1 as a default. b3
involves translating elevation into a decay rate, so must be very small, so we
start it at zero as well. The final
parameter is the dispersion parameter for NB error; there are guidelines for
figuring out reasonable values of this but for convenience we'll start with a
value of 1.

start.par = c(1,0,1,0,1)

Nelder-Mead is the default method for **optim**. Because it involves a bit of jumping around
in parameter space (rather than a derivative-based approach that is more likely
to get caught in local minima), it can produce consistent results but can run
slowly with large datasets or complicated parameterizations. Trying it with our
above likelihood model and defined starting parameters:

out1 = optim(fn=nbNLL,par=start.par,method="Nelder-Mead")

out1

$par

[1] 4.800675e-01 -4.718550e-04 2.677907e+00 -3.190603e-05 5.435705e+00

$value

[1] 1832.206

$counts

function gradient

502 NA

$convergence

[1] 1

$message

NULL

The $par component lists the MLEs for the 5 parameters, and the $value is the
minimized negative log-likelihood for our dataset. The $counts component is the
number of times our likelihood function was queried by **optim** (we did not use a gradient). The $convergence value is
critical: a 1 here indicates that what stopped the optimization was the
iteration limit (here 500) rather than the convergence criterion being reached
('0'). This should give us cause for concern.
Let's increase the iteration limit to see whether the fit improves:

out2 =
optim(fn=nbNLL,par=start.par,method="Nelder-Mead",control=list(trace=TRUE,maxit=1000))

out2

$par

[1] -5.385844e+00 -9.194085e-05
8.467816e+00 -1.698507e-05
5.504508e+00

$value

[1] 1830.939

Still
no convergence and the fit is little better. The *trace=T* argument generates the output for each iteration; scroll
through these to see that in fact the likelihood hit below 1840 relatively
quickly.

The BFGS method is a version of estimating the location of
the minimum using derivatives. Bolker
(2008) recommends BFGS when the likelihood surface is 'well-behaved'
(presumably meaning simple) and starting values are reasonably clear; it is
also a faster technique.

out3 = optim(fn=nbNLL,par=start.par,method="BFGS")

out3

$par

[1] 1.000000e+00 1.002682e-20
1.000000e+00 -1.051292e-19
1.000000e+00

$value

[1] 2003.454

This
is a substantially worse fit than Nelder-Mead and our parameters basically
didn't move, so our functions appear too complex for this method. We can try a version of this that operates
with parameter constraints (upper and lower search limits for each parameter)
called L-BFGS-B. To do so we need vectors of parameter limits, which I provide
using educated guesses:

lower = c(.5,-0.01,0,-0.01,.5)

upper = c(10,0.01,5,1,5)

out4 =
optim(fn=nbNLL,par=start.par,lower=lower,upper=upper,method="L-BFGS-B")

out4

$par

[1] 2.5919153042 -0.0000979387 1.7075221989 -0.0022838252 1.7075572583

$value

[1] 1909.586

$counts

function gradient

41 41

$convergence

[1] 0

We
did get convergence, and a better solution than unbounded BFGS, but still no
better than Nelder-Mead. The parameter values at least look reasonable. Bolker (2008, p. 248) suggests that L-BFGS-B
is less robust than unconstrained methods; however it can be useful for
complicated fits where you want to put tight constraints on parameters (e.g.,
due to biological realism; note Bolker suggests several alternatives to this
when constraints are important to the fit).

One more common optimization method available with **optim** is simulated annealing. Simulated annealing is one of the more robust
(but, unsurprisingly, slow) techniques for finding the 'global' minimum when
there may be many local minima, because it has a large stochastic component
(random jumps in parameter space). The
version in **optim** is SANN:

out5 = optim(fn=nbNLL,par=start.par,method="SANN")

out5

$par

[1] 1.7294495733 0.0003218725
0.6010998011 -0.2951911926
2.2183194506

$value

[1] 1891.716

$counts

function gradient

10000 NA

$convergence

[1] 0

It may take a few tries to get reasonable parameter values. The above output was one of several I tried,
and none produced a fit as good as Nelder-Mead. The parameter values here are
also very different. Most likely we have
a situation where there is a plateau on the likelihood surface where a large
range of parameter values produce similar likelihoods. From here we may need to make our model
simpler, or explore subsets of the data to get a better handle on reasonable
parameter values. The situation also
highlights the fact that we need more than reasonable parameter values if we're
to do much with this fitted model: in particular, we need a way to define
parameter precision (e.g., confidence intervals). Evaluating the support for particular model
values is also the basis for likelihood hypothesis testing.

**Likelihood profiles**

Likelihood profiles are the meat and potatoes of likelihood
modeling fitting, because they provide a snapshot of how the overall model
likelihood depends on particular values of one or more parameters. If negative
log-likelihood values increase quickly as you move away from the MLE (higher or
lower), then the MLE has high precision—it is unlikely that the real value is
far from the estimated value. Conversely, if the likelihood profile is flat,
then a range of different parameter values could produce basically the same
dataset. In this case your point estimate of the MLE may be worthless. Even
better is that likelihoods have a very useful statistical property: the ratio
of any two likelihoods (generated from the same data) is proportional to a
chi-squared distribution (proportional because it is 2*negative log-likelihood,
or the *deviance*, that is chi-square distributed). This means we can
associate any increase in negative log-likelihood with a probability of the
solution of the higher likelihood being equal to that of the lower likelihood:
i.e., a P value. This forms the basis of estimating parameter precision and
comparing (nested) models.

In models that include a single parameter, likelihood profiles are easy to generate because you only need to calculate the likelihood with different values of that parameter. We produced this above in our examination of the mean of a normal distribution (using likelihood, rather than negative log-likelihood). For many parameter models, however, a change in any parameter value will (usually) influence the MLEs of the others. Construction of an accurate likelihood profile requires that the fit be 're-optimized' to find the lowest negative log-likelihood for each alternate parameterization. Of course, there are various libraries in R that automate this process, but we'll try a simple manual example.

For ease of use let’s return to the simple case of linear regression of 9 cover values and associated elevation values.

cover = c(0,4,5,7,8,2,2,4,4)

elev = c(100,110,320,500,830,120,330,100,500)

We’ll first examine the overall MLEs fit by optim using a normal likelihood function:

fn = function(par) {-sum(dnorm(cover,mean=par[1]+par[2]*elev,sd=sqrt(par[3]),log=T))}

out1 = optim(fn,par=c(1,0,3),method="Nelder-Mead")

out1

$par

[1] 1.54495516 0.00759282 2.31627982

$value

[1] 16.55038

The slope for the effect of elevation is ~0.0076. So what? Is there evidence that in fact elevation is a significant predictor? We can find out by examining the likelihoods of a range of different slope parameters, while also allowing the MLEs of the other two parameters to vary with the different slope values. We create a vector of possible slopes and then generate new optim results for each case of a new fixed slope parameter:

slopes = seq(-.005,.02,by=.0005) #vector of possible slope values surrounding the MLE

lik.out = numeric(length(slopes)) #holding vector (currently zeros)

fn2 = function(par,m) { #new simplified likelihood function (only 2 fitted pars)

b = par[1]

v = par[2]

-sum(dnorm(cover,mean=b+m*elev,sd=sqrt(v),log=T))}

for(i in 1:length(slopes)) { #calculates new likelihoods for each new slope model

lik.out[i] = optim(fn2,par=c(1,3),m=slopes[i],method="Nelder-Mead")$value }

plot(slopes,lik.out,type="l")

This is our likelihood profile for the slope. The MLE is about 0.0076 and results in a negative log-likelihood of about 16.5. To determine the 95% confidence interval for the slope we use the fact that a difference in likelihood more than half the chi-squared critical value for alpha=0.95 (1 df) is beyond the P=0.05 boundary. Note this critical value is the current minimum negative log-likelihood plus:

qchisq(0.95,1)/2

[1] 1.920729

So we just need to find the slope value that is equal to the current minimum likelihood (~16.5) plus 1.92, which should be around 18.5. To get a more precise value we use the approx function that does simple interpolation given a set of x and y values. We do this for both the lower tail and the upper tail:

lower.s = slopes[1:which.min(lik.out)] #x values, left side of the graph

lower.k = lik.out[1:which.min(lik.out)] #y values, left side of the graph

lowerCI = approx(lower.k,lower.s,xout=min(lik.out)+qchisq(0.95,1)/2)$y #desired x value for y = 1.92+y(MLE)

upper.s = slopes[which.min(lik.out):length(lik.out)] #x values, right side

upper.k = lik.out[which.min(lik.out):length(lik.out)] #y values, right side

upperCI = approx(upper.k,upper.s,xout=min(lik.out)+qchisq(0.95,1)/2)$y #desired x value for y = 1.92+y(MLE)

abline(v=c(lowerCI,upperCI),lty=2,col="red")

arrows(lowerCI,18.5,upperCI,18.5,code=3,col="blue"); text(.0075,19,"95% CI",cex=2)

I’ve plotted the estimated confidence limits. They do not contain zero, so elevation is indeed significant. The fact that this procedure can be used to test how different a parameter MLE is from zero forms the basis of nested model evaluation using likelihood: we've just decided that a model including elevation is a better model than one without it. This is the likelihood ratio test.

**Model comparison**

Likelihood offers two general means of model comparison, corresponding to the frequentist likelihood ratio test (LRT), generating a P value for the comparison of nested models, and the non-frequentist (non-P-value) use of information criteria (like AIC). Information criteria-based methods like AIC are more flexible because they can be used to compare any models fit via likelihoods (including those not strictly fit via likelihood but from which a likelihood is easily calculated, such as least-squares). The 'rules' of IC-based methods are not strictly based on probability; rather, rules-of-thumb are established via simulation. For this reason, many researchers prefer LRTs when they are available, although IC-based methods, and particularly AIC, have widespread acceptance in the ecological literature.

LRTs apply
only to the comparison of nested models, involving the change in model
likelihood when one parameter is set to zero (or two terms are equivalent).
Returning to the above example of elevation and cover, we can ask whether there
is sufficient evidence from a comparison of their likelihoods that the
deterministic component of the model should contain a squared term for
elevation:

fn.sq
= function(par) {-sum(dnorm(cover,mean=par[1]+par[2]*elev+par[4]*(elev^2),sd=sqrt(par[3]),log=T))}

out.sq = optim(fn.sq,par=c(1,0,3,0),method="Nelder-Mead")

out.sq

$par

[1] 1.840340e+00 5.376081e-03 2.294774e+00 2.622342e-06

$value

[1] 16.50852

The
optimized negative log-likelihood is 16.50852, which is barely lower than the
above model that did not include the quadratic term (=16.55). Is it low enough
to be a significantly better fit? We don't have to create the likelihood
profile, because (as frequentists) we're only interested whether quadratic
coefficient is zero (null hypothesis) or greater than zero (the positive MLE
according to our out.sq result). In fact we already know our critical value is
a negative log-likelihood difference of

qchisq(0.95,1)/2

[1] 1.920729

because the one-parameter difference in models gives us a 1
DF for the chi-squared test. So the original likelihood would have had to drop
to

out1$value - qchisq(0.95,1)/2

[1] 14.62965

So
we weren't even close. A model including only an intercept, however, has a
negative log-likelihood of:

fn.int = function(par) {-sum(dnorm(cover,mean=par[1],sd=sqrt(par[2]),log=T))}

out.int = optim(fn.int,par=c(1,3),method="Nelder-Mead")

out.int$value

[1] 20.48704

This
deviance has an associated P-value of:

1-pchisq(2*(out.int$value-out1$value),df=1)

[1] 0.005016926

We conclude that a model including the main effect of elevation is much more
likely to be the better model than one with only an intercept. Interestingly,
this is not equivalent to the P-value obtained with a linear model based on
least squares:

summary(lm(cover~elev))

Call:

lm(formula = cover ~ elev)

Residuals:

Min 1Q Median 3Q Max

-2.3045 -1.3412 0.1534 1.6196 1.6955

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.545275 0.972862 1.588 0.1562

elev 0.007592 0.002427 3.129 0.0166 *

The
fitted coefficients are identical to what we found with likelihood, but the
test is based on a Student's t test of whether the elevation coefficient
overlaps with zero, rather than on the competing likelihoods of two models.
Conveniently, R has a built-in function **logLik** for returning
log-likelihoods of a fitted model object:

logLik(lm(cover~elev))

'log Lik.' -16.55038 (df=3)

Because the estimated coefficients are identical to what we produced manually,
the value is equal to that of our out1 object (note you need to return the
negative of this for the negative log-likelihood). Using this technique we
could do the same model comparison via likelihood but skipping the
optimization:

1-pchisq(abs(2*(-logLik(lm(cover~1))[1]+logLik(lm(cover~elev))[1])),df=1)

[1] 0.005016926

The
tricky bit about the above line is that we want to the first element of the **logLik** function ([1]); I also made sure that the difference returned a positive value
(**abs**). You should be able to do this for all GLMs as well.

AIC
(Akaike Information Criterion) and other IC-based model goodnesss-of-fit
statistics are similarly easy to calculate once model likelihoods are in hand,
because they are based only on the negative log-likelihood and the number of
parameters in the model (AIC = -2L + 2k, where k is the number of fitted
parameters). The comparisons needn't be restricted to nested models, but the
datasets fit by the models must be identical. What you don't get with IC-based
methods is a P value (many consider this to be a good thing). Let's compare our
simple models of elevation and cover using AIC rather than LRT:

2*out1$value+2*length(out1$par)

[1] 39.10075

2*out.int$value+2*length(out.int$par)

[1] 44.97408

Lower AIC values indicate better fits, and the rule-of-thumb established for
ICs is that a difference of 2 units or more indicates models that are clearly
different, so again we conclude that there is substantial support for a model
that includes elevation. Here too there is a built-in function to save you the
few seconds of trouble of writing the above formulas:

AIC(lm(cover~elev),lm(cover~1))

df AIC

lm(cover ~ elev) 3 39.10075

lm(cover ~ 1) 2 44.97408

The **AIC** function should work for any fitted model object that generates
an associated **logLik** value (lm, glm, gam, etc). Note according to
Burnham and Anderson (2002) we probably should have instead used AICc, which is
the AIC value corrected for small sample sizes (generally when #values/#params
< 40):

2*out1$value+2*length(out1$par) + 2*(length(out1$par)*(length(out1$par)+1)/(length(cover)-length(out1$par)-1))

[1] 43.90075

2*out.int$value+2*length(out.int$par) + 2*(length(out.int$par)*(length(out.int$par)+1)/(length(cover)-length(out.int$par)-1))

[1] 46.97408

The
conclusion is the same but the disparity isn't quite as great with small sample
sizes. Yet another IC variant is the so-called 'Bayesian' IC (or BIC; not
actually a Bayesian method) with a parameter penalization of (log n)*k rather
than 2k, which may be more appropriate for comparing models for large datasets
(compared to AIC, BIC yields a more conservative acceptance of more highly
parameterized models):

2*out1$value+length(out1$par)*log(length(cover))

[1] 39.69243

2*out.int$value+length(out.int$par)*log(length(cover))

[1] 45.36853

Given the growing popularity of model comparisons via information-theoretic
approaches like AIC, it is important to understand how AICs are used to compare
the relative goodness-of-fit of competing models. First the actual AIC value
is not a meaningful value (as opposed to something like R2); rather, it is the
difference in AICs between models that are of interest. AIC differences in a
table of competing models allow a quick and convenient means of model comparison
(differences<2 suggest the models are of equal ranking; between 4-7 suggests
'considerably less' support, and >10 suggests 'essentially none' [Burnham
and Anderson 2002, p. 70]). Burnham and Anderson have additionally popularized
a method (established by Akaike) that uses AIC values to calculate the relative
weight of evidence supporting one model over others, given a list of reasonable
alternative models fit to a given dataset. These are called Akaike weights (w_{i}),
and are unique to a particular set of models (that is, if a model is added or
removed, w_{i} must be recalculated). Akaike weights can also be used
for other ICs (eg BIC). Model comparison tables based on ICs are now common in
the literature, so it is worthwhile to be able to produce your own using the
basic likelihood approaches above. Here we can write a simple function that,
given a set of candidate models, creates the standard information-theoretic
output:

ICtab = function(model.list) {

if(is.list(model.list[[1]])==F) {cat("Please enter models as a list (list(mod1,mod2,...))"); stop}

n = length(model.list) #number of candidate models

s = length(model.list[[1]]$fitted) #no. observations

kv = numeric(1:n) #empty no. param vector

modLL = numeric(1:n) #empty neg log lik vector

mod.call = numeric(1:n) #model form (for output)

R2 = numeric(1:n) #empty model R2 vector

fam = numeric(1:n) #empty family vector

for(i in 1:n) { #loop over each model

mod
= model.list[[i]] #select model

fam[i] = mod$family[[1]] #error family

mod.call[i] = as.character(mod$formula[3]) #return model formula

modLL[i] = -logLik(mod)[1] #extract neg log likelihood

kv[i]
= attr(logLik(mod),"df") #extract no. fitted
params

res = mod$y-mod$fitted #residuals
scaled to original data

R2[i] = 1-(sum((res-mean(res))^2) / sum((mod$y-mean(mod$y))^2) )

}

AICv = 2*modLL + 2*kv + 2*kv*(kv+1)/(s-kv-1) #AICc for all models

BICv = 2*modLL + kv*log(s) #BIC for all models

AICdiff = AICv - min(AICv) #AIC differences

BICdiff = BICv - min(BICv) #BIC differences

wi = exp(-.5*AICdiff) / sum(exp(-.5*AICdiff)) #Akaike weights

out = data.frame(mod.call,fam,kv,modLL,round(AICv,2),round(AICdiff,2),round(wi,3),round(BICv,2),round(BICdiff,2),round(R2,3))

names(out)
= c("model","family","k","-LL","AICc","AICdiff","wi","BIC","BICdiff","R2")

out = out[order(AICdiff),] #sort ascending by
AICdiff column

return(out)

}

To illustrate what the above function does, let's construct
a few models with our tree dataset:

dat =
read.csv(file.choose()) #grab the treedata.csv
file

dat2 = subset(dat,dat$species=="Acer rubrum") #red
maple

We'll construct a set of candidate models that involve
reasonable a priori hypotheses about red maple abundance (cover) with respect
to elevation, water potential, stream distance, disturbance history, and
radiation (beers index). We're also interested in whether the data model
should be based on counts (poisson) or normal:

glm1 = glm(cover~elev+I(elev^2)+disturb+disturb*elev+disturb*I(elev^2),data=dat2,family=gaussian)

glm2 = glm(cover~elev+I(elev^2)+disturb+disturb*elev+disturb*I(elev^2),data=dat2,family=poisson)

glm3 = glm(cover~elev+I(elev^2)+disturb,data=dat2,family=gaussian)

glm4 = glm(cover~elev+I(elev^2)+disturb,data=dat2,family=poisson)

glm5 = glm(cover~elev+I(elev^2)+disturb+tci,data=dat2,family=gaussian)

glm6 = glm(cover~elev+I(elev^2)+disturb+streamdist,data=dat2,family=gaussian)

glm7 =
glm(cover~elev+I(elev^2)+disturb+I(log(streamdist+1)),data=dat2,family=gaussian)

glm8 = glm(cover~elev+I(elev^2)+disturb+ disturb*elev+disturb*I(elev^2)+I(log(streamdist+1)),data=dat2,family=gaussian)

This is
obviously not a complete list of candidate models; ideally we have some
hypotheses to test involving biological mechanisms of red maple distribution or
are interested in only a small subset of all possible regression models. I say
"ideally" because this approach does rely heavily on the set of
models you choose to evaluate. To illustrate this, let's first run our
information-theoretic function on models 1 through 7:

ICtab(list(glm1,glm2,glm3,glm4,glm5,glm6,glm7))

model family k -LL AICc AICdiff wi BIC BICdiff R2

7 elev + I(elev^2) + disturb + I(log(streamdist + 1)) gaussian 8 1676.202 3352.59 0.00 0.825 3405.83 0.00 0.042

6 elev + I(elev^2) + disturb + streamdist gaussian 8 1678.407 3357.00 4.41 0.091 3410.24 4.41 0.037

1 elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2) gaussian 13 1678.354 3357.17 4.59 0.083 3443.53 37.70 0.041

5 elev + I(elev^2) + disturb + tci gaussian 8 1683.407 3367.00 14.41 0.001 3420.24 14.41 0.029

3 elev + I(elev^2) + disturb gaussian 7 1685.925 3371.99 19.41 0.000 3418.60 12.77 0.022

2 elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2) poisson 12 1703.530 3407.46 54.87 0.000 3487.20 81.37 0.039

4
elev + I(elev^2) + disturb poisson 6 1709.601 3419.31 66.72 0.000 3459.27
53.44 0.023

Note our function requires arguments (model names) that are embedded in a
list. The function returns all the basic IC-based (and R2) criteria, including
the number of parameters (k), the negative log-likelihood, the sample
size-corrected AICc (with large samples this converges to AIC, so might as well
always use AICc), the rank order of AIC differences (all AICs compared to the
best [lowest] model), Akaike weights, and raw and relative BICs. Note the much
harsher penalties for k imposed by BIC than AIC (eg, glm1), and that R2s hardly
vary and are thus very poor tools for model comparison. Clearly model 7 is the
best model; its AICc advantage suggests "considerably more" support
than models 1 and 6 and the Akaike weights indicate that that there is an order
of magnitude more support for it compared to the next two. Still, there
remains some minor support for the other two models (the others have no support
at all). Now compare the result when we add model 8 to the mix:

ICtab(list(glm1,glm2,glm3,glm4,glm5,glm6,glm7,glm8))

model family k -LL AICc AICdiff wi BIC BICdiff R2

8 elev+I(elev^2)+disturb+disturb*elev+disturb*I(elev^2)+I(log(streamdist+1)) gaussian 14 1667.856 3336.25 0.00 1 3429.21 23.38 0.062

7 elev + I(elev^2) + disturb + I(log(streamdist + 1)) gaussian 8 1676.202 3352.59 16.34 0 3405.83 0.00 0.042

6 elev + I(elev^2) + disturb + streamdist gaussian 8 1678.407 3357.00 20.75 0 3410.24 4.41 0.037

1 elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2) gaussian 13 1678.354 3357.17 20.92 0 3443.53 37.70 0.041

5 elev + I(elev^2) + disturb + tci gaussian 8 1683.407 3367.00 30.75 0 3420.24 14.41 0.029

3 elev + I(elev^2) + disturb gaussian 7 1685.925 3371.99 35.74 0 3418.60 12.77 0.022

2 elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2) poisson 12 1703.530 3407.46 71.21 0 3487.20 81.37 0.039

4 elev + I(elev^2) + disturb poisson 6 1709.601 3419.31 83.06 0 3459.27 53.44 0.023

Relative to glm8, there is no support whatsoever (wi=0) for any remaining model if you adhere to AICs. Interestingly, because we've nearly doubled the number of parameters (because of the interaction terms for each level of disturbance class), the BIC disagrees and instead favors the lower parameterized models. Here you'd probably go with the AIC, as BIC is used primarily for cases for extremely large sample sizes relative to the number of parameters (here, n/k for glm8 is 795/14=57).

Note that the Akaike weight cannot be interpreted as the 'probability that a model is true'; rather, the wi shows the evidence in favor of each model given we initially (a priori) thought are models were equally likely. So, although the information-theoretic approach is a great way to compare models if you're in the situation of comparing several models that are all equally compelling on a priori grounds (perhaps for biological reasons, or from past literature), we're still left with a situation where we're evaluating the probability of data rather than models. Bayesians would argue that, on philosophical grounds, it makes more scientific sense to obtain a result that establishes the probability of a particular hypothesis (model) being true, rather than making statements based on the probability of the observing the data given particular hypotheses. To make this transition we need to consider Bayes' theorem.