**March
1: Likelihood**

For this session we’re going back to revisit a key concept
that underlies many of the modeling approaches we’ve already used and will be
an essential component of building more complicated model structures, including
mixed effect models and Bayesian techniques: maximum likelihood. *Likelihood*
is the probability of a dataset given a model.
The ‘dataset’ is any collection of values that are 1) independent and 2)
identically distributed (meaning they are produced by the same statistical distribution:
normal, binomial, Poisson, etc.). The
‘model’ includes both function(s) describing how the predictor variable(s) are
converted into the response variable, and the particular values of parameters
included in the function. In terms of a
standard linear regression, ‘data’ are the Y values, and ‘model’ is both the
form ‘b+mX’ and particular values of the slope and intercept. Given they are independent, the probability
of obtaining a particular set of observations given the model is equal to the product
of the probabilities of obtaining each value separately. One of the great things about likelihoods is
that they force you to specify the error distribution that produces the data. We usually don’t consider how error enters
the model until we’re forced to do so (e.g., by the *family* argument in a GLM) or we ignore it entirely. Creating likelihood functions forces us to
admit that data are produced by a combination of deterministic relationships (*process* models, indicating ecological
relationships) and functions that describe the distribution of error or
uncertainty (*data* models). Excellent introductions to the concept and
usage of likelihoods for model fitting are found in Hilborn and Mangel (1997,
Chap. 7) and Bolker (2008, Chap. 6), among many others.

Model fitting via maximum likelihood is usually accomplished
with a search through parameter space in a technique called *optimization*. Before computers this was basically
impossible for most applications, but R implements several different optimization
routines that are robust enough to now be the backbone of most model fitting
routines (including GAMs, nls, and all types of mixed effects models). Unfortunately convergence on the optimum
solution (here, the maximum likelihood estimate, or *MLE*) is not guaranteed and can be difficult to achieve in some
circumstances. For this reason, to
gainfully employ likelihood techniques in fitting models to data we have to
know how to construct likelihood functions AND know at least a little about the
underlying optimization methods that are supposedly finding our desired MLE.

My objectives in what follows concern implementing
likelihood model fitting techniques in R, including:

1. writing likelihood functions,

2. finding maximum likelihood solutions through optimization,

3. creating likelihood profiles to examine parameter precision, and

4. comparing models using likelihood techniques (likelihood ratio tests and
information criteria like AIC).

**Writing
likelihood functions in R**

Our basic objective in fitting a model to data using
likelihood is, once a model has been specified, finding the parameter value(s)
that make the observed data the most likely (the maximum likelihood). Because of tradition and mathematical
convenience, we are actually concerned not with likelihoods but log-likelihoods
(a log transformation of the likelihood, which changes the product of
probabilities to a sum of log-probabilities), and are interested in minimizing
negative log-likelihoods rather than maximization.

A likelihood function is a specification of how the data are
obtained from a given error distribution. As a starting example consider a
model of the mean (intercept) and variance of data with normally distributed
error (akin to **lm**). Values distributed according to a normal
distribution each occur with a probability based on the mean and the variance
of the distribution in this way:

normprob = function (x,m,v) {

(1/sqrt(2*pi*v))*exp(-((x-m)^2)/(2*v)) }

You might recognize this as the probability density function (pdf) of a normal
distribution, telling us how probable certain values of x are given mean m and
variance v. For a given mean of 5 and
variance of 2, how probable are observations of integers from 1 to 10?

X = c(1:10)

Y = normprob(x=X,m=5,v=2)

plot(X,Y)

We're of course expecting values near the mean to be the most probable, and
from what we know about normal distributions we anticipate about 95% of the
data being within 2 standard deviations (~1.4 units) of 5. Let's check that this is in fact a pdf by
generating more x values and comparing it to the built-in **dnorm** function:

X = seq(0,10,by=.1)

Y = normprob(x=X,m=5,v=2)

plot(X,Y)

lines(X,dnorm(X,mean=5,sd=sqrt(2)))

Now
let's take a set of observations and assess their likelihood assuming normal
error. We assume these observations are independent, so we define their
likelihood as the product of their individual probabilities given a mean m and
variance v:

X = c(0,4,5,7,8,2,2,4,4)

like1 = prod(normprob(x=X,m=5,v=2))

like1

[1] 4.443365e-12

Again,
this is the probability of observing X given the values were generated from a
normal distribution of mean=5 and variance=2. Is this the model that maximizes
the likelihood? Let's try some other possibilities, all assuming normal error:

means = seq(0,10,by=.1)

vars = seq(1,8,by=.5)

mean.likes = sapply(means,function(y)prod(normprob(x=X,m=y,v=2))) #varying the
mean

var.likes = sapply(vars,function(y)prod(normprob(x=X,m=5,v=y))) #varying the
variance

plot(means,mean.likes,type="b")

windows()

plot(vars,var.likes,type="b")

The
graphs (*likelihood curves*) show the
likelihood values for different parameter values of the mean (assuming
variance=2) and variance (assuming mean=5). Let's do this in a slightly more
sophisticated way by examining the likelihood of models where the mean and
variance parameters vary simultaneously (a *likelihood*
*surface*):

means = rep(means,each=length(vars))

vars = rep(vars,times=101)

mv.mat = cbind(means,vars)

mv.likes = apply(mv.mat,1,function(y)prod(normprob(x=X,m=y[1],v=y[2])))

library(lattice)

levelplot(mv.likes~means*vars,contour=T)

The
likelihood peaks at the mean and variance values we found from the separate
plots, and we can extract the mean and variance values by asking on what row of
mv.mat the max value lies:

mv.mat[mv.likes==max(mv.likes)]

[1] 4.0 5.5

So
our crude maximum likelihood estimate (MLE) for the mean is 4.0, and the
variance is 5.5. Of course we've gone to
a lot of trouble to calculate the obvious:

mean(X)

[1] 4

mean(X^2)-(mean(X)^2)

[1] 5.555556

Note that **var**(X) produces the
unbiased variance estimator for small samples, which multiplies the population
variance by n/(n-1), or 9/8. But hopefully you see the general technique: given
an error function, we can search over a range of possible parameter values to
find the model that maximizes the likelihood of our data. This is model fitting via ML in a nutshell.

Let's keep our data and add in a deterministic
function in addition to our normal error.
Let's envision our current dataset X as plot cover values of some
species and suppose we have a predictor variable of elevation in meters
associated with each X:

cover
= X

elev
= c(100,110,320,500,830,120,330,100,500)

Although normal error is probably not the best choice for an error model, we'll
stick with it for now and write a likelihood function assuming a simple linear
relationship (as in **lm**) and normal
error. This time we'll take a shortcut
using **dnorm** instead of bothering to
write out the formula for the normal pdf, and we'll convert our calculation to
negative log-likelihoods using the *log=T*
argument in **dnorm** and summing the
values rather than taking their product:

normNLL
= function(par) {

y.pred = par[1] + par[2]*elev

-sum(dnorm(cover,mean=y.pred,sd=sqrt(par[3]),log=T))
}

The only model component we've added is how we think the mean of the normal
distribution should vary depending on elevation, using two parameters of an
intercept (par[1]) and slope (par[2]).
This means we need to find the MLE of 3 parameters, including the
variance (par[3]), which we assume does not vary according to elevation. Let's
plot the data to see which parameter values are reasonable:

plot(elev,cover)

The relationship looks roughly linear, with an intercept around zero and a
slope of close to 1/100 (0.01). We'll
now automate the search process using one of several 'search' or optimization
routines in R:

optim(par=c(0,1/100,1),normNLL)
#par argument is the starting search values for par[1:3]

$par

[1] 1.54507000 0.00759274 2.31612581

$value

[1] 16.55038

We'll look at the details of **optim**
below; for now note the fitted parameter values of the intercept (par[1]=1.54),
slope (par[2]=.0076), and variance (par[3]=2.3), and our estimate of the
minimized value of the negative log-likelihood function at the MLEs of each
parameter (16.55). To see how this
works, plug in the MLEs into normNLL:

normNLL(c(1.545,.00759,2.316))

[1] 16.55038

If **optim** worked correctly, this is
the smallest possible value of normNLL given our data and model form. Let's check to make sure that the analytical
solution using least squares is basically the same:

summary(lm(cover~elev))

Call:

lm(formula = cover ~ elev)

Residuals:

Min 1Q
Median 3Q Max

-2.3045 -1.3412 0.1534 1.6196
1.6955

Coefficients:

Estimate Std. Error t value
Pr(>|t|)

(Intercept) 1.545275 0.972862 1.588
0.1562

elev 0.007592 0.002427
3.129 0.0166 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.726 on 7 degrees of freedom

Multiple R-squared: 0.5831, Adjusted
R-squared: 0.5235

F-statistic: 9.789 on 1 and 7 DF,
p-value: 0.01664

The
intercept and slope values are identical.

Let's compare this result to one using a
different error distribution that recognizes cover cannot be negative, such as
the Poisson distribution. The likelihood
function is similar except we change the error function to the Poisson pdf:

poisNLL
= function(par) {

y.pred = par[1] + par[2]*elev

-sum(dpois(cover,lambda=y.pred,log=T))
}

We now have
a model with one less parameter, because we assume the mean and variance are
equal with Poisson error. As initial parameter values we'll use the MLEs from
the normal error example:

optim(par=c(1.54,.0076),poisNLL)

$par

[1] 1.626705708 0.007340982

$value

[1] 17.61976

Notice that
our MLEs were similar but our negative log-likelihood went up, even through we
used one less parameter. Clearly this is
a worse model for these data.

**Optimization
routines**

Before we go much further with model fitting using ML we
should examine the optimization routines in more detail. *Optimization*,
or the process of finding a function's minimum or maximum solution numerically,
is by far the most frustrating part of fitting complicated models to data using
likelihoods, because there is no universal best optimization method and there
are many ways a given routine can fail to converge on an answer or produce MLEs
that are nonsensical. The most commonly
used R function for optimization is **optim**,
but it itself is just a convenient way to access several different optimization
routines.

Calls to **optim**
require a function (*fn*=) to minimize
(here, the negative log-likelihood function), starting values for the fitted
parameters (*par*=), and the desired
optimization method (*method*=). **optim**
includes five methods of various utility.
Let's generate a more complicated model for our treedata and compare
several **optim** methods.

dat = read.csv(file.choose()) #grab the
treedata.csv file

dat2 = subset(dat,dat$species=="Magnolia fraseri") #Fraser's
magnolia

ro.plots = unique(as.character(dat2$plotID)) #plots with
mag fra

u.plots = unique(as.character(dat$plotID)) #all plots

nob.plots = u.plots[is.element(u.plots,ro.plots)==F] #plots without
mag fra

dat3 = subset(dat,is.element(as.character(dat$plotID),nob.plots)) #dataset of
no mag fra

dat4 = subset(dat3,duplicated(dat3$plotID)==F) #one row per plot

dat4$cover = 0 #cover of red oak is zero in these
plots

mf = na.exclude(rbind(dat2,dat4)) #new dataframe
of presences and absences

Fraser magnolia is a classic "rich cove" species
that probably has a strong relationship with cove (near-stream) locations. We'll write a likelihood function that
includes a deterministic component which includes a nonlinear (exponential
decay) relationship of magnolia abundance with distance-from-stream, with the
rate dependent on elevation (thinking that mid-slope locations get wetter as we
go higher). For this exercise we'll also
use the full complement of cover data, including zeros. Our expectation is that these data are
overdispersed, so the error component of the model will be a negative binomial
function. To show off the flexibility of
the ML approach, we'll do this using a pair of nested functions, with the first
being the error distribution and the second the deterministic function.

cover = mf$cover

elev = mf$elev

streamdist = mf$streamdist

nbNLL = function(par) {

y.pred =
mfNLL(c(par[1],par[2],par[3],par[4]))

-sum(dnbinom(cover+1,mu=y.pred,size=par[5],log=T))
}

mfNLL = function(par) {

b0 = par[1]

b1 = par[2]

b2 = par[3]

b3 = par[4]

r = elev*b3

b0 + b1*elev +
b2*exp(r*log(streamdist+1)) }

My
objective here is to illustrate the power of likelihoods in fitting models to
data, rather than explore in detail magnolia distribution. What we did is define an error function based
on the two-parameter negative binomial function, where *mu* is the expected cover and *size*
is the dispersion parameter that accounts for the variance increasing faster
than the mean (thus relaxing this assumption for the otherwise equivalent
Poisson model). nbNLL thus returns the
negative log-likelihood of the dataset fed through the deterministic function
mfNLL. mfNLL has no specification of
error in it, but only contains a four-parameter model based on an overall
intercept (b0), slope for the relation of cover and elevation (b1), a constant
for the stream distance relationship (b2), and finally a parameter that
controls how elevation influences the exponential decay rate of cover with
stream distance (b3). We could have opted
for a simpler regression of elevation and stream distance, each with a slope
parameter and including an intercept, but I wanted to demonstrate the
flexibility of giving optim any function that will produce a likelihood value. This can involve calls to numerous other
functions, many estimated parameters (constrained by optim's ability to
converge on a consistent and sensible answer), or even calling up other
scripts—indeed anything that will generate a 'y.pred' value for nbNLL.

The first consideration in having **optim** produce MLEs is giving it reasonable starting values for each
parameter. b0 should be near the overall cover mean, so we can start it at 1.
b1 is the slope for elevation, and it must be small (given the scale of elevation
relative to cover) but we're not sure if it's greater or less than zero, so we
start it at 0. b2 is a bit of a mystery; we'll start it at 1 as a default. b3
involves translating elevation into a decay rate, so must be very small, so we
start it at zero as well. The final
parameter is the dispersion parameter for NB error; there are guidelines for
figuring out reasonable values of this but for convenience we'll start with a
value of 1.

start.par = c(1,0,1,0,1)

Nelder-Mead is the default method for **optim**. Because it involves a bit of jumping around
in parameter space (rather than a derivative-based approach that is more likely
to get caught in local minima), it can produce consistent results but can run
slowly with large datasets or complicated parameterizations. Trying it with our
above likelihood model and defined starting parameters:

out1 = optim(fn=nbNLL,par=start.par,method="Nelder-Mead")

out1

$par

[1] 4.800675e-01 -4.718550e-04 2.677907e+00 -3.190603e-05 5.435705e+00

$value

[1] 1832.206

$counts

function gradient

502 NA

$convergence

[1] 1

$message

NULL

The $par component lists the MLEs for the 5 parameters, and the $value is the
minimized negative log-likelihood for our dataset. The $counts component is the
number of times our likelihood function was queried by **optim** (we did not use a gradient). The $convergence value is
critical: a 1 here indicates that what stopped the optimization was the
iteration limit (here 500) rather than the convergence criterion being reached
('0'). This should give us cause for concern.
Let's increase the iteration limit to see whether the fit improves:

out2 =
optim(fn=nbNLL,par=start.par,method="Nelder-Mead",control=list(trace=TRUE,maxit=1000))

out2

$par

[1] -5.385844e+00 -9.194085e-05
8.467816e+00 -1.698507e-05
5.504508e+00

$value

[1] 1830.939

Still
no convergence and the fit is little better. The *trace=T* argument generates the output for each iteration; scroll
through these to see that in fact the likelihood hit below 1840 relatively
quickly.

The BFGS method is a version of estimating the location of
the minimum using derivatives. Bolker
(2008) recommends BFGS when the likelihood surface is 'well-behaved'
(presumably meaning simple) and starting values are reasonably clear; it is
also a faster technique.

out3 = optim(fn=nbNLL,par=start.par,method="BFGS")

out3

$par

[1] 1.000000e+00 1.002682e-20
1.000000e+00 -1.051292e-19
1.000000e+00

$value

[1] 2003.454

This
is a substantially worse fit than Nelder-Mead and our parameters basically
didn't move, so our functions appear too complex for this method. We can try a version of this that operates
with parameter constraints (upper and lower search limits for each parameter)
called L-BFGS-B. To do so we need vectors of parameter limits, which I provide
using educated guesses:

lower = c(.5,-0.01,0,-0.01,.5)

upper = c(10,0.01,5,1,5)

out4 =
optim(fn=nbNLL,par=start.par,lower=lower,upper=upper,method="L-BFGS-B")

out4

$par

[1] 2.5919153042 -0.0000979387 1.7075221989 -0.0022838252 1.7075572583

$value

[1] 1909.586

$counts

function gradient

41 41

$convergence

[1] 0

We
did get convergence, and a better solution than unbounded BFGS, but still no
better than Nelder-Mead. The parameter values at least look reasonable. Bolker (2008, p. 248) suggests that L-BFGS-B
is less robust than unconstrained methods; however it can be useful for
complicated fits where you want to put tight constraints on parameters (e.g.,
due to biological realism; note Bolker suggests several alternatives to this
when constraints are important to the fit).

One more common optimization method available with **optim** is simulated annealing. Simulated annealing is one of the more robust
(but, unsurprisingly, slow) techniques for finding the 'global' minimum when
there may be many local minima, because it has a large stochastic component
(random jumps in parameter space). The
version in **optim** is SANN:

out5 = optim(fn=nbNLL,par=start.par,method="SANN")

out5

$par

[1] 1.7294495733 0.0003218725
0.6010998011 -0.2951911926
2.2183194506

$value

[1] 1891.716

$counts

function gradient

10000 NA

$convergence

[1] 0

It may take a few tries to get reasonable parameter values. The above output was one of several I tried,
and none produced a fit as good as Nelder-Mead. The parameter values here are
also very different. Most likely we have
a situation where there is a plateau on the likelihood surface where a large
range of parameter values produce similar likelihoods. From here we may need to make our model
simpler, or explore subsets of the data to get a better handle on reasonable
parameter values. The situation also
highlights the fact that we need more than reasonable parameter values if we're
to do much with this fitted model: in particular, we need a way to define
parameter precision (e.g., confidence intervals). Evaluating the support for particular model
values is also the basis for likelihood hypothesis testing.

*An exercise
*Write a binomial likelihood function (using 'dbinom') for the
magnolia dataset, using a deterministic model of elevation with both linear and
quadratic terms (i.e., 3 parameters, including the intercept) and a binary
response variable (presence-absence).
What parameter values are produced by