March 1: Likelihood

For this session we’re going back to revisit a key concept that underlies many of the modeling approaches we’ve already used and will be an essential component of building more complicated model structures, including mixed effect models and Bayesian techniques: maximum likelihood.  Likelihood is the probability of a dataset given a model.  The ‘dataset’ is any collection of values that are 1) independent and 2) identically distributed (meaning they are produced by the same statistical distribution: normal, binomial, Poisson, etc.).  The ‘model’ includes both function(s) describing how the predictor variable(s) are converted into the response variable, and the particular values of parameters included in the function.  In terms of a standard linear regression, ‘data’ are the Y values, and ‘model’ is both the form ‘b+mX’ and particular values of the slope and intercept.  Given they are independent, the probability of obtaining a particular set of observations given the model is equal to the product of the probabilities of obtaining each value separately.  One of the great things about likelihoods is that they force you to specify the error distribution that produces the data.  We usually don’t consider how error enters the model until we’re forced to do so (e.g., by the family argument in a GLM) or we ignore it entirely.  Creating likelihood functions forces us to admit that data are produced by a combination of deterministic relationships (process models, indicating ecological relationships) and functions that describe the distribution of error or uncertainty (data models).  Excellent introductions to the concept and usage of likelihoods for model fitting are found in Hilborn and Mangel (1997, Chap. 7) and Bolker (2008, Chap. 6), among many others.

Model fitting via maximum likelihood is usually accomplished with a search through parameter space in a technique called optimization.  Before computers this was basically impossible for most applications, but R implements several different optimization routines that are robust enough to now be the backbone of most model fitting routines (including GAMs, nls, and all types of mixed effects models).  Unfortunately convergence on the optimum solution (here, the maximum likelihood estimate, or MLE) is not guaranteed and can be difficult to achieve in some circumstances.  For this reason, to gainfully employ likelihood techniques in fitting models to data we have to know how to construct likelihood functions AND know at least a little about the underlying optimization methods that are supposedly finding our desired MLE.

My objectives in what follows concern implementing likelihood model fitting techniques in R, including:
1. writing likelihood functions,
2. finding maximum likelihood solutions through optimization,
3. creating likelihood profiles to examine parameter precision, and
4. comparing models using likelihood techniques (likelihood ratio tests and information criteria like AIC).

Writing likelihood functions in R

Our basic objective in fitting a model to data using likelihood is, once a model has been specified, finding the parameter value(s) that make the observed data the most likely (the maximum likelihood).  Because of tradition and mathematical convenience, we are actually concerned not with likelihoods but log-likelihoods (a log transformation of the likelihood, which changes the product of probabilities to a sum of log-probabilities), and are interested in minimizing negative log-likelihoods rather than maximization.

A likelihood function is a specification of how the data are obtained from a given error distribution. As a starting example consider a model of the mean (intercept) and variance of data with normally distributed error (akin to lm).  Values distributed according to a normal distribution each occur with a probability based on the mean and the variance of the distribution in this way:
normprob = function (x,m,v) {
(1/sqrt(2*pi*v))*exp(-((x-m)^2)/(2*v))  }

You might recognize this as the probability density function (pdf) of a normal distribution, telling us how probable certain values of x are given mean m and variance v.  For a given mean of 5 and variance of 2, how probable are observations of integers from 1 to 10?
X = c(1:10)
Y = normprob(x=X,m=5,v=2)
plot(X,Y)

We're of course expecting values near the mean to be the most probable, and from what we know about normal distributions we anticipate about 95% of the data being within 2 standard deviations (~1.4 units) of 5.  Let's check that this is in fact a pdf by generating more x values and comparing it to the built-in dnorm function:
X = seq(0,10,by=.1)
Y = normprob(x=X,m=5,v=2)
plot(X,Y)
lines(X,dnorm(X,mean=5,sd=sqrt(2)))
Now let's take a set of observations and assess their likelihood assuming normal error. We assume these observations are independent, so we define their likelihood as the product of their individual probabilities given a mean m and variance v:
X = c(0,4,5,7,8,2,2,4,4)
like1 = prod(normprob(x=X,m=5,v=2))
like1
[1] 4.443365e-12
Again, this is the probability of observing X given the values were generated from a normal distribution of mean=5 and variance=2. Is this the model that maximizes the likelihood? Let's try some other possibilities, all assuming normal error:
means = seq(0,10,by=.1)
vars = seq(1,8,by=.5)
mean.likes = sapply(means,function(y)prod(normprob(x=X,m=y,v=2))) #varying the mean
var.likes = sapply(vars,function(y)prod(normprob(x=X,m=5,v=y))) #varying the variance
plot(means,mean.likes,type="b")
windows()
plot(vars,var.likes,type="b")
The graphs (likelihood curves) show the likelihood values for different parameter values of the mean (assuming variance=2) and variance (assuming mean=5). Let's do this in a slightly more sophisticated way by examining the likelihood of models where the mean and variance parameters vary simultaneously (a likelihood surface):
means = rep(means,each=length(vars))
vars = rep(vars,times=101)
mv.mat = cbind(means,vars)
mv.likes = apply(mv.mat,1,function(y)prod(normprob(x=X,m=y[1],v=y[2])))
library(lattice)
levelplot(mv.likes~means*vars,contour=T)
The likelihood peaks at the mean and variance values we found from the separate plots, and we can extract the mean and variance values by asking on what row of mv.mat the max value lies:
mv.mat[mv.likes==max(mv.likes)]
[1] 4.0 5.5
So our crude maximum likelihood estimate (MLE) for the mean is 4.0, and the variance is 5.5.  Of course we've gone to a lot of trouble to calculate the obvious:
mean(X)
[1] 4
mean(X^2)-(mean(X)^2)
[1] 5.555556
Note that var(X) produces the unbiased variance estimator for small samples, which multiplies the population variance by n/(n-1), or 9/8. But hopefully you see the general technique: given an error function, we can search over a range of possible parameter values to find the model that maximizes the likelihood of our data.  This is model fitting via ML in a nutshell.

Let's keep our data and add in a deterministic function in addition to our normal error.  Let's envision our current dataset X as plot cover values of some species and suppose we have a predictor variable of elevation in meters associated with each X:
cover = X
elev = c(100,110,320,500,830,120,330,100,500)
Although normal error is probably not the best choice for an error model, we'll stick with it for now and write a likelihood function assuming a simple linear relationship (as in lm) and normal error.  This time we'll take a shortcut using dnorm instead of bothering to write out the formula for the normal pdf, and we'll convert our calculation to negative log-likelihoods using the log=T argument in dnorm and summing the values rather than taking their product:
normNLL = function(par) {
y.pred = par[1] + par[2]*elev
-sum(dnorm(cover,mean=y.pred,sd=sqrt(par[3]),log=T)) }

The only model component we've added is how we think the mean of the normal distribution should vary depending on elevation, using two parameters of an intercept (par[1]) and slope (par[2]).  This means we need to find the MLE of 3 parameters, including the variance (par[3]), which we assume does not vary according to elevation. Let's plot the data to see which parameter values are reasonable:
plot(elev,cover)
The relationship looks roughly linear, with an intercept around zero and a slope of close to 1/100 (0.01).  We'll now automate the search process using one of several 'search' or optimization routines in R:
optim(par=c(0,1/100,1),normNLL) #par argument is the starting search values for par[1:3]
\$par
[1] 1.54507000 0.00759274 2.31612581
\$value
[1] 16.55038

We'll look at the details of optim below; for now note the fitted parameter values of the intercept (par[1]=1.54), slope (par[2]=.0076), and variance (par[3]=2.3), and our estimate of the minimized value of the negative log-likelihood function at the MLEs of each parameter (16.55).  To see how this works, plug in the MLEs into normNLL:
normNLL(c(1.545,.00759,2.316))
[1] 16.55038

If optim worked correctly, this is the smallest possible value of normNLL given our data and model form.  Let's check to make sure that the analytical solution using least squares is basically the same:
summary(lm(cover~elev))
Call:
lm(formula = cover ~ elev)
Residuals:
Min      1Q  Median      3Q     Max
-2.3045 -1.3412  0.1534  1.6196  1.6955

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.545275   0.972862   1.588   0.1562
elev        0.007592   0.002427   3.129   0.0166 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.726 on 7 degrees of freedom
Multiple R-squared: 0.5831,     Adjusted R-squared: 0.5235
F-statistic: 9.789 on 1 and 7 DF,  p-value: 0.01664
The intercept and slope values are identical.

Let's compare this result to one using a different error distribution that recognizes cover cannot be negative, such as the Poisson distribution.  The likelihood function is similar except we change the error function to the Poisson pdf:
poisNLL = function(par) {
y.pred = par[1] + par[2]*elev
-sum(dpois(cover,lambda=y.pred,log=T)) }
We now have a model with one less parameter, because we assume the mean and variance are equal with Poisson error. As initial parameter values we'll use the MLEs from the normal error example:
optim(par=c(1.54,.0076),poisNLL)
\$par
[1] 1.626705708 0.007340982
\$value
[1] 17.61976
Notice that our MLEs were similar but our negative log-likelihood went up, even through we used one less parameter.  Clearly this is a worse model for these data.

Optimization routines

Before we go much further with model fitting using ML we should examine the optimization routines in more detail.  Optimization, or the process of finding a function's minimum or maximum solution numerically, is by far the most frustrating part of fitting complicated models to data using likelihoods, because there is no universal best optimization method and there are many ways a given routine can fail to converge on an answer or produce MLEs that are nonsensical.  The most commonly used R function for optimization is optim, but it itself is just a convenient way to access several different optimization routines.

Calls to optim require a function (fn=) to minimize (here, the negative log-likelihood function), starting values for the fitted parameters (par=), and the desired optimization method (method=).  optim includes five methods of various utility.  Let's generate a more complicated model for our treedata and compare several optim methods.
dat = read.csv(file.choose())     #grab the treedata.csv file
dat2 = subset(dat,dat\$species=="Magnolia fraseri") #Fraser's magnolia
ro.plots = unique(as.character(dat2\$plotID))    #plots with mag fra

u.plots = unique(as.character(dat\$plotID)) #all plots
nob.plots = u.plots[is.element(u.plots,ro.plots)==F] #plots without mag fra
dat3 = subset(dat,is.element(as.character(dat\$plotID),nob.plots)) #dataset of no mag fra
dat4 = subset(dat3,duplicated(dat3\$plotID)==F)  #one row per plot
dat4\$cover = 0     #cover of red oak is zero in these plots
mf = na.exclude(rbind(dat2,dat4))     #new dataframe of presences and absences

Fraser magnolia is a classic "rich cove" species that probably has a strong relationship with cove (near-stream) locations.  We'll write a likelihood function that includes a deterministic component which includes a nonlinear (exponential decay) relationship of magnolia abundance with distance-from-stream, with the rate dependent on elevation (thinking that mid-slope locations get wetter as we go higher).  For this exercise we'll also use the full complement of cover data, including zeros.  Our expectation is that these data are overdispersed, so the error component of the model will be a negative binomial function.  To show off the flexibility of the ML approach, we'll do this using a pair of nested functions, with the first being the error distribution and the second the deterministic function.
cover = mf\$cover
elev = mf\$elev
streamdist = mf\$streamdist

nbNLL = function(par) {
y.pred = mfNLL(c(par[1],par[2],par[3],par[4]))
-sum(dnbinom(cover+1,mu=y.pred,size=par[5],log=T)) }

mfNLL = function(par) {
b0 = par[1]
b1 = par[2]
b2 = par[3]
b3 = par[4]
r = elev*b3
b0 + b1*elev + b2*exp(r*log(streamdist+1)) }
My objective here is to illustrate the power of likelihoods in fitting models to data, rather than explore in detail magnolia distribution.  What we did is define an error function based on the two-parameter negative binomial function, where mu is the expected cover and size is the dispersion parameter that accounts for the variance increasing faster than the mean (thus relaxing this assumption for the otherwise equivalent Poisson model).  nbNLL thus returns the negative log-likelihood of the dataset fed through the deterministic function mfNLL.  mfNLL has no specification of error in it, but only contains a four-parameter model based on an overall intercept (b0), slope for the relation of cover and elevation (b1), a constant for the stream distance relationship (b2), and finally a parameter that controls how elevation influences the exponential decay rate of cover with stream distance (b3).  We could have opted for a simpler regression of elevation and stream distance, each with a slope parameter and including an intercept, but I wanted to demonstrate the flexibility of giving optim any function that will produce a likelihood value.  This can involve calls to numerous other functions, many estimated parameters (constrained by optim's ability to converge on a consistent and sensible answer), or even calling up other scripts—indeed anything that will generate a 'y.pred' value for nbNLL.

The first consideration in having optim produce MLEs is giving it reasonable starting values for each parameter. b0 should be near the overall cover mean, so we can start it at 1. b1 is the slope for elevation, and it must be small (given the scale of elevation relative to cover) but we're not sure if it's greater or less than zero, so we start it at 0. b2 is a bit of a mystery; we'll start it at 1 as a default. b3 involves translating elevation into a decay rate, so must be very small, so we start it at zero as well.  The final parameter is the dispersion parameter for NB error; there are guidelines for figuring out reasonable values of this but for convenience we'll start with a value of 1.
start.par = c(1,0,1,0,1)
Nelder-Mead is the default method for optim.  Because it involves a bit of jumping around in parameter space (rather than a derivative-based approach that is more likely to get caught in local minima), it can produce consistent results but can run slowly with large datasets or complicated parameterizations. Trying it with our above likelihood model and defined starting parameters:
out1

\$par
[1]  4.800675e-01 -4.718550e-04  2.677907e+00 -3.190603e-05  5.435705e+00
\$value
[1] 1832.206
\$counts
502       NA
\$convergence
[1] 1
\$message
NULL

The \$par component lists the MLEs for the 5 parameters, and the \$value is the minimized negative log-likelihood for our dataset. The \$counts component is the number of times our likelihood function was queried by optim (we did not use a gradient). The \$convergence value is critical: a 1 here indicates that what stopped the optimization was the iteration limit (here 500) rather than the convergence criterion being reached ('0'). This should give us cause for concern.  Let's increase the iteration limit to see whether the fit improves:
out2
\$par
[1] -5.385844e+00 -9.194085e-05  8.467816e+00 -1.698507e-05  5.504508e+00
\$value
[1] 1830.939

Still no convergence and the fit is little better. The trace=T argument generates the output for each iteration; scroll through these to see that in fact the likelihood hit below 1840 relatively quickly.

The BFGS method is a version of estimating the location of the minimum using derivatives.  Bolker (2008) recommends BFGS when the likelihood surface is 'well-behaved' (presumably meaning simple) and starting values are reasonably clear; it is also a faster technique.
out3 = optim(fn=nbNLL,par=start.par,method="BFGS")
out3
\$par
[1]  1.000000e+00  1.002682e-20  1.000000e+00 -1.051292e-19  1.000000e+00
\$value
[1] 2003.454
This is a substantially worse fit than Nelder-Mead and our parameters basically didn't move, so our functions appear too complex for this method.  We can try a version of this that operates with parameter constraints (upper and lower search limits for each parameter) called L-BFGS-B. To do so we need vectors of parameter limits, which I provide using educated guesses:
lower = c(.5,-0.01,0,-0.01,.5)
upper = c(10,0.01,5,1,5)

out4 = optim(fn=nbNLL,par=start.par,lower=lower,upper=upper,method="L-BFGS-B")
out4
\$par
[1]  2.5919153042 -0.0000979387  1.7075221989 -0.0022838252  1.7075572583
\$value
[1] 1909.586
\$counts
41       41
\$convergence
[1] 0
We did get convergence, and a better solution than unbounded BFGS, but still no better than Nelder-Mead. The parameter values at least look reasonable.  Bolker (2008, p. 248) suggests that L-BFGS-B is less robust than unconstrained methods; however it can be useful for complicated fits where you want to put tight constraints on parameters (e.g., due to biological realism; note Bolker suggests several alternatives to this when constraints are important to the fit).

One more common optimization method available with optim is simulated annealing.  Simulated annealing is one of the more robust (but, unsurprisingly, slow) techniques for finding the 'global' minimum when there may be many local minima, because it has a large stochastic component (random jumps in parameter space).  The version in optim is SANN:
out5 = optim(fn=nbNLL,par=start.par,method="SANN")
out5

\$par
[1]  1.7294495733  0.0003218725  0.6010998011 -0.2951911926  2.2183194506
\$value
[1] 1891.716
\$counts