Likelihood approaches in R

For this session we’re going back to revisit a key concept that underlies many of the modeling approaches we’ve already used and will be an essential component of building more complicated model structures, including mixed effect models and Bayesian techniques: maximum likelihood.  Likelihood is the probability of a dataset given a model.  The ‘dataset’ is any collection of values that are 1) independent and 2) identically distributed (meaning they are produced by the same statistical distribution: normal, binomial, Poisson, etc.).  The ‘model’ includes both function(s) describing how the predictor variable(s) are converted into the response variable, and the particular values of parameters included in the function.  In terms of a standard linear regression, ‘data’ are the Y values, and ‘model’ is both the form ‘b+mX’ and particular values of the slope and intercept.  Given they are independent, the probability of obtaining a particular set of observations given the model is equal to the product of the probabilities of obtaining each value separately.  One of the great things about likelihoods is that they force you to specify the error distribution that produces the data.  We usually don’t consider how error enters the model until we’re forced to do so (e.g., by the family argument in a GLM) or we ignore it entirely.  Creating likelihood functions forces us to admit that data are produced by a combination of deterministic relationships (process models, indicating ecological relationships) and functions that describe the distribution of error or uncertainty (data models).  Excellent introductions to the concept and usage of likelihoods for model fitting are found in Hilborn and Mangel (1997, Chap. 7) and Bolker (2008, Chap. 6), among many others.

Model fitting via maximum likelihood is usually accomplished with a search through parameter space in a technique called optimization.  Before computers this was basically impossible for most applications, but R implements several different optimization routines that are robust enough to now be the backbone of most model fitting routines (including GAMs, nls, and all types of mixed effects models).  Unfortunately convergence on the optimum solution (here, the maximum likelihood estimate, or MLE) is not guaranteed and can be difficult to achieve in some circumstances.  For this reason, to gainfully employ likelihood techniques in fitting models to data we have to know how to construct likelihood functions AND know at least a little about the underlying optimization methods that are supposedly finding our desired MLE.

My objectives in what follows concern implementing likelihood model fitting techniques in R, including:
1. writing likelihood functions,
2. finding maximum likelihood solutions through optimization,
3. creating likelihood profiles to examine parameter precision, and
4. comparing models using likelihood techniques (likelihood ratio tests and information criteria like AIC).

Writing likelihood functions in R

Our basic objective in fitting a model to data using likelihood is, once a model has been specified, finding the parameter value(s) that make the observed data the most likely (the maximum likelihood).  Because of tradition and mathematical convenience, we are actually concerned not with likelihoods but log-likelihoods (a log transformation of the likelihood, which changes the product of probabilities to a sum of log-probabilities), and are interested in minimizing negative log-likelihoods rather than maximization. 

A likelihood function is a specification of how the data are obtained from a given error distribution. As a starting example consider a model of the mean (intercept) and variance of data with normally distributed error (akin to lm).  Values distributed according to a normal distribution each occur with a probability based on the mean and the variance of the distribution in this way:
normprob = function (x,m,v) {
     (1/sqrt(2*pi*v))*exp(-((x-m)^2)/(2*v))  }
     
You might recognize this as the probability density function (pdf) of a normal distribution, telling us how probable certain values of x are given mean m and variance v.  For a given mean of 5 and variance of 2, how probable are observations of integers from 1 to 10?
X = c(1:10)
Y = normprob(x=X,m=5,v=2)
plot(X,Y)

We're of course expecting values near the mean to be the most probable, and from what we know about normal distributions we anticipate about 95% of the data being within 2 standard deviations (~1.4 units) of 5.  Let's check that this is in fact a pdf by generating more x values and comparing it to the built-in dnorm function:
X = seq(0,10,by=.1)
Y = normprob(x=X,m=5,v=2)
plot(X,Y)
lines(X,dnorm(X,mean=5,sd=sqrt(2)))
Now let's take a set of observations and assess their likelihood assuming normal error. We assume these observations are independent, so we define their likelihood as the product of their individual probabilities given a mean m and variance v:
X = c(0,4,5,7,8,2,2,4,4)
like1 = prod(normprob(x=X,m=5,v=2))
like1
[1] 4.443365e-12
Again, this is the probability of observing X given the values were generated from a normal distribution of mean=5 and variance=2. Is this the model that maximizes the likelihood? Let's try some other possibilities, all assuming normal error:
means = seq(0,10,by=.1)
vars = seq(1,8,by=.5)
mean.likes = sapply(means,function(y)prod(normprob(x=X,m=y,v=2))) #varying the mean 
var.likes = sapply(vars,function(y)prod(normprob(x=X,m=5,v=y))) #varying the variance
plot(means,mean.likes,type="b")
windows()
plot(vars,var.likes,type="b")
The graphs (likelihood curves) show the likelihood values for different parameter values of the mean (assuming variance=2) and variance (assuming mean=5). Let's do this in a slightly more sophisticated way by examining the likelihood of models where the mean and variance parameters vary simultaneously (a likelihood surface):
means = rep(means,each=length(vars))
vars = rep(vars,times=101)
mv.mat = cbind(means,vars)
mv.likes = apply(mv.mat,1,function(y)prod(normprob(x=X,m=y[1],v=y[2])))
library(lattice)
levelplot(mv.likes~means*vars,contour=T)
The likelihood peaks at the mean and variance values we found from the separate plots, and we can extract the mean and variance values by asking on what row of mv.mat the max value lies:
mv.mat[mv.likes==max(mv.likes)]
[1] 4.0 5.5
So our crude maximum likelihood estimate (MLE) for the mean is 4.0, and the variance is 5.5.  Of course we've gone to a lot of trouble to calculate the obvious:
mean(X)
[1] 4
mean(X^2)-(mean(X)^2)
[1] 5.555556
Note that var(X) produces the unbiased variance estimator for small samples, which multiplies the population variance by n/(n-1), or 9/8. But hopefully you see the general technique: given an error function, we can search over a range of possible parameter values to find the model that maximizes the likelihood of our data.  This is model fitting via ML in a nutshell.

Let's keep our data and add in a deterministic function in addition to our normal error.  Let's envision our current dataset X as plot cover values of some species and suppose we have a predictor variable of elevation in meters associated with each X:
cover = X
elev = c(100,110,320,500,830,120,330,100,500)
Although normal error is probably not the best choice for an error model, we'll stick with it for now and write a likelihood function assuming a simple linear relationship (as in lm) and normal error.  This time we'll take a shortcut using dnorm instead of bothering to write out the formula for the normal pdf, and we'll convert our calculation to negative log-likelihoods using the log=T argument in dnorm and summing the values rather than taking their product:
normNLL = function(par) {
     y.pred = par[1] + par[2]*elev
     -sum(dnorm(cover,mean=y.pred,sd=sqrt(par[3]),log=T)) }

The only model component we've added is how we think the mean of the normal distribution should vary depending on elevation, using two parameters of an intercept (par[1]) and slope (par[2]).  This means we need to find the MLE of 3 parameters, including the variance (par[3]), which we assume does not vary according to elevation. Let's plot the data to see which parameter values are reasonable:
plot(elev,cover)
The relationship looks roughly linear, with an intercept around zero and a slope of close to 1/100 (0.01).  We'll now automate the search process using one of several 'search' or optimization routines in R:
optim(par=c(0,1/100,1),normNLL) #par argument is the starting search values for par[1:3]
$par
[1] 1.54507000 0.00759274 2.31612581
$value
[1] 16.55038

We'll look at the details of optim below; for now note the fitted parameter values of the intercept (par[1]=1.54), slope (par[2]=.0076), and variance (par[3]=2.3), and our estimate of the minimized value of the negative log-likelihood function at the MLEs of each parameter (16.55).  To see how this works, plug in the MLEs into normNLL:
normNLL(c(1.545,.00759,2.316))
[1] 16.55038

If optim worked correctly, this is the smallest possible value of normNLL given our data and model form.  Let's check to make sure that the analytical solution using least squares is basically the same:
summary(lm(cover~elev))
Call:
lm(formula = cover ~ elev)
Residuals:
    Min      1Q  Median      3Q     Max
-2.3045 -1.3412  0.1534  1.6196  1.6955

Coefficients:
            Estimate Std. Error t value Pr(>|t|) 
(Intercept) 1.545275   0.972862   1.588   0.1562 
elev        0.007592   0.002427   3.129   0.0166 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.726 on 7 degrees of freedom
Multiple R-squared: 0.5831,     Adjusted R-squared: 0.5235
F-statistic: 9.789 on 1 and 7 DF,  p-value: 0.01664
The intercept and slope values are identical. 

Let's compare this result to one using a different error distribution that recognizes cover cannot be negative, such as the Poisson distribution.  The likelihood function is similar except we change the error function to the Poisson pdf:
poisNLL = function(par) {
     y.pred = par[1] + par[2]*elev
     -sum(dpois(cover,lambda=y.pred,log=T)) }
We now have a model with one less parameter, because we assume the mean and variance are equal with Poisson error. As initial parameter values we'll use the MLEs from the normal error example: 
optim(par=c(1.54,.0076),poisNLL)
$par
[1] 1.626705708 0.007340982
$value
[1] 17.61976
Notice that our MLEs were similar but our negative log-likelihood went up, even through we used one less parameter.  Clearly this is a worse model for these data.

Optimization routines

Before we go much further with model fitting using ML we should examine the optimization routines in more detail.  Optimization, or the process of finding a function's minimum or maximum solution numerically, is by far the most frustrating part of fitting complicated models to data using likelihoods, because there is no universal best optimization method and there are many ways a given routine can fail to converge on an answer or produce MLEs that are nonsensical.  The most commonly used R function for optimization is optim, but it itself is just a convenient way to access several different optimization routines. 

Calls to optim require a function (fn=) to minimize (here, the negative log-likelihood function), starting values for the fitted parameters (par=), and the desired optimization method (method=).  optim includes five methods of various utility.  Let's generate a more complicated model for our treedata and compare several optim methods. 
dat = read.csv(file.choose())     #grab the treedata.csv file
dat2 = subset(dat,dat$species=="Magnolia fraseri") #Fraser's magnolia
ro.plots = unique(as.character(dat2$plotID))    #plots with mag fra

u.plots = unique(as.character(dat$plotID)) #all plots
nob.plots = u.plots[is.element(u.plots,ro.plots)==F] #plots without mag fra
dat3 = subset(dat,is.element(as.character(dat$plotID),nob.plots)) #dataset of no mag fra
dat4 = subset(dat3,duplicated(dat3$plotID)==F)  #one row per plot
dat4$cover = 0     #cover of red oak is zero in these plots
mf = na.exclude(rbind(dat2,dat4))     #new dataframe of presences and absences

Fraser magnolia is a classic "rich cove" species that probably has a strong relationship with cove (near-stream) locations.  We'll write a likelihood function that includes a deterministic component which includes a nonlinear (exponential decay) relationship of magnolia abundance with distance-from-stream, with the rate dependent on elevation (thinking that mid-slope locations get wetter as we go higher).  For this exercise we'll also use the full complement of cover data, including zeros.  Our expectation is that these data are overdispersed, so the error component of the model will be a negative binomial function.  To show off the flexibility of the ML approach, we'll do this using a pair of nested functions, with the first being the error distribution and the second the deterministic function.
cover = mf$cover
elev = mf$elev
streamdist = mf$streamdist

nbNLL = function(par) {
     y.pred = mfNLL(c(par[1],par[2],par[3],par[4]))
     -sum(dnbinom(cover+1,mu=y.pred,size=par[5],log=T)) }

mfNLL = function(par) {
     b0 = par[1]
     b1 = par[2]  
     b2 = par[3]  
     b3 = par[4]
     r = elev*b3
     b0 + b1*elev + b2*exp(r*log(streamdist+1)) }
My objective here is to illustrate the power of likelihoods in fitting models to data, rather than explore in detail magnolia distribution.  What we did is define an error function based on the two-parameter negative binomial function, where mu is the expected cover and size is the dispersion parameter that accounts for the variance increasing faster than the mean (thus relaxing this assumption for the otherwise equivalent Poisson model).  nbNLL thus returns the negative log-likelihood of the dataset fed through the deterministic function mfNLL.  mfNLL has no specification of error in it, but only contains a four-parameter model based on an overall intercept (b0), slope for the relation of cover and elevation (b1), a constant for the stream distance relationship (b2), and finally a parameter that controls how elevation influences the exponential decay rate of cover with stream distance (b3).  We could have opted for a simpler regression of elevation and stream distance, each with a slope parameter and including an intercept, but I wanted to demonstrate the flexibility of giving optim any function that will produce a likelihood value.  This can involve calls to numerous other functions, many estimated parameters (constrained by optim's ability to converge on a consistent and sensible answer), or even calling up other scripts—indeed anything that will generate a 'y.pred' value for nbNLL.

The first consideration in having optim produce MLEs is giving it reasonable starting values for each parameter. b0 should be near the overall cover mean, so we can start it at 1. b1 is the slope for elevation, and it must be small (given the scale of elevation relative to cover) but we're not sure if it's greater or less than zero, so we start it at 0. b2 is a bit of a mystery; we'll start it at 1 as a default. b3 involves translating elevation into a decay rate, so must be very small, so we start it at zero as well.  The final parameter is the dispersion parameter for NB error; there are guidelines for figuring out reasonable values of this but for convenience we'll start with a value of 1.
start.par = c(1,0,1,0,1)
Nelder-Mead is the default method for optim.  Because it involves a bit of jumping around in parameter space (rather than a derivative-based approach that is more likely to get caught in local minima), it can produce consistent results but can run slowly with large datasets or complicated parameterizations. Trying it with our above likelihood model and defined starting parameters:
out1 = optim(fn=nbNLL,par=start.par,method="Nelder-Mead")
out1

$par
[1]  4.800675e-01 -4.718550e-04  2.677907e+00 -3.190603e-05  5.435705e+00
$value
[1] 1832.206
$counts
function gradient
     502       NA
$convergence
[1] 1
$message
NULL

The $par component lists the MLEs for the 5 parameters, and the $value is the minimized negative log-likelihood for our dataset. The $counts component is the number of times our likelihood function was queried by optim (we did not use a gradient). The $convergence value is critical: a 1 here indicates that what stopped the optimization was the iteration limit (here 500) rather than the convergence criterion being reached ('0'). This should give us cause for concern.  Let's increase the iteration limit to see whether the fit improves:
out2 = optim(fn=nbNLL,par=start.par,method="Nelder-Mead",control=list(trace=TRUE,maxit=1000))
out2
$par
[1] -5.385844e+00 -9.194085e-05  8.467816e+00 -1.698507e-05  5.504508e+00
$value
[1] 1830.939

Still no convergence and the fit is little better. The trace=T argument generates the output for each iteration; scroll through these to see that in fact the likelihood hit below 1840 relatively quickly. 

The BFGS method is a version of estimating the location of the minimum using derivatives.  Bolker (2008) recommends BFGS when the likelihood surface is 'well-behaved' (presumably meaning simple) and starting values are reasonably clear; it is also a faster technique. 
out3 = optim(fn=nbNLL,par=start.par,method="BFGS")
out3
$par
[1]  1.000000e+00  1.002682e-20  1.000000e+00 -1.051292e-19  1.000000e+00
$value
[1] 2003.454
This is a substantially worse fit than Nelder-Mead and our parameters basically didn't move, so our functions appear too complex for this method.  We can try a version of this that operates with parameter constraints (upper and lower search limits for each parameter) called L-BFGS-B. To do so we need vectors of parameter limits, which I provide using educated guesses:
lower = c(.5,-0.01,0,-0.01,.5)
upper = c(10,0.01,5,1,5)

out4 = optim(fn=nbNLL,par=start.par,lower=lower,upper=upper,method="L-BFGS-B")
out4
$par
[1]  2.5919153042 -0.0000979387  1.7075221989 -0.0022838252  1.7075572583
$value
[1] 1909.586
$counts
function gradient
      41       41
$convergence
[1] 0
We did get convergence, and a better solution than unbounded BFGS, but still no better than Nelder-Mead. The parameter values at least look reasonable.  Bolker (2008, p. 248) suggests that L-BFGS-B is less robust than unconstrained methods; however it can be useful for complicated fits where you want to put tight constraints on parameters (e.g., due to biological realism; note Bolker suggests several alternatives to this when constraints are important to the fit).

One more common optimization method available with optim is simulated annealing.  Simulated annealing is one of the more robust (but, unsurprisingly, slow) techniques for finding the 'global' minimum when there may be many local minima, because it has a large stochastic component (random jumps in parameter space).  The version in optim is SANN:
out5 = optim(fn=nbNLL,par=start.par,method="SANN")
out5

$par
[1]  1.7294495733  0.0003218725  0.6010998011 -0.2951911926  2.2183194506
$value
[1] 1891.716
$counts
function gradient
   10000       NA
$convergence
[1] 0
It may take a few tries to get reasonable parameter values.  The above output was one of several I tried, and none produced a fit as good as Nelder-Mead. The parameter values here are also very different.  Most likely we have a situation where there is a plateau on the likelihood surface where a large range of parameter values produce similar likelihoods.  From here we may need to make our model simpler, or explore subsets of the data to get a better handle on reasonable parameter values.  The situation also highlights the fact that we need more than reasonable parameter values if we're to do much with this fitted model: in particular, we need a way to define parameter precision (e.g., confidence intervals).  Evaluating the support for particular model values is also the basis for likelihood hypothesis testing.

Likelihood profiles

Likelihood profiles are the meat and potatoes of likelihood modeling fitting, because they provide a snapshot of how the overall model likelihood depends on particular values of one or more parameters.  If negative log-likelihood values increase quickly as you move away from the MLE (higher or lower), then the MLE has high precision—it is unlikely that the real value is far from the estimated value.  Conversely, if the likelihood profile is flat, then a range of different parameter values could produce basically the same dataset.  In this case your point estimate of the MLE may be worthless.  Even better is that likelihoods have a very useful statistical property: the ratio of any two likelihoods (generated from the same data) is proportional to a chi-squared distribution (proportional because it is 2*negative log-likelihood, or the deviance, that is chi-square distributed).  This means we can associate any increase in negative log-likelihood with a probability of the solution of the higher likelihood being equal to that of the lower likelihood: i.e., a P value. This forms the basis of estimating parameter precision and comparing (nested) models.

In models that include a single parameter, likelihood profiles are easy to generate because you only need to calculate the likelihood with different values of that parameter.  We produced this above in our examination of the mean of a normal distribution (using likelihood, rather than negative log-likelihood).  For many parameter models, however, a change in any parameter value will (usually) influence the MLEs of the others.  Construction of an accurate likelihood profile requires that the fit be 're-optimized' to find the lowest negative log-likelihood for each alternate parameterization.  Of course, there are various libraries in R that automate this process, but we'll try a simple manual example.

For ease of use let’s return to the simple case of linear regression of 9 cover values and associated elevation values.

cover = c(0,4,5,7,8,2,2,4,4)

elev = c(100,110,320,500,830,120,330,100,500)

We’ll first examine the overall MLEs fit by optim using a normal likelihood function:

fn = function(par) {-sum(dnorm(cover,mean=par[1]+par[2]*elev,sd=sqrt(par[3]),log=T))}

out1 = optim(fn,par=c(1,0,3),method="Nelder-Mead")

out1

$par

[1] 1.54495516 0.00759282 2.31627982

$value

[1] 16.55038

The slope for the effect of elevation is ~0.0076. So what? Is there evidence that in fact elevation is a significant predictor? We can find out by examining the likelihoods of a range of different slope parameters, while also allowing the MLEs of the other two parameters to vary with the different slope values. We create a vector of possible slopes and then generate new optim results for each case of a new fixed slope parameter:

slopes = seq(-.005,.02,by=.0005)  #vector of possible slope values surrounding the MLE

lik.out = numeric(length(slopes)) #holding vector (currently zeros)

fn2 = function(par,m) {           #new simplified likelihood function (only 2 fitted pars)

     b = par[1]

     v = par[2]

     -sum(dnorm(cover,mean=b+m*elev,sd=sqrt(v),log=T))}

for(i in 1:length(slopes)) {      #calculates new likelihoods for each new slope model

     lik.out[i] = optim(fn2,par=c(1,3),m=slopes[i],method="Nelder-Mead")$value }

plot(slopes,lik.out,type="l")    

This is our likelihood profile for the slope. The MLE is about 0.0076 and results in a negative log-likelihood of about 16.5. To determine the 95% confidence interval for the slope we use the fact that a difference in likelihood more than half the chi-squared critical value for alpha=0.95 (1 df) is beyond the P=0.05 boundary. Note this critical value is the current minimum negative log-likelihood plus:

qchisq(0.95,1)/2

[1] 1.920729

So we just need to find the slope value that is equal to the current minimum likelihood (~16.5) plus 1.92, which should be around 18.5. To get a more precise value we use the approx function that does simple interpolation given a set of x and y values. We do this for both the lower tail and the upper tail:

lower.s = slopes[1:which.min(lik.out)] #x values, left side of the graph

lower.k = lik.out[1:which.min(lik.out)]    #y values, left side of the graph

lowerCI = approx(lower.k,lower.s,xout=min(lik.out)+qchisq(0.95,1)/2)$y     #desired x value for y = 1.92+y(MLE)

upper.s = slopes[which.min(lik.out):length(lik.out)] #x values, right side

upper.k = lik.out[which.min(lik.out):length(lik.out)] #y values, right side

upperCI = approx(upper.k,upper.s,xout=min(lik.out)+qchisq(0.95,1)/2)$y #desired x value for y = 1.92+y(MLE)

abline(v=c(lowerCI,upperCI),lty=2,col="red")

arrows(lowerCI,18.5,upperCI,18.5,code=3,col="blue"); text(.0075,19,"95% CI",cex=2)

I’ve plotted the estimated confidence limits. They do not contain zero, so elevation is indeed significant.  The fact that this procedure can be used to test how different a parameter MLE is from zero forms the basis of nested model evaluation using likelihood: we've just decided that a model including elevation is a better model than one without it.  This is the likelihood ratio test.

Model comparison

Likelihood offers two general means of model comparison, corresponding to the frequentist likelihood ratio test (LRT), generating a P value for the comparison of nested models, and the non-frequentist (non-P-value) use of information criteria (like AIC).  Information criteria-based methods like AIC are more flexible because they can be used to compare any models fit via likelihoods (including those not strictly fit via likelihood but from which a likelihood is easily calculated, such as least-squares).  The 'rules' of IC-based methods are not strictly based on probability; rather, rules-of-thumb are established via simulation.  For this reason, many researchers prefer LRTs when they are available, although IC-based methods, and particularly AIC, have widespread acceptance in the ecological literature.

LRTs apply only to the comparison of nested models, involving the change in model likelihood when one parameter is set to zero (or two terms are equivalent).  Returning to the above example of elevation and cover, we can ask whether there is sufficient evidence from a comparison of their likelihoods that the deterministic component of the model should contain a squared term for elevation:
fn.sq = function(par) {-sum(dnorm(cover,mean=par[1]+par[2]*elev+par[4]*(elev^2),sd=sqrt(par[3]),log=T))}
out.sq = optim(fn.sq,par=c(1,0,3,0),method="Nelder-Mead")
out.sq

$par

[1] 1.840340e+00 5.376081e-03 2.294774e+00 2.622342e-06
$value

[1] 16.50852

The optimized negative log-likelihood is 16.50852, which is barely lower than the above model that did not include the quadratic term (=16.55).  Is it low enough to be a significantly better fit?  We don't have to create the likelihood profile, because (as frequentists) we're only interested whether quadratic coefficient is zero (null hypothesis) or greater than zero (the positive MLE according to our out.sq result).  In fact we already know our critical value is a negative log-likelihood difference of
qchisq(0.95,1)/2

[1] 1.920729

because the one-parameter difference in models gives us a 1 DF for the chi-squared test.  So the original likelihood would have had to drop to
out1$value - qchisq(0.95,1)/2
[1] 14.62965
So we weren't even close.  A model including only an intercept, however, has a negative log-likelihood of:
fn.int = function(par) {-sum(dnorm(cover,mean=par[1],sd=sqrt(par[2]),log=T))}
out.int = optim(fn.int,par=c(1,3),method="Nelder-Mead")
out.int$value
[1] 20.48704
This deviance has an associated P-value of:
1-pchisq(2*(out.int$value-out1$value),df=1)
[1] 0.005016926
We conclude that a model including the main effect of elevation is much more likely to be the better model than one with only an intercept.  Interestingly, this is not equivalent to the P-value obtained with a linear model based on least squares:
summary(lm(cover~elev))
Call:
lm(formula = cover ~ elev)
Residuals:
    Min      1Q  Median      3Q     Max
-2.3045 -1.3412  0.1534  1.6196  1.6955

Coefficients:
            Estimate Std. Error t value Pr(>|t|) 
(Intercept) 1.545275   0.972862   1.588   0.1562 
elev        0.007592   0.002427   3.129   0.0166 *
The fitted coefficients are identical to what we found with likelihood, but the test is based on a Student's t test of whether the elevation coefficient overlaps with zero, rather than on the competing likelihoods of two models.  Conveniently, R has a built-in function logLik for returning log-likelihoods of a fitted model object:
logLik(lm(cover~elev))
'log Lik.' -16.55038 (df=3)
Because the estimated coefficients are identical to what we produced manually, the value is equal to that of our out1 object (note you need to return the negative of this for the negative log-likelihood). Using this technique we could do the same model comparison via likelihood but skipping the optimization:
1-pchisq(abs(2*(-logLik(lm(cover~1))[1]+logLik(lm(cover~elev))[1])),df=1)
[1] 0.005016926
The tricky bit about the above line is that we want to the first element of the logLik function ([1]); I also made sure that the difference returned a positive value (abs).  You should be able to do this for all GLMs as well.

AIC (Akaike Information Criterion) and other IC-based model goodnesss-of-fit statistics are similarly easy to calculate once model likelihoods are in hand, because they are based only on the negative log-likelihood and the number of parameters in the model (AIC = -2L + 2k, where k is the number of fitted parameters).  The comparisons needn't be restricted to nested models, but the datasets fit by the models must be identical.  What you don't get with IC-based methods is a P value (many consider this to be a good thing). Let's compare our simple models of elevation and cover using AIC rather than LRT:
2*out1$value+2*length(out1$par)
[1] 39.10075
2*out.int$value+2*length(out.int$par)
[1] 44.97408

Lower AIC values indicate better fits, and the rule-of-thumb established for ICs is that a difference of 2 units or more indicates models that are clearly different, so again we conclude that there is substantial support for a model that includes elevation.  Here too there is a built-in function to save you the few seconds of trouble of writing the above formulas:
AIC(lm(cover~elev),lm(cover~1))
 
                df      AIC
lm(cover ~ elev)  3 39.10075
lm(cover ~ 1)     2 44.97408

The AIC function should work for any fitted model object that generates an associated logLik value (lm, glm, gam, etc).  Note according to Burnham and Anderson (2002) we probably should have instead used AICc, which is the AIC value corrected for small sample sizes (generally when #values/#params < 40):
2*out1$value+2*length(out1$par) +  2*(length(out1$par)*(length(out1$par)+1)/(length(cover)-length(out1$par)-1))
[1] 43.90075
2*out.int$value+2*length(out.int$par) +  2*(length(out.int$par)*(length(out.int$par)+1)/(length(cover)-length(out.int$par)-1))
[1] 46.97408
The conclusion is the same but the disparity isn't quite as great with small sample sizes. Yet another IC variant is the so-called 'Bayesian' IC (or BIC; not actually a Bayesian method) with a parameter penalization of (log n)*k rather than 2k, which may be more appropriate for comparing models for large datasets (compared to AIC, BIC yields a more conservative acceptance of more highly parameterized models):
2*out1$value+length(out1$par)*log(length(cover))
[1] 39.69243
2*out.int$value+length(out.int$par)*log(length(cover))
[1] 45.36853

Given the growing popularity of model comparisons via information-theoretic approaches like AIC, it is important to understand how AICs are used to compare the relative goodness-of-fit of competing models.  First the actual AIC value is not a meaningful value (as opposed to something like R2); rather, it is the difference in AICs between models that are of interest.  AIC differences in a table of competing models allow a quick and convenient means of model comparison (differences<2 suggest the models are of equal ranking; between 4-7 suggests 'considerably less' support, and >10 suggests 'essentially none' [Burnham and Anderson 2002, p. 70]).  Burnham and Anderson have additionally popularized a method (established by Akaike) that uses AIC values to calculate the relative weight of evidence supporting one model over others, given a list of reasonable alternative models fit to a given dataset.  These are called Akaike weights (wi), and are unique to a particular set of models (that is, if a model is added or removed, wi must be recalculated).  Akaike weights can also be used for other ICs (eg BIC).  Model comparison tables based on ICs are now common in the literature, so it is worthwhile to be able to produce your own using the basic likelihood approaches above.  Here we can write a simple function that, given a set of candidate models, creates the standard information-theoretic output:
ICtab = function(model.list) {

     if(is.list(model.list[[1]])==F) {cat("Please enter models as a list (list(mod1,mod2,...))"); stop}

     n = length(model.list)  #number of candidate models

     s = length(model.list[[1]]$fitted)    #no. observations

     kv = numeric(1:n)       #empty no. param vector

     modLL = numeric(1:n)    #empty neg log lik vector

     mod.call = numeric(1:n) #model form (for output)

     R2 = numeric(1:n)       #empty model R2 vector

     fam = numeric(1:n)      #empty family vector

     for(i in 1:n) {         #loop over each model

          mod = model.list[[i]]  #select model
         
fam[i] = mod$family[[1]] #error family

          mod.call[i] = as.character(mod$formula[3])    #return model formula

          modLL[i] = -logLik(mod)[1] #extract neg log likelihood

          kv[i] = attr(logLik(mod),"df") #extract no. fitted params
         
res = mod$y-mod$fitted #residuals scaled to original data

          R2[i] = 1-(sum((res-mean(res))^2) / sum((mod$y-mean(mod$y))^2) )

     }

     AICv = 2*modLL + 2*kv + 2*kv*(kv+1)/(s-kv-1) #AICc for all models

     BICv = 2*modLL + kv*log(s)          #BIC for all models

     AICdiff = AICv - min(AICv)   #AIC differences

     BICdiff = BICv - min(BICv)   #BIC differences

     wi = exp(-.5*AICdiff) / sum(exp(-.5*AICdiff)) #Akaike weights

     out = data.frame(mod.call,fam,kv,modLL,round(AICv,2),round(AICdiff,2),round(wi,3),round(BICv,2),round(BICdiff,2),round(R2,3))

     names(out) = c("model","family","k","-LL","AICc","AICdiff","wi","BIC","BICdiff","R2")
     out = out[order(AICdiff),]  
#sort ascending by AICdiff column

     return(out)

}

To illustrate what the above function does, let's construct a few models with our tree dataset:
dat = read.csv(file.choose())     #grab the treedata.csv file
dat2 = subset(dat,dat$species=="Acer rubrum") #red maple

We'll construct a set of candidate models that involve reasonable a priori hypotheses about red maple abundance (cover) with respect to elevation, water potential, stream distance, disturbance history, and radiation (beers index).  We're also interested in whether the data model should be based on counts (poisson) or normal:
glm1 = glm(cover~elev+I(elev^2)+disturb+disturb*elev+disturb*I(elev^2),data=dat2,family=gaussian)
glm2 = glm(cover~elev+I(elev^2)+disturb+disturb*elev+disturb*I(elev^2),data=dat2,family=poisson)
glm3 = glm(cover~elev+I(elev^2)+disturb,data=dat2,family=gaussian)
glm4 = glm(cover~elev+I(elev^2)+disturb,data=dat2,family=poisson) 
glm5 = glm(cover~elev+I(elev^2)+disturb+tci,data=dat2,family=gaussian) 
glm6 = glm(cover~elev+I(elev^2)+disturb+streamdist,data=dat2,family=gaussian)
glm7 = glm(cover~elev+I(elev^2)+disturb+I(log(streamdist+1)),data=dat2,family=gaussian)
glm8 = glm(cover~elev+I(elev^2)+disturb+ disturb*elev+disturb*I(elev^2)+I(log(streamdist+1)),data=dat2,family=gaussian)

This is obviously not a complete list of candidate models; ideally we have some hypotheses to test involving biological mechanisms of red maple distribution or are interested in only a small subset of all possible regression models.  I say "ideally" because this approach does rely heavily on the set of models you choose to evaluate.  To illustrate this, let's first run our information-theoretic function on models 1 through 7:
ICtab(list(glm1,glm2,glm3,glm4,glm5,glm6,glm7))

                                                              model   family  k      -LL    AICc AICdiff    wi     BIC BICdiff    R2

7               elev + I(elev^2) + disturb + I(log(streamdist + 1)) gaussian  8 1676.202 3352.59    0.00 0.825 3405.83    0.00 0.042

6                           elev + I(elev^2) + disturb + streamdist gaussian  8 1678.407 3357.00    4.41 0.091 3410.24    4.41 0.037

1 elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2) gaussian 13 1678.354 3357.17    4.59 0.083 3443.53   37.70 0.041

5                                  elev + I(elev^2) + disturb + tci gaussian  8 1683.407 3367.00   14.41 0.001 3420.24   14.41 0.029

3                                        elev + I(elev^2) + disturb gaussian  7 1685.925 3371.99   19.41 0.000 3418.60   12.77 0.022

2 elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2)  poisson 12 1703.530 3407.46   54.87 0.000 3487.20   81.37 0.039

4                                        elev + I(elev^2) + disturb  poisson  6 1709.601 3419.31   66.72 0.000 3459.27   53.44 0.023
Note our function requires arguments (model names) that are embedded in a list.  The function returns all the basic IC-based (and R2) criteria, including the number of parameters (k), the negative log-likelihood, the sample size-corrected AICc (with large samples this converges to AIC, so might as well always use AICc), the rank order of AIC differences (all AICs compared to the best [lowest] model), Akaike weights, and raw and relative BICs.  Note the much harsher penalties for k imposed by BIC than AIC (eg, glm1), and that R2s hardly vary and are thus very poor tools for model comparison.  Clearly model 7 is the best model; its AICc advantage suggests "considerably more" support than models 1 and 6 and the Akaike weights indicate that that there is an order of magnitude more support for it compared to the next two.  Still, there remains some minor support for the other two models (the others have no support at all).  Now compare the result when we add model 8 to the mix:
ICtab(list(glm1,glm2,glm3,glm4,glm5,glm6,glm7,glm8))

                                                                       model   family  k      -LL    AICc AICdiff wi     BIC BICdiff    R2

8 elev+I(elev^2)+disturb+disturb*elev+disturb*I(elev^2)+I(log(streamdist+1)) gaussian 14 1667.856 3336.25    0.00  1 3429.21   23.38 0.062

7                        elev + I(elev^2) + disturb + I(log(streamdist + 1)) gaussian  8 1676.202 3352.59   16.34  0 3405.83    0.00 0.042

6                                    elev + I(elev^2) + disturb + streamdist gaussian  8 1678.407 3357.00   20.75  0 3410.24    4.41 0.037

1          elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2) gaussian 13 1678.354 3357.17   20.92  0 3443.53   37.70 0.041

5                                           elev + I(elev^2) + disturb + tci gaussian  8 1683.407 3367.00   30.75  0 3420.24   14.41 0.029

3                                                 elev + I(elev^2) + disturb gaussian  7 1685.925 3371.99   35.74  0 3418.60   12.77 0.022

2          elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2)  poisson 12 1703.530 3407.46   71.21  0 3487.20   81.37 0.039

4                                                 elev + I(elev^2) + disturb  poisson  6 1709.601 3419.31   83.06  0 3459.27   53.44 0.023

Relative to glm8, there is no support whatsoever (wi=0) for any remaining model if you adhere to AICs.  Interestingly, because we've nearly doubled the number of parameters (because of the interaction terms for each level of disturbance class), the BIC disagrees and instead favors the lower parameterized models.  Here you'd probably go with the AIC, as BIC is used primarily for cases for extremely large sample sizes relative to the number of parameters (here, n/k for glm8 is 795/14=57).

Note that the Akaike weight cannot be interpreted as the 'probability that a model is true'; rather, the wi shows the evidence in favor of each model given we initially (a priori) thought are models were equally likely.  So, although the information-theoretic approach is a great way to compare models if you're in the situation of comparing several models that are all equally compelling on a priori grounds (perhaps for biological reasons, or from past literature), we're still left with a situation where we're evaluating the probability of data rather than models.  Bayesians would argue that, on philosophical grounds, it makes more scientific sense to obtain a result that establishes the probability of a particular hypothesis (model) being true, rather than making statements based on the probability of the observing the data given particular hypotheses.  To make this transition we need to consider Bayes' theorem.