March 8: More likelihood

Likelihood profiles

Likelihood profiles are the meat and potatoes of likelihood modeling fitting, because they provide a snapshot of how the overall model likelihood depends on particular values of one or more parameters.  If negative log-likelihood values increase quickly as you move away from the MLE (higher or lower), then the MLE has high precision—it is unlikely that the real value is far from the estimated value.  Conversely, if the likelihood profile is flat, then a range of different parameter values could produce basically the same dataset.  In this case your point estimate of the MLE may be worthless.  Even better is that likelihoods have a very useful statistical property: the ratio of any two likelihoods (generated from the same data) is proportional to a chi-squared distribution (proportional because it is 2*negative log-likelihood, or the deviance, that is chi-square distributed).  This means we can associate any increase in negative log-likelihood with a probability of the solution of the higher likelihood being equal to that of the lower likelihood: i.e., a P value. This forms the basis of estimating parameter precision and comparing (nested) models.

In models that include a single parameter, likelihood profiles are easy to generate because you only need to calculate the likelihood with different values of that parameter.  We produced this above in our examination of the mean of a normal distribution (using likelihood, rather than negative log-likelihood).  For many parameter models, however, a change in any parameter value will (usually) influence the MLEs of the others.  Construction of an accurate likelihood profile requires that the fit be 're-optimized' to find the lowest negative log-likelihood for each alternate parameterization.  Of course, there are various libraries in R that automate this process, but we'll try a simple manual example.

For ease of use let’s return to the simple case of linear regression of 9 cover values and associated elevation values.

cover = c(0,4,5,7,8,2,2,4,4)

elev = c(100,110,320,500,830,120,330,100,500)

We’ll first examine the overall MLEs fit by optim using a normal likelihood function:

fn = function(par) {-sum(dnorm(cover,mean=par[1]+par[2]*elev,sd=sqrt(par[3]),log=T))}

out1 = optim(fn,par=c(1,0,3),method="Nelder-Mead")

out1

$par

[1] 1.54495516 0.00759282 2.31627982

$value

[1] 16.55038

The slope for the effect of elevation is ~0.0076. So what? Is there evidence that in fact elevation is a significant predictor? We can find out by examining the likelihoods of a range of different slope parameters, while also allowing the MLEs of the other two parameters to vary with the different slope values. We create a vector of possible slopes and then generate new optim results for each case of a new fixed slope parameter:

slopes = seq(-.005,.02,by=.0005)  #vector of possible slope values surrounding the MLE

lik.out = numeric(length(slopes)) #holding vector (currently zeros)

fn2 = function(par,m) {           #new simplified likelihood function (only 2 fitted pars)

     b = par[1]

     v = par[2]

     -sum(dnorm(cover,mean=b+m*elev,sd=sqrt(v),log=T))}

for(i in 1:length(slopes)) {      #calculates new likelihoods for each new slope model

     lik.out[i] = optim(fn2,par=c(1,3),m=slopes[i],method="Nelder-Mead")$value }

plot(slopes,lik.out,type="l")    

This is our likelihood profile for the slope. The MLE is about 0.0076 and results in a negative log-likelihood of about 16.5. To determine the 95% confidence interval for the slope we use the fact that a difference in likelihood more than half the chi-squared critical value for alpha=0.95 (1 df) is beyond the P=0.05 boundary. Note this critical value is the current minimum negative log-likelihood plus:

qchisq(0.95,1)/2

[1] 1.920729

So we just need to find the slope value that is equal to the current minimum likelihood (~16.5) plus 1.92, which should be around 18.5. To get a more precise value we use the approx function that does simple interpolation given a set of x and y values. We do this for both the lower tail and the upper tail:

lower.s = slopes[1:which.min(lik.out)] #x values, left side of the graph

lower.k = lik.out[1:which.min(lik.out)]    #y values, left side of the graph

lowerCI = approx(lower.k,lower.s,xout=min(lik.out)+qchisq(0.95,1)/2)$y     #desired x value for y = 1.92+y(MLE)

upper.s = slopes[which.min(lik.out):length(lik.out)] #x values, right side

upper.k = lik.out[which.min(lik.out):length(lik.out)] #y values, right side

upperCI = approx(upper.k,upper.s,xout=min(lik.out)+qchisq(0.95,1)/2)$y #desired x value for y = 1.92+y(MLE)

abline(v=c(lowerCI,upperCI),lty=2,col="red")

arrows(lowerCI,18.5,upperCI,18.5,code=3,col="blue"); text(.0075,19,"95% CI",cex=2)

I’ve plotted the estimated confidence limits. They do not contain zero, so elevation is indeed significant.  The fact that this procedure can be used to test how different a parameter MLE is from zero forms the basis of nested model evaluation using likelihood: we've just decided that a model including elevation is a better model than one without it.  This is the likelihood ratio test.

Exercise

Create a likelihood profile for the quadratic coefficient of elevation for the exercise of magnolia presence-absence from last week.  Does the confidence interval suggest the quadratic term is significant? (Note this can be tricky because of the tendency for parameter values to produce predicted values of the binomial probability outside [0,1]. I used a starting parameter set of 0 for the intercept and 0.7 for the linear slope. Use a small starting range for the profiled parameter, near its MLE.)

Model comparison

Likelihood offers two general means of model comparison, corresponding to the frequentist likelihood ratio test (LRT), generating a P value for the comparison of nested models, and the non-frequentist (non-P-value) use of information criteria (like AIC).  Information criteria-based methods like AIC are more flexible because they can be used to compare any models fit via likelihoods (including those not strictly fit via likelihood but from which a likelihood is easily calculated, such as least-squares).  The 'rules' of IC-based methods are not strictly based on probability; rather, rules-of-thumb are established via simulation.  For this reason, many researchers prefer LRTs when they are available, although IC-based methods, and particularly AIC, have widespread acceptance in the ecological literature.

LRTs apply only to the comparison of nested models, involving the change in model likelihood when one parameter is set to zero (or two terms are equivalent).  Returning to the above example of elevation and cover, we can ask whether there is sufficient evidence from a comparison of their likelihoods that the deterministic component of the model should contain a squared term for elevation:
fn.sq = function(par) {-sum(dnorm(cover,mean=par[1]+par[2]*elev+par[4]*(elev^2),sd=sqrt(par[3]),log=T))}
out.sq = optim(fn.sq,par=c(1,0,3,0),method="Nelder-Mead")
out.sq

$par

[1] 1.840340e+00 5.376081e-03 2.294774e+00 2.622342e-06
$value

[1] 16.50852

The optimized negative log-likelihood is 16.50852, which is barely lower than the above model that did not include the quadratic term (=16.55).  Is it low enough to be a significantly better fit?  We don't have to create the likelihood profile, because (as frequentists) we're only interested whether quadratic coefficient is zero (null hypothesis) or greater than zero (the positive MLE according to our out.sq result).  In fact we already know our critical value is a negative log-likelihood difference of
qchisq(0.95,1)/2

[1] 1.920729

because the one-parameter difference in models gives us a 1 DF for the chi-squared test.  So the original likelihood would have had to drop to
out1$value - qchisq(0.95,1)/2
[1] 14.62965
So we weren't even close.  A model including only an intercept, however, has a negative log-likelihood of:
fn.int = function(par) {-sum(dnorm(cover,mean=par[1],sd=sqrt(par[2]),log=T))}
out.int = optim(fn.int,par=c(1,3),method="Nelder-Mead")
out.int$value
[1] 20.48704
This deviance has an associated P-value of:
1-pchisq(2*(out.int$value-out1$value),df=1)
[1] 0.005016926
We conclude that a model including the main effect of elevation is much more likely to be the better model than one with only an intercept.  Interestingly, this is not equivalent to the P-value obtained with a linear model based on least squares:
summary(lm(cover~elev))
Call:
lm(formula = cover ~ elev)
Residuals:
    Min      1Q  Median      3Q     Max
-2.3045 -1.3412  0.1534  1.6196  1.6955

Coefficients:
            Estimate Std. Error t value Pr(>|t|) 
(Intercept) 1.545275   0.972862   1.588   0.1562 
elev        0.007592   0.002427   3.129   0.0166 *
The fitted coefficients are identical to what we found with likelihood, but the test is based on a Student's t test of whether the elevation coefficient overlaps with zero, rather than on the competing likelihoods of two models.  Conveniently, R has a built-in function logLik for returning log-likelihoods of a fitted model object:
logLik(lm(cover~elev))
'log Lik.' -16.55038 (df=3)
Because the estimated coefficients are identical to what we produced manually, the value is equal to that of our out1 object (note you need to return the negative of this for the negative log-likelihood). Using this technique we could do the same model comparison via likelihood but skipping the optimization:
1-pchisq(abs(2*(-logLik(lm(cover~1))[1]+logLik(lm(cover~elev))[1])),df=1)
[1] 0.005016926
The tricky bit about the above line is that we want to the first element of the logLik function ([1]); I also made sure that the difference returned a positive value (abs).  You should be able to do this for all GLMs as well.

AIC (Akaike Information Criterion) and other IC-based model goodnesss-of-fit statistics are similarly easy to calculate once model likelihoods are in hand, because they are based only on the negative log-likelihood and the number of parameters in the model (AIC = -2L + 2k, where k is the number of fitted parameters).  The comparisons needn't be restricted to nested models, but the datasets fit by the models must be identical.  What you don't get with IC-based methods is a P value (many consider this to be a good thing). Let's compare our simple models of elevation and cover using AIC rather than LRT:
2*out1$value+2*length(out1$par)
[1] 39.10075
2*out.int$value+2*length(out.int$par)
[1] 44.97408

Lower AIC values indicate better fits, and the rule-of-thumb established for ICs is that a difference of 2 units or more indicates models that are clearly different, so again we conclude that there is substantial support for a model that includes elevation.  Here too there is a built-in function to save you the few seconds of trouble of writing the above formulas:
AIC(lm(cover~elev),lm(cover~1))
 
                df      AIC
lm(cover ~ elev)  3 39.10075
lm(cover ~ 1)     2 44.97408

The AIC function should work for any fitted model object that generates an associated logLik value (lm, glm, gam, etc).  Note according to Burnham and Anderson (2002) we probably should have instead used AICc, which is the AIC value corrected for small sample sizes (generally when #values/#params < 40):
2*out1$value+2*length(out1$par) +  2*(length(out1$par)*(length(out1$par)+1)/(length(cover)-length(out1$par)-1))
[1] 43.90075
2*out.int$value+2*length(out.int$par) +  2*(length(out.int$par)*(length(out.int$par)+1)/(length(cover)-length(out.int$par)-1))
[1] 46.97408
The conclusion is the same but the disparity isn't quite as great with small sample sizes. Yet another IC variant is the so-called 'Bayesian' IC (or BIC; not actually a Bayesian method) with a parameter penalization of (log n)*k rather than 2k, which may be more appropriate for comparing models for large datasets (compared to AIC, BIC yields a more conservative acceptance of more highly parameterized models):
2*out1$value+length(out1$par)*log(length(cover))
[1] 39.69243
2*out.int$value+length(out.int$par)*log(length(cover))
[1] 45.36853

Given the growing popularity of model comparisons via information-theoretic approaches like AIC, it is important to understand how AICs are used to compare the relative goodness-of-fit of competing models.  First the actual AIC value is not a meaningful value (as opposed to something like R2); rather, it is the difference in AICs between models that are of interest.  AIC differences in a table of competing models allow a quick and convenient means of model comparison (differences<2 suggest the models are of equal ranking; between 4-7 suggests 'considerably less' support, and >10 suggests 'essentially none' [Burnham and Anderson 2002, p. 70]).  Burnham and Anderson have additionally popularized a method (established by Akaike) that uses AIC values to calculate the relative weight of evidence supporting one model over others, given a list of reasonable alternative models fit to a given dataset.  These are called Akaike weights (wi), and are unique to a particular set of models (that is, if a model is added or removed, wi must be recalculated).  Akaike weights can also be used for other ICs (eg BIC).  Model comparison tables based on ICs are now common in the literature, so it is worthwhile to be able to produce your own using the basic likelihood approaches above.  Here we can write a simple function that, given a set of candidate models, creates the standard information-theoretic output:
ICtab = function(model.list) {

     if(is.list(model.list[[1]])==F) {cat("Please enter models as a list (list(mod1,mod2,...))"); stop}

     n = length(model.list)  #number of candidate models

     s = length(model.list[[1]]$fitted)    #no. observations

     kv = numeric(1:n)       #empty no. param vector

     modLL = numeric(1:n)    #empty neg log lik vector

     mod.call = numeric(1:n) #model form (for output)

     R2 = numeric(1:n)       #empty model R2 vector

     fam = numeric(1:n)      #empty family vector

     for(i in 1:n) {         #loop over each model

          mod = model.list[[i]]  #select model
         
fam[i] = mod$family[[1]] #error family

          mod.call[i] = as.character(mod$formula[3])    #return model formula

          modLL[i] = -logLik(mod)[1] #extract neg log likelihood

          kv[i] = attr(logLik(mod),"df") #extract no. fitted params
         
res = mod$y-mod$fitted #residuals scaled to original data

          R2[i] = 1-(sum((res-mean(res))^2) / sum((mod$y-mean(mod$y))^2) )

     }

     AICv = 2*modLL + 2*kv*(kv+1)/(s-kv-1) #AICc for all models

     BICv = 2*modLL + kv*log(s)          #BIC for all models

     AICdiff = AICv - min(AICv)   #AIC differences

     BICdiff = BICv - min(BICv)   #BIC differences

     wi = exp(-.5*AICdiff) / sum(exp(-.5*AICdiff)) #Akaike weights

     out = data.frame(mod.call,fam,kv,modLL,round(AICv,2),round(AICdiff,2),round(wi,3),round(BICv,2),round(BICdiff,2),round(R2,3))

     names(out) = c("model","family","k","-LL","AICc","AICdiff","wi","BIC","BICdiff","R2")
     out = out[order(AICdiff),]  
#sort ascending by AICdiff column

     return(out)

}

To illustrate what the above function does, let's construct a few models with our tree dataset:
dat = read.csv(file.choose())     #grab the treedata.csv file
dat2 = subset(dat,dat$species=="Acer rubrum") #red maple

We'll construct a set of candidate models that involve reasonable a priori hypotheses about red maple abundance (cover) with respect to elevation, water potential, stream distance, disturbance history, and radiation (beers index).  We're also interested in whether the data model should be based on counts (poisson) or normal:
glm1 = glm(cover~elev+I(elev^2)+disturb+disturb*elev+disturb*I(elev^2),data=dat2,family=gaussian)
glm2 = glm(cover~elev+I(elev^2)+disturb+disturb*elev+disturb*I(elev^2),data=dat2,family=poisson)
glm3 = glm(cover~elev+I(elev^2)+disturb,data=dat2,family=gaussian)
glm4 = glm(cover~elev+I(elev^2)+disturb,data=dat2,family=poisson) 
glm5 = glm(cover~elev+I(elev^2)+disturb+tci,data=dat2,family=gaussian) 
glm6 = glm(cover~elev+I(elev^2)+disturb+streamdist,data=dat2,family=gaussian)
glm7 = glm(cover~elev+I(elev^2)+disturb+I(log(streamdist+1)),data=dat2,family=gaussian)
glm8 = glm(cover~elev+I(elev^2)+disturb+ disturb*elev+disturb*I(elev^2)+I(log(streamdist+1)),data=dat2,family=gaussian)

This is obviously not a complete list of candidate models; ideally we have some hypotheses to test involving biological mechanisms of red maple distribution or are interested in only a small subset of all possible regression models.  I say "ideally" because this approach does rely heavily on the set of models you choose to evaluate.  To illustrate this, let's first run our information-theoretic function on models 1 through 7:
ICtab(list(glm1,glm2,glm3,glm4,glm5,glm6,glm7))

                                                              model   family  k      -LL    AICc AICdiff    wi     BIC BICdiff    R2

7               elev + I(elev^2) + disturb + I(log(streamdist + 1)) gaussian  8 1676.202 3352.59    0.00 0.825 3405.83    0.00 0.042

6                           elev + I(elev^2) + disturb + streamdist gaussian  8 1678.407 3357.00    4.41 0.091 3410.24    4.41 0.037

1 elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2) gaussian 13 1678.354 3357.17    4.59 0.083 3443.53   37.70 0.041

5                                  elev + I(elev^2) + disturb + tci gaussian  8 1683.407 3367.00   14.41 0.001 3420.24   14.41 0.029

3                                        elev + I(elev^2) + disturb gaussian  7 1685.925 3371.99   19.41 0.000 3418.60   12.77 0.022

2 elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2)  poisson 12 1703.530 3407.46   54.87 0.000 3487.20   81.37 0.039

4                                        elev + I(elev^2) + disturb  poisson  6 1709.601 3419.31   66.72 0.000 3459.27   53.44 0.023
Note our function requires arguments (model names) that are embedded in a list.  The function returns all the basic IC-based (and R2) criteria, including the number of parameters (k), the negative log-likelihood, the sample size-corrected AICc (with large samples this converges to AIC, so might as well always use AICc), the rank order of AIC differences (all AICs compared to the best [lowest] model), Akaike weights, and raw and relative BICs.  Note the much harsher penalties for k imposed by BIC than AIC (eg, glm1), and that R2s hardly vary and are thus very poor tools for model comparison.  Clearly model 7 is the best model; its AICc advantage suggests "considerably more" support than models 1 and 6 and the Akaike weights indicate that that there is an order of magnitude more support for it compared to the next two.  Still, there remains some minor support for the other two models (the others have no support at all).  Now compare the result when we add model 8 to the mix:
ICtab(list(glm1,glm2,glm3,glm4,glm5,glm6,glm7,glm8))

                                                                       model   family  k      -LL    AICc AICdiff wi     BIC BICdiff    R2

8 elev+I(elev^2)+disturb+disturb*elev+disturb*I(elev^2)+I(log(streamdist+1)) gaussian 14 1667.856 3336.25    0.00  1 3429.21   23.38 0.062

7                        elev + I(elev^2) + disturb + I(log(streamdist + 1)) gaussian  8 1676.202 3352.59   16.34  0 3405.83    0.00 0.042

6                                    elev + I(elev^2) + disturb + streamdist gaussian  8 1678.407 3357.00   20.75  0 3410.24    4.41 0.037

1          elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2) gaussian 13 1678.354 3357.17   20.92  0 3443.53   37.70 0.041

5                                           elev + I(elev^2) + disturb + tci gaussian  8 1683.407 3367.00   30.75  0 3420.24   14.41 0.029

3                                                 elev + I(elev^2) + disturb gaussian  7 1685.925 3371.99   35.74  0 3418.60   12.77 0.022

2          elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2)  poisson 12 1703.530 3407.46   71.21  0 3487.20   81.37 0.039

4                                                 elev + I(elev^2) + disturb  poisson  6 1709.601 3419.31   83.06  0 3459.27   53.44 0.023

Relative to glm8, there is no support whatsoever (wi=0) for any remaining model if you adhere to AICs.  Interestingly, because we've nearly doubled the number of parameters (because of the interaction terms for each level of disturbance class), the BIC disagrees and instead favors the lower parameterized models.  Here you'd probably go with the AIC, as BIC is used primarily for cases for extremely large sample sizes relative to the number of parameters (here, n/k for glm8 is 795/14=57).

Note that the Akaike weight cannot be interpreted as the 'probability that a model is true'; rather, the wi shows the evidence in favor of each model given we initially (a priori) thought are models were equally likely.  So, although the information-theoretic approach is a great way to compare models if you're in the situation of comparing several models that are all equally compelling on a priori grounds (perhaps for biological reasons, or from past literature), we're still left with a situation where we're evaluating the probability of data rather than models.  Bayesians would argue that, on philosophical grounds, it makes more scientific sense to obtain a result that establishes the probability of a particular hypothesis (model) being true, rather than making statements based on the probability of the observing the data given particular hypotheses.  To make this transition we need to consider Bayes' theorem.

Toward Bayesian models

So much has been written on Bayesian vs. non-Bayesian analysis in ecology in the last decade that I will limit our discussion to implementation of Bayesian models in R and leave the 'why or why not' concerns to more able statisticians (highly recommended: McCarthy's 2007 book Bayesian Methods for Ecology, also Clark 2005 Ecology Letters 8:2-14).  Essentially, what a simple Bayesian approach (as opposed to hierarchical Bayes, which we'll approach later) facilitates is the translation of a conditional statement about data (given a hypothesis) to a conditional statement about a hypothesis (given a dataset), using Bayes' theorem.  Rev. Thomas Bayes figured this out in the 18th century: he was interested in calculating inverse probabilities (e.g., given data of heads and tails, what can you tell about the coin?).  In practice, we do this by multiplying our already established likelihood function [P(data|model)] by a so-called prior distribution, which is the suspected probability of the model before we collect new data.  This product of a likelihood and a prior is proportional to the posterior distribution [P(model|data)], which becomes the basis for statistical inference.  Once you've established the posterior distribution, model selection and parameter estimation are relatively easy and based on the more intuitive notion of calculating actual probabilities of different hypotheses.  The difficulties of Bayesian modeling are how you get to the posterior: you need to specify a prior (sometimes admitting that you know nothing about the model initially: a so-called uninformative or flat prior) and then you need to estimate the shape of the posterior distribution.  The analytical problem in this is that all the posterior probabilities must sum to 1 (the prior and posterior distributions are probability density functions), and thus the product of the likelihood function and the prior distribution must be standardized by a sum of probabilities of all possible outcomes.  In most cases we can't figure this out analytically and so must estimate the shape of the posterior using a numerical random sampling technique (Markov Chain Monte Carlo simulation, or MCMC). In many ways this is similar to the optimization routines we needed to calculate MLEs: like the simulated annealing algorithm, MCMC works sequentially to find new parameter values using random jumps through parameter space. Of the several common MCMC variants used to 'solve' (actually sample from) the posterior distribution, perhaps the most popular is called Gibbs sampling.  This algorithm (often called the Gibbs sampler) is particularly good for estimating the value of one parameter conditional on values of a host of other parameters, which is especially effective for fitting hierarchical Bayesian models.  Unfortunately, the most effective software that uses the Gibbs sampler, BUGS (Bayesian inference Using Gibbs Sampling), is outside the realm of R, although it is integrated using the R library R2WinBUGS (WinBUGS being the Windows version of BUGS).

A simple WinBUGS example

Here's an example of a simple regression to demonstrate how WinBUGS works, following the example of Box 5.1 in McCarthy 2007.  I'll use the newer version of WinBUGS called OpenBUGS, available here:
http://www.openbugs.info/w/

Browse to the download page and grab the executable for your operating system. It takes a minute or two to install. Open the program and open a new file. In this file we specify a model that contains the 1) deterministic function; 2) stochastic function; and 3) prior distributions of each parameter to be estimated.  After the model we specify the data and (optionally) a list of starting values for fitted parameters.  For a simple regression with normal error, we're estimating three parameters (intercept, slope, and variance); note BUGS wants normal distributions described by a mean and precision term, which is the inverse of variance, and there are other minor differences between BUGS and R code.  Here's our first regression example:

model

{

     a ~ dnorm(0,1.0E-6) #flat prior for the intercept (extremely low precision), note "~" means "is distributed as"

     b ~ dnorm(0,1.0E-6) #flat prior for the slope (ditto)

     p ~ dgamma(0.001,0.001)  #flat gamma prior for the precision, because restricted to positive values

     for(i in 1:9) #BUGS only accepts for loops rather than vector-based functions, so we loop over each observation

     {

          mean[i] <- a + b*elev[i]     #deterministic function (linear); note equal sign in BUGS is "<-" rather than "="

          cover[i] ~ dnorm(mean[i],p)  #stochastic (data) model, here normal with precision=p

     }

}

list(elev=c(100,110,320,500,830,120,330,100,500),cover=c(0,4,5,7,8,2,2,4,4))   #data

list(a=0,b=0,p=1)  #initial values for parameters

The final line is a list with two elements, each named for the variables named in model (cover and elev).  The above can be copied and pasted in the OpenBUGS new file window.  The next step is making sure that there are no errors or omissions in the model specification.  Double-click 'model' to highlight it and then select 'Specification' under the Model menu. The button 'check model' should indicate 'model is syntactically correct' in the bottom status line, or indicate lines that need attention.  You can then use the 'load data' button to specify the dataset to BUGS, while highlighting the 'list' term in the final line of the code.  Next hit the 'compile' button to set up the information needed for the Gibbs sampler, and then tell BUGS the starting values for each parameter (like we did for optim), either by having BUGS pick values from the priors ('gen inits' button) or specifying your line using a line in the code: here highlight 'list' in the last line and then click 'load inits'.  Generally, it is recommended you specify your own starting values.

At this point WinBUGS is ready to sample from the posterior distribution. Use the ‘Update Tool’ under the Model menu to run a given number of MCMC iterations (‘updates’, with a default of 1000).  To gauge the state of our variables after a chain of 1000, select the ‘Sample Monitor Tool’ under the Inference menu and type in our parameter names (a, b, and p) in the ‘node’ box (after typing in each separately, click ‘set’ to save it).  To view what’s happening with all the parameters at once, type ‘*’ in the node box and click ‘trace’.  A graph window should open with iteration on the x axis and the fitted parameter value on the y axis for all three parameters. Taking more updates (Update Tool box) will move these trace windows in real time. As they move through the window the curves should look like white noise centered on the posterior mode: these are random samples of the posterior distributions of each parameter (to see the entire history of iterations, select ‘history’ in the Sample Monitor Tool). To see the accumulated sample for each posterior distribution, click ‘density’ in the Sample Monitor Tool window (note all graphs can be resized via click-and-drag).  The ‘stats’ button shows summary statistics (including the mid-95% of each posterior, called the Bayesian credible interval, the confidence interval analogue in Bayesian statistics).  It is typically recommended that early samples (‘burn in’ samples) are thrown away because these remain correlated with the initial state of parameter values; you can get rid of these by changing the ‘beg’ box value in the Sample Monitor Tool to 1000 or some other reasonably large value. The ‘auto cor’ button gives you an idea as to the correlation structure of the MCMC values of each parameter (here, values separated by 10 or more iterations appear to be uncorrelated). As you take many samples, the results should conform to what we established using likelihood, given we used uninformative priors.

Exercise

Use the OpenBUGS manual to see the required format for loading a dataset file. (There are two acceptable formats, the easier of which is the ‘rectangular format’ that is essentially a text file of data columns, each with a header row of the variables [each followed by brackets], and with a final row containing only END).  Using the write function in R (or write.csv), export a fraser magnolia cover dataset that includes at least two columns (cover and elevation), reformat it in a text editor (or Excel), and load it into OpenBUGS. Finally, rewrite the above model to produce the posterior distribution of the elevation coefficient in a simple 3-parameter regression model of magnolia cover and elevation.