**March 8: More likelihood **

**Likelihood profiles**

Likelihood profiles are the meat and potatoes of likelihood
modeling fitting, because they provide a snapshot of how the overall model
likelihood depends on particular values of one or more parameters. If negative
log-likelihood values increase quickly as you move away from the MLE (higher or
lower), then the MLE has high precision—it is unlikely that the real value is
far from the estimated value. Conversely, if the likelihood profile is flat,
then a range of different parameter values could produce basically the same
dataset. In this case your point estimate of the MLE may be worthless. Even
better is that likelihoods have a very useful statistical property: the ratio
of any two likelihoods (generated from the same data) is proportional to a
chi-squared distribution (proportional because it is 2*negative log-likelihood,
or the *deviance*, that is chi-square distributed). This means we can
associate any increase in negative log-likelihood with a probability of the
solution of the higher likelihood being equal to that of the lower likelihood:
i.e., a P value. This forms the basis of estimating parameter precision and
comparing (nested) models.

In models that include a single parameter, likelihood profiles are easy to generate because you only need to calculate the likelihood with different values of that parameter. We produced this above in our examination of the mean of a normal distribution (using likelihood, rather than negative log-likelihood). For many parameter models, however, a change in any parameter value will (usually) influence the MLEs of the others. Construction of an accurate likelihood profile requires that the fit be 're-optimized' to find the lowest negative log-likelihood for each alternate parameterization. Of course, there are various libraries in R that automate this process, but we'll try a simple manual example.

For ease of use let’s return to the simple case of linear regression of 9 cover values and associated elevation values.

cover = c(0,4,5,7,8,2,2,4,4)

elev = c(100,110,320,500,830,120,330,100,500)

We’ll first examine the overall MLEs fit by optim using a normal likelihood function:

fn = function(par) {-sum(dnorm(cover,mean=par[1]+par[2]*elev,sd=sqrt(par[3]),log=T))}

out1 = optim(fn,par=c(1,0,3),method="Nelder-Mead")

out1

$par

[1] 1.54495516 0.00759282 2.31627982

$value

[1] 16.55038

The slope for the effect of elevation is ~0.0076. So what? Is there evidence that in fact elevation is a significant predictor? We can find out by examining the likelihoods of a range of different slope parameters, while also allowing the MLEs of the other two parameters to vary with the different slope values. We create a vector of possible slopes and then generate new optim results for each case of a new fixed slope parameter:

slopes = seq(-.005,.02,by=.0005) #vector of possible slope values surrounding the MLE

lik.out = numeric(length(slopes)) #holding vector (currently zeros)

fn2 = function(par,m) { #new simplified likelihood function (only 2 fitted pars)

b = par[1]

v = par[2]

-sum(dnorm(cover,mean=b+m*elev,sd=sqrt(v),log=T))}

for(i in 1:length(slopes)) { #calculates new likelihoods for each new slope model

lik.out[i] = optim(fn2,par=c(1,3),m=slopes[i],method="Nelder-Mead")$value }

plot(slopes,lik.out,type="l")

This is our likelihood profile for the slope. The MLE is about 0.0076 and results in a negative log-likelihood of about 16.5. To determine the 95% confidence interval for the slope we use the fact that a difference in likelihood more than half the chi-squared critical value for alpha=0.95 (1 df) is beyond the P=0.05 boundary. Note this critical value is the current minimum negative log-likelihood plus:

qchisq(0.95,1)/2

[1] 1.920729

So we just need to find the slope value that is equal to the current minimum likelihood (~16.5) plus 1.92, which should be around 18.5. To get a more precise value we use the approx function that does simple interpolation given a set of x and y values. We do this for both the lower tail and the upper tail:

lower.s = slopes[1:which.min(lik.out)] #x values, left side of the graph

lower.k = lik.out[1:which.min(lik.out)] #y values, left side of the graph

lowerCI = approx(lower.k,lower.s,xout=min(lik.out)+qchisq(0.95,1)/2)$y #desired x value for y = 1.92+y(MLE)

upper.s = slopes[which.min(lik.out):length(lik.out)] #x values, right side

upper.k = lik.out[which.min(lik.out):length(lik.out)] #y values, right side

upperCI = approx(upper.k,upper.s,xout=min(lik.out)+qchisq(0.95,1)/2)$y #desired x value for y = 1.92+y(MLE)

abline(v=c(lowerCI,upperCI),lty=2,col="red")

arrows(lowerCI,18.5,upperCI,18.5,code=3,col="blue"); text(.0075,19,"95% CI",cex=2)

I’ve plotted the estimated confidence limits. They do not contain zero, so elevation is indeed significant. The fact that this procedure can be used to test how different a parameter MLE is from zero forms the basis of nested model evaluation using likelihood: we've just decided that a model including elevation is a better model than one without it. This is the likelihood ratio test.

*Exercise*

Create a likelihood profile for the quadratic coefficient of elevation for the exercise of magnolia presence-absence from last week. Does the confidence interval suggest the quadratic term is significant? (Note this can be tricky because of the tendency for parameter values to produce predicted values of the binomial probability outside [0,1]. I used a starting parameter set of 0 for the intercept and 0.7 for the linear slope. Use a small starting range for the profiled parameter, near its MLE.)

**Model comparison**

Likelihood offers two general means of model comparison, corresponding to the frequentist likelihood ratio test (LRT), generating a P value for the comparison of nested models, and the non-frequentist (non-P-value) use of information criteria (like AIC). Information criteria-based methods like AIC are more flexible because they can be used to compare any models fit via likelihoods (including those not strictly fit via likelihood but from which a likelihood is easily calculated, such as least-squares). The 'rules' of IC-based methods are not strictly based on probability; rather, rules-of-thumb are established via simulation. For this reason, many researchers prefer LRTs when they are available, although IC-based methods, and particularly AIC, have widespread acceptance in the ecological literature.

LRTs apply
only to the comparison of nested models, involving the change in model
likelihood when one parameter is set to zero (or two terms are equivalent).
Returning to the above example of elevation and cover, we can ask whether there
is sufficient evidence from a comparison of their likelihoods that the
deterministic component of the model should contain a squared term for
elevation:

fn.sq
= function(par) {-sum(dnorm(cover,mean=par[1]+par[2]*elev+par[4]*(elev^2),sd=sqrt(par[3]),log=T))}

out.sq = optim(fn.sq,par=c(1,0,3,0),method="Nelder-Mead")

out.sq

$par

[1] 1.840340e+00 5.376081e-03 2.294774e+00 2.622342e-06

$value

[1] 16.50852

The
optimized negative log-likelihood is 16.50852, which is barely lower than the
above model that did not include the quadratic term (=16.55). Is it low enough
to be a significantly better fit? We don't have to create the likelihood
profile, because (as frequentists) we're only interested whether quadratic
coefficient is zero (null hypothesis) or greater than zero (the positive MLE
according to our out.sq result). In fact we already know our critical value is
a negative log-likelihood difference of

qchisq(0.95,1)/2

[1] 1.920729

because the one-parameter difference in models gives us a 1
DF for the chi-squared test. So the original likelihood would have had to drop
to

out1$value - qchisq(0.95,1)/2

[1] 14.62965

So
we weren't even close. A model including only an intercept, however, has a
negative log-likelihood of:

fn.int = function(par) {-sum(dnorm(cover,mean=par[1],sd=sqrt(par[2]),log=T))}

out.int = optim(fn.int,par=c(1,3),method="Nelder-Mead")

out.int$value

[1] 20.48704

This
deviance has an associated P-value of:

1-pchisq(2*(out.int$value-out1$value),df=1)

[1] 0.005016926

We conclude that a model including the main effect of elevation is much more
likely to be the better model than one with only an intercept. Interestingly,
this is not equivalent to the P-value obtained with a linear model based on
least squares:

summary(lm(cover~elev))

Call:

lm(formula = cover ~ elev)

Residuals:

Min 1Q Median 3Q Max

-2.3045 -1.3412 0.1534 1.6196 1.6955

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.545275 0.972862 1.588 0.1562

elev 0.007592 0.002427 3.129 0.0166 *

The
fitted coefficients are identical to what we found with likelihood, but the
test is based on a Student's t test of whether the elevation coefficient
overlaps with zero, rather than on the competing likelihoods of two models.
Conveniently, R has a built-in function **logLik** for returning
log-likelihoods of a fitted model object:

logLik(lm(cover~elev))

'log Lik.' -16.55038 (df=3)

Because the estimated coefficients are identical to what we produced manually,
the value is equal to that of our out1 object (note you need to return the
negative of this for the negative log-likelihood). Using this technique we
could do the same model comparison via likelihood but skipping the
optimization:

1-pchisq(abs(2*(-logLik(lm(cover~1))[1]+logLik(lm(cover~elev))[1])),df=1)

[1] 0.005016926

The
tricky bit about the above line is that we want to the first element of the **logLik**
function ([1]); I also made sure that the difference returned a positive value
(**abs**). You should be able to do this for all GLMs as well.

AIC
(Akaike Information Criterion) and other IC-based model goodnesss-of-fit
statistics are similarly easy to calculate once model likelihoods are in hand,
because they are based only on the negative log-likelihood and the number of
parameters in the model (AIC = -2L + 2k, where k is the number of fitted
parameters). The comparisons needn't be restricted to nested models, but the
datasets fit by the models must be identical. What you don't get with IC-based
methods is a P value (many consider this to be a good thing). Let's compare our
simple models of elevation and cover using AIC rather than LRT:

2*out1$value+2*length(out1$par)

[1] 39.10075

2*out.int$value+2*length(out.int$par)

[1] 44.97408

Lower AIC values indicate better fits, and the rule-of-thumb established for
ICs is that a difference of 2 units or more indicates models that are clearly
different, so again we conclude that there is substantial support for a model
that includes elevation. Here too there is a built-in function to save you the
few seconds of trouble of writing the above formulas:

AIC(lm(cover~elev),lm(cover~1))

df AIC

lm(cover ~ elev) 3 39.10075

lm(cover ~ 1) 2 44.97408

The **AIC** function should work for any fitted model object that generates
an associated **logLik** value (lm, glm, gam, etc). Note according to
Burnham and Anderson (2002) we probably should have instead used AICc, which is
the AIC value corrected for small sample sizes (generally when #values/#params
< 40):

2*out1$value+2*length(out1$par) + 2*(length(out1$par)*(length(out1$par)+1)/(length(cover)-length(out1$par)-1))

[1] 43.90075

2*out.int$value+2*length(out.int$par) + 2*(length(out.int$par)*(length(out.int$par)+1)/(length(cover)-length(out.int$par)-1))

[1] 46.97408

The
conclusion is the same but the disparity isn't quite as great with small sample
sizes. Yet another IC variant is the so-called 'Bayesian' IC (or BIC; not
actually a Bayesian method) with a parameter penalization of (log n)*k rather
than 2k, which may be more appropriate for comparing models for large datasets
(compared to AIC, BIC yields a more conservative acceptance of more highly
parameterized models):

2*out1$value+length(out1$par)*log(length(cover))

[1] 39.69243

2*out.int$value+length(out.int$par)*log(length(cover))

[1] 45.36853

Given the growing popularity of model comparisons via information-theoretic
approaches like AIC, it is important to understand how AICs are used to compare
the relative goodness-of-fit of competing models. First the actual AIC value
is not a meaningful value (as opposed to something like R2); rather, it is the
difference in AICs between models that are of interest. AIC differences in a
table of competing models allow a quick and convenient means of model comparison
(differences<2 suggest the models are of equal ranking; between 4-7 suggests
'considerably less' support, and >10 suggests 'essentially none' [Burnham
and Anderson 2002, p. 70]). Burnham and Anderson have additionally popularized
a method (established by Akaike) that uses AIC values to calculate the relative
weight of evidence supporting one model over others, given a list of reasonable
alternative models fit to a given dataset. These are called Akaike weights (w_{i}),
and are unique to a particular set of models (that is, if a model is added or
removed, w_{i} must be recalculated). Akaike weights can also be used
for other ICs (eg BIC). Model comparison tables based on ICs are now common in
the literature, so it is worthwhile to be able to produce your own using the
basic likelihood approaches above. Here we can write a simple function that,
given a set of candidate models, creates the standard information-theoretic
output:

ICtab = function(model.list) {

if(is.list(model.list[[1]])==F) {cat("Please enter models as a list (list(mod1,mod2,...))"); stop}

n = length(model.list) #number of candidate models

s = length(model.list[[1]]$fitted) #no. observations

kv = numeric(1:n) #empty no. param vector

modLL = numeric(1:n) #empty neg log lik vector

mod.call = numeric(1:n) #model form (for output)

R2 = numeric(1:n) #empty model R2 vector

fam = numeric(1:n) #empty family vector

for(i in 1:n) { #loop over each model

mod
= model.list[[i]] #select model

fam[i] = mod$family[[1]] #error family

mod.call[i] = as.character(mod$formula[3]) #return model formula

modLL[i] = -logLik(mod)[1] #extract neg log likelihood

kv[i]
= attr(logLik(mod),"df") #extract no. fitted
params

res = mod$y-mod$fitted #residuals
scaled to original data

R2[i] = 1-(sum((res-mean(res))^2) / sum((mod$y-mean(mod$y))^2) )

}

AICv = 2*modLL + 2*kv*(kv+1)/(s-kv-1) #AICc for all models

BICv = 2*modLL + kv*log(s) #BIC for all models

AICdiff = AICv - min(AICv) #AIC differences

BICdiff = BICv - min(BICv) #BIC differences

wi = exp(-.5*AICdiff) / sum(exp(-.5*AICdiff)) #Akaike weights

out = data.frame(mod.call,fam,kv,modLL,round(AICv,2),round(AICdiff,2),round(wi,3),round(BICv,2),round(BICdiff,2),round(R2,3))

names(out)
= c("model","family","k","-LL","AICc","AICdiff","wi","BIC","BICdiff","R2")

out = out[order(AICdiff),] #sort ascending by
AICdiff column

return(out)

}

To illustrate what the above function does, let's construct
a few models with our tree dataset:

dat =
read.csv(file.choose()) #grab the treedata.csv
file

dat2 = subset(dat,dat$species=="Acer rubrum") #red
maple

We'll construct a set of candidate models that involve
reasonable a priori hypotheses about red maple abundance (cover) with respect
to elevation, water potential, stream distance, disturbance history, and
radiation (beers index). We're also interested in whether the data model
should be based on counts (poisson) or normal:

glm1 = glm(cover~elev+I(elev^2)+disturb+disturb*elev+disturb*I(elev^2),data=dat2,family=gaussian)

glm2 = glm(cover~elev+I(elev^2)+disturb+disturb*elev+disturb*I(elev^2),data=dat2,family=poisson)

glm3 = glm(cover~elev+I(elev^2)+disturb,data=dat2,family=gaussian)

glm4 = glm(cover~elev+I(elev^2)+disturb,data=dat2,family=poisson)

glm5 = glm(cover~elev+I(elev^2)+disturb+tci,data=dat2,family=gaussian)

glm6 = glm(cover~elev+I(elev^2)+disturb+streamdist,data=dat2,family=gaussian)

glm7 =
glm(cover~elev+I(elev^2)+disturb+I(log(streamdist+1)),data=dat2,family=gaussian)

glm8 = glm(cover~elev+I(elev^2)+disturb+ disturb*elev+disturb*I(elev^2)+I(log(streamdist+1)),data=dat2,family=gaussian)

This is
obviously not a complete list of candidate models; ideally we have some
hypotheses to test involving biological mechanisms of red maple distribution or
are interested in only a small subset of all possible regression models. I say
"ideally" because this approach does rely heavily on the set of
models you choose to evaluate. To illustrate this, let's first run our
information-theoretic function on models 1 through 7:

ICtab(list(glm1,glm2,glm3,glm4,glm5,glm6,glm7))

model family k -LL AICc AICdiff wi BIC BICdiff R2

7 elev + I(elev^2) + disturb + I(log(streamdist + 1)) gaussian 8 1676.202 3352.59 0.00 0.825 3405.83 0.00 0.042

6 elev + I(elev^2) + disturb + streamdist gaussian 8 1678.407 3357.00 4.41 0.091 3410.24 4.41 0.037

1 elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2) gaussian 13 1678.354 3357.17 4.59 0.083 3443.53 37.70 0.041

5 elev + I(elev^2) + disturb + tci gaussian 8 1683.407 3367.00 14.41 0.001 3420.24 14.41 0.029

3 elev + I(elev^2) + disturb gaussian 7 1685.925 3371.99 19.41 0.000 3418.60 12.77 0.022

2 elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2) poisson 12 1703.530 3407.46 54.87 0.000 3487.20 81.37 0.039

4
elev + I(elev^2) + disturb poisson 6 1709.601 3419.31 66.72 0.000 3459.27
53.44 0.023

Note our function requires arguments (model names) that are embedded in a
list. The function returns all the basic IC-based (and R2) criteria, including
the number of parameters (k), the negative log-likelihood, the sample
size-corrected AICc (with large samples this converges to AIC, so might as well
always use AICc), the rank order of AIC differences (all AICs compared to the
best [lowest] model), Akaike weights, and raw and relative BICs. Note the much
harsher penalties for k imposed by BIC than AIC (eg, glm1), and that R2s hardly
vary and are thus very poor tools for model comparison. Clearly model 7 is the
best model; its AICc advantage suggests "considerably more" support
than models 1 and 6 and the Akaike weights indicate that that there is an order
of magnitude more support for it compared to the next two. Still, there
remains some minor support for the other two models (the others have no support
at all). Now compare the result when we add model 8 to the mix:

ICtab(list(glm1,glm2,glm3,glm4,glm5,glm6,glm7,glm8))

model family k -LL AICc AICdiff wi BIC BICdiff R2

8 elev+I(elev^2)+disturb+disturb*elev+disturb*I(elev^2)+I(log(streamdist+1)) gaussian 14 1667.856 3336.25 0.00 1 3429.21 23.38 0.062

7 elev + I(elev^2) + disturb + I(log(streamdist + 1)) gaussian 8 1676.202 3352.59 16.34 0 3405.83 0.00 0.042

6 elev + I(elev^2) + disturb + streamdist gaussian 8 1678.407 3357.00 20.75 0 3410.24 4.41 0.037

1 elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2) gaussian 13 1678.354 3357.17 20.92 0 3443.53 37.70 0.041

5 elev + I(elev^2) + disturb + tci gaussian 8 1683.407 3367.00 30.75 0 3420.24 14.41 0.029

3 elev + I(elev^2) + disturb gaussian 7 1685.925 3371.99 35.74 0 3418.60 12.77 0.022

2 elev + I(elev^2) + disturb + disturb * elev + disturb * I(elev^2) poisson 12 1703.530 3407.46 71.21 0 3487.20 81.37 0.039

4 elev + I(elev^2) + disturb poisson 6 1709.601 3419.31 83.06 0 3459.27 53.44 0.023

Relative to glm8, there is no support whatsoever (wi=0) for any remaining model if you adhere to AICs. Interestingly, because we've nearly doubled the number of parameters (because of the interaction terms for each level of disturbance class), the BIC disagrees and instead favors the lower parameterized models. Here you'd probably go with the AIC, as BIC is used primarily for cases for extremely large sample sizes relative to the number of parameters (here, n/k for glm8 is 795/14=57).

Note that the Akaike weight cannot be interpreted as the 'probability that a model is true'; rather, the wi shows the evidence in favor of each model given we initially (a priori) thought are models were equally likely. So, although the information-theoretic approach is a great way to compare models if you're in the situation of comparing several models that are all equally compelling on a priori grounds (perhaps for biological reasons, or from past literature), we're still left with a situation where we're evaluating the probability of data rather than models. Bayesians would argue that, on philosophical grounds, it makes more scientific sense to obtain a result that establishes the probability of a particular hypothesis (model) being true, rather than making statements based on the probability of the observing the data given particular hypotheses. To make this transition we need to consider Bayes' theorem.

**Toward Bayesian models**

So much has been written on Bayesian vs. non-Bayesian analysis
in ecology in the last decade that I will limit our discussion to
implementation of Bayesian models in R and leave the 'why or why not' concerns
to more able statisticians (highly recommended: McCarthy's 2007 book *Bayesian
Methods for Ecology*, also Clark 2005 Ecology Letters 8:2-14). Essentially,
what a simple Bayesian approach (as opposed to hierarchical Bayes, which we'll
approach later) facilitates is the translation of a conditional statement about
data (given a hypothesis) to a conditional statement about a hypothesis (given a
dataset), using Bayes' theorem. Rev. Thomas Bayes figured this out in the 18^{th}
century: he was interested in calculating inverse probabilities (e.g., given
data of heads and tails, what can you tell about the coin?). In practice, we
do this by multiplying our already established likelihood function [P(data|model)]
by a so-called *prior* distribution, which is the suspected probability of
the model before we collect new data. This product of a likelihood and a prior
is proportional to the *posterior* distribution [P(model|data)], which
becomes the basis for statistical inference. Once you've established the
posterior distribution, model selection and parameter estimation are relatively
easy and based on the more intuitive notion of calculating actual probabilities
of different *hypotheses*. The difficulties of Bayesian modeling are how
you get to the posterior: you need to specify a prior (sometimes admitting that
you know nothing about the model initially: a so-called *uninformative *or*
flat prior*) and then you need to estimate the shape of the posterior
distribution. The analytical problem in this is that all the posterior probabilities
must sum to 1 (the prior and posterior distributions are probability density
functions), and thus the product of the likelihood function and the prior
distribution must be standardized by a sum of probabilities of all possible
outcomes. In most cases we can't figure this out analytically and so must
estimate the shape of the posterior using a numerical random sampling technique
(Markov Chain Monte Carlo simulation, or MCMC). In many ways this is similar to
the optimization routines we needed to calculate MLEs: like the simulated
annealing algorithm, MCMC works sequentially to find new parameter values using
random jumps through parameter space. Of the several common MCMC variants used
to 'solve' (actually sample from) the posterior distribution, perhaps the most
popular is called *Gibbs sampling*. This algorithm (often called the
Gibbs sampler) is particularly good for estimating the value of one parameter
conditional on values of a host of other parameters, which is especially
effective for fitting hierarchical Bayesian models. Unfortunately, the most
effective software that uses the Gibbs sampler, BUGS (Bayesian inference Using
Gibbs Sampling), is outside the realm of R, although it is integrated using the
R library R2WinBUGS (WinBUGS being the Windows version of BUGS).

*A simple WinBUGS example*

Here's an example of a simple regression to demonstrate how
WinBUGS works, following the example of Box 5.1 in McCarthy 2007. I'll use the
newer version of WinBUGS called OpenBUGS, available here:

http://www.openbugs.info/w/

Browse to the download page and grab the executable for your
operating system. It takes a minute or two to install. Open the program and
open a new file. In this file we specify a *model* that contains the 1)
deterministic function; 2) stochastic function; and 3) prior distributions of
each parameter to be estimated. After the model we specify the data and
(optionally) a list of starting values for fitted parameters. For a simple
regression with normal error, we're estimating three parameters (intercept,
slope, and variance); note BUGS wants normal distributions described by a mean
and precision term, which is the inverse of variance, and there are other minor
differences between BUGS and R code. Here's our first regression example:

model

{

a ~ dnorm(0,1.0E-6) #flat prior for the intercept (extremely low precision), note "~" means "is distributed as"

b ~ dnorm(0,1.0E-6) #flat prior for the slope (ditto)

p ~ dgamma(0.001,0.001) #flat gamma prior for the precision, because restricted to positive values

for(i in 1:9) #BUGS only accepts for loops rather than vector-based functions, so we loop over each observation

{

mean[i] <- a + b*elev[i] #deterministic function (linear); note equal sign in BUGS is "<-" rather than "="

cover[i] ~ dnorm(mean[i],p) #stochastic (data) model, here normal with precision=p

}

}

list(elev=c(100,110,320,500,830,120,330,100,500),cover=c(0,4,5,7,8,2,2,4,4)) #data

list(a=0,b=0,p=1) #initial values for parameters

The final line is a list with two elements, each named for
the variables named in **model** (cover and elev). The above can be copied
and pasted in the OpenBUGS new file window. The next step is making sure that
there are no errors or omissions in the model specification. Double-click
'model' to highlight it and then select 'Specification' under the Model menu.
The button 'check model' should indicate 'model is syntactically correct' in
the bottom status line, or indicate lines that need attention. You can then
use the 'load data' button to specify the dataset to BUGS, while highlighting
the 'list' term in the final line of the code. Next hit the 'compile' button
to set up the information needed for the Gibbs sampler, and then tell BUGS the
starting values for each parameter (like we did for **optim**), either by
having BUGS pick values from the priors ('gen inits' button) or specifying your
line using a line in the code: here highlight 'list' in the last line and then
click 'load inits'. Generally, it is recommended you specify your own starting
values.

At this point WinBUGS is ready to sample from the posterior
distribution. Use the ‘Update Tool’ under the Model menu to run a given number
of MCMC iterations (‘updates’, with a default of 1000). To gauge the state of
our variables after a chain of 1000, select the ‘Sample Monitor Tool’ under the
Inference menu and type in our parameter names (a, b, and p) in the ‘node’ box
(after typing in each separately, click ‘set’ to save it). To view what’s
happening with all the parameters at once, type ‘*’ in the node box and click
‘trace’. A graph window should open with iteration on the x axis and the
fitted parameter value on the y axis for all three parameters. Taking more
updates (Update Tool box) will move these trace windows in real time. As they
move through the window the curves should look like white noise centered on the
posterior mode: these are random samples of the posterior distributions of each
parameter (to see the entire history of iterations, select ‘history’ in the
Sample Monitor Tool). To see the accumulated sample for each posterior
distribution, click ‘density’ in the Sample Monitor Tool window (note all
graphs can be resized via click-and-drag). The ‘stats’ button shows summary
statistics (including the mid-95% of each posterior, called the *Bayesian*
*credible interval*, the confidence interval analogue in Bayesian
statistics). It is typically recommended that early samples (‘burn in’
samples) are thrown away because these remain correlated with the initial state
of parameter values; you can get rid of these by changing the ‘beg’ box value
in the Sample Monitor Tool to 1000 or some other reasonably large value. The
‘auto cor’ button gives you an idea as to the correlation structure of the MCMC
values of each parameter (here, values separated by 10 or more iterations
appear to be uncorrelated). As you take many samples, the results should
conform to what we established using likelihood, given we used uninformative
priors.

*Exercise*

Use the OpenBUGS manual to see the required format for
loading a dataset file. (There are two acceptable formats, the easier of which
is the ‘rectangular format’ that is essentially a text file of data columns,
each with a header row of the variables [each followed by brackets], and with a
final row containing only END). Using the **write** function in R (or **write.csv**),
export a fraser magnolia cover dataset that includes at least two columns
(cover and elevation), reformat it in a text editor (or Excel), and load it
into OpenBUGS. Finally, rewrite the above model to produce the posterior
distribution of the elevation coefficient in a simple 3-parameter regression
model of magnolia cover and elevation.