Lecture notes, Feb 8

Generalized linear models

Generalized linear models (GLMs) extend the linear modeling capability of R to scenarios that involve non-normal error distributions or heteroscedasticity.  All other classic assumptions (particularly independent observations) still apply.  The idea here is that linear functions of the predictor variables are obtained by transforming the right side of the equation ( f(x) ) by a link function.  The data are then fit in this transformed scale (using an iterative routine based on least squares), but the expected variance is calculated on the original scale of the predictor variables.  Simple examples of link functions are log(y) [which linearizes exp(x)], sqrt(y) [x^2], and 1/y [1/x].  More particularly, GLMs work for the so-called 'exponential' family of error models: Poisson, binomial, Gamma, and normal.  Each of these has a standard (default) link function:

gaussian                        identity                         1                                  normally distributed error

poisson                         log                                exp                               counts (many zeros, various integers)

binomial                        logit                              1/(1+1/exp(x))              proportions or binary (0,1) data

Gamma                        inverse                          1/x                               continuous data with non-constant error (constant CV)

You don't have to use the default link.  For example, for binary data (e.g., presence-absence data) it may be more appropriate to use the 'complementary log-log link function, specified as family=binomial(link="cloglog").  The link transformations are particularly handy when it comes to model comparisons, because the GLM approach produces a response deviance on the same scale for all models (e.g., AICs are comparable and likelihood ratio tests are appropriate for nested models).  This allows for flexibility in exploring which link function is most appropriate: take the one with the lowest residual deviance (-2*logLikelihood).

Note that parameter values returned by coef() or summary() are on the scale of the link function, not the untransformed predictor variables.  Use the inverse of the link function to get parameter values back on the scale of x, or use the function predict with the type="response" argument.

Count (or count-like) response variables

We'll first try some examples for using glm given count data.  Strictly speaking, the provided tree dataset does not include a vector of counts, but the abundance data, measured as cover classes (integers between 1 and 10), may fit the essential criteria of 1) being discrete, and 2) having variance that generally increases with the mean (you can consider this from first principles: if a species had a mean abundance around 1, the variance has to be low because you can't get any lower than this given our data; a mean of 5, however, could have high variance).  A Poisson process is one whose mean and variance are equal; we can check to see whether this is true for our data.  Taking eastern hemlock as an example:

> mean(dat2\$cover)

 4.659517

> var(dat2\$cover)

 4.471835

Not perfect but close.  Let's look at the distribution of counts themselves:

> table(dat2\$cover)

1   2   3   4   5   6   7   8   9   10

39  71 166 110  92  80 108  60  19  1

If these counts were distributed exactly from a Poisson process, what would they look like, assuming the same mean (and variance)?

> bar1 = barplot(as.vector(table(dat2\$cover)),names.arg=seq(1:10))

> points(bar1,dpois(seq(1,10),4.66)*sum(table(dat2\$cover)),

cex=2,type="b",col="sienna",lwd=2,pch=19)

A couple things about what we did here: rather than use hist, I plotted a histogram using the table function output, because I wanted to see exactly the number of counts in each bin from 1 to 10 (this can be frustrating to do exactly with hist: try it).  Then I overlayed the expected values from a Poisson distribution with the same mean (=4.66) using the dpois function.  Because dpois returns a probability density, I had to multiply these values by the overall count sum to translate them into the expected frequencies we see on the bar chart.  As you can see, the data look Poisson-ish but they're not perfect.  (One reason you've probably already guessed: our data are bounded at 10, so variance should actually go down for the highest values.)

Now let's fit a GLM to these data with just an intercept (overall mean):

> glm1 = glm(cover~1,data=dat2,family=poisson)

> summary(glm1)

Call:

glm(formula = cover ~ 1, family = poisson, data = dat2)

Deviance Residuals:

Min       1Q   Median       3Q      Max

-2.0594  -0.8229  -0.3132   1.0085   2.1430

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept)  1.53891    0.01696   90.73   <2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 749.25  on 745  degrees of freedom

Residual deviance: 749.25  on 745  degrees of freedom

AIC: 3212.2

Number of Fisher Scoring iterations: 4

The output is similar to lm output, and the standard summary and other attribute functions (coef, confint, resid, fitted, etc) apply.  One major difference is how the fitted coefficients are scaled: we were expecting a mean of 4.66 (calculated above), but instead got back 1.54.  Here and in all other cases where the link is not the identity function, the fitted coefficients are returned on the scale of the link function, not the scale of the original data.  To get back the appropriately scaled coefficient we apply the inverse of the link function, in this case exp (as the default link for Poisson is log):

> exp(coef(glm1))

(Intercept)

4.659517

Let's add a continuous predictor variable like elevation to generate a simple Poisson regression.  First we'll graph it:

> with(dat2,plot(elev,cover,main="Hemlock cover vs. elevation",      cex=1.5,col="cyan",pch=19))

And then we'll fit the new glm and test it against a model with only an intercept:

> glm2 = glm(cover~elev,data=dat2,family=poisson)

> anova(glm1,glm2,test="Chisq")

Analysis of Deviance Table

Model 1: cover ~ 1

Model 2: cover ~ elev

Resid. Df Resid. Dev Df Deviance P(>|Chi|)

1       745     749.25

2       744     749.23  1 0.023849    0.8773

Note the chi-squared test is typically recommended for models with 'known deviance' (Poisson and binomial).  Here the model with elevation adds no explanatory power (fairly obvious from the graph), but we can still add the predicted trend line to our graph:

> x = seq(0,1660)

> lines(predict(glm2,list(elev=x),type="response"),lwd=2,col="orange")

What is crucial here is the type argument to predict: "response" re-calculates the coefficients to be on the same scale as the original response variable, rather than the scale of the link function.  Let's now try an ANOVA with Poisson error, using disturbance as our predictor:

> glm3 = glm(cover~disturb,data=dat2,family=poisson)

> summary(glm3)

Call:

glm(formula = cover ~ disturb, family = poisson, data = dat2)

Deviance Residuals:

Min       1Q   Median       3Q      Max

-2.1794  -0.7763  -0.1980   0.8523   2.2006

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept)    1.48367    0.03838  38.661   <2e-16 ***

disturbLT-SEL  0.03204    0.04685   0.684   0.4941

disturbSETTLE  0.08957    0.05485   1.633   0.1025

disturbVIRGIN  0.12184    0.05277   2.309   0.0210 *

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 749.25  on 745  degrees of freedom

Residual deviance: 742.32  on 742  degrees of freedom

AIC: 3211.3

Number of Fisher Scoring iterations: 4

> anova(glm1,glm3,test="Chisq")

Analysis of Deviance Table

Model 1: cover ~ 1

Model 2: cover ~ disturb

Resid. Df Resid. Dev Df Deviance P(>|Chi|)

1       745     749.25

2       742     742.32  3    6.926    0.0743 .

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

There is minor support for the disturbance variable.  The glm coefficient table works just like the summary for ANOVA produced by lm: the level alphabetically first (CORPLOG) is used as an intercept, and all subsequent levels are tested against the first (thus the remaining coefficients are differences from the mean, not means themselves).  There appears to be some support for higher hemlock cover in 'virgin' forest, and the overall model is of borderline significance (AICs are also similar).  Would it help to add an interaction with elevation?

> glm4 = glm(cover~disturb*elev,data=dat2,family=poisson)

> summary(glm4)

Call:

glm(formula = cover ~ disturb * elev, family = poisson, data = dat2)

Deviance Residuals:

Min       1Q   Median       3Q      Max

-2.4042  -0.7782  -0.2072   0.8090   2.0888

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept)         1.445e+00  1.396e-01  10.352   <2e-16 ***

disturbLT-SEL       2.546e-01  1.639e-01   1.554   0.1203

disturbSETTLE      -3.702e-01  2.275e-01  -1.628   0.1036

disturbVIRGIN       9.283e-02  2.625e-01   0.354   0.7236

elev                3.540e-05  1.243e-04   0.285   0.7758

disturbLT-SEL:elev -2.788e-04  1.651e-04  -1.689   0.0912 .

disturbSETTLE:elev  7.319e-04  2.943e-04   2.487   0.0129 *

disturbVIRGIN:elev  2.278e-05  2.269e-04   0.100   0.9200

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 749.25  on 745  degrees of freedom

Residual deviance: 729.06  on 738  degrees of freedom

AIC: 3206

Number of Fisher Scoring iterations: 4

> step(glm4)

Start:  AIC=3206

cover ~ disturb * elev

Df Deviance  AIC

<none>              729.06 3206

- disturb:elev  3   742.06 3213

Call:  glm(formula = cover ~ disturb * elev, family = poisson, data = dat2)

Coefficients:

(Intercept)       disturbLT-SEL       disturbSETTLE       disturbVIRGIN                elev  disturbLT-SEL:elev

1.445e+00           2.546e-01          -3.702e-01           9.283e-02           3.540e-05          -2.788e-04

disturbSETTLE:elev  disturbVIRGIN:elev

7.319e-04           2.278e-05

Degrees of Freedom: 745 Total (i.e. Null);  738 Residual

Null Deviance:      749.2

Residual Deviance: 729.1        AIC: 3206

The ANOVA contrasts suggest a significant difference in slope with elevation in the plots of prior settlement; the step function shows that the interaction is necessary (large increase in AIC when the interaction is removed).

GLM extensions for overdispersed (discrete) data

The above exponential family of error functions cannot account for the case of data overdispersion, where the variance is higher than expected from a Poisson or binomial processes.  To account for these cases glm includes 'quasi' exponential functions that add the parameter phi (φ) to the expected variance equation (Poisson example: variance as φλ rather than λ).  The expectation is that the fit of φ will be greater than 1 (<1 is underdispersion).  The 'quasipoisson' family object can thus account for count data with overdispersion (e.g., many zeros but also many very high values), and the 'quasibinomial' family object can do the same for binomial data.  The fitted φ value is returned in the summary.  One sign of overdispersion is that the residual deviance is significantly higher than the residual degrees of freedom.  That's not true of our glm4 fitted model, but we'll see what happens with a quasipoisson fit anyway:

> glm4.qp = update(glm4,family=quasipoisson)

> summary(glm4.qp)

Call:

glm(formula = cover ~ disturb * elev, family = quasipoisson,

data = dat2)

Deviance Residuals:

Min       1Q   Median       3Q      Max

-2.4042  -0.7782  -0.2072   0.8090   2.0888

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept)         1.445e+00  1.357e-01  10.649   <2e-16 ***

disturbLT-SEL       2.546e-01  1.593e-01   1.598   0.1104

disturbSETTLE      -3.702e-01  2.211e-01  -1.674   0.0945 .

disturbVIRGIN       9.283e-02  2.552e-01   0.364   0.7161

elev                3.540e-05  1.208e-04   0.293   0.7697

disturbLT-SEL:elev -2.788e-04  1.605e-04  -1.738   0.0827 .

disturbSETTLE:elev  7.319e-04  2.861e-04   2.558   0.0107 *

disturbVIRGIN:elev  2.278e-05  2.205e-04   0.103   0.9178

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 0.9449323)

Null deviance: 749.25  on 745  degrees of freedom

Residual deviance: 729.06  on 738  degrees of freedom

AIC: NA

Number of Fisher Scoring iterations: 4

In our case this looks to be a waste of time, since the overdispersion parameter was fit at <1.  Note these error objects are not fit with actual likelihoods, so the standard likelihood ratio test for comparing nested models is not available.  Crawley suggests using F tests instead.

Another, often better way of dealing with overdispersion that retains the nice characteristics of likelihood (AICs, likelihood ratio tests, use of step or stepAIC) is using a negative binomial (NB) model.  NB models are handy for discrete data like count data, but where the variance increases much faster than the mean (as opposed to the Poisson, where they are equal).   Where the Poisson model has one parameter (lambda = mean = var), NB contains an additional parameter k that accounts for 'clumping'—particularly handy for count data where there are a preponderance of zeros.  NB is not accommodated in the base glm function (it is not part of the exponential family of distributions) but the MASS library includes the otherwise-identical glm.nb function.  Here's a try fitting a NB model to our standard cover data:

> library(MASS)

> nb1 = glm.nb(cover~disturb*elev,data=dat2)

Warning messages:

1: In theta.ml(Y, mu, sum(w), w, limit = control\$maxit, trace = control\$trace >  :

iteration limit reached

2: In theta.ml(Y, mu, sum(w), w, limit = control\$maxit, trace = control\$trace >  :

iteration limit reached

> summary(nb1)

Call:

glm.nb(formula = cover ~ disturb * elev, data = dat2, init.theta = 48556.6389,

Deviance Residuals:

Min       1Q   Median       3Q      Max

-2.4041  -0.7781  -0.2072   0.8089   2.0887

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept)         1.445e+00  1.396e-01  10.351   <2e-16 ***

disturbLT-SEL       2.546e-01  1.639e-01   1.553   0.1203

disturbSETTLE      -3.702e-01  2.275e-01  -1.627   0.1036

disturbVIRGIN       9.283e-02  2.625e-01   0.354   0.7236

elev                3.540e-05  1.243e-04   0.285   0.7758

disturbLT-SEL:elev -2.788e-04  1.651e-04  -1.689   0.0912 .

disturbSETTLE:elev  7.319e-04  2.943e-04   2.487   0.0129 *

disturbVIRGIN:elev  2.278e-05  2.269e-04   0.100   0.9200

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(48556.64) family taken to be 1)

Null deviance: 749.18  on 745  degrees of freedom

Residual deviance: 728.99  on 738  degrees of freedom

AIC: 3208

Number of Fisher Scoring iterations: 1

Theta:  48557

Std. Err.:  373192

Warning while fitting theta: iteration limit reached

2 x log-likelihood:  -3190.003

The overdispersion parameter is called 'theta' here (=k). We should be a bit wary of the results since the fit did not meet the default convergence limits.  Nonetheless, the fit is very similar to our Poisson model (glm4), and unlike the quasipoisson fit we can compare the models via AIC:

> AIC(glm4,nb1)

df      AIC

glm4  8 3205.998

nb1   9 3399.000

There is no support for a NB model here.  Given the glm4 model did not suggest overdispersion, this makes sense.  But what if we allowed absences (zero abundance) in our model of hemlock abundance?  Surely there must be hundreds of plots that don't contain hemlock, and this might drastically change our model based on Poisson assumptions.  We can start by creating a new dataset, where plots that do not contain hemlock are added in with zero cover values:

> h.plots = unique(as.character(dat2\$plotID))

> u.plots = unique(as.character(dat\$plotID))

> noh.plots = u.plots[is.element(u.plots,h.plots)==F]

The first line is a vector of plot IDs containing hemlock, the second is a vector of all the plots, and the third vector is all plots that do not contain hemlock.  Now we create a data frame of these plots with unique plotIDs (ignoring species):

> dat3 = subset(dat,is.element(as.character(dat\$plotID),noh.plots))

> dat4 = subset(dat3,duplicated(dat3\$plotID)==F)

And finally we add these rows to dat2 (containing hemlock) after changing the cover values to zero:

> dat4\$cover = 0

> dat5 = rbind(dat2,dat4)

Now let's look at our cover distribution:

> bar1 = barplot(as.vector(table(dat5\$cover)),names.arg=as.character(seq(0,10)))

> points(bar1,dpois(seq(0,10),mean(dat5\$cover))*sum(table(dat5\$cover)),

cex=2,type="b",col="lightblue",lwd=2,pch=19)

> points(bar1,dnbinom(seq(0,10),mu=mean(dat5\$cover),

size=(mean(dat5\$cover)^2)/(var(dat5\$cover)-     mean(dat5\$cover)))*sum(table(dat5\$cover)),

type="b",cex=2,col="salmon",lwd=2,pch=19)

The Poisson curve (blue) doesn't cut it, and note that that variance of the counts is now nearly twice that of the mean.  The orangish curve here is that fit by the built-in dnbinom function, with specified values of the mean (calculated from the cover vector) and the overdispersion parameter k, which is defined as the mean squared divided by (variance – mean).  We still have way too many zeros even for NB, but is NB a better model here?

> nb2 = glm.nb(cover~disturb*elev,data=dat5)

> glm5 = glm(cover~disturb*elev,data=dat5,family=poisson)

> AIC(nb2,glm5)

df      AIC

nb2   9 4788.591

glm5  8 5287.842

No contest.  But we still have a major problem in that our plot diagnostics (and the residual deviance compared to residual df) suggests that we've only partly dealt with overdispersion:

> plot(nb2)

It's no surprise that the left-hand tail of our residuals is significantly higher than expected.  This brings up an important note about overdispersion (mentioned  by Crawley): severe overdispersion suggests there is a hidden factor that you are not accounting for in your model.  Here, I might suggest that the complete absence of hemlock results from a very different ecological process than that controlling its abundance once it is present.  Ideally, if we insisted on modeling both absences and abundances simultaneously, we would use a coupled model of two components, one predicting presences and absences, and another predicting its abundance once it's there.  We'll do more of this later.

# Binary response variables

Occupancy (presence-absence) data involve a response variable defined by one of two states; in statistical parlance this is a Bernoulli trial (heads or tails), or a binomial process where N=1.  In this case we're interested in the probability of 'success' (presence) given values of one or more independent variables.  In R this is done via a glm with family=binomial, with the link function either taken as the default (link="logit") or the user-specified 'complementary log-log' (link="cloglog").  Crawley suggests the choice of the link function should be determined by trying them both and taking the fit of lowest model deviance.  Unlike the Poisson or other binomial models of N>1, overdispersion is not possible with a binary response variable, so there is no associated overdispersion function for binary data in glm.

Let's take our overdispersed hemlock count data and covert all abundances to 1, thereby creating a presence-absence vector:

> dat5\$pa = dat5\$cover

> dat5\$pa[dat5\$pa>0] = 1

We've defined a new column in dataset 'dat5' named 'pa', containing only 1s and 0s for presences and absences.  Let's pick up our above thread and fit an ANCOVA with disturbance class and elevation:

> bin.glm1 = glm(pa~disturb*elev,data=dat5,family=binomial)

> summary(bin.glm1)

Call:

glm(formula = pa ~ disturb * elev, family = binomial, data = dat5)

Deviance Residuals:

Min       1Q   Median       3Q      Max

-2.6619  -0.6417   0.5740   0.7278   1.8803

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept)         2.4435687  0.5939061   4.114 3.88e-05 ***

disturbLT-SEL       1.2989502  0.7147168   1.817  0.06915 .

disturbSETTLE      -0.7848815  0.8840129  -0.888  0.37461

disturbVIRGIN       7.0985933  1.2561157   5.651 1.59e-08 ***

elev               -0.0013617  0.0004864  -2.800  0.00511 **

disturbLT-SEL:elev -0.0016463  0.0006325  -2.603  0.00924 **

disturbSETTLE:elev  0.0007611  0.0010745   0.708  0.47875

disturbVIRGIN:elev -0.0054999  0.0009433  -5.830 5.53e-09 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1307.5  on 1067  degrees of freedom

Residual deviance: 1063.7  on 1060  degrees of freedom

AIC: 1079.7

Number of Fisher Scoring iterations: 5

Plenty of evidence that both factors matter. Let's test this model with sequential removal of terms:

> step(bin.glm1)

Start:  AIC=1079.67

pa ~ disturb * elev

Df Deviance    AIC

<none>              1063.7 1079.7

- disturb:elev  3   1113.0 1123.0

Call:  glm(formula = pa ~ disturb * elev, family = binomial, data = dat5)

Coefficients:

(Intercept)       disturbLT-SEL       disturbSETTLE       disturbVIRGIN                elev

2.4435687           1.2989502          -0.7848815           7.0985933          -0.0013617

disturbLT-SEL:elev  disturbSETTLE:elev  disturbVIRGIN:elev

-0.0016463           0.0007611          -0.0054999

Degrees of Freedom: 1067 Total (i.e. Null);  1060 Residual

Null Deviance:      1308

Residual Deviance: 1064         AIC: 1080

Strong support for including the interaction.  Now let's make sure the logit link function is gives the lowest deviance.

> anova(bin.glm1,bin.glm2)

Analysis of Deviance Table

Model 1: pa ~ disturb * elev

Model 2: pa ~ disturb * elev

Resid. Df Resid. Dev Df Deviance

1      1060     1063.7

2      1060     1086.0  0  -22.346

The default logit link is much better (measured as lowest deviance; because the same number of parameters are used in both models, this is equivalent to comparing their AICs).  Let's see what the fitted model looks like on an occupancy graph against elevation for different subsets of disturbance classes.  We'll do this separately by disturbance type:

> x = seq(250,2000)

> with(dat5[dat5\$disturb=="VIRGIN",],plot(elev,pa,cex=1.5,col="darkgreen",

xlim=c(250,2000)))

> lines(x,predict(bin.glm1,newdata=list(elev=x,disturb=factor(rep("VIRGIN",length(x)))),

type="response"),lwd=2,col="darkgreen")

This is the predicted logistic curve for the elevation response in VIRGIN stands.  We first made a new x axis vector, then plotted the presence-elevation data for a hemlock dataset restricted to VIRGIN plots, and then overlayed the predicted values from our fitted bin.glm1 model at each x value, but only using the coefficients specific to VIRGIN plots.  Don't forget that the 'response' type argument ensures our fitted values are on the scale of the original data rather than the logit link function.  Let's compare this distribution to that for CORPLOG plots:

> with(dat5[dat5\$disturb=="CORPLOG",],points(elev,pa+.02,cex=1.5,col="blue"))

> lines(x,predict(bin.glm1,newdata=list(elev=x,disturb=factor(rep("CORPLOG",length(x)))),

type="response"),lwd=2,col="blue")

Apparently a history of heavy logging has blurred the relationship between hemlock presence and elevation.  Does this remain true with a history of selective cutting?

> with(dat5[dat5\$disturb=="LT-SEL",],points(elev,pa-.02,cex=1.5,col="darkred"))

> lines(x,predict(bin.glm1,newdata=list(elev=x,disturb=factor(rep("LT-SEL",length(x)))),

type="response"),lwd=2,col="darkred")

The response is intermediate, which seems intuitive.

A note on proportion data

Although not a common case with species distribution data, you may run into a dependent variable where observations are actually proportions derived from a set of trials of a certain number of successes and failures—i.e., a binomial process where N>1.  You could simply model this as percentages (a continuous variable bounded by 0 and 1), which would be an appropriate case of using glm with family=binomial with the default logit link.  But if your observations differed in number of trials, you'd be throwing away the N associated with each datum.  Observations derived from more samples would seem intuitively more important in the regression than those from small samples, even though their proportions may be the same.  R can handle this using glm with the binomial(link="logit") family, with a dependent variable that is actually a two-vector object, the first being the number of 'successes' and the second the number of 'failures'.

One application of this to species distribution data is when surveys include differential survey effort per observation.  For example, in tree inventories, plots may vary in the total number of trees (individuals) sampled due to differences in quadrat size (or naturally due to environmental or historical factors).  If you have information on the total number of samples per site and similar measure of abundance for individual species, you can model abundance as a binomial process of # 'successes' (individuals per species) and # 'failures' (individuals that were other species).  Essentially R will then treat this as a weighted regression, with sites of more individuals influencing parameter estimates more than small samples.

Here's an example by modeling hemlock cover with respect to total cover.  The example is a bit forced (as these are not actual samples of individuals) but you get the idea.  We start by generating the summed cover of all species for each plot:

> sumcover = tapply(dat\$cover,dat\$plotID,sum)

> coverdat = data.frame(names(sumcover),sumcover)

Now we merge these data with our dat5 hemlock object, based on plotID:

> dat6 = merge(dat5,coverdat,all.x=T,by.x=1,by.y=1)

And create our response variables of hemlock cover (successes) and total cover minus hemlock cover (failures):

> cover.y = cbind(dat6\$cover,dat6\$sumcover-dat6\$cover)

We can now model these data as a binomial process of varying number of trials per observation:

> glm6 = glm(cover.y~disturb*elev,data=dat6,family=binomial)

> summary(glm6)

Call:

glm(formula = cover.y ~ disturb * elev, family = binomial, data = dat6)

Deviance Residuals:

Min       1Q   Median       3Q      Max

-5.1115  -1.4730  -0.4089   1.4116   5.1429

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept)        -2.528e+00  1.385e-01 -18.258  < 2e-16 ***

disturbLT-SEL       2.766e-01  1.625e-01   1.702 0.088828 .

disturbSETTLE      -2.065e+00  2.217e-01  -9.312  < 2e-16 ***

disturbVIRGIN       3.722e-01  2.217e-01   1.679 0.093245 .

elev               -8.036e-05  1.228e-04  -0.655 0.512737

disturbLT-SEL:elev -6.051e-04  1.628e-04  -3.716 0.000202 ***

disturbSETTLE:elev  2.941e-03  2.882e-04  10.207  < 2e-16 ***

disturbVIRGIN:elev -2.089e-04  1.908e-04  -1.095 0.273590

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 4068  on 1067  degrees of freedom

Residual deviance: 3866  on 1060  degrees of freedom

AIC: 6241.6

Number of Fisher Scoring iterations: 5

Note coefficient values are on the logit-transformed scale, as above.  You can also get overdispersion with proportion data.  The family for this is quasibinomial, and it works just like quasipoisson.

Note that there will be occasions where you may wonder whether it makes more sense to model observations as binary (0, 1s) or lump them to form counts summarized by particular factors (e.g., 47 presences, 80 absences in a particular disturbance history class).  If your predictor variables are factor variables that are common to all occurrences, it may make more sense to model these as proportion data based on counts.  But if each occurrence has a unique value of a continuous variable (e.g., in an ANCOVA with both factor and continuous variables), a binary response is logical.

Two more points on when and when not to use GLMs with species distribution data:

1). If you have abundance measured in percent cover (where sample effort is equal for each observation), Crawley suggests the using lm with an arcsine transformation (sin-1sqrt(.01*%cover)) of the dependent variable. As an exercise, you can covert our cover class data back into mean cover percentages (see the metadata) and examine model residuals from simple lm regressions.

2). Because the use of link functions, even using the Gaussian (normal) family, can incorporate transformations of the response variable (such as log(y)), one might wonder whether it makes more sense to use lm with log(y) or glm with link="log").  This is apparently something of an art, but Crawley suggests the rule of thumb that if the variance (on x) is constant (assessed, e.g., with a plot of the residuals against fitted values), use the log link function; if not, transform y.