**Generalized additive
models
in R
**

GAMs in R are a nonparametric extension of GLMs, used often
for the case when you have no a priori reason for choosing a particular
response function (such as linear, quadratic, etc.) and want the data to 'speak
for themselves'. GAMs do this via a
smoothing function, similar to what you may already know about locally weighted
regressions. GAMs take each predictor
variable in the model and separate it into sections (delimited by 'knots'), and
then fit polynomial functions to each section separately, with the constraint
that there are no kinks at the knots (second derivatives of the separate
functions are equal at the knots). The
number of parameters used for such fitting is obviously more than what would be
necessary for a simpler parametric fit to the same data, but computational
shortcuts mean the model degrees of freedom is usually lower than what you
might expect from a line with so much 'wiggliness'. Indeed this is the principal statistical
issue associated with GAM modeling: minimizing residual deviance (goodness of
fit) while maximizing parsimony (lowest possible degrees of freedom). Since the model fit is based on
deviance/likelihood, fitted models are directly comparable with GLMs using
likelihood techniques (like AIC) or classical tests based on model deviance
(Chi-squared or F tests, depending on the error structure). Even better, all the error and link
structures of GLMs are available in GAMs (including poisson and binomial), as
are the standard suite of **lm** or **glm** attributes (resid, fitted, summary,
coef, etc.). A principal reason why GAMs
are often less preferred than GLMs is that the results are often difficult to
interpret because no parameter values are returned (although significance tests
of each term are). They can be very good
for prediction/interpolation, as well as exploratory analyses about the
functional nature of a response. Some
researchers examine the shape of a curve with GAMs, then reconstruct the curve
shape parametrically with GLMs for model building.

There are
two common implementations of GAMs in R.
The older version (originally made for S-PLUS) is available as the 'gam'
package by Hastie and Tibshirani. The
newer version that we will use below is the 'mgcv' package from Simon
Wood. The basic modeling procedure for
both packages is similar (the function is **gam**
for both; be wary of having both libraries loaded at the same time), but the
behind-the-scenes computational approaches differ, as do the arguments for
optimization and the model output.
Expect the results to be slightly different when used with the same
model structure on the same dataset.

Let's first examine output from
the built-in function **loess**, which is a locally weighted polynomial
regression.** **This time we'll try
modeling yellow birch, and we'll include plots of birch absences.

dat = read.csv(file.choose()) #grab the treedata.csv file

dat2 = subset(dat,dat$species=="Betula alleghaniensis") #a yellow birch subset

b.plots = unique(as.character(dat2$plotID)) #plots with birch

u.plots = unique(as.character(dat$plotID)) #all plots

nob.plots =
u.plots[is.element(u.plots,b.plots)==F] #plots without birch

dat3 =
subset(dat,is.element(as.character(dat$plotID),nob.plots)) #datast of no birch

dat4 = subset(dat3,duplicated(dat3$plotID)==F) #one row per plot

dat4$cover = 0 #cover
of birch is zero in these plots

dat5 = rbind(dat2,dat4) #new dataframe of presences and absences

As from last time, dat5 is our new data frame with both presences and absences, this time for yellow birch.

ls1 = loess(cover>0~elev,data=dat5)

summary(ls1)

Call:

loess(formula = as.numeric(cover > 0) ~ elev,
data = dat5)

Number of Observations: 1009

Equivalent Number of Parameters: 4.69

Residual Standard Error: 0.3931

Trace of smoother matrix: 5.11

Control settings:

normalize: TRUE

span :
0.75

degree : 2

family : gaussian

surface : interpolate cell = 0.2

The summary of the loess model gives the precision (SE) of the fit and the (default) arguments used. We can plot the result using predict. (Note predict with loess requires newdata to be entered as a data.frame rather than a list.)

with(dat5,plot(elev,cover>0))

x = seq(0,2000)

lines(x,predict(ls1,newdata=data.frame(elev=x)),col="tomato",lwd=2)

The curve looks oddly classical-niche-like. The loess function gives a straightforward
way to create smooth responses for simple regressions, but its utility ends
there: the fit is not based on likelihood and there is no easy way to compare
whether this model fits better than other (e.g., parametric) models, nor is
there a way to accommodate non-Gaussian error functions. The older **lowess** function (ported from
S-PLUS) does essentially the same thing, used for plotting trendlines (rather
than creating models).

Now we can
examine what **gam** can do. The
formulation of a gam model is nearly exactly the same as for **glm**; all
the same families and link functions apply.
The only difference is wrapping the predictors in a non-parametric
smoother function, s().

install.packages("mgcv")

library(mgcv)

gam1 =
gam(cover>0~s(elev),family=binomial,data=dat5)

summary(gam1)

Family: binomial

Link function: logit

Formula:

cover > 0 ~ s(elev)

Parametric coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.37157 0.08805
-4.22 2.44e-05 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Approximate significance of smooth terms:

edf Ref.df Chi.sq p-value

s(elev) 8.266
8.85 298.6 <2e-16 ***

---

Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

R-sq.(adj) =
0.396 Deviance explained = 32.5%

UBRE score = -0.056009 Scale est. = 1 n = 1009

points(dat5$elev,fitted(gam1),col="springgreen",lwd=2)

You can see right away that the **gam** fit here was more
sensitive to minimizing deviance (higher wiggliness) than the default fit of
the **loess** function. The gam was
also able to minimize deviance based on the logit transformation. The model output shows that an overall
(parametric) intercept was fit (the mean), -0.37157. But this is on the scale of the logit
transformation. To get the value on the
scale of actual probability we use the inverse of the logit function:

1/(1+1/exp(coef(gam1)[1]))

(Intercept)

0.4081611

The parametric estimates are listed with tests of significance against a null of zero. A summary of the smoothed terms are listed next, which here is only s(elev). Reported statistics include the estimated degrees of freedom (edf) and a test of whether the smoothed function significantly reduces model deviance. Finally, a pseudo-R2 is available using deviance explained (here 32.5%); this statistic is just 1 (residual deviance/null deviance).

Let's
compare the fit of this model against a simpler parametric binary regression
using **glm**:

glm1 =
glm(cover>0~elev,family=binomial,data=dat5)

lines(x,predict(glm1,newdata=list(elev=x),type="response"),lwd=3,col="turquoise")

summary(glm1)

Call:

glm(formula = cover > 0 ~ elev, family =
binomial, data = dat5)

Deviance Residuals:

Min 1Q Median
3Q Max

-2.4790
-0.7775 -0.4832 0.8654
2.3320

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.9132171 0.2558538
-15.29 <2e-16 ***

elev 0.0034832 0.0002306
15.11 <2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken
to be 1)

Null
deviance: 1382.7 on 1008 degrees of freedom

Residual deviance: 1057.5 on 1007
degrees of freedom

AIC: 1061.5

Number of Fisher Scoring iterations: 4

There are several ways to compare the gam1 and glm1 models. First, we can take the easy way out and compare the AICs:

AIC(gam1,glm1)

df AIC

gam1 9.266309
952.4866

glm1 2.000000 1061.4910

Although the gam1 model faces a stiffer penalty due to the need for many more df, the lower residual deviance more than makes up for it and the AIC is substantially lower. What is the difference in explained deviance?

1-(glm1$dev/glm1$null)

[1] 0.2352225

The gam1 explained deviance was 32.5%, but that for glm1 is
only 23.5%. Finally, the residual
deviance and df for both models means we can compare them via a Chi-square test
using **anova**:

anova(gam1,glm1,test="Chi")

Analysis of Deviance Table

Model 1: cover > 0 ~ s(elev)

Model 2: cover > 0 ~ elev

Resid. Df
Resid. Dev Df Deviance
P(>|Chi|)

1 999.73
933.95

2 1007.00
1057.49 -7.2663 -123.54 <
2.2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

The clear winner is gam1. Looking at the plot, we suspect it is due to the apparently hump-shaped (or even bimodal) distribution of yellow birch along the elevation gradient: our glm1 model is strictly monotonic. Of course, we can allow for a unimodal possibility in the glm by adding a quadratic term:

glm2 =
glm(cover>0~elev+I(elev^2),family=binomial,data=dat5)

lines(x,predict(glm2,newdata=list(elev=x),type="response"),lwd=3,col="purple")

anova(gam1,glm1,glm2,test="Chi")

Analysis of Deviance Table

Model 1: cover > 0 ~ s(elev)

Model 2: cover > 0 ~ elev

Model 3: cover > 0 ~ elev + I(elev^2)

Resid. Df
Resid. Dev Df Deviance
P(>|Chi|)

1 999.73
933.95

2 1007.00
1057.49 -7.2663 -123.537 < 2.2e-16 ***

3 1006.00
1011.99 1.0000 45.505 1.523e-11 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

It's better, but it still doesn't come close to the goodness
of fit of the gam1. You can create a
simple plot of the fitted gam with **plot**,
including residuals. Unless you tell it
otherwise, the fit is shown on the scale of the link function. The trans argument can be given the inverse
link function (here the inverse logit) to scale the y axis for the desired
probabilities like this:

windows() #this creates a new (second) plot window

plot(gam1,residuals=T,trans=function(x)exp(x)/(1+exp(x)),shade=T)

The handy part of these plots is the display of error (2*SE, roughly 95% of the predicted values fall within the gray area, or have these drawn as dashed lines using shade=F). Also note what R calls the 'rug': the dashes on the x axis that show the distribution of x values (elevation). Here the areas of few plots have the highest associated error.

The specification of terms within GAMs can a complicated assortment of smoothed, parametric, 'isotropic' smoothing of two or more parameters, and 'overlapping' smoothing. Let's take a second continuous variable like transformed aspect (Beers index: from zero [SW, or hottest] to 2 [NE, coolest]) and model yellow birch with respect to elevation and the heat-load index in several different ways:

gam2 =
gam(cover>0~s(elev)+s(beers),family=binomial,data=dat5)

gam3 =
gam(cover>0~s(elev)+beers,family=binomial,data=dat5)

gam4 = gam(cover>0~s(elev)+s(beers)+s(elev,by=beers),family=binomial,data=dat5)

gam5 =
gam(cover>0~s(elev)+s(beers)+te(elev,beers),family=binomial,data=dat5)

The first model fits both predictor variables with their own independent smoothing functions. We get back a parametric intercept and two tests of significance for the smoothed variables:

summary(gam2)

Family:
binomial

Link
function: logit

Formula:

cover
> 0 ~ s(elev) + s(beers)

Parametric
coefficients:

Estimate Std. Error z value
Pr(>|z|)

(Intercept) -0.39007
0.08954 -4.357 1.32e-05 ***

---

Signif. codes: 0
*** 0.001 ** 0.01 * 0.05 . 0.1 1

Approximate
significance of smooth terms:

edf Ref.df Chi.sq p-value

s(elev) 8.299
8.863 294.83 < 2e-16 ***

s(beers)
1.952 2.418 26.30 3.69e-06 ***

---

Signif.
codes: 0 *** 0.001 ** 0.01 * 0.05
. 0.1 1

R-sq.(adj)
= 0.421
Deviance explained = 34.6%

UBRE
score = -0.081485 Scale est. = 1 n = 1009

Both variables appear to be significant, and the explained
deviance has gone up to 34.6% (from 32.5%).
We can test this model against gam1 specifically by using **anova**:

anova(gam1,gam2,test="Chi")

Analysis
of Deviance Table

Model 1:
cover > 0 ~ s(elev)

Model 2:
cover > 0 ~ s(elev) + s(beers)

Resid. Df Resid. Dev Df Deviance P(>|Chi|)

1 999.73
933.95

2 997.75
904.28 1.9849 29.675 3.51e-07 ***

---

Signif.
codes: 0 *** 0.001 ** 0.01 * 0.05
. 0.1 1

The lower deviance with very little cost in model df (basically 2) leads to a much better model. What do the predicted responses look like?

plot(gam2,trans=function(x)exp(x)/(1+exp(x)),shade=T,pages=1)

Yellow birch appears to prefer cooler and shadier spots in addition to mid-high elevation. But the response to the Beers-transformed aspect looks basically linear: should this term just be a (non-smoothed) parametric fit? Our gam3 model is a GAM-GLM hybrid that has a smoothed elevation term and a parametric beers term:

anova(gam2,gam3,test="Chi")

Analysis
of Deviance Table

Model 1:
cover > 0 ~ s(elev) + s(beers)

Model 2:
cover > 0 ~ s(elev) + beers

Resid. Df Resid. Dev Df Deviance P(>|Chi|)

1 997.75
904.28

2 998.68
909.39 -0.936 -5.1077 0.02146 *

---

Signif.
codes: 0 *** 0.001 ** 0.01 * 0.05
. 0.1 1

Surprisingly, gam2 is actually the better model (lower residual deviance) even though it is more highly parameterized. You can check AIC to see if the likelihood-based assessment agrees.

The next two models are two different ways to assess interaction terms in a GAM model. Unfortunately, assessing interactions in GAMs is anything but straightforward. The first thing to understand is that this model structure is not usually what you want:

summary(gam(cover>0~s(elev)+s(beers)+s(elev,beers),family=binomial,data=dat5))

Family:
binomial

Link
function: logit

Formula:

cover
> 0 ~ s(elev) + s(beers) + s(elev, beers)

Parametric
coefficients:

Estimate Std. Error z value
Pr(>|z|)

(Intercept) -0.39008
0.08954 -4.357 1.32e-05 ***

---

Signif. codes: 0
*** 0.001 ** 0.01 * 0.05 . 0.1 1

Approximate
significance of smooth terms:

edf Ref.df
Chi.sq p-value

s(elev)
8.298849 8.86203 275.49 < 2e-16 ***

s(beers)
1.951841 2.41812 26.30 3.69e-06
***

s(elev,beers) 0.002208 0.00384 3.4e-08 NA

---

Signif.
codes: 0 *** 0.001 ** 0.01 * 0.05
. 0.1 1

R-sq.(adj)
= 0.421
Deviance explained = 34.6%

UBRE
score = -0.081484 Scale est. = 1 n = 1009

In fact the fit here is equal to the fit of gam2 (the df and residual deviance is identical), and the 'interaction' term is not tested. What the third term does is scale the smoothing functions equally for the two variables (isotropic smoothing). This may come in handy for truly isotrophic variables (e.g., modeling with spatial distances in 2D or 3D), but otherwise doesn't get you closer to understanding whether the significance of one variable in the model depends on values of another. Instead, you can use the form of either gam4 (interaction fit isotropically; note the "by" argument here) or gam5 (interaction fit non-isotropically using the "tensor product" smoothing function, te). The mgcv developer, Simon Wood, suggests "tensor product smooths often perform better than isotropic smooths when the covariates of a smooth are not naturally on the same scale, so that their relative scaling is arbitrary"in other words, if the variables are not in the same units, you're probably better off using the structure of gam5.

summary(gam5)

Family:
binomial

Link
function: logit

Formula:

cover
> 0 ~ s(elev) + s(beers) + te(elev, beers)

Parametric
coefficients:

Estimate Std. Error z value
Pr(>|z|)

(Intercept) -0.35987
0.09143 -3.936 8.29e-05 ***

---

Signif. codes: 0
*** 0.001 ** 0.01 * 0.05 . 0.1 1

Approximate
significance of smooth terms:

edf Ref.df Chi.sq p-value

s(elev)
8.311 8.868 243.283 <2e-16 ***

s(beers)
1.915 2.373 7.361
0.0364 *

te(elev,beers) 1.001
1.002 2.053 0.1523

---

Signif.
codes: 0 *** 0.001 ** 0.01 * 0.05
. 0.1 1

R-sq.(adj)
= 0.422
Deviance explained = 34.8%

UBRE
score = -0.081586 Scale est. = 1 n = 1009

Now we have a significance test associated with the interaction, which is here not significant. A plot of this model:

plot(gam5,trans=function(x)exp(x)/(1+exp(x)),shade=T,pages=1)

shows the smoothed response of cover to each variable, plus contour lines of smoothed cover against both elevation and beers variables.

The "by" formulation is also appropriate for an ANCOVA structure, where we want to assess whether the smoothed fit varies according to different classes of some factor variable. Here's an example using the distribution of yellow birch with respect to elevation and disturbance history:

gam6 =
gam(cover>0~s(elev)+disturb+te(elev,by=disturb,k=4),family=binomial,data=dat5)

plot(gam6,trans=function(x)exp(x)/(1+exp(x)),shade=T,pages=1)

The model was fit as a hybrid of a smoothed response to elevation, a parametric (ANOVA-type) response to disturbance class (see the standard contrasts below), and a tensor-product smoothed response to elevation separated by disturbance class (here not significant). The graph shows the overall elevation response and elevation responses restricted to disturbance classes. The large error associated with most of these is consistent with the interaction being insignificant. The "k=4" argument limits the number of knots in each smoothing of elevation within disturbance classes; I did this because this is maximum number the 5-class disturbance variable can support (try it without the k specified: you should get an error).

summary(gam6)

Family:
binomial

Link
function: logit

Formula:

cover
> 0 ~ s(elev) + disturb + te(elev, disturb, k = 4)

Parametric
coefficients:

Estimate Std. Error z value
Pr(>|z|)

(Intercept) 0.04427
0.37272 0.119 0.905

disturbLT-SEL
-0.49432 0.45897 -1.077
0.281

disturbSETTLE
-0.25341 0.60576 -0.418
0.676

disturbVIRGIN
-0.75584 0.68266 -1.107
0.268

Approximate
significance of smooth terms:

edf Ref.df Chi.sq p-value

s(elev) 8.203
8.816 80.45 1.06e-13 ***

te(elev,disturb)
2.792 3.633 2.02
0.678

---

Signif.
codes: 0 *** 0.001 ** 0.01 * 0.05
. 0.1 1

R-sq.(adj)
= 0.4
Deviance explained = 33.1%

UBRE
score = -0.054127 Scale est. = 1 n = 1009

Here's an
example of the procedure used by Bio et al. (1998, JVS 9:5-16) to compare
models fit by GLMs and GAMs, using the **gam**
function in the gam package. Using our
yellow birch dataset:

install.packages("gam")

library(gam)

m0 = glm(cover>0~1,family=binomial,data=dat5)

m1 = glm(cover>0~elev,family=binomial,data=dat5)

m2 =
glm(cover>0~elev+I(elev^2),family=binomial,data=dat5)

m3 = gam(cover>0~s(elev,3),family=binomial,data=dat5) #smoother with 3 df

m4 =
gam(cover>0~s(elev,4),family=binomial,data=dat5) #smoother with 4 df

anova(m0,m1,m2,m3,m4,test="Chi")

Analysis
of Deviance Table

Model 1:
cover > 0 ~ 1

Model 2:
cover > 0 ~ elev

Model 3:
cover > 0 ~ elev + I(elev^2)

Model 4:
cover > 0 ~ s(elev, 3)

Model 5:
cover > 0 ~ s(elev, 4)

Resid. Df Resid. Dev Df Deviance P(>|Chi|)

1 1008
1382.74

2 1007
1057.49 1.0000 325.25 <
2.2e-16 ***

3 1006
1011.99 1.0000 45.50 1.523e-11
***

4 1005
978.88 1.0000 33.11 8.712e-09
***

5 1004
965.93 1.0002 12.95 0.00032 ***

---

Signif.
codes: 0 *** 0.001 ** 0.01 * 0.05
. 0.1 1

All the models are better than the null model (intercept
only), and the lowest deviance is model 4 (the 4-df smoother). This term would this be included in the model
(s(elev,4)) and the next environmental variable would be test using the current
model 4 as the null model. Like Bio et
al., we can also compare the 4-df smoother fit with a 4^{th}-degree
polynomial fit in a GLM:

m5 =
glm(cover>0~elev+I(elev^2)+I(elev^3)+I(elev^4),family=binomial,data=dat5)

plot(x,predict(m4,newdata=list(elev=x),type="response"),pch=19,col="steelblue",

ylim=c(0,1),ylab="Probability",xlab="Elevation")

points(x,predict(m5,newdata=list(elev=x),type="response"),pch=19,col="orchid")

anova(m4,m5,test="Chi")

Analysis
of Deviance Table

Model 1:
cover > 0 ~ s(elev, 4)

Model 2:
cover > 0 ~ elev + I(elev^2) + I(elev^3) + I(elev^4)

Resid. Df Resid. Dev Df Deviance P(>|Chi|)

1 1004
965.93

2 1004
973.77 -0.00025783 -7.8438
5.358e-07 ***

---

Signif.
codes: 0 *** 0.001 ** 0.01 * 0.05
. 0.1 1

Like that of Bio et al., we conclude that using high-order polynomial functions leads to much greater residual variance than the nonparametric equivalent GAM.

Finally,
you may wish to experiment with a GAM called BRUTO, or flexible discriminant
analysis, available as the **bruto**
function in the "mda" package.
Elith et al. (Ecography 29: 129-151) suggest this method often works
faster than the traditional GAM approach and has some convenient options for
term selection. However, it only takes
Gaussian error, so working with binary data is difficult. Unfortunately I was unable to get the **bruto** function to work with the above
Smokies data; Elith et al. also had problems with the R implementation and used
S-PLUS instead.