Generalized additive models in R

 

GAMs in R are a nonparametric extension of GLMs, used often for the case when you have no a priori reason for choosing a particular response function (such as linear, quadratic, etc.) and want the data to 'speak for themselves'.  GAMs do this via a smoothing function, similar to what you may already know about locally weighted regressions.  GAMs take each predictor variable in the model and separate it into sections (delimited by 'knots'), and then fit polynomial functions to each section separately, with the constraint that there are no kinks at the knots (second derivatives of the separate functions are equal at the knots).  The number of parameters used for such fitting is obviously more than what would be necessary for a simpler parametric fit to the same data, but computational shortcuts mean the model degrees of freedom is usually lower than what you might expect from a line with so much 'wiggliness'.  Indeed this is the principal statistical issue associated with GAM modeling: minimizing residual deviance (goodness of fit) while maximizing parsimony (lowest possible degrees of freedom).  Since the model fit is based on deviance/likelihood, fitted models are directly comparable with GLMs using likelihood techniques (like AIC) or classical tests based on model deviance (Chi-squared or F tests, depending on the error structure).  Even better, all the error and link structures of GLMs are available in GAMs (including poisson and binomial), as are the standard suite of lm or glm attributes (resid, fitted, summary, coef, etc.).  A principal reason why GAMs are often less preferred than GLMs is that the results are often difficult to interpret because no parameter values are returned (although significance tests of each term are).  They can be very good for prediction/interpolation, as well as exploratory analyses about the functional nature of a response.  Some researchers examine the shape of a curve with GAMs, then reconstruct the curve shape parametrically with GLMs for model building.

            There are two common implementations of GAMs in R.  The older version (originally made for S-PLUS) is available as the 'gam' package by Hastie and Tibshirani.  The newer version that we will use below is the 'mgcv' package from Simon Wood.  The basic modeling procedure for both packages is similar (the function is gam for both; be wary of having both libraries loaded at the same time), but the behind-the-scenes computational approaches differ, as do the arguments for optimization and the model output.  Expect the results to be slightly different when used with the same model structure on the same dataset.

Let's first examine output from the built-in function loess, which is a locally weighted polynomial regression.  This time we'll try modeling yellow birch, and we'll include plots of birch absences.

dat = read.csv(file.choose())     #grab the treedata.csv file

dat2 = subset(dat,dat$species=="Betula alleghaniensis")   #a yellow birch subset

b.plots = unique(as.character(dat2$plotID)) #plots with birch

u.plots = unique(as.character(dat$plotID)) #all plots

nob.plots = u.plots[is.element(u.plots,b.plots)==F] #plots without birch

dat3 = subset(dat,is.element(as.character(dat$plotID),nob.plots)) #datast of no birch

dat4 = subset(dat3,duplicated(dat3$plotID)==F)  #one row per plot

dat4$cover = 0     #cover of birch is zero in these plots

dat5 = rbind(dat2,dat4) #new dataframe of presences and absences

As from last time, dat5 is our new data frame with both presences and absences, this time for yellow birch.

ls1 = loess(cover>0~elev,data=dat5)

summary(ls1)

Call:

loess(formula = as.numeric(cover > 0) ~ elev, data = dat5)

 

Number of Observations: 1009

Equivalent Number of Parameters: 4.69

Residual Standard Error: 0.3931

Trace of smoother matrix: 5.11

 

Control settings:

  normalize:  TRUE

  span      :  0.75

  degree   :  2

  family   :  gaussian

  surface  :  interpolate         cell = 0.2

The summary of the loess model gives the precision (SE) of the fit and the (default) arguments used.  We can plot the result using predict. (Note predict with loess requires newdata to be entered as a data.frame rather than a list.)

with(dat5,plot(elev,cover>0))

x = seq(0,2000)

lines(x,predict(ls1,newdata=data.frame(elev=x)),col="tomato",lwd=2)

The curve looks oddly classical-niche-like.  The loess function gives a straightforward way to create smooth responses for simple regressions, but its utility ends there: the fit is not based on likelihood and there is no easy way to compare whether this model fits better than other (e.g., parametric) models, nor is there a way to accommodate non-Gaussian error functions.  The older lowess function (ported from S-PLUS) does essentially the same thing, used for plotting trendlines (rather than creating models).

            Now we can examine what gam can do.  The formulation of a gam model is nearly exactly the same as for glm; all the same families and link functions apply.  The only difference is wrapping the predictors in a non-parametric smoother function, s().

install.packages("mgcv")

library(mgcv)

gam1 = gam(cover>0~s(elev),family=binomial,data=dat5)

summary(gam1)

Family: binomial

Link function: logit

 

Formula:

cover > 0 ~ s(elev)

 

Parametric coefficients:

            Estimate Std. Error z value Pr(>|z|)   

(Intercept) -0.37157    0.08805   -4.22 2.44e-05 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Approximate significance of smooth terms:

          edf Ref.df Chi.sq p-value   

s(elev) 8.266   8.85  298.6  <2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

R-sq.(adj) =  0.396   Deviance explained = 32.5%

UBRE score = -0.056009  Scale est. = 1         n = 1009

points(dat5$elev,fitted(gam1),col="springgreen",lwd=2)

You can see right away that the gam fit here was more sensitive to minimizing deviance (higher wiggliness) than the default fit of the loess function.  The gam was also able to minimize deviance based on the logit transformation.  The model output shows that an overall (parametric) intercept was fit (the mean), -0.37157.  But this is on the scale of the logit transformation.  To get the value on the scale of actual probability we use the inverse of the logit function:

1/(1+1/exp(coef(gam1)[1]))

(Intercept)

  0.4081611

The parametric estimates are listed with tests of significance against a null of zero.  A summary of the smoothed terms are listed next, which here is only s(elev).  Reported statistics include the estimated degrees of freedom (edf) and a test of whether the smoothed function significantly reduces model deviance.  Finally, a pseudo-R2 is available using deviance explained (here 32.5%); this statistic is just 1 – (residual deviance/null deviance). 

            Let's compare the fit of this model against a simpler parametric binary regression using glm:

glm1 = glm(cover>0~elev,family=binomial,data=dat5)

lines(x,predict(glm1,newdata=list(elev=x),type="response"),lwd=3,col="turquoise")

summary(glm1)

Call:

glm(formula = cover > 0 ~ elev, family = binomial, data = dat5)

 

Deviance Residuals:

    Min       1Q   Median       3Q      Max 

-2.4790  -0.7775  -0.4832   0.8654   2.3320 

 

Coefficients:

              Estimate Std. Error z value Pr(>|z|)   

(Intercept) -3.9132171  0.2558538  -15.29   <2e-16 ***

elev         0.0034832  0.0002306   15.11   <2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

(Dispersion parameter for binomial family taken to be 1)

 

    Null deviance: 1382.7  on 1008  degrees of freedom

Residual deviance: 1057.5  on 1007  degrees of freedom

AIC: 1061.5

 

Number of Fisher Scoring iterations: 4

There are several ways to compare the gam1 and glm1 models.  First, we can take the easy way out and compare the AICs:

AIC(gam1,glm1)

           df       AIC

gam1 9.266309  952.4866

glm1 2.000000 1061.4910

Although the gam1 model faces a stiffer penalty due to the need for many more df, the lower residual deviance more than makes up for it and the AIC is substantially lower.  What is the difference in explained deviance?

1-(glm1$dev/glm1$null)

[1] 0.2352225

The gam1 explained deviance was 32.5%, but that for glm1 is only 23.5%.  Finally, the residual deviance and df for both models means we can compare them via a Chi-square test using anova:

anova(gam1,glm1,test="Chi")

Analysis of Deviance Table

 

Model 1: cover > 0 ~ s(elev)

Model 2: cover > 0 ~ elev

  Resid. Df Resid. Dev      Df Deviance P(>|Chi|)   

1    999.73     933.95                              

2   1007.00    1057.49 -7.2663  -123.54 < 2.2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The clear winner is gam1.  Looking at the plot, we suspect it is due to the apparently hump-shaped (or even bimodal) distribution of yellow birch along the elevation gradient: our glm1 model is strictly monotonic.  Of course, we can allow for a unimodal possibility in the glm by adding a quadratic term:

glm2 = glm(cover>0~elev+I(elev^2),family=binomial,data=dat5)

lines(x,predict(glm2,newdata=list(elev=x),type="response"),lwd=3,col="purple")

anova(gam1,glm1,glm2,test="Chi")

Analysis of Deviance Table

 

Model 1: cover > 0 ~ s(elev)

Model 2: cover > 0 ~ elev

Model 3: cover > 0 ~ elev + I(elev^2)

  Resid. Df Resid. Dev      Df Deviance P(>|Chi|)   

1    999.73     933.95                              

2   1007.00    1057.49 -7.2663 -123.537 < 2.2e-16 ***

3   1006.00    1011.99  1.0000   45.505 1.523e-11 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

It's better, but it still doesn't come close to the goodness of fit of the gam1.  You can create a simple plot of the fitted gam with plot, including residuals.  Unless you tell it otherwise, the fit is shown on the scale of the link function.  The trans argument can be given the inverse link function (here the inverse logit) to scale the y axis for the desired probabilities like this:

windows() #this creates a new (second) plot window

plot(gam1,residuals=T,trans=function(x)exp(x)/(1+exp(x)),shade=T)

The handy part of these plots is the display of error (2*SE, roughly 95% of the predicted values fall within the gray area, or have these drawn as dashed lines using shade=F).  Also note what R calls the 'rug': the dashes on the x axis that show the distribution of x values (elevation).  Here the areas of few plots have the highest associated error. 

            The specification of terms within GAMs can a complicated assortment of smoothed, parametric, 'isotropic' smoothing of two or more parameters, and 'overlapping' smoothing.  Let's take a second continuous variable like transformed aspect (Beers index: from zero [SW, or hottest] to 2 [NE, coolest]) and model yellow birch with respect to elevation and the heat-load index in several different ways:

gam2 = gam(cover>0~s(elev)+s(beers),family=binomial,data=dat5)

gam3 = gam(cover>0~s(elev)+beers,family=binomial,data=dat5)

gam4 = gam(cover>0~s(elev)+s(beers)+s(elev,by=beers),family=binomial,data=dat5)

gam5 = gam(cover>0~s(elev)+s(beers)+te(elev,beers),family=binomial,data=dat5)

The first model fits both predictor variables with their own independent smoothing functions.  We get back a parametric intercept and two tests of significance for the smoothed variables:

summary(gam2)

Family: binomial

Link function: logit

 

Formula:

cover > 0 ~ s(elev) + s(beers)

 

Parametric coefficients:

            Estimate Std. Error z value Pr(>|z|)   

(Intercept) -0.39007    0.08954  -4.357 1.32e-05 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Approximate significance of smooth terms:

           edf Ref.df Chi.sq  p-value   

s(elev)  8.299  8.863 294.83  < 2e-16 ***

s(beers) 1.952  2.418  26.30 3.69e-06 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

R-sq.(adj) =  0.421   Deviance explained = 34.6%

UBRE score = -0.081485  Scale est. = 1         n = 1009

Both variables appear to be significant, and the explained deviance has gone up to 34.6% (from 32.5%).  We can test this model against gam1 specifically by using anova:

anova(gam1,gam2,test="Chi")

Analysis of Deviance Table

 

Model 1: cover > 0 ~ s(elev)

Model 2: cover > 0 ~ s(elev) + s(beers)

  Resid. Df Resid. Dev     Df Deviance P(>|Chi|)   

1    999.73     933.95                             

2    997.75     904.28 1.9849   29.675  3.51e-07 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The lower deviance with very little cost in model df (basically 2) leads to a much better model.  What do the predicted responses look like?

plot(gam2,trans=function(x)exp(x)/(1+exp(x)),shade=T,pages=1)

Yellow birch appears to prefer cooler and shadier spots in addition to mid-high elevation.  But the response to the Beers-transformed aspect looks basically linear: should this term just be a (non-smoothed) parametric fit?  Our gam3 model is a GAM-GLM hybrid that has a smoothed elevation term and a parametric beers term:

anova(gam2,gam3,test="Chi")

Analysis of Deviance Table

 

Model 1: cover > 0 ~ s(elev) + s(beers)

Model 2: cover > 0 ~ s(elev) + beers

  Resid. Df Resid. Dev     Df Deviance P(>|Chi|) 

1    997.75     904.28                           

2    998.68     909.39 -0.936  -5.1077   0.02146 *

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Surprisingly, gam2 is actually the better model (lower residual deviance) even though it is more highly parameterized.  You can check AIC to see if the likelihood-based assessment agrees.

            The next two models are two different ways to assess interaction terms in a GAM model.  Unfortunately, assessing interactions in GAMs is anything but straightforward.  The first thing to understand is that this model structure is not usually what you want:

summary(gam(cover>0~s(elev)+s(beers)+s(elev,beers),family=binomial,data=dat5))

Family: binomial

Link function: logit

 

Formula:

cover > 0 ~ s(elev) + s(beers) + s(elev, beers)

 

Parametric coefficients:

            Estimate Std. Error z value Pr(>|z|)   

(Intercept) -0.39008    0.08954  -4.357 1.32e-05 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Approximate significance of smooth terms:

                   edf  Ref.df  Chi.sq  p-value   

s(elev)       8.298849 8.86203  275.49  < 2e-16 ***

s(beers)      1.951841 2.41812   26.30 3.69e-06 ***

s(elev,beers) 0.002208 0.00384 3.4e-08       NA   

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

R-sq.(adj) =  0.421   Deviance explained = 34.6%

UBRE score = -0.081484  Scale est. = 1         n = 1009

In fact the fit here is equal to the fit of gam2 (the df and residual deviance is identical), and the 'interaction' term is not tested.  What the third term does is scale the smoothing functions equally for the two variables (isotropic smoothing).  This may come in handy for truly isotrophic variables (e.g., modeling with spatial distances in 2D or 3D), but otherwise doesn't get you closer to understanding whether the significance of one variable in the model depends on values of another.  Instead, you can use the form of either gam4 (interaction fit isotropically; note the "by" argument here) or gam5 (interaction fit non-isotropically using the "tensor product" smoothing function, te).  The mgcv developer, Simon Wood, suggests "tensor product smooths often perform better than isotropic smooths when the covariates of a smooth are not naturally on the same scale, so that their relative scaling is arbitrary"—in other words, if the variables are not in the same units, you're probably better off using the structure of gam5. 

summary(gam5)

Family: binomial

Link function: logit

 

Formula:

cover > 0 ~ s(elev) + s(beers) + te(elev, beers)

 

Parametric coefficients:

            Estimate Std. Error z value Pr(>|z|)   

(Intercept) -0.35987    0.09143  -3.936 8.29e-05 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Approximate significance of smooth terms:

                 edf Ref.df  Chi.sq p-value   

s(elev)        8.311  8.868 243.283  <2e-16 ***

s(beers)       1.915  2.373   7.361  0.0364 * 

te(elev,beers) 1.001  1.002   2.053  0.1523   

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

R-sq.(adj) =  0.422   Deviance explained = 34.8%

UBRE score = -0.081586  Scale est. = 1         n = 1009

Now we have a significance test associated with the interaction, which is here not significant.  A plot of this model:

plot(gam5,trans=function(x)exp(x)/(1+exp(x)),shade=T,pages=1)

shows the smoothed response of cover to each variable, plus contour lines of smoothed cover against both elevation and beers variables. 

            The "by" formulation is also appropriate for an ANCOVA structure, where we want to assess whether the smoothed fit varies according to different classes of some factor variable.  Here's an example using the distribution of yellow birch with respect to elevation and disturbance history:

gam6 = gam(cover>0~s(elev)+disturb+te(elev,by=disturb,k=4),family=binomial,data=dat5)

plot(gam6,trans=function(x)exp(x)/(1+exp(x)),shade=T,pages=1)

The model was fit as a hybrid of a smoothed response to elevation, a parametric (ANOVA-type) response to disturbance class (see the standard contrasts below), and a tensor-product smoothed response to elevation separated by disturbance class (here not significant).  The graph shows the overall elevation response and elevation responses restricted to disturbance classes.  The large error associated with most of these is consistent with the interaction being insignificant.  The "k=4" argument limits the number of knots in each smoothing of elevation within disturbance classes; I did this because this is maximum number the 5-class disturbance variable can support (try it without the k specified: you should get an error).

summary(gam6)

Family: binomial

Link function: logit

 

Formula:

cover > 0 ~ s(elev) + disturb + te(elev, disturb, k = 4)

 

Parametric coefficients:

              Estimate Std. Error z value Pr(>|z|)

(Intercept)    0.04427    0.37272   0.119    0.905

disturbLT-SEL -0.49432    0.45897  -1.077    0.281

disturbSETTLE -0.25341    0.60576  -0.418    0.676

disturbVIRGIN -0.75584    0.68266  -1.107    0.268

 

Approximate significance of smooth terms:

                   edf Ref.df Chi.sq  p-value   

s(elev)          8.203  8.816  80.45 1.06e-13 ***

te(elev,disturb) 2.792  3.633   2.02    0.678   

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

R-sq.(adj) =    0.4   Deviance explained = 33.1%

UBRE score = -0.054127  Scale est. = 1         n = 1009

 

            Here's an example of the procedure used by Bio et al. (1998, JVS 9:5-16) to compare models fit by GLMs and GAMs, using the gam function in the gam package.  Using our yellow birch dataset:

install.packages("gam")

library(gam)

m0 = glm(cover>0~1,family=binomial,data=dat5)

m1 = glm(cover>0~elev,family=binomial,data=dat5)

m2 = glm(cover>0~elev+I(elev^2),family=binomial,data=dat5)

m3 = gam(cover>0~s(elev,3),family=binomial,data=dat5)     #smoother with 3 df

m4 = gam(cover>0~s(elev,4),family=binomial,data=dat5)     #smoother with 4 df

anova(m0,m1,m2,m3,m4,test="Chi")

Analysis of Deviance Table

 

Model 1: cover > 0 ~ 1

Model 2: cover > 0 ~ elev

Model 3: cover > 0 ~ elev + I(elev^2)

Model 4: cover > 0 ~ s(elev, 3)

Model 5: cover > 0 ~ s(elev, 4)

  Resid. Df Resid. Dev     Df Deviance P(>|Chi|)   

1      1008    1382.74                              

2      1007    1057.49 1.0000   325.25 < 2.2e-16 ***

3      1006    1011.99 1.0000    45.50 1.523e-11 ***

4      1005     978.88 1.0000    33.11 8.712e-09 ***

5      1004     965.93 1.0002    12.95   0.00032 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

All the models are better than the null model (intercept only), and the lowest deviance is model 4 (the 4-df smoother).  This term would this be included in the model (s(elev,4)) and the next environmental variable would be test using the current model 4 as the null model.  Like Bio et al., we can also compare the 4-df smoother fit with a 4th-degree polynomial fit in a GLM:

m5 = glm(cover>0~elev+I(elev^2)+I(elev^3)+I(elev^4),family=binomial,data=dat5)

plot(x,predict(m4,newdata=list(elev=x),type="response"),pch=19,col="steelblue",

     ylim=c(0,1),ylab="Probability",xlab="Elevation")

points(x,predict(m5,newdata=list(elev=x),type="response"),pch=19,col="orchid")

anova(m4,m5,test="Chi")

Analysis of Deviance Table

 

Model 1: cover > 0 ~ s(elev, 4)

Model 2: cover > 0 ~ elev + I(elev^2) + I(elev^3) + I(elev^4)

  Resid. Df Resid. Dev          Df Deviance P(>|Chi|)   

1      1004     965.93                                  

2      1004     973.77 -0.00025783  -7.8438 5.358e-07 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Like that of Bio et al., we conclude that using high-order polynomial functions leads to much greater residual variance than the nonparametric equivalent GAM. 

 

            Finally, you may wish to experiment with a GAM called BRUTO, or flexible discriminant analysis, available as the bruto function in the "mda" package.  Elith et al. (Ecography 29: 129-151) suggest this method often works faster than the traditional GAM approach and has some convenient options for term selection.  However, it only takes Gaussian error, so working with binary data is difficult.  Unfortunately I was unable to get the bruto function to work with the above Smokies data; Elith et al. also had problems with the R implementation and used S-PLUS instead.