General linear models (least squares) in R


Bolker (Chap. 9) provides a nice summary of general linear models as opposed to the more modern techniques. Essentially general linear models (NOT generalized linear models) are the old-school models of normal residual distributions, independent observations, homoscedasticity, and (assumed) lack of any observation error. In R this includes everything that the lm function does (simple and multiple least-squares regression), ANOVA, and ANCOVA. The only difference between these is whether the model includes only continuous variables (regression), only factor variables (ANOVA), or both (ANCOVA).

All modeling approaches in R use the same basic structure of "response ~ explanatory variable(s)". Here's an example of a regression model of the abundance (cover) of beech on elevation:

dat = read.csv(file.choose()) #choose the treedata.csv dataset

lm1 = lm(cover~elev,data=dat,subset=species=="Fagus grandifolia")

This simple linear regression fits both an intercept and a slope. A summary of the fitted model object is available with the summary function:




lm(formula = cover ~ elev, data = dat, subset = species == "Fagus grandifolia")



Min 1Q Median 3Q Max

-3.9054 -1.8398 -0.1664 1.6433 5.4512



Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.9620373 0.3437750 5.707 2.21e-08 ***

elev 0.0019596 0.0002971 6.596 1.31e-10 ***


Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


Residual standard error: 2.1 on 409 degrees of freedom

Multiple R-squared: 0.09613, Adjusted R-squared: 0.09393

F-statistic: 43.5 on 1 and 409 DF, p-value: 1.313e-10

All fitted model objects can be explored with summary. For lm, you get a restatement of the model, the range and quantile estimates of the model residuals, a table of fitted coefficients along with their precision (SE) and significance (t tests), the residual SE (square root of the error variance), the R2 or coefficient of determination (fraction of variance explained), the adjusted-R2 (applies a penalty for more fitted parameters), and the associated F statistic evaluating the significance of the full model. There are many attributes of lm1, most of which are not shown. For example, you can retrieve a vector of fitted (predicted) values with lm1$fitted (or fitted(lm1)), a vector of residuals with lm1$resid (or resid(lm1)), and return the coefficients with coef. Alternatively, note that this summary is itself an object, with components that can be accessed with indexing. For example:


Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.962037256 0.3437749892 5.707330 2.206867e-08

elev 0.001959647 0.0002971168 6.595544 1.312590e-10

returns the 4th element of summary(lm1), which is stored as a list. Here this element is the coefficient table. We could also show these results as an ANOVA-type table:


Df Sum Sq Mean Sq F value Pr(>F)

elev 1 191.76 191.76 43.501 1.313e-10 ***

Residuals 409 1802.91 4.41

In fact, all lm fitted objects can be returned in both summary forms (summary.lm and summary.aov). Other helpful related items are a report of the confidence intervals of each parameter:


2.5 % 97.5 %

(Intercept) 1.28625089 2.637823622

elev 0.00137558 0.002543713

and associated AIC:


[1] 1780.058

Don't forget that AIC values are particular to a given response vector: a change in any of the values of the response (e.g., transformation, removing an outlier, etc.) results in a non-comparable vector.

Is our model a good model? One easy way to examine model results is with the plot function:


Here you get four graphs (click to go from one panel to the next): 1) residuals vs. fitted values (a good model will show no pattern); 2) the qqnorm plot we saw above (values should be on the dashed line); 3) scale-location graph, indicating heteroscedasticity ; and 4) standardized residuals vs. leverage and Cook's distance, which is handy for identifying outliers. The first thing we notice in our data is that the y data are not continuous (thus the cross-hatched patterns in residual graphs) and that the residuals are probably not normally distributed (try shapiro.test(lm1$resid) to confirm). Would square root transformation help? We could compare the fit of a new model (lm2) with a transformed cover variable, by either calling lm again with new arguments or using the update function:

lm2 = update(lm1,formula=sqrt(cover)~elev)


The residuals don't look any better and the R2 has gone down. To do better we'll have to move to a generalized linear model (glm).

Now we'll try an ANOVA. Let's ask whether beech is differentially sensitive to disturbance history:

aov1 = lm(cover~disturb,data=dat,subset=species=="Fagus grandifolia")




lm(formula = cover ~ disturb, data = dat, subset = species ==

"Fagus grandifolia")



Min 1Q Median 3Q Max

-3.333 -2.053 -0.229 1.947 5.771



Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.2290 0.1920 22.030 <2e-16 ***

disturbLT-SEL 0.1043 0.2759 0.378 0.7055

disturbSETTLE -0.8199 0.3828 -2.142 0.0328 *

disturbVIRGIN -0.1759 0.2821 -0.624 0.5332


Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


Residual standard error: 2.197 on 407 degrees of freedom

Multiple R-squared: 0.01499, Adjusted R-squared: 0.007725

F-statistic: 2.064 on 3 and 407 DF, p-value: 0.1044

This output is more confusing than regression output; for one thing coefficients are only shown for 3 of four levels of disturbance, plus the overall intercept. This is because the default contrast structure for ANOVA in R is to use 'treatment contrasts' and ignore a raw intercept. The factor level that is alphabetically first (here, "disturbCORPLOG") is used as the overall intercept, so this estimate is used to test whether the overall mean differs from zero (here, it clearly does, because there are no zero cover values in our dataset). Each remaining factor level is then tested against this first level; i.e., the second parameter estimate (here next to "disturbLT-SEL") is the difference in cover of beech in CORPLOG and LT-SEL plots. So what this output is showing us is that SETTLE plots have significantly lower abundances of beech than CORPLOG plots (by -.8 cover units). The first row coefficient is an actual mean of a sample (mean beech cover of all CORPLOG plots), while the remaining coefficients are differences in mean cover, not actual means themselves. These are of course only a few of all possible contrasts; you can specify a particular contrast structure with the function contrasts.

We could optionally choose to have R not fit an intercept at all, but instead determine whether the mean cover of each group is significantly different from zero. In our case this is a bit silly, but it is easily accomplished using '-1' in on right side of the equation:

aov2 = lm(cover~disturb-1,data=dat,subset=species=="Fagus grandifolia")

A call to summary(aov2) will confirm that indeed cover is greater than zero for all groups. Not fitting an intercept is usually more handy in regressions when we want the line to go through the origin.

Finally, note that the lm output for our model is different than that produced by aov:


Df Sum Sq Mean Sq F value Pr(>F)

disturb 3 29.89 9.9635 2.0639 0.1044

Residuals 407 1964.78 4.8275

You can see why from the summary table: the lm output produces a set of treatment contrasts (the mean of each group), each tested with pairwise t tests. But here aov produces the overall ANOVA result, which is a simple comparison of variances of the cover response variable between and within the grouping variable; comparisons of two variances are done via the F test. The fact that these tests give us different results suggests than in fact the difference between groups is an issue of variance as well as the mean:

plot(dat$disturb[dat$species=="Fagus grandifolia"],dat$cover[dat$species=="Fagus grandifolia"],notch=T)

The problem is evident here: a general linear model is not a good choice because of heteroscedasticity (SETTLE variance is much lower). This would be a good occasion to use a bootstrapped model (like a permutational ANOVA) rather than something based on classical assumptions.

Multiple comparisons can be accomplished with various other tests in R, such as the popular Tukey honest significant difference (HSD) test. The argument for the TukeyHSD function must be an aov object rather than a lm one:


Tukey multiple comparisons of means

95% family-wise confidence level


Fit: aov(formula = aov1)



diff lwr upr p adj

LT-SEL-CORPLOG 0.1043257 -0.6073034 0.81595478 0.9815546

SETTLE-CORPLOG -0.8199167 -1.8075190 0.16768552 0.1417827

VIRGIN-CORPLOG -0.1759103 -0.9035981 0.55177748 0.9244785

SETTLE-LT-SEL -0.9242424 -1.9198871 0.07140225 0.0796909

VIRGIN-LT-SEL -0.2802360 -1.0188019 0.45832992 0.7616165

VIRGIN-SETTLE 0.6440064 -0.3631787 1.65119155 0.3521145

Unlike the t tests produced by the lm summary, the more conservative Tukey HSD tests suggest that beech cover does not vary between any two disturbance classes.

We have already noted that plots of different disturbance classes have a very different elevation distribution. We can determine whether elevation influences the above ANOVA results (or equivalently, whether the elevation effect varies by disturbance class) with an ANCOVA:

ancova1 = lm(cover~disturb*elev,data=dat,subset=species=="Fagus grandifolia")




lm(formula = cover ~ disturb * elev, data = dat, subset = species ==

"Fagus grandifolia")



Min 1Q Median 3Q Max

-4.3054 -1.5585 -0.1539 1.7687 5.3592



Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.517e+00 6.635e-01 2.287 0.0227 *

disturbLT-SEL 3.248e-01 8.653e-01 0.375 0.7076

disturbSETTLE 1.688e+00 1.546e+00 1.092 0.2754

disturbVIRGIN 2.092e-01 1.338e+00 0.156 0.8758

elev 2.340e-03 5.503e-04 4.252 2.64e-05 ***

disturbLT-SEL:elev 2.112e-05 7.402e-04 0.029 0.9772

disturbSETTLE:elev -2.041e-03 2.073e-03 -0.984 0.3255

disturbVIRGIN:elev -4.867e-04 1.065e-03 -0.457 0.6478


Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


Residual standard error: 2.094 on 403 degrees of freedom

Multiple R-squared: 0.1139, Adjusted R-squared: 0.09848

F-statistic: 7.398 on 7 and 403 DF, p-value: 2.249e-08

This again is the default contrast form of model results, where we see that elevation is really driving the relationship, with no support for including disturbance history or its interaction with elevation. The significance tests associated with only the main effect of disturbance (rather than treatment contrasts is found in the anova-style output:


Analysis of Variance Table


Response: cover

Df Sum Sq Mean Sq F value Pr(>F)

disturb 3 29.89 9.964 2.2717 0.07972 .

elev 1 191.92 191.924 43.7589 1.184e-10 ***

disturb:elev 3 5.32 1.775 0.4047 0.74971

Residuals 403 1767.53 4.386


Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Which again shows no support for disturbance or the interaction. We can test the fit of models with and without the disturbance variance by defining a second ANCOVA model with elevation only (the period below followed by the minus sign is translated as 'the full model minus these terms, with the colon signifying the interaction):

ancova2 = update(ancova1, ~.-disturb-disturb:elev)

and then using the useful anova function to compare whether the explained variance is significantly less from the simpler model:


Analysis of Variance Table


Model 1: cover ~ disturb * elev

Model 2: cover ~ elev

Res.Df RSS Df Sum of Sq F Pr(>F)

1 403 1767.5

2 409 1802.9 -6 -35.381 1.3445 0.2361

The models are not significantly different in explanatory power (p=0.2361), so the simpler one with only elevation (model 2) is better. AIC agrees:

> AIC(ancova1,ancova2)

df AIC

ancova1 9 1783.912

ancova2 3 1780.058

There is an even easier (but not always correct) way to perform model simplification (backwards elimination from a maximal model) in R using the step function:

> step(ancova1)

Start: AIC=615.54

cover ~ disturb * elev


Df Sum of Sq RSS AIC

- disturb:elev 3 5.3248 1772.9 610.78

<none> 1767.5 615.54


Step: AIC=610.78

cover ~ disturb + elev


Df Sum of Sq RSS AIC

<none> 1772.9 610.78

- disturb 3 30.057 1802.9 611.69

- elev 1 191.924 1964.8 651.03



lm(formula = cover ~ disturb + elev, data = dat, subset = species == "Fagus grandifolia")



(Intercept) disturbLT-SEL disturbSETTLE disturbVIRGIN elev

1.649949 0.335265 0.244698 -0.390918 0.002225

The output is not as complicated as it seems. Starting from the maximal model (~disturb*elev), the function compares the resulting AIC value from taking away the most complicated (last) term in the model, which here is the interaction between disturb and elev. With a substantially better AIC, the function keeps this simpler model and then takes away the terms from the next tier (here, both main effects). The model with disturb removed is essentially of equal fit to having it in (AIC goes up 1 unit), but taking elev out results in a massive loss of explanatory power (up 41 units). So the best model only includes elev. I do not know why the AICs reported here do not correspond to that produced by AIC. Note that step works for forward selection as well as backward, and you can select to do both simultaneously. (It works with glm, too). Try using step with a more complicated ANCOVA, such including the tci variable along with disturb and elev, along with all interactions. Beware that the saved object associated with step will include the best-fitting model according to raw AIC (better AIC wins, no matter whether the actual difference is less than two). For example:

step.ancova = step(ancova1)



lm(formula = cover ~ disturb + elev, data = dat, subset = species ==

"Fagus grandifolia")



Min 1Q Median 3Q Max

-4.24965 -1.67849 -0.09873 1.81710 5.37934



Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.6499485 0.4297313 3.839 0.000143 ***

disturbLT-SEL 0.3352653 0.2646654 1.267 0.205970

disturbSETTLE 0.2446982 0.3979475 0.615 0.538965

disturbVIRGIN -0.3909178 0.2702366 -1.447 0.148787

elev 0.0022252 0.0003357 6.630 1.07e-10 ***


Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


Residual standard error: 2.09 on 406 degrees of freedom

Multiple R-squared: 0.1112, Adjusted R-squared: 0.1024

F-statistic: 12.7 on 4 and 406 DF, p-value: 9.535e-10

So disturbance is retained here, even though there is not sufficient AIC support.


Nonlinear Least Squares (nls)

Let's limit the focus of our analysis to one species (say, sweet birch) using the subset function:

dat2 = subset(dat,dat$species=="Betula lenta")

Which gives us a data frame of 795 rows (use str to confirm). An example of one relationship that may be nonlinear is that between elevation and tci (topographic convergence index).


Clearly there is more to the relationship than a simple linear function, but nonetheless let's try fitting an asymptotic model to these data. The function for this in R using least squares is nls. Unlike lm, minimization is done via an optimization algorithm, which thus requires extra input on your part in the form of initial values for each parameter in the optimization. This also means nls can sometimes be frustrating, due to lack of convergence or slow performance with very large datasets. One two-parameter asymptotic model is the Michaelis-Menten (MM) function, used in nls here:

model1 = nls(tci ~ a*elev / (b+elev), start=list(a=500,b=5), data=dat2)

The start argument is a list of each fitted parameter and its starting value. Here a is the asymptote, and b is the x value where y=a/2. Note usually the MM function is used for monotonically increasing data, whereas here the trend is downward, so our b value will be negative. Access parameter estimates with the summary function:


Formula: tci ~ a * elev/(b + elev)



Estimate Std. Error t value Pr(>|t|)

a 4.9742 0.1564 31.814 < 2e-16 ***

b -109.3603 16.9142 -6.466 1.76e-10 ***


Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


Residual standard error: 2.255 on 793 degrees of freedom


Number of iterations to convergence: 8

Achieved convergence tolerance: 1.309e-06

Indeed the guess about the asymptote (a=5) was right on, and b was fit to -109. The t and associated p values indicate whether the parameter values are significantly different from zero. (This is different from deciding whether or not this is the best model, or even whether these terms both belong in this model.) You also get an indication of the model fit and df, plus information about the optimization result. Let's see what the regression line looks like:

x = seq(0,2000)


Here we use predict to generate new (predicted) values based on our fitted model, using the new vector x. The predict function is generic and should work with any modeling function.

R provides an easier way to implement nls for many common nonlinear functions via 'self-starting functions' that do not require specification of initial parameter values. (It is also possible to write your own functions for this; see selfStart.) The one for the MM model looks like this:

model2 = nls(tci ~ SSmicmen(elev,a,b), data=dat2)

Fortunately we get basically the same model fit as before:



Formula: tci ~ SSmicmen(elev, a, b)



Estimate Std. Error t value Pr(>|t|)

a 4.9742 0.1564 31.814 < 2e-16 ***

b -109.3588 16.9144 -6.465 1.76e-10 ***


Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


Residual standard error: 2.255 on 793 degrees of freedom


Number of iterations to convergence: 4

Achieved convergence tolerance: 2.352e-06

Maybe a power function fits the data better?

model3 = nls(tci ~ c*elev^z, start=list(c=20,z=1), data=dat2)


The fit is similar. If the functions were nested (i.e,. setting a parameter to zero yields the other function), we could use a likelihood ratio test to compare models with the anova function. Here they're not, so we can appeal to AIC:


df AIC

model2 3 3552.757

model3 3 3546.405

There is substantially more support for the power model (lower AIC). We'll talk more about this later. For more on useful deterministic nonlinear functions, see Crawley p. 662 or Bolker p. 88. You can incorporate grouping (factor) variables with nonlinear least squares using other functions like gnls in the nlme library. Finally, note that for relatively simple nonlinear relationships it is often more productive to simply transform the dependent variable rather than worry about nonlinearity. Crawley (p. 205) lists these common transformations:

- log y vs x for exponential relationship

- log y vs log x for power

- exp(y) vs. x for logarithmic

- 1/y vs 1/x for asymptotic

- log(p/(1-p)) vs x for proportion data

- sqrt(y) for count data for variance homogenization

- arcsin(y) for percentage data for variance homogenization