Hierarchical models I: parameter models and random effects

The model approaches we have considered up to this point involve estimating how the expected value (mean response) changes with respect to some deterministic function (process model), with all of the variation around the mean produced by the same error function (data model).  We considered model variance only in the sense that we wanted an appropriate stochastic function for a given response variable (e.g., a Poisson process for count data)—that is, assuming all the variance came from one source (usually from the ecological process).  In reality, observations in any ecological dataset come from several potential sources, including observation error (for both response and predictor variables) and different error structures in the process model.

Consider the simple example of a model of tree sizes (dbh).  If we only want to describe the mean expectation without regard to explanatory variables, we have a one-line model of dbh ~ normal(μ,σ2), and we estimate two parameters (mean, variance).  This model assumes that our trees come from a single population defined by a constant variance.  But what if our observations come from several different tree populations?  Even if we had no reason to suspect the mean size would be different among populations, it seems logical to suspect that the populations might be different, and accounting for such differences might improve the fit of the model.  One clever way of doing this is to think about the mean (intercept) as not a 'fixed effect'—a parameter whose value does not vary regardless of the observation—but as a variable whose value varies depending on which population it is sampled from ('random effect').  For example, we might suppose the intercept is sampled from a normal distribution of a given mean and variance, which is allowed to vary for observations sampled from different populations.  So our full model would be a composite statement of a deterministic model (here, the null model of only the intercept), a stochastic model (describing the lowest level variance, here within populations), and a parameter model (describing the expected mean and variance of the overall intercept, which varies between populations).  Our original model of dbh ~ normal(μ,σ2) had two fixed effects, and our revised model has three fixed effects and one random effect:

dbh ~ normal(μ,σ2)

μ ~ normal(μpp2)

Because the first μ is itself modeled stochastically, it has become a random effect, and you can see that our model is now a hierarchy: at least one model parameter is itself the response variable of another model.  The addition of this 'parameter model' completes the overall menu of model building: process models describe ecological processes, stochastic models describe how error enters the relationships (and observations), and parameter models link deterministic and stochastic functions to different parts of an overall hierarchy of data and process models.  (Of course, the parameters in parameter models can themselves be modeled, producing 'hyperparameters' and an abundance of model levels).  In classical statistics models that include parameter models (which include new sources of variance) are called 'mixed effects' models because some predictor variables are fixed (unmodeled) while others aren't (random); mixed effect models are equivalent to the more recent jargon of multi-level, multistratum, nested, and hierarchical models.  The unifying principle is that all such models include some modeling of the variance, rather than assuming a single (constant) variance parameter.  Also, these terms only apply to model construction, as evaluation can be via any statistical method (least squares, ML, Bayesian).  Models where observations are not independent (including temporal and spatial autocorrelation) are similar qualitatively because they include variance-covariance structures (more on this next time).  The mixed model above is the simplest variant of a class of models called block models that allow the intercept to vary randomly according to some grouping factor.

To illustrate construction of hierarchical models in R, we’ll use a dataset that involves clustered spatial sampling (blocks) and multiple observations (through time) per ‘individual’.  Such a structure typifies many species distribution datasets, but here we’ll use data from a temperature logger network in the Smokies (Tf2.RData; bring it into R with the load command).  These data were used in a project to create daily landscape models for ground-level temperatures; here I’ve pared down the full dataset to save us some computation by taking only daily temperature data for July 2006.  The dataset describes minimum and maximum daily temperatures from 101 data loggers located 1m above the ground, arranged in 12 linear ‘cover-to-ridge’ transects at different overall elevations and aspects.  Existence of the 12 transects means we can construct a ‘block’ model, which may include a random effect for Transect.  We also have variation of temperatures within each logger itself (31 days in July), so we can also include an ‘individual’ submodel, and test for a random effect of ‘logger’. So we might envision this model building exercise as one of three separate sources of variation: the day-to-day variation of temperature within each logger; the average variation between loggers of the same transect (up to 10 per group); and broad-scale spatial variation of temperature between the 12 Transects.

The are two standard packages for creating hierarchical models in R based on maximum likelihood via optimization: nlme is older and is better documented, while lme4 replaces most of the nlme functionality and adds the critical family argument to include the exponential family of error structures (Poisson, binomial, etc.).  The approaches are similar but there are some differences in syntax.  It is worthwhile to learn both, and we’ll start with nlme.

install.packages(c("nlme","lme4"))

library(nlme)

The function for linear mixed models in the nlme library is lme, and it works like lm or glm except you need to specify fixed and random components separately.  We’ll start with a null model that includes only an intercept term, but one that is allowed normal variation within both Transect and logger. We’ll concentrate on minimum daily temperature (minT):
lme1 = lme(fixed=minT~1,random=~1|Transect/logger,method="ML",na.action=na.exclude,data=Tf2)

summary(lme1)

Linear mixed-effects model fit by maximum likelihood

Data: Tf2

AIC      BIC    logLik

21533.11 21559.77 -10762.55

Random effects:

Formula: ~1 | Transect

(Intercept)

StdDev:    1.911162

Formula: ~1 | logger %in% Transect

(Intercept) Residual

StdDev:   0.5425797 1.513821

Fixed effects: minT ~ 1

Value Std.Error   DF  t-value p-value

(Intercept) 16.66494 0.5549976 5696 30.02705       0

Standardized Within-Group Residuals:

Min         Q1        Med         Q3        Max

-4.1771081 -0.5028408  0.1852666  0.6500676  3.3828615

Number of Observations: 5797

Number of Groups:

Transect logger %in% Transect

12                  101

Most critical here is to understand how the random effects are specified: here we have “random=~1|Transect/logger”, where the “|” sign signifies that what comes after is the structure of the groups, with latter terms nested within earlier terms (here loggers are nested within Transects). You don’t repeat the response variable in the random effects formula.  The “1” indicates that we’re only interested in modeling the intercept with respect to normal distributions for the groups (Transect and logger). Also note that there are two methods for fitting these models: “ML” (full [regular] maximum likelihood) is required if we’re to compare models with different fixed effects (using AIC or LRT). “REML” is restricted or residual ML, and actually works by fitting the model to the least squares residuals after including the fixed effects terms, and so only involves ML fitting of model variance components.  This is advantageous when you’re most interested in models comparing different variance structures (random effects), and in fact you can’t compare models with different fixed effects using REML.  Again note that normal error is assumed in lme.

At this point we should examine the qualitative nature of the data within our imposed hierarchical framework. A basic question is how much variance exists at each level: does temperature vary more day-to-day, within transects, or between them?  The random effects output above includes estimates of the standard deviation for each level: 1.191 between transects, 0.542 between loggers within transects, and 1.514 within loggers.  We can put this in terms of percent variance at each level:
sigmas = c(1.911,.542,1.514)
sigmas^2 / sum(sigmas^2)

 0.58544256 0.04709356 0.36746389

You can also extract the variance of each level (variance components) using the VarCorr function:
VarCorr(lme1)

Variance     StdDev

Transect =  pdLogChol(1)

(Intercept) 3.6525412    1.9111623

logger =    pdLogChol(1)

(Intercept) 0.2943928    0.5425797

Residual    2.2916555    1.5138215

The variance components indicate that day-to-day variation in July is important (37% of the variance) but that spatial variance is higher, and most of this is between transects (59%). The next question is how to specify our fixed effects terms.  This dataset includes many predictors, but let’s focus on two that are likely candidates for explaining within vs. between transect variance, elevation (ELEV) and stream distance (here log-transformed, LOG.STRDST).  To make graph inspection easier we can create simplified datasets that specify the grouping factor of interest, called a ‘groupedData’ object. For example:
gD = groupedData(minT~1|Transect,data=Tf2) #transect-level variation only (means of each logger)

plot(gD)
gD = groupedData(minT~1|logger,data=Tf2)   #logger-level variation only (means of each transect)

plot(gD)
gD = groupedData(minT~ELEV|Transect,data=Tf2)   #transect-level variation modeled with elevation fixed effect

plot(gD)

gD = groupedData(minT~LOG.STRDST|Transect,data=Tf2)  #transect-level variation modeled with stream distance fixed effect

plot(gD)

There is clearly more variation in STRDST within transects, and more variation in ELEV between transects (this is, of course no accident—this is how logger deployment was structuring from the beginning).  Based with this information we update our previous model with each new fixed effect, starting with ELEV:
lme2 = update(lme1,fixed=minT~ELEV)
lme3 = update(lme1,fixed=minT~LOG.STRDST)

lme4 = update(lme1,fixed=minT~ELEV+LOG.STRDST)

Now let’s examine what happened to the variance of each level. I write a quick function to extract the variance components from VarCorr and summarize each as a percentage:
VC = function(model) {

vars = as.numeric(VarCorr(model)[,1][c(2,4,5)])
round(vars/sum(vars),2)

}

out = data.frame(c("between transects", "between loggers", "within loggers"),VC(lme1),VC(lme2),VC(lme3),VC(lme4))

names(out) = c("level", "null","elevation","log.streamdist","elev+log.strdist")

out

level null elevation log.streamdist elev+log.strdist

1 between transects 0.59      0.04           0.60             0.03

2   between loggers 0.05      0.14           0.04             0.07

3    within loggers 0.37      0.82           0.36             0.91

The output shows that when we added elevation to the model, we explained a vast proportion of between-transect level variation. Conversely, stream distance was mostly helpful at the between-logger level (the absolute variance went down about 25% at the between logger level). Adding both variables left essentially all of the variance at the within-logger level. We can of course use the tools of ML to investigate whether the decreases in variance were significant:
anova(lme1,lme2,lme3,lme4)

Model df      AIC      BIC    logLik   Test  L.Ratio p-value

lme1     1  4 21533.11 21559.77 -10762.55

lme2     2  5 21522.50 21555.82 -10756.25 1 vs 2 12.61125   4e-04

lme3     3  5 21515.21 21548.53 -10752.60

lme4     4  6 21450.89 21490.88 -10719.44 3 vs 4 66.32371  <.0001

The default output for anova in nlme guesses which nested contrasts you want; here we see lme4 is substantially better than the other models and we can confirm with p values (LRT) using:
anova(lme2,lme4)

Model df      AIC      BIC    logLik   Test L.Ratio p-value

lme2     1  5 21522.50 21555.82 -10756.25

lme4     2  6 21450.89 21490.88 -10719.44 1 vs 2 73.6117  <.0001

anova(lme3,lme4)

Model df      AIC      BIC    logLik   Test  L.Ratio p-value

lme3     1  5 21515.21 21548.53 -10752.60

lme4     2  6 21450.89 21490.88 -10719.44 1 vs 2 66.32371  <.0001

which tests the two fixed-effect term both against each single-term model.  What can we do for day-to-day variance?  The variable minSYN is derived from regional weather stations corrected for lapse rate:
lme5 = update(lme4,fixed=minT~ELEV+LOG.STRDST+minSYN)
VC(lme5)

 0.07 0.22 0.72

anova(lme4,lme5)

Model df      AIC      BIC    logLik   Test  L.Ratio p-value

lme4     1  6 21450.89 21490.88 -10719.44

lme5     2  7 14290.76 14337.42  -7138.38 1 vs 2 7162.125  <.0001

We can continue to exploit the power of hierarchical modeling by asking whether the effects of our predictor variables (elevation, stream distance, regional temperature) should themselves vary by transect or logger. Our current model only varies our overall intercept with respect to transect and logger.  Of course, now that we have fixed effects associated with each level, do we actually need these random effects?
lme6 = update(lme5,random=~1|Transect) #remove the random effect for logger
lme7 = update(lme5,random=~1|logger)  #remove the random effect for transect
anova(lme5,lme6)

Model df      AIC      BIC    logLik   Test  L.Ratio p-value

lme5     1  7 14290.76 14337.42 -7138.380

lme6     2  6 15393.36 15433.35 -7690.679 1 vs 2 1104.598  <.0001

anova(lme5,lme7)

Model df      AIC      BIC    logLik   Test  L.Ratio p-value

lme5     1  7 14290.76 14337.42 -7138.380

lme7     2  6 14297.56 14337.55 -7142.781 1 vs 2 8.800888   0.003

Indeed there are still factors that produce variation between both transects and loggers that our fixed effects haven’t fully accounted for. We already saw in our groupedData graph for LOG.STRDST that, although the effect was generally positive (temperatures increased with distance-from-stream), the slope wasn’t perfectly consistent between transects.  We could allow the slope itself (the coefficient form LOG.STRDST we already estimated as a fixed effect) to vary randomly (via normal error) with Transect, like this (note I had to change LOG.STRDST slightly by adding a constant +1 to it, because other the model fit wouldn’t otherwise converge when the term included zeros):
lme8 = update(lme5,random=list(~1|logger,~I(LOG.STRDST+1)|Transect))
anova(lme5,lme8)

Model df      AIC      BIC    logLik   Test  L.Ratio p-value
lme5     1  7 14290.76 14337.42 -7138.380
lme8     2  9 14303.56 14363.55 -7142.781 1 vs 2 8.800893  0.0123

Here we broke up the random effects term into two separate formulas for each nested level.  Interestingly, the model suggests that the variation in the slope value for LOG.STRDST is not enough between transects to merit a random effect. What about regional temperature—is this effect the same regardless of where the transect is located?
lme9 = update(lme5,random=list(~1|logger,~minSYN|Transect))
anova(lme5,lme9)

Model df      AIC      BIC    logLik   Test L.Ratio p-value

lme5     1  7 14290.76 14337.42 -7138.380

lme9     2  9 14291.46 14351.44 -7136.729 1 vs 2 3.30166  0.1919

Again, the random effect is not enough to merit inclusion the model.  This is intellectually satisfying, since it means the fixed effect predictors work the same regardless of where we are on the landscape. Of course, we still have some work to do: in particular we may be worried about the non-independence of logger values through time, or their spatial correlation (although note the random effects for spatial structure take care of some of this problem).  We’ll examine this next time.

It is worthwhile to demonstrate this model structure using the more recent lmer function in the lme4 library:

library(lme4)
lmer5 = lmer(minT~ELEV+LOG.STRDST+minSYN+(1|Transect)+(1|logger),REML=F,data=Tf2,na.action=na.exclude,family=gaussian)

summary(lmer5)

Linear mixed model fit by maximum likelihood

Formula: minT ~ ELEV + LOG.STRDST + minSYN + (1 | Transect) + (1 | logger)

Data: Tf2

AIC   BIC logLik deviance REMLdev

14291 14337  -7138    14277   14308

Random effects:

Groups   Name        Variance Std.Dev.

logger   (Intercept) 0.196541 0.44333

Transect (Intercept) 0.062834 0.25067

Residual             0.651855 0.80738

Number of obs: 5797, groups: logger, 101; Transect, 12

Fixed effects:

Estimate Std. Error t value

(Intercept)  3.5941512  0.2799954   12.84

ELEV        -0.0012356  0.0002009   -6.15

LOG.STRDST   0.3169141  0.0298608   10.61

minSYN       0.8361858  0.0069850  119.71

Correlation of Fixed Effects:

(Intr) ELEV   LOG.ST

ELEV       -0.726

LOG.STRDST -0.253 -0.288

minSYN     -0.498  0.138  0.001

The important differences are how to specify the random effects components (in parentheses), “method” here is whether REML=T or not, and we can specify the glm family of error distributions. The output is slightly different (with variance components more easily read in a table) but the results look identical to lme5.  Like lme, comparing models of different fixed effects requires REML=F.

Finally, we’d be remiss not to try a GAM mixed model example, which is fit with the gamm function in Wood’s familiar mgcv library. The syntax for random effects is again slightly different but the rest is built on lme (although the gamm4 package apparently does the same with lmer):

gamm5 = gamm(minT~s(ELEV)+s(LOG.STRDST)+s(minSYN),random=list(Transect=~1,logger=~1),data=Tf2,na.action=na.exclude)
gamm5

Although there does not appear to be anova or AIC functionality in this package, the reported log L is -7072.89, making the deviance=14146 on 13.73 estimated DF.  The lme5 deviance is 14277 using up 14 parameters, so (unsurprisingly) the gamm appears to be a more efficient model. There appears to be AIC methods for the gamm4 package.

Zero-truncated and zero-inflated models

A convenient class of hierarchical models for species distribution modeling are those that can deal with count-like response variables that would otherwise be associated with a Poisson data model, but have either 1) no zeros (a dataset that does not include absences but presences are positive integers), or 2) an overabundance of zeros (many more than expected from a Poisson process).  The former situation can be modeled with zero-truncated models that adjust either Poisson or negative binomial functions such that zeros are not possible, and are available in the VGAM library.  The latter situation can be dealt with using zero-inflated models that either model absences separately from the positive integers (so-called 'two-part' or zero-altered models), or zero-inflated mixture models that also use two sub-models but absences can enter both (one accounts for 'false' zeros, where the species cannot occur, and the other accounts for 'real' zeros, where the species can indeed occur but, by chance, doesn't in your dataset).  Zero-inflated model functions are available in the pscl library.  A very readable introduction to these models is Chapter 11 of Zuur et al. (2009), Mixed effects models and extensions in ecology with R, which I've used as the basis for the following examples.

Our species cover data that don't include absences may be a decent candidate for a zero-truncated model.  We can return to our Fraser magnolia example:

dat = read.csv(file.choose())     #grab the treedata.csv file
dat2 = subset(dat,dat\$species=="Magnolia fraseri") #Fraser's magnolia
install.packages("VGAM")
library(VGAM) #lookout: VGAM overwrites some key functions from the stats package
The function implementing the zero-truncated model is vglm. The family argument has many options; here we are most interested whether the zero truncation approach fits a better model than a simple Poisson fit. The zero-truncated Poisson model is family=pospoisson, and the (regular) poisson is family=poissonff.  I construct a simple deterministic model involving elevation and stream distance:
vglm1 = vglm(cover~elev+I(elev^2)+streamdist+elev*streamdist,family=pospoisson,data=dat2)
vglm2 = vglm(cover~elev+I(elev^2)+streamdist+elev*streamdist,family=poissonff,data=dat2)
AIC(vglm1)
 1490.624
AIC(vglm2)
 1532.216

Unfortunately the anova function is not yet implemented in the VGAM library, but AICs suggest we have a vast improvement in model building with the zero-truncated model.  Also note that Poisson counts assume equal mean and variance.  If the variance increases faster than the mean, a negative binomial model is more appropriate, because it includes an additional parameter for the added variance (overdispersion).  Zero-truncated models have an extension to the negative binomial case using the family=posnegbinomial argument.  Unfortunately I couldn't get this routine to work in the present example, but it makes sense to compare Poisson and NB versions of model building when constructing zero-truncated models.  There is also a GAM version using vgam:

vgam1 = vgam(cover~elev+I(elev^2)+streamdist+elev*streamdist,family=pospoisson,data=dat2)
although I couldn't get this to do much of anything beyond the glm version.  Still, they merit exploration.

Zero-inflated models come in two flavors, each with a Poisson or negative binomial variant: ZAP and ZANB on the one hand (zero-altered Poisson, zero-altered NB) and ZIP and ZINB on the other (zero-inflated Poisson, NB).  Each technique fits two separate models: one to the zeros (absences) only, and the other to the counts.  What distinguishes them is whether the counts are allowed to contain absences.  ZAP and ZANB divide the data in two and fit a binomial model to the absences and a zero-truncated model to the positive integers.  Thus, in ZAP and ZANB models there is only one 'type' of absence, and presences are modeled with any desired deterministic function like we did above.  ZIP and ZINB models are more sophisticated, in that they force you into thinking about the nature of an absence: is the site completely outside of a species' range or niche (a 'naughty naught'), and thus should be eliminated on the basis of a simple binomial function, or is the site acceptable but, simply due to Poisson (or NB) chance, the species was absent?  ZIP and ZINB models allow absences to enter each submodel (one binomial, the other Poisson or NB).  Because both submodels can have complicated deterministic components (one predicting which sites are 'off limits', the other predicting species abundance within the range or niche), ZIP/ZINB models can be a logical choice for distribution modeling with response variables that include information on absences and abundances.  The ability of the model to differentiate between the two types of absences, however, depends on the quality of the data, so beware.

To demonstrate the use of ZIP and ZAP models we'll add absences to the magnolia dataset:

ro.plots = unique(as.character(dat2\$plotID))    #plots with mag fra
u.plots = unique(as.character(dat\$plotID)) #all plots
nob.plots = u.plots[is.element(u.plots,ro.plots)==F] #plots without mag fra
dat3 = subset(dat,is.element(as.character(dat\$plotID),nob.plots)) #dataset of no mag fra
dat4 = subset(dat3,duplicated(dat3\$plotID)==F)  #one row per plot
dat4\$cover = 0     #cover of red oak is zero in these plots
mf = na.exclude(rbind(dat2,dat4))     #new dataframe of presences and absences

We presume this new cover vector is highly overdispersed in the sense of inflated zeros:
hist(mf\$cover,nclass=13)
We'll start with a ZAP model.  Note the two different kinds of overdispersion we're potentially dealing with: overdispersion in the zeros only, which the ZAP or ZIP models will take care of; or overdispersion from both the zeros and the counts (e.g., if there were also too many 1s for a Poisson process), which the NB variants of ZANB and ZINB will handle.  You likely won't know a priori which of any of these would be the better choices (or just a NB model alone without a binomial component), but we can compare each approach with likelihood methods (e.g., AIC).  Using our above deterministic function, let's start with un-mixed models that involve Poisson and NB error, knowing full well that the inflated zeros will favor a NB stochastic function:
glm1 = glm(cover~elev+I(elev^2)+streamdist+elev*streamdist,family=poisson,data=mf)
library(MASS)
glm.nb1 = glm.nb(cover~elev+I(elev^2)+streamdist+elev*streamdist,data=mf)      #negative binomial GLM from the MASS library
AIC(glm1,glm.nb1)

df      AIC
glm1     5 3456.315
glm.nb1  6 2847.853

No surprise there. Now we ask whether a mixed (ZAP) model that divides up the data into two groups—zeros and nonzeros, fitting binomial error to the former and Poisson error to the latter—is a better model even though it requires another parameter.  For this we use the hurdle function in the pscl library:
install.packages("pscl")
library(pscl)
summary(zap1)

Call:

hurdle(formula = cover ~ elev | elev + I(elev^2) + streamdist + elev * streamdist, data = mf, dist = "poisson", zero.dist = "binomial", link = "logit", control = hurdle.control(method = "Nelder-Mead"))

Pearson residuals:

Min      1Q  Median      3Q     Max

-1.0436 -0.7001 -0.3469  0.4502  6.0524

Count model coefficients (truncated poisson with log link):

Estimate Std. Error z value Pr(>|z|)

(Intercept) 6.755e-01  1.051e-01   6.427 1.30e-10 ***

elev        4.651e-04  9.552e-05   4.869 1.12e-06 ***

Zero hurdle model coefficients (binomial with logit link):

Estimate Std. Error z value Pr(>|z|)

(Intercept)     -4.782e+00  2.150e-01  -22.24   <2e-16 ***

elev             1.231e-02  1.929e-04   63.82   <2e-16 ***

I(elev^2)       -6.950e-06         NA      NA       NA

streamdist      -2.927e-03  1.479e-05 -197.84   <2e-16 ***

elev:streamdist  1.866e-06         NA      NA       NA

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Number of iterations in Nelder-Mead optimization: 362

Log-likelihood: -1328 on 7 Df

Warning message:

In sqrt(diag(object\$vcov)) : NaNs produced
The formula for hurdle can take two components: the left side of the "|" is the model for the count component, and the right side (without repeating the response variable) is the model for the binomial predictor of absences.  Including one formula component only means the same deterministic function is used for both submodels.  Here I made the presence-absence component our above deterministic function, and the count (abundance) component a simple function of elevation.  The output shows that the parameters vary strongly according to whether counts or absences are being predicted (note the coefficients signs are often opposite because the latter model is predicting absences, not presences).  The model fitting procedure did not like the second-order components for some reason, so we get the parameter values but not any estimate of variance.  Also note the hurdle function passes the hard work of model fitting to optim, and you can access optim parameters (like method) using the hurdle.control argument.  Before asking whether the ZAP model is better than the standard (un-mixed) NB model, we can try the equivalent ZANB model:
AIC(glm.nb1,zap1,zanb1)
df      AIC
glm.nb1  6 2847.853
zap1     7 2669.024
zanb1    8 2667.635

We gained a lot by using the ZA model structure, but the Poisson and NB count models are similar. Of course, at this point you would want to compare other possibilities for deterministic components of each submodel.

Finally, we can compare the ZAP/ZANB approach to one that allows absences to enter both submodels, the so-called 'mixture' models (ZIP and ZINB).  The function for ZIP/ZINB is zeroinfl in the same pscl package, and the formula syntax for submodels is the same as for hurdle. Because only a binomial model is used for the zeros-only component, you can specify the link but there is no zero.dist argument:
AIC(zap1,zanb1,zip1,zinb1)
df      AIC

zap1   7 2669.024

zanb1  8 2667.635

zip1   7 2667.738

zinb1  8 2666.023
Here we see that the ZINB model is almost a clear frontrunner, although not clearly better than ZIP or ZANB models.  Although there is also no anova (LRT) functionality within the pscl package, the lmtest library contains the lrtest function that is able to extract –LL and df from nested models (like Poisson and NB) for the chi-squared test:
install.packages("lmtest")
library(lmtest)
lrtest(zip1,zinb1)

Likelihood ratio test

Model 1: cover ~ elev | elev + I(elev^2) + streamdist + elev * streamdist

Model 2: cover ~ elev | elev + I(elev^2) + streamdist + elev * streamdist

#Df  LogLik Df  Chisq Pr(>Chisq)

1   7 -1326.9

2   8 -1325.0  1 3.7147    0.05394 .

As a way to compare how the presence-absence vs. abundance models affect the overall prediction of magnolia abundance, we can use the predict function with special arguments for zeroinfl objects that specify output of either submodel or the full composite model, and graph the predictions for one or more predictor variables.  I'll use predictions over an elevation gradient and will fix the stream distance input at 100 m:
x = seq(0,2000)    #new elevation input
y.absence = predict(zinb1,newdata=list(elev=x,streamdist=rep(100,length(x))),type="zero")  #predictions for the bimomial submodel
y.count = predict(zinb1,newdata=list(elev=x,streamdist=rep(100,length(x))),type="count") #predictions for the NB submodel
y.all = predict(zinb1,newdata=list(elev=x,streamdist=rep(100,length(x))),type="response")   #predictions for the full model

par(mfrow=c(3,1))  #three-panel graph
plot(x,y.all,xlab="Elevation",ylab="Predicted abundance")
plot(x,1-y.absence,xlab="Elevation",ylab="Predicted occurrence")   #note 1-y.absence predicts occurrences rather than absences
plot(x,y.count, xlab="Elevation",ylab="Predicted abundance")

With so many zeros dominating the cover vector, it is not surprising that the presence-absence model dominates the overall predicted cover response.  Our deterministic model for abundance given occurrence is simply a monotonically increasing function of elevation because elevation was our only predictor.  Clearly this is just the start of such an exercise; ZIP and ZINB models allow for complex sets of hypotheses involving factors that influence whether a species occurs in a location versus how abundant it gets once it's there, using up a minimal amount of degrees of freedom.