GRASS GIS and R

 

The movement away from proprietary GIS packages like Arc that are convenient but expensive and toward an open-source program like GRASS has not quite achieved the level of SAS/SPLUS/STATA/etc to R, but the analogy holds: there is a large and growing community of users that are continually providing new packages in a free software product that is harder to learn (at first) but ultimately more powerful and satisfying.  And the best bit is that GRASS is now designed to integrate (relatively) smoothly with R (R runs within GRASS), allowing powerful statistical analysis of spatial data and basically one-stop shopping for species distribution modeling.  We'll just cover the basics of starting with GRASS and interfacing with R, but I recommended these resources for continuing on:

Neteler, M. and Mitasova, H. 2008. Open source GIS: a grass GIS approach (3rd Ed.). Springer.

Hengl, T. 2009. A practical guide to geostatistical mapping. Open Access (FREE!):

http://blog.revolution-computing.com/2010/04/a-free-book-on-geostatistical-mapping-with-r.html

 

Installing GRASS GIS

As of this writing, you have 3 basic options for working with GRASS: 1) you can run GRASS in its native UNIX environment on a Linux box or Mac (recommended; find a linux box and run everything via a remote terminal is the best option, thereby not worrying about local computer performance); 2) you can be a guinea pig for the *new* native Windows GRASS, available here:

http://grass.itc.it/grass64/binary/mswindows/native/

which does not require a unix emulator but is still in beta stage and has no easy integration with R; or 3) install the Cygwin linux-like environment for Windows and run GRASS from Cygwin, which used to be a decent alternative to the UNIX version but recent Cygwin changes have made installation and particularly R integration difficult.  As much as I wanted to make things easy using a Windows install, I have had so many problems this week working with winGRASS and the new Cygwin changes to GRASS that I'm instead going to show you the Linux version, which at this point seems to be the only option for smooth R integration.  For intrepid Windows junkies I post how I proceeded with the latest Cygwin interface in the box below.  Mac and Linux users can download GRASS binaries here:

http://grass.osgeo.org/download/software.php

and Linux and skip the box below.

 

---------------------------------------------------------------------------------------------------------------------------------------------------------------

GRASS on Windows via Cygwin

            The page supposedly describing GRASS 6.3 installation via Cygwin is here: http://grass.osgeo.org/grass63/binary/mswindows/cygwin/#prereq
            What they don't tell you on that page is that the default package installer for GRASS does not get all the components you need for visualization (that is, the Windows X server that allows you to open new frames for map display).  So I recommend you start here:

http://x.cygwin.com/docs/ug/setup-cygwin-x-installing.html

Follow these instructions carefully, in particular the X packages listed in step 15.  During the download and installation process you may get notices that the download could not be completed: choose another download mirror site and try it again.  The install process remembers what you've already successfully downloaded, so it doesn't take as long in repeated attempts (although repeated cycles through the process is a bit frustrating).  At the last step, choose to create the icons and then fire up Cygwin from the desktop icon.

            You now have an environment that simulates a bash shell UNIX environment.  All the files that Cygwin can 'see' are embedded in your home directory under C:\cygwin.  To start the X windows server, type 'startxwin' at the bash prompt.  A new white background dialogue box appears that will run GRASS.  However, you don't have it yet.  Put the 'setup.exe' installer you downloaded at the beginning in the C:\ directory (if it's not there already), and from the DOS command line (Start-Run: Cmd), type "setup.exe –X".  The Cygwin installer starts again, and this time at step #3 add this location to the list of sites:
http://grass.ibiblio.org/grass63/binary/mswindows/cygwin/

and select it (just it, no other mirror necessary).  Now under the 'Database' package folder, select "grass: Geographic…", and click Next until the installer finishes.  You now hopefully have grass63 installed (or possibly grass62; some mirrors have different versions available).  From your X window, type "grass63" (or grass62), and now we're ready to get started, you can proceed with defining a LOCATION AND MAPSET below.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

 

Making the Arc-GRASS transition

            I will assume you already have at least a little GIS know-how and have worked with maps in Arc (not essential if you haven't), and are in the position of having an existing raster dataset in Arc that defines your region of interest (e.g., a DEM).  You can port your own raster from Arc to GRASS by saving the Arc raster as a ESRI ASCII grid file (same input for Maxent).  (In ArcToolbox: Conversion Tools -> From Raster -> Raster to ASCII).  This should be a .txt file.  The file is put in the GRASS data directory (see below for Linux; in Cygwin this is created in your home directory (mine is C:\cygwin\home\Fridley\grassdata).  Here we'll use the elev.asc file from the Maxent session (in grids.zip), renamed to "elev.txt". 

There are other tips for importing Arc vector data, etc here:

http://grass.osgeo.org/wiki/Tips_for_Arc_users

`

Defining a LOCATION and MAPSET

            Because of its size and formatting complexity, working with GIS data is always a hassle, but the advantage of GRASS is that the hassle is mostly front-loaded: it takes some time to start work in a new region and upload new files, but once they're in the projection and formatting-related problems disappear.  (Anyone who works in Arc knows that the Arc situation is the opposite: easy to get files in, sometimes impossible to get them to talk to each other in an analysis.)  Once you get the hang of it creating new projects is easy, but you do need to know a few basic parameters of your region of interest to get started.

            Whether you're in Linux or Cygwin or winGRASS, GRASS projects have a nested directory structure.  All objects are stored in a central database directory (e.g., /home/user/grassdata), which is where you put maps to be imported (here, our elev.txt file).  Nested within this directory are separate directories for different projects called LOCATIONs.  A Location is both a folder and a geographical region defined by standard spatial properties of spatial extent, resolution, measurement unit (e.g., UTMs), and projection/datum, all of which must be defined when a new Location is created.  A MAPSET is a directory of maps for a particular Location.  Mapsets are most useful in a multi-user environment, where a central set of read-only maps (often called PERMANENT) can be copied and manipulation by individual users, each with their own Mapset folder (/user1, /user2, etc).  I typically create one Mapset called PERMANENT and don’t worry about creating others.  Finally, within the Mapset folder are separate folders for different data types (rasters, vectors, volume maps), which fortunately you don't have to worry about because these are all visible to in a standard GRASS session.  This diagram from the GRASS Quickstart Wiki (http://grass.osgeo.org/grass64/manuals/html64_user/helptext.html) summarizes how GRASS projects are organized:

 

Creating a new Location and Mapset for your region of interest is straightforward once you have basic spatial information for it.  Here is an example working from the command line of GRASS 5.4 in Linux (instructions for the GUI interface for winGRASS are in the above link).

Start grass:

$ grass54

The define location screen opens (this is always the first screen).  Type in a new location name:

LOCATION: BIO793_GSMNP

Keep the Mapset as PERMANENT, and indicate ('y') that you would indeed to create this Location, and give it a one-line description.  For the Smokies most geocoordinate data are in UTMs (option C).  For datum, specify nad27 (type 'list' to see options). A geodetic datum is a particular spatial reference grid; NAD27 is the older North American Datum, and also common is WGS84 (World Geodetic System of 1984) that covers the globe, which is basically equal to NAD83.  Specify option 1 for datum transformation (contiguous North America).  Our UTM zone is 17 (north).

            Next we specify the UTM boundaries of the location.  You can get this info using the spatial attributes property on an existing map in Arc, or you can examine the header information for elev.txt, which lists the X,Y coordinates of the lower-left (ll) corner and calculate E-W-N-S boundaries by multiplying the number of rows or columns by the resolution (here, 30 m).  Our west edge is 227845, south edge is 3922854, north edge is 3963294 (=3922854+[30*1348]), and east edge is 315235 (=227845+[30*2913]).  We'll work from a spatial resolution of 30 m, which matches our input DEM.  After accepting these parameters, our new location is created.  GRASS then returns to the define Location and Mapset page; select our new location and GRASS starts from the command line.  Note the default GRASS data directory on the define location screen; this is again where maps to be imported should be placed (here, elev.txt).  That's it—the hard part is now over.

 

Importing rasters

            Standard GIS tasks in GRASS are usually very simple.  For example, to import our elevation raster, type:
r.in.arc input=elev.txt output=elev

which creates the raster file elev and puts it in our raster folder in our PERMANENT Mapset.  Note that the syntax for GRASS commands is based on UNIX: a particular command function (r.in.arc) is followed by required and optional arguments (here, two required arguments are input and output).  If you omit required arguments, GRASS will usually prompt you for more information.  Raster-based commands start with 'r.', vector commands with 'v.', and general commands (like file manipulation, location information, etc.) with 'g.'.  For the latest version (GRASS 6.4), the list of available functions with links for each are here:
http://grass.osgeo.org/grass64/manuals/html64_user/index.html

For map display in both Linux and Cygwin, GRASS uses the standard X windows display drivers, initiated by:

d.mon start=x1

which opens a monitor window.  You can open more windows (x2, x3, etc) as needed, and close them with replacing start with stop.  To display our imported elevation map:

d.rast elev

You can add a legend with

d.legend elev

or try just d.legend without arguments to see the range of formatting options.  A handy tool for summarizing a raster is r.univar, which lists the total cells and simple stats of the distribution of raster values.

 

Topographic calculations

            Aside from interfacing directly with R, a main advantage of GRASS is the built-in functionality for using DEMs to create standard topographic indices that are often used as proxy variables for critical environmental factors like water and radiation.  For example, the r.slope.aspect function uses a DEM to create slope and aspect maps, as well as other DEM derivatives (curvature):

r.slope.aspect elevation=elev slope=slope aspect=aspect

Generation of new raster maps 'slope' and 'aspect' took a few seconds on my Linux box.

            For hydrologic variables, make sure your region includes all upslope contributing area.  A typical example is the topographic convergence index (TCI) of Beven and Kirkby (1979, A physically-based variable contributing area model of basin hydrology, Hydrol. Sci. Bull. 1: 43–69), which calculates 'potential wetness' as the total upslope contributing area corrected by the steepness of the local slope: TCI = log(upslope area / tan(slope angle)).  This is a one-line command in GRASS, requiring only an input DEM and specification of the output file:
r.topidx intput=elev output=tci

Despite our DEM containing over 2 million pixels, this took only about two minutes on my Linux box.  You can do the same calculations using GRID in Arc (or Arc Toolbox), but it takes several steps and (in my experience) much longer.  Users with decent rainfall and ET data can create dynamic (spatial water flow) maps using the TOPMODEL hydrologic model with the r.topmodel command.  There are many other raster functions related to hydrologic modeling: r.watershed can delimit separate watersheds from a DEM; r.terraflow creates flow accumulation and direction maps (used to estimate stream locations or find ponding basins), and see the raster command manual above for others. 

            Another very useful function in GRASS is r.sun, a solar irradiance model that uses a DEM along with earth-sun geometry to calculate daily solar radiation for each pixel (direct, diffuse, and reflected).  You specify a day and overall latitude, and the algorithm computes the intensity and daylight length (including a terrain shadowing effect) to derive total daily radiation (in watts per m2 per day).  You can optionally include other variables or maps that allow for ground albedo, atmospheric turbidity, and cloudiness.  Here's an example for calculating direct-beam radiation for each pixel on June 20 (Julian day 171), using shadows but assuming clear-sky conditions:
r.sun –s eleven-elev aspin=aspect slopein=slope day=171 lat=35.00 beam_rad=rad171

Currently, it is trickier to sum up radiation values for multiple days (e.g., a full month or differences in total annual radiation between pixels).  I paste an addendum shell script below that automates the r.sun routine for any list of days you specify.  You can also do this with the R interface, but for large rasters these calculations are much faster in GRASS than GRASS/R (several minutes in this case).

 

Extracting values from point samples in GRASS

            Outside of the R environment, it is often useful to extract values of certain raster maps for use in subsequent modeling.  In GRASS this is accomplished using the r.what command along with a text file in your working directory that specifies the coordinates of your locations of interest; you put a text file of coordinates in, and get a text file of coordinates plus raster values back.  It is a slightly confusing process but the following steps work for me:

1. In Excel, save your coordinates as 3 columns: X, Y, and (optionally) a site label.  For a 4th column, just paste a comma (",") in a a cell at the end of the line.  Save it as a tab-delimited text file with no header (no column labels).  When you open it in a text editor, it should look like the treedataxy.txt file. For example line 1 is:

275736    3942439   ATBN-01-0403,

2. Put this file in your GRASS working directory.

3. In GRASS:

r.what input=elev,tci null="NA" <treedataxy.txt >output.txt

Which will extract elevation and TCI values according to the coordinates in treedataxy.txt, and place them in a new text file output.txt.  Missing values will get a NA value.

4. The output.txt file now has new rows under each original row for elevation and TCI values.  To format this correctly (one row per site), open the file in a good text editor (my favorite is TextPad), and use the 'find and replace' function to replace the end of the line comma with nothing.  In TextPad: hit F8, make sure 'regular expression' is checked, find ",\n", replace with [blank]. This kills the commas and the line breaks, and it's now easy to examine in Excel (import text with | delimiter) or feed into R.

            Of course you often desire values from many rasters at once, often those whose names are almost identical (but may differ in number, such as days of radiation); for this you can use r.what along with the g.mlist command to specify a set of rasters that share part of their names.  Pay close attention to the syntax here for the input argument:
r.what input="`g.mlist pattern='radiation*' sep=,`" null="NA" <treedataxy.txt >output.txt

This grabs values for all rasters with "radiation" at the beginning of their name.

 

Exporting rasters out of GRASS

            Any new maps created in GRASS but needed elsewhere (like Arc) can be exported with various 'r.out…' commands specific to particular raster formats.  Exporting to Arc via a ASCII GRID file is accomplished via:
r.out.arc input=tci output=tci.txt

which places the tci.txt file in your working directory.  In Arc Toolbox, use Conversion Tools -> To Raster -> ASCII to Raster to import the raster back into an Arc raster file.

 

An example shell script for automating GRASS functions using r.sun

            Save this as a text file in the grass data directory with a '.sh' suffix, and then execute in GRASS by typing its name at the GRASS prompt.

# Compute a sum of daily global radiation for a given period of the year

echo "Enter elevation map:"

read elev

g.copy rast=$elev,SOLelev

echo "Enter slope map:"

read sl

g.copy rast=$sl,SOLslope

echo "Enter aspect map:"

read asp

g.copy rast=$asp,SOLaspect

echo "Enter Latitude of given region:"

read lat

#loops over Julian days, from 'i' to 'lastday'

i=1

lastday=30

while [ $i -le $lastday ]

do

 # generate convenient map names

 DAY=`echo $i | awk '{printf "%03i", $1}'`

 echo "Computing radiation for day $DAY..."

 r.sun -s elevin=SOLelev aspin=SOLaspect slopein=SOLslope\

       lat="$lat" day="$i"\

       beam_rad=b_rad.$DAY diff_rad=d_rad.$DAY\

                            refl_rad=r_rad.$DAY

 r.mapcalc totalrad.$DAY = "b_rad.$DAY + d_rad.$DAY + r_rad.$DAY"

 r.timestamp totalrad.$DAY date="$i days"

 r.colors totalrad.$DAY col=gyr

 i=`expr $i + 1`

 g.remove rast=d_rad.$DAY,r_rad.$DAY,b_rad.$DAY

done

#sum all days for total radiation, output raster as "globalrad"

echo "Summing all daily rasters..."

r.series input="`g.mlist pattern='totalrad*' sep=,`" output=globalrad method=sum

     #note: if more than 200 or so rasters, have to break this up into pieces

#cleanup:

g.remove rast=SOLelev,SOLaspect,SOLslope

echo "Finished."

 

 

GRASS and R

 

Starting the R environment from within a GRASS session gives you a powerful approach for spatial modeling and is one of the clear advantages for using GRASS over Arc and other GIS software.  In UNIX/Mac, run R from the GRASS shell environment by starting up a GRASS session in the desired LOCATION and MAPSET, then at the GRASS prompt type 'R' to start an R session.  The R package required to interface with GRASS depends on which version of GRASS you're using: for GRASS 5.x, use the "GRASS" library; for GRASS 6.x, use the "spgrass6" library.  For example:
install.packages("spgrass6", dependencies = TRUE)

Although the way R interfaces with GRASS is different with version 6 (supposedly faster), the commands are similar regardless of version.  For Windows users, the situation looks a bit dire, as I can find very little in the way of recent web updating for integrating with R via the new Cygwin changes, and I am unaware of a reasonable integration of winGRASS with R.  (Those wanting to try can explore old GRASS-R posts on the web; please let me know of any successes.)  A general introduction to GRASS-R is here:

http://grass.osgeo.org/statsgrass/grass_geostats.html

 

Preliminaries

            In this example I'm starting GRASS 5.4 from our BIO793_GSMNP2 location and the PERMANENT mapset.  After the R GRASS (or spgrass6) package is installed and R is running from the GRASS environment, load the library and then use the gmeta command to load the current GRASS environment parameters into R:

library(GRASS) #or library(spgrass6)

G = gmeta()

This creates an object of class=grassmeta, which tells R which versions of the generic functions to use (like plot), and it defines the spatial environment.  A summary command specific to GRASS meta-objects shows:
summary(G)

Data from GASS 5.0 LOCATION BIO793_GSMNP2 with 2913 columns and 1348 rows; UTM, zone: 17.

The west-east range is 227845, 315235, and the south-north: 3922854, 3963294;
West-east cell sizes are 30 units, and south-north 30 units.

The str(G) command shows that G is just a list containing spatial information.  Note that all GRASS environment commands are available in R using the system command.  For example, to see a list of available rasters use the g.list command as if in GRASS:
system("g.list rast")

-----------------------------------------------------

raster files available in mapset PERMANENT:

aspect    elev slope     tci

-----------------------------------------------------

To manipulate GRASS rasters in R, the rast.get command converts a raster into a large R list object.  The c(F) argument tells R that you want the raw raster values rather than a categorical dataset (you can also use catlabels=F), and the G argument is needed to specify the GRASS spatial environment:

elev = rast.get(G,"elev",c(F))

str(elev)

List of 1

 $ elev: num [1:3926724] NA NA NA NA NA NA NA NA …

The NAs describe missing cells.  Note that the raster values are a vector nested inside a list, so to access particular values you need to specify both the list name and the vector name.  Note the difference:
length(elev)

[1] 1

length(elev$elev)

[1] 3926724

As one single vector, there is no X or Y information in our R object.  To create an entire data frame with X, Y, and Z (here, elevation) data, we extract the 'east' and 'north' elements of our G object:

elev.frame = data.frame(east(G),north(G),elev$elev)

str(elev.frame)

'data.frame': 3926724 obs. of 3 variables:

 $ east.G.    : num     227860 227890 227920 227950 227980 …

 $ north.G.   : num     3963279 3963279 3963279 3963279 3963279 …

 $ elev.elev  : num     NA NA NA NA NA NA NA NA NA ...

The naming conventions are a bit silly, so change via:
names(elev.frame) = c("x","y","elev")

Depending on the objectives of the analysis, it may make sense to create one data frame for the region that contains all rasters (although beware of creating very large objects).  You can grab multiple rasters with rast.get and then put them each as a separate column in a data frame:
rast.list = rast.get(G, rlist=c("elev","aspect","slope","tci","rad171"), catlabels=as.logical(rep(F,5)))

rf = data.frame(east(G),north(G),rast.list$elev,rast.list$aspect,rast.list$slope,rast.list$tci,rast.list$rad171)

names(rf) = c("x","y","elev","aspect","slope","tci","rad")

str(rf)

Now you can access all raster variables as you normally do those in data frames.  For example, compare elevation summary statistics with GRASS output:
summary(rf$elev)

system("r.univar elev")

The generic plotting function for GRASS objects is a 3D image, where you specify the region (G) and the R vector for a particular raster:
plot(G,rf$elev,col=terrain.colors(30))

R has functions for specifying certain color palettes: the above is for basic terrain (the argument is for the number of categories to display) and you can also try topo.colors(), heat.colors(), and rainbow().  There is a built-in function for making contour lines for raster in GRASS-R 5 called contour.G, which we can superimpose our on image:

contour.G(G,rf$elev,add=T)

 

Spatial analysis

            One you have a data frame of X, Y, and various 'Z' coordinates, spatial analysis via a variety of R packages is straightforward.  Given the number of calculations involved (for example, using pairwise comparisons of millions of pixels), you often want to take randomly sampled rows of the spatial data frame rather than the entire set, and it's generally a good idea to remove missing values before the analysis.  A polynomial trend surface fitted by least squares (or generalized least squares using surf.gls) can be fit to a raster using the surf.ls function in the spatial library, and displayed with a GRASS 5 wrapper called trmat.G:

library(spatial)

xyz = na.omit(rf[,1:3]) #remove missing values of elevation

xyz2 = xyz[sample(1:2292309),10000),] #10,000 random points (not NA)

names(xyz2) = c("x","y","z")

trend = surf.ls(4,xyz2) #4th-order polynomial

plot(G,rf$elev)

contour.G(G,trmat.G(G,trend),add=T)   #superpose contour lines of trend model on raster

The spatial package includes a function for creating variograms that takes the fitted trend-surface as input, along with desired number of distance bins.  correlogram is the equivalent function for distance-based covariances.
variogram(trend,nint=30)

 

Mapping predictions from a species distribution model

            Predictions from a fitted species distribution model can be easily mapped in the GRASS-R interface using the predict function for any model building exercise and a data frame of raster variables that were used in the model.  Here's an example where we return to our Smokies tree dataset and are interested in examining the probability of Fraser magnolia occurrence across the entire Park, based on elevation, moisture (tci), and site radiation that we calculated for June 20.  Make sure that the treedata.csv file is in the working directory.  I additionally need to import the radiation raster (rad171.txt):

system("r.in.arc input=rad171.txt output=rad171")

rad171 = rast.get(G,"rad171",c(F))

rf$rad = rad171$rad171

Let's create the dataset for magnolia:

dat = read.csv("treedata.csv")    

dat2 = subset(dat,dat$species=="Magnolia fraseri")  

b.plots = unique(as.character(dat2$plotID)) #plots with magnolia

u.plots = unique(as.character(dat$plotID)) #all plots

nob.plots = u.plots[is.element(u.plots,b.plots)==F] #plots without magnolia

dat3 = subset(dat,is.element(as.character(dat$plotID),nob.plots)) #dataset of no magnolia

dat4 = subset(dat3,duplicated(dat3$plotID)==F)  #one row per plot

dat4$cover = 0     #cover of magnolia is zero in these plots

dat5 = rbind(dat2,dat4) #new dataframe of presences and absences

We need to extract values of our new variable rad corresponding to presences and absences from our existing raster. We can use the utme and utmn coordinates to subset the raster values in R, but note they don't exactly equal our X and Y values.  An easier option is to route it through GRASS using the r.what command:
frame1 = data.frame(dat5$utme,dat5$utmn)   #data frame of only x and y coordinates

write.table(frame1,file="magout.txt",sep="\t",row.names=F,col.names=F)     #write a text file to working directory for r.what to access

system("r.what input=rad171 <magout.txt >output1.txt")    #create a new text file with appended rad171 values

rad1 = read.table("output1.txt",sep="|")   #read new text file

dat5$rad = as.numeric(as.character(rad1[,4]))        #final column of text file is appended rad171 value; beware it is imported as a factor rather than numeric values

            Now we construct a GAM model with binomial error:

library(mgcv)

gam1 = gam(cover>0~s(elev)+s(tci)+s(rad),family=binomial,data=dat5,na.action=na.exclude)

summary(gam1)

Using the predict function we can insert a another column into our raster data frame:
rf$predict1 = as.numeric(predict(gam1,newdata=list(elev=rf$elev,tci=rf$tci,rad=rf$rad),type="response"))

Each pixel in the new raster predict1 is now a value between 0 and 1 representing our predictions about magnolia presence:

hist(rf$predict1)

plot(G,rf$predict1,col=heat.colors(30))

We can save our prediction map using the rast.put function, for further display elsewhere (such as in GRASS or exported into Arc via r.out.arc):
rast.put(G,lname=gampred,layer=rf$predict1,DCELL=T)  #DCELL=T keeps 2-decimal precision

Display of the map is faster in native GRASS, which we can again access with the system command. The colors of the map in GRASS can be manipulated with the r.colors command.

system("g.list rast")

system("r.colors map=gampred color=gyr")

system("d.mon start=x1")

system("d.rast gampred")
system("d.legend gampred")