rtrim 2.0 extensions

Patrick Bogaart

2024-06-21

Introduction

This vignette describes the extensions to the rtrim package as per version 2.0:

and in addition extensions introduced in version 2.2: * long output * scaled index standard errors

See also the vignettes rtrim confidence intervals and taming overdispersion for additional new features.

Monthly data

One of the major improvements in rtrim 2.0 with respect to 1.0 is the handling of monthly, or any other intra-annual, data. For example, where a classic TRIM model 3 looks like \[ \ln \mu_{ij} = \alpha_i + \beta_j \] where \(\mu_{ij}\) is an expected count, \(\alpha_i\) is a site parameter for site \(i\) and \(\beta_j\) is a time point parameter for year \(j\). the extension towards months looks like \[ \ln \mu_{ijm} = \alpha_i + \beta_j + \delta_m \] where \(\delta_m\) are month parameters for month \(m\).

Please take a look in Models and statistical methods in rtrim for a full explanation, and application to other model formulations.

The general syntax to specify R-Trim models that use monthly data is as follows:

  z <- trim(count ~ site + year, data=...)                   # simple annual data
  z <- trim(count ~ site + year + habitat, data=...)         # using covariates
  z <- trim(count ~ site + (year+month), data=...)           # using monthy data
  z <- trim(count ~ site + (year+month) + habitat, data=...) # using both

Note the use of brackets to distinguish months from covariates.

Here is a full example for Oystercatcher data, which now comes with rtrim 2.0

rm(list=ls()) # always start with a clean slate
library(rtrim)
data(oystercatcher)
oc <- trim(count ~ site + (year+month), data=oystercatcher, model=3, overdisp=TRUE,
           constrain_overdisp=0.999)
#> Warning in trim_estimate(count = count, site = site, year = year, month =
#> month, : Removed 15 sites without positive observations: (1, 30, 41, 48, 66,
#> 69, 78, 82, 83, 85, 86, 87, 88, 92, 104)
plot(index(oc))

Comparison with UIndex

While in the past TRIM was used to analyse count data with an annual resolution (i.e. one observation per site per year), the software package UIndex [Underhill, 1989; Underhill and Prŷs-Jones, 1995] was and is used to analyse count data with higher (e.g., monthly) resolution. As demonstrated above, rtrim is extended to accept and analyse monthly data as well. This section demonstrates the application of rtrim to monthly data, and compares the output with that of UIndex.

UIndex

UIndex was used to analyse the monthly Oystercatcher counts, collected by SOVON Netherlands. Here we show the pre-saved output of UIndex, as the main trend, and the 90% `consistency intervals’. Note also the use of 2004 as base year.

load("UIndex_Oystercatcher_output.RData")
yrange <- range(uidx$index, uidx$lower, uidx$upper)
plot(uidx$year, uidx$index, type='l', xlab="Year", ylab="Index", ylim=yrange)
segments(uidx$year, uidx$lower, uidx$year,uidx$upper)
legend("topright", "UIndex", col="black", lty="solid")

# Add index=1 line for reference
abline(h=1.0, lty="dashed", col=gray(0.5))

# Mark the base year
ibase <- match(2004, uidx$year)
points(uidx$year[ibase], uidx$index[ibase], pch=16)

rtrim

Here we show the comparision with rtrim, using the results computed above.

# Compute and plot an index for Oystercatcher counts, using 2004 as base year and
# adding 90% confidence intervals as well.
idx <- index(oc, level=0.9, base=2004)
plot(idx, band="ci")

# Plot UIndex on top
lines(uidx$year, uidx$index)
segments(uidx$year, uidx$lower, uidx$year,uidx$upper, lwd=2)

legend("bottom", c("UIndex","TRIM"), col=c("black","red"), lty="solid")

Note the computation and display of confidence intervals, which is new for rtrim 2.0, along with the standard errors of both classic TRIM and rtrim 1.0.

This plot demonstrates that the indices as computed by UIndex and rtrim are virtually identical, and that the 90% confidence intervals of TRIM are well comparable to the 90% consistency intervals of TRIM, although both are estimated using completely different approaches. In the case of TRIM, confidence intervals are based on standard errors which are derived analytically as part of the GEE estimation process and ultimately are based on the variance within the orginal data. See the vignettes Models and statistical methods in rtrim and rtrim confidence intervals for more information. Consistency intervals in UIndex are estimated by means of a bootstrap method. See Underhill [1989] and Underhill and Prŷs-Jones [1995] for more information.

Stratified rtrim

Sometimes it can be usefull to combine rtrim results for different regions (‘strata’) into a single, larger scale (‘superstratum’) rtrim analysis. One particular example is the case where individual EU countries use TRIM or rtrim to compute indices for their own countries, and submit the results to the European Bird Census Counsil EBCC for aggregatation on the EU scale, see van Strien et al. [2001] for an example using the previous stand-alone version of TRIM. In this case, the output of the lower scale rtrim runs, i.e., the time totals, are used as ‘observations’ in the higher scale run. Although this procedure works out well for the estimates and indices, it doesn’t work for the associated standard errors, because the time totals are not Poisson distributed, where the original counts are. To circumvent this problem, rtrim has options to export the variances of the lower scale runs and to import these into the higher scale runs, to use instead.

The following example shows the associated workflow. Strictly for demonstration purposes, we split the Skylark dataset into two ‘regions’ associated with the habitats (heath and dunes).

# split data
data(skylark2)
heath <- subset(skylark2, habitat=="heath") # 208 records
dunes <- subset(skylark2, habitat=="dunes") # 232 records

heath$site <- factor(heath$site) # get rid of empty levels
dunes$site <- factor(dunes$site)

# run models
m1 <- trim(count ~ site + year, data=heath, model=3)
m2 <- trim(count ~ site + year, data=dunes, model=3)

# collect imputed time-totals (which is the default)
t1 <- totals(m1)
t2 <- totals(m2)

plot(t1,t2, names=c("heath", "dunes"))

Note the use of multiple time-totals in a single plot (new for rtrim 2.0)

The next step is to use the time totals for the differente habitats (strata') as input data for an upscaled (superstratum’) run. The habitat types now serve as site names, and imputed counts will be the input counts.

t1$region <- "heath"
t2$region <- "dunes"
t12 <- rbind(t1, t2)
head(t12)
#>   time imputed se_imp region
#> 1 1984     376     33  heath
#> 2 1985     255     25  heath
#> 3 1986     339     22  heath
#> 4 1987     336     21  heath
#> 5 1988     389     23  heath
#> 6 1989     425     24  heath

The final preparation step is to extract the variance-covariance information for the different habitats, and combine them into a single list, using habitat/region names as identifier, enabling the correct match between the site identifiers in the data, and the variance-covariance matrices.

# Also collect the variance-covariance matrices for both runs
vcv1 <- vcov(m1)
vcv2 <- vcov(m2)
vcv3 <- list(heath=vcv1, dunes=vcv2)

and off we go with the superstratum run. Note the new argument covin to use the variance-covariance data.

m3 <- trim(imputed ~ region + time, data=t12, model=3, covin=vcv3)
plot(totals(m3))

Now, just for comparison, we compare index plots for both the baseline run (where dunes and heath are taken together, but do act as covariates) and the upscaled `superstratum’ variant.

m0 <- trim(count ~ site + year + habitat, data=skylark2, model=3) # baseline
t0 <- totals(m0)
t3 <- totals(m3)
plot(t0,t3, names=c("baseline","superstrata"))

Which suggests that for this example the differences are small, if any.

Taming overdispersion

In some cases, especially with clustering bird species, overdispersion can be huge, reaching unrealistic values of more than 500. rtrim now contains an option to constrain the computed value of overdispersion by detecting outliers, and removing them from the computation of overdispersion (but retaining them for all other calculations). The full rationale and methdology is described in Taming overdispersion, but the actual use is rather simple.

Take for example the Oystercatcher data, which results in a huge overdispersion of about 850

data(oystercatcher)
m1 <- trim(count ~ site + (year + month), data=oystercatcher, model=3, overdisp=TRUE)
#> Warning in trim_estimate(count = count, site = site, year = year, month =
#> month, : Removed 15 sites without positive observations: (1, 30, 41, 48, 66,
#> 69, 78, 82, 83, 85, 86, 87, 88, 92, 104)
overdispersion(m1)
#> [1] 850.7956

The inclusion of the option constrain_overdisp=0.999 triggers the detection of outliers that have a probability of 0.1%.

m2 <- trim(count ~ site + (year + month), data=oystercatcher, model=3, overdisp=TRUE,
           constrain_overdisp=0.99)
#> Warning in trim_estimate(count = count, site = site, year = year, month =
#> month, : Removed 15 sites without positive observations: (1, 30, 41, 48, 66,
#> 69, 78, 82, 83, 85, 86, 87, 88, 92, 104)
overdispersion(m2)
#> [1] 102.6988

And so we get a much more reasonable result, with smaller standard errors.

t1 <- totals(m1)
t2 <- totals(m2)
plot(t1, t2, names=c("unconstrained","constrained"), leg.pos="bottom")

Output visualization

Plotting time-totals

Once an rtrim model has been estimated, one of the first steps of analysis schould be the plotting of time-totals. This is done by first calling the totals() function, and then a custom plot() function:

rm(list=ls())                          # New section; time for a new blank slate
data(skylark2)                                             # reload Skylark data
m1 <- trim(count ~ site + year, data=skylark2, model=3)
t1 <- totals(m1)          # By default, the time-totals for the imputed data set
plot(t1)

Alternatively, one may compute the fitted time-totals. the next example shows the plotting of both the imputed and fitted time-totals, and also demonstrates how series can be named, and the plot can be decorated with a main title.

m2 <- trim(count ~ site + year, data=skylark2, model=2, changepoints=c(1,2))
ti <- totals(m2, "imputed")
tf <- totals(m2, "fitted")
plot(ti, tf, names=c("imputed","fitted"), main="Skylark", leg.pos="bottomright")

Since imputed totals are composed of both observed and estimated counts, it might be insightful to plot the observed counts as well.

m3 <- trim(count ~ site + year, data=skylark2, model=3)
t3 <- totals(m3, obs=TRUE)          # Extract observations in addition to totals
plot(t3)

As can be seen, the amount of observed Skylarks is considerable smaller than the time totals suggest. Furthermore, it can be seen that while the observed counts decrease from 1989, the imputed counts continue to increase. It is thus suggested to look into more detail what is going on in different sites.

Plotting indices

Once a TRIM model has been estimated, and indices are computed, these latter can be plotted using the generic plot command plot() (which, behind the screens, calls plot.trim.index()).

m <- trim(count ~ site + year, data=skylark2, model=3) # Run a fairly basic TRIM model
idx <- index(m) # By default, the indices for the imputed data set
plot(idx)

If required, the x-axis and y-axis labels as well as the tile can be defined, and the index can be expressed as a percentage, instead as a fraction. This example shows all these options:

plot(idx, xlab="Year AD", ylab="Index (%)", main="Skylark index", pct=TRUE)

Indices and covariates

When covariates are involved, it can be helpful to compute and plot indices for the various covariate categories as well. The next example demonstrates this.

m <- trim(count ~ site + year + habitat, data=skylark2, model=3) # Run a fairly basic TRIM model
idx <- index(m, covars=TRUE)
plot(idx)

As can be seen, indices for the various covariate categories are automatically plotted as well. This behaviour can be supressed by setting covar="none" in the call to plot() (note the use of plural covars' in the call toindex()--- because indices for multiple covariates can be computed, and the singularcovarin the call toplot()` — because only one of them can be used for a single figure)

Combining multiple indices

Indices for multiple TRIM runs can be combined in a single plot.

data(skylark2)
m0 = trim(count ~ site + year          , data=skylark2, model=3)
m1 = trim(count ~ site + year + habitat, data=skylark2, model=3)

idx0 <- index(m0)
idx1 <- index(m1)

plot(idx0, idx1)

As you see, a legend is inserted automatically. You can change the names of the series by using the names argument:

plot(idx0, idx1, names=c("Without covariates", "Using 'Habitat' as covariate"))

Adding confidence intervals.

New in rtrim 2.0 is the possibility to express uncertainty as a confidence interval, in addition to the standard errors. Both the functions totals() and index() now accept the option level that specifies the confidence level and triggers the computation.

m <- trim(count ~ site + year, data=skylark2, model=3)
tt <- totals(m, level=0.95)                   # Compute 95% confidence intervals
head(tt)
#>   time imputed se_imp       lo       hi
#> 1 1984     511     38 510.0584 510.0584
#> 2 1985     362     31 361.1155 361.1155
#> 3 1986     429     26 428.4749 428.4749
#> 4 1987     423     25 422.5076 422.5076
#> 5 1988     469     27 468.4820 468.4820
#> 6 1989     522     27 521.5346 521.5346

So, the lower and upper bounds of the confidence interval is stored in columns lo and hi. These are automatically picked up by the plot() function.

plot(tt)

If required, the uncertainty band, which is by default plotted using standard errors, can be plotted using the confidence intervals when the option band="ci" is used.

plot(tt, band="ci")

See vignette rtrim confidence intervals for more information on the underlying methodology.

Long output

Stating with version 2.2, the totals() and index() functions can also generate so-called long output, which simplifies plotting using alternative approaches, e.g. ggplot2 or similar, especially because imputed and fitted time-totals are in rows, not in columns.

tt <- totals(m, long=TRUE)
head(tt)
#>      variable  series year value SE
#> 1 time_totals imputed 1984   511 38
#> 2 time_totals imputed 1985   362 31
#> 3 time_totals imputed 1986   429 26
#> 4 time_totals imputed 1987   423 25
#> 5 time_totals imputed 1988   469 27
#> 6 time_totals imputed 1989   522 27

lo <- tt$value - 1.96 * tt$SE # Assume normal distribution
hi <- tt$value + 1.96 * tt$SE

# create an empty plot with sufficient space
xrange = range(tt$year)
yrange = range(lo, hi)
plot(xrange, yrange, type='n', xlab="Year", ylab="Time-totals")
# plot the time series and error bars
segments(tt$year, lo, tt$year, hi, col="red", lwd=2)
lines(tt$year, tt$value, col="red", lwd=2)

A similar mechanism has been included for custom plotting of overall trends, using a new separate trendlines function.

tl <- trendlines(overall(m))     # collect overall trend line
print(tl)
#>    segment     year    value       lo       hi
#> 1        1 1984.000 404.7544 362.9033 451.4320
#> 2        1 1984.101 406.7516 365.4989 452.6604
#> 3        1 1984.203 408.7587 368.1056 453.9015
#> 4        1 1984.304 410.7757 370.7227 455.1559
#> 5        1 1984.406 412.8026 373.3498 456.4244
#> 6        1 1984.507 414.8395 375.9863 457.7077
#> 7        1 1984.609 416.8865 378.6314 459.0067
#> 8        1 1984.710 418.9436 381.2843 460.3223
#> 9        1 1984.812 421.0108 383.9444 461.6556
#> 10       1 1984.913 423.0882 386.6106 463.0075
#> 11       1 1985.014 425.1759 389.2821 464.3792
#> 12       1 1985.116 427.2739 391.9577 465.7720
#> 13       1 1985.217 429.3822 394.6363 467.1872
#> 14       1 1985.319 431.5009 397.3167 468.6263
#> 15       1 1985.420 433.6301 399.9974 470.0907
#> 16       1 1985.522 435.7698 402.6771 471.5821
#> 17       1 1985.623 437.9200 405.3541 473.1023
#> 18       1 1985.725 440.0809 408.0267 474.6532
#> 19       1 1985.826 442.2524 410.6932 476.2368
#> 20       1 1985.928 444.4347 413.3516 477.8551
#> 21       1 1986.029 446.6277 415.9999 479.5104
#> 22       1 1986.130 448.8315 418.6359 481.2050
#> 23       1 1986.232 451.0462 421.2574 482.9415
#> 24       1 1986.333 453.2718 423.8621 484.7222
#> 25       1 1986.435 455.5085 426.4474 486.5499
#> 26       1 1986.536 457.7561 429.0111 488.4271
#> 27       1 1986.638 460.0148 431.5505 490.3566
#> 28       1 1986.739 462.2847 434.0633 492.3410
#> 29       1 1986.841 464.5658 436.5470 494.3829
#> 30       1 1986.942 466.8582 438.9993 496.4850
#> 31       1 1987.043 469.1618 441.4179 498.6495
#> 32       1 1987.145 471.4768 443.8007 500.8788
#> 33       1 1987.246 473.8033 446.1460 503.1750
#> 34       1 1987.348 476.1412 448.4522 505.5398
#> 35       1 1987.449 478.4907 450.7177 507.9749
#> 36       1 1987.551 480.8517 452.9418 510.4814
#> 37       1 1987.652 483.2244 455.1235 513.0604
#> 38       1 1987.754 485.6088 457.2625 515.7124
#> 39       1 1987.855 488.0050 459.3587 518.4377
#> 40       1 1987.957 490.4130 461.4124 521.2364
#> 41       1 1988.058 492.8329 463.4240 524.1080
#> 42       1 1988.159 495.2647 465.3944 527.0522
#> 43       1 1988.261 497.7085 467.3246 530.0679
#> 44       1 1988.362 500.1644 469.2157 533.1544
#> 45       1 1988.464 502.6324 471.0693 536.3103
#> 46       1 1988.565 505.1126 472.8868 539.5344
#> 47       1 1988.667 507.6050 474.6699 542.8253
#> 48       1 1988.768 510.1097 476.4201 546.1816
#> 49       1 1988.870 512.6268 478.1393 549.6017
#> 50       1 1988.971 515.1563 479.8291 553.0843
#> 51       1 1989.072 517.6982 481.4913 556.6279
#> 52       1 1989.174 520.2527 483.1274 560.2309
#> 53       1 1989.275 522.8199 484.7392 563.8921
#> 54       1 1989.377 525.3997 486.3283 567.6100
#> 55       1 1989.478 527.9922 487.8960 571.3835
#> 56       1 1989.580 530.5975 489.4439 575.2113
#> 57       1 1989.681 533.2157 490.9734 579.0923
#> 58       1 1989.783 535.8467 492.4857 583.0255
#> 59       1 1989.884 538.4908 493.9821 587.0099
#> 60       1 1989.986 541.1479 495.4637 591.0445
#> 61       1 1990.087 543.8181 496.9316 595.1286
#> 62       1 1990.188 546.5016 498.3868 599.2614
#> 63       1 1990.290 549.1982 499.8303 603.4422
#> 64       1 1990.391 551.9081 501.2629 607.6703
#> 65       1 1990.493 554.6315 502.6855 611.9453
#> 66       1 1990.594 557.3682 504.0989 616.2666
#> 67       1 1990.696 560.1185 505.5038 620.6338
#> 68       1 1990.797 562.8823 506.9008 625.0464
#> 69       1 1990.899 565.6598 508.2906 629.5041
#> 70       1 1991.000 568.4510 509.6738 634.0065

tt <- totals(m, long=TRUE)       # collect time-totals

# define plot limits
xr <- range(tt$year)
yr <- range(tl$lo, tl$hi, tt$value)
plot(xr, yr, type='n', xlab="Year", ylab="Total counts")

# Plot uncertainty band
ubx <- c(tl$year, rev(tl$year))
uby <- c(tl$lo, rev(tl$hi))
polygon(ubx, uby, col=gray(0.9), border=NA)

# Plot trend line
lines(tl$year, tl$value, col="black", lwd=2)

# Plot time-totals
lines(tt$year, tt$value, col="red", lwd=2)
points(tt$year, tt$value, col="red", pch=16, cex=1.5)

## “Scaled” standard errors

In rtrim version \(<2.2\) standard errors for indices are always computed using a formal approach that results in \(SE=0\) for the reference year. Mathematically, this makes completely sense because the indices are computed by dividing the time-totals time series by the time-total for the reference year. The index for that reference year is thus by definition 1, without any uncertainty, hence \(SE=0\). The following plot illustrates this.

tt  <- totals(m)
idx <- index(m)
par(mfrow=c(1,2))
plot(tt, main="Time-totals", ylab=NA)
plot(idx, main="Index", ylab=NA)

However, for many (communication) purposes this formal approach is rather confusing, because the interpretation is often made that uncertainties ‘disappeared’ this way, Furthermore, the indices are often only calculated to compare trends for multiple species. For this comparison purposes, one often wants to preserve the original uncertainty pattern as in the time-totals. Starting with rtrim version 2.2, this is possible by using the method="scaled" option:

fidx <- index(m, method="formal") # same as just index(m)
sidx <- index(m, method="scaled")
par(mfrow=c(1,2))
plot(fidx, main="Formal approach")
plot(sidx, main="Scaled approach")

As can be seen, the uncertainty for the reference year now is \(>0\), while the uncertainties for the other years have been shrunk, which can be explaiend from what loosely can be described as a preservation of total uncertainty.

Plotting heatmaps.

The detailed spatiotemporal structure of both the observed and the imputed data can be inspected by means of the function heatmap() that operates on the output of trim(). The default behaviour of this function is to display a heat map of the observed counts only:

m <- trim(count ~ site + year, data=skylark2, model=3)
heatmap(m, main="Skylark, observations")

In this heatmap, site/time combinations are colored by (log) counts: lower counts are colored a pale red, and high counts a dark red. Consistent with the nature of count data, this color scale is proportional to the log counts. Observed counts of 0 cannot be represented this way and are colored white. Missing site/time combinations are marked as gray.

It can be seen that the observational coverage is not constant: most sites have incomplete records, especially in the earlier years. This is a typical patern for an expanding observation program, but may have consequences for the statistical analysis, because the imputation for these years will in fact be an extrapolation back in time.

The next example shows the TRIM estimated counts (using shades of blue, rather than red:

heatmap(m, "fitted", main="Skylark, TRIM estimates")

From this plot, it is clear that the variance between sites is much higher than the variance between years. In fact, the trend in time can hardly be seen.

The last example sows the heatmap for the imputed data, where estimates are used to fill up the missing observations. Again, red is for obervations, blue for estimates.

heatmap(m, "imputed", main="Skylark, imputed data")

Heatmaps for monthly datasets

For monthly data, heatmaps work slightly different, but in the same spirit:

data(oystercatcher)
m <- trim(count ~ site + (year + month), data=oystercatcher, model=3, overdisp=TRUE)
#> Warning in trim_estimate(count = count, site = site, year = year, month =
#> month, : Removed 15 sites without positive observations: (1, 30, 41, 48, 66,
#> 69, 78, 82, 83, 85, 86, 87, 88, 92, 104)
heatmap(m, "imputed", main="Oystercatcher (imputed)")

Again, observational coverage is extremely variable in both space and time. There appears to be a few sites that have sporadic, yet high, count observations, causing large amounts of estimated counts for this location for all other time points, which may effect the aggregated time-totals in a significant way.

Also note that in this example, many site/time combinations have registered a count of 0, which are colored white, as explained above.

References

van Strien, A. J., J. Pannekoek and D.W. Gibbons (2001) Indexing European bird population trends using results of national monitoring schemes: a trial of a new method, Bird Study, 48 (2), 200-213, DOI: 10.1080/00063650109461219

Underhill, LG, Prŷs-Jones, RP (1994) Index numbers for waterbird populations. I. Review and methodology. J Appl Ecol, 31, 463-480. doi: 10.2307/2404443

Underhill, L.G. (1989) Indices for Waterbird Populations. BTO Research Report 52, British Trust for Ornithology, Tring.