Application: Modeling Relative Humidity in Brasília

Everton da Costa, Francisco Cribari-Neto, and Rodney V. Fonseca

2026-05-23

Note: This vignette requires suggested packages. Install them with: install.packages(c("ggplot2", "gridExtra", "zoo", "forecast", "scales", "moments"))

1 Introduction

This vignette illustrates a complete \(\beta\)ARMA modeling workflow applied to monthly relative humidity data from Brasília, Brazil, using the betaARMA package. The analysis has two complementary objectives.

The first objective concerns numerical stability. Conditional maximum likelihood estimation (CMLE) of \(\beta\text{ARMA}\) models can fail in practice: Cribari-Neto, Costa, and Fonseca (2025) show, via Monte Carlo simulation, that convergence failure rates can reach 18.68% in challenging scenarios (e.g., \(\beta\text{ARMA}(1,4)\) with \(n = 125\)). Their paper also demonstrates, using two empirical datasets, that certain model specifications only converge when ridge penalization is applied. A natural question then arises: once the penalized estimator (PCMLE) achieves convergence, how reliable is the resulting model? Does it produce a well-specified fit that passes residual diagnostics? This vignette addresses that question directly by presenting a specification for which CMLE fails and PCMLE succeeds, and by carrying out a full residual analysis on the penalized model.

The second objective is to present a full modeling workflow — from exploratory analysis through feature engineering, model fitting, residual diagnostics, and out-of-sample forecasting — applied to the complete Brasília humidity series. This mirrors the applied methodology of Costa, Cribari-Neto, and Scher (2024), published in the Journal of Hydrology.

Replication code and data are openly available at https://github.com/Everton-da-Costa/BarmaRidgeBJPS2025.

2 Data and Exploratory Analysis

library(betaARMA)
library(forecast)
library(ggplot2)
library(gridExtra)
library(zoo)
library(scales)
library(moments)

The dataset consists of 306 monthly observations of average relative humidity in Brasília, Brazil, covering January 1999 to June 2024. The data were obtained from NASA’s POWER project. Relative humidity is naturally bounded in \((0, 1)\), making it well suited for \(\beta\)ARMA modeling.

3 A Case of Numerical Instability

To motivate the penalized estimator, we focus on a specific subsample of the Brasília dataset. In Cribari-Neto, Costa, and Fonseca (2025), this subsample was used to illustrate convergence failure under standard CMLE, and it was shown that ridge penalization successfully restores convergence.

While that paper established the numerical stability of the penalized estimator, the objective here is different: to verify the statistical adequacy of the penalized fit through residual diagnostics, and to show how this workflow can guide model reduction. To that end, we specify a \(\beta\text{ARMA}\) model with \(\text{AR} = (10, 18)\), \(\text{MA} = (1, 11, 13)\), and harmonic regressors.

3.1 Train–Test Split

y_subsample_ts <- window(
  brasilia_ts,
  start = c(2006, 02),
  end   = c(2019, 07)
)

y_train <- window(
  y_subsample_ts,
  end = c(2018, 07)
)

y_test <- window(
  y_subsample_ts,
  start = c(2018, 08)
)

The figure below shows the full time series with the training period delimited by dashed vertical lines.

Relative humidity in Brasília. Dashed red lines delimit the training period used in the instability analysis.

Relative humidity in Brasília. Dashed red lines delimit the training period used in the instability analysis.

3.2 Descriptive Statistics

Descriptive statistics of the monthly relative humidity in Brasília — subsample (February 2006 to July 2019).
Min Max Median Mean SD Skewness Excess_kurtosis
0.33 0.88 0.75 0.69 0.16 -0.72 2.11

3.3 Feature Engineering

To capture annual seasonality we include two harmonic regressors, \(s_t = \sin(2\pi t / 12)\) and \(c_t = \cos(2\pi t / 12)\), which together represent a smooth annual wave.

freq              <- frequency(y_train)
n_test            <- length(y_test)
trend_index_train <- seq_along(y_train)
trend_index_test  <- (max(trend_index_train) + 1):
                     (max(trend_index_train) + n_test)

hs_train <- sin(2 * pi * trend_index_train / freq)
hc_train <- cos(2 * pi * trend_index_train / freq)
hs_test  <- sin(2 * pi * trend_index_test  / freq)
hc_test  <- cos(2 * pi * trend_index_test  / freq)

X_train <- cbind(hs = hs_train, hc = hc_train)
X_test  <- cbind(hs = hs_test,  hc = hc_test)

3.4 Standard CMLE

ar_lags <- c(10, 18)
ma_lags <- c(1, 11, 13)

fit_cmle <- barma(
  y       = y_train,
  ar      = ar_lags,
  ma      = ma_lags,
  penalty = FALSE,
  xreg    = X_train
)
summary(fit_cmle)
#> Beta Autoregressive Moving Average Model
#> 
#> Call:
#> barma(y = y_train, ar = ar_lags, ma = ma_lags, xreg = X_train, 
#>     penalty = FALSE)
#> 
#> Link function: logit 
#> 
#> Coefficients:
#>          Estimate Std. Error z value Pr(>|z|)    
#> alpha     0.72525    0.11752   6.171 6.77e-10 ***
#> varphi10 -0.09790    0.06146  -1.593 0.111167    
#> varphi18  0.28234    0.08558   3.299 0.000969 ***
#> theta1    0.92386    0.04349  21.242  < 2e-16 ***
#> theta11   0.16907    0.05591   3.024 0.002496 ** 
#> theta13   0.15547    0.05554   2.799 0.005120 ** 
#> hs        0.51653    0.06040   8.552  < 2e-16 ***
#> hc        0.71515    0.05372  13.313  < 2e-16 ***
#> phi      53.56433    6.54492   8.184 2.22e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Warning: Optimization algorithm [BFGS (stats::optim)] did not converge (Code: 1).
#> ---
#> Log-likelihood: 186.2 
#> AIC: -354.3   BIC: -327.2   HQ: -343.3

The optimizer failed to converge (conv = 1), and the estimates are unreliable. Without a penalized alternative, an analyst would have little recourse but to abandon this lag structure and search for a different specification.

3.5 Penalized CMLE (PCMLE)

Applying ridge penalization to the same specification restores convergence.

fit_pcmle <- barma(
  y       = y_train,
  ar      = ar_lags,
  ma      = ma_lags,
  penalty = TRUE,
  xreg    = X_train
)
summary(fit_pcmle)
#> Beta Autoregressive Moving Average Model
#> 
#> Call:
#> barma(y = y_train, ar = ar_lags, ma = ma_lags, xreg = X_train, 
#>     penalty = TRUE)
#> 
#> Link function: logit 
#> Penalization : Ridge (lambda = 0.01234)
#> 
#> Coefficients:
#>          Estimate Std. Error z value Pr(>|z|)    
#> alpha     0.76929    0.10587   7.266 3.69e-13 ***
#> varphi10 -0.18060    0.07521  -2.401  0.01634 *  
#> varphi18  0.23408    0.08074   2.899  0.00374 ** 
#> theta1    0.47711    0.07387   6.459 1.06e-10 ***
#> theta11  -0.05077    0.07329  -0.693  0.48843    
#> theta13   0.31205    0.07846   3.977 6.97e-05 ***
#> hs        0.53171    0.04495  11.828  < 2e-16 ***
#> hc        0.77621    0.04482  17.317  < 2e-16 ***
#> phi      59.57003    7.28618   8.176 2.22e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Optimization: BFGS (stats::optim) converged after 111 objective function evaluations.
#> ---
#> Log-likelihood: 191.8 
#> AIC: -365.7   BIC: -338.6   HQ: -354.7

The penalized estimator successfully converges. The ridge penalty value \(\lambda_n = 1/(n - a)^{0.9}\) is chosen to balance bias and variance, as established in Cribari-Neto, Costa, and Fonseca (2025).

This illustrates a practical use case for PCMLE. When standard CMLE presents a convergence failure, an analyst might discard the entire lag structure without further investigation. By applying PCMLE, convergence is achieved and the parameter estimates become available for inspection. Furthermore, the diagnostic plots below for the PCMLE fit confirm that the residuals exhibit no serial correlation (Scher, Cribari-Neto, Pumi, and Bayer, 2020), indicating that the lag structure adequately captures the dynamics of the series.

plot(fit_pcmle, which = "all")
Diagnostic plots for the PCMLE fit: AR=(10,18), MA=(1,11,13).

Diagnostic plots for the PCMLE fit: AR=(10,18), MA=(1,11,13).

The observed and fitted values track each other closely. All ACF and PACF spikes lie within the confidence bands, and both the Ljung-Box and Monti p-values remain consistently above 0.05 across all evaluated lags, providing no evidence of residual serial correlation.

Upon reviewing the PCMLE summary, we note that the \(\theta_{11}\) estimate is not statistically significant (p-value = 0.488). This suggests that the MA(11) lag does not contribute meaningfully to the model dynamics. Removing it yields the reduced specification \(\text{AR} = (10, 18)\), \(\text{MA} = (1, 13)\), for which the likelihood surface is no longer flat and standard CMLE converges. Without PCMLE, the non-significance of \(\theta_{11}\) would never have been discovered, and the valid underlying structure of the series would have remained inaccessible.

3.6 Reduced Model

ar_lags_reduced <- c(10, 18)
ma_lags_reduced <- c(1, 13)

fit_cmle_reduced <- barma(
  y       = y_train,
  ar      = ar_lags_reduced,
  ma      = ma_lags_reduced,
  penalty = FALSE,
  xreg    = X_train
)
summary(fit_cmle_reduced)
#> Beta Autoregressive Moving Average Model
#> 
#> Call:
#> barma(y = y_train, ar = ar_lags_reduced, ma = ma_lags_reduced, 
#>     xreg = X_train, penalty = FALSE)
#> 
#> Link function: logit 
#> 
#> Coefficients:
#>          Estimate Std. Error z value Pr(>|z|)    
#> alpha     0.79153    0.10931   7.241 4.45e-13 ***
#> varphi10 -0.18959    0.07669  -2.472 0.013435 *  
#> varphi18  0.22200    0.08253   2.690 0.007147 ** 
#> theta1    0.47857    0.07509   6.373 1.85e-10 ***
#> theta13   0.29543    0.07676   3.849 0.000119 ***
#> hs        0.53129    0.04584  11.591  < 2e-16 ***
#> hc        0.77560    0.04571  16.969  < 2e-16 ***
#> phi      59.35648    7.26010   8.176 2.22e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Optimization: BFGS (stats::optim) converged after 97 objective function evaluations.
#> ---
#> Log-likelihood: 193.2 
#> AIC: -370.4   BIC: -346.3   HQ: -360.6

The reduced CMLE model converges and all coefficients are statistically significant. We proceed with residual diagnostics and forecasting using this parsimonious specification.

3.7 Residual Diagnostics

The panels below assess whether the fitted model adequately captures the dynamics of the Brasília relative humidity series.

plot(fit_cmle_reduced, which = "all")
Diagnostic plots for the reduced CMLE model AR=(10,18), MA=(1,13).

Diagnostic plots for the reduced CMLE model AR=(10,18), MA=(1,13).

The observed and fitted values track each other closely throughout the sample period. All ACF and PACF spikes lie within the confidence bands, and both the Ljung-Box and Monti p-values remain consistently above 0.05 across all evaluated lags, providing no evidence of residual serial correlation.

plot(fit_cmle_reduced, which = "qq")
Normal Q-Q plot of quantile residuals for the reduced CMLE model.

Normal Q-Q plot of quantile residuals for the reduced CMLE model.

The quantile residuals fall closely along the reference line within the 95% confidence band, supporting the normality assumption. One observation in the upper tail departs slightly from the line but does not indicate a systematic departure from the assumed distribution.

The largest residual corresponds to October 2011 (quantile residual = 4.37), a month that marks the climatological transition from the dry to the rainy season in Brasília — the period of highest interannual humidity variability. This observation coincides with the onset of an extreme drought period documented across Brazil from 2011 onward (Cunha et al., 2019). The second largest residual corresponds to January 2015 (−3.44), coinciding with the strong 2015–2016 El Niño event, which suppressed rainfall across central Brazil. These isolated departures reflect genuine meteorological anomalies rather than systematic model misspecification.

4 Forecasting

# Define the translation dictionary
month_map <- c(
  "Jan" = "Jan", "Fev" = "Feb", "Mar" = "Mar", "Abr" = "Apr",
  "Mai" = "May", "Jun" = "Jun", "Jul" = "Jul", "Ago" = "Aug",
  "Set" = "Sep", "Out" = "Oct", "Nov" = "Nov", "Dez" = "Dec"
)

dates     <- as.Date(zoo::as.yearmon(time(y_test)))
pt_months <- tools::toTitleCase(format(dates, "%b"))
months    <- unname(month_map[pt_months])

forecasts_df <- data.frame(
  observed = as.numeric(y_test),
  forecast = forecast(fit_cmle_reduced, xreg = X_test, h = n_test)$mean,
  month    = months
)
forecasts_df$time <- seq_len(n_test)
Out-of-sample forecasts from the reduced CMLE model against the observed test data (teal dots).

Out-of-sample forecasts from the reduced CMLE model against the observed test data (teal dots).

mae  <- function(obs, pred) mean(abs(obs - pred))
rmse <- function(obs, pred) sqrt(mean((obs - pred)^2))

obs <- forecasts_df$observed

accuracy_table <- data.frame(
  Model = "betaARMA CMLE (reduced)",
  MAE   = round(mae(obs,  forecasts_df$forecast) * 100, 2),
  RMSE  = round(rmse(obs, forecasts_df$forecast) * 100, 2)
)

knitr::kable(
  accuracy_table,
  caption = paste0(
    "MAE and RMSE ($\\times 100$) over the ",
    n_test,
    "-month forecast horizon."
  )
)
MAE and RMSE (\(\times 100\)) over the 12-month forecast horizon.
Model MAE RMSE
betaARMA CMLE (reduced) 5.27 7.57

5 Conclusion

This vignette illustrated the practical value of PCMLE as a diagnostic tool in \(\beta\)ARMA modeling, using monthly relative humidity data from Brasília.

Standard CMLE failed to converge for the \(\beta\text{ARMA}\) model with \(\text{AR} = (10, 18)\), \(\text{MA} = (1, 11, 13)\). Applying ridge penalization (PCMLE) restored convergence and made the parameter estimates available for inspection. Residual diagnostics on the PCMLE fit confirmed that the lag structure adequately captured the series dynamics. Inspection of the PCMLE estimates revealed that \(\theta_{11}\) was not statistically significant, motivating the reduction to the parsimonious \(\beta\text{ARMA}\) model with \(\text{AR} = (10, 18)\), \(\text{MA} = (1, 13)\). The reduced CMLE model converged, passed all residual diagnostics, and produced accurate out-of-sample forecasts, with all predictions guaranteed to lie in \((0, 1)\).

This workflow — CMLE fails, PCMLE reveals the structure, reduction yields a stable CMLE model — demonstrates that PCMLE is not merely a fallback estimator but an active analytical tool that can guide model selection in challenging estimation scenarios. For the complete simulation study, bootstrap procedures, and extended empirical results, we refer the reader to Cribari-Neto, Costa, and Fonseca (2025).

6 References

Costa, E., Cribari-Neto, F., & Scher, V. T. (2024). Test inferences and link function selection in dynamic beta modeling of seasonal hydro-environmental time series with temporary abnormal regimes. Journal of Hydrology, 638, 131489. https://doi.org/10.1016/j.jhydrol.2024.131489

Cribari-Neto, F., Costa, E., & Fonseca, R. V. (2025). Numerical stability enhancements in beta autoregressive moving average model estimation. Brazilian Journal of Probability and Statistics, 39(4), 410–437. https://doi.org/10.1214/25-BJPS645

Cunha, A. P. M. A., Zeri, M., Deusdará Leal, K., Costa, L., Cuartas, L. A., Marengo, J. A., … & Cunningham, C. (2019). Extreme drought events over Brazil from 2011 to 2019. Atmosphere, 10(11), 642. https://doi.org/10.3390/atmos10110642

Rocha, A. V., & Cribari-Neto, F. (2009). Beta autoregressive moving average models. TEST, 18(3), 529–545. https://doi.org/10.1007/s11749-008-0112-z

Rocha, A. V., & Cribari-Neto, F. (2017). Erratum to: Beta autoregressive moving average models. TEST, 26, 451–459. https://doi.org/10.1007/s11749-017-0528-4

Scher, V. T., Cribari-Neto, F., Pumi, G., & Bayer, F. M. (2020). Goodness-of-fit tests for \(\beta\)ARMA hydrological time series modeling. Environmetrics, 31(3), e2607. https://doi.org/10.1002/env.2607

7 Reproducibility

The analysis was conducted in the following computational environment.

#> =================================================================
#> Session Information
#> =================================================================
#> This report was generated at: maio 23, 2026 at 02:12
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=pt_BR.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=pt_BR.UTF-8   
#>  [7] LC_PAPER=pt_BR.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/Bahia
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] moments_0.14.1 scales_1.4.0   zoo_1.8-15     gridExtra_2.3  ggplot2_4.0.3 
#> [6] forecast_9.0.2 betaARMA_1.2.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6       jsonlite_2.0.0     dplyr_1.1.4        compiler_4.4.2    
#>  [5] tidyselect_1.2.1   Rcpp_1.1.1-1.1     parallel_4.4.2     jquerylib_0.1.4   
#>  [9] yaml_2.3.10        fastmap_1.2.0      lattice_0.22-5     R6_2.6.1          
#> [13] labeling_0.4.3     generics_0.1.4     knitr_1.50         tibble_3.3.0      
#> [17] timeDate_4052.112  bslib_0.9.0        pillar_1.11.1      RColorBrewer_1.1-3
#> [21] rlang_1.2.0        cachem_1.1.0       urca_1.3-4         xfun_0.53         
#> [25] sass_0.4.10        S7_0.2.2           cli_3.6.6          withr_3.0.2       
#> [29] magrittr_2.0.5     digest_0.6.37      grid_4.4.2         rstudioapi_0.17.1 
#> [33] nlme_3.1-165       lifecycle_1.0.5    fracdiff_1.5-4     vctrs_0.7.3       
#> [37] evaluate_1.0.5     glue_1.8.1         farver_2.1.2       colorspace_2.1-2  
#> [41] rmarkdown_2.31     tools_4.4.2        pkgconfig_2.0.3    htmltools_0.5.8.1