| Title: | Reality Check and Predictive Ability Tests for Forecast Evaluation |
|---|---|
| Description: | Implements a comprehensive suite of statistical tests for evaluating the accuracy of forecasting models against a benchmark. The package is grounded in the reality check framework of White (2000) <doi:10.1111/1468-0262.00152>, extended by Hansen (2005) <doi:10.1198/073500105000000063> for Superior Predictive Ability (SPA), 'Giacomini' & White (2006) <doi:10.1111/j.1468-0262.2006.00718.x> for Conditional Predictive Ability (CPA), and 'Corradi' & Swanson (2006) <doi:10.1016/j.jeconom.2005.07.026> for predictive density evaluation via the 'Kullback'-'Leibler' Information Criterion ('KLIC') and 'ZP' Quantile Loss test, the Continuous Ranked Probability Score ('CRPS') ('Gneiting' & 'Raftery', 2007) <doi:10.1198/016214506000001437>, coverage tests ('Kupiec', 1995) <doi:10.3905/jod.1995.407942>, 'HAC' covariance estimation ('Newey' & West, 1987) <doi:10.2307/1913610>, and Moving Block Bootstrap resampling ('Kunsch', 1989) <doi:10.1214/aos/1176347265>. |
| Authors: | Joanna Jedrzejewska [aut, cre] (Faculty of Economic Sciences, University of Warsaw, Poland), Krzysztof Drachal [ctb] (Faculty of Economic Sciences, University of Warsaw, Poland) |
| Maintainer: | Joanna Jedrzejewska <[email protected]> |
| License: | GPL-3 |
| Version: | 1.0 |
| Built: | 2026-06-03 09:41:44 UTC |
| Source: | https://github.com/cran/RCtest |
Calculates the Continuous Ranked Probability Score (CRPS) using the energy score (Monte Carlo) approximation for a single forecast period.
compute_crps(forecast_density, target_realization)compute_crps(forecast_density, target_realization)
forecast_density |
|
target_realization |
|
The CRPS is a strictly proper scoring rule that jointly rewards calibration and sharpness of a probabilistic forecast. It is computed via the energy score identity:
where are independent draws from the forecast distribution and is
the realization. Lower values are better: a CRPS of 0 indicates a perfect
point-mass forecast at the true realization.
numeric scalar representing the CRPS loss, or NA if
input is invalid. Lower values indicate better probabilistic forecast accuracy.
Gneiting, T., & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477), 359–378. doi:10.1198/016214506000001437
data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark # CRPS for forecast 1, period 1: # Use the cross-sectional spread of all competing forecasts at period t=1 as the density density_samples <- as.numeric(metals[1, 1:14]) realized_value <- metals[1, 15] compute_crps(density_samples, realized_value) # In practice, iterate over all forecasts and periods. # For forecast k and period t, the predictive density is approximated by shifting the # cross-sectional spread of all K competing forecasts so that it is centred at the # cross-sectional mean of forecasts at period t. Specifically, for each forecast k: # density_samples_tk = (forecasts of all K forecast at t) - forecast_k(t) + mean # (all K forecasts at t) # This preserves the spread (diversity) across forecasts while recentring around the # cross-sectional mean rather than around forecast k's own point forecast. It is an # empirical approximation to the predictive distribution when no parametric density # is available. P <- nrow(metals) K <- ncol(metals) - 1L # 14 competing forecasts crps_matrix <- matrix(NA_real_, nrow = P, ncol = K, dimnames = list(NULL, colnames(metals)[1:K])) for (t in seq_len(P)) { for (k in seq_len(K)) { density_samples_tk <- as.numeric(metals[t, 1:K]) - metals[t, k] + mean(metals[t, 1:K]) crps_matrix[t, k] <- compute_crps(as.numeric(density_samples_tk), target_realization = metals[t, ncol(metals)]) } } head(crps_matrix)data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark # CRPS for forecast 1, period 1: # Use the cross-sectional spread of all competing forecasts at period t=1 as the density density_samples <- as.numeric(metals[1, 1:14]) realized_value <- metals[1, 15] compute_crps(density_samples, realized_value) # In practice, iterate over all forecasts and periods. # For forecast k and period t, the predictive density is approximated by shifting the # cross-sectional spread of all K competing forecasts so that it is centred at the # cross-sectional mean of forecasts at period t. Specifically, for each forecast k: # density_samples_tk = (forecasts of all K forecast at t) - forecast_k(t) + mean # (all K forecasts at t) # This preserves the spread (diversity) across forecasts while recentring around the # cross-sectional mean rather than around forecast k's own point forecast. It is an # empirical approximation to the predictive distribution when no parametric density # is available. P <- nrow(metals) K <- ncol(metals) - 1L # 14 competing forecasts crps_matrix <- matrix(NA_real_, nrow = P, ncol = K, dimnames = list(NULL, colnames(metals)[1:K])) for (t in seq_len(P)) { for (k in seq_len(K)) { density_samples_tk <- as.numeric(metals[t, 1:K]) - metals[t, k] + mean(metals[t, 1:K]) crps_matrix[t, k] <- compute_crps(as.numeric(density_samples_tk), target_realization = metals[t, ncol(metals)]) } } head(crps_matrix)
Computes the per-period Negative Log-Likelihood Score (NLS) loss matrix under a Gaussian predictive density assumption. The NLS is the loss function corresponding to minimisation of the Kullback-Leibler Information Criterion (KLIC) distance from the true density (Corradi & Swanson, 2006).
compute_klic( forecast_matrix, forecast_sd_models, benchmark_col = ncol(forecast_matrix) )compute_klic( forecast_matrix, forecast_sd_models, benchmark_col = ncol(forecast_matrix) )
forecast_matrix |
|
forecast_sd_models |
|
benchmark_col |
Index or name of the benchmark column. Defaults to the last column. |
For each competing forecast k and period t:
where denotes the Gaussian density, is the realized value,
is the point forecast, and is the
forecast standard deviation. Minimising the average NLS is equivalent to minimising
the KLIC distance between the forecast's predictive density and the true density
(Corradi & Swanson, 2006). Lower NLS values are better. The benchmark
column in the returned matrix is set to zero.
matrix of dimension P x K_total containing NLS
values. Lower values indicate better density forecast accuracy. The benchmark
column is set to zero.
Corradi, V., & Swanson, N. R. (2006). Predictive density and conditional confidence interval accuracy tests. Journal of Econometrics, 135(1–2), 187–228. doi:10.1016/j.jeconom.2005.07.026
Corradi, V., & Swanson, N. R. (2011). The White Reality Check and some of its recent extensions. In Festschrift in honor of Halbert L. White.
kullback_leibler_test,
estimate_forecast_variance
data(metals) benchmark_col <- 15 K_total <- ncol(metals) comp_cols <- setdiff(seq_len(K_total), benchmark_col) forecast_variance <- estimate_forecast_variance(metals, benchmark_col = benchmark_col) forecast_sd_models <- sqrt(forecast_variance[, comp_cols]) klic_loss <- compute_klic(metals, forecast_sd_models, benchmark_col = benchmark_col) head(klic_loss)data(metals) benchmark_col <- 15 K_total <- ncol(metals) comp_cols <- setdiff(seq_len(K_total), benchmark_col) forecast_variance <- estimate_forecast_variance(metals, benchmark_col = benchmark_col) forecast_sd_models <- sqrt(forecast_variance[, comp_cols]) klic_loss <- compute_klic(metals, forecast_sd_models, benchmark_col = benchmark_col) head(klic_loss)
Performs Kupiec's (1995) Unconditional Coverage (UC) test for evaluating Value-at-Risk (VaR) forecasts from competing forecast against realized values.
Hypotheses:
H0: The forecast correctly captures VaR — violations occur with the
expected frequency alpha.
H1: The forecast fails to correctly capture VaR — the observed
frequency of violations differs significantly from alpha.
compute_kupiec( forecast_matrix, forecast_sd_models, benchmark_col = ncol(forecast_matrix), alpha = 0.05 )compute_kupiec( forecast_matrix, forecast_sd_models, benchmark_col = ncol(forecast_matrix), alpha = 0.05 )
forecast_matrix |
|
forecast_sd_models |
|
benchmark_col |
Index or name of the benchmark column. Defaults to the last column. |
alpha |
|
For each competing forecast k, the VaR at level alpha is:
where is the standard normal quantile function. A violation occurs
when the realized value falls below . The likelihood-ratio statistic
follows a distribution under H0 (Kupiec, 1995).
Failing to reject H0 (large p-value) indicates correctly calibrated VaR;
rejecting H0 (small p-value) indicates the forecast under- or over-estimates
tail risk.
A named list (one element per competing forecast) of htest objects,
each containing:
statisticThe LR-UC test statistic (-distributed under
H0).
p.valueP-value from the distribution. A large
p-value indicates correctly calibrated VaR coverage.
actual_exceedancesObserved number of VaR violations.
expectedExpected number of violations
(P * alpha).
Kupiec, P. H. (1995). Techniques for Verifying the Accuracy of Risk Measurement Models. The Journal of Derivatives, 3(2), 173–184. doi:10.3905/jod.1995.407942
estimate_forecast_variance,
run_comprehensive_erc_analysis
data(metals) benchmark_col <- 15 K_total <- ncol(metals) comp_cols <- setdiff(seq_len(K_total), benchmark_col) forecast_variance <- estimate_forecast_variance(metals, benchmark_col = benchmark_col, window_size = 20) forecast_sd_models <- sqrt(forecast_variance[, comp_cols]) coverage_results <- compute_kupiec(metals, forecast_sd_models, benchmark_col = benchmark_col, alpha = 0.05) print(coverage_results[[1]])data(metals) benchmark_col <- 15 K_total <- ncol(metals) comp_cols <- setdiff(seq_len(K_total), benchmark_col) forecast_variance <- estimate_forecast_variance(metals, benchmark_col = benchmark_col, window_size = 20) forecast_sd_models <- sqrt(forecast_variance[, comp_cols]) coverage_results <- compute_kupiec(metals, forecast_sd_models, benchmark_col = benchmark_col, alpha = 0.05) print(coverage_results[[1]])
Performs a Diebold-Mariano (1995) test for each competing forecast
separately, testing whether the predictive accuracy of that forecast differs
significantly from the benchmark forecast. The test statistic is the sample
mean loss differential standardised by a Newey-West HAC standard error;
p-values are provided both analytically and via Moving Block Bootstrap (MBB).
The direction of the alternative hypothesis is controlled by the
H1 argument: "same" for a two-sided test (),
"more" for the one-sided test that forecast is more accurate than
the benchmark (), and "less" for the one-sided
test that forecast is less accurate than the benchmark
().
compute_per_model_statistics( loss_differences, model_names, n_boot = 999, block_length = 5, alpha = 0.05, h = 1, H1 = "same" )compute_per_model_statistics( loss_differences, model_names, n_boot = 999, block_length = 5, alpha = 0.05, h = 1, H1 = "same" )
loss_differences |
|
model_names |
|
n_boot |
|
block_length |
|
alpha |
|
h |
|
H1 |
|
For each forecast , the loss differential series is:
where is the loss function used to construct loss_differences.
In the standard workflow of this package
(run_comprehensive_erc_analysis), is the squared error loss:
The Diebold-Mariano test statistic is:
For multi-step forecasts (), the Harvey, Leybourne & Newbold (1997)
small-sample correction is applied:
For the correction reduces to , which approaches
1 as . For the correction inflates the statistic,
improving finite-sample size control. The corrected statistic is
compared to a distribution.
The analytic p-value (P_Value) uses the t-distribution with
degrees of freedom. The bootstrap p-value (P_Value_Boot) uses MBB
resampling (Kunsch, 1989) with recentering at the sample mean ,
placing the bootstrap distribution under H0. The Harvey correction is applied
consistently to both the analytic and bootstrap statistics. The p-values are
computed according to the alternative hypothesis specified by H1:
H1)"same" – two-sided"more" – one-sided right"less" – one-sided leftThe bootstrap analogue replaces the t-distribution probability with the empirical proportion of bootstrap statistics falling in the appropriate tail.
where follows a t-distribution with degrees of freedom.
The bootstrap analogue replaces the t-distribution tail probability with the
empirical proportion of bootstrap statistics falling in the appropriate tail.
This function performs individual tests and does not control for
multiple comparisons. For a joint test controlling the family-wise error rate,
use white_reality_check or
superior_predictive_ability_test.
Note on MASE: When loss_differences are constructed from
Mean Absolute Scaled Errors, the scaling (division by the naive benchmark
MAE, i.e. mean(abs(diff(realizations)))) must be applied
before passing the loss differentials to this function.
compute_per_model_statistics receives pre-computed loss differentials
and applies no internal rescaling — the caller is responsible for ensuring
that MASE-based loss_differences already contain scaled errors.
In run_comprehensive_erc_analysis this is handled automatically.
data.frame with one row per competing model forecast:
Model |
Model name. |
Mean_Loss_Diff |
Sample mean of . |
Frac_Better_Than_Benchmark |
Fraction of periods where . |
T_Stat |
Harvey-corrected DM statistic . |
P_Value |
Analytic p-value (t-distribution, df). |
P_Value_Boot |
MBB bootstrap p-value. |
Significant |
TRUE if P_Value <= alpha. |
Significant_Boot |
TRUE if P_Value_Boot <= alpha.
|
Davidson, R., & MacKinnon, J. G. (2000). Bootstrap tests: How many bootstraps? Econometric Reviews, 19(1), 55–68. doi:10.1080/07474930008800459
Diebold, F. X., & Mariano, R. S. (1995). Comparing Predictive Accuracy. Journal of Business & Economic Statistics, 13(3), 253–263. doi:10.1080/07350015.1995.10524599
Harvey, D., Leybourne, S., & Newbold, P. (1997). Testing the equality of prediction mean squared errors. International Journal of Forecasting, 13(2), 281–291. doi:10.1016/S0169-2070(96)00719-4
Kunsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. The Annals of Statistics, 17(3), 1217–1241. doi:10.1214/aos/1176347265
Newey, W. K., & West, K. D. (1987). A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix. Econometrica, 55(3), 703–708. doi:10.2307/1913610
white_reality_check,
superior_predictive_ability_test,
estimate_long_run_covariance
data(metals) P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1 # A small offset (+0.5) avoids degenerate zero loss differences (illustration only). realized <- c(metals[-1, K_total], metals[P, K_total]) + 0.5 bench_loss <- (metals[, K_total] - realized)^2 model_loss <- (metals[, 1:K] - realized)^2 loss_differences <- bench_loss - model_loss model_names <- colnames(metals)[1:K] # Two-sided test (default) result_df <- compute_per_model_statistics(loss_differences, model_names, n_boot = 10) print(result_df) # One-sided test: H1 = forecast is more accurate than benchmark result_more <- compute_per_model_statistics(loss_differences, model_names, n_boot = 10, H1 = "more") print(result_more)data(metals) P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1 # A small offset (+0.5) avoids degenerate zero loss differences (illustration only). realized <- c(metals[-1, K_total], metals[P, K_total]) + 0.5 bench_loss <- (metals[, K_total] - realized)^2 model_loss <- (metals[, 1:K] - realized)^2 loss_differences <- bench_loss - model_loss model_names <- colnames(metals)[1:K] # Two-sided test (default) result_df <- compute_per_model_statistics(loss_differences, model_names, n_boot = 10) print(result_df) # One-sided test: H1 = forecast is more accurate than benchmark result_more <- compute_per_model_statistics(loss_differences, model_names, n_boot = 10, H1 = "more") print(result_more)
Computes the per-period ZP quantile loss matrix based on the squared difference between the indicator of a tail event and the forecast's predicted probability of that event (Corradi & Swanson, 2006, eq. 7).
compute_zp( forecast_matrix, forecast_sd_models, threshold, benchmark_col = ncol(forecast_matrix) )compute_zp( forecast_matrix, forecast_sd_models, threshold, benchmark_col = ncol(forecast_matrix) )
forecast_matrix |
|
forecast_sd_models |
|
threshold |
|
benchmark_col |
Index or name of the benchmark column. Defaults to the last column. |
For each competing forecast k and period t:
where is the realized value, is the threshold,
is the point forecast, and is the
forecast standard deviation. Lower ZP values are better.
Choosing at the 5th percentile focuses evaluation on whether forecasts
correctly predict the risk of falling into the worst 5% of outcomes. The benchmark
column is assigned a point-mass predictive distribution (),
which approximates the Brier score for the tail indicator and serves as a conservative
reference. When the benchmark is the Historical Average (HA), the ZP test thus
evaluates whether any competing forecast's calibrated tail probability outperforms the
HA's point prediction of the tail event.
matrix of dimension P x K_total containing ZP
loss values. Lower values indicate better left-tail probability calibration.
The benchmark column uses .
Corradi, V., & Swanson, N. R. (2006). Predictive density and conditional confidence interval accuracy tests. Journal of Econometrics, 135(1–2), 187–228. doi:10.1016/j.jeconom.2005.07.026
Corradi, V., & Swanson, N. R. (2011). The White Reality Check and some of its recent extensions. In Festschrift in honor of Halbert L. White.
reality_check_zp_test,
estimate_forecast_variance
data(metals) benchmark_col <- 15 K_total <- ncol(metals) comp_cols <- setdiff(seq_len(K_total), benchmark_col) forecast_variance <- estimate_forecast_variance(metals, benchmark_col = benchmark_col) forecast_sd_models <- sqrt(forecast_variance[, comp_cols]) threshold_val <- quantile(metals[, benchmark_col], 0.10) zp_loss <- compute_zp(metals, forecast_sd_models, threshold = threshold_val, benchmark_col = benchmark_col) head(zp_loss)data(metals) benchmark_col <- 15 K_total <- ncol(metals) comp_cols <- setdiff(seq_len(K_total), benchmark_col) forecast_variance <- estimate_forecast_variance(metals, benchmark_col = benchmark_col) forecast_sd_models <- sqrt(forecast_variance[, comp_cols]) threshold_val <- quantile(metals[, benchmark_col], 0.10) zp_loss <- compute_zp(metals, forecast_sd_models, threshold = threshold_val, benchmark_col = benchmark_col) head(zp_loss)
Creates a summary data frame consolidating p-values and conclusions for all statistical tests across all datasets and error metrics. Covers White's Reality Check (WRC), Superior Predictive Ability (SPA), Conditional Predictive Ability (CPA), ZP Quantile Loss test, and Kullback-Leibler (KLIC) test.
create_unified_summary(comprehensive_results, alpha = 0.05)create_unified_summary(comprehensive_results, alpha = 0.05)
comprehensive_results |
|
alpha |
Numeric significance level for determining conclusions. Default is 0.05. |
list containing a data frame named summary with
columns: Dataset, Test, P_Value, Statistic,
Conclusion ("H0 rejected" or "H0 accepted").
White, H. (2000). A reality check for data snooping. Econometrica, 68(5), 1097–1126. doi:10.1111/1468-0262.00152
Hansen, P. R. (2005). A Test for Superior Predictive Ability. Journal of Business & Economic Statistics, 23(4), 365–380. doi:10.1198/073500105000000063
Giacomini, R., & White, H. (2006). Tests of Conditional Predictive Ability. Econometrica, 74(6), 1545–1578. doi:10.1111/j.1468-0262.2006.00718.x
Corradi, V., & Swanson, N. R. (2006). Predictive density and conditional confidence interval accuracy tests. Journal of Econometrics, 135(1–2), 187–228. doi:10.1016/j.jeconom.2005.07.026
run_comprehensive_erc_analysis,
generate_comprehensive_report,
compute_per_model_statistics
data(metals) realizations <- list(M = metals[, ncol(metals)]) prep_list <- list(M = list(R_start = 0)) f_hat <- list(list(NULL, NULL, metals)) names(f_hat) <- "M" res <- run_comprehensive_erc_analysis( data_list_prepared = prep_list, mods_matrix = matrix(0), alpha_grid = 0.05, window_size = 20, y_hat_all = f_hat, y_raw = realizations, block_length = 5, n_boot = 10, zp_quantile = 0.05 ) summary_table <- create_unified_summary(res$aggregate_results) print(summary_table$summary)data(metals) realizations <- list(M = metals[, ncol(metals)]) prep_list <- list(M = list(R_start = 0)) f_hat <- list(list(NULL, NULL, metals)) names(f_hat) <- "M" res <- run_comprehensive_erc_analysis( data_list_prepared = prep_list, mods_matrix = matrix(0), alpha_grid = 0.05, window_size = 20, y_hat_all = f_hat, y_raw = realizations, block_length = 5, n_boot = 10, zp_quantile = 0.05 ) summary_table <- create_unified_summary(res$aggregate_results) print(summary_table$summary)
Estimates forecast variance from historical forecast errors relative to
the benchmark using a rolling window. Used to approximate the time-varying predictive
standard deviation for each competing model forecast, required by compute_klic,
compute_zp, and compute_kupiec.
estimate_forecast_variance( forecast_matrix, benchmark_col = ncol(forecast_matrix), window_size = 20 )estimate_forecast_variance( forecast_matrix, benchmark_col = ncol(forecast_matrix), window_size = 20 )
forecast_matrix |
|
benchmark_col |
Index or name of the benchmark column. Defaults to the last column. |
window_size |
|
For each competing forecast k and period t, the forecast error is defined
as . The variance of these errors
is estimated over a rolling window:
If t < window_size: variance is computed over observations
1:t (expanding window).
If t >= window_size: variance is computed over observations
max(1, t - window_size):t (rolling window of size window_size).
For t = 1 the variance of a single observation is undefined (NA).
Estimated variances that are NA, zero, or negative are replaced by
1e-6 to ensure numerical stability in downstream computations.
The benchmark column in the returned matrix is set to zero throughout.
matrix of dimension P x K_total containing
estimated variances. Columns correspond to the same forecasts as
forecast_matrix; the benchmark column contains zeros.
compute_klic, compute_zp,
compute_kupiec
data(metals) forecast_variance <- estimate_forecast_variance(metals, benchmark_col = 15, window_size = 20) head(forecast_variance)data(metals) forecast_variance <- estimate_forecast_variance(metals, benchmark_col = 15, window_size = 20) head(forecast_variance)
Estimates the long-run covariance matrix using the Newey-West (1987) approach with a Bartlett kernel. Provides Heteroskedasticity and Autocorrelation Consistent (HAC) variance estimates used for studentizing Reality Check test statistics.
estimate_long_run_covariance(loss_differences, block_length)estimate_long_run_covariance(loss_differences, block_length)
loss_differences |
A |
block_length |
|
Implements the Newey-West (1987) HAC covariance matrix estimator with Bartlett kernel
weights for lags , where
denotes the truncation lag (following the notation of Newey & West, 1987, and
Politis & Romano, 1994), here set equal to block_length. This is essential
for accounting for serial dependence in time-series forecast evaluations.
A symmetric positive semi-definite matrix of dimensions
K x K representing the estimated long-run covariance.
Newey, W. K., & West, K. D. (1987). A Simple Positive Semi-Definite Heteroskedasticity and Autocorrelation Consistent Covariance Matrix. Econometrica, 55(3), 703–708. doi:10.2307/1913610
Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), 1303–1313. doi:10.1080/01621459.1994.10476870
data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark # A small offset (+0.5) is added to the lagged benchmark to avoid degenerate zero # loss differences when forecasts equal the realized value exactly (illustration only). P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1 # 14 competing forecasts realized <- c(metals[-1, K_total], metals[P, K_total]) + 0.5 benchmark_loss <- (metals[, K_total] - realized)^2 model_loss <- (metals[, 1:K] - realized)^2 loss_diff <- benchmark_loss - model_loss lrc_result <- estimate_long_run_covariance(loss_diff, block_length = 5) print(round(lrc_result[1:3, 1:3], 6))data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark # A small offset (+0.5) is added to the lagged benchmark to avoid degenerate zero # loss differences when forecasts equal the realized value exactly (illustration only). P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1 # 14 competing forecasts realized <- c(metals[-1, K_total], metals[P, K_total]) + 0.5 benchmark_loss <- (metals[, K_total] - realized)^2 model_loss <- (metals[, 1:K] - realized)^2 loss_diff <- benchmark_loss - model_loss lrc_result <- estimate_long_run_covariance(loss_diff, block_length = 5) print(round(lrc_result[1:3, 1:3], 6))
Prepares the comprehensive results list for export (e.g., to Microsoft
Excel). Converts all htest objects and per-model results into flat
data.frame objects, one per dataset.
extract_and_flatten_results_aggregated(comprehensive_results, alpha = 0.05)extract_and_flatten_results_aggregated(comprehensive_results, alpha = 0.05)
comprehensive_results |
|
alpha |
Numeric significance level used to determine |
list of data frames, one per dataset, each with columns:
Model, Test_Type, P_Value, Statistic,
Actual_Violations, Reject_H0.
run_comprehensive_erc_analysis, create_unified_summary,
generate_comprehensive_report
data(metals) realizations <- list(M = metals[, ncol(metals)]) prep_list <- list(M = list(R_start = 0)) f_hat <- list(list(NULL, NULL, metals)) names(f_hat) <- "M" res <- run_comprehensive_erc_analysis( data_list_prepared = prep_list, mods_matrix = matrix(0), alpha_grid = 0.05, window_size = 20, y_hat_all = f_hat, y_raw = realizations, block_length = 5, n_boot = 10, zp_quantile = 0.05 ) excel_data <- extract_and_flatten_results_aggregated(res$aggregate_results, alpha = 0.05) head(excel_data[[1]])data(metals) realizations <- list(M = metals[, ncol(metals)]) prep_list <- list(M = list(R_start = 0)) f_hat <- list(list(NULL, NULL, metals)) names(f_hat) <- "M" res <- run_comprehensive_erc_analysis( data_list_prepared = prep_list, mods_matrix = matrix(0), alpha_grid = 0.05, window_size = 20, y_hat_all = f_hat, y_raw = realizations, block_length = 5, n_boot = 10, zp_quantile = 0.05 ) excel_data <- extract_and_flatten_results_aggregated(res$aggregate_results, alpha = 0.05) head(excel_data[[1]])
Generates an automatic summary report in Markdown format covering all
error metrics (MSE, MAE, MASE) and distributional tests (ZP, KLIC, Kupiec).
For the ZP and KLIC sections, the report lists superior competing forecasts (i.e.
those whose forecasts are found to be more accurate than the benchmark forecast)
or states that no superior forecasts were found. For the Kupiec section, forecasts
with correct VaR coverage (Reject_H0 == FALSE) are listed, or a message
is printed if none passed.
Technical Abbreviations:
WRC: White's Reality Check (White, 2000). Tests whether any competing forecast has lower expected loss than the benchmark forecast; controls family-wise error rate.
SPA: Superior Predictive Ability test (Hansen, 2005). A studentized extension of WRC with improved power that corrects for irrelevant forecasts.
CPA: Conditional Predictive Ability test (Giacomini & White, 2006). Tests whether loss differentials are predictable by a conditioning variable.
ZP: Quantile Loss test (Corradi & Swanson, 2006). Evaluates whether
any competing forecast better calibrates the probability of a left-tail event
defined by the zp_quantile threshold.
KLIC: Kullback-Leibler Information Criterion based density test (Corradi & Swanson, 2006). Selects the forecast whose predictive density is closest to the true density in terms of KLIC distance, evaluated via Negative Log-Likelihood Scores (NLS) under a Gaussian predictive density assumption.
CRPS: Continuous Ranked Probability Score (Gneiting & Raftery, 2007). Jointly rewards calibration and sharpness of the predictive distribution.
UC: Kupiec Unconditional Coverage test (Kupiec, 1995).
MSE: Mean Squared Error.
MAE: Mean Absolute Error.
MASE: Mean Absolute Scaled Error.
generate_comprehensive_report( summary_df, zp_models_df, klic_models_df, kupiec_models_df, dataset_name, alpha = 0.05 )generate_comprehensive_report( summary_df, zp_models_df, klic_models_df, kupiec_models_df, dataset_name, alpha = 0.05 )
summary_df |
|
zp_models_df |
|
klic_models_df |
|
kupiec_models_df |
|
dataset_name |
|
alpha |
|
character string in Markdown format.
White, H. (2000). A reality check for data snooping. Econometrica, 68(5), 1097–1126. doi:10.1111/1468-0262.00152
Hansen, P. R. (2005). A Test for Superior Predictive Ability. Journal of Business & Economic Statistics, 23(4), 365–380. doi:10.1198/073500105000000063
Giacomini, R., & White, H. (2006). Tests of Conditional Predictive Ability. Econometrica, 74(6), 1545–1578. doi:10.1111/j.1468-0262.2006.00718.x
Corradi, V., & Swanson, N. R. (2006). Predictive density and conditional confidence interval accuracy tests. Journal of Econometrics, 135(1–2), 187–228. doi:10.1016/j.jeconom.2005.07.026
Corradi, V., & Swanson, N. R. (2011). The White Reality Check and some of its recent extensions. In Festschrift in honor of Halbert L. White.
Gneiting, T., & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477), 359–378. doi:10.1198/016214506000001437
Kupiec, P. H. (1995). Techniques for Verifying the Accuracy of Risk Measurement Models. The Journal of Derivatives, 3(2), 173–184. doi:10.3905/jod.1995.407942
Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), 1303–1313. doi:10.1080/01621459.1994.10476870
Diebold, F. X., & Mariano, R. S. (1995). Comparing Predictive Accuracy. Journal of Business & Economic Statistics, 13(3), 253–263. doi:10.1080/07350015.1995.10524599
data(metals) ds_name <- "Dataset1" prep_list <- list(Dataset1 = list(R_start = 0)) realizations <- list(Dataset1 = metals[, ncol(metals)]) f_hat <- list(list(NULL, NULL, metals)) names(f_hat) <- ds_name res <- run_comprehensive_erc_analysis( data_list_prepared = prep_list, mods_matrix = matrix(0), alpha_grid = 0.05, window_size = 20, y_hat_all = f_hat, y_raw = realizations, block_length = 5, n_boot = 10, zp_quantile = 0.05 ) unified_summ <- create_unified_summary(res$aggregate_results) zp_table <- data.frame(Model = colnames(metals)[1:(ncol(metals) - 1)], P_Value = 0.1, Dataset = ds_name) klic_table <- zp_table kupiec_table <- data.frame(Model = colnames(metals)[1:(ncol(metals) - 1)], Reject_H0 = rep(FALSE, ncol(metals) - 1), Dataset = ds_name) report <- generate_comprehensive_report( summary_df = unified_summ$summary, zp_models_df = zp_table, klic_models_df = klic_table, kupiec_models_df = kupiec_table, dataset_name = ds_name ) cat(report)data(metals) ds_name <- "Dataset1" prep_list <- list(Dataset1 = list(R_start = 0)) realizations <- list(Dataset1 = metals[, ncol(metals)]) f_hat <- list(list(NULL, NULL, metals)) names(f_hat) <- ds_name res <- run_comprehensive_erc_analysis( data_list_prepared = prep_list, mods_matrix = matrix(0), alpha_grid = 0.05, window_size = 20, y_hat_all = f_hat, y_raw = realizations, block_length = 5, n_boot = 10, zp_quantile = 0.05 ) unified_summ <- create_unified_summary(res$aggregate_results) zp_table <- data.frame(Model = colnames(metals)[1:(ncol(metals) - 1)], P_Value = 0.1, Dataset = ds_name) klic_table <- zp_table kupiec_table <- data.frame(Model = colnames(metals)[1:(ncol(metals) - 1)], Reject_H0 = rep(FALSE, ncol(metals) - 1), Dataset = ds_name) report <- generate_comprehensive_report( summary_df = unified_summ$summary, zp_models_df = zp_table, klic_models_df = klic_table, kupiec_models_df = kupiec_table, dataset_name = ds_name ) cat(report)
Implements the Reality Check using Negative Log-Likelihood Scores (NLS) to evaluate predictive densities in terms of their Kullback-Leibler divergence from the true density. Based on Corradi & Swanson (2006).
H0: – no
competing forecast achieves a higher average log-likelihood (lower KLIC distance)
than the benchmark density .
H1: At least one competing forecast achieves strictly higher average log-likelihood than the benchmark.
kullback_leibler_test( log_likelihood_differences, block_length, num_bootstrap_replications, alpha = 0.05 )kullback_leibler_test( log_likelihood_differences, block_length, num_bootstrap_replications, alpha = 0.05 )
log_likelihood_differences |
A |
block_length |
|
num_bootstrap_replications |
|
alpha |
|
The KLIC between the true density and a forecast density is
. Minimising KLIC is equivalent to maximising
expected log-likelihood. This test therefore selects the forecast with the smallest
KLIC distance from the true density. Lower NLS values are better: a forecast
with lower NLS assigns higher average probability to events that actually occurred
(Corradi & Swanson, 2006).
The NLS loss matrix is constructed via compute_klic assuming normal
predictive densities parameterised by a point forecast and a rolling-window standard
deviation. The test statistic is the maximum studentized mean NLS differential,
with p-values obtained via MBB following the SPA bootstrap of Hansen (2005).
An object of class "htest". A small p-value (below alpha)
leads to rejection of H0, indicating that at least one competing forecast has a
lower Kullback-Leibler distance from the true density than the benchmark.
Failure to reject H0 suggests no forecast provides significantly better density
fit than the benchmark.
Corradi, V., & Swanson, N. R. (2006). Predictive density and conditional confidence interval accuracy tests. Journal of Econometrics, 135(1–2), 187–228. doi:10.1016/j.jeconom.2005.07.026
Corradi, V., & Swanson, N. R. (2011). The White Reality Check and some of its recent extensions. In Festschrift in honor of Halbert L. White.
Davidson, R., & MacKinnon, J. G. (2000). Bootstrap tests: How many bootstraps? Econometric Reviews, 19(1), 55–68. doi:10.1080/07474930008800459
Hansen, P. R. (2005). A Test for Superior Predictive Ability. Journal of Business & Economic Statistics, 23(4), 365–380. doi:10.1198/073500105000000063
Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), 1303–1313. doi:10.1080/01621459.1994.10476870
data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark benchmark_col <- 15 P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1 # 14 competing forecasts forecast_variance <- estimate_forecast_variance(metals, benchmark_col = K_total, window_size = 20) comp_cols <- setdiff(seq_len(K_total), benchmark_col) forecast_sd_models <- sqrt(forecast_variance[, comp_cols]) nls_loss <- compute_klic(metals, forecast_sd_models, benchmark_col = K_total) nls_diff <- nls_loss[, K_total] - nls_loss[, comp_cols] kullback_leibler_test(nls_diff, block_length = 5, num_bootstrap_replications = 50)data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark benchmark_col <- 15 P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1 # 14 competing forecasts forecast_variance <- estimate_forecast_variance(metals, benchmark_col = K_total, window_size = 20) comp_cols <- setdiff(seq_len(K_total), benchmark_col) forecast_sd_models <- sqrt(forecast_variance[, comp_cols]) nls_loss <- compute_klic(metals, forecast_sd_models, benchmark_col = K_total) nls_diff <- nls_loss[, K_total] - nls_loss[, comp_cols] kullback_leibler_test(nls_diff, block_length = 5, num_bootstrap_replications = 50)
Generates a bootstrap resample of a time series matrix using the Moving Block Bootstrap (MBB) method of Kunsch (1989).
mbb_resample_data(data_series, block_length)mbb_resample_data(data_series, block_length)
data_series |
|
block_length |
|
Resamples overlapping blocks of block_length consecutive rows with replacement,
then concatenates them to produce a bootstrap sample of the same length P as
the original series. This preserves the short-run autocorrelation structure of the data,
which is required for valid inference in the Reality Check and SPA-type tests.
matrix of bootstrap resampled data with the same dimensions
as data_series.
Kunsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. The Annals of Statistics, 17(3), 1217–1241. doi:10.1214/aos/1176347265
Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), 1303–1313. doi:10.1080/01621459.1994.10476870
Corradi, V., & Swanson, N. R. (2011). The White Reality Check and some of its recent extensions. In Festschrift in honor of Halbert L. White.
Liu, R. Y., & Singh, K. (1992). Moving blocks jackknife and bootstrap capture weak dependence. In R. LePage & L. Billard (Eds.), Exploring the Limits of Bootstrap (pp. 225–248). Wiley.
data(metals) # metals: 165 x 15; columns 1-3 are the first three competing forecasts mbb_resample_data(metals[, 1:3], block_length = 5)data(metals) # metals: 165 x 15; columns 1-3 are the first three competing forecasts mbb_resample_data(metals[, 1:3], block_length = 5)
A dataset of 165 observations of base metals price indices. It includes point forecasts from 15 predictive models (Bayesian and shrinkage-based). This dataset is based on the research methodology and benchmarks presented in Drachal and Jedrzejewska (2025).
metalsmetals
A matrix with 165 rows and 15 columns:
metals[,1:14]: Point forecasts from 14 competing models (including
variations of Bayesian Dynamic Mixture Models (BDMM), Dynamic Model Averaging (DMA),
Time-Varying Parameters (TVP), as well as Bayesian LASSO, Bayesian RIDGE,
Autoregressive (AR1), and Linear Regression).
metals[,15]: HA - Historical Average (Benchmark model).
Monthly World Bank base metals price index (2010 = 100, USD), including aluminium, copper, lead, nickel, tin and zinc, forecasts for the period from 03/2011 to 11/2024.
Drachal, K. (2025). Base metals price index forecasts. figshare. doi:10.6084/m9.figshare.28382480.v1
Drachal, K., & Jedrzejewska, J. (2025). Forecasting Base Metals Prices: A Comparison of Various Bayesian-Based Methods. Sinteza 2025, 175–183. doi:10.15308/Sinteza-2025-175-183
World Bank. (2025). Commodity markets. https://www.worldbank.org/en/research/commodity-markets
data(metals) head(metals) plot(metals[, 1], type = "l", main = "Competing Model vs Benchmark") lines(metals[, ncol(metals)], col = "red", lty = 2)data(metals) head(metals) plot(metals[, 1], type = "l", main = "Competing Model vs Benchmark") lines(metals[, ncol(metals)], col = "red", lty = 2)
Generates a time-series plot of cumulative squared error (MSE)
loss differences between each competing forecast and the benchmark. A positive
value at time means the competing forecast has accumulated lower squared
errors than the benchmark up to that point (i.e., the forecast is outperforming
the benchmark cumulatively).
plot_cumulative_loss(data_matrix, benchmark_col = ncol(data_matrix))plot_cumulative_loss(data_matrix, benchmark_col = ncol(data_matrix))
data_matrix |
Matrix or data frame of dimension |
benchmark_col |
Index or name of the benchmark column. Defaults to the last column. |
The cumulative loss difference for forecast at time is:
where is the forecast error of forecast
at period . The function squares the values in
data_matrix directly, so pre-computed forecast errors
(not raw forecasts) must be passed for the result to represent cumulative
MSE differences.
A positive means forecast has accumulated lower
squared errors than the benchmark up to period . A negative value
means the benchmark has been more accurate up to that point.
A ggplot object. The y-axis shows the cumulative MSE
difference; forecasts above zero at the right edge have outperformed the
benchmark over the full evaluation window. The top 2 and bottom 2 forecasts
(by final cumulative loss difference) are labelled directly on the plot.
white_reality_check,
plot_performance_metrics
data(metals) P <- nrow(metals) K_total <- ncol(metals) realized <- c(metals[-1, K_total], metals[P, K_total]) errors <- sweep(as.matrix(metals), 1, realized, "-") p <- plot_cumulative_loss(errors, benchmark_col = K_total) print(p)data(metals) P <- nrow(metals) K_total <- ncol(metals) realized <- c(metals[-1, K_total], metals[P, K_total]) errors <- sweep(as.matrix(metals), 1, realized, "-") p <- plot_cumulative_loss(errors, benchmark_col = K_total) print(p)
Generates a kernel density plot of a predictive distribution,
highlighting the point forecast and a symmetric credible interval at a
specified level. The predictive distribution can be constructed by treating
the cross-sectional spread of model forecasts at a given period as a proxy
for forecast uncertainty (see compute_crps for the centring
convention).
plot_density_forecast( full_distribution, point_forecast, title = "Density Forecast", ci_level = 0.9 )plot_density_forecast( full_distribution, point_forecast, title = "Density Forecast", ci_level = 0.9 )
full_distribution |
Numeric vector of forecast samples drawn from the
predictive distribution (e.g., forecasts from all models at one time
period, optionally centred and shifted as in |
point_forecast |
Single numeric value: the point forecast to highlight
(e.g., the mean or median of |
title |
Character string for the plot title. Default is
|
ci_level |
Numeric confidence interval level strictly between 0 and 1.
Default is 0.90, producing lower and upper quantiles at
|
To build a pseudo-predictive distribution from the metals dataset,
centre the cross-sectional model forecasts at period t around their
mean, following the same convention used in compute_crps:
dist_t <- forecasts[t, ] - forecasts[t, k] + mean(forecasts[t, ]).
This preserves cross-sectional spread while recentring on the cross-sectional
mean rather than on model k's own point forecast.
A ggplot object. The plot shows a kernel density curve
(blue fill), a red dashed vertical line at point_forecast, and
orange dotted vertical lines at the lower and upper quantiles of the
credible interval. The subtitle reports the numeric bounds of the interval.
compute_crps, run_comprehensive_erc_analysis'
data(metals) t <- 100 K <- ncol(metals) - 1 dist_t <- as.numeric(metals[t, 1:K]) - metals[t, 7] + mean(as.numeric(metals[t, 1:K])) pt_fcst <- mean(as.numeric(metals[t, 1:K])) plot_density_forecast(dist_t, pt_fcst, title = "Predictive Density at t = 100", ci_level = 0.90)data(metals) t <- 100 K <- ncol(metals) - 1 dist_t <- as.numeric(metals[t, 1:K]) - metals[t, 7] + mean(as.numeric(metals[t, 1:K])) pt_fcst <- mean(as.numeric(metals[t, 1:K])) plot_density_forecast(dist_t, pt_fcst, title = "Predictive Density at t = 100", ci_level = 0.90)
Generates a grid of scatter plots for Root Mean Squared Error
(RMSE), Normalized Root Mean Squared Error (N-RMSE), Mean Absolute Error
(MAE), and Mean Absolute Scaled Error (MASE) plotted against the Equally
Weighted Risk Contribution (ERC) Weight of each competing forecast. These
metrics are used for visualization only and are computed directly from
forecast errors; they are distinct from the loss functions used in the
WRC/SPA/CPA hypothesis tests (see run_comprehensive_erc_analysis).
Each panel shows:
Points: one per model. The radius (size) of each circle is
proportional to the model's Risk Contribution, defined as
. Larger circles indicate
forecasts that contribute more to the portfolio-level risk.
Dashed line: an OLS regression line of the error metric (x-axis) on the ERC Weight (y-axis), showing whether higher-weighted forecasts tend to have systematically lower or higher error.
Note on MASE scaling: The MASE denominator used here is
mean(abs(diff(benchmark))), where benchmark is the
benchmark forecast column extracted from forecast_matrix.
This differs from the scaling used in run_comprehensive_erc_analysis,
where the denominator is mean(abs(diff(realizations_raw))) computed
from the actual realised values. The two denominators coincide when the
benchmark is a random-walk or historical-average forecast whose one-step-ahead
forecasts equal the lagged realised value, but will diverge otherwise.
The plot_performance_metrics function does not accept a separate
realisation vector, so the benchmark forecast series is used as a
proxy for the naive forecast scale. MASE values produced by this function
are therefore intended for visual comparison across forecasts only
and should not be directly compared to MASE values from the hypothesis
tests in run_comprehensive_erc_analysis.
plot_performance_metrics( forecast_matrix, weights = NULL, benchmark_col = ncol(forecast_matrix) )plot_performance_metrics( forecast_matrix, weights = NULL, benchmark_col = ncol(forecast_matrix) )
forecast_matrix |
Matrix or data frame of dimension |
weights |
Optional numeric vector of length |
benchmark_col |
Index or name of the benchmark column. Defaults to the last column. |
A gtable object produced by
grid.arrange containing four panels arranged
in a 2x2 grid: RMSE (top-left), N-RMSE (top-right), MAE (bottom-left),
MASE (bottom-right). Each panel is a
ggplot object and can be extracted individually if needed.
ERC weights are portfolio weights assigned so that every competing forecast
contributes an equal share to the total portfolio risk (measured here by
forecast error dispersion). Formally, weights are chosen so that
for all ,
where is a measure of forecast 's risk and
is the common
per-forecast risk budget determined endogenously by the equal-contribution
constraint (Maillard et al., 2010).
When weights are supplied by the user, they are treated as
pre-computed ERC weights and normalised to sum to one. When
weights = NULL, equal weights are used as a baseline.
Maillard, S., Roncalli, T., & Teïletche, J. (2010). The Properties of Equally Weighted Risk Contributions Portfolios. The Journal of Portfolio Management, 36(4), 60–70. doi:10.3905/jpm.2010.36.4.060
run_comprehensive_erc_analysis,
plot_cumulative_loss
data(metals) K_models <- ncol(metals) - 1 custom_weights <- (K_models:1) / sum(K_models:1) plot_performance_metrics(metals, weights = custom_weights, benchmark_col = 15)data(metals) K_models <- ncol(metals) - 1 custom_weights <- (K_models:1) / sum(K_models:1) plot_performance_metrics(metals, weights = custom_weights, benchmark_col = 15)
Implements the studentized Reality Check test for comparing predictive densities based on the ZP quantile loss function of Corradi & Swanson (2006). The test evaluates whether any competing forecast more accurately predicts the probability of the outcome falling below a specified threshold than the benchmark forecast.
H0: – no competing
forecast has a lower expected squared probability forecast error at threshold
than the benchmark (Corradi & Swanson, 2006, eq. 7).
H1: At least one competing forecast has lower expected ZP-loss than the benchmark.
reality_check_zp_test( zp_loss_differences, block_length, num_bootstrap_replications, alpha = 0.05 )reality_check_zp_test( zp_loss_differences, block_length, num_bootstrap_replications, alpha = 0.05 )
zp_loss_differences |
A |
block_length |
|
num_bootstrap_replications |
|
alpha |
|
The ZP loss for forecast at period is
where is a threshold, is the forecast's predictive CDF at
period , and is the indicator for a tail event.
Interpretively, the threshold defines a tail event of interest (e.g., the
5th percentile of realizations). The ZP loss penalises the squared difference between
the predicted probability of this event and whether it actually occurred. A forecast with
lower ZP loss more accurately calibrates the left-tail probability. The
threshold is typically set to a quantile of the realized series (e.g.,
quantile(realized, 0.05)); a lower threshold focuses the test more sharply on
extreme left-tail events.
The benchmark is treated as a degenerate (point-mass) predictive distribution
with , which is a conservative choice ensuring the
benchmark's ZP loss approximates the Brier score for the tail indicator.
Two p-values are returned: p_consistent and p_conservative, analogous
to those in superior_predictive_ability_test.
An object of class "htest". Lower p-values indicate that
at least one competing forecast is significantly better calibrated in the
left-tail than the benchmark forecast.
Also contains p_consistent and p_conservative.
Failure to reject H0 suggests no forecast provides significantly better density
fit than the benchmark forecast.
Corradi, V., & Swanson, N. R. (2006). Predictive density and conditional confidence interval accuracy tests. Journal of Econometrics, 135(1–2), 187–228. doi:10.1016/j.jeconom.2005.07.026
Corradi, V., & Swanson, N. R. (2011). The White Reality Check and some of its recent extensions. In Festschrift in honor of Halbert L. White.
Davidson, R., & MacKinnon, J. G. (2000). Bootstrap tests: How many bootstraps? Econometric Reviews, 19(1), 55–68. doi:10.1080/07474930008800459
Hansen, P. R. (2005). A Test for Superior Predictive Ability. Journal of Business & Economic Statistics, 23(4), 365–380. doi:10.1198/073500105000000063
Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), 1303–1313. doi:10.1080/01621459.1994.10476870
data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark benchmark_col <- 15 P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1 # 14 competing forecasts forecast_variance <- estimate_forecast_variance(metals, benchmark_col = K_total, window_size = 20) comp_cols <- setdiff(seq_len(K_total), benchmark_col) forecast_sd_models <- sqrt(forecast_variance[, comp_cols]) threshold_val <- quantile(metals[, K_total], 0.05) zp_loss <- compute_zp(metals, forecast_sd_models, threshold = threshold_val, benchmark_col = K_total) zp_diff <- zp_loss[, K_total] - zp_loss[, comp_cols] reality_check_zp_test(zp_diff, block_length = 5, num_bootstrap_replications = 50)data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark benchmark_col <- 15 P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1 # 14 competing forecasts forecast_variance <- estimate_forecast_variance(metals, benchmark_col = K_total, window_size = 20) comp_cols <- setdiff(seq_len(K_total), benchmark_col) forecast_sd_models <- sqrt(forecast_variance[, comp_cols]) threshold_val <- quantile(metals[, K_total], 0.05) zp_loss <- compute_zp(metals, forecast_sd_models, threshold = threshold_val, benchmark_col = K_total) zp_diff <- zp_loss[, K_total] - zp_loss[, comp_cols] reality_check_zp_test(zp_diff, block_length = 5, num_bootstrap_replications = 50)
Runs a full suite of reality check and density forecast evaluation tests on one or more datasets. Tests included: White's Reality Check (WRC), Superior Predictive Ability (SPA), Conditional Predictive Ability (CPA), ZP Quantile Loss test, Kullback-Leibler Information Criterion (KLIC) test, Kupiec Unconditional Coverage (UC) test, CRPS-based CDF comparison, and per-model Diebold-Mariano statistics across four error metrics (MSE, MAE, MASE).
Technical Abbreviations:
WRC: White's Reality Check (White, 2000). Tests whether any competing forecast has lower expected loss than the benchmark; controls family-wise error rate.
SPA: Superior Predictive Ability test (Hansen, 2005). A studentized extension of WRC with improved power that corrects for irrelevant forecasts.
CPA: Conditional Predictive Ability test (Giacomini & White, 2006). Tests whether loss differentials are predictable by a conditioning variable.
ZP: Quantile Loss test (Corradi & Swanson, 2006). Evaluates whether
any competing forecast better calibrates the probability of a left-tail event
defined by the zp_quantile threshold.
KLIC: Kullback-Leibler Information Criterion based density test (Corradi & Swanson, 2006). Selects the forecast whose predictive density is closest to the true density in terms of KLIC distance, evaluated via Negative Log-Likelihood Scores (NLS) under a Gaussian predictive density assumption.
CRPS: Continuous Ranked Probability Score (Gneiting & Raftery, 2007). Jointly rewards calibration and sharpness of the predictive distribution.
UC: Kupiec Unconditional Coverage test (Kupiec, 1995).
MSE: Mean Squared Error.
MAE: Mean Absolute Error.
MASE: Mean Absolute Scaled Error.
run_comprehensive_erc_analysis( data_list_prepared, mods_matrix, alpha_grid, window_size, y_hat_all, y_raw, block_length = 5, n_boot = 999, zp_quantile = 0.05, n_crps_samples = 10, benchmark_col = NULL )run_comprehensive_erc_analysis( data_list_prepared, mods_matrix, alpha_grid, window_size, y_hat_all, y_raw, block_length = 5, n_boot = 999, zp_quantile = 0.05, n_crps_samples = 10, benchmark_col = NULL )
data_list_prepared |
Named list of prepared data frames, one per dataset. Each
element must be a list containing at least a field |
mods_matrix |
A legacy placeholder matrix retained for interface compatibility
with external pipelines in which forecasts were previously defined as a parameter
matrix. Its contents are not read or used anywhere in this function — the actual
forecast structure is derived entirely from the forecast matrices supplied in
|
alpha_grid |
Numeric scalar or vector of significance levels. Only the first
element ( |
window_size |
Integer window size for rolling variance estimation passed to
|
y_hat_all |
Named list of forecast results, one per dataset. Each element must
be a list of length at least 3, where the third element ( |
y_raw |
Named list of raw realized value vectors, one per dataset. Each element
is a numeric vector whose length must be at least |
block_length |
Integer block length for Moving Block Bootstrap (MBB) used in
WRC, SPA, CPA, ZP, and KLIC tests. Default is 5. A commonly used rule of thumb is
|
n_boot |
|
zp_quantile |
Numeric quantile level used to define the left-tail threshold
|
n_crps_samples |
Integer number of Monte Carlo samples drawn from the
Gaussian predictive distribution |
benchmark_col |
Index or name of the benchmark column in the forecast matrix.
Defaults to the last column ( |
list containing:
aggregate_results: Named list of test results per dataset. Each
dataset element contains named htest objects for each test and metric
combination, plus VaR_Backtests (a list of per-model Kupiec htest
objects).
per_model_results: Named list of per-model Diebold-Mariano statistics
per dataset, returned as data.frame objects from
compute_per_model_statistics.
Davidson, R., & MacKinnon, J. G. (2000). Bootstrap tests: How many bootstraps? Econometric Reviews, 19(1), 55–68. doi:10.1080/07474930008800459
White, H. (2000). A reality check for data snooping. Econometrica, 68(5), 1097–1126. doi:10.1111/1468-0262.00152
Hansen, P. R. (2005). A Test for Superior Predictive Ability. Journal of Business & Economic Statistics, 23(4), 365–380. doi:10.1198/073500105000000063
Giacomini, R., & White, H. (2006). Tests of Conditional Predictive Ability. Econometrica, 74(6), 1545–1578. doi:10.1111/j.1468-0262.2006.00718.x
Corradi, V., & Swanson, N. R. (2006). Predictive density and conditional confidence interval accuracy tests. Journal of Econometrics, 135(1–2), 187–228. doi:10.1016/j.jeconom.2005.07.026
Corradi, V., & Swanson, N. R. (2011). The White Reality Check and some of its recent extensions. In Festschrift in honor of Halbert L. White.
Gneiting, T., & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477), 359–378. doi:10.1198/016214506000001437
Kupiec, P. H. (1995). Techniques for Verifying the Accuracy of Risk Measurement Models. The Journal of Derivatives, 3(2), 173–184. doi:10.3905/jod.1995.407942
Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), 1303–1313. doi:10.1080/01621459.1994.10476870
Diebold, F. X., & Mariano, R. S. (1995). Comparing Predictive Accuracy. Journal of Business & Economic Statistics, 13(3), 253–263. doi:10.1080/07350015.1995.10524599
create_unified_summary,
generate_comprehensive_report,
white_reality_check,
superior_predictive_ability_test,
white_reality_check_conditional,
reality_check_zp_test,
kullback_leibler_test,
compute_kupiec
Implements the Hansen (2005) Superior Predictive Ability (SPA) test, a studentized extension of White's (2000) Reality Check that corrects for the inclusion of irrelevant (poor) forecasts to reduce conservatism.
Hypotheses:
H0: – no competing
forecast produces strictly lower expected loss than the benchmark forecast.
H1: At least one competing forecast has strictly lower expected loss than the benchmark forecast.
superior_predictive_ability_test( loss_differences, block_length, num_bootstrap_replications, alpha )superior_predictive_ability_test( loss_differences, block_length, num_bootstrap_replications, alpha )
loss_differences |
A |
block_length |
|
num_bootstrap_replications |
|
alpha |
|
The SPA statistic studentizes each mean loss differential by its HAC standard
deviation (estimated via estimate_long_run_covariance), then takes
the maximum across forecasts. Two p-values are returned, corresponding to two choices
of the null distribution (Hansen, 2005, Section 3):
p_consistent: uses the sample-dependent null estimator
, which recentres the bootstrap statistic at the sample mean
for each forecast. This is the recommended p-value.
p_conservative: uses the Least Favourable Configuration (LFC)
for all forecasts – equivalent to White's (2000) Reality
Check bootstrap, where no recentring is applied. This provides an upper bound
on the true p-value and is always p_consistent.
An object of class "htest". Additionally contains
p_consistent and p_conservative for the two SPA bootstrap variants.
Hansen, P. R. (2005). A Test for Superior Predictive Ability. Journal of Business & Economic Statistics, 23(4), 365–380. doi:10.1198/073500105000000063
Corradi, V., & Swanson, N. R. (2011). The White Reality Check and some of its recent extensions. In Festschrift in honor of Halbert L. White.
Davidson, R., & MacKinnon, J. G. (2000). Bootstrap tests: How many bootstraps? Econometric Reviews, 19(1), 55–68. doi:10.1080/07474930008800459
Kunsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. The Annals of Statistics, 17(3), 1217–1241. doi:10.1214/aos/1176347265
data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark # A small offset (+0.5) is added to the lagged benchmark to avoid degenerate zero # loss differences when forecasts equal the realized value exactly (illustration only). P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1 # 14 competing forecasts realized <- c(metals[-1, K_total], metals[P, K_total]) + 0.5 benchmark_loss <- (metals[, K_total] - realized)^2 model_loss <- (metals[, 1:K] - realized)^2 loss_diff <- benchmark_loss - model_loss res <- superior_predictive_ability_test(loss_diff, block_length = 5, num_bootstrap_replications = 50, alpha = 0.05) print(res)data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark # A small offset (+0.5) is added to the lagged benchmark to avoid degenerate zero # loss differences when forecasts equal the realized value exactly (illustration only). P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1 # 14 competing forecasts realized <- c(metals[-1, K_total], metals[P, K_total]) + 0.5 benchmark_loss <- (metals[, K_total] - realized)^2 model_loss <- (metals[, 1:K] - realized)^2 loss_diff <- benchmark_loss - model_loss res <- superior_predictive_ability_test(loss_diff, block_length = 5, num_bootstrap_replications = 50, alpha = 0.05) print(res)
Implements White's (2000) Reality Check (WRC) for comparing forecast accuracy of multiple competing forecasts against a benchmark forecast based on mean loss differences. The test controls the family-wise error rate across all forecast comparisons simultaneously, avoiding data-snooping bias.
Hypotheses:
H0: – no competing
forecast produces a strictly lower expected loss than the benchmark forecast.
H1: At least one competing forecast has strictly lower expected loss than the benchmark forecast.
white_reality_check( loss_differences, n_simulations = 999, block_length = 5, alpha = 0.05 )white_reality_check( loss_differences, n_simulations = 999, block_length = 5, alpha = 0.05 )
loss_differences |
A |
n_simulations |
|
block_length |
|
alpha |
|
The test statistic is , where
is the sample mean of the loss differential series for forecast
(White, 2000, eq. 2). Bootstrap p-values are obtained via the MBB of
Kunsch (1989) by recentring each bootstrap statistic at the sample mean, following
the procedure in Corradi & Swanson (2011). This is an unstudentized test; for
a studentized version with improved power against irrelevant forecasts, see
superior_predictive_ability_test.
An object of class "htest". The printed output shows the
test statistic (maximum mean loss differential), the bootstrap p-value, and the
test name. A small p-value (below alpha) leads to rejection of H0,
indicating that at least one competing forecast is significantly more accurate
than the benchmark forecast.
White, H. (2000). A reality check for data snooping. Econometrica, 68(5), 1097–1126. doi:10.1111/1468-0262.00152
Kunsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. The Annals of Statistics, 17(3), 1217–1241. doi:10.1214/aos/1176347265
Davidson, R., & MacKinnon, J. G. (2000). Bootstrap tests: How many bootstraps? Econometric Reviews, 19(1), 55–68. doi:10.1080/07474930008800459
Corradi, V., & Swanson, N. R. (2011). The White Reality Check and some of its recent extensions. In Festschrift in honor of Halbert L. White.
data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark # A small offset (+0.5) is added to the lagged benchmark to avoid degenerate zero # loss differences when forecasts equal the realized value exactly (illustration only). P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1 # 14 competing forecasts realized <- c(metals[-1, K_total], metals[P, K_total]) + 0.5 benchmark_loss <- (metals[, K_total] - realized)^2 model_loss <- (metals[, 1:K] - realized)^2 loss_diff <- benchmark_loss - model_loss res <- white_reality_check(loss_diff, block_length = 5, n_simulations = 50) print(res)data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark # A small offset (+0.5) is added to the lagged benchmark to avoid degenerate zero # loss differences when forecasts equal the realized value exactly (illustration only). P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1 # 14 competing forecasts realized <- c(metals[-1, K_total], metals[P, K_total]) + 0.5 benchmark_loss <- (metals[, K_total] - realized)^2 model_loss <- (metals[, 1:K] - realized)^2 loss_diff <- benchmark_loss - model_loss res <- white_reality_check(loss_diff, block_length = 5, n_simulations = 50) print(res)
Implements a studentized Reality Check test that compares competing
forecasts against a benchmark across the entire distribution of loss
differences, not only the mean. For each forecast and each quantile
threshold (derived from the pooled empirical distribution of loss
differences), the test evaluates
whether the empirical CDF of forecast 's loss differences lies
uniformly above that of the benchmark, indicating stochastic dominance of the
benchmark's loss distribution over the forecast's loss distribution.
Hypotheses:
H0:
for all and all quantile thresholds
– no competing forecast has a uniformly higher
empirical CDF of loss differences than the benchmark at any evaluation point.
H1: At least one competing forecast has a significantly higher
empirical CDF of loss differences than the benchmark at some quantile threshold
, i.e., the benchmark is stochastically dominated in terms of
loss differences.
white_reality_check_cdf_approx( loss_differences, block_length, num_bootstrap_replications, alpha = 0.05 )white_reality_check_cdf_approx( loss_differences, block_length, num_bootstrap_replications, alpha = 0.05 )
loss_differences |
A |
block_length |
|
num_bootstrap_replications |
|
alpha |
|
The test proceeds in three steps.
Step 1 — Quantile grid. A grid of evaluation points
is constructed as the
quantiles of the pooled
empirical distribution of all loss differences (across all forecasts and all periods).
Using quantiles of the data rather than a fixed grid ensures that the evaluation
points are always in the support of the observed loss differences.
Step 2 — Indicator matrix. For each forecast and each threshold
, the binary indicator
is formed, where is the loss difference for forecast at
period . This yields a matrix Gdata
with columns (for the metals dataset).
The column mean estimates the
empirical CDF of forecast 's loss differences evaluated at .
A higher CDF value means a larger fraction of the forecast's loss differences fall
below , i.e., the competing forecast more frequently outperforms the
benchmark forecast (since a positive loss difference means the competing forecast is more accurate).
Step 3 — Studentized test statistic. Each column mean is studentized by
its HAC standard deviation (from estimate_long_run_covariance),
and the test statistic is the maximum studentized CDF indicator mean across all
columns:
Bootstrap p-values are obtained via the MBB of Kunsch (1989) with recentring, following the WRC procedure of White (2000) and Corradi & Swanson (2011).
Relationship to Corradi & Swanson (2006). This test is a loss-difference
analogue of the predictive CDF comparison in Corradi & Swanson (2006, Section 4).
Rather than comparing forecast CDFs against the true conditional distribution
(as in the ZP test), it compares empirical CDFs of loss differences against
zero, assessing stochastic dominance of the benchmark over each competing forecast
in terms of loss. It complements white_reality_check (which tests
only the mean) by detecting cases where one forecast is better in the tails but
not on average.
Lower p-values are more informative: rejection of H0 indicates that at
least one forecast stochastically dominates the benchmark at some point of the loss
distribution. Failure to reject does not preclude dominance at specific quantiles
– it means no single combination is significant after controlling
for multiple comparisons.
An object of class "htest" with the following components:
statistic |
Maximum studentized CDF indicator mean across all
forecast-quantile combinations,
labelled "KS-type". |
p.value |
Bootstrap p-value from the MBB procedure. |
method |
"Expected Loss CDF Comparison Test". |
null.value |
Named scalar
"max studentized CDF indicator mean
(benchmark minus forecast)" = 0. |
alternative |
Description of the alternative hypothesis. |
A small p-value (below alpha) leads to rejection of H0, indicating that
at least one competing forecast has a significantly higher empirical CDF of loss
differences than the benchmark at some quantile threshold – i.e., the competing forecast
more frequently produces smaller losses than the benchmark forecast in some region of the
loss distribution.
Corradi, V., & Swanson, N. R. (2006). Predictive density and conditional confidence interval accuracy tests. Journal of Econometrics, 135(1–2), 187–228. doi:10.1016/j.jeconom.2005.07.026
Davidson, R., & MacKinnon, J. G. (2000). Bootstrap tests: How many bootstraps? Econometric Reviews, 19(1), 55–68. doi:10.1080/07474930008800459
White, H. (2000). A reality check for data snooping. Econometrica, 68(5), 1097–1126. doi:10.1111/1468-0262.00152
Corradi, V., & Swanson, N. R. (2011). The White Reality Check and some of its recent extensions. In Festschrift in honor of Halbert L. White.
Kunsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. The Annals of Statistics, 17(3), 1217–1241. doi:10.1214/aos/1176347265
Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), 1303–1313. doi:10.1080/01621459.1994.10476870
white_reality_check for the mean-based WRC test;
reality_check_zp_test for a distributional test based on the
true conditional CDF; superior_predictive_ability_test for the
studentized mean-based SPA test.
data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark # A small offset (+0.5) is added to the lagged benchmark to avoid degenerate zero # loss differences when forecasts equal the realized value exactly (illustration only). P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1L # 14 competing forecasts realized <- c(metals[-1, K_total], metals[P, K_total]) + 0.5 benchmark_loss <- (metals[, K_total] - realized)^2 model_loss <- (metals[, 1:K] - realized)^2 loss_diff <- benchmark_loss - model_loss res <- white_reality_check_cdf_approx(loss_diff, block_length = 5, num_bootstrap_replications = 50) print(res)data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark # A small offset (+0.5) is added to the lagged benchmark to avoid degenerate zero # loss differences when forecasts equal the realized value exactly (illustration only). P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1L # 14 competing forecasts realized <- c(metals[-1, K_total], metals[P, K_total]) + 0.5 benchmark_loss <- (metals[, K_total] - realized)^2 model_loss <- (metals[, 1:K] - realized)^2 loss_diff <- benchmark_loss - model_loss res <- white_reality_check_cdf_approx(loss_diff, block_length = 5, num_bootstrap_replications = 50) print(res)
Implements the Conditional Predictive Ability (CPA) test of Giacomini &
White (2006), extended to a multiple-forecast setting via a studentized Reality Check
statistic. Tests whether any competing forecast's predictive advantage over the benchmark
is state-dependent, i.e., predictable from a conditioning variable known
at the time the forecast is made.
Hypotheses:
H0: for all
– no competing forecast's loss differential with the benchmark
is predictable using the conditioning information .
H1: At least one forecast's loss differential
is predictable by , i.e.,
for some .
white_reality_check_conditional( loss_differences, weighting_vector, block_length, num_bootstrap_replications, alpha )white_reality_check_conditional( loss_differences, weighting_vector, block_length, num_bootstrap_replications, alpha )
loss_differences |
A |
weighting_vector |
|
block_length |
|
num_bootstrap_replications |
|
alpha |
|
The test multiplies each column of loss_differences element-wise by
weighting_vector to form the weighted loss differential series
. The unconditional mean of this product,
, equals zero under H0 by the law of iterated
expectations when is a valid instrument. The test statistic is the
maximum studentized mean across all forecasts:
where is the HAC standard deviation of
estimated via estimate_long_run_covariance. Bootstrap p-values are
obtained via the MBB of Kunsch (1989) with recentring, following the SPA-type
procedure of Hansen (2005) applied to the weighted series.
weighting_vector)A significant result means that knowing allows one to predict which
forecast will perform better in period – the benchmark's advantage (or
disadvantage) is state-dependent and potentially exploitable. This is a strictly
stronger statement than the unconditional WRC: a forecast can fail the WRC (no
unconditional improvement) yet pass the CPA test (conditional improvement in
specific states).
must be measurable with respect to the information set available at
time (Giacomini & White, 2006, Assumption 1) – it must not use
information from period or later. The scale of does not
affect the test result because the statistic is studentized by its own HAC
standard deviation.
Recommended choices:
abs(realized) – absolute realised valuesTests whether forecast performance depends on outcome magnitude – a natural
proxy for market volatility or economic uncertainty. Default in
run_comprehensive_erc_analysis.
c(realized[1], realized[-length(realized)]) – lagged realised valuesTests whether the previous period's outcome predicts which forecast wins next period. Relevant when forecast errors are autocorrelated.
rep(1, P) – constant vectorThe product reduces to , making the
CPA test equivalent to the unconditional WRC. Use as a sanity check: results
should be consistent with white_reality_check.
E.g., a recession dummy, VIX level, lagged interest rate spread, or monetary
policy stance dummy. Tests whether one forecast systematically outperforms
during specific regimes. Must be lagged one period to ensure is
in the information set at the time of the forecast.
The scale of weighting_vector has no effect on inference because both the
test statistic and its bootstrap distribution are studentized by the same
.
An object of class "htest" with the following components:
statistic |
Maximum studentized weighted mean loss differential
across all forecasts, labelled "T-CPA". |
p.value |
Bootstrap p-value from the MBB procedure. |
method |
"Conditional Predictive Ability (CPA) Test". |
null.value |
Named scalar "max studentized weighted mean
loss differential" = 0. |
alternative |
Direction of the alternative hypothesis. |
reject_null |
Logical: TRUE if p.value <= alpha.
|
A small p-value indicates that at least one forecast's loss differential is
predictable from the conditioning variable . Failure to reject H0
means no evidence of state-dependent predictive ability for the chosen instrument.
Giacomini, R., & White, H. (2006). Tests of Conditional Predictive Ability. Econometrica, 74(6), 1545–1578. doi:10.1111/j.1468-0262.2006.00718.x
Hansen, P. R. (2005). A Test for Superior Predictive Ability. Journal of Business & Economic Statistics, 23(4), 365–380. doi:10.1198/073500105000000063
Corradi, V., & Swanson, N. R. (2011). The White Reality Check and some of its recent extensions. In Festschrift in honor of Halbert L. White.
Davidson, R., & MacKinnon, J. G. (2000). Bootstrap tests: How many bootstraps? Econometric Reviews, 19(1), 55–68. doi:10.1080/07474930008800459
Kunsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. The Annals of Statistics, 17(3), 1217–1241. doi:10.1214/aos/1176347265
Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), 1303–1313. doi:10.1080/01621459.1994.10476870
white_reality_check for the unconditional WRC test (equivalent to
CPA with a constant weighting_vector);
superior_predictive_ability_test for the studentized unconditional test.
data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark # A small offset (+0.5) is added to the lagged benchmark to avoid degenerate zero # loss differences when forecasts equal the realized value exactly (illustration only). P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1L # 14 competing forecasts realized <- c(metals[-1, K_total], metals[P, K_total]) + 0.5 benchmark_loss <- (metals[, K_total] - realized)^2 model_loss <- (metals[, 1:K] - realized)^2 loss_diff <- benchmark_loss - model_loss # Example 1: absolute realised values as conditioning variable (volatility proxy) res1 <- white_reality_check_conditional( loss_differences = loss_diff, weighting_vector = abs(realized), block_length = 5, num_bootstrap_replications = 50, alpha = 0.05 ) print(res1) # Example 2: constant vector - should give results consistent with white_reality_check() res2 <- white_reality_check_conditional( loss_differences = loss_diff, weighting_vector = rep(1, P), block_length = 5, num_bootstrap_replications = 50, alpha = 0.05 ) print(res2)data(metals) # metals: 165 x 15; columns 1-14 are competing forecasts, column 15 is the benchmark # A small offset (+0.5) is added to the lagged benchmark to avoid degenerate zero # loss differences when forecasts equal the realized value exactly (illustration only). P <- nrow(metals) K_total <- ncol(metals) K <- K_total - 1L # 14 competing forecasts realized <- c(metals[-1, K_total], metals[P, K_total]) + 0.5 benchmark_loss <- (metals[, K_total] - realized)^2 model_loss <- (metals[, 1:K] - realized)^2 loss_diff <- benchmark_loss - model_loss # Example 1: absolute realised values as conditioning variable (volatility proxy) res1 <- white_reality_check_conditional( loss_differences = loss_diff, weighting_vector = abs(realized), block_length = 5, num_bootstrap_replications = 50, alpha = 0.05 ) print(res1) # Example 2: constant vector - should give results consistent with white_reality_check() res2 <- white_reality_check_conditional( loss_differences = loss_diff, weighting_vector = rep(1, P), block_length = 5, num_bootstrap_replications = 50, alpha = 0.05 ) print(res2)