Distribution-Based Model Evaluation and Diagnostics: Elicitability, Propriety, and Scoring Rules for Hydrograph Functionals

被引:6
作者
Vrugt, Jasper A. [1 ]
机构
[1] Univ Calif Irvine, Dept Civil & Environm Engn, Irvine, CA 92697 USA
关键词
ensemble prediction; distribution forecast; elicitability; scoring rules; divergence score; propriety; sharpness; reliability; uncertainty; entropy; integral transform; logarithmic score; continuous ranked probability score; recession analysis; flow duration curve; signatures; watershed models; RANKED PROBABILITY SCORE; KULLBACK-LEIBLER DIVERGENCE; RAINFALL-RUNOFF MODELS; PARAMETER-ESTIMATION; ENSEMBLE PREDICTION; HYDROLOGIC-MODELS; UNCERTAINTY ESTIMATION; IMPROVED CALIBRATION; DENSITY FORECASTS; CATCHMENT MODELS;
D O I
10.1029/2023WR036710
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Distribution forecasts P over future quantities or events are routinely made in hydrology but usually traded for a (likelihood-weighted) mean or median prediction to accommodate error measures or scoring functions such as the mean absolute error or mean squared error. Case in point is the so-called KG efficiency (KGE) of Gupta et al. (2009, ) and improvements thereof (Lamontagne et al., 2020, ), which have rapidly gained popularity among hydrologists as alternative scoring functions to the commonly used Nash and Sutcliffe (1970, ) efficiency, but are equally exclusive in how they quantify model performance using only single-valued output of the quantities of interest. This point-valued mapping necessarily implies a loss of information about model performance. This paper advocates the use of probabilistic watershed model training, evaluation and diagnostics. Distribution evaluation opens a mature literature on scoring rules whose strong statistical underpinning provides, as we will demonstrate, the theory, context and guidelines necessary for the development of robust information-theoretically principled metrics for watershed signatures. These so-called hydrograph functionals are scalar-valued mappings of major behavioral watershed functions embodied in a strictly proper scoring rule. We discuss past developments that led to the current state-of-the-art of distribution evaluation in hydrology and review scoring rules for dichotomous and categorical events, quantiles (intervals) and density forecasts. We are particularly concerned with elicitable functionals and scoring rule propriety, discuss the decomposition of scoring rules into a sharpness, reliability and entropy term and present diagnostically appealing strictly proper divergence scores of hydrograph functionals for flood frequency analysis, flow duration and recession curves. The usefulness and power of distribution-based model evaluation and diagnostics by means of scoring rules is demonstrated on simple illustrative problems and discharge distributions simulated with watershed models using random sampling and Bayesian model averaging. The presented theory (a) enables a more complete evaluation of distribution forecasts, (b) offers a statistically principled means for watershed model training, evaluation, diagnostics and selection using hydrograph functionals and/or extreme events and (c) provides a universal framework for metric development of watershed signatures, promoting metric standardization and reproducibility. The past decades have witnessed an unbridled growth in goodness-of-fit metrics of hydrologic models. These metrics may satisfy the needs of hydrologists but lack conforming theory and principles. This state of affairs (a) elicits improper model training and evaluation, (b) provokes and supports misguided inferences, (c) impedes statistically-principled uncertainty quantification, metric standardization and development of universal model benchmarks and (d) obfuscates determination of whether the model has finished learning. What is more, most hydrologic model evaluation metrics in use today are rather exclusive in how they quantify model performance using only single-valued simulated output of the quantities of interest. Predictive distributions derived from (quasi)-Bayesian methods or ensembles are usually traded for a (likelihood-weighted) mean or median prediction to accommodate error measures (scoring functions) such as the mean absolute error. This implies a large loss of information. This paper develops a distribution-based approach to hydrologic model evaluation and diagnostics. Distribution evaluation opens the necessary theory and guidelines for development of robust information-theoretically principled metrics of watershed signatures. These so-called hydrograph functionals are scalar-valued mappings of major behavioral watershed functions embodied in a strictly proper scoring rule. The hydrograph functionals offer a statistically principled means for hydrologic model evaluation, diagnostics and selection. Scoring rules of hydrograph functionals provide an information-theoretically principled means for watershed model training, evaluation, and diagnostics We present strictly proper (divergence) scores for flood frequency analysis, flow duration, and recession curves Propriety and elicitability offer useful working paradigms for metric development of hydrograph functionals
引用
收藏
页数:80
相关论文
共 256 条
[1]   Evaluating the discrimination ability of proper multi-variate scoring rules [J].
Alexander, C. ;
Coulon, M. ;
Han, Y. ;
Meng, X. .
ANNALS OF OPERATIONS RESEARCH, 2024, 334 (1-3) :857-883
[2]   PROBABILISTIC PROJECTIONS OF HIV PREVALENCE USING BAYESIAN MELDING [J].
Alkema, Leontine ;
Raftery, Adrian E. ;
Clark, Samuel J. .
ANNALS OF APPLIED STATISTICS, 2007, 1 (01) :229-248
[3]   A likelihood framework for deterministic hydrological models and the importance of non-stationary autocorrelation [J].
Ammann, Lorenz ;
Fenicia, Fabrizio ;
Reichert, Peter .
HYDROLOGY AND EARTH SYSTEM SCIENCES, 2019, 23 (04) :2147-2172
[4]   DISTRIBUTION OF 2-SAMPLE CRAMER-VON MISES CRITERION [J].
ANDERSON, TW .
ANNALS OF MATHEMATICAL STATISTICS, 1962, 33 (03) :1148-&
[5]   A new flashiness index: Characteristics and applications to midwestern rivers and streams [J].
Baker, DB ;
Richards, RP ;
Loftus, TT ;
Kramer, JW .
JOURNAL OF THE AMERICAN WATER RESOURCES ASSOCIATION, 2004, 40 (02) :503-522
[6]   On a new multivariate two-sample test [J].
Baringhaus, L ;
Franz, C .
JOURNAL OF MULTIVARIATE ANALYSIS, 2004, 88 (01) :190-206
[7]   A Markov chain Monte Carlo scheme for parameter estimation and inference in conceptual rainfall-runoff modeling [J].
Bates, BC ;
Campbell, EP .
WATER RESOURCES RESEARCH, 2001, 37 (04) :937-947
[8]  
Bayes T., 1763, Phil. Trans. of the Royal Soc. of London, V53, P370, DOI DOI 10.1098/RSTL.1763.0053
[9]   On the effect of calibration in classifier combination [J].
Bella, Antonio ;
Ferri, Cesar ;
Hernandez-Orallo, Jose ;
Jose Ramirez-Quintana, Maria .
APPLIED INTELLIGENCE, 2013, 38 (04) :566-585
[10]   EXPECTED INFORMATION AS EXPECTED UTILITY [J].
BERNARDO, JM .
ANNALS OF STATISTICS, 1979, 7 (03) :686-690