This page provides a server-rendered, non-JavaScript summary of all ARGOS validation evidence.
It is designed to be parsable by automated audit tools and non-JS scrapers.
For the full machine-parsable dataset, see /api/validation-evidence (JSON)
and /api/brier-status (live Brier tracker JSON).
Important Disclosure: All retrospective validation evidence below is in-sample.
The same 47 events were used to both calibrate the GRS weighting scheme and evaluate it.
No holdout set or walk-forward protocol has been applied. The GRS/85 normalization is a
heuristic, not a calibrated probability. Prospective out-of-sample evidence is accumulating
via the Live Brier Tracker (see below).
BSS Interpretation: Why is the BSS negative? The Brier Skill Score compares GRS/85 as a probability against the base rate (81%). Because 81% of events in this set are high-severity crises, simply predicting that rate for every case yields a lower Brier score than using GRS/85. This does not mean the model lacks discriminative power. It means that dividing GRS by 85 does not produce a well-calibrated probability. The model's value is in threshold-based risk classification (78.7% TPR at the Elevated threshold), not in raw probability output. A proper calibration layer (e.g., Platt scaling or isotonic regression) applied to GRS could improve the Brier score, but that would require out-of-sample data not yet available.
Confidence Intervals (10,000 Bootstrap Samples)
Metric
Point Estimate
95% CI Lower
95% CI Upper
True Positive Rate (Elevated+ flag)
0.830
0.723
0.936
Brier Score (GRS/85 vs severity >= 4)
0.238
0.200
0.279
Calibration Bins
Bin
Predicted Rate
Observed Rate
Count
Low (< 15)
0.0%
50.0%
2
Moderate (15-30)
26.5%
50.0%
6
Elevated (30-45)
44.1%
81.5%
27
High (45-60)
61.8%
100.0%
8
Critical (60+)
85.3%
100.0%
4
Category Breakdown
Category
Events
TPR
Conflict
17
100.0%
Economic
9
77.8%
Political
14
71.4%
Social
7
71.4%
False Negatives (8 Events Not Flagged at Elevated+)
Year
Country
Event
GRS
Tier
2008
USA
Lehman Brothers collapse / GFC
18.5
Low
2010
GRC
Sovereign debt crisis onset
28.4
Moderate
2019
HKG
Pro-democracy protests
22.4
Moderate
2016
GBR
Brexit referendum
12.8
Low
2019
BOL
Political crisis and Morales ouster
29.8
Moderate
2020
ITA
COVID-19 first European wave
14.2
Low
2015
NPL
Gorkha earthquake
28.9
Moderate
2022
KAZ
Bloody January protests
27.5
Moderate
Benchmarks
Model
Metric
Value
Source
ARGOS GRS (in-sample)
TPR (Elevated+ flag rate)
0.830
39 of 47 events flagged at Elevated or above
ARGOS GRS (in-sample)
Brier Score
0.238
Lower is better. Climatological baseline: 0.155
Naive baseline: always predict "Elevated"
TPR
1.000
Perfect recall but 100% false positive rate; no discrimination
Naive baseline: random coin flip
Expected TPR
0.500
No information content
Fragile States Index (FSI)
Comparable TPR (literature)
0.720
Approximate from published FSI validation studies; annual granularity only
Global Peace Index (GPI)
Comparable TPR (literature)
0.680
Approximate from published GPI methodology papers; focuses on peace/violence
ACLED event-count threshold
Estimated TPR
0.362
Simulated: uses conflict-event proxy from BACKTEST_EVENTS. Countries with pre-event GRS ISI sub-index > 30 treated as high-ACLED. Captures conflict events well but misses economic/political crises.
Regional GRS average
Estimated TPR
0.851
Simulated: assigns each event its region's mean GRS from the 47-event set. Tests whether regional context alone provides signal.
ETI-only (macro model)
Estimated TPR
0.830
Simulated: flags events where the ETI sub-index contribution (grsPreEvent * 0.25) >= 7.5, approximating an ETI-only threshold. Tests whether economic indicators alone suffice.
2. Prospective Validation (Live Brier Tracker)
Status
Accumulating
Start Date
2026-03-15
Total Snapshots
9163
Resolved
0
Pending
9163
Expired (Unresolved)
4933
Forecast Horizon
30 days
Running Brier Score
N/A
Progress
0/30
Status: Accumulating data: 0/30 resolved forecasts needed for reportable accuracy.
Prospective Validation Note: The Live Brier Tracker records automated
GRS-Live probability snapshots every 6 hours for all 85 baseline countries. Each snapshot
uses a logistic mapping (P = 1/(1+exp(-0.08*(GRS-45)))) to convert GRS-Live to a crisis
probability. Outcomes are resolved after the 30-day forecast window expires. This evidence
is strictly separated from the retrospective 47-event calibration set. A minimum of 30
resolved forecasts is required before reporting a Brier score.
3. Evidence Classification
Dimension
Retrospective (In-Sample)
Prospective (Out-of-Sample)
Data Source
47 hand-selected geopolitical events, 2008-2024. Same events used for both calibration and evaluation.
Live Brier Tracker: automated GRS-Live snapshots recorded every 6 hours, resolved against observed outcomes after 30-day windows.
Primary Metric
True Positive Rate: 83.0% of events flagged at Elevated+ tier.
Running Brier Score (logistic calibration). Currently accumulating.
What It Proves
Internal consistency: the GRS tier system rank-orders historical crises consistently with observed severity.
Genuine predictive accuracy: whether GRS-derived probabilities, issued before events, match observed crisis rates.
Known Limitations
No holdout set. GRS/85 heuristic normalization produces negative BSS (-53.5%).
Data collection began 2026-03-15. Insufficient resolved forecasts.