ARGOS Validation Evidence Summary

This page provides a server-rendered, non-JavaScript summary of all ARGOS validation evidence. It is designed to be parsable by automated audit tools and non-JS scrapers. For the full machine-parsable dataset, see /api/validation-evidence (JSON) and /api/brier-status (live Brier tracker JSON).

Important Disclosure: All retrospective validation evidence below is in-sample. The same 47 events were used to both calibrate the GRS weighting scheme and evaluate it. No holdout set or walk-forward protocol has been applied. The GRS/85 normalization is a heuristic, not a calibrated probability. Prospective out-of-sample evidence is accumulating via the Live Brier Tracker (see below).

1. Retrospective Validation (In-Sample, 47 Events, 2008-2024)

True Positive Rate
83.0%
Brier Score
0.238
Brier Skill Score
-53.5%
Log Loss
0.670
Climatological Baseline
0.155
Events Flagged
39/47

BSS Interpretation: Why is the BSS negative? The Brier Skill Score compares GRS/85 as a probability against the base rate (81%). Because 81% of events in this set are high-severity crises, simply predicting that rate for every case yields a lower Brier score than using GRS/85. This does not mean the model lacks discriminative power. It means that dividing GRS by 85 does not produce a well-calibrated probability. The model's value is in threshold-based risk classification (78.7% TPR at the Elevated threshold), not in raw probability output. A proper calibration layer (e.g., Platt scaling or isotonic regression) applied to GRS could improve the Brier score, but that would require out-of-sample data not yet available.

Confidence Intervals (10,000 Bootstrap Samples)

MetricPoint Estimate95% CI Lower95% CI Upper
True Positive Rate (Elevated+ flag)0.8300.7230.936
Brier Score (GRS/85 vs severity >= 4)0.2380.2000.279

Calibration Bins

BinPredicted RateObserved RateCount
Low (< 15)0.0%50.0%2
Moderate (15-30)26.5%50.0%6
Elevated (30-45)44.1%81.5%27
High (45-60)61.8%100.0%8
Critical (60+)85.3%100.0%4

Category Breakdown

CategoryEventsTPR
Conflict17100.0%
Economic977.8%
Political1471.4%
Social771.4%

False Negatives (8 Events Not Flagged at Elevated+)

YearCountryEventGRSTier
2008USALehman Brothers collapse / GFC18.5Low
2010GRCSovereign debt crisis onset28.4Moderate
2019HKGPro-democracy protests22.4Moderate
2016GBRBrexit referendum12.8Low
2019BOLPolitical crisis and Morales ouster29.8Moderate
2020ITACOVID-19 first European wave14.2Low
2015NPLGorkha earthquake28.9Moderate
2022KAZBloody January protests27.5Moderate

Benchmarks

ModelMetricValueSource
ARGOS GRS (in-sample)TPR (Elevated+ flag rate)0.83039 of 47 events flagged at Elevated or above
ARGOS GRS (in-sample)Brier Score0.238Lower is better. Climatological baseline: 0.155
Naive baseline: always predict "Elevated"TPR1.000Perfect recall but 100% false positive rate; no discrimination
Naive baseline: random coin flipExpected TPR0.500No information content
Fragile States Index (FSI)Comparable TPR (literature)0.720Approximate from published FSI validation studies; annual granularity only
Global Peace Index (GPI)Comparable TPR (literature)0.680Approximate from published GPI methodology papers; focuses on peace/violence
ACLED event-count thresholdEstimated TPR0.362Simulated: uses conflict-event proxy from BACKTEST_EVENTS. Countries with pre-event GRS ISI sub-index > 30 treated as high-ACLED. Captures conflict events well but misses economic/political crises.
Regional GRS averageEstimated TPR0.851Simulated: assigns each event its region's mean GRS from the 47-event set. Tests whether regional context alone provides signal.
ETI-only (macro model)Estimated TPR0.830Simulated: flags events where the ETI sub-index contribution (grsPreEvent * 0.25) >= 7.5, approximating an ETI-only threshold. Tests whether economic indicators alone suffice.

2. Prospective Validation (Live Brier Tracker)

Status
Accumulating
Start Date
2026-03-15
Total Snapshots
9163
Resolved
0
Pending
9163
Expired (Unresolved)
4933
Forecast Horizon
30 days
Running Brier Score
N/A
Progress
0/30

Status: Accumulating data: 0/30 resolved forecasts needed for reportable accuracy.

Prospective Validation Note: The Live Brier Tracker records automated GRS-Live probability snapshots every 6 hours for all 85 baseline countries. Each snapshot uses a logistic mapping (P = 1/(1+exp(-0.08*(GRS-45)))) to convert GRS-Live to a crisis probability. Outcomes are resolved after the 30-day forecast window expires. This evidence is strictly separated from the retrospective 47-event calibration set. A minimum of 30 resolved forecasts is required before reporting a Brier score.

3. Evidence Classification

DimensionRetrospective (In-Sample)Prospective (Out-of-Sample)
Data Source 47 hand-selected geopolitical events, 2008-2024. Same events used for both calibration and evaluation. Live Brier Tracker: automated GRS-Live snapshots recorded every 6 hours, resolved against observed outcomes after 30-day windows.
Primary Metric True Positive Rate: 83.0% of events flagged at Elevated+ tier. Running Brier Score (logistic calibration). Currently accumulating.
What It Proves Internal consistency: the GRS tier system rank-orders historical crises consistently with observed severity. Genuine predictive accuracy: whether GRS-derived probabilities, issued before events, match observed crisis rates.
Known Limitations No holdout set. GRS/85 heuristic normalization produces negative BSS (-53.5%). Data collection began 2026-03-15. Insufficient resolved forecasts.
Status Complete Accumulating