ARGOS Validation Evidence Summary

This page provides a server-rendered, non-JavaScript summary of all ARGOS validation evidence. It is designed to be parsable by automated audit tools and non-JS scrapers. For the full machine-parsable dataset, see /api/validation-evidence (JSON) and /api/brier-status (live Brier tracker JSON).

1. Retrospective Validation (In-Sample, 47 Events, 2008-2024)

True Positive Rate

83.0%

Brier Score

0.238

Brier Skill Score

-53.5%

Log Loss

0.670

Climatological Baseline

0.155

Events Flagged

39/47

BSS Interpretation: Why is the BSS negative? The Brier Skill Score compares GRS/85 as a probability against the base rate (81%). Because 81% of events in this set are high-severity crises, simply predicting that rate for every case yields a lower Brier score than using GRS/85. This does not mean the model lacks discriminative power. It means that dividing GRS by 85 does not produce a well-calibrated probability. The model's value is in threshold-based risk classification (78.7% TPR at the Elevated threshold), not in raw probability output. A proper calibration layer (e.g., Platt scaling or isotonic regression) applied to GRS could improve the Brier score, but that would require out-of-sample data not yet available.

Confidence Intervals (10,000 Bootstrap Samples)

Metric	Point Estimate	95% CI Lower	95% CI Upper
True Positive Rate (Elevated+ flag)	0.830	0.723	0.936
Brier Score (GRS/85 vs severity >= 4)	0.238	0.200	0.279

Calibration Bins

Bin	Predicted Rate	Observed Rate	Count
Low (< 15)	0.0%	50.0%	2
Moderate (15-30)	26.5%	50.0%	6
Elevated (30-45)	44.1%	81.5%	27
High (45-60)	61.8%	100.0%	8
Critical (60+)	85.3%	100.0%	4

Category Breakdown

Category	Events	TPR
Conflict	17	100.0%
Economic	9	77.8%
Political	14	71.4%
Social	7	71.4%

False Negatives (8 Events Not Flagged at Elevated+)

Year	Country	Event	GRS	Tier
2008	USA	Lehman Brothers collapse / GFC	18.5	Low
2010	GRC	Sovereign debt crisis onset	28.4	Moderate
2019	HKG	Pro-democracy protests	22.4	Moderate
2016	GBR	Brexit referendum	12.8	Low
2019	BOL	Political crisis and Morales ouster	29.8	Moderate
2020	ITA	COVID-19 first European wave	14.2	Low
2015	NPL	Gorkha earthquake	28.9	Moderate
2022	KAZ	Bloody January protests	27.5	Moderate

Benchmarks

Model	Metric	Value	Source
ARGOS GRS (in-sample)	TPR (Elevated+ flag rate)	0.830	39 of 47 events flagged at Elevated or above
ARGOS GRS (in-sample)	Brier Score	0.238	Lower is better. Climatological baseline: 0.155
Naive baseline: always predict "Elevated"	TPR	1.000	Perfect recall but 100% false positive rate; no discrimination
Naive baseline: random coin flip	Expected TPR	0.500	No information content
Fragile States Index (FSI)	Comparable TPR (literature)	0.720	Approximate from published FSI validation studies; annual granularity only
Global Peace Index (GPI)	Comparable TPR (literature)	0.680	Approximate from published GPI methodology papers; focuses on peace/violence
ACLED event-count threshold	Estimated TPR	0.362	Simulated: uses conflict-event proxy from BACKTEST_EVENTS. Countries with pre-event GRS ISI sub-index > 30 treated as high-ACLED. Captures conflict events well but misses economic/political crises.
Regional GRS average	Estimated TPR	0.851	Simulated: assigns each event its region's mean GRS from the 47-event set. Tests whether regional context alone provides signal.
ETI-only (macro model)	Estimated TPR	0.830	Simulated: flags events where the ETI sub-index contribution (grsPreEvent * 0.25) >= 7.5, approximating an ETI-only threshold. Tests whether economic indicators alone suffice.

2. Prospective Validation (Live Brier Tracker)

Status

Accumulating

Start Date

2026-03-15

Total Snapshots

9163

Resolved

Pending

9163

Expired (Unresolved)

4933

Forecast Horizon

30 days

Running Brier Score

N/A

Progress

0/30

Status: Accumulating data: 0/30 resolved forecasts needed for reportable accuracy.

Prospective Validation Note: The Live Brier Tracker records automated GRS-Live probability snapshots every 6 hours for all 85 baseline countries. Each snapshot uses a logistic mapping (P = 1/(1+exp(-0.08*(GRS-45)))) to convert GRS-Live to a crisis probability. Outcomes are resolved after the 30-day forecast window expires. This evidence is strictly separated from the retrospective 47-event calibration set. A minimum of 30 resolved forecasts is required before reporting a Brier score.

3. Evidence Classification

Dimension	Retrospective (In-Sample)	Prospective (Out-of-Sample)
Data Source	47 hand-selected geopolitical events, 2008-2024. Same events used for both calibration and evaluation.	Live Brier Tracker: automated GRS-Live snapshots recorded every 6 hours, resolved against observed outcomes after 30-day windows.
Primary Metric	True Positive Rate: 83.0% of events flagged at Elevated+ tier.	Running Brier Score (logistic calibration). Currently accumulating.
What It Proves	Internal consistency: the GRS tier system rank-orders historical crises consistently with observed severity.	Genuine predictive accuracy: whether GRS-derived probabilities, issued before events, match observed crisis rates.
Known Limitations	No holdout set. GRS/85 heuristic normalization produces negative BSS (-53.5%).	Data collection began 2026-03-15. Insufficient resolved forecasts.
Status	Complete	Accumulating