Model Transparency

Validation & Calibration

This page presents quantitative calibration metrics for the ARGOS Geopolitical Risk Score (GRS). All results are derived from a 47-event retrospective calibration set spanning 2008-2024.

Important Epistemic Caveat

These metrics are in-sample retrospective results. The same 47 events were used to calibrate the GRS weighting scheme and are now used to evaluate it. This demonstrates internal consistency, not prospective predictive accuracy. No holdout set or walk-forward out-of-sample protocol has been applied. The metrics answer: "Does the model's scoring system rank historical crises consistently with their observed severity?" - not "Can the model predict future crises?" A formal out-of-sample validation framework is planned for a future release.

Threshold-Based Detection (Primary Strength)

True Positive Rate?

83.0%

39 of 47 events flagged at Elevated+

False Negative Rate?

17.0%

8 events below Elevated threshold

Discrimination Metrics (See Discrimination Tab)

AUC (ROC)?

0.880

Acceptable+ discrimination

PR-AUC?

0.946

Baseline: 80.9%

F1 Score (GRS >= 45)?

48.0%

At High threshold

Probabilistic Calibration (Known Limitation)

GRS/85 is a heuristic normalization, not a probability calibration. Dividing the raw GRS (range [-15, +85]) by 85 maps it to a [0, 1] interval for Brier scoring, but this linear rescaling does not produce a calibrated probability of crisis occurrence. The resulting Brier and Log-Loss metrics below should be interpreted as measuring how well this naive mapping performs, not as a reflection of the model's core risk-ranking capability. The Live Brier Tracker tab applies a logistic calibration layer to produce genuine probability forecasts.

Brier Score?

0.238

Baseline: 0.155

Brier Skill Score?

-53.5%

Below base-rate benchmark (see note)

Log-Loss?

0.67

Baseline: 0.488

Understanding the Negative Brier Skill Score

Why is the BSS negative? The Brier Skill Score compares GRS/85 as a probability against the base rate (81%). Because 81% of events in this set are high-severity crises, simply predicting that rate for every case yields a lower Brier score than using GRS/85. This does not mean the model lacks discriminative power. It means that dividing GRS by 85 does not produce a well-calibrated probability. The model's value is in threshold-based risk classification (78.7% TPR at the Elevated threshold), not in raw probability output. A proper calibration layer (e.g., Platt scaling or isotonic regression) applied to GRS could improve the Brier score, but that would require out-of-sample data not yet available.

What this means in practice: ARGOS is designed as a risk-ranking and threshold-detection system, not as a calibrated probability forecaster. Its primary metric, the True Positive Rate (83.0%), demonstrates that the tier system successfully identifies elevated-risk environments before crises occur. The negative BSS reflects a known gap between risk ranking and probability calibration, not a failure of the underlying model. The Live Brier Tracker tab will accumulate genuine out-of-sample probability forecasts using a logistic calibration layer to address this limitation over time.

Calibration: Predicted vs Observed Crisis Rates?

Events grouped by pre-event GRS tier. A well-calibrated model shows predicted and observed rates that track each other across bins. In-sample retrospective metric.

0%25%50%75%100%Low (< 15)n=2Moderate (15-30)n=6Elevated (30-45)n=27High (45-60)n=8Critical (60+)n=4Predicted (GRS midpoint / 85)Observed crisis rate (severity ≥ 4)
GRS TierEventsCrises (sev ≥ 4)Predicted RateObserved RateGap
Low (< 15)210.0%50.0%50.0pp
Moderate (15-30)6326.5%50.0%23.5pp
Elevated (30-45)272244.1%81.5%37.4pp
High (45-60)8861.8%100.0%38.2pp
Critical (60+)4485.3%100.0%14.7pp
Performance by Event Category

How well does GRS flag events across different crisis types? In-sample retrospective metric.

CategoryEventsFlaggedTPRAvg GRSAvg Severity
Conflict1717100.0%47.44.6
Economic9777.8%364
Political141071.4%363.6
Social7571.4%35.74.4

Validation Evidence Summary

This table summarizes the two distinct categories of validation evidence available for ARGOS. Retrospective and prospective evidence are kept strictly separate to avoid conflating in-sample consistency with out-of-sample predictive accuracy.

DimensionRetrospective (In-Sample)Prospective (Out-of-Sample)
Data Source47 hand-selected geopolitical events, 2008-2024. Same events used for both calibration and evaluation.Live Brier Tracker: automated GRS-Live snapshots recorded every 6 hours by the scheduler, resolved against observed outcomes after 30-day forecast windows expire.
Primary MetricTrue Positive Rate: 83.0% of events flagged at Elevated+ tier before occurrence.Running Brier Score (logistic calibration). Currently accumulating; requires 30+ resolved forecasts for statistical significance.
What It ProvesInternal consistency: the GRS tier system rank-orders historical crises in a manner consistent with their observed severity.Genuine predictive accuracy: whether GRS-derived probabilities, issued before events occur, match observed crisis rates.
Known LimitationsNo holdout set. Events were used to set model weights. GRS/85 heuristic normalization produces a negative Brier Skill Score (-53.5%) because GRS is not a calibrated probability.Data collection began March 2026. Insufficient resolved forecasts for meaningful Brier scoring. Results will be published here as they accumulate.
StatusCompleteAccumulating

The retrospective case studies on the Case Studies page are explicitly labeled as hindsight reconstructions. They demonstrate how the ARGOS framework would have interpreted historical events, not that it predicted them in real time. Only the Live Brier Tracker constitutes forward-looking validation evidence.

Machine-Parsable Validation Data

All metrics, the 47-event register, calibration bins, confidence intervals, benchmarks, and evidence classification are available as structured JSON for independent third-party verification.