Calibration Metrics

How well do our predictions match reality?

0.050

Brier Score

Lower is better (0 = perfect)

0.226

Log Loss

Cross-entropy measure

0.80

AUC-ROC

Discrimination (1.0 = perfect)

480

Resolved

Scored predictions

By Domain

Domain	Brier	Log Loss	AUC-ROC	N
Caucasus	0.037	0.195	0.78	48
Iran	0.082	0.311	0.58	48
Israel_Lebanon	0.067	0.273	0.92	48
Korean_Peninsula	0.019	0.145	0.50	48
Red_Sea	0.015	0.127	0.50	48
Sahel	0.040	0.209	0.35	48
South_China_Sea	0.083	0.314	0.75	48
Taiwan_Strait	0.019	0.144	0.50	48
Ukraine	0.065	0.265	0.83	48
Venezuela	0.068	0.278	0.70	48

Understanding the Metrics

Brier Score

Measures the mean squared difference between predicted probabilities and actual outcomes. Ranges from 0 (perfect) to 1 (worst). A climatological baseline (always predicting the base rate) typically scores around 0.25. Scores below 0.1 indicate strong calibration.

Log Loss (Cross-Entropy)

Penalizes confident wrong predictions more heavily than the Brier score. A prediction of 95% for an event that does not occur is punished severely. Lower values indicate better probabilistic discrimination.

AUC-ROC

Area Under the Receiver Operating Characteristic curve. Measures the model's ability to discriminate between positive and negative outcomes regardless of the chosen threshold. 0.5 = random guessing, 1.0 = perfect discrimination.

Resolved Predictions

The number of forecasts that have reached their resolution date and been scored against ground truth. Calibration metrics are only meaningful with a sufficient sample size (N > 30).