Reproducible benchmark of LLM native forecasting under knowledge-cutoff and temporal-masking control.
Forecasting = Gathering × Synthesis × Judgment × Decision
The core composite capability driving LLMs toward decision support
Traditional benchmarks ask: “Can you recall the answer?”
OracleProto asks: “Can you predict the future?”
May every forecast be reproducible, may AI truly become decision support
In service of every person’s judgments and choices for a good life
Claude Opus 4.6
GPT 5.4 Thinking (High)
Gemini 3.1 Pro Preview
Full Ranking
Per-Model Detail
GPT 5.4 Thinking (High)
Claude Opus 4.6
Gemini 3.1 Pro Preview
Claude Sonnet 4.6
DeepSeek V3.2 Exp
GLM-5
Qwen3.5 Flash
GPT 5.4
Kimi K2.5
MiniMax M2.5
Gemini 3.1 Flash Lite Preview
Qwen3.5 35B A3B
Doubao Seed 2.0 Lite
GPT 5.3 Codex
Grok 4.1 Fast Reasoning
The canonical score per trial is $\mathrm{exam}(\hat S, G) = |\hat S \cap G| / |G|$ when $\hat S \setminus G = \varnothing$, and $0$ otherwise (recall under a zero-false-positive gate). The leaderboard sorts by its question-then-model average over $N$ trials.
A question $q_i$ is admitted to model $M$ only when $\kappa_M \le \chi_i < \tau_i$, where $\kappa_M$ is the model’s knowledge cutoff and $\tau_i$ the event resolution time. Tool-level temporal masking and content-level leakage detection together suppress retrieval-mediated leakage to roughly $1\%$.
$\mathrm{Pass}^{\mathrm{any}}@N$ and $\mathrm{Pass}^{\mathrm{all}}@N$ characterise sampling stability across $N$ trials. Cohen’s $\kappa$ and Fleiss’ $\kappa$ chance-correct strict accuracy. FSS is a Tversky skill score with $(\alpha,\beta)=(2.0, 0.5)$. Per-correct cost $C^{\mathrm{per\text{-}correct}}_m = C^{\mathrm{total}}_m / (|\mathcal{D}^{\mathrm{eval}}| \cdot N \cdot \mathrm{CA}_m)$ amortises the total invoice over the difficulty-weighted notional correct-sample count.
