Reproducible benchmark of LLM native forecasting under knowledge-cutoff and temporal-masking control.
Forecasting = Gathering × Synthesis × Judgment × Decision
The core composite capability driving LLMs toward decision support
Traditional benchmarks ask: “Can you recall the answer?”
OracleProto asks: “Can you predict the future?”
May every forecast be reproducible, may AI truly become decision support
In service of every person’s judgments and choices for a good life
GPT-5.4 Thinking (High)
Claude Opus 4.6
Gemini 3.1 Pro Preview
Full Ranking
Per-Model Detail
Claude Opus 4.6
GPT-5.4 Thinking (High)
Gemini 3.1 Pro Preview
DeepSeek V3.2 Exp
GLM-5
Doubao Seed 2.0 Lite
GPT-5.4
Qwen3.5 Flash
Claude Sonnet 4.6
gpt-oss-120b
Gemini 3.1 Flash Lite Preview
Kimi K2.5
Qwen3.5 35B A3B
MiniMax M2.5
GPT-5.3 Codex
Qwen3.5 Plus
Grok 4.1 Fast Reasoning
DeepSeek V3.2 Exp (Reasoning)
Per trial, each question type scores by $\mathrm{exam}(\hat S, G) = |\hat S \cap G| / |G|$ when $\hat S \setminus G = \varnothing$, and $0$ otherwise. This is recall under a zero-false-positive gate. The leaderboard ranks by a composite over four canonical question types: Yes/No, named-entity, single-choice, and multi-choice, averaged across $N=3$ trials.
A question $q_i$ is admitted to model $M$ only when $\kappa_M \le \chi_i < \tau_i$, where $\kappa_M$ is the model’s knowledge cutoff and $\tau_i$ the event resolution time. We take $\kappa_M$ from the vendor’s officially published cutoff when one is disclosed, and fall back to the model’s public release date otherwise. Tool-level temporal masking and content-level leakage detection together suppress retrieval-mediated leakage to roughly $1\%$.
$\mathrm{Pass}^{\mathrm{any}}@N$ and $\mathrm{Pass}^{\mathrm{all}}@N$ characterise sampling stability across $N$ trials. Cohen’s $\kappa$ and Fleiss’ $\kappa$ chance-correct strict accuracy. FSS is a Tversky skill score with $(\alpha,\beta)=(2.0, 0.5)$. Per-correct cost $C^{\mathrm{per\text{-}correct}}_m = C^{\mathrm{total}}_m / (|\mathcal{D}^{\mathrm{eval}}| \cdot N \cdot \mathrm{CA}_m)$ amortises the total invoice over the difficulty-weighted notional correct-sample count.
