Reproducible benchmark of LLM native forecasting under knowledge-cutoff and temporal-masking control.
Forecasting = Gathering × Synthesis × Judgment × Decision
The core composite capability driving LLMs toward decision support
Traditional benchmarks ask: “Can you recall the answer?”
OracleProto asks: “Can you predict the future?”
May every forecast be reproducible, may AI truly become decision support
In service of every person’s judgments and choices for a good life
GLM-5
DeepSeek V3.2 Exp
Qwen3.5 Flash
Full Ranking
Per-Model Detail
DeepSeek V3.2 Exp
GLM-5
Qwen3.5 Flash
Kimi K2.5
MiniMax M2.5
Doubao Seed 2.0 Lite
The canonical score per trial is $\mathrm{exam}(\hat S, G) = |\hat S \cap G| / |G|$ when $\hat S \setminus G = \varnothing$, and $0$ otherwise (recall under a zero-false-positive gate). The leaderboard sorts by its question-then-model average over $N$ trials.
A question $q_i$ is admitted to model $M$ only when $\kappa_M \le \chi_i < \tau_i$, where $\kappa_M$ is the model’s knowledge cutoff and $\tau_i$ the event resolution time. Tool-level temporal masking and content-level leakage detection together suppress retrieval-mediated leakage to roughly $1\%$.
$\mathrm{Pass}^{\mathrm{any}}@N$ and $\mathrm{Pass}^{\mathrm{all}}@N$ characterise sampling stability across $N$ trials. Cohen’s $\kappa$ and Fleiss’ $\kappa$ chance-correct strict accuracy. FSS is a Tversky skill score with $(\alpha,\beta)=(2.0, 0.5)$. Per-correct cost $C^{\mathrm{per\text{-}correct}}_m = C^{\mathrm{total}}_m / (|\mathcal{D}^{\mathrm{eval}}| \cdot N \cdot \mathrm{CA}_m)$ amortises the total invoice over the difficulty-weighted notional correct-sample count.
