OracleProto
Reproducible benchmark of LLM native forecasting under knowledge-cutoff and temporal-masking control.
GLM-5
DeepSeek V3.2 (Exp)
Qwen3.5 Flash
Full Ranking
Ordered by Exam Score by default. Click a column header to re-sort. All 6 models share the same 80-question subset and 3 parallel trials per question.
Per-Model Detail
Each model is decomposed into three panels: discrimination (how often it gets the answer right), agreement (how stable its k=3 trials are after chance correction), and efficiency (resource cost per answer).
DeepSeek V3.2 (Exp)
GLM-5
Qwen3.5 Flash
Kimi K2.5
MiniMax M2.5
Doubao Seed 2.0 Lite
The canonical score per trial is $\mathrm{exam}(\hat S, G) = |\hat S \cap G| / |G|$ when $\hat S \setminus G = \varnothing$, and $0$ otherwise — recall under a zero-false-positive gate. The leaderboard sorts by its question-then-model average over $N$ trials.
A question $q_i$ is admitted to model $M$ only when $\kappa_M \le \chi_i < \tau_i$, where $\kappa_M$ is the model’s knowledge cutoff and $\tau_i$ the event resolution time. Tool-level temporal masking and content-level leakage detection together suppress retrieval-mediated leakage to roughly $1\%$.
$\mathrm{Pass}^{\mathrm{any}}@N$ and $\mathrm{Pass}^{\mathrm{all}}@N$ characterise sampling stability across $N$ trials. Cohen’s $\kappa$ and Fleiss’ $\kappa$ chance-correct strict accuracy. FSS is a Tversky skill score with $(\alpha,\beta)=(2.0, 0.5)$. Per-correct cost $C^{\mathrm{per\text{-}correct}}_m = C^{\mathrm{total}}_m / (|\mathcal{D}^{\mathrm{eval}}| \cdot N \cdot \mathrm{CA}_m)$ amortises the OpenRouter invoice over the difficulty-weighted notional correct-sample count.