OracleProto

Reproducible benchmark of LLM native forecasting under knowledge-cutoff and temporal-masking control.

Forecasting = Gathering × Synthesis × Judgment × Decision

The core composite capability driving LLMs toward decision support

Traditional benchmarks ask: “Can you recall the answer?”

OracleProto asks: “Can you predict the future?”

May every forecast be reproducible, may AI truly become decision support

In service of every person’s judgments and choices for a good life

workspace_premium
2

GLM-5

Knowledge Cutoff · 2026-02-11
Exam Score
58.00 / 100
Pass@1 56.96
Passany@N 76.25
Passall@N 36.25
Majority Vote 62.50
trophy
1

DeepSeek V3.2 Exp

Knowledge Cutoff · 2025-09-29
Exam Score
59.03 / 100
Pass@1 57.56
Passany@N 80.00
Passall@N 35.00
Majority Vote 67.16
military_tech
3

Qwen3.5 Flash

Knowledge Cutoff · 2026-02-25
Exam Score
57.37 / 100
Pass@1 55.65
Passany@N 75.00
Passall@N 38.75
Majority Vote 60.87

Full Ranking

Rank
Model
Exam Score
Pass@1
Majority Vote
Cost /
1
DeepSeek V3.2 Exp
Knowledge Cutoff · 2025-09-29
59.03
57.56
67.16
$0.025
2
GLM-5
Knowledge Cutoff · 2026-02-11
58.00
56.96
62.50
$0.048
3
Qwen3.5 Flash
Knowledge Cutoff · 2026-02-25
57.37
55.65
60.87
$0.003
4
Kimi K2.5
Knowledge Cutoff · 2026-01-27
56.69
56.12
66.18
$0.049
5
MiniMax M2.5
Knowledge Cutoff · 2026-02-12
54.05
53.14
63.01
$0.024
6
Doubao Seed 2.0 Lite
Knowledge Cutoff · 2026-03-10
51.76
50.56
57.41
$0.006

Per-Model Detail

Rank · 01

DeepSeek V3.2 Exp

59.03
Exam Score
Discrimination ↑
Pass@1
57.56
Passany@N
80.00
Passall@N
35.00
FSS
37.58
Agreement ↑
Cohen κ
0.310
Fleiss κ
0.345
Total Cost (USD) ↓
$3.60
Cost / (USD) ↓
$0.025
Rank · 02

GLM-5

58.00
Exam Score
Discrimination ↑
Pass@1
56.96
Passany@N
76.25
Passall@N
36.25
FSS
36.27
Agreement ↑
Cohen κ
0.298
Fleiss κ
0.429
Total Cost (USD) ↓
$7.06
Cost / (USD) ↓
$0.048
Rank · 03

Qwen3.5 Flash

57.37
Exam Score
Discrimination ↑
Pass@1
55.65
Passany@N
75.00
Passall@N
38.75
FSS
34.33
Agreement ↑
Cohen κ
0.278
Fleiss κ
0.452
Total Cost (USD) ↓
$0.45
Cost / (USD) ↓
$0.003
Rank · 04

Kimi K2.5

56.69
Exam Score
Discrimination ↑
Pass@1
56.12
Passany@N
80.00
Passall@N
30.00
FSS
33.15
Agreement ↑
Cohen κ
0.285
Fleiss κ
0.297
Total Cost (USD) ↓
$6.79
Cost / (USD) ↓
$0.049
Rank · 05

MiniMax M2.5

54.05
Exam Score
Discrimination ↑
Pass@1
53.14
Passany@N
68.75
Passall@N
32.50
FSS
29.97
Agreement ↑
Cohen κ
0.239
Fleiss κ
0.387
Total Cost (USD) ↓
$3.21
Cost / (USD) ↓
$0.024
Rank · 06

Doubao Seed 2.0 Lite

51.76
Exam Score
Discrimination ↑
Pass@1
50.56
Passany@N
70.00
Passall@N
28.33
FSS
23.49
Agreement ↑
Cohen κ
0.185
Fleiss κ
0.420
Total Cost (USD) ↓
$0.89
Cost / (USD) ↓
$0.006
Exam Score

The canonical score per trial is $\mathrm{exam}(\hat S, G) = |\hat S \cap G| / |G|$ when $\hat S \setminus G = \varnothing$, and $0$ otherwise (recall under a zero-false-positive gate). The leaderboard sorts by its question-then-model average over $N$ trials.

Knowledge Cutoff & Temporal Masking

A question $q_i$ is admitted to model $M$ only when $\kappa_M \le \chi_i < \tau_i$, where $\kappa_M$ is the model’s knowledge cutoff and $\tau_i$ the event resolution time. Tool-level temporal masking and content-level leakage detection together suppress retrieval-mediated leakage to roughly $1\%$.

Companion Metrics & Cost

$\mathrm{Pass}^{\mathrm{any}}@N$ and $\mathrm{Pass}^{\mathrm{all}}@N$ characterise sampling stability across $N$ trials. Cohen’s $\kappa$ and Fleiss’ $\kappa$ chance-correct strict accuracy. FSS is a Tversky skill score with $(\alpha,\beta)=(2.0, 0.5)$. Per-correct cost $C^{\mathrm{per\text{-}correct}}_m = C^{\mathrm{total}}_m / (|\mathcal{D}^{\mathrm{eval}}| \cdot N \cdot \mathrm{CA}_m)$ amortises the total invoice over the difficulty-weighted notional correct-sample count.