OracleProto

Reproducible benchmark of LLM native forecasting under knowledge-cutoff and temporal-masking control.

Forecasting = Gathering × Synthesis × Judgment × Decision

The core composite capability driving LLMs toward decision support

Traditional benchmarks ask: “Can you recall the answer?”

OracleProto asks: “Can you predict the future?”

May every forecast be reproducible, may AI truly become decision support

In service of every person’s judgments and choices for a good life

workspace_premium
2

Claude 4.6 Sonnet

Knowledge Cutoff · 2026-02-17
Exam Score
59.98 / 100
Pass@1 59.17
Passany@N 68.75
Passall@N 50.00
Majority Vote 66.20
trophy
1

Gemini 3.1 Pro Preview

Knowledge Cutoff · 2026-02-19
Exam Score
61.86 / 100
Pass@1 61.25
Passany@N 73.75
Passall@N 46.25
Majority Vote 64.56
military_tech
3

DeepSeek V3.2 Exp

Knowledge Cutoff · 2025-09-29
Exam Score
59.03 / 100
Pass@1 57.56
Passany@N 80.00
Passall@N 35.00
Majority Vote 67.16

Full Ranking

Rank
Model
Exam Score
Pass@1
Majority Vote
Cost /
1
Gemini 3.1 Pro Preview
Knowledge Cutoff · 2026-02-19
61.86
61.25
64.56
$0.247
2
Claude 4.6 Sonnet
Knowledge Cutoff · 2026-02-17
59.98
59.17
66.20
$0.239
3
DeepSeek V3.2 Exp
Knowledge Cutoff · 2025-09-29
59.03
57.56
67.16
$0.025
4
GLM-5
Knowledge Cutoff · 2026-02-11
58.00
56.96
62.50
$0.048
5
Qwen3.5 Flash
Knowledge Cutoff · 2026-02-25
57.37
55.65
60.87
$0.003
6
GPT 5.4
Knowledge Cutoff · 2026-03-05
57.30
56.25
60.81
$0.139
7
Kimi K2.5
Knowledge Cutoff · 2026-01-27
56.69
56.12
66.18
$0.049
8
MiniMax M2.5
Knowledge Cutoff · 2026-02-12
54.05
53.14
63.01
$0.024
9
Doubao Seed 2.0 Lite
Knowledge Cutoff · 2026-03-10
51.76
50.56
57.41
$0.006

Per-Model Detail

Rank · 01

Gemini 3.1 Pro Preview

61.86
Exam Score
Discrimination ↑
Pass@1
61.25
Passany@N
73.75
Passall@N
46.25
FSS
41.11
Agreement ↑
Cohen κ
0.370
Fleiss κ
0.553
Total Cost (USD) ↓
$35.90
Cost / (USD) ↓
$0.247
Rank · 02

Claude 4.6 Sonnet

59.98
Exam Score
Discrimination ↑
Pass@1
59.17
Passany@N
68.75
Passall@N
50.00
FSS
39.73
Agreement ↑
Cohen κ
0.337
Fleiss κ
0.683
Total Cost (USD) ↓
$35.80
Cost / (USD) ↓
$0.239
Rank · 03

DeepSeek V3.2 Exp

59.03
Exam Score
Discrimination ↑
Pass@1
57.56
Passany@N
80.00
Passall@N
35.00
FSS
37.58
Agreement ↑
Cohen κ
0.310
Fleiss κ
0.345
Total Cost (USD) ↓
$3.60
Cost / (USD) ↓
$0.025
Rank · 04

GLM-5

58.00
Exam Score
Discrimination ↑
Pass@1
56.96
Passany@N
76.25
Passall@N
36.25
FSS
36.27
Agreement ↑
Cohen κ
0.298
Fleiss κ
0.429
Total Cost (USD) ↓
$7.06
Cost / (USD) ↓
$0.048
Rank · 05

Qwen3.5 Flash

57.37
Exam Score
Discrimination ↑
Pass@1
55.65
Passany@N
75.00
Passall@N
38.75
FSS
34.33
Agreement ↑
Cohen κ
0.278
Fleiss κ
0.452
Total Cost (USD) ↓
$0.45
Cost / (USD) ↓
$0.003
Rank · 06

GPT 5.4

57.30
Exam Score
Discrimination ↑
Pass@1
56.25
Passany@N
73.75
Passall@N
38.75
FSS
36.25
Agreement ↑
Cohen κ
0.289
Fleiss κ
0.504
Total Cost (USD) ↓
$18.71
Cost / (USD) ↓
$0.139
Rank · 07

Kimi K2.5

56.69
Exam Score
Discrimination ↑
Pass@1
56.12
Passany@N
80.00
Passall@N
30.00
FSS
33.15
Agreement ↑
Cohen κ
0.285
Fleiss κ
0.297
Total Cost (USD) ↓
$6.79
Cost / (USD) ↓
$0.049
Rank · 08

MiniMax M2.5

54.05
Exam Score
Discrimination ↑
Pass@1
53.14
Passany@N
68.75
Passall@N
32.50
FSS
29.97
Agreement ↑
Cohen κ
0.239
Fleiss κ
0.387
Total Cost (USD) ↓
$3.21
Cost / (USD) ↓
$0.024
Rank · 09

Doubao Seed 2.0 Lite

51.76
Exam Score
Discrimination ↑
Pass@1
50.56
Passany@N
70.00
Passall@N
28.33
FSS
23.49
Agreement ↑
Cohen κ
0.185
Fleiss κ
0.420
Total Cost (USD) ↓
$0.89
Cost / (USD) ↓
$0.006
Exam Score

The canonical score per trial is $\mathrm{exam}(\hat S, G) = |\hat S \cap G| / |G|$ when $\hat S \setminus G = \varnothing$, and $0$ otherwise (recall under a zero-false-positive gate). The leaderboard sorts by its question-then-model average over $N$ trials.

Knowledge Cutoff & Temporal Masking

A question $q_i$ is admitted to model $M$ only when $\kappa_M \le \chi_i < \tau_i$, where $\kappa_M$ is the model’s knowledge cutoff and $\tau_i$ the event resolution time. Tool-level temporal masking and content-level leakage detection together suppress retrieval-mediated leakage to roughly $1\%$.

Companion Metrics & Cost

$\mathrm{Pass}^{\mathrm{any}}@N$ and $\mathrm{Pass}^{\mathrm{all}}@N$ characterise sampling stability across $N$ trials. Cohen’s $\kappa$ and Fleiss’ $\kappa$ chance-correct strict accuracy. FSS is a Tversky skill score with $(\alpha,\beta)=(2.0, 0.5)$. Per-correct cost $C^{\mathrm{per\text{-}correct}}_m = C^{\mathrm{total}}_m / (|\mathcal{D}^{\mathrm{eval}}| \cdot N \cdot \mathrm{CA}_m)$ amortises the total invoice over the difficulty-weighted notional correct-sample count.