OracleProto

Reproducible benchmark of LLM native forecasting under knowledge-cutoff and temporal-masking control.

Forecasting = Gathering × Synthesis × Judgment × Decision

The core composite capability driving LLMs toward decision support

Traditional benchmarks ask: “Can you recall the answer?”

OracleProto asks: “Can you predict the future?”

May every forecast be reproducible, may AI truly become decision support

In service of every person’s judgments and choices for a good life

workspace_premium
2

Claude Opus 4.6

Knowledge Cutoff · 2026-02-04
Exam Score
59.50 / 100
Binary 65.83
Single Choice 66.67
Multi Choice 32.06
Cost / (USD) ↓ $0.363
trophy
1

GPT-5.4 Thinking (High)

Knowledge Cutoff · 2026-03-05
Exam Score
60.34 / 100
Binary 74.16
Single Choice 67.71
Multi Choice 21.18
Cost / (USD) ↓ $0.217
military_tech
3

Gemini 3.1 Pro Preview

Knowledge Cutoff · 2026-02-19
Exam Score
57.46 / 100
Binary 68.34
Single Choice 61.46
Multi Choice 31.13
Cost / (USD) ↓ $0.260

Full Ranking

Rank
Model
Sort by
1
GPT-5.4 Thinking (High)
Knowledge Cutoff · 2026-03-05
60.34
65.42
71.62
$0.217
2
Claude Opus 4.6
Knowledge Cutoff · 2026-02-04
59.50
61.25
65.79
$0.363
3
Gemini 3.1 Pro Preview
Knowledge Cutoff · 2026-02-19
57.46
61.25
64.56
$0.260
4
Claude Sonnet 4.6
Knowledge Cutoff · 2026-02-17
54.81
59.17
66.20
$0.272
5
gpt-oss-120b
Knowledge Cutoff · 2024-01-30
54.61
57.74
64.79
$0.018
6
DeepSeek V3.2 Exp
Knowledge Cutoff · 2025-09-29
54.39
57.08
66.18
$0.028
7
DeepSeek V3.2 Exp (Reasoning)
Knowledge Cutoff · 2025-09-29
53.92
57.50
62.67
$0.030
8
Qwen3.5 Flash
Knowledge Cutoff · 2026-02-25
53.49
55.83
60.87
$0.004
9
GLM-5
Knowledge Cutoff · 2026-02-11
53.31
56.67
62.50
$0.056
10
Doubao Seed 2.0 Lite
Knowledge Cutoff · 2026-03-10
52.37
52.50
58.67
$0.011
11
GPT-5.4
Knowledge Cutoff · 2026-03-05
52.05
56.25
60.81
$0.150
12
Kimi K2.5
Knowledge Cutoff · 2026-01-27
51.71
55.65
66.18
$0.056
13
Gemini 3.1 Flash Lite Preview
Knowledge Cutoff · 2026-03-03
50.62
53.75
60.81
$0.013
14
Qwen3.5 35B A3B
Knowledge Cutoff · 2026-02-25
49.57
51.67
59.21
$0.007
15
MiniMax M2.5
Knowledge Cutoff · 2026-02-12
48.19
53.14
63.01
$0.028
16
Qwen3.5 Plus
Knowledge Cutoff · 2026-02-15
46.69
52.50
58.90
$0.013
17
GPT-5.3 Codex
Knowledge Cutoff · 2026-02-24
45.34
48.75
54.29
$0.062
18
Grok 4.1 Fast Reasoning
Knowledge Cutoff · 2025-11-19
44.32
48.33
53.42
$0.018

Per-Model Detail

Rank · 01

GPT-5.4 Thinking (High)

Exam Score
60.34
By Category ↑
Binary
74.16
Single Choice
67.71
Multi Choice
21.18
Discrimination ↑
Pass@1
65.42
Passany@N
75.00
Passall@N
55.00
FSS
50.19
Agreement ↑
Cohen κ
0.438
Fleiss κ
0.677
Total Cost (USD) ↓
$31.49
Cost / (USD) ↓
$0.217
Rank · 02

Claude Opus 4.6

Exam Score
59.50
By Category ↑
Binary
65.83
Single Choice
66.67
Multi Choice
32.06
Discrimination ↑
Pass@1
61.25
Passany@N
75.00
Passall@N
46.25
FSS
41.80
Agreement ↑
Cohen κ
0.370
Fleiss κ
0.611
Total Cost (USD) ↓
$51.86
Cost / (USD) ↓
$0.363
Rank · 03

Gemini 3.1 Pro Preview

Exam Score
57.46
By Category ↑
Binary
68.34
Single Choice
61.46
Multi Choice
31.13
Discrimination ↑
Pass@1
61.25
Passany@N
73.75
Passall@N
46.25
FSS
41.11
Agreement ↑
Cohen κ
0.370
Fleiss κ
0.553
Total Cost (USD) ↓
$35.90
Cost / (USD) ↓
$0.260
Rank · 04

Claude Sonnet 4.6

Exam Score
54.81
By Category ↑
Binary
65.83
Single Choice
63.54
Multi Choice
16.44
Discrimination ↑
Pass@1
59.17
Passany@N
68.75
Passall@N
50.00
FSS
39.73
Agreement ↑
Cohen κ
0.337
Fleiss κ
0.683
Total Cost (USD) ↓
$35.80
Cost / (USD) ↓
$0.272
Rank · 05

gpt-oss-120b

Exam Score
54.61
By Category ↑
Binary
64.16
Single Choice
59.38
Multi Choice
28.36
Discrimination ↑
Pass@1
57.74
Passany@N
77.50
Passall@N
37.50
FSS
37.20
Agreement ↑
Cohen κ
0.314
Fleiss κ
0.387
Total Cost (USD) ↓
$2.31
Cost / (USD) ↓
$0.018
Rank · 06

DeepSeek V3.2 Exp

Exam Score
54.39
By Category ↑
Binary
64.16
Single Choice
58.33
Multi Choice
29.86
Discrimination ↑
Pass@1
57.08
Passany@N
80.00
Passall@N
35.00
FSS
36.64
Agreement ↑
Cohen κ
0.303
Fleiss κ
0.341
Total Cost (USD) ↓
$3.63
Cost / (USD) ↓
$0.028
Rank · 07

DeepSeek V3.2 Exp (Reasoning)

Exam Score
53.92
By Category ↑
Binary
62.50
Single Choice
65.62
Multi Choice
11.81
Discrimination ↑
Pass@1
57.50
Passany@N
77.50
Passall@N
36.25
FSS
35.87
Agreement ↑
Cohen κ
0.309
Fleiss κ
0.372
Total Cost (USD) ↓
$3.86
Cost / (USD) ↓
$0.030
Rank · 08

Qwen3.5 Flash

Exam Score
53.49
By Category ↑
Binary
62.50
Single Choice
58.33
Multi Choice
27.89
Discrimination ↑
Pass@1
55.83
Passany@N
75.00
Passall@N
40.00
FSS
34.33
Agreement ↑
Cohen κ
0.282
Fleiss κ
0.450
Total Cost (USD) ↓
$0.45
Cost / (USD) ↓
$0.004
Rank · 09

GLM-5

Exam Score
53.31
By Category ↑
Binary
65.00
Single Choice
57.29
Multi Choice
25.81
Discrimination ↑
Pass@1
56.67
Passany@N
76.25
Passall@N
37.50
FSS
36.27
Agreement ↑
Cohen κ
0.296
Fleiss κ
0.436
Total Cost (USD) ↓
$7.17
Cost / (USD) ↓
$0.056
Rank · 10

Doubao Seed 2.0 Lite

Exam Score
52.37
By Category ↑
Binary
55.00
Single Choice
59.38
Multi Choice
30.90
Discrimination ↑
Pass@1
52.50
Passany@N
70.00
Passall@N
32.50
FSS
27.92
Agreement ↑
Cohen κ
0.228
Fleiss κ
0.439
Total Cost (USD) ↓
$1.44
Cost / (USD) ↓
$0.011
Rank · 11

GPT-5.4

Exam Score
52.05
By Category ↑
Binary
67.50
Single Choice
51.04
Multi Choice
31.37
Discrimination ↑
Pass@1
56.25
Passany@N
73.75
Passall@N
38.75
FSS
36.25
Agreement ↑
Cohen κ
0.289
Fleiss κ
0.504
Total Cost (USD) ↓
$18.71
Cost / (USD) ↓
$0.150
Rank · 12

Kimi K2.5

Exam Score
51.71
By Category ↑
Binary
62.50
Single Choice
58.33
Multi Choice
18.98
Discrimination ↑
Pass@1
55.65
Passany@N
80.00
Passall@N
30.00
FSS
32.92
Agreement ↑
Cohen κ
0.280
Fleiss κ
0.290
Total Cost (USD) ↓
$6.89
Cost / (USD) ↓
$0.056
Rank · 13

Gemini 3.1 Flash Lite Preview

Exam Score
50.62
By Category ↑
Binary
57.50
Single Choice
57.29
Multi Choice
23.61
Discrimination ↑
Pass@1
53.75
Passany@N
73.75
Passall@N
31.25
FSS
28.53
Agreement ↑
Cohen κ
0.248
Fleiss κ
0.358
Total Cost (USD) ↓
$1.56
Cost / (USD) ↓
$0.013
Rank · 14

Qwen3.5 35B A3B

Exam Score
49.57
By Category ↑
Binary
55.00
Single Choice
57.29
Multi Choice
22.11
Discrimination ↑
Pass@1
51.67
Passany@N
66.25
Passall@N
32.50
FSS
26.31
Agreement ↑
Cohen κ
0.215
Fleiss κ
0.433
Total Cost (USD) ↓
$0.87
Cost / (USD) ↓
$0.007
Rank · 15

MiniMax M2.5

Exam Score
48.19
By Category ↑
Binary
64.17
Single Choice
50.00
Multi Choice
19.68
Discrimination ↑
Pass@1
53.14
Passany@N
68.75
Passall@N
32.50
FSS
29.97
Agreement ↑
Cohen κ
0.239
Fleiss κ
0.387
Total Cost (USD) ↓
$3.21
Cost / (USD) ↓
$0.028
Rank · 16

Qwen3.5 Plus

Exam Score
46.69
By Category ↑
Binary
66.66
Single Choice
45.83
Multi Choice
18.87
Discrimination ↑
Pass@1
52.50
Passany@N
72.50
Passall@N
31.25
FSS
31.29
Agreement ↑
Cohen κ
0.228
Fleiss κ
0.398
Total Cost (USD) ↓
$1.49
Cost / (USD) ↓
$0.013
Rank · 17

GPT-5.3 Codex

Exam Score
45.34
By Category ↑
Binary
59.17
Single Choice
46.88
Multi Choice
20.72
Discrimination ↑
Pass@1
48.75
Passany@N
70.00
Passall@N
28.75
FSS
24.74
Agreement ↑
Cohen κ
0.167
Fleiss κ
0.402
Total Cost (USD) ↓
$6.77
Cost / (USD) ↓
$0.062
Rank · 18

Grok 4.1 Fast Reasoning

Exam Score
44.32
By Category ↑
Binary
59.17
Single Choice
46.88
Multi Choice
15.62
Discrimination ↑
Pass@1
48.33
Passany@N
63.75
Passall@N
32.50
FSS
23.70
Agreement ↑
Cohen κ
0.160
Fleiss κ
0.452
Total Cost (USD) ↓
$1.95
Cost / (USD) ↓
$0.018
Exam Score

Per trial, each question type scores by $\mathrm{exam}(\hat S, G) = |\hat S \cap G| / |G|$ when $\hat S \setminus G = \varnothing$, and $0$ otherwise. This is recall under a zero-false-positive gate. The leaderboard ranks by a weighted composite of three canonical question types: Binary, single-choice, and multi-choice, combined with weights of $30\%$, $50\%$, and $20\%$, then averaged across $N=3$ trials.

Knowledge Cutoff & Temporal Masking

A question $q_i$ is admitted to model $M$ only when $\kappa_M \le \chi_i < \tau_i$, where $\kappa_M$ is the model’s knowledge cutoff and $\tau_i$ the event resolution time. We take $\kappa_M$ from the vendor’s officially published cutoff when one is disclosed, and fall back to the model’s public release date otherwise. Tool-level temporal masking and content-level leakage detection together suppress retrieval-mediated leakage to roughly $1\%$.

Companion Metrics & Cost

$\mathrm{Pass}^{\mathrm{any}}@N$ and $\mathrm{Pass}^{\mathrm{all}}@N$ characterise sampling stability across $N$ trials. Cohen’s $\kappa$ and Fleiss’ $\kappa$ chance-correct strict accuracy. FSS is a Tversky skill score with $(\alpha,\beta)=(2.0, 0.5)$. Per-correct cost $C^{\mathrm{per\text{-}correct}}_m = C^{\mathrm{total}}_m / (|\mathcal{D}^{\mathrm{eval}}| \cdot N \cdot \mathrm{CA}_m)$ amortises the total invoice over the difficulty-weighted notional correct-sample count.