OracleProto

Reproducible benchmark of LLM native forecasting under knowledge-cutoff and temporal-masking control.

Forecasting = Gathering × Synthesis × Judgment × Decision

The core composite capability driving LLMs toward decision support

Traditional benchmarks ask: “Can you recall the answer?”

OracleProto asks: “Can you predict the future?”

May every forecast be reproducible, may AI truly become decision support

In service of every person’s judgments and choices for a good life

workspace_premium
2

GPT 5.4 Thinking (High)

Knowledge Cutoff · 2026-03-05
Exam Score
53.66 / 100
Pass@1 65.42
Passany@N 75.00
Passall@N 55.00
Majority Vote 71.62
trophy
1

Claude Opus 4.6

Knowledge Cutoff · 2026-02-04
Exam Score
56.59 / 100
Pass@1 61.25
Passany@N 75.00
Passall@N 46.25
Majority Vote 65.79
military_tech
3

Gemini 3.1 Pro Preview

Knowledge Cutoff · 2026-02-19
Exam Score
52.09 / 100
Pass@1 61.25
Passany@N 73.75
Passall@N 46.25
Majority Vote 64.56

Full Ranking

Rank
Model
Sort by
1
Claude Opus 4.6
Knowledge Cutoff · 2026-02-04
56.59
61.25
65.79
$0.382
2
GPT 5.4 Thinking (High)
Knowledge Cutoff · 2026-03-05
53.66
65.42
71.62
$0.245
3
Gemini 3.1 Pro Preview
Knowledge Cutoff · 2026-02-19
52.09
61.25
64.56
$0.287
4
DeepSeek V3.2 Exp
Knowledge Cutoff · 2025-09-29
51.53
57.08
66.18
$0.029
5
GLM-5
Knowledge Cutoff · 2026-02-11
51.21
56.67
62.50
$0.058
6
Doubao Seed 2.0 Lite
Knowledge Cutoff · 2026-03-10
50.78
52.50
58.67
$0.012
7
GPT 5.4
Knowledge Cutoff · 2026-03-05
50.67
56.25
60.81
$0.154
8
Qwen3.5 Flash
Knowledge Cutoff · 2026-02-25
50.43
55.83
60.87
$0.004
9
Claude Sonnet 4.6
Knowledge Cutoff · 2026-02-17
49.15
59.17
66.20
$0.303
10
gpt-oss-120b
Knowledge Cutoff · 2024-01-30
48.13
57.74
64.79
$0.02
11
Gemini 3.1 Flash Lite Preview
Knowledge Cutoff · 2026-03-03
47.53
53.75
60.81
$0.014
12
Kimi K2.5
Knowledge Cutoff · 2026-01-27
46.69
55.65
66.18
$0.061
13
Qwen3.5 35B A3B
Knowledge Cutoff · 2026-02-25
46.50
51.67
59.21
$0.008
14
MiniMax M2.5
Knowledge Cutoff · 2026-02-12
46.45
53.14
63.01
$0.029
15
GPT 5.3 Codex
Knowledge Cutoff · 2026-02-24
45.21
48.75
54.29
$0.062
16
Qwen3.5 Plus
Knowledge Cutoff · 2026-02-15
43.82
52.50
58.90
$0.014
17
Grok 4.1 Fast Reasoning
Knowledge Cutoff · 2025-11-19
43.07
48.33
53.42
$0.019
18
DeepSeek V3.2 Exp Think
Knowledge Cutoff · 2025-09-29
42.66
57.50
62.67
$0.038

Per-Model Detail

Rank · 01

Claude Opus 4.6

Exam Score
56.59
By Category ↑
Yes/No
63.06
Named
100.00
Single MC
66.67
Multi MC
32.06
Discrimination ↑
Pass@1
61.25
Passany@N
75.00
Passall@N
46.25
FSS
41.80
Agreement ↑
Cohen κ
0.370
Fleiss κ
0.611
Total Cost (USD) ↓
$51.86
Cost / (USD) ↓
$0.382
Rank · 02

GPT 5.4 Thinking (High)

Exam Score
53.66
By Category ↑
Yes/No
72.07
Named
100.00
Single MC
67.71
Multi MC
21.18
Discrimination ↑
Pass@1
65.42
Passany@N
75.00
Passall@N
55.00
FSS
50.19
Agreement ↑
Cohen κ
0.438
Fleiss κ
0.677
Total Cost (USD) ↓
$31.49
Cost / (USD) ↓
$0.245
Rank · 03

Gemini 3.1 Pro Preview

Exam Score
52.09
By Category ↑
Yes/No
67.57
Named
77.78
Single MC
61.46
Multi MC
31.13
Discrimination ↑
Pass@1
61.25
Passany@N
73.75
Passall@N
46.25
FSS
41.11
Agreement ↑
Cohen κ
0.370
Fleiss κ
0.553
Total Cost (USD) ↓
$35.90
Cost / (USD) ↓
$0.287
Rank · 04

DeepSeek V3.2 Exp

Exam Score
51.53
By Category ↑
Yes/No
62.16
Named
88.89
Single MC
58.33
Multi MC
29.86
Discrimination ↑
Pass@1
57.08
Passany@N
80.00
Passall@N
35.00
FSS
36.64
Agreement ↑
Cohen κ
0.303
Fleiss κ
0.341
Total Cost (USD) ↓
$3.63
Cost / (USD) ↓
$0.029
Rank · 05

GLM-5

Exam Score
51.21
By Category ↑
Yes/No
62.16
Named
100.00
Single MC
57.29
Multi MC
25.81
Discrimination ↑
Pass@1
56.67
Passany@N
76.25
Passall@N
37.50
FSS
36.27
Agreement ↑
Cohen κ
0.296
Fleiss κ
0.436
Total Cost (USD) ↓
$7.17
Cost / (USD) ↓
$0.058
Rank · 06

Doubao Seed 2.0 Lite

Exam Score
50.78
By Category ↑
Yes/No
52.25
Named
88.89
Single MC
59.38
Multi MC
30.90
Discrimination ↑
Pass@1
52.50
Passany@N
70.00
Passall@N
32.50
FSS
27.92
Agreement ↑
Cohen κ
0.228
Fleiss κ
0.439
Total Cost (USD) ↓
$1.44
Cost / (USD) ↓
$0.012
Rank · 07

GPT 5.4

Exam Score
50.67
By Category ↑
Yes/No
65.77
Named
88.89
Single MC
51.04
Multi MC
31.37
Discrimination ↑
Pass@1
56.25
Passany@N
73.75
Passall@N
38.75
FSS
36.25
Agreement ↑
Cohen κ
0.289
Fleiss κ
0.504
Total Cost (USD) ↓
$18.71
Cost / (USD) ↓
$0.154
Rank · 08

Qwen3.5 Flash

Exam Score
50.43
By Category ↑
Yes/No
60.36
Named
88.89
Single MC
58.33
Multi MC
27.89
Discrimination ↑
Pass@1
55.83
Passany@N
75.00
Passall@N
40.00
FSS
34.33
Agreement ↑
Cohen κ
0.282
Fleiss κ
0.450
Total Cost (USD) ↓
$0.45
Cost / (USD) ↓
$0.004
Rank · 09

Claude Sonnet 4.6

Exam Score
49.15
By Category ↑
Yes/No
63.06
Named
100.00
Single MC
63.54
Multi MC
16.44
Discrimination ↑
Pass@1
59.17
Passany@N
68.75
Passall@N
50.00
FSS
39.73
Agreement ↑
Cohen κ
0.337
Fleiss κ
0.683
Total Cost (USD) ↓
$35.80
Cost / (USD) ↓
$0.303
Rank · 10

gpt-oss-120b

Exam Score
48.13
By Category ↑
Yes/No
63.96
Named
66.67
Single MC
59.38
Multi MC
28.36
Discrimination ↑
Pass@1
57.74
Passany@N
77.50
Passall@N
37.50
FSS
37.20
Agreement ↑
Cohen κ
0.314
Fleiss κ
0.387
Total Cost (USD) ↓
$2.31
Cost / (USD) ↓
$0.020
Rank · 11

Gemini 3.1 Flash Lite Preview

Exam Score
47.53
By Category ↑
Yes/No
54.95
Named
88.89
Single MC
57.29
Multi MC
23.61
Discrimination ↑
Pass@1
53.75
Passany@N
73.75
Passall@N
31.25
FSS
28.53
Agreement ↑
Cohen κ
0.248
Fleiss κ
0.358
Total Cost (USD) ↓
$1.56
Cost / (USD) ↓
$0.014
Rank · 12

Kimi K2.5

Exam Score
46.69
By Category ↑
Yes/No
60.36
Named
88.89
Single MC
58.33
Multi MC
18.98
Discrimination ↑
Pass@1
55.65
Passany@N
80.00
Passall@N
30.00
FSS
32.92
Agreement ↑
Cohen κ
0.280
Fleiss κ
0.290
Total Cost (USD) ↓
$6.89
Cost / (USD) ↓
$0.061
Rank · 13

Qwen3.5 35B A3B

Exam Score
46.50
By Category ↑
Yes/No
52.25
Named
88.89
Single MC
57.29
Multi MC
22.11
Discrimination ↑
Pass@1
51.67
Passany@N
66.25
Passall@N
32.50
FSS
26.31
Agreement ↑
Cohen κ
0.215
Fleiss κ
0.433
Total Cost (USD) ↓
$0.87
Cost / (USD) ↓
$0.008
Rank · 14

MiniMax M2.5

Exam Score
46.45
By Category ↑
Yes/No
61.26
Named
100.00
Single MC
50.00
Multi MC
19.68
Discrimination ↑
Pass@1
53.14
Passany@N
68.75
Passall@N
32.50
FSS
29.97
Agreement ↑
Cohen κ
0.239
Fleiss κ
0.387
Total Cost (USD) ↓
$3.21
Cost / (USD) ↓
$0.029
Rank · 15

GPT 5.3 Codex

Exam Score
45.21
By Category ↑
Yes/No
55.86
Named
100.00
Single MC
46.88
Multi MC
20.72
Discrimination ↑
Pass@1
48.75
Passany@N
70.00
Passall@N
28.75
FSS
24.74
Agreement ↑
Cohen κ
0.167
Fleiss κ
0.402
Total Cost (USD) ↓
$6.77
Cost / (USD) ↓
$0.062
Rank · 16

Qwen3.5 Plus

Exam Score
43.82
By Category ↑
Yes/No
64.86
Named
88.89
Single MC
45.83
Multi MC
18.87
Discrimination ↑
Pass@1
52.50
Passany@N
72.50
Passall@N
31.25
FSS
31.29
Agreement ↑
Cohen κ
0.228
Fleiss κ
0.398
Total Cost (USD) ↓
$1.49
Cost / (USD) ↓
$0.014
Rank · 17

Grok 4.1 Fast Reasoning

Exam Score
43.07
By Category ↑
Yes/No
55.86
Named
100.00
Single MC
46.88
Multi MC
15.62
Discrimination ↑
Pass@1
48.33
Passany@N
63.75
Passall@N
32.50
FSS
23.70
Agreement ↑
Cohen κ
0.160
Fleiss κ
0.452
Total Cost (USD) ↓
$1.95
Cost / (USD) ↓
$0.019
Rank · 18

DeepSeek V3.2 Exp Think

Exam Score
42.66
By Category ↑
Yes/No
62.16
Named
66.67
Single MC
65.62
Multi MC
11.81
Discrimination ↑
Pass@1
57.50
Passany@N
77.50
Passall@N
36.25
FSS
35.87
Agreement ↑
Cohen κ
0.309
Fleiss κ
0.372
Total Cost (USD) ↓
$3.86
Cost / (USD) ↓
$0.038
Exam Score

Per trial, each question type scores by $\mathrm{exam}(\hat S, G) = |\hat S \cap G| / |G|$ when $\hat S \setminus G = \varnothing$, and $0$ otherwise. This is recall under a zero-false-positive gate. The leaderboard ranks by a composite over four canonical question types: Yes/No, named-entity, single-choice, and multi-choice, averaged across $N=3$ trials.

Knowledge Cutoff & Temporal Masking

A question $q_i$ is admitted to model $M$ only when $\kappa_M \le \chi_i < \tau_i$, where $\kappa_M$ is the model’s knowledge cutoff and $\tau_i$ the event resolution time. We take $\kappa_M$ from the vendor’s officially published cutoff when one is disclosed, and fall back to the model’s public release date otherwise. Tool-level temporal masking and content-level leakage detection together suppress retrieval-mediated leakage to roughly $1\%$.

Companion Metrics & Cost

$\mathrm{Pass}^{\mathrm{any}}@N$ and $\mathrm{Pass}^{\mathrm{all}}@N$ characterise sampling stability across $N$ trials. Cohen’s $\kappa$ and Fleiss’ $\kappa$ chance-correct strict accuracy. FSS is a Tversky skill score with $(\alpha,\beta)=(2.0, 0.5)$. Per-correct cost $C^{\mathrm{per\text{-}correct}}_m = C^{\mathrm{total}}_m / (|\mathcal{D}^{\mathrm{eval}}| \cdot N \cdot \mathrm{CA}_m)$ amortises the total invoice over the difficulty-weighted notional correct-sample count.