OracleProto · Reproducible Benchmark of LLM Native Forecasting

Reproducible benchmark of LLM native forecasting under knowledge-cutoff and temporal-masking control.

Forecasting = Gathering × Synthesis × Judgment × Decision

The core composite capability driving LLMs toward decision support

Traditional benchmarks ask: “Can you recall the answer?”

OracleProto asks: “Can you predict the future?”

May every forecast be reproducible, may AI truly become decision support

In service of every person’s judgments and choices for a good life

workspace_premium

2

Claude Opus 4.6

Knowledge Cutoff · 2026-02-04

Exam Score

59.50 / 100

Binary 65.83

Single Choice 66.67

Multi Choice 32.06

Cost / ✓ (USD) ↓ $0.363

trophy

1

GPT-5.4 Thinking (High)

Knowledge Cutoff · 2026-03-05

Exam Score

60.34 / 100

Binary 74.16

Single Choice 67.71

Multi Choice 21.18

Cost / ✓ (USD) ↓ $0.217

military_tech

3

Gemini 3.1 Pro Preview

Knowledge Cutoff · 2026-02-19

Exam Score

57.46 / 100

Binary 68.34

Single Choice 61.46

Multi Choice 31.13

Cost / ✓ (USD) ↓ $0.260

Full Ranking

Rank

Model

Sort by

1

GPT-5.4 Thinking (High)

Knowledge Cutoff · 2026-03-05

60.34

65.42

71.62

$0.217

2

Claude Opus 4.6

Knowledge Cutoff · 2026-02-04

59.50

61.25

65.79

$0.363

3

Gemini 3.1 Pro Preview

Knowledge Cutoff · 2026-02-19

57.46

61.25

64.56

$0.260

4

Claude Sonnet 4.6

Knowledge Cutoff · 2026-02-17

54.81

59.17

66.20

$0.272

5

gpt-oss-120b

Knowledge Cutoff · 2024-01-30

54.61

57.74

64.79

$0.018

6

DeepSeek V3.2 Exp

Knowledge Cutoff · 2025-09-29

54.39

57.08

66.18

$0.028

7

DeepSeek V3.2 Exp (Reasoning)

Knowledge Cutoff · 2025-09-29

53.92

57.50

62.67

$0.030

8

Qwen3.5 Flash

Knowledge Cutoff · 2026-02-25

53.49

55.83

60.87

$0.004

9

GLM-5

Knowledge Cutoff · 2026-02-11

53.31

56.67

62.50

$0.056

10

Doubao Seed 2.0 Lite

Knowledge Cutoff · 2026-03-10

52.37

52.50

58.67

$0.011

11

GPT-5.4

Knowledge Cutoff · 2026-03-05

52.05

56.25

60.81

$0.150

12

Kimi K2.5

Knowledge Cutoff · 2026-01-27

51.71

55.65

66.18

$0.056

13

Gemini 3.1 Flash Lite Preview

Knowledge Cutoff · 2026-03-03

50.62

53.75

60.81

$0.013

14

Qwen3.5 35B A3B

Knowledge Cutoff · 2026-02-25

49.57

51.67

59.21

$0.007

15

MiniMax M2.5

Knowledge Cutoff · 2026-02-12

48.19

53.14

63.01

$0.028

16

Qwen3.5 Plus

Knowledge Cutoff · 2026-02-15

46.69

52.50

58.90

$0.013

17

GPT-5.3 Codex

Knowledge Cutoff · 2026-02-24

45.34

48.75

54.29

$0.062

18

Grok 4.1 Fast Reasoning

Knowledge Cutoff · 2025-11-19

44.32

48.33

53.42

$0.018

Per-Model Detail

Rank · 01

GPT-5.4 Thinking (High)

Exam Score

60.34

By Category ↑

Binary

74.16

Single Choice

67.71

Multi Choice

21.18

Discrimination ↑

Pass@1

65.42

Pass^any@N

75.00

Pass^all@N

55.00

FSS

50.19

Agreement ↑

Cohen κ

0.438

Fleiss κ

0.677

Total Cost (USD) ↓

$31.49

Cost / ✓ (USD) ↓

$0.217

Rank · 02

Claude Opus 4.6

Exam Score

59.50

By Category ↑

Binary

65.83

Single Choice

66.67

Multi Choice

32.06

Discrimination ↑

Pass@1

61.25

Pass^any@N

75.00

Pass^all@N

46.25

FSS

41.80

Agreement ↑

Cohen κ

0.370

Fleiss κ

0.611

Total Cost (USD) ↓

$51.86

Cost / ✓ (USD) ↓

$0.363

Rank · 03

Gemini 3.1 Pro Preview

Exam Score

57.46

By Category ↑

Binary

68.34

Single Choice

61.46

Multi Choice

31.13

Discrimination ↑

Pass@1

61.25

Pass^any@N

73.75

Pass^all@N

46.25

FSS

41.11

Agreement ↑

Cohen κ

0.370

Fleiss κ

0.553

Total Cost (USD) ↓

$35.90

Cost / ✓ (USD) ↓

$0.260

Rank · 04

Claude Sonnet 4.6

Exam Score

54.81

By Category ↑

Binary

65.83

Single Choice

63.54

Multi Choice

16.44

Discrimination ↑

Pass@1

59.17

Pass^any@N

68.75

Pass^all@N

50.00

FSS

39.73

Agreement ↑

Cohen κ

0.337

Fleiss κ

0.683

Total Cost (USD) ↓

$35.80

Cost / ✓ (USD) ↓

$0.272

Rank · 05

gpt-oss-120b

Exam Score

54.61

By Category ↑

Binary

64.16

Single Choice

59.38

Multi Choice

28.36

Discrimination ↑

Pass@1

57.74

Pass^any@N

77.50

Pass^all@N

37.50

FSS

37.20

Agreement ↑

Cohen κ

0.314

Fleiss κ

0.387

Total Cost (USD) ↓

$2.31

Cost / ✓ (USD) ↓

$0.018

Rank · 06

DeepSeek V3.2 Exp

Exam Score

54.39

By Category ↑

Binary

64.16

Single Choice

58.33

Multi Choice

29.86

Discrimination ↑

Pass@1

57.08

Pass^any@N

80.00

Pass^all@N

35.00

FSS

36.64

Agreement ↑

Cohen κ

0.303

Fleiss κ

0.341

Total Cost (USD) ↓

$3.63

Cost / ✓ (USD) ↓

$0.028

Rank · 07

DeepSeek V3.2 Exp (Reasoning)

Exam Score

53.92

By Category ↑

Binary

62.50

Single Choice

65.62

Multi Choice

11.81

Discrimination ↑

Pass@1

57.50

Pass^any@N

77.50

Pass^all@N

36.25

FSS

35.87

Agreement ↑

Cohen κ

0.309

Fleiss κ

0.372

Total Cost (USD) ↓

$3.86

Cost / ✓ (USD) ↓

$0.030

Rank · 08

Qwen3.5 Flash

Exam Score

53.49

By Category ↑

Binary

62.50

Single Choice

58.33

Multi Choice

27.89

Discrimination ↑

Pass@1

55.83

Pass^any@N

75.00

Pass^all@N

40.00

FSS

34.33

Agreement ↑

Cohen κ

0.282

Fleiss κ

0.450

Total Cost (USD) ↓

$0.45

Cost / ✓ (USD) ↓

$0.004

Rank · 09

GLM-5

Exam Score

53.31

By Category ↑

Binary

65.00

Single Choice

57.29

Multi Choice

25.81

Discrimination ↑

Pass@1

56.67

Pass^any@N

76.25

Pass^all@N

37.50

FSS

36.27

Agreement ↑

Cohen κ

0.296

Fleiss κ

0.436

Total Cost (USD) ↓

$7.17

Cost / ✓ (USD) ↓

$0.056

Rank · 10

Doubao Seed 2.0 Lite

Exam Score

52.37

By Category ↑

Binary

55.00

Single Choice

59.38

Multi Choice

30.90

Discrimination ↑

Pass@1

52.50

Pass^any@N

70.00

Pass^all@N

32.50

FSS

27.92

Agreement ↑

Cohen κ

0.228

Fleiss κ

0.439

Total Cost (USD) ↓

$1.44

Cost / ✓ (USD) ↓

$0.011

Rank · 11

GPT-5.4

Exam Score

52.05

By Category ↑

Binary

67.50

Single Choice

51.04

Multi Choice

31.37

Discrimination ↑

Pass@1

56.25

Pass^any@N

73.75

Pass^all@N

38.75

FSS

36.25

Agreement ↑

Cohen κ

0.289

Fleiss κ

0.504

Total Cost (USD) ↓

$18.71

Cost / ✓ (USD) ↓

$0.150

Rank · 12

Kimi K2.5

Exam Score

51.71

By Category ↑

Binary

62.50

Single Choice

58.33

Multi Choice

18.98

Discrimination ↑

Pass@1

55.65

Pass^any@N

80.00

Pass^all@N

30.00

FSS

32.92

Agreement ↑

Cohen κ

0.280

Fleiss κ

0.290

Total Cost (USD) ↓

$6.89

Cost / ✓ (USD) ↓

$0.056

Rank · 13

Gemini 3.1 Flash Lite Preview

Exam Score

50.62

By Category ↑

Binary

57.50

Single Choice

57.29

Multi Choice

23.61

Discrimination ↑

Pass@1

53.75

Pass^any@N

73.75

Pass^all@N

31.25

FSS

28.53

Agreement ↑

Cohen κ

0.248

Fleiss κ

0.358

Total Cost (USD) ↓

$1.56

Cost / ✓ (USD) ↓

$0.013

Rank · 14

Qwen3.5 35B A3B

Exam Score

49.57

By Category ↑

Binary

55.00

Single Choice

57.29

Multi Choice

22.11

Discrimination ↑

Pass@1

51.67

Pass^any@N

66.25

Pass^all@N

32.50

FSS

26.31

Agreement ↑

Cohen κ

0.215

Fleiss κ

0.433

Total Cost (USD) ↓

$0.87

Cost / ✓ (USD) ↓

$0.007

Rank · 15

MiniMax M2.5

Exam Score

48.19

By Category ↑

Binary

64.17

Single Choice

50.00

Multi Choice

19.68

Discrimination ↑

Pass@1

53.14

Pass^any@N

68.75

Pass^all@N

32.50

FSS

29.97

Agreement ↑

Cohen κ

0.239

Fleiss κ

0.387

Total Cost (USD) ↓

$3.21

Cost / ✓ (USD) ↓

$0.028

Rank · 16

Qwen3.5 Plus

Exam Score

46.69

By Category ↑

Binary

66.66

Single Choice

45.83

Multi Choice

18.87

Discrimination ↑

Pass@1

52.50

Pass^any@N

72.50

Pass^all@N

31.25

FSS

31.29

Agreement ↑

Cohen κ

0.228

Fleiss κ

0.398

Total Cost (USD) ↓

$1.49

Cost / ✓ (USD) ↓

$0.013

Rank · 17

GPT-5.3 Codex

Exam Score

45.34

By Category ↑

Binary

59.17

Single Choice

46.88

Multi Choice

20.72

Discrimination ↑

Pass@1

48.75

Pass^any@N

70.00

Pass^all@N

28.75

FSS

24.74

Agreement ↑

Cohen κ

0.167

Fleiss κ

0.402

Total Cost (USD) ↓

$6.77

Cost / ✓ (USD) ↓

$0.062

Rank · 18

Grok 4.1 Fast Reasoning

Exam Score

44.32

By Category ↑

Binary

59.17

Single Choice

46.88

Multi Choice

15.62

Discrimination ↑

Pass@1

48.33

Pass^any@N

63.75

Pass^all@N

32.50

FSS

23.70

Agreement ↑

Cohen κ

0.160

Fleiss κ

0.452

Total Cost (USD) ↓

$1.95

Cost / ✓ (USD) ↓

$0.018

Exam Score

Per trial, each question type scores by $\mathrm{exam}(\hat S, G) = |\hat S \cap G| / |G|$ when $\hat S \setminus G = \varnothing$, and $0$ otherwise. This is recall under a zero-false-positive gate. The leaderboard ranks by a weighted composite of three canonical question types: Binary, single-choice, and multi-choice, combined with weights of $30\%$, $50\%$, and $20\%$, then averaged across $N=3$ trials.

Knowledge Cutoff & Temporal Masking

A question $q_i$ is admitted to model $M$ only when $\kappa_M \le \chi_i < \tau_i$, where $\kappa_M$ is the model’s knowledge cutoff and $\tau_i$ the event resolution time. We take $\kappa_M$ from the vendor’s officially published cutoff when one is disclosed, and fall back to the model’s public release date otherwise. Tool-level temporal masking and content-level leakage detection together suppress retrieval-mediated leakage to roughly $1\%$.

Companion Metrics & Cost

$\mathrm{Pass}^{\mathrm{any}}@N$ and $\mathrm{Pass}^{\mathrm{all}}@N$ characterise sampling stability across $N$ trials. Cohen’s $\kappa$ and Fleiss’ $\kappa$ chance-correct strict accuracy. FSS is a Tversky skill score with $(\alpha,\beta)=(2.0, 0.5)$. Per-correct cost $C^{\mathrm{per\text{-}correct}}_m = C^{\mathrm{total}}_m / (|\mathcal{D}^{\mathrm{eval}}| \cdot N \cdot \mathrm{CA}_m)$ amortises the total invoice over the difficulty-weighted notional correct-sample count.