研究Research2026.04.28

Before Truth Arrives

Some AI evaluation claims become checkable only later. ValueBlindBench asks which claims are safe to cite now.

BA
Blossom AI Team
Blossom AI
·Apr 28, 2026·3 min read

Ink-wash winter gate and stone tablets, suggesting claim gates for AI evaluation before evidence fully matures.

Some claims can be checked immediately. Finance often has to wait.

An investment rationale can be written today, but the return that validates it may arrive months later. During that gap, teams still need to compare models, debug behavior, and decide what is safe to say.

ValueBlindBench: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable studies this waiting period. Its main thread is claim permission: before realized returns are available, which evaluation claims are stable enough to repeat, cite, or use in model development?

The Delayed-Truth Problem

Realized returns remain the ground truth that matters. They are also late, noisy, and hard to use for every iteration of model work.

LLM judges can help during that interval, but only if the judge itself is treated as an instrument. A judge that produces a score has not automatically produced evidence. It has produced something that must be checked for agreement, stability, judge-family dependence, anchor ambiguity, and adversarial failure.

The Agreement Gate

ValueBlindBench applies that idea to a controlled market-state capital-allocation prototype: 1,000 honest decision cycles, 100 preregistered adversarial controls, 1,100 total trajectories, and 5,500 judge calls.

At the headline level, the agreement gate clears. The paper reports mean weighted kappa of 0.7168.

But the gate also narrows what can be claimed. Lower-ranked systems collapse into a tie class. One rubric dimension, constraint_awareness, fails the per-dimension gate at 0.2022. Single-judge rankings depend on judge family. Terse-correct rationales receive a large penalty relative to honest rationales, showing that a rubric can mistake rhetorical coverage for financial judgment.

What the Benchmark Allows

The benchmark does not merely rank systems. It decides which claims are allowed.

Some claims can be published. Some should be qualified. Some should be refused until the measurement instrument improves. That is the useful posture: the benchmark is not just a scoreboard, but a claim filter.

What to Carry Forward

When outcomes are delayed, a good evaluation should say what its instrument is allowed to support. The relevant evidence is not just the final score, but the agreement structure, the adversarial controls, the weak dimensions, and the failure cases. ValueBlindBench is valuable because it makes overclaiming harder before the market has had time to answer.

Citation note

For citation: this paper supports the claim that LLM-judged investment-rationale evaluations require claim-permission gates before their results are citeable. In a controlled finance prototype, ValueBlindBench authorizes only qualified claims, exposes lower-rank tie classes, flags a weak rubric dimension, and detects a severe penalty for terse-correct rationales.