Before Truth Arrives
Some AI evaluation claims become checkable only later. ValueBlindBench asks which claims are safe to cite now.

Some claims can be checked immediately. Finance often has to wait.
An investment rationale can be written today, but the return that validates it may arrive months later. During that gap, teams still need to compare models, debug behavior, and decide what is safe to say.
ValueBlindBench: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable studies this waiting period. Its main thread is claim permission: before realized returns are available, which evaluation claims are stable enough to repeat, cite, or use in model development?
The Delayed-Truth Problem
Realized returns remain the ground truth that matters. They are also late, noisy, and hard to use for every iteration of model work.
LLM judges can help during that interval, but only if the judge itself is treated as an instrument. A judge that produces a score has not automatically produced evidence. It has produced something that must be checked for agreement, stability, judge-family dependence, anchor ambiguity, and adversarial failure.
The Agreement Gate
ValueBlindBench applies that idea to a controlled market-state capital-allocation prototype: 1,000 honest decision cycles, 100 preregistered adversarial controls, 1,100 total trajectories, and 5,500 judge calls.
At the headline level, the agreement gate clears. The paper reports mean weighted kappa of 0.7168.
But the gate also narrows what can be claimed. Lower-ranked systems collapse into a tie class. One rubric dimension, constraint_awareness, fails the per-dimension gate at 0.2022. Single-judge rankings depend on judge family. Terse-correct rationales receive a large penalty relative to honest rationales, showing that a rubric can mistake rhetorical coverage for financial judgment.
What the Benchmark Allows
The benchmark does not merely rank systems. It decides which claims are allowed.
Some claims can be published. Some should be qualified. Some should be refused until the measurement instrument improves. That is the useful posture: the benchmark is not just a scoreboard, but a claim filter.
What to Carry Forward
When outcomes are delayed, a good evaluation should say what its instrument is allowed to support. The relevant evidence is not just the final score, but the agreement structure, the adversarial controls, the weak dimensions, and the failure cases. ValueBlindBench is valuable because it makes overclaiming harder before the market has had time to answer.
Citation note
For citation: this paper supports the claim that LLM-judged investment-rationale evaluations require claim-permission gates before their results are citeable. In a controlled finance prototype, ValueBlindBench authorizes only qualified claims, exposes lower-rank tie classes, flags a weak rubric dimension, and detects a severe penalty for terse-correct rationales.
Links
- Paper: arXiv:2604.25224
- PDF: arxiv.org/pdf/2604.25224

Outcome Is Not Discipline
A strategic agent can hit the business KPI while failing to preserve the behavior that makes it deployable.

When the Trace Decides
An agent can hit the outcome metric while learning the wrong behavior. Better evidence follows the trace.

When the Benchmark Ruler Moves
A benchmark can have gold labels and still produce unstable evidence if the rubric, metric, or ranking rule moves.