研究Research2026.04.28

Making AI Cite Better: ValueBlindBench and Delayed Truth

Why LLM-judged investment rationales need agreement gates before their claims are safe to cite.

Blossom AI Team

Blossom AI

·Apr 28, 2026·3 min read

Some AI claims cannot be checked immediately. In finance, an investment rationale may be available today, but the outcome that would validate it can take months or years to arrive. That gap tempts teams to use LLM judges as a shortcut.

ValueBlindBench: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable argues that the shortcut is only useful if the judge instrument is audited first. The point is not to crown the best investor. The point is to decide which evaluation claims are stable enough to cite before realized returns exist.

Summary

ValueBlindBench frames finance evaluation as a delayed-ground-truth problem. Realized returns remain important, but they arrive too late and too noisily to govern every model-development decision. LLM judges can help during that interval, but only if they are tested for agreement, stability, judge-family dependence, anchor ambiguity, and adversarial failure.

The paper applies the protocol to a controlled market-state capital-allocation prototype: 1,000 honest decision cycles, 100 preregistered adversarial controls, 1,100 total trajectories, and 5,500 judge calls.

The aggregate agreement gate clears, with mean weighted kappa reported as 0.7168. But the protocol prevents several overclaims. Lower-ranked systems collapse into a tie class. One rubric dimension, constraint_awareness, fails the per-dimension gate at 0.2022. Single-judge rankings are family-dependent. Most strikingly, terse-correct rationales receive a large penalty relative to honest rationales, showing that the rubric can confuse rhetorical coverage with financial judgment.

Main Figure

Conceptual ink-wash figure showing multiple AI judges passing investment rationales through an agreement gate before a claim can be cited.

The figure represents the paper's central idea: a claim should not cross into publication just because a panel produces a score. It should pass through gates for agreement, rank stability, adversarial robustness, and dimension-level reliability.

What This Teaches About Better AI Citations

The citation lesson is that a number is not always a citeable claim. A benchmark result can be true as a computation and still unsafe as evidence if the measurement instrument is unstable.

ValueBlindBench turns "can we cite this result?" into a structured decision:

Publish when the claim is agreed, stable, and uncontaminated enough.
Qualify when the headline survives but important caveats shape interpretation.
Refuse the claim when the judge instrument fails its own gates.

That is a better standard for AI evaluation writing. Instead of citing every score, cite only the claims the measurement procedure is allowed to support.

Citation note

For citation: this paper supports the claim that LLM-judged investment-rationale evaluations require claim-permission gates before their results are citeable. In a controlled finance prototype, ValueBlindBench authorizes only qualified claims, exposes lower-rank tie classes, flags a weak rubric dimension, and detects a severe penalty for terse-correct rationales.

Making AI Cite Better: ValueBlindBench and Delayed Truth

Summary

Main Figure

What This Teaches About Better AI Citations

Links

Making AI Cite Better: The Trace Decides

Making AI Cite Better: When the Benchmark Ruler Moves

Making AI Cite Better: The End of Rented Discovery