研究Research2026.05.01

When the Benchmark Ruler Moves

A benchmark can have gold labels and still produce unstable evidence if the rubric, metric, or ranking rule moves.

BA
Blossom AI Team
Blossom AI
·May 1, 2026·3 min read

Ink-wash scholar table beside water and reeds, suggesting measurement instruments dissolving into the landscape they measure.

Gold labels help. They do not make a benchmark self-explanatory.

A benchmark result is produced by a pipeline: the dataset split, the prompt or rubric, the metric, the aggregation rule, and the class distribution underneath it all. If those choices move, the claim can move too.

Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR studies that problem in a setting where ground truth exists. Its main thread is measurement risk: what happens when a benchmark has labels, but the measurement contract still changes the result?

The Task

The paper audits Japanese Financial Implicit-Commitment Recognition, or JF-ICR. The task is to classify whether a corporate answer in an investor-relations Q&A exchange expresses refusal, neutrality, weak commitment, or strong commitment.

That label can matter in disclosure monitoring, management-candor analysis, and risk review. It is also a subtle language task. A literal reading and a pragmatic reading can disagree about whether a statement is truly committing to future action.

Rubric Sensitivity

The paper finds that rubric wording moves labels. Agreement between literal and pragmatic rubric variants ranges from 70.0% to 83.4% across models.

Much of the movement occurs around the weak-commitment / neutral boundary. That is exactly where financial language often lives: careful enough to avoid overpromising, but specific enough to signal intent.

Metric Identifiability

Not every plausible metric can bear the same evidentiary weight.

Within-one accuracy is too forgiving because near misses receive credit and the majority class dominates. Worst-class accuracy is too noisy because the rarest class has only two examples. The paper therefore treats exact accuracy, macro-F1, and weighted kappa as the primary ranking metrics under its operational rule.

Ranking After Measurement

Aggregation becomes more defensible only after the metric audit. Bradley-Terry, Borda, and Ranked Pairs agree on the identifiable metric subset. When the full five-metric sweep is used, the closest pair becomes less stable.

The result is not "benchmarks are bad." It is more useful than that: benchmark claims are conditional objects.

What to Carry Forward

A score should travel with the conditions that produced it: the exact benchmark artifact, the rubric wording, the metric-identifiability audit, the aggregation rule, and the scope of the rank claim. In financial NLP, that context is not administrative detail; it is part of the evidence.

Citation note

For citation: this paper supports the claim that supervised financial NLP evaluations can remain measurement-sensitive even when gold labels exist. On JF-ICR, rubric wording materially changes model-assigned labels, metric identifiability changes which ranking claims are defensible, and aggregation agreement improves only after weakly identified metrics are removed.