研究Research2026.05.01

Making AI Cite Better: When the Benchmark Ruler Moves

A financial NLP benchmark can have gold labels and still produce unstable evidence if the rubric, metric, and aggregation rule move.

BA
Blossom AI Team
Blossom AI
·May 1, 2026·3 min read

Gold labels do not automatically make a benchmark objective. They make one part of the measurement pipeline explicit. The rest of the pipeline still matters: the rubric used to induce model labels, the metric used to read those labels, and the aggregation rule used to turn many scores into a ranking.

Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR studies that problem in a Japanese financial NLP task. It is the gold-labels-exist counterpart to delayed-ground-truth protocols like ValueBlindBench.

Summary

The paper audits Japanese Financial Implicit-Commitment Recognition, or JF-ICR. The task asks whether a corporate answer in an investor-relations Q&A exchange expresses refusal, neutrality, weak commitment, or strong commitment. That kind of label can matter for disclosure monitoring, management-candor signals, and risk review.

The experiment pins a 253-item test split from the Hugging Face dataset TheFinAI/JF-ICR, then runs four frontier LLM classifiers across five rubric variants, three temperatures, and five ordinal metrics.

Three findings matter most.

First, rubric wording moves labels. Agreement between the literal and pragmatic rubric variants ranges from 70.0% to 83.4% across models. The dominant movement appears near the weak-commitment / neutral boundary, where many +1 examples shift to 0 under a pragmatic reading.

Second, not every plausible metric is identifiable on this dataset. Within-one accuracy is too easy because near misses receive credit and the majority class dominates. Worst-class accuracy is too noisy because the rarest class has only two examples. The paper treats exact accuracy, macro-F1, and weighted kappa as the primary ranking metrics under its operational rule.

Third, aggregation becomes more defensible only after that metric audit. Bradley-Terry, Borda, and Ranked Pairs agree on the identifiable metric subset. The full five-metric sweep produces disagreement on the closest pair.

Main Figure

Conceptual ink-wash figure showing a benchmark ruler bending over stones that represent rubric, metric, and aggregation choices.

The figure is a generated metaphor for the paper's measurement stack. Labels exist, but the ruler can still bend if the rubric, metric, or aggregation policy is not pinned and audited.

What This Teaches About Better AI Citations

The mistake this paper prevents is subtle: citing a benchmark score as if it were a neutral fact. In financial NLP, a benchmark score is better treated as a measurement output. To cite it responsibly, the paper argues that we should also cite the conditions under which it was produced.

That means reporting:

  • The exact benchmark artifact, not only the dataset name.
  • The rubric wording used to induce labels.
  • The metric-identifiability audit under the observed class distribution.
  • The aggregation rule and the uncertainty around rank claims.
  • The scope of the claim the benchmark actually supports.

Good AI citation is not only about adding a link. It is about carrying the measurement conditions with the claim.

Citation note

For citation: this paper supports the claim that supervised financial NLP evaluations can remain measurement-sensitive even when gold labels exist. On JF-ICR, rubric wording materially changes model-assigned labels, metric identifiability changes which ranking claims are defensible, and aggregation agreement improves only after weakly identified metrics are removed.