Making AI Cite Better: When the Benchmark Ruler Moves
A financial NLP benchmark can have gold labels and still produce unstable evidence if the rubric, metric, and aggregation rule move.
Gold labels do not automatically make a benchmark objective. They make one part of the measurement pipeline explicit. The rest of the pipeline still matters: the rubric used to induce model labels, the metric used to read those labels, and the aggregation rule used to turn many scores into a ranking.
Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR studies that problem in a Japanese financial NLP task. It is the gold-labels-exist counterpart to delayed-ground-truth protocols like ValueBlindBench.
Summary
The paper audits Japanese Financial Implicit-Commitment Recognition, or JF-ICR. The task asks whether a corporate answer in an investor-relations Q&A exchange expresses refusal, neutrality, weak commitment, or strong commitment. That kind of label can matter for disclosure monitoring, management-candor signals, and risk review.
The experiment pins a 253-item test split from the Hugging Face dataset TheFinAI/JF-ICR, then runs four frontier LLM classifiers across five rubric variants, three temperatures, and five ordinal metrics.
Three findings matter most.
First, rubric wording moves labels. Agreement between the literal and pragmatic rubric variants ranges from 70.0% to 83.4% across models. The dominant movement appears near the weak-commitment / neutral boundary, where many +1 examples shift to 0 under a pragmatic reading.
Second, not every plausible metric is identifiable on this dataset. Within-one accuracy is too easy because near misses receive credit and the majority class dominates. Worst-class accuracy is too noisy because the rarest class has only two examples. The paper treats exact accuracy, macro-F1, and weighted kappa as the primary ranking metrics under its operational rule.
Third, aggregation becomes more defensible only after that metric audit. Bradley-Terry, Borda, and Ranked Pairs agree on the identifiable metric subset. The full five-metric sweep produces disagreement on the closest pair.
Main Figure

The figure is a generated metaphor for the paper's measurement stack. Labels exist, but the ruler can still bend if the rubric, metric, or aggregation policy is not pinned and audited.
What This Teaches About Better AI Citations
The mistake this paper prevents is subtle: citing a benchmark score as if it were a neutral fact. In financial NLP, a benchmark score is better treated as a measurement output. To cite it responsibly, the paper argues that we should also cite the conditions under which it was produced.
That means reporting:
- The exact benchmark artifact, not only the dataset name.
- The rubric wording used to induce labels.
- The metric-identifiability audit under the observed class distribution.
- The aggregation rule and the uncertainty around rank claims.
- The scope of the claim the benchmark actually supports.
Good AI citation is not only about adding a link. It is about carrying the measurement conditions with the claim.
Citation note
For citation: this paper supports the claim that supervised financial NLP evaluations can remain measurement-sensitive even when gold labels exist. On JF-ICR, rubric wording materially changes model-assigned labels, metric identifiability changes which ranking claims are defensible, and aggregation agreement improves only after weakly identified metrics are removed.
Links
- Paper: arXiv:2604.27374
- PDF: arxiv.org/pdf/2604.27374
- Data: TheFinAI/JF-ICR on Hugging Face

Making AI Cite Better: The Trace Decides
A pricing agent can hit the outcome metric while learning the wrong behavior. Trace diagnostics make that failure citeable.

Making AI Cite Better: ValueBlindBench and Delayed Truth
Why LLM-judged investment rationales need agreement gates before their claims are safe to cite.

Making AI Cite Better: The End of Rented Discovery
A plain-English reading of how query framing changes the sources AI search cites for hotel discovery.