When the Benchmark Ruler Moves
A benchmark can have gold labels and still produce unstable evidence if the rubric, metric, or ranking rule moves.

Gold labels help. They do not make a benchmark self-explanatory.
A benchmark result is produced by a pipeline: the dataset split, the prompt or rubric, the metric, the aggregation rule, and the class distribution underneath it all. If those choices move, the claim can move too.
Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR studies that problem in a setting where ground truth exists. Its main thread is measurement risk: what happens when a benchmark has labels, but the measurement contract still changes the result?
The Task
The paper audits Japanese Financial Implicit-Commitment Recognition, or JF-ICR. The task is to classify whether a corporate answer in an investor-relations Q&A exchange expresses refusal, neutrality, weak commitment, or strong commitment.
That label can matter in disclosure monitoring, management-candor analysis, and risk review. It is also a subtle language task. A literal reading and a pragmatic reading can disagree about whether a statement is truly committing to future action.
Rubric Sensitivity
The paper finds that rubric wording moves labels. Agreement between literal and pragmatic rubric variants ranges from 70.0% to 83.4% across models.
Much of the movement occurs around the weak-commitment / neutral boundary. That is exactly where financial language often lives: careful enough to avoid overpromising, but specific enough to signal intent.
Metric Identifiability
Not every plausible metric can bear the same evidentiary weight.
Within-one accuracy is too forgiving because near misses receive credit and the majority class dominates. Worst-class accuracy is too noisy because the rarest class has only two examples. The paper therefore treats exact accuracy, macro-F1, and weighted kappa as the primary ranking metrics under its operational rule.
Ranking After Measurement
Aggregation becomes more defensible only after the metric audit. Bradley-Terry, Borda, and Ranked Pairs agree on the identifiable metric subset. When the full five-metric sweep is used, the closest pair becomes less stable.
The result is not "benchmarks are bad." It is more useful than that: benchmark claims are conditional objects.
What to Carry Forward
A score should travel with the conditions that produced it: the exact benchmark artifact, the rubric wording, the metric-identifiability audit, the aggregation rule, and the scope of the rank claim. In financial NLP, that context is not administrative detail; it is part of the evidence.
Citation note
For citation: this paper supports the claim that supervised financial NLP evaluations can remain measurement-sensitive even when gold labels exist. On JF-ICR, rubric wording materially changes model-assigned labels, metric identifiability changes which ranking claims are defensible, and aggregation agreement improves only after weakly identified metrics are removed.
Links
- Paper: arXiv:2604.27374
- PDF: arxiv.org/pdf/2604.27374
- Data: TheFinAI/JF-ICR on Hugging Face

Outcome Is Not Discipline
A strategic agent can hit the business KPI while failing to preserve the behavior that makes it deployable.

When the Trace Decides
An agent can hit the outcome metric while learning the wrong behavior. Better evidence follows the trace.

Before Truth Arrives
Some AI evaluation claims become checkable only later. ValueBlindBench asks which claims are safe to cite now.