When the Trace Decides
An agent can hit the outcome metric while learning the wrong behavior. Better evidence follows the trace.

Outcome metrics can be true and incomplete.
An AI agent may earn the target reward while learning a behavior that would make a domain expert uneasy. It may optimize the number while drifting away from the process the number was supposed to represent.
Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State follows that behavioral question. It asks what evidence proves that a pricing agent learned market-like behavior, not just a market-like score.
The Setting
The paper studies a two-hotel revenue-management simulator. Hotel A trains an agent against a fixed Hotel B competitor. The main business metric is RevPAR, or revenue per available room.
At first glance, this looks like a reward-optimization problem. If the agent reaches competitive RevPAR, the system seems to work.
The trace says otherwise.
The Failure Mode
A standard learning agent can get close to the reference RevPAR while selling too aggressively, undercutting, or collapsing into modal price buckets.
A deterministic copy policy can even improve exact one-step action accuracy while making aggregate market alignment worse. The reason is that argmax copying collapses uncertainty that should remain unresolved.
Hidden Competitor State
The problem is partial observability. Hotel A can see its own state and lagged market traces, but not Hotel B's remaining inventory, booking curve, or pricing rule.
The same visible state can correspond to several plausible competitor prices. The target is not one correct action. It is a distribution.
Trace-Prior RL
Trace-Prior RL turns that observation into a training rule. First, learn a distributional market prior from observed traces. Then train a stochastic pricing policy that optimizes Hotel A's own RevPAR while paying a KL penalty when its full action distribution drifts away from the learned market prior.
In the selected result, the final policy matches Hotel B's RevPAR, occupancy, ADR, and price distribution within seed-level uncertainty. The paper reports a close final price-bucket distribution, with L1 distance 0.0196 and JS divergence 0.0001.
What to Carry Forward
If the claim is "the agent earned competitive RevPAR," a RevPAR citation may be enough. If the claim is "the agent learned market-like yield management," the evidence needs more: occupancy, ADR, price-bucket distributions, divergence measures, seed uncertainty, and failure-mode diagnostics. The trace is where the behavioral claim either holds together or falls apart.
Citation note
For citation: this paper supports the claim that scalar reward metrics can hide behavioral failure in pricing agents under hidden competitor state. It diagnoses the failure with trace-level metrics and verifies Trace-Prior RL as a repair that preserves market-distribution uncertainty while still optimizing the agent's own RevPAR.
Links
- Paper: arXiv:2605.06529
- PDF: arxiv.org/pdf/2605.06529

Outcome Is Not Discipline
A strategic agent can hit the business KPI while failing to preserve the behavior that makes it deployable.

When the Benchmark Ruler Moves
A benchmark can have gold labels and still produce unstable evidence if the rubric, metric, or ranking rule moves.

Before Truth Arrives
Some AI evaluation claims become checkable only later. ValueBlindBench asks which claims are safe to cite now.