研究Research2026.05.08

When the Trace Decides

An agent can hit the outcome metric while learning the wrong behavior. Better evidence follows the trace.

BA
Blossom AI Team
Blossom AI
·May 8, 2026·3 min read

Ink-wash summer lake town with two inns, a ferry wake, and subtle ripples, suggesting market traces under partial observability.

Outcome metrics can be true and incomplete.

An AI agent may earn the target reward while learning a behavior that would make a domain expert uneasy. It may optimize the number while drifting away from the process the number was supposed to represent.

Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State follows that behavioral question. It asks what evidence proves that a pricing agent learned market-like behavior, not just a market-like score.

The Setting

The paper studies a two-hotel revenue-management simulator. Hotel A trains an agent against a fixed Hotel B competitor. The main business metric is RevPAR, or revenue per available room.

At first glance, this looks like a reward-optimization problem. If the agent reaches competitive RevPAR, the system seems to work.

The trace says otherwise.

The Failure Mode

A standard learning agent can get close to the reference RevPAR while selling too aggressively, undercutting, or collapsing into modal price buckets.

A deterministic copy policy can even improve exact one-step action accuracy while making aggregate market alignment worse. The reason is that argmax copying collapses uncertainty that should remain unresolved.

Hidden Competitor State

The problem is partial observability. Hotel A can see its own state and lagged market traces, but not Hotel B's remaining inventory, booking curve, or pricing rule.

The same visible state can correspond to several plausible competitor prices. The target is not one correct action. It is a distribution.

Trace-Prior RL

Trace-Prior RL turns that observation into a training rule. First, learn a distributional market prior from observed traces. Then train a stochastic pricing policy that optimizes Hotel A's own RevPAR while paying a KL penalty when its full action distribution drifts away from the learned market prior.

In the selected result, the final policy matches Hotel B's RevPAR, occupancy, ADR, and price distribution within seed-level uncertainty. The paper reports a close final price-bucket distribution, with L1 distance 0.0196 and JS divergence 0.0001.

What to Carry Forward

If the claim is "the agent earned competitive RevPAR," a RevPAR citation may be enough. If the claim is "the agent learned market-like yield management," the evidence needs more: occupancy, ADR, price-bucket distributions, divergence measures, seed uncertainty, and failure-mode diagnostics. The trace is where the behavioral claim either holds together or falls apart.

Citation note

For citation: this paper supports the claim that scalar reward metrics can hide behavioral failure in pricing agents under hidden competitor state. It diagnoses the failure with trace-level metrics and verifies Trace-Prior RL as a repair that preserves market-distribution uncertainty while still optimizing the agent's own RevPAR.