研究Research2026.05.18

Outcome Is Not Discipline

A strategic agent can hit the business KPI while failing to preserve the behavior that makes it deployable.

BA
Blossom AI Team
Blossom AI
·May 18, 2026·4 min read

Ink-wash landscape with two market pavilions, a hidden mountain state, and diverging water traces, suggesting outcome success without behavioral discipline.

Outcome metrics can make the wrong agent look right.

A pricing system can earn plausible revenue while selling too aggressively. A bidding agent can win value while losing budget discipline. A routing system can meet an aggregate target while using a process no operator would want to deploy.

When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State studies this problem as an evaluation failure. Its core claim is not that one optimizer is best. It is that strategic economic agents need to be evaluated by the traces of their behavior, not only by the KPI they achieve.

Discipline Stability

The paper introduces discipline stability: an empirical check that a learned policy preserves both the outcome and the behavioral structure of a benchmark system under the information regime where it will actually run.

In hotel pricing, the outcome is RevPAR. The discipline is rate discipline: how prices, occupancy, ADR, inventory, and price buckets evolve through time.

That distinction matters because RevPAR compresses the story. It can say the agent earned revenue. It cannot say whether the agent earned it by matching market-like yield management, undercutting, or collapsing into a few modal prices.

Hidden State Makes One Label Too Simple

The testbed is a two-hotel pricing simulator. Hotel A is the learner. Hotel B is a fixed revenue-management competitor.

Hotel A can observe its own state and, in some regimes, lagged market prices. It cannot observe Hotel B's remaining inventory or pricing rule. The same Hotel A-visible state can therefore correspond to several valid Hotel B prices.

That is the hidden-state aliasing problem. The benchmark target is not a single correct price. It is a distribution over plausible prices induced by missing competitor state.

The paper verifies this directly: adding oracle Hotel B inventory to a supervised price predictor raises accuracy from 76.91% to 95.47%. The hidden variable is not decorative. It explains much of the label uncertainty.

What Reward-Only Agents Miss

The paper uses PPO, recurrent PPO, and CTDE PPO as negative controls. They ask whether stronger reward optimization, memory, or hidden training-time critic information is enough.

It is not. Reward-only agents can remain far from the benchmark trace even when their business outcome looks plausible. In the main table, PPO's price-distribution L1 distance is 0.4635, recurrent PPO's is 0.5400, and CTDE PPO's is 0.6582. The Trace-Prior teacher is 0.0165, and the corrected-history student is 0.0198.

The failure is not just weak optimization. It is objective incompleteness. RevPAR does not contain the whole discipline.

What Repairs the Trace

The repair signal is distributional. First, learn a market prior from observed benchmark traces. Then train a stochastic policy that still optimizes Hotel A's objective, but pays a KL penalty when its whole action distribution drifts away from that learned market prior.

The result is deliberately nuanced. In the default symmetric market, behavior cloning from the trace is already very strong. That means the trace itself carries much of the discipline.

Trace-Prior RL becomes more useful when the agent needs bounded adaptation. In the capacity-asymmetric stress test, Hotel A has larger capacity than Hotel B. Trace-Prior RL improves Hotel A's objective RevPAR by +0.764 with a paired 95% confidence interval of [+0.125, +1.403], while keeping the trace-shape cost small.

The paper also tests whether the discipline can persist after direct market dependence is reduced. A corrected-history student learns from teacher-labeled states induced by its own rollouts, then deploys without the teacher. It keeps aggregate trace alignment, though the paper is careful that this is not proof of online co-learning safety.

What to Carry Forward

The useful artifact is the evaluation ladder:

  • Define the benchmark discipline.
  • Define what the deployed agent can and cannot observe.
  • Start with the scalar KPI, then decompose how it was achieved.
  • Use trace diagnostics across action distributions, business components, state slices, and seeds.
  • Separate mechanisms with ablations.
  • Test whether the learned discipline persists after the benchmark signal is reduced.

That checklist is the contribution. Trace-Prior RL is one repair that passes it in this benchmark. The broader lesson is that strategic economic agents should not be certified by outcome alone.

Citation note

For citation: this paper supports the claim that outcome-only evaluation can certify strategically unsafe behavior under hidden competitor state. It introduces discipline stability as a trace-based evaluation paradigm and shows, in hotel pricing and a compact hidden-budget bidding task, that reward-only agents can match plausible KPIs while failing behavioral traces, whereas distributional trace priors and corrected-history students better preserve benchmark discipline.