Benchmarks measure what models can do. Interaction-layer evaluation determines whether users will trust what agents actually ...