Beyond evaluation leaderboards: what makes an agent trustworthy in production?