The pilot-production gap is an accountability gap
A pilot answers "can the agent do the task?" Production asks a harder question: "who is accountable when it does the task wrong?" Systems that make it to production define failure modes explicitly — what the agent may decide alone, what it must escalate, and what it must never touch.
Constrain the action space first
The reliability of an agent is inversely proportional to the surface area of what it can do. Start with a narrow, idempotent set of tools, log every action with enough context to replay it, and expand the action space only as the audit trail earns trust.
Design for the retry, not the happy path
Agents fail mid-task. The systems around them must treat every step as resumable: durable state, idempotent operations, and compensation paths for the steps that cannot be undone. This is distributed-systems hygiene applied to a new kind of worker.
Measure task completion, not token quality
The only metric that matters is the rate at which the agent completes the business task correctly without human rescue. Instrument that number from day one, and let it — not the demo — decide when the agent earns more autonomy.