A six‑week push to pave the roads
We formed a focused team with a simple mandate: ship a handful of meaningful automations, and in doing so, standardize the stack — hosting, observability, testing, human review, and audits — so any team could build an agent the same way. We prioritized categories of work that were manual, time‑consuming, and decision‑heavy: think “summarize and triage,” “collect and compare,” and “draft with references for a human to approve,” across a few different internal domains.
By the end of the six weeks, two automations were in production with measurable time savings, two more were completed end‑to‑end in development, and we’d published reference implementations and onboarding materials other teams could adopt. But most importantly, once those paved roads were in place, the time to build a new agent dropped from quarters to days — and more than half a dozen engineers were able to self-serve on the patterns.
We built this with an “observability‑first” mindset. Every tool call, retrieval, decision, and output is traced. Data‑fetch and transform steps are deterministic and unit‑tested. LLM steps run with evaluation harnesses and curated datasets. We use a second LLM as a judge for spot‑checks and confidence scoring. And we treat human review as an intentional part of the system, not a workaround.
We designed for auditability from the start, which is essential in our legal and regulatory flows and best practice for all other flows. An immutable record showing which data was used, how it was used, the reasoning the agent followed, and who approved the output is created for each execution of an agent. Once we built these capabilities for one agent, we were able to leverage auditability for all of them. That lets us meet today’s requirements with the least human friction — while building confidence to reduce that friction over time.