In one paragraph
A small sales team needed hand-built contact lists for outbound — scoped by company, geography, seniority, and role. The old loop took 30–60 minutes per request and only one person on the team was fluent enough across the source tools to do it well. This agent compresses that into a one-line message in the team’s existing chat channel and a 1–3 minute wait. The salesperson asks in plain English; the agent picks the right data sources, runs the lookups in parallel, deduplicates against the CRM, and replies with a ready-to-use spreadsheet. About a thousand contact records a week, delivered to a sales team of 3–5. Seven months of iteration through three generations to get the shape right.
The problem in business terms
The sales team builds outbound lists by hand. Each list needs a specific filter — a city, an industry, a seniority band, a role, a function. The pre-agent workflow was: open the contact database, build a search, export to CSV, dedupe against the CRM by hand, hop to professional-network search for missing emails, paste it all into a spreadsheet, share the link, repeat. A 30–60 minute round-trip per request, multiple browser tabs, and the only person who could do it well was the one who knew which tool mapped to which question.
The agent compresses the loop into a one-line chat message. The salesperson asks something like “find me innovation-team decision-makers at target account X, with LinkedIn URL and email.” The agent acknowledges in the same thread, picks the data sources, runs the lookups in parallel, deduplicates, writes the results into a freshly spawned spreadsheet, and posts the link back. End to end: a minute or three.
What gets removed from the daily critical path is the procurement skill — nobody on the sales team needs to remember which database covers which geography or which surface has up-to-date emails. The agent knows.
- salesperson opens the contact database, builds a search, exports a CSV
- deduplicates the CSV against the CRM by hand
- hops to professional-network search for missing emails
- 30–60 minutes per request; only one teammate is fluent across the tools
- every list is a one-off — no record of what was asked or delivered
- salesperson sends a one-line request in their existing chat channel
- three specialist operators investigate in parallel across the sources
- deduplication happens automatically against the CRM and prior lists
- 1–3 minutes end to end; anyone in the channel can do it
- every run leaves a spreadsheet artefact and a structured execution record
What makes this different from “a chatbot wired to a few APIs”
Most agent demos collapse under their own tool surface: hand the LLM twelve flat tools, watch it pick the wrong one under load, watch it loop, watch it fail. This agent is built on four decisions that turn “the model has access to our contact sources” into “the team can rely on it without watching.”
Three operators, each with its own context. The main agent only routes.
The most valuable structural decision — and the reason the agent doesn’t collapse — is that the data sources aren’t exposed as a flat list to the main model. They’re grouped into three specialist operators:
- a request operator — owns the contact database, the CRM, and web research
- a public-profile operator — owns the long-running professional-network scraper
- a spreadsheet operator — owns spreadsheet creation and row append
Each operator has its own model context and its own narrow tool surface. The main agent only chooses which operator to call. The planning prompt stays short, the model never accidentally writes to a spreadsheet in the middle of a research turn, and adding a new data source means writing a new operator rather than expanding a flat list everyone has to re-read.
| criterion | flat tool list (typical) | operator-per-domain (this agent) chosen |
|---|---|---|
| main agent's tool count | ~12 flat tools | 3 operators |
| context per tool decision | full conversation | operator's narrow scope |
| wrong-tool risk under load | high | structural · near-zero |
| adding a new source | edit main prompt | new operator, isolated |
| parallel investigation | sequential only | natural per-operator |
For the business: predictable latency, fewer wrong-tool calls, and a clean place to plug the next data source in.
Every model context has a fallback.
Every operator and the main agent is configured with two models — a sharper but flakier primary, and a steadier fallback. Eight model nodes wired into four contexts. The primary is faster and better when it works; the fallback absorbs the provider-side hiccups that bleeding-edge models invariably ship with.
This is a deliberate resilience-over-capability trade. A slightly less clever second-tier answer is better than a hard failure surfaced to a salesperson in chat. The 4% error rate in the most recent full week is what’s left over after this pattern has absorbed every provider outage that hit production.
Long-running scrapes are pushed into a sub-workflow.
A public-profile scrape can take anywhere from 30 seconds to several minutes — the upstream service does the work asynchronously. Rather than block the main agent for that time, a dedicated scrape-and-wait sub-workflow owns the polling loop entirely: submit task → wait → check status → loop up to five minutes → return either a parsed result, a failure payload, or a still-running task ID. The main agent calls it like any other tool.
For the salesperson: a consistent “wait, then spreadsheet” experience whether the underlying source is fast or slow. The 5-minute cap means a stuck scrape never silently consumes the session — it surfaces as a clean failure the operator can retry.
The interface is the chat channel they already use.
There is no dashboard, no portal, no separate auth. Permissions are
inherited from chat-channel membership. Onboarding a new salesperson
is /invite to the channel — and that is the training: type what
you want, read the spreadsheet that comes back.
The non-obvious consequence is observability. Of 140 lifetime executions, 138 were triggered by humans on the team; only 2 were manual test runs by the maintainer. The agent isn’t a demo that someone keeps alive by poking it — it’s load-bearing for the sales week.
The three operators
- ● 01contact-db queryby company · geography · seniority · role
- ● 02CRM cross-checkdrop contacts the team already owns
- ● 03web researchfill missing emails · verify company info
- ● 01submit scrapequeue task in external service
- ● 02poll status30s interval · 5-minute cap
- ● 03parse + cleanBOM-strip · column normalisation
- ● 01spawn sheetfreshly created · channel permissions inherited
- ● 02append rowsname · title · profile URL · email · location · company
- ● 03replyshareable link posted in original chat thread
The stability story
The architecture pays for itself in the production numbers. Two consecutive weeks:
| week of | runs | success | error | success rate |
|---|---|---|---|---|
| 2026-05-11 | 83 | 72 | 11 | 87% |
| 2026-05-18 | 53 | 51 | 2 | 96% |
A 3.5× drop in error rate week-over-week. It didn’t come from the upstream APIs getting better. It came from rolling the dual-model fallback pattern out across all four model contexts — the failure class that dominated the earlier week (provider hiccups on the bleeding-edge primary model) got absorbed at the architecture level instead of surfacing to the salesperson.
The remaining errors cluster narrowly: most are bad-request edge cases on specific contact-database search parameters. That’s a knob to turn, not a structural problem.
Audit, by default
Because the agent runs on a workflow runtime, every execution leaves a record in that runtime’s execution log: which operator ran, with which parameters, with what output, with how long it took, with success or failure. There’s nothing to instrument — the runtime captures it for free.
That’s how the weekly stability numbers above were computed: by querying the runtime’s execution history for the 140 lifetime runs of the production workflow, scoring each as success or error, computing the rates.
For the business: the same record that exists for debugging doubles as the record for audit. Sales, finance, and security can all answer “what did the agent do last Tuesday for the LATAM list?” from one place, without anyone having to remember to log it.
What’s shipped
In production: the main workflow (30 nodes), a data-appender helper, a long-poll scrape-and-wait sub-workflow (17 nodes, added 2026-05-21), and a paired regression-test workflow.
Kept as references, deactivated: v1 (“Archive”) and v2 (“playground”), left in the runtime as deactivated workflows rather than deleted. The diff between generations stays legible months later — when someone (including future-me) asks “why did we move away from X?” the answer lives in the workflow next door, not in someone’s memory.
Separate dev fork: a wizleads-dev twin used for safer experimentation with new integrations before they touch the production graph.
About fifty nodes across the production graph in total.
What this says about the builder
The interesting part isn’t that the agent exists — agentic workflows on this kind of runtime are common enough that templates exist for them. The interesting part is the shape: orchestration as nested agents with disjoint vocabularies, not as a flat tool list. A main planner that only knows operators. Operators that only know their own domain. The same instinct an experienced service-mesh architect applies — narrow contracts, push specifics down, keep the top of the stack legible.
The stability work shows the same discipline. Rather than chase 100% by adding retries everywhere, the builder picked one failure class (provider flakiness on bleeding-edge models) and absorbed it architecturally with a fallback at every model context. The error rate fell because the surface area where one provider hiccup could derail a run got smaller — not because individual integrations got better.
And the practical tells: BOM-stripping CSVs from the scraper, polling loops kept inside their sub-workflow, kept-for-reference legacy versions, a paired test workflow, a separate dev fork. The artefacts of someone who has shipped enough automation to know which corners eat you later.
Seven months. One builder. ~50 nodes across the production graph. Roughly a thousand contact records a week, delivered to a sales team that previously did the work by hand.
This case describes architecture and patterns. Specific vendors, hostnames, and the client itself are deliberately left abstract; what matters here is the shape of the system, not the procurement choices.