AI-Ops kit for a corporate operations team

In one paragraph

A mid-size operations team used to spend most of a working day stitching context together by hand — across a sales CRM, a support desk, an engineering tracker, an analytics warehouse, a payments processor, a BI surface, and the automation glue between them — just to answer a single business question. This kit replaces that scramble with one conversation. The operator asks in plain English; the right skill is selected automatically; a fleet of three specialist subagents investigates in parallel; an audit record is written without anyone having to remember to. Fifteen corporate skills reaching six systems, three subagents, built solo over about three weeks — and now shipped to the team as a finished, governed catalogue rather than a drawer of one-off scripts.

0 skills

corporate skills · read+write pairs + meta

0 subagents

specialist investigators, invokable

0 systems

reached through one conversation

~0 weeks

solo build, end to end

The problem in business terms

The team’s data lives in eight different places, each with its own login, its own query dialect, and its own quirks. Every real business question — why does this number differ across two dashboards? where did this customer’s refund get stuck? what’s actually on my plate before the sync call? — means reading two or three of those systems and reconciling the answers by hand. Before the kit, that reconciliation took multiple sessions and hours of elapsed time; operators without their own AI workflow simply escalated to engineering, and engineering became the bottleneck for routine questions.

how a multi-system investigation runs before vs after the kit shipped

without the kit

operators without an AI workflow escalate to engineering
the maintainer hand-assembles context across two or three chat sessions
every investigation re-invents its own prompt; no shared vocabulary
ad-hoc queries via personal scripts, scattered across laptops
no record of what the AI was asked, against which system, with what outcome

with the kit

operators get a reconciled answer in one conversation, without escalating
subagents investigate the question in parallel from different angles
shared skills with shared semantics; the next operator inherits the vocabulary
data access goes through separate read and write skills — the risk surface is bounded
every invocation is captured as a structured audit record, automatically

What makes this different from “just install a few MCPs”

Most teams that bolt AI onto their operations stop at the MCP step: install a CRM connector, install a chat connector, hand Claude the credentials, hope for the best. That works for one user with simple questions; it doesn’t survive a second user or an audit. The kit’s value isn’t any single skill — it’s the fleet of three specialist subagents the main session delegates to. A multi-angle question doesn’t get answered in one stretched thread; Claude dispatches the warehouse-analyst, the CRM-agent, and the support-desk-agent to look in parallel, each loading only the references its corner of the world needs, each reporting back. That parallelism is what makes the kit feel like a competent analyst rather than a bot that takes forever and forgets what you asked — and crucially, none of the subagents grants new access; each inherits exactly the access the operator already has.

The two decisions that make it safe to hand to the whole team are quieter. Credentials never pass through the model: every skill reads its token from the OS keychain at execution time and uses it locally, so if a session transcript leaked tomorrow, no secret would leak with it — the single most under-discussed pattern in AI-ops today. And every skill carries a RACI block at the top of its instruction file by template, not by hope, so when a skill returns a wrong answer six months from now, accountability is in the file rather than in someone’s memory. [devil's-advocate]

read vs write skill contract enforced by template + lint · not a policy doc

criterion	read skills	write skills
default behaviour	execute	dry-run
confirmation per call	no	yes (--commit + --reason)
per-session change cap	—	yes (per-skill default)
pre-state captured	—	backup_strategy block
destructive ops	impossible	extra ticket reference required
lint rules applied	base set	base + write-specific rules

A session in the room

Three everyday operator questions, run consecutively — each one touching a different corner of the business, each answered end to end. The agent investigates, decides, executes, and verifies; every scenario closes on a success line. The greeting pinned on the right stays put as you scroll; each scene below picks up the next task as its centerline crosses.

A quick word before the work

The kit’s promise fits in one sentence an operator can quote: ask a cross-system question in plain English, get a reconciled answer back in one conversation, without learning six logins or escalating to engineering.

Two files load on every session. CLAUDE.md holds the kit’s operating principles and the map of which skill answers which kind of question; profile.json records which skills this operator has been granted and authenticated. Between them, the agent knows what it can reach and how to behave before the first question is typed.

Three questions follow, picked because they’re the ones an operations team actually asks: a year of refunds and who makes them; what’s on my plate before a sync call; the whole story of a customer I’m about to phone. Different corners of the business, the same loop each time — investigate, decide, execute, verify, report.

The trust property worth stating up front: no credential ever passes through the model. Each skill reads its token from the operating system’s keychain and uses it locally, so the conversation you’re watching could be captured wholesale and not a single secret would travel with it. That is what makes a kit like this safe to hand to a whole team rather than one careful engineer.

A year of refunds, and who actually makes them

Refund analysis looks like a one-query job and almost never is. The data is spread across the payment processor and the warehouse, and the table that looks like the obvious source is the one that lies: a known webhook gap means the warehouse’s own refund table silently captures only about half of what actually happened.

The expertise the kit encodes is knowing that before it runs anything. It starts from the processor’s balance feed — the complete record — and uses the internal table only to enrich, never to filter. The business meaning is blunt: the naive query would have reported the refund population at roughly half its true size, the portrait would have been drawn from the wrong sample, and nobody downstream would have noticed the answer was wrong. [future-max]

With the right sample, the portrait inverts the intuition. The person who refunds isn’t the disappointed long-term power user; it’s the early, self-serve, entry-plan buyer inside a two-week buyer’s-remorse window, over-represented on the trial-to-paid auto-convert. The lever that follows isn’t a retention campaign — it’s the expectation-setting on the self-serve checkout.

For the business: an analyst-grade investigation, with the landmines already mapped and a defensible sample, in minutes rather than a day — and a written breakdown the next person can audit instead of re-deriving.

Ten minutes before a sync call

Everyone runs the same scramble before a status meeting: what’s on my plate, what’s blocked, what’s quietly gone stale. Done by hand it’s a scroll through a tracker and a guess; done badly it turns the meeting into status theatre.

A raw query would just list eleven open tickets. What the kit adds is the triage a good lead does in their head — ship versus blocked versus stale — and, more usefully, it names the human dependencies: the review that’s been sitting six days, the task whose upstream blocker is still open. Those are the two lines that change what gets said on the call.

The trust property here is that the tracker skill is read-only. It pulls the picture and changes nothing; the operator walks in prepared and the board is exactly as their teammates left it. No accidental transitions, no surprise reassignments.

For the business: the meeting starts from a shared, accurate state instead of a round of “let me check what I’m working on.” Less ceremony, more decision — and the preparation cost dropped from ten anxious minutes to ten honest seconds.

A three-year relationship on one screen

Walking into a customer call blind is expensive, and the relationship that should inform it is scattered: the commercial shape lives in the CRM, the health of the account lives in the support desk, and no single screen shows both.

This is the kit’s signature move — the cross-system stitch. The CRM-agent reads the deal history, the plan, the renewal date; the support-desk-agent checks for open tickets. The agent assembles them into one brief, in the order a salesperson should actually read them.

And it surfaces the thing that changes the call: an open ticket that’s had no reply for nine days. The right play isn’t to lead with the renewal — it’s to acknowledge the silence first. A nine-day gap owned is goodwill; the same gap ignored is the quiet reason a renewal slips. [crossref]

For the business: every rep walks into every call with the same complete picture, assembled in seconds. The quality of the relationship stops depending on who happens to remember what — and the support gap gets caught by the person with the customer on the line, not a quarter later.

The catalogue

Each connector domain is a read skill and a write skill; the write skill always inherits the extra discipline shown in the table above. Three meta-skills sit above the connectors — one to scaffold new skills, one to onboard a new operator, one to refresh the dated snapshots inside other skills’ references.

meta and orchestration 3 skills · all live

● 01

skill-creator

interviews you · scaffolds with a RACI block
● 02

setup

walks a new operator through which skills they need
● 03

refresh-references

on-demand refresh of dated reference snapshots

the six connector domains each a read + write pair · all live

● 01

analytics warehouse

read: keychain auth, smoke-tested · write: statement allowlist + row cap
● 02

sales CRM

read: per-operator OAuth · write: first backup_strategy · DELETE needs a ticket
● 03

support desk

read: orgId injection · write: public reply needs a literal confirm token
● 04

engineering tracker

read: JQL / issue / project · write: comments, transitions, assignee
● 05

automation orchestrator

read: live sync snapshot · write: curated tool surface (MCP)
● 06

BI / dashboards

read: export a view as CSV · write: publish (Creator-role only)

The three subagents — warehouse-analyst, crm-agent, desk-agent — are the value layer. Each is a read-only specialist that loads only its own references, runs in its own context window so the main thread stays clean, and reports findings back to the orchestrator. None grants new access; each inherits the operator’s existing scope. Three more connectors — a payments processor, team chat, and a secondary accounting system — are scoped but deliberately out of this kit’s first version; adding one is now a mechanical pass through the meta-skill, not a design problem. The architecture is done.

Governance: the merge gate

Every artifact passes this checklist before it can merge. Lint enforces the mechanical items; reviewers enforce the rest.

Useful to at least two people — single-user helpers stay personal, outside the corporate kit.
A named owner — a RACI block at the top of the instruction file, no exceptions.
Template-compliant — required sections present, no TODO markers, plain-language triggers.
Safe defaults — read-only by default; no destructive operation runs without confirmation.
No secrets — verified by lint and the CI secret-detection job.
No name conflicts — the area-verb name is unique across the catalogue.
Documented failure behaviour — what the skill does when the API errors, auth expires, or input is malformed.

A lint rule that protects the team starts life as a warning and graduates to a hard error only after a few real skills have used it successfully in production. The rule engine evolves with the catalogue rather than running ahead of it — governance that earns its strictness instead of asserting it.

The audit trail, in two sentences

Every skill that touches an upstream system is wrapped in a

@telemetered decorator

python

@telemetered(
  skill="warehouse-read",
  operation="query",
  outcome_map={0: "ok", 1: "error", 2: "auth_expired"},
  extras=lambda result, args: {"sql_preview": args.sql[:200]},
)
def cmd_query(args): ...

that captures what skill ran, against what system, with what parameters, with what outcome, and how long it took — and ships those records to a central log aggregator. Two operational properties make it trustworthy: the shipper is idempotent (a session that drops its connection mid-flight retries later and never ships a duplicate), and it is non-blocking (if the aggregator is unreachable, records queue locally and the operator’s session is unaffected).

For the business, that’s a complete record of who asked the AI tier what, against which system, when, and with what result — the foundation for trusting the kit at scale, and the starting point for any future compliance conversation.

What the kit reclaims

Plug in your team's own numbers. A cross-system investigation that took most of an hour by hand now takes a few minutes — this is what that difference is worth across a year.

runs locally · 0 network calls · deterministic

The verdict from the room

The kit’s impact is best measured by what stops happening: operators no longer queue behind engineering for routine questions, and the answers they get arrive already reconciled and already recorded. An operator who uses it daily puts it more plainly than any metric — “there’s nothing to configure; it does everything it can on its own, and when it does need me it’s one simple step with clear instructions. It structures every answer cleanly — everything laid out in its place.” That is the whole design goal stated as a user experience: the machinery is invisible, the judgement is visible, and the operator is never asked to think about the plumbing.

~3 weeks. One builder. 15 skills, 3 subagents, 6 systems. A multi-system investigation that took a day now takes minutes — reconciled, recorded, and safe to hand to the whole team.

This case describes architecture and patterns. Specific vendors, hostnames, and the client itself are deliberately left abstract; what matters here is the shape of the system — one conversation over many systems, governed so a team can trust it — not the procurement choices behind it.