In one paragraph
A small operations team was spending most of its day stitching context together by hand across a CRM, a support desk, an engineering tracker, an analytics warehouse, a payments processor, and a couple of glue tools — just to answer a single business question. This kit replaces that scramble with one conversation. The operator speaks plainly to Claude Code; the right skill is picked automatically; specialist subagents go off to investigate in parallel; an audit trail is recorded without anyone having to remember. Fourteen skills and three subagents in production. Four days of build by one person. Rolling out to the operations team now.
The problem in business terms
The operations team’s data lives in many places: an analytics warehouse, a CRM, a support desk, an engineering tracker, a payments processor, a secondary accounting system, a separate BI surface, and an automation orchestrator gluing them together. Every real business question — why does this number differ across two dashboards? where did this customer’s refund get stuck? which support tickets correlate with the recent product change? — requires reading two or three of those systems and reconciling the answers.
Before this kit, that reconciliation took multiple sessions and hours of elapsed time. The maintainer would spend an hour assembling context for each Gemini conversation, then another hour reconciling. Other operators, without a workflow at all, would escalate to engineering instead.
The kit collapses that loop into one prompt. The operator describes the question; the kit picks the right skills, talks to the right systems, and returns a reconciled answer. The team stops waiting on engineering, and engineering stops being a bottleneck for routine investigations.
- operators without their own AI workflow escalate to engineering
- the maintainer assembles context manually across 2–3 Gemini sessions
- every investigation re-invents its own prompt; no shared vocabulary
- ad-hoc queries via personal scripts, scattered across laptops
- no record of what the AI was asked, against which system, with what outcome
- operators get answers in one conversation, without escalating
- subagents investigate the question in parallel from different angles
- shared skills with shared semantics; the next operator inherits the vocabulary
- data access goes through separate read and write skills — risk surface is bounded
- every invocation is captured as a structured audit record
Two lifecycles, two pipelines
The kit has two distinct flows owned by two distinct populations. One runs when I push a change. The other runs every time an operator opens Claude Code. They share a repository, but nothing else.
Build & release — runs on git push (maintainer)
- ● 01commitmain is the catalogue
- ● 02lintstructure + naming + RACI presence
- ● 03secret-detecttrufflehog + CI scanner
- ● 04auto-tagbot tags the commit; kit pins to it
I’m currently the only person who pushes to main. That’s by design while
the kit is still maturing — and it’s about to change, see the meta-skill
below.
Session — runs every time an operator opens Claude Code
- ● 01session startkit self-updates to the latest tag
- ● 02promptoperator speaks naturally
- ● 03subagentsclaude delegates to specialists in parallel
- ● 04skill callseach subagent invokes the skills it needs
- ● 05session endaudit trail shipped to the log aggregator
The operator’s experience is invisible: open Claude Code, the kit self-updates, ask a question in plain English. Claude decides which subagents to dispatch, those subagents call the right skills, the answer comes back as a single reply. The audit record is created without anyone having to think about it.
What makes this different from “just install a few MCPs”
Most teams that add AI to their operations stop at the MCP step: install a chat MCP, install a CRM MCP, give Claude the credentials, hope for the best. That works for a single user with simple questions. It doesn’t scale, and it doesn’t survive an audit.
The kit is built on four decisions that turn “Claude has access to our systems” into “the team can use Claude against our systems safely.”
Subagents do the investigation. Claude orchestrates.
The most valuable piece of the kit isn’t any single skill — it’s the fleet of three specialist subagents that the main Claude session delegates to. When an operator asks a multi-angle question, Claude doesn’t try to investigate it all in one stretched thread; it dispatches the warehouse-analyst, the CRM-agent, and the support-desk-agent to look in parallel. Each subagent loads only the references it needs for its corner of the world, keeps its context window clean, and reports back.
That parallelism is what makes the kit feel fast and competent rather than “a bot that takes forever and forgets what you asked.” It also gives the operator investigation depth they wouldn’t be able to produce by hand — three specialists working at the same time, on the same question, talking to the same orchestrator.
Every skill has an owner. By template, not by hope.
Each skill carries a RACI block at the top of its instruction file: Executor (who runs it), Owner (who maintains it), Experts (who to consult when it breaks), Consumers (who depends on its output). The meta-skill that scaffolds new skills refuses to ship one without that block filled in.
The business meaning: if a skill returns a wrong answer six months from now, accountability is in the file, not in someone’s memory. Standard practice from regulated industries (banking, healthcare, aerospace), translated into AI tooling.
Read and write are separate, even for the same system.
The kit treats read and write as different skills. Same underlying
credentials, same API — different skill names. The reason isn’t security
(that’s the OAuth scope’s job); it’s cognitive. When a skill is named
*-write, both the operator and Claude become aware that this turn can
change something downstream. Write skills get extra discipline:
| criterion | read skills | write skills |
|---|---|---|
| default behaviour | execute | dry-run |
| confirmation per call | no | yes (--commit + --reason) |
| per-session change cap | — | yes (per-skill default) |
| pre-state captured | — | backup_strategy block |
| destructive ops | impossible | extra ticket reference required |
| lint rules applied | base set | base + 4 write-specific rules |
Result: no operator ever mutates a system by mistake. Mistakes have to be deliberate, and even then they’re capped and recorded.
Secrets never reach the AI.
The non-obvious point isn’t “no secrets in the repo” — that’s table stakes. It’s that no credential is ever passed through Claude’s context window. Every skill reads its credentials from the OS keychain at execution time, through Python’s keyring library, and uses them locally. Claude orchestrates the skill, but never sees the bearer token. If a session transcript leaks tomorrow, no credentials leak with it.
This is the single most under-discussed pattern in AI-Ops today. Most teams pass credentials in environment variables that the AI tier can read. The kit doesn’t.
The investigation in practice
When an operator asks “why are we seeing a 40% discrepancy between the two analytics surfaces for last month’s recurring revenue?” the kit responds like this:
- Claude reads the question. It identifies the two systems involved (warehouse + CRM live API) and the kind of analysis needed (snapshot reconciliation).
- Two subagents are dispatched in parallel. The
warehouse-analystqueries the relevant snapshot tables. Thecrm-agentqueries the live API for the same period. - Each subagent reports. Numbers, the queries they ran, the timestamps. Nothing else — no extra reasoning, no opinion.
- Claude reconciles. It identifies where the numbers diverge and what’s likely causing it (timing lag in the snapshot table; a column that includes refunds in one source and not the other).
- The audit record is written. Which skills ran, against which systems, with what query, with what outcome, with what latency.
What used to take a multi-hour scramble between three people now takes a few minutes and produces a record an auditor can read.
Governance: the merge gate
Every artifact in the kit passes this checklist before it can merge. Mechanical items are enforced by lint; the rest by reviewers.
- Useful to at least two people — single-user helpers stay outside the corporate kit.
- A named owner — RACI block at the top, no exceptions.
- Template-compliant — required sections present, no
TODOmarkers, description in plain language. - Safe defaults — read-only by default; no destructive operation runs without confirmation.
- No secrets — verified by lint and the CI’s secret-detection job.
- No name conflicts —
<prefix>-<area>-<verb>is unique across the catalogue. - Documented failure behaviour — what the skill does when the upstream API errors, auth expires, or input is malformed.
Lint rules that protect the team start as warnings and graduate to errors only after three real skills have used the rule successfully in production. The rule engine evolves with the catalogue, not ahead of it.
The catalogue
Each domain has a read skill and a write skill. The write skill always inherits the extra discipline shown above.
- ● 01skill-creatorinterviews you · scaffolds with RACI
- ● 02setupwalks new operator through which skills they need
- ● 03refresh-referenceson-demand snapshot refresh
- ● 01warehouse-readkeychain auth · smoke-tested
- ● 02warehouse-writestatement allowlist · row cap · audit log
- ● 01crm-readOAuth v2 · refresh-on-call
- ● 02crm-writefirst to implement backup_strategy · DELETE needs ticket reference
- ● 01desk-readorgId header injection
- ● 02desk-writepublic reply needs literal send-public token
- ● 01tracker-readby user / by issue / by query / by project
- ● 02tracker-writecomments + transitions + assignee changes
- ● 01orch-readlive snapshot of CRM↔payments sync (18 flows, 11 active)
- ● 02orch-writescoped to a curated tool surface
Three subagents — the value layer
warehouse-analyst, crm-agent, desk-agent. Each is a read-only
specialist that loads only the references it needs for its domain and
reports findings back to the orchestrator. Each runs in its own context
window so the main session never gets cluttered. None of them grants
new access — each one inherits the operator’s existing OAuth scope and
can do exactly what the operator could do by hand, no more.
Planned
- ○ 01payments-readcross-check vs warehouse
- ○ 02team-chat-readanalytics only-in-chat threads
- ○ 03accounting-readmanual-refund channel; access path TBD
The audit trail, in two sentences
Every skill that touches an upstream system is wrapped in a . The decorator captures what skill ran, against what system, with what parameters, with what outcome, and how long it took — and ships those records to a central log aggregator at the end of each session.
Two operational properties matter:
- The shipper is idempotent — a session that loses its VPN connection mid-flight retries on the next session, and never ships a duplicate.
- The shipper is non-blocking — if the log aggregator is unreachable, the operator’s session is unaffected; records queue up locally and ship on the next successful connection.
For the business: the team gets a complete record of who asked the AI tier what, against which system, when, with what outcome. That record is the foundation for trusting the kit at scale, and the starting point for any future compliance conversation.
What this says about the builder
In the AI-tooling space today, most teams ship individual prompts, individual MCP servers, or individual agents. The choice here was to ship a catalogue with governance: a meta-skill that enforces structure, a lint engine that codifies invariants, a RACI block that names accountability, a telemetry pipeline that creates an audit trail, a backup contract that protects against mutations going wrong, and a tiered model that distinguishes corporate artifacts from personal experiments.
The vocabulary borrowed from elsewhere — RACI from project management, progressive disclosure from documentation, dry-run from infrastructure-as-code, audit trail from regulated software, OAuth-per-user from security operations — is the signature of an AI-Ops practice that asks what does the AI tier need to be safe to scale, not just what does it need to be useful once.
Four days of build. One person. Fourteen skills. Three subagents. The rollout is happening now.
This case describes structure and patterns. Specific vendors, hostnames, and product names are deliberately left abstract; what matters here is the architecture, not the procurement choices. The client, the codename, and any URL that would identify either are out of scope by policy.