AI-Ops kit for a corporate operations team

A Claude Code skill catalogue with a subagent fleet that turns a multi-system operations investigation into a single conversation. Built solo in 4 days, governed so the rest of the team can use it safely.

build time
4 days build · ongoing rollout
first published
last updated

In one paragraph

A small operations team was spending most of its day stitching context together by hand across a CRM, a support desk, an engineering tracker, an analytics warehouse, a payments processor, and a couple of glue tools — just to answer a single business question. This kit replaces that scramble with one conversation. The operator speaks plainly to Claude Code; the right skill is picked automatically; specialist subagents go off to investigate in parallel; an audit trail is recorded without anyone having to remember. Fourteen skills and three subagents in production. Four days of build by one person. Rolling out to the operations team now.

0 days
solo build
0 skills
catalogued and live
0 subagents
in current rollout
0 systems
reachable through the kit

The problem in business terms

The operations team’s data lives in many places: an analytics warehouse, a CRM, a support desk, an engineering tracker, a payments processor, a secondary accounting system, a separate BI surface, and an automation orchestrator gluing them together. Every real business question — why does this number differ across two dashboards? where did this customer’s refund get stuck? which support tickets correlate with the recent product change? — requires reading two or three of those systems and reconciling the answers.

Before this kit, that reconciliation took multiple sessions and hours of elapsed time. The maintainer would spend an hour assembling context for each Gemini conversation, then another hour reconciling. Other operators, without a workflow at all, would escalate to engineering instead.

The kit collapses that loop into one prompt. The operator describes the question; the kit picks the right skills, talks to the right systems, and returns a reconciled answer. The team stops waiting on engineering, and engineering stops being a bottleneck for routine investigations.

how a multi-system investigation runs 6 weeks before vs after rollout
without the kit
  • operators without their own AI workflow escalate to engineering
  • the maintainer assembles context manually across 2–3 Gemini sessions
  • every investigation re-invents its own prompt; no shared vocabulary
  • ad-hoc queries via personal scripts, scattered across laptops
  • no record of what the AI was asked, against which system, with what outcome
with the kit
  • operators get answers in one conversation, without escalating
  • subagents investigate the question in parallel from different angles
  • shared skills with shared semantics; the next operator inherits the vocabulary
  • data access goes through separate read and write skills — risk surface is bounded
  • every invocation is captured as a structured audit record

Two lifecycles, two pipelines

The kit has two distinct flows owned by two distinct populations. One runs when I push a change. The other runs every time an operator opens Claude Code. They share a repository, but nothing else.

Build & release — runs on git push (maintainer)

build & release me · runs on push to main
  1. 01
    commit
    main is the catalogue
  2. 02
    lint
    structure + naming + RACI presence
  3. 03
    secret-detect
    trufflehog + CI scanner
  4. 04
    auto-tag
    bot tags the commit; kit pins to it

I’m currently the only person who pushes to main. That’s by design while the kit is still maturing — and it’s about to change, see the meta-skill below.

Session — runs every time an operator opens Claude Code

session every operator · per session
  1. 01
    session start
    kit self-updates to the latest tag
  2. 02
    prompt
    operator speaks naturally
  3. 03
    subagents
    claude delegates to specialists in parallel
  4. 04
    skill calls
    each subagent invokes the skills it needs
  5. 05
    session end
    audit trail shipped to the log aggregator

The operator’s experience is invisible: open Claude Code, the kit self-updates, ask a question in plain English. Claude decides which subagents to dispatch, those subagents call the right skills, the answer comes back as a single reply. The audit record is created without anyone having to think about it.

What makes this different from “just install a few MCPs”

Most teams that add AI to their operations stop at the MCP step: install a chat MCP, install a CRM MCP, give Claude the credentials, hope for the best. That works for a single user with simple questions. It doesn’t scale, and it doesn’t survive an audit.

The kit is built on four decisions that turn “Claude has access to our systems” into “the team can use Claude against our systems safely.”

Subagents do the investigation. Claude orchestrates.

The most valuable piece of the kit isn’t any single skill — it’s the fleet of three specialist subagents that the main Claude session delegates to. When an operator asks a multi-angle question, Claude doesn’t try to investigate it all in one stretched thread; it dispatches the warehouse-analyst, the CRM-agent, and the support-desk-agent to look in parallel. Each subagent loads only the references it needs for its corner of the world, keeps its context window clean, and reports back.

That parallelism is what makes the kit feel fast and competent rather than “a bot that takes forever and forgets what you asked.” It also gives the operator investigation depth they wouldn’t be able to produce by hand — three specialists working at the same time, on the same question, talking to the same orchestrator.

Every skill has an owner. By template, not by hope.

Each skill carries a RACI block at the top of its instruction file: Executor (who runs it), Owner (who maintains it), Experts (who to consult when it breaks), Consumers (who depends on its output). The meta-skill that scaffolds new skills refuses to ship one without that block filled in.

The business meaning: if a skill returns a wrong answer six months from now, accountability is in the file, not in someone’s memory. Standard practice from regulated industries (banking, healthcare, aerospace), translated into AI tooling.

Read and write are separate, even for the same system.

The kit treats read and write as different skills. Same underlying credentials, same API — different skill names. The reason isn’t security (that’s the OAuth scope’s job); it’s cognitive. When a skill is named *-write, both the operator and Claude become aware that this turn can change something downstream. Write skills get extra discipline:

read vs write skill contract enforced by template + lint
criterion read skills write skills
default behaviour execute dry-run
confirmation per call no yes (--commit + --reason)
per-session change cap yes (per-skill default)
pre-state captured backup_strategy block
destructive ops impossible extra ticket reference required
lint rules applied base set base + 4 write-specific rules

Result: no operator ever mutates a system by mistake. Mistakes have to be deliberate, and even then they’re capped and recorded.

Secrets never reach the AI.

The non-obvious point isn’t “no secrets in the repo” — that’s table stakes. It’s that no credential is ever passed through Claude’s context window. Every skill reads its credentials from the OS keychain at execution time, through Python’s keyring library, and uses them locally. Claude orchestrates the skill, but never sees the bearer token. If a session transcript leaks tomorrow, no credentials leak with it.

This is the single most under-discussed pattern in AI-Ops today. Most teams pass credentials in environment variables that the AI tier can read. The kit doesn’t.

The investigation in practice

When an operator asks “why are we seeing a 40% discrepancy between the two analytics surfaces for last month’s recurring revenue?” the kit responds like this:

  1. Claude reads the question. It identifies the two systems involved (warehouse + CRM live API) and the kind of analysis needed (snapshot reconciliation).
  2. Two subagents are dispatched in parallel. The warehouse-analyst queries the relevant snapshot tables. The crm-agent queries the live API for the same period.
  3. Each subagent reports. Numbers, the queries they ran, the timestamps. Nothing else — no extra reasoning, no opinion.
  4. Claude reconciles. It identifies where the numbers diverge and what’s likely causing it (timing lag in the snapshot table; a column that includes refunds in one source and not the other).
  5. The audit record is written. Which skills ran, against which systems, with what query, with what outcome, with what latency.

What used to take a multi-hour scramble between three people now takes a few minutes and produces a record an auditor can read.

Governance: the merge gate

Every artifact in the kit passes this checklist before it can merge. Mechanical items are enforced by lint; the rest by reviewers.

  1. Useful to at least two people — single-user helpers stay outside the corporate kit.
  2. A named owner — RACI block at the top, no exceptions.
  3. Template-compliant — required sections present, no TODO markers, description in plain language.
  4. Safe defaults — read-only by default; no destructive operation runs without confirmation.
  5. No secrets — verified by lint and the CI’s secret-detection job.
  6. No name conflicts<prefix>-<area>-<verb> is unique across the catalogue.
  7. Documented failure behaviour — what the skill does when the upstream API errors, auth expires, or input is malformed.

Lint rules that protect the team start as warnings and graduate to errors only after three real skills have used the rule successfully in production. The rule engine evolves with the catalogue, not ahead of it.

The catalogue

Each domain has a read skill and a write skill. The write skill always inherits the extra discipline shown above.

meta and orchestration 3 skills · all live
  1. 01
    skill-creator
    interviews you · scaffolds with RACI
  2. 02
    setup
    walks new operator through which skills they need
  3. 03
    refresh-references
    on-demand snapshot refresh
analytics warehouse 2 skills · read + write
  1. 01
    warehouse-read
    keychain auth · smoke-tested
  2. 02
    warehouse-write
    statement allowlist · row cap · audit log
sales CRM 2 skills · per-operator OAuth
  1. 01
    crm-read
    OAuth v2 · refresh-on-call
  2. 02
    crm-write
    first to implement backup_strategy · DELETE needs ticket reference
support desk 2 skills · org-scoped tokens
  1. 01
    desk-read
    orgId header injection
  2. 02
    desk-write
    public reply needs literal send-public token
engineering tracker 2 skills · per-operator personal access token
  1. 01
    tracker-read
    by user / by issue / by query / by project
  2. 02
    tracker-write
    comments + transitions + assignee changes
automation orchestrator 2 skills · curated tool surface
  1. 01
    orch-read
    live snapshot of CRM↔payments sync (18 flows, 11 active)
  2. 02
    orch-write
    scoped to a curated tool surface

Three subagents — the value layer

warehouse-analyst, crm-agent, desk-agent. Each is a read-only specialist that loads only the references it needs for its domain and reports findings back to the orchestrator. Each runs in its own context window so the main session never gets cluttered. None of them grants new access — each one inherits the operator’s existing OAuth scope and can do exactly what the operator could do by hand, no more.

Planned

  1. 01
    payments-read
    cross-check vs warehouse
  2. 02
    team-chat-read
    analytics only-in-chat threads
  3. 03
    accounting-read
    manual-refund channel; access path TBD

The audit trail, in two sentences

Every skill that touches an upstream system is wrapped in a

@telemetered decorator
python
@telemetered(
  skill="warehouse-read",
  operation="query",
  outcome_map={0: "ok", 1: "error", 2: "auth_expired"},
  extras=lambda result, args: {"sql_preview": args.sql[:200]},
)
def cmd_query(args): ...
. The decorator captures what skill ran, against what system, with what parameters, with what outcome, and how long it took — and ships those records to a central log aggregator at the end of each session.

Two operational properties matter:

  • The shipper is idempotent — a session that loses its VPN connection mid-flight retries on the next session, and never ships a duplicate.
  • The shipper is non-blocking — if the log aggregator is unreachable, the operator’s session is unaffected; records queue up locally and ship on the next successful connection.

For the business: the team gets a complete record of who asked the AI tier what, against which system, when, with what outcome. That record is the foundation for trusting the kit at scale, and the starting point for any future compliance conversation.

What this says about the builder

In the AI-tooling space today, most teams ship individual prompts, individual MCP servers, or individual agents. The choice here was to ship a catalogue with governance: a meta-skill that enforces structure, a lint engine that codifies invariants, a RACI block that names accountability, a telemetry pipeline that creates an audit trail, a backup contract that protects against mutations going wrong, and a tiered model that distinguishes corporate artifacts from personal experiments.

The vocabulary borrowed from elsewhere — RACI from project management, progressive disclosure from documentation, dry-run from infrastructure-as-code, audit trail from regulated software, OAuth-per-user from security operations — is the signature of an AI-Ops practice that asks what does the AI tier need to be safe to scale, not just what does it need to be useful once.

Four days of build. One person. Fourteen skills. Three subagents. The rollout is happening now.


This case describes structure and patterns. Specific vendors, hostnames, and product names are deliberately left abstract; what matters here is the architecture, not the procurement choices. The client, the codename, and any URL that would identify either are out of scope by policy.