News-digest swarm for an internal product team

A multi-tenant runtime that delivers a weekly industry-news digest to a product team's chat channel. One codebase, N YAML configs — adding another team's agent is configuration, not code. Three days of solo build; seven test runs to polish before regular delivery.

build time
3 days build · 7 test runs · rolling out
first published
last updated

In one paragraph

A product team needed a weekly industry-news digest in their chat channel — two audiences inside the same team (marketing for content, engineering for market context), one shared signal. Ad-hoc queries to consumer AI tools only happen when someone thinks to ask — useful occasionally, never a steady source of market context. This swarm replaces occasional with scheduled: a containerised agent on a single VM, every Monday morning, collects from search APIs and RSS, deduplicates three ways (cheapest filter first, most expensive last), ranks with an LLM, summarises the top items in two languages, and posts them as a structured message in the team’s chat channel. Three days of build; seven test runs to polish the output; one agent ready to deliver and the second already designed — adding it is a YAML file, not a code change.

0 days
solo build
0 runs
test runs · polish before rollout
0 tests
CI green on every push
0 code
per new tenant · YAML only

The problem in business terms

The product team has two audiences that both need industry news, for different reasons. Marketing needs fresh facts to feed content. Engineering and product need market context — M&A moves, competitor releases, regulatory shifts, weather and climate signals that affect the business. Before this swarm, context arrived occasionally — when someone happened to need an answer that day and ran an ad-hoc query, or when a colleague happened to forward something useful that week. There was no steady cadence; the team moved with whoever was paying attention.

The swarm replaces that with a structured weekly delivery in the chat channel the team already lives in. Every Monday at 10:00 local time, a single message arrives with the top items, summarised in two languages, tagged by country, with a one-line “why this matters” per item. Nobody has to remember to run anything. Nobody has to forward links.

The non-obvious move is that the system is built multi-tenant from day one. A second team can be added with one YAML file — their own topic list, their own countries, their own chat channel, their own delivery cadence. No code change, no fork, no parallel deployment. The same runtime serves them.

how the weekly digest gets to the team before vs after the swarm
without the swarm
  • context arrives only when someone thinks to ask for it — never on a steady cadence
  • coverage depends on whoever happened to forward something useful this week
  • no record of what was surveyed, ranked, or chosen — every digest a one-off
  • adding a second team's digest means duplicating the whole effort
  • no measurement of cost, no measurement of recall, no audit
with the swarm
  • the digest arrives in chat every Monday at 10:00, without anyone running anything
  • coverage is driven by a per-tenant config — countries, topics, sources, owner
  • every run leaves a structured record: discovery → dedup → rank → summary → deliver
  • adding a second team is one YAML file plus one chat webhook — zero code
  • ops channel reports failure, silent partial-failure, and weekly cost per run

A weekend project can scrape RSS and post links. What separates that from something the team can rely on for years is a small set of structural decisions that compound. Four of them carry most of the weight here.

Multi-tenant from day one — one codebase, N YAML configs.

The most valuable structural decision — and the reason adding the next team is cheap — is that the agent is a thin runtime parameterised by config. A per-tenant YAML file declares everything that varies between teams: which topics to include and exclude, which countries are P1 vs P2, which sources are always-include and which are blacklisted, the cron schedule, the chat webhook, the owner, the model parameters, even the in-context ranking examples.

A new tenant is created by adding one file to configs/ and one service to docker-compose.yml. The agent’s row keys in the database are scoped by agent_id, so two tenants share the same Postgres without ever seeing each other’s data. The embedding model is loaded once on the host and shared by every tenant — N agents, one ML cost.

one codebase + N configs vs one project per team why parameterisation beats forking
criterion per-team fork (typical) one codebase + N YAML (this swarm) chosen
adding a new team fork the repo · diverge over time new YAML + new service
shared bug fix cherry-pick across N forks one commit, all tenants benefit
shared ML cost N copies of the embedder one embedder · shared volume
schedules per team independent cron per YAML
data isolation separate database per fork agent_id FK · same Postgres
config drift over time high · merges get harder structural · per-tenant by design

For the business: the cost of serving the second team is the cost of writing its YAML file. No engineer-week per new team, no parallel deployment to maintain, no slow drift between forks.

Three-level dedup — cheap filter first, expensive filter last.

The same news story shows up in four publications, three of them rewriting each other. Ranking every duplicate through the LLM is wasted money; ignoring duplicates dilutes the digest. The swarm runs a three-level dedup pipeline before any LLM call:

dedup pipeline cheapest filter first · 30-day window
  1. 01
    L1 — URL hash
    exact-URL collision · milliseconds
  2. 02
    L2 — content hash
    SHA-256 of normalised text · cheap text match
  3. 03
    L3 — vector cosine
    multilingual e5-base · pgvector · threshold 0.90

The threshold (0.90) wasn’t picked by feel. It came out of a sweep on a 288-article fresh-state set — a one-off tuner that prints the unique-count at each threshold value and lets the operator pick the inflection point. That tuner is checked into the repo as a CLI command for the next operator who tunes it.

For the business: the LLM ranking budget is spent on signal, not on near-duplicates. A back-of-envelope reading of the funnel — roughly 287 candidates per run reduced to about 10 delivered items — is what makes the weekly cost report come out small enough to be unremarkable.

Idempotent stages — every run is restartable.

The pipeline has six stages: discovery, dedup, ranking, summarisation, delivery, ops post. Each stage reads articles from the database in a specific status, processes them, and writes them back in a new status. A run can be resumed at any stage with a flag — --skip-discovery --skip-dedup skips the work already done — and there’s nothing in memory that survives across stages.

The scheduler itself spawns each run as a subprocess, not in-process, so a single bad run can’t take down the scheduler. The container healthcheck stays green; next Monday’s run starts on schedule.

For the business: a transient upstream blip never costs a week. The operator can rerun from the failing stage with one command and the digest still ships before the team’s Monday standup.

The ops channel — three message types, none of them ever a surprise.

Production AI systems fail in two ways: loud failure (the subprocess crashes; everyone notices) and silent failure (the run completes but the digest is half-empty, or noticeably off-topic). Both need to surface before the audience sees the broken delivery.

The swarm posts to a separate ops chat channel — never the digest channel — with three message types:

ops channel message types all three go to the owner · none surprise the audience
criterion failure silent partial weekly cost
what it means subprocess exited non-zero run ok · delivered count below ½ of expected scheduled report after each run
trigger exit code ≠ 0 ranked < 0.9 × unique OR summarised < 0.8 × top-N every run
what's in the message traceback + run id funnel numbers + diagnosis hint funnel + token spend + estimated USD
audience reaction fix before next Monday investigate · maybe rerun · maybe tune read · file

For the business: the owner of the digest can sleep through Sunday because the agent will tell them if something is off. The weekly cost message turns “what does this AI thing actually cost us?” from a quarterly audit question into a routine line in chat.

The pipeline, end to end

weekly run · Monday 10:00 six stages · all idempotent · subprocess-isolated
  1. 01
    discovery
    search API + Google News RSS · URL-filter
  2. 02
    dedup
    URL hash → content hash → vector cosine (0.90)
  3. 03
    ranking
    Gemini Flash-Lite · batch=30 · JSON-mode · 5 scoring rules
  4. 04
    summarise
    trafilatura → markdown → snippet · top-N · EN + UK native
  5. 05
    delivery
    Block Kit · per-country sections · chat webhook
  6. 06
    ops post
    diagnose() → alert · estimate_cost() → weekly report

Two pipeline choices are worth a second look:

  • Full-text extraction happens after ranking, not before. Article extraction is the slow, brittle part — different sites need different fallbacks (Trafilatura, a JS-rendering fallback, and a paywall-snippet fallback). Running it on every discovered URL would multiply HTTP traffic by ~30×. Running it only on the top-N items (max 10) keeps the bandwidth bill flat.
  • Country classification happens at the summary stage, with full text in hand — not at discovery. Discovery only sees the search-query context, which is famously unreliable about which country the article is actually about (a query about Canadian markets routinely returns Argentine news). The LLM sees the full text and reclassifies; the previous label is ignored.

Both choices show the same instinct: do the cheap thing on everything, do the expensive thing only on the things that survive.

Observability — three signals, all routed through tools the team already runs

The swarm publishes everything through surfaces the company already operates:

  • /healthz — a 200 if the scheduler is alive. The container’s healthcheck reads this.
  • /metrics — a JSON object: last run timestamp, last run status, last delivered count, 7-day run total, 7-day failure count. Computed on demand from the database — so the value survives a restart, and any tool that can parse JSON can read it.
  • the ops chat channel — the three message types above, posted by the agent itself.

Structured logs go to stdout in JSON; the company’s existing log aggregator picks them up without a line of code on our side. No new metrics stack to operate, no new dashboard to build. When the production load grows — many tenants, many runs per day — the JSON endpoint is already in the shape a metrics scraper expects.

What’s shipped

Built and ready to deliver: one agent (the one prepared to run the regular Monday digest), one codebase, one container image (Python 3.11-slim with a CPU-only inference wheel — 512MB), one Postgres instance shared across all future agents, one scheduler running on a single VM, two chat webhooks (digest + ops).

Tested: 159 tests, green on every push to main. Mix of unit tests against a fake LLM client, integration tests against a real Postgres via a session-rollback fixture, and recorded-cassette tests against the search APIs.

Designed but not yet running: the second agent, with its own YAML config and chat channel. The work to add it is mechanical — the runtime is already multi-tenant; only the per-tenant decisions remain.

What this says about the builder

The shape of this case isn’t “someone wired an LLM to a news feed.” It’s “someone built a runtime in which adding the next news feed is a configuration change.” That’s a different instinct — closer to platform engineering than to script-writing.

Each architectural decision shows up again in the next one. The agent is parameterised by config; the database keys are scoped by agent; the embedder is shared across tenants; the ops channel is separate from the digest channel; the cost report is a stage in the pipeline, not an afterthought. By the time you’ve read the YAML file, you’ve read the whole system.

The discipline shows in the small things too: a threshold tuner shipped as a CLI rather than a notebook, an /metrics endpoint that computes from the database instead of holding a counter in memory, country classification moved to the only stage where the data exists to do it correctly, full-text extraction held back until ranking already proved the article worth the HTTP call. None of those choices showed up on day one — they were the output of seven test runs and one operator paying attention.

Three days. One builder. One agent ready to deliver, one designed. A runtime that costs the same whether it serves one team or ten.


This case describes architecture and patterns. Specific vendors, the client, the agent’s topic domain, and the receiving team are deliberately left abstract; what matters here is the shape of the multi-tenant runtime and the production discipline around it. The codename, the client, and any URL that would identify either are out of scope by policy.