TRACK 01 · DATA
01
Data layer surgery
For teams whose primary database is now the single most dangerous object in the architecture. Typical entry point: page rate is climbing, query p99 is visibly correlated with deploys, and the warehouse and app are fighting over the same primary.
What's in scope
- Audit of top 200 queries by time, bytes, and lock contention — traced back to the call sites that generate them (pg_stat_statements, auto_explain, or equivalent).
- Partitioning, sharding, or read/write split design, proposed with a cost estimate and a rollback plan before any code is written.
- Logical replication or CDC pipeline into your warehouse so analytics traffic leaves the primary for good.
- Written runbooks for failover, reindex, vacuum storms, autovacuum tuning, and bloat recovery, checked into your repo.
- A "day one on-call" document for every engineer rotating onto the data team after we leave.
Typical deliverable
A merged PR series, a baseline and after report on the same dashboard, and a 10–15 page memo in /docs/adr/ explaining every non-obvious choice.
Duration
Eight to twelve weeks, depending on dataset size and release cadence.
TRACK 02 · QUEUES
02
Queue & event topology
For teams whose background work has quietly become the product — billing runs, webhook fan-out, ingestion pipelines, video or ML job orchestration — and where a single queue misbehaving takes the business with it.
What's in scope
- Inventory of every asynchronous code path, classified by delivery semantics required vs. delivery semantics actually implemented. The gap is almost always larger than the team expects.
- Migration to a durable queue with per-tenant fairness, explicit idempotency keys, and structured retry-with-decorrelated-jitter.
- Dead-letter handling that is survivable on-call: a single dashboard, a single runbook, a single replay command that is safe to run half-asleep.
- Load tests reproducing last quarter's worst incident, run against the new topology, with the results committed to the repo.
- A saturation model: at what ingest rate does each queue tip, and which upstream signal predicts it by five minutes.
Duration
Six to ten weeks.
TRACK 03 · DEPLOY
03
Deploy & runtime
For teams where deploys are slow, scary, or both, and where the Friday freeze has become culture. The goal of this track is simple: any engineer, on their first week, should be able to ship safely.
What's in scope
- Build-time audit. Most pipelines we see are two to four times longer than they need to be because of cache misses, serial dependencies, and a test matrix that stopped reflecting the code a year ago.
- Canary design with automated abort on error-rate, latency, and saturation signals. No human in the loop for the common case.
- Feature-flag discipline: a policy for how flags are born, observed, and retired, so your flag system doesn't become a second codebase. Default TTL, owner, and removal PR at birth.
- On-call rotation design, paging thresholds tied to SLOs, and the three dashboards the on-call engineer actually needs.
- A "safe to ship" checklist that runs in CI, not in a human's head.
Duration
Eight to fourteen weeks.
TRACK 04 · INCIDENT
04
Incident-grade consulting
A narrower engagement, intended for teams in the middle of a fire or the week after one. Two of us, up to fourteen days, embedded with your on-call.
What's in scope
- Incident-command assist and postmortem authorship — not a slide deck, a document the team will actually reread. Our template is in the Runbook.
- Triage of the top five latent issues most likely to cause the next incident, ranked by blast radius and time-to-detect.
- One concrete, shippable mitigation in flight before we leave. Not a proposal. A merged PR with measurements.
- A one-page "what we learned, what you already knew" for the executive team, written plainly.
Duration
Up to two weeks. Extensions, if needed, roll over into one of the three longer tracks.
WHAT WE DO NOT SELL
We don't do greenfield architecture for pre-launch companies; there are better firms for that, and the problems are different. We don't do security audits or compliance work. We don't do front-end performance beyond the parts that touch the API gateway. We don't do team coaching, org design, or "engineering culture" engagements. We are not a staff-aug shop: every hour billed is one of the four partners.
If you describe your problem to us and it fits one of the above, we will say so in the first call and try to name three firms we trust who do.
HOW WE START
The queue orders itself on referral timestamps.
A staff or principal engineer we've shipped alongside forwards your roadmap; we send back a one-page memo inside a week; if it's a fit, we lock the calendar. That's the whole protocol, and it hasn't changed since the second engagement. Cold outreach, recruiter inbound, and agency placements go to /dev/null without reply.
First call runs forty-five minutes and covers three questions: what's breaking, what's already been tried, and what the team would do with four more senior engineers for ten weeks. If we can't write a useful memo from that call, we tell you on the call — not a week later.