Pro Scale Team — Platform engineering for systems under pressure

001 — THESIS

Most scaling failures are not capacity failures. They are coupling failures.

A single Postgres primary serving twenty-two services. A Sidekiq-over-Redis arrangement that also happens to be the source of truth for billing. A deploy pipeline that takes nineteen minutes because nobody has owned it since the second engineer left. We have seen the same four or five patterns at companies in healthtech, logistics, developer tooling, and fintech — and the fix is rarely "add more nodes."

We decompose the failure mode, propose the smallest intervention that buys twelve to eighteen months of headroom, and then execute it alongside your team. No seat warming. No slide decks that outlive the engagement. We work in your repo, on your runbooks, paged on your rotation for the duration.

A short taxonomy of the patterns we see lives in Field Notes. The operator-facing cheatsheets we leave behind are indexed in the Runbook.

002 — WHAT WE ACTUALLY DO

Data layer surgery

Read replicas that actually take read traffic. Logical decoding into a warehouse without blowing up WAL. Range-partitioning, hash-sharding, and — when warranted — a carve-out to a purpose-built store. Primarily Postgres and MySQL; we've also touched Cassandra, DynamoDB, and ClickHouse in anger.

Queue & event topology

Replacing ad-hoc cron-plus-Redis arrangements with durable work queues. Explicit idempotency keys. Dead-letter handling a human can triage at 3 a.m. Exactly-once where it matters, at-least-once everywhere else, documented so the next engineer knows which is which.

Deploy & runtime

Builds under five minutes. Canaries that abort without a human. Blue/green where it earns its complexity, rolling where it doesn't. SLO-linked paging thresholds. The three dashboards on-call actually opens before anything else.

003 — SELECTED ENGAGEMENTS

SERIES B · LOGISTICS SAAS

10 WEEKS

ENG-221

Re-shaped a write-heavy Postgres cluster carrying 4.2 billion rows in its busiest table.

The team had been told they needed to rewrite in another database. They did not. We range-partitioned shipments by created_at, moved the historical tail to cheaper storage via pg_partman and tablespace routing, and introduced a CDC feed (logical replication → Kafka → warehouse) so analytics stopped pounding the primary. P99 writes fell from 840 ms to 62 ms. Migration cost: zero downtime, one weekend of our time on call.

SERIES A · DEV TOOLS

6 WEEKS

ENG-208

Rebuilt a job system that had been silently dropping webhooks.

Their queue was a Redis list with a supervisor that restarted workers whenever memory spiked — which silently discarded in-flight jobs. We replaced it with a durable work queue backed by Postgres SELECT … FOR UPDATE SKIP LOCKED, added per-tenant fairness with a weighted leaky bucket, structured retries with decorrelated jitter, and a DLQ with a one-line replay command. Webhook delivery SLO moved from "aspirational" to 99.94% measured over the following quarter.

SERIES C · HEALTHTECH

14 WEEKS

ENG-194

Cut deploy time from 23 minutes to 4, and eliminated the Friday freeze.

We shrank the container image by 71% (multi-stage build, distroless base, dropped a 190 MB debug toolchain nobody remembered adding), parallelized the test matrix along lines that matched actual code ownership, and introduced a canary stage with automatic rollback on error-rate, p99 latency, and saturation signals. The team ships on Fridays again. Their staff engineers asked us to write the postmortem template they now use company-wide — it's reproduced in the Runbook.

SERIES B · FINTECH

9 WEEKS

ENG-187

Broke up a 22-service monolithic Postgres without a single cutover window.

The billing, auth, and ledger services had grown into the same primary through a decade of "just add the table here, it's faster." We identified the natural seams using pg_stat_statements and foreign-key topology, then carved out auth and ledger into their own clusters using a dual-write strategy with a replay-and-verify phase before each cutover. No downtime. The billing team stopped being afraid of Tuesday.

004 — OPERATING PRINCIPLES

Reversible first. If a change can't be rolled back in under fifteen minutes, we design a way until it can. Dual-write, shadow read, feature-flag the cutover.
Measure the bad thing before fixing it. No engagement begins without a baseline we can point at later. If the metric doesn't exist, we instrument it in week one.
Write it down where your team works. Runbooks, ADRs, and decision memos live in /docs of your repo, not a Notion we own.
Leave, visibly. Our success is measured by how obviously unnecessary we become after the engagement ends. The handoff document is part of the deliverable, not an afterthought.
The third option. When a team is arguing between two bad plans, there is almost always a third plan neither side has articulated yet. Finding it is most of the work.

005 — ENGAGEMENT

Four engagements a year. Referred by operators we've paged with.

The queue orders itself the day a staff or principal engineer we've worked with forwards your roadmap to us. There is no intake form because we've never had one work — the useful signal is what the referring engineer says on the phone before we read the doc, not a Typeform field. If you've been pointed here by someone who's been in our git blame, they already have the thread.

Hard nos: cold outbound, recruiting-agency placements, fractional-CTO retainers, "advisory" arrangements paid in equity, interview loops disguised as trial projects, and anything that requires a mutual NDA before we can ask what database you run. Current window: one Q3 slot tentatively held for an ENG-187 follow-up; the next opens against ENG-194's successor engagement in October.