№ 012 · MARCH 2026 · DATA · ENG-221
The read replica that wasn't
A Series B team had three Postgres replicas behind a pgbouncer pool labeled reads. Dashboards showed the replicas at 30% CPU. The primary was on fire. We traced it.
Two-thirds of read traffic was going to the primary. The ORM's "reader" connection was, by a two-year-old config bug, resolving to the writer DNS entry in the default region. The dashboard told the truth about the replicas — they were genuinely idle. It just wasn't the truth anyone needed.
Fix was fifteen lines of config and a CI assertion that fails the build if reader_host and writer_host ever resolve to the same endpoint. The interesting part: this had been the cause of every "mystery primary spike" the team had paged on for eighteen months. Three postmortems had blamed the application.
Takeaway. When a dashboard and a pager disagree, the dashboard is lying. Instrument the thing the pager is actually telling you about.
№ 011 · FEBRUARY 2026 · QUEUES · ENG-218
Sidekiq as source of truth
Billing cycle kicked off with a scheduled job. The job read customer state, computed line items, wrote invoices. Standard enough — except the job also mutated the customer's plan state on the way through, and those mutations existed only in the job's in-memory lifecycle. If the worker crashed mid-run, the plan transitions were lost, but the invoice had already been written.
The team had noticed the drift. They had a reconciliation script. The reconciliation script had itself been running as a scheduled job on the same Redis, which had been evicting keys under memory pressure for the last six weeks.
We separated the billing run into three idempotent phases (gather, compute, emit), pushed each phase into a durable queue with explicit keys, and moved the plan-state transitions into the database transaction that produced the invoice. Reconciliation became unnecessary, which is the correct end state for a reconciliation script.
Takeaway. If your reconciliation script has a reconciliation script, you have accepted a category of bug that the architecture is generating faster than you can clean it up.
№ 010 · DECEMBER 2025 · DEPLOY · ENG-212
The 19-minute deploy
Nineteen minutes, door to door. Four engineers on the team could describe what happened in those nineteen minutes. None of the four agreed. The pipeline was a chain of nineteen jobs, fifteen of which ran serially, and three of which were conditional on a file path regex that had been written for a directory that no longer existed.
First week: we deleted four jobs, inlined two, and parallelized the test matrix along service ownership lines. Deploy fell to eleven minutes with zero behavioral change. The psychological effect was immediate and disproportionate — engineers started shipping smaller PRs because the cost of shipping them had fallen below the cost of batching.
Second phase was the container. A 1.1 GB image, 190 MB of which was a debug toolchain left in from an incident in 2023. Multi-stage build, distroless runtime, image down to 310 MB, pull time down by a factor of four. Final deploy time: four minutes, ten seconds.
Takeaway. Deploy duration is a morale metric. It sets the size of the unit of work engineers are willing to ship.
№ 009 · NOVEMBER 2025 · DATA · ENG-204
Autovacuum on a bank holiday
A healthtech team scheduled their heaviest ETL for Monday morning because their warehouse load window closed at noon. Autovacuum on the largest table had been tuned, five years ago, for a workload that no longer existed. On a Monday after a three-day weekend, the accumulated dead tuples crossed a threshold that triggered a full-table vacuum at 09:14, exactly when the ETL started reading. Query p99 went to fourteen seconds. The team paged. The symptom was "the warehouse is slow." The cause was three calendar days of bloat and a default.
We retuned autovacuum_vacuum_scale_factor and autovacuum_vacuum_cost_limit per-table, added a Sunday-evening scheduled vacuum for the three largest tables, and wrote a dashboard that shows dead-tuple ratio over time per table, so the next surprise wouldn't be a surprise.
Takeaway. Database defaults are calibrated for a median workload five years out of date. Assume every default is wrong, and measure which ones are wrong enough to matter.
№ 008 · SEPTEMBER 2025 · QUEUES · ENG-197
"Exactly-once" (reader, it was not)
A fintech team had built their ledger on the promise, inherited from an early architect, that the event bus delivered exactly-once. The event bus did not. It delivered at-least-once with a deduplication window of five minutes, which was true enough that nobody had noticed for two years, because the duplicate rate was roughly one in nine hundred thousand events.
The ledger had been quietly double-posting about eighty transactions a day against a daily volume in the tens of millions. Cumulatively, over two years: low six figures of mis-posted cents, mostly self-cancelling. The company's own auditors had flagged nothing. Our audit did.
Fix was a dedupe table keyed on (event_id, consumer), checked inside the ledger's write transaction. Backfill took ninety minutes against a read replica and a careful reconciliation. The team now treats "delivery semantics" as a written contract between service owners, not folklore.
Takeaway. "Exactly-once" is almost always a claim about a window, not an invariant. Ask what the window is, in milliseconds.
№ 007 · AUGUST 2025 · DATA · ENG-190
The cache that outlived the schema
A dev-tools company had a Memcached layer in front of one of its hottest endpoints. Cache key included the customer ID. Cache value was a serialized object from a Ruby class that had been renamed, then split, then one of its fields had changed type, across three separate deploys over eighteen months. Old entries still deserialized because the deserializer was tolerant. They deserialized into a shape that lied about two fields.
The team had been debugging intermittent "customer sees wrong plan tier" reports for a year. The cache TTL was seven days. The poisoned entries were outliving the fixes.
We added a schema version to the cache key, wrote a migration, and made the serializer refuse to read an entry whose version didn't match. The bug class disappeared the same week.
Takeaway. Cache keys without schema versions are a time machine pointed at your users.
№ 006 · JUNE 2025 · DEPLOY · ENG-184
Feature flags, at the bottom of a well
A team we were advising, not yet formally engaged with, had 1,340 feature flags. Of those, 1,180 had been "on for everyone" for longer than a year. Twenty were "off for everyone" permanently. The remaining hundred-odd were doing actual work. The cost was not the flag system — it was the if flag_enabled? branches threaded through every hot path, each one a small tax on reading the code.
We didn't take the engagement. We wrote them a one-page memo: retire flags the same day you retire the PR that introduced the code behind them; if you cannot commit to that, do not introduce the flag. They hired internally and did the work themselves over two quarters. We heard from them a year later. 97 flags remain.
Takeaway. A feature flag without a retirement date is an architectural decision wearing a disguise.