Share

Trust, but Verify: Letting AI Drive a Production Migration

June 01, 2026 • tech

Trust, but Verify: Letting AI Drive a Production Migration
claude-codeai-agentsagentic-developmentdata-migrationproductionmongodbcode-swarmleaderboard

Trust, but Verify: Letting AI Drive a Production Migration

Monday evening. I'm SSH'd into Leaderboard's production box, a fresh database backup sitting safely on disk, and I'm about to re-key the entire production database onto a data provider it has never run on. Every player, every tournament, every score — about to get new identifiers and a new source of truth.

Here's the part that should make any engineer's palms sweat: I didn't write this migration. Claude did. The new API client, the identity model, the blue-green rebuild, the verification suite, the cutover runbook I'm reading off of right now — all of it. My contribution over the previous four days was mostly to sit a half-step back and ask "are you sure?" in increasingly specific ways.

And now I was about to find out whether four days of asking that question had been enough.

Let me rewind.

The job

Leaderboard runs on golf data. Tournament schedules, fields, live leaderboards, FedExCup and world rankings, and — during a tournament — live scoring that updates as players move around the course. All of that came from one source: the SlashGolf live-golf-data feed.

The plan was to move it to balldontlie's PGA API. Richer data — per-round results, full scorecards by hole — cleaner rate limits, and the kind of reliability I want underneath a product as its user base grows. Same sport, same shape of data, completely different plumbing.

In the old world, a migration like that is my job. I'd write the client, map the schema, build the backfill, sweat the cutover. In this one, I described the destination and Claude built the road. That's the inversion this whole post is about, so I want to be honest about it up front: for four days, I was not the author. I was the reviewer.

How the work actually ran

The migration wasn't one big heroic session. It was a sequence of phased GitHub issues — provider abstraction, then the new API client, then a provider-agnostic identity model, then ingestion across each data type, then live scoring, then the cutover — each one scoped, built, reviewed, and merged on its own.

Where it got genuinely fun was the parallelism. I leaned on my /code-swarm workflow, which fans independent issues out to multiple Claude subagents, each working in its own isolated git worktree so they can't trample each other. Wave one built the new client and the identity model at the same time. Wave two did players and rankings in parallel. Then they consolidated back into clean PRs.

flowchart TD
    A["Phase 1<br/>Provider abstraction"] --> B{"Swarm Wave 1<br/>(parallel)"}
    B --> C["Phase 2<br/>balldontlie client"]
    B --> D["Phase 3<br/>Identity model<br/>+ DB rebuild"]
    C --> E{"Swarm Wave 2<br/>(parallel)"}
    D --> E
    E --> F["Phase 4/5<br/>Tournaments · Players · Rankings"]
    F --> G["Phase 6/7<br/>Field · Live scoring"]
    G --> H["Cutover<br/>(production)"]
    style A fill:#cce5ff,stroke:#004085,color:#000
    style H fill:#d4edda,stroke:#28a745,color:#000
    style B fill:#fff3cd,stroke:#856404,color:#000
    style E fill:#fff3cd,stroke:#856404,color:#000

But speed at the keyboard is the easy part now. What kept this from turning into a fast way to ship a disaster was a set of standing rules I refused to bend on. Every merge had to run through /address-pr-feedback first. CI had to be green — actually green, not "probably fine." And the default posture was conservative:

"never assume, ask questions, and be conservative... All debugging will fall on you for logs, hotfix, etc."

When the agent is fast, the gates are the product. The discipline isn't in the typing anymore — it's in what you refuse to let through.

Where I earned my keep

Here's the honest mechanic of working this way: Claude built it, and my job was to try to break it. Trust, but verify. A few times, verifying is exactly what caught something that would've gone to production otherwise.

The leaderboard that showed too much. Mid-migration, I was eyeballing a live test of the Charles Schwab Challenge and something felt off — the leaderboard was showing four rounds of scores when, that early in the tournament, it should've shown two. I didn't have a stack trace. I had a hunch and a reference:

"This current tournament should only have two days. You can compare at pgatour.com/leaderboard."

That hunch was right. The investigation traced it back to a score load that had run in Charles Schwab's context but fetched a different tournament's data — RBC Heritage's — because the rebuild had seeded an ID crosswalk for players but not for tournaments. A gap you'd never see in a unit test. You'd only catch it by looking at the actual product and going "that's not what golf looks like."

The parity question, and the trap it sprung. Before I'd let this anywhere near production, I wanted proof:

"Can you review our balldontlie impl, and verify we have 100% parity with RapidAPI? Also, I'm curious if we have a safe fallback."

The answer I got back was refreshingly un-salesy: "100% parity? No." A few things — live in-round progress, some course detail, ranking trends — weren't at full parity yet. Good. That's the answer I needed to plan around.

But the more important thing surfaced because I'd asked about the fallback. I'd been quietly assuming that a config flag — golf.provider — was my rollback switch. It wasn't:

"The flag is NOT a rollback after re-key... after re-key the primary key is a ULID, so flipping back mints duplicate players and mis-attributes scores. Your safety net is the DB instance, not the flag."

That correction reshaped the entire cutover plan. My real rollback wasn't a flag — it was keeping the old database intact and untouched as a cold standby. I would not have known that until it was too late to matter.

The bug my gut found before the logs did. Later, a live contest that should've shown up on my dashboard just... didn't. My instinct went straight to the obvious suspect:

"Remember we were migrating to a different provider for golf data, so I bet this is related."

It was. balldontlie writes date-only values like "2026-05-28"; RapidAPI had written full datetimes like "2025-05-22T00:00:00". One date parser had only an ISO_DATE_TIME format and no plain-date fallback — so it threw, returned null, and a downstream "is registration open?" check quietly defaulted to true. Which made the dashboard's Live tab decide the tournament hadn't started, and hide the contest.

flowchart TD
    A["balldontlie sends<br/>date-only: 2026-05-28"] --> B["Parser expects<br/>full datetime only"]
    B --> C["Parse throws →<br/>returns null"]
    C --> D["'registration open?'<br/>defaults to TRUE"]
    D --> E["Live tab computes<br/>'hasn't started'"]
    E --> F["❌ Live contest<br/>hidden from dashboard"]
    style A fill:#fff3cd,stroke:#856404,color:#000
    style F fill:#f8d7da,stroke:#721c24,color:#000

The fix was a one-line-ish parser change, not a data re-ingest — "the parser is the bug, not the data." But notice the pattern across all three: none of these came from reading the diff. They came from looking at the running product and trusting the feeling that something was wrong. That's the part of the job that didn't go away.

The cutover

Cutover night was deliberately boring, which is the highest compliment you can pay a production migration.

The shape was blue-green. Re-key everything into a second database rather than mutating the live one. Run a verification pass — referential integrity, plus a re-score parity check to confirm contest results came out the same. Match players across providers (we landed north of 90% automatic coverage, with a known handful deferred to manual linking). Then, and only then, flip the environment to point at the new database and the new provider, smoke-test, and let the first real ingest run.

flowchart LR
    A["Backup prod"] --> B["Re-key into<br/>SECOND database"]
    B --> C["Verify:<br/>integrity + re-score parity"]
    C --> D["Flip env →<br/>new DB + provider"]
    D --> E["Smoke test<br/>+ first ingest"]
    E --> F["✅ Live"]
    B -.->|"untouched"| G["Old DB<br/>(cold standby)"]
    G -.->|"fast fallback"| H["Rollback path"]
    style A fill:#cce5ff,stroke:#004085,color:#000
    style F fill:#d4edda,stroke:#28a745,color:#000
    style G fill:#fff3cd,stroke:#856404,color:#000

The site never went down. The old database is sitting right there, intact and ready, as the fast fallback I now know I actually need.

It wasn't flawless along the way, and I don't want to pretend otherwise. At one point Claude reported a release as not-yet-deployed, and I pushed:

"when we deployed, did you pull latest before tagging the release?"

It hadn't run a git fetch first, so a stale local view of main had produced a wrong conclusion. It owned the mistake immediately and corrected the read. Small thing — but it's the texture of the whole partnership. I'm not rubber-stamping. I'm checking the work, and sometimes the check matters.

The numbers

I like publishing the metrics, so here they are, computed straight from the session transcripts:

Metric Value
Calendar span 4 days (May 28 → Jun 1)
Total tokens 3.42 billion
— of which cache reads ~96%
Output tokens (generated) 25.8 M
Total messages 12,296 (4,028 me · 8,268 Claude)
Parallel subagents spawned 16
Models Claude Opus 4.7 + 4.8 (~50/50), ~5% Haiku
Estimated cost ~$8,845

A word on that cost, because the token count looks alarming next to it. 96% of those 3.42 billion tokens were cache reads — context being re-read across a long-running set of sessions, billed at a fraction of fresh tokens. That's the only reason 3.4 billion tokens costs under nine thousand dollars instead of fifty. The figure is an estimate: I applied Anthropic's published per-model rates to the token totals pulled from the transcripts. Treat it as the right order of magnitude, not an invoice.

Now the comparison I actually care about. How long would this have taken a human?

I'll be honest that it's a range, not a number. A provider abstraction layer, a new paginated API client with retries and rate-limit handling, a provider-agnostic identity model, a blue-green database rebuild with a verification suite, provider-aware ingestion across half a dozen data types, a live-scoring feature, a dev-environment rehearsal, and a documented production cutover — plus the bug fixes along the way. For a solo senior developer who knows the codebase, that's somewhere between a few weeks and a couple of months of focused work, depending on the dev and how much breaks.

Call it weeks. It happened in four days, and I wasn't even at the keyboard full-time.

This is the thing I keep coming back to, the same lesson my three-agents experiment pointed at from a different angle: the cost of doing serious, risky engineering work has collapsed. Not just toy projects — a real production data migration with a database re-key. When that cost collapses, the scarce resource stops being the typing. It becomes judgment: knowing what to challenge, what to verify, and what you refuse to ship without proof.

Cautiously optimistic

So we're live. I want to resist the urge to tie a bow on it, because we're not done — we're watching.

The real test is the next set of contests. The Memorial tees off June 4, and that's the first live tournament running entirely on the new provider's scoring. Live, in-round data is exactly where parity was thinnest, so that's where my attention goes next. Until I've watched a full tournament cycle through cleanly, "it works" is a hypothesis, not a conclusion.

Here's why I can say that without losing sleep: we planned for precisely this. The old database is a cold standby, so the fallback is fast if I need it. And the hotfix workflow already proved itself during the migration — the date-parser bug and a round of 504 timeouts both got caught, root-caused, and patched in tight loops, not fire drills. The muscle for "spot it, diagnose it, ship the fix" is warm. That's what lets me run live and cautious at the same time.

Cautiously optimistic is the honest state. Trust, but verify — and the verifying doesn't stop at cutover.

That's really the whole shift, isn't it. I used to measure my work by what I built. On this one, I built almost none of it. What I did was decide what was worth doing, then refuse to believe it was done until I'd seen it hold up. The leverage is real. It is emphatically not autopilot. The catches that mattered came from a human looking at the actual product and trusting the feeling that something was off.

I'll take that trade. I just won't take my eyes off the leaderboard this weekend.

–Jeremy


Thanks for reading! I'd love to hear your thoughts.

Have questions, feedback, or just want to say hello? I always enjoy connecting with readers.

Get in Touch

Published on June 01, 2026 in tech