ai-assisted-developmentcoding-agentsclaudegeminiglmgomodel-comparisongotour

Three Coding Agents, One Spec, and What I Actually Learned

I'm learning Go and wanted something focused purely on syntax. There are a bajillion sites for that already — but the last month, I haven't been doing much agentic coding, and instead of hunting for the right learning resource, I decided to build my own to get back into the swing. Partly because I could. Mostly because I wanted an honest look at how close open-source (and potentially self-hostable) models are getting to frontier in terms of quality and accuracy.

So I handed the same brief — build an interactive Go tutorial — to three different coding agents and turned them loose.

Claude (Opus 4.7) in Claude Code finished in 24 minutes and 32 seconds. Gemini in Google Antigravity finished in 36 minutes and 18 seconds. GLM-5.1, running in the Pi coding agent, took 1 hour and 13 minutes.

The setup

Same product spec authored by my custom /spec-writer skill. Same set of phased GitHub issues. Same baseline commit (458d83c). Same M4 Max workstation kicking off each run — though inference happened in each provider's cloud, so the laptop didn't influence anything other than my electric bill.

The lineup matters for the question I was actually asking. Claude Opus 4.7 is the frontier reference. Gemini sits a tier down on price but is still closed. GLM-5.1 is the open-weights contender — the only one of the three I could realistically self-host.

I called the experiment GoTour, and you can see all three apps side-by-side at go.jking.ai.

The point wasn't to crown a winner. I'm a software professional — I make tooling decisions for real systems, and "which one is fastest at one task" is a terrible way to make those decisions. What I actually wanted was a controlled-enough environment to look at the differences in how three model families approach the same brief.

What the numbers show (and what they don't)

	go-claude	go-gemini	go-glm
Implementation time	24m 32s	36m 18s	1h 13m 18s
Lines added	4,997	4,744	5,494
Files created	125	54	47
Subscription cost	$100/mo	$20/mo	$20/mo

A few honest caveats before anyone takes this to a bake-off:

"Implementation time" is wall-clock between the agent's first phase commit and its Phase 19 commit. It's not active model think-time. It includes pauses, retries, and tool waits.
GLM's extra hour wasn't debugging — it was mid-run rework. Clean cadence through Phase 12, then ~41 minutes across two bundled commits that revisited earlier work. Different agents revise differently. That's a real signal, not just a slower clock.
Claude's file count (125 vs. 54 and 47) is an architecture choice — one TypeScript module per lesson — not 2× the work. The other two used a small handful of registry modules. Speed and structure are independent variables.
Token usage and model think-time aren't shown. Git can't see them.

The leadership thing I keep coming back to

Two of these runs cost $20/month — and one of those was an open-weights model I could pull down and run on my own hardware. The most expensive cost $100/month. For under $150/month, I ran the same product brief through three model families — frontier, mid-tier closed, and open-weights — and got three working products to compare.

A year ago, that experiment would have taken three developers, three weeks, and a project plan.

The cost of trying things has collapsed. And when the cost of an experiment collapses, the right move isn't to pick the cheapest tool and lock in. It's to run more experiments. The bottleneck isn't compute or talent anymore — it's taste and judgment about what's worth comparing in the first place.

That shift — from picking tools to comparing approaches — is the part I think most engineering leaders are still underweighting.

The part I didn't expect

Reading the three apps side-by-side is genuinely useful in a way I didn't predict.

Claude's prose has a particular rhythm. Gemini frames concepts differently. GLM picks different examples. None of them are wrong — they're just three takes by three different sets of training data and three different tool harnesses. If you've ever wondered what "model voice" actually means in practice, running the same brief through multiple agents is the cleanest way I've found to see it.

That's the whole reason I published the comparison. Not to declare a winner. To make it possible for someone else to look at the same artifact and form their own opinion.

What's next

I'm going to keep doing this. The next experiment is already half-formed in my head — and the whole point of running these in public is that other people can look at the work, push back, suggest variants, and tell me what I missed.

If you're curious, the three apps and all the commit-derivable metrics are at go.jking.ai. The repo is public. Fork it, run your own variant, tell me what you find.

The tools have gotten cheap. The discipline of comparing them carefully — and being honest about what the comparison actually shows — is the part that still matters.

–Jeremy

Thanks for reading! I'd love to hear your thoughts.

Have questions, feedback, or just want to say hello? I always enjoy connecting with readers.

Get in Touch