Share

My Remote Agent Experiment: From Cloud Agents to My Own AI Dev Team

March 17, 2026 • tech

My Remote Agent Experiment: From Cloud Agents to My Own AI Dev Team
composerclaude-codeai-agentsagentic-developmentorchestrationmulti-agentclaude-maxai-labs

My Remote Agent Experiment: From Cloud Agents to My Own AI Dev Team

If you read Part 1, you know how that story ended. I set up a dedicated Mac Mini, installed OpenClaw, wired it into Slack, and tried to build myself an always-on remote coding agent. The vision was compelling. The results were... educational.

But the experiment didn't fail because the technology was bad. It failed because the tradeoffs didn't work for me. I was uncomfortable with the escalated permissions. The black-box nature of the agent left me second-guessing every decision it made. And API pricing meant every experiment had an unpredictable cost attached to it. I wanted the subscription model I was already paying for — clear spending, clear quota limits, no surprises.

The thing is, the goal was never "run an agent in the cloud." The goal was always the same: produce better software, faster. OpenClaw was one path to that. When it didn't pan out, I didn't abandon the goal. I just changed the approach.

What if I stopped trying to replace my tools and started automating them instead?

What I Was Actually Chasing

Before I built anything, I tried the existing options. cto.new, Claude on the web, GitHub Copilot, Cursor's cloud agents. I gave each of them a real shot.

The common thread? Too much setup to get right. Every cloud agent required me to configure environments, grant permissions, set up repositories, and define workflows — and even after all that, I was never quite pleased with the output. Things like UI testing either didn't work at all or just weren't being done. And that's not entirely the tools' fault — the setup burden was on me, and I never invested enough to get it perfect.

But that realization was the insight. I already had Claude Code running locally. I already had CLI tooling for GitHub, Firebase, and Google Cloud. I already had a workflow I trusted. The question wasn't which cloud agent should I use? It was why am I not just orchestrating what I already have?

Meet Composer

So I built Composer — a work queue and orchestration dashboard that manages a team of AI agents. Think of it like a small software company where every employee is an AI, and I'm the manager handing out assignments.

Here's the team:

Agent Think of them as... What they do
Designer 🎨 Functional analyst + software architect Reads the codebase, plans the approach, identifies files to change, creates a blueprint
Design Reviewer 🔍 Product owner + lead architect Reviews the design for quality and completeness — can send it back for revisions
Implementer ⚙️ Software developer Writes the code, runs linting and builds, commits to a branch, opens a pull request
Code Reviewer 🔎 Lead engineer / senior developer Reviews the Implementer's pull request, checks CI status, and iterates with the Implementer until the code passes
Tester 🧪 QA engineer Writes and runs tests — unit, integration, edge cases
Documenter 📝 Technical writer Updates READMEs, API docs, project and in-app documentation

You describe what you want built. Composer breaks it down into phases and assigns it to the right specialists, in order. Each agent is a separate Claude Code session with its own instructions, its own perspective, and its own definition of "done."

Composer's multi-agent pipeline — six specialized agents collaborate with built-in review loops to go from idea to shipped code.

Under the hood, it's a Node.js backend with a React dashboard, SQLite for persistence, and WebSocket for real-time updates. But the magic isn't the tech stack — it's what happens when you let these agents collaborate.

The Assembly Line

The pipeline isn't just a linear sequence. It has feedback loops built in.

The Design Reviewer can send the Designer back to the drawing board — up to three times. If the blueprint isn't solid, it doesn't move forward. The Code Reviewer does the same with the Implementer. If the pull request doesn't meet standards, the feedback goes back, the code gets revised, and the review happens again. Nothing ships without passing quality gates.

Different task types get different pipelines. A full feature goes through all six agents. A bugfix skips design and goes straight to implementation, testing, and a doc check. A specification task doesn't write code at all — it creates phased GitHub issues as an implementation plan.

And then there's quota management. I'm running on Claude Max, which has usage windows — a 5-hour rolling window and a 7-day window. Composer watches both. When usage is low, it runs up to three tasks in parallel. As quota fills up, it automatically throttles down to two, then one, then pauses entirely. Queued tasks show a countdown: "Quota full — starts in 2h 14m." When the window resets, work resumes automatically. I don't have to babysit it.

Vibing It Into Existence

Here's the part that might surprise you: I built the first version of Composer in about a week. And I'm still building it. There was no grand design document, no detailed specification. I had a loose idea — orchestrate Claude Code sessions from a dashboard — and I started vibing.

Some days I'd sit down with a specific feature in mind. Today I'm adding Slack integration so I can create tasks from my phone. Other days I'd use the tool, hit a friction point, and build the fix right then and there. The Slack bot and MCP integration came from a practical need — I wanted to operate Composer remotely, either from a Claude Code session via MCP or from my phone via Slack. So I built both.

And yes — the meta question everyone asks — I used Claude Code to build the orchestrator for Claude Code. The irony isn't lost on me. But honestly, it's the same workflow I use for everything now. I describe the architecture, Claude writes the code, and we iterate fast. This has become my standard approach for AI Labs projects.

I'd also been reading about others in the community building similar orchestration layers. That helped accelerate the idea from "interesting concept" to "okay, I'm doing this tonight."

The Moment It Built Itself

This is the story I keep telling people.

I needed an Activity Timeline feature for Composer — a way to see every event in a task's lifecycle. When did it start? When did the reviewer send it back? When did the PR merge? I wanted a full audit trail with a real-time UI.

So I used my /spec-writer tool to draft a detailed specification and create it as GitHub Issue #22. The spec covered everything: database schema, API endpoints, WebSocket events, a timeline UI component — the works. Four implementation phases.

Then I opened Composer's dashboard and created a task: "Implement Issue #22."

Composer's Designer read the spec and the codebase. The Design Reviewer approved the approach. The Implementer wrote the code — new database table, event logger, 15+ instrumentation points across the backend, a REST endpoint, WebSocket broadcasting, and a full React timeline component. The Code Reviewer approved the PR. The Tester added tests. The Documenter updated the docs.

Then I created one more task: "Pull latest, build, and restart the service."

It worked. Composer pulled its own updated code, rebuilt itself, and restarted. When the dashboard came back up, there was a brand new Activity Timeline on every task detail page — including the task that had just built it.

I'm not going to pretend I was cool about it. That was a genuine whoa moment.

What It Can't Do (Yet)

I want to be honest about the gaps, because this is still an evolving experiment.

The biggest challenge is UI and visual testing. My agents can write and run unit tests and integration tests, but browser-based testing is a different beast. Which tools do I use — Chrome DevTools protocol or Playwright? How do I manage test credentials for role-based auth? My platforms all use different roles with different permissions, and each combination needs its own test scenarios.

There's also a deeper problem that I think the entire industry is wrestling with: AI agents think in happy paths. They're excellent at testing the expected flow. They're much less reliable at imagining the weird edge cases — the timeout during a race condition, the user who pastes emoji into a number field, the session that expires mid-transaction. Edge case coverage is a massive opportunity, both for my workflow and for AI tooling in general.

I haven't tried to tackle this head-on yet. But it's on the list.

The Real Takeaway

Composer is running production workloads right now. It's building features for TutorPro (still in testing), shipping updates to Leaderboard Fantasy, and — as you just read — improving itself. I've essentially eliminated my dependency on cloud agents. I run everything locally, on my own hardware, with my Claude Max subscription, on my terms.

But here's the thing — I'm not going to tell you to go build your own orchestration platform. That's not the takeaway.

We all work differently. We all want different levels of control. OpenClaw wasn't for me — but it might be perfect for someone else. Cursor's cloud agents didn't click for me — but plenty of developers swear by them. The tools are getting better every week, and honestly, my workflow will probably look completely different in a few months.

What I will say is this: the landscape is evolving so fast that the best strategy is to stay curious and keep experimenting. Find what works for you and your organization. Push the boundaries of what's possible, be honest about what doesn't work yet, and share what you learn along the way.

That's what AI Labs is for. That's what this blog is for. And that's what I'll keep doing.

–Jeremy


Thanks for reading! I'd love to hear your thoughts.

Have questions, feedback, or just want to say hello? I always enjoy connecting with readers.

Get in Touch

Published on March 17, 2026 in tech