I gave the same feature to 3 different agents and then things escalated
March 24, 2026
There was a feature in my backlog I hadn't had the time for until now. I started in the usual way, planning it out with Codex, my chosen cli agent for this one, running gpt-5.4. We talked through it, I answered a few "product questions" as it puts it, and off it went - it put it together in about 30 minutes of rhythmically fading in and out text in a cli window, on the model's high setting. When it got done I ran my lint-like internal suite checks, glanced at the UI, and sent it off to be cross reviewed by several different agents. They found a few things, we talked it through, fixed it, and 2 more of the same passes later everything looked good.
The feature was essentially 2 new checkboxes. If you're asking "how can two tiny inputs take OpenAI's best model a half an hour, and it didn't even get it right?" the answer of course is the backend, the complexity added there from just these new properties, and to explain further I'll have to go into the feature as briefly as I can before it will make much sense.
What I've been working on is the tail end of a bigger project, a new web game involving "composable" chess variants. Users can assign rules to squares on the board and play games on them, basically.
The before of what I'm having these agents work on is this ruleset here - users can set these inputs and then drag the ruleset onto squares on the board they're making. This is of course the first rule, because without it, there's no "chess", pawns can't promote after their 6 square journey to the opposite team's back rank. But as you can see users can set it to a lot of different things here to create different experiences.
The one that's missing is kings, for both from and to. My system already handles more than one king situations, as users who create boards with more than 1 king on each side will need to select either a game ends with first or last king capture mode. So the feature is add kings, that's it, make it work when an eligible piece moves into this square with this rule, that user is given the option to promote to a king, and it works to either team too. There's the 30 minutes.
Enter Claude
What I had now was a completed feature, done by Codex, and ready to PR. But at this point Claude Code's token allowance reset and I didn't have much for it either. So I figured hey well I got this prompt just sitting here, right? And then I went a little nuts.
I committed this work to king-promo-codex and then branched off of the previous one, creating king-promo-claude. We went down the same path with Claude (Opus 4.6 high), planning it out, making sure we're on the same page about the feature, and off it went. It was done about 10 minutes faster, but a very similar experience of small lint error problems, fixed by Claude, some defects caught by review, and then some more fixing and review and it's good - it seems to work just like Codex made it work.
So then I said "Well I have these 2 implementations of this in 2 places, which one is better?" and who better to find that out than these 2 same agents of course. And by same I mean memory-wiped - I asked both to compare both branches code, and tell me which one was better and why.
I'm going to shock you when I tell you that Claude thought Claude's branch was better, and Codex thought Codex's was. Claude explained thoroughly in its usual way, leading with "More complete core function". Codex tackled it more like a traditional code review, finding 3 issues, but being a bit more convincing in its preferred codebase.
So I took those responses, and gave them to each other - "I had another agent do this same analysis and this is what they came up with, thoughts?". Codex was not impressed at all, coming back with "I think that analysis overweights helper design and picks the wrong winner." Interestingly, Claude was much more amenable to change, concluding that "But the other agent is right that functional completeness matters more than structural elegance."
That was enough for me - I was going to go with Codex's implementation. But how can we get the "best" code now after all of this? Well I took Claude's last response, gave it back to Codex, and asked it to see what is actionable that would actually make the Codex's branch better from what Claude was attempting, or fix any more issues that came up. That led to more discussion, implementation, and code review, and that was it, I had one very token-expensive feature ready to PR, but it was probably "better", the changes we did here made sense, they learned from each other and the feature improved, and I should PR it now.
Then I had a thought: "It's close to the end of the month, and I have most of my tokens left for github copilot cli, and that has Gemini 3 pro preview, and I have this prompt"..
Enter Gemini
So we went off again, this time through our third cli experience, set to the only Google model I have access to. We planned it out, it seemed to get it, it got through its first pass in about Claude-time, and we started reviews.
First review found many items that I sent back to Gemini and had them attempt to address. Then back to review, which found more items, and back for fixes. After many rounds of these, more and more issues found by review that Gemini happily says is fixed after some changes every time, I asked my reviewers to "in as much detail as you can, describe the problems and fixes you'd do if you were in the shoes of the agent assigned to work on this" for all issues it finds in the eighth and final round of reviews I'm going to have it do. It failed miserably, as the ninth round of reviews had multiple blocking issues in similar ways as the rest.
If you're wondering what could be so hard here, I mentioned that the draggable promotion rule was the first rule. There's currently 5 others, and they can, and must be able to stack and reconcile with all of the others on squares as well, and they all have their own subrules too.
Gemini would keep getting things wrong, didn't figure out how to make things work with the flip rule, had problems with win conditions and pieces being exploded, all kinds of issues with making sense of the logic flows.
And because I'm a glutton - I did it another 4 full times, review and fixes, before throwing in the towel on seeing what Gemini could do. I had the Codex agent session that did all the work review the Gemini branch and see what it says. (paraphrased): "It did real work - the UI changes are there, it added a test file, it tried to thread the rule through the server. But it made a bad architectural call early on, introducing a parallel win condition override instead of using the existing mechanism, and that one decision created problems everywhere". Which was about what I expected.
There was one more experiment to run: was either model biased by the branch names? To do that I'd have to make new branches, named for the opposite agent that actually did it, and have both agents compare and review them again. And that will have to wait, as I ran out of tokens doing all of this.