How our AI coding agent generates 18% of our PRs

Download PDF

Every blog post about AI coding agents promises the same thing: ten engineers' worth of output, overnight, just add some agentic magic dust. Most are about generating code in a flashy zero-to-one demo. That's the easy part. At Checkout we've found that velocity gains come from the opposite instinct: not unleashing an agent but putting it on a narrow contract–bounded task, embedded in the processes we already had.

That's the work we hand to our Agent HAL–the in-house AI coding agent we built to work inside our process, not alongside it. The acceleration hasn't come from one heroic feature shipped without oversight; it's come from that contract, applied across a stream of small, well-scoped tickets and turned into reviewable pull requests.

Today, HAL is generating 18% of all pull requests opened across Checkout. We got here by making HAL narrow, event-driven, and–honestly–a little bit mundane on purpose.

Here is how we moved agentic engineering out of the chat window and directly into our core development pipeline.

‍

Embedded, not bolted-on

HAL is wired into the tools the team already lives in. Work enters through a few well-defined gates – a labelled ticket, a scheduled job, or activity on a pull request it already owns – and comes out as a pull request a human reviews. In between, HAL reads the backlog, picks up the tickets that are ready, and implements them. But it doesn’t just drop a first draft and leave. HAL actively manages the lifecycle of the PR: it reviews its own changes, fixes its own failing CI checks, and continuously rebases the branch whenever main moves.

By handling the tedious branch babysitting that normally causes PRs to rot and drift, HAL ensures that by the time an engineer looks at the code, it is still fresh and immediately mergeable.

It starts in the backlog

The cheapest place to lose a week is before any code is written–on a ticket too vague to act on. So HAL works the backlog first: on a schedule it scores how ready each incoming ticket is and writes that back onto it. The same check gates HAL itself–too thin a ticket, and it asks for detail rather than work against unclear requirements. The backlog gets sharper because the agent keeps asking it to.

Small features, not just maintenance

The part most people assume an agent can't do is build something new. In practice it's where HAL earns its keep–most product work is small and well-scoped, just queued behind bigger things.

A recent practical example: we wanted a new filter on an internal dashboard–a genuinely new feature that had sat in the backlog for weeks. HAL picked up the ticket, built it, and had a pull request open that afternoon; it was reviewed and merged the same day. Same story for new endpoints, front-end changes, and other bounded work.

Before any pull request, a separate review persona–not the one that wrote the code, with its own fresh context–checks the work against the ticket. That separation matters: an agent grading its own reading of a spec in the same context could be confidently wrong twice over–so we run the review in a new context with fresh system prompts, and of course with a human as the final word (for now). If it falls short, HAL says what's missing and stops.

One case study: the Friday-afternoon advisory

Maintenance is still part of the story–one lane now, not the whole road. The clearest example of the economics: context switching an engineer with interruptive tickets fragments their day.

On a Friday afternoon, vulnerability alerts landed against one of our services. Some real, some noise, all needing triage–historically half a day spent reading advisories to decide what applies. HAL takes the ticket, reads each advisory, updates the dependencies, pins what’s needed, runs the test suite, opens a pull request, and clears the alerts. The reviewing engineer's job shrank to: confirm the changes are dependencies-only, glance at the tests, merge. The same shape runs on a schedule across many repositories.

Control, not autonomy

None of this works without the narrow contract. HAL is not autonomous–it's a controllable agent operating inside guardrails that both Checkout Engineering and the team chose. There's no ambient agent deciding what to do next; every run traces to a specific instruction. When it goes off-track–and it does, because LLMs do–the engineer can stop it in one step, and feedback flows into the next run. A second pair of human eyes is preserved structurally: the people who described what to build can't be the ones who sign off on how it was built.

A subtler risk: automation complacency. The more a team trusts HAL, the easier it is to wave its PRs through–and for a payments company, a rubber-stamped change is exactly what you can't afford. The narrow contract is itself a defence: small, scoped diffs get genuinely reviewed, not skimmed. And HAL's PRs clear the same gates as any human's–tests, security scans, a named approver, nothing auto-merged. The trust is in the process, not the agent.

Beyond the 18%: the next phase of engineering

When an agent fleet has a hand in nearly a fifth of your pull requests, the nature of daily engineering shifts. We don't measure HAL solely by lines of code or raw speed; the real win is cognitive offload. By handing the low-context, high-friction work to a background agent, we've given engineers their focus back. Engineers aren't giving up the work they enjoy; they're giving up the work they used to put off.

That's the real promise of agentic engineering: not one all-powerful AI that replaces human thought, but a background workforce of narrow, specialised agents operating safely inside the guardrails we already had. HAL began as an experiment in whether an LLM could handle a few mundane tasks; today it's a core part of how Checkout.com ships software. As we expand it, we aren't chasing more autonomy–we're deepening its integration, so the boring, well-scoped work is done, reviewed, and waiting for our engineers in the morning.