Table of Contents

How it started

Over the last year or two, our team has been doing more and more coding work with AI. We started by running CLIs like Claude Code locally, and quickly hit a practical problem: every agent task I kicked off held my laptop hostage. The agent was building, running tests, editing files, thinking about the next step; meanwhile I wanted to do something else, and the CPU fan started screaming. Forget about running five agents on five unrelated things at once.

Collaboration was even more painful. I’d run an interesting agent flow and want to show a teammate the progress — all I could do was screenshot or copy-paste the conversation. If they wanted to take over, I had to package up the working directory state and hand it over. Every time, it felt like we were emailing Word docs back and forth.

What we actually wanted was a remote, shareable, extensible AI coding environment — the local IDE stays, but the agent’s long-running tasks live “somewhere else” and we just subscribe to their progress. Like the experience of Anthropic’s Managed Agents — call an API, stream events back — but on our own infrastructure, with our own keys, hooked up to our own internal tools.

We looked at a few options:

  • Build our own K8s sandbox cluster. Technically doable. But images, networking, quotas, security hardening, multi-region — that pile alone is enough to keep a team busy for a year. We’re a few people.
  • Use Replit / E2B / Modal or similar off-the-shelf sandboxes. Data flows through a third party, and the GitHub integration has to be cobbled together.
  • Just call Anthropic’s / OpenAI’s managed agents directly. Can’t use our internal MCP servers, can’t use our customized system prompt, can’t use our own tool policies — and the data leaves our boundary.

We circled back to an answer that doesn’t look very mainstream: use GitHub Actions runners as the sandbox.

It sounds odd. GHA has always been a CI/CD tool, with a hard 6-hour timeout and ephemeral runners — not exactly designed for long-running tasks. But think about it: it already has almost every property we need.

  • It’s an isolated container, fresh on every task, no contamination.
  • It has native checkout permission for the repo — and operating on GitHub-hosted code is exactly what we want the agent to do.
  • It has full outbound network — calls to internal services and external APIs both work.
  • It has mature OIDC, which lets us achieve “zero secrets in the sandbox repo”.
  • For private repos we buy our own quota; for public repos GitHub’s free tier is enough.
  • Its instability and timeouts are known — they can be handled as engineering problems.

We decided to try it.


Three things we wanted

Stripping the romance out of “for ourselves” and grounding it in our daily reality, the requirements were pragmatic:

First, we wanted to fan out. I fire off five agents on five things; they run on different runners and don’t interfere with each other. I close the laptop and go home; they keep running. I come back the next morning and harvest the results.

Second, we wanted team-visibility and hand-off. For any agent task, anyone can use the SDK to subscribe to its event stream and see in the web UI what the agent is doing right now, what it’s thinking, what tools it used. If you need to interrupt, append input, or approve some high-risk action, anyone can step in. It’s not somebody’s private local process.

Third, we wanted freedom to evolve and experiment. Want to try a new model, swap the system prompt, hook the agent up to a new internal MCP server, add a hook intercepting some class of tool calls — all of these should be configured once in a central place, and every agent in flight should pick it up immediately. We did not want every engineer maintaining their own copy of “my Claude config”.

These three together describe what we wanted to build: an AI coding workbench, for ourselves.


First thing we got clear: don’t paper over GHA’s nature

The initial design impulse looked like this: since GHA cuts off at 6 hours and runners are occasionally flaky, we’d build a complete checkpoint mechanism — agent’s working directory, file system, temp files, running processes, all snapshotted, so a new sandbox could come up and resume seamlessly. Technically possible.

We went a few rounds of this design, and it kept getting more uncomfortable:

  • Every tool call required an atomic checkpoint (commit + upload + ack), which slowed the happy path.
  • Filesystem snapshots had to be shipped to the control plane via git bundle or incremental tar — bandwidth and storage aren’t cheap.
  • The control plane had to manage “shadow repos” or object storage — that pile was bigger than the business code.
  • And once the agent has done anything with external side effects (called an external API, sent a Slack message), you simply cannot replay it — replaying is re-executing the side effect.

The real turning point was admitting one fact: GHA’s instability is its essential nature, not a defect. As the users of this workbench ourselves, we are perfectly capable of understanding that. Our job is not to disguise GHA as an immortal container — it’s to provide a clean contract on top of that nature.

So we drew the responsibilities cleanly:

Whose responsibility / how it’s handled
Whether files in the working directory survive across sandboxesThe agent committing & pushing frequently is the preferred path; on a clean sandbox exit (including timeout), a post hook auto-commits & pushes as a safety net.
Whether conversation state continues across sandboxesThe platform.
Idempotency of external side effectsUs — via tool policy and the platform’s idempotency proxy.
GHA’s 6-hour ceilingOur own awareness.

Once that division was made, the design became sharp:

  • We don’t have to solve “working directory persistence” — a fundamentally unsolvable problem.
  • We only have to guarantee one thing: the conversation does not get lost, and a new sandbox can resume from where the conversation was cut off.

As long as that one thing holds, the agent itself handles working-directory differences — it has the full conversation memory, knows “I edited these files earlier”, and will use git to check and re-do as needed. A few more tokens spent, but the engineering complexity drops by an order of magnitude.

That said, “the agent commits frequently” is ultimately a contract — in practice an agent doesn’t commit every time it edits a file. Here GHA gave us a clean hook: the post step. An Action runs its cleanup code after main exits, regardless of whether main finished normally, was canceled, or was killed by timeout. We use this hook to the hilt: on each segment exit, the post step checks the working directory for uncommitted changes, batch-commits & pushes them to the working branch, and ships the latest session snapshot back to the control plane.

Only when the runner truly hard-crashes (OOM, infra failure — post doesn’t even get to run) do we lose the agent’s work from the last stretch. That’s rare, and we don’t try to save it — saving it means going back to the complex path we cut earlier. In the vast majority of day-to-day cases, even if the agent itself was lazy about pushing, the code lands safely on GitHub in the end.

We wrote a very short “user contract” pinned to page one of the README: The agent runs in an ephemeral sandbox; commit & push frequently is good hygiene; we’ll cover you on a clean exit, but on a hard crash some loss is unavoidable. Everyone using this workbench starts from that single sentence.


Second thing we got clear: the control plane owns the truth, sandboxes are disposable fuel

Once “conversation continuity” was established as the only hard guarantee, the system structure clarified instantly.

The control plane is the source of all truth. It holds:

  • Every conversation event (every utterance the agent made, every tool it used, every step it thought through).
  • Every session snapshot (used to restore the agent’s internal state when restarting).
  • The state machine of every task (created, executing, paused, resumed, finished).
  • Every run configuration (model, system prompt, tool allow-list, quotas, etc.).

Sandboxes are stateless execution fuel. When a sandbox starts up:

  • Pull run config from the control plane.
  • Pull session snapshot from the control plane (if this is a resume).
  • Clone the target code from GitHub.
  • Run the agent.
  • Stream events back to the control plane in real time.
  • Periodically push session snapshots back.
  • On interrupt or completion, exit gracefully.

If the sandbox crashes mid-flight, the control plane knows. It pulls a fresh sandbox from the pool, hands it the previous snapshot, and the new sandbox picks up from there. The agent doesn’t notice; we don’t notice.

The session snapshot is Claude Code’s own internal jsonl format. We decided to treat it as an opaque binary blob — don’t try to understand it, don’t try to parse it; Claude Code’s format is for Claude Code itself. The control plane only stores it, retrieves it, and validates integrity. The “knowing less is more reliable” trade-off runs through the entire design.


Third thing we got clear: zero secrets in the sandbox repo

This one is my favorite. Our sandbox repo holds no long-lived secrets — no ANTHROPIC_API_KEY, no control-plane access token, no GitHub PAT.

Sounds impossible. The Action calls the Anthropic API, ships data back to the control plane — surely those need keys?

GitHub Actions’ built-in OIDC makes this elegant. Every workflow run can ask GitHub for a signed identity token (a JWT) that says “I am this repo, this workflow, this run, on this branch”. That token is sent to the control plane, which verifies the signature using GitHub’s public keys and then checks whether the token’s identity matches what it expected when it dispatched the task. If it matches, the control plane issues a 15-minute access token. If not, the door doesn’t open.

Under this mechanism, anyone reaching the control-plane API has to be a runner currently executing in our sandbox org, on a specific workflow, on a specific run — nobody else can impersonate that. Even if the entire git history of the sandbox repo leaks, there is no reusable token in there.

The most beautiful part: the sandbox repo is just an empty repo. Ten lines of workflow template and a README, that’s it. No secrets panel to worry about, no leak surface to audit. The security model is so clean it makes you laugh.


What it looks like in daily use

The final test of any technical decision is daily experience.

We kick off a task with a small SDK: pick the agent, pick the repo, tell it what to do, and you get back a task object — stream events from it, respond to approval requests and questions when needed, wait for terminal state. The whole call can be made from a laptop, an internal web UI, or even from another agent (agent-in-agent is a common pattern). Once dispatched, I can close my laptop; events keep flowing to the control plane; a teammate opens the web UI and sees progress instantly, and can take over directly.

And the workflow file in the sandbox repo? Under ten lines, and it stays that way — business changes don’t touch this file. It only declares the trigger, the permissions, and a call to the Action we publish. That’s it. No business logic, no secret references, no per-task config — all of those live in the control plane.

Want to swap models? Change the control plane. Tweak the system prompt? Change the control plane. Add a new internal MCP server so the agent can query the company wiki? Change the control plane. Want a hook approval gate on some class of high-risk tools? Change the control plane. The sandbox repo doesn’t move a line. This is the experience you only get after strictly separating “infrastructure” from “configuration” — and it’s the most fundamental difference from “running Claude Code on your own laptop”. The latter has every engineer maintaining their own config; the former has the whole team sharing a single source of truth.


When a resume happens, nothing happens

The thing we’re most pleased with is that in daily use resume is essentially invisible.

Suppose an agent has been running for 5h55m. GHA fires the timeout and sends SIGTERM to the runner. The Action’s main receives the signal, tells claude “finish the current turn and stop”, waits for the child to exit, and exits itself. Now GHA’s post step takes over: batch-commits & pushes anything the agent didn’t get around to pushing, ships the latest session snapshot back to the control plane, emits a “segment paused” event, and exits. The control plane marks the task as awaiting resume; tens of seconds later it pulls a fresh sandbox from the pool and dispatches the task. The new sandbox starts, clones the working branch (which already includes the post-step’s safety-net commit), downloads the snapshot, restarts Claude Code, injects a line — “the sandbox was rebuilt; your working branch is at commit X; if you find some in-memory edits aren’t on disk, please verify with git” — and the agent carries on.

For us, subscribed to the event stream, the whole thing is just a segment.suspended event followed tens of seconds later by a segment.resumed event. If the web UI wants to be fancy, it can show “reassembling sandbox…”. If you don’t care, treat it as if nothing happened.

The continuity of the event stream matters more than the continuity of physical execution — that is the central insight of this design. As long as the externally visible conversation stream looks continuous, the internal infrastructure can flail however it likes.


A note on complexity

The same pattern kept showing up while designing this workbench:

Imagine a flashy solution → list out the implementation details → notice that the details share a recurring root cause of complexity → step back and ask “whose responsibility is this part, really?” → once the responsibilities are drawn cleanly, the flashy solution becomes unnecessary.

Every round of “cut the flashy feature” left the system simpler, more reliable, and clearer to use. What stood out this time: a lot of complexity comes not from the requirements, but from us mistakenly believing “we shouldn’t have to understand certain low-level facts”.

GHA is unstable? We should understand. The agent’s working directory isn’t durable? We should understand. External side effects can’t be replayed? We should understand.

Once those understandings become part of the contract, the system can be minimal. If you insist on hiding those facts from yourself, the system bloats — and it never stays consistent across all edge cases anyway, because there will always be some corner where you have to confront the facts directly, and by then the trust in the system has already been pre-spent on the “considerate concealment”.

The workbench is for ourselves. “For ourselves” means we are both the engineers and the users — we don’t get to hide behind the “user education cost” excuse you’d use when shipping to an external audience, because we are the audience. Conversely, we don’t have to pretend we don’t understand anything either.


What this design isn’t good at

To be honest, some scenarios this workbench doesn’t fit well:

Tasks that don’t fit in a single 6-hour segment will invalidate a lot of prompt cache. If the gap between segments exceeds about an hour, the cache expires and token consumption goes up substantially. For very long tasks our practice is to teach the agent in the system prompt to actively write progress to an issue or PR description, so that the critical state isn’t only in memory.

Workflows that depend heavily on external side effects need to be handled carefully. For example, an agent that streams Slack messages, triggers CI/CD, and writes downstream databases as it runs — those side effects can’t be undone on a segment restart. We provide a tool-policy layer to disable dangerous tools, and an idempotency proxy to make common side effects safely retryable, but fundamentally this is an engineering problem, not something the platform can absorb.

It is not a replacement for low-latency real-time applications. End-to-end event latency p95 is around 400ms — fine for code generation, PR automation, PR review, bug investigation. But for live chat or voice interaction, it’s the wrong tool. Those scenarios should call the LLM provider’s API directly and not wrap a layer of GHA around it.


Why I wrote this

Now that we use this thing every day and it just works, we increasingly think the approach might be useful to other teams of similar size — not in the sense that this exact architecture should be copied, but in the sense that this kind of design taste — deliberately shrinking what you commit to — may be underrated in the AI tooling space.

Looking back, the most unusual thing about this design is that it deliberately shrinks the scope of what it promises. We didn’t promise “perfect container persistence”. We didn’t promise “breaking past GHA’s time limit”. We didn’t promise “transparently hiding all low-level details”.

We promised one thing: the conversation does not get lost, and a new sandbox can resume from where the conversation was cut off.

Around that one thing, we designed a clean state machine, a clear event protocol, a zero-secret authentication model, and a ten-line workflow template.

A lot of technical decisions only become available once you stop trying to do everything.