Our agents build things. Ask one to put together a business case and it will create use cases and write the calculations and inputs underneath them, calling tools that write each of those to Firestore as it goes. A big request takes a lot of steps, and a single run doesn't always finish: it can hit the per-request turn limit, run past the stream's duration cap, lose the client connection, or catch a provider error partway through a response. When that happens, the user gets a "Continue" button.

For a long time, "Continue" restarted the run instead of resuming it. The agent would recreate use cases it had already created and redo calculations it had already written, then spend turns finding the duplicates and deleting them. Everything from the previous run was already in the database; the agent just couldn't see that it had done any of it.

Side effects survive, the transcript doesn't

A turn produces two different things, and we had been treating them as one.

The first is side effects: the documents the agent's tools write. When a write tool runs, it updates Firestore immediately — that data is durable the moment the tool returns, independent of anything the agent framework does afterward.

The second is conversational memory: the transcript of the turn — which tools were called, what they returned, what the model said. On the next turn we replay that transcript to the model so it knows what it has already done.

We build on the OpenAI Agents SDK, and it saves that transcript to the session at the end of a turn — but only when the turn ends cleanly, on a final answer or a deliberate pause to ask the user something. Any other ending throws straight to the error handler, which records the user's input message and drops everything the turn generated. The ways a run actually ends badly in production — MaxTurnsExceededError, the stream-duration abort, a client disconnect, a mid-stream provider error — all take that path.

So on a messy exit the two stores disagree. The side effects are on disk, because the tools wrote them in real time, but the transcript is gone. On "Continue," the agent reads back a conversation in which it never acted, and builds everything a second time.

Writing the transcript the framework skipped

The fix is to write that transcript ourselves when the turn doesn't end cleanly. In the shared error path, before we surface the turn-limit or error frame to the client, we take the items the SDK would have saved on a clean finish and add them to the same session.

The care is in matching the SDK's own save logic so that a later, clean turn doesn't write the same items twice. We skip the items it had already persisted earlier in the turn, drop the internal approval placeholders that aren't really conversation, strip the per-response IDs that shouldn't be replayed into the next request, and advance the SDK's persisted-item counter so its own save can't re-flush them. If the turn died between a tool call and its result, we leave the dangling pair alone — there is already a step at the start of the next turn that repairs orphaned tool calls.

It is best-effort by construction. This code runs on a path where something has already failed, so if the recovery write itself throws, it logs and returns rather than turning a turn-limit into a crash.

We put it in one place on purpose. All six of our streaming agents — the business-case builder, the use-case agent, and insights among them — share a single endpoint factory and a single catch block. Putting the recovery there means every one of them, and every agent we add later, gets memory recovery without any code of its own.

Longer conversations, older bugs

While the bug existed, every failed turn dropped its own items, so persisted histories stayed small. Once turns started persisting, conversations got longer — and that reached two bugs that had been in the code the whole time, unreachable until histories grew.

The first was the history window. We replay at most the 150 most-recent items of a conversation to the model, to bound cost. With longer histories, that slice could now begin in the middle of a turn — for instance on a tool result whose matching tool call had been trimmed off the top. Anthropic's Messages API rejects that: every tool_result needs the tool_use that produced it in the preceding message. We now trim the window forward to the first user message, which is always a clean turn boundary, so whatever we send is a valid prefix of the real conversation. If a single turn is itself larger than the window, we fall back to dropping leading orphaned results.

The second was the plan checklist. The agent shows a live plan and ticks off steps as it finishes them. The code that marked a step done only knew how to find a plan proposed on the same turn — the step-update event carries a plan ID only when the plan was proposed in the same connection. The moment "Continue" resumed a plan from an earlier turn, the updates had no plan to attach to, and the checklist sat frozen while the agent kept working. We fixed it on both ends: the backend now patches the existing persisted plan in place instead of appending a second one (which the UI would render as a duplicate checklist), and the client falls back to the most recent plan when an update arrives without a resolvable ID.

There was also a quieter follow-on. Now that resuming no longer duplicates work, we raised the per-request turn ceiling from 25 to 50 for the two heaviest multi-entity builders, to let a large request finish in fewer "Continue" cycles. The analytical agents stayed at 25.

Where it stands

"Continue" now resumes the same run instead of restarting it, and the duplicate-and-delete behavior is gone.

Best-effort recovery is a deliberate principle, not a gap. On a path where something has already failed, the job is to never make things worse — so the recovery write can't itself become a crash, and if the very end of a turn leaves a tool call without its result, the next turn's orphan-repair step settles it. The system heals forward rather than carrying a mistake into the user's next request.

We already log every time the recovery path fires, so the behavior is observable today. The next step is to put it on a dashboard and watch it in production — which will let us quantify exactly how much memory the old behavior used to throw away.

Last updated: Jun 15, 2026

When "Continue" meant "start over"

Side effects survive, the transcript doesn't

Writing the transcript the framework skipped

Longer conversations, older bugs

Where it stands

About the Author

Ready to transform your sales process?

More in Engineering

How We're Using Conductor and Docker to 10x Engineering Output

Stop Dumping Docs Into Vector Stores