Skip to content

Add write-once probe loop kill guard to prevent 15-min agent timeout#43569

Closed
pelikhan with Copilot wants to merge 3 commits into
mainfrom
copilot/aw-failures-fix-copilot-cli-timeout
Closed

Add write-once probe loop kill guard to prevent 15-min agent timeout#43569
pelikhan with Copilot wants to merge 3 commits into
mainfrom
copilot/aw-failures-fix-copilot-cli-timeout

Conversation

Copilot AI commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

Reviewer workflows (PR Code Quality Reviewer, Impeccable Skills Reviewer, Matt Pocock Skills Reviewer) were hitting the 15-minute step hard timeout at a rate of 10/22 red runs in 6h. Root cause: the Copilot CLI agent enters an infinite probe loop — repeatedly calling write-once safe-output tools with empty arguments, receiving -32602 rejections, and never exiting.

Changes

harness_retry_guard.cjs — detection primitives

  • WRITE_ONCE_PROBE_REJECTION_PATTERN / WRITE_ONCE_PROBE_REJECTION_THRESHOLD (3)
  • countWriteOnceProbeRejections(output) — counts occurrences of the rejection phrase in accumulated output
  • hasExcessiveWriteOnceProbeRejections(output) — true when count ≥ threshold

process_runner.cjs — live kill guard

  • Added optional killGuard?: (output: string) => boolean to runProcess. Called after each stdout/stderr data chunk; fires SIGTERM the first time it returns true. Prevents any loop from burning the full step timeout.

copilot_harness.cjs — wiring

  • Passes a killGuard to runProcess that fires on 3+ probe rejections in live output
  • Post-exit: detects the pattern, emits infrastructure_incomplete, breaks the retry loop (non-retryable — retrying would reproduce the same loop)
  • New "write_once_probe_loop" failure class in classifyCopilotFailure
// Before: agent hangs for 15 min, killed by Actions timeout
// After: harness detects 3rd rejection, sends SIGTERM, emits structured diagnostic

[copilot-harness] kill-guard fired — terminating process (SIGTERM)
[copilot-harness] attempt 1: excessive write-once probe rejections detected
  (count=3, threshold=3) — not retrying (agent probe loop terminated)

Tests

  • 11 new tests in harness_retry_guard.test.cjs covering count/threshold/false-positive cases
  • 3 new tests in process_runner.test.cjs covering kill-guard fire, no-op, and single-fire semantics

Copilot AI and others added 2 commits July 5, 2026 15:40
…timeout

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
…and fix tools/list message

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix Copilot CLI agent-loop timeout in reviewer workflows Add write-once probe loop kill guard to prevent 15-min agent timeout Jul 5, 2026
Copilot AI requested a review from pelikhan July 5, 2026 15:45
@pelikhan pelikhan closed this Jul 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants