Skip to content

fix(neon_local): allow restart of crashed compute endpoints#12898

Open
hawthornic wants to merge 1 commit into
neondatabase:mainfrom
hawthornic:fix/crashed-endpoint-restart
Open

fix(neon_local): allow restart of crashed compute endpoints#12898
hawthornic wants to merge 1 commit into
neondatabase:mainfrom
hawthornic:fix/crashed-endpoint-restart

Conversation

@hawthornic
Copy link
Copy Markdown

Problem

When a compute endpoint is killed externally (kill -9), neon_local cannot restart it:

  • cargo neon endpoint stop fails because pg_ctl cannot signal a dead PID
  • cargo neon endpoint start fails because check_conflicting_endpoints sees the crashed endpoint as a "duplicate primary" (status is Crashed, not Stopped)
  • User is stuck

Root cause

check_conflicting_endpoints iterates all endpoints and never excludes the endpoint being operated on. The status filter only excluded Stopped, not Crashed. Additionally, stop() unconditionally called pg_ctl stop which fails when the postmaster is already dead.

Fix

  1. check_conflicting_endpoints: Added exclude_id: Option<&str> parameter so an endpoint does not conflict with itself when being restarted. The Create path passes None (no-op), the Start path passes the endpoint being started.

  2. stop(): Before attempting pg_ctl stop, check if the endpoint is Crashed. If so, verify the postmaster process is truly dead via kill(pid, None) (signal 0 — existence check only), then clean up the stale postmaster.pid and let compute_ctl shut down. If the postmaster is alive despite the Crashed heuristic (300ms TCP timeout false-positive during startup), fall through to the normal stop path.

Testing

Manual testing confirmed all three paths work:

  • kill -9 → direct cargo neon endpoint start
  • kill -9cargo neon endpoint stopstart
  • Normal stop / start unaffected ✅

A regression test in test_runner/ is planned but not included in this PR.

When a compute endpoint is killed externally (kill -9), its status becomes
Crashed (stale postmaster.pid + no TCP connectivity). The user then cannot
restart it:

- 'endpoint stop' fails because pg_ctl cannot signal a dead process
- 'endpoint start' fails because check_conflicting_endpoints treats the
  crashed endpoint as a duplicate primary of itself

Fix:
1. Add exclude_id parameter to check_conflicting_endpoints so an endpoint
   does not conflict with itself when being restarted.
2. Handle the Crashed state in stop(): verify the postmaster is truly dead
   via kill(pid, None), clean up the stale postmaster.pid, and return
   success instead of bailing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant