fix(neon_local): allow restart of crashed compute endpoints#12898
Open
hawthornic wants to merge 1 commit into
Open
fix(neon_local): allow restart of crashed compute endpoints#12898hawthornic wants to merge 1 commit into
hawthornic wants to merge 1 commit into
Conversation
When a compute endpoint is killed externally (kill -9), its status becomes Crashed (stale postmaster.pid + no TCP connectivity). The user then cannot restart it: - 'endpoint stop' fails because pg_ctl cannot signal a dead process - 'endpoint start' fails because check_conflicting_endpoints treats the crashed endpoint as a duplicate primary of itself Fix: 1. Add exclude_id parameter to check_conflicting_endpoints so an endpoint does not conflict with itself when being restarted. 2. Handle the Crashed state in stop(): verify the postmaster is truly dead via kill(pid, None), clean up the stale postmaster.pid, and return success instead of bailing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a compute endpoint is killed externally (kill -9),
neon_localcannot restart it:cargo neon endpoint stopfails becausepg_ctlcannot signal a dead PIDcargo neon endpoint startfails becausecheck_conflicting_endpointssees the crashed endpoint as a "duplicate primary" (status isCrashed, notStopped)Root cause
check_conflicting_endpointsiterates all endpoints and never excludes the endpoint being operated on. The status filter only excludedStopped, notCrashed. Additionally,stop()unconditionally calledpg_ctl stopwhich fails when the postmaster is already dead.Fix
check_conflicting_endpoints: Addedexclude_id: Option<&str>parameter so an endpoint does not conflict with itself when being restarted. TheCreatepath passesNone(no-op), theStartpath passes the endpoint being started.stop(): Before attemptingpg_ctl stop, check if the endpoint isCrashed. If so, verify the postmaster process is truly dead viakill(pid, None)(signal 0 — existence check only), then clean up the stalepostmaster.pidand letcompute_ctlshut down. If the postmaster is alive despite theCrashedheuristic (300ms TCP timeout false-positive during startup), fall through to the normal stop path.Testing
Manual testing confirmed all three paths work:
kill -9→ directcargo neon endpoint start✅kill -9→cargo neon endpoint stop→start✅stop/startunaffected ✅A regression test in
test_runner/is planned but not included in this PR.