Skip to content

process-segments consumer crash-loops after upgrading #4346

@Mateolem

Description

@Mateolem

Self-Hosted Version

25.11.1

CPU Architecture

x86_64

Docker Version

25.0.3

Docker Compose Version

2.32.3

Machine Specification

  • My system meets the minimum system requirements of Sentry

Installation Type

Upgrade from 25.8.0 to 25.11.1

Steps to Reproduce

Environment

  • Self-hosted version: 25.11.1
  • Upgraded from: 25.8.0
  • Host OS: Linux x86_64
  • Available RAM: ~12 GiB free of 48 GiB total
  • Shared memory (/dev/shm): 22 GiB available, <1 MB in use
  • Disk: 1.3 TB available
  • Swap: Within expected system requirements (16gb)

Description

After upgrading from 25.8.0 to 25.11.1 (With a data migration to SeaweedFS), the process-segments consumer enters a continuous crash-restart loop and never successfully starts. All other containers are healthy and the Sentry application itself is functional. Only process-segments is affected.

The container starts, initializes its multiprocessing pool (including running parallel_worker_initializer), begins consuming from the buffered-segments Kafka topic, and then crashes approximately 20 to 30 seconds into processing every time, without exception.

Observed behaviour

The crash cycle follows this consistent pattern:

  1. Container starts, multiprocessing pool initializes successfully
  2. Consumer is assigned Partition(topic=Topic(name='buffered-segments'), index=0)
  3. Worker begins processing, then emits one or more incomplete batch warnings
  4. Child process terminates (signal 17 / SIGCHLD)
  5. Parent process crashes with ChildProcessTerminated: 17
  6. Container restarts and the cycle repeats

Error output

WARNING arroyo.processing.strategies.run_task_with_multiprocessing: Received incomplete batch (57.00% complete), resubmitting

Traceback (most recent call last):
  File ".../run_task_with_multiprocessing.py", line 860, in __reset_batch_builder
    input_block = self.__input_blocks.pop()
IndexError: pop from empty list

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".../processor.py", line 440, in _run_once
    self.__processing_strategy.submit(message)
  File ".../healthcheck.py", line 29, in submit
    self.__next_step.submit(message)
  File ".../run_task_with_multiprocessing.py", line 879, in submit
    self.__reset_batch_builder()
  File ".../run_task_with_multiprocessing.py", line 862, in __reset_batch_builder
    raise MessageRejected("no available input blocks") from e
arroyo.processing.strategies.abstract.MessageRejected: no available input blocks

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  ...
  File ".../kafka.py", line 67, in run_processor_with_signals
    processor.run()
    ...
    raise ChildProcessTerminated(signum)
arroyo.processing.strategies.run_task_with_multiprocessing.ChildProcessTerminated: 17

ERROR arroyo.processing.processor: Caught exception, shutting down...

Investigation

We investigated the following potential causes:

  • Memory pressure: Ruled out. 12 GiB RAM available, /dev/shm has 22 GiB available with negligible usage. No OOM entries in dmesg.
  • Shared memory exhaustion: Ruled out. df -h /dev/shm confirms ample space and only a single unrelated semaphore present.
  • Kafka connectivity: Consumer successfully connects, receives partition assignment, and begins consuming before crashing.
  • Kafka topic reset: Ruled out. In addition to recreating the consumer groups, we also deleted and recreated all Kafka topics related to this consumer, including buffered-segments itself, to ensure no stale messages, corrupt data, or leftover topic configuration was contributing to the crash. The issue persists on a completely fresh topic with no existing messages.

The error originates in arroyo's run_task_with_multiprocessing.py where the parent process attempts to reset the batch builder after the child process is killed, finds no available input blocks, and crashes. The root cause appears to be the child worker process being terminated (SIGCHLD) silently before the parent can recover however the child's own stderr output is not surfaced in Docker logs.

Steps to reproduce

  1. Run a healthy self-hosted Sentry 25.8.0 instance
  2. Upgrade to 25.11.1 following the standard upgrade procedure
  3. Observe process-segments container entering a crash-restart loop

What we have tried

  • Restarting the full application
  • Re-running the full ./install.sh / docker compose up sequence
  • Verifying resource availability (RAM, shm, disk, swap)

Full logs

logs.txt

Expected Result

All containers are healthy after the upgrade and the application is online, however due to the crash loop on process-segments it is not fully operational.

Actual Result

See the attached full logs in section 1.

Event ID

No response

Metadata

Metadata

Assignees

No one assigned
    No fields configured for issues without a type.

    Projects

    Status
    No status
    Status
    Waiting for: Community

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions