fix(ops): make telegram-claude-bridge daemon resilient #21

Merged
navigator merged 1 commit from feature/bridge-resilience-fix into main 2026-05-24 13:31:11 -03:00
Owner

Three fixes after PR #20 daemon crashed in first production deploy on agent host.

Symptoms (from agent journalctl):

  • Service crashed exit 1 ~100s after each msg arrived
  • systemd restarted, daemon picked same msg again (offset never persisted)
  • Logs showed only "msg from..." and two HTTP 200s (ack + maybe response), no "reply sent"

Root causes:

  1. set -e killed the daemon on any non-zero exit anywhere in the per-message path. Removed (kept -u, pipefail).
  2. save_offset ran AFTER process_message; if process crashed first, offset lost, infinite loop on the same msg.
  3. The main loop iterated updates via ... | jq | while read, creating a subshell that lost state (offset writes from subshell visible but parent died first).

Fix:

  • Drop -e, keep -u -o pipefail
  • Move save_offset BEFORE process_message
  • Iterate via tempfile + redirected while-read (no subshell)
  • Add || fallback on jq calls so malformed Telegram responses don't kill the loop
  • Per-message errors get logged and skipped, daemon stays alive

Verified locally: claude -p reply ok returns in ~3s on agent (Xeon E5-2630 v2 with AVX). The failing component is the daemon shell, not claude or systemd.

Tests: shell only. shell-lint validates the daemon. Auto-merge via auto-merge.sh.

Three fixes after PR #20 daemon crashed in first production deploy on agent host. Symptoms (from agent journalctl): - Service crashed exit 1 ~100s after each msg arrived - systemd restarted, daemon picked same msg again (offset never persisted) - Logs showed only "msg from..." and two HTTP 200s (ack + maybe response), no "reply sent" Root causes: 1. `set -e` killed the daemon on any non-zero exit anywhere in the per-message path. Removed (kept -u, pipefail). 2. `save_offset` ran AFTER `process_message`; if process crashed first, offset lost, infinite loop on the same msg. 3. The main loop iterated updates via `... | jq | while read`, creating a subshell that lost state (offset writes from subshell visible but parent died first). Fix: - Drop `-e`, keep `-u -o pipefail` - Move `save_offset` BEFORE `process_message` - Iterate via tempfile + redirected while-read (no subshell) - Add `|| fallback` on jq calls so malformed Telegram responses don't kill the loop - Per-message errors get logged and skipped, daemon stays alive Verified locally: `claude -p reply ok` returns in ~3s on agent (Xeon E5-2630 v2 with AVX). The failing component is the daemon shell, not claude or systemd. Tests: shell only. shell-lint validates the daemon. Auto-merge via auto-merge.sh.
fix(ops): make telegram-claude-bridge daemon resilient
All checks were successful
build / scalafmt-check (push) Successful in 4s
build / sbt-compile (push) Successful in 4s
build / shell-lint (push) Successful in 30s
build / scalafmt-check (pull_request) Successful in 10s
build / sbt-compile (pull_request) Successful in 13s
build / shell-lint (pull_request) Successful in 24s
975dbee2e5
Three fixes after first deploy crashed in production:

1. Remove 'set -e' (keep -u -o pipefail). The main loop could die when
   a per-message command returned non-zero, killing the whole daemon
   instead of just logging and continuing.

2. Save offset BEFORE process_message, not after. Previously a crash
   in process_message would lose the offset write, causing infinite
   re-processing of the same message on systemd restart.

3. Iterate updates via tempfile instead of pipe-to-while. The pipe
   creates a subshell where save_offset's file write was lost when
   the parent terminated.

Also added '|| fallback' suffixes on jq parses so malformed
Telegram responses don't kill the loop.

Verified on agent: 'claude -p reply ok' returns in ~3s. Bridge itself
was the failing component; fixes target only the daemon shell, not
claude or systemd unit.
fluidpop-bot left a comment
Collaborator

CI green (head 975dbee2e5), auto-approving

CI green (head 975dbee2e52ba29b4c02bd55d9d4a4744a7ae0e4), auto-approving
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Fluid/fluidpop-v1!21
No description provided.