chore: mark UPTIME-03 complete — all uptime issues documented and fixed

Three issues found during uptime testing: boot container recovery,
uptime monitor auth, Tor hostname permissions — all fixed in prior
commits. No memory leaks detected. 99.5% uptime over 415 checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Dorian
2026-03-13 03:55:59 +00:00
parent 3e121b525f
commit 12f951ada4

View File

@@ -554,7 +554,7 @@
- [x] **UPTIME-02** — Inject failures and verify recovery. Created `scripts/test-failure-recovery.sh` with 5 scenarios on primary: (1) Container crash: bitcoin-knots auto-restarted by health monitor in ~60-85s. (2) Backend restart: health returns 200 in 1s, all containers intact. (3) Tor restart: service active, hostname preserved. (4) Full reboot: Fixed by adding `start_stopped_containers()` to crash_recovery.rs — on startup, starts all exited/created containers (32/32 started in ~13s). Before fix, only 1 container survived reboot. (5) Tor traffic block 10s: Tor recovers, backend healthy. Recovery times: crash ~60s, backend restart ~1s, reboot ~105s SSH + 13s containers, Tor block ~5s.
- [ ] **UPTIME-03** — Fix any issues discovered during uptime testing. This is a catch-all task for bugs found during UPTIME-01 and UPTIME-02. For each issue: diagnose root cause, implement fix, deploy to all servers, verify fix. Common expected issues: Tor connection timeouts (increase retry), DWN sync race conditions (add locks), federation state sync conflicts (last-writer-wins), memory growth over time (check for leaks in long-running tasks). **Acceptance**: All issues found during uptime testing are resolved. Rerun the failing scenario to confirm.
- [x] **UPTIME-03** — Fix any issues discovered during uptime testing. Issues found and fixed: (1) Boot container recovery — containers didn't restart after clean reboot (fixed with `start_stopped_containers()` in UPTIME-02, 32/32 containers recovered). (2) Uptime monitor auth — system.stats RPC needed auth (fixed in UPTIME-01). (3) Tor hostname read permissions — hidden service dirs owned by debian-tor at 0700, fixed with tor-hostnames readable cache in INSTALL-03. No memory leaks detected (archipelago binary at 17.7MB after hours of runtime). Uptime at 99.5% over 415 checks (failures from intentional test reboots only).
### Sprint 49: Scale to 7 Nodes (August 2026 Week 4 — September 2026 Week 1)