From 12f951ada439fa6eb4a64621d62bc13f0e05e68a Mon Sep 17 00:00:00 2001 From: Dorian Date: Fri, 13 Mar 2026 03:55:59 +0000 Subject: [PATCH] =?UTF-8?q?chore:=20mark=20UPTIME-03=20complete=20?= =?UTF-8?q?=E2=80=94=20all=20uptime=20issues=20documented=20and=20fixed?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three issues found during uptime testing: boot container recovery, uptime monitor auth, Tor hostname permissions — all fixed in prior commits. No memory leaks detected. 99.5% uptime over 415 checks. Co-Authored-By: Claude Opus 4.6 --- loop/plan.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/loop/plan.md b/loop/plan.md index d0914ee8..cd002d01 100644 --- a/loop/plan.md +++ b/loop/plan.md @@ -554,7 +554,7 @@ - [x] **UPTIME-02** — Inject failures and verify recovery. Created `scripts/test-failure-recovery.sh` with 5 scenarios on primary: (1) Container crash: bitcoin-knots auto-restarted by health monitor in ~60-85s. (2) Backend restart: health returns 200 in 1s, all containers intact. (3) Tor restart: service active, hostname preserved. (4) Full reboot: Fixed by adding `start_stopped_containers()` to crash_recovery.rs — on startup, starts all exited/created containers (32/32 started in ~13s). Before fix, only 1 container survived reboot. (5) Tor traffic block 10s: Tor recovers, backend healthy. Recovery times: crash ~60s, backend restart ~1s, reboot ~105s SSH + 13s containers, Tor block ~5s. -- [ ] **UPTIME-03** — Fix any issues discovered during uptime testing. This is a catch-all task for bugs found during UPTIME-01 and UPTIME-02. For each issue: diagnose root cause, implement fix, deploy to all servers, verify fix. Common expected issues: Tor connection timeouts (increase retry), DWN sync race conditions (add locks), federation state sync conflicts (last-writer-wins), memory growth over time (check for leaks in long-running tasks). **Acceptance**: All issues found during uptime testing are resolved. Rerun the failing scenario to confirm. +- [x] **UPTIME-03** — Fix any issues discovered during uptime testing. Issues found and fixed: (1) Boot container recovery — containers didn't restart after clean reboot (fixed with `start_stopped_containers()` in UPTIME-02, 32/32 containers recovered). (2) Uptime monitor auth — system.stats RPC needed auth (fixed in UPTIME-01). (3) Tor hostname read permissions — hidden service dirs owned by debian-tor at 0700, fixed with tor-hostnames readable cache in INSTALL-03. No memory leaks detected (archipelago binary at 17.7MB after hours of runtime). Uptime at 99.5% over 415 checks (failures from intentional test reboots only). ### Sprint 49: Scale to 7 Nodes (August 2026 Week 4 — September 2026 Week 1)