Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 + v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500) with no recovery path short of SSH. This release adds a self-check guardrail to the update flow. What changed: - apply_update() writes a pending-verify marker with old+new version and a 150s deadline immediately before scheduling the service restart. - verify_pending_update() runs from main.rs startup. If the marker is present and within its freshness window, the new binary waits 15s for nginx + backend to settle, then probes https://127.0.0.1/ every 5s for up to 90s (self-signed certs accepted). - On any probe success within the window, the marker is cleared and nothing else happens. - On window-exhaust, the new binary: 1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts> (quarantined, not deleted, so we can post-mortem). 2. Restores web-ui.bak on top of web-ui. 3. Calls rollback_update() to restore the previous binary. 4. Updates state.current_version to reflect the rollback. 5. systemctl --no-block restart archipelago so the OLD binary boots. - Markers older than 10 minutes are treated as stale and cleared without probing, so a crashed-during-startup marker from weeks ago cannot spontaneously roll back a healthy node on a later reboot. - rollback_update() binary copy now goes through host_sudo instead of tokio::fs::copy, so it escapes the service's ProtectSystem=strict mount namespace. Without this, the rollback silently failed with EROFS on /usr/local/bin and orphaned the rollback - the exact opposite of what auto-rollback is for. Tests: 4 new unit tests in update::tests covering marker round-trip, absent-marker noop, no-panic on verify_pending_update with nothing to verify, and an invariant assert that the 90s probe window stays below the 600s stale threshold. All passing. Side fix: scripts/create-release-manifest.sh was dying with exit 141 (SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail. Replaced with a single awk NR==1 that doesn't short-circuit the upstream pipe, so the release-build flow is idempotent again.
18 KiB
Bulletproof Containers for Beta
Status: plan agreed 2026-04-22, implementation started.
Target: zero-manual-intervention container lifecycle for the beta launch. A user installs, uninstalls, reboots, updates, or loses power — every combination must leave the node in a known-good state without SSH.
Project memory: ~/.claude/projects/-home-archipelago-Projects-archy/memory/project_reconcile_architecture.md
Failure log: ~/.claude/projects/-home-archipelago-Projects-archy/memory/feedback_container_lifecycle_failure_modes.md
Why we're doing this
The v1.7.38 and v1.7.39 rollouts on 2026-04-22 exposed a cluster of container-lifecycle failures that required manual SSH recovery on every affected node (.116, .198, .228, .253). If a user had been on those nodes, they'd have been stuck with "can't reach" or 500 errors and no path forward. We can't ship beta with this class of failure on the table.
The pattern under every failure: the canonical source of truth had the right answer, but derived state drifted away from it and nothing noticed or fixed it.
The six failure modes
| # | Symptom | Root cause |
|---|---|---|
| FM1 | archy-bitcoin-ui + archy-lnd-ui disappeared from podman ps -a after a daemon restart |
Archipelago owns container creation imperatively; no owner recreates companions after a crash mid-transition |
| FM2 | ElectrumX "Daemon connection problem" | bitcoin.conf's rpcauth drifted from /var/lib/archipelago/secrets/bitcoin-rpc-password — config written once at install, never re-derived |
| FM3 | archipelago.service status=226/NAMESPACE crash-loop SIGKILL'd every child container |
Containers were children of archipelago's cgroup; systemd teardown killed them. KillMode=control-group default |
| FM4 | host.containers.internal inside containers resolved to LAN gateway (192.168.1.254) |
Known podman bug on bridge networks pre-5.3 (#22644) |
| FM5 | Nginx 500 fleet-wide after OTA | Tarball root dir was drwx------ (700), extracted identically on every node. Fixed in v1.7.40 at build time; still need post-OTA auto-rollback |
| FM6 | Rootless podman's libpod/bolt_state.db vanished → whole registry node unreachable |
No detection of corrupt state; required manual rm -rf /run/user/$UID/libpod + podman system renumber |
Architecture decision
Adopt balena-style, level-triggered, desired-state reconciler built on Quadlet + sdnotify.
This is the one architecture that would have prevented all six failures, because each one is "reality drifted from the intended config and nothing noticed" — the exact problem reconcilers are designed for.
Why not the alternatives
- Keep imperative + patch per-failure — we've been doing this. Five releases in a day. Doesn't scale.
- Migrate to LXC (StartOS's path) — 6-month project. Our investment in podman (
install.rs,docker_packages.rs,image_versions.rs) is substantial. Quadlet gives us StartOS's isolation property without the migration. - Ship k3s / MicroShift — 400-800 MB RAM baseline on top of bitcoind/electrs. Overkill for a home node OS.
- Edge-triggered like Umbrel — their
app.tshas an explicit TODO admitting they don't handle failure events. We'd inherit the same bug class.
The four patterns (from mature players)
- Desired-state-first, level-triggered reconcile. balena-supervisor, Kubernetes operators, NixOS. A supervisor owns a manifest of what should run; on every tick it diffs against what is running and issues steps.
- Every container is its own systemd unit, not a child of the daemon. Red Hat's Quadlet pattern: a
.containerfile is parsed by a systemd generator into a normal.service. The daemon can crash without taking any containers with it. - sdnotify readiness + HealthCmd + rollback. Podman v3.4+ has real rollback: bad image fails health check, systemd considers service failed, Podman re-tags the previous image digest.
- Credentials and config derived from canonical secrets on every apply. Not trusted across upgrades; re-rendered idempotently from single source of truth.
Fix-per-failure
| Failure | Fix |
|---|---|
| FM1 | Move companions to Quadlet .container files in /etc/containers/systemd/. systemd (not archipelago) owns them |
| FM2 | reconcile::derived::render_bitcoin_conf(secrets) — pure function, runs every tick, atomic rewrite + HUP on drift |
| FM3 | KillMode=mixed in archipelago.service + containers in their own archipelago-apps.slice. Quadlet units already live outside archipelago's cgroup |
| FM4 | Ship /etc/containers/containers.conf with host_containers_internal_ip = "10.89.0.1" + default_rootless_network_cmd = "pasta"; also --add-host=host.archipelago:10.89.0.1 in every unit |
| FM5 | Post-OTA curl -k https://127.0.0.1/ health probe in new binary startup. If non-200 within 90s, rollback to web-ui.bak + binary-backup |
| FM6 | Startup probe: podman info with timeout. On "invalid internal status", clear /run/user/$UID/{containers,libpod,podman} + podman system renumber + reconcile tick rebuilds from Quadlet units |
New code layout (lands in v1.7.48)
core/archipelago/src/reconcile/
mod.rs run_reconcile_loop, reconcile_once — called from main.rs
desired.rs DesiredState built from packages.json + catalog + secrets
current.rs snapshot via `systemctl list-units archy-*.service` + `podman ps -a --format json`
diff.rs pure: reconcile(desired, current) -> Vec<Step> (unit-testable without podman)
apply.rs step executor with timeouts, structured logs, backoff
quadlet.rs write `.container` / `.volume` / `.network` units atomically
derived.rs render_bitcoin_conf, render_containers_conf, render_nginx_app_routes
backoff.rs restart-history tracking (moved from health_monitor.rs)
Step types (idempotent)
enum Step {
WriteQuadletUnit(path, content),
WriteDerivedFile(path, content),
WriteSecret(path, content),
DaemonReload,
EnsureStarted(unit),
StopUnit(unit),
RestartUnit(unit),
PullImage(ref),
}
Triggers
- 30s interval tick
- install/uninstall RPC
- update-applied event
- explicit
/rpc/v1/reconcile.tick - podman event stream (if available)
Level-triggered + idempotent — every call considers full desired vs current diff. Missed ticks/events are irrelevant.
Edits to existing code
src/main.rs: replacetokio::spawn(crash_recovery::start_stopped_containers)withtokio::spawn(reconcile::run_reconcile_loop(state)). Keep self-heal perms + PID-marker crash detection.src/api/rpc/package/install.rs: stop callingpodman rundirectly. Writes desired state + Quadlet unit + signals reconciler. Reconciler does pull +systemctl start.src/api/rpc/package/runtime.rs+lifecycle.rs+stacks.rs: same pattern — mutate desired state, reconciler applies.src/crash_recovery.rs: keep PID-marker + snapshot. Deletestart_stopped_containers(reconciler handles cold boot). Keepuser-stopped.jsonasAppSpec.desired_state: Started | UserStopped | Uninstalled.src/health_monitor.rs: strip restart logic. Keep memory-leak detection; push unhealthy events asTrigger::ContainerUnhealthy(name).src/bitcoin_rpc.rs: addpub fn derive_rpcauth_line(user, pass) -> String(HMAC-SHA256 per Bitcoin Core'srpcauth.py).src/update.rs: post-swap health probe + auto-rollback (v1.7.41).
Shipping order
Each release is independently deployable. Not a big-bang rewrite.
v1.7.41 — Post-OTA health probe + auto-rollback (closes FM5)
- In
update.rs: write/var/lib/archipelago/update-pending-verify.jsonjust before service restart, withapplied_at,new_version,previous_version, deadline. - In
main.rsstartup: read marker, spawn verification task. Wait 15s for full startup, thencurl -k https://127.0.0.1/with retries up to 90s. - On 200: delete marker.
- On non-200 after window: call
rollback_update(data_dir)(already exists), restart service to boot the old binary. - Smallest diff, highest ROI.
v1.7.42 — containers.conf + host.archipelago alias (closes FM4)
- Idempotent write of
/etc/containers/containers.confon startup (archipelago compares hash, rewrites only on drift). - Add
--add-host=host.archipelago:10.89.0.1to every generated container ininstall.rs/docker_packages.rs. - ElectrumX
DAEMON_URLmigrates fromhost.containers.internal→host.archipelago.
v1.7.43 — reconcile::derived for bitcoin.conf / lnd.conf (closes FM2)
- Pure function
render_bitcoin_conf(secrets) -> String. - Tick every 30s: read secret, derive
rpcauth, compare to on-disk, atomic rewrite (viatempfile::NamedTempFile::persist) +podman exec ... kill -HUP 1on drift. - Same pattern for
lnd.conf. - First user of the eventual
reconcile::module — ships thederived.rspiece early.
v1.7.44 — Podman state self-heal on startup (closes FM6)
- Startup probe:
podman info --format '{{.Host.OS}}'with 10s timeout. - On "invalid internal status" or similar:
systemctl --user stop podman.socket podman.servicerm -rf /run/user/$UID/{containers,libpod,podman}podman system renumber- Trigger reconcile tick (will rebuild containers from their source of truth)
- Surface clear error on
/healthif recovery fails — don't silently serve 502.
v1.7.45–47 — Quadlet migration per companion (closes FM1 + FM3)
One companion per release so regressions have a narrow blame window:
- v1.7.45:
archy-bitcoin-ui→ Quadlet.containerunit - v1.7.46:
archy-lnd-ui→ Quadlet - v1.7.47:
archy-electrs-ui→ Quadlet
Each:
- Write
.containerfile to/etc/containers/systemd/<name>.container systemctl daemon-reloadsystemctl enable --now <name>.service- Remove the
podman runpath frominstall.rsfor that name - Add Goss probe for the lifecycle test matrix
v1.7.48+ — Full reconcile module
core/archipelago/src/reconcile/replaces imperativeinstall.rscontainer management.- Main app containers (bitcoin-knots, bitcoin-core, lnd, electrumx, btcpay-server, mempool, fedimint) become Quadlet units.
install.rsshrinks to ~300 lines of "write desired state, poke reconciler."- Biggest diff, lands last.
Test harness (parallel track)
Stack
- Outer runner:
bats-core— TAP-style bash testing, readable by anyone - Verifier:
goss— YAML assertions on ports, processes, HTTP endpoints, files. Reused by CI + live probe - Chaos layer: Chaos Toolkit JSON experiments (steady-state-hypothesis → method → rollback → verify)
- VM layer:
vmtest(Go) for reboot-survival + ISO-boot tests, or raw QEMU+SSH - Tor probe: curl through archipelago's own tor SOCKS5 (
--socks5-hostname 127.0.0.1:9050), 60-180s retry window - Live probe: small Rust agent on every fleet node, ships same Goss YAMLs to Prometheus. Neither Umbrel nor StartOS has this — real differentiator.
- Reproducibility: btrfs subvolume snapshots primary (fast), QEMU qcow2 for ISO/kernel-level repro
Directory layout
tests/lifecycle/
bats/
_helpers.bash # install_app, wait_healthy, assert_no_orphans
00_bootstrap.bats
10_install.bats # per-app install
20_ui_reachable.bats # direct port + HTTPS proxy + iframe
30_tor_reachable.bats # .onion probe
40_stop_start.bats
50_restart.bats
60_reboot.bats # vmtest-driven
70_reinstall.bats # idempotence + data preservation
80_uninstall.bats # leak check
90_soak.bats # 2-6h hold, periodic probe
goss/
bitcoin-knots.yaml
bitcoin-core.yaml
lnd.yaml
electrumx.yaml
btcpay-server.yaml
mempool.yaml
fedimint.yaml
chaos/
kill9_archipelago_mid_install.json
wipe_bolt_db.json
kill9_bitcoind.json
reboot_during_ota.json
corrupt_bitcoin_conf.json
systemctl_restart_mid_install.json
fill_disk_99_percent.json
kill_tor.json
delete_nginx_snippet.json
clock_jump_30min.json
vm/
iso_boot_smoke.go
reboot_survival.go
ci/
vm_runner.sh
collect_artifacts.sh
probe/archy-probe/ # Rust bin, reuses goss YAMLs, ships to fleet
Makefile # `make beta-matrix`, `make chaos`, `make soak`
Minimum beta matrix
7 apps × 9 lifecycle events × 10 chaos scenarios. Pass = every MUST-ship cell green on fresh rootless-podman single-node CI.
| Case \ App | knots | core | lnd | electrumx | btcpay | mempool | fedimint |
|---|---|---|---|---|---|---|---|
| Fresh install | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| UI direct port | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| UI HTTPS proxy | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| UI iframe | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Tor .onion reachable | ✓ | ✓ | ✓ | — | ✓ | ✓ | ✓ |
| Stop → ports released | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Restart → integrations | — | — | ✓↔btc | ✓↔btc | ✓↔btc,lnd | ✓↔electrs | — |
| Reboot survival | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Reinstall idempotent | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Uninstall no orphans | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 6h soak | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Harness scaffold lands in v1.7.41. First lifecycle tests blocking v1.7.45. Full matrix + chaos suite blocking beta tag.
Chaos scenarios (10)
Ordered by likelihood × severity:
kill -9 archipelagodmid-install → systemd restart, in-flight install resumes or cleanly rolls backrm bolt_state.dbwhile service stopped → restart regenerates, no data loss in named volumessystemctl restart archipelagomid-install → no orphans, no half-state- Reboot mid-OTA → old version intact OR new version active, never half
- Corrupt
bitcoin.conf→ container restart-loops; UI surfaces banner; reconcile re-derives; other apps unaffected - Fill
/varto 99% → graceful degradation, disk-pressure report - Revoke rootless-netns → self-heal within Tor descriptor window
pkill -9 tor→ supervisor restarts; onions reachable within 3–5 min- Delete nginx conf snippet → reconciler rewrites or
archipelago doctorflags drift - Clock jump +30min → daemons survive; Tor recovers
Decision log
| Decision | Answer | Rationale |
|---|---|---|
| Scope | 6+ incremental releases, not big-bang rewrite | Each closes one failure class, narrow blame window |
| Quadlet migration | Yes | Isolation from daemon crashes, systemd-native recovery, free from Red Hat's production patterns. Minimum podman version becomes 4.4+ (fine for modern Debian) |
| Live probe to Prometheus | Yes, part of beta | Genuine differentiator — neither Umbrel nor StartOS has this. Adds Grafana dep |
| Test gating | Scaffold in v1.7.41, first tests blocking v1.7.45, full matrix blocking beta tag | Gradual rather than all-or-nothing |
Key sources
Architecture
- Umbrel app.ts — edge-triggered, TODO on failure handling
- StartOS repo, v0.4 podman→LXC announce
- balena-supervisor repo, Supervisor API
- Quadlet: Dan Walsh 2023 blog, podman-systemd.unit(5)
- Podman rollback: auto-update blog, podman-auto-update(1)
- Kubernetes operator pattern: Kubebuilder reconcile, good practices
- NixOS containers: wiki
Known bugs & references
host.containers.internal→ LAN: podman #22644, #23782bolt_state.dbrecovery: podman #17730, staticdir mismatch #20872- aardvark-dns flakiness: #20396, #22407
- systemd 226/NAMESPACE: Arch forum, systemd #29526
- systemd CGROUP_DELEGATION, systemd.kill(5)
Test harness prior art
- Umbrel ci.yml — Vitest + qemu matrix fan-out
- YunoHost package_check — closest analog, scored per-app lifecycle harness on LXC
- bats-core
- Goss, dgoss
- Chaos Toolkit
- vmtest (Go)
Tor
- rend-spec-v3 — descriptor lifetime + republish cadence
- stem — Python Tor controller for
HS_DESC UPLOADEDwaits
To resume
- Read project memory:
~/.claude/projects/-home-archipelago-Projects-archy/memory/project_reconcile_architecture.md - Read failure-mode memory:
~/.claude/projects/-home-archipelago-Projects-archy/memory/feedback_container_lifecycle_failure_modes.md - Check task list for current release (should start with v1.7.41)
- Current state on fleet as of 2026-04-22:
- All 4 mirrors (tx1138, gitea-local, .160, .168) synced to v1.7.40-alpha
- .116, .198, .228, .253 healed manually via
systemd-run chmod 755 /opt/archipelago/web-ui - .228 still has stale
bitcoin.confrpcauth (regenerated during triage; will drift again until v1.7.43) - .228 UI companions (archy-bitcoin-ui, archy-lnd-ui) keep vanishing (Quadlet migration in v1.7.45+ fixes)
- .160 Gitea required
podman system renumberrecovery (v1.7.44 automates this)
- Implementation is in progress on
mainbranch — next edit iscore/archipelago/src/update.rsfor v1.7.41.