Files
archy/docs/STATUS.md
archipelago 4b8ef0a098 docs: STATUS.md through Step 9 (.228 hot-swap verified)
Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.
2026-04-23 03:46:23 -04:00

18 KiB

RESUME HERE — Rust orchestrator migration

Updated: 2026-04-23 (Step 9 complete on .228, Step 10 next)

To resume this work, SSH into the ThinkPad and run opencode from ~/Projects/archy/. Or work from the laptop via the SSHFS mount at ~/mnt/archy-thinkpad/.

Where we are

Working through the 11-step plan in rust-orchestrator-migration.md.

  • Step 13767c267 ContainerConfig schema with build:, ResolvedSource enum, resolve(), 10 tests
  • Step 234af4d9d ContainerRuntime trait gained image_exists + build_image, 4 argv tests, 25/25 pass
  • Step 3b6a04d31 ProdContainerOrchestrator (999 LOC), 16 tests all pass, not yet wired to main.rs
  • Step 4e8a59c93 ContainerOrchestrator trait, RpcHandler uses it in prod (+ 13858842 chore gitignore ._*)
  • Step 5fc39b04b BootReconciler with Arc shutdown, 4 paused-time tests pass
  • Step 648f08aa3 main.rs wire-up (orchestrator construction + adopt_existing + BootReconciler spawn + shutdown Notify)
  • Step 7069bc4a5 bitcoin-ui pre-start hook + embedded nginx.conf template (8 unit tests + 1 integration test), 39/39 container:: tests pass
  • Step 8aa0707f4d retire archipelago-reconcile.{service,timer} + ISO builder touchpoints, keep scripts for update.rs
  • Step 9Hot-swap on .228 verified. All three UIs (bitcoin-ui/lnd-ui/electrs-ui) installing + serving HTTP 200. Adoption + reconciler + pre-start hook + dependency ordering all working under the prod code path. See "Step 9 evidence" below.
  • Step 8b — Port remaining ~25 container creations from first-boot-containers.sh into apps/<id>/manifest.yml, then port update.rs to orchestrator (deferred, multi-day work)
  • Step 8c — Rename first-boot-containers.shfirst-boot-setup.sh, strip container ops, keep setup. Delete reconcile-containers.sh + container-specs.sh. Add ISO lines to copy apps/ (final one-way door, requires 8b complete)
  • Step 10 — Hot-swap + verify on .116 (adoption-heavy test — .116 already has all containers running)
  • Step 11 — Chaos matrix on both nodes

Step 9 evidence (.228, 2026-04-23)

  • Binary: fix: parse_memory_limit accepts Ki/Mi/Gi IEC binary suffixes (732df1b8) + feat(systemd): delegate cgroup controllers (ba83f9bc), built on .116, scp'd to .228 as /usr/local/bin/archipelago. Old binary backed up at /usr/local/bin/archipelago.bak-pre-step9.
  • DEV_MODE override disabled (override.confoverride.conf.disabled-pre-step9).
  • /opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml populated.
  • /opt/archipelago/docker/bitcoin-ui/Dockerfile replaced with the Step 7 version (no COPY nginx.conf). Old dir backed up as bitcoin-ui.bak-pre-step9.
  • Post-start snapshot:
    • 🔗 Adopted 1 existing container(s): ["electrs-ui"] — adoption of 13h-running container worked without recreation
    • 🔄 Boot reconciler started (interval: 30s) — every 30s, all three app_ids reach NoOp after the initial install pass
    • bitcoin-ui nginx.conf rendered path=/var/lib/archipelago/bitcoin-ui/nginx.conf auth_hash=97af1c18 — pre-start hook fires in install_fresh
    • curl localhost:8334 → HTTP 200 (bitcoin-ui), :8081 → 200 (lnd-ui), :50002 → 200 (electrs-ui)
    • OCI memory limits correctly applied: bitcoin-ui=128Mi, electrs-ui=128Mi, lnd-ui=64Mi (was emitted as 0 pre-fix)
  • bitcoin-core / filebrowser / lnd / electrumx continue running untouched (prod orchestrator currently only manages apps it has manifests for; Step 8b expands that scope).

Two bugs found & fixed during Step 9

  1. parse_memory_limit truncation bug (732df1b8): lowercased "128Mi" → "128mi" → trim_end_matches('m') → "128i" → f64 parse fails → None.unwrap_or(0) → OCI memory.limit:0 → systemd rejects MemoryMax=0 at container start. Every manifest in apps/ uses IEC suffixes so every ProdContainerOrchestrator install was DOA. Now handles Ki/Mi/Gi/Ti + SI decimal + shorthand + raw bytes; 6 regression tests. Also changed create_container to OMIT the memory/cpu fields on absent/unparseable input rather than emitting 0.
  2. archipelago.service cgroup delegation missing (ba83f9bc): not the root cause of Step 9 failures (bug #1 was), but belt-and-braces so future code paths using the --memory CLI arg (runtime.rs DockerRuntime) don't hit the same systemd rejection on hosts where podman needs to create libpod scopes inside a non-delegated system-slice subtree. Added Delegate=memory pids cpu io.

Commits made this session

ba83f9bc feat(systemd): delegate cgroup controllers to archipelago.service
732df1b8 fix: parse_memory_limit accepts Ki/Mi/Gi IEC binary suffixes
a0707f4d refactor: retire archipelago-reconcile.{service,timer}  (Step 8a)
1c81a739 docs: split Step 8 into 8a/8b/8c
6e46932f docs: STATUS.md through Step 7
069bc4a5 feat: bitcoin-ui pre-start hook (Step 7)

Branch is 17 commits ahead of tx1138/main (local only — user pushes to mirrors personally).

Uncommitted state

Clean. Only untracked: tests/ (bats harness from prior session, not in scope).

Answered design questions (no need to re-ask)

  1. UI container naming → archy-<app_id> for UIs only; existing bitcoin-knots/lnd/electrumx keep bare names
  2. BITCOIN_RPC_AUTH injection → runtime bind-mount of nginx.conf (no build-args, no envsubst)
  3. Reconciler interval → 30 seconds
  4. Concurrency → per-app Mutex<()> in a DashMap
  5. Bash scripts → split into 8a/8b/8c; 8a done, 8b/8c deferred
  6. Step 4 extension → ContainerOrchestrator trait includes install(app_id); the manifest_path-based install RPC stays dev-only
  7. Step 7 bitcoin-ui template → embed via include_str!, render on install + every reconcile, atomic tmp+rename to /var/lib/archipelago/bitcoin-ui/nginx.conf, bind-mount into container. RPC user hardcoded archipelago, password from /var/lib/archipelago/secrets/bitcoin-rpc-password.

Context: which host is what

Host IP Role Dashboard pw Sudo pw
archy 192.168.1.116 Dev ThinkPad (Lenovo X250, Debian 13). Currently running v1.7.42-alpha (DEV_MODE). Step 10 target. archipelago ThisIsWeb54321@
archy228 192.168.1.228 Kiosk HP ProDesk. Step 9 landing zone — now running Rust-orchestrator binary in prod mode. password123 archipelago

Both are development alpha nodes — full destructive latitude, no need to ask before stop/start/rebuild.

Next action

Step 10 — Hot-swap on .116.

Unlike .228 (which tested the INSTALL path for net-new UI containers), .116 tests the ADOPTION path: it already has all three UIs and all backend containers running from prior v1.7.42-alpha runs. We want to verify the new prod orchestrator adopts every existing container without recreating or restarting them.

Steps:

  1. Disable DEV_MODE on .116 (check if override.conf exists — /etc/systemd/system/archipelago.service.d/)
  2. Stage the already-built binary at ~/Projects/archy/core/target/release/archipelago/usr/local/bin/archipelago.new
  3. Ensure /opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml present (copy from repo)
  4. Ensure /opt/archipelago/docker/bitcoin-ui/ matches the Step-7 layout (no baked nginx.conf)
  5. Snapshot: podman ps -a --format "{{.Names}}\t{{.Status}}\t{{.CreatedAt}}" → save to /tmp/pre-step10-containers.txt
  6. systemctl stop archipelago → install binary → systemctl start archipelago
  7. Verify in journal: every running container appears in "Adopted N existing container(s)"; no container was recreated; all HTTP smokes still 200; BootReconciler reaches NoOp on every app_id after one pass.
  8. If broken → restore .bak binary, re-enable DEV_MODE override.
  9. Commit STATUS.md update.

Risk on .116: If adoption fails mid-flight, we'd lose the running v1.7.42 backend that I'm currently typing at. Keep a second SSH session open to the ThinkPad for emergency revert. The backup plan is install /usr/local/bin/archipelago.bak /usr/local/bin/archipelago && systemctl restart archipelago.

After Step 10 we are blocked on Step 8b (multi-day manifest ports) before Step 11 (chaos matrix).


Why Step 8 got split (discovered 2026-04-23)

Original plan was one commit "delete bash + edit ISO builder". But on investigation:

  • first-boot-containers.sh creates 30+ containers with per-container logic (wallets, DB init, rpcauth derivations, post-create health waits). The repo only has manifests for 3 (bitcoin-ui, electrs-ui, lnd-ui from Step 7). Deleting bash now = brick first-boot on fresh installs.
  • Script also does non-container setup: secret generation (RPC pw, DB pw, FileBrowser admin pw), UID-mapping chowns for rootless podman subuid, Tor hostnames dir, WireGuard, firewall rules, nostr-relay dir. None of this lives in the Rust orchestrator.
  • update.rs (OTA update RPC) invokes reconcile-containers.sh at two sites. Deleting the script breaks package updates. Porting those call sites to the orchestrator needs all containers to have manifests.
  • Design doc §505 updated to split 8 → 8a/8b/8c. Only 8a (delete the reconcile systemd unit + timer, BootReconciler covers) is safe to execute before we port manifests.

Archipelago — Current State, Plan, and Releases

Updated: 2026-04-22

This is the "pick this up tomorrow" page. One-stop summary of where we are, what the plan is, and what's shipped. Detailed plan lives in bulletproof-containers.md.


Current state

Fleet status

All four Gitea mirrors are synced to v1.7.40-alpha:

Mirror Host Status
tx1138 https://git.tx1138.com v1.7.40-alpha live
gitea-local http://localhost:3000 v1.7.40-alpha live
.160 http://23.182.128.160:3000 v1.7.40-alpha live (Gitea recovered via podman system renumber — see below)
.168 http://146.59.87.168:3000 v1.7.40-alpha live

Fleet test nodes:

Node Version State
.103 (dev) 1.7.40 running, being developed against
.116 (this box) 1.7.40 healed manually via systemd-run chmod 755 /opt/archipelago/web-ui after v1.7.38/39 bug
.198 1.7.39 → 1.7.40-alpha healed manually
.228 (primary test) 1.7.40-alpha healed manually; bitcoin-core + lnd + electrumx running; UI companions currently missing; bitcoin.conf rpcauth patched live
.249 (ISO test) unreachable today
.253 1.7.39 → 1.7.40-alpha healed manually

Known open issues (drives the plan below)

  1. UI companion containers disappear on .228 after daemon restarts — no auto-recreate (fixed by v1.7.45 Quadlet migration)
  2. bitcoin.conf rpcauth drifts from canonical secret → ElectrumX "Daemon connection problem" (fixed by v1.7.43 reconcile::derived)
  3. host.containers.internal resolves to LAN gateway inside containers on some versions (fixed by v1.7.42 containers.conf)
  4. Podman state DB loss requires manual recovery (fixed by v1.7.44 startup self-heal)
  5. LND "Connect Wallet" info vanishing after crashes — symptom of the same drift class as #2
  6. ElectrumX not syncing on .228 — downstream of #2; will resolve when bitcoin.conf is reconciled

Recent field incident (2026-04-22)

  • Shipped v1.7.38 + v1.7.39, both broke nginx fleet-wide because the frontend tarball's root dir was drwx------ (700). Every node that OTA'd got 500 errors on every page.
  • Root-cause fix shipped in v1.7.40 (create-release-manifest.sh chmod + pre-ship assertion that tar tvzf | head -1 shows drwxr-xr-x).
  • .160 Gitea was down all day (502) because its rootless podman's libpod/bolt_state.db had vanished. Recovered via clearing /run/user/$UID/{containers,libpod,podman} + podman system renumber.
  • Full failure-mode audit is in bulletproof-containers.md.

Plan

We're shipping a level-triggered reconciler + Quadlet architecture over six incremental releases. Each release closes one failure mode. See bulletproof-containers.md for the full design, code layout, test harness, chaos matrix, sources.

Release roadmap

Release Closes What lands Status
v1.7.41 FM5 (bad OTA nginx 500) Post-OTA auto-rollback. New binary probes https://127.0.0.1/ on boot; if non-200 within 90s, restores web-ui.bak + calls rollback_update() + restarts in flight — deploying to .228 for test
v1.7.42 FM4 (host.containers.internal wrong) /etc/containers/containers.conf w/ host_containers_internal_ip = 10.89.0.1; every container gets --add-host=host.archipelago:10.89.0.1 pending
v1.7.43 FM2 (config drift) reconcile::derived::render_bitcoin_conf — pure fn over canonical secret, rewrites on drift. Same for lnd.conf pending
v1.7.44 FM6 (podman state loss) Startup probe detects broken podman state, auto-recovers via /run/user/$UID/* clear + system renumber pending
v1.7.45 FM1 + FM3 (companion orphans) archy-bitcoin-ui → Quadlet .container unit in /etc/containers/systemd/. systemd (not archipelago) owns it pending
v1.7.46 archy-lnd-ui → Quadlet pending
v1.7.47 archy-electrs-ui → Quadlet pending
v1.7.48+ all (full daemon refactor) core/archipelago/src/reconcile/ module replaces imperative install.rs container management. Main app containers become Quadlet too pending

Test harness (bats + Goss + Chaos Toolkit + vmtest) lands scaffold in v1.7.41, first lifecycle tests blocking v1.7.45, full matrix blocking beta tag.


Release history

v1.7.41-alpha — IN FLIGHT — 2026-04-22

Post-OTA auto-rollback. After an update lands, the node probes its own web UI through nginx — if the frontend isn't answering cleanly within 90 seconds, the node automatically rolls back to the previous version and restarts. A bad release can no longer leave the fleet stranded on an unreachable node.

Changes:

  • core/archipelago/src/update.rs: PendingVerification struct, write marker before service restart, verify_pending_update() on new binary boot — probes https://127.0.0.1/, on fail restores web-ui.bak + calls rollback_update() + systemctl restart archipelago
  • core/archipelago/src/main.rs: startup task invokes verifier concurrently with server

v1.7.40-alpha — 2026-04-22

Proper fix for the 500 error. Fixed the v1.7.38/39 tarball-perms bug at its source — staging dir is now explicitly chmod 755 before tar; --mode=u=rwX,go=rX normalizes archive perms; pre-ship assertion aborts release if tar tvzf | head -1 isn't drwxr-xr-x.

Changes:

  • scripts/create-release-manifest.sh: pre-tar chmod + tar --mode flag + post-tar verify
  • Everything from .38 + .39 still in place (onboarding auto-heal, silent logins, app purge, AIUI in tarball)

v1.7.39-alpha — 2026-04-22

Hotfix attempt for v1.7.38's nginx 500 (didn't fully work — still shipped broken tarball perms). Added startup self-heal chmod in main.rs and post-extract chmod in update.rs OTA applier.

v1.7.38-alpha — 2026-04-22

Onboarding auto-heal + silent logins + App Store trim.

Changes:

  • auth.rs: is_onboarding_complete() auto-heals from setup_complete + password_hash (prevents clear-cache → onboarding wizard bug)
  • useOnboarding: tri-state — backend-unreachable no longer defaults to /onboarding/intro
  • Login sounds gated by isFirstInstallPhase() — silent after onboarding, typing sounds unaffected
  • Removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog + Rust + docker + icons
  • Deleted 15 image versions from tx1138, .168, gitea-local registries
  • AIUI baked into release tarball via demo/aiui/
  • prebuild hook syncs app-catalog/catalog.jsonpublic/catalog.json

(Shipped with tarball-perms bug; fleet had to be healed before v1.7.40.)

v1.7.37-alpha — 2026-04-22

Bitcoin Core install fixes + dynamic node UI + full-archive default.

  • Bitcoin Core passes explicit -rpcbind/-rpcallowip/etc. CLI args so vanilla image exposes RPC
  • Split bitcoin-core from bitcoin-knots in backend AppMetadata
  • bitcoin-ui auto-detects Core vs. Knots from subversion, swaps branding at runtime
  • Storage (Full Archive · X GB / Pruned) indicator on dashboard
  • Node Settings modal shows real values (network, storage, txindex, ZMQ, RPC port)
  • Pull fallback to docker.io when no mirror carries the image
  • Removed prune=550 hardcode — full archive default

Key docs


How to resume

  1. Check fleet mirrors are all live: curl -sS https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version
  2. Read bulletproof-containers.md for the current plan
  3. Check task list (/list or via Claude Code) for the in-flight release
  4. Latest in-flight work: v1.7.41 deploying to .228 for test; will ship to all 4 mirrors once verified