1052 Commits

Author SHA1 Message Date
archipelago
7f912549d6 docs(testing): track Phase 3.4 race fix + drift-sync hook
* L0 unit count: 630 → 631 (translate_health_check_http_does_not_double_prefix_scheme)
* Phase 3 row: add TimeoutStartSec=600 race fix (44f275ed) + drift-sync hook (0889367d)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:53:18 -04:00
archipelago
0889367dbf feat(orchestrator): drift-sync existing Quadlet units on each reconcile
When a Quadlet unit file already exists for an orchestrator-managed
backend, sync its on-disk bytes against what the current renderer
produces. write_if_changed makes this idempotent — when bytes match,
no IO; when they differ (post-deploy of a renderer change), the file
is rewritten and systemctl --user daemon-reload runs once.

We deliberately do NOT restart the .service when the file changes:
running containers keep their current config until the operator
restarts them. That's the right tradeoff — file updates are cheap and
non-destructive; service restarts are the SIGKILL cascade we're
trying to eliminate.

Why this matters: pre-this-commit, every renderer change required a
fresh package.install RPC per app to take effect. Observed live on
.228 2026-05-02 — the TimeoutStartSec=600 fix shipped in code but
existing units stayed on the old format because nothing triggered a
re-render. Combined with state.json being empty (so the reconciler's
auto-install path didn't fire either), the fix was invisible until
manual unit deletion.

Companions (UI_APP_IDS) are skipped — companion.rs renders those units
with a different shape; syncing here would clobber them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:43:18 -04:00
archipelago
44f275eda4 fix(quadlet): TimeoutStartSec=600 when Notify=healthy is set
Bug surfaced live on .228 2026-05-02 — every backend Quadlet unit
(lnd, electrumx, fedimint, btcpay-server, mempool-api, bitcoin-knots)
hit systemd's default 90s start timeout because Notify=healthy makes
systemctl wait for the first green health probe, but
HealthInterval=30s × HealthRetries=3 = 90s minimum even on a healthy
service. Race: timeout fires the moment the third probe MIGHT succeed.

Result was three different post-states (inactive+running, failed+missing,
inactive+stopped) depending on whether systemd's ExecStopPost ran
podman rm before the orchestrator's adoption logic re-grabbed the
container.

Fix: when health is set, render TimeoutStartSec=600 (10 minutes) into
[Service]. Long enough for slow-starting backends (electrumx index
replay, lnd wallet unlock) without being so long that a truly stuck
unit hangs forever. Companions stay unchanged (no health → no override,
default 90s applies).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 07:14:48 -04:00
archipelago
a9c2685d8b fix(quadlet): http:// double-prefix + companion migration race
Two bugs surfaced by the first real-node validation of Phase 3.2-3.4
on .228 (2026-05-02), both caught before flipping the default.

Bug 1 — translate_health_check double-prefixed http://. Manifests in
the wild carry the scheme inside the endpoint string ("http://localhost:8175"),
and we were prepending another http:// unconditionally. Result on .228:
every backend HealthCmd read `curl -fsS -m 5 http://http://localhost...`,
every probe failed, fedimint hit a 14-restart loop. Now we accept either
form and skip appending hc.path when the endpoint already carries one.
Regression test asserts no double-prefix and that an in-endpoint path
is honoured.

Bug 2 — Phase 3.3 migration ran for UI companions (bitcoin-ui /
electrs-ui / lnd-ui) that have shipped via Quadlet since v1.7.41.
Migration tore down the running companion + raced companion.rs render,
producing "Phase 3.3: re-install archy-bitcoin-ui via Quadlet" reconcile
errors and leaving archy-bitcoin-ui down. Companions now short-circuit
out of migrate_to_quadlet_if_needed before any IO. Also: when try_exists
returns Err for an unrelated reason (permissions, EIO), we now skip
migration instead of treating "I can't tell" as "go ahead and migrate" —
migrating on top of a possibly-existing unit is destructive.

What this does not fix yet:
  * the orchestrator's reconciler iterating every manifest in
    /opt/archipelago/apps/, not just installed apps. Pre-existing
    behavior (also affects the legacy path) — separate scope.
  * fedimint /data UID mismatch surfaced when Quadlet started fedimint
    fresh. Likely orthogonal — defer.
  * no rollback when install_via_quadlet fails after a remove_container.
    Tracked as Phase 3.3.1 — defer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 06:37:37 -04:00
archipelago
409a19e7d7 feat(config): ARCHIPELAGO_USE_QUADLET_BACKENDS env override
Adds an env-var lever for Phase 3.2's use_quadlet_backends flag so the
20× harness can flip the path on per-node without a config.json edit
(which would require an archipelago.service restart — and that triggers
FM3 cgroup cascade until Phase 3.5 ships, so we can't ask anyone to
reconfigure live nodes that way today).

Truthy parsing centralised in `parse_truthy_env` (1, true, yes, on —
case-insensitive, whitespace-trimmed). Anything else is false. The
helper is unit-tested so future env-var flags can reuse the same shape.

Also adds a default-off regression test for use_quadlet_backends so
flipping the default ahead of the 20× verification fires immediately.

TESTING.md documents the Environment= snippet for the systemd drop-in
so the next operator can flip the flag on a debug node without
re-deriving the recipe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 05:44:09 -04:00
archipelago
5eec0c143c test(lifecycle): post-condition gate for use_quadlet_backends path
A six-test bats suite that validates what install_via_quadlet (Phase 3.2)
is supposed to leave behind:

  * `.container` unit on disk in $XDG_CONFIG_HOME/containers/systemd/
    with [Container] / [Service] / [Install] sections, Image= present,
    and Restart=on-failure (the backend invariant — companions use Always)
  * Phase 3.4 cross-check: any unit with HealthCmd= must also emit
    Notify=healthy, otherwise systemctl start won't gate on health
  * `systemctl --user is-active` returns 0 for the .service
  * podman shows the container running
  * the container's cgroup is under user.slice/, NOT under
    archipelago.service — the kernel-level proof that FM3 cgroup
    cascade SIGKILL is structurally fixed for this container

Auto-skips on every test when no backend Quadlet units exist (today's
default state, use_quadlet_backends=false) — so the suite is a no-op
on current fleet boxes and turns into a hard regression gate the
moment anyone flips the flag and reinstalls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 05:34:47 -04:00
archipelago
e7b8fb8fa3 feat(quadlet): Phase 3.4 — health-gated startup via Notify=healthy
QuadletUnit gains an optional HealthSpec; from_manifest translates the
manifest's health_check (tcp/http/cmd) into a HealthCmd= directive and
emits Notify=healthy alongside it. systemctl start <unit>.service then
blocks until the container's first green probe — eliminating the
"container up but RPC not ready" race the orchestrator currently papers
over with post-start polling.

Translation policy:
* tcp,  endpoint "host:port"        -> nc -z host port
* http, endpoint "host:port", path  -> curl -fsS -m 5 http://endpoint<path>
* cmd,  endpoint "<shell command>"  -> verbatim
* unknown type / malformed endpoint -> None (skip Notify=healthy rather
  than emit a HealthCmd that hangs the unit start forever)

Companion units leave health: None and remain byte-identical to before
this PR — the renderer only emits the Health* / Notify= block when set.

+4 quadlet unit tests (19 total). Dropped a never-used test setter that
was generating a dead_code warning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 05:21:57 -04:00
archipelago
12acf02d7b feat(orchestrator): Phase 3.3 — in-place migration to Quadlet
When use_quadlet_backends flips from off → on, existing fleet boxes
have backend containers parented under archipelago.service's cgroup
(the bad shape that triggers FM3 cascade SIGKILL on every archipelago
restart). ensure_running now notices and corrects this:

* If there's already a `<name>.container` unit on disk → no-op
  (subsequent reconcile ticks take this fast path).
* Else if a podman container with that name exists → it's a pre-3.3
  artifact. Stop+remove it (volumes survive — bind mounts are not
  touched by `podman rm`), then write the Quadlet unit, daemon-reload,
  and start the new managed service.
* Else → fall through to install_fresh, which already routes through
  install_via_quadlet when the flag is on.

The migration is idempotent and self-healing: if a fleet box is
half-migrated (unit on disk but no service active, or service active
but stale unit), the next reconcile tick converges. Bitcoin chain
data, lnd wallet state, and electrumx index all live on host bind
mounts and are unaffected by the container-record swap.

Volume safety audited per backend in `uses_orchestrator_install_flow`
allowlist — every entry mounts its data dir as a host bind mount.

Default still off. To migrate a node:
  /etc/archipelago/config.toml: use_quadlet_backends = true
followed by `systemctl restart archipelago` — the next reconcile tick
walks every managed app and migrates each in turn.

Tests: 624 passing, 0 cargo warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:27:59 -04:00
archipelago
5e67208c1d feat(orchestrator): Phase 3.2 — wire Quadlet path behind feature flag
prod_orchestrator::install_fresh now branches on the new
Config::use_quadlet_backends flag (default false):

* off (today's production behavior) — unchanged: runtime.create_container
  + start_container, container parented under archipelago.service's
  cgroup, FM3 cascade SIGKILL on every archipelago restart.
* on  — install_via_quadlet renders the manifest as a Quadlet unit via
  QuadletUnit::from_manifest, writes it atomically into
  ~/.config/containers/systemd/, calls daemon-reload, and starts the
  generated <name>.service. Container ends up under user.slice — no
  more cgroup parented under archipelago, so archipelago restarts
  don't touch the container's lifetime.

Default off so this commit is structurally safe to ship: nothing
changes at runtime until an operator opts in. Flip the default once
tests/lifecycle/run-20x.sh has gone green against the new path on
.228 + .198 (the v1.7.52 release gate).

Plumbing:
* config.rs — `use_quadlet_backends: bool` w/ Default false
* prod_orchestrator.rs — flag stored on the struct, threaded through
  new(), with set_use_quadlet_backends(bool) test setter
* prod_orchestrator.rs — install_via_quadlet helper
* dropped the Phase-3.1 #[allow(dead_code)] markers on from_manifest /
  parse_memory_mib / RestartPolicy::OnFailure now that the call path
  exists; if a future revert removes the wiring, the warnings come back.

Tests: 624 passing, cargo check clean (0 warnings). Existing companion
behavior unaffected — render_skips_backend_directives_when_default
still passes byte-equal to before quadlet.rs grew the new fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:22:10 -04:00
archipelago
82eba8be03 feat(quadlet): backend-manifest renderer (Phase 3.1 of v1.7.52)
The QuadletUnit struct now covers everything a backend manifest needs
(ports, environment, devices, add_hosts, entrypoint+command, read-only
root, no_new_privileges, cpu_quota, restart policy choice). Adds
QuadletUnit::from_manifest(&AppManifest, name) that translates a parsed
manifest into a unit, plus parse_memory_mib for "1g"/"512m"/raw-MiB
forms. The renderer skips empty/false directives so existing companion
units render byte-identically — no behavior change for shipping
companions; the backend renderer is dead code until Phase 3.2 wires it
into the orchestrator.

Eight new unit tests cover:
* parse_memory_mib forms (1024, 512m, 2g, garbage)
* shell_join quoting (whitespace, embedded quotes)
* RestartPolicy → systemd string mapping
* render emits backend directives when set
* render skips them when defaulted (companion regression gate)
* from_manifest happy path on a bitcoin-knots-shaped manifest
* from_manifest read-only volume detection
* from_manifest tmpfs filtering
* end-to-end manifest → render bytes assertion

Tests: 615 → 624 (+9 net; one pre-existing parse_memory_mib path was
implicitly covered before but is now explicit). Cargo warnings: 0.

`from_manifest`, `parse_memory_mib`, and `RestartPolicy::OnFailure` are
marked allow(dead_code) with explicit references to Phase 3.2 — if
3.2 doesn't wire them, the dead-code warning resurfaces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:09:50 -04:00
archipelago
9a5d5027f5 test(lifecycle): add btcpay + fedimint + mempool suites
Brings L1 (RPC API) + L3 (lifecycle survival) parity coverage to the
three multi-app stacks that were previously only touched by
required-stack.bats. Combined with bitcoin-knots / lnd / electrumx
already shipping, the six core apps now have dedicated bats files.

Each suite is shaped like the existing single-container suites
(bitcoin-knots / lnd / electrumx) and gates every assertion on the
backing container actually being present, so a node without the stack
installed gets clean skip messages instead of false fails.

* btcpay.bats — 9 tests, including stack-wide presence and a
  "supporting containers don't cascade-restart" guard
* fedimint.bats — 8 tests, single container
* mempool.bats — 9 tests, mixed legacy + orchestrator-managed stack;
  reuses the :8999 mempool-api probe from required-stack for parity

Total bats now: 88 (was 53 → +35).
TESTING.md matrix advances 23 → 50 of 110 cells.
UI URL coverage for these three apps already lives in
ui-coverage.bats, so this PR doesn't duplicate proxy-path probes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:55:31 -04:00
archipelago
d6f2e7bddc docs(testing): canonical scorecard for container subsystem testing
Single source of truth for "where are we, where are we going" on the
v1.7.52 container excellence work. Replaces ad-hoc tracking in chat.

Sections:
* Test layers L0..L6 with toolchain + per-iteration latency
* Per-app × per-state coverage matrix (23 of 110 cells today; goal 110)
* Layer-by-layer status (L0+L1+L2 ●; L3 ◐; L4..L6 ○)
* Run commands (single suite / full suite / 20×)
* LoC budget — -270 committed, ~1,616 more possible if Phase 3 ships
* Performance KPIs (TBD — measure first, target second)
* Release gates — 8 boxes that must tick before v1.7.52 ships

The file lives in-repo so PR diffs to it answer "what did this commit
improve?". If you can't tick the box, the change isn't ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:52:42 -04:00
archipelago
1cbac3b82c test(lifecycle): add UI surface coverage — HTTPS proxy + iframe URLs
Closes the coverage gap where existing bats suites would report green
on a node whose dashboard tiles 502 because the proxy upstream is dead.
First pass against .198 caught real prod issues immediately:
  /app/lnd/       → 502 (lnd container exited)
  /app/mempool/   → 502 (mempool container exited)
  /app/fedimint/  → 502 (fedimint container exited)
while existing tests reported only "container is up: false" with no
404/502 distinction.

* lib/ui-probes.bash — sourced helper. probe_https_200,
  probe_app_url (skip-if-container-down else assert-200),
  probe_dashboard_shell (asserts the Vue SPA HTML, not nginx default —
  catches the layout regression from feedback_release_tarball_layout.md),
  probe_dashboard_catalog (asserts /catalog.json non-empty).
* bats/ui-coverage.bats — 9 @test cases covering the dashboard +
  bitcoin-ui :8334 + the seven HTTPS_PROXY_PATHS most users hit
  (lnd, electrumx, mempool, fedimint, btcpay, filebrowser).

URL list mirrors HTTPS_PROXY_PATHS in
neode-ui/src/views/appSession/appSessionConfig.ts. Divergence between
the two is the exact bug class we're guarding against.

Loops clean under run-20x.sh. Container-state oracle is via local
podman inspect, so the suite must run on the archy host (same as
companion-survives-archipelago-restart.bats).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:49:30 -04:00
archipelago
4f503df6f1 test(bootstrap): regression gate for the heal_podman_state socket bug
Extracted the heal_podman_state cleanup list as a module-level
HEAL_RUNTIME_SUBDIRS const so a unit test can structurally enforce
the invariant: the list must contain "containers" + "libpod" but
must NOT contain "podman" (which holds systemd's podman.sock
listener and was the bug fixed in commit bb421803).

If anyone re-adds "podman" — accidentally, by reverting, or by
copy-paste from old plan memory — this test fires before we ship,
not on the next deploy when it nukes the orchestrator's HTTP path.

Total tests: 614 → 615.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:32:59 -04:00
archipelago
ebffe3ace5 test(lifecycle): regression gate for FM3 cgroup-cascade SIGKILL
Sister suite to companion-survives-archipelago-restart.bats. That one
tests the same property for UI companions, which already ship via
Quadlet (commit 6e716f68) and so already pass.

This new suite tests the property for backend containers (bitcoin-knots
/ bitcoin-core / lnd / electrumx). Until v1.7.52 Phase 3 ships these
under Quadlet too, the suite is EXPECTED TO FAIL on fleet boxes — it's
the executable definition of "FM3 fixed".

Observed live on .198 on 2026-05-01: `sudo systemctl stop archipelago`
killed every container in archipelago.service's cgroup. The dedicated
"backends survive archipelago restart" test catches exactly that, and
also verifies the SAME container instance survives (compares pre/post
.Id), so an orchestrator that recreates a fresh container after the
SIGKILL doesn't read as pass.

Three @test cases:
* destructive gate (skip-marker for the suite)
* baseline: at least one backend installed + running
* backends survive: same .Id pre + post archipelago restart

Don't gate releases on this passing until Phase 3 lands; before then
treat it as a "expected to fail / shows progress" indicator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:17:27 -04:00
archipelago
a9896aabfa test(lifecycle): add dedicated electrumx.bats suite
Same shape as bitcoin-knots.bats and lnd.bats so the 20× release-gate
exercises electrumx through the same state matrix it uses for the other
two core apps. electrumx previously had a single TCP-port check inside
required-stack.bats; this adds destructive + cascade-destructive tiers.

10 @test cases:
* read-only: presence, valid state, TCP port (50001) reachable, no
  orphan containers beyond {electrumx, archy-electrs-ui}
* destructive: stop, start, restart, TCP port recovers within 120s of
  cold restart (longer than bitcoind because electrumx replays its
  index against bitcoind on start)
* cascade: uninstall, reinstall (240s timeout for index rebuild)

With this suite, the three single-container core apps (bitcoin-knots,
lnd, electrumx) now have parity coverage. Multi-container stacks
(btcpay, mempool, fedimint) come next.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:11:02 -04:00
archipelago
abe07c6588 test(lifecycle): add dedicated lnd.bats suite
Mirrors bitcoin-knots.bats so the 20× release-gate run exercises lnd
through the same state matrix. lnd previously had only a single
read-only check inside required-stack.bats; this adds the destructive
and cascade-destructive tiers that match what we already test for
bitcoin-knots.

10 @test cases:
* read-only: presence, valid state, lncli getinfo, no orphan containers
* destructive (ARCHY_ALLOW_DESTRUCTIVE=1): stop, start, restart,
  RPC recovers within 90s of cold restart (longer than bitcoind
  because the wallet has to unlock first)
* cascade (ARCHY_ALLOW_CASCADE_DESTRUCTIVE=1): uninstall, reinstall

Reuses the same lncli invocation as required-stack.bats so divergence
shows up clearly if either test breaks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:09:43 -04:00
archipelago
168a0d9509 test(lifecycle): add setup-teardown + run-20x harness scaffolding
Phase 4 of the v1.7.52 container excellence plan: a release-gate harness
that loops the bats suite N times in a row, with teardown between
iterations, and reports a pass/fail tally.

* setup-teardown.sh — clears /tmp/archy-rpc-session-* between runs so
  iteration N+1 doesn't reuse a logged-out cookie from iteration N.
  Idempotent; safe to run anytime. Designed to grow as we add suites
  that leave other transient state.
* run-20x.sh — wraps run.sh in a loop of ARCHY_ITERATIONS (default 20).
  Tracks per-iteration pass/fail with wall-clock timing, prints a
  results block, exits non-zero on any failure. Honors ARCHY_FAIL_FAST
  for short-circuit during dev.

Suggested release-gate command:
  ARCHY_PASSWORD=password123 ARCHY_ALLOW_DESTRUCTIVE=1 \
    tests/lifecycle/run-20x.sh

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:06:09 -04:00
archipelago
bb42180373 fix(bootstrap): don't nuke podman socket dir during runtime self-heal
Observed live on .198: heal_podman_state was removing
$XDG_RUNTIME_DIR/podman/ alongside containers/ and libpod/. That dir
holds the systemd-bound podman.sock — the listener systemd creates for
socket-activated podman.service. Removing it broke every libpod HTTP
call from the orchestrator until `systemctl --user restart
podman.socket` ran. Far worse than any wedge it was trying to repair.

Drop podman/ from the cleanup list. The runtime state we actually want
to clean for FM6 (bolt_state.db drift) lives in containers/ and
libpod/ only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 15:57:15 -04:00
archipelago
1c0df95f9a refactor: drop dead code surfaced by cargo
cargo check was showing five real warnings, all genuinely dead:

* container/mod.rs   — re-exports compute_container_name, AdoptionReport,
                       ReconcileAction, ReconcileReport were unused outside
                       prod_orchestrator. Drop from the pub use line.
* prod_orchestrator  — with_runtime + insert_manifest_for_test only exist
                       for the test module in the same file. Mark them
                       #[cfg(test)] so they don't appear in release builds.
* async_lifecycle    — remove_package_entry has no callers; doc claims
                       "used for install-failure cleanup" but nothing
                       cleans up. Delete (10 lines).
* registry.rs        — `use tracing::{debug, info};` had no consumers.
* fips.rs            — unused-assignment chain on last_status. The poll
                       loop always sets it on every break path, so the
                       initial `None` and the unwrap_or_else fallback
                       were both dead. Refactored to `let after = loop
                       { ...; break s; };`.

cargo check is now clean. cargo test --workspace --bins: 614 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 15:34:02 -04:00
archipelago
45c1f0b6d1 fix(bootstrap): self-heal wedged podman runtime state at startup
Closes FM6 (podman bolt_state.db / runtime drift) — observed live on
.198 today: bitcoind was running for several minutes, but podman's
state DB reported the container as Exited. The reconciler then tried
to "restart" it, racing the still-bound port 8332 and failing in a
loop.

heal_podman_state() runs as the last bootstrap stage, BEFORE the
orchestrator's reconcile loop ticks. It probes `podman info` with a
5s timeout; on failure it removes the runtime-state dirs under
$XDG_RUNTIME_DIR and re-probes. Persistent storage under
~/.local/share/containers/storage/ is never touched, so containers
re-discover from manifests on next call.

Cleanup never includes `podman system reset` or `system renumber` —
those are destructive and must stay operator-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 15:23:36 -04:00
archipelago
a1f37b20ed refactor(container): drop unused dependency_resolver module
DependencyResolver had zero call sites in prod or tests outside the
module itself. The actual install-time dependency check lives in
install.rs::detect_running_deps + check_install_deps; this DAG-walk
solver was never wired up. -268 LoC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 15:22:07 -04:00
archipelago
8321d093e8 fix(install): auto-clean stuck OTHER-variant bitcoin container
If bitcoin-core was installed but never started (e.g. port 8332 already
bound by bitcoin-knots), the container sticks in `created` state forever.
The old conflict check refused EVERY future bitcoin install — including
re-install of the running variant — leaving no UI path to recovery.

Now the check distinguishes states:
  - missing                       → no conflict, continue
  - running                       → real conflict, refuse install
  - created/exited/configured/... → stuck; auto-remove and continue

Volumes are untouched; only the dead container record goes away.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:59:11 -04:00
archipelago
d5c1253a7e fix(install): generate bitcoin RPC password before orchestrator install
Bitcoin containers were exiting in ms after start because the orchestrator
install path skipped the credential-materialisation step the legacy path
did. resolve_secret_env then failed to read
/var/lib/archipelago/secrets/bitcoin-rpc-password, the container started
with no password, and bitcoind crashed before logs were useful.

Two changes:

1. install.rs — call bitcoin_rpc_credentials() for bitcoin/bitcoin-core/
   bitcoin-knots before any install branch runs. The function generates +
   persists on first call (OnceCell-cached), so this is idempotent.

2. manifest.rs::resolve_secret_env — return ManifestError::Invalid when a
   resolved secret trims to empty, instead of silently producing
   `KEY=` env vars that crash auth.

Adds a unit test for the empty-secret rejection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:39:56 -04:00
archipelago
fdb1f3d4d3 refactor(install): route orchestrator-managed apps through orchestrator first
Phase 3a of the install path consolidation. Two coupled changes:

1. install.rs handle_package_install: gate the legacy "container exists →
   adopt + return" probe on !orchestrator_managed. Apps the orchestrator
   knows about (bitcoin-knots, bitcoin-core, lnd, electrumx, fedimint,
   filebrowser, btcpay-server stack apps, mempool stack apps, plus the
   companion UIs that just moved to Quadlet) skip the legacy probe and
   fall straight into the orchestrator branch.

   The legacy adopt block was returning success on a bare `podman start`
   exit-0 — even when the process inside the container crashed seconds
   later. That's the .228 "running but unreachable" failure mode. The
   orchestrator's ensure_running honors the manifest's health check and
   pre-start hooks (e.g. re-renders bitcoin-ui's nginx.conf if the RPC
   password rotated), so this is a behavioral upgrade, not just a
   refactor.

2. ProdContainerOrchestrator::install: make idempotent. Previously it
   blindly called install_fresh which would fail on `podman create` if
   the container name already existed. Now it delegates to ensure_running:
     - Container Running + healthy → no-op (refresh hooks, restart if
       config rewritten)
     - Container Stopped/Exited → start (with hook refresh)
     - Container missing → install_fresh
     - Container in wedged state (Created/Paused/Unknown) → force-recreate

   Without this, change #1 would regress every "container already exists"
   case for the 18 orchestrator-managed app IDs. With it, install becomes
   the single source of truth for "make app X be in the desired state."

Tests: 654 passed across the workspace (614 unit + 37 orchestration + 3
rpc), 0 failures. The 20 prod_orchestrator tests cover the install /
ensure_running / reconcile paths the new install delegates through.

Net delta: install.rs grows by ~30 lines (gating wrapper + comments),
prod_orchestrator.rs grows by ~30 lines (idempotent install body). Both
are temporary — the larger deletions (~1700 lines) come once every app
has been verified through the orchestrator path in subsequent phases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:12:52 -04:00
archipelago
6e716f68b6 refactor(container): move companion UIs to systemd via Quadlet
Companion UI containers (archy-bitcoin-ui, archy-lnd-ui,
archy-electrs-ui) used to be launched as fire-and-forget tokio::spawn
blocks from install.rs. If archipelago crashed mid-spawn or the
container's cgroup was reaped, companions vanished from podman ps -a
and only a manual rm/run could bring them back (the .228 incident).

Now each companion is rendered as a Quadlet .container unit under
~/.config/containers/systemd/, daemon-reloaded, and started via
systemctl --user. systemd owns supervision from that point on:

- archipelago can crash, restart, or be uninstalled without touching
  any companion.
- Quadlet's Restart=always + RestartSec=10 handles container exits.
- A 30s reconcile tick in boot_reconciler enumerates expected
  companion units and re-installs any whose unit file or service
  vanished — defense-in-depth against external tampering.

New module layout:
- container/quadlet.rs: pure unit renderer + atomic write_if_changed
  + systemctl helpers (daemon_reload_user / enable_now / disable_remove
  / is_active). 6 unit tests, no I/O in the renderer.
- container/companion.rs: per-app companion specs, install/remove/
  reconcile, image presence (build local first, fall back to insecure
  registry only via image_uses_insecure_registry whitelist). 2 tests.

install.rs handle_package_install now ends with a single call to
companion::install_for(package_id), replacing 287 lines of spawn-and-
hope shellouts plus a ~120-line nginx auth-injector helper that worked
around per-node RPC password baking. The helper is gone too — the
pre-start hook renders the per-node nginx.conf to /var/lib/archipelago/
bitcoin-ui/nginx.conf and the Quadlet unit bind-mounts it read-only.

runtime.rs handle_package_uninstall now disables companions before
the container rm loop. Otherwise systemd's Restart=always would
respawn each companion within ~10s of removal.

Tests: 53 container tests pass, including 6 quadlet renderer tests
(host network, bridge network, capability set, atomic write idempotence)
and 2 companion specs (per-app companion lookup, build_unit shape).
boot_reconciler tests gain a #[cfg(test)] without_companion_stage()
flag so the paused-clock fixtures don't race the real systemctl I/O.

A bats regression test (companion-survives-archipelago-restart.bats,
gated on ARCHY_ALLOW_DESTRUCTIVE=1) asserts the .228 failure mode
cannot recur: every installed companion has a unit file, services
stay active across systemctl --user restart archipelago, and a
deleted unit file is recreated within one reconcile tick.

Net delta: +941 / -363, but the +941 is mostly tests (~440 lines)
and the new declarative layer; the imperative tokio::spawn block and
its nginx-auth helper are gone, removing two failure classes
(orphan companions on archipelago crash, and post-start exec races
under tightly-confined cgroups) that previously needed manual SSH
recovery.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:45:07 -04:00
archipelago
8a22ccfa20 refactor(security): tighten capability + TLS-bypass surface
Three small, focused tightenings:

- core/container/src/podman_client.rs: drop the legacy Hetzner
  23.182.128.160:3000 mirror from image_uses_insecure_registry().
  It was decommissioned in v1.7.x and is stripped from active
  registry config at load time; leaving it in the bypass list let
  a stale config still skip TLS. Replace the inline match with a
  named INSECURE_REGISTRY_HOSTS slice so future entries are one
  line. Test now also pins the spoofing-immune semantics
  ("evil.example/146.59.87.168:3000/x" must NOT match).

- core/archipelago/src/api/rpc/package/config.rs: split bitcoin
  from lnd in get_app_capabilities(). bitcoind never opens raw
  sockets — drop CAP_NET_RAW from bitcoin/bitcoin-core/bitcoin-knots.
  lnd/fedimint/fedimint-gateway keep it because they enumerate
  network interfaces during cert generation.

- core/archipelago/src/bootstrap.rs: tighten_secrets_dir()
  enforces 0700 on /var/lib/archipelago/secrets and 0600 on every
  file inside on each startup. The dir-mode is the load-bearing
  isolation boundary against rootless container escapes (their UID
  maps to >=100000, can't traverse uid=1000/0700). The per-file
  sweep is defense-in-depth against any installer that wrote 0644.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 08:59:11 -04:00
archipelago
3866c12ddf chore: baseline codex hardening before lifecycle refactor
Snapshots the in-flight hardening work so subsequent reconcile/Quadlet
phases land on a clean before/after diff.

Changes:
- core/container/src/podman_client.rs: image_uses_insecure_registry()
  whitelist for the OVH (146.59.87.168:3000) and legacy Hetzner
  (23.182.128.160:3000) HTTP mirrors; podman_network_settings() lifts
  custom networks into the Networks map so containers can join them.
- core/archipelago/src/container/prod_orchestrator.rs:
  ensure_container_network() creates per-manifest networks on demand;
  apply_data_uid() now goes through host_sudo for mkdir -p + chown so
  bind-mount roots get created and chowned without password prompts.
- core/archipelago/src/api/rpc/package/{install,update,stacks}.rs:
  podman pull adds --tls-verify=false only for whitelisted registries.
- core/archipelago/src/bootstrap.rs: removes stale dev-mode systemd
  override on startup (live nodes carried it from old installers).
- core/archipelago/src/config.rs: ignore ARCHIPELAGO_DEV_MODE in prod
  binaries — it had been silently rerouting volumes to /tmp.
- apps/bitcoin-{core,knots}/manifest.yml: locate bitcoind at runtime
  so image-layout differences don't break entrypoint.
- scripts/app-catalog-image-smoke-test.py: production catalog/image
  smoke test that probes a target node before users click Install.
- .gitignore: cover .codex, .pnpm-store, __pycache__, *.bak.

Removes filebrowser.rs.bak and two stale catalog.json.bak files
(verified identical to live counterparts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 08:52:29 -04:00
archipelago
63a33de229 fix: release v1.7.51-alpha install hardening 2026-05-01 05:02:39 -04:00
archipelago
e376fec825 fix: release v1.7.50-alpha OTA runtime repair v1.7.50-alpha 2026-05-01 03:14:07 -04:00
archipelago
b4756183e8 chore: release v1.7.49-alpha v1.7.49-alpha 2026-04-30 16:37:54 -04:00
archipelago
b7ee82ccbc chore: release v1.7.48-alpha
Hotfix: archipelago.service ExecStartPre now mkdirs /run/containers and
/var/lib/containers before the unit's mount-namespace setup tries to bind
them. Without this, fresh nodes that don't have /run/containers (e.g.
nodes provisioned without a prior podman session) fail at the namespace
step with:

  Failed to set up mount namespacing: /run/containers: No such file or directory
  Failed at step NAMESPACE spawning /bin/bash: No such file or directory

Existing nodes don't pick up systemd unit changes via OTA — they need a
one-time `systemctl edit archipelago` adding the same mkdir. ISO installs
from this version forward have the fix baked in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v1.7.48-alpha
2026-04-29 16:27:22 -04:00
archipelago
1f86f2e937 chore: release v1.7.47-alpha
Sync-perf tuning for bitcoin/bitcoin-core/bitcoin-knots/electrumx.

- Drop the --cpus=2 cap on bitcoin/electrumx variants. Script verification
  is parallelizable; the cap halved IBD speed on 4-8 core machines.
- Bump bitcoin --memory 4g→8g so dbcache=4096 has headroom for mempool +
  connection buffers + I/O. 4g was OOM-prone during heavy IBD.
- Bump electrumx --memory 1g→2g + add CACHE_MB=2048 + MAX_SEND=10MB.
- bitcoin-core CLI args gain -dbcache=4096 -par=0 -maxconnections=125.
- bitcoin-knots manifest matched (1024MB pruned / 4096MB full + par=0).

Future v2: host-RAM-aware dbcache scaling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v1.7.47-alpha
2026-04-29 15:47:51 -04:00
archipelago
03b7966c38 chore: release v1.7.46-alpha
Follow-up to v1.7.45-alpha closing the remaining tasks identified by the
resilience sweeps + the new bitcoin orphan / install-fail-vanish bugs.

User-visible:
- Health monitor: stop paging on orphaned containers from variant switches
- Install fail: card stays visible (was vanishing) with error message
- Stack pull progress: interpolate 20→70% (was stuck at 20%)
- docker.io → lfg2025 mirror: bitcoin/gitea/nextcloud/valkey

Internal:
- Resilience harness — install-wait uses expected_containers_for, ui+auth
  probes retry with 60s backoff, dep-snapshot fix
- InstallProgress gains optional `message` field (frontend renders it
  when phase is None)

binary  $(stat -c %s releases/v1.7.46-alpha/archipelago)  sha256:$(sha256sum releases/v1.7.46-alpha/archipelago | awk '{print $1}')
tarball $(stat -c %s releases/v1.7.46-alpha/archipelago-frontend-1.7.46-alpha.tar.gz)  sha256:$(sha256sum releases/v1.7.46-alpha/archipelago-frontend-1.7.46-alpha.tar.gz | awk '{print $1}')

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v1.7.46-alpha
2026-04-29 14:50:33 -04:00
archipelago
68245142b5 release: v1.7.45-alpha artifacts and manifest
binary  41,618,344 bytes  sha256:ca1958b0f420cc6e73aa4bc161e20ebe7750e933888368394ad17a3f3a36cfad
tarball 77,025,110 bytes  sha256:59d538768e92a1cd726afd272838dbdd581c87780140792b2818434ef2ae7b81

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v1.7.45-alpha
2026-04-29 12:43:22 -04:00
archipelago
dacdab9f6e chore: release v1.7.45-alpha
Resilience-validated release. Three full sweeps of the new resilience
harness against .228 confirm no shipstoppers.

Big user-visible:
- Bitcoin RPC auth durably correct via host-rendered nginx.conf bind-mount,
  replaces fragile post-start exec that failed under restricted-cap rootless
  podman ("crun: write cgroup.procs: Permission denied")
- Multi-container stack installs (indeedhub, immich, btcpay, mempool) now
  emit phase events at every boundary so the progress bar advances
- Apps no longer vanish from the dashboard mid-install (absent-scanner skips
  packages in transitional states)
- Indeedhub fresh installs work end-to-end (was 8500+ restart loop): five
  missing env vars (DATABASE_PORT, QUEUE_HOST, QUEUE_PORT,
  S3_PRIVATE_BUCKET_NAME, AES_MASTER_SECRET) added to install code
- Tailscale install fixed: --entrypoint string was being passed as a single
  shell-line arg; switched to custom_args array
- Catalog cleaned of broken entries (dwn, endurain, ollama removed; nextcloud
  restored on docker.io)
- Bitcoin Core update path uses correct image (was looking for nonexistent
  lfg2025/bitcoin:28.4)
- ISO installs now allocate swap on the encrypted data partition

Infra:
- New resilience harness (scripts/resilience/) — black-box state-machine
  tester, every app × every transition. Run before each release.

Sweep #3 final: PASS 107 / FAIL 12 / SKIP 14. The 12 fails are 1 cosmetic
(homeassistant trusted_hosts), 8 harness/timing false-positives, and 3
non-shipstopper tracked items. Down from 23 in baseline sweep #1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:31:45 -04:00
archipelago
6c970dc969 chore: release v1.7.44-alpha v1.7.44-alpha 2026-04-28 15:03:04 -04:00
archipelago
43de3b73b2 feat(orchestrator): complete container migration and release hardening 2026-04-28 15:00:58 -04:00
archipelago
ce39430b33 feat(self-update): sync and rebuild UI containers on OTA
self-update.sh previously rebuilt only the backend binary and Vue
frontend. The custom UI containers (archy-bitcoin-ui, archy-lnd-ui,
archy-electrs-ui) were left untouched forever. That meant any change to
docker/<ui>/{Dockerfile, nginx.conf, index.html, ...} never reached a
running node through OTA; it required a manual SSH + rebuild. This is
exactly why the lnd-ui port fix didnt reach .228 in v1.7.43-alpha.

Add a sync-and-rebuild stage:

  1. Hash each docker/<ui>/ tree (content-only, path-stable via
     `cd && find` so src and dst compare equal when identical).
  2. rsync changed trees to /opt/archipelago/docker/<ui>/.
  3. For each changed UI: rebuild image as the archipelago user
     (rootless podman), then stop+remove+recreate the container using
     the canonical spec from scripts/container-specs.sh. Port mappings,
     caps, memory, and security opts all come from the spec, so the
     runtime cant drift from the tree.

Also install first-boot-containers.sh into /opt/archipelago/scripts/ so
a later reconciler run or reboot picks up current orchestration logic.

Idempotent: if no UI tree changed since the last update, the whole stage
is a no-op beyond the hash compare. Verified end-to-end on .228 with a
synthetic change to lnd-ui: detection, sync, build, recreate, and HTTP
200 on both the direct container port and the host-nginx /app/lnd/
proxy.
2026-04-23 15:48:53 -04:00
archipelago
72dec5aaa5 fix(lnd-ui): align container port across all specs
The LND UI container was unreachable on .228 after the v1.7.43-alpha
deploy because three sources of truth disagreed on which port nginx
listens on inside the container:

  - docker/lnd-ui/nginx.conf        listen 8081
  - docker/lnd-ui/Dockerfile        EXPOSE 8080
  - apps/lnd-ui/manifest.yml        host networking, ports: []
  - scripts/first-boot-containers.sh  -p 8081:8080
  - scripts/deploy-to-target.sh        -p 8081:80     (de-facto)
  - scripts/deploy-tailscale.sh        -p 8081:80
  - scripts/container-specs.sh        SPEC_PORTS=8081:80

Result: podman published host 8081 to container port 80, but no one was
listening on 80 inside, so connections were reset. Canonicalize on
container:80 with host:8081 publish, matching the three deploy paths
already in agreement.

Changes:
  - docker/lnd-ui/nginx.conf: listen 8081 -> listen 80
  - docker/lnd-ui/Dockerfile: EXPOSE 8080 -> EXPOSE 80
  - apps/lnd-ui/manifest.yml: replace host-network (never true) with
    bridge networking and explicit 8081:80 port mapping, correcting a
    documentation-vs-reality mismatch
  - scripts/first-boot-containers.sh: -p 8081:8080 -> -p 8081:80, and
    fix the internal-port comment

Verified on .228 after rebuild: curl http://127.0.0.1:8081/ returns HTTP
200 and the /app/lnd/ host-nginx proxy resolves cleanly.
2026-04-23 15:42:49 -04:00
archipelago
83aacdf209 chore(release): archive ISO build recipes, tarball-only releases
Releases no longer ship as bootable ISOs. Archipelago updates are
distributed as the backend binary plus a frontend tarball referenced by
releases/manifest.json. Nodes OTA-update via scripts/self-update.sh.

Filebrowser and AIUI remain bundled inside the frontend tarball and
deployed atomically, verified present in v1.7.43-alpha release artifact
(189 AIUI files, filebrowser-client bundle).

Archived under image-recipe/_archived/ (resurrectable if ISO distribution
is reintroduced):
  - build-auto-installer-iso.sh
  - build-unbundled-iso.sh
  - test-iso-qemu.sh
  - scripts/convert-iso-to-disk.sh
  - BUILD-ISO-STATUS.md, ISO-BUILD-CHECKLIST.md
  - branding/isohdpfx.bin
  - .gitea/workflows/build-iso-dev.yml

Updated release process docs to drop ISO references:
  - scripts/create-release.sh (next-steps text)
  - docs/BETA-RELEASE-CHECKLIST.md
  - docs/hotfix-process.md
  - README.md
2026-04-23 15:36:00 -04:00
archipelago
4ece2c1e7e release: v1.7.43-alpha artifacts and manifest
Backend binary, frontend tarball (with AIUI bundled), and updated
manifest.json pointing fleet updaters at the new download URLs.
v1.7.43-alpha
2026-04-23 13:25:21 -04:00
archipelago
a672f45b00 docs(release-notes): v1.7.43-alpha bullet for AIUI preservation fix 2026-04-23 13:22:28 -04:00
archipelago
a76e7604a0 chore(release): bump version to 1.7.43-alpha 2026-04-23 13:21:58 -04:00
archipelago
84c2c2880a fix(aiui): bundle demo/aiui in self-update and ISO builds so updates never wipe it
Every OTA self-update and every ISO capture was implicitly relying on
/opt/archipelago/web-ui/aiui/ already being present on disk. Any node that
had its web-ui directory atomically swapped (for example by a manual
deployment shipping only neode-ui dist output) lost aiui entirely and the
AI Assistant tab fell through to the "needs to be enabled" placeholder.

self-update.sh: drop the rsync --exclude aiui preservation trick and
instead stage demo/aiui into the freshly-built dist tree before rsync.
demo/aiui in the repo is now the source of truth; every update overwrites
the on-disk copy with a matching version rather than carrying forward
whatever stale bundle happened to survive.

build-auto-installer-iso.sh: prepend demo/aiui to the AIUI search list so
ISO builds from a fresh repo clone pick it up automatically, without
requiring a side-checkout of the AIUI project or a live dev server.

This matches create-release-manifest.sh which already bakes demo/aiui
into the release tarball (lines 86-89).
2026-04-23 13:21:49 -04:00
archipelago
8034d382ee docs(release-notes): v1.7.43-alpha bullets for chunking, avatar, outbox, parser
Four production-code fixes merit user-visible mention: the transport
chunking data-corruption fix (real user-affecting bug for multi-chunk
mesh payloads), the avatar u16 overflow panic (backend crash on certain
seeds), the outbox TTL boundary, and the image-versions parser hardening.
2026-04-23 13:03:49 -04:00
archipelago
5ddc30db1e test: repair stale test fixtures across identity, mesh, update, wallet, fips
Several tests had drifted from the current production behavior:

- identity_manager: create() already auto-provisions a Nostr key, so the
  explicit create_nostr_key() call failed with "already exists". Rewrite
  the test to assert on record.nostr_npub from create() directly.
- mesh/protocol: test_build_app_start read the app name from frame[4..]
  but the v2 layout is [0:marker][1-2:len][3:cmd][4:version][5..:name].
  test_identity_broadcast_roundtrip expected input DID = output DID but
  the v2 decoder derives DID from the ed25519 pubkey, so the roundtrip
  compares against did_key_from_pubkey_hex(&pub) now.
- mesh/bitcoin_relay: test_build_block_header_announcement asserted
  sig.is_some(), but the builder intentionally emits an unsigned envelope
  to fit the 160-byte LoRa limit; assert sig.is_none(). Also widen
  placeholder hashes to the required 64 hex chars (32 bytes).
- update: load_mirrors() now merges default mirrors post-migration, so
  the roundtrip test must assert the custom mirror survives alongside
  the defaults rather than strict equality.
- wallet/cashu: test_proof_c_as_pubkey used hex that is not on the curve;
  replace with the secp256k1 generator point G so parsing succeeds.
- fips: test_status_reports_no_key_pre_onboarding asserted npub.is_none(),
  which fails on dev boxes where the fips daemon is already running. Keep
  the !key_present assertion and drop the npub one.
2026-04-23 13:02:45 -04:00
archipelago
de9995f869 test(credentials): seed identity/node_key in test helper so encrypt/decrypt works
Credentials tests created a fresh tempdir and immediately invoked
encrypt/decrypt, but load_encryption_key reads <dir>/identity/node_key
which did not exist, so every test failed with "node key not found".
Add a test_dir_with_node_key() helper that writes a deterministic 32-byte
key and switch all 8 call sites to it.
2026-04-23 13:02:28 -04:00
archipelago
83dac52410 fix(session): add test-only constructor so tests do not read real sessions
SessionStore::new() reads /var/lib/archipelago/sessions.json, which on
any node with an active dashboard contains live sessions that pollute
test state and cause intermittent failures. Introduce a cfg(test) only
new_for_tests(PathBuf) constructor and switch the test suite to it so
tests always start from a clean tempdir.
2026-04-23 13:02:22 -04:00
archipelago
5439aa8ff1 fix(container/image_versions): reject entries that are not image references
The parser retained any key ending in _IMAGE, so a harmless-looking
variable like NOT_AN_IMAGE="something" would be treated as a pinned
container image. Add a value-shape check: the value must contain both
a registry separator (/) and a tag separator (:) to qualify.
2026-04-23 13:02:15 -04:00