lfg2025/archy

Fork 0

Files

Dorian 64b57dca7d

Build Archipelago ISO (dev) / build-iso (push) Failing after 13m44s

Details

Container Orchestration Tests / unit-tests (push) Failing after 7m30s

Details

Container Orchestration Tests / smoke-tests (push) Has been skipped

Details

fix: overhaul container lifecycle — recovery, health, uninstall, UI state

Container recovery:
- Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s
- Dependency-aware restarts: won't restart services before their deps
- Reset dependent counters when a dependency recovers
- Handle "created" state containers (were invisible to health monitor)
- Added IndeedHub, mempool-api, mysql to tier system
- Crash recovery: podman start timeout 30s→120s with retry
- Podman client: socket timeout 5s→30s, added restart policy

UI state representation:
- Exit code 0 shows "stopped" (gray), not "crashed" (red)
- Exit code 137 shows "killed (OOM)"
- Non-zero exit shows "crashed" (red)
- Added exit_code field to PackageDataEntry

Install/uninstall fixes:
- Install returns error when container doesn't start (was silent success)
- Post-install hooks awaited instead of fire-and-forget tokio::spawn
- Uninstall: graceful rm before force, volume prune, network cleanup
- Uninstall returns error on partial failure (was 200 OK)

Config consistency:
- DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded)
- Bitcoin: added ZMQ ports 28332/28333 for LND block notifications
- IndeedHub port 7777→8190 (was conflicting with strfry)
- Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0

Performance:
- Metrics collector interval 60s→300s (was duplicating health monitor)
- Podman client: proper error propagation instead of unwrap_or_default

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-31 07:03:57 +01:00

29 KiB

Raw Permalink Blame History

Archipelago Container Infrastructure — Critical Issues Report

Date: 2026-03-31 Status: Server .228 rebooted — some apps recovered, many did not. UI showed everything as "crashed" during recovery window. Purpose: Fix guide for getting container lifecycle to production quality.

Executive Summary

The container system has 7 systemic failures that compound each other:

Silent failures everywhere — errors are swallowed with || true, .unwrap_or_default(), and warn-level logs. Nothing actually tells the user (or the system) that something broke.
Health checks are fake — manifests define real health checks (HTTP probes, exec checks) but they are never executed. "Healthy" just means podman ps shows "running".
Duplicate polling burns CPU — health monitor + metrics collector both call podman stats every 60 seconds independently. Add crash recovery snapshots, disk monitor, and frontend polling = constant subprocess spawning.
Uninstall doesn't clean up — no volume removal, no network cleanup, force-kills stateful containers (risking wallet/DB corruption), returns 200 OK on partial failure.
Two divergent install paths — first-boot-containers.sh and the Rust RPC installer use different passwords, ports, capabilities, memory limits, and Bitcoin config. They are never in sync.
UI misrepresents state — Exited (even clean exit code 0) shows as "crashed". No "recovering" or "starting up" state exists. During boot recovery, UI shows a wall of red/gray "crashed" labels.
Dependency-blind restarts — health monitor restarts services without restarting their dependencies first, so they immediately fail again and burn through the 3-attempt limit.

LIVE EVIDENCE: .228 Reboot on 2026-03-31

After rebooting .228, here's the actual container state 30 minutes later:

Permanently Dead (exceeded 3 restart attempts, abandoned)

Container	Exit Code	Cause
`indeedhub-postgres`	0 (clean)	Shut down by reboot. Health monitor tried 3 restarts, it keeps exiting cleanly. Once abandoned, all dependent services die too.
`indeedhub-redis`	0	Same — clean exit, 3 failed restart attempts, abandoned
`indeedhub-minio`	0	Same
`indeedhub-relay`	0	Same
`indeedhub`	0	Same
`indeedhub-api`	1	Can't resolve hostname `indeedhub-postgres` (postgres is dead, DNS entry gone from network)
`jellyfin`	137 (OOM)	"Failed to create CoreCLR" — memory limit too low for .NET runtime. SIGKILL = OOM. 3 attempts exhausted.

Crash-Looping (still failing on every restart)

Container	Cause
`mempool-api`	`ECONNREFUSED 10.89.0.42:3306` — DB (`archy-mempool-db`) just restarted, not ready yet
`portainer`	"database schema version does not align with server version" — image upgraded, DB not migrated. Will NEVER recover.
`photoprism`	"Failed creating test file in storage folder" — volume permission issue (rootless UID mapping)

Never Started (stuck in "Created" state)

Container	Cause
`archy-mempool-web`	"cannot assign requested address" — network binding failure
`fedimint`	Same network error

Running but Unhealthy

Container	Notes
`homeassistant`	Up 14 min, health check failing
`searxng`	Up 13 min, health check failing
`onlyoffice`	Up 10 min, health check failing

Actually Recovered (healthy)

filebrowser, bitcoin-knots, vaultwarden, nginx-proxy-manager, archy-btcpay-db, lnd, electrumx, grafana

Key Observations

All containers have unless-stopped restart policy — but this doesn't help because containers that exit cleanly (code 0) don't get restarted by Podman. The health monitor is the only restart mechanism, and it gives up after 3 attempts.
The entire IndeedHub stack died because postgres was abandoned first. Once postgres hit 3 restart attempts, every dependent service (api, redis, minio, relay, main) also failed and hit their own 3-attempt limit. No dependency awareness.
Containers in "Created" state were never even started — some kind of network assignment failure during creation. The health monitor doesn't handle "Created" state containers.
The UI showed ALL apps as "crashed" during the first few minutes, even the ones that eventually recovered. This is because Exited state (even exit code 0) maps to the label "crashed" in appsConfig.ts.

Problem 1: Containers Don't Start or Recover After Reboot

Confirmed: All apps crashed after .228 reboot on 2026-03-31.

Root Causes

A. Crash recovery has a 30-second timeout that's too short

File: core/archipelago/src/crash_recovery.rs:265-271

let result = tokio::time::timeout(
    std::time::Duration::from_secs(30),
    tokio::process::Command::new("podman").args(["start", &record.name]).output(),
).await;

On a cold boot with many containers, Podman is under load. 30 seconds is not enough. If it times out, the container is skipped — no retry.

B. If `podman ps` itself times out, recovery finds zero containers

File: core/archipelago/src/crash_recovery.rs:318 The podman ps -a call to discover stopped containers has a 30-second timeout. On a busy system post-reboot, this can timeout. Result: all_names is empty, recovery silently exits having started nothing.

C. Boot tier ordering uses a catch-all that misses dependencies

File: core/archipelago/src/crash_recovery.rs:374-385

fn container_boot_tier(name: &str) -> u8 {
    match id {
        "btcpay-db" | "mempool-db" | ... => 0,  // databases
        "bitcoin-knots" | ... => 1,               // bitcoin
        "lnd" | "electrumx" | ... => 2,           // depends on bitcoin
        "mempool-web" | ... => 4,                  // frontend
        _ => 3,  // EVERYTHING ELSE - may start before its dependencies
    }
}

Any app not explicitly listed gets tier 3, which may be before its dependencies are ready.

D. First-boot script swallows ALL errors

File: scripts/first-boot-containers.sh:8 — no set -e 48+ commands have || true appended. Every podman run failure is silently ignored. The script always exits 0 and reports "complete" to systemd even if 50% of containers failed.

E. Install RPC returns success before container is actually running

File: core/archipelago/src/api/rpc/package/install.rs:260-294 After container creation, the installer polls for 30 seconds (6 checks x 5 seconds). If the container is still in "created" or "starting" state after 30 seconds:

if i == 5 {
    debug!("Container {} health check timeout (30s) -- continuing anyway");
}

It logs at debug level and returns success. The user sees "installed" but the container never actually started.

Fixes Required

Increase crash recovery timeout to 120s and add retry with backoff (3 attempts per container)
Increase podman ps timeout to 60s during boot recovery
Replace tier catch-all — every container must be explicitly listed or derived from manifest dependencies
Remove || true from critical commands in first-boot-containers.sh. Use proper error handling: log the error, record the failure, continue to next container, but report actual failures at the end
Install RPC must return failure if container isn't running after timeout, not silently succeed
Add --restart unless-stopped to container creation in the Podman client (core/container/src/podman_client.rs:303-335) — currently missing, so Podman itself never auto-restarts crashed containers

Problem 2: Health Checks Are Fake

Root Causes

A. "Healthy" just means "running" — application health is never checked

File: core/archipelago/src/container/dev_orchestrator.rs:239-249

pub async fn get_health_status(&self, app_id: &str) -> Result<String> {
    match status.state {
        ContainerState::Running => Ok("healthy".to_string()),  // <-- THIS IS THE ENTIRE CHECK
        ContainerState::Stopped | ContainerState::Exited => Ok("unhealthy".to_string()),
        ...
    }
}

A container can be "running" but the application inside is completely broken. This is reported as "healthy".

B. Manifest health checks exist but are never executed

All 30+ app manifests in image-recipe/build/debian-iso/custom/archipelago/apps/*/manifest.yml define health checks like:

health_check:
  type: http
  endpoint: http://localhost:4080
  path: /api/health
  interval: 30s
  timeout: 5s
  retries: 3

The HealthMonitor struct at core/container/src/health_monitor.rs can execute these checks. But it is never instantiated. No code path creates a HealthMonitor from the manifest health check definitions.

C. Health status is never pushed to the frontend via WebSocket

File: core/archipelago/src/data_model.rs:120-127

pub struct PackageDataEntry {
    pub health: Option<String>,  // Field exists but is NEVER POPULATED
}

The health field in the data model is always None. Frontend can only get health via explicit RPC call, which it almost never makes.

D. Frontend never polls health status

File: neode-ui/src/stores/container.ts:169-175 fetchHealthStatus() is only called after startContainer() and startBundledApp(). There is no setInterval, no periodic polling, no watch. After the initial call, health status is never refreshed.

Fixes Required

Wire up manifest health checks — instantiate HealthMonitor from manifest definitions, run actual HTTP/exec probes instead of just checking podman ps
Populate the health field in PackageDataEntry so WebSocket pushes real health status to frontend
Add 30-second health polling in the frontend container store (with backoff to 60s when all healthy)
Fix get_health_status() in dev_orchestrator to call actual health checks, not just check container state

Problem 3: CPU Exhaustion from Duplicate Polling

Root Causes

A. Two independent monitors both call `podman stats` every 60 seconds

Health monitor: core/archipelago/src/health_monitor.rs:17 — CHECK_INTERVAL_SECS = 60
- Runs podman ps -a --format json (line 305-323)
- Runs podman stats --no-stream every 5 cycles (line 442-450)
Metrics collector: core/archipelago/src/monitoring/mod.rs:28 — 60-second interval
- Runs podman stats --no-stream --format json independently (collector.rs:220-224)

These are not coordinated. Both spawn separate subprocesses. On a system with 15+ containers, each podman stats call is expensive.

B. Total subprocess spawning frequency

Component	Interval	What it runs
Health monitor	60s	`podman ps`, `podman stats` (every 5th), restart attempts
Metrics collector	60s	`podman stats` (duplicate!)
Crash recovery snapshot	120s	`podman ps`
Disk monitor	300s	`df`, `sudo dmesg`, potentially `podman image prune`
Telemetry	900s	`podman stats` (another duplicate)
Systemd watchdog	120s	sd_notify ping
Frontend fleet polling	60s	RPC calls that trigger more podman commands

That's roughly one podman subprocess every 10-15 seconds on average, plus all the triggered operations.

C. No restart policy means polling-driven restarts

File: core/container/src/podman_client.rs:303-335 Container creation spec does NOT include RestartPolicy. Podman itself never restarts crashed containers. Instead, the health monitor's 60-second poll detects the crash and attempts a restart. This is far more CPU-intensive than Podman's built-in restart mechanism.

D. Health monitor restart attempts with exponential backoff still spawn processes

When a container fails, the health monitor tries restarts at 10s, 30s, 90s backoff. Each attempt spawns podman start, podman inspect, etc. If multiple containers are unhealthy, this multiplies.

Fixes Required

Deduplicate podman stats — create a shared cache layer. One component fetches, others read from cache (TTL: 30s)
Add RestartPolicy: unless-stopped with MaxRetryCount: 5 to all container creation — let Podman handle restarts natively instead of polling
Increase health monitor interval to 120s (60s is too aggressive when health checks are just podman ps)
Remove duplicate podman stats call from metrics collector — share data with health monitor
Make frontend fleet polling viewport-aware — only poll when user is actually viewing the fleet page
Batch all container queries — use a single podman ps -a --format json per check cycle, shared across all consumers

Problem 4: Uninstall Doesn't Work

Root Causes

A. No volume removal

File: core/archipelago/src/api/rpc/package/runtime.rs:172-289 The uninstall function stops containers, removes containers, releases ports, and attempts data directory cleanup. It never removes Podman volumes. Orphaned volumes accumulate forever.

B. No network cleanup

File: core/archipelago/src/api/rpc/package/runtime.rs:172-289 Multi-container stacks create networks (archy-net, immich-net, penpot-net) during install (stacks.rs:89, 211). These are never cleaned up during uninstall. Leftover networks can prevent reinstallation.

C. Force-kills stateful containers without graceful shutdown

File: core/archipelago/src/api/rpc/package/runtime.rs:226

let rm_out = tokio::process::Command::new("podman")
    .args(["rm", "-f", name])  // -f = force kill
    .output().await;

The code defines proper shutdown timeouts (Bitcoin: 600s, LND: 330s, databases: 120s) but only uses them for stop. The rm -f that follows ignores these timeouts and force-kills immediately. This risks corrupting Bitcoin's UTXO set, LND channel state, or database WAL.

D. Returns 200 OK even on partial failure

File: core/archipelago/src/api/rpc/package/runtime.rs:268-289

Ok(serde_json::json!({
    "status": if errors.is_empty() { "uninstalled" } else { "partial" },
    ...
}))

Returns HTTP 200 with "partial" status. Frontend at neode-ui/src/views/apps/useAppsActions.ts:74 doesn't check for "partial" — it deletes the app from the UI regardless.

E. Data directory cleanup requires sudo and fails silently

File: core/archipelago/src/api/rpc/package/runtime.rs:256-265

let rm_out = tokio::process::Command::new("sudo")
    .args(["rm", "-rf", dir]).output().await;
if let Ok(o) = rm_out {
    if !o.status.success() {
        tracing::warn!(...);  // Warning only, continues
    }
}

If sudo isn't configured or fails, data remains on disk but UI shows "uninstalled".

F. Container name detection has gaps

File: core/archipelago/src/api/rpc/package/config.rs:287-340 Container names are hardcoded patterns. If a container was created with a different naming convention (e.g., by first-boot-containers.sh vs RPC installer), it won't be found and won't be removed.

Fixes Required

Add podman volume rm for all volumes associated with the app after container removal
Add network cleanup — remove app-specific networks after all containers on that network are gone
Use podman stop -t {timeout} then podman rm (without -f) — respect graceful shutdown timeouts, especially for Bitcoin/LND/databases
Return an error (not 200) when uninstall has failures. Frontend must check and display errors
Surface "partial" failures to the user with specific error messages
Unify container naming — derive names from a single source (manifest), not hardcoded patterns in multiple files

Problem 5: Two Divergent Install Paths

The first-boot bash script and the Rust RPC installer create containers with different configurations. This is a major source of bugs.

Specific Divergences

A. Database passwords

First-boot (scripts/first-boot-containers.sh:118-127): Generates random passwords with openssl rand -base64 24, stores in /var/lib/archipelago/secrets/
Rust RPC (core/archipelago/src/api/rpc/package/config.rs:456,484,514-515,610): Uses hardcoded "btcpaypass", "mempoolpass", "rootpass", "immichpass"

Result: Apps installed via RPC after first-boot can't connect to databases because passwords don't match.

B. Bitcoin configuration

First-boot (scripts/first-boot-containers.sh:295-313): Dynamically sets -prune=550 on small disks, -txindex=1 on large disks
Rust RPC (core/archipelago/src/api/rpc/package/config.rs:415-420): No custom args at all

Result: Bitcoin installed via RPC has no pruning or txindex regardless of disk size.

C. ZMQ configuration for LND

First-boot (scripts/first-boot-containers.sh:100-114): Bitcoin.conf generated without ZMQ publisher settings
Rust RPC (core/archipelago/src/api/rpc/package/config.rs:438-439): LND configured to connect to tcp://bitcoin-knots:28332 and tcp://bitcoin-knots:28333

Result: LND can't receive block notifications from Bitcoin because ZMQ isn't configured on either path.

D. Port conflicts

First-boot (scripts/first-boot-containers.sh:813,835): Both strfry and indeedhub bind to host port 7777
Rust RPC (core/archipelago/src/api/rpc/package/config.rs:734): IndeedHub uses 8190:3000

Result: On first-boot, whichever of strfry/indeedhub starts second fails. Via RPC, different port entirely.

E. Memory limits

First-boot (scripts/first-boot-containers.sh:253-283): Ollama gets 1g on low-mem systems
Rust RPC (core/archipelago/src/api/rpc/package/config.rs:245-280): Ollama gets 4g always

Result: Same app gets different resource limits depending on how it was installed.

F. Version mismatches in marketplace UI

scripts/image-versions.sh:17: LND image is v0.18.4-beta
neode-ui/src/views/marketplace/marketplaceData.ts:155: Shows 0.17.4
scripts/image-versions.sh:21-22: Mempool images are v3.0.0
neode-ui/src/views/marketplace/marketplaceData.ts:177: Shows 2.5.0

Fixes Required

Single source of truth for container config — Rust config must read passwords from /var/lib/archipelago/secrets/, not hardcode them
Add ZMQ config to Bitcoin startup in both paths: zmqpubrawblock=tcp://0.0.0.0:28332 and zmqpubrawtx=tcp://0.0.0.0:28333
Fix port 7777 conflict — assign unique ports to strfry and indeedhub
Add disk-aware Bitcoin config to Rust installer (prune/txindex based on disk size)
Sync memory limits between first-boot and Rust config
Update marketplace version strings to match actual image versions in image-versions.sh
Long-term: eliminate first-boot-containers.sh — have the backend handle all container creation using the same Rust code path

Problem 6: Post-Install Hooks Run Async and Fail Silently

File: core/archipelago/src/api/rpc/package/install.rs:541-625

Post-install hooks (setting FileBrowser password, configuring NextCloud, etc.) are spawned as background tasks:

tokio::spawn(async move {
    let _ = tokio::fs::create_dir_all(secret_dir).await;
    let _ = tokio::fs::write(...).await;
});

The install RPC returns success before hooks complete. If a hook fails (network timeout, service not ready), the error is logged but the user is told installation succeeded. Credentials aren't set, configs aren't applied.

Fix Required

Await post-install hooks before returning success, or return a "configuring" status and let the frontend poll for completion.

Problem 7: Podman Client Swallows Errors

File: core/container/src/podman_client.rs

A. JSON serialization failures return empty strings (line 182-183)

let body_str = body.map(|b| serde_json::to_string(&b).unwrap_or_default()).unwrap_or_default();

B. Container ID parsing failures return empty string (line 344-348)

let id = result["Id"].as_str().unwrap_or("").to_string();
Ok(id)  // Empty string = success?

C. Socket timeout is only 5 seconds (line 154-160)

On a busy system or during boot, Podman socket may take >5s to respond. Every API call fails. No retry logic.

Fixes Required

Replace .unwrap_or_default() with proper error propagation using ?
Return Err when container ID is empty
Increase socket timeout to 15-30s
Add retry with backoff (3 attempts) on socket connection

Problem 8: UI Misrepresents Container State

Root Causes

A. "Exited" always displays as "Crashed" — even for clean shutdowns

File: neode-ui/src/views/apps/appsConfig.ts:119-146

getStatusLabel(state, health):
  - "exited" → "crashed"     // <-- THIS IS THE PROBLEM

Every container that exited — whether from a clean reboot (exit 0), OOM kill (exit 137), or app error (exit 1) — shows the same "crashed" label. After a reboot, the UI is a wall of "crashed" labels even though containers are in the process of starting up.

B. No "recovering" or "boot in progress" state exists

File: core/archipelago/src/data_model.rs:103-119 PackageState enum has Starting, but it's only set during explicit user start actions, not during automatic crash recovery. During boot recovery, containers transition from Exited → Running without ever passing through Starting, so the UI never shows a spinner or "starting up" message.

C. Backend skips sub-containers from package listing, so their state is invisible

File: core/archipelago/src/container/docker_packages.rs:39-117 The excluded_services list filters out backend services like mempool-db, btcpay-db, nbxplorer, penpot-postgres, etc. UI containers ending in -ui are also skipped. These containers are invisible to the user even when they're the actual cause of a stack failure (e.g., indeedhub-postgres being dead kills the entire IndeedHub stack, but only indeedhub-api errors are visible).

D. No distinction between "needs manual intervention" and "will recover soon"

The UI shows the same visual treatment for:

Portainer (DB migration error — will NEVER recover without manual intervention)
mempool-api (DB not ready yet — will recover in 30 seconds)
IndeedHub (dependencies abandoned — won't recover until deps are manually restarted)

Fixes Required

Differentiate exit codes: Exit 0 = "stopped" (gray), Exit non-zero = "crashed" (red), Exit 137 = "killed (OOM)" (red with warning)
Add a "recovering" state: During boot/crash recovery window (first 5 minutes after backend start), show "Starting up..." instead of "crashed" for exited containers
Show sub-container health: When a parent app is unhealthy, show which sub-service caused the failure (e.g., "IndeedHub: postgres is down")
Distinguish recoverable from permanent failures: After health monitor gives up (3 attempts), change label to "Needs attention" instead of keeping "crashed"
Add recovery progress indicator: During boot, show "Recovering containers: 15/22 started" on the dashboard

Problem 9: Dependency-Blind Restarts

Root Cause (Confirmed by .228 reboot)

The health monitor restarts containers individually without considering dependencies. This was proven by the IndeedHub stack failure:

indeedhub-postgres exits cleanly (code 0) on reboot
Health monitor restarts postgres — it starts, but exits again (likely needs volume mount or network ready)
After 3 attempts, postgres is abandoned
Meanwhile, indeedhub-api tries to connect to postgres → ENOTFOUND indeedhub-postgres → exits
Health monitor restarts api → same DNS failure → exits
After 3 attempts, api is abandoned
Same cascade for redis, minio, relay, main container — all abandoned within minutes

File: core/archipelago/src/health_monitor.rs:500-530 The restart loop treats each container independently. There's no logic to:

Check if a container's dependencies are running before restarting it
Restart dependencies first when a dependent container fails
Reset attempt counters when a dependency comes back online

3 attempts is too few, especially when dependencies need time:

Attempt 1: 10s backoff → dependency still starting
Attempt 2: 30s backoff → dependency crashed and is being restarted
Attempt 3: 90s backoff → dependency hit its own 3-attempt limit and was abandoned
Game over. Entire stack is dead.

Fixes Required

Dependency-aware restart ordering: Before restarting a container, check if its dependencies are running. If not, restart dependencies first.
Increase max restart attempts to 5-10 for containers with dependencies
Reset attempt counters when a dependency comes back online (the dependent container failed because of the dependency, not itself)
Add a "stack restart" concept: When restarting any container in a multi-container stack (indeedhub, mempool, btcpay, immich, penpot), restart the entire stack in dependency order
Handle "Created" state containers: archy-mempool-web and fedimint are in "Created" state (never started). The health monitor should detect these and attempt to start them.

Priority Order for Fixes

P0 — System is broken without these (reboot = broken system)

Dependency-aware restarts in health_monitor.rs — restart dependencies before dependents, reset attempt counters when deps recover
Increase max restart attempts to 10 (currently 3) — dependency chains need more time on boot
Handle "Created" state — containers stuck in Created are never started by health monitor
Fix UI state labels — "exited" code 0 should say "stopped", not "crashed". Add "recovering" state during boot window.
Fix Rust config to read secrets from /var/lib/archipelago/secrets/ instead of hardcoded passwords
Fix port 7777 conflict (strfry vs indeedhub)
Add ZMQ config to Bitcoin for LND block notifications

P1 — Core functionality broken

Wire up manifest health checks (replace fake "running = healthy" with actual HTTP/exec probes)
Fix uninstall to clean up volumes, networks, and respect graceful shutdown timeouts
Return actual errors from install/uninstall instead of silent success on partial failure
Remove || true from critical first-boot commands
Show sub-container health in UI (which dependency is actually broken)

P2 — Performance and CPU

Deduplicate podman stats calls (health monitor + metrics collector both call every 60s independently)
Increase health monitor interval to 120s
Add frontend health polling via WebSocket push (populate health field in data model)
Make fleet polling viewport-aware (don't poll when user isn't viewing)

P3 — Consistency and correctness

Sync memory limits between first-boot and Rust config
Update marketplace version strings (LND shows 0.17.4, actual is 0.18.4; Mempool shows 2.5.0, actual is 3.0.0)
Unify container naming conventions between first-boot script and Rust config
Add disk-aware Bitcoin config (prune/txindex) to Rust installer
Distinguish "needs manual intervention" from "will recover soon" in UI

Key Files to Modify

File	What to fix
`core/archipelago/src/health_monitor.rs`	Dependency-aware restarts, increase MAX_RESTART_ATTEMPTS to 10, handle Created state, deduplicate with metrics collector
`core/container/src/podman_client.rs`	Add RestartPolicy to container creation spec, fix `.unwrap_or_default()` error swallowing, increase socket timeout to 15-30s
`core/archipelago/src/crash_recovery.rs`	Increase timeouts to 120s, add retry with backoff, fix tier ordering catch-all
`core/archipelago/src/api/rpc/package/install.rs`	Return failure on timeout (not silent success), await post-install hooks
`core/archipelago/src/api/rpc/package/runtime.rs`	Add volume/network cleanup on uninstall, use `podman stop -t` then `podman rm` (not `-f`), return errors on partial failure
`core/archipelago/src/api/rpc/package/config.rs`	Read secrets from disk, fix port 7777, add ZMQ config, sync memory limits
`core/archipelago/src/container/dev_orchestrator.rs`	Wire up manifest-defined health checks instead of just checking podman state
`core/archipelago/src/container/docker_packages.rs`	Stop filtering sub-containers from state — or expose their health as part of parent app status
`core/archipelago/src/data_model.rs`	Populate `health` field for WebSocket push, add exit code to state
`core/archipelago/src/monitoring/mod.rs`	Share podman stats data with health monitor instead of duplicate subprocess calls
`neode-ui/src/views/apps/appsConfig.ts`	Fix state labels: exit 0 = "stopped", exit non-zero = "crashed", add "recovering" during boot window
`neode-ui/src/stores/container.ts`	Add periodic health polling (30s)
`neode-ui/src/views/apps/useAppsActions.ts`	Check for "partial" uninstall status, show errors to user
`neode-ui/src/views/marketplace/marketplaceData.ts`	Fix version strings to match image-versions.sh
`scripts/first-boot-containers.sh`	Remove `\|\| true` from critical commands, fix port 7777 conflict, add proper error reporting

29 KiB Raw Permalink Blame History

Archipelago Container Infrastructure — Critical Issues Report

Executive Summary

LIVE EVIDENCE: .228 Reboot on 2026-03-31

Permanently Dead (exceeded 3 restart attempts, abandoned)

Crash-Looping (still failing on every restart)

Never Started (stuck in "Created" state)

Running but Unhealthy

Actually Recovered (healthy)

Key Observations

Problem 1: Containers Don't Start or Recover After Reboot

Root Causes

A. Crash recovery has a 30-second timeout that's too short

B. If podman ps itself times out, recovery finds zero containers

C. Boot tier ordering uses a catch-all that misses dependencies

D. First-boot script swallows ALL errors

E. Install RPC returns success before container is actually running

Fixes Required

Problem 2: Health Checks Are Fake

Root Causes

A. "Healthy" just means "running" — application health is never checked

B. Manifest health checks exist but are never executed

C. Health status is never pushed to the frontend via WebSocket

D. Frontend never polls health status

Fixes Required

Problem 3: CPU Exhaustion from Duplicate Polling

Root Causes

A. Two independent monitors both call podman stats every 60 seconds

B. Total subprocess spawning frequency

C. No restart policy means polling-driven restarts

D. Health monitor restart attempts with exponential backoff still spawn processes

Fixes Required

Problem 4: Uninstall Doesn't Work

Root Causes

A. No volume removal

B. No network cleanup

C. Force-kills stateful containers without graceful shutdown

D. Returns 200 OK even on partial failure

E. Data directory cleanup requires sudo and fails silently

F. Container name detection has gaps

Fixes Required

Problem 5: Two Divergent Install Paths

Specific Divergences

A. Database passwords

B. Bitcoin configuration

C. ZMQ configuration for LND

D. Port conflicts

E. Memory limits

F. Version mismatches in marketplace UI

Fixes Required

Problem 6: Post-Install Hooks Run Async and Fail Silently

Fix Required

Problem 7: Podman Client Swallows Errors

A. JSON serialization failures return empty strings (line 182-183)

B. Container ID parsing failures return empty string (line 344-348)

C. Socket timeout is only 5 seconds (line 154-160)

Fixes Required

Problem 8: UI Misrepresents Container State

Root Causes

A. "Exited" always displays as "Crashed" — even for clean shutdowns

B. No "recovering" or "boot in progress" state exists

C. Backend skips sub-containers from package listing, so their state is invisible

D. No distinction between "needs manual intervention" and "will recover soon"

Fixes Required

Problem 9: Dependency-Blind Restarts

Root Cause (Confirmed by .228 reboot)

Fixes Required

Priority Order for Fixes

P0 — System is broken without these (reboot = broken system)

P1 — Core functionality broken

P2 — Performance and CPU

P3 — Consistency and correctness

Key Files to Modify

29 KiB

Raw Permalink Blame History

B. If `podman ps` itself times out, recovery finds zero containers

A. Two independent monitors both call `podman stats` every 60 seconds