Files
archy/docs/CONTAINER-ISSUES-REPORT.md
Dorian 64b57dca7d
Some checks failed
Build Archipelago ISO (dev) / build-iso (push) Failing after 13m44s
Container Orchestration Tests / unit-tests (push) Failing after 7m30s
Container Orchestration Tests / smoke-tests (push) Has been skipped
fix: overhaul container lifecycle — recovery, health, uninstall, UI state
Container recovery:
- Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s
- Dependency-aware restarts: won't restart services before their deps
- Reset dependent counters when a dependency recovers
- Handle "created" state containers (were invisible to health monitor)
- Added IndeedHub, mempool-api, mysql to tier system
- Crash recovery: podman start timeout 30s→120s with retry
- Podman client: socket timeout 5s→30s, added restart policy

UI state representation:
- Exit code 0 shows "stopped" (gray), not "crashed" (red)
- Exit code 137 shows "killed (OOM)"
- Non-zero exit shows "crashed" (red)
- Added exit_code field to PackageDataEntry

Install/uninstall fixes:
- Install returns error when container doesn't start (was silent success)
- Post-install hooks awaited instead of fire-and-forget tokio::spawn
- Uninstall: graceful rm before force, volume prune, network cleanup
- Uninstall returns error on partial failure (was 200 OK)

Config consistency:
- DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded)
- Bitcoin: added ZMQ ports 28332/28333 for LND block notifications
- IndeedHub port 7777→8190 (was conflicting with strfry)
- Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0

Performance:
- Metrics collector interval 60s→300s (was duplicating health monitor)
- Podman client: proper error propagation instead of unwrap_or_default

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 07:03:57 +01:00

29 KiB

Archipelago Container Infrastructure — Critical Issues Report

Date: 2026-03-31 Status: Server .228 rebooted — some apps recovered, many did not. UI showed everything as "crashed" during recovery window. Purpose: Fix guide for getting container lifecycle to production quality.


Executive Summary

The container system has 7 systemic failures that compound each other:

  1. Silent failures everywhere — errors are swallowed with || true, .unwrap_or_default(), and warn-level logs. Nothing actually tells the user (or the system) that something broke.
  2. Health checks are fake — manifests define real health checks (HTTP probes, exec checks) but they are never executed. "Healthy" just means podman ps shows "running".
  3. Duplicate polling burns CPU — health monitor + metrics collector both call podman stats every 60 seconds independently. Add crash recovery snapshots, disk monitor, and frontend polling = constant subprocess spawning.
  4. Uninstall doesn't clean up — no volume removal, no network cleanup, force-kills stateful containers (risking wallet/DB corruption), returns 200 OK on partial failure.
  5. Two divergent install pathsfirst-boot-containers.sh and the Rust RPC installer use different passwords, ports, capabilities, memory limits, and Bitcoin config. They are never in sync.
  6. UI misrepresents stateExited (even clean exit code 0) shows as "crashed". No "recovering" or "starting up" state exists. During boot recovery, UI shows a wall of red/gray "crashed" labels.
  7. Dependency-blind restarts — health monitor restarts services without restarting their dependencies first, so they immediately fail again and burn through the 3-attempt limit.

LIVE EVIDENCE: .228 Reboot on 2026-03-31

After rebooting .228, here's the actual container state 30 minutes later:

Permanently Dead (exceeded 3 restart attempts, abandoned)

Container Exit Code Cause
indeedhub-postgres 0 (clean) Shut down by reboot. Health monitor tried 3 restarts, it keeps exiting cleanly. Once abandoned, all dependent services die too.
indeedhub-redis 0 Same — clean exit, 3 failed restart attempts, abandoned
indeedhub-minio 0 Same
indeedhub-relay 0 Same
indeedhub 0 Same
indeedhub-api 1 Can't resolve hostname indeedhub-postgres (postgres is dead, DNS entry gone from network)
jellyfin 137 (OOM) "Failed to create CoreCLR" — memory limit too low for .NET runtime. SIGKILL = OOM. 3 attempts exhausted.

Crash-Looping (still failing on every restart)

Container Cause
mempool-api ECONNREFUSED 10.89.0.42:3306 — DB (archy-mempool-db) just restarted, not ready yet
portainer "database schema version does not align with server version" — image upgraded, DB not migrated. Will NEVER recover.
photoprism "Failed creating test file in storage folder" — volume permission issue (rootless UID mapping)

Never Started (stuck in "Created" state)

Container Cause
archy-mempool-web "cannot assign requested address" — network binding failure
fedimint Same network error

Running but Unhealthy

Container Notes
homeassistant Up 14 min, health check failing
searxng Up 13 min, health check failing
onlyoffice Up 10 min, health check failing

Actually Recovered (healthy)

filebrowser, bitcoin-knots, vaultwarden, nginx-proxy-manager, archy-btcpay-db, lnd, electrumx, grafana

Key Observations

  1. All containers have unless-stopped restart policy — but this doesn't help because containers that exit cleanly (code 0) don't get restarted by Podman. The health monitor is the only restart mechanism, and it gives up after 3 attempts.
  2. The entire IndeedHub stack died because postgres was abandoned first. Once postgres hit 3 restart attempts, every dependent service (api, redis, minio, relay, main) also failed and hit their own 3-attempt limit. No dependency awareness.
  3. Containers in "Created" state were never even started — some kind of network assignment failure during creation. The health monitor doesn't handle "Created" state containers.
  4. The UI showed ALL apps as "crashed" during the first few minutes, even the ones that eventually recovered. This is because Exited state (even exit code 0) maps to the label "crashed" in appsConfig.ts.

Problem 1: Containers Don't Start or Recover After Reboot

Confirmed: All apps crashed after .228 reboot on 2026-03-31.

Root Causes

A. Crash recovery has a 30-second timeout that's too short

File: core/archipelago/src/crash_recovery.rs:265-271

let result = tokio::time::timeout(
    std::time::Duration::from_secs(30),
    tokio::process::Command::new("podman").args(["start", &record.name]).output(),
).await;

On a cold boot with many containers, Podman is under load. 30 seconds is not enough. If it times out, the container is skipped — no retry.

B. If podman ps itself times out, recovery finds zero containers

File: core/archipelago/src/crash_recovery.rs:318 The podman ps -a call to discover stopped containers has a 30-second timeout. On a busy system post-reboot, this can timeout. Result: all_names is empty, recovery silently exits having started nothing.

C. Boot tier ordering uses a catch-all that misses dependencies

File: core/archipelago/src/crash_recovery.rs:374-385

fn container_boot_tier(name: &str) -> u8 {
    match id {
        "btcpay-db" | "mempool-db" | ... => 0,  // databases
        "bitcoin-knots" | ... => 1,               // bitcoin
        "lnd" | "electrumx" | ... => 2,           // depends on bitcoin
        "mempool-web" | ... => 4,                  // frontend
        _ => 3,  // EVERYTHING ELSE - may start before its dependencies
    }
}

Any app not explicitly listed gets tier 3, which may be before its dependencies are ready.

D. First-boot script swallows ALL errors

File: scripts/first-boot-containers.sh:8 — no set -e 48+ commands have || true appended. Every podman run failure is silently ignored. The script always exits 0 and reports "complete" to systemd even if 50% of containers failed.

E. Install RPC returns success before container is actually running

File: core/archipelago/src/api/rpc/package/install.rs:260-294 After container creation, the installer polls for 30 seconds (6 checks x 5 seconds). If the container is still in "created" or "starting" state after 30 seconds:

if i == 5 {
    debug!("Container {} health check timeout (30s) -- continuing anyway");
}

It logs at debug level and returns success. The user sees "installed" but the container never actually started.

Fixes Required

  1. Increase crash recovery timeout to 120s and add retry with backoff (3 attempts per container)
  2. Increase podman ps timeout to 60s during boot recovery
  3. Replace tier catch-all — every container must be explicitly listed or derived from manifest dependencies
  4. Remove || true from critical commands in first-boot-containers.sh. Use proper error handling: log the error, record the failure, continue to next container, but report actual failures at the end
  5. Install RPC must return failure if container isn't running after timeout, not silently succeed
  6. Add --restart unless-stopped to container creation in the Podman client (core/container/src/podman_client.rs:303-335) — currently missing, so Podman itself never auto-restarts crashed containers

Problem 2: Health Checks Are Fake

Root Causes

A. "Healthy" just means "running" — application health is never checked

File: core/archipelago/src/container/dev_orchestrator.rs:239-249

pub async fn get_health_status(&self, app_id: &str) -> Result<String> {
    match status.state {
        ContainerState::Running => Ok("healthy".to_string()),  // <-- THIS IS THE ENTIRE CHECK
        ContainerState::Stopped | ContainerState::Exited => Ok("unhealthy".to_string()),
        ...
    }
}

A container can be "running" but the application inside is completely broken. This is reported as "healthy".

B. Manifest health checks exist but are never executed

All 30+ app manifests in image-recipe/build/debian-iso/custom/archipelago/apps/*/manifest.yml define health checks like:

health_check:
  type: http
  endpoint: http://localhost:4080
  path: /api/health
  interval: 30s
  timeout: 5s
  retries: 3

The HealthMonitor struct at core/container/src/health_monitor.rs can execute these checks. But it is never instantiated. No code path creates a HealthMonitor from the manifest health check definitions.

C. Health status is never pushed to the frontend via WebSocket

File: core/archipelago/src/data_model.rs:120-127

pub struct PackageDataEntry {
    pub health: Option<String>,  // Field exists but is NEVER POPULATED
}

The health field in the data model is always None. Frontend can only get health via explicit RPC call, which it almost never makes.

D. Frontend never polls health status

File: neode-ui/src/stores/container.ts:169-175 fetchHealthStatus() is only called after startContainer() and startBundledApp(). There is no setInterval, no periodic polling, no watch. After the initial call, health status is never refreshed.

Fixes Required

  1. Wire up manifest health checks — instantiate HealthMonitor from manifest definitions, run actual HTTP/exec probes instead of just checking podman ps
  2. Populate the health field in PackageDataEntry so WebSocket pushes real health status to frontend
  3. Add 30-second health polling in the frontend container store (with backoff to 60s when all healthy)
  4. Fix get_health_status() in dev_orchestrator to call actual health checks, not just check container state

Problem 3: CPU Exhaustion from Duplicate Polling

Root Causes

A. Two independent monitors both call podman stats every 60 seconds

  • Health monitor: core/archipelago/src/health_monitor.rs:17CHECK_INTERVAL_SECS = 60
    • Runs podman ps -a --format json (line 305-323)
    • Runs podman stats --no-stream every 5 cycles (line 442-450)
  • Metrics collector: core/archipelago/src/monitoring/mod.rs:28 — 60-second interval
    • Runs podman stats --no-stream --format json independently (collector.rs:220-224)

These are not coordinated. Both spawn separate subprocesses. On a system with 15+ containers, each podman stats call is expensive.

B. Total subprocess spawning frequency

Component Interval What it runs
Health monitor 60s podman ps, podman stats (every 5th), restart attempts
Metrics collector 60s podman stats (duplicate!)
Crash recovery snapshot 120s podman ps
Disk monitor 300s df, sudo dmesg, potentially podman image prune
Telemetry 900s podman stats (another duplicate)
Systemd watchdog 120s sd_notify ping
Frontend fleet polling 60s RPC calls that trigger more podman commands

That's roughly one podman subprocess every 10-15 seconds on average, plus all the triggered operations.

C. No restart policy means polling-driven restarts

File: core/container/src/podman_client.rs:303-335 Container creation spec does NOT include RestartPolicy. Podman itself never restarts crashed containers. Instead, the health monitor's 60-second poll detects the crash and attempts a restart. This is far more CPU-intensive than Podman's built-in restart mechanism.

D. Health monitor restart attempts with exponential backoff still spawn processes

When a container fails, the health monitor tries restarts at 10s, 30s, 90s backoff. Each attempt spawns podman start, podman inspect, etc. If multiple containers are unhealthy, this multiplies.

Fixes Required

  1. Deduplicate podman stats — create a shared cache layer. One component fetches, others read from cache (TTL: 30s)
  2. Add RestartPolicy: unless-stopped with MaxRetryCount: 5 to all container creation — let Podman handle restarts natively instead of polling
  3. Increase health monitor interval to 120s (60s is too aggressive when health checks are just podman ps)
  4. Remove duplicate podman stats call from metrics collector — share data with health monitor
  5. Make frontend fleet polling viewport-aware — only poll when user is actually viewing the fleet page
  6. Batch all container queries — use a single podman ps -a --format json per check cycle, shared across all consumers

Problem 4: Uninstall Doesn't Work

Root Causes

A. No volume removal

File: core/archipelago/src/api/rpc/package/runtime.rs:172-289 The uninstall function stops containers, removes containers, releases ports, and attempts data directory cleanup. It never removes Podman volumes. Orphaned volumes accumulate forever.

B. No network cleanup

File: core/archipelago/src/api/rpc/package/runtime.rs:172-289 Multi-container stacks create networks (archy-net, immich-net, penpot-net) during install (stacks.rs:89, 211). These are never cleaned up during uninstall. Leftover networks can prevent reinstallation.

C. Force-kills stateful containers without graceful shutdown

File: core/archipelago/src/api/rpc/package/runtime.rs:226

let rm_out = tokio::process::Command::new("podman")
    .args(["rm", "-f", name])  // -f = force kill
    .output().await;

The code defines proper shutdown timeouts (Bitcoin: 600s, LND: 330s, databases: 120s) but only uses them for stop. The rm -f that follows ignores these timeouts and force-kills immediately. This risks corrupting Bitcoin's UTXO set, LND channel state, or database WAL.

D. Returns 200 OK even on partial failure

File: core/archipelago/src/api/rpc/package/runtime.rs:268-289

Ok(serde_json::json!({
    "status": if errors.is_empty() { "uninstalled" } else { "partial" },
    ...
}))

Returns HTTP 200 with "partial" status. Frontend at neode-ui/src/views/apps/useAppsActions.ts:74 doesn't check for "partial" — it deletes the app from the UI regardless.

E. Data directory cleanup requires sudo and fails silently

File: core/archipelago/src/api/rpc/package/runtime.rs:256-265

let rm_out = tokio::process::Command::new("sudo")
    .args(["rm", "-rf", dir]).output().await;
if let Ok(o) = rm_out {
    if !o.status.success() {
        tracing::warn!(...);  // Warning only, continues
    }
}

If sudo isn't configured or fails, data remains on disk but UI shows "uninstalled".

F. Container name detection has gaps

File: core/archipelago/src/api/rpc/package/config.rs:287-340 Container names are hardcoded patterns. If a container was created with a different naming convention (e.g., by first-boot-containers.sh vs RPC installer), it won't be found and won't be removed.

Fixes Required

  1. Add podman volume rm for all volumes associated with the app after container removal
  2. Add network cleanup — remove app-specific networks after all containers on that network are gone
  3. Use podman stop -t {timeout} then podman rm (without -f) — respect graceful shutdown timeouts, especially for Bitcoin/LND/databases
  4. Return an error (not 200) when uninstall has failures. Frontend must check and display errors
  5. Surface "partial" failures to the user with specific error messages
  6. Unify container naming — derive names from a single source (manifest), not hardcoded patterns in multiple files

Problem 5: Two Divergent Install Paths

The first-boot bash script and the Rust RPC installer create containers with different configurations. This is a major source of bugs.

Specific Divergences

A. Database passwords

  • First-boot (scripts/first-boot-containers.sh:118-127): Generates random passwords with openssl rand -base64 24, stores in /var/lib/archipelago/secrets/
  • Rust RPC (core/archipelago/src/api/rpc/package/config.rs:456,484,514-515,610): Uses hardcoded "btcpaypass", "mempoolpass", "rootpass", "immichpass"

Result: Apps installed via RPC after first-boot can't connect to databases because passwords don't match.

B. Bitcoin configuration

  • First-boot (scripts/first-boot-containers.sh:295-313): Dynamically sets -prune=550 on small disks, -txindex=1 on large disks
  • Rust RPC (core/archipelago/src/api/rpc/package/config.rs:415-420): No custom args at all

Result: Bitcoin installed via RPC has no pruning or txindex regardless of disk size.

C. ZMQ configuration for LND

  • First-boot (scripts/first-boot-containers.sh:100-114): Bitcoin.conf generated without ZMQ publisher settings
  • Rust RPC (core/archipelago/src/api/rpc/package/config.rs:438-439): LND configured to connect to tcp://bitcoin-knots:28332 and tcp://bitcoin-knots:28333

Result: LND can't receive block notifications from Bitcoin because ZMQ isn't configured on either path.

D. Port conflicts

  • First-boot (scripts/first-boot-containers.sh:813,835): Both strfry and indeedhub bind to host port 7777
  • Rust RPC (core/archipelago/src/api/rpc/package/config.rs:734): IndeedHub uses 8190:3000

Result: On first-boot, whichever of strfry/indeedhub starts second fails. Via RPC, different port entirely.

E. Memory limits

  • First-boot (scripts/first-boot-containers.sh:253-283): Ollama gets 1g on low-mem systems
  • Rust RPC (core/archipelago/src/api/rpc/package/config.rs:245-280): Ollama gets 4g always

Result: Same app gets different resource limits depending on how it was installed.

F. Version mismatches in marketplace UI

  • scripts/image-versions.sh:17: LND image is v0.18.4-beta
  • neode-ui/src/views/marketplace/marketplaceData.ts:155: Shows 0.17.4
  • scripts/image-versions.sh:21-22: Mempool images are v3.0.0
  • neode-ui/src/views/marketplace/marketplaceData.ts:177: Shows 2.5.0

Fixes Required

  1. Single source of truth for container config — Rust config must read passwords from /var/lib/archipelago/secrets/, not hardcode them
  2. Add ZMQ config to Bitcoin startup in both paths: zmqpubrawblock=tcp://0.0.0.0:28332 and zmqpubrawtx=tcp://0.0.0.0:28333
  3. Fix port 7777 conflict — assign unique ports to strfry and indeedhub
  4. Add disk-aware Bitcoin config to Rust installer (prune/txindex based on disk size)
  5. Sync memory limits between first-boot and Rust config
  6. Update marketplace version strings to match actual image versions in image-versions.sh
  7. Long-term: eliminate first-boot-containers.sh — have the backend handle all container creation using the same Rust code path

Problem 6: Post-Install Hooks Run Async and Fail Silently

File: core/archipelago/src/api/rpc/package/install.rs:541-625

Post-install hooks (setting FileBrowser password, configuring NextCloud, etc.) are spawned as background tasks:

tokio::spawn(async move {
    let _ = tokio::fs::create_dir_all(secret_dir).await;
    let _ = tokio::fs::write(...).await;
});

The install RPC returns success before hooks complete. If a hook fails (network timeout, service not ready), the error is logged but the user is told installation succeeded. Credentials aren't set, configs aren't applied.

Fix Required

Await post-install hooks before returning success, or return a "configuring" status and let the frontend poll for completion.


Problem 7: Podman Client Swallows Errors

File: core/container/src/podman_client.rs

A. JSON serialization failures return empty strings (line 182-183)

let body_str = body.map(|b| serde_json::to_string(&b).unwrap_or_default()).unwrap_or_default();

B. Container ID parsing failures return empty string (line 344-348)

let id = result["Id"].as_str().unwrap_or("").to_string();
Ok(id)  // Empty string = success?

C. Socket timeout is only 5 seconds (line 154-160)

On a busy system or during boot, Podman socket may take >5s to respond. Every API call fails. No retry logic.

Fixes Required

  1. Replace .unwrap_or_default() with proper error propagation using ?
  2. Return Err when container ID is empty
  3. Increase socket timeout to 15-30s
  4. Add retry with backoff (3 attempts) on socket connection

Problem 8: UI Misrepresents Container State

Root Causes

A. "Exited" always displays as "Crashed" — even for clean shutdowns

File: neode-ui/src/views/apps/appsConfig.ts:119-146

getStatusLabel(state, health):
  - "exited"  "crashed"     // <-- THIS IS THE PROBLEM

Every container that exited — whether from a clean reboot (exit 0), OOM kill (exit 137), or app error (exit 1) — shows the same "crashed" label. After a reboot, the UI is a wall of "crashed" labels even though containers are in the process of starting up.

B. No "recovering" or "boot in progress" state exists

File: core/archipelago/src/data_model.rs:103-119 PackageState enum has Starting, but it's only set during explicit user start actions, not during automatic crash recovery. During boot recovery, containers transition from Exited → Running without ever passing through Starting, so the UI never shows a spinner or "starting up" message.

C. Backend skips sub-containers from package listing, so their state is invisible

File: core/archipelago/src/container/docker_packages.rs:39-117 The excluded_services list filters out backend services like mempool-db, btcpay-db, nbxplorer, penpot-postgres, etc. UI containers ending in -ui are also skipped. These containers are invisible to the user even when they're the actual cause of a stack failure (e.g., indeedhub-postgres being dead kills the entire IndeedHub stack, but only indeedhub-api errors are visible).

D. No distinction between "needs manual intervention" and "will recover soon"

The UI shows the same visual treatment for:

  • Portainer (DB migration error — will NEVER recover without manual intervention)
  • mempool-api (DB not ready yet — will recover in 30 seconds)
  • IndeedHub (dependencies abandoned — won't recover until deps are manually restarted)

Fixes Required

  1. Differentiate exit codes: Exit 0 = "stopped" (gray), Exit non-zero = "crashed" (red), Exit 137 = "killed (OOM)" (red with warning)
  2. Add a "recovering" state: During boot/crash recovery window (first 5 minutes after backend start), show "Starting up..." instead of "crashed" for exited containers
  3. Show sub-container health: When a parent app is unhealthy, show which sub-service caused the failure (e.g., "IndeedHub: postgres is down")
  4. Distinguish recoverable from permanent failures: After health monitor gives up (3 attempts), change label to "Needs attention" instead of keeping "crashed"
  5. Add recovery progress indicator: During boot, show "Recovering containers: 15/22 started" on the dashboard

Problem 9: Dependency-Blind Restarts

Root Cause (Confirmed by .228 reboot)

The health monitor restarts containers individually without considering dependencies. This was proven by the IndeedHub stack failure:

  1. indeedhub-postgres exits cleanly (code 0) on reboot
  2. Health monitor restarts postgres — it starts, but exits again (likely needs volume mount or network ready)
  3. After 3 attempts, postgres is abandoned
  4. Meanwhile, indeedhub-api tries to connect to postgres → ENOTFOUND indeedhub-postgres → exits
  5. Health monitor restarts api → same DNS failure → exits
  6. After 3 attempts, api is abandoned
  7. Same cascade for redis, minio, relay, main container — all abandoned within minutes

File: core/archipelago/src/health_monitor.rs:500-530 The restart loop treats each container independently. There's no logic to:

  • Check if a container's dependencies are running before restarting it
  • Restart dependencies first when a dependent container fails
  • Reset attempt counters when a dependency comes back online

3 attempts is too few, especially when dependencies need time:

  • Attempt 1: 10s backoff → dependency still starting
  • Attempt 2: 30s backoff → dependency crashed and is being restarted
  • Attempt 3: 90s backoff → dependency hit its own 3-attempt limit and was abandoned
  • Game over. Entire stack is dead.

Fixes Required

  1. Dependency-aware restart ordering: Before restarting a container, check if its dependencies are running. If not, restart dependencies first.
  2. Increase max restart attempts to 5-10 for containers with dependencies
  3. Reset attempt counters when a dependency comes back online (the dependent container failed because of the dependency, not itself)
  4. Add a "stack restart" concept: When restarting any container in a multi-container stack (indeedhub, mempool, btcpay, immich, penpot), restart the entire stack in dependency order
  5. Handle "Created" state containers: archy-mempool-web and fedimint are in "Created" state (never started). The health monitor should detect these and attempt to start them.

Priority Order for Fixes

P0 — System is broken without these (reboot = broken system)

  1. Dependency-aware restarts in health_monitor.rs — restart dependencies before dependents, reset attempt counters when deps recover
  2. Increase max restart attempts to 10 (currently 3) — dependency chains need more time on boot
  3. Handle "Created" state — containers stuck in Created are never started by health monitor
  4. Fix UI state labels — "exited" code 0 should say "stopped", not "crashed". Add "recovering" state during boot window.
  5. Fix Rust config to read secrets from /var/lib/archipelago/secrets/ instead of hardcoded passwords
  6. Fix port 7777 conflict (strfry vs indeedhub)
  7. Add ZMQ config to Bitcoin for LND block notifications

P1 — Core functionality broken

  1. Wire up manifest health checks (replace fake "running = healthy" with actual HTTP/exec probes)
  2. Fix uninstall to clean up volumes, networks, and respect graceful shutdown timeouts
  3. Return actual errors from install/uninstall instead of silent success on partial failure
  4. Remove || true from critical first-boot commands
  5. Show sub-container health in UI (which dependency is actually broken)

P2 — Performance and CPU

  1. Deduplicate podman stats calls (health monitor + metrics collector both call every 60s independently)
  2. Increase health monitor interval to 120s
  3. Add frontend health polling via WebSocket push (populate health field in data model)
  4. Make fleet polling viewport-aware (don't poll when user isn't viewing)

P3 — Consistency and correctness

  1. Sync memory limits between first-boot and Rust config
  2. Update marketplace version strings (LND shows 0.17.4, actual is 0.18.4; Mempool shows 2.5.0, actual is 3.0.0)
  3. Unify container naming conventions between first-boot script and Rust config
  4. Add disk-aware Bitcoin config (prune/txindex) to Rust installer
  5. Distinguish "needs manual intervention" from "will recover soon" in UI

Key Files to Modify

File What to fix
core/archipelago/src/health_monitor.rs Dependency-aware restarts, increase MAX_RESTART_ATTEMPTS to 10, handle Created state, deduplicate with metrics collector
core/container/src/podman_client.rs Add RestartPolicy to container creation spec, fix .unwrap_or_default() error swallowing, increase socket timeout to 15-30s
core/archipelago/src/crash_recovery.rs Increase timeouts to 120s, add retry with backoff, fix tier ordering catch-all
core/archipelago/src/api/rpc/package/install.rs Return failure on timeout (not silent success), await post-install hooks
core/archipelago/src/api/rpc/package/runtime.rs Add volume/network cleanup on uninstall, use podman stop -t then podman rm (not -f), return errors on partial failure
core/archipelago/src/api/rpc/package/config.rs Read secrets from disk, fix port 7777, add ZMQ config, sync memory limits
core/archipelago/src/container/dev_orchestrator.rs Wire up manifest-defined health checks instead of just checking podman state
core/archipelago/src/container/docker_packages.rs Stop filtering sub-containers from state — or expose their health as part of parent app status
core/archipelago/src/data_model.rs Populate health field for WebSocket push, add exit code to state
core/archipelago/src/monitoring/mod.rs Share podman stats data with health monitor instead of duplicate subprocess calls
neode-ui/src/views/apps/appsConfig.ts Fix state labels: exit 0 = "stopped", exit non-zero = "crashed", add "recovering" during boot window
neode-ui/src/stores/container.ts Add periodic health polling (30s)
neode-ui/src/views/apps/useAppsActions.ts Check for "partial" uninstall status, show errors to user
neode-ui/src/views/marketplace/marketplaceData.ts Fix version strings to match image-versions.sh
scripts/first-boot-containers.sh Remove || true from critical commands, fix port 7777 conflict, add proper error reporting