Container recovery: - Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s - Dependency-aware restarts: won't restart services before their deps - Reset dependent counters when a dependency recovers - Handle "created" state containers (were invisible to health monitor) - Added IndeedHub, mempool-api, mysql to tier system - Crash recovery: podman start timeout 30s→120s with retry - Podman client: socket timeout 5s→30s, added restart policy UI state representation: - Exit code 0 shows "stopped" (gray), not "crashed" (red) - Exit code 137 shows "killed (OOM)" - Non-zero exit shows "crashed" (red) - Added exit_code field to PackageDataEntry Install/uninstall fixes: - Install returns error when container doesn't start (was silent success) - Post-install hooks awaited instead of fire-and-forget tokio::spawn - Uninstall: graceful rm before force, volume prune, network cleanup - Uninstall returns error on partial failure (was 200 OK) Config consistency: - DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded) - Bitcoin: added ZMQ ports 28332/28333 for LND block notifications - IndeedHub port 7777→8190 (was conflicting with strfry) - Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0 Performance: - Metrics collector interval 60s→300s (was duplicating health monitor) - Podman client: proper error propagation instead of unwrap_or_default Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
29 KiB
Archipelago Container Infrastructure — Critical Issues Report
Date: 2026-03-31 Status: Server .228 rebooted — some apps recovered, many did not. UI showed everything as "crashed" during recovery window. Purpose: Fix guide for getting container lifecycle to production quality.
Executive Summary
The container system has 7 systemic failures that compound each other:
- Silent failures everywhere — errors are swallowed with
|| true,.unwrap_or_default(), and warn-level logs. Nothing actually tells the user (or the system) that something broke. - Health checks are fake — manifests define real health checks (HTTP probes, exec checks) but they are never executed. "Healthy" just means
podman psshows "running". - Duplicate polling burns CPU — health monitor + metrics collector both call
podman statsevery 60 seconds independently. Add crash recovery snapshots, disk monitor, and frontend polling = constant subprocess spawning. - Uninstall doesn't clean up — no volume removal, no network cleanup, force-kills stateful containers (risking wallet/DB corruption), returns 200 OK on partial failure.
- Two divergent install paths —
first-boot-containers.shand the Rust RPC installer use different passwords, ports, capabilities, memory limits, and Bitcoin config. They are never in sync. - UI misrepresents state —
Exited(even clean exit code 0) shows as "crashed". No "recovering" or "starting up" state exists. During boot recovery, UI shows a wall of red/gray "crashed" labels. - Dependency-blind restarts — health monitor restarts services without restarting their dependencies first, so they immediately fail again and burn through the 3-attempt limit.
LIVE EVIDENCE: .228 Reboot on 2026-03-31
After rebooting .228, here's the actual container state 30 minutes later:
Permanently Dead (exceeded 3 restart attempts, abandoned)
| Container | Exit Code | Cause |
|---|---|---|
indeedhub-postgres |
0 (clean) | Shut down by reboot. Health monitor tried 3 restarts, it keeps exiting cleanly. Once abandoned, all dependent services die too. |
indeedhub-redis |
0 | Same — clean exit, 3 failed restart attempts, abandoned |
indeedhub-minio |
0 | Same |
indeedhub-relay |
0 | Same |
indeedhub |
0 | Same |
indeedhub-api |
1 | Can't resolve hostname indeedhub-postgres (postgres is dead, DNS entry gone from network) |
jellyfin |
137 (OOM) | "Failed to create CoreCLR" — memory limit too low for .NET runtime. SIGKILL = OOM. 3 attempts exhausted. |
Crash-Looping (still failing on every restart)
| Container | Cause |
|---|---|
mempool-api |
ECONNREFUSED 10.89.0.42:3306 — DB (archy-mempool-db) just restarted, not ready yet |
portainer |
"database schema version does not align with server version" — image upgraded, DB not migrated. Will NEVER recover. |
photoprism |
"Failed creating test file in storage folder" — volume permission issue (rootless UID mapping) |
Never Started (stuck in "Created" state)
| Container | Cause |
|---|---|
archy-mempool-web |
"cannot assign requested address" — network binding failure |
fedimint |
Same network error |
Running but Unhealthy
| Container | Notes |
|---|---|
homeassistant |
Up 14 min, health check failing |
searxng |
Up 13 min, health check failing |
onlyoffice |
Up 10 min, health check failing |
Actually Recovered (healthy)
filebrowser, bitcoin-knots, vaultwarden, nginx-proxy-manager, archy-btcpay-db, lnd, electrumx, grafana
Key Observations
- All containers have
unless-stoppedrestart policy — but this doesn't help because containers that exit cleanly (code 0) don't get restarted by Podman. The health monitor is the only restart mechanism, and it gives up after 3 attempts. - The entire IndeedHub stack died because postgres was abandoned first. Once postgres hit 3 restart attempts, every dependent service (api, redis, minio, relay, main) also failed and hit their own 3-attempt limit. No dependency awareness.
- Containers in "Created" state were never even started — some kind of network assignment failure during creation. The health monitor doesn't handle "Created" state containers.
- The UI showed ALL apps as "crashed" during the first few minutes, even the ones that eventually recovered. This is because
Exitedstate (even exit code 0) maps to the label "crashed" inappsConfig.ts.
Problem 1: Containers Don't Start or Recover After Reboot
Confirmed: All apps crashed after .228 reboot on 2026-03-31.
Root Causes
A. Crash recovery has a 30-second timeout that's too short
File: core/archipelago/src/crash_recovery.rs:265-271
let result = tokio::time::timeout(
std::time::Duration::from_secs(30),
tokio::process::Command::new("podman").args(["start", &record.name]).output(),
).await;
On a cold boot with many containers, Podman is under load. 30 seconds is not enough. If it times out, the container is skipped — no retry.
B. If podman ps itself times out, recovery finds zero containers
File: core/archipelago/src/crash_recovery.rs:318
The podman ps -a call to discover stopped containers has a 30-second timeout. On a busy system post-reboot, this can timeout. Result: all_names is empty, recovery silently exits having started nothing.
C. Boot tier ordering uses a catch-all that misses dependencies
File: core/archipelago/src/crash_recovery.rs:374-385
fn container_boot_tier(name: &str) -> u8 {
match id {
"btcpay-db" | "mempool-db" | ... => 0, // databases
"bitcoin-knots" | ... => 1, // bitcoin
"lnd" | "electrumx" | ... => 2, // depends on bitcoin
"mempool-web" | ... => 4, // frontend
_ => 3, // EVERYTHING ELSE - may start before its dependencies
}
}
Any app not explicitly listed gets tier 3, which may be before its dependencies are ready.
D. First-boot script swallows ALL errors
File: scripts/first-boot-containers.sh:8 — no set -e
48+ commands have || true appended. Every podman run failure is silently ignored. The script always exits 0 and reports "complete" to systemd even if 50% of containers failed.
E. Install RPC returns success before container is actually running
File: core/archipelago/src/api/rpc/package/install.rs:260-294
After container creation, the installer polls for 30 seconds (6 checks x 5 seconds). If the container is still in "created" or "starting" state after 30 seconds:
if i == 5 {
debug!("Container {} health check timeout (30s) -- continuing anyway");
}
It logs at debug level and returns success. The user sees "installed" but the container never actually started.
Fixes Required
- Increase crash recovery timeout to 120s and add retry with backoff (3 attempts per container)
- Increase
podman pstimeout to 60s during boot recovery - Replace tier catch-all — every container must be explicitly listed or derived from manifest dependencies
- Remove
|| truefrom critical commands in first-boot-containers.sh. Use proper error handling: log the error, record the failure, continue to next container, but report actual failures at the end - Install RPC must return failure if container isn't running after timeout, not silently succeed
- Add
--restart unless-stoppedto container creation in the Podman client (core/container/src/podman_client.rs:303-335) — currently missing, so Podman itself never auto-restarts crashed containers
Problem 2: Health Checks Are Fake
Root Causes
A. "Healthy" just means "running" — application health is never checked
File: core/archipelago/src/container/dev_orchestrator.rs:239-249
pub async fn get_health_status(&self, app_id: &str) -> Result<String> {
match status.state {
ContainerState::Running => Ok("healthy".to_string()), // <-- THIS IS THE ENTIRE CHECK
ContainerState::Stopped | ContainerState::Exited => Ok("unhealthy".to_string()),
...
}
}
A container can be "running" but the application inside is completely broken. This is reported as "healthy".
B. Manifest health checks exist but are never executed
All 30+ app manifests in image-recipe/build/debian-iso/custom/archipelago/apps/*/manifest.yml define health checks like:
health_check:
type: http
endpoint: http://localhost:4080
path: /api/health
interval: 30s
timeout: 5s
retries: 3
The HealthMonitor struct at core/container/src/health_monitor.rs can execute these checks. But it is never instantiated. No code path creates a HealthMonitor from the manifest health check definitions.
C. Health status is never pushed to the frontend via WebSocket
File: core/archipelago/src/data_model.rs:120-127
pub struct PackageDataEntry {
pub health: Option<String>, // Field exists but is NEVER POPULATED
}
The health field in the data model is always None. Frontend can only get health via explicit RPC call, which it almost never makes.
D. Frontend never polls health status
File: neode-ui/src/stores/container.ts:169-175
fetchHealthStatus() is only called after startContainer() and startBundledApp(). There is no setInterval, no periodic polling, no watch. After the initial call, health status is never refreshed.
Fixes Required
- Wire up manifest health checks — instantiate
HealthMonitorfrom manifest definitions, run actual HTTP/exec probes instead of just checkingpodman ps - Populate the
healthfield inPackageDataEntryso WebSocket pushes real health status to frontend - Add 30-second health polling in the frontend container store (with backoff to 60s when all healthy)
- Fix
get_health_status()in dev_orchestrator to call actual health checks, not just check container state
Problem 3: CPU Exhaustion from Duplicate Polling
Root Causes
A. Two independent monitors both call podman stats every 60 seconds
- Health monitor:
core/archipelago/src/health_monitor.rs:17—CHECK_INTERVAL_SECS = 60- Runs
podman ps -a --format json(line 305-323) - Runs
podman stats --no-streamevery 5 cycles (line 442-450)
- Runs
- Metrics collector:
core/archipelago/src/monitoring/mod.rs:28— 60-second interval- Runs
podman stats --no-stream --format jsonindependently (collector.rs:220-224)
- Runs
These are not coordinated. Both spawn separate subprocesses. On a system with 15+ containers, each podman stats call is expensive.
B. Total subprocess spawning frequency
| Component | Interval | What it runs |
|---|---|---|
| Health monitor | 60s | podman ps, podman stats (every 5th), restart attempts |
| Metrics collector | 60s | podman stats (duplicate!) |
| Crash recovery snapshot | 120s | podman ps |
| Disk monitor | 300s | df, sudo dmesg, potentially podman image prune |
| Telemetry | 900s | podman stats (another duplicate) |
| Systemd watchdog | 120s | sd_notify ping |
| Frontend fleet polling | 60s | RPC calls that trigger more podman commands |
That's roughly one podman subprocess every 10-15 seconds on average, plus all the triggered operations.
C. No restart policy means polling-driven restarts
File: core/container/src/podman_client.rs:303-335
Container creation spec does NOT include RestartPolicy. Podman itself never restarts crashed containers. Instead, the health monitor's 60-second poll detects the crash and attempts a restart. This is far more CPU-intensive than Podman's built-in restart mechanism.
D. Health monitor restart attempts with exponential backoff still spawn processes
When a container fails, the health monitor tries restarts at 10s, 30s, 90s backoff. Each attempt spawns podman start, podman inspect, etc. If multiple containers are unhealthy, this multiplies.
Fixes Required
- Deduplicate
podman stats— create a shared cache layer. One component fetches, others read from cache (TTL: 30s) - Add
RestartPolicy: unless-stoppedwith MaxRetryCount: 5 to all container creation — let Podman handle restarts natively instead of polling - Increase health monitor interval to 120s (60s is too aggressive when health checks are just
podman ps) - Remove duplicate
podman statscall from metrics collector — share data with health monitor - Make frontend fleet polling viewport-aware — only poll when user is actually viewing the fleet page
- Batch all container queries — use a single
podman ps -a --format jsonper check cycle, shared across all consumers
Problem 4: Uninstall Doesn't Work
Root Causes
A. No volume removal
File: core/archipelago/src/api/rpc/package/runtime.rs:172-289
The uninstall function stops containers, removes containers, releases ports, and attempts data directory cleanup. It never removes Podman volumes. Orphaned volumes accumulate forever.
B. No network cleanup
File: core/archipelago/src/api/rpc/package/runtime.rs:172-289
Multi-container stacks create networks (archy-net, immich-net, penpot-net) during install (stacks.rs:89, 211). These are never cleaned up during uninstall. Leftover networks can prevent reinstallation.
C. Force-kills stateful containers without graceful shutdown
File: core/archipelago/src/api/rpc/package/runtime.rs:226
let rm_out = tokio::process::Command::new("podman")
.args(["rm", "-f", name]) // -f = force kill
.output().await;
The code defines proper shutdown timeouts (Bitcoin: 600s, LND: 330s, databases: 120s) but only uses them for stop. The rm -f that follows ignores these timeouts and force-kills immediately. This risks corrupting Bitcoin's UTXO set, LND channel state, or database WAL.
D. Returns 200 OK even on partial failure
File: core/archipelago/src/api/rpc/package/runtime.rs:268-289
Ok(serde_json::json!({
"status": if errors.is_empty() { "uninstalled" } else { "partial" },
...
}))
Returns HTTP 200 with "partial" status. Frontend at neode-ui/src/views/apps/useAppsActions.ts:74 doesn't check for "partial" — it deletes the app from the UI regardless.
E. Data directory cleanup requires sudo and fails silently
File: core/archipelago/src/api/rpc/package/runtime.rs:256-265
let rm_out = tokio::process::Command::new("sudo")
.args(["rm", "-rf", dir]).output().await;
if let Ok(o) = rm_out {
if !o.status.success() {
tracing::warn!(...); // Warning only, continues
}
}
If sudo isn't configured or fails, data remains on disk but UI shows "uninstalled".
F. Container name detection has gaps
File: core/archipelago/src/api/rpc/package/config.rs:287-340
Container names are hardcoded patterns. If a container was created with a different naming convention (e.g., by first-boot-containers.sh vs RPC installer), it won't be found and won't be removed.
Fixes Required
- Add
podman volume rmfor all volumes associated with the app after container removal - Add network cleanup — remove app-specific networks after all containers on that network are gone
- Use
podman stop -t {timeout}thenpodman rm(without -f) — respect graceful shutdown timeouts, especially for Bitcoin/LND/databases - Return an error (not 200) when uninstall has failures. Frontend must check and display errors
- Surface "partial" failures to the user with specific error messages
- Unify container naming — derive names from a single source (manifest), not hardcoded patterns in multiple files
Problem 5: Two Divergent Install Paths
The first-boot bash script and the Rust RPC installer create containers with different configurations. This is a major source of bugs.
Specific Divergences
A. Database passwords
- First-boot (
scripts/first-boot-containers.sh:118-127): Generates random passwords withopenssl rand -base64 24, stores in/var/lib/archipelago/secrets/ - Rust RPC (
core/archipelago/src/api/rpc/package/config.rs:456,484,514-515,610): Uses hardcoded"btcpaypass","mempoolpass","rootpass","immichpass"
Result: Apps installed via RPC after first-boot can't connect to databases because passwords don't match.
B. Bitcoin configuration
- First-boot (
scripts/first-boot-containers.sh:295-313): Dynamically sets-prune=550on small disks,-txindex=1on large disks - Rust RPC (
core/archipelago/src/api/rpc/package/config.rs:415-420): No custom args at all
Result: Bitcoin installed via RPC has no pruning or txindex regardless of disk size.
C. ZMQ configuration for LND
- First-boot (
scripts/first-boot-containers.sh:100-114): Bitcoin.conf generated without ZMQ publisher settings - Rust RPC (
core/archipelago/src/api/rpc/package/config.rs:438-439): LND configured to connect totcp://bitcoin-knots:28332andtcp://bitcoin-knots:28333
Result: LND can't receive block notifications from Bitcoin because ZMQ isn't configured on either path.
D. Port conflicts
- First-boot (
scripts/first-boot-containers.sh:813,835): Both strfry and indeedhub bind to host port 7777 - Rust RPC (
core/archipelago/src/api/rpc/package/config.rs:734): IndeedHub uses8190:3000
Result: On first-boot, whichever of strfry/indeedhub starts second fails. Via RPC, different port entirely.
E. Memory limits
- First-boot (
scripts/first-boot-containers.sh:253-283): Ollama gets 1g on low-mem systems - Rust RPC (
core/archipelago/src/api/rpc/package/config.rs:245-280): Ollama gets 4g always
Result: Same app gets different resource limits depending on how it was installed.
F. Version mismatches in marketplace UI
scripts/image-versions.sh:17: LND image isv0.18.4-betaneode-ui/src/views/marketplace/marketplaceData.ts:155: Shows0.17.4scripts/image-versions.sh:21-22: Mempool images arev3.0.0neode-ui/src/views/marketplace/marketplaceData.ts:177: Shows2.5.0
Fixes Required
- Single source of truth for container config — Rust config must read passwords from
/var/lib/archipelago/secrets/, not hardcode them - Add ZMQ config to Bitcoin startup in both paths:
zmqpubrawblock=tcp://0.0.0.0:28332andzmqpubrawtx=tcp://0.0.0.0:28333 - Fix port 7777 conflict — assign unique ports to strfry and indeedhub
- Add disk-aware Bitcoin config to Rust installer (prune/txindex based on disk size)
- Sync memory limits between first-boot and Rust config
- Update marketplace version strings to match actual image versions in
image-versions.sh - Long-term: eliminate first-boot-containers.sh — have the backend handle all container creation using the same Rust code path
Problem 6: Post-Install Hooks Run Async and Fail Silently
File: core/archipelago/src/api/rpc/package/install.rs:541-625
Post-install hooks (setting FileBrowser password, configuring NextCloud, etc.) are spawned as background tasks:
tokio::spawn(async move {
let _ = tokio::fs::create_dir_all(secret_dir).await;
let _ = tokio::fs::write(...).await;
});
The install RPC returns success before hooks complete. If a hook fails (network timeout, service not ready), the error is logged but the user is told installation succeeded. Credentials aren't set, configs aren't applied.
Fix Required
Await post-install hooks before returning success, or return a "configuring" status and let the frontend poll for completion.
Problem 7: Podman Client Swallows Errors
File: core/container/src/podman_client.rs
A. JSON serialization failures return empty strings (line 182-183)
let body_str = body.map(|b| serde_json::to_string(&b).unwrap_or_default()).unwrap_or_default();
B. Container ID parsing failures return empty string (line 344-348)
let id = result["Id"].as_str().unwrap_or("").to_string();
Ok(id) // Empty string = success?
C. Socket timeout is only 5 seconds (line 154-160)
On a busy system or during boot, Podman socket may take >5s to respond. Every API call fails. No retry logic.
Fixes Required
- Replace
.unwrap_or_default()with proper error propagation using? - Return
Errwhen container ID is empty - Increase socket timeout to 15-30s
- Add retry with backoff (3 attempts) on socket connection
Problem 8: UI Misrepresents Container State
Root Causes
A. "Exited" always displays as "Crashed" — even for clean shutdowns
File: neode-ui/src/views/apps/appsConfig.ts:119-146
getStatusLabel(state, health):
- "exited" → "crashed" // <-- THIS IS THE PROBLEM
Every container that exited — whether from a clean reboot (exit 0), OOM kill (exit 137), or app error (exit 1) — shows the same "crashed" label. After a reboot, the UI is a wall of "crashed" labels even though containers are in the process of starting up.
B. No "recovering" or "boot in progress" state exists
File: core/archipelago/src/data_model.rs:103-119
PackageState enum has Starting, but it's only set during explicit user start actions, not during automatic crash recovery. During boot recovery, containers transition from Exited → Running without ever passing through Starting, so the UI never shows a spinner or "starting up" message.
C. Backend skips sub-containers from package listing, so their state is invisible
File: core/archipelago/src/container/docker_packages.rs:39-117
The excluded_services list filters out backend services like mempool-db, btcpay-db, nbxplorer, penpot-postgres, etc. UI containers ending in -ui are also skipped. These containers are invisible to the user even when they're the actual cause of a stack failure (e.g., indeedhub-postgres being dead kills the entire IndeedHub stack, but only indeedhub-api errors are visible).
D. No distinction between "needs manual intervention" and "will recover soon"
The UI shows the same visual treatment for:
- Portainer (DB migration error — will NEVER recover without manual intervention)
- mempool-api (DB not ready yet — will recover in 30 seconds)
- IndeedHub (dependencies abandoned — won't recover until deps are manually restarted)
Fixes Required
- Differentiate exit codes: Exit 0 = "stopped" (gray), Exit non-zero = "crashed" (red), Exit 137 = "killed (OOM)" (red with warning)
- Add a "recovering" state: During boot/crash recovery window (first 5 minutes after backend start), show "Starting up..." instead of "crashed" for exited containers
- Show sub-container health: When a parent app is unhealthy, show which sub-service caused the failure (e.g., "IndeedHub: postgres is down")
- Distinguish recoverable from permanent failures: After health monitor gives up (3 attempts), change label to "Needs attention" instead of keeping "crashed"
- Add recovery progress indicator: During boot, show "Recovering containers: 15/22 started" on the dashboard
Problem 9: Dependency-Blind Restarts
Root Cause (Confirmed by .228 reboot)
The health monitor restarts containers individually without considering dependencies. This was proven by the IndeedHub stack failure:
indeedhub-postgresexits cleanly (code 0) on reboot- Health monitor restarts postgres — it starts, but exits again (likely needs volume mount or network ready)
- After 3 attempts, postgres is abandoned
- Meanwhile,
indeedhub-apitries to connect to postgres →ENOTFOUND indeedhub-postgres→ exits - Health monitor restarts api → same DNS failure → exits
- After 3 attempts, api is abandoned
- Same cascade for redis, minio, relay, main container — all abandoned within minutes
File: core/archipelago/src/health_monitor.rs:500-530
The restart loop treats each container independently. There's no logic to:
- Check if a container's dependencies are running before restarting it
- Restart dependencies first when a dependent container fails
- Reset attempt counters when a dependency comes back online
3 attempts is too few, especially when dependencies need time:
- Attempt 1: 10s backoff → dependency still starting
- Attempt 2: 30s backoff → dependency crashed and is being restarted
- Attempt 3: 90s backoff → dependency hit its own 3-attempt limit and was abandoned
- Game over. Entire stack is dead.
Fixes Required
- Dependency-aware restart ordering: Before restarting a container, check if its dependencies are running. If not, restart dependencies first.
- Increase max restart attempts to 5-10 for containers with dependencies
- Reset attempt counters when a dependency comes back online (the dependent container failed because of the dependency, not itself)
- Add a "stack restart" concept: When restarting any container in a multi-container stack (indeedhub, mempool, btcpay, immich, penpot), restart the entire stack in dependency order
- Handle "Created" state containers:
archy-mempool-webandfedimintare in "Created" state (never started). The health monitor should detect these and attempt to start them.
Priority Order for Fixes
P0 — System is broken without these (reboot = broken system)
- Dependency-aware restarts in health_monitor.rs — restart dependencies before dependents, reset attempt counters when deps recover
- Increase max restart attempts to 10 (currently 3) — dependency chains need more time on boot
- Handle "Created" state — containers stuck in Created are never started by health monitor
- Fix UI state labels — "exited" code 0 should say "stopped", not "crashed". Add "recovering" state during boot window.
- Fix Rust config to read secrets from
/var/lib/archipelago/secrets/instead of hardcoded passwords - Fix port 7777 conflict (strfry vs indeedhub)
- Add ZMQ config to Bitcoin for LND block notifications
P1 — Core functionality broken
- Wire up manifest health checks (replace fake "running = healthy" with actual HTTP/exec probes)
- Fix uninstall to clean up volumes, networks, and respect graceful shutdown timeouts
- Return actual errors from install/uninstall instead of silent success on partial failure
- Remove
|| truefrom critical first-boot commands - Show sub-container health in UI (which dependency is actually broken)
P2 — Performance and CPU
- Deduplicate
podman statscalls (health monitor + metrics collector both call every 60s independently) - Increase health monitor interval to 120s
- Add frontend health polling via WebSocket push (populate
healthfield in data model) - Make fleet polling viewport-aware (don't poll when user isn't viewing)
P3 — Consistency and correctness
- Sync memory limits between first-boot and Rust config
- Update marketplace version strings (LND shows 0.17.4, actual is 0.18.4; Mempool shows 2.5.0, actual is 3.0.0)
- Unify container naming conventions between first-boot script and Rust config
- Add disk-aware Bitcoin config (prune/txindex) to Rust installer
- Distinguish "needs manual intervention" from "will recover soon" in UI
Key Files to Modify
| File | What to fix |
|---|---|
core/archipelago/src/health_monitor.rs |
Dependency-aware restarts, increase MAX_RESTART_ATTEMPTS to 10, handle Created state, deduplicate with metrics collector |
core/container/src/podman_client.rs |
Add RestartPolicy to container creation spec, fix .unwrap_or_default() error swallowing, increase socket timeout to 15-30s |
core/archipelago/src/crash_recovery.rs |
Increase timeouts to 120s, add retry with backoff, fix tier ordering catch-all |
core/archipelago/src/api/rpc/package/install.rs |
Return failure on timeout (not silent success), await post-install hooks |
core/archipelago/src/api/rpc/package/runtime.rs |
Add volume/network cleanup on uninstall, use podman stop -t then podman rm (not -f), return errors on partial failure |
core/archipelago/src/api/rpc/package/config.rs |
Read secrets from disk, fix port 7777, add ZMQ config, sync memory limits |
core/archipelago/src/container/dev_orchestrator.rs |
Wire up manifest-defined health checks instead of just checking podman state |
core/archipelago/src/container/docker_packages.rs |
Stop filtering sub-containers from state — or expose their health as part of parent app status |
core/archipelago/src/data_model.rs |
Populate health field for WebSocket push, add exit code to state |
core/archipelago/src/monitoring/mod.rs |
Share podman stats data with health monitor instead of duplicate subprocess calls |
neode-ui/src/views/apps/appsConfig.ts |
Fix state labels: exit 0 = "stopped", exit non-zero = "crashed", add "recovering" during boot window |
neode-ui/src/stores/container.ts |
Add periodic health polling (30s) |
neode-ui/src/views/apps/useAppsActions.ts |
Check for "partial" uninstall status, show errors to user |
neode-ui/src/views/marketplace/marketplaceData.ts |
Fix version strings to match image-versions.sh |
scripts/first-boot-containers.sh |
Remove || true from critical commands, fix port 7777 conflict, add proper error reporting |