Monitoring & Watchdog

Three-layer monitoring architecture: because process alive does not mean healthy, and healthy does not mean correct.

Three-Layer Architecture

Level 1

OS / systemd

Process-level monitoring by the operating system. If openclaw-gateway crashes, systemd automatically restarts it. This catches catastrophic process death but cannot detect application-level issues.

Detects

Process crash, OOM kill, service failure

Limitation

Process alive ≠ healthy. Gateway can be running but unresponsive.

Level 1.5

Watchdog

Lightweight health ping that runs independently of the gateway process. Verifies the gateway is not just alive but actually responding to HTTP requests. Runs via cron — survives gateway death.

Detects

Unresponsive gateway, hung process, failed HTTP health endpoint

Limitation

Only checks availability, not correctness of data or configuration.

Level 2

Full Health Battery

Mighty Mark's 138+ checks across 13 categories. Deep validation of gateway config, agent state, API connectivity, data integrity, fleet communication, security posture, and temporal systems.

Detects

Configuration drift, API failures, audit chain corruption, stale data, security violations

Limitation

Slower to run (minutes, not seconds). Scheduled daily + on-demand.

The watchdog (Level 1.5) runs as a cron job outside the gateway process. If the gateway dies, the watchdog keeps running and can detect the failure. This is why it's called Level 1.5—it's more intelligent than systemd but lighter than the full battery.

Heal Recommendations

When problems are detected, Mighty Mark classifies them and emits structured JSON recommendations. A separate executor (fleet-heal) picks up FIXABLE items and performs the actual remediation—preserving the sentinel's read-only constraint.

Classification	Description	Example
FIXABLE	Mighty Mark knows the exact remediation steps. fleet-heal can execute automatically.	Gateway process dead → restart via systemctl
BROKEN	Problem detected but remediation requires investigation. Human review recommended.	Audit chain hash mismatch → data corruption investigation
MANUAL	Requires human intervention. Cannot be automated safely.	Expired API key → credential rotation by operator

Incident Logging

Every check run produces a JSONL append-only incident log. Each record includes the fleet-bus auditHash so incident investigation can cross-reference the cryptographic audit trail—connecting operational incidents to tamper-proof evidence.

Incident Log Entry (JSONL)

{
  "timestamp": "2026-04-03T06:00:00Z",
  "category": "gateway",
  "passed": 9,
  "failed": 1,
  "warned": 0,
  "classification": "FIXABLE",
  "recommendation": "Gateway process unresponsive — restart via systemctl",
  "auditHash": "a3f2c1...b8e9d4"
}

Upgrade Triage

mighty-mark triage performs version-indexed breaking change detection before an OpenClaw upgrade. It compares your current configuration against known breaking changes in the target version, flagging items that need attention before you upgrade.

CLI Reference

Monitoring Commands

# mighty-mark v0.7.x

# Start watchdog mode (lightweight health ping loop)
mighty-mark watchdog

# Run heal detection and emit recommendations
mighty-mark heal

# View incident log
mighty-mark incidents

# Run upgrade triage for target version
mighty-mark triage

# Install watchdog as system service
mighty-mark install-watchdog

Health Check Battery

Full Level 2 check breakdown

Governance & Authority

Why heal is detect-only