Health checks
A health check is a one-line curl wrapped in cron. The interesting
parts are the failure semantics: how do you avoid paging on a single
flaky probe, and how do you make sure the alert actually arrives when
the underlying service is genuinely down?
This recipe walks through a Postgres-backed web app’s external health check, but the shape is the same for any HTTP service.
The task
Section titled “The task”[tasks.healthcheck-api]group = "Health checks"description = "External probe of the public API every 5 minutes"cron = "*/5 * * * *"on_overlap = "skip" # don't pile up if the probe stallstimeout = "30s" # 30s for one HTTP round-trip is generousretry_attempts = 2 # forgive 2 transient blips before escalatingretry_delay = "30s"retry_backoff = "linear" # 30s, 60s — total ~90s of probe-and-waitkeep_runs = 500 # ~1.7 days of 5-minute checks for forensicsnotify_on_failure = ["slack-ops"]
run = """set -euo pipefailcurl --silent --show-error --fail-with-body \\ --max-time 25 \\ --connect-timeout 5 \\ --user-agent 'runwisp-healthcheck/1.0' \\ https://api.example.com/health"""That’s the whole task. Two minutes of reading; ten minutes if you care about the rationale below.
Why each knob
Section titled “Why each knob”cron = "*/5 * * * *"
Section titled “cron = "*/5 * * * *"”Every five minutes is the sweet spot for human-driven services:
- fine-grained enough that “down for 5 minutes” is the worst-case detection delay, which most teams can tolerate;
- coarse enough that a slightly slow service doesn’t generate load you’d notice.
Drop to */1 * * * * for tighter SLAs, or */15 * * * * for
internal tools. Keep an eye on the run history in the Web UI —
if probes regularly take longer than the cron interval, dial it
back.
on_overlap = "skip"
Section titled “on_overlap = "skip"”A health check that takes 30 seconds because the service is
struggling shouldn’t fire a second probe while the first is still
in flight — that just adds load. skip drops the new firing.
The skipped firing is recorded as a failed row with exit code -1
and the message “task already running, skipping (policy: skip)”.
timeout = "30s" + --max-time 25
Section titled “timeout = "30s" + --max-time 25”Belt and braces. The shell-side curl --max-time exits
non-zero before the daemon’s timeout kicks in — so the run
exits with a real curl error and a useful message in the log,
rather than the daemon’s terse “killed by timeout”.
retry_attempts = 2 + retry_delay = "30s" + retry_backoff = "linear"
Section titled “retry_attempts = 2 + retry_delay = "30s" + retry_backoff = "linear"”The trick: a real outage shows up as all three attempts failing, which fires the notification. A single dropped packet shows up as one failure followed by a passing retry — silent.
Math: with linear and delay = 30s, attempts wait 30s, 60s
between them. Total alert latency from start of first probe to
notification: ~90s. That’s faster than most humans-on-call
expect, and it eliminates the “alert bot pages me at 3am for one
flaky probe” failure mode.
The trade-off: if your service is genuinely flapping every 90 seconds, the retries will sometimes mask it. The notification coalescing window catches that case — flapping shows up as a single coalesced row with a count and a sparkline, rather than dozens of separate alerts.
keep_runs = 500
Section titled “keep_runs = 500”Five-minute checks generate ~288 rows per day. 500 keeps not quite two days of probes — usually enough to correlate “the service was unhappy from 14:30 onwards” without flooding the run list. Bump it for finer-grained probes.
--silent --show-error --fail-with-body
Section titled “--silent --show-error --fail-with-body”The curl flags that matter:
--silent— no progress bar in the captured log.--show-error— but show the error if one happens.--fail-with-body— exit non-zero on HTTP 4xx/5xx and print the response body. Without this, curl returns 0 on a 500 with body content; the task would silently report success.
--fail-with-body is curl 7.76+ (Ubuntu 22.04, Debian 12+,
macOS native). On older systems use plain --fail and accept
that 4xx/5xx response bodies aren’t captured.
A flap-friendly variant
Section titled “A flap-friendly variant”If your service is known to be flaky and you only want to be paged
on a sustained outage (say “5 failures in the last 30 minutes”),
turn down retry_attempts to 0 and lean on the
notifications model to do the rate-limiting:
[tasks.healthcheck-api]cron = "*/5 * * * *"on_overlap = "skip"timeout = "30s"retry_attempts = 0keep_runs = 500
run = "curl --silent --show-error --fail-with-body --max-time 25 https://api.example.com/health"
# No notify_on_failure here — every failure goes only to in-app.[notify]coalesce_window = "30m"Now an outage that lasts 30 minutes is one in-app row with
count = 6 and a sparkline. The bell glows red, but you don’t
get a Slack barrage.
Probing internal services
Section titled “Probing internal services”For internal-network checks you don’t always have a public URL — just probe a TCP port:
# Replace the curl line with:exec 3<>/dev/tcp/db.internal/5432echo -e "" >&3read -t 5 -u 3 RESP || { echo "no response from db.internal:5432"; exit 1; }Or run a real query that exercises the database, not just the listener:
PGPASSWORD="$HEALTHCHECK_PG_PWD" psql \\ --host=db.internal --username=healthcheck \\ --dbname=app_production \\ --command='SELECT 1' \\ --quiet --no-psqlrc \\ > /dev/nullA passing port-only check on a database that won’t accept connections is the classic cause of an undetected outage. Probe something real.
Where to next
Section titled “Where to next”- Slack provider — wiring the
slack-opsnotifier the task references. - Notifications model — coalescing, delivery failure handling, and the in-app default.
- Concepts: retries & timeouts — what
retry_backoff = "linear"actually computes.