Health checks

A health check is a one-line curl wrapped in cron. The interesting parts are the failure semantics: how do you avoid paging on a single flaky probe, and how do you make sure the alert actually arrives when the underlying service is genuinely down?

This recipe walks through a Postgres-backed web app’s external health check, but the shape is the same for any HTTP service.

The task

[tasks.healthcheck-api]
group       = "Health checks"
description = "External probe of the public API every 5 minutes"
cron        = "*/5 * * * *"
on_overlap  = "skip"          # don't pile up if the probe stalls
timeout     = "30s"           # 30s for one HTTP round-trip is generous
retry_attempts = 2            # forgive 2 transient blips before escalating
retry_delay    = "30s"
retry_backoff  = "linear"     # 30s, 60s — total ~90s of probe-and-wait
keep_runs   = 500             # ~1.7 days of 5-minute checks for forensics
notify_on_failure = ["slack-ops"]

run = """
set -euo pipefail
curl --silent --show-error --fail-with-body \\
     --max-time 25 \\
     --connect-timeout 5 \\
     --user-agent 'runwisp-healthcheck/1.0' \\
     https://api.example.com/health
"""

That’s the whole task. Two minutes of reading; ten minutes if you care about the rationale below.

Why each knob

`cron = "/5 * * *"`

Every five minutes is the sweet spot for human-driven services:

fine-grained enough that “down for 5 minutes” is the worst-case detection delay, which most teams can tolerate;
coarse enough that a slightly slow service doesn’t generate load you’d notice.

Drop to */1 * * * * for tighter SLAs, or */15 * * * * for internal tools. Keep an eye on the run history in the Web UI — if probes regularly take longer than the cron interval, dial it back.

`on_overlap = "skip"`

A health check that takes 30 seconds because the service is struggling shouldn’t fire a second probe while the first is still in flight — that just adds load. skip drops the new firing. The skipped firing is recorded as a failed row with exit code -1 and the message “task already running, skipping (policy: skip)”.

`timeout = "30s"` + `--max-time 25`

Belt and braces. The shell-side curl --max-time exits non-zero before the daemon’s timeout kicks in — so the run exits with a real curl error and a useful message in the log, rather than the daemon’s terse “killed by timeout”.

`retry_attempts = 2` + `retry_delay = "30s"` + `retry_backoff = "linear"`

The trick: a real outage shows up as all three attempts failing, which fires the notification. A single dropped packet shows up as one failure followed by a passing retry — silent.

Math: with linear and delay = 30s, attempts wait 30s, 60s between them. Total alert latency from start of first probe to notification: ~90s. That’s faster than most humans-on-call expect, and it eliminates the “alert bot pages me at 3am for one flaky probe” failure mode.

The trade-off: if your service is genuinely flapping every 90 seconds, the retries will sometimes mask it. The notification coalescing window catches that case — flapping shows up as a single coalesced row with a count and a sparkline, rather than dozens of separate alerts.

`keep_runs = 500`

Five-minute checks generate ~288 rows per day. 500 keeps not quite two days of probes — usually enough to correlate “the service was unhappy from 14:30 onwards” without flooding the run list. Bump it for finer-grained probes.

`--silent --show-error --fail-with-body`

The curl flags that matter:

--silent — no progress bar in the captured log.
--show-error — but show the error if one happens.
--fail-with-body — exit non-zero on HTTP 4xx/5xx and print the response body. Without this, curl returns 0 on a 500 with body content; the task would silently report success.

--fail-with-body is curl 7.76+ (Ubuntu 22.04, Debian 12+, macOS native). On older systems use plain --fail and accept that 4xx/5xx response bodies aren’t captured.

A flap-friendly variant

If your service is known to be flaky and you only want to be paged on a sustained outage (say “5 failures in the last 30 minutes”), turn down retry_attempts to 0 and lean on the notifications model to do the rate-limiting:

[tasks.healthcheck-api]
cron       = "*/5 * * * *"
on_overlap = "skip"
timeout    = "30s"
retry_attempts = 0
keep_runs  = 500

run = "curl --silent --show-error --fail-with-body --max-time 25 https://api.example.com/health"

# No notify_on_failure here — every failure goes only to in-app.

[notify]
coalesce_window = "30m"

Now an outage that lasts 30 minutes is one in-app row with count = 6 and a sparkline. The bell glows red, but you don’t get a Slack barrage.

Probing internal services

For internal-network checks you don’t always have a public URL — just probe a TCP port:

# Replace the curl line with:
exec 3<>/dev/tcp/db.internal/5432
echo -e "" >&3
read -t 5 -u 3 RESP || { echo "no response from db.internal:5432"; exit 1; }

Or run a real query that exercises the database, not just the listener:

PGPASSWORD="$HEALTHCHECK_PG_PWD" psql \\
  --host=db.internal --username=healthcheck \\
  --dbname=app_production \\
  --command='SELECT 1' \\
  --quiet --no-psqlrc \\
  > /dev/null

A passing port-only check on a database that won’t accept connections is the classic cause of an undetected outage. Probe something real.

Where to next

Slack provider — wiring the slack-ops notifier the task references.
Notifications model — coalescing, delivery failure handling, and the in-app default.
Concepts: retries & timeouts — what retry_backoff = "linear" actually computes.

Health checks

The task

Why each knob

cron = "*/5 * * * *"

on_overlap = "skip"

timeout = "30s" + --max-time 25

retry_attempts = 2 + retry_delay = "30s" + retry_backoff = "linear"

keep_runs = 500

--silent --show-error --fail-with-body