Skip to content

Health checks

A health check is a one-line curl wrapped in cron. The interesting parts are the failure semantics: how do you avoid paging on a single flaky probe, and how do you make sure the alert actually arrives when the underlying service is genuinely down?

This recipe walks through a Postgres-backed web app’s external health check, but the shape is the same for any HTTP service.

[tasks.healthcheck-api]
group = "Health checks"
description = "External probe of the public API every 5 minutes"
cron = "*/5 * * * *"
on_overlap = "skip" # don't pile up if the probe stalls
timeout = "30s" # 30s for one HTTP round-trip is generous
retry_attempts = 2 # forgive 2 transient blips before escalating
retry_delay = "30s"
retry_backoff = "linear" # 30s, 60s — total ~90s of probe-and-wait
keep_runs = 500 # ~1.7 days of 5-minute checks for forensics
notify_on_failure = ["slack-ops"]
run = """
set -euo pipefail
curl --silent --show-error --fail-with-body \\
--max-time 25 \\
--connect-timeout 5 \\
--user-agent 'runwisp-healthcheck/1.0' \\
https://api.example.com/health
"""

That’s the whole task. Two minutes of reading; ten minutes if you care about the rationale below.

Every five minutes is the sweet spot for human-driven services:

  • fine-grained enough that “down for 5 minutes” is the worst-case detection delay, which most teams can tolerate;
  • coarse enough that a slightly slow service doesn’t generate load you’d notice.

Drop to */1 * * * * for tighter SLAs, or */15 * * * * for internal tools. Keep an eye on the run history in the Web UI — if probes regularly take longer than the cron interval, dial it back.

A health check that takes 30 seconds because the service is struggling shouldn’t fire a second probe while the first is still in flight — that just adds load. skip drops the new firing. The skipped firing is recorded as a failed row with exit code -1 and the message “task already running, skipping (policy: skip)”.

Belt and braces. The shell-side curl --max-time exits non-zero before the daemon’s timeout kicks in — so the run exits with a real curl error and a useful message in the log, rather than the daemon’s terse “killed by timeout”.

retry_attempts = 2 + retry_delay = "30s" + retry_backoff = "linear"

Section titled “retry_attempts = 2 + retry_delay = "30s" + retry_backoff = "linear"”

The trick: a real outage shows up as all three attempts failing, which fires the notification. A single dropped packet shows up as one failure followed by a passing retry — silent.

Math: with linear and delay = 30s, attempts wait 30s, 60s between them. Total alert latency from start of first probe to notification: ~90s. That’s faster than most humans-on-call expect, and it eliminates the “alert bot pages me at 3am for one flaky probe” failure mode.

The trade-off: if your service is genuinely flapping every 90 seconds, the retries will sometimes mask it. The notification coalescing window catches that case — flapping shows up as a single coalesced row with a count and a sparkline, rather than dozens of separate alerts.

Five-minute checks generate ~288 rows per day. 500 keeps not quite two days of probes — usually enough to correlate “the service was unhappy from 14:30 onwards” without flooding the run list. Bump it for finer-grained probes.

The curl flags that matter:

  • --silent — no progress bar in the captured log.
  • --show-error — but show the error if one happens.
  • --fail-with-body — exit non-zero on HTTP 4xx/5xx and print the response body. Without this, curl returns 0 on a 500 with body content; the task would silently report success.

--fail-with-body is curl 7.76+ (Ubuntu 22.04, Debian 12+, macOS native). On older systems use plain --fail and accept that 4xx/5xx response bodies aren’t captured.

If your service is known to be flaky and you only want to be paged on a sustained outage (say “5 failures in the last 30 minutes”), turn down retry_attempts to 0 and lean on the notifications model to do the rate-limiting:

[tasks.healthcheck-api]
cron = "*/5 * * * *"
on_overlap = "skip"
timeout = "30s"
retry_attempts = 0
keep_runs = 500
run = "curl --silent --show-error --fail-with-body --max-time 25 https://api.example.com/health"
# No notify_on_failure here — every failure goes only to in-app.
[notify]
coalesce_window = "30m"

Now an outage that lasts 30 minutes is one in-app row with count = 6 and a sparkline. The bell glows red, but you don’t get a Slack barrage.

For internal-network checks you don’t always have a public URL — just probe a TCP port:

Terminal window
# Replace the curl line with:
exec 3<>/dev/tcp/db.internal/5432
echo -e "" >&3
read -t 5 -u 3 RESP || { echo "no response from db.internal:5432"; exit 1; }

Or run a real query that exercises the database, not just the listener:

Terminal window
PGPASSWORD="$HEALTHCHECK_PG_PWD" psql \\
--host=db.internal --username=healthcheck \\
--dbname=app_production \\
--command='SELECT 1' \\
--quiet --no-psqlrc \\
> /dev/null

A passing port-only check on a database that won’t accept connections is the classic cause of an undetected outage. Probe something real.