Nightly backup

A nightly database backup is the canonical RunWisp task. It runs on a schedule, writes a timestamped artefact, takes long enough that you care about overlap, and you absolutely want to know when it fails.

This recipe covers Postgres; the shape is the same for MySQL, SQLite, MongoDB, or any external service you can drive from a shell script.

The task

[tasks.backup-postgres]
group       = "Backups"
description = "Nightly logical dump of the production database"
cron        = "30 2 * * *"          # 02:30 every day, daemon-local time
on_overlap  = "skip"                 # never two dumps at once
timeout     = "45m"                  # die before the next firing
retry_attempts = 1                   # one quick retry on transient failure
retry_delay    = "5m"
keep_runs   = 30
keep_for    = "60d"
notify_on_failure = ["slack-ops"]

run = """
set -euo pipefail
TS=$(date -u +%Y%m%dT%H%M%SZ)
DEST=/srv/backups/postgres
mkdir -p "$DEST"

PGPASSWORD="$BACKUP_DB_PASSWORD" pg_dump \\
  --host=db.internal \\
  --username=backup \\
  --format=custom \\
  --no-owner --no-privileges \\
  app_production \\
| gzip --best > "$DEST/app_production-$TS.dump.gz"

# Verify the archive is at least readable end-to-end.
gzip -t "$DEST/app_production-$TS.dump.gz"
echo "Wrote $DEST/app_production-$TS.dump.gz ($(du -h "$DEST/app_production-$TS.dump.gz" | cut -f1))"
"""

The matching [[notifier]] block — described on the Slack provider page — listens for run.failed, run.timeout, and run.crashed because that’s what notify_on_failure desugars to.

Why each knob

`cron = "30 2 * * *"`

Off-peak. Avoid landing on the hour or half-hour — a host running many cron daemons at exactly 0 2 * * * and 0 3 * * * will serialise its own writes and cause backup contention.

`on_overlap = "skip"`

If a previous dump is still running at the next firing, don’t start a second one. The default of "queue" would line up overlapping firings; for a nightly task that doesn’t help and can cause a backup pile-up if a slow night extends past 02:30 the next morning.

`timeout = "45m"`

A backup that takes 45 minutes is broken — kill it. The cron is 24 hours apart so this is a safe ceiling. See Retries & timeouts for what timeout actually does (per-attempt, hard kill, no grace period).

`retry_attempts = 1` + `retry_delay = "5m"`

A transient network blip (database briefly unreachable, an intermittent DNS hiccup) shouldn’t trigger an alert. One retry five minutes later forgives that. Two retries is rarely the right call here — if the real cause persists, you want the alert to fire so a human looks at it.

`keep_runs = 30` + `keep_for = "60d"`

Two months of nightly dumps is enough for both forensics (“when did the schema change?”) and to outlast a long incident (“we discovered the data corruption a month later”). The retention sweeper deletes the oldest completed runs (rows + log files) when either limit is exceeded — see Logs & retention.

`notify_on_failure`

Sugar that desugars to a route firing on run.failed / run.timeout / run.crashed. Per-task notification sugar is documented separately. The “in-app” notifier is implicitly added — even without Slack, the bell in the Web UI lights up.

`set -euo pipefail` inside `run`

Bash’s -e exits on the first failed command, -u errors on unset variables, -o pipefail propagates the exit code from any stage of a pipeline. Without these, a pg_dump that fails mid-stream will still produce a “successful” gzipped file (gzip exits 0 on truncated input) and your backup task will quietly return success.

This is the pattern for every non-trivial run block. RunWisp itself has no opinion on shell flags — the burden is on your script.

Off-host copy

Local backups die with the host. Append a sync to S3, B2, or your NAS:

aws s3 cp "$DEST/app_production-$TS.dump.gz" \\
  "s3://my-backups/postgres/$(hostname)/app_production-$TS.dump.gz" \\
  --storage-class GLACIER_IR

Or split it into a second task that depends on the first having landed something on disk — one cron-fired backup task plus a separate cron-fired sync task is simpler than wiring up a multi-step DAG (which RunWisp deliberately doesn’t do).

Verifying the dump

A backup you’ve never restored is a hopeful filename, not a backup. Run a periodic restore-test as its own task:

[tasks.backup-restore-test]
group       = "Backups"
description = "Restore last night's dump into a scratch DB and run a smoke query"
cron        = "0 5 * * *"            # 03:00 dump → 05:00 restore-test
on_overlap  = "skip"
timeout     = "1h"
notify_on_failure = ["slack-ops"]

run = """
set -euo pipefail
LATEST=$(ls -1t /srv/backups/postgres/app_production-*.dump.gz | head -n1)
test -n "$LATEST" || { echo "no dump found"; exit 1; }

# Restore into a scratch database the daemon can drop and recreate.
psql -h db.internal -U backup -d postgres -c 'DROP DATABASE IF EXISTS app_restore_test'
psql -h db.internal -U backup -d postgres -c 'CREATE DATABASE app_restore_test'

gunzip -c "$LATEST" | pg_restore --no-owner --no-privileges --dbname=app_restore_test

# Smoke query — adjust to something cheap that proves the schema is real.
psql -h db.internal -U backup -d app_restore_test -c 'SELECT count(*) FROM users LIMIT 1'
"""

Two cron rows in the daemon, two log streams, two failure paths. You’ll know within 24 hours if a backup file isn’t restorable — which is the only failure mode that actually matters.

Where to next

Slack provider — wiring up the slack-ops notifier this recipe references.
Concepts: retries & timeouts — what retry_attempts and timeout actually do, and which end reasons trigger retries.
[storage] — the daemon-wide cap that sits above keep_runs. Don’t let on-disk dumps fill the data dir.

Nightly backup

The task

Why each knob

cron = "30 2 * * *"

on_overlap = "skip"

timeout = "45m"

retry_attempts = 1 + retry_delay = "5m"

keep_runs = 30 + keep_for = "60d"

notify_on_failure

set -euo pipefail inside run