Skip to content

Nightly backup

A nightly database backup is the canonical RunWisp task. It runs on a schedule, writes a timestamped artefact, takes long enough that you care about overlap, and you absolutely want to know when it fails.

This recipe covers Postgres; the shape is the same for MySQL, SQLite, MongoDB, or any external service you can drive from a shell script.

[tasks.backup-postgres]
group = "Backups"
description = "Nightly logical dump of the production database"
cron = "30 2 * * *" # 02:30 every day, daemon-local time
on_overlap = "skip" # never two dumps at once
timeout = "45m" # die before the next firing
retry_attempts = 1 # one quick retry on transient failure
retry_delay = "5m"
keep_runs = 30
keep_for = "60d"
notify_on_failure = ["slack-ops"]
run = """
set -euo pipefail
TS=$(date -u +%Y%m%dT%H%M%SZ)
DEST=/srv/backups/postgres
mkdir -p "$DEST"
PGPASSWORD="$BACKUP_DB_PASSWORD" pg_dump \\
--host=db.internal \\
--username=backup \\
--format=custom \\
--no-owner --no-privileges \\
app_production \\
| gzip --best > "$DEST/app_production-$TS.dump.gz"
# Verify the archive is at least readable end-to-end.
gzip -t "$DEST/app_production-$TS.dump.gz"
echo "Wrote $DEST/app_production-$TS.dump.gz ($(du -h "$DEST/app_production-$TS.dump.gz" | cut -f1))"
"""

The matching [[notifier]] block — described on the Slack provider page — listens for run.failed, run.timeout, and run.crashed because that’s what notify_on_failure desugars to.

Off-peak. Avoid landing on the hour or half-hour — a host running many cron daemons at exactly 0 2 * * * and 0 3 * * * will serialise its own writes and cause backup contention.

If a previous dump is still running at the next firing, don’t start a second one. The default of "queue" would line up overlapping firings; for a nightly task that doesn’t help and can cause a backup pile-up if a slow night extends past 02:30 the next morning.

A backup that takes 45 minutes is broken — kill it. The cron is 24 hours apart so this is a safe ceiling. See Retries & timeouts for what timeout actually does (per-attempt, hard kill, no grace period).

A transient network blip (database briefly unreachable, an intermittent DNS hiccup) shouldn’t trigger an alert. One retry five minutes later forgives that. Two retries is rarely the right call here — if the real cause persists, you want the alert to fire so a human looks at it.

Two months of nightly dumps is enough for both forensics (“when did the schema change?”) and to outlast a long incident (“we discovered the data corruption a month later”). The retention sweeper deletes the oldest completed runs (rows + log files) when either limit is exceeded — see Logs & retention.

Sugar that desugars to a route firing on run.failed / run.timeout / run.crashed. Per-task notification sugar is documented separately. The “in-app” notifier is implicitly added — even without Slack, the bell in the Web UI lights up.

Bash’s -e exits on the first failed command, -u errors on unset variables, -o pipefail propagates the exit code from any stage of a pipeline. Without these, a pg_dump that fails mid-stream will still produce a “successful” gzipped file (gzip exits 0 on truncated input) and your backup task will quietly return success.

This is the pattern for every non-trivial run block. RunWisp itself has no opinion on shell flags — the burden is on your script.

Local backups die with the host. Append a sync to S3, B2, or your NAS:

Terminal window
aws s3 cp "$DEST/app_production-$TS.dump.gz" \\
"s3://my-backups/postgres/$(hostname)/app_production-$TS.dump.gz" \\
--storage-class GLACIER_IR

Or split it into a second task that depends on the first having landed something on disk — one cron-fired backup task plus a separate cron-fired sync task is simpler than wiring up a multi-step DAG (which RunWisp deliberately doesn’t do).

A backup you’ve never restored is a hopeful filename, not a backup. Run a periodic restore-test as its own task:

[tasks.backup-restore-test]
group = "Backups"
description = "Restore last night's dump into a scratch DB and run a smoke query"
cron = "0 5 * * *" # 03:00 dump → 05:00 restore-test
on_overlap = "skip"
timeout = "1h"
notify_on_failure = ["slack-ops"]
run = """
set -euo pipefail
LATEST=$(ls -1t /srv/backups/postgres/app_production-*.dump.gz | head -n1)
test -n "$LATEST" || { echo "no dump found"; exit 1; }
# Restore into a scratch database the daemon can drop and recreate.
psql -h db.internal -U backup -d postgres -c 'DROP DATABASE IF EXISTS app_restore_test'
psql -h db.internal -U backup -d postgres -c 'CREATE DATABASE app_restore_test'
gunzip -c "$LATEST" | pg_restore --no-owner --no-privileges --dbname=app_restore_test
# Smoke query — adjust to something cheap that proves the schema is real.
psql -h db.internal -U backup -d app_restore_test -c 'SELECT count(*) FROM users LIMIT 1'
"""

Two cron rows in the daemon, two log streams, two failure paths. You’ll know within 24 hours if a backup file isn’t restorable — which is the only failure mode that actually matters.

  • Slack provider — wiring up the slack-ops notifier this recipe references.
  • Concepts: retries & timeouts — what retry_attempts and timeout actually do, and which end reasons trigger retries.
  • [storage] — the daemon-wide cap that sits above keep_runs. Don’t let on-disk dumps fill the data dir.