Skip to content

Retries & timeouts

If a task exits non-zero, RunWisp can take another crack at it. If a run drags on too long, RunWisp can pull the plug. You set both per task, and anything you leave off falls back to its built-in default — retry settings don’t inherit from [defaults].

Retries are a task thing only. Services don’t retry — they restart instead.

[tasks.publish-feed]
cron = "*/15 * * * *"
run = "/usr/local/bin/publish.sh"
retry_attempts = 3
retry_delay = "30s"
retry_backoff = "exponential"

retry_attempts is the number of extra tries after the first one fails — it defaults to 0, so out of the box nothing retries. Set it to 3 and a failing run gets up to three more goes.

Not every ending counts as a failure worth retrying, though. A retry fires when a run ends in failed (non-zero exit), timeout, crashed, or log_overflow (which is what log_on_full = "kill_task" does after a run blows past log_max_size).

It does not fire on success, stopped, or skipped. stopped means a human deliberately killed it from the API/CLI/UI, or a sibling run cancelled it via on_overlap = "terminate" — either way, retrying would override an intentional decision. skipped means the original run is still going, so another attempt would just race it.

retry_delay is how long to wait before trying again, written as a duration ("5s", "2m", "1h"). Turn retries on and it defaults to "5s".

retry_backoff decides whether that wait stays put or grows with each attempt. Services use the same vocabulary for their restart_backoff, so once you know it here it carries over there too.

ValueWait before attempt N (1-indexed retries)With retry_delay = "10s"
constant (or unset)constant delay10s, 10s, 10s …
lineardelay × N10s, 20s, 30s, 40s …
exponentialdelay × 2^(N-1)10s, 20s, 40s, 80s, 160s, 300s …

No matter which curve you pick, every wait is capped at 5 minutes. An exponential schedule with a short base climbs up to that ceiling and then just sits there.

Every attempt is its own run — its own ID, its own exit code, its own log file. The Web UI lists them all under the task’s history, numbered by retry_attempt (0 for the first try, 1 for the first retry, and on up). That’s on purpose: if attempt 1 quietly corrupted something and attempt 2 papered over it by succeeding, you can still go back and read exactly what attempt 1 printed.

[tasks.heavy-job]
cron = "0 3 * * *"
run = "/usr/local/bin/heavy-job.sh"
timeout = "30m"

It takes the same duration syntax as retry_delay ("30s", "5m", "1h"). If you don’t set one, it inherits from [defaults] timeout — and if that isn’t set either, there’s no timeout at all and the run can take as long as it needs. The window is per attempt, so each retry starts the clock fresh, and time spent waiting out retry_delay doesn’t eat into it.

When the deadline passes, the daemon SIGTERMs the run’s process group, gives it up to graceful_stop (default "5s") to bow out, then SIGKILLs whatever’s left and records the run with end reason timeout. That same SIGTERM-then-wait dance is what on_overlap = "terminate", manual stops, and daemon shutdown all use — graceful_stop is the one knob behind all of it.

A few places where retries bump into other features:

  • on_overlap = "terminate" plus retries. When a fresh firing terminates the running attempt, that attempt ends as stopped, which shuts the retry chain down. The run the terminate policy starts is a brand-new execution, not a retry.
  • Manual stop. Same deal — stop a run from the API or UI and it ends as stopped, which ends the retry chain.
  • max_concurrent > 1. Retries go through the same concurrency check — but since a retry only starts once its predecessor has finished, there’s no overlap left to evaluate.

Services play by different rules, because the whole point of a service is to stay up.

The two do share a backoff vocabulary — constant / linear / exponential — so there’s only one set of words to remember. The difference is in the surrounding knobs: tasks have retry_attempts (services run forever, so there’s nothing to count), and services have restart_delay (the supervisor sets the pace).

FieldTasksServices
retry_attempts✅ default 0❌ rejected
retry_delay✅ default 5s❌ rejected
retry_backoffconstant / linear / exponential, default constant❌ rejected
restart_delay❌ rejected✅ default 1s
restart_backoff❌ rejectedconstant / linear / exponential, default exponential

A service supervisor restarts a replica with bounded exponential backoff as long as it looks like the replica can come back — but if it exits before healthy_after too many times in a row, the supervisor marks that slot FATAL and stops trying until you manually restart the service.

Once a replica has stayed up for at least healthy_after (default 60s), RunWisp calls it healthy — and that one threshold pulls double duty. It resets the restart counter, so a service that flapped a bunch on startup isn’t stuck with long restart delays after it finally settles down. And it clears the failed-start streak behind FATAL, so a service that comes up and stays up is never given up on. Set it in [defaults] to cover every service that doesn’t say otherwise, or per service:

[defaults]
healthy_after = "30s" # global default
[services.flaky-worker]
run = "/usr/local/bin/worker"
healthy_after = "2m" # this one needs longer to call "stable"

Rule of thumb: if you want a finite, escalating set of retries, that’s a task with cron and retry_attempts. If you want a workload kept alive and self-healing indefinitely, that’s [services.*].