Skip to content

Retries & timeouts

A task that exits non-zero can be retried automatically. A run that takes too long can be killed. Both are configured per-task, with sensible defaults from [defaults] if you leave them off.

Retries apply only to tasks. Services don’t retry — they restart instead.

[tasks.publish-feed]
cron = "*/15 * * * *"
run = "/usr/local/bin/publish.sh"
retry_attempts = 3
retry_delay = "30s"
retry_backoff = "exponential"
  • Default: 0 (no retries).
  • Semantics: the number of additional attempts after the first failure. retry_attempts = 3 means up to 4 executions total — one initial run plus three retries.
  • Triggers a retry: the run ended with failed (non-zero exit), timeout, crashed, or log_overflow (cancelled by log_on_full = "kill_task" after exceeding log_max_size).
  • Does not trigger a retry: success, stopped (manual stop via the API/CLI/UI, or a sibling run cancelled it via on_overlap = "terminate"), or skipped (on_overlap = "skip" rejected the firing because another run was still in flight). Stopped is a deliberate human action; skipped means the original run is still running and another attempt would just race it.

retry_delay is parsed as a Go duration ("5s", "2m", "1h"). The default — used when retries are enabled but retry_delay is unset — is 5 seconds.

retry_backoff chooses the curve. The same enum is used by services’ restart_backoff, so the values move between the two contexts cleanly.

ValueWait before attempt N (1-indexed retries)With retry_delay = "10s"
constant (or unset)constant delay10s, 10s, 10s …
lineardelay × N10s, 20s, 30s, 40s …
exponentialdelay × 2^(N-1)10s, 20s, 40s, 80s, 160s, 300s …

All curves are capped at 5 minutes. exponential with a 10-second base hits the cap on attempt 6 and stays there.

Each attempt is its own run row in SQLite, with its own ULID, its own exit code, and its own captured log file. The retry chain links back via two fields:

  • retry_attempt0 for the first run, 1 for the first retry, 2 for the second, and so on.
  • retry_of_run_id — the ULID of the immediate predecessor.

In the Web UI you see each attempt as a separate entry in the task’s run list. That’s deliberate: if attempt 1 silently corrupted state and attempt 2 succeeded, you can still find attempt 1’s stderr.

[tasks.heavy-job]
cron = "0 3 * * *"
run = "/usr/local/bin/heavy-job.sh"
timeout = "30m"
  • Parsed as a Go duration (same syntax as retry_delay).
  • Default: inherited from [defaults] timeout if set; otherwise no timeout — the run is allowed to take as long as it likes.
  • Scope: per attempt. A retry gets a fresh timeout window; the time spent waiting in retry_delay doesn’t count against it.

When the deadline hits, the daemon cancels the run’s context and the underlying process is killed. The run is recorded with end reason timeout. There is no SIGTERM grace period — the kill goes straight through. If your job needs to flush state cleanly before exiting, do it defensively (atomic temp-file rename, transactional commit) rather than relying on a graceful shutdown signal.

  • on_overlap = "terminate" plus retries. If a new firing terminates the running attempt, that attempt records end reason stopped — which blocks any further retries. The new run from the terminate policy is a fresh execution, not a retry.
  • Manual stop. Same story: stopping a run from the API/UI records stopped and ends the retry chain.
  • parallelism > 1. Retries don’t count against parallelism. Each retry runs after its predecessor is already terminal, so there’s no overlap to evaluate.

Services have a different model because they’re meant to stay up:

Tasks and services share the same backoff vocabulary — constant / linear / exponential — so one rule is easier to remember. Tasks add retry_attempts (services run forever); services add restart_delay (the supervisor owns the cadence).

FieldTasksServices
retry_attempts✅ default 0❌ rejected
retry_delay✅ default 5s❌ rejected
retry_backoffconstant / linear / exponential, default constant❌ rejected
restart_delay❌ rejected✅ default 1s
restart_backoff❌ rejectedconstant / linear / exponential, default exponential

A service supervisor restarts a replica forever (with exponential backoff capped at 60s) until you stop it explicitly. A replica that stays up for at least 60 seconds resets its backoff counter, so transient flapping doesn’t permanently slow restarts on a service that eventually stabilises.