Error Handling & Retry Mechanisms for Continuous Aggregates

In high-throughput IoT telemetry pipelines, financial tick data streams, and industrial monitoring workloads, continuous aggregates serve as the foundational layer for low-latency analytics. Automated materialization cycles inevitably encounter transient failures: serialization conflicts, lock contention, out-of-memory conditions, or collisions with concurrent retention sweeps. This page solves one focused engineering problem — how to make a failed continuous-aggregate refresh recover deterministically instead of degrading silently. The building blocks defined in Continuous Aggregate Creation & Refresh Management establish how TimescaleDB materializes incremental updates; here we wrap those refreshes in an explicit failure boundary, a durable retry queue, and post-refresh validation so that a single aborted job never leaves dashboards, alerting rules, or machine-learning feature stores reading stale data.

The state machine above drives the rest of this guide. Every refresh attempt lands in one of four states — Pending, Retrying, Resolved, or DeadLetter — and each implementation step below maps directly to a transition in that diagram.

Prerequisites

This pattern targets TimescaleDB 2.10+ on PostgreSQL 14+ with the background job scheduler enabled. Confirm each item before deploying the retry harness:

timescaledb extension installed and loaded via shared_preload_libraries
At least one continuous aggregate already created (see Materialized View Architecture & Syntax)
CREATE, EXECUTE, and ALTER privileges on the target schema
Read access to timescaledb_information.jobs, timescaledb_information.job_stats, and timescaledb_information.job_errors
timescaledb.max_background_workers sized to cover every aggregate, compression, and retention policy plus headroom
A Python 3.11+ environment with psycopg v3 for the external orchestration layer
A connection role that can execute CALL refresh_continuous_aggregate(...) outside a transaction block

Because refresh_continuous_aggregate() cannot run inside a transaction block, the orchestration connection must use autocommit. Attempting to call it from within a BEGIN ... COMMIT wrapper raises invalid_transaction_termination and is the single most common setup error.

The Failure Surface in Incremental Materialization

Understanding the storage engine is mandatory before implementing fault tolerance. Continuous aggregates rely on hypertable chunking, watermark tracking, and incremental materialization. The system maintains an internal watermark that determines which time ranges require recomputation. When a refresh transaction aborts mid-chunk — because of a serialization failure, an out-of-memory kill, or a network partition during a distributed query — the watermark may stall or advance inconsistently with the materialized partials. The result is either stale aggregation or duplicate materialization on the next cycle. Treat the watermark as a recoverable checkpoint rather than an immutable progression: the retry queue must be able to re-drive a window without assuming the previous attempt left clean state.

The failure surface overlaps three adjacent subsystems. Retention automation may issue a DROP CHUNK on a range a refresh is mid-way through reading; the background worker pool defined by your Asynchronous Execution & Queue Management configuration can starve refresh jobs during saturation; and unaligned refresh windows from Refresh Policy Design & Scheduling can queue overlapping recomputation of the same buckets. Each of these produces a distinct SQLSTATE, and classifying them correctly is what separates a targeted retry from a blind re-execution loop.

Step-by-Step Implementation

The five steps below build the closed loop shown in the state diagram: capture a failure into a durable row (Pending), let backoff elapse, re-drive the refresh (Retrying), validate the result (Resolved), and escalate exhausted jobs (DeadLetter).

Step 1 — Persist a durable failure record

Default scheduler retry logic is intentionally conservative; it will not recover from heavy lock contention, concurrent DROP CHUNK operations, or resource exhaustion on shared worker pools. The first step is a durable audit table plus an idempotent helper that survives transaction rollback. Keying ON CONFLICT by aggregate name lets repeat failures accumulate a retry_count on a single row rather than flooding the table.

sql

CREATE TABLE IF NOT EXISTS aggregate_refresh_audit (
    aggregate_name TEXT PRIMARY KEY,
    failed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    error_state TEXT,
    error_detail TEXT,
    retry_count INT NOT NULL DEFAULT 0,
    next_retry TIMESTAMPTZ,
    resolved BOOLEAN NOT NULL DEFAULT FALSE
);

-- Record a refresh failure. The orchestration layer captures the SQLSTATE and
-- detail when a refresh raises, then calls this helper. Keying ON CONFLICT by
-- aggregate_name lets repeat failures accumulate retry_count on a single row.
CREATE OR REPLACE FUNCTION record_refresh_failure(
    p_aggregate TEXT, p_state TEXT, p_detail TEXT
) RETURNS VOID AS $$
BEGIN
    INSERT INTO aggregate_refresh_audit (aggregate_name, error_state, error_detail, next_retry)
    VALUES (p_aggregate, p_state, p_detail, NOW() + INTERVAL '1 minute')
    ON CONFLICT (aggregate_name) DO UPDATE SET
        failed_at    = NOW(),
        error_state  = EXCLUDED.error_state,
        error_detail = EXCLUDED.error_detail,
        retry_count  = aggregate_refresh_audit.retry_count + 1,
        resolved     = FALSE,
        next_retry   = NOW() + INTERVAL '5 minutes';
END;
$$ LANGUAGE plpgsql;

Step 2 — Classify the failure by SQLSTATE

Not every failure deserves a retry. This pattern leverages PostgreSQL’s standard error-code catalog to classify failures accurately. Referencing the official PostgreSQL Error Codes appendix lets you differentiate transient conditions from permanent ones and route each accordingly:

Class 08 (connection exception) and 40 (transaction rollback, including 40001 serialization failure and 40P01 deadlock) are transient — retry with backoff.
Class 53 (insufficient resources, e.g. 53200 out of memory) is transient but should widen the backoff to let pressure subside.
Class 23 (integrity constraint violation) and 42 (syntax/access rule) are permanent — escalate straight to the dead-letter state without wasting retry budget.

Capturing SQLSTATE and MESSAGE_TEXT in the caller and passing them to record_refresh_failure() preserves this signal for the retry loop to act on.

Step 3 — Intercept failures at the database layer

For refreshes triggered inside the database, wrap the call in structured exception handling so the audit row is written on the same connection that saw the error. The full trigger-based interception approach — including how to pause conflicting background jobs while a window recovers — is covered in Handling refresh failures with custom PL/pgSQL triggers. The minimal wrapper looks like this:

sql

CREATE OR REPLACE PROCEDURE refresh_with_capture(p_aggregate TEXT)
LANGUAGE plpgsql AS $$
BEGIN
    -- refresh_continuous_aggregate cannot run inside a transaction block, so
    -- this procedure is invoked with CALL on an autocommit connection.
    CALL refresh_continuous_aggregate(p_aggregate, NULL, NULL);
EXCEPTION WHEN OTHERS THEN
    -- SQLSTATE + SQLERRM classify the failure for the retry loop in Step 5.
    PERFORM record_refresh_failure(p_aggregate, SQLSTATE, SQLERRM);
    RAISE NOTICE 'Refresh of % captured: % (%)', p_aggregate, SQLERRM, SQLSTATE;
END;
$$;

Step 4 — Validate before releasing the result

Once a retry succeeds, verify data integrity before marking the row Resolved. The function below is a deterministic checkpoint that compares row density and null ratio against thresholds for a given window. Because the view name is dynamic, it builds the query with format()/EXECUTE and binds the window bounds as parameters.

sql

CREATE OR REPLACE FUNCTION validate_continuous_aggregate(
    agg_name TEXT,
    window_start TIMESTAMPTZ,
    window_end TIMESTAMPTZ
)
RETURNS TABLE (is_valid BOOLEAN, validation_notes TEXT) AS $$
DECLARE
    row_count BIGINT;
    null_ratio NUMERIC;
BEGIN
    -- Inspect the aggregate itself over the supplied window. The view name is
    -- dynamic, so build the query with format()/EXECUTE and bind the bounds.
    EXECUTE format(
        'SELECT count(*), avg((avg_temp IS NULL)::int)::numeric
           FROM %I WHERE bucket >= $1 AND bucket < $2', agg_name)
    INTO row_count, null_ratio
    USING window_start, window_end;

    IF row_count = 0 THEN
        RETURN QUERY SELECT FALSE, 'Zero rows materialized in window';
    ELSIF null_ratio > 0.15 THEN
        RETURN QUERY SELECT FALSE, 'Null ratio exceeds 15% threshold';
    ELSE
        RETURN QUERY SELECT TRUE, 'Validation passed';
    END IF;
END;
$$ LANGUAGE plpgsql STABLE;

Step 5 — Drive retries from an external orchestrator

External orchestration bridges the database with your observability stack. A lightweight polling service reads the audit table, re-drives due jobs with exponential backoff, and escalates exhausted ones to the dead-letter state. The loop below uses psycopg v3 with an autocommit connection and always binds the aggregate name as a parameter — never string-interpolated — to eliminate injection risk.

python

import logging
import psycopg
from psycopg.rows import dict_row

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("tsdb_aggregate_monitor")

def monitor_and_retry(conn_str: str, max_retries: int = 3):
    # refresh_continuous_aggregate() cannot run inside a transaction block, so
    # the connection uses autocommit; each statement commits on its own.
    with psycopg.connect(conn_str, autocommit=True) as conn:
        with conn.cursor(row_factory=dict_row) as cur:
            cur.execute("""
                SELECT aggregate_name, next_retry, retry_count
                FROM aggregate_refresh_audit
                WHERE resolved = FALSE AND next_retry <= NOW()
                ORDER BY next_retry ASC
                LIMIT 10
            """)
            pending = cur.fetchall()

            for job in pending:
                if job['retry_count'] >= max_retries:
                    logger.warning("Max retries exceeded for %s. Escalating to dead-letter queue.", job['aggregate_name'])
                    continue

                logger.info("Retrying refresh for %s (attempt %d)", job['aggregate_name'], job['retry_count'] + 1)
                try:
                    # Parameterized: the aggregate name is bound, never interpolated.
                    cur.execute("CALL refresh_continuous_aggregate(%s, NULL, NULL)", (job['aggregate_name'],))
                    cur.execute("""
                        UPDATE aggregate_refresh_audit
                        SET resolved = TRUE, next_retry = NULL
                        WHERE aggregate_name = %s
                    """, (job['aggregate_name'],))
                    logger.info("Successfully resolved %s", job['aggregate_name'])
                except psycopg.DatabaseError as e:
                    backoff = min(2 ** job['retry_count'] * 60, 3600)
                    logger.error("Retry failed for %s: %s. Backoff: %ds", job['aggregate_name'], e, backoff)
                    cur.execute("""
                        UPDATE aggregate_refresh_audit
                        SET retry_count = retry_count + 1, next_retry = NOW() + INTERVAL '1 second' * %s
                        WHERE aggregate_name = %s
                    """, (backoff, job['aggregate_name']))

Integrating Python’s native logging module ensures retry attempts, validation outcomes, and escalation events are structured for ingestion into centralized observability platforms. This layer runs asynchronously and must never block the primary ingestion pipeline; it treats database failures as recoverable state transitions rather than fatal exceptions.

Configuration Parameters Reference

The following knobs govern how aggressively the retry harness recovers and how much backoff pressure it applies. Tune them to the failure profile of your workload rather than accepting defaults.

Parameter	Type	Recommended value	Effect
`max_retries`	int	3–5	Attempts before a job moves to the dead-letter state; higher values mask persistent faults
`base_backoff`	interval	60 s	Multiplier for `2 ** retry_count`; the geometric growth rate of the retry delay
`max_backoff`	interval	3600 s	Ceiling on any single backoff so a job is never deferred indefinitely
`null_ratio_threshold`	numeric	0.15	Validation fails when the materialized window exceeds this null fraction
`poll_limit`	int	10	Rows drained per poll; caps how many refreshes a single tick can trigger
`timescaledb.max_background_workers`	int	policies + 2	Total worker slots; too few starves refresh jobs and inflates retry rates
`retry_backoff` (job)	interval	scheduler default	Native scheduler retry delay for policy-driven refreshes, independent of this harness

Integration with Adjacent Features

The retry harness does not operate in isolation. It sits at the intersection of three lifecycle subsystems, and correct integration prevents the retries from causing the very contention they are meant to survive.

Retention automation is the most frequent source of refresh conflicts. When a policy defined under Data Retention & Compression Lifecycle Automation drops a chunk that a refresh is reading, the refresh aborts with a lock or missing-relation error. Sequence your TTL policy mapping & enforcement so retention sweeps never overlap the refresh window — leave the newest, in-flight bucket untouched by both. When the underlying data is stale rather than merely unrefreshed, cross-reference troubleshooting stale continuous aggregates in production before burning retry budget on a window that will never validate.

The retry strategy also depends on which refresh mode a job uses. A full recompute costs far more than an incremental one, so before setting max_retries high, decide the mode via Incremental vs Full Refresh Strategies — an expensive full refresh that fails repeatedly should escalate faster than a cheap incremental one. Finally, retry throughput is bounded by worker availability; if the loop is re-driving jobs faster than the pool can absorb, revisit incremental refresh performance tuning for large datasets to reduce per-refresh cost before adding more retries.

Performance Validation

Verify the harness is actually recovering jobs — not just accumulating dead letters — by querying TimescaleDB’s system views alongside the audit table. First, inspect native job health to see which policy-driven refreshes are failing at the scheduler level:

sql

SELECT j.job_id,
       j.proc_name,
       s.last_run_status,
       s.total_failures,
       s.last_run_duration,
       s.next_start
FROM timescaledb_information.jobs j
JOIN timescaledb_information.job_stats s ON s.job_id = j.job_id
WHERE j.proc_name = 'policy_refresh_continuous_aggregate'
ORDER BY s.total_failures DESC;

Then correlate scheduler errors with the captured detail. The job_errors view records the exact SQLSTATE the background worker saw, which should match the error_state column your harness stored:

sql

SELECT e.job_id,
       e.error_data ->> 'sqlerrcode' AS sqlstate,
       e.error_data ->> 'message'    AS message,
       e.start_time
FROM timescaledb_information.job_errors e
ORDER BY e.start_time DESC
LIMIT 20;

Finally, measure recovery effectiveness from the audit table itself. A healthy harness keeps the unresolved backlog small and the mean retry_count low:

sql

SELECT count(*) FILTER (WHERE resolved)            AS resolved_total,
       count(*) FILTER (WHERE NOT resolved)        AS pending_total,
       count(*) FILTER (WHERE retry_count >= 3)    AS dead_letter_total,
       round(avg(retry_count), 2)                  AS avg_retries
FROM aggregate_refresh_audit;

A rising dead_letter_total signals a permanent fault — a schema drift, a constraint violation, or a persistently locked chunk — that no amount of retrying will clear.

Troubleshooting

Common failure states and how to resolve them:

ERROR: refresh_continuous_aggregate cannot be executed from within a transaction block (SQLSTATE 25001) — the orchestrator opened an explicit transaction. Set autocommit=True on the psycopg connection, or issue a COMMIT before the CALL.
ERROR: could not serialize access due to concurrent update (SQLSTATE 40001) — two refreshes targeted overlapping windows, or a retention sweep touched the same chunk. Stagger policy schedule_intervals and ensure retention never overlaps the active refresh window.
ERROR: relation "_hyper_x_y_chunk" does not exist (SQLSTATE 42P01) — a DROP CHUNK removed the chunk mid-refresh. This is permanent for that window; mark the job resolved for the dropped range and align retention ordering so drops trail refreshes.
ERROR: out of memory (SQLSTATE 53200) — the refresh window is too wide for available work_mem. Narrow the window, raise work_mem for the refresh role, or split the recompute into smaller time ranges.
Jobs stuck in Pending and never retried — the poller’s next_retry <= NOW() filter never matches because clocks or timezones diverge between the app host and the database. Store next_retry as TIMESTAMPTZ (as shown) and compare against database NOW(), not client time.

Frequently Asked Questions

Does TimescaleDB not already retry failed refresh jobs automatically?

The native scheduler does retry policy-driven jobs with its own backoff, but it treats every failure identically and gives you no window-level control, no SQLSTATE-based classification, and no validation gate. The harness here adds a durable audit trail, permanent-versus-transient routing, and a post-refresh integrity check the built-in retry cannot provide.

How do I keep retention policies from causing refresh failures in the first place?

Order the lifecycle so retention trails materialization. Configure retention through TTL policy mapping & enforcement to drop only chunks older than the oldest window any aggregate still refreshes, and never let a DROP CHUNK overlap the active refresh window. That single ordering rule eliminates most 42P01 and lock-timeout aborts.

Should the retry loop run inside the database or as an external service?

Use both layers. Database-side interception (Step 3) captures the failure on the connection that saw it, preserving SQLSTATE and detail. An external psycopg service (Step 5) then drives backoff, escalation, and alerting without holding a long-lived database session — keeping recovery logic off the ingestion path.

What belongs in the dead-letter state versus a retry?

Retry class 08, 40, and 53 failures — connection drops, serialization/deadlock, and resource exhaustion — because they are transient. Send class 23 and 42 failures — constraint violations and missing objects — straight to dead-letter, since re-running the same statement will fail identically until a human fixes the schema or data.

How often should the orchestrator poll the audit table?

Match the poll interval to your freshness SLA and the base backoff. A 30–60 second tick is typical: frequent enough to clear transient faults within one refresh window, infrequent enough that the poll_limit cap and worker pool are never overwhelmed by a burst of simultaneously due jobs.

← Back to Continuous Aggregate Creation & Refresh Management

Error Handling & Retry Mechanisms for Continuous Aggregates

# Prerequisites

# The Failure Surface in Incremental Materialization

# Step-by-Step Implementation

# Step 1 — Persist a durable failure record

# Step 2 — Classify the failure by SQLSTATE

# Step 3 — Intercept failures at the database layer

# Step 4 — Validate before releasing the result

# Step 5 — Drive retries from an external orchestrator

# Configuration Parameters Reference

# Integration with Adjacent Features

# Performance Validation

# Troubleshooting