Fallback Routing for Legacy Data

Legacy telemetry ingestion is one of the most persistent operational problems in industrial IoT and time-series platforms. Network partitions, edge gateway reboots, and protocol migrations routinely produce delayed payloads that violate strict chronological ordering. When that historical data finally reaches the central database, a naive INSERT fragments chunks, invalidates already-materialized rollups, and collides with retention sweeps. This guide solves one focused problem: how to route late-arriving and legacy payloads into a hypertable safely, so they reconcile with the live ingestion stream without degrading query performance, compression ratios, or automated lifecycle jobs. It assumes the schema and partitioning conventions covered in the Core Hypertable Architecture & Partitioning Strategy guide.

Two-tier ingestion keeps late-arriving data off the hot path until it is validated and merged.

The core idea is a two-tier path. Live telemetry writes straight to the primary hypertable on the hot path. Late or legacy payloads land first in an uncompressed staging table, where they are validated and de-duplicated, then merged into the hypertable within a bounded time window, after which only the affected continuous aggregate range is refreshed. Each numbered step below maps to a node in the diagram above.

Prerequisites

This pattern targets TimescaleDB 2.10+ on PostgreSQL 14+ with psycopg v3 on the application side. Before wiring in a fallback path, confirm the following:

The timescaledb extension is loaded via shared_preload_libraries and CREATE EXTENSION timescaledb has run in the target database.
The primary table is already a hypertable (SELECT create_hypertable('telemetry_hypertable', by_range('time'))) with a sensible interval — see time-based chunk partitioning strategies.
A composite unique index on (device_id, time) exists on the hypertable so ON CONFLICT is deterministic. TimescaleDB requires every partitioning column to be part of any unique index.
At least one continuous aggregate (for example telemetry_1h_agg) is defined over the hypertable and has a running refresh policy.
timescaledb.max_background_workers leaves headroom for an out-of-band refresh in addition to your steady-state compression and retention jobs.
The connecting role holds INSERT, SELECT, and DELETE on the staging table plus EXECUTE on refresh_continuous_aggregate — scope this per the security boundaries and access control model.

The staging table mirrors the hypertable columns but carries no compression policy and is a plain table, not a hypertable — it exists only to hold a handful of in-flight batches:

sql

CREATE TABLE telemetry_staging (
    device_id    bigint      NOT NULL,
    time         timestamptz NOT NULL,
    metric_value double precision,
    metadata     jsonb,
    UNIQUE (device_id, time)
);

Step-by-Step Implementation

The four steps below correspond directly to the diagram: stage (validate + dedupe), merge (merge window), backfill-refresh the aggregate, and enforce retention. Each is idempotent, so a retried batch after a transient network failure produces the same end state.

1. Stage and de-duplicate the incoming batch

Late payloads frequently repeat rows the gateway already flushed once. Insert them into the staging table first with ON CONFLICT DO NOTHING keyed on (device_id, time), which drops in-batch and cross-batch duplicates before they can touch the hot path:

sql

INSERT INTO telemetry_staging (device_id, time, metric_value, metadata)
VALUES (%(device_id)s, %(time)s, %(metric_value)s, %(metadata)s)
ON CONFLICT (device_id, time) DO NOTHING;

2. Merge the validated window into the hypertable

Once staged, merge only the bounded [min_time, max_time] window of this batch into the hypertable. Bounding the merge by an explicit time predicate keeps the write from scanning the whole staging table and lets the planner touch only the relevant chunks:

sql

INSERT INTO telemetry_hypertable (device_id, time, metric_value, metadata)
SELECT device_id, time, metric_value, metadata
FROM telemetry_staging
WHERE time >= %(min_time)s AND time <= %(max_time)s
ON CONFLICT (device_id, time) DO NOTHING;

If the target window overlaps chunks that columnar compression has already sealed, the insert transparently routes through the decompress path on modern TimescaleDB — correct, but costly. Keep the merge window newer than the compression boundary wherever possible, and treat backfills that reach into cold chunks as scheduled maintenance rather than inline work.

3. Refresh only the affected aggregate range

A blanket refresh_continuous_aggregate over all history would saturate the worker pool. Instead, refresh the exact window the merge touched, padded by one bucket on each side to catch buckets that straddle the boundary. This call cannot run inside a transaction block, so it uses a separate autocommit connection:

sql

CALL refresh_continuous_aggregate('telemetry_1h_agg', %(from)s, %(to)s);

The choice between this targeted refresh and a full recompute is exactly the tradeoff analysed in incremental vs full refresh strategies — for out-of-order backfill the incremental, range-scoped refresh is almost always correct.

4. Reconcile with retention

Finally, keep the fallback path aligned with lifecycle automation. A backfill that inserts data older than the retention horizon will simply be dropped by the next sweep, so validate the batch’s minimum timestamp against your TTL before merging, and let the TTL policy mapping and enforcement job reclaim expired chunks on its own schedule with drop_chunks.

Full Python router

The following production-safe module ties the four steps together using psycopg v3, explicit transaction boundaries, and TimescaleDB-specific calls for safe concurrent execution:

python

import psycopg
from datetime import timedelta
import logging

logging.basicConfig(level=logging.INFO)


class FallbackIngestionRouter:
    def __init__(self, conn_string: str):
        self.conn_string = conn_string

    def route_legacy_batch(self, telemetry_batch: list[dict]) -> None:
        """
        Validates, stages, and merges legacy telemetry into the primary hypertable.
        Idempotent via ON CONFLICT DO NOTHING on (device_id, time), which requires a
        unique index on (device_id, time) on both the staging table and hypertable.
        """
        if not telemetry_batch:
            return

        min_time = min(rec["time"] for rec in telemetry_batch)
        max_time = max(rec["time"] for rec in telemetry_batch)

        # Stage -> merge -> clean up runs in a single transaction.
        with psycopg.connect(self.conn_string) as conn:
            with conn.cursor() as cur:
                # 1. Insert into staging table for temporal validation
                cur.executemany("""
                    INSERT INTO telemetry_staging (device_id, time, metric_value, metadata)
                    VALUES (%(device_id)s, %(time)s, %(metric_value)s, %(metadata)s)
                    ON CONFLICT (device_id, time) DO NOTHING;
                """, telemetry_batch)

                # 2. Merge validated records into the primary hypertable
                cur.execute("""
                    INSERT INTO telemetry_hypertable (device_id, time, metric_value, metadata)
                    SELECT device_id, time, metric_value, metadata
                    FROM telemetry_staging
                    WHERE time >= %(min_time)s AND time <= %(max_time)s
                    ON CONFLICT (device_id, time) DO NOTHING;
                """, {"min_time": min_time, "max_time": max_time})

                # 3. Delete only the rows from THIS batch, keyed by (device_id, time),
                #    so a concurrent batch's staged rows in the same window survive.
                cur.executemany(
                    "DELETE FROM telemetry_staging WHERE device_id = %(device_id)s AND time = %(time)s;",
                    telemetry_batch,
                )
            conn.commit()

        # 4. Refresh the continuous aggregate for the affected window. This cannot
        #    run inside a transaction block, so use a separate autocommit connection.
        #    The window is extended by 1 hour on each side to cover partial overlaps.
        with psycopg.connect(self.conn_string, autocommit=True) as conn:
            conn.execute(
                "CALL refresh_continuous_aggregate('telemetry_1h_agg', %s, %s);",
                (min_time - timedelta(hours=1), max_time + timedelta(hours=1)),
            )

        logging.info("Successfully routed %d legacy records and refreshed aggregates.", len(telemetry_batch))

    def enforce_retention_policy(self) -> None:
        """
        Ensures data retention aligns with lifecycle automation requirements.
        """
        with psycopg.connect(self.conn_string) as conn:
            with conn.cursor() as cur:
                # Drop chunks older than 90 days. Current TimescaleDB expects the
                # relation first: drop_chunks(relation, older_than).
                cur.execute("""
                    SELECT drop_chunks('telemetry_hypertable', INTERVAL '90 days');
                """)
                conn.commit()
                logging.info("Retention policy enforced: chunks older than 90 days dropped.")

This router guarantees exactly-once semantics through the composite unique constraint, isolates historical writes from real-time ingestion, and triggers only targeted aggregate refreshes. Deployed as a scheduled worker or a queue consumer, it keeps lifecycle automation running without manual intervention.

Configuration Parameters Reference

The knobs below govern how aggressively the fallback path merges and refreshes. Tune them against your worst-case reconnection burst, not the steady state.

Parameter	Type	Recommended value	Effect
`chunk_time_interval` (hypertable)	interval	1 day for high-frequency telemetry	Sets how many chunks a wide historical backfill touches; too small fragments metadata, too large widens decompress scope
Staging batch size	int (rows)	5,000–20,000	Rows per merge transaction; larger amortizes overhead but lengthens lock hold on target chunks
Merge-window padding	interval	1 bucket width each side	Extra range around `[min_time, max_time]` passed to the refresh so straddling buckets recompute correctly
Consumer poll interval	interval	1–5 s, backoff on pressure	How often the fallback worker drains its queue; throttle when refresh lag climbs
Token-bucket rate	rows/sec	Provision to hot-path headroom	Caps burst ingestion from reconnecting gateways so live writes keep their SLA
`timescaledb.max_background_workers`	int	jobs + 2	Must cover refresh, compression, and retention jobs plus the out-of-band refresh this path issues

Integration With Adjacent Features

Fallback routing never runs in isolation — it sits between ingestion, aggregation, compression, and retention, and its correctness depends on how it cooperates with each.

Against continuous aggregates, the range-scoped refresh in step 3 is what keeps historical dashboards honest without a full recompute; when a backfill fails mid-refresh, recovery belongs to the error handling and retry mechanisms layer rather than a bare retry loop. When many gateways reconnect at once, the refresh jobs the router queues share the scheduler with every other policy, so treat asynchronous execution and queue management as the governor on how much backfill work you admit at once.

Against partitioning, multi-tenant deployments must keep one tenant’s bulk sync from starving another’s live path — distribute backfills across dedicated tenant chunks using space partitioning for multi-tenant IoT so mass reconnection stays a noisy-neighbour non-event. And the hardest edge — reordering rows whose timestamps predate the current chunk boundary — is covered in depth in handling out-of-order data insertion in TimescaleDB, the focused walkthrough beneath this guide.

At the device layer, offline gateways should buffer to a local SQLite store with monotonic per-device sequence tracking, then transmit only the delta between the last acknowledged sequence number and the current buffer state. That delta-sync keeps bandwidth low and gives the router a clean, ordered stream to validate on reconnect.

Performance Validation

After a backfill, confirm the fallback path behaved as intended rather than quietly fragmenting the hypertable. First, watch for small-chunk explosion during historical inserts:

sql

SELECT chunk_name, range_start, range_end,
       pg_size_pretty(total_bytes) AS size
FROM timescaledb_information.chunks
WHERE hypertable_name = 'telemetry_hypertable'
ORDER BY range_start DESC
LIMIT 20;

A run of tiny, adjacent chunks around the backfilled window means the chunk_time_interval is too fine for the historical span. Next, verify the out-of-band refresh actually ran and did not pile up failures:

sql

SELECT job_id, last_run_status, last_run_duration,
       total_successes, total_failures, next_start
FROM timescaledb_information.job_stats
WHERE hypertable_name = 'telemetry_hypertable'
ORDER BY last_run_duration DESC;

Rising total_failures or a last_run_duration that approaches the schedule interval signals the backfill is outpacing the refresh window — throttle the consumer poll interval until it recovers. Finally, spot-check that staged rows are not leaking (a healthy staging table trends toward empty between bursts):

sql

SELECT count(*) AS pending_rows,
       min(time) AS oldest_pending
FROM telemetry_staging;

Troubleshooting

ERROR: duplicate key value violates unique constraint "telemetry_staging_device_id_time_key" — the batch reached the staging table without ON CONFLICT DO NOTHING, or the unique index is missing on the hypertable side. Confirm both tables carry the (device_id, time) unique index; the merge in step 2 depends on it for idempotency.

ERROR: refresh_continuous_aggregate cannot run inside a transaction block — the refresh in step 3 was issued on the same connection that staged and merged. It must run on a separate autocommit=True connection, as in the router above.

ERROR: cannot update/insert into chunk "..." because it is compressed (older TimescaleDB) or unexpectedly slow inserts (newer) — the merge window reached into a compressed chunk. Shorten the window to stay newer than the compression boundary, or schedule the deep backfill as maintenance and decompress the target range first.

Refresh reports success but dashboards stay stale — the padding around the merge window was too tight and a straddling bucket never recomputed. Widen the merge-window padding to at least one full bucket on each side and re-run a manual refresh_continuous_aggregate over the affected range.

Backfilled rows vanish within minutes — the batch’s oldest timestamp is older than the retention horizon, so drop_chunks reclaimed it. Validate min_time against the active TTL before merging, or extend the retention window for the historical range you intend to keep.

Frequently Asked Questions

Why stage legacy data instead of inserting straight into the hypertable?

Direct inserts of out-of-order data interleave with the hot path, can trigger decompression on sealed chunks, and force an aggregate refresh over an unbounded range. The staging table gives you a place to de-duplicate and bound the batch first, so the merge into the hypertable touches only the chunks and aggregate buckets it must — protecting live ingestion latency.

Does routing legacy data hurt my compression ratio?

Only if you write into already-compressed chunks. Compression groups similar values within a chunk; inserting stragglers after the fact decompresses, rewrites, and recompresses that chunk. Keep the merge window newer than the compression boundary, and run genuinely old backfills as scheduled maintenance rather than inline.

How large should each fallback batch be?

Size the batch so a single merge transaction holds locks on target chunks for well under a second — typically 5,000 to 20,000 rows for narrow-row telemetry. Larger batches amortize per-transaction overhead but lengthen lock hold time and delay the aggregate refresh; measure last_run_duration in job_stats and adjust.

What happens when a gateway replays a batch it already sent?

Nothing changes state. Both the staging insert and the hypertable merge use ON CONFLICT (device_id, time) DO NOTHING, so replayed rows are silently discarded. This is what makes the router safe to retry after a transient network failure without a distributed lock.

Can two fallback workers run concurrently against the same hypertable?

Yes, provided each processes a distinct batch. The step-3 delete removes only the exact (device_id, time) rows from the batch it merged, so a concurrent worker’s staged rows in the same window survive. The unique constraints guarantee neither worker double-inserts. Keep an eye on worker-pool pressure, since each issues its own out-of-band refresh.

← Back to Core Hypertable Architecture & Partitioning Strategy

Handling Out-of-Order Data Insertion in TimescaleDB — the focused walkthrough of the reordering edge case beneath this guide.
Time-Based Chunk Partitioning Strategies — sizing the chunks a backfill has to touch.
Space Partitioning for Multi-Tenant IoT — isolating tenant backfills from each other.
Compression Models for High-Frequency Telemetry — why writing into sealed chunks is expensive.
Incremental vs Full Refresh Strategies — choosing the refresh scope after a merge.

Fallback Routing for Legacy Data

# Prerequisites

# Step-by-Step Implementation

# 1. Stage and de-duplicate the incoming batch

# 2. Merge the validated window into the hypertable

# 3. Refresh only the affected aggregate range

# 4. Reconcile with retention

# Full Python router

# Configuration Parameters Reference

# Integration With Adjacent Features

# Performance Validation

# Troubleshooting