Core Problem
When an INCREMENTAL_BY_TIME_RANGE model with lookback configured fails during a lookback interval run, it creates a data hole (empty interval) rather than preserving the existing stale data. This occurs because the default materialization strategy for most engines is DELETE + INSERT, which is not atomic. You can find the materialization strategy documentation in docs/concepts/models/model_kinds.md.
Materialization Strategy
The DELETE + INSERT strategy works in two steps:
- DELETE existing data for the time interval
- INSERT new data for that interval
If step 1 succeeds but step 2 fails, the interval is left empty. This is fundamentally different from atomic operations like INSERT OVERWRITE (used by Spark/Databricks) which either fully succeed or fully fail. The strategy table is documented in docs/concepts/models/model_kinds.md.
Lookback Mechanism
The lookback parameter is designed to handle late-arriving data by reprocessing recent intervals on each run. The lookback parameter is documented in docs/concepts/models/overview.md.
For example, with lookback=2 and daily intervals:
- On January 5th, SQLMesh processes January 5th, 4th, and 3rd
- On January 6th, SQLMesh processes January 6th, 5th, and 4th
The lookback_start method calculates the effective start date by moving back lookback intervals. This method is implemented in sqlmesh/core/model/meta.py.
Interval Computation Logic
The compute_missing_intervals function determines which intervals need processing. This function is implemented in sqlmesh/core/snapshot/definition.py.
Key logic:
- Expands the time range into individual interval boundaries
- Compares against already-processed intervals
- If
lookback > 0, adds additional intervals to the missing set based on parent interval availability
The critical section for lookback handling checks if parent intervals are missing and adds current intervals to the missing set accordingly. However, this doesn't create a mechanism for retrying failed lookback intervals specifically.
Scheduler Behavior
The scheduler's merged_missing_intervals function computes missing intervals across all snapshots. This function is implemented in sqlmesh/core/scheduler.py.
This function:
- Takes a collection of snapshots
- Calls
compute_interval_params for each
- Returns a mapping of snapshots to their missing intervals
Crucially, it treats all missing intervals equally - there's no distinction between:
- Current intervals that failed
- Lookback intervals that failed
- Intervals that were never processed
The Gap
The gap exists because:
-
No Atomicity Guarantee: DELETE + INSERT is not atomic, so failures after DELETE leave data holes. This is documented in docs/concepts/models/model_kinds.md.
-
No Special Retry Logic: The scheduler doesn't distinguish between failed lookback intervals and other missing intervals. The scheduler logic is in sqlmesh/core/scheduler.py.
-
Assumption of Success: The lookback mechanism assumes intervals will either succeed or be naturally retried on the next scheduled run. However, if the failure is transient (e.g., network issue), the next scheduled run may have moved past the lookback window.
-
Interval Tracking: While failed intervals should theoretically be included in missing intervals (since they're never added to the processed intervals list), there's no explicit mechanism to prioritize retrying them. The interval tracking logic is in sqlmesh/core/snapshot/definition.py.
Test Case Evidence
The test cases in tests/core/test_snapshot.py show the expected behavior of lookback. Specifically, the test_missing_intervals_past_end_date_with_lookback function demonstrates that lookback intervals ARE included in missing intervals when they're removed.
However, this doesn't address the case where a DELETE succeeded but INSERT failed - the interval would be missing from the processed list, but the data would also be gone from the target table.
Gap Detection Logic
SQLMesh does have gap detection logic in _ensure_no_gaps, but this is only used during plan application for version changes, not for runtime retry of failed intervals. This function is implemented in sqlmesh/core/state_sync/db/facade.py.
This prevents gaps during version changes but doesn't help with runtime failures.
Summary
The issue is a gap because:
- The materialization strategy is non-atomic for most engines (documented in
docs/concepts/models/model_kinds.md)
- Lookback intervals that fail mid-execution leave data holes
- There's no special retry mechanism for failed lookback intervals (scheduler logic in
sqlmesh/core/scheduler.py)
- The system assumes lookback intervals will succeed or be naturally retried, which may not happen for transient failures
- Gap detection only applies to plan application, not runtime failures (logic in
sqlmesh/core/state_sync/db/facade.py)
Notes
The end_bounded parameter in missing_intervals can be used to limit intervals to the target end date, but this doesn't address the retry issue for failed lookback intervals. This parameter is documented in the missing_intervals method in sqlmesh/core/snapshot/definition.py.
Relevant Files
docs/concepts/models/model_kinds.md - Materialization strategy documentation
docs/concepts/models/overview.md - Lookback parameter documentation
sqlmesh/core/model/meta.py - lookback_start method implementation
sqlmesh/core/snapshot/definition.py - compute_missing_intervals and missing_intervals implementation
sqlmesh/core/scheduler.py - merged_missing_intervals implementation
tests/core/test_snapshot.py - Test cases for lookback behavior
sqlmesh/core/state_sync/db/facade.py - Gap detection logic in _ensure_no_gaps
Wiki pages you might want to explore:
Core Problem
When an
INCREMENTAL_BY_TIME_RANGEmodel withlookbackconfigured fails during a lookback interval run, it creates a data hole (empty interval) rather than preserving the existing stale data. This occurs because the default materialization strategy for most engines is DELETE + INSERT, which is not atomic. You can find the materialization strategy documentation indocs/concepts/models/model_kinds.md.Materialization Strategy
The DELETE + INSERT strategy works in two steps:
If step 1 succeeds but step 2 fails, the interval is left empty. This is fundamentally different from atomic operations like INSERT OVERWRITE (used by Spark/Databricks) which either fully succeed or fully fail. The strategy table is documented in
docs/concepts/models/model_kinds.md.Lookback Mechanism
The
lookbackparameter is designed to handle late-arriving data by reprocessing recent intervals on each run. The lookback parameter is documented indocs/concepts/models/overview.md.For example, with
lookback=2and daily intervals:The
lookback_startmethod calculates the effective start date by moving backlookbackintervals. This method is implemented insqlmesh/core/model/meta.py.Interval Computation Logic
The
compute_missing_intervalsfunction determines which intervals need processing. This function is implemented insqlmesh/core/snapshot/definition.py.Key logic:
lookback > 0, adds additional intervals to the missing set based on parent interval availabilityThe critical section for lookback handling checks if parent intervals are missing and adds current intervals to the missing set accordingly. However, this doesn't create a mechanism for retrying failed lookback intervals specifically.
Scheduler Behavior
The scheduler's
merged_missing_intervalsfunction computes missing intervals across all snapshots. This function is implemented insqlmesh/core/scheduler.py.This function:
compute_interval_paramsfor eachCrucially, it treats all missing intervals equally - there's no distinction between:
The Gap
The gap exists because:
No Atomicity Guarantee: DELETE + INSERT is not atomic, so failures after DELETE leave data holes. This is documented in
docs/concepts/models/model_kinds.md.No Special Retry Logic: The scheduler doesn't distinguish between failed lookback intervals and other missing intervals. The scheduler logic is in
sqlmesh/core/scheduler.py.Assumption of Success: The lookback mechanism assumes intervals will either succeed or be naturally retried on the next scheduled run. However, if the failure is transient (e.g., network issue), the next scheduled run may have moved past the lookback window.
Interval Tracking: While failed intervals should theoretically be included in missing intervals (since they're never added to the processed intervals list), there's no explicit mechanism to prioritize retrying them. The interval tracking logic is in
sqlmesh/core/snapshot/definition.py.Test Case Evidence
The test cases in
tests/core/test_snapshot.pyshow the expected behavior of lookback. Specifically, thetest_missing_intervals_past_end_date_with_lookbackfunction demonstrates that lookback intervals ARE included in missing intervals when they're removed.However, this doesn't address the case where a DELETE succeeded but INSERT failed - the interval would be missing from the processed list, but the data would also be gone from the target table.
Gap Detection Logic
SQLMesh does have gap detection logic in
_ensure_no_gaps, but this is only used during plan application for version changes, not for runtime retry of failed intervals. This function is implemented insqlmesh/core/state_sync/db/facade.py.This prevents gaps during version changes but doesn't help with runtime failures.
Summary
The issue is a gap because:
docs/concepts/models/model_kinds.md)sqlmesh/core/scheduler.py)sqlmesh/core/state_sync/db/facade.py)Notes
The
end_boundedparameter inmissing_intervalscan be used to limit intervals to the target end date, but this doesn't address the retry issue for failed lookback intervals. This parameter is documented in themissing_intervalsmethod insqlmesh/core/snapshot/definition.py.Relevant Files
docs/concepts/models/model_kinds.md- Materialization strategy documentationdocs/concepts/models/overview.md- Lookback parameter documentationsqlmesh/core/model/meta.py- lookback_start method implementationsqlmesh/core/snapshot/definition.py- compute_missing_intervals and missing_intervals implementationsqlmesh/core/scheduler.py- merged_missing_intervals implementationtests/core/test_snapshot.py- Test cases for lookback behaviorsqlmesh/core/state_sync/db/facade.py- Gap detection logic in _ensure_no_gapsWiki pages you might want to explore: