fix(targets): Delay table preparation until processing the first batch #3340

edgarrmondragon · 2025-10-22T23:22:33Z

Summary by Sourcery

Defer SQL table creation in the SQLSink until the first batch is processed to avoid data loss on schema changes, refactoring setup() to only prepare schemas and introducing lazy table preparation across batch and version methods, with a new test to validate the behavior.

Bug Fixes:

Delay table preparation until first batch to prevent data loss when multiple schema messages arrive.

Enhancements:

Introduce a _table_prepared flag and _ensure_table_prepared method to lazily prepare tables.
Remove immediate table creation from setup() and invoke table preparation in start_batch, process_batch, and activate_version.

Tests:

Add a test to verify that table preparation is deferred until the first batch invocation.

sourcery-ai · 2025-10-22T23:22:39Z

Reviewer's Guide

This PR defers SQLSink’s table creation until the first data batch is processed by adding a table-prepared flag, removing premature table setup in setup(), encapsulating table preparation in a helper method, and invoking it on batch start and during processing. Tests are updated to verify that preparation only occurs once a batch begins.

Sequence diagram for deferred table preparation in SQLSink

sequenceDiagram
    participant "SQLSink"
    participant "Connector"
    participant "Database"
    "SQLSink"->>"Connector": prepare_schema(schema_name) (during setup)
    Note over "SQLSink","Connector": Table is NOT prepared during setup
    "SQLSink"->>"SQLSink": start_batch(context)
    "SQLSink"->>"SQLSink": _ensure_table_prepared()
    alt Table not prepared yet
        "SQLSink"->>"Connector": prepare_table(full_table_name, schema, primary_keys, as_temp_table=False)
        "Connector"->>"Database": Create table
        "SQLSink"->>"SQLSink": _table_prepared = True
    end
    "SQLSink"->>"SQLSink": process_batch(context)
    "SQLSink"->>"SQLSink": _ensure_table_prepared() (if not already prepared)
    Note over "SQLSink": Table preparation only occurs on first batch

Class diagram for updated SQLSink table preparation logic

classDiagram
    class SQLSink {
        - _connector: _C
        - _table_prepared: bool
        + setup()
        + start_batch(context: dict)
        + process_batch(context: dict)
        + _ensure_table_prepared()
        + activate_version(new_version: int)
    }
    SQLSink --> Connector : uses
    class Connector {
        + prepare_schema(schema_name)
        + prepare_table(full_table_name, schema, primary_keys, as_temp_table)
        + table_exists(full_table_name)
    }

File-Level Changes

Change	Details	Files
Introduce table preparation flag and defer initial table setup	Add self._table_prepared = False in init Remove connector.prepare_table call from setup() Update setup() docstring to note deferred table preparation	`singer_sdk/sql/sink.py`
Encapsulate table preparation logic in a helper	Create _ensure_table_prepared() to prepare table and set flag Use conform_schema and key_properties when preparing	`singer_sdk/sql/sink.py`
Trigger table preparation at batch processing points	Call _ensure_table_prepared() in start_batch() Call _ensure_table_prepared() at start of process_batch() Replace table_exists check in activate_version() with _ensure_table_prepared()	`singer_sdk/sql/sink.py`
Add test to verify deferred table creation	Introduce test_table_preparation_deferred_until_first_batch() Use subtests to assert flag and actual table existence before and after first batch	`tests/core/sinks/test_sql_sink.py`

Possibly linked issues

bug: SQL prepare_table runs on setup in stead of drain #3237: The PR delays SQL table preparation until the first batch, fixing the issue where prepare_table ran on setup causing data loss with mid-stream schema changes.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

read-the-docs-community · 2025-10-22T23:26:06Z

Documentation build overview

📚 Meltano SDK | 🛠️ Build #30102294 | 📁 Comparing 397f2ac against latest (ffd9a81)

🔍 Preview build

Show files changed (2 files in total): 📝 2 modified | ➕ 0 added | ➖ 0 deleted

File	Status
genindex.html	📝 modified
classes/singer_sdk.SQLSink.html	📝 modified

codecov · 2025-10-22T23:27:55Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.61%. Comparing base (ffd9a81) to head (397f2ac).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3340      +/-   ##
==========================================
+ Coverage   93.57%   93.61%   +0.04%     
==========================================
  Files          69       69              
  Lines        5663     5669       +6     
  Branches      700      700              
==========================================
+ Hits         5299     5307       +8     
+ Misses        259      258       -1     
+ Partials      105      104       -1

Flag	Coverage Δ
core	`80.45% <66.66%> (+0.09%)`	⬆️
end-to-end	`76.68% <100.00%> (-0.05%)`	⬇️
optional-components	`43.32% <22.22%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codspeed-hq · 2025-10-22T23:28:28Z

CodSpeed Performance Report

Merging #3340 will not alter performance

_{Comparing fix/sql/delay-table-preparation (397f2ac) with main (cc29a37)¹}

Summary

✅ 8 untouched

No successful run was found on main (ffd9a81) during the generation of this report, so cc29a37 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

Extend the test suite with a case for activate_version to verify that it also triggers deferred table preparation on its first invocation.
Clarify in the setup() docstring that table creation is intentionally deferred to start_batch/process_batch so consumers don’t assume the table exists immediately after setup.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Extend the test suite with a case for activate_version to verify that it also triggers deferred table preparation on its first invocation.
- Clarify in the setup() docstring that table creation is intentionally deferred to start_batch/process_batch so consumers don’t assume the table exists immediately after setup.

## Individual Comments

### Comment 1
<location> `tests/core/sinks/test_sql_sink.py:172-181` </location>
<code_context>
+    def test_table_preparation_deferred_until_first_batch(self, subtests: SubTests):
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding a test for table preparation when process_batch is called directly.

Add a subtest that calls process_batch directly to verify table preparation, ensuring both code paths are covered.

Suggested implementation:

```python
    def test_table_preparation_deferred_until_first_batch(self, subtests: SubTests):
        """Test that table preparation is deferred until first batch.

        This test verifies the fix for issue #3237 where table preparation
        occurred during setup() instead of during the first batch, causing
        data loss when multiple schema messages arrived for the same stream.

        The test verifies that:
        1. Table is NOT prepared during sink setup()
        2. Table IS prepared during the first batch (start_batch)
        3. Table IS prepared when process_batch is called directly
        4. This prevents data loss when schemas change mid-stream

        # Setup sink and verify table is not prepared
        sink = self._make_sink()
        assert not sink._table_prepared

        # Subtest: Table preparation via start_batch
        with subtests.test("table preparation via start_batch"):
            sink.start_batch()
            assert sink._table_prepared

        # Reset sink for next subtest
        sink = self._make_sink()
        assert not sink._table_prepared

        # Subtest: Table preparation via process_batch directly
        with subtests.test("table preparation via process_batch"):
            batch = self._make_batch()  # You may need to adjust this to match your batch creation logic
            sink.process_batch(batch)
            assert sink._table_prepared

```

- Ensure that `self._make_batch()` (or equivalent) is available and returns a valid batch for `process_batch`. If not, you will need to implement or adjust the batch creation logic to fit your test setup.
- If `sink._table_prepared` is not the correct attribute to check, replace it with the appropriate property or method for verifying table preparation.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

kgpayne · 2025-10-29T13:52:21Z

@edgarrmondragon I'm not sure this resolves #3237 🤔 Consider the following stream with the current behaviour, with batch_size_rows set to 2 for the purpose of illustration:

{"type": "SCHEMA", "stream": "users", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}}}}
// Schema received, so SQLSink class instantiated.
// With the default `TargetLoadMethods.OVERWRITE` load method,
// this results in any existing destination table being dropped and recreated by `.setup()`.
{"type": "RECORD", "stream": "users", "record": {"id": 1}}
{"type": "RECORD", "stream": "users", "record": {"id": 2}}
// Records received..
// `batch_size_rows` reached, so sink would drain and destination table would have 2 records.
{"type": "RECORD", "stream": "users", "record": {"id": 3}}
// Another record received, but below the batch threshold, so not yet drained to the destination.
{"type": "SCHEMA", "stream": "users", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}, "name": {"type": "string"}}}}
// Then another schema received, this time with an additional "name" property, which means:
// - a new SQLSink class would be instantiated, and the previous one (with un-drained record) is moved to `_sinks_to_clear`
// - new sink instance calls `setup()` which results in the destination table being dropped and recreated by `.setup()`
// Issue 1: **records (id=1,2) previously written to destination table are lost in the drop/recreate cycle**
{"type": "RECORD", "stream": "users", "record": {"id": 4, "name": "Alice"}}
{"type": "RECORD", "stream": "users", "record": {"id": 5, "name": "Alfred"}}
// Records recieved...
// `batch_size_rows` reached, so sink would drain and destination table would again have 2 records.
{"type": "RECORD", "stream": "users", "record": {"id": 6, "name": "Anita"}}
// Record received, but below the batch threshold, so not yet drained to the destination.
// This is the final record, so now the Target would begin `drain_all()`.
// `drain_all()` starts with sinks in `_sinks_to_clear`, however the target table schema has since been changed.
// Issue 2: **attempting to drain the first sink (with only "id" column) would/could fail because the target table now expects both "id" and "name" columns**

For the above stream, I would expect the following table contents:

id,name
1,null
2,null
3,null
4,Alice
5,Alfred
6,Anita

In stead we get either an error (due to id=3 and Issue 2) or:

id,name
4,Alice
5,Alfred
3,null
6,Anita

Under your new implementation, with prepare_table deferred to first drain, we still encounter Issue 1 whereby calling drain (for the first time) on the second Sink instance would trigger prepare_table causing a drop-and-recreate when using the default TargetLoadMethods.OVERWRITE load method.

We also still encounter Issue 2 when the second instance of the Sink meets the batch_size_rows threshold and drains, thereby dropping and recreating the table before the first instance (stored in _sinks_to_clear) has completely drained. Once the first instance does eventually drain (during drain_all), the schema will not match and may fail.

Its hard to reason through, but I believe this PR still produces the lossy outcome for the example stream:

id,name
4,Alice
5,Alfred
3,null
6,Anita

So, to solve both issues, I think we need to:

Only drop-and-recreate once per Stream. I.e. set target load method to TargetLoadMethods.APPEND_ONLY or UPSERT on subsequent instances of SQLSink for a given stream.
Completely drain one Sink before instantiating another (for the same stream). I.e. do not retire SQLSink streams to _sinks_to_clear. Arguably we wouldn't want to do this for any stream types, for the same reason.

Point 1. is definetily nuanced - is a new schema message effectively a new stream, and therefore should be overwritten? Even if that is the consensus, we'd still need to fix point 1.

Does that make sense?

Add failing test

cc14c47

edgarrmondragon force-pushed the fix/sql/delay-table-preparation branch from 7eea4ce to cc14c47 Compare October 22, 2025 23:23

edgarrmondragon changed the title ~~fix(sql,targets): Delay table preparation until processing the first batch~~ fix(targets): Delay table preparation until processing the first batch Oct 22, 2025

edgarrmondragon added SQL Support for SQL taps and targets Type/Target Singer targets labels Oct 22, 2025

edgarrmondragon self-assigned this Oct 22, 2025

Implement fix

51ee72c

Fix BATCH and ACTIVATE_VERSION processing

a95deb7

edgarrmondragon mentioned this pull request Oct 22, 2025

bug: SQL prepare_table runs on setup in stead of drain #3237

Open

1 task

edgarrmondragon marked this pull request as ready for review October 23, 2025 18:27

sourcery-ai bot reviewed Oct 23, 2025

View reviewed changes

Merge branch 'main' into fix/sql/delay-table-preparation

397f2ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(targets): Delay table preparation until processing the first batch #3340

fix(targets): Delay table preparation until processing the first batch #3340

edgarrmondragon commented Oct 22, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Oct 22, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

read-the-docs-community bot commented Oct 22, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 22, 2025 •

edited

Loading

Uh oh!

codspeed-hq bot commented Oct 22, 2025 •

edited

Loading

Uh oh!

sourcery-ai bot left a comment

Uh oh!

kgpayne commented Oct 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

fix(targets): Delay table preparation until processing the first batch #3340

Are you sure you want to change the base?

fix(targets): Delay table preparation until processing the first batch #3340

Conversation

edgarrmondragon commented Oct 22, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for deferred table preparation in SQLSink

Class diagram for updated SQLSink table preparation logic

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

read-the-docs-community bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

codecov bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codspeed-hq bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #3340 will not alter performance

Summary

Footnotes

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

kgpayne commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

edgarrmondragon commented Oct 22, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Oct 22, 2025 •

edited

Loading

read-the-docs-community bot commented Oct 22, 2025 •

edited

Loading

codecov bot commented Oct 22, 2025 •

edited

Loading

codspeed-hq bot commented Oct 22, 2025 •

edited

Loading

kgpayne commented Oct 29, 2025 •

edited

Loading