Skip to content

Conversation

@edgarrmondragon
Copy link
Collaborator

@edgarrmondragon edgarrmondragon commented Oct 22, 2025

Summary by Sourcery

Defer SQL table creation in the SQLSink until the first batch is processed to avoid data loss on schema changes, refactoring setup() to only prepare schemas and introducing lazy table preparation across batch and version methods, with a new test to validate the behavior.

Bug Fixes:

  • Delay table preparation until first batch to prevent data loss when multiple schema messages arrive.

Enhancements:

  • Introduce a _table_prepared flag and _ensure_table_prepared method to lazily prepare tables.
  • Remove immediate table creation from setup() and invoke table preparation in start_batch, process_batch, and activate_version.

Tests:

  • Add a test to verify that table preparation is deferred until the first batch invocation.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 22, 2025

Reviewer's Guide

This PR defers SQLSink’s table creation until the first data batch is processed by adding a table-prepared flag, removing premature table setup in setup(), encapsulating table preparation in a helper method, and invoking it on batch start and during processing. Tests are updated to verify that preparation only occurs once a batch begins.

Sequence diagram for deferred table preparation in SQLSink

sequenceDiagram
    participant "SQLSink"
    participant "Connector"
    participant "Database"
    "SQLSink"->>"Connector": prepare_schema(schema_name) (during setup)
    Note over "SQLSink","Connector": Table is NOT prepared during setup
    "SQLSink"->>"SQLSink": start_batch(context)
    "SQLSink"->>"SQLSink": _ensure_table_prepared()
    alt Table not prepared yet
        "SQLSink"->>"Connector": prepare_table(full_table_name, schema, primary_keys, as_temp_table=False)
        "Connector"->>"Database": Create table
        "SQLSink"->>"SQLSink": _table_prepared = True
    end
    "SQLSink"->>"SQLSink": process_batch(context)
    "SQLSink"->>"SQLSink": _ensure_table_prepared() (if not already prepared)
    Note over "SQLSink": Table preparation only occurs on first batch
Loading

Class diagram for updated SQLSink table preparation logic

classDiagram
    class SQLSink {
        - _connector: _C
        - _table_prepared: bool
        + setup()
        + start_batch(context: dict)
        + process_batch(context: dict)
        + _ensure_table_prepared()
        + activate_version(new_version: int)
    }
    SQLSink --> Connector : uses
    class Connector {
        + prepare_schema(schema_name)
        + prepare_table(full_table_name, schema, primary_keys, as_temp_table)
        + table_exists(full_table_name)
    }
Loading

File-Level Changes

Change Details Files
Introduce table preparation flag and defer initial table setup
  • Add self._table_prepared = False in init
  • Remove connector.prepare_table call from setup()
  • Update setup() docstring to note deferred table preparation
singer_sdk/sql/sink.py
Encapsulate table preparation logic in a helper
  • Create _ensure_table_prepared() to prepare table and set flag
  • Use conform_schema and key_properties when preparing
singer_sdk/sql/sink.py
Trigger table preparation at batch processing points
  • Call _ensure_table_prepared() in start_batch()
  • Call _ensure_table_prepared() at start of process_batch()
  • Replace table_exists check in activate_version() with _ensure_table_prepared()
singer_sdk/sql/sink.py
Add test to verify deferred table creation
  • Introduce test_table_preparation_deferred_until_first_batch()
  • Use subtests to assert flag and actual table existence before and after first batch
tests/core/sinks/test_sql_sink.py

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@edgarrmondragon edgarrmondragon force-pushed the fix/sql/delay-table-preparation branch from 7eea4ce to cc14c47 Compare October 22, 2025 23:23
@edgarrmondragon edgarrmondragon changed the title fix(sql,targets): Delay table preparation until processing the first batch fix(targets): Delay table preparation until processing the first batch Oct 22, 2025
@edgarrmondragon edgarrmondragon added SQL Support for SQL taps and targets Type/Target Singer targets labels Oct 22, 2025
@edgarrmondragon edgarrmondragon self-assigned this Oct 22, 2025
@read-the-docs-community
Copy link

read-the-docs-community bot commented Oct 22, 2025

Documentation build overview

📚 Meltano SDK | 🛠️ Build #30102294 | 📁 Comparing 397f2ac against latest (ffd9a81)


🔍 Preview build

Show files changed (2 files in total): 📝 2 modified | ➕ 0 added | ➖ 0 deleted
File Status
genindex.html 📝 modified
classes/singer_sdk.SQLSink.html 📝 modified

@codecov
Copy link

codecov bot commented Oct 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.61%. Comparing base (ffd9a81) to head (397f2ac).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3340      +/-   ##
==========================================
+ Coverage   93.57%   93.61%   +0.04%     
==========================================
  Files          69       69              
  Lines        5663     5669       +6     
  Branches      700      700              
==========================================
+ Hits         5299     5307       +8     
+ Misses        259      258       -1     
+ Partials      105      104       -1     
Flag Coverage Δ
core 80.45% <66.66%> (+0.09%) ⬆️
end-to-end 76.68% <100.00%> (-0.05%) ⬇️
optional-components 43.32% <22.22%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@codspeed-hq
Copy link

codspeed-hq bot commented Oct 22, 2025

CodSpeed Performance Report

Merging #3340 will not alter performance

Comparing fix/sql/delay-table-preparation (397f2ac) with main (cc29a37)1

Summary

✅ 8 untouched

Footnotes

  1. No successful run was found on main (ffd9a81) during the generation of this report, so cc29a37 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@edgarrmondragon edgarrmondragon marked this pull request as ready for review October 23, 2025 18:27
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Extend the test suite with a case for activate_version to verify that it also triggers deferred table preparation on its first invocation.
  • Clarify in the setup() docstring that table creation is intentionally deferred to start_batch/process_batch so consumers don’t assume the table exists immediately after setup.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Extend the test suite with a case for activate_version to verify that it also triggers deferred table preparation on its first invocation.
- Clarify in the setup() docstring that table creation is intentionally deferred to start_batch/process_batch so consumers don’t assume the table exists immediately after setup.

## Individual Comments

### Comment 1
<location> `tests/core/sinks/test_sql_sink.py:172-181` </location>
<code_context>
+    def test_table_preparation_deferred_until_first_batch(self, subtests: SubTests):
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding a test for table preparation when process_batch is called directly.

Add a subtest that calls process_batch directly to verify table preparation, ensuring both code paths are covered.

Suggested implementation:

```python
    def test_table_preparation_deferred_until_first_batch(self, subtests: SubTests):
        """Test that table preparation is deferred until first batch.

        This test verifies the fix for issue #3237 where table preparation
        occurred during setup() instead of during the first batch, causing
        data loss when multiple schema messages arrived for the same stream.

        The test verifies that:
        1. Table is NOT prepared during sink setup()
        2. Table IS prepared during the first batch (start_batch)
        3. Table IS prepared when process_batch is called directly
        4. This prevents data loss when schemas change mid-stream

        # Setup sink and verify table is not prepared
        sink = self._make_sink()
        assert not sink._table_prepared

        # Subtest: Table preparation via start_batch
        with subtests.test("table preparation via start_batch"):
            sink.start_batch()
            assert sink._table_prepared

        # Reset sink for next subtest
        sink = self._make_sink()
        assert not sink._table_prepared

        # Subtest: Table preparation via process_batch directly
        with subtests.test("table preparation via process_batch"):
            batch = self._make_batch()  # You may need to adjust this to match your batch creation logic
            sink.process_batch(batch)
            assert sink._table_prepared

```

- Ensure that `self._make_batch()` (or equivalent) is available and returns a valid batch for `process_batch`. If not, you will need to implement or adjust the batch creation logic to fit your test setup.
- If `sink._table_prepared` is not the correct attribute to check, replace it with the appropriate property or method for verifying table preparation.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@kgpayne
Copy link
Contributor

kgpayne commented Oct 29, 2025

@edgarrmondragon I'm not sure this resolves #3237 🤔 Consider the following stream with the current behaviour, with batch_size_rows set to 2 for the purpose of illustration:

{"type": "SCHEMA", "stream": "users", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}}}}
// Schema received, so SQLSink class instantiated.
// With the default `TargetLoadMethods.OVERWRITE` load method,
// this results in any existing destination table being dropped and recreated by `.setup()`.
{"type": "RECORD", "stream": "users", "record": {"id": 1}}
{"type": "RECORD", "stream": "users", "record": {"id": 2}}
// Records received..
// `batch_size_rows` reached, so sink would drain and destination table would have 2 records.
{"type": "RECORD", "stream": "users", "record": {"id": 3}}
// Another record received, but below the batch threshold, so not yet drained to the destination.
{"type": "SCHEMA", "stream": "users", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}, "name": {"type": "string"}}}}
// Then another schema received, this time with an additional "name" property, which means:
// - a new SQLSink class would be instantiated, and the previous one (with un-drained record) is moved to `_sinks_to_clear`
// - new sink instance calls `setup()` which results in the destination table being dropped and recreated by `.setup()`
// Issue 1: **records (id=1,2) previously written to destination table are lost in the drop/recreate cycle**
{"type": "RECORD", "stream": "users", "record": {"id": 4, "name": "Alice"}}
{"type": "RECORD", "stream": "users", "record": {"id": 5, "name": "Alfred"}}
// Records recieved...
// `batch_size_rows` reached, so sink would drain and destination table would again have 2 records.
{"type": "RECORD", "stream": "users", "record": {"id": 6, "name": "Anita"}}
// Record received, but below the batch threshold, so not yet drained to the destination.
// This is the final record, so now the Target would begin `drain_all()`.
// `drain_all()` starts with sinks in `_sinks_to_clear`, however the target table schema has since been changed.
// Issue 2: **attempting to drain the first sink (with only "id" column) would/could fail because the target table now expects both "id" and "name" columns**

For the above stream, I would expect the following table contents:

id,name
1,null
2,null
3,null
4,Alice
5,Alfred
6,Anita

In stead we get either an error (due to id=3 and Issue 2) or:

id,name
4,Alice
5,Alfred
3,null
6,Anita

Under your new implementation, with prepare_table deferred to first drain, we still encounter Issue 1 whereby calling drain (for the first time) on the second Sink instance would trigger prepare_table causing a drop-and-recreate when using the default TargetLoadMethods.OVERWRITE load method.

We also still encounter Issue 2 when the second instance of the Sink meets the batch_size_rows threshold and drains, thereby dropping and recreating the table before the first instance (stored in _sinks_to_clear) has completely drained. Once the first instance does eventually drain (during drain_all), the schema will not match and may fail.

Its hard to reason through, but I believe this PR still produces the lossy outcome for the example stream:

id,name
4,Alice
5,Alfred
3,null
6,Anita

So, to solve both issues, I think we need to:

  1. Only drop-and-recreate once per Stream. I.e. set target load method to TargetLoadMethods.APPEND_ONLY or UPSERT on subsequent instances of SQLSink for a given stream.
  2. Completely drain one Sink before instantiating another (for the same stream). I.e. do not retire SQLSink streams to _sinks_to_clear. Arguably we wouldn't want to do this for any stream types, for the same reason.

Point 1. is definetily nuanced - is a new schema message effectively a new stream, and therefore should be overwritten? Even if that is the consensus, we'd still need to fix point 1.

Does that make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

SQL Support for SQL taps and targets Type/Target Singer targets

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants