Skip to content

Conversation

@longbingljw
Copy link
Member

@longbingljw longbingljw commented Dec 19, 2025

Summary

1.Fixed a bug where $in/$nin operators on JSON array metadata fields failed to match due to incorrect SQL generation.
2.Add tests for array $in/$nin
close #83

Solution Description

  • $in: JSON_OVERLAPS(JSON_EXTRACT(metadata, '$.field'), CAST(%s AS JSON))
  • $nin: NOT JSON_OVERLAPS(...)

Summary by CodeRabbit

  • Bug Fixes

    • Improved metadata filtering for $in and $nin: validates and normalizes scalar or array inputs, skips empty lists, and uses JSON-overlap comparisons to ensure correct matches.
  • Tests

    • Added regression tests for array and scalar $in/$nin scenarios (overlap, exclusion, missing/null/empty) across embedded, server, and OceanBase clients.
    • Refactored tests to use shared helpers and added extra cleanup and progress logging.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 19, 2025

Walkthrough

Metadata $in/$nin handling was changed to validate and normalize values into JSON arrays and use JSON_OVERLAPS with CAST(... AS JSON) and a single bound parameter; tests added to cover array and scalar $in/$nin across embedded, server, and OceanBase clients.

Changes

Cohort / File(s) Summary
Filter logic for metadata array operators
src/pyseekdb/client/filters.py
Added json import; updated $in and $nin paths for metadata fields to require list/tuple (skip if empty), normalize scalar values to single-element arrays, serialize to a JSON array string, and emit JSON_OVERLAPS(CAST(JSON_EXTRACT(... ) AS JSON), CAST(%s AS JSON)) SQL fragments with a single parameter (replacing previous IN/NOT IN placeholders).
Regression tests for metadata array overlap & nin
tests/test_collection_get.py
Added _test_metadata_array_in_overlap shared flow and _test_collection_get helper; introduced tests for embedded/server/oceanbase variants for both $in and $nin overlap cases and refactored existing collection.get tests to use the new helper; included collection lifecycle setup/teardown and additional assertions covering overlap, disjoint, missing, null, and empty-array scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

  • Review JSON serialization and parameter binding in src/pyseekdb/client/filters.py for correct escaping and safe binding.
  • Verify the empty-list skipping and scalar-to-array normalization across edge cases (missing key, null, empty array).
  • Run the new tests/test_collection_get.py scenarios (embedded, server, oceanbase) and confirm cleanup paths and assertions.

Poem

🐇 I nibbled tags and turned them true,
Wrapped values in brackets, one or two,
Overlaps now dance where matches hide,
Queries hop in, no more divide,
Tiny paws applaud the JSON tide.

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the bug fix: metadata $in/$nin operators now correctly match JSON array fields.
Linked Issues check ✅ Passed The PR implements the required fix: replacing IN/NOT IN with JSON_OVERLAPS for metadata array field comparisons [#83].
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the $in/$nin operators for JSON array metadata fields as specified in issue #83.
Docstring Coverage ✅ Passed Docstring coverage is 92.31% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@longbingljw
Copy link
Member Author

Distract from #77,cc plz @hnwyllmm

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/test_collection_get.py (2)

141-141: Clean up unused noqa directives.

Lines 141 and 285 have # noqa: F401 directives that are no longer needed (the import check is not enabled).

Proposed fix
     try:
-        import pylibseekdb  # noqa: F401
+        import pylibseekdb
     except ImportError:

Apply the same change on line 285.

Also applies to: 285-285


558-558: Remove extraneous f prefix from strings without placeholders.

Lines 558, 569, 901, and 912 use f-strings but don't contain any placeholders.

Proposed fix
-            print(f"✅ Testing get with scalar $in operator")
+            print("✅ Testing get with scalar $in operator")

Apply the same fix to lines 569, 901, and 912.

Also applies to: 569-569, 901-901, 912-912

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 417f62b and 85e7f27.

📒 Files selected for processing (2)
  • src/pyseekdb/client/filters.py (2 hunks)
  • tests/test_collection_get.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_collection_get.py (6)
src/pyseekdb/client/collection.py (5)
  • client (78-80)
  • name (63-65)
  • embedding_function (88-90)
  • add (103-149)
  • get (364-428)
src/pyseekdb/client/__init__.py (1)
  • Client (84-201)
src/pyseekdb/client/admin_client.py (1)
  • get_or_create_collection (214-233)
src/pyseekdb/client/embedding_function.py (1)
  • DefaultEmbeddingFunction (84-469)
src/pyseekdb/client/client_seekdb_embedded.py (1)
  • _execute (106-122)
src/pyseekdb/client/client_seekdb_server.py (1)
  • _execute (98-111)
🪛 Ruff (0.14.8)
tests/test_collection_get.py

141-141: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


176-176: Do not catch blind exception: Exception

(BLE001)


197-197: Do not catch blind exception: Exception

(BLE001)


226-226: Do not catch blind exception: Exception

(BLE001)


247-247: Do not catch blind exception: Exception

(BLE001)


276-276: Do not catch blind exception: Exception

(BLE001)


285-285: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


326-326: Do not catch blind exception: Exception

(BLE001)


347-347: Do not catch blind exception: Exception

(BLE001)


382-382: Do not catch blind exception: Exception

(BLE001)


403-403: Do not catch blind exception: Exception

(BLE001)


438-438: Do not catch blind exception: Exception

(BLE001)


558-558: f-string without any placeholders

Remove extraneous f prefix

(F541)


569-569: f-string without any placeholders

Remove extraneous f prefix

(F541)


901-901: f-string without any placeholders

Remove extraneous f prefix

(F541)


912-912: f-string without any placeholders

Remove extraneous f prefix

(F541)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: test (server, ubuntu-latest)
  • GitHub Check: test (oceanbase, ubuntu-latest)
  • GitHub Check: test (embedded, ubuntu-latest)
🔇 Additional comments (6)
src/pyseekdb/client/filters.py (3)

9-9: LGTM!

Clean addition of the standard library json module to support JSON serialization for the $in and $nin operators.


128-140: The implementation is correct. JSON_OVERLAPS returns 1 if a scalar is an element of the array when comparing the two, so the comment on Line 135 accurately states this handles both scalar and array fields. The old IN operator and the new JSON_OVERLAPS approach both work correctly for scalar metadata fields.


142-154: The $nin operator implementation is correct and properly tested with both scalar and array metadata fields. JSON_OVERLAPS inherently supports scalar-to-array comparisons, so no issues exist.

tests/test_collection_get.py (3)

135-277: Excellent regression test coverage for array $in operator.

These three test functions (test_embedded_metadata_array_in_overlap, test_server_metadata_array_in_overlap, test_oceanbase_metadata_array_in_overlap) properly validate the fix for issue #83 across all three client types. The tests correctly:

  • Insert metadata with JSON array fields ("tags": ["ml", "ai"])
  • Query using $in operator with an overlapping array
  • Expect a single matching result due to overlap
  • Include proper cleanup and connection validation

279-439: Good coverage for array $nin operator.

These three test functions (test_embedded_metadata_array_nin_overlap, test_server_metadata_array_nin_overlap, test_oceanbase_metadata_array_nin_overlap) provide complementary coverage for the $nin operator. They correctly test that:

  • Documents with array overlap are excluded
  • Documents without overlap are returned
  • Proper assertions verify both inclusion (id2) and exclusion (id1)

557-577: These scalar field $in/$nin tests are correctly implemented and should pass.

JSON_OVERLAPS is specifically designed to handle comparisons between a scalar value and an array, returning true if the scalar is an element of the array. The test data stores scalar metadata values ("tag": "ml", "tag": "python", "tag": "neural"), and the filter implementation correctly generates SQL like JSON_OVERLAPS(JSON_EXTRACT(metadata, '$.tag'), CAST('["ml", "python"]' AS JSON)), which will work as intended.

The concern raised in the original review is unfounded—when comparing a scalar with an array, JSON_OVERLAPS returns 1 if the scalar is an element of the array, which is exactly what these tests expect.

Likely an incorrect or invalid review comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
tests/test_collection_get.py (3)

135-180: Good edge case coverage for array $in/$nin, but consider additional corner cases.

The helper method properly tests overlap, disjoint, missing, null, and empty scenarios for JSON array fields, which validates the JSON_OVERLAPS implementation.

Based on the past review feedback requesting more corner cases, consider expanding coverage to include:

  • Single-element arrays (e.g., tags: ["ml"])
  • Large arrays to verify performance characteristics
  • Special characters or unicode in array values
  • Numeric or mixed-type arrays (if the implementation supports them)

181-440: Consider refactoring to reduce duplication across client modes.

The six test methods (embedded/server/OceanBase × $in/$nin) follow a repetitive pattern with duplicated setup, connection testing, and cleanup code. As noted in the past review comment (line 210), this structure can be simplified.

Per the previous suggestion, consider extracting the core test logic into shared helper methods that accept a client parameter, then have each mode-specific test create the appropriate client and delegate to the helper. This would improve maintainability and reduce code duplication.

Example refactoring pattern
def _test_array_in_with_client(self, client):
    """Shared test logic for array $in across all client modes."""
    collection_name = f"test_tags_in_{int(time.time() * 1000)}"
    collection = client.get_or_create_collection(
        name=collection_name,
        embedding_function=pyseekdb.DefaultEmbeddingFunction(),
    )
    try:
        self._run_tags_array_in_nin_suite(collection)
    finally:
        try:
            client.delete_collection(name=collection_name)
        except Exception as cleanup_error:
            print(f"Warning: cleanup failed for {collection_name}: {cleanup_error}")

def test_embedded_metadata_array_in_overlap(self):
    try:
        import pylibseekdb  # noqa: F401
    except ImportError:
        pytest.skip("seekdb embedded package is not installed")
    
    client = pyseekdb.Client(path=SEEKDB_PATH, database=SEEKDB_DATABASE)
    self._test_array_in_with_client(client)

# Similar pattern for server and oceanbase modes

558-579: These scalar $in/$nin tests verify basic operator functionality and regression coverage.

Tests 9-10 exercise scalar metadata field filtering, which is a separate concern from array field support. The $nin operator had a documented bug with scalar metadata fields, making these regression tests essential. $in and $nin operators are designed to work with scalar metadata fields of the same type, and these tests appropriately validate that functionality.

If this PR extends $in/$nin to also support array fields (using JSON_OVERLAPS), the scalar tests are correctly included for regression coverage and to ensure backward compatibility. Verify that both scalar and array field behavior is implemented in the underlying metadata filter logic.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 85e7f27 and 23e67b8.

📒 Files selected for processing (2)
  • src/pyseekdb/client/filters.py
  • tests/test_collection_get.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/pyseekdb/client/filters.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_collection_get.py (3)
src/pyseekdb/client/collection.py (5)
  • add (103-149)
  • get (364-428)
  • client (78-80)
  • name (63-65)
  • embedding_function (88-90)
src/pyseekdb/client/client_seekdb_server.py (1)
  • _execute (98-111)
src/pyseekdb/client/client_seekdb_embedded.py (1)
  • _execute (106-122)
🪛 Ruff (0.14.8)
tests/test_collection_get.py

187-187: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


207-207: Do not catch blind exception: Exception

(BLE001)


228-228: Do not catch blind exception: Exception

(BLE001)


242-242: Do not catch blind exception: Exception

(BLE001)


263-263: Do not catch blind exception: Exception

(BLE001)


277-277: Do not catch blind exception: Exception

(BLE001)


286-286: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


327-327: Do not catch blind exception: Exception

(BLE001)


348-348: Do not catch blind exception: Exception

(BLE001)


383-383: Do not catch blind exception: Exception

(BLE001)


404-404: Do not catch blind exception: Exception

(BLE001)


439-439: Do not catch blind exception: Exception

(BLE001)


559-559: f-string without any placeholders

Remove extraneous f prefix

(F541)


570-570: f-string without any placeholders

Remove extraneous f prefix

(F541)


902-902: f-string without any placeholders

Remove extraneous f prefix

(F541)


913-913: f-string without any placeholders

Remove extraneous f prefix

(F541)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: test (oceanbase, ubuntu-latest)
  • GitHub Check: test (server, ubuntu-latest)
  • GitHub Check: test (embedded, ubuntu-latest)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
tests/test_collection_get.py (2)

200-200: Remove unused # noqa: F401 directive.

The F401 rule is not enabled in the Ruff configuration, making this directive unnecessary.

Also applies to line 260.


532-552: Remove extraneous f prefix from f-strings without placeholders.

Lines 533 and 544 use f-string syntax without any format placeholders. Use plain strings instead.

🔎 Proposed fix
             # Test 9: Get with scalar $in operator
-            print(f"✅ Testing get with scalar $in operator")
+            print("✅ Testing get with scalar $in operator")
             results = collection.get(
                 where={"tag": {"$in": ["ml", "python"]}},
                 limit=10
             )
             assert results is not None
             assert "ids" in results
             assert len(results["ids"]) > 0
             print(f"   Found {len(results['ids'])} results with tag in ['ml', 'python']")
             
             # Test 10: Get with scalar $nin operator
-            print(f"✅ Testing get with scalar $nin operator")
+            print("✅ Testing get with scalar $nin operator")
             results = collection.get(
🧹 Nitpick comments (2)
tests/test_collection_get.py (2)

254-302: Consider consolidating with shared test flow.

This standalone $nin test partially duplicates the $nin scenario already covered in _test_metadata_array_in_overlap (lines 181-187). While having explicit regression tests is valuable, you could simplify by relying on the shared flow and removing the standalone methods.

The same observation applies to test_server_metadata_array_nin_overlap (lines 304-358) and test_oceanbase_metadata_array_nin_overlap (lines 360-414).


647-654: Consider adding scalar $nin test for consistency.

The embedded client test (lines 543-552) and OceanBase test (lines 886-895) both include a scalar $nin operator test, but the server client test only has the existing $in test here. For symmetric coverage, consider adding a $nin test block.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23e67b8 and 65c402a.

📒 Files selected for processing (1)
  • tests/test_collection_get.py
🧰 Additional context used
🪛 Ruff (0.14.8)
tests/test_collection_get.py

191-191: Do not catch blind exception: Exception

(BLE001)


200-200: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


228-228: Do not catch blind exception: Exception

(BLE001)


250-250: Do not catch blind exception: Exception

(BLE001)


260-260: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


301-301: Do not catch blind exception: Exception

(BLE001)


322-322: Do not catch blind exception: Exception

(BLE001)


357-357: Do not catch blind exception: Exception

(BLE001)


378-378: Do not catch blind exception: Exception

(BLE001)


413-413: Do not catch blind exception: Exception

(BLE001)


533-533: f-string without any placeholders

Remove extraneous f prefix

(F541)


544-544: f-string without any placeholders

Remove extraneous f prefix

(F541)


876-876: f-string without any placeholders

Remove extraneous f prefix

(F541)


887-887: f-string without any placeholders

Remove extraneous f prefix

(F541)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (server, ubuntu-latest)
  • GitHub Check: test (embedded, ubuntu-latest)
  • GitHub Check: test (other, ubuntu-latest)
  • GitHub Check: test (oceanbase, ubuntu-latest)
🔇 Additional comments (1)
tests/test_collection_get.py (1)

194-252: LGTM!

The three test methods correctly invoke the shared test flow for each client mode. Connection verification for server and OceanBase modes is appropriate.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/test_collection_get.py (2)

321-336: Consider further consolidating client-mode test duplication.

A previous reviewer suggested simplifying the 3-mode testing pattern. While the shared _test_metadata_array_in_overlap helper is a good step, the six test methods (embedded/server/OceanBase × $in/$nin) still contain significant boilerplate for client initialization and connection validation.

Additionally, line 327 has an unused noqa: F401 directive that can be removed (the F401 rule is not enabled in Ruff configuration).

🔎 Example consolidation pattern
@pytest.fixture(params=["embedded", "server", "oceanbase"])
def client_fixture(request):
    """Fixture providing all three client types"""
    mode = request.param
    if mode == "embedded":
        try:
            import pylibseekdb
        except ImportError:
            pytest.skip("seekdb embedded package is not installed")
        client = pyseekdb.Client(path=SEEKDB_PATH, database=SEEKDB_DATABASE)
    elif mode == "server":
        client = pyseekdb.Client(host=SERVER_HOST, port=SERVER_PORT, ...)
        # validation...
    else:  # oceanbase
        client = pyseekdb.Client(host=OB_HOST, port=OB_PORT, tenant=OB_TENANT, ...)
        # validation...
    yield client

def test_metadata_array_in_overlap(self, client_fixture):
    """Single parameterized test across all modes"""
    self._test_metadata_array_in_overlap(client_fixture)

This would reduce from 6 test methods to 2 (one for $in, one for $nin).

🔎 Quick fix for unused noqa directive
         try:
-            import pylibseekdb  # noqa: F401
+            import pylibseekdb  # noqa
         except ImportError:

Apply the same change on line 387.


381-430: Test coverage is narrower than the comprehensive helper.

This $nin test validates the basic overlap vs. disjoint case but doesn't cover the NULL, empty array, or missing field scenarios tested in _test_metadata_array_in_overlap. Consider whether these edge cases should also be validated here or if relying on the comprehensive helper is sufficient.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 65c402a and f83f63d.

📒 Files selected for processing (1)
  • tests/test_collection_get.py
🧰 Additional context used
🪛 Ruff (0.14.8)
tests/test_collection_get.py

191-191: Do not catch blind exception: Exception

(BLE001)


327-327: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


355-355: Do not catch blind exception: Exception

(BLE001)


377-377: Do not catch blind exception: Exception

(BLE001)


387-387: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


428-428: Do not catch blind exception: Exception

(BLE001)


449-449: Do not catch blind exception: Exception

(BLE001)


484-484: Do not catch blind exception: Exception

(BLE001)


505-505: Do not catch blind exception: Exception

(BLE001)


540-540: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (server, ubuntu-latest)
  • GitHub Check: test (oceanbase, ubuntu-latest)
  • GitHub Check: test (embedded, ubuntu-latest)
  • GitHub Check: test (other, ubuntu-latest)
🔇 Additional comments (6)
tests/test_collection_get.py (6)

194-320: Good refactoring—extracts common test flow.

The _test_collection_get helper method effectively reduces duplication across the three client-mode test methods. The 14 test scenarios provide comprehensive coverage of the collection.get() interface.


337-358: LGTM—server mode test is properly structured.

The connection validation and test flow are correct. The bare Exception catch at line 355 is acceptable for connection testing in a test context.


359-380: LGTM—OceanBase mode test follows consistent pattern.

The test structure and connection validation are appropriate.


431-486: LGTM—server mode $nin test is consistent.

The test logic correctly validates that overlapping arrays are excluded by $nin. The bare exception catches in connection validation (line 449) and cleanup (line 484) are acceptable in test context.


487-542: LGTM—OceanBase mode $nin test follows established pattern.

The test structure is consistent with the other client modes. Exception handling is appropriate for test context.


579-579: Effective use of the refactored helper.

The calls to _test_collection_get successfully eliminate code duplication across the three client modes.

Also applies to: 631-631, 683-683

@hnwyllmm hnwyllmm merged commit 650af0b into oceanbase:develop Dec 22, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: failed to handle $in operator if the type of the value is array

2 participants