-
Notifications
You must be signed in to change notification settings - Fork 15
fix: metadata $in/$nin operators not matching JSON array fields #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughMetadata Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20–30 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/test_collection_get.py (2)
141-141: Clean up unusednoqadirectives.Lines 141 and 285 have
# noqa: F401directives that are no longer needed (the import check is not enabled).Proposed fix
try: - import pylibseekdb # noqa: F401 + import pylibseekdb except ImportError:Apply the same change on line 285.
Also applies to: 285-285
558-558: Remove extraneousfprefix from strings without placeholders.Lines 558, 569, 901, and 912 use f-strings but don't contain any placeholders.
Proposed fix
- print(f"✅ Testing get with scalar $in operator") + print("✅ Testing get with scalar $in operator")Apply the same fix to lines 569, 901, and 912.
Also applies to: 569-569, 901-901, 912-912
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/pyseekdb/client/filters.py(2 hunks)tests/test_collection_get.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_collection_get.py (6)
src/pyseekdb/client/collection.py (5)
client(78-80)name(63-65)embedding_function(88-90)add(103-149)get(364-428)src/pyseekdb/client/__init__.py (1)
Client(84-201)src/pyseekdb/client/admin_client.py (1)
get_or_create_collection(214-233)src/pyseekdb/client/embedding_function.py (1)
DefaultEmbeddingFunction(84-469)src/pyseekdb/client/client_seekdb_embedded.py (1)
_execute(106-122)src/pyseekdb/client/client_seekdb_server.py (1)
_execute(98-111)
🪛 Ruff (0.14.8)
tests/test_collection_get.py
141-141: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
176-176: Do not catch blind exception: Exception
(BLE001)
197-197: Do not catch blind exception: Exception
(BLE001)
226-226: Do not catch blind exception: Exception
(BLE001)
247-247: Do not catch blind exception: Exception
(BLE001)
276-276: Do not catch blind exception: Exception
(BLE001)
285-285: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
326-326: Do not catch blind exception: Exception
(BLE001)
347-347: Do not catch blind exception: Exception
(BLE001)
382-382: Do not catch blind exception: Exception
(BLE001)
403-403: Do not catch blind exception: Exception
(BLE001)
438-438: Do not catch blind exception: Exception
(BLE001)
558-558: f-string without any placeholders
Remove extraneous f prefix
(F541)
569-569: f-string without any placeholders
Remove extraneous f prefix
(F541)
901-901: f-string without any placeholders
Remove extraneous f prefix
(F541)
912-912: f-string without any placeholders
Remove extraneous f prefix
(F541)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test (server, ubuntu-latest)
- GitHub Check: test (oceanbase, ubuntu-latest)
- GitHub Check: test (embedded, ubuntu-latest)
🔇 Additional comments (6)
src/pyseekdb/client/filters.py (3)
9-9: LGTM!Clean addition of the standard library
jsonmodule to support JSON serialization for the$inand$ninoperators.
128-140: The implementation is correct. JSON_OVERLAPS returns 1 if a scalar is an element of the array when comparing the two, so the comment on Line 135 accurately states this handles both scalar and array fields. The oldINoperator and the newJSON_OVERLAPSapproach both work correctly for scalar metadata fields.
142-154: The $nin operator implementation is correct and properly tested with both scalar and array metadata fields. JSON_OVERLAPS inherently supports scalar-to-array comparisons, so no issues exist.tests/test_collection_get.py (3)
135-277: Excellent regression test coverage for array$inoperator.These three test functions (
test_embedded_metadata_array_in_overlap,test_server_metadata_array_in_overlap,test_oceanbase_metadata_array_in_overlap) properly validate the fix for issue #83 across all three client types. The tests correctly:
- Insert metadata with JSON array fields (
"tags": ["ml", "ai"])- Query using
$inoperator with an overlapping array- Expect a single matching result due to overlap
- Include proper cleanup and connection validation
279-439: Good coverage for array$ninoperator.These three test functions (
test_embedded_metadata_array_nin_overlap,test_server_metadata_array_nin_overlap,test_oceanbase_metadata_array_nin_overlap) provide complementary coverage for the$ninoperator. They correctly test that:
- Documents with array overlap are excluded
- Documents without overlap are returned
- Proper assertions verify both inclusion (id2) and exclusion (id1)
557-577: These scalar field $in/$nin tests are correctly implemented and should pass.JSON_OVERLAPS is specifically designed to handle comparisons between a scalar value and an array, returning true if the scalar is an element of the array. The test data stores scalar metadata values (
"tag": "ml","tag": "python","tag": "neural"), and the filter implementation correctly generates SQL likeJSON_OVERLAPS(JSON_EXTRACT(metadata, '$.tag'), CAST('["ml", "python"]' AS JSON)), which will work as intended.The concern raised in the original review is unfounded—when comparing a scalar with an array, JSON_OVERLAPS returns 1 if the scalar is an element of the array, which is exactly what these tests expect.
Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
tests/test_collection_get.py (3)
135-180: Good edge case coverage for array $in/$nin, but consider additional corner cases.The helper method properly tests overlap, disjoint, missing, null, and empty scenarios for JSON array fields, which validates the JSON_OVERLAPS implementation.
Based on the past review feedback requesting more corner cases, consider expanding coverage to include:
- Single-element arrays (e.g.,
tags: ["ml"])- Large arrays to verify performance characteristics
- Special characters or unicode in array values
- Numeric or mixed-type arrays (if the implementation supports them)
181-440: Consider refactoring to reduce duplication across client modes.The six test methods (embedded/server/OceanBase × $in/$nin) follow a repetitive pattern with duplicated setup, connection testing, and cleanup code. As noted in the past review comment (line 210), this structure can be simplified.
Per the previous suggestion, consider extracting the core test logic into shared helper methods that accept a client parameter, then have each mode-specific test create the appropriate client and delegate to the helper. This would improve maintainability and reduce code duplication.
Example refactoring pattern
def _test_array_in_with_client(self, client): """Shared test logic for array $in across all client modes.""" collection_name = f"test_tags_in_{int(time.time() * 1000)}" collection = client.get_or_create_collection( name=collection_name, embedding_function=pyseekdb.DefaultEmbeddingFunction(), ) try: self._run_tags_array_in_nin_suite(collection) finally: try: client.delete_collection(name=collection_name) except Exception as cleanup_error: print(f"Warning: cleanup failed for {collection_name}: {cleanup_error}") def test_embedded_metadata_array_in_overlap(self): try: import pylibseekdb # noqa: F401 except ImportError: pytest.skip("seekdb embedded package is not installed") client = pyseekdb.Client(path=SEEKDB_PATH, database=SEEKDB_DATABASE) self._test_array_in_with_client(client) # Similar pattern for server and oceanbase modes
558-579: These scalar $in/$nin tests verify basic operator functionality and regression coverage.Tests 9-10 exercise scalar metadata field filtering, which is a separate concern from array field support. The $nin operator had a documented bug with scalar metadata fields, making these regression tests essential. $in and $nin operators are designed to work with scalar metadata fields of the same type, and these tests appropriately validate that functionality.
If this PR extends $in/$nin to also support array fields (using JSON_OVERLAPS), the scalar tests are correctly included for regression coverage and to ensure backward compatibility. Verify that both scalar and array field behavior is implemented in the underlying metadata filter logic.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/pyseekdb/client/filters.pytests/test_collection_get.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/pyseekdb/client/filters.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_collection_get.py (3)
src/pyseekdb/client/collection.py (5)
add(103-149)get(364-428)client(78-80)name(63-65)embedding_function(88-90)src/pyseekdb/client/client_seekdb_server.py (1)
_execute(98-111)src/pyseekdb/client/client_seekdb_embedded.py (1)
_execute(106-122)
🪛 Ruff (0.14.8)
tests/test_collection_get.py
187-187: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
207-207: Do not catch blind exception: Exception
(BLE001)
228-228: Do not catch blind exception: Exception
(BLE001)
242-242: Do not catch blind exception: Exception
(BLE001)
263-263: Do not catch blind exception: Exception
(BLE001)
277-277: Do not catch blind exception: Exception
(BLE001)
286-286: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
327-327: Do not catch blind exception: Exception
(BLE001)
348-348: Do not catch blind exception: Exception
(BLE001)
383-383: Do not catch blind exception: Exception
(BLE001)
404-404: Do not catch blind exception: Exception
(BLE001)
439-439: Do not catch blind exception: Exception
(BLE001)
559-559: f-string without any placeholders
Remove extraneous f prefix
(F541)
570-570: f-string without any placeholders
Remove extraneous f prefix
(F541)
902-902: f-string without any placeholders
Remove extraneous f prefix
(F541)
913-913: f-string without any placeholders
Remove extraneous f prefix
(F541)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test (oceanbase, ubuntu-latest)
- GitHub Check: test (server, ubuntu-latest)
- GitHub Check: test (embedded, ubuntu-latest)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
tests/test_collection_get.py (2)
200-200: Remove unused# noqa: F401directive.The
F401rule is not enabled in the Ruff configuration, making this directive unnecessary.Also applies to line 260.
532-552: Remove extraneousfprefix from f-strings without placeholders.Lines 533 and 544 use f-string syntax without any format placeholders. Use plain strings instead.
🔎 Proposed fix
# Test 9: Get with scalar $in operator - print(f"✅ Testing get with scalar $in operator") + print("✅ Testing get with scalar $in operator") results = collection.get( where={"tag": {"$in": ["ml", "python"]}}, limit=10 ) assert results is not None assert "ids" in results assert len(results["ids"]) > 0 print(f" Found {len(results['ids'])} results with tag in ['ml', 'python']") # Test 10: Get with scalar $nin operator - print(f"✅ Testing get with scalar $nin operator") + print("✅ Testing get with scalar $nin operator") results = collection.get(
🧹 Nitpick comments (2)
tests/test_collection_get.py (2)
254-302: Consider consolidating with shared test flow.This standalone
$nintest partially duplicates the$ninscenario already covered in_test_metadata_array_in_overlap(lines 181-187). While having explicit regression tests is valuable, you could simplify by relying on the shared flow and removing the standalone methods.The same observation applies to
test_server_metadata_array_nin_overlap(lines 304-358) andtest_oceanbase_metadata_array_nin_overlap(lines 360-414).
647-654: Consider adding scalar$nintest for consistency.The embedded client test (lines 543-552) and OceanBase test (lines 886-895) both include a scalar
$ninoperator test, but the server client test only has the existing$intest here. For symmetric coverage, consider adding a$nintest block.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_collection_get.py
🧰 Additional context used
🪛 Ruff (0.14.8)
tests/test_collection_get.py
191-191: Do not catch blind exception: Exception
(BLE001)
200-200: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
228-228: Do not catch blind exception: Exception
(BLE001)
250-250: Do not catch blind exception: Exception
(BLE001)
260-260: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
301-301: Do not catch blind exception: Exception
(BLE001)
322-322: Do not catch blind exception: Exception
(BLE001)
357-357: Do not catch blind exception: Exception
(BLE001)
378-378: Do not catch blind exception: Exception
(BLE001)
413-413: Do not catch blind exception: Exception
(BLE001)
533-533: f-string without any placeholders
Remove extraneous f prefix
(F541)
544-544: f-string without any placeholders
Remove extraneous f prefix
(F541)
876-876: f-string without any placeholders
Remove extraneous f prefix
(F541)
887-887: f-string without any placeholders
Remove extraneous f prefix
(F541)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test (server, ubuntu-latest)
- GitHub Check: test (embedded, ubuntu-latest)
- GitHub Check: test (other, ubuntu-latest)
- GitHub Check: test (oceanbase, ubuntu-latest)
🔇 Additional comments (1)
tests/test_collection_get.py (1)
194-252: LGTM!The three test methods correctly invoke the shared test flow for each client mode. Connection verification for server and OceanBase modes is appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/test_collection_get.py (2)
321-336: Consider further consolidating client-mode test duplication.A previous reviewer suggested simplifying the 3-mode testing pattern. While the shared
_test_metadata_array_in_overlaphelper is a good step, the six test methods (embedded/server/OceanBase × $in/$nin) still contain significant boilerplate for client initialization and connection validation.Additionally, line 327 has an unused
noqa: F401directive that can be removed (theF401rule is not enabled in Ruff configuration).🔎 Example consolidation pattern
@pytest.fixture(params=["embedded", "server", "oceanbase"]) def client_fixture(request): """Fixture providing all three client types""" mode = request.param if mode == "embedded": try: import pylibseekdb except ImportError: pytest.skip("seekdb embedded package is not installed") client = pyseekdb.Client(path=SEEKDB_PATH, database=SEEKDB_DATABASE) elif mode == "server": client = pyseekdb.Client(host=SERVER_HOST, port=SERVER_PORT, ...) # validation... else: # oceanbase client = pyseekdb.Client(host=OB_HOST, port=OB_PORT, tenant=OB_TENANT, ...) # validation... yield client def test_metadata_array_in_overlap(self, client_fixture): """Single parameterized test across all modes""" self._test_metadata_array_in_overlap(client_fixture)This would reduce from 6 test methods to 2 (one for $in, one for $nin).
🔎 Quick fix for unused noqa directive
try: - import pylibseekdb # noqa: F401 + import pylibseekdb # noqa except ImportError:Apply the same change on line 387.
381-430: Test coverage is narrower than the comprehensive helper.This $nin test validates the basic overlap vs. disjoint case but doesn't cover the NULL, empty array, or missing field scenarios tested in
_test_metadata_array_in_overlap. Consider whether these edge cases should also be validated here or if relying on the comprehensive helper is sufficient.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_collection_get.py
🧰 Additional context used
🪛 Ruff (0.14.8)
tests/test_collection_get.py
191-191: Do not catch blind exception: Exception
(BLE001)
327-327: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
355-355: Do not catch blind exception: Exception
(BLE001)
377-377: Do not catch blind exception: Exception
(BLE001)
387-387: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
428-428: Do not catch blind exception: Exception
(BLE001)
449-449: Do not catch blind exception: Exception
(BLE001)
484-484: Do not catch blind exception: Exception
(BLE001)
505-505: Do not catch blind exception: Exception
(BLE001)
540-540: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test (server, ubuntu-latest)
- GitHub Check: test (oceanbase, ubuntu-latest)
- GitHub Check: test (embedded, ubuntu-latest)
- GitHub Check: test (other, ubuntu-latest)
🔇 Additional comments (6)
tests/test_collection_get.py (6)
194-320: Good refactoring—extracts common test flow.The
_test_collection_gethelper method effectively reduces duplication across the three client-mode test methods. The 14 test scenarios provide comprehensive coverage of thecollection.get()interface.
337-358: LGTM—server mode test is properly structured.The connection validation and test flow are correct. The bare
Exceptioncatch at line 355 is acceptable for connection testing in a test context.
359-380: LGTM—OceanBase mode test follows consistent pattern.The test structure and connection validation are appropriate.
431-486: LGTM—server mode $nin test is consistent.The test logic correctly validates that overlapping arrays are excluded by $nin. The bare exception catches in connection validation (line 449) and cleanup (line 484) are acceptable in test context.
487-542: LGTM—OceanBase mode $nin test follows established pattern.The test structure is consistent with the other client modes. Exception handling is appropriate for test context.
579-579: Effective use of the refactored helper.The calls to
_test_collection_getsuccessfully eliminate code duplication across the three client modes.Also applies to: 631-631, 683-683
Summary
1.Fixed a bug where $in/$nin operators on JSON array metadata fields failed to match due to incorrect SQL generation.
2.Add tests for array $in/$nin
close #83
Solution Description
JSON_OVERLAPS(JSON_EXTRACT(metadata, '$.field'), CAST(%s AS JSON))NOT JSON_OVERLAPS(...)Summary by CodeRabbit
Bug Fixes
$inand$nin: validates and normalizes scalar or array inputs, skips empty lists, and uses JSON-overlap comparisons to ensure correct matches.Tests
$in/$ninscenarios (overlap, exclusion, missing/null/empty) across embedded, server, and OceanBase clients.✏️ Tip: You can customize this high-level summary in your review settings.