-
Notifications
You must be signed in to change notification settings - Fork 35
[WIP] Add expression index for sargable queries #2151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideIntroduces generated columns and an expression-backed composite index on product_status to make joins against base_purl sargable, and updates the advisory summary query to use the new indexed columns instead of a non-sargable LIKE/OR condition. Sequence diagram for advisory summary query using indexed generated columnssequenceDiagram
actor Client
participant VulnerabilityAdvisorySummary
participant DbConnection
participant PostgreSQL
participant BTreeIndex as idx_product_status_package_lookup
Client->>VulnerabilityAdvisorySummary: get_summaries()
VulnerabilityAdvisorySummary->>DbConnection: execute SQL
DbConnection->>PostgreSQL: JOIN product_status WITH base_purl
PostgreSQL->>BTreeIndex: lookup(package_namespace, package_name)
BTreeIndex-->>PostgreSQL: matching product_status rows
PostgreSQL-->>DbConnection: result set
DbConnection-->>VulnerabilityAdvisorySummary: mapped rows
VulnerabilityAdvisorySummary-->>Client: advisory summaries
ER diagram for product_status and base_purl join using generated columnserDiagram
product_status {
uuid id
text package
text package_namespace
text package_name
uuid product_version_range_id
uuid context_cpe_id
}
base_purl {
uuid id
text namespace
text name
}
product_status ||--o{ base_purl : joins_on
Updated class diagram for product_status model and migrationclassDiagram
class product_status_Model {
+Option~String~ package
+Uuid product_version_range_id
+Option~Uuid~ context_cpe_id
+Option~String~ package_namespace
+Option~String~ package_name
}
class m0001220_improve_product_status_Migration {
+up(manager : SchemaManager) Result
+down(manager : SchemaManager) Result
}
class Migrator {
+migrations() Vec
}
Migrator --> m0001220_improve_product_status_Migration : registers
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- The new JOIN on product_status.package_namespace/package_name assumes those columns are always populated; consider clarifying or enforcing how existing product_status.package values are backfilled into the split columns to avoid mismatches or missed rows during the migration period.
- If product_status.package can contain values without a namespace (e.g., just a name), verify that the split logic and index definition handle the NULL/empty-namespace case consistently with the previous OR/LIKE semantics so you don't lose matches that previously hit via the equality branch.
- Double-check the new expression index and schema changes are compatible with any remaining code paths that still rely on product_status.package directly, to avoid divergence between the denormalized package field and the new namespace/name columns.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new JOIN on product_status.package_namespace/package_name assumes those columns are always populated; consider clarifying or enforcing how existing product_status.package values are backfilled into the split columns to avoid mismatches or missed rows during the migration period.
- If product_status.package can contain values without a namespace (e.g., just a name), verify that the split logic and index definition handle the NULL/empty-namespace case consistently with the previous OR/LIKE semantics so you don't lose matches that previously hit via the equality branch.
- Double-check the new expression index and schema changes are compatible with any remaining code paths that still rely on product_status.package directly, to avoid divergence between the denormalized package field and the new namespace/name columns.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- The new join condition
"product_status"."package_namespace" = "base_purl"."namespace" AND "product_status"."package_name" = "base_purl"."name"changes semantics versus the previousLIKE/=OR predicate: barepackagevalues that previously matched scoped base_purl entries by name alone will no longer match; if this equivalence is still desired you may need to encode the logic from the description (NULL namespace + name-only matching) in the SQL instead of a straight equality join. - Because
=does not match NULLs in SQL, the new join will drop rows where bothproduct_status.package_namespaceandbase_purl.namespaceare NULL (unscoped packages); consider usingIS NOT DISTINCT FROMor an explicit(both NULL) OR (both equal)condition to preserve behavior for unscoped purls.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new join condition `"product_status"."package_namespace" = "base_purl"."namespace" AND "product_status"."package_name" = "base_purl"."name"` changes semantics versus the previous `LIKE`/`=` OR predicate: bare `package` values that previously matched scoped base_purl entries by name alone will no longer match; if this equivalence is still desired you may need to encode the logic from the description (NULL namespace + name-only matching) in the SQL instead of a straight equality join.
- Because `=` does not match NULLs in SQL, the new join will drop rows where both `product_status.package_namespace` and `base_purl.namespace` are NULL (unscoped packages); consider using `IS NOT DISTINCT FROM` or an explicit `(both NULL) OR (both equal)` condition to preserve behavior for unscoped purls.
## Individual Comments
### Comment 1
<location> `modules/fundamental/src/vulnerability/model/details/vulnerability_advisory.rs:246` </location>
<code_context>
-- find purls belonging to the sboms having a name matching package patterns
- JOIN base_purl on "product_status"."package" LIKE CONCAT("base_purl"."namespace", '/', "base_purl"."name") OR "product_status"."package" = "base_purl"."name"
+ JOIN "base_purl" ON "product_status"."package_namespace" = "base_purl"."namespace" AND "product_status"."package_name" = "base_purl"."name"
JOIN "versioned_purl" ON "versioned_purl"."base_purl_id" = "base_purl"."id"
JOIN "qualified_purl" ON "qualified_purl"."versioned_purl_id" = "versioned_purl"."id"
</code_context>
<issue_to_address>
**issue:** Consider impact of stricter join semantics and NULL handling compared to the previous LIKE/OR condition
The previous join matched either `namespace/name` via `LIKE CONCAT(namespace, '/', name)` or just the plain name. The new join requires both `package_namespace` and `package_name` to match, so any rows with NULL `package_namespace` or only a name (no namespace) will now be excluded.
If all `product_status` rows are guaranteed to have been migrated to non-NULL `package_namespace`/`package_name`, this is fine (and likely faster). Otherwise, consider either:
- keeping a fallback on `package_name = base_purl.name` for namespace-less packages, or
- explicitly filtering to rows with non-NULL `package_namespace` to make the behavior change intentional.
Please confirm this stricter join is valid for existing `product_status` data.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
f18e4e7 to
d78bdb3
Compare
|
@sourcery-ai dismiss |
|
@sourcery-ai guide |
d78bdb3 to
76f9752
Compare
|
@sourcery-ai dismiss |
76f9752 to
fa2e80c
Compare
|
@sourcery-ai resolve |
fa2e80c to
ac44ed3
Compare
ctron
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool work. We definitely need some unit tests for it. Do you have any measurements (even local) how much improvement we get out of it?
| package: Set(self.package), | ||
| context_cpe_id: Set(self.cpe.as_ref().map(Cpe::uuid)), | ||
| product_version_range_id: Set(self.product_version_range_id), | ||
| package_namespace: Set(package_namespace), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding this should be generated by postgres, why do we need to set it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ctron, sure, do you have pointers ?
|
|
||
| manager | ||
| .get_connection() | ||
| .execute_unprepared("ALTER TABLE product_status DROP COLUMN IF EXISTS package_name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done in seaorm. Like the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ctron, absolutely and it will, the reason it started as SQL code it's because CONCURRENTLY is not supported by sea-orm and using it allows writes (new data) while the index gets updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where in that code is CURRENTLY being used?
.execute_unprepared("ALTER TABLE product_status DROP COLUMN IF EXISTS package_name")| pub product_version_range_id: Uuid, | ||
| pub context_cpe_id: Option<Uuid>, | ||
| /// Generated column: namespace part of package (NULL if no '/' in package) | ||
| pub package_namespace: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't need to use those in Rust code, do we need them here?
| ON product_status (package_namespace, package_name)", | ||
| ) | ||
| .await?; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding the package_name is a mandatory field. Shouldn't this be set NOT NULL at the end of the migration?
| /// Generated column: namespace part of package (NULL if no '/' in package) | ||
| pub package_namespace: Option<String>, | ||
| /// Generated column: name part of package (everything after '/' or entire package if no '/') | ||
| pub package_name: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is mandatory, isn't it? If that's the case, this should be String.
ac44ed3 to
7291f18
Compare
Add expression index built on the result of expression applied to the package column to make JOIN or WHERE queries more efficient.
The current join:
The problem is that even if with an index on product_status.package, PostgreSQL cannot use it because:
This expression index pre-computes the split (see the slash in the CONCAT) so PostgreSQL doesn't have to do it for every row during the join:
The migration adds
package_namespaceandpackage_nameas GENERATED STORED columnsderived from the existing
packagefield, along with a composite index to optimize lookups.Compatibility Guarantees
Source of Truth:
packageremains the authoritative field. The generated columnsare automatically maintained by PostgreSQL and cannot diverge from their source.
Existing Code Paths: All existing code that reads/writes
packagecontinues to work:package(generated columns auto-populate)packageautomatically recalculate the generated columnsproduct_status.packageremain valid (not replaced by this migration)Query Pattern Equivalence:
OLD:
package = 'name' OR package = 'namespace/name'NEW:
(package_namespace IS NULL AND package_name = 'name') OR (package_namespace = 'namespace' AND package_name = 'name')Both patterns return identical results; the new pattern can use the composite index.
Index Usage: The index on (package_namespace, package_name) efficiently handles:
Migration Safety
Affected Code Locations
See #2039
And https://issues.redhat.com/browse/TC-3106
Summary by Sourcery
Introduce a database migration to optimize product status lookups and update advisory queries to use the new indexed columns.
Enhancements: