[WIP] Add expression index for sargable queries #2151

gildub · 2025-11-27T22:26:35Z

Add expression index built on the result of expression applied to the package column to make JOIN or WHERE queries more efficient.

The current join:

JOIN base_purl ON 
    product_status.package = base_purl.name 
    OR product_status.package LIKE CONCAT(base_purl.namespace, '/', base_purl.name)

The problem is that even if with an index on product_status.package, PostgreSQL cannot use it because:

LIKE with a pattern that starts with a variable (CONCAT(...)) prevents index usage
The OR condition makes it worse
It's not sargable (able to use indexes -> Search ARGument ABLE)

This expression index pre-computes the split (see the slash in the CONCAT) so PostgreSQL doesn't have to do it for every row during the join:

Without index: PostgreSQL scans 100k rows and calls split_part() 100k times
With index: PostgreSQL looks up the pre-computed values in O(log n) B-tree time

The migration adds package_namespace and package_name as GENERATED STORED columns
derived from the existing package field, along with a composite index to optimize lookups.

Compatibility Guarantees

Source of Truth: package remains the authoritative field. The generated columns
are automatically maintained by PostgreSQL and cannot diverge from their source.
Existing Code Paths: All existing code that reads/writes package continues to work:
- Insertions only need to set package (generated columns auto-populate)
- Updates to package automatically recalculate the generated columns
- Queries using product_status.package remain valid (not replaced by this migration)
Query Pattern Equivalence:
OLD: package = 'name' OR package = 'namespace/name'
NEW: (package_namespace IS NULL AND package_name = 'name') OR (package_namespace = 'namespace' AND package_name = 'name')
Both patterns return identical results; the new pattern can use the composite index.
Index Usage: The index on (package_namespace, package_name) efficiently handles:
- Lookups with NULL namespace (packages without '/')
- Lookups with non-NULL namespace (scoped packages like npm, maven)
- B-tree indexes naturally support NULL values in PostgreSQL

Migration Safety

GENERATED STORED columns are computed at write-time, not query-time
Backfill UPDATE ensures all existing rows have populated generated columns
CONCURRENTLY option prevents write blocking during index creation
Columns added with IF NOT EXISTS for migration idempotency

Affected Code Locations

entity/src/product_status.rs: Entity updated with new columns
modules/fundamental/src/sbom/model/raw_sql.rs: Can use new columns for optimized joins
modules/fundamental/src/purl/model/details/purl.rs: Can migrate to indexed lookups
modules/fundamental/src/vulnerability/model/details/vulnerability_advisory.rs: Already uses new columns

See #2039
And https://issues.redhat.com/browse/TC-3106

Summary by Sourcery

Introduce a database migration to optimize product status lookups and update advisory queries to use the new indexed columns.

Enhancements:

Register a new migration that improves product_status performance via additional schema/index changes.
Update vulnerability advisory summary queries to join base_purl using package_namespace and package_name instead of non-sargable LIKE/OR conditions.

sourcery-ai · 2025-11-27T22:26:41Z

Reviewer's Guide

Introduces generated columns and an expression-backed composite index on product_status to make joins against base_purl sargable, and updates the advisory summary query to use the new indexed columns instead of a non-sargable LIKE/OR condition.

Sequence diagram for advisory summary query using indexed generated columns

sequenceDiagram
    actor Client
    participant VulnerabilityAdvisorySummary
    participant DbConnection
    participant PostgreSQL
    participant BTreeIndex as idx_product_status_package_lookup

    Client->>VulnerabilityAdvisorySummary: get_summaries()
    VulnerabilityAdvisorySummary->>DbConnection: execute SQL
    DbConnection->>PostgreSQL: JOIN product_status WITH base_purl
    PostgreSQL->>BTreeIndex: lookup(package_namespace, package_name)
    BTreeIndex-->>PostgreSQL: matching product_status rows
    PostgreSQL-->>DbConnection: result set
    DbConnection-->>VulnerabilityAdvisorySummary: mapped rows
    VulnerabilityAdvisorySummary-->>Client: advisory summaries

ER diagram for product_status and base_purl join using generated columns

erDiagram
    product_status {
        uuid id
        text package
        text package_namespace
        text package_name
        uuid product_version_range_id
        uuid context_cpe_id
    }

    base_purl {
        uuid id
        text namespace
        text name
    }

    product_status ||--o{ base_purl : joins_on

Updated class diagram for product_status model and migration

classDiagram
    class product_status_Model {
        +Option~String~ package
        +Uuid product_version_range_id
        +Option~Uuid~ context_cpe_id
        +Option~String~ package_namespace
        +Option~String~ package_name
    }

    class m0001220_improve_product_status_Migration {
        +up(manager : SchemaManager) Result
        +down(manager : SchemaManager) Result
    }

    class Migrator {
        +migrations() Vec
    }

    Migrator --> m0001220_improve_product_status_Migration : registers

File-Level Changes

Change	Details	Files
Add generated columns to product_status model to expose namespace/name split of package for indexed lookups.	Extend the product_status ORM model with nullable package_namespace and package_name fields annotated as generated columns conceptually derived from package. Document the semantics of the new fields as the namespace and name components of package, covering both scoped and unscoped package formats.	`entity/src/product_status.rs`
Register a new database migration that adds generated columns, backfills them, and creates a composite expression index for efficient joins.	Introduce migration m0001220_improve_product_status implementing up/down logic for schema changes. In up, add GENERATED ALWAYS AS STORED columns package_namespace and package_name to product_status using SQL CASE and split_part to parse the package string. Run a backfill UPDATE to recompute generated columns for existing rows where either generated column is NULL. Create a concurrent composite index idx_product_status_package_lookup on (package_namespace, package_name) to support sargable queries while avoiding write blocking. In down, drop the composite index and remove the generated columns to fully revert the migration.	`migration/src/m0001220_improve_product_status.rs` `migration/src/lib.rs`
Update vulnerability advisory summary query to join base_purl via the new indexed namespace/name columns instead of a non-sargable LIKE/OR predicate.	Replace the JOIN condition that used package LIKE CONCAT(namespace,'/',name) OR package = name with an equality join on package_namespace = namespace AND package_name = name. Leverage the new composite index so PostgreSQL can efficiently resolve joins without scanning and evaluating string functions per row.	`modules/fundamental/src/vulnerability/model/details/vulnerability_advisory.rs`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

The new JOIN on product_status.package_namespace/package_name assumes those columns are always populated; consider clarifying or enforcing how existing product_status.package values are backfilled into the split columns to avoid mismatches or missed rows during the migration period.
If product_status.package can contain values without a namespace (e.g., just a name), verify that the split logic and index definition handle the NULL/empty-namespace case consistently with the previous OR/LIKE semantics so you don't lose matches that previously hit via the equality branch.
Double-check the new expression index and schema changes are compatible with any remaining code paths that still rely on product_status.package directly, to avoid divergence between the denormalized package field and the new namespace/name columns.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new JOIN on product_status.package_namespace/package_name assumes those columns are always populated; consider clarifying or enforcing how existing product_status.package values are backfilled into the split columns to avoid mismatches or missed rows during the migration period.
- If product_status.package can contain values without a namespace (e.g., just a name), verify that the split logic and index definition handle the NULL/empty-namespace case consistently with the previous OR/LIKE semantics so you don't lose matches that previously hit via the equality branch.
- Double-check the new expression index and schema changes are compatible with any remaining code paths that still rely on product_status.package directly, to avoid divergence between the denormalized package field and the new namespace/name columns.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

gildub · 2025-11-27T22:42:35Z

@sourcery-ai review

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

The new join condition "product_status"."package_namespace" = "base_purl"."namespace" AND "product_status"."package_name" = "base_purl"."name" changes semantics versus the previous LIKE/= OR predicate: bare package values that previously matched scoped base_purl entries by name alone will no longer match; if this equivalence is still desired you may need to encode the logic from the description (NULL namespace + name-only matching) in the SQL instead of a straight equality join.
Because = does not match NULLs in SQL, the new join will drop rows where both product_status.package_namespace and base_purl.namespace are NULL (unscoped packages); consider using IS NOT DISTINCT FROM or an explicit (both NULL) OR (both equal) condition to preserve behavior for unscoped purls.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new join condition `"product_status"."package_namespace" = "base_purl"."namespace" AND "product_status"."package_name" = "base_purl"."name"` changes semantics versus the previous `LIKE`/`=` OR predicate: bare `package` values that previously matched scoped base_purl entries by name alone will no longer match; if this equivalence is still desired you may need to encode the logic from the description (NULL namespace + name-only matching) in the SQL instead of a straight equality join.
- Because `=` does not match NULLs in SQL, the new join will drop rows where both `product_status.package_namespace` and `base_purl.namespace` are NULL (unscoped packages); consider using `IS NOT DISTINCT FROM` or an explicit `(both NULL) OR (both equal)` condition to preserve behavior for unscoped purls.

## Individual Comments

### Comment 1
<location> `modules/fundamental/src/vulnerability/model/details/vulnerability_advisory.rs:246` </location>
<code_context>

             -- find purls belonging to the sboms having a name matching package patterns
-            JOIN base_purl on "product_status"."package" LIKE CONCAT("base_purl"."namespace", '/', "base_purl"."name") OR "product_status"."package" = "base_purl"."name"
+            JOIN "base_purl" ON "product_status"."package_namespace" = "base_purl"."namespace" AND "product_status"."package_name" = "base_purl"."name"
             JOIN "versioned_purl" ON "versioned_purl"."base_purl_id" = "base_purl"."id"
             JOIN "qualified_purl" ON "qualified_purl"."versioned_purl_id" = "versioned_purl"."id"
</code_context>

<issue_to_address>
**issue:** Consider impact of stricter join semantics and NULL handling compared to the previous LIKE/OR condition

The previous join matched either `namespace/name` via `LIKE CONCAT(namespace, '/', name)` or just the plain name. The new join requires both `package_namespace` and `package_name` to match, so any rows with NULL `package_namespace` or only a name (no namespace) will now be excluded.

If all `product_status` rows are guaranteed to have been migrated to non-NULL `package_namespace`/`package_name`, this is fine (and likely faster). Otherwise, consider either:
- keeping a fallback on `package_name = base_purl.name` for namespace-less packages, or
- explicitly filtering to rows with non-NULL `package_namespace` to make the behavior change intentional.

Please confirm this stricter join is valid for existing `product_status` data.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

gildub · 2025-11-27T22:48:17Z

@sourcery-ai dismiss

gildub · 2025-11-27T22:48:33Z

@sourcery-ai guide

gildub · 2025-11-27T23:02:36Z

@sourcery-ai dismiss

gildub · 2025-11-27T23:09:40Z

@sourcery-ai resolve

ctron

Cool work. We definitely need some unit tests for it. Do you have any measurements (even local) how much improvement we get out of it?

ctron · 2025-11-28T09:41:34Z

modules/ingestor/src/graph/advisory/product_status.rs

            package: Set(self.package),
            context_cpe_id: Set(self.cpe.as_ref().map(Cpe::uuid)),
            product_version_range_id: Set(self.product_version_range_id),
+            package_namespace: Set(package_namespace),


To my understanding this should be generated by postgres, why do we need to set it here?

@ctron, sure, do you have pointers ?

ctron · 2025-11-28T09:42:04Z

migration/src/m0001220_improve_product_status.rs

+
+        manager
+            .get_connection()
+            .execute_unprepared("ALTER TABLE product_status DROP COLUMN IF EXISTS package_name")


This can be done in seaorm. Like the others.

@ctron, absolutely and it will, the reason it started as SQL code it's because CONCURRENTLY is not supported by sea-orm and using it allows writes (new data) while the index gets updated.

Where in that code is CURRENTLY being used?

.execute_unprepared("ALTER TABLE product_status DROP COLUMN IF EXISTS package_name")

ctron · 2025-11-28T09:43:17Z

entity/src/product_status.rs

    pub product_version_range_id: Uuid,
    pub context_cpe_id: Option<Uuid>,
+    /// Generated column: namespace part of package (NULL if no '/' in package)
+    pub package_namespace: Option<String>,


If we don't need to use those in Rust code, do we need them here?

ctron · 2025-11-28T09:44:16Z

migration/src/m0001220_improve_product_status.rs

+                 ON product_status (package_namespace, package_name)",
+            )
+            .await?;
+


To my understanding the package_name is a mandatory field. Shouldn't this be set NOT NULL at the end of the migration?

ctron · 2025-11-28T09:44:47Z

entity/src/product_status.rs

+    /// Generated column: namespace part of package (NULL if no '/' in package)
+    pub package_namespace: Option<String>,
+    /// Generated column: name part of package (everything after '/' or entire package if no '/')
+    pub package_name: Option<String>,


The name is mandatory, isn't it? If that's the case, this should be String.

sourcery-ai bot reviewed Nov 27, 2025

View reviewed changes

gildub force-pushed the product_status-index branch from f18e4e7 to d78bdb3 Compare November 27, 2025 22:43

gildub force-pushed the product_status-index branch from d78bdb3 to 76f9752 Compare November 27, 2025 23:00

gildub force-pushed the product_status-index branch from 76f9752 to fa2e80c Compare November 27, 2025 23:09

gildub force-pushed the product_status-index branch from fa2e80c to ac44ed3 Compare November 27, 2025 23:11

gildub changed the title ~~Add expression index for sargable queries~~ [WIP] Add expression index for sargable queries Nov 27, 2025

ctron reviewed Nov 28, 2025

View reviewed changes

Add expression index for sargable query

7291f18

gildub force-pushed the product_status-index branch from ac44ed3 to 7291f18 Compare November 28, 2025 14:20

[WIP] Add expression index for sargable queries #2151

Are you sure you want to change the base?

[WIP] Add expression index for sargable queries #2151

Uh oh!

Conversation

gildub commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Compatibility Guarantees

Migration Safety

Affected Code Locations

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for advisory summary query using indexed generated columns

ER diagram for product_status and base_purl join using generated columns

Updated class diagram for product_status model and migration

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gildub commented Nov 27, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gildub commented Nov 27, 2025

Uh oh!

gildub commented Nov 27, 2025

Uh oh!

gildub commented Nov 27, 2025

Uh oh!

gildub commented Nov 27, 2025

Uh oh!

ctron left a comment

Choose a reason for hiding this comment

Uh oh!

ctron Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

gildub Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ctron Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

gildub Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ctron Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

ctron Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ctron Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ctron Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gildub commented Nov 27, 2025 •

edited

Loading

sourcery-ai bot commented Nov 27, 2025 •

edited

Loading

gildub Nov 28, 2025 •

edited

Loading