Skip to content

Conversation

@gildub
Copy link
Contributor

@gildub gildub commented Nov 27, 2025

Add expression index built on the result of expression applied to the package column to make JOIN or WHERE queries more efficient.

The current join:

JOIN base_purl ON 
    product_status.package = base_purl.name 
    OR product_status.package LIKE CONCAT(base_purl.namespace, '/', base_purl.name)

The problem is that even if with an index on product_status.package, PostgreSQL cannot use it because:

  • LIKE with a pattern that starts with a variable (CONCAT(...)) prevents index usage
  • The OR condition makes it worse
  • It's not sargable (able to use indexes -> Search ARGument ABLE)

This expression index pre-computes the split (see the slash in the CONCAT) so PostgreSQL doesn't have to do it for every row during the join:

  • Without index: PostgreSQL scans 100k rows and calls split_part() 100k times
  • With index: PostgreSQL looks up the pre-computed values in O(log n) B-tree time

The migration adds package_namespace and package_name as GENERATED STORED columns
derived from the existing package field, along with a composite index to optimize lookups.

Compatibility Guarantees

  1. Source of Truth: package remains the authoritative field. The generated columns
    are automatically maintained by PostgreSQL and cannot diverge from their source.

  2. Existing Code Paths: All existing code that reads/writes package continues to work:

    • Insertions only need to set package (generated columns auto-populate)
    • Updates to package automatically recalculate the generated columns
    • Queries using product_status.package remain valid (not replaced by this migration)
  3. Query Pattern Equivalence:
    OLD: package = 'name' OR package = 'namespace/name'
    NEW: (package_namespace IS NULL AND package_name = 'name') OR (package_namespace = 'namespace' AND package_name = 'name')
    Both patterns return identical results; the new pattern can use the composite index.

  4. Index Usage: The index on (package_namespace, package_name) efficiently handles:

    • Lookups with NULL namespace (packages without '/')
    • Lookups with non-NULL namespace (scoped packages like npm, maven)
    • B-tree indexes naturally support NULL values in PostgreSQL

Migration Safety

  • GENERATED STORED columns are computed at write-time, not query-time
  • Backfill UPDATE ensures all existing rows have populated generated columns
  • CONCURRENTLY option prevents write blocking during index creation
  • Columns added with IF NOT EXISTS for migration idempotency

Affected Code Locations

  • entity/src/product_status.rs: Entity updated with new columns
  • modules/fundamental/src/sbom/model/raw_sql.rs: Can use new columns for optimized joins
  • modules/fundamental/src/purl/model/details/purl.rs: Can migrate to indexed lookups
  • modules/fundamental/src/vulnerability/model/details/vulnerability_advisory.rs: Already uses new columns

See #2039
And https://issues.redhat.com/browse/TC-3106

Summary by Sourcery

Introduce a database migration to optimize product status lookups and update advisory queries to use the new indexed columns.

Enhancements:

  • Register a new migration that improves product_status performance via additional schema/index changes.
  • Update vulnerability advisory summary queries to join base_purl using package_namespace and package_name instead of non-sargable LIKE/OR conditions.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Nov 27, 2025

Reviewer's Guide

Introduces generated columns and an expression-backed composite index on product_status to make joins against base_purl sargable, and updates the advisory summary query to use the new indexed columns instead of a non-sargable LIKE/OR condition.

Sequence diagram for advisory summary query using indexed generated columns

sequenceDiagram
    actor Client
    participant VulnerabilityAdvisorySummary
    participant DbConnection
    participant PostgreSQL
    participant BTreeIndex as idx_product_status_package_lookup

    Client->>VulnerabilityAdvisorySummary: get_summaries()
    VulnerabilityAdvisorySummary->>DbConnection: execute SQL
    DbConnection->>PostgreSQL: JOIN product_status WITH base_purl
    PostgreSQL->>BTreeIndex: lookup(package_namespace, package_name)
    BTreeIndex-->>PostgreSQL: matching product_status rows
    PostgreSQL-->>DbConnection: result set
    DbConnection-->>VulnerabilityAdvisorySummary: mapped rows
    VulnerabilityAdvisorySummary-->>Client: advisory summaries
Loading

ER diagram for product_status and base_purl join using generated columns

erDiagram
    product_status {
        uuid id
        text package
        text package_namespace
        text package_name
        uuid product_version_range_id
        uuid context_cpe_id
    }

    base_purl {
        uuid id
        text namespace
        text name
    }

    product_status ||--o{ base_purl : joins_on
Loading

Updated class diagram for product_status model and migration

classDiagram
    class product_status_Model {
        +Option~String~ package
        +Uuid product_version_range_id
        +Option~Uuid~ context_cpe_id
        +Option~String~ package_namespace
        +Option~String~ package_name
    }

    class m0001220_improve_product_status_Migration {
        +up(manager : SchemaManager) Result
        +down(manager : SchemaManager) Result
    }

    class Migrator {
        +migrations() Vec
    }

    Migrator --> m0001220_improve_product_status_Migration : registers
Loading

File-Level Changes

Change Details Files
Add generated columns to product_status model to expose namespace/name split of package for indexed lookups.
  • Extend the product_status ORM model with nullable package_namespace and package_name fields annotated as generated columns conceptually derived from package.
  • Document the semantics of the new fields as the namespace and name components of package, covering both scoped and unscoped package formats.
entity/src/product_status.rs
Register a new database migration that adds generated columns, backfills them, and creates a composite expression index for efficient joins.
  • Introduce migration m0001220_improve_product_status implementing up/down logic for schema changes.
  • In up, add GENERATED ALWAYS AS STORED columns package_namespace and package_name to product_status using SQL CASE and split_part to parse the package string.
  • Run a backfill UPDATE to recompute generated columns for existing rows where either generated column is NULL.
  • Create a concurrent composite index idx_product_status_package_lookup on (package_namespace, package_name) to support sargable queries while avoiding write blocking.
  • In down, drop the composite index and remove the generated columns to fully revert the migration.
migration/src/m0001220_improve_product_status.rs
migration/src/lib.rs
Update vulnerability advisory summary query to join base_purl via the new indexed namespace/name columns instead of a non-sargable LIKE/OR predicate.
  • Replace the JOIN condition that used package LIKE CONCAT(namespace,'/',name) OR package = name with an equality join on package_namespace = namespace AND package_name = name.
  • Leverage the new composite index so PostgreSQL can efficiently resolve joins without scanning and evaluating string functions per row.
modules/fundamental/src/vulnerability/model/details/vulnerability_advisory.rs

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • The new JOIN on product_status.package_namespace/package_name assumes those columns are always populated; consider clarifying or enforcing how existing product_status.package values are backfilled into the split columns to avoid mismatches or missed rows during the migration period.
  • If product_status.package can contain values without a namespace (e.g., just a name), verify that the split logic and index definition handle the NULL/empty-namespace case consistently with the previous OR/LIKE semantics so you don't lose matches that previously hit via the equality branch.
  • Double-check the new expression index and schema changes are compatible with any remaining code paths that still rely on product_status.package directly, to avoid divergence between the denormalized package field and the new namespace/name columns.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new JOIN on product_status.package_namespace/package_name assumes those columns are always populated; consider clarifying or enforcing how existing product_status.package values are backfilled into the split columns to avoid mismatches or missed rows during the migration period.
- If product_status.package can contain values without a namespace (e.g., just a name), verify that the split logic and index definition handle the NULL/empty-namespace case consistently with the previous OR/LIKE semantics so you don't lose matches that previously hit via the equality branch.
- Double-check the new expression index and schema changes are compatible with any remaining code paths that still rely on product_status.package directly, to avoid divergence between the denormalized package field and the new namespace/name columns.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@gildub
Copy link
Contributor Author

gildub commented Nov 27, 2025

@sourcery-ai review

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • The new join condition "product_status"."package_namespace" = "base_purl"."namespace" AND "product_status"."package_name" = "base_purl"."name" changes semantics versus the previous LIKE/= OR predicate: bare package values that previously matched scoped base_purl entries by name alone will no longer match; if this equivalence is still desired you may need to encode the logic from the description (NULL namespace + name-only matching) in the SQL instead of a straight equality join.
  • Because = does not match NULLs in SQL, the new join will drop rows where both product_status.package_namespace and base_purl.namespace are NULL (unscoped packages); consider using IS NOT DISTINCT FROM or an explicit (both NULL) OR (both equal) condition to preserve behavior for unscoped purls.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new join condition `"product_status"."package_namespace" = "base_purl"."namespace" AND "product_status"."package_name" = "base_purl"."name"` changes semantics versus the previous `LIKE`/`=` OR predicate: bare `package` values that previously matched scoped base_purl entries by name alone will no longer match; if this equivalence is still desired you may need to encode the logic from the description (NULL namespace + name-only matching) in the SQL instead of a straight equality join.
- Because `=` does not match NULLs in SQL, the new join will drop rows where both `product_status.package_namespace` and `base_purl.namespace` are NULL (unscoped packages); consider using `IS NOT DISTINCT FROM` or an explicit `(both NULL) OR (both equal)` condition to preserve behavior for unscoped purls.

## Individual Comments

### Comment 1
<location> `modules/fundamental/src/vulnerability/model/details/vulnerability_advisory.rs:246` </location>
<code_context>

             -- find purls belonging to the sboms having a name matching package patterns
-            JOIN base_purl on "product_status"."package" LIKE CONCAT("base_purl"."namespace", '/', "base_purl"."name") OR "product_status"."package" = "base_purl"."name"
+            JOIN "base_purl" ON "product_status"."package_namespace" = "base_purl"."namespace" AND "product_status"."package_name" = "base_purl"."name"
             JOIN "versioned_purl" ON "versioned_purl"."base_purl_id" = "base_purl"."id"
             JOIN "qualified_purl" ON "qualified_purl"."versioned_purl_id" = "versioned_purl"."id"
</code_context>

<issue_to_address>
**issue:** Consider impact of stricter join semantics and NULL handling compared to the previous LIKE/OR condition

The previous join matched either `namespace/name` via `LIKE CONCAT(namespace, '/', name)` or just the plain name. The new join requires both `package_namespace` and `package_name` to match, so any rows with NULL `package_namespace` or only a name (no namespace) will now be excluded.

If all `product_status` rows are guaranteed to have been migrated to non-NULL `package_namespace`/`package_name`, this is fine (and likely faster). Otherwise, consider either:
- keeping a fallback on `package_name = base_purl.name` for namespace-less packages, or
- explicitly filtering to rows with non-NULL `package_namespace` to make the behavior change intentional.

Please confirm this stricter join is valid for existing `product_status` data.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@gildub gildub force-pushed the product_status-index branch from f18e4e7 to d78bdb3 Compare November 27, 2025 22:43
@gildub
Copy link
Contributor Author

gildub commented Nov 27, 2025

@sourcery-ai dismiss

@gildub
Copy link
Contributor Author

gildub commented Nov 27, 2025

@sourcery-ai guide

@gildub gildub force-pushed the product_status-index branch from d78bdb3 to 76f9752 Compare November 27, 2025 23:00
@gildub
Copy link
Contributor Author

gildub commented Nov 27, 2025

@sourcery-ai dismiss

@gildub gildub force-pushed the product_status-index branch from 76f9752 to fa2e80c Compare November 27, 2025 23:09
@gildub
Copy link
Contributor Author

gildub commented Nov 27, 2025

@sourcery-ai resolve

@gildub gildub force-pushed the product_status-index branch from fa2e80c to ac44ed3 Compare November 27, 2025 23:11
@gildub gildub changed the title Add expression index for sargable queries [WIP] Add expression index for sargable queries Nov 27, 2025
Copy link
Contributor

@ctron ctron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool work. We definitely need some unit tests for it. Do you have any measurements (even local) how much improvement we get out of it?

package: Set(self.package),
context_cpe_id: Set(self.cpe.as_ref().map(Cpe::uuid)),
product_version_range_id: Set(self.product_version_range_id),
package_namespace: Set(package_namespace),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding this should be generated by postgres, why do we need to set it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ctron, sure, do you have pointers ?


manager
.get_connection()
.execute_unprepared("ALTER TABLE product_status DROP COLUMN IF EXISTS package_name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done in seaorm. Like the others.

Copy link
Contributor Author

@gildub gildub Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ctron, absolutely and it will, the reason it started as SQL code it's because CONCURRENTLY is not supported by sea-orm and using it allows writes (new data) while the index gets updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where in that code is CURRENTLY being used?

.execute_unprepared("ALTER TABLE product_status DROP COLUMN IF EXISTS package_name")

pub product_version_range_id: Uuid,
pub context_cpe_id: Option<Uuid>,
/// Generated column: namespace part of package (NULL if no '/' in package)
pub package_namespace: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't need to use those in Rust code, do we need them here?

ON product_status (package_namespace, package_name)",
)
.await?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding the package_name is a mandatory field. Shouldn't this be set NOT NULL at the end of the migration?

/// Generated column: namespace part of package (NULL if no '/' in package)
pub package_namespace: Option<String>,
/// Generated column: name part of package (everything after '/' or entire package if no '/')
pub package_name: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is mandatory, isn't it? If that's the case, this should be String.

@gildub gildub force-pushed the product_status-index branch from ac44ed3 to 7291f18 Compare November 28, 2025 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants