Skip to content

Support SQL queries over dataset branches #5507

@majin1102

Description

@majin1102

Motivation

Lance datasets support versioning and branching, which is powerful for ML experimentation (e.g., trying different feature engineering, labeling, or modeling approaches). Today, if you want to run SQL against a specific branch(Let's focus on the dataset.sql in this context), you must first obtain a Dataset for that branch via APIs like checkout_branch / checkout_version, and then call sql(...) on that Dataset (or register it in a SQL context).

This works, but it has several drawbacks:

  • Branch selection happens outside of SQL, so queries are less declarative and harder to reason about.
  • Multi-branch analysis cannot be expressed as a single SQL statement: to compare or combine branches, you need to manually wire multiple datasets/contexts in host code instead of writing JOIN/UNION directly in SQL.
  • It is harder to build generic tools that treat branches as first-class tables, because the mapping from branches to queryable objects is not visible at the SQL layer.

DataFusion itself fully supports multi-table SQL (joins, unions, etc.). The limitation is in how Lance currently exposes branches to DataFusion: only a single logical table is registered per query. We would like branches to be first-class at the SQL level, so that users can directly query branches and express cross-branch JOIN, UNION, and similar operations purely in SQL.

Grammar / Proposal

Expose each dataset branch as a logical table in DataFusion.

  1. Branch-to-Table Mapping: A branch named main would be accessible as a table named main. A branch named exp_new_features would be accessible as exp_new_features.

  2. Quoted identifiers: Some branch names are not valid bare SQL identifiers, for example names that contain hyphens such as exp-new-features. In that case, the branch can still be queried by using a quoted identifier, e.g.:

    SELECT * FROM "exp-new-features" WHERE score > 0.9;
  3. Backward compatibility: Keep the special table name table as an alias for the default/current branch to ensure existing SQL continues to work.

From the user's perspective, SQL would look like:

SELECT ... FROM <branch_name> WHERE ...

where <branch_name> is either a bare identifier or a quoted identifier corresponding to an existing branch.

How to use

Query a specific branch:

-- Query the main branch
SELECT * FROM main WHERE label = 'cat';

-- Query an experimental branch
SELECT id, score FROM exp_new_features WHERE score > 0.95;

Compare two branches with a JOIN:

-- Find rows where the label differs between `main` and `exp_new_features`
SELECT
    m.id,
    m.label AS main_label,
    e.label AS exp_label
FROM main AS m
JOIN exp_new_features AS e
    ON m.id = e.id
WHERE m.label <> e.label;

Combine branches with UNION:

SELECT * FROM main
UNION ALL
SELECT * FROM exp_new_features;

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions