-
Notifications
You must be signed in to change notification settings - Fork 513
Description
Motivation
Lance datasets support versioning and branching, which is powerful for ML experimentation (e.g., trying different feature engineering, labeling, or modeling approaches). Today, if you want to run SQL against a specific branch(Let's focus on the dataset.sql in this context), you must first obtain a Dataset for that branch via APIs like checkout_branch / checkout_version, and then call sql(...) on that Dataset (or register it in a SQL context).
This works, but it has several drawbacks:
- Branch selection happens outside of SQL, so queries are less declarative and harder to reason about.
- Multi-branch analysis cannot be expressed as a single SQL statement: to compare or combine branches, you need to manually wire multiple datasets/contexts in host code instead of writing JOIN/UNION directly in SQL.
- It is harder to build generic tools that treat branches as first-class tables, because the mapping from branches to queryable objects is not visible at the SQL layer.
DataFusion itself fully supports multi-table SQL (joins, unions, etc.). The limitation is in how Lance currently exposes branches to DataFusion: only a single logical table is registered per query. We would like branches to be first-class at the SQL level, so that users can directly query branches and express cross-branch JOIN, UNION, and similar operations purely in SQL.
Grammar / Proposal
Expose each dataset branch as a logical table in DataFusion.
-
Branch-to-Table Mapping: A branch named
mainwould be accessible as a table namedmain. A branch namedexp_new_featureswould be accessible asexp_new_features. -
Quoted identifiers: Some branch names are not valid bare SQL identifiers, for example names that contain hyphens such as
exp-new-features. In that case, the branch can still be queried by using a quoted identifier, e.g.:SELECT * FROM "exp-new-features" WHERE score > 0.9;
-
Backward compatibility: Keep the special table name
tableas an alias for the default/current branch to ensure existing SQL continues to work.
From the user's perspective, SQL would look like:
SELECT ... FROM <branch_name> WHERE ...where <branch_name> is either a bare identifier or a quoted identifier corresponding to an existing branch.
How to use
Query a specific branch:
-- Query the main branch
SELECT * FROM main WHERE label = 'cat';
-- Query an experimental branch
SELECT id, score FROM exp_new_features WHERE score > 0.95;Compare two branches with a JOIN:
-- Find rows where the label differs between `main` and `exp_new_features`
SELECT
m.id,
m.label AS main_label,
e.label AS exp_label
FROM main AS m
JOIN exp_new_features AS e
ON m.id = e.id
WHERE m.label <> e.label;Combine branches with UNION:
SELECT * FROM main
UNION ALL
SELECT * FROM exp_new_features;