Support SQL queries over dataset branches #5511

majin1102 · 2025-12-17T09:12:00Z

majin1102
Dec 17, 2025
Collaborator

Motivation

Lance datasets support versioning and branching, which is powerful for ML experimentation (e.g., trying different feature engineering, labeling, or modeling approaches). Today, if you want to run SQL against a specific branch(Let's focus on the dataset.sql in this context), you must first obtain a Dataset for that branch via APIs like checkout_branch / checkout_version, and then call sql(...) on that Dataset (or register it in a SQL context).

This works, but it has several drawbacks:

Branch selection happens outside of SQL, so queries are less declarative and harder to reason about.
Multi-branch analysis cannot be expressed as a single SQL statement: to compare or combine branches, you need to manually wire multiple datasets/contexts in host code instead of writing JOIN/UNION directly in SQL.
It is harder to build generic tools that treat branches as first-class tables, because the mapping from branches to queryable objects is not visible at the SQL layer.

DataFusion itself fully supports multi-table SQL (joins, unions, etc.). The limitation is in how Lance currently exposes branches to DataFusion: only a single logical table is registered per query. We would like branches to be first-class at the SQL level, so that users can directly query branches and express cross-branch JOIN, UNION, and similar operations purely in SQL.

Grammar / Proposal

Expose each dataset branch as a logical table in DataFusion.

Branch-to-Table Mapping: A branch named main would be accessible as a table named main. A branch named exp_new_features would be accessible as exp_new_features.
Quoted identifiers: Some branch names are not valid bare SQL identifiers, for example names that contain hyphens such as exp-new-features. In that case, the branch can still be queried by using a quoted identifier, e.g.:
```
SELECT * FROM "exp-new-features" WHERE score > 0.9;
```
Backward compatibility: Keep the special table name table as an alias for the default/current branch to ensure existing SQL continues to work.

From the user's perspective, SQL would look like:

SELECT ... FROM <branch_name> WHERE ...

where <branch_name> is either a bare identifier or a quoted identifier corresponding to an existing branch.

How to use

Query a specific branch:

-- Query the main branch
SELECT * FROM main WHERE label = 'cat';

-- Query an experimental branch
SELECT id, score FROM exp_new_features WHERE score > 0.95;

Compare two branches with a JOIN:

-- Find rows where the label differs between `main` and `exp_new_features`
SELECT
    m.id,
    m.label AS main_label,
    e.label AS exp_label
FROM main AS m
JOIN exp_new_features AS e
    ON m.id = e.id
WHERE m.label <> e.label;

Combine branches with UNION:

SELECT * FROM main
UNION ALL
SELECT * FROM exp_new_features;

majin1102 · 2025-12-17T09:13:05Z

majin1102
Dec 17, 2025
Collaborator Author

Love to hear your thoughts on this @yanghua @Xuanwo @jackye1995 @wjones127 @westonpace

0 replies

Xuanwo · 2025-12-17T12:03:38Z

Xuanwo
Dec 17, 2025
Maintainer

Thank you for starting this discussion. The SQL statement SELECT * FROM main is somewhat surprising to me. My initial assumption when parsing this SQL would always be to treat main as a table name.

I've explored some ideas and feel that perhaps Dataset::sql() is not the most appropriate place to handle this type of work. Now I'm thinking that maybe it's a good time for us to treat SQL as a first-class citizen. We could provide APIs like lance_client.sql("select * from 's3://path/to/dataset@main'").

In this way, we will have:

-- Query the main branch
SELECT * FROM 's3://bucket/test.lance' WHERE label = 'cat';

-- Query an experimental branch
SELECT id, score FROM 's3://bucket/test.lance@exp_new_features' WHERE score > 0.95;

-- Find rows where the label differs between `main` and `exp_new_features`
SELECT
    m.id,
    m.label AS main_label,
    e.label AS exp_label
FROM 's3://bucket/test.lance' AS m
JOIN 's3://bucket/test.lance@exp_new_features' AS e
    ON m.id = e.id
WHERE m.label <> e.label;

-- Combine branches with UNION
SELECT * FROM 's3://bucket/test.lance'
UNION ALL
SELECT * FROM 's3://bucket/test.lance@exp_new_features';

2 replies

jackye1995 Dec 17, 2025
Maintainer

I've explored some ideas and feel that perhaps Dataset::sql() is not the most appropriate place to handle this type of work. Now I'm thinking that maybe it's a good time for us to treat SQL as a first-class citizen. We could provide APIs like lance_client.sql("select * from 's3://path/to/dataset@main'").

agree! I think we discussed doing that through DataFusion catalog provider in #4779, but just don't have time to do it...

majin1102 Jan 6, 2026
Collaborator Author

I’ve built a prototype for #4779 — think it’s a good time to pick this up again. Would love your input!

Xuanwo · 2025-12-17T17:37:31Z

Xuanwo
Dec 17, 2025
Maintainer

btw, iceberg in spark handles branches in this way:

SELECT * FROM db.table.branch_test_branch;
1   a   NULL
2   b   NULL
3   c   NULL

SELECT * FROM db.table VERSION AS OF 'test_branch';
1   a   NULL
2   b   NULL
3   c   NULL

SELECT * FROM db.table.refs;
test_branch BRANCH  8109744798576441359 NULL    NULL    NULL
main        BRANCH  6910357365743665710 NULL    NULL    NULL


SELECT * FROM db.table VERSION AS OF 8109744798576441359;
1   a   1.0
2   b   2.0
3   c   3.0

1 reply

westonpace Dec 18, 2025
Maintainer

+1 for this approach. Both Snowflake and Databricks support time travel queries with this style. Substituting the branch in place of the version/timestamp seems like a very natural way to represent this.

Sadly, Datafusion's SQL parser doesn't (yet) have support for this, but we can do it with custom SQL.

jackye1995 · 2025-12-17T17:47:17Z

jackye1995
Dec 17, 2025
Maintainer

SELECT ... FROM <branch_name> is also surprising to me, I think we need to have at least the table reference, even if it is dummy like SELECT ... FROM dataset@<branch_name>.

In Iceberg I think the common consensus way to support it is through the time travel syntax, and treat branch as a string version name.

There is the standard SQL approach:

ANSI standard we pushed (Trino, Hive):

SELECT * FROM table FOR VERSION AS OF branch1

Spark:

SELECT * FROM table VERSION AS OF branch1

and the identifier approach, that you directly encode branch name in table name like table@branch1 or table$branch_branch1 depending on different engines.

I think the identifier approach turned out to be much easier to integrate with, since there is no need to plumb in the new SQL syntax across the system, and many no-code platforms don't even allow adding a new clause and only exposes table name as allowed user inputs. So for accessing branch in SQL for lance, to me it's just a question of how we resolve the identifier properly when branch information is encoded, and we can definitely do that through table provider.

0 replies

Support SQL queries over dataset branches #5511

Uh oh!

Uh oh!

majin1102 Dec 17, 2025 Collaborator

Motivation

Grammar / Proposal

How to use

Replies: 4 comments · 3 replies

Uh oh!

Uh oh!

majin1102 Dec 17, 2025 Collaborator Author

Uh oh!

Xuanwo Dec 17, 2025 Maintainer

Uh oh!

jackye1995 Dec 17, 2025 Maintainer

Uh oh!

majin1102 Jan 6, 2026 Collaborator Author

Uh oh!

Xuanwo Dec 17, 2025 Maintainer

Uh oh!

westonpace Dec 18, 2025 Maintainer

Uh oh!

Uh oh!

jackye1995 Dec 17, 2025 Maintainer

majin1102
Dec 17, 2025
Collaborator

Replies: 4 comments 3 replies

majin1102
Dec 17, 2025
Collaborator Author

Xuanwo
Dec 17, 2025
Maintainer

jackye1995 Dec 17, 2025
Maintainer

majin1102 Jan 6, 2026
Collaborator Author

Xuanwo
Dec 17, 2025
Maintainer

westonpace Dec 18, 2025
Maintainer

jackye1995
Dec 17, 2025
Maintainer