Replies: 4 comments 3 replies
-
|
Love to hear your thoughts on this @yanghua @Xuanwo @jackye1995 @wjones127 @westonpace |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for starting this discussion. The SQL statement I've explored some ideas and feel that perhaps In this way, we will have: -- Query the main branch
SELECT * FROM 's3://bucket/test.lance' WHERE label = 'cat';
-- Query an experimental branch
SELECT id, score FROM 's3://bucket/test.lance@exp_new_features' WHERE score > 0.95;
-- Find rows where the label differs between `main` and `exp_new_features`
SELECT
m.id,
m.label AS main_label,
e.label AS exp_label
FROM 's3://bucket/test.lance' AS m
JOIN 's3://bucket/test.lance@exp_new_features' AS e
ON m.id = e.id
WHERE m.label <> e.label;
-- Combine branches with UNION
SELECT * FROM 's3://bucket/test.lance'
UNION ALL
SELECT * FROM 's3://bucket/test.lance@exp_new_features'; |
Beta Was this translation helpful? Give feedback.
-
|
btw, iceberg in spark handles branches in this way: SELECT * FROM db.table.branch_test_branch;
1 a NULL
2 b NULL
3 c NULL
SELECT * FROM db.table VERSION AS OF 'test_branch';
1 a NULL
2 b NULL
3 c NULL
SELECT * FROM db.table.refs;
test_branch BRANCH 8109744798576441359 NULL NULL NULL
main BRANCH 6910357365743665710 NULL NULL NULL
SELECT * FROM db.table VERSION AS OF 8109744798576441359;
1 a 1.0
2 b 2.0
3 c 3.0 |
Beta Was this translation helpful? Give feedback.
-
|
In Iceberg I think the common consensus way to support it is through the time travel syntax, and treat branch as a string version name. There is the standard SQL approach: ANSI standard we pushed (Trino, Hive): SELECT * FROM table FOR VERSION AS OF branch1Spark: SELECT * FROM table VERSION AS OF branch1and the identifier approach, that you directly encode branch name in table name like I think the identifier approach turned out to be much easier to integrate with, since there is no need to plumb in the new SQL syntax across the system, and many no-code platforms don't even allow adding a new clause and only exposes table name as allowed user inputs. So for accessing branch in SQL for lance, to me it's just a question of how we resolve the identifier properly when branch information is encoded, and we can definitely do that through table provider. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
Lance datasets support versioning and branching, which is powerful for ML experimentation (e.g., trying different feature engineering, labeling, or modeling approaches). Today, if you want to run SQL against a specific branch(Let's focus on the dataset.sql in this context), you must first obtain a
Datasetfor that branch via APIs likecheckout_branch/checkout_version, and then callsql(...)on thatDataset(or register it in a SQL context).This works, but it has several drawbacks:
DataFusion itself fully supports multi-table SQL (joins, unions, etc.). The limitation is in how Lance currently exposes branches to DataFusion: only a single logical table is registered per query. We would like branches to be first-class at the SQL level, so that users can directly query branches and express cross-branch JOIN, UNION, and similar operations purely in SQL.
Grammar / Proposal
Expose each dataset branch as a logical table in DataFusion.
Branch-to-Table Mapping: A branch named
mainwould be accessible as a table namedmain. A branch namedexp_new_featureswould be accessible asexp_new_features.Quoted identifiers: Some branch names are not valid bare SQL identifiers, for example names that contain hyphens such as
exp-new-features. In that case, the branch can still be queried by using a quoted identifier, e.g.:Backward compatibility: Keep the special table name
tableas an alias for the default/current branch to ensure existing SQL continues to work.From the user's perspective, SQL would look like:
where
<branch_name>is either a bare identifier or a quoted identifier corresponding to an existing branch.How to use
Query a specific branch:
Compare two branches with a JOIN:
Combine branches with UNION:
Beta Was this translation helpful? Give feedback.
All reactions