Lance is a modern columnar data format optimized for ML/AI workloads, with native cloud storage support. This extension will make Lance the best file/table/lakehouse formats on DuckDB.
If you just want to use the extension, install it directly from DuckDB's community extensions repository:
INSTALL lance FROM community;
LOAD lance;
SELECT *
FROM 'path/to/dataset.lance'
LIMIT 1;See DuckDB's extension page for lance for the latest release details: https://duckdb.org/community_extensions/extensions/lance
This repository focuses on source builds for development and CI.
- Initialize submodules:
git submodule update --init --recursive- Build:
GEN=ninja make release- Load the extension from a standalone DuckDB binary (local builds typically require unsigned extensions):
duckdb -unsigned -c "LOAD 'build/release/extension/lance/lance.duckdb_extension'; SELECT 1;"Full SQL reference: docs/sql.md
-- local file
SELECT *
FROM 'path/to/dataset.lance'
LIMIT 10;
-- s3
SELECT *
FROM 's3://bucket/path/to/dataset.lance'
LIMIT 10;To read s3:// paths, the extension can use DuckDB's native Secrets mechanism to obtain credentials:
CREATE SECRET (TYPE S3, provider credential_chain);
SELECT *
FROM 's3://bucket/path/to/dataset.lance'
LIMIT 10;Use DuckDB's COPY ... TO ... to materialize query results as a Lance dataset.
-- Create/overwrite a Lance dataset from a query
COPY (
SELECT 1::BIGINT AS id, 'a'::VARCHAR AS s
UNION ALL
SELECT 2::BIGINT AS id, 'b'::VARCHAR AS s
) TO 'path/to/out.lance' (FORMAT lance, mode 'overwrite');
-- Read it back via the replacement scan
SELECT count(*) FROM 'path/to/out.lance';
-- Append more rows to an existing dataset
COPY (
SELECT 3::BIGINT AS id, 'c'::VARCHAR AS s
) TO 'path/to/out.lance' (FORMAT lance, mode 'append');
-- Optionally create an empty dataset (schema only)
COPY (
SELECT 1::BIGINT AS id, 'x'::VARCHAR AS s
LIMIT 0
) TO 'path/to/empty.lance' (FORMAT lance, mode 'overwrite', write_empty_file true);To write to s3://... paths, load DuckDB's httpfs extension and configure an S3 secret:
INSTALL httpfs;
LOAD httpfs;
CREATE SECRET (TYPE S3, provider credential_chain);
COPY (SELECT 1 AS id) TO 's3://bucket/path/to/out.lance' (FORMAT lance, mode 'overwrite');When you ATTACH a directory as a Lance namespace, you can create new datasets using CREATE TABLE (schema-only)
or CREATE TABLE AS SELECT (CTAS). The dataset is written to <namespace_root>/<table_name>.lance.
ATTACH 'path/to/dir' AS lance_ns (TYPE LANCE);
-- Schema-only (creates an empty dataset)
CREATE TABLE lance_ns.main.my_empty (id BIGINT, s VARCHAR);
-- CTAS (writes query results)
CREATE TABLE lance_ns.main.my_dataset AS
SELECT 1::BIGINT AS id, 'a'::VARCHAR AS s
UNION ALL
SELECT 2::BIGINT AS id, 'b'::VARCHAR AS s;
SELECT count(*) FROM lance_ns.main.my_dataset;-- Search a vector column, returning distances in `_distance` (smaller is closer)
SELECT id, label, _distance
FROM lance_vector_search('path/to/dataset.lance', 'vec', [0.1, 0.2, 0.3, 0.4]::FLOAT[],
k = 5, prefilter = true)
ORDER BY _distance ASC;- Signature:
lance_vector_search(uri, vector_column, query_vector, ...) - Positional arguments:
uri(VARCHAR): Dataset root path or object store URI (e.g.s3://...).vector_column(VARCHAR): Vector column name.query_vector(FLOAT[] or DOUBLE[]): Query vector (must be non-empty; values are cast to float32).
- Named parameters:
k(BIGINT, default10): Number of results to return.prefilter(BOOLEAN, defaultfalse): Iftrue, filters are applied before top-k selection.use_index(BOOLEAN, defaulttrue): Iftrue, allow ANN index usage when available.explain_verbose(BOOLEAN, defaultfalse): Emit a more verbose Lance plan inEXPLAINoutput.
- Output:
- Dataset columns plus
_distance(smaller is closer).
- Dataset columns plus
-- Search a text column, returning BM25-like scores in `_score`
SELECT id, text, _score
FROM lance_fts('path/to/dataset.lance', 'text', 'puppy', k = 10, prefilter = true)
ORDER BY _score DESC;- Signature:
lance_fts(uri, text_column, query, ...) - Positional arguments:
uri(VARCHAR): Dataset root path or object store URI (e.g.s3://...).text_column(VARCHAR): Text column name.query(VARCHAR): Query string.
- Named parameters:
k(BIGINT, default10): Number of results to return.prefilter(BOOLEAN, defaultfalse): Iftrue, filters are applied before top-k selection.
- Output:
- Dataset columns plus
_score(larger is better).
- Dataset columns plus
-- Combine vector and text scores, returning `_hybrid_score` in addition to `_distance` / `_score`
SELECT id, _hybrid_score, _distance, _score
FROM lance_hybrid_search('path/to/dataset.lance',
'vec', [0.1, 0.2, 0.3, 0.4]::FLOAT[],
'text', 'puppy',
k = 10, prefilter = false,
alpha = 0.5, oversample_factor = 4)
ORDER BY _hybrid_score DESC;- Signature:
lance_hybrid_search(uri, vector_column, query_vector, text_column, query, ...) - Positional arguments:
uri(VARCHAR): Dataset root path or object store URI (e.g.s3://...).vector_column(VARCHAR): Vector column name.query_vector(FLOAT[] or DOUBLE[]): Query vector (must be non-empty; values are cast to float32).text_column(VARCHAR): Text column name.query(VARCHAR): Query string.
- Named parameters:
k(BIGINT, default10): Number of results to return.prefilter(BOOLEAN, defaultfalse): Iftrue, filters are applied before top-k selection.alpha(FLOAT, default0.5): Vector/text mixing weight.oversample_factor(INTEGER, default4): Oversample factor for candidate generation (larger can improve recall at higher cost).
- Output:
- Dataset columns plus
_hybrid_score(larger is better),_distance, and_score.
- Dataset columns plus
Issues and PRs are welcome. High-impact areas include pushdown, parallelism/performance, type coverage, and better diagnostics.
Apache License 2.0.