-
Notifications
You must be signed in to change notification settings - Fork 328
Description
Describe the bug
I'm encountering an issue where filtering a Daft DataFrame read from a Lance dataset and then using show() with a limit produces unexpected results. It appears that the limit is being applied before the filter, which is not the intended behavior.
Notably, this issue does NOT occur with other data formats like Arrow - only with Lance.
To Reproduce
import tempfile
import os
import pyarrow as pa
import lance
import daft
TABLE_NAME = "my_table"
data = {
"vector": [[1.1, 1.2], [0.2, 1.8]],
"lat": [45.5, 40.1],
"long": [-122.7, -74.1],
"id": [1, 2]
}
with tempfile.TemporaryDirectory() as tmp_dir:
lance_path = os.path.join(tmp_dir, TABLE_NAME)
arrow_table = pa.Table.from_pydict(data)
lance.write_dataset(arrow_table, lance_path)
daft_df = daft.read_lance(lance_path)
# This works correctly
daft_df.filter("id = 1").show(1)
# This should show 1 row but shows none
daft_df.filter("id = 2").show(1)
# This works correctly when limit is larger than result count
daft_df.filter("id = 2").show(2)
╭───────────────┬─────────┬─────────┬───────╮
│ vector ┆ lat ┆ long ┆ id │
│ --- ┆ --- ┆ --- ┆ --- │
│ List[Float64] ┆ Float64 ┆ Float64 ┆ Int64 │
╞═══════════════╪═════════╪═════════╪═══════╡
│ [1.1, 1.2] ┆ 45.5 ┆ -122.7 ┆ 1 │
╰───────────────┴─────────┴─────────┴───────╯
(Showing first 1 rows)
╭───────────────┬─────────┬─────────┬───────╮
│ vector ┆ lat ┆ long ┆ id │
│ --- ┆ --- ┆ --- ┆ --- │
│ List[Float64] ┆ Float64 ┆ Float64 ┆ Int64 │
╞═══════════════╪═════════╪═════════╪═══════╡
╰───────────────┴─────────┴─────────┴───────╯
(No data to display: Materialized dataframe has no rows)
╭───────────────┬─────────┬─────────┬───────╮
│ vector ┆ lat ┆ long ┆ id │
│ --- ┆ --- ┆ --- ┆ --- │
│ List[Float64] ┆ Float64 ┆ Float64 ┆ Int64 │
╞═══════════════╪═════════╪═════════╪═══════╡
│ [0.2, 1.8] ┆ 40.1 ┆ -74.1 ┆ 2 │Expected behavior
Expected output:
- All three filter operations should return the matching row when id=1 or id=2
- The second show(1) should display the row with id=2
Actual output:
- The first filter (id=1) with show(1) works correctly
- The second filter (id=2) with show(1) shows "No data to display"
- The third filter (id=2) with show(2) works correctly
Component(s)
Expressions
Additional context
This behavior only occurs when reading from Lance datasets. When using other data formats with Daft, the filtering and show() behavior works as expected, applying the filter first before limiting results in show().
Environment:
daft: latest version
lance: latest version
pyarrow: compatible version
It seems the limit parameter in show() is being applied before the filter when working with Lance datasets, which is the opposite of the expected behavior.