Skip to content

[Lance] Filter followed by show() with limit behaves incorrectly when reading Lance datasets #5407

@SimonWan1029

Description

@SimonWan1029

Describe the bug

I'm encountering an issue where filtering a Daft DataFrame read from a Lance dataset and then using show() with a limit produces unexpected results. It appears that the limit is being applied before the filter, which is not the intended behavior.
Notably, this issue does NOT occur with other data formats like Arrow - only with Lance.

To Reproduce

  import tempfile
  import os
  import pyarrow as pa
  import lance
  import daft
  
  TABLE_NAME = "my_table"
  data = {
      "vector": [[1.1, 1.2], [0.2, 1.8]], 
      "lat": [45.5, 40.1], 
      "long": [-122.7, -74.1], 
      "id": [1, 2]
  }
  
  with tempfile.TemporaryDirectory() as tmp_dir:
      lance_path = os.path.join(tmp_dir, TABLE_NAME)
      
      arrow_table = pa.Table.from_pydict(data)
      lance.write_dataset(arrow_table, lance_path)
      daft_df = daft.read_lance(lance_path)
      
      # This works correctly
      daft_df.filter("id = 1").show(1)
      
      # This should show 1 row but shows none
      daft_df.filter("id = 2").show(1)
      
      # This works correctly when limit is larger than result count
      daft_df.filter("id = 2").show(2)


╭───────────────┬─────────┬─────────┬───────╮
│ vectorlatlongid    │
│ ------------   │
│ List[Float64] ┆ Float64Float64Int64 │
╞═══════════════╪═════════╪═════════╪═══════╡
│ [1.1, 1.2]    ┆ 45.5-122.71     │
╰───────────────┴─────────┴─────────┴───────╯

(Showing first 1 rows)
╭───────────────┬─────────┬─────────┬───────╮
│ vectorlatlongid    │
│ ------------   │
│ List[Float64] ┆ Float64Float64Int64 │
╞═══════════════╪═════════╪═════════╪═══════╡
╰───────────────┴─────────┴─────────┴───────╯

(No data to display: Materialized dataframe has no rows)
╭───────────────┬─────────┬─────────┬───────╮
│ vectorlatlongid    │
│ ------------   │
│ List[Float64] ┆ Float64Float64Int64 │
╞═══════════════╪═════════╪═════════╪═══════╡
│ [0.2, 1.8]    ┆ 40.1-74.12

Expected behavior

Expected output:

  • All three filter operations should return the matching row when id=1 or id=2
  • The second show(1) should display the row with id=2

Actual output:

  • The first filter (id=1) with show(1) works correctly
  • The second filter (id=2) with show(1) shows "No data to display"
  • The third filter (id=2) with show(2) works correctly

Component(s)

Expressions

Additional context

This behavior only occurs when reading from Lance datasets. When using other data formats with Daft, the filtering and show() behavior works as expected, applying the filter first before limiting results in show().

Environment:
daft: latest version
lance: latest version
pyarrow: compatible version

It seems the limit parameter in show() is being applied before the filter when working with Lance datasets, which is the opposite of the expected behavior.

@Jay-ju

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions