feat(core): Add support for `_file` column #1824

gbrgr · 2025-11-04T09:06:43Z

Which issue does this PR close?

Closes Support for _file metadata column #1766.

What changes are included in this PR?

This PR adds support for the _file reserved column in Iceberg table scans, allowing users to retrieve the file path for each row in their query results. This is useful for debugging, auditing, and tracking data lineage.

Core Implementation

Reserved column constants (reader.rs):
- Added RESERVED_FIELD_ID_FILE = 2147483646 - reserved field ID per Iceberg spec
- Added RESERVED_COL_NAME_FILE = "_file" - column name per Iceberg spec
Public API (scan/mod.rs):
- Exposed RESERVED_COL_NAME_FILE as a public constant in the iceberg::scan module

Key Features

Column ordering preserved: If a user selects ["col1", "_file", "col2"], the _file column appears in the correct position (second)
Memory efficient: Uses Run-End Encoding (REE) to store the repeated file path value across all rows in a batch

Usage Example

use iceberg::scan::RESERVED_COL_NAME_FILE;

let scan = table
    .scan()
    .select(["id", "name", RESERVED_COL_NAME_FILE])
    .build()?;

let batches = scan.to_arrow().await?;
// Each batch will have 3 columns: id, name, and _file

Are these changes tested?

Yes, comprehensive tests have been added to verify the functionality:

New Tests (9 tests added)

Table Scan API Tests (7 tests)

test_select_with_file_column - Verifies basic functionality of selecting _file with regular columns
test_select_file_column_position - Verifies column ordering is preserved
test_select_file_column_only - Tests selecting only the _file column
test_file_column_with_multiple_files - Tests multiple data files scenario
test_file_column_at_start - Tests _file at position 0
test_file_column_at_end - Tests _file at the last position
test_select_with_repeated_column_names - Tests repeated column selection

Arrow Reader Helper Tests (2 tests)

test_add_file_path_column_ree - Tests the add_file_path_column_ree_at_position() helper
test_add_file_path_column_ree_empty_batch - Tests empty batch handling

crates/iceberg/src/scan/mod.rs

vustef · 2025-11-04T11:15:37Z

crates/iceberg/src/scan/mod.rs

+/// // Select regular columns along with the file path
+/// let scan = table
+///     .scan()
+///     .select(["id", "name", RESERVED_COL_NAME_FILE])


How do we ask for _file column without having to explicitly list all the other columns? E.g. get me all columns + _file. There should be some shortcut for this.

crates/iceberg/src/scan/mod.rs

crates/iceberg/src/arrow/reader.rs

…erg-rust into feature/gb/file-column

liurenjie1024

Thanks @gbrgr for this pr. But I think we need to rethink how to compute the _file, _pos metadata column. While it's somehow trivial to compute _file, it's non trivial to compute _pos efficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.

liurenjie1024 · 2025-11-06T10:15:04Z

crates/iceberg/src/arrow/reader.rs

+pub(crate) const RESERVED_FIELD_ID_FILE: i32 = 2147483646;
+
+/// Column name for the file path metadata column per Iceberg spec
+pub(crate) const RESERVED_COL_NAME_FILE: &str = "_file";


I think we should add a metadata_columns module similar to what java did: https://github.com/apache/iceberg/blob/bb32b90c4ad63f037f0bda197cc53fb08c886934/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L2

vustef · 2025-11-06T10:23:35Z

Thanks @gbrgr for this pr. But I think we need to rethink how to compute the _file, _pos metadata column. While it's somehow trivial to compute _file, it's non trivial to compute _pos efficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.

@liurenjie1024 I agree for _pos, and we have a PR there: apache/arrow-rs#8715
But _file seems like something that we don't need the arrow-rs to know about. Similarly, in future, for _row_id from V3 spec, we cannot expect arrow-rs to be responsible for computing that one.

How do we go forward with rethinking this, what would be the action items for us?

…erg-rust into feature/gb/file-column

crates/iceberg/src/arrow/metadata_column_transformer.rs

vustef · 2025-11-07T13:34:15Z

crates/iceberg/src/arrow/metadata_column_transformer.rs

+// Reserved field IDs and names for metadata columns
+
+/// Reserved field ID for the file path (_file) column per Iceberg spec
+pub(crate) const RESERVED_FIELD_ID_FILE: i32 = 2147483646;


nit: Can we move consts to the top of the file?

crates/iceberg/src/arrow/metadata_column_transformer.rs

crates/iceberg/src/arrow/reader.rs

This reverts commit adf0da0.

liurenjie1024

Thanks @gbrgr for this pr, I left some comments to improve.

liurenjie1024 · 2025-11-10T01:15:32Z

crates/iceberg/src/scan/mod.rs

+/// # Ok(())
+/// # }
+/// ```
+pub const RESERVED_COL_NAME_FILE: &str = RESERVED_COL_NAME_FILE_INTERNAL;


We will have more metadata columns, so I would prefert to put these definition in sth like metadata_columns module.

liurenjie1024 · 2025-11-10T01:22:27Z

crates/iceberg/src/scan/mod.rs

        if let Some(column_names) = self.column_names.as_ref() {
            for column_name in column_names {
+                // Skip reserved columns that don't exist in the schema
+                if column_name == RESERVED_COL_NAME_FILE_INTERNAL {


We should have sth like is_metadata_column_name() in metadata_columns module, and useis_metadata_column_name so that we could avoid such changes when we add more metadata columns.

liurenjie1024 · 2025-11-10T01:32:32Z

crates/iceberg/src/arrow/reader.rs


+    /// Helper function to add a `_file` column to a RecordBatch at a specific position.
+    /// Takes the array, field to add, and position where to insert.
+    fn create_file_field_at_position(


I think this approach is not extensible. I prefer what's similar in this pr:

Add constant_map for ArrowReader

Add another variant of RecordBatchTransformer to handle constant field like _file

liurenjie1024 · 2025-11-10T01:34:46Z

Thanks @gbrgr for this pr. But I think we need to rethink how to compute the _file, _pos metadata column. While it's somehow trivial to compute _file, it's non trivial to compute _pos efficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.

@liurenjie1024 I agree for _pos, and we have a PR there: apache/arrow-rs#8715 But _file seems like something that we don't need the arrow-rs to know about. Similarly, in future, for _row_id from V3 spec, we cannot expect arrow-rs to be responsible for computing that one.

How do we go forward with rethinking this, what would be the action items for us?

Hi, @vustef I also agree that we should put _file in iceberg-rust, and I left some comments about how to proceed.

gbrgr and others added 8 commits November 4, 2025 08:38

Add REE file column helpers

aab78d6

Add helper tests

ee21cab

Add constants

37b52e2

Add support for _file constant

44463a0

Merge branch 'main' into feature/gb/file-column

b5449f6

Update tests

e034009

Fix clippy warning

4f0a4f1

Fix doc test

51f76d3

vustef reviewed Nov 4, 2025

View reviewed changes

crates/iceberg/src/arrow/reader.rs Show resolved Hide resolved

gbrgr and others added 2 commits November 4, 2025 12:52

Track in field ids

d84e16b

Merge branch 'main' into feature/gb/file-column

984dacd

gbrgr changed the title ~~Add support for _file column~~ feat(core): Add support for _file column Nov 4, 2025

gbrgr added 2 commits November 4, 2025 15:32

Add test

bd478cb

Merge branch 'feature/gb/file-column' of github.com:RelationalAI/iceb…

8593db0

…erg-rust into feature/gb/file-column

gbrgr marked this pull request as ready for review November 4, 2025 14:32

gbrgr and others added 2 commits November 4, 2025 16:04

Allow repeated virtual file column selection

9b186c7

Merge branch 'main' into feature/gb/file-column

30ae5fb

alamb mentioned this pull request Nov 5, 2025

Support extended partition cols for listing table. apache/datafusion#18482

Open

liurenjie1024 requested changes Nov 6, 2025

View reviewed changes

liurenjie1024 mentioned this pull request Nov 6, 2025

General support for metadata columns + implementation for _pos #1791

Draft

gbrgr added 2 commits November 7, 2025 14:03

Refactor into own transformer step

adf0da0

Merge branch 'feature/gb/file-column' of github.com:RelationalAI/iceb…

f4336a8

…erg-rust into feature/gb/file-column

vustef reviewed Nov 7, 2025

View reviewed changes

gbrgr added 3 commits November 7, 2025 15:04

Revert "Refactor into own transformer step"

ef3a965

This reverts commit adf0da0.

Avoid special casing in batch creation

534490b

.

04bf463

liurenjie1024 reviewed Nov 10, 2025

View reviewed changes

feat(core): Add support for _file column #1824

Are you sure you want to change the base?

feat(core): Add support for _file column #1824

Conversation

gbrgr commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Core Implementation

Key Features

Usage Example

Are these changes tested?

New Tests (9 tests added)

Table Scan API Tests (7 tests)

Arrow Reader Helper Tests (2 tests)

Uh oh!

Uh oh!

Uh oh!

vustef Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

vustef commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vustef Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(core): Add support for `_file` column #1824

feat(core): Add support for `_file` column #1824

gbrgr commented Nov 4, 2025 •

edited

Loading

vustef commented Nov 6, 2025 •

edited

Loading