Skip to content

Conversation

@gbrgr
Copy link

@gbrgr gbrgr commented Nov 4, 2025

Which issue does this PR close?

What changes are included in this PR?

This PR adds support for the _file reserved column in Iceberg table scans, allowing users to retrieve the file path for each row in their query results. This is useful for debugging, auditing, and tracking data lineage.

Core Implementation

  1. Reserved column constants (reader.rs):

    • Added RESERVED_FIELD_ID_FILE = 2147483646 - reserved field ID per Iceberg spec
    • Added RESERVED_COL_NAME_FILE = "_file" - column name per Iceberg spec
  2. Public API (scan/mod.rs):

    • Exposed RESERVED_COL_NAME_FILE as a public constant in the iceberg::scan module

Key Features

  • Column ordering preserved: If a user selects ["col1", "_file", "col2"], the _file column appears in the correct position (second)
  • Memory efficient: Uses Run-End Encoding (REE) to store the repeated file path value across all rows in a batch

Usage Example

use iceberg::scan::RESERVED_COL_NAME_FILE;

let scan = table
    .scan()
    .select(["id", "name", RESERVED_COL_NAME_FILE])
    .build()?;

let batches = scan.to_arrow().await?;
// Each batch will have 3 columns: id, name, and _file

Are these changes tested?

Yes, comprehensive tests have been added to verify the functionality:

New Tests (9 tests added)

Table Scan API Tests (7 tests)

  1. test_select_with_file_column - Verifies basic functionality of selecting _file with regular columns
  2. test_select_file_column_position - Verifies column ordering is preserved
  3. test_select_file_column_only - Tests selecting only the _file column
  4. test_file_column_with_multiple_files - Tests multiple data files scenario
  5. test_file_column_at_start - Tests _file at position 0
  6. test_file_column_at_end - Tests _file at the last position
  7. test_select_with_repeated_column_names - Tests repeated column selection

Arrow Reader Helper Tests (2 tests)

  1. test_add_file_path_column_ree - Tests the add_file_path_column_ree_at_position() helper
  2. test_add_file_path_column_ree_empty_batch - Tests empty batch handling

/// // Select regular columns along with the file path
/// let scan = table
/// .scan()
/// .select(["id", "name", RESERVED_COL_NAME_FILE])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we ask for _file column without having to explicitly list all the other columns? E.g. get me all columns + _file. There should be some shortcut for this.

@gbrgr gbrgr changed the title Add support for _file column feat(core): Add support for _file column Nov 4, 2025
@gbrgr gbrgr marked this pull request as ready for review November 4, 2025 14:32
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gbrgr for this pr. But I think we need to rethink how to compute the _file, _pos metadata column. While it's somehow trivial to compute _file, it's non trivial to compute _pos efficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.

pub(crate) const RESERVED_FIELD_ID_FILE: i32 = 2147483646;

/// Column name for the file path metadata column per Iceberg spec
pub(crate) const RESERVED_COL_NAME_FILE: &str = "_file";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vustef
Copy link

vustef commented Nov 6, 2025

Thanks @gbrgr for this pr. But I think we need to rethink how to compute the _file, _pos metadata column. While it's somehow trivial to compute _file, it's non trivial to compute _pos efficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.

@liurenjie1024 I agree for _pos, and we have a PR there: apache/arrow-rs#8715
But _file seems like something that we don't need the arrow-rs to know about. Similarly, in future, for _row_id from V3 spec, we cannot expect arrow-rs to be responsible for computing that one.

How do we go forward with rethinking this, what would be the action items for us?

// Reserved field IDs and names for metadata columns

/// Reserved field ID for the file path (_file) column per Iceberg spec
pub(crate) const RESERVED_FIELD_ID_FILE: i32 = 2147483646;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we move consts to the top of the file?

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gbrgr for this pr, I left some comments to improve.

/// # Ok(())
/// # }
/// ```
pub const RESERVED_COL_NAME_FILE: &str = RESERVED_COL_NAME_FILE_INTERNAL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have more metadata columns, so I would prefert to put these definition in sth like metadata_columns module.

if let Some(column_names) = self.column_names.as_ref() {
for column_name in column_names {
// Skip reserved columns that don't exist in the schema
if column_name == RESERVED_COL_NAME_FILE_INTERNAL {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have sth like is_metadata_column_name() in metadata_columns module, and useis_metadata_column_name so that we could avoid such changes when we add more metadata columns.


/// Helper function to add a `_file` column to a RecordBatch at a specific position.
/// Takes the array, field to add, and position where to insert.
fn create_file_field_at_position(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach is not extensible. I prefer what's similar in this pr:

  1. Add constant_map for ArrowReader
  2. Add another variant of RecordBatchTransformer to handle constant field like _file

@liurenjie1024
Copy link
Contributor

Thanks @gbrgr for this pr. But I think we need to rethink how to compute the _file, _pos metadata column. While it's somehow trivial to compute _file, it's non trivial to compute _pos efficient, since when we read parquet files, we have filtered out some row groups. I think the best way is to push reading these two columns to arrow-rs.

@liurenjie1024 I agree for _pos, and we have a PR there: apache/arrow-rs#8715 But _file seems like something that we don't need the arrow-rs to know about. Similarly, in future, for _row_id from V3 spec, we cannot expect arrow-rs to be responsible for computing that one.

How do we go forward with rethinking this, what would be the action items for us?

Hi, @vustef I also agree that we should put _file in iceberg-rust, and I left some comments about how to proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for _file metadata column

3 participants