Skip to content

How to get a snapshot list with some summary properties #4337

@majin1102

Description

@majin1102

The basic motivation is to integrate a lakehouse management platform like:

Image

The page above comes from Apache(incubating) Amoro, which is a multi-format lakehouse management system. I have raised a discussion to integrate lance format and got considerable feedback: apache/amoro#3668

In production scenarios, when users want to explore a table/dataset, Snapshot list with these basic summary would help them to:

  1. Get the basic writing workloads on this table/dataset(frequency, volume, sizes .etc)
  2. Quickly get the target tag/version to rollback
  3. Know the trends
  4. Know the health

I think snapshot/version list is a basic demand for lakehouse management.

Solutions

Iceberg

For Iceberg, the snapshot list is stored in metadata file with all necessary summary information in it. When requesting a page of snapshot list, that's O(1) cost for reading metadata file:

Image

Paimon

Paimon does't have a root metadata file and use a snapshot file to store summaries for each snapshot:
https://paimon.apache.org/docs/0.8/concepts/specification/#snapshot

The listing snapshot operation loads each snapshot file sequentially which would lead a bad experience for snapshot pages. Amoro optimized this process by using a parallelism way.

Delta

Delta desn't have a root metadata file too. It uses transaction json files and checkpoint files to store snapshots. Amoro doesn't have the plan to integrate Delta yet.

Lance

For lance, I think it is more similar to Paimon. And there's a list version operation in dataset:

pub struct Version {
    /// version number
    pub version: u64,

    /// Timestamp of dataset creation in UTC.
    pub timestamp: DateTime<Utc>,

    /// Key-value pairs of metadata.
    pub metadata: BTreeMap<String, String>,
}

/// Convert Manifest to Data Version.
impl From<&Manifest> for Version {
    fn from(m: &Manifest) -> Self {
        Self {
            version: m.version,
            timestamp: m.timestamp(),
            metadata: BTreeMap::default(),
        }
    }
}

However, there are two concerns for this listing version operation:

  1. For now, the metadata field is always empty, we expect a summary map like what in Iceberg and Paimon. I think this could be done in the impl From<&Manifest> as above.
  2. The list operation hidden is listing files under the dataset_root/.manifest and load each manifest file in parallel​​. A practical problem is we may need a range parameter for listing versions for the case that there are too many versions.

What do you think?
@jackye1995 @westonpace

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions