-
Notifications
You must be signed in to change notification settings - Fork 513
Description
The basic motivation is to integrate a lakehouse management platform like:
The page above comes from Apache(incubating) Amoro, which is a multi-format lakehouse management system. I have raised a discussion to integrate lance format and got considerable feedback: apache/amoro#3668
In production scenarios, when users want to explore a table/dataset, Snapshot list with these basic summary would help them to:
- Get the basic writing workloads on this table/dataset(frequency, volume, sizes .etc)
- Quickly get the target tag/version to rollback
- Know the trends
- Know the health
I think snapshot/version list is a basic demand for lakehouse management.
Solutions
Iceberg
For Iceberg, the snapshot list is stored in metadata file with all necessary summary information in it. When requesting a page of snapshot list, that's O(1) cost for reading metadata file:
Paimon
Paimon does't have a root metadata file and use a snapshot file to store summaries for each snapshot:
https://paimon.apache.org/docs/0.8/concepts/specification/#snapshot
The listing snapshot operation loads each snapshot file sequentially which would lead a bad experience for snapshot pages. Amoro optimized this process by using a parallelism way.
Delta
Delta desn't have a root metadata file too. It uses transaction json files and checkpoint files to store snapshots. Amoro doesn't have the plan to integrate Delta yet.
Lance
For lance, I think it is more similar to Paimon. And there's a list version operation in dataset:
pub struct Version {
/// version number
pub version: u64,
/// Timestamp of dataset creation in UTC.
pub timestamp: DateTime<Utc>,
/// Key-value pairs of metadata.
pub metadata: BTreeMap<String, String>,
}
/// Convert Manifest to Data Version.
impl From<&Manifest> for Version {
fn from(m: &Manifest) -> Self {
Self {
version: m.version,
timestamp: m.timestamp(),
metadata: BTreeMap::default(),
}
}
}
However, there are two concerns for this listing version operation:
- For now, the metadata field is always empty, we expect a summary map like what in Iceberg and Paimon. I think this could be done in the impl From<&Manifest> as above.
- The list operation hidden is listing files under the dataset_root/.manifest and load each manifest file in parallel. A practical problem is we may need a range parameter for listing versions for the case that there are too many versions.
What do you think?
@jackye1995 @westonpace