Migrate existing unstable dataset to stable dataset #5528
Replies: 2 comments 2 replies
-
|
cc @westonpace @wjones127 @jackye1995 Just raised the idea and some major steps. Looking forward to here your feedback. After that, I will draft the api design... |
Beta Was this translation helpful? Give feedback.
-
My only issue is this. I think we should put the effort into making the transaction safe. This should be pretty easy actually.
That should give you basic conflict resolution for this and prevent inconsistent state. The disadvantage of using |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation and Goals
Today, stable row IDs in Lance must effectively be decided at dataset creation time. Once a dataset has been written without stable row IDs, there is no supported way to “upgrade” it in place so that future features(e.g., CDF and row lineage) can rely on stable row IDs.
This proposal introduces a metadata-only migration so that we can migrate an existing unstable dataset to a stable dataset. There would be some major steps:
RowIdSequence) for every existing fragment.FLAG_STABLE_ROW_IDSin the feature flags.row_id_metafor all fragments.next_row_idvalue so future writes can continue allocating stable row IDs.enable_stable_row_ids = truefrom the beginning.Key Constraints
Once we set
FLAG_STABLE_ROW_IDS, the manifest invariants still apply: either all fragments haverow_id_metaor none do. The migration must ensure that every fragment gets a validRowIdSequencebefore flipping the flag.The upgrade is done by creating a new manifest version. Older versions remain as-is (no stable row IDs); only the new version and onward use stable row IDs. There is no mixing of “some fragments with row IDs, some without” within a single manifest.
The migration must not rewrite data files. It only:
To avoid inconsistent views, the migration assumes: There are no concurrent operations while it’s running.
Existing deletion files remain untouched. The migration ensures that the resulting RowIdIndex still maps only live rows, consistent with current deletion semantics.
Logical row IDs assigned during migration:
Beta Was this translation helpful? Give feedback.
All reactions