Skip to content

Conversation

@Anton-Tarazi
Copy link
Contributor

Resolves #2673

Rationale for this change

_SnapshotProducer._summary() copies the metadata for every added / deleted DataFile. This is pretty expensive. Instead we just copy it once at the beginning of the function and use the same value each DataFile.

On my data, which overwrites a few million rows at a time, I saw the time for table.overwrite go from ~20 seconds to ~6 seconds.

Are these changes tested?

Yes, existing unit / integration tests

Are there any user-facing changes?

Just faster writes :)

f Please enter the commit message for your changes. Lines starting
@Fokko
Copy link
Contributor

Fokko commented Nov 2, 2025

@Anton-Tarazi This makes a lot of sense to me, thanks for digging into this and providing the patch 🙌

@Fokko Fokko merged commit 1f9c46b into apache:main Nov 2, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_SnapshotProducer._summary() unreasonably slow

2 participants