-
-
Notifications
You must be signed in to change notification settings - Fork 33.5k
GH-140638: Add a GC "visited" stat #141814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
pablogsal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚀
I assumed that visited is not initialized because IIRC the interpreter struct is calloc-ed. I'm on my phone so is not easy to double check but just a thing to check before landing
| add_stats(GCState *gcstate, int gen, struct gc_collection_stats *stats) | ||
| { | ||
| gcstate->generation_stats[gen].duration += stats->duration; | ||
| gcstate->generation_stats[gen].visited += stats->visited; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this matters but what would happen if this overflows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we generally worried about overflowing Py_ssize_t?
I can switch this to size_t if we are so at least the overflow isn't UB. A Python int seems sort of heavy for this, but we could go that route if needed too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we generally worried about overflowing
Py_ssize_t?
Nah, I think this has the same problem as the other fields just wanted to raise it here in case we both believe is worth fixing. I don't think is important, but it is possible. WDYT? I am fine to ignore just to be clear
| "uncollectable": When *phase* is "stop", the number of objects | ||
| that could not be collected and were put in :data:`garbage`. | ||
|
|
||
| "visited": When *phase* is "stop", the number of unique objects visited |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing something but if I understand this correctly this counts objects in the collection set, not all objects visited during traversal (which would include objects referenced by containers). Consider maybe renaming to "objects_in_collection" or "initial_objects" or updating implementation to count all visited objects (including referenced ones during subtract_refs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, @nascheme pointed this out too on the issue. I think naming is hard... this can really be any of:
- Total visits performed (edges in the graph).
- Total unique objects visited (nodes in the graph, optionally minus immortals, marked-alive objects, untracked objects, etc.).
- Total objects in the current generation (nodes in the generation subgraph, optionally minus immortals, marked-alive objects, untracked objects, etc.).
I'm really trying not to overthink it... I think the probably most intuitive answer is something along the lines of "How many unique objects did the GC consider before eventually freeing <collected> number of them"? Like I said above, collected / visited should be a meaningful efficiency metric to help see at a glance if you're running at the right times (in addition to memory metrics, of course).
But definitely open to suggestions. I naturally lean towards a simpler API name and more nuance in the docs for it, rather than trying to convey all of the nuance in one long name alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put another way, it seems like objects that can't count towards collected also shouldn't count towards visited. So I'm pretty sure that would exclude anything reachable from, but not part of, the current generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but even keeping the "naming is hard" optics, I would say that anything other than visited is good because at least we can agree that surely this is not "number of visited objects" no? Maybe candidates?
Yeah, per-collection stats have an initializer here (here in free-threading), so unnamed members are zeroed. The process-wide stats live on the |
This exposes a counter for the number of unique objects seen in the current generation. Previously, something like
sum(len(gc.get_objects(g)) for g in range(gen + 1))was needed to get the a statistic like this (in this PR I've made the choice to collect this statistic inupdate_refs, and omit immortal and "marked alive" members of the current generation).It's useful for things like
info["collected"] / info["visited"], to get the efficiency of a collection.durationandvisitedto thegc.callbacksandgc.get_stats()dicts #140638📚 Documentation preview 📚: https://cpython-previews--141814.org.readthedocs.build/