-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Destruct internals during interpreter finalization #5958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
shutdown can safely DECREF Python objects owned by the internals.
|
I think for shared internals you should go ahead and deallocate the internals struct, but not the pointer to it, from the capsule destructor. That is, call The internals state dict is cleared very late in interpreter shutdown - AFAICT it's the last place that calls user-controlled code where the Python C API is available. C-level Py_AtExit handlers are later but those aren't allowed to call into Python. Step 24 in this wonderful list, which I wish had a wider audience than a random GitHub comment: hudson-trading/pymetabind#2 (comment) (that list is for main interpreter shutdown, but I believe subinterpreters do a subset in basically the same order). If anyone is trying to access internals after the state dict is cleared, it's overwhelmingly likely that whatever they're going to do with the result is invalid anyway. If somehow someone does manage to ask for internals after they've been destroyed, there are two possibilities for what will occur, and I think both are acceptable.
If we were OK with extension modules in the second category above crashing (since they hold a pointer to a deallocated The perfect solution would be to store inside |
|
Ok, I have updated the title and description and code to delete the internals itself in the capsule destructor. I didn't have it call |
If something triggers internals to be created during finalization, it might end up being destroyed after finalization and we don't want to do the DECREF at that point, we need the leaky behavior.
include/pybind11/detail/internals.h
Outdated
| // In odd finalization scenarios it might end up running after the interpreter has | ||
| // completely shut down, In that case, we should not decref these objects because pymalloc | ||
| // is gone. | ||
| if (Py_IsInitialized() != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check defeats the point. Py_IsInitialized begins returning false quite early in Py_Finalize, well before the state dict is cleared, so adding this check will bring your leak back (at least for the main interpreter).
internals can only be freed by a DECREF on the capsule (in which case its destructor runs at a time when it's safe to DECREF things) or by an explicit call to internals_pp_manager::destroy(). Do any of the latter happen at times when it's not safe to DECREF things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If internals is re-created during finalization, then it will be set in the unique_ptr and the current pybind11 finalization code will subsequently call destroy (after PyFinalize has completed) and delete the re-created version.
We could, of course, change the pybind11 finalization process to not behave this way....
include/pybind11/detail/internals.h
Outdated
| // destructed when going out of scope here, so the destructor will be called | ||
| // immediately, which will also free the storage. | ||
| /*destructor=*/[](void *ptr) -> void { delete static_cast<Payload *>(ptr); }); | ||
| /*destructor=*/dtor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a semantics-preserving change.
There are two possibilities for when the capsule gets destroyed:
- It could be destroyed almost immediately, if another thread inserted the same key before we did. See the comment on line 639. In that case, no one else has seen our newly-allocated payload, so we should delete it.
- It could be destroyed at interpreter finalization. That's when custom destruction logic might come into play.
If two threads concurrently try to create internals, your use of the final dtor from the start means that one of them will leak its unique_ptr. Not a terrible outcome, but easy enough to avoid.
Suggest you change the dtor parameter to dtor_if_inserted. Create the capsule initially with a destructor that deallocates the payload. If insertion succeeds and dtor_if_inserted is not null, then swap the capsule destructor to be dtor_if_inserted instead. Effectively, the old clear_destructor parameter is a special case of the dtor_if_inserted semantics: dtor_if_inserted is a no-op lambda if clear_destructor is true, or is null (i.e. continue with the original dtor that deletes the payload) if clear_destructor is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I believe the latest change addresses this.
…e capsule already exists.
Added a step to clean out unused files to save space during CI.
|
I'm trying to get my head around the current state of this PR, starting with: Cursor (Claude 4.5 Opus) generated, with minor manual edits: Problem Being Solvedpybind11 leaks three Joshua's (@oremanj) Concerns and Their Resolution1.
|
|
EDIT: Please see #5961 for the continuation of the leak investigation Cursor (Claude 4.5 Opus) generated: Subinterpreter Memory Leak InvestigationI ran tests to measure memory usage with repeated subinterpreter create/destroy cycles, comparing pure C API against pybind11. Summary of Results
Key FindingThe massive leak (~1.7 MB per subinterpreter cycle) is NOT primarily from Python itself. Pure C API subinterpreter create/destroy cycles leak only ~4-6 kB, while pybind11 leaks approximately 400x more. This PR correctly fixes pybind11's leak of the three PyTypeObjects (~3 kB), but there appears to be a much larger leak somewhere else in pybind11's subinterpreter handling that this PR does not address. Raw Test OutputPure C API (simple Py_NewInterpreter): Pure C API (with PyInterpreterConfig + PyRun_SimpleString): pybind11 upstream/master: pybind11 PR #5958: Test Environment
Test Codepybind11 Test (measure_subinterpreter_leak.cpp)// Memory leak measurement for subinterpreter create/destroy cycles
//
// Compile:
// g++ -std=c++20 -O0 -g \
// -I<pybind11>/include \
// -I<python>/include/python3.14 \
// -o measure_leak measure_subinterpreter_leak.cpp \
// -L<python>/lib -Wl,-rpath,<python>/lib \
// -lpython3.14 -lpthread -ldl -lutil
//
// Run:
// ./measure_leak 500
#include <pybind11/embed.h>
#ifdef PYBIND11_HAS_SUBINTERPRETER_SUPPORT
#include <pybind11/subinterpreter.h>
#endif
#include <iostream>
#include <fstream>
#include <string>
namespace py = pybind11;
long get_rss_kb() {
std::ifstream status("/proc/self/status");
std::string line;
while (std::getline(status, line)) {
if (line.rfind("VmRSS:", 0) == 0) {
// Format: "VmRSS: 12345 kB"
long rss = 0;
for (char c : line) {
if (c >= '0' && c <= '9') {
rss = rss * 10 + (c - '0');
}
}
return rss;
}
}
return -1;
}
int main(int argc, char* argv[]) {
int iterations = 10000;
if (argc > 1) {
iterations = std::stoi(argv[1]);
}
py::scoped_interpreter guard{};
#ifdef PYBIND11_HAS_SUBINTERPRETER_SUPPORT
long rss_before = get_rss_kb();
for (int i = 0; i < iterations; ++i) {
{
py::scoped_subinterpreter ssi;
}
}
long rss_after = get_rss_kb();
std::cout << "Iterations: " << iterations << "\n";
std::cout << "RSS before: " << rss_before << " kB\n";
std::cout << "RSS after: " << rss_after << " kB\n";
std::cout << "Increase: " << (rss_after - rss_before) << " kB\n";
std::cout << "Per iter: " << (double)(rss_after - rss_before) / iterations << " kB\n";
#else
std::cout << "Subinterpreter support: NO\n";
#endif
return 0;
}Pure C API Test - Simple (measure_leak_pure_c.c)// Pure C API subinterpreter leak measurement (no pybind11)
// Uses simple Py_NewInterpreter/Py_EndInterpreter
//
// Compile:
// gcc -o measure_leak_pure_c measure_leak_pure_c.c \
// -I<python>/include/python3.14 \
// -L<python>/lib -Wl,-rpath,<python>/lib \
// -lpython3.14 -lpthread -ldl -lutil
//
// Run:
// ./measure_leak_pure_c 500
#include <Python.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
long get_rss_kb() {
FILE *f = fopen("/proc/self/status", "r");
if (!f) return -1;
char line[256];
long rss = -1;
while (fgets(line, sizeof(line), f)) {
if (strncmp(line, "VmRSS:", 6) == 0) {
rss = 0;
for (char *p = line; *p; p++) {
if (*p >= '0' && *p <= '9') {
rss = rss * 10 + (*p - '0');
}
}
break;
}
}
fclose(f);
return rss;
}
int main(int argc, char *argv[]) {
int iterations = 500;
if (argc > 1) {
iterations = atoi(argv[1]);
}
// Initialize main interpreter
Py_Initialize();
PyThreadState *main_tstate = PyThreadState_Get();
long rss_before = get_rss_kb();
for (int i = 0; i < iterations; i++) {
// Create subinterpreter
PyThreadState *sub_tstate = Py_NewInterpreter();
if (!sub_tstate) {
fprintf(stderr, "Failed to create subinterpreter at iteration %d\n", i);
break;
}
// Destroy subinterpreter
Py_EndInterpreter(sub_tstate);
// Switch back to main
PyThreadState_Swap(main_tstate);
}
long rss_after = get_rss_kb();
printf("Iterations: %d\n", iterations);
printf("RSS before: %ld kB\n", rss_before);
printf("RSS after: %ld kB\n", rss_after);
printf("Increase: %ld kB\n", rss_after - rss_before);
printf("Per iter: %.2f kB\n", (double)(rss_after - rss_before) / iterations);
Py_Finalize();
return 0;
}Pure C API Test - With Config (measure_leak_pure_c_active.c)This version uses // Pure C API subinterpreter leak measurement - with PyInterpreterConfig
// More comparable to pybind11's scoped_subinterpreter
//
// Compile:
// gcc -o measure_leak_pure_c_active measure_leak_pure_c_active.c \
// -I<python>/include/python3.14 \
// -L<python>/lib -Wl,-rpath,<python>/lib \
// -lpython3.14 -lpthread -ldl -lutil
//
// Run:
// ./measure_leak_pure_c_active 500
#include <Python.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
long get_rss_kb() {
FILE *f = fopen("/proc/self/status", "r");
if (!f) return -1;
char line[256];
long rss = -1;
while (fgets(line, sizeof(line), f)) {
if (strncmp(line, "VmRSS:", 6) == 0) {
rss = 0;
for (char *p = line; *p; p++) {
if (*p >= '0' && *p <= '9') {
rss = rss * 10 + (*p - '0');
}
}
break;
}
}
fclose(f);
return rss;
}
int main(int argc, char *argv[]) {
int iterations = 500;
if (argc > 1) {
iterations = atoi(argv[1]);
}
// Initialize main interpreter
Py_Initialize();
PyThreadState *main_tstate = PyThreadState_Get();
long rss_before = get_rss_kb();
for (int i = 0; i < iterations; i++) {
// Create subinterpreter with config (like pybind11 does)
PyInterpreterConfig cfg = {
.use_main_obmalloc = 0,
.allow_fork = 0,
.allow_exec = 0,
.allow_threads = 1,
.allow_daemon_threads = 0,
.check_multi_interp_extensions = 1,
.gil = PyInterpreterConfig_OWN_GIL,
};
PyThreadState *sub_tstate;
PyStatus status = Py_NewInterpreterFromConfig(&sub_tstate, &cfg);
if (PyStatus_Exception(status)) {
fprintf(stderr, "Failed to create subinterpreter at iteration %d\n", i);
break;
}
// We're now on the subinterpreter - run a tiny bit of code
PyRun_SimpleString("x = 1");
// Destroy subinterpreter
Py_EndInterpreter(sub_tstate);
// Switch back to main
PyThreadState_Swap(main_tstate);
}
long rss_after = get_rss_kb();
printf("Iterations: %d\n", iterations);
printf("RSS before: %ld kB\n", rss_before);
printf("RSS after: %ld kB\n", rss_after);
printf("Increase: %ld kB\n", rss_after - rss_before);
printf("Per iter: %.2f kB\n", (double)(rss_after - rss_before) / iterations);
Py_Finalize();
return 0;
}AnalysisThe ~400x difference between pure C API (~4 kB/iter) and pybind11 (~1.7 MB/iter) suggests the leak is in pybind11's subinterpreter infrastructure, not in Python itself. Looking at
The
This PR fixes the leak of the PyTypeObjects (~3 kB), but the remaining ~1.7 MB leak suggests something else in the internals infrastructure is not being properly cleaned up. Possible areas to investigate:
ConclusionThis PR is correct and should be merged - it fixes the leak it claims to fix. However, there's a much larger leak in pybind11's subinterpreter handling that warrants separate investigation. |
rwgk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@b-pass I'd say go ahead and merge this PR, it looks solid. (It'd be great if you could adopt the suggested expanded comment.)
I'll try to find out more about the "massive leak" as measured with the Cursor-generated code (#5958 (comment)). If you see anything suspicious in the measurements, please let me know.
|
@b-pass I asked Cursor (GPT-5.2 Codex Extra High) to run leak measurements for this PR, similar to what it did before under PR #5961. The test code and measurements are below. Measurement: PR 5958 (
|
Follow-on measurement: revert
|
|
TL;DR (written by rwgk): It looks like Measurement: PR 5958 (
|
rwgk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this PR is about as much as we can/should do to reduce the leakage, even though the impact is small compared to the default obmalloc behavior.
My vote is to merge this PR asap.
I'll see if I can make a CPython reproducer to demonstrate the vast difference in obmalloc default vs use_main_obmalloc=1 behavior.
|
I've been using KDE heaptrack (with |
|
I have interesting/surprising new observations I wanted to show here clearly, without going into interpretations quite yet. Results (non-heaptrack)All runs used the same binary built against the default CPython 3.14 branch install, and the same two runtime args:
I also ran the same six measurements under Environment
Reproducer source (
|
|
Similar to the previous comment, but with a freethreaded build of the exact same CPython sources I don't want to explore this dimension (default vs freethreaded) at the moment, but just report the measurements for potentially future reference. Results (non-heaptrack, freethreaded)All runs used the same binary built against the freethreaded CPython 3.14 branch install, and the same two runtime args:
Environment
Reproducer sourceSame Build command (freethreaded)Run commands (freethreaded) |
pybind11 subinterpreter RSS measurements (default build)RationaleWe now have a CPython repro ( Results (non-heaptrack)All runs used the same binary built against the default CPython 3.14 branch install, and the same two runtime args:
Reproducer source (
|
pybind11 subinterpreter RSS measurements with
|
| use_main_obmalloc | iterations | RSS before | RSS after | Increase | Per iter |
|---|---|---|---|---|---|
| 0 | 1000 | 12288 kB | 16964 kB | 4676 kB | 4.68 kB |
| 1 | 1000 | 12320 kB | 16660 kB | 4340 kB | 4.34 kB |
Note
With PYTHONMALLOC=malloc, the two modes converge to ~4–5 kB/iter, which is much closer to the pure‑C measurements.
|
FYI: I just gave the prompt below to Cursor (still using GPT-5.2 Codex Extra High). Let's see what it can do. Now my mind is changed again, back to: there is probably a leaky path in cpython How can we demonstrate that in a pure C reproducer? Could you please try hard to work out a reproducer, conceptually converging cpython_obmalloc_repro.cpp and pybind11_obmalloc_repro.cpp, by reducing the latter and expanding the cpython_obmalloc_repro.cpp? Always run with 1000 iterations, to be sure we're not getting mislead by noise. I'm guessing the super-safe approach would be to run four combinations for each new reproducer version: (PYTHONMALLOC=malloc no, yes) x (use_main_obmalloc 0, 1) But probably, it's reasonable to only do that at certain critical decision points (or when backtracking), and to mostlywork with just "PYTHONMALLOC=malloc no" and use_main_obmalloc=0 Could you please try to handle this as independently as possible, until you have a pure cpython reproducer? Each time you stop, could you please write ~/progress_$(date "+%Y-%m-%d+%H%M%S").md for me to read? |
|
The prompt reported before was not very successful. Curser was trying to expand I believe the most certain path to success will be to reduce, rather than to guess. I.e., to start from pybind11_obmalloc_repro.cpp Conceptually:
Expected intermediate result: a large pybind11_obmalloc_repro.cpp that no longer needs pybind11, but still needs C++ Continue by looking for these opportunities at each step:
Theoretically this must be possible, although taken to the extreme it'd be just inlining stdc++ library code and pybind11 code into the reproducer. However, I'd expect that a lot of pruning will be possible on the go. The end-game will be pruning-pruning-pruning going hand-in-hand with emulating C++ features in C. If I did this manually, I'd keep an eye out for safe shortcuts, although in my experience often shortcut guesses don't work out, and it's best to backtrack quickly to the systematic inline-and-prune strategy. Could you please try this strategy? |
pybind11_obmalloc_repro_like_inlined.c (pure-C reproducer)This reproducer is a pure C reduction of the It keeps the same high-level loop and subinterpreter config, but replaces all pybind11 usage with Python C-API calls and inlines only the minimal pybind11 internals needed to preserve the behavior. How this was produced (from
|
| PYTHONMALLOC | use_main_obmalloc | RSS before (kB) | RSS after (kB) | Increase (kB) | Per iter (kB) |
|---|---|---|---|---|---|
| default | 0 | 9920 | 1705580 | 1695660 | 1695.66 |
| default | 1 | 9956 | 14184 | 4228 | 4.23 |
| malloc | 0 | 9972 | 14492 | 4520 | 4.52 |
| malloc | 1 | 9936 | 14140 | 4204 | 4.20 |
|
@b-pass, could you please glance through the reproducer attached to my previous comment (pybind11_obmalloc_repro_like_inlined.c)? Could you please let me know any suggestions? I want to create a cpython issue for it. |
|
User prompt: "I think by chance I saw a message about the pybind11 metaclass while you worked on the reduction. IIUC it was something to the effect that using the default metaclass reduced the leakage to the 4kB range, did I catch that correctly?" Assistant response: "Short answer: close, but not 4 kB. When I switched make_default_metaclass() to return the default metaclass (PyType_Type, i.e., no heap-allocated metaclass), the leak collapsed from ~1.7 MB/iter to ~1.7 kB/iter (baseline-ish), even with instance_base creation still enabled. Restoring the heap metaclass brought the ~1.7 MB/iter leak back. So the “metaclass switch” effect is real, but the observed drop was to ~1.7 kB/iter, not ~4 kB/iter. The ~4 kB/iter numbers are from use_main_obmalloc=1 / PYTHONMALLOC=malloc." |
|
|
||
| using instance_map = std::unordered_multimap<const void *, instance *>; | ||
|
|
||
| inline bool is_interpreter_alive() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sad to report that (AFAICT) this is still not quite correct.
Py_IsInitializedandPy_IsFinalizingcheck whether the whole runtime is initialized/finalizing, not whether a specific interpreter is. It can be invalid to DECREF an object from a particular interpreter even though the runtime is still active.Py_IsFinalizingremains true even afterPy_Finalizereturns. It does not reset to false unlessPy_Initializeis later called again.
I think the correct solution is a internals::leak_detach() method which will set the relevant members to NULL, without decref'ing them, so that a later destructor invocation won't call into the CPython API. Then internals_pp_manager::destroy() can call that method if the internals_pp still contains a valid pointer, before destroying it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll point Cursor to this, the pybind11 sources, and the CPython 3.14 sources, and ask it to look very thoroughly to figure out what is the best achievable solution in terms of avoiding UB but leaking as little as possible. I'll report here what it finds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d'ooh, I should've caught this before merging it. I just opened #5965 for a possible solution to this by checking the current interpreter instead of the "is alive" check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description
pybind11 leaks three PyTypes (metaclass, function_record, static_property) when an interpreter or sub-interpreter is finalized. This is about 1KB each. These should be released during finalization so that simply creating interpreters does not leak memory.
This fixes #5794 (Python itself leaks more memory than this, but we can fix what we're doing at least...)
If a pybind11 module is loaded into a non-pybind11-owned sub-interpreter, then the pybind11 internals are leaked when the sub-interpreter shuts down. This PR also fixes that, by having the internals destruct and free during the interpreter finalization.
Suggested changelog entry:
📚 Documentation preview 📚: https://pybind11--5958.org.readthedocs.build/