-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Destruct internals during interpreter finalization #5958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
shutdown can safely DECREF Python objects owned by the internals.
|
I think for shared internals you should go ahead and deallocate the internals struct, but not the pointer to it, from the capsule destructor. That is, call The internals state dict is cleared very late in interpreter shutdown - AFAICT it's the last place that calls user-controlled code where the Python C API is available. C-level Py_AtExit handlers are later but those aren't allowed to call into Python. Step 24 in this wonderful list, which I wish had a wider audience than a random GitHub comment: hudson-trading/pymetabind#2 (comment) (that list is for main interpreter shutdown, but I believe subinterpreters do a subset in basically the same order). If anyone is trying to access internals after the state dict is cleared, it's overwhelmingly likely that whatever they're going to do with the result is invalid anyway. If somehow someone does manage to ask for internals after they've been destroyed, there are two possibilities for what will occur, and I think both are acceptable.
If we were OK with extension modules in the second category above crashing (since they hold a pointer to a deallocated The perfect solution would be to store inside |
|
Ok, I have updated the title and description and code to delete the internals itself in the capsule destructor. I didn't have it call |
If something triggers internals to be created during finalization, it might end up being destroyed after finalization and we don't want to do the DECREF at that point, we need the leaky behavior.
include/pybind11/detail/internals.h
Outdated
| // In odd finalization scenarios it might end up running after the interpreter has | ||
| // completely shut down, In that case, we should not decref these objects because pymalloc | ||
| // is gone. | ||
| if (Py_IsInitialized() != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check defeats the point. Py_IsInitialized begins returning false quite early in Py_Finalize, well before the state dict is cleared, so adding this check will bring your leak back (at least for the main interpreter).
internals can only be freed by a DECREF on the capsule (in which case its destructor runs at a time when it's safe to DECREF things) or by an explicit call to internals_pp_manager::destroy(). Do any of the latter happen at times when it's not safe to DECREF things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If internals is re-created during finalization, then it will be set in the unique_ptr and the current pybind11 finalization code will subsequently call destroy (after PyFinalize has completed) and delete the re-created version.
We could, of course, change the pybind11 finalization process to not behave this way....
include/pybind11/detail/internals.h
Outdated
| // destructed when going out of scope here, so the destructor will be called | ||
| // immediately, which will also free the storage. | ||
| /*destructor=*/[](void *ptr) -> void { delete static_cast<Payload *>(ptr); }); | ||
| /*destructor=*/dtor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a semantics-preserving change.
There are two possibilities for when the capsule gets destroyed:
- It could be destroyed almost immediately, if another thread inserted the same key before we did. See the comment on line 639. In that case, no one else has seen our newly-allocated payload, so we should delete it.
- It could be destroyed at interpreter finalization. That's when custom destruction logic might come into play.
If two threads concurrently try to create internals, your use of the final dtor from the start means that one of them will leak its unique_ptr. Not a terrible outcome, but easy enough to avoid.
Suggest you change the dtor parameter to dtor_if_inserted. Create the capsule initially with a destructor that deallocates the payload. If insertion succeeds and dtor_if_inserted is not null, then swap the capsule destructor to be dtor_if_inserted instead. Effectively, the old clear_destructor parameter is a special case of the dtor_if_inserted semantics: dtor_if_inserted is a no-op lambda if clear_destructor is true, or is null (i.e. continue with the original dtor that deletes the payload) if clear_destructor is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I believe the latest change addresses this.
…e capsule already exists.
Added a step to clean out unused files to save space during CI.
|
I'm trying to get my head around the current state of this PR, starting with: Cursor (Claude 4.5 Opus) generated, with minor manual edits: Problem Being Solvedpybind11 leaks three Joshua's (@oremanj) Concerns and Their Resolution1.
|
|
EDIT: Please see #5961 for the continuation of the leak investigation Cursor (Claude 4.5 Opus) generated: Subinterpreter Memory Leak InvestigationI ran tests to measure memory usage with repeated subinterpreter create/destroy cycles, comparing pure C API against pybind11. Summary of Results
Key FindingThe massive leak (~1.7 MB per subinterpreter cycle) is NOT primarily from Python itself. Pure C API subinterpreter create/destroy cycles leak only ~4-6 kB, while pybind11 leaks approximately 400x more. This PR correctly fixes pybind11's leak of the three PyTypeObjects (~3 kB), but there appears to be a much larger leak somewhere else in pybind11's subinterpreter handling that this PR does not address. Raw Test OutputPure C API (simple Py_NewInterpreter): Pure C API (with PyInterpreterConfig + PyRun_SimpleString): pybind11 upstream/master: pybind11 PR #5958: Test Environment
Test Codepybind11 Test (measure_subinterpreter_leak.cpp)// Memory leak measurement for subinterpreter create/destroy cycles
//
// Compile:
// g++ -std=c++20 -O0 -g \
// -I<pybind11>/include \
// -I<python>/include/python3.14 \
// -o measure_leak measure_subinterpreter_leak.cpp \
// -L<python>/lib -Wl,-rpath,<python>/lib \
// -lpython3.14 -lpthread -ldl -lutil
//
// Run:
// ./measure_leak 500
#include <pybind11/embed.h>
#ifdef PYBIND11_HAS_SUBINTERPRETER_SUPPORT
#include <pybind11/subinterpreter.h>
#endif
#include <iostream>
#include <fstream>
#include <string>
namespace py = pybind11;
long get_rss_kb() {
std::ifstream status("/proc/self/status");
std::string line;
while (std::getline(status, line)) {
if (line.rfind("VmRSS:", 0) == 0) {
// Format: "VmRSS: 12345 kB"
long rss = 0;
for (char c : line) {
if (c >= '0' && c <= '9') {
rss = rss * 10 + (c - '0');
}
}
return rss;
}
}
return -1;
}
int main(int argc, char* argv[]) {
int iterations = 10000;
if (argc > 1) {
iterations = std::stoi(argv[1]);
}
py::scoped_interpreter guard{};
#ifdef PYBIND11_HAS_SUBINTERPRETER_SUPPORT
long rss_before = get_rss_kb();
for (int i = 0; i < iterations; ++i) {
{
py::scoped_subinterpreter ssi;
}
}
long rss_after = get_rss_kb();
std::cout << "Iterations: " << iterations << "\n";
std::cout << "RSS before: " << rss_before << " kB\n";
std::cout << "RSS after: " << rss_after << " kB\n";
std::cout << "Increase: " << (rss_after - rss_before) << " kB\n";
std::cout << "Per iter: " << (double)(rss_after - rss_before) / iterations << " kB\n";
#else
std::cout << "Subinterpreter support: NO\n";
#endif
return 0;
}Pure C API Test - Simple (measure_leak_pure_c.c)// Pure C API subinterpreter leak measurement (no pybind11)
// Uses simple Py_NewInterpreter/Py_EndInterpreter
//
// Compile:
// gcc -o measure_leak_pure_c measure_leak_pure_c.c \
// -I<python>/include/python3.14 \
// -L<python>/lib -Wl,-rpath,<python>/lib \
// -lpython3.14 -lpthread -ldl -lutil
//
// Run:
// ./measure_leak_pure_c 500
#include <Python.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
long get_rss_kb() {
FILE *f = fopen("/proc/self/status", "r");
if (!f) return -1;
char line[256];
long rss = -1;
while (fgets(line, sizeof(line), f)) {
if (strncmp(line, "VmRSS:", 6) == 0) {
rss = 0;
for (char *p = line; *p; p++) {
if (*p >= '0' && *p <= '9') {
rss = rss * 10 + (*p - '0');
}
}
break;
}
}
fclose(f);
return rss;
}
int main(int argc, char *argv[]) {
int iterations = 500;
if (argc > 1) {
iterations = atoi(argv[1]);
}
// Initialize main interpreter
Py_Initialize();
PyThreadState *main_tstate = PyThreadState_Get();
long rss_before = get_rss_kb();
for (int i = 0; i < iterations; i++) {
// Create subinterpreter
PyThreadState *sub_tstate = Py_NewInterpreter();
if (!sub_tstate) {
fprintf(stderr, "Failed to create subinterpreter at iteration %d\n", i);
break;
}
// Destroy subinterpreter
Py_EndInterpreter(sub_tstate);
// Switch back to main
PyThreadState_Swap(main_tstate);
}
long rss_after = get_rss_kb();
printf("Iterations: %d\n", iterations);
printf("RSS before: %ld kB\n", rss_before);
printf("RSS after: %ld kB\n", rss_after);
printf("Increase: %ld kB\n", rss_after - rss_before);
printf("Per iter: %.2f kB\n", (double)(rss_after - rss_before) / iterations);
Py_Finalize();
return 0;
}Pure C API Test - With Config (measure_leak_pure_c_active.c)This version uses // Pure C API subinterpreter leak measurement - with PyInterpreterConfig
// More comparable to pybind11's scoped_subinterpreter
//
// Compile:
// gcc -o measure_leak_pure_c_active measure_leak_pure_c_active.c \
// -I<python>/include/python3.14 \
// -L<python>/lib -Wl,-rpath,<python>/lib \
// -lpython3.14 -lpthread -ldl -lutil
//
// Run:
// ./measure_leak_pure_c_active 500
#include <Python.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
long get_rss_kb() {
FILE *f = fopen("/proc/self/status", "r");
if (!f) return -1;
char line[256];
long rss = -1;
while (fgets(line, sizeof(line), f)) {
if (strncmp(line, "VmRSS:", 6) == 0) {
rss = 0;
for (char *p = line; *p; p++) {
if (*p >= '0' && *p <= '9') {
rss = rss * 10 + (*p - '0');
}
}
break;
}
}
fclose(f);
return rss;
}
int main(int argc, char *argv[]) {
int iterations = 500;
if (argc > 1) {
iterations = atoi(argv[1]);
}
// Initialize main interpreter
Py_Initialize();
PyThreadState *main_tstate = PyThreadState_Get();
long rss_before = get_rss_kb();
for (int i = 0; i < iterations; i++) {
// Create subinterpreter with config (like pybind11 does)
PyInterpreterConfig cfg = {
.use_main_obmalloc = 0,
.allow_fork = 0,
.allow_exec = 0,
.allow_threads = 1,
.allow_daemon_threads = 0,
.check_multi_interp_extensions = 1,
.gil = PyInterpreterConfig_OWN_GIL,
};
PyThreadState *sub_tstate;
PyStatus status = Py_NewInterpreterFromConfig(&sub_tstate, &cfg);
if (PyStatus_Exception(status)) {
fprintf(stderr, "Failed to create subinterpreter at iteration %d\n", i);
break;
}
// We're now on the subinterpreter - run a tiny bit of code
PyRun_SimpleString("x = 1");
// Destroy subinterpreter
Py_EndInterpreter(sub_tstate);
// Switch back to main
PyThreadState_Swap(main_tstate);
}
long rss_after = get_rss_kb();
printf("Iterations: %d\n", iterations);
printf("RSS before: %ld kB\n", rss_before);
printf("RSS after: %ld kB\n", rss_after);
printf("Increase: %ld kB\n", rss_after - rss_before);
printf("Per iter: %.2f kB\n", (double)(rss_after - rss_before) / iterations);
Py_Finalize();
return 0;
}AnalysisThe ~400x difference between pure C API (~4 kB/iter) and pybind11 (~1.7 MB/iter) suggests the leak is in pybind11's subinterpreter infrastructure, not in Python itself. Looking at
The
This PR fixes the leak of the PyTypeObjects (~3 kB), but the remaining ~1.7 MB leak suggests something else in the internals infrastructure is not being properly cleaned up. Possible areas to investigate:
ConclusionThis PR is correct and should be merged - it fixes the leak it claims to fix. However, there's a much larger leak in pybind11's subinterpreter handling that warrants separate investigation. |
rwgk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@b-pass I'd say go ahead and merge this PR, it looks solid. (It'd be great if you could adopt the suggested expanded comment.)
I'll try to find out more about the "massive leak" as measured with the Cursor-generated code (#5958 (comment)). If you see anything suspicious in the measurements, please let me know.
Description
pybind11 leaks three PyTypes (metaclass, function_record, static_property) when an interpreter or sub-interpreter is finalized. This is about 1KB each. These should be released during finalization so that simply creating interpreters does not leak memory.
This fixes #5794 (Python itself leaks more memory than this, but we can fix what we're doing at least...)
If a pybind11 module is loaded into a non-pybind11-owned sub-interpreter, then the pybind11 internals are leaked when the sub-interpreter shuts down. This PR also fixes that, by having the internals destruct and free during the interpreter finalization.
Suggested changelog entry:
📚 Documentation preview 📚: https://pybind11--5958.org.readthedocs.build/