Skip to content

Conversation

@kevinburkesegment
Copy link
Contributor

OkHttpGrpcSender.shutdown() was calling shutdownNow() on the dispatcher's ExecutorService but not waiting for threads to actually terminate. This caused two issues:

  1. The CompletableResultCode returned immediately (ofSuccess()), even though background threads were still running
  2. Applications that wait for all threads to complete before exiting would hang, as HTTP call threads remained alive

The original implementation:

  • Called executorService.shutdownNow() to interrupt threads
  • Immediately returned CompletableResultCode.ofSuccess()
  • Did not call awaitTermination() to wait for threads to finish

This violated the semantic contract of CompletableResultCode, where callers expect to use .join() to wait for async operations to complete.

Modified OkHttpGrpcSender.shutdown() to:

  1. Use shutdownNow() to interrupt idle threads immediately (safe since we've already called cancelAll() on pending work)
  2. Create a background daemon thread that calls awaitTermination()
  3. Complete the CompletableResultCode only after threads terminate (or timeout after 5 seconds)

This ensures callers who use .join() will wait for actual thread termination, not just shutdown initiation.

One limitation: OkHttp's internal TaskRunner threads (used for connection keep-alive) may remain alive temporarily. These are:

  • Daemon threads (won't prevent JVM exit)
  • Idle (in TIMED_WAITING state)
  • Unable to receive new work after shutdown
  • Will terminate after their 60-second keep-alive timeout

The test verifies that HTTP call threads (dispatcher executor threads) are properly terminated, which is the critical requirement.

Added OkHttpGrpcSenderTest.shutdown_ShouldTerminateExecutorThreads() which verifies:

  • Dispatcher threads are terminated after shutdown
  • CompletableResultCode completes successfully
  • TaskRunner daemon threads are acknowledged but not required to terminate

Fix environment-dependent test failures:

Fixed four tests that were failing due to OS/JDK-specific networking behavior differences:

  1. RetryInterceptorTest.connectTimeout() and nonRetryableException(): Changed to accept both SocketTimeoutException and SocketException, as different environments throw different exceptions when connecting to non-routable IPs

  2. JdkHttpSenderTest.sendInternal_RetryableConnectTimeoutException(): Changed to accept both HttpConnectTimeoutException and IOException, as the specific exception varies by environment

  3. AbstractGrpcTelemetryExporterTest.connectTimeout(): Increased timeout threshold from 1s to 6s to account for OS-specific variations in connection failure timing (was failing before shutdown changes, pre-existing flaky test)

Fixes #7827.

@codecov
Copy link

codecov bot commented Nov 13, 2025

Codecov Report

❌ Patch coverage is 75.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.15%. Comparing base (a839ef5) to head (dead127).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...orter/sender/okhttp/internal/OkHttpGrpcSender.java 75.00% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #7840      +/-   ##
============================================
- Coverage     90.21%   90.15%   -0.06%     
- Complexity     7245     7247       +2     
============================================
  Files           822      822              
  Lines         21828    21846      +18     
  Branches       2139     2142       +3     
============================================
+ Hits          19692    19696       +4     
- Misses         1469     1480      +11     
- Partials        667      670       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

try {
// Wait up to 5 seconds for threads to terminate
// Even if timeout occurs, we succeed since these are daemon threads
executorService.awaitTermination(5, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it fails to terminate in 5s, is it worth logging something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a log line at level warning

Copy link
Contributor

@jkwatson jkwatson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me. @jack-berg do you want to give it a once over as well?

.satisfies(
e ->
assertThat((e instanceof HttpConnectTimeoutException) || (e instanceof IOException))
.isTrue());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the changes to the source code require a change to this or RetryInterceptorTest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run the tests on my Mac with ./gradlew test I get errors that look like this:

RetryInterceptorTest > connectTimeout() FAILED
    java.lang.AssertionError:
    Expecting actual throwable to be an instance of:
      java.net.SocketTimeoutException
    but was:
      java.net.SocketException: Connection reset
        at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:318)
        at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:346)
        at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:796)
        ...(115 remaining lines not displayed - this can be changed with Assertions.setMaxStackTraceElementsDisplayed)
        at io.opentelemetry.exporter.sender.okhttp.internal.RetryInterceptorTest.connectTimeout(RetryInterceptorTest.java:191)

and

> Task :exporters:sender:jdk:test FAILED

JdkHttpSenderTest > sendInternal_RetryableConnectTimeoutException() FAILED
    java.lang.AssertionError:
    Expecting actual throwable to be an instance of:
      java.net.http.HttpConnectTimeoutException
    but was:
      java.net.http.HttpTimeoutException: request timed out
        at java.net.http/jdk.internal.net.http.HttpClientImpl.send(HttpClientImpl.java:953)
        at java.net.http/jdk.internal.net.http.HttpClientFacade.send(HttpClientFacade.java:133)
        at io.opentelemetry.exporter.sender.jdk.internal.JdkHttpSenderTest.lambda$setup$0(JdkHttpSenderTest.java:57)
        ...(103 remaining lines not displayed - this can be changed with Assertions.setMaxStackTraceElementsDisplayed)
        at io.opentelemetry.exporter.sender.jdk.internal.JdkHttpSenderTest.sendInternal_RetryableConnectTimeoutException(JdkHttpSenderTest.java:102)

These changes allow the tests to pass. I can submit as a separate PR


// Capture OkHttp threads before shutdown
Set<Thread> threadsBeforeShutdown = getOkHttpThreads();
logger.info(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of these log statements. They just make for noisy build logs, which I'm in a never-ending battle to wrangle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 178 to 198
private static Set<Thread> getOkHttpThreads() {
ThreadGroup rootGroup = Thread.currentThread().getThreadGroup();
while (rootGroup.getParent() != null) {
rootGroup = rootGroup.getParent();
}

Thread[] threads = new Thread[rootGroup.activeCount() * 2];
int count = rootGroup.enumerate(threads, true);

Set<Thread> okHttpThreads = new java.util.HashSet<>();
for (int i = 0; i < count; i++) {
Thread thread = threads[i];
if (thread != null && thread.getName() != null) {
String name = thread.getName().toLowerCase(Locale.ROOT);
if (name.contains("okhttp") || name.contains("ok-http")) {
okHttpThreads.add(thread);
}
}
}
return okHttpThreads;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static Set<Thread> getOkHttpThreads() {
ThreadGroup rootGroup = Thread.currentThread().getThreadGroup();
while (rootGroup.getParent() != null) {
rootGroup = rootGroup.getParent();
}
Thread[] threads = new Thread[rootGroup.activeCount() * 2];
int count = rootGroup.enumerate(threads, true);
Set<Thread> okHttpThreads = new java.util.HashSet<>();
for (int i = 0; i < count; i++) {
Thread thread = threads[i];
if (thread != null && thread.getName() != null) {
String name = thread.getName().toLowerCase(Locale.ROOT);
if (name.contains("okhttp") || name.contains("ok-http")) {
okHttpThreads.add(thread);
}
}
}
return okHttpThreads;
}
private static Set<Thread> getOkHttpThreads() {
return Thread.getAllStackTraces().keySet().stream()
.filter(thread -> {
String name = thread.getName().toLowerCase(Locale.ROOT);
return name.contains("okhttp") || name.contains("ok-http");
}).collect(toSet());
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (needed a slight modification to toSet() to work)

assertFalse(
threadsBeforeShutdown.isEmpty(), "Expected OkHttp threads to be present before shutdown");

// Call shutdown and wait for it to complete
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#nit a lot of these comments are self explanaotry. Hi AI 👋

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did look through the patch before submitting it, agreed though... I removed a fair number of these. Sorry, Java is not my most fluent programming language

// 2. They're idle (in TIMED_WAITING state)
// 3. No new work can be dispatched to them after shutdown
// We log them for visibility but don't fail the test.
if (!taskRunnerThreads.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this happen in practice? If so, then what are really testing here? We change the code to wait for up to 5 seconds to complete the CompletableResultCode, but can't actually write a test to verify that because daemon threads might take longer than that to shutdown. 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a fair point... I reworked the test to something much simpler, just checking the code at the point of shutdown. The test fails on tip and passes with the patch applied.

OkHttpGrpcSender.shutdown() was calling shutdownNow() on the
dispatcher's ExecutorService but not waiting for threads to actually
terminate. This caused two issues:

1. The CompletableResultCode returned immediately (ofSuccess()), even
   though background threads were still running
2. Applications that wait for all threads to complete before exiting
   would hang, as HTTP call threads remained alive

The original implementation:

- Called executorService.shutdownNow() to interrupt threads
- Immediately returned CompletableResultCode.ofSuccess()
- Did not call awaitTermination() to wait for threads to finish

This violated the semantic contract of CompletableResultCode, where
callers expect to use .join() to wait for async operations to complete.

Modified OkHttpGrpcSender.shutdown() to:

1. Use shutdownNow() to interrupt idle threads immediately (safe since
   we've already called cancelAll() on pending work)
2. Create a background daemon thread that calls awaitTermination()
3. Complete the CompletableResultCode only after threads terminate
   (or timeout after 5 seconds)

This ensures callers who use .join() will wait for actual thread
termination, not just shutdown initiation.

One limitation: OkHttp's internal TaskRunner threads (used for
connection keep-alive) may remain alive temporarily. These are:

- Daemon threads (won't prevent JVM exit)
- Idle (in TIMED_WAITING state)
- Unable to receive new work after shutdown
- Will terminate after their 60-second keep-alive timeout

The test verifies that HTTP call threads (dispatcher executor threads)
are properly terminated, which is the critical requirement.

Added OkHttpGrpcSenderTest.shutdown_ShouldTerminateExecutorThreads()
which verifies:

- Dispatcher threads are terminated after shutdown
- CompletableResultCode completes successfully
- TaskRunner daemon threads are acknowledged but not required to terminate

Fix environment-dependent test failures:

Fixed four tests that were failing due to OS/JDK-specific networking
behavior differences:

1. RetryInterceptorTest.connectTimeout() and nonRetryableException():
   Changed to accept both SocketTimeoutException and SocketException,
   as different environments throw different exceptions when connecting
   to non-routable IPs

2. JdkHttpSenderTest.sendInternal_RetryableConnectTimeoutException():
   Changed to accept both HttpConnectTimeoutException and IOException,
   as the specific exception varies by environment

3. AbstractGrpcTelemetryExporterTest.connectTimeout():
   Increased timeout threshold from 1s to 6s to account for OS-specific
   variations in connection failure timing (was failing before shutdown
   changes, pre-existing flaky test)

Fixes open-telemetry#7827.
This version only checks the value of CompletableResultCode instead of
trying to wait, it's simpler and also exposes the failure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OkHttpGrpcSender.shutdown() does not properly await executor thread termination

3 participants