Summary
We are seeing large latency outliers when using openai-java async Responses API calls in a Java service.
Most OpenAI requests complete in about 1-3s, but some otherwise similar requests occasionally take >30s. These outliers cause our outer application timeout to fire even though the rest of our pipeline is fast.
This does not look like local CPU or executor blockage. In the slow cases, our service continues processing other work normally while the OpenAI call remains in flight.
Environment
openai-java: 4.24.1
- Java:
21
- Framework:
Quarkus 3.30.8
- Client:
OpenAIOkHttpClientAsync
- Model:
gpt-5.1
How we use the SDK
We use one shared OpenAIClientAsync and run an async pipeline with:
- one non-streaming classification call via
client.responses().create(...)
- one streaming planner call via
client.responses().createStreaming(...)
- several parallel non-streaming calls via
client.responses().create(...)
- one final non-streaming call via
client.responses().create(...)
All of this happens inside a single chat turn. In production-like traffic, multiple chat turns may also overlap in time.
Expected behavior
We expect occasional variance, but not very large outliers for otherwise small JSON-style requests.
A request pattern where most calls finish in 1-3s but some similar calls take ~30s makes it hard to set a reasonable application timeout.
Actual behavior
We are seeing severe tail latency on individual OpenAI calls:
- in one case, the final non streaming request took
27.439s
- in another case, the planner stream took
35.109s
Meanwhile, nearby requests in the same system complete much faster.
Why this does not seem like a local thread/executor problem
During the slow request:
- other requests in the same service continue to run
- local search calls complete in about
0.3-0.7s
- other OpenAI calls in nearby flows complete in about
1-3s
- completion logs for the slow request appear on the OkHttp / stream handler side, suggesting the service is waiting on the remote OpenAI call rather than being blocked locally
Example timings observed
Typical successful runs:
- calls often complete in
~1-3s
- whole flow often completes in
~6-11s
Outlier runs:
- final call outlier:
- planner completed in
3.542s
- parallel calls completed in
2.932s
- internal api call completed in
0.317s
- final call completed in
27.439s
- Planner stream outlier:
- flow started normally
- outer flow timed out at
25s
- planner stream itself completed only after
35.109s
Important detail
The slow final call request is not a large prompt. The static prompt is very small and the payload is compact. shouldnt be more than 500 tokens. So this does not seem to be caused by an unusually large request on our side. The outputs for all these calls are also not large. each call will return a JSON with <200 chars.
Question
Is this level of tail latency variance expected for responses().create(...) / responses().createStreaming(...) with gpt-5.1 under bursty concurrent usage?
If so:
- are there SDK-level recommendations for this traffic pattern?
- are there known best practices around concurrent async requests with one shared
OpenAIClientAsync?
- is there a better model or request pattern for lower tail latency on small JSON-style calls?
Minimal reproduction shape
The exact business logic is not important; the pattern is:
- shared
OpenAIClientAsync
- start one async non-streaming call
- start one streaming planner call
- start several parallel JSON calls
- start one more async JSON call
- observe that most requests complete quickly, but occasional outliers take much longer than the rest
Pseudo-flow:
OpenAIClientAsync client = OpenAIOkHttpClientAsync.builder()
.apiKey(System.getenv("OPENAI_API_KEY"))
.build(); // did not pass any OkHttp params here, could that be an issue?
CompletableFuture<Response> classify =
client.responses().create(classifyParams);
AsyncStreamResponse<ResponseStreamEvent> plannerStream =
client.responses().createStreaming(plannerParams);
CompletableFuture<Void> plannerDone = plannerStream.onCompleteFuture();
CompletableFuture<Response> agent1 = plannerDone.thenCompose(v -> client.responses().create(agent1Params));
CompletableFuture<Response> agent2 = plannerDone.thenCompose(v -> client.responses().create(agent2Params));
CompletableFuture<Response> agent3 = plannerDone.thenCompose(v -> client.responses().create(agent3Params));
CompletableFuture<Response> agent4 =
CompletableFuture.allOf(agent1, agent2, agent3)
.thenCompose(v -> client.responses().create(agent4Params));
Summary
We are seeing large latency outliers when using
openai-javaasync Responses API calls in a Java service.Most OpenAI requests complete in about
1-3s, but some otherwise similar requests occasionally take>30s. These outliers cause our outer application timeout to fire even though the rest of our pipeline is fast.This does not look like local CPU or executor blockage. In the slow cases, our service continues processing other work normally while the OpenAI call remains in flight.
Environment
openai-java:4.24.121Quarkus 3.30.8OpenAIOkHttpClientAsyncgpt-5.1How we use the SDK
We use one shared
OpenAIClientAsyncand run an async pipeline with:client.responses().create(...)client.responses().createStreaming(...)client.responses().create(...)client.responses().create(...)All of this happens inside a single chat turn. In production-like traffic, multiple chat turns may also overlap in time.
Expected behavior
We expect occasional variance, but not very large outliers for otherwise small JSON-style requests.
A request pattern where most calls finish in
1-3sbut some similar calls take~30smakes it hard to set a reasonable application timeout.Actual behavior
We are seeing severe tail latency on individual OpenAI calls:
27.439s35.109sMeanwhile, nearby requests in the same system complete much faster.
Why this does not seem like a local thread/executor problem
During the slow request:
0.3-0.7s1-3sExample timings observed
Typical successful runs:
~1-3s~6-11sOutlier runs:
3.542s2.932s0.317s27.439s25s35.109sImportant detail
The slow final call request is not a large prompt. The static prompt is very small and the payload is compact. shouldnt be more than 500 tokens. So this does not seem to be caused by an unusually large request on our side. The outputs for all these calls are also not large. each call will return a JSON with <200 chars.
Question
Is this level of tail latency variance expected for
responses().create(...)/responses().createStreaming(...)withgpt-5.1under bursty concurrent usage?If so:
OpenAIClientAsync?Minimal reproduction shape
The exact business logic is not important; the pattern is:
OpenAIClientAsyncPseudo-flow: