Skip to content

async Responses API has severe latency variance for small concurrent requests #722

@pablo-chocobar

Description

@pablo-chocobar

Summary

We are seeing large latency outliers when using openai-java async Responses API calls in a Java service.

Most OpenAI requests complete in about 1-3s, but some otherwise similar requests occasionally take >30s. These outliers cause our outer application timeout to fire even though the rest of our pipeline is fast.

This does not look like local CPU or executor blockage. In the slow cases, our service continues processing other work normally while the OpenAI call remains in flight.

Environment

  • openai-java: 4.24.1
  • Java: 21
  • Framework: Quarkus 3.30.8
  • Client: OpenAIOkHttpClientAsync
  • Model: gpt-5.1

How we use the SDK

We use one shared OpenAIClientAsync and run an async pipeline with:

  • one non-streaming classification call via client.responses().create(...)
  • one streaming planner call via client.responses().createStreaming(...)
  • several parallel non-streaming calls via client.responses().create(...)
  • one final non-streaming call via client.responses().create(...)

All of this happens inside a single chat turn. In production-like traffic, multiple chat turns may also overlap in time.

Expected behavior

We expect occasional variance, but not very large outliers for otherwise small JSON-style requests.

A request pattern where most calls finish in 1-3s but some similar calls take ~30s makes it hard to set a reasonable application timeout.

Actual behavior

We are seeing severe tail latency on individual OpenAI calls:

  • in one case, the final non streaming request took 27.439s
  • in another case, the planner stream took 35.109s

Meanwhile, nearby requests in the same system complete much faster.

Why this does not seem like a local thread/executor problem

During the slow request:

  • other requests in the same service continue to run
  • local search calls complete in about 0.3-0.7s
  • other OpenAI calls in nearby flows complete in about 1-3s
  • completion logs for the slow request appear on the OkHttp / stream handler side, suggesting the service is waiting on the remote OpenAI call rather than being blocked locally

Example timings observed

Typical successful runs:

  • calls often complete in ~1-3s
  • whole flow often completes in ~6-11s

Outlier runs:

  1. final call outlier:
  • planner completed in 3.542s
  • parallel calls completed in 2.932s
  • internal api call completed in 0.317s
  • final call completed in 27.439s
  1. Planner stream outlier:
  • flow started normally
  • outer flow timed out at 25s
  • planner stream itself completed only after 35.109s

Important detail

The slow final call request is not a large prompt. The static prompt is very small and the payload is compact. shouldnt be more than 500 tokens. So this does not seem to be caused by an unusually large request on our side. The outputs for all these calls are also not large. each call will return a JSON with <200 chars.

Question

Is this level of tail latency variance expected for responses().create(...) / responses().createStreaming(...) with gpt-5.1 under bursty concurrent usage?

If so:

  • are there SDK-level recommendations for this traffic pattern?
  • are there known best practices around concurrent async requests with one shared OpenAIClientAsync?
  • is there a better model or request pattern for lower tail latency on small JSON-style calls?

Minimal reproduction shape

The exact business logic is not important; the pattern is:

  1. shared OpenAIClientAsync
  2. start one async non-streaming call
  3. start one streaming planner call
  4. start several parallel JSON calls
  5. start one more async JSON call
  6. observe that most requests complete quickly, but occasional outliers take much longer than the rest

Pseudo-flow:

OpenAIClientAsync client = OpenAIOkHttpClientAsync.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .build(); // did not pass any OkHttp params here, could that be an issue? 

CompletableFuture<Response> classify =
    client.responses().create(classifyParams);

AsyncStreamResponse<ResponseStreamEvent> plannerStream =
    client.responses().createStreaming(plannerParams);

CompletableFuture<Void> plannerDone = plannerStream.onCompleteFuture();

CompletableFuture<Response> agent1 = plannerDone.thenCompose(v -> client.responses().create(agent1Params));
CompletableFuture<Response> agent2 = plannerDone.thenCompose(v -> client.responses().create(agent2Params));
CompletableFuture<Response> agent3 = plannerDone.thenCompose(v -> client.responses().create(agent3Params));

CompletableFuture<Response> agent4 =
    CompletableFuture.allOf(agent1, agent2, agent3)
        .thenCompose(v -> client.responses().create(agent4Params));

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions