Skip to content

Conversation

@pablomartinezbernardo
Copy link

@pablomartinezbernardo pablomartinezbernardo commented Feb 12, 2026

Overview

  1. The first thing handle_end_invocation does is spawn a task, let's call it anonymousTask
  2. handle_end_invocation then immediately returns 200 so the tracer can continue
  3. anonymousTask is busy with a complex body in extract_request_body for the time being
  4. Because the tracer has continued, eventually PlatformRuntimeDone is processed
  5. Given our customer is not managed (initialization_type: SnapStart) then PlatformRuntimeDone tries to pair_platform_runtime_done_event which is None because anonymousTask is still busy with the body
  6. We then jump to process_on_platform_runtime_done
  7. Span and trace ids are not there yet, and they are never checked again after this
  8. anonymousTask finally completes, but that's irrelevant because send_ctx_spans is only run on PlatformRuntimeDone which assumes universal_instrumentation_end has already been sent

Why this looks likely

In the customer's logs we can see

  • 05:11:48.463 datadog.trace.agent.core.DDSpan - Finished span (WRITTEN): DDSpan [ t_id=2742542901019652192
  • 05:11:48.489 PlatformRuntimeDone received
  • 05:11:48.630 REPORT RequestId 1db22159-7200-43c8-bec1-11b89df4f099 (last log emitted in an execution)
  • 05:11:53.784 START RequestId: 8c801767-e21b-43f7-bd11-078bb64bc430 (new request id, 5s later)
  • 05:11:53.789 Received end invocation request from headers:{""x-datadog-trace-id"": ""2742542901019652192"... -> we are now trying to finish the span after the request is long gone 🙃

In this specific run, the lambda even had time to stop before continuin with the anonymous task from handle_end_invocation.

Performance

This PR makes the reading of the body synchronous with the response. This will delay handing over execution to outside the extension until the body is read. But that is irrelevant because it is a requirement to read the body and send universal_instrumentation_end before relinquishing control.

Testing

Suggestions very welcome

@pablomartinezbernardo pablomartinezbernardo changed the title [SLES-2666] extract_request_body before exiting handle_end_invocation [SLES-2666] handle_end_invocation race condition Feb 12, 2026
Ok(r) => r,
Err(e) => {
error!("Failed to extract request body: {e}");
return (StatusCode::OK, json!({}).to_string()).into_response();
Copy link
Member

@lucaspimentel lucaspimentel Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_end_invocation returns StatusCode::OK on error but handle_start_invocation returns StatusCode::BAD_REQUEST (line 129). Should these be consistent?

If this was intentional, consider leaving a comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentional as in this is the behavior that does not introduce regressions: handle_end_invocation was never able to return anything other than 200, and consumers of this endpoint may not be expecting something other than 200. But yeah, a comment explaining why (or better yet, making consumers aware that this may not return 200) is a good idea.

State((invocation_processor_handle, _, tasks)): State<ListenerState>,
request: Request,
) -> Response {
let (parts, body) = match extract_request_body(request).await {
Copy link
Member

@lucaspimentel lucaspimentel Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider leaving a comment to explain why this should not be async (so somebody doesn't come and try to "optimize" it again later).

Something like:

extract_request_body must complete BEFORE returning 200 OK
to avoid a race condition. See SLES-2666.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants