Skip to content

HttpRemoteTaskRunner enhancements#18851

Open
jtuglu1 wants to merge 6 commits intoapache:masterfrom
jtuglu1:http-remote-task-runner-revamp-v2
Open

HttpRemoteTaskRunner enhancements#18851
jtuglu1 wants to merge 6 commits intoapache:masterfrom
jtuglu1:http-remote-task-runner-revamp-v2

Conversation

@jtuglu1
Copy link
Contributor

@jtuglu1 jtuglu1 commented Dec 18, 2025

Description

Clone of #18729 but merged into current runner per @kfaraz request.

I've seen on the giant lock in HttpRemoteTaskRunner cause severe performance degradation under heavy load(200-500ms per acquisition with 1000s of activeTasks can slow down the startPendingTasks loop in TaskQueue). This leads to scheduling delays, which leads to more lag, which auto-scales more tasks, ..., etc. The runner also has a few (un)documented races abundant in the code. This overhead also slows down query tasks under load (e.g. MSQE and others) which utilize the scheduler for execution.

I'm attempting a rewrite of this class to optimize for throughput and safety.

Apart from the performance improvements/bug fixes, this will also include some new features:

  • Simpler code. The old task runner had old, legacy ZK references dangling around as well as a pretty complicated scheduling loop.

I would ultimately like to make this the default HttpRemoteTaskRunner and have it run in all tests/production clusters, etc. as I think that would help catch more bugs/issues.

Performance Testing

Test results thus far have shown ~100-300ms speed up per task runner operation (add(), etc.). Over 1000s of tasks, this amounts to minutes of delay saved.

Release note

Speed up throughput and improve thread safety of HttpRemoteTaskRunner


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@jtuglu1 jtuglu1 changed the title Http remote task runner revamp v2 HttpRemoteTaskRunner enhancements Dec 18, 2025
@jtuglu1 jtuglu1 force-pushed the http-remote-task-runner-revamp-v2 branch from d6dc9a2 to 6cc5303 Compare December 18, 2025 03:28
@kfaraz
Copy link
Contributor

kfaraz commented Dec 18, 2025

Thanks for creating this PR, @jtuglu1 ! The patch seems much simpler now.
I should be able to complete an initial review today.

@jtuglu1 jtuglu1 force-pushed the http-remote-task-runner-revamp-v2 branch from 6cc5303 to 0deca3a Compare December 18, 2025 03:38
@jtuglu1 jtuglu1 requested a review from kfaraz December 18, 2025 10:12
@jtuglu1 jtuglu1 marked this pull request as ready for review December 18, 2025 17:12
@jtuglu1 jtuglu1 added this to the 36.0.0 milestone Dec 18, 2025
@jtuglu1 jtuglu1 force-pushed the http-remote-task-runner-revamp-v2 branch 2 times, most recently from 006a079 to f1b210a Compare December 21, 2025 19:45
Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a partial review, will try to finish going through the rest of the changes today.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished going through the bulk of the changes.

On the whole, the patch looks good. I have these major suggestions:

  • For the time being, it would be cleaner to use workerStateLock consistently whenever accessing the workers map. We can try to improve this later.
  • Avoid use of .forEach() and use .compute() instead, preferably encasing it in an addOrUpdate method similar to TaskQueue.
  • Do not perform any heavy operation like metadata store access, metric emission, listener notification, etc. inside the .compute() lambda.
  • Avoid throwing exceptions inside the lambda, if they are just to be caught back in the same method/loop. Instead, log an error and continue with the loop.
  • Remove the priority scheduling changes for now.
  • Reduce debug logging.

@kgyrtkirk
Copy link
Member

I will move this out from 36.0.0 for now - it doesn't seem like something which should block the release.
If it gets merged in the upcoming days it could still be ported over to the release branch - and thus be part of it!

@kgyrtkirk kgyrtkirk modified the milestones: 36.0.0, 37.0.0 Jan 12, 2026
@jtuglu1 jtuglu1 force-pushed the http-remote-task-runner-revamp-v2 branch from f1b210a to 3237a49 Compare February 4, 2026 07:10
task.getType(),
HttpRemoteTaskRunnerWorkItem.State.PENDING
);
pendingTasks.offer(new PendingTaskQueueItem(task));

Check notice

Code scanning / CodeQL

Ignored error status of call Note

Method apply ignores exceptional return value of LinkedBlockingQueue.offer.
@jtuglu1 jtuglu1 force-pushed the http-remote-task-runner-revamp-v2 branch 2 times, most recently from e2ca69f to 87d0be5 Compare March 6, 2026 03:33
@jtuglu1 jtuglu1 requested a review from kfaraz March 6, 2026 04:00
@jtuglu1 jtuglu1 force-pushed the http-remote-task-runner-revamp-v2 branch 3 times, most recently from 1ee0493 to 1c741ec Compare March 7, 2026 02:15
@jtuglu1 jtuglu1 requested a review from gianm March 9, 2026 19:40
@jtuglu1
Copy link
Contributor Author

jtuglu1 commented Mar 10, 2026

@gianm any thoughts here?

@gianm
Copy link
Contributor

gianm commented Mar 10, 2026

@gianm any thoughts here?

I will try to take a look. It may take some time to get to it, since the changes look quite extensive.

Have you run this on a real production at-scale cluster yet (something with hundreds or thousands of tasks running simultaneously, ideally)? If so, that's always helpful to know.

@jtuglu1
Copy link
Contributor Author

jtuglu1 commented Mar 10, 2026

@gianm any thoughts here?

I will try to take a look. It may take some time to get to it, since the changes look quite extensive.

Have you run this on a real production at-scale cluster yet (something with hundreds or thousands of tasks running simultaneously, ideally)? If so, that's always helpful to know.

Yes, no observed issues. We run with close to 10k tasks at peak per cluster.

@jtuglu1 jtuglu1 force-pushed the http-remote-task-runner-revamp-v2 branch from 4eaed68 to 626a95c Compare March 12, 2026 20:32
@jtuglu1
Copy link
Contributor Author

jtuglu1 commented Mar 16, 2026

@kfaraz @gianm thoughts here?

@jtuglu1
Copy link
Contributor Author

jtuglu1 commented Mar 19, 2026

@kfaraz @gianm any thoughts here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants