Conversation
d6dc9a2 to
6cc5303
Compare
|
Thanks for creating this PR, @jtuglu1 ! The patch seems much simpler now. |
6cc5303 to
0deca3a
Compare
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...-service/src/test/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunnerTest.java
Fixed
Show fixed
Hide fixed
006a079 to
f1b210a
Compare
kfaraz
left a comment
There was a problem hiding this comment.
Leaving a partial review, will try to finish going through the rest of the changes today.
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Finished going through the bulk of the changes.
On the whole, the patch looks good. I have these major suggestions:
- For the time being, it would be cleaner to use
workerStateLockconsistently whenever accessing theworkersmap. We can try to improve this later. - Avoid use of
.forEach()and use.compute()instead, preferably encasing it in anaddOrUpdatemethod similar toTaskQueue. - Do not perform any heavy operation like metadata store access, metric emission, listener notification, etc. inside the
.compute()lambda. - Avoid throwing exceptions inside the lambda, if they are just to be caught back in the same method/loop. Instead, log an error and continue with the loop.
- Remove the priority scheduling changes for now.
- Reduce debug logging.
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/overlord/hrtr/HttpRemoteTaskRunner.java
Show resolved
Hide resolved
|
I will move this out from 36.0.0 for now - it doesn't seem like something which should block the release. |
f1b210a to
3237a49
Compare
e2ca69f to
87d0be5
Compare
1ee0493 to
1c741ec
Compare
|
@gianm any thoughts here? |
I will try to take a look. It may take some time to get to it, since the changes look quite extensive. Have you run this on a real production at-scale cluster yet (something with hundreds or thousands of tasks running simultaneously, ideally)? If so, that's always helpful to know. |
Yes, no observed issues. We run with close to 10k tasks at peak per cluster. |
4eaed68 to
626a95c
Compare
Description
Clone of #18729 but merged into current runner per @kfaraz request.
I've seen on the giant lock in
HttpRemoteTaskRunnercause severe performance degradation under heavy load(200-500ms per acquisition with 1000s of activeTasks can slow down the startPendingTasks loop in TaskQueue). This leads to scheduling delays, which leads to more lag, which auto-scales more tasks, ..., etc. The runner also has a few (un)documented races abundant in the code. This overhead also slows down query tasks under load (e.g. MSQE and others) which utilize the scheduler for execution.I'm attempting a rewrite of this class to optimize for throughput and safety.
Apart from the performance improvements/bug fixes, this will also include some new features:
I would ultimately like to make this the default
HttpRemoteTaskRunnerand have it run in all tests/production clusters, etc. as I think that would help catch more bugs/issues.Performance Testing
Test results thus far have shown ~100-300ms speed up per task runner operation (
add(), etc.). Over 1000s of tasks, this amounts to minutes of delay saved.Release note
Speed up throughput and improve thread safety of HttpRemoteTaskRunner
This PR has: