-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add channel benchmarks #4546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Add channel benchmarks #4546
Conversation
|
It does perform unexpectedly badly, could you all take a look for any silly mistakes? The simplest benchmark, sending X elements only (not receiving anything), is already pretty sad: Full output for the first three counts (4KB, 40KB, 400KB) of Ints (just FYI, no reason to look at this, since the snippet above is already bad enough) |
|
Just for the record, we discussed benchmarks with @murfel offline and she'll rework them. |
|
Ran (on freshly restarted macbook, without any apps open but the terminal and system monitor) |
|
Quick normalisation with ChatGPT
(Will do a proper Notebook for a JSON benchmark output after we agree on the benchmark correctness. Forgot to save this one as JSON and it takes 40 min to re-run.) |
| repeat(maxCount) { add(it) } | ||
| } | ||
|
|
||
| @Setup(Level.Invocation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it has to be done before every benchmark function invocation and not once per trial / iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a tradeoff. After each invocation, there could be a little extra items in the channel, which can accumulate with iterations. I can rewrite runSendReceive to leave channel with the same number of elements as it came in with, but then it will slightly affect the benchmark. Possibly negligible, since it's only up to 4 items each time...
| @OutputTimeUnit(TimeUnit.NANOSECONDS) | ||
| @State(Scope.Benchmark) | ||
| @Fork(1) | ||
| open class ChannelBenchmark { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate what exactly you're trying to measure using these benchmarks?
Right now, it looks like "time required to create a new channel, send N messages into it (and, optionally, receive them), and then close the channel". However, I thought that initial idea was to measure the latency of sending (and receiving) a single message into the channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
measure the latency of sending (and receiving) a single message into the channel
I do measure that, indirectly. Do you suggest to literally only send/receive one message per benchmark? Is that reliable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Direct measurements are always better than indirect. If the goal is to measure send/recv timing, let's measure it.
What makes you think it will be unreliable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Direct measurements are always better than indirect.
Not always. See below. Also depends on how you define "better".
If the goal is to measure send/recv timing, let's measure it.
Again, I am measuring that. My way of measuring is a valid way of measuring. Having to assert that makes me feel dismissed.
We can explore other ways to measure, for sure.
What makes you think it will be unreliable?
- Overhead of the measuring setup could be greater than the effect measured.
- Yet simplified setup might not capture a typical usage.
- Does not average over data structure amortization (e.g. our sent element could be the element which triggers the channel's internal data structure doubling / allocation) (or, on the contrary, the constant from amortization could be noticeable and we do in fact want to measure it)
- Does not average over GC
What setup did you have in mind, something like this?
@Benchmark
fun sendReceiveUnlimitedPrefilledSequential(wrapper: UnlimitedChannelWrapper, blackhole: Blackhole) =
runBlocking {
wrapper.channel.send(42)
blackhole.consume(wrapper.channel.receive())
}
ChannelBenchmark.sendReceiveUnlimitedPrefilledSequential 0 0 avgt 10 53.959 ± 0.168 ns/op
ChannelBenchmark.sendReceiveUnlimitedPrefilledSequential 0 1000000 avgt 10 60.069 ± 1.345 ns/op
ChannelBenchmark.sendReceiveUnlimitedPrefilledSequential 0 100000000 avgt 10 71.457 ± 13.101 ns/op
Or this? (no suspension, trySend/tryReceive)
@Benchmark
fun sendReceiveUnlimitedPrefilledSequentialNoSuspension(wrapper: UnlimitedChannelWrapper, blackhole: Blackhole) {
wrapper.channel.trySend(42)
blackhole.consume(wrapper.channel.tryReceive().getOrThrow())
}
Benchmark (count) (prefill) Mode Cnt Score Error Units
ChannelBenchmark.sendReceiveUnlimitedPrefilledNoSuspension 0 0 avgt 10 10.619 ± 0.270 ns/op
ChannelBenchmark.sendReceiveUnlimitedPrefilledNoSuspension 0 1000000 avgt 10 10.859 ± 0.330 ns/op
ChannelBenchmark.sendReceiveUnlimitedPrefilledNoSuspension 0 100000000 avgt 10 17.163 ± 1.523 ns/op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The internal structure of the channel may be worth taking into account. For a prefilled channel with 32+ elements (32 is the default channel segment size), we can expect send and receive not to interact with one another at all, that is, the duration of send followed by receive should be roughly the sum of durations of send and receive, invoked independently. I imagine wrapper.channel.send(42) and blackhole.consume(wrapper.channel.receive()) could stay in different benchmarks without affecting the results too much.
For an empty channel, we could also try racing send and receive.
Using runBlocking in a benchmark that's only doing send doesn't seem optimal to me, I can imagine the run time getting dominated by the runBlocking machinery. I don't know what the proper way of doing this in JMH is, but I'd try a scheme like this:
internal class BenchmarkSynchronization() {
private val state = AtomicInteger(0)
private val benchmarkThread = Thread.currentThread()
private val threadDoingWork = AtomicReference<Thread?>()
fun awaitThreadAssignment(): Thread {
assert(Thread.currentThread() === benchmarkThread)
while (true) {
val thread = threadDoingWork.get()
if (thread != null) return thread
LockSupport.parkNanos(Long.MAX_VALUE)
}
}
fun awaitStartSignal() {
threadDoingWork.set(Thread.currentThread())
LockSupport.unpark(benchmarkThread)
while (state.get() == 0) {
LockSupport.parkNanos(Long.MAX_VALUE)
}
}
fun signalFinish() {
state.set(2)
LockSupport.unpark(benchmarkThread)
}
fun runBenchmark(thread: Thread) {
state.set(1)
LockSupport.unpark(thread)
while (state.get() != 2) {
LockSupport.parkNanos(Long.MAX_VALUE)
}
}
}(haven't actually tested the code). Then, the scheme would be:
// preparation
val synchronization = BenchmarkSynchronization()
GlobalScope.launch {
synchronization.awaitStartSignal()
try {
// actual benchmark code here
} finally {
synchronization.signalFinish()
}
}
val threadDoingWork = synchronization.awaitThreadAssignment()
// the @Benchmark itself
wrapper.synchronization.runBenchmark(wrapper.threadDoingWork)@fzhinkin , is there a standard mechanism that encapsulates this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an empty channel, we could also try racing send and receive.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running them in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Direct measurements are always better than indirect.
Not always. See below. Also depends on how you define "better".
I mean, if our goal is to measure a latency of a certain operation and we have facilities to do so, then it's better to do it directly (to an extent, benchmark's results are averages anyway). By doing so, we can ensure that all unnecessary setup and teardown code (like, creating a channel) won't skew results.
On the other hand, if the goal is to measure end-to-end latency, like "time to create a channel and send 100k messages over it", then sure, the current approach works for that (moreover, I don't see how to measure it otherwise).
If the goal is to measure send/recv timing, let's measure it.
Again, I am measuring that. My way of measuring is a valid way of measuring. Having to assert that makes me feel dismissed.
See the comment, above. I was under the impression that the typical use case for channel is to be used indirectly (within a flow, for example), so for channels as they are we decided to measure a latency of a single operation to see how it will be affected by potential changes in the implementation.
I'm not saying that the way you're measuring it is invalid, but if there are facilities to measure latency of a single operation (well, the send-receive pair of operations), I'm voting for using it (unless there is an evidence that such a measurement is impossible or makes no sense).
We can explore other ways to measure, for sure.
What makes you think it will be unreliable?
Overhead of the measuring setup could be greater than the effect measured.
Setup (and teardown) actions performed before (after) the whole run (or an individual iteration) should not affect measurements (as they are performed outside of the measurement scope); it will affect the measurements when performed for each benchmark function invocation.
Yet simplified setup might not capture a typical usage.
I'm not sure if sending 400MB of data is a typical usage either. ;)
Does not average over data structure amortization (e.g. our sent element could be the element which triggers the channel's internal data structure doubling / allocation) (or, on the contrary, the constant from amortization could be noticeable and we do in fact want to measure it)
The benchmark function is continuously invoked over a configured period of time (you set it to 1 second).
If we reuse the same channel in each invocation, results will average over data structure amortization.
Does not average over GC
It's easier to focus on memory footprint as it is something we control directly (how many bytes we're allocating when performing an operation), rather than on GC pauses (they are a subject to various factors).
What setup did you have in mind, something like this?
Both approaches look sane (assuming the wrapper is not recreated for every benchmark call) and we can do both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkhalanskyjb, it feels like I didn't get you, but nevertheless: JMH provides some facilities to running benchmark methods concurrently and synchronize their execution:
https://github.com/openjdk/jmh/blob/master/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_15_Asymmetric.java
https://github.com/openjdk/jmh/blob/master/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_17_SyncIterations.java
As of runBlocking, it would be nice to have a kx-benchmarks maintainer here, who would solve a problem with benchmarking suspend-API for us. Oh, wait... 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels like I didn't get you
Indeed, but that's because I had the wrong assumptions. I hadn't realized one important aspect:
Setup (and teardown) actions performed before (after) the whole run (or an individual iteration) should not affect measurements (as they are performed outside of the measurement scope); it will affect the measurements when performed for each benchmark function invocation.
My goal was to reduce the influence of runBlocking on the performance, and to that end, I wanted to spawn the actual computation in a separate thread beforehand. In the benchmark itself, instead of wasting time on scheduling a computation, initializing and deinitializing structured concurrency, etc., it would only send a signal "you can actually start now" and then wait for the computation to signal "okay, done".
This is all moot if the preparatory work in @Setup(Level.Invocation) objects is included in the time measurements. It's not viable to start as many threads as there are computations beforehand.
|
This subset takes 10 min on my machine, and looks representative enough. OriginalModified |
|
So… with the change, our channels are a bit more efficient? |
|
Not really ready for review yet, I'm polishing bits up before making any statements. Sorry for posting here, I thought I had some reliable data and then got carried away. |
No description provided.