Avoid atomic operations when the local queue is not empty (!5) · Merge requests · Martin Karsten / libfibre

Peter Cai requested to merge p5cai/libfibre:patch-remove-atomic into master May 31, 2022

As previously found out by Thierry's test, the atomic integer fredCounter causes a bottleneck in scalability as it needs to be decremented each time a processor tries to find a new fred to execute, regardless of whether coordination with other processors is needed or not. This is unnecessary when the local run queue is not empty, and removing atomic access in this case would reduce the bottleneck observed in the benchmark.

This commit removes the use of fredCounter and instead introduces a new atomic counter, spinningCounter, that counts the number of "spinning" processors instead. A processor enters the spinning state when it runs out of local work and has to attempt to steal from other processors -- and it does not increment or decrement the counter unless it needs to transition into or out of the spinning state. A processor goes to sleep (calls block()) when it has spinned for a while and has not been able to find any new work.

When a new task gets submitted, processors are only awakened (unblock()-ed) if none of them are spinning (spinningCounter is zero). This helps reduce the thundering herd effect when a burst of workload enters the queue, which before this commit could cause too many processors to be unblocked and start stealing from each other unnecessarily. To compensate for the fact that we will most definitely not unblock enough processors after the change (since we only ever unblock 1 processor regardless of how many tasks are submitted at once), each processor must unblock another processor if it finds a task and exits the spinning state. This is the same logic used by Go 1, and by inheritance, Tokio, since Tokio's scheduler is modelled after Go's.

This change has been tested with a TCP echo server, the threadtest demo with 1 contending lock, and Thierry's test that originally revealed the atomic's scalability issue. No significant performance regression has been observed in performance benchmarks, and the implementation seems to be correct (in the sense that sleeping / waking behavior is consistent) as far as current testing shows. Thierry's test run against libfibre after this change shows a significant improvement:

$ ./cycle -d 60 -t 1 -p 100
Running for 60.000000 seconds
Starting
 60.0
Done
Duration (ms)        : 60040
Number of processors : 100
Number of threads    : 2
Cycle size (# thrds) : 2
Total Operations(ops):          251873
Total blocks         :          251873
Ops per second       :            4195.02
ns per ops           :          238377.83
Ops per threads      :          125936
Ops per procs        :            2518
Ops/sec/procs        :              41.95
ns per ops/procs     :        23844693.66
$ ./cycle -d 60 -t 8 -p 800
Running for 60.000000 seconds
Starting
 60.1
Done
Duration (ms)        : 60051
Number of processors : 800
Number of threads    : 16
Cycle size (# thrds) : 2
Total Operations(ops):         9576176
Total blocks         :         9575103
Ops per second       :          159466.69
ns per ops           :            6270.90
Ops per threads      :          598511
Ops per procs        :           11970
Ops/sec/procs        :             199.33
ns per ops/procs     :         5016813.77

It is still not linear scaling, but that should be expected, given that work stealing still involves a lot of locking and atomics.

Note that this change has not been integrated into the statistics system, and the original stats metrics for fredCounter no longer applies to the new counter. This should be fixed in a future commit.

Edited May 31, 2022 by Peter Cai

Avoid atomic operations when the local queue is not empty

Merge request reports