Avoid atomic operations when the local queue is not empty
As previously found out by Thierry's test, the atomic integer
fredCounter
causes a bottleneck in scalability as it needs to be
decremented each time a processor tries to find a new fred to execute,
regardless of whether coordination with other processors is needed or
not. This is unnecessary when the local run queue is not empty, and
removing atomic access in this case would reduce the bottleneck observed
in the benchmark.
This commit removes the use of fredCounter
and instead introduces a
new atomic counter, spinningCounter
, that counts the number of
"spinning" processors instead. A processor enters the spinning state
when it runs out of local work and has to attempt to steal from other
processors -- and it does not increment or decrement the counter unless
it needs to transition into or out of the spinning state. A processor
goes to sleep (calls block()
) when it has spinned for a while and has
not been able to find any new work.
When a new task gets submitted, processors are only awakened
(unblock()
-ed) if none of them are spinning (spinningCounter
is
zero). This helps reduce the thundering herd effect when a burst of
workload enters the queue, which before this commit could cause too many
processors to be unblocked and start stealing from each other
unnecessarily. To compensate for the fact that we will most definitely
not unblock enough processors after the change (since we only ever
unblock 1 processor regardless of how many tasks are submitted at once),
each processor must unblock another processor if it finds a task and
exits the spinning state. This is the same logic used by Go 1, and by
inheritance, Tokio, since Tokio's scheduler is modelled after Go's.
This change has been tested with a TCP echo server, the threadtest demo with 1 contending lock, and Thierry's test that originally revealed the atomic's scalability issue. No significant performance regression has been observed in performance benchmarks, and the implementation seems to be correct (in the sense that sleeping / waking behavior is consistent) as far as current testing shows. Thierry's test run against libfibre after this change shows a significant improvement:
$ ./cycle -d 60 -t 1 -p 100
Running for 60.000000 seconds
Starting
60.0
Done
Duration (ms) : 60040
Number of processors : 100
Number of threads : 2
Cycle size (# thrds) : 2
Total Operations(ops): 251873
Total blocks : 251873
Ops per second : 4195.02
ns per ops : 238377.83
Ops per threads : 125936
Ops per procs : 2518
Ops/sec/procs : 41.95
ns per ops/procs : 23844693.66
$ ./cycle -d 60 -t 8 -p 800
Running for 60.000000 seconds
Starting
60.1
Done
Duration (ms) : 60051
Number of processors : 800
Number of threads : 16
Cycle size (# thrds) : 2
Total Operations(ops): 9576176
Total blocks : 9575103
Ops per second : 159466.69
ns per ops : 6270.90
Ops per threads : 598511
Ops per procs : 11970
Ops/sec/procs : 199.33
ns per ops/procs : 5016813.77
It is still not linear scaling, but that should be expected, given that work stealing still involves a lot of locking and atomics.
Note that this change has not been integrated into the statistics
system, and the original stats metrics for fredCounter
no longer
applies to the new counter. This should be fixed in a future commit.