Hello.
The main point of my patches is to make two separate cpumasks. One for
parallel and another for serial workers(callback cpus). It'll perform to
bind non-intersecting groups of CPUs for serial and parallel workers and
do more thin tuning of padata subsystem.
My tests shows that proper configuration of serial and parallel cpu
masks gives a bit better performance. For example (aes-asm,
sha1-generic. Two 16-core machines):
1) 1 point-to-point connection:
Non-modified padata gives ~650Mbit of TCP and ~780Mbit of UDP
When I exclude callback CPUs from parallel cpumask padata gives
~750Mbit of TCP and ~900Mbit of UDP.
2) 2 IPSEC tunnels between 16-core machines and 4 clients
communicating via tunnels with each-other
Non-modified padata gives ~1.5Gbit of UDP
padata with non-intersecting cpumasks for parallel and serial workers
gives ~1.8Gbit
Besides the performance growth, there may be situations when serial job
takes a lot of time. For example if I add several dozens of firewall
rules, serial worker will work slower and padata_do_parallel will
continue to enqueue requests into the queue of CPU serial worker
executes on.
It may significantly slow down parallelization and reordering because
one CPU(that is shared by both parallel and serial workers) will always
have more requests in its parallel queue than others CPUs(because
serialization takes a lot of time). In such cases user may exclude
callback CPUs from cpumask for parallel workers.
--
W.B.R.
Dan Kruchinin