2011-04-05 15:31:27

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 00/21] sched: Reduce runqueue lock contention -v6

This patch series aims to optimize remote wakeups by moving most of the
work of the wakeup to the remote cpu and avoid bouncing runqueue data
structures where possible.

As measured by sembench (which basically creates a wakeup storm) on my
dual-socket westmere:

$ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i; done
$ echo 4096 32000 64 128 > /proc/sys/kernel/sem
$ ./sembench -t 2048 -w 1900 -o 0

unpatched: run time 30 seconds 647278 worker burns per second
patched: run time 30 seconds 816715 worker burns per second

I've queued this series for .40.


2011-04-05 16:00:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/21] sched: Reduce runqueue lock contention -v6

On Tue, 2011-04-05 at 17:23 +0200, Peter Zijlstra wrote:
>
> unpatched: run time 30 seconds 647278 worker burns per second
> patched: run time 30 seconds 816715 worker burns per second

Obviously bigger is better :-), the above means 26% more wakeups
processed in the 30 seconds.

2011-04-06 11:00:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/21] sched: Reduce runqueue lock contention -v6

On Tue, 2011-04-05 at 17:23 +0200, Peter Zijlstra wrote:
> This patch series aims to optimize remote wakeups by moving most of the
> work of the wakeup to the remote cpu and avoid bouncing runqueue data
> structures where possible.
>
> As measured by sembench (which basically creates a wakeup storm) on my
> dual-socket westmere:
>
> $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i; done
> $ echo 4096 32000 64 128 > /proc/sys/kernel/sem
> $ ./sembench -t 2048 -w 1900 -o 0
>
> unpatched: run time 30 seconds 647278 worker burns per second
> patched: run time 30 seconds 816715 worker burns per second
>
> I've queued this series for .40.

Full diffstat per request

---
arch/alpha/kernel/smp.c | 3 +-
arch/arm/kernel/smp.c | 5 +-
arch/blackfin/mach-common/smp.c | 3 +
arch/cris/arch-v32/kernel/smp.c | 13 +-
arch/ia64/kernel/irq_ia64.c | 2 +
arch/ia64/xen/irq_xen.c | 10 +-
arch/m32r/kernel/smp.c | 4 +-
arch/mips/cavium-octeon/smp.c | 2 +
arch/mips/kernel/smtc.c | 2 +-
arch/mips/mti-malta/malta-int.c | 2 +
arch/mips/pmc-sierra/yosemite/smp.c | 4 +
arch/mips/sgi-ip27/ip27-irq.c | 2 +
arch/mips/sibyte/bcm1480/smp.c | 7 +-
arch/mips/sibyte/sb1250/smp.c | 7 +-
arch/mn10300/kernel/smp.c | 5 +-
arch/parisc/kernel/smp.c | 5 +-
arch/powerpc/kernel/smp.c | 4 +-
arch/s390/kernel/smp.c | 6 +-
arch/sh/kernel/smp.c | 2 +
arch/sparc/kernel/smp_32.c | 4 +-
arch/sparc/kernel/smp_64.c | 1 +
arch/tile/kernel/smp.c | 6 +-
arch/um/kernel/smp.c | 2 +-
arch/x86/kernel/smp.c | 5 +-
arch/x86/xen/smp.c | 5 +-
include/linux/mutex.h | 2 +-
include/linux/sched.h | 23 +-
init/Kconfig | 5 +
kernel/mutex-debug.c | 2 +-
kernel/mutex-debug.h | 2 +-
kernel/mutex.c | 2 +-
kernel/mutex.h | 2 +-
kernel/sched.c | 622 +++++++++++++++++++----------------
kernel/sched_debug.c | 2 +-
kernel/sched_fair.c | 23 ++-
kernel/sched_features.h | 6 +
kernel/sched_idletask.c | 2 +-
kernel/sched_rt.c | 54 ++--
kernel/sched_stoptask.c | 5 +-
39 files changed, 483 insertions(+), 380 deletions(-)

2011-04-27 16:55:21

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH 00/21] sched: Reduce runqueue lock contention -v6

On 04/05/2011 10:23 AM, Peter Zijlstra wrote:
> This patch series aims to optimize remote wakeups by moving most of the
> work of the wakeup to the remote cpu and avoid bouncing runqueue data
> structures where possible.
>
> As measured by sembench (which basically creates a wakeup storm) on my
> dual-socket westmere:
>
> $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance> $i; done
> $ echo 4096 32000 64 128> /proc/sys/kernel/sem
> $ ./sembench -t 2048 -w 1900 -o 0
>
> unpatched: run time 30 seconds 647278 worker burns per second
> patched: run time 30 seconds 816715 worker burns per second
>
> I've queued this series for .40.

Here are the results of running sembench on a 128 cpu box. In all of the
below cases, I had to use the kernel parameter idle=mwait to eliminate
spinlock contention in clockevents_notify() in the idle loop. I'll try
to track down what can be done about that later.

I took Peter's patches from the tip/sched/locking tree. I got similar
results directly from that branch, but separated them out to try to
isolate some irregular behavior that mostly went away when I added
idle=mwait. Since that branch was on top of 2.6.39-rc3, I used that
as a base.

The other patchset in play is Chris Mason's semtimedop optimization
patches. By themselves, I didn't see an improvement with Chris' patches,
but in conjunction with Peter's, they gave the best results. When
combining the patches, I removed Chris' batched wakeup patch, since it
conflicted with Peter's patchset and really isn't needed any more.

(It's been a while since Chris posted these. They are in the
"unbreakable" git tree,
http://oss.oracle.com/git/?p=linux-2.6-unbreakable.git;a=summary ,
and ported easily to mainline. I can repost them.)

I used Chris's latest sembench, http://oss.oracle.com/~mason/sembench.c
and the command "./sembench -t 2048 -w 1900 -o 0". I got similar
burns-per-second numbers when cranking up the parameters to
"./sembench -t 16384 -w 15000 -o 0".


2.6.38:

2048 threads, waking 1900 at a time
using ipc sem operations
main thread burns: 6549
worker burn count total 12443100 min 6068 max 6105 avg 6075
run time 30 seconds 414770 worker burns per second

2.6.39-rc3:

worker burn count total 11876900 min 5791 max 5805 avg 5799
run time 30 seconds 395896 worker burns per second

2.6.39-rc3 + mason's semtimedop patches:

worker burn count total 9988300 min 4868 max 4896 avg 4877
run time 30 seconds 332943 worker burns per second

2.6.39-rc3 + mason's patches (no batch wakeup patch):

worker burn count total 9743200 min 4750 max 4786 avg 4757
run time 30 seconds 324773 worker burns per second

2.6.39-rc3 + peterz's patches:

worker burn count total 14430500 min 7038 max 7060 avg 7046
run time 30 seconds 481016 worker burns per second

2.6.39-rc3 + mason's patches + peterz's patches:

worker burn count total 15072700 min 7348 max 7381 avg 7359
run time 30 seconds 502423 worker burns per second