(switched to email. Please respond via emailed reply-to-all, not via the
bugzilla web interface).
On Wed, 28 Jan 2009 06:35:20 -0800 (PST)
[email protected] wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12562
>
> Summary: High overhead while switching or synchronizing threads
> on different cores
Thanks for the report, and the testcase.
> Product: Process Management
> Version: 2.5
> KernelVersion: 2.6.28
> Platform: All
> OS/Version: Linux
> Tree: Mainline
> Status: NEW
> Severity: normal
> Priority: P1
> Component: Scheduler
> AssignedTo: [email protected]
> ReportedBy: [email protected]
(There's testcase code in the bugzilla report)
(Seems to be a regression)
>
> Hardware Environment: Core2Duo 2.4GHz / 4GB RAM
> Software Environment: Ubuntu 8.10 + Vanilla 2.6.28
>
> Hardware Environment: AMD64 X2 2.1GHz / 6GB RAM
> Software Environment: Ubuntu 8.10 + Vanilla 2.6.28.2
>
> Problem Description:
> The overhead on a dual core while switching between tasks is extremely high
> (>60% of cputime). If is produced by synchronization with pthread and
> mutex/cond.
>
> Executing the attaches program schedulingissue 1 1024 8 20, which create a
> producer and a consumer thread with eight 8kb big buffers. The producer creates
> 1024 random generated double values, consumer makes the same after receiving
> the buffer.
>
> While executing the program the thoughtput is ~1.6 msg/s. While executing two
> instances of the program, the thoughtput is much higher (2 * 8.7 msg/s = 17,4
> msg/s).
>
> Small improvement while using jiffies as clocksource instead of acpi_pm or hpet
> (1.8 messages instead of 1.6). Disabling NO_HZ and HIGH_RESOLUTION_TIME gives
> no improvement. Much higher performance with kernel <= 2.6.24, but still four
> times slower.
Unclear. What is four times slower than what? You're saying that the
app progresses four times faster when there are two instances of it
running, rather than one instance?
> ---------------------------------------
> Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64
> GNU/Linux
> acpi_pm (equal with htep)
> schedulerissue 1 1024 8 20
> All threads finished: 20 messages in 12.295 seconds / 1.627 msg/s
> schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200
> All threads finished: 200 messages in 22.882 seconds / 8.741 msg/s
> All threads finished: 200 messages in 22.934 seconds / 8.721 msg/s
> ---------------------------------------
> Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64
> GNU/Linux
> jiffies
> schedulerissue 1 1024 8 20
> All threads finished: 20 messages in 10.704 seconds / 1.868 msg/s
> schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200
> All threads finished: 200 messages in 23.372 seconds / 8.557 msg/s
> All threads finished: 200 messages in 23.460 seconds / 8.525 msg/s
> --------------------------------------
> Linux bugs-laptop 2.6.24.7 #1 SMP Wed Jan 14 10:21:04 CET 2009 x86_64 GNU/Linux
> hpet
> schedulerissue 1 1024 8 20
> All threads finished: 20 messages in 5.290 seconds / 3.781 msg/s
> schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200
> All threads finished: 200 messages in 23.000 seconds / 8.695 msg/s
> All threads finished: 200 messages in 23.078 seconds / 8.666 msg/s
>
Seems that 2.6.24 is faster than 2.6.28 with 20 messages, but 2.6.24
and 2.6.28 run at the same speed when 200 messages are sent?
If so, that seems rather odd, doesn't it? Is it possible that cpufreq
does something bad once the CPU gets hot?
> AMD64 X2 @ 2.1GHz
> Linux bugs-desktop 2.6.28.2 #4 SMP Mon Jan 26 20:26:12 CET 2009 x86_64
> GNU/Linux
> acpi_pm
> schedulerissue 1 1024 8 20
> All threads finished: 20 messages in 9.288 seconds / 2.153 msg/s
> schedulerissue 1 1024 8 200
> All threads finished: 200 messages in 17.049 seconds / 11.731 msg/s
> All threads finished: 200 messages in 18.539 seconds / 10.788 msg/s
On Wed, 2009-01-28 at 12:56 -0800, Andrew Morton wrote:
> (switched to email. Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> On Wed, 28 Jan 2009 06:35:20 -0800 (PST)
> [email protected] wrote:
>
> > http://bugzilla.kernel.org/show_bug.cgi?id=12562
> >
> > Summary: High overhead while switching or synchronizing threads
> > on different cores
>
> Thanks for the report, and the testcase.
>
> > Product: Process Management
> > Version: 2.5
> > KernelVersion: 2.6.28
> > Platform: All
> > OS/Version: Linux
> > Tree: Mainline
> > Status: NEW
> > Severity: normal
> > Priority: P1
> > Component: Scheduler
> > AssignedTo: [email protected]
> > ReportedBy: [email protected]
>
> (There's testcase code in the bugzilla report)
>
> (Seems to be a regression)
Is there a known good kernel?
> >
> > Hardware Environment: Core2Duo 2.4GHz / 4GB RAM
> > Software Environment: Ubuntu 8.10 + Vanilla 2.6.28
> >
> > Hardware Environment: AMD64 X2 2.1GHz / 6GB RAM
> > Software Environment: Ubuntu 8.10 + Vanilla 2.6.28.2
> >
> > Problem Description:
> > The overhead on a dual core while switching between tasks is extremely high
> > (>60% of cputime). If is produced by synchronization with pthread and
> > mutex/cond.
> >
> > Executing the attaches program schedulingissue 1 1024 8 20, which create a
> > producer and a consumer thread with eight 8kb big buffers. The producer creates
> > 1024 random generated double values, consumer makes the same after receiving
> > the buffer.
> >
> > While executing the program the thoughtput is ~1.6 msg/s. While executing two
> > instances of the program, the thoughtput is much higher (2 * 8.7 msg/s = 17,4
> > msg/s).
> >
> > Small improvement while using jiffies as clocksource instead of acpi_pm or hpet
> > (1.8 messages instead of 1.6). Disabling NO_HZ and HIGH_RESOLUTION_TIME gives
> > no improvement. Much higher performance with kernel <= 2.6.24, but still four
> > times slower.
>
> Unclear. What is four times slower than what? You're saying that the
> app progresses four times faster when there are two instances of it
> running, rather than one instance?
It seems that way indeed, a bit more clarity would be good though.
> > ---------------------------------------
> > Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64
> > GNU/Linux
> > acpi_pm (equal with htep)
> > schedulerissue 1 1024 8 20
> > All threads finished: 20 messages in 12.295 seconds / 1.627 msg/s
> > schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200
> > All threads finished: 200 messages in 22.882 seconds / 8.741 msg/s
> > All threads finished: 200 messages in 22.934 seconds / 8.721 msg/s
> > ---------------------------------------
> > Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64
> > GNU/Linux
> > jiffies
> > schedulerissue 1 1024 8 20
> > All threads finished: 20 messages in 10.704 seconds / 1.868 msg/s
> > schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200
> > All threads finished: 200 messages in 23.372 seconds / 8.557 msg/s
> > All threads finished: 200 messages in 23.460 seconds / 8.525 msg/s
> > --------------------------------------
> > Linux bugs-laptop 2.6.24.7 #1 SMP Wed Jan 14 10:21:04 CET 2009 x86_64 GNU/Linux
> > hpet
> > schedulerissue 1 1024 8 20
> > All threads finished: 20 messages in 5.290 seconds / 3.781 msg/s
> > schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200
> > All threads finished: 200 messages in 23.000 seconds / 8.695 msg/s
> > All threads finished: 200 messages in 23.078 seconds / 8.666 msg/s
> >
>
> Seems that 2.6.24 is faster than 2.6.28 with 20 messages, but 2.6.24
> and 2.6.28 run at the same speed when 200 messages are sent?
>
> If so, that seems rather odd, doesn't it? Is it possible that cpufreq
> does something bad once the CPU gets hot?
Nah, I'll bet is a cache affinity issue.
Some applications like strong wakeup affinity, others not so. This looks
to be a lover.
With a single instance, the producer and consumer get scheduled on two
different cores for some reason (maybe wake idle too strong).
With two instances, they get to stay on the same cpu, since the other
cpu is already busy.
I'll start up the browser in the morning to download this proglet and
poke at it some, but sleep comes first.
Am Mittwoch, den 28.01.2009, 12:56 -0800 schrieb Andrew Morton:
> (There's testcase code in the bugzilla report)
>
> (Seems to be a regression)
There is a regression, because of the improved cpu switching. The problem exists in every kernel.
I takes a lot of time to switch between the threads, when they are executed on different cores.
Perhaps of the big buffer size of 512KB?
> > Small improvement while using jiffies as clocksource instead of acpi_pm or hpet
> > (1.8 messages instead of 1.6). Disabling NO_HZ and HIGH_RESOLUTION_TIME gives
> > no improvement. Much higher performance with kernel <= 2.6.24, but still four
> > times slower.
>
> Unclear. What is four times slower than what? You're saying that the
> app progresses four times faster when there are two instances of it
> running, rather than one instance?
About 4 messages every second, while executing only one instance and
about 8 message every second, while executing two instance of the test.
It makes 16 messages every second, when the two threads of a instance is
executed on only one core.
> Seems that 2.6.24 is faster than 2.6.28 with 20 messages, but 2.6.24
> and 2.6.28 run at the same speed when 200 messages are sent?
I have executed the test twenty times. It stays constant on 2.6.28. On
2.6.24 one of ten tests is executed slower.
******* kernel 2.6.28:
All threads finished: 20 messages in 12.853 seconds / 1.556 msg/s
real 0m12.857s
user 0m8.589s
sys 0m16.629s
******* kernel 2.6.24:
All threads finished: 20 messages in 4.939 seconds / 4.050 msg/s
real 0m4.942s
user 0m5.248s
sys 0m4.352s
One of ten executions is going down to 1.806 msg/s.
All threads finished: 20 messages in 11.074 seconds / 1.806 msg/s
real 0m11.077s
user 0m8.817s
sys 0m12.925s
> If so, that seems rather odd, doesn't it? Is it possible that cpufreq
> does something bad once the CPU gets hot?
I have disabled the acpid, clocked the cpu to 2.4GHz and watched the
temperature of the cores and the frequency. The clock stay always at
2.4GHz and the temperature is always below 67°C. My cpu is clocking down
at 95°C.
On Wed, 2009-01-28 at 23:25 +0100, Thomas Pilarski wrote:
> Am Mittwoch, den 28.01.2009, 12:56 -0800 schrieb Andrew Morton:
>
> > (There's testcase code in the bugzilla report)
> >
> > (Seems to be a regression)
>
> There is a regression, because of the improved cpu switching. The
> problem exists in every kernel.
This is a contradiction in terms - twice.
If it is a regression, then clearly things haven't improved.
If it is a regression, state clearly when it worked last. If it never
worked, it cannot be a regression.
> I takes a lot of time to switch between the threads, when they are
> executed on different cores.
> Perhaps of the big buffer size of 512KB?
Of course, pushing 512kb to another cpu means lots and lots of cache
misses.
> > There is a regression, because of the improved cpu switching. The
> > problem exists in every kernel.
>
> This is a contradiction in terms - twice.
>
> If it is a regression, then clearly things haven't improved.
>
> If it is a regression, state clearly when it worked last. If it never
> worked, it cannot be a regression.
There is a improvement in load balancing for single threaded
applications. It's a regression for my problem. But the problem exists
in every kernel I have tested.
> > I takes a lot of time to switch between the threads, when they are
> > executed on different cores.
> > Perhaps of the big buffer size of 512KB?
>
> Of course, pushing 512kb to another cpu means lots and lots of cache
> misses.
I have tried 2.6.15, 2.6.18 and 2.6.20 too, but same behavior as in
2.6.24.
With Windows I can get 64 message every second with a buffer size of 512
KB. It is reduced to 16 messages with a buffer size of 1MB. But I think
it not really comparable, because there is nearby no cpu consumption
with 512kB. Perhaps random() works different. By increasing the cpu
usage eight times in the producer, I can get 16msg/s and both cores are
used about ~50%. Doing the same with linux I get a throughput of
~2msg/s.
If it is a caching issue, shouldn't it exists in Windows too?
Using a smaller buffer of 4KB, the test is executed on one core only.
./schedulerissue 1 4096 8 2000
All threads finished: 2000 messages in 1.631 seconds / 1226.076 msg/s
real 0m1.635s
user 0m1.352s
sys 0m0.052s
But I want to use both cores to increase the performance. Adding a
second producer and a second consumer reduces the performance to 33%.
Both cores are used.
./schedulerissue 2 4096 8 2000
All threads finished: 1999 messages in 4.744 seconds / 421.379 msg/s
real 0m4.748s
user 0m3.280s
sys 0m5.852s
I have added a new version as there was a possible deadlock during
shut-down.
Some explanation of the test program.
./schedulerissue 1 4096 8 2000
1 producer and 1 consumer
buffer size of 4096 doubles * 8byte
8 buffer (256kB total buffer)
2000 messages
./schedulerissue 2 4096 8 2000
2 producer and 2 consumer
buffer size of 4096 doubles * 8byte
8 buffer (256kB total buffer)
2000 messages
It was not 512KB bytes in the test before, but 4MB.
But there is the same problem with a total buffer size of 48kB and 4
threads (./schedulerissue 2 2048 3 20000).
On Thu, 2009-01-29 at 11:24 +0100, Thomas Pilarski wrote:
> Some explanation of the test program.
>
> ../schedulerissue 1 4096 8 2000
> 1 producer and 1 consumer
> buffer size of 4096 doubles * 8byte
> 8 buffer (256kB total buffer)
> 2000 messages
>
> ../schedulerissue 2 4096 8 2000
> 2 producer and 2 consumer
> buffer size of 4096 doubles * 8byte
> 8 buffer (256kB total buffer)
> 2000 messages
>
>
> It was not 512KB bytes in the test before, but 4MB.
> But there is the same problem with a total buffer size of 48kB and 4
> threads (./schedulerissue 2 2048 3 20000).
Right, read the proglet (and removed that usleep(1)) and am poking at
it.
On Thu, 2009-01-29 at 11:24 +0100, Thomas Pilarski wrote:
> Some explanation of the test program.
>
> ../schedulerissue 1 4096 8 2000
> 1 producer and 1 consumer
> buffer size of 4096 doubles * 8byte
> 8 buffer (256kB total buffer)
> 2000 messages
>
> ../schedulerissue 2 4096 8 2000
> 2 producer and 2 consumer
> buffer size of 4096 doubles * 8byte
> 8 buffer (256kB total buffer)
> 2000 messages
>
>
> It was not 512KB bytes in the test before, but 4MB.
> But there is the same problem with a total buffer size of 48kB and 4
> threads (./schedulerissue 2 2048 3 20000).
Linux opteron 2.6.29-rc3-tip #61 SMP PREEMPT Thu Jan 29 11:59:15 CET
2009 x86_64 x86_64 x86_64 GNU/Linux
[root@opteron bench]# schedtool -a 1 -e ./ThreadSchedulingIssue 1 4096 8 20000
All threads finished: 19992 messages in 6.485 seconds / 3082.877 msg/s
[root@opteron bench]# ./ThreadSchedulingIssue 1 4096 8 20000
All threads finished: 19992 messages in 6.496 seconds / 3077.604 msg/s
[root@opteron bench]# ./ThreadSchedulingIssue 1 4096 8 20000 & ./ThreadSchedulingIssue 1 4096 8 20000 &
[1] 10314
[2] 10315
[root@opteron bench]# All threads finished: 19992 messages in 6.720 seconds / 2975.009 msg/s
All threads finished: 19992 messages in 6.792 seconds / 2943.574 msg/s
[1]- Done ./ThreadSchedulingIssue 1 4096 8 20000
[2]+ Done ./ThreadSchedulingIssue 1 4096 8 20000
[root@opteron bench]# ./ThreadSchedulingIssue 2 4096 8 20000
All threads finished: 19992 messages in 17.299 seconds / 1155.667 msg/s
[root@opteron bench]# for i in 4 8 16 32 64 128 256 ; do
> echo -n $((i*1024)) $((80000/i)) " " ;
> schedtool -a 1 -e ./ThreadSchedulingIssue 1 $((i*1024)) 8 $((80000/i)) ;
> done
4096 20000 All threads finished: 19992 messages in 6.368 seconds / 3139.251 msg/s
8192 10000 All threads finished: 9992 messages in 5.363 seconds / 1863.083 msg/s
16384 5000 All threads finished: 4992 messages in 5.471 seconds / 912.479 msg/s
32768 2500 All threads finished: 2493 messages in 5.730 seconds / 435.059 msg/s
65536 1250 All threads finished: 1242 messages in 5.544 seconds / 224.021 msg/s
131072 625 All threads finished: 617 messages in 5.755 seconds / 107.217 msg/s
262144 312 All threads finished: 305 messages in 6.014 seconds / 50.713 msg/s
[root@opteron bench]# for i in 4 8 16 32 64 128 256 ; do
> echo -n $((i*1024)) $((80000/i)) " " ;
> ./ThreadSchedulingIssue 1 $((i*1024)) 8 $((80000/i)) ;
> done
4096 20000 All threads finished: 19992 messages in 6.462 seconds / 3093.717 msg/s
8192 10000 All threads finished: 9992 messages in 8.767 seconds / 1139.738 msg/s
16384 5000 All threads finished: 5000 messages in 5.366 seconds / 931.798 msg/s
32768 2500 All threads finished: 2494 messages in 20.720 seconds / 120.369 msg/s
65536 1250 All threads finished: 1242 messages in 11.521 seconds / 107.805 msg/s
131072 625 All threads finished: 618 messages in 14.035 seconds / 44.032 msg/s
262144 312 All threads finished: 305 messages in 17.342 seconds / 17.587 msg/s
The above point between 16 and 32 is exactly where the total working set
doesn't fit into cache anymore -- I suspect that pushes the producer's
latency to go to sleep over the edge and everything collapses.
We use wakeup patterns to determine if two tasks are working together
and should thus be kept together.
Task A should wake up B, and B should wake up A. Furthermore, any task
should quickly go to sleep after waking up the other.
This program does neither, with a single pair, the producer continues
production after waking the consumer (until the queue is filled --
which, if the consumer is fast enough, might never happen).
With multiple pairs there is no strict pair relation at all, since they
all work on the same global buffer queue, so P1 can wake Cn etc.
Furthermore the program uses shared memory (not a bad design), and thus
mises out on the explicit affinity hints of pipes, sockets, etc.
In short this program is carefully crafted to defeat all our affinity
tests - and I'm not sure what to do.
> In short this program is carefully crafted to defeat all our affinity
> tests - and I'm not sure what to do.
I am sorry, although it is not carefully crafted. The function random()
is causing my problem. I currently have no real data, so I tried to make
some random utilization and data.
Without the random() function it works even with 80MB of data and I get
great results.
./ThreadSchedulingIssue 1 10485760 8 312
All threads finished: 309 messages in 29.369 seconds / 10.521 msg/s
schedtool -a 1 -e ./ThreadSchedulingIssue 1 10485760 8 312
All threads finished: 312 messages in 44.284 seconds / 7.045 msg/s
It does not even regress with more then two threads.
./ThreadSchedulingIssue 2 10485760 8 312
All threads finished: 311 messages in 28.040 seconds / 11.091 msg/s
./ThreadSchedulingIssue 4 10485760 8 312
All threads finished: 309 messages in 28.021 seconds / 11.027 msg/s
With small amounts of data the speed on two core is even doubled.
schedtool -a 1 -e ./ThreadSchedulingIssue 1 1048 8 312000
All threads finished: 311992 messages in 19.437 seconds / 16051.247
msg/s
./ThreadSchedulingIssue 3 1048 8 312000
All threads finished: 311998 messages in 9.652 seconds / 32324.411 msg/s
./ThreadSchedulingIssue 8 1048 8 312000
All threads finished: 311997 messages in 9.339 seconds / 33406.370 msg/s
--------------
Perhaps it is as it should be, but when I run the test (without
random()) with 2*8 threads, it uses ~186 of the cpu, while an instance
of "bzip2 -9 -c /dev/urandom >/dev/null" gets only 12%.
On Thu, 2009-01-29 at 15:05 +0100, Thomas Pilarski wrote:
> > In short this program is carefully crafted to defeat all our affinity
> > tests - and I'm not sure what to do.
>
> I am sorry, although it is not carefully crafted. The function random()
> is causing my problem. I currently have no real data, so I tried to make
> some random utilization and data.
Yeah, rather big difference, mega-contention vs zero-contention.
2.6.28.2, profile of ThreadSchedulingIssue 4 524288 8 200
vma samples % app name symbol name
ffffffff80251efa 2574819 31.6774 vmlinux futex_wake
ffffffff80251a39 1367613 16.8255 vmlinux futex_wait
0000000000411790 815426 10.0320 ThreadSchedulingIssue random
ffffffff8022b3b5 343692 4.2284 vmlinux task_rq_lock
0000000000404e30 299316 3.6824 ThreadSchedulingIssue __lll_lock_wait_private
ffffffff8030d430 262906 3.2345 vmlinux copy_user_generic_string
ffffffff80462af2 235176 2.8933 vmlinux schedule
0000000000411b90 210984 2.5957 ThreadSchedulingIssue random_r
ffffffff80251730 129376 1.5917 vmlinux hash_futex
ffffffff8020be10 123548 1.5200 vmlinux system_call
ffffffff8020a679 119398 1.4689 vmlinux __switch_to
ffffffff8022f49b 110068 1.3541 vmlinux try_to_wake_up
ffffffff8024c4d1 106352 1.3084 vmlinux sched_clock_cpu
ffffffff8020be20 102709 1.2636 vmlinux system_call_after_swapgs
ffffffff80229a2d 100614 1.2378 vmlinux update_curr
ffffffff80248309 86475 1.0639 vmlinux add_wait_queue
ffffffff80253149 85969 1.0577 vmlinux do_futex
Versus using myrand() free sample cruft generator from rand(3) manpage. Poof.
vma samples % app name symbol name
004002f4 979506 90.7113 ThreadSchedulingIssue myrand
00400b00 53348 4.9405 ThreadSchedulingIssue thread_consumer
00400c25 42710 3.9553 ThreadSchedulingIssue thread_producer
One of those "don't _ever_ do that" things?
-Mike