2007-06-19 05:12:51

by Vernon Mauery

[permalink] [raw]
Subject: [-RT] multiple streams have degraded performance

In looking at the performance characteristics of my network I found that
2.6.21.5-rt15 suffers from degraded thoughput with multiple threads. The
test that I did this with is simply invoking 1, 2, 4, and 8 instances of
netperf at a time and measuring the total throughput. I have two 4-way
machines connected with 10GbE cards. I tested several kernels (some older
and some newer) and found that the only thing in common was that with -RT
kernels the performance went down with concurrent streams.

While the test was showing the numbers for receiving as well as sending, the
receiving numbers are not reliable because that machine was running a -RT
kernel for these tests.

I was just wondering if anyone had seen this problem before or would have any
idea on where to start hunting for the solution.

--Vernon

The key for this is 'default' was invoked like:
netperf -c -C -l 60 -H 10.2.2.4 -t UDP_STREAM -- -m 1472 -M 1472
and '1Msock' was invoked like:
netperf -c -C -l 60 -H 10.2.2.4 -t UDP_STREAM -- -m 1472 -M 1472 -s 1M -S 1M

2.6.21
==============
default: 1 streams: Send at 2844.2 Mb/s, Receive at 2840.1 Mb/s
default: 2 streams: Send at 3927.9 Mb/s, Receive at 3603.9 Mb/s
default: 4 streams: Send at 4197.4 Mb/s, Receive at 3776.3 Mb/s
default: 8 streams: Send at 4223.9 Mb/s, Receive at 3848.9 Mb/s
1Msock: 1 streams: Send at 4232.3 Mb/s, Receive at 3914.4 Mb/s
1Msock: 2 streams: Send at 5428.8 Mb/s, Receive at 3853.2 Mb/s
1Msock: 4 streams: Send at 6202.1 Mb/s, Receive at 3774.8 Mb/s
1Msock: 8 streams: Send at 6225.1 Mb/s, Receive at 3754.7 Mb/s

2.6.21.5-rt15
===============
default: 1 streams: Send at 3091.6 Mb/s, Receive at 3048.1 Mb/s
default: 2 streams: Send at 3768.8 Mb/s, Receive at 3714.2 Mb/s
default: 4 streams: Send at 1873.6 Mb/s, Receive at 1825.9 Mb/s
default: 8 streams: Send at 1806.5 Mb/s, Receive at 1792.7 Mb/s
1Msock: 1 streams: Send at 3680.4 Mb/s, Receive at 3255.6 Mb/s
1Msock: 2 streams: Send at 4129.8 Mb/s, Receive at 3991.5 Mb/s
1Msock: 4 streams: Send at 1862.1 Mb/s, Receive at 1787.1 Mb/s
1Msock: 8 streams: Send at 1790.2 Mb/s, Receive at 1556.8 Mb/s


2007-06-19 06:52:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [-RT] multiple streams have degraded performance

On Mon, 2007-06-18 at 22:12 -0700, Vernon Mauery wrote:
> In looking at the performance characteristics of my network I found that
> 2.6.21.5-rt15 suffers from degraded thoughput with multiple threads. The
> test that I did this with is simply invoking 1, 2, 4, and 8 instances of
> netperf at a time and measuring the total throughput. I have two 4-way
> machines connected with 10GbE cards. I tested several kernels (some older
> and some newer) and found that the only thing in common was that with -RT
> kernels the performance went down with concurrent streams.
>
> While the test was showing the numbers for receiving as well as sending, the
> receiving numbers are not reliable because that machine was running a -RT
> kernel for these tests.
>
> I was just wondering if anyone had seen this problem before or would have any
> idea on where to start hunting for the solution.

could you enable CONFIG_LOCK_STAT

echo 0 > /proc/lock_stat

<run your test>

and report the output of (preferably not 80 column wrapped):

grep : /proc/lock_stat | head

or otherwise if there are any highly contended network locks listed?

- Peter

2007-06-19 14:26:08

by Vernon Mauery

[permalink] [raw]
Subject: Re: [-RT] multiple streams have degraded performance

On Monday 18 June 2007 11:51:38 pm Peter Zijlstra wrote:
> On Mon, 2007-06-18 at 22:12 -0700, Vernon Mauery wrote:
> > In looking at the performance characteristics of my network I found that
> > 2.6.21.5-rt15 suffers from degraded thoughput with multiple threads. The
> > test that I did this with is simply invoking 1, 2, 4, and 8 instances of
> > netperf at a time and measuring the total throughput. I have two 4-way
> > machines connected with 10GbE cards. I tested several kernels (some
> > older and some newer) and found that the only thing in common was that
> > with -RT kernels the performance went down with concurrent streams.
> >
> > While the test was showing the numbers for receiving as well as sending,
> > the receiving numbers are not reliable because that machine was running a
> > -RT kernel for these tests.
> >
> > I was just wondering if anyone had seen this problem before or would have
> > any idea on where to start hunting for the solution.
>
> could you enable CONFIG_LOCK_STAT
>
> echo 0 > /proc/lock_stat
>
> <run your test>
>
> and report the output of (preferably not 80 column wrapped):
>
> grep : /proc/lock_stat | head

/proc/lock_stat stayed empty for the duration of the test. I am guessing this
means there was no lock contention.

I do see this on the console:
BUG: scheduling with irqs disabled: IRQ-8414/0x00000000/9494
caller is wait_for_completion+0x85/0xc4

Call Trace:
[<ffffffff8106f3b2>] dump_trace+0xaa/0x32a
[<ffffffff8106f673>] show_trace+0x41/0x64
[<ffffffff8106f6ab>] dump_stack+0x15/0x17
[<ffffffff8106566f>] schedule+0x82/0x102
[<ffffffff81065774>] wait_for_completion+0x85/0xc4
[<ffffffff81092043>] set_cpus_allowed+0xa1/0xc8
[<ffffffff810986e2>] do_softirq_from_hardirq+0x105/0x12d
[<ffffffff810ca6cc>] do_irqd+0x2a8/0x32f
[<ffffffff8103469d>] kthread+0xf5/0x128
[<ffffffff81060f68>] child_rip+0xa/0x12

INFO: lockdep is turned off.

I haven't seen this until I enabled lock_stat and ran the test.

> or otherwise if there are any highly contended network locks listed?

Any other ideas for debugging this?

--Vernon

2007-06-19 15:39:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [-RT] multiple streams have degraded performance

On Tue, 2007-06-19 at 07:25 -0700, Vernon Mauery wrote:
> On Monday 18 June 2007 11:51:38 pm Peter Zijlstra wrote:
> > On Mon, 2007-06-18 at 22:12 -0700, Vernon Mauery wrote:
> > > In looking at the performance characteristics of my network I found that
> > > 2.6.21.5-rt15 suffers from degraded thoughput with multiple threads. The
> > > test that I did this with is simply invoking 1, 2, 4, and 8 instances of
> > > netperf at a time and measuring the total throughput. I have two 4-way
> > > machines connected with 10GbE cards. I tested several kernels (some
> > > older and some newer) and found that the only thing in common was that
> > > with -RT kernels the performance went down with concurrent streams.
> > >
> > > While the test was showing the numbers for receiving as well as sending,
> > > the receiving numbers are not reliable because that machine was running a
> > > -RT kernel for these tests.
> > >
> > > I was just wondering if anyone had seen this problem before or would have
> > > any idea on where to start hunting for the solution.
> >
> > could you enable CONFIG_LOCK_STAT
> >
> > echo 0 > /proc/lock_stat
> >
> > <run your test>
> >
> > and report the output of (preferably not 80 column wrapped):
> >
> > grep : /proc/lock_stat | head
>
> /proc/lock_stat stayed empty for the duration of the test. I am guessing this
> means there was no lock contention.

Most likely caused by the issue below.

> I do see this on the console:
> BUG: scheduling with irqs disabled: IRQ-8414/0x00000000/9494
> caller is wait_for_completion+0x85/0xc4
>
> Call Trace:
> [<ffffffff8106f3b2>] dump_trace+0xaa/0x32a
> [<ffffffff8106f673>] show_trace+0x41/0x64
> [<ffffffff8106f6ab>] dump_stack+0x15/0x17
> [<ffffffff8106566f>] schedule+0x82/0x102
> [<ffffffff81065774>] wait_for_completion+0x85/0xc4
> [<ffffffff81092043>] set_cpus_allowed+0xa1/0xc8
> [<ffffffff810986e2>] do_softirq_from_hardirq+0x105/0x12d
> [<ffffffff810ca6cc>] do_irqd+0x2a8/0x32f
> [<ffffffff8103469d>] kthread+0xf5/0x128
> [<ffffffff81060f68>] child_rip+0xa/0x12
>
> INFO: lockdep is turned off.
>
> I haven't seen this until I enabled lock_stat and ran the test.

I think this is what causes the lack of output. It disables all lock
tracking features...

> > or otherwise if there are any highly contended network locks listed?
>
> Any other ideas for debugging this?

fixing the above bug would help :-)

Ingo says that should be fixed in -rt17, so if you could give that a
spin...

2007-06-19 17:19:26

by Vernon Mauery

[permalink] [raw]
Subject: Re: [-RT] multiple streams have degraded performance

On Tuesday 19 June 2007 8:38:50 am Peter Zijlstra wrote:
> On Tue, 2007-06-19 at 07:25 -0700, Vernon Mauery wrote:
> > On Monday 18 June 2007 11:51:38 pm Peter Zijlstra wrote:
> > > On Mon, 2007-06-18 at 22:12 -0700, Vernon Mauery wrote:
> > > > In looking at the performance characteristics of my network I found
> > > > that 2.6.21.5-rt15 suffers from degraded thoughput with multiple
> > > > threads. The test that I did this with is simply invoking 1, 2, 4,
> > > > and 8 instances of netperf at a time and measuring the total
> > > > throughput. I have two 4-way machines connected with 10GbE cards. I
> > > > tested several kernels (some older and some newer) and found that the
> > > > only thing in common was that with -RT kernels the performance went
> > > > down with concurrent streams.
> > > >
> > > > While the test was showing the numbers for receiving as well as
> > > > sending, the receiving numbers are not reliable because that machine
> > > > was running a -RT kernel for these tests.
> > > >
> > > > I was just wondering if anyone had seen this problem before or would
> > > > have any idea on where to start hunting for the solution.
> > >
> > > could you enable CONFIG_LOCK_STAT
> > >
> > > echo 0 > /proc/lock_stat
> > >
> > > <run your test>
> > >
> > > and report the output of (preferably not 80 column wrapped):
> > >
> > > grep : /proc/lock_stat | head
> >
> > /proc/lock_stat stayed empty for the duration of the test. I am guessing
> > this means there was no lock contention.
>
> Most likely caused by the issue below.
>
> > I do see this on the console:
> > BUG: scheduling with irqs disabled: IRQ-8414/0x00000000/9494
> > caller is wait_for_completion+0x85/0xc4
> >
> > Call Trace:
> > [<ffffffff8106f3b2>] dump_trace+0xaa/0x32a
> > [<ffffffff8106f673>] show_trace+0x41/0x64
> > [<ffffffff8106f6ab>] dump_stack+0x15/0x17
> > [<ffffffff8106566f>] schedule+0x82/0x102
> > [<ffffffff81065774>] wait_for_completion+0x85/0xc4
> > [<ffffffff81092043>] set_cpus_allowed+0xa1/0xc8
> > [<ffffffff810986e2>] do_softirq_from_hardirq+0x105/0x12d
> > [<ffffffff810ca6cc>] do_irqd+0x2a8/0x32f
> > [<ffffffff8103469d>] kthread+0xf5/0x128
> > [<ffffffff81060f68>] child_rip+0xa/0x12
> >
> > INFO: lockdep is turned off.
> >
> > I haven't seen this until I enabled lock_stat and ran the test.
>
> I think this is what causes the lack of output. It disables all lock
> tracking features...
>
> > > or otherwise if there are any highly contended network locks listed?
> >
> > Any other ideas for debugging this?
>
> fixing the above bug would help :-)
>
> Ingo says that should be fixed in -rt17, so if you could give that a
> spin...

I just tested with -rt17 and the BUG message is gone, but I still don't see
any entries in. I don't see any other suspicious message so I think it is
really strange that is not listing any locks at all.

--Vernon

2007-06-20 03:38:18

by Vernon Mauery

[permalink] [raw]
Subject: Re: [-RT] multiple streams have degraded performance

On Monday 18 June 2007 10:12:21 pm Vernon Mauery wrote:
> In looking at the performance characteristics of my network I found that
> 2.6.21.5-rt15 suffers from degraded thoughput with multiple threads. The
> test that I did this with is simply invoking 1, 2, 4, and 8 instances of
> netperf at a time and measuring the total throughput. I have two 4-way
> machines connected with 10GbE cards. I tested several kernels (some older
> and some newer) and found that the only thing in common was that with -RT
> kernels the performance went down with concurrent streams.

I just tested this using lo instead of the 10GbE adapter. I found similar
results. Since this makes it reproducible by just about anybody (maybe the
only factor now is the number of CPUs), I have attached the script that I
test things with.

So with the script run like ./stream_test 127.0.0.1 on 2.6.21 and
2.6.21.5-rt17 I got the following:

2.6.21
=======================
default: 1 streams: Send at 2790.3 Mb/s, Receive at 2790.3 Mb/s
default: 2 streams: Send at 4129.4 Mb/s, Receive at 4128.7 Mb/s
default: 4 streams: Send at 7949.6 Mb/s, Receive at 7735.5 Mb/s
default: 8 streams: Send at 7930.7 Mb/s, Receive at 7910.1 Mb/s
1Msock: 1 streams: Send at 2810.7 Mb/s, Receive at 2810.7 Mb/s
1Msock: 2 streams: Send at 4093.4 Mb/s, Receive at 4092.6 Mb/s
1Msock: 4 streams: Send at 7887.8 Mb/s, Receive at 7880.4 Mb/s
1Msock: 8 streams: Send at 8091.7 Mb/s, Receive at 8082.2 Mb/s

2.6.21.5-rt17
======================
default: 1 streams: Send at 938.2 Mb/s, Receive at 938.2 Mb/s
default: 2 streams: Send at 1476.3 Mb/s, Receive at 1436.9 Mb/s
default: 4 streams: Send at 1489.8 Mb/s, Receive at 1145.0 Mb/s
default: 8 streams: Send at 1099.8 Mb/s, Receive at 1079.1 Mb/s
1Msock: 1 streams: Send at 921.4 Mb/s, Receive at 920.4 Mb/s
1Msock: 2 streams: Send at 1332.2 Mb/s, Receive at 1311.5 Mb/s
1Msock: 4 streams: Send at 1483.0 Mb/s, Receive at 1137.8 Mb/s
1Msock: 8 streams: Send at 1446.2 Mb/s, Receive at 1135.6 Mb/s

--Vernon

> While the test was showing the numbers for receiving as well as sending,
> the receiving numbers are not reliable because that machine was running a
> -RT kernel for these tests.
>
> I was just wondering if anyone had seen this problem before or would have
> any idea on where to start hunting for the solution.
>
> --Vernon
>
> The key for this is 'default' was invoked like:
> netperf -c -C -l 60 -H 10.2.2.4 -t UDP_STREAM -- -m 1472 -M 1472
> and '1Msock' was invoked like:
> netperf -c -C -l 60 -H 10.2.2.4 -t UDP_STREAM -- -m 1472 -M 1472 -s 1M -S
> 1M
>
> 2.6.21
> ==============
> default: 1 streams: Send at 2844.2 Mb/s, Receive at 2840.1 Mb/s
> default: 2 streams: Send at 3927.9 Mb/s, Receive at 3603.9 Mb/s
> default: 4 streams: Send at 4197.4 Mb/s, Receive at 3776.3 Mb/s
> default: 8 streams: Send at 4223.9 Mb/s, Receive at 3848.9 Mb/s
> 1Msock: 1 streams: Send at 4232.3 Mb/s, Receive at 3914.4 Mb/s
> 1Msock: 2 streams: Send at 5428.8 Mb/s, Receive at 3853.2 Mb/s
> 1Msock: 4 streams: Send at 6202.1 Mb/s, Receive at 3774.8 Mb/s
> 1Msock: 8 streams: Send at 6225.1 Mb/s, Receive at 3754.7 Mb/s
>
> 2.6.21.5-rt15
> ===============
> default: 1 streams: Send at 3091.6 Mb/s, Receive at 3048.1 Mb/s
> default: 2 streams: Send at 3768.8 Mb/s, Receive at 3714.2 Mb/s
> default: 4 streams: Send at 1873.6 Mb/s, Receive at 1825.9 Mb/s
> default: 8 streams: Send at 1806.5 Mb/s, Receive at 1792.7 Mb/s
> 1Msock: 1 streams: Send at 3680.4 Mb/s, Receive at 3255.6 Mb/s
> 1Msock: 2 streams: Send at 4129.8 Mb/s, Receive at 3991.5 Mb/s
> 1Msock: 4 streams: Send at 1862.1 Mb/s, Receive at 1787.1 Mb/s
> 1Msock: 8 streams: Send at 1790.2 Mb/s, Receive at 1556.8 Mb/s



Attachments:
(No filename) (3.60 kB)
stream_test (2.76 kB)
Download all attachments

2007-06-20 07:43:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [-RT] multiple streams have degraded performance

On Tue, 2007-06-19 at 20:38 -0700, Vernon Mauery wrote:
> On Monday 18 June 2007 10:12:21 pm Vernon Mauery wrote:
> > In looking at the performance characteristics of my network I found that
> > 2.6.21.5-rt15 suffers from degraded thoughput with multiple threads. The
> > test that I did this with is simply invoking 1, 2, 4, and 8 instances of
> > netperf at a time and measuring the total throughput. I have two 4-way
> > machines connected with 10GbE cards. I tested several kernels (some older
> > and some newer) and found that the only thing in common was that with -RT
> > kernels the performance went down with concurrent streams.
>
> I just tested this using lo instead of the 10GbE adapter. I found similar
> results. Since this makes it reproducible by just about anybody (maybe the
> only factor now is the number of CPUs), I have attached the script that I
> test things with.
>
> So with the script run like ./stream_test 127.0.0.1 on 2.6.21 and
> 2.6.21.5-rt17 I got the following:
>
> 2.6.21
> =======================
> default: 1 streams: Send at 2790.3 Mb/s, Receive at 2790.3 Mb/s
> default: 2 streams: Send at 4129.4 Mb/s, Receive at 4128.7 Mb/s
> default: 4 streams: Send at 7949.6 Mb/s, Receive at 7735.5 Mb/s
> default: 8 streams: Send at 7930.7 Mb/s, Receive at 7910.1 Mb/s
> 1Msock: 1 streams: Send at 2810.7 Mb/s, Receive at 2810.7 Mb/s
> 1Msock: 2 streams: Send at 4093.4 Mb/s, Receive at 4092.6 Mb/s
> 1Msock: 4 streams: Send at 7887.8 Mb/s, Receive at 7880.4 Mb/s
> 1Msock: 8 streams: Send at 8091.7 Mb/s, Receive at 8082.2 Mb/s
>
> 2.6.21.5-rt17
> ======================
> default: 1 streams: Send at 938.2 Mb/s, Receive at 938.2 Mb/s
> default: 2 streams: Send at 1476.3 Mb/s, Receive at 1436.9 Mb/s
> default: 4 streams: Send at 1489.8 Mb/s, Receive at 1145.0 Mb/s
> default: 8 streams: Send at 1099.8 Mb/s, Receive at 1079.1 Mb/s
> 1Msock: 1 streams: Send at 921.4 Mb/s, Receive at 920.4 Mb/s
> 1Msock: 2 streams: Send at 1332.2 Mb/s, Receive at 1311.5 Mb/s
> 1Msock: 4 streams: Send at 1483.0 Mb/s, Receive at 1137.8 Mb/s
> 1Msock: 8 streams: Send at 1446.2 Mb/s, Receive at 1135.6 Mb/s

Unfortunately I do not have a 4-way machine, but a dual core was enough
to show the problem:

./stream_test 127.0.0.1

** default: 1 streams: Send at 268.9 Mb/s, Receive at 268.0 Mb/s
** default: 2 streams: Send at 448.7 Mb/s, Receive at 448.4 Mb/s
** default: 4 streams: Send at 438.7 Mb/s, Receive at 438.3 Mb/s
** default: 8 streams: Send at 352.7 Mb/s, Receive at 351.2 Mb/s

** 1Msock: 1 streams: Send at 207.4 Mb/s, Receive at 207.4 Mb/s
** 1Msock: 2 streams: Send at 450.0 Mb/s, Receive at 450.0 Mb/s
** 1Msock: 4 streams: Send at 444.9 Mb/s, Receive at 444.9 Mb/s
** 1Msock: 8 streams: Send at 351.4 Mb/s, Receive at 351.4 Mb/s


and lock_stat shows the following offender:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name contentions waittime-min waittime-max waittime-total con-bounces acquisitions holdtime-min holdtime-max holdtime-total acq-bounces
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

udp_hash_lock-W: 1 1.58 1.58 1.58 1 163 0.100 4.38 184.44 110
udp_hash_lock-R: 539603 1.30 4848.34 7756821.91 539131 15095136 0.36 520.09 20404391.93 11527348
---------------
udp_hash_lock 539603 [<ffffffff80468d63>] __udp4_lib_lookup+0x3c/0x10d
udp_hash_lock 1 [<ffffffff804686fe>] __udp_lib_get_port+0x2c/0x21a


Which makes perfect sense, in that r/w locks are degraded to regular
exclusive locks on -rt.

So I guess you're wanting to go help out with the RCU-ification of
whatever it is that this lock protects. :-)