2009-03-18 17:25:39

by Vernon Mauery

[permalink] [raw]
Subject: High contention on the sk_buff_head.lock

I have been beating on network throughput in the -rt kernel for some time
now. After digging down through the send path of UDP packets, I found
that the sk_buff_head.lock is under some very high contention. This lock
is acquired each time a packet is enqueued on a qdisc and then acquired
again to dequeue the packet. Under high networking loads, the enqueueing
processes are not only contending among each other for the lock, but also
with the net-tx soft irq. This makes for some very high contention on this
one lock. My testcase is running varying numbers of concurrent netperf
instances pushing UDP traffic to another machine. As the count goes from
1 to 2, the network performance increases. But from 2 to 4 and from 4 to 8,
we see a big decline, with 8 instances pushing about half of what a single
thread can do.

Running 2.6.29-rc6-rt3 on an 8-way machine with a 10GbE card (I have tried
both NetXen and Broadcom, with very similar results), I can only push about
1200 Mb/s. Whereas with the mainline 2.6.29-rc8 kernel, I can push nearly
6000 Mb/s. But still not as much as I think is possible. I was curious and
decided to see if the mainline kernel was hitting the same lock, and using
/proc/lock_stat, it is hitting the sk_buff_head.lock as well (it was the
number one contended lock).

So while this issue really hits -rt kernels hard, it has a real effect on
mainline kernels as well. The contention of the spinlocks is amplified
when they get turned into rt-mutexes, which causes a double context switch.

Below is the top of the lock_stat for 2.6.29-rc8. This was captured from
a 1 minute network stress test. The next high contender had 2 orders of
magnitude fewer contentions. Think of the throughput increase if we could
ease this contention a bit. We might even be able to saturate a 10GbE
link.

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

&list->lock#3: 24517307 24643791 0.71 1286.62 56516392.42 34834296 44904018 0.60 164.79 31314786.02
-------------
&list->lock#3 15596927 [<ffffffff812474da>] dev_queue_xmit+0x2ea/0x468
&list->lock#3 9046864 [<ffffffff812546e9>] __qdisc_run+0x11b/0x1ef
-------------
&list->lock#3 6525300 [<ffffffff812546e9>] __qdisc_run+0x11b/0x1ef
&list->lock#3 18118491 [<ffffffff812474da>] dev_queue_xmit+0x2ea/0x468


The story is the same for -rt kernels, only the waittime and holdtime are both
orders of magnitude greater.

I am not exactly clear on the solution, but if I understand correctly, in the
past there has been some discussion of batched enqueueing and dequeueing. Is
anyone else working on this problem right now who has just not yet posted
anything for review? Questions, comments, flames?

--Vernon


2009-03-18 19:07:39

by Eric Dumazet

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

Vernon Mauery a ?crit :
> I have been beating on network throughput in the -rt kernel for some time
> now. After digging down through the send path of UDP packets, I found
> that the sk_buff_head.lock is under some very high contention. This lock
> is acquired each time a packet is enqueued on a qdisc and then acquired
> again to dequeue the packet. Under high networking loads, the enqueueing
> processes are not only contending among each other for the lock, but also
> with the net-tx soft irq. This makes for some very high contention on this
> one lock. My testcase is running varying numbers of concurrent netperf
> instances pushing UDP traffic to another machine. As the count goes from
> 1 to 2, the network performance increases. But from 2 to 4 and from 4
> to 8,
> we see a big decline, with 8 instances pushing about half of what a single
> thread can do.
>
> Running 2.6.29-rc6-rt3 on an 8-way machine with a 10GbE card (I have tried
> both NetXen and Broadcom, with very similar results), I can only push about
> 1200 Mb/s. Whereas with the mainline 2.6.29-rc8 kernel, I can push nearly
> 6000 Mb/s. But still not as much as I think is possible. I was curious and
> decided to see if the mainline kernel was hitting the same lock, and using
> /proc/lock_stat, it is hitting the sk_buff_head.lock as well (it was the
> number one contended lock).
>
> So while this issue really hits -rt kernels hard, it has a real effect on
> mainline kernels as well. The contention of the spinlocks is amplified
> when they get turned into rt-mutexes, which causes a double context switch.
>
> Below is the top of the lock_stat for 2.6.29-rc8. This was captured from
> a 1 minute network stress test. The next high contender had 2 orders of
> magnitude fewer contentions. Think of the throughput increase if we could
> ease this contention a bit. We might even be able to saturate a 10GbE
> link.
>
> lock_stat version 0.3
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> class name con-bounces contentions waittime-min
> waittime-max waittime-total acq-bounces acquisitions
> holdtime-min holdtime-max holdtime-total
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> &list->lock#3: 24517307 24643791 0.71
> 1286.62 56516392.42 34834296 44904018
> 0.60 164.79 31314786.02
> -------------
> &list->lock#3 15596927 [<ffffffff812474da>]
> dev_queue_xmit+0x2ea/0x468
> &list->lock#3 9046864 [<ffffffff812546e9>]
> __qdisc_run+0x11b/0x1ef
> -------------
> &list->lock#3 6525300 [<ffffffff812546e9>]
> __qdisc_run+0x11b/0x1ef
> &list->lock#3 18118491 [<ffffffff812474da>]
> dev_queue_xmit+0x2ea/0x468
>
>
> The story is the same for -rt kernels, only the waittime and holdtime
> are both
> orders of magnitude greater.
>
> I am not exactly clear on the solution, but if I understand correctly,
> in the
> past there has been some discussion of batched enqueueing and
> dequeueing. Is
> anyone else working on this problem right now who has just not yet posted
> anything for review? Questions, comments, flames?
>

Yes we have a known contention point here, but before adding more complex code,
could you try following patch please ?

[PATCH] net: Reorder fields of struct Qdisc

dev_queue_xmit() needs to dirty fields "state" and "q"

On x86_64 arch, they currently span two cache lines, involving more
cache line ping pongs than necessary.

Before patch :

offsetof(struct Qdisc, state)=0x38
offsetof(struct Qdisc, q)=0x48
offsetof(struct Qdisc, dev_queue)=0x60

After patch :

offsetof(struct Qdisc, dev_queue)=0x38
offsetof(struct Qdisc, state)=0x48
offsetof(struct Qdisc, q)=0x50


Signed-off-by: Eric Dumazet <[email protected]>

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f8c4742..e24feeb 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -51,10 +51,11 @@ struct Qdisc
u32 handle;
u32 parent;
atomic_t refcnt;
- unsigned long state;
+ struct netdev_queue *dev_queue;
+
struct sk_buff *gso_skb;
+ unsigned long state;
struct sk_buff_head q;
- struct netdev_queue *dev_queue;
struct Qdisc *next_sched;
struct list_head list;


2009-03-18 20:18:09

by Vernon Mauery

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

Eric Dumazet wrote:
> Vernon Mauery a ?crit :
>> I have been beating on network throughput in the -rt kernel for some time
>> now. After digging down through the send path of UDP packets, I found
>> that the sk_buff_head.lock is under some very high contention. This lock
>> is acquired each time a packet is enqueued on a qdisc and then acquired
>> again to dequeue the packet. Under high networking loads, the enqueueing
>> processes are not only contending among each other for the lock, but also
>> with the net-tx soft irq. This makes for some very high contention on this
>> one lock. My testcase is running varying numbers of concurrent netperf
>> instances pushing UDP traffic to another machine. As the count goes from
>> 1 to 2, the network performance increases. But from 2 to 4 and from 4
>> to 8,
>> we see a big decline, with 8 instances pushing about half of what a single
>> thread can do.
>>
>> Running 2.6.29-rc6-rt3 on an 8-way machine with a 10GbE card (I have tried
>> both NetXen and Broadcom, with very similar results), I can only push about
>> 1200 Mb/s. Whereas with the mainline 2.6.29-rc8 kernel, I can push nearly
>> 6000 Mb/s. But still not as much as I think is possible. I was curious and
>> decided to see if the mainline kernel was hitting the same lock, and using
>> /proc/lock_stat, it is hitting the sk_buff_head.lock as well (it was the
>> number one contended lock).
>>
>> So while this issue really hits -rt kernels hard, it has a real effect on
>> mainline kernels as well. The contention of the spinlocks is amplified
>> when they get turned into rt-mutexes, which causes a double context switch.
>>
>> Below is the top of the lock_stat for 2.6.29-rc8. This was captured from
>> a 1 minute network stress test. The next high contender had 2 orders of
>> magnitude fewer contentions. Think of the throughput increase if we could
>> ease this contention a bit. We might even be able to saturate a 10GbE
>> link.
>>
>> lock_stat version 0.3
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> class name con-bounces contentions waittime-min
>> waittime-max waittime-total acq-bounces acquisitions
>> holdtime-min holdtime-max holdtime-total
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>
>> &list->lock#3: 24517307 24643791 0.71
>> 1286.62 56516392.42 34834296 44904018
>> 0.60 164.79 31314786.02
>> -------------
>> &list->lock#3 15596927 [<ffffffff812474da>]
>> dev_queue_xmit+0x2ea/0x468
>> &list->lock#3 9046864 [<ffffffff812546e9>]
>> __qdisc_run+0x11b/0x1ef
>> -------------
>> &list->lock#3 6525300 [<ffffffff812546e9>]
>> __qdisc_run+0x11b/0x1ef
>> &list->lock#3 18118491 [<ffffffff812474da>]
>> dev_queue_xmit+0x2ea/0x468
>>
>>
>> The story is the same for -rt kernels, only the waittime and holdtime
>> are both
>> orders of magnitude greater.
>>
>> I am not exactly clear on the solution, but if I understand correctly,
>> in the
>> past there has been some discussion of batched enqueueing and
>> dequeueing. Is
>> anyone else working on this problem right now who has just not yet posted
>> anything for review? Questions, comments, flames?
>>
>
> Yes we have a known contention point here, but before adding more complex code,
> could you try following patch please ?

This patch does seem to reduce the number of contentions by about 10%. That is
a good start (and a good catch on the cacheline bounces). But, like I mentioned
above, this lock still has 2 orders of magnitude greater contention than the
next lock, so even a large decrease like 10% makes little difference in the
overall contention characteristics.

So we will have to do something more. Whether it needs to be more complex or
not is still up in the air. Batched enqueueing/dequeueing are just two options
and the former would be a *lot* less complex than the latter.

If anyone else has any ideas they have been holding back, now would be a great
time to get them out in the open.

--Vernon

> [PATCH] net: Reorder fields of struct Qdisc
>
> dev_queue_xmit() needs to dirty fields "state" and "q"
>
> On x86_64 arch, they currently span two cache lines, involving more
> cache line ping pongs than necessary.
>
> Before patch :
>
> offsetof(struct Qdisc, state)=0x38
> offsetof(struct Qdisc, q)=0x48
> offsetof(struct Qdisc, dev_queue)=0x60
>
> After patch :
>
> offsetof(struct Qdisc, dev_queue)=0x38
> offsetof(struct Qdisc, state)=0x48
> offsetof(struct Qdisc, q)=0x50
>
>
> Signed-off-by: Eric Dumazet <[email protected]>
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index f8c4742..e24feeb 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -51,10 +51,11 @@ struct Qdisc
> u32 handle;
> u32 parent;
> atomic_t refcnt;
> - unsigned long state;
> + struct netdev_queue *dev_queue;
> +
> struct sk_buff *gso_skb;
> + unsigned long state;
> struct sk_buff_head q;
> - struct netdev_queue *dev_queue;
> struct Qdisc *next_sched;
> struct list_head list;
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-03-18 20:57:35

by Andi Kleen

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

Vernon Mauery <[email protected]> writes:
>
> So while this issue really hits -rt kernels hard, it has a real effect on
> mainline kernels as well. The contention of the spinlocks is amplified
> when they get turned into rt-mutexes, which causes a double context switch.

The new adaptive spin heuristics that have been discussed some time
ago didn't help? Unsurprisingly making locks a lot more expensive
has drawbacks as you discovered.

> &list->lock#3: 24517307 24643791 0.71 1286.62 56516392.42 34834296 44904018 0.60 164.79 31314786.02
> -------------
> &list->lock#3 15596927 [<ffffffff812474da>] dev_queue_xmit+0x2ea/0x468

The real "fix" would be probably to use a multi queue capable NIC
and a NIC driver that sets up multiple queues for TX (normally they
only do for RX). Then cores or a set of cores (often the number
of cores is larger than the number of NIC queues) could avoid this
problem. Disadvantage: more memory use.

But then again I'm not sure it's worth it if the problem only
happens in out of tree RT.

-Andi

--
[email protected] -- Speaking for myself only.

2009-03-18 21:04:15

by David Miller

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

From: Andi Kleen <[email protected]>
Date: Wed, 18 Mar 2009 21:54:37 +0100

> But then again I'm not sure it's worth it if the problem only
> happens in out of tree RT.

The list of problems that only show up with the RT kernel seems to be
constantly increasing, but honestly is very surprising to me.

I don't understand why we even need to be concerned about this stuff
upstream, to be honest.

Please reproduce this in the vanilla kernel, then get back to us.

2009-03-18 21:07:51

by Vernon Mauery

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

Andi Kleen wrote:
> Vernon Mauery <[email protected]> writes:
>> So while this issue really hits -rt kernels hard, it has a real effect on
>> mainline kernels as well. The contention of the spinlocks is amplified
>> when they get turned into rt-mutexes, which causes a double context switch.
>
> The new adaptive spin heuristics that have been discussed some time
> ago didn't help? Unsurprisingly making locks a lot more expensive
> has drawbacks as you discovered.

Yes. Well, while the adaptive spinlocks did great things for the
network throughput last time I tested them, they also didn't quite
give the determinism in other areas. It would be nice to be able to
target a handful of trouble locks with adaptive spinlocks.

Even so, though I saw dramatic throughput increases with adaptive
spinlocks, they would still be bound by this same lock contention
that I am seeing when the locks are true spinlocks.

>> &list->lock#3: 24517307 24643791 0.71 1286.62 56516392.42 34834296 44904018 0.60 164.79 31314786.02
>> -------------
>> &list->lock#3 15596927 [<ffffffff812474da>] dev_queue_xmit+0x2ea/0x468
>
> The real "fix" would be probably to use a multi queue capable NIC
> and a NIC driver that sets up multiple queues for TX (normally they
> only do for RX). Then cores or a set of cores (often the number
> of cores is larger than the number of NIC queues) could avoid this
> problem. Disadvantage: more memory use.

Hmmm. So do either the netxen_nic or bnx2x drivers support multiple
queues? (that is the HW that I have access to right now). And do I
need to do anything to set them up?

> But then again I'm not sure it's worth it if the problem only
> happens in out of tree RT.

The effects of the high contention are not quite so pronounced in the
vanilla kernel, but I think we are still limited by this lock. In the
-rt kernel, it is obvious that the lock contention is causing lots of
trouble.

--Vernon

2009-03-18 21:10:54

by Vernon Mauery

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

David Miller wrote:
> From: Andi Kleen <[email protected]>
> Date: Wed, 18 Mar 2009 21:54:37 +0100
>
>> But then again I'm not sure it's worth it if the problem only
>> happens in out of tree RT.
>
> The list of problems that only show up with the RT kernel seems to be
> constantly increasing, but honestly is very surprising to me.
>
> I don't understand why we even need to be concerned about this stuff
> upstream, to be honest.
>
> Please reproduce this in the vanilla kernel, then get back to us.

Huh? The numbers that I posted *were* from the vanilla kernel. I ran
the 2.6.29-rc8 kernel with lock_stat enabled. The lock contention
happens on the same lock in both vanilla and -rt, it just happens
to be more pronounced in the -rt kernel because of the double context
switches that the sleeping spinlock/rt-mutexes introduce.

--Vernon

2009-03-18 21:39:18

by David Miller

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

From: Vernon Mauery <[email protected]>
Date: Wed, 18 Mar 2009 14:10:33 -0700

> David Miller wrote:
> > From: Andi Kleen <[email protected]>
> > Date: Wed, 18 Mar 2009 21:54:37 +0100
> >
> >> But then again I'm not sure it's worth it if the problem only
> >> happens in out of tree RT.
> > The list of problems that only show up with the RT kernel seems to be
> > constantly increasing, but honestly is very surprising to me.
> > I don't understand why we even need to be concerned about this stuff
> > upstream, to be honest.
> > Please reproduce this in the vanilla kernel, then get back to us.
>
> Huh? The numbers that I posted *were* from the vanilla kernel. I ran
> the 2.6.29-rc8 kernel with lock_stat enabled. The lock contention
> happens on the same lock in both vanilla and -rt, it just happens
> to be more pronounced in the -rt kernel because of the double context
> switches that the sleeping spinlock/rt-mutexes introduce.

And the double context switches are probably also why less
natural batching and locality are achieved in the RT kernel.

Isn't that true?

2009-03-18 21:56:57

by Eilon Greenstein

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

On Wed, 2009-03-18 at 14:07 -0700, Vernon Mauery wrote:
> >
> > The real "fix" would be probably to use a multi queue capable NIC
> > and a NIC driver that sets up multiple queues for TX (normally they
> > only do for RX). Then cores or a set of cores (often the number
> > of cores is larger than the number of NIC queues) could avoid this
> > problem. Disadvantage: more memory use.
>
> Hmmm. So do either the netxen_nic or bnx2x drivers support multiple
> queues? (that is the HW that I have access to right now). And do I
> need to do anything to set them up?
>
The version of bnx2x in net-next support multi Tx queues (and Rx). It
will open an equal number of Tx and Rx queues up to 16 or the number of
cores in the system. You can validate that all queues are transmitting
with "ethtool -S" which has per queue statistics in that version.

I hope it will help,
Eilon

2009-03-18 21:57:23

by Vernon Mauery

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

David Miller wrote:
> From: Vernon Mauery <[email protected]>
> Date: Wed, 18 Mar 2009 14:10:33 -0700
>
>> David Miller wrote:
>>> From: Andi Kleen <[email protected]>
>>> Date: Wed, 18 Mar 2009 21:54:37 +0100
>>>
>>>> But then again I'm not sure it's worth it if the problem only
>>>> happens in out of tree RT.
>>> The list of problems that only show up with the RT kernel seems to be
>>> constantly increasing, but honestly is very surprising to me.
>>> I don't understand why we even need to be concerned about this stuff
>>> upstream, to be honest.
>>> Please reproduce this in the vanilla kernel, then get back to us.
>> Huh? The numbers that I posted *were* from the vanilla kernel. I ran
>> the 2.6.29-rc8 kernel with lock_stat enabled. The lock contention
>> happens on the same lock in both vanilla and -rt, it just happens
>> to be more pronounced in the -rt kernel because of the double context
>> switches that the sleeping spinlock/rt-mutexes introduce.
>
> And the double context switches are probably also why less
> natural batching and locality are achieved in the RT kernel.
>
> Isn't that true?

Yes, the double context switches surely hurt the temporal and
spatial locality of the vanilla codepath, but it also induces a
longer penalty for blocking on a lock -- instead of a nanoseconds
or a few microseconds, the task gets delayed for tens of
microseconds. So really, the -rt kernel has more to fix than
the vanilla kernel in this case, but any improvement in the lock
contention in the vanilla case would be magnified and would cause
dramatic improvements in the -rt kernel.

--Vernon

2009-03-18 21:57:59

by Vernon Mauery

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

Eilon Greenstein wrote:
> On Wed, 2009-03-18 at 14:07 -0700, Vernon Mauery wrote:
>>> The real "fix" would be probably to use a multi queue capable NIC
>>> and a NIC driver that sets up multiple queues for TX (normally they
>>> only do for RX). Then cores or a set of cores (often the number
>>> of cores is larger than the number of NIC queues) could avoid this
>>> problem. Disadvantage: more memory use.
>> Hmmm. So do either the netxen_nic or bnx2x drivers support multiple
>> queues? (that is the HW that I have access to right now). And do I
>> need to do anything to set them up?
>>
> The version of bnx2x in net-next support multi Tx queues (and Rx). It
> will open an equal number of Tx and Rx queues up to 16 or the number of
> cores in the system. You can validate that all queues are transmitting
> with "ethtool -S" which has per queue statistics in that version.

Thanks. I will test to see how this affects this lock contention the
next time the broadcom hardware is available.

--Vernon

2009-03-18 21:58:27

by Gregory Haskins

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

David Miller wrote:
> From: Vernon Mauery <[email protected]>
> Date: Wed, 18 Mar 2009 14:10:33 -0700
>
>
>> David Miller wrote:
>>
>>> From: Andi Kleen <[email protected]>
>>> Date: Wed, 18 Mar 2009 21:54:37 +0100
>>>
>>>
>>>> But then again I'm not sure it's worth it if the problem only
>>>> happens in out of tree RT.
>>>>
>>> The list of problems that only show up with the RT kernel seems to be
>>> constantly increasing, but honestly is very surprising to me.
>>> I don't understand why we even need to be concerned about this stuff
>>> upstream, to be honest.
>>> Please reproduce this in the vanilla kernel, then get back to us.
>>>
>> Huh? The numbers that I posted *were* from the vanilla kernel. I ran
>> the 2.6.29-rc8 kernel with lock_stat enabled. The lock contention
>> happens on the same lock in both vanilla and -rt, it just happens
>> to be more pronounced in the -rt kernel because of the double context
>> switches that the sleeping spinlock/rt-mutexes introduce.
>>
>
> And the double context switches are probably also why less
> natural batching and locality are achieved in the RT kernel.
>
> Isn't that true?
>

Note that -rt doesnt typically context-switch under contention anymore
since we introduced adaptive-locks. Also note that the contention
against the lock is still contention, regardless of whether you have -rt
or not. Its just that the slow-path to handle the contended case for
-rt is more expensive than mainline. However, once you have the
contention as stated, you have already lost.

We have observed the posters findings ourselves in both mainline and
-rt. I.e. That lock doesnt scale very well once you have more than a
handful of cores. It's certainly a great area to look at for improving
the overall stack, IMO, as I believe there is quite a bit of headroom
left to be recovered that is buried there.

Regards,
-Greg


Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2009-03-18 22:00:20

by Andi Kleen

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

> Thanks. I will test to see how this affects this lock contention the
> next time the broadcom hardware is available.

The other strategy to reduce lock contention here is to use TSO/GSO/USO.
With that the lock has to be taken less often because there are less packets
travelling down the stack. I'm not sure how well that works with netperf style
workloads though. Using multiple TX queues is probably better though if you have
capable hardware. Or ideally both.

-Andi

--
[email protected] -- Speaking for myself only.

2009-03-18 22:20:12

by Rick Jones

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

Andi Kleen wrote:
>>Thanks. I will test to see how this affects this lock contention the
>>next time the broadcom hardware is available.
>
>
> The other strategy to reduce lock contention here is to use TSO/GSO/USO.
> With that the lock has to be taken less often because there are less packets
> travelling down the stack. I'm not sure how well that works with netperf style
> workloads though.

All depends on what the user provides with the test-specific -m option for how
much data they shove into the socket each time "send" is called, and I suppose if
they use a test-specific -D option to set TCP_NODELAY in the case of a TCP test
when they have small values of -m. Eg

netperf -t TCP_STREAM ... -- -m 64K
vs
netperf -t TCP_STREAM ... -- -m 1024
vs
netperf -t TCP_STREAM ... -- -m 1024 -D
vs
netperf -t UDP_STREAM ... -- -m 1024

etc etc.

If the netperf test is:

netperf -t TCP_RR ... -- -r 1 (single-byte request/response)

then TSO/GSO/USO won't matter at all, and probably still wont matter even if the
user has ./configure'd netperf with --enable-burst and does:

netperf -t TCP_RR ... -- -r 1 -b 64
or
netperf -t TCP_RR ... -- -r 1 -b 64 -D

which was basically what I was doing for the 32-core scaling stuff I posted about
a few weeks ago. That was running on multi-queue NICs, so looking at some of the
profiles of the "no iptables" data may help show how big/small the problem is,
keeping in mind that my runs (either the XFrame II runs, or the Chelsio T3C runs
before them) had one queue per core in the system...and as such may be a best
case scenario as far as lock contention on a per-queue basis goes.

ftp://ftp.netperf.org/

happy benchmarking,

rick jones

BTW, that setup went "poof" and had to go to other nefarious porpoises. I'm not
sure when I can recreate it, but I still have both the XFrame and T3C NICs when
the HW comes free again.

2009-03-19 01:02:53

by David Miller

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

From: Vernon Mauery <[email protected]>
Date: Wed, 18 Mar 2009 14:49:17 -0700

> Yes, the double context switches surely hurt the temporal and
> spatial locality of the vanilla codepath, but it also induces a
> longer penalty for blocking on a lock -- instead of a nanoseconds
> or a few microseconds, the task gets delayed for tens of
> microseconds. So really, the -rt kernel has more to fix than
> the vanilla kernel in this case, but any improvement in the lock
> contention in the vanilla case would be magnified and would cause
> dramatic improvements in the -rt kernel.

Contention on a shared resource is not implicitly bad.

And with upstream spinlocks the cost is relatively low for a case like
this where a thread of control goes in and only holds the lock for
long enough to unlink a packet from the list and immediately the lock
is released.

The cost should be, cache line move from cpu-to-cpu, atomic lock,
linked list unlink, a store, and a memory barrier. And that's
all it is upstream.

If the -rt kernel makes this 10 times more expensive, I really don't
see why that is an upstream concern at the current point in time.

That's the tradeoff, common situations where locks are held by
multiple threads with contention, but only for mere cycles, are
seemingly a lot more expensive in the -rt kernel.

I mean, for example, why doesn't the -rt kernel just spin for a little
while like a normal spinlock would instead of always entering that
expensive contended code path? It would be a huge win here, but I
have yet to see discussion of changes in that area of the -rt kernel
to combat these effects. Is it the networking that always has to
change for the sake of -rt? :-)

2009-03-19 01:04:23

by David Miller

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

From: Gregory Haskins <[email protected]>
Date: Wed, 18 Mar 2009 17:54:04 -0400

> Note that -rt doesnt typically context-switch under contention anymore
> since we introduced adaptive-locks. Also note that the contention
> against the lock is still contention, regardless of whether you have -rt
> or not. Its just that the slow-path to handle the contended case for
> -rt is more expensive than mainline. However, once you have the
> contention as stated, you have already lost.

First, contention is not implicitly a bad thing.

Second, if the -rt kernel is doing adaptive spinning I see no
reason why that adaptive spinning is not kicking in here to
make this problem just go away.

This lock is held for mere cycles, just to unlink an SKB from
the networking qdisc, and then it is immediately released.

2009-03-19 01:17:41

by David Miller

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

From: Sven-Thorsten Dietrich <[email protected]>
Date: Wed, 18 Mar 2009 18:13:11 -0700

> On Wed, 2009-03-18 at 18:03 -0700, David Miller wrote:
> > From: Gregory Haskins <[email protected]>
> > Date: Wed, 18 Mar 2009 17:54:04 -0400
> >
> > > Note that -rt doesnt typically context-switch under contention anymore
> > > since we introduced adaptive-locks. Also note that the contention
> > > against the lock is still contention, regardless of whether you have -rt
> > > or not. Its just that the slow-path to handle the contended case for
> > > -rt is more expensive than mainline. However, once you have the
> > > contention as stated, you have already lost.
> >
> > First, contention is not implicitly a bad thing.
>
> Its a bad thing when it does not scale.

You have only one pipe to shove packets into in this case, and for
your workload multiple cpus are going to be trying to stuff a single
packet at a time from a single UDP send request.

There is no added parallelism you can create for that kind of workload
on that kind of hardware.

It is one of the issues addressed by multiqueue.

2009-03-19 01:20:33

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

On Wed, 2009-03-18 at 18:03 -0700, David Miller wrote:
> From: Gregory Haskins <[email protected]>
> Date: Wed, 18 Mar 2009 17:54:04 -0400
>
> > Note that -rt doesnt typically context-switch under contention anymore
> > since we introduced adaptive-locks. Also note that the contention
> > against the lock is still contention, regardless of whether you have -rt
> > or not. Its just that the slow-path to handle the contended case for
> > -rt is more expensive than mainline. However, once you have the
> > contention as stated, you have already lost.
>
> First, contention is not implicitly a bad thing.
>

Its a bad thing when it does not scale.

> Second, if the -rt kernel is doing adaptive spinning I see no
> reason why that adaptive spinning is not kicking in here to
> make this problem just go away.
>

If only the first of N contending threads gets to spin, 2..N would
context-switch.


> This lock is held for mere cycles, just to unlink an SKB from
> the networking qdisc, and then it is immediately released.

For very short hold times, and heavy contention, as well as for
scalability, the solution may lie in tunable spinner-count and adaptive
spinner time-out.

Sven

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-03-19 01:44:43

by Sven-Thorsten Dietrich

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

On Wed, 2009-03-18 at 18:17 -0700, David Miller wrote:
> From: Sven-Thorsten Dietrich <[email protected]>
> Date: Wed, 18 Mar 2009 18:13:11 -0700
>
> > On Wed, 2009-03-18 at 18:03 -0700, David Miller wrote:
> > > From: Gregory Haskins <[email protected]>
> > > Date: Wed, 18 Mar 2009 17:54:04 -0400
> > >
> > > > Note that -rt doesnt typically context-switch under contention anymore
> > > > since we introduced adaptive-locks. Also note that the contention
> > > > against the lock is still contention, regardless of whether you have -rt
> > > > or not. Its just that the slow-path to handle the contended case for
> > > > -rt is more expensive than mainline. However, once you have the
> > > > contention as stated, you have already lost.
> > >
> > > First, contention is not implicitly a bad thing.
> >
> > Its a bad thing when it does not scale.
>
> You have only one pipe to shove packets into in this case, and for
> your workload multiple cpus are going to be trying to stuff a single
> packet at a time from a single UDP send request.
>
> There is no added parallelism you can create for that kind of workload
> on that kind of hardware.
>

Do we have to rule-out per-CPU queues, that aggregate into a master
queue in a batch-wise manner?

I speculate that might reduce the lock contention by a factor of NCPUs.
I cannot say this would be enough to mitigate the consequent overhead
penalty.

> It is one of the issues addressed by multiqueue.

Properly tuned adaptive locks, would theoretically perform
near-optimally compared to the mainline case, and would provide better
CPU utilization on very large parallel architectures.

Sven

2009-03-19 01:55:13

by David Miller

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

From: Sven-Thorsten Dietrich <[email protected]>
Date: Wed, 18 Mar 2009 18:43:27 -0700

> Do we have to rule-out per-CPU queues, that aggregate into a master
> queue in a batch-wise manner?

That would violate the properties and characteristics expected by
the packet scheduler, wrt. to fair based fairness, rate limiting,
etc.

The only legal situation where we can parallelize to single device
is where only the most trivial packet scheduler is attached to
the device and the device is multiqueue, and that is exactly what
we do right now.

2009-03-19 03:46:49

by Gregory Haskins

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

David Miller wrote:
> From: Gregory Haskins <[email protected]>
> Date: Wed, 18 Mar 2009 17:54:04 -0400
>
>
>> Note that -rt doesnt typically context-switch under contention anymore
>> since we introduced adaptive-locks. Also note that the contention
>> against the lock is still contention, regardless of whether you have -rt
>> or not. Its just that the slow-path to handle the contended case for
>> -rt is more expensive than mainline. However, once you have the
>> contention as stated, you have already lost.
>>
>
> First, contention is not implicitly a bad thing.
>
However, when the contention in question is your top bottleneck, even
small improvements have the potential to yield large performance gains. :)

> Second, if the -rt kernel is doing adaptive spinning I see no
> reason why that adaptive spinning is not kicking in here
It does. Things would be *much* worse if it wasn't.

> to
> make this problem just go away.
>
Well, "go away" is a relative term. Before adaptive-locks, the box was
heavily context-switching under a network workload like we are
discussing, and that is where much of the performance was lost.
Adaptive-locks mitigate that particular problem so that spinlock_t
clients now more closely model mainline behavior (i.e. spin under
contention, at least when they can) and this brought -rt much closer to
what you expect for performance from mainline.

However, in -rt the contention path is, by necessity, still moderately
more heavy weight than a simple mainline spin (PI, etc), so any
contention will still hurt relatively more. And obviously
adaptive-locks do nothing to address the contention itself. Therefore,
as long as the contention remains it will have a higher impact in -rt.
But make no mistake: the contention can (and I believe does) affect both
trees.

To see this in action, try taking a moderately large smp system (8-way+)
and scaling the number of flows. At 1 flow the stack is generally quite
capable of maintaining line-rate (at least up to GigE). With two flows
this should result in each achieving roughly 50% of the line-rate, for a
sum total of line-rate. But as you add flows to the equation, the
aggregate bandwidth typically starts to drop off to be something less
than line-rate (many times its actually much less). And when this
happens, the contention in question is usually at the top of the charts
(thus all the interest in this thread).

> This lock is held for mere cycles, just to unlink an SKB from
> the networking qdisc, and then it is immediately released.
>
To clarify: I haven't been looking at the stack code since last summer,
so some things may have changed since then. However, the issue back
then from my perspective was the general backpressure in the qdisc
subsystem (e.g. dev->queue_lock), not just the skb head.lock per se.
This dev->queue_lock can be held for quite a long time, and the
mechanism in general scales poorly as the fabric speeds and core-counts
increase. (It is arguable whether you would even want another buffering
layer on any reasonably fast interconnect, anyway. It ultimately just
destabilizes the flows and is probably best left to the underlying
hardware which will typically have the proper facilities for
shaping/limiting, etc. However, that is a topic for another discussion ;).)

One approach that we found yielded some significant improvements here
was to bypass the qdisc outright and do per-cpu lockless queuing. One
thread/core could then simply aggregate the per-cpu buffers with only
minor contention between the egress (consuming) core and the producer
thread. I realize that this solution as-is may not be feasible for
production use. However, it did serve to prove the theory that backing
off cache-line pressures that scale with the core count *did* help the
stack make more forward progress per unit time (e.g. higher throughput)
as the number of flows increased. The idea was more or less to refactor
the egress/TX path to be more per-cpu, similar to the methodology of the
ingress/RX side.

You could probably adapt this same concept to work harmoniously with the
qdisc infrastructure (though as previously indicated, I am not sure if
we would *want* to ;)). Something to consider, anyway.

Regards,
-Greg




Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2009-03-19 05:38:33

by David Miller

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

From: Gregory Haskins <[email protected]>
Date: Wed, 18 Mar 2009 23:48:46 -0400

> To see this in action, try taking a moderately large smp system
> (8-way+) and scaling the number of flows.

I can maintain line-rate over 10GB with a 64-cpu box. It's not
a problem.

2009-03-19 05:49:50

by Eric Dumazet

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

David Miller a ?crit :
> From: Sven-Thorsten Dietrich <[email protected]>
> Date: Wed, 18 Mar 2009 18:43:27 -0700
>
>> Do we have to rule-out per-CPU queues, that aggregate into a master
>> queue in a batch-wise manner?
>
> That would violate the properties and characteristics expected by
> the packet scheduler, wrt. to fair based fairness, rate limiting,
> etc.
>
> The only legal situation where we can parallelize to single device
> is where only the most trivial packet scheduler is attached to
> the device and the device is multiqueue, and that is exactly what
> we do right now.

I agree with you David.

Still, there is room for improvements, since :

1) default qdisc is pfifo_fast. This beast uses three sk_buff_head (96 bytes)
where it could use 3 smaller list_head (3 * 16 = 48 bytes on x86_64)

(assuming sizeof(spinlock_t) is only 4 bytes, but it's more than that
on various situations (LOCKDEP, ...)

2) struct Qdisc layout could be better, letting read mostly fields
at beginning of structure. (ie move 'dev_queue', 'next_sched', reshape_fail,
u32_node, __parent, ...)

'struct gnet_stats_basic' has a 32 bits hole

'gnet_stats_queue' could be split, at least in Qdisc, so that three
seldom use fields (drops, requeues, overlimits) go in a different cache line.

gnet_stats_rate_est might be also moved in a 'not very used' cache line, if
I am not mistaken ?

3) In stress situation a CPU A queues a skb to a sk_buff_head, but a CPU B
dequeues it to feed device, involving an expensive cache line miss
on the skb.{next|prev} (to set them to NULL)

We could:
Use a special dequeue op that doesnt touch skb.{next|prev}
Eventually set next/prev to NULL after q.lock is released


2009-03-19 05:58:49

by David Miller

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

From: Eric Dumazet <[email protected]>
Date: Thu, 19 Mar 2009 06:49:19 +0100

> Still, there is room for improvements, since :
>
> 1) default qdisc is pfifo_fast. This beast uses three sk_buff_head (96 bytes)
> where it could use 3 smaller list_head (3 * 16 = 48 bytes on x86_64)
>
> (assuming sizeof(spinlock_t) is only 4 bytes, but it's more than that
> on various situations (LOCKDEP, ...)

I already plan on doing this, skb->{next,prev} will be replaced with a
list_head and nearly all of the sk_buff_head usage will simply
disappear. It's a lot of work because every piece of SKB queue
handling code has to be sanitized to only use the interfaces in
linux/skbuff.h and lots of extremely ugly code like the PPP
defragmenter make many non-trivial direct skb->{next,prev}
manipulations.

> 2) struct Qdisc layout could be better, letting read mostly fields
> at beginning of structure. (ie move 'dev_queue', 'next_sched', reshape_fail,
> u32_node, __parent, ...)

I have no problem with your struct layout changes, submit it formally.

> 3) In stress situation a CPU A queues a skb to a sk_buff_head, but a CPU B
> dequeues it to feed device, involving an expensive cache line miss
> on the skb.{next|prev} (to set them to NULL)
>
> We could:
> Use a special dequeue op that doesnt touch skb.{next|prev}
> Eventually set next/prev to NULL after q.lock is released

You absolutely can't do this, as it would break GSO/GRO. The whole
transmit path is littered with checks of skb->next being NULL for the
purposes of segmentation handling.

2009-03-19 07:15:43

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

On Wed, Mar 18, 2009 at 05:54:04PM -0400, Gregory Haskins ([email protected]) wrote:
> Note that -rt doesnt typically context-switch under contention anymore
> since we introduced adaptive-locks. Also note that the contention
> against the lock is still contention, regardless of whether you have -rt
> or not. Its just that the slow-path to handle the contended case for
> -rt is more expensive than mainline. However, once you have the
> contention as stated, you have already lost.
>
> We have observed the posters findings ourselves in both mainline and
> -rt. I.e. That lock doesnt scale very well once you have more than a
> handful of cores. It's certainly a great area to look at for improving
> the overall stack, IMO, as I believe there is quite a bit of headroom
> left to be recovered that is buried there.

Something tells me that you observer skb head lock contention because
you stressed the network and anything else on that machine slacked in
sleep. What if you will start IO stress, will __queue_lock contention
have the same magnitude order? Or having as many processes as skbs each
of which will race for the scheduler? Will run-queue lock show up in the
stats?

I believe the asnwer is yes for all the questions.
You stressed one subsystem and it showed up in the statistics.

--
Evgeniy Polyakov

2009-03-19 12:40:32

by Gregory Haskins

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

David Miller wrote:
> From: Gregory Haskins <[email protected]>
> Date: Wed, 18 Mar 2009 23:48:46 -0400
>
>
>> To see this in action, try taking a moderately large smp system
>> (8-way+) and scaling the number of flows.
>>
>
> I can maintain line-rate over 10GB with a 64-cpu box.
Oh man, I am jealous of that 64-way :)

How many simultaneous flows? What hardware? What qdisc and other
config do you use? MTU? I cannot replicate such results on 10GB even
with much smaller cpu counts.

On my test rig here, I have a 10GB link connected by crossover between
two 8-core boxes. Running one unidirectional TCP flow is typically
toping out at ~5.5Gb/s on 2.6.29-rc8. Granted we are using MTU=1500,
which in of itself is part of the upper limit. However, that result in
of itself isn't a problem, per se. What is a problem is the aggregate
bandwidth drops with scaling the number of flows. I would like to
understand how to make this better, if possible, and perhaps I can learn
something from your setup.
> It's not
> a problem.
>

To clarify terms, we are not saying "the stack performs inadequately".
What we are saying here is that analysis of our workloads and of the
current stack indicates that we are io-bound, and that this particular
locking architecture in the qdisc subsystem is the apparent top gating
factor from going faster. Therefore we are really saying "how can we
make it even better"? This is not a bad question to ask in general,
would you agree?

To vet our findings, we built that prototype I mentioned in the last
mail where we substituted the single queue and queue_lock with a
per-cpu, lockless queue. This meant each cpu could submit work
independent of the others with substantially reduced contention. More
importantly, it eliminated the property of scaling the RFO pressure on a
single cache-lne for the queue-lock. When we did this, we got
significant increases in aggregate throughput (somewhere on the order of
6%-25% depending on workload, but this was last summer so I am a little
hazy on the exact numbers).

So you had said something to the effect of "Contention isn't implicitly
a bad thing". I agree to a point. At least so much as contention
cannot always be avoided. Ultimately we only have one resource in this
equation: the phy-link in question. So naturally multiple flows
targeted for that link will contend for it.

But the important thing to note here is that there are different kinds
of contention. And contention against spinlocks *is* generally bad, for
multiple reasons. It not only affects the cores under contention, but
it affects all cores that exist in the same coherency domain. IMO it
should be avoided whenever possible.

So I am not saying our per-cpu solution is the answer. But what I am
saying is that we found that an architecture that doesnt piggy back all
flows into a single spinlock does have the potential to unleash even
more Linux-networking fury :)

I haven't really been following the latest developments in netdev, but
if I understand correctly part of what we are talking about here would
be addressed by the new MQ stuff? And if I also understand that
correctly, support for MQ is dependent on the hardware beneath it? If
so, I wonder if we could apply some of the ideas I presented earlier for
making "soft MQ" with a lockless-queue per flow or something like that?
Thoughts?

-Greg



Attachments:
signature.asc (257.00 B)
OpenPGP digital signature

2009-03-19 12:51:16

by Peter W. Morreale

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

On Wed, 2009-03-18 at 18:03 -0700, David Miller wrote:
> From: Gregory Haskins <[email protected]>
> Date: Wed, 18 Mar 2009 17:54:04 -0400
>
> > Note that -rt doesnt typically context-switch under contention anymore
> > since we introduced adaptive-locks. Also note that the contention
> > against the lock is still contention, regardless of whether you have -rt
> > or not. Its just that the slow-path to handle the contended case for
> > -rt is more expensive than mainline. However, once you have the
> > contention as stated, you have already lost.
>
> First, contention is not implicitly a bad thing.
>
> Second, if the -rt kernel is doing adaptive spinning I see no
> reason why that adaptive spinning is not kicking in here to
> make this problem just go away.

The basic 'problem' with comparing RT adaptive spinning to non-rt
spinlocks is that if the lock owner is !oncpu, all spinners must break
and go to sleep, otherwise we (potentially) deadlock. This does not
exist for non-rt spinners.

Best,
-PWM


>
> This lock is held for mere cycles, just to unlink an SKB from
> the networking qdisc, and then it is immediately released.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-03-19 12:59:42

by Peter W. Morreale

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

On Wed, 2009-03-18 at 21:54 +0100, Andi Kleen wrote:
> Vernon Mauery <[email protected]> writes:
> >

>
> But then again I'm not sure it's worth it if the problem only
> happens in out of tree RT.
>

Heh. That darn bastard child again, eh? :-)


Recall that adaptive spin,developed in RT, is now in mainline.

Best,
-PWM

2009-03-19 13:36:25

by Peter W. Morreale

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

On Thu, 2009-03-19 at 06:59 -0600, Peter W. Morreale wrote:
> On Wed, 2009-03-18 at 21:54 +0100, Andi Kleen wrote:
> > Vernon Mauery <[email protected]> writes:
> > >
>
> >
> > But then again I'm not sure it's worth it if the problem only
> > happens in out of tree RT.
> >
>
> Heh. That darn bastard child again, eh? :-)
>
>
> Recall that adaptive spin,developed in RT, is now in mainline.
>

Err.. correction... *about* to be in mainline. Only point I'm making
is that RT does implicitly, if not explicitly, contributes to mainline.
Its not a one-way street.

Best,
-PWM

> Best,
> -PWM
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-03-19 13:45:28

by Andi Kleen

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

> Do we have to rule-out per-CPU queues, that aggregate into a master
> queue in a batch-wise manner?

TSO/GSO/USO is that actually in a very specific way. It batches
packets into larger packets and then less per packet locks need to be taken.
It was unclear if that was used in this workload or not.

-Andi
--
[email protected] -- Speaking for myself only.

2009-03-19 13:46:29

by Andi Kleen

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

> Recall that adaptive spin,developed in RT, is now in mainline.

For mutexes, but not for spinlocks. And qdisc uses spinlocks.

-Andi

--
[email protected] -- Speaking for myself only.

2009-03-19 14:04:38

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] net: reorder struct Qdisc for better SMP performance

David Miller a ?crit :
> From: Eric Dumazet <[email protected]>
> Date: Thu, 19 Mar 2009 06:49:19 +0100
>
>> Still, there is room for improvements, since :
>>
>> 1) default qdisc is pfifo_fast. This beast uses three sk_buff_head (96 bytes)
>> where it could use 3 smaller list_head (3 * 16 = 48 bytes on x86_64)
>>
>> (assuming sizeof(spinlock_t) is only 4 bytes, but it's more than that
>> on various situations (LOCKDEP, ...)
>
> I already plan on doing this, skb->{next,prev} will be replaced with a
> list_head and nearly all of the sk_buff_head usage will simply
> disappear. It's a lot of work because every piece of SKB queue
> handling code has to be sanitized to only use the interfaces in
> linux/skbuff.h and lots of extremely ugly code like the PPP
> defragmenter make many non-trivial direct skb->{next,prev}
> manipulations.
>
>> 2) struct Qdisc layout could be better, letting read mostly fields
>> at beginning of structure. (ie move 'dev_queue', 'next_sched', reshape_fail,
>> u32_node, __parent, ...)
>
> I have no problem with your struct layout changes, submit it formally.

OK here is the layout change then ;)

Thank you

[PATCH] net: reorder struct Qdisc for better SMP performance

dev_queue_xmit() needs to dirty fields "state", "q", "bstats" and "qstats"

On x86_64 arch, they currently span three cache lines, involving more
cache line ping pongs than necessary, making longer holding of queue spinlock.

We can reduce this to one cache line, by grouping all read-mostly fields
at the beginning of structure. (Or should I say, all highly modified fields
at the end :) )

Before patch :

offsetof(struct Qdisc, state)=0x38
offsetof(struct Qdisc, q)=0x48
offsetof(struct Qdisc, bstats)=0x80
offsetof(struct Qdisc, qstats)=0x90
sizeof(struct Qdisc)=0xc8

After patch :

offsetof(struct Qdisc, state)=0x80
offsetof(struct Qdisc, q)=0x88
offsetof(struct Qdisc, bstats)=0xa0
offsetof(struct Qdisc, qstats)=0xac
sizeof(struct Qdisc)=0xc0

Signed-off-by: Eric Dumazet <[email protected]>
---
include/linux/gen_stats.h | 2 +-
include/net/sch_generic.h | 21 ++++++++++++---------
2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/include/linux/gen_stats.h b/include/linux/gen_stats.h
index 13f4e74..0ffa41d 100644
--- a/include/linux/gen_stats.h
+++ b/include/linux/gen_stats.h
@@ -22,7 +22,7 @@ struct gnet_stats_basic
{
__u64 bytes;
__u32 packets;
-};
+} __attribute__ ((packed));

/**
* struct gnet_stats_rate_est - rate estimator
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f8c4742..b0ae100 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -48,18 +48,10 @@ struct Qdisc
int padded;
struct Qdisc_ops *ops;
struct qdisc_size_table *stab;
+ struct list_head list;
u32 handle;
u32 parent;
atomic_t refcnt;
- unsigned long state;
- struct sk_buff *gso_skb;
- struct sk_buff_head q;
- struct netdev_queue *dev_queue;
- struct Qdisc *next_sched;
- struct list_head list;
-
- struct gnet_stats_basic bstats;
- struct gnet_stats_queue qstats;
struct gnet_stats_rate_est rate_est;
int (*reshape_fail)(struct sk_buff *skb,
struct Qdisc *q);
@@ -70,6 +62,17 @@ struct Qdisc
* and it will live until better solution will be invented.
*/
struct Qdisc *__parent;
+ struct netdev_queue *dev_queue;
+ struct Qdisc *next_sched;
+
+ struct sk_buff *gso_skb;
+ /*
+ * For performance sake on SMP, we put highly modified fields at the end
+ */
+ unsigned long state;
+ struct sk_buff_head q;
+ struct gnet_stats_basic bstats;
+ struct gnet_stats_queue qstats;
};

struct Qdisc_class_ops

2009-03-19 20:53:19

by David Miller

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

From: Gregory Haskins <[email protected]>
Date: Thu, 19 Mar 2009 08:42:33 -0400

> David Miller wrote:
> > From: Gregory Haskins <[email protected]>
> > Date: Wed, 18 Mar 2009 23:48:46 -0400
> >
> >
> >> To see this in action, try taking a moderately large smp system
> >> (8-way+) and scaling the number of flows.
> >>
> >
> > I can maintain line-rate over 10GB with a 64-cpu box.
> Oh man, I am jealous of that 64-way :)
>
> How many simultaneous flows? What hardware? What qdisc and other
> config do you use? MTU? I cannot replicate such results on 10GB even
> with much smaller cpu counts.

Sun Neptune NIU 10G with 24 TX queues.

And it's all because of the number of TX queues, nothing more.

It is essential for good performance with any level of cpu
arity.

2009-03-20 08:34:42

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] net: reorder struct Qdisc for better SMP performance

From: Eric Dumazet <[email protected]>
Date: Thu, 19 Mar 2009 15:04:11 +0100

> [PATCH] net: reorder struct Qdisc for better SMP performance

Applied, thanks.

2009-03-20 23:41:51

by Jarek Poplawski

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

Vernon Mauery wrote, On 03/18/2009 09:17 PM:
...
> This patch does seem to reduce the number of contentions by about 10%. That is
> a good start (and a good catch on the cacheline bounces). But, like I mentioned
> above, this lock still has 2 orders of magnitude greater contention than the
> next lock, so even a large decrease like 10% makes little difference in the
> overall contention characteristics.
>
> So we will have to do something more. Whether it needs to be more complex or
> not is still up in the air. Batched enqueueing/dequeueing are just two options
> and the former would be a *lot* less complex than the latter.
>
> If anyone else has any ideas they have been holding back, now would be a great
> time to get them out in the open.

I think there would be interesting to check another idea around this
contention: not all contenders are equal here. One thread is doing
qdisc_run() and owning the transmit queue (even after releasing the TX
lock). So if it waits for the qdisc lock the NIC, if not multiqueue,
is idle. Probably some handicap like in the patch below could make
some difference in throughput; alas I didn't test it.

Jarek P.
---

net/core/dev.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index f112970..d5ad808 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1852,7 +1852,11 @@ gso:
if (q->enqueue) {
spinlock_t *root_lock = qdisc_lock(q);

- spin_lock(root_lock);
+ while (!spin_trylock(root_lock)) {
+ do {
+ cpu_relax();
+ } while (spin_is_locked(root_lock));
+ }

if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
kfree_skb(skb);

2009-03-23 08:33:44

by Eric Dumazet

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

Jarek Poplawski a ?crit :
> Vernon Mauery wrote, On 03/18/2009 09:17 PM:
> ...
>> This patch does seem to reduce the number of contentions by about 10%. That is
>> a good start (and a good catch on the cacheline bounces). But, like I mentioned
>> above, this lock still has 2 orders of magnitude greater contention than the
>> next lock, so even a large decrease like 10% makes little difference in the
>> overall contention characteristics.
>>
>> So we will have to do something more. Whether it needs to be more complex or
>> not is still up in the air. Batched enqueueing/dequeueing are just two options
>> and the former would be a *lot* less complex than the latter.
>>
>> If anyone else has any ideas they have been holding back, now would be a great
>> time to get them out in the open.
>
> I think there would be interesting to check another idea around this
> contention: not all contenders are equal here. One thread is doing
> qdisc_run() and owning the transmit queue (even after releasing the TX
> lock). So if it waits for the qdisc lock the NIC, if not multiqueue,
> is idle. Probably some handicap like in the patch below could make
> some difference in throughput; alas I didn't test it.
>
> Jarek P.
> ---
>
> net/core/dev.c | 6 +++++-
> 1 files changed, 5 insertions(+), 1 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index f112970..d5ad808 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1852,7 +1852,11 @@ gso:
> if (q->enqueue) {
> spinlock_t *root_lock = qdisc_lock(q);
>
> - spin_lock(root_lock);
> + while (!spin_trylock(root_lock)) {
> + do {
> + cpu_relax();
> + } while (spin_is_locked(root_lock));
> + }
>
> if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> kfree_skb(skb);
>
>

I dont understand, doesnt it defeat the ticket spinlock thing and fairness ?

Thread doing __qdisc_run() already owns the __QDISC_STATE_RUNNING bit.

trying or taking spinlock has same effect, since it force a cache line ping pong,
and this is the real problem.

2009-03-23 08:38:23

by David Miller

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

From: Eric Dumazet <[email protected]>
Date: Mon, 23 Mar 2009 09:32:39 +0100

> I dont understand, doesnt it defeat the ticket spinlock thing and fairness ?
>
> Thread doing __qdisc_run() already owns the __QDISC_STATE_RUNNING bit.

Right.

Remember, the way this is designed is that if there is a busy
cpu taking packets out of the queue and putting them into the
device then other cpus will simply add to the queue and immediately
return. This effectively keeps the queue running there processing
all the new work that other cpus are adding to the qdisc.

Those other cpus make these decisions by looking at that
__QDISC_STATE_RUNNING bit, which the queue runner grabs before
it does any work.

2009-03-23 08:51:25

by Jarek Poplawski

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

On Mon, Mar 23, 2009 at 01:37:49AM -0700, David Miller wrote:
> From: Eric Dumazet <[email protected]>
> Date: Mon, 23 Mar 2009 09:32:39 +0100
>
> > I dont understand, doesnt it defeat the ticket spinlock thing and fairness ?
> >
> > Thread doing __qdisc_run() already owns the __QDISC_STATE_RUNNING bit.
>
> Right.
>
> Remember, the way this is designed is that if there is a busy
> cpu taking packets out of the queue and putting them into the
> device then other cpus will simply add to the queue and immediately
> return.

But this "busy cpu" can't take packets out of the queue when it's
waiting on the contended spinlock. Anyway, it's only for testing,
and I didn't say it has to be right.

Jarek P.

2009-04-02 14:14:20

by Herbert Xu

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

David Miller <[email protected]> wrote:
> From: Eric Dumazet <[email protected]>
> Date: Mon, 23 Mar 2009 09:32:39 +0100
>
>> I dont understand, doesnt it defeat the ticket spinlock thing and fairness ?
>>
>> Thread doing __qdisc_run() already owns the __QDISC_STATE_RUNNING bit.
>
> Right.
>
> Remember, the way this is designed is that if there is a busy
> cpu taking packets out of the queue and putting them into the
> device then other cpus will simply add to the queue and immediately
> return. This effectively keeps the queue running there processing
> all the new work that other cpus are adding to the qdisc.
>
> Those other cpus make these decisions by looking at that
> __QDISC_STATE_RUNNING bit, which the queue runner grabs before
> it does any work.

Come on guys, if this lock is a problem. go out and buy a proper
NIC that supports multiequeue TX!

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-04-02 14:16:10

by Herbert Xu

[permalink] [raw]
Subject: Re: High contention on the sk_buff_head.lock

On Thu, Apr 02, 2009 at 10:13:42PM +0800, Herbert Xu wrote:
>
> Come on guys, if this lock is a problem. go out and buy a proper
> NIC that supports multiequeue TX!

Nevermind, I was reading old emails again :)
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt