Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754957AbZCRTHj (ORCPT ); Wed, 18 Mar 2009 15:07:39 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753687AbZCRTHa (ORCPT ); Wed, 18 Mar 2009 15:07:30 -0400 Received: from gw1.cosmosbay.com ([212.99.114.194]:54162 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753361AbZCRTH3 convert rfc822-to-8bit (ORCPT ); Wed, 18 Mar 2009 15:07:29 -0400 Message-ID: <49C14667.2040806@cosmosbay.com> Date: Wed, 18 Mar 2009 20:07:19 +0100 From: Eric Dumazet User-Agent: Thunderbird 2.0.0.19 (Windows/20081209) MIME-Version: 1.0 To: Vernon Mauery CC: netdev , LKML , rt-users Subject: Re: High contention on the sk_buff_head.lock References: <49C12E64.1000301@us.ibm.com> In-Reply-To: <49C12E64.1000301@us.ibm.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [0.0.0.0]); Wed, 18 Mar 2009 20:07:20 +0100 (CET) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4785 Lines: 119 Vernon Mauery a ?crit : > I have been beating on network throughput in the -rt kernel for some time > now. After digging down through the send path of UDP packets, I found > that the sk_buff_head.lock is under some very high contention. This lock > is acquired each time a packet is enqueued on a qdisc and then acquired > again to dequeue the packet. Under high networking loads, the enqueueing > processes are not only contending among each other for the lock, but also > with the net-tx soft irq. This makes for some very high contention on this > one lock. My testcase is running varying numbers of concurrent netperf > instances pushing UDP traffic to another machine. As the count goes from > 1 to 2, the network performance increases. But from 2 to 4 and from 4 > to 8, > we see a big decline, with 8 instances pushing about half of what a single > thread can do. > > Running 2.6.29-rc6-rt3 on an 8-way machine with a 10GbE card (I have tried > both NetXen and Broadcom, with very similar results), I can only push about > 1200 Mb/s. Whereas with the mainline 2.6.29-rc8 kernel, I can push nearly > 6000 Mb/s. But still not as much as I think is possible. I was curious and > decided to see if the mainline kernel was hitting the same lock, and using > /proc/lock_stat, it is hitting the sk_buff_head.lock as well (it was the > number one contended lock). > > So while this issue really hits -rt kernels hard, it has a real effect on > mainline kernels as well. The contention of the spinlocks is amplified > when they get turned into rt-mutexes, which causes a double context switch. > > Below is the top of the lock_stat for 2.6.29-rc8. This was captured from > a 1 minute network stress test. The next high contender had 2 orders of > magnitude fewer contentions. Think of the throughput increase if we could > ease this contention a bit. We might even be able to saturate a 10GbE > link. > > lock_stat version 0.3 > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > class name con-bounces contentions waittime-min > waittime-max waittime-total acq-bounces acquisitions > holdtime-min holdtime-max holdtime-total > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > &list->lock#3: 24517307 24643791 0.71 > 1286.62 56516392.42 34834296 44904018 > 0.60 164.79 31314786.02 > ------------- > &list->lock#3 15596927 [] > dev_queue_xmit+0x2ea/0x468 > &list->lock#3 9046864 [] > __qdisc_run+0x11b/0x1ef > ------------- > &list->lock#3 6525300 [] > __qdisc_run+0x11b/0x1ef > &list->lock#3 18118491 [] > dev_queue_xmit+0x2ea/0x468 > > > The story is the same for -rt kernels, only the waittime and holdtime > are both > orders of magnitude greater. > > I am not exactly clear on the solution, but if I understand correctly, > in the > past there has been some discussion of batched enqueueing and > dequeueing. Is > anyone else working on this problem right now who has just not yet posted > anything for review? Questions, comments, flames? > Yes we have a known contention point here, but before adding more complex code, could you try following patch please ? [PATCH] net: Reorder fields of struct Qdisc dev_queue_xmit() needs to dirty fields "state" and "q" On x86_64 arch, they currently span two cache lines, involving more cache line ping pongs than necessary. Before patch : offsetof(struct Qdisc, state)=0x38 offsetof(struct Qdisc, q)=0x48 offsetof(struct Qdisc, dev_queue)=0x60 After patch : offsetof(struct Qdisc, dev_queue)=0x38 offsetof(struct Qdisc, state)=0x48 offsetof(struct Qdisc, q)=0x50 Signed-off-by: Eric Dumazet diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index f8c4742..e24feeb 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -51,10 +51,11 @@ struct Qdisc u32 handle; u32 parent; atomic_t refcnt; - unsigned long state; + struct netdev_queue *dev_queue; + struct sk_buff *gso_skb; + unsigned long state; struct sk_buff_head q; - struct netdev_queue *dev_queue; struct Qdisc *next_sched; struct list_head list; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/