Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752271AbZDLRaa (ORCPT ); Sun, 12 Apr 2009 13:30:30 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751920AbZDLRaL (ORCPT ); Sun, 12 Apr 2009 13:30:11 -0400 Received: from e8.ny.us.ibm.com ([32.97.182.138]:50999 "EHLO e8.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751742AbZDLRaJ (ORCPT ); Sun, 12 Apr 2009 13:30:09 -0400 Date: Sun, 12 Apr 2009 10:30:09 -0700 From: "Paul E. McKenney" To: Stephen Hemminger Cc: Linus Torvalds , David Miller , Ingo Molnar , Lai Jiangshan , jeff.chua.linux@gmail.com, dada1@cosmosbay.com, jengelh@medozas.de, kaber@trash.net, r000n@r000n.net, Linux Kernel Mailing List , netfilter-devel@vger.kernel.org, netdev@vger.kernel.org Subject: Re: iptables very slow after commit 784544739a25c30637397ace5489eeb6e15d7d49 Message-ID: <20090412173009.GN6822@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20090410095246.4fdccb56@s6510> <20090410.182507.140306636.davem@davemloft.net> <20090411041533.GB6822@linux.vnet.ibm.com> <20090412003445.GK6822@linux.vnet.ibm.com> <20090412090603.556ba4fa@nehalam> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090412090603.556ba4fa@nehalam> User-Agent: Mutt/1.5.15+20070412 (2007-04-11) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5874 Lines: 136 On Sun, Apr 12, 2009 at 09:06:03AM -0700, Stephen Hemminger wrote: > On Sat, 11 Apr 2009 17:34:45 -0700 > "Paul E. McKenney" wrote: > > > On Sat, Apr 11, 2009 at 11:57:16AM -0700, Linus Torvalds wrote: > > > > > > > > > On Fri, 10 Apr 2009, Paul E. McKenney wrote: > > > > > > > > 1. Assuming that the synchronize_net() is intended to guarantee > > > > that the new rules will be in effect before returning to > > > > user space: > > > > > > Btw, I think that's a bad assumption. > > > > It does indeed appear to be! > > > > > The thing is, nobody can really care if the new rules are in effect or > > > not, because the thing you race with is not the "return to user space" > > > part, but the incoming packets. > > > > > > And those incoming packets might have been incoming before the rules were > > > set up too. > > > > > > So I seriously doubt you need to synchronize with any returning to user > > > space. What you want to synchronize with is then later actions that do > > > things like turning on the interface that the rules are attached to etc! > > > > > > So I would suggest: > > > > > > - remove the synchronize_net() entirely. Replace it with just freeing the > > > old rules using RCU. > > > > > > - new packets will always end up seeing the new rules. That includes the > > > case of somebody doing "ifconfig eth0 up" that enables a new source of > > > packets, so there are no real security issues. > > > > > > - if you enabled your network interfaces before you updated your packet > > > filtering rules, you already had a window where packets would come in > > > with the old rules, so doing a "synchronize_net()" in no way protects > > > against any race conditions anyway. > > > > > > Am I missing something? > > > > The issue at this point seems to be the need to get accurate snapshots > > of various counters -- there are a number of Linux networking users who > > need to account for every byte flowing through their systems. However, > > it is also necessary to update these counters very efficiently, given > > that they are updated on a per-packet basis. The current approach is > > as follows: > > > > 1. Install a new set of counters. > > > > 2. Wait for a grace period to elapse. > > > > 3. At this point, we know that all subsequent counting will happen > > on the new set of counters. > > > > 4. Add the value of the old set of counters to the new set of > > counters. > > > > 5. Copy the old set of counters up to user space. > > > > So we get a good snapshot in #5, while #4 ensures that we don't lose > > any counts when taking future snapshots. Unfortunately, #2 hits us > > with grace-period latencies on the critical path. > > > > We are going through the following possibilities: > > > > o Stick with the current approach, and ask people to move to > > new batch-oriented interfaces. However, a 30x decrease in > > performance is pretty grim, even for an old-style interface. > > > > o Use various atomic tricks to get an immediate snapshot of the > > old counters after step 1. Make step 3 use call_rcu() instead > > of synchronize_rcu(), and then step 4 happens off the > > critical path. > > > > This approach moves the RCU grace period off of the critical > > path, but the atomic tricks are extremely ugly on 32-bit SMP > > machines. 32-bit UP machines and 64-bit machines are not > > too bad, though the 32-bit UP case does add preemption-disable > > overhead on the counter-update fastpath. > > > > o Provide some sort of expedited synchronize_rcu(). This might > > be able to decrease the hit from 30x down to maybe 5x. > > But I might need to do this for the fast-boot folks anyway, > > though I am first trying to get away with just speeding > > up synchronized_rcu(). Though I was not thinking in terms > > of 6x, let alone 30x. > > > > Please note that this would not be a drop-in replacement for > > synchronize_rcu(). One would use synchronize_rcu_expedited() > > (or whatever) only when the system really could not get any > > useful work done while the grace period was in progress. > > The general approach would be to keep the whole machine busy > > trying to get the grace period done as soon as possible. > > > > Thanx, Paul > > We could also try: > * per-cpu spinlock on counters (instead of synchronize_net). > When doing update, just acquire > lock on that cpu and futz with counters then. Overhead should > still be less than 2.6.29 and earlier global rwlock This one makes a lot of sense to me. The overhead of an uncontended lock is pretty small on most systems. This would also mean that you don't have to actually swap the counters, correct? > * synchonize_rcu/synchronize_net is more guarantee than needed? If you really do need to swap the counters themselves, you -might- also need call_rcu() to dispose of them. But it should possible to do that under the per-CPU lock instead. > * use on_each_cpu() somehow to do grace periood? You could certainly use something like smp_call_function() to collect the other CPUs' counter values -- just disable interrupts across the increments for architectures that cannot atomically increment a 64-bit value. (And it only needs to be atomic with respect to an interrupt, not necessarily to some other CPU.) > * Add a cond_resched() into net_rx_action which might cause rx processing > to get out of rcu sooner? also in transmit packet scheduler. This might help some, but would probably only give a few tens of percent improvement. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/