Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753208AbZDVLTZ (ORCPT ); Wed, 22 Apr 2009 07:19:25 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751897AbZDVLTE (ORCPT ); Wed, 22 Apr 2009 07:19:04 -0400 Received: from mx2.mail.elte.hu ([157.181.151.9]:43682 "EHLO mx2.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751383AbZDVLTB (ORCPT ); Wed, 22 Apr 2009 07:19:01 -0400 Date: Wed, 22 Apr 2009 13:18:17 +0200 From: Ingo Molnar To: Eric Dumazet Cc: Stephen Hemminger , Peter Zijlstra , Linus Torvalds , Paul Mackerras , paulmck@linux.vnet.ibm.com, Evgeniy Polyakov , David Miller , kaber@trash.net, jeff.chua.linux@gmail.com, laijs@cn.fujitsu.com, jengelh@medozas.de, r000n@r000n.net, linux-kernel@vger.kernel.org, netfilter-devel@vger.kernel.org, netdev@vger.kernel.org, benh@kernel.crashing.org, mathieu.desnoyers@polymtl.ca Subject: Re: [PATCH] netfilter: use per-cpu recursive lock (v11) Message-ID: <20090422111817.GA21190@elte.hu> References: <18924.59347.375292.102385@cargo.ozlabs.ibm.com> <20090420215827.GK6822@linux.vnet.ibm.com> <18924.64032.103954.171918@cargo.ozlabs.ibm.com> <20090420160121.268a8226@nehalam> <20090421111541.228e977a@nehalam> <20090421191007.GA15485@elte.hu> <49EE2293.4090201@cosmosbay.com> <20090422073524.GA31835@elte.hu> <49EEDAF0.2010507@cosmosbay.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <49EEDAF0.2010507@cosmosbay.com> User-Agent: Mutt/1.5.18 (2008-05-17) X-ELTE-VirusStatus: clean X-ELTE-SpamScore: -1.5 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-1.5 required=5.9 tests=BAYES_00 autolearn=no SpamAssassin version=3.2.3 -1.5 BAYES_00 BODY: Bayesian spam probability is 0 to 1% [score: 0.0000] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3650 Lines: 85 * Eric Dumazet wrote: > netfilter in 2.6.2[0-9] used : > > CPU 1 > > sofirq handles one packet from a NIC > > ipt_do_table() /* to handle INPUT table for example, or FORWARD */ > read_lock_bh(&a_central_and_damn_rwlock) > ... parse rules > -> calling some netfilter sub-function > re-entering network IP stack to send some packet (say a RST packet) > ... > ipt_do_table() /* to handle OUPUT table rules for example */ > read_lock_bh() ; /* WE RECURSE here, but once in a while (if ever) */ > > This is one of the case, but other can happens with virtual > networks, tunnels, ... and so on. (Stephen had some cases with KVM > if I remember well) > > If this could be done without recursion, I am pretty sure > netfilter and network guys would have done it. I found Linus > reaction quite shocking IMHO, considering hard work done by all > people on this. Again ... i dont know this code well, but you yourself describe it as "a_central_and_damn_rwlock" above. "Central and damn" locks of any type, anywhere tend to cause such trouble. Often they are mini-BKL's in the making. Here's some historic patterns: 1- First there's a convenient lock around a popular and useful piece of data and code. 2- As popularity (and reach of code) grows, and some non-trivial interaction ensues, a little bit of self-recursion is added to the mix (this is often easier to do than to fix the root cause of the problem) - making critical sections even easier to grow in size. 3- Then attempts are made to make it scale all (it's a popular piece of code), it's extended along a per-cpu axis, but complexity of locking explode by taking a ton of locks all at once. The solution: yell at the lock validator for not allowing this (or for exploding due to the sheer mathematically large complexity of the locking rules). Frequent requests for an exclusion from those pesky validations are made, and various hacks are done to work it around. It's all the fault of the lock validator, of course. 4- At this point it's rarely a clean, tidy data lock anymore - it tends to grow into a code lock nobody really knows how it works, except that it better be taken more often than not, badness may ensue otherwise. Nobody really knows when to take it, only that it should be taken widely enough, that it should be recursive enough to call even _more_ code from under it. Efforts to reduce critical section length are rebuffed with: "this adds unlocking and relocking overhead". And it should then all scale as well. 5- In the end it's a lock everyone curses but nobody is able to fix anymore. "If it was easy to fix we'd have fixed it long ago already" kind of fatalist thinking becomes widespread. We saw many examples of this in the past: beyond the BKL (which is a bit special as its more of an UP legacy - but which hurt us the most), we had the tasklist_lock which was problematic for a long time, then there's also the reiser3 locking which was extremely monolithic. [ i'm only naming safe examples here, where nobody will flame me =B-) ] Whether this particular case applies is up to you. I see certain matches up to phase 3 or so - but i also see certain dissimilarities as well. Nor do i claim that it's easy to fix or improve. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/