Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755660AbZDUQYw (ORCPT ); Tue, 21 Apr 2009 12:24:52 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753544AbZDUQYn (ORCPT ); Tue, 21 Apr 2009 12:24:43 -0400 Received: from smtp1.linux-foundation.org ([140.211.169.13]:55662 "EHLO smtp1.linux-foundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751051AbZDUQYl (ORCPT ); Tue, 21 Apr 2009 12:24:41 -0400 Date: Tue, 21 Apr 2009 09:13:52 -0700 (PDT) From: Linus Torvalds X-X-Sender: torvalds@localhost.localdomain To: Stephen Hemminger cc: Paul Mackerras , paulmck@linux.vnet.ibm.com, Eric Dumazet , Evgeniy Polyakov , David Miller , kaber@trash.net, jeff.chua.linux@gmail.com, mingo@elte.hu, laijs@cn.fujitsu.com, jengelh@medozas.de, r000n@r000n.net, linux-kernel@vger.kernel.org, netfilter-devel@vger.kernel.org, netdev@vger.kernel.org, benh@kernel.crashing.org, mathieu.desnoyers@polymtl.ca Subject: Re: [PATCH] netfilter: use per-cpu recursive lock (v11) In-Reply-To: <20090420160121.268a8226@nehalam> Message-ID: References: <49E72E83.50702@trash.net> <20090416.153354.170676392.davem@davemloft.net> <20090416234955.GL6924@linux.vnet.ibm.com> <20090417012812.GA25534@linux.vnet.ibm.com> <20090418094001.GA2369@ioremap.net> <20090418141455.GA7082@linux.vnet.ibm.com> <20090420103414.1b4c490f@nehalam> <49ECBE0A.7010303@cosmosbay.com> <18924.59347.375292.102385@cargo.ozlabs.ibm.com> <20090420215827.GK6822@linux.vnet.ibm.com> <18924.64032.103954.171918@cargo.ozlabs.ibm.com> <20090420160121.268a8226@nehalam> User-Agent: Alpine 2.00 (LFD 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5055 Lines: 120 Ok, so others already pointed out how dangerous/buggy this looks, but I'd like to strengthen that a bit more: On Mon, 20 Apr 2009, Stephen Hemminger wrote: > + > +/** > + * xt_table_info_rdlock_bh - recursive read lock for xt table info > + * > + * Table processing calls this to hold off any changes to table > + * (on current CPU). Always leaves with bottom half disabled. > + * If called recursively, then assumes bh/preempt already disabled. > + */ > +void xt_info_rdlock_bh(void) > +{ > + struct xt_info_lock *lock; > + > + preempt_disable(); > + lock = &__get_cpu_var(xt_info_locks); > + if (likely(++lock->depth == 0)) > + spin_lock_bh(&lock->lock); > + preempt_enable_no_resched(); > +} > +EXPORT_SYMBOL_GPL(xt_info_rdlock_bh); This function is FUCKED UP. It's total crap for several reasons" - the already-mentioned race with bottom half locking. If bottom halfs aren't already disabled, then if a bottom half comes in after the "++lock->depth" and before the spin_lock_bh(), then you will have no locking AT ALL for the processing of that bottom half - it will just increment the lock depth again, and nobody will have locked anything at all. And if for some reason, you can prove that bottom half processing is already disabled, then ALL THAT OTHER CRAP is just that - CRAP. The whole preemption disabling, the whole "_bh()" thing, everything. So either it's horribly buggy, or it's horribly broken and pointless. Take your pick. - the total lack of comments. Why does that "count" protect anything? It's not a recursive lock, since there is no ownership (two independent accessors could call this and both "get" the lock), so you had damn well better create some big-ass comments about why it's ok in this case, and what the rules are that make it ok. So DON'T GO AROUND CALLING IT A RECURSIVE LOCK! Don't write comments that are TOTAL AND UTTER SH*T! Just DON'T! It's a "reader lock". It's not "recursive". It never was recursive, it never will be, and calling it that is just a sign that whoever wrote the function is a moron and doesn't know what he is doing. STOP DOING THIS! - that _idiotic_ "preempt_enable_no_resched()". F*ck me, but the comment already says that preemption is disabled when exiting, so why does it even bother to enable it? Why play those mindless games with preemption counts, knowing that they are bogus? Do it readably. Disable preemption first, and just re-enable it at UNLOCK time. No odd pseudo-reenables anywhere. Oh, I know very well that the spli-locking will disable preemption, so it's "correct" to play those games, but the point is, it's just damn stupid and annoying. It just makes the source code actively _misleading_ to see crap like that - it looks like you enabled preemption, when in fact the code very much on purpose does _not_ enable preemption at all. In other words, I really REALLY hate that patch. I think it looks slightly better than Eric Dumazet's original patch (at least the locking subtlety is now in a function of its own and _could_ be commented upon sanely and if it wasn't broken it might be acceptable), but quite frankly, I'd still horribly disgusted with the crap. Why are you doing this insane thing? If you want a read-lock, just use the damned read-write locks already! Ad far as I can tell, this lock is in _no_ way better than just using those counting reader-writer locks, except it is open-coded and looks buggy. There is basically _never_ a good reason to re-implement locking primitives: you'll just introduce bugs. As proven very ably by the amount of crap above in what is supposed to be a very simple function. If you want a counting read-lock (in _order_ to allow recursion, but not because the lock itself is somehow recursive!), then that function should look like this: void xt_info_rdlock_bh(void) { struct xt_info_lock *lock local_bh_disable(); lock = &__get_cpu_var(xt_info_locks); read_lock(&lock->lock); } And then the "unlock" should be the reverse. No games, no crap, and hopefully then no bugs. And if you do it that way, you don't even need the comments, although quite frankly, it probably makes a lot of sense to talk about the interaction between "local_bh_disable()" and the preempt count, and why you're not using "read_lock_bh()". And if you don't want a read-lock, then fine - don't use a read-lock, do something else. But then don't just re-implement it (badly) either and call it something else! Linus PS: Ingo, why do the *_bh() functions in kernel/spinlock.c do _both_ a "local_bh_disable()" and a "preempt_disable()"? BH disable should disable preemption too, no? Or am I confused? In which case we need that in the above rdlock_bh too. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/