Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758104AbZGGO5i (ORCPT ); Tue, 7 Jul 2009 10:57:38 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757743AbZGGO51 (ORCPT ); Tue, 7 Jul 2009 10:57:27 -0400 Received: from tomts5-srv.bellnexxia.net ([209.226.175.25]:38022 "EHLO tomts5-srv.bellnexxia.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757690AbZGGO50 convert rfc822-to-8bit (ORCPT ); Tue, 7 Jul 2009 10:57:26 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AtkEAGn/UkpMQWU3/2dsb2JhbACBUc0ghBMFgTo Date: Tue, 7 Jul 2009 10:57:10 -0400 From: Mathieu Desnoyers To: Eric Dumazet Cc: Jiri Olsa , Ingo Molnar , Peter Zijlstra , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, fbl@redhat.com, nhorman@redhat.com, davem@redhat.com, htejun@gmail.com, jarkao2@gmail.com, oleg@redhat.com, davidel@xmailserver.org Subject: Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock Message-ID: <20090707145710.GB7124@Krystal> References: <20090703090606.GA3902@elte.hu> <4A4DCD54.1080908@gmail.com> <20090703092438.GE3902@elte.hu> <20090703095659.GA4518@jolsa.lab.eng.brq.redhat.com> <20090703102530.GD32128@elte.hu> <20090703111848.GA10267@jolsa.lab.eng.brq.redhat.com> <20090707101816.GA6619@jolsa.lab.eng.brq.redhat.com> <20090707134601.GB6619@jolsa.lab.eng.brq.redhat.com> <20090707140135.GA5506@Krystal> <4A535EB9.2020406@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8BIT In-Reply-To: <4A535EB9.2020406@gmail.com> X-Editor: vi X-Info: http://krystal.dyndns.org:8080 X-Operating-System: Linux/2.6.21.3-grsec (i686) X-Uptime: 10:51:54 up 129 days, 11:18, 3 users, load average: 1.30, 1.38, 1.23 User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8172 Lines: 208 * Eric Dumazet (eric.dumazet@gmail.com) wrote: > Mathieu Desnoyers a ?crit : > > * Jiri Olsa (jolsa@redhat.com) wrote: > >> On Tue, Jul 07, 2009 at 12:18:16PM +0200, Jiri Olsa wrote: > >>> On Fri, Jul 03, 2009 at 01:18:48PM +0200, Jiri Olsa wrote: > >>>> On Fri, Jul 03, 2009 at 12:25:30PM +0200, Ingo Molnar wrote: > >>>>> * Jiri Olsa wrote: > >>>>> > >>>>>> On Fri, Jul 03, 2009 at 11:24:38AM +0200, Ingo Molnar wrote: > >>>>>>> * Eric Dumazet wrote: > >>>>>>> > >>>>>>>> Ingo Molnar a ?crit : > >>>>>>>>> * Jiri Olsa wrote: > >>>>>>>>> > >>>>>>>>>> +++ b/arch/x86/include/asm/spinlock.h > >>>>>>>>>> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw) > >>>>>>>>>> #define _raw_read_relax(lock) cpu_relax() > >>>>>>>>>> #define _raw_write_relax(lock) cpu_relax() > >>>>>>>>>> > >>>>>>>>>> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */ > >>>>>>>>>> +#define smp_mb__after_lock() do { } while (0) > >>>>>>>>> Two small stylistic comments, please make this an inline function: > >>>>>>>>> > >>>>>>>>> static inline void smp_mb__after_lock(void) { } > >>>>>>>>> #define smp_mb__after_lock > >>>>>>>>> > >>>>>>>>> (untested) > >>>>>>>>> > >>>>>>>>>> +/* The lock does not imply full memory barrier. */ > >>>>>>>>>> +#ifndef smp_mb__after_lock > >>>>>>>>>> +#define smp_mb__after_lock() smp_mb() > >>>>>>>>>> +#endif > >>>>>>>>> ditto. > >>>>>>>>> > >>>>>>>>> Ingo > >>>>>>>> This was following existing implementations of various smp_mb__??? helpers : > >>>>>>>> > >>>>>>>> # grep -4 smp_mb__before_clear_bit include/asm-generic/bitops.h > >>>>>>>> > >>>>>>>> /* > >>>>>>>> * clear_bit may not imply a memory barrier > >>>>>>>> */ > >>>>>>>> #ifndef smp_mb__before_clear_bit > >>>>>>>> #define smp_mb__before_clear_bit() smp_mb() > >>>>>>>> #define smp_mb__after_clear_bit() smp_mb() > >>>>>>>> #endif > >>>>>>> Did i mention that those should be fixed too? :-) > >>>>>>> > >>>>>>> Ingo > >>>>>> ok, could I include it in the 2/2 or you prefer separate patch? > >>>>> depends on whether it will regress ;-) > >>>>> > >>>>> If it regresses, it's better to have it separate. If it wont, it can > >>>>> be included. If unsure, default to the more conservative option. > >>>>> > >>>>> Ingo > >>>> > >>>> how about this.. > >>>> and similar change for smp_mb__before_clear_bit in a separate patch > >>>> > >>>> > >>>> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h > >>>> index b7e5db8..4e77853 100644 > >>>> --- a/arch/x86/include/asm/spinlock.h > >>>> +++ b/arch/x86/include/asm/spinlock.h > >>>> @@ -302,4 +302,8 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw) > >>>> #define _raw_read_relax(lock) cpu_relax() > >>>> #define _raw_write_relax(lock) cpu_relax() > >>>> > >>>> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */ > >>>> +static inline void smp_mb__after_lock(void) { } > >>>> +#define ARCH_HAS_SMP_MB_AFTER_LOCK > >>>> + > >>>> #endif /* _ASM_X86_SPINLOCK_H */ > >>>> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h > >>>> index 252b245..4be57ab 100644 > >>>> --- a/include/linux/spinlock.h > >>>> +++ b/include/linux/spinlock.h > >>>> @@ -132,6 +132,11 @@ do { \ > >>>> #endif /*__raw_spin_is_contended*/ > >>>> #endif > >>>> > >>>> +/* The lock does not imply full memory barrier. */ > >>>> +#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK > >>>> +static inline void smp_mb__after_lock(void) { smp_mb(); } > >>>> +#endif > >>>> + > >>>> /** > >>>> * spin_unlock_wait - wait until the spinlock gets unlocked > >>>> * @lock: the spinlock in question. > >>>> diff --git a/include/net/sock.h b/include/net/sock.h > >>>> index 4eb8409..98afcd9 100644 > >>>> --- a/include/net/sock.h > >>>> +++ b/include/net/sock.h > >>>> @@ -1271,6 +1271,9 @@ static inline int sk_has_allocations(const struct sock *sk) > >>>> * in its cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 > >>>> * could then endup calling schedule and sleep forever if there are no more > >>>> * data on the socket. > >>>> + * > >>>> + * The sk_has_helper is always called right after a call to read_lock, so we > >>>> + * can use smp_mb__after_lock barrier. > >>>> */ > >>>> static inline int sk_has_sleeper(struct sock *sk) > >>>> { > >>>> @@ -1280,7 +1283,7 @@ static inline int sk_has_sleeper(struct sock *sk) > >>>> * > >>>> * This memory barrier is paired in the sock_poll_wait. > >>>> */ > >>>> - smp_mb(); > >>>> + smp_mb__after_lock(); > >>>> return sk->sk_sleep && waitqueue_active(sk->sk_sleep); > >>>> } > >>>> > >>> any feedback on this? > >>> I'd send v6 if this way is acceptable.. > >>> > >>> thanks, > >>> jirka > >> also I checked the smp_mb__before_clear_bit/smp_mb__after_clear_bit and > >> it is used quite extensivelly. > >> > >> I'd prefer to send it in a separate patch, so we can move on with the > >> changes I've sent so far.. > >> > > > > As with any optimization (and this is one that adds a semantic that will > > just grow the memory barrier/locking rule complexity), it should come > > with performance benchmarks showing real-life improvements. > > > > Otherwise I'd recommend sticking to smp_mb() if this execution path is > > not that critical, or to move to RCU if it's _that_ critical. > > > > A valid argument would be if the data structures protected are so > > complex that RCU is out of question but still the few cycles saved by > > removing a memory barrier are really significant. And even then, the > > proper solution would be more something like a > > __read_lock()+smp_mb+smp_mb+__read_unlock(), so we get the performance > > improvements on architectures other than x86 as well. > > > > So in all cases, I don't think the smp_mb__after_lock() is the > > appropriate solution. > > RCU on this part is out of the question, as David already mentioned it. > OK > It would be a regression for short lived tcp/udp sessions, and some workloads > use them a lot... > > We gained about 20% performance between 2.6.26 and 2.6.31, carefuly removing > some atomic ops in network stack, adding RCU where it was sensible, but this > is a painful process, not something Jiri can use to fix bugs on legacy RedHat > kernels :) (We still are sorting out regressions) > Yep, I can understand that. Tbench on localhost is an especially good benchmark for this ;) > To solve problem pointed by Jiri, we have to insert an smp_mb() at this point, > (not mentioning the other change in select() logic of course) > > static void sock_def_readable(struct sock *sk, int len) > { > read_lock(&sk->sk_callback_lock); > + smp_mb(); /* paired with opposite smp_mb() in sk poll logic */ > if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) > wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN | > POLLRDNORM | POLLRDBAND); > sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN); > read_unlock(&sk->sk_callback_lock); > } > > As about every incoming packet calls this path, we should be very careful not > slowing down stack if not necessary. > > On x86 this extra smp_mb() is not needed, since previous call to read_lock() > already gives the full barrier for free. > > Well, I see the __read_lock()+2x smp_mb+__read_unlock is not well suited for x86. You're right. But read_lock + smp_mb__after_lock + read_unlock is not well suited for powerpc, arm, mips and probably others where there is an explicit memory barrier at the end of the read lock primitive. One thing that would be efficient for all architectures is to create a locking primitive that contains the smp_mb, e.g.: read_lock_smp_mb() which would act as a read_lock which does a full smp_mb after the lock is taken. The naming may be a bit odd, better ideas are welcome. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/