Date:   Thu, 23 Feb 2023 09:37:25 +0530
From:   Kautuk Consul <kconsul@linux.vnet.ibm.com>
To:     Michael Ellerman <mpe@ellerman.id.au>
Cc:     paulmck@kernel.org, Nicholas Piggin <npiggin@gmail.com>,
        Christophe Leroy <christophe.leroy@csgroup.eu>,
        Rohan McLure <rmclure@linux.ibm.com>,
        linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] arch/powerpc/include/asm/barrier.h: redefine rmb and
 wmb to  lwsync
Message-ID: <Y/bmffAACvN2B6mj@li-a450e7cc-27df-11b2-a85c-b5a9ac31e8ef.ibm.com>
References: <20230222090344.189270-1-kconsul@linux.vnet.ibm.com>
 <20230222174719.GA1400185@paulmck-ThinkPad-P17-Gen-1>
 <87fsaxavk2.fsf@mpe.ellerman.id.au>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <87fsaxavk2.fsf@mpe.ellerman.id.au>
Precedence: bulk

On 2023-02-23 14:51:25, Michael Ellerman wrote:
> Hi Paul,
> 
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> > On Wed, Feb 22, 2023 at 02:33:44PM +0530, Kautuk Consul wrote:
> >> A link from ibm.com states:
> >> "Ensures that all instructions preceding the call to __lwsync
> >>  complete before any subsequent store instructions can be executed
> >>  on the processor that executed the function. Also, it ensures that
> >>  all load instructions preceding the call to __lwsync complete before
> >>  any subsequent load instructions can be executed on the processor
> >>  that executed the function. This allows you to synchronize between
> >>  multiple processors with minimal performance impact, as __lwsync
> >>  does not wait for confirmation from each processor."
> >> 
> >> Thats why smp_rmb() and smp_wmb() are defined to lwsync.
> >> But this same understanding applies to parallel pipeline
> >> execution on each PowerPC processor.
> >> So, use the lwsync instruction for rmb() and wmb() on the PPC
> >> architectures that support it.
> >> 
> >> Signed-off-by: Kautuk Consul <kconsul@linux.vnet.ibm.com>
> >> ---
> >>  arch/powerpc/include/asm/barrier.h | 7 +++++++
> >>  1 file changed, 7 insertions(+)
> >> 
> >> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> >> index b95b666f0374..e088dacc0ee8 100644
> >> --- a/arch/powerpc/include/asm/barrier.h
> >> +++ b/arch/powerpc/include/asm/barrier.h
> >> @@ -36,8 +36,15 @@
> >>   * heavy-weight sync, so smp_wmb() can be a lighter-weight eieio.
> >>   */
> >>  #define __mb()   __asm__ __volatile__ ("sync" : : : "memory")
> >> +
> >> +/* The sub-arch has lwsync. */
> >> +#if defined(CONFIG_PPC64) || defined(CONFIG_PPC_E500MC)
> >> +#define __rmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> >> +#define __wmb() __asm__ __volatile__ ("lwsync" : : : "memory")
> >
> > Hmmm...
> >
> > Does the lwsync instruction now order both cached and uncached accesses?
> 
> No.
> 
> > Or have there been changes so that smp_rmb() and smp_wmb() get this
> > definition, while rmb() and wmb() still get the sync instruction?
> 
> No.
> 
> They come from:
> 
> include/asm-generic/barrier.h:#define rmb()        do { kcsan_rmb(); __rmb(); } while (0)
> include/asm-generic/barrier.h:#define wmb()        do { kcsan_wmb(); __wmb(); } while (0)
> 
> > (Not seeing this, but I could easily be missing something.)
> 
> You are correct, the patch is wrong because it fails to account for IO
> accesses.
> 
> Kautuk, I'm not sure what motivated you to look at these barriers, was
> it just the documentation you linked to?
> 
> Maybe we need some better documentation in the kernel explaining why we
> use the barriers we do?
> 
> Although there's already a pretty decent comment above the definitions,
> but maybe it needs further clarification:
> 
>   /*
>    * Memory barrier.
>    * The sync instruction guarantees that all memory accesses initiated
>    * by this processor have been performed (with respect to all other
>    * mechanisms that access memory).  The eieio instruction is a barrier
>    * providing an ordering (separately) for (a) cacheable stores and (b)
>    * loads and stores to non-cacheable memory (e.g. I/O devices).
>    *
>    * mb() prevents loads and stores being reordered across this point.
>    * rmb() prevents loads being reordered across this point.
>    * wmb() prevents stores being reordered across this point.
>    *
>    * *mb() variants without smp_ prefix must order all types of memory
>    * operations with one another. sync is the only instruction sufficient
>    * to do this.
>    *
>    * For the smp_ barriers, ordering is for cacheable memory operations
>    * only. We have to use the sync instruction for smp_mb(), since lwsync
>    * doesn't order loads with respect to previous stores.  Lwsync can be
>    * used for smp_rmb() and smp_wmb().

Sorry I didn't change the comment.
My point is: Is the "sync is the only instruction sufficient to do this"
comment completely correct?
As I mentioned in my reply to Paul there I didn't find any
documentation that up-front states (in the differences between sync and
lwsync) that lwsync distinguishes between cached and unchached accesses.
> 
> 
> cheers