2016-03-03 15:05:33

by Dexuan Cui

[permalink] [raw]
Subject: x86 memory barrier: why does Linux prefer MFENCE to Locked ADD?

Hi,
My understanding about arch/x86/include/asm/barrier.h is: obviously Linux
more likes {L,S,M}FENCE -- Locked ADD is only used in x86_32 platforms that
don't support XMM2.

However, it looks people say Locked Add is much faster than the FENCE
instructions, even on modern Intel CPUs like Haswell, e.g., please see
the three sources:

" 11.5.1 Locked Instructions as Memory Barriers
Optimization
Use locked instructions to implement Store/Store and Store/Load barriers.
"
http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf

"lock addl %(rsp), 0 is a better solution for StoreLoad barrier ":
http://shipilev.net/blog/2014/on-the-fence-with-dependencies/

"...locked instruction are more efficient barriers...":
http://www.pvk.ca/Blog/2014/10/19/performance-optimisation-~-writing-an-essay/

I also found that FreeBSD prefers Locked Add.

So, I'm curious why Linux prefers MFENCE.
I guess I may be missing something.

I tried to google the question, but didn't find an answer.

Thanks,
-- Dexuan



2016-03-03 15:27:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: x86 memory barrier: why does Linux prefer MFENCE to Locked ADD?


* Dexuan Cui <[email protected]> wrote:

> Hi,
> My understanding about arch/x86/include/asm/barrier.h is: obviously Linux
> more likes {L,S,M}FENCE -- Locked ADD is only used in x86_32 platforms that
> don't support XMM2.
>
> However, it looks people say Locked Add is much faster than the FENCE
> instructions, even on modern Intel CPUs like Haswell, e.g., please see
> the three sources:
>
> " 11.5.1 Locked Instructions as Memory Barriers
> Optimization
> Use locked instructions to implement Store/Store and Store/Load barriers.
> "
> http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf
>
> "lock addl %(rsp), 0 is a better solution for StoreLoad barrier ":
> http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
>
> "...locked instruction are more efficient barriers...":
> http://www.pvk.ca/Blog/2014/10/19/performance-optimisation-~-writing-an-essay/
>
> I also found that FreeBSD prefers Locked Add.
>
> So, I'm curious why Linux prefers MFENCE.
> I guess I may be missing something.
>
> I tried to google the question, but didn't find an answer.

It's being worked on, see this thread on lkml from a few weeks ago:

C Jan 13 Michael S. Tsir | [PATCH v3 0/4] x86: faster mb()+documentation tweaks
C Jan 13 Michael S. Tsir | ├─>[PATCH v3 1/4] x86: add cc clobber for addl
C Jan 13 Michael S. Tsir | ├─>[PATCH v3 2/4] x86: drop a comment left over from X86_OOSTORE
C Jan 13 Michael S. Tsir | ├─>[PATCH v3 3/4] x86: tweak the comment about use of wmb for IO
C Jan 13 Michael S. Tsir | ├─>[PATCH v3 4/4] x86: drop mfence in favor of lock+addl

The 4th patch changes MFENCE to a LOCK ADDL locked instruction.

Thanks,

Ingo

2016-03-03 15:34:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: x86 memory barrier: why does Linux prefer MFENCE to Locked ADD?

On Thu, Mar 03, 2016 at 04:27:39PM +0100, Ingo Molnar wrote:
>
> * Dexuan Cui <[email protected]> wrote:
>
> > Hi,
> > My understanding about arch/x86/include/asm/barrier.h is: obviously Linux
> > more likes {L,S,M}FENCE -- Locked ADD is only used in x86_32 platforms that
> > don't support XMM2.
> >
> > However, it looks people say Locked Add is much faster than the FENCE
> > instructions, even on modern Intel CPUs like Haswell, e.g., please see
> > the three sources:
> >
> > " 11.5.1 Locked Instructions as Memory Barriers
> > Optimization
> > Use locked instructions to implement Store/Store and Store/Load barriers.
> > "
> > http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf
> >
> > "lock addl %(rsp), 0 is a better solution for StoreLoad barrier ":
> > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> >
> > "...locked instruction are more efficient barriers...":
> > http://www.pvk.ca/Blog/2014/10/19/performance-optimisation-~-writing-an-essay/
> >
> > I also found that FreeBSD prefers Locked Add.
> >
> > So, I'm curious why Linux prefers MFENCE.
> > I guess I may be missing something.
> >
> > I tried to google the question, but didn't find an answer.
>
> It's being worked on, see this thread on lkml from a few weeks ago:
>
> C Jan 13 Michael S. Tsir | [PATCH v3 0/4] x86: faster mb()+documentation tweaks
> C Jan 13 Michael S. Tsir | ├─>[PATCH v3 1/4] x86: add cc clobber for addl
> C Jan 13 Michael S. Tsir | ├─>[PATCH v3 2/4] x86: drop a comment left over from X86_OOSTORE
> C Jan 13 Michael S. Tsir | ├─>[PATCH v3 3/4] x86: tweak the comment about use of wmb for IO
> C Jan 13 Michael S. Tsir | ├─>[PATCH v3 4/4] x86: drop mfence in favor of lock+addl
>
> The 4th patch changes MFENCE to a LOCK ADDL locked instruction.

Lots of additional chatter here:

lkml.kernel.org/r/[email protected]

And some useful bits here:

lkml.kernel.org/r/[email protected]

latest version here:

lkml.kernel.org/r/[email protected]

2016-03-03 18:35:58

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: x86 memory barrier: why does Linux prefer MFENCE to Locked ADD?

On Thu, Mar 03, 2016 at 04:34:53PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 03, 2016 at 04:27:39PM +0100, Ingo Molnar wrote:
> >
> > * Dexuan Cui <[email protected]> wrote:
> >
> > > Hi,
> > > My understanding about arch/x86/include/asm/barrier.h is: obviously Linux
> > > more likes {L,S,M}FENCE -- Locked ADD is only used in x86_32 platforms that
> > > don't support XMM2.
> > >
> > > However, it looks people say Locked Add is much faster than the FENCE
> > > instructions, even on modern Intel CPUs like Haswell, e.g., please see
> > > the three sources:
> > >
> > > " 11.5.1 Locked Instructions as Memory Barriers
> > > Optimization
> > > Use locked instructions to implement Store/Store and Store/Load barriers.
> > > "
> > > http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf
> > >
> > > "lock addl %(rsp), 0 is a better solution for StoreLoad barrier ":
> > > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> > >
> > > "...locked instruction are more efficient barriers...":
> > > http://www.pvk.ca/Blog/2014/10/19/performance-optimisation-~-writing-an-essay/
> > >
> > > I also found that FreeBSD prefers Locked Add.
> > >
> > > So, I'm curious why Linux prefers MFENCE.
> > > I guess I may be missing something.
> > >
> > > I tried to google the question, but didn't find an answer.
> >
> > It's being worked on, see this thread on lkml from a few weeks ago:
> >
> > C Jan 13 Michael S. Tsir | [PATCH v3 0/4] x86: faster mb()+documentation tweaks
> > C Jan 13 Michael S. Tsir | ├─>[PATCH v3 1/4] x86: add cc clobber for addl
> > C Jan 13 Michael S. Tsir | ├─>[PATCH v3 2/4] x86: drop a comment left over from X86_OOSTORE
> > C Jan 13 Michael S. Tsir | ├─>[PATCH v3 3/4] x86: tweak the comment about use of wmb for IO
> > C Jan 13 Michael S. Tsir | ├─>[PATCH v3 4/4] x86: drop mfence in favor of lock+addl
> >
> > The 4th patch changes MFENCE to a LOCK ADDL locked instruction.
>
> Lots of additional chatter here:
>
> lkml.kernel.org/r/[email protected]
>
> And some useful bits here:
>
> lkml.kernel.org/r/[email protected]
>
> latest version here:
>
> lkml.kernel.org/r/[email protected]

It's ready as far as I am concerned.
Basically we are just waiting for ack from hpa.

--
MST

2016-03-03 19:06:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: x86 memory barrier: why does Linux prefer MFENCE to Locked ADD?

On March 3, 2016 10:35:50 AM PST, "Michael S. Tsirkin" <[email protected]> wrote:
>On Thu, Mar 03, 2016 at 04:34:53PM +0100, Peter Zijlstra wrote:
>> On Thu, Mar 03, 2016 at 04:27:39PM +0100, Ingo Molnar wrote:
>> >
>> > * Dexuan Cui <[email protected]> wrote:
>> >
>> > > Hi,
>> > > My understanding about arch/x86/include/asm/barrier.h is:
>obviously Linux
>> > > more likes {L,S,M}FENCE -- Locked ADD is only used in x86_32
>platforms that
>> > > don't support XMM2.
>> > >
>> > > However, it looks people say Locked Add is much faster than the
>FENCE
>> > > instructions, even on modern Intel CPUs like Haswell, e.g.,
>please see
>> > > the three sources:
>> > >
>> > > " 11.5.1 Locked Instructions as Memory Barriers
>> > > Optimization
>> > > Use locked instructions to implement Store/Store and Store/Load
>barriers.
>> > > "
>> > > http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf
>> > >
>> > > "lock addl %(rsp), 0 is a better solution for StoreLoad barrier
>":
>> > > http://shipilev.net/blog/2014/on-the-fence-with-dependencies/
>> > >
>> > > "...locked instruction are more efficient barriers...":
>> > >
>http://www.pvk.ca/Blog/2014/10/19/performance-optimisation-~-writing-an-essay/
>> > >
>> > > I also found that FreeBSD prefers Locked Add.
>> > >
>> > > So, I'm curious why Linux prefers MFENCE.
>> > > I guess I may be missing something.
>> > >
>> > > I tried to google the question, but didn't find an answer.
>> >
>> > It's being worked on, see this thread on lkml from a few weeks ago:
>> >
>> > C Jan 13 Michael S. Tsir | [PATCH v3 0/4] x86: faster
>mb()+documentation tweaks
>> > C Jan 13 Michael S. Tsir | ├─>[PATCH v3 1/4] x86: add cc
>clobber for addl
>> > C Jan 13 Michael S. Tsir | ├─>[PATCH v3 2/4] x86: drop a
>comment left over from X86_OOSTORE
>> > C Jan 13 Michael S. Tsir | ├─>[PATCH v3 3/4] x86: tweak the
>comment about use of wmb for IO
>> > C Jan 13 Michael S. Tsir | ├─>[PATCH v3 4/4] x86: drop mfence
>in favor of lock+addl
>> >
>> > The 4th patch changes MFENCE to a LOCK ADDL locked instruction.
>>
>> Lots of additional chatter here:
>>
>> lkml.kernel.org/r/[email protected]
>>
>> And some useful bits here:
>>
>> lkml.kernel.org/r/[email protected]
>>
>> latest version here:
>>
>> lkml.kernel.org/r/[email protected]
>
>It's ready as far as I am concerned.
>Basically we are just waiting for ack from hpa.

And I'm still discussing this with the hardware people. It seems we can do this for *most* things, but not all; the question is where exactly we need to do something different.
--
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.