2008-08-06 17:14:09

by Arkadiusz Miśkiewicz

[permalink] [raw]
Subject: Re: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier (confirmed)

On Wednesday 06 August 2008, Wahlig, Elsie wrote:
> Your issue may be one that has been seen on 1st generation
> AMD Opteron processor's with cpuid family 0Fh, cpuid model's
> < 40h with the code sequence that performs a read-modify write
> operation after acquiring a semaphore.

Matches my hardware

cpu family : 15
model : 33

>
> The memory read ordering between a semaphore operation and a
> subsequent read-modify-write instruction (an instruction which
> uses the same memory location as both a source and destination)
> may allow the read-modify-write instruction to operate on the
> memory location ahead of the completion of the semaphore
> operation and an erratum may occur.

I wonder why there was no official errata about this?


> If you think your software is encountering this code sequence,
> a work-around should be implemented by adding an LFENCE
> instruction right after the semaphore, after a cpuid check.
> The workaround's applied to OpenSolaris at
> http://mail.opensolaris.org/pipermail/onnv-notify/2006-October/009080.ht
> ml
> and Google performance tools tool at
> http://google-perftools.googlecode.com/svn-history/r48/trunk/src/base/at
> omicops-internals-x86.cc
> are suitable examples.
> A list of the model numbers this issue may occur on is at
> http://products.amd.com/en-us/downloads/AMD_Opteron_First_Generation_Ref
> erence_101607.pdf.

Would be better to fix the bug on kernel level if this is possible. Just
someone with the knowledge needs to do this. Anyone interested?

> Mikael Pettersson writes:
> ... snip ...
>
> > I investigated the Solaris track, but I've found no detailed
> > explanation of the alleged bug. I've asked the Sun engineer
> > who committed the fix for an explanation, but so far there's
> > been no reply.
> >
> > Anyway, here's what I've found out.
> >
> > It's Solaris bug # 6323525.
> >
> > They call it "Mutex primitives don't work as expected."
> >
> > if (number_of_cores() < 2) then don't have bug if (family ==
> > 0xf && Model < 0x40) then have bug if
> > (rdmsr(MSR_BU_CFG/*0xC0011023*/) & 2) then bug is masked
> > lock: // mutex_lock, spin_lock, etc
> > ...
> > lock; cmpxchg ..
> > jnz fail
> > ret; nop; nop; nop // patched to "lfence; ret" if bug The
> > workaround is to place a fencing instruction (lfence) between
> > the mutex operation and the subsequent read-modify-write instruction.
> > (This provides the necessary load memory barrier.)
> >
> > There's no change to the unlock code.
> >
> > Anyone know who to contact @ AMD about confirming or denying this?
> >
> > /Mikael



--
Arkadiusz Miśkiewicz PLD/Linux Team
arekm / maven.pl http://ftp.pld-linux.org/


2008-08-06 19:23:41

by Mikael Pettersson

[permalink] [raw]
Subject: Re: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier (confirmed)

On Wed, 6 Aug 2008 19:13:34 +0200, Arkadiusz Miskiewicz wrote:
>On Wednesday 06 August 2008, Wahlig, Elsie wrote:
>> Your issue may be one that has been seen on 1st generation
>> AMD Opteron processor's with cpuid family 0Fh, cpuid model's
>> < 40h with the code sequence that performs a read-modify write
>> operation after acquiring a semaphore.
>
>Matches my hardware
>
>cpu family : 15
>model : 33
>
>>
>> The memory read ordering between a semaphore operation and a
>> subsequent read-modify-write instruction (an instruction which
>> uses the same memory location as both a source and destination)
>> may allow the read-modify-write instruction to operate on the
>> memory location ahead of the completion of the semaphore
>> operation and an erratum may occur.

Thanks for the detailed erratum description.

>I wonder why there was no official errata about this?

Indeed.

>> If you think your software is encountering this code sequence,
>> a work-around should be implemented by adding an LFENCE
>> instruction right after the semaphore, after a cpuid check.
>> The workaround's applied to OpenSolaris at
>> http://mail.opensolaris.org/pipermail/onnv-notify/2006-October/009080.ht
>> ml
>> and Google performance tools tool at
>> http://google-perftools.googlecode.com/svn-history/r48/trunk/src/base/at
>> omicops-internals-x86.cc
>> are suitable examples.
>> A list of the model numbers this issue may occur on is at
>> http://products.amd.com/en-us/downloads/AMD_Opteron_First_Generation_Ref
>> erence_101607.pdf.
>
>Would be better to fix the bug on kernel level if this is possible. Just=20
>someone with the knowledge needs to do this. Anyone interested?

In principle it's easy. We append a 3-byte nop to the lock-taking
instructions. We invent an AMD_MUTEX_BUG synthetic cpuid feature
bit and add boot-time code to detect it. We use the alternatives()
infrastructure to replace that nop with lfence at boot-time if
AMD_MUTEX_BUG is present.

I think the hardest part is locating all lock-taking code sequences.

Also I think I'll start by writing a user-space test program that
does a stress-test of the plain lock;rmw;unlobk sequence to see if
it can break it. (Locks/mutexes are also used in user-space.)

/Mikael

2008-08-06 21:29:19

by Wahlig, Elsie

[permalink] [raw]
Subject: RE: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier (confirmed)



Mikael Pettersson writes:
>
> On Wed, 6 Aug 2008 19:13:34 +0200, Arkadiusz Miskiewicz wrote:
> >On Wednesday 06 August 2008, Wahlig, Elsie wrote:
> >> Your issue may be one that has been seen on 1st generation AMD
> >> Opteron processor's with cpuid family 0Fh, cpuid model's <
> 40h with
> >> the code sequence that performs a read-modify write
> operation after
> >> acquiring a semaphore.
> >
> >Matches my hardware
> >
> >cpu family : 15
> >model : 33
> >
> >>
> >> The memory read ordering between a semaphore operation and a
> >> subsequent read-modify-write instruction (an instruction
> which uses
> >> the same memory location as both a source and destination)
> may allow
> >> the read-modify-write instruction to operate on the memory
> location
> >> ahead of the completion of the semaphore operation and an
> erratum may
> >> occur.
>
> Thanks for the detailed erratum description.
>
> >I wonder why there was no official errata about this?
>
> Indeed.

I don't know but I will see about getting it in there.

Elsie

>
> >> If you think your software is encountering this code sequence, a
> >> work-around should be implemented by adding an LFENCE instruction
> >> right after the semaphore, after a cpuid check.
> >> The workaround's applied to OpenSolaris at
> >>
> http://mail.opensolaris.org/pipermail/onnv-notify/2006-October/009080
> >> .ht
> >> ml
> >> and Google performance tools tool at
> >>
> http://google-perftools.googlecode.com/svn-history/r48/trunk/src/base
> >> /at
> >> omicops-internals-x86.cc
> >> are suitable examples.
> >> A list of the model numbers this issue may occur on is at
> >>
> http://products.amd.com/en-us/downloads/AMD_Opteron_First_Generation_
> >> Ref
> >> erence_101607.pdf.
> >
> >Would be better to fix the bug on kernel level if this is possible.
> >Just=20 someone with the knowledge needs to do this. Anyone
> interested?
>
> In principle it's easy. We append a 3-byte nop to the
> lock-taking instructions. We invent an AMD_MUTEX_BUG
> synthetic cpuid feature bit and add boot-time code to detect
> it. We use the alternatives() infrastructure to replace that
> nop with lfence at boot-time if AMD_MUTEX_BUG is present.
>
> I think the hardest part is locating all lock-taking code sequences.
>
> Also I think I'll start by writing a user-space test program
> that does a stress-test of the plain lock;rmw;unlobk sequence
> to see if it can break it. (Locks/mutexes are also used in
> user-space.)
>
> /Mikael
>
>

2008-08-11 17:03:57

by Arkadiusz Miśkiewicz

[permalink] [raw]
Subject: Re: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier (confirmed)

On Wednesday 06 August 2008, Mikael Pettersson wrote:
> On Wed, 6 Aug 2008 19:13:34 +0200, Arkadiusz Miskiewicz wrote:
> >On Wednesday 06 August 2008, Wahlig, Elsie wrote:
> >> Your issue may be one that has been seen on 1st generation
> >> AMD Opteron processor's with cpuid family 0Fh, cpuid model's
> >> < 40h with the code sequence that performs a read-modify write
> >> operation after acquiring a semaphore.
[...]
> Also I think I'll start by writing a user-space test program that
> does a stress-test of the plain lock;rmw;unlobk sequence to see if
> it can break it. (Locks/mutexes are also used in user-space.)

Bugreported so hopefully it won't be lost.

http://bugzilla.kernel.org/show_bug.cgi?id=11305

> /Mikael

--
Arkadiusz Miśkiewicz PLD/Linux Team
arekm / maven.pl http://ftp.pld-linux.org/