2008-08-03 09:06:25

by Arkadiusz Miśkiewicz

[permalink] [raw]
Subject: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier


Hello,

http://google-perftools.googlecode.com/svn-history/r48/trunk/src/base/atomicops-internals-x86.cc
says

" // Opteron Rev E has a bug in which on very rare occasions a locked
// instruction doesn't act as a read-acquire barrier if followed by a
// non-locked read-modify-write instruction. Rev F has this bug in
// pre-release versions, but not in versions released to customers,
// so we test only for Rev E, which is family 15, model 32..63 inclusive.
if (strcmp(vendor, "AuthenticAMD") == 0 && // AMD
family == 15 &&
32 <= model && model <= 63) {
AtomicOps_Internalx86CPUFeatures.has_amd_lock_mb_bug = true;
} else {
AtomicOps_Internalx86CPUFeatures.has_amd_lock_mb_bug = false;
}
"

does kernel have quirk/workaround for this? I'm looking at arch/x86/kernel/cpu
but I don't see workaround related to this (possibly I'm overlooking).

--
Arkadiusz Miśkiewicz PLD/Linux Team
arekm / maven.pl http://ftp.pld-linux.org/


2008-08-04 13:26:37

by Mikael Pettersson

[permalink] [raw]
Subject: Re: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier

Arkadiusz Miskiewicz writes:
>
> Hello,
>
> http://google-perftools.googlecode.com/svn-history/r48/trunk/src/base/atomicops-internals-x86.cc
> says
>
> " // Opteron Rev E has a bug in which on very rare occasions a locked
> // instruction doesn't act as a read-acquire barrier if followed by a
> // non-locked read-modify-write instruction. Rev F has this bug in
> // pre-release versions, but not in versions released to customers,
> // so we test only for Rev E, which is family 15, model 32..63 inclusive.
> if (strcmp(vendor, "AuthenticAMD") == 0 && // AMD
> family == 15 &&
> 32 <= model && model <= 63) {
> AtomicOps_Internalx86CPUFeatures.has_amd_lock_mb_bug = true;
> } else {
> AtomicOps_Internalx86CPUFeatures.has_amd_lock_mb_bug = false;
> }
> "
>
> does kernel have quirk/workaround for this? I'm looking at arch/x86/kernel/cpu
> but I don't see workaround related to this (possibly I'm overlooking).

I can find no reference to this alleged RevE erratum in the
Athlon64/Opteron revision guide (25759.pdf).

But if this bug is real then we need to know about it. Could
you ask the author of the code you quoted above to clarify?

/Mikael

2008-08-04 13:56:18

by Arkadiusz Miśkiewicz

[permalink] [raw]
Subject: Re: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier

On Monday 04 August 2008, Mikael Pettersson wrote:
> Arkadiusz Miskiewicz writes:
> > Hello,
> >
> > http://google-perftools.googlecode.com/svn-history/r48/trunk/src/base/at
> >omicops-internals-x86.cc says
> >
> > " // Opteron Rev E has a bug in which on very rare occasions a locked
> > // instruction doesn't act as a read-acquire barrier if followed by a
> > // non-locked read-modify-write instruction. Rev F has this bug in
> > // pre-release versions, but not in versions released to customers,
> > // so we test only for Rev E, which is family 15, model 32..63
> > inclusive. if (strcmp(vendor, "AuthenticAMD") == 0 && // AMD
> > family == 15 &&
> > 32 <= model && model <= 63) {
> > AtomicOps_Internalx86CPUFeatures.has_amd_lock_mb_bug = true;
> > } else {
> > AtomicOps_Internalx86CPUFeatures.has_amd_lock_mb_bug = false;
> > }
> > "
> >
> > does kernel have quirk/workaround for this? I'm looking at
> > arch/x86/kernel/cpu but I don't see workaround related to this (possibly
> > I'm overlooking).
>
> I can find no reference to this alleged RevE erratum in the
> Athlon64/Opteron revision guide (25759.pdf).
>
> But if this bug is real then we need to know about it. Could
> you ask the author of the code you quoted above to clarify?

Got answer, opensolaris has some workarounds for this bug I still don't know
which errata # is that:

http://groups.google.com/group/google-perftools/browse_thread/thread/3d1b78d4a9db8c6e

btw. I got info about this bug after hiting this problem:
http://bugs.mysql.com/bug.php?id=26081

> /Mikael

--
Arkadiusz Miśkiewicz PLD/Linux Team
arekm / maven.pl http://ftp.pld-linux.org/

2008-08-04 14:54:32

by Mikael Pettersson

[permalink] [raw]
Subject: Re: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier

On Mon, 4 Aug 2008 15:56:05 +0200, Arkadiusz Miskiewicz wrote:
>On Monday 04 August 2008, Mikael Pettersson wrote:
>> Arkadiusz Miskiewicz writes:
>> > Hello,
>> >
>> > http://google-perftools.googlecode.com/svn-history/r48/trunk/src/base/=
>at
>> >omicops-internals-x86.cc says
>> >
>> > " // Opteron Rev E has a bug in which on very rare occasions a locked
>> > // instruction doesn't act as a read-acquire barrier if followed by a
>> > // non-locked read-modify-write instruction. Rev F has this bug in
>> > // pre-release versions, but not in versions released to customers,
>> > // so we test only for Rev E, which is family 15, model 32..63
>> > inclusive. if (strcmp(vendor, "AuthenticAMD") =3D=3D 0 && // AMD
>> > family =3D=3D 15 &&
>> > 32 <=3D model && model <=3D 63) {
>> > AtomicOps_Internalx86CPUFeatures.has_amd_lock_mb_bug =3D true;
>> > } else {
>> > AtomicOps_Internalx86CPUFeatures.has_amd_lock_mb_bug =3D false;
>> > }
>> > "
>> >
>> > does kernel have quirk/workaround for this? I'm looking at
>> > arch/x86/kernel/cpu but I don't see workaround related to this (possib=
>ly
>> > I'm overlooking).
>>
>> I can find no reference to this alleged RevE erratum in the
>> Athlon64/Opteron revision guide (25759.pdf).
>>
>> But if this bug is real then we need to know about it. Could
>> you ask the author of the code you quoted above to clarify?
>
>Got answer, opensolaris has some workarounds for this bug I still don't kno=
>w=20
>which errata # is that:
>
>http://groups.google.com/group/google-perftools/browse_thread/thread/3d1b78=
>d4a9db8c6e
>
>btw. I got info about this bug after hiting this problem:=20
>http://bugs.mysql.com/bug.php?id=3D26081

Thanks, found the Solaris code in question and the mysql discussion.
I'll dig deeper tomorrow.

/Mikael

2008-08-06 13:09:45

by Mikael Pettersson

[permalink] [raw]
Subject: Re: Opteron Rev E has a bug ... a locked instruction doesn't act as a read-acquire barrier

Mikael Pettersson writes:
> On Mon, 4 Aug 2008 15:56:05 +0200, Arkadiusz Miskiewicz wrote:
> >On Monday 04 August 2008, Mikael Pettersson wrote:
> >> Arkadiusz Miskiewicz writes:
> >> > Hello,
> >> >
> >> > http://google-perftools.googlecode.com/svn-history/r48/trunk/src/base/=
> >at
> >> >omicops-internals-x86.cc says
> >> >
> >> > " // Opteron Rev E has a bug in which on very rare occasions a locked
> >> > // instruction doesn't act as a read-acquire barrier if followed by a
> >> > // non-locked read-modify-write instruction. Rev F has this bug in
> >> > // pre-release versions, but not in versions released to customers,
> >> > // so we test only for Rev E, which is family 15, model 32..63
> >> > inclusive. if (strcmp(vendor, "AuthenticAMD") =3D=3D 0 && // AMD
> >> > family =3D=3D 15 &&
> >> > 32 <=3D model && model <=3D 63) {
> >> > AtomicOps_Internalx86CPUFeatures.has_amd_lock_mb_bug =3D true;
> >> > } else {
> >> > AtomicOps_Internalx86CPUFeatures.has_amd_lock_mb_bug =3D false;
> >> > }
> >> > "
> >> >
> >> > does kernel have quirk/workaround for this? I'm looking at
> >> > arch/x86/kernel/cpu but I don't see workaround related to this (possib=
> >ly
> >> > I'm overlooking).
> >>
> >> I can find no reference to this alleged RevE erratum in the
> >> Athlon64/Opteron revision guide (25759.pdf).
> >>
> >> But if this bug is real then we need to know about it. Could
> >> you ask the author of the code you quoted above to clarify?
> >
> >Got answer, opensolaris has some workarounds for this bug I still don't kno=
> >w=20
> >which errata # is that:
> >
> >http://groups.google.com/group/google-perftools/browse_thread/thread/3d1b78=
> >d4a9db8c6e
> >
> >btw. I got info about this bug after hiting this problem:=20
> >http://bugs.mysql.com/bug.php?id=3D26081
>
> Thanks, found the Solaris code in question and the mysql discussion.
> I'll dig deeper tomorrow.

I investigated the Solaris track, but I've found no detailed
explanation of the alleged bug. I've asked the Sun engineer
who committed the fix for an explanation, but so far there's
been no reply.

Anyway, here's what I've found out.

It's Solaris bug # 6323525.

They call it "Mutex primitives don't work as expected."

if (number_of_cores() < 2) then don't have bug
if (family == 0xf && Model < 0x40) then have bug
if (rdmsr(MSR_BU_CFG/*0xC0011023*/) & 2) then bug is masked

lock: // mutex_lock, spin_lock, etc
...
lock; cmpxchg ..
jnz fail
ret; nop; nop; nop // patched to "lfence; ret" if bug

The workaround is to place a fencing instruction (lfence) between
the mutex operation and the subsequent read-modify-write instruction.
(This provides the necessary load memory barrier.)

There's no change to the unlock code.

Anyone know who to contact @ AMD about confirming or denying this?

/Mikael