LinuxLists.cc - AMD FX CPU bug, not fixed by latest microcode?

2012-06-10 19:47:30

Subject: AMD FX CPU bug, not fixed by latest microcode?

Hi,

I have an AMD FX-8120 boxed CPU in an ASUS M5A99X-EVO mainboard
with 32GB DDR3/1600 memory, running Fedora 17, upgraded from 16.
memtest86+ show no problems.

Still, I get occasional crashes and signal 11 during kernel compilation even
with single-job make. Sometimes the compiler jumps out with a strange
error message, like "stray \NNN character in the source". When re-running
make, the error doesn't happen in the same file and the source file doesn't
contain the character being complained about when inspecting with
an editor or hexdump.

Now, a few minutes ago I was able to catch this bug when I copied the
kernel GIT tree to apply a patch manually and did "git commit -a".
Strangely, the commit contained one extra file that I didn't touch.
git diff showed this for the extra file:

==============================
--- a/drivers/usb/gadget/fsl_usb2_udc.h
+++ b/drivers/usb/gadget/fsl_usb2_udc.h
@@ -427,7 +427,7 @@ struct ep_td_struct {
#define DTD_ADDR_MASK 0xFFFFFFE0
#define DTD_PACKET_SIZE 0x7FFF0000
#define DTD_LENGTH_BIT_POS 16
-#define DTD_ERROR_MASK (DTD_STATUS_HALTED | \
+#define DTD_ERROR_MASK (DTD_STATUS_HALTED | ^Z
DTD_STATUS_DATA_BUFF_ERR | \
DTD_STATUS_TRANSACTION_ERR)
/* Alignment requirements; must be a power of two */
==============================

The "^Z" is a 0-character in the file and is not present in the
original source tree, only in the copy.

Similar errors happened during copying large files on the same
machine but it seems it's enough to trigger if the total amount
of data read is large enough.

The mainboard has the latest (UEFI) firmware flashed which
contains the latest AMD microcode, so microcode_ctl doesn't
need to apply it anymore. Previously, I used amd-ucode-2012-01-17.tar
from http://www.amd64.org/support/microcode.html which is now
part of microcode_ctl in Fedora.

Since the error happens during compiling a source file and not only
copying, the bug seems to happens during *reading* data.

Does anyone know whether it's a known problem in AMD FX CPUs?
Does AMD have a newer microcode to fix this bug, or should I apply
for warranty?

Thanks in advance,
Zolt?n B?sz?rm?nyi

2012-06-11 03:45:53

by Rus

[permalink] [raw]

Subject: Re: AMD FX CPU bug, not fixed by latest microcode?

:Does anyone know whether it's a known problem in AMD FX CPUs?
:Does AMD have a newer microcode to fix this bug, or should I apply
:for warranty?

This is do not related to AMD microcode. The main problem is that Asus
do not tests their motherboards against supported CPU list. The next
problem is m/b BIOS and last problem - ASUS is ignoring Linux users
and Linux as oficially supported OS on their m/b. I'm having the same
problems with FX-8150 and M5A97-PRO m/b. This m/b do not work by
default with FX-* CPU because of wrong power mode selected for this
CPU by BIOS. To make it work somehow you need to :

1. Flash the latest BIOS
2. Disable turbo core in BIOS setup
3. Enable extreme EPU mode in BIOS setup
4. Disable EPU Power Saving mode in BIOS setup
5. Use the >= 3.5 kernels (rc1 or rc2 for now)
6. Play with IOMMU mode in BIOS setup (try enable, try disable)

P.S. I'm not subscribed, so cc.

Rus

--
SfinxSoft
http://sfinxsoft.com

2012-06-11 07:52:12

by Clemens Ladisch

[permalink] [raw]

Subject: Re: AMD FX CPU bug, not fixed by latest microcode?

Boszormenyi Zoltan wrote:
> I have an AMD FX-8120 boxed CPU in an ASUS M5A99X-EVO mainboard
> with 32GB DDR3/1600 memory, running Fedora 17, upgraded from 16.
>
> I get occasional crashes and signal 11 during kernel compilation even
> with single-job make. Sometimes the compiler jumps out with a strange
> error message, like "stray \NNN character in the source". When re-running
> make, the error doesn't happen in the same file and the source file doesn't
> contain the character being complained about when inspecting with
> an editor or hexdump.
>
> Now, a few minutes ago I was able to catch this bug when I copied the
> kernel GIT tree to apply a patch manually and did "git commit -a".
> Strangely, the commit contained one extra file that I didn't touch.
> git diff showed this for the extra file:
>
> ==============================
> --- a/drivers/usb/gadget/fsl_usb2_udc.h
> +++ b/drivers/usb/gadget/fsl_usb2_udc.h
> @@ -427,7 +427,7 @@ struct ep_td_struct {
> #define DTD_ADDR_MASK 0xFFFFFFE0
> #define DTD_PACKET_SIZE 0x7FFF0000
> #define DTD_LENGTH_BIT_POS 16
> -#define DTD_ERROR_MASK (DTD_STATUS_HALTED | \
> +#define DTD_ERROR_MASK (DTD_STATUS_HALTED | ^Z
> DTD_STATUS_DATA_BUFF_ERR | \
> DTD_STATUS_TRANSACTION_ERR)
> /* Alignment requirements; must be a power of two */
> ==============================
>
> The "^Z" is a 0-character in the file and is not present in the
> original source tree, only in the copy.

Is it always a zero, or other invalids characters?
(The (number of) changed bits might tell something.)

> Similar errors happened during copying large files on the same
> machine but it seems it's enough to trigger if the total amount
> of data read is large enough.

Does "large enough" mean "large enough so that they are not in the file
cache"?

All caches and your memory are ECC protected, so I think it is unlikely
that the problem is with these. If I had to guess, I'd point to your
disk (firmware) or the SATA controller. (A bad or loose SATA cable
would throw CRC errors into the kernel log. Are there any?)

What is the exact offset of the changed byte in the file? (It might be
at a cacheline, sector, or page boundary.)

> Does anyone know whether it's a known problem in AMD FX CPUs?

http://support.amd.com/us/Processor_TechDocs/48063_15h_Mod_00h-0Fh_Rev_Guide.pdf

Regards,
Clemens

2012-06-11 08:13:45

by Böszörményi Zoltán

[permalink] [raw]

Subject: Re: AMD FX CPU bug, not fixed by latest microcode?

2012-06-11 09:52 keltez?ssel, Clemens Ladisch ?rta:
> Boszormenyi Zoltan wrote:
>> I have an AMD FX-8120 boxed CPU in an ASUS M5A99X-EVO mainboard
>> with 32GB DDR3/1600 memory, running Fedora 17, upgraded from 16.
>>
>> I get occasional crashes and signal 11 during kernel compilation even
>> with single-job make. Sometimes the compiler jumps out with a strange
>> error message, like "stray \NNN character in the source". When re-running
>> make, the error doesn't happen in the same file and the source file doesn't
>> contain the character being complained about when inspecting with
>> an editor or hexdump.
>>
>> Now, a few minutes ago I was able to catch this bug when I copied the
>> kernel GIT tree to apply a patch manually and did "git commit -a".
>> Strangely, the commit contained one extra file that I didn't touch.
>> git diff showed this for the extra file:
>>
>> ==============================
>> --- a/drivers/usb/gadget/fsl_usb2_udc.h
>> +++ b/drivers/usb/gadget/fsl_usb2_udc.h
>> @@ -427,7 +427,7 @@ struct ep_td_struct {
>> #define DTD_ADDR_MASK 0xFFFFFFE0
>> #define DTD_PACKET_SIZE 0x7FFF0000
>> #define DTD_LENGTH_BIT_POS 16
>> -#define DTD_ERROR_MASK (DTD_STATUS_HALTED | \
>> +#define DTD_ERROR_MASK (DTD_STATUS_HALTED | ^Z
>> DTD_STATUS_DATA_BUFF_ERR | \
>> DTD_STATUS_TRANSACTION_ERR)
>> /* Alignment requirements; must be a power of two */
>> ==============================
>>
>> The "^Z" is a 0-character in the file and is not present in the
>> original source tree, only in the copy.

Actually, the "^Z" there is 0x1a. It should be 0x5c, the backslash character.

> Is it always a zero, or other invalids characters?
> (The (number of) changed bits might tell something.)

IIRC, GCC has a different error for a 0-character and "stray \NNN character"
(that's not inside a string literal) and both happened at some time.
Sorry, I didn't bother to make a note of the error messages.

>
>> Similar errors happened during copying large files on the same
>> machine but it seems it's enough to trigger if the total amount
>> of data read is large enough.
> Does "large enough" mean "large enough so that they are not in the file
> cache"?
>
> All caches and your memory are ECC protected,

Unfortunately the memory is not with ECC. "Large enough" means it's
usually not in file system cache

> so I think it is unlikely
> that the problem is with these. If I had to guess, I'd point to your
> disk (firmware) or the SATA controller. (A bad or loose SATA cable
> would throw CRC errors into the kernel log. Are there any?)

The disks (8 of them) are attached to 3ware 9650SE-8LPML in RAID10.
tw_cli reports no problems.

> What is the exact offset of the changed byte in the file? (It might be
> at a cacheline, sector, or page boundary.)

The bad character is at offset 0x4b74.

>> Does anyone know whether it's a known problem in AMD FX CPUs?
> http://support.amd.com/us/Processor_TechDocs/48063_15h_Mod_00h-0Fh_Rev_Guide.pdf

Thanks but I have seen this file already. The "no fix planned" for every
errata is saddening...

>
>
> Regards,
> Clemens
>

2012-06-11 08:43:25

by Borislav Petkov

[permalink] [raw]

Subject: Re: AMD FX CPU bug, not fixed by latest microcode?

(leaving in the full text)

On Sun, Jun 10, 2012 at 09:24:13PM +0200, Boszormenyi Zoltan wrote:
> Hi,
>
> I have an AMD FX-8120 boxed CPU in an ASUS M5A99X-EVO mainboard
> with 32GB DDR3/1600 memory, running Fedora 17, upgraded from 16.
> memtest86+ show no problems.

Did you have the same issue with Fedora 16? Also, could you test with
another distro whether the same thing happens?

> Still, I get occasional crashes and signal 11 during kernel
> compilation even with single-job make. Sometimes the compiler jumps
> out with a strange error message, like "stray \NNN character in the
> source".

Is that the same ^Z character as below? Is that character with ascii
number \026? Or do you get different characters each time?

> When re-running
> make, the error doesn't happen in the same file and the source file doesn't
> contain the character being complained about when inspecting with
> an editor or hexdump.
>
> Now, a few minutes ago I was able to catch this bug when I copied the
> kernel GIT tree to apply a patch manually and did "git commit -a".
> Strangely, the commit contained one extra file that I didn't touch.
> git diff showed this for the extra file:
>
> ==============================
> --- a/drivers/usb/gadget/fsl_usb2_udc.h
> +++ b/drivers/usb/gadget/fsl_usb2_udc.h
> @@ -427,7 +427,7 @@ struct ep_td_struct {
> #define DTD_ADDR_MASK 0xFFFFFFE0
> #define DTD_PACKET_SIZE 0x7FFF0000
> #define DTD_LENGTH_BIT_POS 16
> -#define DTD_ERROR_MASK (DTD_STATUS_HALTED | \
> +#define DTD_ERROR_MASK (DTD_STATUS_HALTED | ^Z
> DTD_STATUS_DATA_BUFF_ERR | \
> DTD_STATUS_TRANSACTION_ERR)
> /* Alignment requirements; must be a power of two */
> ==============================
>
> The "^Z" is a 0-character in the file and is not present in the
> original source tree, only in the copy.
>
> Similar errors happened during copying large files on the same
> machine but it seems it's enough to trigger if the total amount
> of data read is large enough.
>
> The mainboard has the latest (UEFI) firmware flashed which
> contains the latest AMD microcode, so microcode_ctl doesn't
> need to apply it anymore. Previously, I used amd-ucode-2012-01-17.tar
> from http://www.amd64.org/support/microcode.html which is now
> part of microcode_ctl in Fedora.

Can you send /proc/cpuinfo?

Also, a dmesg from a recent kernel?

> Since the error happens during compiling a source file and not only
> copying, the bug seems to happens during *reading* data.
>
> Does anyone know whether it's a known problem in AMD FX CPUs?
> Does AMD have a newer microcode to fix this bug, or should I apply
> for warranty?

Thanks.

--
Regards/Gruss,
Boris.

2012-06-11 09:49:10

by Borislav Petkov

[permalink] [raw]

Subject: Re: AMD FX CPU bug, not fixed by latest microcode?

On Mon, Jun 11, 2012 at 10:43:18AM +0200, Borislav Petkov wrote:
> On Sun, Jun 10, 2012 at 09:24:13PM +0200, Boszormenyi Zoltan wrote:
> > I have an AMD FX-8120 boxed CPU in an ASUS M5A99X-EVO mainboard
> > with 32GB DDR3/1600 memory, running Fedora 17, upgraded from 16.
> > memtest86+ show no problems.

Ohe other thing: if there's an option in the BIOS to disable the IOMMU,
can you do that and try reproducing the issue with IOMMU disabled?

Thanks.

--
Regards/Gruss,
Boris.

2012-06-11 10:22:04

by Clemens Ladisch

[permalink] [raw]

Subject: Re: AMD FX CPU bug, not fixed by latest microcode?

Boszormenyi Zoltan wrote:
> 2012-06-11 09:52 keltez?ssel, Clemens Ladisch ?rta:
>>> Similar errors happened during copying large files on the same
>>> machine but it seems it's enough to trigger if the total amount
>>> of data read is large enough.
>>
>> Does "large enough" mean "large enough so that they are not in the file
>> cache"?
>>
> "Large enough" means it's usually not in file system cache

If you could see a change while it's in the cache, you could rule
out the disks.

>> All caches and your memory are ECC protected,
>
> Unfortunately the memory is not with ECC.

Sorry, I misread your mail.

This means that you cannot rule out bad memory.

>> so I think it is unlikely
>> that the problem is with these. If I had to guess, I'd point to your
>> disk (firmware) or the SATA controller. (A bad or loose SATA cable
>> would throw CRC errors into the kernel log. Are there any?)
>
> The disks (8 of them) are attached to 3ware 9650SE-8LPML in RAID10.
> tw_cli reports no problems.

Could you check whether the same happens with some disk connected to
the on-board SATA controller? Or while copying around lots of data
inside a RAM disk?

>> What is the exact offset of the changed byte in the file? (It might be
>> at a cacheline, sector, or page boundary.)
>
> The bad character is at offset 0x4b74.

That's completely random, i.e., probably an hardware error.

>> http://support.amd.com/us/Processor_TechDocs/48063_15h_Mod_00h-0Fh_Rev_Guide.pdf
>
> The "no fix planned" for every errata is saddening...

It's good news, because none of them actually matter.

Regards,
Clemens

2012-06-11 10:57:58

by Böszörményi Zoltán

[permalink] [raw]

Subject: Re: AMD FX CPU bug, not fixed by latest microcode?

2012-06-11 12:21 keltez?ssel, Clemens Ladisch ?rta:
> All caches and your memory are ECC protected,
>> Unfortunately the memory is not with ECC.
> Sorry, I misread your mail.
>
> This means that you cannot rule out bad memory.

And there were two storms recently with lightings that struck nearby. :-(
I will retest with memtest86+.

Thanks for everyone who replied.

Best regards,
Zolt?n B?sz?rm?nyi

2012-06-11 11:06:32

by Johannes Stezenbach

[permalink] [raw]

Subject: Re: AMD FX CPU bug, not fixed by latest microcode?

On Mon, Jun 11, 2012 at 11:49:05AM +0200, Borislav Petkov wrote:
> On Mon, Jun 11, 2012 at 10:43:18AM +0200, Borislav Petkov wrote:
> > On Sun, Jun 10, 2012 at 09:24:13PM +0200, Boszormenyi Zoltan wrote:
> > > I have an AMD FX-8120 boxed CPU in an ASUS M5A99X-EVO mainboard
> > > with 32GB DDR3/1600 memory, running Fedora 17, upgraded from 16.
> > > memtest86+ show no problems.
>
> Ohe other thing: if there's an option in the BIOS to disable the IOMMU,
> can you do that and try reproducing the issue with IOMMU disabled?

Maybe not related, but I had bad memory in my Intel Core-i5
based system some months ago which resulted in rare crashes,
usually manifested itself as g++ ICEs when compiling a
mid-sized C++ project -- compiling a kernel with make -p4 showed
no problem. Also memtest86+ didn't show the issue,
so I tried memtest86-4.0a which claims to find more errors
due to SMP support. An overnight run left me with a screen
full of garbage and a crashed memtest86-4.0. I replaced
the RAM anyway and the box was stable since then.

memtest86-4.0a is at
http://memtest86.com/

The page claims:
With a single CPU it is not possible to drive multi-channel memory
controllers at full speed making it impossible to detect some types of errors

Maybe someone knowledgable could comment if this is true.

Johannes

2012-06-13 07:30:25

by Böszörményi Zoltán

[permalink] [raw]

Subject: Re: AMD FX CPU bug, not fixed by latest microcode?

2012-06-11 13:05 keltez?ssel, Johannes Stezenbach ?rta:
> On Mon, Jun 11, 2012 at 11:49:05AM +0200, Borislav Petkov wrote:
>> On Mon, Jun 11, 2012 at 10:43:18AM +0200, Borislav Petkov wrote:
>>> On Sun, Jun 10, 2012 at 09:24:13PM +0200, Boszormenyi Zoltan wrote:
>>>> I have an AMD FX-8120 boxed CPU in an ASUS M5A99X-EVO mainboard
>>>> with 32GB DDR3/1600 memory, running Fedora 17, upgraded from 16.
>>>> memtest86+ show no problems.
>> Ohe other thing: if there's an option in the BIOS to disable the IOMMU,
>> can you do that and try reproducing the issue with IOMMU disabled?
> Maybe not related, but I had bad memory in my Intel Core-i5
> based system some months ago which resulted in rare crashes,
> usually manifested itself as g++ ICEs when compiling a
> mid-sized C++ project -- compiling a kernel with make -p4 showed
> no problem. Also memtest86+ didn't show the issue,
> so I tried memtest86-4.0a which claims to find more errors
> due to SMP support. An overnight run left me with a screen
> full of garbage and a crashed memtest86-4.0. I replaced
> the RAM anyway and the box was stable since then.
>
> memtest86-4.0a is at
> http://memtest86.com/
>
> The page claims:
> With a single CPU it is not possible to drive multi-channel memory
> controllers at full speed making it impossible to detect some types of errors
>
> Maybe someone knowledgable could comment if this is true.

This one locked up on my machine but memtest86+ 4.20 detected
12 different addresses with faulty bits in the lower 16GB. Applying
for warranty.

With only two modules, "make -j8" succeeded a lot of times.

Thanks for everyone who tried to help.

Best regards,
Zolt?n B?sz?rm?nyi

2012-06-13 15:57:39

by Borislav Petkov

[permalink] [raw]

Subject: Re: AMD FX CPU bug, not fixed by latest microcode?

On Wed, Jun 13, 2012 at 09:30:03AM +0200, Boszormenyi Zoltan wrote:
> This one locked up on my machine but memtest86+ 4.20 detected 12
> different addresses with faulty bits in the lower 16GB. Applying for
> warranty.

If you have multiple DIMMs, you can also take out the DIMM which
contains the 16GB and rerun memtest to confirm it really is the culprit.

--
Regards/Gruss,
Boris.

2012-06-13 18:26:39

by Böszörményi Zoltán

[permalink] [raw]

Subject: Re: AMD FX CPU bug, not fixed by latest microcode?

2012-06-13 17:57 keltezéssel, Borislav Petkov írta:
> On Wed, Jun 13, 2012 at 09:30:03AM +0200, Boszormenyi Zoltan wrote:
>> This one locked up on my machine but memtest86+ 4.20 detected 12
>> different addresses with faulty bits in the lower 16GB. Applying for
>> warranty.
> If you have multiple DIMMs, you can also take out the DIMM which
> contains the 16GB and rerun memtest to confirm it really is the culprit.
>

I did exactly that, the remaining two 8GB modules don't have faults
according to memtest86+ and the machine is stable with "make -j8"
in the kernel tree. Neither thunderbird nor firefox crashed for a day,
these are the usual victims when hitting the bad memory address.

2012-06-13 22:06:56

by Borislav Petkov

[permalink] [raw]

Subject: Re: AMD FX CPU bug, not fixed by latest microcode?

On Wed, Jun 13, 2012 at 08:26:07PM +0200, Boszormenyi Zoltan wrote:
> I did exactly that, the remaining two 8GB modules don't have faults
> according to memtest86+ and the machine is stable with "make -j8" in
> the kernel tree. Neither thunderbird nor firefox crashed for a day,
> these are the usual victims when hitting the bad memory address.

Cool.

I guess you could try to limit it even further by taking a known-good
8GB module and pairing it with one of the "bad" ones to see whether one
of the "bad" 8GB modules is faulty or both of them are.

Or you could stop wasting time and go buy two new ones :-)

--
Regards/Gruss,
Boris.

2012-06-14 04:23:46

by Böszörményi Zoltán

[permalink] [raw]

Subject: Re: AMD FX CPU bug, not fixed by latest microcode?

2012-06-14 00:06 keltezéssel, Borislav Petkov írta:
> On Wed, Jun 13, 2012 at 08:26:07PM +0200, Boszormenyi Zoltan wrote:
>> I did exactly that, the remaining two 8GB modules don't have faults
>> according to memtest86+ and the machine is stable with "make -j8" in
>> the kernel tree. Neither thunderbird nor firefox crashed for a day,
>> these are the usual victims when hitting the bad memory address.
> Cool.
>
> I guess you could try to limit it even further by taking a known-good
> 8GB module and pairing it with one of the "bad" ones to see whether one
> of the "bad" 8GB modules is faulty or both of them are.
>
> Or you could stop wasting time and go buy two new ones :-)

Yesterday I took the bad pair back to the shop with a screenshot
showing memtest86+ results and they accepted it for warranty.
I will get a new pair or RAMs for no extra fee.