Subject: Issues with AMD microcode updates

Jacob, Andreas,

I take care of the amd64 microcode update support for Debian, and I'm
receiving user reports of lockup issues with the AMD microcode driver in
several kernels. This is about the runtime update interface,
/sys/devices/system/cpu/*/microcode/reload and
/sys/devices/system/cpu/microcode/reload.

Basically, the issue is that the process that tries to write "1" to the
reload node gets stuck in "D" state on several kernel versions.

I started by blacklisting several older kernels (e.g. I got a report of
2.6.38 locking up), but recently I got a report of a lockup with kernel
3.5.1. Blacklisting everything before 3.10 is not exactly kosher, not when
I would have to blindly trust 3.0, 3.2 and 3.4 to not have whatever issue is
causing the lockups.

IMHO that's the point where it becomes interesting to actually track down
the bug even if it apparently doesn't exist anymore on the more recent
kernels, and ensure that the stable/long-term kernels have the fix. That
would also help distros blacklist microcode update on the broken kernels.

Unfortunately, I don't own, or have access to, any boxes with an AMD
processor (let alone one with an AMD processor in need of a microcode
update) to bissect the problem.

I'd appreciate if AMD (or anyone with an AMD processor, really) could help
me track this issue down.

Debian bug reports:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=717185
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=723081

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh


2013-09-19 16:44:16

by Borislav Petkov

[permalink] [raw]
Subject: Re: Issues with AMD microcode updates

On Thu, Sep 19, 2013 at 11:58:34AM -0300, Henrique de Moraes Holschuh wrote:
> Jacob, Andreas,
>
> I take care of the amd64 microcode update support for Debian, and I'm
> receiving user reports of lockup issues with the AMD microcode driver in
> several kernels. This is about the runtime update interface,
> /sys/devices/system/cpu/*/microcode/reload and
> /sys/devices/system/cpu/microcode/reload.
>
> Basically, the issue is that the process that tries to write "1" to the
> reload node gets stuck in "D" state on several kernel versions.
>
> I started by blacklisting several older kernels (e.g. I got a report of
> 2.6.38 locking up), but recently I got a report of a lockup with kernel
> 3.5.1. Blacklisting everything before 3.10 is not exactly kosher, not when
> I would have to blindly trust 3.0, 3.2 and 3.4 to not have whatever issue is
> causing the lockups.
>
> IMHO that's the point where it becomes interesting to actually track down
> the bug even if it apparently doesn't exist anymore on the more recent
> kernels, and ensure that the stable/long-term kernels have the fix. That
> would also help distros blacklist microcode update on the broken kernels.
>
> Unfortunately, I don't own, or have access to, any boxes with an AMD
> processor (let alone one with an AMD processor in need of a microcode
> update) to bissect the problem.
>
> I'd appreciate if AMD (or anyone with an AMD processor, really) could help
> me track this issue down.
>
> Debian bug reports:
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=717185
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=723081

Well, both Andreas and Jacob don't work for AMD anymore. I could try to
help with this but it'll be slow as I'm pretty busy with other stuff.

Anyway, I'd suggest we look only on the long term kernels since they're
the only ones which can get updates/fixes anyway.

Now, how do I reproduce this? Writing 1 to .../reload on latest kernel
works here. So I'd need a reproducer. Alternatively, I'd need a sysrq-l
and sysrq-w from those systems with hung processes.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Subject: Re: Issues with AMD microcode updates

On Thu, 19 Sep 2013, Borislav Petkov wrote:
> On Thu, Sep 19, 2013 at 11:58:34AM -0300, Henrique de Moraes Holschuh wrote:
> > I take care of the amd64 microcode update support for Debian, and I'm
> > receiving user reports of lockup issues with the AMD microcode driver in
> > several kernels. This is about the runtime update interface,
> > /sys/devices/system/cpu/*/microcode/reload and
> > /sys/devices/system/cpu/microcode/reload.
> >
> > Basically, the issue is that the process that tries to write "1" to the
> > reload node gets stuck in "D" state on several kernel versions.
> >
> > I started by blacklisting several older kernels (e.g. I got a report of
> > 2.6.38 locking up), but recently I got a report of a lockup with kernel
> > 3.5.1. Blacklisting everything before 3.10 is not exactly kosher, not when

The kernels reproted to be broken are 2.6.38 and 3.5.2, I got the last one
wrong.

> > I would have to blindly trust 3.0, 3.2 and 3.4 to not have whatever issue is
> > causing the lockups.

...

> > Debian bug reports:
> > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=717185
> > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=723081
>
> Well, both Andreas and Jacob don't work for AMD anymore. I could try to
> help with this but it'll be slow as I'm pretty busy with other stuff.

Well, if someone can give me suitable ssh and full root access to a small
AMD box anywhere in the world [with a suitably outdated BIOS/EFI that
doesn't have the latest microcode for the processor] so that I can bissect
this, I'm game. Preferably, a box with a throw-away install of the latest
Debian stable, which might help track down the issue faster since it is what
I am most confortable with.

> Anyway, I'd suggest we look only on the long term kernels since they're
> the only ones which can get updates/fixes anyway.

If I could get a confirmation that "it's good on latest 3.0, 3.2, 3.4, 3.10
and mainline", I'd at least be able to blacklist everything else. But I'd
need at least a control test of 3.5.2 (which should fail) to make sure it is
easy to reproduce the bug on the test box...

I'm almost sure that the latest 3.2 and 3.10+ work just fine, otherwise I'd
have noticed it really fast...

> Now, how do I reproduce this? Writing 1 to .../reload on latest kernel
> works here. So I'd need a reproducer. Alternatively, I'd need a sysrq-l
> and sysrq-w from those systems with hung processes.

I can request help on debian-user or debian-devel to get someone with an AMD
box to help with bissection, but it is usually best if we don't ask general
users to bissect kernels (due to non-zero risk of data corruption if the
bissect hit one of the problem spots that often show up during the
development window).

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2013-09-19 18:47:03

by Borislav Petkov

[permalink] [raw]
Subject: Re: Issues with AMD microcode updates

On Thu, Sep 19, 2013 at 03:15:54PM -0300, Henrique de Moraes Holschuh wrote:
> I can request help on debian-user or debian-devel to get someone with
> an AMD box to help with bissection, but it is usually best if we don't
> ask general users to bissect kernels (due to non-zero risk of data
> corruption if the bissect hit one of the problem spots that often show
> up during the development window).

I have a couple of AMD boxes so I can bisect - I just need a reproducer
how to trigger.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Subject: Re: Issues with AMD microcode updates

On Thu, 19 Sep 2013, Borislav Petkov wrote:
> On Thu, Sep 19, 2013 at 03:15:54PM -0300, Henrique de Moraes Holschuh wrote:
> > I can request help on debian-user or debian-devel to get someone with
> > an AMD box to help with bissection, but it is usually best if we don't
> > ask general users to bissect kernels (due to non-zero risk of data
> > corruption if the bissect hit one of the problem spots that often show
> > up during the development window).
>
> I have a couple of AMD boxes so I can bisect - I just need a reproducer
> how to trigger.

Sure. There are two possibilities:


Possiblity one (the most likely): hang on first microcode update:

1. Have the lastest AMD microcode update (from linux-firmware) installed to
the proper place under /lib/firmware, but NOT yet uploaded to kernel.

2. run this:

find /sys/devices/system/cpu -noleaf -type f -path '/sys/devices/system/cpu/cpu*/microcode/reload' | while read i ; do echo -n 1 >"$i" || true ; done

If the kernel is buggy, it should hang. If it doesn't hang for a
supposed-bad kernel (2.6.38 or 3.5.2), please check "possibility two" below.



Possibility two: hang on second microcode update in a row:

1. Install a previous version of the AMD microcode update (which must still
be newer than what is in the processor) to /lib/firmware/...

2. Run the command (2) above. It should not hang, and it should update the
microcode in the processor.

3. Update /lib/firmware with the latest microcode from AMD (i.e. so that the
processor will have its microcode updated TWICE).

4. Run the command (2) above. It should hang if the kernel is buggy.





I do not have any reports of kernels 3.6 and later causing issues. If they
do, the "reproducer" should be this, instead:

echo -n 1 > /sys/devices/system/cpu/microcode/reload

You can get earlier versions of the AMD microcode to test "possiblity two"
from the Debian package historical archive:

http://snapshot.debian.org/archive/debian/20120710T032858Z/pool/non-free/a/amd64-microcode/amd64-microcode_1.20120117.orig.tar.bz2
http://snapshot.debian.org/archive/debian/20120915T033250Z/pool/non-free/a/amd64-microcode/amd64-microcode_1.20120910.orig.tar.bz2

The latest version of the microcode is available in linux-firmware.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2013-09-24 23:35:17

by Hurwitz, Sherry

[permalink] [raw]
Subject: Re: Issues with AMD microcode updates

On 09/19/2013 11:44 AM, Borislav Petkov wrote:
> On Thu, Sep 19, 2013 at 11:58:34AM -0300, Henrique de Moraes Holschuh wrote:
>> Jacob, Andreas,
>>
>> I take care of the amd64 microcode update support for Debian, and I'm
>> receiving user reports of lockup issues with the AMD microcode driver in
>> several kernels. This is about the runtime update interface,
>> /sys/devices/system/cpu/*/microcode/reload and
>> /sys/devices/system/cpu/microcode/reload.
>>
>> Basically, the issue is that the process that tries to write "1" to the
>> reload node gets stuck in "D" state on several kernel versions.
>>
>> I started by blacklisting several older kernels (e.g. I got a report of
>> 2.6.38 locking up), but recently I got a report of a lockup with kernel
>> 3.5.1. Blacklisting everything before 3.10 is not exactly kosher, not when
>> I would have to blindly trust 3.0, 3.2 and 3.4 to not have whatever issue is
>> causing the lockups.
>>
>> IMHO that's the point where it becomes interesting to actually track down
>> the bug even if it apparently doesn't exist anymore on the more recent
>> kernels, and ensure that the stable/long-term kernels have the fix. That
>> would also help distros blacklist microcode update on the broken kernels.
>>
>> Unfortunately, I don't own, or have access to, any boxes with an AMD
>> processor (let alone one with an AMD processor in need of a microcode
>> update) to bissect the problem.
>>
>> I'd appreciate if AMD (or anyone with an AMD processor, really) could help
>> me track this issue down.
>>
>> Debian bug reports:
>> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=717185
>> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=723081
> Well, both Andreas and Jacob don't work for AMD anymore. I could try to
> help with this but it'll be slow as I'm pretty busy with other stuff.
>
> Anyway, I'd suggest we look only on the long term kernels since they're
> the only ones which can get updates/fixes anyway.
>
> Now, how do I reproduce this? Writing 1 to .../reload on latest kernel
> works here. So I'd need a reproducer. Alternatively, I'd need a sysrq-l
> and sysrq-w from those systems with hung processes.
>
> Thanks.
>
You can direct AMD microcode issues to me now.
We are setting up some systems in the lab and trying to duplicate
the problem now.

Thanks.

Subject: Re: Issues with AMD microcode updates

On Tue, 24 Sep 2013, Sherry Hurwitz wrote:
> You can direct AMD microcode issues to me now.
> We are setting up some systems in the lab and trying to duplicate
> the problem now.

Thank you!

If you're going to be taking care of AMD microcode update issues, maybe it
would be a good idea to add your name to the MAINTAINERS file for the "AMD
MICROCODE UPDATE SUPPORT", and remove the (dead for a while now)
[email protected] mailing list?

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2013-09-26 17:36:49

by Hurwitz, Sherry

[permalink] [raw]
Subject: Re: Issues with AMD microcode updates

On 09/25/2013 08:49 AM, Henrique de Moraes Holschuh wrote:
> On Tue, 24 Sep 2013, Sherry Hurwitz wrote:
>> You can direct AMD microcode issues to me now.
>> We are setting up some systems in the lab and trying to duplicate
>> the problem now.
> Thank you!
>
> If you're going to be taking care of AMD microcode update issues, maybe it
> would be a good idea to add your name to the MAINTAINERS file for the "AMD
> MICROCODE UPDATE SUPPORT", and remove the (dead for a while now)
> [email protected] mailing list?
>
We have failed to reproduce a hang while loading microcode.
We have tested with kernel and AMD family combinations with
normal and error condition so error paths were taken. Obviously
there are factors we are missing that the users are hitting.
Any suggestions on how we improve the test matrix would be
helpful. We will continue the investigation but any insights are appreciated.

NOTE: kernels before 3.0 only load 1 (2k) size of microcode patch and
therefore do not support microcode loading of family 14h, 15h, and 16h.
Also,in a test request on another thread you suggested someone with
family 15h revC0 to load microcode twice with an earlier patch and then
the latest, but there has only been 1 microcode patch level published for revB2
so that test won't work.

Test Matrix:

kernel cpu family results conditions
---------------------------------------------------------------------------------
2.6.38 fam10h load passed normal
2.6.38 fam15h revC0 load failed 2.6.38 can not handle 4k patches
3.5.2 fam10h load passed normal
3.5.2 fam15h revB2 load passed loaded 637 then second load 63d
3.5.2 fam15h revC0 load passed normal
3.5.2 fam15h revC0 load failed used a corrupted bin file
3.7 fam15h revC0 load passed loaded 81c then second load 822
3.10 fam15h revC0 load passed loaded 81c then second load 822
3.11rc7 fam15h revB2 load passedBIOS loaded 637; test loaded 63d; sysfs info can be misleading

Subject: Re: Issues with AMD microcode updates

On Thu, 26 Sep 2013, Sherry Hurwitz wrote:
> We have failed to reproduce a hang while loading microcode.

I got an offer from a Debian user to test it over the weekend, let's hope
he will have more luck(?) at hitting the issue. If he does, it should give
us sysrq+t dumps of the hung system.

> We have tested with kernel and AMD family combinations with
> normal and error condition so error paths were taken. Obviously
> there are factors we are missing that the users are hitting.

Yeah, and it is not likely to be a kernel patch, as the users hit the issue
using non-distro kernels :-(

Maybe it is on the firmware-loader side, but one user did wait 1 hour for
the thing to get unstuck, and that would have taken care of any possible
firmware-loader timeouts.

> Any suggestions on how we improve the test matrix would be
> helpful. We will continue the investigation but any insights are appreciated.
>
> NOTE: kernels before 3.0 only load 1 (2k) size of microcode patch and
> therefore do not support microcode loading of family 14h, 15h, and 16h.
> Also,in a test request on another thread you suggested someone with
> family 15h revC0 to load microcode twice with an earlier patch and then
> the latest, but there has only been 1 microcode patch level published for revB2
> so that test won't work.

Well, it is the only thing I could think of, other than some nasty race
condition...

> kernel cpu family results conditions
> ---------------------------------------------------------------------------------
> 2.6.38 fam10h load passed normal
> 2.6.38 fam15h revC0 load failed 2.6.38 can not handle 4k patches
> 3.5.2 fam10h load passed normal
> 3.5.2 fam15h revB2 load passed loaded 637 then second load 63d
> 3.5.2 fam15h revC0 load passed normal
> 3.5.2 fam15h revC0 load failed used a corrupted bin file

I just looked, and the 2.6.38 hang happened for i686 and an unindentified
3-core AMD processor, and the 3.5.2 on x86-64 PREEMPT, on a fam15h model 2
stepping 0, 32-core AMD processor (Linux 3.5.2 (SMP w/32 CPU cores;
PREEMPT)). No patterns there.

BTW, the userspace script that users reported to have hung is this:

grep -q "^vendor_id[[:blank:]]*:[[:blank:]]*.*AuthenticAMD" /proc/cpuinfo && {
if modprobe -q --first-time microcode ; then
echo "Updating microcode on all online processors..." >&2
else
# we have to trigger the microcode update manually
if [ -e /sys/devices/system/cpu/microcode/reload ] ; then
echo "Updating microcode on all online processors..." >&2
echo 1 > /sys/devices/system/cpu/microcode/reload || {
echo "Kernel reported failure while updating microcode!" >&2
}
else
# Try all online processors, broken kernels need this,
# fixed kernels will accept it only on the BSP and update
# all processors anyway, and -EINVAL all others... but we
# don't know which one is the BSP, so we try all of them
# and hide errors, the kernel will log any real problem.
echo "Using per-core interface to update microcode on online processors..." >&2
find /sys/devices/system/cpu -noleaf -type f -path '/sys/devices/system/cpu/cpu*/microcode/reload' | \
while read i ; do echo -n 1 2>/dev/null >"$i" || true ; done
fi
fi
}


With the microcode driver already loaded (so, that modprobe line fails).

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh