2017-09-30 02:05:19

by Adam Borowski

[permalink] [raw]
Subject: random insta-reboots on AMD Phenom II

Hi!
I'm afraid I see random instant reboots on current -rc, approximately
once per day, only under CPU load. There's nothing on serial/etc -- just
an immediate reboot. 4.13 works perfectly; last kernel I've tried is
v4.14-rc2-165-g770b782f555d. gcc 7.2.0-7 (Debian).

CPU is AMD Phenom II X6 1055T (family 10h).

Sometimes it dies within a few minutes of load, sometimes all is fine for a
couple of days. This randomness makes bisecting not really an option.

Any hints how to debug this?


Meow!
--
⢀⣴⠾⠻⢶⣦⠀ We domesticated dogs 36000 years ago; together we chased
⣾⠁⢰⠒⠀⣿⡁ animals, hung out and licked or scratched our private parts.
⢿⡄⠘⠷⠚⠋⠀ Cats domesticated us 9500 years ago, and immediately we got
⠈⠳⣄⠀⠀⠀⠀ agriculture, towns then cities. -- whitroth on /.


2017-09-30 11:12:00

by Borislav Petkov

[permalink] [raw]
Subject: Re: random insta-reboots on AMD Phenom II

On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:
> Any hints how to debug this?

Do

rdmsr -a 0xc0010015

as root and paste it here.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-30 11:29:06

by Adam Borowski

[permalink] [raw]
Subject: Re: random insta-reboots on AMD Phenom II

On Sat, Sep 30, 2017 at 01:11:37PM +0200, Borislav Petkov wrote:
> On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:
> > Any hints how to debug this?
>
> Do
> rdmsr -a 0xc0010015
> as root and paste it here.

1000010
1000010
1000010
1000010
1000010
1000010

on both 4.13.4 and 4.14-rc2+.


Meow!
--
⢀⣴⠾⠻⢶⣦⠀ We domesticated dogs 36000 years ago; together we chased
⣾⠁⢰⠒⠀⣿⡁ animals, hung out and licked or scratched our private parts.
⢿⡄⠘⠷⠚⠋⠀ Cats domesticated us 9500 years ago, and immediately we got
⠈⠳⣄⠀⠀⠀⠀ agriculture, towns then cities. -- whitroth on /.

2017-09-30 11:53:19

by Borislav Petkov

[permalink] [raw]
Subject: Re: random insta-reboots on AMD Phenom II

On Sat, Sep 30, 2017 at 01:29:03PM +0200, Adam Borowski wrote:
> On Sat, Sep 30, 2017 at 01:11:37PM +0200, Borislav Petkov wrote:
> > On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:
> > > Any hints how to debug this?
> >
> > Do
> > rdmsr -a 0xc0010015
> > as root and paste it here.
>
> 1000010
> 1000010
> 1000010
> 1000010
> 1000010
> 1000010
>
> on both 4.13.4 and 4.14-rc2+.

Boot into -rc2+ and do as root:

# wrmsr -a 0xc0010015 0x1000018

If the issue gets fixed then Mr. Luto better revert the new lazy TLB
flushing fun'n'games for 4.14 before it is too late and that kernel
releases b0rked.

Thx.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-30 12:47:14

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: random insta-reboots on AMD Phenom II

On 2017.09.30 at 13:53 +0200, Borislav Petkov wrote:
> On Sat, Sep 30, 2017 at 01:29:03PM +0200, Adam Borowski wrote:
> > On Sat, Sep 30, 2017 at 01:11:37PM +0200, Borislav Petkov wrote:
> > > On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:
> > > > Any hints how to debug this?
> > >
> > > Do
> > > rdmsr -a 0xc0010015
> > > as root and paste it here.
> >
> > 1000010
> > 1000010
> > 1000010
> > 1000010
> > 1000010
> > 1000010
> >
> > on both 4.13.4 and 4.14-rc2+.
>
> Boot into -rc2+ and do as root:
>
> # wrmsr -a 0xc0010015 0x1000018
>
> If the issue gets fixed then Mr. Luto better revert the new lazy TLB
> flushing fun'n'games for 4.14 before it is too late and that kernel
> releases b0rked.

The issue does get fixed by setting TlbCacheDis to 1. I have been
running it for the last few weeks without any problems.
Performance is not affected at all. So it might by easier to just set
the bit for older AMD processors as a boot quirk.
Changing the TLB code so late might not be a good idea...

--
Markus

2017-09-30 14:20:50

by Brian Gerst

[permalink] [raw]
Subject: Re: random insta-reboots on AMD Phenom II

On Sat, Sep 30, 2017 at 8:47 AM, Markus Trippelsdorf
<[email protected]> wrote:
> On 2017.09.30 at 13:53 +0200, Borislav Petkov wrote:
>> On Sat, Sep 30, 2017 at 01:29:03PM +0200, Adam Borowski wrote:
>> > On Sat, Sep 30, 2017 at 01:11:37PM +0200, Borislav Petkov wrote:
>> > > On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:
>> > > > Any hints how to debug this?
>> > >
>> > > Do
>> > > rdmsr -a 0xc0010015
>> > > as root and paste it here.
>> >
>> > 1000010
>> > 1000010
>> > 1000010
>> > 1000010
>> > 1000010
>> > 1000010
>> >
>> > on both 4.13.4 and 4.14-rc2+.
>>
>> Boot into -rc2+ and do as root:
>>
>> # wrmsr -a 0xc0010015 0x1000018
>>
>> If the issue gets fixed then Mr. Luto better revert the new lazy TLB
>> flushing fun'n'games for 4.14 before it is too late and that kernel
>> releases b0rked.
>
> The issue does get fixed by setting TlbCacheDis to 1. I have been
> running it for the last few weeks without any problems.
> Performance is not affected at all. So it might by easier to just set
> the bit for older AMD processors as a boot quirk.
> Changing the TLB code so late might not be a good idea...

Looking at the AMD K10 revision guide
(http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf), errata #298
that this fixes should only apply to revisions DR-BA and DR-B2, which
include the original Phenom, but not Phenom II. The Phenom II X6 is
revision PH-E0, which does not have this errata.

--
Brian Gerst

2017-09-30 15:11:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: random insta-reboots on AMD Phenom II



> On Sep 30, 2017, at 4:53 AM, Borislav Petkov <[email protected]> wrote:
>
>> On Sat, Sep 30, 2017 at 01:29:03PM +0200, Adam Borowski wrote:
>>> On Sat, Sep 30, 2017 at 01:11:37PM +0200, Borislav Petkov wrote:
>>>> On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:
>>>> Any hints how to debug this?
>>>
>>> Do
>>> rdmsr -a 0xc0010015
>>> as root and paste it here.
>>
>> 1000010
>> 1000010
>> 1000010
>> 1000010
>> 1000010
>> 1000010
>>
>> on both 4.13.4 and 4.14-rc2+.
>
> Boot into -rc2+ and do as root:
>
> # wrmsr -a 0xc0010015 0x1000018
>
> If the issue gets fixed then Mr. Luto better revert the new lazy TLB
> flushing fun'n'games for 4.14 before it is too late and that kernel
> releases b0rked.

Yeah, working on it. It's not a straightforward revert.

>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-30 15:21:08

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: random insta-reboots on AMD Phenom II

On 2017.09.30 at 10:20 -0400, Brian Gerst wrote:
> On Sat, Sep 30, 2017 at 8:47 AM, Markus Trippelsdorf
> <[email protected]> wrote:
> > On 2017.09.30 at 13:53 +0200, Borislav Petkov wrote:
> >> On Sat, Sep 30, 2017 at 01:29:03PM +0200, Adam Borowski wrote:
> >> > On Sat, Sep 30, 2017 at 01:11:37PM +0200, Borislav Petkov wrote:
> >> > > On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:
> >> > > > Any hints how to debug this?
> >> > >
> >> > > Do
> >> > > rdmsr -a 0xc0010015
> >> > > as root and paste it here.
> >> >
> >> > 1000010
> >> > 1000010
> >> > 1000010
> >> > 1000010
> >> > 1000010
> >> > 1000010
> >> >
> >> > on both 4.13.4 and 4.14-rc2+.
> >>
> >> Boot into -rc2+ and do as root:
> >>
> >> # wrmsr -a 0xc0010015 0x1000018
> >>
> >> If the issue gets fixed then Mr. Luto better revert the new lazy TLB
> >> flushing fun'n'games for 4.14 before it is too late and that kernel
> >> releases b0rked.
> >
> > The issue does get fixed by setting TlbCacheDis to 1. I have been
> > running it for the last few weeks without any problems.
> > Performance is not affected at all. So it might by easier to just set
> > the bit for older AMD processors as a boot quirk.
> > Changing the TLB code so late might not be a good idea...
>
> Looking at the AMD K10 revision guide
> (http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf), errata #298
> that this fixes should only apply to revisions DR-BA and DR-B2, which
> include the original Phenom, but not Phenom II. The Phenom II X6 is
> revision PH-E0, which does not have this errata.

It has nothing to do with errata #298. The new lazy TLB code causes
MCEs, because the page tables may now contain garbage.
See the long "Current mainline git (24e700e291d52bd2) hangs when
building e.g. perf" LKML thread.
--
Markus

2017-09-30 15:48:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: random insta-reboots on AMD Phenom II

On Sat, Sep 30, 2017 at 08:11:51AM -0700, Andy Lutomirski wrote:
> Yeah, working on it. It's not a straightforward revert.

Thanks. At least you have testers :-)

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-30 15:51:08

by Borislav Petkov

[permalink] [raw]
Subject: Re: random insta-reboots on AMD Phenom II

On Sat, Sep 30, 2017 at 02:47:11PM +0200, Markus Trippelsdorf wrote:
> Changing the TLB code so late might not be a good idea...

The new lazy code is too risky to keep as we don't know what else will
break. The conservative and thus safe thing to do is to revert to the
old behavior for old machines.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-09-30 16:04:38

by Andy Lutomirski

[permalink] [raw]
Subject: Re: random insta-reboots on AMD Phenom II

On Sat, Sep 30, 2017 at 8:50 AM, Borislav Petkov <[email protected]> wrote:
> On Sat, Sep 30, 2017 at 02:47:11PM +0200, Markus Trippelsdorf wrote:
>> Changing the TLB code so late might not be a good idea...
>
> The new lazy code is too risky to keep as we don't know what else will
> break. The conservative and thus safe thing to do is to revert to the
> old behavior for old machines.

Agreed.

The only problem is that the code has changed so much on top of the
problematic commit that just reverting it won't work.

>
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.