2021-08-03 14:53:17

by Jean Delvare

[permalink] [raw]
Subject: Faulty commit "watchdog: iTCO_wdt: Account for rebooting on second timeout"

Hi all,

Commit cb011044e34c ("watchdog: iTCO_wdt: Account for rebooting on
second timeout") causes a regression on several systems. Symptoms are:
system reboots automatically after a short period of time if watchdog
is enabled (by systemd for example). This has been reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=213809

Unfortunately this commit was backported to all stable kernel branches
(4.14, 4.19, 5.4, 5.10, 5.12 and 5.13). I'm not sure why that is the
case, BTW, as there is no Fixes tag and no Cc to stable@vger either.
And the fix is not trivial, has apparently not seen enough testing,
and addresses a problem that has a known and simple workaround. IMHO it
should never have been accepted as a stable patch in the first place.
Especially when the previous attempt to fix this issue already ended
with a regression and a revert.

Anyway... After a glance at the patch, I see what looks like a nice
thinko:

+ if (p->smi_res &&
+ (SMI_EN(p) & (TCO_EN | GBL_SMI_EN)) != (TCO_EN | GBL_SMI_EN))

The author most certainly meant inl(SMI_EN(p)) (the register's value)
and not SMI_EN(p) (the register's address).

--
Jean Delvare
SUSE L3 Support


2021-08-03 15:01:58

by Jan Kiszka

[permalink] [raw]
Subject: Re: Faulty commit "watchdog: iTCO_wdt: Account for rebooting on second timeout"

On 03.08.21 16:51, Jean Delvare wrote:
> Hi all,
>
> Commit cb011044e34c ("watchdog: iTCO_wdt: Account for rebooting on
> second timeout") causes a regression on several systems. Symptoms are:
> system reboots automatically after a short period of time if watchdog
> is enabled (by systemd for example). This has been reported in bugzilla:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=213809
>
> Unfortunately this commit was backported to all stable kernel branches
> (4.14, 4.19, 5.4, 5.10, 5.12 and 5.13). I'm not sure why that is the
> case, BTW, as there is no Fixes tag and no Cc to stable@vger either.
> And the fix is not trivial, has apparently not seen enough testing,
> and addresses a problem that has a known and simple workaround. IMHO it
> should never have been accepted as a stable patch in the first place.
> Especially when the previous attempt to fix this issue already ended
> with a regression and a revert.
>
> Anyway... After a glance at the patch, I see what looks like a nice
> thinko:
>
> + if (p->smi_res &&
> + (SMI_EN(p) & (TCO_EN | GBL_SMI_EN)) != (TCO_EN | GBL_SMI_EN))
>
> The author most certainly meant inl(SMI_EN(p)) (the register's value)
> and not SMI_EN(p) (the register's address).
>

https://lkml.org/lkml/2021/7/26/349

Jan

--
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux

2021-08-03 15:04:46

by Jan Kiszka

[permalink] [raw]
Subject: Re: Faulty commit "watchdog: iTCO_wdt: Account for rebooting on second timeout"

On 03.08.21 16:59, Jan Kiszka wrote:
> On 03.08.21 16:51, Jean Delvare wrote:
>> Hi all,
>>
>> Commit cb011044e34c ("watchdog: iTCO_wdt: Account for rebooting on
>> second timeout") causes a regression on several systems. Symptoms are:
>> system reboots automatically after a short period of time if watchdog
>> is enabled (by systemd for example). This has been reported in bugzilla:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=213809
>>
>> Unfortunately this commit was backported to all stable kernel branches
>> (4.14, 4.19, 5.4, 5.10, 5.12 and 5.13). I'm not sure why that is the
>> case, BTW, as there is no Fixes tag and no Cc to stable@vger either.
>> And the fix is not trivial, has apparently not seen enough testing,
>> and addresses a problem that has a known and simple workaround. IMHO it
>> should never have been accepted as a stable patch in the first place.
>> Especially when the previous attempt to fix this issue already ended
>> with a regression and a revert.
>>
>> Anyway... After a glance at the patch, I see what looks like a nice
>> thinko:
>>
>> + if (p->smi_res &&
>> + (SMI_EN(p) & (TCO_EN | GBL_SMI_EN)) != (TCO_EN | GBL_SMI_EN))
>>
>> The author most certainly meant inl(SMI_EN(p)) (the register's value)
>> and not SMI_EN(p) (the register's address).
>>
>
> https://lkml.org/lkml/2021/7/26/349
>

That's for the fix (in line with your analysis).

I was also wondering if backporting that quickly was needed. Didn't
propose it, though.

Jan

--
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux

2021-08-03 15:28:22

by Guenter Roeck

[permalink] [raw]
Subject: Re: Faulty commit "watchdog: iTCO_wdt: Account for rebooting on second timeout"

On 8/3/21 8:01 AM, Jan Kiszka wrote:
> On 03.08.21 16:59, Jan Kiszka wrote:
>> On 03.08.21 16:51, Jean Delvare wrote:
>>> Hi all,
>>>
>>> Commit cb011044e34c ("watchdog: iTCO_wdt: Account for rebooting on
>>> second timeout") causes a regression on several systems. Symptoms are:
>>> system reboots automatically after a short period of time if watchdog
>>> is enabled (by systemd for example). This has been reported in bugzilla:
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=213809
>>>
>>> Unfortunately this commit was backported to all stable kernel branches
>>> (4.14, 4.19, 5.4, 5.10, 5.12 and 5.13). I'm not sure why that is the
>>> case, BTW, as there is no Fixes tag and no Cc to stable@vger either.
>>> And the fix is not trivial, has apparently not seen enough testing,
>>> and addresses a problem that has a known and simple workaround. IMHO it
>>> should never have been accepted as a stable patch in the first place.
>>> Especially when the previous attempt to fix this issue already ended
>>> with a regression and a revert.
>>>
>>> Anyway... After a glance at the patch, I see what looks like a nice
>>> thinko:
>>>
>>> + if (p->smi_res &&
>>> + (SMI_EN(p) & (TCO_EN | GBL_SMI_EN)) != (TCO_EN | GBL_SMI_EN))
>>>
>>> The author most certainly meant inl(SMI_EN(p)) (the register's value)
>>> and not SMI_EN(p) (the register's address).
>>>

Yes, shame on me that I didn't see that.

>>
>> https://lkml.org/lkml/2021/7/26/349
>>
>
> That's for the fix (in line with your analysis).
>
> I was also wondering if backporting that quickly was needed. Didn't
> propose it, though.
>

I'd suggest to discuss that with Greg and Sasha. Backporting is pretty
aggressive nowadays.

Guenter

2021-08-03 15:33:26

by Jean Delvare

[permalink] [raw]
Subject: Re: Faulty commit "watchdog: iTCO_wdt: Account for rebooting on second timeout"

On Tue, 3 Aug 2021 16:59:02 +0200, Jan Kiszka wrote:
> https://lkml.org/lkml/2021/7/26/349

For the record, I tested this fix successfully on my system (as in: no
more random reboots).

Tested-by: Jean Delvare <[email protected]>

Thanks,
--
Jean Delvare
SUSE L3 Support

2021-08-04 00:20:52

by Michael Marley

[permalink] [raw]
Subject: Re: Faulty commit "watchdog: iTCO_wdt: Account for rebooting on second timeout"

On 8/3/21 10:59 AM, Jan Kiszka wrote:

> https://lkml.org/lkml/2021/7/26/349

It fixes the problem for me (the person who opened the Bugzilla report)
too, thanks!

Tested-by: Michael Marley <[email protected]>


2021-08-04 07:12:37

by Jan Kiszka

[permalink] [raw]
Subject: Re: Faulty commit "watchdog: iTCO_wdt: Account for rebooting on second timeout"

On 04.08.21 02:04, Michael Marley wrote:
> On 8/3/21 10:59 AM, Jan Kiszka wrote:
>
>> https://lkml.org/lkml/2021/7/26/349
>>
>
> It fixes the problem for me (the person who opened the Bugzilla report)
> too, thanks!
>
> Tested-by: Michael Marley <[email protected]>
>

Thanks for the confirmation!

Yeah, sorry, that original mistake is truly mine. In fact, I wrote this
code twice [1] but only messed it up here. Unfortunate that it spread so
quickly. I'll try discuss this with stable people eventually, if it was
a one off or if there are more such cases.

Jan

[1]
https://github.com/siemens/efibootguard/commit/aa89fe3cbd883198c23eaec43c4448fe9e8ae148

--
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux

2021-08-06 07:53:47

by Greg KH

[permalink] [raw]
Subject: Re: Faulty commit "watchdog: iTCO_wdt: Account for rebooting on second timeout"

On Tue, Aug 03, 2021 at 04:51:08PM +0200, Jean Delvare wrote:
> Hi all,
>
> Commit cb011044e34c ("watchdog: iTCO_wdt: Account for rebooting on
> second timeout") causes a regression on several systems. Symptoms are:
> system reboots automatically after a short period of time if watchdog
> is enabled (by systemd for example). This has been reported in bugzilla:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=213809
>
> Unfortunately this commit was backported to all stable kernel branches
> (4.14, 4.19, 5.4, 5.10, 5.12 and 5.13). I'm not sure why that is the
> case, BTW, as there is no Fixes tag and no Cc to stable@vger either.
> And the fix is not trivial, has apparently not seen enough testing,
> and addresses a problem that has a known and simple workaround. IMHO it
> should never have been accepted as a stable patch in the first place.
> Especially when the previous attempt to fix this issue already ended
> with a regression and a revert.
>
> Anyway... After a glance at the patch, I see what looks like a nice
> thinko:
>
> + if (p->smi_res &&
> + (SMI_EN(p) & (TCO_EN | GBL_SMI_EN)) != (TCO_EN | GBL_SMI_EN))
>
> The author most certainly meant inl(SMI_EN(p)) (the register's value)
> and not SMI_EN(p) (the register's address).

Let me go revert this from the stable trees now, thanks for the report.

greg k-h