From: Lenny Szubowicz <[email protected]>
In __ghes_panic() clear the block status in the APEI generic
error status block for that generic hardware error source before
calling panic() to prevent a second panic() in the crash kernel
for exactly the same fatal error.
Otherwise ghes_probe(), running in the crash kernel, would see
an unhandled error in the APEI generic error status block and
panic again, thereby precluding any crash dump.
Signed-off-by: Lenny Szubowicz <[email protected]>
Signed-off-by: David Arcari <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Alexandru Gagniuc <[email protected]>
Cc: [email protected]
---
drivers/acpi/apei/ghes.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 02c6fd9..f008ba7 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -691,6 +691,8 @@ static void __ghes_panic(struct ghes *ghes)
{
__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
+ ghes_clear_estatus(ghes);
+
/* reboot to log the error! */
if (!panic_timeout)
panic_timeout = ghes_panic_timeout;
--
1.8.3.1
On Wed, Dec 19, 2018 at 1:33 PM David Arcari <[email protected]> wrote:
>
> From: Lenny Szubowicz <[email protected]>
>
> In __ghes_panic() clear the block status in the APEI generic
> error status block for that generic hardware error source before
> calling panic() to prevent a second panic() in the crash kernel
> for exactly the same fatal error.
>
> Otherwise ghes_probe(), running in the crash kernel, would see
> an unhandled error in the APEI generic error status block and
> panic again, thereby precluding any crash dump.
>
> Signed-off-by: Lenny Szubowicz <[email protected]>
> Signed-off-by: David Arcari <[email protected]>
Good catch!
Tested-by: Tyler Baicar <[email protected]>
+ James.
On Wed, Dec 19, 2018 at 11:50:52AM -0500, David Arcari wrote:
> From: Lenny Szubowicz <[email protected]>
>
> In __ghes_panic() clear the block status in the APEI generic
> error status block for that generic hardware error source before
> calling panic() to prevent a second panic() in the crash kernel
> for exactly the same fatal error.
>
> Otherwise ghes_probe(), running in the crash kernel, would see
> an unhandled error in the APEI generic error status block and
> panic again, thereby precluding any crash dump.
>
> Signed-off-by: Lenny Szubowicz <[email protected]>
> Signed-off-by: David Arcari <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Tony Luck <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Alexandru Gagniuc <[email protected]>
> Cc: [email protected]
> ---
> drivers/acpi/apei/ghes.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 02c6fd9..f008ba7 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -691,6 +691,8 @@ static void __ghes_panic(struct ghes *ghes)
> {
> __ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
>
> + ghes_clear_estatus(ghes);
> +
> /* reboot to log the error! */
> if (!panic_timeout)
> panic_timeout = ghes_panic_timeout;
> --
Acked-by: Borislav Petkov <[email protected]>
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Thursday, December 20, 2018 8:24:47 PM CET Borislav Petkov wrote:
> + James.
>
> On Wed, Dec 19, 2018 at 11:50:52AM -0500, David Arcari wrote:
> > From: Lenny Szubowicz <[email protected]>
> >
> > In __ghes_panic() clear the block status in the APEI generic
> > error status block for that generic hardware error source before
> > calling panic() to prevent a second panic() in the crash kernel
> > for exactly the same fatal error.
> >
> > Otherwise ghes_probe(), running in the crash kernel, would see
> > an unhandled error in the APEI generic error status block and
> > panic again, thereby precluding any crash dump.
> >
> > Signed-off-by: Lenny Szubowicz <[email protected]>
> > Signed-off-by: David Arcari <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Tony Luck <[email protected]>
> > Cc: Borislav Petkov <[email protected]>
> > Cc: "Eric W. Biederman" <[email protected]>
> > Cc: Alexandru Gagniuc <[email protected]>
> > Cc: [email protected]
> > ---
> > drivers/acpi/apei/ghes.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > index 02c6fd9..f008ba7 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -691,6 +691,8 @@ static void __ghes_panic(struct ghes *ghes)
> > {
> > __ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
> >
> > + ghes_clear_estatus(ghes);
> > +
> > /* reboot to log the error! */
> > if (!panic_timeout)
> > panic_timeout = ghes_panic_timeout;
>
> Acked-by: Borislav Petkov <[email protected]>
Patch applied, thanks!
On Fri, Dec 21, 2018 at 06:52:20PM +0000, James Morse wrote:
> Do we need to ghes_ack_error() too?
That's GHES v2 AFAICT.
> With the location cleared the new kernel will never find the records, and
> firmware can never re-use that location because it wasn't ack'd. The upshot is
> RAS records can't be generated for the kdump kernel. The acpi spec talks about
> use of the memory, so I don't think its fair for it to use this to disarm a
> watchdog.
>
> I think we can live with this as the kdump kernel isn't going to handle RAS
> errors for the bulk of memory anyway.
Usually, handling hw errors is always better than not but the second
kernel can't do anything better in that respect than the first, right?
If it panics, it panics - no matter the kernel. Generally.
Therefore I think the role of the second kernel should be to be as
resilient as possible to hw errors - like, not even see them :-) - dump
the memory of the first kernel as quickly as possible and reboot for
analysis.
IMHO, of course.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On 21/12/2018 11:17, Rafael J. Wysocki wrote:
> On Thursday, December 20, 2018 8:24:47 PM CET Borislav Petkov wrote:
>> + James.
Thanks,
>> On Wed, Dec 19, 2018 at 11:50:52AM -0500, David Arcari wrote:
>>> From: Lenny Szubowicz <[email protected]>
>>>
>>> In __ghes_panic() clear the block status in the APEI generic
>>> error status block for that generic hardware error source before
>>> calling panic() to prevent a second panic() in the crash kernel
>>> for exactly the same fatal error.
>>>
>>> Otherwise ghes_probe(), running in the crash kernel, would see
>>> an unhandled error in the APEI generic error status block and
>>> panic again, thereby precluding any crash dump.
I bet that was fun to watch!
>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>> index 02c6fd9..f008ba7 100644
>>> --- a/drivers/acpi/apei/ghes.c
>>> +++ b/drivers/acpi/apei/ghes.c
>>> @@ -691,6 +691,8 @@ static void __ghes_panic(struct ghes *ghes)
>>> {
>>> __ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
>>>
>>> + ghes_clear_estatus(ghes);
>>> +
>>> /* reboot to log the error! */
>>> if (!panic_timeout)
>>> panic_timeout = ghes_panic_timeout;
>>
>> Acked-by: Borislav Petkov <[email protected]>
>
> Patch applied, thanks!
Great!
Do we need to ghes_ack_error() too?
With the location cleared the new kernel will never find the records, and
firmware can never re-use that location because it wasn't ack'd. The upshot is
RAS records can't be generated for the kdump kernel. The acpi spec talks about
use of the memory, so I don't think its fair for it to use this to disarm a
watchdog.
I think we can live with this as the kdump kernel isn't going to handle RAS
errors for the bulk of memory anyway.
Thanks,
James