2018-04-16 22:01:42

by Alexandru Gagniuc

[permalink] [raw]
Subject: [RFC PATCH v2 0/4] acpi: apei: Improve error handling with firmware-first

Or "acpi: apei: Don't let puny firmware crash us with puny errors"

This is the improved implementation following feedback from James Morse
(thanks James!). This implementation, I think, is more modular, and easier to
follow, and just makes more sense.

I'm leaving this as RFC because the BIOS team is a bit scared of an OS
that won't crash when it's told to. However, if people like the idea, then
I have nothing against merging this.

Changes since v1:
- Due to popular request, the panic() is left in the NMI handler
- GHES AER handler is split into NMI and non-NMI portions
- ghes_notify_nmi() does not panic on deferrable errors
- The handlers are put in a mapping and given a common call signature

Alexandru Gagniuc (4):
EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error
acpi: apei: Split GHES handlers outside of ghes_do_proc
acpi: apei: Do not panic() when correctable errors are marked as
fatal.
acpi: apei: Warn when GHES marks correctable errors as "fatal"

drivers/acpi/apei/ghes.c | 132 ++++++++++++++++++++++++++++++++++++++++-------
drivers/edac/ghes_edac.c | 3 +-
include/acpi/ghes.h | 5 +-
3 files changed, 117 insertions(+), 23 deletions(-)

--
2.14.3



2018-04-16 22:00:48

by Alexandru Gagniuc

[permalink] [raw]
Subject: [RFC PATCH v2 2/4] acpi: apei: Split GHES handlers outside of ghes_do_proc

Use a mapping from CPER UUID to get the correct handler for a given
GHES error. This is in preparation of splitting some handlers into
irq safe and regular parts.

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/acpi/apei/ghes.c | 78 ++++++++++++++++++++++++++++++++++++++----------
1 file changed, 63 insertions(+), 15 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index f9b53a6f55f3..2119c51b4a9e 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -414,6 +414,25 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
#endif
}

+static int ghes_handle_arm(struct acpi_hest_generic_data *gdata, int sev)
+{
+ struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
+
+ log_arm_hw_error(err);
+ return ghes_severity(gdata->error_severity);
+}
+
+static int ghes_handle_mem(struct acpi_hest_generic_data *gdata, int sev)
+{
+ struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
+
+ ghes_edac_report_mem_error(sev, mem_err);
+ arch_apei_report_mem_error(sev, mem_err);
+ ghes_handle_memory_failure(gdata, sev);
+
+ return ghes_severity(gdata->error_severity);
+}
+
/*
* PCIe AER errors need to be sent to the AER driver for reporting and
* recovery. The GHES severities map to the following AER severities and
@@ -428,7 +447,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
* GHES_SEV_PANIC does not make it to this handling since the kernel must
* panic.
*/
-static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
+static int ghes_handle_aer(struct acpi_hest_generic_data *gdata, int sev)
{
#ifdef CONFIG_ACPI_APEI_PCIEAER
struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
@@ -456,14 +475,54 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
(struct aer_capability_regs *)
pcie_err->aer_info);
}
+
+ return GHES_SEV_CORRECTED;
#endif
+ return ghes_severity(gdata->error_severity);
}

+/**
+ * ghes_handler - handler for ACPI APEI errors
+ * @error_uuid: UUID describing the error entry (See ACPI/EFI CPER for details)
+ * @handle: Handler for the GHES entry of type 'error_uuid'. The handler
+ * returns the severity of the error after handling. A handler is allowed
+ * to demote errors to correctable or corrected, as appropriate.
+ */
+static const struct ghes_handler {
+ const guid_t *error_uuid;
+ int (*handle_irqsafe)(struct acpi_hest_generic_data *gdata, int sev);
+ int (*handle)(struct acpi_hest_generic_data *gdata, int sev);
+} ghes_handlers[] = {
+ {
+ .error_uuid = &CPER_SEC_PLATFORM_MEM,
+ .handle = ghes_handle_mem,
+ }, {
+ .error_uuid = &CPER_SEC_PCIE,
+ .handle = ghes_handle_aer,
+ }, {
+ .error_uuid = &CPER_SEC_PROC_ARM,
+ .handle = ghes_handle_arm,
+ }
+};
+
+static const struct ghes_handler *get_handler(const guid_t *type)
+{
+ size_t i;
+
+ for (i = 0; i < ARRAY_SIZE(ghes_handlers); i++) {
+ if (guid_equal(type, ghes_handlers[i].error_uuid))
+ return &ghes_handlers[i];
+ }
+ return NULL;
+}
+
+
static void ghes_do_proc(struct ghes *ghes,
const struct acpi_hest_generic_status *estatus)
{
int sev, sec_sev;
struct acpi_hest_generic_data *gdata;
+ const struct ghes_handler *handler;
guid_t *sec_type;
guid_t *fru_id = &NULL_UUID_LE;
char *fru_text = "";
@@ -478,21 +537,10 @@ static void ghes_do_proc(struct ghes *ghes,
if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
fru_text = gdata->fru_text;

- if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
- struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
-
- ghes_edac_report_mem_error(sev, mem_err);
-
- arch_apei_report_mem_error(sev, mem_err);
- ghes_handle_memory_failure(gdata, sev);
- }
- else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
- ghes_handle_aer(gdata);
- }
- else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
- struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);

- log_arm_hw_error(err);
+ handler = get_handler(sec_type);
+ if (handler) {
+ sec_sev = handler->handle(gdata, sev);
} else {
void *err = acpi_hest_get_payload(gdata);

--
2.14.3


2018-04-16 22:01:01

by Alexandru Gagniuc

[permalink] [raw]
Subject: [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal"

There seems to be a culture amongst BIOS teams to want to crash the
OS when an error can't be handled in firmware. Marking GHES errors as
"fatal" is a very common way to do this.

However, a number of errors reported by GHES may be fatal in the sense
a device or link is lost, but are not fatal to the system. When there
is a disagreement with firmware about the handleability of an error,
print a warning message.

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/acpi/apei/ghes.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index e0528da4e8f8..6a117825611d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -535,13 +535,14 @@ static const struct ghes_handler *get_handler(const guid_t *type)
static void ghes_do_proc(struct ghes *ghes,
const struct acpi_hest_generic_status *estatus)
{
- int sev, sec_sev;
+ int sev, sec_sev, corrected_sev;
struct acpi_hest_generic_data *gdata;
const struct ghes_handler *handler;
guid_t *sec_type;
guid_t *fru_id = &NULL_UUID_LE;
char *fru_text = "";

+ corrected_sev = GHES_SEV_NO;
sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
sec_type = (guid_t *)gdata->section_type;
@@ -563,6 +564,13 @@ static void ghes_do_proc(struct ghes *ghes,
sec_sev, err,
gdata->error_data_length);
}
+
+ corrected_sev = max(corrected_sev, sec_sev);
+ }
+
+ if ((sev >= GHES_SEV_PANIC) && (corrected_sev < sev)) {
+ pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
+ pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");
}
}

--
2.14.3


2018-04-16 22:01:45

by Alexandru Gagniuc

[permalink] [raw]
Subject: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

Firmware is evil:
- ACPI was created to "try and make the 'ACPI' extensions somehow
Windows specific" in order to "work well with NT and not the others
even if they are open"
- EFI was created to hide "secret" registers from the OS.
- UEFI was created to allow compromising an otherwise secure OS.

Never has firmware been created to solve a problem or simplify an
otherwise cumbersome process. It is of no surprise then, that
firmware nowadays intentionally crashes an OS.

One simple way to do that is to mark GHES errors as fatal. Firmware
knows and even expects that an OS will crash in this case. And most
OSes do.

PCIe errors are notorious for having different definitions of "fatal".
In ACPI, and other firmware sandards, 'fatal' means the machine is
about to explode and needs to be reset. In PCIe, on the other hand,
fatal means that the link to a device has died. In the hotplug world
of PCIe, this is akin to a USB disconnect. From that view, the "fatal"
loss of a link is a normal event. To allow a machine to crash in this
case is downright idiotic.

To solve this, implement an IRQ safe handler for AER. This makes sure
we have enough information to invoke the full AER handler later down
the road, and tells ghes_notify_nmi that "It's all cool".
ghes_notify_nmi() then gets calmed down a little, and doesn't panic().

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/acpi/apei/ghes.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 2119c51b4a9e..e0528da4e8f8 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -481,12 +481,26 @@ static int ghes_handle_aer(struct acpi_hest_generic_data *gdata, int sev)
return ghes_severity(gdata->error_severity);
}

+static int ghes_handle_aer_irqsafe(struct acpi_hest_generic_data *gdata,
+ int sev)
+{
+ struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
+
+ /* The system can always recover from AER errors. */
+ if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
+ pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO)
+ return CPER_SEV_RECOVERABLE;
+
+ return ghes_severity(gdata->error_severity);
+}
+
/**
* ghes_handler - handler for ACPI APEI errors
* @error_uuid: UUID describing the error entry (See ACPI/EFI CPER for details)
* @handle: Handler for the GHES entry of type 'error_uuid'. The handler
* returns the severity of the error after handling. A handler is allowed
* to demote errors to correctable or corrected, as appropriate.
+ * @handle_irqsafe: (optional) Non-blocking handler for GHES entry.
*/
static const struct ghes_handler {
const guid_t *error_uuid;
@@ -498,6 +512,7 @@ static const struct ghes_handler {
.handle = ghes_handle_mem,
}, {
.error_uuid = &CPER_SEC_PCIE,
+ .handle_irqsafe = ghes_handle_aer_irqsafe,
.handle = ghes_handle_aer,
}, {
.error_uuid = &CPER_SEC_PROC_ARM,
@@ -551,6 +566,30 @@ static void ghes_do_proc(struct ghes *ghes,
}
}

+/* How severe is the error if handling is deferred outside IRQ/NMI context? */
+static int ghes_deferrable_severity(struct ghes *ghes)
+{
+ int deferrable_sev, sev, sec_sev;
+ struct acpi_hest_generic_data *gdata;
+ const struct ghes_handler *handler;
+ const guid_t *section_type;
+ const struct acpi_hest_generic_status *estatus = ghes->estatus;
+
+ deferrable_sev = GHES_SEV_NO;
+ sev = ghes_severity(estatus->error_severity);
+ apei_estatus_for_each_section(estatus, gdata) {
+ section_type = (guid_t *)gdata->section_type;
+ handler = get_handler(section_type);
+ if (handler && handler->handle_irqsafe)
+ sec_sev = handler->handle_irqsafe(gdata, sev);
+ else
+ sec_sev = ghes_severity(gdata->error_severity);
+ deferrable_sev = max(deferrable_sev, sec_sev);
+ }
+
+ return deferrable_sev;
+}
+
static void __ghes_print_estatus(const char *pfx,
const struct acpi_hest_generic *generic,
const struct acpi_hest_generic_status *estatus)
@@ -980,7 +1019,7 @@ static void __process_error(struct ghes *ghes)
static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
{
struct ghes *ghes;
- int sev, ret = NMI_DONE;
+ int sev, dsev, ret = NMI_DONE;

if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
return ret;
@@ -993,8 +1032,9 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
ret = NMI_HANDLED;
}

+ dsev = ghes_deferrable_severity(ghes);
sev = ghes_severity(ghes->estatus->error_severity);
- if (sev >= GHES_SEV_PANIC) {
+ if ((sev >= GHES_SEV_PANIC) && (dsev >= GHES_SEV_PANIC)) {
oops_begin();
ghes_print_queued_estatus();
__ghes_panic(ghes);
--
2.14.3


2018-04-16 22:03:32

by Alexandru Gagniuc

[permalink] [raw]
Subject: [RFC PATCH v2 1/4] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/acpi/apei/ghes.c | 2 +-
drivers/edac/ghes_edac.c | 3 +--
include/acpi/ghes.h | 5 ++---
3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 1efefe919555..f9b53a6f55f3 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -481,7 +481,7 @@ static void ghes_do_proc(struct ghes *ghes,
if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);

- ghes_edac_report_mem_error(ghes, sev, mem_err);
+ ghes_edac_report_mem_error(sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
ghes_handle_memory_failure(gdata, sev);
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 68b6ee18bea6..32bb8c5f47dc 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -172,8 +172,7 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
}
}

-void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
- struct cper_sec_mem_err *mem_err)
+void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
{
enum hw_event_mc_err_type type;
struct edac_raw_error_desc *e;
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 8feb0c866ee0..e096a4e7f611 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -55,15 +55,14 @@ enum {
/* From drivers/edac/ghes_edac.c */

#ifdef CONFIG_EDAC_GHES
-void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
- struct cper_sec_mem_err *mem_err);
+void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err);

int ghes_edac_register(struct ghes *ghes, struct device *dev);

void ghes_edac_unregister(struct ghes *ghes);

#else
-static inline void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
+static inline void ghes_edac_report_mem_error(int sev,
struct cper_sec_mem_err *mem_err)
{
}
--
2.14.3


2018-04-17 09:38:58

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/4] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error

On Mon, Apr 16, 2018 at 04:59:00PM -0500, Alexandru Gagniuc wrote:

<--- Insert commit message here.

A possible candidate would be some blurb about what commit removed the
use of that first arg.

> Signed-off-by: Alexandru Gagniuc <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 2 +-
> drivers/edac/ghes_edac.c | 3 +--
> include/acpi/ghes.h | 5 ++---
> 3 files changed, 4 insertions(+), 6 deletions(-)

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-17 16:45:08

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v2 1/4] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error



On 04/17/2018 04:36 AM, Borislav Petkov wrote:
> On Mon, Apr 16, 2018 at 04:59:00PM -0500, Alexandru Gagniuc wrote:
>
> <--- Insert commit message here.
>
> A possible candidate would be some blurb about what commit removed the
> use of that first arg.

I didn't consider any commit message pork to be necessary when the
summary already explains the triviality of the change. I'll add it in
the next rev.

Thanks,
Alex

>> Signed-off-by: Alexandru Gagniuc <[email protected]>
>> ---
>> drivers/acpi/apei/ghes.c | 2 +-
>> drivers/edac/ghes_edac.c | 3 +--
>> include/acpi/ghes.h | 5 ++---
>> 3 files changed, 4 insertions(+), 6 deletions(-)
>

2018-04-18 17:53:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/4] acpi: apei: Split GHES handlers outside of ghes_do_proc

On Mon, Apr 16, 2018 at 04:59:01PM -0500, Alexandru Gagniuc wrote:
> static void ghes_do_proc(struct ghes *ghes,
> const struct acpi_hest_generic_status *estatus)
> {
> int sev, sec_sev;
> struct acpi_hest_generic_data *gdata;
> + const struct ghes_handler *handler;
> guid_t *sec_type;
> guid_t *fru_id = &NULL_UUID_LE;
> char *fru_text = "";
> @@ -478,21 +537,10 @@ static void ghes_do_proc(struct ghes *ghes,
> if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
> fru_text = gdata->fru_text;
>
> - if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
> - struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
> -
> - ghes_edac_report_mem_error(sev, mem_err);
> -
> - arch_apei_report_mem_error(sev, mem_err);
> - ghes_handle_memory_failure(gdata, sev);
> - }
> - else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
> - ghes_handle_aer(gdata);
> - }
> - else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
> - struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
>
> - log_arm_hw_error(err);
> + handler = get_handler(sec_type);

I don't like this - it was better and more readable before because I can
follow which handler gets called. This change makes is less readable.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-18 17:56:17

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

On Mon, Apr 16, 2018 at 04:59:02PM -0500, Alexandru Gagniuc wrote:
> Firmware is evil:
> - ACPI was created to "try and make the 'ACPI' extensions somehow
> Windows specific" in order to "work well with NT and not the others
> even if they are open"
> - EFI was created to hide "secret" registers from the OS.
> - UEFI was created to allow compromising an otherwise secure OS.
>
> Never has firmware been created to solve a problem or simplify an
> otherwise cumbersome process. It is of no surprise then, that
> firmware nowadays intentionally crashes an OS.

I don't believe I'm saying this but, get rid of that rant. Even though I
agree, it doesn't belong in a commit message.

>
> One simple way to do that is to mark GHES errors as fatal. Firmware
> knows and even expects that an OS will crash in this case. And most
> OSes do.
>
> PCIe errors are notorious for having different definitions of "fatal".
> In ACPI, and other firmware sandards, 'fatal' means the machine is
> about to explode and needs to be reset. In PCIe, on the other hand,
> fatal means that the link to a device has died. In the hotplug world
> of PCIe, this is akin to a USB disconnect. From that view, the "fatal"
> loss of a link is a normal event. To allow a machine to crash in this
> case is downright idiotic.
>
> To solve this, implement an IRQ safe handler for AER. This makes sure
> we have enough information to invoke the full AER handler later down
> the road, and tells ghes_notify_nmi that "It's all cool".
> ghes_notify_nmi() then gets calmed down a little, and doesn't panic().
>
> Signed-off-by: Alexandru Gagniuc <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 42 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 2119c51b4a9e..e0528da4e8f8 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -481,12 +481,26 @@ static int ghes_handle_aer(struct acpi_hest_generic_data *gdata, int sev)
> return ghes_severity(gdata->error_severity);
> }
>
> +static int ghes_handle_aer_irqsafe(struct acpi_hest_generic_data *gdata,
> + int sev)
> +{
> + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
> +
> + /* The system can always recover from AER errors. */
> + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO)
> + return CPER_SEV_RECOVERABLE;
> +
> + return ghes_severity(gdata->error_severity);
> +}

Well, Tyler touched that AER error severity handling recently and we had
it all nicely documented in the comment above ghes_handle_aer().

Your ghes_handle_aer_irqsafe() graft basically bypasses
ghes_handle_aer() instead of incorporating in it.

If all you wanna say is, the severity computation should go through all
the sections and look at each error's severity before making a decision,
then add that to ghes_severity() instead of doing that "deferrable"
severity dance.

And add the changes to the policy to the comment above
ghes_handle_aer(). I don't want any changes from people coming and going
and leaving us scratching heads why we did it this way.

And no need for those handlers and so on - make it simple first - then we
can talk more complex handling.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-18 17:57:01

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal"

On Mon, Apr 16, 2018 at 04:59:03PM -0500, Alexandru Gagniuc wrote:
> There seems to be a culture amongst BIOS teams to want to crash the
> OS when an error can't be handled in firmware. Marking GHES errors as
> "fatal" is a very common way to do this.
>
> However, a number of errors reported by GHES may be fatal in the sense
> a device or link is lost, but are not fatal to the system. When there
> is a disagreement with firmware about the handleability of an error,
> print a warning message.
>
> Signed-off-by: Alexandru Gagniuc <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index e0528da4e8f8..6a117825611d 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -535,13 +535,14 @@ static const struct ghes_handler *get_handler(const guid_t *type)
> static void ghes_do_proc(struct ghes *ghes,
> const struct acpi_hest_generic_status *estatus)
> {
> - int sev, sec_sev;
> + int sev, sec_sev, corrected_sev;
> struct acpi_hest_generic_data *gdata;
> const struct ghes_handler *handler;
> guid_t *sec_type;
> guid_t *fru_id = &NULL_UUID_LE;
> char *fru_text = "";
>
> + corrected_sev = GHES_SEV_NO;
> sev = ghes_severity(estatus->error_severity);
> apei_estatus_for_each_section(estatus, gdata) {
> sec_type = (guid_t *)gdata->section_type;
> @@ -563,6 +564,13 @@ static void ghes_do_proc(struct ghes *ghes,
> sec_sev, err,
> gdata->error_data_length);
> }
> +
> + corrected_sev = max(corrected_sev, sec_sev);
> + }
> +
> + if ((sev >= GHES_SEV_PANIC) && (corrected_sev < sev)) {
> + pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
> + pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");

No, I don't want any of that crap issuing stuff in dmesg and then people
opening bugs and running around and trying to replace hardware.

We either can handle the error and log a normal record somewhere or we
cannot and explode. The complaining about the FW doesn't bring shit.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-19 14:20:27

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/4] acpi: apei: Split GHES handlers outside of ghes_do_proc


On 04/18/2018 12:52 PM, Borislav Petkov wrote:
> On Mon, Apr 16, 2018 at 04:59:01PM -0500, Alexandru Gagniuc wrote:
>> static void ghes_do_proc(struct ghes *ghes,
>> const struct acpi_hest_generic_status *estatus)
>> {
>> int sev, sec_sev;
>> struct acpi_hest_generic_data *gdata;
>> + const struct ghes_handler *handler;
>> guid_t *sec_type;
>> guid_t *fru_id = &NULL_UUID_LE;
>> char *fru_text = "";
>> @@ -478,21 +537,10 @@ static void ghes_do_proc(struct ghes *ghes,
>> if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
>> fru_text = gdata->fru_text;
>>
>> - if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
>> - struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
>> -
>> - ghes_edac_report_mem_error(sev, mem_err);
>> -
>> - arch_apei_report_mem_error(sev, mem_err);
>> - ghes_handle_memory_failure(gdata, sev);
>> - }
>> - else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
>> - ghes_handle_aer(gdata);
>> - }
>> - else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
>> - struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
>>
>> - log_arm_hw_error(err);
>> + handler = get_handler(sec_type);
>
> I don't like this - it was better and more readable before because I can
> follow which handler gets called. This change makes is less readable.

I agree with the readability argument in the current situation of three
handlers. I guess I was thinking ahead and generalizing for an arbitrary
number of handlers.

On the other side, you lose readability as soon as you get a few more
handlers and the function becomes too long. And more importantly, you
lose generality: it's not obvious that there's
ghes_edac_report_mem_error() which too wide a context.

Alex

2018-04-19 14:32:16

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/4] acpi: apei: Split GHES handlers outside of ghes_do_proc

On Thu, Apr 19, 2018 at 09:19:03AM -0500, Alex G. wrote:
> On the other side, you lose readability as soon as you get a few more
> handlers and the function becomes too long.

No you don't - you split it properly.

> And more importantly, you lose generality: it's not obvious that
> there's ghes_edac_report_mem_error() which too wide a context.

I don't understand what that means.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-19 14:59:56

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/4] acpi: apei: Split GHES handlers outside of ghes_do_proc



On 04/19/2018 09:30 AM, Borislav Petkov wrote:
> On Thu, Apr 19, 2018 at 09:19:03AM -0500, Alex G. wrote:
>> On the other side, you lose readability as soon as you get a few more
>> handlers and the function becomes too long.
>
> No you don't - you split it properly.

And that was the motivation behind my splitting it in this patch.

>> And more importantly, you lose generality: it's not obvious that
>> there's ghes_edac_report_mem_error() which too wide a context.
>
> I don't understand what that means.

My apologies, sometimes my thought is too far ahead of my typing
fingers. For the purpose of handling _one_ error, you need the CPER
entry for that one error -- narrow context. You don't need the entire
GHES structure -- wide context. Individual handlers should not be able
to access the entire ghes.

When the handlers are restricted to a common signature --which doesn't
include ghes--, it's obvious when functions try to bite more than they
can chew.

Alex


2018-04-19 14:59:56

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.



On 04/18/2018 12:54 PM, Borislav Petkov wrote:
> On Mon, Apr 16, 2018 at 04:59:02PM -0500, Alexandru Gagniuc wrote:
>> Firmware is evil:
>> - ACPI was created to "try and make the 'ACPI' extensions somehow
>> Windows specific" in order to "work well with NT and not the others
>> even if they are open"
>> - EFI was created to hide "secret" registers from the OS.
>> - UEFI was created to allow compromising an otherwise secure OS.
>>
>> Never has firmware been created to solve a problem or simplify an
>> otherwise cumbersome process. It is of no surprise then, that
>> firmware nowadays intentionally crashes an OS.
>
> I don't believe I'm saying this but, get rid of that rant. Even though I
> agree, it doesn't belong in a commit message.

Of course.

(snip)> Well, Tyler touched that AER error severity handling recently
and we had
> it all nicely documented in the comment above ghes_handle_aer().
>
> Your ghes_handle_aer_irqsafe() graft basically bypasses
> ghes_handle_aer() instead of incorporating in it.
>
> If all you wanna say is, the severity computation should go through all
> the sections and look at each error's severity before making a decision,
> then add that to ghes_severity() instead of doing that "deferrable"
> severity dance.

ghes_severity() is a one-to-one mapping from a set of unsorted
severities to monotonically increasing numbers. The "one-to-one" mapping
part of the sentence is obvious from the function name. To change it to
parse the entire GHES would completely destroy this, and I think it
would apply policy in the wrong place.

Should I do that, I might have to call it something like
ghes_parse_and_apply_policy_to_severity(). But that misses the whole
point if these changes.

I would like to get to the handlers first, and then decide if things are
okay or not, but the ARM guys didn't exactly like this approach. It
seems there are quite some per-error-type considerations.
The logical step is to associate these considerations with the specific
error type they apply to, rather than hide them as a decision under an
innocent ghes_severity().

> And add the changes to the policy to the comment above
> ghes_handle_aer(). I don't want any changes from people coming and going
> and leaving us scratching heads why we did it this way.
>
> And no need for those handlers and so on - make it simple first - then we
> can talk more complex handling.

I don't want to leave people scratching their heads, but I also don't
want to make AER a special case without having a generic way to handle
these cases. People are just as susceptible to scratch their heads
wondering why AER is a special case and everything else crashes.

Maybe it's better move the AER handling to NMI/IRQ context, since
ghes_handle_aer() is only scheduling the real AER andler, and is irq
safe. I'm scratching my head about why we're messing with IRQ work from
NMI context, instead of just scheduling a regular handler to take care
of things.

Alex


2018-04-19 15:12:41

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal"



On 04/18/2018 12:54 PM, Borislav Petkov wrote:
> On Mon, Apr 16, 2018 at 04:59:03PM -0500, Alexandru Gagniuc wrote:

(snip)
>> +
>> + corrected_sev = max(corrected_sev, sec_sev);
>> + }
>> +
>> + if ((sev >= GHES_SEV_PANIC) && (corrected_sev < sev)) {
>> + pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
>> + pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");
>
> No, I don't want any of that crap issuing stuff in dmesg and then people
> opening bugs and running around and trying to replace hardware.
>
> We either can handle the error and log a normal record somewhere or we
> cannot and explode.

There is value in this. From my observations, fw claims it will do
everything through FFS, yet fails to fully handle the situation. It's
rooted in FW's assumptions about OS behavior. Because the (old) versions
of windows, esxi, and rhel used during development crash, fw assumes
that _all_ OSes crash. The result in a surprising majority of cases is
that FFS doesn't properly handle recurring errors, and fw is, in fact,
broken.

> The complaining about the FW doesn't bring shit.

You are correct. It doesn't bring defecation. It brings a red flag that
helps people get closer to the root cause of problems.

That being said, I can just drop this patch.

Alex


2018-04-19 15:31:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/4] acpi: apei: Split GHES handlers outside of ghes_do_proc

On Thu, Apr 19, 2018 at 09:57:08AM -0500, Alex G. wrote:
> And that was the motivation behind my splitting it in this patch.

By "split" I don't mean add a function pointer which gets selected and
then called - if the function becomes too long, you simply split the
function body properly.

> You don't need the entire GHES structure -- wide context. Individual
> handlers should not be able to access the entire ghes.

But you remove the @ghes argument in patch 1. So what are we even
talking about?

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-19 15:37:30

by James Morse

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

Hi Alex,

(I haven't read through all this yet, just on this one:)

On 04/19/2018 03:57 PM, Alex G. wrote:
> Maybe it's better move the AER handling to NMI/IRQ context, since
> ghes_handle_aer() is only scheduling the real AER andler, and is irq
> safe. I'm scratching my head about why we're messing with IRQ work from
> NMI context, instead of just scheduling a regular handler to take care
> of things.

We can't touch schedule_work_on() from NMI context as it takes spinlocks and
disables interrupts. (see __queue_work()) The NMI may have interrupted
IRQ-context code
that was already holding the same locks.

IRQ-work behaves differently, it uses an llist for the work and an arch
code hook
to raise a self-IPI.


Thanks,

James

2018-04-19 15:42:02

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

On Thu, Apr 19, 2018 at 09:57:07AM -0500, Alex G. wrote:
> ghes_severity() is a one-to-one mapping from a set of unsorted
> severities to monotonically increasing numbers. The "one-to-one" mapping
> part of the sentence is obvious from the function name. To change it to
> parse the entire GHES would completely destroy this, and I think it
> would apply policy in the wrong place.

So do a wrapper or whatever. Do a ghes_compute_severity() or however you
would wanna call it and do the iteration there.

> Should I do that, I might have to call it something like
> ghes_parse_and_apply_policy_to_severity(). But that misses the whole
> point if these changes.

What policy? You simply compute the severity like we do in the mce code.

> I would like to get to the handlers first, and then decide if things are
> okay or not,

Why? Give me an example why you'd handle an error first and then decide
whether we're ok or not?

Usually, the error handler decides that in one place. So what exactly
are you trying to do differently that doesn't fit that flow?

> I don't want to leave people scratching their heads, but I also don't
> want to make AER a special case without having a generic way to handle
> these cases. People are just as susceptible to scratch their heads
> wondering why AER is a special case and everything else crashes.

Not if it is properly done *and* documented why we applying the
respective policy for the error type.

> Maybe it's better move the AER handling to NMI/IRQ context, since
> ghes_handle_aer() is only scheduling the real AER andler, and is irq
> safe. I'm scratching my head about why we're messing with IRQ work from
> NMI context, instead of just scheduling a regular handler to take care
> of things.

No, first pls explain what exactly you're trying to do and then we can
talk about how to do it. Btw, a real-life example to accompany that
intention goes a long way.

Thx.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-19 15:48:01

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/4] acpi: apei: Split GHES handlers outside of ghes_do_proc



On 04/19/2018 10:29 AM, Borislav Petkov wrote:
> On Thu, Apr 19, 2018 at 09:57:08AM -0500, Alex G. wrote:
>> And that was the motivation behind my splitting it in this patch.
>
> By "split" I don't mean add a function pointer which gets selected and
> then called - if the function becomes too long, you simply split the
> function body properly.

The bulk of the function is the if/else mapping from UUID to error
handler. I don't see how that can be easily split up, hence why I
originally resorted to the mapping. As you said, we'll keep it simple at
first.

>> You don't need the entire GHES structure -- wide context. Individual
>> handlers should not be able to access the entire ghes.
>
> But you remove the @ghes argument in patch 1. So what are we even
> talking about?

You could say, by convention, handlers shouldn't access ghes directly,
but that is not obvious when @ghes is in scope. The reason I bring it up
is that, if [1/4] ends up being unneeded, then I will drop it from the
series.

Alex

2018-04-19 16:29:20

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

On 04/19/2018 10:35 AM, James Morse wrote:
> Hi Alex,
>
> (I haven't read through all this yet, just on this one:)
>
> On 04/19/2018 03:57 PM, Alex G. wrote:
>> Maybe it's better move the AER handling to NMI/IRQ context, since
>> ghes_handle_aer() is only scheduling the real AER andler, and is irq
>> safe. I'm scratching my head about why we're messing with IRQ work from
>> NMI context, instead of just scheduling a regular handler to take care
>> of things.
>
> We can't touch schedule_work_on() from NMI context as it takes spinlocks
> and
> disables interrupts. (see __queue_work()) The NMI may have interrupted
> IRQ-context code
> that was already holding the same locks.
>
> IRQ-work behaves differently, it uses an llist for the work and an arch
> code hook
> to raise a self-IPI.

That makes sense. Thank you!

Alex

>
> Thanks,
>
> James

2018-04-19 16:47:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

On Thu, Apr 19, 2018 at 11:26:57AM -0500, Alex G. wrote:
> At a very high level, I'm working with Dell on improving server
> reliability, with a focus on NVME hotplug and surprise removal. One of
> the features we don't support is surprise removal of NVME drives;
> hotplug is supported with 'prepare to remove'. This is one of the
> reasons NVME is not on feature parity with SAS and SATA.

Ok, first question: is surprise removal something purely mechanical or
do you need firmware support for it? In the sense that you need to tell
the firmware that you will be removing the drive.

I'm sceptical, though, as it has "surprise" in the name so I'm guessing
the firmware doesn't know about it, the drive physically disappears and
the FW starts spewing PCIe errors...

> I'm not sure if this is the example you're looking for, but
> take an r740xd server, and slowly unplug an Intel NVME drives at an
> angle. You're likely to crash the machine.

No no, that's actually a great example!

Thx.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-19 16:57:20

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal"

On Thu, Apr 19, 2018 at 10:11:03AM -0500, Alex G. wrote:
> There is value in this. From my observations, fw claims it will do
> everything through FFS, yet fails to fully handle the situation. It's
> rooted in FW's assumptions about OS behavior. Because the (old) versions
> of windows, esxi, and rhel used during development crash, fw assumes
> that _all_ OSes crash. The result in a surprising majority of cases is
> that FFS doesn't properly handle recurring errors, and fw is, in fact,
> broken.

So FW being broken is a social secret. But we don't care. We have tried,
nothing happens. No one moves. The crack monkeys which program it have
long moved to the next release and you hear crap like, "we don't support
linux" and other bullshit.

What we do now is to try to make the best of it - we either can handle
an error *without* firmware's help or we panic. If we can recover from
it, let's do that without screaming about something the user can't deal
with anyway.

All those FW_ERR printks cause nothing but expensive support calls, the
outcome of which is nothing. Just a lot of money down the drain.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-19 17:02:13

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.


On 04/19/2018 10:40 AM, Borislav Petkov wrote:
> On Thu, Apr 19, 2018 at 09:57:07AM -0500, Alex G. wrote:
>> ghes_severity() is a one-to-one mapping from a set of unsorted
>> severities to monotonically increasing numbers. The "one-to-one" mapping
>> part of the sentence is obvious from the function name. To change it to
>> parse the entire GHES would completely destroy this, and I think it
>> would apply policy in the wrong place.
>
> So do a wrapper or whatever. Do a ghes_compute_severity() or however you
> would wanna call it and do the iteration there.

That doesn't sound right. There isn't a formula to compute. What we're
doing is we're looking at individual error sources, and deciding what
errors we can handle based both on the error, and our ability to handle
the error.

>> Should I do that, I might have to call it something like
>> ghes_parse_and_apply_policy_to_severity(). But that misses the whole
>> point if these changes.
>
> What policy? You simply compute the severity like we do in the mce code.

As explained above, our ability to resolve an error depends on the
interaction between the error and error handler. This is very closely
tied to the capabilities of each individual handler. I'll do it your
way, but I don't think ignoring this tight coupling is the right thing
to do.

>
>> I would like to get to the handlers first, and then decide if things are
>> okay or not,
>
> Why? Give me an example why you'd handle an error first and then decide
> whether we're ok or not?
>
> Usually, the error handler decides that in one place. So what exactly
> are you trying to do differently that doesn't fit that flow?

In the NMI case you don't make it to the error handler. James and I beat
this subject to the afterlife in v1.

>> I don't want to leave people scratching their heads, but I also don't
>> want to make AER a special case without having a generic way to handle
>> these cases. People are just as susceptible to scratch their heads
>> wondering why AER is a special case and everything else crashes.
>
> Not if it is properly done *and* documented why we applying the
> respective policy for the error type.
>
>> Maybe it's better move the AER handling to NMI/IRQ context, since
>> ghes_handle_aer() is only scheduling the real AER andler, and is irq
>> safe. I'm scratching my head about why we're messing with IRQ work from
>> NMI context, instead of just scheduling a regular handler to take care
>> of things.
>
> No, first pls explain what exactly you're trying to do

I realize v1 was quite a while back, so I'll take this opportunity to
restate:

At a very high level, I'm working with Dell on improving server
reliability, with a focus on NVME hotplug and surprise removal. One of
the features we don't support is surprise removal of NVME drives;
hotplug is supported with 'prepare to remove'. This is one of the
reasons NVME is not on feature parity with SAS and SATA.

My role is to solve this issue on linux, and to not worry about other
OSes. This puts me in a position to have a linux-centric view of the
problem, as opposed to the more common firmware-centric view.

Part of solving the surprise removal issue involves improving FFS error
handling. This is required because the servers which are shipping use
FFS instead of native error notifications. As part of extensive testing,
I have found the NMI handler to be the most common cause of crashes, and
hence this series.

> and then we can talk about how to do it.

Your move.

> Btw, a real-life example to accompany that intention goes a long way.

I'm not sure if this is the example you're looking for, but
take an r740xd server, and slowly unplug an Intel NVME drives at an
angle. You're likely to crash the machine.

Alex

2018-04-19 17:03:02

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/4] acpi: apei: Split GHES handlers outside of ghes_do_proc

On Thu, Apr 19, 2018 at 10:46:15AM -0500, Alex G. wrote:
> The bulk of the function is the if/else mapping from UUID to error
> handler. I don't see how that can be easily split up, hence why I
> originally resorted to the mapping. As you said, we'll keep it simple at
> first.

So that function is 43 lines now. Why are we even talking about this?!

Just add your UUID check to the if-else statement and be done with it
already. No handlers no nothing.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-19 17:42:56

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

SURPRISE!!!

On 04/19/2018 11:45 AM, Borislav Petkov wrote:
> On Thu, Apr 19, 2018 at 11:26:57AM -0500, Alex G. wrote:
>> At a very high level, I'm working with Dell on improving server
>> reliability, with a focus on NVME hotplug and surprise removal. One of
>> the features we don't support is surprise removal of NVME drives;
>> hotplug is supported with 'prepare to remove'. This is one of the
>> reasons NVME is not on feature parity with SAS and SATA.
>
> Ok, first question: is surprise removal something purely mechanical or
> do you need firmware support for it? In the sense that you need to tell
> the firmware that you will be removing the drive.

SURPRISE!!! removal only means that the system was not expecting the
drive to be yanked. An example is removing a USB flash drive without
first unmounting it and removing the usb device (echo 0 >
/sys/bus/usb/.../authorized).

PCIe removal and hotplug is fairly well spec'd, and NVMe rides on that
without issue. It's much easier and faster for an OS to just follow the
spec and handle things on its own.

Interference from firmware only comes in with EFI/ACPI and FFS. From a
purely technical point of view, firmware has nothing to do with this.
From a firmware-centric view, unfortunately, firmware wants the ability
to log errors to the BMC... and hotplug events.

Does firmware need to know that a drive will be removed? I'm not aware
of any such requirement. I think the main purpose of 'prepare to remove'
is to shut down any traffic on the link. This way, link removal does not
generate PCIe errors which may otherwise end up crashing the OS.


> I'm sceptical, though, as it has "surprise" in the name so I'm guessing
> the firmware doesn't know about it, the drive physically disappears and
> the FW starts spewing PCIe errors...

It's not the FW that spews out errors. It's the hardware. It's very
likely that a device which is actively used will have several DMA
transactions already queued up and lots of traffic going through the
link. When the link dies and the traffic can't be delivered, Unsupported
Request errors are very common.

On the r740xd, FW just hides those errors from the OS with no further
notification. On this machine BIOS sets things up such that non-posted
requests report fatal (PCIe) errors. FW still tries very hard to hide
this from the OS, and I think the heuristic is that if the drive
physical presence is gone, don't even report the error.

There are a lot of problems with the approach, but one thing to keep in
mind is that the FW was written at a time when OSes were more than happy
to crash at any PCIe error reported through GHES.

Alex

>> I'm not sure if this is the example you're looking for, but
>> take an r740xd server, and slowly unplug an Intel NVME drives at an
>> angle. You're likely to crash the machine.
>
> No no, that's actually a great example!
>
> Thx.
>

2018-04-19 19:05:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

(snip useful explanation).

On Thu, Apr 19, 2018 at 12:40:54PM -0500, Alex G. wrote:
> On the r740xd, FW just hides those errors from the OS with no further
> notification. On this machine BIOS sets things up such that non-posted
> requests report fatal (PCIe) errors. FW still tries very hard to hide
> this from the OS, and I think the heuristic is that if the drive
> physical presence is gone, don't even report the error.

Ok, second question: can you detect from the error signatures alone that
it was a surprise removal? How does such an error look like, in detail?
Got error logs somewhere to dump?

Thx.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-19 22:59:23

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.



On 04/19/2018 02:03 PM, Borislav Petkov wrote:
> (snip useful explanation).
>
> On Thu, Apr 19, 2018 at 12:40:54PM -0500, Alex G. wrote:
>> On the r740xd, FW just hides those errors from the OS with no further
>> notification. On this machine BIOS sets things up such that non-posted
>> requests report fatal (PCIe) errors. FW still tries very hard to hide
>> this from the OS, and I think the heuristic is that if the drive
>> physical presence is gone, don't even report the error.
>
> Ok, second question: can you detect from the error signatures alone that
> it was a surprise removal?

I suppose you could make some inference, given the timing of other
events going on around the the crash. It's not uncommon to see a "Card
not present" event around drive removal.

Since the presence detect pin breaks last, you might not get that
interrupt for a long while. In that case it's much harder to determine
if you're seeing a SURPRISE!!! removal or some other fault.

I don't think you can use GHES alone to determine the nature of the
event. There is not a 1:1 mapping from the set of things going wrong to
the set of PCIe errors.

> How does such an error look like, in detail?

It's green on the soft side, with lots of red accents, as well as some
textured white shades:

[ 51.414616] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
[ 51.414634] pciehp 0000:b0:05.0:pcie204: Slot(179): Link Down
[ 52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able
to correct
[ 52.703345] BROKEN FIRMWARE: Complain to your hardware vendor
[ 52.703347] {1}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 1
[ 52.703358] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up
[ 52.711616] {1}[Hardware Error]: event severity: fatal
[ 52.716754] {1}[Hardware Error]: Error 0, type: fatal
[ 52.721891] {1}[Hardware Error]: section_type: PCIe error
[ 52.727463] {1}[Hardware Error]: port_type: 6, downstream switch port
[ 52.734075] {1}[Hardware Error]: version: 3.0
[ 52.738607] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ 52.744786] {1}[Hardware Error]: device_id: 0000:b0:06.0
[ 52.750271] {1}[Hardware Error]: slot: 4
[ 52.754371] {1}[Hardware Error]: secondary_bus: 0xb3
[ 52.759509] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x9733
[ 52.766123] {1}[Hardware Error]: class_code: 000406
[ 52.771182] {1}[Hardware Error]: bridge: secondary_status: 0x0000,
control: 0x0003
[ 52.779038] pcieport 0000:b0:06.0: aer_status: 0x00100000, aer_mask:
0x01a10000
[ 52.782303] nvme0n1: detected capacity change from 3200631791616 to 0
[ 52.786348] pcieport 0000:b0:06.0: [20] Unsupported Request
[ 52.786349] pcieport 0000:b0:06.0: aer_layer=Transaction Layer,
aer_agent=Requester ID
[ 52.786350] pcieport 0000:b0:06.0: aer_uncor_severity: 0x004eb030
[ 52.786352] pcieport 0000:b0:06.0: TLP Header: 40000001 0000020f
e12023bc 01000000
[ 52.786357] pcieport 0000:b0:06.0: broadcast error_detected message
[ 52.883895] pci 0000:b3:00.0: device has no driver
[ 52.883976] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
[ 52.884184] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down event
queued; currently getting powered on
[ 52.967175] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up


> Got error logs somewhere to dump?

Sure [1]. They have the ANSI sequences, so you might want to wget and
grep them in a color terminal.

Alex

[1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180416-1919.log

2018-04-22 10:51:48

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

On Thu, Apr 19, 2018 at 05:55:08PM -0500, Alex G. wrote:
> > How does such an error look like, in detail?
>
> It's green on the soft side, with lots of red accents, as well as some
> textured white shades:
>
> [ 51.414616] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
> [ 51.414634] pciehp 0000:b0:05.0:pcie204: Slot(179): Link Down
> [ 52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able
> to correct
> [ 52.703345] BROKEN FIRMWARE: Complain to your hardware vendor
> [ 52.703347] {1}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 1
> [ 52.703358] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up
> [ 52.711616] {1}[Hardware Error]: event severity: fatal
> [ 52.716754] {1}[Hardware Error]: Error 0, type: fatal
> [ 52.721891] {1}[Hardware Error]: section_type: PCIe error
> [ 52.727463] {1}[Hardware Error]: port_type: 6, downstream switch port
> [ 52.734075] {1}[Hardware Error]: version: 3.0
> [ 52.738607] {1}[Hardware Error]: command: 0x0407, status: 0x0010
> [ 52.744786] {1}[Hardware Error]: device_id: 0000:b0:06.0
> [ 52.750271] {1}[Hardware Error]: slot: 4
> [ 52.754371] {1}[Hardware Error]: secondary_bus: 0xb3
> [ 52.759509] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x9733
> [ 52.766123] {1}[Hardware Error]: class_code: 000406
> [ 52.771182] {1}[Hardware Error]: bridge: secondary_status: 0x0000,
> control: 0x0003
> [ 52.779038] pcieport 0000:b0:06.0: aer_status: 0x00100000, aer_mask:
> 0x01a10000
> [ 52.782303] nvme0n1: detected capacity change from 3200631791616 to 0
> [ 52.786348] pcieport 0000:b0:06.0: [20] Unsupported Request
> [ 52.786349] pcieport 0000:b0:06.0: aer_layer=Transaction Layer,
> aer_agent=Requester ID
> [ 52.786350] pcieport 0000:b0:06.0: aer_uncor_severity: 0x004eb030
> [ 52.786352] pcieport 0000:b0:06.0: TLP Header: 40000001 0000020f
> e12023bc 01000000
> [ 52.786357] pcieport 0000:b0:06.0: broadcast error_detected message
> [ 52.883895] pci 0000:b3:00.0: device has no driver
> [ 52.883976] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
> [ 52.884184] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down event
> queued; currently getting powered on
> [ 52.967175] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up

Btw, from another discussion we're having with Yazen:

@Yazen, do you see how this error record is worth shit?

class_code: 000406
command: 0x0407, status: 0x0010
bridge: secondary_status: 0x0000, control: 0x0003
aer_status: 0x00100000, aer_mask: 0x01a10000
aer_uncor_severity: 0x004eb030

those above are only some of the fields which are purely useless
undecoded. Makes me wonder what's worse for the user: dump the
half-decoded error or not dump an error at all...

Anyway, Alex, I see this in the logs:

[ 66.581121] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
[ 66.591939] pciehp 0000:b0:05.0:pcie204: Slot(179): Card not present
[ 66.592102] pciehp 0000:b0:06.0:pcie204: Slot(176): Card not present

and that comes from that pciehp_isr() interrupt handler AFAICT.

So there *is* a way to know that the card is not present anymore. So,
theoretically, and ignoring the code layering for now, we can connect
that error to the card not present event and then ignore the error...

Hmmm.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-24 04:21:05

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.



On 04/22/2018 05:48 AM, Borislav Petkov wrote:
> On Thu, Apr 19, 2018 at 05:55:08PM -0500, Alex G. wrote:
>>> How does such an error look like, in detail?
>>
>> It's green on the soft side, with lots of red accents, as well as some
>> textured white shades:
>>
>> [ 51.414616] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
>> [ 51.414634] pciehp 0000:b0:05.0:pcie204: Slot(179): Link Down
>> [ 52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able
>> to correct
>> [ 52.703345] BROKEN FIRMWARE: Complain to your hardware vendor
>> [ 52.703347] {1}[Hardware Error]: Hardware error from APEI Generic
>> Hardware Error Source: 1
>> [ 52.703358] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up
>> [ 52.711616] {1}[Hardware Error]: event severity: fatal
>> [ 52.716754] {1}[Hardware Error]: Error 0, type: fatal
>> [ 52.721891] {1}[Hardware Error]: section_type: PCIe error
>> [ 52.727463] {1}[Hardware Error]: port_type: 6, downstream switch port
>> [ 52.734075] {1}[Hardware Error]: version: 3.0
>> [ 52.738607] {1}[Hardware Error]: command: 0x0407, status: 0x0010
>> [ 52.744786] {1}[Hardware Error]: device_id: 0000:b0:06.0
>> [ 52.750271] {1}[Hardware Error]: slot: 4
>> [ 52.754371] {1}[Hardware Error]: secondary_bus: 0xb3
>> [ 52.759509] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x9733
>> [ 52.766123] {1}[Hardware Error]: class_code: 000406
>> [ 52.771182] {1}[Hardware Error]: bridge: secondary_status: 0x0000,
>> control: 0x0003
>> [ 52.779038] pcieport 0000:b0:06.0: aer_status: 0x00100000, aer_mask:
>> 0x01a10000
>> [ 52.782303] nvme0n1: detected capacity change from 3200631791616 to 0
>> [ 52.786348] pcieport 0000:b0:06.0: [20] Unsupported Request
>> [ 52.786349] pcieport 0000:b0:06.0: aer_layer=Transaction Layer,
>> aer_agent=Requester ID
>> [ 52.786350] pcieport 0000:b0:06.0: aer_uncor_severity: 0x004eb030
>> [ 52.786352] pcieport 0000:b0:06.0: TLP Header: 40000001 0000020f
>> e12023bc 01000000
>> [ 52.786357] pcieport 0000:b0:06.0: broadcast error_detected message
>> [ 52.883895] pci 0000:b3:00.0: device has no driver
>> [ 52.883976] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
>> [ 52.884184] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down event
>> queued; currently getting powered on
>> [ 52.967175] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up
>
> Btw, from another discussion we're having with Yazen:
>
> @Yazen, do you see how this error record is worth shit?
>
> class_code: 000406
> command: 0x0407, status: 0x0010
> bridge: secondary_status: 0x0000, control: 0x0003
> aer_status: 0x00100000, aer_mask: 0x01a10000
> aer_uncor_severity: 0x004eb030

That tells you what FFS said about the error. Keep in mind that FFS has
cleared the hardware error bits, which the AER handler would normally
read from the PCI device.

> those above are only some of the fields which are purely useless
> undecoded. Makes me wonder what's worse for the user: dump the
> half-decoded error or not dump an error at all...

It's immediately obvious if there's a glaring FFS bug and if we get
bogus data. If you distrust firmware as much as I do, then you will find
great value in having such info in the logs. It's probably not too
useful to a casual user, but then neither is a majority of the system log.

> Anyway, Alex, I see this in the logs:
>
> [ 66.581121] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
> [ 66.591939] pciehp 0000:b0:05.0:pcie204: Slot(179): Card not present
> [ 66.592102] pciehp 0000:b0:06.0:pcie204: Slot(176): Card not present
>
> and that comes from that pciehp_isr() interrupt handler AFAICT.
>
> So there *is* a way to know that the card is not present anymore. So,
> theoretically, and ignoring the code layering for now, we can connect
> that error to the card not present event and then ignore the error...

You're missing the timing and assuming you will get the hotplug
interrupt. In this example, you have 22ms between the link down and
presence detect state change. This is a fairly fast removal.

Hotplug dependencies aside (you can have the kernel run without PCIe
hotplug support), I don't think you want to just linger in NMI for
dozens of milliseconds waiting for presence detect confirmation.

For enterprise SFF NVMe drives, the data lanes will disconnect before
the presence detect. FFS relies on presence detect, and these are two of
the reasons why slow removal is such a problem. You might not get a
presence detect interrupt at all.

Presence detect is optional for PCIe. PD is such a reliable heuristic,
that it guarantees worse error handling than the crackmonkey firmware. I
don't see how might be useful in a way which gives us better handling
than firmware.

> Hmmm.

Hmmm

Anyway, heuristics about PCIe error recovery belong in the recovery
handler. I don't think it's smart to apply policy before we get there

Alex


2018-04-25 14:03:26

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

On Mon, Apr 23, 2018 at 11:19:25PM -0500, Alex G. wrote:
> That tells you what FFS said about the error.

I betcha those status and command values have a human-readable counterparts.

Btw, what do you abbreviate with "FFS"?

> It's immediately obvious if there's a glaring FFS bug and if we get bogus
> data. If you distrust firmware as much as I do, then you will find great
> value in having such info in the logs. It's probably not too useful to a
> casual user, but then neither is a majority of the system log.

No no, you're missing the point - I *want* all data in the error log
which helps debug a hardware issue. I just want it humanly readable so
that I don't have to jot down the values and go scour the manuals to map
what it actually means.

> You're missing the timing and assuming you will get the hotplug interrupt.
> In this example, you have 22ms between the link down and presence detect
> state change. This is a fairly fast removal.
>
> Hotplug dependencies aside (you can have the kernel run without PCIe hotplug
> support), I don't think you want to just linger in NMI for dozens of
> milliseconds waiting for presence detect confirmation.

No, I don't mean that. I mean something like deferred processing: you
get an error, you notice it is a device which supports physical removal
so you exit the NMI handler and process the error in normal, process
context which allows you to query the device and say, "Hey device, are
you still there?"

If it is not, you drop all the hw I/O errors reported for it.

Hmmm?

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-25 15:02:59

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.



On 04/25/2018 09:01 AM, Borislav Petkov wrote:
> On Mon, Apr 23, 2018 at 11:19:25PM -0500, Alex G. wrote:
>> That tells you what FFS said about the error.
>
> I betcha those status and command values have a human-readable counterparts.
>
> Btw, what do you abbreviate with "FFS"?

Firmware-first.

>> It's immediately obvious if there's a glaring FFS bug and if we get bogus
>> data. If you distrust firmware as much as I do, then you will find great
>> value in having such info in the logs. It's probably not too useful to a
>> casual user, but then neither is a majority of the system log.
>
> No no, you're missing the point - I *want* all data in the error log
> which helps debug a hardware issue. I just want it humanly readable so
> that I don't have to jot down the values and go scour the manuals to map
> what it actually means.

We could probably use more of the native AER print functions, but that's
beyond the scope of this patch. I tried something like this [1], but
have given up following the PCI maintainer's radio silence. I don't care
_that_ much about the log format.

[1] http://www.spinics.net/lists/linux-pci/msg71422.html

>> You're missing the timing and assuming you will get the hotplug interrupt.
>> In this example, you have 22ms between the link down and presence detect
>> state change. This is a fairly fast removal.
>>
>> Hotplug dependencies aside (you can have the kernel run without PCIe hotplug
>> support), I don't think you want to just linger in NMI for dozens of
>> milliseconds waiting for presence detect confirmation.
>
> No, I don't mean that. I mean something like deferred processing:

Like the exact thing that this patch series implements? :)

> you
> get an error, you notice it is a device which supports physical removal
> so you exit the NMI handler and process the error in normal, process
> context which allows you to query the device and say, "Hey device, are
> you still there?"

Like the exact way the AER handler works?

> If it is not, you drop all the hw I/O errors reported for it.

Like the PCI error recovery mechanisms that AER invokes?

> Hmmm?
Hmmm

2018-04-25 17:18:56

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

On Wed, Apr 25, 2018 at 10:00:53AM -0500, Alex G. wrote:
> Firmware-first.

Ok, my guess was right.

> We could probably use more of the native AER print functions, but that's
> beyond the scope of this patch.

No no, this does not belong in this patchset.

> Like the exact thing that this patch series implements? :)

Exact thing? I don't think so.

No, your patchset is grafting some funky and questionable side-handler
which gets to see the PCIe errors first, out-of-line and then it
practically downgrades their severity outside of the error processing
flow.

What I've been telling you to do is to extend ghes_severity() to
give the lower than PANIC severity for CPER_SEC_PCIE errors first
so that the machine doesn't panic from them anymore and those PCIe
errors get processed in the normal error processing path down
through ghes_do_proc() and then land in ghes_handle_aer(). No adhoc
->handle_irqsafe thing - just the normal straightforward error
processing path.

There, in ghes_handle_aer(), you do the check whether the device is
still there - i.e., you try to apply some heuristics to detect the error
type and why the system is complaining - you maybe even check whether
the NVMe device is still there - and *then* you do the proper recovery
action.

And you document for the future people looking at this code *why* you're
doing this.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-25 17:29:31

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.



On 04/25/2018 12:15 PM, Borislav Petkov wrote:
> On Wed, Apr 25, 2018 at 10:00:53AM -0500, Alex G. wrote:
>> Firmware-first.
>
> Ok, my guess was right.
>
>> We could probably use more of the native AER print functions, but that's
>> beyond the scope of this patch.
>
> No no, this does not belong in this patchset.
>
>> Like the exact thing that this patch series implements? :)
>
> Exact thing? I don't think so.
>
> No, your patchset is grafting some funky and questionable side-handler
> which gets to see the PCIe errors first, out-of-line and then it
> practically downgrades their severity outside of the error processing
> flow.

SURPRISE!!! This is a what vs how issue. I am keeping the what, and
working on the how that you suggested.

> What I've been telling you

It's coming (eventually). I'm trying to avoid pushing more than one
series per week.

(snip useful email context)

Hmmm.

Alex

2018-04-25 17:42:15

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

On Wed, Apr 25, 2018 at 12:27:59PM -0500, Alex G. wrote:
> SURPRISE!!!

What does that mean? You've had too much coffee?

> It's coming (eventually). I'm trying to avoid pushing more than one
> series per week.

You better. Flooding people with patchsets won't get you very far.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-25 20:42:03

by Alexandru Gagniuc

[permalink] [raw]
Subject: [RFC PATCH v3 1/3] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error

The use of the 'ghes' argument was removed in a previous commit, but
function signature was not updated to reflect this.

Fixes: 0fe5f281f749 ("EDAC, ghes: Model a single, logical memory controller")
Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/acpi/apei/ghes.c | 2 +-
drivers/edac/ghes_edac.c | 3 +--
include/acpi/ghes.h | 5 ++---
3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 1efefe919555..f9b53a6f55f3 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -481,7 +481,7 @@ static void ghes_do_proc(struct ghes *ghes,
if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);

- ghes_edac_report_mem_error(ghes, sev, mem_err);
+ ghes_edac_report_mem_error(sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
ghes_handle_memory_failure(gdata, sev);
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 68b6ee18bea6..32bb8c5f47dc 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -172,8 +172,7 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
}
}

-void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
- struct cper_sec_mem_err *mem_err)
+void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
{
enum hw_event_mc_err_type type;
struct edac_raw_error_desc *e;
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 8feb0c866ee0..e096a4e7f611 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -55,15 +55,14 @@ enum {
/* From drivers/edac/ghes_edac.c */

#ifdef CONFIG_EDAC_GHES
-void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
- struct cper_sec_mem_err *mem_err);
+void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err);

int ghes_edac_register(struct ghes *ghes, struct device *dev);

void ghes_edac_unregister(struct ghes *ghes);

#else
-static inline void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
+static inline void ghes_edac_report_mem_error(int sev,
struct cper_sec_mem_err *mem_err)
{
}
--
2.14.3


2018-04-25 20:42:08

by Alexandru Gagniuc

[permalink] [raw]
Subject: [RFC PATCH v3 3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal"

There seems to be a culture amongst BIOS teams to want to crash the
OS when an error can't be handled in firmware. Marking GHES errors as
"fatal" is a very common way to do this.

However, a number of errors reported by GHES may be fatal in the sense
a device or link is lost, but are not fatal to the system. When there
is a disagreement with firmware about the handleability of an error,
print a warning message.

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/acpi/apei/ghes.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 8ccb9cc10fc8..34d0da692dd0 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -539,6 +539,12 @@ static void ghes_do_proc(struct ghes *ghes,
sec_sev, err,
gdata->error_data_length);
}
+
+ }
+
+ if ((sev >= GHES_SEV_PANIC) && (ghes_actual_severity(ghes) < sev)) {
+ pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
+ pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");
}
}

--
2.14.3


2018-04-25 20:42:47

by Alexandru Gagniuc

[permalink] [raw]
Subject: [RFC PATCH v3 0/3] acpi: apei: Improve PCIe error handling with firmware-first

Or "acpi: apei: Don't trust firmware any further than you can throw it"

This is the improved implementation following feedback from James and
Borislav. This implementation is much simpler, albeit less flexible than v2.

I'm leaving this as RFC because the BIOS team is a bit scared of an OS
that won't crash when it's told to. However, if people like the idea, then
I have nothing against merging this.

Borislav, if you don't like the third patch in the series, feel free to leave
it out. THings will work beautifully with or without it.


Changes since v2:
- Due to popular request, simple is chosen over flexible
- Removed splitting of handlers into irq safe portion.
- Change behavior only for PCIe errors

Changes since v1:
- Due to popular request, the panic() is left in the NMI handler
- GHES AER handler is split into NMI and non-NMI portions
- ghes_notify_nmi() does not panic on deferrable errors
- The handlers are put in a mapping and given a common call signature

Alexandru Gagniuc (3):
EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error
acpi: apei: Do not panic() on PCIe errors reported through GHES
acpi: apei: Warn when GHES marks correctable errors as "fatal"

drivers/acpi/apei/ghes.c | 56 +++++++++++++++++++++++++++++++++++++++++++-----
drivers/edac/ghes_edac.c | 3 +--
include/acpi/ghes.h | 5 ++---
3 files changed, 54 insertions(+), 10 deletions(-)

--
2.14.3


2018-04-25 20:44:16

by Alexandru Gagniuc

[permalink] [raw]
Subject: [RFC PATCH v3 2/3] acpi: apei: Do not panic() on PCIe errors reported through GHES

The policy was to panic() when GHES said that an error is "Fatal".
This logic is wrong for several reasons, as it doesn't take into
account what caused the error.

PCIe fatal errors indicate that the link to a device is either
unstable or unusable. They don't indicate that the machine is on fire,
and they are not severe enough that we need to panic(). Instead of
relying on crackmonkey firmware, evaluate the error severity based on
what caused the error (GHES subsections).

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/acpi/apei/ghes.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index f9b53a6f55f3..8ccb9cc10fc8 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -425,8 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
* GHES_SEV_RECOVERABLE -> AER_NONFATAL
* GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL
* These both need to be reported and recovered from by the AER driver.
- * GHES_SEV_PANIC does not make it to this handling since the kernel must
- * panic.
+ * GHES_SEV_PANIC -> AER_FATAL
*/
static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
{
@@ -459,6 +458,46 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
#endif
}

+/* PCIe errors should not cause a panic. */
+static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata)
+{
+ struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
+
+ if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
+ pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO &&
+ IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER))
+ return CPER_SEV_RECOVERABLE;
+
+ return ghes_severity(gdata->error_severity);
+}
+/*
+ * The severity field in the status block is oftentimes more severe than it
+ * needs to be. This makes it an unreliable metric for the severity. A more
+ * reliable way is to look at each subsection and correlate it with how well
+ * the error can be handled.
+ * - SEC_PCIE: All PCIe errors can be handled by AER.
+ */
+static int ghes_actual_severity(struct ghes *ghes)
+{
+ int worst_sev, sec_sev;
+ struct acpi_hest_generic_data *gdata;
+ const guid_t *section_type;
+ const struct acpi_hest_generic_status *estatus = ghes->estatus;
+
+ worst_sev = GHES_SEV_NO;
+ apei_estatus_for_each_section(estatus, gdata) {
+ section_type = (guid_t *)gdata->section_type;
+ sec_sev = ghes_severity(gdata->error_severity);
+
+ if (guid_equal(section_type, &CPER_SEC_PCIE))
+ sec_sev = ghes_sec_pcie_severity(gdata);
+
+ worst_sev = max(worst_sev, sec_sev);
+ }
+
+ return worst_sev;
+}
+
static void ghes_do_proc(struct ghes *ghes,
const struct acpi_hest_generic_status *estatus)
{
@@ -932,7 +971,7 @@ static void __process_error(struct ghes *ghes)
static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
{
struct ghes *ghes;
- int sev, ret = NMI_DONE;
+ int sev, asev, ret = NMI_DONE;

if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
return ret;
@@ -945,8 +984,9 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
ret = NMI_HANDLED;
}

+ asev = ghes_actual_severity(ghes);
sev = ghes_severity(ghes->estatus->error_severity);
- if (sev >= GHES_SEV_PANIC) {
+ if ((sev >= GHES_SEV_PANIC) && (asev >= GHES_SEV_PANIC)) {
oops_begin();
ghes_print_queued_estatus();
__ghes_panic(ghes);
--
2.14.3


2018-04-26 11:21:43

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/3] acpi: apei: Do not panic() on PCIe errors reported through GHES

On Wed, Apr 25, 2018 at 03:39:50PM -0500, Alexandru Gagniuc wrote:
> @@ -932,7 +971,7 @@ static void __process_error(struct ghes *ghes)
> static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
> {
> struct ghes *ghes;
> - int sev, ret = NMI_DONE;
> + int sev, asev, ret = NMI_DONE;
>
> if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
> return ret;
> @@ -945,8 +984,9 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
> ret = NMI_HANDLED;
> }
>
> + asev = ghes_actual_severity(ghes);
> sev = ghes_severity(ghes->estatus->error_severity);

So renaming ghes_deferrable_severity() to ghes_actual_severity() is not
a big change. And that's not what I meant.

I'd like to see here:

sev = ghes_severity(ghes);

and inside you do all the required mapping/severity processing/etc. And
you can rename the current ghes_severity() to ghes_map_cper_severity()
or whatever...

> - if (sev >= GHES_SEV_PANIC) {
> + if ((sev >= GHES_SEV_PANIC) && (asev >= GHES_SEV_PANIC)) {

... so that this change doesn't happen and there are not two severities
queried but a single one.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-26 11:23:19

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal"

On Wed, Apr 25, 2018 at 03:39:51PM -0500, Alexandru Gagniuc wrote:
> There seems to be a culture amongst BIOS teams to want to crash the
> OS when an error can't be handled in firmware. Marking GHES errors as
> "fatal" is a very common way to do this.
>
> However, a number of errors reported by GHES may be fatal in the sense
> a device or link is lost, but are not fatal to the system. When there
> is a disagreement with firmware about the handleability of an error,
> print a warning message.
>
> Signed-off-by: Alexandru Gagniuc <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 8ccb9cc10fc8..34d0da692dd0 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -539,6 +539,12 @@ static void ghes_do_proc(struct ghes *ghes,
> sec_sev, err,
> gdata->error_data_length);
> }
> +
> + }
> +
> + if ((sev >= GHES_SEV_PANIC) && (ghes_actual_severity(ghes) < sev)) {
> + pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
> + pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");

Pasting the same comment from last time since you missed it:

"No, I don't want any of that crap issuing stuff in dmesg and then people
opening bugs and running around and trying to replace hardware.

We either can handle the error and log a normal record somewhere or we
cannot and explode. The complaining about the FW doesn't bring shit."

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-04-26 17:46:42

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/3] acpi: apei: Do not panic() on PCIe errors reported through GHES

Hi Borislav,

On 04/26/2018 06:19 AM, Borislav Petkov wrote:
> On Wed, Apr 25, 2018 at 03:39:50PM -0500, Alexandru Gagniuc wrote:
>> @@ -932,7 +971,7 @@ static void __process_error(struct ghes *ghes)
>> static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>> {
>> struct ghes *ghes;
>> - int sev, ret = NMI_DONE;
>> + int sev, asev, ret = NMI_DONE;
>>
>> if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
>> return ret;
>> @@ -945,8 +984,9 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>> ret = NMI_HANDLED;
>> }
>>
>> + asev = ghes_actual_severity(ghes);
>> sev = ghes_severity(ghes->estatus->error_severity);
>
> So renaming ghes_deferrable_severity() to ghes_actual_severity() is not
> a big change. And that's not what I meant.

I'm sorry I misunderstood you.

> I'd like to see here:
>
> sev = ghes_severity(ghes);

sev = ghes_severity(ghes);


> and inside you do all the required mapping/severity processing/etc. And
> you can rename the current ghes_severity() to ghes_map_cper_severity()
> or whatever...

I agree that the current ghes_severity() name is vague. I'll get it done
properly in v4 (next week).

>> - if (sev >= GHES_SEV_PANIC) {
>> + if ((sev >= GHES_SEV_PANIC) && (asev >= GHES_SEV_PANIC)) {
>
> ... so that this change doesn't happen and there are not two severities
> queried but a single one.

Two severities is a result of the wanky GHES data structure. Nothing
says we have to use the severity field in the header... if you're okay
with just ignoring it.

Alex

2018-04-26 17:49:09

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal"

On 04/26/2018 06:20 AM, Borislav Petkov wrote:
> Pasting the same comment from last time since you missed it:
>
> "No, I don't want any of that crap issuing stuff in dmesg and then people
> opening bugs and running around and trying to replace hardware.
>
> We either can handle the error and log a normal record somewhere or we
> cannot and explode. The complaining about the FW doesn't bring shit."

" Borislav, if you don't like the third patch in the series, feel free
to leave it out. THings will work beautifully with or without it."

:)

2018-04-26 18:05:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal"

On Thu, Apr 26, 2018 at 12:47:30PM -0500, Alex G. wrote:
> " Borislav, if you don't like the third patch in the series, feel free to
> leave it out. THings will work beautifully with or without it."

Then don't send it.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-05-02 19:11:14

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal"

On Thu 2018-04-26 13:20:57, Borislav Petkov wrote:
> On Wed, Apr 25, 2018 at 03:39:51PM -0500, Alexandru Gagniuc wrote:
> > There seems to be a culture amongst BIOS teams to want to crash the
> > OS when an error can't be handled in firmware. Marking GHES errors as
> > "fatal" is a very common way to do this.
> >
> > However, a number of errors reported by GHES may be fatal in the sense
> > a device or link is lost, but are not fatal to the system. When there
> > is a disagreement with firmware about the handleability of an error,
> > print a warning message.


> > +
> > + if ((sev >= GHES_SEV_PANIC) && (ghes_actual_severity(ghes) < sev)) {
> > + pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
> > + pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");
>
> Pasting the same comment from last time since you missed it:
>
> "No, I don't want any of that crap issuing stuff in dmesg and then people
> opening bugs and running around and trying to replace hardware.

We want to see warnings. Maybe they can be toned done. We even have
dedicated distros for firmware testing.

> Good mailing practices for 400: avoid top-posting and trim the reply.

Good mailing practices -- limit use of four letter words on public lists.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2018-05-02 19:31:44

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal"

On 05/02/2018 02:10 PM, Pavel Machek wrote:
> On Thu 2018-04-26 13:20:57, Borislav Petkov wrote:
>> On Wed, Apr 25, 2018 at 03:39:51PM -0500, Alexandru Gagniuc wrote:
>>> There seems to be a culture amongst BIOS teams to want to crash the
>>> OS when an error can't be handled in firmware. Marking GHES errors as
>>> "fatal" is a very common way to do this.
>>>
>>> However, a number of errors reported by GHES may be fatal in the sense
>>> a device or link is lost, but are not fatal to the system. When there
>>> is a disagreement with firmware about the handleability of an error,
>>> print a warning message.
>
>
>>> +
>>> + if ((sev >= GHES_SEV_PANIC) && (ghes_actual_severity(ghes) < sev)) {
>>> + pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
>>> + pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");
>>
>> Pasting the same comment from last time since you missed it:
>>
>> "No, I don't want any of that crap issuing stuff in dmesg and then people
>> opening bugs and running around and trying to replace hardware.
>
> We want to see warnings. Maybe they can be toned done. We even have
> dedicated distros for firmware testing.

I'm told that had we had this warning when the r740 BIOS was in
development, we would have solved a lot of the issues that I'm currently
working on. That would, in turn, have exposed bigger issues, and we
would have had a platform to fix and test those bigger issues.

Hardware vendors who test on linux might be scratching their heads at
this error, though they tend to figure out what they're doing wrong, and
fix it.

One argument against was "expensive support calls", on which I call BS.
The firmware resources are expensive, but those are there whether or not
the customers call to complain.

Alex

>> Good mailing practices for 400: avoid top-posting and trim the reply.
>
> Good mailing practices -- limit use of four letter words on public lists.

Then can't show word 'four'.