The purpose of these changes is to see if we can safely de-escalate
the situation and notify the appropriate error handler. Since FFS
reports errors through NMIs or other non-standard mechanism, we have
to be just a little more careful with reporting the error.
We're concerned with things, such as being able to cross the NMI/IRQ
boundary, or being able to safely schedule work and notify the
appropriate subsystem. Once the notification is sent, our job is done.
I'm explicitly _NOT_ concerned with whether the error is handled or
not, especially since such concern reduces to a call to __ghes_panic().
There are rare cases that prevent us from de-escalating to lesser
contexts, such as uncorrectable memory errors in kernel. In these sort
of cases, trying to leave the NMI might cause a triple fault. James
Morse explained this very well when discussing v1 of this series. In
and only in such cases, we are justified to panic().
Once the error is safely sent its merry way, it's really up to the
error handler to panic() or continue. For example, aer_recover_queue()
might for ungodly reasons fail. However, it's up to the AER code to
decide whether failing to queue an error for handling is panic worthy.
Changes since v5:
- Removed zoological references from commit message
Changes since v4:
- Fix Freudian slip and use GHES_ instead of CPER_ enum
- Rephrased comments to clarify what we don't care about
Changes since v3:
- Renamed ghes_severity to something more concrete
- Reorganized code to make it look like more than just a rename
- Remembered to remove last patch in the series
Changes since v2:
- Due to popular request, simple is chosen over flexible
- Removed splitting of handlers into irq safe portion.
- Change behavior only for PCIe errors
Changes since v1:
- Due to popular request, the panic() is left in the NMI handler
- GHES AER handler is split into NMI and non-NMI portions
- ghes_notify_nmi() does not panic on deferrable errors
- The handlers are put in a mapping and given a common call signature
Alexandru Gagniuc (2):
acpi: apei: Rename ghes_severity() to ghes_cper_severity()
acpi: apei: Do not panic() on PCIe errors reported through GHES
drivers/acpi/apei/ghes.c | 63 +++++++++++++++++++++++++++++++++++++++---------
1 file changed, 52 insertions(+), 11 deletions(-)
--
2.14.3
The policy was to panic() when GHES said that an error is "Fatal".
This logic is wrong for several reasons, as it doesn't account for the
cause of the error.
PCIe fatal errors indicate that the link to a device is either
unstable or unusable. They don't indicate that the machine is on fire,
and they are not severe enough to justify a panic(). Do not blindly
rely on firmware to evaluate the severity for us. Instead, look at
the error severity based on what caused the error (GHES subsections).
Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/acpi/apei/ghes.c | 48 +++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 45 insertions(+), 3 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 7c1a16b106ba..9baaab798020 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -425,8 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
* GHES_SEV_RECOVERABLE -> AER_NONFATAL
* GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL
* These both need to be reported and recovered from by the AER driver.
- * GHES_SEV_PANIC does not make it to this handling since the kernel must
- * panic.
+ * GHES_SEV_PANIC -> AER_FATAL
*/
static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
{
@@ -459,6 +458,49 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
#endif
}
+/* PCIe errors should not cause a panic. */
+static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata)
+{
+ struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
+
+ if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
+ pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO &&
+ IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER))
+ return GHES_SEV_RECOVERABLE;
+
+ return ghes_cper_severity(gdata->error_severity);
+}
+
+/*
+ * The severity field in the status block is an unreliable metric for the
+ * severity. A more reliable way is to look at each subsection and see how safe
+ * it is to call the approproate error handler.
+ * We're not conerned with handling the error. We're concerned with being able
+ * to notify an error handler by crossing the NMI/IRQ boundary, being able to
+ * schedule_work, and so forth.
+ * - SEC_PCIE: All PCIe errors can be handled by AER.
+ */
+static int ghes_severity(struct ghes *ghes)
+{
+ int worst_sev, sec_sev;
+ struct acpi_hest_generic_data *gdata;
+ const guid_t *section_type;
+ const struct acpi_hest_generic_status *estatus = ghes->estatus;
+
+ worst_sev = GHES_SEV_NO;
+ apei_estatus_for_each_section(estatus, gdata) {
+ section_type = (guid_t *)gdata->section_type;
+ sec_sev = ghes_cper_severity(gdata->error_severity);
+
+ if (guid_equal(section_type, &CPER_SEC_PCIE))
+ sec_sev = ghes_sec_pcie_severity(gdata);
+
+ worst_sev = max(worst_sev, sec_sev);
+ }
+
+ return worst_sev;
+}
+
static void ghes_do_proc(struct ghes *ghes,
const struct acpi_hest_generic_status *estatus)
{
@@ -944,7 +986,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
ret = NMI_HANDLED;
}
- sev = ghes_cper_severity(ghes->estatus->error_severity);
+ sev = ghes_severity(ghes);
if (sev >= GHES_SEV_PANIC) {
oops_begin();
ghes_print_queued_estatus();
--
2.14.3
ghes_severity() is a misnomer in this case, as it implies the severity
of the entire GHES structure. Instead, it maps one CPER value to a
GHES_SEV* value. ghes_cper_severity() is clearer.
Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/acpi/apei/ghes.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 1efefe919555..7c1a16b106ba 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -271,7 +271,7 @@ static void ghes_fini(struct ghes *ghes)
unmap_gen_v2(ghes);
}
-static inline int ghes_severity(int severity)
+static inline int ghes_cper_severity(int severity)
{
switch (severity) {
case CPER_SEV_INFORMATIONAL:
@@ -388,7 +388,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
#ifdef CONFIG_ACPI_APEI_MEMORY_FAILURE
unsigned long pfn;
int flags = -1;
- int sec_sev = ghes_severity(gdata->error_severity);
+ int sec_sev = ghes_cper_severity(gdata->error_severity);
struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
@@ -468,10 +468,10 @@ static void ghes_do_proc(struct ghes *ghes,
guid_t *fru_id = &NULL_UUID_LE;
char *fru_text = "";
- sev = ghes_severity(estatus->error_severity);
+ sev = ghes_cper_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
sec_type = (guid_t *)gdata->section_type;
- sec_sev = ghes_severity(gdata->error_severity);
+ sec_sev = ghes_cper_severity(gdata->error_severity);
if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
fru_id = (guid_t *)gdata->fru_id;
@@ -512,7 +512,7 @@ static void __ghes_print_estatus(const char *pfx,
char pfx_seq[64];
if (pfx == NULL) {
- if (ghes_severity(estatus->error_severity) <=
+ if (ghes_cper_severity(estatus->error_severity) <=
GHES_SEV_CORRECTED)
pfx = KERN_WARNING;
else
@@ -534,7 +534,7 @@ static int ghes_print_estatus(const char *pfx,
static DEFINE_RATELIMIT_STATE(ratelimit_uncorrected, 5*HZ, 2);
struct ratelimit_state *ratelimit;
- if (ghes_severity(estatus->error_severity) <= GHES_SEV_CORRECTED)
+ if (ghes_cper_severity(estatus->error_severity) <= GHES_SEV_CORRECTED)
ratelimit = &ratelimit_corrected;
else
ratelimit = &ratelimit_uncorrected;
@@ -705,9 +705,8 @@ static int ghes_proc(struct ghes *ghes)
if (rc)
goto out;
- if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
+ if (ghes_cper_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC)
__ghes_panic(ghes);
- }
if (!ghes_estatus_cached(ghes->estatus)) {
if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
@@ -945,7 +944,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
ret = NMI_HANDLED;
}
- sev = ghes_severity(ghes->estatus->error_severity);
+ sev = ghes_cper_severity(ghes->estatus->error_severity);
if (sev >= GHES_SEV_PANIC) {
oops_begin();
ghes_print_queued_estatus();
--
2.14.3
On 5/21/2018 9:49 AM, Alexandru Gagniuc wrote:
> +/* PCIe errors should not cause a panic. */
> +static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata)
> +{
> + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
> +
> + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO &&
> + IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER))
> + return GHES_SEV_RECOVERABLE;
> +
> + return ghes_cper_severity(gdata->error_severity);
> +}
> +
> +/*
> + * The severity field in the status block is an unreliable metric for the
> + * severity. A more reliable way is to look at each subsection and see how safe
> + * it is to call the approproate error handler.
> + * We're not conerned with handling the error. We're concerned with being able
> + * to notify an error handler by crossing the NMI/IRQ boundary, being able to
> + * schedule_work, and so forth.
> + * - SEC_PCIE: All PCIe errors can be handled by AER.
> + */
> +static int ghes_severity(struct ghes *ghes)
> +{
> + int worst_sev, sec_sev;
> + struct acpi_hest_generic_data *gdata;
> + const guid_t *section_type;
> + const struct acpi_hest_generic_status *estatus = ghes->estatus;
> +
> + worst_sev = GHES_SEV_NO;
> + apei_estatus_for_each_section(estatus, gdata) {
> + section_type = (guid_t *)gdata->section_type;
> + sec_sev = ghes_cper_severity(gdata->error_severity);
> +
> + if (guid_equal(section_type, &CPER_SEC_PCIE))
> + sec_sev = ghes_sec_pcie_severity(gdata);
> +
> + worst_sev = max(worst_sev, sec_sev);
> + }
> +
> + return worst_sev;
> +}
> +
> static void ghes_do_proc(struct ghes *ghes,
> const struct acpi_hest_generic_status *estatus)
> {
> @@ -944,7 +986,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
> ret = NMI_HANDLED;
> }
>
> - sev = ghes_cper_severity(ghes->estatus->error_severity);
> + sev = ghes_severity(ghes);
Hello Alex,
There is a compile warning if CONFIG_HAVE_ACPI_APEI_NMI is not selected.
CC drivers/acpi/apei/ghes.o
drivers/acpi/apei/ghes.c:483:12: warning: ‘ghes_severity’ defined but not used
[-Wunused-function]
static int ghes_severity(struct ghes *ghes)
^~~~~~~~~~~~~
Thanks,
Tyler
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
On 05/21/2018 09:27 AM, Tyler Baicar wrote:
> On 5/21/2018 9:49 AM, Alexandru Gagniuc wrote:
(snip)
> Hello Alex,
>
> There is a compile warning if CONFIG_HAVE_ACPI_APEI_NMI is not selected.
>
> CC drivers/acpi/apei/ghes.o
> drivers/acpi/apei/ghes.c:483:12: warning: ‘ghes_severity’ defined but
> not used [-Wunused-function]
> static int ghes_severity(struct ghes *ghes)
> ^~~~~~~~~~~~~
Thanks for finding that. It's an easy fix. Staged for v7.
> Thanks,
> Tyler
>
On Mon, May 21, 2018 at 3:49 PM, Alexandru Gagniuc <[email protected]> wrote:
> ghes_severity() is a misnomer in this case, as it implies the severity
> of the entire GHES structure. Instead, it maps one CPER value to a
> GHES_SEV* value. ghes_cper_severity() is clearer.
It looks like the *real* reason for this change is that you
re-introduce ghes_severity() as a different function in the second
patch.
There are a couple of reasons to avoid that, one of them being that
people will now have to remember what the function did in which kernel
versions. Also, the current name is good enough IMO, so I'm not going
to apply this patch.
On Mon, May 21, 2018 at 3:49 PM, Alexandru Gagniuc <[email protected]> wrote:
> The policy was to panic() when GHES said that an error is "Fatal".
> This logic is wrong for several reasons, as it doesn't account for the
> cause of the error.
>
> PCIe fatal errors indicate that the link to a device is either
> unstable or unusable. They don't indicate that the machine is on fire,
But they very well may indicate just that AFAICS.
> and they are not severe enough to justify a panic(). Do not blindly
> rely on firmware to evaluate the severity for us. Instead, look at
> the error severity based on what caused the error (GHES subsections).
Which bit also comes from the firmware, right? So why is it regarded
as a better source of information?
Or are you trying to say that both of the pieces of information in
question should be consistent with each other? But if they aren't,
which one should we trust more and why?
> Signed-off-by: Alexandru Gagniuc <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 48 +++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 45 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 7c1a16b106ba..9baaab798020 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -425,8 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
> * GHES_SEV_RECOVERABLE -> AER_NONFATAL
> * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL
> * These both need to be reported and recovered from by the AER driver.
> - * GHES_SEV_PANIC does not make it to this handling since the kernel must
> - * panic.
> + * GHES_SEV_PANIC -> AER_FATAL
> */
> static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
> {
> @@ -459,6 +458,49 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
> #endif
> }
>
> +/* PCIe errors should not cause a panic. */
This comment is not sufficient and it should go inside of the function.
> +static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata)
> +{
> + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
> +
> + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO &&
> + IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER))
> + return GHES_SEV_RECOVERABLE;
You have not explained convincingly enough why the above condition
makes sense at all.
> +
> + return ghes_cper_severity(gdata->error_severity);
> +}
> +
> +/*
> + * The severity field in the status block is an unreliable metric for the
> + * severity. A more reliable way is to look at each subsection and see how safe
> + * it is to call the approproate error handler.
> + * We're not conerned with handling the error. We're concerned with being able
> + * to notify an error handler by crossing the NMI/IRQ boundary, being able to
> + * schedule_work, and so forth.
> + * - SEC_PCIE: All PCIe errors can be handled by AER.
Make this comment a proper kerneldoc or move it inside of the function.
> + */
> +static int ghes_severity(struct ghes *ghes)
> +{
> + int worst_sev, sec_sev;
> + struct acpi_hest_generic_data *gdata;
> + const guid_t *section_type;
> + const struct acpi_hest_generic_status *estatus = ghes->estatus;
> +
> + worst_sev = GHES_SEV_NO;
> + apei_estatus_for_each_section(estatus, gdata) {
> + section_type = (guid_t *)gdata->section_type;
> + sec_sev = ghes_cper_severity(gdata->error_severity);
> +
> + if (guid_equal(section_type, &CPER_SEC_PCIE))
> + sec_sev = ghes_sec_pcie_severity(gdata);
> +
> + worst_sev = max(worst_sev, sec_sev);
> + }
> +
> + return worst_sev;
> +}
> +
> static void ghes_do_proc(struct ghes *ghes,
> const struct acpi_hest_generic_status *estatus)
> {
> @@ -944,7 +986,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
> ret = NMI_HANDLED;
> }
>
> - sev = ghes_cper_severity(ghes->estatus->error_severity);
> + sev = ghes_severity(ghes);
> if (sev >= GHES_SEV_PANIC) {
> oops_begin();
> ghes_print_queued_estatus();
> --
On 05/22/2018 03:55 AM, Rafael J. Wysocki wrote:
> On Mon, May 21, 2018 at 3:49 PM, Alexandru Gagniuc <[email protected]> wrote:
>> ghes_severity() is a misnomer in this case, as it implies the severity
>> of the entire GHES structure. Instead, it maps one CPER value to a
>> GHES_SEV* value. ghes_cper_severity() is clearer.
>
> It looks like the *real* reason for this change is that you
> re-introduce ghes_severity() as a different function in the second
> patch.
/me holds fist at Borislav
> There are a couple of reasons to avoid that, one of them being that
> people will now have to remember what the function did in which kernel
> versions.
So?
> Also, the current name is good enough IMO,
Two other reviewers were extremely confused by the vague name, so no,
this is not good enough.
> so I'm not going to apply this patch.
On Tue, May 22, 2018 at 08:38:39AM -0500, Alex G. wrote:
> > It looks like the *real* reason for this change is that you
> > re-introduce ghes_severity() as a different function in the second
> > patch.
>
> /me holds fist at Borislav
That was a misunderstanding with Rafael and me - we fixed it on IRC.
But this is not the real problem with your approach - it is the marking
of all PCIe errors as recoverable, regardless of the signature. That's a
no-no, IMO.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On 05/22/2018 04:02 AM, Rafael J. Wysocki wrote:
> On Mon, May 21, 2018 at 3:49 PM, Alexandru Gagniuc <[email protected]> wrote:
>> The policy was to panic() when GHES said that an error is "Fatal".
>> This logic is wrong for several reasons, as it doesn't account for the
>> cause of the error.
>>
>> PCIe fatal errors indicate that the link to a device is either
>> unstable or unusable. They don't indicate that the machine is on fire,
>
> But they very well may indicate just that AFAICS.
I guess it's possible to set a machine on fire, and get a PCIe error as
one of the links melts. Although in that case, I doubt how we try to
handle the error makes a difference. Sarcasm aside, my point is that it
makes little sense to crash a machine when we lose a PCIe link.
>> and they are not severe enough to justify a panic(). Do not blindly
>> rely on firmware to evaluate the severity for us. Instead, look at
>> the error severity based on what caused the error (GHES subsections).
>
> Which bit also comes from the firmware, right? So why is it regarded
> as a better source of information?
It's less bad (not using 'better') because it relates more closely to
the error than the specific mechanism through which it is reported. The
header severity is an artificial concept that firmware has to make up,
whereas the subsection severity usually comes directly from hardware.
> Or are you trying to say that both of the pieces of information in
> question should be consistent with each other? But if they aren't,
> which one should we trust more and why?
The header severity is letting someone else make the decisions for you.
(snip)
>> +/* PCIe errors should not cause a panic. */
>
> This comment is not sufficient and it should go inside of the function.
What would make a "sufficient" comment?
>> +static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata)
>> +{
>> + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
>> +
>> + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
>> + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO &&
>> + IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER))
>> + return GHES_SEV_RECOVERABLE;
>
> You have not explained convincingly enough why the above condition
> makes sense at all.
Is this a test? I think it's self-explanatory:
How can you invoke a handler when you don't have a source for the error?
Or how can you invoke a handler when you don't have that handler?
>> +
>> + return ghes_cper_severity(gdata->error_severity);
>> +}
>> +
>> +/*
>> + * The severity field in the status block is an unreliable metric for the
>> + * severity. A more reliable way is to look at each subsection and see how safe
>> + * it is to call the approproate error handler.
>> + * We're not conerned with handling the error. We're concerned with being able
>> + * to notify an error handler by crossing the NMI/IRQ boundary, being able to
>> + * schedule_work, and so forth.
>> + * - SEC_PCIE: All PCIe errors can be handled by AER.
>
> Make this comment a proper kerneldoc or move it inside of the function.
I don't like moving long comments inside a function, as it breaks code
flow. Above-function explanation is also consistent with how
ghes_handle_aer() is documented.
Rafael, thank you very much for taking the time to review these patches.
Although, after reading through your emails, I'm at a loss on how you
want to to solve the problem. It appears everyone has a very strong and
different opinion how to proceed.
I think the biggest problem is having a policy to panic on "fatal"
errors, instead of letting the error handler make that decision. I'd
much rather kill that stupid policy, but people seem to like it for some
reason.
Alex
On 05/22/2018 08:50 AM, Borislav Petkov wrote:
> On Tue, May 22, 2018 at 08:38:39AM -0500, Alex G. wrote:
>>> It looks like the *real* reason for this change is that you
>>> re-introduce ghes_severity() as a different function in the second
>>> patch.
>>
>> /me holds fist at Borislav
>
> That was a misunderstanding with Rafael and me - we fixed it on IRC.
You mean to say this whole time I've been struggling to write emails,
there was an IRC?
> But this is not the real problem with your approach - it is the marking
> of all PCIe errors as recoverable, regardless of the signature. That's a
> no-no, IMO.
No, the problem is with the current approach, not with mine. The problem
is trying to handle the error outside of the existing handler. That's a
no-no, IMO.
Alex
On Tue, May 22, 2018 at 09:39:15AM -0500, Alex G. wrote:
> No, the problem is with the current approach, not with mine. The problem
> is trying to handle the error outside of the existing handler. That's a
> no-no, IMO.
Let me save you some time: until you come up with a proper solution for
*all* PCIe errors so that the kernel can correctly decide what to do for
each error based on its actual severity, consider this NAKed.
I don't care about outside or inside of the handler - this thing needs
to be done properly and not just to serve your particular use case of
abrupt removal of devices causing PCIe errors, and punish the rest.
I especially don't want to have the case where a PCIe error is *really*
fatal and then we noodle in some handlers debating about the severity
because it got marked as recoverable intermittently and end up causing
data corruption on the storage device. Here's a real no-no for ya.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On 5/22/2018 10:32 AM, Alex G. wrote:
> I think the biggest problem is having a policy to panic on "fatal"
> errors, instead of letting the error handler make that decision. I'd
> much rather kill that stupid policy, but people seem to like it for some
> reason.
>
You can get around that panic and still have the error handled as AER_FATAL in
the current code. Your FW needs to mark the error as RECOVERABLE and then
set the CPER_SEC_RESET flag.
https://elixir.bootlin.com/linux/v4.17-rc6/source/drivers/acpi/apei/ghes.c#L450
Thanks,
Tyler
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
On 05/22/2018 10:15 AM, Tyler Baicar wrote:
> On 5/22/2018 10:32 AM, Alex G. wrote:
>> I think the biggest problem is having a policy to panic on "fatal"
>> errors, instead of letting the error handler make that decision. I'd
>> much rather kill that stupid policy, but people seem to like it for some
>> reason.
>>
> You can get around that panic and still have the error handled as
> AER_FATAL in
> the current code. Your FW needs to mark the error as RECOVERABLE and then
> set the CPER_SEC_RESET flag.
Of course, that would be ideal. But experience shows that firmware
doesn't do this. That's the whole point: firmware sends questionable data.
Alex
> https://elixir.bootlin.com/linux/v4.17-rc6/source/drivers/acpi/apei/ghes.c#L450
>
>
> Thanks,
> Tyler
>
On 05/22/2018 09:54 AM, Borislav Petkov wrote:
> On Tue, May 22, 2018 at 09:39:15AM -0500, Alex G. wrote:
>> No, the problem is with the current approach, not with mine. The problem
>> is trying to handle the error outside of the existing handler. That's a
>> no-no, IMO.
>
> Let me save you some time: until you come up with a proper solution for
> *all* PCIe errors so that the kernel can correctly decide what to do for
> each error based on its actual severity, consider this NAKed.
I do have a proper solution for _all_ PCIe errors. In fact, we discussed
several valid approaches already.
> I don't care about outside or inside of the handler
I do. I have a handler that can handle (no pun intended) errors. I want
to use the same code path in native and GHES cases. If I allow ghes.c to
take different decisions than what aer_do_recovery() would, I've failed.
>- this thing needs to be done properly
Exactly!
> and not just to serve your particular use case of
> abrupt removal of devices causing PCIe errors, and punish the rest.
I think you're confused about what I'm actually trying to do. Or maybe
you're confused about how PCIe errors work. That's understandable. PCIe
uses the term "fatal" for errors that may make the link unusable, and
which may require a link reset, and in most other specs "fatal" means
"on fire". I understand your confusion, and I hope I cleared it up.
You're trying to make the case that surprise removal is my only concern
and use case, because that's the example that I gave. It makes your
argument stronger, but it's wrong. You don't know our test setup, and
all the things I'm testing for, and whenever I try to tell you, you fall
back to the 'surprise removal' example.
I don't know why you'd think Dell would pay me to work on this if I were
to allow things like silent data corruption to creep in. This isn't a
traditional company from Redmond, Washington.
> I especially don't want to have the case where a PCIe error is *really*
> fatal and then we noodle in some handlers debating about the severity
> because it got marked as recoverable intermittently and end up causing
> data corruption on the storage device. Here's a real no-no for ya.
I especially don't want a kernel maintainer who hasn't even read the
recovery handler (let alone the spec around which the handler was
written) tell me how the recovery handler works and what it's supposed
to do (see, I can be an ass).
PCIe errors really are fatal. They might need to unload the driver and
remove the device. But somebody set the questionable policy that
"fatal"=="panic", and that is completely inappropriate for a larger
class of errors -- PCIe happens to be the easiest example to pick on.
And even you realize that the argument that a panic() will somehow
prevent data corruption is complete noodle sauce.
Alex
On Tue, May 22, 2018 at 10:22:19AM -0500, Alex G. wrote:
> I especially don't want a kernel maintainer who hasn't even read the
> recovery handler (let alone the spec around which the handler was
> written) tell me how the recovery handler works and what it's supposed
> to do (see, I can be an ass).
Well, since you're such an ass and since you know better, go find
someone else to review your crap. I'm done wasting time with you.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote:
> I especially don't want to have the case where a PCIe error is *really*
> fatal and then we noodle in some handlers debating about the severity
> because it got marked as recoverable intermittently and end up causing
> data corruption on the storage device. Here's a real no-no for ya.
All that we have is a message from the BIOS that this is a "fatal"
error. When did we start trusting the BIOS to give us accurate
information?
PCIe fatal means that the link or the device is broken. But that
seems a poor reason to take down a large server that may have
dozens of devices (some of them set up specifically to handle
errors ... e.g. mirrored disks on separate controllers, or NIC
devices that have been "bonded" together).
So, as long as the action for a "fatal" error is to mark a link
down and offline the device, that seems a pretty reasonable course
of action.
The argument gets a lot more marginal if you simply reset the
link and re-enable the device to "fix" it. That might be enough,
but I don't think the OS has enough data to make the call.
-Tony
P.S. I deliberately put "fatal" in quotes above because to
quote "The Princess Bride" -- "that word, I do not think it
means what you think it means". :-)
On Tue, May 22, 2018 at 7:57 PM, Luck, Tony <[email protected]> wrote:
> On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote:
>> I especially don't want to have the case where a PCIe error is *really*
>> fatal and then we noodle in some handlers debating about the severity
>> because it got marked as recoverable intermittently and end up causing
>> data corruption on the storage device. Here's a real no-no for ya.
>
> All that we have is a message from the BIOS that this is a "fatal"
> error. When did we start trusting the BIOS to give us accurate
> information?
Some time ago, actually.
This is about changing the existing behavior which has been to treat
"fatal" errors reported by the BIOS as good enough reasons for a panic
for quite a while AFAICS.
> PCIe fatal means that the link or the device is broken.
And that may really mean that the component in question is on fire.
We just don't know.
> But that seems a poor reason to take down a large server that may have
> dozens of devices (some of them set up specifically to handle
> errors ... e.g. mirrored disks on separate controllers, or NIC
> devices that have been "bonded" together).
>
> So, as long as the action for a "fatal" error is to mark a link
> down and offline the device, that seems a pretty reasonable course
> of action.
>
> The argument gets a lot more marginal if you simply reset the
> link and re-enable the device to "fix" it. That might be enough,
> but I don't think the OS has enough data to make the call.
Again, that's about changing the existing behavior or the existing policy even.
What exactly has changed to make us consider this now?
Thanks,
Rafael
On Tue, May 22, 2018 at 3:38 PM, Alex G. <[email protected]> wrote:
>
>
> On 05/22/2018 03:55 AM, Rafael J. Wysocki wrote:
>> On Mon, May 21, 2018 at 3:49 PM, Alexandru Gagniuc <[email protected]> wrote:
>>> ghes_severity() is a misnomer in this case, as it implies the severity
>>> of the entire GHES structure. Instead, it maps one CPER value to a
>>> GHES_SEV* value. ghes_cper_severity() is clearer.
>>
>> It looks like the *real* reason for this change is that you
>> re-introduce ghes_severity() as a different function in the second
>> patch.
>
> /me holds fist at Borislav
>
>> There are a couple of reasons to avoid that, one of them being that
>> people will now have to remember what the function did in which kernel
>> versions.
>
> So?
>
>> Also, the current name is good enough IMO,
>
> Two other reviewers were extremely confused by the vague name, so no,
> this is not good enough.
Of course, you are free to have a differing opinion and I don't have
to convince you about my point. You need to convince me about your
point to get the patch in through my tree, which you haven't done so
far.
>> so I'm not going to apply this patch.
On 05/22/2018 12:57 PM, Luck, Tony wrote:
> On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote:
>> I especially don't want to have the case where a PCIe error is *really*
>> fatal and then we noodle in some handlers debating about the severity
>> because it got marked as recoverable intermittently and end up causing
>> data corruption on the storage device. Here's a real no-no for ya.
>
> All that we have is a message from the BIOS that this is a "fatal"
> error. When did we start trusting the BIOS to give us accurate
> information?
When we merged ACPI handling.
> PCIe fatal means that the link or the device is broken. But that
> seems a poor reason to take down a large server that may have
> dozens of devices (some of them set up specifically to handle
> errors ... e.g. mirrored disks on separate controllers, or NIC
> devices that have been "bonded" together).
>
> So, as long as the action for a "fatal" error is to mark a link
> down and offline the device, that seems a pretty reasonable course
> of action.
>
> The argument gets a lot more marginal if you simply reset the
> link and re-enable the device to "fix" it. That might be enough,
> but I don't think the OS has enough data to make the call.
I'm not 100% satisfied with how AER handler works, and how certain
drivers (nvme!!!) interface with AER handling. But this is an arguments
that belongs in PCI code, and a fight I will fight with Bjorn and Keith.
The issue we're having with Borislav and Rafael's estate is that we
can't make it to PCI land.
I'm seeing here the same fight that I saw with firmware vs OS, where fw
wants to have control, and OS wants to have control. I saw the same with
ME/CSE/CSME team vs ChromeOS team, where ME team did everything possible
to make sure only they can access the boot vector and boot the
processor, and ChromeOS team couldn't use this approach because they
wanted their own root of trust. I've seen this in other places as well,
though confidentiality agreements prevent me from talking about it.
It's the issue of control, and it's a fact of life. Borislav and Rafael
don't want to relinquish control until they can be 100% certain that
going further will result in 100% recovery. That is a goal I aspire to
as well, but an unachievable ideal nonetheless.
I thought the best compromise would be to be as close as possible to
native handling. That is, if AER can't recover, we don't recover the
device, but the machine keeps running. I think there's some deeper
history to GHES handling, which I didn't take into consideration. The
fight is to convince appropriate parties to share the responsibility in
a way which doesn't kill the machine. We still have a ways to go until
we get there.
Alex
> -Tony
>
> P.S. I deliberately put "fatal" in quotes above because to
> quote "The Princess Bride" -- "that word, I do not think it
> means what you think it means". :-)
>
On 05/22/2018 01:10 PM, Rafael J. Wysocki wrote:
> On Tue, May 22, 2018 at 7:57 PM, Luck, Tony <[email protected]> wrote:
>> On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote:
>>> I especially don't want to have the case where a PCIe error is *really*
>>> fatal and then we noodle in some handlers debating about the severity
>>> because it got marked as recoverable intermittently and end up causing
>>> data corruption on the storage device. Here's a real no-no for ya.
>>
>> All that we have is a message from the BIOS that this is a "fatal"
>> error. When did we start trusting the BIOS to give us accurate
>> information?
>
> Some time ago, actually.
>
> This is about changing the existing behavior which has been to treat
> "fatal" errors reported by the BIOS as good enough reasons for a panic
> for quite a while AFAICS.
Yes, you are correct. I'd actually like to go deeper, and remove the
policy to panic() on fatal errors. Now whether we blacklist or whitelist
which errors can go through is up for debate, but the current policy
seems broken.
>> PCIe fatal means that the link or the device is broken.
>
> And that may really mean that the component in question is on fire.
> We just don't know.
Should there be a physical fire, we have much bigger issues. At the same
time, we could retrain the link, call the driver, and release freon gas
to put out the fire. That's something we don't currently have the option
to do.
>> But that seems a poor reason to take down a large server that may have
>> dozens of devices (some of them set up specifically to handle
>> errors ... e.g. mirrored disks on separate controllers, or NIC
>> devices that have been "bonded" together).
>>
>> So, as long as the action for a "fatal" error is to mark a link
>> down and offline the device, that seems a pretty reasonable course
>> of action.
>>
>> The argument gets a lot more marginal if you simply reset the
>> link and re-enable the device to "fix" it. That might be enough,
>> but I don't think the OS has enough data to make the call.
>
> Again, that's about changing the existing behavior or the existing policy even.
>
> What exactly has changed to make us consider this now?
Firmware started passing "fatal" GHES headers with the explicit intent
of crashing an OS. At the same time, we've learnt how to handle these
errors in a number of cases. With DPC (coming soon to firmware-first)
the error is contained, and a non-issue.
As specs and hardware implementations evolve, we have to adapt. I'm here
until November, and one of my goals is to involve linux upstream in the
development of these features so that when the hardware hits the market,
we're ready. That does mean we have to drop some of the silly things
we're doing.
Alex
> Thanks,
> Rafael
>
On 05/22/2018 01:13 PM, Rafael J. Wysocki wrote:
(snip)
> Of course, you are free to have a differing opinion and I don't have
> to convince you about my point. You need to convince me about your
> point to get the patch in through my tree, which you haven't done so
> far.
My point is that crossing your arms and saying "impress me" doesn't give
me a good idea of how you'd like to see the problem resolved.
Alex
On Tue, May 22, 2018 at 08:10:47PM +0200, Rafael J. Wysocki wrote:
> > PCIe fatal means that the link or the device is broken.
>
> And that may really mean that the component in question is on fire.
> We just don't know.
Components on fire could be the root cause of many errors. If we really
believe that is a problem we should power the system off rather than
just calling panic() [not just for PCIe errors, but also for machine
checks, and perhaps a bunch of other places in the kernel].
True story: I used to work for Stratus Computer on fault tolerant
systems. A customer once called in with a "my computer is on fire"
report and asked what to do. The support person told them to power it
off. Customer asked "Isn't there something else? It's still running
just fine".
-Tony
On Tue, May 22, 2018 at 01:19:34PM -0500, Alex G. wrote:
> Firmware started passing "fatal" GHES headers with the explicit intent of
> crashing an OS. At the same time, we've learnt how to handle these errors in
> a number of cases. With DPC (coming soon to firmware-first) the error is
> contained, and a non-issue.
Perhaps DPC is the change that you need to emphasize as to
why things are different now, so we can change the default
Linux behavior.
With the h/w guaranteeing that corrupt data is contained, we
should be safe to disregard BIOS indications of "fatal" problems
that could be anything and might show up in unknown ways some
time later if we keep running.
-Tony
On 05/22/2018 01:45 PM, Luck, Tony wrote:
> On Tue, May 22, 2018 at 01:19:34PM -0500, Alex G. wrote:
>> Firmware started passing "fatal" GHES headers with the explicit intent of
>> crashing an OS. At the same time, we've learnt how to handle these errors in
>> a number of cases. With DPC (coming soon to firmware-first) the error is
>> contained, and a non-issue.
>
> Perhaps DPC is the change that you need to emphasize as to
> why things are different now, so we can change the default
> Linux behavior.
>
> With the h/w guaranteeing that corrupt data is contained, we
> should be safe to disregard BIOS indications of "fatal" problems
> that could be anything and might show up in unknown ways some
> time later if we keep running.
Sure. DPC is much harder to contest as a reason. However, the AER path
benefits as well from this change in behavior. I'm certain there are
other classes of errors that benefit as well from the change, though I
haven't had the time or the inclination to look for them.
Alex
On Tue, May 22, 2018 at 8:20 PM, Alex G. <[email protected]> wrote:
> On 05/22/2018 01:13 PM, Rafael J. Wysocki wrote:
> (snip)
>>
>> Of course, you are free to have a differing opinion and I don't have
>> to convince you about my point. You need to convince me about your
>> point to get the patch in through my tree, which you haven't done so
>> far.
>
>
> My point is that crossing your arms and saying "impress me" doesn't give me
> a good idea of how you'd like to see the problem resolved.
I do not recall having said "impress me" at any time during our
conversation so far.
Also this problem space is not the one I'm focusing on and I can't
tell you how to address the problem you want to address in that space
from the top of my head. I can only look at what you post and decide
whether or not I agree with that. It has to be well documented and
convincing enough, however, and it hasn't been so far in my view. As
a rule, I'd rather not apply patches that I don't feel confident of.
Tony seems to agree with you, so maybe work with him on making your
case stronger.