Hi,
I'm helping out Dell work out through the issues related to PCIe and NVMe
hotplug. Although hot-plug generally works, there are corner cases such as
pin bounce, drives failing and surprise removal that are not 100% worked out.
Because of this, NVMe is not yet on feature parity with SCSI and SAS.
One of the interesting issues is that most server vendors like to use
firmware-first (FFS), for various reasons that I won't go into. The side
effect of that is that we oftentimes don't even a stab at correcting the
problem.
This is especially troublesome for NVMe, which needs PCIe hotplug to work
correctly. When we do get a stab, it's after FFS can't handle a fatal error,
and we're told of it through ACPI tables. On x86, this happens through an
NMI, and as soon as we see a "fatal" error, we panic().
One problem with this FFS approach is that AER never even gets notified of
the issue. And even if a PCIe drive were to stop responding, nvme or higher
block drivers would notice something is wrong even without AER. Unless there
is a physical defect or silicon bug, AER can recover the link.
Another issue we're seeing with FFS is that BIOSes assume than an OS will crash
on a fatal error reported through ACPI. Sometimes they will leave hardware in
a "kind of" working state, or will fail to re-arm the errors. From what I've
observed, this happens on hardware with silicon bugs. For example, PCIe root
ports are unaffected, but certain PCIe switches may stop issuing hotplug
interrupts. It's just another headache with FFS.
While I don't expect server vendors to drop FFS in favor of native AER control,
I do think we can harden linux against the idiosyncrasies of such systems. The
scope of these patches is to protect against poorly designed firmware, and
perform proper error handling when possible. It is not to make FFS a first
class citizen in error handling.
Alexandru Gagniuc (4):
acpi: apei: Return severity of GHES messages after handling
acpi: apei: Swap ghes_print_queued_estatus and ghes_proc_in_irq
acpi: apei: Do not panic() in NMI because of GHES messages
acpi: apei: Warn when GHES marks correctable errors as "fatal"
drivers/acpi/apei/ghes.c | 100 ++++++++++++++++++++++++++++++-----------------
1 file changed, 64 insertions(+), 36 deletions(-)
--
2.14.3
Move ghes_print_queued_estatus() above ghes_proc_in_irq(). In a
subsequent patch, the NMI handler will be updated, and the print
functionality will be used in ghes_proc_in_irq.
This simply makes the subsequent diff look sane.
Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/acpi/apei/ghes.c | 40 ++++++++++++++++++++--------------------
1 file changed, 20 insertions(+), 20 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 25cf77a18e0a..2c998125b1d5 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -870,13 +870,12 @@ static atomic_t ghes_in_nmi = ATOMIC_INIT(0);
static LIST_HEAD(ghes_nmi);
-static void ghes_proc_in_irq(struct irq_work *irq_work)
+static void ghes_print_queued_estatus(void)
{
- struct llist_node *llnode, *next;
+ struct llist_node *llnode;
struct ghes_estatus_node *estatus_node;
struct acpi_hest_generic *generic;
struct acpi_hest_generic_status *estatus;
- u32 len, node_len;
llnode = llist_del_all(&ghes_estatus_llist);
/*
@@ -885,30 +884,22 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
*/
llnode = llist_reverse_order(llnode);
while (llnode) {
- next = llnode->next;
estatus_node = llist_entry(llnode, struct ghes_estatus_node,
llnode);
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
- len = cper_estatus_len(estatus);
- node_len = GHES_ESTATUS_NODE_LEN(len);
- ghes_do_proc(estatus_node->ghes, estatus);
- if (!ghes_estatus_cached(estatus)) {
- generic = estatus_node->generic;
- if (ghes_print_estatus(NULL, generic, estatus))
- ghes_estatus_cache_add(generic, estatus);
- }
- gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
- node_len);
- llnode = next;
+ generic = estatus_node->generic;
+ ghes_print_estatus(NULL, generic, estatus);
+ llnode = llnode->next;
}
}
-static void ghes_print_queued_estatus(void)
+static void ghes_proc_in_irq(struct irq_work *irq_work)
{
- struct llist_node *llnode;
+ struct llist_node *llnode, *next;
struct ghes_estatus_node *estatus_node;
struct acpi_hest_generic *generic;
struct acpi_hest_generic_status *estatus;
+ u32 len, node_len;
llnode = llist_del_all(&ghes_estatus_llist);
/*
@@ -917,12 +908,21 @@ static void ghes_print_queued_estatus(void)
*/
llnode = llist_reverse_order(llnode);
while (llnode) {
+ next = llnode->next;
estatus_node = llist_entry(llnode, struct ghes_estatus_node,
llnode);
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
- generic = estatus_node->generic;
- ghes_print_estatus(NULL, generic, estatus);
- llnode = llnode->next;
+ len = cper_estatus_len(estatus);
+ node_len = GHES_ESTATUS_NODE_LEN(len);
+ ghes_do_proc(estatus_node->ghes, estatus);
+ if (!ghes_estatus_cached(estatus)) {
+ generic = estatus_node->generic;
+ if (ghes_print_estatus(NULL, generic, estatus))
+ ghes_estatus_cache_add(generic, estatus);
+ }
+ gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+ node_len);
+ llnode = next;
}
}
--
2.14.3
BIOSes like to send NMIs for a number of silly reasons often deemed
to be "fatal". For example pin bounce during a PCIE hotplug/unplug
might cause the link to go down and retrain, with fatal PCI errors
being generated while the link is retraining.
Instead of panic()ing in NMI context, pass fatal errors down to IRQ
context to see if they can be resolved.
With these change, PCIe error are handled by AER. Other far less
common errors, such as machine check exceptions, still cause a panic()
in their respective handlers.
Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/acpi/apei/ghes.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 2c998125b1d5..7243a99ea57e 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -428,8 +428,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
* GHES_SEV_RECOVERABLE -> AER_NONFATAL
* GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL
* These both need to be reported and recovered from by the AER driver.
- * GHES_SEV_PANIC does not make it to this handling since the kernel must
- * panic.
+ * GHES_SEV_PANIC -> AER_FATAL
*/
static bool ghes_handle_aer(struct acpi_hest_generic_data *gdata)
{
@@ -899,6 +898,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
struct ghes_estatus_node *estatus_node;
struct acpi_hest_generic *generic;
struct acpi_hest_generic_status *estatus;
+ int corrected_sev;
u32 len, node_len;
llnode = llist_del_all(&ghes_estatus_llist);
@@ -914,7 +914,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
len = cper_estatus_len(estatus);
node_len = GHES_ESTATUS_NODE_LEN(len);
- ghes_do_proc(estatus_node->ghes, estatus);
+ corrected_sev = ghes_do_proc(estatus_node->ghes, estatus);
+
+ if (corrected_sev >= GHES_SEV_PANIC) {
+ oops_begin();
+ ghes_print_queued_estatus();
+ __ghes_panic(estatus_node->ghes);
+ }
+
if (!ghes_estatus_cached(estatus)) {
generic = estatus_node->generic;
if (ghes_print_estatus(NULL, generic, estatus))
@@ -955,7 +962,7 @@ static void __process_error(struct ghes *ghes)
static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
{
struct ghes *ghes;
- int sev, ret = NMI_DONE;
+ int ret = NMI_DONE;
if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
return ret;
@@ -968,13 +975,6 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
ret = NMI_HANDLED;
}
- sev = ghes_severity(ghes->estatus->error_severity);
- if (sev >= GHES_SEV_PANIC) {
- oops_begin();
- ghes_print_queued_estatus();
- __ghes_panic(ghes);
- }
-
if (!(ghes->flags & GHES_TO_CLEAR))
continue;
--
2.14.3
The policy currently is to simply panic() on GHES fatal errors.
Oftentimes we may correct fatal errors
i.e. "Fatal" PCIe errors can be corrected via AER
When these errors are corrected, it doesn't make sense to panic().
Update ghes_do_proc() to return the severity of the worst error, while
marking handled errors as corrected.
Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/acpi/apei/ghes.c | 35 +++++++++++++++++++++++++++++------
1 file changed, 29 insertions(+), 6 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 1efefe919555..25cf77a18e0a 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -383,7 +383,7 @@ static void ghes_clear_estatus(struct ghes *ghes)
ghes->flags &= ~GHES_TO_CLEAR;
}
-static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int sev)
+static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int sev)
{
#ifdef CONFIG_ACPI_APEI_MEMORY_FAILURE
unsigned long pfn;
@@ -411,7 +411,10 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
if (flags != -1)
memory_failure_queue(pfn, flags);
+
+ return true;
#endif
+ return false;
}
/*
@@ -428,7 +431,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
* GHES_SEV_PANIC does not make it to this handling since the kernel must
* panic.
*/
-static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
+static bool ghes_handle_aer(struct acpi_hest_generic_data *gdata)
{
#ifdef CONFIG_ACPI_APEI_PCIEAER
struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
@@ -456,20 +459,33 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
(struct aer_capability_regs *)
pcie_err->aer_info);
}
+
+ return true;
#endif
+ return false;
}
-static void ghes_do_proc(struct ghes *ghes,
+/*
+ * Handle GHES messages, and return the highest encountered severity.
+ * Errors which are handled are considered to be CORRECTED. The severity is
+ * taken from each GHES error data entry, not the error status block.
+ * An error is considered corrected if it can be dispatched to an appropriate
+ * handler. However, simply logging an error is not enough to "correct" it.
+ */
+static int ghes_do_proc(struct ghes *ghes,
const struct acpi_hest_generic_status *estatus)
{
- int sev, sec_sev;
+ int sev, sec_sev, corrected_sev;
struct acpi_hest_generic_data *gdata;
guid_t *sec_type;
guid_t *fru_id = &NULL_UUID_LE;
char *fru_text = "";
+ bool handled;
+ corrected_sev = GHES_SEV_NO;
sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
+ handled = false;
sec_type = (guid_t *)gdata->section_type;
sec_sev = ghes_severity(gdata->error_severity);
if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
@@ -484,10 +500,10 @@ static void ghes_do_proc(struct ghes *ghes,
ghes_edac_report_mem_error(ghes, sev, mem_err);
arch_apei_report_mem_error(sev, mem_err);
- ghes_handle_memory_failure(gdata, sev);
+ handled = ghes_handle_memory_failure(gdata, sev);
}
else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
- ghes_handle_aer(gdata);
+ handled = ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
@@ -500,7 +516,14 @@ static void ghes_do_proc(struct ghes *ghes,
sec_sev, err,
gdata->error_data_length);
}
+
+ if (sec_sev >= GHES_SEV_RECOVERABLE && handled)
+ sec_sev = GHES_SEV_CORRECTED;
+
+ corrected_sev = max(corrected_sev, sec_sev);
}
+
+ return corrected_sev;
}
static void __ghes_print_estatus(const char *pfx,
--
2.14.3
There seems to be a culture amongst BIOS teams to want to crash the
OS when an error can't be handled in firmware. Marking GHES errors as
"fatal" is a very common way to do this.
However, a number of errors reported by GHES may be fatal in the sense
a device or link is lost, but are not fatal to the system. When there
is a disagreement with firmware about the handleability of an error,
print a warning message.
Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/acpi/apei/ghes.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 7243a99ea57e..d9123c5fa3d9 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -521,6 +521,11 @@ static int ghes_do_proc(struct ghes *ghes,
corrected_sev = max(corrected_sev, sec_sev);
}
+
+ if ((sev >= GHES_SEV_PANIC) && (corrected_sev < sev)) {
+ pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
+ pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");
+ }
return corrected_sev;
}
--
2.14.3
Hi Alexandru,
On 03/04/18 18:08, Alexandru Gagniuc wrote:
> BIOSes like to send NMIs for a number of silly reasons often deemed
> to be "fatal". For example pin bounce during a PCIE hotplug/unplug
> might cause the link to go down and retrain, with fatal PCI errors
> being generated while the link is retraining.
Sounds fun!
> Instead of panic()ing in NMI context, pass fatal errors down to IRQ
> context to see if they can be resolved.
How do we know we will survive this trip?
On arm64 systems it may not be possible to return to the context we took the NMI
notification from: we may bounce back into firmware with the same "world is on
fire" error. Firmware can rightly assume the OS has made no attempt to handle
the error. Your 'not re-arming the error' example makes this worrying.
> With these change, PCIe error are handled by AER. Other far less
> common errors, such as machine check exceptions, still cause a panic()
> in their respective handlers.
I agree AER is always going to be different. Could we take a look at the CPER
records while still in_nmi() to decide whether linux knows better than firmware?
For non-standard or processor-errors I think we should always panic() if they're
marked as fatal.
For memory-errors we could split memory_failure() up to have
{NMI,IRQ,process}-context helpers, all we need to know at NMI-time is whether
the affected memory is kernel memory.
For the PCI*/AER errors we should be able to try and handle it ... if we can get
back to process/IRQ context:
What happens if a PCIe driver has interrupts masked and is doing something to
cause this error? We can take the NMI and setup the irq-work, but it may never
run as we return to interrupts-masked context.
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 2c998125b1d5..7243a99ea57e 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -955,7 +962,7 @@ static void __process_error(struct ghes *ghes)
> static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
> {
> struct ghes *ghes;
> - int sev, ret = NMI_DONE;
> + int ret = NMI_DONE;
>
> if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
> return ret;
> @@ -968,13 +975,6 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
> ret = NMI_HANDLED;
> }
>
> - sev = ghes_severity(ghes->estatus->error_severity);
> - if (sev >= GHES_SEV_PANIC) {
> - oops_begin();
> - ghes_print_queued_estatus();
> - __ghes_panic(ghes);
> - }
> -
> if (!(ghes->flags & GHES_TO_CLEAR))
> continue;
For Processor-errors I think this is the wrong thing to do, but we should be
able to poke around in the CPER records and find out what we are dealing with.
Thanks,
James
Hi James,
On 04/04/2018 02:18 AM, James Morse wrote:
> Hi Alexandru,
>
> On 03/04/18 18:08, Alexandru Gagniuc wrote:
>> BIOSes like to send NMIs for a number of silly reasons often deemed
>> to be "fatal". For example pin bounce during a PCIE hotplug/unplug
>> might cause the link to go down and retrain, with fatal PCI errors
>> being generated while the link is retraining.
>
> Sounds fun!
>
>
>> Instead of panic()ing in NMI context, pass fatal errors down to IRQ
>> context to see if they can be resolved.
>
> How do we know we will survive this trip?
We don't.
> On arm64 systems it may not be possible to return to the context we took the NMI
> notification from: we may bounce back into firmware with the same "world is on
> fire" error. Firmware can rightly assume the OS has made no attempt to handle
> the error.
I'm not aware of the NMI path being used on arm64:
$ git grep 'select HAVE_ACPI_APEI_NMI'
arch/x86/Kconfig: select HAVE_ACPI_APEI_NMI if ACPI
$
A far as jumping back into firmware, the OS should clear the GHES block
status before exiting the NMI. THis tells the firmware that the OS got
the message [1]
I'm not seeing any mechanism to say "handled/not handled", so I'm not
sure how firmware can make such assumptions.
> Your 'not re-arming the error' example makes this worrying.
Those are one class of bugs you get when you let firmware run the show.
It's been an issue since day one of EFI. The best you can do in such
cases is accept you may later lose the link to a device.
Side-note: We have a BIOS defect filed with against this particular
issue, and it only seems to affect PCIe SMIs.
>> With these change, PCIe error are handled by AER. Other far less
>> common errors, such as machine check exceptions, still cause a panic()
>> in their respective handlers.
>
> I agree AER is always going to be different. Could we take a look at the CPER
> records while still in_nmi() to decide whether linux knows better than firmware?
>
> For non-standard or processor-errors I think we should always panic() if they're
> marked as fatal.
> For memory-errors we could split memory_failure() up to have
> {NMI,IRQ,process}-context helpers, all we need to know at NMI-time is whether
> the affected memory is kernel memory.
I'm trying to avoid splitting up the recovery logic based on _how_ we
were notified of the error. The only sure way to know if you've handled
an error is to try to handle it. Not handling _any_ fatal error still
results in a panic(), and I was very careful to ensure that.
The alternative is to keep ghes_do_proc() and ghes_notify_nmi() in sync,
duplicate code between the two. But we can also get ghes_proc_irq_work()
as an entry point, and this easily turns in a maintenance nightmare.
> For the PCI*/AER errors we should be able to try and handle it ... if we can get
> back to process/IRQ context:
> What happens if a PCIe driver has interrupts masked and is doing something to
> cause this error? We can take the NMI and setup the irq-work, but it may never
> run as we return to interrupts-masked context.
The way the recovery from NMI is set up is confusing. On NMIs,
ghes_proc_in_irq() is queued up to do further work. Despite the name, in
this case it runs as a task. That further queues up AER work,
Either way, PCIe endpoint interrupts are irrelevant here. AER recovery
runs independent of drivers or interrupts:
- Drivers that implement AER recovery services are expected to do the
right thing and shut down IO. To not do so, is a bug in the driver.
- Drivers that do not implement AER have their devices taken offline.
It is not uncommon for PCIe errors to amplify. Imagine losing the link
right after you've fired off several DMA queues. You'd get a stream of
Unsupported Request errors. AER handles this fairly well.
>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 2c998125b1d5..7243a99ea57e 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -955,7 +962,7 @@ static void __process_error(struct ghes *ghes)
>> static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>> {
>> struct ghes *ghes;
>> - int sev, ret = NMI_DONE;
>> + int ret = NMI_DONE;
>>
>> if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
>> return ret;
>> @@ -968,13 +975,6 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>> ret = NMI_HANDLED;
>> }
>>
>> - sev = ghes_severity(ghes->estatus->error_severity);
>> - if (sev >= GHES_SEV_PANIC) {
>> - oops_begin();
>> - ghes_print_queued_estatus();
>> - __ghes_panic(ghes);
>> - }
>> -
>> if (!(ghes->flags & GHES_TO_CLEAR))
>> continue;
>
> For Processor-errors I think this is the wrong thing to do, but we should be
> able to poke around in the CPER records and find out what we are dealing with.
I don't think being trigger-happy with the panic button is ever a good
thing. The logic to stop the system exists downstream, and is being
called. I can see an issue if we somehow lose the ability to run queued
tasks, but in that case, we've already lost control of the machine.
Alex
[1] ACPI 6.2 spec.
18.3.2.8 Generic Hardware Error Source version 2 (GHESv2 - Type 10)
> Thanks,
>
> James
>
Hi Alex,
On 04/04/18 16:33, Alex G. wrote:
> On 04/04/2018 02:18 AM, James Morse wrote:
>> On 03/04/18 18:08, Alexandru Gagniuc wrote:
>>> BIOSes like to send NMIs for a number of silly reasons often deemed
>>> to be "fatal". For example pin bounce during a PCIE hotplug/unplug
>>> might cause the link to go down and retrain, with fatal PCI errors
>>> being generated while the link is retraining.
>>> Instead of panic()ing in NMI context, pass fatal errors down to IRQ
>>> context to see if they can be resolved.
>>
>> How do we know we will survive this trip?
>
> We don't.
Isn't that even worse than panic()ing? (No output, followed by a watchdog reboot
if we're lucky)
>> On arm64 systems it may not be possible to return to the context we took the NMI
>> notification from: we may bounce back into firmware with the same "world is on
>> fire" error. Firmware can rightly assume the OS has made no attempt to handle
>> the error.
>
> I'm not aware of the NMI path being used on arm64:
> $ git grep 'select HAVE_ACPI_APEI_NMI'
> arch/x86/Kconfig: select HAVE_ACPI_APEI_NMI if ACPI
> $
(I agree its not obvious!), arm64 has NOTIFY_SEA, NOTIFY_SEI and NOTIFY_SDEI
that all behave like NOTIFY_NMI (in that they can interrupt IRQ-masked code).
CONFIG_HAVE_ACPI_APEI_NMI has come to mean NOTIFY_NMI, which requires the arch
code to implement register_nmi_handler() and (from memory) behave as a notifier
chain, which doesn't fit with how these things behave.
NOTIFY_SEA lives behind CONFIG_ACPI_APEI_SEA in driver/acpi/apei/Kconfig.
The series to add SDEI support is on the list here:
https://www.spinics.net/lists/arm-kernel/msg642946.html
(NOTIFY_SEI is very tricky for firmware to get right, I don't think we're there
yet.)
> A far as jumping back into firmware, the OS should clear the GHES block
> status before exiting the NMI. THis tells the firmware that the OS got
> the message [1]
> I'm not seeing any mechanism to say "handled/not handled", so I'm not
> sure how firmware can make such assumptions.
(I'm probably assuming GHESv2's ACK stuff).
Wouldn't reasonably-intelligent firmware be able to spot the same fault
repeatedly in a loop? (last-fault-address == this-fault-address,
last-fault-time==now)
>> Your 'not re-arming the error' example makes this worrying.
>
> Those are one class of bugs you get when you let firmware run the show.
> It's been an issue since day one of EFI. The best you can do in such
> cases is accept you may later lose the link to a device.
> Side-note: We have a BIOS defect filed with against this particular
> issue, and it only seems to affect PCIe SMIs.
This sounds like policy applied too early, fixing the firmware to at least make
this configurable would be preferable.
My point was there is a class of reported-as-fatal error that really is fatal,
and for these trying to defer the work to irq_work is worse than panic()ing as
we get less information about what happened.
>>> With these change, PCIe error are handled by AER. Other far less
>>> common errors, such as machine check exceptions, still cause a panic()
>>> in their respective handlers.
>>
>> I agree AER is always going to be different. Could we take a look at the CPER
>> records while still in_nmi() to decide whether linux knows better than firmware?
>>
>> For non-standard or processor-errors I think we should always panic() if they're
>> marked as fatal.
>> For memory-errors we could split memory_failure() up to have
>> {NMI,IRQ,process}-context helpers, all we need to know at NMI-time is whether
>> the affected memory is kernel memory.
>
> I'm trying to avoid splitting up the recovery logic based on _how_ we
> were notified of the error.
I agree, last time I checked there were 11 of them. 'in_nmi()' or not is a
distinction the ghes code already has for ghes_copy_tofrom_phys(). We already
have to consider NMI-like notifications separately as we can't call the
recovery-helpers.
> The only sure way to know if you've handled
> an error is to try to handle it. Not handling _any_ fatal error still
> results in a panic(), and I was very careful to ensure that.
Are all AER errors recoverable? (is an OS:'fatal' AER error ever valid?)
> The alternative is to keep ghes_do_proc() and ghes_notify_nmi() in sync,
> duplicate code between the two. But we can also get ghes_proc_irq_work()
> as an entry point, and this easily turns in a maintenance nightmare.
We can't call all the recovery-helpers from ghes_notify_nmi() as they sleep and
allocate memory. I think the trick would be to make the CPER parsing routines
NMI-safe so that we can check these 'fatal' records for cases where we're pretty
sure linux can do a better job.
>> For the PCI*/AER errors we should be able to try and handle it ... if we can get
>> back to process/IRQ context:
>> What happens if a PCIe driver has interrupts masked and is doing something to
>> cause this error? We can take the NMI and setup the irq-work, but it may never
>> run as we return to interrupts-masked context.
>
> The way the recovery from NMI is set up is confusing. On NMIs,
> ghes_proc_in_irq() is queued up to do further work. Despite the name, in
> this case it runs as a task. That further queues up AER work,
> Either way, PCIe endpoint interrupts are irrelevant here. AER recovery
> runs independent of drivers or interrupts:
Hmm, we may be looking at different things, (and I'm not familiar with x86):
ghes_notify_nmi() uses irq_work_queue() to add the work to a percpu llist and
call arch_irq_work_raise(). arch_irq_work_raise() sends a self IPI, which, once
taken causes the arch code to call irq_work_run() to consume the llist.
My badly made point is we could take the NMI that triggers all this from a
context that has interrupts masked. We can't take the self-IPI until we return
from the NMI, and we return to a context that still has interrupts masked. The
original context may cause the NMI to trigger again before it unmasks
interrupts. In this case we've livelocked, the recovery work never runs.
I think something like polling a status register over a broken-link would do
this. (has my config write been acknowledge?)
> - Drivers that implement AER recovery services are expected to do the
> right thing and shut down IO. To not do so, is a bug in the driver.
> - Drivers that do not implement AER have their devices taken offline.
> It is not uncommon for PCIe errors to amplify. Imagine losing the link
> right after you've fired off several DMA queues. You'd get a stream of
> Unsupported Request errors. AER handles this fairly well.
>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>> index 2c998125b1d5..7243a99ea57e 100644
>>> --- a/drivers/acpi/apei/ghes.c
>>> +++ b/drivers/acpi/apei/ghes.c
>>> @@ -955,7 +962,7 @@ static void __process_error(struct ghes *ghes)
>>> static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>>> {
>>> struct ghes *ghes;
>>> - int sev, ret = NMI_DONE;
>>> + int ret = NMI_DONE;
>>>
>>> if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
>>> return ret;
>>> @@ -968,13 +975,6 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>>> ret = NMI_HANDLED;
>>> }
>>>
>>> - sev = ghes_severity(ghes->estatus->error_severity);
>>> - if (sev >= GHES_SEV_PANIC) {
>>> - oops_begin();
>>> - ghes_print_queued_estatus();
>>> - __ghes_panic(ghes);
>>> - }
>>> -
>>> if (!(ghes->flags & GHES_TO_CLEAR))
>>> continue;
>>
>> For Processor-errors I think this is the wrong thing to do, but we should be
>> able to poke around in the CPER records and find out what we are dealing with.
>
> I don't think being trigger-happy with the panic button is ever a good
> thing. The logic to stop the system exists downstream, and is being
> called.
> I can see an issue if we somehow lose the ability to run queued
> tasks, but in that case, we've already lost control of the machine.
about to loose control of the machine: firmware has saved the day giving us a
clean slate and some standardised messages that describe the fault.
I agree we can do better for AER errors, but I don't think we should make the
handling of other classes of error worse. These classes match nicely to the CPER
record types, which we already have mapped in ghes_notify_nmi().
Thanks,
James
On 04/04/2018 11:53 AM, James Morse wrote:
> Hi Alex,
(snip)
>>> How do we know we will survive this trip?
>>
>> We don't.
>
> Isn't that even worse than panic()ing? (No output, followed by a watchdog reboot
> if we're lucky)
On the contrary. No output, followed by a watchdog reboot is usually
what happens when the panic() is in NMI. One of the very few ways that
can be debugged is over a NULL modem cable.
>>> On arm64 systems it may not be possible to return to the context we took the NMI
>>> notification from: we may bounce back into firmware with the same "world is on
>>> fire" error. Firmware can rightly assume the OS has made no attempt to handle
>>> the error.
>>
>> I'm not aware of the NMI path being used on arm64:
>> $ git grep 'select HAVE_ACPI_APEI_NMI'
>> arch/x86/Kconfig: select HAVE_ACPI_APEI_NMI if ACPI
>> $
>
> (I agree its not obvious!), arm64 has NOTIFY_SEA, NOTIFY_SEI and NOTIFY_SDEI
> that all behave like NOTIFY_NMI (in that they can interrupt IRQ-masked code).
>
> CONFIG_HAVE_ACPI_APEI_NMI has come to mean NOTIFY_NMI, which requires the arch
> code to implement register_nmi_handler() and (from memory) behave as a notifier
> chain, which doesn't fit with how these things behave.
>
> NOTIFY_SEA lives behind CONFIG_ACPI_APEI_SEA in driver/acpi/apei/Kconfig.
> The series to add SDEI support is on the list here:
> https://www.spinics.net/lists/arm-kernel/msg642946.html
> (NOTIFY_SEI is very tricky for firmware to get right, I don't think we're there
> yet.)
From my understanding, your patch series tries to use ghes_notify_nmi()
as a common entry point. But if on arm64, you can return to firmware,
then we have wildly different constraints.
On x86, you can't return to SMM. You can have a new SMI triggered while
servicing the NMI, but you can't re-enter an SMI that you've returned
from. SMM holds all the cores, and they all leave SMM at roughly the
same time, so you don't deal with coexistence issues. This probably also
avoids the multiple notifications that you are trying to implement on arm.
On quick thought, I think the best way to go for both series is to leave
the entry points arch-specific, and prevent hax86 and harm64 from
stepping on each other's toes. Thoughts?
(snip)
> Wouldn't reasonably-intelligent firmware be able to spot the same fault
> repeatedly in a loop? (last-fault-address == this-fault-address,
> last-fault-time==now)
I'm told that the BIOS on Dell machines will detect interrupt storms,
and disable the specific error source in such cases. On an x86 server.
50us of SMM eats up several milliseconds-core of CPU time, so they try
to be a little careful.
>>> Your 'not re-arming the error' example makes this worrying.
>>
>> Those are one class of bugs you get when you let firmware run the show.
>> It's been an issue since day one of EFI. The best you can do in such
>> cases is accept you may later lose the link to a device.
>> Side-note: We have a BIOS defect filed with against this particular
>> issue, and it only seems to affect PCIe SMIs.
>
> This sounds like policy applied too early
That's firmware-first in one sentence.
> fixing the firmware to at least make this configurable would be preferable.
I can ask for "fixing the firmware" and we might even get the fix, but
the fact remains that there are machines in the field with the buggy
BIOSes, and there will continue to be such machines even after a fix is
released.
> My point was there is a class of reported-as-fatal error that really is fatal,
> and for these trying to defer the work to irq_work is worse than panic()ing as
> we get less information about what happened.
How do we get less information? We save the GHES structure on the
estatus_queue, and use it as the error source in either case. Is there a
class of errors where we can reasonable expect things to be okay in NMI,
but go horribly wrong on return from NMI?
(snip)
> Are all AER errors recoverable? (is an OS:'fatal' AER error ever valid?)
PCIe "fatal" means that your link is gone and you might not get it back.
You lose all the downstream devices. You may be able to reset the link
and get it back. This sort of "fatal" has no impact on a machine's
ability to continue operation.
From ACPI, FFS should only send a "fatal" error if a machine needs reset
[1], so it's clearly a firmware bug to mark any PCIe error "fatal". This
won't stop BIOS vendors from taking shortcuts and deciding to crash an
OS by sending the error as "fatal".
[1]ACPI 6.2: 18.1Hardware Errors and Error Sources
>> The alternative is to keep ghes_do_proc() and ghes_notify_nmi() in sync,
>> duplicate code between the two. But we can also get ghes_proc_irq_work()
>> as an entry point, and this easily turns in a maintenance nightmare.
>
> We can't call all the recovery-helpers from ghes_notify_nmi() as they sleep and
> allocate memory. I think the trick would be to make the CPER parsing routines
> NMI-safe so that we can check these 'fatal' records for cases where we're pretty
> sure linux can do a better job.
can_i_handle_this_nonblocking(struct four_letter_acronum_err *err) ?
That's the approach I originally tried, but I didn't like it. The
parsing gets deeper and deeper, and, by the time you realize if the
error can be handled or not, you're already halfway through recovery.
This is a perfectly viable approach. However, is there a palpable
(non-theoretical) concern that warrants complicating things this way?
>>> For the PCI*/AER errors we should be able to try and handle it ... if we can get
>>> back to process/IRQ context:
>>> What happens if a PCIe driver has interrupts masked and is doing something to
>>> cause this error? We can take the NMI and setup the irq-work, but it may never
>>> run as we return to interrupts-masked context.
>>
>> The way the recovery from NMI is set up is confusing. On NMIs,
>> ghes_proc_in_irq() is queued up to do further work. Despite the name, in
>> this case it runs as a task. That further queues up AER work,
>> Either way, PCIe endpoint interrupts are irrelevant here. AER recovery
>> runs independent of drivers or interrupts:
>
> Hmm, we may be looking at different things, (and I'm not familiar with x86):
> ghes_notify_nmi() uses irq_work_queue() to add the work to a percpu llist and
> call arch_irq_work_raise(). arch_irq_work_raise() sends a self IPI, which, once
> taken causes the arch code to call irq_work_run() to consume the llist.
I abstract things at irq_work_queue(): "queue this work and let me
return from this interrupt"
> My badly made point is we could take the NMI that triggers all this from a
> context that has interrupts masked. We can't take the self-IPI until we return
> from the NMI, and we return to a context that still has interrupts masked. The
> original context may cause the NMI to trigger again before it unmasks
> interrupts. In this case we've livelocked, the recovery work never runs.
In order to get another NMI from firmware, you need SMM to be triggered
again. Remember, SMM has to exit before the NMI can be serviced.
I doubt the livelock is a concern in this place, because we're dealing
with NMIs invoked by firmware(SMM). NMIs for other reasons would be
handled elsewhere.
In the event that FFS fails to prevent and SMI storm, and keeps spamming
the CPU with NMIs, that is a clear firmware bug. It's also the sort of
bug I haven't observed in the wild.
> I think something like polling a status register over a broken-link would do
> this. (has my config write been acknowledge?)
That would not be a valid reason to disable interrupts. Non-posted PCIe
requests can take hundreds of milliseconds (memory-mapped or otherwise).
I can't think of any sort of code that can cause errors to come in via
NMI which would be a suitable atomic context for disabling interrupts.
Maybe on arm it's normal to stuff in anything error-related via NMI? On
x86, most error sources have their own (maskable) interrupt vector. I am
not concert about livelocking the system.
(snip)
>> I can see an issue if we somehow lose the ability to run queued
>> tasks, but in that case, we've already lost control of the machine.
>
> about to loose control of the machine: firmware has saved the day giving us a
> clean slate and some standardised messages that describe the fault.
In theory, yes. In practice, by the time OS gets control things are
usually worse than they would have been without FFS.
"standardized" is really a pipe dream, especially on x86, where each
vendor interprets things differently, so the OS still has to figure it
all out on its own.
> I agree we can do better for AER errors, but I don't think we should make the
> handling of other classes of error worse. These classes match nicely to the CPER
> record types, which we already have mapped in ghes_notify_nmi().
My approach is to _not_ trust the firmware. I hope arm does it better,
but on x86, each hardware vendor does what it thinks is best in
firmware. The result is wildly different FFS behavior between vendors.
This is something that an OS then, cannot rely upon.
Alex
Hi Alex,
On 04/04/18 20:49, Alex G. wrote:
> On 04/04/2018 11:53 AM, James Morse wrote:
>>>> How do we know we will survive this trip?
>>>
>>> We don't.
>>
>> Isn't that even worse than panic()ing? (No output, followed by a watchdog reboot
>> if we're lucky)
> On the contrary. No output, followed by a watchdog reboot is usually
> what happens when the panic() is in NMI. One of the very few ways that
> can be debugged is over a NULL modem cable.
(IPMI-SOL?)
Really? printk has NMI support, and panic()s call to bust_spinlocks() should
unblank the screen before printk_safe_flush_on_panic() dumps the messages. (I'm
assuming some BMC-VGA exists to capture the 'screen').
>>>> On arm64 systems it may not be possible to return to the context we took the NMI
>>>> notification from: we may bounce back into firmware with the same "world is on
>>>> fire" error. Firmware can rightly assume the OS has made no attempt to handle
>>>> the error.
>>>
>>> I'm not aware of the NMI path being used on arm64:
>>> $ git grep 'select HAVE_ACPI_APEI_NMI'
>>> arch/x86/Kconfig: select HAVE_ACPI_APEI_NMI if ACPI
>>> $
>>
>> (I agree its not obvious!), arm64 has NOTIFY_SEA, NOTIFY_SEI and NOTIFY_SDEI
>> that all behave like NOTIFY_NMI (in that they can interrupt IRQ-masked code).
>>
>> CONFIG_HAVE_ACPI_APEI_NMI has come to mean NOTIFY_NMI, which requires the arch
>> code to implement register_nmi_handler() and (from memory) behave as a notifier
>> chain, which doesn't fit with how these things behave.
>>
>> NOTIFY_SEA lives behind CONFIG_ACPI_APEI_SEA in driver/acpi/apei/Kconfig.
>> The series to add SDEI support is on the list here:
>> https://www.spinics.net/lists/arm-kernel/msg642946.html
>> (NOTIFY_SEI is very tricky for firmware to get right, I don't think we're there
>> yet.)
> From my understanding, your patch series tries to use ghes_notify_nmi()
> as a common entry point.
It abstracts the NMI-like notifications to use a common helper into the
estatus-queue. ghes_notify_nmi() is the arch-code entry point for x86's
NOTIFY_NMI. We have ghes_notify_sea() and ghes_sdei_callback() on arm64. They
are different as their registration requirements are different. They each call
_in_nmi_notify_one() to do the in_nmi() ghes work.
> But if on arm64, you can return to firmware,
> then we have wildly different constraints.
We can't return to firmware, but we can take the error again, causing another
trap to firmware.
e.g.: If a program accesses a value in the cache and the cache location is
discovered to be corrupt/marked-as-poisoned. This causes an 'external abort'
exception, which firmware has configured to trap to firmware. Once there, it
generates CPER records and sets-up an external-abort-exception for Linux.
If linux returns from this external-abort, we return to whatever program
accessed the bad-cache-value in the first case, re-run the instruction that
generates the external abort, and the same thing happens again, but to firmware
it looks like a second fault at the same address.
The same thing can happen for a processor error.
> On x86, you can't return to SMM. You can have a new SMI triggered while
> servicing the NMI, but you can't re-enter an SMI that you've returned
> from. SMM holds all the cores, and they all leave SMM at roughly the
> same time, so you don't deal with coexistence issues. This probably also
> avoids the multiple notifications that you are trying to implement on arm.
its the new SMI case.
(We don't necessarily have all cores pulled into firmware, (although firmware
could do that in software), so different CPUs taking NMI-like notifications is
one of the things we have to handle.)
> On quick thought, I think the best way to go for both series is to leave
> the entry points arch-specific, and prevent hax86 and harm64 from
> stepping on each other's toes. Thoughts?
The behaviour should be driven from the standard. I agree there is some leeway,
which linux should handle on all architectures that support GHES.
The entry points from the arch code are already separate, as these have
different notification types in the HEST->GHES entries and different
registration requirements.
Once in ghes.c all the NMI-like notifications get merged together as the only
relevant thing about them is we have to use the estatus-queue, and can't call
any of the sleepy recovery helpers.
I don't think this is the issue though: An NMI notification could by any of the
eleven CPER notification record types.
Your argument is Linux probably can handle PCIe-AER errors, even if broken
firmware thinks they are fatal.
This is only one of the eleven notification record types. Returning to handle
the error definitely doesn't work for some classes of fatal error, especially
when interrupts are masked.
I think its straightforward to test whether the CPER records contain only errors
where we can do better: (not tested/built, I haven't checked whether this is nmi
safe):
---------------------%<---------------------
/*
* Return true if only PCIe-AER CPER sections are present. We ignore the fatal
* severity in this case as linux knows better. Called in_nmi().
*/
static bool ghes_is_deferrable(struct ghes *ghes,
const struct acpi_hest_generic_status *estatus)
{
guid_t *sec_type;
struct acpi_hest_generic_data *gdata;
apei_estatus_for_each_section(estatus, gdata) {
sec_type = (guid_t *)gdata->section_type;
if (!guid_equal(sec_type, &CPER_SEC_PCIE))
return false;
}
return true;
}
---------------------%<---------------------
>> Wouldn't reasonably-intelligent firmware be able to spot the same fault
>> repeatedly in a loop? (last-fault-address == this-fault-address,
>> last-fault-time==now)
>
> I'm told that the BIOS on Dell machines will detect interrupt storms,
> and disable the specific error source in such cases. On an x86 server.
> 50us of SMM eats up several milliseconds-core of CPU time, so they try
> to be a little careful.
>
>>>> Your 'not re-arming the error' example makes this worrying.
>>>
>>> Those are one class of bugs you get when you let firmware run the show.
>>> It's been an issue since day one of EFI. The best you can do in such
>>> cases is accept you may later lose the link to a device.
>>> Side-note: We have a BIOS defect filed with against this particular
>>> issue, and it only seems to affect PCIe SMIs.
>>
>> This sounds like policy applied too early
>
> That's firmware-first in one sentence.
>
>> fixing the firmware to at least make this configurable would be preferable.
>
> I can ask for "fixing the firmware" and we might even get the fix, but
> the fact remains that there are machines in the field with the buggy
> BIOSes, and there will continue to be such machines even after a fix is
> released.
>
>
>> My point was there is a class of reported-as-fatal error that really is fatal,
>> and for these trying to defer the work to irq_work is worse than panic()ing as
>> we get less information about what happened.
>
> How do we get less information? We save the GHES structure on the
> estatus_queue, and use it as the error source in either case. Is there a
> class of errors where we can reasonable expect things to be okay in NMI,
> but go horribly wrong on return from NMI?
Yes, (described above). This could happen if we see Arm64's 'SEA' CPER records
or x86's 'MCE' CPER records. I'm pretty sure it will happen on both
architectures if you return to a context with interrupts masked.
Things will go horribly wrong if you return from an imprecise exception notified
via firmware first.
> (snip)
>> Are all AER errors recoverable? (is an OS:'fatal' AER error ever valid?)
>
> PCIe "fatal" means that your link is gone and you might not get it back.
> You lose all the downstream devices. You may be able to reset the link
> and get it back. This sort of "fatal" has no impact on a machine's
> ability to continue operation.
>
> From ACPI, FFS should only send a "fatal" error if a machine needs reset
> [1], so it's clearly a firmware bug to mark any PCIe error "fatal". This
> won't stop BIOS vendors from taking shortcuts and deciding to crash an
> OS by sending the error as "fatal".
>
> [1]ACPI 6.2: 18.1Hardware Errors and Error Sources
Great! Its never valid. So we can check the CPER records in the NMI handler, and
for AER errors, clear the fatal bit printing a nasty message about the firmware.
>>> The alternative is to keep ghes_do_proc() and ghes_notify_nmi() in sync,
>>> duplicate code between the two. But we can also get ghes_proc_irq_work()
>>> as an entry point, and this easily turns in a maintenance nightmare.
>>
>> We can't call all the recovery-helpers from ghes_notify_nmi() as they sleep and
>> allocate memory. I think the trick would be to make the CPER parsing routines
>> NMI-safe so that we can check these 'fatal' records for cases where we're pretty
>> sure linux can do a better job.
>
> can_i_handle_this_nonblocking(struct four_letter_acronum_err *err) ?
Nothing so complicated. Just clearing the bit for AER-only CPER blocks should do.
(we can probably assume that AER & Vendor-Specific are describing the same thing)
> That's the approach I originally tried, but I didn't like it. The
> parsing gets deeper and deeper, and, by the time you realize if the
> error can be handled or not, you're already halfway through recovery.
>
> This is a perfectly viable approach. However, is there a palpable
> (non-theoretical) concern that warrants complicating things this way?
For AER your text above says no, we can always handle an error, they are never
fatal.
For memory-errors, yes. An ECC error for a location on the kernel stack while
the affected thread was preempt_disable()d. We can't kick the thread off the
CPU, and we can't return from the handler as we can't stop accesses to the stack
when we return.
We can avoid it being complicated by just separating AER errors out when they
come in. You've said they never mean the OS has to reboot.
>>>> For the PCI*/AER errors we should be able to try and handle it ... if we can get
>>>> back to process/IRQ context:
>>>> What happens if a PCIe driver has interrupts masked and is doing something to
>>>> cause this error? We can take the NMI and setup the irq-work, but it may never
>>>> run as we return to interrupts-masked context.
>>>
>>> The way the recovery from NMI is set up is confusing. On NMIs,
>>> ghes_proc_in_irq() is queued up to do further work. Despite the name, in
>>> this case it runs as a task. That further queues up AER work,
>>> Either way, PCIe endpoint interrupts are irrelevant here. AER recovery
>>> runs independent of drivers or interrupts:
>>
>> Hmm, we may be looking at different things, (and I'm not familiar with x86):
>> ghes_notify_nmi() uses irq_work_queue() to add the work to a percpu llist and
>> call arch_irq_work_raise(). arch_irq_work_raise() sends a self IPI, which, once
>> taken causes the arch code to call irq_work_run() to consume the llist.
>
> I abstract things at irq_work_queue(): "queue this work and let me
> return from this interrupt"
('from this NMI', to come back later as an IRQ, from where we can call
schedule_work_on(). This chaining is because the scheduler takes irqsave
spin-locks to schedule the work.)
>> My badly made point is we could take the NMI that triggers all this from a
>> context that has interrupts masked. We can't take the self-IPI until we return
>> from the NMI, and we return to a context that still has interrupts masked. The
>> original context may cause the NMI to trigger again before it unmasks
>> interrupts. In this case we've livelocked, the recovery work never runs.
> In order to get another NMI from firmware, you need SMM to be triggered
> again. Remember, SMM has to exit before the NMI can be serviced.
>
> I doubt the livelock is a concern in this place, because we're dealing
> with NMIs invoked by firmware(SMM). NMIs for other reasons would be
> handled elsewhere.
(aha, this might be the difference in the way we see this)
> In the event that FFS fails to prevent and SMI storm, and keeps spamming
> the CPU with NMIs, that is a clear firmware bug. It's also the sort of
> bug I haven't observed in the wild.
Do these SMI ever occur synchronously with the CPUs execution? i.e. if I execute
this instruction, I will get an SMI. (or are these your handled-elsewhere-NMI case?)
On arm64 poisoned-cache locations triggering an external abort (leading to a
NOTIFY_SEA NMI-like notification to APEI) in response to a load is synchronous,
hence the live-lock problem.
I agree AER is very likely to be different.
>> I think something like polling a status register over a broken-link would do
>> this. (has my config write been acknowledge?)
> That would not be a valid reason to disable interrupts. Non-posted PCIe
> requests can take hundreds of milliseconds (memory-mapped or otherwise).
> I can't think of any sort of code that can cause errors to come in via
> NMI which would be a suitable atomic context for disabling interrupts.
>
> Maybe on arm it's normal to stuff in anything error-related via NMI? On
> x86, most error sources have their own (maskable) interrupt vector.
Maybe. On arm systems error sources might have their own interrupts that are
taken to firmware, and firmware may be able to mask them. This is all hidden in
firmware, which can use something like SDEI to push these onto linux as NMI-like
notifications.
> I am not concert about livelocking the system.
This is what keeps me awake at night.
I think your platform has NMI handled as MCE/kernel-first for synchronous CPU
errors. SMI triggered by PCI-AER are notified by NMI too, but handled by
firmware-first.
This hybrid kernel/firmware first depending on the error-source isn't the only
way of building things. Appendix-N of the UEFI spec lists CPER records for
firmware-first handling of types of error that your platform must 'handle
elsewhere'.
We have arm systems where kernel-first isn't viable as the equivalent of the MCA
registers is platform/integration/silicon-revision specific. These systems
handle all their errors using firmware-first. These systems will generate CPER
records where fatal means fatal, and returning from the NMI-like notification
will either send us into the weeds, or trap us back into firmware.
Thanks,
James
Hi James,
On 04/06/2018 01:24 PM, James Morse wrote:
Do you have any ETA on when your SEA patches are going to make it
upstream? There's not much point in updating my patchset if it's going
to conflict with your work.
(snip)
>> On the contrary. No output, followed by a watchdog reboot is usually
>> what happens when the panic() is in NMI. One of the very few ways that
>> can be debugged is over a NULL modem cable.
>
> (IPMI-SOL?)
>
> Really? printk has NMI support, and panic()s call to bust_spinlocks() should
> unblank the screen before printk_safe_flush_on_panic() dumps the messages. (I'm
> assuming some BMC-VGA exists to capture the 'screen').
Plausible. There are other mechanisms though that may reset the system
before it is practical to capture the screen.
(snip)
>> But if on arm64, you can return to firmware,
>> then we have wildly different constraints.
>
> We can't return to firmware, but we can take the error again, causing another
> trap to firmware.
>
> e.g.: If a program accesses a value in the cache and the cache location is
> discovered to be corrupt/marked-as-poisoned. This causes an 'external abort'
> exception, which firmware has configured to trap to firmware. Once there, it
> generates CPER records and sets-up an external-abort-exception for Linux.
> If linux returns from this external-abort, we return to whatever program
> accessed the bad-cache-value in the first case, re-run the instruction that
> generates the external abort, and the same thing happens again, but to firmware
> it looks like a second fault at the same address.
This is a very good example of why we should _not_ panic in NMI.
Invalidate the cache-line, next load causes a memory round-trip, and
everyone's happy. Of course, FFS can't do that because it doesn't know
if it's valid to invalidate (pun intended). But the OS does. So you
_want_ to reach the error handler downstream of NMI context, and you do
_not_ want to crash.
(snip)
>> On x86, you can't return to SMM. You can have a new SMI triggered while
>> servicing the NMI, but you can't re-enter an SMI that you've returned
>> from. SMM holds all the cores, and they all leave SMM at roughly the
>> same time, so you don't deal with coexistence issues. This probably also
>> avoids the multiple notifications that you are trying to implement on arm.
>
> its the new SMI case.
>
> (We don't necessarily have all cores pulled into firmware, (although firmware
> could do that in software), so different CPUs taking NMI-like notifications is
> one of the things we have to handle.)
How does FFS handle race conditions that can occur when accessing HW
concurrently with the OS? I'm told it's the main reasons why BIOS
doesn't release unused cores from SMM early.
>> On quick thought, I think the best way to go for both series is to leave
>> the entry points arch-specific, and prevent hax86 and harm64 from
>> stepping on each other's toes. Thoughts?
>
> The behaviour should be driven from the standard. I agree there is some leeway,
> which linux should handle on all architectures that support GHES.
"The standard" deals mostly with firmware behavior. While I'm all for
following specifications in order to ensure smooth interaction and
integration, Linux is an OS and it deals with a plethora of "standards".
Its job is to serve user requests. When a "standard" deviates from this,
such as mandating a crash when one is not required, it's the OS's job to
_not_ follow this requirement.
> The entry points from the arch code are already separate, as these have
> different notification types in the HEST->GHES entries and different
> registration requirements.
> Once in ghes.c all the NMI-like notifications get merged together as the only
> relevant thing about them is we have to use the estatus-queue, and can't call
> any of the sleepy recovery helpers.
>
> I don't think this is the issue though: An NMI notification could by any of the
> eleven CPER notification record types.
> Your argument is Linux probably can handle PCIe-AER errors,
My argument is that Linux should do a best effort to recover from
hardware (or FFS) errors. In support of that argument, I presented
evidence that Linux already can handle a variety of errors. How those
errors are reported determines whether the OS crashes or not, and that's
not right. When errors are reported natively, we make it to the error
handlers before crashing. And the essence of my argument is that there
is no reason to have a different behavior with FFS.
> even if broken firmware thinks they are fatal.
I think the idea of firmware-first is broken. But it's there, it's
shipping in FW, so we have to accommodate it in SW.
> This is only one of the eleven notification record types. Returning to handle
> the error definitely doesn't work for some classes of fatal error, especially
> when interrupts are masked.
> I think its straightforward to test whether the CPER records contain only errors
> where we can do better: (not tested/built, I haven't checked whether this is nmi
> safe):
That looks very similar to what I had originally implemented.
> ---------------------%<---------------------
> /*
> * Return true if only PCIe-AER CPER sections are present. We ignore the fatal
> * severity in this case as linux knows better. Called in_nmi().
> */
> static bool ghes_is_deferrable(struct ghes *ghes,
> const struct acpi_hest_generic_status *estatus)
> {
> guid_t *sec_type;
> struct acpi_hest_generic_data *gdata;
>
> apei_estatus_for_each_section(estatus, gdata) {
> sec_type = (guid_t *)gdata->section_type;
> if (!guid_equal(sec_type, &CPER_SEC_PCIE))
Also need to check if the CPER record contains enough information to
identify the cause of the error. There's no guarantee the record even
has a valid PCI BDF, let alone error information. Possible? yes. Easy?
yes. Advisable? maybe.
(snip)
>>> My point was there is a class of reported-as-fatal error that really is fatal,
>>> and for these trying to defer the work to irq_work is worse than panic()ing as
>>> we get less information about what happened.
>>
>> How do we get less information? We save the GHES structure on the
>> estatus_queue, and use it as the error source in either case. Is there a
>> class of errors where we can reasonable expect things to be okay in NMI,
>> but go horribly wrong on return from NMI?
>
> Yes, (described above). This could happen if we see Arm64's 'SEA' CPER records
> or x86's 'MCE' CPER records. I'm pretty sure it will happen on both
> architectures if you return to a context with interrupts masked.
And linux can handle a wide subset of MCEs just fine, so the
ghes_is_deferrable() logic would, under my argument, agree to pass
execution to the actual handlers. I do agree that this is one case where
splitting the handler in a top half and bottom half would make sense.
Though in that case, the problem is generalized to "how to best handle
error Y", rather than "how to handle error Y in FFS".
>> (snip)
>>> Are all AER errors recoverable? (is an OS:'fatal' AER error ever valid?)
>>
>> PCIe "fatal" means that your link is gone and you might not get it back.
>> You lose all the downstream devices. You may be able to reset the link
>> and get it back. This sort of "fatal" has no impact on a machine's
>> ability to continue operation.
>>
>> From ACPI, FFS should only send a "fatal" error if a machine needs reset
>> [1], so it's clearly a firmware bug to mark any PCIe error "fatal". This
>> won't stop BIOS vendors from taking shortcuts and deciding to crash an
>> OS by sending the error as "fatal".
>>
>> [1]ACPI 6.2: 18.1Hardware Errors and Error Sources
>
> Great! Its never valid. So we can check the CPER records in the NMI handler, and
> for AER errors, clear the fatal bit printing a nasty message about the firmware.
There's the counter-argument that the machine is a "XYZ" appliance, and
losing the PCIe link to the "XYZ" device is essentially "fatal" to the
(purpose of the) machine. As you noted, there is some leeway ;)
Sarcasm aside, I do like the idea of a message complaining about the
firmware.
(snip)
>> That's the approach I originally tried, but I didn't like it. The
>> parsing gets deeper and deeper, and, by the time you realize if the
>> error can be handled or not, you're already halfway through recovery.
>>
>> This is a perfectly viable approach. However, is there a palpable
>> (non-theoretical) concern that warrants complicating things this way?
>
> For AER your text above says no, we can always handle an error, they are never
> fatal.
>
> For memory-errors, yes. An ECC error for a location on the kernel stack while
> the affected thread was preempt_disable()d. We can't kick the thread off the
> CPU, and we can't return from the handler as we can't stop accesses to the stack
> when we return.
>
> We can avoid it being complicated by just separating AER errors out when they
> come in. You've said they never mean the OS has to reboot.
If we're going to split recovery paths, then it makes better sense to
have a system where handleable errors are passed down to a lesser
context. Or even run non-blocking handlers in whatever context they are
received. Much more than having a special case for a specific error.
(snip)
>> I abstract things at irq_work_queue(): "queue this work and let me
>> return from this interrupt"
>
> ('from this NMI', to come back later as an IRQ, from where we can call
> schedule_work_on(). This chaining is because the scheduler takes irqsave
> spin-locks to schedule the work.)
Hopefully, the handling is set up in such a way that I don't have to
worry about these details on a per-error basis. If I do have to worry,
them I must be doing something wrong.
(snip)
>> In the event that FFS fails to prevent and SMI storm, and keeps spamming
>> the CPU with NMIs, that is a clear firmware bug. It's also the sort of
>> bug I haven't observed in the wild.
>
> Do these SMI ever occur synchronously with the CPUs execution? i.e. if I execute
> this instruction, I will get an SMI. (or are these your handled-elsewhere-NMI case?)
I'm not aware of synchronous exceptions that end up as SMIs, though I
imagine it is possible on x86. An IO write to an SMI trap is equivalent
to that.
> On arm64 poisoned-cache locations triggering an external abort (leading to a
> NOTIFY_SEA NMI-like notification to APEI) in response to a load is synchronous,
> hence the live-lock problem.
How is this currently handled in the kernel-first case?
(snip)
> Maybe. On arm systems error sources might have their own interrupts that are
> taken to firmware,
That sounds like an even bigger headache than x86.
>> I am not concert about livelocking the system.
>
> This is what keeps me awake at night.
Nothing keeps me awake at night. I keep others awake at night.
> I think your platform has NMI handled as MCE/kernel-first for synchronous CPU
> errors. SMI triggered by PCI-AER are notified by NMI too, but handled by
> firmware-first.
MCEs come in via FFS. There's a school of thought that sees a certain
value in logging ECC errors to the BMC. It's arguable whether the ECC
errors are synchronous in 100% of cases.
> This hybrid kernel/firmware first depending on the error-source isn't the only
> way of building things. Appendix-N of the UEFI spec lists CPER records for
> firmware-first handling of types of error that your platform must 'handle
> elsewhere'.
>
> We have arm systems where kernel-first isn't viable as the equivalent of the MCA
> registers is platform/integration/silicon-revision specific. These systems
> handle all their errors using firmware-first. These systems will generate CPER
> records where fatal means fatal, and returning from the NMI-like notification
> will either send us into the weeds, or trap us back into firmware.
Without getting into details, the non-viablity of kernel-first is
subjective.
Alex
Hi Alex,
On 09/04/18 19:11, Alex G. wrote:
> On 04/06/2018 01:24 PM, James Morse wrote:
> Do you have any ETA on when your SEA patches are going to make it
> upstream? There's not much point in updating my patchset if it's going
> to conflict with your work.
The SEA stuff went in with 7edda0886bc3 ("acpi: apei: handle SEA notification
type for ARMv8"). My series is moving it to use the estatus-queue in the same
way as x86's NOTIFY_NMI does. This lets us safely add the other two NMI-like
notifications. I have no idea on the ETA, it depends on review feedback!
>>> But if on arm64, you can return to firmware,
>>> then we have wildly different constraints.
>>
>> We can't return to firmware, but we can take the error again, causing another
>> trap to firmware.
>>
>> e.g.: If a program accesses a value in the cache and the cache location is
>> discovered to be corrupt/marked-as-poisoned. This causes an 'external abort'
>> exception, which firmware has configured to trap to firmware. Once there, it
>> generates CPER records and sets-up an external-abort-exception for Linux.
>> If linux returns from this external-abort, we return to whatever program
>> accessed the bad-cache-value in the first case, re-run the instruction that
>> generates the external abort, and the same thing happens again, but to firmware
>> it looks like a second fault at the same address.
> This is a very good example of why we should _not_ panic in NMI.
> Invalidate the cache-line, next load causes a memory round-trip, and
> everyone's happy. Of course, FFS can't do that because it doesn't know
> if it's valid to invalidate (pun intended). But the OS does. So you
> _want_ to reach the error handler downstream of NMI context, and you do
> _not_ want to crash.
This assumes a cache-invalidate will clear the error, which I don't think we're
guaranteed on arm.
It also destroys any adjacent data, "everyone's happy" includes the thread that
got a chunk of someone-else's stack frame, I don't think it will be happy for
very long!
(this is a side issue for AER though)
>>> On x86, you can't return to SMM. You can have a new SMI triggered while
>>> servicing the NMI, but you can't re-enter an SMI that you've returned
>>> from. SMM holds all the cores, and they all leave SMM at roughly the
>>> same time, so you don't deal with coexistence issues. This probably also
>>> avoids the multiple notifications that you are trying to implement on arm.
>>
>> its the new SMI case.
>>
>> (We don't necessarily have all cores pulled into firmware, (although firmware
>> could do that in software), so different CPUs taking NMI-like notifications is
>> one of the things we have to handle.)
> How does FFS handle race conditions that can occur when accessing HW
> concurrently with the OS? I'm told it's the main reasons why BIOS
> doesn't release unused cores from SMM early.
This is firmware's problem, it depends on whether there is any hardware that is
shared with the OS. Some hardware can be marked 'secure' in which case only
firmware can access it, alternatively firmware can trap or just disable the OS's
access to the shared hardware.
For example, with the v8.2 RAS Extensions, there are some per-cpu error
registers. Firmware can disable these for the OS, so that it always reads 0 from
them. Instead firmware takes the error via FF, reads the registers from
firmware, and dumps CPER records into the OS's memory.
If there is a shared hardware resource that both the OS and firmware may be
accessing, yes firmware needs to pull the other CPUs in, but this depends on the
SoC design, it doesn't necessarily happen.
>>> On quick thought, I think the best way to go for both series is to leave
>>> the entry points arch-specific, and prevent hax86 and harm64 from
>>> stepping on each other's toes. Thoughts?
>>
>> The behaviour should be driven from the standard. I agree there is some leeway,
>> which linux should handle on all architectures that support GHES.
>
> "The standard" deals mostly with firmware behavior. While I'm all for
> following specifications in order to ensure smooth interaction and
> integration, Linux is an OS and it deals with a plethora of "standards".
> Its job is to serve user requests. When a "standard" deviates from this,
> such as mandating a crash when one is not required, it's the OS's job to
> _not_ follow this requirement.
Sure, we're quirking our behaviour based on a high level of mistrust for the
firmware. My point here was we shouldn't duplicate the implementation because we
want x86:{AER,CPU,MEM} to behave differently to arm64:{AER,CPU,MEM}. I'd rather
the quirked-behaviour was along the *:{AER} versus *:{CPU,MEM} line. If we have
extra code to spot deferrable errors, we should use it on both architectures.
>> Once in ghes.c all the NMI-like notifications get merged together as the only
>> relevant thing about them is we have to use the estatus-queue, and can't call
>> any of the sleepy recovery helpers.
>>
>> I don't think this is the issue though: An NMI notification could by any of the
>> eleven CPER notification record types.
>> Your argument is Linux probably can handle PCIe-AER errors,
>
> My argument is that Linux should do a best effort to recover from
> hardware (or FFS) errors. In support of that argument, I presented
> evidence that Linux already can handle a variety of errors. How those
> errors are reported determines whether the OS crashes or not, and that's
> not right.
For AER we agree, these never mean 'the CPU is on fire'.
> When errors are reported natively, we make it to the error
> handlers before crashing. And the essence of my argument is that there
> is no reason to have a different behavior with FFS.
For the native CPU-error case, the CPU generates some kind of interrupt out of
firey-death, into the error handler and we put out the fire or panic.
For the firmware-first case this patch returned us to the firey-death, in the
hope of taking an IRQ that would bring us into the error handler.
This won't work if the CPU came out of a mode where interrupts are masked.
Even if they aren't, its not guaranteed on arm64 that we take the interrupt 'in
time'. Doesn't x86 guarantee to execute at least one instruction after an iret?
(I recall something about interrupt storms and forward-progress guarantees).
Some triage of the error in the NMI handler is the right thing to do here. I
agree with your point that AER errors are always deferrable, lets tackle those
first.
(I think user-space memory errors can be deferred if we took the error out of
user-space and can prevent our return with TIF_NEED_RESCHED or equivalent. The
same might work for CPU errors affecting user-space context)
>> even if broken firmware thinks they are fatal.
> I think the idea of firmware-first is broken. But it's there, it's
> shipping in FW, so we have to accommodate it in SW.
Part of our different-views here is firmware-first is taking something away from
you, whereas for me its giving me information that would otherwise be in
secret-soc-specific registers.
>> This is only one of the eleven notification record types. Returning to handle
>> the error definitely doesn't work for some classes of fatal error, especially
>> when interrupts are masked.
>
>> I think its straightforward to test whether the CPER records contain only errors
>> where we can do better: (not tested/built, I haven't checked whether this is nmi
>> safe):
> That looks very similar to what I had originally implemented.
>> ---------------------%<---------------------
>> /*
>> * Return true if only PCIe-AER CPER sections are present. We ignore the fatal
>> * severity in this case as linux knows better. Called in_nmi().
>> */
>> static bool ghes_is_deferrable(struct ghes *ghes,
>> const struct acpi_hest_generic_status *estatus)
>> {
>> guid_t *sec_type;
>> struct acpi_hest_generic_data *gdata;
>>
>> apei_estatus_for_each_section(estatus, gdata) {
>> sec_type = (guid_t *)gdata->section_type;
>> if (!guid_equal(sec_type, &CPER_SEC_PCIE))
>
> Also need to check if the CPER record contains enough information to
> identify the cause of the error. There's no guarantee the record even
> has a valid PCI BDF, let alone error information. Possible? yes. Easy?
> yes. Advisable? maybe.
If AER errors never prevent the CPU getting to the deferred error handler, we
can do that sort of thing there.
>>>> My point was there is a class of reported-as-fatal error that really is fatal,
>>>> and for these trying to defer the work to irq_work is worse than panic()ing as
>>>> we get less information about what happened.
>>>
>>> How do we get less information? We save the GHES structure on the
>>> estatus_queue, and use it as the error source in either case. Is there a
>>> class of errors where we can reasonable expect things to be okay in NMI,
>>> but go horribly wrong on return from NMI?
>>
>> Yes, (described above). This could happen if we see Arm64's 'SEA' CPER records
>> or x86's 'MCE' CPER records. I'm pretty sure it will happen on both
>> architectures if you return to a context with interrupts masked.
> And linux can handle a wide subset of MCEs just fine, so the
> ghes_is_deferrable() logic would, under my argument, agree to pass
> execution to the actual handlers.
For some classes of error we can't safely get there.
> I do agree that this is one case where
> splitting the handler in a top half and bottom half would make sense.
Yes, splitting this stuff up to have all the NMI-safe stuff up-front, then drop
to IRQ-context for the next round would make this a lot easier.
> Though in that case, the problem is generalized to "how to best handle
> error Y", rather than "how to handle error Y in FFS".
(that's a good thing yes?) I assume x86 can take MCE errors out of IRQ-masked
code. Sharing the handle_foo_error_nmi() code between the two paths would be a
good thing.
>>>> Are all AER errors recoverable? (is an OS:'fatal' AER error ever valid?)
>>>
>>> PCIe "fatal" means that your link is gone and you might not get it back.
>>> You lose all the downstream devices. You may be able to reset the link
>>> and get it back. This sort of "fatal" has no impact on a machine's
>>> ability to continue operation.
>>>
>>> From ACPI, FFS should only send a "fatal" error if a machine needs reset
>>> [1], so it's clearly a firmware bug to mark any PCIe error "fatal". This
>>> won't stop BIOS vendors from taking shortcuts and deciding to crash an
>>> OS by sending the error as "fatal".
>>>
>>> [1]ACPI 6.2: 18.1Hardware Errors and Error Sources
>>
>> Great! Its never valid. So we can check the CPER records in the NMI handler, and
>> for AER errors, clear the fatal bit printing a nasty message about the firmware.
> There's the counter-argument that the machine is a "XYZ" appliance, and
> losing the PCIe link to the "XYZ" device is essentially "fatal" to the
> (purpose of the) machine. As you noted, there is some leeway ;)
For any appliance I'd expect some health-monitor-thing to spot all the disks in
the disk-enclosure have disappeared and reboot.
(Losing the root-filsystem is probably the general case).
> Sarcasm aside, I do like the idea of a message complaining about the
> firmware.
It wouldn't be appropriate in all cases, but for AER it looks like 'fatal' is
always an over-reaction.
For example, given a memory error firmware can't know whether we can re-read
some page of kernel memory from disk and then reschedule the thread.
>>> That's the approach I originally tried, but I didn't like it. The
>>> parsing gets deeper and deeper, and, by the time you realize if the
>>> error can be handled or not, you're already halfway through recovery.
>>>
>>> This is a perfectly viable approach. However, is there a palpable
>>> (non-theoretical) concern that warrants complicating things this way?
>>
>> For AER your text above says no, we can always handle an error, they are never
>> fatal.
>>
>> For memory-errors, yes. An ECC error for a location on the kernel stack while
>> the affected thread was preempt_disable()d. We can't kick the thread off the
>> CPU, and we can't return from the handler as we can't stop accesses to the stack
>> when we return.
>>
>> We can avoid it being complicated by just separating AER errors out when they
>> come in. You've said they never mean the OS has to reboot.
>
> If we're going to split recovery paths, then it makes better sense to
> have a system where handleable errors are passed down to a lesser
> context. Or even run non-blocking handlers in whatever context they are
> received. Much more than having a special case for a specific error.
Yes. I think running the NMI-safe parts of the handler first, then the IRQ-parts
when we drop to IRQ context is the way to do this.
>>> I abstract things at irq_work_queue(): "queue this work and let me
>>> return from this interrupt"
>>
>> ('from this NMI', to come back later as an IRQ, from where we can call
>> schedule_work_on(). This chaining is because the scheduler takes irqsave
>> spin-locks to schedule the work.)
>
> Hopefully, the handling is set up in such a way that I don't have to
> worry about these details on a per-error basis. If I do have to worry,
> them I must be doing something wrong.
I agree, but something has to know. Lets tackle AER first, where we have none of
theses concerns, then tackle those that are more closely tied to the CPU state.
>>> In the event that FFS fails to prevent and SMI storm, and keeps spamming
>>> the CPU with NMIs, that is a clear firmware bug. It's also the sort of
>>> bug I haven't observed in the wild.
>>
>> Do these SMI ever occur synchronously with the CPUs execution? i.e. if I execute
>> this instruction, I will get an SMI. (or are these your handled-elsewhere-NMI case?)
>
> I'm not aware of synchronous exceptions that end up as SMIs, though I
> imagine it is possible on x86. An IO write to an SMI trap is equivalent
> to that.
Okay, that's good to know.
>> On arm64 poisoned-cache locations triggering an external abort (leading to a
>> NOTIFY_SEA NMI-like notification to APEI) in response to a load is synchronous,
>> hence the live-lock problem.
>
> How is this currently handled in the kernel-first case?
We don't have any kernel-first support today it would need the Arm version of
MCA (v8.2 RAS Extensions) which are relatively new. Everything we've seen so far
is using firmware-first as it has to know a little about the SoC topology.
How would it be handled? For the kernel-first version of NOTIFY_SEA the code
would get the interrupted pt_regs and can probe the RAS Error registers to get
the physical address and properties of the fault. It can then determine whether
this can be handled, and kick off the appropriate recovery helper.
If it interrupted the kernel, the helper has to schedule the recovery work, if
it interrupted interrupts-masked code, it would have to self-IPI to then be able
to schedule the work. The same return-to-a-broken-context problem exists for
kernel-first, we can't do it there either.
>>> I am not concert about livelocking the system.
>>
>> This is what keeps me awake at night.
>
> Nothing keeps me awake at night. I keep others awake at night.
(Ha! I owe you a beer!)
>> I think your platform has NMI handled as MCE/kernel-first for synchronous CPU
>> errors. SMI triggered by PCI-AER are notified by NMI too, but handled by
>> firmware-first.
>
> MCEs come in via FFS.
Oh? I thought those were to separate worlds. I assumed MCE-NMIs didn't go via SMM.
> There's a school of thought that sees a certain
> value in logging ECC errors to the BMC. It's arguable whether the ECC
> errors are synchronous in 100% of cases.
>> This hybrid kernel/firmware first depending on the error-source isn't the only
>> way of building things. Appendix-N of the UEFI spec lists CPER records for
>> firmware-first handling of types of error that your platform must 'handle
>> elsewhere'.
>>
>> We have arm systems where kernel-first isn't viable as the equivalent of the MCA
>> registers is platform/integration/silicon-revision specific. These systems
>> handle all their errors using firmware-first. These systems will generate CPER
>> records where fatal means fatal, and returning from the NMI-like notification
>> will either send us into the weeds, or trap us back into firmware.
>
> Without getting into details, the non-viablity of kernel-first is
> subjective.
Sure, there are some similar problems for both.
I think the way forward here is to spot defer-able errors, even if they are
marked fatal. Initially this will just be AER, but if we can split up the
handlers to have NMI/IRQ/Process context counterparts, we may be able to do
enough work early-on to defer other types of error.
Thanks,
James
On 04/13/2018 11:38 AM, James Morse wrote:
> Hi Alex,
>
> On 09/04/18 19:11, Alex G. wrote:
>> On 04/06/2018 01:24 PM, James Morse wrote:
>> Do you have any ETA on when your SEA patches are going to make it
>> upstream? There's not much point in updating my patchset if it's going
>> to conflict with your work.
>
> The SEA stuff went in with 7edda0886bc3 ("acpi: apei: handle SEA notification
> type for ARMv8"). My series is moving it to use the estatus-queue in the same
> way as x86's NOTIFY_NMI does. This lets us safely add the other two NMI-like
> notifications. I have no idea on the ETA, it depends on review feedback!
Okay. I'll get a v2 out soonish then.
(snip)
> This assumes a cache-invalidate will clear the error, which I don't think we're
> guaranteed on arm.
> It also destroys any adjacent data, "everyone's happy" includes the thread that
> got a chunk of someone-else's stack frame, I don't think it will be happy for
> very long!
Hmm, no cache-line (or page) invalidation on arm64? How does
dma_map/unmap_*() work then? You may not guarantee to fix the error, but
I don't buy into the "let's crash without trying" argument.
> (this is a side issue for AER though)
Somebody muddled up AER with these tables, so we now have to worry about
it. :)
(snip)
>> How does FFS handle race conditions that can occur when accessing HW
>> concurrently with the OS? I'm told it's the main reasons why BIOS
>> doesn't release unused cores from SMM early.
>
> This is firmware's problem, it depends on whether there is any hardware that is
> shared with the OS. Some hardware can be marked 'secure' in which case only
> firmware can access it, alternatively firmware can trap or just disable the OS's
> access to the shared hardware.
It's everyone's problem. It's the firmware's responsibility.
> For example, with the v8.2 RAS Extensions, there are some per-cpu error
> registers. Firmware can disable these for the OS, so that it always reads 0 from
> them. Instead firmware takes the error via FF, reads the registers from
> firmware, and dumps CPER records into the OS's memory.
>
> If there is a shared hardware resource that both the OS and firmware may be
> accessing, yes firmware needs to pull the other CPUs in, but this depends on the
> SoC design, it doesn't necessarily happen.
The problem with shared resources is just a problem. I've seen systems
where all 100 cores are held up for 300+ ms. In latency-critical
applications reliability drops exponentially. Am I correct in assuming
your answer would be to "hide" more stuff from the OS?
(snip)
> Sure, we're quirking our behaviour based on a high level of mistrust for the
> firmware. My point here was we shouldn't duplicate the implementation because we
> want x86:{AER,CPU,MEM} to behave differently to arm64:{AER,CPU,MEM}. I'd rather
> the quirked-behaviour was along the *:{AER} versus *:{CPU,MEM} line. If we have
> extra code to spot deferrable errors, we should use it on both architectures.
It's a well earned and well deserved mistrust. Firmware is evil (*).
(*) sarcastic overstatement of facts
(snip)
> For AER we agree, these never mean 'the CPU is on fire'.
Sounds like a good marketing slogan:
"ACME's new turbo-encabulated CPU -- it's on fire!"
(snip)
>>> even if broken firmware thinks they are fatal.
>
>> I think the idea of firmware-first is broken. But it's there, it's
>> shipping in FW, so we have to accommodate it in SW.
>
> Part of our different-views here is firmware-first is taking something away from
> you, whereas for me its giving me information that would otherwise be in
> secret-soc-specific registers.
Under this interpretation, FFS is a band-aid to the problem of "secret"
registers. "Secret" hardware doesn't really fit well into the idea of an
OS [1]. The irony of the solution is that the response is centered on
firmware extensions which were designed to stile interoperability [2].
You are right, FFS is a poopstorm and headache for me. It takes my
sanity away. I once wrote FW for a laptop where the only use of SMM was
to run the "Enable ACPI" call -- which only disabled further SMM. We
didn't need an iota of FFS because at that time AMD wasn't secretive,
and there was no need to have "secret" registers in the first place.
[1] https://www.youtube.com/watch?v=_36yNWw_07g
[2]
http://antitrust.slated.org/www.iowaconsumercase.org/011607/3000/PX03020.pdf
(snip)
>> And linux can handle a wide subset of MCEs just fine, so the
>> ghes_is_deferrable() logic would, under my argument, agree to pass
>> execution to the actual handlers.
>
> For some classes of error we can't safely get there.
Optimize for the common case.
(snip)
>> Though in that case, the problem is generalized to "how to best handle
>> error Y", rather than "how to handle error Y in FFS".
>
> (that's a good thing yes?) I assume x86 can take MCE errors out of IRQ-masked
> code. Sharing the handle_foo_error_nmi() code between the two paths would be a
> good thing.
That's a good thing, yes. But, by ignorance, FFS doesn't always let you
do this. Getting to your kernel data structures is not IRQ-safe in the
general case. In the AER case, you need pci_get_domain_bus_and_slot(),
which is not IRQ-safe. So, while in the native case, you can stage
context pointers, and separate things based on interrupt vector, you
don't get this sort of flexibility with FFS.
There is the extra step of recovering your context, which is not the
sort of situation you often get with kernel-first. Most of the
discussion has been around whether to crash before we even get this
context. Stay tuned for v2 though.
(snip)
>> Sarcasm aside, I do like the idea of a message complaining about the
>> firmware.
>
> It wouldn't be appropriate in all cases, but for AER it looks like 'fatal' is
> always an over-reaction.
> For example, given a memory error firmware can't know whether we can re-read
> some page of kernel memory from disk and then reschedule the thread.
If firmware can't know the error is recoverable, it also can't know it's
fatal. When firmware does try to make a determination from insufficient
information, it opens itself up to sarky remarks. That's fair game.
(snip)
>> If we're going to split recovery paths, then it makes better sense to
>> have a system where handleable errors are passed down to a lesser
>> context. Or even run non-blocking handlers in whatever context they are
>> received. Much more than having a special case for a specific error.
>
> Yes. I think running the NMI-safe parts of the handler first, then the IRQ-parts
> when we drop to IRQ context is the way to do this.
Coming soon to a ML near you.
>>>> I abstract things at irq_work_queue(): "queue this work and let me
>>>> return from this interrupt"
>>>
>>> ('from this NMI', to come back later as an IRQ, from where we can call
>>> schedule_work_on(). This chaining is because the scheduler takes irqsave
>>> spin-locks to schedule the work.)
>>
>> Hopefully, the handling is set up in such a way that I don't have to
>> worry about these details on a per-error basis. If I do have to worry,
>> them I must be doing something wrong.
>
> I agree, but something has to know. Lets tackle AER first,
The actual handler knows. For AER, that's do_recovery().
> where we have none of theses concerns
Untamed PCIe errors can end up as MCEs and triple-fault the CPU :)
You didn't think we get a free ride with AER, did you?
(snip)
>>> On arm64 poisoned-cache locations triggering an external abort (leading to a
>>> NOTIFY_SEA NMI-like notification to APEI) in response to a load is synchronous,
>>> hence the live-lock problem.
>>
>> How is this currently handled in the kernel-first case?
>
> We don't have any kernel-first support today
:(
> it would need the Arm version of
> MCA (v8.2 RAS Extensions) which are relatively new. Everything we've seen so far
> is using firmware-first as it has to know a little about the SoC topology.
>
> How would it be handled? For the kernel-first version of NOTIFY_SEA the code
> would get the interrupted pt_regs and can probe the RAS Error registers (snip)
Sounds like a regular, standard, raceless error handler :)
>>>> I am not concert about livelocking the system.
>>>
>>> This is what keeps me awake at night.
>>
>> Nothing keeps me awake at night. I keep others awake at night.
>
> (Ha! I owe you a beer!)
(I will not forget)
>>> I think your platform has NMI handled as MCE/kernel-first for synchronous CPU
>>> errors. SMI triggered by PCI-AER are notified by NMI too, but handled by
>>> firmware-first.
>>
>> MCEs come in via FFS.
>
> Oh? I thought those were to separate worlds. I assumed MCE-NMIs didn't go via SMM.
FFS is done in SMM. I'm certain there's a very simple and easy way in
which MCEs can be reported kernel-first, but I've only seen it done FFS
recently.
(snip)
> I think the way forward here is to spot defer-able errors, even if they are
> marked fatal. Initially this will just be AER, but if we can split up the
> handlers to have NMI/IRQ/Process context counterparts, we may be able to do
> enough work early-on to defer other types of error.
You're gonna love my v2.
Alex
Hi Alex,
On 04/16/2018 10:59 PM, Alex G. wrote:
> On 04/13/2018 11:38 AM, James Morse wrote:
>> This assumes a cache-invalidate will clear the error, which I don't
think we're
>> guaranteed on arm.
>> It also destroys any adjacent data, "everyone's happy" includes the
thread that
>> got a chunk of someone-else's stack frame, I don't think it will be
happy for
>> very long!
>
> Hmm, no cache-line (or page) invalidation on arm64? How does
> dma_map/unmap_*() work then? You may not guarantee to fix the error, but
There are cache-invalidate instructions, but I don't think 'solving' a
RAS error with them is the right thing to do.
> I don't buy into the "let's crash without trying" argument.
Our 'cache writeback granule' may be as large as 2K, so we may have to
invalidate up to 2K of data to convince the hardware this address is
okay again.
All we've done here is differently-corrupt the data so that it no longer
generates a RAS fault, it just gives you the wrong data instead.
Cache-invalidation is destructive.
I don't think there is a one-size-fits-all solution here.
>> (this is a side issue for AER though)
>
> Somebody muddled up AER with these tables, so we now have to worry about
> it. :)
Eh? I see there is a v2, maybe I'll understand this comment once I read it.
>>> How does FFS handle race conditions that can occur when accessing HW
>>> concurrently with the OS? I'm told it's the main reasons why BIOS
>>> doesn't release unused cores from SMM early.
>>
>> This is firmware's problem, it depends on whether there is any
hardware that is
>> shared with the OS. Some hardware can be marked 'secure' in which
case only
>> firmware can access it, alternatively firmware can trap or just
disable the OS's
>> access to the shared hardware.
>
> It's everyone's problem. It's the firmware's responsibility.
It depends on the SoC design. If there is no hardware that the OS and
firmware both need to access to handle an error then I don't think
firmware needs to do this.
>> For example, with the v8.2 RAS Extensions, there are some per-cpu error
>> registers. Firmware can disable these for the OS, so that it always
reads 0 from
>> them. Instead firmware takes the error via FF, reads the registers from
>> firmware, and dumps CPER records into the OS's memory.
>>
>> If there is a shared hardware resource that both the OS and firmware
may be
>> accessing, yes firmware needs to pull the other CPUs in, but this
depends on the
>> SoC design, it doesn't necessarily happen.
>
> The problem with shared resources is just a problem. I've seen systems
> where all 100 cores are held up for 300+ ms. In latency-critical
> applications reliability drops exponentially. Am I correct in assuming
> your answer would be to "hide" more stuff from the OS?
No, I'm not a fan of firmware cycle stealing. If you can design the SoC or
firmware so that the 'all CPUs' stuff doesn't need to happen, then you
won't get
these issues. (I don't design these things, I'm sure they're much more
complicated
than I think!)
Because the firmware is SoC-specific, so it only needs to do exactly
what is necessary.
>>> I think the idea of firmware-first is broken. But it's there, it's
>>> shipping in FW, so we have to accommodate it in SW.
>>
>> Part of our different-views here is firmware-first is taking
something away from
>> you, whereas for me its giving me information that would otherwise be in
>> secret-soc-specific registers.
>
> Under this interpretation, FFS is a band-aid to the problem of "secret"
> registers. "Secret" hardware doesn't really fit well into the idea of an
> OS [1].
Sorry, I'm being sloppy with my terminology, by secret-soc-specific I
mean either Linux can't access them (firmware privilege-level only) or
Linux can't reasonably know where these registers are, as they're
soc-specific and vary by manufacture.
>>> And linux can handle a wide subset of MCEs just fine, so the
>>> ghes_is_deferrable() logic would, under my argument, agree to pass
>>> execution to the actual handlers.
>>
>> For some classes of error we can't safely get there.
>
> Optimize for the common case.
At the expense of reliability?
Thanks,
James
On 04/20/2018 02:27 AM, James Morse wrote:
> Hi Alex,
>
> On 04/16/2018 10:59 PM, Alex G. wrote:
>> On 04/13/2018 11:38 AM, James Morse wrote:
>>> This assumes a cache-invalidate will clear the error, which I don't
> think we're
>>> guaranteed on arm.
>>> It also destroys any adjacent data, "everyone's happy" includes the
> thread that
>>> got a chunk of someone-else's stack frame, I don't think it will be
> happy for
>>> very long!
>>
>> Hmm, no cache-line (or page) invalidation on arm64? How does
>> dma_map/unmap_*() work then? You may not guarantee to fix the error, but
>
> There are cache-invalidate instructions, but I don't think 'solving' a
> RAS error with them is the right thing to do.
You seem to be putting RAS on a pedestal in a very cloudy and foggy day.
I admit that I fail to see the specialness of RAS in comparison to other
errors.
>> I don't buy into the "let's crash without trying" argument.
>
> Our 'cache writeback granule' may be as large as 2K, so we may have to
> invalidate up to 2K of data to convince the hardware this address is
> okay again.
Eureka! OS can invalidate the entire page. 1:1 mapping with the memory
management data.
> All we've done here is differently-corrupt the data so that it no longer
> generates a RAS fault, it just gives you the wrong data instead.
> Cache-invalidation is destructive.
>
> I don't think there is a one-size-fits-all solution here.
Of course there isn't. That's not the issue.
A cache corruption is a special case of a memory access issue, and that,
we already know how to handle. Triple-fault and cpu-on-fire concerns
apply wrt returning to the context which triggered the problem. We've
already figured that out.
There is a lot of opportunity here for using well tested code paths and
not crashing on first go. Why let firmware make this a problem again?
>>> (this is a side issue for AER though)
>>
>> Somebody muddled up AER with these tables, so we now have to worry about
>> it. :)
>
> Eh? I see there is a v2, maybe I'll understand this comment once I read it.
I meant that somebody (the spec writers) decided to put ominous errors
(PCIe) on the same severity scale with "cpu is on fire" errors.
>>>> How does FFS handle race conditions that can occur when accessing HW
>>>> concurrently with the OS? I'm told it's the main reasons why BIOS
>>>> doesn't release unused cores from SMM early.
>>>
>>> This is firmware's problem, it depends on whether there is any
> hardware that is
>>> shared with the OS. Some hardware can be marked 'secure' in which
> case only
>>> firmware can access it, alternatively firmware can trap or just
> disable the OS's
>>> access to the shared hardware.
>>
>> It's everyone's problem. It's the firmware's responsibility.
>
> It depends on the SoC design. If there is no hardware that the OS and
> firmware both need to access to handle an error then I don't think
> firmware needs to do this.
>
>
>>> For example, with the v8.2 RAS Extensions, there are some per-cpu error
>>> registers. Firmware can disable these for the OS, so that it always
> reads 0 from
>>> them. Instead firmware takes the error via FF, reads the registers from
>>> firmware, and dumps CPER records into the OS's memory.
>>>
>>> If there is a shared hardware resource that both the OS and firmware
> may be
>>> accessing, yes firmware needs to pull the other CPUs in, but this
> depends on the
>>> SoC design, it doesn't necessarily happen.
>>
>> The problem with shared resources is just a problem. I've seen systems
>> where all 100 cores are held up for 300+ ms. In latency-critical
>> applications reliability drops exponentially. Am I correct in assuming
>> your answer would be to "hide" more stuff from the OS?
>
> No, I'm not a fan of firmware cycle stealing. If you can design the SoC or
> firmware so that the 'all CPUs' stuff doesn't need to happen, then you
> won't get
> these issues. (I don't design these things, I'm sure they're much more
> complicated
> than I think!)
>
> Because the firmware is SoC-specific, so it only needs to do exactly
> what is necessary.
Irrespective of the hardware design, there's devicetree, ACPI methods,
and a few other ways to inform the OS of non-standard bits. They don't
have the resource sharing problem. I'm confused as to why FFS is used
when there are concerns about resource conflicts instead of race-free
alternatives.
>>>> I think the idea of firmware-first is broken. But it's there, it's
>>>> shipping in FW, so we have to accommodate it in SW.
>>>
>>> Part of our different-views here is firmware-first is taking
> something away from
>>> you, whereas for me its giving me information that would otherwise be in
>>> secret-soc-specific registers.
>>
>> Under this interpretation, FFS is a band-aid to the problem of "secret"
>> registers. "Secret" hardware doesn't really fit well into the idea of an
>> OS [1].
>
> Sorry, I'm being sloppy with my terminology, by secret-soc-specific I
> mean either Linux can't access them (firmware privilege-level only) or
> Linux can't reasonably know where these registers are, as they're
> soc-specific and vary by manufacture.
This is still a software problem. I'm assuming register access can be
granted to the OS, and I'm also assuming that there exists a non-FFS way
to describe the registers to the OS.
>>>> And linux can handle a wide subset of MCEs just fine, so the
>>>> ghes_is_deferrable() logic would, under my argument, agree to pass
>>>> execution to the actual handlers.
>>>
>>> For some classes of error we can't safely get there.
>>
>> Optimize for the common case.
>
> At the expense of reliability?
Who suggested to sacrifice reliability?
Alex