2022-10-27 04:39:32

by Shuai Xue

[permalink] [raw]
Subject: [PATCH] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on action required events

There are two major types of uncorrected error (UC) :

- Action Required: The error is detected and the processor already consumes the
memory. OS requires to take action (for example, offline failure page/kill
failure thread) to recover this uncorrectable error.

- Action Optional: The error is detected out of processor execution context.
Some data in the memory are corrupted. But the data have not been consumed.
OS is optional to take action to recover this uncorrectable error.

For X86 platforms, we can easily distinguish between these two types
based on the MCA Bank. While for arm64 platform, the memory failure
flags for all UCs which severity are GHES_SEV_RECOVERABLE are set as 0,
a.k.a, Action Optional now.

If UC is detected by a background scrubber, it is obviously an Action
Optional error. For other errors, we should conservatively regard them
as Action Required.

cper_sec_mem_err::error_type identifies the type of error that occurred
if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
flags as MF_ACTION_REQUIRED.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/acpi/apei/ghes.c | 10 ++++++++--
include/linux/cper.h | 3 +++
2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 80ad530583c9..6c03059cbfc6 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -474,8 +474,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
if (sec_sev == GHES_SEV_CORRECTED &&
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
- if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
- flags = 0;
+ if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) {
+ if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE)
+ flags = mem_err->error_type == CPER_MEM_SCRUB_UC ?
+ 0 :
+ MF_ACTION_REQUIRED;
+ else
+ flags = MF_ACTION_REQUIRED;
+ }

if (flags != -1)
return ghes_do_memory_failure(mem_err->physical_addr, flags);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index eacb7dd7b3af..b77ab7636614 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -235,6 +235,9 @@ enum {
#define CPER_MEM_VALID_BANK_ADDRESS 0x100000
#define CPER_MEM_VALID_CHIP_ID 0x200000

+#define CPER_MEM_SCRUB_CE 13
+#define CPER_MEM_SCRUB_UC 14
+
#define CPER_MEM_EXT_ROW_MASK 0x3
#define CPER_MEM_EXT_ROW_SHIFT 16

--
2.20.1.9.gb50a0d7



2022-10-28 17:57:11

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on action required events

On Thu, Oct 27, 2022 at 6:25 AM Shuai Xue <[email protected]> wrote:
>
> There are two major types of uncorrected error (UC) :
>
> - Action Required: The error is detected and the processor already consumes the
> memory. OS requires to take action (for example, offline failure page/kill
> failure thread) to recover this uncorrectable error.
>
> - Action Optional: The error is detected out of processor execution context.
> Some data in the memory are corrupted. But the data have not been consumed.
> OS is optional to take action to recover this uncorrectable error.
>
> For X86 platforms, we can easily distinguish between these two types
> based on the MCA Bank. While for arm64 platform, the memory failure
> flags for all UCs which severity are GHES_SEV_RECOVERABLE are set as 0,
> a.k.a, Action Optional now.
>
> If UC is detected by a background scrubber, it is obviously an Action
> Optional error. For other errors, we should conservatively regard them
> as Action Required.
>
> cper_sec_mem_err::error_type identifies the type of error that occurred
> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
> flags as MF_ACTION_REQUIRED.
>
> Signed-off-by: Shuai Xue <[email protected]>

I need input from the APEI reviewers on this.

Thanks!

> ---
> drivers/acpi/apei/ghes.c | 10 ++++++++--
> include/linux/cper.h | 3 +++
> 2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 80ad530583c9..6c03059cbfc6 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -474,8 +474,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> if (sec_sev == GHES_SEV_CORRECTED &&
> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> flags = MF_SOFT_OFFLINE;
> - if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
> - flags = 0;
> + if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) {
> + if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE)
> + flags = mem_err->error_type == CPER_MEM_SCRUB_UC ?
> + 0 :
> + MF_ACTION_REQUIRED;
> + else
> + flags = MF_ACTION_REQUIRED;
> + }
>
> if (flags != -1)
> return ghes_do_memory_failure(mem_err->physical_addr, flags);
> diff --git a/include/linux/cper.h b/include/linux/cper.h
> index eacb7dd7b3af..b77ab7636614 100644
> --- a/include/linux/cper.h
> +++ b/include/linux/cper.h
> @@ -235,6 +235,9 @@ enum {
> #define CPER_MEM_VALID_BANK_ADDRESS 0x100000
> #define CPER_MEM_VALID_CHIP_ID 0x200000
>
> +#define CPER_MEM_SCRUB_CE 13
> +#define CPER_MEM_SCRUB_UC 14
> +
> #define CPER_MEM_EXT_ROW_MASK 0x3
> #define CPER_MEM_EXT_ROW_SHIFT 16
>
> --
> 2.20.1.9.gb50a0d7
>

2022-10-28 18:02:24

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on action required events

>> cper_sec_mem_err::error_type identifies the type of error that occurred
>> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
>> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
>> flags as MF_ACTION_REQUIRED.

On x86 the "action required" cases are signaled by a synchronous machine check
that is delivered before the instruction that is attempting to consume the uncorrected
data retires. I.e., it is guaranteed that the uncorrected error has not been propagated
because it is not visible in any architectural state.

APEI signaled errors don't fall into that category on x86 ... the uncorrected data
could have been consumed and propagated long before the signaling used for
APEI can alert the OS.

Does ARM deliver APEI signals synchronously?

If not, then this patch might deliver a false sense of security to applications
about the state of uncorrected data in the system.

-Tony

2022-11-02 07:48:18

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on action required events



在 2022/10/29 AM1:08, Rafael J. Wysocki 写道:
> On Thu, Oct 27, 2022 at 6:25 AM Shuai Xue <[email protected]> wrote:
>>
>> There are two major types of uncorrected error (UC) :
>>
>> - Action Required: The error is detected and the processor already consumes the
>> memory. OS requires to take action (for example, offline failure page/kill
>> failure thread) to recover this uncorrectable error.
>>
>> - Action Optional: The error is detected out of processor execution context.
>> Some data in the memory are corrupted. But the data have not been consumed.
>> OS is optional to take action to recover this uncorrectable error.
>>
>> For X86 platforms, we can easily distinguish between these two types
>> based on the MCA Bank. While for arm64 platform, the memory failure
>> flags for all UCs which severity are GHES_SEV_RECOVERABLE are set as 0,
>> a.k.a, Action Optional now.
>>
>> If UC is detected by a background scrubber, it is obviously an Action
>> Optional error. For other errors, we should conservatively regard them
>> as Action Required.
>>
>> cper_sec_mem_err::error_type identifies the type of error that occurred
>> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
>> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
>> flags as MF_ACTION_REQUIRED.
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>
> I need input from the APEI reviewers on this.
>
> Thanks!

Hi, Rafael,

Sorry, I missed this email. Thank you for you quick reply. Let's discuss with
reviewers.

Thank you.

Cheers,
Shuai


>
>> ---
>> drivers/acpi/apei/ghes.c | 10 ++++++++--
>> include/linux/cper.h | 3 +++
>> 2 files changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 80ad530583c9..6c03059cbfc6 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -474,8 +474,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>> if (sec_sev == GHES_SEV_CORRECTED &&
>> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
>> flags = MF_SOFT_OFFLINE;
>> - if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>> - flags = 0;
>> + if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) {
>> + if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE)
>> + flags = mem_err->error_type == CPER_MEM_SCRUB_UC ?
>> + 0 :
>> + MF_ACTION_REQUIRED;
>> + else
>> + flags = MF_ACTION_REQUIRED;
>> + }
>>
>> if (flags != -1)
>> return ghes_do_memory_failure(mem_err->physical_addr, flags);
>> diff --git a/include/linux/cper.h b/include/linux/cper.h
>> index eacb7dd7b3af..b77ab7636614 100644
>> --- a/include/linux/cper.h
>> +++ b/include/linux/cper.h
>> @@ -235,6 +235,9 @@ enum {
>> #define CPER_MEM_VALID_BANK_ADDRESS 0x100000
>> #define CPER_MEM_VALID_CHIP_ID 0x200000
>>
>> +#define CPER_MEM_SCRUB_CE 13
>> +#define CPER_MEM_SCRUB_UC 14
>> +
>> #define CPER_MEM_EXT_ROW_MASK 0x3
>> #define CPER_MEM_EXT_ROW_SHIFT 16
>>
>> --
>> 2.20.1.9.gb50a0d7
>>

2022-11-02 11:57:26

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on action required events



在 2022/10/29 AM1:25, Luck, Tony 写道:
>>> cper_sec_mem_err::error_type identifies the type of error that occurred
>>> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
>>> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
>>> flags as MF_ACTION_REQUIRED.
>
> On x86 the "action required" cases are signaled by a synchronous machine check
> that is delivered before the instruction that is attempting to consume the uncorrected
> data retires. I.e., it is guaranteed that the uncorrected error has not been propagated
> because it is not visible in any architectural state.

On arm, if a 2-bit (uncorrectable) error is detected, and the memory access has been
architecturally executed, that error is considered “consumed”. The CPU will take a
synchronous error exception, signaled as synchronous external abort (SEA), which is
analogously to MCE.

>
> APEI signaled errors don't fall into that category on x86 ... the uncorrected data
> could have been consumed and propagated long before the signaling used for
> APEI can alert the OS.
>
> Does ARM deliver APEI signals synchronously?
>
> If not, then this patch might deliver a false sense of security to applications
> about the state of uncorrected data in the system.
>

Well, it does not always. There are many APEI notification, such as SCI, GSIV, GPIO,
SDEI, SEA, etc. Not all APEI notifications are synchronously and it depends on
hardware signal. As far as I know, if a UE is detected and consumed, synchronous external
abort is signaled to firmware and firmware then performs a first-level triage and
synchronously notify OS by SDEI or SEA notification. On the other hand, if CE is
detected, a asynchronous interrupt will be signaled and firmware could notify OS
by GPIO or GSIV.

Best Regards,
Shuai



2022-11-22 12:08:54

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on action required events



在 2022/11/2 PM7:53, Shuai Xue 写道:
>
>
> 在 2022/10/29 AM1:25, Luck, Tony 写道:
>>>> cper_sec_mem_err::error_type identifies the type of error that occurred
>>>> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
>>>> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
>>>> flags as MF_ACTION_REQUIRED.
>>
>> On x86 the "action required" cases are signaled by a synchronous machine check
>> that is delivered before the instruction that is attempting to consume the uncorrected
>> data retires. I.e., it is guaranteed that the uncorrected error has not been propagated
>> because it is not visible in any architectural state.
>
> On arm, if a 2-bit (uncorrectable) error is detected, and the memory access has been
> architecturally executed, that error is considered “consumed”. The CPU will take a
> synchronous error exception, signaled as synchronous external abort (SEA), which is
> analogously to MCE.
>
>>
>> APEI signaled errors don't fall into that category on x86 ... the uncorrected data
>> could have been consumed and propagated long before the signaling used for
>> APEI can alert the OS.
>>
>> Does ARM deliver APEI signals synchronously?
>>
>> If not, then this patch might deliver a false sense of security to applications
>> about the state of uncorrected data in the system.
>>
>
> Well, it does not always. There are many APEI notification, such as SCI, GSIV, GPIO,
> SDEI, SEA, etc. Not all APEI notifications are synchronously and it depends on
> hardware signal. As far as I know, if a UE is detected and consumed, synchronous external
> abort is signaled to firmware and firmware then performs a first-level triage and
> synchronously notify OS by SDEI or SEA notification. On the other hand, if CE is
> detected, a asynchronous interrupt will be signaled and firmware could notify OS
> by GPIO or GSIV.
>
> Best Regards,
> Shuai
>
>


Hi, Tony,

Prefetch data with UE error triggers async interrupt on both X86 and Arm64 platform
(CMCI in X86 and SPI in arm64). It does not belongs to scrub UEs. I have to admit that
cper_sec_mem_err::error_type is not an appropriate basis to distinguish
"action required" cases.



acpi_hest_generic_data::flags (UEFI spec section N.2.2) could be used to indicate
Action Optional (Scrub/Prefetch).

Bit 5 – Latent error: If set this flag indicates that action has been
taken to ensure error containment (such a poisoning data), but
the error has not been fully corrected and the data has not been
consumed. System software may choose to take further
corrective action before the data is consumed.

Our hardware team has submitted a proposal to UEFI community to add a new bit:

Bit 8 – sync flag; if set this flag indicates that
this event record is synchronous(e.g. cpu
core consumes poison data, then cause
instruction/data abort); if not set, this event
record is asynchronous.

With bit 8, we will know it is "Action Required".


I will send a new patch set to rework GHES error handling after the proposal is accept.


Thank you.

Best Regards
Shuai




2022-12-06 15:48:59

by Shuai Xue

[permalink] [raw]
Subject: [RFC PATCH 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

There are two major types of uncorrected error (UC) :

- Action Required: The error is detected and the processor already consumes
the memory. OS requires to take action (for example, offline failure
page/kill failure thread) to recover this uncorrectable error.

- Action Optional: The error is detected out of processor execution
context. Some data in the memory are corrupted. But the data have not
been consumed. OS is optional to take action to recover this
uncorrectable error.

For X86 platforms, we can easily distinguish between these two types based
on the MCA Bank. While for arm64 platform, the memory failure flags for all
UCs which severity are GHES_SEV_RECOVERABLE are set as 0, a.k.a, Action
Optional now. Set memory failure flags as MF_ACTION_REQUIRED on synchronous
events.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/acpi/apei/ghes.c | 2 +-
include/linux/cper.h | 22 ++++++++++++++++++++++
2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 9952f3a792ba..a420759fce2d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -475,7 +475,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
- flags = 0;
+ flags = (gdata->flags & CPER_SEC_SYNC) ? MF_ACTION_REQUIRED : 0;

if (flags != -1)
return ghes_do_memory_failure(mem_err->physical_addr, flags);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index eacb7dd7b3af..a3571fa8a73d 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -144,6 +144,28 @@ enum {
* corrective action before the data is consumed
*/
#define CPER_SEC_LATENT_ERROR 0x0020
+/*
+ * If set, the section is to be associated with an error that has been
+ * propagated due to hardware poisoning. This implies the error is a symptom of
+ * another error. It is not always possible to ascertain whether this is the
+ * case for an error, therefore if the flag is not set, it is unknown whether
+ * the error was propagated. this helps determining FRU when dealing with HW
+ * failures
+ */
+#define CPER_SEC_PROPAGATED 0x0040
+/*
+ * If set this flag indicates the firmware has detected an overflow of
+ * buffers/queues that are used to accumulate, collect, or report errors (e.g.
+ * the error status control block exposed to the OS). When this occurs, some
+ * error records may be lost.
+ */
+#define CPER_SEC_OVERFLOW 0x0080
+/*
+ * If set, it indicates that this event record is synchronous(e.g. cpu core
+ * consumes poison data, then cause instruction/data abort); if not set,
+ * this event record is asynchronous.
+ */
+#define CPER_SEC_SYNC 0x00100

/*
* Section type definitions, used in section_type field in struct
--
2.20.1.12.g72788fdb

2022-12-06 15:51:13

by Shuai Xue

[permalink] [raw]
Subject: [RFC PATCH 0/2] ACPI: APEI: handle synchronous exceptions in task work

Currently, both synchronous and asynchronous error are queued and handled by a
dedicated kthread in workqueue. Memory failure for synchronous error is
synced by a trick.

Although the task could be killed by page fault, the memory failure is handled
in a kthread context so that the hwpoison-aware mechanisms, e.g. PF_MCE_EARLY,
early kill, does not work as expected.

To this end, separate synchronous and asynchronous error handling into
different paths like X86 does:

- task work for synchronous error.
- and workqueue for asynchronous error.

This patch set is based on a new UEFI proposal submitted by our colleague Yingwen.[1]

> Background:
>
> In ARM world, two type events (Sync/Async) from hardware IP need OS/VMM take different actions.
> Current CPER memory error record is not able to distinguish sync/async type event right now.
> Current OS/VMM need to take extra actions beyond CPER which is heavy burden to identify the
> two type events
>
> Sync event (e.g. CPU consume poisoned data) --> Firmware -> CPER error log --> OS/VMM take recovery action.
> Async event (e.g. Memory controller detect UE event) --> Firmware --> CPER error log --> OS take page action.
>
>
> Proposal:
>
> - In section description Flags field(UEFI spec section N.2, add sync flag as below. OS/VMM
> could depend on this flag to distinguish sync/async events.
> - Bit8 – sync flag; if set this flag indicates that this event record is synchronous(e.g.
> cpu core consumes poison data, then cause instruction/data abort); if not set, this event record is asynchronous.
>
> Best regards,
> Yingwen Chen
>
> [ Shuai Xue: The thread is only opened to the member of UEFI Workgroup.
> Paste here for discussion.]

[1] https://members.uefi.org/wg/uswg/mail/thread/9453

Shuai Xue (2):
ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
synchronous events
ACPI: APEI: separate synchronous error handling into task work

drivers/acpi/apei/ghes.c | 120 ++++++++++++++++++++++-----------------
include/linux/cper.h | 22 +++++++
2 files changed, 89 insertions(+), 53 deletions(-)

--
2.20.1.12.g72788fdb

2022-12-06 15:52:42

by Shuai Xue

[permalink] [raw]
Subject: [RFC PATCH 2/2] ACPI: APEI: separate synchronous error handling into task work

On Arm64 platform, errors could be signaled by synchronous interrupt, e.g.
when an error is detected by a background scrubber, or signaled by
synchronous exception, e.g. when an uncorrected error is consumed. Both
synchronous and asynchronous error are queued and handled by a dedicated
kthread in workqueue.

commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
synchronous errors") keep track of whether memory_failure() work was
queued, and make task_work pending to flush out the workqueue so that the
work for synchronous error is processed before returning to user-space.
The trick ensures that the corrupted page is unmapped and poisoned. And
after returning to user-space, the task starts at current instruction which
triggering a page fault and kernel will send sigbus due to
VM_FAULT_HWPOISON.

Although the task could be killed by page fault, the memory failure is
handled in a kthread context so that the hwpoison-aware mechanisms, e.g.
PF_MCE_EARLY, early kill, does not work as expected.

To this end, separate synchronous and asynchronous error handling into
different paths like X86 does:

- task work for synchronous error.
- and workqueue for asynchronous error.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/acpi/apei/ghes.c | 118 ++++++++++++++++++++++-----------------
1 file changed, 66 insertions(+), 52 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a420759fce2d..f13c298f47e6 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -421,46 +421,80 @@ static void ghes_clear_estatus(struct ghes *ghes,
ghes_ack_error(ghes->generic_v2);
}

-/*
- * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+/**
+ * struct mce_task_work - for synchronous RAS event
+ *
+ * @twork: callback_head for task work
+ * @pfn: page frame number of corrupted page
+ * @flags: fine tune action taken
+ *
+ * Structure to pass task work to be handled before
+ * returning to userspace via task_work_add().
*/
-static void ghes_kick_task_work(struct callback_head *head)
+struct mce_task_work {
+ struct callback_head twork;
+ u64 pfn;
+ int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
{
- struct acpi_hest_generic_status *estatus;
- struct ghes_estatus_node *estatus_node;
- u32 node_len;
+ int ret;
+ struct mce_task_work *twcb =
+ container_of(twork, struct mce_task_work, twork);
+ ret = memory_failure(twcb->pfn, twcb->flags);
+ kfree(twcb);

- estatus_node = container_of(head, struct ghes_estatus_node, task_work);
- if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
- memory_failure_queue_kick(estatus_node->task_work_cpu);
+ if (!ret)
+ return;
+ /*
+ * -EHWPOISON from memory_failure() means that it already sent SIGBUS
+ * to the current process with the proper error info,
+ * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
+ *
+ * In both cases, no further processing is required.
+ */
+ if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
+ return;

- estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
- node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
- gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
+ pr_err("Memory error not recovered");
+ force_sig(SIGBUS);
}

-static bool ghes_do_memory_failure(u64 physical_addr, int flags)
+static void ghes_do_memory_failure(u64 physical_addr, int flags)
{
unsigned long pfn;
+ struct mce_task_work *twcb;

if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
- return false;
+ return;

pfn = PHYS_PFN(physical_addr);
if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
pr_warn_ratelimited(FW_WARN GHES_PFX
"Invalid address in generic error data: %#llx\n",
physical_addr);
- return false;
+ return;
}

- memory_failure_queue(pfn, flags);
- return true;
+ if (flags == MF_ACTION_REQUIRED && current->mm) {
+ twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
+ if (!twcb)
+ return;
+
+ twcb->pfn = pfn;
+ twcb->flags = flags;
+ init_task_work(&twcb->twork, memory_failure_cb);
+ task_work_add(current, &twcb->twork, TWA_RESUME);
+ return;
+ } else {
+ memory_failure_queue(pfn, flags);
+ }
+
+ return;
}

-static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
+static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
int sev)
{
int flags = -1;
@@ -468,7 +502,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);

if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
- return false;
+ return;

/* iff following two events can be handled properly by now */
if (sec_sev == GHES_SEV_CORRECTED &&
@@ -478,15 +512,12 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
flags = (gdata->flags & CPER_SEC_SYNC) ? MF_ACTION_REQUIRED : 0;

if (flags != -1)
- return ghes_do_memory_failure(mem_err->physical_addr, flags);
-
- return false;
+ ghes_do_memory_failure(mem_err->physical_addr, flags);
}

-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
+static void ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
{
struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
- bool queued = false;
int sec_sev, i;
char *p;

@@ -494,7 +525,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s

sec_sev = ghes_severity(gdata->error_severity);
if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE)
- return false;
+ return;

p = (char *)(err + 1);
for (i = 0; i < err->err_info_num; i++) {
@@ -510,7 +541,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
* and don't filter out 'corrected' error here.
*/
if (is_cache && has_pa) {
- queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
+ ghes_do_memory_failure(err_info->physical_fault_addr, 0);
p += err_info->length;
continue;
}
@@ -524,7 +555,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
p += err_info->length;
}

- return queued;
+ return;
}

/*
@@ -622,7 +653,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
schedule_work(&entry->work);
}

-static bool ghes_do_proc(struct ghes *ghes,
+static void ghes_do_proc(struct ghes *ghes,
const struct acpi_hest_generic_status *estatus)
{
int sev, sec_sev;
@@ -630,7 +661,6 @@ static bool ghes_do_proc(struct ghes *ghes,
guid_t *sec_type;
const guid_t *fru_id = &guid_null;
char *fru_text = "";
- bool queued = false;

sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
@@ -648,13 +678,13 @@ static bool ghes_do_proc(struct ghes *ghes,
ghes_edac_report_mem_error(sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
- queued = ghes_handle_memory_failure(gdata, sev);
+ ghes_handle_memory_failure(gdata, sev);
}
else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
- queued = ghes_handle_arm_hw_error(gdata, sev);
+ ghes_handle_arm_hw_error(gdata, sev);
} else {
void *err = acpi_hest_get_payload(gdata);

@@ -664,8 +694,6 @@ static bool ghes_do_proc(struct ghes *ghes,
gdata->error_data_length);
}
}
-
- return queued;
}

static void __ghes_print_estatus(const char *pfx,
@@ -961,9 +989,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
struct ghes_estatus_node *estatus_node;
struct acpi_hest_generic *generic;
struct acpi_hest_generic_status *estatus;
- bool task_work_pending;
u32 len, node_len;
- int ret;

llnode = llist_del_all(&ghes_estatus_llist);
/*
@@ -978,26 +1004,15 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
len = cper_estatus_len(estatus);
node_len = GHES_ESTATUS_NODE_LEN(len);
- task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+ ghes_do_proc(estatus_node->ghes, estatus);
if (!ghes_estatus_cached(estatus)) {
generic = estatus_node->generic;
if (ghes_print_estatus(NULL, generic, estatus))
ghes_estatus_cache_add(generic, estatus);
}

- if (task_work_pending && current->mm) {
- estatus_node->task_work.func = ghes_kick_task_work;
- estatus_node->task_work_cpu = smp_processor_id();
- ret = task_work_add(current, &estatus_node->task_work,
- TWA_RESUME);
- if (ret)
- estatus_node->task_work.func = NULL;
- }
-
- if (!estatus_node->task_work.func)
- gen_pool_free(ghes_estatus_pool,
- (unsigned long)estatus_node, node_len);
-
+ gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+ node_len);
llnode = next;
}
}
@@ -1057,7 +1072,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,

estatus_node->ghes = ghes;
estatus_node->generic = ghes->generic;
- estatus_node->task_work.func = NULL;
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);

if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
--
2.20.1.12.g72788fdb

2022-12-07 10:26:03

by Lv Ying

[permalink] [raw]
Subject: reply for ACPI: APEI: handle synchronous exceptions in task work

Hi Shuai Xue:

I notice that we are both handling the same problem, my patchset:
RFC: https://lkml.org/lkml/fancy/2022/12/5/364
RFC PATCH v1: https://lkml.org/lkml/2022/12/7/244
has CC to you

Yingwen's proposal in 2022/12/06[1]:
Add Bit 8 in "Common Platform Error Record" -> "Section Descriptor" ->
Flags (which Now, Bit 8 through 31 – Reserved)

[1] https://members.uefi.org/wg/uswg/mail/thread/9453

Yingwen's proposal makes distinguish synchronous error by CPER report more
easy, however, it's not supported yet.
Looking forward to your reply if there is any progress on the proposal and
your suggestions about my patchset.

2022-12-07 12:51:23

by Bixuan Cui

[permalink] [raw]
Subject: Re: reply for ACPI: APEI: handle synchronous exceptions in task work



在 2022/12/7 17:54, Lv Ying 写道:
> Yingwen's proposal makes distinguish synchronous error by CPER report more
> easy, however, it's not supported yet.
> Looking forward to your reply if there is any progress on the proposal and
> your suggestions about my patchset.

Originally, the arm can distinguish between synchronous and asynchronous
errors, but the OS cannot. Therefore, it is more reasonable to
distinguish by adding 'sync flag' bit for arm.

Thanks,
Bixuan Cui

2022-12-07 13:35:12

by Shuai Xue

[permalink] [raw]
Subject: Re: reply for ACPI: APEI: handle synchronous exceptions in task work



On 2022/12/7 PM5:54, Lv Ying wrote:
> Hi Shuai Xue:
>
> I notice that we are both handling the same problem, my patchset:
> RFC: https://lkml.org/lkml/fancy/2022/12/5/364
> RFC PATCH v1: https://lkml.org/lkml/2022/12/7/244
> has CC to you

I am glad to see that the community is trying to address the same problems,
I have replied to your RFC version.

> Yingwen's proposal in 2022/12/06[1]:
> Add Bit 8 in "Common Platform Error Record" -> "Section Descriptor" ->
> Flags (which Now, Bit 8 through 31 – Reserved)
>
> [1] https://members.uefi.org/wg/uswg/mail/thread/9453
>
> Yingwen's proposal makes distinguish synchronous error by CPER report more
> easy, however, it's not supported yet.
> Looking forward to your reply if there is any progress on the proposal and
> your suggestions about my patchset.

Yes, it is not supported yet. So we separated synchronous error handling into
task work based on a similar flag internally.

We submitted the proposal last month after discussed with Tony. But there
is still no progress, I will update it here in time.

Cheers,
Shuai

2022-12-07 14:42:51

by Shuai Xue

[permalink] [raw]
Subject: Re: reply for ACPI: APEI: handle synchronous exceptions in task work



On 2022/12/7 PM8:56, Shuai Xue wrote:
>
>
> On 2022/12/7 PM5:54, Lv Ying wrote:
>> Hi Shuai Xue:
>>
>> I notice that we are both handling the same problem, my patchset:
>> RFC: https://lkml.org/lkml/fancy/2022/12/5/364
>> RFC PATCH v1: https://lkml.org/lkml/2022/12/7/244
>> has CC to you
>
> I am glad to see that the community is trying to address the same problems,
> I have replied to your RFC version.
>
>> Yingwen's proposal in 2022/12/06[1]:
>> Add Bit 8 in "Common Platform Error Record" -> "Section Descriptor" ->
>> Flags (which Now, Bit 8 through 31 – Reserved)
>>
>> [1] https://members.uefi.org/wg/uswg/mail/thread/9453
>>
>> Yingwen's proposal makes distinguish synchronous error by CPER report more
>> easy, however, it's not supported yet.
>> Looking forward to your reply if there is any progress on the proposal and
>> your suggestions about my patchset.
>
> Yes, it is not supported yet. So we separated synchronous error handling into
> task work based on a similar flag internally.
>
> We submitted the proposal last month after discussed with Tony. But there
> is still no progress, I will update it here in time.
>
> Cheers,
> Shuai

By the way, if you agree with the proposal, please vote to approve it in UEFI community
with your right on behalf of your organization, then we can make it happen soon. :)

Thank you.

Best Regards,
Shuai

2022-12-08 02:58:24

by Lv Ying

[permalink] [raw]
Subject: Re: reply for ACPI: APEI: handle synchronous exceptions in task work

On 2022/12/7 22:04, Shuai Xue wrote:
>
>
> On 2022/12/7 PM8:56, Shuai Xue wrote:
>>
>>
>> On 2022/12/7 PM5:54, Lv Ying wrote:
>>> Hi Shuai Xue:
>>>
>>> I notice that we are both handling the same problem, my patchset:
>>> RFC: https://lkml.org/lkml/fancy/2022/12/5/364
>>> RFC PATCH v1: https://lkml.org/lkml/2022/12/7/244
>>> has CC to you
>>
>> I am glad to see that the community is trying to address the same problems,
>> I have replied to your RFC version.
>>
>>> Yingwen's proposal in 2022/12/06[1]:
>>> Add Bit 8 in "Common Platform Error Record" -> "Section Descriptor" ->
>>> Flags (which Now, Bit 8 through 31 – Reserved)
>>>
>>> [1] https://members.uefi.org/wg/uswg/mail/thread/9453
>>>
>>> Yingwen's proposal makes distinguish synchronous error by CPER report more
>>> easy, however, it's not supported yet.
>>> Looking forward to your reply if there is any progress on the proposal and
>>> your suggestions about my patchset.
>>
>> Yes, it is not supported yet. So we separated synchronous error handling into
>> task work based on a similar flag internally.
>>
>> We submitted the proposal last month after discussed with Tony. But there
>> is still no progress, I will update it here in time.
>>
>> Cheers,
>> Shuai
>
> By the way, if you agree with the proposal, please vote to approve it in UEFI community
> with your right on behalf of your organization, then we can make it happen soon. :)
>

I notice your proposal yesterday, however I have no permission to access
this proposal[1]. I already made my firmware colleague aware of your
suggestion yesterday. I will make the meaning and background of this
work clear to my firmware colleague, I also hope that this work will
move forward :)

[1] https://members.uefi.org/wg/uswg/mail/thread/9453

--
Thanks!
Lv Ying

2023-02-27 05:03:29

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v2 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code

changes since v1:
- synchronous events by notify type
- Link: https://lore.kernel.org/lkml/[email protected]/

Currently, both synchronous and asynchronous error are queued and handled
by a dedicated kthread in workqueue. And Memory failure for synchronous
error is synced by a cancel_work_sync trick which ensures that the
corrupted page is unmapped and poisoned. And after returning to user-space,
the task starts at current instruction which triggering a page fault in
which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON.

However, the memory failure recovery for hwpoison-aware mechanisms does not
work as expected. For example, hwpoison-aware user-space processes like
QEMU register their customized SIGBUS handler and enable early kill mode by
seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
the process by sending a SIGBUS signal in memory failure with wrong
si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of
BUS_MCEERR_AR.

To address this problem:

- PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which
indicates error happened in current execution context
- PATCH 2 separates synchronous error handling into task work so that the
current context in memory failure is exactly belongs to the task
consuming poison data.

Then, kernel will send SIGBUS with proper si_code in kill_proc().

Lv Ying and XiuQi also proposed to address similar problem and we discussed
about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to
distinguish synchronous event. [2][3] The UEFI community still has no response.
After a deep dive into the SDEI TRM, the SDEI notification should be used for
asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an
exception-like entry into the client, **with the client providing an additional
asynchronous entry point similar to an interrupt entry point**". The client
(kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR,
etc). So notify type is enough to distinguish synchronous event.

[1] https://developer.arm.com/documentation/den0054/latest/
[2] https://lore.kernel.org/linux-arm-kernel/[email protected]/T/
[3] https://lore.kernel.org/lkml/[email protected]/

Shuai Xue (2):
ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
synchronous events
ACPI: APEI: handle synchronous exceptions in task work

drivers/acpi/apei/ghes.c | 134 ++++++++++++++++++++++++---------------
include/acpi/ghes.h | 3 -
mm/memory-failure.c | 13 ----
3 files changed, 82 insertions(+), 68 deletions(-)

--
2.20.1.12.g72788fdb


2023-02-27 05:03:35

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v2 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

There are two major types of uncorrected recoverable (UCR) errors :

- Action Required (AR): The error is detected and the processor already
consumes the memory. OS requires to take action (for example, offline
failure page/kill failure thread) to recover this uncorrectable error.

- Action Optional (AO): The error is detected out of processor execution
context. Some data in the memory are corrupted. But the data have not
been consumed. OS is optional to take action to recover this
uncorrectable error.

The essential difference between AR and AO errors is that AR is a
synchronous event, while AO is an asynchronous event. The hardware will
signal a synchronous exception (Machine Check Exception on X86 and
Synchronous External Abort on Arm64) when an error is detected and the
memory access has been architecturally executed.

When APEI firmware first is enabled, a platform may describe one error
source for the handling of synchronous errors (e.g. MCE or SEA notification
), or for handling asynchronous errors (e.g. SCI or External Interrupt
notification). In other words, we can distinguish synchronous errors by
APEI notification. For AR errors, kernel will kill current process
accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
addition, for AO errors, kernel will notify the process who owns the
poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
However, the GHES driver always sets mf_flags to 0 so that all UCR errors
are handled as AO errors in memory failure.

To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
events.

Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
Signed-off-by: Shuai Xue <[email protected]>
---
drivers/acpi/apei/ghes.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 34ad071a64e9..5d37fb4bca67 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -101,6 +101,19 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
}

+/*
+ * A platform may describe one error source for the handling of synchronous
+ * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
+ * or External Interrupt).
+ */
+static inline bool is_hest_sync_notify(struct ghes *ghes)
+{
+ int notify_type = ghes->generic->notify.type;
+
+ return notify_type == ACPI_HEST_NOTIFY_SEA ||
+ notify_type == ACPI_HEST_NOTIFY_MCE;
+}
+
/*
* This driver isn't really modular, however for the time being,
* continuing to use module_param is the easiest way to remain
@@ -477,7 +490,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
}

static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
- int sev)
+ int sev, bool sync)
{
int flags = -1;
int sec_sev = ghes_severity(gdata->error_severity);
@@ -491,7 +504,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
- flags = 0;
+ flags = sync ? MF_ACTION_REQUIRED : 0;

if (flags != -1)
return ghes_do_memory_failure(mem_err->physical_addr, flags);
@@ -499,12 +512,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
return false;
}

-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
+static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
+ int sev, bool sync)
{
struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
bool queued = false;
int sec_sev, i;
char *p;
+ int flags = sync ? MF_ACTION_REQUIRED : 0;

log_arm_hw_error(err);

@@ -526,7 +541,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
* and don't filter out 'corrected' error here.
*/
if (is_cache && has_pa) {
- queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
+ queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
p += err_info->length;
continue;
}
@@ -647,6 +662,7 @@ static bool ghes_do_proc(struct ghes *ghes,
const guid_t *fru_id = &guid_null;
char *fru_text = "";
bool queued = false;
+ bool sync = is_hest_sync_notify(ghes);

sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
@@ -664,13 +680,13 @@ static bool ghes_do_proc(struct ghes *ghes,
atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
- queued = ghes_handle_memory_failure(gdata, sev);
+ queued = ghes_handle_memory_failure(gdata, sev, sync);
}
else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
- queued = ghes_handle_arm_hw_error(gdata, sev);
+ queued = ghes_handle_arm_hw_error(gdata, sev, sync);
} else {
void *err = acpi_hest_get_payload(gdata);

--
2.20.1.12.g72788fdb


2023-02-27 05:03:39

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v2 2/2] ACPI: APEI: handle synchronous exceptions in task work

Hardware errors could be signaled by synchronous interrupt, e.g. when an
error is detected by a background scrubber, or signaled by synchronous
exception, e.g. when an uncorrected error is consumed. Both synchronous and
asynchronous error are queued and handled by a dedicated kthread in
workqueue.

commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
synchronous errors") keep track of whether memory_failure() work was
queued, and make task_work pending to flush out the workqueue so that the
work for synchronous error is processed before returning to user-space.
The trick ensures that the corrupted page is unmapped and poisoned. And
after returning to user-space, the task starts at current instruction which
triggering a page fault in which kernel will send SIGBUS to current process
due to VM_FAULT_HWPOISON.

However, the memory failure recovery for hwpoison-aware mechanisms does not
work as expected. For example, hwpoison-aware user-space processes like
QEMU register their customized SIGBUS handler and enable early kill mode by
seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
the process by sending a SIGBUS signal in memory failure with wrong
si_code: the actual user-space process accessing the corrupt memory
location, but its memory failure work is handled in a kthread context, so
it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
process instead of BUS_MCEERR_AR in kill_proc().

To this end, separate synchronous and asynchronous error handling into
different paths like X86 platform does:

- task work for synchronous errors.
- and workqueue for asynchronous errors.

Then for synchronous errors, the current context in memory failure is
exactly belongs to the task consuming poison data and it will send SIBBUS
with proper si_code.

Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
Signed-off-by: Shuai Xue <[email protected]>
---
drivers/acpi/apei/ghes.c | 114 ++++++++++++++++++++++-----------------
include/acpi/ghes.h | 3 --
mm/memory-failure.c | 13 -----
3 files changed, 64 insertions(+), 66 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 5d37fb4bca67..b2fe309f395c 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -451,45 +451,79 @@ static void ghes_clear_estatus(struct ghes *ghes,
}

/*
- * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+ * struct mce_task_work - for synchronous RAS event
+ *
+ * @twork: callback_head for task work
+ * @pfn: page frame number of corrupted page
+ * @flags: fine tune action taken
+ *
+ * Structure to pass task work to be handled before
+ * ret_to_user via task_work_add().
*/
-static void ghes_kick_task_work(struct callback_head *head)
+struct mce_task_work {
+ struct callback_head twork;
+ u64 pfn;
+ int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
{
- struct acpi_hest_generic_status *estatus;
- struct ghes_estatus_node *estatus_node;
- u32 node_len;
+ int ret;
+ struct mce_task_work *twcb =
+ container_of(twork, struct mce_task_work, twork);

- estatus_node = container_of(head, struct ghes_estatus_node, task_work);
- if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
- memory_failure_queue_kick(estatus_node->task_work_cpu);
+ ret = memory_failure(twcb->pfn, twcb->flags);
+ kfree(twcb);

- estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
- node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
- gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
+ if (!ret)
+ return;
+
+ /*
+ * -EHWPOISON from memory_failure() means that it already sent SIGBUS
+ * to the current process with the proper error info,
+ * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
+ *
+ * In both cases, no further processing is required.
+ */
+ if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
+ return;
+
+ pr_err("Memory error not recovered");
+ force_sig(SIGBUS);
}

-static bool ghes_do_memory_failure(u64 physical_addr, int flags)
+static void ghes_do_memory_failure(u64 physical_addr, int flags)
{
unsigned long pfn;
+ struct mce_task_work *twcb;

if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
- return false;
+ return;

pfn = PHYS_PFN(physical_addr);
if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
pr_warn_ratelimited(FW_WARN GHES_PFX
"Invalid address in generic error data: %#llx\n",
physical_addr);
- return false;
+ return;
+ }
+
+ if (flags == MF_ACTION_REQUIRED && current->mm) {
+ twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
+ if (!twcb)
+ return;
+
+ twcb->pfn = pfn;
+ twcb->flags = flags;
+ init_task_work(&twcb->twork, memory_failure_cb);
+ task_work_add(current, &twcb->twork, TWA_RESUME);
+ return;
}

memory_failure_queue(pfn, flags);
- return true;
}

-static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
+static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
int sev, bool sync)
{
int flags = -1;
@@ -497,7 +531,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);

if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
- return false;
+ return;

/* iff following two events can be handled properly by now */
if (sec_sev == GHES_SEV_CORRECTED &&
@@ -507,16 +541,15 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
flags = sync ? MF_ACTION_REQUIRED : 0;

if (flags != -1)
- return ghes_do_memory_failure(mem_err->physical_addr, flags);
+ ghes_do_memory_failure(mem_err->physical_addr, flags);

- return false;
+ return;
}

-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
+static void ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
int sev, bool sync)
{
struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
- bool queued = false;
int sec_sev, i;
char *p;
int flags = sync ? MF_ACTION_REQUIRED : 0;
@@ -525,7 +558,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,

sec_sev = ghes_severity(gdata->error_severity);
if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE)
- return false;
+ return;

p = (char *)(err + 1);
for (i = 0; i < err->err_info_num; i++) {
@@ -541,7 +574,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
* and don't filter out 'corrected' error here.
*/
if (is_cache && has_pa) {
- queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
+ ghes_do_memory_failure(err_info->physical_fault_addr, flags);
p += err_info->length;
continue;
}
@@ -554,8 +587,6 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
error_type);
p += err_info->length;
}
-
- return queued;
}

/*
@@ -653,7 +684,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
schedule_work(&entry->work);
}

-static bool ghes_do_proc(struct ghes *ghes,
+static void ghes_do_proc(struct ghes *ghes,
const struct acpi_hest_generic_status *estatus)
{
int sev, sec_sev;
@@ -661,7 +692,6 @@ static bool ghes_do_proc(struct ghes *ghes,
guid_t *sec_type;
const guid_t *fru_id = &guid_null;
char *fru_text = "";
- bool queued = false;
bool sync = is_hest_sync_notify(ghes);

sev = ghes_severity(estatus->error_severity);
@@ -680,13 +710,13 @@ static bool ghes_do_proc(struct ghes *ghes,
atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
- queued = ghes_handle_memory_failure(gdata, sev, sync);
+ ghes_handle_memory_failure(gdata, sev, sync);
}
else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
- queued = ghes_handle_arm_hw_error(gdata, sev, sync);
+ ghes_handle_arm_hw_error(gdata, sev, sync);
} else {
void *err = acpi_hest_get_payload(gdata);

@@ -696,8 +726,6 @@ static bool ghes_do_proc(struct ghes *ghes,
gdata->error_data_length);
}
}
-
- return queued;
}

static void __ghes_print_estatus(const char *pfx,
@@ -999,9 +1027,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
struct ghes_estatus_node *estatus_node;
struct acpi_hest_generic *generic;
struct acpi_hest_generic_status *estatus;
- bool task_work_pending;
u32 len, node_len;
- int ret;

llnode = llist_del_all(&ghes_estatus_llist);
/*
@@ -1016,25 +1042,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
len = cper_estatus_len(estatus);
node_len = GHES_ESTATUS_NODE_LEN(len);
- task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+ ghes_do_proc(estatus_node->ghes, estatus);
if (!ghes_estatus_cached(estatus)) {
generic = estatus_node->generic;
if (ghes_print_estatus(NULL, generic, estatus))
ghes_estatus_cache_add(generic, estatus);
}
-
- if (task_work_pending && current->mm) {
- estatus_node->task_work.func = ghes_kick_task_work;
- estatus_node->task_work_cpu = smp_processor_id();
- ret = task_work_add(current, &estatus_node->task_work,
- TWA_RESUME);
- if (ret)
- estatus_node->task_work.func = NULL;
- }
-
- if (!estatus_node->task_work.func)
- gen_pool_free(ghes_estatus_pool,
- (unsigned long)estatus_node, node_len);
+ gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+ node_len);

llnode = next;
}
@@ -1095,7 +1110,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,

estatus_node->ghes = ghes;
estatus_node->generic = ghes->generic;
- estatus_node->task_work.func = NULL;
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);

if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 3c8bba9f1114..e5e0c308d27f 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -35,9 +35,6 @@ struct ghes_estatus_node {
struct llist_node llnode;
struct acpi_hest_generic *generic;
struct ghes *ghes;
-
- int task_work_cpu;
- struct callback_head task_work;
};

struct ghes_estatus_cache {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a1ede7bdce95..d4fd983dfc97 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
}
}

-/*
- * Process memory_failure work queued on the specified CPU.
- * Used to avoid return-to-userspace racing with the memory_failure workqueue.
- */
-void memory_failure_queue_kick(int cpu)
-{
- struct memory_failure_cpu *mf_cpu;
-
- mf_cpu = &per_cpu(memory_failure_cpu, cpu);
- cancel_work_sync(&mf_cpu->work);
- memory_failure_work_func(&mf_cpu->work);
-}
-
static int __init memory_failure_init(void)
{
struct memory_failure_cpu *mf_cpu;
--
2.20.1.12.g72788fdb


2023-03-06 00:45:50

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code

Gentle ping.

On 2023/2/27 PM1:03, Shuai Xue wrote:
> changes since v1:
> - synchronous events by notify type
> - Link: https://lore.kernel.org/lkml/[email protected]/
>
> Currently, both synchronous and asynchronous error are queued and handled
> by a dedicated kthread in workqueue. And Memory failure for synchronous
> error is synced by a cancel_work_sync trick which ensures that the
> corrupted page is unmapped and poisoned. And after returning to user-space,
> the task starts at current instruction which triggering a page fault in
> which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON.
>
> However, the memory failure recovery for hwpoison-aware mechanisms does not
> work as expected. For example, hwpoison-aware user-space processes like
> QEMU register their customized SIGBUS handler and enable early kill mode by
> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
> the process by sending a SIGBUS signal in memory failure with wrong
> si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of
> BUS_MCEERR_AR.
>
> To address this problem:
>
> - PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which
> indicates error happened in current execution context
> - PATCH 2 separates synchronous error handling into task work so that the
> current context in memory failure is exactly belongs to the task
> consuming poison data.
>
> Then, kernel will send SIGBUS with proper si_code in kill_proc().
>
> Lv Ying and XiuQi also proposed to address similar problem and we discussed
> about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to
> distinguish synchronous event. [2][3] The UEFI community still has no response.
> After a deep dive into the SDEI TRM, the SDEI notification should be used for
> asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an
> exception-like entry into the client, **with the client providing an additional
> asynchronous entry point similar to an interrupt entry point**". The client
> (kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR,
> etc). So notify type is enough to distinguish synchronous event.
>
> [1] https://developer.arm.com/documentation/den0054/latest/
> [2] https://lore.kernel.org/linux-arm-kernel/[email protected]/T/
> [3] https://lore.kernel.org/lkml/[email protected]/
>
> Shuai Xue (2):
> ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
> synchronous events
> ACPI: APEI: handle synchronous exceptions in task work
>
> drivers/acpi/apei/ghes.c | 134 ++++++++++++++++++++++++---------------
> include/acpi/ghes.h | 3 -
> mm/memory-failure.c | 13 ----
> 3 files changed, 82 insertions(+), 68 deletions(-)
>

Subject: Re: [PATCH v2 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

On Mon, Feb 27, 2023 at 01:03:14PM +0800, Shuai Xue wrote:
> There are two major types of uncorrected recoverable (UCR) errors :
>
> - Action Required (AR): The error is detected and the processor already
> consumes the memory. OS requires to take action (for example, offline
> failure page/kill failure thread) to recover this uncorrectable error.
>
> - Action Optional (AO): The error is detected out of processor execution
> context. Some data in the memory are corrupted. But the data have not
> been consumed. OS is optional to take action to recover this
> uncorrectable error.
>
> The essential difference between AR and AO errors is that AR is a
> synchronous event, while AO is an asynchronous event. The hardware will
> signal a synchronous exception (Machine Check Exception on X86 and
> Synchronous External Abort on Arm64) when an error is detected and the
> memory access has been architecturally executed.
>
> When APEI firmware first is enabled, a platform may describe one error
> source for the handling of synchronous errors (e.g. MCE or SEA notification
> ), or for handling asynchronous errors (e.g. SCI or External Interrupt
> notification). In other words, we can distinguish synchronous errors by
> APEI notification. For AR errors, kernel will kill current process
> accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
> addition, for AO errors, kernel will notify the process who owns the
> poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
> However, the GHES driver always sets mf_flags to 0 so that all UCR errors
> are handled as AO errors in memory failure.
>
> To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
> events.
>
> Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
> Signed-off-by: Shuai Xue <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 28 ++++++++++++++++++++++------
> 1 file changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 34ad071a64e9..5d37fb4bca67 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -101,6 +101,19 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
> return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
> }
>
> +/*
> + * A platform may describe one error source for the handling of synchronous
> + * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
> + * or External Interrupt).
> + */
> +static inline bool is_hest_sync_notify(struct ghes *ghes)
> +{
> + int notify_type = ghes->generic->notify.type;
> +
> + return notify_type == ACPI_HEST_NOTIFY_SEA ||
> + notify_type == ACPI_HEST_NOTIFY_MCE;
> +}

This code seems to read that all MCEs are synchronous, which I think is
not correct. The scenario I'm worrying about is that is_hest_sync_notify()
returns true when this code is called for AO MCE (so asynchronous one).
Then, ghes_do_memory_failure() (updated by your patch 2/2) will choose to
use task_work instead of memory_failure_queue(). This should not be expected.
Or does that never happen?

- Naoya Horiguchi

> +
> /*
> * This driver isn't really modular, however for the time being,
> * continuing to use module_param is the easiest way to remain
> @@ -477,7 +490,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> }
>
> static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> - int sev)
> + int sev, bool sync)
> {
> int flags = -1;
> int sec_sev = ghes_severity(gdata->error_severity);
> @@ -491,7 +504,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> flags = MF_SOFT_OFFLINE;
> if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
> - flags = 0;
> + flags = sync ? MF_ACTION_REQUIRED : 0;
>
> if (flags != -1)
> return ghes_do_memory_failure(mem_err->physical_addr, flags);
> @@ -499,12 +512,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> return false;
> }
>
> -static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
> + int sev, bool sync)
> {
> struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
> bool queued = false;
> int sec_sev, i;
> char *p;
> + int flags = sync ? MF_ACTION_REQUIRED : 0;
>
> log_arm_hw_error(err);
>
> @@ -526,7 +541,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
> * and don't filter out 'corrected' error here.
> */
> if (is_cache && has_pa) {
> - queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
> + queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
> p += err_info->length;
> continue;
> }
> @@ -647,6 +662,7 @@ static bool ghes_do_proc(struct ghes *ghes,
> const guid_t *fru_id = &guid_null;
> char *fru_text = "";
> bool queued = false;
> + bool sync = is_hest_sync_notify(ghes);
>
> sev = ghes_severity(estatus->error_severity);
> apei_estatus_for_each_section(estatus, gdata) {
> @@ -664,13 +680,13 @@ static bool ghes_do_proc(struct ghes *ghes,
> atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
>
> arch_apei_report_mem_error(sev, mem_err);
> - queued = ghes_handle_memory_failure(gdata, sev);
> + queued = ghes_handle_memory_failure(gdata, sev, sync);
> }
> else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
> ghes_handle_aer(gdata);
> }
> else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
> - queued = ghes_handle_arm_hw_error(gdata, sev);
> + queued = ghes_handle_arm_hw_error(gdata, sev, sync);
> } else {
> void *err = acpi_hest_get_payload(gdata);
>
> --
> 2.20.1.12.g72788fdb

Subject: Re: [PATCH v2 2/2] ACPI: APEI: handle synchronous exceptions in task work

On Mon, Feb 27, 2023 at 01:03:15PM +0800, Shuai Xue wrote:
> Hardware errors could be signaled by synchronous interrupt, e.g. when an
> error is detected by a background scrubber, or signaled by synchronous
> exception, e.g. when an uncorrected error is consumed. Both synchronous and
> asynchronous error are queued and handled by a dedicated kthread in
> workqueue.
>
> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
> synchronous errors") keep track of whether memory_failure() work was
> queued, and make task_work pending to flush out the workqueue so that the
> work for synchronous error is processed before returning to user-space.
> The trick ensures that the corrupted page is unmapped and poisoned. And
> after returning to user-space, the task starts at current instruction which
> triggering a page fault in which kernel will send SIGBUS to current process
> due to VM_FAULT_HWPOISON.
>
> However, the memory failure recovery for hwpoison-aware mechanisms does not
> work as expected. For example, hwpoison-aware user-space processes like
> QEMU register their customized SIGBUS handler and enable early kill mode by
> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
> the process by sending a SIGBUS signal in memory failure with wrong
> si_code: the actual user-space process accessing the corrupt memory
> location, but its memory failure work is handled in a kthread context, so
> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
> process instead of BUS_MCEERR_AR in kill_proc().
>
> To this end, separate synchronous and asynchronous error handling into
> different paths like X86 platform does:
>
> - task work for synchronous errors.
> - and workqueue for asynchronous errors.
>
> Then for synchronous errors, the current context in memory failure is
> exactly belongs to the task consuming poison data and it will send SIBBUS
> with proper si_code.
>
> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
> Signed-off-by: Shuai Xue <[email protected]>
...
>
> /*
> - * Called as task_work before returning to user-space.
> - * Ensure any queued work has been done before we return to the context that
> - * triggered the notification.
> + * struct mce_task_work - for synchronous RAS event

This seems to handle synchronous memory errors, not limited to MCE?
So naming this struct as such (more generally) might be better.

> + *
> + * @twork: callback_head for task work
> + * @pfn: page frame number of corrupted page
> + * @flags: fine tune action taken
> + *
> + * Structure to pass task work to be handled before
> + * ret_to_user via task_work_add().
> */
...

> }
>
> -static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> +static void ghes_do_memory_failure(u64 physical_addr, int flags)
> {
> unsigned long pfn;
> + struct mce_task_work *twcb;
>
> if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> - return false;
> + return;
>
> pfn = PHYS_PFN(physical_addr);
> if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
> pr_warn_ratelimited(FW_WARN GHES_PFX
> "Invalid address in generic error data: %#llx\n",
> physical_addr);
> - return false;
> + return;
> + }
> +
> + if (flags == MF_ACTION_REQUIRED && current->mm) {
> + twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
> + if (!twcb)
> + return;

When this kmalloc() fails, the error event might be silently dropped?
If so, some warning messages could be helpful.

Thanks,
Naoya Horiguchi

> +
> + twcb->pfn = pfn;
> + twcb->flags = flags;
> + init_task_work(&twcb->twork, memory_failure_cb);
> + task_work_add(current, &twcb->twork, TWA_RESUME);
> + return;
> }
>
> memory_failure_queue(pfn, flags);

2023-03-16 09:58:39

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

+ Tony Luck for MCE

On 2023/3/16 PM3:21, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Feb 27, 2023 at 01:03:14PM +0800, Shuai Xue wrote:
>> There are two major types of uncorrected recoverable (UCR) errors :
>>
>> - Action Required (AR): The error is detected and the processor already
>> consumes the memory. OS requires to take action (for example, offline
>> failure page/kill failure thread) to recover this uncorrectable error.
>>
>> - Action Optional (AO): The error is detected out of processor execution
>> context. Some data in the memory are corrupted. But the data have not
>> been consumed. OS is optional to take action to recover this
>> uncorrectable error.
>>
>> The essential difference between AR and AO errors is that AR is a
>> synchronous event, while AO is an asynchronous event. The hardware will
>> signal a synchronous exception (Machine Check Exception on X86 and
>> Synchronous External Abort on Arm64) when an error is detected and the
>> memory access has been architecturally executed.
>>
>> When APEI firmware first is enabled, a platform may describe one error
>> source for the handling of synchronous errors (e.g. MCE or SEA notification
>> ), or for handling asynchronous errors (e.g. SCI or External Interrupt
>> notification). In other words, we can distinguish synchronous errors by
>> APEI notification. For AR errors, kernel will kill current process
>> accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
>> addition, for AO errors, kernel will notify the process who owns the
>> poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
>> However, the GHES driver always sets mf_flags to 0 so that all UCR errors
>> are handled as AO errors in memory failure.
>>
>> To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
>> events.
>>
>> Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
>> Signed-off-by: Shuai Xue <[email protected]>
>> ---
>> drivers/acpi/apei/ghes.c | 28 ++++++++++++++++++++++------
>> 1 file changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 34ad071a64e9..5d37fb4bca67 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -101,6 +101,19 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
>> return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
>> }
>>
>> +/*
>> + * A platform may describe one error source for the handling of synchronous
>> + * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
>> + * or External Interrupt).
>> + */
>> +static inline bool is_hest_sync_notify(struct ghes *ghes)
>> +{
>> + int notify_type = ghes->generic->notify.type;
>> +
>> + return notify_type == ACPI_HEST_NOTIFY_SEA ||
>> + notify_type == ACPI_HEST_NOTIFY_MCE;
>> +}
>
> This code seems to read that all MCEs are synchronous, which I think is
> not correct. The scenario I'm worrying about is that is_hest_sync_notify()
> returns true when this code is called for AO MCE (so asynchronous one).
> Then, ghes_do_memory_failure() (updated by your patch 2/2) will choose to
> use task_work instead of memory_failure_queue(). This should not be expected.
> Or does that never happen?

I think you are right.

On x86 platform with MCA, patrol scrub errors are asynchronous error, which are
by default signaled with MCE. It is possible to downgrade the patrol scrub SRAO
to UCNA or other correctable error in the logging/signaling behavior and signal
CMCI only.

As far as I know, on X86 platform, MCE is handled in do_machche_check() and only
asynchronous error is notified through HEST. Can we safely drop ACPI_HEST_NOTIFY_MCE
and only consider ACPI_HEST_NOTIFY_SEA as synchronous notification?

Tony, do you have any comments on this? Please point out if I am wrong. Thank you.

>
> - Naoya Horiguchi

Glad to hear from you and thank you for your comments.

Best regards
Shuai

>
>> +
>> /*
>> * This driver isn't really modular, however for the time being,
>> * continuing to use module_param is the easiest way to remain
>> @@ -477,7 +490,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>> }
>>
>> static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>> - int sev)
>> + int sev, bool sync)
>> {
>> int flags = -1;
>> int sec_sev = ghes_severity(gdata->error_severity);
>> @@ -491,7 +504,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
>> flags = MF_SOFT_OFFLINE;
>> if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>> - flags = 0;
>> + flags = sync ? MF_ACTION_REQUIRED : 0;
>>
>> if (flags != -1)
>> return ghes_do_memory_failure(mem_err->physical_addr, flags);
>> @@ -499,12 +512,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>> return false;
>> }
>>
>> -static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
>> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
>> + int sev, bool sync)
>> {
>> struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
>> bool queued = false;
>> int sec_sev, i;
>> char *p;
>> + int flags = sync ? MF_ACTION_REQUIRED : 0;
>>
>> log_arm_hw_error(err);
>>
>> @@ -526,7 +541,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
>> * and don't filter out 'corrected' error here.
>> */
>> if (is_cache && has_pa) {
>> - queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
>> + queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
>> p += err_info->length;
>> continue;
>> }
>> @@ -647,6 +662,7 @@ static bool ghes_do_proc(struct ghes *ghes,
>> const guid_t *fru_id = &guid_null;
>> char *fru_text = "";
>> bool queued = false;
>> + bool sync = is_hest_sync_notify(ghes);
>>
>> sev = ghes_severity(estatus->error_severity);
>> apei_estatus_for_each_section(estatus, gdata) {
>> @@ -664,13 +680,13 @@ static bool ghes_do_proc(struct ghes *ghes,
>> atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
>>
>> arch_apei_report_mem_error(sev, mem_err);
>> - queued = ghes_handle_memory_failure(gdata, sev);
>> + queued = ghes_handle_memory_failure(gdata, sev, sync);
>> }
>> else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
>> ghes_handle_aer(gdata);
>> }
>> else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
>> - queued = ghes_handle_arm_hw_error(gdata, sev);
>> + queued = ghes_handle_arm_hw_error(gdata, sev, sync);
>> } else {
>> void *err = acpi_hest_get_payload(gdata);
>>
>> --
>> 2.20.1.12.g72788fdb

2023-03-16 11:11:14

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ACPI: APEI: handle synchronous exceptions in task work



On 2023/3/16 PM3:21, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Feb 27, 2023 at 01:03:15PM +0800, Shuai Xue wrote:
>> Hardware errors could be signaled by synchronous interrupt, e.g. when an
>> error is detected by a background scrubber, or signaled by synchronous
>> exception, e.g. when an uncorrected error is consumed. Both synchronous and
>> asynchronous error are queued and handled by a dedicated kthread in
>> workqueue.
>>
>> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
>> synchronous errors") keep track of whether memory_failure() work was
>> queued, and make task_work pending to flush out the workqueue so that the
>> work for synchronous error is processed before returning to user-space.
>> The trick ensures that the corrupted page is unmapped and poisoned. And
>> after returning to user-space, the task starts at current instruction which
>> triggering a page fault in which kernel will send SIGBUS to current process
>> due to VM_FAULT_HWPOISON.
>>
>> However, the memory failure recovery for hwpoison-aware mechanisms does not
>> work as expected. For example, hwpoison-aware user-space processes like
>> QEMU register their customized SIGBUS handler and enable early kill mode by
>> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
>> the process by sending a SIGBUS signal in memory failure with wrong
>> si_code: the actual user-space process accessing the corrupt memory
>> location, but its memory failure work is handled in a kthread context, so
>> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
>> process instead of BUS_MCEERR_AR in kill_proc().
>>
>> To this end, separate synchronous and asynchronous error handling into
>> different paths like X86 platform does:
>>
>> - task work for synchronous errors.
>> - and workqueue for asynchronous errors.
>>
>> Then for synchronous errors, the current context in memory failure is
>> exactly belongs to the task consuming poison data and it will send SIBBUS
>> with proper si_code.
>>
>> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
>> Signed-off-by: Shuai Xue <[email protected]>
> ...
>>
>> /*
>> - * Called as task_work before returning to user-space.
>> - * Ensure any queued work has been done before we return to the context that
>> - * triggered the notification.
>> + * struct mce_task_work - for synchronous RAS event
>
> This seems to handle synchronous memory errors, not limited to MCE?
> So naming this struct as such (more generally) might be better.

Yes. How about `sync_task_work`?

>
>> + *
>> + * @twork: callback_head for task work
>> + * @pfn: page frame number of corrupted page
>> + * @flags: fine tune action taken
>> + *
>> + * Structure to pass task work to be handled before
>> + * ret_to_user via task_work_add().
>> */
> ...
>
>> }
>>
>> -static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>> +static void ghes_do_memory_failure(u64 physical_addr, int flags)
>> {
>> unsigned long pfn;
>> + struct mce_task_work *twcb;
>>
>> if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>> - return false;
>> + return;
>>
>> pfn = PHYS_PFN(physical_addr);
>> if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
>> pr_warn_ratelimited(FW_WARN GHES_PFX
>> "Invalid address in generic error data: %#llx\n",
>> physical_addr);
>> - return false;
>> + return;
>> + }
>> +
>> + if (flags == MF_ACTION_REQUIRED && current->mm) {
>> + twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
>> + if (!twcb)
>> + return;
>
> When this kmalloc() fails, the error event might be silently dropped?
> If so, some warning messages could be helpful.

Yes, I was going to add a warning messages like:

pr_err("Failed to handle memory failure due to out of memory\n");

But got a warning about patch when checked by checkpatch.pl.

WARNING: Possible unnecessary 'out of memory' message

I will add it back in next version :)

>
> Thanks,
> Naoya Horiguchi

Thank you for your comments.

Cheer,
Shuai

>
>> +
>> + twcb->pfn = pfn;
>> + twcb->flags = flags;
>> + init_task_work(&twcb->twork, memory_failure_cb);
>> + task_work_add(current, &twcb->twork, TWA_RESUME);
>> + return;
>> }
>>
>> memory_failure_queue(pfn, flags);

2023-03-16 16:45:53

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v2 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

> On x86 platform with MCA, patrol scrub errors are asynchronous error, which are
> by default signaled with MCE. It is possible to downgrade the patrol scrub SRAO
> to UCNA or other correctable error in the logging/signaling behavior and signal
> CMCI only.
>
> As far as I know, on X86 platform, MCE is handled in do_machche_check() and only
> asynchronous error is notified through HEST. Can we safely drop ACPI_HEST_NOTIFY_MCE
> and only consider ACPI_HEST_NOTIFY_SEA as synchronous notification?
>
> Tony, do you have any comments on this? Please point out if I am wrong. Thank you.

You are correct. On x86 the HEST notifications are always asynchronous. The only
synchronous events are machine check with IA32_MCi_STATUS.AR == 1 (patrol
scrub and cache eviction machine checks are async and do not set this bit).

-Tony

Subject: Re: [PATCH v2 2/2] ACPI: APEI: handle synchronous exceptions in task work

On Thu, Mar 16, 2023 at 07:10:56PM +0800, Shuai Xue wrote:
>
>
> On 2023/3/16 PM3:21, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Mon, Feb 27, 2023 at 01:03:15PM +0800, Shuai Xue wrote:
> >> Hardware errors could be signaled by synchronous interrupt, e.g. when an
> >> error is detected by a background scrubber, or signaled by synchronous
> >> exception, e.g. when an uncorrected error is consumed. Both synchronous and
> >> asynchronous error are queued and handled by a dedicated kthread in
> >> workqueue.
> >>
> >> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
> >> synchronous errors") keep track of whether memory_failure() work was
> >> queued, and make task_work pending to flush out the workqueue so that the
> >> work for synchronous error is processed before returning to user-space.
> >> The trick ensures that the corrupted page is unmapped and poisoned. And
> >> after returning to user-space, the task starts at current instruction which
> >> triggering a page fault in which kernel will send SIGBUS to current process
> >> due to VM_FAULT_HWPOISON.
> >>
> >> However, the memory failure recovery for hwpoison-aware mechanisms does not
> >> work as expected. For example, hwpoison-aware user-space processes like
> >> QEMU register their customized SIGBUS handler and enable early kill mode by
> >> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
> >> the process by sending a SIGBUS signal in memory failure with wrong
> >> si_code: the actual user-space process accessing the corrupt memory
> >> location, but its memory failure work is handled in a kthread context, so
> >> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
> >> process instead of BUS_MCEERR_AR in kill_proc().
> >>
> >> To this end, separate synchronous and asynchronous error handling into
> >> different paths like X86 platform does:
> >>
> >> - task work for synchronous errors.
> >> - and workqueue for asynchronous errors.
> >>
> >> Then for synchronous errors, the current context in memory failure is
> >> exactly belongs to the task consuming poison data and it will send SIBBUS
> >> with proper si_code.
> >>
> >> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
> >> Signed-off-by: Shuai Xue <[email protected]>
> > ...
> >>
> >> /*
> >> - * Called as task_work before returning to user-space.
> >> - * Ensure any queued work has been done before we return to the context that
> >> - * triggered the notification.
> >> + * struct mce_task_work - for synchronous RAS event
> >
> > This seems to handle synchronous memory errors, not limited to MCE?
> > So naming this struct as such (more generally) might be better.
>
> Yes. How about `sync_task_work`?

Sounds better to me.

>
> >
> >> + *
> >> + * @twork: callback_head for task work
> >> + * @pfn: page frame number of corrupted page
> >> + * @flags: fine tune action taken
> >> + *
> >> + * Structure to pass task work to be handled before
> >> + * ret_to_user via task_work_add().
> >> */
> > ...
> >
> >> }
> >>
> >> -static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> >> +static void ghes_do_memory_failure(u64 physical_addr, int flags)
> >> {
> >> unsigned long pfn;
> >> + struct mce_task_work *twcb;
> >>
> >> if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> >> - return false;
> >> + return;
> >>
> >> pfn = PHYS_PFN(physical_addr);
> >> if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
> >> pr_warn_ratelimited(FW_WARN GHES_PFX
> >> "Invalid address in generic error data: %#llx\n",
> >> physical_addr);
> >> - return false;
> >> + return;
> >> + }
> >> +
> >> + if (flags == MF_ACTION_REQUIRED && current->mm) {
> >> + twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
> >> + if (!twcb)
> >> + return;
> >
> > When this kmalloc() fails, the error event might be silently dropped?
> > If so, some warning messages could be helpful.
>
> Yes, I was going to add a warning messages like:
>
> pr_err("Failed to handle memory failure due to out of memory\n");
>
> But got a warning about patch when checked by checkpatch.pl.
>
> WARNING: Possible unnecessary 'out of memory' message
>
> I will add it back in next version :)

Oh, I didn't know about this warning. I checked the commit log introduced
this meesages, and the justification makes sense to me. So I'd like to
withdraw my comment about this (I mean you don't have to add it back).

commit ebfdc40969f24fc0cdd1349835d36e8ebae05374
Author: Joe Perches <[email protected]>
Date: Wed Aug 6 16:10:27 2014 -0700

checkpatch: attempt to find unnecessary 'out of memory' messages

Logging messages that show some type of "out of memory" error are
generally unnecessary as there is a generic message and a stack dump
done by the memory subsystem.

These messages generally increase kernel size without much added value.


Thanks,
Naoya Horiguchi

2023-03-17 01:12:53

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events



On 2023/3/17 AM12:45, Luck, Tony wrote:
>> On x86 platform with MCA, patrol scrub errors are asynchronous error, which are
>> by default signaled with MCE. It is possible to downgrade the patrol scrub SRAO
>> to UCNA or other correctable error in the logging/signaling behavior and signal
>> CMCI only.
>>
>> As far as I know, on X86 platform, MCE is handled in do_machche_check() and only
>> asynchronous error is notified through HEST. Can we safely drop ACPI_HEST_NOTIFY_MCE
>> and only consider ACPI_HEST_NOTIFY_SEA as synchronous notification?
>>
>> Tony, do you have any comments on this? Please point out if I am wrong. Thank you.
>
> You are correct. On x86 the HEST notifications are always asynchronous. The only
> synchronous events are machine check with IA32_MCi_STATUS.AR == 1 (patrol
> scrub and cache eviction machine checks are async and do not set this bit).
>
> -Tony

Thank you for confirmation. I will drop ACPI_HEST_NOTIFY_MCE in is_hest_sync_notify().

Best Regards.
Shuai

2023-03-17 01:24:30

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ACPI: APEI: handle synchronous exceptions in task work



On 2023/3/17 AM8:29, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Thu, Mar 16, 2023 at 07:10:56PM +0800, Shuai Xue wrote:
>>
>>
>> On 2023/3/16 PM3:21, HORIGUCHI NAOYA(堀口 直也) wrote:
>>> On Mon, Feb 27, 2023 at 01:03:15PM +0800, Shuai Xue wrote:
>>>> Hardware errors could be signaled by synchronous interrupt, e.g. when an
>>>> error is detected by a background scrubber, or signaled by synchronous
>>>> exception, e.g. when an uncorrected error is consumed. Both synchronous and
>>>> asynchronous error are queued and handled by a dedicated kthread in
>>>> workqueue.
>>>>
>>>> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
>>>> synchronous errors") keep track of whether memory_failure() work was
>>>> queued, and make task_work pending to flush out the workqueue so that the
>>>> work for synchronous error is processed before returning to user-space.
>>>> The trick ensures that the corrupted page is unmapped and poisoned. And
>>>> after returning to user-space, the task starts at current instruction which
>>>> triggering a page fault in which kernel will send SIGBUS to current process
>>>> due to VM_FAULT_HWPOISON.
>>>>
>>>> However, the memory failure recovery for hwpoison-aware mechanisms does not
>>>> work as expected. For example, hwpoison-aware user-space processes like
>>>> QEMU register their customized SIGBUS handler and enable early kill mode by
>>>> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
>>>> the process by sending a SIGBUS signal in memory failure with wrong
>>>> si_code: the actual user-space process accessing the corrupt memory
>>>> location, but its memory failure work is handled in a kthread context, so
>>>> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
>>>> process instead of BUS_MCEERR_AR in kill_proc().
>>>>
>>>> To this end, separate synchronous and asynchronous error handling into
>>>> different paths like X86 platform does:
>>>>
>>>> - task work for synchronous errors.
>>>> - and workqueue for asynchronous errors.
>>>>
>>>> Then for synchronous errors, the current context in memory failure is
>>>> exactly belongs to the task consuming poison data and it will send SIBBUS
>>>> with proper si_code.
>>>>
>>>> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
>>>> Signed-off-by: Shuai Xue <[email protected]>
>>> ...
>>>>
>>>> /*
>>>> - * Called as task_work before returning to user-space.
>>>> - * Ensure any queued work has been done before we return to the context that
>>>> - * triggered the notification.
>>>> + * struct mce_task_work - for synchronous RAS event
>>>
>>> This seems to handle synchronous memory errors, not limited to MCE?
>>> So naming this struct as such (more generally) might be better.
>>
>> Yes. How about `sync_task_work`?
>
> Sounds better to me.

Fine, I will change it in next version.

>>
>>>
>>>> + *
>>>> + * @twork: callback_head for task work
>>>> + * @pfn: page frame number of corrupted page
>>>> + * @flags: fine tune action taken
>>>> + *
>>>> + * Structure to pass task work to be handled before
>>>> + * ret_to_user via task_work_add().
>>>> */
>>> ...
>>>
>>>> }
>>>>
>>>> -static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>>> +static void ghes_do_memory_failure(u64 physical_addr, int flags)
>>>> {
>>>> unsigned long pfn;
>>>> + struct mce_task_work *twcb;
>>>>
>>>> if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>>>> - return false;
>>>> + return;
>>>>
>>>> pfn = PHYS_PFN(physical_addr);
>>>> if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
>>>> pr_warn_ratelimited(FW_WARN GHES_PFX
>>>> "Invalid address in generic error data: %#llx\n",
>>>> physical_addr);
>>>> - return false;
>>>> + return;
>>>> + }
>>>> +
>>>> + if (flags == MF_ACTION_REQUIRED && current->mm) {
>>>> + twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
>>>> + if (!twcb)
>>>> + return;
>>>
>>> When this kmalloc() fails, the error event might be silently dropped?
>>> If so, some warning messages could be helpful.
>>
>> Yes, I was going to add a warning messages like:
>>
>> pr_err("Failed to handle memory failure due to out of memory\n");
>>
>> But got a warning about patch when checked by checkpatch.pl.
>>
>> WARNING: Possible unnecessary 'out of memory' message
>>
>> I will add it back in next version :)
>
> Oh, I didn't know about this warning. I checked the commit log introduced
> this meesages, and the justification makes sense to me. So I'd like to
> withdraw my comment about this (I mean you don't have to add it back).
>
> commit ebfdc40969f24fc0cdd1349835d36e8ebae05374
> Author: Joe Perches <[email protected]>
> Date: Wed Aug 6 16:10:27 2014 -0700
>
> checkpatch: attempt to find unnecessary 'out of memory' messages
>
> Logging messages that show some type of "out of memory" error are
> generally unnecessary as there is a generic message and a stack dump
> done by the memory subsystem.
>
> These messages generally increase kernel size without much added value.

Haha, that's exactly the patch I was referring to (Sorry for forgetting to
attach a link in last reply). So I will not add the warning messages back.

>
> Thanks,
> Naoya Horiguchi

Thank you for comments.

Cheers.
Shuai


2023-03-17 07:25:05

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v3 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code

changes since v2 by addressing comments from Naoya:
- rename mce_task_work to sync_task_work
- drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
- add steps to reproduce this problem in cover letter
- Link: https://lore.kernel.org/lkml/[email protected]/T/#mb3dede6b7a6d189dc8de3cf9310071e38a192f8e

changes since v1:
- synchronous events by notify type
- Link: https://lore.kernel.org/lkml/[email protected]/

Currently, both synchronous and asynchronous error are queued and handled
by a dedicated kthread in workqueue. And Memory failure for synchronous
error is synced by a cancel_work_sync trick which ensures that the
corrupted page is unmapped and poisoned. And after returning to user-space,
the task starts at current instruction which triggering a page fault in
which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON.

However, the memory failure recovery for hwpoison-aware mechanisms does not
work as expected. For example, hwpoison-aware user-space processes like
QEMU register their customized SIGBUS handler and enable early kill mode by
seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
the process by sending a SIGBUS signal in memory failure with wrong
si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of
BUS_MCEERR_AR.

To address this problem:

- PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which
indicates error happened in current execution context
- PATCH 2 separates synchronous error handling into task work so that the
current context in memory failure is exactly belongs to the task
consuming poison data.

Then, kernel will send SIGBUS with proper si_code in kill_proc().

Lv Ying and XiuQi also proposed to address similar problem and we discussed
about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to
distinguish synchronous event. [2][3] The UEFI community still has no response.
After a deep dive into the SDEI TRM, the SDEI notification should be used for
asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an
exception-like entry into the client, **with the client providing an additional
asynchronous entry point similar to an interrupt entry point**". The client
(kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR,
etc). So notify type is enough to distinguish synchronous event.

To reproduce this problem:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 5 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
and it is not fact.

After this patch set:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 4 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
as we expected.

[1] https://developer.arm.com/documentation/den0054/latest/
[2] https://lore.kernel.org/linux-arm-kernel/[email protected]/T/
[3] https://lore.kernel.org/lkml/[email protected]/

Shuai Xue (2):
ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
synchronous events
ACPI: APEI: handle synchronous exceptions in task work

drivers/acpi/apei/ghes.c | 135 ++++++++++++++++++++++++---------------
include/acpi/ghes.h | 3 -
mm/memory-failure.c | 13 ----
3 files changed, 83 insertions(+), 68 deletions(-)

--
2.20.1.12.g72788fdb


2023-03-17 07:25:08

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v3 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

There are two major types of uncorrected recoverable (UCR) errors :

- Action Required (AR): The error is detected and the processor already
consumes the memory. OS requires to take action (for example, offline
failure page/kill failure thread) to recover this uncorrectable error.

- Action Optional (AO): The error is detected out of processor execution
context. Some data in the memory are corrupted. But the data have not
been consumed. OS is optional to take action to recover this
uncorrectable error.

The essential difference between AR and AO errors is that AR is a
synchronous event, while AO is an asynchronous event. The hardware will
signal a synchronous exception (Machine Check Exception on X86 and
Synchronous External Abort on Arm64) when an error is detected and the
memory access has been architecturally executed.

When APEI firmware first is enabled, a platform may describe one error
source for the handling of synchronous errors (e.g. MCE or SEA notification
), or for handling asynchronous errors (e.g. SCI or External Interrupt
notification). In other words, we can distinguish synchronous errors by
APEI notification. For AR errors, kernel will kill current process
accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
addition, for AO errors, kernel will notify the process who owns the
poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
However, the GHES driver always sets mf_flags to 0 so that all UCR errors
are handled as AO errors in memory failure.

To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
events.

Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
Signed-off-by: Shuai Xue <[email protected]>
---
drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 34ad071a64e9..cccd96596efe 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
}

+/*
+ * A platform may describe one error source for the handling of synchronous
+ * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
+ * or External Interrupt). On x86, the HEST notifications are always
+ * asynchronous, so only SEA on ARM is delivered as a synchronous
+ * notification.
+ */
+static inline bool is_hest_sync_notify(struct ghes *ghes)
+{
+ u8 notify_type = ghes->generic->notify.type;
+
+ return notify_type == ACPI_HEST_NOTIFY_SEA;
+}
+
/*
* This driver isn't really modular, however for the time being,
* continuing to use module_param is the easiest way to remain
@@ -477,7 +491,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
}

static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
- int sev)
+ int sev, bool sync)
{
int flags = -1;
int sec_sev = ghes_severity(gdata->error_severity);
@@ -491,7 +505,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
- flags = 0;
+ flags = sync ? MF_ACTION_REQUIRED : 0;

if (flags != -1)
return ghes_do_memory_failure(mem_err->physical_addr, flags);
@@ -499,12 +513,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
return false;
}

-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
+static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
+ int sev, bool sync)
{
struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
bool queued = false;
int sec_sev, i;
char *p;
+ int flags = sync ? MF_ACTION_REQUIRED : 0;

log_arm_hw_error(err);

@@ -526,7 +542,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
* and don't filter out 'corrected' error here.
*/
if (is_cache && has_pa) {
- queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
+ queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
p += err_info->length;
continue;
}
@@ -647,6 +663,7 @@ static bool ghes_do_proc(struct ghes *ghes,
const guid_t *fru_id = &guid_null;
char *fru_text = "";
bool queued = false;
+ bool sync = is_hest_sync_notify(ghes);

sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
@@ -664,13 +681,13 @@ static bool ghes_do_proc(struct ghes *ghes,
atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
- queued = ghes_handle_memory_failure(gdata, sev);
+ queued = ghes_handle_memory_failure(gdata, sev, sync);
}
else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
- queued = ghes_handle_arm_hw_error(gdata, sev);
+ queued = ghes_handle_arm_hw_error(gdata, sev, sync);
} else {
void *err = acpi_hest_get_payload(gdata);

--
2.20.1.12.g72788fdb


2023-03-17 07:25:20

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v3 2/2] ACPI: APEI: handle synchronous exceptions in task work

Hardware errors could be signaled by synchronous interrupt, e.g. when an
error is detected by a background scrubber, or signaled by synchronous
exception, e.g. when an uncorrected error is consumed. Both synchronous and
asynchronous error are queued and handled by a dedicated kthread in
workqueue.

commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
synchronous errors") keep track of whether memory_failure() work was
queued, and make task_work pending to flush out the workqueue so that the
work for synchronous error is processed before returning to user-space.
The trick ensures that the corrupted page is unmapped and poisoned. And
after returning to user-space, the task starts at current instruction which
triggering a page fault in which kernel will send SIGBUS to current process
due to VM_FAULT_HWPOISON.

However, the memory failure recovery for hwpoison-aware mechanisms does not
work as expected. For example, hwpoison-aware user-space processes like
QEMU register their customized SIGBUS handler and enable early kill mode by
seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
the process by sending a SIGBUS signal in memory failure with wrong
si_code: the actual user-space process accessing the corrupt memory
location, but its memory failure work is handled in a kthread context, so
it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
process instead of BUS_MCEERR_AR in kill_proc().

To this end, separate synchronous and asynchronous error handling into
different paths like X86 platform does:

- task work for synchronous errors.
- and workqueue for asynchronous errors.

Then for synchronous errors, the current context in memory failure is
exactly belongs to the task consuming poison data and it will send SIBBUS
with proper si_code.

Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
Signed-off-by: Shuai Xue <[email protected]>
---
drivers/acpi/apei/ghes.c | 114 ++++++++++++++++++++++-----------------
include/acpi/ghes.h | 3 --
mm/memory-failure.c | 13 -----
3 files changed, 64 insertions(+), 66 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index cccd96596efe..1901ee3498c4 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -452,45 +452,79 @@ static void ghes_clear_estatus(struct ghes *ghes,
}

/*
- * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+ * struct sync_task_work - for synchronous RAS event
+ *
+ * @twork: callback_head for task work
+ * @pfn: page frame number of corrupted page
+ * @flags: fine tune action taken
+ *
+ * Structure to pass task work to be handled before
+ * ret_to_user via task_work_add().
*/
-static void ghes_kick_task_work(struct callback_head *head)
+struct sync_task_work {
+ struct callback_head twork;
+ u64 pfn;
+ int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
{
- struct acpi_hest_generic_status *estatus;
- struct ghes_estatus_node *estatus_node;
- u32 node_len;
+ int ret;
+ struct sync_task_work *twcb =
+ container_of(twork, struct sync_task_work, twork);

- estatus_node = container_of(head, struct ghes_estatus_node, task_work);
- if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
- memory_failure_queue_kick(estatus_node->task_work_cpu);
+ ret = memory_failure(twcb->pfn, twcb->flags);
+ kfree(twcb);

- estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
- node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
- gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
+ if (!ret)
+ return;
+
+ /*
+ * -EHWPOISON from memory_failure() means that it already sent SIGBUS
+ * to the current process with the proper error info,
+ * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
+ *
+ * In both cases, no further processing is required.
+ */
+ if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
+ return;
+
+ pr_err("Memory error not recovered");
+ force_sig(SIGBUS);
}

-static bool ghes_do_memory_failure(u64 physical_addr, int flags)
+static void ghes_do_memory_failure(u64 physical_addr, int flags)
{
unsigned long pfn;
+ struct sync_task_work *twcb;

if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
- return false;
+ return;

pfn = PHYS_PFN(physical_addr);
if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
pr_warn_ratelimited(FW_WARN GHES_PFX
"Invalid address in generic error data: %#llx\n",
physical_addr);
- return false;
+ return;
+ }
+
+ if (flags == MF_ACTION_REQUIRED && current->mm) {
+ twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
+ if (!twcb)
+ return;
+
+ twcb->pfn = pfn;
+ twcb->flags = flags;
+ init_task_work(&twcb->twork, memory_failure_cb);
+ task_work_add(current, &twcb->twork, TWA_RESUME);
+ return;
}

memory_failure_queue(pfn, flags);
- return true;
}

-static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
+static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
int sev, bool sync)
{
int flags = -1;
@@ -498,7 +532,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);

if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
- return false;
+ return;

/* iff following two events can be handled properly by now */
if (sec_sev == GHES_SEV_CORRECTED &&
@@ -508,16 +542,15 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
flags = sync ? MF_ACTION_REQUIRED : 0;

if (flags != -1)
- return ghes_do_memory_failure(mem_err->physical_addr, flags);
+ ghes_do_memory_failure(mem_err->physical_addr, flags);

- return false;
+ return;
}

-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
+static void ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
int sev, bool sync)
{
struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
- bool queued = false;
int sec_sev, i;
char *p;
int flags = sync ? MF_ACTION_REQUIRED : 0;
@@ -526,7 +559,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,

sec_sev = ghes_severity(gdata->error_severity);
if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE)
- return false;
+ return;

p = (char *)(err + 1);
for (i = 0; i < err->err_info_num; i++) {
@@ -542,7 +575,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
* and don't filter out 'corrected' error here.
*/
if (is_cache && has_pa) {
- queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
+ ghes_do_memory_failure(err_info->physical_fault_addr, flags);
p += err_info->length;
continue;
}
@@ -555,8 +588,6 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
error_type);
p += err_info->length;
}
-
- return queued;
}

/*
@@ -654,7 +685,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
schedule_work(&entry->work);
}

-static bool ghes_do_proc(struct ghes *ghes,
+static void ghes_do_proc(struct ghes *ghes,
const struct acpi_hest_generic_status *estatus)
{
int sev, sec_sev;
@@ -662,7 +693,6 @@ static bool ghes_do_proc(struct ghes *ghes,
guid_t *sec_type;
const guid_t *fru_id = &guid_null;
char *fru_text = "";
- bool queued = false;
bool sync = is_hest_sync_notify(ghes);

sev = ghes_severity(estatus->error_severity);
@@ -681,13 +711,13 @@ static bool ghes_do_proc(struct ghes *ghes,
atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
- queued = ghes_handle_memory_failure(gdata, sev, sync);
+ ghes_handle_memory_failure(gdata, sev, sync);
}
else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
- queued = ghes_handle_arm_hw_error(gdata, sev, sync);
+ ghes_handle_arm_hw_error(gdata, sev, sync);
} else {
void *err = acpi_hest_get_payload(gdata);

@@ -697,8 +727,6 @@ static bool ghes_do_proc(struct ghes *ghes,
gdata->error_data_length);
}
}
-
- return queued;
}

static void __ghes_print_estatus(const char *pfx,
@@ -1000,9 +1028,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
struct ghes_estatus_node *estatus_node;
struct acpi_hest_generic *generic;
struct acpi_hest_generic_status *estatus;
- bool task_work_pending;
u32 len, node_len;
- int ret;

llnode = llist_del_all(&ghes_estatus_llist);
/*
@@ -1017,25 +1043,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
len = cper_estatus_len(estatus);
node_len = GHES_ESTATUS_NODE_LEN(len);
- task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+ ghes_do_proc(estatus_node->ghes, estatus);
if (!ghes_estatus_cached(estatus)) {
generic = estatus_node->generic;
if (ghes_print_estatus(NULL, generic, estatus))
ghes_estatus_cache_add(generic, estatus);
}
-
- if (task_work_pending && current->mm) {
- estatus_node->task_work.func = ghes_kick_task_work;
- estatus_node->task_work_cpu = smp_processor_id();
- ret = task_work_add(current, &estatus_node->task_work,
- TWA_RESUME);
- if (ret)
- estatus_node->task_work.func = NULL;
- }
-
- if (!estatus_node->task_work.func)
- gen_pool_free(ghes_estatus_pool,
- (unsigned long)estatus_node, node_len);
+ gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+ node_len);

llnode = next;
}
@@ -1096,7 +1111,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,

estatus_node->ghes = ghes;
estatus_node->generic = ghes->generic;
- estatus_node->task_work.func = NULL;
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);

if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 3c8bba9f1114..e5e0c308d27f 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -35,9 +35,6 @@ struct ghes_estatus_node {
struct llist_node llnode;
struct acpi_hest_generic *generic;
struct ghes *ghes;
-
- int task_work_cpu;
- struct callback_head task_work;
};

struct ghes_estatus_cache {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fae9baf3be16..6ea8c325acb3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
}
}

-/*
- * Process memory_failure work queued on the specified CPU.
- * Used to avoid return-to-userspace racing with the memory_failure workqueue.
- */
-void memory_failure_queue_kick(int cpu)
-{
- struct memory_failure_cpu *mf_cpu;
-
- mf_cpu = &per_cpu(memory_failure_cpu, cpu);
- cancel_work_sync(&mf_cpu->work);
- memory_failure_work_func(&mf_cpu->work);
-}
-
static int __init memory_failure_init(void)
{
struct memory_failure_cpu *mf_cpu;
--
2.20.1.12.g72788fdb


2023-03-20 18:10:38

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code

On Fri, Mar 17, 2023 at 8:25 AM Shuai Xue <[email protected]> wrote:
>
> changes since v2 by addressing comments from Naoya:
> - rename mce_task_work to sync_task_work
> - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
> - add steps to reproduce this problem in cover letter
> - Link: https://lore.kernel.org/lkml/[email protected]/T/#mb3dede6b7a6d189dc8de3cf9310071e38a192f8e
>
> changes since v1:
> - synchronous events by notify type
> - Link: https://lore.kernel.org/lkml/[email protected]/
>
> Currently, both synchronous and asynchronous error are queued and handled
> by a dedicated kthread in workqueue. And Memory failure for synchronous
> error is synced by a cancel_work_sync trick which ensures that the
> corrupted page is unmapped and poisoned. And after returning to user-space,
> the task starts at current instruction which triggering a page fault in
> which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON.
>
> However, the memory failure recovery for hwpoison-aware mechanisms does not
> work as expected. For example, hwpoison-aware user-space processes like
> QEMU register their customized SIGBUS handler and enable early kill mode by
> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
> the process by sending a SIGBUS signal in memory failure with wrong
> si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of
> BUS_MCEERR_AR.
>
> To address this problem:
>
> - PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which
> indicates error happened in current execution context
> - PATCH 2 separates synchronous error handling into task work so that the
> current context in memory failure is exactly belongs to the task
> consuming poison data.
>
> Then, kernel will send SIGBUS with proper si_code in kill_proc().
>
> Lv Ying and XiuQi also proposed to address similar problem and we discussed
> about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to
> distinguish synchronous event. [2][3] The UEFI community still has no response.
> After a deep dive into the SDEI TRM, the SDEI notification should be used for
> asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an
> exception-like entry into the client, **with the client providing an additional
> asynchronous entry point similar to an interrupt entry point**". The client
> (kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR,
> etc). So notify type is enough to distinguish synchronous event.
>
> To reproduce this problem:
>
> # STEP1: enable early kill mode
> #sysctl -w vm.memory_failure_early_kill=1
> vm.memory_failure_early_kill = 1
>
> # STEP2: inject an UCE error and consume it to trigger a synchronous error
> #einj_mem_uc single
> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
> injecting ...
> triggering ...
> signal 7 code 5 addr 0xffffb0d75000
> page not present
> Test passed
>
> The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
> and it is not fact.
>
> After this patch set:
>
> # STEP1: enable early kill mode
> #sysctl -w vm.memory_failure_early_kill=1
> vm.memory_failure_early_kill = 1
>
> # STEP2: inject an UCE error and consume it to trigger a synchronous error
> #einj_mem_uc single
> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
> injecting ...
> triggering ...
> signal 7 code 4 addr 0xffffb0d75000
> page not present
> Test passed
>
> The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
> as we expected.
>
> [1] https://developer.arm.com/documentation/den0054/latest/
> [2] https://lore.kernel.org/linux-arm-kernel/[email protected]/T/
> [3] https://lore.kernel.org/lkml/[email protected]/
>
> Shuai Xue (2):
> ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
> synchronous events
> ACPI: APEI: handle synchronous exceptions in task work
>
> drivers/acpi/apei/ghes.c | 135 ++++++++++++++++++++++++---------------
> include/acpi/ghes.h | 3 -
> mm/memory-failure.c | 13 ----
> 3 files changed, 83 insertions(+), 68 deletions(-)
>
> --

I really need the designated APEI reviewers to give their feedback on this.

2023-03-21 07:17:44

by mawupeng

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code

Test-by: Ma Wupeng <[email protected]>

I have test this on arm64 with following steps:
1. make memory failure return EBUSY
2. force a UCE with einj

Without this patchset, user task will not be kill since memory_failure can
not handle this UCE properly and user task is in D state. The stack can
be found in the end.
With this patchset, user task can be killed even memory_failure return
-EBUSY without doing anything.

Here is the stack of user task with D state:

# cat /proc/7001/stack
[<0>] __flush_work.isra.0+0x80/0xa8
[<0>] __cancel_work_timer+0x144/0x1c8
[<0>] cancel_work_sync+0x1c/0x30
[<0>] memory_failure_queue_kick+0x3c/0x88
[<0>] ghes_kick_task_work+0x28/0x78
[<0>] task_work_run+0xb8/0x188
[<0>] do_notify_resume+0x1e0/0x280
[<0>] el0_da+0x130/0x138
[<0>] el0t_64_sync_handler+0x68/0xc0
[<0>] el0t_64_sync+0x188/0x190

On 2023/3/17 15:24, Shuai Xue wrote:
> changes since v2 by addressing comments from Naoya:
> - rename mce_task_work to sync_task_work
> - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
> - add steps to reproduce this problem in cover letter
> - Link: https://lore.kernel.org/lkml/[email protected]/T/#mb3dede6b7a6d189dc8de3cf9310071e38a192f8e
>
> changes since v1:
> - synchronous events by notify type
> - Link: https://lore.kernel.org/lkml/[email protected]/
>
> Currently, both synchronous and asynchronous error are queued and handled
> by a dedicated kthread in workqueue. And Memory failure for synchronous
> error is synced by a cancel_work_sync trick which ensures that the
> corrupted page is unmapped and poisoned. And after returning to user-space,
> the task starts at current instruction which triggering a page fault in
> which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON.
>
> However, the memory failure recovery for hwpoison-aware mechanisms does not
> work as expected. For example, hwpoison-aware user-space processes like
> QEMU register their customized SIGBUS handler and enable early kill mode by
> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
> the process by sending a SIGBUS signal in memory failure with wrong
> si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of
> BUS_MCEERR_AR.
>
> To address this problem:
>
> - PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which
> indicates error happened in current execution context
> - PATCH 2 separates synchronous error handling into task work so that the
> current context in memory failure is exactly belongs to the task
> consuming poison data.
>
> Then, kernel will send SIGBUS with proper si_code in kill_proc().
>
> Lv Ying and XiuQi also proposed to address similar problem and we discussed
> about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to
> distinguish synchronous event. [2][3] The UEFI community still has no response.
> After a deep dive into the SDEI TRM, the SDEI notification should be used for
> asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an
> exception-like entry into the client, **with the client providing an additional
> asynchronous entry point similar to an interrupt entry point**". The client
> (kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR,
> etc). So notify type is enough to distinguish synchronous event.
>
> To reproduce this problem:
>
> # STEP1: enable early kill mode
> #sysctl -w vm.memory_failure_early_kill=1
> vm.memory_failure_early_kill = 1
>
> # STEP2: inject an UCE error and consume it to trigger a synchronous error
> #einj_mem_uc single
> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
> injecting ...
> triggering ...
> signal 7 code 5 addr 0xffffb0d75000
> page not present
> Test passed
>
> The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
> and it is not fact.
>
> After this patch set:
>
> # STEP1: enable early kill mode
> #sysctl -w vm.memory_failure_early_kill=1
> vm.memory_failure_early_kill = 1
>
> # STEP2: inject an UCE error and consume it to trigger a synchronous error
> #einj_mem_uc single
> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
> injecting ...
> triggering ...
> signal 7 code 4 addr 0xffffb0d75000
> page not present
> Test passed
>
> The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
> as we expected.
>
> [1] https://developer.arm.com/documentation/den0054/latest/
> [2] https://lore.kernel.org/linux-arm-kernel/[email protected]/T/
> [3] https://lore.kernel.org/lkml/[email protected]/
>
> Shuai Xue (2):
> ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
> synchronous events
> ACPI: APEI: handle synchronous exceptions in task work
>
> drivers/acpi/apei/ghes.c | 135 ++++++++++++++++++++++++---------------
> include/acpi/ghes.h | 3 -
> mm/memory-failure.c | 13 ----
> 3 files changed, 83 insertions(+), 68 deletions(-)
>

2023-03-22 01:35:04

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code



On 2023/3/21 PM3:17, mawupeng wrote:
> Test-by: Ma Wupeng <[email protected]>
>
> I have test this on arm64 with following steps:
> 1. make memory failure return EBUSY
> 2. force a UCE with einj
>
> Without this patchset, user task will not be kill since memory_failure can
> not handle this UCE properly and user task is in D state. The stack can
> be found in the end.
> With this patchset, user task can be killed even memory_failure return
> -EBUSY without doing anything.
>
> Here is the stack of user task with D state:
>
> # cat /proc/7001/stack
> [<0>] __flush_work.isra.0+0x80/0xa8
> [<0>] __cancel_work_timer+0x144/0x1c8
> [<0>] cancel_work_sync+0x1c/0x30
> [<0>] memory_failure_queue_kick+0x3c/0x88
> [<0>] ghes_kick_task_work+0x28/0x78
> [<0>] task_work_run+0xb8/0x188
> [<0>] do_notify_resume+0x1e0/0x280
> [<0>] el0_da+0x130/0x138
> [<0>] el0t_64_sync_handler+0x68/0xc0
> [<0>] el0t_64_sync+0x188/0x190

Thank you :)

Cheers,
Shuai

>
> On 2023/3/17 15:24, Shuai Xue wrote:
>> changes since v2 by addressing comments from Naoya:
>> - rename mce_task_work to sync_task_work
>> - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
>> - add steps to reproduce this problem in cover letter
>> - Link: https://lore.kernel.org/lkml/[email protected]/T/#mb3dede6b7a6d189dc8de3cf9310071e38a192f8e
>>
>> changes since v1:
>> - synchronous events by notify type
>> - Link: https://lore.kernel.org/lkml/[email protected]/
>>
>> Currently, both synchronous and asynchronous error are queued and handled
>> by a dedicated kthread in workqueue. And Memory failure for synchronous
>> error is synced by a cancel_work_sync trick which ensures that the
>> corrupted page is unmapped and poisoned. And after returning to user-space,
>> the task starts at current instruction which triggering a page fault in
>> which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON.
>>
>> However, the memory failure recovery for hwpoison-aware mechanisms does not
>> work as expected. For example, hwpoison-aware user-space processes like
>> QEMU register their customized SIGBUS handler and enable early kill mode by
>> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
>> the process by sending a SIGBUS signal in memory failure with wrong
>> si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of
>> BUS_MCEERR_AR.
>>
>> To address this problem:
>>
>> - PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which
>> indicates error happened in current execution context
>> - PATCH 2 separates synchronous error handling into task work so that the
>> current context in memory failure is exactly belongs to the task
>> consuming poison data.
>>
>> Then, kernel will send SIGBUS with proper si_code in kill_proc().
>>
>> Lv Ying and XiuQi also proposed to address similar problem and we discussed
>> about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to
>> distinguish synchronous event. [2][3] The UEFI community still has no response.
>> After a deep dive into the SDEI TRM, the SDEI notification should be used for
>> asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an
>> exception-like entry into the client, **with the client providing an additional
>> asynchronous entry point similar to an interrupt entry point**". The client
>> (kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR,
>> etc). So notify type is enough to distinguish synchronous event.
>>
>> To reproduce this problem:
>>
>> # STEP1: enable early kill mode
>> #sysctl -w vm.memory_failure_early_kill=1
>> vm.memory_failure_early_kill = 1
>>
>> # STEP2: inject an UCE error and consume it to trigger a synchronous error
>> #einj_mem_uc single
>> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
>> injecting ...
>> triggering ...
>> signal 7 code 5 addr 0xffffb0d75000
>> page not present
>> Test passed
>>
>> The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
>> and it is not fact.
>>
>> After this patch set:
>>
>> # STEP1: enable early kill mode
>> #sysctl -w vm.memory_failure_early_kill=1
>> vm.memory_failure_early_kill = 1
>>
>> # STEP2: inject an UCE error and consume it to trigger a synchronous error
>> #einj_mem_uc single
>> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
>> injecting ...
>> triggering ...
>> signal 7 code 4 addr 0xffffb0d75000
>> page not present
>> Test passed
>>
>> The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
>> as we expected.
>>
>> [1] https://developer.arm.com/documentation/den0054/latest/
>> [2] https://lore.kernel.org/linux-arm-kernel/[email protected]/T/
>> [3] https://lore.kernel.org/lkml/[email protected]/
>>
>> Shuai Xue (2):
>> ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
>> synchronous events
>> ACPI: APEI: handle synchronous exceptions in task work
>>
>> drivers/acpi/apei/ghes.c | 135 ++++++++++++++++++++++++---------------
>> include/acpi/ghes.h | 3 -
>> mm/memory-failure.c | 13 ----
>> 3 files changed, 83 insertions(+), 68 deletions(-)
>>

2023-03-30 06:20:49

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code


On 2023/3/21 AM2:03, Rafael J. Wysocki wrote:
> On Fri, Mar 17, 2023 at 8:25 AM Shuai Xue <[email protected]> wrote:
>>
>> changes since v2 by addressing comments from Naoya:
>> - rename mce_task_work to sync_task_work
>> - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
>> - add steps to reproduce this problem in cover letter
>> - Link: https://lore.kernel.org/lkml/[email protected]/T/#mb3dede6b7a6d189dc8de3cf9310071e38a192f8e
>>
>> changes since v1:
>> - synchronous events by notify type
>> - Link: https://lore.kernel.org/lkml/[email protected]/
>>
>> Currently, both synchronous and asynchronous error are queued and handled
>> by a dedicated kthread in workqueue. And Memory failure for synchronous
>> error is synced by a cancel_work_sync trick which ensures that the
>> corrupted page is unmapped and poisoned. And after returning to user-space,
>> the task starts at current instruction which triggering a page fault in
>> which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON.
>>
>> However, the memory failure recovery for hwpoison-aware mechanisms does not
>> work as expected. For example, hwpoison-aware user-space processes like
>> QEMU register their customized SIGBUS handler and enable early kill mode by
>> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
>> the process by sending a SIGBUS signal in memory failure with wrong
>> si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of
>> BUS_MCEERR_AR.
>>
>> To address this problem:
>>
>> - PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which
>> indicates error happened in current execution context
>> - PATCH 2 separates synchronous error handling into task work so that the
>> current context in memory failure is exactly belongs to the task
>> consuming poison data.
>>
>> Then, kernel will send SIGBUS with proper si_code in kill_proc().
>>
>> Lv Ying and XiuQi also proposed to address similar problem and we discussed
>> about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to
>> distinguish synchronous event. [2][3] The UEFI community still has no response.
>> After a deep dive into the SDEI TRM, the SDEI notification should be used for
>> asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an
>> exception-like entry into the client, **with the client providing an additional
>> asynchronous entry point similar to an interrupt entry point**". The client
>> (kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR,
>> etc). So notify type is enough to distinguish synchronous event.
>>
>> To reproduce this problem:
>>
>> # STEP1: enable early kill mode
>> #sysctl -w vm.memory_failure_early_kill=1
>> vm.memory_failure_early_kill = 1
>>
>> # STEP2: inject an UCE error and consume it to trigger a synchronous error
>> #einj_mem_uc single
>> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
>> injecting ...
>> triggering ...
>> signal 7 code 5 addr 0xffffb0d75000
>> page not present
>> Test passed
>>
>> The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
>> and it is not fact.
>>
>> After this patch set:
>>
>> # STEP1: enable early kill mode
>> #sysctl -w vm.memory_failure_early_kill=1
>> vm.memory_failure_early_kill = 1
>>
>> # STEP2: inject an UCE error and consume it to trigger a synchronous error
>> #einj_mem_uc single
>> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
>> injecting ...
>> triggering ...
>> signal 7 code 4 addr 0xffffb0d75000
>> page not present
>> Test passed
>>
>> The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
>> as we expected.
>>
>> [1] https://developer.arm.com/documentation/den0054/latest/
>> [2] https://lore.kernel.org/linux-arm-kernel/[email protected]/T/
>> [3] https://lore.kernel.org/lkml/[email protected]/
>>
>> Shuai Xue (2):
>> ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
>> synchronous events
>> ACPI: APEI: handle synchronous exceptions in task work
>>
>> drivers/acpi/apei/ghes.c | 135 ++++++++++++++++++++++++---------------
>> include/acpi/ghes.h | 3 -
>> mm/memory-failure.c | 13 ----
>> 3 files changed, 83 insertions(+), 68 deletions(-)
>>
>> --
>
> I really need the designated APEI reviewers to give their feedback on this.

Gentle ping.

Best Regards.
Shuai




2023-03-30 09:57:55

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code

On Thu, Mar 30, 2023 at 8:11 AM Shuai Xue <[email protected]> wrote:
>
>
> On 2023/3/21 AM2:03, Rafael J. Wysocki wrote:
> > On Fri, Mar 17, 2023 at 8:25 AM Shuai Xue <[email protected]> wrote:
> >>
> >> changes since v2 by addressing comments from Naoya:
> >> - rename mce_task_work to sync_task_work
> >> - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
> >> - add steps to reproduce this problem in cover letter
> >> - Link: https://lore.kernel.org/lkml/[email protected]/T/#mb3dede6b7a6d189dc8de3cf9310071e38a192f8e
> >>
> >> changes since v1:
> >> - synchronous events by notify type
> >> - Link: https://lore.kernel.org/lkml/[email protected]/
> >>
> >> Currently, both synchronous and asynchronous error are queued and handled
> >> by a dedicated kthread in workqueue. And Memory failure for synchronous
> >> error is synced by a cancel_work_sync trick which ensures that the
> >> corrupted page is unmapped and poisoned. And after returning to user-space,
> >> the task starts at current instruction which triggering a page fault in
> >> which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON.
> >>
> >> However, the memory failure recovery for hwpoison-aware mechanisms does not
> >> work as expected. For example, hwpoison-aware user-space processes like
> >> QEMU register their customized SIGBUS handler and enable early kill mode by
> >> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
> >> the process by sending a SIGBUS signal in memory failure with wrong
> >> si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of
> >> BUS_MCEERR_AR.
> >>
> >> To address this problem:
> >>
> >> - PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which
> >> indicates error happened in current execution context
> >> - PATCH 2 separates synchronous error handling into task work so that the
> >> current context in memory failure is exactly belongs to the task
> >> consuming poison data.
> >>
> >> Then, kernel will send SIGBUS with proper si_code in kill_proc().
> >>
> >> Lv Ying and XiuQi also proposed to address similar problem and we discussed
> >> about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to
> >> distinguish synchronous event. [2][3] The UEFI community still has no response.
> >> After a deep dive into the SDEI TRM, the SDEI notification should be used for
> >> asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an
> >> exception-like entry into the client, **with the client providing an additional
> >> asynchronous entry point similar to an interrupt entry point**". The client
> >> (kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR,
> >> etc). So notify type is enough to distinguish synchronous event.
> >>
> >> To reproduce this problem:
> >>
> >> # STEP1: enable early kill mode
> >> #sysctl -w vm.memory_failure_early_kill=1
> >> vm.memory_failure_early_kill = 1
> >>
> >> # STEP2: inject an UCE error and consume it to trigger a synchronous error
> >> #einj_mem_uc single
> >> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
> >> injecting ...
> >> triggering ...
> >> signal 7 code 5 addr 0xffffb0d75000
> >> page not present
> >> Test passed
> >>
> >> The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
> >> and it is not fact.
> >>
> >> After this patch set:
> >>
> >> # STEP1: enable early kill mode
> >> #sysctl -w vm.memory_failure_early_kill=1
> >> vm.memory_failure_early_kill = 1
> >>
> >> # STEP2: inject an UCE error and consume it to trigger a synchronous error
> >> #einj_mem_uc single
> >> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
> >> injecting ...
> >> triggering ...
> >> signal 7 code 4 addr 0xffffb0d75000
> >> page not present
> >> Test passed
> >>
> >> The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
> >> as we expected.
> >>
> >> [1] https://developer.arm.com/documentation/den0054/latest/
> >> [2] https://lore.kernel.org/linux-arm-kernel/[email protected]/T/
> >> [3] https://lore.kernel.org/lkml/[email protected]/
> >>
> >> Shuai Xue (2):
> >> ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
> >> synchronous events
> >> ACPI: APEI: handle synchronous exceptions in task work
> >>
> >> drivers/acpi/apei/ghes.c | 135 ++++++++++++++++++++++++---------------
> >> include/acpi/ghes.h | 3 -
> >> mm/memory-failure.c | 13 ----
> >> 3 files changed, 83 insertions(+), 68 deletions(-)
> >>
> >> --
> >
> > I really need the designated APEI reviewers to give their feedback on this.
>
> Gentle ping.

As already stated in this thread, this series requires reviews from
the designated APEI reviewers (Tony, Boris, James).

Thanks!

2023-04-06 12:49:25

by Xiaofei Tan

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] ACPI: APEI: handle synchronous exceptions in task work

Hi Shuai,

Thanks for your this effort, and it's great.
Some comments below.

在 2023/3/17 15:24, Shuai Xue 写道:
> Hardware errors could be signaled by synchronous interrupt, e.g. when an
> error is detected by a background scrubber, or signaled by synchronous
> exception, e.g. when an uncorrected error is consumed. Both synchronous and
> asynchronous error are queued and handled by a dedicated kthread in
> workqueue.
>
> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
> synchronous errors") keep track of whether memory_failure() work was
> queued, and make task_work pending to flush out the workqueue so that the
> work for synchronous error is processed before returning to user-space.
> The trick ensures that the corrupted page is unmapped and poisoned. And
> after returning to user-space, the task starts at current instruction which
> triggering a page fault in which kernel will send SIGBUS to current process
> due to VM_FAULT_HWPOISON.
>
> However, the memory failure recovery for hwpoison-aware mechanisms does not
> work as expected. For example, hwpoison-aware user-space processes like
> QEMU register their customized SIGBUS handler and enable early kill mode by
> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
> the process by sending a SIGBUS signal in memory failure with wrong
> si_code: the actual user-space process accessing the corrupt memory
> location, but its memory failure work is handled in a kthread context, so
> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
> process instead of BUS_MCEERR_AR in kill_proc().
>
> To this end, separate synchronous and asynchronous error handling into
> different paths like X86 platform does:
>
> - task work for synchronous errors.
> - and workqueue for asynchronous errors.
>
> Then for synchronous errors, the current context in memory failure is
> exactly belongs to the task consuming poison data and it will send SIBBUS
> with proper si_code.
>
> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
> Signed-off-by: Shuai Xue <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 114 ++++++++++++++++++++++-----------------
> include/acpi/ghes.h | 3 --
> mm/memory-failure.c | 13 -----
> 3 files changed, 64 insertions(+), 66 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index cccd96596efe..1901ee3498c4 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -452,45 +452,79 @@ static void ghes_clear_estatus(struct ghes *ghes,
> }
>
> /*
> - * Called as task_work before returning to user-space.
> - * Ensure any queued work has been done before we return to the context that
> - * triggered the notification.
> + * struct sync_task_work - for synchronous RAS event
> + *
> + * @twork: callback_head for task work
> + * @pfn: page frame number of corrupted page
> + * @flags: fine tune action taken
> + *
> + * Structure to pass task work to be handled before
> + * ret_to_user via task_work_add().
> */
> -static void ghes_kick_task_work(struct callback_head *head)
> +struct sync_task_work {
> + struct callback_head twork;
> + u64 pfn;
> + int flags;
> +};
> +
> +static void memory_failure_cb(struct callback_head *twork)
> {
> - struct acpi_hest_generic_status *estatus;
> - struct ghes_estatus_node *estatus_node;
> - u32 node_len;
> + int ret;
> + struct sync_task_work *twcb =
> + container_of(twork, struct sync_task_work, twork);
>
> - estatus_node = container_of(head, struct ghes_estatus_node, task_work);
> - if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> - memory_failure_queue_kick(estatus_node->task_work_cpu);
> + ret = memory_failure(twcb->pfn, twcb->flags);
> + kfree(twcb);
>
> - estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> - node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
> - gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
> + if (!ret)
> + return;
> +
> + /*
> + * -EHWPOISON from memory_failure() means that it already sent SIGBUS
> + * to the current process with the proper error info,
> + * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
> + *
> + * In both cases, no further processing is required.
> + */
> + if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
> + return;
> +
> + pr_err("Memory error not recovered");
> + force_sig(SIGBUS);
> }
>
> -static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> +static void ghes_do_memory_failure(u64 physical_addr, int flags)
> {
> unsigned long pfn;
> + struct sync_task_work *twcb;
>
> if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> - return false;
> + return;
>
> pfn = PHYS_PFN(physical_addr);
> if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
> pr_warn_ratelimited(FW_WARN GHES_PFX
> "Invalid address in generic error data: %#llx\n",
> physical_addr);
> - return false;
> + return;

For synchronous errors, we need send SIGBUS to the current task if not recovered,
as the behavior of this patch  in the function memory_failure_cb().
Such abnormal branches should also be taken as not recovered.


> + }
> +
> + if (flags == MF_ACTION_REQUIRED && current->mm) {
> + twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
> + if (!twcb)
> + return;

It's the same here.


> +
> + twcb->pfn = pfn;
> + twcb->flags = flags;
> + init_task_work(&twcb->twork, memory_failure_cb);
> + task_work_add(current, &twcb->twork, TWA_RESUME);
> + return;
> }
>
> memory_failure_queue(pfn, flags);
> - return true;
> }
>
> -static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> +static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> int sev, bool sync)
> {
> int flags = -1;
> @@ -498,7 +532,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
>
> if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
> - return false;
> + return;

and here.


>
> /* iff following two events can be handled properly by now */
> if (sec_sev == GHES_SEV_CORRECTED &&
> @@ -508,16 +542,15 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> flags = sync ? MF_ACTION_REQUIRED : 0;
>
> if (flags != -1)
> - return ghes_do_memory_failure(mem_err->physical_addr, flags);
> + ghes_do_memory_failure(mem_err->physical_addr, flags);
>
> - return false;
> + return;
> }
>
> -static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
> +static void ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
> int sev, bool sync)
> {
> struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
> - bool queued = false;
> int sec_sev, i;
> char *p;
> int flags = sync ? MF_ACTION_REQUIRED : 0;
> @@ -526,7 +559,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
>
> sec_sev = ghes_severity(gdata->error_severity);
> if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE)
> - return false;
> + return;

and here.


>
> p = (char *)(err + 1);
> for (i = 0; i < err->err_info_num; i++) {
> @@ -542,7 +575,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
> * and don't filter out 'corrected' error here.
> */
> if (is_cache && has_pa) {
> - queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
> + ghes_do_memory_failure(err_info->physical_fault_addr, flags);
> p += err_info->length;
> continue;
> }
> @@ -555,8 +588,6 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
> error_type);
> p += err_info->length;
> }

and here, for the case that memory failure is not done, as PA is invalid.


> -
> - return queued;
> }
>
> /*
> @@ -654,7 +685,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
> schedule_work(&entry->work);
> }
>
> -static bool ghes_do_proc(struct ghes *ghes,
> +static void ghes_do_proc(struct ghes *ghes,
> const struct acpi_hest_generic_status *estatus)
> {
> int sev, sec_sev;
> @@ -662,7 +693,6 @@ static bool ghes_do_proc(struct ghes *ghes,
> guid_t *sec_type;
> const guid_t *fru_id = &guid_null;
> char *fru_text = "";
> - bool queued = false;
> bool sync = is_hest_sync_notify(ghes);
>
> sev = ghes_severity(estatus->error_severity);
> @@ -681,13 +711,13 @@ static bool ghes_do_proc(struct ghes *ghes,
> atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
>
> arch_apei_report_mem_error(sev, mem_err);
> - queued = ghes_handle_memory_failure(gdata, sev, sync);
> + ghes_handle_memory_failure(gdata, sev, sync);
> }
> else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
> ghes_handle_aer(gdata);
> }
> else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
> - queued = ghes_handle_arm_hw_error(gdata, sev, sync);
> + ghes_handle_arm_hw_error(gdata, sev, sync);
> } else {
> void *err = acpi_hest_get_payload(gdata);
>
> @@ -697,8 +727,6 @@ static bool ghes_do_proc(struct ghes *ghes,
> gdata->error_data_length);
> }
> }
> -
> - return queued;
> }
>
> static void __ghes_print_estatus(const char *pfx,
> @@ -1000,9 +1028,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> struct ghes_estatus_node *estatus_node;
> struct acpi_hest_generic *generic;
> struct acpi_hest_generic_status *estatus;
> - bool task_work_pending;
> u32 len, node_len;
> - int ret;
>
> llnode = llist_del_all(&ghes_estatus_llist);
> /*
> @@ -1017,25 +1043,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> len = cper_estatus_len(estatus);
> node_len = GHES_ESTATUS_NODE_LEN(len);
> - task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
> + ghes_do_proc(estatus_node->ghes, estatus);
> if (!ghes_estatus_cached(estatus)) {
> generic = estatus_node->generic;
> if (ghes_print_estatus(NULL, generic, estatus))
> ghes_estatus_cache_add(generic, estatus);
> }
> -
> - if (task_work_pending && current->mm) {
> - estatus_node->task_work.func = ghes_kick_task_work;
> - estatus_node->task_work_cpu = smp_processor_id();
> - ret = task_work_add(current, &estatus_node->task_work,
> - TWA_RESUME);
> - if (ret)
> - estatus_node->task_work.func = NULL;
> - }
> -
> - if (!estatus_node->task_work.func)
> - gen_pool_free(ghes_estatus_pool,
> - (unsigned long)estatus_node, node_len);
> + gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
> + node_len);
>
> llnode = next;
> }
> @@ -1096,7 +1111,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>
> estatus_node->ghes = ghes;
> estatus_node->generic = ghes->generic;
> - estatus_node->task_work.func = NULL;
> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>
> if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 3c8bba9f1114..e5e0c308d27f 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
> struct llist_node llnode;
> struct acpi_hest_generic *generic;
> struct ghes *ghes;
> -
> - int task_work_cpu;
> - struct callback_head task_work;
> };
>
> struct ghes_estatus_cache {
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fae9baf3be16..6ea8c325acb3 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
> }
> }
>
> -/*
> - * Process memory_failure work queued on the specified CPU.
> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
> - */
> -void memory_failure_queue_kick(int cpu)
> -{
> - struct memory_failure_cpu *mf_cpu;
> -
> - mf_cpu = &per_cpu(memory_failure_cpu, cpu);
> - cancel_work_sync(&mf_cpu->work);
> - memory_failure_work_func(&mf_cpu->work);
> -}
> -
> static int __init memory_failure_init(void)
> {
> struct memory_failure_cpu *mf_cpu;

2023-04-07 02:39:02

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] ACPI: APEI: handle synchronous exceptions in task work



On 2023/4/6 PM8:39, Xiaofei Tan wrote:
> Hi Shuai,
>
> Thanks for your this effort, and it's great.
> Some comments below.
>
> 在 2023/3/17 15:24, Shuai Xue 写道:
>> Hardware errors could be signaled by synchronous interrupt, e.g.  when an
>> error is detected by a background scrubber, or signaled by synchronous
>> exception, e.g. when an uncorrected error is consumed. Both synchronous and
>> asynchronous error are queued and handled by a dedicated kthread in
>> workqueue.
>>
>> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
>> synchronous errors") keep track of whether memory_failure() work was
>> queued, and make task_work pending to flush out the workqueue so that the
>> work for synchronous error is processed before returning to user-space.
>> The trick ensures that the corrupted page is unmapped and poisoned. And
>> after returning to user-space, the task starts at current instruction which
>> triggering a page fault in which kernel will send SIGBUS to current process
>> due to VM_FAULT_HWPOISON.
>>
>> However, the memory failure recovery for hwpoison-aware mechanisms does not
>> work as expected. For example, hwpoison-aware user-space processes like
>> QEMU register their customized SIGBUS handler and enable early kill mode by
>> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
>> the process by sending a SIGBUS signal in memory failure with wrong
>> si_code: the actual user-space process accessing the corrupt memory
>> location, but its memory failure work is handled in a kthread context, so
>> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
>> process instead of BUS_MCEERR_AR in kill_proc().
>>
>> To this end, separate synchronous and asynchronous error handling into
>> different paths like X86 platform does:
>>
>> - task work for synchronous errors.
>> - and workqueue for asynchronous errors.
>>
>> Then for synchronous errors, the current context in memory failure is
>> exactly belongs to the task consuming poison data and it will send SIBBUS
>> with proper si_code.
>>
>> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
>> Signed-off-by: Shuai Xue <[email protected]>
>> ---
>>   drivers/acpi/apei/ghes.c | 114 ++++++++++++++++++++++-----------------
>>   include/acpi/ghes.h      |   3 --
>>   mm/memory-failure.c      |  13 -----
>>   3 files changed, 64 insertions(+), 66 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index cccd96596efe..1901ee3498c4 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -452,45 +452,79 @@ static void ghes_clear_estatus(struct ghes *ghes,
>>   }
>>     /*
>> - * Called as task_work before returning to user-space.
>> - * Ensure any queued work has been done before we return to the context that
>> - * triggered the notification.
>> + * struct sync_task_work - for synchronous RAS event
>> + *
>> + * @twork:                callback_head for task work
>> + * @pfn:                  page frame number of corrupted page
>> + * @flags:                fine tune action taken
>> + *
>> + * Structure to pass task work to be handled before
>> + * ret_to_user via task_work_add().
>>    */
>> -static void ghes_kick_task_work(struct callback_head *head)
>> +struct sync_task_work {
>> +    struct callback_head twork;
>> +    u64 pfn;
>> +    int flags;
>> +};
>> +
>> +static void memory_failure_cb(struct callback_head *twork)
>>   {
>> -    struct acpi_hest_generic_status *estatus;
>> -    struct ghes_estatus_node *estatus_node;
>> -    u32 node_len;
>> +    int ret;
>> +    struct sync_task_work *twcb =
>> +        container_of(twork, struct sync_task_work, twork);
>>   -    estatus_node = container_of(head, struct ghes_estatus_node, task_work);
>> -    if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>> -        memory_failure_queue_kick(estatus_node->task_work_cpu);
>> +    ret = memory_failure(twcb->pfn, twcb->flags);
>> +    kfree(twcb);
>>   -    estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> -    node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
>> -    gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
>> +    if (!ret)
>> +        return;
>> +
>> +    /*
>> +     * -EHWPOISON from memory_failure() means that it already sent SIGBUS
>> +     * to the current process with the proper error info,
>> +     * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
>> +     *
>> +     * In both cases, no further processing is required.
>> +     */
>> +    if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
>> +        return;
>> +
>> +    pr_err("Memory error not recovered");
>> +    force_sig(SIGBUS);
>>   }
>>   -static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>> +static void ghes_do_memory_failure(u64 physical_addr, int flags)
>>   {
>>       unsigned long pfn;
>> +    struct sync_task_work *twcb;
>>         if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>> -        return false;
>> +        return;
>>         pfn = PHYS_PFN(physical_addr);
>>       if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
>>           pr_warn_ratelimited(FW_WARN GHES_PFX
>>           "Invalid address in generic error data: %#llx\n",
>>           physical_addr);
>> -        return false;
>> +        return;
>
> For synchronous errors, we need send SIGBUS to the current task if not recovered,
> as the behavior of this patch  in the function memory_failure_cb().
> Such abnormal branches should also be taken as not recovered.

You are right. Thank you for pointing this out. I overlooked the abnormal
branches. To sum up, there are three cases:

- valid synchronous errors: queue a task_work to synchronously send SIGBUS
before ret_to_user.
- valid asynchronous errors: queue a work into workqueue to asynchronously
handle memory failure.
- abnormal branches such as invalid PA, unexpected severity, no memory
failure config support, invalid GUID section, OOM, etc.

As you commented, abnormal branches case should also be send SIGBUS, I will
handle it in ghes_proc_in_irq() if no work (task work or workqueue work) is
queued.

Best Regards,
Shuai

>
>
>> +    }
>> +
>> +    if (flags == MF_ACTION_REQUIRED && current->mm) {
>> +        twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
>> +        if (!twcb)
>> +            return;
>
> It's the same here.
>
>
>> +
>> +        twcb->pfn = pfn;
>> +        twcb->flags = flags;
>> +        init_task_work(&twcb->twork, memory_failure_cb);
>> +        task_work_add(current, &twcb->twork, TWA_RESUME);
>> +        return;
>>       }
>>         memory_failure_queue(pfn, flags);
>> -    return true;
>>   }
>>   -static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>> +static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>>                          int sev, bool sync)
>>   {
>>       int flags = -1;
>> @@ -498,7 +532,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>>       struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
>>         if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>> -        return false;
>> +        return;
>
> and here.
>
>
>>         /* iff following two events can be handled properly by now */
>>       if (sec_sev == GHES_SEV_CORRECTED &&
>> @@ -508,16 +542,15 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>>           flags = sync ? MF_ACTION_REQUIRED : 0;
>>         if (flags != -1)
>> -        return ghes_do_memory_failure(mem_err->physical_addr, flags);
>> +        ghes_do_memory_failure(mem_err->physical_addr, flags);
>>   -    return false;
>> +    return;
>>   }
>>   -static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
>> +static void ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
>>                          int sev, bool sync)
>>   {
>>       struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
>> -    bool queued = false;
>>       int sec_sev, i;
>>       char *p;
>>       int flags = sync ? MF_ACTION_REQUIRED : 0;
>> @@ -526,7 +559,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
>>         sec_sev = ghes_severity(gdata->error_severity);
>>       if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE)
>> -        return false;
>> +        return;
>
> and here.
>
>
>>         p = (char *)(err + 1);
>>       for (i = 0; i < err->err_info_num; i++) {
>> @@ -542,7 +575,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
>>            * and don't filter out 'corrected' error here.
>>            */
>>           if (is_cache && has_pa) {
>> -            queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
>> +            ghes_do_memory_failure(err_info->physical_fault_addr, flags);
>>               p += err_info->length;
>>               continue;
>>           }
>> @@ -555,8 +588,6 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
>>                       error_type);
>>           p += err_info->length;
>>       }
>
> and here, for the case that memory failure is not done, as PA is invalid.
>
>
>> -
>> -    return queued;
>>   }
>>     /*
>> @@ -654,7 +685,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
>>       schedule_work(&entry->work);
>>   }
>>   -static bool ghes_do_proc(struct ghes *ghes,
>> +static void ghes_do_proc(struct ghes *ghes,
>>                const struct acpi_hest_generic_status *estatus)
>>   {
>>       int sev, sec_sev;
>> @@ -662,7 +693,6 @@ static bool ghes_do_proc(struct ghes *ghes,
>>       guid_t *sec_type;
>>       const guid_t *fru_id = &guid_null;
>>       char *fru_text = "";
>> -    bool queued = false;
>>       bool sync = is_hest_sync_notify(ghes);
>>         sev = ghes_severity(estatus->error_severity);
>> @@ -681,13 +711,13 @@ static bool ghes_do_proc(struct ghes *ghes,
>>               atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
>>                 arch_apei_report_mem_error(sev, mem_err);
>> -            queued = ghes_handle_memory_failure(gdata, sev, sync);
>> +            ghes_handle_memory_failure(gdata, sev, sync);
>>           }
>>           else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
>>               ghes_handle_aer(gdata);
>>           }
>>           else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
>> -            queued = ghes_handle_arm_hw_error(gdata, sev, sync);
>> +            ghes_handle_arm_hw_error(gdata, sev, sync);
>>           } else {
>>               void *err = acpi_hest_get_payload(gdata);
>>   @@ -697,8 +727,6 @@ static bool ghes_do_proc(struct ghes *ghes,
>>                              gdata->error_data_length);
>>           }
>>       }
>> -
>> -    return queued;
>>   }
>>     static void __ghes_print_estatus(const char *pfx,
>> @@ -1000,9 +1028,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>       struct ghes_estatus_node *estatus_node;
>>       struct acpi_hest_generic *generic;
>>       struct acpi_hest_generic_status *estatus;
>> -    bool task_work_pending;
>>       u32 len, node_len;
>> -    int ret;
>>         llnode = llist_del_all(&ghes_estatus_llist);
>>       /*
>> @@ -1017,25 +1043,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>           estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>           len = cper_estatus_len(estatus);
>>           node_len = GHES_ESTATUS_NODE_LEN(len);
>> -        task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>> +        ghes_do_proc(estatus_node->ghes, estatus);
>>           if (!ghes_estatus_cached(estatus)) {
>>               generic = estatus_node->generic;
>>               if (ghes_print_estatus(NULL, generic, estatus))
>>                   ghes_estatus_cache_add(generic, estatus);
>>           }
>> -
>> -        if (task_work_pending && current->mm) {
>> -            estatus_node->task_work.func = ghes_kick_task_work;
>> -            estatus_node->task_work_cpu = smp_processor_id();
>> -            ret = task_work_add(current, &estatus_node->task_work,
>> -                        TWA_RESUME);
>> -            if (ret)
>> -                estatus_node->task_work.func = NULL;
>> -        }
>> -
>> -        if (!estatus_node->task_work.func)
>> -            gen_pool_free(ghes_estatus_pool,
>> -                      (unsigned long)estatus_node, node_len);
>> +        gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
>> +                  node_len);
>>             llnode = next;
>>       }
>> @@ -1096,7 +1111,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>>         estatus_node->ghes = ghes;
>>       estatus_node->generic = ghes->generic;
>> -    estatus_node->task_work.func = NULL;
>>       estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>         if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>> index 3c8bba9f1114..e5e0c308d27f 100644
>> --- a/include/acpi/ghes.h
>> +++ b/include/acpi/ghes.h
>> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
>>       struct llist_node llnode;
>>       struct acpi_hest_generic *generic;
>>       struct ghes *ghes;
>> -
>> -    int task_work_cpu;
>> -    struct callback_head task_work;
>>   };
>>     struct ghes_estatus_cache {
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index fae9baf3be16..6ea8c325acb3 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
>>       }
>>   }
>>   -/*
>> - * Process memory_failure work queued on the specified CPU.
>> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
>> - */
>> -void memory_failure_queue_kick(int cpu)
>> -{
>> -    struct memory_failure_cpu *mf_cpu;
>> -
>> -    mf_cpu = &per_cpu(memory_failure_cpu, cpu);
>> -    cancel_work_sync(&mf_cpu->work);
>> -    memory_failure_work_func(&mf_cpu->work);
>> -}
>> -
>>   static int __init memory_failure_init(void)
>>   {
>>       struct memory_failure_cpu *mf_cpu;

2023-04-08 09:15:00

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v4 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

There are two major types of uncorrected recoverable (UCR) errors :

- Action Required (AR): The error is detected and the processor already
consumes the memory. OS requires to take action (for example, offline
failure page/kill failure thread) to recover this uncorrectable error.

- Action Optional (AO): The error is detected out of processor execution
context. Some data in the memory are corrupted. But the data have not
been consumed. OS is optional to take action to recover this
uncorrectable error.

The essential difference between AR and AO errors is that AR is a
synchronous event, while AO is an asynchronous event. The hardware will
signal a synchronous exception (Machine Check Exception on X86 and
Synchronous External Abort on Arm64) when an error is detected and the
memory access has been architecturally executed.

When APEI firmware first is enabled, a platform may describe one error
source for the handling of synchronous errors (e.g. MCE or SEA notification
), or for handling asynchronous errors (e.g. SCI or External Interrupt
notification). In other words, we can distinguish synchronous errors by
APEI notification. For AR errors, kernel will kill current process
accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
addition, for AO errors, kernel will notify the process who owns the
poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
However, the GHES driver always sets mf_flags to 0 so that all UCR errors
are handled as AO errors in memory failure.

To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
events.

Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
Signed-off-by: Shuai Xue <[email protected]>
Tested-by: Ma Wupeng <[email protected]>
---
drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 34ad071a64e9..c479b85899f5 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
}

+/*
+ * A platform may describe one error source for the handling of synchronous
+ * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
+ * or External Interrupt). On x86, the HEST notifications are always
+ * asynchronous, so only SEA on ARM is delivered as a synchronous
+ * notification.
+ */
+static inline bool is_hest_sync_notify(struct ghes *ghes)
+{
+ u8 notify_type = ghes->generic->notify.type;
+
+ return notify_type == ACPI_HEST_NOTIFY_SEA;
+}
+
/*
* This driver isn't really modular, however for the time being,
* continuing to use module_param is the easiest way to remain
@@ -477,7 +491,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
}

static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
- int sev)
+ int sev, bool sync)
{
int flags = -1;
int sec_sev = ghes_severity(gdata->error_severity);
@@ -491,7 +505,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
- flags = 0;
+ flags = sync ? MF_ACTION_REQUIRED : 0;

if (flags != -1)
return ghes_do_memory_failure(mem_err->physical_addr, flags);
@@ -499,9 +513,11 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
return false;
}

-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
+static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
+ int sev, bool sync)
{
struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
+ int flags = sync ? MF_ACTION_REQUIRED : 0;
bool queued = false;
int sec_sev, i;
char *p;
@@ -526,7 +542,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
* and don't filter out 'corrected' error here.
*/
if (is_cache && has_pa) {
- queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
+ queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
p += err_info->length;
continue;
}
@@ -647,6 +663,7 @@ static bool ghes_do_proc(struct ghes *ghes,
const guid_t *fru_id = &guid_null;
char *fru_text = "";
bool queued = false;
+ bool sync = is_hest_sync_notify(ghes);

sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
@@ -664,13 +681,13 @@ static bool ghes_do_proc(struct ghes *ghes,
atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
- queued = ghes_handle_memory_failure(gdata, sev);
+ queued = ghes_handle_memory_failure(gdata, sev, sync);
}
else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
- queued = ghes_handle_arm_hw_error(gdata, sev);
+ queued = ghes_handle_arm_hw_error(gdata, sev, sync);
} else {
void *err = acpi_hest_get_payload(gdata);

--
2.20.1.12.g72788fdb

2023-04-08 09:15:31

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v4 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code

changes since v3 by addressing comments from Xiaofei:
- do a force kill for abnormal memofy failure error such as invalid PA,
unexpected severity, OOM, etc
- pcik up tested-by tag from Ma Wupeng

changes since v2 by addressing comments from Naoya:
- rename mce_task_work to sync_task_work
- drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
- add steps to reproduce this problem in cover letter
- Link: https://lore.kernel.org/lkml/[email protected]/

changes since v1:
- synchronous events by notify type
- Link: https://lore.kernel.org/lkml/[email protected]/

Currently, both synchronous and asynchronous error are queued and handled
by a dedicated kthread in workqueue. And Memory failure for synchronous
error is synced by a cancel_work_sync trick which ensures that the
corrupted page is unmapped and poisoned. And after returning to user-space,
the task starts at current instruction which triggering a page fault in
which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON.

However, the memory failure recovery for hwpoison-aware mechanisms does not
work as expected. For example, hwpoison-aware user-space processes like
QEMU register their customized SIGBUS handler and enable early kill mode by
seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
the process by sending a SIGBUS signal in memory failure with wrong
si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of
BUS_MCEERR_AR.

To address this problem:

- PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which
indicates error happened in current execution context
- PATCH 2 separates synchronous error handling into task work so that the
current context in memory failure is exactly belongs to the task
consuming poison data.

Then, kernel will send SIGBUS with proper si_code in kill_proc().

Lv Ying and XiuQi also proposed to address similar problem and we discussed
about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to
distinguish synchronous event. [2][3] The UEFI community still has no response.
After a deep dive into the SDEI TRM, the SDEI notification should be used for
asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an
exception-like entry into the client, **with the client providing an additional
asynchronous entry point similar to an interrupt entry point**". The client
(kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR,
etc). So notify type is enough to distinguish synchronous event.

To reproduce this problem:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 5 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
and it is not fact.

After this patch set:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 4 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
as we expected.

[1] https://developer.arm.com/documentation/den0054/latest/
[2] https://lore.kernel.org/linux-arm-kernel/[email protected]/T/
[3] https://lore.kernel.org/lkml/[email protected]/

Shuai Xue (2):
ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
synchronous events
ACPI: APEI: handle synchronous exceptions in task work

drivers/acpi/apei/ghes.c | 120 +++++++++++++++++++++++++++------------
include/acpi/ghes.h | 3 -
mm/memory-failure.c | 13 -----
3 files changed, 84 insertions(+), 52 deletions(-)

--
2.20.1.12.g72788fdb

2023-04-08 09:16:01

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v4 2/2] ACPI: APEI: handle synchronous exceptions in task work

Hardware errors could be signaled by synchronous interrupt, e.g. when an
error is detected by a background scrubber, or signaled by synchronous
exception, e.g. when an uncorrected error is consumed. Both synchronous and
asynchronous error are queued and handled by a dedicated kthread in
workqueue.

commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
synchronous errors") keep track of whether memory_failure() work was
queued, and make task_work pending to flush out the workqueue so that the
work for synchronous error is processed before returning to user-space.
The trick ensures that the corrupted page is unmapped and poisoned. And
after returning to user-space, the task starts at current instruction which
triggering a page fault in which kernel will send SIGBUS to current process
due to VM_FAULT_HWPOISON.

However, the memory failure recovery for hwpoison-aware mechanisms does not
work as expected. For example, hwpoison-aware user-space processes like
QEMU register their customized SIGBUS handler and enable early kill mode by
seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
the process by sending a SIGBUS signal in memory failure with wrong
si_code: the actual user-space process accessing the corrupt memory
location, but its memory failure work is handled in a kthread context, so
it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
process instead of BUS_MCEERR_AR in kill_proc().

To this end, separate synchronous and asynchronous error handling into
different paths like X86 platform does:

- valid synchronous errors: queue a task_work to synchronously send SIGBUS
before ret_to_user.
- valid asynchronous errors: queue a work into workqueue to asynchronously
handle memory failure.
- abnormal branches such as invalid PA, unexpected severity, no memory
failure config support, invalid GUID section, OOM, etc.

Then for valid synchronous errors, the current context in memory failure is
exactly belongs to the task consuming poison data and it will send SIBBUS
with proper si_code.

Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
Signed-off-by: Shuai Xue <[email protected]>
Tested-by: Ma Wupeng <[email protected]>
---
drivers/acpi/apei/ghes.c | 91 +++++++++++++++++++++++++++-------------
include/acpi/ghes.h | 3 --
mm/memory-failure.c | 13 ------
3 files changed, 61 insertions(+), 46 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index c479b85899f5..df5574264d1b 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -452,28 +452,51 @@ static void ghes_clear_estatus(struct ghes *ghes,
}

/*
- * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+ * struct sync_task_work - for synchronous RAS event
+ *
+ * @twork: callback_head for task work
+ * @pfn: page frame number of corrupted page
+ * @flags: fine tune action taken
+ *
+ * Structure to pass task work to be handled before
+ * ret_to_user via task_work_add().
*/
-static void ghes_kick_task_work(struct callback_head *head)
+struct sync_task_work {
+ struct callback_head twork;
+ u64 pfn;
+ int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
{
- struct acpi_hest_generic_status *estatus;
- struct ghes_estatus_node *estatus_node;
- u32 node_len;
+ int ret;
+ struct sync_task_work *twcb =
+ container_of(twork, struct sync_task_work, twork);

- estatus_node = container_of(head, struct ghes_estatus_node, task_work);
- if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
- memory_failure_queue_kick(estatus_node->task_work_cpu);
+ ret = memory_failure(twcb->pfn, twcb->flags);
+ kfree(twcb);

- estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
- node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
- gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
+ if (!ret)
+ return;
+
+ /*
+ * -EHWPOISON from memory_failure() means that it already sent SIGBUS
+ * to the current process with the proper error info,
+ * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
+ *
+ * In both cases, no further processing is required.
+ */
+ if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
+ return;
+
+ pr_err("Memory error not recovered");
+ force_sig(SIGBUS);
}

static bool ghes_do_memory_failure(u64 physical_addr, int flags)
{
unsigned long pfn;
+ struct sync_task_work *twcb;

if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
return false;
@@ -486,6 +509,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
return false;
}

+ if (flags == MF_ACTION_REQUIRED && current->mm) {
+ twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
+ if (!twcb)
+ return false;
+
+ twcb->pfn = pfn;
+ twcb->flags = flags;
+ init_task_work(&twcb->twork, memory_failure_cb);
+ task_work_add(current, &twcb->twork, TWA_RESUME);
+ return true;
+ }
+
memory_failure_queue(pfn, flags);
return true;
}
@@ -1000,9 +1035,8 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
struct ghes_estatus_node *estatus_node;
struct acpi_hest_generic *generic;
struct acpi_hest_generic_status *estatus;
- bool task_work_pending;
+ bool queued;
u32 len, node_len;
- int ret;

llnode = llist_del_all(&ghes_estatus_llist);
/*
@@ -1017,25 +1051,23 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
len = cper_estatus_len(estatus);
node_len = GHES_ESTATUS_NODE_LEN(len);
- task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+
+ queued = ghes_do_proc(estatus_node->ghes, estatus);
+ /*
+ * No memory failure work is queued into work queue or task queue
+ * due to invalid PA, unexpected severity, OOM, etc, do a force
+ * kill.
+ */
+ if (!queued && current->mm)
+ force_sig(SIGBUS);
+
if (!ghes_estatus_cached(estatus)) {
generic = estatus_node->generic;
if (ghes_print_estatus(NULL, generic, estatus))
ghes_estatus_cache_add(generic, estatus);
}
-
- if (task_work_pending && current->mm) {
- estatus_node->task_work.func = ghes_kick_task_work;
- estatus_node->task_work_cpu = smp_processor_id();
- ret = task_work_add(current, &estatus_node->task_work,
- TWA_RESUME);
- if (ret)
- estatus_node->task_work.func = NULL;
- }
-
- if (!estatus_node->task_work.func)
- gen_pool_free(ghes_estatus_pool,
- (unsigned long)estatus_node, node_len);
+ gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+ node_len);

llnode = next;
}
@@ -1096,7 +1128,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,

estatus_node->ghes = ghes;
estatus_node->generic = ghes->generic;
- estatus_node->task_work.func = NULL;
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);

if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 3c8bba9f1114..e5e0c308d27f 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -35,9 +35,6 @@ struct ghes_estatus_node {
struct llist_node llnode;
struct acpi_hest_generic *generic;
struct ghes *ghes;
-
- int task_work_cpu;
- struct callback_head task_work;
};

struct ghes_estatus_cache {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fae9baf3be16..6ea8c325acb3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
}
}

-/*
- * Process memory_failure work queued on the specified CPU.
- * Used to avoid return-to-userspace racing with the memory_failure workqueue.
- */
-void memory_failure_queue_kick(int cpu)
-{
- struct memory_failure_cpu *mf_cpu;
-
- mf_cpu = &per_cpu(memory_failure_cpu, cpu);
- cancel_work_sync(&mf_cpu->work);
- memory_failure_work_func(&mf_cpu->work);
-}
-
static int __init memory_failure_init(void)
{
struct memory_failure_cpu *mf_cpu;
--
2.20.1.12.g72788fdb

2023-04-11 02:00:11

by Xiaofei Tan

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ACPI: APEI: handle synchronous exceptions in task work


Hi Shuai,

在 2023/4/8 17:13, Shuai Xue 写道:
> Hardware errors could be signaled by synchronous interrupt, e.g. when an
> error is detected by a background scrubber, or signaled by synchronous
> exception, e.g. when an uncorrected error is consumed. Both synchronous and
> asynchronous error are queued and handled by a dedicated kthread in
> workqueue.
>
> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
> synchronous errors") keep track of whether memory_failure() work was
> queued, and make task_work pending to flush out the workqueue so that the
> work for synchronous error is processed before returning to user-space.
> The trick ensures that the corrupted page is unmapped and poisoned. And
> after returning to user-space, the task starts at current instruction which
> triggering a page fault in which kernel will send SIGBUS to current process
> due to VM_FAULT_HWPOISON.
>
> However, the memory failure recovery for hwpoison-aware mechanisms does not
> work as expected. For example, hwpoison-aware user-space processes like
> QEMU register their customized SIGBUS handler and enable early kill mode by
> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
> the process by sending a SIGBUS signal in memory failure with wrong
> si_code: the actual user-space process accessing the corrupt memory
> location, but its memory failure work is handled in a kthread context, so
> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
> process instead of BUS_MCEERR_AR in kill_proc().
>
> To this end, separate synchronous and asynchronous error handling into
> different paths like X86 platform does:
>
> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
> before ret_to_user.
> - valid asynchronous errors: queue a work into workqueue to asynchronously
> handle memory failure.
> - abnormal branches such as invalid PA, unexpected severity, no memory
> failure config support, invalid GUID section, OOM, etc.
>
> Then for valid synchronous errors, the current context in memory failure is
> exactly belongs to the task consuming poison data and it will send SIBBUS
> with proper si_code.
>
> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
> Signed-off-by: Shuai Xue <[email protected]>
> Tested-by: Ma Wupeng <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 91 +++++++++++++++++++++++++++-------------
> include/acpi/ghes.h | 3 --
> mm/memory-failure.c | 13 ------
> 3 files changed, 61 insertions(+), 46 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index c479b85899f5..df5574264d1b 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -452,28 +452,51 @@ static void ghes_clear_estatus(struct ghes *ghes,
> }
>
> /*
> - * Called as task_work before returning to user-space.
> - * Ensure any queued work has been done before we return to the context that
> - * triggered the notification.
> + * struct sync_task_work - for synchronous RAS event
> + *
> + * @twork: callback_head for task work
> + * @pfn: page frame number of corrupted page
> + * @flags: fine tune action taken
> + *
> + * Structure to pass task work to be handled before
> + * ret_to_user via task_work_add().
> */
> -static void ghes_kick_task_work(struct callback_head *head)
> +struct sync_task_work {
> + struct callback_head twork;
> + u64 pfn;
> + int flags;
> +};
> +
> +static void memory_failure_cb(struct callback_head *twork)
> {
> - struct acpi_hest_generic_status *estatus;
> - struct ghes_estatus_node *estatus_node;
> - u32 node_len;
> + int ret;
> + struct sync_task_work *twcb =
> + container_of(twork, struct sync_task_work, twork);
>
> - estatus_node = container_of(head, struct ghes_estatus_node, task_work);
> - if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> - memory_failure_queue_kick(estatus_node->task_work_cpu);
> + ret = memory_failure(twcb->pfn, twcb->flags);
> + kfree(twcb);
>
> - estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> - node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
> - gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
> + if (!ret)
> + return;
> +
> + /*
> + * -EHWPOISON from memory_failure() means that it already sent SIGBUS
> + * to the current process with the proper error info,
> + * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
> + *
> + * In both cases, no further processing is required.
> + */
> + if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
> + return;
> +
> + pr_err("Memory error not recovered");
> + force_sig(SIGBUS);
> }
>
> static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> {
> unsigned long pfn;
> + struct sync_task_work *twcb;
>
> if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> return false;
> @@ -486,6 +509,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> return false;
> }
>
> + if (flags == MF_ACTION_REQUIRED && current->mm) {
> + twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
> + if (!twcb)
> + return false;
> +
> + twcb->pfn = pfn;
> + twcb->flags = flags;
> + init_task_work(&twcb->twork, memory_failure_cb);
> + task_work_add(current, &twcb->twork, TWA_RESUME);
> + return true;
> + }
> +
> memory_failure_queue(pfn, flags);
> return true;
> }
> @@ -1000,9 +1035,8 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> struct ghes_estatus_node *estatus_node;
> struct acpi_hest_generic *generic;
> struct acpi_hest_generic_status *estatus;
> - bool task_work_pending;
> + bool queued;
> u32 len, node_len;
> - int ret;
>
> llnode = llist_del_all(&ghes_estatus_llist);
> /*
> @@ -1017,25 +1051,23 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> len = cper_estatus_len(estatus);
> node_len = GHES_ESTATUS_NODE_LEN(len);
> - task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
> +
> + queued = ghes_do_proc(estatus_node->ghes, estatus);
> + /*
> + * No memory failure work is queued into work queue or task queue
> + * due to invalid PA, unexpected severity, OOM, etc, do a force
> + * kill.
> + */
> + if (!queued && current->mm)
> + force_sig(SIGBUS);

The SIGBUS needs to be sent to the current only for synchronous exceptions. The judgment of this if statement does not guarantee this.
Because the function ghes_proc_in_irq() is used for NMI, but NMI not only used for synchronous exception. One user SEA is synchronous
exception, and some other users, such as SDEI, may be not synchronous exception.

You could transfer the sync flag out from ghes_do_proc() and judge it here, or change meaning of the ghes_do_proc() return value
as if recovered.


> +
> if (!ghes_estatus_cached(estatus)) {
> generic = estatus_node->generic;
> if (ghes_print_estatus(NULL, generic, estatus))
> ghes_estatus_cache_add(generic, estatus);
> }
> -
> - if (task_work_pending && current->mm) {
> - estatus_node->task_work.func = ghes_kick_task_work;
> - estatus_node->task_work_cpu = smp_processor_id();
> - ret = task_work_add(current, &estatus_node->task_work,
> - TWA_RESUME);
> - if (ret)
> - estatus_node->task_work.func = NULL;
> - }
> -
> - if (!estatus_node->task_work.func)
> - gen_pool_free(ghes_estatus_pool,
> - (unsigned long)estatus_node, node_len);
> + gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
> + node_len);
>
> llnode = next;
> }
> @@ -1096,7 +1128,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>
> estatus_node->ghes = ghes;
> estatus_node->generic = ghes->generic;
> - estatus_node->task_work.func = NULL;
> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>
> if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 3c8bba9f1114..e5e0c308d27f 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
> struct llist_node llnode;
> struct acpi_hest_generic *generic;
> struct ghes *ghes;
> -
> - int task_work_cpu;
> - struct callback_head task_work;
> };
>
> struct ghes_estatus_cache {
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fae9baf3be16..6ea8c325acb3 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
> }
> }
>
> -/*
> - * Process memory_failure work queued on the specified CPU.
> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
> - */
> -void memory_failure_queue_kick(int cpu)
> -{
> - struct memory_failure_cpu *mf_cpu;
> -
> - mf_cpu = &per_cpu(memory_failure_cpu, cpu);
> - cancel_work_sync(&mf_cpu->work);
> - memory_failure_work_func(&mf_cpu->work);
> -}
> -
> static int __init memory_failure_init(void)
> {
> struct memory_failure_cpu *mf_cpu;

2023-04-11 03:25:05

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ACPI: APEI: handle synchronous exceptions in task work



On 2023/4/11 AM9:44, Xiaofei Tan wrote:
>
> Hi Shuai,
>
> 在 2023/4/8 17:13, Shuai Xue 写道:
>> Hardware errors could be signaled by synchronous interrupt, e.g.  when an
>> error is detected by a background scrubber, or signaled by synchronous
>> exception, e.g. when an uncorrected error is consumed. Both synchronous and
>> asynchronous error are queued and handled by a dedicated kthread in
>> workqueue.
>>
>> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
>> synchronous errors") keep track of whether memory_failure() work was
>> queued, and make task_work pending to flush out the workqueue so that the
>> work for synchronous error is processed before returning to user-space.
>> The trick ensures that the corrupted page is unmapped and poisoned. And
>> after returning to user-space, the task starts at current instruction which
>> triggering a page fault in which kernel will send SIGBUS to current process
>> due to VM_FAULT_HWPOISON.
>>
>> However, the memory failure recovery for hwpoison-aware mechanisms does not
>> work as expected. For example, hwpoison-aware user-space processes like
>> QEMU register their customized SIGBUS handler and enable early kill mode by
>> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
>> the process by sending a SIGBUS signal in memory failure with wrong
>> si_code: the actual user-space process accessing the corrupt memory
>> location, but its memory failure work is handled in a kthread context, so
>> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
>> process instead of BUS_MCEERR_AR in kill_proc().
>>
>> To this end, separate synchronous and asynchronous error handling into
>> different paths like X86 platform does:
>>
>> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
>>    before ret_to_user.
>> - valid asynchronous errors: queue a work into workqueue to asynchronously
>>    handle memory failure.
>> - abnormal branches such as invalid PA, unexpected severity, no memory
>>    failure config support, invalid GUID section, OOM, etc.
>>
>> Then for valid synchronous errors, the current context in memory failure is
>> exactly belongs to the task consuming poison data and it will send SIBBUS
>> with proper si_code.
>>
>> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
>> Signed-off-by: Shuai Xue <[email protected]>
>> Tested-by: Ma Wupeng <[email protected]>
>> ---
>>   drivers/acpi/apei/ghes.c | 91 +++++++++++++++++++++++++++-------------
>>   include/acpi/ghes.h      |  3 --
>>   mm/memory-failure.c      | 13 ------
>>   3 files changed, 61 insertions(+), 46 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index c479b85899f5..df5574264d1b 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -452,28 +452,51 @@ static void ghes_clear_estatus(struct ghes *ghes,
>>   }
>>     /*
>> - * Called as task_work before returning to user-space.
>> - * Ensure any queued work has been done before we return to the context that
>> - * triggered the notification.
>> + * struct sync_task_work - for synchronous RAS event
>> + *
>> + * @twork:                callback_head for task work
>> + * @pfn:                  page frame number of corrupted page
>> + * @flags:                fine tune action taken
>> + *
>> + * Structure to pass task work to be handled before
>> + * ret_to_user via task_work_add().
>>    */
>> -static void ghes_kick_task_work(struct callback_head *head)
>> +struct sync_task_work {
>> +    struct callback_head twork;
>> +    u64 pfn;
>> +    int flags;
>> +};
>> +
>> +static void memory_failure_cb(struct callback_head *twork)
>>   {
>> -    struct acpi_hest_generic_status *estatus;
>> -    struct ghes_estatus_node *estatus_node;
>> -    u32 node_len;
>> +    int ret;
>> +    struct sync_task_work *twcb =
>> +        container_of(twork, struct sync_task_work, twork);
>>   -    estatus_node = container_of(head, struct ghes_estatus_node, task_work);
>> -    if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>> -        memory_failure_queue_kick(estatus_node->task_work_cpu);
>> +    ret = memory_failure(twcb->pfn, twcb->flags);
>> +    kfree(twcb);
>>   -    estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> -    node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
>> -    gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
>> +    if (!ret)
>> +        return;
>> +
>> +    /*
>> +     * -EHWPOISON from memory_failure() means that it already sent SIGBUS
>> +     * to the current process with the proper error info,
>> +     * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
>> +     *
>> +     * In both cases, no further processing is required.
>> +     */
>> +    if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
>> +        return;
>> +
>> +    pr_err("Memory error not recovered");
>> +    force_sig(SIGBUS);
>>   }
>>     static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>   {
>>       unsigned long pfn;
>> +    struct sync_task_work *twcb;
>>         if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>>           return false;
>> @@ -486,6 +509,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>           return false;
>>       }
>>   +    if (flags == MF_ACTION_REQUIRED && current->mm) {
>> +        twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
>> +        if (!twcb)
>> +            return false;
>> +
>> +        twcb->pfn = pfn;
>> +        twcb->flags = flags;
>> +        init_task_work(&twcb->twork, memory_failure_cb);
>> +        task_work_add(current, &twcb->twork, TWA_RESUME);
>> +        return true;
>> +    }
>> +
>>       memory_failure_queue(pfn, flags);
>>       return true;
>>   }
>> @@ -1000,9 +1035,8 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>       struct ghes_estatus_node *estatus_node;
>>       struct acpi_hest_generic *generic;
>>       struct acpi_hest_generic_status *estatus;
>> -    bool task_work_pending;
>> +    bool queued;
>>       u32 len, node_len;
>> -    int ret;
>>         llnode = llist_del_all(&ghes_estatus_llist);
>>       /*
>> @@ -1017,25 +1051,23 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>           estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>           len = cper_estatus_len(estatus);
>>           node_len = GHES_ESTATUS_NODE_LEN(len);
>> -        task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>> +
>> +        queued = ghes_do_proc(estatus_node->ghes, estatus);
>> +        /*
>> +         * No memory failure work is queued into work queue or task queue
>> +         * due to invalid PA, unexpected severity, OOM, etc, do a force
>> +         * kill.
>> +         */
>> +        if (!queued && current->mm)
>> +            force_sig(SIGBUS);
>
> The SIGBUS needs to be sent to the current only for synchronous exceptions. The judgment of this if statement does not guarantee this.
> Because the function ghes_proc_in_irq() is used for NMI, but NMI not only used for synchronous exception. One user SEA is synchronous
> exception, and some other users, such as SDEI, may be not synchronous exception.

Yes, you are right. I was going to handle abnormal cases for sync error
and async error. But SIGBUS sent to the current task for an asynchronous
error is totally wrong. Is it safe to keep running when an asynchronous
error is not handled?

And should we add some warning message in abnormal cases?
e.g pr_warn_ratelimited on invalid PA?

>
> You could transfer the sync flag out from ghes_do_proc() and judge it here, or change meaning of the ghes_do_proc() return value
> as if recovered.

I think we could get sync flag by estatus_node, e.g:

bool sync = is_hest_sync_notify(estatus_node->ghes);

Then the condition in if statement should be:

if (sync && !queued)

I drop out current->mm from if statement. For sync errors, the current
is guaranteed to be in user task, kernel task for sync error will panic
in do_sea(), the caller of ghes_proc_in_irq(). For async errors, SIGBUS
to current is meaningless.

Thank you.

Best Regards,
Shuai

>
>
>> +
>>           if (!ghes_estatus_cached(estatus)) {
>>               generic = estatus_node->generic;
>>               if (ghes_print_estatus(NULL, generic, estatus))
>>                   ghes_estatus_cache_add(generic, estatus);
>>           }
>> -
>> -        if (task_work_pending && current->mm) {
>> -            estatus_node->task_work.func = ghes_kick_task_work;
>> -            estatus_node->task_work_cpu = smp_processor_id();
>> -            ret = task_work_add(current, &estatus_node->task_work,
>> -                        TWA_RESUME);
>> -            if (ret)
>> -                estatus_node->task_work.func = NULL;
>> -        }
>> -
>> -        if (!estatus_node->task_work.func)
>> -            gen_pool_free(ghes_estatus_pool,
>> -                      (unsigned long)estatus_node, node_len);
>> +        gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
>> +                  node_len);
>>             llnode = next;
>>       }
>> @@ -1096,7 +1128,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>>         estatus_node->ghes = ghes;
>>       estatus_node->generic = ghes->generic;
>> -    estatus_node->task_work.func = NULL;
>>       estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>         if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>> index 3c8bba9f1114..e5e0c308d27f 100644
>> --- a/include/acpi/ghes.h
>> +++ b/include/acpi/ghes.h
>> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
>>       struct llist_node llnode;
>>       struct acpi_hest_generic *generic;
>>       struct ghes *ghes;
>> -
>> -    int task_work_cpu;
>> -    struct callback_head task_work;
>>   };
>>     struct ghes_estatus_cache {
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index fae9baf3be16..6ea8c325acb3 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
>>       }
>>   }
>>   -/*
>> - * Process memory_failure work queued on the specified CPU.
>> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
>> - */
>> -void memory_failure_queue_kick(int cpu)
>> -{
>> -    struct memory_failure_cpu *mf_cpu;
>> -
>> -    mf_cpu = &per_cpu(memory_failure_cpu, cpu);
>> -    cancel_work_sync(&mf_cpu->work);
>> -    memory_failure_work_func(&mf_cpu->work);
>> -}
>> -
>>   static int __init memory_failure_init(void)
>>   {
>>       struct memory_failure_cpu *mf_cpu;

2023-04-11 09:06:14

by Xiaofei Tan

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ACPI: APEI: handle synchronous exceptions in task work


在 2023/4/11 11:16, Shuai Xue 写道:
>
> On 2023/4/11 AM9:44, Xiaofei Tan wrote:
>> Hi Shuai,
>>
>> 在 2023/4/8 17:13, Shuai Xue 写道:
>>> Hardware errors could be signaled by synchronous interrupt, e.g.  when an
>>> error is detected by a background scrubber, or signaled by synchronous
>>> exception, e.g. when an uncorrected error is consumed. Both synchronous and
>>> asynchronous error are queued and handled by a dedicated kthread in
>>> workqueue.
>>>
>>> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
>>> synchronous errors") keep track of whether memory_failure() work was
>>> queued, and make task_work pending to flush out the workqueue so that the
>>> work for synchronous error is processed before returning to user-space.
>>> The trick ensures that the corrupted page is unmapped and poisoned. And
>>> after returning to user-space, the task starts at current instruction which
>>> triggering a page fault in which kernel will send SIGBUS to current process
>>> due to VM_FAULT_HWPOISON.
>>>
>>> However, the memory failure recovery for hwpoison-aware mechanisms does not
>>> work as expected. For example, hwpoison-aware user-space processes like
>>> QEMU register their customized SIGBUS handler and enable early kill mode by
>>> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
>>> the process by sending a SIGBUS signal in memory failure with wrong
>>> si_code: the actual user-space process accessing the corrupt memory
>>> location, but its memory failure work is handled in a kthread context, so
>>> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
>>> process instead of BUS_MCEERR_AR in kill_proc().
>>>
>>> To this end, separate synchronous and asynchronous error handling into
>>> different paths like X86 platform does:
>>>
>>> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
>>>    before ret_to_user.
>>> - valid asynchronous errors: queue a work into workqueue to asynchronously
>>>    handle memory failure.
>>> - abnormal branches such as invalid PA, unexpected severity, no memory
>>>    failure config support, invalid GUID section, OOM, etc.
>>>
>>> Then for valid synchronous errors, the current context in memory failure is
>>> exactly belongs to the task consuming poison data and it will send SIBBUS
>>> with proper si_code.
>>>
>>> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
>>> Signed-off-by: Shuai Xue <[email protected]>
>>> Tested-by: Ma Wupeng <[email protected]>
>>> ---
>>>   drivers/acpi/apei/ghes.c | 91 +++++++++++++++++++++++++++-------------
>>>   include/acpi/ghes.h      |  3 --
>>>   mm/memory-failure.c      | 13 ------
>>>   3 files changed, 61 insertions(+), 46 deletions(-)
>>>
>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>> index c479b85899f5..df5574264d1b 100644
>>> --- a/drivers/acpi/apei/ghes.c
>>> +++ b/drivers/acpi/apei/ghes.c
>>> @@ -452,28 +452,51 @@ static void ghes_clear_estatus(struct ghes *ghes,
>>>   }
>>>     /*
>>> - * Called as task_work before returning to user-space.
>>> - * Ensure any queued work has been done before we return to the context that
>>> - * triggered the notification.
>>> + * struct sync_task_work - for synchronous RAS event
>>> + *
>>> + * @twork:                callback_head for task work
>>> + * @pfn:                  page frame number of corrupted page
>>> + * @flags:                fine tune action taken
>>> + *
>>> + * Structure to pass task work to be handled before
>>> + * ret_to_user via task_work_add().
>>>    */
>>> -static void ghes_kick_task_work(struct callback_head *head)
>>> +struct sync_task_work {
>>> +    struct callback_head twork;
>>> +    u64 pfn;
>>> +    int flags;
>>> +};
>>> +
>>> +static void memory_failure_cb(struct callback_head *twork)
>>>   {
>>> -    struct acpi_hest_generic_status *estatus;
>>> -    struct ghes_estatus_node *estatus_node;
>>> -    u32 node_len;
>>> +    int ret;
>>> +    struct sync_task_work *twcb =
>>> +        container_of(twork, struct sync_task_work, twork);
>>>   -    estatus_node = container_of(head, struct ghes_estatus_node, task_work);
>>> -    if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>>> -        memory_failure_queue_kick(estatus_node->task_work_cpu);
>>> +    ret = memory_failure(twcb->pfn, twcb->flags);
>>> +    kfree(twcb);
>>>   -    estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>> -    node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
>>> -    gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
>>> +    if (!ret)
>>> +        return;
>>> +
>>> +    /*
>>> +     * -EHWPOISON from memory_failure() means that it already sent SIGBUS
>>> +     * to the current process with the proper error info,
>>> +     * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
>>> +     *
>>> +     * In both cases, no further processing is required.
>>> +     */
>>> +    if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
>>> +        return;
>>> +
>>> +    pr_err("Memory error not recovered");
>>> +    force_sig(SIGBUS);
>>>   }
>>>     static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>>   {
>>>       unsigned long pfn;
>>> +    struct sync_task_work *twcb;
>>>         if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>>>           return false;
>>> @@ -486,6 +509,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>>           return false;
>>>       }
>>>   +    if (flags == MF_ACTION_REQUIRED && current->mm) {
>>> +        twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
>>> +        if (!twcb)
>>> +            return false;
>>> +
>>> +        twcb->pfn = pfn;
>>> +        twcb->flags = flags;
>>> +        init_task_work(&twcb->twork, memory_failure_cb);
>>> +        task_work_add(current, &twcb->twork, TWA_RESUME);
>>> +        return true;
>>> +    }
>>> +
>>>       memory_failure_queue(pfn, flags);
>>>       return true;
>>>   }
>>> @@ -1000,9 +1035,8 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>>       struct ghes_estatus_node *estatus_node;
>>>       struct acpi_hest_generic *generic;
>>>       struct acpi_hest_generic_status *estatus;
>>> -    bool task_work_pending;
>>> +    bool queued;
>>>       u32 len, node_len;
>>> -    int ret;
>>>         llnode = llist_del_all(&ghes_estatus_llist);
>>>       /*
>>> @@ -1017,25 +1051,23 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>>           estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>>           len = cper_estatus_len(estatus);
>>>           node_len = GHES_ESTATUS_NODE_LEN(len);
>>> -        task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>>> +
>>> +        queued = ghes_do_proc(estatus_node->ghes, estatus);
>>> +        /*
>>> +         * No memory failure work is queued into work queue or task queue
>>> +         * due to invalid PA, unexpected severity, OOM, etc, do a force
>>> +         * kill.
>>> +         */
>>> +        if (!queued && current->mm)
>>> +            force_sig(SIGBUS);
>> The SIGBUS needs to be sent to the current only for synchronous exceptions. The judgment of this if statement does not guarantee this.
>> Because the function ghes_proc_in_irq() is used for NMI, but NMI not only used for synchronous exception. One user SEA is synchronous
>> exception, and some other users, such as SDEI, may be not synchronous exception.
> Yes, you are right. I was going to handle abnormal cases for sync error
> and async error. But SIGBUS sent to the current task for an asynchronous
> error is totally wrong.

yes

> Is it safe to keep running when an asynchronous
> error is not handled?

I think so. Corrupt data should not be consumed silently. It should be guaranteed by Chip platorm.
If platform can't support this, it will still not be 100% safe even we panic the system here, once received
uncorrected memory error section.


>
> And should we add some warning message in abnormal cases?
> e.g pr_warn_ratelimited on invalid PA?

Do you mean here ? it is not needed, as ghes_print_estatus() has included this info.

    if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
        return false;

>> You could transfer the sync flag out from ghes_do_proc() and judge it here, or change meaning of the ghes_do_proc() return value
>> as if recovered.
> I think we could get sync flag by estatus_node, e.g:
>
> bool sync = is_hest_sync_notify(estatus_node->ghes);

It's ok for me.

>
> Then the condition in if statement should be:
>
> if (sync && !queued)
>
> I drop out current->mm from if statement. For sync errors, the current
> is guaranteed to be in user task, kernel task for sync error will panic
> in do_sea(), the caller of ghes_proc_in_irq(). For async errors, SIGBUS
> to current is meaningless.

OK. It is correct for ARM SEA, if want to support more sync notify type,
should consider in the future.

>
> Thank you.
>
> Best Regards,
> Shuai
>
>>
>>> +
>>>           if (!ghes_estatus_cached(estatus)) {
>>>               generic = estatus_node->generic;
>>>               if (ghes_print_estatus(NULL, generic, estatus))
>>>                   ghes_estatus_cache_add(generic, estatus);
>>>           }
>>> -
>>> -        if (task_work_pending && current->mm) {
>>> -            estatus_node->task_work.func = ghes_kick_task_work;
>>> -            estatus_node->task_work_cpu = smp_processor_id();
>>> -            ret = task_work_add(current, &estatus_node->task_work,
>>> -                        TWA_RESUME);
>>> -            if (ret)
>>> -                estatus_node->task_work.func = NULL;
>>> -        }
>>> -
>>> -        if (!estatus_node->task_work.func)
>>> -            gen_pool_free(ghes_estatus_pool,
>>> -                      (unsigned long)estatus_node, node_len);
>>> +        gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
>>> +                  node_len);
>>>             llnode = next;
>>>       }
>>> @@ -1096,7 +1128,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>>>         estatus_node->ghes = ghes;
>>>       estatus_node->generic = ghes->generic;
>>> -    estatus_node->task_work.func = NULL;
>>>       estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>>         if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
>>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>>> index 3c8bba9f1114..e5e0c308d27f 100644
>>> --- a/include/acpi/ghes.h
>>> +++ b/include/acpi/ghes.h
>>> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
>>>       struct llist_node llnode;
>>>       struct acpi_hest_generic *generic;
>>>       struct ghes *ghes;
>>> -
>>> -    int task_work_cpu;
>>> -    struct callback_head task_work;
>>>   };
>>>     struct ghes_estatus_cache {
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index fae9baf3be16..6ea8c325acb3 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
>>>       }
>>>   }
>>>   -/*
>>> - * Process memory_failure work queued on the specified CPU.
>>> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
>>> - */
>>> -void memory_failure_queue_kick(int cpu)
>>> -{
>>> -    struct memory_failure_cpu *mf_cpu;
>>> -
>>> -    mf_cpu = &per_cpu(memory_failure_cpu, cpu);
>>> -    cancel_work_sync(&mf_cpu->work);
>>> -    memory_failure_work_func(&mf_cpu->work);
>>> -}
>>> -
>>>   static int __init memory_failure_init(void)
>>>   {
>>>       struct memory_failure_cpu *mf_cpu;
> .

2023-04-11 09:56:08

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ACPI: APEI: handle synchronous exceptions in task work



On 2023/4/11 PM5:02, Xiaofei Tan wrote:
>
> 在 2023/4/11 11:16, Shuai Xue 写道:
>>
>> On 2023/4/11 AM9:44, Xiaofei Tan wrote:
>>> Hi Shuai,
>>>
>>> 在 2023/4/8 17:13, Shuai Xue 写道:
>>>> Hardware errors could be signaled by synchronous interrupt, e.g.  when an
>>>> error is detected by a background scrubber, or signaled by synchronous
>>>> exception, e.g. when an uncorrected error is consumed. Both synchronous and
>>>> asynchronous error are queued and handled by a dedicated kthread in
>>>> workqueue.
>>>>
>>>> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
>>>> synchronous errors") keep track of whether memory_failure() work was
>>>> queued, and make task_work pending to flush out the workqueue so that the
>>>> work for synchronous error is processed before returning to user-space.
>>>> The trick ensures that the corrupted page is unmapped and poisoned. And
>>>> after returning to user-space, the task starts at current instruction which
>>>> triggering a page fault in which kernel will send SIGBUS to current process
>>>> due to VM_FAULT_HWPOISON.
>>>>
>>>> However, the memory failure recovery for hwpoison-aware mechanisms does not
>>>> work as expected. For example, hwpoison-aware user-space processes like
>>>> QEMU register their customized SIGBUS handler and enable early kill mode by
>>>> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
>>>> the process by sending a SIGBUS signal in memory failure with wrong
>>>> si_code: the actual user-space process accessing the corrupt memory
>>>> location, but its memory failure work is handled in a kthread context, so
>>>> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
>>>> process instead of BUS_MCEERR_AR in kill_proc().
>>>>
>>>> To this end, separate synchronous and asynchronous error handling into
>>>> different paths like X86 platform does:
>>>>
>>>> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
>>>>     before ret_to_user.
>>>> - valid asynchronous errors: queue a work into workqueue to asynchronously
>>>>     handle memory failure.
>>>> - abnormal branches such as invalid PA, unexpected severity, no memory
>>>>     failure config support, invalid GUID section, OOM, etc.
>>>>
>>>> Then for valid synchronous errors, the current context in memory failure is
>>>> exactly belongs to the task consuming poison data and it will send SIBBUS
>>>> with proper si_code.
>>>>
>>>> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
>>>> Signed-off-by: Shuai Xue <[email protected]>
>>>> Tested-by: Ma Wupeng <[email protected]>
>>>> ---
>>>>    drivers/acpi/apei/ghes.c | 91 +++++++++++++++++++++++++++-------------
>>>>    include/acpi/ghes.h      |  3 --
>>>>    mm/memory-failure.c      | 13 ------
>>>>    3 files changed, 61 insertions(+), 46 deletions(-)
>>>>
>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>>> index c479b85899f5..df5574264d1b 100644
>>>> --- a/drivers/acpi/apei/ghes.c
>>>> +++ b/drivers/acpi/apei/ghes.c
>>>> @@ -452,28 +452,51 @@ static void ghes_clear_estatus(struct ghes *ghes,
>>>>    }
>>>>      /*
>>>> - * Called as task_work before returning to user-space.
>>>> - * Ensure any queued work has been done before we return to the context that
>>>> - * triggered the notification.
>>>> + * struct sync_task_work - for synchronous RAS event
>>>> + *
>>>> + * @twork:                callback_head for task work
>>>> + * @pfn:                  page frame number of corrupted page
>>>> + * @flags:                fine tune action taken
>>>> + *
>>>> + * Structure to pass task work to be handled before
>>>> + * ret_to_user via task_work_add().
>>>>     */
>>>> -static void ghes_kick_task_work(struct callback_head *head)
>>>> +struct sync_task_work {
>>>> +    struct callback_head twork;
>>>> +    u64 pfn;
>>>> +    int flags;
>>>> +};
>>>> +
>>>> +static void memory_failure_cb(struct callback_head *twork)
>>>>    {
>>>> -    struct acpi_hest_generic_status *estatus;
>>>> -    struct ghes_estatus_node *estatus_node;
>>>> -    u32 node_len;
>>>> +    int ret;
>>>> +    struct sync_task_work *twcb =
>>>> +        container_of(twork, struct sync_task_work, twork);
>>>>    -    estatus_node = container_of(head, struct ghes_estatus_node, task_work);
>>>> -    if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>>>> -        memory_failure_queue_kick(estatus_node->task_work_cpu);
>>>> +    ret = memory_failure(twcb->pfn, twcb->flags);
>>>> +    kfree(twcb);
>>>>    -    estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>>> -    node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
>>>> -    gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
>>>> +    if (!ret)
>>>> +        return;
>>>> +
>>>> +    /*
>>>> +     * -EHWPOISON from memory_failure() means that it already sent SIGBUS
>>>> +     * to the current process with the proper error info,
>>>> +     * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
>>>> +     *
>>>> +     * In both cases, no further processing is required.
>>>> +     */
>>>> +    if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
>>>> +        return;
>>>> +
>>>> +    pr_err("Memory error not recovered");
>>>> +    force_sig(SIGBUS);
>>>>    }
>>>>      static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>>>    {
>>>>        unsigned long pfn;
>>>> +    struct sync_task_work *twcb;
>>>>          if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>>>>            return false;
>>>> @@ -486,6 +509,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>>>            return false;
>>>>        }
>>>>    +    if (flags == MF_ACTION_REQUIRED && current->mm) {
>>>> +        twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
>>>> +        if (!twcb)
>>>> +            return false;
>>>> +
>>>> +        twcb->pfn = pfn;
>>>> +        twcb->flags = flags;
>>>> +        init_task_work(&twcb->twork, memory_failure_cb);
>>>> +        task_work_add(current, &twcb->twork, TWA_RESUME);
>>>> +        return true;
>>>> +    }
>>>> +
>>>>        memory_failure_queue(pfn, flags);
>>>>        return true;
>>>>    }
>>>> @@ -1000,9 +1035,8 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>>>        struct ghes_estatus_node *estatus_node;
>>>>        struct acpi_hest_generic *generic;
>>>>        struct acpi_hest_generic_status *estatus;
>>>> -    bool task_work_pending;
>>>> +    bool queued;
>>>>        u32 len, node_len;
>>>> -    int ret;
>>>>          llnode = llist_del_all(&ghes_estatus_llist);
>>>>        /*
>>>> @@ -1017,25 +1051,23 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>>>            estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>>>            len = cper_estatus_len(estatus);
>>>>            node_len = GHES_ESTATUS_NODE_LEN(len);
>>>> -        task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>>>> +
>>>> +        queued = ghes_do_proc(estatus_node->ghes, estatus);
>>>> +        /*
>>>> +         * No memory failure work is queued into work queue or task queue
>>>> +         * due to invalid PA, unexpected severity, OOM, etc, do a force
>>>> +         * kill.
>>>> +         */
>>>> +        if (!queued && current->mm)
>>>> +            force_sig(SIGBUS);
>>> The SIGBUS needs to be sent to the current only for synchronous exceptions. The judgment of this if statement does not guarantee this.
>>> Because the function ghes_proc_in_irq() is used for NMI, but NMI not only used for synchronous exception. One user SEA is synchronous
>>> exception, and some other users, such as SDEI, may be not synchronous exception.
>> Yes, you are right. I was going to handle abnormal cases for sync error
>> and async error. But SIGBUS sent to the current task for an asynchronous
>> error is totally wrong.
>
> yes
>
>> Is it safe to keep running when an asynchronous
>> error is not handled?
>
> I think so. Corrupt data should not be consumed silently. It should be guaranteed by Chip platorm.
> If platform can't support this, it will still not be 100% safe even we panic the system here, once received
> uncorrected memory error section.
>
>
>>
>> And should we add some warning message in abnormal cases?
>> e.g pr_warn_ratelimited on invalid PA?
>
> Do you mean here ? it is not needed, as ghes_print_estatus() has included this info.
>
>     if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>         return false;
>
>>> You could transfer the sync flag out from ghes_do_proc() and judge it here, or change meaning of the ghes_do_proc() return value
>>> as if recovered.
>> I think we could get sync flag by estatus_node, e.g:
>>
>>     bool sync = is_hest_sync_notify(estatus_node->ghes);
>
> It's ok for me.
>
>>
>> Then the condition in if statement should be:
>>
>>     if (sync && !queued)
>>
>> I drop out current->mm from if statement. For sync errors, the current
>> is guaranteed to be in user task, kernel task for sync error will panic
>> in do_sea(), the caller of ghes_proc_in_irq(). For async errors, SIGBUS
>> to current is meaningless.
>
> OK. It is correct for ARM SEA, if want to support more sync notify type,
> should consider in the future.


Thanks for confirmations. I will send a new version later.

Best Regards,
Shuai

>
>>
>> Thank you.
>>
>> Best Regards,
>> Shuai
>>
>>>
>>>> +
>>>>            if (!ghes_estatus_cached(estatus)) {
>>>>                generic = estatus_node->generic;
>>>>                if (ghes_print_estatus(NULL, generic, estatus))
>>>>                    ghes_estatus_cache_add(generic, estatus);
>>>>            }
>>>> -
>>>> -        if (task_work_pending && current->mm) {
>>>> -            estatus_node->task_work.func = ghes_kick_task_work;
>>>> -            estatus_node->task_work_cpu = smp_processor_id();
>>>> -            ret = task_work_add(current, &estatus_node->task_work,
>>>> -                        TWA_RESUME);
>>>> -            if (ret)
>>>> -                estatus_node->task_work.func = NULL;
>>>> -        }
>>>> -
>>>> -        if (!estatus_node->task_work.func)
>>>> -            gen_pool_free(ghes_estatus_pool,
>>>> -                      (unsigned long)estatus_node, node_len);
>>>> +        gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
>>>> +                  node_len);
>>>>              llnode = next;
>>>>        }
>>>> @@ -1096,7 +1128,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>>>>          estatus_node->ghes = ghes;
>>>>        estatus_node->generic = ghes->generic;
>>>> -    estatus_node->task_work.func = NULL;
>>>>        estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>>>          if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
>>>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>>>> index 3c8bba9f1114..e5e0c308d27f 100644
>>>> --- a/include/acpi/ghes.h
>>>> +++ b/include/acpi/ghes.h
>>>> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
>>>>        struct llist_node llnode;
>>>>        struct acpi_hest_generic *generic;
>>>>        struct ghes *ghes;
>>>> -
>>>> -    int task_work_cpu;
>>>> -    struct callback_head task_work;
>>>>    };
>>>>      struct ghes_estatus_cache {
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index fae9baf3be16..6ea8c325acb3 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
>>>>        }
>>>>    }
>>>>    -/*
>>>> - * Process memory_failure work queued on the specified CPU.
>>>> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
>>>> - */
>>>> -void memory_failure_queue_kick(int cpu)
>>>> -{
>>>> -    struct memory_failure_cpu *mf_cpu;
>>>> -
>>>> -    mf_cpu = &per_cpu(memory_failure_cpu, cpu);
>>>> -    cancel_work_sync(&mf_cpu->work);
>>>> -    memory_failure_work_func(&mf_cpu->work);
>>>> -}
>>>> -
>>>>    static int __init memory_failure_init(void)
>>>>    {
>>>>        struct memory_failure_cpu *mf_cpu;
>> .

2023-04-11 10:49:56

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v5 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code

changes since v4 by addressing comments from Xiaofei:
- do a force kill only for abnormal sync errors
- Link: https://lore.kernel.org/lkml/[email protected]/

changes since v3 by addressing comments from Xiaofei:
- do a force kill for abnormal memofy failure error such as invalid PA,
unexpected severity, OOM, etc
- pcik up tested-by tag from Ma Wupeng
- Link: https://lore.kernel.org/lkml/[email protected]/

changes since v2 by addressing comments from Naoya:
- rename mce_task_work to sync_task_work
- drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
- add steps to reproduce this problem in cover letter
- Link: https://lore.kernel.org/lkml/[email protected]/

changes since v1:
- synchronous events by notify type
- Link: https://lore.kernel.org/lkml/[email protected]/

Currently, both synchronous and asynchronous error are queued and handled
by a dedicated kthread in workqueue. And Memory failure for synchronous
error is synced by a cancel_work_sync trick which ensures that the
corrupted page is unmapped and poisoned. And after returning to user-space,
the task starts at current instruction which triggering a page fault in
which kernel will send SIGBUS to current process due to VM_FAULT_HWPOISON.

However, the memory failure recovery for hwpoison-aware mechanisms does not
work as expected. For example, hwpoison-aware user-space processes like
QEMU register their customized SIGBUS handler and enable early kill mode by
seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
the process by sending a SIGBUS signal in memory failure with wrong
si_code: BUS_MCEERR_AO si_code to the actual user-space process instead of
BUS_MCEERR_AR.

To address this problem:

- PATCH 1 sets mf_flags as MF_ACTION_REQUIRED on synchronous events which
indicates error happened in current execution context
- PATCH 2 separates synchronous error handling into task work so that the
current context in memory failure is exactly belongs to the task
consuming poison data.

Then, kernel will send SIGBUS with proper si_code in kill_proc().

Lv Ying and XiuQi also proposed to address similar problem and we discussed
about new solution to add a new flag(acpi_hest_generic_data::flags bit 8) to
distinguish synchronous event. [2][3] The UEFI community still has no response.
After a deep dive into the SDEI TRM, the SDEI notification should be used for
asynchronous error. As SDEI TRM[1] describes "the dispatcher can simulate an
exception-like entry into the client, **with the client providing an additional
asynchronous entry point similar to an interrupt entry point**". The client
(kernel) lacks complete synchronous context, e.g. systeam register (ELR, ESR,
etc). So notify type is enough to distinguish synchronous event.

To reproduce this problem:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 5 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
and it is not fact.

After this patch set:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 4 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
as we expected.

[1] https://developer.arm.com/documentation/den0054/latest/
[2] https://lore.kernel.org/linux-arm-kernel/[email protected]/T/
[3] https://lore.kernel.org/lkml/[email protected]/

Shuai Xue (2):
ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
synchronous events
ACPI: APEI: handle synchronous exceptions in task work

drivers/acpi/apei/ghes.c | 120 +++++++++++++++++++++++++++------------
include/acpi/ghes.h | 3 -
mm/memory-failure.c | 13 -----
3 files changed, 84 insertions(+), 52 deletions(-)

--
2.20.1.12.g72788fdb

2023-04-11 10:50:07

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v5 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

There are two major types of uncorrected recoverable (UCR) errors :

- Action Required (AR): The error is detected and the processor already
consumes the memory. OS requires to take action (for example, offline
failure page/kill failure thread) to recover this uncorrectable error.

- Action Optional (AO): The error is detected out of processor execution
context. Some data in the memory are corrupted. But the data have not
been consumed. OS is optional to take action to recover this
uncorrectable error.

The essential difference between AR and AO errors is that AR is a
synchronous event, while AO is an asynchronous event. The hardware will
signal a synchronous exception (Machine Check Exception on X86 and
Synchronous External Abort on Arm64) when an error is detected and the
memory access has been architecturally executed.

When APEI firmware first is enabled, a platform may describe one error
source for the handling of synchronous errors (e.g. MCE or SEA notification
), or for handling asynchronous errors (e.g. SCI or External Interrupt
notification). In other words, we can distinguish synchronous errors by
APEI notification. For AR errors, kernel will kill current process
accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
addition, for AO errors, kernel will notify the process who owns the
poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
However, the GHES driver always sets mf_flags to 0 so that all UCR errors
are handled as AO errors in memory failure.

To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
events.

Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
Signed-off-by: Shuai Xue <[email protected]>
Tested-by: Ma Wupeng <[email protected]>
---
drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 34ad071a64e9..c479b85899f5 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
}

+/*
+ * A platform may describe one error source for the handling of synchronous
+ * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
+ * or External Interrupt). On x86, the HEST notifications are always
+ * asynchronous, so only SEA on ARM is delivered as a synchronous
+ * notification.
+ */
+static inline bool is_hest_sync_notify(struct ghes *ghes)
+{
+ u8 notify_type = ghes->generic->notify.type;
+
+ return notify_type == ACPI_HEST_NOTIFY_SEA;
+}
+
/*
* This driver isn't really modular, however for the time being,
* continuing to use module_param is the easiest way to remain
@@ -477,7 +491,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
}

static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
- int sev)
+ int sev, bool sync)
{
int flags = -1;
int sec_sev = ghes_severity(gdata->error_severity);
@@ -491,7 +505,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
- flags = 0;
+ flags = sync ? MF_ACTION_REQUIRED : 0;

if (flags != -1)
return ghes_do_memory_failure(mem_err->physical_addr, flags);
@@ -499,9 +513,11 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
return false;
}

-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
+static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
+ int sev, bool sync)
{
struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
+ int flags = sync ? MF_ACTION_REQUIRED : 0;
bool queued = false;
int sec_sev, i;
char *p;
@@ -526,7 +542,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
* and don't filter out 'corrected' error here.
*/
if (is_cache && has_pa) {
- queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
+ queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
p += err_info->length;
continue;
}
@@ -647,6 +663,7 @@ static bool ghes_do_proc(struct ghes *ghes,
const guid_t *fru_id = &guid_null;
char *fru_text = "";
bool queued = false;
+ bool sync = is_hest_sync_notify(ghes);

sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
@@ -664,13 +681,13 @@ static bool ghes_do_proc(struct ghes *ghes,
atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
- queued = ghes_handle_memory_failure(gdata, sev);
+ queued = ghes_handle_memory_failure(gdata, sev, sync);
}
else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
- queued = ghes_handle_arm_hw_error(gdata, sev);
+ queued = ghes_handle_arm_hw_error(gdata, sev, sync);
} else {
void *err = acpi_hest_get_payload(gdata);

--
2.20.1.12.g72788fdb

2023-04-11 10:50:53

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v5 2/2] ACPI: APEI: handle synchronous exceptions in task work

Hardware errors could be signaled by synchronous interrupt, e.g. when an
error is detected by a background scrubber, or signaled by synchronous
exception, e.g. when an uncorrected error is consumed. Both synchronous and
asynchronous error are queued and handled by a dedicated kthread in
workqueue.

commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
synchronous errors") keep track of whether memory_failure() work was
queued, and make task_work pending to flush out the workqueue so that the
work for synchronous error is processed before returning to user-space.
The trick ensures that the corrupted page is unmapped and poisoned. And
after returning to user-space, the task starts at current instruction which
triggering a page fault in which kernel will send SIGBUS to current process
due to VM_FAULT_HWPOISON.

However, the memory failure recovery for hwpoison-aware mechanisms does not
work as expected. For example, hwpoison-aware user-space processes like
QEMU register their customized SIGBUS handler and enable early kill mode by
seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
the process by sending a SIGBUS signal in memory failure with wrong
si_code: the actual user-space process accessing the corrupt memory
location, but its memory failure work is handled in a kthread context, so
it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
process instead of BUS_MCEERR_AR in kill_proc().

To this end, separate synchronous and asynchronous error handling into
different paths like X86 platform does:

- valid synchronous errors: queue a task_work to synchronously send SIGBUS
before ret_to_user.
- valid asynchronous errors: queue a work into workqueue to asynchronously
handle memory failure.
- abnormal branches such as invalid PA, unexpected severity, no memory
failure config support, invalid GUID section, OOM, etc.

Then for valid synchronous errors, the current context in memory failure is
exactly belongs to the task consuming poison data and it will send SIBBUS
with proper si_code.

Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
Signed-off-by: Shuai Xue <[email protected]>
Tested-by: Ma Wupeng <[email protected]>
---
drivers/acpi/apei/ghes.c | 91 +++++++++++++++++++++++++++-------------
include/acpi/ghes.h | 3 --
mm/memory-failure.c | 13 ------
3 files changed, 61 insertions(+), 46 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index c479b85899f5..4b70955e25f9 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -452,28 +452,51 @@ static void ghes_clear_estatus(struct ghes *ghes,
}

/*
- * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+ * struct sync_task_work - for synchronous RAS event
+ *
+ * @twork: callback_head for task work
+ * @pfn: page frame number of corrupted page
+ * @flags: fine tune action taken
+ *
+ * Structure to pass task work to be handled before
+ * ret_to_user via task_work_add().
*/
-static void ghes_kick_task_work(struct callback_head *head)
+struct sync_task_work {
+ struct callback_head twork;
+ u64 pfn;
+ int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
{
- struct acpi_hest_generic_status *estatus;
- struct ghes_estatus_node *estatus_node;
- u32 node_len;
+ int ret;
+ struct sync_task_work *twcb =
+ container_of(twork, struct sync_task_work, twork);

- estatus_node = container_of(head, struct ghes_estatus_node, task_work);
- if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
- memory_failure_queue_kick(estatus_node->task_work_cpu);
+ ret = memory_failure(twcb->pfn, twcb->flags);
+ kfree(twcb);

- estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
- node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
- gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
+ if (!ret)
+ return;
+
+ /*
+ * -EHWPOISON from memory_failure() means that it already sent SIGBUS
+ * to the current process with the proper error info,
+ * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
+ *
+ * In both cases, no further processing is required.
+ */
+ if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
+ return;
+
+ pr_err("Memory error not recovered");
+ force_sig(SIGBUS);
}

static bool ghes_do_memory_failure(u64 physical_addr, int flags)
{
unsigned long pfn;
+ struct sync_task_work *twcb;

if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
return false;
@@ -486,6 +509,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
return false;
}

+ if (flags == MF_ACTION_REQUIRED && current->mm) {
+ twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
+ if (!twcb)
+ return false;
+
+ twcb->pfn = pfn;
+ twcb->flags = flags;
+ init_task_work(&twcb->twork, memory_failure_cb);
+ task_work_add(current, &twcb->twork, TWA_RESUME);
+ return true;
+ }
+
memory_failure_queue(pfn, flags);
return true;
}
@@ -1000,9 +1035,8 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
struct ghes_estatus_node *estatus_node;
struct acpi_hest_generic *generic;
struct acpi_hest_generic_status *estatus;
- bool task_work_pending;
+ bool queued, sync;
u32 len, node_len;
- int ret;

llnode = llist_del_all(&ghes_estatus_llist);
/*
@@ -1015,27 +1049,25 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
estatus_node = llist_entry(llnode, struct ghes_estatus_node,
llnode);
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
+ sync = is_hest_sync_notify(estatus_node->ghes);
len = cper_estatus_len(estatus);
node_len = GHES_ESTATUS_NODE_LEN(len);
- task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+
+ queued = ghes_do_proc(estatus_node->ghes, estatus);
+ /*
+ * If no memory failure work is queued for abnormal synchronous
+ * errors, do a force kill.
+ */
+ if (sync && !queued)
+ force_sig(SIGBUS);
+
if (!ghes_estatus_cached(estatus)) {
generic = estatus_node->generic;
if (ghes_print_estatus(NULL, generic, estatus))
ghes_estatus_cache_add(generic, estatus);
}
-
- if (task_work_pending && current->mm) {
- estatus_node->task_work.func = ghes_kick_task_work;
- estatus_node->task_work_cpu = smp_processor_id();
- ret = task_work_add(current, &estatus_node->task_work,
- TWA_RESUME);
- if (ret)
- estatus_node->task_work.func = NULL;
- }
-
- if (!estatus_node->task_work.func)
- gen_pool_free(ghes_estatus_pool,
- (unsigned long)estatus_node, node_len);
+ gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+ node_len);

llnode = next;
}
@@ -1096,7 +1128,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,

estatus_node->ghes = ghes;
estatus_node->generic = ghes->generic;
- estatus_node->task_work.func = NULL;
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);

if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 3c8bba9f1114..e5e0c308d27f 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -35,9 +35,6 @@ struct ghes_estatus_node {
struct llist_node llnode;
struct acpi_hest_generic *generic;
struct ghes *ghes;
-
- int task_work_cpu;
- struct callback_head task_work;
};

struct ghes_estatus_cache {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fae9baf3be16..6ea8c325acb3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
}
}

-/*
- * Process memory_failure work queued on the specified CPU.
- * Used to avoid return-to-userspace racing with the memory_failure workqueue.
- */
-void memory_failure_queue_kick(int cpu)
-{
- struct memory_failure_cpu *mf_cpu;
-
- mf_cpu = &per_cpu(memory_failure_cpu, cpu);
- cancel_work_sync(&mf_cpu->work);
- memory_failure_work_func(&mf_cpu->work);
-}
-
static int __init memory_failure_init(void)
{
struct memory_failure_cpu *mf_cpu;
--
2.20.1.12.g72788fdb

2023-04-11 14:18:58

by Kefeng Wang

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

Hi Shuai Xue,

On 2023/4/11 18:48, Shuai Xue wrote:
> There are two major types of uncorrected recoverable (UCR) errors :
>
> - Action Required (AR): The error is detected and the processor already
> consumes the memory. OS requires to take action (for example, offline
> failure page/kill failure thread) to recover this uncorrectable error.
>
> - Action Optional (AO): The error is detected out of processor execution
> context. Some data in the memory are corrupted. But the data have not
> been consumed. OS is optional to take action to recover this
> uncorrectable error.
>
> The essential difference between AR and AO errors is that AR is a
> synchronous event, while AO is an asynchronous event. The hardware will
> signal a synchronous exception (Machine Check Exception on X86 and
> Synchronous External Abort on Arm64) when an error is detected and the
> memory access has been architecturally executed.
>
> When APEI firmware first is enabled, a platform may describe one error
> source for the handling of synchronous errors (e.g. MCE or SEA notification
> ), or for handling asynchronous errors (e.g. SCI or External Interrupt
> notification). In other words, we can distinguish synchronous errors by
> APEI notification. For AR errors, kernel will kill current process
> accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
> addition, for AO errors, kernel will notify the process who owns the
> poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
> However, the GHES driver always sets mf_flags to 0 so that all UCR errors
> are handled as AO errors in memory failure.
>
> To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
> events.

As your mentioned in cover-letter, we met same issue, and hope it could
be fixed ASAP, this patch looks good to me,

Reviewed-by: Kefeng Wang <[email protected]>


>
> Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
> Signed-off-by: Shuai Xue <[email protected]>
> Tested-by: Ma Wupeng <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
> 1 file changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 34ad071a64e9..c479b85899f5 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
> return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
> }
>
> +/*
> + * A platform may describe one error source for the handling of synchronous
> + * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
> + * or External Interrupt). On x86, the HEST notifications are always
> + * asynchronous, so only SEA on ARM is delivered as a synchronous
> + * notification.
> + */
> +static inline bool is_hest_sync_notify(struct ghes *ghes)
> +{
> + u8 notify_type = ghes->generic->notify.type;
> +
> + return notify_type == ACPI_HEST_NOTIFY_SEA;
> +}
> +
> /*
> * This driver isn't really modular, however for the time being,
> * continuing to use module_param is the easiest way to remain
> @@ -477,7 +491,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> }
>
> static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> - int sev)
> + int sev, bool sync)
> {
> int flags = -1;
> int sec_sev = ghes_severity(gdata->error_severity);
> @@ -491,7 +505,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> flags = MF_SOFT_OFFLINE;
> if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
> - flags = 0;
> + flags = sync ? MF_ACTION_REQUIRED : 0;
>
> if (flags != -1)
> return ghes_do_memory_failure(mem_err->physical_addr, flags);
> @@ -499,9 +513,11 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> return false;
> }
>
> -static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
> + int sev, bool sync)
> {
> struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
> + int flags = sync ? MF_ACTION_REQUIRED : 0;
> bool queued = false;
> int sec_sev, i;
> char *p;
> @@ -526,7 +542,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
> * and don't filter out 'corrected' error here.
> */
> if (is_cache && has_pa) {
> - queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
> + queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
> p += err_info->length;
> continue;
> }
> @@ -647,6 +663,7 @@ static bool ghes_do_proc(struct ghes *ghes,
> const guid_t *fru_id = &guid_null;
> char *fru_text = "";
> bool queued = false;
> + bool sync = is_hest_sync_notify(ghes);
>
> sev = ghes_severity(estatus->error_severity);
> apei_estatus_for_each_section(estatus, gdata) {
> @@ -664,13 +681,13 @@ static bool ghes_do_proc(struct ghes *ghes,
> atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
>
> arch_apei_report_mem_error(sev, mem_err);
> - queued = ghes_handle_memory_failure(gdata, sev);
> + queued = ghes_handle_memory_failure(gdata, sev, sync);
> }
> else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
> ghes_handle_aer(gdata);
> }
> else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
> - queued = ghes_handle_arm_hw_error(gdata, sev);
> + queued = ghes_handle_arm_hw_error(gdata, sev, sync);
> } else {
> void *err = acpi_hest_get_payload(gdata);
>

2023-04-11 14:36:34

by Kefeng Wang

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] ACPI: APEI: handle synchronous exceptions in task work



On 2023/4/11 18:48, Shuai Xue wrote:
> Hardware errors could be signaled by synchronous interrupt, e.g. when an
> error is detected by a background scrubber, or signaled by synchronous
> exception, e.g. when an uncorrected error is consumed. Both synchronous and
> asynchronous error are queued and handled by a dedicated kthread in
> workqueue.
>
> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
> synchronous errors") keep track of whether memory_failure() work was
> queued, and make task_work pending to flush out the workqueue so that the
> work for synchronous error is processed before returning to user-space.
> The trick ensures that the corrupted page is unmapped and poisoned. And
> after returning to user-space, the task starts at current instruction which
> triggering a page fault in which kernel will send SIGBUS to current process
> due to VM_FAULT_HWPOISON.
>
> However, the memory failure recovery for hwpoison-aware mechanisms does not
> work as expected. For example, hwpoison-aware user-space processes like
> QEMU register their customized SIGBUS handler and enable early kill mode by
> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
> the process by sending a SIGBUS signal in memory failure with wrong
> si_code: the actual user-space process accessing the corrupt memory
> location, but its memory failure work is handled in a kthread context, so
> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
> process instead of BUS_MCEERR_AR in kill_proc().
>
> To this end, separate synchronous and asynchronous error handling into
> different paths like X86 platform does:
>
> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
> before ret_to_user.
> - valid asynchronous errors: queue a work into workqueue to asynchronously
> handle memory failure.
> - abnormal branches such as invalid PA, unexpected severity, no memory
> failure config support, invalid GUID section, OOM, etc.
>
> Then for valid synchronous errors, the current context in memory failure is
> exactly belongs to the task consuming poison data and it will send SIBBUS
> with proper si_code.
>
> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
> Signed-off-by: Shuai Xue <[email protected]>
> Tested-by: Ma Wupeng <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 91 +++++++++++++++++++++++++++-------------
> include/acpi/ghes.h | 3 --
> mm/memory-failure.c | 13 ------
> 3 files changed, 61 insertions(+), 46 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index c479b85899f5..4b70955e25f9 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -452,28 +452,51 @@ static void ghes_clear_estatus(struct ghes *ghes,
> }
>
> /*
> - * Called as task_work before returning to user-space.
> - * Ensure any queued work has been done before we return to the context that
> - * triggered the notification.
> + * struct sync_task_work - for synchronous RAS event
> + *
> + * @twork: callback_head for task work
> + * @pfn: page frame number of corrupted page
> + * @flags: fine tune action taken
> + *
> + * Structure to pass task work to be handled before
> + * ret_to_user via task_work_add().
> */
> -static void ghes_kick_task_work(struct callback_head *head)
> +struct sync_task_work {
> + struct callback_head twork;
> + u64 pfn;
> + int flags;
> +};
> +
> +static void memory_failure_cb(struct callback_head *twork)
> {
> - struct acpi_hest_generic_status *estatus;
> - struct ghes_estatus_node *estatus_node;
> - u32 node_len;
> + int ret;
> + struct sync_task_work *twcb =
> + container_of(twork, struct sync_task_work, twork);
>
> - estatus_node = container_of(head, struct ghes_estatus_node, task_work);
> - if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> - memory_failure_queue_kick(estatus_node->task_work_cpu);
> + ret = memory_failure(twcb->pfn, twcb->flags);
> + kfree(twcb);
>
> - estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> - node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
> - gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
> + if (!ret)
> + return;
> +
> + /*
> + * -EHWPOISON from memory_failure() means that it already sent SIGBUS
> + * to the current process with the proper error info,

This should be part of the comments of function memory_failure(),

> + * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
> + *
and this part is already there
> + * In both cases, no further processing is required.
> + */
so, after that, I think we could drop this comment, also the same
comment in x86's kill_me_maybe().

> + if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
> + return;
> +
> + pr_err("Memory error not recovered");
> + force_sig(SIGBUS);
> }
>
> static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> {
> unsigned long pfn;
> + struct sync_task_work *twcb;
>
> if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> return false;
> @@ -486,6 +509,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> return false;
> }
>
> + if (flags == MF_ACTION_REQUIRED && current->mm) {
> + twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
> + if (!twcb)
> + return false;
> +
> + twcb->pfn = pfn;
> + twcb->flags = flags;
> + init_task_work(&twcb->twork, memory_failure_cb);
> + task_work_add(current, &twcb->twork, TWA_RESUME);
> + return true;
> + }
> +
> memory_failure_queue(pfn, flags);
> return true;
> }
> @@ -1000,9 +1035,8 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> struct ghes_estatus_node *estatus_node;
> struct acpi_hest_generic *generic;
> struct acpi_hest_generic_status *estatus;
> - bool task_work_pending;
> + bool queued, sync;
> u32 len, node_len;
> - int ret;
>
> llnode = llist_del_all(&ghes_estatus_llist);
> /*
> @@ -1015,27 +1049,25 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> estatus_node = llist_entry(llnode, struct ghes_estatus_node,
> llnode);
> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> + sync = is_hest_sync_notify(estatus_node->ghes);
> len = cper_estatus_len(estatus);
> node_len = GHES_ESTATUS_NODE_LEN(len);
> - task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
> +
> + queued = ghes_do_proc(estatus_node->ghes, estatus) > + /*
> + * If no memory failure work is queued for abnormal synchronous
> + * errors, do a force kill.
> + */
> + if (sync && !queued)
> + force_sig(SIGBUS);

It's better to move this part into function ghes_do_proc(), because
there is already an is_hest_sync_notify(), and no need return value,
so make ghes_do_proc() a void function, Apart from this,

Reviewed-by: Kefeng Wang <[email protected]>

> +
> if (!ghes_estatus_cached(estatus)) {
> generic = estatus_node->generic;
> if (ghes_print_estatus(NULL, generic, estatus))
> ghes_estatus_cache_add(generic, estatus);
> }
> -
> - if (task_work_pending && current->mm) {
> - estatus_node->task_work.func = ghes_kick_task_work;
> - estatus_node->task_work_cpu = smp_processor_id();
> - ret = task_work_add(current, &estatus_node->task_work,
> - TWA_RESUME);
> - if (ret)
> - estatus_node->task_work.func = NULL;
> - }
> -
> - if (!estatus_node->task_work.func)
> - gen_pool_free(ghes_estatus_pool,
> - (unsigned long)estatus_node, node_len);
> + gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
> + node_len);
>
> llnode = next;
> }
> @@ -1096,7 +1128,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>
> estatus_node->ghes = ghes;
> estatus_node->generic = ghes->generic;
> - estatus_node->task_work.func = NULL;
> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>
> if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 3c8bba9f1114..e5e0c308d27f 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
> struct llist_node llnode;
> struct acpi_hest_generic *generic;
> struct ghes *ghes;
> -
> - int task_work_cpu;
> - struct callback_head task_work;
> };
>
> struct ghes_estatus_cache {
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fae9baf3be16..6ea8c325acb3 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
> }
> }
>
> -/*
> - * Process memory_failure work queued on the specified CPU.
> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
> - */
> -void memory_failure_queue_kick(int cpu)
> -{
> - struct memory_failure_cpu *mf_cpu;
> -
> - mf_cpu = &per_cpu(memory_failure_cpu, cpu);
> - cancel_work_sync(&mf_cpu->work);
> - memory_failure_work_func(&mf_cpu->work);
> -}
> -
> static int __init memory_failure_init(void)
> {
> struct memory_failure_cpu *mf_cpu;

2023-04-12 02:59:24

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events



On 2023/4/11 PM10:17, Kefeng Wang wrote:
> Hi Shuai Xue,
>
> On 2023/4/11 18:48, Shuai Xue wrote:
>> There are two major types of uncorrected recoverable (UCR) errors :
>>
>> - Action Required (AR): The error is detected and the processor already
>>    consumes the memory. OS requires to take action (for example, offline
>>    failure page/kill failure thread) to recover this uncorrectable error.
>>
>> - Action Optional (AO): The error is detected out of processor execution
>>    context. Some data in the memory are corrupted. But the data have not
>>    been consumed. OS is optional to take action to recover this
>>    uncorrectable error.
>>
>> The essential difference between AR and AO errors is that AR is a
>> synchronous event, while AO is an asynchronous event. The hardware will
>> signal a synchronous exception (Machine Check Exception on X86 and
>> Synchronous External Abort on Arm64) when an error is detected and the
>> memory access has been architecturally executed.
>>
>> When APEI firmware first is enabled, a platform may describe one error
>> source for the handling of synchronous errors (e.g. MCE or SEA notification
>> ), or for handling asynchronous errors (e.g. SCI or External Interrupt
>> notification). In other words, we can distinguish synchronous errors by
>> APEI notification. For AR errors, kernel will kill current process
>> accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
>> addition, for AO errors, kernel will notify the process who owns the
>> poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
>> However, the GHES driver always sets mf_flags to 0 so that all UCR errors
>> are handled as AO errors in memory failure.
>>
>> To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
>> events.
>
> As your mentioned in cover-letter, we met same issue, and hope it could be fixed ASAP, this patch looks good to me,
>
> Reviewed-by: Kefeng Wang <[email protected]>

Thank you.

Cheers,
Shuai

>
>>
>> Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
>> Signed-off-by: Shuai Xue <[email protected]>
>> Tested-by: Ma Wupeng <[email protected]>
>> ---
>>   drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
>>   1 file changed, 23 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 34ad071a64e9..c479b85899f5 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
>>       return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
>>   }
>>   +/*
>> + * A platform may describe one error source for the handling of synchronous
>> + * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
>> + * or External Interrupt). On x86, the HEST notifications are always
>> + * asynchronous, so only SEA on ARM is delivered as a synchronous
>> + * notification.
>> + */
>> +static inline bool is_hest_sync_notify(struct ghes *ghes)
>> +{
>> +    u8 notify_type = ghes->generic->notify.type;
>> +
>> +    return notify_type == ACPI_HEST_NOTIFY_SEA;
>> +}
>> +
>>   /*
>>    * This driver isn't really modular, however for the time being,
>>    * continuing to use module_param is the easiest way to remain
>> @@ -477,7 +491,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>   }
>>     static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>> -                       int sev)
>> +                       int sev, bool sync)
>>   {
>>       int flags = -1;
>>       int sec_sev = ghes_severity(gdata->error_severity);
>> @@ -491,7 +505,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>>           (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
>>           flags = MF_SOFT_OFFLINE;
>>       if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>> -        flags = 0;
>> +        flags = sync ? MF_ACTION_REQUIRED : 0;
>>         if (flags != -1)
>>           return ghes_do_memory_failure(mem_err->physical_addr, flags);
>> @@ -499,9 +513,11 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>>       return false;
>>   }
>>   -static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
>> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
>> +                       int sev, bool sync)
>>   {
>>       struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
>> +    int flags = sync ? MF_ACTION_REQUIRED : 0;
>>       bool queued = false;
>>       int sec_sev, i;
>>       char *p;
>> @@ -526,7 +542,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
>>            * and don't filter out 'corrected' error here.
>>            */
>>           if (is_cache && has_pa) {
>> -            queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
>> +            queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
>>               p += err_info->length;
>>               continue;
>>           }
>> @@ -647,6 +663,7 @@ static bool ghes_do_proc(struct ghes *ghes,
>>       const guid_t *fru_id = &guid_null;
>>       char *fru_text = "";
>>       bool queued = false;
>> +    bool sync = is_hest_sync_notify(ghes);
>>         sev = ghes_severity(estatus->error_severity);
>>       apei_estatus_for_each_section(estatus, gdata) {
>> @@ -664,13 +681,13 @@ static bool ghes_do_proc(struct ghes *ghes,
>>               atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
>>                 arch_apei_report_mem_error(sev, mem_err);
>> -            queued = ghes_handle_memory_failure(gdata, sev);
>> +            queued = ghes_handle_memory_failure(gdata, sev, sync);
>>           }
>>           else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
>>               ghes_handle_aer(gdata);
>>           }
>>           else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
>> -            queued = ghes_handle_arm_hw_error(gdata, sev);
>> +            queued = ghes_handle_arm_hw_error(gdata, sev, sync);
>>           } else {
>>               void *err = acpi_hest_get_payload(gdata);
>>  

2023-04-12 03:00:28

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] ACPI: APEI: handle synchronous exceptions in task work



On 2023/4/11 PM10:28, Kefeng Wang wrote:
>
>
> On 2023/4/11 18:48, Shuai Xue wrote:
>> Hardware errors could be signaled by synchronous interrupt, e.g.  when an
>> error is detected by a background scrubber, or signaled by synchronous
>> exception, e.g. when an uncorrected error is consumed. Both synchronous and
>> asynchronous error are queued and handled by a dedicated kthread in
>> workqueue.
>>
>> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
>> synchronous errors") keep track of whether memory_failure() work was
>> queued, and make task_work pending to flush out the workqueue so that the
>> work for synchronous error is processed before returning to user-space.
>> The trick ensures that the corrupted page is unmapped and poisoned. And
>> after returning to user-space, the task starts at current instruction which
>> triggering a page fault in which kernel will send SIGBUS to current process
>> due to VM_FAULT_HWPOISON.
>>
>> However, the memory failure recovery for hwpoison-aware mechanisms does not
>> work as expected. For example, hwpoison-aware user-space processes like
>> QEMU register their customized SIGBUS handler and enable early kill mode by
>> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
>> the process by sending a SIGBUS signal in memory failure with wrong
>> si_code: the actual user-space process accessing the corrupt memory
>> location, but its memory failure work is handled in a kthread context, so
>> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
>> process instead of BUS_MCEERR_AR in kill_proc().
>>
>> To this end, separate synchronous and asynchronous error handling into
>> different paths like X86 platform does:
>>
>> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
>>    before ret_to_user.
>> - valid asynchronous errors: queue a work into workqueue to asynchronously
>>    handle memory failure.
>> - abnormal branches such as invalid PA, unexpected severity, no memory
>>    failure config support, invalid GUID section, OOM, etc.
>>
>> Then for valid synchronous errors, the current context in memory failure is
>> exactly belongs to the task consuming poison data and it will send SIBBUS
>> with proper si_code.
>>
>> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
>> Signed-off-by: Shuai Xue <[email protected]>
>> Tested-by: Ma Wupeng <[email protected]>
>> ---
>>   drivers/acpi/apei/ghes.c | 91 +++++++++++++++++++++++++++-------------
>>   include/acpi/ghes.h      |  3 --
>>   mm/memory-failure.c      | 13 ------
>>   3 files changed, 61 insertions(+), 46 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index c479b85899f5..4b70955e25f9 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -452,28 +452,51 @@ static void ghes_clear_estatus(struct ghes *ghes,
>>   }
>>     /*
>> - * Called as task_work before returning to user-space.
>> - * Ensure any queued work has been done before we return to the context that
>> - * triggered the notification.
>> + * struct sync_task_work - for synchronous RAS event
>> + *
>> + * @twork:                callback_head for task work
>> + * @pfn:                  page frame number of corrupted page
>> + * @flags:                fine tune action taken
>> + *
>> + * Structure to pass task work to be handled before
>> + * ret_to_user via task_work_add().
>>    */
>> -static void ghes_kick_task_work(struct callback_head *head)
>> +struct sync_task_work {
>> +    struct callback_head twork;
>> +    u64 pfn;
>> +    int flags;
>> +};
>> +
>> +static void memory_failure_cb(struct callback_head *twork)
>>   {
>> -    struct acpi_hest_generic_status *estatus;
>> -    struct ghes_estatus_node *estatus_node;
>> -    u32 node_len;
>> +    int ret;
>> +    struct sync_task_work *twcb =
>> +        container_of(twork, struct sync_task_work, twork);
>>   -    estatus_node = container_of(head, struct ghes_estatus_node, task_work);
>> -    if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>> -        memory_failure_queue_kick(estatus_node->task_work_cpu);
>> +    ret = memory_failure(twcb->pfn, twcb->flags);
>> +    kfree(twcb);
>>   -    estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> -    node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
>> -    gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
>> +    if (!ret)
>> +        return;
>> +
>> +    /*
>> +     * -EHWPOISON from memory_failure() means that it already sent SIGBUS
>> +     * to the current process with the proper error info,
>
> This should be part of the comments of function memory_failure(),
>
>> +     * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
>> +     *
> and this part is already there
>> +     * In both cases, no further processing is required.
>> +     */
> so, after that, I think we could drop this comment, also the same comment in x86's kill_me_maybe().

Ok, I will add comments on return value of memory_failure() and drop both this
comment and that in kill_me_maybe() out.

>
>> +    if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
>> +        return;
>> +
>> +    pr_err("Memory error not recovered");
>> +    force_sig(SIGBUS);
>>   }
>>     static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>   {
>>       unsigned long pfn;
>> +    struct sync_task_work *twcb;
>>         if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>>           return false;
>> @@ -486,6 +509,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>           return false;
>>       }
>>   +    if (flags == MF_ACTION_REQUIRED && current->mm) {
>> +        twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
>> +        if (!twcb)
>> +            return false;
>> +
>> +        twcb->pfn = pfn;
>> +        twcb->flags = flags;
>> +        init_task_work(&twcb->twork, memory_failure_cb);
>> +        task_work_add(current, &twcb->twork, TWA_RESUME);
>> +        return true;
>> +    }
>> +
>>       memory_failure_queue(pfn, flags);
>>       return true;
>>   }
>> @@ -1000,9 +1035,8 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>       struct ghes_estatus_node *estatus_node;
>>       struct acpi_hest_generic *generic;
>>       struct acpi_hest_generic_status *estatus;
>> -    bool task_work_pending;
>> +    bool queued, sync;
>>       u32 len, node_len;
>> -    int ret;
>>         llnode = llist_del_all(&ghes_estatus_llist);
>>       /*
>> @@ -1015,27 +1049,25 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>           estatus_node = llist_entry(llnode, struct ghes_estatus_node,
>>                          llnode);
>>           estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> +        sync = is_hest_sync_notify(estatus_node->ghes);
>>           len = cper_estatus_len(estatus);
>>           node_len = GHES_ESTATUS_NODE_LEN(len);
>> -        task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>> +
>> +        queued = ghes_do_proc(estatus_node->ghes, estatus) > +        /*
>> +         * If no memory failure work is queued for abnormal synchronous
>> +         * errors, do a force kill.
>> +         */
>> +        if (sync && !queued)
>> +            force_sig(SIGBUS);
>
> It's better to move this part into function ghes_do_proc(), because there is already an is_hest_sync_notify(), and no need return value,
> so make ghes_do_proc() a void function, Apart from this,

Good idea. I will do this and send a new version.

>
> Reviewed-by: Kefeng Wang <[email protected]>

Thank you.

Cheers,
Shuai

>
>> +
>>           if (!ghes_estatus_cached(estatus)) {
>>               generic = estatus_node->generic;
>>               if (ghes_print_estatus(NULL, generic, estatus))
>>                   ghes_estatus_cache_add(generic, estatus);
>>           }
>> -
>> -        if (task_work_pending && current->mm) {
>> -            estatus_node->task_work.func = ghes_kick_task_work;
>> -            estatus_node->task_work_cpu = smp_processor_id();
>> -            ret = task_work_add(current, &estatus_node->task_work,
>> -                        TWA_RESUME);
>> -            if (ret)
>> -                estatus_node->task_work.func = NULL;
>> -        }
>> -
>> -        if (!estatus_node->task_work.func)
>> -            gen_pool_free(ghes_estatus_pool,
>> -                      (unsigned long)estatus_node, node_len);
>> +        gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
>> +                  node_len);
>>             llnode = next;
>>       }
>> @@ -1096,7 +1128,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>>         estatus_node->ghes = ghes;
>>       estatus_node->generic = ghes->generic;
>> -    estatus_node->task_work.func = NULL;
>>       estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>         if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>> index 3c8bba9f1114..e5e0c308d27f 100644
>> --- a/include/acpi/ghes.h
>> +++ b/include/acpi/ghes.h
>> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
>>       struct llist_node llnode;
>>       struct acpi_hest_generic *generic;
>>       struct ghes *ghes;
>> -
>> -    int task_work_cpu;
>> -    struct callback_head task_work;
>>   };
>>     struct ghes_estatus_cache {
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index fae9baf3be16..6ea8c325acb3 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
>>       }
>>   }
>>   -/*
>> - * Process memory_failure work queued on the specified CPU.
>> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
>> - */
>> -void memory_failure_queue_kick(int cpu)
>> -{
>> -    struct memory_failure_cpu *mf_cpu;
>> -
>> -    mf_cpu = &per_cpu(memory_failure_cpu, cpu);
>> -    cancel_work_sync(&mf_cpu->work);
>> -    memory_failure_work_func(&mf_cpu->work);
>> -}
>> -
>>   static int __init memory_failure_init(void)
>>   {
>>       struct memory_failure_cpu *mf_cpu;

2023-04-12 04:05:45

by Xiaofei Tan

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events


Reviewed-by: Xiaofei Tan <[email protected]>

在 2023/4/11 18:48, Shuai Xue 写道:
> There are two major types of uncorrected recoverable (UCR) errors :
>
> - Action Required (AR): The error is detected and the processor already
> consumes the memory. OS requires to take action (for example, offline
> failure page/kill failure thread) to recover this uncorrectable error.
>
> - Action Optional (AO): The error is detected out of processor execution
> context. Some data in the memory are corrupted. But the data have not
> been consumed. OS is optional to take action to recover this
> uncorrectable error.
>
> The essential difference between AR and AO errors is that AR is a
> synchronous event, while AO is an asynchronous event. The hardware will
> signal a synchronous exception (Machine Check Exception on X86 and
> Synchronous External Abort on Arm64) when an error is detected and the
> memory access has been architecturally executed.
>
> When APEI firmware first is enabled, a platform may describe one error
> source for the handling of synchronous errors (e.g. MCE or SEA notification
> ), or for handling asynchronous errors (e.g. SCI or External Interrupt
> notification). In other words, we can distinguish synchronous errors by
> APEI notification. For AR errors, kernel will kill current process
> accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
> addition, for AO errors, kernel will notify the process who owns the
> poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
> However, the GHES driver always sets mf_flags to 0 so that all UCR errors
> are handled as AO errors in memory failure.
>
> To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
> events.
>
> Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
> Signed-off-by: Shuai Xue <[email protected]>
> Tested-by: Ma Wupeng <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
> 1 file changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 34ad071a64e9..c479b85899f5 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
> return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
> }
>
> +/*
> + * A platform may describe one error source for the handling of synchronous
> + * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
> + * or External Interrupt). On x86, the HEST notifications are always
> + * asynchronous, so only SEA on ARM is delivered as a synchronous
> + * notification.
> + */
> +static inline bool is_hest_sync_notify(struct ghes *ghes)
> +{
> + u8 notify_type = ghes->generic->notify.type;
> +
> + return notify_type == ACPI_HEST_NOTIFY_SEA;
> +}
> +
> /*
> * This driver isn't really modular, however for the time being,
> * continuing to use module_param is the easiest way to remain
> @@ -477,7 +491,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> }
>
> static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> - int sev)
> + int sev, bool sync)
> {
> int flags = -1;
> int sec_sev = ghes_severity(gdata->error_severity);
> @@ -491,7 +505,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> flags = MF_SOFT_OFFLINE;
> if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
> - flags = 0;
> + flags = sync ? MF_ACTION_REQUIRED : 0;
>
> if (flags != -1)
> return ghes_do_memory_failure(mem_err->physical_addr, flags);
> @@ -499,9 +513,11 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> return false;
> }
>
> -static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
> + int sev, bool sync)
> {
> struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
> + int flags = sync ? MF_ACTION_REQUIRED : 0;
> bool queued = false;
> int sec_sev, i;
> char *p;
> @@ -526,7 +542,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
> * and don't filter out 'corrected' error here.
> */
> if (is_cache && has_pa) {
> - queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
> + queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
> p += err_info->length;
> continue;
> }
> @@ -647,6 +663,7 @@ static bool ghes_do_proc(struct ghes *ghes,
> const guid_t *fru_id = &guid_null;
> char *fru_text = "";
> bool queued = false;
> + bool sync = is_hest_sync_notify(ghes);
>
> sev = ghes_severity(estatus->error_severity);
> apei_estatus_for_each_section(estatus, gdata) {
> @@ -664,13 +681,13 @@ static bool ghes_do_proc(struct ghes *ghes,
> atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
>
> arch_apei_report_mem_error(sev, mem_err);
> - queued = ghes_handle_memory_failure(gdata, sev);
> + queued = ghes_handle_memory_failure(gdata, sev, sync);
> }
> else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
> ghes_handle_aer(gdata);
> }
> else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
> - queued = ghes_handle_arm_hw_error(gdata, sev);
> + queued = ghes_handle_arm_hw_error(gdata, sev, sync);
> } else {
> void *err = acpi_hest_get_payload(gdata);
>

2023-04-12 04:07:03

by Xiaofei Tan

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] ACPI: APEI: handle synchronous exceptions in task work


在 2023/4/11 18:48, Shuai Xue 写道:
> Hardware errors could be signaled by synchronous interrupt, e.g. when an
> error is detected by a background scrubber, or signaled by synchronous
> exception, e.g. when an uncorrected error is consumed. Both synchronous and
> asynchronous error are queued and handled by a dedicated kthread in
> workqueue.
>
> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
> synchronous errors") keep track of whether memory_failure() work was
> queued, and make task_work pending to flush out the workqueue so that the
> work for synchronous error is processed before returning to user-space.
> The trick ensures that the corrupted page is unmapped and poisoned. And
> after returning to user-space, the task starts at current instruction which
> triggering a page fault in which kernel will send SIGBUS to current process
> due to VM_FAULT_HWPOISON.
>
> However, the memory failure recovery for hwpoison-aware mechanisms does not
> work as expected. For example, hwpoison-aware user-space processes like
> QEMU register their customized SIGBUS handler and enable early kill mode by
> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
> the process by sending a SIGBUS signal in memory failure with wrong
> si_code: the actual user-space process accessing the corrupt memory
> location, but its memory failure work is handled in a kthread context, so
> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
> process instead of BUS_MCEERR_AR in kill_proc().
>
> To this end, separate synchronous and asynchronous error handling into
> different paths like X86 platform does:
>
> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
> before ret_to_user.
> - valid asynchronous errors: queue a work into workqueue to asynchronously
> handle memory failure.
> - abnormal branches such as invalid PA, unexpected severity, no memory
> failure config support, invalid GUID section, OOM, etc.
>
> Then for valid synchronous errors, the current context in memory failure is
> exactly belongs to the task consuming poison data and it will send SIBBUS
> with proper si_code.
>
> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
> Signed-off-by: Shuai Xue <[email protected]>
> Tested-by: Ma Wupeng <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 91 +++++++++++++++++++++++++++-------------
> include/acpi/ghes.h | 3 --
> mm/memory-failure.c | 13 ------
> 3 files changed, 61 insertions(+), 46 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index c479b85899f5..4b70955e25f9 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -452,28 +452,51 @@ static void ghes_clear_estatus(struct ghes *ghes,
> }
>
> /*
> - * Called as task_work before returning to user-space.
> - * Ensure any queued work has been done before we return to the context that
> - * triggered the notification.
> + * struct sync_task_work - for synchronous RAS event
> + *
> + * @twork: callback_head for task work
> + * @pfn: page frame number of corrupted page
> + * @flags: fine tune action taken
> + *
> + * Structure to pass task work to be handled before
> + * ret_to_user via task_work_add().
> */
> -static void ghes_kick_task_work(struct callback_head *head)
> +struct sync_task_work {
> + struct callback_head twork;
> + u64 pfn;
> + int flags;
> +};
> +
> +static void memory_failure_cb(struct callback_head *twork)
> {
> - struct acpi_hest_generic_status *estatus;
> - struct ghes_estatus_node *estatus_node;
> - u32 node_len;
> + int ret;
> + struct sync_task_work *twcb =
> + container_of(twork, struct sync_task_work, twork);
>
> - estatus_node = container_of(head, struct ghes_estatus_node, task_work);
> - if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> - memory_failure_queue_kick(estatus_node->task_work_cpu);
> + ret = memory_failure(twcb->pfn, twcb->flags);
> + kfree(twcb);
>
> - estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> - node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
> - gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
> + if (!ret)
> + return;
> +
> + /*
> + * -EHWPOISON from memory_failure() means that it already sent SIGBUS
> + * to the current process with the proper error info,
> + * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
> + *
> + * In both cases, no further processing is required.
> + */
> + if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
> + return;
> +
> + pr_err("Memory error not recovered");

The print could add the following SIGBUS signal sending.
Such as "Sending SIGBUS to current task due to memory error not recovered"

> + force_sig(SIGBUS);
> }
>
> static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> {
> unsigned long pfn;
> + struct sync_task_work *twcb;
>
> if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> return false;
> @@ -486,6 +509,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> return false;
> }
>
> + if (flags == MF_ACTION_REQUIRED && current->mm) {
> + twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
> + if (!twcb)
> + return false;
> +
> + twcb->pfn = pfn;
> + twcb->flags = flags;
> + init_task_work(&twcb->twork, memory_failure_cb);
> + task_work_add(current, &twcb->twork, TWA_RESUME);
> + return true;
> + }
> +
> memory_failure_queue(pfn, flags);
> return true;
> }
> @@ -1000,9 +1035,8 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> struct ghes_estatus_node *estatus_node;
> struct acpi_hest_generic *generic;
> struct acpi_hest_generic_status *estatus;
> - bool task_work_pending;
> + bool queued, sync;
> u32 len, node_len;
> - int ret;
>
> llnode = llist_del_all(&ghes_estatus_llist);
> /*
> @@ -1015,27 +1049,25 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> estatus_node = llist_entry(llnode, struct ghes_estatus_node,
> llnode);
> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> + sync = is_hest_sync_notify(estatus_node->ghes);
> len = cper_estatus_len(estatus);
> node_len = GHES_ESTATUS_NODE_LEN(len);
> - task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
> +
> + queued = ghes_do_proc(estatus_node->ghes, estatus);
> + /*
> + * If no memory failure work is queued for abnormal synchronous
> + * errors, do a force kill.
> + */
> + if (sync && !queued)
> + force_sig(SIGBUS);

Could also add one similar print here as above
Apart from this,
Reviewed-by: Xiaofei Tan <[email protected]>

> +
> if (!ghes_estatus_cached(estatus)) {
> generic = estatus_node->generic;
> if (ghes_print_estatus(NULL, generic, estatus))
> ghes_estatus_cache_add(generic, estatus);
> }
> -
> - if (task_work_pending && current->mm) {
> - estatus_node->task_work.func = ghes_kick_task_work;
> - estatus_node->task_work_cpu = smp_processor_id();
> - ret = task_work_add(current, &estatus_node->task_work,
> - TWA_RESUME);
> - if (ret)
> - estatus_node->task_work.func = NULL;
> - }
> -
> - if (!estatus_node->task_work.func)
> - gen_pool_free(ghes_estatus_pool,
> - (unsigned long)estatus_node, node_len);
> + gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
> + node_len);
>
> llnode = next;
> }
> @@ -1096,7 +1128,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>
> estatus_node->ghes = ghes;
> estatus_node->generic = ghes->generic;
> - estatus_node->task_work.func = NULL;
> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>
> if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 3c8bba9f1114..e5e0c308d27f 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
> struct llist_node llnode;
> struct acpi_hest_generic *generic;
> struct ghes *ghes;
> -
> - int task_work_cpu;
> - struct callback_head task_work;
> };
>
> struct ghes_estatus_cache {
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fae9baf3be16..6ea8c325acb3 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
> }
> }
>
> -/*
> - * Process memory_failure work queued on the specified CPU.
> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
> - */
> -void memory_failure_queue_kick(int cpu)
> -{
> - struct memory_failure_cpu *mf_cpu;
> -
> - mf_cpu = &per_cpu(memory_failure_cpu, cpu);
> - cancel_work_sync(&mf_cpu->work);
> - memory_failure_work_func(&mf_cpu->work);
> -}
> -
> static int __init memory_failure_init(void)
> {
> struct memory_failure_cpu *mf_cpu;

2023-04-12 11:30:12

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v6 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code

changes since v5 by addressing comments from Kefeng:
- document return value of memory_failure()
- drop redundant comments in call site of memory_failure()
- make ghes_do_proc void and handle abnormal case within it
- pickup review-by tag from Kefeng Wang

changes since v4 by addressing comments from Xiaofei:
- do a force kill only for abnormal sync errors

changes since v3 by addressing comments from Xiaofei:
- do a force kill for abnormal memory failure error such as invalid PA,
unexpected severity, OOM, etc
- pcik up tested-by tag from Ma Wupeng

changes since v2 by addressing comments from Naoya:
- rename mce_task_work to sync_task_work
- drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
- add steps to reproduce this problem in cover letter
- Link: https://lore.kernel.org/lkml/[email protected]/

changes since v1:
- synchronous events by notify type
- Link: https://lore.kernel.org/lkml/[email protected]/

Shuai Xue (2):
ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
synchronous events
ACPI: APEI: handle synchronous exceptions in task work

arch/x86/kernel/cpu/mce/core.c | 7 ---
drivers/acpi/apei/ghes.c | 111 ++++++++++++++++++++++-----------
include/acpi/ghes.h | 3 -
mm/memory-failure.c | 17 +----
4 files changed, 76 insertions(+), 62 deletions(-)

--
2.20.1.12.g72788fdb

2023-04-12 11:30:30

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v6 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

There are two major types of uncorrected recoverable (UCR) errors :

- Action Required (AR): The error is detected and the processor already
consumes the memory. OS requires to take action (for example, offline
failure page/kill failure thread) to recover this uncorrectable error.

- Action Optional (AO): The error is detected out of processor execution
context. Some data in the memory are corrupted. But the data have not
been consumed. OS is optional to take action to recover this
uncorrectable error.

The essential difference between AR and AO errors is that AR is a
synchronous event, while AO is an asynchronous event. The hardware will
signal a synchronous exception (Machine Check Exception on X86 and
Synchronous External Abort on Arm64) when an error is detected and the
memory access has been architecturally executed.

When APEI firmware first is enabled, a platform may describe one error
source for the handling of synchronous errors (e.g. MCE or SEA notification
), or for handling asynchronous errors (e.g. SCI or External Interrupt
notification). In other words, we can distinguish synchronous errors by
APEI notification. For AR errors, kernel will kill current process
accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
addition, for AO errors, kernel will notify the process who owns the
poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
However, the GHES driver always sets mf_flags to 0 so that all UCR errors
are handled as AO errors in memory failure.

To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
events.

Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
Signed-off-by: Shuai Xue <[email protected]>
Tested-by: Ma Wupeng <[email protected]>
Reviewed-by: Kefeng Wang <[email protected]>
---
drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 34ad071a64e9..c479b85899f5 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
}

+/*
+ * A platform may describe one error source for the handling of synchronous
+ * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
+ * or External Interrupt). On x86, the HEST notifications are always
+ * asynchronous, so only SEA on ARM is delivered as a synchronous
+ * notification.
+ */
+static inline bool is_hest_sync_notify(struct ghes *ghes)
+{
+ u8 notify_type = ghes->generic->notify.type;
+
+ return notify_type == ACPI_HEST_NOTIFY_SEA;
+}
+
/*
* This driver isn't really modular, however for the time being,
* continuing to use module_param is the easiest way to remain
@@ -477,7 +491,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
}

static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
- int sev)
+ int sev, bool sync)
{
int flags = -1;
int sec_sev = ghes_severity(gdata->error_severity);
@@ -491,7 +505,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
- flags = 0;
+ flags = sync ? MF_ACTION_REQUIRED : 0;

if (flags != -1)
return ghes_do_memory_failure(mem_err->physical_addr, flags);
@@ -499,9 +513,11 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
return false;
}

-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
+static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
+ int sev, bool sync)
{
struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
+ int flags = sync ? MF_ACTION_REQUIRED : 0;
bool queued = false;
int sec_sev, i;
char *p;
@@ -526,7 +542,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
* and don't filter out 'corrected' error here.
*/
if (is_cache && has_pa) {
- queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
+ queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
p += err_info->length;
continue;
}
@@ -647,6 +663,7 @@ static bool ghes_do_proc(struct ghes *ghes,
const guid_t *fru_id = &guid_null;
char *fru_text = "";
bool queued = false;
+ bool sync = is_hest_sync_notify(ghes);

sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
@@ -664,13 +681,13 @@ static bool ghes_do_proc(struct ghes *ghes,
atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
- queued = ghes_handle_memory_failure(gdata, sev);
+ queued = ghes_handle_memory_failure(gdata, sev, sync);
}
else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
- queued = ghes_handle_arm_hw_error(gdata, sev);
+ queued = ghes_handle_arm_hw_error(gdata, sev, sync);
} else {
void *err = acpi_hest_get_payload(gdata);

--
2.20.1.12.g72788fdb

2023-04-12 11:31:02

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v6 2/2] ACPI: APEI: handle synchronous exceptions in task work

Hardware errors could be signaled by synchronous interrupt, e.g. when an
error is detected by a background scrubber, or signaled by synchronous
exception, e.g. when an uncorrected error is consumed. Both synchronous and
asynchronous error are queued and handled by a dedicated kthread in
workqueue.

commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
synchronous errors") keep track of whether memory_failure() work was
queued, and make task_work pending to flush out the workqueue so that the
work for synchronous error is processed before returning to user-space.
The trick ensures that the corrupted page is unmapped and poisoned. And
after returning to user-space, the task starts at current instruction which
triggering a page fault in which kernel will send SIGBUS to current process
due to VM_FAULT_HWPOISON.

However, the memory failure recovery for hwpoison-aware mechanisms does not
work as expected. For example, hwpoison-aware user-space processes like
QEMU register their customized SIGBUS handler and enable early kill mode by
seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
the process by sending a SIGBUS signal in memory failure with wrong
si_code: the actual user-space process accessing the corrupt memory
location, but its memory failure work is handled in a kthread context, so
it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
process instead of BUS_MCEERR_AR in kill_proc().

To this end, separate synchronous and asynchronous error handling into
different paths like X86 platform does:

- valid synchronous errors: queue a task_work to synchronously send SIGBUS
before ret_to_user.
- valid asynchronous errors: queue a work into workqueue to asynchronously
handle memory failure.
- abnormal branches such as invalid PA, unexpected severity, no memory
failure config support, invalid GUID section, OOM, etc.

Then for valid synchronous errors, the current context in memory failure is
exactly belongs to the task consuming poison data and it will send SIBBUS
with proper si_code.

Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
Signed-off-by: Shuai Xue <[email protected]>
Tested-by: Ma Wupeng <[email protected]>
Reviewed-by: Kefeng Wang <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 7 ---
drivers/acpi/apei/ghes.c | 82 +++++++++++++++++++++-------------
include/acpi/ghes.h | 3 --
mm/memory-failure.c | 17 ++-----
4 files changed, 53 insertions(+), 56 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2eec60f50057..0badc97920c6 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1311,13 +1311,6 @@ static void kill_me_maybe(struct callback_head *cb)
return;
}

- /*
- * -EHWPOISON from memory_failure() means that it already sent SIGBUS
- * to the current process with the proper error info,
- * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
- *
- * In both cases, no further processing is required.
- */
if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
return;

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index c479b85899f5..836c829795ee 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -452,28 +452,41 @@ static void ghes_clear_estatus(struct ghes *ghes,
}

/*
- * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+ * struct sync_task_work - for synchronous RAS event
+ *
+ * @twork: callback_head for task work
+ * @pfn: page frame number of corrupted page
+ * @flags: fine tune action taken
+ *
+ * Structure to pass task work to be handled before
+ * ret_to_user via task_work_add().
*/
-static void ghes_kick_task_work(struct callback_head *head)
+struct sync_task_work {
+ struct callback_head twork;
+ u64 pfn;
+ int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
{
- struct acpi_hest_generic_status *estatus;
- struct ghes_estatus_node *estatus_node;
- u32 node_len;
+ int ret;
+ struct sync_task_work *twcb =
+ container_of(twork, struct sync_task_work, twork);

- estatus_node = container_of(head, struct ghes_estatus_node, task_work);
- if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
- memory_failure_queue_kick(estatus_node->task_work_cpu);
+ ret = memory_failure(twcb->pfn, twcb->flags);
+ kfree(twcb);

- estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
- node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
- gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
+ if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
+ return;
+
+ pr_err("Memory error not recovered");
+ force_sig(SIGBUS);
}

static bool ghes_do_memory_failure(u64 physical_addr, int flags)
{
unsigned long pfn;
+ struct sync_task_work *twcb;

if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
return false;
@@ -486,6 +499,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
return false;
}

+ if (flags == MF_ACTION_REQUIRED && current->mm) {
+ twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
+ if (!twcb)
+ return false;
+
+ twcb->pfn = pfn;
+ twcb->flags = flags;
+ init_task_work(&twcb->twork, memory_failure_cb);
+ task_work_add(current, &twcb->twork, TWA_RESUME);
+ return true;
+ }
+
memory_failure_queue(pfn, flags);
return true;
}
@@ -654,7 +679,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
schedule_work(&entry->work);
}

-static bool ghes_do_proc(struct ghes *ghes,
+static void ghes_do_proc(struct ghes *ghes,
const struct acpi_hest_generic_status *estatus)
{
int sev, sec_sev;
@@ -698,7 +723,12 @@ static bool ghes_do_proc(struct ghes *ghes,
}
}

- return queued;
+ /*
+ * If no memory failure work is queued for abnormal synchronous
+ * errors, do a force kill.
+ */
+ if (sync && !queued)
+ force_sig(SIGBUS);
}

static void __ghes_print_estatus(const char *pfx,
@@ -1000,9 +1030,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
struct ghes_estatus_node *estatus_node;
struct acpi_hest_generic *generic;
struct acpi_hest_generic_status *estatus;
- bool task_work_pending;
u32 len, node_len;
- int ret;

llnode = llist_del_all(&ghes_estatus_llist);
/*
@@ -1017,25 +1045,16 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
len = cper_estatus_len(estatus);
node_len = GHES_ESTATUS_NODE_LEN(len);
- task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+
+ ghes_do_proc(estatus_node->ghes, estatus);
+
if (!ghes_estatus_cached(estatus)) {
generic = estatus_node->generic;
if (ghes_print_estatus(NULL, generic, estatus))
ghes_estatus_cache_add(generic, estatus);
}
-
- if (task_work_pending && current->mm) {
- estatus_node->task_work.func = ghes_kick_task_work;
- estatus_node->task_work_cpu = smp_processor_id();
- ret = task_work_add(current, &estatus_node->task_work,
- TWA_RESUME);
- if (ret)
- estatus_node->task_work.func = NULL;
- }
-
- if (!estatus_node->task_work.func)
- gen_pool_free(ghes_estatus_pool,
- (unsigned long)estatus_node, node_len);
+ gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+ node_len);

llnode = next;
}
@@ -1096,7 +1115,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,

estatus_node->ghes = ghes;
estatus_node->generic = ghes->generic;
- estatus_node->task_work.func = NULL;
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);

if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 3c8bba9f1114..e5e0c308d27f 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -35,9 +35,6 @@ struct ghes_estatus_node {
struct llist_node llnode;
struct acpi_hest_generic *generic;
struct ghes *ghes;
-
- int task_work_cpu;
- struct callback_head task_work;
};

struct ghes_estatus_cache {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fae9baf3be16..3aef483ca3c6 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2073,7 +2073,9 @@ static DEFINE_MUTEX(mf_mutex);
*
* Return: 0 for successfully handled the memory error,
* -EOPNOTSUPP for hwpoison_filter() filtered the error event,
- * < 0(except -EOPNOTSUPP) on failure.
+ * -EHWPOISON for already sent SIGBUS to the current process with
+ * the proper error info,
+ * other negative error code on failure.
*/
int memory_failure(unsigned long pfn, int flags)
{
@@ -2355,19 +2357,6 @@ static void memory_failure_work_func(struct work_struct *work)
}
}

-/*
- * Process memory_failure work queued on the specified CPU.
- * Used to avoid return-to-userspace racing with the memory_failure workqueue.
- */
-void memory_failure_queue_kick(int cpu)
-{
- struct memory_failure_cpu *mf_cpu;
-
- mf_cpu = &per_cpu(memory_failure_cpu, cpu);
- cancel_work_sync(&mf_cpu->work);
- memory_failure_work_func(&mf_cpu->work);
-}
-
static int __init memory_failure_init(void)
{
struct memory_failure_cpu *mf_cpu;
--
2.20.1.12.g72788fdb

2023-04-13 01:55:22

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events



On 2023/4/12 AM11:55, Xiaofei Tan wrote:
>
> Reviewed-by: Xiaofei Tan <[email protected]>

Thank you :)

Cheers,
Shuai

>
> 在 2023/4/11 18:48, Shuai Xue 写道:
>> There are two major types of uncorrected recoverable (UCR) errors :
>>
>> - Action Required (AR): The error is detected and the processor already
>>    consumes the memory. OS requires to take action (for example, offline
>>    failure page/kill failure thread) to recover this uncorrectable error.
>>
>> - Action Optional (AO): The error is detected out of processor execution
>>    context. Some data in the memory are corrupted. But the data have not
>>    been consumed. OS is optional to take action to recover this
>>    uncorrectable error.
>>
>> The essential difference between AR and AO errors is that AR is a
>> synchronous event, while AO is an asynchronous event. The hardware will
>> signal a synchronous exception (Machine Check Exception on X86 and
>> Synchronous External Abort on Arm64) when an error is detected and the
>> memory access has been architecturally executed.
>>
>> When APEI firmware first is enabled, a platform may describe one error
>> source for the handling of synchronous errors (e.g. MCE or SEA notification
>> ), or for handling asynchronous errors (e.g. SCI or External Interrupt
>> notification). In other words, we can distinguish synchronous errors by
>> APEI notification. For AR errors, kernel will kill current process
>> accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
>> addition, for AO errors, kernel will notify the process who owns the
>> poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
>> However, the GHES driver always sets mf_flags to 0 so that all UCR errors
>> are handled as AO errors in memory failure.
>>
>> To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
>> events.
>>
>> Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
>> Signed-off-by: Shuai Xue <[email protected]>
>> Tested-by: Ma Wupeng <[email protected]>
>> ---
>>   drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
>>   1 file changed, 23 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 34ad071a64e9..c479b85899f5 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
>>       return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
>>   }
>>   +/*
>> + * A platform may describe one error source for the handling of synchronous
>> + * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
>> + * or External Interrupt). On x86, the HEST notifications are always
>> + * asynchronous, so only SEA on ARM is delivered as a synchronous
>> + * notification.
>> + */
>> +static inline bool is_hest_sync_notify(struct ghes *ghes)
>> +{
>> +    u8 notify_type = ghes->generic->notify.type;
>> +
>> +    return notify_type == ACPI_HEST_NOTIFY_SEA;
>> +}
>> +
>>   /*
>>    * This driver isn't really modular, however for the time being,
>>    * continuing to use module_param is the easiest way to remain
>> @@ -477,7 +491,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>   }
>>     static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>> -                       int sev)
>> +                       int sev, bool sync)
>>   {
>>       int flags = -1;
>>       int sec_sev = ghes_severity(gdata->error_severity);
>> @@ -491,7 +505,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>>           (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
>>           flags = MF_SOFT_OFFLINE;
>>       if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>> -        flags = 0;
>> +        flags = sync ? MF_ACTION_REQUIRED : 0;
>>         if (flags != -1)
>>           return ghes_do_memory_failure(mem_err->physical_addr, flags);
>> @@ -499,9 +513,11 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>>       return false;
>>   }
>>   -static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
>> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
>> +                       int sev, bool sync)
>>   {
>>       struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
>> +    int flags = sync ? MF_ACTION_REQUIRED : 0;
>>       bool queued = false;
>>       int sec_sev, i;
>>       char *p;
>> @@ -526,7 +542,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
>>            * and don't filter out 'corrected' error here.
>>            */
>>           if (is_cache && has_pa) {
>> -            queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
>> +            queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
>>               p += err_info->length;
>>               continue;
>>           }
>> @@ -647,6 +663,7 @@ static bool ghes_do_proc(struct ghes *ghes,
>>       const guid_t *fru_id = &guid_null;
>>       char *fru_text = "";
>>       bool queued = false;
>> +    bool sync = is_hest_sync_notify(ghes);
>>         sev = ghes_severity(estatus->error_severity);
>>       apei_estatus_for_each_section(estatus, gdata) {
>> @@ -664,13 +681,13 @@ static bool ghes_do_proc(struct ghes *ghes,
>>               atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
>>                 arch_apei_report_mem_error(sev, mem_err);
>> -            queued = ghes_handle_memory_failure(gdata, sev);
>> +            queued = ghes_handle_memory_failure(gdata, sev, sync);
>>           }
>>           else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
>>               ghes_handle_aer(gdata);
>>           }
>>           else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
>> -            queued = ghes_handle_arm_hw_error(gdata, sev);
>> +            queued = ghes_handle_arm_hw_error(gdata, sev, sync);
>>           } else {
>>               void *err = acpi_hest_get_payload(gdata);
>>  

2023-04-13 01:57:20

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] ACPI: APEI: handle synchronous exceptions in task work



On 2023/4/12 PM12:05, Xiaofei Tan wrote:
>
> 在 2023/4/11 18:48, Shuai Xue 写道:
>> Hardware errors could be signaled by synchronous interrupt, e.g.  when an
>> error is detected by a background scrubber, or signaled by synchronous
>> exception, e.g. when an uncorrected error is consumed. Both synchronous and
>> asynchronous error are queued and handled by a dedicated kthread in
>> workqueue.
>>
>> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
>> synchronous errors") keep track of whether memory_failure() work was
>> queued, and make task_work pending to flush out the workqueue so that the
>> work for synchronous error is processed before returning to user-space.
>> The trick ensures that the corrupted page is unmapped and poisoned. And
>> after returning to user-space, the task starts at current instruction which
>> triggering a page fault in which kernel will send SIGBUS to current process
>> due to VM_FAULT_HWPOISON.
>>
>> However, the memory failure recovery for hwpoison-aware mechanisms does not
>> work as expected. For example, hwpoison-aware user-space processes like
>> QEMU register their customized SIGBUS handler and enable early kill mode by
>> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
>> the process by sending a SIGBUS signal in memory failure with wrong
>> si_code: the actual user-space process accessing the corrupt memory
>> location, but its memory failure work is handled in a kthread context, so
>> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
>> process instead of BUS_MCEERR_AR in kill_proc().
>>
>> To this end, separate synchronous and asynchronous error handling into
>> different paths like X86 platform does:
>>
>> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
>>    before ret_to_user.
>> - valid asynchronous errors: queue a work into workqueue to asynchronously
>>    handle memory failure.
>> - abnormal branches such as invalid PA, unexpected severity, no memory
>>    failure config support, invalid GUID section, OOM, etc.
>>
>> Then for valid synchronous errors, the current context in memory failure is
>> exactly belongs to the task consuming poison data and it will send SIBBUS
>> with proper si_code.
>>
>> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
>> Signed-off-by: Shuai Xue <[email protected]>
>> Tested-by: Ma Wupeng <[email protected]>
>> ---
>>   drivers/acpi/apei/ghes.c | 91 +++++++++++++++++++++++++++-------------
>>   include/acpi/ghes.h      |  3 --
>>   mm/memory-failure.c      | 13 ------
>>   3 files changed, 61 insertions(+), 46 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index c479b85899f5..4b70955e25f9 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -452,28 +452,51 @@ static void ghes_clear_estatus(struct ghes *ghes,
>>   }
>>     /*
>> - * Called as task_work before returning to user-space.
>> - * Ensure any queued work has been done before we return to the context that
>> - * triggered the notification.
>> + * struct sync_task_work - for synchronous RAS event
>> + *
>> + * @twork:                callback_head for task work
>> + * @pfn:                  page frame number of corrupted page
>> + * @flags:                fine tune action taken
>> + *
>> + * Structure to pass task work to be handled before
>> + * ret_to_user via task_work_add().
>>    */
>> -static void ghes_kick_task_work(struct callback_head *head)
>> +struct sync_task_work {
>> +    struct callback_head twork;
>> +    u64 pfn;
>> +    int flags;
>> +};
>> +
>> +static void memory_failure_cb(struct callback_head *twork)
>>   {
>> -    struct acpi_hest_generic_status *estatus;
>> -    struct ghes_estatus_node *estatus_node;
>> -    u32 node_len;
>> +    int ret;
>> +    struct sync_task_work *twcb =
>> +        container_of(twork, struct sync_task_work, twork);
>>   -    estatus_node = container_of(head, struct ghes_estatus_node, task_work);
>> -    if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>> -        memory_failure_queue_kick(estatus_node->task_work_cpu);
>> +    ret = memory_failure(twcb->pfn, twcb->flags);
>> +    kfree(twcb);
>>   -    estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> -    node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
>> -    gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
>> +    if (!ret)
>> +        return;
>> +
>> +    /*
>> +     * -EHWPOISON from memory_failure() means that it already sent SIGBUS
>> +     * to the current process with the proper error info,
>> +     * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
>> +     *
>> +     * In both cases, no further processing is required.
>> +     */
>> +    if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
>> +        return;
>> +
>> +    pr_err("Memory error not recovered");
>
> The print could add the following SIGBUS signal sending.
> Such as "Sending SIGBUS to current task due to memory error not recovered"
>
>> +    force_sig(SIGBUS);
>>   }
>>     static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>   {
>>       unsigned long pfn;
>> +    struct sync_task_work *twcb;
>>         if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>>           return false;
>> @@ -486,6 +509,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>           return false;
>>       }
>>   +    if (flags == MF_ACTION_REQUIRED && current->mm) {
>> +        twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
>> +        if (!twcb)
>> +            return false;
>> +
>> +        twcb->pfn = pfn;
>> +        twcb->flags = flags;
>> +        init_task_work(&twcb->twork, memory_failure_cb);
>> +        task_work_add(current, &twcb->twork, TWA_RESUME);
>> +        return true;
>> +    }
>> +
>>       memory_failure_queue(pfn, flags);
>>       return true;
>>   }
>> @@ -1000,9 +1035,8 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>       struct ghes_estatus_node *estatus_node;
>>       struct acpi_hest_generic *generic;
>>       struct acpi_hest_generic_status *estatus;
>> -    bool task_work_pending;
>> +    bool queued, sync;
>>       u32 len, node_len;
>> -    int ret;
>>         llnode = llist_del_all(&ghes_estatus_llist);
>>       /*
>> @@ -1015,27 +1049,25 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>           estatus_node = llist_entry(llnode, struct ghes_estatus_node,
>>                          llnode);
>>           estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> +        sync = is_hest_sync_notify(estatus_node->ghes);
>>           len = cper_estatus_len(estatus);
>>           node_len = GHES_ESTATUS_NODE_LEN(len);
>> -        task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>> +
>> +        queued = ghes_do_proc(estatus_node->ghes, estatus);
>> +        /*
>> +         * If no memory failure work is queued for abnormal synchronous
>> +         * errors, do a force kill.
>> +         */
>> +        if (sync && !queued)
>> +            force_sig(SIGBUS);
>
> Could also add one similar print here as above
> Apart from this,
> Reviewed-by: Xiaofei Tan <[email protected]>

Thanks :)

Sorry, I missed your replies, because Thunderbird marks an email as Junk,
just move it to the Junk folder.

I'd like to add above warning message and pick up your reviewed-by tag.

Cheers,
Shuai



>
>> +
>>           if (!ghes_estatus_cached(estatus)) {
>>               generic = estatus_node->generic;
>>               if (ghes_print_estatus(NULL, generic, estatus))
>>                   ghes_estatus_cache_add(generic, estatus);
>>           }
>> -
>> -        if (task_work_pending && current->mm) {
>> -            estatus_node->task_work.func = ghes_kick_task_work;
>> -            estatus_node->task_work_cpu = smp_processor_id();
>> -            ret = task_work_add(current, &estatus_node->task_work,
>> -                        TWA_RESUME);
>> -            if (ret)
>> -                estatus_node->task_work.func = NULL;
>> -        }
>> -
>> -        if (!estatus_node->task_work.func)
>> -            gen_pool_free(ghes_estatus_pool,
>> -                      (unsigned long)estatus_node, node_len);
>> +        gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
>> +                  node_len);
>>             llnode = next;
>>       }
>> @@ -1096,7 +1128,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>>         estatus_node->ghes = ghes;
>>       estatus_node->generic = ghes->generic;
>> -    estatus_node->task_work.func = NULL;
>>       estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>         if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
>> index 3c8bba9f1114..e5e0c308d27f 100644
>> --- a/include/acpi/ghes.h
>> +++ b/include/acpi/ghes.h
>> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
>>       struct llist_node llnode;
>>       struct acpi_hest_generic *generic;
>>       struct ghes *ghes;
>> -
>> -    int task_work_cpu;
>> -    struct callback_head task_work;
>>   };
>>     struct ghes_estatus_cache {
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index fae9baf3be16..6ea8c325acb3 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work)
>>       }
>>   }
>>   -/*
>> - * Process memory_failure work queued on the specified CPU.
>> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
>> - */
>> -void memory_failure_queue_kick(int cpu)
>> -{
>> -    struct memory_failure_cpu *mf_cpu;
>> -
>> -    mf_cpu = &per_cpu(memory_failure_cpu, cpu);
>> -    cancel_work_sync(&mf_cpu->work);
>> -    memory_failure_work_func(&mf_cpu->work);
>> -}
>> -
>>   static int __init memory_failure_init(void)
>>   {
>>       struct memory_failure_cpu *mf_cpu;

2023-04-17 01:35:41

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v7 2/2] ACPI: APEI: handle synchronous exceptions in task work

Hardware errors could be signaled by synchronous interrupt, e.g. when an
error is detected by a background scrubber, or signaled by synchronous
exception, e.g. when an uncorrected error is consumed. Both synchronous and
asynchronous error are queued and handled by a dedicated kthread in
workqueue.

commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
synchronous errors") keep track of whether memory_failure() work was
queued, and make task_work pending to flush out the workqueue so that the
work for synchronous error is processed before returning to user-space.
The trick ensures that the corrupted page is unmapped and poisoned. And
after returning to user-space, the task starts at current instruction which
triggering a page fault in which kernel will send SIGBUS to current process
due to VM_FAULT_HWPOISON.

However, the memory failure recovery for hwpoison-aware mechanisms does not
work as expected. For example, hwpoison-aware user-space processes like
QEMU register their customized SIGBUS handler and enable early kill mode by
seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
the process by sending a SIGBUS signal in memory failure with wrong
si_code: the actual user-space process accessing the corrupt memory
location, but its memory failure work is handled in a kthread context, so
it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
process instead of BUS_MCEERR_AR in kill_proc().

To this end, separate synchronous and asynchronous error handling into
different paths like X86 platform does:

- valid synchronous errors: queue a task_work to synchronously send SIGBUS
before ret_to_user.
- valid asynchronous errors: queue a work into workqueue to asynchronously
handle memory failure.
- abnormal branches such as invalid PA, unexpected severity, no memory
failure config support, invalid GUID section, OOM, etc.

Then for valid synchronous errors, the current context in memory failure is
exactly belongs to the task consuming poison data and it will send SIBBUS
with proper si_code.

Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
Signed-off-by: Shuai Xue <[email protected]>
Tested-by: Ma Wupeng <[email protected]>
Reviewed-by: Kefeng Wang <[email protected]>
Reviewed-by: Xiaofei Tan <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 9 +---
drivers/acpi/apei/ghes.c | 84 +++++++++++++++++++++-------------
include/acpi/ghes.h | 3 --
mm/memory-failure.c | 17 ++-----
4 files changed, 56 insertions(+), 57 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2eec60f50057..2ebaaa494ac4 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1311,17 +1311,10 @@ static void kill_me_maybe(struct callback_head *cb)
return;
}

- /*
- * -EHWPOISON from memory_failure() means that it already sent SIGBUS
- * to the current process with the proper error info,
- * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
- *
- * In both cases, no further processing is required.
- */
if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
return;

- pr_err("Memory error not recovered");
+ pr_err("Sending SIGBUS to current task due to memory error not recovered");
kill_me_now(cb);
}

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index c479b85899f5..b41d4e462b36 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -452,28 +452,41 @@ static void ghes_clear_estatus(struct ghes *ghes,
}

/*
- * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+ * struct sync_task_work - for synchronous RAS event
+ *
+ * @twork: callback_head for task work
+ * @pfn: page frame number of corrupted page
+ * @flags: fine tune action taken
+ *
+ * Structure to pass task work to be handled before
+ * ret_to_user via task_work_add().
*/
-static void ghes_kick_task_work(struct callback_head *head)
+struct sync_task_work {
+ struct callback_head twork;
+ u64 pfn;
+ int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
{
- struct acpi_hest_generic_status *estatus;
- struct ghes_estatus_node *estatus_node;
- u32 node_len;
+ int ret;
+ struct sync_task_work *twcb =
+ container_of(twork, struct sync_task_work, twork);

- estatus_node = container_of(head, struct ghes_estatus_node, task_work);
- if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
- memory_failure_queue_kick(estatus_node->task_work_cpu);
+ ret = memory_failure(twcb->pfn, twcb->flags);
+ kfree(twcb);

- estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
- node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
- gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
+ if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
+ return;
+
+ pr_err("Sending SIGBUS to current task due to memory error not recovered");
+ force_sig(SIGBUS);
}

static bool ghes_do_memory_failure(u64 physical_addr, int flags)
{
unsigned long pfn;
+ struct sync_task_work *twcb;

if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
return false;
@@ -486,6 +499,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
return false;
}

+ if (flags == MF_ACTION_REQUIRED && current->mm) {
+ twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
+ if (!twcb)
+ return false;
+
+ twcb->pfn = pfn;
+ twcb->flags = flags;
+ init_task_work(&twcb->twork, memory_failure_cb);
+ task_work_add(current, &twcb->twork, TWA_RESUME);
+ return true;
+ }
+
memory_failure_queue(pfn, flags);
return true;
}
@@ -654,7 +679,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
schedule_work(&entry->work);
}

-static bool ghes_do_proc(struct ghes *ghes,
+static void ghes_do_proc(struct ghes *ghes,
const struct acpi_hest_generic_status *estatus)
{
int sev, sec_sev;
@@ -698,7 +723,14 @@ static bool ghes_do_proc(struct ghes *ghes,
}
}

- return queued;
+ /*
+ * If no memory failure work is queued for abnormal synchronous
+ * errors, do a force kill.
+ */
+ if (sync && !queued) {
+ pr_err("Sending SIGBUS to current task due to memory error not recovered");
+ force_sig(SIGBUS);
+ }
}

static void __ghes_print_estatus(const char *pfx,
@@ -1000,9 +1032,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
struct ghes_estatus_node *estatus_node;
struct acpi_hest_generic *generic;
struct acpi_hest_generic_status *estatus;
- bool task_work_pending;
u32 len, node_len;
- int ret;

llnode = llist_del_all(&ghes_estatus_llist);
/*
@@ -1017,25 +1047,16 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
len = cper_estatus_len(estatus);
node_len = GHES_ESTATUS_NODE_LEN(len);
- task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+
+ ghes_do_proc(estatus_node->ghes, estatus);
+
if (!ghes_estatus_cached(estatus)) {
generic = estatus_node->generic;
if (ghes_print_estatus(NULL, generic, estatus))
ghes_estatus_cache_add(generic, estatus);
}
-
- if (task_work_pending && current->mm) {
- estatus_node->task_work.func = ghes_kick_task_work;
- estatus_node->task_work_cpu = smp_processor_id();
- ret = task_work_add(current, &estatus_node->task_work,
- TWA_RESUME);
- if (ret)
- estatus_node->task_work.func = NULL;
- }
-
- if (!estatus_node->task_work.func)
- gen_pool_free(ghes_estatus_pool,
- (unsigned long)estatus_node, node_len);
+ gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+ node_len);

llnode = next;
}
@@ -1096,7 +1117,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,

estatus_node->ghes = ghes;
estatus_node->generic = ghes->generic;
- estatus_node->task_work.func = NULL;
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);

if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 3c8bba9f1114..e5e0c308d27f 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -35,9 +35,6 @@ struct ghes_estatus_node {
struct llist_node llnode;
struct acpi_hest_generic *generic;
struct ghes *ghes;
-
- int task_work_cpu;
- struct callback_head task_work;
};

struct ghes_estatus_cache {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fae9baf3be16..3aef483ca3c6 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2073,7 +2073,9 @@ static DEFINE_MUTEX(mf_mutex);
*
* Return: 0 for successfully handled the memory error,
* -EOPNOTSUPP for hwpoison_filter() filtered the error event,
- * < 0(except -EOPNOTSUPP) on failure.
+ * -EHWPOISON for already sent SIGBUS to the current process with
+ * the proper error info,
+ * other negative error code on failure.
*/
int memory_failure(unsigned long pfn, int flags)
{
@@ -2355,19 +2357,6 @@ static void memory_failure_work_func(struct work_struct *work)
}
}

-/*
- * Process memory_failure work queued on the specified CPU.
- * Used to avoid return-to-userspace racing with the memory_failure workqueue.
- */
-void memory_failure_queue_kick(int cpu)
-{
- struct memory_failure_cpu *mf_cpu;
-
- mf_cpu = &per_cpu(memory_failure_cpu, cpu);
- cancel_work_sync(&mf_cpu->work);
- memory_failure_work_func(&mf_cpu->work);
-}
-
static int __init memory_failure_init(void)
{
struct memory_failure_cpu *mf_cpu;
--
2.20.1.12.g72788fdb

2023-04-17 01:35:41

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v7 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code

changes since v6:
- add more explicty error message suggested by Xiaofei
- pick up reviewed-by tag from Xiaofei
- pick up internal reviewed-by tag from Baolin

changes since v5 by addressing comments from Kefeng:
- document return value of memory_failure()
- drop redundant comments in call site of memory_failure()
- make ghes_do_proc void and handle abnormal case within it
- pick up reviewed-by tag from Kefeng Wang

changes since v4 by addressing comments from Xiaofei:
- do a force kill only for abnormal sync errors

changes since v3 by addressing comments from Xiaofei:
- do a force kill for abnormal memory failure error such as invalid PA,
unexpected severity, OOM, etc
- pcik up tested-by tag from Ma Wupeng

changes since v2 by addressing comments from Naoya:
- rename mce_task_work to sync_task_work
- drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
- add steps to reproduce this problem in cover letter

changes since v1:
- synchronous events by notify type
- Link: https://lore.kernel.org/lkml/[email protected]/

Shuai Xue (2):
ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
synchronous events
ACPI: APEI: handle synchronous exceptions in task work

arch/x86/kernel/cpu/mce/core.c | 9 +--
drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++-----------
include/acpi/ghes.h | 3 -
mm/memory-failure.c | 17 +----
4 files changed, 79 insertions(+), 63 deletions(-)

--
2.20.1.12.g72788fdb

2023-04-17 01:35:46

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v7 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

There are two major types of uncorrected recoverable (UCR) errors :

- Action Required (AR): The error is detected and the processor already
consumes the memory. OS requires to take action (for example, offline
failure page/kill failure thread) to recover this uncorrectable error.

- Action Optional (AO): The error is detected out of processor execution
context. Some data in the memory are corrupted. But the data have not
been consumed. OS is optional to take action to recover this
uncorrectable error.

The essential difference between AR and AO errors is that AR is a
synchronous event, while AO is an asynchronous event. The hardware will
signal a synchronous exception (Machine Check Exception on X86 and
Synchronous External Abort on Arm64) when an error is detected and the
memory access has been architecturally executed.

When APEI firmware first is enabled, a platform may describe one error
source for the handling of synchronous errors (e.g. MCE or SEA notification
), or for handling asynchronous errors (e.g. SCI or External Interrupt
notification). In other words, we can distinguish synchronous errors by
APEI notification. For AR errors, kernel will kill current process
accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
addition, for AO errors, kernel will notify the process who owns the
poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
However, the GHES driver always sets mf_flags to 0 so that all UCR errors
are handled as AO errors in memory failure.

To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
events.

Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
Signed-off-by: Shuai Xue <[email protected]>
Tested-by: Ma Wupeng <[email protected]>
Reviewed-by: Kefeng Wang <[email protected]>
Reviewed-by: Xiaofei Tan <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
---
drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 34ad071a64e9..c479b85899f5 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
}

+/*
+ * A platform may describe one error source for the handling of synchronous
+ * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
+ * or External Interrupt). On x86, the HEST notifications are always
+ * asynchronous, so only SEA on ARM is delivered as a synchronous
+ * notification.
+ */
+static inline bool is_hest_sync_notify(struct ghes *ghes)
+{
+ u8 notify_type = ghes->generic->notify.type;
+
+ return notify_type == ACPI_HEST_NOTIFY_SEA;
+}
+
/*
* This driver isn't really modular, however for the time being,
* continuing to use module_param is the easiest way to remain
@@ -477,7 +491,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
}

static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
- int sev)
+ int sev, bool sync)
{
int flags = -1;
int sec_sev = ghes_severity(gdata->error_severity);
@@ -491,7 +505,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
- flags = 0;
+ flags = sync ? MF_ACTION_REQUIRED : 0;

if (flags != -1)
return ghes_do_memory_failure(mem_err->physical_addr, flags);
@@ -499,9 +513,11 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
return false;
}

-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
+static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
+ int sev, bool sync)
{
struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
+ int flags = sync ? MF_ACTION_REQUIRED : 0;
bool queued = false;
int sec_sev, i;
char *p;
@@ -526,7 +542,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
* and don't filter out 'corrected' error here.
*/
if (is_cache && has_pa) {
- queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
+ queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
p += err_info->length;
continue;
}
@@ -647,6 +663,7 @@ static bool ghes_do_proc(struct ghes *ghes,
const guid_t *fru_id = &guid_null;
char *fru_text = "";
bool queued = false;
+ bool sync = is_hest_sync_notify(ghes);

sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
@@ -664,13 +681,13 @@ static bool ghes_do_proc(struct ghes *ghes,
atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
- queued = ghes_handle_memory_failure(gdata, sev);
+ queued = ghes_handle_memory_failure(gdata, sev, sync);
}
else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
- queued = ghes_handle_arm_hw_error(gdata, sev);
+ queued = ghes_handle_arm_hw_error(gdata, sev, sync);
} else {
void *err = acpi_hest_get_payload(gdata);

--
2.20.1.12.g72788fdb

2023-04-24 06:38:00

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v7 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code



On 2023/4/17 AM9:14, Shuai Xue wrote:
> changes since v6:
> - add more explicty error message suggested by Xiaofei
> - pick up reviewed-by tag from Xiaofei
> - pick up internal reviewed-by tag from Baolin
>
> changes since v5 by addressing comments from Kefeng:
> - document return value of memory_failure()
> - drop redundant comments in call site of memory_failure()
> - make ghes_do_proc void and handle abnormal case within it
> - pick up reviewed-by tag from Kefeng Wang
>
> changes since v4 by addressing comments from Xiaofei:
> - do a force kill only for abnormal sync errors
>
> changes since v3 by addressing comments from Xiaofei:
> - do a force kill for abnormal memory failure error such as invalid PA,
> unexpected severity, OOM, etc
> - pcik up tested-by tag from Ma Wupeng
>
> changes since v2 by addressing comments from Naoya:
> - rename mce_task_work to sync_task_work
> - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
> - add steps to reproduce this problem in cover letter
>
> changes since v1:
> - synchronous events by notify type
> - Link: https://lore.kernel.org/lkml/[email protected]/
>
> Shuai Xue (2):
> ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
> synchronous events
> ACPI: APEI: handle synchronous exceptions in task work
>
> arch/x86/kernel/cpu/mce/core.c | 9 +--
> drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++-----------
> include/acpi/ghes.h | 3 -
> mm/memory-failure.c | 17 +----
> 4 files changed, 79 insertions(+), 63 deletions(-)
>

Hi, Rafael,

Gentle ping. Are you happy to queue this patch set into your next tree, so that we can merge
that in next merge window.

Thank you.

Best Regards,
Shuai

2023-05-08 02:28:20

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v7 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code



On 2023/4/24 14:24, Shuai Xue wrote:
>
>
> On 2023/4/17 AM9:14, Shuai Xue wrote:
>> changes since v6:
>> - add more explicty error message suggested by Xiaofei
>> - pick up reviewed-by tag from Xiaofei
>> - pick up internal reviewed-by tag from Baolin
>>
>> changes since v5 by addressing comments from Kefeng:
>> - document return value of memory_failure()
>> - drop redundant comments in call site of memory_failure()
>> - make ghes_do_proc void and handle abnormal case within it
>> - pick up reviewed-by tag from Kefeng Wang
>>
>> changes since v4 by addressing comments from Xiaofei:
>> - do a force kill only for abnormal sync errors
>>
>> changes since v3 by addressing comments from Xiaofei:
>> - do a force kill for abnormal memory failure error such as invalid PA,
>> unexpected severity, OOM, etc
>> - pcik up tested-by tag from Ma Wupeng
>>
>> changes since v2 by addressing comments from Naoya:
>> - rename mce_task_work to sync_task_work
>> - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
>> - add steps to reproduce this problem in cover letter
>>
>> changes since v1:
>> - synchronous events by notify type
>> - Link: https://lore.kernel.org/lkml/[email protected]/
>>
>> Shuai Xue (2):
>> ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
>> synchronous events
>> ACPI: APEI: handle synchronous exceptions in task work
>>
>> arch/x86/kernel/cpu/mce/core.c | 9 +--
>> drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++-----------
>> include/acpi/ghes.h | 3 -
>> mm/memory-failure.c | 17 +----
>> 4 files changed, 79 insertions(+), 63 deletions(-)
>>
>
> Hi, Rafael,
>
> Gentle ping. Are you happy to queue this patch set into your next tree, so that we can merge
> that in next merge window.
>
> Thank you.
>

Gentle ping :)

Thanks.

> Best Regards,
> Shuai


2023-09-19 02:22:07

by Shuai Xue

[permalink] [raw]
Subject: [RESEND PATCH v8 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code

Hi, ALL,

I have rewritten the cover letter with the hope that the maintainer will truly
understand the necessity of this patch. Both Alibaba and Huawei met the same
issue in products, and we hope it could be fixed ASAP.

changes since v7:
- rebase to Linux v6.6-rc2 (no code changed)
- rewritten the cover letter to explain the motivation of this patchset

changes since v6:
- add more explicty error message suggested by Xiaofei
- pick up reviewed-by tag from Xiaofei
- pick up internal reviewed-by tag from Baolin

changes since v5 by addressing comments from Kefeng:
- document return value of memory_failure()
- drop redundant comments in call site of memory_failure()
- make ghes_do_proc void and handle abnormal case within it
- pick up reviewed-by tag from Kefeng Wang

changes since v4 by addressing comments from Xiaofei:
- do a force kill only for abnormal sync errors

changes since v3 by addressing comments from Xiaofei:
- do a force kill for abnormal memory failure error such as invalid PA,
unexpected severity, OOM, etc
- pcik up tested-by tag from Ma Wupeng

changes since v2 by addressing comments from Naoya:
- rename mce_task_work to sync_task_work
- drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
- add steps to reproduce this problem in cover letter

changes since v1:
- synchronous events by notify type
- Link: https://lore.kernel.org/lkml/[email protected]/


There are two major types of uncorrected recoverable (UCR) errors :

- Action Required (AR): The error is detected and the processor already
consumes the memory. OS requires to take action (for example, offline
failure page/kill failure thread) to recover this error.

- Action Optional (AO): The error is detected out of processor execution
context. Some data in the memory are corrupted. But the data have not
been consumed. OS is optional to take action to recover this error.

The main difference between AR and AO errors is that AR errors are synchronous
events, while AO errors are asynchronous events. Synchronous exceptions, such as
Machine Check Exception (MCE) on X86 and Synchronous External Abort (SEA) on
Arm64, are signaled by the hardware when an error is detected and the memory
access has architecturally been executed.

Currently, both synchronous and asynchronous errors are queued as AO errors and
handled by a dedicated kernel thread in a work queue on the ARM64 platform. For
synchronous errors, memory_failure() is synced using a cancel_work_sync trick to
ensure that the corrupted page is unmapped and poisoned. Upon returning to
user-space, the process resumes at the current instruction, triggering a page
fault. As a result, the kernel sends a SIGBUS signal to the current process due
to VM_FAULT_HWPOISON.

However, this trick is not always be effective, this patch set improves the
recovery process in three specific aspects:

1. Handle synchronous exceptions with proper si_code

ghes_handle_memory_failure() queue both synchronous and asynchronous errors with
flag=0. Then the kernel will notify the process by sending a SIGBUS signal in
memory_failure() with wrong si_code: BUS_MCEERR_AO to the actual user-space
process instead of BUS_MCEERR_AR. The user-space processes rely on the si_code
to distinguish to handle memory failure.

For example, hwpoison-aware user-space processes use the si_code:
BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR
for 'action required' synchronous/late notifications. Specifically, when a
signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to
Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored
by QEMU.[1]

Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1)

2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot

If process mapping fault page, but memory_failure() abnormal return before
try_to_unmap(), for example, the fault page process mapping is KSM page.
In this case, arm64 cannot use the page fault process to terminate the
synchronous exception loop.[4]

This loop can potentially exceed the platform firmware threshold or even trigger
a kernel hard lockup, leading to a system reboot. However, kernel has the
capability to recover from this error.

Fix it by performing a force kill when memory_failure() abnormal fails or when
other abnormal synchronous errors occur. These errors can include situations
such as invalid PA, unexpected severity, no memory failure config support,
invalid GUID section, OOM, etc. (PATCH 2)

3. Handle memory_failure() in current process context which consuming poison

When synchronous errors occur, memory_failure() assume that current process
context is exactly that consuming poison synchronous error.

For example, kill_accessing_process() holds mmap locking of current->mm, does
pagetable walk to find the error virtual address, and sends SIGBUS to the
current process with error info. However, the mm of kworker is not valid,
resulting in a null-pointer dereference. I have fixed this in[3].

commit 77677cdbc2aa mm,hwpoison: check mm when killing accessing process

Another example is that collect_procs()/kill_procs() walk the task list, only
collect and send sigbus to task which consuming poison. But memory_failure() is
queued and handled by a dedicated kernel thread on arm64 platform.

Fix it by queuing memory_failure() as a task work which runs in current
execution context to synchronously send SIGBUS before ret_to_user. (PATCH 2)

** In summary, this patch set handles synchronous errors in task work with
proper si_code so that hwpoison-aware process can recover from errors, and
fixes (potentially)abnormal cases. **

Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4].
Acknowledge to discussion with them.

To reproduce this problem:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 5 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
and it is not fact.

After this patch set:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 4 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
as we expected.

[1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lkml.kernel.org/r/[email protected]
[4] https://lore.kernel.org/lkml/[email protected]/

Shuai Xue (2):
ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
synchronous events
ACPI: APEI: handle synchronous exceptions in task work

arch/x86/kernel/cpu/mce/core.c | 9 +--
drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++-----------
include/acpi/ghes.h | 3 -
mm/memory-failure.c | 17 +----
4 files changed, 79 insertions(+), 63 deletions(-)

--
2.39.3

2023-09-19 02:22:17

by Shuai Xue

[permalink] [raw]
Subject: [RESEND PATCH v8 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

There are two major types of uncorrected recoverable (UCR) errors :

- Action Required (AR): The error is detected and the processor already
consumes the memory. OS requires to take action (for example, offline
failure page/kill failure thread) to recover this uncorrectable error.

- Action Optional (AO): The error is detected out of processor execution
context. Some data in the memory are corrupted. But the data have not
been consumed. OS is optional to take action to recover this
uncorrectable error.

The essential difference between AR and AO errors is that AR is a
synchronous event, while AO is an asynchronous event. The hardware will
signal a synchronous exception (Machine Check Exception on X86 and
Synchronous External Abort on Arm64) when an error is detected and the
memory access has been architecturally executed.

When APEI firmware first is enabled, a platform may describe one error
source for the handling of synchronous errors (e.g. MCE or SEA notification
), or for handling asynchronous errors (e.g. SCI or External Interrupt
notification). In other words, we can distinguish synchronous errors by
APEI notification. For AR errors, kernel will kill current process
accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
addition, for AO errors, kernel will notify the process who owns the
poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
However, the GHES driver always sets mf_flags to 0 so that all UCR errors
are handled as AO errors in memory failure.

To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
events.

Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
Signed-off-by: Shuai Xue <[email protected]>
Tested-by: Ma Wupeng <[email protected]>
Reviewed-by: Kefeng Wang <[email protected]>
Reviewed-by: Xiaofei Tan <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
---
drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index ef59d6ea16da..88178aa6222d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
}

+/*
+ * A platform may describe one error source for the handling of synchronous
+ * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
+ * or External Interrupt). On x86, the HEST notifications are always
+ * asynchronous, so only SEA on ARM is delivered as a synchronous
+ * notification.
+ */
+static inline bool is_hest_sync_notify(struct ghes *ghes)
+{
+ u8 notify_type = ghes->generic->notify.type;
+
+ return notify_type == ACPI_HEST_NOTIFY_SEA;
+}
+
/*
* This driver isn't really modular, however for the time being,
* continuing to use module_param is the easiest way to remain
@@ -475,7 +489,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
}

static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
- int sev)
+ int sev, bool sync)
{
int flags = -1;
int sec_sev = ghes_severity(gdata->error_severity);
@@ -489,7 +503,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
- flags = 0;
+ flags = sync ? MF_ACTION_REQUIRED : 0;

if (flags != -1)
return ghes_do_memory_failure(mem_err->physical_addr, flags);
@@ -497,9 +511,11 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
return false;
}

-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
+static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
+ int sev, bool sync)
{
struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
+ int flags = sync ? MF_ACTION_REQUIRED : 0;
bool queued = false;
int sec_sev, i;
char *p;
@@ -524,7 +540,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
* and don't filter out 'corrected' error here.
*/
if (is_cache && has_pa) {
- queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
+ queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
p += err_info->length;
continue;
}
@@ -645,6 +661,7 @@ static bool ghes_do_proc(struct ghes *ghes,
const guid_t *fru_id = &guid_null;
char *fru_text = "";
bool queued = false;
+ bool sync = is_hest_sync_notify(ghes);

sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
@@ -662,13 +679,13 @@ static bool ghes_do_proc(struct ghes *ghes,
atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
- queued = ghes_handle_memory_failure(gdata, sev);
+ queued = ghes_handle_memory_failure(gdata, sev, sync);
}
else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
- queued = ghes_handle_arm_hw_error(gdata, sev);
+ queued = ghes_handle_arm_hw_error(gdata, sev, sync);
} else {
void *err = acpi_hest_get_payload(gdata);

--
2.39.3

2023-09-19 02:23:08

by Shuai Xue

[permalink] [raw]
Subject: [RESEND PATCH v8 2/2] ACPI: APEI: handle synchronous exceptions in task work

Hardware errors could be signaled by synchronous interrupt, e.g. when an
error is detected by a background scrubber, or signaled by synchronous
exception, e.g. when an uncorrected error is consumed. Both synchronous and
asynchronous error are queued and handled by a dedicated kthread in
workqueue.

commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
synchronous errors") keep track of whether memory_failure() work was
queued, and make task_work pending to flush out the workqueue so that the
work for synchronous error is processed before returning to user-space.
The trick ensures that the corrupted page is unmapped and poisoned. And
after returning to user-space, the task starts at current instruction which
triggering a page fault in which kernel will send SIGBUS to current process
due to VM_FAULT_HWPOISON.

However, the memory failure recovery for hwpoison-aware mechanisms does not
work as expected. For example, hwpoison-aware user-space processes like
QEMU register their customized SIGBUS handler and enable early kill mode by
seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
the process by sending a SIGBUS signal in memory failure with wrong
si_code: the actual user-space process accessing the corrupt memory
location, but its memory failure work is handled in a kthread context, so
it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
process instead of BUS_MCEERR_AR in kill_proc().

To this end, separate synchronous and asynchronous error handling into
different paths like X86 platform does:

- valid synchronous errors: queue a task_work to synchronously send SIGBUS
before ret_to_user.
- valid asynchronous errors: queue a work into workqueue to asynchronously
handle memory failure.
- abnormal branches such as invalid PA, unexpected severity, no memory
failure config support, invalid GUID section, OOM, etc.

Then for valid synchronous errors, the current context in memory failure is
exactly belongs to the task consuming poison data and it will send SIBBUS
with proper si_code.

Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
Signed-off-by: Shuai Xue <[email protected]>
Tested-by: Ma Wupeng <[email protected]>
Reviewed-by: Kefeng Wang <[email protected]>
Reviewed-by: Xiaofei Tan <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 9 +---
drivers/acpi/apei/ghes.c | 84 +++++++++++++++++++++-------------
include/acpi/ghes.h | 3 --
mm/memory-failure.c | 17 ++-----
4 files changed, 56 insertions(+), 57 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6f35f724cc14..1675ff77033d 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1334,17 +1334,10 @@ static void kill_me_maybe(struct callback_head *cb)
return;
}

- /*
- * -EHWPOISON from memory_failure() means that it already sent SIGBUS
- * to the current process with the proper error info,
- * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
- *
- * In both cases, no further processing is required.
- */
if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
return;

- pr_err("Memory error not recovered");
+ pr_err("Sending SIGBUS to current task due to memory error not recovered");
kill_me_now(cb);
}

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 88178aa6222d..014401a65ed5 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -450,28 +450,41 @@ static void ghes_clear_estatus(struct ghes *ghes,
}

/*
- * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+ * struct sync_task_work - for synchronous RAS event
+ *
+ * @twork: callback_head for task work
+ * @pfn: page frame number of corrupted page
+ * @flags: fine tune action taken
+ *
+ * Structure to pass task work to be handled before
+ * ret_to_user via task_work_add().
*/
-static void ghes_kick_task_work(struct callback_head *head)
+struct sync_task_work {
+ struct callback_head twork;
+ u64 pfn;
+ int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
{
- struct acpi_hest_generic_status *estatus;
- struct ghes_estatus_node *estatus_node;
- u32 node_len;
+ int ret;
+ struct sync_task_work *twcb =
+ container_of(twork, struct sync_task_work, twork);

- estatus_node = container_of(head, struct ghes_estatus_node, task_work);
- if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
- memory_failure_queue_kick(estatus_node->task_work_cpu);
+ ret = memory_failure(twcb->pfn, twcb->flags);
+ kfree(twcb);

- estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
- node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
- gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
+ if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
+ return;
+
+ pr_err("Sending SIGBUS to current task due to memory error not recovered");
+ force_sig(SIGBUS);
}

static bool ghes_do_memory_failure(u64 physical_addr, int flags)
{
unsigned long pfn;
+ struct sync_task_work *twcb;

if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
return false;
@@ -484,6 +497,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
return false;
}

+ if (flags == MF_ACTION_REQUIRED && current->mm) {
+ twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
+ if (!twcb)
+ return false;
+
+ twcb->pfn = pfn;
+ twcb->flags = flags;
+ init_task_work(&twcb->twork, memory_failure_cb);
+ task_work_add(current, &twcb->twork, TWA_RESUME);
+ return true;
+ }
+
memory_failure_queue(pfn, flags);
return true;
}
@@ -652,7 +677,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
schedule_work(&entry->work);
}

-static bool ghes_do_proc(struct ghes *ghes,
+static void ghes_do_proc(struct ghes *ghes,
const struct acpi_hest_generic_status *estatus)
{
int sev, sec_sev;
@@ -696,7 +721,14 @@ static bool ghes_do_proc(struct ghes *ghes,
}
}

- return queued;
+ /*
+ * If no memory failure work is queued for abnormal synchronous
+ * errors, do a force kill.
+ */
+ if (sync && !queued) {
+ pr_err("Sending SIGBUS to current task due to memory error not recovered");
+ force_sig(SIGBUS);
+ }
}

static void __ghes_print_estatus(const char *pfx,
@@ -998,9 +1030,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
struct ghes_estatus_node *estatus_node;
struct acpi_hest_generic *generic;
struct acpi_hest_generic_status *estatus;
- bool task_work_pending;
u32 len, node_len;
- int ret;

llnode = llist_del_all(&ghes_estatus_llist);
/*
@@ -1015,25 +1045,16 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
len = cper_estatus_len(estatus);
node_len = GHES_ESTATUS_NODE_LEN(len);
- task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+
+ ghes_do_proc(estatus_node->ghes, estatus);
+
if (!ghes_estatus_cached(estatus)) {
generic = estatus_node->generic;
if (ghes_print_estatus(NULL, generic, estatus))
ghes_estatus_cache_add(generic, estatus);
}
-
- if (task_work_pending && current->mm) {
- estatus_node->task_work.func = ghes_kick_task_work;
- estatus_node->task_work_cpu = smp_processor_id();
- ret = task_work_add(current, &estatus_node->task_work,
- TWA_RESUME);
- if (ret)
- estatus_node->task_work.func = NULL;
- }
-
- if (!estatus_node->task_work.func)
- gen_pool_free(ghes_estatus_pool,
- (unsigned long)estatus_node, node_len);
+ gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+ node_len);

llnode = next;
}
@@ -1094,7 +1115,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,

estatus_node->ghes = ghes;
estatus_node->generic = ghes->generic;
- estatus_node->task_work.func = NULL;
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);

if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 3c8bba9f1114..e5e0c308d27f 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -35,9 +35,6 @@ struct ghes_estatus_node {
struct llist_node llnode;
struct acpi_hest_generic *generic;
struct ghes *ghes;
-
- int task_work_cpu;
- struct callback_head task_work;
};

struct ghes_estatus_cache {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4d6e43c88489..80e1ea1cc56d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2163,7 +2163,9 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
*
* Return: 0 for successfully handled the memory error,
* -EOPNOTSUPP for hwpoison_filter() filtered the error event,
- * < 0(except -EOPNOTSUPP) on failure.
+ * -EHWPOISON for already sent SIGBUS to the current process with
+ * the proper error info,
+ * other negative error code on failure.
*/
int memory_failure(unsigned long pfn, int flags)
{
@@ -2445,19 +2447,6 @@ static void memory_failure_work_func(struct work_struct *work)
}
}

-/*
- * Process memory_failure work queued on the specified CPU.
- * Used to avoid return-to-userspace racing with the memory_failure workqueue.
- */
-void memory_failure_queue_kick(int cpu)
-{
- struct memory_failure_cpu *mf_cpu;
-
- mf_cpu = &per_cpu(memory_failure_cpu, cpu);
- cancel_work_sync(&mf_cpu->work);
- memory_failure_work_func(&mf_cpu->work);
-}
-
static int __init memory_failure_init(void)
{
struct memory_failure_cpu *mf_cpu;
--
2.39.3

2023-09-25 14:54:11

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [RESEND PATCH v8 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

On Tue Sep 19, 2023 at 5:21 AM EEST, Shuai Xue wrote:
> There are two major types of uncorrected recoverable (UCR) errors :
>
> - Action Required (AR): The error is detected and the processor already
> consumes the memory. OS requires to take action (for example, offline
> failure page/kill failure thread) to recover this uncorrectable error.
>
> - Action Optional (AO): The error is detected out of processor execution
> context. Some data in the memory are corrupted. But the data have not
> been consumed. OS is optional to take action to recover this
> uncorrectable error.
>
> The essential difference between AR and AO errors is that AR is a
> synchronous event, while AO is an asynchronous event. The hardware will
> signal a synchronous exception (Machine Check Exception on X86 and
> Synchronous External Abort on Arm64) when an error is detected and the
> memory access has been architecturally executed.
>
> When APEI firmware first is enabled, a platform may describe one error
> source for the handling of synchronous errors (e.g. MCE or SEA notification
> ), or for handling asynchronous errors (e.g. SCI or External Interrupt
> notification). In other words, we can distinguish synchronous errors by
> APEI notification. For AR errors, kernel will kill current process
> accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
> addition, for AO errors, kernel will notify the process who owns the
> poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
> However, the GHES driver always sets mf_flags to 0 so that all UCR errors
> are handled as AO errors in memory failure.
>
> To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
> events.
>
> Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
> Signed-off-by: Shuai Xue <[email protected]>
> Tested-by: Ma Wupeng <[email protected]>
> Reviewed-by: Kefeng Wang <[email protected]>
> Reviewed-by: Xiaofei Tan <[email protected]>
> Reviewed-by: Baolin Wang <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
> 1 file changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index ef59d6ea16da..88178aa6222d 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
> return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
> }
>
> +/*
> + * A platform may describe one error source for the handling of synchronous
> + * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
> + * or External Interrupt). On x86, the HEST notifications are always
> + * asynchronous, so only SEA on ARM is delivered as a synchronous
> + * notification.
> + */
> +static inline bool is_hest_sync_notify(struct ghes *ghes)
> +{
> + u8 notify_type = ghes->generic->notify.type;
> +
> + return notify_type == ACPI_HEST_NOTIFY_SEA;
> +}
> +
> /*
> * This driver isn't really modular, however for the time being,
> * continuing to use module_param is the easiest way to remain
> @@ -475,7 +489,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> }
>
> static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> - int sev)
> + int sev, bool sync)
> {
> int flags = -1;
> int sec_sev = ghes_severity(gdata->error_severity);
> @@ -489,7 +503,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> flags = MF_SOFT_OFFLINE;
> if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
> - flags = 0;
> + flags = sync ? MF_ACTION_REQUIRED : 0;

Not my territory but this branching looks a bit weird to my
eyes so just in case putting a comment.

What *if* the previous condition sets MF_SOFT_OFFLINE and
this condition overwrites the value?

I know that earlier it could have been overwritten by zero.

Neither the function comment has any explanation why it is
ok overwrite like this.

Or if these cannot happen simultaenously why there is not
immediate return after settting MF_SOFT_OFFLINE?

For someone like me the functions logic is tediously hard
to understand tbh.

BR, Jarkko

2023-09-25 15:10:57

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [RESEND PATCH v8 2/2] ACPI: APEI: handle synchronous exceptions in task work

On Tue Sep 19, 2023 at 5:21 AM EEST, Shuai Xue wrote:
> Hardware errors could be signaled by synchronous interrupt, e.g. when an
> error is detected by a background scrubber, or signaled by synchronous
> exception, e.g. when an uncorrected error is consumed. Both synchronous and
> asynchronous error are queued and handled by a dedicated kthread in
> workqueue.
>
> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
> synchronous errors") keep track of whether memory_failure() work was
> queued, and make task_work pending to flush out the workqueue so that the
> work for synchronous error is processed before returning to user-space.
> The trick ensures that the corrupted page is unmapped and poisoned. And
> after returning to user-space, the task starts at current instruction which
> triggering a page fault in which kernel will send SIGBUS to current process
> due to VM_FAULT_HWPOISON.
>
> However, the memory failure recovery for hwpoison-aware mechanisms does not
> work as expected. For example, hwpoison-aware user-space processes like
> QEMU register their customized SIGBUS handler and enable early kill mode by
> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
> the process by sending a SIGBUS signal in memory failure with wrong
> si_code: the actual user-space process accessing the corrupt memory
> location, but its memory failure work is handled in a kthread context, so
> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
> process instead of BUS_MCEERR_AR in kill_proc().
>
> To this end, separate synchronous and asynchronous error handling into
> different paths like X86 platform does:
>
> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
> before ret_to_user.
> - valid asynchronous errors: queue a work into workqueue to asynchronously
> handle memory failure.
> - abnormal branches such as invalid PA, unexpected severity, no memory
> failure config support, invalid GUID section, OOM, etc.
>
> Then for valid synchronous errors, the current context in memory failure is
> exactly belongs to the task consuming poison data and it will send SIBBUS
> with proper si_code.
>
> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
> Signed-off-by: Shuai Xue <[email protected]>
> Tested-by: Ma Wupeng <[email protected]>
> Reviewed-by: Kefeng Wang <[email protected]>
> Reviewed-by: Xiaofei Tan <[email protected]>
> Reviewed-by: Baolin Wang <[email protected]>

Did 7f17b4a121d0 actually break something that was not broken before?

If not, this is (afaik) not a bug fix.

BR, Jarkko

2023-09-26 06:26:24

by Shuai Xue

[permalink] [raw]
Subject: Re: [RESEND PATCH v8 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events



On 2023/9/25 22:43, Jarkko Sakkinen wrote:
> On Tue Sep 19, 2023 at 5:21 AM EEST, Shuai Xue wrote:
>> There are two major types of uncorrected recoverable (UCR) errors :
>>
>> - Action Required (AR): The error is detected and the processor already
>> consumes the memory. OS requires to take action (for example, offline
>> failure page/kill failure thread) to recover this uncorrectable error.
>>
>> - Action Optional (AO): The error is detected out of processor execution
>> context. Some data in the memory are corrupted. But the data have not
>> been consumed. OS is optional to take action to recover this
>> uncorrectable error.
>>
>> The essential difference between AR and AO errors is that AR is a
>> synchronous event, while AO is an asynchronous event. The hardware will
>> signal a synchronous exception (Machine Check Exception on X86 and
>> Synchronous External Abort on Arm64) when an error is detected and the
>> memory access has been architecturally executed.
>>
>> When APEI firmware first is enabled, a platform may describe one error
>> source for the handling of synchronous errors (e.g. MCE or SEA notification
>> ), or for handling asynchronous errors (e.g. SCI or External Interrupt
>> notification). In other words, we can distinguish synchronous errors by
>> APEI notification. For AR errors, kernel will kill current process
>> accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
>> addition, for AO errors, kernel will notify the process who owns the
>> poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
>> However, the GHES driver always sets mf_flags to 0 so that all UCR errors
>> are handled as AO errors in memory failure.
>>
>> To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
>> events.
>>
>> Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
>> Signed-off-by: Shuai Xue <[email protected]>
>> Tested-by: Ma Wupeng <[email protected]>
>> Reviewed-by: Kefeng Wang <[email protected]>
>> Reviewed-by: Xiaofei Tan <[email protected]>
>> Reviewed-by: Baolin Wang <[email protected]>
>> ---
>> drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
>> 1 file changed, 23 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index ef59d6ea16da..88178aa6222d 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
>> return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
>> }
>>
>> +/*
>> + * A platform may describe one error source for the handling of synchronous
>> + * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
>> + * or External Interrupt). On x86, the HEST notifications are always
>> + * asynchronous, so only SEA on ARM is delivered as a synchronous
>> + * notification.
>> + */
>> +static inline bool is_hest_sync_notify(struct ghes *ghes)
>> +{
>> + u8 notify_type = ghes->generic->notify.type;
>> +
>> + return notify_type == ACPI_HEST_NOTIFY_SEA;
>> +}
>> +
>> /*
>> * This driver isn't really modular, however for the time being,
>> * continuing to use module_param is the easiest way to remain
>> @@ -475,7 +489,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>> }
>>
>> static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>> - int sev)
>> + int sev, bool sync)
>> {
>> int flags = -1;
>> int sec_sev = ghes_severity(gdata->error_severity);
>> @@ -489,7 +503,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
>> flags = MF_SOFT_OFFLINE;
>> if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>> - flags = 0;
>> + flags = sync ? MF_ACTION_REQUIRED : 0;
>
> Not my territory but this branching looks a bit weird to my
> eyes so just in case putting a comment.
>
> What *if* the previous condition sets MF_SOFT_OFFLINE and
> this condition overwrites the value?
>
> I know that earlier it could have been overwritten by zero.
>
> Neither the function comment has any explanation why it is
> ok overwrite like this.
>
> Or if these cannot happen simultaenously why there is not
> immediate return after settting MF_SOFT_OFFLINE?
>
> For someone like me the functions logic is tediously hard
> to understand tbh.
>
> BR, Jarkko

Hi, Jarkko,

I hope the original source code can help to understand:

/* iff following two events can be handled properly by now */
if (sec_sev == GHES_SEV_CORRECTED &&
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
flags = 0;

if (flags != -1)
return ghes_do_memory_failure(mem_err->physical_addr, flags);

The sec_sev of gdata is either GHES_SEV_CORRECTED or GHES_SEV_RECOVERABLE.
So the two if-conditions are independent of each other and these cannot
happen simultaneously. ghes_do_memory_failure() then handle the two events
with a proper seted flags.

Thanks.

Best Regards,
Shuai

2023-09-26 06:56:20

by Shuai Xue

[permalink] [raw]
Subject: Re: [RESEND PATCH v8 2/2] ACPI: APEI: handle synchronous exceptions in task work



On 2023/9/25 23:00, Jarkko Sakkinen wrote:
> On Tue Sep 19, 2023 at 5:21 AM EEST, Shuai Xue wrote:
>> Hardware errors could be signaled by synchronous interrupt, e.g. when an
>> error is detected by a background scrubber, or signaled by synchronous
>> exception, e.g. when an uncorrected error is consumed. Both synchronous and
>> asynchronous error are queued and handled by a dedicated kthread in
>> workqueue.
>>
>> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
>> synchronous errors") keep track of whether memory_failure() work was
>> queued, and make task_work pending to flush out the workqueue so that the
>> work for synchronous error is processed before returning to user-space.
>> The trick ensures that the corrupted page is unmapped and poisoned. And
>> after returning to user-space, the task starts at current instruction which
>> triggering a page fault in which kernel will send SIGBUS to current process
>> due to VM_FAULT_HWPOISON.
>>
>> However, the memory failure recovery for hwpoison-aware mechanisms does not
>> work as expected. For example, hwpoison-aware user-space processes like
>> QEMU register their customized SIGBUS handler and enable early kill mode by
>> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
>> the process by sending a SIGBUS signal in memory failure with wrong
>> si_code: the actual user-space process accessing the corrupt memory
>> location, but its memory failure work is handled in a kthread context, so
>> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
>> process instead of BUS_MCEERR_AR in kill_proc().
>>
>> To this end, separate synchronous and asynchronous error handling into
>> different paths like X86 platform does:
>>
>> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
>> before ret_to_user.
>> - valid asynchronous errors: queue a work into workqueue to asynchronously
>> handle memory failure.
>> - abnormal branches such as invalid PA, unexpected severity, no memory
>> failure config support, invalid GUID section, OOM, etc.
>>
>> Then for valid synchronous errors, the current context in memory failure is
>> exactly belongs to the task consuming poison data and it will send SIBBUS
>> with proper si_code.
>>
>> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
>> Signed-off-by: Shuai Xue <[email protected]>
>> Tested-by: Ma Wupeng <[email protected]>
>> Reviewed-by: Kefeng Wang <[email protected]>
>> Reviewed-by: Xiaofei Tan <[email protected]>
>> Reviewed-by: Baolin Wang <[email protected]>
>
> Did 7f17b4a121d0 actually break something that was not broken before?
>
> If not, this is (afaik) not a bug fix.

Hi, Jarkko,

It did not. It keeps track of whether memory_failure() work was queued,
and makes task_work pending to flush out the queue. But if no work queued for
synchronous error due to abnormal branches, it does not do a force kill to
current process resulting a hard lockup due to exception loop.

It is fine to me to remove the bug fix tag if you insist on removing it.

Best Regards,
Shuai

2023-10-03 08:29:23

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [RESEND PATCH v8 2/2] ACPI: APEI: handle synchronous exceptions in task work

On Tue, Sep 19, 2023 at 10:21:27AM +0800, Shuai Xue wrote:
> Hardware errors could be signaled by synchronous interrupt, e.g. when an
> error is detected by a background scrubber, or signaled by synchronous
> exception, e.g. when an uncorrected error is consumed. Both synchronous and
> asynchronous error are queued and handled by a dedicated kthread in
> workqueue.
>
> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
> synchronous errors") keep track of whether memory_failure() work was
> queued, and make task_work pending to flush out the workqueue so that the
> work for synchronous error is processed before returning to user-space.
> The trick ensures that the corrupted page is unmapped and poisoned. And
> after returning to user-space, the task starts at current instruction which
> triggering a page fault in which kernel will send SIGBUS to current process
> due to VM_FAULT_HWPOISON.
>
> However, the memory failure recovery for hwpoison-aware mechanisms does not
> work as expected. For example, hwpoison-aware user-space processes like
> QEMU register their customized SIGBUS handler and enable early kill mode by
> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
> the process by sending a SIGBUS signal in memory failure with wrong
> si_code: the actual user-space process accessing the corrupt memory
> location, but its memory failure work is handled in a kthread context, so
> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
> process instead of BUS_MCEERR_AR in kill_proc().
>
> To this end, separate synchronous and asynchronous error handling into
> different paths like X86 platform does:
>
> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
> before ret_to_user.
> - valid asynchronous errors: queue a work into workqueue to asynchronously
> handle memory failure.
> - abnormal branches such as invalid PA, unexpected severity, no memory
> failure config support, invalid GUID section, OOM, etc.
>
> Then for valid synchronous errors, the current context in memory failure is
> exactly belongs to the task consuming poison data and it will send SIBBUS
> with proper si_code.
>
> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
> Signed-off-by: Shuai Xue <[email protected]>
> Tested-by: Ma Wupeng <[email protected]>
> Reviewed-by: Kefeng Wang <[email protected]>
> Reviewed-by: Xiaofei Tan <[email protected]>
> Reviewed-by: Baolin Wang <[email protected]>
> ---
> arch/x86/kernel/cpu/mce/core.c | 9 +---
> drivers/acpi/apei/ghes.c | 84 +++++++++++++++++++++-------------
> include/acpi/ghes.h | 3 --
> mm/memory-failure.c | 17 ++-----
> 4 files changed, 56 insertions(+), 57 deletions(-)
>
...

> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 4d6e43c88489..80e1ea1cc56d 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2163,7 +2163,9 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> *
> * Return: 0 for successfully handled the memory error,
> * -EOPNOTSUPP for hwpoison_filter() filtered the error event,
> - * < 0(except -EOPNOTSUPP) on failure.
> + * -EHWPOISON for already sent SIGBUS to the current process with
> + * the proper error info,

The meaning of this comment is understood, but the sentence seems to be
a little too long. Could you sort this out with bullet points (like below)?

* Return values:
* 0 - success
* -EOPNOTSUPP - hwpoison_filter() filtered the error event.
* -EHWPOISON - sent SIGBUS to the current process with the proper
* error info by kill_accessing_process().
* other negative values - failure

> + * other negative error code on failure.
> */
> int memory_failure(unsigned long pfn, int flags)
> {
> @@ -2445,19 +2447,6 @@ static void memory_failure_work_func(struct work_struct *work)
> }
> }
>
> -/*
> - * Process memory_failure work queued on the specified CPU.
> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
> - */
> -void memory_failure_queue_kick(int cpu)
> -{
> - struct memory_failure_cpu *mf_cpu;
> -
> - mf_cpu = &per_cpu(memory_failure_cpu, cpu);
> - cancel_work_sync(&mf_cpu->work);
> - memory_failure_work_func(&mf_cpu->work);
> -}
> -

The declaration of memory_failure_queue_kick() still remains in include/linux/mm.h,
so you can remove it together.

Thanks,
Naoya Horiguchi

2023-10-07 02:03:23

by Shuai Xue

[permalink] [raw]
Subject: Re: [RESEND PATCH v8 2/2] ACPI: APEI: handle synchronous exceptions in task work



On 2023/10/3 16:28, Naoya Horiguchi wrote:
> On Tue, Sep 19, 2023 at 10:21:27AM +0800, Shuai Xue wrote:
>> Hardware errors could be signaled by synchronous interrupt, e.g. when an
>> error is detected by a background scrubber, or signaled by synchronous
>> exception, e.g. when an uncorrected error is consumed. Both synchronous and
>> asynchronous error are queued and handled by a dedicated kthread in
>> workqueue.
>>
>> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
>> synchronous errors") keep track of whether memory_failure() work was
>> queued, and make task_work pending to flush out the workqueue so that the
>> work for synchronous error is processed before returning to user-space.
>> The trick ensures that the corrupted page is unmapped and poisoned. And
>> after returning to user-space, the task starts at current instruction which
>> triggering a page fault in which kernel will send SIGBUS to current process
>> due to VM_FAULT_HWPOISON.
>>
>> However, the memory failure recovery for hwpoison-aware mechanisms does not
>> work as expected. For example, hwpoison-aware user-space processes like
>> QEMU register their customized SIGBUS handler and enable early kill mode by
>> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
>> the process by sending a SIGBUS signal in memory failure with wrong
>> si_code: the actual user-space process accessing the corrupt memory
>> location, but its memory failure work is handled in a kthread context, so
>> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
>> process instead of BUS_MCEERR_AR in kill_proc().
>>
>> To this end, separate synchronous and asynchronous error handling into
>> different paths like X86 platform does:
>>
>> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
>> before ret_to_user.
>> - valid asynchronous errors: queue a work into workqueue to asynchronously
>> handle memory failure.
>> - abnormal branches such as invalid PA, unexpected severity, no memory
>> failure config support, invalid GUID section, OOM, etc.
>>
>> Then for valid synchronous errors, the current context in memory failure is
>> exactly belongs to the task consuming poison data and it will send SIBBUS
>> with proper si_code.
>>
>> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
>> Signed-off-by: Shuai Xue <[email protected]>
>> Tested-by: Ma Wupeng <[email protected]>
>> Reviewed-by: Kefeng Wang <[email protected]>
>> Reviewed-by: Xiaofei Tan <[email protected]>
>> Reviewed-by: Baolin Wang <[email protected]>
>> ---
>> arch/x86/kernel/cpu/mce/core.c | 9 +---
>> drivers/acpi/apei/ghes.c | 84 +++++++++++++++++++++-------------
>> include/acpi/ghes.h | 3 --
>> mm/memory-failure.c | 17 ++-----
>> 4 files changed, 56 insertions(+), 57 deletions(-)
>>
> ...
>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 4d6e43c88489..80e1ea1cc56d 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2163,7 +2163,9 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>> *
>> * Return: 0 for successfully handled the memory error,
>> * -EOPNOTSUPP for hwpoison_filter() filtered the error event,
>> - * < 0(except -EOPNOTSUPP) on failure.
>> + * -EHWPOISON for already sent SIGBUS to the current process with
>> + * the proper error info,
>
> The meaning of this comment is understood, but the sentence seems to be
> a little too long. Could you sort this out with bullet points (like below)?
>
> * Return values:
> * 0 - success
> * -EOPNOTSUPP - hwpoison_filter() filtered the error event.
> * -EHWPOISON - sent SIGBUS to the current process with the proper
> * error info by kill_accessing_process().
> * other negative values - failure
>

Of course, will do it.


>> + * other negative error code on failure.
>> */
>> int memory_failure(unsigned long pfn, int flags)
>> {
>> @@ -2445,19 +2447,6 @@ static void memory_failure_work_func(struct work_struct *work)
>> }
>> }
>>
>> -/*
>> - * Process memory_failure work queued on the specified CPU.
>> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
>> - */
>> -void memory_failure_queue_kick(int cpu)
>> -{
>> - struct memory_failure_cpu *mf_cpu;
>> -
>> - mf_cpu = &per_cpu(memory_failure_cpu, cpu);
>> - cancel_work_sync(&mf_cpu->work);
>> - memory_failure_work_func(&mf_cpu->work);
>> -}
>> -
>
> The declaration of memory_failure_queue_kick() still remains in include/linux/mm.h,
> so you can remove it together.

Good catch, will remove it too.

>
> Thanks,
> Naoya Horiguchi


Thank you for valuable comments.

Best Regards,
Shuai

2023-10-07 07:28:57

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code

Hi, ALL,

I have rewritten the cover letter with the hope that the maintainer will truly
understand the necessity of this patch. Both Alibaba and Huawei met the same
issue in products, and we hope it could be fixed ASAP.

## Changes Log

changes since v8:
- remove the bug fix tag of patch 2 (per Jarkko Sakkinen)
- remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi)
- rewrite the return value comments of memory_failure (per Naoya Horiguchi)

changes since v7:
- rebase to Linux v6.6-rc2 (no code changed)
- rewritten the cover letter to explain the motivation of this patchset

changes since v6:
- add more explicty error message suggested by Xiaofei
- pick up reviewed-by tag from Xiaofei
- pick up internal reviewed-by tag from Baolin

changes since v5 by addressing comments from Kefeng:
- document return value of memory_failure()
- drop redundant comments in call site of memory_failure()
- make ghes_do_proc void and handle abnormal case within it
- pick up reviewed-by tag from Kefeng Wang

changes since v4 by addressing comments from Xiaofei:
- do a force kill only for abnormal sync errors

changes since v3 by addressing comments from Xiaofei:
- do a force kill for abnormal memory failure error such as invalid PA,
unexpected severity, OOM, etc
- pcik up tested-by tag from Ma Wupeng

changes since v2 by addressing comments from Naoya:
- rename mce_task_work to sync_task_work
- drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
- add steps to reproduce this problem in cover letter

changes since v1:
- synchronous events by notify type
- Link: https://lore.kernel.org/lkml/[email protected]/


## Cover Letter

There are two major types of uncorrected recoverable (UCR) errors :

- Action Required (AR): The error is detected and the processor already
consumes the memory. OS requires to take action (for example, offline
failure page/kill failure thread) to recover this error.

- Action Optional (AO): The error is detected out of processor execution
context. Some data in the memory are corrupted. But the data have not
been consumed. OS is optional to take action to recover this error.

The main difference between AR and AO errors is that AR errors are synchronous
events, while AO errors are asynchronous events. Synchronous exceptions, such as
Machine Check Exception (MCE) on X86 and Synchronous External Abort (SEA) on
Arm64, are signaled by the hardware when an error is detected and the memory
access has architecturally been executed.

Currently, both synchronous and asynchronous errors are queued as AO errors and
handled by a dedicated kernel thread in a work queue on the ARM64 platform. For
synchronous errors, memory_failure() is synced using a cancel_work_sync trick to
ensure that the corrupted page is unmapped and poisoned. Upon returning to
user-space, the process resumes at the current instruction, triggering a page
fault. As a result, the kernel sends a SIGBUS signal to the current process due
to VM_FAULT_HWPOISON.

However, this trick is not always be effective, this patch set improves the
recovery process in three specific aspects:

1. Handle synchronous exceptions with proper si_code

ghes_handle_memory_failure() queue both synchronous and asynchronous errors with
flag=0. Then the kernel will notify the process by sending a SIGBUS signal in
memory_failure() with wrong si_code: BUS_MCEERR_AO to the actual user-space
process instead of BUS_MCEERR_AR. The user-space processes rely on the si_code
to distinguish to handle memory failure.

For example, hwpoison-aware user-space processes use the si_code:
BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR
for 'action required' synchronous/late notifications. Specifically, when a
signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to
Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored
by QEMU.[1]

Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1)

2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot

If process mapping fault page, but memory_failure() abnormal return before
try_to_unmap(), for example, the fault page process mapping is KSM page.
In this case, arm64 cannot use the page fault process to terminate the
synchronous exception loop.[4]

This loop can potentially exceed the platform firmware threshold or even trigger
a kernel hard lockup, leading to a system reboot. However, kernel has the
capability to recover from this error.

Fix it by performing a force kill when memory_failure() abnormal fails or when
other abnormal synchronous errors occur. These errors can include situations
such as invalid PA, unexpected severity, no memory failure config support,
invalid GUID section, OOM, etc. (PATCH 2)

3. Handle memory_failure() in current process context which consuming poison

When synchronous errors occur, memory_failure() assume that current process
context is exactly that consuming poison synchronous error.

For example, kill_accessing_process() holds mmap locking of current->mm, does
pagetable walk to find the error virtual address, and sends SIGBUS to the
current process with error info. However, the mm of kworker is not valid,
resulting in a null-pointer dereference. I have fixed this in[3].

commit 77677cdbc2aa mm,hwpoison: check mm when killing accessing process

Another example is that collect_procs()/kill_procs() walk the task list, only
collect and send sigbus to task which consuming poison. But memory_failure() is
queued and handled by a dedicated kernel thread on arm64 platform.

Fix it by queuing memory_failure() as a task work which runs in current
execution context to synchronously send SIGBUS before ret_to_user. (PATCH 2)

** In summary, this patch set handles synchronous errors in task work with
proper si_code so that hwpoison-aware process can recover from errors, and
fixes (potentially) abnormal cases. **

Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4].
Acknowledge to discussion with them.

## Steps to Reproduce This Problem

To reproduce this problem:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 5 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
and it is not fact.

After this patch set:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 4 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
as we expected.

[1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lkml.kernel.org/r/[email protected]
[4] https://lore.kernel.org/lkml/[email protected]/

Shuai Xue (2):
ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
synchronous events
ACPI: APEI: handle synchronous exceptions in task work

arch/x86/kernel/cpu/mce/core.c | 9 +--
drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++-----------
include/acpi/ghes.h | 3 -
include/linux/mm.h | 1 -
mm/memory-failure.c | 22 ++-----
5 files changed, 82 insertions(+), 66 deletions(-)

--
2.39.3

2023-10-07 07:29:12

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v9 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

There are two major types of uncorrected recoverable (UCR) errors :

- Action Required (AR): The error is detected and the processor already
consumes the memory. OS requires to take action (for example, offline
failure page/kill failure thread) to recover this uncorrectable error.

- Action Optional (AO): The error is detected out of processor execution
context. Some data in the memory are corrupted. But the data have not
been consumed. OS is optional to take action to recover this
uncorrectable error.

The essential difference between AR and AO errors is that AR is a
synchronous event, while AO is an asynchronous event. The hardware will
signal a synchronous exception (Machine Check Exception on X86 and
Synchronous External Abort on Arm64) when an error is detected and the
memory access has been architecturally executed.

When APEI firmware first is enabled, a platform may describe one error
source for the handling of synchronous errors (e.g. MCE or SEA notification
), or for handling asynchronous errors (e.g. SCI or External Interrupt
notification). In other words, we can distinguish synchronous errors by
APEI notification. For AR errors, kernel will kill current process
accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
addition, for AO errors, kernel will notify the process who owns the
poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
However, the GHES driver always sets mf_flags to 0 so that all UCR errors
are handled as AO errors in memory failure.

To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
events.

Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
Signed-off-by: Shuai Xue <[email protected]>
Tested-by: Ma Wupeng <[email protected]>
Reviewed-by: Kefeng Wang <[email protected]>
Reviewed-by: Xiaofei Tan <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
---
drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index ef59d6ea16da..88178aa6222d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
}

+/*
+ * A platform may describe one error source for the handling of synchronous
+ * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
+ * or External Interrupt). On x86, the HEST notifications are always
+ * asynchronous, so only SEA on ARM is delivered as a synchronous
+ * notification.
+ */
+static inline bool is_hest_sync_notify(struct ghes *ghes)
+{
+ u8 notify_type = ghes->generic->notify.type;
+
+ return notify_type == ACPI_HEST_NOTIFY_SEA;
+}
+
/*
* This driver isn't really modular, however for the time being,
* continuing to use module_param is the easiest way to remain
@@ -475,7 +489,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
}

static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
- int sev)
+ int sev, bool sync)
{
int flags = -1;
int sec_sev = ghes_severity(gdata->error_severity);
@@ -489,7 +503,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
- flags = 0;
+ flags = sync ? MF_ACTION_REQUIRED : 0;

if (flags != -1)
return ghes_do_memory_failure(mem_err->physical_addr, flags);
@@ -497,9 +511,11 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
return false;
}

-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
+static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
+ int sev, bool sync)
{
struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
+ int flags = sync ? MF_ACTION_REQUIRED : 0;
bool queued = false;
int sec_sev, i;
char *p;
@@ -524,7 +540,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
* and don't filter out 'corrected' error here.
*/
if (is_cache && has_pa) {
- queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
+ queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
p += err_info->length;
continue;
}
@@ -645,6 +661,7 @@ static bool ghes_do_proc(struct ghes *ghes,
const guid_t *fru_id = &guid_null;
char *fru_text = "";
bool queued = false;
+ bool sync = is_hest_sync_notify(ghes);

sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
@@ -662,13 +679,13 @@ static bool ghes_do_proc(struct ghes *ghes,
atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
- queued = ghes_handle_memory_failure(gdata, sev);
+ queued = ghes_handle_memory_failure(gdata, sev, sync);
}
else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
- queued = ghes_handle_arm_hw_error(gdata, sev);
+ queued = ghes_handle_arm_hw_error(gdata, sev, sync);
} else {
void *err = acpi_hest_get_payload(gdata);

--
2.39.3

2023-10-07 07:29:12

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v9 2/2] ACPI: APEI: handle synchronous exceptions in task work

Hardware errors could be signaled by synchronous interrupt, e.g. when an
error is detected by a background scrubber, or signaled by synchronous
exception, e.g. when an uncorrected error is consumed. Both synchronous and
asynchronous error are queued and handled by a dedicated kthread in
workqueue.

commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
synchronous errors") keep track of whether memory_failure() work was
queued, and make task_work pending to flush out the workqueue so that the
work for synchronous error is processed before returning to user-space.
The trick ensures that the corrupted page is unmapped and poisoned. And
after returning to user-space, the task starts at current instruction which
triggering a page fault in which kernel will send SIGBUS to current process
due to VM_FAULT_HWPOISON.

However, the memory failure recovery for hwpoison-aware mechanisms does not
work as expected. For example, hwpoison-aware user-space processes like
QEMU register their customized SIGBUS handler and enable early kill mode by
seting PF_MCE_EARLY at initialization. Then the kernel will directy notify
the process by sending a SIGBUS signal in memory failure with wrong
si_code: the actual user-space process accessing the corrupt memory
location, but its memory failure work is handled in a kthread context, so
it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
process instead of BUS_MCEERR_AR in kill_proc().

To this end, separate synchronous and asynchronous error handling into
different paths like X86 platform does:

- valid synchronous errors: queue a task_work to synchronously send SIGBUS
before ret_to_user.
- valid asynchronous errors: queue a work into workqueue to asynchronously
handle memory failure.
- abnormal branches such as invalid PA, unexpected severity, no memory
failure config support, invalid GUID section, OOM, etc.

Then for valid synchronous errors, the current context in memory failure is
exactly belongs to the task consuming poison data and it will send SIBBUS
with proper si_code.

Signed-off-by: Shuai Xue <[email protected]>
Tested-by: Ma Wupeng <[email protected]>
Reviewed-by: Kefeng Wang <[email protected]>
Reviewed-by: Xiaofei Tan <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 9 +---
drivers/acpi/apei/ghes.c | 84 +++++++++++++++++++++-------------
include/acpi/ghes.h | 3 --
include/linux/mm.h | 1 -
mm/memory-failure.c | 22 +++------
5 files changed, 59 insertions(+), 60 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6f35f724cc14..1675ff77033d 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1334,17 +1334,10 @@ static void kill_me_maybe(struct callback_head *cb)
return;
}

- /*
- * -EHWPOISON from memory_failure() means that it already sent SIGBUS
- * to the current process with the proper error info,
- * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
- *
- * In both cases, no further processing is required.
- */
if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
return;

- pr_err("Memory error not recovered");
+ pr_err("Sending SIGBUS to current task due to memory error not recovered");
kill_me_now(cb);
}

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 88178aa6222d..014401a65ed5 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -450,28 +450,41 @@ static void ghes_clear_estatus(struct ghes *ghes,
}

/*
- * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+ * struct sync_task_work - for synchronous RAS event
+ *
+ * @twork: callback_head for task work
+ * @pfn: page frame number of corrupted page
+ * @flags: fine tune action taken
+ *
+ * Structure to pass task work to be handled before
+ * ret_to_user via task_work_add().
*/
-static void ghes_kick_task_work(struct callback_head *head)
+struct sync_task_work {
+ struct callback_head twork;
+ u64 pfn;
+ int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
{
- struct acpi_hest_generic_status *estatus;
- struct ghes_estatus_node *estatus_node;
- u32 node_len;
+ int ret;
+ struct sync_task_work *twcb =
+ container_of(twork, struct sync_task_work, twork);

- estatus_node = container_of(head, struct ghes_estatus_node, task_work);
- if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
- memory_failure_queue_kick(estatus_node->task_work_cpu);
+ ret = memory_failure(twcb->pfn, twcb->flags);
+ kfree(twcb);

- estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
- node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
- gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
+ if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
+ return;
+
+ pr_err("Sending SIGBUS to current task due to memory error not recovered");
+ force_sig(SIGBUS);
}

static bool ghes_do_memory_failure(u64 physical_addr, int flags)
{
unsigned long pfn;
+ struct sync_task_work *twcb;

if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
return false;
@@ -484,6 +497,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
return false;
}

+ if (flags == MF_ACTION_REQUIRED && current->mm) {
+ twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
+ if (!twcb)
+ return false;
+
+ twcb->pfn = pfn;
+ twcb->flags = flags;
+ init_task_work(&twcb->twork, memory_failure_cb);
+ task_work_add(current, &twcb->twork, TWA_RESUME);
+ return true;
+ }
+
memory_failure_queue(pfn, flags);
return true;
}
@@ -652,7 +677,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
schedule_work(&entry->work);
}

-static bool ghes_do_proc(struct ghes *ghes,
+static void ghes_do_proc(struct ghes *ghes,
const struct acpi_hest_generic_status *estatus)
{
int sev, sec_sev;
@@ -696,7 +721,14 @@ static bool ghes_do_proc(struct ghes *ghes,
}
}

- return queued;
+ /*
+ * If no memory failure work is queued for abnormal synchronous
+ * errors, do a force kill.
+ */
+ if (sync && !queued) {
+ pr_err("Sending SIGBUS to current task due to memory error not recovered");
+ force_sig(SIGBUS);
+ }
}

static void __ghes_print_estatus(const char *pfx,
@@ -998,9 +1030,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
struct ghes_estatus_node *estatus_node;
struct acpi_hest_generic *generic;
struct acpi_hest_generic_status *estatus;
- bool task_work_pending;
u32 len, node_len;
- int ret;

llnode = llist_del_all(&ghes_estatus_llist);
/*
@@ -1015,25 +1045,16 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
len = cper_estatus_len(estatus);
node_len = GHES_ESTATUS_NODE_LEN(len);
- task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+
+ ghes_do_proc(estatus_node->ghes, estatus);
+
if (!ghes_estatus_cached(estatus)) {
generic = estatus_node->generic;
if (ghes_print_estatus(NULL, generic, estatus))
ghes_estatus_cache_add(generic, estatus);
}
-
- if (task_work_pending && current->mm) {
- estatus_node->task_work.func = ghes_kick_task_work;
- estatus_node->task_work_cpu = smp_processor_id();
- ret = task_work_add(current, &estatus_node->task_work,
- TWA_RESUME);
- if (ret)
- estatus_node->task_work.func = NULL;
- }
-
- if (!estatus_node->task_work.func)
- gen_pool_free(ghes_estatus_pool,
- (unsigned long)estatus_node, node_len);
+ gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+ node_len);

llnode = next;
}
@@ -1094,7 +1115,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,

estatus_node->ghes = ghes;
estatus_node->generic = ghes->generic;
- estatus_node->task_work.func = NULL;
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);

if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 3c8bba9f1114..e5e0c308d27f 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -35,9 +35,6 @@ struct ghes_estatus_node {
struct llist_node llnode;
struct acpi_hest_generic *generic;
struct ghes *ghes;
-
- int task_work_cpu;
- struct callback_head task_work;
};

struct ghes_estatus_cache {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1b16f4..3ce9e4371659 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3835,7 +3835,6 @@ enum mf_flags {
int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
unsigned long count, int mf_flags);
extern int memory_failure(unsigned long pfn, int flags);
-extern void memory_failure_queue_kick(int cpu);
extern int unpoison_memory(unsigned long pfn);
extern void shake_page(struct page *p);
extern atomic_long_t num_poisoned_pages __read_mostly;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4d6e43c88489..0d02f8a0b556 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2161,9 +2161,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
* Must run in process context (e.g. a work queue) with interrupts
* enabled and no spinlocks held.
*
- * Return: 0 for successfully handled the memory error,
- * -EOPNOTSUPP for hwpoison_filter() filtered the error event,
- * < 0(except -EOPNOTSUPP) on failure.
+ * Return values:
+ * 0 - success
+ * -EOPNOTSUPP - hwpoison_filter() filtered the error event.
+ * -EHWPOISON - sent SIGBUS to the current process with the proper
+ * error info by kill_accessing_process().
+ * other negative values - failure
*/
int memory_failure(unsigned long pfn, int flags)
{
@@ -2445,19 +2448,6 @@ static void memory_failure_work_func(struct work_struct *work)
}
}

-/*
- * Process memory_failure work queued on the specified CPU.
- * Used to avoid return-to-userspace racing with the memory_failure workqueue.
- */
-void memory_failure_queue_kick(int cpu)
-{
- struct memory_failure_cpu *mf_cpu;
-
- mf_cpu = &per_cpu(memory_failure_cpu, cpu);
- cancel_work_sync(&mf_cpu->work);
- memory_failure_work_func(&mf_cpu->work);
-}
-
static int __init memory_failure_init(void)
{
struct memory_failure_cpu *mf_cpu;
--
2.39.3

2023-11-21 01:49:08

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code

Hi, ALL,

Gentle ping.

Best Regards,
Shuai

On 2023/10/7 15:28, Shuai Xue wrote:
> Hi, ALL,
>
> I have rewritten the cover letter with the hope that the maintainer will truly
> understand the necessity of this patch. Both Alibaba and Huawei met the same
> issue in products, and we hope it could be fixed ASAP.
>
> ## Changes Log
>
> changes since v8:
> - remove the bug fix tag of patch 2 (per Jarkko Sakkinen)
> - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi)
> - rewrite the return value comments of memory_failure (per Naoya Horiguchi)
>
> changes since v7:
> - rebase to Linux v6.6-rc2 (no code changed)
> - rewritten the cover letter to explain the motivation of this patchset
>
> changes since v6:
> - add more explicty error message suggested by Xiaofei
> - pick up reviewed-by tag from Xiaofei
> - pick up internal reviewed-by tag from Baolin
>
> changes since v5 by addressing comments from Kefeng:
> - document return value of memory_failure()
> - drop redundant comments in call site of memory_failure()
> - make ghes_do_proc void and handle abnormal case within it
> - pick up reviewed-by tag from Kefeng Wang
>
> changes since v4 by addressing comments from Xiaofei:
> - do a force kill only for abnormal sync errors
>
> changes since v3 by addressing comments from Xiaofei:
> - do a force kill for abnormal memory failure error such as invalid PA,
> unexpected severity, OOM, etc
> - pcik up tested-by tag from Ma Wupeng
>
> changes since v2 by addressing comments from Naoya:
> - rename mce_task_work to sync_task_work
> - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
> - add steps to reproduce this problem in cover letter
>
> changes since v1:
> - synchronous events by notify type
> - Link: https://lore.kernel.org/lkml/[email protected]/
>
>
> ## Cover Letter
>
> There are two major types of uncorrected recoverable (UCR) errors :
>
> - Action Required (AR): The error is detected and the processor already
> consumes the memory. OS requires to take action (for example, offline
> failure page/kill failure thread) to recover this error.
>
> - Action Optional (AO): The error is detected out of processor execution
> context. Some data in the memory are corrupted. But the data have not
> been consumed. OS is optional to take action to recover this error.
>
> The main difference between AR and AO errors is that AR errors are synchronous
> events, while AO errors are asynchronous events. Synchronous exceptions, such as
> Machine Check Exception (MCE) on X86 and Synchronous External Abort (SEA) on
> Arm64, are signaled by the hardware when an error is detected and the memory
> access has architecturally been executed.
>
> Currently, both synchronous and asynchronous errors are queued as AO errors and
> handled by a dedicated kernel thread in a work queue on the ARM64 platform. For
> synchronous errors, memory_failure() is synced using a cancel_work_sync trick to
> ensure that the corrupted page is unmapped and poisoned. Upon returning to
> user-space, the process resumes at the current instruction, triggering a page
> fault. As a result, the kernel sends a SIGBUS signal to the current process due
> to VM_FAULT_HWPOISON.
>
> However, this trick is not always be effective, this patch set improves the
> recovery process in three specific aspects:
>
> 1. Handle synchronous exceptions with proper si_code
>
> ghes_handle_memory_failure() queue both synchronous and asynchronous errors with
> flag=0. Then the kernel will notify the process by sending a SIGBUS signal in
> memory_failure() with wrong si_code: BUS_MCEERR_AO to the actual user-space
> process instead of BUS_MCEERR_AR. The user-space processes rely on the si_code
> to distinguish to handle memory failure.
>
> For example, hwpoison-aware user-space processes use the si_code:
> BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR
> for 'action required' synchronous/late notifications. Specifically, when a
> signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to
> Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored
> by QEMU.[1]
>
> Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1)
>
> 2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot
>
> If process mapping fault page, but memory_failure() abnormal return before
> try_to_unmap(), for example, the fault page process mapping is KSM page.
> In this case, arm64 cannot use the page fault process to terminate the
> synchronous exception loop.[4]
>
> This loop can potentially exceed the platform firmware threshold or even trigger
> a kernel hard lockup, leading to a system reboot. However, kernel has the
> capability to recover from this error.
>
> Fix it by performing a force kill when memory_failure() abnormal fails or when
> other abnormal synchronous errors occur. These errors can include situations
> such as invalid PA, unexpected severity, no memory failure config support,
> invalid GUID section, OOM, etc. (PATCH 2)
>
> 3. Handle memory_failure() in current process context which consuming poison
>
> When synchronous errors occur, memory_failure() assume that current process
> context is exactly that consuming poison synchronous error.
>
> For example, kill_accessing_process() holds mmap locking of current->mm, does
> pagetable walk to find the error virtual address, and sends SIGBUS to the
> current process with error info. However, the mm of kworker is not valid,
> resulting in a null-pointer dereference. I have fixed this in[3].
>
> commit 77677cdbc2aa mm,hwpoison: check mm when killing accessing process
>
> Another example is that collect_procs()/kill_procs() walk the task list, only
> collect and send sigbus to task which consuming poison. But memory_failure() is
> queued and handled by a dedicated kernel thread on arm64 platform.
>
> Fix it by queuing memory_failure() as a task work which runs in current
> execution context to synchronously send SIGBUS before ret_to_user. (PATCH 2)
>
> ** In summary, this patch set handles synchronous errors in task work with
> proper si_code so that hwpoison-aware process can recover from errors, and
> fixes (potentially) abnormal cases. **
>
> Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4].
> Acknowledge to discussion with them.
>
> ## Steps to Reproduce This Problem
>
> To reproduce this problem:
>
> # STEP1: enable early kill mode
> #sysctl -w vm.memory_failure_early_kill=1
> vm.memory_failure_early_kill = 1
>
> # STEP2: inject an UCE error and consume it to trigger a synchronous error
> #einj_mem_uc single
> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
> injecting ...
> triggering ...
> signal 7 code 5 addr 0xffffb0d75000
> page not present
> Test passed
>
> The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
> and it is not fact.
>
> After this patch set:
>
> # STEP1: enable early kill mode
> #sysctl -w vm.memory_failure_early_kill=1
> vm.memory_failure_early_kill = 1
>
> # STEP2: inject an UCE error and consume it to trigger a synchronous error
> #einj_mem_uc single
> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
> injecting ...
> triggering ...
> signal 7 code 4 addr 0xffffb0d75000
> page not present
> Test passed
>
> The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
> as we expected.
>
> [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/
> [3] https://lkml.kernel.org/r/[email protected]
> [4] https://lore.kernel.org/lkml/[email protected]/
>
> Shuai Xue (2):
> ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
> synchronous events
> ACPI: APEI: handle synchronous exceptions in task work
>
> arch/x86/kernel/cpu/mce/core.c | 9 +--
> drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++-----------
> include/acpi/ghes.h | 3 -
> include/linux/mm.h | 1 -
> mm/memory-failure.c | 22 ++-----
> 5 files changed, 82 insertions(+), 66 deletions(-)
>

2023-11-23 15:30:13

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code

On Sat, Oct 07, 2023 at 03:28:16PM +0800, Shuai Xue wrote:
> However, this trick is not always be effective

So far so good.

What's missing here is why "this trick" is not always effective.

Basically to explain what exactly the problem is.

> For example, hwpoison-aware user-space processes use the si_code:
> BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR
> for 'action required' synchronous/late notifications. Specifically, when a
> signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to
> Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored
> by QEMU.[1]
>
> Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1)

So you're fixing qemu by "fixing" the kernel?

This doesn't make any sense.

Make errors which are ACPI_HEST_NOTIFY_SEA type return
MF_ACTION_REQUIRED so that it *happens* to fix your use case.

Sounds like a lot of nonsense to me.

What is the issue here you're trying to solve?

> 2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot
>
> If process mapping fault page, but memory_failure() abnormal return before
> try_to_unmap(), for example, the fault page process mapping is KSM page.
> In this case, arm64 cannot use the page fault process to terminate the
> synchronous exception loop.[4]
>
> This loop can potentially exceed the platform firmware threshold or even trigger
> a kernel hard lockup, leading to a system reboot. However, kernel has the
> capability to recover from this error.
>
> Fix it by performing a force kill when memory_failure() abnormal fails or when
> other abnormal synchronous errors occur.

Just like that?

Without giving the process the opportunity to even save its other data?

So this all is still very confusing, patches definitely need splitting
and this whole thing needs restraint.

You go and do this: you split *each* issue you're addressing into
a separate patch and explain it like this:

---
1. Prepare the context for the explanation briefly.

2. Explain the problem at hand.

3. "It happens because of <...>"

4. "Fix it by doing X"

5. "(Potentially do Y)."
---

and each patch explains *exactly* *one* issue, what happens, why it
happens and just the fix for it and *why* it is needed.

Otherwise, this is unreviewable.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-11-25 06:47:13

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code



On 2023/11/23 23:07, Borislav Petkov wrote:

Hi, Borislav,

Thank you for your reply and advice.


> On Sat, Oct 07, 2023 at 03:28:16PM +0800, Shuai Xue wrote:
>> However, this trick is not always be effective
>
> So far so good.
>
> What's missing here is why "this trick" is not always effective.

>
> Basically to explain what exactly the problem is.

I think the main point is that this trick for AR error is not effective,
because:

- an AR error consumed by current process is deferred to handle in a
dedicated kernel thread, but memory_failure() assumes that it runs in the
current context
- another page fault is not unnecessary, we can send sigbus to current
process in the first Synchronous External Abort SEA on arm64 (analogy
Machine Check Exception on x86)

>
>> For example, hwpoison-aware user-space processes use the si_code:
>> BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR
>> for 'action required' synchronous/late notifications. Specifically, when a
>> signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to
>> Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored
>> by QEMU.[1]
>>
>> Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1)
>
> So you're fixing qemu by "fixing" the kernel?
>
> This doesn't make any sense.

I just give an example that the user space process *really* relys on the
si_code of signal to handle hardware errors

>
> Make errors which are ACPI_HEST_NOTIFY_SEA type return
> MF_ACTION_REQUIRED so that it *happens* to fix your use case.
>
> Sounds like a lot of nonsense to me.
>
> What is the issue here you're trying to solve?

The SIGBUS si_codes defined in include/uapi/asm-generic/siginfo.h says:

/* hardware memory error consumed on a machine check: action required */
#define BUS_MCEERR_AR 4
/* hardware memory error detected in process but not consumed: action optional*/
#define BUS_MCEERR_AO 5

When a synchronous error is consumed by Guest, the kernel should send a
signal with BUS_MCEERR_AR instead of BUS_MCEERR_AO.

>
>> 2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot
>>
>> If process mapping fault page, but memory_failure() abnormal return before
>> try_to_unmap(), for example, the fault page process mapping is KSM page.
>> In this case, arm64 cannot use the page fault process to terminate the
>> synchronous exception loop.[4]
>>
>> This loop can potentially exceed the platform firmware threshold or even trigger
>> a kernel hard lockup, leading to a system reboot. However, kernel has the
>> capability to recover from this error.
>>
>> Fix it by performing a force kill when memory_failure() abnormal fails or when
>> other abnormal synchronous errors occur.
>
> Just like that?
>
> Without giving the process the opportunity to even save its other data?

Exactly.

>
> So this all is still very confusing, patches definitely need splitting
> and this whole thing needs restraint.
>
> You go and do this: you split *each* issue you're addressing into
> a separate patch and explain it like this:
>
> ---
> 1. Prepare the context for the explanation briefly.
>
> 2. Explain the problem at hand.
>
> 3. "It happens because of <...>"
>
> 4. "Fix it by doing X"
>
> 5. "(Potentially do Y)."
> ---
>
> and each patch explains *exactly* *one* issue, what happens, why it
> happens and just the fix for it and *why* it is needed.
>
> Otherwise, this is unreviewable.

Thank you for your valuable suggestion, I will split the patches and
resubmit a new patch set.

>
> Thx.
>

Best Regards,
Shuai

2023-11-25 16:31:16

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code

On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote:
> - an AR error consumed by current process is deferred to handle in a
> dedicated kernel thread, but memory_failure() assumes that it runs in the
> current context

On x86? ARM?

Please point to the exact code flow.

> - another page fault is not unnecessary, we can send sigbus to current
> process in the first Synchronous External Abort SEA on arm64 (analogy
> Machine Check Exception on x86)

I have no clue what that means. What page fault?

> I just give an example that the user space process *really* relys on the
> si_code of signal to handle hardware errors

No, don't give examples.

Explain what the exact problem is you're seeing, in your use case, point
to the code and then state how you think it should be fixed and why.

Right now your text is "all over the place" and I have no clue what you
even want.

> The SIGBUS si_codes defined in include/uapi/asm-generic/siginfo.h says:
>
> /* hardware memory error consumed on a machine check: action required */
> #define BUS_MCEERR_AR 4
> /* hardware memory error detected in process but not consumed: action optional*/
> #define BUS_MCEERR_AO 5
>
> When a synchronous error is consumed by Guest, the kernel should send a
> signal with BUS_MCEERR_AR instead of BUS_MCEERR_AO.

Can you drop this "synchronous" bla and concentrate on the error
*severity*?

I think you want to say that there are some types of errors for which
error handling needs to happen immediately and for some reason that
doesn't happen.

Which errors are those? Types?

Why do you need them to be handled immediately?

> Exactly.

No, not exactly. Why is it ok to do that? What are the implications of
this?

Is immediate killing the right decision?

Is this ok for *every* possible kernel running out there - not only for
your use case?

And so on and so on...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-11-26 12:26:25

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code



On 2023/11/25 20:10, Borislav Petkov wrote:

Hi, Borislav,

Thank you for your reply, and sorry for the confusion I made. Please see my rely
inline.

Best Regards,
Shuai

> On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote:
>> - an AR error consumed by current process is deferred to handle in a
>> dedicated kernel thread, but memory_failure() assumes that it runs in the
>> current context
>
> On x86? ARM?
>
> Pease point to the exact code flow.

An AR error consumed by current process is deferred to handle in a
dedicated kernel thread on ARM platform. The AR error is handled in bellow
flow:

-----------------------------------------------------------------------------
[usr space task einj_mem_uc consumd data poison, CPU 3] STEP 0

-----------------------------------------------------------------------------
[ghes_sdei_critical_callback: current einj_mem_uc, CPU 3] STEP 1
ghes_sdei_critical_callback
=> __ghes_sdei_callback
=> ghes_in_nmi_queue_one_entry // peak and read estatus
=> irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work
[ghes_sdei_critical_callback: return]
-----------------------------------------------------------------------------
[ghes_proc_in_irq: current einj_mem_uc, CPU 3] STEP 2
=> ghes_do_proc
=> ghes_handle_memory_failure
=> ghes_do_memory_failure
=> memory_failure_queue // put work task on current CPU
=> if (kfifo_put(&mf_cpu->fifo, entry))
schedule_work_on(smp_processor_id(), &mf_cpu->work);
=> task_work_add(current, &estatus_node->task_work, TWA_RESUME);
[ghes_proc_in_irq: return]
-----------------------------------------------------------------------------
// kworker preempts einj_mem_uc on CPU 3 due to RESCHED flag STEP 3
[memory_failure_work_func: current kworker, CPU 3]
=> memory_failure_work_func(&mf_cpu->work)
=> while kfifo_get(&mf_cpu->fifo, &entry); // until get no work
=> memory_failure(entry.pfn, entry.flags);
-----------------------------------------------------------------------------
[ghes_kick_task_work: current einj_mem_uc, other cpu] STEP 4
=> memory_failure_queue_kick
=> cancel_work_sync - waiting memory_failure_work_func finish
=> memory_failure_work_func(&mf_cpu->work)
=> kfifo_get(&mf_cpu->fifo, &entry); // no work
-----------------------------------------------------------------------------
[einj_mem_uc resume at the same PC, trigger a page fault STEP 5

STEP 0: A user space task, named einj_mem_uc consume a poison. The firmware
notifies hardware error to kernel through is SDEI
(ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED).

STEP 1: The swapper running on CPU 3 is interrupted. irq_work_queue() rasie
a irq_work to handle hardware errors in IRQ context

STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on
current CPU in workqueue and add task work to sync with the workqueue.

STEP3: The kworker preempts the current running thread and get CPU 3. Then
memory_failure() is processed in kworker.

STEP4: ghes_kick_task_work() is called as task_work to ensure any queued
workqueue has been done before returning to user-space.

STEP5: Upon returning to user-space, the task einj_mem_uc resumes at the
current instruction, because the poison page is unmapped by
memory_failure() in step 3, so a page fault will be triggered.

memory_failure() assumes that it runs in the current context on both x86
and ARM platform.


for example:
memory_failure() in mm/memory-failure.c:

if (flags & MF_ACTION_REQUIRED) {
folio = page_folio(p);
res = kill_accessing_process(current, folio_pfn(folio), flags);
}

>
>> - another page fault is not unnecessary, we can send sigbus to current
>> process in the first Synchronous External Abort SEA on arm64 (analogy
>> Machine Check Exception on x86)
>
> I have no clue what that means. What page fault?

I mean page fault in step 5. We can simplify the above flow by queuing
memory_failure() as a task work for AR errors in step 3 directly.

>
>> I just give an example that the user space process *really* relys on the
>> si_code of signal to handle hardware errors
>
> No, don't give examples.
>
> Explain what the exact problem is you're seeing, in your use case, point
> to the code and then state how you think it should be fixed and why.
>
> Right now your text is "all over the place" and I have no clue what you
> even want.

Ok, got it. Thank you.

>
>> The SIGBUS si_codes defined in include/uapi/asm-generic/siginfo.h says:
>>
>> /* hardware memory error consumed on a machine check: action required */
>> #define BUS_MCEERR_AR 4
>> /* hardware memory error detected in process but not consumed: action optional*/
>> #define BUS_MCEERR_AO 5
>>
>> When a synchronous error is consumed by Guest, the kernel should send a
>> signal with BUS_MCEERR_AR instead of BUS_MCEERR_AO.
>
> Can you drop this "synchronous" bla and concentrate on the error
> *severity*?
>
> I think you want to say that there are some types of errors for which
> error handling needs to happen immediately and for some reason that
> doesn't happen.
>
> Which errors are those? Types?
>
> Why do you need them to be handled immediately?

Well, the severity defined on x86 and ARM platform is quite different. I
guess you mean taxonomy of producer error types.

- X86: Software recoverable action required (SRAR)

A UCR error that *requires* system software to take a recovery action on
this processor *before scheduling another stream of execution on this
processor*.
(15.6.3 UCR Error Classification in Intel® 64 and IA-32 Architectures
Software Developer’s Manual Volume 3)

- ARM: Recoverable state (UER)

The PE determines that software *must* take action to locate and repair
the error to successfully recover execution. This might be because the
exception was taken before the error was architecturally consumed by
the PE, at the point when the PE was not be able to make correct
progress without either consuming the error or *otherwise making the
state of the PE unrecoverable*.
(2.3.2 PE error state classification in Arm RAS Supplement
https://documentation-service.arm.com/static/63185614f72fad1903828eda)

I think above two types of error need to be handled immediately.

>
>> Exactly.
>
> No, not exactly. Why is it ok to do that? What are the implications of
> this?
>
> Is immediate killing the right decision?
>
> Is this ok for *every* possible kernel running out there - not only for
> your use case?
>
> And so on and so on...
>

I don't have a clear answer here. I guess the poison data only effects the
user space task which triggers exception. A panic is not necessary.

On x86 platform, the current error handling of memory_failure() in
kill_me_maybe() is just send a sigbus forcely.

kill_me_maybe():

ret = memory_failure(pfn, flags);
if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
return;

pr_err("Memory error not recovered");
kill_me_now(cb);

Do you have any comments or suggestion about this? I don't change x86
behavior.

For arm64 platform, step 3 in above flow, memory_failure_work_func(), the
call site of memory_failure(), does not handle the return code of
memory_failure(). I just add the same behavior.


2023-11-30 03:01:44

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code



On 2023/11/30 02:54, Borislav Petkov wrote:
> Moving James to To:
>
> On Sun, Nov 26, 2023 at 08:25:38PM +0800, Shuai Xue wrote:
>>> On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote:
>>>> - an AR error consumed by current process is deferred to handle in a
>>>> dedicated kernel thread, but memory_failure() assumes that it runs in the
>>>> current context
>>>
>>> On x86? ARM?
>>>
>>> Pease point to the exact code flow.
>>
>> An AR error consumed by current process is deferred to handle in a
>> dedicated kernel thread on ARM platform. The AR error is handled in bellow
>> flow:
>>
>> -----------------------------------------------------------------------------
>> [usr space task einj_mem_uc consumd data poison, CPU 3] STEP 0
>>
>> -----------------------------------------------------------------------------
>> [ghes_sdei_critical_callback: current einj_mem_uc, CPU 3] STEP 1
>> ghes_sdei_critical_callback
>> => __ghes_sdei_callback
>> => ghes_in_nmi_queue_one_entry // peak and read estatus
>> => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work
>> [ghes_sdei_critical_callback: return]
>> -----------------------------------------------------------------------------
>> [ghes_proc_in_irq: current einj_mem_uc, CPU 3] STEP 2
>> => ghes_do_proc
>> => ghes_handle_memory_failure
>> => ghes_do_memory_failure
>> => memory_failure_queue // put work task on current CPU
>> => if (kfifo_put(&mf_cpu->fifo, entry))
>> schedule_work_on(smp_processor_id(), &mf_cpu->work);
>> => task_work_add(current, &estatus_node->task_work, TWA_RESUME);
>> [ghes_proc_in_irq: return]
>> -----------------------------------------------------------------------------
>> // kworker preempts einj_mem_uc on CPU 3 due to RESCHED flag STEP 3
>> [memory_failure_work_func: current kworker, CPU 3]
>> => memory_failure_work_func(&mf_cpu->work)
>> => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work
>> => memory_failure(entry.pfn, entry.flags);
>
> From the comment above that function:
>
> * The function is primarily of use for corruptions that
> * happen outside the current execution context (e.g. when
> * detected by a background scrubber)
> *
> * Must run in process context (e.g. a work queue) with interrupts
> * enabled and no spinlocks held.

Hi, Borislav,

Thank you for your comments.

But we are talking about Action Required error, it does happen *inside the
current execution context*. The Action Required error does not meet the
function comments.

>
>> -----------------------------------------------------------------------------
>> [ghes_kick_task_work: current einj_mem_uc, other cpu] STEP 4
>> => memory_failure_queue_kick
>> => cancel_work_sync - waiting memory_failure_work_func finish
>> => memory_failure_work_func(&mf_cpu->work)
>> => kfifo_get(&mf_cpu->fifo, &entry); // no work
>> -----------------------------------------------------------------------------
>> [einj_mem_uc resume at the same PC, trigger a page fault STEP 5
>>
>> STEP 0: A user space task, named einj_mem_uc consume a poison. The firmware
>> notifies hardware error to kernel through is SDEI
>> (ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED).
>>
>> STEP 1: The swapper running on CPU 3 is interrupted. irq_work_queue() rasie
>> a irq_work to handle hardware errors in IRQ context
>>
>> STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on
>> current CPU in workqueue and add task work to sync with the workqueue.
>>
>> STEP3: The kworker preempts the current running thread and get CPU 3. Then
>> memory_failure() is processed in kworker.
>
> See above.
>
>> STEP4: ghes_kick_task_work() is called as task_work to ensure any queued
>> workqueue has been done before returning to user-space.
>>
>> STEP5: Upon returning to user-space, the task einj_mem_uc resumes at the
>> current instruction, because the poison page is unmapped by
>> memory_failure() in step 3, so a page fault will be triggered.
>>
>> memory_failure() assumes that it runs in the current context on both x86
>> and ARM platform.
>>
>>
>> for example:
>> memory_failure() in mm/memory-failure.c:
>>
>> if (flags & MF_ACTION_REQUIRED) {
>> folio = page_folio(p);
>> res = kill_accessing_process(current, folio_pfn(folio), flags);
>> }
>
> And?
>
> Do you see the check above it?
>
> if (TestSetPageHWPoison(p)) {
>
> test_and_set_bit() returns true only when the page was poisoned already.
>
> * This function is intended to handle "Action Required" MCEs on already
> * hardware poisoned pages. They could happen, for example, when
> * memory_failure() failed to unmap the error page at the first call, or
> * when multiple local machine checks happened on different CPUs.
>
> And that's kill_accessing_process().
>
> So AFAIU, the kworker running memory_failure() would only mark the page
> as poison.
>
> The killing happens when memory_failure() runs again and the process
> touches the page again.

When a Action Required error occurs, it triggers a MCE-like exception
(SEA). In the first call of memory_failure(), it will poison the page. If
it failed to unmap the error page, the user space task resumes at the
current PC and triggers another SEA exception, then the second call of
memory_failure() will run into kill_accessing_process() which do nothing
and just return -EFAULT. As a result, a third SEA exception will be
triggered. Finally, a exception loop happens resulting a hard lockup
panic.

>
> But I'd let James confirm here.
>
>
> I still don't know what you're fixing here.

In ARM64 platform, when a Action Required error occurs, the kernel should
send SIGBUS with si_code BUS_MCEERR_AR instead of BUS_MCEERR_AO. (It is
also the subject of this thread)

>
> Is this something you're encountering on some machine or you simply
> stared at code?

I met the wrong si_code problem on Yitian 710 machine which is based on
ARM64 platform. And I think it is gernel on ARM64 platfrom.

To reproduce this problem:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 5 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
and it is not fact.

After this patch set:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 4 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
as we expected.


>
> What does that
>
> "Both Alibaba and Huawei met the same issue in products, and we hope it
> could be fixed ASAP."
>
> mean?
>
> What did you meet?
>
> What was the problem?

We both got wrong si_code of SIGBUS from kernel side on ARM64 platform.

The VMM in our product relies on the si_code of SIGBUS to handle memory
failure in userspace.

- For BUS_MCEERR_AO, we regard that the corruptions happen *outside the
current execution context* e.g. detected by a background scrubber, the
VMM will ignore the error and the VM will not be killed immediately.
- For BUS_MCEERR_AR, we regard that the corruptions happen *insdie the
current execution context*, e.g. when a data poison is consumed, the VMM
will kill the VM immediately to avoid any further potential data
propagation.

>
> I still note that you're avoiding answering the question what the issue
> is and if you keep avoiding it, I'll ignore this whole thread.
>

Sorry, Borislav, thank you for your patient and time. I really appreciate
that you are involving in to review this patchset. But I have to say it is
not the truth, I am avoiding anything. I tried my best to answer every comments
you raised, give the details of ARM RAS specific and code flow.

Best Regards,
Shuai

2023-11-30 15:31:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code

Moving James to To:

On Sun, Nov 26, 2023 at 08:25:38PM +0800, Shuai Xue wrote:
> > On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote:
> >> - an AR error consumed by current process is deferred to handle in a
> >> dedicated kernel thread, but memory_failure() assumes that it runs in the
> >> current context
> >
> > On x86? ARM?
> >
> > Pease point to the exact code flow.
>
> An AR error consumed by current process is deferred to handle in a
> dedicated kernel thread on ARM platform. The AR error is handled in bellow
> flow:
>
> -----------------------------------------------------------------------------
> [usr space task einj_mem_uc consumd data poison, CPU 3] STEP 0
>
> -----------------------------------------------------------------------------
> [ghes_sdei_critical_callback: current einj_mem_uc, CPU 3] STEP 1
> ghes_sdei_critical_callback
> => __ghes_sdei_callback
> => ghes_in_nmi_queue_one_entry // peak and read estatus
> => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work
> [ghes_sdei_critical_callback: return]
> -----------------------------------------------------------------------------
> [ghes_proc_in_irq: current einj_mem_uc, CPU 3] STEP 2
> => ghes_do_proc
> => ghes_handle_memory_failure
> => ghes_do_memory_failure
> => memory_failure_queue // put work task on current CPU
> => if (kfifo_put(&mf_cpu->fifo, entry))
> schedule_work_on(smp_processor_id(), &mf_cpu->work);
> => task_work_add(current, &estatus_node->task_work, TWA_RESUME);
> [ghes_proc_in_irq: return]
> -----------------------------------------------------------------------------
> // kworker preempts einj_mem_uc on CPU 3 due to RESCHED flag STEP 3
> [memory_failure_work_func: current kworker, CPU 3]
> => memory_failure_work_func(&mf_cpu->work)
> => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work
> => memory_failure(entry.pfn, entry.flags);

From the comment above that function:

* The function is primarily of use for corruptions that
* happen outside the current execution context (e.g. when
* detected by a background scrubber)
*
* Must run in process context (e.g. a work queue) with interrupts
* enabled and no spinlocks held.

> -----------------------------------------------------------------------------
> [ghes_kick_task_work: current einj_mem_uc, other cpu] STEP 4
> => memory_failure_queue_kick
> => cancel_work_sync - waiting memory_failure_work_func finish
> => memory_failure_work_func(&mf_cpu->work)
> => kfifo_get(&mf_cpu->fifo, &entry); // no work
> -----------------------------------------------------------------------------
> [einj_mem_uc resume at the same PC, trigger a page fault STEP 5
>
> STEP 0: A user space task, named einj_mem_uc consume a poison. The firmware
> notifies hardware error to kernel through is SDEI
> (ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED).
>
> STEP 1: The swapper running on CPU 3 is interrupted. irq_work_queue() rasie
> a irq_work to handle hardware errors in IRQ context
>
> STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on
> current CPU in workqueue and add task work to sync with the workqueue.
>
> STEP3: The kworker preempts the current running thread and get CPU 3. Then
> memory_failure() is processed in kworker.

See above.

> STEP4: ghes_kick_task_work() is called as task_work to ensure any queued
> workqueue has been done before returning to user-space.
>
> STEP5: Upon returning to user-space, the task einj_mem_uc resumes at the
> current instruction, because the poison page is unmapped by
> memory_failure() in step 3, so a page fault will be triggered.
>
> memory_failure() assumes that it runs in the current context on both x86
> and ARM platform.
>
>
> for example:
> memory_failure() in mm/memory-failure.c:
>
> if (flags & MF_ACTION_REQUIRED) {
> folio = page_folio(p);
> res = kill_accessing_process(current, folio_pfn(folio), flags);
> }

And?

Do you see the check above it?

if (TestSetPageHWPoison(p)) {

test_and_set_bit() returns true only when the page was poisoned already.

* This function is intended to handle "Action Required" MCEs on already
* hardware poisoned pages. They could happen, for example, when
* memory_failure() failed to unmap the error page at the first call, or
* when multiple local machine checks happened on different CPUs.

And that's kill_accessing_process().

So AFAIU, the kworker running memory_failure() would only mark the page
as poison.

The killing happens when memory_failure() runs again and the process
touches the page again.

But I'd let James confirm here.

I still don't know what you're fixing here.

Is this something you're encountering on some machine or you simply
stared at code?

What does that

"Both Alibaba and Huawei met the same issue in products, and we hope it
could be fixed ASAP."

mean?

What did you meet?

What was the problem?

I still note that you're avoiding answering the question what the issue
is and if you keep avoiding it, I'll ignore this whole thread.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-11-30 15:36:35

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code

FTR, this is starting to make sense, thanks for explaining.

Replying only to this one for now:

On Thu, Nov 30, 2023 at 10:58:53AM +0800, Shuai Xue wrote:
> To reproduce this problem:
>
> # STEP1: enable early kill mode
> #sysctl -w vm.memory_failure_early_kill=1
> vm.memory_failure_early_kill = 1
>
> # STEP2: inject an UCE error and consume it to trigger a synchronous error

So this is for ARM folks to deal with, BUT:

A consumed uncorrectable error on x86 means panic. On some hw like on
AMD, that error doesn't even get seen by the OS but the hw does
something called syncflood to prevent further error propagation. So
there's no any action required - the hw does that.

But I'd like to hear from ARM folks whether consuming an uncorrectable
error even lets software run. Dunno.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-11-30 17:39:57

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code

Hi Boris, Shuai,

On 29/11/2023 18:54, Borislav Petkov wrote:
> On Sun, Nov 26, 2023 at 08:25:38PM +0800, Shuai Xue wrote:
>>> On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote:
>>>> - an AR error consumed by current process is deferred to handle in a
>>>> dedicated kernel thread, but memory_failure() assumes that it runs in the
>>>> current context
>>>
>>> On x86? ARM?
>>>
>>> Pease point to the exact code flow.


>> An AR error consumed by current process is deferred to handle in a
>> dedicated kernel thread on ARM platform. The AR error is handled in bellow
>> flow:

Please don't think of errors as "action required" - that's a user-space signal code. If
the page could be fixed by memory-failure(), you may never get a signal. (all this was the
fix for always sending an action-required signal)

I assume you mean the CPU accessed a poisoned location and took a synchronous error.


>> -----------------------------------------------------------------------------
>> [usr space task einj_mem_uc consumd data poison, CPU 3] STEP 0
>>
>> -----------------------------------------------------------------------------
>> [ghes_sdei_critical_callback: current einj_mem_uc, CPU 3] STEP 1
>> ghes_sdei_critical_callback
>> => __ghes_sdei_callback
>> => ghes_in_nmi_queue_one_entry // peak and read estatus
>> => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work
>> [ghes_sdei_critical_callback: return]
>> -----------------------------------------------------------------------------
>> [ghes_proc_in_irq: current einj_mem_uc, CPU 3] STEP 2
>> => ghes_do_proc
>> => ghes_handle_memory_failure
>> => ghes_do_memory_failure
>> => memory_failure_queue // put work task on current CPU
>> => if (kfifo_put(&mf_cpu->fifo, entry))
>> schedule_work_on(smp_processor_id(), &mf_cpu->work);
>> => task_work_add(current, &estatus_node->task_work, TWA_RESUME);
>> [ghes_proc_in_irq: return]
>> -----------------------------------------------------------------------------
>> // kworker preempts einj_mem_uc on CPU 3 due to RESCHED flag STEP 3
>> [memory_failure_work_func: current kworker, CPU 3]
>> => memory_failure_work_func(&mf_cpu->work)
>> => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work
>> => memory_failure(entry.pfn, entry.flags);
>
> From the comment above that function:
>
> * The function is primarily of use for corruptions that
> * happen outside the current execution context (e.g. when
> * detected by a background scrubber)
> *
> * Must run in process context (e.g. a work queue) with interrupts
> * enabled and no spinlocks held.
>
>> -----------------------------------------------------------------------------
>> [ghes_kick_task_work: current einj_mem_uc, other cpu] STEP 4
>> => memory_failure_queue_kick
>> => cancel_work_sync - waiting memory_failure_work_func finish
>> => memory_failure_work_func(&mf_cpu->work)
>> => kfifo_get(&mf_cpu->fifo, &entry); // no work
>> -----------------------------------------------------------------------------
>> [einj_mem_uc resume at the same PC, trigger a page fault STEP 5
>>
>> STEP 0: A user space task, named einj_mem_uc consume a poison. The firmware
>> notifies hardware error to kernel through is SDEI
>> (ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED).
>>
>> STEP 1: The swapper running on CPU 3 is interrupted. irq_work_queue() rasie
>> a irq_work to handle hardware errors in IRQ context
>>
>> STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on
>> current CPU in workqueue and add task work to sync with the workqueue.
>>
>> STEP3: The kworker preempts the current running thread and get CPU 3. Then
>> memory_failure() is processed in kworker.
>
> See above.
>
>> STEP4: ghes_kick_task_work() is called as task_work to ensure any queued
>> workqueue has been done before returning to user-space.
>>
>> STEP5: Upon returning to user-space, the task einj_mem_uc resumes at the
>> current instruction, because the poison page is unmapped by
>> memory_failure() in step 3, so a page fault will be triggered.
>>
>> memory_failure() assumes that it runs in the current context on both x86
>> and ARM platform.
>>
>>
>> for example:
>> memory_failure() in mm/memory-failure.c:
>>
>> if (flags & MF_ACTION_REQUIRED) {
>> folio = page_folio(p);
>> res = kill_accessing_process(current, folio_pfn(folio), flags);
>> }
>
> And?
>
> Do you see the check above it?
>
> if (TestSetPageHWPoison(p)) {
>
> test_and_set_bit() returns true only when the page was poisoned already.
>
> * This function is intended to handle "Action Required" MCEs on already
> * hardware poisoned pages. They could happen, for example, when
> * memory_failure() failed to unmap the error page at the first call, or
> * when multiple local machine checks happened on different CPUs.
>
> And that's kill_accessing_process().
>
> So AFAIU, the kworker running memory_failure() would only mark the page
> as poison.
>
> The killing happens when memory_failure() runs again and the process
> touches the page again.
>
> But I'd let James confirm here.

Yes, this is what is expected to happen with the existing code.

The first pass will remove the pages from all processes that have it mapped before this
user-space task can restart. Restarting the task will make it access a poisoned page,
kicking off the second path which delivers the signal.

The reason for two passes is send_sig_mceerr() likes to clear_siginfo(), so even if you
queued action-required before leaving GHES, memory-failure() would stomp on it.


> I still don't know what you're fixing here.

The problem is if the user-space process registered for early messages, it gets a signal
on the first pass. If it returns from that signal, it will access the poisoned page and
get the action-required signal.

How is this making Qemu go wrong?


As to how this works for you given Boris' comments above: kill_procs() is also called from
hwpoison_user_mappings(), which takes the flags given to memory-failure(). This is where
the action-optional signals come from.


Thanks,

James

2023-11-30 17:40:05

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v9 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

Hi Shuai,

On 07/10/2023 08:28, Shuai Xue wrote:
> There are two major types of uncorrected recoverable (UCR) errors :

Is UCR a well known x86 acronym? It's best to just spell this out each time,
there is enough jargon in this area already.

>
> - Action Required (AR): The error is detected and the processor already
> consumes the memory. OS requires to take action (for example, offline
> failure page/kill failure thread) to recover this uncorrectable error.
>
> - Action Optional (AO): The error is detected out of processor execution
> context. Some data in the memory are corrupted. But the data have not
> been consumed. OS is optional to take action to recover this
> uncorrectable error.

As elsewhere, please don't think of errors as 'action required', this is how
things get reported to user-space. Action-required for one thread may be
action-optional for another that has the same page mapped - its really not a
property of the error.
It would be better to describe this as synchronous and asynchronous, or in-band
and out-of-band.


> The essential difference between AR and AO errors is that AR is a
> synchronous event, while AO is an asynchronous event. The hardware will
> signal a synchronous exception (Machine Check Exception on X86 and
> Synchronous External Abort on Arm64) when an error is detected and the
> memory access has been architecturally executed.

> When APEI firmware first is enabled, a platform may describe one error
> source for the handling of synchronous errors (e.g. MCE or SEA notification
> ), or for handling asynchronous errors (e.g. SCI or External Interrupt
> notification). In other words, we can distinguish synchronous errors by
> APEI notification. For AR errors, kernel will kill current process
> accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
> addition, for AO errors, kernel will notify the process who owns the
> poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
> However, the GHES driver always sets mf_flags to 0 so that all UCR errors
> are handled as AO errors in memory failure.

To make this easier to read:
UCR and AR -> synchronous
AO -> asynchronous



> To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
> events.

> Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'

Erm, this predates arm64 support, and what you have here doesn't change the behaviour on x86.

You can blame 7f17b4a121d0d50 ("ACPI: APEI: Kick the memory_failure() queue for
synchronous errors"), which should have covered this.

> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index ef59d6ea16da..88178aa6222d 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
> return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
> }
>
> +/*
> + * A platform may describe one error source for the handling of synchronous
> + * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
> + * or External Interrupt). On x86, the HEST notifications are always
> + * asynchronous, so only SEA on ARM is delivered as a synchronous
> + * notification.
> + */
> +static inline bool is_hest_sync_notify(struct ghes *ghes)
> +{
> + u8 notify_type = ghes->generic->notify.type;
> +
> + return notify_type == ACPI_HEST_NOTIFY_SEA;
> +}

and as you had in earlier versions, sometimes SDEI.
SDEI can report by synchronous and asynchronous errors, I wouldn't too surprised if the
hardware NMI can be used for the same. It would be good to chase up having a hint of this
in the CPER records and pass that in here as a hint.

Unfortunately, its not safe to assume either way for SDEI.

Reviewed-by: James Morse <[email protected]>


Thanks,

James

2023-11-30 17:40:34

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v9 2/2] ACPI: APEI: handle synchronous exceptions in task work

Hi Shuai,

On 07/10/2023 08:28, Shuai Xue wrote:
> Hardware errors could be signaled by synchronous interrupt,

I'm struggling with 'synchronous interrupt'. Do you mean arm64's 'precise' (all
instructions before the exception were executed, and none after).
Otherwise, surely any interrupt from a background scrubber is inherently asynchronous!


> e.g. when an
> error is detected by a background scrubber, or signaled by synchronous
> exception, e.g. when an uncorrected error is consumed. Both synchronous and
> asynchronous error are queued and handled by a dedicated kthread in
> workqueue.
>
> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
> synchronous errors") keep track of whether memory_failure() work was
> queued, and make task_work pending to flush out the workqueue so that the
> work for synchronous error is processed before returning to user-space.

It does it regardless, if user-space was interrupted by APEI any work queued as a result
of that should be completed before we go back to user-space. Otherwise we can bounce
between user-space and firmware, with the kernel only running the APEI code, and never
making progress.


> The trick ensures that the corrupted page is unmapped and poisoned. And
> after returning to user-space, the task starts at current instruction which
> triggering a page fault in which kernel will send SIGBUS to current process
> due to VM_FAULT_HWPOISON.
>
> However, the memory failure recovery for hwpoison-aware mechanisms does not
> work as expected. For example, hwpoison-aware user-space processes like
> QEMU register their customized SIGBUS handler and enable early kill mode by
> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify

(setting, directly)

> the process by sending a SIGBUS signal in memory failure with wrong

> si_code: the actual user-space process accessing the corrupt memory
> location, but its memory failure work is handled in a kthread context, so
> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
> process instead of BUS_MCEERR_AR in kill_proc().

This is hard to parse, "the user-space process is accessing"? (dropping 'actual' and
adding 'is')


Wasn't this behaviour fixed by the previous patch?

What problem are you fixing here?


> To this end, separate synchronous and asynchronous error handling into
> different paths like X86 platform does:
>
> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
> before ret_to_user.

> - valid asynchronous errors: queue a work into workqueue to asynchronously
> handle memory failure.

Why? The signal issue was fixed by the previous patch. Why delay the handling of a
poisoned memory location further?


> - abnormal branches such as invalid PA, unexpected severity, no memory
> failure config support, invalid GUID section, OOM, etc.

... do what?


> Then for valid synchronous errors, the current context in memory failure is
> exactly belongs to the task consuming poison data and it will send SIBBUS
> with proper si_code.


> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 6f35f724cc14..1675ff77033d 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1334,17 +1334,10 @@ static void kill_me_maybe(struct callback_head *cb)
> return;
> }
>
> - /*
> - * -EHWPOISON from memory_failure() means that it already sent SIGBUS
> - * to the current process with the proper error info,
> - * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
> - *
> - * In both cases, no further processing is required.
> - */
> if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
> return;
>
> - pr_err("Memory error not recovered");
> + pr_err("Sending SIGBUS to current task due to memory error not recovered");
> kill_me_now(cb);
> }
>

I'm not sure how this hunk is relevant to the commit message.


> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 88178aa6222d..014401a65ed5 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -484,6 +497,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> return false;
> }
>
> + if (flags == MF_ACTION_REQUIRED && current->mm) {
> + twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
> + if (!twcb)
> + return false;

Yuck - New failure modes! This is why the existing code always has this memory allocated
in struct ghes_estatus_node.


> + twcb->pfn = pfn;
> + twcb->flags = flags;
> + init_task_work(&twcb->twork, memory_failure_cb);
> + task_work_add(current, &twcb->twork, TWA_RESUME);
> + return true;
> + }
> +
> memory_failure_queue(pfn, flags);
> return true;
> }

[..]

> @@ -696,7 +721,14 @@ static bool ghes_do_proc(struct ghes *ghes,
> }
> }
>
> - return queued;
> + /*
> + * If no memory failure work is queued for abnormal synchronous
> + * errors, do a force kill.
> + */
> + if (sync && !queued) {
> + pr_err("Sending SIGBUS to current task due to memory error not recovered");
> + force_sig(SIGBUS);
> + }
> }

I think this is a lot of churn, and this hunk is the the only meaningful change in
behaviour. Can you explain how this happens?


Wouldn't it be simpler to split ghes_kick_task_work() to have a sync/async version.
The synchronous version can unconditionally force_sig_mceerr(BUS_MCEERR_AR, ...) after
memory_failure_queue_kick() - but that still means memory_failure() is unable to disappear
errors that it fixed - see MF_RECOVERED.



> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 4d6e43c88489..0d02f8a0b556 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2161,9 +2161,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> * Must run in process context (e.g. a work queue) with interrupts
> * enabled and no spinlocks held.
> *
> - * Return: 0 for successfully handled the memory error,
> - * -EOPNOTSUPP for hwpoison_filter() filtered the error event,
> - * < 0(except -EOPNOTSUPP) on failure.
> + * Return values:
> + * 0 - success
> + * -EOPNOTSUPP - hwpoison_filter() filtered the error event.
> + * -EHWPOISON - sent SIGBUS to the current process with the proper
> + * error info by kill_accessing_process().
> + * other negative values - failure
> */
> int memory_failure(unsigned long pfn, int flags)
> {

I'm not sure how this hunk is relevant to the commit message.


Thanks,

James

2023-11-30 17:44:00

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code

Hi Boris,

On 30/11/2023 14:40, Borislav Petkov wrote:
> FTR, this is starting to make sense, thanks for explaining.
>
> Replying only to this one for now:
>
> On Thu, Nov 30, 2023 at 10:58:53AM +0800, Shuai Xue wrote:
>> To reproduce this problem:
>>
>> # STEP1: enable early kill mode
>> #sysctl -w vm.memory_failure_early_kill=1
>> vm.memory_failure_early_kill = 1
>>
>> # STEP2: inject an UCE error and consume it to trigger a synchronous error
>
> So this is for ARM folks to deal with, BUT:
>
> A consumed uncorrectable error on x86 means panic. On some hw like on
> AMD, that error doesn't even get seen by the OS but the hw does
> something called syncflood to prevent further error propagation. So
> there's no any action required - the hw does that.
>
> But I'd like to hear from ARM folks whether consuming an uncorrectable
> error even lets software run. Dunno.

I think we mean different things by 'consume' here.

I'd assume Shuai's test is poisoning a cache-line. When the CPU tries to access that
cache-line it will get an 'external abort' signal back from the memory system. Shuai - is
this what you mean by 'consume' - the CPU received external abort from the poisoned cache
line?

It's then up to the CPU whether it can put the world back in order to take this as
synchronous-external-abort or asynchronous-external-abort, which for arm64 are two
different interrupt/exception types.
The synchronous exceptions can't be masked, but the asynchronous one can.
If by the time the asynchronous-external-abort interrupt/exception has been unmasked, the
CPU has used the poisoned value in some calculation (which is what we usually mean by
consume) which has resulted in a memory access - it will report the error as 'uncontained'
because the error has been silently propagated. APEI should always report those a 'fatal',
and there is little point getting the OS involved at this point. Also in this category are
things like 'tag ram corruption', where you can no longer trust anything about memory.

Everything in this thread is about synchronous errors where this can't happen. The CPU
stops and does takes an interrupt/exception instead.


Thanks,

James

2023-12-01 02:59:05

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code



On 2023/12/1 01:43, James Morse wrote:
> Hi Boris,
>
> On 30/11/2023 14:40, Borislav Petkov wrote:
>> FTR, this is starting to make sense, thanks for explaining.
>>
>> Replying only to this one for now:
>>
>> On Thu, Nov 30, 2023 at 10:58:53AM +0800, Shuai Xue wrote:
>>> To reproduce this problem:
>>>
>>> # STEP1: enable early kill mode
>>> #sysctl -w vm.memory_failure_early_kill=1
>>> vm.memory_failure_early_kill = 1
>>>
>>> # STEP2: inject an UCE error and consume it to trigger a synchronous error
>>
>> So this is for ARM folks to deal with, BUT:
>>
>> A consumed uncorrectable error on x86 means panic. On some hw like on
>> AMD, that error doesn't even get seen by the OS but the hw does
>> something called syncflood to prevent further error propagation. So
>> there's no any action required - the hw does that.

The "consume" is at the application point of view, e.g. a memory read. If
poison is enable, then a SRAR error will be detected and a MCE raised
at the point of the consumption in the execution flow.

A generic Intel x86 hw behaves like below:

1. UE Error Inject at a known Physical Address. (by einj_mem_uc through EINJ interface)
2. Core Issue a Memory Read to the same Physical Address (by a singe memory read)
3. iMC Detects the error.
4. HA logs UCA error and signals CMCI if enabled
5. HA Forward data with poison indication bit set.
6. CBo detects the Poison data. Does not log any error.
7. MLC detects the Poison data.
8. DCU detects the Poison data, logs SRAR error and trigger MCERR if recoverable
9. OS/VMM takes corresponding recovery action based on affected state.

In our example:
- step 2 is triggered by a singe memory read.
- step 8: UCR errors detected on data load, MCACOD 134H, triggering MCERR
- step 9: the kernel is excepted to send sigbus with si_code BUS_MCEERR_AR (code 4)

I also run the same test in AMD EPYC platform, e.g. Milan, Genoa, which
behaves the same as Intel Xeon platform, e.g. Icelake, SPR.

The ARMv8.2 RAS extension support similar data poison mechanism, a
Synchronous External Abort on arm64 (analogy Machine Check Exception on
x86) will be trigger in setp 8. See James comments for details. But the
kernel sends sigbus with si_code BUS_MCEERR_AO (code 5) , tested on
Alibaba Yitian710 and Huawei Kunepng 920.


>>
>> But I'd like to hear from ARM folks whether consuming an uncorrectable
>> error even lets software run. Dunno.
>
> I think we mean different things by 'consume' here.
>
> I'd assume Shuai's test is poisoning a cache-line. When the CPU tries to access that
> cache-line it will get an 'external abort' signal back from the memory system. Shuai - is
> this what you mean by 'consume' - the CPU received external abort from the poisoned cache
> line?
>

Yes, exactly. Thank you for point it out. We are talking about synchronous errors.

> It's then up to the CPU whether it can put the world back in order to take this as
> synchronous-external-abort or asynchronous-external-abort, which for arm64 are two
> different interrupt/exception types.
> The synchronous exceptions can't be masked, but the asynchronous one can.
> If by the time the asynchronous-external-abort interrupt/exception has been unmasked, the
> CPU has used the poisoned value in some calculation (which is what we usually mean by
> consume) which has resulted in a memory access - it will report the error as 'uncontained'
> because the error has been silently propagated. APEI should always report those a 'fatal',
> and there is little point getting the OS involved at this point. Also in this category are
> things like 'tag ram corruption', where you can no longer trust anything about memory.
>
> Everything in this thread is about synchronous errors where this can't happen. The CPU
> stops and does takes an interrupt/exception instead.
>
>

Thank you for explaining.

Best Regards,
Shuai

2023-12-01 03:38:13

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code



On 2023/12/1 01:39, James Morse wrote:
> Hi Boris, Shuai,
>
> On 29/11/2023 18:54, Borislav Petkov wrote:
>> On Sun, Nov 26, 2023 at 08:25:38PM +0800, Shuai Xue wrote:
>>>> On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote:
>>>>> - an AR error consumed by current process is deferred to handle in a
>>>>> dedicated kernel thread, but memory_failure() assumes that it runs in the
>>>>> current context
>>>>
>>>> On x86? ARM?
>>>>
>>>> Pease point to the exact code flow.
>
>
>>> An AR error consumed by current process is deferred to handle in a
>>> dedicated kernel thread on ARM platform. The AR error is handled in bellow
>>> flow:
>
> Please don't think of errors as "action required" - that's a user-space signal code. If
> the page could be fixed by memory-failure(), you may never get a signal. (all this was the
> fix for always sending an action-required signal)
>
> I assume you mean the CPU accessed a poisoned location and took a synchronous error.

Yes, I mean that CPU accessed a poisoned location and took a synchronous error.
>
>
>>> -----------------------------------------------------------------------------
>>> [usr space task einj_mem_uc consumd data poison, CPU 3] STEP 0
>>>
>>> -----------------------------------------------------------------------------
>>> [ghes_sdei_critical_callback: current einj_mem_uc, CPU 3] STEP 1
>>> ghes_sdei_critical_callback
>>> => __ghes_sdei_callback
>>> => ghes_in_nmi_queue_one_entry // peak and read estatus
>>> => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work
>>> [ghes_sdei_critical_callback: return]
>>> -----------------------------------------------------------------------------
>>> [ghes_proc_in_irq: current einj_mem_uc, CPU 3] STEP 2
>>> => ghes_do_proc
>>> => ghes_handle_memory_failure
>>> => ghes_do_memory_failure
>>> => memory_failure_queue // put work task on current CPU
>>> => if (kfifo_put(&mf_cpu->fifo, entry))
>>> schedule_work_on(smp_processor_id(), &mf_cpu->work);
>>> => task_work_add(current, &estatus_node->task_work, TWA_RESUME);
>>> [ghes_proc_in_irq: return]
>>> -----------------------------------------------------------------------------
>>> // kworker preempts einj_mem_uc on CPU 3 due to RESCHED flag STEP 3
>>> [memory_failure_work_func: current kworker, CPU 3]
>>> => memory_failure_work_func(&mf_cpu->work)
>>> => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work
>>> => memory_failure(entry.pfn, entry.flags);
>>
>> From the comment above that function:
>>
>> * The function is primarily of use for corruptions that
>> * happen outside the current execution context (e.g. when
>> * detected by a background scrubber)
>> *
>> * Must run in process context (e.g. a work queue) with interrupts
>> * enabled and no spinlocks held.
>>
>>> -----------------------------------------------------------------------------
>>> [ghes_kick_task_work: current einj_mem_uc, other cpu] STEP 4
>>> => memory_failure_queue_kick
>>> => cancel_work_sync - waiting memory_failure_work_func finish
>>> => memory_failure_work_func(&mf_cpu->work)
>>> => kfifo_get(&mf_cpu->fifo, &entry); // no work
>>> -----------------------------------------------------------------------------
>>> [einj_mem_uc resume at the same PC, trigger a page fault STEP 5
>>>
>>> STEP 0: A user space task, named einj_mem_uc consume a poison. The firmware
>>> notifies hardware error to kernel through is SDEI
>>> (ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED).
>>>
>>> STEP 1: The swapper running on CPU 3 is interrupted. irq_work_queue() rasie
>>> a irq_work to handle hardware errors in IRQ context
>>>
>>> STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on
>>> current CPU in workqueue and add task work to sync with the workqueue.
>>>
>>> STEP3: The kworker preempts the current running thread and get CPU 3. Then
>>> memory_failure() is processed in kworker.
>>
>> See above.
>>
>>> STEP4: ghes_kick_task_work() is called as task_work to ensure any queued
>>> workqueue has been done before returning to user-space.
>>>
>>> STEP5: Upon returning to user-space, the task einj_mem_uc resumes at the
>>> current instruction, because the poison page is unmapped by
>>> memory_failure() in step 3, so a page fault will be triggered.
>>>
>>> memory_failure() assumes that it runs in the current context on both x86
>>> and ARM platform.
>>>
>>>
>>> for example:
>>> memory_failure() in mm/memory-failure.c:
>>>
>>> if (flags & MF_ACTION_REQUIRED) {
>>> folio = page_folio(p);
>>> res = kill_accessing_process(current, folio_pfn(folio), flags);
>>> }
>>
>> And?
>>
>> Do you see the check above it?
>>
>> if (TestSetPageHWPoison(p)) {
>>
>> test_and_set_bit() returns true only when the page was poisoned already.
>>
>> * This function is intended to handle "Action Required" MCEs on already
>> * hardware poisoned pages. They could happen, for example, when
>> * memory_failure() failed to unmap the error page at the first call, or
>> * when multiple local machine checks happened on different CPUs.
>>
>> And that's kill_accessing_process().
>>
>> So AFAIU, the kworker running memory_failure() would only mark the page
>> as poison.
>>
>> The killing happens when memory_failure() runs again and the process
>> touches the page again.
>>
>> But I'd let James confirm here.
>
> Yes, this is what is expected to happen with the existing code.
>
> The first pass will remove the pages from all processes that have it mapped before this
> user-space task can restart. Restarting the task will make it access a poisoned page,
> kicking off the second path which delivers the signal.
>
> The reason for two passes is send_sig_mceerr() likes to clear_siginfo(), so even if you
> queued action-required before leaving GHES, memory-failure() would stomp on it.
>
>
>> I still don't know what you're fixing here.
>
> The problem is if the user-space process registered for early messages, it gets a signal
> on the first pass. If it returns from that signal, it will access the poisoned page and
> get the action-required signal.
>
> How is this making Qemu go wrong?

The problem here is that we need to assume, the first pass memory failure
handle and unmap the poisoned page successfully.

- If so, it may work by the second pass action-requried signal because it
access an unmapped page. But IMHO, we can improve by just sending one
pass signal, so that the Guest will vmexit only once, right?

- If not, there is no second pass signal. The exist code does not handle
the error code from memory_failure(), so a exception loop happens
resulting a hard lockup panic.

Besides, in production environment, a second access to an already known
poison page will introduce more risk of error propagation.

>
>
> As to how this works for you given Boris' comments above: kill_procs() is also called from
> hwpoison_user_mappings(), which takes the flags given to memory-failure(). This is where
> the action-optional signals come from.
>
>

Thank you very much for involving to review and comment.

Best Regards,
Shuai

2023-12-01 05:22:50

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v9 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events



On 2023/12/1 01:39, James Morse wrote:
> Hi Shuai,
>
> On 07/10/2023 08:28, Shuai Xue wrote:
>> There are two major types of uncorrected recoverable (UCR) errors :
>
> Is UCR a well known x86 acronym? It's best to just spell this out each time,
> there is enough jargon in this area already.

Quite agreed, will replace the commit log with "uncorrected recoverable error".

>
>>
>> - Action Required (AR): The error is detected and the processor already
>> consumes the memory. OS requires to take action (for example, offline
>> failure page/kill failure thread) to recover this uncorrectable error.
>>
>> - Action Optional (AO): The error is detected out of processor execution
>> context. Some data in the memory are corrupted. But the data have not
>> been consumed. OS is optional to take action to recover this
>> uncorrectable error.
>
> As elsewhere, please don't think of errors as 'action required', this is how
> things get reported to user-space. Action-required for one thread may be
> action-optional for another that has the same page mapped - its really not a
> property of the error.
> It would be better to describe this as synchronous and asynchronous, or in-band
> and out-of-band.

Thank you for explanation. I will change to "synchronous and asynchronous".

>
>
>> The essential difference between AR and AO errors is that AR is a
>> synchronous event, while AO is an asynchronous event. The hardware will
>> signal a synchronous exception (Machine Check Exception on X86 and
>> Synchronous External Abort on Arm64) when an error is detected and the
>> memory access has been architecturally executed.
>
>> When APEI firmware first is enabled, a platform may describe one error
>> source for the handling of synchronous errors (e.g. MCE or SEA notification
>> ), or for handling asynchronous errors (e.g. SCI or External Interrupt
>> notification). In other words, we can distinguish synchronous errors by
>> APEI notification. For AR errors, kernel will kill current process
>> accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
>> addition, for AO errors, kernel will notify the process who owns the
>> poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
>> However, the GHES driver always sets mf_flags to 0 so that all UCR errors
>> are handled as AO errors in memory failure.
>
> To make this easier to read:
> UCR and AR -> synchronous
> AO -> asynchronous
>

Will do that.

>
>> To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
>> events.
>
>> Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
>
> Erm, this predates arm64 support, and what you have here doesn't change the behaviour on x86.
>
> You can blame 7f17b4a121d0d50 ("ACPI: APEI: Kick the memory_failure() queue for
> synchronous errors"), which should have covered this.

Do you mean just drop the "Fixes" tags?

>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index ef59d6ea16da..88178aa6222d 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
>> return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
>> }
>>
>> +/*
>> + * A platform may describe one error source for the handling of synchronous
>> + * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
>> + * or External Interrupt). On x86, the HEST notifications are always
>> + * asynchronous, so only SEA on ARM is delivered as a synchronous
>> + * notification.
>> + */
>> +static inline bool is_hest_sync_notify(struct ghes *ghes)
>> +{
>> + u8 notify_type = ghes->generic->notify.type;
>> +
>> + return notify_type == ACPI_HEST_NOTIFY_SEA;
>> +}
>
> and as you had in earlier versions, sometimes SDEI.
> SDEI can report by synchronous and asynchronous errors, I wouldn't too surprised if the
> hardware NMI can be used for the same. It would be good to chase up having a hint of this
> in the CPER records and pass that in here as a hint.>
> Unfortunately, its not safe to assume either way for SDEI.

For SDEI notification, only x0-x17 has preserved by firmware. As SDEI
TRM[1] describes "the dispatcher can simulate an exception-like entry into
the client, **with the client providing an additional asynchronous entry
point similar to an interrupt entry point**". The client (kernel) lacks
complete synchronous context, e.g. system register (ELR, ESR, etc). So I
think SDEI notification should not be used for asynchronous error, can you
help to confirm this?

For NMI notification, as far as I know, AArch64 (aka arm64 in the Linux
tree) does not provide architected NMIs.

>
> Reviewed-by: James Morse <[email protected]>
>

Thank you for valuable comments.

Best Regards,
Shuai

[1] https://developer.arm.com/documentation/den0054/latest/

2023-12-01 07:04:04

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v9 2/2] ACPI: APEI: handle synchronous exceptions in task work



On 2023/12/1 01:39, James Morse wrote:
> Hi Shuai,
>
> On 07/10/2023 08:28, Shuai Xue wrote:
>> Hardware errors could be signaled by synchronous interrupt,
>
> I'm struggling with 'synchronous interrupt'. Do you mean arm64's 'precise' (all
> instructions before the exception were executed, and none after).
> Otherwise, surely any interrupt from a background scrubber is inherently asynchronous!
>

I am sorry, this is typo. I mean asynchronous interrupt.

>
>> e.g. when an
>> error is detected by a background scrubber, or signaled by synchronous
>> exception, e.g. when an uncorrected error is consumed. Both synchronous and
>> asynchronous error are queued and handled by a dedicated kthread in
>> workqueue.
>>
>> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for
>> synchronous errors") keep track of whether memory_failure() work was
>> queued, and make task_work pending to flush out the workqueue so that the
>> work for synchronous error is processed before returning to user-space.
>
> It does it regardless, if user-space was interrupted by APEI any work queued as a result
> of that should be completed before we go back to user-space. Otherwise we can bounce
> between user-space and firmware, with the kernel only running the APEI code, and never
> making progress.
>

Agreed.

>
>> The trick ensures that the corrupted page is unmapped and poisoned. And
>> after returning to user-space, the task starts at current instruction which
>> triggering a page fault in which kernel will send SIGBUS to current process
>> due to VM_FAULT_HWPOISON.
>>
>> However, the memory failure recovery for hwpoison-aware mechanisms does not
>> work as expected. For example, hwpoison-aware user-space processes like
>> QEMU register their customized SIGBUS handler and enable early kill mode by
>> seting PF_MCE_EARLY at initialization. Then the kernel will directly notify
>
> (setting, directly)

Thank you. Will fix it.

>
>> the process by sending a SIGBUS signal in memory failure with wrong
>
>> si_code: the actual user-space process accessing the corrupt memory
>> location, but its memory failure work is handled in a kthread context, so
>> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space
>> process instead of BUS_MCEERR_AR in kill_proc().
>
> This is hard to parse, "the user-space process is accessing"? (dropping 'actual' and
> adding 'is')

Will fix it.


>
>
> Wasn't this behaviour fixed by the previous patch?
>
> What problem are you fixing here?


Nope. The memory_failure() runs in a kthread context, but not the
user-space process which consuming poison data.


// kill_proc() in memory-failure.c

if ((flags & MF_ACTION_REQUIRED) && (t == current))
ret = force_sig_mceerr(BUS_MCEERR_AR,
(void __user *)tk->addr, addr_lsb);
else
ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr,
addr_lsb, t);

So, even we queue memory_failure() with MF_ACTION_REQUIRED flags in
previous patch, it will still send a sigbus with BUS_MCEERR_AO in the else
branch of kill_proc().

>
>
>> To this end, separate synchronous and asynchronous error handling into
>> different paths like X86 platform does:
>>
>> - valid synchronous errors: queue a task_work to synchronously send SIGBUS
>> before ret_to_user.
>
>> - valid asynchronous errors: queue a work into workqueue to asynchronously
>> handle memory failure.
>
> Why? The signal issue was fixed by the previous patch. Why delay the handling of a
> poisoned memory location further?

The signal issue is not fixed completely. See my reply above.

>
>
>> - abnormal branches such as invalid PA, unexpected severity, no memory
>> failure config support, invalid GUID section, OOM, etc.
>
> ... do what?

If no memory failure work is queued for abnormal errors, do a force kill.
Will also add this comment to commit log.

>
>
>> Then for valid synchronous errors, the current context in memory failure is
>> exactly belongs to the task consuming poison data and it will send SIBBUS
>> with proper si_code.
>
>
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index 6f35f724cc14..1675ff77033d 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -1334,17 +1334,10 @@ static void kill_me_maybe(struct callback_head *cb)
>> return;
>> }
>>
>> - /*
>> - * -EHWPOISON from memory_failure() means that it already sent SIGBUS
>> - * to the current process with the proper error info,
>> - * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
>> - *
>> - * In both cases, no further processing is required.
>> - */
>> if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
>> return;
>>
>> - pr_err("Memory error not recovered");
>> + pr_err("Sending SIGBUS to current task due to memory error not recovered");
>> kill_me_now(cb);
>> }
>>
>
> I'm not sure how this hunk is relevant to the commit message.

I handle memory_failure() error code in its arm64 call site
memory_failure_cb() with some comments, similar to x86 call site
kill_me_maybe(). I moved these two part comments to function declaration,
followed by review comments from Kefeng.

I should split this into a separate patch. Will do it in next version.

>
>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 88178aa6222d..014401a65ed5 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -484,6 +497,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>> return false;
>> }
>>
>> + if (flags == MF_ACTION_REQUIRED && current->mm) {
>> + twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC);
>> + if (!twcb)
>> + return false;
>
> Yuck - New failure modes! This is why the existing code always has this memory allocated
> in struct ghes_estatus_node.

Are you suggesting to move fields of struct sync_task_work to struct
ghes_estatus_node, and use ghes_estatus_node here? Or we can just alloc
struct sync_task_work with gen_pool_alloc from ghes_estatus_pool.

>
>
>> + twcb->pfn = pfn;
>> + twcb->flags = flags;
>> + init_task_work(&twcb->twork, memory_failure_cb);
>> + task_work_add(current, &twcb->twork, TWA_RESUME);
>> + return true;
>> + }
>> +
>> memory_failure_queue(pfn, flags);
>> return true;
>> }
>
> [..]
>
>> @@ -696,7 +721,14 @@ static bool ghes_do_proc(struct ghes *ghes,
>> }
>> }
>>
>> - return queued;
>> + /*
>> + * If no memory failure work is queued for abnormal synchronous
>> + * errors, do a force kill.
>> + */
>> + if (sync && !queued) {
>> + pr_err("Sending SIGBUS to current task due to memory error not recovered");
>> + force_sig(SIGBUS);
>> + }
>> }
>
> I think this is a lot of churn, and this hunk is the the only meaningful change in
> behaviour. Can you explain how this happens?

For example:
- invalid GUID section in ghes_do_proc()
- CPER_MEM_VALID_PA is not set, unexpected severity in
ghes_handle_memory_failure().
- CONFIG_ACPI_APEI_MEMORY_FAILURE is not enabled, !pfn_vaild(pfn) in
ghes_do_memory_failure()

>
>
> Wouldn't it be simpler to split ghes_kick_task_work() to have a sync/async version.
> The synchronous version can unconditionally force_sig_mceerr(BUS_MCEERR_AR, ...) after
> memory_failure_queue_kick() - but that still means memory_failure() is unable to disappear
> errors that it fixed - see MF_RECOVERED.

Sorry, I don't think so. Unconditionally send a sigbus is not a good
choice. For example, if a sync memory error detected in instruction memory
error, the kernel should transparently fix and no signal should be send.

./einj_mem_uc instr
[168522.751671] Memory failure: 0x89dedd: corrupted page was clean: dropped without side effects
[168522.751679] Memory failure: 0x89dedd: recovery action for clean LRU page: Recovered

With this patch set, the instr case behaves consistently on both the arm64 and x86 platforms.

The complex page error_states are handled in memory_failure(). IMHO, we
should left this part to it.

>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 4d6e43c88489..0d02f8a0b556 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2161,9 +2161,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>> * Must run in process context (e.g. a work queue) with interrupts
>> * enabled and no spinlocks held.
>> *
>> - * Return: 0 for successfully handled the memory error,
>> - * -EOPNOTSUPP for hwpoison_filter() filtered the error event,
>> - * < 0(except -EOPNOTSUPP) on failure.
>> + * Return values:
>> + * 0 - success
>> + * -EOPNOTSUPP - hwpoison_filter() filtered the error event.
>> + * -EHWPOISON - sent SIGBUS to the current process with the proper
>> + * error info by kill_accessing_process().
>> + * other negative values - failure
>> */
>> int memory_failure(unsigned long pfn, int flags)
>> {
>
> I'm not sure how this hunk is relevant to the commit message.


As mentioned, I will split this into a separate patch.

>
>
> Thanks,
>
> James


Thank you for valuable comments.
Best Regards,
Shuai

2023-12-18 06:45:46

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v10 0/4] ACPI: APEI: handle synchronous errors in task work with proper si_code

## Changes Log

changes since v9:
- split patch 2 to address exactly one issue in one patch (per Borislav)
- rewrite commit log according to template (per Borislav)
- pickup reviewed-by tag of patch 1 from James Morse
- alloc and free twcb through gen_pool_{alloc, free) (Per James)
- rewrite cover letter

changes since v8:
- remove the bug fix tag of patch 2 (per Jarkko Sakkinen)
- remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi)
- rewrite the return value comments of memory_failure (per Naoya Horiguchi)

changes since v7:
- rebase to Linux v6.6-rc2 (no code changed)
- rewritten the cover letter to explain the motivation of this patchset

changes since v6:
- add more explicty error message suggested by Xiaofei
- pick up reviewed-by tag from Xiaofei
- pick up internal reviewed-by tag from Baolin

changes since v5 by addressing comments from Kefeng:
- document return value of memory_failure()
- drop redundant comments in call site of memory_failure()
- make ghes_do_proc void and handle abnormal case within it
- pick up reviewed-by tag from Kefeng Wang

changes since v4 by addressing comments from Xiaofei:
- do a force kill only for abnormal sync errors

changes since v3 by addressing comments from Xiaofei:
- do a force kill for abnormal memory failure error such as invalid PA,
unexpected severity, OOM, etc
- pcik up tested-by tag from Ma Wupeng

changes since v2 by addressing comments from Naoya:
- rename mce_task_work to sync_task_work
- drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
- add steps to reproduce this problem in cover letter

changes since v1:
- synchronous events by notify type
- Link: https://lore.kernel.org/lkml/[email protected]/

## Cover Letter

There are two major types of uncorrected recoverable (UCR) errors :

- Synchronous error: The error is detected and raised at the point of the
consumption in the execution flow, e.g. when a CPU tries to access
a poisoned cache line. The CPU will take a synchronous error exception
such as Synchronous External Abort (SEA) on Arm64 and Machine Check
Exception (MCE) on X86. OS requires to take action (for example, offline
failure page/kill failure thread) to recover this uncorrectable error.

- Asynchronous error: The error is detected out of processor execution
context, e.g. when an error is detected by a background scrubber. Some data
in the memory are corrupted. But the data have not been consumed. OS is
optional to take action to recover this uncorrectable error.

Currently, both synchronous and asynchronous errors are queued by
ghes_handle_memory_failure() with flag 0, and handled by a dedicated kernel
thread in a work queue on the ARM64 platform. As a result, the memory
failure recovery sends SIBUS with wrong BUS_MCEERR_AO si_code for
synchronous errors in early kill mode. The main problem is that the
memory_failure() work is handled in kthread context but not the user-space
process context which is accessing the corrupt memory location, so it will
send SIGBUS with BUS_MCEERR_AO si_code to the user-space process instead of
BUS_MCEERR_AR in kill_proc().

Fix the problem by:
- Patch 1: seting memory_failure() flags as MF_ACTION_REQUIRED on synchronous
errors.
- Patch 2: performing a force kill if no memory_failure() work is queued for
synchronous errors.
- Patch 3: a minor comments improve.
- Patch 4: queueing memory_failure() as a task_work so that the current
context in memory_failure() exactly belongs to the process
consuming poison data.

Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4].
Acknowledge to discussion with them.

## Steps to Reproduce This Problem

To reproduce this problem:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 5 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
and it is not fact.

After this patch set:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 4 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
as we expected.

[1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lkml.kernel.org/r/[email protected]
[4] https://lore.kernel.org/lkml/[email protected]/

Shuai Xue (4):
ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
synchronous events
ACPI: APEI: send SIGBUS to current task if synchronous memory error
not recovered
mm: memory-failure: move memory_failure() return value documentation
to function declaration
ACPI: APEI: handle synchronous exceptions in task work

arch/x86/kernel/cpu/mce/core.c | 9 +--
drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++-----------
include/acpi/ghes.h | 3 -
mm/memory-failure.c | 22 ++-----
4 files changed, 82 insertions(+), 65 deletions(-)

--
2.39.3


2023-12-18 06:46:00

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v10 3/4] mm: memory-failure: move memory_failure() return value documentation to function declaration

Part of return value comments for memory_failure() were originally
documented at the call site. Move those comments to the function
declaration to improve code readability and to provide developers with
immediate access to function usage and return information.

Signed-off-by: Shuai Xue <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 9 +--------
mm/memory-failure.c | 9 ++++++---
2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7b397370b4d6..43e542f06ad5 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1324,17 +1324,10 @@ static void kill_me_maybe(struct callback_head *cb)
return;
}

- /*
- * -EHWPOISON from memory_failure() means that it already sent SIGBUS
- * to the current process with the proper error info,
- * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
- *
- * In both cases, no further processing is required.
- */
if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
return;

- pr_err("Memory error not recovered");
+ pr_err("Sending SIGBUS to current task due to memory error not recovered");
kill_me_now(cb);
}

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 660c21859118..bd3dcafdfa4a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2164,9 +2164,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
* Must run in process context (e.g. a work queue) with interrupts
* enabled and no spinlocks held.
*
- * Return: 0 for successfully handled the memory error,
- * -EOPNOTSUPP for hwpoison_filter() filtered the error event,
- * < 0(except -EOPNOTSUPP) on failure.
+ * Return values:
+ * 0 - success
+ * -EOPNOTSUPP - hwpoison_filter() filtered the error event.
+ * -EHWPOISON - sent SIGBUS to the current process with the proper
+ * error info by kill_accessing_process().
+ * other negative values - failure
*/
int memory_failure(unsigned long pfn, int flags)
{
--
2.39.3


2023-12-18 06:46:11

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v10 1/4] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

There are two major types of uncorrected recoverable (UCR) errors :

- Synchronous error: The error is detected and raised at the point of the
consumption in the execution flow, e.g. when a CPU tries to access
a poisoned cache line. The CPU will take a synchronous error exception
such as Synchronous External Abort (SEA) on Arm64 and Machine Check
Exception (MCE) on X86. OS requires to take action (for example, offline
failure page/kill failure thread) to recover this uncorrectable error.

- Asynchronous error: The error is detected out of processor execution
context, e.g. when an error is detected by a background scrubber. Some data
in the memory are corrupted. But the data have not been consumed. OS is
optional to take action to recover this uncorrectable error.

When APEI firmware first is enabled, a platform may describe one error
source for the handling of synchronous errors (e.g. MCE or SEA notification
), or for handling asynchronous errors (e.g. SCI or External Interrupt
notification). In other words, we can distinguish synchronous errors by
APEI notification. For synchronous errors, kernel will kill the current
process which accessing the poisoned page by sending SIGBUS with
BUS_MCEERR_AR. In addition, for asynchronous errors, kernel will notify the
process who owns the poisoned page by sending SIGBUS with BUS_MCEERR_AO in
early kill mode. However, the GHES driver always sets mf_flags to 0 so that
all synchronous errors are handled as asynchronous errors in memory failure.

To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
events.

Signed-off-by: Shuai Xue <[email protected]>
Tested-by: Ma Wupeng <[email protected]>
Reviewed-by: Kefeng Wang <[email protected]>
Reviewed-by: Xiaofei Tan <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Reviewed-by: James Morse <[email protected]>
---
drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 63ad0541db38..ab2a82cb1b0b 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
}

+/*
+ * A platform may describe one error source for the handling of synchronous
+ * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
+ * or External Interrupt). On x86, the HEST notifications are always
+ * asynchronous, so only SEA on ARM is delivered as a synchronous
+ * notification.
+ */
+static inline bool is_hest_sync_notify(struct ghes *ghes)
+{
+ u8 notify_type = ghes->generic->notify.type;
+
+ return notify_type == ACPI_HEST_NOTIFY_SEA;
+}
+
/*
* This driver isn't really modular, however for the time being,
* continuing to use module_param is the easiest way to remain
@@ -489,7 +503,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
}

static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
- int sev)
+ int sev, bool sync)
{
int flags = -1;
int sec_sev = ghes_severity(gdata->error_severity);
@@ -503,7 +517,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
- flags = 0;
+ flags = sync ? MF_ACTION_REQUIRED : 0;

if (flags != -1)
return ghes_do_memory_failure(mem_err->physical_addr, flags);
@@ -511,9 +525,11 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
return false;
}

-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
+static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
+ int sev, bool sync)
{
struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
+ int flags = sync ? MF_ACTION_REQUIRED : 0;
bool queued = false;
int sec_sev, i;
char *p;
@@ -538,7 +554,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
* and don't filter out 'corrected' error here.
*/
if (is_cache && has_pa) {
- queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
+ queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
p += err_info->length;
continue;
}
@@ -666,6 +682,7 @@ static bool ghes_do_proc(struct ghes *ghes,
const guid_t *fru_id = &guid_null;
char *fru_text = "";
bool queued = false;
+ bool sync = is_hest_sync_notify(ghes);

sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
@@ -683,13 +700,13 @@ static bool ghes_do_proc(struct ghes *ghes,
atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
- queued = ghes_handle_memory_failure(gdata, sev);
+ queued = ghes_handle_memory_failure(gdata, sev, sync);
}
else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
- queued = ghes_handle_arm_hw_error(gdata, sev);
+ queued = ghes_handle_arm_hw_error(gdata, sev, sync);
} else {
void *err = acpi_hest_get_payload(gdata);

--
2.39.3


2023-12-18 06:51:19

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v10 2/4] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered

Synchronous error was detected as a result of user-space process accessing
a 2-bit uncorrected error. The CPU will take a synchronous error exception
such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
memory_failure() work which poisons the related page, unmaps the page, and
then sends a SIGBUS to the process, so that a system wide panic can be
avoided.

However, no memory_failure() work will be queued when abnormal synchronous
errors occur. These errors can include situations such as invalid PA,
unexpected severity, no memory failure config support, invalid GUID
section, etc. In such case, the user-space process will trigger SEA again.
This loop can potentially exceed the platform firmware threshold or even
trigger a kernel hard lockup, leading to a system reboot.

Fix it by performing a force kill if no memory_failure() work is queued for synchronous errors.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/acpi/apei/ghes.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index ab2a82cb1b0b..f832ffc5a88d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -717,6 +717,15 @@ static bool ghes_do_proc(struct ghes *ghes,
}
}

+ /*
+ * If no memory failure work is queued for abnormal synchronous
+ * errors, do a force kill.
+ */
+ if (sync && !queued) {
+ pr_err("Sending SIGBUS to current task due to memory error not recovered");
+ force_sig(SIGBUS);
+ }
+
return queued;
}

--
2.39.3


2023-12-18 06:51:35

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v10 4/4] ACPI: APEI: handle synchronous exceptions in task work

Hardware errors could be signaled by asynchronous interrupt, e.g. when an
error is detected by a background scrubber, or signaled by synchronous
exception, e.g. when a CPU tries to access a poisoned cache line. Both
synchronous and asynchronous error are queued as a memory_failure() work
and handled by a dedicated kthread in workqueue.

However, the memory failure recovery sends SIBUS with wrong BUS_MCEERR_AO
si_code for synchronous errors in early kill mode, even MF_ACTION_REQUIRED
is set. The main problem is that the memory failure work is handled in
kthread context but not the user-space process which is accessing the
corrupt memory location, so it will send SIGBUS with BUS_MCEERR_AO si_code
to the user-space process instead of BUS_MCEERR_AR in kill_proc().

To this end, queue memory_failure() as a task_work so that the current
context in memory_failure() is exactly belongs to the process consuming
poison data and it will send SIBBUS with proper si_code.

Signed-off-by: Shuai Xue <[email protected]>
Tested-by: Ma Wupeng <[email protected]>
Reviewed-by: Kefeng Wang <[email protected]>
Reviewed-by: Xiaofei Tan <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
---
drivers/acpi/apei/ghes.c | 77 +++++++++++++++++++++++-----------------
include/acpi/ghes.h | 3 --
mm/memory-failure.c | 13 -------
3 files changed, 44 insertions(+), 49 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index f832ffc5a88d..a6b4907cfe47 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -464,28 +464,41 @@ static void ghes_clear_estatus(struct ghes *ghes,
}

/*
- * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+ * struct sync_task_work - for synchronous RAS event
+ *
+ * @twork: callback_head for task work
+ * @pfn: page frame number of corrupted page
+ * @flags: fine tune action taken
+ *
+ * Structure to pass task work to be handled before
+ * ret_to_user via task_work_add().
*/
-static void ghes_kick_task_work(struct callback_head *head)
+struct sync_task_work {
+ struct callback_head twork;
+ u64 pfn;
+ int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
{
- struct acpi_hest_generic_status *estatus;
- struct ghes_estatus_node *estatus_node;
- u32 node_len;
+ int ret;
+ struct sync_task_work *twcb =
+ container_of(twork, struct sync_task_work, twork);

- estatus_node = container_of(head, struct ghes_estatus_node, task_work);
- if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
- memory_failure_queue_kick(estatus_node->task_work_cpu);
+ ret = memory_failure(twcb->pfn, twcb->flags);
+ gen_pool_free(ghes_estatus_pool, (unsigned long)twcb, sizeof(*twcb));

- estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
- node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
- gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
+ if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
+ return;
+
+ pr_err("Sending SIGBUS to current task due to memory error not recovered");
+ force_sig(SIGBUS);
}

static bool ghes_do_memory_failure(u64 physical_addr, int flags)
{
unsigned long pfn;
+ struct sync_task_work *twcb;

if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
return false;
@@ -498,6 +511,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
return false;
}

+ if (flags == MF_ACTION_REQUIRED && current->mm) {
+ twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
+ if (!twcb)
+ return false;
+
+ twcb->pfn = pfn;
+ twcb->flags = flags;
+ init_task_work(&twcb->twork, memory_failure_cb);
+ task_work_add(current, &twcb->twork, TWA_RESUME);
+ return true;
+ }
+
memory_failure_queue(pfn, flags);
return true;
}
@@ -673,7 +698,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
schedule_work(&entry->work);
}

-static bool ghes_do_proc(struct ghes *ghes,
+static void ghes_do_proc(struct ghes *ghes,
const struct acpi_hest_generic_status *estatus)
{
int sev, sec_sev;
@@ -725,8 +750,6 @@ static bool ghes_do_proc(struct ghes *ghes,
pr_err("Sending SIGBUS to current task due to memory error not recovered");
force_sig(SIGBUS);
}
-
- return queued;
}

static void __ghes_print_estatus(const char *pfx,
@@ -1028,9 +1051,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
struct ghes_estatus_node *estatus_node;
struct acpi_hest_generic *generic;
struct acpi_hest_generic_status *estatus;
- bool task_work_pending;
u32 len, node_len;
- int ret;

llnode = llist_del_all(&ghes_estatus_llist);
/*
@@ -1045,25 +1066,16 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
len = cper_estatus_len(estatus);
node_len = GHES_ESTATUS_NODE_LEN(len);
- task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+
+ ghes_do_proc(estatus_node->ghes, estatus);
+
if (!ghes_estatus_cached(estatus)) {
generic = estatus_node->generic;
if (ghes_print_estatus(NULL, generic, estatus))
ghes_estatus_cache_add(generic, estatus);
}
-
- if (task_work_pending && current->mm) {
- estatus_node->task_work.func = ghes_kick_task_work;
- estatus_node->task_work_cpu = smp_processor_id();
- ret = task_work_add(current, &estatus_node->task_work,
- TWA_RESUME);
- if (ret)
- estatus_node->task_work.func = NULL;
- }
-
- if (!estatus_node->task_work.func)
- gen_pool_free(ghes_estatus_pool,
- (unsigned long)estatus_node, node_len);
+ gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+ node_len);

llnode = next;
}
@@ -1124,7 +1136,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,

estatus_node->ghes = ghes;
estatus_node->generic = ghes->generic;
- estatus_node->task_work.func = NULL;
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);

if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index be1dd4c1a917..ebd21b05fe6e 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -35,9 +35,6 @@ struct ghes_estatus_node {
struct llist_node llnode;
struct acpi_hest_generic *generic;
struct ghes *ghes;
-
- int task_work_cpu;
- struct callback_head task_work;
};

struct ghes_estatus_cache {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index bd3dcafdfa4a..6bff57444928 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2451,19 +2451,6 @@ static void memory_failure_work_func(struct work_struct *work)
}
}

-/*
- * Process memory_failure work queued on the specified CPU.
- * Used to avoid return-to-userspace racing with the memory_failure workqueue.
- */
-void memory_failure_queue_kick(int cpu)
-{
- struct memory_failure_cpu *mf_cpu;
-
- mf_cpu = &per_cpu(memory_failure_cpu, cpu);
- cancel_work_sync(&mf_cpu->work);
- memory_failure_work_func(&mf_cpu->work);
-}
-
static int __init memory_failure_init(void)
{
struct memory_failure_cpu *mf_cpu;
--
2.39.3


2023-12-18 06:54:06

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v10 1/4] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

On Mon, Dec 18, 2023 at 02:45:18PM +0800, Shuai Xue wrote:
> There are two major types of uncorrected recoverable (UCR) errors :
>
> - Synchronous error: The error is detected and raised at the point of the
> consumption in the execution flow, e.g. when a CPU tries to access
> a poisoned cache line. The CPU will take a synchronous error exception
> such as Synchronous External Abort (SEA) on Arm64 and Machine Check
> Exception (MCE) on X86. OS requires to take action (for example, offline
> failure page/kill failure thread) to recover this uncorrectable error.
>
> - Asynchronous error: The error is detected out of processor execution
> context, e.g. when an error is detected by a background scrubber. Some data
> in the memory are corrupted. But the data have not been consumed. OS is
> optional to take action to recover this uncorrectable error.
>
> When APEI firmware first is enabled, a platform may describe one error
> source for the handling of synchronous errors (e.g. MCE or SEA notification
> ), or for handling asynchronous errors (e.g. SCI or External Interrupt
> notification). In other words, we can distinguish synchronous errors by
> APEI notification. For synchronous errors, kernel will kill the current
> process which accessing the poisoned page by sending SIGBUS with
> BUS_MCEERR_AR. In addition, for asynchronous errors, kernel will notify the
> process who owns the poisoned page by sending SIGBUS with BUS_MCEERR_AO in
> early kill mode. However, the GHES driver always sets mf_flags to 0 so that
> all synchronous errors are handled as asynchronous errors in memory failure.
>
> To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
> events.
>
> Signed-off-by: Shuai Xue <[email protected]>
> Tested-by: Ma Wupeng <[email protected]>
> Reviewed-by: Kefeng Wang <[email protected]>
> Reviewed-by: Xiaofei Tan <[email protected]>
> Reviewed-by: Baolin Wang <[email protected]>
> Reviewed-by: James Morse <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
> 1 file changed, 23 insertions(+), 6 deletions(-)
>

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree. Please read:
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

2023-12-18 06:54:21

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v10 2/4] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered

On Mon, Dec 18, 2023 at 02:45:19PM +0800, Shuai Xue wrote:
> Synchronous error was detected as a result of user-space process accessing
> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> memory_failure() work which poisons the related page, unmaps the page, and
> then sends a SIGBUS to the process, so that a system wide panic can be
> avoided.
>
> However, no memory_failure() work will be queued when abnormal synchronous
> errors occur. These errors can include situations such as invalid PA,
> unexpected severity, no memory failure config support, invalid GUID
> section, etc. In such case, the user-space process will trigger SEA again.
> This loop can potentially exceed the platform firmware threshold or even
> trigger a kernel hard lockup, leading to a system reboot.
>
> Fix it by performing a force kill if no memory_failure() work is queued for synchronous errors.
>
> Signed-off-by: Shuai Xue <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 9 +++++++++
> 1 file changed, 9 insertions(+)

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree. Please read:
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

2023-12-18 06:54:37

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v10 3/4] mm: memory-failure: move memory_failure() return value documentation to function declaration

On Mon, Dec 18, 2023 at 02:45:20PM +0800, Shuai Xue wrote:
> Part of return value comments for memory_failure() were originally
> documented at the call site. Move those comments to the function
> declaration to improve code readability and to provide developers with
> immediate access to function usage and return information.
>
> Signed-off-by: Shuai Xue <[email protected]>
> ---
> arch/x86/kernel/cpu/mce/core.c | 9 +--------
> mm/memory-failure.c | 9 ++++++---
> 2 files changed, 7 insertions(+), 11 deletions(-)
>

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree. Please read:
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

2023-12-18 06:54:58

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v10 4/4] ACPI: APEI: handle synchronous exceptions in task work

On Mon, Dec 18, 2023 at 02:45:21PM +0800, Shuai Xue wrote:
> Hardware errors could be signaled by asynchronous interrupt, e.g. when an
> error is detected by a background scrubber, or signaled by synchronous
> exception, e.g. when a CPU tries to access a poisoned cache line. Both
> synchronous and asynchronous error are queued as a memory_failure() work
> and handled by a dedicated kthread in workqueue.
>
> However, the memory failure recovery sends SIBUS with wrong BUS_MCEERR_AO
> si_code for synchronous errors in early kill mode, even MF_ACTION_REQUIRED
> is set. The main problem is that the memory failure work is handled in
> kthread context but not the user-space process which is accessing the
> corrupt memory location, so it will send SIGBUS with BUS_MCEERR_AO si_code
> to the user-space process instead of BUS_MCEERR_AR in kill_proc().
>
> To this end, queue memory_failure() as a task_work so that the current
> context in memory_failure() is exactly belongs to the process consuming
> poison data and it will send SIBBUS with proper si_code.
>
> Signed-off-by: Shuai Xue <[email protected]>
> Tested-by: Ma Wupeng <[email protected]>
> Reviewed-by: Kefeng Wang <[email protected]>
> Reviewed-by: Xiaofei Tan <[email protected]>
> Reviewed-by: Baolin Wang <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 77 +++++++++++++++++++++++-----------------
> include/acpi/ghes.h | 3 --
> mm/memory-failure.c | 13 -------
> 3 files changed, 44 insertions(+), 49 deletions(-)
>


<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree. Please read:
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

2023-12-21 14:03:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v10 1/4] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

On Mon, Dec 18, 2023 at 7:45 AM Shuai Xue <[email protected]> wrote:
>
> There are two major types of uncorrected recoverable (UCR) errors :
>
> - Synchronous error: The error is detected and raised at the point of the
> consumption in the execution flow, e.g. when a CPU tries to access
> a poisoned cache line. The CPU will take a synchronous error exception
> such as Synchronous External Abort (SEA) on Arm64 and Machine Check
> Exception (MCE) on X86. OS requires to take action (for example, offline
> failure page/kill failure thread) to recover this uncorrectable error.
>
> - Asynchronous error: The error is detected out of processor execution
> context, e.g. when an error is detected by a background scrubber. Some data
> in the memory are corrupted. But the data have not been consumed. OS is
> optional to take action to recover this uncorrectable error.
>
> When APEI firmware first is enabled, a platform may describe one error
> source for the handling of synchronous errors (e.g. MCE or SEA notification
> ), or for handling asynchronous errors (e.g. SCI or External Interrupt
> notification). In other words, we can distinguish synchronous errors by
> APEI notification. For synchronous errors, kernel will kill the current
> process which accessing the poisoned page by sending SIGBUS with
> BUS_MCEERR_AR. In addition, for asynchronous errors, kernel will notify the
> process who owns the poisoned page by sending SIGBUS with BUS_MCEERR_AO in
> early kill mode. However, the GHES driver always sets mf_flags to 0 so that
> all synchronous errors are handled as asynchronous errors in memory failure.
>
> To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
> events.
>
> Signed-off-by: Shuai Xue <[email protected]>
> Tested-by: Ma Wupeng <[email protected]>
> Reviewed-by: Kefeng Wang <[email protected]>
> Reviewed-by: Xiaofei Tan <[email protected]>
> Reviewed-by: Baolin Wang <[email protected]>
> Reviewed-by: James Morse <[email protected]>

Applied as 6.8 material.

The other patches in the series still need to receive tags from the
APEI designated reviewers (as per MAINTAINERS).

Thanks!


> ---
> drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------
> 1 file changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 63ad0541db38..ab2a82cb1b0b 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
> return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
> }
>
> +/*
> + * A platform may describe one error source for the handling of synchronous
> + * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
> + * or External Interrupt). On x86, the HEST notifications are always
> + * asynchronous, so only SEA on ARM is delivered as a synchronous
> + * notification.
> + */
> +static inline bool is_hest_sync_notify(struct ghes *ghes)
> +{
> + u8 notify_type = ghes->generic->notify.type;
> +
> + return notify_type == ACPI_HEST_NOTIFY_SEA;
> +}
> +
> /*
> * This driver isn't really modular, however for the time being,
> * continuing to use module_param is the easiest way to remain
> @@ -489,7 +503,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> }
>
> static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> - int sev)
> + int sev, bool sync)
> {
> int flags = -1;
> int sec_sev = ghes_severity(gdata->error_severity);
> @@ -503,7 +517,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> flags = MF_SOFT_OFFLINE;
> if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
> - flags = 0;
> + flags = sync ? MF_ACTION_REQUIRED : 0;
>
> if (flags != -1)
> return ghes_do_memory_failure(mem_err->physical_addr, flags);
> @@ -511,9 +525,11 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> return false;
> }
>
> -static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
> + int sev, bool sync)
> {
> struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
> + int flags = sync ? MF_ACTION_REQUIRED : 0;
> bool queued = false;
> int sec_sev, i;
> char *p;
> @@ -538,7 +554,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
> * and don't filter out 'corrected' error here.
> */
> if (is_cache && has_pa) {
> - queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
> + queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
> p += err_info->length;
> continue;
> }
> @@ -666,6 +682,7 @@ static bool ghes_do_proc(struct ghes *ghes,
> const guid_t *fru_id = &guid_null;
> char *fru_text = "";
> bool queued = false;
> + bool sync = is_hest_sync_notify(ghes);
>
> sev = ghes_severity(estatus->error_severity);
> apei_estatus_for_each_section(estatus, gdata) {
> @@ -683,13 +700,13 @@ static bool ghes_do_proc(struct ghes *ghes,
> atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
>
> arch_apei_report_mem_error(sev, mem_err);
> - queued = ghes_handle_memory_failure(gdata, sev);
> + queued = ghes_handle_memory_failure(gdata, sev, sync);
> }
> else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
> ghes_handle_aer(gdata);
> }
> else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
> - queued = ghes_handle_arm_hw_error(gdata, sev);
> + queued = ghes_handle_arm_hw_error(gdata, sev, sync);
> } else {
> void *err = acpi_hest_get_payload(gdata);
>
> --
> 2.39.3
>

2023-12-22 01:08:13

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v10 1/4] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events



On 2023/12/21 21:55, Rafael J. Wysocki wrote:
> On Mon, Dec 18, 2023 at 7:45 AM Shuai Xue <[email protected]> wrote:
>>
>> There are two major types of uncorrected recoverable (UCR) errors :
>>
>> - Synchronous error: The error is detected and raised at the point of the
>> consumption in the execution flow, e.g. when a CPU tries to access
>> a poisoned cache line. The CPU will take a synchronous error exception
>> such as Synchronous External Abort (SEA) on Arm64 and Machine Check
>> Exception (MCE) on X86. OS requires to take action (for example, offline
>> failure page/kill failure thread) to recover this uncorrectable error.
>>
>> - Asynchronous error: The error is detected out of processor execution
>> context, e.g. when an error is detected by a background scrubber. Some data
>> in the memory are corrupted. But the data have not been consumed. OS is
>> optional to take action to recover this uncorrectable error.
>>
>> When APEI firmware first is enabled, a platform may describe one error
>> source for the handling of synchronous errors (e.g. MCE or SEA notification
>> ), or for handling asynchronous errors (e.g. SCI or External Interrupt
>> notification). In other words, we can distinguish synchronous errors by
>> APEI notification. For synchronous errors, kernel will kill the current
>> process which accessing the poisoned page by sending SIGBUS with
>> BUS_MCEERR_AR. In addition, for asynchronous errors, kernel will notify the
>> process who owns the poisoned page by sending SIGBUS with BUS_MCEERR_AO in
>> early kill mode. However, the GHES driver always sets mf_flags to 0 so that
>> all synchronous errors are handled as asynchronous errors in memory failure.
>>
>> To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
>> events.
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>> Tested-by: Ma Wupeng <[email protected]>
>> Reviewed-by: Kefeng Wang <[email protected]>
>> Reviewed-by: Xiaofei Tan <[email protected]>
>> Reviewed-by: Baolin Wang <[email protected]>
>> Reviewed-by: James Morse <[email protected]>
>
> Applied as 6.8 material.
>
> The other patches in the series still need to receive tags from the
> APEI designated reviewers (as per MAINTAINERS).
>
> Thanks!
>

Thank you :)

I will wait more feedback of other patches from MAINTAINERS.

Cheers,
Shuai

2024-02-04 08:02:20

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v11 0/3] ACPI: APEI: handle synchronous exceptions in task work to send correct SIGBUS si_code

## Changes Log
changes since v10:
- rebase to v6.8-rc2

changes since v9:
- split patch 2 to address exactly one issue in one patch (per Borislav)
- rewrite commit log according to template (per Borislav)
- pickup reviewed-by tag of patch 1 from James Morse
- alloc and free twcb through gen_pool_{alloc, free) (Per James)
- rewrite cover letter

changes since v8:
- remove the bug fix tag of patch 2 (per Jarkko Sakkinen)
- remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi)
- rewrite the return value comments of memory_failure (per Naoya Horiguchi)

changes since v7:
- rebase to Linux v6.6-rc2 (no code changed)
- rewritten the cover letter to explain the motivation of this patchset

changes since v6:
- add more explicty error message suggested by Xiaofei
- pick up reviewed-by tag from Xiaofei
- pick up internal reviewed-by tag from Baolin

changes since v5 by addressing comments from Kefeng:
- document return value of memory_failure()
- drop redundant comments in call site of memory_failure()
- make ghes_do_proc void and handle abnormal case within it
- pick up reviewed-by tag from Kefeng Wang

changes since v4 by addressing comments from Xiaofei:
- do a force kill only for abnormal sync errors

changes since v3 by addressing comments from Xiaofei:
- do a force kill for abnormal memory failure error such as invalid PA,
unexpected severity, OOM, etc
- pcik up tested-by tag from Ma Wupeng

changes since v2 by addressing comments from Naoya:
- rename mce_task_work to sync_task_work
- drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
- add steps to reproduce this problem in cover letter

changes since v1:
- synchronous events by notify type
- Link: https://lore.kernel.org/lkml/[email protected]/

## Cover Letter

There are two major types of uncorrected recoverable (UCR) errors :

- Synchronous error: The error is detected and raised at the point of the
consumption in the execution flow, e.g. when a CPU tries to access
a poisoned cache line. The CPU will take a synchronous error exception
such as Synchronous External Abort (SEA) on Arm64 and Machine Check
Exception (MCE) on X86. OS requires to take action (for example, offline
failure page/kill failure thread) to recover this uncorrectable error.

- Asynchronous error: The error is detected out of processor execution
context, e.g. when an error is detected by a background scrubber. Some data
in the memory are corrupted. But the data have not been consumed. OS is
optional to take action to recover this uncorrectable error.

Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as
MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED
could be used to determine whether a synchronous exception occurs on ARM64
platform. When a synchronous exception is detected, the kernel should
terminate the current process which accessing the poisoned page. This is
done by sending a SIGBUS signal with an error code BUS_MCEERR_AR,
indicating an action-required machine check error on read.

However, the memory failure recovery is incorrectly sending a SIGBUS
with wrong error code BUS_MCEERR_AO for synchronous errors in early kill
mode, even MF_ACTION_REQUIRED is set. The main problem is that
synchronous errors are queued as a memory_failure() work, and are
executed within a kernel thread context, not the user-space process that
encountered the corrupted memory on ARM64 platform. As a result, when
kill_proc() is called to terminate the process, it sends the incorrect
SIGBUS error code because the context in which it operates is not the
one where the error was triggered.

To this end, fix the problem by:

- Patch 1: performing a force kill if no memory_failure() work is queued for
synchronous errors.
- Patch 2: a minor comments improvement.
- Patch 3: queue memory_failure() as a task_work so that it runs in the
context of the process that is actually consuming the poisoned
data, and it will send SIBBUS with si_code BUS_MCEERR_AR.

Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4].
Acknowledge to discussion with them.

## Steps to Reproduce This Problem

To reproduce this problem:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 5 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
and it is not fact.

After this patch set:

# STEP1: enable early kill mode
#sysctl -w vm.memory_failure_early_kill=1
vm.memory_failure_early_kill = 1

# STEP2: inject an UCE error and consume it to trigger a synchronous error
#einj_mem_uc single
0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
injecting ...
triggering ...
signal 7 code 4 addr 0xffffb0d75000
page not present
Test passed

The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
as we expected.

[1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lkml.kernel.org/r/[email protected]
[4] https://lore.kernel.org/lkml/[email protected]/

Shuai Xue (3):
ACPI: APEI: send SIGBUS to current task if synchronous memory error
not recovered
mm: memory-failure: move return value documentation to function
declaration
ACPI: APEI: handle synchronous exceptions in task work to send correct
SIGBUS si_code

arch/x86/kernel/cpu/mce/core.c | 9 +---
drivers/acpi/apei/ghes.c | 84 +++++++++++++++++++++-------------
include/acpi/ghes.h | 3 --
mm/memory-failure.c | 22 +++------
4 files changed, 59 insertions(+), 59 deletions(-)

--
2.39.3


2024-02-04 08:02:28

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered

Synchronous error was detected as a result of user-space process accessing
a 2-bit uncorrected error. The CPU will take a synchronous error exception
such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
memory_failure() work which poisons the related page, unmaps the page, and
then sends a SIGBUS to the process, so that a system wide panic can be
avoided.

However, no memory_failure() work will be queued when abnormal synchronous
errors occur. These errors can include situations such as invalid PA,
unexpected severity, no memory failure config support, invalid GUID
section, etc. In such case, the user-space process will trigger SEA again.
This loop can potentially exceed the platform firmware threshold or even
trigger a kernel hard lockup, leading to a system reboot.

Fix it by performing a force kill if no memory_failure() work is queued
for synchronous errors.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/acpi/apei/ghes.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 7b7c605166e0..0892550732d4 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
}
}

+ /*
+ * If no memory failure work is queued for abnormal synchronous
+ * errors, do a force kill.
+ */
+ if (sync && !queued) {
+ pr_err("Sending SIGBUS to current task due to memory error not recovered");
+ force_sig(SIGBUS);
+ }
+
return queued;
}

--
2.39.3


2024-02-04 08:02:49

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v11 2/3] mm: memory-failure: move return value documentation to function declaration

Part of return value comments for memory_failure() were originally
documented at the call site. Move those comments to the function
declaration to improve code readability and to provide developers with
immediate access to function usage and return information.

Signed-off-by: Shuai Xue <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 9 +--------
mm/memory-failure.c | 9 ++++++---
2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index bc39252bc54f..822b21eb48ad 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1365,17 +1365,10 @@ static void kill_me_maybe(struct callback_head *cb)
return;
}

- /*
- * -EHWPOISON from memory_failure() means that it already sent SIGBUS
- * to the current process with the proper error info,
- * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
- *
- * In both cases, no further processing is required.
- */
if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
return;

- pr_err("Memory error not recovered");
+ pr_err("Sending SIGBUS to current task due to memory error not recovered");
kill_me_now(cb);
}

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 636280d04008..d33729c48eff 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2175,9 +2175,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
* Must run in process context (e.g. a work queue) with interrupts
* enabled and no spinlocks held.
*
- * Return: 0 for successfully handled the memory error,
- * -EOPNOTSUPP for hwpoison_filter() filtered the error event,
- * < 0(except -EOPNOTSUPP) on failure.
+ * Return values:
+ * 0 - success
+ * -EOPNOTSUPP - hwpoison_filter() filtered the error event.
+ * -EHWPOISON - sent SIGBUS to the current process with the proper
+ * error info by kill_accessing_process().
+ * other negative values - failure
*/
int memory_failure(unsigned long pfn, int flags)
{
--
2.39.3


2024-02-04 08:03:26

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v11 3/3] ACPI: APEI: handle synchronous exceptions in task work to send correct SIGBUS si_code

Hardware errors could be signaled by asynchronous interrupt, e.g. when an
error is detected by a background scrubber, or signaled by synchronous
exception, e.g. when a CPU tries to access a poisoned cache line. Since
commit a70297d22132 ("ACPI: APEI: set memory failure flags as
MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED
could be used to determine whether a synchronous exception occurs on ARM64
platform. When a synchronous exception is detected, the kernel should
terminate the current process which accessing the poisoned page. This is
done by sending a SIGBUS signal with an error code BUS_MCEERR_AR,
indicating an action-required machine check error on read.

However, the memory failure recovery is incorrectly sending a SIGBUS
with wrong error code BUS_MCEERR_AO for synchronous errors in early kill
mode, even MF_ACTION_REQUIRED is set. The main problem is that
synchronous errors are queued as a memory_failure() work, and are
executed within a kernel thread context, not the user-space process that
encountered the corrupted memory on ARM64 platform. As a result, when
kill_proc() is called to terminate the process, it sends the incorrect
SIGBUS error code because the context in which it operates is not the
one where the error was triggered.

To this end, queue memory_failure() as a task_work so that it runs in
the context of the process that is actually consuming the poisoned data,
and it will send SIBBUS with si_code BUS_MCEERR_AR.

Signed-off-by: Shuai Xue <[email protected]>
Tested-by: Ma Wupeng <[email protected]>
Reviewed-by: Kefeng Wang <[email protected]>
Reviewed-by: Xiaofei Tan <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
---
drivers/acpi/apei/ghes.c | 77 +++++++++++++++++++++++-----------------
include/acpi/ghes.h | 3 --
mm/memory-failure.c | 13 -------
3 files changed, 44 insertions(+), 49 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 0892550732d4..e5086d795bee 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -465,28 +465,41 @@ static void ghes_clear_estatus(struct ghes *ghes,
}

/*
- * Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+ * struct sync_task_work - for synchronous RAS event
+ *
+ * @twork: callback_head for task work
+ * @pfn: page frame number of corrupted page
+ * @flags: fine tune action taken
+ *
+ * Structure to pass task work to be handled before
+ * ret_to_user via task_work_add().
*/
-static void ghes_kick_task_work(struct callback_head *head)
+struct sync_task_work {
+ struct callback_head twork;
+ u64 pfn;
+ int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
{
- struct acpi_hest_generic_status *estatus;
- struct ghes_estatus_node *estatus_node;
- u32 node_len;
+ int ret;
+ struct sync_task_work *twcb =
+ container_of(twork, struct sync_task_work, twork);

- estatus_node = container_of(head, struct ghes_estatus_node, task_work);
- if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
- memory_failure_queue_kick(estatus_node->task_work_cpu);
+ ret = memory_failure(twcb->pfn, twcb->flags);
+ gen_pool_free(ghes_estatus_pool, (unsigned long)twcb, sizeof(*twcb));

- estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
- node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
- gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
+ if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
+ return;
+
+ pr_err("Sending SIGBUS to current task due to memory error not recovered");
+ force_sig(SIGBUS);
}

static bool ghes_do_memory_failure(u64 physical_addr, int flags)
{
unsigned long pfn;
+ struct sync_task_work *twcb;

if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
return false;
@@ -499,6 +512,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
return false;
}

+ if (flags == MF_ACTION_REQUIRED && current->mm) {
+ twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
+ if (!twcb)
+ return false;
+
+ twcb->pfn = pfn;
+ twcb->flags = flags;
+ init_task_work(&twcb->twork, memory_failure_cb);
+ task_work_add(current, &twcb->twork, TWA_RESUME);
+ return true;
+ }
+
memory_failure_queue(pfn, flags);
return true;
}
@@ -746,7 +771,7 @@ int cxl_cper_unregister_callback(cxl_cper_callback callback)
}
EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_callback, CXL);

-static bool ghes_do_proc(struct ghes *ghes,
+static void ghes_do_proc(struct ghes *ghes,
const struct acpi_hest_generic_status *estatus)
{
int sev, sec_sev;
@@ -814,8 +839,6 @@ static bool ghes_do_proc(struct ghes *ghes,
pr_err("Sending SIGBUS to current task due to memory error not recovered");
force_sig(SIGBUS);
}
-
- return queued;
}

static void __ghes_print_estatus(const char *pfx,
@@ -1117,9 +1140,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
struct ghes_estatus_node *estatus_node;
struct acpi_hest_generic *generic;
struct acpi_hest_generic_status *estatus;
- bool task_work_pending;
u32 len, node_len;
- int ret;

llnode = llist_del_all(&ghes_estatus_llist);
/*
@@ -1134,25 +1155,16 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
len = cper_estatus_len(estatus);
node_len = GHES_ESTATUS_NODE_LEN(len);
- task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+
+ ghes_do_proc(estatus_node->ghes, estatus);
+
if (!ghes_estatus_cached(estatus)) {
generic = estatus_node->generic;
if (ghes_print_estatus(NULL, generic, estatus))
ghes_estatus_cache_add(generic, estatus);
}
-
- if (task_work_pending && current->mm) {
- estatus_node->task_work.func = ghes_kick_task_work;
- estatus_node->task_work_cpu = smp_processor_id();
- ret = task_work_add(current, &estatus_node->task_work,
- TWA_RESUME);
- if (ret)
- estatus_node->task_work.func = NULL;
- }
-
- if (!estatus_node->task_work.func)
- gen_pool_free(ghes_estatus_pool,
- (unsigned long)estatus_node, node_len);
+ gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
+ node_len);

llnode = next;
}
@@ -1213,7 +1225,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,

estatus_node->ghes = ghes;
estatus_node->generic = ghes->generic;
- estatus_node->task_work.func = NULL;
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);

if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index be1dd4c1a917..ebd21b05fe6e 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -35,9 +35,6 @@ struct ghes_estatus_node {
struct llist_node llnode;
struct acpi_hest_generic *generic;
struct ghes *ghes;
-
- int task_work_cpu;
- struct callback_head task_work;
};

struct ghes_estatus_cache {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d33729c48eff..4ad663bdc1d5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2462,19 +2462,6 @@ static void memory_failure_work_func(struct work_struct *work)
}
}

-/*
- * Process memory_failure work queued on the specified CPU.
- * Used to avoid return-to-userspace racing with the memory_failure workqueue.
- */
-void memory_failure_queue_kick(int cpu)
-{
- struct memory_failure_cpu *mf_cpu;
-
- mf_cpu = &per_cpu(memory_failure_cpu, cpu);
- cancel_work_sync(&mf_cpu->work);
- memory_failure_work_func(&mf_cpu->work);
-}
-
static int __init memory_failure_init(void)
{
struct memory_failure_cpu *mf_cpu;
--
2.39.3


2024-02-19 01:47:29

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v11 0/3] ACPI: APEI: handle synchronous exceptions in task work to send correct SIGBUS si_code

Hi, James and Borislav,

Gentle Ping. Any feedback to this new version?

Thank you.

Best Regards,
Shuai

On 2024/2/4 16:01, Shuai Xue wrote:
> ## Changes Log
> changes since v10:
> - rebase to v6.8-rc2
>
> changes since v9:
> - split patch 2 to address exactly one issue in one patch (per Borislav)
> - rewrite commit log according to template (per Borislav)
> - pickup reviewed-by tag of patch 1 from James Morse
> - alloc and free twcb through gen_pool_{alloc, free) (Per James)
> - rewrite cover letter
>
> changes since v8:
> - remove the bug fix tag of patch 2 (per Jarkko Sakkinen)
> - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi)
> - rewrite the return value comments of memory_failure (per Naoya Horiguchi)
>
> changes since v7:
> - rebase to Linux v6.6-rc2 (no code changed)
> - rewritten the cover letter to explain the motivation of this patchset
>
> changes since v6:
> - add more explicty error message suggested by Xiaofei
> - pick up reviewed-by tag from Xiaofei
> - pick up internal reviewed-by tag from Baolin
>
> changes since v5 by addressing comments from Kefeng:
> - document return value of memory_failure()
> - drop redundant comments in call site of memory_failure()
> - make ghes_do_proc void and handle abnormal case within it
> - pick up reviewed-by tag from Kefeng Wang
>
> changes since v4 by addressing comments from Xiaofei:
> - do a force kill only for abnormal sync errors
>
> changes since v3 by addressing comments from Xiaofei:
> - do a force kill for abnormal memory failure error such as invalid PA,
> unexpected severity, OOM, etc
> - pcik up tested-by tag from Ma Wupeng
>
> changes since v2 by addressing comments from Naoya:
> - rename mce_task_work to sync_task_work
> - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
> - add steps to reproduce this problem in cover letter
>
> changes since v1:
> - synchronous events by notify type
> - Link: https://lore.kernel.org/lkml/[email protected]/
>
> ## Cover Letter
>
> There are two major types of uncorrected recoverable (UCR) errors :
>
> - Synchronous error: The error is detected and raised at the point of the
> consumption in the execution flow, e.g. when a CPU tries to access
> a poisoned cache line. The CPU will take a synchronous error exception
> such as Synchronous External Abort (SEA) on Arm64 and Machine Check
> Exception (MCE) on X86. OS requires to take action (for example, offline
> failure page/kill failure thread) to recover this uncorrectable error.
>
> - Asynchronous error: The error is detected out of processor execution
> context, e.g. when an error is detected by a background scrubber. Some data
> in the memory are corrupted. But the data have not been consumed. OS is
> optional to take action to recover this uncorrectable error.
>
> Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as
> MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED
> could be used to determine whether a synchronous exception occurs on ARM64
> platform. When a synchronous exception is detected, the kernel should
> terminate the current process which accessing the poisoned page. This is
> done by sending a SIGBUS signal with an error code BUS_MCEERR_AR,
> indicating an action-required machine check error on read.
>
> However, the memory failure recovery is incorrectly sending a SIGBUS
> with wrong error code BUS_MCEERR_AO for synchronous errors in early kill
> mode, even MF_ACTION_REQUIRED is set. The main problem is that
> synchronous errors are queued as a memory_failure() work, and are
> executed within a kernel thread context, not the user-space process that
> encountered the corrupted memory on ARM64 platform. As a result, when
> kill_proc() is called to terminate the process, it sends the incorrect
> SIGBUS error code because the context in which it operates is not the
> one where the error was triggered.
>
> To this end, fix the problem by:
>
> - Patch 1: performing a force kill if no memory_failure() work is queued for
> synchronous errors.
> - Patch 2: a minor comments improvement.
> - Patch 3: queue memory_failure() as a task_work so that it runs in the
> context of the process that is actually consuming the poisoned
> data, and it will send SIBBUS with si_code BUS_MCEERR_AR.
>
> Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4].
> Acknowledge to discussion with them.
>
> ## Steps to Reproduce This Problem
>
> To reproduce this problem:
>
> # STEP1: enable early kill mode
> #sysctl -w vm.memory_failure_early_kill=1
> vm.memory_failure_early_kill = 1
>
> # STEP2: inject an UCE error and consume it to trigger a synchronous error
> #einj_mem_uc single
> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
> injecting ...
> triggering ...
> signal 7 code 5 addr 0xffffb0d75000
> page not present
> Test passed
>
> The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
> and it is not fact.
>
> After this patch set:
>
> # STEP1: enable early kill mode
> #sysctl -w vm.memory_failure_early_kill=1
> vm.memory_failure_early_kill = 1
>
> # STEP2: inject an UCE error and consume it to trigger a synchronous error
> #einj_mem_uc single
> 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400
> injecting ...
> triggering ...
> signal 7 code 4 addr 0xffffb0d75000
> page not present
> Test passed
>
> The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
> as we expected.
>
> [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/
> [3] https://lkml.kernel.org/r/[email protected]
> [4] https://lore.kernel.org/lkml/[email protected]/
>
> Shuai Xue (3):
> ACPI: APEI: send SIGBUS to current task if synchronous memory error
> not recovered
> mm: memory-failure: move return value documentation to function
> declaration
> ACPI: APEI: handle synchronous exceptions in task work to send correct
> SIGBUS si_code
>
> arch/x86/kernel/cpu/mce/core.c | 9 +---
> drivers/acpi/apei/ghes.c | 84 +++++++++++++++++++++-------------
> include/acpi/ghes.h | 3 --
> mm/memory-failure.c | 22 +++------
> 4 files changed, 59 insertions(+), 59 deletions(-)
>

2024-02-19 09:26:27

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered

On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:
> Synchronous error was detected as a result of user-space process accessing
> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> memory_failure() work which poisons the related page, unmaps the page, and
> then sends a SIGBUS to the process, so that a system wide panic can be
> avoided.
>
> However, no memory_failure() work will be queued when abnormal synchronous
> errors occur. These errors can include situations such as invalid PA,
> unexpected severity, no memory failure config support, invalid GUID
> section, etc. In such case, the user-space process will trigger SEA again.
> This loop can potentially exceed the platform firmware threshold or even
> trigger a kernel hard lockup, leading to a system reboot.
>
> Fix it by performing a force kill if no memory_failure() work is queued
> for synchronous errors.
>
> Signed-off-by: Shuai Xue <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 7b7c605166e0..0892550732d4 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
> }
> }
>
> + /*
> + * If no memory failure work is queued for abnormal synchronous
> + * errors, do a force kill.
> + */
> + if (sync && !queued) {
> + pr_err("Sending SIGBUS to current task due to memory error not recovered");
> + force_sig(SIGBUS);
> + }

Except that there are a bunch of CXL GUIDs being handled there too and
this will sigbus those processes now automatically.

Lemme add the whole bunch from

671a794c33c6 ("acpi/ghes: Process CXL Component Events")

for comment to Cc.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-22 02:07:50

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered



On 2024/2/19 17:25, Borislav Petkov wrote:
> On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:
>> Synchronous error was detected as a result of user-space process accessing
>> a 2-bit uncorrected error. The CPU will take a synchronous error exception
>> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
>> memory_failure() work which poisons the related page, unmaps the page, and
>> then sends a SIGBUS to the process, so that a system wide panic can be
>> avoided.
>>
>> However, no memory_failure() work will be queued when abnormal synchronous
>> errors occur. These errors can include situations such as invalid PA,
>> unexpected severity, no memory failure config support, invalid GUID
>> section, etc. In such case, the user-space process will trigger SEA again.
>> This loop can potentially exceed the platform firmware threshold or even
>> trigger a kernel hard lockup, leading to a system reboot.
>>
>> Fix it by performing a force kill if no memory_failure() work is queued
>> for synchronous errors.
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>> ---
>> drivers/acpi/apei/ghes.c | 9 +++++++++
>> 1 file changed, 9 insertions(+)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 7b7c605166e0..0892550732d4 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
>> }
>> }
>>
>> + /*
>> + * If no memory failure work is queued for abnormal synchronous
>> + * errors, do a force kill.
>> + */
>> + if (sync && !queued) {
>> + pr_err("Sending SIGBUS to current task due to memory error not recovered");
>> + force_sig(SIGBUS);
>> + }
>
> Except that there are a bunch of CXL GUIDs being handled there too and
> this will sigbus those processes now automatically.

Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always
asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is
delivered as a synchronous notification.

Will the CXL component trigger synchronous events for which we need to terminate the
current process by sending sigbus to process?

>
> Lemme add the whole bunch from
>
> 671a794c33c6 ("acpi/ghes: Process CXL Component Events")
>
> for comment to Cc.
>

Thank you.

Best Regards,
Shuai

2024-02-23 05:28:58

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered

Shuai Xue wrote:
>
>
> On 2024/2/19 17:25, Borislav Petkov wrote:
> > On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:
> >> Synchronous error was detected as a result of user-space process accessing
> >> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> >> memory_failure() work which poisons the related page, unmaps the page, and
> >> then sends a SIGBUS to the process, so that a system wide panic can be
> >> avoided.
> >>
> >> However, no memory_failure() work will be queued when abnormal synchronous
> >> errors occur. These errors can include situations such as invalid PA,
> >> unexpected severity, no memory failure config support, invalid GUID
> >> section, etc. In such case, the user-space process will trigger SEA again.
> >> This loop can potentially exceed the platform firmware threshold or even
> >> trigger a kernel hard lockup, leading to a system reboot.
> >>
> >> Fix it by performing a force kill if no memory_failure() work is queued
> >> for synchronous errors.
> >>
> >> Signed-off-by: Shuai Xue <[email protected]>
> >> ---
> >> drivers/acpi/apei/ghes.c | 9 +++++++++
> >> 1 file changed, 9 insertions(+)
> >>
> >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> >> index 7b7c605166e0..0892550732d4 100644
> >> --- a/drivers/acpi/apei/ghes.c
> >> +++ b/drivers/acpi/apei/ghes.c
> >> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
> >> }
> >> }
> >>
> >> + /*
> >> + * If no memory failure work is queued for abnormal synchronous
> >> + * errors, do a force kill.
> >> + */
> >> + if (sync && !queued) {
> >> + pr_err("Sending SIGBUS to current task due to memory error not recovered");
> >> + force_sig(SIGBUS);
> >> + }
> >
> > Except that there are a bunch of CXL GUIDs being handled there too and
> > this will sigbus those processes now automatically.
>
> Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always
> asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is
> delivered as a synchronous notification.
>
> Will the CXL component trigger synchronous events for which we need to terminate the
> current process by sending sigbus to process?

None of the CXL component errors should be handled as synchronous
events. They are either asynchronous protocol errors, or effectively
equivalent to CPER_SEC_PLATFORM_MEM notifications.

2024-02-23 12:09:03

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered

On Thu, 22 Feb 2024 21:26:43 -0800
Dan Williams <[email protected]> wrote:

> Shuai Xue wrote:
> >
> >
> > On 2024/2/19 17:25, Borislav Petkov wrote:
> > > On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:
> > >> Synchronous error was detected as a result of user-space process accessing
> > >> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> > >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> > >> memory_failure() work which poisons the related page, unmaps the page, and
> > >> then sends a SIGBUS to the process, so that a system wide panic can be
> > >> avoided.
> > >>
> > >> However, no memory_failure() work will be queued when abnormal synchronous
> > >> errors occur. These errors can include situations such as invalid PA,
> > >> unexpected severity, no memory failure config support, invalid GUID
> > >> section, etc. In such case, the user-space process will trigger SEA again.
> > >> This loop can potentially exceed the platform firmware threshold or even
> > >> trigger a kernel hard lockup, leading to a system reboot.
> > >>
> > >> Fix it by performing a force kill if no memory_failure() work is queued
> > >> for synchronous errors.
> > >>
> > >> Signed-off-by: Shuai Xue <[email protected]>
> > >> ---
> > >> drivers/acpi/apei/ghes.c | 9 +++++++++
> > >> 1 file changed, 9 insertions(+)
> > >>
> > >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > >> index 7b7c605166e0..0892550732d4 100644
> > >> --- a/drivers/acpi/apei/ghes.c
> > >> +++ b/drivers/acpi/apei/ghes.c
> > >> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
> > >> }
> > >> }
> > >>
> > >> + /*
> > >> + * If no memory failure work is queued for abnormal synchronous
> > >> + * errors, do a force kill.
> > >> + */
> > >> + if (sync && !queued) {
> > >> + pr_err("Sending SIGBUS to current task due to memory error not recovered");
> > >> + force_sig(SIGBUS);
> > >> + }
> > >
> > > Except that there are a bunch of CXL GUIDs being handled there too and
> > > this will sigbus those processes now automatically.
> >
> > Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always
> > asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is
> > delivered as a synchronous notification.
> >
> > Will the CXL component trigger synchronous events for which we need to terminate the
> > current process by sending sigbus to process?
>
> None of the CXL component errors should be handled as synchronous
> events. They are either asynchronous protocol errors, or effectively
> equivalent to CPER_SEC_PLATFORM_MEM notifications.

Not a good example, CPER_SEC_PLATFORM_MEM is sometimes signaled via SEA.



2024-02-23 12:17:17

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered

On Fri, 23 Feb 2024 12:08:13 +0000
Jonathan Cameron <[email protected]> wrote:

> On Thu, 22 Feb 2024 21:26:43 -0800
> Dan Williams <[email protected]> wrote:
>
> > Shuai Xue wrote:
> > >
> > >
> > > On 2024/2/19 17:25, Borislav Petkov wrote:
> > > > On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:
> > > >> Synchronous error was detected as a result of user-space process accessing
> > > >> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> > > >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> > > >> memory_failure() work which poisons the related page, unmaps the page, and
> > > >> then sends a SIGBUS to the process, so that a system wide panic can be
> > > >> avoided.
> > > >>
> > > >> However, no memory_failure() work will be queued when abnormal synchronous
> > > >> errors occur. These errors can include situations such as invalid PA,
> > > >> unexpected severity, no memory failure config support, invalid GUID
> > > >> section, etc. In such case, the user-space process will trigger SEA again.
> > > >> This loop can potentially exceed the platform firmware threshold or even
> > > >> trigger a kernel hard lockup, leading to a system reboot.
> > > >>
> > > >> Fix it by performing a force kill if no memory_failure() work is queued
> > > >> for synchronous errors.
> > > >>
> > > >> Signed-off-by: Shuai Xue <[email protected]>
> > > >> ---
> > > >> drivers/acpi/apei/ghes.c | 9 +++++++++
> > > >> 1 file changed, 9 insertions(+)
> > > >>
> > > >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > > >> index 7b7c605166e0..0892550732d4 100644
> > > >> --- a/drivers/acpi/apei/ghes.c
> > > >> +++ b/drivers/acpi/apei/ghes.c
> > > >> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
> > > >> }
> > > >> }
> > > >>
> > > >> + /*
> > > >> + * If no memory failure work is queued for abnormal synchronous
> > > >> + * errors, do a force kill.
> > > >> + */
> > > >> + if (sync && !queued) {
> > > >> + pr_err("Sending SIGBUS to current task due to memory error not recovered");
> > > >> + force_sig(SIGBUS);
> > > >> + }
> > > >
> > > > Except that there are a bunch of CXL GUIDs being handled there too and
> > > > this will sigbus those processes now automatically.
> > >
> > > Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always
> > > asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is
> > > delivered as a synchronous notification.
> > >
> > > Will the CXL component trigger synchronous events for which we need to terminate the
> > > current process by sending sigbus to process?
> >
> > None of the CXL component errors should be handled as synchronous
> > events. They are either asynchronous protocol errors, or effectively
> > equivalent to CPER_SEC_PLATFORM_MEM notifications.
>
> Not a good example, CPER_SEC_PLATFORM_MEM is sometimes signaled via SEA.
>

Premature send.:(

One example I can point at is how we do signaling of memory
errors detected by the host into a VM on arm64.
https://elixir.bootlin.com/qemu/latest/source/hw/acpi/ghes.c#L391
CPER_SEC_PLATFORM_MEM via ARM Synchronous External Abort (SEA).

Right now we've only used async in QEMU for proposed CXL error
CPER records signalling but your reference to them being similar
to CPER_SEC_PLATFORM_MEM is valid so 'maybe' they will be
synchronous in some physical systems as it's one viable way to
provide rich information for synchronous reception of poison.
For the VM case my assumption today is we don't care about providing the
VM with rich data, so CPER_SEC_PLATFORM_MEM is fine as a path for
errors whether from CXL CPER records or not.

Jonathan


2024-02-24 06:09:07

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered



On 2024/2/23 20:17, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 12:08:13 +0000
> Jonathan Cameron <[email protected]> wrote:
>
>> On Thu, 22 Feb 2024 21:26:43 -0800
>> Dan Williams <[email protected]> wrote:
>>
>>> Shuai Xue wrote:
>>>>
>>>>
>>>> On 2024/2/19 17:25, Borislav Petkov wrote:
>>>>> On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:
>>>>>> Synchronous error was detected as a result of user-space process accessing
>>>>>> a 2-bit uncorrected error. The CPU will take a synchronous error exception
>>>>>> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
>>>>>> memory_failure() work which poisons the related page, unmaps the page, and
>>>>>> then sends a SIGBUS to the process, so that a system wide panic can be
>>>>>> avoided.
>>>>>>
>>>>>> However, no memory_failure() work will be queued when abnormal synchronous
>>>>>> errors occur. These errors can include situations such as invalid PA,
>>>>>> unexpected severity, no memory failure config support, invalid GUID
>>>>>> section, etc. In such case, the user-space process will trigger SEA again.
>>>>>> This loop can potentially exceed the platform firmware threshold or even
>>>>>> trigger a kernel hard lockup, leading to a system reboot.
>>>>>>
>>>>>> Fix it by performing a force kill if no memory_failure() work is queued
>>>>>> for synchronous errors.
>>>>>>
>>>>>> Signed-off-by: Shuai Xue <[email protected]>
>>>>>> ---
>>>>>> drivers/acpi/apei/ghes.c | 9 +++++++++
>>>>>> 1 file changed, 9 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>>>>> index 7b7c605166e0..0892550732d4 100644
>>>>>> --- a/drivers/acpi/apei/ghes.c
>>>>>> +++ b/drivers/acpi/apei/ghes.c
>>>>>> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> + /*
>>>>>> + * If no memory failure work is queued for abnormal synchronous
>>>>>> + * errors, do a force kill.
>>>>>> + */
>>>>>> + if (sync && !queued) {
>>>>>> + pr_err("Sending SIGBUS to current task due to memory error not recovered");
>>>>>> + force_sig(SIGBUS);
>>>>>> + }
>>>>>
>>>>> Except that there are a bunch of CXL GUIDs being handled there too and
>>>>> this will sigbus those processes now automatically.
>>>>
>>>> Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always
>>>> asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is
>>>> delivered as a synchronous notification.
>>>>
>>>> Will the CXL component trigger synchronous events for which we need to terminate the
>>>> current process by sending sigbus to process?
>>>
>>> None of the CXL component errors should be handled as synchronous
>>> events. They are either asynchronous protocol errors, or effectively
>>> equivalent to CPER_SEC_PLATFORM_MEM notifications.
>>
>> Not a good example, CPER_SEC_PLATFORM_MEM is sometimes signaled via SEA.
>>
>
> Premature send.:(
>
> One example I can point at is how we do signaling of memory
> errors detected by the host into a VM on arm64.
> https://elixir.bootlin.com/qemu/latest/source/hw/acpi/ghes.c#L391
> CPER_SEC_PLATFORM_MEM via ARM Synchronous External Abort (SEA).
>
> Right now we've only used async in QEMU for proposed CXL error
> CPER records signalling but your reference to them being similar
> to CPER_SEC_PLATFORM_MEM is valid so 'maybe' they will be
> synchronous in some physical systems as it's one viable way to
> provide rich information for synchronous reception of poison.
> For the VM case my assumption today is we don't care about providing the
> VM with rich data, so CPER_SEC_PLATFORM_MEM is fine as a path for
> errors whether from CXL CPER records or not.
>
> Jonathan

Thank you for your confirmation and explanation.

So I think the condition:

- `sync` for synchronous event,
- `!queued` for CPER_SEC_PLATFORM_MEM notifications which do not handle memory failures.

is fine.

@Borislav, do you have any other concerns?

Best Regards,
Shuai

2024-02-24 19:43:02

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered

Jonathan Cameron wrote:
[..]
> > > None of the CXL component errors should be handled as synchronous
> > > events. They are either asynchronous protocol errors, or effectively
> > > equivalent to CPER_SEC_PLATFORM_MEM notifications.
> >
> > Not a good example, CPER_SEC_PLATFORM_MEM is sometimes signaled via SEA.
> >
>
> Premature send.:(
>
> One example I can point at is how we do signaling of memory
> errors detected by the host into a VM on arm64.
> https://elixir.bootlin.com/qemu/latest/source/hw/acpi/ghes.c#L391
> CPER_SEC_PLATFORM_MEM via ARM Synchronous External Abort (SEA).
>
> Right now we've only used async in QEMU for proposed CXL error
> CPER records signalling but your reference to them being similar
> to CPER_SEC_PLATFORM_MEM is valid so 'maybe' they will be
> synchronous in some physical systems as it's one viable way to
> provide rich information for synchronous reception of poison.
> For the VM case my assumption today is we don't care about providing the
> VM with rich data, so CPER_SEC_PLATFORM_MEM is fine as a path for
> errors whether from CXL CPER records or not.

Makes sense... and I was not precise when I mentioned the equivalency, I
was only considering x86.

2024-02-24 19:43:17

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered

Borislav Petkov wrote:
> On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:
> > Synchronous error was detected as a result of user-space process accessing
> > a 2-bit uncorrected error. The CPU will take a synchronous error exception
> > such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> > memory_failure() work which poisons the related page, unmaps the page, and
> > then sends a SIGBUS to the process, so that a system wide panic can be
> > avoided.
> >
> > However, no memory_failure() work will be queued when abnormal synchronous
> > errors occur. These errors can include situations such as invalid PA,
> > unexpected severity, no memory failure config support, invalid GUID
> > section, etc. In such case, the user-space process will trigger SEA again.
> > This loop can potentially exceed the platform firmware threshold or even
> > trigger a kernel hard lockup, leading to a system reboot.
> >
> > Fix it by performing a force kill if no memory_failure() work is queued
> > for synchronous errors.
> >
> > Signed-off-by: Shuai Xue <[email protected]>
> > ---
> > drivers/acpi/apei/ghes.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > index 7b7c605166e0..0892550732d4 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
> > }
> > }
> >
> > + /*
> > + * If no memory failure work is queued for abnormal synchronous
> > + * errors, do a force kill.
> > + */
> > + if (sync && !queued) {
> > + pr_err("Sending SIGBUS to current task due to memory error not recovered");
> > + force_sig(SIGBUS);
> > + }
>
> Except that there are a bunch of CXL GUIDs being handled there too and
> this will sigbus those processes now automatically.
>
> Lemme add the whole bunch from
>
> 671a794c33c6 ("acpi/ghes: Process CXL Component Events")
>
> for comment to Cc.

BTW, I am about to revert all the CXL GUIDs for v6.8 to try again for
v6.9:

http://lore.kernel.org/r/170820177849.631006.8893584762602010898.stgit@dwillia2-xfh.jf.intel.com

2024-02-26 10:50:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered

On Sat, Feb 24, 2024 at 02:08:42PM +0800, Shuai Xue wrote:
> @Borislav, do you have any other concerns?

Yes, this change needs to be further reviewed by an ARM person: I have
no clue what those "abnormal synchronous errors" on ARM are and how
they're supposed to be handled properly there:

- what happens if you get such an error when ghes is disabled there?

- is that even the right place to handle them?

James?

--
Regards/Gruss,
Boris.
ttps://people.kernel.org/tglx/notes-about-netiquette

2024-02-26 10:56:48

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v11 2/3] mm: memory-failure: move return value documentation to function declaration

On Sun, Feb 04, 2024 at 04:01:43PM +0800, Shuai Xue wrote:
> Part of return value comments for memory_failure() were originally
> documented at the call site. Move those comments to the function
> declaration to improve code readability and to provide developers with
> immediate access to function usage and return information.
>
> Signed-off-by: Shuai Xue <[email protected]>
> ---
> arch/x86/kernel/cpu/mce/core.c | 9 +--------
> mm/memory-failure.c | 9 ++++++---
> 2 files changed, 7 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index bc39252bc54f..822b21eb48ad 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1365,17 +1365,10 @@ static void kill_me_maybe(struct callback_head *cb)
> return;
> }
>
> - /*
> - * -EHWPOISON from memory_failure() means that it already sent SIGBUS
> - * to the current process with the proper error info,
> - * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
> - *
> - * In both cases, no further processing is required.
> - */
> if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
> return;
>
> - pr_err("Memory error not recovered");
> + pr_err("Sending SIGBUS to current task due to memory error not recovered");

Unrelated change.

> kill_me_now(cb);
> }
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 636280d04008..d33729c48eff 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2175,9 +2175,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> * Must run in process context (e.g. a work queue) with interrupts
> * enabled and no spinlocks held.
> *
> - * Return: 0 for successfully handled the memory error,
> - * -EOPNOTSUPP for hwpoison_filter() filtered the error event,
> - * < 0(except -EOPNOTSUPP) on failure.
> + * Return values:
> + * 0 - success
> + * -EOPNOTSUPP - hwpoison_filter() filtered the error event.
> + * -EHWPOISON - sent SIGBUS to the current process with the proper
> + * error info by kill_accessing_process().

kill_accessing_process() is not the only one returning -EHWPOISON.

And if you look at the code, it should be:

-EHWPOISON - the page was already poisoned, potentially
kill process

or so.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-27 01:24:09

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered



On 2024/2/26 18:29, Borislav Petkov wrote:
> On Sat, Feb 24, 2024 at 02:08:42PM +0800, Shuai Xue wrote:
>> @Borislav, do you have any other concerns?
>
> Yes, this change needs to be further reviewed by an ARM person: I have
> no clue what those "abnormal synchronous errors" on ARM are

Hi, Borislav,

May the `abnormal` is not inaccurate and misled you. I mean the preconditions
check before memory_failure_queue():

- `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure()
- `if (flags == -1)` in ghes_handle_memory_failure()
- `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure()
- `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure()

If the preconditions are not passed, the user-space process will trigger SEA again.
This loop can potentially exceed the platform firmware threshold or even
trigger a kernel hard lockup, leading to a system reboot.

> and how
> they're supposed to be handled properly there:
>
> - what happens if you get such an error when ghes is disabled there?

If ghes_disable is set, the GHES driver will not be inited by acpi_ghes_init(),
so none of error notifications will be handled. IMHO, it is expected.

>
> - is that even the right place to handle them?
>
> James?
>

Leave this to @James.

Thank you.

Best Regards,
Shuai


2024-02-27 01:28:42

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v11 2/3] mm: memory-failure: move return value documentation to function declaration



On 2024/2/26 18:46, Borislav Petkov wrote:
> On Sun, Feb 04, 2024 at 04:01:43PM +0800, Shuai Xue wrote:
>> Part of return value comments for memory_failure() were originally
>> documented at the call site. Move those comments to the function
>> declaration to improve code readability and to provide developers with
>> immediate access to function usage and return information.
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>> ---
>> arch/x86/kernel/cpu/mce/core.c | 9 +--------
>> mm/memory-failure.c | 9 ++++++---
>> 2 files changed, 7 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index bc39252bc54f..822b21eb48ad 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -1365,17 +1365,10 @@ static void kill_me_maybe(struct callback_head *cb)
>> return;
>> }
>>
>> - /*
>> - * -EHWPOISON from memory_failure() means that it already sent SIGBUS
>> - * to the current process with the proper error info,
>> - * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
>> - *
>> - * In both cases, no further processing is required.
>> - */
>> if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
>> return;
>>
>> - pr_err("Memory error not recovered");
>> + pr_err("Sending SIGBUS to current task due to memory error not recovered");
>
> Unrelated change.

Yes, I will drop the error message change.

>
>> kill_me_now(cb);
>> }
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 636280d04008..d33729c48eff 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2175,9 +2175,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>> * Must run in process context (e.g. a work queue) with interrupts
>> * enabled and no spinlocks held.
>> *
>> - * Return: 0 for successfully handled the memory error,
>> - * -EOPNOTSUPP for hwpoison_filter() filtered the error event,
>> - * < 0(except -EOPNOTSUPP) on failure.
>> + * Return values:
>> + * 0 - success
>> + * -EOPNOTSUPP - hwpoison_filter() filtered the error event.
>> + * -EHWPOISON - sent SIGBUS to the current process with the proper
>> + * error info by kill_accessing_process().
>
> kill_accessing_process() is not the only one returning -EHWPOISON.
>
> And if you look at the code, it should be:
>
> -EHWPOISON - the page was already poisoned, potentially
> kill process
>
> or so.
>

You are right, will fix it in next version.

Thank you.

Best Regards.
Shuai


2024-02-29 07:05:39

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v11 3/3] ACPI: APEI: handle synchronous exceptions in task work to send correct SIGBUS si_code



On 2024/2/4 16:01, Shuai Xue wrote:
> Hardware errors could be signaled by asynchronous interrupt, e.g. when an
> error is detected by a background scrubber, or signaled by synchronous
> exception, e.g. when a CPU tries to access a poisoned cache line. Since
> commit a70297d22132 ("ACPI: APEI: set memory failure flags as
> MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED
> could be used to determine whether a synchronous exception occurs on ARM64
> platform. When a synchronous exception is detected, the kernel should
> terminate the current process which accessing the poisoned page. This is
> done by sending a SIGBUS signal with an error code BUS_MCEERR_AR,
> indicating an action-required machine check error on read.
>
> However, the memory failure recovery is incorrectly sending a SIGBUS
> with wrong error code BUS_MCEERR_AO for synchronous errors in early kill
> mode, even MF_ACTION_REQUIRED is set. The main problem is that
> synchronous errors are queued as a memory_failure() work, and are
> executed within a kernel thread context, not the user-space process that
> encountered the corrupted memory on ARM64 platform. As a result, when
> kill_proc() is called to terminate the process, it sends the incorrect
> SIGBUS error code because the context in which it operates is not the
> one where the error was triggered.
>
> To this end, queue memory_failure() as a task_work so that it runs in
> the context of the process that is actually consuming the poisoned data,
> and it will send SIBBUS with si_code BUS_MCEERR_AR.
>
> Signed-off-by: Shuai Xue <[email protected]>
> Tested-by: Ma Wupeng <[email protected]>
> Reviewed-by: Kefeng Wang <[email protected]>
> Reviewed-by: Xiaofei Tan <[email protected]>
> Reviewed-by: Baolin Wang <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 77 +++++++++++++++++++++++-----------------
> include/acpi/ghes.h | 3 --
> mm/memory-failure.c | 13 -------
> 3 files changed, 44 insertions(+), 49 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 0892550732d4..e5086d795bee 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -465,28 +465,41 @@ static void ghes_clear_estatus(struct ghes *ghes,
> }
>
> /*
> - * Called as task_work before returning to user-space.
> - * Ensure any queued work has been done before we return to the context that
> - * triggered the notification.
> + * struct sync_task_work - for synchronous RAS event
> + *
> + * @twork: callback_head for task work
> + * @pfn: page frame number of corrupted page
> + * @flags: fine tune action taken
> + *
> + * Structure to pass task work to be handled before
> + * ret_to_user via task_work_add().
> */
> -static void ghes_kick_task_work(struct callback_head *head)
> +struct sync_task_work {
> + struct callback_head twork;
> + u64 pfn;
> + int flags;
> +};
> +
> +static void memory_failure_cb(struct callback_head *twork)
> {
> - struct acpi_hest_generic_status *estatus;
> - struct ghes_estatus_node *estatus_node;
> - u32 node_len;
> + int ret;
> + struct sync_task_work *twcb =
> + container_of(twork, struct sync_task_work, twork);
>
> - estatus_node = container_of(head, struct ghes_estatus_node, task_work);
> - if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> - memory_failure_queue_kick(estatus_node->task_work_cpu);
> + ret = memory_failure(twcb->pfn, twcb->flags);
> + gen_pool_free(ghes_estatus_pool, (unsigned long)twcb, sizeof(*twcb));
>
> - estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> - node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
> - gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
> + if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
> + return;
> +
> + pr_err("Sending SIGBUS to current task due to memory error not recovered");
> + force_sig(SIGBUS);
> }
>
> static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> {
> unsigned long pfn;
> + struct sync_task_work *twcb;
>
> if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> return false;
> @@ -499,6 +512,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> return false;
> }
>
> + if (flags == MF_ACTION_REQUIRED && current->mm) {
> + twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
> + if (!twcb)
> + return false;
> +
> + twcb->pfn = pfn;
> + twcb->flags = flags;
> + init_task_work(&twcb->twork, memory_failure_cb);
> + task_work_add(current, &twcb->twork, TWA_RESUME);
> + return true;
> + }
> +
> memory_failure_queue(pfn, flags);
> return true;
> }
> @@ -746,7 +771,7 @@ int cxl_cper_unregister_callback(cxl_cper_callback callback)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_callback, CXL);
>
> -static bool ghes_do_proc(struct ghes *ghes,
> +static void ghes_do_proc(struct ghes *ghes,
> const struct acpi_hest_generic_status *estatus)
> {
> int sev, sec_sev;
> @@ -814,8 +839,6 @@ static bool ghes_do_proc(struct ghes *ghes,
> pr_err("Sending SIGBUS to current task due to memory error not recovered");
> force_sig(SIGBUS);
> }
> -
> - return queued;
> }
>
> static void __ghes_print_estatus(const char *pfx,
> @@ -1117,9 +1140,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> struct ghes_estatus_node *estatus_node;
> struct acpi_hest_generic *generic;
> struct acpi_hest_generic_status *estatus;
> - bool task_work_pending;
> u32 len, node_len;
> - int ret;
>
> llnode = llist_del_all(&ghes_estatus_llist);
> /*
> @@ -1134,25 +1155,16 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> len = cper_estatus_len(estatus);
> node_len = GHES_ESTATUS_NODE_LEN(len);
> - task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
> +
> + ghes_do_proc(estatus_node->ghes, estatus);
> +
> if (!ghes_estatus_cached(estatus)) {
> generic = estatus_node->generic;
> if (ghes_print_estatus(NULL, generic, estatus))
> ghes_estatus_cache_add(generic, estatus);
> }
> -
> - if (task_work_pending && current->mm) {
> - estatus_node->task_work.func = ghes_kick_task_work;
> - estatus_node->task_work_cpu = smp_processor_id();
> - ret = task_work_add(current, &estatus_node->task_work,
> - TWA_RESUME);
> - if (ret)
> - estatus_node->task_work.func = NULL;
> - }
> -
> - if (!estatus_node->task_work.func)
> - gen_pool_free(ghes_estatus_pool,
> - (unsigned long)estatus_node, node_len);
> + gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node,
> + node_len);
>
> llnode = next;
> }
> @@ -1213,7 +1225,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes,
>
> estatus_node->ghes = ghes;
> estatus_node->generic = ghes->generic;
> - estatus_node->task_work.func = NULL;
> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>
> if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) {
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index be1dd4c1a917..ebd21b05fe6e 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -35,9 +35,6 @@ struct ghes_estatus_node {
> struct llist_node llnode;
> struct acpi_hest_generic *generic;
> struct ghes *ghes;
> -
> - int task_work_cpu;
> - struct callback_head task_work;
> };
>
> struct ghes_estatus_cache {
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index d33729c48eff..4ad663bdc1d5 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2462,19 +2462,6 @@ static void memory_failure_work_func(struct work_struct *work)
> }
> }
>
> -/*
> - * Process memory_failure work queued on the specified CPU.
> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
> - */
> -void memory_failure_queue_kick(int cpu)
> -{
> - struct memory_failure_cpu *mf_cpu;
> -
> - mf_cpu = &per_cpu(memory_failure_cpu, cpu);
> - cancel_work_sync(&mf_cpu->work);
> - memory_failure_work_func(&mf_cpu->work);
> -}
> -
> static int __init memory_failure_init(void)
> {
> struct memory_failure_cpu *mf_cpu;


Hi, Tony, Borislav, and James:

Any comments to this patch?

Looking forward to hear from you.

Best Regards,
Shuai

2024-03-08 10:20:09

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v11 3/3] ACPI: APEI: handle synchronous exceptions in task work to send correct SIGBUS si_code

On Sun, Feb 04, 2024 at 04:01:44PM +0800, Shuai Xue wrote:
> Hardware errors could be signaled by asynchronous interrupt, e.g. when an
> error is detected by a background scrubber, or signaled by synchronous
> exception, e.g. when a CPU tries to access a poisoned cache line. Since
> commit a70297d22132 ("ACPI: APEI: set memory failure flags as
> MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED
> could be used to determine whether a synchronous exception occurs on ARM64
> platform. When a synchronous exception is detected, the kernel should
> terminate the current process which accessing the poisoned page. This is

"which has accessed poison data"

> done by sending a SIGBUS signal with an error code BUS_MCEERR_AR,
> indicating an action-required machine check error on read.
>
> However, the memory failure recovery is incorrectly sending a SIGBUS
> with wrong error code BUS_MCEERR_AO for synchronous errors in early kill
> mode, even MF_ACTION_REQUIRED is set. The main problem is that

"even if"

> synchronous errors are queued as a memory_failure() work, and are
> executed within a kernel thread context, not the user-space process that
> encountered the corrupted memory on ARM64 platform. As a result, when
> kill_proc() is called to terminate the process, it sends the incorrect
> SIGBUS error code because the context in which it operates is not the
> one where the error was triggered.
>
> To this end, queue memory_failure() as a task_work so that it runs in
> the context of the process that is actually consuming the poisoned data,
> and it will send SIBBUS with si_code BUS_MCEERR_AR.

SIGBUS

>
> Signed-off-by: Shuai Xue <[email protected]>
> Tested-by: Ma Wupeng <[email protected]>
> Reviewed-by: Kefeng Wang <[email protected]>
> Reviewed-by: Xiaofei Tan <[email protected]>
> Reviewed-by: Baolin Wang <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 77 +++++++++++++++++++++++-----------------
> include/acpi/ghes.h | 3 --
> mm/memory-failure.c | 13 -------
> 3 files changed, 44 insertions(+), 49 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 0892550732d4..e5086d795bee 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -465,28 +465,41 @@ static void ghes_clear_estatus(struct ghes *ghes,
> }
>
> /*
> - * Called as task_work before returning to user-space.
> - * Ensure any queued work has been done before we return to the context that
> - * triggered the notification.
> + * struct sync_task_work - for synchronous RAS event

What's so special about it being a "sync_"?

task_work is just fine and something else could use it too.

> + *
> + * @twork: callback_head for task work
> + * @pfn: page frame number of corrupted page
> + * @flags: fine tune action taken

s/fine tune action taken/work control flags/

> + *
> + * Structure to pass task work to be handled before
> + * ret_to_user via task_work_add().

What is "ret_to_user"?

If this is an ARM thing, then make sure you explain stuff properly and
detailed. This driver is used by multiple architectures.

> */
> -static void ghes_kick_task_work(struct callback_head *head)
> +struct sync_task_work {
> + struct callback_head twork;
> + u64 pfn;
> + int flags;
> +};
> +
> +static void memory_failure_cb(struct callback_head *twork)
> {
> - struct acpi_hest_generic_status *estatus;
> - struct ghes_estatus_node *estatus_node;
> - u32 node_len;
> + int ret;
> + struct sync_task_work *twcb =
> + container_of(twork, struct sync_task_work, twork);

Ugly linebreak - no need for it.

> - estatus_node = container_of(head, struct ghes_estatus_node, task_work);
> - if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> - memory_failure_queue_kick(estatus_node->task_work_cpu);
> + ret = memory_failure(twcb->pfn, twcb->flags);
> + gen_pool_free(ghes_estatus_pool, (unsigned long)twcb, sizeof(*twcb));
>
> - estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> - node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
> - gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
> + if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
> + return;
> +
> + pr_err("Sending SIGBUS to current task due to memory error not recovered");
> + force_sig(SIGBUS);
> }
>
> static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> {
> unsigned long pfn;
> + struct sync_task_work *twcb;
>
> if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
> return false;
> @@ -499,6 +512,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> return false;
> }
>
> + if (flags == MF_ACTION_REQUIRED && current->mm) {
> + twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
> + if (!twcb)
> + return false;
> +
> + twcb->pfn = pfn;
> + twcb->flags = flags;
> + init_task_work(&twcb->twork, memory_failure_cb);
> + task_work_add(current, &twcb->twork, TWA_RESUME);
> + return true;
> + }
> +
> memory_failure_queue(pfn, flags);
> return true;
> }
> @@ -746,7 +771,7 @@ int cxl_cper_unregister_callback(cxl_cper_callback callback)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_callback, CXL);
>
> -static bool ghes_do_proc(struct ghes *ghes,
> +static void ghes_do_proc(struct ghes *ghes,
> const struct acpi_hest_generic_status *estatus)
> {
> int sev, sec_sev;
> @@ -814,8 +839,6 @@ static bool ghes_do_proc(struct ghes *ghes,
> pr_err("Sending SIGBUS to current task due to memory error not recovered");
> force_sig(SIGBUS);
> }
> -
> - return queued;
> }
>
> static void __ghes_print_estatus(const char *pfx,
> @@ -1117,9 +1140,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> struct ghes_estatus_node *estatus_node;
> struct acpi_hest_generic *generic;
> struct acpi_hest_generic_status *estatus;
> - bool task_work_pending;
> u32 len, node_len;
> - int ret;
>
> llnode = llist_del_all(&ghes_estatus_llist);
> /*
> @@ -1134,25 +1155,16 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> len = cper_estatus_len(estatus);
> node_len = GHES_ESTATUS_NODE_LEN(len);
> - task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
> +
> + ghes_do_proc(estatus_node->ghes, estatus);
> +
> if (!ghes_estatus_cached(estatus)) {
> generic = estatus_node->generic;
> if (ghes_print_estatus(NULL, generic, estatus))
> ghes_estatus_cache_add(generic, estatus);
> }
> -
> - if (task_work_pending && current->mm) {
> - estatus_node->task_work.func = ghes_kick_task_work;
> - estatus_node->task_work_cpu = smp_processor_id();
> - ret = task_work_add(current, &estatus_node->task_work,
> - TWA_RESUME);
> - if (ret)
> - estatus_node->task_work.func = NULL;
> - }
> -
> - if (!estatus_node->task_work.func)
> - gen_pool_free(ghes_estatus_pool,
> - (unsigned long)estatus_node, node_len);

I have no clue why this is being removed.

Why doesn't a synchronous exception on ARM call into ghes_proc_in_irq()?

That SDEI thing certainly does.

Well looka here:

7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")

that thing does exactly what you're trying to "fix". So why doesn't that
work for you?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-12 06:06:04

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v11 3/3] ACPI: APEI: handle synchronous exceptions in task work to send correct SIGBUS si_code



On 2024/3/8 18:18, Borislav Petkov wrote:
> On Sun, Feb 04, 2024 at 04:01:44PM +0800, Shuai Xue wrote:
>> Hardware errors could be signaled by asynchronous interrupt, e.g. when an
>> error is detected by a background scrubber, or signaled by synchronous
>> exception, e.g. when a CPU tries to access a poisoned cache line. Since
>> commit a70297d22132 ("ACPI: APEI: set memory failure flags as
>> MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED
>> could be used to determine whether a synchronous exception occurs on ARM64
>> platform. When a synchronous exception is detected, the kernel should
>> terminate the current process which accessing the poisoned page. This is
>
> "which has accessed poison data"

Thank you. Will fix the grammer.

>
>> done by sending a SIGBUS signal with an error code BUS_MCEERR_AR,
>> indicating an action-required machine check error on read.
>>
>> However, the memory failure recovery is incorrectly sending a SIGBUS
>> with wrong error code BUS_MCEERR_AO for synchronous errors in early kill
>> mode, even MF_ACTION_REQUIRED is set. The main problem is that
>
> "even if"

Thank you. Will fix the grammer.


>
>> synchronous errors are queued as a memory_failure() work, and are
>> executed within a kernel thread context, not the user-space process that
>> encountered the corrupted memory on ARM64 platform. As a result, when
>> kill_proc() is called to terminate the process, it sends the incorrect
>> SIGBUS error code because the context in which it operates is not the
>> one where the error was triggered.
>>
>> To this end, queue memory_failure() as a task_work so that it runs in
>> the context of the process that is actually consuming the poisoned data,
>> and it will send SIBBUS with si_code BUS_MCEERR_AR.
>
> SIGBUS

Sorry, will fix the typo.
>
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>> Tested-by: Ma Wupeng <[email protected]>
>> Reviewed-by: Kefeng Wang <[email protected]>
>> Reviewed-by: Xiaofei Tan <[email protected]>
>> Reviewed-by: Baolin Wang <[email protected]>
>> ---
>> drivers/acpi/apei/ghes.c | 77 +++++++++++++++++++++++-----------------
>> include/acpi/ghes.h | 3 --
>> mm/memory-failure.c | 13 -------
>> 3 files changed, 44 insertions(+), 49 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 0892550732d4..e5086d795bee 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -465,28 +465,41 @@ static void ghes_clear_estatus(struct ghes *ghes,
>> }
>>
>> /*
>> - * Called as task_work before returning to user-space.
>> - * Ensure any queued work has been done before we return to the context that
>> - * triggered the notification.
>> + * struct sync_task_work - for synchronous RAS event
>
> What's so special about it being a "sync_"?
>
> task_work is just fine and something else could use it too.

You are right, the `sync_task_work` is only use for synchronous RAS event right, but
it could be also use for other purpose in the future. The purpose can be specified
through flags.

I will remove the `sync_` prefix.

>
>> + *
>> + * @twork: callback_head for task work
>> + * @pfn: page frame number of corrupted page
>> + * @flags: fine tune action taken
>
> s/fine tune action taken/work control flags/
>

Will fix it.

>> + *
>> + * Structure to pass task work to be handled before
>> + * ret_to_user via task_work_add().
>
> What is "ret_to_user"?
>
> If this is an ARM thing, then make sure you explain stuff properly and
> detailed. This driver is used by multiple architectures.

It is not ARM specific thing. I mean it is used by task_work before returning to user-space.

+ * Structure to pass task work to be handled before
+ * returning to user-space via task_work_add().

>
>> */
>> -static void ghes_kick_task_work(struct callback_head *head)
>> +struct sync_task_work {
>> + struct callback_head twork;
>> + u64 pfn;
>> + int flags;
>> +};
>> +
>> +static void memory_failure_cb(struct callback_head *twork)
>> {
>> - struct acpi_hest_generic_status *estatus;
>> - struct ghes_estatus_node *estatus_node;
>> - u32 node_len;
>> + int ret;
>> + struct sync_task_work *twcb =
>> + container_of(twork, struct sync_task_work, twork);
>
> Ugly linebreak - no need for it.

Will fix it.
>
>> - estatus_node = container_of(head, struct ghes_estatus_node, task_work);
>> - if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>> - memory_failure_queue_kick(estatus_node->task_work_cpu);
>> + ret = memory_failure(twcb->pfn, twcb->flags);
>> + gen_pool_free(ghes_estatus_pool, (unsigned long)twcb, sizeof(*twcb));
>>
>> - estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> - node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus));
>> - gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len);
>> + if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
>> + return;
>> +
>> + pr_err("Sending SIGBUS to current task due to memory error not recovered");
>> + force_sig(SIGBUS);
>> }
>>
>> static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>> {
>> unsigned long pfn;
>> + struct sync_task_work *twcb;
>>
>> if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>> return false;
>> @@ -499,6 +512,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>> return false;
>> }
>>
>> + if (flags == MF_ACTION_REQUIRED && current->mm) {
>> + twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
>> + if (!twcb)
>> + return false;
>> +
>> + twcb->pfn = pfn;
>> + twcb->flags = flags;
>> + init_task_work(&twcb->twork, memory_failure_cb);
>> + task_work_add(current, &twcb->twork, TWA_RESUME);
>> + return true;
>> + }
>> +
>> memory_failure_queue(pfn, flags);
>> return true;
>> }
>> @@ -746,7 +771,7 @@ int cxl_cper_unregister_callback(cxl_cper_callback callback)
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_callback, CXL);
>>
>> -static bool ghes_do_proc(struct ghes *ghes,
>> +static void ghes_do_proc(struct ghes *ghes,
>> const struct acpi_hest_generic_status *estatus)
>> {
>> int sev, sec_sev;
>> @@ -814,8 +839,6 @@ static bool ghes_do_proc(struct ghes *ghes,
>> pr_err("Sending SIGBUS to current task due to memory error not recovered");
>> force_sig(SIGBUS);
>> }
>> -
>> - return queued;
>> }
>>
>> static void __ghes_print_estatus(const char *pfx,
>> @@ -1117,9 +1140,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>> struct ghes_estatus_node *estatus_node;
>> struct acpi_hest_generic *generic;
>> struct acpi_hest_generic_status *estatus;
>> - bool task_work_pending;
>> u32 len, node_len;
>> - int ret;
>>
>> llnode = llist_del_all(&ghes_estatus_llist);
>> /*
>> @@ -1134,25 +1155,16 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> len = cper_estatus_len(estatus);
>> node_len = GHES_ESTATUS_NODE_LEN(len);
>> - task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>> +
>> + ghes_do_proc(estatus_node->ghes, estatus);
>> +
>> if (!ghes_estatus_cached(estatus)) {
>> generic = estatus_node->generic;
>> if (ghes_print_estatus(NULL, generic, estatus))
>> ghes_estatus_cache_add(generic, estatus);
>> }
>> -
>> - if (task_work_pending && current->mm) {
>> - estatus_node->task_work.func = ghes_kick_task_work;
>> - estatus_node->task_work_cpu = smp_processor_id();
>> - ret = task_work_add(current, &estatus_node->task_work,
>> - TWA_RESUME);
>> - if (ret)
>> - estatus_node->task_work.func = NULL;
>> - }
>> -
>> - if (!estatus_node->task_work.func)
>> - gen_pool_free(ghes_estatus_pool,
>> - (unsigned long)estatus_node, node_len);
>
> I have no clue why this is being removed.

Before this patch, a memory_failure() work is queued into workqueue for
both the asynchronous interrupt and synchronous exception. So
memory_failure() will be executed asynchronously. For NMIlike
notifications, commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure()
queue for synchronous errors") keeps track of whether memory_failure() work
was queued, and makes task_work pending to flush out the queue. It ensures
any queued work has been done before we return to the context that
triggered the notification.

In this patch:

- a memory_failure() work is queued into workqueue for asynchronous interrupt
- a memory_failure() task_work is queued by task_work_add for synchronous exception

The memory_failure() task_work will be handled before returning to user
space, so we does not need to queue a flushing task_work any anymore.

>
> Why doesn't a synchronous exception on ARM call into ghes_proc_in_irq()?

/*
* SEA can interrupt SError, mask it and describe this as an NMI so
* that APEI defers the handling.
*/
local_daif_restore(DAIF_ERRCTX);
nmi_enter();
=> ghes_notify_sea
=> ghes_in_nmi_spool_from_list
=> ghes_in_nmi_queue_one_entry // also called in __ghes_sdei_callback
=> irq_work_queue(&ghes_proc_irq_work);
nmi_exit();
>
> That SDEI thing certainly does.
>
> Well looka here:
>
> 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
>
> that thing does exactly what you're trying to "fix". So why doesn't that
> work for you?
>

Commit a70297d22132 (ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events)
set MF_ACTION_REQUIRED for synchronous events.

/*
* Send all the processes who have the page mapped a signal.
* ``action optional'' if they are not immediately affected by the error
* ``action required'' if error happened in current execution context
*/
static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
{
...
if ((flags & MF_ACTION_REQUIRED) && (t == current))
ret = force_sig_mceerr(BUS_MCEERR_AR,
(void __user *)tk->addr, addr_lsb);
else
ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr,
addr_lsb, t);
...
}

Because the memory_failure() running in a kthread context, the false branch in kill_proc()
will send SIGBUS with BUS_MCEERR_AO. But we except it as a BUS_MCEERR_AR.

Thank you for valuable comments :)

Best Regards,
Shuai