2022-12-05 12:07:40

by Lv Ying

[permalink] [raw]
Subject: [RFC 0/2] ACPI: APEI: Make synchronization errors call and

This patchset try to fix two problems:
- memory_failure() triggered by synchronization errors run in kernel
thread context
- ignore memory_failure() failure cause synchronous error infinite loop,
the platform firmware exceed some threshold and reboot

Lv Ying (2):
ACPI: APEI: Make memory_failure() triggered by synchronization errors
execute in the current context
ACPI: APEI: fix reboot caused by synchronous error loop because of
memory_failure() failed

drivers/acpi/apei/ghes.c | 27 ++++++++++++++-------------
mm/memory-failure.c | 38 ++++++++++++++++++++++++++------------
2 files changed, 40 insertions(+), 25 deletions(-)

--
2.33.0


2022-12-05 12:25:58

by Lv Ying

[permalink] [raw]
Subject: [RFC 1/2] ACPI: APEI: Make memory_failure() triggered by synchronization errors execute in the current context

The memory uncorrected error which is detected by an external component and
notified via an IRQ, can be called asynchronization error. If an error is
detected as a result of user-space process accessing a corrupt memory
location, the CPU may take an abort. On arm64 this is a
'synchronous external abort', and on a firmware first system it is notified
via NOTIFY_SEA, this can be called synchronization error.

Currently, synchronization error and asynchronization error both use
memory_failure_queue to schedule memory_failure() exectute in kworker
context. Commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue
for synchronous errors") make task_work pending to flush out the queue,
cancel_work_sync() in memory_failure_queue_kick() will make
memory_failure() exectute in kworker context first which will get
synchronization error info from kfifo, so task_work later will get nothing
from kfifo which doesn't work as expected. Even worse, synchronization
error notification has NMI like properties, (it can interrupt IRQ-masked
code), task_work may get wrong kfifo entry from interrupted
asynchronization error which is notified by IRQ.

Since the memory_failure() triggered by a synchronous exception is
executed in the kworker context, the early_kill mode of memory_failure()
will send wrong si_code by SIGBUS signal: current process is kworker
thread, the actual user-space process accessing the corrupt memory location
will be collected by find_early_kill_thread(), and then send SIGBUS with
BUS_MCEERR_AO si_code to the actual user-space process instead of
BUS_MCEERR_AR. The machine-manager(kvm) use the si_code: BUS_MCEERR_AO for
'action optional' early notifications, and BUS_MCEERR_AR for
'action required' synchronous/late notifications.

Make memory_failure() triggered by synchronization errors execute in the
current context, we do not need workqueue for synchronization error
anymore, use task_work handle synchronization errors directly. Since,
synchronization errors and asynchronization errors share the same kfifo,
use MF_ACTION_REQUIRED flag to distinguish them. And the asynchronization
error keeps the same as before.

Signed-off-by: Lv Ying <[email protected]>
---
drivers/acpi/apei/ghes.c | 27 ++++++++++++++-------------
mm/memory-failure.c | 34 ++++++++++++++++++++++------------
2 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 9952f3a792ba..2ec71fc8a8dd 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -423,8 +423,8 @@ static void ghes_clear_estatus(struct ghes *ghes,

/*
* Called as task_work before returning to user-space.
- * Ensure any queued work has been done before we return to the context that
- * triggered the notification.
+ * Ensure any queued corrupt page in synchronous errors has been handled before
+ * we return to the user context that triggered the notification.
*/
static void ghes_kick_task_work(struct callback_head *head)
{
@@ -461,7 +461,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
}

static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
- int sev)
+ int sev, bool sync)
{
int flags = -1;
int sec_sev = ghes_severity(gdata->error_severity);
@@ -475,7 +475,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
- flags = 0;
+ flags = sync ? MF_ACTION_REQUIRED : 0;

if (flags != -1)
return ghes_do_memory_failure(mem_err->physical_addr, flags);
@@ -483,7 +483,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
return false;
}

-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
+static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev, bool sync)
{
struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
bool queued = false;
@@ -510,7 +510,8 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
* and don't filter out 'corrected' error here.
*/
if (is_cache && has_pa) {
- queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
+ queued = ghes_do_memory_failure(err_info->physical_fault_addr,
+ sync ? MF_ACTION_REQUIRED : 0);
p += err_info->length;
continue;
}
@@ -623,7 +624,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
}

static bool ghes_do_proc(struct ghes *ghes,
- const struct acpi_hest_generic_status *estatus)
+ const struct acpi_hest_generic_status *estatus, bool sync)
{
int sev, sec_sev;
struct acpi_hest_generic_data *gdata;
@@ -648,13 +649,13 @@ static bool ghes_do_proc(struct ghes *ghes,
ghes_edac_report_mem_error(sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
- queued = ghes_handle_memory_failure(gdata, sev);
+ queued = ghes_handle_memory_failure(gdata, sev, sync);
}
else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
- queued = ghes_handle_arm_hw_error(gdata, sev);
+ queued = ghes_handle_arm_hw_error(gdata, sev, sync);
} else {
void *err = acpi_hest_get_payload(gdata);

@@ -868,7 +869,7 @@ static int ghes_proc(struct ghes *ghes)
if (ghes_print_estatus(NULL, ghes->generic, estatus))
ghes_estatus_cache_add(ghes->generic, estatus);
}
- ghes_do_proc(ghes, estatus);
+ ghes_do_proc(ghes, estatus, false);

out:
ghes_clear_estatus(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ);
@@ -961,7 +962,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
struct ghes_estatus_node *estatus_node;
struct acpi_hest_generic *generic;
struct acpi_hest_generic_status *estatus;
- bool task_work_pending;
+ bool corruption_page_pending;
u32 len, node_len;
int ret;

@@ -978,14 +979,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
len = cper_estatus_len(estatus);
node_len = GHES_ESTATUS_NODE_LEN(len);
- task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
+ corruption_page_pending = ghes_do_proc(estatus_node->ghes, estatus, true);
if (!ghes_estatus_cached(estatus)) {
generic = estatus_node->generic;
if (ghes_print_estatus(NULL, generic, estatus))
ghes_estatus_cache_add(generic, estatus);
}

- if (task_work_pending && current->mm) {
+ if (corruption_page_pending && current->mm) {
estatus_node->task_work.func = ghes_kick_task_work;
estatus_node->task_work_cpu = smp_processor_id();
ret = task_work_add(current, &estatus_node->task_work,
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index bead6bccc7f2..3b6ac3694b8d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2204,7 +2204,11 @@ struct memory_failure_cpu {
static DEFINE_PER_CPU(struct memory_failure_cpu, memory_failure_cpu);

/**
- * memory_failure_queue - Schedule handling memory failure of a page.
+ * memory_failure_queue
+ * - Schedule handling memory failure of a page for asynchronous error, memory
+ * failure page will be executed in kworker thread
+ * - put corrupt memory info into kfifo for synchronous error, task_work will
+ * handle them before returning to the user
* @pfn: Page Number of the corrupted page
* @flags: Flags for memory failure handling
*
@@ -2217,6 +2221,11 @@ static DEFINE_PER_CPU(struct memory_failure_cpu, memory_failure_cpu);
* happen outside the current execution context (e.g. when
* detected by a background scrubber)
*
+ * This function can also be used in synchronous errors which was detected as a
+ * result of user-space accessing a corrupt memory location, just put memory
+ * error info into kfifo, and then, task_work get and handle it in current
+ * execution context instead of scheduling kworker to handle it
+ *
* Can run in IRQ context.
*/
void memory_failure_queue(unsigned long pfn, int flags)
@@ -2230,9 +2239,10 @@ void memory_failure_queue(unsigned long pfn, int flags)

mf_cpu = &get_cpu_var(memory_failure_cpu);
spin_lock_irqsave(&mf_cpu->lock, proc_flags);
- if (kfifo_put(&mf_cpu->fifo, entry))
- schedule_work_on(smp_processor_id(), &mf_cpu->work);
- else
+ if (kfifo_put(&mf_cpu->fifo, entry)) {
+ if (!(entry.flags & MF_ACTION_REQUIRED))
+ schedule_work_on(smp_processor_id(), &mf_cpu->work);
+ } else
pr_err("buffer overflow when queuing memory failure at %#lx\n",
pfn);
spin_unlock_irqrestore(&mf_cpu->lock, proc_flags);
@@ -2240,7 +2250,7 @@ void memory_failure_queue(unsigned long pfn, int flags)
}
EXPORT_SYMBOL_GPL(memory_failure_queue);

-static void memory_failure_work_func(struct work_struct *work)
+static void __memory_failure_work_func(struct work_struct *work, bool sync)
{
struct memory_failure_cpu *mf_cpu;
struct memory_failure_entry entry = { 0, };
@@ -2256,22 +2266,22 @@ static void memory_failure_work_func(struct work_struct *work)
break;
if (entry.flags & MF_SOFT_OFFLINE)
soft_offline_page(entry.pfn, entry.flags);
- else
+ else if (!sync || (entry.flags & MF_ACTION_REQUIRED))
memory_failure(entry.pfn, entry.flags);
}
}

-/*
- * Process memory_failure work queued on the specified CPU.
- * Used to avoid return-to-userspace racing with the memory_failure workqueue.
- */
+static void memory_failure_work_func(struct work_struct *work)
+{
+ __memory_failure_work_func(work, false);
+}
+
void memory_failure_queue_kick(int cpu)
{
struct memory_failure_cpu *mf_cpu;

mf_cpu = &per_cpu(memory_failure_cpu, cpu);
- cancel_work_sync(&mf_cpu->work);
- memory_failure_work_func(&mf_cpu->work);
+ __memory_failure_work_func(&mf_cpu->work, true);
}

static int __init memory_failure_init(void)
--
2.33.0

2022-12-05 12:29:51

by Lv Ying

[permalink] [raw]
Subject: [RFC 2/2] ACPI: APEI: fix reboot caused by synchronous error loop because of memory_failure() failed

Synchronous error was detected as a result of user-space accessing a
corrupt memory location the CPU may take an abort instead. On arm64 this
is a 'synchronous external abort' which can be notified by SEA.

If memory_failure() failed, we return to user-space will trigger SEA again,
such loop may cause platform firmware to exceed some threshold and reboot
when Linux could have recovered from this error. Not all memory_failure()
processing failures will cause the reboot, VM_FAULT_HWPOISON[_LARGE]
handling in arm64 page fault will send SIGBUS signal to the user-space
accessing process to terminate this loop.

If process mapping fault page, but memory_failure() abnormal return before
try_to_unmap(), for example, the fault page process mapping is KSM page.
In this case, arm64 cannot use the page fault process to terminate the
loop.

Add judgement of memory_failure() result in task_work before returning to
user-space. If memory_failure() failed, send SIGBUS signal to the current
process to avoid SEA loop.

Signed-off-by: Lv Ying <[email protected]>
---
mm/memory-failure.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3b6ac3694b8d..4c1c558f7161 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2266,7 +2266,11 @@ static void __memory_failure_work_func(struct work_struct *work, bool sync)
break;
if (entry.flags & MF_SOFT_OFFLINE)
soft_offline_page(entry.pfn, entry.flags);
- else if (!sync || (entry.flags & MF_ACTION_REQUIRED))
+ else if (sync) {
+ if ((entry.flags & MF_ACTION_REQUIRED) &&
+ memory_failure(entry.pfn, entry.flags))
+ force_sig_mceerr(BUS_MCEERR_AR, 0, 0);
+ } else
memory_failure(entry.pfn, entry.flags);
}
}
--
2.33.0

2022-12-07 11:06:55

by Shuai Xue

[permalink] [raw]
Subject: Re: [RFC 1/2] ACPI: APEI: Make memory_failure() triggered by synchronization errors execute in the current context



在 2022/12/5 PM7:51, Lv Ying 写道:
> The memory uncorrected error which is detected by an external component and
> notified via an IRQ, can be called asynchronization error. If an error is
> detected as a result of user-space process accessing a corrupt memory
> location, the CPU may take an abort. On arm64 this is a
> 'synchronous external abort', and on a firmware first system it is notified
> via NOTIFY_SEA, this can be called synchronization error.
>
> Currently, synchronization error and asynchronization error both use
> memory_failure_queue to schedule memory_failure() exectute in kworker
> context. Commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue
> for synchronous errors") make task_work pending to flush out the queue,
> cancel_work_sync() in memory_failure_queue_kick() will make
> memory_failure() exectute in kworker context first which will get
> synchronization error info from kfifo, so task_work later will get nothing
> from kfifo which doesn't work as expected. Even worse, synchronization
> error notification has NMI like properties, (it can interrupt IRQ-masked
> code), task_work may get wrong kfifo entry from interrupted
> asynchronization error which is notified by IRQ.
>
> Since the memory_failure() triggered by a synchronous exception is
> executed in the kworker context, the early_kill mode of memory_failure()
> will send wrong si_code by SIGBUS signal: current process is kworker
> thread, the actual user-space process accessing the corrupt memory location
> will be collected by find_early_kill_thread(), and then send SIGBUS with
> BUS_MCEERR_AO si_code to the actual user-space process instead of
> BUS_MCEERR_AR. The machine-manager(kvm) use the si_code: BUS_MCEERR_AO for
> 'action optional' early notifications, and BUS_MCEERR_AR for
> 'action required' synchronous/late notifications.
>
> Make memory_failure() triggered by synchronization errors execute in the
> current context, we do not need workqueue for synchronization error
> anymore, use task_work handle synchronization errors directly. Since,
> synchronization errors and asynchronization errors share the same kfifo,
> use MF_ACTION_REQUIRED flag to distinguish them. And the asynchronization
> error keeps the same as before.


Hi, Lv Ying,

Thank you for your great work.

We also encountered this problem in production environment, and tried to
solve it by dividing synchronous and asynchronous error handling into different
paths: task work for synchronous error and workqueue for asynchronous error.

The main challenge is how to distinguish synchronous errors in kernel first
mode through APEI, a related discussion is here.[1]

> @@ -978,14 +979,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> len = cper_estatus_len(estatus);
> node_len = GHES_ESTATUS_NODE_LEN(len);
> - task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
> + corruption_page_pending = ghes_do_proc(estatus_node->ghes, estatus, true);
> if (!ghes_estatus_cached(estatus)) {
> generic = estatus_node->generic;
> if (ghes_print_estatus(NULL, generic, estatus))
> ghes_estatus_cache_add(generic, estatus);
> }

In the case of your patch, it is inappropriate to assume that ghes_proc_in_irq() is only
called to handle synchronous error. Firmware could notify all synchronous and asynchronous
error signals to kernel through NMI notification, e.g. SDEI. In this case, asynchronous
error will be treated as synchronous error.

Our colleague Yingwen has submitted a proposal to extend acpi_hest_generic_data::flag (bit 8)
to indicate that the error is a synchronous[2]. Personally speaking, it is a more general
solution and completely solves the problem.


> Background:
>
> In ARM world, two type events (Sync/Async) from hardware IP need OS/VMM take different actions.
> Current CPER memory error record is not able to distinguish sync/async type event right now.
> Current OS/VMM need to take extra actions beyond CPER which is heavy burden to identify the
> two type events
>
> Sync event (e.g. CPU consume poisoned data) --> Firmware -> CPER error log --> OS/VMM take recovery action.
> Async event (e.g. Memory controller detect UE event) --> Firmware --> CPER error log --> OS take page action.
>
>
> Proposal:
>
> - In section description Flags field(UEFI spec section N.2, add sync flag as below. OS/VMM
> could depend on this flag to distinguish sync/async events.
> - Bit8 – sync flag; if set this flag indicates that this event record is synchronous(e.g.
> cpu core consumes poison data, then cause instruction/data abort); if not set, this event record is asynchronous.
>
> Best regards,
> Yingwen Chen

A RFC patch set based on above proposal is here[3].

Thank you.

Best Regards,
Shuai


[1] https://lore.kernel.org/lkml/[email protected]/T/
[2] https://members.uefi.org/wg/uswg/mail/thread/9453
[3] https://lore.kernel.org/lkml/[email protected]/


>
> Signed-off-by: Lv Ying <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 27 ++++++++++++++-------------
> mm/memory-failure.c | 34 ++++++++++++++++++++++------------
> 2 files changed, 36 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 9952f3a792ba..2ec71fc8a8dd 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -423,8 +423,8 @@ static void ghes_clear_estatus(struct ghes *ghes,
>
> /*
> * Called as task_work before returning to user-space.
> - * Ensure any queued work has been done before we return to the context that
> - * triggered the notification.
> + * Ensure any queued corrupt page in synchronous errors has been handled before
> + * we return to the user context that triggered the notification.
> */
> static void ghes_kick_task_work(struct callback_head *head)
> {
> @@ -461,7 +461,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
> }
>
> static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> - int sev)
> + int sev, bool sync)
> {
> int flags = -1;
> int sec_sev = ghes_severity(gdata->error_severity);
> @@ -475,7 +475,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> flags = MF_SOFT_OFFLINE;
> if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
> - flags = 0;
> + flags = sync ? MF_ACTION_REQUIRED : 0;
>
> if (flags != -1)
> return ghes_do_memory_failure(mem_err->physical_addr, flags);
> @@ -483,7 +483,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
> return false;
> }
>
> -static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev, bool sync)
> {
> struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
> bool queued = false;
> @@ -510,7 +510,8 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
> * and don't filter out 'corrected' error here.
> */
> if (is_cache && has_pa) {
> - queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
> + queued = ghes_do_memory_failure(err_info->physical_fault_addr,
> + sync ? MF_ACTION_REQUIRED : 0);
> p += err_info->length;
> continue;
> }
> @@ -623,7 +624,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
> }
>
> static bool ghes_do_proc(struct ghes *ghes,
> - const struct acpi_hest_generic_status *estatus)
> + const struct acpi_hest_generic_status *estatus, bool sync)
> {
> int sev, sec_sev;
> struct acpi_hest_generic_data *gdata;
> @@ -648,13 +649,13 @@ static bool ghes_do_proc(struct ghes *ghes,
> ghes_edac_report_mem_error(sev, mem_err);
>
> arch_apei_report_mem_error(sev, mem_err);
> - queued = ghes_handle_memory_failure(gdata, sev);
> + queued = ghes_handle_memory_failure(gdata, sev, sync);
> }
> else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
> ghes_handle_aer(gdata);
> }
> else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
> - queued = ghes_handle_arm_hw_error(gdata, sev);
> + queued = ghes_handle_arm_hw_error(gdata, sev, sync);
> } else {
> void *err = acpi_hest_get_payload(gdata);
>
> @@ -868,7 +869,7 @@ static int ghes_proc(struct ghes *ghes)
> if (ghes_print_estatus(NULL, ghes->generic, estatus))
> ghes_estatus_cache_add(ghes->generic, estatus);
> }
> - ghes_do_proc(ghes, estatus);
> + ghes_do_proc(ghes, estatus, false);
>
> out:
> ghes_clear_estatus(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ);
> @@ -961,7 +962,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> struct ghes_estatus_node *estatus_node;
> struct acpi_hest_generic *generic;
> struct acpi_hest_generic_status *estatus;
> - bool task_work_pending;
> + bool corruption_page_pending;
> u32 len, node_len;
> int ret;
>
> @@ -978,14 +979,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
> len = cper_estatus_len(estatus);
> node_len = GHES_ESTATUS_NODE_LEN(len);
> - task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
> + corruption_page_pending = ghes_do_proc(estatus_node->ghes, estatus, true);
> if (!ghes_estatus_cached(estatus)) {
> generic = estatus_node->generic;
> if (ghes_print_estatus(NULL, generic, estatus))
> ghes_estatus_cache_add(generic, estatus);
> }
>
> - if (task_work_pending && current->mm) {
> + if (corruption_page_pending && current->mm) {
> estatus_node->task_work.func = ghes_kick_task_work;
> estatus_node->task_work_cpu = smp_processor_id();
> ret = task_work_add(current, &estatus_node->task_work,
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index bead6bccc7f2..3b6ac3694b8d 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2204,7 +2204,11 @@ struct memory_failure_cpu {
> static DEFINE_PER_CPU(struct memory_failure_cpu, memory_failure_cpu);
>
> /**
> - * memory_failure_queue - Schedule handling memory failure of a page.
> + * memory_failure_queue
> + * - Schedule handling memory failure of a page for asynchronous error, memory
> + * failure page will be executed in kworker thread
> + * - put corrupt memory info into kfifo for synchronous error, task_work will
> + * handle them before returning to the user
> * @pfn: Page Number of the corrupted page
> * @flags: Flags for memory failure handling
> *
> @@ -2217,6 +2221,11 @@ static DEFINE_PER_CPU(struct memory_failure_cpu, memory_failure_cpu);
> * happen outside the current execution context (e.g. when
> * detected by a background scrubber)
> *
> + * This function can also be used in synchronous errors which was detected as a
> + * result of user-space accessing a corrupt memory location, just put memory
> + * error info into kfifo, and then, task_work get and handle it in current
> + * execution context instead of scheduling kworker to handle it
> + *
> * Can run in IRQ context.
> */
> void memory_failure_queue(unsigned long pfn, int flags)
> @@ -2230,9 +2239,10 @@ void memory_failure_queue(unsigned long pfn, int flags)
>
> mf_cpu = &get_cpu_var(memory_failure_cpu);
> spin_lock_irqsave(&mf_cpu->lock, proc_flags);
> - if (kfifo_put(&mf_cpu->fifo, entry))
> - schedule_work_on(smp_processor_id(), &mf_cpu->work);
> - else
> + if (kfifo_put(&mf_cpu->fifo, entry)) {
> + if (!(entry.flags & MF_ACTION_REQUIRED))
> + schedule_work_on(smp_processor_id(), &mf_cpu->work);
> + } else
> pr_err("buffer overflow when queuing memory failure at %#lx\n",
> pfn);
> spin_unlock_irqrestore(&mf_cpu->lock, proc_flags);
> @@ -2240,7 +2250,7 @@ void memory_failure_queue(unsigned long pfn, int flags)
> }
> EXPORT_SYMBOL_GPL(memory_failure_queue);
>
> -static void memory_failure_work_func(struct work_struct *work)
> +static void __memory_failure_work_func(struct work_struct *work, bool sync)
> {
> struct memory_failure_cpu *mf_cpu;
> struct memory_failure_entry entry = { 0, };
> @@ -2256,22 +2266,22 @@ static void memory_failure_work_func(struct work_struct *work)
> break;
> if (entry.flags & MF_SOFT_OFFLINE)
> soft_offline_page(entry.pfn, entry.flags);
> - else
> + else if (!sync || (entry.flags & MF_ACTION_REQUIRED))
> memory_failure(entry.pfn, entry.flags);
> }
> }
>
> -/*
> - * Process memory_failure work queued on the specified CPU.
> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
> - */
> +static void memory_failure_work_func(struct work_struct *work)
> +{
> + __memory_failure_work_func(work, false);
> +}
> +
> void memory_failure_queue_kick(int cpu)
> {
> struct memory_failure_cpu *mf_cpu;
>
> mf_cpu = &per_cpu(memory_failure_cpu, cpu);
> - cancel_work_sync(&mf_cpu->work);
> - memory_failure_work_func(&mf_cpu->work);
> + __memory_failure_work_func(&mf_cpu->work, true);
> }
>
> static int __init memory_failure_init(void)

2022-12-07 12:40:57

by Bixuan Cui

[permalink] [raw]
Subject: Re: [RFC 2/2] ACPI: APEI: fix reboot caused by synchronous error loop because of memory_failure() failed



在 2022/12/5 19:51, Lv Ying 写道:
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3b6ac3694b8d..4c1c558f7161 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2266,7 +2266,11 @@ static void __memory_failure_work_func(struct work_struct *work, bool sync)
> break;
> if (entry.flags & MF_SOFT_OFFLINE)
> soft_offline_page(entry.pfn, entry.flags);
> - else if (!sync || (entry.flags & MF_ACTION_REQUIRED))
> + else if (sync) {
> + if ((entry.flags & MF_ACTION_REQUIRED) &&
> + memory_failure(entry.pfn, entry.flags))
> + force_sig_mceerr(BUS_MCEERR_AR, 0, 0);
> + } else
> memory_failure(entry.pfn, entry.flags);
Hi,

Some of the ideas in this patch are wrong :-(

1. As Shuai Xue said, it is wrong to judge synchronization error and
asynchronization error through functions such as
memory_failure_queue_kick()/ghes_proc()/ghes_proc_in_irq(), because both
synchronization error and asynchronization error may go to the same
notification.

2. There is no need to pass 'sync' to __memory_failure_work_func(),
because memory_failure() can directly handle synchronous and
asynchronous errors according to entry.flags & MF_ACTION_REQUIRED:

entry.flags & MF_ACTION_REQUIRED == 1: Action: poison page and kill task
for synchronous error
entry.flags & MF_ACTION_REQUIRED == 0: Action: poison page for
asynchronous error

Reference x86:
do_machine_check # MCE, synchronous
->kill_me_maybe
->memory_failure(p->mce_addr >> PAGE_SHIFT, MF_ACTION_REQUIRED);

uc_decode_notifier # CMCI, asynchronous
->memory_failure(pfn, 0)

At the same time, the modification here is repeated with your patch 01
if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
- flags = 0;
+ flags = sync ? MF_ACTION_REQUIRED : 0;

3. Why add 'force_sig_mceerr(BUS_MCEERR_AR, 0, 0)' after
memory_failure(pfn, MF_ACTION_REQUIRED)?
The task will be killed in memory_failure():
if poisoned, kill_accessing_process()->kill_proc()
if not poisoned, hwpoison_user_mappings()->collect_procs()->kill_procs()

Reference x86 to handle synchronous error:
kill_me_maybe()
{
int flags = MF_ACTION_REQUIRED;
ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
if (!ret) {
...
return;
}
if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
return;

pr_err("Memory error not recovered");
kill_me_now(cb);
}


Thanks,
Bixuan Cui

2022-12-08 02:27:05

by Lv Ying

[permalink] [raw]
Subject: Re: [RFC 1/2] ACPI: APEI: Make memory_failure() triggered by synchronization errors execute in the current context


>
> We also encountered this problem in production environment, and tried to
> solve it by dividing synchronous and asynchronous error handling into different
> paths: task work for synchronous error and workqueue for asynchronous error.
>
> The main challenge is how to distinguish synchronous errors in kernel first
> mode through APEI, a related discussion is here.[1]
>
Hi Shuai:

I'm very happy to have received your response, we encountered this
problem in our production environment too. I spent a lot of time
researching the rationale for synchronous errors and asynchronous errors
to share the same process. I mention in my patch commit message:
memory_failure() triggered by synchronous error is
executed in the kworker context, the early_kill mode of memory_failure()

will send wrong si_code by SIGBUS signal because of wrong current process.

The challenge for my patch is to prove the rationality of distinguishing
synchronous errors. I do not have a good idea of distinguishing
synchronous error by looking through ACPI/UEFI spec, so I sent this
patchset for more input. And I resent RFC PATCH v1 [1]add this as TODO.

>> @@ -978,14 +979,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> len = cper_estatus_len(estatus);
>> node_len = GHES_ESTATUS_NODE_LEN(len);
>> - task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>> + corruption_page_pending = ghes_do_proc(estatus_node->ghes, estatus, true);
>> if (!ghes_estatus_cached(estatus)) {
>> generic = estatus_node->generic;
>> if (ghes_print_estatus(NULL, generic, estatus))
>> ghes_estatus_cache_add(generic, estatus);
>> }
>
> In the case of your patch, it is inappropriate to assume that ghes_proc_in_irq() is only
> called to handle synchronous error. Firmware could notify all synchronous and asynchronous
> error signals to kernel through NMI notification, e.g. SDEI. In this case, asynchronous
> error will be treated as synchronous error.
>

Yes, as I mentioned above, I do not have a good idea of distinguishing
synchronous error. I agree with you that ghes_proc_in_irq() is called in
SDEI, SEA, NMI notify type, they are NMI-like notify, this function run
some job which may not be NMI safe in IRQ context. And NMI may be
asynchronous error.

However, cureent kernel use ghes_kick_task_work in ghes_proc_in_irq(),
there is an assumption here that ghes_proc_in_irq() are currently in the
context of a synchronous exception, although this is not appropriate.

> Our colleague Yingwen has submitted a proposal to extend acpi_hest_generic_data::flag (bit 8)
> to indicate that the error is a synchronous[2]. Personally speaking, it is a more general
> solution and completely solves the problem.
>
>
>> Background:
>>
>> In ARM world, two type events (Sync/Async) from hardware IP need OS/VMM take different actions.
>> Current CPER memory error record is not able to distinguish sync/async type event right now.
>> Current OS/VMM need to take extra actions beyond CPER which is heavy burden to identify the
>> two type events
>>
>> Sync event (e.g. CPU consume poisoned data) --> Firmware -> CPER error log --> OS/VMM take recovery action.
>> Async event (e.g. Memory controller detect UE event) --> Firmware --> CPER error log --> OS take page action.
>>
>>
>> Proposal:
>>
>> - In section description Flags field(UEFI spec section N.2, add sync flag as below. OS/VMM
>> could depend on this flag to distinguish sync/async events.
>> - Bit8 – sync flag; if set this flag indicates that this event record is synchronous(e.g.
>> cpu core consumes poison data, then cause instruction/data abort); if not set, this event record is asynchronous.
>>
>> Best regards,
>> Yingwen Chen
>
> A RFC patch set based on above proposal is here[3].
>

I'm glad your team is working on advancing this thing in the UEFI
community. This will make kernel identify synchronous as an easy job.
Based on your proposal, I think your work is suitable.

However, there are real problems here in the running production
environment of the LTS kernel:
1. this new proposal has not yet been accepted by the UEFI spec
2. it will require firmware update in production environment which may
be more difficult than kernel livepatch upgrade.

If there are more ways to distinguish synchronous error in kernel APEI,
I can try it. For example, in APEI mode, is it suitable to use notify
type to distinguish synchronous error?
synchronous error:
SDEI SEA
asynchronous error:
NMI

[1]
https://lore.kernel.org/linux-mm/[email protected]/T/


--
Thanks!
Lv Ying

2022-12-08 02:54:25

by Xie XiuQi

[permalink] [raw]
Subject: Re: [RFC 1/2] ACPI: APEI: Make memory_failure() triggered by synchronization errors execute in the current context


On 2022/12/7 18:57, Shuai Xue wrote:
>
>
> 在 2022/12/5 PM7:51, Lv Ying 写道:
>> The memory uncorrected error which is detected by an external component and
>> notified via an IRQ, can be called asynchronization error. If an error is
>> detected as a result of user-space process accessing a corrupt memory
>> location, the CPU may take an abort. On arm64 this is a
>> 'synchronous external abort', and on a firmware first system it is notified
>> via NOTIFY_SEA, this can be called synchronization error.
>>
>> Currently, synchronization error and asynchronization error both use
>> memory_failure_queue to schedule memory_failure() exectute in kworker
>> context. Commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue
>> for synchronous errors") make task_work pending to flush out the queue,
>> cancel_work_sync() in memory_failure_queue_kick() will make
>> memory_failure() exectute in kworker context first which will get
>> synchronization error info from kfifo, so task_work later will get nothing
>> from kfifo which doesn't work as expected. Even worse, synchronization
>> error notification has NMI like properties, (it can interrupt IRQ-masked
>> code), task_work may get wrong kfifo entry from interrupted
>> asynchronization error which is notified by IRQ.
>>
>> Since the memory_failure() triggered by a synchronous exception is
>> executed in the kworker context, the early_kill mode of memory_failure()
>> will send wrong si_code by SIGBUS signal: current process is kworker
>> thread, the actual user-space process accessing the corrupt memory location
>> will be collected by find_early_kill_thread(), and then send SIGBUS with
>> BUS_MCEERR_AO si_code to the actual user-space process instead of
>> BUS_MCEERR_AR. The machine-manager(kvm) use the si_code: BUS_MCEERR_AO for
>> 'action optional' early notifications, and BUS_MCEERR_AR for
>> 'action required' synchronous/late notifications.
>>
>> Make memory_failure() triggered by synchronization errors execute in the
>> current context, we do not need workqueue for synchronization error
>> anymore, use task_work handle synchronization errors directly. Since,
>> synchronization errors and asynchronization errors share the same kfifo,
>> use MF_ACTION_REQUIRED flag to distinguish them. And the asynchronization
>> error keeps the same as before.
>
>
> Hi, Lv Ying,
>
> Thank you for your great work.
>
> We also encountered this problem in production environment, and tried to
> solve it by dividing synchronous and asynchronous error handling into different
> paths: task work for synchronous error and workqueue for asynchronous error.
>
> The main challenge is how to distinguish synchronous errors in kernel first
> mode through APEI, a related discussion is here.[1]
>
>> @@ -978,14 +979,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> len = cper_estatus_len(estatus);
>> node_len = GHES_ESTATUS_NODE_LEN(len);
>> - task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>> + corruption_page_pending = ghes_do_proc(estatus_node->ghes, estatus, true);
>> if (!ghes_estatus_cached(estatus)) {
>> generic = estatus_node->generic;
>> if (ghes_print_estatus(NULL, generic, estatus))
>> ghes_estatus_cache_add(generic, estatus);
>> }
>
> In the case of your patch, it is inappropriate to assume that ghes_proc_in_irq() is only
> called to handle synchronous error. Firmware could notify all synchronous and asynchronous
> error signals to kernel through NMI notification, e.g. SDEI. In this case, asynchronous
> error will be treated as synchronous error.
>
> Our colleague Yingwen has submitted a proposal to extend acpi_hest_generic_data::flag (bit 8)
> to indicate that the error is a synchronous[2]. Personally speaking, it is a more general
> solution and completely solves the problem.
>
>
>> Background:
>>
>> In ARM world, two type events (Sync/Async) from hardware IP need OS/VMM take different actions.
>> Current CPER memory error record is not able to distinguish sync/async type event right now.
>> Current OS/VMM need to take extra actions beyond CPER which is heavy burden to identify the
>> two type events
>>
>> Sync event (e.g. CPU consume poisoned data) --> Firmware -> CPER error log --> OS/VMM take recovery action.
>> Async event (e.g. Memory controller detect UE event) --> Firmware --> CPER error log --> OS take page action.
>>
>>
>> Proposal:
>>
>> - In section description Flags field(UEFI spec section N.2, add sync flag as below. OS/VMM
>> could depend on this flag to distinguish sync/async events.
>> - Bit8 – sync flag; if set this flag indicates that this event record is synchronous(e.g.
>> cpu core consumes poison data, then cause instruction/data abort); if not set, this event record is asynchronous.
>>
>> Best regards,
>> Yingwen Chen
>
> A RFC patch set based on above proposal is here[3].

Hi Shuai & Lv Ying,

Thanks for your great works, I'm also trying to improve the handling for ARM SEA.
But I missed the issue about cancel_work_sync().

I also agree that adding SYNC flag in CPER is a better way.
Otherwise, the notification type is used to distinguish them.
Let's try to fix this problem before the new UEFI version comes out.

After all, we have to wait for a new UEFI version and a new BIOS version,
which may take a long time.

This is the patchset perious.
https://lore.kernel.org/linux-arm-kernel/[email protected]/T/


>
> Thank you.
>
> Best Regards,
> Shuai
>
>
> [1] https://lore.kernel.org/lkml/[email protected]/T/
> [2] https://members.uefi.org/wg/uswg/mail/thread/9453
> [3] https://lore.kernel.org/lkml/[email protected]/
>
>
>>
>> Signed-off-by: Lv Ying <[email protected]>
>> ---
>> drivers/acpi/apei/ghes.c | 27 ++++++++++++++-------------
>> mm/memory-failure.c | 34 ++++++++++++++++++++++------------
>> 2 files changed, 36 insertions(+), 25 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 9952f3a792ba..2ec71fc8a8dd 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -423,8 +423,8 @@ static void ghes_clear_estatus(struct ghes *ghes,
>>
>> /*
>> * Called as task_work before returning to user-space.
>> - * Ensure any queued work has been done before we return to the context that
>> - * triggered the notification.
>> + * Ensure any queued corrupt page in synchronous errors has been handled before
>> + * we return to the user context that triggered the notification.
>> */
>> static void ghes_kick_task_work(struct callback_head *head)
>> {
>> @@ -461,7 +461,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>> }
>>
>> static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>> - int sev)
>> + int sev, bool sync)
>> {
>> int flags = -1;
>> int sec_sev = ghes_severity(gdata->error_severity);
>> @@ -475,7 +475,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
>> flags = MF_SOFT_OFFLINE;
>> if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>> - flags = 0;
>> + flags = sync ? MF_ACTION_REQUIRED : 0;
>>
>> if (flags != -1)
>> return ghes_do_memory_failure(mem_err->physical_addr, flags);
>> @@ -483,7 +483,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>> return false;
>> }
>>
>> -static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
>> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev, bool sync)
>> {
>> struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
>> bool queued = false;
>> @@ -510,7 +510,8 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
>> * and don't filter out 'corrected' error here.
>> */
>> if (is_cache && has_pa) {
>> - queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
>> + queued = ghes_do_memory_failure(err_info->physical_fault_addr,
>> + sync ? MF_ACTION_REQUIRED : 0);
>> p += err_info->length;
>> continue;
>> }
>> @@ -623,7 +624,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
>> }
>>
>> static bool ghes_do_proc(struct ghes *ghes,
>> - const struct acpi_hest_generic_status *estatus)
>> + const struct acpi_hest_generic_status *estatus, bool sync)
>> {
>> int sev, sec_sev;
>> struct acpi_hest_generic_data *gdata;
>> @@ -648,13 +649,13 @@ static bool ghes_do_proc(struct ghes *ghes,
>> ghes_edac_report_mem_error(sev, mem_err);
>>
>> arch_apei_report_mem_error(sev, mem_err);
>> - queued = ghes_handle_memory_failure(gdata, sev);
>> + queued = ghes_handle_memory_failure(gdata, sev, sync);
>> }
>> else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
>> ghes_handle_aer(gdata);
>> }
>> else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
>> - queued = ghes_handle_arm_hw_error(gdata, sev);
>> + queued = ghes_handle_arm_hw_error(gdata, sev, sync);
>> } else {
>> void *err = acpi_hest_get_payload(gdata);
>>
>> @@ -868,7 +869,7 @@ static int ghes_proc(struct ghes *ghes)
>> if (ghes_print_estatus(NULL, ghes->generic, estatus))
>> ghes_estatus_cache_add(ghes->generic, estatus);
>> }
>> - ghes_do_proc(ghes, estatus);
>> + ghes_do_proc(ghes, estatus, false);
>>
>> out:
>> ghes_clear_estatus(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ);
>> @@ -961,7 +962,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>> struct ghes_estatus_node *estatus_node;
>> struct acpi_hest_generic *generic;
>> struct acpi_hest_generic_status *estatus;
>> - bool task_work_pending;
>> + bool corruption_page_pending;
>> u32 len, node_len;
>> int ret;
>>
>> @@ -978,14 +979,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>> len = cper_estatus_len(estatus);
>> node_len = GHES_ESTATUS_NODE_LEN(len);
>> - task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>> + corruption_page_pending = ghes_do_proc(estatus_node->ghes, estatus, true);
>> if (!ghes_estatus_cached(estatus)) {
>> generic = estatus_node->generic;
>> if (ghes_print_estatus(NULL, generic, estatus))
>> ghes_estatus_cache_add(generic, estatus);
>> }
>>
>> - if (task_work_pending && current->mm) {
>> + if (corruption_page_pending && current->mm) {
>> estatus_node->task_work.func = ghes_kick_task_work;
>> estatus_node->task_work_cpu = smp_processor_id();
>> ret = task_work_add(current, &estatus_node->task_work,
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index bead6bccc7f2..3b6ac3694b8d 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2204,7 +2204,11 @@ struct memory_failure_cpu {
>> static DEFINE_PER_CPU(struct memory_failure_cpu, memory_failure_cpu);
>>
>> /**
>> - * memory_failure_queue - Schedule handling memory failure of a page.
>> + * memory_failure_queue
>> + * - Schedule handling memory failure of a page for asynchronous error, memory
>> + * failure page will be executed in kworker thread
>> + * - put corrupt memory info into kfifo for synchronous error, task_work will
>> + * handle them before returning to the user
>> * @pfn: Page Number of the corrupted page
>> * @flags: Flags for memory failure handling
>> *
>> @@ -2217,6 +2221,11 @@ static DEFINE_PER_CPU(struct memory_failure_cpu, memory_failure_cpu);
>> * happen outside the current execution context (e.g. when
>> * detected by a background scrubber)
>> *
>> + * This function can also be used in synchronous errors which was detected as a
>> + * result of user-space accessing a corrupt memory location, just put memory
>> + * error info into kfifo, and then, task_work get and handle it in current
>> + * execution context instead of scheduling kworker to handle it
>> + *
>> * Can run in IRQ context.
>> */
>> void memory_failure_queue(unsigned long pfn, int flags)
>> @@ -2230,9 +2239,10 @@ void memory_failure_queue(unsigned long pfn, int flags)
>>
>> mf_cpu = &get_cpu_var(memory_failure_cpu);
>> spin_lock_irqsave(&mf_cpu->lock, proc_flags);
>> - if (kfifo_put(&mf_cpu->fifo, entry))
>> - schedule_work_on(smp_processor_id(), &mf_cpu->work);
>> - else
>> + if (kfifo_put(&mf_cpu->fifo, entry)) {
>> + if (!(entry.flags & MF_ACTION_REQUIRED))
>> + schedule_work_on(smp_processor_id(), &mf_cpu->work);
>> + } else
>> pr_err("buffer overflow when queuing memory failure at %#lx\n",
>> pfn);
>> spin_unlock_irqrestore(&mf_cpu->lock, proc_flags);
>> @@ -2240,7 +2250,7 @@ void memory_failure_queue(unsigned long pfn, int flags)
>> }
>> EXPORT_SYMBOL_GPL(memory_failure_queue);
>>
>> -static void memory_failure_work_func(struct work_struct *work)
>> +static void __memory_failure_work_func(struct work_struct *work, bool sync)
>> {
>> struct memory_failure_cpu *mf_cpu;
>> struct memory_failure_entry entry = { 0, };
>> @@ -2256,22 +2266,22 @@ static void memory_failure_work_func(struct work_struct *work)
>> break;
>> if (entry.flags & MF_SOFT_OFFLINE)
>> soft_offline_page(entry.pfn, entry.flags);
>> - else
>> + else if (!sync || (entry.flags & MF_ACTION_REQUIRED))
>> memory_failure(entry.pfn, entry.flags);
>> }
>> }
>>
>> -/*
>> - * Process memory_failure work queued on the specified CPU.
>> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
>> - */
>> +static void memory_failure_work_func(struct work_struct *work)
>> +{
>> + __memory_failure_work_func(work, false);
>> +}
>> +
>> void memory_failure_queue_kick(int cpu)
>> {
>> struct memory_failure_cpu *mf_cpu;
>>
>> mf_cpu = &per_cpu(memory_failure_cpu, cpu);
>> - cancel_work_sync(&mf_cpu->work);
>> - memory_failure_work_func(&mf_cpu->work);
>> + __memory_failure_work_func(&mf_cpu->work, true);
>> }
>>
>> static int __init memory_failure_init(void)
>
> .
>

2022-12-08 03:19:36

by Lv Ying

[permalink] [raw]
Subject: Re: [RFC 2/2] ACPI: APEI: fix reboot caused by synchronous error loop because of memory_failure() failed

>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 3b6ac3694b8d..4c1c558f7161 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -2266,7 +2266,11 @@ static void __memory_failure_work_func(struct
>> work_struct *work, bool sync)
>>               break;
>>           if (entry.flags & MF_SOFT_OFFLINE)
>>               soft_offline_page(entry.pfn, entry.flags);
>> -        else if (!sync || (entry.flags & MF_ACTION_REQUIRED))
>> +        else if (sync) {
>> +            if ((entry.flags & MF_ACTION_REQUIRED) &&
>> +                    memory_failure(entry.pfn, entry.flags))
>> +                force_sig_mceerr(BUS_MCEERR_AR, 0, 0);
>> +        } else
>>               memory_failure(entry.pfn, entry.flags);
> Hi,
>
> Some of the ideas in this patch are wrong :-(
>
> 1. As Shuai Xue said, it is wrong to judge synchronization error and
> asynchronization error through functions such as
> memory_failure_queue_kick()/ghes_proc()/ghes_proc_in_irq(), because both
> synchronization error and asynchronization error may go to the same
> notification.
>
Hi Bixuan:

Thanks for your review. I agree with you that ghes_proc_in_irq() is
called in SDEI, SEA, NMI notify type, they are NMI-like notify, this
function run some job which may not be NMI safe in IRQ context. And NMI
may be asynchronous error.

However, cureent kernel use ghes_kick_task_work in ghes_proc_in_irq(),
there is an assumption here that ghes_proc_in_irq() are currently in the
context of a synchronous exception, although this is not appropriate.

The challenge for my patch is to prove the rationality of distinguishing
synchronous errors. I do not have a good idea yet of distinguishing
synchronous error by looking through ACPI/UEFI spec, so I sent this
patchset for more input. And I resent RFC PATCH v1 [1]add this as TODO.

> 2. There is no need to pass 'sync' to __memory_failure_work_func(),
> because memory_failure() can directly handle synchronous and
> asynchronous errors according to entry.flags & MF_ACTION_REQUIRED:
>
> entry.flags & MF_ACTION_REQUIRED == 1: Action: poison page and kill task
> for synchronous error
> entry.flags & MF_ACTION_REQUIRED == 0: Action: poison page for
> asynchronous error
>
> Reference x86:
> do_machine_check # MCE, synchronous
>    ->kill_me_maybe
>      ->memory_failure(p->mce_addr >> PAGE_SHIFT, MF_ACTION_REQUIRED);
>
> uc_decode_notifier # CMCI, asynchronous
>    ->memory_failure(pfn, 0)
>
> At the same time, the modification here is repeated with your patch 01
>      if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
> -        flags = 0;
> +        flags = sync ? MF_ACTION_REQUIRED : 0;
>

Thanks, there is indeed no need to pass 'sync' to
__memory_failure_work_func(). MF_ACTION_REQUIRED can cover this, I will
update it in the next version patchset.

> 3. Why add 'force_sig_mceerr(BUS_MCEERR_AR, 0, 0)' after
> memory_failure(pfn, MF_ACTION_REQUIRED)?
> The task will be killed in memory_failure():
> if poisoned, kill_accessing_process()->kill_proc()
> if not poisoned, hwpoison_user_mappings()->collect_procs()->kill_procs()
>
> Reference x86 to handle synchronous error:
> kill_me_maybe()
> {
>     int flags = MF_ACTION_REQUIRED;
>     ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
>     if (!ret) {
>     ...
>         return;
>     }
>     if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
>         return;
>
>     pr_err("Memory error not recovered");
>     kill_me_now(cb);
> }
>

Thanks again, this patch is based on synchronous error is not
distinguished from
asynchronous error, in that case, kill_accessing_process() run in
kthread worker may not kill current thread. Now, based on the first
patch, this SEA loop can be handled. But this patch is also needed
reference x86 kill_me_maybe(), I update this patch in RFC PATCH v1[1].
I will integrate this patch into the first patch, because this patch
commit message is not suitable based on the first patch.

[1]
https://lore.kernel.org/linux-mm/[email protected]/T/


--
Thanks!
Lv Ying

2022-12-08 03:34:37

by Shuai Xue

[permalink] [raw]
Subject: Re: [RFC 1/2] ACPI: APEI: Make memory_failure() triggered by synchronization errors execute in the current context

+ Xie XiuQi

On 2022/12/8 AM10:20, Lv Ying wrote:
>
>>
>> We also encountered this problem in production environment, and tried to
>> solve it by dividing synchronous and asynchronous error handling into different
>> paths: task work for synchronous error and workqueue for asynchronous error.
>>
>> The main challenge is how to distinguish synchronous errors in kernel first
>> mode through APEI, a related discussion is here.[1]
>>
> Hi Shuai:
>
> I'm very happy to have received your response, we encountered this problem in our production environment too. I spent a lot of time researching the rationale for synchronous errors and asynchronous errors to share the same process. I mention in my patch commit message: memory_failure() triggered by synchronous error is
> executed in the kworker context, the early_kill mode of memory_failure()
> will send wrong si_code by SIGBUS signal because of wrong current process.
>
> The challenge for my patch is to prove the rationality of distinguishing synchronous errors. I do not have a good idea of distinguishing synchronous error by looking through ACPI/UEFI spec, so I sent this patchset for more input. And I resent RFC PATCH v1 [1]add this as TODO.
>
>>> @@ -978,14 +979,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>>           estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>>           len = cper_estatus_len(estatus);
>>>           node_len = GHES_ESTATUS_NODE_LEN(len);
>>> -        task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>>> +        corruption_page_pending = ghes_do_proc(estatus_node->ghes, estatus, true);
>>>           if (!ghes_estatus_cached(estatus)) {
>>>               generic = estatus_node->generic;
>>>               if (ghes_print_estatus(NULL, generic, estatus))
>>>                   ghes_estatus_cache_add(generic, estatus);
>>>           }
>>
>> In the case of your patch, it is inappropriate to assume that ghes_proc_in_irq() is only
>> called to handle synchronous error. Firmware could notify all synchronous and asynchronous
>> error signals to kernel through NMI notification, e.g. SDEI. In this case, asynchronous
>> error will be treated as synchronous error.
>>
>
> Yes, as I mentioned above, I do not have a good idea of distinguishing synchronous error. I agree with you that ghes_proc_in_irq() is called in SDEI, SEA, NMI notify type, they are NMI-like notify, this function run some job which may not be NMI safe in IRQ context. And NMI may be asynchronous error.
>
> However, cureent kernel use ghes_kick_task_work in ghes_proc_in_irq(), there is an assumption here that ghes_proc_in_irq() are currently in the context of a synchronous exception, although this is not appropriate.
>
>> Our colleague Yingwen has submitted a proposal to extend acpi_hest_generic_data::flag (bit 8)
>> to indicate that the error is a synchronous[2]. Personally speaking, it is a more general
>> solution and completely solves the problem.
>>
>>
>>> Background:
>>>
>>> In ARM world, two type events (Sync/Async) from hardware IP need OS/VMM take different actions.
>>> Current CPER memory error record is not able to distinguish sync/async type event right now.
>>> Current OS/VMM need to take extra actions beyond CPER which is heavy burden to identify the
>>> two type events
>>>
>>> Sync event (e.g. CPU consume poisoned data) --> Firmware  -> CPER error log  --> OS/VMM take recovery action.
>>> Async event (e.g. Memory controller detect UE event)  --> Firmware  --> CPER error log  --> OS take page action.
>>>
>>>
>>> Proposal:
>>>
>>> - In section description Flags field(UEFI spec section N.2, add sync flag as below. OS/VMM
>>>   could depend on this flag to distinguish sync/async events.
>>> - Bit8 – sync flag; if set this flag indicates that this event record is synchronous(e.g.
>>>   cpu core consumes poison data, then cause instruction/data abort); if not set, this event record is asynchronous.
>>>
>>> Best regards,
>>> Yingwen Chen
>>
>> A RFC patch set based on above proposal is here[3].
>>
>
> I'm glad your team is working on advancing this thing in the UEFI community. This will make kernel identify synchronous as an easy job. Based on your proposal, I think your work is suitable.
>
> However, there are real problems here in the running production environment of the LTS kernel:
> 1. this new proposal has not yet been accepted by the UEFI spec
> 2. it will require firmware update in production environment which may be more difficult than kernel livepatch upgrade.
>
> If there are more ways to distinguish synchronous error in kernel APEI,
> I can try it. For example, in APEI mode, is it suitable to use notify type to distinguish synchronous error?
> synchronous error:
> SDEI SEA
> asynchronous error:
> NMI

Sorry, I'm afraid it's not suitable. It is no doubt that SEA notification is synchronous,
but we should not assume that platform use NMI notification for asynchronous errors,
and SDEI notification for synchronous errors.

If we want to address the problem now, I prefer to just consider SEA notification is synchronous
and add MF_ACTION_REQUIRED when SEA is adopt. Please refer to XiuQi's work.[1]

[1]https://lore.kernel.org/linux-arm-kernel/[email protected]/T/

Best Regards,
Shuai

>
> [1] https://lore.kernel.org/linux-mm/[email protected]/T/
>
>

2022-12-08 04:46:23

by Shuai Xue

[permalink] [raw]
Subject: Re: [RFC 1/2] ACPI: APEI: Make memory_failure() triggered by synchronization errors execute in the current context



On 2022/12/8 AM10:37, Xie XiuQi wrote:
>
> On 2022/12/7 18:57, Shuai Xue wrote:
>>
>>
>> 在 2022/12/5 PM7:51, Lv Ying 写道:
>>> The memory uncorrected error which is detected by an external component and
>>> notified via an IRQ, can be called asynchronization error. If an error is
>>> detected as a result of user-space process accessing a corrupt memory
>>> location, the CPU may take an abort. On arm64 this is a
>>> 'synchronous external abort', and on a firmware first system it is notified
>>> via NOTIFY_SEA, this can be called synchronization error.
>>>
>>> Currently, synchronization error and asynchronization error both use
>>> memory_failure_queue to schedule memory_failure() exectute in kworker
>>> context. Commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue
>>> for synchronous errors") make task_work pending to flush out the queue,
>>> cancel_work_sync() in memory_failure_queue_kick() will make
>>> memory_failure() exectute in kworker context first which will get
>>> synchronization error info from kfifo, so task_work later will get nothing
>>> from kfifo which doesn't work as expected. Even worse, synchronization
>>> error notification has NMI like properties, (it can interrupt IRQ-masked
>>> code), task_work may get wrong kfifo entry from interrupted
>>> asynchronization error which is notified by IRQ.
>>>
>>> Since the memory_failure() triggered by a synchronous exception is
>>> executed in the kworker context, the early_kill mode of memory_failure()
>>> will send wrong si_code by SIGBUS signal: current process is kworker
>>> thread, the actual user-space process accessing the corrupt memory location
>>> will be collected by find_early_kill_thread(), and then send SIGBUS with
>>> BUS_MCEERR_AO si_code to the actual user-space process instead of
>>> BUS_MCEERR_AR. The machine-manager(kvm) use the si_code: BUS_MCEERR_AO for
>>> 'action optional' early notifications, and BUS_MCEERR_AR for
>>> 'action required' synchronous/late notifications.
>>>
>>> Make memory_failure() triggered by synchronization errors execute in the
>>> current context, we do not need workqueue for synchronization error
>>> anymore, use task_work handle synchronization errors directly. Since,
>>> synchronization errors and asynchronization errors share the same kfifo,
>>> use MF_ACTION_REQUIRED flag to distinguish them. And the asynchronization
>>> error keeps the same as before.
>>
>>
>> Hi, Lv Ying,
>>
>> Thank you for your great work.
>>
>> We also encountered this problem in production environment, and tried to
>> solve it by dividing synchronous and asynchronous error handling into different
>> paths: task work for synchronous error and workqueue for asynchronous error.
>>
>> The main challenge is how to distinguish synchronous errors in kernel first
>> mode through APEI, a related discussion is here.[1]
>>
>>> @@ -978,14 +979,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>> len = cper_estatus_len(estatus);
>>> node_len = GHES_ESTATUS_NODE_LEN(len);
>>> - task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>>> + corruption_page_pending = ghes_do_proc(estatus_node->ghes, estatus, true);
>>> if (!ghes_estatus_cached(estatus)) {
>>> generic = estatus_node->generic;
>>> if (ghes_print_estatus(NULL, generic, estatus))
>>> ghes_estatus_cache_add(generic, estatus);
>>> }
>>
>> In the case of your patch, it is inappropriate to assume that ghes_proc_in_irq() is only
>> called to handle synchronous error. Firmware could notify all synchronous and asynchronous
>> error signals to kernel through NMI notification, e.g. SDEI. In this case, asynchronous
>> error will be treated as synchronous error.
>>
>> Our colleague Yingwen has submitted a proposal to extend acpi_hest_generic_data::flag (bit 8)
>> to indicate that the error is a synchronous[2]. Personally speaking, it is a more general
>> solution and completely solves the problem.
>>
>>
>>> Background:
>>>
>>> In ARM world, two type events (Sync/Async) from hardware IP need OS/VMM take different actions.
>>> Current CPER memory error record is not able to distinguish sync/async type event right now.
>>> Current OS/VMM need to take extra actions beyond CPER which is heavy burden to identify the
>>> two type events
>>>
>>> Sync event (e.g. CPU consume poisoned data) --> Firmware -> CPER error log --> OS/VMM take recovery action.
>>> Async event (e.g. Memory controller detect UE event) --> Firmware --> CPER error log --> OS take page action.
>>>
>>>
>>> Proposal:
>>>
>>> - In section description Flags field(UEFI spec section N.2, add sync flag as below. OS/VMM
>>> could depend on this flag to distinguish sync/async events.
>>> - Bit8 – sync flag; if set this flag indicates that this event record is synchronous(e.g.
>>> cpu core consumes poison data, then cause instruction/data abort); if not set, this event record is asynchronous.
>>>
>>> Best regards,
>>> Yingwen Chen
>>
>> A RFC patch set based on above proposal is here[3].
>
> Hi Shuai & Lv Ying,
>
> Thanks for your great works, I'm also trying to improve the handling for ARM SEA.
> But I missed the issue about cancel_work_sync().

May part of my RFC[1] can address this issue :)

[1][RFC PATCH 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events
https://lore.kernel.org/lkml/[email protected]/T/#m2bd61100552ea3aeb1b0aae3ab71df85a580de4c

>
> I also agree that adding SYNC flag in CPER is a better way.
> Otherwise, the notification type is used to distinguish them.
> Let's try to fix this problem before the new UEFI version comes out.
>
> After all, we have to wait for a new UEFI version and a new BIOS version,
> which may take a long time.

Agree. At least we could address the problem for SEA notification before a new UEFI version
is ready.

>
> This is the patchset perious.
> https://lore.kernel.org/linux-arm-kernel/[email protected]/T/

Great work, thank you.

Best regards,
Shuai
>
>>
>> Thank you.
>>
>> Best Regards,
>> Shuai
>>
>>
>> [1] https://lore.kernel.org/lkml/[email protected]/T/
>> [2] https://members.uefi.org/wg/uswg/mail/thread/9453
>> [3] https://lore.kernel.org/lkml/[email protected]/
>>
>>
>>>
>>> Signed-off-by: Lv Ying <[email protected]>
>>> ---
>>> drivers/acpi/apei/ghes.c | 27 ++++++++++++++-------------
>>> mm/memory-failure.c | 34 ++++++++++++++++++++++------------
>>> 2 files changed, 36 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>> index 9952f3a792ba..2ec71fc8a8dd 100644
>>> --- a/drivers/acpi/apei/ghes.c
>>> +++ b/drivers/acpi/apei/ghes.c
>>> @@ -423,8 +423,8 @@ static void ghes_clear_estatus(struct ghes *ghes,
>>>
>>> /*
>>> * Called as task_work before returning to user-space.
>>> - * Ensure any queued work has been done before we return to the context that
>>> - * triggered the notification.
>>> + * Ensure any queued corrupt page in synchronous errors has been handled before
>>> + * we return to the user context that triggered the notification.
>>> */
>>> static void ghes_kick_task_work(struct callback_head *head)
>>> {
>>> @@ -461,7 +461,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>>> }
>>>
>>> static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>>> - int sev)
>>> + int sev, bool sync)
>>> {
>>> int flags = -1;
>>> int sec_sev = ghes_severity(gdata->error_severity);
>>> @@ -475,7 +475,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>>> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
>>> flags = MF_SOFT_OFFLINE;
>>> if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>>> - flags = 0;
>>> + flags = sync ? MF_ACTION_REQUIRED : 0;
>>>
>>> if (flags != -1)
>>> return ghes_do_memory_failure(mem_err->physical_addr, flags);
>>> @@ -483,7 +483,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>>> return false;
>>> }
>>>
>>> -static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
>>> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev, bool sync)
>>> {
>>> struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
>>> bool queued = false;
>>> @@ -510,7 +510,8 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s
>>> * and don't filter out 'corrected' error here.
>>> */
>>> if (is_cache && has_pa) {
>>> - queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
>>> + queued = ghes_do_memory_failure(err_info->physical_fault_addr,
>>> + sync ? MF_ACTION_REQUIRED : 0);
>>> p += err_info->length;
>>> continue;
>>> }
>>> @@ -623,7 +624,7 @@ static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
>>> }
>>>
>>> static bool ghes_do_proc(struct ghes *ghes,
>>> - const struct acpi_hest_generic_status *estatus)
>>> + const struct acpi_hest_generic_status *estatus, bool sync)
>>> {
>>> int sev, sec_sev;
>>> struct acpi_hest_generic_data *gdata;
>>> @@ -648,13 +649,13 @@ static bool ghes_do_proc(struct ghes *ghes,
>>> ghes_edac_report_mem_error(sev, mem_err);
>>>
>>> arch_apei_report_mem_error(sev, mem_err);
>>> - queued = ghes_handle_memory_failure(gdata, sev);
>>> + queued = ghes_handle_memory_failure(gdata, sev, sync);
>>> }
>>> else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
>>> ghes_handle_aer(gdata);
>>> }
>>> else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
>>> - queued = ghes_handle_arm_hw_error(gdata, sev);
>>> + queued = ghes_handle_arm_hw_error(gdata, sev, sync);
>>> } else {
>>> void *err = acpi_hest_get_payload(gdata);
>>>
>>> @@ -868,7 +869,7 @@ static int ghes_proc(struct ghes *ghes)
>>> if (ghes_print_estatus(NULL, ghes->generic, estatus))
>>> ghes_estatus_cache_add(ghes->generic, estatus);
>>> }
>>> - ghes_do_proc(ghes, estatus);
>>> + ghes_do_proc(ghes, estatus, false);
>>>
>>> out:
>>> ghes_clear_estatus(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ);
>>> @@ -961,7 +962,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>> struct ghes_estatus_node *estatus_node;
>>> struct acpi_hest_generic *generic;
>>> struct acpi_hest_generic_status *estatus;
>>> - bool task_work_pending;
>>> + bool corruption_page_pending;
>>> u32 len, node_len;
>>> int ret;
>>>
>>> @@ -978,14 +979,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>> estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>> len = cper_estatus_len(estatus);
>>> node_len = GHES_ESTATUS_NODE_LEN(len);
>>> - task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>>> + corruption_page_pending = ghes_do_proc(estatus_node->ghes, estatus, true);
>>> if (!ghes_estatus_cached(estatus)) {
>>> generic = estatus_node->generic;
>>> if (ghes_print_estatus(NULL, generic, estatus))
>>> ghes_estatus_cache_add(generic, estatus);
>>> }
>>>
>>> - if (task_work_pending && current->mm) {
>>> + if (corruption_page_pending && current->mm) {
>>> estatus_node->task_work.func = ghes_kick_task_work;
>>> estatus_node->task_work_cpu = smp_processor_id();
>>> ret = task_work_add(current, &estatus_node->task_work,
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index bead6bccc7f2..3b6ac3694b8d 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -2204,7 +2204,11 @@ struct memory_failure_cpu {
>>> static DEFINE_PER_CPU(struct memory_failure_cpu, memory_failure_cpu);
>>>
>>> /**
>>> - * memory_failure_queue - Schedule handling memory failure of a page.
>>> + * memory_failure_queue
>>> + * - Schedule handling memory failure of a page for asynchronous error, memory
>>> + * failure page will be executed in kworker thread
>>> + * - put corrupt memory info into kfifo for synchronous error, task_work will
>>> + * handle them before returning to the user
>>> * @pfn: Page Number of the corrupted page
>>> * @flags: Flags for memory failure handling
>>> *
>>> @@ -2217,6 +2221,11 @@ static DEFINE_PER_CPU(struct memory_failure_cpu, memory_failure_cpu);
>>> * happen outside the current execution context (e.g. when
>>> * detected by a background scrubber)
>>> *
>>> + * This function can also be used in synchronous errors which was detected as a
>>> + * result of user-space accessing a corrupt memory location, just put memory
>>> + * error info into kfifo, and then, task_work get and handle it in current
>>> + * execution context instead of scheduling kworker to handle it
>>> + *
>>> * Can run in IRQ context.
>>> */
>>> void memory_failure_queue(unsigned long pfn, int flags)
>>> @@ -2230,9 +2239,10 @@ void memory_failure_queue(unsigned long pfn, int flags)
>>>
>>> mf_cpu = &get_cpu_var(memory_failure_cpu);
>>> spin_lock_irqsave(&mf_cpu->lock, proc_flags);
>>> - if (kfifo_put(&mf_cpu->fifo, entry))
>>> - schedule_work_on(smp_processor_id(), &mf_cpu->work);
>>> - else
>>> + if (kfifo_put(&mf_cpu->fifo, entry)) {
>>> + if (!(entry.flags & MF_ACTION_REQUIRED))
>>> + schedule_work_on(smp_processor_id(), &mf_cpu->work);
>>> + } else
>>> pr_err("buffer overflow when queuing memory failure at %#lx\n",
>>> pfn);
>>> spin_unlock_irqrestore(&mf_cpu->lock, proc_flags);
>>> @@ -2240,7 +2250,7 @@ void memory_failure_queue(unsigned long pfn, int flags)
>>> }
>>> EXPORT_SYMBOL_GPL(memory_failure_queue);
>>>
>>> -static void memory_failure_work_func(struct work_struct *work)
>>> +static void __memory_failure_work_func(struct work_struct *work, bool sync)
>>> {
>>> struct memory_failure_cpu *mf_cpu;
>>> struct memory_failure_entry entry = { 0, };
>>> @@ -2256,22 +2266,22 @@ static void memory_failure_work_func(struct work_struct *work)
>>> break;
>>> if (entry.flags & MF_SOFT_OFFLINE)
>>> soft_offline_page(entry.pfn, entry.flags);
>>> - else
>>> + else if (!sync || (entry.flags & MF_ACTION_REQUIRED))
>>> memory_failure(entry.pfn, entry.flags);
>>> }
>>> }
>>>
>>> -/*
>>> - * Process memory_failure work queued on the specified CPU.
>>> - * Used to avoid return-to-userspace racing with the memory_failure workqueue.
>>> - */
>>> +static void memory_failure_work_func(struct work_struct *work)
>>> +{
>>> + __memory_failure_work_func(work, false);
>>> +}
>>> +
>>> void memory_failure_queue_kick(int cpu)
>>> {
>>> struct memory_failure_cpu *mf_cpu;
>>>
>>> mf_cpu = &per_cpu(memory_failure_cpu, cpu);
>>> - cancel_work_sync(&mf_cpu->work);
>>> - memory_failure_work_func(&mf_cpu->work);
>>> + __memory_failure_work_func(&mf_cpu->work, true);
>>> }
>>>
>>> static int __init memory_failure_init(void)
>>
>> .
>>

2022-12-08 07:29:35

by Lv Ying

[permalink] [raw]
Subject: Re: [RFC 1/2] ACPI: APEI: Make memory_failure() triggered by synchronization errors execute in the current context

On 2022/12/8 11:25, Shuai Xue wrote:
> + Xie XiuQi
>
> On 2022/12/8 AM10:20, Lv Ying wrote:
>>
>>>
>>> We also encountered this problem in production environment, and tried to
>>> solve it by dividing synchronous and asynchronous error handling into different
>>> paths: task work for synchronous error and workqueue for asynchronous error.
>>>
>>> The main challenge is how to distinguish synchronous errors in kernel first
>>> mode through APEI, a related discussion is here.[1]
>>>
>> Hi Shuai:
>>
>> I'm very happy to have received your response, we encountered this problem in our production environment too. I spent a lot of time researching the rationale for synchronous errors and asynchronous errors to share the same process. I mention in my patch commit message: memory_failure() triggered by synchronous error is
>> executed in the kworker context, the early_kill mode of memory_failure()
>> will send wrong si_code by SIGBUS signal because of wrong current process.
>>
>> The challenge for my patch is to prove the rationality of distinguishing synchronous errors. I do not have a good idea of distinguishing synchronous error by looking through ACPI/UEFI spec, so I sent this patchset for more input. And I resent RFC PATCH v1 [1]add this as TODO.
>>
>>>> @@ -978,14 +979,14 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>>>           estatus = GHES_ESTATUS_FROM_NODE(estatus_node);
>>>>           len = cper_estatus_len(estatus);
>>>>           node_len = GHES_ESTATUS_NODE_LEN(len);
>>>> -        task_work_pending = ghes_do_proc(estatus_node->ghes, estatus);
>>>> +        corruption_page_pending = ghes_do_proc(estatus_node->ghes, estatus, true);
>>>>           if (!ghes_estatus_cached(estatus)) {
>>>>               generic = estatus_node->generic;
>>>>               if (ghes_print_estatus(NULL, generic, estatus))
>>>>                   ghes_estatus_cache_add(generic, estatus);
>>>>           }
>>>
>>> In the case of your patch, it is inappropriate to assume that ghes_proc_in_irq() is only
>>> called to handle synchronous error. Firmware could notify all synchronous and asynchronous
>>> error signals to kernel through NMI notification, e.g. SDEI. In this case, asynchronous
>>> error will be treated as synchronous error.
>>>
>>
>> Yes, as I mentioned above, I do not have a good idea of distinguishing synchronous error. I agree with you that ghes_proc_in_irq() is called in SDEI, SEA, NMI notify type, they are NMI-like notify, this function run some job which may not be NMI safe in IRQ context. And NMI may be asynchronous error.
>>
>> However, cureent kernel use ghes_kick_task_work in ghes_proc_in_irq(), there is an assumption here that ghes_proc_in_irq() are currently in the context of a synchronous exception, although this is not appropriate.
>>
>>> Our colleague Yingwen has submitted a proposal to extend acpi_hest_generic_data::flag (bit 8)
>>> to indicate that the error is a synchronous[2]. Personally speaking, it is a more general
>>> solution and completely solves the problem.
>>>
>>>
>>>> Background:
>>>>
>>>> In ARM world, two type events (Sync/Async) from hardware IP need OS/VMM take different actions.
>>>> Current CPER memory error record is not able to distinguish sync/async type event right now.
>>>> Current OS/VMM need to take extra actions beyond CPER which is heavy burden to identify the
>>>> two type events
>>>>
>>>> Sync event (e.g. CPU consume poisoned data) --> Firmware  -> CPER error log  --> OS/VMM take recovery action.
>>>> Async event (e.g. Memory controller detect UE event)  --> Firmware  --> CPER error log  --> OS take page action.
>>>>
>>>>
>>>> Proposal:
>>>>
>>>> - In section description Flags field(UEFI spec section N.2, add sync flag as below. OS/VMM
>>>>   could depend on this flag to distinguish sync/async events.
>>>> - Bit8 – sync flag; if set this flag indicates that this event record is synchronous(e.g.
>>>>   cpu core consumes poison data, then cause instruction/data abort); if not set, this event record is asynchronous.
>>>>
>>>> Best regards,
>>>> Yingwen Chen
>>>
>>> A RFC patch set based on above proposal is here[3].
>>>
>>
>> I'm glad your team is working on advancing this thing in the UEFI community. This will make kernel identify synchronous as an easy job. Based on your proposal, I think your work is suitable.
>>
>> However, there are real problems here in the running production environment of the LTS kernel:
>> 1. this new proposal has not yet been accepted by the UEFI spec
>> 2. it will require firmware update in production environment which may be more difficult than kernel livepatch upgrade.
>>
>> If there are more ways to distinguish synchronous error in kernel APEI,
>> I can try it. For example, in APEI mode, is it suitable to use notify type to distinguish synchronous error?
>> synchronous error:
>> SDEI SEA
>> asynchronous error:
>> NMI
>
> Sorry, I'm afraid it's not suitable. It is no doubt that SEA notification is synchronous,
> but we should not assume that platform use NMI notification for asynchronous errors,
> and SDEI notification for synchronous errors.
>
> If we want to address the problem now, I prefer to just consider SEA notification is synchronous
> and add MF_ACTION_REQUIRED when SEA is adopt. Please refer to XiuQi's work.[1]
>
> [1]https://lore.kernel.org/linux-arm-kernel/[email protected]/T/
>
> Best Regards,
> Shuai
>
Thanks for your advice, it may be currently more practical to solve the
problem that synchronous and asynchronous errors
share the same processing flow only in the SEA scenario. I will refer to
XiuQi's work and update RFC PATCH v2.


--
Thanks!
Lv Ying