2023-12-15 08:12:45

by Qiuxu Zhuo

[permalink] [raw]
Subject: [PATCH 1/1] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN

During the process of splitting a hw-poisoned huge page, it is possible
for the reference count of the huge page to be increased by the threads
within the affected process, leading to a failure in splitting the
hw-poisoned huge page with an error code of -EAGAIN.

This issue can be reproduced when doing memory error injection to a
multiple-thread process, and the error occurs within a huge page.
The call path with the returned -EAGAIN during the testing is shown below:

memory_failure()
try_to_split_thp_page()
split_huge_page()
split_huge_page_to_list() {
...
Step A: can_split_folio() - Checked that the thp can be split.
Step B: unmap_folio()
Step C: folio_ref_freeze() - Failed and returned -EAGAIN.
...
}

The testing logs indicated that some huge pages were split successfully
via the call path above (Step C was successful for these huge pages).
However, some huge pages failed to split due to a failure at Step C, and
it was observed that the reference count of the huge page increased between
Step A and Step C.

Testing has shown that after receiving -EAGAIN, simply re-splitting the
hw-poisoned huge page within memory_failure() always results in the same
-EAGAIN. This is possible because memory_failure() is executed in the
currently affected process. Before this process exits memory_failure() and
is terminated, its threads could increase the reference count of the
hw-poisoned page.

To address this issue, employ the kernel worker to re-split the hw-poisoned
huge page. By the time this worker begins re-splitting the hw-poisoned huge
page, the affected process has already been terminated, preventing its
threads from increasing the reference count. Experimental results have
consistently shown that this worker successfully re-splits these
hw-poisoned huge pages on its first attempt.

The kernel log (before):
[ 1116.862895] Memory failure: 0x4097fa7: recovery action for unsplit thp: Ignored

The kernel log (after):
[ 793.573536] Memory failure: 0x2100dda: recovery action for unsplit thp: Delayed
[ 793.574666] Memory failure: 0x2100dda: split unsplit thp successfully.

Signed-off-by: Qiuxu Zhuo <[email protected]>
---
mm/memory-failure.c | 73 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 71 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 660c21859118..0db4cf712a78 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -72,6 +72,60 @@ atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);

static bool hw_memory_failure __read_mostly = false;

+#define SPLIT_THP_MAX_RETRY_CNT 10
+#define SPLIT_THP_INIT_DELAYED_MS 1
+
+static bool split_thp_pending;
+
+struct split_thp_req {
+ struct delayed_work work;
+ struct page *thp;
+ int retries;
+};
+
+static void split_thp_work_fn(struct work_struct *work)
+{
+ struct split_thp_req *req = container_of(work, typeof(*req), work.work);
+ int ret;
+
+ /* Split the thp. */
+ get_page(req->thp);
+ lock_page(req->thp);
+ ret = split_huge_page(req->thp);
+ unlock_page(req->thp);
+ put_page(req->thp);
+
+ /* Retry with an exponential backoff. */
+ if (ret && ++req->retries < SPLIT_THP_MAX_RETRY_CNT) {
+ schedule_delayed_work(to_delayed_work(work),
+ msecs_to_jiffies(SPLIT_THP_INIT_DELAYED_MS << req->retries));
+ return;
+ }
+
+ pr_err("%#lx: split unsplit thp %ssuccessfully.\n", page_to_pfn(req->thp), ret ? "un" : "");
+ kfree(req);
+ split_thp_pending = false;
+}
+
+static bool split_thp_delayed(struct page *thp)
+{
+ struct split_thp_req *req;
+
+ if (split_thp_pending)
+ return false;
+
+ req = kmalloc(sizeof(*req), GFP_ATOMIC);
+ if (!req)
+ return false;
+
+ req->thp = thp;
+ req->retries = 0;
+ INIT_DELAYED_WORK(&req->work, split_thp_work_fn);
+ split_thp_pending = true;
+ schedule_delayed_work(&req->work, msecs_to_jiffies(SPLIT_THP_INIT_DELAYED_MS));
+ return true;
+}
+
static DEFINE_MUTEX(mf_mutex);

void num_poisoned_pages_inc(unsigned long pfn)
@@ -2275,8 +2329,23 @@ int memory_failure(unsigned long pfn, int flags)
* page is a valid handlable page.
*/
SetPageHasHWPoisoned(hpage);
- if (try_to_split_thp_page(p) < 0) {
- res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+ res = try_to_split_thp_page(p);
+ if (res < 0) {
+ /*
+ * Re-attempting try_to_split_thp_page() here could consistently
+ * yield -EAGAIN, as the threads of the process may increment the
+ * reference count of the huge page before the process exits
+ * memory_failure() and terminates.
+ *
+ * Employ the kernel worker to re-split the huge page. By the time
+ * this worker initiates the re-splitting process, the affected
+ * process has already been terminated, preventing its threads from
+ * incrementing the reference count.
+ */
+ if (res == -EAGAIN && split_thp_delayed(p))
+ res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_DELAYED);
+ else
+ res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
goto unlock_mutex;
}
VM_BUG_ON_PAGE(!page_count(p), p);
--
2.17.1



2023-12-19 02:18:09

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN

On Fri, Dec 15, 2023 at 04:12:04PM +0800, Qiuxu Zhuo wrote:
> During the process of splitting a hw-poisoned huge page, it is possible
> for the reference count of the huge page to be increased by the threads
> within the affected process, leading to a failure in splitting the
> hw-poisoned huge page with an error code of -EAGAIN.
>
> This issue can be reproduced when doing memory error injection to a
> multiple-thread process, and the error occurs within a huge page.
> The call path with the returned -EAGAIN during the testing is shown below:
>
> memory_failure()
> try_to_split_thp_page()
> split_huge_page()
> split_huge_page_to_list() {
> ...
> Step A: can_split_folio() - Checked that the thp can be split.
> Step B: unmap_folio()
> Step C: folio_ref_freeze() - Failed and returned -EAGAIN.
> ...
> }
>
> The testing logs indicated that some huge pages were split successfully
> via the call path above (Step C was successful for these huge pages).
> However, some huge pages failed to split due to a failure at Step C, and
> it was observed that the reference count of the huge page increased between
> Step A and Step C.
>
> Testing has shown that after receiving -EAGAIN, simply re-splitting the
> hw-poisoned huge page within memory_failure() always results in the same
> -EAGAIN. This is possible because memory_failure() is executed in the
> currently affected process. Before this process exits memory_failure() and
> is terminated, its threads could increase the reference count of the
> hw-poisoned page.
>
> To address this issue, employ the kernel worker to re-split the hw-poisoned
> huge page. By the time this worker begins re-splitting the hw-poisoned huge
> page, the affected process has already been terminated, preventing its
> threads from increasing the reference count. Experimental results have
> consistently shown that this worker successfully re-splits these
> hw-poisoned huge pages on its first attempt.
>
> The kernel log (before):
> [ 1116.862895] Memory failure: 0x4097fa7: recovery action for unsplit thp: Ignored
>
> The kernel log (after):
> [ 793.573536] Memory failure: 0x2100dda: recovery action for unsplit thp: Delayed
> [ 793.574666] Memory failure: 0x2100dda: split unsplit thp successfully.

I'm unclear about the user-visible benefit of ensuring that the error thp is split.
So could you explain about it?

I think that the raw error page is not unmapped (with hwpoisoned entry) after
delayed re-splitting, so recovery action seems not complete even with this patch.
So this patch seems to just convert a hwpoisoned unrecovered thp into a hwpoisoned
unrecovered raw page.

Thanks,
Naoya Horiguchi

2023-12-19 11:51:25

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 1/1] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN

On 2023/12/15 16:12, Qiuxu Zhuo wrote:
> During the process of splitting a hw-poisoned huge page, it is possible
> for the reference count of the huge page to be increased by the threads
> within the affected process, leading to a failure in splitting the
> hw-poisoned huge page with an error code of -EAGAIN.
>
> This issue can be reproduced when doing memory error injection to a
> multiple-thread process, and the error occurs within a huge page.
> The call path with the returned -EAGAIN during the testing is shown below:
>
> memory_failure()
> try_to_split_thp_page()
> split_huge_page()
> split_huge_page_to_list() {
> ...
> Step A: can_split_folio() - Checked that the thp can be split.
> Step B: unmap_folio()
> Step C: folio_ref_freeze() - Failed and returned -EAGAIN.
> ...
> }
>
> The testing logs indicated that some huge pages were split successfully
> via the call path above (Step C was successful for these huge pages).
> However, some huge pages failed to split due to a failure at Step C, and
> it was observed that the reference count of the huge page increased between
> Step A and Step C.
>
> Testing has shown that after receiving -EAGAIN, simply re-splitting the
> hw-poisoned huge page within memory_failure() always results in the same
> -EAGAIN. This is possible because memory_failure() is executed in the
> currently affected process. Before this process exits memory_failure() and
> is terminated, its threads could increase the reference count of the
> hw-poisoned page.
>
> To address this issue, employ the kernel worker to re-split the hw-poisoned
> huge page. By the time this worker begins re-splitting the hw-poisoned huge
> page, the affected process has already been terminated, preventing its
> threads from increasing the reference count. Experimental results have
> consistently shown that this worker successfully re-splits these
> hw-poisoned huge pages on its first attempt.
>
> The kernel log (before):
> [ 1116.862895] Memory failure: 0x4097fa7: recovery action for unsplit thp: Ignored
>
> The kernel log (after):
> [ 793.573536] Memory failure: 0x2100dda: recovery action for unsplit thp: Delayed
> [ 793.574666] Memory failure: 0x2100dda: split unsplit thp successfully.
>
> Signed-off-by: Qiuxu Zhuo <[email protected]>

Thanks for your patch. Except for the comment from Naoya, I have some questions about the code itself.

> ---
> mm/memory-failure.c | 73 +++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 71 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 660c21859118..0db4cf712a78 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -72,6 +72,60 @@ atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
>
> static bool hw_memory_failure __read_mostly = false;
>
> +#define SPLIT_THP_MAX_RETRY_CNT 10
> +#define SPLIT_THP_INIT_DELAYED_MS 1
> +
> +static bool split_thp_pending;
> +
> +struct split_thp_req {
> + struct delayed_work work;
> + struct page *thp;
> + int retries;
> +};
> +
> +static void split_thp_work_fn(struct work_struct *work)
> +{
> + struct split_thp_req *req = container_of(work, typeof(*req), work.work);
> + int ret;
> +
> + /* Split the thp. */
> + get_page(req->thp);

Can req->thp be freed when split_thp_work_fn is scheduled ?

> + lock_page(req->thp);
> + ret = split_huge_page(req->thp);
> + unlock_page(req->thp);
> + put_page(req->thp);
> +
> + /* Retry with an exponential backoff. */
> + if (ret && ++req->retries < SPLIT_THP_MAX_RETRY_CNT) {
> + schedule_delayed_work(to_delayed_work(work),
> + msecs_to_jiffies(SPLIT_THP_INIT_DELAYED_MS << req->retries));
> + return;
> + }
> +
> + pr_err("%#lx: split unsplit thp %ssuccessfully.\n", page_to_pfn(req->thp), ret ? "un" : "");
> + kfree(req);
> + split_thp_pending = false;

split_thp_pending is not protected against split_thp_delayed? Though this race should be benign.

Thanks.

2023-12-20 08:44:47

by Qiuxu Zhuo

[permalink] [raw]
Subject: RE: [PATCH 1/1] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN

Hi Naoya Horiguchi,

Thanks for the review.
See the comments below.

> From: Naoya Horiguchi <[email protected]>
> Sent: Tuesday, December 19, 2023 10:17 AM
> ...
> > The kernel log (before):
> > [ 1116.862895] Memory failure: 0x4097fa7: recovery action for
> > unsplit thp: Ignored
> >
> > The kernel log (after):
> > [ 793.573536] Memory failure: 0x2100dda: recovery action for unsplit thp:
> Delayed
> > [ 793.574666] Memory failure: 0x2100dda: split unsplit thp successfully.
>
> I'm unclear about the user-visible benefit of ensuring that the error thp is
> split.
> So could you explain about it?

During our testing, we observed that the hardware-poisoned huge page had been
mapped for the victim application's text and was present in the file cache.
Unfortunately, when attempting to restart the application without splitting the thp,
the application restart failed. This was possible because its text was remapped to the
hardware-poisoned huge page from the file cache, leading to its swift termination
due to another MCE.

So, after re-splitting the unsplit thp successfully (drop the text mapping),
the application restart is successful. I'll also add this description in the commit message in the v2.

> I think that the raw error page is not unmapped (with hwpoisoned entry)
> after delayed re-splitting, so recovery action seems not complete even with
> this patch.
> So this patch seems to just convert a hwpoisoned unrecovered thp into a
> hwpoisoned unrecovered raw page.

You're correct. Thanks for catching this.
Instead of creating a new work just to split the thp, I'll leverage the existing memory_failure_queue()
to re-split the thp in the v2, which should make the recovery action more complete.

-Qiuxu


2023-12-20 08:59:47

by Qiuxu Zhuo

[permalink] [raw]
Subject: RE: [PATCH 1/1] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN

Hi Miaohe,

Thanks for the review.
Please see the comments below.

> From: Miaohe Lin <[email protected]>
> ...
> > +
> > +static void split_thp_work_fn(struct work_struct *work) {
> > + struct split_thp_req *req = container_of(work, typeof(*req),
> work.work);
> > + int ret;
> > +
> > + /* Split the thp. */
> > + get_page(req->thp);
>
> Can req->thp be freed when split_thp_work_fn is scheduled ?

It's possible. Thanks for catching this.

Instead of making a new work to re-split the thp,
I'll leverage the existing memory_failure_queue() to resplit the thp in the v2.

>
> > + lock_page(req->thp);
> > + ret = split_huge_page(req->thp);
> > + unlock_page(req->thp);
> > + put_page(req->thp);
> > +
> > + /* Retry with an exponential backoff. */
> > + if (ret && ++req->retries < SPLIT_THP_MAX_RETRY_CNT) {
> > + schedule_delayed_work(to_delayed_work(work),
> > +
> msecs_to_jiffies(SPLIT_THP_INIT_DELAYED_MS << req->retries));
> > + return;
> > + }
> > +
> > + pr_err("%#lx: split unsplit thp %ssuccessfully.\n", page_to_pfn(req-
> >thp), ret ? "un" : "");
> > + kfree(req);
> > + split_thp_pending = false;
>
> split_thp_pending is not protected against split_thp_delayed? Though this
> race should be benign.

Thanks for being concerned about this.

As the Read-Check-Modify of "split_thp_pending" is protected by the
mutex " &mf_mutex", and the worker only modified it to false (no read on it).
In theory, there is no race here.

Will leverage the existing memory_failure_queue() in v2. There should be no
such concern about this race. ????

-Qiuxu


2023-12-22 06:28:44

by Qiuxu Zhuo

[permalink] [raw]
Subject: [PATCH v2 1/2] mm: memory-failure: Make memory_failure_queue_delayed() helper

Introduce the memory_failure_queue_delayed() helper for deferred handling
of memory failure tasks. This prepares for later re-splitting of a
hardware-poisoned huge page.

Signed-off-by: Qiuxu Zhuo <[email protected]>
---
Prepares for the patch 2.

mm/memory-failure.c | 51 +++++++++++++++++++++++++--------------------
1 file changed, 28 insertions(+), 23 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 660c21859118..8f2c11863bae 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2385,28 +2385,12 @@ struct memory_failure_cpu {
DECLARE_KFIFO(fifo, struct memory_failure_entry,
MEMORY_FAILURE_FIFO_SIZE);
spinlock_t lock;
- struct work_struct work;
+ struct delayed_work work;
};

static DEFINE_PER_CPU(struct memory_failure_cpu, memory_failure_cpu);

-/**
- * memory_failure_queue - Schedule handling memory failure of a page.
- * @pfn: Page Number of the corrupted page
- * @flags: Flags for memory failure handling
- *
- * This function is called by the low level hardware error handler
- * when it detects hardware memory corruption of a page. It schedules
- * the recovering of error page, including dropping pages, killing
- * processes etc.
- *
- * The function is primarily of use for corruptions that
- * happen outside the current execution context (e.g. when
- * detected by a background scrubber)
- *
- * Can run in IRQ context.
- */
-void memory_failure_queue(unsigned long pfn, int flags)
+static void memory_failure_queue_delayed(unsigned long pfn, int flags, unsigned long delay)
{
struct memory_failure_cpu *mf_cpu;
unsigned long proc_flags;
@@ -2418,13 +2402,34 @@ void memory_failure_queue(unsigned long pfn, int flags)
mf_cpu = &get_cpu_var(memory_failure_cpu);
spin_lock_irqsave(&mf_cpu->lock, proc_flags);
if (kfifo_put(&mf_cpu->fifo, entry))
- schedule_work_on(smp_processor_id(), &mf_cpu->work);
+ schedule_delayed_work_on(smp_processor_id(), &mf_cpu->work, delay);
else
pr_err("buffer overflow when queuing memory failure at %#lx\n",
pfn);
spin_unlock_irqrestore(&mf_cpu->lock, proc_flags);
put_cpu_var(memory_failure_cpu);
}
+
+/**
+ * memory_failure_queue - Schedule handling memory failure of a page.
+ * @pfn: Page Number of the corrupted page
+ * @flags: Flags for memory failure handling
+ *
+ * This function is called by the low level hardware error handler
+ * when it detects hardware memory corruption of a page. It schedules
+ * the recovering of error page, including dropping pages, killing
+ * processes etc.
+ *
+ * The function is primarily of use for corruptions that
+ * happen outside the current execution context (e.g. when
+ * detected by a background scrubber)
+ *
+ * Can run in IRQ context.
+ */
+void memory_failure_queue(unsigned long pfn, int flags)
+{
+ memory_failure_queue_delayed(pfn, flags, 0);
+}
EXPORT_SYMBOL_GPL(memory_failure_queue);

static void memory_failure_work_func(struct work_struct *work)
@@ -2434,7 +2439,7 @@ static void memory_failure_work_func(struct work_struct *work)
unsigned long proc_flags;
int gotten;

- mf_cpu = container_of(work, struct memory_failure_cpu, work);
+ mf_cpu = container_of(work, struct memory_failure_cpu, work.work);
for (;;) {
spin_lock_irqsave(&mf_cpu->lock, proc_flags);
gotten = kfifo_get(&mf_cpu->fifo, &entry);
@@ -2457,8 +2462,8 @@ void memory_failure_queue_kick(int cpu)
struct memory_failure_cpu *mf_cpu;

mf_cpu = &per_cpu(memory_failure_cpu, cpu);
- cancel_work_sync(&mf_cpu->work);
- memory_failure_work_func(&mf_cpu->work);
+ cancel_delayed_work_sync(&mf_cpu->work);
+ memory_failure_work_func(&mf_cpu->work.work);
}

static int __init memory_failure_init(void)
@@ -2470,7 +2475,7 @@ static int __init memory_failure_init(void)
mf_cpu = &per_cpu(memory_failure_cpu, cpu);
spin_lock_init(&mf_cpu->lock);
INIT_KFIFO(mf_cpu->fifo);
- INIT_WORK(&mf_cpu->work, memory_failure_work_func);
+ INIT_DELAYED_WORK(&mf_cpu->work, memory_failure_work_func);
}

register_sysctl_init("vm", memory_failure_table);
--
2.17.1


2023-12-22 06:29:42

by Qiuxu Zhuo

[permalink] [raw]
Subject: [PATCH v2 2/2] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN

During the process of splitting a hw-poisoned huge page, it is possible
for the reference count of the huge page to be increased by the threads
within the affected process, leading to a failure in splitting the
hw-poisoned huge page with an error code of -EAGAIN.

This issue can be reproduced when doing memory error injection to a
multiple-thread process, and the error occurs within a huge page.
The call path with the returned -EAGAIN during the testing is shown below:

memory_failure()
try_to_split_thp_page()
split_huge_page()
split_huge_page_to_list() {
...
Step A: can_split_folio() - Checked that the thp can be split.
Step B: unmap_folio()
Step C: folio_ref_freeze() - Failed and returned -EAGAIN.
...
}

The testing logs indicated that some huge pages were split successfully
via the call path above (Step C was successful for these huge pages).
However, some huge pages failed to split due to a failure at Step C, and
it was observed that the reference count of the huge page increased between
Step A and Step C.

Testing has shown that after receiving -EAGAIN, simply re-splitting the
hw-poisoned huge page within memory_failure() always results in the same
-EAGAIN. This is possible because memory_failure() is executed in the
currently affected process. Before this process exits memory_failure() and
is terminated, its threads could increase the reference count of the
hw-poisoned page.

Furthermore, if the h/w-poisoned huge page had been mapped for the victim
application's text and was present in the file cache and it was failed to
be split. When attempting to restart the process without splitting the
h/w-poisoned huge page, the application restart failed. This was possible
because its text was remapped to the hardware-poisoned huge page from the
file cache, leading to its swift termination due to another MCE.

To address this issue, leverage memory_failure_queue_delayed() to re-split
the h/w-poisoned huge page by a kernel worker. By the time this worker
begins re-splitting the h/w-poisoned huge page, the affected process has
already been terminated, preventing its threads from increasing the
reference count. Experimental results have consistently shown that this
worker successfully re-splits these h/w-poisoned huge pages on its first
attempt. After that the victim application restarted successfully.

The kernel log (before):
[ 1116.862895] Memory failure: 0x4097fa7: recovery action for unsplit thp: Ignored

After that, the victim application may fail to restart.

The kernel log (after):
[ 405.195308] Memory failure: 0x410756b: recovery action for unsplit thp: Delayed
[ 405.200267] Memory failure: 0x410756b: recovery action for clean LRU page: Recovered

After that, the victim application restarted successfully.

Signed-off-by: Qiuxu Zhuo <[email protected]>
---
v1->v2:

1. Include user-visible benefits in both the commit message and code comments.
[ Address Naoya Horiguchi's comment on the unclear user-visible benefits in v1. ]

2. Leverage memory_failure_queue_delayed() to re-split the h/w-poisoned huge page,
which makes the recovery action more complete.
[ Address Naoya Horiguchi's comment regarding the incomplete recovery in v1.
Address Miaohe's concern about the possibility of being freed when the worker is scheduled in v1. ]

include/linux/mm.h | 1 +
mm/memory-failure.c | 54 ++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 418d26608ece..87bd67d0a046 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3904,6 +3904,7 @@ enum mf_flags {
MF_UNPOISON = 1 << 4,
MF_SW_SIMULATED = 1 << 5,
MF_NO_RETRY = 1 << 6,
+ /* Bits {31:28} are reserved for encoding the THP split retry count. */
};
int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
unsigned long count, int mf_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8f2c11863bae..2334d0417258 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2147,6 +2147,28 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
return rc;
}

+static void memory_failure_queue_delayed(unsigned long pfn, int flags, unsigned long delay);
+
+#define MF_RETRY_CNT_MASK GENMASK(31, 28)
+#define MF_RETRY_CNT_SHIFT 28
+#define MF_RETRY_CNT_GET(f) (((f) & MF_RETRY_CNT_MASK) >> MF_RETRY_CNT_SHIFT)
+#define MF_RETRY_CNT_SET(f, c) (((f) & ~MF_RETRY_CNT_MASK) | ((c) << MF_RETRY_CNT_SHIFT))
+
+#define MAX_RETRY_CNT 10
+#define INIT_DELAYED_MS 1
+
+static bool split_thp_delayed(unsigned long pfn, int flags)
+{
+ int cnt = MF_RETRY_CNT_GET(flags);
+
+ if (cnt >= MAX_RETRY_CNT)
+ return false;
+
+ memory_failure_queue_delayed(pfn, MF_RETRY_CNT_SET(flags, cnt + 1),
+ msecs_to_jiffies(INIT_DELAYED_MS << cnt));
+ return true;
+}
+
/**
* memory_failure - Handle memory failure of a page.
* @pfn: Page Number of the corrupted page
@@ -2211,7 +2233,7 @@ int memory_failure(unsigned long pfn, int flags)
if (hugetlb)
goto unlock_mutex;

- if (TestSetPageHWPoison(p)) {
+ if (TestSetPageHWPoison(p) && !MF_RETRY_CNT_GET(flags)) {
pr_err("%#lx: already hardware poisoned\n", pfn);
res = -EHWPOISON;
if (flags & MF_ACTION_REQUIRED)
@@ -2275,8 +2297,34 @@ int memory_failure(unsigned long pfn, int flags)
* page is a valid handlable page.
*/
SetPageHasHWPoisoned(hpage);
- if (try_to_split_thp_page(p) < 0) {
- res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+ res = try_to_split_thp_page(p);
+ if (res < 0) {
+ /*
+ * Re-attempting try_to_split_thp_page() here could consistently
+ * yield -EAGAIN, as the threads of the process may increment the
+ * reference count of the huge page before the process exits
+ * memory_failure() and terminates.
+ *
+ * Furthermore, if the h/w-poisoned huge page had been mapped for
+ * the victim application's text and was present in the file cache
+ * and it was failed to be split. When attempting to restart the
+ * application without splitting the h/w-poisoned huge page, the
+ * application restart failed. This was possible because its text
+ * was remapped to the hardware-poisoned huge page from the file
+ * cache, leading to its swift termination due to another MCE.
+ *
+ * Leverage memory_failure_queue_delayed() to re-split the h/w-poisoned
+ * huge page by a kernel worker. By the time this worker initiates
+ * the re-splitting and recovery process, the affected process has
+ * already been terminated, preventing its threads from incrementing
+ * the reference count. After re-splitting the h/w-poisoned huge page
+ * successfully (drop the text mapping), the application restart is
+ * successful.
+ */
+ if (res == -EAGAIN && split_thp_delayed(pfn, flags))
+ res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_DELAYED);
+ else
+ res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
goto unlock_mutex;
}
VM_BUG_ON_PAGE(!page_count(p), p);
--
2.17.1


2023-12-22 19:42:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN

On Fri, 22 Dec 2023 14:27:06 +0800 Qiuxu Zhuo <[email protected]> wrote:

> During the process of splitting a hw-poisoned huge page, it is possible
> for the reference count of the huge page to be increased by the threads
> within the affected process, leading to a failure in splitting the
> hw-poisoned huge page with an error code of -EAGAIN.
>
> This issue can be reproduced when doing memory error injection to a
> multiple-thread process, and the error occurs within a huge page.
> The call path with the returned -EAGAIN during the testing is shown below:
>
> memory_failure()
> try_to_split_thp_page()
> split_huge_page()
> split_huge_page_to_list() {
> ...
> Step A: can_split_folio() - Checked that the thp can be split.
> Step B: unmap_folio()
> Step C: folio_ref_freeze() - Failed and returned -EAGAIN.
> ...
> }
>
> The testing logs indicated that some huge pages were split successfully
> via the call path above (Step C was successful for these huge pages).
> However, some huge pages failed to split due to a failure at Step C, and
> it was observed that the reference count of the huge page increased between
> Step A and Step C.
>
> Testing has shown that after receiving -EAGAIN, simply re-splitting the
> hw-poisoned huge page within memory_failure() always results in the same
> -EAGAIN. This is possible because memory_failure() is executed in the
> currently affected process. Before this process exits memory_failure() and
> is terminated, its threads could increase the reference count of the
> hw-poisoned page.
>
> Furthermore, if the h/w-poisoned huge page had been mapped for the victim
> application's text and was present in the file cache and it was failed to
> be split. When attempting to restart the process without splitting the
> h/w-poisoned huge page, the application restart failed. This was possible
> because its text was remapped to the hardware-poisoned huge page from the
> file cache, leading to its swift termination due to another MCE.

So we're hoping that when the worker runs to split the page, the
process and its threads have exited. What guarantees this timing?

And we're hoping that the worker has split the page before userspace
attempts to restart the process. What guarantees this timing?

All this reliance upon fortunate timing sounds rather unreliable,
doesn't it?

2024-01-02 02:41:27

by Qiuxu Zhuo

[permalink] [raw]
Subject: RE: [PATCH v2 2/2] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN

> From: Andrew Morton <[email protected]>

Hi Andrew,

Happy New Year.
Thanks for reviewing the patch.
Please see the comments inline.

> ...
>
> So we're hoping that when the worker runs to split the page, the process and
> its threads have exited. What guarantees this timing?

Case 1: If the threads of the victim process do not access the new mapping to
the h/w-poisoned huge page(no refcnt increase), the h/w-poisoned huge page
should be successfully split in the process context. No need for the worker to
split this h/w-poisoned page.

Case 2: If the threads of the victim process access the new mapping to the
hardware-poisoned huge page (refcnt increase), causing the failure of splitting
the hardware-poisoned huge page, a new MCE will be re-triggered immediately.
Consequently, the process will be promptly terminated upon re-entering the
code below:

MCE occurs:
memory_failure()
{
{
...
if (TestSetPageHWPoison(p)) {
...
kill_accessing_process(current, pfn, flags);
...
}
...
}

The worker splits the h/w-poisoned background with retry delays of 1ms, 2ms,
4ms, 8ms, ..., 512ms. Before reaching the max 512ms timeout, the process and
its threads should already exit. So, the retry delays can guarantee the timing.

> And we're hoping that the worker has split the page before userspace
> attempts to restart the process. What guarantees this timing?

Our experiments showed that an immediate restart of the victim process was
consistently successful. This success could be attributed to the duration between
the process being killed and its subsequent restart being sufficiently long,
allowing the worker enough time to split the hardware-poisoned page.
However, in theory, this timing indeed isn't guaranteed.

> All this reliance upon fortunate timing sounds rather unreliable, doesn't it?

The timing of the victim process exit can be guaranteed.
The timing of the new restart of the process cannot be guaranteed in theory.

The patch is not perfect, but it still provides the victim process with the
opportunity to be restarted successfully.

Thanks!
-Qiuxu

2024-01-03 02:47:50

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN

On 2024/1/2 10:41, Zhuo, Qiuxu wrote:
>> From: Andrew Morton <[email protected]>
>
> Hi Andrew,
>
> Happy New Year.
> Thanks for reviewing the patch.
> Please see the comments inline.
>
>> ...
>>
>> So we're hoping that when the worker runs to split the page, the process and
>> its threads have exited. What guarantees this timing?
>
> Case 1: If the threads of the victim process do not access the new mapping to
> the h/w-poisoned huge page(no refcnt increase), the h/w-poisoned huge page
> should be successfully split in the process context. No need for the worker to
> split this h/w-poisoned page.
>
> Case 2: If the threads of the victim process access the new mapping to the
> hardware-poisoned huge page (refcnt increase), causing the failure of splitting
> the hardware-poisoned huge page, a new MCE will be re-triggered immediately.
> Consequently, the process will be promptly terminated upon re-entering the
> code below:
>
> MCE occurs:
> memory_failure()
> {
> {
> ...
> if (TestSetPageHWPoison(p)) {
> ...
> kill_accessing_process(current, pfn, flags);
> ...
> }
> ...
> }
>
> The worker splits the h/w-poisoned background with retry delays of 1ms, 2ms,
> 4ms, 8ms, ..., 512ms. Before reaching the max 512ms timeout, the process and
> its threads should already exit. So, the retry delays can guarantee the timing.
>
>> And we're hoping that the worker has split the page before userspace
>> attempts to restart the process. What guarantees this timing?
>
> Our experiments showed that an immediate restart of the victim process was
> consistently successful. This success could be attributed to the duration between
> the process being killed and its subsequent restart being sufficiently long,
> allowing the worker enough time to split the hardware-poisoned page.
> However, in theory, this timing indeed isn't guaranteed.
>
>> All this reliance upon fortunate timing sounds rather unreliable, doesn't it?
>
> The timing of the victim process exit can be guaranteed.
> The timing of the new restart of the process cannot be guaranteed in theory.
>
> The patch is not perfect, but it still provides the victim process with the
> opportunity to be restarted successfully.

Will it be better if affected process could try re-splitting the hw-poisoned huge page itself before
returning to userspace? Each affected process (including possible later restarted process) will try
re-splitting huge page in that case and the last one without any competitor will get the work done.
So the delayed work is not needed. Will this provide more reliance?

Thanks.

>
> Thanks!
> -Qiuxu
>