2022-05-20 11:39:32

by zhenwei pi

[permalink] [raw]
Subject: [PATCH 0/3] recover hardware corrupted page by virtio balloon

Hi,

I'm trying to recover hardware corrupted page by virtio balloon, the
workflow of this feature like this:

Guest 5.MF -> 6.RVQ FE 10.Unpoison page
/ \ /
-------------------+-------------+----------+-----------
| | |
4.MCE 7.RVQ BE 9.RVQ Event
QEMU / \ /
3.SIGBUS 8.Remap
/
----------------+------------------------------------
|
+--2.MF
Host /
1.HW error

1, HardWare page error occurs randomly.
2, host side handles corrupted page by Memory Failure mechanism, sends
SIGBUS to the user process if early-kill is enabled.
3, QEMU handles SIGBUS, if the address belongs to guest RAM, then:
4, QEMU tries to inject MCE into guest.
5, guest handles memory failure again.

1-5 is already supported for a long time, the next steps are supported
in this patch(also related driver patch):

6, guest balloon driver gets noticed of the corrupted PFN, and sends
request to host side by Recover VQ FrontEnd.
7, QEMU handles request from Recover VQ BackEnd, then:
8, QEMU remaps the corrupted HVA fo fix the memory failure, then:
9, QEMU acks the guest side the result by Recover VQ.
10, guest unpoisons the page if the corrupted page gets recoverd
successfully.

Test:
This patch set can be tested with QEMU(also in developing):
https://github.com/pizhenwei/qemu/tree/balloon-recover

Emulate MCE by QEMU(guest RAM normal page only, hugepage is not supported):
virsh qemu-monitor-command vm --hmp mce 0 9 0xbd000000000000c0 0xd 0x61646678 0x8c

The guest works fine(on Intel Platinum 8260):
mce: [Hardware Error]: Machine check events logged
Memory failure: 0x61646: recovery action for dirty LRU page: Recovered
virtio_balloon virtio5: recovered pfn 0x61646
Unpoison: Unpoisoned page 0x61646 by virtio-balloon
MCE: Killing stress:24502 due to hardware memory corruption fault at 7f5be2e5a010

And the 'HardwareCorrupted' in /proc/meminfo also shows 0 kB.

About the protocol of virtio balloon recover VQ, it's undefined and in
developing currently:
- 'struct virtio_balloon_recover' defines the structure which is used to
exchange message between guest and host.
- '__le32 corrupted_pages' in struct virtio_balloon_config is used in the next
step:
1, a VM uses RAM of 2M huge page, once a MCE occurs, the 2M becomes
unaccessible. Reporting 512 * 4K 'corrupted_pages' to the guest, the guest
has a chance to isolate the 512 pages ahead of time.

2, after migrating to another host, the corrupted pages are actually recovered,
once the guest gets the 'corrupted_pages' with 0, then the guest could
unpoison all the poisoned pages which are recorded in the balloon driver.

zhenwei pi (3):
memory-failure: Introduce memory failure notifier
mm/memory-failure.c: support reset PTE during unpoison
virtio_balloon: Introduce memory recover

drivers/virtio/virtio_balloon.c | 243 ++++++++++++++++++++++++++++
include/linux/mm.h | 4 +-
include/uapi/linux/virtio_balloon.h | 16 ++
mm/hwpoison-inject.c | 2 +-
mm/memory-failure.c | 59 ++++++-
5 files changed, 315 insertions(+), 9 deletions(-)

--
2.20.1



2022-05-22 05:31:12

by zhenwei pi

[permalink] [raw]
Subject: [PATCH 3/3] virtio_balloon: Introduce memory recover

Introduce a new queue 'recover VQ', this allows guest to recover
hardware corrupted page:

Guest 5.MF -> 6.RVQ FE 10.Unpoison page
/ \ /
-------------------+-------------+----------+-----------
| | |
4.MCE 7.RVQ BE 9.RVQ Event
QEMU / \ /
3.SIGBUS 8.Remap
/
----------------+------------------------------------
|
+--2.MF
Host /
1.HW error

The workflow:
1, HardWare page error occurs randomly.
2, host side handles corrupted page by Memory Failure mechanism, sends
SIGBUS to the user process if early-kill is enabled.
3, QEMU handles SIGBUS, if the address belongs to guest RAM, then:
4, QEMU tries to inject MCE into guest.
5, guest handles memory failure again.

1-5 is already supported for a long time, the next steps are supported
in this patch(also related driver patch):
6, guest balloon driver gets noticed of the corrupted PFN, and sends
request to host side by Recover VQ FrontEnd.
7, QEMU handles request from Recover VQ BackEnd, then:
8, QEMU remaps the corrupted HVA fo fix the memory failure, then:
9, QEMU acks the guest side the result by Recover VQ.
10, guest unpoisons the page if the corrupted page gets recoverd
successfully.

Then the guest fixes the HW page error dynamiclly without rebooting.

Emulate MCE by QEMU, the guest works fine:
mce: [Hardware Error]: Machine check events logged
Memory failure: 0x61646: recovery action for dirty LRU page: Recovered
virtio_balloon virtio5: recovered pfn 0x61646
Unpoison: Unpoisoned page 0x61646 by virtio-balloon
MCE: Killing stress:24502 due to hardware memory corruption fault at 7f5be2e5a010

The 'HardwareCorrupted' in /proc/meminfo also shows 0 kB.

Signed-off-by: zhenwei pi <[email protected]>
---
drivers/virtio/virtio_balloon.c | 243 ++++++++++++++++++++++++++++
include/uapi/linux/virtio_balloon.h | 16 ++
2 files changed, 259 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f4c34a2a6b8e..f9d95d1d8a4d 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -52,6 +52,7 @@ enum virtio_balloon_vq {
VIRTIO_BALLOON_VQ_STATS,
VIRTIO_BALLOON_VQ_FREE_PAGE,
VIRTIO_BALLOON_VQ_REPORTING,
+ VIRTIO_BALLOON_VQ_RECOVER,
VIRTIO_BALLOON_VQ_MAX
};

@@ -59,6 +60,12 @@ enum virtio_balloon_config_read {
VIRTIO_BALLOON_CONFIG_READ_CMD_ID = 0,
};

+/* the request body to commucate with host side */
+struct __virtio_balloon_recover {
+ struct virtio_balloon_recover vbr;
+ __virtio32 pfns[VIRTIO_BALLOON_PAGES_PER_PAGE];
+};
+
struct virtio_balloon {
struct virtio_device *vdev;
struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
@@ -126,6 +133,16 @@ struct virtio_balloon {
/* Free page reporting device */
struct virtqueue *reporting_vq;
struct page_reporting_dev_info pr_dev_info;
+
+ /* Memory recover VQ - VIRTIO_BALLOON_F_RECOVER */
+ struct virtqueue *recover_vq;
+ spinlock_t recover_vq_lock;
+ struct notifier_block memory_failure_nb;
+ struct list_head corrupted_page_list;
+ struct list_head recovered_page_list;
+ spinlock_t recover_page_list_lock;
+ struct __virtio_balloon_recover in_vbr;
+ struct work_struct unpoison_memory_work;
};

static const struct virtio_device_id id_table[] = {
@@ -494,6 +511,198 @@ static void update_balloon_size_func(struct work_struct *work)
queue_work(system_freezable_wq, work);
}

+/*
+ * virtballoon_memory_failure - notified by memory failure, try to fix the
+ * corrupted page.
+ * The memory failure notifier is designed to call back when the kernel handled
+ * successfully only, WARN_ON_ONCE on the unlikely condition to find out any
+ * error(memory error handling is a best effort, not 100% coverd).
+ */
+static int virtballoon_memory_failure(struct notifier_block *notifier,
+ unsigned long pfn, void *parm)
+{
+ struct virtio_balloon *vb = container_of(notifier, struct virtio_balloon,
+ memory_failure_nb);
+ struct page *page;
+ struct __virtio_balloon_recover *out_vbr;
+ struct scatterlist sg;
+ unsigned long flags;
+ int err;
+
+ page = pfn_to_online_page(pfn);
+ if (WARN_ON_ONCE(!page))
+ return NOTIFY_DONE;
+
+ if (PageHuge(page))
+ return NOTIFY_DONE;
+
+ if (WARN_ON_ONCE(!PageHWPoison(page)))
+ return NOTIFY_DONE;
+
+ if (WARN_ON_ONCE(page_count(page) != 1))
+ return NOTIFY_DONE;
+
+ get_page(page); /* balloon reference */
+
+ out_vbr = kzalloc(sizeof(*out_vbr), GFP_KERNEL);
+ if (WARN_ON_ONCE(!out_vbr))
+ return NOTIFY_BAD;
+
+ spin_lock(&vb->recover_page_list_lock);
+ balloon_page_push(&vb->corrupted_page_list, page);
+ spin_unlock(&vb->recover_page_list_lock);
+
+ out_vbr->vbr.cmd = VIRTIO_BALLOON_R_CMD_RECOVER;
+ set_page_pfns(vb, out_vbr->pfns, page);
+ sg_init_one(&sg, out_vbr, sizeof(*out_vbr));
+
+ spin_lock_irqsave(&vb->recover_vq_lock, flags);
+ err = virtqueue_add_outbuf(vb->recover_vq, &sg, 1, out_vbr, GFP_KERNEL);
+ if (unlikely(err)) {
+ spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
+ return NOTIFY_DONE;
+ }
+ virtqueue_kick(vb->recover_vq);
+ spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
+
+ return NOTIFY_OK;
+}
+
+static int recover_vq_get_response(struct virtio_balloon *vb)
+{
+ struct __virtio_balloon_recover *in_vbr;
+ struct scatterlist sg;
+ unsigned long flags;
+ int err;
+
+ spin_lock_irqsave(&vb->recover_vq_lock, flags);
+ in_vbr = &vb->in_vbr;
+ memset(in_vbr, 0x00, sizeof(*in_vbr));
+ sg_init_one(&sg, in_vbr, sizeof(*in_vbr));
+ err = virtqueue_add_inbuf(vb->recover_vq, &sg, 1, in_vbr, GFP_KERNEL);
+ if (unlikely(err)) {
+ spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
+ return err;
+ }
+
+ virtqueue_kick(vb->recover_vq);
+ spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
+
+ return 0;
+}
+
+static void recover_vq_handle_response(struct virtio_balloon *vb, unsigned int len)
+{
+ struct __virtio_balloon_recover *in_vbr;
+ struct virtio_balloon_recover *vbr;
+ struct page *page;
+ unsigned int pfns;
+ u32 pfn0, pfn1;
+ __u8 status;
+
+ /* the response is not expected */
+ if (unlikely(len != sizeof(struct __virtio_balloon_recover)))
+ return;
+
+ in_vbr = &vb->in_vbr;
+ vbr = &in_vbr->vbr;
+ if (unlikely(vbr->cmd != VIRTIO_BALLOON_R_CMD_RESPONSE))
+ return;
+
+ /* to make sure the contiguous balloon PFNs */
+ for (pfns = 1; pfns < VIRTIO_BALLOON_PAGES_PER_PAGE; pfns++) {
+ pfn0 = virtio32_to_cpu(vb->vdev, in_vbr->pfns[pfns - 1]);
+ pfn1 = virtio32_to_cpu(vb->vdev, in_vbr->pfns[pfns]);
+ if (pfn1 - pfn0 != 1)
+ return;
+ }
+
+ pfn0 = virtio32_to_cpu(vb->vdev, in_vbr->pfns[0]);
+ if (!pfn_valid(pfn0))
+ return;
+
+ pfn1 = -1;
+ spin_lock(&vb->recover_page_list_lock);
+ list_for_each_entry(page, &vb->corrupted_page_list, lru) {
+ pfn1 = page_to_pfn(page);
+ if (pfn1 == pfn0)
+ break;
+ }
+ spin_unlock(&vb->recover_page_list_lock);
+
+ status = vbr->status;
+ switch (status) {
+ case VIRTIO_BALLOON_R_STATUS_RECOVERED:
+ if (pfn1 == pfn0) {
+ spin_lock(&vb->recover_page_list_lock);
+ list_del(&page->lru);
+ balloon_page_push(&vb->recovered_page_list, page);
+ spin_unlock(&vb->recover_page_list_lock);
+ queue_work(system_freezable_wq, &vb->unpoison_memory_work);
+ dev_info_ratelimited(&vb->vdev->dev, "recovered pfn 0x%x", pfn0);
+ }
+ break;
+ case VIRTIO_BALLOON_R_STATUS_FAILED:
+ /* the hypervisor can't fix this corrupted page, balloon puts page */
+ if (pfn1 == pfn0) {
+ spin_lock(&vb->recover_page_list_lock);
+ list_del(&page->lru);
+ spin_unlock(&vb->recover_page_list_lock);
+ put_page(page);
+ dev_info_ratelimited(&vb->vdev->dev, "failed to recover pfn 0x%x", pfn0);
+ }
+ default:
+ break;
+ };
+
+ /* continue to get response from host side if the response gets handled successfully */
+ recover_vq_get_response(vb);
+}
+
+static void unpoison_memory_func(struct work_struct *work)
+{
+ struct virtio_balloon *vb;
+ struct page *page;
+
+ vb = container_of(work, struct virtio_balloon, unpoison_memory_work);
+
+ do {
+ spin_lock(&vb->recover_page_list_lock);
+ page = list_first_entry_or_null(&vb->recovered_page_list,
+ struct page, lru);
+ if (page)
+ list_del(&page->lru);
+ spin_unlock(&vb->recover_page_list_lock);
+
+ if (page) {
+ put_page(page);
+ unpoison_memory(page_to_pfn(page), true, "virtio-balloon");
+ }
+ } while (page);
+}
+
+static void recover_vq_cb(struct virtqueue *vq)
+{
+ struct virtio_balloon *vb = vq->vdev->priv;
+ struct __virtio_balloon_recover *vbr;
+ unsigned long flags;
+ unsigned int len;
+
+ spin_lock_irqsave(&vb->recover_vq_lock, flags);
+ do {
+ virtqueue_disable_cb(vq);
+ while ((vbr = virtqueue_get_buf(vq, &len)) != NULL) {
+ spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
+ if (vbr == &vb->in_vbr)
+ recover_vq_handle_response(vb, len);
+ else
+ kfree(vbr); /* just free the memory for out vbr request */
+ spin_lock_irqsave(&vb->recover_vq_lock, flags);
+ }
+ } while (!virtqueue_enable_cb(vq));
+ spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
+}
+
static int init_vqs(struct virtio_balloon *vb)
{
struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX];
@@ -515,6 +724,7 @@ static int init_vqs(struct virtio_balloon *vb)
callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
+ names[VIRTIO_BALLOON_VQ_RECOVER] = NULL;

if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
names[VIRTIO_BALLOON_VQ_STATS] = "stats";
@@ -531,6 +741,11 @@ static int init_vqs(struct virtio_balloon *vb)
callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
}

+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_RECOVER)) {
+ names[VIRTIO_BALLOON_VQ_RECOVER] = "recover_vq";
+ callbacks[VIRTIO_BALLOON_VQ_RECOVER] = recover_vq_cb;
+ }
+
err = virtio_find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX, vqs,
callbacks, names, NULL);
if (err)
@@ -566,6 +781,9 @@ static int init_vqs(struct virtio_balloon *vb)
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];

+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_RECOVER))
+ vb->recover_vq = vqs[VIRTIO_BALLOON_VQ_RECOVER];
+
return 0;
}

@@ -1015,12 +1233,31 @@ static int virtballoon_probe(struct virtio_device *vdev)
goto out_unregister_oom;
}

+ if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_RECOVER)) {
+ err = recover_vq_get_response(vb);
+ if (err)
+ goto out_unregister_reporting;
+
+ vb->memory_failure_nb.notifier_call = virtballoon_memory_failure;
+ spin_lock_init(&vb->recover_page_list_lock);
+ spin_lock_init(&vb->recover_vq_lock);
+ INIT_LIST_HEAD(&vb->corrupted_page_list);
+ INIT_LIST_HEAD(&vb->recovered_page_list);
+ INIT_WORK(&vb->unpoison_memory_work, unpoison_memory_func);
+ err = register_memory_failure_notifier(&vb->memory_failure_nb);
+ if (err)
+ goto out_unregister_reporting;
+ }
+
virtio_device_ready(vdev);

if (towards_target(vb))
virtballoon_changed(vdev);
return 0;

+out_unregister_reporting:
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
+ page_reporting_unregister(&vb->pr_dev_info);
out_unregister_oom:
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
unregister_oom_notifier(&vb->oom_nb);
@@ -1082,6 +1319,11 @@ static void virtballoon_remove(struct virtio_device *vdev)
destroy_workqueue(vb->balloon_wq);
}

+ if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_RECOVER)) {
+ unregister_memory_failure_notifier(&vb->memory_failure_nb);
+ cancel_work_sync(&vb->unpoison_memory_work);
+ }
+
remove_common(vb);
#ifdef CONFIG_BALLOON_COMPACTION
if (vb->vb_dev_info.inode)
@@ -1147,6 +1389,7 @@ static unsigned int features[] = {
VIRTIO_BALLOON_F_FREE_PAGE_HINT,
VIRTIO_BALLOON_F_PAGE_POISON,
VIRTIO_BALLOON_F_REPORTING,
+ VIRTIO_BALLOON_F_RECOVER,
};

static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index ddaa45e723c4..41d0ffa2fb54 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -37,6 +37,7 @@
#define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
#define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
#define VIRTIO_BALLOON_F_REPORTING 5 /* Page reporting virtqueue */
+#define VIRTIO_BALLOON_F_RECOVER 6 /* Memory recover virtqueue */

/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -59,6 +60,8 @@ struct virtio_balloon_config {
};
/* Stores PAGE_POISON if page poisoning is in use */
__le32 poison_val;
+ /* Number of hardware corrupted pages, guest read only */
+ __le32 corrupted_pages;
};

#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */
@@ -116,4 +119,17 @@ struct virtio_balloon_stat {
__virtio64 val;
} __attribute__((packed));

+#define VIRTIO_BALLOON_R_CMD_RECOVER 0
+#define VIRTIO_BALLOON_R_CMD_RESPONSE 0x80
+
+#define VIRTIO_BALLOON_R_STATUS_CORRUPTED 0
+#define VIRTIO_BALLOON_R_STATUS_RECOVERED 1
+#define VIRTIO_BALLOON_R_STATUS_FAILED 2
+
+struct virtio_balloon_recover {
+ __u8 cmd;
+ __u8 status;
+ __u8 padding[6];
+};
+
#endif /* _LINUX_VIRTIO_BALLOON_H */
--
2.20.1


2022-05-23 06:30:24

by zhenwei pi

[permalink] [raw]
Subject: [PATCH 2/3] mm/memory-failure.c: support reset PTE during unpoison

Origianlly, unpoison_memory() is only used by hwpoison-inject, and
unpoisons a page which is poisoned by hwpoison-inject too. The kernel PTE
entry has no change during software poison/unpoison.

On a virtualization platform, it's possible to fix hardware corrupted page
by hypervisor, typically the hypervisor remaps the error HVA(host virtual
address). So add a new parameter 'const char *reason' to show the reason
called by.

Once the corrupted page gets fixed, the guest kernel needs put page to
buddy. Reuse the page and hit the following issue(Intel Platinum 8260):
BUG: unable to handle page fault for address: ffff888061646000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 2c01067 P4D 2c01067 PUD 61aaa063 PMD 10089b063 PTE 800fffff9e9b9062
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 2 PID: 31106 Comm: stress Kdump: loaded Tainted: G M OE 5.18.0-rc6.bm.1-amd64 #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:clear_page_erms+0x7/0x10

The kernel PTE entry of the fixed page is still uncorrected, kernel hits
page fault during prep_new_page. So add 'bool reset_kpte' to get a change
to fix the PTE entry if the page is fixed by hypervisor.

Signed-off-by: zhenwei pi <[email protected]>
---
include/linux/mm.h | 2 +-
mm/hwpoison-inject.c | 2 +-
mm/memory-failure.c | 26 +++++++++++++++++++-------
3 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 665873c2788c..7ba210e86401 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3191,7 +3191,7 @@ enum mf_flags {
extern int memory_failure(unsigned long pfn, int flags);
extern void memory_failure_queue(unsigned long pfn, int flags);
extern void memory_failure_queue_kick(int cpu);
-extern int unpoison_memory(unsigned long pfn);
+extern int unpoison_memory(unsigned long pfn, bool reset_kpte, const char *reason);
extern int sysctl_memory_failure_early_kill;
extern int sysctl_memory_failure_recovery;
extern void shake_page(struct page *p);
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 5c0cddd81505..0dd17ba98ade 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -57,7 +57,7 @@ static int hwpoison_unpoison(void *data, u64 val)
if (!capable(CAP_SYS_ADMIN))
return -EPERM;

- return unpoison_memory(val);
+ return unpoison_memory(val, false, "hwpoison-inject");
}

DEFINE_DEBUGFS_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 95c218bb0a37..a46de3be1dd7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2132,21 +2132,26 @@ core_initcall(memory_failure_init);
/**
* unpoison_memory - Unpoison a previously poisoned page
* @pfn: Page number of the to be unpoisoned page
+ * @reset_kpte: Reset the PTE entry for kmap
+ * @reason: The callers tells why unpoisoning the page
*
- * Software-unpoison a page that has been poisoned by
- * memory_failure() earlier.
+ * Unpoison a page that has been poisoned by memory_failure() earlier.
*
- * This is only done on the software-level, so it only works
- * for linux injected failures, not real hardware failures
+ * For linux injected failures, there is no need to reset PTE entry.
+ * It's possible to fix hardware memory failure on a virtualization platform,
+ * once hypervisor fixes the failure, guest needs put page back to buddy and
+ * reset the PTE entry in kernel.
*
* Returns 0 for success, otherwise -errno.
*/
-int unpoison_memory(unsigned long pfn)
+int unpoison_memory(unsigned long pfn, bool reset_kpte, const char *reason)
{
struct page *page;
struct page *p;
int ret = -EBUSY;
int freeit = 0;
+ pte_t *kpte;
+ unsigned long addr;
static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);

@@ -2208,8 +2213,15 @@ int unpoison_memory(unsigned long pfn)
mutex_unlock(&mf_mutex);
if (!ret || freeit) {
num_poisoned_pages_dec();
- unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n",
- page_to_pfn(p), &unpoison_rs);
+ pr_info("Unpoison: Unpoisoned page %#lx by %s\n",
+ page_to_pfn(p), reason);
+ if (reset_kpte) {
+ preempt_disable();
+ addr = (unsigned long)page_to_virt(p);
+ kpte = virt_to_kpte(addr);
+ set_pte_at(&init_mm, addr, kpte, pfn_pte(pfn, PAGE_KERNEL));
+ preempt_enable();
+ }
}
return ret;
}
--
2.20.1


2022-05-23 07:53:09

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 3/3] virtio_balloon: Introduce memory recover

Hi zhenwei,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20220519]
[cannot apply to linux/master linus/master v5.18-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/intel-lab-lkp/linux/commits/zhenwei-pi/recover-hardware-corrupted-page-by-virtio-balloon/20220520-151328
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
config: arm-randconfig-r026-20220519 (https://download.01.org/0day-ci/archive/20220520/[email protected]/config)
compiler: arm-linux-gnueabi-gcc (GCC) 11.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/a42127073dd4adb6354649c8235c5cde033d01f2
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review zhenwei-pi/recover-hardware-corrupted-page-by-virtio-balloon/20220520-151328
git checkout a42127073dd4adb6354649c8235c5cde033d01f2
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

arm-linux-gnueabi-ld: drivers/virtio/virtio_balloon.o: in function `unpoison_memory_func':
>> virtio_balloon.c:(.text+0x89a): undefined reference to `unpoison_memory'
arm-linux-gnueabi-ld: drivers/virtio/virtio_balloon.o: in function `virtballoon_probe':
>> virtio_balloon.c:(.text+0x111a): undefined reference to `register_memory_failure_notifier'
arm-linux-gnueabi-ld: drivers/virtio/virtio_balloon.o: in function `virtballoon_remove':
>> virtio_balloon.c:(.text+0x12a2): undefined reference to `unregister_memory_failure_notifier'

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-05-25 05:39:23

by zhenwei pi

[permalink] [raw]
Subject: Re: Re: [PATCH 3/3] virtio_balloon: Introduce memory recover



On 5/25/22 03:35, Sean Christopherson wrote:
> On Fri, May 20, 2022, zhenwei pi wrote:
>> @@ -59,6 +60,12 @@ enum virtio_balloon_config_read {
>> VIRTIO_BALLOON_CONFIG_READ_CMD_ID = 0,
>> };
>>
>> +/* the request body to commucate with host side */
>> +struct __virtio_balloon_recover {
>> + struct virtio_balloon_recover vbr;
>> + __virtio32 pfns[VIRTIO_BALLOON_PAGES_PER_PAGE];
>
> I assume this is copied from virtio_balloon.pfns, which also uses __virtio32, but
> isn't that horribly broken? PFNs are 'unsigned long', i.e. 64 bits on 64-bit kernels.
> x86-64 at least most definitely generates 64-bit PFNs. Unless there's magic I'm
> missing, page_to_balloon_pfn() will truncate PFNs and feed the host bad info.
>

Yes, I also noticed this point, I suppose the balloon device can not
work on a virtual machine which has physical address larger than 16T.

I still let the recover VQ keep aligned with the inflate VQ and deflate
VQ. I prefer the recover VQ to be workable/unworkable with
inflate/deflate VQ together. So I leave this to the virtio balloon
maintainer to decide ...

>> @@ -494,6 +511,198 @@ static void update_balloon_size_func(struct work_struct *work)
>> queue_work(system_freezable_wq, work);
>> }
>>
>> +/*
>> + * virtballoon_memory_failure - notified by memory failure, try to fix the
>> + * corrupted page.
>> + * The memory failure notifier is designed to call back when the kernel handled
>> + * successfully only, WARN_ON_ONCE on the unlikely condition to find out any
>> + * error(memory error handling is a best effort, not 100% coverd).
>> + */
>> +static int virtballoon_memory_failure(struct notifier_block *notifier,
>> + unsigned long pfn, void *parm)
>> +{
>> + struct virtio_balloon *vb = container_of(notifier, struct virtio_balloon,
>> + memory_failure_nb);
>> + struct page *page;
>> + struct __virtio_balloon_recover *out_vbr;
>> + struct scatterlist sg;
>> + unsigned long flags;
>> + int err;
>> +
>> + page = pfn_to_online_page(pfn);
>> + if (WARN_ON_ONCE(!page))
>> + return NOTIFY_DONE;
>> +
>> + if (PageHuge(page))
>> + return NOTIFY_DONE;
>> +
>> + if (WARN_ON_ONCE(!PageHWPoison(page)))
>> + return NOTIFY_DONE;
>> +
>> + if (WARN_ON_ONCE(page_count(page) != 1))
>> + return NOTIFY_DONE;
>> +
>> + get_page(page); /* balloon reference */
>> +
>> + out_vbr = kzalloc(sizeof(*out_vbr), GFP_KERNEL);
>> + if (WARN_ON_ONCE(!out_vbr))
>> + return NOTIFY_BAD;
>
> Not that it truly matters, but won't failure at this point leak the poisoned page?

I'll fix this, thanks!

--
zhenwei pi

2022-05-25 17:41:39

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

On 20.05.22 09:06, zhenwei pi wrote:
> Hi,
>
> I'm trying to recover hardware corrupted page by virtio balloon, the
> workflow of this feature like this:
>
> Guest 5.MF -> 6.RVQ FE 10.Unpoison page
> / \ /
> -------------------+-------------+----------+-----------
> | | |
> 4.MCE 7.RVQ BE 9.RVQ Event
> QEMU / \ /
> 3.SIGBUS 8.Remap
> /
> ----------------+------------------------------------
> |
> +--2.MF
> Host /
> 1.HW error
>
> 1, HardWare page error occurs randomly.
> 2, host side handles corrupted page by Memory Failure mechanism, sends
> SIGBUS to the user process if early-kill is enabled.
> 3, QEMU handles SIGBUS, if the address belongs to guest RAM, then:
> 4, QEMU tries to inject MCE into guest.
> 5, guest handles memory failure again.
>
> 1-5 is already supported for a long time, the next steps are supported
> in this patch(also related driver patch):
>
> 6, guest balloon driver gets noticed of the corrupted PFN, and sends
> request to host side by Recover VQ FrontEnd.
> 7, QEMU handles request from Recover VQ BackEnd, then:
> 8, QEMU remaps the corrupted HVA fo fix the memory failure, then:
> 9, QEMU acks the guest side the result by Recover VQ.
> 10, guest unpoisons the page if the corrupted page gets recoverd
> successfully.
>
> Test:
> This patch set can be tested with QEMU(also in developing):
> https://github.com/pizhenwei/qemu/tree/balloon-recover
>
> Emulate MCE by QEMU(guest RAM normal page only, hugepage is not supported):
> virsh qemu-monitor-command vm --hmp mce 0 9 0xbd000000000000c0 0xd 0x61646678 0x8c
>
> The guest works fine(on Intel Platinum 8260):
> mce: [Hardware Error]: Machine check events logged
> Memory failure: 0x61646: recovery action for dirty LRU page: Recovered
> virtio_balloon virtio5: recovered pfn 0x61646
> Unpoison: Unpoisoned page 0x61646 by virtio-balloon
> MCE: Killing stress:24502 due to hardware memory corruption fault at 7f5be2e5a010
>
> And the 'HardwareCorrupted' in /proc/meminfo also shows 0 kB.
>
> About the protocol of virtio balloon recover VQ, it's undefined and in
> developing currently:
> - 'struct virtio_balloon_recover' defines the structure which is used to
> exchange message between guest and host.
> - '__le32 corrupted_pages' in struct virtio_balloon_config is used in the next
> step:
> 1, a VM uses RAM of 2M huge page, once a MCE occurs, the 2M becomes
> unaccessible. Reporting 512 * 4K 'corrupted_pages' to the guest, the guest
> has a chance to isolate the 512 pages ahead of time.
>
> 2, after migrating to another host, the corrupted pages are actually recovered,
> once the guest gets the 'corrupted_pages' with 0, then the guest could
> unpoison all the poisoned pages which are recorded in the balloon driver.
>

Hi,

I'm still on vacation this week, I'll try to have a look when I'm back
(and flushed out my overflowing inbox :D).


--
Thanks,

David / dhildenb


2022-05-25 20:11:53

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 3/3] virtio_balloon: Introduce memory recover

On Fri, May 20, 2022, zhenwei pi wrote:
> @@ -59,6 +60,12 @@ enum virtio_balloon_config_read {
> VIRTIO_BALLOON_CONFIG_READ_CMD_ID = 0,
> };
>
> +/* the request body to commucate with host side */
> +struct __virtio_balloon_recover {
> + struct virtio_balloon_recover vbr;
> + __virtio32 pfns[VIRTIO_BALLOON_PAGES_PER_PAGE];

I assume this is copied from virtio_balloon.pfns, which also uses __virtio32, but
isn't that horribly broken? PFNs are 'unsigned long', i.e. 64 bits on 64-bit kernels.
x86-64 at least most definitely generates 64-bit PFNs. Unless there's magic I'm
missing, page_to_balloon_pfn() will truncate PFNs and feed the host bad info.

> @@ -494,6 +511,198 @@ static void update_balloon_size_func(struct work_struct *work)
> queue_work(system_freezable_wq, work);
> }
>
> +/*
> + * virtballoon_memory_failure - notified by memory failure, try to fix the
> + * corrupted page.
> + * The memory failure notifier is designed to call back when the kernel handled
> + * successfully only, WARN_ON_ONCE on the unlikely condition to find out any
> + * error(memory error handling is a best effort, not 100% coverd).
> + */
> +static int virtballoon_memory_failure(struct notifier_block *notifier,
> + unsigned long pfn, void *parm)
> +{
> + struct virtio_balloon *vb = container_of(notifier, struct virtio_balloon,
> + memory_failure_nb);
> + struct page *page;
> + struct __virtio_balloon_recover *out_vbr;
> + struct scatterlist sg;
> + unsigned long flags;
> + int err;
> +
> + page = pfn_to_online_page(pfn);
> + if (WARN_ON_ONCE(!page))
> + return NOTIFY_DONE;
> +
> + if (PageHuge(page))
> + return NOTIFY_DONE;
> +
> + if (WARN_ON_ONCE(!PageHWPoison(page)))
> + return NOTIFY_DONE;
> +
> + if (WARN_ON_ONCE(page_count(page) != 1))
> + return NOTIFY_DONE;
> +
> + get_page(page); /* balloon reference */
> +
> + out_vbr = kzalloc(sizeof(*out_vbr), GFP_KERNEL);
> + if (WARN_ON_ONCE(!out_vbr))
> + return NOTIFY_BAD;

Not that it truly matters, but won't failure at this point leak the poisoned page?

2022-05-27 06:47:06

by Jue Wang

[permalink] [raw]
Subject: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

Some points to consider:

The injected MCE has _done_ the damages to guest workload. Recovering
the guest poisoned memory doesn't help with the already happened guest
workload memory corruption / loss / interruption due to injected MCEs.

The hypervisor _must_ emulate poisons identified in guest physical
address space (could be transported from the source VM), this is to
prevent silent data corruption in the guest. With a paravirtual
approach like this patch series, the hypervisor can clear some of the
poisoned HVAs knowing for certain that the guest OS has isolated the
poisoned page. I wonder how much value it provides to the guest if the
guest and workload are _not_ in a pressing need for the extra KB/MB
worth of memory.

Thanks,
-Jue

2022-05-27 09:03:12

by zhenwei pi

[permalink] [raw]
Subject: Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

On 5/27/22 02:37, Peter Xu wrote:
> On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote:
>> The hypervisor _must_ emulate poisons identified in guest physical
>> address space (could be transported from the source VM), this is to
>> prevent silent data corruption in the guest. With a paravirtual
>> approach like this patch series, the hypervisor can clear some of the
>> poisoned HVAs knowing for certain that the guest OS has isolated the
>> poisoned page. I wonder how much value it provides to the guest if the
>> guest and workload are _not_ in a pressing need for the extra KB/MB
>> worth of memory.
>
> I'm curious the same on how unpoisoning could help here. The reasoning
> behind would be great material to be mentioned in the next cover letter.
>
> Shouldn't we consider migrating serious workloads off the host already
> where there's a sign of more severe hardware issues, instead?
>
> Thanks,
>

I'm maintaining 1000,000+ virtual machines, from my experience:
UE is quite unusual and occurs randomly, and I did not hit UE storm case
in the past years. The memory also has no obvious performance drop after
hitting UE.

I hit several CE storm case, the performance memory drops a lot. But I
can't find obvious relationship between UE and CE.

So from the point of my view, to fix the corrupted page for VM seems
good enough. And yes, unpoisoning several pages does not help
significantly, but it is still a chance to make the virtualization better.

--
zhenwei pi

2022-05-27 11:39:28

by zhenwei pi

[permalink] [raw]
Subject: Re: Re: [PATCH 3/3] virtio_balloon: Introduce memory recover

On 5/27/22 03:18, Michael S. Tsirkin wrote:
> On Fri, May 20, 2022 at 03:06:48PM +0800, zhenwei pi wrote:
>> Introduce a new queue 'recover VQ', this allows guest to recover
>> hardware corrupted page:
>>
>> Guest 5.MF -> 6.RVQ FE 10.Unpoison page
>> / \ /
>> -------------------+-------------+----------+-----------
>> | | |
>> 4.MCE 7.RVQ BE 9.RVQ Event
>> QEMU / \ /
>> 3.SIGBUS 8.Remap
>> /
>> ----------------+------------------------------------
>> |
>> +--2.MF
>> Host /
>> 1.HW error
>>
>> The workflow:
>> 1, HardWare page error occurs randomly.
>> 2, host side handles corrupted page by Memory Failure mechanism, sends
>> SIGBUS to the user process if early-kill is enabled.
>> 3, QEMU handles SIGBUS, if the address belongs to guest RAM, then:
>> 4, QEMU tries to inject MCE into guest.
>> 5, guest handles memory failure again.
>>
>> 1-5 is already supported for a long time, the next steps are supported
>> in this patch(also related driver patch):
>> 6, guest balloon driver gets noticed of the corrupted PFN, and sends
>> request to host side by Recover VQ FrontEnd.
>> 7, QEMU handles request from Recover VQ BackEnd, then:
>> 8, QEMU remaps the corrupted HVA fo fix the memory failure, then:
>> 9, QEMU acks the guest side the result by Recover VQ.
>> 10, guest unpoisons the page if the corrupted page gets recoverd
>> successfully.
>>
>> Then the guest fixes the HW page error dynamiclly without rebooting.
>>
>> Emulate MCE by QEMU, the guest works fine:
>> mce: [Hardware Error]: Machine check events logged
>> Memory failure: 0x61646: recovery action for dirty LRU page: Recovered
>> virtio_balloon virtio5: recovered pfn 0x61646
>> Unpoison: Unpoisoned page 0x61646 by virtio-balloon
>> MCE: Killing stress:24502 due to hardware memory corruption fault at 7f5be2e5a010
>>
>> The 'HardwareCorrupted' in /proc/meminfo also shows 0 kB.
>>
>> Signed-off-by: zhenwei pi <[email protected]>
>> ---
>> drivers/virtio/virtio_balloon.c | 243 ++++++++++++++++++++++++++++
>> include/uapi/linux/virtio_balloon.h | 16 ++
>> 2 files changed, 259 insertions(+)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index f4c34a2a6b8e..f9d95d1d8a4d 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -52,6 +52,7 @@ enum virtio_balloon_vq {
>> VIRTIO_BALLOON_VQ_STATS,
>> VIRTIO_BALLOON_VQ_FREE_PAGE,
>> VIRTIO_BALLOON_VQ_REPORTING,
>> + VIRTIO_BALLOON_VQ_RECOVER,
>> VIRTIO_BALLOON_VQ_MAX
>> };
>>
>> @@ -59,6 +60,12 @@ enum virtio_balloon_config_read {
>> VIRTIO_BALLOON_CONFIG_READ_CMD_ID = 0,
>> };
>>
>> +/* the request body to commucate with host side */
>> +struct __virtio_balloon_recover {
>> + struct virtio_balloon_recover vbr;
>> + __virtio32 pfns[VIRTIO_BALLOON_PAGES_PER_PAGE];
>> +};
>> +
>
>
> I don't think this idea of passing 32 bit pfns is going to fly.
> What is wrong with just passing the pages normally as a s/g list?
> this is what is done for the hints at the moment.
>
> neither should you use __virtio types for new functionality
> (should all be __le), nor use __virtio for the struct name.
>
>

Guest side sends GPA/PFN to host side by passing the pages normally as a
s/g list, this is OK.

But in this scenario, guest also needs to get
status(recovered?corrupted?failed to recover?) of page from the host side.

For a normal page(Ex, 4K), the host could return the status quite
immediately. But for the 2M hugetlb of guest RAM, the host should be
pending until the guest requests 512*4K to recover. Once the 2M hugepage
gets recovered(or failed to recover), the host returns 512 PFNs with
status to guest. There are at most 512 recover requests of a single 2M
huge page.

For example, the guest tries to recover a corrupted page:
struct scatterlist status_sg, page_sg, *sgs[2];

sg_init_one(&status_sg, status, sizeof(*status));
sgs[0] = &status_sg;

p = page_address(page);
sg_init_one(&page_sg, p, PAGE_SIZE);
sgs[1] = &page_sg;

virtqueue_add_sgs(recover_vq, sgs, 1, 1, NULL, GFP_ATOMIC);

The host handles 4K recover request on 2M hugepage, this request is
pending until the full 2M huge page gets recovered(or failed).

To avoid too many pending request in virt queue, I designed as this
patch(should use __le), passing PFN in request body, using a single IN
request only.


...
>> --- a/include/uapi/linux/virtio_balloon.h
>> +++ b/include/uapi/linux/virtio_balloon.h
>> @@ -37,6 +37,7 @@
>> #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
>> #define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
>> #define VIRTIO_BALLOON_F_REPORTING 5 /* Page reporting virtqueue */
>> +#define VIRTIO_BALLOON_F_RECOVER 6 /* Memory recover virtqueue */
>>
>> /* Size of a PFN in the balloon interface. */
>> #define VIRTIO_BALLOON_PFN_SHIFT 12
>
> Please get this feature recorded in the spec with the virtio TC.
> They will also ask you to supply minimal documentation.
>

Sure!
By the way, this feature depends on the memory&&memory-failure
mechanism, what about sending the change of spec to virtio TC after
Andrew and Naoya ack?

--
zhenwei pi

2022-05-27 11:42:42

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 3/3] virtio_balloon: Introduce memory recover

On Fri, May 20, 2022 at 03:06:48PM +0800, zhenwei pi wrote:
> Introduce a new queue 'recover VQ', this allows guest to recover
> hardware corrupted page:
>
> Guest 5.MF -> 6.RVQ FE 10.Unpoison page
> / \ /
> -------------------+-------------+----------+-----------
> | | |
> 4.MCE 7.RVQ BE 9.RVQ Event
> QEMU / \ /
> 3.SIGBUS 8.Remap
> /
> ----------------+------------------------------------
> |
> +--2.MF
> Host /
> 1.HW error
>
> The workflow:
> 1, HardWare page error occurs randomly.
> 2, host side handles corrupted page by Memory Failure mechanism, sends
> SIGBUS to the user process if early-kill is enabled.
> 3, QEMU handles SIGBUS, if the address belongs to guest RAM, then:
> 4, QEMU tries to inject MCE into guest.
> 5, guest handles memory failure again.
>
> 1-5 is already supported for a long time, the next steps are supported
> in this patch(also related driver patch):
> 6, guest balloon driver gets noticed of the corrupted PFN, and sends
> request to host side by Recover VQ FrontEnd.
> 7, QEMU handles request from Recover VQ BackEnd, then:
> 8, QEMU remaps the corrupted HVA fo fix the memory failure, then:
> 9, QEMU acks the guest side the result by Recover VQ.
> 10, guest unpoisons the page if the corrupted page gets recoverd
> successfully.
>
> Then the guest fixes the HW page error dynamiclly without rebooting.
>
> Emulate MCE by QEMU, the guest works fine:
> mce: [Hardware Error]: Machine check events logged
> Memory failure: 0x61646: recovery action for dirty LRU page: Recovered
> virtio_balloon virtio5: recovered pfn 0x61646
> Unpoison: Unpoisoned page 0x61646 by virtio-balloon
> MCE: Killing stress:24502 due to hardware memory corruption fault at 7f5be2e5a010
>
> The 'HardwareCorrupted' in /proc/meminfo also shows 0 kB.
>
> Signed-off-by: zhenwei pi <[email protected]>
> ---
> drivers/virtio/virtio_balloon.c | 243 ++++++++++++++++++++++++++++
> include/uapi/linux/virtio_balloon.h | 16 ++
> 2 files changed, 259 insertions(+)
>
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f4c34a2a6b8e..f9d95d1d8a4d 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -52,6 +52,7 @@ enum virtio_balloon_vq {
> VIRTIO_BALLOON_VQ_STATS,
> VIRTIO_BALLOON_VQ_FREE_PAGE,
> VIRTIO_BALLOON_VQ_REPORTING,
> + VIRTIO_BALLOON_VQ_RECOVER,
> VIRTIO_BALLOON_VQ_MAX
> };
>
> @@ -59,6 +60,12 @@ enum virtio_balloon_config_read {
> VIRTIO_BALLOON_CONFIG_READ_CMD_ID = 0,
> };
>
> +/* the request body to commucate with host side */
> +struct __virtio_balloon_recover {
> + struct virtio_balloon_recover vbr;
> + __virtio32 pfns[VIRTIO_BALLOON_PAGES_PER_PAGE];
> +};
> +


I don't think this idea of passing 32 bit pfns is going to fly.
What is wrong with just passing the pages normally as a s/g list?
this is what is done for the hints at the moment.

neither should you use __virtio types for new functionality
(should all be __le), nor use __virtio for the struct name.



> struct virtio_balloon {
> struct virtio_device *vdev;
> struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> @@ -126,6 +133,16 @@ struct virtio_balloon {
> /* Free page reporting device */
> struct virtqueue *reporting_vq;
> struct page_reporting_dev_info pr_dev_info;
> +
> + /* Memory recover VQ - VIRTIO_BALLOON_F_RECOVER */
> + struct virtqueue *recover_vq;
> + spinlock_t recover_vq_lock;
> + struct notifier_block memory_failure_nb;
> + struct list_head corrupted_page_list;
> + struct list_head recovered_page_list;
> + spinlock_t recover_page_list_lock;
> + struct __virtio_balloon_recover in_vbr;
> + struct work_struct unpoison_memory_work;
> };
>
> static const struct virtio_device_id id_table[] = {
> @@ -494,6 +511,198 @@ static void update_balloon_size_func(struct work_struct *work)
> queue_work(system_freezable_wq, work);
> }
>
> +/*
> + * virtballoon_memory_failure - notified by memory failure, try to fix the
> + * corrupted page.
> + * The memory failure notifier is designed to call back when the kernel handled
> + * successfully only, WARN_ON_ONCE on the unlikely condition to find out any
> + * error(memory error handling is a best effort, not 100% coverd).

covered

> + */
> +static int virtballoon_memory_failure(struct notifier_block *notifier,
> + unsigned long pfn, void *parm)
> +{
> + struct virtio_balloon *vb = container_of(notifier, struct virtio_balloon,
> + memory_failure_nb);
> + struct page *page;
> + struct __virtio_balloon_recover *out_vbr;
> + struct scatterlist sg;
> + unsigned long flags;
> + int err;
> +
> + page = pfn_to_online_page(pfn);
> + if (WARN_ON_ONCE(!page))
> + return NOTIFY_DONE;
> +
> + if (PageHuge(page))
> + return NOTIFY_DONE;
> +
> + if (WARN_ON_ONCE(!PageHWPoison(page)))
> + return NOTIFY_DONE;
> +
> + if (WARN_ON_ONCE(page_count(page) != 1))
> + return NOTIFY_DONE;
> +
> + get_page(page); /* balloon reference */
> +
> + out_vbr = kzalloc(sizeof(*out_vbr), GFP_KERNEL);
> + if (WARN_ON_ONCE(!out_vbr))
> + return NOTIFY_BAD;
> +
> + spin_lock(&vb->recover_page_list_lock);
> + balloon_page_push(&vb->corrupted_page_list, page);
> + spin_unlock(&vb->recover_page_list_lock);
> +
> + out_vbr->vbr.cmd = VIRTIO_BALLOON_R_CMD_RECOVER;
> + set_page_pfns(vb, out_vbr->pfns, page);
> + sg_init_one(&sg, out_vbr, sizeof(*out_vbr));
> +
> + spin_lock_irqsave(&vb->recover_vq_lock, flags);
> + err = virtqueue_add_outbuf(vb->recover_vq, &sg, 1, out_vbr, GFP_KERNEL);
> + if (unlikely(err)) {
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> + return NOTIFY_DONE;
> + }
> + virtqueue_kick(vb->recover_vq);
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> +
> + return NOTIFY_OK;
> +}
> +
> +static int recover_vq_get_response(struct virtio_balloon *vb)
> +{
> + struct __virtio_balloon_recover *in_vbr;
> + struct scatterlist sg;
> + unsigned long flags;
> + int err;
> +
> + spin_lock_irqsave(&vb->recover_vq_lock, flags);
> + in_vbr = &vb->in_vbr;
> + memset(in_vbr, 0x00, sizeof(*in_vbr));
> + sg_init_one(&sg, in_vbr, sizeof(*in_vbr));
> + err = virtqueue_add_inbuf(vb->recover_vq, &sg, 1, in_vbr, GFP_KERNEL);
> + if (unlikely(err)) {
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> + return err;
> + }
> +
> + virtqueue_kick(vb->recover_vq);
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> +
> + return 0;
> +}
> +
> +static void recover_vq_handle_response(struct virtio_balloon *vb, unsigned int len)
> +{
> + struct __virtio_balloon_recover *in_vbr;
> + struct virtio_balloon_recover *vbr;
> + struct page *page;
> + unsigned int pfns;
> + u32 pfn0, pfn1;
> + __u8 status;
> +
> + /* the response is not expected */
> + if (unlikely(len != sizeof(struct __virtio_balloon_recover)))
> + return;
> +
> + in_vbr = &vb->in_vbr;
> + vbr = &in_vbr->vbr;
> + if (unlikely(vbr->cmd != VIRTIO_BALLOON_R_CMD_RESPONSE))
> + return;
> +
> + /* to make sure the contiguous balloon PFNs */
> + for (pfns = 1; pfns < VIRTIO_BALLOON_PAGES_PER_PAGE; pfns++) {
> + pfn0 = virtio32_to_cpu(vb->vdev, in_vbr->pfns[pfns - 1]);
> + pfn1 = virtio32_to_cpu(vb->vdev, in_vbr->pfns[pfns]);
> + if (pfn1 - pfn0 != 1)
> + return;
> + }
> +
> + pfn0 = virtio32_to_cpu(vb->vdev, in_vbr->pfns[0]);
> + if (!pfn_valid(pfn0))
> + return;
> +
> + pfn1 = -1;
> + spin_lock(&vb->recover_page_list_lock);
> + list_for_each_entry(page, &vb->corrupted_page_list, lru) {
> + pfn1 = page_to_pfn(page);
> + if (pfn1 == pfn0)
> + break;
> + }
> + spin_unlock(&vb->recover_page_list_lock);
> +
> + status = vbr->status;
> + switch (status) {
> + case VIRTIO_BALLOON_R_STATUS_RECOVERED:
> + if (pfn1 == pfn0) {
> + spin_lock(&vb->recover_page_list_lock);
> + list_del(&page->lru);
> + balloon_page_push(&vb->recovered_page_list, page);
> + spin_unlock(&vb->recover_page_list_lock);
> + queue_work(system_freezable_wq, &vb->unpoison_memory_work);
> + dev_info_ratelimited(&vb->vdev->dev, "recovered pfn 0x%x", pfn0);
> + }
> + break;
> + case VIRTIO_BALLOON_R_STATUS_FAILED:
> + /* the hypervisor can't fix this corrupted page, balloon puts page */
> + if (pfn1 == pfn0) {
> + spin_lock(&vb->recover_page_list_lock);
> + list_del(&page->lru);
> + spin_unlock(&vb->recover_page_list_lock);
> + put_page(page);
> + dev_info_ratelimited(&vb->vdev->dev, "failed to recover pfn 0x%x", pfn0);
> + }
> + default:
> + break;
> + };
> +
> + /* continue to get response from host side if the response gets handled successfully */
> + recover_vq_get_response(vb);
> +}
> +
> +static void unpoison_memory_func(struct work_struct *work)
> +{
> + struct virtio_balloon *vb;
> + struct page *page;
> +
> + vb = container_of(work, struct virtio_balloon, unpoison_memory_work);
> +
> + do {
> + spin_lock(&vb->recover_page_list_lock);
> + page = list_first_entry_or_null(&vb->recovered_page_list,
> + struct page, lru);
> + if (page)
> + list_del(&page->lru);
> + spin_unlock(&vb->recover_page_list_lock);
> +
> + if (page) {
> + put_page(page);
> + unpoison_memory(page_to_pfn(page), true, "virtio-balloon");
> + }
> + } while (page);
> +}
> +
> +static void recover_vq_cb(struct virtqueue *vq)
> +{
> + struct virtio_balloon *vb = vq->vdev->priv;
> + struct __virtio_balloon_recover *vbr;
> + unsigned long flags;
> + unsigned int len;
> +
> + spin_lock_irqsave(&vb->recover_vq_lock, flags);
> + do {
> + virtqueue_disable_cb(vq);
> + while ((vbr = virtqueue_get_buf(vq, &len)) != NULL) {
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> + if (vbr == &vb->in_vbr)
> + recover_vq_handle_response(vb, len);
> + else
> + kfree(vbr); /* just free the memory for out vbr request */
> + spin_lock_irqsave(&vb->recover_vq_lock, flags);
> + }
> + } while (!virtqueue_enable_cb(vq));
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> +}
> +
> static int init_vqs(struct virtio_balloon *vb)
> {
> struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX];
> @@ -515,6 +724,7 @@ static int init_vqs(struct virtio_balloon *vb)
> callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
> + names[VIRTIO_BALLOON_VQ_RECOVER] = NULL;
>
> if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> @@ -531,6 +741,11 @@ static int init_vqs(struct virtio_balloon *vb)
> callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
> }
>
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_RECOVER)) {
> + names[VIRTIO_BALLOON_VQ_RECOVER] = "recover_vq";
> + callbacks[VIRTIO_BALLOON_VQ_RECOVER] = recover_vq_cb;
> + }
> +
> err = virtio_find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX, vqs,
> callbacks, names, NULL);
> if (err)
> @@ -566,6 +781,9 @@ static int init_vqs(struct virtio_balloon *vb)
> if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
>
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_RECOVER))
> + vb->recover_vq = vqs[VIRTIO_BALLOON_VQ_RECOVER];
> +
> return 0;
> }
>
> @@ -1015,12 +1233,31 @@ static int virtballoon_probe(struct virtio_device *vdev)
> goto out_unregister_oom;
> }
>
> + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_RECOVER)) {
> + err = recover_vq_get_response(vb);
> + if (err)
> + goto out_unregister_reporting;
> +
> + vb->memory_failure_nb.notifier_call = virtballoon_memory_failure;
> + spin_lock_init(&vb->recover_page_list_lock);
> + spin_lock_init(&vb->recover_vq_lock);
> + INIT_LIST_HEAD(&vb->corrupted_page_list);
> + INIT_LIST_HEAD(&vb->recovered_page_list);
> + INIT_WORK(&vb->unpoison_memory_work, unpoison_memory_func);
> + err = register_memory_failure_notifier(&vb->memory_failure_nb);
> + if (err)
> + goto out_unregister_reporting;
> + }
> +
> virtio_device_ready(vdev);
>
> if (towards_target(vb))
> virtballoon_changed(vdev);
> return 0;
>
> +out_unregister_reporting:
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> + page_reporting_unregister(&vb->pr_dev_info);
> out_unregister_oom:
> if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> unregister_oom_notifier(&vb->oom_nb);
> @@ -1082,6 +1319,11 @@ static void virtballoon_remove(struct virtio_device *vdev)
> destroy_workqueue(vb->balloon_wq);
> }
>
> + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_RECOVER)) {
> + unregister_memory_failure_notifier(&vb->memory_failure_nb);
> + cancel_work_sync(&vb->unpoison_memory_work);
> + }
> +
> remove_common(vb);
> #ifdef CONFIG_BALLOON_COMPACTION
> if (vb->vb_dev_info.inode)
> @@ -1147,6 +1389,7 @@ static unsigned int features[] = {
> VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> VIRTIO_BALLOON_F_PAGE_POISON,
> VIRTIO_BALLOON_F_REPORTING,
> + VIRTIO_BALLOON_F_RECOVER,
> };
>
> static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index ddaa45e723c4..41d0ffa2fb54 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -37,6 +37,7 @@
> #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
> #define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
> #define VIRTIO_BALLOON_F_REPORTING 5 /* Page reporting virtqueue */
> +#define VIRTIO_BALLOON_F_RECOVER 6 /* Memory recover virtqueue */
>
> /* Size of a PFN in the balloon interface. */
> #define VIRTIO_BALLOON_PFN_SHIFT 12

Please get this feature recorded in the spec with the virtio TC.
They will also ask you to supply minimal documentation.



> @@ -59,6 +60,8 @@ struct virtio_balloon_config {
> };
> /* Stores PAGE_POISON if page poisoning is in use */
> __le32 poison_val;
> + /* Number of hardware corrupted pages, guest read only */
> + __le32 corrupted_pages;
> };
>
> #define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */
> @@ -116,4 +119,17 @@ struct virtio_balloon_stat {
> __virtio64 val;
> } __attribute__((packed));
>
> +#define VIRTIO_BALLOON_R_CMD_RECOVER 0
> +#define VIRTIO_BALLOON_R_CMD_RESPONSE 0x80
> +
> +#define VIRTIO_BALLOON_R_STATUS_CORRUPTED 0
> +#define VIRTIO_BALLOON_R_STATUS_RECOVERED 1
> +#define VIRTIO_BALLOON_R_STATUS_FAILED 2
> +
> +struct virtio_balloon_recover {
> + __u8 cmd;
> + __u8 status;
> + __u8 padding[6];
> +};
> +
> #endif /* _LINUX_VIRTIO_BALLOON_H */
> --
> 2.20.1


2022-05-28 18:30:19

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote:
> The hypervisor _must_ emulate poisons identified in guest physical
> address space (could be transported from the source VM), this is to
> prevent silent data corruption in the guest. With a paravirtual
> approach like this patch series, the hypervisor can clear some of the
> poisoned HVAs knowing for certain that the guest OS has isolated the
> poisoned page. I wonder how much value it provides to the guest if the
> guest and workload are _not_ in a pressing need for the extra KB/MB
> worth of memory.

I'm curious the same on how unpoisoning could help here. The reasoning
behind would be great material to be mentioned in the next cover letter.

Shouldn't we consider migrating serious workloads off the host already
where there's a sign of more severe hardware issues, instead?

Thanks,

--
Peter Xu


2022-05-28 20:39:32

by zhenwei pi

[permalink] [raw]
Subject: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

Hi, Andrew & Naoya

I would appreciate it if you could give me any hint about the changes of
memory/memory-failure!

On 5/20/22 15:06, zhenwei pi wrote:
> Hi,
>
> I'm trying to recover hardware corrupted page by virtio balloon, the
> workflow of this feature like this:
>
> Guest 5.MF -> 6.RVQ FE 10.Unpoison page
> / \ /
> -------------------+-------------+----------+-----------
> | | |
> 4.MCE 7.RVQ BE 9.RVQ Event
> QEMU / \ /
> 3.SIGBUS 8.Remap
> /
> ----------------+------------------------------------
> |
> +--2.MF
> Host /
> 1.HW error
>
> 1, HardWare page error occurs randomly.
> 2, host side handles corrupted page by Memory Failure mechanism, sends
> SIGBUS to the user process if early-kill is enabled.
> 3, QEMU handles SIGBUS, if the address belongs to guest RAM, then:
> 4, QEMU tries to inject MCE into guest.
> 5, guest handles memory failure again.
>
> 1-5 is already supported for a long time, the next steps are supported
> in this patch(also related driver patch):
>
> 6, guest balloon driver gets noticed of the corrupted PFN, and sends
> request to host side by Recover VQ FrontEnd.
> 7, QEMU handles request from Recover VQ BackEnd, then:
> 8, QEMU remaps the corrupted HVA fo fix the memory failure, then:
> 9, QEMU acks the guest side the result by Recover VQ.
> 10, guest unpoisons the page if the corrupted page gets recoverd
> successfully.
>
> Test:
> This patch set can be tested with QEMU(also in developing):
> https://github.com/pizhenwei/qemu/tree/balloon-recover
>
> Emulate MCE by QEMU(guest RAM normal page only, hugepage is not supported):
> virsh qemu-monitor-command vm --hmp mce 0 9 0xbd000000000000c0 0xd 0x61646678 0x8c
>
> The guest works fine(on Intel Platinum 8260):
> mce: [Hardware Error]: Machine check events logged
> Memory failure: 0x61646: recovery action for dirty LRU page: Recovered
> virtio_balloon virtio5: recovered pfn 0x61646
> Unpoison: Unpoisoned page 0x61646 by virtio-balloon
> MCE: Killing stress:24502 due to hardware memory corruption fault at 7f5be2e5a010
>
> And the 'HardwareCorrupted' in /proc/meminfo also shows 0 kB.
>
> About the protocol of virtio balloon recover VQ, it's undefined and in
> developing currently:
> - 'struct virtio_balloon_recover' defines the structure which is used to
> exchange message between guest and host.
> - '__le32 corrupted_pages' in struct virtio_balloon_config is used in the next
> step:
> 1, a VM uses RAM of 2M huge page, once a MCE occurs, the 2M becomes
> unaccessible. Reporting 512 * 4K 'corrupted_pages' to the guest, the guest
> has a chance to isolate the 512 pages ahead of time.
>
> 2, after migrating to another host, the corrupted pages are actually recovered,
> once the guest gets the 'corrupted_pages' with 0, then the guest could
> unpoison all the poisoned pages which are recorded in the balloon driver.
>
> zhenwei pi (3):
> memory-failure: Introduce memory failure notifier
> mm/memory-failure.c: support reset PTE during unpoison
> virtio_balloon: Introduce memory recover
>
> drivers/virtio/virtio_balloon.c | 243 ++++++++++++++++++++++++++++
> include/linux/mm.h | 4 +-
> include/uapi/linux/virtio_balloon.h | 16 ++
> mm/hwpoison-inject.c | 2 +-
> mm/memory-failure.c | 59 ++++++-
> 5 files changed, 315 insertions(+), 9 deletions(-)
>

--
zhenwei pi

2022-05-30 11:07:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

On 27.05.22 08:32, zhenwei pi wrote:
> On 5/27/22 02:37, Peter Xu wrote:
>> On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote:
>>> The hypervisor _must_ emulate poisons identified in guest physical
>>> address space (could be transported from the source VM), this is to
>>> prevent silent data corruption in the guest. With a paravirtual
>>> approach like this patch series, the hypervisor can clear some of the
>>> poisoned HVAs knowing for certain that the guest OS has isolated the
>>> poisoned page. I wonder how much value it provides to the guest if the
>>> guest and workload are _not_ in a pressing need for the extra KB/MB
>>> worth of memory.
>>
>> I'm curious the same on how unpoisoning could help here. The reasoning
>> behind would be great material to be mentioned in the next cover letter.
>>
>> Shouldn't we consider migrating serious workloads off the host already
>> where there's a sign of more severe hardware issues, instead?
>>
>> Thanks,
>>
>
> I'm maintaining 1000,000+ virtual machines, from my experience:
> UE is quite unusual and occurs randomly, and I did not hit UE storm case
> in the past years. The memory also has no obvious performance drop after
> hitting UE.
>
> I hit several CE storm case, the performance memory drops a lot. But I
> can't find obvious relationship between UE and CE.
>
> So from the point of my view, to fix the corrupted page for VM seems
> good enough. And yes, unpoisoning several pages does not help
> significantly, but it is still a chance to make the virtualization better.
>

I'm curious why we should care about resurrecting a handful of poisoned
pages in a VM. The cover letter doesn't touch on that.

IOW, I'm missing the motivation why we should add additional
code+complexity to unpoison pages at all.

If we're talking about individual 4k pages, it's certainly sub-optimal,
but does it matter in practice? I could understand if we're losing
megabytes of memory. But then, I assume the workload might be seriously
harmed either way already?


I assume when talking about "the performance memory drops a lot", you
imply that this patch set can mitigate that performance drop?

But why do you see a performance drop? Because we might lose some
possible THP candidates (in the host or the guest) and you want to plug
does holes? I assume you'll see a performance drop simply because
poisoning memory is expensive, including migrating pages around on CE.

If you have some numbers to share, especially before/after this change,
that would be great.

--
Thanks,

David / dhildenb


2022-05-30 13:28:51

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/memory-failure.c: support reset PTE during unpoison

On 20.05.22 09:06, zhenwei pi wrote:
> Origianlly, unpoison_memory() is only used by hwpoison-inject, and
> unpoisons a page which is poisoned by hwpoison-inject too. The kernel PTE
> entry has no change during software poison/unpoison.
>
> On a virtualization platform, it's possible to fix hardware corrupted page
> by hypervisor, typically the hypervisor remaps the error HVA(host virtual
> address). So add a new parameter 'const char *reason' to show the reason
> called by.
>
> Once the corrupted page gets fixed, the guest kernel needs put page to
> buddy. Reuse the page and hit the following issue(Intel Platinum 8260):
> BUG: unable to handle page fault for address: ffff888061646000
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0002) - not-present page
> PGD 2c01067 P4D 2c01067 PUD 61aaa063 PMD 10089b063 PTE 800fffff9e9b9062
> Oops: 0002 [#1] PREEMPT SMP NOPTI
> CPU: 2 PID: 31106 Comm: stress Kdump: loaded Tainted: G M OE 5.18.0-rc6.bm.1-amd64 #6
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> RIP: 0010:clear_page_erms+0x7/0x10
>
> The kernel PTE entry of the fixed page is still uncorrected, kernel hits
> page fault during prep_new_page. So add 'bool reset_kpte' to get a change
> to fix the PTE entry if the page is fixed by hypervisor.

Why don't we want to do that for the hwpoison case?

>
> Signed-off-by: zhenwei pi <[email protected]>
> ---
> include/linux/mm.h | 2 +-
> mm/hwpoison-inject.c | 2 +-
> mm/memory-failure.c | 26 +++++++++++++++++++-------
> 3 files changed, 21 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 665873c2788c..7ba210e86401 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3191,7 +3191,7 @@ enum mf_flags {
> extern int memory_failure(unsigned long pfn, int flags);
> extern void memory_failure_queue(unsigned long pfn, int flags);
> extern void memory_failure_queue_kick(int cpu);
> -extern int unpoison_memory(unsigned long pfn);
> +extern int unpoison_memory(unsigned long pfn, bool reset_kpte, const char *reason);
> extern int sysctl_memory_failure_early_kill;
> extern int sysctl_memory_failure_recovery;
> extern void shake_page(struct page *p);
> diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
> index 5c0cddd81505..0dd17ba98ade 100644
> --- a/mm/hwpoison-inject.c
> +++ b/mm/hwpoison-inject.c
> @@ -57,7 +57,7 @@ static int hwpoison_unpoison(void *data, u64 val)
> if (!capable(CAP_SYS_ADMIN))
> return -EPERM;
>
> - return unpoison_memory(val);
> + return unpoison_memory(val, false, "hwpoison-inject");

s/hwpoison-inject/hwpoison/

or maybe

s/hwpoison-inject/debugfs/

> }
>
> DEFINE_DEBUGFS_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 95c218bb0a37..a46de3be1dd7 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2132,21 +2132,26 @@ core_initcall(memory_failure_init);
> /**
> * unpoison_memory - Unpoison a previously poisoned page
> * @pfn: Page number of the to be unpoisoned page
> + * @reset_kpte: Reset the PTE entry for kmap
> + * @reason: The callers tells why unpoisoning the page
> *
> - * Software-unpoison a page that has been poisoned by
> - * memory_failure() earlier.
> + * Unpoison a page that has been poisoned by memory_failure() earlier.
> *
> - * This is only done on the software-level, so it only works
> - * for linux injected failures, not real hardware failures
> + * For linux injected failures, there is no need to reset PTE entry.
> + * It's possible to fix hardware memory failure on a virtualization platform,
> + * once hypervisor fixes the failure, guest needs put page back to buddy and
> + * reset the PTE entry in kernel.

Why can't we do this unconditionally? Just check if the PTE is poisoned,
and if so, reset it.

--
Thanks,

David / dhildenb


2022-05-30 13:48:44

by zhenwei pi

[permalink] [raw]
Subject: Re: Re: [PATCH 2/3] mm/memory-failure.c: support reset PTE during unpoison



On 5/30/22 13:02, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Fri, May 20, 2022 at 03:06:47PM +0800, zhenwei pi wrote:
>> Origianlly, unpoison_memory() is only used by hwpoison-inject, and
>> unpoisons a page which is poisoned by hwpoison-inject too. The kernel PTE
>> entry has no change during software poison/unpoison.
>>
>> On a virtualization platform, it's possible to fix hardware corrupted page
>> by hypervisor, typically the hypervisor remaps the error HVA(host virtual
>> address). So add a new parameter 'const char *reason' to show the reason
>> called by.
>>
>> Once the corrupted page gets fixed, the guest kernel needs put page to
>> buddy. Reuse the page and hit the following issue(Intel Platinum 8260):
>> BUG: unable to handle page fault for address: ffff888061646000
>> #PF: supervisor write access in kernel mode
>> #PF: error_code(0x0002) - not-present page
>> PGD 2c01067 P4D 2c01067 PUD 61aaa063 PMD 10089b063 PTE 800fffff9e9b9062
>> Oops: 0002 [#1] PREEMPT SMP NOPTI
>> CPU: 2 PID: 31106 Comm: stress Kdump: loaded Tainted: G M OE 5.18.0-rc6.bm.1-amd64 #6
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
>> RIP: 0010:clear_page_erms+0x7/0x10
>>
>> The kernel PTE entry of the fixed page is still uncorrected, kernel hits
>> page fault during prep_new_page. So add 'bool reset_kpte' to get a change
>> to fix the PTE entry if the page is fixed by hypervisor.
>>
>> Signed-off-by: zhenwei pi <[email protected]>
>> ---
>> include/linux/mm.h | 2 +-
>> mm/hwpoison-inject.c | 2 +-
>> mm/memory-failure.c | 26 +++++++++++++++++++-------
>> 3 files changed, 21 insertions(+), 9 deletions(-)
>>
>
> Do you need undoing rate limiting here? In the original unpoison's usage,
> avoiding flood of "Unpoison: Software-unpoisoned page" messages is helpful.
>
> And unpoison seems to be called from virtio-balloon multiple times when
> the backend is 2MB hugepages. If it's right, printing out 512 lines of
> "Unpoison: Unpoisoned page 0xXXX by virtio-balloon" messages might not be
> so helpful?
>

All the suggestions(include '[PATCH 1/3] memory-failure: Introduce
memory failure notifier') are reasonable, I'll fix them in the next
version. Thanks a lot!


--
zhenwei pi

2022-05-30 13:54:04

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 3/3] virtio_balloon: Introduce memory recover

On 25.05.22 01:32, zhenwei pi wrote:
>
>
> On 5/25/22 03:35, Sean Christopherson wrote:
>> On Fri, May 20, 2022, zhenwei pi wrote:
>>> @@ -59,6 +60,12 @@ enum virtio_balloon_config_read {
>>> VIRTIO_BALLOON_CONFIG_READ_CMD_ID = 0,
>>> };
>>>
>>> +/* the request body to commucate with host side */
>>> +struct __virtio_balloon_recover {
>>> + struct virtio_balloon_recover vbr;
>>> + __virtio32 pfns[VIRTIO_BALLOON_PAGES_PER_PAGE];
>>
>> I assume this is copied from virtio_balloon.pfns, which also uses __virtio32, but
>> isn't that horribly broken? PFNs are 'unsigned long', i.e. 64 bits on 64-bit kernels.
>> x86-64 at least most definitely generates 64-bit PFNs. Unless there's magic I'm
>> missing, page_to_balloon_pfn() will truncate PFNs and feed the host bad info.
>>
>
> Yes, I also noticed this point, I suppose the balloon device can not
> work on a virtual machine which has physical address larger than 16T.

Yes, that's a historical artifact and we never ran into it in practice
-- because 16TB VMs are still rare, especially when paired with
virtio-balloon inflation/deflation. Most probably the guest should just
stop inflating when hitting such a big PFN. In the future, we might want
a proper sg interface instead.

--
Thanks,

David / dhildenb


Subject: Re: [PATCH 2/3] mm/memory-failure.c: support reset PTE during unpoison

On Fri, May 20, 2022 at 03:06:47PM +0800, zhenwei pi wrote:
> Origianlly, unpoison_memory() is only used by hwpoison-inject, and
> unpoisons a page which is poisoned by hwpoison-inject too. The kernel PTE
> entry has no change during software poison/unpoison.
>
> On a virtualization platform, it's possible to fix hardware corrupted page
> by hypervisor, typically the hypervisor remaps the error HVA(host virtual
> address). So add a new parameter 'const char *reason' to show the reason
> called by.
>
> Once the corrupted page gets fixed, the guest kernel needs put page to
> buddy. Reuse the page and hit the following issue(Intel Platinum 8260):
> BUG: unable to handle page fault for address: ffff888061646000
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0002) - not-present page
> PGD 2c01067 P4D 2c01067 PUD 61aaa063 PMD 10089b063 PTE 800fffff9e9b9062
> Oops: 0002 [#1] PREEMPT SMP NOPTI
> CPU: 2 PID: 31106 Comm: stress Kdump: loaded Tainted: G M OE 5.18.0-rc6.bm.1-amd64 #6
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> RIP: 0010:clear_page_erms+0x7/0x10
>
> The kernel PTE entry of the fixed page is still uncorrected, kernel hits
> page fault during prep_new_page. So add 'bool reset_kpte' to get a change
> to fix the PTE entry if the page is fixed by hypervisor.
>
> Signed-off-by: zhenwei pi <[email protected]>
> ---
> include/linux/mm.h | 2 +-
> mm/hwpoison-inject.c | 2 +-
> mm/memory-failure.c | 26 +++++++++++++++++++-------
> 3 files changed, 21 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 665873c2788c..7ba210e86401 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3191,7 +3191,7 @@ enum mf_flags {
> extern int memory_failure(unsigned long pfn, int flags);
> extern void memory_failure_queue(unsigned long pfn, int flags);
> extern void memory_failure_queue_kick(int cpu);
> -extern int unpoison_memory(unsigned long pfn);
> +extern int unpoison_memory(unsigned long pfn, bool reset_kpte, const char *reason);
> extern int sysctl_memory_failure_early_kill;
> extern int sysctl_memory_failure_recovery;
> extern void shake_page(struct page *p);
> diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
> index 5c0cddd81505..0dd17ba98ade 100644
> --- a/mm/hwpoison-inject.c
> +++ b/mm/hwpoison-inject.c
> @@ -57,7 +57,7 @@ static int hwpoison_unpoison(void *data, u64 val)
> if (!capable(CAP_SYS_ADMIN))
> return -EPERM;
>
> - return unpoison_memory(val);
> + return unpoison_memory(val, false, "hwpoison-inject");
> }
>
> DEFINE_DEBUGFS_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 95c218bb0a37..a46de3be1dd7 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2132,21 +2132,26 @@ core_initcall(memory_failure_init);
> /**
> * unpoison_memory - Unpoison a previously poisoned page
> * @pfn: Page number of the to be unpoisoned page
> + * @reset_kpte: Reset the PTE entry for kmap
> + * @reason: The callers tells why unpoisoning the page
> *
> - * Software-unpoison a page that has been poisoned by
> - * memory_failure() earlier.
> + * Unpoison a page that has been poisoned by memory_failure() earlier.
> *
> - * This is only done on the software-level, so it only works
> - * for linux injected failures, not real hardware failures
> + * For linux injected failures, there is no need to reset PTE entry.
> + * It's possible to fix hardware memory failure on a virtualization platform,
> + * once hypervisor fixes the failure, guest needs put page back to buddy and
> + * reset the PTE entry in kernel.
> *
> * Returns 0 for success, otherwise -errno.
> */
> -int unpoison_memory(unsigned long pfn)
> +int unpoison_memory(unsigned long pfn, bool reset_kpte, const char *reason)
> {
> struct page *page;
> struct page *p;
> int ret = -EBUSY;
> int freeit = 0;
> + pte_t *kpte;
> + unsigned long addr;

These variables are used only in "if (reset_kpte)" block, so you can
move the definitions in it.

> static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL,
> DEFAULT_RATELIMIT_BURST);
>
> @@ -2208,8 +2213,15 @@ int unpoison_memory(unsigned long pfn)
> mutex_unlock(&mf_mutex);
> if (!ret || freeit) {
> num_poisoned_pages_dec();
> - unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n",
> - page_to_pfn(p), &unpoison_rs);
> + pr_info("Unpoison: Unpoisoned page %#lx by %s\n",
> + page_to_pfn(p), reason);

Do you need undoing rate limiting here? In the original unpoison's usage,
avoiding flood of "Unpoison: Software-unpoisoned page" messages is helpful.

And unpoison seems to be called from virtio-balloon multiple times when
the backend is 2MB hugepages. If it's right, printing out 512 lines of
"Unpoison: Unpoisoned page 0xXXX by virtio-balloon" messages might not be
so helpful?

Thanks,
Naoya Horiguchi

> + if (reset_kpte) {
> + preempt_disable();
> + addr = (unsigned long)page_to_virt(p);
> + kpte = virt_to_kpte(addr);
> + set_pte_at(&init_mm, addr, kpte, pfn_pte(pfn, PAGE_KERNEL));
> + preempt_enable();
> + }
> }
> return ret;
> }
> --
> 2.20.1

2022-05-31 10:10:53

by zhenwei pi

[permalink] [raw]
Subject: Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon



On 5/30/22 15:41, David Hildenbrand wrote:
> On 27.05.22 08:32, zhenwei pi wrote:
>> On 5/27/22 02:37, Peter Xu wrote:
>>> On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote:
>>>> The hypervisor _must_ emulate poisons identified in guest physical
>>>> address space (could be transported from the source VM), this is to
>>>> prevent silent data corruption in the guest. With a paravirtual
>>>> approach like this patch series, the hypervisor can clear some of the
>>>> poisoned HVAs knowing for certain that the guest OS has isolated the
>>>> poisoned page. I wonder how much value it provides to the guest if the
>>>> guest and workload are _not_ in a pressing need for the extra KB/MB
>>>> worth of memory.
>>>
>>> I'm curious the same on how unpoisoning could help here. The reasoning
>>> behind would be great material to be mentioned in the next cover letter.
>>>
>>> Shouldn't we consider migrating serious workloads off the host already
>>> where there's a sign of more severe hardware issues, instead?
>>>
>>> Thanks,
>>>
>>
>> I'm maintaining 1000,000+ virtual machines, from my experience:
>> UE is quite unusual and occurs randomly, and I did not hit UE storm case
>> in the past years. The memory also has no obvious performance drop after
>> hitting UE.
>>
>> I hit several CE storm case, the performance memory drops a lot. But I
>> can't find obvious relationship between UE and CE.
>>
>> So from the point of my view, to fix the corrupted page for VM seems
>> good enough. And yes, unpoisoning several pages does not help
>> significantly, but it is still a chance to make the virtualization better.
>>
>
> I'm curious why we should care about resurrecting a handful of poisoned
> pages in a VM. The cover letter doesn't touch on that.
>
> IOW, I'm missing the motivation why we should add additional
> code+complexity to unpoison pages at all.
>
> If we're talking about individual 4k pages, it's certainly sub-optimal,
> but does it matter in practice? I could understand if we're losing
> megabytes of memory. But then, I assume the workload might be seriously
> harmed either way already?
>

Yes, resurrecting a handful of poisoned pages does not help
significantly. And, in some ways, it seems nice to have. :D

A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs,
the 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest
poisons 4K (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE
([GPAx, GPAz) except GPAy). This is the worse case, so I want to add
'__le32 corrupted_pages' in struct virtio_balloon_config, it is used
in the next step: reporting 512 * 4K 'corrupted_pages' to the guest, the
guest has a chance to isolate the other 511 pages ahead of time. And the
guest actually loses 2M, fixing 512*4K seems to help significantly.

>
> I assume when talking about "the performance memory drops a lot", you
> imply that this patch set can mitigate that performance drop?
>
> But why do you see a performance drop? Because we might lose some
> possible THP candidates (in the host or the guest) and you want to plug
> does holes? I assume you'll see a performance drop simply because
> poisoning memory is expensive, including migrating pages around on CE.
>
> If you have some numbers to share, especially before/after this change,
> that would be great.
>

The CE storm leads 2 problems I have even seen:
1, the memory bandwidth slows down to 10%~20%, and the cycles per
instruction of CPU increases a lot.
2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use
a lot time to handle IRQ.

But no corrupted page occurs. Migrating VM to another healthy host seems
a good choice. This patch does not handle CE storm case.

--
zhenwei pi

2022-05-31 20:56:08

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 3/3] virtio_balloon: Introduce memory recover


> +
> struct virtio_balloon {
> struct virtio_device *vdev;
> struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> @@ -126,6 +133,16 @@ struct virtio_balloon {
> /* Free page reporting device */
> struct virtqueue *reporting_vq;
> struct page_reporting_dev_info pr_dev_info;
> +
> + /* Memory recover VQ - VIRTIO_BALLOON_F_RECOVER */
> + struct virtqueue *recover_vq;
> + spinlock_t recover_vq_lock;
> + struct notifier_block memory_failure_nb;
> + struct list_head corrupted_page_list;
> + struct list_head recovered_page_list;
> + spinlock_t recover_page_list_lock;
> + struct __virtio_balloon_recover in_vbr;
> + struct work_struct unpoison_memory_work;

I assume we want all that only with CONFIG_MEMORY_FAILURE.

> };
>
> static const struct virtio_device_id id_table[] = {
> @@ -494,6 +511,198 @@ static void update_balloon_size_func(struct work_struct *work)
> queue_work(system_freezable_wq, work);
> }
>
> +/*
> + * virtballoon_memory_failure - notified by memory failure, try to fix the
> + * corrupted page.
> + * The memory failure notifier is designed to call back when the kernel handled
> + * successfully only, WARN_ON_ONCE on the unlikely condition to find out any
> + * error(memory error handling is a best effort, not 100% coverd).
> + */
> +static int virtballoon_memory_failure(struct notifier_block *notifier,
> + unsigned long pfn, void *parm)
> +{
> + struct virtio_balloon *vb = container_of(notifier, struct virtio_balloon,
> + memory_failure_nb);
> + struct page *page;
> + struct __virtio_balloon_recover *out_vbr;
> + struct scatterlist sg;
> + unsigned long flags;
> + int err;
> +
> + page = pfn_to_online_page(pfn);
> + if (WARN_ON_ONCE(!page))
> + return NOTIFY_DONE;
> +
> + if (PageHuge(page))
> + return NOTIFY_DONE;
> +
> + if (WARN_ON_ONCE(!PageHWPoison(page)))
> + return NOTIFY_DONE;
> +
> + if (WARN_ON_ONCE(page_count(page) != 1))
> + return NOTIFY_DONE;

Relying on the page_count to be 1 for correctness is usually a bit
shaky, for example, when racing against isolate_movable_page() that
might temporarily bump upo the refcount.

> +
> + get_page(page); /* balloon reference */
> +
> + out_vbr = kzalloc(sizeof(*out_vbr), GFP_KERNEL);

Are we always guaranteed to be able to use GFP_KERNEL out of MCE
context? (IOW, are we never atomic?)

> + if (WARN_ON_ONCE(!out_vbr))
> + return NOTIFY_BAD;
> +
> + spin_lock(&vb->recover_page_list_lock);
> + balloon_page_push(&vb->corrupted_page_list, page);
> + spin_unlock(&vb->recover_page_list_lock);
> +
> + out_vbr->vbr.cmd = VIRTIO_BALLOON_R_CMD_RECOVER;

This makes me wonder if we should have a more generic guest->host
request queue, similar to what e.g., virtio-mem uses, instead of adding
a separate VIRTIO_BALLOON_VQ_RECOVER vq.

> + set_page_pfns(vb, out_vbr->pfns, page);
> + sg_init_one(&sg, out_vbr, sizeof(*out_vbr));
> +
> + spin_lock_irqsave(&vb->recover_vq_lock, flags);
> + err = virtqueue_add_outbuf(vb->recover_vq, &sg, 1, out_vbr, GFP_KERNEL);
> + if (unlikely(err)) {
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> + return NOTIFY_DONE;
> + }
> + virtqueue_kick(vb->recover_vq);
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> +
> + return NOTIFY_OK;
> +}
> +
> +static int recover_vq_get_response(struct virtio_balloon *vb)
> +{
> + struct __virtio_balloon_recover *in_vbr;
> + struct scatterlist sg;
> + unsigned long flags;
> + int err;
> +
> + spin_lock_irqsave(&vb->recover_vq_lock, flags);
> + in_vbr = &vb->in_vbr;
> + memset(in_vbr, 0x00, sizeof(*in_vbr));
> + sg_init_one(&sg, in_vbr, sizeof(*in_vbr));
> + err = virtqueue_add_inbuf(vb->recover_vq, &sg, 1, in_vbr, GFP_KERNEL);
> + if (unlikely(err)) {
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> + return err;
> + }
> +
> + virtqueue_kick(vb->recover_vq);
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> +
> + return 0;
> +}
> +
> +static void recover_vq_handle_response(struct virtio_balloon *vb, unsigned int len)
> +{
> + struct __virtio_balloon_recover *in_vbr;
> + struct virtio_balloon_recover *vbr;
> + struct page *page;
> + unsigned int pfns;
> + u32 pfn0, pfn1;
> + __u8 status;
> +
> + /* the response is not expected */
> + if (unlikely(len != sizeof(struct __virtio_balloon_recover)))
> + return;
> +
> + in_vbr = &vb->in_vbr;
> + vbr = &in_vbr->vbr;
> + if (unlikely(vbr->cmd != VIRTIO_BALLOON_R_CMD_RESPONSE))
> + return;
> +
> + /* to make sure the contiguous balloon PFNs */
> + for (pfns = 1; pfns < VIRTIO_BALLOON_PAGES_PER_PAGE; pfns++) {
> + pfn0 = virtio32_to_cpu(vb->vdev, in_vbr->pfns[pfns - 1]);
> + pfn1 = virtio32_to_cpu(vb->vdev, in_vbr->pfns[pfns]);
> + if (pfn1 - pfn0 != 1)
> + return;

Yeah, we really shouldn't be dealing with (legacy) 4k PFNs here, but
instead, proper ranges I guess.

> + }
> +
> + pfn0 = virtio32_to_cpu(vb->vdev, in_vbr->pfns[0]);
> + if (!pfn_valid(pfn0))
> + return;
> +
> + pfn1 = -1;
> + spin_lock(&vb->recover_page_list_lock);
> + list_for_each_entry(page, &vb->corrupted_page_list, lru) {
> + pfn1 = page_to_pfn(page);
> + if (pfn1 == pfn0)
> + break;
> + }
> + spin_unlock(&vb->recover_page_list_lock);
> +
> + status = vbr->status;
> + switch (status) {
> + case VIRTIO_BALLOON_R_STATUS_RECOVERED:
> + if (pfn1 == pfn0) {
> + spin_lock(&vb->recover_page_list_lock);
> + list_del(&page->lru);
> + balloon_page_push(&vb->recovered_page_list, page);

We rather not reuse actual balloon functions in !balloon context. Just
move the page to the proper list directly.

> + spin_unlock(&vb->recover_page_list_lock);
> + queue_work(system_freezable_wq, &vb->unpoison_memory_work);
> + dev_info_ratelimited(&vb->vdev->dev, "recovered pfn 0x%x", pfn0);

Well, not yet. Shouldn't this go into unpoison_memory_func() ?

> + }
> + break;
> + case VIRTIO_BALLOON_R_STATUS_FAILED:
> + /* the hypervisor can't fix this corrupted page, balloon puts page */
> + if (pfn1 == pfn0) {
> + spin_lock(&vb->recover_page_list_lock);
> + list_del(&page->lru);
> + spin_unlock(&vb->recover_page_list_lock);
> + put_page(page);
> + dev_info_ratelimited(&vb->vdev->dev, "failed to recover pfn 0x%x", pfn0);
> + }
> + default:
> + break;
> + };
> +
> + /* continue to get response from host side if the response gets handled successfully */
> + recover_vq_get_response(vb);
> +}
> +
> +static void unpoison_memory_func(struct work_struct *work)
> +{
> + struct virtio_balloon *vb;
> + struct page *page;
> +
> + vb = container_of(work, struct virtio_balloon, unpoison_memory_work);
> +
> + do {
> + spin_lock(&vb->recover_page_list_lock);
> + page = list_first_entry_or_null(&vb->recovered_page_list,
> + struct page, lru);
> + if (page)
> + list_del(&page->lru);
> + spin_unlock(&vb->recover_page_list_lock);
> +
> + if (page) {
> + put_page(page);
> + unpoison_memory(page_to_pfn(page), true, "virtio-balloon");
> + }
> + } while (page);
> +}
> +
> +static void recover_vq_cb(struct virtqueue *vq)
> +{
> + struct virtio_balloon *vb = vq->vdev->priv;
> + struct __virtio_balloon_recover *vbr;
> + unsigned long flags;
> + unsigned int len;
> +
> + spin_lock_irqsave(&vb->recover_vq_lock, flags);
> + do {
> + virtqueue_disable_cb(vq);
> + while ((vbr = virtqueue_get_buf(vq, &len)) != NULL) {
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> + if (vbr == &vb->in_vbr)
> + recover_vq_handle_response(vb, len);
> + else
> + kfree(vbr); /* just free the memory for out vbr request */
> + spin_lock_irqsave(&vb->recover_vq_lock, flags);
> + }
> + } while (!virtqueue_enable_cb(vq));
> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
> +}
> +


[...]

>
> +out_unregister_reporting:
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> + page_reporting_unregister(&vb->pr_dev_info);
> out_unregister_oom:
> if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> unregister_oom_notifier(&vb->oom_nb);
> @@ -1082,6 +1319,11 @@ static void virtballoon_remove(struct virtio_device *vdev)
> destroy_workqueue(vb->balloon_wq);
> }
>
> + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_RECOVER)) {

Could the notifier already have been triggered and we might be using the
device before already fully initialized from the notifier and might end
up leaking memory here that we allocated?

> + unregister_memory_failure_notifier(&vb->memory_failure_nb);
> + cancel_work_sync(&vb->unpoison_memory_work);
> + }
> +

Could we be leaking memory from the virtballoon_remove() path?

--
Thanks,

David / dhildenb


2022-05-31 22:47:34

by Jue Wang

[permalink] [raw]
Subject: Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

On Mon, May 30, 2022 at 8:49 AM Peter Xu <[email protected]> wrote:
>
> On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
> > A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, the
> > 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest poisons 4K
> > (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx, GPAz)
> > except GPAy). This is the worse case, so I want to add
> > '__le32 corrupted_pages' in struct virtio_balloon_config, it is used in the
> > next step: reporting 512 * 4K 'corrupted_pages' to the guest, the guest has
> > a chance to isolate the other 511 pages ahead of time. And the guest
> > actually loses 2M, fixing 512*4K seems to help significantly.
>
> It sounds hackish to teach a virtio device to assume one page will always
> be poisoned in huge page granule. That's only a limitation to host kernel
> not virtio itself.
>
> E.g. there're upstream effort ongoing with enabling doublemap on hugetlbfs
> pages so hugetlb pages can be mapped in 4k with it. It provides potential
> possibility to do page poisoning with huge pages in 4k too. When that'll
> be ready the assumption can go away, and that does sound like a better
> approach towards this problem.

+1.

A hypervisor should always strive to minimize the guest memory loss.

The HugeTLB double mapping enlightened memory poisoning behavior (only
poison 4K out of a 2MB huge page and 4K in guest) is a much better
solution here. To be completely transparent, it's not _strictly_
required to poison the page (whatever the granularity it is) on the
host side, as long as the following are true:

1. A hypervisor can emulate the _minimized_ (e.g., 4K) the poison to the guest.
2. The host page with the UC error is "isolated" (could be PG_HWPOISON
or in some other way) and prevented from being reused by other
processes.

For #2, PG_HWPOISON and HugeTLB double mapping enlightened memory
poisoning is a good solution.

>
> >
> > >
> > > I assume when talking about "the performance memory drops a lot", you
> > > imply that this patch set can mitigate that performance drop?
> > >
> > > But why do you see a performance drop? Because we might lose some
> > > possible THP candidates (in the host or the guest) and you want to plug
> > > does holes? I assume you'll see a performance drop simply because
> > > poisoning memory is expensive, including migrating pages around on CE.
> > >
> > > If you have some numbers to share, especially before/after this change,
> > > that would be great.
> > >
> >
> > The CE storm leads 2 problems I have even seen:
> > 1, the memory bandwidth slows down to 10%~20%, and the cycles per
> > instruction of CPU increases a lot.
> > 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use a
> > lot time to handle IRQ.
>
> Totally no good knowledge on CMCI, but if 2) is true then I'm wondering
> whether it's necessary to handle the interrupts that frequently. When I
> was reading the Intel CMCI vector handler I stumbled over this comment:
>
> /*
> * The interrupt handler. This is called on every event.
> * Just call the poller directly to log any events.
> * This could in theory increase the threshold under high load,
> * but doesn't for now.
> */
> static void intel_threshold_interrupt(void)
>
> I think that matches with what I was thinking.. I mean for 2) not sure
> whether it can be seen as a CMCI problem and potentially can be optimized
> by adjust the cmci threshold dynamically.

The CE storm caused performance drop is caused by the extra cycles
spent by the ECC steps in memory controller, not in CMCI handling.
This is observed in the Google fleet as well. A good solution is to
monitor the CE rate closely in user space via /dev/mcelog and migrate
all VMs to another host once the CE rate exceeds some threshold.

CMCI is a _background_ interrupt that is not handled in the process
execution context and its handler is setup to switch to poll (1 / 5
min) mode if there are more than ~ a dozen CEs reported via CMCI per
second.
>
> --
> Peter Xu
>

2022-06-01 10:17:03

by Peter Xu

[permalink] [raw]
Subject: Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
> A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, the
> 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest poisons 4K
> (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx, GPAz)
> except GPAy). This is the worse case, so I want to add
> '__le32 corrupted_pages' in struct virtio_balloon_config, it is used in the
> next step: reporting 512 * 4K 'corrupted_pages' to the guest, the guest has
> a chance to isolate the other 511 pages ahead of time. And the guest
> actually loses 2M, fixing 512*4K seems to help significantly.

It sounds hackish to teach a virtio device to assume one page will always
be poisoned in huge page granule. That's only a limitation to host kernel
not virtio itself.

E.g. there're upstream effort ongoing with enabling doublemap on hugetlbfs
pages so hugetlb pages can be mapped in 4k with it. It provides potential
possibility to do page poisoning with huge pages in 4k too. When that'll
be ready the assumption can go away, and that does sound like a better
approach towards this problem.

>
> >
> > I assume when talking about "the performance memory drops a lot", you
> > imply that this patch set can mitigate that performance drop?
> >
> > But why do you see a performance drop? Because we might lose some
> > possible THP candidates (in the host or the guest) and you want to plug
> > does holes? I assume you'll see a performance drop simply because
> > poisoning memory is expensive, including migrating pages around on CE.
> >
> > If you have some numbers to share, especially before/after this change,
> > that would be great.
> >
>
> The CE storm leads 2 problems I have even seen:
> 1, the memory bandwidth slows down to 10%~20%, and the cycles per
> instruction of CPU increases a lot.
> 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use a
> lot time to handle IRQ.

Totally no good knowledge on CMCI, but if 2) is true then I'm wondering
whether it's necessary to handle the interrupts that frequently. When I
was reading the Intel CMCI vector handler I stumbled over this comment:

/*
* The interrupt handler. This is called on every event.
* Just call the poller directly to log any events.
* This could in theory increase the threshold under high load,
* but doesn't for now.
*/
static void intel_threshold_interrupt(void)

I think that matches with what I was thinking.. I mean for 2) not sure
whether it can be seen as a CMCI problem and potentially can be optimized
by adjust the cmci threshold dynamically.

--
Peter Xu


2022-06-01 17:21:06

by zhenwei pi

[permalink] [raw]
Subject: Re: Re: [PATCH 3/3] virtio_balloon: Introduce memory recover



On 5/30/22 15:48, David Hildenbrand wrote:
>
>> +
>> struct virtio_balloon {
>> struct virtio_device *vdev;
>> struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>> @@ -126,6 +133,16 @@ struct virtio_balloon {
>> /* Free page reporting device */
>> struct virtqueue *reporting_vq;
>> struct page_reporting_dev_info pr_dev_info;
>> +
>> + /* Memory recover VQ - VIRTIO_BALLOON_F_RECOVER */
>> + struct virtqueue *recover_vq;
>> + spinlock_t recover_vq_lock;
>> + struct notifier_block memory_failure_nb;
>> + struct list_head corrupted_page_list;
>> + struct list_head recovered_page_list;
>> + spinlock_t recover_page_list_lock;
>> + struct __virtio_balloon_recover in_vbr;
>> + struct work_struct unpoison_memory_work;
>
> I assume we want all that only with CONFIG_MEMORY_FAILURE.
>

Sorry, I missed this.

>> };
>>
>> static const struct virtio_device_id id_table[] = {
>> @@ -494,6 +511,198 @@ static void update_balloon_size_func(struct work_struct *work)
>> queue_work(system_freezable_wq, work);
>> }
>>
>> +/*
>> + * virtballoon_memory_failure - notified by memory failure, try to fix the
>> + * corrupted page.
>> + * The memory failure notifier is designed to call back when the kernel handled
>> + * successfully only, WARN_ON_ONCE on the unlikely condition to find out any
>> + * error(memory error handling is a best effort, not 100% coverd).
>> + */
>> +static int virtballoon_memory_failure(struct notifier_block *notifier,
>> + unsigned long pfn, void *parm)
>> +{
>> + struct virtio_balloon *vb = container_of(notifier, struct virtio_balloon,
>> + memory_failure_nb);
>> + struct page *page;
>> + struct __virtio_balloon_recover *out_vbr;
>> + struct scatterlist sg;
>> + unsigned long flags;
>> + int err;
>> +
>> + page = pfn_to_online_page(pfn);
>> + if (WARN_ON_ONCE(!page))
>> + return NOTIFY_DONE;
>> +
>> + if (PageHuge(page))
>> + return NOTIFY_DONE;
>> +
>> + if (WARN_ON_ONCE(!PageHWPoison(page)))
>> + return NOTIFY_DONE;
>> +
>> + if (WARN_ON_ONCE(page_count(page) != 1))
>> + return NOTIFY_DONE;
>
> Relying on the page_count to be 1 for correctness is usually a bit
> shaky, for example, when racing against isolate_movable_page() that
> might temporarily bump upo the refcount.
>

The memory notifier is designed to call the chain if a page gets result
MF_RECOVERED only:
if (result == MF_RECOVERED)
blocking_notifier_call_chain(&mf_notifier_list, pfn, NULL);


>> +
>> + get_page(page); /* balloon reference */
>> +
>> + out_vbr = kzalloc(sizeof(*out_vbr), GFP_KERNEL);
>
> Are we always guaranteed to be able to use GFP_KERNEL out of MCE
> context? (IOW, are we never atomic?)
>
>> + if (WARN_ON_ONCE(!out_vbr))
>> + return NOTIFY_BAD;
>> +
>> + spin_lock(&vb->recover_page_list_lock);
>> + balloon_page_push(&vb->corrupted_page_list, page);
>> + spin_unlock(&vb->recover_page_list_lock);
>> +
>> + out_vbr->vbr.cmd = VIRTIO_BALLOON_R_CMD_RECOVER;
>
> This makes me wonder if we should have a more generic guest->host
> request queue, similar to what e.g., virtio-mem uses, instead of adding
> a separate VIRTIO_BALLOON_VQ_RECOVER vq.
>

I'm OK with either one, I'll follow your decision! :D

>> + set_page_pfns(vb, out_vbr->pfns, page);
>> + sg_init_one(&sg, out_vbr, sizeof(*out_vbr));
>> +
>> + spin_lock_irqsave(&vb->recover_vq_lock, flags);
>> + err = virtqueue_add_outbuf(vb->recover_vq, &sg, 1, out_vbr, GFP_KERNEL);
>> + if (unlikely(err)) {
>> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
>> + return NOTIFY_DONE;
>> + }
>> + virtqueue_kick(vb->recover_vq);
>> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
>> +
>> + return NOTIFY_OK;
>> +}
>> +
>> +static int recover_vq_get_response(struct virtio_balloon *vb)
>> +{
>> + struct __virtio_balloon_recover *in_vbr;
>> + struct scatterlist sg;
>> + unsigned long flags;
>> + int err;
>> +
>> + spin_lock_irqsave(&vb->recover_vq_lock, flags);
>> + in_vbr = &vb->in_vbr;
>> + memset(in_vbr, 0x00, sizeof(*in_vbr));
>> + sg_init_one(&sg, in_vbr, sizeof(*in_vbr));
>> + err = virtqueue_add_inbuf(vb->recover_vq, &sg, 1, in_vbr, GFP_KERNEL);
>> + if (unlikely(err)) {
>> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
>> + return err;
>> + }
>> +
>> + virtqueue_kick(vb->recover_vq);
>> + spin_unlock_irqrestore(&vb->recover_vq_lock, flags);
>> +
>> + return 0;
>> +}
>> +
>> +static void recover_vq_handle_response(struct virtio_balloon *vb, unsigned int len)
>> +{
>> + struct __virtio_balloon_recover *in_vbr;
>> + struct virtio_balloon_recover *vbr;
>> + struct page *page;
>> + unsigned int pfns;
>> + u32 pfn0, pfn1;
>> + __u8 status;
>> +
>> + /* the response is not expected */
>> + if (unlikely(len != sizeof(struct __virtio_balloon_recover)))
>> + return;
>> +
>> + in_vbr = &vb->in_vbr;
>> + vbr = &in_vbr->vbr;
>> + if (unlikely(vbr->cmd != VIRTIO_BALLOON_R_CMD_RESPONSE))
>> + return;
>> +
>> + /* to make sure the contiguous balloon PFNs */
>> + for (pfns = 1; pfns < VIRTIO_BALLOON_PAGES_PER_PAGE; pfns++) {
>> + pfn0 = virtio32_to_cpu(vb->vdev, in_vbr->pfns[pfns - 1]);
>> + pfn1 = virtio32_to_cpu(vb->vdev, in_vbr->pfns[pfns]);
>> + if (pfn1 - pfn0 != 1)
>> + return;
>
> Yeah, we really shouldn't be dealing with (legacy) 4k PFNs here, but
> instead, proper ranges I guess.
>

MST also pointed out this, I explained in this link:
https://lkml.org/lkml/2022/5/26/942

Rather than page reporting style, virtio-mem style should be fine. Ex,
struct virtio_memory_recover {
__virtio64 addr;
__virtio32 length;
__virtio16 padding[2];
};

>> + }
>> +
>> + pfn0 = virtio32_to_cpu(vb->vdev, in_vbr->pfns[0]);
>> + if (!pfn_valid(pfn0))
>> + return;
>> +
>> + pfn1 = -1;
>> + spin_lock(&vb->recover_page_list_lock);
>> + list_for_each_entry(page, &vb->corrupted_page_list, lru) {
>> + pfn1 = page_to_pfn(page);
>> + if (pfn1 == pfn0)
>> + break;
>> + }
>> + spin_unlock(&vb->recover_page_list_lock);
>> +
>> + status = vbr->status;
>> + switch (status) {
>> + case VIRTIO_BALLOON_R_STATUS_RECOVERED:
>> + if (pfn1 == pfn0) {
>> + spin_lock(&vb->recover_page_list_lock);
>> + list_del(&page->lru);
>> + balloon_page_push(&vb->recovered_page_list, page);
>
> We rather not reuse actual balloon functions in !balloon context. Just
> move the page to the proper list directly.
>

OK.

>> + spin_unlock(&vb->recover_page_list_lock);
>> + queue_work(system_freezable_wq, &vb->unpoison_memory_work);
>> + dev_info_ratelimited(&vb->vdev->dev, "recovered pfn 0x%x", pfn0);
>
> Well, not yet. Shouldn't this go into unpoison_memory_func() ?
>

OK.

[...]

>
>>
>> +out_unregister_reporting:
>> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
>> + page_reporting_unregister(&vb->pr_dev_info);
>> out_unregister_oom:
>> if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
>> unregister_oom_notifier(&vb->oom_nb);
>> @@ -1082,6 +1319,11 @@ static void virtballoon_remove(struct virtio_device *vdev)
>> destroy_workqueue(vb->balloon_wq);
>> }
>>
>> + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_RECOVER)) {
>
> Could the notifier already have been triggered and we might be using the
> device before already fully initialized from the notifier and might end
> up leaking memory here that we allocated?
>
>> + unregister_memory_failure_notifier(&vb->memory_failure_nb);
>> + cancel_work_sync(&vb->unpoison_memory_work);
>> + }
>> +
>
> Could we be leaking memory from the virtballoon_remove() path?
>

Yes, I'll fix the possible memory leak here.

Thanks a lot.

--
zhenwei pi

2022-06-01 20:53:59

by zhenwei pi

[permalink] [raw]
Subject: Re: Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

On 5/31/22 12:08, Jue Wang wrote:
> On Mon, May 30, 2022 at 8:49 AM Peter Xu <[email protected]> wrote:
>>
>> On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
>>> A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, the
>>> 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest poisons 4K
>>> (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx, GPAz)
>>> except GPAy). This is the worse case, so I want to add
>>> '__le32 corrupted_pages' in struct virtio_balloon_config, it is used in the
>>> next step: reporting 512 * 4K 'corrupted_pages' to the guest, the guest has
>>> a chance to isolate the other 511 pages ahead of time. And the guest
>>> actually loses 2M, fixing 512*4K seems to help significantly.
>>
>> It sounds hackish to teach a virtio device to assume one page will always
>> be poisoned in huge page granule. That's only a limitation to host kernel
>> not virtio itself.
>>
>> E.g. there're upstream effort ongoing with enabling doublemap on hugetlbfs
>> pages so hugetlb pages can be mapped in 4k with it. It provides potential
>> possibility to do page poisoning with huge pages in 4k too. When that'll
>> be ready the assumption can go away, and that does sound like a better
>> approach towards this problem.
>
> +1.
>
> A hypervisor should always strive to minimize the guest memory loss.
>
> The HugeTLB double mapping enlightened memory poisoning behavior (only
> poison 4K out of a 2MB huge page and 4K in guest) is a much better
> solution here. To be completely transparent, it's not _strictly_
> required to poison the page (whatever the granularity it is) on the
> host side, as long as the following are true:
>
> 1. A hypervisor can emulate the _minimized_ (e.g., 4K) the poison to the guest.
> 2. The host page with the UC error is "isolated" (could be PG_HWPOISON
> or in some other way) and prevented from being reused by other
> processes.
>
> For #2, PG_HWPOISON and HugeTLB double mapping enlightened memory
> poisoning is a good solution.
>
>>
>>>
>>>>
>>>> I assume when talking about "the performance memory drops a lot", you
>>>> imply that this patch set can mitigate that performance drop?
>>>>
>>>> But why do you see a performance drop? Because we might lose some
>>>> possible THP candidates (in the host or the guest) and you want to plug
>>>> does holes? I assume you'll see a performance drop simply because
>>>> poisoning memory is expensive, including migrating pages around on CE.
>>>>
>>>> If you have some numbers to share, especially before/after this change,
>>>> that would be great.
>>>>
>>>
>>> The CE storm leads 2 problems I have even seen:
>>> 1, the memory bandwidth slows down to 10%~20%, and the cycles per
>>> instruction of CPU increases a lot.
>>> 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use a
>>> lot time to handle IRQ.
>>
>> Totally no good knowledge on CMCI, but if 2) is true then I'm wondering
>> whether it's necessary to handle the interrupts that frequently. When I
>> was reading the Intel CMCI vector handler I stumbled over this comment:
>>
>> /*
>> * The interrupt handler. This is called on every event.
>> * Just call the poller directly to log any events.
>> * This could in theory increase the threshold under high load,
>> * but doesn't for now.
>> */
>> static void intel_threshold_interrupt(void)
>>
>> I think that matches with what I was thinking.. I mean for 2) not sure
>> whether it can be seen as a CMCI problem and potentially can be optimized
>> by adjust the cmci threshold dynamically.
>
> The CE storm caused performance drop is caused by the extra cycles
> spent by the ECC steps in memory controller, not in CMCI handling.
> This is observed in the Google fleet as well. A good solution is to
> monitor the CE rate closely in user space via /dev/mcelog and migrate
> all VMs to another host once the CE rate exceeds some threshold.
>
> CMCI is a _background_ interrupt that is not handled in the process
> execution context and its handler is setup to switch to poll (1 / 5
> min) mode if there are more than ~ a dozen CEs reported via CMCI per
> second.
>>
>> --
>> Peter Xu
>>

Hi, Andrew, David, Naoya

According to the suggestions, I'd give up the improvement of memory
failure on huge page in this series.

Is it worth recovering corrupted pages for the guest kernel? I'd follow
your decision.

--
zhenwei pi

2022-06-01 21:23:26

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

On 01.06.22 04:17, zhenwei pi wrote:
> On 5/31/22 12:08, Jue Wang wrote:
>> On Mon, May 30, 2022 at 8:49 AM Peter Xu <[email protected]> wrote:
>>>
>>> On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
>>>> A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, the
>>>> 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest poisons 4K
>>>> (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx, GPAz)
>>>> except GPAy). This is the worse case, so I want to add
>>>> '__le32 corrupted_pages' in struct virtio_balloon_config, it is used in the
>>>> next step: reporting 512 * 4K 'corrupted_pages' to the guest, the guest has
>>>> a chance to isolate the other 511 pages ahead of time. And the guest
>>>> actually loses 2M, fixing 512*4K seems to help significantly.
>>>
>>> It sounds hackish to teach a virtio device to assume one page will always
>>> be poisoned in huge page granule. That's only a limitation to host kernel
>>> not virtio itself.
>>>
>>> E.g. there're upstream effort ongoing with enabling doublemap on hugetlbfs
>>> pages so hugetlb pages can be mapped in 4k with it. It provides potential
>>> possibility to do page poisoning with huge pages in 4k too. When that'll
>>> be ready the assumption can go away, and that does sound like a better
>>> approach towards this problem.
>>
>> +1.
>>
>> A hypervisor should always strive to minimize the guest memory loss.
>>
>> The HugeTLB double mapping enlightened memory poisoning behavior (only
>> poison 4K out of a 2MB huge page and 4K in guest) is a much better
>> solution here. To be completely transparent, it's not _strictly_
>> required to poison the page (whatever the granularity it is) on the
>> host side, as long as the following are true:
>>
>> 1. A hypervisor can emulate the _minimized_ (e.g., 4K) the poison to the guest.
>> 2. The host page with the UC error is "isolated" (could be PG_HWPOISON
>> or in some other way) and prevented from being reused by other
>> processes.
>>
>> For #2, PG_HWPOISON and HugeTLB double mapping enlightened memory
>> poisoning is a good solution.
>>
>>>
>>>>
>>>>>
>>>>> I assume when talking about "the performance memory drops a lot", you
>>>>> imply that this patch set can mitigate that performance drop?
>>>>>
>>>>> But why do you see a performance drop? Because we might lose some
>>>>> possible THP candidates (in the host or the guest) and you want to plug
>>>>> does holes? I assume you'll see a performance drop simply because
>>>>> poisoning memory is expensive, including migrating pages around on CE.
>>>>>
>>>>> If you have some numbers to share, especially before/after this change,
>>>>> that would be great.
>>>>>
>>>>
>>>> The CE storm leads 2 problems I have even seen:
>>>> 1, the memory bandwidth slows down to 10%~20%, and the cycles per
>>>> instruction of CPU increases a lot.
>>>> 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use a
>>>> lot time to handle IRQ.
>>>
>>> Totally no good knowledge on CMCI, but if 2) is true then I'm wondering
>>> whether it's necessary to handle the interrupts that frequently. When I
>>> was reading the Intel CMCI vector handler I stumbled over this comment:
>>>
>>> /*
>>> * The interrupt handler. This is called on every event.
>>> * Just call the poller directly to log any events.
>>> * This could in theory increase the threshold under high load,
>>> * but doesn't for now.
>>> */
>>> static void intel_threshold_interrupt(void)
>>>
>>> I think that matches with what I was thinking.. I mean for 2) not sure
>>> whether it can be seen as a CMCI problem and potentially can be optimized
>>> by adjust the cmci threshold dynamically.
>>
>> The CE storm caused performance drop is caused by the extra cycles
>> spent by the ECC steps in memory controller, not in CMCI handling.
>> This is observed in the Google fleet as well. A good solution is to
>> monitor the CE rate closely in user space via /dev/mcelog and migrate
>> all VMs to another host once the CE rate exceeds some threshold.
>>
>> CMCI is a _background_ interrupt that is not handled in the process
>> execution context and its handler is setup to switch to poll (1 / 5
>> min) mode if there are more than ~ a dozen CEs reported via CMCI per
>> second.
>>>
>>> --
>>> Peter Xu
>>>
>
> Hi, Andrew, David, Naoya
>
> According to the suggestions, I'd give up the improvement of memory
> failure on huge page in this series.
>
> Is it worth recovering corrupted pages for the guest kernel? I'd follow
> your decision.

Well, as I said, I am not sure if we really need/want this for a handful
of 4k poisoned pages in a VM. As I suspected, doing so might primarily
be interesting for some sort of de-fragmentation (allow again a higher
order page to be placed at the affected PFNs), not because of the slight
reduction of available memory. A simple VM reboot would get the job
similarly done.

As the poisoning refcount code is already a bit shaky as I learned
recently in the context of memory offlining, I do wonder if we really
want to expose the unpoisoning code outside of debugfs (hwpoison) usage.

Interestingly, unpoison_memory() documents: "This is only done on the
software-level, so it only works for linux injected failures, not real
hardware failures" -- ehm?

--
Thanks,

David / dhildenb


2022-06-02 19:39:01

by zhenwei pi

[permalink] [raw]
Subject: Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

On 6/1/22 15:59, David Hildenbrand wrote:
> On 01.06.22 04:17, zhenwei pi wrote:
>> On 5/31/22 12:08, Jue Wang wrote:
>>> On Mon, May 30, 2022 at 8:49 AM Peter Xu <[email protected]> wrote:
>>>>
>>>> On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
>>>>> A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, the
>>>>> 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest poisons 4K
>>>>> (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx, GPAz)
>>>>> except GPAy). This is the worse case, so I want to add
>>>>> '__le32 corrupted_pages' in struct virtio_balloon_config, it is used in the
>>>>> next step: reporting 512 * 4K 'corrupted_pages' to the guest, the guest has
>>>>> a chance to isolate the other 511 pages ahead of time. And the guest
>>>>> actually loses 2M, fixing 512*4K seems to help significantly.
>>>>
>>>> It sounds hackish to teach a virtio device to assume one page will always
>>>> be poisoned in huge page granule. That's only a limitation to host kernel
>>>> not virtio itself.
>>>>
>>>> E.g. there're upstream effort ongoing with enabling doublemap on hugetlbfs
>>>> pages so hugetlb pages can be mapped in 4k with it. It provides potential
>>>> possibility to do page poisoning with huge pages in 4k too. When that'll
>>>> be ready the assumption can go away, and that does sound like a better
>>>> approach towards this problem.
>>>
>>> +1.
>>>
>>> A hypervisor should always strive to minimize the guest memory loss.
>>>
>>> The HugeTLB double mapping enlightened memory poisoning behavior (only
>>> poison 4K out of a 2MB huge page and 4K in guest) is a much better
>>> solution here. To be completely transparent, it's not _strictly_
>>> required to poison the page (whatever the granularity it is) on the
>>> host side, as long as the following are true:
>>>
>>> 1. A hypervisor can emulate the _minimized_ (e.g., 4K) the poison to the guest.
>>> 2. The host page with the UC error is "isolated" (could be PG_HWPOISON
>>> or in some other way) and prevented from being reused by other
>>> processes.
>>>
>>> For #2, PG_HWPOISON and HugeTLB double mapping enlightened memory
>>> poisoning is a good solution.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> I assume when talking about "the performance memory drops a lot", you
>>>>>> imply that this patch set can mitigate that performance drop?
>>>>>>
>>>>>> But why do you see a performance drop? Because we might lose some
>>>>>> possible THP candidates (in the host or the guest) and you want to plug
>>>>>> does holes? I assume you'll see a performance drop simply because
>>>>>> poisoning memory is expensive, including migrating pages around on CE.
>>>>>>
>>>>>> If you have some numbers to share, especially before/after this change,
>>>>>> that would be great.
>>>>>>
>>>>>
>>>>> The CE storm leads 2 problems I have even seen:
>>>>> 1, the memory bandwidth slows down to 10%~20%, and the cycles per
>>>>> instruction of CPU increases a lot.
>>>>> 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use a
>>>>> lot time to handle IRQ.
>>>>
>>>> Totally no good knowledge on CMCI, but if 2) is true then I'm wondering
>>>> whether it's necessary to handle the interrupts that frequently. When I
>>>> was reading the Intel CMCI vector handler I stumbled over this comment:
>>>>
>>>> /*
>>>> * The interrupt handler. This is called on every event.
>>>> * Just call the poller directly to log any events.
>>>> * This could in theory increase the threshold under high load,
>>>> * but doesn't for now.
>>>> */
>>>> static void intel_threshold_interrupt(void)
>>>>
>>>> I think that matches with what I was thinking.. I mean for 2) not sure
>>>> whether it can be seen as a CMCI problem and potentially can be optimized
>>>> by adjust the cmci threshold dynamically.
>>>
>>> The CE storm caused performance drop is caused by the extra cycles
>>> spent by the ECC steps in memory controller, not in CMCI handling.
>>> This is observed in the Google fleet as well. A good solution is to
>>> monitor the CE rate closely in user space via /dev/mcelog and migrate
>>> all VMs to another host once the CE rate exceeds some threshold.
>>>
>>> CMCI is a _background_ interrupt that is not handled in the process
>>> execution context and its handler is setup to switch to poll (1 / 5
>>> min) mode if there are more than ~ a dozen CEs reported via CMCI per
>>> second.
>>>>
>>>> --
>>>> Peter Xu
>>>>
>>
>> Hi, Andrew, David, Naoya
>>
>> According to the suggestions, I'd give up the improvement of memory
>> failure on huge page in this series.
>>
>> Is it worth recovering corrupted pages for the guest kernel? I'd follow
>> your decision.
>
> Well, as I said, I am not sure if we really need/want this for a handful
> of 4k poisoned pages in a VM. As I suspected, doing so might primarily
> be interesting for some sort of de-fragmentation (allow again a higher
> order page to be placed at the affected PFNs), not because of the slight
> reduction of available memory. A simple VM reboot would get the job
> similarly done.
>

Sure, Let's drop this idea. Thanks to all for the suggestions.

Hi, Naoya
It seems that memory failure notifier is not required currently, so I'll
not push the next version of:
[PATCH 1/3] memory-failure: Introduce memory failure notifier
[PATCH 2/3] mm/memory-failure.c: support reset PTE during unpoison

Thanks you for review work!

> As the poisoning refcount code is already a bit shaky as I learned
> recently in the context of memory offlining, I do wonder if we really
> want to expose the unpoisoning code outside of debugfs (hwpoison) usage.
>
> Interestingly, unpoison_memory() documents: "This is only done on the
> software-level, so it only works for linux injected failures, not real
> hardware failures" -- ehm?
>

I guess unpoison_memory() is designed/tested by hwpoison-inject only, I
have no idea to fix memory failure on a hardware platform. I suppose
it's the first time that unpoison_memory() is required by hardware-level
(balloon VQ).

--
zhenwei pi

2022-06-03 09:51:57

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon

On 02.06.22 11:28, zhenwei pi wrote:
> On 6/1/22 15:59, David Hildenbrand wrote:
>> On 01.06.22 04:17, zhenwei pi wrote:
>>> On 5/31/22 12:08, Jue Wang wrote:
>>>> On Mon, May 30, 2022 at 8:49 AM Peter Xu <[email protected]> wrote:
>>>>>
>>>>> On Mon, May 30, 2022 at 07:33:35PM +0800, zhenwei pi wrote:
>>>>>> A VM uses RAM of 2M huge page. Once a MCE(@HVAy in [HVAx,HVAz)) occurs, the
>>>>>> 2M([HVAx,HVAz)) of hypervisor becomes unaccessible, but the guest poisons 4K
>>>>>> (@GPAy in [GPAx, GPAz)) only, it may hit another 511 MCE ([GPAx, GPAz)
>>>>>> except GPAy). This is the worse case, so I want to add
>>>>>> '__le32 corrupted_pages' in struct virtio_balloon_config, it is used in the
>>>>>> next step: reporting 512 * 4K 'corrupted_pages' to the guest, the guest has
>>>>>> a chance to isolate the other 511 pages ahead of time. And the guest
>>>>>> actually loses 2M, fixing 512*4K seems to help significantly.
>>>>>
>>>>> It sounds hackish to teach a virtio device to assume one page will always
>>>>> be poisoned in huge page granule. That's only a limitation to host kernel
>>>>> not virtio itself.
>>>>>
>>>>> E.g. there're upstream effort ongoing with enabling doublemap on hugetlbfs
>>>>> pages so hugetlb pages can be mapped in 4k with it. It provides potential
>>>>> possibility to do page poisoning with huge pages in 4k too. When that'll
>>>>> be ready the assumption can go away, and that does sound like a better
>>>>> approach towards this problem.
>>>>
>>>> +1.
>>>>
>>>> A hypervisor should always strive to minimize the guest memory loss.
>>>>
>>>> The HugeTLB double mapping enlightened memory poisoning behavior (only
>>>> poison 4K out of a 2MB huge page and 4K in guest) is a much better
>>>> solution here. To be completely transparent, it's not _strictly_
>>>> required to poison the page (whatever the granularity it is) on the
>>>> host side, as long as the following are true:
>>>>
>>>> 1. A hypervisor can emulate the _minimized_ (e.g., 4K) the poison to the guest.
>>>> 2. The host page with the UC error is "isolated" (could be PG_HWPOISON
>>>> or in some other way) and prevented from being reused by other
>>>> processes.
>>>>
>>>> For #2, PG_HWPOISON and HugeTLB double mapping enlightened memory
>>>> poisoning is a good solution.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I assume when talking about "the performance memory drops a lot", you
>>>>>>> imply that this patch set can mitigate that performance drop?
>>>>>>>
>>>>>>> But why do you see a performance drop? Because we might lose some
>>>>>>> possible THP candidates (in the host or the guest) and you want to plug
>>>>>>> does holes? I assume you'll see a performance drop simply because
>>>>>>> poisoning memory is expensive, including migrating pages around on CE.
>>>>>>>
>>>>>>> If you have some numbers to share, especially before/after this change,
>>>>>>> that would be great.
>>>>>>>
>>>>>>
>>>>>> The CE storm leads 2 problems I have even seen:
>>>>>> 1, the memory bandwidth slows down to 10%~20%, and the cycles per
>>>>>> instruction of CPU increases a lot.
>>>>>> 2, the THR (/proc/interrupts) interrupts frequently, the CPU has to use a
>>>>>> lot time to handle IRQ.
>>>>>
>>>>> Totally no good knowledge on CMCI, but if 2) is true then I'm wondering
>>>>> whether it's necessary to handle the interrupts that frequently. When I
>>>>> was reading the Intel CMCI vector handler I stumbled over this comment:
>>>>>
>>>>> /*
>>>>> * The interrupt handler. This is called on every event.
>>>>> * Just call the poller directly to log any events.
>>>>> * This could in theory increase the threshold under high load,
>>>>> * but doesn't for now.
>>>>> */
>>>>> static void intel_threshold_interrupt(void)
>>>>>
>>>>> I think that matches with what I was thinking.. I mean for 2) not sure
>>>>> whether it can be seen as a CMCI problem and potentially can be optimized
>>>>> by adjust the cmci threshold dynamically.
>>>>
>>>> The CE storm caused performance drop is caused by the extra cycles
>>>> spent by the ECC steps in memory controller, not in CMCI handling.
>>>> This is observed in the Google fleet as well. A good solution is to
>>>> monitor the CE rate closely in user space via /dev/mcelog and migrate
>>>> all VMs to another host once the CE rate exceeds some threshold.
>>>>
>>>> CMCI is a _background_ interrupt that is not handled in the process
>>>> execution context and its handler is setup to switch to poll (1 / 5
>>>> min) mode if there are more than ~ a dozen CEs reported via CMCI per
>>>> second.
>>>>>
>>>>> --
>>>>> Peter Xu
>>>>>
>>>
>>> Hi, Andrew, David, Naoya
>>>
>>> According to the suggestions, I'd give up the improvement of memory
>>> failure on huge page in this series.
>>>
>>> Is it worth recovering corrupted pages for the guest kernel? I'd follow
>>> your decision.
>>
>> Well, as I said, I am not sure if we really need/want this for a handful
>> of 4k poisoned pages in a VM. As I suspected, doing so might primarily
>> be interesting for some sort of de-fragmentation (allow again a higher
>> order page to be placed at the affected PFNs), not because of the slight
>> reduction of available memory. A simple VM reboot would get the job
>> similarly done.
>>
>
> Sure, Let's drop this idea. Thanks to all for the suggestions.

Thanks for the interesting idea + discussions.

Just a note that if you believe that we want/need something like that,
and there is a reasonable use case, please tell us we're wrong and push
back :)

--
Thanks,

David / dhildenb