From: Isaku Yamahata <[email protected]>
Background
==========
The TDX is a VM-based confidential computing technology. It encrypts guest
memory to protect it from software (host OS, VMM, firmware, etc. ...) outside of
the TCB. Software in the host can write to the protected guest memory to corrupt,
and the protected guest can consume corrupted memory. TDX uses machine checks
to notify the TDX vcpu consumption of corrupted memory.
For VM-based confidential computing (AMD SEV-SNP and Intel TDX), the KVM guest
memfd effort [1] is ongoing. It allows guests to use file descriptors as the
protected guest memory without user-space virtual address mapping. KVM handles
machine checks for guest vcpu specially. It sets up the guest so that vcpu
exits from running guests on machine check, checks the exit reason, and manually
raises the machine check by calling do_machine_check().
Although Linux supports hwpoison and MCE injection framework, there are gaps in
testing hwpoison or MCE for TDX KVM [2] with KVM guest memfd [1]. a) hwpoison
framework (debugfs /sys/kernel/debug/hwpoison/{corrupt-pfn, unpison-pfn}) uses
physical address. MADV_{HWPISON, UNPOISON} uses the virtual address of the user
process. However, KVM guest memfd requires file descriptor and offset. b) The
x86 MCE injection framework, /dev/mce-log (legacy deprecated device driver
interface) or debug fs /sys/kernel/debug/mce-inject/addr, also uses a physical
address. c) The x86 MCE injection framework injects machine checks in the
context of the injector process. KVM wants to inject machine check on behalf of
running vcpu.
Proposed solution
=================
This patch series fills those gaps and to test KVM with injecting machine
checks. The proposed solution is
a) Introduce new flags FADVISE_{HWPOISON, UNPOISON} to hwpoison memory to
posix_fadvise():
Possible options are a1) add new flags for posix_fadvise() because hwpoison with
file descriptor and offset is a generic operation, not specific to KVM. (This
patch series) a2) Add KVM guest memfd specific ioctl to inject hwpoison/trigger
MCE. We can use same value to MADV_{HWPOISON, UNPOISON}. a3) Add KVM-specific
debugfs entry for guest memfd. Say,
/sys/kernel/debug/kvm/<pid>-<vm-fd>/guest-memfd<fd>/hwpoison/{corrupt-offset,
unoison-offset}.
- fadvise(FADVISE_{HWPOISON, UNPOISON}): This patch series.
Generic interface. Not specific to KVM.
- KVM ioctl
KVM specific. The KVM debugfs is better.
- KVM debugfs
Debugfs is natural fit because this feature is for debug/test.
Specific to KVM guest_memfd.
b) Enhancement to x86 MCE injector:
Add debug fs entries to x86 MCE injector under /sys/kernel/debug/mce-inject/ to
allow necessary parameters. mcgstatus and notrigger. mcgstatus is for LMCE_S.
notrigger is to suppress triggering machine check handler so that KVM can
trigger it.
c) Add a debugfs entry for KVM vcpu to trigger MCE injection:
Because setting parameters for MCE is not specific to KVM, reuse the existing
debugfs mce-inject. The debugfs entry is only to cause KVM to trigger
MCE. An alternative is to add an interface to set parameters without b)
similar to /dev/mce-log.
- KVM debugfs: this patch series.
Debugfs is natural fit because this feature is for debug/test.
- New KVM ioctl
KVM debugfs seems better.
- Enhance /dev/mce-log
Because this is legacy, debugfs mce injector is better.
- Enhance /sys/kernel/debug/mce-inject/
Because the feature is KVM specific, adding the feature to KVM
interface is better than to the x86 MCE injector.
[1] https://lore.kernel.org/all/[email protected]/
KVM guest_memfd() and per-page attributes
https://lore.kernel.org/all/[email protected]/
[PATCH 00/13] KVM: guest_memfd fixes
[2] https://lore.kernel.org/all/[email protected]/
v15 KVM TDX basic feature support
Isaku Yamahata (12):
x86/mce: Fix hw MCE injection feature detection
X86/mce/inject: Add mcgstatus for mce-inject debugfs
x86/mce/inject: Add notrigger entry to suppress MCE injection
x86/mce: Move and export inject_mce() from inject.c to core.c
mm/fadvise: Add flags to inject hwpoison for posix_fadvise()
mm/fadvise: Add FADV_MCE_INJECT flag for posix_fadvise()
x86/mce/inject: Wire up the x86 MCE injector to FADV_MCE_INJECT
x86/mce: Define a notifier chain for mce injector
KVM: X86: Add debugfs to inject machine check on VM exit
KVM: selftests: Allow mapping guest memory without host alias
KVM: selftests: lib: Add src memory type for hwpoison test
KVM: selftests: hwpoison/mce failure injection
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/asm/mce.h | 16 +
arch/x86/kernel/cpu/mce/core.c | 56 ++
arch/x86/kernel/cpu/mce/inject.c | 91 ++-
arch/x86/kvm/debugfs.c | 22 +
arch/x86/kvm/x86.c | 14 +
include/linux/fs.h | 8 +
include/uapi/linux/fadvise.h | 5 +
mm/fadvise.c | 120 ++-
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/include/kvm_util_base.h | 4 +
.../testing/selftests/kvm/include/test_util.h | 2 +
tools/testing/selftests/kvm/lib/kvm_util.c | 30 +-
tools/testing/selftests/kvm/lib/test_util.c | 8 +
.../testing/selftests/kvm/mem_hwpoison_test.c | 721 ++++++++++++++++++
15 files changed, 1066 insertions(+), 33 deletions(-)
create mode 100644 tools/testing/selftests/kvm/mem_hwpoison_test.c
base-commit: 6465e260f48790807eef06b583b38ca9789b6072
--
2.25.1
From: Isaku Yamahata <[email protected]>
For VM-based confidential computing (AMD SEV-SNP and Intel TDX), the KVM
guest memfd effort [1] is ongoing. It allows KVM guests to use file
descriptors and their offset as protected guest memory without user-space
virtual address mapping.
Intel TDX uses machine checks to notify the host OS/VMM that the TDX guest
consumed corrupted memory. The KVM/x86 handles machine checks for guest
vcpu specially. It sets up the guest so that vcpu exits from running on
machine check, checks the exit reason, and manually raises the machine
check by calling do_machine_check().
To test the KVM machine check path, KVM wants to inject an x86 machine
check based on the file descriptor and offset. Add a new flag
FADV_MCE_INJECT for posix_fadvise() to tell the x86 MCE injector the
physical address. The x86 MCE injector can be kernel-builtin or kernel
module. CONFIG_X86_MCE_INJECT=y, m, or n. Because we don't want
posix_fadvise() to depend on the x86 MCE injector, add notifier for it to
register. And the x86 MCE injector will register to the notifier chain.
The possible options would be
- Add FADV flags for memory poison. This patch.
- Introduce a new IOCTL specific to KVM guest memfd
- Introduce debugfs entries for KVM guest memfd like
/sys/kernel/debug/kvm/<pid>-<vm-fd>/guest-memfd<fd>/hwpoison/
{corrupt-offset, unoison-offset}.
This options follows /sys/kernel/debug/hwpoison/
{corrupt-pfn, unpoison-pfn}
[1] https://lore.kernel.org/all/[email protected]/
KVM guest_memfd() and per-page attributes
https://lore.kernel.org/all/[email protected]/
[PATCH 00/13] KVM: guest_memfd fixes
Signed-off-by: Isaku Yamahata <[email protected]>
---
include/linux/fs.h | 8 +++++
include/uapi/linux/fadvise.h | 1 +
mm/fadvise.c | 62 ++++++++++++++++++++++++++++++++++++
3 files changed, 71 insertions(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b528f063e8ff..c458f1e86245 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3377,5 +3377,13 @@ extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len,
int advice);
extern int generic_fadvise(struct file *file, loff_t offset, loff_t len,
int advice);
+#ifdef CONFIG_X86_MCE_INJECT
+void fadvise_register_mce_injector_chain(struct notifier_block *nb);
+void fadvise_unregister_mce_injector_chain(struct notifier_block *nb);
+#else
+static inline void fadvise_register_mce_injector_chain(struct notifier_block *nb) {}
+static inline void fadvise_unregister_mce_injector_chain(struct notifier_block *nb) {}
+#endif
+
#endif /* _LINUX_FS_H */
diff --git a/include/uapi/linux/fadvise.h b/include/uapi/linux/fadvise.h
index 3699b4b0adcd..90e1b5a94f37 100644
--- a/include/uapi/linux/fadvise.h
+++ b/include/uapi/linux/fadvise.h
@@ -22,5 +22,6 @@
/* Same to MADV_HWPOISON and MADV_SOFT_OFFLINE */
#define FADV_HWPOISON 100 /* poison a page for testing */
#define FADV_SOFT_OFFLINE 101 /* soft offline page for testing */
+#define FADV_MCE_INJECT 102
#endif /* FADVISE_H_INCLUDED */
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 1f028a6e1d90..8168d08f7455 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -74,6 +74,65 @@ static int fadvise_inject_error(struct file *file, struct address_space *mapping
return 0;
}
+#ifdef CONFIG_X86_MCE_INJECT
+static BLOCKING_NOTIFIER_HEAD(mce_injector_chain);
+
+void fadvise_register_mce_injector_chain(struct notifier_block *nb)
+{
+ blocking_notifier_chain_register(&mce_injector_chain, nb);
+}
+EXPORT_SYMBOL_GPL(fadvise_register_mce_injector_chain);
+
+void fadvise_unregister_mce_injector_chain(struct notifier_block *nb)
+{
+ blocking_notifier_chain_unregister(&mce_injector_chain, nb);
+}
+EXPORT_SYMBOL_GPL(fadvise_unregister_mce_injector_chain);
+
+static void fadvise_call_mce_injector_chain(unsigned long physaddr)
+{
+ blocking_notifier_call_chain(&mce_injector_chain, physaddr, NULL);
+}
+#else
+static void fadvise_call_mce_injector_chain(unsigned long physaddr)
+{
+}
+#endif
+
+static int fadvise_inject_mce(struct file *file, struct address_space *mapping,
+ loff_t offset)
+{
+ unsigned long page_mask, physaddr;
+ struct folio *folio;
+ pgoff_t index;
+
+ if (!IS_ENABLED(CONFIG_X86_MCE_INJECT))
+ return -EOPNOTSUPP;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (is_file_hugepages(file)) {
+ index = offset >> huge_page_shift(hstate_file(file));
+ page_mask = huge_page_mask(hstate_file(file));
+ } else {
+ index = offset >> PAGE_SHIFT;
+ page_mask = PAGE_MASK;
+ }
+
+ folio = filemap_get_folio(mapping, index);
+ if (IS_ERR(folio))
+ return PTR_ERR(folio);
+
+ physaddr = (folio_pfn(folio) << PAGE_SHIFT) | (offset & ~page_mask);
+ folio_put(folio);
+
+ pr_info("Injecting mce for address %#lx at file offset %#llx\n",
+ physaddr, offset);
+ fadvise_call_mce_injector_chain(physaddr);
+ return 0;
+}
+
/*
* POSIX_FADV_WILLNEED could set PG_Referenced, and POSIX_FADV_NOREUSE could
* deactivate the pages and clear PG_Referenced.
@@ -111,6 +170,7 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
return 0;
case FADV_HWPOISON:
case FADV_SOFT_OFFLINE:
+ case FADV_MCE_INJECT:
break;
default:
return -EINVAL;
@@ -226,6 +286,8 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
case FADV_HWPOISON:
case FADV_SOFT_OFFLINE:
return fadvise_inject_error(file, mapping, offset, endbyte, advice);
+ case FADV_MCE_INJECT:
+ return fadvise_inject_mce(file, mapping, offset);
default:
return -EINVAL;
}
--
2.25.1
From: Isaku Yamahata <[email protected]>
The current x86 MCE injection framework injects MCE when writing to
/sys/kernel/debug/mce-inject/bank. KVM wants to inject machine check on
behalf of vcpu context instead of the context of writing to the bank file.
Because ACPI APEI has a similar requirement and it adds
/sys/kernel/debug/apei/notrigger to suppress immediate injection.
By Following it, add /sys/kernel/debug/mce-inject/notrigger to suppress
MCE injection.
The alternative is add new value "notrigger" to
/sys/kernel/debug/mce-inject/flags in addition to "sw", "hw", "df", and
"th". Because it may break user space ABI, this option follow ACPI APEI
error injection.
Supposed usage flow:
$ echo 1 > notrigger
... setup MCE values
$ echo 0 > bank
The last step to setup mce value to inject with MC bank 0.
Originally this step injects mce. With noijnect=1, don't inject.
$ echo 1 > /sys/kernel/debug/kvm/<pid>-<fd>/vcpu<N>/mce-inject
tell KVM to inject MCE in the context of vcpu.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kernel/cpu/mce/inject.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/arch/x86/kernel/cpu/mce/inject.c b/arch/x86/kernel/cpu/mce/inject.c
index 461858ae18f9..88603a6c0afe 100644
--- a/arch/x86/kernel/cpu/mce/inject.c
+++ b/arch/x86/kernel/cpu/mce/inject.c
@@ -34,6 +34,7 @@
#include "internal.h"
static bool hw_injection_possible;
+static u64 notrigger;
/*
* Collect all the MCi_XXX settings
@@ -598,6 +599,8 @@ static int inj_bank_set(void *data, u64 val)
}
m->bank = val;
+ if (notrigger)
+ return 0;
/*
* sw-only injection allows to write arbitrary values into the MCA
@@ -637,6 +640,21 @@ MCE_INJECT_GET(bank);
DEFINE_SIMPLE_ATTRIBUTE(bank_fops, inj_bank_get, inj_bank_set, "%llu\n");
+static int inj_notrigger_get(void *data, u64 *val)
+{
+ *val = notrigger;
+ return 0;
+}
+
+static int inj_notrigger_set(void *data, u64 val)
+{
+ notrigger = val;
+ return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(notrigger_fops, inj_notrigger_get, inj_notrigger_set,
+ "%llx\n");
+
static const char readme_msg[] =
"Description of the files and their usages:\n"
"\n"
@@ -685,6 +703,9 @@ static const char readme_msg[] =
"\n"
"mcgstatus:\t Set MCG_STATUS: the bits in that MSR describes the current state\n"
"\t of the processor after the MCE.\n"
+"\n"
+"notrigger:\t Suppress triggering the injection when set to non-zero\n"
+"\t The injection is triggered by other way.\n"
"\n";
static ssize_t
@@ -714,6 +735,8 @@ static struct dfs_node {
{ .name = "cpu", .fops = &extcpu_fops, .perm = S_IRUSR | S_IWUSR },
{ .name = "mcgstatus", .fops = &mcgstatus_fops,
.perm = S_IRUSR | S_IWUSR },
+ { .name = "notrigger", .fops = ¬rigger_fops,
+ .perm = S_IRUSR | S_IWUSR },
{ .name = "README", .fops = &readme_fops, .perm = S_IRUSR | S_IRGRP | S_IROTH },
};
--
2.25.1
From: Isaku Yamahata <[email protected]>
X86 MCE support is kernel builtin. CONFIG_X86_MCE=y or n. The x86 MCE
injection framework can be kernel builtin or kernel module.
CONFIG_X86_MCE_INJECT=y, m, or n. We don't want KVM depend on
CONFIG_X86_MCE_INJECT=y, allow CONFIG_X86_MCE_INJECT=m for KVM mce
injection.
Move the necessary symbols with rename from
arch/x86/kernel/cpu/mce/inject.c to arch/x86/kernel/cpu/mce/core.c and
export symbols for KVM to inject MCE.
Oppertunistically add lockdep_assert_held(&mce_inject_mutex) to show
that injectm is protected by mce_inject_mutex.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/mce.h | 10 +++++++++
arch/x86/kernel/cpu/mce/core.c | 36 ++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/mce/inject.c | 26 +++--------------------
3 files changed, 49 insertions(+), 23 deletions(-)
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 180b1cbfcc4e..459066ecd922 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -265,6 +265,16 @@ int mce_notify_irq(void);
DECLARE_PER_CPU(struct mce, injectm);
+#ifdef CONFIG_X86_MCE
+void mce_inject_lock(void);
+void mce_inject_unlock(void);
+void mce_inject(struct mce *m);
+#else
+static inline void mce_inject_lock(void) {}
+static inline void mce_inject_unlock(void) {}
+static inline void mce_inject(struct mce *m) {}
+#endif
+
/* Disable CMCI/polling for MCA bank claimed by firmware */
extern void mce_disable_bank(int bank);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6f35f724cc14..6929c3cad278 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -129,9 +129,45 @@ void mce_setup(struct mce *m)
m->ppin = cpu_data(m->extcpu).ppin;
m->microcode = boot_cpu_data.microcode;
}
+EXPORT_SYMBOL_GPL(mce_setup);
DEFINE_PER_CPU(struct mce, injectm);
EXPORT_PER_CPU_SYMBOL_GPL(injectm);
+static DEFINE_MUTEX(mce_inject_mutex);
+
+void mce_inject_lock(void)
+{
+ mutex_lock(&mce_inject_mutex);
+}
+EXPORT_SYMBOL_GPL(mce_inject_lock);
+
+void mce_inject_unlock(void)
+{
+ mutex_unlock(&mce_inject_mutex);
+}
+EXPORT_SYMBOL_GPL(mce_inject_unlock);
+
+/* Update fake mce registers on current CPU. */
+void mce_inject(struct mce *m)
+{
+ struct mce *i = &per_cpu(injectm, m->extcpu);
+
+ lockdep_assert_held(&mce_inject_mutex);
+
+ /* Make sure no one reads partially written injectm */
+ i->finished = 0;
+ mb();
+ m->finished = 0;
+ /* First set the fields after finished */
+ i->extcpu = m->extcpu;
+ mb();
+ /* Now write record in order, finished last (except above) */
+ memcpy(i, m, sizeof(struct mce));
+ /* Finally activate it */
+ mb();
+ i->finished = 1;
+}
+EXPORT_SYMBOL_GPL(mce_inject);
void mce_log(struct mce *m)
{
diff --git a/arch/x86/kernel/cpu/mce/inject.c b/arch/x86/kernel/cpu/mce/inject.c
index 88603a6c0afe..ae3efbeb78bd 100644
--- a/arch/x86/kernel/cpu/mce/inject.c
+++ b/arch/x86/kernel/cpu/mce/inject.c
@@ -126,25 +126,6 @@ static void setup_inj_struct(struct mce *m)
m->microcode = boot_cpu_data.microcode;
}
-/* Update fake mce registers on current CPU. */
-static void inject_mce(struct mce *m)
-{
- struct mce *i = &per_cpu(injectm, m->extcpu);
-
- /* Make sure no one reads partially written injectm */
- i->finished = 0;
- mb();
- m->finished = 0;
- /* First set the fields after finished */
- i->extcpu = m->extcpu;
- mb();
- /* Now write record in order, finished last (except above) */
- memcpy(i, m, sizeof(struct mce));
- /* Finally activate it */
- mb();
- i->finished = 1;
-}
-
static void raise_poll(struct mce *m)
{
unsigned long flags;
@@ -176,7 +157,6 @@ static void raise_exception(struct mce *m, struct pt_regs *pregs)
}
static cpumask_var_t mce_inject_cpumask;
-static DEFINE_MUTEX(mce_inject_mutex);
static int mce_raise_notify(unsigned int cmd, struct pt_regs *regs)
{
@@ -245,7 +225,7 @@ static void __maybe_unused raise_mce(struct mce *m)
{
int context = MCJ_CTX(m->inject_flags);
- inject_mce(m);
+ mce_inject(m);
if (context == MCJ_CTX_RANDOM)
return;
@@ -303,9 +283,9 @@ static int mce_inject_raise(struct notifier_block *nb, unsigned long val,
if (!m)
return NOTIFY_DONE;
- mutex_lock(&mce_inject_mutex);
+ mce_inject_lock();
raise_mce(m);
- mutex_unlock(&mce_inject_mutex);
+ mce_inject_unlock();
return NOTIFY_DONE;
}
--
2.25.1
From: Isaku Yamahata <[email protected]>
For hwpoison test to recover hwpoisoned memory, use madvise(MADV_FREE) that
requires private mapping. Add a new src memory type for it.
Signed-off-by: Isaku Yamahata <[email protected]>
---
tools/testing/selftests/kvm/include/test_util.h | 2 ++
tools/testing/selftests/kvm/lib/kvm_util.c | 7 +++++--
tools/testing/selftests/kvm/lib/test_util.c | 8 ++++++++
3 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index 7e614adc6cf4..c97af4ff0699 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -111,6 +111,8 @@ enum vm_mem_backing_src_type {
VM_MEM_SRC_ANONYMOUS_HUGETLB_16GB,
VM_MEM_SRC_SHMEM,
VM_MEM_SRC_SHARED_HUGETLB,
+ VM_MEM_SRC_ANONYMOUS_MEMFD,
+ VM_MEM_SRC_ANONYMOUS_MEMFD_HUGETLB,
NUM_SRC_TYPES,
};
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 310c3a760cb8..95ffb3b97dcf 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1007,9 +1007,12 @@ void __vm_userspace_mem_region_add(struct kvm_vm *vm,
region->mmap_size += alignment;
region->fd = -1;
- if (backing_src_is_shared(src_type))
+ if (backing_src_is_shared(src_type) ||
+ src_type == VM_MEM_SRC_ANONYMOUS_MEMFD ||
+ src_type == VM_MEM_SRC_ANONYMOUS_MEMFD_HUGETLB)
region->fd = kvm_memfd_alloc(region->mmap_size,
- src_type == VM_MEM_SRC_SHARED_HUGETLB);
+ src_type == VM_MEM_SRC_SHARED_HUGETLB ||
+ src_type == VM_MEM_SRC_ANONYMOUS_MEMFD_HUGETLB);
region->mmap_start = mmap(NULL, region->mmap_size,
PROT_READ | PROT_WRITE,
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
index 5d7f28b02d73..a09bba5c074e 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -281,6 +281,14 @@ const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i)
*/
.flag = MAP_SHARED,
},
+ [VM_MEM_SRC_ANONYMOUS_MEMFD] = {
+ .name = "anonymous_memfd",
+ .flag = ANON_FLAGS,
+ },
+ [VM_MEM_SRC_ANONYMOUS_MEMFD_HUGETLB] = {
+ .name = "anonymous_memfd_hugetlb",
+ .flag = ANON_HUGE_FLAGS,
+ },
};
_Static_assert(ARRAY_SIZE(aliases) == NUM_SRC_TYPES,
"Missing new backing src types?");
--
2.25.1
From: Isaku Yamahata <[email protected]>
When initializing x86 MCE injection framework, it checks if hardware mce
injection is available or not. When it's not available on AMD, set the
boolean variable to false to not use it. The variable is on by default and
the feature is AMD specific based on the code.
Because the variable is default on, it is true on Intel platform (probably
on other non-AMD x86 platform). It results in unchecked msr access of
MSR_K7_HWCR=0xc0010015 when injecting MCE on Intel platform. (Probably on
other x86 platform.)
Make the variable of by default, and set the variable on when the hardware
feature is usable.
The procedure to produce on Intel platform.
echo hw > /sys/kernel/debug/mce-inject/flags
This succeeds.
set other MCE parameters if necessary.
echo 0 > /sys/kernel/debug/mce-inject/bank
This triggers mce injection.
The kernel accesses MSR_K7_HWCR=0xc0010015 resulting unchecked MSR access.
Fixes: 891e465a1bd8 ("x86/mce: Check whether writes to MCA_STATUS are getting ignored")
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kernel/cpu/mce/inject.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/mce/inject.c b/arch/x86/kernel/cpu/mce/inject.c
index 4d8d4bcf915d..881898a1d2f4 100644
--- a/arch/x86/kernel/cpu/mce/inject.c
+++ b/arch/x86/kernel/cpu/mce/inject.c
@@ -33,7 +33,7 @@
#include "internal.h"
-static bool hw_injection_possible = true;
+static bool hw_injection_possible;
/*
* Collect all the MCi_XXX settings
@@ -732,6 +732,7 @@ static void check_hw_inj_possible(void)
if (!cpu_feature_enabled(X86_FEATURE_SMCA))
return;
+ hw_injection_possible = true;
cpu = get_cpu();
for (bank = 0; bank < MAX_NR_BANKS; ++bank) {
--
2.25.1
From: Isaku Yamahata <[email protected]>
To test KVM x86 to inject machine check with memory failure, it wants to
to set LMCE_S (local machine check exception signaled) for memory failure
injection test. The current code sets mcgstatus based on the other values
and can't set LMCE_S flag.
Add mcgstatus entry to allow to set mcgstatus value.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kernel/cpu/mce/inject.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/kernel/cpu/mce/inject.c b/arch/x86/kernel/cpu/mce/inject.c
index 881898a1d2f4..461858ae18f9 100644
--- a/arch/x86/kernel/cpu/mce/inject.c
+++ b/arch/x86/kernel/cpu/mce/inject.c
@@ -72,6 +72,7 @@ static int inj_##reg##_set(void *data, u64 val) \
return 0; \
}
+MCE_INJECT_SET(mcgstatus);
MCE_INJECT_SET(status);
MCE_INJECT_SET(misc);
MCE_INJECT_SET(addr);
@@ -86,12 +87,14 @@ static int inj_##reg##_get(void *data, u64 *val) \
return 0; \
}
+MCE_INJECT_GET(mcgstatus);
MCE_INJECT_GET(status);
MCE_INJECT_GET(misc);
MCE_INJECT_GET(addr);
MCE_INJECT_GET(synd);
MCE_INJECT_GET(ipid);
+DEFINE_SIMPLE_ATTRIBUTE(mcgstatus_fops, inj_mcgstatus_get, inj_mcgstatus_set, "%llx\n");
DEFINE_SIMPLE_ATTRIBUTE(status_fops, inj_status_get, inj_status_set, "%llx\n");
DEFINE_SIMPLE_ATTRIBUTE(misc_fops, inj_misc_get, inj_misc_set, "%llx\n");
DEFINE_SIMPLE_ATTRIBUTE(addr_fops, inj_addr_get, inj_addr_set, "%llx\n");
@@ -679,6 +682,9 @@ static const char readme_msg[] =
"\t APIC interrupt handler to handle the error. \n"
"\n"
"ipid:\t IPID (AMD-specific)\n"
+"\n"
+"mcgstatus:\t Set MCG_STATUS: the bits in that MSR describes the current state\n"
+"\t of the processor after the MCE.\n"
"\n";
static ssize_t
@@ -706,6 +712,8 @@ static struct dfs_node {
{ .name = "bank", .fops = &bank_fops, .perm = S_IRUSR | S_IWUSR },
{ .name = "flags", .fops = &flags_fops, .perm = S_IRUSR | S_IWUSR },
{ .name = "cpu", .fops = &extcpu_fops, .perm = S_IRUSR | S_IWUSR },
+ { .name = "mcgstatus", .fops = &mcgstatus_fops,
+ .perm = S_IRUSR | S_IWUSR },
{ .name = "README", .fops = &readme_fops, .perm = S_IRUSR | S_IRGRP | S_IROTH },
};
--
2.25.1
From: Isaku Yamahata <[email protected]>
For VM-based confidential computing (AMD SEV-SNP and Intel TDX), the KVM
guest memfd effort [1] is ongoing. It allows KVM guests to use file
descriptors and their offset as protected guest memory without user-space
virtual address mapping.
Intel TDX uses machine checks to notify the host OS/VMM that the TDX guest
consumed corrupted memory. The KVM/x86 handles machine checks for guest
vcpu specially. It sets up the guest so that vcpu exits from running on
machine check, checks the exit reason, and manually raises the machine
check by calling do_machine_check(). To test the KVM machine check path,
KVM wants to poison memory based on file descriptor and its offset. The
current memory poisoning is based on the physical address,
/sys/kernel/debug/hwpoison/{corrupt-pfn, unpoison-pfn}, or the virtual
address, MADV_HWPOISON and MADV_UNPOISON.
Add new flags FADV_HWPOISON, and FADV_UNPOISON to
posix_fadvise() by following MADV_HWPOISON and MADV_UNPOISON. 9893e49d64a4
("HWPOISON: Add madvise() based injector for hardware poisoned pages v4")
The possible options would be
- Add FADV flags for memory poison. This patch.
- introduce IOCTL specific to KVM guest memfd
- introduce debugfs entries for KVM guest memfd
/sys/kernel/debug/kvm/<pid>-<vm-fd>/guest-memfd<fd>/hwpoison/
{corrupt-offset, unoison-offset}.
This options follows /sys/kernel/debug/hwpoison/{corrupt-pfn, unpoison-pfn}
[1] https://lore.kernel.org/all/[email protected]/
KVM guest_memfd() and per-page attributes
https://lore.kernel.org/all/[email protected]/
[PATCH 00/13] KVM: guest_memfd fixes
Signed-off-by: Isaku Yamahata <[email protected]>
---
include/uapi/linux/fadvise.h | 4 +++
mm/fadvise.c | 58 +++++++++++++++++++++++++++++++++++-
2 files changed, 61 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/fadvise.h b/include/uapi/linux/fadvise.h
index 0862b87434c2..3699b4b0adcd 100644
--- a/include/uapi/linux/fadvise.h
+++ b/include/uapi/linux/fadvise.h
@@ -19,4 +19,8 @@
#define POSIX_FADV_NOREUSE 5 /* Data will be accessed once. */
#endif
+/* Same to MADV_HWPOISON and MADV_SOFT_OFFLINE */
+#define FADV_HWPOISON 100 /* poison a page for testing */
+#define FADV_SOFT_OFFLINE 101 /* soft offline page for testing */
+
#endif /* FADVISE_H_INCLUDED */
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 6c39d42f16dc..1f028a6e1d90 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -18,11 +18,62 @@
#include <linux/writeback.h>
#include <linux/syscalls.h>
#include <linux/swap.h>
+#include <linux/hugetlb.h>
#include <asm/unistd.h>
#include "internal.h"
+static int fadvise_inject_error(struct file *file, struct address_space *mapping,
+ loff_t offset, off_t endbyte, int advice)
+{
+ pgoff_t start_index, end_index, index, next;
+ struct folio *folio;
+ unsigned int shift;
+ unsigned long pfn;
+ int ret;
+
+ if (!IS_ENABLED(CONFIG_MEMORY_FAILURE))
+ return -EOPNOTSUPP;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (is_file_hugepages(file))
+ shift = huge_page_shift(hstate_file(file));
+ else
+ shift = PAGE_SHIFT;
+ start_index = offset >> shift;
+ end_index = endbyte >> shift;
+
+ index = start_index;
+ while (index <= end_index) {
+ folio = filemap_get_folio(mapping, index);
+ if (IS_ERR(folio))
+ return PTR_ERR(folio);
+
+ next = folio_next_index(folio);
+ pfn = folio_pfn(folio);
+ if (advice == FADV_SOFT_OFFLINE) {
+ pr_info("Soft offlining pfn %#lx at file index %#lx\n",
+ pfn, index);
+ ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
+ } else {
+ pr_info("Injecting memory failure for pfn %#lx at file index %#lx\n",
+ pfn, index);
+ ret = memory_failure(pfn, MF_COUNT_INCREASED | MF_SW_SIMULATED);
+ if (ret == -EOPNOTSUPP)
+ ret = 0;
+ }
+
+ if (ret)
+ return ret;
+ index = next;
+ }
+
+ return 0;
+}
+
/*
* POSIX_FADV_WILLNEED could set PG_Referenced, and POSIX_FADV_NOREUSE could
* deactivate the pages and clear PG_Referenced.
@@ -57,11 +108,13 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
case POSIX_FADV_NOREUSE:
case POSIX_FADV_DONTNEED:
/* no bad return value, but ignore advice */
+ return 0;
+ case FADV_HWPOISON:
+ case FADV_SOFT_OFFLINE:
break;
default:
return -EINVAL;
}
- return 0;
}
/*
@@ -170,6 +223,9 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
}
}
break;
+ case FADV_HWPOISON:
+ case FADV_SOFT_OFFLINE:
+ return fadvise_inject_error(file, mapping, offset, endbyte, advice);
default:
return -EINVAL;
}
--
2.25.1
From: Isaku Yamahata <[email protected]>
For test the KVM hwpoison of memory, the recovery path unmaps the virtual
memory and send SIGBUS with its address. If the poisoned page is mapped at
twice or more in the virtual memory, the address in SIGBUS siginfo isn't
deterministic.
To make the hwpoison test cases deterministic, allow to avoid guest memory
alias.
Signed-off-by: Isaku Yamahata <[email protected]>
---
.../selftests/kvm/include/kvm_util_base.h | 4 ++++
tools/testing/selftests/kvm/lib/kvm_util.c | 23 ++++++++++++++-----
2 files changed, 21 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index a18db6a7b3cf..e980776a2c94 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -435,6 +435,10 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
uint64_t gpa, uint64_t size, void *hva);
int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
uint64_t gpa, uint64_t size, void *hva);
+void __vm_userspace_mem_region_add(struct kvm_vm *vm,
+ enum vm_mem_backing_src_type src_type,
+ uint64_t guest_paddr, uint32_t slot, uint64_t npages,
+ uint32_t flags, bool alias);
void vm_userspace_mem_region_add(struct kvm_vm *vm,
enum vm_mem_backing_src_type src_type,
uint64_t guest_paddr, uint32_t slot, uint64_t npages,
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 7a8af1821f5d..310c3a760cb8 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -691,12 +691,14 @@ static void __vm_mem_region_delete(struct kvm_vm *vm,
sparsebit_free(®ion->unused_phy_pages);
ret = munmap(region->mmap_start, region->mmap_size);
TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("munmap()", ret));
- if (region->fd >= 0) {
- /* There's an extra map when using shared memory. */
+
+ /* There's an extra map when using shared memory. */
+ if (region->mmap_alias) {
ret = munmap(region->mmap_alias, region->mmap_size);
TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("munmap()", ret));
- close(region->fd);
}
+ if (region->fd >= 0)
+ close(region->fd);
free(region);
}
@@ -920,10 +922,10 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
* given by slot, which must be unique and < KVM_MEM_SLOTS_NUM. The
* region is created with the flags given by flags.
*/
-void vm_userspace_mem_region_add(struct kvm_vm *vm,
+void __vm_userspace_mem_region_add(struct kvm_vm *vm,
enum vm_mem_backing_src_type src_type,
uint64_t guest_paddr, uint32_t slot, uint64_t npages,
- uint32_t flags)
+ uint32_t flags, bool create_alias)
{
int ret;
struct userspace_mem_region *region;
@@ -1057,7 +1059,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
hash_add(vm->regions.slot_hash, ®ion->slot_node, slot);
/* If shared memory, create an alias. */
- if (region->fd >= 0) {
+ if (region->fd >= 0 && create_alias) {
region->mmap_alias = mmap(NULL, region->mmap_size,
PROT_READ | PROT_WRITE,
vm_mem_backing_src_alias(src_type)->flag,
@@ -1070,6 +1072,15 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
}
}
+void vm_userspace_mem_region_add(struct kvm_vm *vm,
+ enum vm_mem_backing_src_type src_type,
+ uint64_t guest_paddr, uint32_t slot, uint64_t npages,
+ uint32_t flags)
+{
+ __vm_userspace_mem_region_add(vm, src_type, guest_paddr, slot, npages,
+ flags, true);
+}
+
/*
* Memslot to region
*
--
2.25.1
From: Isaku Yamahata <[email protected]>
To test KVM mechine check injection path, KVM wants to trigger machine
check on its own. Not on writing to debugfs bank file of x86 mce injector.
Because KVM doesn't want to depend on x86 MCE injector, define notifier
chain in the x86 MCE core and make the x86 MCE injector to register the
notifier chain to decouple the MCE injector and its user. This follows how
dev-mcelog does.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/mce.h | 6 ++++++
arch/x86/kernel/cpu/mce/core.c | 20 ++++++++++++++++++++
arch/x86/kernel/cpu/mce/inject.c | 18 ++++++++++++++++++
3 files changed, 44 insertions(+)
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 459066ecd922..1a832a207b6a 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -269,10 +269,16 @@ DECLARE_PER_CPU(struct mce, injectm);
void mce_inject_lock(void);
void mce_inject_unlock(void);
void mce_inject(struct mce *m);
+void mce_register_atomic_injector_chain(struct notifier_block *nb);
+void mce_unregister_atomic_injector_chain(struct notifier_block *nb);
+void mce_call_atomic_injector_chain(unsigned long cpu);
#else
static inline void mce_inject_lock(void) {}
static inline void mce_inject_unlock(void) {}
static inline void mce_inject(struct mce *m) {}
+static inline void mce_register_atomic_injector_chain(struct notifier_block *nb) {}
+static inline void mce_unregister_atomic_injector_chain(struct notifier_block *nb) {}
+static inline void mce_call_atomic_injector_chain(unsigned long cpu) {}
#endif
/* Disable CMCI/polling for MCA bank claimed by firmware */
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6929c3cad278..9a6b04be7404 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -169,6 +169,26 @@ void mce_inject(struct mce *m)
}
EXPORT_SYMBOL_GPL(mce_inject);
+static ATOMIC_NOTIFIER_HEAD(mce_atomic_injector_chain);
+
+void mce_register_atomic_injector_chain(struct notifier_block *nb)
+{
+ atomic_notifier_chain_register(&mce_atomic_injector_chain, nb);
+}
+EXPORT_SYMBOL_GPL(mce_register_atomic_injector_chain);
+
+void mce_unregister_atomic_injector_chain(struct notifier_block *nb)
+{
+ atomic_notifier_chain_unregister(&mce_atomic_injector_chain, nb);
+}
+EXPORT_SYMBOL_GPL(mce_unregister_atomic_injector_chain);
+
+void mce_call_atomic_injector_chain(unsigned long cpu)
+{
+ atomic_notifier_call_chain(&mce_atomic_injector_chain, cpu, NULL);
+}
+EXPORT_SYMBOL_GPL(mce_call_atomic_injector_chain);
+
void mce_log(struct mce *m)
{
if (!mce_gen_pool_add(m))
diff --git a/arch/x86/kernel/cpu/mce/inject.c b/arch/x86/kernel/cpu/mce/inject.c
index 43d896a89648..5dac4dcb25aa 100644
--- a/arch/x86/kernel/cpu/mce/inject.c
+++ b/arch/x86/kernel/cpu/mce/inject.c
@@ -303,8 +303,24 @@ static int fadvise_inject_addr(struct notifier_block *nb, unsigned long val,
static struct notifier_block fadvise_nb = {
.notifier_call = fadvise_inject_addr,
+};
+
+static int mce_atomic_injector(struct notifier_block *nb, unsigned long val,
+ void *data)
+{
+ lockdep_assert_preemption_disabled();
+
+ i_mce.extcpu = val;
+ mce_inject(&i_mce);
+ setup_inj_struct(&i_mce);
+
+ return NOTIFY_DONE;
}
+static struct notifier_block atomic_injector_nb = {
+ .notifier_call = mce_atomic_injector,
+};
+
/*
* Caller needs to be make sure this cpu doesn't disappear
* from under us, i.e.: get_cpu/put_cpu.
@@ -796,6 +812,7 @@ static int __init inject_init(void)
register_nmi_handler(NMI_LOCAL, mce_raise_notify, 0, "mce_notify");
mce_register_injector_chain(&inject_nb);
fadvise_register_mce_injector_chain(&fadvise_nb);
+ mce_register_atomic_injector_chain(&atomic_injector_nb);
setup_inj_struct(&i_mce);
@@ -807,6 +824,7 @@ static int __init inject_init(void)
static void __exit inject_exit(void)
{
+ mce_unregister_atomic_injector_chain(&atomic_injector_nb);
fadvise_unregister_mce_injector_chain(&fadvise_nb);
mce_unregister_injector_chain(&inject_nb);
unregister_nmi_handler(NMI_LOCAL, "mce_notify");
--
2.25.1
From: Isaku Yamahata <[email protected]>
Add test cases for KVM to inject hwpoison or x86 MCE. It exercises
FADV_HWPOISON and KVM/x86 machine check injection.
Signed-off-by: Isaku Yamahata <[email protected]>
---
tools/testing/selftests/kvm/Makefile | 1 +
.../testing/selftests/kvm/mem_hwpoison_test.c | 721 ++++++++++++++++++
2 files changed, 722 insertions(+)
create mode 100644 tools/testing/selftests/kvm/mem_hwpoison_test.c
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index a3bb36fb3cfc..ec9b4c6f1e75 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -128,6 +128,7 @@ TEST_GEN_PROGS_x86_64 += hardware_disable_test
TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
TEST_GEN_PROGS_x86_64 += kvm_page_table_test
TEST_GEN_PROGS_x86_64 += max_guest_memory_test
+TEST_GEN_PROGS_x86_64 += mem_hwpoison_test
TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test
TEST_GEN_PROGS_x86_64 += memslot_perf_test
TEST_GEN_PROGS_x86_64 += rseq_test
diff --git a/tools/testing/selftests/kvm/mem_hwpoison_test.c b/tools/testing/selftests/kvm/mem_hwpoison_test.c
new file mode 100644
index 000000000000..cc4cdc6420fe
--- /dev/null
+++ b/tools/testing/selftests/kvm/mem_hwpoison_test.c
@@ -0,0 +1,721 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022, Google LLC.
+ * Copyright (C) 2023, Intel Corp.
+ */
+#define _GNU_SOURCE /* for program_invocation_short_name */
+
+#include <pthread.h>
+#include <signal.h>
+#include <setjmp.h>
+
+#include <linux/fadvise.h>
+#include <linux/sizes.h>
+
+#include <processor.h>
+
+#define DATA_SLOT 10
+#define DATA_GPA ((uint64_t)(1ull << 32))
+#define DATA_SIZE ((uint64_t)(SZ_2M))
+
+enum ucall_syncs {
+ /* Give kvm a chance to inject mce */
+ GUEST_NOP,
+ GUEST_HWPOISON,
+};
+
+static void guest_sync_nop(void)
+{
+ ucall(UCALL_SYNC, 1, GUEST_NOP);
+}
+
+static void guest_sync_hwpoison(uint64_t gpa)
+{
+ ucall(UCALL_SYNC, 2, GUEST_HWPOISON, gpa);
+}
+
+static void memcmp_g(const char *file, unsigned int line, uint64_t gpa,
+ uint8_t pattern, uint64_t size)
+{
+ uint8_t *mem = (uint8_t *)gpa;
+ size_t i;
+
+ for (i = 0; i < size; i++) {
+ if (mem[i] == pattern)
+ continue;
+
+ ucall_assert(UCALL_ABORT, "mem[i] == pattern",
+ file, line,
+ "Expected 0x%x at offset %lu (gpa 0x%lx), got 0x%x",
+ pattern, i, gpa + i, mem[i]);
+ }
+}
+#define MEMCMP_G(gpa, pattern, size) \
+ memcmp_g(__FILE__, __LINE__, (gpa), (pattern), (size))
+
+static const uint64_t test_offsets[] = {
+ 0,
+ PAGE_SIZE,
+};
+
+static void guest_code(uint64_t gpa, bool huge_page, bool do_wait)
+{
+ const uint8_t init_p = 0xcc;
+ uint64_t r;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_offsets); i++) {
+ uint64_t offset = test_offsets[i];
+ uint64_t base = gpa + offset;
+
+ /*
+ * Set the test region to non-zero to differentiate it from the
+ * page newly assigned.
+ */
+ memset((void *)gpa, init_p, SZ_2M);
+ if (do_wait)
+ /* Hold mce injector. */
+ WRITE_ONCE(*(uint8_t *)gpa, 0);
+
+ /* Ask VMM to poison the page. */
+ guest_sync_hwpoison(base);
+
+ if (do_wait) {
+ /* Allow mce injector to continue. */
+ WRITE_ONCE(*(uint8_t *)gpa, init_p);
+
+ /* Wait for poisoned page zeroed. */
+ while (READ_ONCE(*(uint8_t *)base) == init_p)
+ ;
+ }
+
+ /* When injecting mce, give KVM a chance to inject it. */
+ guest_sync_nop();
+
+ /* Consume poisoned data. */
+ r = READ_ONCE(*(uint64_t *)base);
+ /* VMM discarded the poisoned page and assign a new page. */
+ GUEST_ASSERT_EQ(r, 0);
+
+ /* Check if the page is zeroed or it keeps its contents. */
+ if (huge_page) {
+ MEMCMP_G(gpa, 0, SZ_2M);
+ } else {
+ if (offset > 0)
+ MEMCMP_G(gpa, init_p, offset);
+ MEMCMP_G(base, 0, PAGE_SIZE);
+ if (offset + PAGE_SIZE < SZ_2M)
+ MEMCMP_G(gpa + offset + PAGE_SIZE, init_p,
+ SZ_2M - (offset + PAGE_SIZE));
+ }
+ }
+
+ GUEST_DONE();
+}
+
+#define MCE_INJECT_DIR "/sys/kernel/debug/mce-inject/"
+#define MCE_STATUS MCE_INJECT_DIR"status"
+#define MCE_MISC MCE_INJECT_DIR"misc"
+#define MCE_BANK MCE_INJECT_DIR"bank"
+#define MCE_FLAGS MCE_INJECT_DIR"flags"
+#define MCE_MCGSTATUS MCE_INJECT_DIR"mcgstatus"
+#define MCE_NOTRIGGER MCE_INJECT_DIR"notrigger"
+
+static void write_buf(const char *path, const char *buf, size_t size)
+{
+ ssize_t r;
+ int fd;
+
+ fd = open(path, O_WRONLY);
+ TEST_ASSERT(fd >= 0, "failed to open %s\n", path);
+
+ r = write(fd, buf, size);
+ TEST_ASSERT(r == size, "failed to write(%s:%s:%zd) = %zd\n",
+ path, buf, size, r);
+
+ close(fd);
+}
+
+static void mce_write(const char *path, uint64_t val)
+{
+ char buf[64];
+ int len;
+
+ len = sprintf(buf, "%"PRIu64"\n", val);
+ write_buf(path, buf, len);
+}
+
+static void mce_flags(void)
+{
+ char *buf = "sw\n";
+
+ write_buf(MCE_FLAGS, buf, strlen(buf));
+}
+
+/* From asm/mce.h */
+/* MCG_STATUS register defines */
+#define MCG_STATUS_EIPV BIT_ULL(1) /* ip points to correct instruction */
+#define MCG_STATUS_MCIP BIT_ULL(2) /* machine check in progress */
+#define MCG_STATUS_LMCES BIT_ULL(3) /* LMCE signaled */
+
+/* MCi_STATUS register defines */
+#define MCI_STATUS_VAL BIT_ULL(63) /* valid error */
+#define MCI_STATUS_UC BIT_ULL(61) /* uncorrected error */
+#define MCI_STATUS_EN BIT_ULL(60) /* error enabled */
+#define MCI_STATUS_MISCV BIT_ULL(59) /* misc error reg. valid */
+#define MCI_STATUS_ADDRV BIT_ULL(58) /* addr reg. valid */
+#define MCI_STATUS_AR BIT_ULL(55) /* Action required */
+#define MCI_STATUS_S BIT_ULL(56) /* Signaled machine check */
+
+#define MCACOD_DATA 0x0134 /* Data Load */
+
+/* MCi_MISC register defines */
+#define MCI_MISC_ADDR_MODE_SHIFT 6
+#define MCI_MISC_ADDR_PHYS 2
+
+#define KVM_MCE_INJECT "/sys/kernel/debug/kvm/%d-%d/vcpu%d/mce-inject"
+
+/* Worst case buffer size needed for holding an integer. */
+#define ITOA_MAX_LEN 12
+
+static void vcpu_inject_mce(struct kvm_vcpu *vcpu)
+{
+ char path[sizeof(KVM_MCE_INJECT) + ITOA_MAX_LEN * 3 + 1];
+ char data[] = "0";
+ ssize_t r;
+ int fd;
+
+ snprintf(path, sizeof(path), KVM_MCE_INJECT,
+ getpid(), vcpu->vm->fd, vcpu->id);
+
+ fd = open(path, O_WRONLY);
+ TEST_ASSERT(fd >= 0, "failed to open %s\n", path);
+
+ data[0] = '0';
+ r = write(fd, data, sizeof(data));
+ TEST_ASSERT(r == -1 && errno == EINVAL,
+ "succeeded to write(%s:%s:%zd) = %zd\n",
+ path, data, sizeof(data), r);
+
+ data[0] = '1';
+ r = write(fd, data, sizeof(data));
+ TEST_ASSERT(r == sizeof(data),
+ "failed to write(%s:%s:%zd) = %zd\n",
+ path, data, sizeof(data), r);
+
+ close(fd);
+}
+
+static void inject_mce(struct kvm_vcpu *vcpu, int memfd, uint64_t gpa)
+{
+ /* See vm_mem_add() in test_mem_failure() */
+ uint64_t offset = gpa - DATA_GPA;
+ int ret;
+
+ /* mce-inject in debugfs triggers mce injection. */
+ mce_write(MCE_NOTRIGGER, 1);
+
+ /* FIXME: These values are vendor specific. */
+ mce_write(MCE_MCGSTATUS,
+ MCG_STATUS_EIPV | MCG_STATUS_MCIP | MCG_STATUS_LMCES);
+ mce_write(MCE_MISC,
+ (MCI_MISC_ADDR_PHYS << MCI_MISC_ADDR_MODE_SHIFT) | 3);
+ /*
+ * MCI_STATUS_UC: Uncorrected error:
+ * MCI_STATUS_EN | MCI_STATUS_AR | MCI_STATUS_S:
+ * SRAR: Software Recoverable Action Required
+ */
+ mce_write(MCE_STATUS,
+ MCACOD_DATA |
+ MCI_STATUS_EN | MCI_STATUS_UC | MCI_STATUS_S | MCI_STATUS_AR |
+ MCI_STATUS_VAL | MCI_STATUS_MISCV | MCI_STATUS_ADDRV);
+ mce_flags();
+ mce_write(MCE_BANK, 0);
+
+ /* Set physical address to MCE_ADDR. */
+ ret = posix_fadvise(memfd, offset, sizeof(u64), FADV_MCE_INJECT);
+ /* posix_fadvise() doesn't set errno, but returns erorr no. */
+ if (ret)
+ errno = ret;
+ __TEST_REQUIRE(ret != EPERM,
+ "Injecting mcet requires CAP_SYS_ADMIN ret %d", ret);
+ TEST_ASSERT(!ret || ret == EBUSY,
+ "posix_fadvise(FADV_MCE_INJECT) should success ret %d", ret);
+
+ /* Schedule to fire MCE on the next vcpu run. */
+ vcpu_inject_mce(vcpu);
+}
+
+static sigjmp_buf sigbuf;
+static volatile void *fault_addr;
+
+static void sigbus_handler(int sig, siginfo_t *info, void *data)
+{
+ int lsb = info->si_addr_lsb;
+
+ TEST_ASSERT(info->si_signo == SIGBUS,
+ "Unknown signal number expected %d received %d\n",
+ SIGBUS, info->si_signo);
+ TEST_ASSERT(info->si_code == BUS_MCEERR_AR,
+ "Unknown signal code received expected %d code %d\n",
+ BUS_MCEERR_AR, info->si_code);
+ TEST_ASSERT((info->si_addr_lsb == PAGE_SHIFT ||
+ info->si_addr_lsb == HUGEPAGE_SHIFT(2)),
+ "Unknown signal addr_lsb expected lsb %d or %d received %d\n",
+ PAGE_SHIFT, HUGEPAGE_SHIFT(2), info->si_addr_lsb);
+ TEST_ASSERT(((intptr_t)info->si_addr >> lsb) == ((intptr_t)fault_addr >> lsb),
+ "Unknown signal addr expected %p received %p lsb %d\n",
+ fault_addr, info->si_addr, lsb);
+
+ fault_addr = NULL;
+ siglongjmp(sigbuf, 1);
+}
+
+static void sigbus_handler_set(bool set)
+{
+ struct sigaction sa;
+ int r;
+
+ if (set)
+ sa = (struct sigaction) {
+ .sa_sigaction = sigbus_handler,
+ .sa_flags = SA_SIGINFO,
+ };
+ else
+ sa = (struct sigaction) {
+ .sa_handler = SIG_DFL,
+ };
+
+ r = sigaction(SIGBUS, &sa, NULL);
+ TEST_ASSERT(!r, "sigaction should success");
+}
+
+static void discard_nop(struct kvm_vcpu *vcpu)
+{
+ struct kvm_run *run = vcpu->run;
+ struct ucall uc;
+ uint64_t cmd;
+
+ vcpu_run(vcpu);
+ TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+ "Wanted KVM_EXIT_IO, got exit reason: %u (%s)",
+ run->exit_reason, exit_reason_str(run->exit_reason));
+ cmd = get_ucall(vcpu, &uc);
+ TEST_ASSERT(cmd == UCALL_SYNC, "UCALL_SYNC is expected %lu", cmd);
+ TEST_ASSERT(uc.args[0] == GUEST_NOP, "GUEST_NOP is expected %lu",
+ uc.args[0]);
+}
+
+static void test_madvise(struct kvm_vcpu *vcpu, int memfd, bool huge_page,
+ uint64_t gpa)
+{
+ uint8_t *hva = addr_gpa2hva(vcpu->vm, gpa);
+ int r;
+
+ discard_nop(vcpu);
+ r = madvise(hva, sizeof(u64), MADV_HWPOISON);
+ __TEST_REQUIRE(!(r == -1 && errno == EPERM),
+ "madvise(MADV_HWPOISON) requires CAP_SYS_ADMIN");
+ TEST_ASSERT(!r, "madvise(MADV_HWPOISON) should succeed");
+
+ if (!sigsetjmp(sigbuf, 1)) {
+ sigbus_handler_set(true);
+
+ /* Trigger SIGBUS */
+ fault_addr = hva;
+ vcpu_run(vcpu);
+ TEST_FAIL("SIGBUS isn't triggered");
+ }
+
+ /*
+ * Discard poisoned page and add a new page so that guest vcpu
+ * can continue.
+ */
+ if (huge_page) {
+ void *v;
+
+ /*
+ * madvise(MADV_FREE) doesn't work for huge page.
+ * Resort to munmap() and mmap().
+ */
+ gpa = align_down(gpa, SZ_2M);
+ hva = addr_gpa2hva(vcpu->vm, gpa);
+
+ r = munmap(hva, SZ_2M);
+ TEST_ASSERT(!r, "munmap() should succeed");
+
+ /* Map it again with new page for guest to continue. */
+ v = mmap(hva, SZ_2M, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED,
+ memfd, gpa - DATA_GPA);
+ TEST_ASSERT(v != MAP_FAILED,
+ __KVM_SYSCALL_ERROR("mmap()",
+ (int)(unsigned long)MAP_FAILED));
+ TEST_ASSERT(v == hva, "mmap(MAP_FIXED) v %p hva %p",
+ v, hva);
+ } else {
+ fault_addr = hva;
+ r = madvise(hva, PAGE_SIZE, MADV_FREE);
+ TEST_ASSERT(!r, "madvise(MADV_FREE) should success");
+ }
+
+ sigbus_handler_set(false);
+}
+
+static void punch_hole(struct kvm_vcpu *vcpu, int memfd, uint64_t gpa, uint64_t len)
+{
+ int r;
+
+ /* Discard poisoned page. */
+ r = fallocate(memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+ gpa - DATA_GPA, len);
+ TEST_ASSERT(!r || (r == -1 && errno == ENOENT),
+ "fallocate(PUNCH_HOLE, PAGE_SIZE) failed at fd = %d, offset = %lx\n",
+ memfd, gpa - DATA_GPA);
+}
+
+static void punch_hole_page(struct kvm_vcpu *vcpu, int memfd, bool huge_page,
+ uint64_t gpa)
+{
+ /* Discard the poisoned page and assign new page. */
+ if (huge_page) {
+ gpa = align_down(gpa, SZ_2M);
+ punch_hole(vcpu, memfd, gpa, SZ_2M);
+ } else
+ punch_hole(vcpu, memfd, gpa, PAGE_SIZE);
+}
+
+static void munmap_hole(struct kvm_vcpu *vcpu, int memfd, uint64_t gpa,
+ uint64_t len)
+{
+ uint8_t *hva = addr_gpa2hva(vcpu->vm, gpa);
+ void *v;
+ int r;
+
+ /* hwpoison is also recorded in the PTE entry. Clear it. */
+ r = munmap(hva, len);
+ TEST_ASSERT(!r, "munmap() should succeed");
+
+ /* Map it again with new page for guest to continue. */
+ v = mmap(hva, len, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED,
+ memfd, gpa - DATA_GPA);
+ TEST_ASSERT(v != MAP_FAILED,
+ __KVM_SYSCALL_ERROR("mmap()", (int)(unsigned long)MAP_FAILED));
+ TEST_ASSERT(v == hva,
+ "mmap(MAP_FIXED) v %p hva %p", v, hva);
+}
+
+static void munmap_page(struct kvm_vcpu *vcpu, int memfd, bool huge_page,
+ uint64_t gpa)
+{
+ if (huge_page) {
+ gpa = align_down(gpa, SZ_2M);
+ munmap_hole(vcpu, memfd, gpa, SZ_2M);
+ } else
+ munmap_hole(vcpu, memfd, gpa, PAGE_SIZE);
+}
+
+static void test_fadvise(struct kvm_vcpu *vcpu, int memfd, bool huge_page,
+ uint64_t gpa)
+{
+ /* See vm_mem_add() in test_mem_failure() */
+ uint64_t offset = gpa - DATA_GPA;
+ int ret;
+
+ discard_nop(vcpu);
+ ret = posix_fadvise(memfd, offset, sizeof(u64), FADV_HWPOISON);
+ /* posix_fadvise() doesn't set errno, but returns erorr no. */
+ if (ret)
+ errno = ret;
+ __TEST_REQUIRE(ret != EPERM,
+ "Injecting memory fault requires CAP_SYS_ADMIN ret %d",
+ ret);
+ TEST_ASSERT(!ret || ret == EBUSY,
+ "posix_fadvise(FADV_HWPOISON) should success ret %d", ret);
+
+ sigbus_handler_set(true);
+ if (!sigsetjmp(sigbuf, 1)) {
+ /* Trigger SIGBUS */
+ fault_addr = addr_gpa2hva(vcpu->vm, gpa);
+ vcpu_run(vcpu);
+ TEST_FAIL("SIGBUS isn't triggered");
+ }
+
+ punch_hole_page(vcpu, memfd, huge_page, gpa);
+ if (!huge_page && !sigsetjmp(sigbuf, 1)) {
+ fault_addr = addr_gpa2hva(vcpu->vm, gpa);
+ vcpu_run(vcpu);
+ TEST_FAIL("SIGBUS isn't triggered");
+ }
+ munmap_page(vcpu, memfd, huge_page, gpa);
+ sigbus_handler_set(false);
+}
+
+static void test_mce(struct kvm_vcpu *vcpu, int memfd, bool huge_page,
+ uint64_t gpa)
+{
+ inject_mce(vcpu, memfd, gpa);
+
+ sigbus_handler_set(true);
+ if (!sigsetjmp(sigbuf, 1)) {
+ /* Give KVM a chance to inject MCE and trigger SIGBUS. */
+ fault_addr = addr_gpa2hva(vcpu->vm, gpa);
+ discard_nop(vcpu);
+
+ /* As the mce framework uses work queue, give it time. */
+ sleep(1);
+ TEST_FAIL("SIGBUS isn't triggered");
+ }
+
+ if (!sigsetjmp(sigbuf, 1)) {
+ /* Trigger SIGBUS */
+ fault_addr = addr_gpa2hva(vcpu->vm, gpa);
+ vcpu_run(vcpu);
+ TEST_FAIL("SIGBUS isn't triggered");
+ }
+
+ punch_hole_page(vcpu, memfd, huge_page, gpa);
+ if (!huge_page && !sigsetjmp(sigbuf, 1)) {
+ fault_addr = addr_gpa2hva(vcpu->vm, gpa);
+ vcpu_run(vcpu);
+ TEST_FAIL("SIGBUS isn't triggered");
+ }
+ munmap_page(vcpu, memfd, huge_page, gpa);
+ sigbus_handler_set(false);
+}
+
+struct inject_mce_args {
+ struct kvm_vcpu *vcpu;
+ uint64_t gpa;
+ int memfd;
+};
+
+static void *inject_mce_remote(void *args_)
+{
+ struct inject_mce_args *args = args_;
+ struct kvm_vcpu *vcpu = args->vcpu;
+ uint8_t *hva = addr_gpa2hva(vcpu->vm, DATA_GPA);
+
+ /* Wait for vcpu running */
+ while (!READ_ONCE(*hva))
+ ;
+
+ inject_mce(vcpu, args->memfd, args->gpa);
+ return NULL;
+}
+
+static void test_mce_remote(struct kvm_vcpu *vcpu, int memfd, bool huge_page,
+ uint64_t gpa)
+{
+ uint64_t *hva = addr_gpa2hva(vcpu->vm, gpa);
+ struct inject_mce_args args = {
+ .memfd = memfd,
+ .vcpu = vcpu,
+ .gpa = gpa,
+ };
+ pthread_t thread;
+
+ /* Make another thread inject mce while vcpu running. */
+ fault_addr = hva;
+ pthread_create(&thread, NULL, inject_mce_remote, &args);
+
+ sigbus_handler_set(true);
+ if (!sigsetjmp(sigbuf, 1)) {
+ vcpu_run(vcpu);
+ TEST_FAIL("SIGBUS isn't triggered");
+ }
+ pthread_join(thread, NULL);
+
+ punch_hole_page(vcpu, memfd, huge_page, gpa);
+ if (!huge_page && !sigsetjmp(sigbuf, 1)) {
+ fault_addr = hva;
+ vcpu_run(vcpu);
+ TEST_FAIL("SIGBUS isn't triggered");
+ }
+
+ munmap_page(vcpu, memfd, huge_page, gpa);
+ sigbus_handler_set(false);
+ discard_nop(vcpu);
+}
+
+/* How to inject failure */
+enum failure_injection {
+ INJECT_MADVISE,
+ INJECT_FADVISE,
+ INJECT_MCE,
+ INJECT_MCE_REMOTE,
+};
+
+struct test_args {
+ struct kvm_vcpu *vcpu;
+ enum failure_injection how;
+ bool huge_page;
+ int memfd;
+};
+
+static void *__test_mem_failure(void *args_)
+{
+ struct test_args *args = args_;
+ enum failure_injection how = args->how;
+ struct kvm_vcpu *vcpu = args->vcpu;
+ bool huge_page = args->huge_page;
+ struct kvm_run *run = vcpu->run;
+ int memfd = args->memfd;
+ struct ucall uc;
+
+ for ( ;; ) {
+ vcpu_run(vcpu);
+
+ TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+ "Wanted KVM_EXIT_IO, got exit reason: %u (%s)",
+ run->exit_reason, exit_reason_str(run->exit_reason));
+
+ switch (get_ucall(vcpu, &uc)) {
+ case UCALL_ABORT:
+ REPORT_GUEST_ASSERT(uc);
+ break;
+ case UCALL_SYNC: {
+ uint64_t gpa = uc.args[1];
+
+ if (uc.args[0] == GUEST_NOP)
+ break;
+
+ TEST_ASSERT(uc.args[0] == GUEST_HWPOISON,
+ "Unknown sync command '%ld'", uc.args[0]);
+
+ switch (how) {
+ case INJECT_MADVISE:
+ test_madvise(vcpu, memfd, huge_page, gpa);
+ break;
+ case INJECT_FADVISE:
+ test_fadvise(vcpu, memfd, huge_page, gpa);
+ break;
+ case INJECT_MCE:
+ test_mce(vcpu, memfd, huge_page, gpa);
+ break;
+ case INJECT_MCE_REMOTE:
+ test_mce_remote(vcpu, memfd, huge_page, gpa);
+ break;
+ default:
+ TEST_FAIL("Unknown sync ucall %lu", uc.args[0]);
+ break;
+ }
+ break;
+ }
+ case UCALL_DONE:
+ return NULL;
+ default:
+ TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
+ }
+ }
+}
+
+static void test_mem_failure(enum failure_injection how, bool huge_page)
+{
+ enum vm_mem_backing_src_type src_type;
+ struct userspace_mem_region *region;
+ struct test_args args;
+ struct kvm_vcpu *vcpu;
+ pthread_t thread;
+ struct kvm_vm *vm;
+ size_t size;
+ int r;
+
+ if (how == INJECT_MADVISE) {
+ /* madvise(MADV_FREE) requires anonymous region. */
+ if (huge_page)
+ src_type = VM_MEM_SRC_ANONYMOUS_MEMFD_HUGETLB;
+ else
+ src_type = VM_MEM_SRC_ANONYMOUS_MEMFD;
+ } else {
+ /*
+ * Use memfd_create() for fadvise() because * fadvise() doesn't
+ * work on private anonymous page.
+ */
+ if (huge_page)
+ src_type = VM_MEM_SRC_SHARED_HUGETLB;
+ else
+ src_type = VM_MEM_SRC_SHMEM;
+ }
+
+ vm = vm_create_with_one_vcpu(&vcpu, guest_code);
+
+ /*
+ * When poisoning memory, the kernel searches mapped pages and inject
+ * sigbus with the virtual address. Avoid alias mapping for
+ * deterministic result.
+ */
+ size = align_up(DATA_SIZE, get_backing_src_pagesz(src_type));
+ __vm_userspace_mem_region_add(vm, src_type, DATA_GPA, DATA_SLOT,
+ size / vm->page_size, 0, false);
+
+ region = memslot2region(vm, DATA_SLOT);
+ if (huge_page) {
+ r = madvise(region->host_mem, size, MADV_HUGEPAGE);
+ TEST_ASSERT(!r, "madvise(MADV_HUGEPAGE) should succeed");
+ } else {
+ r = madvise(region->host_mem, size, MADV_NOHUGEPAGE);
+ TEST_ASSERT(!r, "madvise(MADV_NOHUGEPAGE) should succeed");
+ }
+
+ virt_map(vm, DATA_GPA, DATA_GPA, size / vm->page_size);
+ vcpu_args_set(vcpu, 3, DATA_GPA, huge_page, how == INJECT_MCE_REMOTE);
+
+ args = (struct test_args) {
+ .vcpu = vcpu,
+ .how = how,
+ .huge_page = huge_page,
+ .memfd = region->fd,
+ };
+ pthread_create(&thread, NULL, __test_mem_failure, &args);
+
+ pthread_join(thread, NULL);
+ kvm_vm_free(vm);
+}
+
+static void help(const char *prog_name)
+{
+ printf("usage: %s [-f] [-h] [-i] [-m] [-r] [-?]\n"
+ " -f: use fadvise for memory poison\n"
+ " -h: use huge page\n"
+ " -i: use mce injection\n"
+ " -m: use madvise for memory poison\n"
+ " -r: use mce injection remotely\n"
+ " -?: print this message\n",
+ prog_name);
+}
+
+int main(int argc, char *argv[])
+{
+ enum failure_injection how = INJECT_MADVISE;
+ bool huge_page = false;
+ int opt;
+
+ while ((opt = getopt(argc, argv, "fhimr?")) != -1) {
+ switch (opt) {
+ case 'f':
+ how = INJECT_FADVISE;
+ break;
+ case 'i':
+ how = INJECT_MCE;
+ break;
+ case 'h':
+ huge_page = true;
+ break;
+ case 'm':
+ how = INJECT_MADVISE;
+ break;
+ case 'r':
+ how = INJECT_MCE_REMOTE;
+ break;
+ case '?':
+ default:
+ help(argv[0]);
+ exit(0);
+ }
+ }
+
+ test_mem_failure(how, huge_page);
+
+ return 0;
+}
--
2.25.1
From: Isaku Yamahata <[email protected]>
The KVM/x86 handles machine-check in the guest specially. It sets up the
guest so that vcpu exits from running guests, checks the exit reason and,
manually raises the machine check by calling do_machine_check().
To test the KVM machine check execution path, KVM wants to inject the
machine check in the context of vcpu instead of the context of the process
of MCE injection. Wire up the MCE injection framework for KVM to trigger
MCE in the vcpu context. Add a kvm vcpu debugfs entry for an operator to
tell KVM to inject MCE.
The operation flow is as follows:
- Set notrigger to 1 to tell the x86 MCE injector to suppress it from
injecting machine check.
echo 1 > /sys/kernel/debug/mce-inject/notrigger
- Set MCE parameters via x86 MCE injector debugfs
/sys/kernel/debug/mce-inject/{addr, bank, flags, mcgstatus, misc, status}
- Tell KVM to inject MCE
echo 1 > /sys/kernel/debug/kvm/<pid>-<vm-fd>/vcpu<vcpuid>/mce-inject
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/debugfs.c | 22 ++++++++++++++++++++++
arch/x86/kvm/x86.c | 14 ++++++++++++++
3 files changed, 37 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 17715cb8731d..9286f3d02f30 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -113,6 +113,7 @@
KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_HV_TLB_FLUSH \
KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_MCE_INJECT KVM_ARCH_REQ(33)
#define CR0_RESERVED_BITS \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
index ee8c4c3496ed..fee208f30400 100644
--- a/arch/x86/kvm/debugfs.c
+++ b/arch/x86/kvm/debugfs.c
@@ -56,6 +56,22 @@ static int vcpu_get_tsc_scaling_frac_bits(void *data, u64 *val)
DEFINE_SIMPLE_ATTRIBUTE(vcpu_tsc_scaling_frac_fops, vcpu_get_tsc_scaling_frac_bits, NULL, "%llu\n");
+static int vcpu_mce_inject_set(void *data, u64 val)
+{
+ struct kvm_vcpu *vcpu = (struct kvm_vcpu *) data;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (val != 1)
+ return -EINVAL;
+ kvm_make_request(KVM_REQ_MCE_INJECT, vcpu);
+ kvm_vcpu_kick(vcpu);
+ return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(vcpu_mce_inject_fops, NULL, vcpu_mce_inject_set, "%llx\n");
+
void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry)
{
debugfs_create_file("guest_mode", 0444, debugfs_dentry, vcpu,
@@ -76,6 +92,12 @@ void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_
debugfs_dentry, vcpu,
&vcpu_tsc_scaling_frac_fops);
}
+
+ if (IS_ENABLED(CONFIG_X86_MCE_INJECT) &&
+ boot_cpu_has(X86_FEATURE_MCE) && boot_cpu_has(X86_FEATURE_MCA))
+ debugfs_create_file("mce-inject", 0200,
+ debugfs_dentry, vcpu,
+ &vcpu_mce_inject_fops);
}
/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9f18b06bbda6..e4c63ded4c9a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10496,6 +10496,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
fastpath_t exit_fastpath;
bool req_immediate_exit = false;
+ bool req_mce_inject = false;
if (kvm_request_pending(vcpu)) {
if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
@@ -10642,6 +10643,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
+
+ req_mce_inject = kvm_check_request(KVM_REQ_MCE_INJECT, vcpu);
}
if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
@@ -10676,6 +10679,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
goto cancel_injection;
}
+ if (unlikely(req_mce_inject))
+ mce_inject_lock();
preempt_disable();
static_call(kvm_x86_prepare_switch_to_guest)(vcpu);
@@ -10721,6 +10726,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
smp_wmb();
local_irq_enable();
preempt_enable();
+ if (unlikely(req_mce_inject)) {
+ kvm_make_request(KVM_REQ_MCE_INJECT, vcpu);
+ mce_inject_unlock();
+ }
kvm_vcpu_srcu_read_lock(vcpu);
r = 1;
goto cancel_injection;
@@ -10814,6 +10823,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
fpu_sync_guest_vmexit_xfd_state();
static_call(kvm_x86_handle_exit_irqoff)(vcpu);
+ if (unlikely(req_mce_inject)) {
+ mce_call_atomic_injector_chain(smp_processor_id());
+ kvm_machine_check();
+ mce_inject_unlock();
+ }
if (vcpu->arch.guest_fpu.xfd_err)
wrmsrl(MSR_IA32_XFD_ERR, 0);
--
2.25.1
From: Isaku Yamahata <[email protected]>
To inject machine check based on FADV_MCE_INJECT, wire up the x86 MCE
injector to register its handler to get notified physical address on
FADV_MCE_INJECT.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kernel/cpu/mce/inject.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/arch/x86/kernel/cpu/mce/inject.c b/arch/x86/kernel/cpu/mce/inject.c
index ae3efbeb78bd..43d896a89648 100644
--- a/arch/x86/kernel/cpu/mce/inject.c
+++ b/arch/x86/kernel/cpu/mce/inject.c
@@ -294,6 +294,17 @@ static struct notifier_block inject_nb = {
.notifier_call = mce_inject_raise,
};
+static int fadvise_inject_addr(struct notifier_block *nb, unsigned long val,
+ void *data)
+{
+ inj_addr_set(&i_mce, val);
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block fadvise_nb = {
+ .notifier_call = fadvise_inject_addr,
+}
+
/*
* Caller needs to be make sure this cpu doesn't disappear
* from under us, i.e.: get_cpu/put_cpu.
@@ -784,6 +795,7 @@ static int __init inject_init(void)
register_nmi_handler(NMI_LOCAL, mce_raise_notify, 0, "mce_notify");
mce_register_injector_chain(&inject_nb);
+ fadvise_register_mce_injector_chain(&fadvise_nb);
setup_inj_struct(&i_mce);
@@ -795,6 +807,7 @@ static int __init inject_init(void)
static void __exit inject_exit(void)
{
+ fadvise_unregister_mce_injector_chain(&fadvise_nb);
mce_unregister_injector_chain(&inject_nb);
unregister_nmi_handler(NMI_LOCAL, "mce_notify");
--
2.25.1
Isaku, when you report a bug, it would be great to include the folks who
authored and worked on the original patch that introduced the bug. I've
gone ahead and done that for you here.
On 10/10/23 01:35, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> When initializing x86 MCE injection framework, it checks if hardware mce
> injection is available or not. When it's not available on AMD, set the
> boolean variable to false to not use it. The variable is on by default and
> the feature is AMD specific based on the code.
>
> Because the variable is default on, it is true on Intel platform (probably
> on other non-AMD x86 platform). It results in unchecked msr access of
> MSR_K7_HWCR=0xc0010015 when injecting MCE on Intel platform. (Probably on
> other x86 platform.)
>
> Make the variable of by default, and set the variable on when the hardware
> feature is usable.
Gah, I'm finding that changelog impenetrable. Here's what's missing:
* The entirety of check_hw_inj_possible() is for AMD hardware:
X86_FEATURE_SMCA, the MSRs, hw_injection_possible, everything.
* Only AMD systems with SMCA support hardware error injection
(anything other than "echo sw > /sys/kernel/debug/mce-inject/flags")
* That AMD-only restriction is enforced by 'hw_injection_possible'
* 'hw_injection_possible' is true by default and only set to false in
check_hw_inj_possible() ... the AMD-only code
The end result is that everyone except SMCA-enabled AMD systems (Intel
included) leaves hw_injection_possible=true. They are free to try and
inject hardware errors. If they do, they'll get errors when writing to
the MSRs.
To fix this, make disable hw_injection_possible by default. Only enable
it on SMCA hardware that actually succeeds in ... whatever:
wrmsrl_safe(mca_msr_reg(bank, MCA_STATUS), status);
rdmsrl_safe(mca_msr_reg(bank, MCA_STATUS), &status);
is doing.
... and don't do it at the top of the function. Why bother setting it
to true only to disable it a moment later?
Do something like the following instead.
On Tue, Oct 10, 2023, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> The KVM/x86 handles machine-check in the guest specially. It sets up the
> guest so that vcpu exits from running guests, checks the exit reason and,
> manually raises the machine check by calling do_machine_check().
>
> To test the KVM machine check execution path, KVM wants to inject the
> machine check in the context of vcpu instead of the context of the process
> of MCE injection. Wire up the MCE injection framework for KVM to trigger
> MCE in the vcpu context. Add a kvm vcpu debugfs entry for an operator to
> tell KVM to inject MCE.
But this isn't "injecting" a #MC, it's just having KVM call do_machine_check()
before enabling IRQs after a VM-Exit. I don't see how that is interesting enough
to warrant a dedicated knob and code in KVM's run loop.
> @@ -10814,6 +10823,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> fpu_sync_guest_vmexit_xfd_state();
>
> static_call(kvm_x86_handle_exit_irqoff)(vcpu);
> + if (unlikely(req_mce_inject)) {
> + mce_call_atomic_injector_chain(smp_processor_id());
> + kvm_machine_check();
> + mce_inject_unlock();
> + }
>
> if (vcpu->arch.guest_fpu.xfd_err)
> wrmsrl(MSR_IA32_XFD_ERR, 0);
> --
> 2.25.1
>