CVE-2018-12207 is a microarchitectural implementation issue
that could allow an unprivileged local attacker to cause system wide
denial-of-service condition.
Privileged software may change the page size (ex. 4KB, 2MB, 1GB) in the
paging structures, without following such paging structure changes with
invalidation of the TLB entries corresponding to the changed pages. In
this case, the attacker could invoke instruction fetch, which will result
in the processor hitting multiple TLB entries, reporting a machine check
error exception, and ultimately hanging the system.
The attached patches mitigate the vulnerability by making huge pages
non-executable. The processor will not be able to execute an instruction
residing in a large page (ie. 2MB, 1GB, etc.) without causing a trap into
the host kernel/hypervisor; KVM will then break the large page into 4KB
pages and gives executable permission to 4KB pages.
Thanks to everyone that was involved in the development of these patches,
especially Junaid Shahid, who provided the first version of the code,
and Thomas Gleixner.
Paolo
Gomez Iglesias, Antonio (1):
Documentation: Add ITLB_MULTIHIT documentation
Junaid Shahid (2):
kvm: Add helper function for creating VM worker threads
kvm: x86: mmu: Recovery of shattered NX large pages
Paolo Bonzini (1):
kvm: mmu: ITLB_MULTIHIT mitigation
Pawan Gupta (1):
x86/cpu: Add Tremont to the cpu vulnerability whitelist
Tyler Hicks (1):
cpu/speculation: Uninline and export CPU mitigations helpers
Vineela Tummalapalli (1):
x86/bugs: Add ITLB_MULTIHIT bug infrastructure
Documentation/ABI/testing/sysfs-devices-system-cpu | 1 +
Documentation/admin-guide/hw-vuln/index.rst | 1 +
Documentation/admin-guide/hw-vuln/multihit.rst | 163 +++++++++++++
Documentation/admin-guide/kernel-parameters.txt | 25 ++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/kvm_host.h | 6 +
arch/x86/include/asm/msr-index.h | 7 +
arch/x86/kernel/cpu/bugs.c | 24 ++
arch/x86/kernel/cpu/common.c | 67 ++---
arch/x86/kvm/mmu.c | 270 ++++++++++++++++++++-
arch/x86/kvm/mmu.h | 4 +
arch/x86/kvm/paging_tmpl.h | 29 ++-
arch/x86/kvm/x86.c | 20 ++
drivers/base/cpu.c | 8 +
include/linux/cpu.h | 27 +--
include/linux/kvm_host.h | 6 +
kernel/cpu.c | 27 ++-
virt/kvm/kvm_main.c | 112 +++++++++
18 files changed, 732 insertions(+), 66 deletions(-)
create mode 100644 Documentation/admin-guide/hw-vuln/multihit.rst
--
1.8.3.1
From: Pawan Gupta <[email protected]>
Add the new cpu family ATOM_TREMONT_D to the cpu vunerability
whitelist. ATOM_TREMONT_D is not affected by X86_BUG_ITLB_MULTIHIT.
ATOM_TREMONT_D might have mitigations against other issues as well, but
only the ITLB multihit mitigation is confirmed at this point.
Signed-off-by: Pawan Gupta <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/kernel/cpu/common.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d29b71ca3ca7..fffe21945374 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1074,6 +1074,8 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
* good enough for our purposes.
*/
+ VULNWL_INTEL(ATOM_TREMONT_D, NO_ITLB_MULTIHIT),
+
/* AMD Family 0xf - 0x12 */
VULNWL_AMD(0x0f, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
VULNWL_AMD(0x10, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
--
1.8.3.1
With some Intel processors, putting the same virtual address in the TLB
as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
and cause the processor to issue a machine check resulting in a CPU lockup.
Unfortunately when EPT page tables use huge pages, it is possible for a
malicious guest to cause this situation.
Add a knob to mark huge pages as non-executable. When the nx_huge_pages
parameter is enabled (and we are using EPT), all huge pages are marked as
NX. If the guest attempts to execute in one of those pages, the page is
broken down into 4K pages, which are then marked executable.
This is not an issue for shadow paging (except nested EPT), because then
the host is in control of TLB flushes and the problematic situation cannot
happen. With nested EPT, again the nested guest can cause problems shadow
and direct EPT is treated in the same way.
[ tglx: Fixup default to auto and massage wording a bit ]
Originally-by: Junaid Shahid <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 19 ++++
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kernel/cpu/bugs.c | 13 ++-
arch/x86/kvm/mmu.c | 141 +++++++++++++++++++++++-
arch/x86/kvm/paging_tmpl.h | 29 ++++-
arch/x86/kvm/x86.c | 9 ++
6 files changed, 200 insertions(+), 13 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fa8f03ddff24..9d5f123cc218 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2055,6 +2055,19 @@
KVM MMU at runtime.
Default is 0 (off)
+ kvm.nx_huge_pages=
+ [KVM] Controls the software workaround for the
+ X86_BUG_ITLB_MULTIHIT bug.
+ force : Always deploy workaround.
+ off : Never deploy workaround.
+ auto : Deploy workaround based on the presence of
+ X86_BUG_ITLB_MULTIHIT.
+
+ Default is 'auto'.
+
+ If the software workaround is enabled for the host,
+ guests do need not to enable it for nested guests.
+
kvm-amd.nested= [KVM,AMD] Allow nested virtualization in KVM/SVM.
Default is 1 (enabled)
@@ -2637,6 +2650,12 @@
l1tf=off [X86]
mds=off [X86]
tsx_async_abort=off [X86]
+ kvm.nx_huge_pages=off [X86]
+
+ Exceptions:
+ This does not have any effect on
+ kvm.nx_huge_pages when
+ kvm.nx_huge_pages=force.
auto (default)
Mitigate all CPU vulnerabilities, but leave SMT
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 24d6598dea29..a37b03483b66 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -315,6 +315,7 @@ struct kvm_mmu_page {
bool unsync;
u8 mmu_valid_gen;
bool mmio_cached;
+ bool lpage_disallowed; /* Can't be replaced by an equiv large page */
/*
* The following two entries are used to key the shadow page in the
@@ -946,6 +947,7 @@ struct kvm_vm_stat {
ulong mmu_unsync;
ulong remote_tlb_flush;
ulong lpages;
+ ulong nx_lpage_splits;
ulong max_mmu_page_hash_collisions;
};
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 5364beda8c61..850005590167 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1257,6 +1257,9 @@ void x86_spec_ctrl_setup_ap(void)
x86_amd_ssb_disable();
}
+bool itlb_multihit_kvm_mitigation;
+EXPORT_SYMBOL_GPL(itlb_multihit_kvm_mitigation);
+
#undef pr_fmt
#define pr_fmt(fmt) "L1TF: " fmt
@@ -1412,17 +1415,25 @@ static ssize_t l1tf_show_state(char *buf)
l1tf_vmx_states[l1tf_vmx_mitigation],
sched_smt_active() ? "vulnerable" : "disabled");
}
+
+static ssize_t itlb_multihit_show_state(char *buf)
+{
+ if (itlb_multihit_kvm_mitigation)
+ return sprintf(buf, "KVM: Mitigation: Split huge pages\n");
+ else
+ return sprintf(buf, "KVM: Vulnerable\n");
+}
#else
static ssize_t l1tf_show_state(char *buf)
{
return sprintf(buf, "%s\n", L1TF_DEFAULT_MSG);
}
-#endif
static ssize_t itlb_multihit_show_state(char *buf)
{
return sprintf(buf, "Processor vulnerable\n");
}
+#endif
static ssize_t mds_show_state(char *buf)
{
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24c23c66b226..bedf6864b092 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -47,6 +47,20 @@
#include <asm/kvm_page_track.h>
#include "trace.h"
+extern bool itlb_multihit_kvm_mitigation;
+
+static int __read_mostly nx_huge_pages = -1;
+
+static int set_nx_huge_pages(const char *val, const struct kernel_param *kp);
+
+static struct kernel_param_ops nx_huge_pages_ops = {
+ .set = set_nx_huge_pages,
+ .get = param_get_bool,
+};
+
+module_param_cb(nx_huge_pages, &nx_huge_pages_ops, &nx_huge_pages, 0644);
+__MODULE_PARM_TYPE(nx_huge_pages, "bool");
+
/*
* When setting this variable to true it enables Two-Dimensional-Paging
* where the hardware walks 2 page tables:
@@ -352,6 +366,11 @@ static inline bool spte_ad_need_write_protect(u64 spte)
return (spte & SPTE_SPECIAL_MASK) != SPTE_AD_ENABLED_MASK;
}
+static bool is_nx_huge_page_enabled(void)
+{
+ return READ_ONCE(nx_huge_pages);
+}
+
static inline u64 spte_shadow_accessed_mask(u64 spte)
{
MMU_WARN_ON(is_mmio_spte(spte));
@@ -1190,6 +1209,15 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
kvm_mmu_gfn_disallow_lpage(slot, gfn);
}
+static void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ if (sp->lpage_disallowed)
+ return;
+
+ ++kvm->stat.nx_lpage_splits;
+ sp->lpage_disallowed = true;
+}
+
static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
{
struct kvm_memslots *slots;
@@ -1207,6 +1235,12 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
kvm_mmu_gfn_allow_lpage(slot, gfn);
}
+static void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ --kvm->stat.nx_lpage_splits;
+ sp->lpage_disallowed = false;
+}
+
static bool __mmu_gfn_lpage_is_disallowed(gfn_t gfn, int level,
struct kvm_memory_slot *slot)
{
@@ -2792,6 +2826,9 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
kvm_reload_remote_mmus(kvm);
}
+ if (sp->lpage_disallowed)
+ unaccount_huge_nx_page(kvm, sp);
+
sp->role.invalid = 1;
return list_unstable;
}
@@ -3013,6 +3050,11 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
if (!speculative)
spte |= spte_shadow_accessed_mask(spte);
+ if (level > PT_PAGE_TABLE_LEVEL && (pte_access & ACC_EXEC_MASK) &&
+ is_nx_huge_page_enabled()) {
+ pte_access &= ~ACC_EXEC_MASK;
+ }
+
if (pte_access & ACC_EXEC_MASK)
spte |= shadow_x_mask;
else
@@ -3233,9 +3275,32 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
__direct_pte_prefetch(vcpu, sp, sptep);
}
+static void disallowed_hugepage_adjust(struct kvm_shadow_walk_iterator it,
+ gfn_t gfn, kvm_pfn_t *pfnp, int *levelp)
+{
+ int level = *levelp;
+ u64 spte = *it.sptep;
+
+ if (it.level == level && level > PT_PAGE_TABLE_LEVEL &&
+ is_nx_huge_page_enabled() &&
+ is_shadow_present_pte(spte) &&
+ !is_large_pte(spte)) {
+ /*
+ * A small SPTE exists for this pfn, but FNAME(fetch)
+ * and __direct_map would like to create a large PTE
+ * instead: just force them to go down another level,
+ * patching back for them into pfn the next 9 bits of
+ * the address.
+ */
+ u64 page_mask = KVM_PAGES_PER_HPAGE(level) - KVM_PAGES_PER_HPAGE(level - 1);
+ *pfnp |= gfn & page_mask;
+ (*levelp)--;
+ }
+}
+
static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
int map_writable, int level, kvm_pfn_t pfn,
- bool prefault)
+ bool prefault, bool lpage_disallowed)
{
struct kvm_shadow_walk_iterator it;
struct kvm_mmu_page *sp;
@@ -3248,6 +3313,12 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
trace_kvm_mmu_spte_requested(gpa, level, pfn);
for_each_shadow_entry(vcpu, gpa, it) {
+ /*
+ * We cannot overwrite existing page tables with an NX
+ * large page, as the leaf could be executable.
+ */
+ disallowed_hugepage_adjust(it, gfn, &pfn, &level);
+
base_gfn = gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
if (it.level == level)
break;
@@ -3258,6 +3329,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
it.level - 1, true, ACC_ALL);
link_shadow_page(vcpu, it.sptep, sp);
+ if (lpage_disallowed)
+ account_huge_nx_page(vcpu->kvm, sp);
}
}
@@ -3550,11 +3623,14 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
{
int r;
int level;
- bool force_pt_level = false;
+ bool force_pt_level;
kvm_pfn_t pfn;
unsigned long mmu_seq;
bool map_writable, write = error_code & PFERR_WRITE_MASK;
+ bool lpage_disallowed = (error_code & PFERR_FETCH_MASK) &&
+ is_nx_huge_page_enabled();
+ force_pt_level = lpage_disallowed;
level = mapping_level(vcpu, gfn, &force_pt_level);
if (likely(!force_pt_level)) {
/*
@@ -3588,7 +3664,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
goto out_unlock;
if (likely(!force_pt_level))
transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
- r = __direct_map(vcpu, v, write, map_writable, level, pfn, prefault);
+ r = __direct_map(vcpu, v, write, map_writable, level, pfn,
+ prefault, false);
out_unlock:
spin_unlock(&vcpu->kvm->mmu_lock);
kvm_release_pfn_clean(pfn);
@@ -4174,6 +4251,8 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
unsigned long mmu_seq;
int write = error_code & PFERR_WRITE_MASK;
bool map_writable;
+ bool lpage_disallowed = (error_code & PFERR_FETCH_MASK) &&
+ is_nx_huge_page_enabled();
MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
@@ -4184,8 +4263,9 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
if (r)
return r;
- force_pt_level = !check_hugepage_cache_consistency(vcpu, gfn,
- PT_DIRECTORY_LEVEL);
+ force_pt_level =
+ lpage_disallowed ||
+ !check_hugepage_cache_consistency(vcpu, gfn, PT_DIRECTORY_LEVEL);
level = mapping_level(vcpu, gfn, &force_pt_level);
if (likely(!force_pt_level)) {
if (level > PT_DIRECTORY_LEVEL &&
@@ -4214,7 +4294,8 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
goto out_unlock;
if (likely(!force_pt_level))
transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
- r = __direct_map(vcpu, gpa, write, map_writable, level, pfn, prefault);
+ r = __direct_map(vcpu, gpa, write, map_writable, level, pfn,
+ prefault, lpage_disallowed);
out_unlock:
spin_unlock(&vcpu->kvm->mmu_lock);
kvm_release_pfn_clean(pfn);
@@ -6155,10 +6236,58 @@ static void kvm_set_mmio_spte_mask(void)
kvm_mmu_set_mmio_spte_mask(mask, mask, ACC_WRITE_MASK | ACC_USER_MASK);
}
+static bool get_nx_auto_mode(void)
+{
+ /* Return true when CPU has the bug, and mitigations are ON */
+ return boot_cpu_has_bug(X86_BUG_ITLB_MULTIHIT) && !cpu_mitigations_off();
+}
+
+static void __set_nx_huge_pages(bool val)
+{
+ nx_huge_pages = itlb_multihit_kvm_mitigation = val;
+}
+
+static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
+{
+ bool old_val = nx_huge_pages;
+ bool new_val;
+
+ /* In "auto" mode deploy workaround only if CPU has the bug. */
+ if (sysfs_streq(val, "off"))
+ new_val = 0;
+ else if (sysfs_streq(val, "force"))
+ new_val = 1;
+ else if (sysfs_streq(val, "auto"))
+ new_val = get_nx_auto_mode();
+ else if (strtobool(val, &new_val) < 0)
+ return -EINVAL;
+
+ __set_nx_huge_pages(new_val);
+
+ if (new_val != old_val) {
+ struct kvm *kvm;
+ int idx;
+
+ mutex_lock(&kvm_lock);
+
+ list_for_each_entry(kvm, &vm_list, vm_list) {
+ idx = srcu_read_lock(&kvm->srcu);
+ kvm_mmu_zap_all_fast(kvm);
+ srcu_read_unlock(&kvm->srcu, idx);
+ }
+ mutex_unlock(&kvm_lock);
+ }
+
+ return 0;
+}
+
int kvm_mmu_module_init(void)
{
int ret = -ENOMEM;
+ if (nx_huge_pages == -1)
+ __set_nx_huge_pages(get_nx_auto_mode());
+
/*
* MMU roles use union aliasing which is, generally speaking, an
* undefined behavior. However, we supposedly know how compilers behave
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 7d5cdb3af594..97b21e7fd013 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -614,13 +614,14 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
struct guest_walker *gw,
int write_fault, int hlevel,
- kvm_pfn_t pfn, bool map_writable, bool prefault)
+ kvm_pfn_t pfn, bool map_writable, bool prefault,
+ bool lpage_disallowed)
{
struct kvm_mmu_page *sp = NULL;
struct kvm_shadow_walk_iterator it;
unsigned direct_access, access = gw->pt_access;
int top_level, ret;
- gfn_t base_gfn;
+ gfn_t gfn, base_gfn;
direct_access = gw->pte_access;
@@ -665,13 +666,25 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
link_shadow_page(vcpu, it.sptep, sp);
}
- base_gfn = gw->gfn;
+ /*
+ * FNAME(page_fault) might have clobbered the bottom bits of
+ * gw->gfn, restore them from the virtual address.
+ */
+ gfn = gw->gfn | ((addr & PT_LVL_OFFSET_MASK(gw->level)) >> PAGE_SHIFT);
+ base_gfn = gfn;
trace_kvm_mmu_spte_requested(addr, gw->level, pfn);
for (; shadow_walk_okay(&it); shadow_walk_next(&it)) {
clear_sp_write_flooding_count(it.sptep);
- base_gfn = gw->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
+
+ /*
+ * We cannot overwrite existing page tables with an NX
+ * large page, as the leaf could be executable.
+ */
+ disallowed_hugepage_adjust(it, gfn, &pfn, &hlevel);
+
+ base_gfn = gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
if (it.level == hlevel)
break;
@@ -683,6 +696,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
sp = kvm_mmu_get_page(vcpu, base_gfn, addr,
it.level - 1, true, direct_access);
link_shadow_page(vcpu, it.sptep, sp);
+ if (lpage_disallowed)
+ account_huge_nx_page(vcpu->kvm, sp);
}
}
@@ -759,9 +774,11 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
int r;
kvm_pfn_t pfn;
int level = PT_PAGE_TABLE_LEVEL;
- bool force_pt_level = false;
unsigned long mmu_seq;
bool map_writable, is_self_change_mapping;
+ bool lpage_disallowed = (error_code & PFERR_FETCH_MASK) &&
+ is_nx_huge_page_enabled();
+ bool force_pt_level = lpage_disallowed;
pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
@@ -851,7 +868,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
if (!force_pt_level)
transparent_hugepage_adjust(vcpu, walker.gfn, &pfn, &level);
r = FNAME(fetch)(vcpu, addr, &walker, write_fault,
- level, pfn, map_writable, prefault);
+ level, pfn, map_writable, prefault, lpage_disallowed);
kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
out_unlock:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 32d70ca2a7fd..b087d178a774 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -213,6 +213,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "mmu_unsync", VM_STAT(mmu_unsync) },
{ "remote_tlb_flush", VM_STAT(remote_tlb_flush) },
{ "largepages", VM_STAT(lpages, .mode = 0444) },
+ { "nx_largepages_splitted", VM_STAT(nx_lpage_splits, .mode = 0444) },
{ "max_mmu_page_hash_collisions",
VM_STAT(max_mmu_page_hash_collisions) },
{ NULL }
@@ -1280,6 +1281,14 @@ static u64 kvm_get_arch_capabilities(void)
rdmsrl(MSR_IA32_ARCH_CAPABILITIES, data);
/*
+ * If nx_huge_pages is enabled, KVM's shadow paging will ensure that
+ * the nested hypervisor runs with NX huge pages. If it is not,
+ * L1 is anyway vulnerable to ITLB_MULTIHIT explots from other
+ * L1 guests, so it need not worry about its own (L2) guests.
+ */
+ data |= ARCH_CAP_PSCHANGE_MC_NO;
+
+ /*
* If we're doing cache flushes (either "always" or "cond")
* we will do one whenever the guest does a vmlaunch/vmresume.
* If an outer hypervisor is doing the cache flush for us
--
1.8.3.1
From: "Gomez Iglesias, Antonio" <[email protected]>
Add the initial ITLB_MULTIHIT documentation.
[ tglx: Add it to the index so it gets actually built. ]
Signed-off-by: Antonio Gomez Iglesias <[email protected]>
Signed-off-by: Nelson D'Souza <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
Documentation/admin-guide/hw-vuln/index.rst | 1 +
Documentation/admin-guide/hw-vuln/multihit.rst | 163 +++++++++++++++++++++++++
2 files changed, 164 insertions(+)
create mode 100644 Documentation/admin-guide/hw-vuln/multihit.rst
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index 0802b1c67452..0795e3c2643f 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -13,3 +13,4 @@ are configurable at compile, boot or run time.
l1tf
mds
tsx_async_abort
+ multihit.rst
diff --git a/Documentation/admin-guide/hw-vuln/multihit.rst b/Documentation/admin-guide/hw-vuln/multihit.rst
new file mode 100644
index 000000000000..ba9988d8bce5
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/multihit.rst
@@ -0,0 +1,163 @@
+iTLB multihit
+=============
+
+iTLB multihit is an erratum where some processors may incur a machine check
+error, possibly resulting in an unrecoverable CPU lockup, when an
+instruction fetch hits multiple entries in the instruction TLB. This can
+occur when the page size is changed along with either the physical address
+or cache type. A malicious guest running on a virtualized system can
+exploit this erratum to perform a denial of service attack.
+
+
+Affected processors
+-------------------
+
+Variations of this erratum are present on most Intel Core and Xeon processor
+models. The erratum is not present on:
+
+ - non-Intel processors
+
+ - Some Atoms (Airmont, Bonnell, Goldmont, GoldmontPlus, Saltwell, Silvermont)
+
+ - Intel processors that have the PSCHANGE_MC_NO bit set in the
+ IA32_ARCH_CAPABILITIES MSR.
+
+
+Related CVEs
+------------
+
+The following CVE entry is related to this issue:
+
+ ============== =================================================
+ CVE-2018-12207 Machine Check Error Avoidance on Page Size Change
+ ============== =================================================
+
+
+Problem
+-------
+
+Privileged software, including OS and virtual machine managers (VMM), are in
+charge of memory management. A key component in memory management is the control
+of the page tables. Modern processors use virtual memory, a technique that creates
+the illusion of a very large memory for processors. This virtual space is split
+into pages of a given size. Page tables translate virtual addresses to physical
+addresses.
+
+To reduce latency when performing a virtual to physical address translation,
+processors include a structure, called TLB, that caches recent translations.
+There are separate TLBs for instruction (iTLB) and data (dTLB).
+
+Under this errata, instructions are fetched from a linear address translated
+using a 4 KB translation cached in the iTLB. Privileged software modifies the
+paging structure so that the same linear address using large page size (2 MB, 4
+MB, 1 GB) with a different physical address or memory type. After the page
+structure modification but before the software invalidates any iTLB entries for
+the linear address, a code fetch that happens on the same linear address may
+cause a machine-check error which can result in a system hang or shutdown.
+
+
+Attack scenarios
+----------------
+
+Attacks against the iTLB multihit erratum can be mounted from malicious
+guests in a virtualized system.
+
+
+iTLB multihit system information
+--------------------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current iTLB
+multihit status of the system:whether the system is vulnerable and which
+mitigations are active. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/itlb_multihit
+
+The possible values in this file are:
+
+.. list-table::
+
+ * - Not affected
+ - The processor is not vulnerable.
+ * - KVM: Mitigation: Split huge pages
+ - Software changes mitigate this issue.
+ * - KVM: Vulnerable
+ - The processor is vulnerable, but no mitigation enabled
+
+
+Enumeration of the erratum
+--------------------------------
+
+A new bit has been allocated in the IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) msr
+and will be set on CPU's which are mitigated against this issue.
+
+ ======================================= =========== ===============================
+ IA32_ARCH_CAPABILITIES MSR Not present Possibly vulnerable,check model
+ IA32_ARCH_CAPABILITIES[PSCHANGE_MC_NO] '0' Likely vulnerable,check model
+ IA32_ARCH_CAPABILITIES[PSCHANGE_MC_NO] '1' Not vulnerable
+ ======================================= =========== ===============================
+
+
+Mitigation mechanism
+-------------------------
+
+This erratum can be mitigated by restricting the use of large page sizes to
+non-executable pages. This forces all iTLB entries to be 4K, and removes
+the possibility of multiple hits.
+
+In order to mitigate the vulnerability, KVM initially marks all huge pages
+as non-executable. If the guest attempts to execute in one of those pages,
+the page is broken down into 4K pages, which are then marked executable.
+
+If EPT is disabled or not available on the host, KVM is in control of TLB
+flushes and the problematic situation cannot happen. However, the shadow
+EPT paging mechanism used by nested virtualization is vulnerable, because
+the nested guest can trigger multiple iTLB hits by modifying its own
+(non-nested) page tables. For simplicity, KVM will make large pages
+non-executable in all shadow paging modes.
+
+Mitigation control on the kernel command line and KVM - module parameter
+------------------------------------------------------------------------
+
+The KVM hypervisor mitigation mechanism for marking huge pages as
+non-executable can be controlled with a module parameter "nx_huge_pages=".
+The kernel command line allows to control the iTLB multihit mitigations at
+boot time with the option "kvm.nx_huge_pages=".
+
+The valid arguments for these options are:
+
+ ========== ================================================================
+ force Mitigation is enabled. In this case, the mitigation implements
+ non-executable huge pages in Linux kernel KVM module. All huge
+ pages in the EPT are marked as non-executable.
+ If a guest attempts to execute in one of those pages, the page is
+ broken down into 4K pages, which are then marked executable.
+
+ off Mitigation is disabled.
+
+ auto Enable mitigation only if the platform is affected and the kernel
+ was not booted with the "mitigations=off" command line parameter.
+ This is the default option.
+ ========== ================================================================
+
+
+Mitigation selection guide
+--------------------------
+
+1. No virtualization in use
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The system is protected by the kernel unconditionally and no further
+ action is required.
+
+2. Virtualization with trusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ If the guest comes from a trusted source, you may assume that the guest will
+ not attempt to maliciously exploit these errata and no further action is
+ required.
+
+3. Virtualization with untrusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+ If the guest comes from an untrusted source, the guest host kernel will need
+ to apply iTLB multihit mitigation via the kernel command line or kvm
+ module parameter.
--
1.8.3.1
From: Junaid Shahid <[email protected]>
The page table pages corresponding to broken down large pages are zapped in
FIFO order, so that the large page can potentially be recovered, if it is
not longer being used for execution. This removes the performance penalty
for walking deeper EPT page tables.
By default, one large page will last about one hour once the guest
reaches a steady state.
Signed-off-by: Junaid Shahid <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 6 ++
arch/x86/include/asm/kvm_host.h | 4 +
arch/x86/kvm/mmu.c | 129 ++++++++++++++++++++++++
arch/x86/kvm/mmu.h | 4 +
arch/x86/kvm/x86.c | 11 ++
virt/kvm/kvm_main.c | 28 +++++
6 files changed, 182 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9d5f123cc218..8dee8f68fe15 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2068,6 +2068,12 @@
If the software workaround is enabled for the host,
guests do need not to enable it for nested guests.
+ kvm.nx_huge_pages_recovery_ratio=
+ [KVM] Controls how many 4KiB pages are periodically zapped
+ back to huge pages. 0 disables the recovery, otherwise if
+ the value is N KVM will zap 1/Nth of the 4KiB pages every
+ minute. The default is 60.
+
kvm-amd.nested= [KVM,AMD] Allow nested virtualization in KVM/SVM.
Default is 1 (enabled)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a37b03483b66..4fc61483919a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -312,6 +312,8 @@ struct kvm_rmap_head {
struct kvm_mmu_page {
struct list_head link;
struct hlist_node hash_link;
+ struct list_head lpage_disallowed_link;
+
bool unsync;
u8 mmu_valid_gen;
bool mmio_cached;
@@ -860,6 +862,7 @@ struct kvm_arch {
*/
struct list_head active_mmu_pages;
struct list_head zapped_obsolete_pages;
+ struct list_head lpage_disallowed_mmu_pages;
struct kvm_page_track_notifier_node mmu_sp_tracker;
struct kvm_page_track_notifier_head track_notifier_head;
@@ -934,6 +937,7 @@ struct kvm_arch {
bool exception_payload_enabled;
struct kvm_pmu_event_filter *pmu_event_filter;
+ struct task_struct *nx_lpage_recovery_thread;
};
struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index bedf6864b092..529589a42afb 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -37,6 +37,7 @@
#include <linux/uaccess.h>
#include <linux/hash.h>
#include <linux/kern_levels.h>
+#include <linux/kthread.h>
#include <asm/page.h>
#include <asm/pat.h>
@@ -50,16 +51,26 @@
extern bool itlb_multihit_kvm_mitigation;
static int __read_mostly nx_huge_pages = -1;
+static uint __read_mostly nx_huge_pages_recovery_ratio = 60;
static int set_nx_huge_pages(const char *val, const struct kernel_param *kp);
+static int set_nx_huge_pages_recovery_ratio(const char *val, const struct kernel_param *kp);
static struct kernel_param_ops nx_huge_pages_ops = {
.set = set_nx_huge_pages,
.get = param_get_bool,
};
+static struct kernel_param_ops nx_huge_pages_recovery_ratio_ops = {
+ .set = set_nx_huge_pages_recovery_ratio,
+ .get = param_get_uint,
+};
+
module_param_cb(nx_huge_pages, &nx_huge_pages_ops, &nx_huge_pages, 0644);
__MODULE_PARM_TYPE(nx_huge_pages, "bool");
+module_param_cb(nx_huge_pages_recovery_ratio, &nx_huge_pages_recovery_ratio_ops,
+ &nx_huge_pages_recovery_ratio, 0644);
+__MODULE_PARM_TYPE(nx_huge_pages_recovery_ratio, "uint");
/*
* When setting this variable to true it enables Two-Dimensional-Paging
@@ -1215,6 +1226,8 @@ static void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
return;
++kvm->stat.nx_lpage_splits;
+ list_add_tail(&sp->lpage_disallowed_link,
+ &kvm->arch.lpage_disallowed_mmu_pages);
sp->lpage_disallowed = true;
}
@@ -1239,6 +1252,7 @@ static void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
--kvm->stat.nx_lpage_splits;
sp->lpage_disallowed = false;
+ list_del(&sp->lpage_disallowed_link);
}
static bool __mmu_gfn_lpage_is_disallowed(gfn_t gfn, int level,
@@ -6274,6 +6288,8 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
idx = srcu_read_lock(&kvm->srcu);
kvm_mmu_zap_all_fast(kvm);
srcu_read_unlock(&kvm->srcu, idx);
+
+ wake_up_process(kvm->arch.nx_lpage_recovery_thread);
}
mutex_unlock(&kvm_lock);
}
@@ -6367,3 +6383,116 @@ void kvm_mmu_module_exit(void)
unregister_shrinker(&mmu_shrinker);
mmu_audit_disable();
}
+
+static int set_nx_huge_pages_recovery_ratio(const char *val, const struct kernel_param *kp)
+{
+ unsigned int old_val;
+ int err;
+
+ old_val = nx_huge_pages_recovery_ratio;
+ err = param_set_uint(val, kp);
+ if (err)
+ return err;
+
+ if (READ_ONCE(nx_huge_pages) &&
+ !old_val && nx_huge_pages_recovery_ratio) {
+ struct kvm *kvm;
+
+ mutex_lock(&kvm_lock);
+
+ list_for_each_entry(kvm, &vm_list, vm_list)
+ wake_up_process(kvm->arch.nx_lpage_recovery_thread);
+
+ mutex_unlock(&kvm_lock);
+ }
+
+ return err;
+}
+
+static void kvm_recover_nx_lpages(struct kvm *kvm)
+{
+ int rcu_idx;
+ struct kvm_mmu_page *sp;
+ unsigned int ratio;
+ LIST_HEAD(invalid_list);
+ ulong to_zap;
+
+ rcu_idx = srcu_read_lock(&kvm->srcu);
+ spin_lock(&kvm->mmu_lock);
+
+ ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
+ to_zap = ratio ? DIV_ROUND_UP(kvm->stat.nx_lpage_splits, ratio) : 0;
+ while (to_zap && !list_empty(&kvm->arch.lpage_disallowed_mmu_pages)) {
+ /*
+ * We use a separate list instead of just using active_mmu_pages
+ * because the number of lpage_disallowed pages is expected to
+ * be relatively small compared to the total.
+ */
+ sp = list_first_entry(&kvm->arch.lpage_disallowed_mmu_pages,
+ struct kvm_mmu_page,
+ lpage_disallowed_link);
+ WARN_ON_ONCE(!sp->lpage_disallowed);
+ kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
+ WARN_ON_ONCE(sp->lpage_disallowed);
+
+ if (!--to_zap || need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+ kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ if (to_zap)
+ cond_resched_lock(&kvm->mmu_lock);
+ }
+ }
+
+ spin_unlock(&kvm->mmu_lock);
+ srcu_read_unlock(&kvm->srcu, rcu_idx);
+}
+
+static long get_nx_lpage_recovery_timeout(u64 start_time)
+{
+ return READ_ONCE(nx_huge_pages) && READ_ONCE(nx_huge_pages_recovery_ratio)
+ ? start_time + 60 * HZ - get_jiffies_64()
+ : MAX_SCHEDULE_TIMEOUT;
+}
+
+static int kvm_nx_lpage_recovery_worker(struct kvm *kvm, uintptr_t data)
+{
+ u64 start_time;
+ long remaining_time;
+
+ while (true) {
+ start_time = get_jiffies_64();
+ remaining_time = get_nx_lpage_recovery_timeout(start_time);
+
+ set_current_state(TASK_INTERRUPTIBLE);
+ while (!kthread_should_stop() && remaining_time > 0) {
+ schedule_timeout(remaining_time);
+ remaining_time = get_nx_lpage_recovery_timeout(start_time);
+ set_current_state(TASK_INTERRUPTIBLE);
+ }
+
+ set_current_state(TASK_RUNNING);
+
+ if (kthread_should_stop())
+ return 0;
+
+ kvm_recover_nx_lpages(kvm);
+ }
+}
+
+int kvm_mmu_post_init_vm(struct kvm *kvm)
+{
+ int err;
+
+ err = kvm_vm_create_worker_thread(kvm, kvm_nx_lpage_recovery_worker, 0,
+ "kvm-nx-lpage-recovery",
+ &kvm->arch.nx_lpage_recovery_thread);
+ if (!err)
+ kthread_unpark(kvm->arch.nx_lpage_recovery_thread);
+
+ return err;
+}
+
+void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
+{
+ if (kvm->arch.nx_lpage_recovery_thread)
+ kthread_stop(kvm->arch.nx_lpage_recovery_thread);
+}
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 11f8ec89433b..d55674f44a18 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -210,4 +210,8 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
struct kvm_memory_slot *slot, u64 gfn);
int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
+
+int kvm_mmu_post_init_vm(struct kvm *kvm);
+void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
+
#endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b087d178a774..a30e9962a6ef 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9456,6 +9456,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
INIT_HLIST_HEAD(&kvm->arch.mask_notifier_list);
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
+ INIT_LIST_HEAD(&kvm->arch.lpage_disallowed_mmu_pages);
INIT_LIST_HEAD(&kvm->arch.assigned_dev_head);
atomic_set(&kvm->arch.noncoherent_dma_count, 0);
@@ -9484,6 +9485,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
return kvm_x86_ops->vm_init(kvm);
}
+int kvm_arch_post_init_vm(struct kvm *kvm)
+{
+ return kvm_mmu_post_init_vm(kvm);
+}
+
static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
{
vcpu_load(vcpu);
@@ -9585,6 +9591,11 @@ int x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
}
EXPORT_SYMBOL_GPL(x86_set_memory_region);
+void kvm_arch_pre_destroy_vm(struct kvm *kvm)
+{
+ kvm_mmu_pre_destroy_vm(kvm);
+}
+
void kvm_arch_destroy_vm(struct kvm *kvm)
{
if (current->mm == kvm->mm) {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8aed32b604d9..4aab3547a165 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -626,6 +626,23 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
return 0;
}
+/*
+ * Called after the VM is otherwise initialized, but just before adding it to
+ * the vm_list.
+ */
+int __weak kvm_arch_post_init_vm(struct kvm *kvm)
+{
+ return 0;
+}
+
+/*
+ * Called just after removing the VM from the vm_list, but before doing any
+ * other destruction.
+ */
+void __weak kvm_arch_pre_destroy_vm(struct kvm *kvm)
+{
+}
+
static struct kvm *kvm_create_vm(unsigned long type)
{
struct kvm *kvm = kvm_arch_alloc_vm();
@@ -683,6 +700,10 @@ static struct kvm *kvm_create_vm(unsigned long type)
r = kvm_init_mmu_notifier(kvm);
if (r)
+ goto out_err_no_mmu_notifier;
+
+ r = kvm_arch_post_init_vm(kvm);
+ if (r)
goto out_err;
mutex_lock(&kvm_lock);
@@ -694,6 +715,11 @@ static struct kvm *kvm_create_vm(unsigned long type)
return kvm;
out_err:
+#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+ if (kvm->mmu_notifier.ops)
+ mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
+#endif
+out_err_no_mmu_notifier:
cleanup_srcu_struct(&kvm->irq_srcu);
out_err_no_irq_srcu:
cleanup_srcu_struct(&kvm->srcu);
@@ -738,6 +764,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
mutex_lock(&kvm_lock);
list_del(&kvm->vm_list);
mutex_unlock(&kvm_lock);
+ kvm_arch_pre_destroy_vm(kvm);
+
kvm_free_irq_routing(kvm);
for (i = 0; i < KVM_NR_BUSES; i++) {
struct kvm_io_bus *bus = kvm_get_bus(kvm, i);
--
1.8.3.1
From: Junaid Shahid <[email protected]>
Add a function to create a kernel thread associated with a given VM. In
particular, it ensures that the worker thread inherits the priority and
cgroups of the calling thread.
Signed-off-by: Junaid Shahid <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
include/linux/kvm_host.h | 6 ++++
virt/kvm/kvm_main.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 90 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 719fc3e15ea4..52ed5f66e8f9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1382,4 +1382,10 @@ static inline int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
}
#endif /* CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE */
+typedef int (*kvm_vm_thread_fn_t)(struct kvm *kvm, uintptr_t data);
+
+int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
+ uintptr_t data, const char *name,
+ struct task_struct **thread_ptr);
+
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d6f0696d98ef..8aed32b604d9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -50,6 +50,7 @@
#include <linux/bsearch.h>
#include <linux/io.h>
#include <linux/lockdep.h>
+#include <linux/kthread.h>
#include <asm/processor.h>
#include <asm/ioctl.h>
@@ -4371,3 +4372,86 @@ void kvm_exit(void)
kvm_vfio_ops_exit();
}
EXPORT_SYMBOL_GPL(kvm_exit);
+
+struct kvm_vm_worker_thread_context {
+ struct kvm *kvm;
+ struct task_struct *parent;
+ struct completion init_done;
+ kvm_vm_thread_fn_t thread_fn;
+ uintptr_t data;
+ int err;
+};
+
+static int kvm_vm_worker_thread(void *context)
+{
+ /*
+ * The init_context is allocated on the stack of the parent thread, so
+ * we have to locally copy anything that is needed beyond initialization
+ */
+ struct kvm_vm_worker_thread_context *init_context = context;
+ struct kvm *kvm = init_context->kvm;
+ kvm_vm_thread_fn_t thread_fn = init_context->thread_fn;
+ uintptr_t data = init_context->data;
+ int err;
+
+ err = kthread_park(current);
+ /* kthread_park(current) is never supposed to return an error */
+ WARN_ON(err != 0);
+ if (err)
+ goto init_complete;
+
+ err = cgroup_attach_task_all(init_context->parent, current);
+ if (err) {
+ kvm_err("%s: cgroup_attach_task_all failed with err %d\n",
+ __func__, err);
+ goto init_complete;
+ }
+
+ set_user_nice(current, task_nice(init_context->parent));
+
+init_complete:
+ init_context->err = err;
+ complete(&init_context->init_done);
+ init_context = NULL;
+
+ if (err)
+ return err;
+
+ /* Wait to be woken up by the spawner before proceeding. */
+ kthread_parkme();
+
+ if (!kthread_should_stop())
+ err = thread_fn(kvm, data);
+
+ return err;
+}
+
+int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
+ uintptr_t data, const char *name,
+ struct task_struct **thread_ptr)
+{
+ struct kvm_vm_worker_thread_context init_context = {};
+ struct task_struct *thread;
+
+ *thread_ptr = NULL;
+ init_context.kvm = kvm;
+ init_context.parent = current;
+ init_context.thread_fn = thread_fn;
+ init_context.data = data;
+ init_completion(&init_context.init_done);
+
+ thread = kthread_run(kvm_vm_worker_thread, &init_context,
+ "%s-%d", name, task_pid_nr(current));
+ if (IS_ERR(thread))
+ return PTR_ERR(thread);
+
+ /* kthread_run is never supposed to return NULL */
+ WARN_ON(thread == NULL);
+
+ wait_for_completion(&init_context.init_done);
+
+ if (!init_context.err)
+ *thread_ptr = thread;
+
+ return init_context.err;
+}
--
1.8.3.1
From: Tyler Hicks <[email protected]>
A kernel module may need to check the value of the "mitigations=" kernel
command line parameter as part of its setup when the module needs
to perform software mitigations for a CPU flaw.
Uninline and export the helper functions surrounding the cpu_mitigations
enum to allow for their usage from a module.
Lastly, privatize the enum and cpu_mitigations variable since the value of
cpu_mitigations can be checked with the exported helper functions.
Signed-off-by: Tyler Hicks <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
include/linux/cpu.h | 25 ++-----------------------
kernel/cpu.c | 27 ++++++++++++++++++++++++++-
2 files changed, 28 insertions(+), 24 deletions(-)
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 2a093434e975..bc6c879bd110 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -218,28 +218,7 @@ static inline void cpu_smt_check_topology(void) { }
static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 0; }
#endif
-/*
- * These are used for a global "mitigations=" cmdline option for toggling
- * optional CPU mitigations.
- */
-enum cpu_mitigations {
- CPU_MITIGATIONS_OFF,
- CPU_MITIGATIONS_AUTO,
- CPU_MITIGATIONS_AUTO_NOSMT,
-};
-
-extern enum cpu_mitigations cpu_mitigations;
-
-/* mitigations=off */
-static inline bool cpu_mitigations_off(void)
-{
- return cpu_mitigations == CPU_MITIGATIONS_OFF;
-}
-
-/* mitigations=auto,nosmt */
-static inline bool cpu_mitigations_auto_nosmt(void)
-{
- return cpu_mitigations == CPU_MITIGATIONS_AUTO_NOSMT;
-}
+extern bool cpu_mitigations_off(void);
+extern bool cpu_mitigations_auto_nosmt(void);
#endif /* _LINUX_CPU_H_ */
diff --git a/kernel/cpu.c b/kernel/cpu.c
index fc28e17940e0..e2cad3ee2ead 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2373,7 +2373,18 @@ void __init boot_cpu_hotplug_init(void)
this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);
}
-enum cpu_mitigations cpu_mitigations __ro_after_init = CPU_MITIGATIONS_AUTO;
+/*
+ * These are used for a global "mitigations=" cmdline option for toggling
+ * optional CPU mitigations.
+ */
+enum cpu_mitigations {
+ CPU_MITIGATIONS_OFF,
+ CPU_MITIGATIONS_AUTO,
+ CPU_MITIGATIONS_AUTO_NOSMT,
+};
+
+static enum cpu_mitigations cpu_mitigations __ro_after_init =
+ CPU_MITIGATIONS_AUTO;
static int __init mitigations_parse_cmdline(char *arg)
{
@@ -2390,3 +2401,17 @@ static int __init mitigations_parse_cmdline(char *arg)
return 0;
}
early_param("mitigations", mitigations_parse_cmdline);
+
+/* mitigations=off */
+bool cpu_mitigations_off(void)
+{
+ return cpu_mitigations == CPU_MITIGATIONS_OFF;
+}
+EXPORT_SYMBOL_GPL(cpu_mitigations_off);
+
+/* mitigations=auto,nosmt */
+bool cpu_mitigations_auto_nosmt(void)
+{
+ return cpu_mitigations == CPU_MITIGATIONS_AUTO_NOSMT;
+}
+EXPORT_SYMBOL_GPL(cpu_mitigations_auto_nosmt);
--
1.8.3.1
From: Vineela Tummalapalli <[email protected]>
Some processors may incur a machine check error possibly resulting in an
unrecoverable CPU lockup when an instruction fetch encounters a TLB
multi-hit in the instruction TLB. This can occur when the page size is
changed along with either the physical address or cache type. The relevant
erratum can be found here:
https://bugzilla.kernel.org/show_bug.cgi?id=205195
There are other processors affected for which the erratum does not fully
disclose the impact.
This issue affects both bare-metal x86 page tables and EPT.
It can be mitigated by either eliminating the use of large pages or by
using careful TLB invalidations when changing the page size in the page
tables.
Just like Spectre, Meltdown, L1TF and MDS, a new bit has been allocated in
MSR_IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) and will be set on CPUs which
are mitigated against this issue.
Signed-off-by: Vineela Tummalapalli <[email protected]>
Co-developed-by: Pawan Gupta <[email protected]>
Signed-off-by: Pawan Gupta <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
Documentation/ABI/testing/sysfs-devices-system-cpu | 1 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 7 +++
arch/x86/kernel/cpu/bugs.c | 13 +++++
arch/x86/kernel/cpu/common.c | 65 ++++++++++++----------
drivers/base/cpu.c | 8 +++
include/linux/cpu.h | 2 +
7 files changed, 67 insertions(+), 30 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 0e77569bd5e0..fc20cde63d1e 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -487,6 +487,7 @@ What: /sys/devices/system/cpu/vulnerabilities
/sys/devices/system/cpu/vulnerabilities/l1tf
/sys/devices/system/cpu/vulnerabilities/mds
/sys/devices/system/cpu/vulnerabilities/tsx_async_abort
+ /sys/devices/system/cpu/vulnerabilities/itlb_multihit
Date: January 2018
Contact: Linux kernel mailing list <[email protected]>
Description: Information about CPU vulnerabilities
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 989e03544f18..c4fbe379cc0b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -400,5 +400,6 @@
#define X86_BUG_MSBDS_ONLY X86_BUG(20) /* CPU is only affected by the MSDBS variant of BUG_MDS */
#define X86_BUG_SWAPGS X86_BUG(21) /* CPU is affected by speculation through SWAPGS */
#define X86_BUG_TAA X86_BUG(22) /* CPU is affected by TSX Async Abort(TAA) */
+#define X86_BUG_ITLB_MULTIHIT X86_BUG(23) /* CPU may incur MCE during certain page attribute changes */
#endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index b3a8bb2af0b6..6a3124664289 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -93,6 +93,13 @@
* Microarchitectural Data
* Sampling (MDS) vulnerabilities.
*/
+#define ARCH_CAP_PSCHANGE_MC_NO BIT(6) /*
+ * The processor is not susceptible to a
+ * machine check error due to modifying the
+ * code page size along with either the
+ * physical address or cache type
+ * without TLB invalidation.
+ */
#define ARCH_CAP_TSX_CTRL_MSR BIT(7) /* MSR for TSX control is available. */
#define ARCH_CAP_TAA_NO BIT(8) /*
* Not susceptible to
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 43c647e19439..5364beda8c61 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1419,6 +1419,11 @@ static ssize_t l1tf_show_state(char *buf)
}
#endif
+static ssize_t itlb_multihit_show_state(char *buf)
+{
+ return sprintf(buf, "Processor vulnerable\n");
+}
+
static ssize_t mds_show_state(char *buf)
{
if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
@@ -1524,6 +1529,9 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
case X86_BUG_TAA:
return tsx_async_abort_show_state(buf);
+ case X86_BUG_ITLB_MULTIHIT:
+ return itlb_multihit_show_state(buf);
+
default:
break;
}
@@ -1565,4 +1573,9 @@ ssize_t cpu_show_tsx_async_abort(struct device *dev, struct device_attribute *at
{
return cpu_show_common(dev, attr, buf, X86_BUG_TAA);
}
+
+ssize_t cpu_show_itlb_multihit(struct device *dev, struct device_attribute *attr, char *buf)
+{
+ return cpu_show_common(dev, attr, buf, X86_BUG_ITLB_MULTIHIT);
+}
#endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f8b8afc8f5b5..d29b71ca3ca7 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1016,13 +1016,14 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
#endif
}
-#define NO_SPECULATION BIT(0)
-#define NO_MELTDOWN BIT(1)
-#define NO_SSB BIT(2)
-#define NO_L1TF BIT(3)
-#define NO_MDS BIT(4)
-#define MSBDS_ONLY BIT(5)
-#define NO_SWAPGS BIT(6)
+#define NO_SPECULATION BIT(0)
+#define NO_MELTDOWN BIT(1)
+#define NO_SSB BIT(2)
+#define NO_L1TF BIT(3)
+#define NO_MDS BIT(4)
+#define MSBDS_ONLY BIT(5)
+#define NO_SWAPGS BIT(6)
+#define NO_ITLB_MULTIHIT BIT(7)
#define VULNWL(_vendor, _family, _model, _whitelist) \
{ X86_VENDOR_##_vendor, _family, _model, X86_FEATURE_ANY, _whitelist }
@@ -1043,27 +1044,27 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
VULNWL(NSC, 5, X86_MODEL_ANY, NO_SPECULATION),
/* Intel Family 6 */
- VULNWL_INTEL(ATOM_SALTWELL, NO_SPECULATION),
- VULNWL_INTEL(ATOM_SALTWELL_TABLET, NO_SPECULATION),
- VULNWL_INTEL(ATOM_SALTWELL_MID, NO_SPECULATION),
- VULNWL_INTEL(ATOM_BONNELL, NO_SPECULATION),
- VULNWL_INTEL(ATOM_BONNELL_MID, NO_SPECULATION),
-
- VULNWL_INTEL(ATOM_SILVERMONT, NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
- VULNWL_INTEL(ATOM_SILVERMONT_D, NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
- VULNWL_INTEL(ATOM_SILVERMONT_MID, NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
- VULNWL_INTEL(ATOM_AIRMONT, NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
- VULNWL_INTEL(XEON_PHI_KNL, NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
- VULNWL_INTEL(XEON_PHI_KNM, NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
+ VULNWL_INTEL(ATOM_SALTWELL, NO_SPECULATION | NO_ITLB_MULTIHIT),
+ VULNWL_INTEL(ATOM_SALTWELL_TABLET, NO_SPECULATION | NO_ITLB_MULTIHIT),
+ VULNWL_INTEL(ATOM_SALTWELL_MID, NO_SPECULATION | NO_ITLB_MULTIHIT),
+ VULNWL_INTEL(ATOM_BONNELL, NO_SPECULATION | NO_ITLB_MULTIHIT),
+ VULNWL_INTEL(ATOM_BONNELL_MID, NO_SPECULATION | NO_ITLB_MULTIHIT),
+
+ VULNWL_INTEL(ATOM_SILVERMONT, NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
+ VULNWL_INTEL(ATOM_SILVERMONT_D, NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
+ VULNWL_INTEL(ATOM_SILVERMONT_MID, NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
+ VULNWL_INTEL(ATOM_AIRMONT, NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
+ VULNWL_INTEL(XEON_PHI_KNL, NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
+ VULNWL_INTEL(XEON_PHI_KNM, NO_SSB | NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
VULNWL_INTEL(CORE_YONAH, NO_SSB),
- VULNWL_INTEL(ATOM_AIRMONT_MID, NO_L1TF | MSBDS_ONLY | NO_SWAPGS),
- VULNWL_INTEL(ATOM_AIRMONT_NP, NO_L1TF | NO_SWAPGS),
+ VULNWL_INTEL(ATOM_AIRMONT_MID, NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT),
+ VULNWL_INTEL(ATOM_AIRMONT_NP, NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT),
- VULNWL_INTEL(ATOM_GOLDMONT, NO_MDS | NO_L1TF | NO_SWAPGS),
- VULNWL_INTEL(ATOM_GOLDMONT_D, NO_MDS | NO_L1TF | NO_SWAPGS),
- VULNWL_INTEL(ATOM_GOLDMONT_PLUS, NO_MDS | NO_L1TF | NO_SWAPGS),
+ VULNWL_INTEL(ATOM_GOLDMONT, NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT),
+ VULNWL_INTEL(ATOM_GOLDMONT_D, NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT),
+ VULNWL_INTEL(ATOM_GOLDMONT_PLUS, NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT),
/*
* Technically, swapgs isn't serializing on AMD (despite it previously
@@ -1074,14 +1075,14 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
*/
/* AMD Family 0xf - 0x12 */
- VULNWL_AMD(0x0f, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS),
- VULNWL_AMD(0x10, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS),
- VULNWL_AMD(0x11, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS),
- VULNWL_AMD(0x12, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS),
+ VULNWL_AMD(0x0f, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
+ VULNWL_AMD(0x10, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
+ VULNWL_AMD(0x11, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
+ VULNWL_AMD(0x12, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
/* FAMILY_ANY must be last, otherwise 0x0f - 0x12 matches won't work */
- VULNWL_AMD(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS),
- VULNWL_HYGON(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS),
+ VULNWL_AMD(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
+ VULNWL_HYGON(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT),
{}
};
@@ -1106,6 +1107,10 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
{
u64 ia32_cap = x86_read_arch_cap_msr();
+ /* Set ITLB_MULTIHIT bug if cpu is not in the whitelist and not mitigated */
+ if (!cpu_matches(NO_ITLB_MULTIHIT) && !(ia32_cap & ARCH_CAP_PSCHANGE_MC_NO))
+ setup_force_cpu_bug(X86_BUG_ITLB_MULTIHIT);
+
if (cpu_matches(NO_SPECULATION))
return;
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 0fccd8c0312e..6265871a4af2 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -561,6 +561,12 @@ ssize_t __weak cpu_show_tsx_async_abort(struct device *dev,
return sprintf(buf, "Not affected\n");
}
+ssize_t __weak cpu_show_itlb_multihit(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "Not affected\n");
+}
+
static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
@@ -568,6 +574,7 @@ ssize_t __weak cpu_show_tsx_async_abort(struct device *dev,
static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);
static DEVICE_ATTR(mds, 0444, cpu_show_mds, NULL);
static DEVICE_ATTR(tsx_async_abort, 0444, cpu_show_tsx_async_abort, NULL);
+static DEVICE_ATTR(itlb_multihit, 0444, cpu_show_itlb_multihit, NULL);
static struct attribute *cpu_root_vulnerabilities_attrs[] = {
&dev_attr_meltdown.attr,
@@ -577,6 +584,7 @@ ssize_t __weak cpu_show_tsx_async_abort(struct device *dev,
&dev_attr_l1tf.attr,
&dev_attr_mds.attr,
&dev_attr_tsx_async_abort.attr,
+ &dev_attr_itlb_multihit.attr,
NULL
};
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index f35369f79771..2a093434e975 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -62,6 +62,8 @@ extern ssize_t cpu_show_mds(struct device *dev,
extern ssize_t cpu_show_tsx_async_abort(struct device *dev,
struct device_attribute *attr,
char *buf);
+extern ssize_t cpu_show_itlb_multihit(struct device *dev,
+ struct device_attribute *attr, char *buf);
extern __printf(4, 5)
struct device *cpu_device_create(struct device *parent, void *drvdata,
--
1.8.3.1
On 12.11.19 22:21, Paolo Bonzini wrote:
> CVE-2018-12207 is a microarchitectural implementation issue
> that could allow an unprivileged local attacker to cause system wide
> denial-of-service condition.
>
> Privileged software may change the page size (ex. 4KB, 2MB, 1GB) in the
> paging structures, without following such paging structure changes with
> invalidation of the TLB entries corresponding to the changed pages. In
> this case, the attacker could invoke instruction fetch, which will result
> in the processor hitting multiple TLB entries, reporting a machine check
> error exception, and ultimately hanging the system.
>
> The attached patches mitigate the vulnerability by making huge pages
> non-executable. The processor will not be able to execute an instruction
> residing in a large page (ie. 2MB, 1GB, etc.) without causing a trap into
> the host kernel/hypervisor; KVM will then break the large page into 4KB
> pages and gives executable permission to 4KB pages.
When reading MCE, error code 0150h, ie. SRAR, I was wondering if that
couldn't simply be handled by the host. But I suppose the symptom of
that erratum is not "just" regular recoverable MCE, rather
sometimes/always an unrecoverable CPU state, despite the error code, right?
Jan
--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux
On 13/11/19 07:38, Jan Kiszka wrote:
> When reading MCE, error code 0150h, ie. SRAR, I was wondering if that
> couldn't simply be handled by the host. But I suppose the symptom of
> that erratum is not "just" regular recoverable MCE, rather
> sometimes/always an unrecoverable CPU state, despite the error code, right?
The erratum documentation talks explicitly about hanging the system, but
it's not clear if it's just a result of the OS mishandling the MCE, or
something worse. So I don't know. :( Pawan, do you?
Paolo
Paolo Bonzini <[email protected]> 于2019年11月12日周二 下午10:23写道:
>
> CVE-2018-12207 is a microarchitectural implementation issue
> that could allow an unprivileged local attacker to cause system wide
> denial-of-service condition.
>
> Privileged software may change the page size (ex. 4KB, 2MB, 1GB) in the
> paging structures, without following such paging structure changes with
> invalidation of the TLB entries corresponding to the changed pages. In
> this case, the attacker could invoke instruction fetch, which will result
> in the processor hitting multiple TLB entries, reporting a machine check
> error exception, and ultimately hanging the system.
>
> The attached patches mitigate the vulnerability by making huge pages
> non-executable. The processor will not be able to execute an instruction
> residing in a large page (ie. 2MB, 1GB, etc.) without causing a trap into
> the host kernel/hypervisor; KVM will then break the large page into 4KB
> pages and gives executable permission to 4KB pages.
>
> Thanks to everyone that was involved in the development of these patches,
> especially Junaid Shahid, who provided the first version of the code,
> and Thomas Gleixner.
>
> Paolo
Hi Paolo, hi list,
Thanks for info, do we need qemu patch for full mitigation?
Debian mentioned:
https://linuxsecurity.com/advisories/debian/debian-dsa-4566-1-qemu-security-update-17-10-10
"
A qemu update adding support for the PSCHANGE_MC_NO feature, which
allows to disable iTLB Multihit mitigations in nested hypervisors
will be provided via DSA 4566-1.
"
But It's not yet available publicly.
About the performance hit, do you know any number? probably the answer
is workload dependent.
Regards,
Jack Wang
On 13/11/19 14:00, Jinpu Wang wrote:
> Hi Paolo, hi list,
>
> Thanks for info, do we need qemu patch for full mitigation?
> Debian mentioned:
> https://linuxsecurity.com/advisories/debian/debian-dsa-4566-1-qemu-security-update-17-10-10
> "
> A qemu update adding support for the PSCHANGE_MC_NO feature, which
> allows to disable iTLB Multihit mitigations in nested hypervisors
> will be provided via DSA 4566-1.
>
> "
> But It's not yet available publicly.
I will send it today, but it's not needed for full mitigation. It just
provides a knob to turn it on and off in nested hypervisors.
> About the performance hit, do you know any number? probably the answer
> is workload dependent.
We generally measured 0-4%. There can be latency spikes for RT, which I
will send a patch for soon.
Paolo
> On Nov 12, 2019, at 1:21 PM, Paolo Bonzini <[email protected]> wrote:
>
> CVE-2018-12207 is a microarchitectural implementation issue
> that could allow an unprivileged local attacker to cause system wide
> denial-of-service condition.
>
> Privileged software may change the page size (ex. 4KB, 2MB, 1GB) in the
> paging structures, without following such paging structure changes with
> invalidation of the TLB entries corresponding to the changed pages. In
> this case, the attacker could invoke instruction fetch, which will result
> in the processor hitting multiple TLB entries, reporting a machine check
> error exception, and ultimately hanging the system.
>
> The attached patches mitigate the vulnerability by making huge pages
> non-executable. The processor will not be able to execute an instruction
> residing in a large page (ie. 2MB, 1GB, etc.) without causing a trap into
> the host kernel/hypervisor; KVM will then break the large page into 4KB
> pages and gives executable permission to 4KB pages.
It sounds that this mitigation will trigger the “page fracturing” problem
I once encountered [1], causing frequent full TLB flushes when invlpg
runs. I wonder if VMs would benefit in performance from changing
/sys/kernel/debug/x86/tlb_single_page_flush_ceiling to zero.
On a different note - I am not sure I fully understand the exact scenario.
Any chance of getting a kvm-unit-test for this case?
[1] https://patchwork.kernel.org/patch/9099311/
On 13/11/19 19:10, Nadav Amit wrote:
>
>> On Nov 12, 2019, at 1:21 PM, Paolo Bonzini <[email protected]> wrote:
>>
>> CVE-2018-12207 is a microarchitectural implementation issue
>> that could allow an unprivileged local attacker to cause system wide
>> denial-of-service condition.
>>
>> Privileged software may change the page size (ex. 4KB, 2MB, 1GB) in the
>> paging structures, without following such paging structure changes with
>> invalidation of the TLB entries corresponding to the changed pages. In
>> this case, the attacker could invoke instruction fetch, which will result
>> in the processor hitting multiple TLB entries, reporting a machine check
>> error exception, and ultimately hanging the system.
>>
>> The attached patches mitigate the vulnerability by making huge pages
>> non-executable. The processor will not be able to execute an instruction
>> residing in a large page (ie. 2MB, 1GB, etc.) without causing a trap into
>> the host kernel/hypervisor; KVM will then break the large page into 4KB
>> pages and gives executable permission to 4KB pages.
>
> It sounds that this mitigation will trigger the “page fracturing” problem
> I once encountered [1], causing frequent full TLB flushes when invlpg
> runs. I wonder if VMs would benefit in performance from changing
> /sys/kernel/debug/x86/tlb_single_page_flush_ceiling to zero.
>
> On a different note - I am not sure I fully understand the exact scenario.
> Any chance of getting a kvm-unit-test for this case?
No, for now I only have a test that causes lots of 2 MiB pages to be
shattered to 4 KiB. But I wanted to get and post a test case in the
next week.
Paolo
> [1] https://patchwork.kernel.org/patch/9099311/
>
On 11/13/19 12:23 AM, Paolo Bonzini wrote:
> On 13/11/19 07:38, Jan Kiszka wrote:
>> When reading MCE, error code 0150h, ie. SRAR, I was wondering if that
>> couldn't simply be handled by the host. But I suppose the symptom of
>> that erratum is not "just" regular recoverable MCE, rather
>> sometimes/always an unrecoverable CPU state, despite the error code, right?
> The erratum documentation talks explicitly about hanging the system, but
> it's not clear if it's just a result of the OS mishandling the MCE, or
> something worse. So I don't know. :( Pawan, do you?
It's "something worse".
I built a kernel module reproducer for this a long time ago. The
symptom I observed was the whole system hanging hard, requiring me to go
hit the power button. The MCE software machinery was not involved at
all from what I could tell.
About creating a unit test, I'd be personally happy to share my
reproducer, but I built it before this issue was root-caused. There are
actually quite a few underlying variants and a good unit test would make
sure to exercise all of them. My reproducer probably only exercised a
single case.
On Wed, Nov 13, 2019 at 09:23:30AM +0100, Paolo Bonzini wrote:
> On 13/11/19 07:38, Jan Kiszka wrote:
> > When reading MCE, error code 0150h, ie. SRAR, I was wondering if that
> > couldn't simply be handled by the host. But I suppose the symptom of
> > that erratum is not "just" regular recoverable MCE, rather
> > sometimes/always an unrecoverable CPU state, despite the error code, right?
>
> The erratum documentation talks explicitly about hanging the system, but
> it's not clear if it's just a result of the OS mishandling the MCE, or
> something worse. So I don't know. :( Pawan, do you?
As Dave mentioned in the other email its "something worse".
Although this erratum results in a machine check with the same MCACOD
signature as an SRAR error (0x150) the MCi_STATUS.PCC bit will be set to
one. The Intel Software Developers manual says that PCC=1 errors are
fatal and cannot be recovered.
15.10.4.1 Machine-Check Exception Handler for Error Recovery [1]
[...]
The PCC flag in each IA32_MCi_STATUS register indicates whether recovery
from the error is possible for uncorrected errors (UC=1). If the PCC
flag is set for enabled uncorrected errors (UC=1 and EN=1), recovery is
not possible.
Thanks,
Pawan
[1]
https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html
> On Nov 13, 2019, at 1:24 PM, Dave Hansen <[email protected]> wrote:
>
> On 11/13/19 12:23 AM, Paolo Bonzini wrote:
>> On 13/11/19 07:38, Jan Kiszka wrote:
>>> When reading MCE, error code 0150h, ie. SRAR, I was wondering if that
>>> couldn't simply be handled by the host. But I suppose the symptom of
>>> that erratum is not "just" regular recoverable MCE, rather
>>> sometimes/always an unrecoverable CPU state, despite the error code, right?
>> The erratum documentation talks explicitly about hanging the system, but
>> it's not clear if it's just a result of the OS mishandling the MCE, or
>> something worse. So I don't know. :( Pawan, do you?
>
> It's "something worse".
>
> I built a kernel module reproducer for this a long time ago. The
> symptom I observed was the whole system hanging hard, requiring me to go
> hit the power button. The MCE software machinery was not involved at
> all from what I could tell.
>
> About creating a unit test, I'd be personally happy to share my
> reproducer, but I built it before this issue was root-caused. There are
> actually quite a few underlying variants and a good unit test would make
> sure to exercise all of them. My reproducer probably only exercised a
> single case.
So please correct me if I am wrong. My understanding is that the reason that
only KVM needs to be fixed is that there is a strong assumption that the
kernel does not hold both 4k and 2M mappings at the same time. There is indeed
documentation that this is the intention in __split_huge_pmd_locked(), for
instance, due to other AMD issues with such setup.
But is it always the case? Looking at __split_large_page(), it seems that the
TLB invalidation is only done after the PMD is changed. Can't this leave a
small time window in which a malicious actor triggers a machine-check on
another core than the one that runs __split_large_page()?
On 11/13/19 5:17 PM, Nadav Amit wrote:
> But is it always the case? Looking at __split_large_page(), it seems that the
> TLB invalidation is only done after the PMD is changed. Can't this leave a
> small time window in which a malicious actor triggers a machine-check on
> another core than the one that runs __split_large_page()?
It's not just a split. It has to be a change that results in
inconsistencies between two entries in the TLB. A normal split doesn't
change the resulting final translations and is never inconsistent
between the two translations.
To have an inconsistency, you need to change the backing physical
address (or cache attributes?). I'd need to go double-check the erratum
to be sure about the cache attributes.
In any case, that's why we decided that normal kernel mapping
split/merges don't need to be mitigated. But, we should probably
document this somewhere if it's not clear.
Pawan, did we document the results of the audit you did anywhere?
> On Nov 13, 2019, at 9:26 PM, Dave Hansen <[email protected]> wrote:
>
> On 11/13/19 5:17 PM, Nadav Amit wrote:
>> But is it always the case? Looking at __split_large_page(), it seems that the
>> TLB invalidation is only done after the PMD is changed. Can't this leave a
>> small time window in which a malicious actor triggers a machine-check on
>> another core than the one that runs __split_large_page()?
>
> It's not just a split. It has to be a change that results in
> inconsistencies between two entries in the TLB. A normal split doesn't
> change the resulting final translations and is never inconsistent
> between the two translations.
>
> To have an inconsistency, you need to change the backing physical
> address (or cache attributes?). I'd need to go double-check the erratum
> to be sure about the cache attributes.
>
> In any case, that's why we decided that normal kernel mapping
> split/merges don't need to be mitigated. But, we should probably
> document this somewhere if it's not clear.
>
> Pawan, did we document the results of the audit you did anywhere?
Thank you for explaining. I now see that it is clearly written:
"Software modifies the paging structures so that the same linear address
is translated using a large page (2 MB, 4 MB, or 1 GB) with a different
physical address or memory type.” [1]
My bad.
[1] https://software.intel.com/security-software-guidance/insights/deep-dive-machine-check-error-avoidance-page-size-change-0
On 13.11.19 22:24, Dave Hansen wrote:
> On 11/13/19 12:23 AM, Paolo Bonzini wrote:
>> On 13/11/19 07:38, Jan Kiszka wrote:
>>> When reading MCE, error code 0150h, ie. SRAR, I was wondering if that
>>> couldn't simply be handled by the host. But I suppose the symptom of
>>> that erratum is not "just" regular recoverable MCE, rather
>>> sometimes/always an unrecoverable CPU state, despite the error code, right?
>> The erratum documentation talks explicitly about hanging the system, but
>> it's not clear if it's just a result of the OS mishandling the MCE, or
>> something worse. So I don't know. :( Pawan, do you?
>
> It's "something worse".
>
> I built a kernel module reproducer for this a long time ago. The
> symptom I observed was the whole system hanging hard, requiring me to go
> hit the power button. The MCE software machinery was not involved at
> all from what I could tell.
Thanks for clarifying this - too bad.
>
> About creating a unit test, I'd be personally happy to share my
> reproducer, but I built it before this issue was root-caused. There are
> actually quite a few underlying variants and a good unit test would make
> sure to exercise all of them. My reproducer probably only exercised a
> single case.
>
Would be interesting to see this. Ralf and tried something quickly, but
there seems to be a detail missing or wrong.
Jan
--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux
On 14.11.19 00:25, Pawan Gupta wrote:
> On Wed, Nov 13, 2019 at 09:23:30AM +0100, Paolo Bonzini wrote:
>> On 13/11/19 07:38, Jan Kiszka wrote:
>>> When reading MCE, error code 0150h, ie. SRAR, I was wondering if that
>>> couldn't simply be handled by the host. But I suppose the symptom of
>>> that erratum is not "just" regular recoverable MCE, rather
>>> sometimes/always an unrecoverable CPU state, despite the error code, right?
>>
>> The erratum documentation talks explicitly about hanging the system, but
>> it's not clear if it's just a result of the OS mishandling the MCE, or
>> something worse. So I don't know. :( Pawan, do you?
>
> As Dave mentioned in the other email its "something worse".
>
> Although this erratum results in a machine check with the same MCACOD
> signature as an SRAR error (0x150) the MCi_STATUS.PCC bit will be set to
> one. The Intel Software Developers manual says that PCC=1 errors are
> fatal and cannot be recovered.
>
> 15.10.4.1 Machine-Check Exception Handler for Error Recovery [1]
>
> [...]
> The PCC flag in each IA32_MCi_STATUS register indicates whether recovery
> from the error is possible for uncorrected errors (UC=1). If the PCC
> flag is set for enabled uncorrected errors (UC=1 and EN=1), recovery is
> not possible.
>
And, as Dave observed, even that event is not delivered to software
(maybe just logged by firmware for post-reset analysis) but can or does
cause a machine lock-up, right?
Thanks,
Jan
--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux
On Wed, Nov 13, 2019 at 09:26:24PM -0800, Dave Hansen wrote:
> On 11/13/19 5:17 PM, Nadav Amit wrote:
> > But is it always the case? Looking at __split_large_page(), it seems that the
> > TLB invalidation is only done after the PMD is changed. Can't this leave a
> > small time window in which a malicious actor triggers a machine-check on
> > another core than the one that runs __split_large_page()?
>
> It's not just a split. It has to be a change that results in
> inconsistencies between two entries in the TLB. A normal split doesn't
> change the resulting final translations and is never inconsistent
> between the two translations.
>
> To have an inconsistency, you need to change the backing physical
> address (or cache attributes?). I'd need to go double-check the erratum
> to be sure about the cache attributes.
>
> In any case, that's why we decided that normal kernel mapping
> split/merges don't need to be mitigated. But, we should probably
> document this somewhere if it's not clear.
>
> Pawan, did we document the results of the audit you did anywhere?
Kirill Shutemov did the heavy lifting, thank you Kirill. Below were the
major areas probed:
1. Can a non-privileged user application induce this erratum?
Userspace can trigger switching between 4k and 2M (in both
directions), but kernel already follows the protocol to avoid
this issue due to similar errata in AMD CPUs. [1][2]
2. If kernel can accidentally induce this?
__split_large_page() in arch/x86/mm/pageattr.c was the suspect [3].
The locking scheme described in the comment only guarantees that
TLB entries for 4k and 2M/1G will have the same page attributes
until TLB flush. There is nothing that would protect from having
multiple TLB entries of different sizes with the same attributes.
But the erratum can be triggered only when:
Software modifies the paging structures so that the same
linear address is translated using a large page (2 MB, 4
MB, or 1 GB) with a different physical address or memory
type.
And in this case the physical address and memory type is
preserved until TLB is flushed, so it should be safe.
Thanks,
Pawan
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/huge_memory.c#n2190
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/khugepaged.c#n1038
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/mm/pageattr.c#n1020
On Thu, Nov 14, 2019 at 09:13:22AM +0100, Jan Kiszka wrote:
> On 14.11.19 00:25, Pawan Gupta wrote:
> > On Wed, Nov 13, 2019 at 09:23:30AM +0100, Paolo Bonzini wrote:
> > > On 13/11/19 07:38, Jan Kiszka wrote:
> > > > When reading MCE, error code 0150h, ie. SRAR, I was wondering if that
> > > > couldn't simply be handled by the host. But I suppose the symptom of
> > > > that erratum is not "just" regular recoverable MCE, rather
> > > > sometimes/always an unrecoverable CPU state, despite the error code, right?
> > >
> > > The erratum documentation talks explicitly about hanging the system, but
> > > it's not clear if it's just a result of the OS mishandling the MCE, or
> > > something worse. So I don't know. :( Pawan, do you?
> >
> > As Dave mentioned in the other email its "something worse".
> >
> > Although this erratum results in a machine check with the same MCACOD
> > signature as an SRAR error (0x150) the MCi_STATUS.PCC bit will be set to
> > one. The Intel Software Developers manual says that PCC=1 errors are
> > fatal and cannot be recovered.
> >
> > 15.10.4.1 Machine-Check Exception Handler for Error Recovery [1]
> >
> > [...]
> > The PCC flag in each IA32_MCi_STATUS register indicates whether recovery
> > from the error is possible for uncorrected errors (UC=1). If the PCC
> > flag is set for enabled uncorrected errors (UC=1 and EN=1), recovery is
> > not possible.
> >
>
> And, as Dave observed, even that event is not delivered to software (maybe
> just logged by firmware for post-reset analysis) but can or does cause a
> machine lock-up, right?
It can either cause a machine lock-up or a reset and the event delivery
to the software is not guaranteed.
Thanks,
Pawan
Hi Dave,
On 11/14/19 9:09 AM, Jan Kiszka wrote:
> On 13.11.19 22:24, Dave Hansen wrote:
>> On 11/13/19 12:23 AM, Paolo Bonzini wrote:
>>> On 13/11/19 07:38, Jan Kiszka wrote:
>>>> When reading MCE, error code 0150h, ie. SRAR, I was wondering if that
>>>> couldn't simply be handled by the host. But I suppose the symptom of
>>>> that erratum is not "just" regular recoverable MCE, rather
>>>> sometimes/always an unrecoverable CPU state, despite the error code,
>>>> right?
>>> The erratum documentation talks explicitly about hanging the system, but
>>> it's not clear if it's just a result of the OS mishandling the MCE, or
>>> something worse. So I don't know. :( Pawan, do you?
>>
>> It's "something worse".
>>
>> I built a kernel module reproducer for this a long time ago. The
>> symptom I observed was the whole system hanging hard, requiring me to go
>> hit the power button. The MCE software machinery was not involved at
>> all from what I could tell.
>
> Thanks for clarifying this - too bad.
>
>>
>> About creating a unit test, I'd be personally happy to share my
>> reproducer, but I built it before this issue was root-caused. There are
I'd appreciate if you could share your code.
>> actually quite a few underlying variants and a good unit test would make
>> sure to exercise all of them. My reproducer probably only exercised a
>> single case.
Still, it triggers the issue, that's enough to compare it to my reproducer.
>>
>
> Would be interesting to see this. Ralf and tried something quickly, but
> there seems to be a detail missing or wrong.
Yep, we still can't reproduce the issue on an affected CPU, and don't
know what we miss.
Thanks,
Ralf
>
> Jan
>
On 18.11.19 14:58, Ralf Ramsauer wrote:
> Hi Dave,
>
> On 11/14/19 9:09 AM, Jan Kiszka wrote:
>> On 13.11.19 22:24, Dave Hansen wrote:
>>> On 11/13/19 12:23 AM, Paolo Bonzini wrote:
>>>> On 13/11/19 07:38, Jan Kiszka wrote:
>>>>> When reading MCE, error code 0150h, ie. SRAR, I was wondering if that
>>>>> couldn't simply be handled by the host. But I suppose the symptom of
>>>>> that erratum is not "just" regular recoverable MCE, rather
>>>>> sometimes/always an unrecoverable CPU state, despite the error code,
>>>>> right?
>>>> The erratum documentation talks explicitly about hanging the system, but
>>>> it's not clear if it's just a result of the OS mishandling the MCE, or
>>>> something worse. So I don't know. :( Pawan, do you?
>>>
>>> It's "something worse".
>>>
>>> I built a kernel module reproducer for this a long time ago. The
>>> symptom I observed was the whole system hanging hard, requiring me to go
>>> hit the power button. The MCE software machinery was not involved at
>>> all from what I could tell.
>>
>> Thanks for clarifying this - too bad.
>>
>>>
>>> About creating a unit test, I'd be personally happy to share my
>>> reproducer, but I built it before this issue was root-caused. There are
>
> I'd appreciate if you could share your code.
>
>>> actually quite a few underlying variants and a good unit test would make
>>> sure to exercise all of them. My reproducer probably only exercised a
>>> single case.
>
> Still, it triggers the issue, that's enough to compare it to my reproducer.
>
>>>
>>
>> Would be interesting to see this. Ralf and tried something quickly, but
>> there seems to be a detail missing or wrong.
>
> Yep, we still can't reproduce the issue on an affected CPU, and don't
> know what we miss.
I just realized that this thread stranded. Ralf told me that he got no
access to that reproducer which would be very valuable for us right now
to validate a static mitigation method in Jailhouse. Any chance to get
the access?
Thanks a lot,
Jan
--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux
On 1/21/20 10:08 AM, Jan Kiszka wrote:
>>> Would be interesting to see this. Ralf and tried something quickly, but
>>> there seems to be a detail missing or wrong.
>>
>> Yep, we still can't reproduce the issue on an affected CPU, and don't
>> know what we miss.
>
> I just realized that this thread stranded. Ralf told me that he got no
> access to that reproducer which would be very valuable for us right now
> to validate a static mitigation method in Jailhouse. Any chance to get
> the access?
I can't share mine. Although, if anyone else can share theirs, I can
take a look and make sure that it covers the same cases that mine did.
Does anyone else have a reproducer they can share?