Many thanks to everyone who reviewed v1. Here are some notes on the
delta and feedback that was given. Thank you all for taking the time to
review and provide suggestions. I attempted to capture everything I
heard, please let me know if anything was missed. kexec support is
still pending, due to what follows about command line parameters, opt
in, and being that those changes wouldn't be localized to KVM (and this
patchset is on top of kvm/next).
- Hibernation and suspend-to-RAM are supported.
- As Paolo requested, a patch will be sent immediately following this
patchset to kvm-unit-tests with the unit tests for general
functionality. selftests are included for SMM specific functionality.
- Liran's catch of ensuring bits stay pinned after returning from SMM
is now present. As a side note, CR values per AMD and Intel SDMs
should be read-only within SMRAM. I attempted a patch to keep a copy
of smstate in vcpu->arch which would later be used as the source of
the read-only values. However, I ran into issues with VM/save load
within the selftests. If someone has a better solution or knows where
might be a good place to persist that across save/load please let me
know.
- Andy's suggestion of a boot option has been incorporated as the
pv_cr_pin command line option. Boris mentioned that short-term
solutions become immutable. However, for the reasons outlined below
we need a way for the user to opt-in to pinning over kexec if both
are compiled in, and the command line parameter seems to be a good
way to do that. Liran's proposed solution of a flag within the ELF
would allow us to identify which kernels have support is assumed to
be implemented in the following scenarios.
We then have the following cases (without the addition of pv_cr_pin):
- Kernel running without pinning enabled kexecs kernel with pinning.
- Loaded kernel has kexec
- Do not enable pinning
- Loaded kernel lacks kexec
- Enable pinning
- Kernel running with pinning enabled kexecs kernel with pinning (as
identified by ELF addition).
- Okay
- Kernel running with pinning enabled kexecs kernel without pinning
(as identified by lack of ELF addition).
- User is presented with an error saying that they may not kexec
a kernel without pinning support.
With the addition of pv_cr_pin we have the following situations:
- Kernel running without pinning enabled kexecs kernel with pinning.
- Loaded kernel has kexec
- pv_cr_pin command line parameter present
- Enable pinning
- pv_cr_pin command line parameter not present
- Do not enable pinning
- Loaded kernel lacks kexec
- Enable pinning
- Kernel running with pinning enabled kexecs kernel with pinning (as
identified by ELF addition).
- Okay
- Kernel running with pinning enabled kexecs kernel without pinning
(as identified by lack of ELF addition).
- pv_cr_pin command line parameter present
- User is presented with an error saying that they have opted
into pinning support and may not kexec a kernel without pinning
support.
- pv_cr_pin command line parameter not present
- Pinning not enabled (not opted into), kexec succeeds
Without the command line parameter I'm not sure how we could preserve
users workflows which might rely on kexecing older kernels (ones
which wouldn't have support). I see the benefit here being that users
have to opt-in to the possibility of breaking their workflow, via
their addition of the pv_cr_pin command line flag. Which could of
course also be called nokexec. A deprecation period could then be
chosen where eventually pinning takes preference over kexec and users
are presented with the error if they try to kexec an older kernel.
Input on this would be much appreciated, as well as if this is the
best way to handle things or if there's another way that would be
preferred. This is just what we were able to come up with to ensure
users didn't get anything broken they didn't agree to have broken.
Paravirtualized Control Register pinning is a strengthened version of
existing protections on the Write Protect, Supervisor Mode Execution /
Access Protection, and User-Mode Instruction Prevention bits. The
existing protections prevent native_write_cr*() functions from writing
values which disable those bits. This patchset prevents any guest
writes to control registers from disabling pinned bits, not just writes
from native_write_cr*(). This stops attackers within the guest from
using ROP to disable protection bits.
https://web.archive.org/web/20171029060939/http://www.blackbunny.io/linux-kernel-x86-64-bypass-smep-kaslr-kptr_restric/
The protection is implemented by adding MSRs to KVM which contain the
bits that are allowed to be pinned, and the bits which are pinned. The
guest or userspace can enable bit pinning by reading MSRs to check
which bits are allowed to be pinned, and then writing MSRs to set which
bits they want pinned.
Other hypervisors such as HyperV have implemented similar protections
for Control Registers and MSRs; which security researchers have found
effective.
https://www.abatchy.com/2018/01/kernel-exploitation-4
We add a CR pin feature bit to the KVM cpuid, read only MSRs which
guests use to identify which bits they may request be pinned, and CR
pinned MSRs which contain the pinned bits. Guests can request that KVM
pin bits within control register 0 or 4 via the CR pinned MSRs. Writes
to the MSRs fail if they include bits that aren't allowed to be pinned.
Host userspace may clear or modify pinned bits at any time. Once
pinned bits are set, the guest may pin more allowed bits, but may never
clear pinned bits.
In the event that the guest vCPU attempts to disable any of the pinned
bits, the vCPU that issued the write is sent a general protection
fault, and the register is left unchanged.
When running with KVM guest support and paravirtualized CR pinning
enabled, paravirtualized and existing pinning are setup at the same
point on the boot CPU. Non-boot CPUs setup pinning upon identification.
Pinning is not active when running in SMM. Entering SMM disables pinned
bits. Writes to control registers within SMM would therefore trigger
general protection faults if pinning was enforced. Upon exit from SMM,
SMRAM is modified to ensure the values of CR0/4 that will be restored
contain the correct values for pinned bits. CR0/4 values are then
restored from SMRAM as usual.
Should userspace expose the CR pining CPUID feature bit, it must zero
CR pinned MSRs on reboot. If it does not, it runs the risk of having
the guest enable pinning and subsequently cause general protection
faults on next boot due to early boot code setting control registers to
values which do not contain the pinned bits.
Hibernation to disk and suspend-to-RAM are supported. identify_cpu was
updated to ensure SMEP/SMAP/UMIP are present in mmu_cr4_features. This
is necessary to ensure protections stay active during hibernation image
restoration.
Guests using the kexec system call currently do not support
paravirtualized control register pinning. This is due to early boot
code writing known good values to control registers, these values do
not contain the protected bits. This is due to CPU feature
identification being done at a later time, when the kernel properly
checks if it can enable protections. As such, the pv_cr_pin command
line option has been added which instructs the kernel to disable kexec
in favor of enabling paravirtualized control register pinning.
crashkernel is also disabled when the pv_cr_pin parameter is specified
due to its reliance on kexec.
When we make kexec compatible, we will still need a way for a kernel
with support to know if the kernel it is attempting to load has
support. If a kernel with this enabled attempts to kexec a kernel where
this is not supported, it would trigger a fault almost immediately.
Liran suggested adding a section to the built image acting as a flag to
signify support for being kexec'd by a kernel with pinning enabled.
Should that approach be implemented, it is likely that the command line
flag (pv_cr_pin) would still be desired for some deprecation period. We
wouldn't want the default behavior to change from being able to kexec
older kernels to not being able to, as this might break some users
workflows. Since we require that the user opt-in to break kexec we've
held off on attempting to fix kexec in this patchset. This way no one
sees any behavior they are not explicitly opting in to.
Security conscious kernel configurations disable kexec already, per
KSPP guidelines. Projects such as Kata Containers, AWS Lambda, ChromeOS
Termina, and others using KVM to virtualize Linux will benefit from
this protection without the need to specify pv_cr_pin on the command
line.
Pinning of sensitive CR bits has already been implemented to protect
against exploits directly calling native_write_cr*(). The current
protection cannot stop ROP attacks which jump directly to a MOV CR
instruction. Guests running with paravirtualized CR pinning are now
protected against the use of ROP to disable CR bits. The same bits that
are being pinned natively may be pinned via the CR pinned MSRs. These
bits are WP in CR0, and SMEP, SMAP, and UMIP in CR4.
Future patches could implement similar MSRs to protect bits in MSRs.
The NXE bit of the EFER MSR is a prime candidate.
John Andersen (4):
X86: Update mmu_cr4_features during feature identification
KVM: X86: Add CR pin MSRs
selftests: kvm: add test for CR pinning with SMM
X86: Use KVM CR pin MSRs
.../admin-guide/kernel-parameters.txt | 11 ++
Documentation/virt/kvm/msr.rst | 38 ++++
arch/x86/Kconfig | 10 +
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/include/asm/kvm_para.h | 17 ++
arch/x86/include/uapi/asm/kvm_para.h | 5 +
arch/x86/kernel/cpu/common.c | 11 +-
arch/x86/kernel/kvm.c | 39 ++++
arch/x86/kernel/setup.c | 12 +-
arch/x86/kvm/cpuid.c | 3 +-
arch/x86/kvm/x86.c | 130 ++++++++++++-
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/include/x86_64/processor.h | 9 +
.../selftests/kvm/x86_64/smm_cr_pin_test.c | 180 ++++++++++++++++++
15 files changed, 462 insertions(+), 7 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/smm_cr_pin_test.c
--
2.21.0
Add a CR pin feature bit to the KVM cpuid. Add read only MSRs to KVM
which guests use to identify which bits they may request be pinned. Add
CR pinned MSRs to KVM. Allow guests to request that KVM pin certain
bits within control register 0 or 4 via the CR pinned MSRs. Writes to
the MSRs fail if they include bits which aren't allowed. Host userspace
may clear or modify pinned bits at any time. Once pinned bits are set,
the guest may pin additional allowed bits, but not clear any. Clear
pinning on vCPU reset.
In the event that the guest vCPU attempts to disable any of the pinned
bits, send that vCPU a general protection fault, and leave the register
unchanged. Entering SMM unconditionally clears various CR0/4 bits, some
of which may be pinned by the OS. To avoid triggering a fault during
SMIs, pinning isn't enforced when the vCPU is running in SMM. Upon
exiting SMM CR0/4 values are restored from SMRAM. kvm_pre_leave_smm
ensures CR0/4 values in SMRAM have pinned bits set appropriately before
restoration. Clearing pinning on vCPU reset avoids faulting when
non-boot CPUs are disabled and then re-enabled, which is done when
hibernating.
Should userspace expose the CR pinning CPUID feature bit, it must zero
CR pinned MSRs on reboot. If it does not, it runs the risk of having the
guest enable pinning and subsequently cause general protection faults on
next boot due to early boot code setting control registers to values
which do not contain the pinned bits. Userspace is responsible for
migrating the contents of the CR* pinned MSRs. If userspace fails to
migrate the MSRs the protection will no longer be active.
Pinning of sensitive CR bits has already been implemented to protect
against exploits directly calling native_write_cr*(). The current
protection cannot stop ROP attacks which jump directly to a MOV CR
instruction.
https://web.archive.org/web/20171029060939/http://www.blackbunny.io/linux-kernel-x86-64-bypass-smep-kaslr-kptr_restric/
Guests running with paravirtualized CR pinning can now be protected
against the use of ROP to disable CR bits. The same bits that are being
pinned natively may be pinned via the CR pinned MSRs. These bits are WP
in CR0, and SMEP, SMAP, and UMIP in CR4.
Other hypervisors such as HyperV have implemented similar protections
for Control Registers and MSRs; which security researchers have found
effective.
https://www.abatchy.com/2018/01/kernel-exploitation-4
Future work could implement similar MSRs to protect bits elsewhere, such
as MSRs. The NXE bit of the EFER MSR is a prime candidate.
Changes for QEMU are required to expose the CR pin cpuid feature bit. As
well as clear the MSRs on reboot and enable migration.
https://github.com/qemu/qemu/commit/e7a0ff8a8dcde1ef2b83a9d93129614f512752ae
https://github.com/qemu/qemu/commit/7e8c770c91616ae8d2d6b15bcc2865be594c8852
Signed-off-by: John Andersen <[email protected]>
---
Documentation/virt/kvm/msr.rst | 38 ++++++++
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/include/uapi/asm/kvm_para.h | 5 ++
arch/x86/kvm/cpuid.c | 3 +-
arch/x86/kvm/x86.c | 130 ++++++++++++++++++++++++++-
5 files changed, 176 insertions(+), 2 deletions(-)
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index 33892036672d..075e15a2c246 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -319,3 +319,41 @@ data:
KVM guests can request the host not to poll on HLT, for example if
they are performing polling themselves.
+
+MSR_KVM_CR0_PIN_ALLOWED: 0x4b564d06
+MSR_KVM_CR4_PIN_ALLOWED: 0x4b564d07
+ Read only registers informing the guest which bits may be pinned for
+ each control register respectively via the CR pinned MSRs.
+
+ data: Bits which may be pinned.
+
+ Attempting to pin bits other than these will result in a failure when
+ writing to the respective CR pinned MSR.
+
+ Bits which are allowed to be pinned are WP for CR0 and SMEP, SMAP, and
+ UMIP for CR4.
+
+MSR_KVM_CR0_PINNED: 0x4b564d08
+MSR_KVM_CR4_PINNED: 0x4b564d09
+ Used to configure pinned bits in control registers
+
+ data: Bits to be pinned.
+
+ Fails if data contains bits which are not allowed to be pinned. Bits
+ which are allowed to be pinned can be found by reading the CR pin
+ allowed MSRs.
+
+ The MSRs are read/write for host userspace, and write-only for the
+ guest.
+
+ Once set to a non-zero value, the guest cannot clear any of the bits
+ that have been pinned to 1. The guest can set more bits to 1, so long
+ as those bits appear in the allowed MSR.
+
+ Host userspace may clear or change pinned bits at any point. Host
+ userspace must clear pinned bits on reboot.
+
+ The MSR enables bit pinning for control registers. Pinning is active
+ when the guest is not in SMM. If the guest attempts to write values to
+ cr* where bits differ from pinned bits, the write will fail and the
+ guest will be sent a general protection fault.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 40a0c0fd95ca..69625a18aa88 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -569,10 +569,12 @@ struct kvm_vcpu_arch {
unsigned long cr0;
unsigned long cr0_guest_owned_bits;
+ unsigned long cr0_pinned;
unsigned long cr2;
unsigned long cr3;
unsigned long cr4;
unsigned long cr4_guest_owned_bits;
+ unsigned long cr4_pinned;
unsigned long cr8;
u32 pkru;
u32 hflags;
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 2a8e0b6b9805..e6c61e455adf 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -31,6 +31,7 @@
#define KVM_FEATURE_PV_SEND_IPI 11
#define KVM_FEATURE_POLL_CONTROL 12
#define KVM_FEATURE_PV_SCHED_YIELD 13
+#define KVM_FEATURE_CR_PIN 14
#define KVM_HINTS_REALTIME 0
@@ -50,6 +51,10 @@
#define MSR_KVM_STEAL_TIME 0x4b564d03
#define MSR_KVM_PV_EOI_EN 0x4b564d04
#define MSR_KVM_POLL_CONTROL 0x4b564d05
+#define MSR_KVM_CR0_PIN_ALLOWED 0x4b564d06
+#define MSR_KVM_CR4_PIN_ALLOWED 0x4b564d07
+#define MSR_KVM_CR0_PINNED 0x4b564d08
+#define MSR_KVM_CR4_PINNED 0x4b564d09
struct kvm_steal_time {
__u64 steal;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index b1c469446b07..94f0b2032524 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -716,7 +716,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function,
(1 << KVM_FEATURE_ASYNC_PF_VMEXIT) |
(1 << KVM_FEATURE_PV_SEND_IPI) |
(1 << KVM_FEATURE_POLL_CONTROL) |
- (1 << KVM_FEATURE_PV_SCHED_YIELD);
+ (1 << KVM_FEATURE_PV_SCHED_YIELD) |
+ (1 << KVM_FEATURE_CR_PIN);
if (sched_info_on())
entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fb5d64ebc35d..2ee0e9886a6e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -733,6 +733,9 @@ bool pdptrs_changed(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(pdptrs_changed);
+#define KVM_CR0_PIN_ALLOWED (X86_CR0_WP)
+#define KVM_CR4_PIN_ALLOWED (X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP)
+
int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
{
unsigned long old_cr0 = kvm_read_cr0(vcpu);
@@ -753,6 +756,11 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
if ((cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PE))
return 1;
+ if (!is_smm(vcpu)
+ && vcpu->arch.cr0_pinned
+ && ((cr0 ^ vcpu->arch.cr0_pinned) & KVM_CR0_PIN_ALLOWED))
+ return 1;
+
if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) {
#ifdef CONFIG_X86_64
if ((vcpu->arch.efer & EFER_LME)) {
@@ -932,6 +940,11 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
if (kvm_valid_cr4(vcpu, cr4))
return 1;
+ if (!is_smm(vcpu)
+ && vcpu->arch.cr4_pinned
+ && ((cr4 ^ vcpu->arch.cr4_pinned) & KVM_CR4_PIN_ALLOWED))
+ return 1;
+
if (is_long_mode(vcpu)) {
if (!(cr4 & X86_CR4_PAE))
return 1;
@@ -1255,6 +1268,10 @@ static const u32 emulated_msrs_all[] = {
MSR_K7_HWCR,
MSR_KVM_POLL_CONTROL,
+ MSR_KVM_CR0_PIN_ALLOWED,
+ MSR_KVM_CR4_PIN_ALLOWED,
+ MSR_KVM_CR0_PINNED,
+ MSR_KVM_CR4_PINNED,
};
static u32 emulated_msrs[ARRAY_SIZE(emulated_msrs_all)];
@@ -2878,6 +2895,28 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vcpu->arch.msr_kvm_poll_control = data;
break;
+ case MSR_KVM_CR0_PIN_ALLOWED:
+ case MSR_KVM_CR4_PIN_ALLOWED:
+ if (report_ignored_msrs)
+ vcpu_debug_ratelimited(vcpu, "unhandled wrmsr: 0x%x data 0x%llx\n",
+ msr, data);
+ break;
+ case MSR_KVM_CR0_PINNED:
+ if (data & ~KVM_CR0_PIN_ALLOWED)
+ return 1;
+ if (msr_info->host_initiated)
+ vcpu->arch.cr0_pinned = data;
+ else
+ vcpu->arch.cr0_pinned |= data;
+ break;
+ case MSR_KVM_CR4_PINNED:
+ if (data & ~KVM_CR4_PIN_ALLOWED)
+ return 1;
+ if (msr_info->host_initiated)
+ vcpu->arch.cr4_pinned = data;
+ else
+ vcpu->arch.cr4_pinned |= data;
+ break;
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
@@ -3124,6 +3163,18 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_KVM_POLL_CONTROL:
msr_info->data = vcpu->arch.msr_kvm_poll_control;
break;
+ case MSR_KVM_CR0_PIN_ALLOWED:
+ msr_info->data = KVM_CR0_PIN_ALLOWED;
+ break;
+ case MSR_KVM_CR4_PIN_ALLOWED:
+ msr_info->data = KVM_CR4_PIN_ALLOWED;
+ break;
+ case MSR_KVM_CR0_PINNED:
+ msr_info->data = vcpu->arch.cr0_pinned;
+ break;
+ case MSR_KVM_CR4_PINNED:
+ msr_info->data = vcpu->arch.cr4_pinned;
+ break;
case MSR_IA32_P5_MC_ADDR:
case MSR_IA32_P5_MC_TYPE:
case MSR_IA32_MCG_CAP:
@@ -6316,10 +6367,84 @@ static void emulator_set_hflags(struct x86_emulate_ctxt *ctxt, unsigned emul_fla
emul_to_vcpu(ctxt)->arch.hflags = emul_flags;
}
+static inline u64 restore_pinned(u64 val, u64 subset, u64 pinned)
+{
+ u64 pinned_high = pinned & subset;
+ u64 pinned_low = ~pinned & subset;
+
+ val |= pinned_high;
+ val &= ~pinned_low;
+
+ return val;
+}
+
+static void kvm_pre_leave_smm_32_restore_crX_pinned(struct kvm_vcpu *vcpu,
+ const char *smstate,
+ u16 offset,
+ unsigned long allowed,
+ unsigned long cr_pinned)
+{
+ u32 cr;
+
+ cr = GET_SMSTATE(u32, smstate, offset);
+ cr = (u32)restore_pinned(cr, allowed, cr_pinned);
+ put_smstate(u32, smstate, offset, cr);
+}
+
+static void kvm_pre_leave_smm_32_restore_cr_pinned(struct kvm_vcpu *vcpu,
+ const char *smstate)
+{
+ if (vcpu->arch.cr0_pinned)
+ kvm_pre_leave_smm_32_restore_crX_pinned(vcpu, smstate, 0x7ffc,
+ KVM_CR0_PIN_ALLOWED,
+ vcpu->arch.cr0_pinned);
+
+ if (vcpu->arch.cr4_pinned)
+ kvm_pre_leave_smm_32_restore_crX_pinned(vcpu, smstate, 0x7f14,
+ KVM_CR4_PIN_ALLOWED,
+ vcpu->arch.cr4_pinned);
+}
+
+static void kvm_pre_leave_smm_64_restore_crX_pinned(struct kvm_vcpu *vcpu,
+ const char *smstate,
+ u16 offset,
+ unsigned long allowed,
+ unsigned long cr_pinned)
+{
+ u32 cr;
+
+ cr = GET_SMSTATE(u64, smstate, offset);
+ cr = restore_pinned(cr, allowed, cr_pinned);
+ put_smstate(u64, smstate, offset, cr);
+}
+
+static void kvm_pre_leave_smm_64_restore_cr_pinned(struct kvm_vcpu *vcpu,
+ const char *smstate)
+{
+ if (vcpu->arch.cr0_pinned)
+ kvm_pre_leave_smm_64_restore_crX_pinned(vcpu, smstate, 0x7f58,
+ KVM_CR0_PIN_ALLOWED,
+ vcpu->arch.cr0_pinned);
+
+ if (vcpu->arch.cr4_pinned)
+ kvm_pre_leave_smm_64_restore_crX_pinned(vcpu, smstate, 0x7f48,
+ KVM_CR4_PIN_ALLOWED,
+ vcpu->arch.cr4_pinned);
+}
+
static int emulator_pre_leave_smm(struct x86_emulate_ctxt *ctxt,
const char *smstate)
{
- return kvm_x86_ops->pre_leave_smm(emul_to_vcpu(ctxt), smstate);
+ struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+
+#ifdef CONFIG_X86_64
+ if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
+ kvm_pre_leave_smm_64_restore_cr_pinned(vcpu, smstate);
+ else
+#endif
+ kvm_pre_leave_smm_32_restore_cr_pinned(vcpu, smstate);
+
+ return kvm_x86_ops->pre_leave_smm(vcpu, smstate);
}
static void emulator_post_leave_smm(struct x86_emulate_ctxt *ctxt)
@@ -9490,6 +9615,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vcpu->arch.ia32_xss = 0;
+ vcpu->arch.cr0_pinned = 0;
+ vcpu->arch.cr4_pinned = 0;
+
kvm_x86_ops->vcpu_reset(vcpu, init_event);
}
--
2.21.0
Check that paravirtualized control register pinning blocks modifications
of pinned CR values stored in SMRAM on exit from SMM.
Signed-off-by: John Andersen <[email protected]>
---
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/include/x86_64/processor.h | 9 +
.../selftests/kvm/x86_64/smm_cr_pin_test.c | 180 ++++++++++++++++++
4 files changed, 191 insertions(+)
create mode 100644 tools/testing/selftests/kvm/x86_64/smm_cr_pin_test.c
diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 30072c3f52fb..08e18ae1b80f 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -7,6 +7,7 @@
/x86_64/platform_info_test
/x86_64/set_sregs_test
/x86_64/smm_test
+/x86_64/smm_cr_pin_test
/x86_64/state_test
/x86_64/sync_regs_test
/x86_64/vmx_close_while_nested_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index d91c53b726e6..f3fdac72fc74 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -19,6 +19,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/mmio_warning_test
TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
TEST_GEN_PROGS_x86_64 += x86_64/smm_test
+TEST_GEN_PROGS_x86_64 += x86_64/smm_cr_pin_test
TEST_GEN_PROGS_x86_64 += x86_64/state_test
TEST_GEN_PROGS_x86_64 += x86_64/sync_regs_test
TEST_GEN_PROGS_x86_64 += x86_64/vmx_close_while_nested_test
diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 7428513a4c68..70394d2ffa5d 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -197,6 +197,11 @@ static inline uint64_t get_cr0(void)
return cr0;
}
+static inline void set_cr0(uint64_t val)
+{
+ __asm__ __volatile__("mov %0, %%cr0" : : "r" (val) : "memory");
+}
+
static inline uint64_t get_cr3(void)
{
uint64_t cr3;
@@ -380,4 +385,8 @@ void kvm_get_cpu_address_width(unsigned int *pa_bits, unsigned int *va_bits);
/* VMX_EPT_VPID_CAP bits */
#define VMX_EPT_VPID_CAP_AD_BITS (1ULL << 21)
+/* KVM MSRs */
+#define MSR_KVM_CR0_PINNED 0x4b564d08
+#define MSR_KVM_CR4_PINNED 0x4b564d09
+
#endif /* SELFTEST_KVM_PROCESSOR_H */
diff --git a/tools/testing/selftests/kvm/x86_64/smm_cr_pin_test.c b/tools/testing/selftests/kvm/x86_64/smm_cr_pin_test.c
new file mode 100644
index 000000000000..013983bb4ba4
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/smm_cr_pin_test.c
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Tests for control register pinning not being affected by SMRAM writes.
+ */
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <sys/ioctl.h>
+
+#include "test_util.h"
+
+#include "kvm_util.h"
+
+#include "processor.h"
+
+#define VCPU_ID 1
+
+#define PAGE_SIZE 4096
+
+#define SMRAM_SIZE 65536
+#define SMRAM_MEMSLOT ((1 << 16) | 1)
+#define SMRAM_PAGES (SMRAM_SIZE / PAGE_SIZE)
+#define SMRAM_GPA 0x1000000
+#define SMRAM_STAGE 0xfe
+
+#define STR(x) #x
+#define XSTR(s) STR(s)
+
+#define SYNC_PORT 0xe
+#define DONE 0xff
+
+#define CR0_PINNED X86_CR0_WP
+#define CR4_PINNED (X86_CR4_SMAP | X86_CR4_UMIP)
+#define CR4_ALL (CR4_PINNED | X86_CR4_SMEP)
+
+/*
+ * This is compiled as normal 64-bit code, however, SMI handler is executed
+ * in real-address mode. To stay simple we're limiting ourselves to a mode
+ * independent subset of asm here.
+ * SMI handler always report back fixed stage SMRAM_STAGE.
+ */
+uint8_t smi_handler[] = {
+ 0xb0, SMRAM_STAGE, /* mov $SMRAM_STAGE, %al */
+ 0xe4, SYNC_PORT, /* in $SYNC_PORT, %al */
+ 0x0f, 0xaa, /* rsm */
+};
+
+void sync_with_host(uint64_t phase)
+{
+ asm volatile("in $" XSTR(SYNC_PORT)", %%al \n"
+ : : "a" (phase));
+}
+
+void self_smi(void)
+{
+ wrmsr(APIC_BASE_MSR + (APIC_ICR >> 4),
+ APIC_DEST_SELF | APIC_INT_ASSERT | APIC_DM_SMI);
+}
+
+void guest_code(void *unused)
+{
+ uint64_t apicbase = rdmsr(MSR_IA32_APICBASE);
+
+ (void)unused;
+
+ sync_with_host(1);
+
+ wrmsr(MSR_IA32_APICBASE, apicbase | X2APIC_ENABLE);
+
+ sync_with_host(2);
+
+ set_cr0(get_cr0() | CR0_PINNED);
+
+ wrmsr(MSR_KVM_CR0_PINNED, CR0_PINNED);
+
+ sync_with_host(3);
+
+ set_cr4(get_cr4() | CR4_PINNED);
+
+ sync_with_host(4);
+
+ /* Pin SMEP low */
+ wrmsr(MSR_KVM_CR4_PINNED, CR4_PINNED);
+
+ sync_with_host(5);
+
+ self_smi();
+
+ sync_with_host(DONE);
+}
+
+int main(int argc, char *argv[])
+{
+ struct kvm_regs regs;
+ struct kvm_sregs sregs;
+ struct kvm_vm *vm;
+ struct kvm_run *run;
+ struct kvm_x86_state *state;
+ int stage, stage_reported;
+ u64 *cr;
+
+ /* Create VM */
+ vm = vm_create_default(VCPU_ID, 0, guest_code);
+
+ vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
+
+ run = vcpu_state(vm, VCPU_ID);
+
+ vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, SMRAM_GPA,
+ SMRAM_MEMSLOT, SMRAM_PAGES, 0);
+ TEST_ASSERT(vm_phy_pages_alloc(vm, SMRAM_PAGES, SMRAM_GPA, SMRAM_MEMSLOT)
+ == SMRAM_GPA, "could not allocate guest physical addresses?");
+
+ memset(addr_gpa2hva(vm, SMRAM_GPA), 0x0, SMRAM_SIZE);
+ memcpy(addr_gpa2hva(vm, SMRAM_GPA) + 0x8000, smi_handler,
+ sizeof(smi_handler));
+
+ vcpu_set_msr(vm, VCPU_ID, MSR_IA32_SMBASE, SMRAM_GPA);
+
+ vcpu_args_set(vm, VCPU_ID, 1, 0);
+
+ for (stage = 1;; stage++) {
+ _vcpu_run(vm, VCPU_ID);
+
+ TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+ "Stage %d: unexpected exit reason: %u (%s),\n",
+ stage, run->exit_reason,
+ exit_reason_str(run->exit_reason));
+
+ memset(®s, 0, sizeof(regs));
+ vcpu_regs_get(vm, VCPU_ID, ®s);
+
+ memset(&sregs, 0, sizeof(sregs));
+ vcpu_sregs_get(vm, VCPU_ID, &sregs);
+
+ stage_reported = regs.rax & 0xff;
+
+ if (stage_reported == DONE) {
+ TEST_ASSERT((sregs.cr0 & CR0_PINNED) == CR0_PINNED,
+ "Unexpected cr0. Bits missing: %llx",
+ sregs.cr0 ^ (CR0_PINNED | sregs.cr0));
+ TEST_ASSERT((sregs.cr4 & CR4_ALL) == CR4_PINNED,
+ "Unexpected cr4. Bits missing: %llx, cr4: %llx",
+ sregs.cr4 ^ (CR4_ALL | sregs.cr4),
+ sregs.cr4);
+ goto done;
+ }
+
+ TEST_ASSERT(stage_reported == stage ||
+ stage_reported == SMRAM_STAGE,
+ "Unexpected stage: #%x, got %x",
+ stage, stage_reported);
+
+ /* Within SMM modify CR0/4 to not contain pinned bits. */
+ if (stage_reported == SMRAM_STAGE) {
+ cr = (u64 *)(addr_gpa2hva(vm, SMRAM_GPA + 0x8000 + 0x7f58));
+ *cr &= ~CR0_PINNED;
+
+ cr = (u64 *)(addr_gpa2hva(vm, SMRAM_GPA + 0x8000 + 0x7f48));
+ /* Unset pinned, set one that was pinned low */
+ *cr &= ~CR4_PINNED;
+ *cr |= X86_CR4_SMEP;
+ }
+
+ state = vcpu_save_state(vm, VCPU_ID);
+ kvm_vm_release(vm);
+ kvm_vm_restart(vm, O_RDWR);
+ vm_vcpu_add(vm, VCPU_ID);
+ vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
+ vcpu_load_state(vm, VCPU_ID, state);
+ run = vcpu_state(vm, VCPU_ID);
+ free(state);
+ }
+
+done:
+ kvm_vm_free(vm);
+}
--
2.21.0
Strengthen existing control register pinning when running
paravirtualized under KVM. Check which bits KVM supports pinning for
each control register and only pin supported bits which are already
pinned via the existing native protection. Write to KVM CR0/4 pinned
MSRs to enable pinning.
Initiate KVM assisted pinning directly following the setup of native
pinning on boot CPU. For non-boot CPUs initiate paravirtualized pinning
on CPU identification.
Identification of non-boot CPUs takes place after the boot CPU has setup
native CR pinning. Therefore, non-boot CPUs access pinned bits setup by
the boot CPU and request that those be pinned. All CPUs request
paravirtualized pinning of the same bits which are already pinned
natively.
Guests using the kexec system call currently do not support
paravirtualized control register pinning. This is due to early boot
code writing known good values to control registers, these values do
not contain the protected bits. This is due to CPU feature
identification being done at a later time, when the kernel properly
checks if it can enable protections. As such, the pv_cr_pin command line
option has been added which instructs the kernel to disable kexec in
favor of enabling paravirtualized control register pinning. crashkernel
is also disabled when the pv_cr_pin parameter is specified due to its
reliance on kexec.
When we fix kexec, we will still need a way for a kernel with support to
know if the kernel it is attempting to load has support. If a kernel
with this enabled attempts to kexec a kernel where this is not
supported, it would trigger a fault almost immediately.
Liran suggested adding a section to the built image acting as a flag to
signify support for being kexec'd by a kernel with pinning enabled.
Should that approach be implemented, it is likely that the command line
flag (pv_cr_pin) would still be desired for some deprecation period. We
wouldn't want the default behavior to change from being able to kexec
older kernels to not being able to, as this might break some users
workflows.
Signed-off-by: John Andersen <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 11 ++++++
arch/x86/Kconfig | 10 +++++
arch/x86/include/asm/kvm_para.h | 17 ++++++++
arch/x86/kernel/cpu/common.c | 5 +++
arch/x86/kernel/kvm.c | 39 +++++++++++++++++++
arch/x86/kernel/setup.c | 8 ++++
6 files changed, 90 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index dbc22d684627..8552501a1579 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3836,6 +3836,17 @@
[KNL] Number of legacy pty's. Overwrites compiled-in
default number.
+ pv_cr_pin [SECURITY,X86]
+ Enable paravirtualized control register pinning. When
+ running paravirutalized under KVM, request that KVM not
+ allow the guest to disable kernel protection features
+ set in CPU control registers. Specifying this option
+ will disable kexec (and crashkernel). If kexec support
+ has not been compiled into the kernel and host KVM
+ supports paravirtualized control register pinning, it
+ will be active by default without the need to specify
+ this parameter.
+
quiet [KNL] Disable most log messages
r128= [HW,DRM]
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index beea77046f9b..d47b09ae23bd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -800,6 +800,7 @@ config KVM_GUEST
bool "KVM Guest support (including kvmclock)"
depends on PARAVIRT
select PARAVIRT_CLOCK
+ select PARAVIRT_CR_PIN
select ARCH_CPUIDLE_HALTPOLL
default y
---help---
@@ -843,6 +844,15 @@ config PARAVIRT_TIME_ACCOUNTING
config PARAVIRT_CLOCK
bool
+config PARAVIRT_CR_PIN
+ bool "Paravirtual bit pinning for CR0 and CR4"
+ depends on KVM_GUEST
+ help
+ Select this option to have the virtualised guest request that the
+ hypervisor disallow it from disabling protections set in control
+ registers. The hypervisor will prevent exploits from disabling
+ features such as SMEP, SMAP, UMIP, and WP.
+
config JAILHOUSE_GUEST
bool "Jailhouse non-root cell support"
depends on X86_64 && PCI
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 9b4df6eaa11a..342b475e7adf 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -102,6 +102,23 @@ static inline void kvm_spinlock_init(void)
}
#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+#ifdef CONFIG_PARAVIRT_CR_PIN
+void __init kvm_paravirt_cr_pinning_init(void);
+void kvm_setup_paravirt_cr_pinning(unsigned long cr0_pinned_bits,
+ unsigned long cr4_pinned_bits);
+#else
+static inline void kvm_paravirt_cr_pinning_init(void)
+{
+ return;
+}
+
+static inline void kvm_setup_paravirt_cr_pinning(unsigned long cr0_pinned_bits,
+ unsigned long cr4_pinned_bits)
+{
+ return;
+}
+#endif /* CONFIG_PARAVIRT_CR_PIN */
+
#else /* CONFIG_KVM_GUEST */
#define kvm_async_pf_task_wait(T, I) do {} while(0)
#define kvm_async_pf_task_wake(T) do {} while(0)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 08682757e547..bbf1de0cb253 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -21,6 +21,7 @@
#include <linux/smp.h>
#include <linux/io.h>
#include <linux/syscore_ops.h>
+#include <linux/kvm_para.h>
#include <asm/stackprotector.h>
#include <asm/perf_event.h>
@@ -416,6 +417,8 @@ static void __init setup_cr_pinning(void)
mask = (X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP);
cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & mask;
static_key_enable(&cr_pinning.key);
+
+ kvm_setup_paravirt_cr_pinning(X86_CR0_WP, cr4_pinned_bits);
}
/*
@@ -1589,6 +1592,8 @@ void identify_secondary_cpu(struct cpuinfo_x86 *c)
mtrr_ap_init();
validate_apic_and_package_id(c);
x86_spec_ctrl_setup_ap();
+
+ kvm_setup_paravirt_cr_pinning(X86_CR0_WP, cr4_pinned_bits);
}
static __init int setup_noclflush(char *arg)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index d817f255aed8..f6c159229e35 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -24,6 +24,8 @@
#include <linux/debugfs.h>
#include <linux/nmi.h>
#include <linux/swait.h>
+#include <linux/init.h>
+#include <linux/kexec.h>
#include <asm/timer.h>
#include <asm/cpu.h>
#include <asm/traps.h>
@@ -34,6 +36,7 @@
#include <asm/hypervisor.h>
#include <asm/tlb.h>
#include <asm/cpuidle_haltpoll.h>
+#include <asm/cmdline.h>
static int kvmapf = 1;
@@ -708,6 +711,7 @@ static void __init kvm_apic_init(void)
static void __init kvm_init_platform(void)
{
kvmclock_init();
+ kvm_paravirt_cr_pinning_init();
x86_platform.apic_post_init = kvm_apic_init;
}
@@ -857,6 +861,41 @@ void __init kvm_spinlock_init(void)
#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+#ifdef CONFIG_PARAVIRT_CR_PIN
+static int kvm_paravirt_cr_pinning_enabled __ro_after_init = 0;
+
+void __init kvm_paravirt_cr_pinning_init()
+{
+#ifdef CONFIG_KEXEC_CORE
+ if (!cmdline_find_option_bool(boot_command_line, "pv_cr_pin"))
+ return;
+
+ /* Paravirtualized CR pinning is currently incompatible with kexec */
+ kexec_load_disabled = 1;
+#endif
+
+ kvm_paravirt_cr_pinning_enabled = 1;
+}
+
+void kvm_setup_paravirt_cr_pinning(unsigned long cr0_pinned_bits,
+ unsigned long cr4_pinned_bits)
+{
+ u64 mask;
+
+ if (!kvm_paravirt_cr_pinning_enabled)
+ return;
+
+ if (!kvm_para_has_feature(KVM_FEATURE_CR_PIN))
+ return;
+
+ rdmsrl(MSR_KVM_CR0_PIN_ALLOWED, mask);
+ wrmsrl(MSR_KVM_CR0_PINNED, cr0_pinned_bits & mask);
+
+ rdmsrl(MSR_KVM_CR4_PIN_ALLOWED, mask);
+ wrmsrl(MSR_KVM_CR4_PINNED, cr4_pinned_bits & mask);
+}
+#endif
+
#ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL
static void kvm_disable_host_haltpoll(void *i)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 9e71d6f6e564..a75f9e730cc3 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -26,6 +26,9 @@
#include <asm/apic.h>
#include <asm/bios_ebda.h>
#include <asm/bugs.h>
+#include <asm/kasan.h>
+#include <asm/cmdline.h>
+
#include <asm/cpu.h>
#include <asm/efi.h>
#include <asm/gart.h>
@@ -496,6 +499,11 @@ static void __init reserve_crashkernel(void)
return;
}
+ if (cmdline_find_option_bool(boot_command_line, "pv_cr_pin")) {
+ pr_info("Ignoring crashkernel since pv_cr_pin present in cmdline\n");
+ return;
+ }
+
/* 0 means: find the address automatically */
if (!crash_base) {
/*
--
2.21.0
In identify_cpu when setting up SMEP/SMAP/UMIP call
cr4_set_bits_and_update_boot instead of cr4_set_bits. This ensures that
mmu_cr4_features contains those bits, and does not disable those
protections when in hibernation asm.
setup_arch updates mmu_cr4_features to save what identified features are
supported for later use in hibernation asm when cr4 needs to be modified
to toggle PGE. cr4 writes happen in restore_image and restore_registers.
setup_arch occurs before identify_cpu, this leads to mmu_cr4_features
not containing some of the cr4 features which were enabled via
identify_cpu when hibernation asm is executed.
On CPU bringup when cr4_set_bits_and_update_boot is called
mmu_cr4_features will now be written to. For the boot CPU,
the __ro_after_init on mmu_cr4_features does not cause a fault. However,
__ro_after_init was removed due to it triggering faults on non-boot
CPUs.
Signed-off-by: John Andersen <[email protected]>
---
arch/x86/kernel/cpu/common.c | 6 +++---
arch/x86/kernel/setup.c | 4 ++--
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 52c9bfbbdb2a..08682757e547 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -297,7 +297,7 @@ __setup("nosmep", setup_disable_smep);
static __always_inline void setup_smep(struct cpuinfo_x86 *c)
{
if (cpu_has(c, X86_FEATURE_SMEP))
- cr4_set_bits(X86_CR4_SMEP);
+ cr4_set_bits_and_update_boot(X86_CR4_SMEP);
}
static __init int setup_disable_smap(char *arg)
@@ -316,7 +316,7 @@ static __always_inline void setup_smap(struct cpuinfo_x86 *c)
if (cpu_has(c, X86_FEATURE_SMAP)) {
#ifdef CONFIG_X86_SMAP
- cr4_set_bits(X86_CR4_SMAP);
+ cr4_set_bits_and_update_boot(X86_CR4_SMAP);
#else
cr4_clear_bits(X86_CR4_SMAP);
#endif
@@ -333,7 +333,7 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
if (!cpu_has(c, X86_FEATURE_UMIP))
goto out;
- cr4_set_bits(X86_CR4_UMIP);
+ cr4_set_bits_and_update_boot(X86_CR4_UMIP);
pr_info_once("x86/cpu: User Mode Instruction Prevention (UMIP) activated\n");
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index a74262c71484..9e71d6f6e564 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -138,9 +138,9 @@ EXPORT_SYMBOL(boot_cpu_data);
#if !defined(CONFIG_X86_PAE) || defined(CONFIG_X86_64)
-__visible unsigned long mmu_cr4_features __ro_after_init;
+__visible unsigned long mmu_cr4_features;
#else
-__visible unsigned long mmu_cr4_features __ro_after_init = X86_CR4_PAE;
+__visible unsigned long mmu_cr4_features = X86_CR4_PAE;
#endif
/* Boot loader ID and version as integers, for the benefit of proc_dointvec */
--
2.21.0
On Tue, Feb 18, 2020 at 01:59:01PM -0800, John Andersen wrote:
> Check that paravirtualized control register pinning blocks modifications
> of pinned CR values stored in SMRAM on exit from SMM.
>
> Signed-off-by: John Andersen <[email protected]>
> ---
> tools/testing/selftests/kvm/.gitignore | 1 +
> tools/testing/selftests/kvm/Makefile | 1 +
> .../selftests/kvm/include/x86_64/processor.h | 9 +
> .../selftests/kvm/x86_64/smm_cr_pin_test.c | 180 ++++++++++++++++++
> 4 files changed, 191 insertions(+)
> create mode 100644 tools/testing/selftests/kvm/x86_64/smm_cr_pin_test.c
>
> diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
> index 30072c3f52fb..08e18ae1b80f 100644
> --- a/tools/testing/selftests/kvm/.gitignore
> +++ b/tools/testing/selftests/kvm/.gitignore
> @@ -7,6 +7,7 @@
> /x86_64/platform_info_test
> /x86_64/set_sregs_test
> /x86_64/smm_test
> +/x86_64/smm_cr_pin_test
> /x86_64/state_test
> /x86_64/sync_regs_test
> /x86_64/vmx_close_while_nested_test
> diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> index d91c53b726e6..f3fdac72fc74 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -19,6 +19,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/mmio_warning_test
> TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
> TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
> TEST_GEN_PROGS_x86_64 += x86_64/smm_test
> +TEST_GEN_PROGS_x86_64 += x86_64/smm_cr_pin_test
> TEST_GEN_PROGS_x86_64 += x86_64/state_test
> TEST_GEN_PROGS_x86_64 += x86_64/sync_regs_test
> TEST_GEN_PROGS_x86_64 += x86_64/vmx_close_while_nested_test
> diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
> index 7428513a4c68..70394d2ffa5d 100644
> --- a/tools/testing/selftests/kvm/include/x86_64/processor.h
> +++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
> @@ -197,6 +197,11 @@ static inline uint64_t get_cr0(void)
> return cr0;
> }
>
> +static inline void set_cr0(uint64_t val)
> +{
> + __asm__ __volatile__("mov %0, %%cr0" : : "r" (val) : "memory");
> +}
> +
> static inline uint64_t get_cr3(void)
> {
> uint64_t cr3;
> @@ -380,4 +385,8 @@ void kvm_get_cpu_address_width(unsigned int *pa_bits, unsigned int *va_bits);
> /* VMX_EPT_VPID_CAP bits */
> #define VMX_EPT_VPID_CAP_AD_BITS (1ULL << 21)
>
> +/* KVM MSRs */
> +#define MSR_KVM_CR0_PINNED 0x4b564d08
> +#define MSR_KVM_CR4_PINNED 0x4b564d09
> +
> #endif /* SELFTEST_KVM_PROCESSOR_H */
> diff --git a/tools/testing/selftests/kvm/x86_64/smm_cr_pin_test.c b/tools/testing/selftests/kvm/x86_64/smm_cr_pin_test.c
> new file mode 100644
> index 000000000000..013983bb4ba4
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/x86_64/smm_cr_pin_test.c
> @@ -0,0 +1,180 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Tests for control register pinning not being affected by SMRAM writes.
> + */
> +#define _GNU_SOURCE /* for program_invocation_short_name */
> +#include <fcntl.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <stdint.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +
> +#include "test_util.h"
> +
> +#include "kvm_util.h"
> +
> +#include "processor.h"
> +
> +#define VCPU_ID 1
> +
> +#define PAGE_SIZE 4096
> +
> +#define SMRAM_SIZE 65536
> +#define SMRAM_MEMSLOT ((1 << 16) | 1)
> +#define SMRAM_PAGES (SMRAM_SIZE / PAGE_SIZE)
> +#define SMRAM_GPA 0x1000000
> +#define SMRAM_STAGE 0xfe
> +
> +#define STR(x) #x
> +#define XSTR(s) STR(s)
linux/stringify.h is in tools/
> +
> +#define SYNC_PORT 0xe
> +#define DONE 0xff
> +
> +#define CR0_PINNED X86_CR0_WP
> +#define CR4_PINNED (X86_CR4_SMAP | X86_CR4_UMIP)
> +#define CR4_ALL (CR4_PINNED | X86_CR4_SMEP)
> +
> +/*
> + * This is compiled as normal 64-bit code, however, SMI handler is executed
> + * in real-address mode. To stay simple we're limiting ourselves to a mode
> + * independent subset of asm here.
> + * SMI handler always report back fixed stage SMRAM_STAGE.
> + */
> +uint8_t smi_handler[] = {
> + 0xb0, SMRAM_STAGE, /* mov $SMRAM_STAGE, %al */
> + 0xe4, SYNC_PORT, /* in $SYNC_PORT, %al */
> + 0x0f, 0xaa, /* rsm */
> +};
> +
> +void sync_with_host(uint64_t phase)
> +{
> + asm volatile("in $" XSTR(SYNC_PORT)", %%al \n"
> + : : "a" (phase));
> +}
Any reason not to use GUEST_SYNC() ?
> +
> +void self_smi(void)
> +{
> + wrmsr(APIC_BASE_MSR + (APIC_ICR >> 4),
> + APIC_DEST_SELF | APIC_INT_ASSERT | APIC_DM_SMI);
> +}
> +
> +void guest_code(void *unused)
> +{
Why not just define guest_code as 'void guest_code(void)' ?
> + uint64_t apicbase = rdmsr(MSR_IA32_APICBASE);
> +
> + (void)unused;
> +
> + sync_with_host(1);
> +
> + wrmsr(MSR_IA32_APICBASE, apicbase | X2APIC_ENABLE);
> +
> + sync_with_host(2);
> +
> + set_cr0(get_cr0() | CR0_PINNED);
> +
> + wrmsr(MSR_KVM_CR0_PINNED, CR0_PINNED);
> +
> + sync_with_host(3);
> +
> + set_cr4(get_cr4() | CR4_PINNED);
> +
> + sync_with_host(4);
> +
> + /* Pin SMEP low */
> + wrmsr(MSR_KVM_CR4_PINNED, CR4_PINNED);
> +
> + sync_with_host(5);
> +
> + self_smi();
> +
> + sync_with_host(DONE);
GUEST_DONE() ?
> +}
> +
> +int main(int argc, char *argv[])
> +{
> + struct kvm_regs regs;
> + struct kvm_sregs sregs;
> + struct kvm_vm *vm;
> + struct kvm_run *run;
> + struct kvm_x86_state *state;
> + int stage, stage_reported;
> + u64 *cr;
> +
> + /* Create VM */
> + vm = vm_create_default(VCPU_ID, 0, guest_code);
> +
> + vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
> +
> + run = vcpu_state(vm, VCPU_ID);
> +
> + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, SMRAM_GPA,
> + SMRAM_MEMSLOT, SMRAM_PAGES, 0);
> + TEST_ASSERT(vm_phy_pages_alloc(vm, SMRAM_PAGES, SMRAM_GPA, SMRAM_MEMSLOT)
> + == SMRAM_GPA, "could not allocate guest physical addresses?");
> +
> + memset(addr_gpa2hva(vm, SMRAM_GPA), 0x0, SMRAM_SIZE);
> + memcpy(addr_gpa2hva(vm, SMRAM_GPA) + 0x8000, smi_handler,
> + sizeof(smi_handler));
> +
> + vcpu_set_msr(vm, VCPU_ID, MSR_IA32_SMBASE, SMRAM_GPA);
> +
> + vcpu_args_set(vm, VCPU_ID, 1, 0);
guest_code() doesn't use inputs, so why set rdi to zero?
> +
> + for (stage = 1;; stage++) {
> + _vcpu_run(vm, VCPU_ID);
> +
> + TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
> + "Stage %d: unexpected exit reason: %u (%s),\n",
> + stage, run->exit_reason,
> + exit_reason_str(run->exit_reason));
> +
> + memset(®s, 0, sizeof(regs));
> + vcpu_regs_get(vm, VCPU_ID, ®s);
> +
> + memset(&sregs, 0, sizeof(sregs));
> + vcpu_sregs_get(vm, VCPU_ID, &sregs);
> +
> + stage_reported = regs.rax & 0xff;
If you use GUEST_ASSERT() and get_ucall() then stage_reported is uc.args[1].
Why mask it with 0xff? Shouldn't the test assert if the stage is an
unexpected value?
> +
> + if (stage_reported == DONE) {
uc.cmd == UCALL_DONE
> + TEST_ASSERT((sregs.cr0 & CR0_PINNED) == CR0_PINNED,
> + "Unexpected cr0. Bits missing: %llx",
> + sregs.cr0 ^ (CR0_PINNED | sregs.cr0));
> + TEST_ASSERT((sregs.cr4 & CR4_ALL) == CR4_PINNED,
> + "Unexpected cr4. Bits missing: %llx, cr4: %llx",
> + sregs.cr4 ^ (CR4_ALL | sregs.cr4),
> + sregs.cr4);
> + goto done;
> + }
> +
> + TEST_ASSERT(stage_reported == stage ||
> + stage_reported == SMRAM_STAGE,
> + "Unexpected stage: #%x, got %x",
> + stage, stage_reported);
> +
> + /* Within SMM modify CR0/4 to not contain pinned bits. */
> + if (stage_reported == SMRAM_STAGE) {
> + cr = (u64 *)(addr_gpa2hva(vm, SMRAM_GPA + 0x8000 + 0x7f58));
> + *cr &= ~CR0_PINNED;
> +
> + cr = (u64 *)(addr_gpa2hva(vm, SMRAM_GPA + 0x8000 + 0x7f48));
> + /* Unset pinned, set one that was pinned low */
> + *cr &= ~CR4_PINNED;
> + *cr |= X86_CR4_SMEP;
> + }
> +
> + state = vcpu_save_state(vm, VCPU_ID);
> + kvm_vm_release(vm);
> + kvm_vm_restart(vm, O_RDWR);
> + vm_vcpu_add(vm, VCPU_ID);
> + vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
> + vcpu_load_state(vm, VCPU_ID, state);
> + run = vcpu_state(vm, VCPU_ID);
> + free(state);
> + }
> +
> +done:
> + kvm_vm_free(vm);
> +}
> --
> 2.21.0
>
Thanks,
drew
On Mon, 2020-02-24 at 16:50 +0100, Andrew Jones wrote:
> On Tue, Feb 18, 2020 at 01:59:01PM -0800, John Andersen wrote:
> > Check that paravirtualized control register pinning blocks
> > modifications
> > of pinned CR values stored in SMRAM on exit from SMM.
> >
> > Signed-off-by: John Andersen <[email protected]>
> > ---
> > tools/testing/selftests/kvm/.gitignore | 1 +
> > tools/testing/selftests/kvm/Makefile | 1 +
> > .../selftests/kvm/include/x86_64/processor.h | 9 +
> > .../selftests/kvm/x86_64/smm_cr_pin_test.c | 180
> > ++++++++++++++++++
> > 4 files changed, 191 insertions(+)
> > create mode 100644
> > tools/testing/selftests/kvm/x86_64/smm_cr_pin_test.c
> >
> > diff --git a/tools/testing/selftests/kvm/.gitignore
> > b/tools/testing/selftests/kvm/.gitignore
> > index 30072c3f52fb..08e18ae1b80f 100644
> > --- a/tools/testing/selftests/kvm/.gitignore
> > +++ b/tools/testing/selftests/kvm/.gitignore
> > @@ -7,6 +7,7 @@
> > /x86_64/platform_info_test
> > /x86_64/set_sregs_test
> > /x86_64/smm_test
> > +/x86_64/smm_cr_pin_test
> > /x86_64/state_test
> > /x86_64/sync_regs_test
> > /x86_64/vmx_close_while_nested_test
> > diff --git a/tools/testing/selftests/kvm/Makefile
> > b/tools/testing/selftests/kvm/Makefile
> > index d91c53b726e6..f3fdac72fc74 100644
> > --- a/tools/testing/selftests/kvm/Makefile
> > +++ b/tools/testing/selftests/kvm/Makefile
> > @@ -19,6 +19,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/mmio_warning_test
> > TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
> > TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
> > TEST_GEN_PROGS_x86_64 += x86_64/smm_test
> > +TEST_GEN_PROGS_x86_64 += x86_64/smm_cr_pin_test
> > TEST_GEN_PROGS_x86_64 += x86_64/state_test
> > TEST_GEN_PROGS_x86_64 += x86_64/sync_regs_test
> > TEST_GEN_PROGS_x86_64 += x86_64/vmx_close_while_nested_test
> > diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h
> > b/tools/testing/selftests/kvm/include/x86_64/processor.h
> > index 7428513a4c68..70394d2ffa5d 100644
> > --- a/tools/testing/selftests/kvm/include/x86_64/processor.h
> > +++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
> > @@ -197,6 +197,11 @@ static inline uint64_t get_cr0(void)
> > return cr0;
> > }
> >
> > +static inline void set_cr0(uint64_t val)
> > +{
> > + __asm__ __volatile__("mov %0, %%cr0" : : "r" (val) : "memory");
> > +}
> > +
> > static inline uint64_t get_cr3(void)
> > {
> > uint64_t cr3;
> > @@ -380,4 +385,8 @@ void kvm_get_cpu_address_width(unsigned int
> > *pa_bits, unsigned int *va_bits);
> > /* VMX_EPT_VPID_CAP bits */
> > #define VMX_EPT_VPID_CAP_AD_BITS (1ULL << 21)
> >
> > +/* KVM MSRs */
> > +#define MSR_KVM_CR0_PINNED 0x4b564d08
> > +#define MSR_KVM_CR4_PINNED 0x4b564d09
> > +
> > #endif /* SELFTEST_KVM_PROCESSOR_H */
> > diff --git a/tools/testing/selftests/kvm/x86_64/smm_cr_pin_test.c
> > b/tools/testing/selftests/kvm/x86_64/smm_cr_pin_test.c
> > new file mode 100644
> > index 000000000000..013983bb4ba4
> > --- /dev/null
> > +++ b/tools/testing/selftests/kvm/x86_64/smm_cr_pin_test.c
> > @@ -0,0 +1,180 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Tests for control register pinning not being affected by SMRAM
> > writes.
> > + */
> > +#define _GNU_SOURCE /* for program_invocation_short_name */
> > +#include <fcntl.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +#include <string.h>
> > +#include <sys/ioctl.h>
> > +
> > +#include "test_util.h"
> > +
> > +#include "kvm_util.h"
> > +
> > +#include "processor.h"
> > +
> > +#define VCPU_ID 1
> > +
> > +#define PAGE_SIZE 4096
> > +
> > +#define SMRAM_SIZE 65536
> > +#define SMRAM_MEMSLOT ((1 << 16) | 1)
> > +#define SMRAM_PAGES (SMRAM_SIZE / PAGE_SIZE)
> > +#define SMRAM_GPA 0x1000000
> > +#define SMRAM_STAGE 0xfe
> > +
> > +#define STR(x) #x
> > +#define XSTR(s) STR(s)
>
> linux/stringify.h is in tools/
>
> > +
> > +#define SYNC_PORT 0xe
> > +#define DONE 0xff
> > +
> > +#define CR0_PINNED X86_CR0_WP
> > +#define CR4_PINNED (X86_CR4_SMAP | X86_CR4_UMIP)
> > +#define CR4_ALL (CR4_PINNED | X86_CR4_SMEP)
> > +
> > +/*
> > + * This is compiled as normal 64-bit code, however, SMI handler is
> > executed
> > + * in real-address mode. To stay simple we're limiting ourselves
> > to a mode
> > + * independent subset of asm here.
> > + * SMI handler always report back fixed stage SMRAM_STAGE.
> > + */
> > +uint8_t smi_handler[] = {
> > + 0xb0, SMRAM_STAGE, /* mov $SMRAM_STAGE, %al */
> > + 0xe4, SYNC_PORT, /* in $SYNC_PORT, %al */
> > + 0x0f, 0xaa, /* rsm */
> > +};
> > +
> > +void sync_with_host(uint64_t phase)
> > +{
> > + asm volatile("in $" XSTR(SYNC_PORT)", %%al \n"
> > + : : "a" (phase));
> > +}
>
> Any reason not to use GUEST_SYNC() ?
>
> > +
> > +void self_smi(void)
> > +{
> > + wrmsr(APIC_BASE_MSR + (APIC_ICR >> 4),
> > + APIC_DEST_SELF | APIC_INT_ASSERT | APIC_DM_SMI);
> > +}
> > +
> > +void guest_code(void *unused)
> > +{
>
> Why not just define guest_code as 'void guest_code(void)' ?
>
> > + uint64_t apicbase = rdmsr(MSR_IA32_APICBASE);
> > +
> > + (void)unused;
> > +
> > + sync_with_host(1);
> > +
> > + wrmsr(MSR_IA32_APICBASE, apicbase | X2APIC_ENABLE);
> > +
> > + sync_with_host(2);
> > +
> > + set_cr0(get_cr0() | CR0_PINNED);
> > +
> > + wrmsr(MSR_KVM_CR0_PINNED, CR0_PINNED);
> > +
> > + sync_with_host(3);
> > +
> > + set_cr4(get_cr4() | CR4_PINNED);
> > +
> > + sync_with_host(4);
> > +
> > + /* Pin SMEP low */
> > + wrmsr(MSR_KVM_CR4_PINNED, CR4_PINNED);
> > +
> > + sync_with_host(5);
> > +
> > + self_smi();
> > +
> > + sync_with_host(DONE);
>
> GUEST_DONE() ?
>
> > +}
> > +
> > +int main(int argc, char *argv[])
> > +{
> > + struct kvm_regs regs;
> > + struct kvm_sregs sregs;
> > + struct kvm_vm *vm;
> > + struct kvm_run *run;
> > + struct kvm_x86_state *state;
> > + int stage, stage_reported;
> > + u64 *cr;
> > +
> > + /* Create VM */
> > + vm = vm_create_default(VCPU_ID, 0, guest_code);
> > +
> > + vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
> > +
> > + run = vcpu_state(vm, VCPU_ID);
> > +
> > + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
> > SMRAM_GPA,
> > + SMRAM_MEMSLOT, SMRAM_PAGES, 0);
> > + TEST_ASSERT(vm_phy_pages_alloc(vm, SMRAM_PAGES, SMRAM_GPA,
> > SMRAM_MEMSLOT)
> > + == SMRAM_GPA, "could not allocate guest physical
> > addresses?");
> > +
> > + memset(addr_gpa2hva(vm, SMRAM_GPA), 0x0, SMRAM_SIZE);
> > + memcpy(addr_gpa2hva(vm, SMRAM_GPA) + 0x8000, smi_handler,
> > + sizeof(smi_handler));
> > +
> > + vcpu_set_msr(vm, VCPU_ID, MSR_IA32_SMBASE, SMRAM_GPA);
> > +
> > + vcpu_args_set(vm, VCPU_ID, 1, 0);
>
> guest_code() doesn't use inputs, so why set rdi to zero?
>
> > +
> > + for (stage = 1;; stage++) {
> > + _vcpu_run(vm, VCPU_ID);
> > +
> > + TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
> > + "Stage %d: unexpected exit reason: %u
> > (%s),\n",
> > + stage, run->exit_reason,
> > + exit_reason_str(run->exit_reason));
> > +
> > + memset(®s, 0, sizeof(regs));
> > + vcpu_regs_get(vm, VCPU_ID, ®s);
> > +
> > + memset(&sregs, 0, sizeof(sregs));
> > + vcpu_sregs_get(vm, VCPU_ID, &sregs);
> > +
> > + stage_reported = regs.rax & 0xff;
>
> If you use GUEST_ASSERT() and get_ucall() then stage_reported is
> uc.args[1].
> Why mask it with 0xff? Shouldn't the test assert if the stage is an
> unexpected value?
>
> > +
> > + if (stage_reported == DONE) {
>
> uc.cmd == UCALL_DONE
>
> > + TEST_ASSERT((sregs.cr0 & CR0_PINNED) ==
> > CR0_PINNED,
> > + "Unexpected cr0. Bits missing:
> > %llx",
> > + sregs.cr0 ^ (CR0_PINNED |
> > sregs.cr0));
> > + TEST_ASSERT((sregs.cr4 & CR4_ALL) ==
> > CR4_PINNED,
> > + "Unexpected cr4. Bits missing:
> > %llx, cr4: %llx",
> > + sregs.cr4 ^ (CR4_ALL | sregs.cr4),
> > + sregs.cr4);
> > + goto done;
> > + }
> > +
> > + TEST_ASSERT(stage_reported == stage ||
> > + stage_reported == SMRAM_STAGE,
> > + "Unexpected stage: #%x, got %x",
> > + stage, stage_reported);
> > +
> > + /* Within SMM modify CR0/4 to not contain pinned bits.
> > */
> > + if (stage_reported == SMRAM_STAGE) {
> > + cr = (u64 *)(addr_gpa2hva(vm, SMRAM_GPA +
> > 0x8000 + 0x7f58));
> > + *cr &= ~CR0_PINNED;
> > +
> > + cr = (u64 *)(addr_gpa2hva(vm, SMRAM_GPA +
> > 0x8000 + 0x7f48));
> > + /* Unset pinned, set one that was pinned low */
> > + *cr &= ~CR4_PINNED;
> > + *cr |= X86_CR4_SMEP;
> > + }
> > +
> > + state = vcpu_save_state(vm, VCPU_ID);
> > + kvm_vm_release(vm);
> > + kvm_vm_restart(vm, O_RDWR);
> > + vm_vcpu_add(vm, VCPU_ID);
> > + vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
> > + vcpu_load_state(vm, VCPU_ID, state);
> > + run = vcpu_state(vm, VCPU_ID);
> > + free(state);
> > + }
> > +
> > +done:
> > + kvm_vm_free(vm);
> > +}
> > --
> > 2.21.0
> >
>
> Thanks,
> drew
>
Thank you Drew,
I didn't know about these functions and macros, I'll use them.
- John
The subject should be something like
KVM: x86: Introduce paravirt feature CR0/CR4 pinning
This patch obviously does a lot more than add a few MSRs :-)
On Tue, Feb 18, 2020 at 01:59:00PM -0800, John Andersen wrote:
...
> +MSR_KVM_CR0_PIN_ALLOWED: 0x4b564d06
> +MSR_KVM_CR4_PIN_ALLOWED: 0x4b564d07
> + Read only registers informing the guest which bits may be pinned for
> + each control register respectively via the CR pinned MSRs.
> +
> + data: Bits which may be pinned.
> +
> + Attempting to pin bits other than these will result in a failure when
> + writing to the respective CR pinned MSR.
> +
> + Bits which are allowed to be pinned are WP for CR0 and SMEP, SMAP, and
> + UMIP for CR4.
> +
> +MSR_KVM_CR0_PINNED: 0x4b564d08
> +MSR_KVM_CR4_PINNED: 0x4b564d09
> + Used to configure pinned bits in control registers
> +
> + data: Bits to be pinned.
> +
> + Fails if data contains bits which are not allowed to be pinned. Bits
> + which are allowed to be pinned can be found by reading the CR pin
> + allowed MSRs.
> +
> + The MSRs are read/write for host userspace, and write-only for the
> + guest.
> +
> + Once set to a non-zero value, the guest cannot clear any of the bits
> + that have been pinned to 1. The guest can set more bits to 1, so long
> + as those bits appear in the allowed MSR.
Why not allow pinning a bit to 0? That would make the logic in set_cr0/4
more intuitive. This approach also means that WRMSR interception needs to
inject a #GP if the guest attempts to pin a bit and the bit isn't currently
set in the MSR, which this patch doesn't do.
The downside to allowing pinning to '0' is that KVM will need to track the
pinned value, e.g. for RSM and VM-Exit, but that's a fairly small cost and
might be valuable in the long run, e.g. if TSX retroactively gained CR
enable bit...
> +
> + Host userspace may clear or change pinned bits at any point. Host
> + userspace must clear pinned bits on reboot.
> +
> + The MSR enables bit pinning for control registers. Pinning is active
> + when the guest is not in SMM. If the guest attempts to write values to
> + cr* where bits differ from pinned bits, the write will fail and the
> + guest will be sent a general protection fault.
...
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fb5d64ebc35d..2ee0e9886a6e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -733,6 +733,9 @@ bool pdptrs_changed(struct kvm_vcpu *vcpu)
> }
> EXPORT_SYMBOL_GPL(pdptrs_changed);
>
> +#define KVM_CR0_PIN_ALLOWED (X86_CR0_WP)
> +#define KVM_CR4_PIN_ALLOWED (X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP)
> +
> int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
> {
> unsigned long old_cr0 = kvm_read_cr0(vcpu);
> @@ -753,6 +756,11 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
> if ((cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PE))
> return 1;
>
> + if (!is_smm(vcpu)
Hmm, so there's some work to be done for nested virtualization. I'm
guessing the story will be the same as SMM, i.e. L2 is allowed to "unpin"
bits, and pinned bits are rechecked on VM-Exit to L1.
Assuming you want to go that route, this needs to also check for
!is_guest_mode(), i.e. allow L2 to change bits that are pinned by L1. It
likely works for standard nesting because nested VM-Enter bypasses
kvm_set_cr0(), e.g. prepare_vmcs02() calls vmx_set_cr0() to avoid fault
logic, but it'll break if L1 doesn't intercept pinned bits.
Speaking of the bypass, you'll need to incorporate pinning into the
VM-Enter consistency checks, otherwise L1 could bypass pinning by loading
CR0/CR4 via VM-Exit, e.g. load_vmcs12_host_state() bypasses pinning checks.
> + && vcpu->arch.cr0_pinned
Operators go on the previous line, newline only when needed, and align
indentation when continuing a statement, e.g. (whitespace damaged)
if (!is_smm(vcpu) && vcpu->arch.cr0_pinned &&
((cr0 ^ vcpu->arch.cr0_pinned) & KVM_CR0_PIN_ALLOWED))
return 1;
> + && ((cr0 ^ vcpu->arch.cr0_pinned) & KVM_CR0_PIN_ALLOWED))
If this ends up only allowing "pin-to-one" semantics, then the fields
should be named vcpu->arch.cr0_pinned_1 or so. It took a lot of staring
to figure out what this code was doing. The logic can also be optimized,
e.g.:
if (!is_smm(vcpu) && !is_guest_mode(vcpu) &&
(~cr0 & vcpu->arch.cr0_pinned_1))
return 1;
If the feature is defined to pin bits to their current value, as opposed to
pinning bits to '1', then this code becomes:
if (!is_smm(vcpu) && !is_guest_mode(vcpu) &&
((cr0 ^ old_cr0) & vcpu->arch.cr0_pinned.mask))
return 1;
or if you wanted to be more paranoid
if (!is_smm(vcpu) && !is_guest_mode(vcpu) &&
((cr0 ^ vcpu->arch.cr0_pinned.val) & vcpu->arch.cr0_pinned.mask))
return 1;
or
if (!is_smm(vcpu) && !is_guest_mode(vcpu) &&
((cr0 & vcpu->arch.cr0_pinned.mask) != vcpu->arch.cr0_pinned.val))
return 1;
> + && ((cr0 ^ vcpu->arch.cr0_pinned) & KVM_CR0_PIN_ALLOWED))
> + return 1;
> +
> if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) {
> #ifdef CONFIG_X86_64
> if ((vcpu->arch.efer & EFER_LME)) {
> @@ -932,6 +940,11 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> if (kvm_valid_cr4(vcpu, cr4))
> return 1;
>
> + if (!is_smm(vcpu)
> + && vcpu->arch.cr4_pinned
> + && ((cr4 ^ vcpu->arch.cr4_pinned) & KVM_CR4_PIN_ALLOWED))
> + return 1;
> +
> if (is_long_mode(vcpu)) {
> if (!(cr4 & X86_CR4_PAE))
> return 1;
> @@ -1255,6 +1268,10 @@ static const u32 emulated_msrs_all[] = {
>
> MSR_K7_HWCR,
> MSR_KVM_POLL_CONTROL,
> + MSR_KVM_CR0_PIN_ALLOWED,
> + MSR_KVM_CR4_PIN_ALLOWED,
> + MSR_KVM_CR0_PINNED,
> + MSR_KVM_CR4_PINNED,
> };
>
> static u32 emulated_msrs[ARRAY_SIZE(emulated_msrs_all)];
> @@ -2878,6 +2895,28 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> vcpu->arch.msr_kvm_poll_control = data;
> break;
>
> + case MSR_KVM_CR0_PIN_ALLOWED:
> + case MSR_KVM_CR4_PIN_ALLOWED:
> + if (report_ignored_msrs)
> + vcpu_debug_ratelimited(vcpu, "unhandled wrmsr: 0x%x data 0x%llx\n",
> + msr, data);
report_ignored_msrs is only for MSRs that KVM doesn't know how to handle.
The *_PIN_ALLOWED MSRs should simply do "return 1", even for host initiated
writes. I.e. inject #GP on attempted WRMSR in the guest, return -EINVAL if
the userspace VMM tries to redefine what's allowed.
> + break;
> + case MSR_KVM_CR0_PINNED:
> + if (data & ~KVM_CR0_PIN_ALLOWED)
> + return 1;
> + if (msr_info->host_initiated)
> + vcpu->arch.cr0_pinned = data;
> + else
> + vcpu->arch.cr0_pinned |= data;
Hmm, I understand what you're doing, but having fundamentally different
behavior for host vs. guest is probably a bad idea. I don't think it's
too onerous to require a RMW operation from the guest, e.g.
if (data & ~KVM_CR0_PIN_ALLOWED)
return 1;
if (!msr_info->host_initiated &&
(~data & vcpu->arch.cr0_pinned.mask))
return 1;
vcpu->arch.cr0_pinned.mask = data;
vcpu->arch.cr0_pinned.val = kvm_read_cr0(vcpu) & data;
break;
> + break;
> + case MSR_KVM_CR4_PINNED:
> + if (data & ~KVM_CR4_PIN_ALLOWED)
> + return 1;
> + if (msr_info->host_initiated)
> + vcpu->arch.cr4_pinned = data;
> + else
> + vcpu->arch.cr4_pinned |= data;
> + break;
> case MSR_IA32_MCG_CTL:
> case MSR_IA32_MCG_STATUS:
> case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
> @@ -3124,6 +3163,18 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> case MSR_KVM_POLL_CONTROL:
> msr_info->data = vcpu->arch.msr_kvm_poll_control;
> break;
> + case MSR_KVM_CR0_PIN_ALLOWED:
> + msr_info->data = KVM_CR0_PIN_ALLOWED;
> + break;
> + case MSR_KVM_CR4_PIN_ALLOWED:
> + msr_info->data = KVM_CR4_PIN_ALLOWED;
> + break;
> + case MSR_KVM_CR0_PINNED:
> + msr_info->data = vcpu->arch.cr0_pinned;
> + break;
> + case MSR_KVM_CR4_PINNED:
> + msr_info->data = vcpu->arch.cr4_pinned;
> + break;
> case MSR_IA32_P5_MC_ADDR:
> case MSR_IA32_P5_MC_TYPE:
> case MSR_IA32_MCG_CAP:
> @@ -6316,10 +6367,84 @@ static void emulator_set_hflags(struct x86_emulate_ctxt *ctxt, unsigned emul_fla
> emul_to_vcpu(ctxt)->arch.hflags = emul_flags;
> }
>
> +static inline u64 restore_pinned(u64 val, u64 subset, u64 pinned)
> +{
> + u64 pinned_high = pinned & subset;
> + u64 pinned_low = ~pinned & subset;
> +
> + val |= pinned_high;
> + val &= ~pinned_low;
> +
> + return val;
> +}
> +
> +static void kvm_pre_leave_smm_32_restore_crX_pinned(struct kvm_vcpu *vcpu,
> + const char *smstate,
> + u16 offset,
> + unsigned long allowed,
> + unsigned long cr_pinned)
> +{
> + u32 cr;
> +
> + cr = GET_SMSTATE(u32, smstate, offset);
> + cr = (u32)restore_pinned(cr, allowed, cr_pinned);
> + put_smstate(u32, smstate, offset, cr);
> +}
> +
> +static void kvm_pre_leave_smm_32_restore_cr_pinned(struct kvm_vcpu *vcpu,
> + const char *smstate)
> +{
> + if (vcpu->arch.cr0_pinned)
> + kvm_pre_leave_smm_32_restore_crX_pinned(vcpu, smstate, 0x7ffc,
> + KVM_CR0_PIN_ALLOWED,
> + vcpu->arch.cr0_pinned);
> +
> + if (vcpu->arch.cr4_pinned)
> + kvm_pre_leave_smm_32_restore_crX_pinned(vcpu, smstate, 0x7f14,
> + KVM_CR4_PIN_ALLOWED,
> + vcpu->arch.cr4_pinned);
> +}
> +
> +static void kvm_pre_leave_smm_64_restore_crX_pinned(struct kvm_vcpu *vcpu,
> + const char *smstate,
> + u16 offset,
> + unsigned long allowed,
> + unsigned long cr_pinned)
> +{
> + u32 cr;
> +
> + cr = GET_SMSTATE(u64, smstate, offset);
> + cr = restore_pinned(cr, allowed, cr_pinned);
> + put_smstate(u64, smstate, offset, cr);
> +}
> +
> +static void kvm_pre_leave_smm_64_restore_cr_pinned(struct kvm_vcpu *vcpu,
> + const char *smstate)
> +{
> + if (vcpu->arch.cr0_pinned)
> + kvm_pre_leave_smm_64_restore_crX_pinned(vcpu, smstate, 0x7f58,
> + KVM_CR0_PIN_ALLOWED,
> + vcpu->arch.cr0_pinned);
> +
> + if (vcpu->arch.cr4_pinned)
> + kvm_pre_leave_smm_64_restore_crX_pinned(vcpu, smstate, 0x7f48,
> + KVM_CR4_PIN_ALLOWED,
> + vcpu->arch.cr4_pinned);
Oof, that's a fair bit of ugly just to prevent SMM from poking into SMRAM
to attack the kernel.
If we do want to enforce pinning on RSM, I think it'd be better to inject
shutdown (well, return X86EMUL_UNHANDLEABLE) if the SMM handler changed the
value of a pinned CR bit. The SDM blurb on RSM states:
If the processor detects invalid state information during state
restoration, it enters the shutdown state.
IMO it's fair to say that attempting to unpin CR bits qualifies as invalid
state.
The logic is also a lot cleaner, e.g. (plus updating prototypes)
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index dd19fb3539e0..b1252de853c1 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2674,7 +2674,8 @@ static int em_rsm(struct x86_emulate_ctxt *ctxt)
return X86EMUL_UNHANDLEABLE;
}
- ctxt->ops->post_leave_smm(ctxt);
+ if (ctxt->ops->post_leave_smm(ctxt))
+ return X86EMUL_UNHANDLEABLE;
return X86EMUL_CONTINUE;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index aa2a085f115c..7c33c22b18fe 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6263,8 +6263,14 @@ static int emulator_pre_leave_smm(struct x86_emulate_ctxt *ctxt,
return kvm_x86_ops->pre_leave_smm(emul_to_vcpu(ctxt), smstate);
}
-static void emulator_post_leave_smm(struct x86_emulate_ctxt *ctxt)
+static int emulator_post_leave_smm(struct x86_emulate_ctxt *ctxt)
{
+ struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+ unsigned long cr0 = kvm_read_cr0(vcpu);
+
+ if ((cr0 ^ vcpu->arch.cr0_pinned.val) & vcpu->arch.cr0_pinned.mask)
+ return 1;
+
kvm_smm_changed(emul_to_vcpu(ctxt));
}
> +}
> +
> static int emulator_pre_leave_smm(struct x86_emulate_ctxt *ctxt,
> const char *smstate)
> {
> - return kvm_x86_ops->pre_leave_smm(emul_to_vcpu(ctxt), smstate);
> + struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
> +
> +#ifdef CONFIG_X86_64
> + if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
> + kvm_pre_leave_smm_64_restore_cr_pinned(vcpu, smstate);
> + else
> +#endif
> + kvm_pre_leave_smm_32_restore_cr_pinned(vcpu, smstate);
> +
> + return kvm_x86_ops->pre_leave_smm(vcpu, smstate);
> }
>
> static void emulator_post_leave_smm(struct x86_emulate_ctxt *ctxt)
> @@ -9490,6 +9615,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>
> vcpu->arch.ia32_xss = 0;
>
> + vcpu->arch.cr0_pinned = 0;
> + vcpu->arch.cr4_pinned = 0;
> +
> kvm_x86_ops->vcpu_reset(vcpu, init_event);
> }
>
> --
> 2.21.0
>