While tying to add support for the MSR_CORE_THREAD_COUNT MSR in KVM,
I realized that we were still in a world where user space has no control
over what happens with MSR emulation in KVM.
That is bad for multiple reasons. In my case, I wanted to emulate the
MSR in user space, because it's a CPU specific register that does not
exist on older CPUs and that really only contains informational data that
is on the package level, so it's a natural fit for user space to provide
it.
However, it is also bad on a platform compatibility level. Currrently,
KVM has no way to expose different MSRs based on the selected target CPU
type.
This patch set introduces a way for user space to indicate to KVM which
MSRs should be handled in kernel space. With that, we can solve part of
the platform compatibility story. Or at least we can not handle AMD specific
MSRs on an Intel platform and vice versa.
In addition, it introduces a way for user space to get into the loop
when an MSR access would generate a #GP fault, such as when KVM finds an
MSR that is not handled by the in-kernel MSR emulation or when the guest
is trying to access reserved registers.
In combination with the allow list, the user space trapping allows us
to emulate arbitrary MSRs in user space, paving the way for target CPU
specific MSR implementations from user space.
v1 -> v2:
- s/ETRAP_TO_USER_SPACE/ENOENT/g
- deflect all #GP injection events to user space, not just unknown MSRs.
That was we can also deflect allowlist errors later
- fix emulator case
- new patch: KVM: x86: Introduce allow list for MSR emulation
- new patch: KVM: selftests: Add test for user space MSR handling
v2 -> v3:
- return r if r == X86EMUL_IO_NEEDED
- s/KVM_EXIT_RDMSR/KVM_EXIT_X86_RDMSR/g
- s/KVM_EXIT_WRMSR/KVM_EXIT_X86_WRMSR/g
- Use complete_userspace_io logic instead of reply field
- Simplify trapping code
- document flags for KVM_X86_ADD_MSR_ALLOWLIST
- generalize exit path, always unlock when returning
- s/KVM_CAP_ADD_MSR_ALLOWLIST/KVM_CAP_X86_MSR_ALLOWLIST/g
- Add KVM_X86_CLEAR_MSR_ALLOWLIST
- Add test to clear whitelist
- Adjust to reply-less API
- Fix asserts
- Actually trap on MSR_IA32_POWER_CTL writes
v3 -> v4:
- Mention exit reasons in re-enter mandatory section of API documentation
- Clear padding bytes
- Generalize get/set deflect functions
- Remove redundant pending_user_msr field
- lock allow check and clearing
- free bitmaps on clear
Alexander Graf (3):
KVM: x86: Deflect unknown MSR accesses to user space
KVM: x86: Introduce allow list for MSR emulation
KVM: selftests: Add test for user space MSR handling
Documentation/virt/kvm/api.rst | 157 ++++++++++-
arch/x86/include/asm/kvm_host.h | 13 +
arch/x86/include/uapi/asm/kvm.h | 15 +
arch/x86/kvm/emulate.c | 18 +-
arch/x86/kvm/x86.c | 259 +++++++++++++++++-
include/trace/events/kvm.h | 2 +-
include/uapi/linux/kvm.h | 15 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/user_msr_test.c | 221 +++++++++++++++
9 files changed, 692 insertions(+), 9 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/user_msr_test.c
--
2.17.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
MSRs are weird. Some of them are normal control registers, such as EFER.
Some however are registers that really are model specific, not very
interesting to virtualization workloads, and not performance critical.
Others again are really just windows into package configuration.
Out of these MSRs, only the first category is necessary to implement in
kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
certain CPU models and MSRs that contain information on the package level
are much better suited for user space to process. However, over time we have
accumulated a lot of MSRs that are not the first category, but still handled
by in-kernel KVM code.
This patch adds a generic interface to handle WRMSR and RDMSR from user
space. With this, any future MSR that is part of the latter categories can
be handled in user space.
Furthermore, it allows us to replace the existing "ignore_msrs" logic with
something that applies per-VM rather than on the full system. That way you
can run productive VMs in parallel to experimental ones where you don't care
about proper MSR handling.
Signed-off-by: Alexander Graf <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
---
v1 -> v2:
- s/ETRAP_TO_USER_SPACE/ENOENT/g
- deflect all #GP injection events to user space, not just unknown MSRs.
That was we can also deflect allowlist errors later
- fix emulator case
v2 -> v3:
- return r if r == X86EMUL_IO_NEEDED
- s/KVM_EXIT_RDMSR/KVM_EXIT_X86_RDMSR/g
- s/KVM_EXIT_WRMSR/KVM_EXIT_X86_WRMSR/g
- Use complete_userspace_io logic instead of reply field
- Simplify trapping code
v3 -> v4:
- Mention exit reasons in re-enter mandatory section of API documentation
- Clear padding bytes
- Generalize get/set deflect functions
- Remove redundant pending_user_msr field
---
Documentation/virt/kvm/api.rst | 66 +++++++++++++++++++-
arch/x86/include/asm/kvm_host.h | 3 +
arch/x86/kvm/emulate.c | 18 +++++-
arch/x86/kvm/x86.c | 106 ++++++++++++++++++++++++++++++--
include/trace/events/kvm.h | 2 +-
include/uapi/linux/kvm.h | 10 +++
6 files changed, 196 insertions(+), 9 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 320788f81a05..2ca38649b3d4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4861,8 +4861,8 @@ to the byte array.
.. note::
- For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and
- KVM_EXIT_EPR the corresponding
+ For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR,
+ KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR the corresponding
operations are complete (and guest state is consistent) only after userspace
has re-entered the kernel with KVM_RUN. The kernel side will first finish
@@ -5155,6 +5155,35 @@ Note that KVM does not skip the faulting instruction as it does for
KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state
if it decides to decode and emulate the instruction.
+::
+
+ /* KVM_EXIT_X86_RDMSR / KVM_EXIT_X86_WRMSR */
+ struct {
+ __u8 error;
+ __u8 pad[3];
+ __u32 index;
+ __u64 data;
+ } msr;
+
+Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
+enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code
+will instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
+exit for writes.
+
+For KVM_EXIT_X86_RDMSR, the "index" field tells user space which MSR the guest
+wants to read. To respond to this request with a successful read, user space
+writes the respective data into the "data" field and must continue guest
+execution to ensure the read data is transferred into guest register state.
+
+If the RDMSR request was unsuccessful, user space indicates that with a "1" in
+the "error" field. This will inject a #GP into the guest when the VCPU is
+executed again.
+
+For KVM_EXIT_X86_WRMSR, the "index" field tells user space which MSR the guest
+wants to write. Once finished processing the event, user space must continue
+vCPU execution. If the MSR write was unsuccessful, user space also sets the
+"error" field to "1".
+
::
/* Fix the size of the union. */
@@ -5844,6 +5873,28 @@ controlled by the kvm module parameter halt_poll_ns. This capability allows
the maximum halt time to specified on a per-VM basis, effectively overriding
the module parameter for the target VM.
+7.21 KVM_CAP_X86_USER_SPACE_MSR
+-------------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] is 1 if user space MSR handling is enabled, 0 otherwise
+:Returns: 0 on success; -1 on error
+
+This capability enables trapping of #GP invoking RDMSR and WRMSR instructions
+into user space.
+
+When a guest requests to read or write an MSR, KVM may not implement all MSRs
+that are relevant to a respective system. It also does not differentiate by
+CPU type.
+
+To allow more fine grained control over MSR handling, user space may enable
+this capability. With it enabled, MSR accesses that would usually trigger
+a #GP event inside the guest by KVM will instead trigger KVM_EXIT_X86_RDMSR
+and KVM_EXIT_X86_WRMSR exit notifications which user space can then handle to
+implement model specific MSR handling and/or user notifications to inform
+a user that an MSR was not handled.
+
8. Other capabilities.
======================
@@ -6151,3 +6202,14 @@ KVM can therefore start protected VMs.
This capability governs the KVM_S390_PV_COMMAND ioctl and the
KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected
guests when the state change is invalid.
+
+8.24 KVM_CAP_X86_USER_SPACE_MSR
+----------------------------
+
+:Architectures: x86
+
+This capability indicates that KVM supports deflection of MSR reads and
+writes to user space. It can be enabled on a VM level. If enabled, MSR
+accesses that would usually trigger a #GP by KVM into the guest will
+instead get bounced to user space through the KVM_EXIT_X86_RDMSR and
+KVM_EXIT_X86_WRMSR exit notifications.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index be5363b21540..2f2307e71342 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1002,6 +1002,9 @@ struct kvm_arch {
bool guest_can_read_msr_platform_info;
bool exception_payload_enabled;
+ /* Deflect RDMSR and WRMSR to user space when they trigger a #GP */
+ bool user_space_msr_enabled;
+
struct kvm_pmu_event_filter *pmu_event_filter;
struct task_struct *nx_lpage_recovery_thread;
};
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index d0e2825ae617..744ab9c92b73 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3689,11 +3689,18 @@ static int em_dr_write(struct x86_emulate_ctxt *ctxt)
static int em_wrmsr(struct x86_emulate_ctxt *ctxt)
{
+ u64 msr_index = reg_read(ctxt, VCPU_REGS_RCX);
u64 msr_data;
+ int r;
msr_data = (u32)reg_read(ctxt, VCPU_REGS_RAX)
| ((u64)reg_read(ctxt, VCPU_REGS_RDX) << 32);
- if (ctxt->ops->set_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), msr_data))
+ r = ctxt->ops->set_msr(ctxt, msr_index, msr_data);
+
+ if (r == X86EMUL_IO_NEEDED)
+ return r;
+
+ if (r)
return emulate_gp(ctxt, 0);
return X86EMUL_CONTINUE;
@@ -3701,9 +3708,16 @@ static int em_wrmsr(struct x86_emulate_ctxt *ctxt)
static int em_rdmsr(struct x86_emulate_ctxt *ctxt)
{
+ u64 msr_index = reg_read(ctxt, VCPU_REGS_RCX);
u64 msr_data;
+ int r;
+
+ r = ctxt->ops->get_msr(ctxt, msr_index, &msr_data);
+
+ if (r == X86EMUL_IO_NEEDED)
+ return r;
- if (ctxt->ops->get_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), &msr_data))
+ if (r)
return emulate_gp(ctxt, 0);
*reg_write(ctxt, VCPU_REGS_RAX) = (u32)msr_data;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 88c593f83b28..e1139124350f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1549,12 +1549,75 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data)
}
EXPORT_SYMBOL_GPL(kvm_set_msr);
+static int complete_emulated_msr(struct kvm_vcpu *vcpu, bool is_read)
+{
+ if (vcpu->run->msr.error) {
+ kvm_inject_gp(vcpu, 0);
+ } else if (is_read) {
+ kvm_rax_write(vcpu, (u32)vcpu->run->msr.data);
+ kvm_rdx_write(vcpu, vcpu->run->msr.data >> 32);
+ }
+
+ return kvm_skip_emulated_instruction(vcpu);
+}
+
+static int complete_emulated_rdmsr(struct kvm_vcpu *vcpu)
+{
+ return complete_emulated_msr(vcpu, true);
+}
+
+static int complete_emulated_wrmsr(struct kvm_vcpu *vcpu)
+{
+ return complete_emulated_msr(vcpu, false);
+}
+
+static int kvm_msr_user_space(struct kvm_vcpu *vcpu, u32 index,
+ u32 exit_reason, u64 data,
+ int (*completion)(struct kvm_vcpu *vcpu))
+{
+ if (!vcpu->kvm->arch.user_space_msr_enabled)
+ return 0;
+
+ vcpu->run->exit_reason = exit_reason;
+ vcpu->run->msr.error = 0;
+ vcpu->run->msr.pad[0] = 0;
+ vcpu->run->msr.pad[1] = 0;
+ vcpu->run->msr.pad[2] = 0;
+ vcpu->run->msr.index = index;
+ vcpu->run->msr.data = data;
+ vcpu->arch.complete_userspace_io = completion;
+
+ return 1;
+}
+
+static int kvm_get_msr_user_space(struct kvm_vcpu *vcpu, u32 index)
+{
+ return kvm_msr_user_space(vcpu, index, KVM_EXIT_X86_RDMSR, 0,
+ complete_emulated_rdmsr);
+}
+
+static int kvm_set_msr_user_space(struct kvm_vcpu *vcpu, u32 index, u64 data)
+{
+ return kvm_msr_user_space(vcpu, index, KVM_EXIT_X86_WRMSR, data,
+ complete_emulated_wrmsr);
+}
+
int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu)
{
u32 ecx = kvm_rcx_read(vcpu);
u64 data;
+ int r;
- if (kvm_get_msr(vcpu, ecx, &data)) {
+ r = kvm_get_msr(vcpu, ecx, &data);
+
+ /* MSR read failed? See if we should ask user space */
+ if (r && kvm_get_msr_user_space(vcpu, ecx)) {
+ /* Bounce to user space */
+ return 0;
+ }
+
+ /* MSR read failed? Inject a #GP */
+ if (r) {
trace_kvm_msr_read_ex(ecx);
kvm_inject_gp(vcpu, 0);
return 1;
@@ -1572,8 +1635,18 @@ int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu)
{
u32 ecx = kvm_rcx_read(vcpu);
u64 data = kvm_read_edx_eax(vcpu);
+ int r;
+
+ r = kvm_set_msr(vcpu, ecx, data);
+
+ /* MSR write failed? See if we should ask user space */
+ if (r && kvm_set_msr_user_space(vcpu, ecx, data)) {
+ /* Bounce to user space */
+ return 0;
+ }
- if (kvm_set_msr(vcpu, ecx, data)) {
+ /* MSR write failed? Inject a #GP */
+ if (r) {
trace_kvm_msr_write_ex(ecx, data);
kvm_inject_gp(vcpu, 0);
return 1;
@@ -3476,6 +3549,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_MSR_PLATFORM_INFO:
case KVM_CAP_EXCEPTION_PAYLOAD:
case KVM_CAP_SET_GUEST_DEBUG:
+ case KVM_CAP_X86_USER_SPACE_MSR:
r = 1;
break;
case KVM_CAP_SYNC_REGS:
@@ -4990,6 +5064,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
kvm->arch.exception_payload_enabled = cap->args[0];
r = 0;
break;
+ case KVM_CAP_X86_USER_SPACE_MSR:
+ kvm->arch.user_space_msr_enabled = cap->args[0];
+ r = 0;
+ break;
default:
r = -EINVAL;
break;
@@ -6319,13 +6397,33 @@ static void emulator_set_segment(struct x86_emulate_ctxt *ctxt, u16 selector,
static int emulator_get_msr(struct x86_emulate_ctxt *ctxt,
u32 msr_index, u64 *pdata)
{
- return kvm_get_msr(emul_to_vcpu(ctxt), msr_index, pdata);
+ struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+ int r;
+
+ r = kvm_get_msr(vcpu, msr_index, pdata);
+
+ if (r && kvm_get_msr_user_space(vcpu, msr_index)) {
+ /* Bounce to user space */
+ return X86EMUL_IO_NEEDED;
+ }
+
+ return r;
}
static int emulator_set_msr(struct x86_emulate_ctxt *ctxt,
u32 msr_index, u64 data)
{
- return kvm_set_msr(emul_to_vcpu(ctxt), msr_index, data);
+ struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+ int r;
+
+ r = kvm_set_msr(emul_to_vcpu(ctxt), msr_index, data);
+
+ if (r && kvm_set_msr_user_space(vcpu, msr_index, data)) {
+ /* Bounce to user space */
+ return X86EMUL_IO_NEEDED;
+ }
+
+ return r;
}
static u64 emulator_get_smbase(struct x86_emulate_ctxt *ctxt)
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 9417a34aad08..26cfb0fa8e7e 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -17,7 +17,7 @@
ERSN(NMI), ERSN(INTERNAL_ERROR), ERSN(OSI), ERSN(PAPR_HCALL), \
ERSN(S390_UCONTROL), ERSN(WATCHDOG), ERSN(S390_TSCH), ERSN(EPR),\
ERSN(SYSTEM_EVENT), ERSN(S390_STSI), ERSN(IOAPIC_EOI), \
- ERSN(HYPERV), ERSN(ARM_NISV)
+ ERSN(HYPERV), ERSN(ARM_NISV), ERSN(X86_RDMSR), ERSN(X86_WRMSR)
TRACE_EVENT(kvm_userspace_exit,
TP_PROTO(__u32 reason, int errno),
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4fdf30316582..13fc7de1eb50 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -248,6 +248,8 @@ struct kvm_hyperv_exit {
#define KVM_EXIT_IOAPIC_EOI 26
#define KVM_EXIT_HYPERV 27
#define KVM_EXIT_ARM_NISV 28
+#define KVM_EXIT_X86_RDMSR 29
+#define KVM_EXIT_X86_WRMSR 30
/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -412,6 +414,13 @@ struct kvm_run {
__u64 esr_iss;
__u64 fault_ipa;
} arm_nisv;
+ /* KVM_EXIT_X86_RDMSR / KVM_EXIT_X86_WRMSR */
+ struct {
+ __u8 error;
+ __u8 pad[3];
+ __u32 index;
+ __u64 data;
+ } msr;
/* Fix the size of the union. */
char padding[256];
};
@@ -1031,6 +1040,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_PPC_SECURE_GUEST 181
#define KVM_CAP_HALT_POLL 182
#define KVM_CAP_ASYNC_PF_INT 183
+#define KVM_CAP_X86_USER_SPACE_MSR 184
#ifdef KVM_CAP_IRQ_ROUTING
--
2.17.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Mon, Aug 3, 2020 at 2:14 PM Alexander Graf <[email protected]> wrote:
>
> While tying to add support for the MSR_CORE_THREAD_COUNT MSR in KVM,
> I realized that we were still in a world where user space has no control
> over what happens with MSR emulation in KVM.
>
> That is bad for multiple reasons. In my case, I wanted to emulate the
> MSR in user space, because it's a CPU specific register that does not
> exist on older CPUs and that really only contains informational data that
> is on the package level, so it's a natural fit for user space to provide
> it.
>
> However, it is also bad on a platform compatibility level. Currrently,
> KVM has no way to expose different MSRs based on the selected target CPU
> type.
>
> This patch set introduces a way for user space to indicate to KVM which
> MSRs should be handled in kernel space. With that, we can solve part of
> the platform compatibility story. Or at least we can not handle AMD specific
> MSRs on an Intel platform and vice versa.
>
> In addition, it introduces a way for user space to get into the loop
> when an MSR access would generate a #GP fault, such as when KVM finds an
> MSR that is not handled by the in-kernel MSR emulation or when the guest
> is trying to access reserved registers.
>
> In combination with the allow list, the user space trapping allows us
> to emulate arbitrary MSRs in user space, paving the way for target CPU
> specific MSR implementations from user space.
This is somewhat misleading. If you don't modify the MSR permission
bitmaps, as Aaron has done, you cannot emulate *arbitrary* MSRs in
userspace. You can only emulate MSRs that kvm is going to intercept.
Moreover, since the set of intercepted MSRs evolves over time, this
isn't a stable API.
> v1 -> v2:
>
> - s/ETRAP_TO_USER_SPACE/ENOENT/g
> - deflect all #GP injection events to user space, not just unknown MSRs.
> That was we can also deflect allowlist errors later
> - fix emulator case
> - new patch: KVM: x86: Introduce allow list for MSR emulation
> - new patch: KVM: selftests: Add test for user space MSR handling
>
> v2 -> v3:
>
> - return r if r == X86EMUL_IO_NEEDED
> - s/KVM_EXIT_RDMSR/KVM_EXIT_X86_RDMSR/g
> - s/KVM_EXIT_WRMSR/KVM_EXIT_X86_WRMSR/g
> - Use complete_userspace_io logic instead of reply field
> - Simplify trapping code
> - document flags for KVM_X86_ADD_MSR_ALLOWLIST
> - generalize exit path, always unlock when returning
> - s/KVM_CAP_ADD_MSR_ALLOWLIST/KVM_CAP_X86_MSR_ALLOWLIST/g
> - Add KVM_X86_CLEAR_MSR_ALLOWLIST
> - Add test to clear whitelist
> - Adjust to reply-less API
> - Fix asserts
> - Actually trap on MSR_IA32_POWER_CTL writes
>
> v3 -> v4:
>
> - Mention exit reasons in re-enter mandatory section of API documentation
> - Clear padding bytes
> - Generalize get/set deflect functions
> - Remove redundant pending_user_msr field
> - lock allow check and clearing
> - free bitmaps on clear
>
> Alexander Graf (3):
> KVM: x86: Deflect unknown MSR accesses to user space
> KVM: x86: Introduce allow list for MSR emulation
> KVM: selftests: Add test for user space MSR handling
>
> Documentation/virt/kvm/api.rst | 157 ++++++++++-
> arch/x86/include/asm/kvm_host.h | 13 +
> arch/x86/include/uapi/asm/kvm.h | 15 +
> arch/x86/kvm/emulate.c | 18 +-
> arch/x86/kvm/x86.c | 259 +++++++++++++++++-
> include/trace/events/kvm.h | 2 +-
> include/uapi/linux/kvm.h | 15 +
> tools/testing/selftests/kvm/Makefile | 1 +
> .../selftests/kvm/x86_64/user_msr_test.c | 221 +++++++++++++++
> 9 files changed, 692 insertions(+), 9 deletions(-)
> create mode 100644 tools/testing/selftests/kvm/x86_64/user_msr_test.c
>
> --
> 2.17.1
>
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>
>
> Am 19.08.2020 um 23:27 schrieb Jim Mattson <[email protected]>:
>
>> On Mon, Aug 3, 2020 at 2:14 PM Alexander Graf <[email protected]> wrote:
>>
>> While tying to add support for the MSR_CORE_THREAD_COUNT MSR in KVM,
>> I realized that we were still in a world where user space has no control
>> over what happens with MSR emulation in KVM.
>>
>> That is bad for multiple reasons. In my case, I wanted to emulate the
>> MSR in user space, because it's a CPU specific register that does not
>> exist on older CPUs and that really only contains informational data that
>> is on the package level, so it's a natural fit for user space to provide
>> it.
>>
>> However, it is also bad on a platform compatibility level. Currrently,
>> KVM has no way to expose different MSRs based on the selected target CPU
>> type.
>>
>> This patch set introduces a way for user space to indicate to KVM which
>> MSRs should be handled in kernel space. With that, we can solve part of
>> the platform compatibility story. Or at least we can not handle AMD specific
>> MSRs on an Intel platform and vice versa.
>>
>> In addition, it introduces a way for user space to get into the loop
>> when an MSR access would generate a #GP fault, such as when KVM finds an
>> MSR that is not handled by the in-kernel MSR emulation or when the guest
>> is trying to access reserved registers.
>>
>> In combination with the allow list, the user space trapping allows us
>> to emulate arbitrary MSRs in user space, paving the way for target CPU
>> specific MSR implementations from user space.
>
> This is somewhat misleading. If you don't modify the MSR permission
> bitmaps, as Aaron has done, you cannot emulate *arbitrary* MSRs in
> userspace. You can only emulate MSRs that kvm is going to intercept.
> Moreover, since the set of intercepted MSRs evolves over time, this
> isn't a stable API.
Yeah, I wrote up a patch today to do the passthrough bitmap masking dynamically, partly based on Aaron's patches. I have not verified it yet though and will need to clean up SVM as well. I'll see how far I get with it for the rest of the week.
Special MSRs like EFER also irritate me a bit. We can't really trap on them - most code paths just know they're handled in kernel. Maybe I'll add some sanity checks as well...
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Wed, Aug 19, 2020 at 2:46 PM Graf (AWS), Alexander <[email protected]> wrote:
> Special MSRs like EFER also irritate me a bit. We can't really trap on them - most code paths just know they're handled in kernel. Maybe I'll add some sanity checks as well...
Why can't we intercept EFER?
On Wed, Aug 19, 2020 at 3:09 PM Jim Mattson <[email protected]> wrote:
>
> On Wed, Aug 19, 2020 at 2:46 PM Graf (AWS), Alexander <[email protected]> wrote:
>
> > Special MSRs like EFER also irritate me a bit. We can't really trap on them - most code paths just know they're handled in kernel. Maybe I'll add some sanity checks as well...
>
> Why can't we intercept EFER?
Some MSRs (IA32_GSBASE comes to mind) can't be completely handled by
userspace, even if we do intercept RDMSR and WRMSR. The EFER.LMA bit
also falls into that category, though the rest of the register isn't a
problem (and EFER.LMA is always derivable). Is that what you meant?