LinuxLists.cc - [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

On 5/6/2023 11:04 AM, Chao Gao wrote:
> to avoid computing the supported value at runtime every time.
>
> No functional change intended.

the value of kvm_get_arch_capabilities() can be changed due to

if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER)
data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH;

and l1tf_vmx_mitigation can be runtime changed by vmentry_l1d_flush
module param.

We need a detailed analysis that in no real case can
ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit change runtime.

> Signed-off-by: Chao Gao <[email protected]>
> ---
>
> A new call site of kvm_get_arch_capabilities() is added by [1]. It should be
> replaced with the cached value in kvm_caps.
>
> [1] https://lore.kernel.org/all/[email protected]/
>
>
> arch/x86/kvm/x86.c | 5 +++--
> arch/x86/kvm/x86.h | 1 +
> 2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 523c39a03c00..94aa70ec169c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1670,7 +1670,7 @@ static int kvm_get_msr_feature(struct kvm_msr_entry *msr)
> {
> switch (msr->index) {
> case MSR_IA32_ARCH_CAPABILITIES:
> - msr->data = kvm_get_arch_capabilities();
> + msr->data = kvm_caps.supported_arch_cap;
> break;
> case MSR_IA32_PERF_CAPABILITIES:
> msr->data = kvm_caps.supported_perf_cap;
> @@ -9523,6 +9523,7 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
> kvm_caps.max_guest_tsc_khz = max;
> }
> kvm_caps.default_tsc_scaling_ratio = 1ULL << kvm_caps.tsc_scaling_ratio_frac_bits;
> + kvm_caps.supported_arch_cap = kvm_get_arch_capabilities();
> kvm_init_msr_lists();
> return 0;
>
> @@ -11879,7 +11880,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> if (r)
> goto free_guest_fpu;
>
> - vcpu->arch.arch_capabilities = kvm_get_arch_capabilities();
> + vcpu->arch.arch_capabilities = kvm_caps.supported_arch_cap;
> vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT;
> kvm_xen_init_vcpu(vcpu);
> kvm_vcpu_mtrr_init(vcpu);
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index c544602d07a3..d3e524bcc169 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -29,6 +29,7 @@ struct kvm_caps {
> u64 supported_xcr0;
> u64 supported_xss;
> u64 supported_perf_cap;
> + u64 supported_arch_cap;
> };
>
> void kvm_spurious_fault(void);
>
> base-commit: b3c98052d46948a8d65d2778c7f306ff38366aac

2023-05-18 17:45:30

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

+Jim

On Thu, May 18, 2023, Xiaoyao Li wrote:
> On 5/6/2023 11:04 AM, Chao Gao wrote:
> > to avoid computing the supported value at runtime every time.
> >
> > No functional change intended.
>
> the value of kvm_get_arch_capabilities() can be changed due to
>
> if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER)
> data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH;
>
> and l1tf_vmx_mitigation can be runtime changed by vmentry_l1d_flush module
> param.

Nice catch!

> We need a detailed analysis that in no real case can
> ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit change runtime.

No, the fact that it _can_ be modified by a writable module param is enough to
make this patch buggy.

I do like snapshotting and then updating the value, even though there's likely no
meaningful performance benefit, as that would provide a place to document that
the "supported" value is dynamic. Though the fact that it's dynamic is arguably a bug
in its own right, e.g. if userspace isn't careful, a VM can have vCPUs with different
values for ARCH_CAPABILITIES. But fixing that is probably a fool's errand. So
I vote to snapshot the value and toggle the ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit
when l1tf_vmx_mitigation is modified.

On a somewhat related topic, what in the absolute #$#$ is going on with FB_CLEAR_DIS!?!?
I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL...

1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill
buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every
entry+exit.

2. I'm pretty sure conditioning mmio_stale_data_clear on kvm_arch_has_assigned_device()
is a bug. AIUI, the vulnerability applies to _any_ MMIO accesses. Assigning
a device is necessary to let the device DMA into the guest, but it's not
necessary to let the guest access MMIO addresses, that's done purely via
memslots.

3. Irrespective of whether or not there is a performance benefit, toggling the
MSR on every entry+exit is completely unnecessary if KVM won't do VERW before
VM-Enter, i.e. if (!mds_user_clear && !mmio_stale_data_clear), then the
toggling can be done in vmx_prepare_switch_to_{guest,host}(). This probably
isn't worth pursuing though, as #4 below is more likely, especially since
X86_BUG_MSBDS_ONLY is limited to Atom (and MIC, lol) CPUs.

4. If the host will will _never_ do VERW, i.e. #3 + !X86_BUG_MSBDS_ONLY, then
KVM just needs to context switch the MSR between guests since the value that's
loaded while running in the host is irrelevant. E.g. use a percpu cache to
track the current value.

5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up,
i.e. the host's desired value is effectively static post-boot, and barring
a buggy configuration (running KVM as a guest), the boot CPU's value will be
the same as every other CPU.

6. Performance aside, KVM should not be speculating (ha!) on what the guest
will and will not do, and should instead honor whatever behavior is presented
to the guest. If the guest CPU model indicates that VERW flushes buffers,
then KVM damn well needs to let VERW flush buffers.

7. Why on earth did Intel provide a knob that affects both the host and guest,
since AFAICT the intent of the MSR is purely to suppress FB clearing for an
unsuspecting (or misconfigured?) guest!?!?!

FWIW, this trainwreck is another reason why I'm not going to look at the proposed
"Intel IA32_SPEC_CTRL Virtualization" crud until external forces dictate that I
do so. I have zero confidence that a paravirt interface defined by hardware
vendors to fiddle with mitigations will be sane, flexible, and extensible.

Anyways, can someone from Intel do some basic performance analysis to justify
doing RDMSR + WRMSRx2 on every transition? Unless someone provides numbers that
show a clear, meaningful benefit to the aggressive toggling, I'm inclined to have
KVM do #4, e.g. end up with something like:

/* L1D Flush includes CPU buffer clear to mitigate MDS */
if (static_branch_unlikely(&vmx_l1d_should_flush)) {
vmx_l1d_flush(vcpu);
} else if (static_branch_unlikely(&mds_user_clear) ||
static_branch_unlikely(&mmio_stale_data_clear)) {
mds_clear_cpu_buffers();
} else if (static_branch_unlikely(&kvm_toggle_fb_clear) {
bool enable_fb_clear = !!(vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR);

if (this_cpu_read(kvm_fb_clear_enabled) != enable_fb_clear) {
u64 mcu_opt_ctrl = host_mcu_opt_ctrl;

if (enable_fb_clear)
mcu_opt_ctrl &= ~FB_CLEAR_DIS;
else
mcu_opt_ctrl |= FB_CLEAR_DIS;
native_wrmsrl(MSR_IA32_MCU_OPT_CTRL, mcu_opt_ctrl);
this_cpu_write(kvm_fb_clear_enabled, enable_fb_clear);
}
}

2023-05-19 08:51:04

by Chao Gao

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

+Pawan, could you share your thoughts on questions about FB_CLEAR?

On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
>+Jim
>
>On Thu, May 18, 2023, Xiaoyao Li wrote:
>> On 5/6/2023 11:04 AM, Chao Gao wrote:
>> > to avoid computing the supported value at runtime every time.
>> >
>> > No functional change intended.
>>
>> the value of kvm_get_arch_capabilities() can be changed due to
>>
>> if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER)
>> data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH;
>>
>> and l1tf_vmx_mitigation can be runtime changed by vmentry_l1d_flush module
>> param.

Thanks for pointing this out. I noticed l1tf_vmx_mitigation and analyzed if
it could be changed at runtime. Obviously I did a wrong analysis.

>
>Nice catch!
>
>> We need a detailed analysis that in no real case can
>> ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit change runtime.
>
>No, the fact that it _can_ be modified by a writable module param is enough to
>make this patch buggy.
>
>I do like snapshotting and then updating the value, even though there's likely no
>meaningful performance benefit, as that would provide a place to document that
>the "supported" value is dynamic. Though the fact that it's dynamic is arguably a bug
>in its own right, e.g. if userspace isn't careful, a VM can have vCPUs with different
>values for ARCH_CAPABILITIES. But fixing that is probably a fool's errand. So

I am not sure if fixing it is fool. There would be some other problem:

KVM enables L1DLFUSH and creates a guest. Then ARCH_CAP_SKIP_VMENTRY_L1DFLUSH is
exposed to the guest. If L1DFLUSH is disabled at runtime in KVM, the guest
doesn't know this change and won't do L1DFLUSH when entering L2. Then L2 may use
L1TF to leak some secrets of L1.

>I vote to snapshot the value and toggle the ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit
>when l1tf_vmx_mitigation is modified.

Sure. Will do.

>
>On a somewhat related topic, what in the absolute #$#$ is going on with FB_CLEAR_DIS!?!?
>I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL...
>
> 1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill
> buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every
> entry+exit.
>
> 2. I'm pretty sure conditioning mmio_stale_data_clear on kvm_arch_has_assigned_device()
> is a bug. AIUI, the vulnerability applies to _any_ MMIO accesses. Assigning
> a device is necessary to let the device DMA into the guest, but it's not
> necessary to let the guest access MMIO addresses, that's done purely via
> memslots.
>
> 3. Irrespective of whether or not there is a performance benefit, toggling the
> MSR on every entry+exit is completely unnecessary if KVM won't do VERW before
> VM-Enter, i.e. if (!mds_user_clear && !mmio_stale_data_clear), then the
> toggling can be done in vmx_prepare_switch_to_{guest,host}(). This probably
> isn't worth pursuing though, as #4 below is more likely, especially since
> X86_BUG_MSBDS_ONLY is limited to Atom (and MIC, lol) CPUs.
>
> 4. If the host will will _never_ do VERW, i.e. #3 + !X86_BUG_MSBDS_ONLY, then
> KVM just needs to context switch the MSR between guests since the value that's
> loaded while running in the host is irrelevant. E.g. use a percpu cache to
> track the current value.

Agreed.

Looks VERW can be used in CPL3, should we restore the MSR on returning
to userspace i.e., leverage uret mechanism?

>
> 5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up,
> i.e. the host's desired value is effectively static post-boot, and barring
> a buggy configuration (running KVM as a guest), the boot CPU's value will be
> the same as every other CPU.
>
> 6. Performance aside, KVM should not be speculating (ha!) on what the guest
> will and will not do, and should instead honor whatever behavior is presented
> to the guest. If the guest CPU model indicates that VERW flushes buffers,
> then KVM damn well needs to let VERW flush buffers.
>
> 7. Why on earth did Intel provide a knob that affects both the host and guest,
> since AFAICT the intent of the MSR is purely to suppress FB clearing for an
> unsuspecting (or misconfigured?) guest!?!?!

I doubt it is purely for guests. Any chance userspace application may use VERW?

And I don't think the original patch is for misconfigured guest. IIUC, it is
about migrating a guest from a vulnerable host to an invulnerable host.

>
>FWIW, this trainwreck is another reason why I'm not going to look at the proposed
>"Intel IA32_SPEC_CTRL Virtualization" crud until external forces dictate that I
>do so. I have zero confidence that a paravirt interface defined by hardware
>vendors to fiddle with mitigations will be sane, flexible, and extensible.
>
>Anyways, can someone from Intel do some basic performance analysis to justify
>doing RDMSR + WRMSRx2 on every transition? Unless someone provides numbers that

Pawan, could you help to answer this question?

>show a clear, meaningful benefit to the aggressive toggling, I'm inclined to have
>KVM do #4, e.g. end up with something like:
>
> /* L1D Flush includes CPU buffer clear to mitigate MDS */
> if (static_branch_unlikely(&vmx_l1d_should_flush)) {
> vmx_l1d_flush(vcpu);
> } else if (static_branch_unlikely(&mds_user_clear) ||
> static_branch_unlikely(&mmio_stale_data_clear)) {
> mds_clear_cpu_buffers();
> } else if (static_branch_unlikely(&kvm_toggle_fb_clear) {
> bool enable_fb_clear = !!(vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR);
>
> if (this_cpu_read(kvm_fb_clear_enabled) != enable_fb_clear) {
> u64 mcu_opt_ctrl = host_mcu_opt_ctrl;
>
> if (enable_fb_clear)
> mcu_opt_ctrl &= ~FB_CLEAR_DIS;
> else
> mcu_opt_ctrl |= FB_CLEAR_DIS;
> native_wrmsrl(MSR_IA32_MCU_OPT_CTRL, mcu_opt_ctrl);
> this_cpu_write(kvm_fb_clear_enabled, enable_fb_clear);
> }
> }

2023-05-19 15:53:14

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

On Fri, May 19, 2023, Chao Gao wrote:
> +Pawan, could you share your thoughts on questions about FB_CLEAR?
>
> On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
> >I do like snapshotting and then updating the value, even though there's likely no
> >meaningful performance benefit, as that would provide a place to document that
> >the "supported" value is dynamic. Though the fact that it's dynamic is arguably a bug
> >in its own right, e.g. if userspace isn't careful, a VM can have vCPUs with different
> >values for ARCH_CAPABILITIES. But fixing that is probably a fool's errand. So
>
> I am not sure if fixing it is fool. There would be some other problem:

Heh, "fool's errand" is an idiom that means doing something has almost no chance
of succeeding, not that doing something is foolish. I 100% agree that there's
value in presenting a consistent model to the guest, but there are conflicting
requirements in play. To present a consistent model, KVM essentially needs to
disallow changing the module param after VMs/vCPUs have been created, but that
would prevent userspace from toggling the param while VMs are running, e.g. in
response to a new vulnerability.

The only feasible idea I can think of is to disallow *disabling* the mitigation
while VMs/vCPUs are active. But then that prevents turning the L1D flush mitigation
back off if some other mitigation is deployed, e.g. via livepatch, policy update,
etc.

That's why I said trying to fix the knob is probably a fool's errand. AFAICT,
there's no straightforward solution that makes everybody happy. :-/

2023-05-20 01:43:35

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
> +Jim
>
> On Thu, May 18, 2023, Xiaoyao Li wrote:
> > On 5/6/2023 11:04 AM, Chao Gao wrote:
> > > to avoid computing the supported value at runtime every time.
> > >
> > > No functional change intended.
> >
> > the value of kvm_get_arch_capabilities() can be changed due to
> >
> > if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER)
> > data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH;
> >
> > and l1tf_vmx_mitigation can be runtime changed by vmentry_l1d_flush module
> > param.
>
> Nice catch!
>
> > We need a detailed analysis that in no real case can
> > ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit change runtime.
>
> No, the fact that it _can_ be modified by a writable module param is enough to
> make this patch buggy.
>
> I do like snapshotting and then updating the value, even though there's likely no
> meaningful performance benefit, as that would provide a place to document that
> the "supported" value is dynamic. Though the fact that it's dynamic is arguably a bug
> in its own right, e.g. if userspace isn't careful, a VM can have vCPUs with different
> values for ARCH_CAPABILITIES. But fixing that is probably a fool's errand. So
> I vote to snapshot the value and toggle the ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit
> when l1tf_vmx_mitigation is modified.
>
> On a somewhat related topic, what in the absolute #$#$ is going on with FB_CLEAR_DIS!?!?
> I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL...
>
> 1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill
> buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every
> entry+exit.

Unnecessary VERWs in guest will have much higher impact than due to MSR
read/write at vmentry/exit. On an Icelake system it is pointless for a
guest to incur VERW penalty when the system is not affected by MDS/TAA and
guests don't need mitigation for MMIO Stale Data. MSR writes are only
done when the guest is likely to execute unnecessary VERWs(e.g. when the
guest thinks its running on an older gen CPU).

> 2. I'm pretty sure conditioning mmio_stale_data_clear on kvm_arch_has_assigned_device()
> is a bug. AIUI, the vulnerability applies to _any_ MMIO accesses.

Vulnerability applies to MMIO access to devices that don't respect "byte
enable" (which indicates valid bytes in a transaction), and don't error
on incorrect read or write size.

> Assigning a device is necessary to let the device DMA into the
> guest, but it's not necessary to let the guest access MMIO
> addresses, that's done purely via memslots.

I will get back on this. The guest would typically need access to an
area that doesn't fail an incorrectly sized MMIO.

> 3. Irrespective of whether or not there is a performance benefit, toggling the
> MSR on every entry+exit is completely unnecessary if KVM won't do VERW before
> VM-Enter, i.e. if (!mds_user_clear && !mmio_stale_data_clear), then the
> toggling can be done in vmx_prepare_switch_to_{guest,host}(). This probably
> isn't worth pursuing though, as #4 below is more likely, especially since
> X86_BUG_MSBDS_ONLY is limited to Atom (and MIC, lol) CPUs.
>
> 4. If the host will will _never_ do VERW, i.e. #3 + !X86_BUG_MSBDS_ONLY, then

Is it likely that KVM will not do VERW when affected by MMIO Stale Data?
If you mean on a hardware that is not vulnerable to MDS and MMIO Stale
Data, in that case MSR writes are unnecessary and will be skipped
because FB_CLEAR_DIS is not available on unaffected hardware.

> KVM just needs to context switch the MSR between guests since the value that's
> loaded while running in the host is irrelevant. E.g. use a percpu cache to

I will be happy to avoid the MSR read/write, but its worth considering
that this MSR can receive more bits that host may want to toggle, then
percpu cache implementation would likely change.

> track the current value.
>
> 5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up,
> i.e. the host's desired value is effectively static post-boot, and barring
> a buggy configuration (running KVM as a guest), the boot CPU's value will be
> the same as every other CPU.

Would the MSR value be same on every CPU, if only some guests have
enumerated FB_CLEAR and others haven't? MSR writes (to disable FB_CLEAR)
are not done when a guest enumerates FB_CLEAR. Enumeration of FB_CLEAR
in guest will depend on its configuration.

> 6. Performance aside, KVM should not be speculating (ha!) on what the guest
> will and will not do, and should instead honor whatever behavior is presented
> to the guest. If the guest CPU model indicates that VERW flushes buffers,
> then KVM damn well needs to let VERW flush buffers.

The current implementation allows guests to have VERW flush buffers when
they enumerate FB_CLEAR. It only restricts the flush behavior when the
guest is trying to mitigate against a vulnerability(like MDS) on a
hardware that is not affected. I guess its common for guests to be
running with older gen configuration on a newer hardware.

> 7. Why on earth did Intel provide a knob that affects both the host and guest,
> since AFAICT the intent of the MSR is purely to suppress FB clearing for an
> unsuspecting (or misconfigured?) guest!?!?!

Thats true.

2023-05-22 14:28:56

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

On 5/18/23 10:33, Sean Christopherson wrote:
>
> 2. I'm pretty sure conditioning mmio_stale_data_clear on kvm_arch_has_assigned_device()
> is a bug. AIUI, the vulnerability applies to _any_ MMIO accesses. Assigning
> a device is necessary to let the device DMA into the guest, but it's not
> necessary to let the guest access MMIO addresses, that's done purely via
> memslots.

Just to make sure we're all on the same page: KVM needs mitigations when
real, hardware MMIO is exposed to the guest. None of this has anything
to do with virtio or what guests _normally_ see as devices or MMIO. Right?

But direct device assignment does that "real hardware MMIO" for sure
because it's mapping parts of the PCI address space (which is all MMIO)
into the guest. That's what the kvm_arch_has_assigned_device() check
was going for.

But I think you're also saying that, in the end, memory gets exposed to
the guest by KVM userspace setting up a memslot. KVM userspace _could_
have mapped a piece of MMIO and could just pass that down to a guest
without kvm_arch_has_assigned_device() being involved. That makes the
kvm_arch_has_assigned_device() useless.

In other words, all guests with kvm_arch_has_assigned_device() need
mitigation. But there are potentially situations where the guest can
see real hardware MMIO and yet also be !kvm_arch_has_assigned_device().

2023-05-22 17:01:01

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

On Mon, May 22, 2023, Dave Hansen wrote:
> On 5/18/23 10:33, Sean Christopherson wrote:
> >
> > 2. I'm pretty sure conditioning mmio_stale_data_clear on kvm_arch_has_assigned_device()
> > is a bug. AIUI, the vulnerability applies to _any_ MMIO accesses. Assigning
> > a device is necessary to let the device DMA into the guest, but it's not
> > necessary to let the guest access MMIO addresses, that's done purely via
> > memslots.
>
> Just to make sure we're all on the same page: KVM needs mitigations when
> real, hardware MMIO is exposed to the guest. None of this has anything
> to do with virtio or what guests _normally_ see as devices or MMIO. Right?

Yes. I try to always call MMIO that is handled by a synthetic/virtual/emulated
device "emulated MMIO", specifically to differentiate between the two cases.

> But direct device assignment does that "real hardware MMIO" for sure
> because it's mapping parts of the PCI address space (which is all MMIO)
> into the guest. That's what the kvm_arch_has_assigned_device() check
> was going for.
>
> But I think you're also saying that, in the end, memory gets exposed to
> the guest by KVM userspace setting up a memslot. KVM userspace _could_
> have mapped a piece of MMIO and could just pass that down to a guest
> without kvm_arch_has_assigned_device() being involved. That makes the
> kvm_arch_has_assigned_device() useless.

Yep.

> In other words, all guests with kvm_arch_has_assigned_device() need
> mitigation.

Yes, assuming the guest wants to actually use the device :-)

> But there are potentially situations where the guest can see real hardware MMIO
> and yet also be !kvm_arch_has_assigned_device().

Yes. There may or may not be _legitimate_ scenarios for exposing host MMIO to the
guest without an assigned device, but as far as the mitigation is concerned, being
legitimate or not doesn't matter, all that matters is that userspace can expose
host MMIO to the guest irrespective of VFIO.

FWIW, I think this would be a minimal fix without having to apply the mitigation
blindly. My only concern is that there might be gaps in the kvm_is_mmio_pfn()
heuristic, but if that's the case then KVM likely has other issues, e.g. would
potentially map MMIO with the wrong memtype.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2865c3cb3501..ac3c535ae3b9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1274,6 +1274,7 @@ struct kvm_arch {

bool apic_access_memslot_enabled;
bool apic_access_memslot_inhibited;
+ bool vm_has_passthrough_mmio;

/* Protects apicv_inhibit_reasons */
struct rw_semaphore apicv_update_lock;
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index cf2c6426a6fc..83d235488e56 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -189,6 +189,10 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
if (level > PG_LEVEL_4K)
spte |= PT_PAGE_SIZE_MASK;

+ if (static_branch_unlikely(&mmio_stale_data_clear) &&
+ !vcpu->kvm->arch.vm_has_passthrough_mmio && kvm_is_mmio_pfn(pfn))
+ vcpu->kvm->arch.vm_has_passthrough_mmio = true;
+
if (shadow_memtype_mask)
spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
kvm_is_mmio_pfn(pfn));
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 44fb619803b8..9c66ba35af92 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7159,7 +7159,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
else if (static_branch_unlikely(&mds_user_clear))
mds_clear_cpu_buffers();
else if (static_branch_unlikely(&mmio_stale_data_clear) &&
- kvm_arch_has_assigned_device(vcpu->kvm))
+ to_kvm_vmx(vcpu->kvm)->vm_has_passthrough_mmio)
mds_clear_cpu_buffers();

vmx_disable_fb_clear(vmx);

2023-05-22 17:47:42

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

On Fri, May 19, 2023, Pawan Gupta wrote:
> On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
> > I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL...
> >
> > 1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill
> > buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every
> > entry+exit.
>
> Unnecessary VERWs in guest will have much higher impact than due to MSR
> read/write at vmentry/exit.

Can you provide numbers for something closeish to a real world workload?

> On an Icelake system it is pointless for a guest to incur VERW penalty when
> the system is not affected by MDS/TAA and guests don't need mitigation for
> MMIO Stale Data. MSR writes are only done when the guest is likely to execute
> unnecessary VERWs(e.g. when the guest thinks its running on an older gen
> CPU).
>
> > KVM just needs to context switch the MSR between guests since the value that's
> > loaded while running in the host is irrelevant. E.g. use a percpu cache to
>
> I will be happy to avoid the MSR read/write, but its worth considering
> that this MSR can receive more bits that host may want to toggle, then
> percpu cache implementation would likely change.

Change in and of itself isn't problematic, so long as whatever code we write won't
fall over if/when new bits are added, i.e. doesn't clobber unknown bits.

> > 5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up,
> > i.e. the host's desired value is effectively static post-boot, and barring
> > a buggy configuration (running KVM as a guest), the boot CPU's value will be
> > the same as every other CPU.
>
> Would the MSR value be same on every CPU, if only some guests have
> enumerated FB_CLEAR and others haven't?

Ignore the guest, I'm talking purely about the host. Specifically, there's no
reason to do a RDMSR to get the host value on every VM-Enter since the host's
value is effectively static post-boot.

> MSR writes (to disable FB_CLEAR) are not done when a guest enumerates
> FB_CLEAR. Enumeration of FB_CLEAR in guest will depend on its configuration.
>
> > 6. Performance aside, KVM should not be speculating (ha!) on what the guest
> > will and will not do, and should instead honor whatever behavior is presented
> > to the guest. If the guest CPU model indicates that VERW flushes buffers,
> > then KVM damn well needs to let VERW flush buffers.
>
> The current implementation allows guests to have VERW flush buffers when
> they enumerate FB_CLEAR. It only restricts the flush behavior when the
> guest is trying to mitigate against a vulnerability(like MDS) on a
> hardware that is not affected. I guess its common for guests to be
> running with older gen configuration on a newer hardware.

Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest
the guest will do things a certain way and should instead honor the "architectural"
definition, in quotes because I realize there probably is no architectural
definition for any of this.

It might be that the code does (unintentionally?) honor the "architecture", i.e.
this code might actually be accurrate with respect to when the guest can expect
VERW to flush buffers. But the comment is so, so wrong.

/*
* If guest will not execute VERW, there is no need to set FB_CLEAR_DIS
* at VMEntry. Skip the MSR read/write when a guest has no use case to
* execute VERW.
*/
if ((vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR) ||
((vcpu->arch.arch_capabilities & ARCH_CAP_MDS_NO) &&
(vcpu->arch.arch_capabilities & ARCH_CAP_TAA_NO) &&
(vcpu->arch.arch_capabilities & ARCH_CAP_PSDP_NO) &&
(vcpu->arch.arch_capabilities & ARCH_CAP_FBSDP_NO) &&
(vcpu->arch.arch_capabilities & ARCH_CAP_SBDR_SSDP_NO)))
vmx->disable_fb_clear = false;

2023-05-22 20:17:24

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

On 5/23/2023 1:43 AM, Sean Christopherson wrote:
>>> 6. Performance aside, KVM should not be speculating (ha!) on what the guest
>>> will and will not do, and should instead honor whatever behavior is presented
>>> to the guest. If the guest CPU model indicates that VERW flushes buffers,
>>> then KVM damn well needs to let VERW flush buffers.
>> The current implementation allows guests to have VERW flush buffers when
>> they enumerate FB_CLEAR. It only restricts the flush behavior when the
>> guest is trying to mitigate against a vulnerability(like MDS) on a
>> hardware that is not affected. I guess its common for guests to be
>> running with older gen configuration on a newer hardware.
> Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest
> the guest will do things a certain way and should instead honor the "architectural"
> definition, in quotes because I realize there probably is no architectural
> definition for any of this.
>
> It might be that the code does (unintentionally?) honor the "architecture", i.e.
> this code might actually be accurrate with respect to when the guest can expect
> VERW to flush buffers. But the comment is so, so wrong.

The comment is wrong and the code is wrong in some case as well.

If none of ARCH_CAP_FB_CLEAR, ARCH_CAP_MDS_NO, ARCH_CAP_TAA_NO,
ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO and ARCH_CAP_SBDR_SSDP_NO are
exposed to VM, the VM is type of "affected by MDS".

And accroding to the page
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html

if the VM enumerates support for both L1D_FLUSH and MD_CLEAR, it
implicitly enumerates FB_CLEAR as part of their MD_CLEAR support.

However, the code will leave vmx->disable_fb_clear as 1 if hardware
supports it, and VERW intruction doesn't clear FB in the VM, which
conflicts "architectural" definition.

> /*
> * If guest will not execute VERW, there is no need to set FB_CLEAR_DIS
> * at VMEntry. Skip the MSR read/write when a guest has no use case to
> * execute VERW.
> */
> if ((vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR) ||
> ((vcpu->arch.arch_capabilities & ARCH_CAP_MDS_NO) &&
> (vcpu->arch.arch_capabilities & ARCH_CAP_TAA_NO) &&
> (vcpu->arch.arch_capabilities & ARCH_CAP_PSDP_NO) &&
> (vcpu->arch.arch_capabilities & ARCH_CAP_FBSDP_NO) &&
> (vcpu->arch.arch_capabilities & ARCH_CAP_SBDR_SSDP_NO)))
> vmx->disable_fb_clear = false;

2023-05-22 21:00:51

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

On Mon, May 22, 2023 at 10:43:49AM -0700, Sean Christopherson wrote:
> On Fri, May 19, 2023, Pawan Gupta wrote:
> > On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
> > > I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL...
> > >
> > > 1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill
> > > buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every
> > > entry+exit.
> >
> > Unnecessary VERWs in guest will have much higher impact than due to MSR
> > read/write at vmentry/exit.
>
> Can you provide numbers for something closeish to a real world workload?

I am collecting the numbers, will update here soon.

> > On an Icelake system it is pointless for a guest to incur VERW penalty when
> > the system is not affected by MDS/TAA and guests don't need mitigation for
> > MMIO Stale Data. MSR writes are only done when the guest is likely to execute
> > unnecessary VERWs(e.g. when the guest thinks its running on an older gen
> > CPU).
> >
> > > KVM just needs to context switch the MSR between guests since the value that's
> > > loaded while running in the host is irrelevant. E.g. use a percpu cache to
> >
> > I will be happy to avoid the MSR read/write, but its worth considering
> > that this MSR can receive more bits that host may want to toggle, then
> > percpu cache implementation would likely change.
>
> Change in and of itself isn't problematic, so long as whatever code we write won't
> fall over if/when new bits are added, i.e. doesn't clobber unknown bits.

Ok.

> > > 5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up,
> > > i.e. the host's desired value is effectively static post-boot, and barring
> > > a buggy configuration (running KVM as a guest), the boot CPU's value will be
> > > the same as every other CPU.
> >
> > Would the MSR value be same on every CPU, if only some guests have
> > enumerated FB_CLEAR and others haven't?
>
> Ignore the guest, I'm talking purely about the host. Specifically, there's no
> reason to do a RDMSR to get the host value on every VM-Enter since the host's
> value is effectively static post-boot.

That right(ignoring late microcode load adding stuff to the MSR or
msr-tools fiddling).

> > MSR writes (to disable FB_CLEAR) are not done when a guest enumerates
> > FB_CLEAR. Enumeration of FB_CLEAR in guest will depend on its configuration.
> >
> > > 6. Performance aside, KVM should not be speculating (ha!) on what the guest
> > > will and will not do, and should instead honor whatever behavior is presented
> > > to the guest. If the guest CPU model indicates that VERW flushes buffers,
> > > then KVM damn well needs to let VERW flush buffers.
> >
> > The current implementation allows guests to have VERW flush buffers when
> > they enumerate FB_CLEAR. It only restricts the flush behavior when the
> > guest is trying to mitigate against a vulnerability(like MDS) on a
> > hardware that is not affected. I guess its common for guests to be
> > running with older gen configuration on a newer hardware.
>
> Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest
> the guest will do things a certain way and should instead honor the "architectural"
> definition, in quotes because I realize there probably is no architectural
> definition for any of this.

Before MMIO Stale Data, processors that were not affected by MDS/TAA did
not clear CPU buffers, even if they enumerated MD_CLEAR. On such
processors guests that deployed VERW(thinking they are vulnerable to
MDS) did not clear the CPU buffers. After MMIO Stale Data was discovered
FB_CLEAR_DIS was introduced to restore this behavior.

> It might be that the code does (unintentionally?) honor the "architecture", i.e.
> this code might actually be accurrate with respect to when the guest can expect
> VERW to flush buffers. But the comment is so, so wrong.

Agree, the comment needs to explain this well.

> /*
> * If guest will not execute VERW, there is no need to set FB_CLEAR_DIS
> * at VMEntry. Skip the MSR read/write when a guest has no use case to
> * execute VERW.
> */
> if ((vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR) ||
> ((vcpu->arch.arch_capabilities & ARCH_CAP_MDS_NO) &&
> (vcpu->arch.arch_capabilities & ARCH_CAP_TAA_NO) &&
> (vcpu->arch.arch_capabilities & ARCH_CAP_PSDP_NO) &&
> (vcpu->arch.arch_capabilities & ARCH_CAP_FBSDP_NO) &&
> (vcpu->arch.arch_capabilities & ARCH_CAP_SBDR_SSDP_NO)))
> vmx->disable_fb_clear = false;

2023-05-22 21:48:41

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

On Tue, May 23, 2023 at 03:31:44AM +0800, Xiaoyao Li wrote:
> On 5/23/2023 1:43 AM, Sean Christopherson wrote:
> > > > 6. Performance aside, KVM should not be speculating (ha!) on what the guest
> > > > will and will not do, and should instead honor whatever behavior is presented
> > > > to the guest. If the guest CPU model indicates that VERW flushes buffers,
> > > > then KVM damn well needs to let VERW flush buffers.
> > > The current implementation allows guests to have VERW flush buffers when
> > > they enumerate FB_CLEAR. It only restricts the flush behavior when the
> > > guest is trying to mitigate against a vulnerability(like MDS) on a
> > > hardware that is not affected. I guess its common for guests to be
> > > running with older gen configuration on a newer hardware.
> > Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest
> > the guest will do things a certain way and should instead honor the "architectural"
> > definition, in quotes because I realize there probably is no architectural
> > definition for any of this.
> >
> > It might be that the code does (unintentionally?) honor the "architecture", i.e.
> > this code might actually be accurrate with respect to when the guest can expect
> > VERW to flush buffers. But the comment is so, so wrong.
>
> The comment is wrong and the code is wrong in some case as well.
>
> If none of ARCH_CAP_FB_CLEAR, ARCH_CAP_MDS_NO, ARCH_CAP_TAA_NO,
> ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO and ARCH_CAP_SBDR_SSDP_NO are exposed to
> VM, the VM is type of "affected by MDS".
>
> And accroding to the page https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
>
> if the VM enumerates support for both L1D_FLUSH and MD_CLEAR, it implicitly
> enumerates FB_CLEAR as part of their MD_CLEAR support.

This is the excerpt from the link that you mentioned:

"For processors that are affected by MDS and support L1D_FLUSH
operations and MD_CLEAR operations, the VERW instruction flushes fill
buffers."

You are missing an important information here "For the processors
_affected_ by MDS". On such processors ...

> However, the code will leave vmx->disable_fb_clear as 1 if hardware supports
> it, and VERW intruction doesn't clear FB in the VM, which conflicts
> "architectural" definition.

... Fill buffer clear is not enabled at all:

vmx_setup_fb_clear_ctrl()
{
u64 msr;

if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES) &&
!boot_cpu_has_bug(X86_BUG_MDS) &&
!boot_cpu_has_bug(X86_BUG_TAA)) {
rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr);
if (msr & ARCH_CAP_FB_CLEAR_CTRL)
vmx_fb_clear_ctrl_available = true;
}
}

2023-05-23 01:35:10

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

On 5/23/2023 5:23 AM, Pawan Gupta wrote:
> On Tue, May 23, 2023 at 03:31:44AM +0800, Xiaoyao Li wrote:
>> On 5/23/2023 1:43 AM, Sean Christopherson wrote:
>>>>> 6. Performance aside, KVM should not be speculating (ha!) on what the guest
>>>>> will and will not do, and should instead honor whatever behavior is presented
>>>>> to the guest. If the guest CPU model indicates that VERW flushes buffers,
>>>>> then KVM damn well needs to let VERW flush buffers.
>>>> The current implementation allows guests to have VERW flush buffers when
>>>> they enumerate FB_CLEAR. It only restricts the flush behavior when the
>>>> guest is trying to mitigate against a vulnerability(like MDS) on a
>>>> hardware that is not affected. I guess its common for guests to be
>>>> running with older gen configuration on a newer hardware.
>>> Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest
>>> the guest will do things a certain way and should instead honor the "architectural"
>>> definition, in quotes because I realize there probably is no architectural
>>> definition for any of this.
>>>
>>> It might be that the code does (unintentionally?) honor the "architecture", i.e.
>>> this code might actually be accurrate with respect to when the guest can expect
>>> VERW to flush buffers. But the comment is so, so wrong.
>>
>> The comment is wrong and the code is wrong in some case as well.
>>
>> If none of ARCH_CAP_FB_CLEAR, ARCH_CAP_MDS_NO, ARCH_CAP_TAA_NO,
>> ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO and ARCH_CAP_SBDR_SSDP_NO are exposed to
>> VM, the VM is type of "affected by MDS".
>>
>> And accroding to the page https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
>>
>> if the VM enumerates support for both L1D_FLUSH and MD_CLEAR, it implicitly
>> enumerates FB_CLEAR as part of their MD_CLEAR support.
>
> This is the excerpt from the link that you mentioned:
>
> "For processors that are affected by MDS and support L1D_FLUSH
> operations and MD_CLEAR operations, the VERW instruction flushes fill
> buffers."
>
> You are missing an important information here "For the processors
> _affected_ by MDS". On such processors ...
>
>> However, the code will leave vmx->disable_fb_clear as 1 if hardware supports
>> it, and VERW intruction doesn't clear FB in the VM, which conflicts
>> "architectural" definition.
>
> ... Fill buffer clear is not enabled at all:
>
> vmx_setup_fb_clear_ctrl()
> {
> u64 msr;
>
> if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES) &&
> !boot_cpu_has_bug(X86_BUG_MDS) &&
> !boot_cpu_has_bug(X86_BUG_TAA)) {
> rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr);
> if (msr & ARCH_CAP_FB_CLEAR_CTRL)
> vmx_fb_clear_ctrl_available = true;
> }
> }

This is the check of bare metal, while the check in
vmx_update_fb_clear_dis() is of guest VM.

For example, if the hardware (host) enumerates ARCH_CAP_TAA_NO,
ARCH_CAP_MDS_NO, ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO,
ARCH_CAP_SBDR_SSDP_NO, ARCH_CAP_FB_CLEAR, and ARCH_CAP_FB_CLEAR_CTRL,
the VERW on this hardware clears Fill Buffer (if FB_CLEAR_DIS is not
enabled in MSR_IA32_MCU_OPT_CTRL). vmx_setup_fb_clear_ctrl() does set
vmx_fb_clear_ctrl_available to true.

If a guest is exposed without ARCH_CAP_TAA_NO, ARCH_CAP_MDS_NO,
ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO and
ARCH_CAP_FB_CLEAR, vmx_update_fb_clear_dis() will leave
vmx->disable_fb_clear as true. So VERW doesn't clear Fill Buffer for
guest. But in the view of guset, it expects VERW to clear Fill Buffer.

2023-05-23 03:48:14

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

On Tue, May 23, 2023 at 09:00:50AM +0800, Xiaoyao Li wrote:
> On 5/23/2023 5:23 AM, Pawan Gupta wrote:
> > On Tue, May 23, 2023 at 03:31:44AM +0800, Xiaoyao Li wrote:
> > > On 5/23/2023 1:43 AM, Sean Christopherson wrote:
> > > > > > 6. Performance aside, KVM should not be speculating (ha!) on what the guest
> > > > > > will and will not do, and should instead honor whatever behavior is presented
> > > > > > to the guest. If the guest CPU model indicates that VERW flushes buffers,
> > > > > > then KVM damn well needs to let VERW flush buffers.
> > > > > The current implementation allows guests to have VERW flush buffers when
> > > > > they enumerate FB_CLEAR. It only restricts the flush behavior when the
> > > > > guest is trying to mitigate against a vulnerability(like MDS) on a
> > > > > hardware that is not affected. I guess its common for guests to be
> > > > > running with older gen configuration on a newer hardware.
> > > > Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest
> > > > the guest will do things a certain way and should instead honor the "architectural"
> > > > definition, in quotes because I realize there probably is no architectural
> > > > definition for any of this.
> > > >
> > > > It might be that the code does (unintentionally?) honor the "architecture", i.e.
> > > > this code might actually be accurrate with respect to when the guest can expect
> > > > VERW to flush buffers. But the comment is so, so wrong.
> > >
> > > The comment is wrong and the code is wrong in some case as well.
> > >
> > > If none of ARCH_CAP_FB_CLEAR, ARCH_CAP_MDS_NO, ARCH_CAP_TAA_NO,
> > > ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO and ARCH_CAP_SBDR_SSDP_NO are exposed to
> > > VM, the VM is type of "affected by MDS".
> > >
> > > And accroding to the page https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
> > >
> > > if the VM enumerates support for both L1D_FLUSH and MD_CLEAR, it implicitly
> > > enumerates FB_CLEAR as part of their MD_CLEAR support.
> >
> > This is the excerpt from the link that you mentioned:
> >
> > "For processors that are affected by MDS and support L1D_FLUSH
> > operations and MD_CLEAR operations, the VERW instruction flushes fill
> > buffers."
> >
> > You are missing an important information here "For the processors
> > _affected_ by MDS". On such processors ...
> >
> > > However, the code will leave vmx->disable_fb_clear as 1 if hardware supports
> > > it, and VERW intruction doesn't clear FB in the VM, which conflicts
> > > "architectural" definition.
> >
> > ... Fill buffer clear is not enabled at all:
> >
> > vmx_setup_fb_clear_ctrl()
> > {
> > u64 msr;
> > if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES) &&
> > !boot_cpu_has_bug(X86_BUG_MDS) &&
> > !boot_cpu_has_bug(X86_BUG_TAA)) {
> > rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr);
> > if (msr & ARCH_CAP_FB_CLEAR_CTRL)
> > vmx_fb_clear_ctrl_available = true;
> > }
> > }
>
> This is the check of bare metal, while the check in
> vmx_update_fb_clear_dis() is of guest VM.
>
> For example, if the hardware (host) enumerates ARCH_CAP_TAA_NO,
> ARCH_CAP_MDS_NO, ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO,
> ARCH_CAP_FB_CLEAR, and ARCH_CAP_FB_CLEAR_CTRL, the VERW on this hardware
> clears Fill Buffer (if FB_CLEAR_DIS is not enabled in
> MSR_IA32_MCU_OPT_CTRL). vmx_setup_fb_clear_ctrl() does set
> vmx_fb_clear_ctrl_available to true.
>
> If a guest is exposed without ARCH_CAP_TAA_NO, ARCH_CAP_MDS_NO,
> ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO and
> ARCH_CAP_FB_CLEAR, vmx_update_fb_clear_dis() will leave
> vmx->disable_fb_clear as true. So VERW doesn't clear Fill Buffer for guest.
> But in the view of guset, it expects VERW to clear Fill Buffer.

That is correct, but whether VERW clears the CPU buffers also depends on
if the hardware is affected or not, enumerating MD_CLEAR solely does not
guarantee that VERW will flush CPU buffers. This was true even before
MMIO Stale Data was discovered.

If host(hardware) enumerates:

MD_CLEAR | MDS_NO | VERW behavior
---------|--------|-------------------
1 | 0 | Clears CPU buffers

But on an MDS mitigated hardware(MDS_NO=1) if guest enumerates:

MD_CLEAR | MDS_NO | VERW behavior
---------|--------|-----------------------
1 | 0 | Not guaranteed to clear
CPU buffers

After MMIO Stale Data, FB_CLEAR_DIS was introduced to keep this behavior
intact(for hardware that is not affected by MDS/TAA). If the userspace
truly wants the guest to have VERW flush behavior, it can export
FB_CLEAR.

I see your point that from a guest's perspective it is being lied about
VERW behavior. OTOH, I am not sure if it is a good enough reason for
mitigated hardware to keep the overhead of clearing micro-architectural
buffers for generations of CPUs.

2023-05-23 05:16:19

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

On Mon, May 22, 2023 at 01:54:55PM -0700, Pawan Gupta wrote:
> On Mon, May 22, 2023 at 10:43:49AM -0700, Sean Christopherson wrote:
> > On Fri, May 19, 2023, Pawan Gupta wrote:
> > > On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
> > > > I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL...
> > > >
> > > > 1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill
> > > > buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every
> > > > entry+exit.
> > >
> > > Unnecessary VERWs in guest will have much higher impact than due to MSR
> > > read/write at vmentry/exit.
> >
> > Can you provide numbers for something closeish to a real world workload?
>
> I am collecting the numbers, will update here soon.

Looks like avoiding VERW flush behavior in guest results in 2-5%
improvement in performance.

Pybench and CPU bound sysbench test shows minor improvement when
avoiding reading/writing MSR_MCU_OPT_CTRL at VMentry/VMexit.

Baseline : v6.4-rc3
Target : v6.4-rc3 + No read/write to MSR_MCU_OPT_CTRL at VMentry/VMexit

| Test | Configuration | Relative |
| --------- | ------------- | -------- |
| nginx | 200 | 0.977 |
| hackbench | 8 - Process | 0.975 |
| sysbench | RAM / Memory | 0.946 |
| sysbench | CPU | 1.002 |
| pybench | T.F.A.T.T | 1.007 |

Host configuration (Icelake server):
CPU family: 6
Model: 106
Stepping: 6
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Vulnerable: eIBRS with unprivileged eBPF
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Guest configuration (Skylake Client):
CPU family: 6
Model: 94
Stepping: 3
Vulnerabilities:
Itlb multihit: KVM: Mitigation: VMX unsupported
L1tf: Mitigation; PTE Inversion
Mds: Vulnerable: Clear CPU buffers attempted, no microcode;
SMT Host state unknown
Meltdown: Mitigation; PTI
Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode;
SMT Host state unknown
Retbleed: Vulnerable
Spec store bypass: Vulnerable
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer
sanitization
Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PB
RSB-eIBRS Not affected
Srbds: Unknown: Dependent on hypervisor status
Tsx async abort: Vulnerable: Clear CPU buffers attempted, no microcode;
SMT Host state unknown

2023-05-25 15:59:05

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

On 5/23/2023 11:34 AM, Pawan Gupta wrote:
> On Tue, May 23, 2023 at 09:00:50AM +0800, Xiaoyao Li wrote:
>> On 5/23/2023 5:23 AM, Pawan Gupta wrote:
>>> On Tue, May 23, 2023 at 03:31:44AM +0800, Xiaoyao Li wrote:
>>>> On 5/23/2023 1:43 AM, Sean Christopherson wrote:
>>>>>>> 6. Performance aside, KVM should not be speculating (ha!) on what the guest
>>>>>>> will and will not do, and should instead honor whatever behavior is presented
>>>>>>> to the guest. If the guest CPU model indicates that VERW flushes buffers,
>>>>>>> then KVM damn well needs to let VERW flush buffers.
>>>>>> The current implementation allows guests to have VERW flush buffers when
>>>>>> they enumerate FB_CLEAR. It only restricts the flush behavior when the
>>>>>> guest is trying to mitigate against a vulnerability(like MDS) on a
>>>>>> hardware that is not affected. I guess its common for guests to be
>>>>>> running with older gen configuration on a newer hardware.
>>>>> Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest
>>>>> the guest will do things a certain way and should instead honor the "architectural"
>>>>> definition, in quotes because I realize there probably is no architectural
>>>>> definition for any of this.
>>>>>
>>>>> It might be that the code does (unintentionally?) honor the "architecture", i.e.
>>>>> this code might actually be accurrate with respect to when the guest can expect
>>>>> VERW to flush buffers. But the comment is so, so wrong.
>>>>
>>>> The comment is wrong and the code is wrong in some case as well.
>>>>
>>>> If none of ARCH_CAP_FB_CLEAR, ARCH_CAP_MDS_NO, ARCH_CAP_TAA_NO,
>>>> ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO and ARCH_CAP_SBDR_SSDP_NO are exposed to
>>>> VM, the VM is type of "affected by MDS".
>>>>
>>>> And accroding to the page https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
>>>>
>>>> if the VM enumerates support for both L1D_FLUSH and MD_CLEAR, it implicitly
>>>> enumerates FB_CLEAR as part of their MD_CLEAR support.
>>>
>>> This is the excerpt from the link that you mentioned:
>>>
>>> "For processors that are affected by MDS and support L1D_FLUSH
>>> operations and MD_CLEAR operations, the VERW instruction flushes fill
>>> buffers."
>>>
>>> You are missing an important information here "For the processors
>>> _affected_ by MDS". On such processors ...
>>>
>>>> However, the code will leave vmx->disable_fb_clear as 1 if hardware supports
>>>> it, and VERW intruction doesn't clear FB in the VM, which conflicts
>>>> "architectural" definition.
>>>
>>> ... Fill buffer clear is not enabled at all:
>>>
>>> vmx_setup_fb_clear_ctrl()
>>> {
>>> u64 msr;
>>> if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES) &&
>>> !boot_cpu_has_bug(X86_BUG_MDS) &&
>>> !boot_cpu_has_bug(X86_BUG_TAA)) {
>>> rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr);
>>> if (msr & ARCH_CAP_FB_CLEAR_CTRL)
>>> vmx_fb_clear_ctrl_available = true;
>>> }
>>> }
>>
>> This is the check of bare metal, while the check in
>> vmx_update_fb_clear_dis() is of guest VM.
>>
>> For example, if the hardware (host) enumerates ARCH_CAP_TAA_NO,
>> ARCH_CAP_MDS_NO, ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO,
>> ARCH_CAP_FB_CLEAR, and ARCH_CAP_FB_CLEAR_CTRL, the VERW on this hardware
>> clears Fill Buffer (if FB_CLEAR_DIS is not enabled in
>> MSR_IA32_MCU_OPT_CTRL). vmx_setup_fb_clear_ctrl() does set
>> vmx_fb_clear_ctrl_available to true.
>>
>> If a guest is exposed without ARCH_CAP_TAA_NO, ARCH_CAP_MDS_NO,
>> ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO and
>> ARCH_CAP_FB_CLEAR, vmx_update_fb_clear_dis() will leave
>> vmx->disable_fb_clear as true. So VERW doesn't clear Fill Buffer for guest.
>> But in the view of guset, it expects VERW to clear Fill Buffer.
>
> That is correct, but whether VERW clears the CPU buffers also depends on
> if the hardware is affected or not, enumerating MD_CLEAR solely does not
> guarantee that VERW will flush CPU buffers. This was true even before
> MMIO Stale Data was discovered.
>
> If host(hardware) enumerates:
>
> MD_CLEAR | MDS_NO | VERW behavior
> ---------|--------|-------------------
> 1 | 0 | Clears CPU buffers
>
> But on an MDS mitigated hardware(MDS_NO=1) if guest enumerates:
>
> MD_CLEAR | MDS_NO | VERW behavior
> ---------|--------|-----------------------
> 1 | 0 | Not guaranteed to clear
> CPU buffers
>
> After MMIO Stale Data, FB_CLEAR_DIS was introduced to keep this behavior
> intact(for hardware that is not affected by MDS/TAA).

Sorry, I don't understand it. What the behavior is?

> If the userspace
> truly wants the guest to have VERW flush behavior, it can export
> FB_CLEAR.
>
> I see your point that from a guest's perspective it is being lied about
> VERW behavior. OTOH, I am not sure if it is a good enough reason for
> mitigated hardware to keep the overhead of clearing micro-architectural
> buffers for generations of CPUs.

User takes the responsiblity because itself requests the specific
feature combination for its guest.

2023-05-25 21:02:38

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

On Thu, May 25, 2023 at 11:42:24PM +0800, Xiaoyao Li wrote:
> On 5/23/2023 11:34 AM, Pawan Gupta wrote:
> > > If a guest is exposed without ARCH_CAP_TAA_NO, ARCH_CAP_MDS_NO,
> > > ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO and
> > > ARCH_CAP_FB_CLEAR, vmx_update_fb_clear_dis() will leave
> > > vmx->disable_fb_clear as true. So VERW doesn't clear Fill Buffer for guest.
> > > But in the view of guset, it expects VERW to clear Fill Buffer.
> >
> > That is correct, but whether VERW clears the CPU buffers also depends on
> > if the hardware is affected or not, enumerating MD_CLEAR solely does not
> > guarantee that VERW will flush CPU buffers. This was true even before
> > MMIO Stale Data was discovered.
> >
> > If host(hardware) enumerates:
> >
> > MD_CLEAR | MDS_NO | VERW behavior
> > ---------|--------|-------------------
> > 1 | 0 | Clears CPU buffers
> >
> > But on an MDS mitigated hardware(MDS_NO=1) if guest enumerates:
> >
> > MD_CLEAR | MDS_NO | VERW behavior
> > ---------|--------|-----------------------
> > 1 | 0 | Not guaranteed to clear
> > CPU buffers
> >
> > After MMIO Stale Data, FB_CLEAR_DIS was introduced to keep this behavior
> > intact(for hardware that is not affected by MDS/TAA).
>
> Sorry, I don't understand it. What the behavior is?

That on a mitigated hardware VERW may not clear the micro-architectural
buffers.

There are many micro-architectural buffers, VERW only clears the
affected ones. This is indicated in section "Fill Buffer Clearing
Operations" of [1].

Some processors may enumerate MD_CLEAR because they overwrite all
buffers affected by MDS/TAA, but they do not overwrite fill buffer
values. This is because fill buffers are not susceptible to MDS or TAA
on those processors.

For processors affected by FBSDP where MD_CLEAR may not overwrite fill
buffer values, Intel has released microcode updates that enumerate
FB_CLEAR so that VERW does overwrite fill buffer values.

> > If the userspace
> > truly wants the guest to have VERW flush behavior, it can export
> > FB_CLEAR.
> >
> > I see your point that from a guest's perspective it is being lied about
> > VERW behavior. OTOH, I am not sure if it is a good enough reason for
> > mitigated hardware to keep the overhead of clearing micro-architectural
> > buffers for generations of CPUs.
>
> User takes the responsiblity because itself requests the specific feature
> combination for its guest.

As I understand, the MD_CLEAR enumeration on mitigated hardware is done
purely for VM migration compatibility. Software is not expected to use
VERW on mitigated hardware, below is from MDS documentation [2]:

Future processors set the MDS_NO bit in IA32_ARCH_CAPABILITIES to
indicate they are not affected by microarchitectural data sampling.
Such processors will continue to enumerate the MD_CLEAR bit in CPUID.
As none of these data buffers are vulnerable to exposure on such
parts, no data buffer overwriting is required or expected for such
parts, despite the MD_CLEAR indication. Software should look to the
MDS_NO bit to determine whether buffer overwriting mitigations are
required.

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html

[2] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/resources/processors-affected-mds.html

2023-05-29 04:21:39

by Chao Gao

[permalink] [raw]

Subject: Re: [PATCH] KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
>FWIW, this trainwreck is another reason why I'm not going to look at the proposed
>"Intel IA32_SPEC_CTRL Virtualization" crud until external forces dictate that I
>do so. I have zero confidence that a paravirt interface defined by hardware
>vendors to fiddle with mitigations will be sane, flexible, and extensible.

Hi Sean,

Just to confirm we are on the same page:

"Intel IA32_SPEC_CTRL Virtualization" series consists of 3 parts:

1. Expose BHI_CTRL, RRSBA_CTRL to guests. They are hardware mitigations to
disable BHI and RRSBA behaviors and can be used by both guest/host.

2. Enable IA32_SPEC_CTRL Virtualization which is a VMX feature. This is for KVM
to effectively lock some bits of IA32_SPEC_CTRL MSR when guests are running.

3. Implement the paravirt interface (the virtual MSRs) for guests to report
software mitigations in-use. KVM can utilize such information to enable
hardware mitigations for guests transparently to address software mitigation
effectiveness issues caused by CPU microarchitecture changes (RRSBA behavior,
size of branch history table).

As per my understanding, your concerns are primarily focused on #3, the
paravirt interface, rather than the entire series. Am I correct in assuming that
you do not oppose #1 and #2?

You previously mentioned that the paravirt interface was not common [1], and
this time, you expressed an expectation for the interface to be "sane, flexible,
and extensible." To ensure clarity, I want to confirm my interpretation of
your expectations:

1. The interface should not be tied to a specific CPU vendor but instead be
beneficial for Intel and AMD (and even ARM, and potentially others).

2. The interface should have the capability to solve other issues (e.g,
coordinate mitigations in guest/host to address other perf/function issues),
not limited to software mitigation effectiveness on Intel CPUs.

3. The interface should be extendable by VMMs rather than relying on hardware
vendors rolling out new spec. This enables VMM developers to propose new
ideas to coordinate mitigations in guest/host.

Please let me know if I missed any key points or if any of the above statements
do not align with your expectations.

[1]: https://lore.kernel.org/all/[email protected]/

2023-06-06 17:31:25