2021-11-10 13:06:51

by Liu, Jing2

[permalink] [raw]
Subject: Thoughts of AMX KVM support based on latest kernel

Hi Thomas and Paolo,

Thanks for your thoughts and suggestions. After reading the emails
and looking at the code, we'd like to explain our thoughts of AMX
KVM support based on latest kernel and the code from git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/fpu-kvm

AMX support based on existing design concepts

One of our objectives is to have a simple and clean KVM implementation by
utilizing the new dynamic extended-features handling in the FPU core.

Dynamic reallocation and "lazy passthrough"

The new code allows us to implement "lazy passthrough" of the XFD MSRs
by coupling a buffer reallocation request, which is indirectly made by vcpus
(VM exit). With "lazy passthrough" of the XFD MSR, we can avoid unnecessary
save/restore of the MSR and allocation of the extended features until the
guest really requires and is allowed to use. Until that point, the XFD MSR is
virtual, and thus we do not need to save/restore the actual MSR at VM
entry/exit time. And the vcpu does not have an extended state until that point.

Once the guest starts using the XFD feature (e.g. AMX) and it is permitted to
use it, we allow the guest to directly modify the MSR (passthrough) to avoid
(potentially frequent) VM exits.

Triggering of a reallocation request and error handling

First, we want to avoid weird guest failures at runtime due to (more likely)
permission failures of a reallocation request, checking the permissions of the
vcpu (for the extend features) at kvm_vcpu_ioctl_set_cpuid2() time, when
QEMU wants to advertise the extended features (e.g. AMX) for the first time.
We have no idea at vcpu_create() time whether QEMU wants to enable AMX
or not at that time. If kvm_vcpu_ioctl_set_cpuid2() succeeds, then there is
no need to further check permission in reallocation path.

Upon detection (interception) of an attempt by a vcpu to write to XCR0 (XSETBV)
and XFD (WRMSR), we check if the write is valid, and we start passthrough of
the XFD MSRs if the dynamic feature[i] meets the condition
XCR0[i]=1 && XFD[i]=0. And we make a reallocation request to the FPU core.

We simplify the KVM implementation by assuming that the reallocation
request was successful when the vcpu comes back to KVM. For such VM exit
handling that requires a buffer-reallocation request, we don't resume the
guest immediately. Instead, we go back to the userspace, to rely on the
userspace VMM (e.g. QEMU) for handling error cases. The actual reallocation
happens when control is transferred from KVM to the kernel (FPU core). If
no error, QEMU will come back to KVM by repeating vcpu_ioctl_run().

Potential failures there are due to lack of memory. But this would not be
interesting cases; the host should have more resource problems at that
time if that is the case.

Additional KVM-specific or and virtualization requirements

KVM needs to virtualize the XFD features, and we have additional
requirements.

XFD reset value
The XFD reset value needs to be 0.

KVM-specific XFD handling in XSAVES/XRSTORS

Once we start passthrough the XFD MSR, we need to save/restore
them at VM exit/entry time. If we immediately resume the guest
without enabling interrupts/preemptions (exit fast-path), we have no
issues. We don't need to save the MSR. The question is how the host
XFD MSR is restored while control is in KVM.

The XSAVE(S) instruction saves the (guest) state component[x] as 0 or
doesn't save when XFD[x] != 0. Accordingly, XRSTOR(S) cannot restore
that (guest state). And it is possible that XFD != 0 and the guest is using
extended feature at VM exit; we can check the XINUSE state-component
bitmap by XGETBV(1). By adding more meaning to the existing field:
fpstate->in_use, it can be useful for KVM to set the XINUSE value.

The usual VM exit handling in KVM, however, is done with
interrupt/preemption enabled. If a guest has a non-zero XFD and AMX
is in use at VM exit, the host and KVM need to maintain the guest state.
There are two cases where the host and KVM may lose the state:

a). KVM is scheduled out and kernel context switch does XSAVES,
b). KVM is interrupted and the softirq path calls
kernel_fpu_begin_mask(), which may execute XSAVES.

One crude way (Option 1) would be clear XFD temporarily at VM exit
time if the extended feature (AMX) is in use (XINUSE). It also causes
unnecessary overhead because interrupt/preemption may not always
happen.

Given the new unified handling of the XFD state management and
guest awareness in the FPU core, we think it might be better to defer
this to the host (Option 2):

a). Before the host kernel executes XSAVES, it clears XFD by checking if
this is a KVM guest fpu and if guest AMX is in use (XINUSE). KVM can
convey the condition by using fpstate->is_guest and fpstate->in_use,
for example. We need to add more meaning (and code changes) to
those fields.
b). Same for XRSTORS.

One of potential drawbacks of the Option 2 might be additional
checks in the host, although we can minimize the impact by having
CONFIG_KVM_TBD. We believe that the case
"XFD != 0 and XINUSE != 0" should be very infrequent.

Propagation of reallocation errors

As noted above, a reallocation request can fail, and we need to
propagate the error code to the userspace (e.g. QEMU) so that
it can handle the failure properly. Since we do not want to
terminate the guest after running due to permission errors
("weird failure"), we think we should check the permission at
set_cpuid2 time, return failure if no permission.

Looking forward to your comments.

Thanks,
Jing


2021-11-16 12:18:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

Jing,

On Wed, Nov 10 2021 at 13:01, Jing2 Liu wrote:
> Triggering of a reallocation request and error handling
>
> First, we want to avoid weird guest failures at runtime due to (more likely)
> permission failures of a reallocation request, checking the permissions of the
> vcpu (for the extend features) at kvm_vcpu_ioctl_set_cpuid2() time, when
> QEMU wants to advertise the extended features (e.g. AMX) for the first
> time.

That's the right thing to do. If there is no permission for the guest
granted via the prctl() extension I suggested then exposing AMX should
be rejected.

> We have no idea at vcpu_create() time whether QEMU wants to enable AMX
> or not at that time. If kvm_vcpu_ioctl_set_cpuid2() succeeds, then there is
> no need to further check permission in reallocation path.

That's correct.

> Upon detection (interception) of an attempt by a vcpu to write to XCR0 (XSETBV)
> and XFD (WRMSR), we check if the write is valid, and we start passthrough of
> the XFD MSRs if the dynamic feature[i] meets the condition
> XCR0[i]=1 && XFD[i]=0. And we make a reallocation request to the FPU core.
>
> We simplify the KVM implementation by assuming that the reallocation
> request was successful when the vcpu comes back to KVM. For such VM exit
> handling that requires a buffer-reallocation request, we don't resume the
> guest immediately. Instead, we go back to the userspace, to rely on the
> userspace VMM (e.g. QEMU) for handling error cases. The actual reallocation
> happens when control is transferred from KVM to the kernel (FPU core). If
> no error, QEMU will come back to KVM by repeating vcpu_ioctl_run().
>
> Potential failures there are due to lack of memory. But this would not be
> interesting cases; the host should have more resource problems at that
> time if that is the case.

Indeed.

> One of potential drawbacks of the Option 2 might be additional
> checks in the host, although we can minimize the impact by having
> CONFIG_KVM_TBD. We believe that the case
> "XFD != 0 and XINUSE != 0" should be very infrequent.

I really don't like the idea of having an extra check in switch_to().

Can we start simple and do something like the uncompiled below and see
how much overhead it creates?

Thanks,

tglx
---
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 0f8b90ab18c9..6175a78e0be8 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -122,4 +122,12 @@ static __always_inline __pure bool fpu_state_size_dynamic(void)
}
#endif

+void fpu_update_guest_xfd_state(void);
+
+static inline void kvm_update_guest_xfd_state(void)
+{
+ if (fpu_state_size_dynamic())
+ fpu_update_guest_xfd_state();
+}
+
#endif
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 8ea306b1bf8e..161db48c9052 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -199,6 +199,17 @@ void fpu_reset_from_exception_fixup(void)
}

#if IS_ENABLED(CONFIG_KVM)
+void fpu_update_guest_xfd_state(void)
+{
+ u64 xfd;
+
+ /* FIXME: Add debug */
+ rdmsrl(MSR_IA32_XFD, xfd);
+ current->thread.fpu.fpstate->xfd = xfd;
+ __this_cpu_write(xfd_state, xfd);
+}
+EXPORT_SYMBOL_GPL(fpu_update_guest_xfd_state);
+
static void __fpstate_reset(struct fpstate *fpstate);

bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2686f2edb47c..9425fdbb4806 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9576,6 +9576,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
vcpu->arch.last_vmentry_cpu = vcpu->cpu;
vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());

+ kvm_update_guest_xfd_state();
+
vcpu->mode = OUTSIDE_GUEST_MODE;
smp_wmb();




2021-11-16 16:06:24

by Sean Christopherson

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

On Tue, Nov 16, 2021, Thomas Gleixner wrote:
> > One of potential drawbacks of the Option 2 might be additional
> > checks in the host, although we can minimize the impact by having
> > CONFIG_KVM_TBD. We believe that the case
> > "XFD != 0 and XINUSE != 0" should be very infrequent.
>
> I really don't like the idea of having an extra check in switch_to().
>
> Can we start simple and do something like the uncompiled below and see
> how much overhead it creates?
>
> Thanks,
>
> tglx
> ---

...

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2686f2edb47c..9425fdbb4806 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9576,6 +9576,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> vcpu->arch.last_vmentry_cpu = vcpu->cpu;
> vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>
> + kvm_update_guest_xfd_state();

Is there a reason the XFD switch can't key off TIF_NEED_FPU_LOAD a la the other
FPU stuff? I.e. piggyback this snippet in vcpu_enter_guest():

if (test_thread_flag(TIF_NEED_FPU_LOAD))
switch_fpu_return();
> +
> vcpu->mode = OUTSIDE_GUEST_MODE;
> smp_wmb();
>
>
>

2021-11-16 18:55:54

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

Sean,

On Tue, Nov 16 2021 at 16:05, Sean Christopherson wrote:
> On Tue, Nov 16, 2021, Thomas Gleixner wrote:
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 2686f2edb47c..9425fdbb4806 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -9576,6 +9576,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>> vcpu->arch.last_vmentry_cpu = vcpu->cpu;
>> vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>>
>> + kvm_update_guest_xfd_state();
>
> Is there a reason the XFD switch can't key off TIF_NEED_FPU_LOAD a la the other
> FPU stuff? I.e. piggyback this snippet in vcpu_enter_guest():

TIF_NEED_FPU_LOAD is not set here.

> if (test_thread_flag(TIF_NEED_FPU_LOAD))
> switch_fpu_return();

Assume guest has control of XFD and XFD writes are not trapped. That
means on vmexit the XFD state of the guest is unknown.

vcpu_run()
kvm_load_guest_fpu()
wrmsrl(XFD, guest_fpstate->xfd);
XRSTORS

do {

local_irq_disable();

// Covers the case of softirq usage and preemption
if (test_thread_flag(TIF_NEED_FPU_LOAD))
switch_fpu_return()
wrmsrl(XFD, guest_fpstate->xfd);

do {
vmenter(); // Guest modifies XFD
} while (reenter);

local_irq_enable(); <- Problem starts here

preempt_enable(); <- Becomes wider here

} while (!breakout);

kvm_put_guest_fpu(); // Switch back to user FPU state

So we have the following cases:

guest_fpstate.xfd XFD at vmexit

0 0 // consistent state
1 0 // inconsistent state
0 1 // inconsistent state
1 1 // consistent state

Now assume that after reenabling interrupts a interrupt/softirq happens
which uses FPU. It will save the correct state because XFD is still
guest state, but the subsequent restore will operate on the stale
guest_fpstate.xfd value.

Same problem vs schedule after reenabling preemption or if not preempted
in kvm_put_guest_fpu()

Now you could argue that the interrupt/softirq XSAVES should also read
the XFD MSR and save it in guest_fpstate.xfd. Same in schedule()
and kvm_put_guest_fpu(), i.e:

XSAVES
if (fpstate->is_guest) {
rdmsrl(XFD, xfd);
fpstate->xfd = xfd;
__this_cpu_write(..., xfd);
}

We can do that, but I'm unhappy about this conditional in schedule(). So
I was asking for doing a simple KVM only solution first:

vcpu_run()
kvm_load_guest_fpu()
wrmsrl(XFD, guest_fpstate->xfd);
XRSTORS

do {

local_irq_disable();

if (test_thread_flag(TIF_NEED_FPU_LOAD))
switch_fpu_return()
wrmsrl(XFD, guest_fpstate->xfd);

do {
vmenter(); // Guest modifies XFD
} while (reenter);

update_xfd_state(); // Restore consistency

local_irq_enable();

and check how bad that is for KVM in terms of overhead on AMX systems.

If it really matters we can look at the conditional in XSAVES path.

Thanks,

tglx

2021-11-16 19:20:33

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

Jing,

On Wed, Nov 10 2021 at 13:01, Liu, Jing2 wrote:

more thoughts.

> Once we start passthrough the XFD MSR, we need to save/restore
> them at VM exit/entry time. If we immediately resume the guest
> without enabling interrupts/preemptions (exit fast-path), we have no
> issues. We don't need to save the MSR.

Correct.

> The question is how the host XFD MSR is restored while control is in
> KVM.
>
> The XSAVE(S) instruction saves the (guest) state component[x] as 0 or
> doesn't save when XFD[x] != 0. Accordingly, XRSTOR(S) cannot restore
> that (guest state). And it is possible that XFD != 0 and the guest is using
> extended feature at VM exit;

You mean on creative guests which just keep AMX state alive and set
XFD[AMX] = 1 to later restore it to XFD[AMX] = 0?

> we can check the XINUSE state-component bitmap by XGETBV(1). By adding
> more meaning to the existing field: fpstate->in_use, it can be useful
> for KVM to set the XINUSE value.

As I pointed out to Sean, the problem is inconsistent state. Checking
XGETBV(1) cannot make that go away. And I have no idea how you want to
abuse fpstate->in_use for anything related to the XINUSE bitmap. It's a
single bit for a particular purpose and has absolutely nothing to do
with XINUSE. Trying to overload that is just wrong.

If XFD is not trapped then you have exactly three options:

1) Make it an autosave MSR and grab the guest XFD value from that
memory to update fpstate and the shadow memory before reenabling
interrupts

2) Do the MSR read how I suggested before reenabling interrupts

3) Conditionally post XSAVES when fpstate->is_guest == true

Anything else wont work.

Thanks,

tglx

2021-11-16 19:49:21

by Paolo Bonzini

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

On 11/16/21 19:55, Thomas Gleixner wrote:
> We can do that, but I'm unhappy about this conditional in schedule(). So
> I was asking for doing a simple KVM only solution first:
>
> vcpu_run()
> kvm_load_guest_fpu()
> wrmsrl(XFD, guest_fpstate->xfd);
> XRSTORS
>
> do {
>
> local_irq_disable();
>
> if (test_thread_flag(TIF_NEED_FPU_LOAD))
> switch_fpu_return()
> wrmsrl(XFD, guest_fpstate->xfd);
>
> do {
> vmenter(); // Guest modifies XFD
> } while (reenter);
>
> update_xfd_state(); // Restore consistency
>
> local_irq_enable();
>
> and check how bad that is for KVM in terms of overhead on AMX systems.

I agree, this is how we handle SPEC_CTRL for example and it can be
extended to XFD. We should first do that, then switch to the MSR lists.
Hacking into schedule() should really be the last resort.

> local_irq_enable(); <- Problem starts here
>
> preempt_enable(); <- Becomes wider here

It doesn't become that much wider because there's always preempt
notifiers. So if it's okay to save XFD in the XSAVES wrapper and in
kvm_arch_vcpu_put(), that might be already remove the need to do it
schedule().

Paolo


2021-11-16 20:12:58

by Sean Christopherson

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

On Tue, Nov 16, 2021, Thomas Gleixner wrote:
> Now you could argue that the interrupt/softirq XSAVES should also read
> the XFD MSR and save it in guest_fpstate.xfd. Same in schedule()
> and kvm_put_guest_fpu(), i.e:
>
> XSAVES
> if (fpstate->is_guest) {
> rdmsrl(XFD, xfd);
> fpstate->xfd = xfd;
> __this_cpu_write(..., xfd);
> }
>
> We can do that, but I'm unhappy about this conditional in schedule(). So
> I was asking for doing a simple KVM only solution first:

Ah, the schedule() conditional is the part I was missing. Thanks!

2021-11-16 20:14:42

by Sean Christopherson

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

On Tue, Nov 16, 2021, Paolo Bonzini wrote:
> On 11/16/21 19:55, Thomas Gleixner wrote:
> > We can do that, but I'm unhappy about this conditional in schedule(). So
> > I was asking for doing a simple KVM only solution first:
> >
> > vcpu_run()
> > kvm_load_guest_fpu()
> > wrmsrl(XFD, guest_fpstate->xfd);
> > XRSTORS
> > do {
> >
> > local_irq_disable();
> >
> > if (test_thread_flag(TIF_NEED_FPU_LOAD))
> > switch_fpu_return()
> > wrmsrl(XFD, guest_fpstate->xfd);
> >
> > do {
> > vmenter(); // Guest modifies XFD
> > } while (reenter);
> >
> > update_xfd_state(); // Restore consistency
> >
> > local_irq_enable();
> >
> > and check how bad that is for KVM in terms of overhead on AMX systems.
>
> I agree, this is how we handle SPEC_CTRL for example and it can be extended
> to XFD. We should first do that, then switch to the MSR lists. Hacking
> into schedule() should really be the last resort.

Agreed as well.

> > local_irq_enable(); <- Problem starts here
> >
> > preempt_enable(); <- Becomes wider here
>
> It doesn't become that much wider because there's always preempt notifiers.
> So if it's okay to save XFD in the XSAVES wrapper and in
> kvm_arch_vcpu_put(), that might be already remove the need to do it
> schedule().

Assuming AMX can be accessed from (soft) IRQ context, hooking the preempt notifiers
isn't sufficient. That's also why KVM waits until IRQs are disabled before
handling TIF_NEED_FPU_LOAD.

2021-11-16 20:36:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

Paolo,

On Tue, Nov 16 2021 at 20:49, Paolo Bonzini wrote:
> On 11/16/21 19:55, Thomas Gleixner wrote:
>> We can do that, but I'm unhappy about this conditional in schedule(). So
>> I was asking for doing a simple KVM only solution first:
>>
>> vcpu_run()
>> kvm_load_guest_fpu()
>> wrmsrl(XFD, guest_fpstate->xfd);
>> XRSTORS
>>
>> do {
>>
>> local_irq_disable();
>>
>> if (test_thread_flag(TIF_NEED_FPU_LOAD))
>> switch_fpu_return()
>> wrmsrl(XFD, guest_fpstate->xfd);
>>
>> do {
>> vmenter(); // Guest modifies XFD
>> } while (reenter);
>>
>> update_xfd_state(); // Restore consistency
>>
>> local_irq_enable();
>>
>> and check how bad that is for KVM in terms of overhead on AMX systems.
>
> I agree, this is how we handle SPEC_CTRL for example and it can be
> extended to XFD.

SPEC_CTRL is different because it's done right after each VMEXIT.

XFD can be done lazy when breaking out of the exit fastpath loop before
enabling interrupts.

> We should first do that, then switch to the MSR lists.
> Hacking into schedule() should really be the last resort.
>
>> local_irq_enable(); <- Problem starts here
>>
>> preempt_enable(); <- Becomes wider here
>
> It doesn't become that much wider because there's always preempt
> notifiers. So if it's okay to save XFD in the XSAVES wrapper and in
> kvm_arch_vcpu_put(), that might be already remove the need to do it
> schedule().

Did not think about preemption notifiers. Probably because I hate
notifiers with a passion since I had to deal with the CPU hotplug
notifier trainwreck.

But yes that would work. So the places to do that would be:

1) kvm_sched_out() -> kvm_arch_vcpu_put()
2) kernel_fpu_begin_mask()
3) kvm_put_guest_fpu()

But I really would start with the trivial version I suggested because
that's already in the slow path and not at every VMEXIT.

I'd be really surprised if that RDMSR is truly noticeable within all the
other crud this path is doing.

Thanks,

tglx

2021-11-16 22:11:54

by Nakajima, Jun

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

>
> On Nov 16, 2021, at 12:36 PM, Thomas Gleixner <[email protected]> wrote:
>
> Paolo,
>
> On Tue, Nov 16 2021 at 20:49, Paolo Bonzini wrote:
>> On 11/16/21 19:55, Thomas Gleixner wrote:
>>> We can do that, but I'm unhappy about this conditional in schedule(). So
>>> I was asking for doing a simple KVM only solution first:
>>>
>>> vcpu_run()
>>> kvm_load_guest_fpu()
>>> wrmsrl(XFD, guest_fpstate->xfd);
>>> XRSTORS
>>>
>>> do {
>>>
>>> local_irq_disable();
>>>
>>> if (test_thread_flag(TIF_NEED_FPU_LOAD))
>>> switch_fpu_return()
>>> wrmsrl(XFD, guest_fpstate->xfd);
>>>
>>> do {
>>> vmenter(); // Guest modifies XFD
>>> } while (reenter);
>>>
>>> update_xfd_state(); // Restore consistency
>>>
>>> local_irq_enable();
>>>
>>> and check how bad that is for KVM in terms of overhead on AMX systems.
>>
>> I agree, this is how we handle SPEC_CTRL for example and it can be
>> extended to XFD.
>
> SPEC_CTRL is different because it's done right after each VMEXIT.
>
> XFD can be done lazy when breaking out of the exit fastpath loop before
> enabling interrupts.

I agree. The XFD features are for user-space.


>
>> We should first do that, then switch to the MSR lists.
>> Hacking into schedule() should really be the last resort.
>>
>>> local_irq_enable(); <- Problem starts here
>>>
>>> preempt_enable(); <- Becomes wider here
>>
>> It doesn't become that much wider because there's always preempt
>> notifiers. So if it's okay to save XFD in the XSAVES wrapper and in
>> kvm_arch_vcpu_put(), that might be already remove the need to do it
>> schedule().
>
> Did not think about preemption notifiers. Probably because I hate
> notifiers with a passion since I had to deal with the CPU hotplug
> notifier trainwreck.
>
> But yes that would work. So the places to do that would be:
>
> 1) kvm_sched_out() -> kvm_arch_vcpu_put()
> 2) kernel_fpu_begin_mask()
> 3) kvm_put_guest_fpu()
>
> But I really would start with the trivial version I suggested because
> that's already in the slow path and not at every VMEXIT.
>
> I'd be really surprised if that RDMSR is truly noticeable within all the
> other crud this path is doing.
>

I also agree here, and we’ll measure the effect to double-check.

We don’t want to complicate or optimize the system for very rare cases.
I like your "trivial version" because all the things KVM needs to do is just restore the consistent state.


---
Jun


2021-11-16 23:35:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

On Tue, Nov 16 2021 at 20:12, Sean Christopherson wrote:
> On Tue, Nov 16, 2021, Thomas Gleixner wrote:
>> Now you could argue that the interrupt/softirq XSAVES should also read
>> the XFD MSR and save it in guest_fpstate.xfd. Same in schedule()
>> and kvm_put_guest_fpu(), i.e:
>>
>> XSAVES
>> if (fpstate->is_guest) {
>> rdmsrl(XFD, xfd);
>> fpstate->xfd = xfd;
>> __this_cpu_write(..., xfd);
>> }
>>
>> We can do that, but I'm unhappy about this conditional in schedule(). So
>> I was asking for doing a simple KVM only solution first:
>
> Ah, the schedule() conditional is the part I was missing. Thanks!

As I was missing the preempt notifier... which makes it a different
story, but I still think that the simple version is good enough at least
for a start.

Thanks,

tglx

2021-11-17 04:52:41

by Nakajima, Jun

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

>
> On Nov 16, 2021, at 11:20 AM, Thomas Gleixner <[email protected]> wrote:
>
> Jing,
>
> On Wed, Nov 10 2021 at 13:01, Liu, Jing2 wrote:
>
> more thoughts.
>
>> Once we start passthrough the XFD MSR, we need to save/restore
>> them at VM exit/entry time. If we immediately resume the guest
>> without enabling interrupts/preemptions (exit fast-path), we have no
>> issues. We don't need to save the MSR.
>
> Correct.
>
>> The question is how the host XFD MSR is restored while control is in
>> KVM.
>>
>> The XSAVE(S) instruction saves the (guest) state component[x] as 0 or
>> doesn't save when XFD[x] != 0. Accordingly, XRSTOR(S) cannot restore
>> that (guest state). And it is possible that XFD != 0 and the guest is using
>> extended feature at VM exit;
>
> You mean on creative guests which just keep AMX state alive and set
> XFD[AMX] = 1 to later restore it to XFD[AMX] = 0?


Typically a (usual) guest saves the AMX state for the previous process and sets XFD[AMX] = 1 for the next at context switch time, and a VM exit can happen anytime, e.g. right after XFD[AMX] = 1.
But this case is okay because the state is already saved by the guest.

If a (creative) guest wants to set XFD[AMX] = 1 for fun while keeping AMX state alive without saving the AXM state, it may lose the state after VM exit/entry. I think the right thing to do is to avoid such programming in the first place. Let me find out if we can add such notes in the programming references.


---
Jun



2021-11-17 07:31:32

by Paolo Bonzini

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

On 11/17/21 05:52, Nakajima, Jun wrote:
> If a (creative) guest wants to set XFD[AMX] = 1 for fun while keeping
> AMX state alive without saving the AXM state, it may lose the state
> after VM exit/entry.

I think this should not happen, unless you also document that other
random events (hypothetically, it could be some other core using AMX?)
can cause the loss of XTILEDATA if XFD[AMX]=1. Virtualization should
not be special, I'd prefer that the guest has the same behavior as bare
metal in this respect.

Paolo


2021-11-17 07:39:39

by Paolo Bonzini

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

On 11/16/21 21:36, Thomas Gleixner wrote:
> local_irq_enable(); <- Problem starts here
>
> preempt_enable(); <- Becomes wider here
>
>> It doesn't become that much wider because there's always preempt
>> notifiers. So if it's okay to save XFD in the XSAVES wrapper and in
>> kvm_arch_vcpu_put(), that might be already remove the need to do it
>> schedule().
>
> Did not think about preemption notifiers. Probably because I hate
> notifiers with a passion since I had to deal with the CPU hotplug
> notifier trainwreck.
>
> But yes that would work. So the places to do that would be:
>
> 1) kvm_sched_out() -> kvm_arch_vcpu_put() > 2) kernel_fpu_begin_mask()

... which calls save_fpregs_to_fpstate

> 3) kvm_put_guest_fpu()

... which calls save_fpregs_to_fpstate via fpu_swap_kvm_fpstate

So perhaps it could be done in save_fpregs_to_fpstate (for the sched out
path, it would be called by switch_fpu_prepare()). But for now I would
also start with the trivial version.

> I'd be really surprised if that RDMSR is truly noticeable within all the
> other crud this path is doing.

I agree.

Paolo


2021-11-17 10:15:39

by Tian, Kevin

[permalink] [raw]
Subject: RE: Thoughts of AMX KVM support based on latest kernel

> From: Paolo Bonzini <[email protected]>
> Sent: Wednesday, November 17, 2021 3:31 PM
>
> On 11/17/21 05:52, Nakajima, Jun wrote:
> > If a (creative) guest wants to set XFD[AMX] = 1 for fun while keeping
> > AMX state alive without saving the AXM state, it may lose the state
> > after VM exit/entry.
>
> I think this should not happen, unless you also document that other
> random events (hypothetically, it could be some other core using AMX?)
> can cause the loss of XTILEDATA if XFD[AMX]=1. Virtualization should
> not be special, I'd prefer that the guest has the same behavior as bare
> metal in this respect.
>

The state may be lost with the simple version suggested by Thomas,
because with XFD[AMX]=1 the AMX state won't be stored to
guest_fpstate when the vcpu thread is preempted and then get lost
when restored. To emulate the hardware behavior this will need
additional trick to force XFD[AMX]=0 if XGETBV(1) is true (ignoring
guest XFD) in fpu_update_guest_xfd_state(void). It also implies
that the actual guest XFD needs to be saved in a place before forcing
XFD[AMX] to 0 and recovered (after interrupt is disabled) before
entering the guest.

We are not sure whether such trick is worthwhile, since a sane
guest shouldn't set XFD[AMX]=1 before storing the AMX state. This
is why we want to seek SDM change to mark out that the software
should not assume XTILEDATA is still valid when XFD[AMX]=1.

Then KVM can make a simple assumption that once XFD[AMX]=1
it implies the guest doesn't care about the AMX state. Then the
simple version from Thomas can work perfectly.

Thanks
Kevin

2021-11-17 12:24:24

by Thomas Gleixner

[permalink] [raw]
Subject: RE: Thoughts of AMX KVM support based on latest kernel

Tian,

On Wed, Nov 17 2021 at 10:15, Tian, Kevin wrote:
> We are not sure whether such trick is worthwhile, since a sane
> guest shouldn't set XFD[AMX]=1 before storing the AMX state. This
> is why we want to seek SDM change to mark out that the software
> should not assume XTILEDATA is still valid when XFD[AMX]=1.

Yes, please. Anything else is just causing too much hackery for no
value.

Thanks,

tglx

2021-11-17 12:53:58

by Paolo Bonzini

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

On 11/17/21 11:15, Tian, Kevin wrote:
> We are not sure whether such trick is worthwhile, since a sane
> guest shouldn't set XFD[AMX]=1 before storing the AMX state. This
> is why we want to seek SDM change to mark out that the software
> should not assume XTILEDATA is still valid when XFD[AMX]=1.

Okay, I just don't want it to be called out as virtualization specific.

It doesn't have to happen in current processors, but it should be
architecturally valid behavior to clear the processor's state as soon as
a bit in XFD is set to 1.

Paolo


2021-11-18 23:17:48

by Nakajima, Jun

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

On Nov 17, 2021, at 4:53 AM, Paolo Bonzini <[email protected]> wrote:
>
> On 11/17/21 11:15, Tian, Kevin wrote:
>> We are not sure whether such trick is worthwhile, since a sane
>> guest shouldn't set XFD[AMX]=1 before storing the AMX state. This
>> is why we want to seek SDM change to mark out that the software
>> should not assume XTILEDATA is still valid when XFD[AMX]=1.
>
> Okay, I just don't want it to be called out as virtualization specific.
>
> It doesn't have to happen in current processors, but it should be architecturally valid behavior to clear the processor's state as soon as a bit in XFD is set to 1.
>
> Paolo
>

We recommend that "system software initialize AMX state _before_ doing so" (below). Also, I think what the “creative” guest is doing is "lazy restore”, and "This approach will not operate correctly for a variety of reasons."

https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf


3.3 RECOMMENDATIONS FOR SYSTEM SOFTWARE

System software may disable use of Intel AMX by clearing XCR0[18:17], by clearing CR4.OSXSAVE, or by setting
IA32_XFD[18]. It is recommended that system software initialize AMX state (e.g., by executing TILERELEASE)
before doing so. This is because maintaining AMX state in a non-initialized state may have negative power and performance implications.

System software should not use XFD to implement a “lazy restore” approach to management of the XTILEDATA
state component. This approach will not operate correctly for a variety of reasons. One is that the LDTILECFG and TILERELEASE instructions initialize XTILEDATA and do not cause an #NM exception. Another is that an execution of XSAVE by a user thread will save XTILEDATA as initialized instead of the data expected by the user thread.

---
Jun





2021-11-19 10:13:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

Jun,

On Thu, Nov 18 2021 at 23:17, Jun Nakajima wrote:
> On Nov 17, 2021, at 4:53 AM, Paolo Bonzini <[email protected]> wrote:
>>
>> It doesn't have to happen in current processors, but it should be
>> architecturally valid behavior to clear the processor's state as soon
>> as a bit in XFD is set to 1.
>
> 3.3 RECOMMENDATIONS FOR SYSTEM SOFTWARE
>
> System software may disable use of Intel AMX by clearing XCR0[18:17],
> by clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended
> that system software initialize AMX state (e.g., by executing
> TILERELEASE) before doing so. This is because maintaining AMX state in
> a non-initialized state may have negative power and performance
> implications.
>
> System software should not use XFD to implement a “lazy restore”
> approach to management of the XTILEDATA state component. This approach
> will not operate correctly for a variety of reasons. One is that the
> LDTILECFG and TILERELEASE instructions initialize XTILEDATA and do not
> cause an #NM exception. Another is that an execution of XSAVE by a
> user thread will save XTILEDATA as initialized instead of the data
> expected by the user thread.

Can this pretty please be reworded so that it says:

When setting IA32_XFD[18] the AMX register state is not guaranteed to
be preserved. The resulting register state depends on the
implementation.

Also it's a real design disaster that component 17 cannot be fenced off
via XFD. That's really inconsistent and leads exactly to this half
defined state.

Thanks,

tglx

2021-11-19 15:42:04

by Nakajima, Jun

[permalink] [raw]
Subject: Re: Thoughts of AMX KVM support based on latest kernel

Hi Thomas,

> On Nov 19, 2021, at 2:13 AM, Thomas Gleixner <[email protected]> wrote:
>
> Jun,
>
> On Thu, Nov 18 2021 at 23:17, Jun Nakajima wrote:
>> On Nov 17, 2021, at 4:53 AM, Paolo Bonzini <[email protected]> wrote:
>>>
>>> It doesn't have to happen in current processors, but it should be
>>> architecturally valid behavior to clear the processor's state as soon
>>> as a bit in XFD is set to 1.
>>
>> 3.3 RECOMMENDATIONS FOR SYSTEM SOFTWARE
>>
>> System software may disable use of Intel AMX by clearing XCR0[18:17],
>> by clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended
>> that system software initialize AMX state (e.g., by executing
>> TILERELEASE) before doing so. This is because maintaining AMX state in
>> a non-initialized state may have negative power and performance
>> implications.
>>
>> System software should not use XFD to implement a “lazy restore”
>> approach to management of the XTILEDATA state component. This approach
>> will not operate correctly for a variety of reasons. One is that the
>> LDTILECFG and TILERELEASE instructions initialize XTILEDATA and do not
>> cause an #NM exception. Another is that an execution of XSAVE by a
>> user thread will save XTILEDATA as initialized instead of the data
>> expected by the user thread.
>
> Can this pretty please be reworded so that it says:
>
> When setting IA32_XFD[18] the AMX register state is not guaranteed to
> be preserved. The resulting register state depends on the
> implementation.
>
> Also it's a real design disaster that component 17 cannot be fenced off
> via XFD. That's really inconsistent and leads exactly to this half
> defined state.
>

I’ll work with the H/W team on those (i.e. rewording and the component 17 issue).

Thanks,
---
Jun






2021-12-08 00:50:52

by Nakajima, Jun

[permalink] [raw]
Subject: Rewording of Setting IA32_XFD[18] (Re: Thoughts of AMX KVM support based on latest kernel)



> On Nov 19, 2021, at 7:41 AM, Nakajima, Jun <[email protected]> wrote:
>
> Hi Thomas,
>
>> On Nov 19, 2021, at 2:13 AM, Thomas Gleixner <[email protected]> wrote:
>>
>> Jun,
>>
>> On Thu, Nov 18 2021 at 23:17, Jun Nakajima wrote:
>>> On Nov 17, 2021, at 4:53 AM, Paolo Bonzini <[email protected]> wrote:
>>>>
>>>> It doesn't have to happen in current processors, but it should be
>>>> architecturally valid behavior to clear the processor's state as soon
>>>> as a bit in XFD is set to 1.
>>>
>>> 3.3 RECOMMENDATIONS FOR SYSTEM SOFTWARE
>>>
>>> System software may disable use of Intel AMX by clearing XCR0[18:17],
>>> by clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended
>>> that system software initialize AMX state (e.g., by executing
>>> TILERELEASE) before doing so. This is because maintaining AMX state in
>>> a non-initialized state may have negative power and performance
>>> implications.
>>>
>>> System software should not use XFD to implement a “lazy restore”
>>> approach to management of the XTILEDATA state component. This approach
>>> will not operate correctly for a variety of reasons. One is that the
>>> LDTILECFG and TILERELEASE instructions initialize XTILEDATA and do not
>>> cause an #NM exception. Another is that an execution of XSAVE by a
>>> user thread will save XTILEDATA as initialized instead of the data
>>> expected by the user thread.
>>
>> Can this pretty please be reworded so that it says:
>>
>> When setting IA32_XFD[18] the AMX register state is not guaranteed to
>> be preserved. The resulting register state depends on the
>> implementation.
>>
>> Also it's a real design disaster that component 17 cannot be fenced off
>> via XFD. That's really inconsistent and leads exactly to this half
>> defined state.
>>
>
> I’ll work with the H/W team on those (i.e. rewording and the component 17 issue).
>

The following is a possible documentation update that may convey the rewording you requested.
Does this (the last sentence, “In addition, “) work for you?


3.3 RECOMMENDATIONS FOR SYSTEM SOFTWARE

System software may disable use of Intel AMX by clearing XCR0[18:17], by clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. System software should initialize AMX state (e.g., by executing TILERELEASE) when doing so because maintaining AMX state in a non-initialized state may have negative power and performance implications. In addition, software should not rely on the state of the tile data after setting IA32_XFD[18]; software should always reload or reinitialize the tile data after clearing IA32_XFD[18].

Thanks,
---
Jun



2021-12-08 13:57:37

by Paolo Bonzini

[permalink] [raw]
Subject: Re: Rewording of Setting IA32_XFD[18] (Re: Thoughts of AMX KVM support based on latest kernel)

On 12/8/21 01:50, Nakajima, Jun wrote:
> 3.3 RECOMMENDATIONS FOR SYSTEM SOFTWARE
>
> System software may disable use of Intel AMX by clearing XCR0[18:17],
> by clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. System software
> should initialize AMX state (e.g., by executing TILERELEASE) when
> doing so because maintaining AMX state in a non-initialized state may
> have negative power and performance implications. In addition,
> software should not rely on the state of the tile data after setting

I would change this to "must not rely", otherwise looks good. Thanks!

Paolo

> IA32_XFD[18]; software should always reload or reinitialize the tile
> data after clearing IA32_XFD[18].



2021-12-10 21:53:43

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Rewording of Setting IA32_XFD[18] (Re: Thoughts of AMX KVM support based on latest kernel)

Jun,

On Wed, Dec 08 2021 at 00:50, Jun Nakajima wrote:
>> On Nov 19, 2021, at 7:41 AM, Nakajima, Jun <[email protected]> wrote:
>> I’ll work with the H/W team on those (i.e. rewording and the component 17 issue).
>>
>
> The following is a possible documentation update that may convey the
> rewording you requested. Does this (the last sentence, “In addition,
> “) work for you?
>
> 3.3 RECOMMENDATIONS FOR SYSTEM SOFTWARE
>
> System software may disable use of Intel AMX by clearing XCR0[18:17],
> by clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. System software
> should initialize AMX state (e.g., by executing TILERELEASE) when
> doing so because maintaining AMX state in a non-initialized state may
> have negative power and performance implications. In addition,
> software should not rely on the state of the tile data after setting
> IA32_XFD[18]; software should always reload or reinitialize the tile
> data after clearing IA32_XFD[18].

looks good to me.

Thanks,

tglx