From:   Thomas Gleixner <tglx@linutronix.de>
To:     "Tian, Kevin" <kevin.tian@intel.com>,
        "Zhong, Yang" <yang.zhong@intel.com>,
        "x86@kernel.org" <x86@kernel.org>,
        "kvm@vger.kernel.org" <kvm@vger.kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "mingo@redhat.com" <mingo@redhat.com>,
        "bp@alien8.de" <bp@alien8.de>,
        "dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
        "pbonzini@redhat.com" <pbonzini@redhat.com>
Cc:     "seanjc@google.com" <seanjc@google.com>,
        "Nakajima, Jun" <jun.nakajima@intel.com>,
        "jing2.liu@linux.intel.com" <jing2.liu@linux.intel.com>,
        "Liu, Jing2" <jing2.liu@intel.com>,
        "Zhong, Yang" <yang.zhong@intel.com>
Subject: RE: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly
In-Reply-To: <BN9PR11MB5276DF25E38EE7C4F4D29F288C729@BN9PR11MB5276.namprd11.prod.outlook.com>
References: <20211208000359.2853257-1-yang.zhong@intel.com>
 <20211208000359.2853257-16-yang.zhong@intel.com> <87pmq4vw54.ffs@tglx>
 <BN9PR11MB5276DF25E38EE7C4F4D29F288C729@BN9PR11MB5276.namprd11.prod.outlook.com>
Date:   Sat, 11 Dec 2021 14:29:11 +0100
Message-ID: <87zgp7uv6g.ffs@tglx>
MIME-Version: 1.0
Content-Type: text/plain
Precedence: bulk

Kevin,

On Sat, Dec 11 2021 at 03:07, Kevin Tian wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>> #NM in the guest is slow path, right? So why are you trying to optimize
>> for it?
>
> This is really good information. The current logic is obviously
> based on the assumption that #NM is frequently triggered.

More context.

When an application want's to use AMX, it invokes the prctl() which
grants permission. If permission is granted then still the kernel FPU
state buffers are default size and XFD is armed.

When a thread of that process issues the first AMX (tile) instruction,
then #NM is raised.

The #NM handler does:

    1) Read MSR_XFD_ERR. If 0, goto regular #NM

    2) Write MSR_XFD_ERR to 0

    3) Check whether the process has permission granted. If not,
       raise SIGILL and return.

    4) Allocate and install a larger FPU state buffer for the task.
       If allocation fails, raise SIGSEGV and return.

    5) Disarm XFD for that task

That means one thread takes at max. one AMX/XFD related #NM during its
lifetime, which means two VMEXITs.

If there are other XFD controlled facilities in the future, then it will
be NR_USED_XFD_CONTROLLED_FACILITIES * 2 VMEXITs per thread which uses
them. Not the end of the world either.

Looking at the targeted application space it's pretty unlikely that
tasks which utilize AMX are going to be so short lived that the overhead
of these VMEXITs really matters.

This of course can be revisited when there is a sane use case, but
optimizing for it prematurely does not buy us anything else than
pointless complexity.

>> The straight forward solution to this is:
>> 
>>     1) Trap #NM and MSR_XFD_ERR write
>
> and #NM vmexit handler should be called in kvm_x86_handle_exit_irqoff()
> before preemption is enabled, otherwise there is still a small window
> where MSR_XFD_ERR might be clobbered after preemption enable and
> before #NM handler is actually called.

Yes.

Thanks,

        tglx