LinuxLists.cc - [RFC 00/26] Intel Thread Director Virtualization

2024-02-03 14:57:43

Subject: [RFC 00/26] Intel Thread Director Virtualization

From: Zhao Liu <[email protected]>

Hi list,

This is our RFC to virtualize Intel Thread Director (ITD) feature for
Guest, which is based on Ricardo's patch series about ITD related
support in HFI driver ("[PATCH 0/9] thermal: intel: hfi: Prework for the
virtualization of HFI" [1]).

In short, the purpose of this patch set is to enable the ITD-based
scheduling logic in Guest so that Guest can better schedule Guest tasks
on Intel hybrid platforms.

Currently, ITD is necessary for Windows VMs. Based on ITD virtualization
support, the Windows 11 Guest could have significant performance
improvement (for example, on i9-13900K, up to 14%+ improvement on
3DMARK).

Our ITD virtualization is not bound to VMs' hybrid topology or vCPUs'
CPU affinity. However, in our practice, the ITD scheduling optimization
for win11 VMs works best when combined with hybrid topology and CPU
affinity (this is related to the specific implementation of Win11
scheduling). For more details, please see the Section.1.2 "About hybrid
topology and vCPU pinning".

To enable ITD related scheduling optimization in Win11 VM, some other
thermal related support is also needed (HWP, CPPC), but we could emulate
it with dummy value in the VMM (We'll also be sending out extra patches
in the future for these).

Welcome your feedback!

1. Background and Motivation
============================

1.1. Background
^^^^^^^^^^^^^^^

We have the use case to run games in the client Windows VM as the cloud
gaming solution.

Gaming VMs are performance-sensitive VMs on Client, so that they usually
have two characteristics to ensure interactivity and performance:

i) There will be vCPUs equal to or close to the number of Host pCPUs.

ii) The vCPUs of Gaming VM are often bound to the pCPUs to achieve
exclusive resources and avoid the overhead of migration.

In this case, Host can't provide effective scheduling for Guest, so we
need to deliver more hardware-assisted scheduling capabilities to Guest
to enhance Guest's scheduling.

Windows 11 (and future Windows products) is heavily optimized for the
Intel hybrid platform. To get the best performance, we need to
virtualize hybrid scheduling features (HFI/ITD) for Windows Guest.

1.2. About hybrid topology and vCPU pinning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Our ITD virtualization can support most vCPU topologies (except multiple
packages/dies, see details in 3.5 Restrictions on Guest Topology), and
can also support the case of non-pinning vCPUs (i.e. it can handle vCPU
thread migration).

The following is our performance measuremnt on an i9-13900K machine
(2995Mhz, 24Cores, 32Thread(8+16) RAM: 14GB (16GB Physical)), with
iGPU passthrough, running 3DMARK in Win11 Professional Guest:

compared with smp topo case smp topo smp topo smp topo hybrid topo hybrid topo hybrid topo hybrid topo
+ affinity + ITD + ITD + affinity + ITD + ITD
+ affinity + affinity
Time Spy - Overall 0.179% -0.250% 0.179% -0.107% 0.143% -0.179% -0.107%
Graphics score 0.124% -0.249% 0.124% -0.083% 0.124% -0.166% -0.249%
CPU score 0.916% -0.485% 1.149% -0.076% 0.722% -0.324% 11.915%
Fire Strike Extreme - Overall 0.149% 0.000% 0.224% -1.021% -3.361% -1.319% -3.361%
Graphics score 0.100% 0.050% 0.150% -1.376% -3.427% -1.676% -3.652%
Physics score 5.060% 0.759% 0.518% -2.907% -10.914% -0.897% 14.638%
Combined score 0.120% -0.179% 0.418% 0.060% -2.929% -0.179% -2.809%
Fire Strike - Overall 0.350% -0.085% 0.193% -1.377% -1.365% -1.509% -1.787%
Graphics score 0.256% -0.047% 0.210% -1.527% -1.376% -1.504% -2.320%
Physics score 3.695% -2.180% 0.629% -1.581% -6.846% -1.444% 14.100%
Combined score 0.415% -0.128% 0.128% -0.957% -1.052% -1.594% -0.957%
CPU Profile Max Threads 1.836% 0.298% 1.786% -0.069% 1.545% 0.025% 9.472%
16 Threads 4.290% 0.989% 3.588% 0.595% 1.580% 0.848% 11.295%
8 Threads -22.632% -0.602% -23.167% -0.988% -1.345% -1.340% 8.648%
4 Threads -21.598% 0.449% -21.429% -0.817% 1.951% -0.832% 2.084%
2 Threads -12.912% -0.014% -12.006% -0.481% -0.609% -0.595% 1.161%
1 Threads -3.793% -0.137% -3.793% -0.495% -3.189% -0.495% 1.154%

Based on the above result, we can find exposing only HFI/ITD to win11
VMs without hybrid topology or CPU affinity (case "smp topo + ITD")
won't hurt performance, but would also not get any performance
improvement.

Setting both hybrid topology and CPU affinity for ITD, then win11 VMs
get significate performance improvement (up to 14%+, compared with the
case setting smp topology without CPU affinity).

Not only the numerical results of 3DMARK, but in practice, there is an
significate improvement in the frame rate of the games.

Also, the more powerful the machine, the more significate the
performance gains!

Therefore, the best practice for enabling ITD scheduling optimization
is to set up both CPU affinity and hybrid topology for win11 Guest while
enabling our ITD virtualization.

Our earlier QEMU prototype RFC [2] presented the initial hybrid
topology support for VMs. And currently our another proposal about
"QOM topology" [3] has been raised in the QEMU community, which is the
first step towards the hybrid topology implementation based on QOM
approach.

2. Introduction of HFI and ITD
==============================

Intel provides Hardware Feedback Interface (HFI) feature to allow
hardware to provide guidance to the OS scheduler to perform optimal
workload scheduling through a hardware feedback interface structure in
memory [4]. This HFI structure is called HFI table.

For now, the guidance includes performance and energy efficiency
hints, and it could be update via thermal interrupt as the actual
operating conditions of the processor change during run time.

Intel Thread Director (ITD) feature extends the HFI to provide
performance and energy efficiency data for advanced classes of
instructions.

Since ITD is an extension of HFI, our ITD virtualization also
virtualizes the native HFI feature.

3. Dependencies of ITD
======================

ITD is a thermal FEATURE that requires:
* PTM (Package Thermal Management, alias, PTS)
* HFI (Hardware Feedback Interface)

In order to support the notification mechanism of ITD/HFI dynamic
update, we also need to add thermal interrupt related support,
including the following two features:
* ACPI (Thermal Monitor and Software Controlled Clock Facilities)
* TM (Thermal Monitor, alias, TM1/ACC)

Therefore, we must also consider support for the emulation of all
the above dependencies.

3.1. ACPI emulation
^^^^^^^^^^^^^^^^^^^

For both ACPI, we can support it by emulating the RDMSR/WRMSR of the
associated MSRs and adding the ability to inject thermal interrupts.
But in fact, we don't really inject termal interrupts into Guest for
the termal conditions corresponding to ACPI. Here the termal interrupt
is prepared for the subsequent HFI/ITD.

3.2. TM emulation
^^^^^^^^^^^^^^^^^

TM is a hardware feature and its CPUID bit only indicates the presence
of the automatic thermal monitoring facilities. For TM, there's no
interactive interface between OS and hardware, but its flag is one of
the prerequisites for the OS to enable thermal interrupt.

Thereby, as the support for TM, it is enough for us to expose its CPUID
flag to Guest.

3.3. PTM emulation
^^^^^^^^^^^^^^^^^^

PTM is a package-scope feature that includes package-level MSR and
package-level thermal interrupt. Unfortunately, KVM currently only
supports thread-scope MSR handling, and also doesn't care about the
specific Guest's topology.

But considering that our purpose of supporting PTM in KVM is to further
support ITD, and the current platforms with ITD are all 1 package, so we
emulate the MSRs of the package scope provided by PTM at the VM level.

In this way, the VMM is required to set only one package topology for
the PTM. In order to alleviate this limitation, we only expose the PTM
feature bit to Guest when ITD needs to be supported.

3.4. HFI emulation
^^^^^^^^^^^^^^^^^^

ITD is the extension of HFI, so both HFI and ITD depend on HFI table.
HFI itself is used on the Host for power-related management control, so
we should only expose HFI to Guest when we need to enable ITD.

HFI also relies on PTM interrupt control, so it also has requirements
for package topology, and we also emulate HFI (including ITD) at the VM
level.

In addition, because the HFI driver allocates HFI instances per die,
this also affects HFI (and ITD) and must limit the Guest to only set one
die.

3.5. Restrictions on Guest Topology
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Due to KVM's incomplete support for MSR topology and the requirement for
HFI instance management in the kernel, PTM, HFI, and ITD limit the
topology of the Guest (mainly restricting the topology types created on
the VMM side).

Therefore, we only expose PTM, HFI, and ITD to userspace when we need to
support ITD. At the same time, considering that currently, ITD is only
used on the client platform with 1 package and 1 die, such temporary
restrictions will not have too much impact.

4. Overview of ITD (and HFI) virtualization
===========================================

The main tasks of ITD (including HFI) virtualization are:
* maintain a virtual HFI table for VM.
* inject thermal interrupt when HFI table updates.
* handle related MSRs' emulation and adjust HFI table based on MSR's
control bits.
* expose ITD/HFI configuration info in related CPUID leaves.

The most important of these is the maintenance of the virtual HFI table.
Although the HFI table should also be per package, since ITD/HFI related
MSRs are treated as per VM in KVM, we also treat the virtual HFI table
as per VM.

4.1. HFI table building
^^^^^^^^^^^^^^^^^^^^^^^

HFI table contains a table header and many table entries. Each table
entry is identified by an hfi table index, and each CPU corresponds to
one of the hfi table indexes.

ITD and HFI features both depend on the HFI table, but their HFI table
are a little different. The HFI table provided by the ITD feature has
more classes (in terms of more columns in the table) than the HFI table
of native HFI feature.

The virtual HFI table in KVM is built based on the actual HFI table,
which is maintained by HFI instance in HFI driver. We extract the HFI
data of the pCPUs, which vCPUs are running on, to form a virtual HFI
table.

4.2. HFI table index
^^^^^^^^^^^^^^^^^^^^

There are many entries in the HFI table, and the vCPU will be assigned
an HFI table index to specify the entry it maps. KVM will fill the
pCPU's HFI data (the pCPU that vCPU is running on) into the entry
corresponding to the HFI table index of the vCPU in the vcitual HFI
table.

This index is set by VMM in CPUID.

4.3. HFI table updating
^^^^^^^^^^^^^^^^^^^^^^^

On some platforms, the HFI table will be dynamically updated with
thermal interrupts. In order to update the virtual HFI table in time, we
added the per-VM notifier to the HFI driver to notify KVM to update the
virtual HFI table for the VM, and then inject thermal interrupt into the
VM to notify the Guest.

There is another case that needs to update the virtual HFI table, that
is, when the vCPU is migrated, the pCPU where it is located is changed,
and the corresponding virtual HFI data should also be updated to the new
pCPU's data. In this case, in order to reduce overhead, we can only
update the data of a single vPCU without traversing the entire virtual
HFI table.

5. Patch Summary
================

Patch 01-03: Prepare the bit definition, the hfi helpers and hfi data
structures that KVM needs.
Patch 04-05: Add the sched_out arch hook and reset the classification
history at sched_in()/schedu_out().
Patch 06-10: Add emulations of ACPI, TM and PTM, mainly about CPUID and
related MSRs.
Patch 11-20: Add the emulation support for HFI, including maintaining
the HFI table for VM.
Patch 21-23: Add the emulation support for ITD, including extending HFI
to ITD and passing through the classification MSRs.
Patch 24-25: Add HRESET emulation support, which is also used by IPC
classes feature.
Patch 26: Add the brief doc about the per-VM lock - pkg_therm_lock.

6. References
=============

[1]: [PATCH 0/9] thermal: intel: hfi: Prework for the virtualization of HFI
https://lore.kernel.org/lkml/[email protected]/
[2]: [RFC 00/52] Introduce hybrid CPU topology,
https://lore.kernel.org/qemu-devel/[email protected]/
[3]: [RFC 00/41] qom-topo: Abstract Everything about CPU Topology,
https://lore.kernel.org/qemu-devel/[email protected]/
[4]: SDM, vol. 3B, section 15.6 HARDWARE FEEDBACK INTERFACE AND INTEL
THREAD DIRECTOR

Thanks and Best Regards,
Zhao
---
Zhao Liu (17):
thermal: Add bit definition for x86 thermal related MSRs
KVM: Add kvm_arch_sched_out() hook
KVM: x86: Reset hardware history at vCPU's sched_in/out
KVM: VMX: Add helpers to handle the writes to MSR's R/O and R/WC0 bits
KVM: x86: cpuid: Define CPUID 0x06.eax by kvm_cpu_cap_mask()
KVM: VMX: Introduce HFI description structure
KVM: VMX: Introduce HFI table index for vCPU
KVM: x86: Introduce the HFI dynamic update request and kvm_x86_ops
KVM: VMX: Allow to inject thermal interrupt without HFI update
KVM: VMX: Emulate HFI related bits in package thermal MSRs
KVM: VMX: Emulate the MSRs of HFI feature
KVM: x86: Expose HFI feature bit and HFI info in CPUID
KVM: VMX: Extend HFI table and MSR emulation to support ITD
KVM: VMX: Pass through ITD classification related MSRs to Guest
KVM: x86: Expose ITD feature bit and related info in CPUID
KVM: VMX: Emulate the MSR of HRESET feature
Documentation: KVM: Add description of pkg_therm_lock

Zhuocheng Ding (9):
thermal: intel: hfi: Add helpers to build HFI/ITD structures
thermal: intel: hfi: Add HFI notifier helpers to notify HFI update
KVM: VMX: Emulate ACPI (CPUID.0x01.edx[bit 22]) feature
KVM: x86: Expose TM/ACC (CPUID.0x01.edx[bit 29]) feature bit to VM
KVM: VMX: Emulate PTM/PTS (CPUID.0x06.eax[bit 6]) feature
KVM: VMX: Support virtual HFI table for VM
KVM: VMX: Sync update of Host HFI table to Guest
KVM: VMX: Update HFI table when vCPU migrates
KVM: x86: Expose HRESET feature's CPUID to Guest

Documentation/virt/kvm/locking.rst | 13 +-
arch/arm64/include/asm/kvm_host.h | 1 +
arch/mips/include/asm/kvm_host.h | 1 +
arch/powerpc/include/asm/kvm_host.h | 1 +
arch/riscv/include/asm/kvm_host.h | 1 +
arch/s390/include/asm/kvm_host.h | 1 +
arch/x86/include/asm/hfi.h | 28 ++
arch/x86/include/asm/kvm-x86-ops.h | 3 +-
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/include/asm/msr-index.h | 54 +-
arch/x86/kvm/cpuid.c | 201 +++++++-
arch/x86/kvm/irq.h | 1 +
arch/x86/kvm/lapic.c | 9 +
arch/x86/kvm/svm/svm.c | 8 +
arch/x86/kvm/vmx/vmx.c | 751 +++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.h | 79 ++-
arch/x86/kvm/x86.c | 18 +
drivers/thermal/intel/intel_hfi.c | 212 +++++++-
drivers/thermal/intel/therm_throt.c | 1 -
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 1 +
21 files changed, 1343 insertions(+), 44 deletions(-)

--
2.34.1

2024-02-03 15:06:39

by Zhao Liu

[permalink] [raw]

Subject: [RFC 24/26] KVM: VMX: Emulate the MSR of HRESET feature

From: Zhao Liu <[email protected]>

HRESET is a feature associated with ITD, which provides an HRESET
instruction to reset the ITD related history accumulated on the current
logical processor it is executing on [1]. The HRESET instruction does
not cause the VMExit and is therefore available to the Guest by default
when the HRESET feature bit is set for the Guest.

The HRESET feature also provides a thread scope MSR to control the
enabling of the ITD history reset via the HRESET instruction [2]:
MSR_IA32_HW_HRESET_ENABLE.

This MSR can control the hardware, so we use the emulation way to
support it for Guest, and this makes the Guest's changes to the hardware
under the control of the Host.

Considering that there may be the difference between Guest and Host
about HRESET enabling status, we store the MSR_IA32_HW_HRESET_ENABLE
values of Host and Guest in vcpu_vmx and save/load their respective
configurations when Guest/Host switch.

[1]: SDM, vol. 3B, section 15.6.11 Logical Processor Scope History
[2]: SDM, vol. 2A, chap. CPUID--CPU Identification, CPUID.07H.01H.EAX
[Bit 22], HRESET.

Tested-by: Yanting Jiang <[email protected]>
Co-developed-by: Zhuocheng Ding <[email protected]>
Signed-off-by: Zhuocheng Ding <[email protected]>
Signed-off-by: Zhao Liu <[email protected]>
---
arch/x86/kvm/svm/svm.c | 1 +
arch/x86/kvm/vmx/vmx.c | 54 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.h | 2 ++
arch/x86/kvm/x86.c | 1 +
4 files changed, 58 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 980d93c70eb6..d847dd8eb193 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4295,6 +4295,7 @@ static bool svm_has_emulated_msr(struct kvm *kvm, u32 index)
case MSR_IA32_PACKAGE_THERM_STATUS:
case MSR_IA32_HW_FEEDBACK_CONFIG:
case MSR_IA32_HW_FEEDBACK_PTR:
+ case MSR_IA32_HW_HRESET_ENABLE:
return false;
case MSR_IA32_SMBASE:
if (!IS_ENABLED(CONFIG_KVM_SMM))
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 11d42e0a208b..2d733c959f32 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1314,6 +1314,35 @@ static void itd_guest_exit(struct vcpu_vmx *vmx)
wrmsrl(MSR_IA32_HW_FEEDBACK_THREAD_CONFIG, vcpu_hfi->host_thread_cfg);
}

+static void hreset_guest_enter(struct vcpu_vmx *vmx)
+{
+ struct vcpu_hfi_desc *vcpu_hfi = &vmx->vcpu_hfi_desc;
+
+ if (!kvm_cpu_cap_has(X86_FEATURE_HRESET) ||
+ !guest_cpuid_has(&vmx->vcpu, X86_FEATURE_HRESET))
+ return;
+
+ rdmsrl(MSR_IA32_HW_HRESET_ENABLE, vcpu_hfi->host_hreset_enable);
+ if (unlikely(vcpu_hfi->host_hreset_enable != vcpu_hfi->guest_hreset_enable))
+ wrmsrl(MSR_IA32_HW_HRESET_ENABLE, vcpu_hfi->guest_hreset_enable);
+}
+
+static void hreset_guest_exit(struct vcpu_vmx *vmx)
+{
+ struct vcpu_hfi_desc *vcpu_hfi = &vmx->vcpu_hfi_desc;
+
+ if (!kvm_cpu_cap_has(X86_FEATURE_HRESET) ||
+ !guest_cpuid_has(&vmx->vcpu, X86_FEATURE_HRESET))
+ return;
+
+ /*
+ * MSR_IA32_HW_HRESET_ENABLE is not passed through to Guest, so there
+ * is no need to read the MSR to save the Guest's value.
+ */
+ if (unlikely(vcpu_hfi->host_hreset_enable != vcpu_hfi->guest_hreset_enable))
+ wrmsrl(MSR_IA32_HW_HRESET_ENABLE, vcpu_hfi->host_hreset_enable);
+}
+
void vmx_set_host_fs_gs(struct vmcs_host_state *host, u16 fs_sel, u16 gs_sel,
unsigned long fs_base, unsigned long gs_base)
{
@@ -2462,6 +2491,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
msr_info->data = kvm_vmx->pkg_therm.msr_ia32_hfi_ptr;
break;
+ case MSR_IA32_HW_HRESET_ENABLE:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(&vmx->vcpu, X86_FEATURE_HRESET))
+ return 1;
+ msr_info->data = vmx->vcpu_hfi_desc.guest_hreset_enable;
+ break;
default:
find_uret_msr:
msr = vmx_find_uret_msr(vmx, msr_info->index);
@@ -3091,6 +3126,21 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
ret = vmx_set_hfi_ptr_msr(vcpu, msr_info);
mutex_unlock(&kvm_vmx->pkg_therm.pkg_therm_lock);
break;
+ case MSR_IA32_HW_HRESET_ENABLE: {
+ struct kvm_cpuid_entry2 *entry;
+
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(&vmx->vcpu, X86_FEATURE_HRESET))
+ return 1;
+
+ entry = kvm_find_cpuid_entry_index(&vmx->vcpu, 0x20, 0);
+ /* Reserved bits: generate the exception. */
+ if (!msr_info->host_initiated && data & ~entry->ebx)
+ return 1;
+ /* hreset_guest_enter() will update MSR for Guest. */
+ vmx->vcpu_hfi_desc.guest_hreset_enable = data;
+ break;
+ }
default:
find_uret_msr:
msr = vmx_find_uret_msr(vmx, msr_index);
@@ -5513,6 +5563,8 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx->msr_ia32_therm_status = 0;
vmx->vcpu_hfi_desc.host_thread_cfg = 0;
vmx->vcpu_hfi_desc.guest_thread_cfg = 0;
+ vmx->vcpu_hfi_desc.host_hreset_enable = 0;
+ vmx->vcpu_hfi_desc.guest_hreset_enable = 0;

vmx->hv_deadline_tsc = -1;
kvm_set_cr8(vcpu, 0);
@@ -8006,6 +8058,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)

pt_guest_enter(vmx);
itd_guest_enter(vmx);
+ hreset_guest_enter(vmx);

atomic_switch_perf_msrs(vmx);
if (intel_pmu_lbr_is_enabled(vcpu))
@@ -8044,6 +8097,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
loadsegment(es, __USER_DS);
#endif

+ hreset_guest_exit(vmx);
itd_guest_exit(vmx);
pt_guest_exit(vmx);

diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 3d3238dd8fc3..c5b4684a5b51 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -74,6 +74,8 @@ struct pt_desc {
struct vcpu_hfi_desc {
u64 host_thread_cfg;
u64 guest_thread_cfg;
+ u64 host_hreset_enable;
+ u64 guest_hreset_enable;
};

union vmx_exit_reason {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 27bec359907c..04489efc2fb4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1552,6 +1552,7 @@ static const u32 emulated_msrs_all[] = {
MSR_IA32_PACKAGE_THERM_STATUS,
MSR_IA32_HW_FEEDBACK_CONFIG,
MSR_IA32_HW_FEEDBACK_PTR,
+ MSR_IA32_HW_HRESET_ENABLE,

/*
* KVM always supports the "true" VMX control MSRs, even if the host
--
2.34.1

2024-02-03 15:30:36

by Zhao Liu

[permalink] [raw]

Subject: [RFC 05/26] KVM: x86: Reset hardware history at vCPU's sched_in/out

From: Zhao Liu <[email protected]>

Reset the classification history of the vCPU thread when it's scheduled
in and scheduled out. Hardware will start the classification of the vCPU
thread from scratch.

This helps protect Host/VM history information from leaking Host history
to VMs or leaking VM history to sibling VMs.

Tested-by: Yanting Jiang <[email protected]>
Signed-off-by: Zhao Liu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 --
arch/x86/kvm/x86.c | 8 ++++++++
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2be78549bec8..b5b2d0fde579 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2280,8 +2280,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)

int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);

-static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu) {}
-
#define KVM_CLOCK_VALID_FLAGS \
(KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 363b1c080205..cd9a7251c768 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -79,6 +79,7 @@
#include <asm/div64.h>
#include <asm/irq_remapping.h>
#include <asm/mshyperv.h>
+#include <asm/hreset.h>
#include <asm/hypervisor.h>
#include <asm/tlbflush.h>
#include <asm/intel_pt.h>
@@ -12491,9 +12492,16 @@ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
pmu->need_cleanup = true;
kvm_make_request(KVM_REQ_PMU, vcpu);
}
+
+ reset_hardware_history();
static_call(kvm_x86_sched_in)(vcpu, cpu);
}

+void kvm_arch_sched_out(struct kvm_vcpu *vcpu)
+{
+ reset_hardware_history();
+}
+
void kvm_arch_free_vm(struct kvm *kvm)
{
#if IS_ENABLED(CONFIG_HYPERV)
--
2.34.1

2024-02-03 16:04:30

by Zhao Liu

[permalink] [raw]

Subject: [RFC 20/26] KVM: x86: Expose HFI feature bit and HFI info in CPUID

From: Zhao Liu <[email protected]>

The HFI feature contains the following relevant CPUID fields:

* 0x06.eax[bit 19]: HFI feature bit
* 0x06.ecx[bits 08-15]: Number of HFI/ITD supported classes
* 0x06.edx[bits 00-07]: Bitmap of supported HFI capabilities
* 0x06.edx[bits 08-11]: Enumerates the size of the HFI table in number
of 4 KB pages
* 0x06.edx[bits 16-31]: HFI table index of processor

Guest's HFI feature bit (0x06.eax[bit 19]) is based on Host's HFI
enabling.

For other HFI related CPUID fields, since they affect the memory
allocation and HFI data filling of the virtual HFI table in KVM, check
the hfi related CPUID fields after KVM_SET_CPUID/KVM_SET_CPUID2 to
ensure the valid HFI feature information and the valid memory size.

And about the HFI table index, since the current KVM creates the same
CPUID template for all vCPUs, we refer to the CPU topology handling and
leave the specific filling of the HFI table index to the user, if the
user does not specifically specify the HFI index, all vCPUs will share
the HFI entry with hfi index 0.

The shared HFI index is valid in spec [1], but considering that the data
of the virtual HFI table is all from the pCPU on which the vCPU is
running, the shared hfi index of vCPUs on different pCPUs might cause
frequent HFI updates, and the virtual HFI table cannot accurately reflect
the actual processor situation, which might have a negative impact on
the Guest performance. Therefore, it is better to assign different HFI
table indexes to different vCPUs.

[1]: SDM, vol. 2A, chap. CPUID--CPU Identification, CPUID.06H.EDX[Bits
31-16], about HFI table index sharing, it said, "Note that on some
parts the index may be same for multiple logical processors".

Tested-by: Yanting Jiang <[email protected]>
Co-developed-by: Zhuocheng Ding <[email protected]>
Signed-off-by: Zhuocheng Ding <[email protected]>
Signed-off-by: Zhao Liu <[email protected]>
---
arch/x86/kvm/cpuid.c | 136 ++++++++++++++++++++++++++++++++++++-----
arch/x86/kvm/vmx/vmx.c | 7 +++
2 files changed, 128 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index eaac2c8d98b9..4da8f3319917 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -17,6 +17,7 @@
#include <linux/uaccess.h>
#include <linux/sched/stat.h>

+#include <asm/hfi.h>
#include <asm/processor.h>
#include <asm/user.h>
#include <asm/fpu/xstate.h>
@@ -130,12 +131,77 @@ static inline struct kvm_cpuid_entry2 *cpuid_entry2_find(
return NULL;
}

+static int kvm_check_hfi_cpuid(struct kvm_vcpu *vcpu,
+ struct kvm_cpuid_entry2 *entries,
+ int nent)
+{
+ struct hfi_features hfi_features;
+ struct kvm_cpuid_entry2 *best = NULL;
+ bool has_hfi;
+ int nr_classes, ret;
+ union cpuid6_ecx ecx;
+ union cpuid6_edx edx;
+ unsigned int data_size;
+
+ best = cpuid_entry2_find(entries, nent, 0x6, 0);
+ if (!best)
+ return 0;
+
+ has_hfi = cpuid_entry_has(best, X86_FEATURE_HFI);
+ if (!has_hfi)
+ return 0;
+
+ /*
+ * Only the platform with 1 HFI instance (i.e., client platform)
+ * can enable HFI in Guest. For more information, please refer to
+ * the comment in kvm_set_cpu_caps().
+ */
+ if (intel_hfi_max_instances() != 1)
+ return -EINVAL;
+
+ /*
+ * Currently we haven't supported ITD. HFI is the default feature
+ * with 1 class.
+ */
+ nr_classes = 1;
+ ret = intel_hfi_build_virt_features(&hfi_features,
+ nr_classes,
+ vcpu->kvm->created_vcpus);
+ if (ret)
+ return ret;
+
+ ecx.full = best->ecx;
+ edx.full = best->edx;
+
+ if (ecx.split.nr_classes != hfi_features.nr_classes)
+ return -EINVAL;
+
+ if (hweight8(edx.split.capabilities.bits) != hfi_features.class_stride)
+ return -EINVAL;
+
+ if (edx.split.table_pages + 1 != hfi_features.nr_table_pages)
+ return -EINVAL;
+
+ /*
+ * The total size of the row corresponding to index and all
+ * previous data.
+ */
+ data_size = hfi_features.hdr_size + (edx.split.index + 1) *
+ hfi_features.cpu_stride;
+ /* Invalid index. */
+ if (data_size > hfi_features.nr_table_pages << PAGE_SHIFT)
+ return -EINVAL;
+
+ return 0;
+}
+
static int kvm_check_cpuid(struct kvm_vcpu *vcpu,
struct kvm_cpuid_entry2 *entries,
int nent)
{
struct kvm_cpuid_entry2 *best;
u64 xfeatures;
+ int ret;

/*
* The existing code assumes virtual address is 48-bit or 57-bit in the
@@ -155,15 +221,18 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu,
* enabling in the FPU, e.g. to expand the guest XSAVE state size.
*/
best = cpuid_entry2_find(entries, nent, 0xd, 0);
- if (!best)
- return 0;
-
- xfeatures = best->eax | ((u64)best->edx << 32);
- xfeatures &= XFEATURE_MASK_USER_DYNAMIC;
- if (!xfeatures)
- return 0;
+ if (best) {
+ xfeatures = best->eax | ((u64)best->edx << 32);
+ xfeatures &= XFEATURE_MASK_USER_DYNAMIC;
+ if (xfeatures) {
+ ret = fpu_enable_guest_xfd_features(&vcpu->arch.guest_fpu,
+ xfeatures);
+ if (ret)
+ return ret;
+ }
+ }

- return fpu_enable_guest_xfd_features(&vcpu->arch.guest_fpu, xfeatures);
+ return kvm_check_hfi_cpuid(vcpu, entries, nent);
}

/* Check whether the supplied CPUID data is equal to what is already set for the vCPU. */
@@ -633,14 +702,27 @@ void kvm_set_cpu_caps(void)
);

/*
- * PTS is the dependency of ITD, currently we only use PTS for
- * enabling ITD in KVM. Since KVM does not support msr topology at
- * present, the emulation of PTS has restrictions on the topology of
- * Guest, so we only expose PTS when Host enables ITD.
+ * PTS and HFI are the dependencies of ITD, currently we only use PTS/HFI
+ * for enabling ITD in KVM. Since KVM does not support msr topology at
+ * present, the emulation of PTS/HFI has restrictions on the topology of
+ * Guest, so we only expose PTS/HFI when Host enables ITD.
+ *
+ * We also restrict HFI virtualization support to platforms with only 1 HFI
+ * instance (i.e., this is the client platform, and ITD is currently a
+ * client-specific feature), while server platforms with multiple instances
+ * do not require HFI virtualization. This restriction avoids adding
+ * additional complex logic to handle notification register updates when
+ * vCPUs migrate between different HFI instances.
*/
- if (cpu_feature_enabled(X86_FEATURE_ITD)) {
+ if (cpu_feature_enabled(X86_FEATURE_ITD) && intel_hfi_max_instances() == 1) {
if (boot_cpu_has(X86_FEATURE_PTS))
kvm_cpu_cap_set(X86_FEATURE_PTS);
+ /*
+ * Set HFI based on hardware capability. Only when the Host has
+ * the valid HFI instance, KVM can build the virtual HFI table.
+ */
+ if (intel_hfi_enabled())
+ kvm_cpu_cap_set(X86_FEATURE_HFI);
}

kvm_cpu_cap_mask(CPUID_7_0_EBX,
@@ -986,8 +1068,32 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->eax |= 0x4;

entry->ebx = 0;
- entry->ecx = 0;
- entry->edx = 0;
+
+ if (kvm_cpu_cap_has(X86_FEATURE_HFI)) {
+ union cpuid6_ecx ecx;
+ union cpuid6_edx edx;
+
+ ecx.full = 0;
+ edx.full = 0;
+ /* Number of supported HFI classes */
+ ecx.split.nr_classes = 1;
+ /* HFI supports performance and energy efficiency capabilities. */
+ edx.split.capabilities.split.performance = 1;
+ edx.split.capabilities.split.energy_efficiency = 1;
+ /* As default, keep the same HFI table size as host. */
+ edx.split.table_pages = ((union cpuid6_edx)entry->edx).split.table_pages;
+ /*
+ * Default HFI index = 0. User should be careful that
+ * the index differ for each CPUs.
+ */
+ edx.split.index = 0;
+
+ entry->ecx = ecx.full;
+ entry->edx = edx.full;
+ } else {
+ entry->ecx = 0;
+ entry->edx = 0;
+ }
break;
/* function 7 has additional index. */
case 7:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9c28d4ea0b2d..636f2bd68546 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8434,6 +8434,13 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
vmx->msr_ia32_feature_control_valid_bits &=
~FEAT_CTL_SGX_LC_ENABLED;

+ if (guest_cpuid_has(vcpu, X86_FEATURE_HFI) && intel_hfi_enabled()) {
+ struct kvm_cpuid_entry2 *best = kvm_find_cpuid_entry_index(vcpu, 0x6, 0);
+
+ if (best)
+ vmx->hfi_table_idx = ((union cpuid6_edx)best->edx).split.index;
+ }
+
/* Refresh #PF interception to account for MAXPHYADDR changes. */
vmx_update_exception_bitmap(vcpu);
}
--
2.34.1

2024-02-03 17:09:47

by Zhao Liu

[permalink] [raw]

Subject: [RFC 07/26] KVM: VMX: Emulate ACPI (CPUID.0x01.edx[bit 22]) feature

From: Zhuocheng Ding <[email protected]>

The ACPI (Thermal Monitor and Software Controlled Clock Facilities)
feature is a dependency of thermal interrupt processing so that
it is required for the HFI notification (a thermal interrupt)
handling.

To support VM to handle thermal interrupt, we need to emulate ACPI
feature in KVM:

1. Emulate MSR_IA32_THERM_CONTROL (alias, IA32_CLOCK_MODULATION),
MSR_IA32_THERM_INTERRUPT and MSR_IA32_THERM_STATUS with dummy values.

According to SDM [1], the ACPI feature means:

"The ACPI flag (bit 22) of the CPUID feature flags indicates the
presence of the IA32_THERM_STATUS, IA32_THERM_INTERRUPT,
IA32_CLOCK_MODULATION MSRs, and the xAPIC thermal LVT entry."

It is enough to use dummy values in KVM to emulate the RDMSR/WRMSR on
them.

2. Add the thermal interrupt injection interfaces.

This interface reflects the integrity of the ACPI emulation. Although
thermal interrupts are not actually injected into the Guest now, in the
following HFI/ITD emulations, thermal interrupt will be injected into
Guest once the conditions are met.

3. Additionally, expose the CPUID bit of the ACPI feature to the VM,
which can help enable thermal interrupt handling in the VM.

[1]: SDM, vol. 3B, section 15.8.4.1, Detection of Software Controlled
Clock Modulation Extension.

Tested-by: Yanting Jiang <[email protected]>
Signed-off-by: Zhuocheng Ding <[email protected]>
Co-developed-by: Zhao Liu <[email protected]>
Signed-off-by: Zhao Liu <[email protected]>
---
arch/x86/kvm/cpuid.c | 2 +-
arch/x86/kvm/irq.h | 1 +
arch/x86/kvm/lapic.c | 9 ++++
arch/x86/kvm/svm/svm.c | 3 ++
arch/x86/kvm/vmx/vmx.c | 94 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.h | 3 ++
arch/x86/kvm/x86.c | 3 ++
7 files changed, 114 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index adba49afb5fe..1ad547651022 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -623,7 +623,7 @@ void kvm_set_cpu_caps(void)
F(CX8) | F(APIC) | 0 /* Reserved */ | F(SEP) |
F(MTRR) | F(PGE) | F(MCA) | F(CMOV) |
F(PAT) | F(PSE36) | 0 /* PSN */ | F(CLFLUSH) |
- 0 /* Reserved, DS, ACPI */ | F(MMX) |
+ 0 /* Reserved, DS */ | F(ACPI) | F(MMX) |
F(FXSR) | F(XMM) | F(XMM2) | F(SELFSNOOP) |
0 /* HTT, TM, Reserved, PBE */
);
diff --git a/arch/x86/kvm/irq.h b/arch/x86/kvm/irq.h
index c2d7cfe82d00..e11c1fb6e1e6 100644
--- a/arch/x86/kvm/irq.h
+++ b/arch/x86/kvm/irq.h
@@ -99,6 +99,7 @@ static inline int irqchip_in_kernel(struct kvm *kvm)
void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu);
void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu);
void kvm_apic_nmi_wd_deliver(struct kvm_vcpu *vcpu);
+void kvm_apic_therm_deliver(struct kvm_vcpu *vcpu);
void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu);
void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu);
void __kvm_migrate_timers(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 3242f3da2457..af8572798976 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2783,6 +2783,15 @@ void kvm_apic_nmi_wd_deliver(struct kvm_vcpu *vcpu)
kvm_apic_local_deliver(apic, APIC_LVT0);
}

+void kvm_apic_therm_deliver(struct kvm_vcpu *vcpu)
+{
+ struct kvm_lapic *apic = vcpu->arch.apic;
+
+ if (apic)
+ kvm_apic_local_deliver(apic, APIC_LVTTHMR);
+}
+EXPORT_SYMBOL_GPL(kvm_apic_therm_deliver);
+
static const struct kvm_io_device_ops apic_mmio_ops = {
.read = apic_mmio_read,
.write = apic_mmio_write,
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index e90b429c84f1..2e22d5e86768 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4288,6 +4288,9 @@ static bool svm_has_emulated_msr(struct kvm *kvm, u32 index)
switch (index) {
case MSR_IA32_MCG_EXT_CTL:
case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR:
+ case MSR_IA32_THERM_CONTROL:
+ case MSR_IA32_THERM_INTERRUPT:
+ case MSR_IA32_THERM_STATUS:
return false;
case MSR_IA32_SMBASE:
if (!IS_ENABLED(CONFIG_KVM_SMM))
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8f5981635fe5..aa37b55cf045 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -157,6 +157,32 @@ module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
RTIT_STATUS_ERROR | RTIT_STATUS_STOPPED | \
RTIT_STATUS_BYTECNT))

+/*
+ * TM2 (CPUID.01H:ECX[8]), DTHERM (CPUID.06H:EAX[0]), PLN (CPUID.06H:EAX[4]),
+ * and HWP (CPUID.06H:EAX[7]) are not emulated in kvm.
+ */
+#define MSR_IA32_THERM_STATUS_RO_MASK (THERM_STATUS_PROCHOT | \
+ THERM_STATUS_PROCHOT_FORCEPR_EVENT | THERM_STATUS_CRITICAL_TEMP)
+#define MSR_IA32_THERM_STATUS_RWC0_MASK (THERM_STATUS_PROCHOT_LOG | \
+ THERM_STATUS_PROCHOT_FORCEPR_LOG | THERM_STATUS_CRITICAL_TEMP_LOG)
+/* MSR_IA32_THERM_STATUS unavailable bits mask: unsupported and reserved bits. */
+#define MSR_IA32_THERM_STATUS_UNAVAIL_MASK (~(MSR_IA32_THERM_STATUS_RO_MASK | \
+ MSR_IA32_THERM_STATUS_RWC0_MASK))
+
+/* ECMD (CPUID.06H:EAX[5]) is not emulated in kvm. */
+#define MSR_IA32_THERM_CONTROL_AVAIL_MASK (THERM_ON_DEM_CLO_MOD_ENABLE | \
+ THERM_ON_DEM_CLO_MOD_DUTY_CYC_MASK)
+
+/*
+ * MSR_IA32_THERM_INTERRUPT available bits mask.
+ * PLN (CPUID.06H:EAX[4]) and HFN (CPUID.06H:EAX[24]) are not emulated in kvm.
+ */
+#define MSR_IA32_THERM_INTERRUPT_AVAIL_MASK (THERM_INT_HIGH_ENABLE | \
+ THERM_INT_LOW_ENABLE | THERM_INT_PROCHOT_ENABLE | \
+ THERM_INT_FORCEPR_ENABLE | THERM_INT_CRITICAL_TEM_ENABLE | \
+ THERM_MASK_THRESHOLD0 | THERM_INT_THRESHOLD0_ENABLE | \
+ THERM_MASK_THRESHOLD1 | THERM_INT_THRESHOLD1_ENABLE)
+
/*
* List of MSRs that can be directly passed to the guest.
* In addition to these x2apic and PT MSRs are handled specially.
@@ -1470,6 +1496,19 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
}
}

+static void vmx_inject_therm_interrupt(struct kvm_vcpu *vcpu)
+{
+ /*
+ * From SDM, the ACPI flag also indicates the presence of the
+ * xAPIC thermal LVT entry.
+ */
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_ACPI))
+ return;
+
+ if (irqchip_in_kernel(vcpu->kvm))
+ kvm_apic_therm_deliver(vcpu);
+}
+
/*
* Switches to specified vcpu, until a matching vcpu_put(), but assumes
* vcpu mutex is already taken.
@@ -2109,6 +2148,24 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_DEBUGCTLMSR:
msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
break;
+ case MSR_IA32_THERM_CONTROL:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_ACPI))
+ return 1;
+ msr_info->data = vmx->msr_ia32_therm_control;
+ break;
+ case MSR_IA32_THERM_INTERRUPT:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_ACPI))
+ return 1;
+ msr_info->data = vmx->msr_ia32_therm_interrupt;
+ break;
+ case MSR_IA32_THERM_STATUS:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_ACPI))
+ return 1;
+ msr_info->data = vmx->msr_ia32_therm_status;
+ break;
default:
find_uret_msr:
msr = vmx_find_uret_msr(vmx, msr_info->index);
@@ -2452,6 +2509,40 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
}
ret = kvm_set_msr_common(vcpu, msr_info);
break;
+ case MSR_IA32_THERM_CONTROL:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_ACPI))
+ return 1;
+ if (!msr_info->host_initiated &&
+ data & ~MSR_IA32_THERM_CONTROL_AVAIL_MASK)
+ return 1;
+ vmx->msr_ia32_therm_control = data;
+ break;
+ case MSR_IA32_THERM_INTERRUPT:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_ACPI))
+ return 1;
+ if (!msr_info->host_initiated &&
+ data & ~MSR_IA32_THERM_INTERRUPT_AVAIL_MASK)
+ return 1;
+ vmx->msr_ia32_therm_interrupt = data;
+ break;
+ case MSR_IA32_THERM_STATUS:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_ACPI))
+ return 1;
+ /* Unsupported and reserved bits: generate the exception. */
+ if (!msr_info->host_initiated &&
+ data & MSR_IA32_THERM_STATUS_UNAVAIL_MASK)
+ return 1;
+ if (!msr_info->host_initiated) {
+ data = vmx_set_msr_rwc0_bits(data, vmx->msr_ia32_therm_status,
+ MSR_IA32_THERM_STATUS_RWC0_MASK);
+ data = vmx_set_msr_ro_bits(data, vmx->msr_ia32_therm_status,
+ MSR_IA32_THERM_STATUS_RO_MASK);
+ }
+ vmx->msr_ia32_therm_status = data;
+ break;

default:
find_uret_msr:
@@ -4870,6 +4961,9 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx->spec_ctrl = 0;

vmx->msr_ia32_umwait_control = 0;
+ vmx->msr_ia32_therm_control = 0;
+ vmx->msr_ia32_therm_interrupt = 0;
+ vmx->msr_ia32_therm_status = 0;

vmx->hv_deadline_tsc = -1;
kvm_set_cr8(vcpu, 0);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index e3b0985bb74a..e159dd5b7a66 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -282,6 +282,9 @@ struct vcpu_vmx {

u64 spec_ctrl;
u32 msr_ia32_umwait_control;
+ u64 msr_ia32_therm_control;
+ u64 msr_ia32_therm_interrupt;
+ u64 msr_ia32_therm_status;

/*
* loaded_vmcs points to the VMCS currently used in this vcpu. For a
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cd9a7251c768..50aceb0ce4ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1545,6 +1545,9 @@ static const u32 emulated_msrs_all[] = {
MSR_AMD64_TSC_RATIO,
MSR_IA32_POWER_CTL,
MSR_IA32_UCODE_REV,
+ MSR_IA32_THERM_CONTROL,
+ MSR_IA32_THERM_INTERRUPT,
+ MSR_IA32_THERM_STATUS,

/*
* KVM always supports the "true" VMX control MSRs, even if the host
--
2.34.1

2024-02-22 08:36:04

by Zhao Liu

[permalink] [raw]

Subject: Re: [RFC 00/26] Intel Thread Director Virtualization

Ping Paolo & Sean,

Do you have any comment? Or do you think ITD virtualization is
appropriate to discuss at PUCK?

Thanks,
Zhao

On Sat, Feb 03, 2024 at 05:11:48PM +0800, Zhao Liu wrote:
> Date: Sat, 3 Feb 2024 17:11:48 +0800
> From: Zhao Liu <[email protected]>
> Subject: [RFC 00/26] Intel Thread Director Virtualization
> X-Mailer: git-send-email 2.34.1
>
> From: Zhao Liu <[email protected]>
>
> Hi list,
>
> This is our RFC to virtualize Intel Thread Director (ITD) feature for
> Guest, which is based on Ricardo's patch series about ITD related
> support in HFI driver ("[PATCH 0/9] thermal: intel: hfi: Prework for the
> virtualization of HFI" [1]).
>
> In short, the purpose of this patch set is to enable the ITD-based
> scheduling logic in Guest so that Guest can better schedule Guest tasks
> on Intel hybrid platforms.
>
> Currently, ITD is necessary for Windows VMs. Based on ITD virtualization
> support, the Windows 11 Guest could have significant performance
> improvement (for example, on i9-13900K, up to 14%+ improvement on
> 3DMARK).
>
> Our ITD virtualization is not bound to VMs' hybrid topology or vCPUs'
> CPU affinity. However, in our practice, the ITD scheduling optimization
> for win11 VMs works best when combined with hybrid topology and CPU
> affinity (this is related to the specific implementation of Win11
> scheduling). For more details, please see the Section.1.2 "About hybrid
> topology and vCPU pinning".
>
> To enable ITD related scheduling optimization in Win11 VM, some other
> thermal related support is also needed (HWP, CPPC), but we could emulate
> it with dummy value in the VMM (We'll also be sending out extra patches
> in the future for these).
>
> Welcome your feedback!
>
>
> 1. Background and Motivation
> ============================
>
> 1.1. Background
> ^^^^^^^^^^^^^^^
>
> We have the use case to run games in the client Windows VM as the cloud
> gaming solution.
>
> Gaming VMs are performance-sensitive VMs on Client, so that they usually
> have two characteristics to ensure interactivity and performance:
>
> i) There will be vCPUs equal to or close to the number of Host pCPUs.
>
> ii) The vCPUs of Gaming VM are often bound to the pCPUs to achieve
> exclusive resources and avoid the overhead of migration.
>
> In this case, Host can't provide effective scheduling for Guest, so we
> need to deliver more hardware-assisted scheduling capabilities to Guest
> to enhance Guest's scheduling.
>
> Windows 11 (and future Windows products) is heavily optimized for the
> Intel hybrid platform. To get the best performance, we need to
> virtualize hybrid scheduling features (HFI/ITD) for Windows Guest.
>
>
> 1.2. About hybrid topology and vCPU pinning
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Our ITD virtualization can support most vCPU topologies (except multiple
> packages/dies, see details in 3.5 Restrictions on Guest Topology), and
> can also support the case of non-pinning vCPUs (i.e. it can handle vCPU
> thread migration).
>
> The following is our performance measuremnt on an i9-13900K machine
> (2995Mhz, 24Cores, 32Thread(8+16) RAM: 14GB (16GB Physical)), with
> iGPU passthrough, running 3DMARK in Win11 Professional Guest:
>
>
> compared with smp topo case smp topo smp topo smp topo hybrid topo hybrid topo hybrid topo hybrid topo
> + affinity + ITD + ITD + affinity + ITD + ITD
> + affinity + affinity
> Time Spy - Overall 0.179% -0.250% 0.179% -0.107% 0.143% -0.179% -0.107%
> Graphics score 0.124% -0.249% 0.124% -0.083% 0.124% -0.166% -0.249%
> CPU score 0.916% -0.485% 1.149% -0.076% 0.722% -0.324% 11.915%
> Fire Strike Extreme - Overall 0.149% 0.000% 0.224% -1.021% -3.361% -1.319% -3.361%
> Graphics score 0.100% 0.050% 0.150% -1.376% -3.427% -1.676% -3.652%
> Physics score 5.060% 0.759% 0.518% -2.907% -10.914% -0.897% 14.638%
> Combined score 0.120% -0.179% 0.418% 0.060% -2.929% -0.179% -2.809%
> Fire Strike - Overall 0.350% -0.085% 0.193% -1.377% -1.365% -1.509% -1.787%
> Graphics score 0.256% -0.047% 0.210% -1.527% -1.376% -1.504% -2.320%
> Physics score 3.695% -2.180% 0.629% -1.581% -6.846% -1.444% 14.100%
> Combined score 0.415% -0.128% 0.128% -0.957% -1.052% -1.594% -0.957%
> CPU Profile Max Threads 1.836% 0.298% 1.786% -0.069% 1.545% 0.025% 9.472%
> 16 Threads 4.290% 0.989% 3.588% 0.595% 1.580% 0.848% 11.295%
> 8 Threads -22.632% -0.602% -23.167% -0.988% -1.345% -1.340% 8.648%
> 4 Threads -21.598% 0.449% -21.429% -0.817% 1.951% -0.832% 2.084%
> 2 Threads -12.912% -0.014% -12.006% -0.481% -0.609% -0.595% 1.161%
> 1 Threads -3.793% -0.137% -3.793% -0.495% -3.189% -0.495% 1.154%
>
>
> Based on the above result, we can find exposing only HFI/ITD to win11
> VMs without hybrid topology or CPU affinity (case "smp topo + ITD")
> won't hurt performance, but would also not get any performance
> improvement.
>
> Setting both hybrid topology and CPU affinity for ITD, then win11 VMs
> get significate performance improvement (up to 14%+, compared with the
> case setting smp topology without CPU affinity).
>
> Not only the numerical results of 3DMARK, but in practice, there is an
> significate improvement in the frame rate of the games.
>
> Also, the more powerful the machine, the more significate the
> performance gains!
>
> Therefore, the best practice for enabling ITD scheduling optimization
> is to set up both CPU affinity and hybrid topology for win11 Guest while
> enabling our ITD virtualization.
>
> Our earlier QEMU prototype RFC [2] presented the initial hybrid
> topology support for VMs. And currently our another proposal about
> "QOM topology" [3] has been raised in the QEMU community, which is the
> first step towards the hybrid topology implementation based on QOM
> approach.
>
>
> 2. Introduction of HFI and ITD
> ==============================
>
> Intel provides Hardware Feedback Interface (HFI) feature to allow
> hardware to provide guidance to the OS scheduler to perform optimal
> workload scheduling through a hardware feedback interface structure in
> memory [4]. This HFI structure is called HFI table.
>
> For now, the guidance includes performance and energy efficiency
> hints, and it could be update via thermal interrupt as the actual
> operating conditions of the processor change during run time.
>
> Intel Thread Director (ITD) feature extends the HFI to provide
> performance and energy efficiency data for advanced classes of
> instructions.
>
> Since ITD is an extension of HFI, our ITD virtualization also
> virtualizes the native HFI feature.
>
>
> 3. Dependencies of ITD
> ======================
>
> ITD is a thermal FEATURE that requires:
> * PTM (Package Thermal Management, alias, PTS)
> * HFI (Hardware Feedback Interface)
>
> In order to support the notification mechanism of ITD/HFI dynamic
> update, we also need to add thermal interrupt related support,
> including the following two features:
> * ACPI (Thermal Monitor and Software Controlled Clock Facilities)
> * TM (Thermal Monitor, alias, TM1/ACC)
>
> Therefore, we must also consider support for the emulation of all
> the above dependencies.
>
>
> 3.1. ACPI emulation
> ^^^^^^^^^^^^^^^^^^^
>
> For both ACPI, we can support it by emulating the RDMSR/WRMSR of the
> associated MSRs and adding the ability to inject thermal interrupts.
> But in fact, we don't really inject termal interrupts into Guest for
> the termal conditions corresponding to ACPI. Here the termal interrupt
> is prepared for the subsequent HFI/ITD.
>
>
> 3.2. TM emulation
> ^^^^^^^^^^^^^^^^^
>
> TM is a hardware feature and its CPUID bit only indicates the presence
> of the automatic thermal monitoring facilities. For TM, there's no
> interactive interface between OS and hardware, but its flag is one of
> the prerequisites for the OS to enable thermal interrupt.
>
> Thereby, as the support for TM, it is enough for us to expose its CPUID
> flag to Guest.
>
>
> 3.3. PTM emulation
> ^^^^^^^^^^^^^^^^^^
>
> PTM is a package-scope feature that includes package-level MSR and
> package-level thermal interrupt. Unfortunately, KVM currently only
> supports thread-scope MSR handling, and also doesn't care about the
> specific Guest's topology.
>
> But considering that our purpose of supporting PTM in KVM is to further
> support ITD, and the current platforms with ITD are all 1 package, so we
> emulate the MSRs of the package scope provided by PTM at the VM level.
>
> In this way, the VMM is required to set only one package topology for
> the PTM. In order to alleviate this limitation, we only expose the PTM
> feature bit to Guest when ITD needs to be supported.
>
>
> 3.4. HFI emulation
> ^^^^^^^^^^^^^^^^^^
>
> ITD is the extension of HFI, so both HFI and ITD depend on HFI table.
> HFI itself is used on the Host for power-related management control, so
> we should only expose HFI to Guest when we need to enable ITD.
>
> HFI also relies on PTM interrupt control, so it also has requirements
> for package topology, and we also emulate HFI (including ITD) at the VM
> level.
>
> In addition, because the HFI driver allocates HFI instances per die,
> this also affects HFI (and ITD) and must limit the Guest to only set one
> die.
>
>
> 3.5. Restrictions on Guest Topology
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Due to KVM's incomplete support for MSR topology and the requirement for
> HFI instance management in the kernel, PTM, HFI, and ITD limit the
> topology of the Guest (mainly restricting the topology types created on
> the VMM side).
>
> Therefore, we only expose PTM, HFI, and ITD to userspace when we need to
> support ITD. At the same time, considering that currently, ITD is only
> used on the client platform with 1 package and 1 die, such temporary
> restrictions will not have too much impact.
>
>
> 4. Overview of ITD (and HFI) virtualization
> ===========================================
>
> The main tasks of ITD (including HFI) virtualization are:
> * maintain a virtual HFI table for VM.
> * inject thermal interrupt when HFI table updates.
> * handle related MSRs' emulation and adjust HFI table based on MSR's
> control bits.
> * expose ITD/HFI configuration info in related CPUID leaves.
>
> The most important of these is the maintenance of the virtual HFI table.
> Although the HFI table should also be per package, since ITD/HFI related
> MSRs are treated as per VM in KVM, we also treat the virtual HFI table
> as per VM.
>
>
> 4.1. HFI table building
> ^^^^^^^^^^^^^^^^^^^^^^^
>
> HFI table contains a table header and many table entries. Each table
> entry is identified by an hfi table index, and each CPU corresponds to
> one of the hfi table indexes.
>
> ITD and HFI features both depend on the HFI table, but their HFI table
> are a little different. The HFI table provided by the ITD feature has
> more classes (in terms of more columns in the table) than the HFI table
> of native HFI feature.
>
> The virtual HFI table in KVM is built based on the actual HFI table,
> which is maintained by HFI instance in HFI driver. We extract the HFI
> data of the pCPUs, which vCPUs are running on, to form a virtual HFI
> table.
>
>
> 4.2. HFI table index
> ^^^^^^^^^^^^^^^^^^^^
>
> There are many entries in the HFI table, and the vCPU will be assigned
> an HFI table index to specify the entry it maps. KVM will fill the
> pCPU's HFI data (the pCPU that vCPU is running on) into the entry
> corresponding to the HFI table index of the vCPU in the vcitual HFI
> table.
>
> This index is set by VMM in CPUID.
>
>
> 4.3. HFI table updating
> ^^^^^^^^^^^^^^^^^^^^^^^
>
> On some platforms, the HFI table will be dynamically updated with
> thermal interrupts. In order to update the virtual HFI table in time, we
> added the per-VM notifier to the HFI driver to notify KVM to update the
> virtual HFI table for the VM, and then inject thermal interrupt into the
> VM to notify the Guest.
>
> There is another case that needs to update the virtual HFI table, that
> is, when the vCPU is migrated, the pCPU where it is located is changed,
> and the corresponding virtual HFI data should also be updated to the new
> pCPU's data. In this case, in order to reduce overhead, we can only
> update the data of a single vPCU without traversing the entire virtual
> HFI table.
>
>
> 5. Patch Summary
> ================
>
> Patch 01-03: Prepare the bit definition, the hfi helpers and hfi data
> structures that KVM needs.
> Patch 04-05: Add the sched_out arch hook and reset the classification
> history at sched_in()/schedu_out().
> Patch 06-10: Add emulations of ACPI, TM and PTM, mainly about CPUID and
> related MSRs.
> Patch 11-20: Add the emulation support for HFI, including maintaining
> the HFI table for VM.
> Patch 21-23: Add the emulation support for ITD, including extending HFI
> to ITD and passing through the classification MSRs.
> Patch 24-25: Add HRESET emulation support, which is also used by IPC
> classes feature.
> Patch 26: Add the brief doc about the per-VM lock - pkg_therm_lock.
>
>
> 6. References
> =============
>
> [1]: [PATCH 0/9] thermal: intel: hfi: Prework for the virtualization of HFI
> https://lore.kernel.org/lkml/[email protected]/
> [2]: [RFC 00/52] Introduce hybrid CPU topology,
> https://lore.kernel.org/qemu-devel/[email protected]/
> [3]: [RFC 00/41] qom-topo: Abstract Everything about CPU Topology,
> https://lore.kernel.org/qemu-devel/[email protected]/
> [4]: SDM, vol. 3B, section 15.6 HARDWARE FEEDBACK INTERFACE AND INTEL
> THREAD DIRECTOR
>
>
> Thanks and Best Regards,
> Zhao
> ---
> Zhao Liu (17):
> thermal: Add bit definition for x86 thermal related MSRs
> KVM: Add kvm_arch_sched_out() hook
> KVM: x86: Reset hardware history at vCPU's sched_in/out
> KVM: VMX: Add helpers to handle the writes to MSR's R/O and R/WC0 bits
> KVM: x86: cpuid: Define CPUID 0x06.eax by kvm_cpu_cap_mask()
> KVM: VMX: Introduce HFI description structure
> KVM: VMX: Introduce HFI table index for vCPU
> KVM: x86: Introduce the HFI dynamic update request and kvm_x86_ops
> KVM: VMX: Allow to inject thermal interrupt without HFI update
> KVM: VMX: Emulate HFI related bits in package thermal MSRs
> KVM: VMX: Emulate the MSRs of HFI feature
> KVM: x86: Expose HFI feature bit and HFI info in CPUID
> KVM: VMX: Extend HFI table and MSR emulation to support ITD
> KVM: VMX: Pass through ITD classification related MSRs to Guest
> KVM: x86: Expose ITD feature bit and related info in CPUID
> KVM: VMX: Emulate the MSR of HRESET feature
> Documentation: KVM: Add description of pkg_therm_lock
>
> Zhuocheng Ding (9):
> thermal: intel: hfi: Add helpers to build HFI/ITD structures
> thermal: intel: hfi: Add HFI notifier helpers to notify HFI update
> KVM: VMX: Emulate ACPI (CPUID.0x01.edx[bit 22]) feature
> KVM: x86: Expose TM/ACC (CPUID.0x01.edx[bit 29]) feature bit to VM
> KVM: VMX: Emulate PTM/PTS (CPUID.0x06.eax[bit 6]) feature
> KVM: VMX: Support virtual HFI table for VM
> KVM: VMX: Sync update of Host HFI table to Guest
> KVM: VMX: Update HFI table when vCPU migrates
> KVM: x86: Expose HRESET feature's CPUID to Guest
>
> Documentation/virt/kvm/locking.rst | 13 +-
> arch/arm64/include/asm/kvm_host.h | 1 +
> arch/mips/include/asm/kvm_host.h | 1 +
> arch/powerpc/include/asm/kvm_host.h | 1 +
> arch/riscv/include/asm/kvm_host.h | 1 +
> arch/s390/include/asm/kvm_host.h | 1 +
> arch/x86/include/asm/hfi.h | 28 ++
> arch/x86/include/asm/kvm-x86-ops.h | 3 +-
> arch/x86/include/asm/kvm_host.h | 2 +
> arch/x86/include/asm/msr-index.h | 54 +-
> arch/x86/kvm/cpuid.c | 201 +++++++-
> arch/x86/kvm/irq.h | 1 +
> arch/x86/kvm/lapic.c | 9 +
> arch/x86/kvm/svm/svm.c | 8 +
> arch/x86/kvm/vmx/vmx.c | 751 +++++++++++++++++++++++++++-
> arch/x86/kvm/vmx/vmx.h | 79 ++-
> arch/x86/kvm/x86.c | 18 +
> drivers/thermal/intel/intel_hfi.c | 212 +++++++-
> drivers/thermal/intel/therm_throt.c | 1 -
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 1 +
> 21 files changed, 1343 insertions(+), 44 deletions(-)
>
> --
> 2.34.1
>