2023-02-13 16:31:51

by Mathias Krause

[permalink] [raw]
Subject: [PATCH 0/5] KVM: Put struct kvm_vcpu on a diet

Relayout members of struct kvm_vcpu and embedded structs to reduce its
memory footprint. Not that it makes sense from a memory usage point of
view (given how few of such objects get allocated), but this series
achieves to make it consume two cachelines less, which should provide a
micro-architectural net win. However, I wasn't able to see a noticeable
difference running benchmarks within a guest VM -- the VMEXIT costs are
likely still high enough to mask any gains. Pointers to other benchmark
tools / tests are highly appreciated!

I made sure that the mutex and spinlocks still stick to their own
disjoint cachelines, so this series shouldn't cause any regression in
that regard.

Below is the high level pahole(1) diff. Most significant is the overall
size change from 6688 to 6560 bytes, i.e. -128 bytes.

--- kvm_vcpu.before 2023-02-13 14:13:49.919952154 +0100
+++ kvm_vcpu.after 2023-02-13 14:13:53.559952140 +0100
@@ -6,78 +6,60 @@
int vcpu_idx; /* 40 4 */
int ____srcu_idx; /* 44 4 */
int mode; /* 48 4 */
-
- /* XXX 4 bytes hole, try to pack */
-
+ unsigned int guest_debug; /* 52 4 */
u64 requests; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
- long unsigned int guest_debug; /* 64 8 */
- struct mutex mutex; /* 72 32 */
- struct kvm_run * run; /* 104 8 */
- struct rcuwait wait; /* 112 8 */
- struct pid * pid; /* 120 8 */
+ struct mutex mutex; /* 64 32 */
+ struct kvm_run * run; /* 96 8 */
+ struct rcuwait wait; /* 104 8 */
+ struct pid * pid; /* 112 8 */
+ sigset_t sigset; /* 120 8 */
/* --- cacheline 2 boundary (128 bytes) --- */
int sigset_active; /* 128 4 */
-
- /* XXX 4 bytes hole, try to pack */
-
- sigset_t sigset; /* 136 8 */
- unsigned int halt_poll_ns; /* 144 4 */
- bool valid_wakeup; /* 148 1 */
+ unsigned int halt_poll_ns; /* 132 4 */
+ bool valid_wakeup; /* 136 1 */

/* XXX 3 bytes hole, try to pack */

- int mmio_needed; /* 152 4 */
- int mmio_read_completed; /* 156 4 */
- int mmio_is_write; /* 160 4 */
- int mmio_cur_fragment; /* 164 4 */
- int mmio_nr_fragments; /* 168 4 */
-
- /* XXX 4 bytes hole, try to pack */
-
- struct kvm_mmio_fragment mmio_fragments[2]; /* 176 48 */
- /* --- cacheline 3 boundary (192 bytes) was 32 bytes ago --- */
+ int mmio_needed; /* 140 4 */
+ int mmio_read_completed; /* 144 4 */
+ int mmio_is_write; /* 148 4 */
+ int mmio_cur_fragment; /* 152 4 */
+ int mmio_nr_fragments; /* 156 4 */
+ struct kvm_mmio_fragment mmio_fragments[2]; /* 160 48 */
+ /* --- cacheline 3 boundary (192 bytes) was 16 bytes ago --- */
struct {
- u32 queued; /* 224 4 */
-
- /* XXX 4 bytes hole, try to pack */
-
- struct list_head queue; /* 232 16 */
- struct list_head done; /* 248 16 */
- /* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
- spinlock_t lock; /* 264 4 */
- } async_pf; /* 224 48 */
-
- /* XXX last struct has 4 bytes of padding */
-
+ struct list_head queue; /* 208 16 */
+ struct list_head done; /* 224 16 */
+ spinlock_t lock; /* 240 4 */
+ u32 queued; /* 244 4 */
+ } async_pf; /* 208 40 */
struct {
- bool in_spin_loop; /* 272 1 */
- bool dy_eligible; /* 273 1 */
- } spin_loop; /* 272 2 */
- bool preempted; /* 274 1 */
- bool ready; /* 275 1 */
+ bool in_spin_loop; /* 248 1 */
+ bool dy_eligible; /* 249 1 */
+ } spin_loop; /* 248 2 */
+ bool preempted; /* 250 1 */
+ bool ready; /* 251 1 */

/* XXX 4 bytes hole, try to pack */

- struct kvm_vcpu_arch arch __attribute__((__aligned__(8))); /* 280 5208 */
-
- /* XXX last struct has 6 bytes of padding */
-
- /* --- cacheline 85 boundary (5440 bytes) was 48 bytes ago --- */
- struct kvm_vcpu_stat stat; /* 5488 1104 */
- /* --- cacheline 103 boundary (6592 bytes) --- */
- char stats_id[48]; /* 6592 48 */
- struct kvm_dirty_ring dirty_ring; /* 6640 32 */
+ /* --- cacheline 4 boundary (256 bytes) --- */
+ struct kvm_vcpu_arch arch __attribute__((__aligned__(8))); /* 256 5104 */
+ /* --- cacheline 83 boundary (5312 bytes) was 48 bytes ago --- */
+ struct kvm_vcpu_stat stat; /* 5360 1104 */
+ /* --- cacheline 101 boundary (6464 bytes) --- */
+ char stats_id[48]; /* 6464 48 */
+ struct kvm_dirty_ring dirty_ring; /* 6512 32 */

/* XXX last struct has 4 bytes of padding */

- /* --- cacheline 104 boundary (6656 bytes) was 16 bytes ago --- */
- struct kvm_memory_slot * last_used_slot; /* 6672 8 */
- u64 last_used_slot_gen; /* 6680 8 */
-
- /* size: 6688, cachelines: 105, members: 33 */
- /* sum members: 6669, holes: 5, sum holes: 19 */
- /* paddings: 3, sum paddings: 14 */
+ /* --- cacheline 102 boundary (6528 bytes) was 16 bytes ago --- */
+ struct kvm_memory_slot * last_used_slot; /* 6544 8 */
+ u64 last_used_slot_gen; /* 6552 8 */
+
+ /* size: 6560, cachelines: 103, members: 33 */
+ /* sum members: 6553, holes: 2, sum holes: 7 */
+ /* paddings: 1, sum paddings: 4 */
/* forced alignments: 1, forced holes: 1, sum forced holes: 4 */
/* last cacheline: 32 bytes */
} __attribute__((__aligned__(8)));


There are still holes left, i.e. still room for squeezing out a few more
bytes. But I didn't want to destroy the grouping of members nor sprinkle
random structs with __packed just for that. That said, we might still
want to do it to save yet another cacheline (only 32 more bytes to go).

Code wise it seems to save a few instructions as well, likely because of
patch 2, simplifying some of the struct kvm_queued_exception users:

$ size kvm-6.2-rc5{,+patch}/arch/x86/kvm/kvm.ko
text data bss dec hex filename
806277 129095 1168 936540 e4a5c kvm-6.2-rc5/arch/x86/kvm/kvm.ko
806026 129063 1168 936257 e4941 kvm-6.2-rc5+patch/arch/x86/kvm/kvm.ko

$ bloat-o-meter kvm-6.2-rc5{,+patch}/arch/x86/kvm/kvm.ko
add/remove: 0/2 grow/shrink: 7/11 up/down: 102/-340 (-238)
Function old new delta
kvm_vcpu_ioctl_x86_set_vcpu_events 664 707 +43
kvm_multiple_exception 581 613 +32
kvm_inject_exception 153 161 +8
kvm_vcpu_ioctl_x86_get_vcpu_events 510 517 +7
__UNIQUE_ID_vermagic166 67 73 +6
kvm_inject_page_fault 108 113 +5
kvm_mmu_x86_module_init 174 175 +1
kvm_vcpu_ready_for_interrupt_injection 134 131 -3
kvm_vcpu_ioctl 1717 1714 -3
kvm_vcpu_compat_ioctl 274 271 -3
kvm_sigset_activate 53 50 -3
kvm_can_do_async_pf 150 147 -3
kvm_mtrr_get_msr 290 286 -4
kvm_deliver_exception_payload 139 135 -4
kvm_arch_vcpu_ioctl_set_guest_debug 662 657 -5
x86_emulate_instruction 1952 1945 -7
kvm_vcpu_reset 1382 1375 -7
__pfx_kvm_deliver_exception_payload.part 16 - -16
kvm_deliver_exception_payload.part 122 - -122
kvm_arch_vcpu_ioctl_run 7782 7622 -160
Total: Before=635427, After=635189, chg -0.04%


Please apply!

Thanks,

Mathias Krause (5):
KVM: x86: Shrink struct kvm_pmu
KVM: x86: Shrink struct kvm_queued_exception
KVM: Shrink struct kvm_mmu_memory_cache
KVM: x86: Shrink struct kvm_vcpu_arch
KVM: Shrink struct kvm_vcpu

arch/x86/include/asm/kvm_host.h | 30 +++++++++++++++---------------
include/linux/kvm_host.h | 6 +++---
include/linux/kvm_types.h | 2 +-
3 files changed, 19 insertions(+), 19 deletions(-)

--
2.39.1



2023-02-13 16:31:53

by Mathias Krause

[permalink] [raw]
Subject: [PATCH 1/5] KVM: x86: Shrink struct kvm_pmu

Move the 'version' member to the beginning of the structure to reuse an
existing hole instead of introducing another one.

This allows us to save 8 bytes for 64 bit builds.

Signed-off-by: Mathias Krause <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6aaae18f1854..43329c60a6b5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -516,6 +516,7 @@ struct kvm_pmc {
#define KVM_PMC_MAX_FIXED 3
#define KVM_AMD_PMC_MAX_GENERIC 6
struct kvm_pmu {
+ u8 version;
unsigned nr_arch_gp_counters;
unsigned nr_arch_fixed_counters;
unsigned available_event_types;
@@ -528,7 +529,6 @@ struct kvm_pmu {
u64 global_ovf_ctrl_mask;
u64 reserved_bits;
u64 raw_event_mask;
- u8 version;
struct kvm_pmc gp_counters[KVM_INTEL_PMC_MAX_GENERIC];
struct kvm_pmc fixed_counters[KVM_PMC_MAX_FIXED];
struct irq_work irq_work;
--
2.39.1


2023-02-13 16:31:57

by Mathias Krause

[permalink] [raw]
Subject: [PATCH 2/5] KVM: x86: Shrink struct kvm_queued_exception

Reshuffle the boolean members of struct kvm_queued_exception and make
them individual bits, allowing denser packing.

This allows us to shrink the object size from 24 to 16 bytes for 64 bit
builds.

Signed-off-by: Mathias Krause <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 43329c60a6b5..040eee3e9583 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -701,13 +701,13 @@ struct kvm_vcpu_xen {
};

struct kvm_queued_exception {
- bool pending;
- bool injected;
- bool has_error_code;
+ u8 pending : 1;
+ u8 injected : 1;
+ u8 has_error_code : 1;
+ u8 has_payload : 1;
u8 vector;
u32 error_code;
unsigned long payload;
- bool has_payload;
};

struct kvm_vcpu_arch {
--
2.39.1


2023-02-13 16:32:00

by Mathias Krause

[permalink] [raw]
Subject: [PATCH 3/5] KVM: Shrink struct kvm_mmu_memory_cache

Move the 'capacity' member around to make use of the padding hole on 64
bit systems instead of introducing yet another one.

This allows us to save 8 bytes per instance for 64 bit builds of which,
e.g., x86's struct kvm_vcpu_arch has a few.

Signed-off-by: Mathias Krause <[email protected]>
---
include/linux/kvm_types.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 76de36e56cdf..8e4f8fa31457 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -92,10 +92,10 @@ struct gfn_to_pfn_cache {
*/
struct kvm_mmu_memory_cache {
int nobjs;
+ int capacity;
gfp_t gfp_zero;
gfp_t gfp_custom;
struct kmem_cache *kmem_cache;
- int capacity;
void **objects;
};
#endif
--
2.39.1


2023-02-13 16:32:02

by Mathias Krause

[permalink] [raw]
Subject: [PATCH 4/5] KVM: x86: Shrink struct kvm_vcpu_arch

Reshuffle the members of struct kvm_vcpu_arch to make use of otherwise
unused padding holes, allowing denser packing without disrupting their
grouping.

This allows us to shrink the object size by 48 bytes for 64 bit builds.

Signed-off-by: Mathias Krause <[email protected]>
---
Instead of attempting to create an optimal shuffle by sorting members by
their alignment constraints, I intended to keep the members grouped by
their meaning to keep the maintainability of the code.

arch/x86/include/asm/kvm_host.h | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 040eee3e9583..5036456b05b0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -824,18 +824,18 @@ struct kvm_vcpu_arch {

int halt_request; /* real mode on Intel only */

+ u32 kvm_cpuid_base;
int cpuid_nent;
struct kvm_cpuid_entry2 *cpuid_entries;
- u32 kvm_cpuid_base;

u64 reserved_gpa_bits;
int maxphyaddr;

/* emulate context */

- struct x86_emulate_ctxt *emulate_ctxt;
bool emulate_regs_need_sync_to_vcpu;
bool emulate_regs_need_sync_from_vcpu;
+ struct x86_emulate_ctxt *emulate_ctxt;
int (*complete_userspace_io)(struct kvm_vcpu *vcpu);

gpa_t time;
@@ -916,17 +916,17 @@ struct kvm_vcpu_arch {
unsigned long last_retry_addr;

struct {
- bool halted;
gfn_t gfns[ASYNC_PF_PER_VCPU];
struct gfn_to_hva_cache data;
u64 msr_en_val; /* MSR_KVM_ASYNC_PF_EN */
u64 msr_int_val; /* MSR_KVM_ASYNC_PF_INT */
- u16 vec;
u32 id;
+ u16 vec;
bool send_user_only;
u32 host_apf_flags;
bool delivery_as_pf_vmexit;
bool pageready_pending;
+ bool halted;
} apf;

/* OSVW MSRs (AMD only) */
@@ -942,6 +942,9 @@ struct kvm_vcpu_arch {

u64 msr_kvm_poll_control;

+ /* set at EPT violation at this point */
+ unsigned long exit_qualification;
+
/*
* Indicates the guest is trying to write a gfn that contains one or
* more of the PTEs used to translate the write itself, i.e. the access
@@ -959,9 +962,6 @@ struct kvm_vcpu_arch {
*/
bool write_fault_to_shadow_pgtable;

- /* set at EPT violation at this point */
- unsigned long exit_qualification;
-
/* pv related host specific info */
struct {
bool pv_unhalted;
@@ -979,9 +979,6 @@ struct kvm_vcpu_arch {
/* Host CPU on which VM-entry was most recently attempted */
int last_vmentry_cpu;

- /* AMD MSRC001_0015 Hardware Configuration */
- u64 msr_hwcr;
-
/* pv related cpuid info */
struct {
/*
@@ -1006,6 +1003,9 @@ struct kvm_vcpu_arch {
*/
bool pdptrs_from_userspace;

+ /* AMD MSRC001_0015 Hardware Configuration */
+ u64 msr_hwcr;
+
#if IS_ENABLED(CONFIG_HYPERV)
hpa_t hv_root_tdp;
#endif
--
2.39.1


2023-02-13 16:32:05

by Mathias Krause

[permalink] [raw]
Subject: [PATCH 5/5] KVM: Shrink struct kvm_vcpu

Reshuffle the members of struct kvm_vcpu to make use of otherwise unused
padding holes, allowing denser packing without disrupting the grouping
nor introducing wrong cacheline sharing.

The embedded mutex and spinlocks continue to not share cachelines, so no
regressions because of lock contention leading to cacheline trashing is
expected.

This allows us to save 40 bytes for 64 bit builds.

Signed-off-by: Mathias Krause <[email protected]>
---
include/linux/kvm_host.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4f26b244f6d0..6e3e5a540037 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -330,8 +330,8 @@ struct kvm_vcpu {
int srcu_depth;
#endif
int mode;
+ unsigned int guest_debug;
u64 requests;
- unsigned long guest_debug;

struct mutex mutex;
struct kvm_run *run;
@@ -340,8 +340,8 @@ struct kvm_vcpu {
struct rcuwait wait;
#endif
struct pid __rcu *pid;
- int sigset_active;
sigset_t sigset;
+ int sigset_active;
unsigned int halt_poll_ns;
bool valid_wakeup;

@@ -356,10 +356,10 @@ struct kvm_vcpu {

#ifdef CONFIG_KVM_ASYNC_PF
struct {
- u32 queued;
struct list_head queue;
struct list_head done;
spinlock_t lock;
+ u32 queued;
} async_pf;
#endif

--
2.39.1


2023-02-13 17:05:50

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 0/5] KVM: Put struct kvm_vcpu on a diet

On Mon, Feb 13, 2023, Mathias Krause wrote:
> Relayout members of struct kvm_vcpu and embedded structs to reduce its
> memory footprint. Not that it makes sense from a memory usage point of
> view (given how few of such objects get allocated), but this series
> achieves to make it consume two cachelines less, which should provide a
> micro-architectural net win. However, I wasn't able to see a noticeable
> difference running benchmarks within a guest VM -- the VMEXIT costs are
> likely still high enough to mask any gains.

...

> Below is the high level pahole(1) diff. Most significant is the overall
> size change from 6688 to 6560 bytes, i.e. -128 bytes.

While part of me wishes KVM were more careful about struct layouts, IMO fiddling
with per vCPU or per VM structures isn't worth the ongoing maintenance cost.

Unless the size of the vCPU allocation (vcpu_vmx or vcpu_svm in x86 land) crosses
a meaningful boundary, e.g. drops the size from an order-3 to order-2 allocation,
the memory savings are negligible in the grand scheme. Assuming the kernel is
even capable of perfectly packing vCPU allocations, saving even a few hundred bytes
per vCPU is uninteresting unless the vCPU count gets reaaally high, and at that
point the host likely has hundreds of GiB of memory, i.e. saving a few KiB is again
uninteresting.

And as you observed, imperfect struct layouts are highly unlikely to have a
measurable impact on performance. The types of operations that are involved in
a world switch are just too costly for the layout to matter much. I do like to
shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably
more performant, doesn't require ongoing mainteance, and/or also improves the code
quality.

I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry
a sub-optimal layouy and the change is arguably warranted even without the change
in size. Ditto for kvm_pmu, logically I think it makes sense to have the version
at the very top.

But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling
fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly
egregious field(s) just isn't worth the cost in the long term.

2023-02-13 17:08:20

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 3/5] KVM: Shrink struct kvm_mmu_memory_cache

On Mon, Feb 13, 2023, Mathias Krause wrote:
> Move the 'capacity' member around to make use of the padding hole on 64
> bit systems instead of introducing yet another one.
>
> This allows us to save 8 bytes per instance for 64 bit builds of which,
> e.g., x86's struct kvm_vcpu_arch has a few.
>
> Signed-off-by: Mathias Krause <[email protected]>
> ---
> include/linux/kvm_types.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 76de36e56cdf..8e4f8fa31457 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -92,10 +92,10 @@ struct gfn_to_pfn_cache {
> */
> struct kvm_mmu_memory_cache {
> int nobjs;
> + int capacity;
> gfp_t gfp_zero;
> gfp_t gfp_custom;
> struct kmem_cache *kmem_cache;
> - int capacity;
> void **objects;

If we touch this, I vote to do a more aggressive cleanup and place nobjs next
to objects, e.g.

struct kvm_mmu_memory_cache {
gfp_t gfp_zero;
gfp_t gfp_custom;
struct kmem_cache *kmem_cache;
int capacity;
int nobjs;
void **objects;
};

2023-02-14 12:19:59

by Mathias Krause

[permalink] [raw]
Subject: Re: [PATCH 0/5] KVM: Put struct kvm_vcpu on a diet

On 13.02.23 18:05, Sean Christopherson wrote:
> On Mon, Feb 13, 2023, Mathias Krause wrote:
>> Relayout members of struct kvm_vcpu and embedded structs to reduce its
>> memory footprint. Not that it makes sense from a memory usage point of
>> view (given how few of such objects get allocated), but this series
>> achieves to make it consume two cachelines less, which should provide a
>> micro-architectural net win. However, I wasn't able to see a noticeable
>> difference running benchmarks within a guest VM -- the VMEXIT costs are
>> likely still high enough to mask any gains.
>
> ...
>
>> Below is the high level pahole(1) diff. Most significant is the overall
>> size change from 6688 to 6560 bytes, i.e. -128 bytes.
>
> While part of me wishes KVM were more careful about struct layouts, IMO fiddling
> with per vCPU or per VM structures isn't worth the ongoing maintenance cost.
>
> Unless the size of the vCPU allocation (vcpu_vmx or vcpu_svm in x86 land) crosses
> a meaningful boundary, e.g. drops the size from an order-3 to order-2 allocation,
> the memory savings are negligible in the grand scheme. Assuming the kernel is
> even capable of perfectly packing vCPU allocations, saving even a few hundred bytes
> per vCPU is uninteresting unless the vCPU count gets reaaally high, and at that
> point the host likely has hundreds of GiB of memory, i.e. saving a few KiB is again
> uninteresting.

Fully agree! That's why I said, this change makes no sense from a memory
usage point of view. The overall memory savings are not visible at all,
recognizing that the slab allocator isn't able to put more vCPU objects
in a given slab page. However, I still remain confident that this makes
sense from a uarch point of view. Touching less cache lines should be a
win -- even if I'm unable to measure it. By preserving more cachelines
during a VMEXIT, guests should be able to resume their work faster
(assuming they still need these cachelines).

> And as you observed, imperfect struct layouts are highly unlikely to have a
> measurable impact on performance. The types of operations that are involved in
> a world switch are just too costly for the layout to matter much. I do like to
> shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably
> more performant, doesn't require ongoing mainteance, and/or also improves the code
> quality.

Any pointers to measure the "more performant" aspect? I tried to make
use of the vmx_vmcs_shadow_test in kvm-unit-tests, as it's already
counting cycles, but the numbers are too unstable, even if I pin the
test to a given CPU, disable turbo mode, SMT, use the performance cpu
governor, etc.

> I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry
> a sub-optimal layouy and the change is arguably warranted even without the change
> in size. Ditto for kvm_pmu, logically I think it makes sense to have the version
> at the very top.

Yeah, was exactly thinking the same when modifying kvm_pmu.

> But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling
> fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly
> egregious field(s) just isn't worth the cost in the long term.

Heh, just found this gem in vcpu_vmx:

struct vcpu_vmx {
[...]
union vmx_exit_reason exit_reason;

/* XXX 44 bytes hole, try to pack */

/* --- cacheline 123 boundary (7872 bytes) --- */
struct pi_desc pi_desc __attribute__((__aligned__(64)));
[...]

So there are, in fact, some bigger holes left.

Would be nice if pahole had a --density flag that would output some
ASCII art, visualizing which bytes of a struct are allocated by real
members and which ones are pure padding.

2023-02-14 12:20:44

by Mathias Krause

[permalink] [raw]
Subject: Re: [PATCH 3/5] KVM: Shrink struct kvm_mmu_memory_cache



On 13.02.23 18:08, Sean Christopherson wrote:
> On Mon, Feb 13, 2023, Mathias Krause wrote:
>> Move the 'capacity' member around to make use of the padding hole on 64
>> bit systems instead of introducing yet another one.
>>
>> This allows us to save 8 bytes per instance for 64 bit builds of which,
>> e.g., x86's struct kvm_vcpu_arch has a few.
>>
>> Signed-off-by: Mathias Krause <[email protected]>
>> ---
>> include/linux/kvm_types.h | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
>> index 76de36e56cdf..8e4f8fa31457 100644
>> --- a/include/linux/kvm_types.h
>> +++ b/include/linux/kvm_types.h
>> @@ -92,10 +92,10 @@ struct gfn_to_pfn_cache {
>> */
>> struct kvm_mmu_memory_cache {
>> int nobjs;
>> + int capacity;
>> gfp_t gfp_zero;
>> gfp_t gfp_custom;
>> struct kmem_cache *kmem_cache;
>> - int capacity;
>> void **objects;
>
> If we touch this, I vote to do a more aggressive cleanup and place nobjs next
> to objects, e.g.
>
> struct kvm_mmu_memory_cache {
> gfp_t gfp_zero;
> gfp_t gfp_custom;
> struct kmem_cache *kmem_cache;
> int capacity;
> int nobjs;
> void **objects;
> };

Ok

2023-02-16 17:32:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 0/5] KVM: Put struct kvm_vcpu on a diet

On Tue, Feb 14, 2023, Mathias Krause wrote:
> On 13.02.23 18:05, Sean Christopherson wrote:
> However, I still remain confident that this makes sense from a uarch point of
> view. Touching less cache lines should be a win -- even if I'm unable to
> measure it. By preserving more cachelines during a VMEXIT, guests should be
> able to resume their work faster (assuming they still need these cachelines).

Yes, but I'm skeptical that compacting struct kvm_vcpu will actually result in
fewer cache lines being touched, at least not in a persistent, maintainable way.
While every VM-Exit accesses a lot of state, it's most definitely still a small
percentage of kvm_vcpu. And for the VM-Exits that really benefit from optimized
handling, a.k.a. the fastpath exits, that percentage is even lower.

On x86, kvm_vcpu is actually comprised of three "major" structs: kvm_vcpu,
kvm_vcpu_arch, and vcpu_{vmx,svm}. The majority of fields accessed on every VM-Exit
are in the vendor part, vcpu_{vmx,svm}, but there are still high-touch fields in
the common structures, e.g. registers in kvm_vcpu_arch and things like requests
in kvm_vcpu.

Naively compating any of the three (well, four) structures might result in fewer
cache lines being touched, but mostly by coincidence. E.g. because compacting
part of kvm_vcpu_arch happened to better align vcpu_vmx, not because of the
compaction itself.

If we really wanted to optimize cache line usage, our best bet would be to identify
the fields that are accessed in the fastpath run loop and then pack those fields
into as few cache lines as possible. And to do that in a maintainable way, we'd
probably need something like ASI[*] to allow us to detect changes that result in
the fastpath handling accessing fields that aren't in the cache-optimized region.

I'm not necessarily opposed to such aggressive optimization, but the ROI is likely
very, very low. For optimized workloads, there simply aren't very many VM-Exits,
e.g. the majority of exits on a modern CPU are due to timer ticks. And even those
will hopefully be eliminiated in the not-too-distant future, e.g. by having hardware
virtualize the TSC deadline timer, and by moving to a vCPU scheduling scheme that
allows for a tickless host.

https://lore.kernel.org/all/[email protected]

> > And as you observed, imperfect struct layouts are highly unlikely to have a
> > measurable impact on performance. The types of operations that are involved in
> > a world switch are just too costly for the layout to matter much. I do like to
> > shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably
> > more performant, doesn't require ongoing mainteance, and/or also improves the code
> > quality.
>
> Any pointers to measure the "more performant" aspect?

TL;DR: not really.

What I've done in the past is run a very tight loop in the guest, and then measure
latency from the host by hacking KVM. Measuring from the guest works, e.g. we have
a variety of selftests that do exactly that, but when trying to shave cycles in
the VM-Exit path, it doesn't take many host IRQs arriving at the "wrong" time to
skew the measurement. My quick-and-dirty solution has always been to just hack
KVM to measure latency with IRQs disabled, but a more maintainable approach would
be to add smarts somewhere to sanitize the results, e.g. to throw away outliers
where the guest likely got interrupted.

I believe we've talked about adding a selftest to measure fastpath latency, e.g.
by writing MSR_IA32_TSC_DEADLINE in a tight loop.

However, that's not going to be useful in this case since you are interested in
measuring the impact of reducing the host's L1D footprint. If the guest isn't
cache-constrainted, reducing the host's cache footprint isn't going to impact
performance since there's no contention.

Running a micro benchmark in the guest that aims to measure cache performance might
work, but presumably those are all userspace tests, i.e. you'd end up measuring
the impact of the guest kernel too. And they wouldn't consistently trigger VM-Exits,
so it would be difficult to prove the validity of the results.

I suppose you could write a dedicated selftest or port a micro benchmark to run
as a selftest (or KVM-unit-test)?

> I tried to make use of the vmx_vmcs_shadow_test in kvm-unit-tests, as it's
> already counting cycles, but the numbers are too unstable, even if I pin the
> test to a given CPU, disable turbo mode, SMT, use the performance cpu
> governor, etc.

Heh, you might have picked quite possibly the worst way to measure VM-Exit
performance :-)

The guest code in that test that's measuring latency runs at L2. For VMREADs
and VMWRITEs that are "passed-through" all the way to L2, no VM-Exit will occur
(the access will be handled purely in ucode). And for accesses that do cause a
VM-Exit, I'm pretty sure they all result in a nested VM-Exit, which is a _very_
heavy path (~10k cycles). Even if the exit is handled by KVM (in L0), it's still
a relatively slow, heavy path.

> > I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry
> > a sub-optimal layouy and the change is arguably warranted even without the change
> > in size. Ditto for kvm_pmu, logically I think it makes sense to have the version
> > at the very top.
>
> Yeah, was exactly thinking the same when modifying kvm_pmu.
>
> > But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling
> > fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly
> > egregious field(s) just isn't worth the cost in the long term.
>
> Heh, just found this gem in vcpu_vmx:
>
> struct vcpu_vmx {
> [...]
> union vmx_exit_reason exit_reason;
>
> /* XXX 44 bytes hole, try to pack */
>
> /* --- cacheline 123 boundary (7872 bytes) --- */
> struct pi_desc pi_desc __attribute__((__aligned__(64)));
> [...]
>
> So there are, in fact, some bigger holes left.

Ya. Again, I'm definitely ok cleaning up the truly heinous warts and/or doing
a targeted, deliberate refactor of structures. What I don't want to do is
shuffle fields around purely to save a few bytes here and there.

2023-02-17 16:07:58

by Mathias Krause

[permalink] [raw]
Subject: Re: [PATCH 0/5] KVM: Put struct kvm_vcpu on a diet

On 16.02.23 18:32, Sean Christopherson wrote:
> On Tue, Feb 14, 2023, Mathias Krause wrote:
>> On 13.02.23 18:05, Sean Christopherson wrote:
>> However, I still remain confident that this makes sense from a uarch point of
>> view. Touching less cache lines should be a win -- even if I'm unable to
>> measure it. By preserving more cachelines during a VMEXIT, guests should be
>> able to resume their work faster (assuming they still need these cachelines).
>
> Yes, but I'm skeptical that compacting struct kvm_vcpu will actually result in
> fewer cache lines being touched, at least not in a persistent, maintainable way.
> While every VM-Exit accesses a lot of state, it's most definitely still a small
> percentage of kvm_vcpu. And for the VM-Exits that really benefit from optimized
> handling, a.k.a. the fastpath exits, that percentage is even lower.

Yeah, that's probably true.

> On x86, kvm_vcpu is actually comprised of three "major" structs: kvm_vcpu,
> kvm_vcpu_arch, and vcpu_{vmx,svm}. The majority of fields accessed on every VM-Exit
> are in the vendor part, vcpu_{vmx,svm}, but there are still high-touch fields in
> the common structures, e.g. registers in kvm_vcpu_arch and things like requests
> in kvm_vcpu.
>
> Naively compating any of the three (well, four) structures might result in fewer
> cache lines being touched, but mostly by coincidence. E.g. because compacting
> part of kvm_vcpu_arch happened to better align vcpu_vmx, not because of the
> compaction itself.

Fortunately, kvm_vcpu is embedded as first member in vcpu_{vmx,svm}, so
all three share a common "header." Optimizations done for kvm_vcpu will
therefore benefit the vendor specific structures too. However, you're
right that this will implicitly change the layout for the remainder of
vcpu_{vmx,svm} and might even have a negative impact regarding cacheline
usage. But, as my changes chop off exactly 128 bytes from kvm_vcpu,
that's not the case here. But I can see that this is "coincidence" and
fragile in the long run.

> If we really wanted to optimize cache line usage, our best bet would be to identify
> the fields that are accessed in the fastpath run loop and then pack those fields
> into as few cache lines as possible. And to do that in a maintainable way, we'd
> probably need something like ASI[*] to allow us to detect changes that result in
> the fastpath handling accessing fields that aren't in the cache-optimized region.
>

> I'm not necessarily opposed to such aggressive optimization, but the ROI is likely
> very, very low. For optimized workloads, there simply aren't very many VM-Exits,
> e.g. the majority of exits on a modern CPU are due to timer ticks. And even those
> will hopefully be eliminiated in the not-too-distant future, e.g. by having hardware
> virtualize the TSC deadline timer, and by moving to a vCPU scheduling scheme that
> allows for a tickless host.

Well, for guests running grsecurity kernels, there's also the CR0.WP
toggling triggering VMEXITs, which happens a lot! -- at least until
something along the lines of [1] gets merged *hint ;)*

[1]
https://lore.kernel.org/all/[email protected]/

>
> https://lore.kernel.org/all/[email protected]

Heh, that RFC is from February last year and it looks like it stalled at
that point. But I guess you only meant patch 44 anyway, that splits up
kvm_vcpu_arch:
https://lore.kernel.org/all/[email protected]/.
It does that for other purposes, though, which might conflict with the
performance aspect I'm mostly after here. Anyways, I got your point. If
we care about cacheline footprint, we should do a more radical change
and group hot members together instead of simply shrinking the structs
involved.

>>> And as you observed, imperfect struct layouts are highly unlikely to have a
>>> measurable impact on performance. The types of operations that are involved in
>>> a world switch are just too costly for the layout to matter much. I do like to
>>> shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably
>>> more performant, doesn't require ongoing mainteance, and/or also improves the code
>>> quality.
>>
>> Any pointers to measure the "more performant" aspect?
>
> TL;DR: not really.
>
> What I've done in the past is run a very tight loop in the guest, and then measure
> latency from the host by hacking KVM. Measuring from the guest works, e.g. we have
> a variety of selftests that do exactly that, but when trying to shave cycles in
> the VM-Exit path, it doesn't take many host IRQs arriving at the "wrong" time to
> skew the measurement. My quick-and-dirty solution has always been to just hack
> KVM to measure latency with IRQs disabled, but a more maintainable approach would
> be to add smarts somewhere to sanitize the results, e.g. to throw away outliers
> where the guest likely got interrupted.
>
> I believe we've talked about adding a selftest to measure fastpath latency, e.g.
> by writing MSR_IA32_TSC_DEADLINE in a tight loop.
>
> However, that's not going to be useful in this case since you are interested in
> measuring the impact of reducing the host's L1D footprint. If the guest isn't
> cache-constrainted, reducing the host's cache footprint isn't going to impact
> performance since there's no contention.

Yeah, it's hard to find a test case measuring the gains. I looked into
running Linux userland workloads initially, but saw no real impact, as
the sdtdev was already too high. But, as you pointed out, a
micro-benchmark is of no use either, so it's all hand-waving only. :D

> Running a micro benchmark in the guest that aims to measure cache performance might
> work, but presumably those are all userspace tests, i.e. you'd end up measuring
> the impact of the guest kernel too. And they wouldn't consistently trigger VM-Exits,
> so it would be difficult to prove the validity of the results.

Jepp. It's all just gut feeling, unfortunately.

> I suppose you could write a dedicated selftest or port a micro benchmark to run
> as a selftest (or KVM-unit-test)?
>
>> I tried to make use of the vmx_vmcs_shadow_test in kvm-unit-tests, as it's
>> already counting cycles, but the numbers are too unstable, even if I pin the
>> test to a given CPU, disable turbo mode, SMT, use the performance cpu
>> governor, etc.
>
> Heh, you might have picked quite possibly the worst way to measure VM-Exit
> performance :-)
>
> The guest code in that test that's measuring latency runs at L2. For VMREADs
> and VMWRITEs that are "passed-through" all the way to L2, no VM-Exit will occur
> (the access will be handled purely in ucode). And for accesses that do cause a
> VM-Exit, I'm pretty sure they all result in a nested VM-Exit, which is a _very_
> heavy path (~10k cycles). Even if the exit is handled by KVM (in L0), it's still
> a relatively slow, heavy path.

I see. I'll have a look at the selftests and see if I can repurpose one
of them. But, as you noted, a microbenchmark might not be what I'm
after. It's more about identifying the usage patterns for hot VMEXIT
paths and optimize these.

>>> I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry
>>> a sub-optimal layouy and the change is arguably warranted even without the change
>>> in size. Ditto for kvm_pmu, logically I think it makes sense to have the version
>>> at the very top.
>>
>> Yeah, was exactly thinking the same when modifying kvm_pmu.
>>
>>> But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling
>>> fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly
>>> egregious field(s) just isn't worth the cost in the long term.
>>
>> Heh, just found this gem in vcpu_vmx:
>>
>> struct vcpu_vmx {
>> [...]
>> union vmx_exit_reason exit_reason;
>>
>> /* XXX 44 bytes hole, try to pack */
>>
>> /* --- cacheline 123 boundary (7872 bytes) --- */
>> struct pi_desc pi_desc __attribute__((__aligned__(64)));
>> [...]
>>
>> So there are, in fact, some bigger holes left.
>
> Ya. Again, I'm definitely ok cleaning up the truly heinous warts and/or doing
> a targeted, deliberate refactor of structures. What I don't want to do is
> shuffle fields around purely to save a few bytes here and there.

Got it. I'll back out the reshuffling ones and only keep the ones for
kvm_pmu and kvm_mmu_memory_cache, as these are more like straight cleanups.

Thanks,
Mathias

2023-02-17 16:35:42

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 0/5] KVM: Put struct kvm_vcpu on a diet

On Fri, Feb 17, 2023, Mathias Krause wrote:
> On 16.02.23 18:32, Sean Christopherson wrote:
> > I'm not necessarily opposed to such aggressive optimization, but the ROI is likely
> > very, very low. For optimized workloads, there simply aren't very many VM-Exits,
> > e.g. the majority of exits on a modern CPU are due to timer ticks. And even those
> > will hopefully be eliminiated in the not-too-distant future, e.g. by having hardware
> > virtualize the TSC deadline timer, and by moving to a vCPU scheduling scheme that
> > allows for a tickless host.
>
> Well, for guests running grsecurity kernels, there's also the CR0.WP
> toggling triggering VMEXITs, which happens a lot! -- at least until
> something along the lines of [1] gets merged *hint ;)*

Ha! It's high on my todo list for 6.4, catching up on other stuff at the moment.

That series is also _exactly_ why the ROI for aggressive cache line optimization
is low. The better long term answer is almost always to avoid the VM-Exit in the
first place, or failing that, to handle the exit in a fastpath. Sometimes it takes
a few years, e.g. to get necessary hardware support, but x86 virtualization is fast
approaching the point where anything remotely performance critical is handled entirely
within the guest.

2023-02-18 14:55:51

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH 2/5] KVM: x86: Shrink struct kvm_queued_exception

On 2/13/23 10:33, Mathias Krause wrote:
> Reshuffle the boolean members of struct kvm_queued_exception and make
> them individual bits, allowing denser packing.
>
> This allows us to shrink the object size from 24 to 16 bytes for 64 bit
> builds.
>
> Signed-off-by: Mathias Krause <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 43329c60a6b5..040eee3e9583 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -701,13 +701,13 @@ struct kvm_vcpu_xen {
> };
>
> struct kvm_queued_exception {
> - bool pending;
> - bool injected;
> - bool has_error_code;
> + u8 pending : 1;
> + u8 injected : 1;
> + u8 has_error_code : 1;
> + u8 has_payload : 1;

I find doing something like this is clearer and easier to read:

u8 pending : 1,
injected : 1,
has_error_code : 1,
has_payload : 1,
__reserved : 4;

(you don't have to have the __reserved, though). Just throwing it out there.

Thanks,
Tom

> u8 vector;
> u32 error_code;
> unsigned long payload;
> - bool has_payload;
> };
>
> struct kvm_vcpu_arch {