Message-ID: <13deaeb6-dfb2-224c-0aa3-5546ad426f63@grsecurity.net>
Date:   Tue, 14 Feb 2023 13:19:12 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.6.0
Subject: Re: [PATCH 0/5] KVM: Put struct kvm_vcpu on a diet
Content-Language: en-US, de-DE
To:     Sean Christopherson <seanjc@google.com>
Cc:     kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
        Paolo Bonzini <pbonzini@redhat.com>
References: <20230213163351.30704-1-minipli@grsecurity.net>
 <Y+pt5MGR+EjLH4qQ@google.com>
From:   Mathias Krause <minipli@grsecurity.net>
In-Reply-To: <Y+pt5MGR+EjLH4qQ@google.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: bulk

On 13.02.23 18:05, Sean Christopherson wrote:
> On Mon, Feb 13, 2023, Mathias Krause wrote:
>> Relayout members of struct kvm_vcpu and embedded structs to reduce its
>> memory footprint. Not that it makes sense from a memory usage point of
>> view (given how few of such objects get allocated), but this series
>> achieves to make it consume two cachelines less, which should provide a
>> micro-architectural net win. However, I wasn't able to see a noticeable
>> difference running benchmarks within a guest VM -- the VMEXIT costs are
>> likely still high enough to mask any gains.
> 
> ...
> 
>> Below is the high level pahole(1) diff. Most significant is the overall
>> size change from 6688 to 6560 bytes, i.e. -128 bytes.
> 
> While part of me wishes KVM were more careful about struct layouts, IMO fiddling
> with per vCPU or per VM structures isn't worth the ongoing maintenance cost.
> 
> Unless the size of the vCPU allocation (vcpu_vmx or vcpu_svm in x86 land) crosses
> a meaningful boundary, e.g. drops the size from an order-3 to order-2 allocation,
> the memory savings are negligible in the grand scheme.  Assuming the kernel is
> even capable of perfectly packing vCPU allocations, saving even a few hundred bytes
> per vCPU is uninteresting unless the vCPU count gets reaaally high, and at that
> point the host likely has hundreds of GiB of memory, i.e. saving a few KiB is again
> uninteresting.

Fully agree! That's why I said, this change makes no sense from a memory
usage point of view. The overall memory savings are not visible at all,
recognizing that the slab allocator isn't able to put more vCPU objects
in a given slab page. However, I still remain confident that this makes
sense from a uarch point of view. Touching less cache lines should be a
win -- even if I'm unable to measure it. By preserving more cachelines
during a VMEXIT, guests should be able to resume their work faster
(assuming they still need these cachelines).

> And as you observed, imperfect struct layouts are highly unlikely to have a
> measurable impact on performance.  The types of operations that are involved in
> a world switch are just too costly for the layout to matter much.  I do like to
> shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably
> more performant, doesn't require ongoing mainteance, and/or also improves the code
> quality.

Any pointers to measure the "more performant" aspect? I tried to make
use of the vmx_vmcs_shadow_test in kvm-unit-tests, as it's already
counting cycles, but the numbers are too unstable, even if I pin the
test to a given CPU, disable turbo mode, SMT, use the performance cpu
governor, etc.

> I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry
> a sub-optimal layouy and the change is arguably warranted even without the change
> in size.  Ditto for kvm_pmu, logically I think it makes sense to have the version
> at the very top.

Yeah, was exactly thinking the same when modifying kvm_pmu.

> But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling
> fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly
> egregious field(s) just isn't worth the cost in the long term.

Heh, just found this gem in vcpu_vmx:

struct vcpu_vmx {
  [...]
  union vmx_exit_reason      exit_reason;

  /* XXX 44 bytes hole, try to pack */

  /* --- cacheline 123 boundary (7872 bytes) --- */
  struct pi_desc             pi_desc __attribute__((__aligned__(64)));
  [...]

So there are, in fact, some bigger holes left.

Would be nice if pahole had a --density flag that would output some
ASCII art, visualizing which bytes of a struct are allocated by real
members and which ones are pure padding.