Received-SPF: pass (google.com: domain of linux-kernel+bounces-91041-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1;
Date: Mon, 4 Mar 2024 09:55:06 -0800
In-Reply-To: <1880816055.4545532.1709260250219.JavaMail.zimbra@sjtu.edu.cn>
Precedence: bulk
Mime-Version: 1.0
References: <1880816055.4545532.1709260250219.JavaMail.zimbra@sjtu.edu.cn>
Message-ID: <ZeYK-hNDQz5cFhre@google.com>
Subject: Re: [PATCH] KVM:SVM: Flush cache only on CPUs running SEV guest
From: Sean Christopherson <seanjc@google.com>
To: Zheyun Shen <szy0127@sjtu.edu.cn>
Cc: pbonzini@redhat.com, tglx@linutronix.de, kvm@vger.kernel.org, 
	linux-kernel@vger.kernel.org, Tom Lendacky <thomas.lendacky@amd.com>
Content-Type: text/plain; charset="us-ascii"

+Tom

"KVM: SVM:" for the shortlog scope.

On Fri, Mar 01, 2024, Zheyun Shen wrote:
> On AMD CPUs without ensuring cache consistency, each memory page reclamation in
> an SEV guest triggers a call to wbinvd_on_all_cpus, thereby affecting the
> performance of other programs on the host.
> 
> Typically, an AMD server may have 128 cores or more, while the SEV guest might only
> utilize 8 of these cores. Meanwhile, host can use qemu-affinity to bind these 8 vCPUs
> to specific physical CPUs.
> 
> Therefore, keeping a record of the physical core numbers each time a vCPU runs
> can help avoid flushing the cache for all CPUs every time.

This needs an unequivocal statement from AMD that flushing caches only on CPUs
that do VMRUN is sufficient.  That sounds like it should be obviously correct,
as I don't see how else a cache line can be dirtied for the encrypted PA, but
this entire non-coherent caches mess makes me more than a bit paranoid.

> Signed-off-by: Zheyun Shen <szy0127@sjtu.edu.cn>
> ---
>  arch/x86/include/asm/smp.h |  1 +
>  arch/x86/kvm/svm/sev.c     | 28 ++++++++++++++++++++++++----
>  arch/x86/kvm/svm/svm.c     |  4 ++++
>  arch/x86/kvm/svm/svm.h     |  3 +++
>  arch/x86/lib/cache-smp.c   |  7 +++++++
>  5 files changed, 39 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
> index 4fab2ed45..19297202b 100644
> --- a/arch/x86/include/asm/smp.h
> +++ b/arch/x86/include/asm/smp.h
> @@ -120,6 +120,7 @@ void native_play_dead(void);
>  void play_dead_common(void);
>  void wbinvd_on_cpu(int cpu);
>  int wbinvd_on_all_cpus(void);
> +int wbinvd_on_cpus(struct cpumask *cpumask);

KVM already has an internal helper that does this, see kvm_emulate_wbinvd_noskip().
I'm not necessarily advocating that we keep KVM's internal code, but I don't want
two ways of doing the same thing.

>  void smp_kick_mwait_play_dead(void);
>  
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index f760106c3..b6ed9a878 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -215,6 +215,21 @@ static void sev_asid_free(struct kvm_sev_info *sev)
>          sev->misc_cg = NULL;
>  }
>  
> +struct cpumask *sev_get_cpumask(struct kvm *kvm)
> +{
> +        struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +
> +        return &sev->cpumask;
> +}
> +
> +void sev_clear_cpumask(struct kvm *kvm)
> +{
> +        struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +
> +        cpumask_clear(&sev->cpumask);
> +}
> +
> +

Unnecessary newline.  But I would just delete these helpers.

>  static void sev_decommission(unsigned int handle)
>  {
>          struct sev_data_decommission decommission;
> @@ -255,6 +270,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
>          if (unlikely(sev->active))
>                  return ret;
>  
> +        cpumask_clear(&sev->cpumask);

This is unnecessary, the mask is zero allocated.

>          sev->active = true;
>          sev->es_active = argp->id == KVM_SEV_ES_INIT;
>          asid = sev_asid_new(sev);
> @@ -2048,7 +2064,8 @@ int sev_mem_enc_unregister_region(struct kvm *kvm,
>           * releasing the pages back to the system for use. CLFLUSH will
>           * not do this, so issue a WBINVD.
>           */
> -        wbinvd_on_all_cpus();
> +        wbinvd_on_cpus(sev_get_cpumask(kvm));
> +        sev_clear_cpumask(kvm);

Instead of copy+paste WBINVD+cpumask_clear() everywhere, add a prep patch to
replace relevant open coded calls to wbinvd_on_all_cpus() with calls to
sev_guest_memory_reclaimed().  Then only sev_guest_memory_reclaimed() needs to
updated, and IMO it helps document why KVM is blasting WBINVD.

That's why I recommend deleting sev_get_cpumask() and sev_clear_cpumask(), there
really should only be two places that touch the mask itself: svm

>          __unregister_enc_region_locked(kvm, region);
>  
> @@ -2152,7 +2169,8 @@ void sev_vm_destroy(struct kvm *kvm)
>           * releasing the pages back to the system for use. CLFLUSH will
>           * not do this, so issue a WBINVD.
>           */
> -        wbinvd_on_all_cpus();
> +        wbinvd_on_cpus(sev_get_cpumask(kvm));
> +        sev_clear_cpumask(kvm);
>  
>          /*
>           * if userspace was terminated before unregistering the memory regions
> @@ -2343,7 +2361,8 @@ static void sev_flush_encrypted_page(struct kvm_vcpu *vcpu, void *va)
>          return;
>  
>  do_wbinvd:
> -        wbinvd_on_all_cpus();
> +        wbinvd_on_cpus(sev_get_cpumask(vcpu->kvm));
> +        sev_clear_cpumask(vcpu->kvm);
>  }
>  
>  void sev_guest_memory_reclaimed(struct kvm *kvm)
> @@ -2351,7 +2370,8 @@ void sev_guest_memory_reclaimed(struct kvm *kvm)
>          if (!sev_guest(kvm))
>                  return;
>  
> -        wbinvd_on_all_cpus();
> +        wbinvd_on_cpus(sev_get_cpumask(kvm));
> +        sev_clear_cpumask(kvm);

This is unsafe from a correctness perspective, as sev_guest_memory_reclaimed()
is called without holding any KVM locks.  E.g. if a vCPU runs between blasting
WBINVD and cpumask_clear(), KVM will fail to emit WBINVD on a future reclaim.

Making the mask per-vCPU, a la vcpu->arch.wbinvd_dirty_mask, doesn't solve the
problem as KVM can't take vcpu->mutex in this path (sleeping may not be allowed),
and that would create an unnecessary/unwated bottleneck.

The simplest solution I can think of is to iterate over all possible CPUs using
cpumask_test_and_clear_cpu().

>  }
>  
>  void sev_free_vcpu(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index e90b429c8..f9bfa6e57 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4107,6 +4107,10 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu, bool spec_ctrl_in
>  
>          amd_clear_divider();
>  
> +    if (sev_guest(vcpu->kvm))

Use tabs, not spaces.

> +                cpumask_set_cpu(smp_processor_id(), sev_get_cpumask(vcpu->kvm));

This does not need to be in the noinstr region, and it _shouldn't_ be in the
noinstr region.  There's already a handy dandy pre_sev_run() that provides a
convenient location to bury this stuff in SEV specific code.

> +    
>          if (sev_es_guest(vcpu->kvm))
>                  __svm_sev_es_vcpu_run(svm, spec_ctrl_intercepted);
>          else
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 8ef95139c..1577e200e 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -90,6 +90,7 @@ struct kvm_sev_info {
>          struct list_head mirror_entry; /* Use as a list entry of mirrors */
>          struct misc_cg *misc_cg; /* For misc cgroup accounting */
>          atomic_t migration_in_progress;
> +        struct cpumask cpumask; /* CPU list to flush */

That is not a helpful comment.  Flush what?  What adds to the list?  When is the
list cleared.  Even the name is fairly useless, e.g. "

I'm also pretty sure this should be a cpumask_var_t, and dynamically allocated
as appropriate.  And at that point, it should be allocated and filled if and only
if the CPU doesn't have X86_FEATURE_SME_COHERENT.