Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Tue, 26 Jul 2022 16:41:10 -0700
From:   Isaku Yamahata <isaku.yamahata@gmail.com>
To:     Yuan Yao <yuan.yao@linux.intel.com>
Cc:     isaku.yamahata@intel.com, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, isaku.yamahata@gmail.com,
        Paolo Bonzini <pbonzini@redhat.com>,
        Kai Huang <kai.huang@intel.com>
Subject: Re: [PATCH v7 046/102] KVM: x86/tdp_mmu: Support TDX private mapping
 for TDP MMU
Message-ID: <20220726234110.GD1379820@ls.amr.corp.intel.com>
References: <cover.1656366337.git.isaku.yamahata@intel.com>
 <94524fe1d3ead13019a2b502f37797727296fbd1.1656366338.git.isaku.yamahata@intel.com>
 <20220711082818.ksm5hxg3zho66oct@yy-desk-7060>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20220711082818.ksm5hxg3zho66oct@yy-desk-7060>
Precedence: bulk

On Mon, Jul 11, 2022 at 04:28:18PM +0800,
Yuan Yao <yuan.yao@linux.intel.com> wrote:

> On Mon, Jun 27, 2022 at 02:53:38PM -0700, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> >
> > Allocate mirrored private page table for private page table, and add hooks
> > to operate on mirrored private page table.  This patch adds only hooks. As
> > kvm_gfn_shared_mask() returns false always, those hooks aren't called yet.
> >
> > Because private guest page is protected, page copy with mmu_notifier to
> > migrate page doesn't work.  Callback from backing store is needed.
> >
> > When the faulting GPA is private, the KVM fault is also called private.
> > When resolving private KVM, allocate mirrored private page table and call
> > hooks to operate on mirrored private page table. On the change of the
> > private PTE entry, invoke kvm_x86_ops hook in __handle_changed_spte() to
> > propagate the change to mirrored private page table. The following depicts
> > the relationship.
> >
> >   private KVM page fault   |
> >       |                    |
> >       V                    |
> >  private GPA               |
> >       |                    |
> >       V                    |
> >  KVM private PT root       |  CPU private PT root
> >       |                    |           |
> >       V                    |           V
> >    private PT ---hook to mirror--->mirrored private PT
> >       |                    |           |
> >       \--------------------+------\    |
> >                            |      |    |
> >                            |      V    V
> >                            |    private guest page
> >                            |
> >                            |
> >      non-encrypted memory  |    encrypted memory
> >                            |
> > PT: page table
> >
> > The existing KVM TDP MMU code uses atomic update of SPTE.  On populating
> > the EPT entry, atomically set the entry.  However, it requires TLB
> > shootdown to zap SPTE.  To address it, the entry is frozen with the special
> > SPTE value that clears the present bit. After the TLB shootdown, the entry
> > is set to the eventual value (unfreeze).
> >
> > For mirrored private page table, hooks are called to update mirrored
> > private page table in addition to direct access to the private SPTE. For
> > the zapping case, it works to freeze the SPTE. It can call hooks in
> > addition to TLB shootdown.  For populating the private SPTE entry, there
> > can be a race condition without further protection
> >
> >   vcpu 1: populating 2M private SPTE
> >   vcpu 2: populating 4K private SPTE
> >   vcpu 2: TDX SEAMCALL to update 4K mirrored private SPTE => error
> >   vcpu 1: TDX SEAMCALL to update 2M mirrored private SPTE
> >
> > To avoid the race, the frozen SPTE is utilized.  Instead of atomic update
> > of the private entry, freeze the entry, call the hook that update mirrored
> > private SPTE, set the entry to the final value.
> >
> > Support 4K page only at this stage.  2M page support can be done in future
> > patches.
> >
> > Add is_private member to kvm_page_fault to indicate the fault is private.
> > Also is_private member to struct tdp_inter to propagate it.
> >
> > Co-developed-by: Kai Huang <kai.huang@intel.com>
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  arch/x86/include/asm/kvm-x86-ops.h |   2 +
> >  arch/x86/include/asm/kvm_host.h    |  20 +++
> >  arch/x86/kvm/mmu/mmu.c             |  86 +++++++++-
> >  arch/x86/kvm/mmu/mmu_internal.h    |  37 +++++
> >  arch/x86/kvm/mmu/paging_tmpl.h     |   2 +-
> >  arch/x86/kvm/mmu/tdp_iter.c        |   1 +
> >  arch/x86/kvm/mmu/tdp_iter.h        |   5 +-
> >  arch/x86/kvm/mmu/tdp_mmu.c         | 247 +++++++++++++++++++++++------
> >  arch/x86/kvm/mmu/tdp_mmu.h         |   7 +-
> >  virt/kvm/kvm_main.c                |   1 +
> >  10 files changed, 346 insertions(+), 62 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > index 32a6df784ea6..6982d57e4518 100644
> > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > @@ -93,6 +93,8 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
> >  KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> >  KVM_X86_OP(get_mt_mask)
> >  KVM_X86_OP(load_mmu_pgd)
> > +KVM_X86_OP_OPTIONAL(free_private_sp)
> > +KVM_X86_OP_OPTIONAL(handle_changed_private_spte)
> >  KVM_X86_OP(has_wbinvd_exit)
> >  KVM_X86_OP(get_l2_tsc_offset)
> >  KVM_X86_OP(get_l2_tsc_multiplier)
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index bfc934dc9a33..f2a4d5a18851 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -440,6 +440,7 @@ struct kvm_mmu {
> >  			 struct kvm_mmu_page *sp);
> >  	void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa);
> >  	struct kvm_mmu_root_info root;
> > +	hpa_t private_root_hpa;
> >  	union kvm_cpu_role cpu_role;
> >  	union kvm_mmu_page_role root_role;
> >
> > @@ -1435,6 +1436,20 @@ static inline u16 kvm_lapic_irq_dest_mode(bool dest_mode_logical)
> >  	return dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
> >  }
> >
> > +struct kvm_spte {
> > +	kvm_pfn_t pfn;
> > +	bool is_present;
> > +	bool is_leaf;
> > +};
> > +
> > +struct kvm_spte_change {
> > +	gfn_t gfn;
> > +	enum pg_level level;
> > +	struct kvm_spte old;
> > +	struct kvm_spte new;
> > +	void *sept_page;
> > +};
> > +
> >  struct kvm_x86_ops {
> >  	const char *name;
> >
> > @@ -1547,6 +1562,11 @@ struct kvm_x86_ops {
> >  	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> >  			     int root_level);
> >
> > +	int (*free_private_sp)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > +			       void *private_sp);
> > +	void (*handle_changed_private_spte)(
> > +		struct kvm *kvm, const struct kvm_spte_change *change);
> > +
> >  	bool (*has_wbinvd_exit)(void);
> >
> >  	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index a5bf3e40e209..ef925722ee28 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1577,7 +1577,11 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> >  		flush = kvm_handle_gfn_range(kvm, range, kvm_unmap_rmapp);
> >
> >  	if (is_tdp_mmu_enabled(kvm))
> > -		flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush);
> > +		/*
> > +		 * private page needs to be kept and handle page migration
> > +		 * on next EPT violation.
> > +		 */
> > +		flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush, false);
> >
> >  	return flush;
> >  }
> > @@ -3082,7 +3086,8 @@ static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fau
> >  		 * SPTE value without #VE suppress bit cleared
> >  		 * (kvm->arch.shadow_mmio_value = 0).
> >  		 */
> > -		if (unlikely(!vcpu->kvm->arch.enable_mmio_caching) ||
> > +		if (unlikely(!vcpu->kvm->arch.enable_mmio_caching &&
> > +			     !kvm_gfn_shared_mask(vcpu->kvm)) ||
> >  		    unlikely(fault->gfn > kvm_mmu_max_gfn()))
> >  			return RET_PF_EMULATE;
> >  	}
> > @@ -3454,7 +3459,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> >  		goto out_unlock;
> >
> >  	if (is_tdp_mmu_enabled(vcpu->kvm)) {
> > -		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> > +		if (kvm_gfn_shared_mask(vcpu->kvm) &&
> > +		    !VALID_PAGE(mmu->private_root_hpa)) {
> > +			root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true);
> > +			mmu->private_root_hpa = root;
> > +		}
> > +		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false);
> >  		mmu->root.hpa = root;
> >  	} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> >  		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
> > @@ -4026,6 +4036,32 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> >  	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> >  }
> >
> > +/*
> > + * Private page can't be release on mmu_notifier without losing page contents.
> > + * The help, callback, from backing store is needed to allow page migration.
> > + * For now, pin the page.
> > + */
> > +static int kvm_faultin_pfn_private_mapped(struct kvm_vcpu *vcpu,
> > +					   struct kvm_page_fault *fault)
> > +{
> > +	hva_t hva = gfn_to_hva_memslot(fault->slot, fault->gfn);
> > +	struct page *page[1];
> > +
> > +	fault->map_writable = false;
> > +	fault->pfn = KVM_PFN_ERR_FAULT;
> > +	if (hva == KVM_HVA_ERR_RO_BAD || hva == KVM_HVA_ERR_BAD)
> > +		return RET_PF_CONTINUE;
> > +
> > +	/* TDX allows only RWX.  Read-only isn't supported. */
> > +	WARN_ON_ONCE(!fault->write);
> > +	if (pin_user_pages_fast(hva, 1, FOLL_WRITE, page) != 1)
> > +		return RET_PF_INVALID;
> > +
> > +	fault->map_writable = true;
> > +	fault->pfn = page_to_pfn(page[0]);
> > +	return RET_PF_CONTINUE;
> > +}
> > +
> >  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  {
> >  	struct kvm_memory_slot *slot = fault->slot;
> > @@ -4058,6 +4094,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  			return RET_PF_EMULATE;
> >  	}
> >
> > +	if (fault->is_private)
> > +		return kvm_faultin_pfn_private_mapped(vcpu, fault);
> > +
> >  	async = false;
> >  	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async,
> >  					  fault->write, &fault->map_writable,
> > @@ -4110,6 +4149,17 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> >  	       mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> >  }
> >
> > +void kvm_mmu_release_fault(struct kvm *kvm, struct kvm_page_fault *fault, int r)
> > +{
> > +	if (is_error_noslot_pfn(fault->pfn) || kvm_is_reserved_pfn(fault->pfn))
> > +		return;
> > +
> > +	if (fault->is_private)
> > +		put_page(pfn_to_page(fault->pfn));
> 
> The pin_user_pages_fast() is used above which has FOLL_PIN set
> internal, so should we use unpin_user_page() here ? The FOLL_PIN means
> the unpin should be done by unpin_user_page() but not put_page, please
> see /Documentation/core-api/pin_user_pages.rst and comments on
> FOLL_PIN;

To align with large page support, I'll make it to use get_user_pages_fast() and
put_page().

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>