Received: by 2002:a05:7412:b130:b0:e2:908c:2ebd with SMTP id az48csp774235rdb; Fri, 17 Nov 2023 12:16:42 -0800 (PST) X-Google-Smtp-Source: AGHT+IGh4HyZ6sJoiCJ94bxbSF8xvTZYzMO6TeMW1oLB6BXq6z7eGxhJBrbQCn429uyFvG88o1MD X-Received: by 2002:a05:6a20:2d13:b0:16c:b5ce:50f with SMTP id g19-20020a056a202d1300b0016cb5ce050fmr288282pzl.32.1700252202321; Fri, 17 Nov 2023 12:16:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700252202; cv=none; d=google.com; s=arc-20160816; b=GpusFOK8p6qxsqvhiRkYRY1Z06BR3l1+Ww+X7XNg6LlcrLXth6F9RLynBHNx8OqWr+ VQOY3uoj+sNiamzaFzjf7YQVAS+qv+AiyW2+gOs8lM9z8YmQegLc6zLonYKMPP47Y/ze 1ZmtdJaSVerdueJOGyCjKuthRIwbta3/cE/UNB7Oy4i3+b3xIRDJA1unX+uzPFnXUn9K v321t5RF+rmCkGLJOJV35H2eofHHFAmaJFNzDf0SVB1l4WdQliNEjeogf1IRnohnS4Qm r1rQCnmqNtkKCHw3cvIAde8aP2Dwb7Ob7Rj/nSWibd9bzgS5S7M7lxojAHzZCYsV4F7s Ie4w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=L/ylFtP4+W/+Xgp5jDF9ybEM+uT2/ZhnqX4unUH3zOk=; fh=taQUnqjo85LpxzleUVNMGd7stgoJVV/u+j7OkXAy09Q=; b=qxdoP2qq6Ou4MU5qwilubp4uklp92i8A6JWChj3guHGS8Up8LbKqoBSRufTxmp5qW8 Cwmvrk+2qlrjYLKdwhoJ9Le16htm2J9aR3QYVf7PsTXMkPxKPoj0qGlQHWb1Q+ZQbKwj hqNNzcROWmwN5wWgPV0J+A4FtmP/KvgsuoRhzDEU6+e4zM/RVUASSYQN3HmDJfRyeGly U8cXXVOrWqif6EAZIlxUrziiZMElnf2cggnLaS6+ZKwCu+eTWHinc3MVcv0byuXT/1Qv 3By2KlQ7IhzOAyVSt6KG7/hS9WF9Yy26OHBJns2vBgqc7mIbAdpFvccFhW3CUWP34Rsq z5AA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=b2gHz9tE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id u13-20020a056a00124d00b006c4d76a68f7si2666642pfi.119.2023.11.17.12.16.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 17 Nov 2023 12:16:42 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=b2gHz9tE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 3828D8092179; Fri, 17 Nov 2023 12:16:30 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235826AbjKQUQM (ORCPT + 99 others); Fri, 17 Nov 2023 15:16:12 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53458 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235757AbjKQUP6 (ORCPT ); Fri, 17 Nov 2023 15:15:58 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E44651FEF; Fri, 17 Nov 2023 12:15:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1700252125; x=1731788125; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=/7+LjIS9p9ecKNObclU4C94lj+TYiKtZZXlq7W+6+hI=; b=b2gHz9tE426xGyctT2ig1NQjHnP1fFZvCzN6Cg/NoFFlCuHsuOTxcPhK wqSQ1Fflee65O9dCt2zoMDu74QNvRtZ2tNvzn+B9izYF+VHB1uSsYfZrD 7jWR+0Z/w/v+7jKN5r4ihCCI3ZmQ4hgoJoybSmMtaPyTI0eVqYsY+bclM AH8k5VV6yijaFZCXgvievajxUT+pXgdsXaRegvtGzXpkYipdJ5sAOcqKz 3GHs9x5sBKnDRMJRVjTllI15dj/6/e6WNpeP2HsdUboBXiKQbqet+QSCk C9wekRKHNyvz+LSqcvV9Hl/jde/DWQZm2iXJGi7YQFr9rpQDULxifwMLj w==; X-IronPort-AV: E=McAfee;i="6600,9927,10897"; a="12913558" X-IronPort-AV: E=Sophos;i="6.04,206,1695711600"; d="scan'208";a="12913558" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Nov 2023 12:15:24 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.04,206,1695711600"; d="scan'208";a="6951003" Received: from ls.sc.intel.com (HELO localhost) ([172.25.112.31]) by orviesa002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Nov 2023 12:15:23 -0800 Date: Fri, 17 Nov 2023 12:15:23 -0800 From: Isaku Yamahata To: "Wang, Wei W" Cc: "Yamahata, Isaku" , "kvm@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "isaku.yamahata@gmail.com" , Paolo Bonzini , "Aktas, Erdem" , "Christopherson,, Sean" , "Shahar, Sagi" , David Matlack , "Huang, Kai" , Zhi Wang , "Chen, Bo2" , "Yuan, Hang" , "Zhang, Tina" , "gkirkpatrick@google.com" , isaku.yamahata@linux.intel.com Subject: Re: [PATCH v16 059/116] KVM: TDX: Create initial guest memory Message-ID: <20231117201523.GD1109547@ls.amr.corp.intel.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 17 Nov 2023 12:16:30 -0800 (PST) On Fri, Nov 17, 2023 at 12:56:32PM +0000, "Wang, Wei W" wrote: > On Tuesday, October 17, 2023 12:14 AM, isaku.yamahata@intel.com wrote: > > Because the guest memory is protected in TDX, the creation of the initial guest > > memory requires a dedicated TDX module API, tdh_mem_page_add, instead of > > directly copying the memory contents into the guest memory in the case of > > the default VM type. KVM MMU page fault handler callback, private_page_add, > > handles it. > > > > Define new subcommand, KVM_TDX_INIT_MEM_REGION, of VM-scoped > > KVM_MEMORY_ENCRYPT_OP. It assigns the guest page, copies the initial > > memory contents into the guest memory, encrypts the guest memory. At the > > same time, optionally it extends memory measurement of the TDX guest. It > > calls the KVM MMU page fault(EPT-violation) handler to trigger the callbacks > > for it. > > > > Reported-by: gkirkpatrick@google.com > > Signed-off-by: Isaku Yamahata > > > > --- > > v15 -> v16: > > - add check if nr_pages isn't large with > > (nr_page << PAGE_SHIFT) >> PAGE_SHIFT > > > > v14 -> v15: > > - add a check if TD is finalized or not to tdx_init_mem_region() > > - return -EAGAIN when partial population > > --- > > arch/x86/include/uapi/asm/kvm.h | 9 ++ > > arch/x86/kvm/mmu/mmu.c | 1 + > > arch/x86/kvm/vmx/tdx.c | 167 +++++++++++++++++++++++++- > > arch/x86/kvm/vmx/tdx.h | 2 + > > tools/arch/x86/include/uapi/asm/kvm.h | 9 ++ > > 5 files changed, 185 insertions(+), 3 deletions(-) > > > > diff --git a/arch/x86/include/uapi/asm/kvm.h > > b/arch/x86/include/uapi/asm/kvm.h index 311a7894b712..a1815fcbb0be > > 100644 > > --- a/arch/x86/include/uapi/asm/kvm.h > > +++ b/arch/x86/include/uapi/asm/kvm.h > > @@ -572,6 +572,7 @@ enum kvm_tdx_cmd_id { > > KVM_TDX_CAPABILITIES = 0, > > KVM_TDX_INIT_VM, > > KVM_TDX_INIT_VCPU, > > + KVM_TDX_INIT_MEM_REGION, > > > > KVM_TDX_CMD_NR_MAX, > > }; > > @@ -645,4 +646,12 @@ struct kvm_tdx_init_vm { > > struct kvm_cpuid2 cpuid; > > }; > > > > +#define KVM_TDX_MEASURE_MEMORY_REGION (1UL << 0) > > + > > +struct kvm_tdx_init_mem_region { > > + __u64 source_addr; > > + __u64 gpa; > > + __u64 nr_pages; > > +}; > > + > > #endif /* _ASM_X86_KVM_H */ > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index > > 107cf27505fe..63a4efd1e40a 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -5652,6 +5652,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu) > > out: > > return r; > > } > > +EXPORT_SYMBOL(kvm_mmu_load); > > > > void kvm_mmu_unload(struct kvm_vcpu *vcpu) { diff --git > > a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index > > a5f1b3e75764..dc17c212cb38 100644 > > --- a/arch/x86/kvm/vmx/tdx.c > > +++ b/arch/x86/kvm/vmx/tdx.c > > @@ -470,6 +470,21 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, > > hpa_t root_hpa, int pgd_level) > > td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & > > PAGE_MASK); } > > > > +static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa) { > > + struct tdx_module_args out; > > + u64 err; > > + int i; > > + > > + for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) { > > + err = tdh_mr_extend(kvm_tdx->tdr_pa, gpa + i, &out); > > + if (KVM_BUG_ON(err, &kvm_tdx->kvm)) { > > + pr_tdx_error(TDH_MR_EXTEND, err, &out); > > + break; > > + } > > + } > > +} > > + > > static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn) { > > struct page *page = pfn_to_page(pfn); > > @@ -533,6 +548,61 @@ static int tdx_sept_page_aug(struct kvm *kvm, gfn_t > > gfn, > > return 0; > > } > > > > +static int tdx_sept_page_add(struct kvm *kvm, gfn_t gfn, > > + enum pg_level level, kvm_pfn_t pfn) { > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > > + hpa_t hpa = pfn_to_hpa(pfn); > > + gpa_t gpa = gfn_to_gpa(gfn); > > + struct tdx_module_args out; > > + hpa_t source_pa; > > + bool measure; > > + u64 err; > > + > > + /* > > + * KVM_INIT_MEM_REGION, tdx_init_mem_region(), supports only 4K > > page > > + * because tdh_mem_page_add() supports only 4K page. > > + */ > > + if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm)) > > + return -EINVAL; > > + > > + /* > > + * In case of TDP MMU, fault handler can run concurrently. Note > > + * 'source_pa' is a TD scope variable, meaning if there are multiple > > + * threads reaching here with all needing to access 'source_pa', it > > + * will break. However fortunately this won't happen, because below > > + * TDH_MEM_PAGE_ADD code path is only used when VM is being > > created > > + * before it is running, using KVM_TDX_INIT_MEM_REGION ioctl > > (which > > + * always uses vcpu 0's page table and protected by vcpu->mutex). > > + */ > > + if (KVM_BUG_ON(kvm_tdx->source_pa == INVALID_PAGE, kvm)) { > > + tdx_unpin(kvm, pfn); > > + return -EINVAL; > > + } > > + > > + source_pa = kvm_tdx->source_pa & > > ~KVM_TDX_MEASURE_MEMORY_REGION; > > + measure = kvm_tdx->source_pa & > > KVM_TDX_MEASURE_MEMORY_REGION; > > + kvm_tdx->source_pa = INVALID_PAGE; > > + > > + do { > > + err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, hpa, > > source_pa, > > + &out); > > + /* > > + * This path is executed during populating initial guest memory > > + * image. i.e. before running any vcpu. Race is rare. > > + */ > > + } while (unlikely(err == TDX_ERROR_SEPT_BUSY)); > > + if (KVM_BUG_ON(err, kvm)) { > > + pr_tdx_error(TDH_MEM_PAGE_ADD, err, &out); > > + tdx_unpin(kvm, pfn); > > + return -EIO; > > + } else if (measure) > > + tdx_measure_page(kvm_tdx, gpa); > > + > > + return 0; > > + > > +} > > + > > static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, > > enum pg_level level, kvm_pfn_t pfn) { @@ > > -555,9 +625,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t > > gfn, > > if (likely(is_td_finalized(kvm_tdx))) > > return tdx_sept_page_aug(kvm, gfn, level, pfn); > > > > - /* TODO: tdh_mem_page_add() comes here for the initial memory. */ > > - > > - return 0; > > + return tdx_sept_page_add(kvm, gfn, level, pfn); > > } > > > > static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn, @@ -1265,6 > > +1333,96 @@ void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) > > tdx_track(vcpu->kvm); > > } > > > > +#define TDX_SEPT_PFERR (PFERR_WRITE_MASK | > > PFERR_GUEST_ENC_MASK) > > + > > +static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd > > +*cmd) { > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > > + struct kvm_tdx_init_mem_region region; > > + struct kvm_vcpu *vcpu; > > + struct page *page; > > + int idx, ret = 0; > > + bool added = false; > > + > > + /* Once TD is finalized, the initial guest memory is fixed. */ > > + if (is_td_finalized(kvm_tdx)) > > + return -EINVAL; > > + > > + /* The BSP vCPU must be created before initializing memory regions. > > */ > > + if (!atomic_read(&kvm->online_vcpus)) > > + return -EINVAL; > > + > > + if (cmd->flags & ~KVM_TDX_MEASURE_MEMORY_REGION) > > + return -EINVAL; > > + > > + if (copy_from_user(®ion, (void __user *)cmd->data, sizeof(region))) > > + return -EFAULT; > > + > > + /* Sanity check */ > > + if (!IS_ALIGNED(region.source_addr, PAGE_SIZE) || > > + !IS_ALIGNED(region.gpa, PAGE_SIZE) || > > + !region.nr_pages || > > + region.nr_pages & GENMASK_ULL(63, 63 - PAGE_SHIFT) || > > + region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa || > > + !kvm_is_private_gpa(kvm, region.gpa) || > > + !kvm_is_private_gpa(kvm, region.gpa + (region.nr_pages << > > PAGE_SHIFT))) > > + return -EINVAL; > > + > > + vcpu = kvm_get_vcpu(kvm, 0); > > + if (mutex_lock_killable(&vcpu->mutex)) > > + return -EINTR; > > + > > + vcpu_load(vcpu); > > + idx = srcu_read_lock(&kvm->srcu); > > + > > + kvm_mmu_reload(vcpu); > > + > > + while (region.nr_pages) { > > + if (signal_pending(current)) { > > + ret = -ERESTARTSYS; > > + break; > > + } > > + > > + if (need_resched()) > > + cond_resched(); > > + > > + /* Pin the source page. */ > > + ret = get_user_pages_fast(region.source_addr, 1, 0, &page); > > + if (ret < 0) > > + break; > > + if (ret != 1) { > > + ret = -ENOMEM; > > + break; > > + } > > + > > + kvm_tdx->source_pa = pfn_to_hpa(page_to_pfn(page)) | > > + (cmd->flags & > > KVM_TDX_MEASURE_MEMORY_REGION); > > + > > Is it fundamentally correct to take a userspace mapped page to add as a TD private page? > Maybe take the corresponding page from gmem and do a copy to it? > For example: > ret = get_user_pages_fast(region.source_addr, 1, 0, &user_page); > ... > kvm_gmem_get_pfn(kvm, gfn_to_memslot(kvm, gfn), gfn, &gmem_pfn, NULL); > memcpy(__va(gmem_pfn << PAGE_SHIFT), page_to_virt(user_page), PAGE_SIZE); > kvm_tdx->source_pa = pfn_to_hpa(gmem_pfn) | > (cmd->flags & KVM_TDX_MEASURE_MEMORY_REGION); Please refer to static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level, kvm_pfn_t pfn) The guest memfd provides the page of gfn which is different from kvm_tdx->source_pa. The function calls tdh_mem_page_add(). tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, hpa, source_pa, &out); gpa: corresponds to the page from guest memfd source_pa: corresopnds to the page tdx_init_mem_region() pinned down. tdh_mem_page_add() copies the page contents from source_pa to gpa and gives gpa to the TD guest. not source_pa. -- Isaku Yamahata