Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp4635742iob; Sun, 8 May 2022 20:05:05 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzQ6HLlP4XTwn/Fc5CqXW3CU2f+C4eS9fkODBWl8PWODnhUEAAJd7Zj2NY9MzaLjgQuZ01t X-Received: by 2002:a17:90a:e7d2:b0:1dc:e6c6:604b with SMTP id kb18-20020a17090ae7d200b001dce6c6604bmr12637164pjb.183.1652065505156; Sun, 08 May 2022 20:05:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652065505; cv=none; d=google.com; s=arc-20160816; b=X78SODU7tDGPzVCPeQKmf4nYTdXcrw9+7TgP6FzMKlcw5FPWl8pfH+X//ulH1oYrko 5Kz8pOp0CxR1LGGm1y7WBHtFcjtFO+UAa+tlXGzidtqEPY6QxZuEK8KH1BdCDJoMBiGD 3gqqQQrGXm6TsonhQaAdVAptJ0HlpA1r7oiWYHLVdtoTmqWqGwzc1DDaTxtrT/UTS6gO AR+KyvsIsv3fyTXzm+2o8n2XVYWDlzkAaI9x725vl2ESQIm+6SxdJ5dqkpy5BJS1jT2M JsISVCZnZ9TbyitJOrrFnblHgyUAZRC1MWmvbxMiLelaj2AMTs1/qw0Ddl8kVBWs/SFg rPAA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=l1murXV38k0xOntDLKsSQg9Rx9UsUQAy9S20FHttvaQ=; b=fLteW4N2rNy0gVNpQbFkdvhGX48sPa/GRCg9YpHxheC7psi2PK8MpWmLY8GGm6vtyU lCXqyE4j/7fGgnHaskfWxNXItnt7B2LMB5mkOnzY7V3McvMtMcifYhRRyZqERfk4BgDS kPyTNyJqAgLTh1Kv2/vPjlVIFDLsosrRmChNwx0N7NoOuj/GUAHuxTDnpBOYlAv274+t DOlPPITy5/kNXVTvefqgYGlahHdBGigZDzsOos+rZqD0inXPhpaOG1aNLf907b9Zgd7c /bCPwF8GTtq63mN0MF0tC6wqOHBySXn7LWZa0IDvH3lPxUnMDSjc7dIp8gsn/nBsgNv2 8KEw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=RAzjpYdx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id ck5-20020a056a00328500b004fa3a8dffa3si12535214pfb.90.2022.05.08.20.05.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 08 May 2022 20:05:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=RAzjpYdx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 0613289098; Sun, 8 May 2022 20:05:01 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1387612AbiEFAtn (ORCPT + 99 others); Thu, 5 May 2022 20:49:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51122 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234996AbiEFAtl (ORCPT ); Thu, 5 May 2022 20:49:41 -0400 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6DE5815A3F; Thu, 5 May 2022 17:46:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1651797960; x=1683333960; h=message-id:subject:from:to:cc:date:in-reply-to: references:mime-version:content-transfer-encoding; bh=CEMQwyKyC2z3crG2gLQ2v/yWIWfTJd1sWTDPTFk+RlE=; b=RAzjpYdxabP8OD8PwZDgtifc7MSIZ6QjxSgTswM7Gmh2rbfyt8cHa7bK 9f7OMVFGVFKiORijLIUskqr0m/2FaqYbysczMo2oBqurN4QOT/9loELeQ hrjgUltM3iKv9u3sXbjhNK4yqww9wq5ENuHbUap6N/s7QikEfEWC+g1ce 4/51zFiCnmlf4HUibY6zJDJrndwsDN3BqgPC/gG5fhRzLNScJRKtR8AyZ nV6a1MGKCHQrjG4JFjpJJcGxivG3hg9Wu3VmsscnQIk+4d9unnF3lZwvC 8A7CcnJU4+ySEdVwAyT7ljmzA28fJz5ljqILZMndQmHB1yC703Zj99GYJ w==; X-IronPort-AV: E=McAfee;i="6400,9594,10338"; a="354738487" X-IronPort-AV: E=Sophos;i="5.91,203,1647327600"; d="scan'208";a="354738487" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 May 2022 17:46:00 -0700 X-IronPort-AV: E=Sophos;i="5.91,203,1647327600"; d="scan'208";a="600317222" Received: from anthienn-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.254.4.139]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 May 2022 17:45:56 -0700 Message-ID: Subject: Re: [PATCH v3 00/21] TDX host kernel support From: Kai Huang To: Dan Williams Cc: Dave Hansen , Linux Kernel Mailing List , KVM list , Sean Christopherson , Paolo Bonzini , "Brown, Len" , "Luck, Tony" , Rafael J Wysocki , Reinette Chatre , Peter Zijlstra , Andi Kleen , "Kirill A. Shutemov" , Kuppuswamy Sathyanarayanan , Isaku Yamahata , Mike Rapoport Date: Fri, 06 May 2022 12:45:54 +1200 In-Reply-To: References: <522e37eb-68fc-35db-44d5-479d0088e43f@intel.com> <9b388f54f13b34fe684ef77603fc878952e48f87.camel@intel.com> <664f8adeb56ba61774f3c845041f016c54e0f96e.camel@intel.com> <1b681365-ef98-ec78-96dc-04e28316cf0e@intel.com> <8bf596b45f68363134f431bcc550e16a9a231b80.camel@intel.com> <6bb89ca6e7346f4334f06ea293f29fd12df70fe4.camel@intel.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.42.4 (3.42.4-1.fc35) MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote: > On Thu, May 5, 2022 at 3:14 PM Kai Huang wrote: > > > > Thanks for feedback! > > > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote: > > > [ add Mike ] > > > > > > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang wrote: > > > [..] > > > > > > > > Hi Dave, > > > > > > > > Sorry to ping (trying to close this). > > > > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from > > > > TDMRs. The worst case is when someone tries to use them as TD guest backend > > > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly > > > > that no one should just use some random backend to run TD. > > > > > > The platform will already do this, right? > > > > > > > In the current v3 implementation, we don't have any code to handle memory > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system > > RAM using kmem driver. In order to guarantee all pages managed by page > > That's the fundamental question I am asking why is "guarantee all > pages managed by page allocator are TDX memory". That seems overkill > compared to indicating the incompatibility after the fact. As I explained, the reason is I don't want to modify page allocator to distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX and GFP_TDX. KVM depends on host's page fault handler to allocate the page. In fact KVM only consumes PFN from host's page tables. For now only RAM is TDX memory. By guaranteeing all pages in page allocator is TDX memory, we can easily use anonymous pages as TD guest memory. This also allows us to easily extend the shmem to support a new fd-based backend which doesn't require having to mmap() TD guest memory to host userspace: https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/ Also, besides TD guest memory, there are some per-TD control data structures (which must be TDX memory too) need to be allocated for each TD. Normal memory allocation APIs can be used for such allocation if we guarantee all pages in page allocator is TDX memory. > > > allocator are all TDX memory, the v3 implementation needs to always include > > legacy PMEMs as TDX memory so that even people truly add legacy PMEMs as system > > RAM, we can still guarantee all pages in page allocator are TDX memory. > > Why? If we don't include legacy PMEMs as TDX memory, then after they are hot-added as system RAM using kmem driver, the assumption of "all pages in page allocator are TDX memory" is broken. A TD can be killed during runtime. > > > Of course, a side benefit of always including legacy PMEMs is people > > theoretically can use them directly as TD guest backend, but this is just a > > bonus but not something that we need to guarantee. > > > > > > > I don't understand why this > > > is trying to take proactive action versus documenting the error > > > conditions and steps someone needs to take to avoid unconvertible > > > memory. There is already the CONFIG_HMEM_REPORTING that describes > > > relative performance properties between initiators and targets, it > > > seems fitting to also add security properties between initiators and > > > targets so someone can enumerate the numa-mempolicy that avoids > > > unconvertible memory. > > > > I don't think there's anything related to performance properties here. The only > > goal here is to make sure all pages in page allocator are TDX memory pages. > > Please reconsider or re-clarify that goal. > > > > > > > > > No, special casing in hotplug code paths needed. > > > > > > > > > > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory > > > > blocks based on memblock, but not e820. The pages managed by page allocator are > > > > from memblock anyway (w/o those from memory hotplug). > > > > > > > > And I also think it makes more sense to introduce 'tdx_memblock' and > > > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when > > > > memblock is still alive. When TDX module is initialized during runtime, TDMRs > > > > can be created based on the 'struct tdx_memory' which contains all TDX memory > > > > blocks we gathered based on memblock during boot. This is also more flexible to > > > > support other TDX memory from other sources such as CLX memory in the future. > > > > > > > > Please let me know if you have any objection? Thanks! > > > > > > It's already the case that x86 maintains sideband structures to > > > preserve memory after exiting the early memblock code. > > > > > > > May I ask what data structures are you referring to? > > struct numa_meminfo. > > > Btw, the purpose of 'tdx_memblock' and 'tdx_memory' is not only just to preserve > > memblock info during boot. It is also used to provide a common data structure > > that the "constructing TDMRs" code can work on. If you look at patch 11-14, the > > logic (create TDMRs, allocate PAMTs, sets up reserved areas) around how to > > construct TDMRs doesn't have hard dependency on e820. If we construct TDMRs > > based on a common 'tdx_memory' like below: > > > > int construct_tdmrs(struct tdx_memory *tmem, ...); > > > > It would be much easier to support other TDX memory resources in the future. > > "in the future" is a prompt to ask "Why not wait until that future / > need arrives before adding new infrastructure?" Fine to me. -- Thanks, -Kai