Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp3749839iob; Mon, 2 May 2022 04:56:32 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyHZhBoc5nMNeMVTMXFmTLxO5FyD74Tt76JcQEpMalMBIxQICJQ1M/oiB+SMBZ4cVPdSPBD X-Received: by 2002:a2e:878d:0:b0:24f:2bf2:5a79 with SMTP id n13-20020a2e878d000000b0024f2bf25a79mr7642161lji.497.1651492591938; Mon, 02 May 2022 04:56:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651492591; cv=none; d=google.com; s=arc-20160816; b=YC59V561I/a9wHzIydbdE0Qyrb1e7kwplWc1R7E3ZnP0fy8CuOj608YQwaeqUlkybu PG8SBm4IavR8Ubw8tlS4Y+dk+Kw2QFSIt/zqgYe0+1S1aDp66vUXs0o/XbBXtCwY1aLO Cqe5jZxOVazVLUzrlmKWv7ALX7KboscaeRxXZ2NsC9yFldrQPtnHYhtbYn7JUhvAtJM+ /dFPQ8K5EhW5te2Inp/h2paszMO3JzRTanAN+7kz2009KaE59mIWejqc/9OZ1Miatr+A nh3MEGiumpr6bqYncOXu9MDX8dkGMg+Rvo3C0nRiEIzfq+N1rd2dSprJ11M3OPLmrPII sl0A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=37m6AzyYfkWStvf9aJw8Jy6UEa2JtWYvi+mVWOmJgUs=; b=x6cPEQsHk05PlzJu/24uRFBZllixk0Or5rGfIrQ+q1+r/d3H/pPPtZCuLb0jb8j7Df OQQWK5tTeH2bMVD5chyySsMkZ5ZzCpgil+SbJ9ZoztPVz5JXR/8Qt1+a0hkYEhuIVT5O xeNDzvxk05N5Lf2aOQFlhXHv5xfgO38zxqa8RvTanlo2TqXJK/u/U7qxezNLXryD/M6R pPeBnRqrnpMn7jlML8bYLrd3UFJ0CoYy+kz7Fti8U2JnwGxdFqJxptj827ewuyIjOzyt zBygWQPk91oQZXRR6E1xJeQ9u2+zEz9a1WGmjnOE79v7RzZRwAYcZspf2g8Cd6on7Cvt Wbzg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=CY7TQ3+a; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x9-20020a2e8809000000b0024f390fda54si11005821ljh.60.2022.05.02.04.56.05; Mon, 02 May 2022 04:56:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20210112.gappssmtp.com header.s=20210112 header.b=CY7TQ3+a; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1379798AbiD2Rvi (ORCPT + 99 others); Fri, 29 Apr 2022 13:51:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33908 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1376940AbiD2Rvg (ORCPT ); Fri, 29 Apr 2022 13:51:36 -0400 Received: from mail-pj1-x102e.google.com (mail-pj1-x102e.google.com [IPv6:2607:f8b0:4864:20::102e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 16AD99398A for ; Fri, 29 Apr 2022 10:48:18 -0700 (PDT) Received: by mail-pj1-x102e.google.com with SMTP id iq10so7753722pjb.0 for ; Fri, 29 Apr 2022 10:48:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=37m6AzyYfkWStvf9aJw8Jy6UEa2JtWYvi+mVWOmJgUs=; b=CY7TQ3+aM1g21qm98sbu+VQVCl+OjwURQv8jUnuInDt0Z9NAiPnAI0hrukouhWApU0 MbPZidJeo8qd5VXc6GFDYxYegOTPyhMCEPL+rH0Ft9Pesb4Vz46Sv1LogiN+wfuj+IUN w/5cjOPiPfoOk2GL+4YQvbBvvwtU+C52KJh/NdofO1bTC/H+F+Pomhfs8Nwv0nwGTPAs n0UWga4gSGPBbV0N+zG7g4gY1mdqJgYxyy2MCI+IlYCpASJFGE9/tWhhiu27VInerF5Y xYpzxCnZgsxP8PXGiozY6STFSZjXbdszLDjlHHSV2r47RTXjtZA19PHwTJ6i3pyTWmwi Hu9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=37m6AzyYfkWStvf9aJw8Jy6UEa2JtWYvi+mVWOmJgUs=; b=tUCfnYLRBeBkjsG+z8xhoeY4Bv07UxYudf7BoZaWZxw1nBdtV2VzSszL8DCGcWh0+q I4NNDCiLD6biAOBfTkgE4BgGm1TR5bqEZbxJz6g16uERiRiqrz/g1U7hrjFOIWL8MyjX Zn+Lx/76zw1qPD4l/DBzx1OJJ+drBVfkMnW9MqkDAp9bGNsrZ75kwo/7CDM+HiKy18KI lNBNt4Wkqpo0K0Fj4ybzfGXNglHmDyxRj4lw8rpo+2sqTD/MMfNtNSEMY2KlNtNblz6x b+Vi/HAxgGKo9sJu9VJWdX8ZM65gfBhJQhnBRtv7dDM9tXI2Cx91BT72gnhIbYUz1AoV UOiw== X-Gm-Message-State: AOAM533bdExlH6GQybC5gbnT0/fqt4C0byCScu7pHjIUFJDUvKlF2rpt HnIjeaHpfGNv8lok98twd0UZl9d8O8Ml179fAAYLiw== X-Received: by 2002:a17:90b:1e0f:b0:1d9:dba5:482c with SMTP id pg15-20020a17090b1e0f00b001d9dba5482cmr286745pjb.220.1651254497576; Fri, 29 Apr 2022 10:48:17 -0700 (PDT) MIME-Version: 1.0 References: <522e37eb-68fc-35db-44d5-479d0088e43f@intel.com> <92af7b22-fa8a-5d42-ae15-8526abfd2622@intel.com> <4a5143cc-3102-5e30-08b4-c07e44f1a2fc@intel.com> In-Reply-To: <4a5143cc-3102-5e30-08b4-c07e44f1a2fc@intel.com> From: Dan Williams Date: Fri, 29 Apr 2022 10:48:06 -0700 Message-ID: Subject: Re: [PATCH v3 00/21] TDX host kernel support To: Dave Hansen Cc: Kai Huang , Linux Kernel Mailing List , KVM list , Sean Christopherson , Paolo Bonzini , "Brown, Len" , "Luck, Tony" , Rafael J Wysocki , Reinette Chatre , Peter Zijlstra , Andi Kleen , "Kirill A. Shutemov" , Kuppuswamy Sathyanarayanan , Isaku Yamahata Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 29, 2022 at 10:18 AM Dave Hansen wrote: > > On 4/29/22 08:18, Dan Williams wrote: > > Yes, I want to challenge the idea that all core-mm memory must be TDX > > capable. Instead, this feels more like something that wants a > > hugetlbfs / dax-device like capability to ask the kernel to gather / > > set-aside the enumerated TDX memory out of all the general purpose > > memory it knows about and then VMs use that ABI to get access to > > convertible memory. Trying to ensure that all page allocator memory is > > TDX capable feels too restrictive with all the different ways pfns can > > get into the allocator. > > The KVM users are the problem here. They use a variety of ABIs to get > memory and then hand it to KVM. KVM basically just consumes the > physical addresses from the page tables. > > Also, there's no _practical_ problem here today. I can't actually think > of a case where any memory that ends up in the allocator on today's TDX > systems is not TDX capable. > > Tomorrow's systems are going to be the problem. They'll (presumably) > have a mix of CXL devices that will have varying capabilities. Some > will surely lack the metadata storage for checksums and TD-owner bits. > TDX use will be *safe* on those systems: if you take this code and run > it on one tomorrow's systems, it will notice the TDX-incompatible memory > and will disable TDX. > > The only way around this that I can see is to introduce ABI today that > anticipates the needs of the future systems. We could require that all > the KVM memory be "validated" before handing it to TDX. Maybe a new > syscall that says: "make sure this mapping works for TDX". It could be > new sysfs ABI which specifies which NUMA nodes contain TDX-capable memory. Yes, node-id seems the only reasonable handle that can be used, and it does not seem too onerous for a KVM user to have to set a node policy preferring all the TDX / confidential-computing capable nodes. > But, neither of those really help with, say, a device-DAX mapping of > TDX-*IN*capable memory handed to KVM. The "new syscall" would just > throw up its hands and leave users with the same result: TDX can't be > used. The new sysfs ABI for NUMA nodes wouldn't clearly apply to > device-DAX because they don't respect the NUMA policy ABI. They do have "target_node" attributes to associate node specific metadata, and could certainly express target_node capabilities in its own ABI. Then it's just a matter of making pfn_to_nid() do the right thing so KVM kernel side can validate the capabilities of all inbound pfns. > I'm open to ideas here. If there's a viable ABI we can introduce to > train TDX users today that will work tomorrow too, I'm all for it. In general, expressing NUMA node perf and node capabilities is something Linux needs to get better at. HMAT data for example still exists as sideband information ignored by numactl, but it feels inevitable that perf and capability details become more of a first class citizen for applications that have these mem-allocation-policy constraints in the presence of disparate memory types.