Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <cover.1649219184.git.kai.huang@intel.com> <522e37eb-68fc-35db-44d5-479d0088e43f@intel.com>
 <ecf718abf864bbb2366209f00d4315ada090aedc.camel@intel.com>
 <de24ac7e-349c-e49a-70bb-31b9bc867b10@intel.com> <9b388f54f13b34fe684ef77603fc878952e48f87.camel@intel.com>
 <d98ca73b-2d2d-757d-e937-acc83cfedfb0@intel.com> <c90a10763969077826f42be6f492e3a3e062326b.camel@intel.com>
 <fc1ca04d94ad45e79c0297719d5ef50a7c33c352.camel@intel.com>
In-Reply-To: <fc1ca04d94ad45e79c0297719d5ef50a7c33c352.camel@intel.com>
From:   Dan Williams <dan.j.williams@intel.com>
Date:   Thu, 28 Apr 2022 20:04:19 -0700
Message-ID: <CAPcyv4gEwjnNE9cWb_KLZ6C7-UxKdUMZKFPF+LAJ4L1SjByisw@mail.gmail.com>
Subject: Re: [PATCH v3 00/21] TDX host kernel support
To:     Kai Huang <kai.huang@intel.com>
Cc:     Dave Hansen <dave.hansen@intel.com>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        KVM list <kvm@vger.kernel.org>,
        Sean Christopherson <seanjc@google.com>,
        Paolo Bonzini <pbonzini@redhat.com>,
        "Brown, Len" <len.brown@intel.com>,
        "Luck, Tony" <tony.luck@intel.com>,
        Rafael J Wysocki <rafael.j.wysocki@intel.com>,
        Reinette Chatre <reinette.chatre@intel.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Andi Kleen <ak@linux.intel.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        Kuppuswamy Sathyanarayanan 
        <sathyanarayanan.kuppuswamy@linux.intel.com>,
        Isaku Yamahata <isaku.yamahata@intel.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Thu, Apr 28, 2022 at 6:40 PM Kai Huang <kai.huang@intel.com> wrote:
>
> On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > On 4/27/22 17:37, Kai Huang wrote:
> > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > >
> > > > I thought we could document this in the documentation saying that this code can
> > > > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > > > can change the code and the documentation  when we add the support of those
> > > > features in the future, and update the documentation.
> > > >
> > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > machine support those features.
> > > >
> > > > I'll think about design solutions if above doesn't look good for you.
> > >
> > > No, it doesn't look good to me.
> > >
> > > You can't just say:
> > >
> > >     /*
> > >      * This code will eat puppies if used on systems with hotplug.
> > >      */
> > >
> > > and merrily await the puppy bloodbath.
> > >
> > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > safe, controlled way.
> > >
> > > > > You can't just ignore the problems because they're not present on one
> > > > > version of the hardware.
> > >
> > > Please, please read this again ^^
> >
> > OK.  I'll think about solutions and come back later.
> > >
>
> Hi Dave,
>
> I think we have two approaches to handle memory hotplug interaction with the TDX
> module initialization.
>
> The first approach is simple.  We just block memory from being added as system
> RAM managed by page allocator when the platform supports TDX [1]. It seems we
> can add some arch-specific-check to __add_memory_resource() and reject the new
> memory resource if platform supports TDX.  __add_memory_resource() is called by
> both __add_memory() and add_memory_driver_managed() so it prevents from adding
> NVDIMM as system RAM and normal ACPI memory hotplug [2].

What if the memory being added *is* TDX capable? What if someone
wanted to manage a memory range as soft-reserved and move it back and
forth from the core-mm to device access. That should be perfectly
acceptable as long as the memory is TDX capable.

> The second approach is relatively more complicated.  Instead of directly
> rejecting the new memory resource in __add_memory_resource(), we check whether
> the memory resource can be added based on CMR and the TDX module initialization
> status.   This is feasible as with the latest public P-SEAMLDR spec, we can get
> CMR from P-SEAMLDR SEAMCALL[3].  So we can detect P-SEAMLDR and get CMR info
> during kernel boots.  And in __add_memory_resource() we do below check:
>
>         tdx_init_disable();     /*similar to cpu_hotplug_disable() */
>         if (tdx_module_initialized())
>                 // reject memory hotplug
>         else if (new_memory_resource NOT in CMRs)
>                 // reject memory hotplug
>         else
>                 allow memory hotplug
>         tdx_init_enable();      /*similar to cpu_hotplug_enable() */
>
> tdx_init_disable() temporarily disables TDX module initialization by trying to
> grab the mutex.  If the TDX module initialization is already on going, then it
> waits until it completes.
>
> This should work better for future platforms, but would requires non-trivial
> more code as we need to add VMXON/VMXOFF support to the core-kernel to detect
> CMR using  SEAMCALL.  A side advantage is with VMXON in core-kernel we can
> shutdown the TDX module in kexec().
>
> But for this series I think the second approach is overkill and we can choose to
> use the first simple approach?

This still sounds like it is trying to solve symptoms and not the root
problem. Why must the core-mm never have non-TDX memory when VMs are
fine to operate with either core-mm pages or memory from other sources
like hugetlbfs and device-dax?