Received-SPF: pass (google.com: domain of linux-kernel+bounces-44968-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1;
Precedence: bulk
MIME-Version: 1.0
References: <20240112055251.36101-1-vannapurve@google.com>
In-Reply-To: <20240112055251.36101-1-vannapurve@google.com>
From: Vishal Annapurve <vannapurve@google.com>
Date: Tue, 30 Jan 2024 22:12:54 +0530
Message-ID: <CAGtprH9FA3-RetE=6i7ezxfV0qEV-8z3HLgPEPY=pzuxSiOD+w@mail.gmail.com>
Subject: Re: [RFC V1 0/5] x86: CVMs: Align memory conversions to 2M granularity
To: x86@kernel.org, linux-kernel@vger.kernel.org, hch@lst.de, 
	petrtesarik@huaweicloud.com, Dave Hansen <dave.hansen@linux.intel.com>, 
	Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, 
	m.szyprowski@samsung.com, robin.murphy@arm.com
Cc: pbonzini@redhat.com, rientjes@google.com, seanjc@google.com, 
	erdemaktas@google.com, ackerleytng@google.com, jxgao@google.com, 
	sagis@google.com, oupton@google.com, peterx@redhat.com, vkuznets@redhat.com, 
	dmatlack@google.com, pgonda@google.com, michael.roth@amd.com, 
	kirill@shutemov.name, thomas.lendacky@amd.com, linux-coco@lists.linux.dev, 
	chao.p.peng@linux.intel.com, isaku.yamahata@gmail.com, andrew.jones@linux.dev, 
	corbet@lwn.net, rostedt@goodmis.org, iommu@lists.linux.dev
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Fri, Jan 12, 2024 at 11:22=E2=80=AFAM Vishal Annapurve <vannapurve@googl=
e.com> wrote:
>
> Goal of this series is aligning memory conversion requests from CVMs to
> huge page sizes to allow better host side management of guest memory and
> optimized page table walks.
>
> This patch series is partially tested and needs more work, I am seeking
> feedback from wider community before making further progress.
>
> Background
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Confidential VMs(CVMs) support two types of guest memory ranges:
> 1) Private Memory: Intended to be consumed/modified only by the CVM.
> 2) Shared Memory: visible to both guest/host components, used for
> non-trusted IO.
>
> Guest memfd [1] support is set to be merged upstream to handle guest priv=
ate
> memory isolation from host usersapace. Guest memfd approach allows follow=
ing
> setup:
> * private memory backed using the guest memfd file which is not accessibl=
e
>   from host userspace.
> * Shared memory backed by tmpfs/hugetlbfs files that are accessible from
>   host userspace.
>
> Userspace VMM needs to register two backing stores for all of the guest
> memory ranges:
> * HVA for shared memory
> * Guest memfd ranges for private memory
>
> KVM keeps track of shared/private guest memory ranges that can be updated=
 at
> runtime using IOCTLs. This allows KVM to back the guest memory using eith=
er HVA
> (shared) or guest memfd file offsets (private) based on the attributes of=
 the
> guest memory ranges.
>
> In this setup, there is possibility of "double allocation" i.e. scenarios=
 where
> both shared and private memory backing stores mapped to the same guest me=
mory
> ranges have memory allocated.
>
> Guest issues an hypercall to convert the memory types which is forwarded =
by KVM
> to the host userspace.
> Userspace VMM is supposed to handle conversion as follows:
> 1) Private to shared conversion:
>   * Update guest memory attributes for the range to be shared using KVM
>     supported IOCTLs.
>     - While handling this IOCTL, KVM will unmap EPT/NPT entries correspon=
ding
>       to the guest memory being converted.
>   * Unback the guest memfd range.
> 2) Shared to private conversion:
>   * Update guest memory attributes for the range to be private using KVM
>     supported IOCTLs.
>     - While handling this IOCTL, KVM will unmap EPT/NPT entries correspon=
ding
>       to the guest memory being converted.
>   * Unback the shared memory file.
>
> Note that unbacking needs to be done for both kinds of conversions in ord=
er to
> avoid double allocation.
>
> Problem
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> CVMs can convert memory between these two types at 4K granularity. Conver=
sion
> done at 4K granularity causes issues when using guest memfd support
> with hugetlb/Hugepage backed guest private memory:
> 1) Hugetlb fs doesn't allow freeing subpage ranges when punching holes,
> causing all the private to shared memory conversions to result in double
> allocation.
> 2) Even if a new fs is implemented for guest memfd that allows splitting
> hugepages, punching holes at 4K will cause:
>    - loss of vmemmmap optimization [2]
>    - more memory for EPT/NPT entries and extra pagetable walks for guest
>      side accesses.
>    - Shared memory mappings to consume more host pagetable entries and
>      extra pagetalble walks for host side access.
>    - Higher number of conversions with additional overhead of VM exits
>      serviced by host userspace.
>
> Memory conversion scenarios in the guest that are of major concern:
> - SWIOTLB area conversion early during boot.
>    * dma_map_* API invocations for CVMs result in using bounce buffers
>      from SWIOTLB region which is already marked as shared.
> - Device drivers allocating memory using dma_alloc_* APIs at runtime
>   that bypass SWIOTLB.
>
> Proposal
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> To counter above issues, this series proposes following:
> 1) Use boot time allocated SWIOTLB pools for all DMA memory allocated
> using dma_alloc_* APIs.
> 2) Increase memory allocated at boot for SWIOTLB from 6% to 8% for CVMs.
> 3) Enable dynamic SWIOTLB [4] by default for CVMs so that SWITLB can be
> scaled up as needed.
> 4) Ensure SWIOTLB pool is 2MB aligned so that all the conversions happen =
at
> 2M granularity once during boot.
> 5) Add a check to ensure all conversions happen at 2M granularity.
>
> ** This series leaves out some of the conversion sites which might not
> be 2M aligned but should be easy to fix once the approach is finalized. *=
*
>
> 1G alignment for conversion:
> * Using 1G alignment may cause over-allocated SWIOTLB buffers but might
>   be acceptable for CVMs depending on more considerations.
> * It might be challenging to use 1G aligned conversion in OVMF. 2M
>   alignment should be achievable with OVMF changes [3].
>
> Alternatives could be:
> 1) Separate hugepage aligned DMA pools setup by individual device drivers=
 in
> case of CVMs.
>
> [1] https://lore.kernel.org/linux-mips/20231105163040.14904-1-pbonzini@re=
dhat.com/
> [2] https://www.kernel.org/doc/html/next/mm/vmemmap_dedup.html
> [3] https://github.com/tianocore/edk2/pull/3784
> [4] https://lore.kernel.org/lkml/20230908080031.GA7848@lst.de/T/
>
> Vishal Annapurve (5):
>   swiotlb: Support allocating DMA memory from SWIOTLB
>   swiotlb: Allow setting up default alignment of SWIOTLB region
>   x86: CVMs: Enable dynamic swiotlb by default for CVMs
>   x86: CVMs: Allow allocating all DMA memory from SWIOTLB
>   x86: CVMs: Ensure that memory conversions happen at 2M alignment
>
>  arch/x86/Kconfig             |  2 ++
>  arch/x86/kernel/pci-dma.c    |  2 +-
>  arch/x86/mm/mem_encrypt.c    |  8 ++++++--
>  arch/x86/mm/pat/set_memory.c |  6 ++++--
>  include/linux/swiotlb.h      | 22 ++++++----------------
>  kernel/dma/direct.c          |  4 ++--
>  kernel/dma/swiotlb.c         | 17 ++++++++++++-----
>  7 files changed, 33 insertions(+), 28 deletions(-)
>
> --
> 2.43.0.275.g3460e3d667-goog
>

Ping for review of this series.

Thanks,
Vishal