Goal of this series is aligning memory conversion requests from CVMs to
huge page sizes to allow better host side management of guest memory and
optimized page table walks.
This patch series is partially tested and needs more work, I am seeking
feedback from wider community before making further progress.
Background
=====================
Confidential VMs(CVMs) support two types of guest memory ranges:
1) Private Memory: Intended to be consumed/modified only by the CVM.
2) Shared Memory: visible to both guest/host components, used for
non-trusted IO.
Guest memfd [1] support is set to be merged upstream to handle guest private
memory isolation from host usersapace. Guest memfd approach allows following
setup:
* private memory backed using the guest memfd file which is not accessible
from host userspace.
* Shared memory backed by tmpfs/hugetlbfs files that are accessible from
host userspace.
Userspace VMM needs to register two backing stores for all of the guest
memory ranges:
* HVA for shared memory
* Guest memfd ranges for private memory
KVM keeps track of shared/private guest memory ranges that can be updated at
runtime using IOCTLs. This allows KVM to back the guest memory using either HVA
(shared) or guest memfd file offsets (private) based on the attributes of the
guest memory ranges.
In this setup, there is possibility of "double allocation" i.e. scenarios where
both shared and private memory backing stores mapped to the same guest memory
ranges have memory allocated.
Guest issues an hypercall to convert the memory types which is forwarded by KVM
to the host userspace.
Userspace VMM is supposed to handle conversion as follows:
1) Private to shared conversion:
* Update guest memory attributes for the range to be shared using KVM
supported IOCTLs.
- While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
to the guest memory being converted.
* Unback the guest memfd range.
2) Shared to private conversion:
* Update guest memory attributes for the range to be private using KVM
supported IOCTLs.
- While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
to the guest memory being converted.
* Unback the shared memory file.
Note that unbacking needs to be done for both kinds of conversions in order to
avoid double allocation.
Problem
=====================
CVMs can convert memory between these two types at 4K granularity. Conversion
done at 4K granularity causes issues when using guest memfd support
with hugetlb/Hugepage backed guest private memory:
1) Hugetlb fs doesn't allow freeing subpage ranges when punching holes,
causing all the private to shared memory conversions to result in double
allocation.
2) Even if a new fs is implemented for guest memfd that allows splitting
hugepages, punching holes at 4K will cause:
- loss of vmemmmap optimization [2]
- more memory for EPT/NPT entries and extra pagetable walks for guest
side accesses.
- Shared memory mappings to consume more host pagetable entries and
extra pagetalble walks for host side access.
- Higher number of conversions with additional overhead of VM exits
serviced by host userspace.
Memory conversion scenarios in the guest that are of major concern:
- SWIOTLB area conversion early during boot.
* dma_map_* API invocations for CVMs result in using bounce buffers
from SWIOTLB region which is already marked as shared.
- Device drivers allocating memory using dma_alloc_* APIs at runtime
that bypass SWIOTLB.
Proposal
=====================
To counter above issues, this series proposes following:
1) Use boot time allocated SWIOTLB pools for all DMA memory allocated
using dma_alloc_* APIs.
2) Increase memory allocated at boot for SWIOTLB from 6% to 8% for CVMs.
3) Enable dynamic SWIOTLB [4] by default for CVMs so that SWITLB can be
scaled up as needed.
4) Ensure SWIOTLB pool is 2MB aligned so that all the conversions happen at
2M granularity once during boot.
5) Add a check to ensure all conversions happen at 2M granularity.
** This series leaves out some of the conversion sites which might not
be 2M aligned but should be easy to fix once the approach is finalized. **
1G alignment for conversion:
* Using 1G alignment may cause over-allocated SWIOTLB buffers but might
be acceptable for CVMs depending on more considerations.
* It might be challenging to use 1G aligned conversion in OVMF. 2M
alignment should be achievable with OVMF changes [3].
Alternatives could be:
1) Separate hugepage aligned DMA pools setup by individual device drivers in
case of CVMs.
[1] https://lore.kernel.org/linux-mips/[email protected]/
[2] https://www.kernel.org/doc/html/next/mm/vmemmap_dedup.html
[3] https://github.com/tianocore/edk2/pull/3784
[4] https://lore.kernel.org/lkml/[email protected]/T/
Vishal Annapurve (5):
swiotlb: Support allocating DMA memory from SWIOTLB
swiotlb: Allow setting up default alignment of SWIOTLB region
x86: CVMs: Enable dynamic swiotlb by default for CVMs
x86: CVMs: Allow allocating all DMA memory from SWIOTLB
x86: CVMs: Ensure that memory conversions happen at 2M alignment
arch/x86/Kconfig | 2 ++
arch/x86/kernel/pci-dma.c | 2 +-
arch/x86/mm/mem_encrypt.c | 8 ++++++--
arch/x86/mm/pat/set_memory.c | 6 ++++--
include/linux/swiotlb.h | 22 ++++++----------------
kernel/dma/direct.c | 4 ++--
kernel/dma/swiotlb.c | 17 ++++++++++++-----
7 files changed, 33 insertions(+), 28 deletions(-)
--
2.43.0.275.g3460e3d667-goog
Modify SWIOTLB framework to allocate DMA memory always from SWIOTLB.
CVMs use SWIOTLB buffers for bouncing memory when using dma_map_* APIs
to setup memory for IO operations. SWIOTLB buffers are marked as shared
once during early boot.
Buffers allocated using dma_alloc_* APIs are allocated from kernel memory
and then converted to shared during each API invocation. This patch ensures
that such buffers are also allocated from already shared SWIOTLB
regions. This allows enforcing alignment requirements on regions marked
as shared.
Signed-off-by: Vishal Annapurve <[email protected]>
---
include/linux/swiotlb.h | 17 +----------------
kernel/dma/direct.c | 4 ++--
kernel/dma/swiotlb.c | 5 +++--
3 files changed, 6 insertions(+), 20 deletions(-)
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index ecde0312dd52..058901313405 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -17,6 +17,7 @@ struct scatterlist;
#define SWIOTLB_VERBOSE (1 << 0) /* verbose initialization */
#define SWIOTLB_FORCE (1 << 1) /* force bounce buffering */
#define SWIOTLB_ANY (1 << 2) /* allow any memory for the buffer */
+#define SWIOTLB_ALLOC (1 << 4) /* force dma allocation through swiotlb */
/*
* Maximum allowable number of contiguous slabs to map,
@@ -259,7 +260,6 @@ static inline phys_addr_t default_swiotlb_limit(void)
extern void swiotlb_print_info(void);
-#ifdef CONFIG_DMA_RESTRICTED_POOL
struct page *swiotlb_alloc(struct device *dev, size_t size);
bool swiotlb_free(struct device *dev, struct page *page, size_t size);
@@ -267,20 +267,5 @@ static inline bool is_swiotlb_for_alloc(struct device *dev)
{
return dev->dma_io_tlb_mem->for_alloc;
}
-#else
-static inline struct page *swiotlb_alloc(struct device *dev, size_t size)
-{
- return NULL;
-}
-static inline bool swiotlb_free(struct device *dev, struct page *page,
- size_t size)
-{
- return false;
-}
-static inline bool is_swiotlb_for_alloc(struct device *dev)
-{
- return false;
-}
-#endif /* CONFIG_DMA_RESTRICTED_POOL */
#endif /* __LINUX_SWIOTLB_H */
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 73c95815789a..a7d3266d3d83 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -78,7 +78,7 @@ bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
static int dma_set_decrypted(struct device *dev, void *vaddr, size_t size)
{
- if (!force_dma_unencrypted(dev))
+ if (!force_dma_unencrypted(dev) || is_swiotlb_for_alloc(dev))
return 0;
return set_memory_decrypted((unsigned long)vaddr, PFN_UP(size));
}
@@ -87,7 +87,7 @@ static int dma_set_encrypted(struct device *dev, void *vaddr, size_t size)
{
int ret;
- if (!force_dma_unencrypted(dev))
+ if (!force_dma_unencrypted(dev) || is_swiotlb_for_alloc(dev))
return 0;
ret = set_memory_encrypted((unsigned long)vaddr, PFN_UP(size));
if (ret)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 33d942615be5..a056d2f8b9ee 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -363,6 +363,7 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
io_tlb_default_mem.force_bounce =
swiotlb_force_bounce || (flags & SWIOTLB_FORCE);
+ io_tlb_default_mem.for_alloc = (flags & SWIOTLB_ALLOC);
#ifdef CONFIG_SWIOTLB_DYNAMIC
if (!remap)
@@ -1601,8 +1602,6 @@ static inline void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
#endif /* CONFIG_DEBUG_FS */
-#ifdef CONFIG_DMA_RESTRICTED_POOL
-
struct page *swiotlb_alloc(struct device *dev, size_t size)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
@@ -1634,6 +1633,8 @@ bool swiotlb_free(struct device *dev, struct page *page, size_t size)
return true;
}
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+
static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
struct device *dev)
{
--
2.43.0.275.g3460e3d667-goog
Allow adjusting alignment of SWIOTLB memory. CVMs can use this framework
to align the shared memory regions as needed.
Signed-off-by: Vishal Annapurve <[email protected]>
---
include/linux/swiotlb.h | 5 +++++
kernel/dma/swiotlb.c | 12 +++++++++---
2 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 058901313405..450bd82cdb9f 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -206,6 +206,7 @@ size_t swiotlb_max_mapping_size(struct device *dev);
bool is_swiotlb_allocated(void);
bool is_swiotlb_active(struct device *dev);
void __init swiotlb_adjust_size(unsigned long size);
+void __init swiotlb_adjust_alignment(unsigned long alignment);
phys_addr_t default_swiotlb_base(void);
phys_addr_t default_swiotlb_limit(void);
#else
@@ -247,6 +248,10 @@ static inline void swiotlb_adjust_size(unsigned long size)
{
}
+void __init swiotlb_adjust_alignment(unsigned long alignment)
+{
+}
+
static inline phys_addr_t default_swiotlb_base(void)
{
return 0;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index a056d2f8b9ee..eeab0607a028 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -97,6 +97,7 @@ static struct io_tlb_mem io_tlb_default_mem;
#endif /* CONFIG_SWIOTLB_DYNAMIC */
static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
+static unsigned long default_alignment = PAGE_SIZE;
static unsigned long default_nareas;
/**
@@ -223,6 +224,11 @@ void __init swiotlb_adjust_size(unsigned long size)
pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
}
+void __init swiotlb_adjust_alignment(unsigned long alignment)
+{
+ default_alignment = alignment;
+}
+
void swiotlb_print_info(void)
{
struct io_tlb_pool *mem = &io_tlb_default_mem.defpool;
@@ -315,7 +321,7 @@ static void __init *swiotlb_memblock_alloc(unsigned long nslabs,
unsigned int flags,
int (*remap)(void *tlb, unsigned long nslabs))
{
- size_t bytes = PAGE_ALIGN(nslabs << IO_TLB_SHIFT);
+ size_t bytes = ALIGN(nslabs << IO_TLB_SHIFT, default_alignment);
void *tlb;
/*
@@ -324,9 +330,9 @@ static void __init *swiotlb_memblock_alloc(unsigned long nslabs,
* memory encryption.
*/
if (flags & SWIOTLB_ANY)
- tlb = memblock_alloc(bytes, PAGE_SIZE);
+ tlb = memblock_alloc(bytes, default_alignment);
else
- tlb = memblock_alloc_low(bytes, PAGE_SIZE);
+ tlb = memblock_alloc_low(bytes, default_alignment);
if (!tlb) {
pr_warn("%s: Failed to allocate %zu bytes tlb structure\n",
--
2.43.0.275.g3460e3d667-goog
CVMs used SWIOTLB for non-trusted IO using dma_map_* APIs. This series
will ensure that dma_alloc_* API to also allocate from SWIOTLB pools.
Initial size of SWIOTLB pool is decided using heuristics and has been
working with CVM usecases so far.
It's better to allow SWIOTLB to scale up on demand rather than keeping
the size fixed to avoid failures with possibly increased usage of
SWIOTLB with dma_alloc_* APIs allocating from SWIOTLB pools. This should
also help in future with more devices getting used from CVMs for
non-trusted IO.
Signed-off-by: Vishal Annapurve <[email protected]>
---
arch/x86/Kconfig | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1566748f16c4..035c8a022c4c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
select X86_MEM_ENCRYPT
select X86_MCE
select UNACCEPTED_MEMORY
+ select SWIOTLB_DYNAMIC
help
Support running as a guest under Intel TDX. Without this support,
the guest kernel can not boot or run under TDX.
@@ -1534,6 +1535,7 @@ config AMD_MEM_ENCRYPT
select ARCH_HAS_CC_PLATFORM
select X86_MEM_ENCRYPT
select UNACCEPTED_MEMORY
+ select SWIOTLB_DYNAMIC
help
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
--
2.43.0.275.g3460e3d667-goog
Changes include:
1) Allocate all DMA memory from SWIOTLB buffers.
2) Increase the size of SWIOTLB region to accommodate dma_alloc_*
invocations.
3) Align SWIOLTB regions to 2M size.
Signed-off-by: Vishal Annapurve <[email protected]>
---
arch/x86/kernel/pci-dma.c | 2 +-
arch/x86/mm/mem_encrypt.c | 8 ++++++--
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index f323d83e40a7..3dcc3104b2a8 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -61,7 +61,7 @@ static void __init pci_swiotlb_detect(void)
*/
if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
x86_swiotlb_enable = true;
- x86_swiotlb_flags |= SWIOTLB_FORCE;
+ x86_swiotlb_flags |= (SWIOTLB_FORCE | SWIOTLB_ALLOC);
}
}
#else
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index c290c55b632b..0cf3365b051f 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -112,10 +112,14 @@ void __init mem_encrypt_setup_arch(void)
* The percentage of guest memory used here for SWIOTLB buffers
* is more of an approximation of the static adjustment which
* 64MB for <1G, and ~128M to 256M for 1G-to-4G, i.e., the 6%
+ *
+ * Extra 2% is added to accommodate the requirement of DMA allocations
+ * done using dma_alloc_* APIs.
*/
- size = total_mem * 6 / 100;
- size = clamp_val(size, IO_TLB_DEFAULT_SIZE, SZ_1G);
+ size = total_mem * 8 / 100;
+ size = clamp_val(size, IO_TLB_DEFAULT_SIZE, (SZ_1G + SZ_256M));
swiotlb_adjust_size(size);
+ swiotlb_adjust_alignment(SZ_2M);
/* Set restricted memory access for virtio. */
virtio_set_mem_acc_cb(virtio_require_restricted_mem_acc);
--
2.43.0.275.g3460e3d667-goog
Return error on conversion of memory ranges not aligned to 2M size.
Signed-off-by: Vishal Annapurve <[email protected]>
---
arch/x86/mm/pat/set_memory.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index bda9f129835e..6f7b06a502f4 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2133,8 +2133,10 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
int ret;
/* Should not be working on unaligned addresses */
- if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
- addr &= PAGE_MASK;
+ if (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address: %#lx\n", addr)
+ || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
+ "misaligned numpages: %#lx\n", numpages))
+ return -EINVAL;
memset(&cpa, 0, sizeof(cpa));
cpa.vaddr = &addr;
--
2.43.0.275.g3460e3d667-goog
On Fri, Jan 12, 2024 at 11:22 AM Vishal Annapurve <[email protected]> wrote:
>
> Goal of this series is aligning memory conversion requests from CVMs to
> huge page sizes to allow better host side management of guest memory and
> optimized page table walks.
>
> This patch series is partially tested and needs more work, I am seeking
> feedback from wider community before making further progress.
>
> Background
> =====================
> Confidential VMs(CVMs) support two types of guest memory ranges:
> 1) Private Memory: Intended to be consumed/modified only by the CVM.
> 2) Shared Memory: visible to both guest/host components, used for
> non-trusted IO.
>
> Guest memfd [1] support is set to be merged upstream to handle guest private
> memory isolation from host usersapace. Guest memfd approach allows following
> setup:
> * private memory backed using the guest memfd file which is not accessible
> from host userspace.
> * Shared memory backed by tmpfs/hugetlbfs files that are accessible from
> host userspace.
>
> Userspace VMM needs to register two backing stores for all of the guest
> memory ranges:
> * HVA for shared memory
> * Guest memfd ranges for private memory
>
> KVM keeps track of shared/private guest memory ranges that can be updated at
> runtime using IOCTLs. This allows KVM to back the guest memory using either HVA
> (shared) or guest memfd file offsets (private) based on the attributes of the
> guest memory ranges.
>
> In this setup, there is possibility of "double allocation" i.e. scenarios where
> both shared and private memory backing stores mapped to the same guest memory
> ranges have memory allocated.
>
> Guest issues an hypercall to convert the memory types which is forwarded by KVM
> to the host userspace.
> Userspace VMM is supposed to handle conversion as follows:
> 1) Private to shared conversion:
> * Update guest memory attributes for the range to be shared using KVM
> supported IOCTLs.
> - While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
> to the guest memory being converted.
> * Unback the guest memfd range.
> 2) Shared to private conversion:
> * Update guest memory attributes for the range to be private using KVM
> supported IOCTLs.
> - While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
> to the guest memory being converted.
> * Unback the shared memory file.
>
> Note that unbacking needs to be done for both kinds of conversions in order to
> avoid double allocation.
>
> Problem
> =====================
> CVMs can convert memory between these two types at 4K granularity. Conversion
> done at 4K granularity causes issues when using guest memfd support
> with hugetlb/Hugepage backed guest private memory:
> 1) Hugetlb fs doesn't allow freeing subpage ranges when punching holes,
> causing all the private to shared memory conversions to result in double
> allocation.
> 2) Even if a new fs is implemented for guest memfd that allows splitting
> hugepages, punching holes at 4K will cause:
> - loss of vmemmmap optimization [2]
> - more memory for EPT/NPT entries and extra pagetable walks for guest
> side accesses.
> - Shared memory mappings to consume more host pagetable entries and
> extra pagetalble walks for host side access.
> - Higher number of conversions with additional overhead of VM exits
> serviced by host userspace.
>
> Memory conversion scenarios in the guest that are of major concern:
> - SWIOTLB area conversion early during boot.
> * dma_map_* API invocations for CVMs result in using bounce buffers
> from SWIOTLB region which is already marked as shared.
> - Device drivers allocating memory using dma_alloc_* APIs at runtime
> that bypass SWIOTLB.
>
> Proposal
> =====================
> To counter above issues, this series proposes following:
> 1) Use boot time allocated SWIOTLB pools for all DMA memory allocated
> using dma_alloc_* APIs.
> 2) Increase memory allocated at boot for SWIOTLB from 6% to 8% for CVMs.
> 3) Enable dynamic SWIOTLB [4] by default for CVMs so that SWITLB can be
> scaled up as needed.
> 4) Ensure SWIOTLB pool is 2MB aligned so that all the conversions happen at
> 2M granularity once during boot.
> 5) Add a check to ensure all conversions happen at 2M granularity.
>
> ** This series leaves out some of the conversion sites which might not
> be 2M aligned but should be easy to fix once the approach is finalized. **
>
> 1G alignment for conversion:
> * Using 1G alignment may cause over-allocated SWIOTLB buffers but might
> be acceptable for CVMs depending on more considerations.
> * It might be challenging to use 1G aligned conversion in OVMF. 2M
> alignment should be achievable with OVMF changes [3].
>
> Alternatives could be:
> 1) Separate hugepage aligned DMA pools setup by individual device drivers in
> case of CVMs.
>
> [1] https://lore.kernel.org/linux-mips/[email protected]/
> [2] https://www.kernel.org/doc/html/next/mm/vmemmap_dedup.html
> [3] https://github.com/tianocore/edk2/pull/3784
> [4] https://lore.kernel.org/lkml/[email protected]/T/
>
> Vishal Annapurve (5):
> swiotlb: Support allocating DMA memory from SWIOTLB
> swiotlb: Allow setting up default alignment of SWIOTLB region
> x86: CVMs: Enable dynamic swiotlb by default for CVMs
> x86: CVMs: Allow allocating all DMA memory from SWIOTLB
> x86: CVMs: Ensure that memory conversions happen at 2M alignment
>
> arch/x86/Kconfig | 2 ++
> arch/x86/kernel/pci-dma.c | 2 +-
> arch/x86/mm/mem_encrypt.c | 8 ++++++--
> arch/x86/mm/pat/set_memory.c | 6 ++++--
> include/linux/swiotlb.h | 22 ++++++----------------
> kernel/dma/direct.c | 4 ++--
> kernel/dma/swiotlb.c | 17 ++++++++++++-----
> 7 files changed, 33 insertions(+), 28 deletions(-)
>
> --
> 2.43.0.275.g3460e3d667-goog
>
Ping for review of this series.
Thanks,
Vishal
On 1/11/24 21:52, Vishal Annapurve wrote:
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -112,10 +112,14 @@ void __init mem_encrypt_setup_arch(void)
> * The percentage of guest memory used here for SWIOTLB buffers
> * is more of an approximation of the static adjustment which
> * 64MB for <1G, and ~128M to 256M for 1G-to-4G, i.e., the 6%
> + *
> + * Extra 2% is added to accommodate the requirement of DMA allocations
> + * done using dma_alloc_* APIs.
> */
> - size = total_mem * 6 / 100;
> - size = clamp_val(size, IO_TLB_DEFAULT_SIZE, SZ_1G);
> + size = total_mem * 8 / 100;
> + size = clamp_val(size, IO_TLB_DEFAULT_SIZE, (SZ_1G + SZ_256M));
> swiotlb_adjust_size(size);
> + swiotlb_adjust_alignment(SZ_2M);
FWIW, this appears superficially to just be fiddling with random
numbers. The changelog basically says: "did stuff".
What *are* "the requirement of DMA allocations done using dma_alloc_* APIs"?
On 1/11/24 21:52, Vishal Annapurve wrote:
> @@ -2133,8 +2133,10 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
> int ret;
>
> /* Should not be working on unaligned addresses */
> - if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
> - addr &= PAGE_MASK;
> + if (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address: %#lx\n", addr)
> + || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
> + "misaligned numpages: %#lx\n", numpages))
> + return -EINVAL;
This series is talking about swiotlb and DMA, then this applies a
restriction to what I *thought* was a much more generic function:
__set_memory_enc_pgtable(). What prevents this function from getting
used on 4k mappings?
There's a bunch of code in the kernel for TDX and SEV guests. How much
of it uses the "CVM" nomenclature?
What do you do when you need to dynamically scale up the SWIOTLB size
and can't allocate a 2M page? I guess you're saying here that you'd
rather run with a too-small 2M pool than a large-enough mixed 4k/2M pool.
I also had a really hard time parsing through the problem statement and
solution here. I'd really suggest cleaning up the problem statement and
more clearly differentiating the host and guest sides in the description.
On Wed, Jan 31, 2024 at 9:47 PM Dave Hansen <[email protected]> wrote:
>
> On 1/11/24 21:52, Vishal Annapurve wrote:
> > --- a/arch/x86/mm/mem_encrypt.c
> > +++ b/arch/x86/mm/mem_encrypt.c
> > @@ -112,10 +112,14 @@ void __init mem_encrypt_setup_arch(void)
> > * The percentage of guest memory used here for SWIOTLB buffers
> > * is more of an approximation of the static adjustment which
> > * 64MB for <1G, and ~128M to 256M for 1G-to-4G, i.e., the 6%
> > + *
> > + * Extra 2% is added to accommodate the requirement of DMA allocations
> > + * done using dma_alloc_* APIs.
> > */
> > - size = total_mem * 6 / 100;
> > - size = clamp_val(size, IO_TLB_DEFAULT_SIZE, SZ_1G);
> > + size = total_mem * 8 / 100;
> > + size = clamp_val(size, IO_TLB_DEFAULT_SIZE, (SZ_1G + SZ_256M));
> > swiotlb_adjust_size(size);
> > + swiotlb_adjust_alignment(SZ_2M);
>
> FWIW, this appears superficially to just be fiddling with random
> numbers. The changelog basically says: "did stuff".
>
> What *are* "the requirement of DMA allocations done using dma_alloc_* APIs"?
dma_alloc_* invocations depend on the devices used and may change with
time, so it's difficult to calculate the memory required for such
allocations.
Though one could note following points about memory allocations done
using dma_alloc_* APIs:
1) They generally happen during early setup of device drivers.
2) They should be relatively smaller in size compared to runtime
memory allocation done by dma_map_* APIs.
This change increases the SWIOTLB memory area by 30% based on the
above observations. Strategy here would be to take a safe enough
heuristic and let dynamic SWIOTLB allocations to handle any spillover.
On Wed, Jan 31, 2024 at 10:03 PM Dave Hansen <[email protected]> wrote:
>
> On 1/11/24 21:52, Vishal Annapurve wrote:
> > @@ -2133,8 +2133,10 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
> > int ret;
> >
> > /* Should not be working on unaligned addresses */
> > - if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
> > - addr &= PAGE_MASK;
> > + if (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address: %#lx\n", addr)
> > + || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
> > + "misaligned numpages: %#lx\n", numpages))
> > + return -EINVAL;
>
> This series is talking about swiotlb and DMA, then this applies a
> restriction to what I *thought* was a much more generic function:
> __set_memory_enc_pgtable(). What prevents this function from getting
> used on 4k mappings?
>
>
The end goal here is to limit the conversion granularity to hugepage
sizes. SWIOTLB allocations are the major source of unaligned
allocations(and so the conversions) that need to be fixed before
achieving this goal.
This change will ensure that conversion fails for unaligned ranges, as
I don't foresee the need for 4K aligned conversions apart from DMA
allocations.
On Wed, Jan 31, 2024 at 10:23 PM Dave Hansen <[email protected]> wrote:
>
> There's a bunch of code in the kernel for TDX and SEV guests. How much
> of it uses the "CVM" nomenclature?
Right, I see that "CoCo VMs" is a more accepted term in the kernel
codebase so far, will update the references in the next version.
>
> What do you do when you need to dynamically scale up the SWIOTLB size
> and can't allocate a 2M page? I guess you're saying here that you'd
> rather run with a too-small 2M pool than a large-enough mixed 4k/2M pool.
I am not yet certain how to ensure 2M page is always available/made
available at runtime for CoCo VMs. Few options that I can think of:
1) Reserve additional memory for CMA allocations to satisfy runtime
requests of 2M allocations.
2) Pre-reserve SWIOTLB to a safe value just like it's done today and
not rely on dynamic scaling.
Any suggestions are welcome.
>
> I also had a really hard time parsing through the problem statement and
> solution here. I'd really suggest cleaning up the problem statement and
> more clearly differentiating the host and guest sides in the description.
Thanks for taking a look at this series. I will reword the description
in the next version. The goal basically is to ensure private and
shared memory regions are always huge page aligned.
On 01/02/2024 04:46, Vishal Annapurve wrote:
> On Wed, Jan 31, 2024 at 10:03 PM Dave Hansen <[email protected]> wrote:
>>
>> On 1/11/24 21:52, Vishal Annapurve wrote:
>>> @@ -2133,8 +2133,10 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>>> int ret;
>>>
>>> /* Should not be working on unaligned addresses */
>>> - if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
>>> - addr &= PAGE_MASK;
>>> + if (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address: %#lx\n", addr)
>>> + || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
>>> + "misaligned numpages: %#lx\n", numpages))
>>> + return -EINVAL;
>>
>> This series is talking about swiotlb and DMA, then this applies a
>> restriction to what I *thought* was a much more generic function:
>> __set_memory_enc_pgtable(). What prevents this function from getting
>> used on 4k mappings?
>>
>>
>
> The end goal here is to limit the conversion granularity to hugepage
> sizes. SWIOTLB allocations are the major source of unaligned
> allocations(and so the conversions) that need to be fixed before
> achieving this goal.
>
> This change will ensure that conversion fails for unaligned ranges, as
> I don't foresee the need for 4K aligned conversions apart from DMA
> allocations.
Hi Vishal,
This assumption is wrong. set_memory_decrypted is called from various
parts of the kernel: kexec, sev-guest, kvmclock, hyperv code. These conversions
are for non-DMA allocations that need to be done at 4KB granularity
because the data structures in question are page sized.
Thanks,
Jeremi
On 12/01/2024 06:52, Vishal Annapurve wrote:
> CVMs used SWIOTLB for non-trusted IO using dma_map_* APIs. This series
> will ensure that dma_alloc_* API to also allocate from SWIOTLB pools.
> Initial size of SWIOTLB pool is decided using heuristics and has been
> working with CVM usecases so far.
>
> It's better to allow SWIOTLB to scale up on demand rather than keeping
> the size fixed to avoid failures with possibly increased usage of
> SWIOTLB with dma_alloc_* APIs allocating from SWIOTLB pools. This should
> also help in future with more devices getting used from CVMs for
> non-trusted IO.
>
> Signed-off-by: Vishal Annapurve <[email protected]>
> ---
> arch/x86/Kconfig | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 1566748f16c4..035c8a022c4c 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
> select X86_MEM_ENCRYPT
> select X86_MCE
> select UNACCEPTED_MEMORY
> + select SWIOTLB_DYNAMIC
> help
> Support running as a guest under Intel TDX. Without this support,
> the guest kernel can not boot or run under TDX.
> @@ -1534,6 +1535,7 @@ config AMD_MEM_ENCRYPT
> select ARCH_HAS_CC_PLATFORM
> select X86_MEM_ENCRYPT
> select UNACCEPTED_MEMORY
> + select SWIOTLB_DYNAMIC
> help
> Say yes to enable support for the encryption of system memory.
> This requires an AMD processor that supports Secure Memory
What this does is unconditionally enable SWIOTLB_DYNAMIC for every kernel compiled
to support memory encryption, regardless of whether it runs inside a confidential guest.
I don't think that is what you intended.
Best wishes,
Jeremi
On Thu, Feb 1, 2024 at 5:50 PM Jeremi Piotrowski
<[email protected]> wrote:
>
> On 12/01/2024 06:52, Vishal Annapurve wrote:
> > CVMs used SWIOTLB for non-trusted IO using dma_map_* APIs. This series
> > will ensure that dma_alloc_* API to also allocate from SWIOTLB pools.
> > Initial size of SWIOTLB pool is decided using heuristics and has been
> > working with CVM usecases so far.
> >
> > It's better to allow SWIOTLB to scale up on demand rather than keeping
> > the size fixed to avoid failures with possibly increased usage of
> > SWIOTLB with dma_alloc_* APIs allocating from SWIOTLB pools. This should
> > also help in future with more devices getting used from CVMs for
> > non-trusted IO.
> >
> > Signed-off-by: Vishal Annapurve <[email protected]>
> > ---
> > arch/x86/Kconfig | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 1566748f16c4..035c8a022c4c 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
> > select X86_MEM_ENCRYPT
> > select X86_MCE
> > select UNACCEPTED_MEMORY
> > + select SWIOTLB_DYNAMIC
> > help
> > Support running as a guest under Intel TDX. Without this support,
> > the guest kernel can not boot or run under TDX.
> > @@ -1534,6 +1535,7 @@ config AMD_MEM_ENCRYPT
> > select ARCH_HAS_CC_PLATFORM
> > select X86_MEM_ENCRYPT
> > select UNACCEPTED_MEMORY
> > + select SWIOTLB_DYNAMIC
> > help
> > Say yes to enable support for the encryption of system memory.
> > This requires an AMD processor that supports Secure Memory
>
> What this does is unconditionally enable SWIOTLB_DYNAMIC for every kernel compiled
> to support memory encryption, regardless of whether it runs inside a confidential guest.
> I don't think that is what you intended.
>
Right, I will have to add a way to toggle this support dynamically in
addition to having the support built in, based on more discussion
around 2M allocations and dynamic scaling.
> Best wishes,
> Jeremi
>
>
On 02/02/2024 06:08, Vishal Annapurve wrote:
> On Thu, Feb 1, 2024 at 5:32 PM Jeremi Piotrowski
> <[email protected]> wrote:
>>
>> On 01/02/2024 04:46, Vishal Annapurve wrote:
>>> On Wed, Jan 31, 2024 at 10:03 PM Dave Hansen <[email protected]> wrote:
>>>>
>>>> On 1/11/24 21:52, Vishal Annapurve wrote:
>>>>> @@ -2133,8 +2133,10 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>>>>> int ret;
>>>>>
>>>>> /* Should not be working on unaligned addresses */
>>>>> - if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
>>>>> - addr &= PAGE_MASK;
>>>>> + if (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address: %#lx\n", addr)
>>>>> + || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
>>>>> + "misaligned numpages: %#lx\n", numpages))
>>>>> + return -EINVAL;
>>>>
>>>> This series is talking about swiotlb and DMA, then this applies a
>>>> restriction to what I *thought* was a much more generic function:
>>>> __set_memory_enc_pgtable(). What prevents this function from getting
>>>> used on 4k mappings?
>>>>
>>>>
>>>
>>> The end goal here is to limit the conversion granularity to hugepage
>>> sizes. SWIOTLB allocations are the major source of unaligned
>>> allocations(and so the conversions) that need to be fixed before
>>> achieving this goal.
>>>
>>> This change will ensure that conversion fails for unaligned ranges, as
>>> I don't foresee the need for 4K aligned conversions apart from DMA
>>> allocations.
>>
>> Hi Vishal,
>>
>> This assumption is wrong. set_memory_decrypted is called from various
>> parts of the kernel: kexec, sev-guest, kvmclock, hyperv code. These conversions
>> are for non-DMA allocations that need to be done at 4KB granularity
>> because the data structures in question are page sized.
>>
>> Thanks,
>> Jeremi
>
> Thanks Jeremi for pointing out these usecases.
>
> My brief analysis for these call sites:
> 1) machine_kexec_64.c, realmode/init.c, kvm/mmu/mmu.c - shared memory
> allocation/conversion happens when host side memory encryption
> (CC_ATTR_HOST_MEM_ENCRYPT) is enabled.
> 2) kernel/kvmclock.c - Shared memory allocation can be made to align
> 2M even if the memory needed is lesser.
> 3) drivers/virt/coco/sev-guest/sev-guest.c,
> drivers/virt/coco/tdx-guest/tdx-guest.c - Shared memory allocation can
> be made to align 2M even if the memory needed is lesser.
>
> I admit I haven't analyzed hyperv code in context of these changes,
> but will take a better look to see if the calls for memory conversion
> here can fit the category of "Shared memory allocation can be made to
> align 2M even if the memory needed is lesser".
>
> Agree that this patch should be modified to look something like
> (subject to more changes on the call sites)
No, this patch is still built on the wrong assumptions. You're trying
to alter a generic function in the guest for the constraints of a very
specific hypervisor + host userspace + memory backend combination.
That's not right.
Is the numpages check supposed to ensure that the guest *only* toggles
visibility in chunks of 2MB? Then you're exposing more memory to the host
than the guest intends.
If you must - focus on getting swiotlb conversions to happen at the desired
granularity but don't try to force every single conversion to be >4K.
Thanks,
Jeremi
>
> =============
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index e9b448d1b1b7..8c608d6913c4 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2132,10 +2132,15 @@ static int __set_memory_enc_pgtable(unsigned
> long addr, int numpages, bool enc)
> struct cpa_data cpa;
> int ret;
>
> /* Should not be working on unaligned addresses */
> if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
> addr &= PAGE_MASK;
>
> + if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
> + (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address:
> %#lx\n", addr)
> + || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
> + "misaligned numpages: %#lx\n", numpages)))
> + return -EINVAL;
> +
> memset(&cpa, 0, sizeof(cpa));
> cpa.vaddr = &addr;
> cpa.numpages = numpages;
On Fri, Feb 2, 2024 at 1:30 PM Jeremi Piotrowski
<[email protected]> wrote:
>
> On 02/02/2024 06:08, Vishal Annapurve wrote:
> > On Thu, Feb 1, 2024 at 5:32 PM Jeremi Piotrowski
> > <[email protected]> wrote:
> >>
> >> On 01/02/2024 04:46, Vishal Annapurve wrote:
> >>> On Wed, Jan 31, 2024 at 10:03 PM Dave Hansen <[email protected]> wrote:
> >>>>
> >>>> On 1/11/24 21:52, Vishal Annapurve wrote:
> >>>>> @@ -2133,8 +2133,10 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
> >>>>> int ret;
> >>>>>
> >>>>> /* Should not be working on unaligned addresses */
> >>>>> - if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
> >>>>> - addr &= PAGE_MASK;
> >>>>> + if (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address: %#lx\n", addr)
> >>>>> + || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
> >>>>> + "misaligned numpages: %#lx\n", numpages))
> >>>>> + return -EINVAL;
> >>>>
> >>>> This series is talking about swiotlb and DMA, then this applies a
> >>>> restriction to what I *thought* was a much more generic function:
> >>>> __set_memory_enc_pgtable(). What prevents this function from getting
> >>>> used on 4k mappings?
> >>>>
> >>>>
> >>>
> >>> The end goal here is to limit the conversion granularity to hugepage
> >>> sizes. SWIOTLB allocations are the major source of unaligned
> >>> allocations(and so the conversions) that need to be fixed before
> >>> achieving this goal.
> >>>
> >>> This change will ensure that conversion fails for unaligned ranges, as
> >>> I don't foresee the need for 4K aligned conversions apart from DMA
> >>> allocations.
> >>
> >> Hi Vishal,
> >>
> >> This assumption is wrong. set_memory_decrypted is called from various
> >> parts of the kernel: kexec, sev-guest, kvmclock, hyperv code. These conversions
> >> are for non-DMA allocations that need to be done at 4KB granularity
> >> because the data structures in question are page sized.
> >>
> >> Thanks,
> >> Jeremi
> >
> > Thanks Jeremi for pointing out these usecases.
> >
> > My brief analysis for these call sites:
> > 1) machine_kexec_64.c, realmode/init.c, kvm/mmu/mmu.c - shared memory
> > allocation/conversion happens when host side memory encryption
> > (CC_ATTR_HOST_MEM_ENCRYPT) is enabled.
> > 2) kernel/kvmclock.c - Shared memory allocation can be made to align
> > 2M even if the memory needed is lesser.
> > 3) drivers/virt/coco/sev-guest/sev-guest.c,
> > drivers/virt/coco/tdx-guest/tdx-guest.c - Shared memory allocation can
> > be made to align 2M even if the memory needed is lesser.
> >
> > I admit I haven't analyzed hyperv code in context of these changes,
> > but will take a better look to see if the calls for memory conversion
> > here can fit the category of "Shared memory allocation can be made to
> > align 2M even if the memory needed is lesser".
> >
> > Agree that this patch should be modified to look something like
> > (subject to more changes on the call sites)
>
> No, this patch is still built on the wrong assumptions. You're trying
> to alter a generic function in the guest for the constraints of a very
> specific hypervisor + host userspace + memory backend combination.
> That's not right.
Agree on the fact that I focussed on a KVM for these changes. I plan
to spend some time understanding guest memfd relevance for other
hypervisors when dealing with CoCo VMs.
>
> Is the numpages check supposed to ensure that the guest *only* toggles
> visibility in chunks of 2MB?
Yes.
> Then you're exposing more memory to the host than the guest intends.
The goal of the series is to ensure that the CoCo VMs convert memory
at hugepage granularity. So such guests would need to allocate any
memory to be converted to shared, at hugepage granularity. This would
not expose any guest memory that needs to be used privately.
I agree about the fact that extra memory that needs to be allocated
for 2M alignment is effectively getting wasted. Optimizing this extra
memory usage can be pursued further depending on how significant this
wastage is. One possible way could be to preallocate large enough
shared memory and use it to back smaller allocations from these call
sites (very similar to SWIOTLB).
>
> If you must - focus on getting swiotlb conversions to happen at the desired
> granularity but don't try to force every single conversion to be >4K.
If any conversion within a guest happens at 4K granularity, then this
will effectively cause non-hugepage aligned EPT/NPT entries. This
series is trying to get all private and shared memory regions to be
hugepage aligned to address the problem statement.
>
> Thanks,
> Jeremi
>
>
> >
> > =============
> > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > index e9b448d1b1b7..8c608d6913c4 100644
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -2132,10 +2132,15 @@ static int __set_memory_enc_pgtable(unsigned
> > long addr, int numpages, bool enc)
> > struct cpa_data cpa;
> > int ret;
> >
> > /* Should not be working on unaligned addresses */
> > if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
> > addr &= PAGE_MASK;
> >
> > + if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
> > + (WARN_ONCE(addr & ~HPAGE_MASK, "misaligned address:
> > %#lx\n", addr)
> > + || WARN_ONCE((numpages << PAGE_SHIFT) & ~HPAGE_MASK,
> > + "misaligned numpages: %#lx\n", numpages)))
> > + return -EINVAL;
> > +
> > memset(&cpa, 0, sizeof(cpa));
> > cpa.vaddr = &addr;
> > cpa.numpages = numpages;
>
On 2/2/24 08:22, Vishal Annapurve wrote:
>> If you must - focus on getting swiotlb conversions to happen at the desired
>> granularity but don't try to force every single conversion to be >4K.
> If any conversion within a guest happens at 4K granularity, then this
> will effectively cause non-hugepage aligned EPT/NPT entries. This
> series is trying to get all private and shared memory regions to be
> hugepage aligned to address the problem statement.
Yeah, but the series is trying to do that by being awfully myopic at
this stage and without being _declared_ to be so myopic.
Take a look at all of the set_memory_decrypted() calls. How many of
them even operate on the part of the guest address space rooted in the
memfd where splits matter? They're not doing conversions. They're just
setting up shared mappings in the page tables of gunk that was never
private in the first place.
On Fri, Feb 2, 2024 at 10:06 PM Dave Hansen <[email protected]> wrote:
>
> On 2/2/24 08:22, Vishal Annapurve wrote:
> >> If you must - focus on getting swiotlb conversions to happen at the desired
> >> granularity but don't try to force every single conversion to be >4K.
> > If any conversion within a guest happens at 4K granularity, then this
> > will effectively cause non-hugepage aligned EPT/NPT entries. This
> > series is trying to get all private and shared memory regions to be
> > hugepage aligned to address the problem statement.
>
> Yeah, but the series is trying to do that by being awfully myopic at
> this stage and without being _declared_ to be so myopic.
>
Agreed. I was being overly optimistic when I mentioned following in
the cover message:
"** This series leaves out some of the conversion sites which might not
be 2M aligned but should be easy to fix once the approach is finalized. **"
> Take a look at all of the set_memory_decrypted() calls. How many of
> them even operate on the part of the guest address space rooted in the
> memfd where splits matter? They're not doing conversions. They're just
> setting up shared mappings in the page tables of gunk that was never
> private in the first place.
Thinking it over again, yeah the conversions that are happening
outside SWIOTLB should be impacting significantly less memory ranges.
As Jeremi and you are suggesting, it would be a big step forward if
memory conversions happening for just DMA requests are aligned to
hugepage sizes.
On Fri, Jan 12, 2024 at 05:52:47AM +0000, Vishal Annapurve wrote:
> Modify SWIOTLB framework to allocate DMA memory always from SWIOTLB.
>
> CVMs use SWIOTLB buffers for bouncing memory when using dma_map_* APIs
> to setup memory for IO operations. SWIOTLB buffers are marked as shared
> once during early boot.
>
> Buffers allocated using dma_alloc_* APIs are allocated from kernel memory
> and then converted to shared during each API invocation. This patch ensures
> that such buffers are also allocated from already shared SWIOTLB
> regions. This allows enforcing alignment requirements on regions marked
> as shared.
But does it work in practice?
Some devices (like GPUs) require a lot of DMA memory. So with this approach
we would need to have huge SWIOTLB buffer that is unused in most VMs.
--
Kiryl Shutsemau / Kirill A. Shutemov
On 15.02.24 04:33, Vishal Annapurve wrote:
> On Wed, Feb 14, 2024 at 8:20 PM Kirill A. Shutemov <[email protected]> wrote:
>> On Fri, Jan 12, 2024 at 05:52:47AM +0000, Vishal Annapurve wrote:
>>> Modify SWIOTLB framework to allocate DMA memory always from SWIOTLB.
>>>
>>> CVMs use SWIOTLB buffers for bouncing memory when using dma_map_* APIs
>>> to setup memory for IO operations. SWIOTLB buffers are marked as shared
>>> once during early boot.
>>>
>>> Buffers allocated using dma_alloc_* APIs are allocated from kernel memory
>>> and then converted to shared during each API invocation. This patch ensures
>>> that such buffers are also allocated from already shared SWIOTLB
>>> regions. This allows enforcing alignment requirements on regions marked
>>> as shared.
>> But does it work in practice?
>>
>> Some devices (like GPUs) require a lot of DMA memory. So with this approach
>> we would need to have huge SWIOTLB buffer that is unused in most VMs.
>>
> Current implementation limits the size of statically allocated SWIOTLB
> memory pool to 1G. I was proposing to enable dynamic SWIOTLB for CVMs
> in addition to aligning the memory allocations to hugepage sizes, so
> that the SWIOTLB pool can be scaled up on demand.
>
> The issue with aligning the pool areas to hugepages is that hugepage
> allocation at runtime is not guaranteed. Guaranteeing the hugepage
> allocation might need calculating the upper bound in advance, which
> defeats the purpose of enabling dynamic SWIOTLB. I am open to
> suggestions here.
You could allocate a max bound at boot using CMA and then only fill into
the CMA area when SWIOTLB size requirements increase? The CMA region
will allow movable allocations as long as you don't require the CMA space.
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
From: Alexander Graf <[email protected]> Sent: Thursday, February 15, 2024 1:44 AM
>
> On 15.02.24 04:33, Vishal Annapurve wrote:
> > On Wed, Feb 14, 2024 at 8:20 PM Kirill A. Shutemov
> <[email protected]> wrote:
> >> On Fri, Jan 12, 2024 at 05:52:47AM +0000, Vishal Annapurve wrote:
> >>> Modify SWIOTLB framework to allocate DMA memory always from SWIOTLB.
> >>>
> >>> CVMs use SWIOTLB buffers for bouncing memory when using dma_map_* APIs
> >>> to setup memory for IO operations. SWIOTLB buffers are marked as shared
> >>> once during early boot.
> >>>
> >>> Buffers allocated using dma_alloc_* APIs are allocated from kernel memory
> >>> and then converted to shared during each API invocation. This patch ensures
> >>> that such buffers are also allocated from already shared SWIOTLB
> >>> regions. This allows enforcing alignment requirements on regions marked
> >>> as shared.
> >> But does it work in practice?
> >>
> >> Some devices (like GPUs) require a lot of DMA memory. So with this approach
> >> we would need to have huge SWIOTLB buffer that is unused in most VMs.
> >>
> > Current implementation limits the size of statically allocated SWIOTLB
> > memory pool to 1G. I was proposing to enable dynamic SWIOTLB for CVMs
> > in addition to aligning the memory allocations to hugepage sizes, so
> > that the SWIOTLB pool can be scaled up on demand.
> >
Vishal --
When the dynamic swiotlb mechanism tries to grow swiotlb space
by adding another pool, it gets the additional memory as a single
physically contiguous chunk using alloc_pages(). It starts by trying
to allocate a chunk the size of the original swiotlb size, and if that
fails, halves the size until it gets a size where the allocation succeeds.
The minimum size is 1 Mbyte, and if that fails, the "grow" fails.
So it seems like dynamic swiotlb is subject to the almost the same
memory fragmentation limitations as trying to allocate huge pages.
swiotlb needs a minimum of 1 Mbyte contiguous in order to grow,
while huge pages need 2 Mbytes, but either is likely to be
problematic in a VM that has been running a while. With that
in mind, I'm not clear on the benefit of enabling dynamic swiotlb.
It seems like it just moves around the problem of needing high order
contiguous memory allocations. Or am I missing something?
Michael
> > The issue with aligning the pool areas to hugepages is that hugepage
> > allocation at runtime is not guaranteed. Guaranteeing the hugepage
> > allocation might need calculating the upper bound in advance, which
> > defeats the purpose of enabling dynamic SWIOTLB. I am open to
> > suggestions here.
>
>
> You could allocate a max bound at boot using CMA and then only fill into
> the CMA area when SWIOTLB size requirements increase? The CMA region
> will allow movable allocations as long as you don't require the CMA space.
>
>
> Alex
On Fri, Feb 16, 2024 at 1:56 AM Michael Kelley <[email protected]> wrote:
>
> From: Alexander Graf <[email protected]> Sent: Thursday, February 15, 2024 1:44 AM
> >
> > On 15.02.24 04:33, Vishal Annapurve wrote:
> > > On Wed, Feb 14, 2024 at 8:20 PM Kirill A. Shutemov
> > <[email protected]> wrote:
> > >> On Fri, Jan 12, 2024 at 05:52:47AM +0000, Vishal Annapurve wrote:
> > >>> Modify SWIOTLB framework to allocate DMA memory always from SWIOTLB.
> > >>>
> > >>> CVMs use SWIOTLB buffers for bouncing memory when using dma_map_* APIs
> > >>> to setup memory for IO operations. SWIOTLB buffers are marked as shared
> > >>> once during early boot.
> > >>>
> > >>> Buffers allocated using dma_alloc_* APIs are allocated from kernel memory
> > >>> and then converted to shared during each API invocation. This patch ensures
> > >>> that such buffers are also allocated from already shared SWIOTLB
> > >>> regions. This allows enforcing alignment requirements on regions marked
> > >>> as shared.
> > >> But does it work in practice?
> > >>
> > >> Some devices (like GPUs) require a lot of DMA memory. So with this approach
> > >> we would need to have huge SWIOTLB buffer that is unused in most VMs.
> > >>
> > > Current implementation limits the size of statically allocated SWIOTLB
> > > memory pool to 1G. I was proposing to enable dynamic SWIOTLB for CVMs
> > > in addition to aligning the memory allocations to hugepage sizes, so
> > > that the SWIOTLB pool can be scaled up on demand.
> > >
>
> Vishal --
>
> When the dynamic swiotlb mechanism tries to grow swiotlb space
> by adding another pool, it gets the additional memory as a single
> physically contiguous chunk using alloc_pages(). It starts by trying
> to allocate a chunk the size of the original swiotlb size, and if that
> fails, halves the size until it gets a size where the allocation succeeds.
> The minimum size is 1 Mbyte, and if that fails, the "grow" fails.
>
Thanks for pointing this out.
> So it seems like dynamic swiotlb is subject to the almost the same
> memory fragmentation limitations as trying to allocate huge pages.
> swiotlb needs a minimum of 1 Mbyte contiguous in order to grow,
> while huge pages need 2 Mbytes, but either is likely to be
> problematic in a VM that has been running a while. With that
> in mind, I'm not clear on the benefit of enabling dynamic swiotlb.
> It seems like it just moves around the problem of needing high order
> contiguous memory allocations. Or am I missing something?
>
Currently the SWIOTLB pool is limited to 1GB in size. Kirill has
pointed out that devices like GPUs could need a significant amount of
memory to be allocated from the SWIOTLB pool. Without dynamic SWIOTLB,
such devices run the risk of memory exhaustion without any hope of
recovery.
In addition, I am proposing to have dma_alloc_* APIs to use the
SWIOTLB area as well, adding to the memory pressure. If there was a
way to calculate the maximum amount of memory needed for all dma
allocations for all possible devices used by CoCo VMs then one can use
that number to preallocate SWIOTLB pool. I am arguing that calculating
the maximum bound would be difficult and instead of trying to
calculate it, allowing SWIOTLB to scale dynamically would be better
since it provides better .
So if the above argument for enabling dynamic SWIOTLB makes sense then
it should be relatively easy to append hugepage alignment restrictions
for SWIOTLB pool increments (inline with the fact that 2MB vs 1MB size
allocations are nearly equally prone to failure due to memory
fragmentation).
> Michael
>
> > > The issue with aligning the pool areas to hugepages is that hugepage
> > > allocation at runtime is not guaranteed. Guaranteeing the hugepage
> > > allocation might need calculating the upper bound in advance, which
> > > defeats the purpose of enabling dynamic SWIOTLB. I am open to
> > > suggestions here.
> >
> >
> > You could allocate a max bound at boot using CMA and then only fill into
> > the CMA area when SWIOTLB size requirements increase? The CMA region
> > will allow movable allocations as long as you don't require the CMA space.
> >
> >
> > Alex
>
From: Vishal Annapurve <[email protected]> Sent: Saturday, February 24, 2024 9:07 AM
>
> On Fri, Feb 16, 2024 at 1:56 AM Michael Kelley <[email protected]> wrote:
> >
> > From: Alexander Graf <[email protected]> Sent: Thursday, February 15, 2024 1:44 AM
> > >
> > > On 15.02.24 04:33, Vishal Annapurve wrote:
> > > > On Wed, Feb 14, 2024 at 8:20 PM Kirill A. Shutemov <[email protected]> wrote:
> > > >> On Fri, Jan 12, 2024 at 05:52:47AM +0000, Vishal Annapurve wrote:
> > > >>> Modify SWIOTLB framework to allocate DMA memory always from SWIOTLB.
> > > >>>
> > > >>> CVMs use SWIOTLB buffers for bouncing memory when using dma_map_* APIs
> > > >>> to setup memory for IO operations. SWIOTLB buffers are marked as shared
> > > >>> once during early boot.
> > > >>>
> > > >>> Buffers allocated using dma_alloc_* APIs are allocated from kernel memory
> > > >>> and then converted to shared during each API invocation. This patch ensures
> > > >>> that such buffers are also allocated from already shared SWIOTLB
> > > >>> regions. This allows enforcing alignment requirements on regions marked
> > > >>> as shared.
> > > >> But does it work in practice?
> > > >>
> > > >> Some devices (like GPUs) require a lot of DMA memory. So with this approach
> > > >> we would need to have huge SWIOTLB buffer that is unused in most VMs.
> > > >>
> > > > Current implementation limits the size of statically allocated SWIOTLB
> > > > memory pool to 1G. I was proposing to enable dynamic SWIOTLB for CVMs
> > > > in addition to aligning the memory allocations to hugepage sizes, so
> > > > that the SWIOTLB pool can be scaled up on demand.
> > > >
> >
> > Vishal --
> >
> > When the dynamic swiotlb mechanism tries to grow swiotlb space
> > by adding another pool, it gets the additional memory as a single
> > physically contiguous chunk using alloc_pages(). It starts by trying
> > to allocate a chunk the size of the original swiotlb size, and if that
> > fails, halves the size until it gets a size where the allocation succeeds.
> > The minimum size is 1 Mbyte, and if that fails, the "grow" fails.
> >
>
> Thanks for pointing this out.
>
> > So it seems like dynamic swiotlb is subject to the almost the same
> > memory fragmentation limitations as trying to allocate huge pages.
> > swiotlb needs a minimum of 1 Mbyte contiguous in order to grow,
> > while huge pages need 2 Mbytes, but either is likely to be
> > problematic in a VM that has been running a while. With that
> > in mind, I'm not clear on the benefit of enabling dynamic swiotlb.
> > It seems like it just moves around the problem of needing high order
> > contiguous memory allocations. Or am I missing something?
> >
>
> Currently the SWIOTLB pool is limited to 1GB in size. Kirill has
> pointed out that devices like GPUs could need a significant amount of
> memory to be allocated from the SWIOTLB pool. Without dynamic SWIOTLB,
> such devices run the risk of memory exhaustion without any hope of
> recovery.
Actually, in a CoCo VM the swiotlb pool *defaults* to 6% of
memory, with a max of 1 Gbyte. See mem_encrypt_setup_arch().
So in a CoCo VM with 16 Gbytes of memory or more, you
typically see 1 Gbyte as the swiotlb size. But that's only the
default. Using the kernel boot line parameter "swiotlb=<nnn>"
you can set the initial swiotlb size to something larger, with the
max typically being in the 3 Gbyte range. The 3 Gbyte limit arises
because the swiotlb pool must reside below the 4 Gbyte line
in the CoCo VM's guest physical memory, and there are other
things like 32-bit MMIO space that also must fit below the
4 Gbyte line. Related, I've contemplated submitting a patch to
allow the swiotlb pool in a CoCo VM to be above the 4 Gbyte
line, since the original reasons for being below the 4 Gbyte line
probably don’t apply in a CoCo VM. With such a change, the
kernel boot line could specify an even larger initial swiotlb pool.
But as you point out, calculating a maximum ahead of time
is not really possible, and choosing a really large value to be
"safe" is likely to waste a lot of memory in most cases. So
using dynamic swiotlb makes sense, except that you can't
be sure that dynamic growth will really work because
fragmentation may prevent getting enough contiguous
memory.
One other consideration with dynamic swiotlb pools:
The initially allocated swiotlb pool is divided into "areas",
defaulting to one area for each vCPU in the VM. This allows
CPUs to do swiotlb allocations from their area without having
to contend for a shared spin lock. But if a CPU finds its own
area is full, then it will search the areas of other CPUs, which
can produce spin lock contention, though hopefully that's
rare. In the case of a dynamically allocated addition of 2
Mbytes (for example), the number of areas is limited to
2M/256K = 8. In a VM with more than 8 vCPUs, multiple
CPUs will immediately be contending for the same area
In the dynamically allocated addition, and we've seen
that swiotlb spin lock contention can be a perf issue in
CoCo VMs. Being able to allocate a memory chunk bigger
than 2 Mbytes would allow for more areas, but of course
success in allocating bigger chunks is less likely and the
alloc_pages() limit on x86/x64 is 4 Mbytes anyway.
Overall, my point is that there are tradeoffs. Dynamic
swiotlb may look like a good approach, but it has some
downsides that aren't immediately obvious. My
understanding is that the original motivation for dynamic
swiotlb was small systems with limited memory, where you
could start with a really small swiotlb, and then grow as
necessary. It's less of a good fit on large CoCo VMs with
dozens of vCPUs, for the reasons described above.
>
> In addition, I am proposing to have dma_alloc_* APIs to use the
> SWIOTLB area as well, adding to the memory pressure. If there was a
> way to calculate the maximum amount of memory needed for all dma
> allocations for all possible devices used by CoCo VMs then one can use
> that number to preallocate SWIOTLB pool. I am arguing that calculating
> the maximum bound would be difficult and instead of trying to
> calculate it, allowing SWIOTLB to scale dynamically would be better
> since it provides better .
Agreed, but see above. I'm not saying dynamic swiotlb
can't be used, but just to be aware of the tradeoffs.
Michael
>
> So if the above argument for enabling dynamic SWIOTLB makes sense then
> it should be relatively easy to append hugepage alignment restrictions
> for SWIOTLB pool increments (inline with the fact that 2MB vs 1MB size
> allocations are nearly equally prone to failure due to memory
> fragmentation).
>
> > Michael
> >
> > > > The issue with aligning the pool areas to hugepages is that hugepage
> > > > allocation at runtime is not guaranteed. Guaranteeing the hugepage
> > > > allocation might need calculating the upper bound in advance, which
> > > > defeats the purpose of enabling dynamic SWIOTLB. I am open to
> > > > suggestions here.
> > >
> > >
> > > You could allocate a max bound at boot using CMA and then only fill into
> > > the CMA area when SWIOTLB size requirements increase? The CMA region
> > > will allow movable allocations as long as you don't require the CMA space.
> > >
> > >
> > > Alex
> >
On Thu, Feb 15, 2024 at 3:14 PM Alexander Graf <[email protected]> wrote:
> > ...
> > The issue with aligning the pool areas to hugepages is that hugepage
> > allocation at runtime is not guaranteed. Guaranteeing the hugepage
> > allocation might need calculating the upper bound in advance, which
> > defeats the purpose of enabling dynamic SWIOTLB. I am open to
> > suggestions here.
>
>
> You could allocate a max bound at boot using CMA and then only fill into
> the CMA area when SWIOTLB size requirements increase? The CMA region
> will allow movable allocations as long as you don't require the CMA space.
Thanks Alex for the inputs, I wanted to understand CMA better before
responding here.
I am trying to get the following requirements satisfied:
1) SWIOTLB pools are aligned to hugepage sizes.
2) SWIOTLB pool areas can be scaled up dynamically on demand to avoid
pre-allocating large memory ranges.
Using CMA to back SWIOTLB pools for CoCo VMs as per this suggestion would need:
1) Pre-configuring CMA areas with a certain amount at boot either with,
- command line argument to the kernel (tedious to specify with
different memory shapes) or
- kernel init time hook called by architecture specific code to
setup CMA areas according to the amount of memory available (Possibly
a percentage of memory as done for SWIOTLB configuration)
2) SWIOTLB pool dynamic allocation logic can first scan for CMA areas
to find the hugepage aligned ranges, and if not found, can fall back
to allocate the ranges using buddy allocator (which should ideally
happen after depleting the CMA area).
3) SWIOTLB pool regions would need to be allocatable from >4G ranges as well.
Setting up a suitable percentage of memory for CMA area in case of
CoCo VMs will allow larger SWIOTLB pool area additions, this should
help alleviate Michael Kelley's concern about spin lock contention due
to smaller pool areas. This will need some analysis of shared memory
usage with current devices in use with CoCo VMs, especially GPUs which
might need large amounts of shared memory.
>
>
> Alex