2023-01-28 08:33:15

by Guorui Yu

[permalink] [raw]
Subject: [RFC] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs

This patch series adds a new swiotlb implementation, cc-swiotlb, for
Confidential VMs (such as TDX and SEV-SNP). The new cc-swiotlb allocates
the DMA TLB buffer dynamically in runtime instead of allocating at boot
with a fixed size. Furthermore, future optimization and security
enhancement could be applied on cc-swiotlb without "infecting" the
legacy swiotlb.

Background
==========
Under COnfidential COmputing (CoCo) scenarios, the VMM cannot access
guest memory directly but requires the guest to explicitly mark the
memory as shared (decrypted). To make the streaming DMA mappings work,
the current implementation relays on legacy SWIOTLB to bounce the DMA
buffer between private (encrypted) and shared (decrypted) memory.

However, the legacy swiotlb is designed for compatibility rather than
efficiency and CoCo purpose, which will inevitably introduce some
unnecessary restrictions.

1. Fixed immutable swiotlb size cannot accommodate to requirements of
multiple devices. And 1GiB (current maximum size) of swiotlb in our
testbed cannot afford multiple disks reads/writes simultaneously.

2. Fixed immutable IO_TLB_SIZE (2KiB) cannot satisfy various kinds of
devices. At the moment, the minimal size of a swiotlb buffer is 2KiB,
which will waste memory on small network packets (under 256 bytes) and
decrease efficiency on a large block (up to 256KiB) size reads/writes of
disks. And it is hard to have a trade-off on legacy swiotlb to rule them
all.

3. The legacy swiotlb cannot efficiently support larger swiotlb buffers.
In the worst case, the current implementation requires a full scan of
the entire swiotlb buffer, which can cause severe performance hits.

Changes in this patch set
=========================
Instead of keeping "infecting" the legacy swiotlb code with CoCo logic,
this patch tries to introduce a new cc-swiotlb for Confidential VMs.

Confidential VMs usually have reasonable modern devices (virtio devices,
NVME, etc.), which can access memory above 4GiB, cc-swiotlb could
allocate TLB buffers at any position dynamically. Since
set_memory_{decrypted,encrypted} is time-consuming and cannot be used in
interrupt context, a new kernel thread "kccd" has been added to populate
new TLB buffers on-demand, which solved the problem 1.

In addition, the cc-swiotlb manages TLB buffers by different sizes
(512B, 2KiB, 4KiB, 16KiB, and 512KiB). The above values come from the
following observations (boot with 8core, 32 GiB, 1 nvme disk, and 1
virtio-net):
- Allocations of 512 bytes and below account for 3.5% of the total DMA
cache allocations;
- Allocations of 2 KiB and below account for 57.7%;
- Allocations of 4 KiB and below account for 91.3%;
- Allocations of 16 KiB and below account for 96.0%;
- Allocations of 512 KiB and below accounted for 100%;
- At the end of booting, cc-swiotlb uses 288 MiB in total.

For comparison, legacy swiotlb reserves memory at 6%, which requires
min(1GiB, 32GiB * 0.06) = 1GiB, and will hang when operating multiple
disks simultaneously due to no memory for the swiotlb buffer.

These patches were tested with fio (using different iodepth and block
size) on a platform with 96 cores, 384 GiB, and 20 NVME disks, and no IO
hang or error was observed.

For simplicity, the current RFC version cannot switch between legacy
implementation with cmdline but through compile options. I am open to
discussing how to integrate the cc-swiotlb into the legacy one.

Patch Organization
==================
- swiotlb: Split common code from swiotlb.{c,h}
- swiotlb: Add a new cc-swiotlb implementation for Confidential VMs
- swiotlb: Add tracepoint swiotlb_unbounced
- cc-swiotlb: Allow set swiotlb watermark from cmdline

Thanks for your time!

Have a nice day,
Guorui




2023-01-28 08:33:17

by Guorui Yu

[permalink] [raw]
Subject: [PATCH 1/4] swiotlb: Split common code from swiotlb.{c,h}

Split swiotlb_bounce, swiotlb_release_slots, and is_swiotlb_buffer from
swiotlb.{c,h} to common-swiotlb.c, and prepare for the new swiotlb
implementaion.

Signed-off-by: GuoRui.Yu <[email protected]>
---
include/linux/swiotlb.h | 10 ++---
kernel/dma/Makefile | 2 +-
kernel/dma/common-swiotlb.c | 74 +++++++++++++++++++++++++++++++++
kernel/dma/swiotlb.c | 82 ++++---------------------------------
4 files changed, 88 insertions(+), 80 deletions(-)
create mode 100644 kernel/dma/common-swiotlb.c

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 35bc4e281c21..c5e74d3f9cbf 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -58,6 +58,9 @@ void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,
size_t size, enum dma_data_direction dir);
dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
size_t size, enum dma_data_direction dir, unsigned long attrs);
+void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
+ enum dma_data_direction dir);
+void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr);

#ifdef CONFIG_SWIOTLB

@@ -105,12 +108,7 @@ struct io_tlb_mem {
};
extern struct io_tlb_mem io_tlb_default_mem;

-static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
-{
- struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
-
- return mem && paddr >= mem->start && paddr < mem->end;
-}
+bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr);

static inline bool is_swiotlb_force_bounce(struct device *dev)
{
diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
index 21926e46ef4f..fc0ea13bc089 100644
--- a/kernel/dma/Makefile
+++ b/kernel/dma/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_DMA_OPS) += dummy.o
obj-$(CONFIG_DMA_CMA) += contiguous.o
obj-$(CONFIG_DMA_DECLARE_COHERENT) += coherent.o
obj-$(CONFIG_DMA_API_DEBUG) += debug.o
-obj-$(CONFIG_SWIOTLB) += swiotlb.o
+obj-$(CONFIG_SWIOTLB) += swiotlb.o common-swiotlb.o
obj-$(CONFIG_DMA_COHERENT_POOL) += pool.o
obj-$(CONFIG_MMU) += remap.o
obj-$(CONFIG_DMA_MAP_BENCHMARK) += map_benchmark.o
diff --git a/kernel/dma/common-swiotlb.c b/kernel/dma/common-swiotlb.c
new file mode 100644
index 000000000000..d477d5f2a71b
--- /dev/null
+++ b/kernel/dma/common-swiotlb.c
@@ -0,0 +1,74 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/dma-direct.h>
+#include <linux/swiotlb.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/swiotlb.h>
+
+/*
+ * Create a swiotlb mapping for the buffer at @paddr, and in case of DMAing
+ * to the device copy the data into it as well.
+ */
+dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
+ enum dma_data_direction dir, unsigned long attrs)
+{
+ phys_addr_t swiotlb_addr;
+ dma_addr_t dma_addr;
+
+ trace_swiotlb_bounced(dev, phys_to_dma(dev, paddr), size);
+
+ swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, size, 0, dir,
+ attrs);
+ if (swiotlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
+ return DMA_MAPPING_ERROR;
+
+ /* Ensure that the address returned is DMA'ble */
+ dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
+ if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
+ swiotlb_tbl_unmap_single(dev, swiotlb_addr, size, dir,
+ attrs | DMA_ATTR_SKIP_CPU_SYNC);
+ dev_WARN_ONCE(dev, 1,
+ "swiotlb addr %pad+%zu overflow (mask %llx, bus limit %llx).\n",
+ &dma_addr, size, *dev->dma_mask, dev->bus_dma_limit);
+ return DMA_MAPPING_ERROR;
+ }
+
+ if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+ arch_sync_dma_for_device(swiotlb_addr, size, dir);
+ return dma_addr;
+}
+
+/*
+ * tlb_addr is the physical address of the bounce buffer to unmap.
+ */
+void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
+ size_t mapping_size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ /*
+ * First, sync the memory before unmapping the entry
+ */
+ if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
+ (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
+ swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
+
+ swiotlb_release_slots(dev, tlb_addr);
+}
+
+void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
+ size_t size, enum dma_data_direction dir)
+{
+ if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
+ swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
+ else
+ BUG_ON(dir != DMA_FROM_DEVICE);
+}
+
+void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,
+ size_t size, enum dma_data_direction dir)
+{
+ if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
+ swiotlb_bounce(dev, tlb_addr, size, DMA_FROM_DEVICE);
+ else
+ BUG_ON(dir != DMA_TO_DEVICE);
+}
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index a34c38bbe28f..f3ff4de08653 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -48,9 +48,6 @@
#include <linux/slab.h>
#endif

-#define CREATE_TRACE_POINTS
-#include <trace/events/swiotlb.h>
-
#define SLABS_PER_PAGE (1 << (PAGE_SHIFT - IO_TLB_SHIFT))

/*
@@ -523,7 +520,7 @@ static unsigned int swiotlb_align_offset(struct device *dev, u64 addr)
/*
* Bounce: copy the swiotlb buffer from or back to the original dma location
*/
-static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
+void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
enum dma_data_direction dir)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
@@ -793,7 +790,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
return tlb_addr;
}

-static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
+void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
unsigned long flags;
@@ -840,74 +837,6 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
spin_unlock_irqrestore(&area->lock, flags);
}

-/*
- * tlb_addr is the physical address of the bounce buffer to unmap.
- */
-void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
- size_t mapping_size, enum dma_data_direction dir,
- unsigned long attrs)
-{
- /*
- * First, sync the memory before unmapping the entry
- */
- if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
- (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
- swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
-
- swiotlb_release_slots(dev, tlb_addr);
-}
-
-void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
- size_t size, enum dma_data_direction dir)
-{
- if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
- swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
- else
- BUG_ON(dir != DMA_FROM_DEVICE);
-}
-
-void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,
- size_t size, enum dma_data_direction dir)
-{
- if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
- swiotlb_bounce(dev, tlb_addr, size, DMA_FROM_DEVICE);
- else
- BUG_ON(dir != DMA_TO_DEVICE);
-}
-
-/*
- * Create a swiotlb mapping for the buffer at @paddr, and in case of DMAing
- * to the device copy the data into it as well.
- */
-dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
- enum dma_data_direction dir, unsigned long attrs)
-{
- phys_addr_t swiotlb_addr;
- dma_addr_t dma_addr;
-
- trace_swiotlb_bounced(dev, phys_to_dma(dev, paddr), size);
-
- swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, size, 0, dir,
- attrs);
- if (swiotlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
- return DMA_MAPPING_ERROR;
-
- /* Ensure that the address returned is DMA'ble */
- dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
- if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
- swiotlb_tbl_unmap_single(dev, swiotlb_addr, size, dir,
- attrs | DMA_ATTR_SKIP_CPU_SYNC);
- dev_WARN_ONCE(dev, 1,
- "swiotlb addr %pad+%zu overflow (mask %llx, bus limit %llx).\n",
- &dma_addr, size, *dev->dma_mask, dev->bus_dma_limit);
- return DMA_MAPPING_ERROR;
- }
-
- if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
- arch_sync_dma_for_device(swiotlb_addr, size, dir);
- return dma_addr;
-}
-
size_t swiotlb_max_mapping_size(struct device *dev)
{
int min_align_mask = dma_get_min_align_mask(dev);
@@ -924,6 +853,13 @@ size_t swiotlb_max_mapping_size(struct device *dev)
return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE - min_align;
}

+bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
+{
+ struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+
+ return mem && paddr >= mem->start && paddr < mem->end;
+}
+
bool is_swiotlb_active(struct device *dev)
{
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
--
2.31.1


2023-01-28 08:33:20

by Guorui Yu

[permalink] [raw]
Subject: [PATCH 2/4] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs

Under COnfidential COmputing (CoCo) scenarios, the VMM cannot access
guest memory directly but requires the guest to explicitly mark the
memory as shared (decrypted). To make the streaming DMA mappings work,
the current implementation relays on legacy SWIOTLB to bounce the DMA
buffer between private (encrypted) and shared (decrypted) memory.

However, the legacy swiotlb is designed for compatibility rather than
efficiency and CoCo purpose, which will inevitably introduce some
unnecessary restrictions.
1. Fixed immutable swiotlb size cannot accommodate to requirements of
multiple devices. And 1GiB (current maximum size) of swiotlb in our
testbed cannot afford multiple disks reads/writes simultaneously.
2. Fixed immutable IO_TLB_SIZE (2KiB) cannot satisfy various kinds of
devices. At the moment, the minimal size of a swiotlb buffer is 2KiB,
which will waste memory on small network packets (under 512 bytes)
and decrease efficiency on a large block (up to 256KiB) size
reads/writes of disks. And it is hard to have a trade-off on legacy
swiotlb to rule them all.
3. The legacy swiotlb cannot efficiently support larger swiotlb buffers.
In the worst case, the current implementation requires a full scan of
the entire swiotlb buffer, which can cause severe performance hits.

Instead of keeping "infecting" the legacy swiotlb code with CoCo logic,
this patch tries to introduce a new cc-swiotlb for Confidential VMs.

Confidential VMs usually have reasonable modern devices (virtio devices,
NVME, etc.), which can access memory above 4GiB, cc-swiotlb could
allocate TLB buffers dynamically on-demand, and this design solves
problem 1.

In addition, the cc-swiotlb manages TLB buffers by different sizes
(512B, 2KiB, 4KiB, 16KiB, and 512KiB), which solves problems 2 and 3.

Signed-off-by: GuoRui.Yu <[email protected]>
---
arch/x86/Kconfig | 3 +-
arch/x86/include/asm/iommu.h | 2 +-
arch/x86/kernel/pci-dma.c | 6 +-
drivers/base/core.c | 2 +-
drivers/iommu/amd/Kconfig | 2 +-
drivers/iommu/dma-iommu.c | 2 +-
drivers/iommu/intel/Kconfig | 2 +-
include/linux/device.h | 2 +-
include/linux/swiotlb.h | 19 +-
kernel/dma/Kconfig | 10 +
kernel/dma/Makefile | 1 +
kernel/dma/cc-swiotlb.c | 492 +++++++++++++++++++++++++++++++++++
kernel/dma/common-swiotlb.c | 1 +
kernel/dma/direct.c | 4 +-
kernel/dma/direct.h | 4 +-
15 files changed, 537 insertions(+), 15 deletions(-)
create mode 100644 kernel/dma/cc-swiotlb.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..bce367fc4b55 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -31,7 +31,7 @@ config X86_64
select HAVE_ARCH_SOFT_DIRTY
select MODULES_USE_ELF_RELA
select NEED_DMA_MAP_STATE
- select SWIOTLB
+ select SWIOTLB if !X86_MEM_ENCRYPT
select ARCH_HAS_ELFCORE_COMPAT
select ZONE_DMA32

@@ -1535,6 +1535,7 @@ config X86_CPA_STATISTICS
config X86_MEM_ENCRYPT
select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select DYNAMIC_PHYSICAL_MASK
+ select SWIOTLB if !CC_SWIOTLB
def_bool n

config AMD_MEM_ENCRYPT
diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
index 2fd52b65deac..a4fbe92b2321 100644
--- a/arch/x86/include/asm/iommu.h
+++ b/arch/x86/include/asm/iommu.h
@@ -11,7 +11,7 @@ extern int iommu_detected;
extern int iommu_merge;
extern int panic_on_overflow;

-#ifdef CONFIG_SWIOTLB
+#if defined(CONFIG_SWIOTLB) || defined(CONFIG_CC_SWIOTLB)
extern bool x86_swiotlb_enable;
#else
#define x86_swiotlb_enable false
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 30bbe4abb5d6..41ad97bfc5be 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -37,7 +37,7 @@ int no_iommu __read_mostly;
/* Set this to 1 if there is a HW IOMMU in the system */
int iommu_detected __read_mostly = 0;

-#ifdef CONFIG_SWIOTLB
+#if defined(CONFIG_SWIOTLB) || defined(CONFIG_CC_SWIOTLB)
bool x86_swiotlb_enable;
static unsigned int x86_swiotlb_flags;

@@ -169,7 +169,7 @@ static __init int iommu_setup(char *p)
disable_dac_quirk = true;
return 1;
}
-#ifdef CONFIG_SWIOTLB
+#if defined(CONFIG_SWIOTLB) || defined(CONFIG_CC_SWIOTLB)
if (!strncmp(p, "soft", 4))
x86_swiotlb_enable = true;
#endif
@@ -192,7 +192,7 @@ static int __init pci_iommu_init(void)
{
x86_init.iommu.iommu_init();

-#ifdef CONFIG_SWIOTLB
+#if defined(CONFIG_SWIOTLB) || defined(CONFIG_CC_SWIOTLB)
/* An IOMMU turned us off. */
if (x86_swiotlb_enable) {
pr_info("PCI-DMA: Using software bounce buffering for IO (SWIOTLB)\n");
diff --git a/drivers/base/core.c b/drivers/base/core.c
index a3e14143ec0c..0a0b35f8b5c3 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2955,7 +2955,7 @@ void device_initialize(struct device *dev)
defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
dev->dma_coherent = dma_default_coherent;
#endif
-#ifdef CONFIG_SWIOTLB
+#if defined(CONFIG_SWIOTLB) || defined(CONFIG_CC_SWIOTLB)
dev->dma_io_tlb_mem = &io_tlb_default_mem;
#endif
}
diff --git a/drivers/iommu/amd/Kconfig b/drivers/iommu/amd/Kconfig
index 9b5fc3356bf2..f8dcb67d02cc 100644
--- a/drivers/iommu/amd/Kconfig
+++ b/drivers/iommu/amd/Kconfig
@@ -2,7 +2,7 @@
# AMD IOMMU support
config AMD_IOMMU
bool "AMD IOMMU support"
- select SWIOTLB
+ select SWIOTLB if !X86_MEM_ENCRYPT
select PCI_MSI
select PCI_ATS
select PCI_PRI
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f798c44e0903..06f6ad1ef59c 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -519,7 +519,7 @@ static bool dev_is_untrusted(struct device *dev)

static bool dev_use_swiotlb(struct device *dev)
{
- return IS_ENABLED(CONFIG_SWIOTLB) && dev_is_untrusted(dev);
+ return (IS_ENABLED(CONFIG_SWIOTLB) || IS_ENABLED(CONFIG_CC_SWIOTLB)) && dev_is_untrusted(dev);
}

/**
diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
index b7dff5092fd2..1f56f78b1880 100644
--- a/drivers/iommu/intel/Kconfig
+++ b/drivers/iommu/intel/Kconfig
@@ -17,7 +17,7 @@ config INTEL_IOMMU
select IOMMU_IOVA
select NEED_DMA_MAP_STATE
select DMAR_TABLE
- select SWIOTLB
+ select SWIOTLB if !X86_MEM_ENCRYPT
select IOASID
select PCI_ATS
select PCI_PRI
diff --git a/include/linux/device.h b/include/linux/device.h
index 44e3acae7b36..b5b082d7d1de 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -607,7 +607,7 @@ struct device {
struct cma *cma_area; /* contiguous memory area for dma
allocations */
#endif
-#ifdef CONFIG_SWIOTLB
+#if defined(CONFIG_SWIOTLB) || defined(CONFIG_CC_SWIOTLB)
struct io_tlb_mem *dma_io_tlb_mem;
#endif
/* arch specific additions */
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index c5e74d3f9cbf..0384784ebce4 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -106,6 +106,23 @@ struct io_tlb_mem {
struct io_tlb_area *areas;
struct io_tlb_slot *slots;
};
+#elif defined(CONFIG_CC_SWIOTLB)
+
+/**
+ * struct io_tlb_mem - io tlb memory pool descriptor
+ *
+ * @force_bounce: %true if swiotlb bouncing is forced
+ * @mapping: the mapping relations between the END address of backing memory and
+ * original DMA address.
+ */
+struct io_tlb_mem {
+ bool force_bounce;
+ struct xarray* mapping;
+ struct dentry *debugfs;
+};
+#endif
+
+#if defined(CONFIG_SWIOTLB) || defined(CONFIG_CC_SWIOTLB)
extern struct io_tlb_mem io_tlb_default_mem;

bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr);
@@ -155,7 +172,7 @@ static inline bool is_swiotlb_active(struct device *dev)
static inline void swiotlb_adjust_size(unsigned long size)
{
}
-#endif /* CONFIG_SWIOTLB */
+#endif /* CONFIG_SWIOTLB || CONFIG_CC_SWIOTLB */

extern void swiotlb_print_info(void);

diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 56866aaa2ae1..7e6b20d2091f 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -78,8 +78,18 @@ config ARCH_HAS_FORCE_DMA_UNENCRYPTED

config SWIOTLB
bool
+ depends on !CC_SWIOTLB
select NEED_DMA_MAP_STATE

+config CC_SWIOTLB
+ bool "Enable cc-swiotlb for Confidential VMs"
+ default n
+ select NEED_DMA_MAP_STATE
+ help
+ This enables a cc-swiotlb implementation for Confidential VMs,
+ which allows allocating the SWIOTLB buffer allocation on runtime.
+ If unsure, say "n".
+
config DMA_RESTRICTED_POOL
bool "DMA Restricted Pool"
depends on OF && OF_RESERVED_MEM && SWIOTLB
diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
index fc0ea13bc089..80c68292da3d 100644
--- a/kernel/dma/Makefile
+++ b/kernel/dma/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_DMA_CMA) += contiguous.o
obj-$(CONFIG_DMA_DECLARE_COHERENT) += coherent.o
obj-$(CONFIG_DMA_API_DEBUG) += debug.o
obj-$(CONFIG_SWIOTLB) += swiotlb.o common-swiotlb.o
+obj-$(CONFIG_CC_SWIOTLB) += cc-swiotlb.o common-swiotlb.o
obj-$(CONFIG_DMA_COHERENT_POOL) += pool.o
obj-$(CONFIG_MMU) += remap.o
obj-$(CONFIG_DMA_MAP_BENCHMARK) += map_benchmark.o
diff --git a/kernel/dma/cc-swiotlb.c b/kernel/dma/cc-swiotlb.c
new file mode 100644
index 000000000000..e85d3ce2272f
--- /dev/null
+++ b/kernel/dma/cc-swiotlb.c
@@ -0,0 +1,492 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Confiditial Computing VM DMA mapping support.
+ * Copyright (C) 2023 Guorui Yu <[email protected]>
+ */
+
+#define pr_fmt(fmt) "cc-swiotlb: " fmt
+
+#include <linux/cc_platform.h>
+#include <linux/types.h>
+#include <linux/ctype.h>
+#include <linux/init.h>
+#include <linux/printk.h>
+#include <linux/xarray.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/io.h>
+#include <linux/dma-direct.h>
+#include <linux/set_memory.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+#include <linux/highmem.h>
+#include <linux/pfn.h>
+#ifdef CONFIG_DEBUG_FS
+#include <linux/debugfs.h>
+#endif
+
+#include <linux/swiotlb.h>
+
+#define INVALID_PHYS_ADDR (~(phys_addr_t)0)
+
+struct io_tlb_slot {
+ phys_addr_t orig_addr;
+ unsigned int alloc_size;
+ unsigned int orig_size;
+ struct list_head node;
+ void *va;
+};
+
+struct io_tlb_mem io_tlb_default_mem;
+static struct task_struct *kccd_tsk;
+static DECLARE_WAIT_QUEUE_HEAD(kccd_waitq);
+
+enum SLOT_SIZE {
+ SIZE_512B,
+ SIZE_2K,
+ SIZE_4K,
+ SIZE_16K,
+ SIZE_512K,
+ NR_OF_SIZES
+};
+
+const unsigned int SLOT_SIZES[NR_OF_SIZES] = {512, 2048, 4096, 16 * 1024, 512 * 1024};
+unsigned int WATERMARK_SLOT_SIZE[NR_OF_SIZES] = {256, 16384, 8192, 1024, 32};
+
+struct slot_terrace {
+ struct list_head slots_by_size[NR_OF_SIZES];
+ spinlock_t lock_by_size[NR_OF_SIZES];
+ atomic_t free_count_by_size[NR_OF_SIZES];
+ atomic_t total_count_by_size[NR_OF_SIZES];
+};
+
+static struct slot_terrace terrace;
+
+bool is_swiotlb_active(struct device *dev)
+{
+ return dev->dma_io_tlb_mem->mapping != NULL;
+}
+EXPORT_SYMBOL_GPL(is_swiotlb_active);
+
+bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
+{
+ struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ struct io_tlb_slot *slot;
+ phys_addr_t slot_end = paddr;
+ bool found = false;
+
+ if (!is_swiotlb_active(dev))
+ return false;
+
+ slot = xa_find_after(mem->mapping, (unsigned long*)&slot_end, ULONG_MAX, XA_PRESENT);
+ if (slot && slot_end - slot->orig_size <= paddr && paddr < slot_end)
+ found = true;
+
+ return found;
+}
+
+static inline ssize_t tr_index(size_t size, unsigned long align_mask) {
+ ssize_t i;
+
+ for (i = 0; i < NR_OF_SIZES; i++) {
+ if (size <= SLOT_SIZES[i] && !(align_mask & SLOT_SIZES[i]))
+ break;
+ }
+ if (unlikely(i == NR_OF_SIZES)) {
+ WARN(1, "Try to allocate too large TLB with size %zu align_mask %lx\n", size, align_mask);
+ return -ENOMEM;
+ }
+ return i;
+}
+
+static bool cc_should_populate(void)
+{
+ size_t i;
+
+ for (i = 0; i < NR_OF_SIZES; i++) {
+ if (atomic_read(&terrace.free_count_by_size[i]) < WATERMARK_SLOT_SIZE[i] / 2)
+ return true;
+ }
+ return false;
+}
+
+static struct io_tlb_slot *__cc_alloc_slot_from_list(size_t alloc_size, unsigned long align_mask)
+{
+ struct io_tlb_slot *slot = NULL;
+ unsigned long flags;
+ spinlock_t *lock;
+ struct list_head *head;
+ atomic_t *count;
+ ssize_t i;
+
+ i = tr_index(alloc_size, align_mask);
+ if (unlikely(i < 0))
+ return NULL;
+
+ lock = &terrace.lock_by_size[i];
+ head = &terrace.slots_by_size[i];
+ count = &terrace.free_count_by_size[i];
+
+ spin_lock_irqsave(lock, flags);
+
+ if (list_empty(head)) {
+ spin_unlock_irqrestore(lock, flags);
+ return NULL;
+ }
+
+ slot = list_first_entry(head, struct io_tlb_slot, node);
+ list_del_init(&slot->node);
+ atomic_dec(count);
+
+ spin_unlock_irqrestore(lock, flags);
+
+ if (cc_should_populate())
+ wake_up(&kccd_waitq);
+
+ return slot;
+}
+
+static void __cc_free_slot_to_list(struct io_tlb_slot *slot)
+{
+ unsigned long flags;
+ ssize_t i;
+
+ for (i = 0; i < NR_OF_SIZES; i++) {
+ if (slot->orig_size <= SLOT_SIZES[i])
+ break;
+ }
+
+ slot->orig_addr = INVALID_PHYS_ADDR;
+ slot->alloc_size = 0;
+
+ spin_lock_irqsave(&terrace.lock_by_size[i], flags);
+ list_add(&slot->node, &terrace.slots_by_size[i]);
+ atomic_inc(&terrace.free_count_by_size[i]);
+ spin_unlock_irqrestore(&terrace.lock_by_size[i], flags);
+}
+
+/*
+ * Bounce: copy the swiotlb buffer from or back to the original dma location
+ */
+void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
+ enum dma_data_direction dir)
+{
+ phys_addr_t orig_addr, tlb_end, slot_start, slot_end = tlb_addr ;
+ struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ struct io_tlb_slot *slot;
+ unsigned long pfn;
+ void *backing_addr;
+ size_t alloc_size;
+
+ slot = xa_find_after(mem->mapping, (unsigned long*)&slot_end, ULONG_MAX, XA_PRESENT);
+ if (!slot) {
+ dev_WARN_ONCE(dev, 1, "TLB buffer not found.\n");
+ return;
+ }
+
+ orig_addr = slot->orig_addr;
+ alloc_size = slot->alloc_size;
+ slot_start = slot_end - slot->orig_size;
+ tlb_end = tlb_addr + size;
+
+ pfn = PFN_DOWN(orig_addr);
+ if (PageHighMem(pfn_to_page(pfn))) {
+ dev_WARN_ONCE(dev, 1, "Unexpected high memory.\n");
+ return;
+ }
+
+ if (size > alloc_size) {
+ dev_WARN_ONCE(dev, 1,
+ "Buffer overflow detected. Allocation size: %zu. Mapping size: %zu.\n",
+ alloc_size, size);
+ size = alloc_size;
+ }
+
+ backing_addr = slot->va + (tlb_addr - slot_start);
+
+ if (dir == DMA_TO_DEVICE) {
+ memcpy(backing_addr, phys_to_virt(orig_addr), size);
+ } else {
+ memcpy(phys_to_virt(orig_addr), backing_addr, size);
+ }
+}
+
+static struct io_tlb_slot* cc_swiotlb_get_slots(struct device *dev,
+ phys_addr_t orig_addr,
+ size_t alloc_size,
+ unsigned long alloc_align_mask,
+ phys_addr_t *slot_end)
+{
+ struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ struct io_tlb_slot *slot;
+ unsigned long index, flags;
+ int ret;
+
+ alloc_align_mask = max((unsigned long)dma_get_min_align_mask(dev), alloc_align_mask);
+ slot = __cc_alloc_slot_from_list(alloc_size, alloc_align_mask);
+ if (!slot) {
+ dev_warn_once(dev, "Fail to alloc cc swiotlb slot\n");
+ return NULL;
+ }
+
+ slot->alloc_size = alloc_size;
+ slot->orig_addr = orig_addr;
+ index = __pa(slot->va) + slot->orig_size;
+ *slot_end = index;
+
+ xa_lock_irqsave(mem->mapping, flags);
+ ret = __xa_insert(mem->mapping, (unsigned long)index, slot, GFP_NOWAIT);
+ xa_unlock_irqrestore(mem->mapping, flags);
+ if (ret) {
+ dev_warn_ratelimited(dev, "Fail to insert index with error 0x%x\n", ret);
+ __cc_free_slot_to_list(slot);
+ return NULL;
+ }
+
+ return slot;
+}
+
+void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
+{
+ struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ struct io_tlb_slot *slot;
+ phys_addr_t slot_end = tlb_addr;
+ unsigned long flags;
+
+ slot = xa_find_after(mem->mapping, (unsigned long*)&slot_end, ULONG_MAX, XA_PRESENT);
+ if (!slot || tlb_addr >= slot_end || tlb_addr < slot_end - slot->orig_size) {
+ dev_warn_ratelimited(dev, "Failed to release slots since no coresponding slot is found\n");
+ return;
+ }
+
+ xa_lock_irqsave(mem->mapping, flags);
+ slot = __xa_erase(mem->mapping, slot_end);
+ xa_unlock_irqrestore(mem->mapping, flags);
+ if (!slot)
+ return;
+
+ __cc_free_slot_to_list(slot);
+}
+
+phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
+ size_t mapping_size, size_t alloc_size,
+ unsigned int alloc_align_mask, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+ unsigned int offset = orig_addr & dma_get_min_align_mask(dev);
+ phys_addr_t tlb_addr;
+ struct io_tlb_slot *slot;
+ phys_addr_t slot_end;
+
+ if (!mem)
+ panic("Can not allocate SWIOTLB buffer earlier and can't now provide you with the DMA bounce buffer");
+
+ if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
+ pr_warn_once("Memory encryption is active and system is using DMA bounce buffers\n");
+
+ if (mapping_size > alloc_size) {
+ dev_warn_once(dev, "Invalid sizes (mapping: %zd bytes, alloc: %zd bytes)",
+ mapping_size, alloc_size);
+ return (phys_addr_t)DMA_MAPPING_ERROR;
+ }
+
+ slot = cc_swiotlb_get_slots(dev, orig_addr, alloc_size + offset, alloc_align_mask, &slot_end);
+ if (slot == NULL) {
+ if (!(attrs & DMA_ATTR_NO_WARN))
+ dev_warn_ratelimited(dev, "No memory for cc-swiotlb buffer (sz: %zd bytes)\n",
+ alloc_size);
+ return (phys_addr_t)DMA_MAPPING_ERROR;
+ }
+
+ tlb_addr = slot_end - slot->orig_size + offset;
+
+ /*
+ * When dir == DMA_FROM_DEVICE we could omit the copy from the orig
+ * to the tlb buffer, if we knew for sure the device will
+ * overwirte the entire current content. But we don't. Thus
+ * unconditional bounce may prevent leaking swiotlb content (i.e.
+ * kernel memory) to user-space.
+ */
+ swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
+ return tlb_addr;
+}
+
+void swiotlb_print_info(void)
+{
+ pr_info("Use dynamic allocated swiotlb buffer\n");
+}
+
+size_t swiotlb_max_mapping_size(struct device *dev)
+{
+ return SLOT_SIZES[NR_OF_SIZES - 1];
+}
+
+static void cc_populate_pages(void)
+{
+ void *vaddr;
+ unsigned long flags;
+ struct io_tlb_slot *slots;
+ size_t i, j, k;
+ const size_t max_bytes_per_alloc = (MAX_ORDER_NR_PAGES << PAGE_SHIFT);
+ size_t remain_alloc_size, rounds, round_size, round_slot_nr;
+
+ for (i = 0; i < NR_OF_SIZES; i++) {
+ if (atomic_read(&terrace.free_count_by_size[i]) > WATERMARK_SLOT_SIZE[i])
+ continue;
+
+ remain_alloc_size = SLOT_SIZES[i] * WATERMARK_SLOT_SIZE[i];
+ rounds = (remain_alloc_size + max_bytes_per_alloc - 1) / max_bytes_per_alloc;
+
+ for (k = 0; k < rounds && remain_alloc_size; k++) {
+ round_size = (remain_alloc_size / max_bytes_per_alloc > 0)?
+ max_bytes_per_alloc:
+ remain_alloc_size;
+ remain_alloc_size -= round_size;
+
+ vaddr = alloc_pages_exact(round_size, GFP_KERNEL);
+ if (!vaddr) {
+ pr_warn_ratelimited("No memory for cc swiotlb!\n");
+ return;
+ }
+
+ if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
+ set_memory_decrypted((unsigned long)vaddr, round_size >> PAGE_SHIFT);
+
+ round_slot_nr = round_size / SLOT_SIZES[i];
+ slots = kmalloc(sizeof(struct io_tlb_slot) * round_slot_nr, GFP_KERNEL);
+ if (!slots) {
+ pr_warn_ratelimited("No memory for cc swiotlb metadata!\n");
+ if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
+ set_memory_encrypted((unsigned long)vaddr, round_size >> PAGE_SHIFT);
+ free_pages_exact(vaddr, round_size);
+ return;
+ }
+
+ spin_lock_irqsave(&terrace.lock_by_size[i], flags);
+ for (j = 0; j < round_slot_nr; j++) {
+ slots[j].va = vaddr + SLOT_SIZES[i] * j;
+ slots[j].alloc_size = 0;
+ slots[j].orig_addr = INVALID_PHYS_ADDR;
+ slots[j].orig_size = SLOT_SIZES[i];
+
+ list_add_tail(&slots[j].node, &terrace.slots_by_size[i]);
+ }
+ atomic_add(round_slot_nr, &terrace.free_count_by_size[i]);
+ atomic_add(round_slot_nr, &terrace.total_count_by_size[i]);
+ spin_unlock_irqrestore(&terrace.lock_by_size[i], flags);
+ }
+ }
+}
+
+static int kccd(void *p)
+{
+ set_freezable();
+
+ if (cc_should_populate())
+ cc_populate_pages();
+
+ while (!kthread_should_stop()) {
+ if (try_to_freeze())
+ continue;
+
+ wait_event_freezable(kccd_waitq,
+ kthread_should_stop() ||
+ cc_should_populate());
+
+ if (cc_should_populate())
+ cc_populate_pages();
+
+ cond_resched();
+ }
+
+ return 0;
+}
+
+static int __init cc_page_populator_init(void)
+{
+ struct task_struct *tsk;
+ size_t i;
+
+ if (!io_tlb_default_mem.force_bounce)
+ return 0;
+
+ for (i = 0; i < NR_OF_SIZES; i++) {
+ spin_lock_init(&terrace.lock_by_size[i]);
+ atomic_set(&terrace.free_count_by_size[i], 0);
+ atomic_set(&terrace.total_count_by_size[i], 0);
+ INIT_LIST_HEAD(&terrace.slots_by_size[i]);
+ }
+
+ tsk = kthread_run(kccd, NULL, "kccd");
+ if (IS_ERR(tsk))
+ return 1;
+
+ kccd_tsk = tsk;
+
+ return 0;
+}
+
+arch_initcall(cc_page_populator_init);
+
+void __init swiotlb_init(bool addressing_limit, unsigned int flags)
+{
+ io_tlb_default_mem.mapping = memblock_alloc(sizeof(struct xarray), PAGE_SIZE);
+ if (!io_tlb_default_mem.mapping) {
+ pr_warn("%s: failed to allocate mapping structure\n", __func__);
+ return;
+ }
+
+ io_tlb_default_mem.force_bounce = cc_platform_has(CC_ATTR_MEM_ENCRYPT);
+
+ xa_init_flags(io_tlb_default_mem.mapping, 0);
+
+ /* cc swiotlb allocates IO TLB at runtime, no need to reserve memory. */
+}
+
+void __init swiotlb_exit(void)
+{
+ pr_warn("cc-swiotlb does not support turning at runtime.\n");
+}
+
+#ifdef CONFIG_DEBUG_FS
+static void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
+ const char *dirname)
+{
+ mem->debugfs = debugfs_create_dir(dirname, io_tlb_default_mem.debugfs);
+ debugfs_create_atomic_t("io_tlb_free_512b", 0400, mem->debugfs,
+ &terrace.free_count_by_size[SIZE_512B]);
+ debugfs_create_atomic_t("io_tlb_free_2k", 0400, mem->debugfs,
+ &terrace.free_count_by_size[SIZE_2K]);
+ debugfs_create_atomic_t("io_tlb_free_4k", 0400, mem->debugfs,
+ &terrace.free_count_by_size[SIZE_4K]);
+ debugfs_create_atomic_t("io_tlb_free_16k", 0400, mem->debugfs,
+ &terrace.free_count_by_size[SIZE_16K]);
+ debugfs_create_atomic_t("io_tlb_free_512k", 0400, mem->debugfs,
+ &terrace.free_count_by_size[SIZE_512K]);
+
+ debugfs_create_atomic_t("io_tlb_total_512b", 0400, mem->debugfs,
+ &terrace.total_count_by_size[SIZE_512B]);
+ debugfs_create_atomic_t("io_tlb_total_2k", 0400, mem->debugfs,
+ &terrace.total_count_by_size[SIZE_2K]);
+ debugfs_create_atomic_t("io_tlb_total_4k", 0400, mem->debugfs,
+ &terrace.total_count_by_size[SIZE_4K]);
+ debugfs_create_atomic_t("io_tlb_total_16k", 0400, mem->debugfs,
+ &terrace.total_count_by_size[SIZE_16K]);
+ debugfs_create_atomic_t("io_tlb_total_512k", 0400, mem->debugfs,
+ &terrace.total_count_by_size[SIZE_512K]);
+}
+
+static int __init __maybe_unused swiotlb_create_default_debugfs(void)
+{
+ swiotlb_create_debugfs_files(&io_tlb_default_mem, "swiotlb");
+ return 0;
+}
+
+late_initcall(swiotlb_create_default_debugfs);
+#endif
+
+unsigned long swiotlb_size_or_default(void) { return 0; }
+void __init swiotlb_update_mem_attributes(void) {}
+void __init swiotlb_adjust_size(unsigned long size) {}
+void __init swiotlb_hint_cpus(int cpus) {}
diff --git a/kernel/dma/common-swiotlb.c b/kernel/dma/common-swiotlb.c
index d477d5f2a71b..553325d5c9cd 100644
--- a/kernel/dma/common-swiotlb.c
+++ b/kernel/dma/common-swiotlb.c
@@ -1,4 +1,5 @@
/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/cc_platform.h>
#include <linux/dma-direct.h>
#include <linux/swiotlb.h>

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 63859a101ed8..67a0ff973732 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -405,7 +405,7 @@ void dma_direct_free_pages(struct device *dev, size_t size,
}

#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \
- defined(CONFIG_SWIOTLB)
+ defined(CONFIG_SWIOTLB) || defined(CONFIG_CC_SWIOTLB)
void dma_direct_sync_sg_for_device(struct device *dev,
struct scatterlist *sgl, int nents, enum dma_data_direction dir)
{
@@ -428,7 +428,7 @@ void dma_direct_sync_sg_for_device(struct device *dev,

#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \
defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) || \
- defined(CONFIG_SWIOTLB)
+ defined(CONFIG_SWIOTLB) || defined(CONFIG_CC_SWIOTLB)
void dma_direct_sync_sg_for_cpu(struct device *dev,
struct scatterlist *sgl, int nents, enum dma_data_direction dir)
{
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index e38ffc5e6bdd..e3050ae59839 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -23,7 +23,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
size_t dma_direct_max_mapping_size(struct device *dev);

#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \
- defined(CONFIG_SWIOTLB)
+ defined(CONFIG_SWIOTLB) || defined(CONFIG_CC_SWIOTLB)
void dma_direct_sync_sg_for_device(struct device *dev, struct scatterlist *sgl,
int nents, enum dma_data_direction dir);
#else
@@ -35,7 +35,7 @@ static inline void dma_direct_sync_sg_for_device(struct device *dev,

#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \
defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) || \
- defined(CONFIG_SWIOTLB)
+ defined(CONFIG_SWIOTLB) || defined(CONFIG_CC_SWIOTLB)
void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
int nents, enum dma_data_direction dir, unsigned long attrs);
void dma_direct_sync_sg_for_cpu(struct device *dev,
--
2.31.1


2023-01-28 08:33:22

by Guorui Yu

[permalink] [raw]
Subject: [PATCH 3/4] swiotlb: Add tracepoint swiotlb_unbounced

Add a new tracepoint swiotlb_unbounced to facilitate statistics of
swiotlb buffer usage.

Signed-off-by: GuoRui.Yu <[email protected]>
---
include/trace/events/swiotlb.h | 28 ++++++++++++++++++++++++++++
kernel/dma/common-swiotlb.c | 1 +
2 files changed, 29 insertions(+)

diff --git a/include/trace/events/swiotlb.h b/include/trace/events/swiotlb.h
index da05c9ebd224..0707adc34df5 100644
--- a/include/trace/events/swiotlb.h
+++ b/include/trace/events/swiotlb.h
@@ -35,6 +35,34 @@ TRACE_EVENT(swiotlb_bounced,
__entry->force ? "FORCE" : "NORMAL")
);

+TRACE_EVENT(swiotlb_unbounced,
+ TP_PROTO(struct device *dev, dma_addr_t dev_addr, size_t size),
+ TP_ARGS(dev, dev_addr, size),
+
+ TP_STRUCT__entry(
+ __string(dev_name, dev_name(dev))
+ __field(u64, dma_mask)
+ __field(dma_addr_t, dev_addr)
+ __field(size_t, size)
+ __field(bool, force)
+ ),
+
+ TP_fast_assign(
+ __assign_str(dev_name, dev_name(dev));
+ __entry->dma_mask = (dev->dma_mask ? *dev->dma_mask : 0);
+ __entry->dev_addr = dev_addr;
+ __entry->size = size;
+ __entry->force = is_swiotlb_force_bounce(dev);
+ ),
+
+ TP_printk("dev_name: %s dma_mask=%llx dev_addr=%llx size=%zu %s",
+ __get_str(dev_name),
+ __entry->dma_mask,
+ (unsigned long long)__entry->dev_addr,
+ __entry->size,
+ __entry->force ? "FORCE" : "NORMAL")
+);
+
#endif /* _TRACE_SWIOTLB_H */

/* This part must be outside protection */
diff --git a/kernel/dma/common-swiotlb.c b/kernel/dma/common-swiotlb.c
index 553325d5c9cd..e3676a8358f2 100644
--- a/kernel/dma/common-swiotlb.c
+++ b/kernel/dma/common-swiotlb.c
@@ -46,6 +46,7 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
size_t mapping_size, enum dma_data_direction dir,
unsigned long attrs)
{
+ trace_swiotlb_unbounced(dev, phys_to_dma(dev, tlb_addr), mapping_size);
/*
* First, sync the memory before unmapping the entry
*/
--
2.31.1


2023-01-28 08:33:36

by Guorui Yu

[permalink] [raw]
Subject: [PATCH 4/4] cc-swiotlb: Allow set swiotlb watermark from cmdline

Signed-off-by: GuoRui.Yu <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 11 ++++++
kernel/dma/cc-swiotlb.c | 34 ++++++++++++++++++-
2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6cfa6e3996cf..ad510e392998 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6090,6 +6090,17 @@
wouldn't be automatically used by the kernel
noforce -- Never use bounce buffers (for debugging)

+ cc_swiotlb= [X86]
+ Format: { [<int>, <int>, <int>, <int>]| force ]}
+ Set the swiotlb TLB buffer initial size for Confidential Computing like
+ TDX and SEV-SNP.
+ <int> -- Number of I/O TLB of 256 bytes
+ <int> -- Number of I/O TLB of 4 KiB
+ <int> -- Number of I/O TLB of 64 KiB
+ <int> -- Number of I/O TLB of 256 KiB
+ force -- force using of cc bounce buffers even if they
+ wouldn't be automatically used by the kernel (for debugging)
+
switches= [HW,M68k]

sysctl.*= [KNL]
diff --git a/kernel/dma/cc-swiotlb.c b/kernel/dma/cc-swiotlb.c
index e85d3ce2272f..ff1cc52189b9 100644
--- a/kernel/dma/cc-swiotlb.c
+++ b/kernel/dma/cc-swiotlb.c
@@ -52,6 +52,32 @@ enum SLOT_SIZE {

const unsigned int SLOT_SIZES[NR_OF_SIZES] = {512, 2048, 4096, 16 * 1024, 512 * 1024};
unsigned int WATERMARK_SLOT_SIZE[NR_OF_SIZES] = {256, 16384, 8192, 1024, 32};
+static bool swiotlb_force_bounce = false;
+
+static int __init
+setup_io_tlb_watermark(char *str)
+{
+ int i;
+ size_t size;
+
+ for (i = 0; i < NR_OF_SIZES; i++) {
+ if (isdigit(*str)) {
+ size = simple_strtoul(str, &str, 0);
+ WATERMARK_SLOT_SIZE[i] = size;
+
+ if (*str == ',')
+ ++str;
+ }
+ else
+ break;
+ }
+
+ if (!strcmp(str, "force"))
+ swiotlb_force_bounce = true;
+
+ return 0;
+}
+early_param("cc_swiotlb", setup_io_tlb_watermark);

struct slot_terrace {
struct list_head slots_by_size[NR_OF_SIZES];
@@ -330,6 +356,7 @@ static void cc_populate_pages(void)
size_t i, j, k;
const size_t max_bytes_per_alloc = (MAX_ORDER_NR_PAGES << PAGE_SHIFT);
size_t remain_alloc_size, rounds, round_size, round_slot_nr;
+ size_t size = 0;

for (i = 0; i < NR_OF_SIZES; i++) {
if (atomic_read(&terrace.free_count_by_size[i]) > WATERMARK_SLOT_SIZE[i])
@@ -377,6 +404,11 @@ static void cc_populate_pages(void)
spin_unlock_irqrestore(&terrace.lock_by_size[i], flags);
}
}
+
+ for (i = 0; i < NR_OF_SIZES; i++) {
+ size += (size_t)atomic_read(&terrace.total_count_by_size[i]) * SLOT_SIZES[i];
+ }
+ pr_info("bounce buffer size adjusted to %luMB", size >> 20);
}

static int kccd(void *p)
@@ -437,7 +469,7 @@ void __init swiotlb_init(bool addressing_limit, unsigned int flags)
return;
}

- io_tlb_default_mem.force_bounce = cc_platform_has(CC_ATTR_MEM_ENCRYPT);
+ io_tlb_default_mem.force_bounce = cc_platform_has(CC_ATTR_MEM_ENCRYPT) || swiotlb_force_bounce;

xa_init_flags(io_tlb_default_mem.mapping, 0);

--
2.31.1


2023-01-28 09:03:14

by Guorui Yu

[permalink] [raw]
Subject: Re: [RFC] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs

+ Marek Szyprowski and Robin Murphy.

在 2023/1/28 16:32, GuoRui.Yu 写道:
> This patch series adds a new swiotlb implementation, cc-swiotlb, for
> Confidential VMs (such as TDX and SEV-SNP). The new cc-swiotlb allocates
> the DMA TLB buffer dynamically in runtime instead of allocating at boot
> with a fixed size. Furthermore, future optimization and security
> enhancement could be applied on cc-swiotlb without "infecting" the
> legacy swiotlb.
>
> Background
> ==========
> Under COnfidential COmputing (CoCo) scenarios, the VMM cannot access
> guest memory directly but requires the guest to explicitly mark the
> memory as shared (decrypted). To make the streaming DMA mappings work,
> the current implementation relays on legacy SWIOTLB to bounce the DMA
> buffer between private (encrypted) and shared (decrypted) memory.
>
> However, the legacy swiotlb is designed for compatibility rather than
> efficiency and CoCo purpose, which will inevitably introduce some
> unnecessary restrictions.
>
> 1. Fixed immutable swiotlb size cannot accommodate to requirements of
> multiple devices. And 1GiB (current maximum size) of swiotlb in our
> testbed cannot afford multiple disks reads/writes simultaneously.
>
> 2. Fixed immutable IO_TLB_SIZE (2KiB) cannot satisfy various kinds of
> devices. At the moment, the minimal size of a swiotlb buffer is 2KiB,
> which will waste memory on small network packets (under 256 bytes) and
> decrease efficiency on a large block (up to 256KiB) size reads/writes of
> disks. And it is hard to have a trade-off on legacy swiotlb to rule them
> all.
>
> 3. The legacy swiotlb cannot efficiently support larger swiotlb buffers.
> In the worst case, the current implementation requires a full scan of
> the entire swiotlb buffer, which can cause severe performance hits.
>
> Changes in this patch set
> =========================
> Instead of keeping "infecting" the legacy swiotlb code with CoCo logic,
> this patch tries to introduce a new cc-swiotlb for Confidential VMs.
>
> Confidential VMs usually have reasonable modern devices (virtio devices,
> NVME, etc.), which can access memory above 4GiB, cc-swiotlb could
> allocate TLB buffers at any position dynamically. Since
> set_memory_{decrypted,encrypted} is time-consuming and cannot be used in
> interrupt context, a new kernel thread "kccd" has been added to populate
> new TLB buffers on-demand, which solved the problem 1.
>
> In addition, the cc-swiotlb manages TLB buffers by different sizes
> (512B, 2KiB, 4KiB, 16KiB, and 512KiB). The above values come from the
> following observations (boot with 8core, 32 GiB, 1 nvme disk, and 1
> virtio-net):
> - Allocations of 512 bytes and below account for 3.5% of the total DMA
> cache allocations;
> - Allocations of 2 KiB and below account for 57.7%;
> - Allocations of 4 KiB and below account for 91.3%;
> - Allocations of 16 KiB and below account for 96.0%;
> - Allocations of 512 KiB and below accounted for 100%;
> - At the end of booting, cc-swiotlb uses 288 MiB in total.
>
> For comparison, legacy swiotlb reserves memory at 6%, which requires
> min(1GiB, 32GiB * 0.06) = 1GiB, and will hang when operating multiple
> disks simultaneously due to no memory for the swiotlb buffer.
>
> These patches were tested with fio (using different iodepth and block
> size) on a platform with 96 cores, 384 GiB, and 20 NVME disks, and no IO
> hang or error was observed.
>
> For simplicity, the current RFC version cannot switch between legacy
> implementation with cmdline but through compile options. I am open to
> discussing how to integrate the cc-swiotlb into the legacy one.
>
> Patch Organization
> ==================
> - swiotlb: Split common code from swiotlb.{c,h}
> - swiotlb: Add a new cc-swiotlb implementation for Confidential VMs
> - swiotlb: Add tracepoint swiotlb_unbounced
> - cc-swiotlb: Allow set swiotlb watermark from cmdline
>
> Thanks for your time!
>
> Have a nice day,
> Guorui
>

2023-01-28 12:03:38

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 2/4] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs

Hi GuoRui.Yu",

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/x86/core]
[also build test WARNING on joro-iommu/next driver-core/driver-core-testing driver-core/driver-core-next driver-core/driver-core-linus linus/master v6.2-rc5 next-20230127]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/GuoRui-Yu/swiotlb-Add-a-new-cc-swiotlb-implementation-for-Confidential-VMs/20230128-181154
patch link: https://lore.kernel.org/r/20230128083254.86012-3-GuoRui.Yu%40linux.alibaba.com
patch subject: [PATCH 2/4] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs
config: alpha-allyesconfig (https://download.01.org/0day-ci/archive/20230128/[email protected]/config)
compiler: alpha-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/9345d2c75b5994cec46ad0abf12f01ed91742e81
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review GuoRui-Yu/swiotlb-Add-a-new-cc-swiotlb-implementation-for-Confidential-VMs/20230128-181154
git checkout 9345d2c75b5994cec46ad0abf12f01ed91742e81
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha SHELL=/bin/bash kernel/dma/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

kernel/dma/cc-swiotlb.c: In function 'swiotlb_bounce':
>> kernel/dma/cc-swiotlb.c:174:32: warning: variable 'tlb_end' set but not used [-Wunused-but-set-variable]
174 | phys_addr_t orig_addr, tlb_end, slot_start, slot_end = tlb_addr ;
| ^~~~~~~
kernel/dma/cc-swiotlb.c: At top level:
>> kernel/dma/cc-swiotlb.c:492:13: warning: no previous prototype for 'swiotlb_hint_cpus' [-Wmissing-prototypes]
492 | void __init swiotlb_hint_cpus(int cpus) {}
| ^~~~~~~~~~~~~~~~~


vim +/tlb_end +174 kernel/dma/cc-swiotlb.c

167
168 /*
169 * Bounce: copy the swiotlb buffer from or back to the original dma location
170 */
171 void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
172 enum dma_data_direction dir)
173 {
> 174 phys_addr_t orig_addr, tlb_end, slot_start, slot_end = tlb_addr ;
175 struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
176 struct io_tlb_slot *slot;
177 unsigned long pfn;
178 void *backing_addr;
179 size_t alloc_size;
180
181 slot = xa_find_after(mem->mapping, (unsigned long*)&slot_end, ULONG_MAX, XA_PRESENT);
182 if (!slot) {
183 dev_WARN_ONCE(dev, 1, "TLB buffer not found.\n");
184 return;
185 }
186
187 orig_addr = slot->orig_addr;
188 alloc_size = slot->alloc_size;
189 slot_start = slot_end - slot->orig_size;
190 tlb_end = tlb_addr + size;
191
192 pfn = PFN_DOWN(orig_addr);
193 if (PageHighMem(pfn_to_page(pfn))) {
194 dev_WARN_ONCE(dev, 1, "Unexpected high memory.\n");
195 return;
196 }
197
198 if (size > alloc_size) {
199 dev_WARN_ONCE(dev, 1,
200 "Buffer overflow detected. Allocation size: %zu. Mapping size: %zu.\n",
201 alloc_size, size);
202 size = alloc_size;
203 }
204
205 backing_addr = slot->va + (tlb_addr - slot_start);
206
207 if (dir == DMA_TO_DEVICE) {
208 memcpy(backing_addr, phys_to_virt(orig_addr), size);
209 } else {
210 memcpy(phys_to_virt(orig_addr), backing_addr, size);
211 }
212 }
213

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-01-28 16:41:48

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 2/4] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs

Hi--

On 1/28/23 00:32, GuoRui.Yu wrote:
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> index 56866aaa2ae1..7e6b20d2091f 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -78,8 +78,18 @@ config ARCH_HAS_FORCE_DMA_UNENCRYPTED
>
> config SWIOTLB
> bool
> + depends on !CC_SWIOTLB
> select NEED_DMA_MAP_STATE
>
> +config CC_SWIOTLB
> + bool "Enable cc-swiotlb for Confidential VMs"
> + default n
> + select NEED_DMA_MAP_STATE
> + help
> + This enables a cc-swiotlb implementation for Confidential VMs,
> + which allows allocating the SWIOTLB buffer allocation on runtime.

Two "allocat..." words seems to be redundant. Probably the second one
can be dropped.
Also, instead of "on runtime", how about "at runtime"?

> + If unsure, say "n".

Thanks.
--
~Randy

2023-01-28 20:20:59

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 4/4] cc-swiotlb: Allow set swiotlb watermark from cmdline

Hi GuoRui.Yu",

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/x86/core]
[also build test WARNING on joro-iommu/next driver-core/driver-core-testing driver-core/driver-core-next driver-core/driver-core-linus linus/master v6.2-rc5 next-20230127]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/GuoRui-Yu/swiotlb-Add-a-new-cc-swiotlb-implementation-for-Confidential-VMs/20230128-181154
patch link: https://lore.kernel.org/r/20230128083254.86012-5-GuoRui.Yu%40linux.alibaba.com
patch subject: [PATCH 4/4] cc-swiotlb: Allow set swiotlb watermark from cmdline
config: nios2-buildonly-randconfig-r002-20230129 (https://download.01.org/0day-ci/archive/20230129/[email protected]/config)
compiler: nios2-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/fcbd538c875f5f1086682829671935d7155e4029
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review GuoRui-Yu/swiotlb-Add-a-new-cc-swiotlb-implementation-for-Confidential-VMs/20230128-181154
git checkout fcbd538c875f5f1086682829671935d7155e4029
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=nios2 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=nios2 SHELL=/bin/bash kernel/dma/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

kernel/dma/cc-swiotlb.c: In function 'swiotlb_bounce':
kernel/dma/cc-swiotlb.c:200:32: warning: variable 'tlb_end' set but not used [-Wunused-but-set-variable]
200 | phys_addr_t orig_addr, tlb_end, slot_start, slot_end = tlb_addr ;
| ^~~~~~~
In file included from kernel/dma/cc-swiotlb.c:13:
kernel/dma/cc-swiotlb.c: In function 'cc_populate_pages':
>> include/linux/kern_levels.h:5:25: warning: format '%lu' expects argument of type 'long unsigned int', but argument 2 has type 'size_t' {aka 'unsigned int'} [-Wformat=]
5 | #define KERN_SOH "\001" /* ASCII Start Of Header */
| ^~~~~~
include/linux/printk.h:429:25: note: in definition of macro 'printk_index_wrap'
429 | _p_func(_fmt, ##__VA_ARGS__); \
| ^~~~
include/linux/printk.h:530:9: note: in expansion of macro 'printk'
530 | printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
| ^~~~~~
include/linux/kern_levels.h:14:25: note: in expansion of macro 'KERN_SOH'
14 | #define KERN_INFO KERN_SOH "6" /* informational */
| ^~~~~~~~
include/linux/printk.h:530:16: note: in expansion of macro 'KERN_INFO'
530 | printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
| ^~~~~~~~~
kernel/dma/cc-swiotlb.c:411:9: note: in expansion of macro 'pr_info'
411 | pr_info("bounce buffer size adjusted to %luMB", size >> 20);
| ^~~~~~~
kernel/dma/cc-swiotlb.c: At top level:
kernel/dma/cc-swiotlb.c:524:13: warning: no previous prototype for 'swiotlb_hint_cpus' [-Wmissing-prototypes]
524 | void __init swiotlb_hint_cpus(int cpus) {}
| ^~~~~~~~~~~~~~~~~


vim +5 include/linux/kern_levels.h

314ba3520e513a Joe Perches 2012-07-30 4
04d2c8c83d0e3a Joe Perches 2012-07-30 @5 #define KERN_SOH "\001" /* ASCII Start Of Header */
04d2c8c83d0e3a Joe Perches 2012-07-30 6 #define KERN_SOH_ASCII '\001'
04d2c8c83d0e3a Joe Perches 2012-07-30 7

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-01-29 01:54:46

by Guorui Yu

[permalink] [raw]
Subject: Re: [PATCH 2/4] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs

Hi Randy,

在 2023/1/29 00:41, Randy Dunlap 写道:
> Hi--
>
> On 1/28/23 00:32, GuoRui.Yu wrote:
>> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
>> index 56866aaa2ae1..7e6b20d2091f 100644
>> --- a/kernel/dma/Kconfig
>> +++ b/kernel/dma/Kconfig
>> @@ -78,8 +78,18 @@ config ARCH_HAS_FORCE_DMA_UNENCRYPTED
>>
>> config SWIOTLB
>> bool
>> + depends on !CC_SWIOTLB
>> select NEED_DMA_MAP_STATE
>>
>> +config CC_SWIOTLB
>> + bool "Enable cc-swiotlb for Confidential VMs"
>> + default n
>> + select NEED_DMA_MAP_STATE
>> + help
>> + This enables a cc-swiotlb implementation for Confidential VMs,
>> + which allows allocating the SWIOTLB buffer allocation on runtime.
>
> Two "allocat..." words seems to be redundant. Probably the second one
> can be dropped.
> Also, instead of "on runtime", how about "at runtime"?

Thanks for your kindly advice, and I will fix it in the next version.

>
>> + If unsure, say "n".
>
> Thanks.

Sincerely,
Guorui

2023-01-29 16:58:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 2/4] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs


On 1/28/2023 12:32 AM, GuoRui.Yu wrote:
> Under COnfidential COmputing (CoCo) scenarios, the VMM cannot access
> guest memory directly but requires the guest to explicitly mark the
> memory as shared (decrypted). To make the streaming DMA mappings work,
> the current implementation relays on legacy SWIOTLB to bounce the DMA
> buffer between private (encrypted) and shared (decrypted) memory.
>
> However, the legacy swiotlb is designed for compatibility rather than
> efficiency and CoCo purpose, which will inevitably introduce some
> unnecessary restrictions.
> 1. Fixed immutable swiotlb size cannot accommodate to requirements of
> multiple devices. And 1GiB (current maximum size) of swiotlb in our
> testbed cannot afford multiple disks reads/writes simultaneously.
> 2. Fixed immutable IO_TLB_SIZE (2KiB) cannot satisfy various kinds of
> devices. At the moment, the minimal size of a swiotlb buffer is 2KiB,
> which will waste memory on small network packets (under 512 bytes)
> and decrease efficiency on a large block (up to 256KiB) size
> reads/writes of disks. And it is hard to have a trade-off on legacy
> swiotlb to rule them all.
> 3. The legacy swiotlb cannot efficiently support larger swiotlb buffers.
> In the worst case, the current implementation requires a full scan of
> the entire swiotlb buffer, which can cause severe performance hits.
>
> Instead of keeping "infecting" the legacy swiotlb code with CoCo logic,
> this patch tries to introduce a new cc-swiotlb for Confidential VMs.
>
> Confidential VMs usually have reasonable modern devices (virtio devices,
> NVME, etc.), which can access memory above 4GiB, cc-swiotlb could
> allocate TLB buffers dynamically on-demand, and this design solves
> problem 1.

When you say solving you mean support for growing the size dynamically
without pre-allocation?

The IOMMU is traditionally called in non preemptible regions in drivers,
and also allocating memory in IO paths is still not considered fully
safe due to potential deadlocks. Both makes it difficult to allocate
large memory regions dynamically.

It's not clear how you would solve that?

-Andi


2023-01-30 02:25:16

by Guorui Yu

[permalink] [raw]
Subject: Re: [PATCH 2/4] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs

在 2023/1/30 00:58, Andi Kleen 写道:
>
> On 1/28/2023 12:32 AM, GuoRui.Yu wrote:
>> Under COnfidential COmputing (CoCo) scenarios, the VMM cannot access
>> guest memory directly but requires the guest to explicitly mark the
>> memory as shared (decrypted). To make the streaming DMA mappings work,
>> the current implementation relays on legacy SWIOTLB to bounce the DMA
>> buffer between private (encrypted) and shared (decrypted) memory.
>>
>> However, the legacy swiotlb is designed for compatibility rather than
>> efficiency and CoCo purpose, which will inevitably introduce some
>> unnecessary restrictions.
>> 1. Fixed immutable swiotlb size cannot accommodate to requirements of
>>     multiple devices. And 1GiB (current maximum size) of swiotlb in our
>>     testbed cannot afford multiple disks reads/writes simultaneously.
>> 2. Fixed immutable IO_TLB_SIZE (2KiB) cannot satisfy various kinds of
>>     devices. At the moment, the minimal size of a swiotlb buffer is 2KiB,
>>     which will waste memory on small network packets (under 512 bytes)
>>     and decrease efficiency on a large block (up to 256KiB) size
>>     reads/writes of disks. And it is hard to have a trade-off on legacy
>>     swiotlb to rule them all.
>> 3. The legacy swiotlb cannot efficiently support larger swiotlb buffers.
>>     In the worst case, the current implementation requires a full scan of
>>     the entire swiotlb buffer, which can cause severe performance hits.
>>
>> Instead of keeping "infecting" the legacy swiotlb code with CoCo logic,
>> this patch tries to introduce a new cc-swiotlb for Confidential VMs.
>>
>> Confidential VMs usually have reasonable modern devices (virtio devices,
>> NVME, etc.), which can access memory above 4GiB, cc-swiotlb could
>> allocate TLB buffers dynamically on-demand, and this design solves
>> problem 1.
>
> When you say solving you mean support for growing the size dynamically
> without pre-allocation?
>
> The IOMMU is traditionally called in non preemptible regions in drivers,
> and also allocating memory in IO paths is still not considered fully
> safe due to potential deadlocks. Both makes it difficult to allocate
> large memory regions dynamically.
>
> It's not clear how you would solve that?
>
> -Andi

Hi Andi,

Thanks for your question!

I try to solve this problem by creating a new kernel thread, "kccd", to
populate the TLB buffer in the backgroud.

Specifically,
1. A new kernel thread is created with the help of "arch_initcall", and
this kthread is responsible for memory allocation and setting memory
attributes (private or shared);
2. The "swiotlb_tbl_map_single" routine only use the spin_lock protected
TLB buffers pre-allocated by the kthread;
a) which actually includes ONE memory allocation brought by xarray
insertion "__xa_insert__".
3. After each allocation, the water level of TLB resources will be
checked. If the current TLB resources are found to be lower than the
preset value (half of the watermark), the kthread will be awakened to
fill them.
4. The TLB buffer allocation in the kthread is batched to
"(MAX_ORDER_NR_PAGES << PAGE_SHIFT)" to reduce the holding time of
spin_lock and number of calls to set_memory_decrypted().

Thanks,
Guorui

2023-01-30 06:46:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 2/4] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs


> I try to solve this problem by creating a new kernel thread, "kccd",
> to populate the TLB buffer in the backgroud.
>
> Specifically,
> 1. A new kernel thread is created with the help of "arch_initcall",
> and this kthread is responsible for memory allocation and setting
> memory attributes (private or shared);
> 2. The "swiotlb_tbl_map_single" routine only use the spin_lock
> protected TLB buffers pre-allocated by the kthread;
>   a) which actually includes ONE memory allocation brought by xarray
> insertion "__xa_insert__".

That already seems dangerous with all the usual problems of memory
allocations in IO paths. Normally code at least uses a mempool to avoid
the worst dead lock potential.

> 3. After each allocation, the water level of TLB resources will be
> checked. If the current TLB resources are found to be lower than the
> preset value (half of the watermark), the kthread will be awakened to
> fill them.
> 4. The TLB buffer allocation in the kthread is batched to
> "(MAX_ORDER_NR_PAGES << PAGE_SHIFT)" to reduce the holding time of
> spin_lock and number of calls to set_memory_decrypted().

Okay, but does this guarantee that it will never run out of memory?

It seems difficult to make such guarantees. What happens for example if
the background thread gets starved by something higher priority?

Or if the allocators have such high bandwidth that they can overwhelm
any reasonable background thread.

-Andi



2023-01-30 06:54:57

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs

Just as the last time around: stop duplicating code pointlessly. NAK.


2023-01-30 13:03:41

by Robin Murphy

[permalink] [raw]
Subject: Re: [RFC] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs

On 2023-01-28 08:32, GuoRui.Yu wrote:
> This patch series adds a new swiotlb implementation, cc-swiotlb, for
> Confidential VMs (such as TDX and SEV-SNP). The new cc-swiotlb allocates
> the DMA TLB buffer dynamically in runtime instead of allocating at boot
> with a fixed size. Furthermore, future optimization and security
> enhancement could be applied on cc-swiotlb without "infecting" the
> legacy swiotlb.
>
> Background
> ==========
> Under COnfidential COmputing (CoCo) scenarios, the VMM cannot access
> guest memory directly but requires the guest to explicitly mark the
> memory as shared (decrypted). To make the streaming DMA mappings work,
> the current implementation relays on legacy SWIOTLB to bounce the DMA
> buffer between private (encrypted) and shared (decrypted) memory.
>
> However, the legacy swiotlb is designed for compatibility rather than
> efficiency and CoCo purpose, which will inevitably introduce some
> unnecessary restrictions.
>
> 1. Fixed immutable swiotlb size cannot accommodate to requirements of
> multiple devices. And 1GiB (current maximum size) of swiotlb in our
> testbed cannot afford multiple disks reads/writes simultaneously.

That's not a very logical argument - if a particular use-case needs a
particular total amount of SWIOTLB capacity, then that's how much it
needs. Whether that capacity is allocated up-front or allocated
gradually over time doesn't affect that. The obvious solution to this
issue as presented is "make the maximum size bigger", not "add a whole
other SWIOTLB implementation".

> 2. Fixed immutable IO_TLB_SIZE (2KiB) cannot satisfy various kinds of
> devices. At the moment, the minimal size of a swiotlb buffer is 2KiB,
> which will waste memory on small network packets (under 256 bytes) and
> decrease efficiency on a large block (up to 256KiB) size reads/writes of
> disks. And it is hard to have a trade-off on legacy swiotlb to rule them
> all.

That's clearly a general argument, so why should any improvement be
arbitrarily restricted to Confidential Compute scenarios?

> 3. The legacy swiotlb cannot efficiently support larger swiotlb buffers.
> In the worst case, the current implementation requires a full scan of
> the entire swiotlb buffer, which can cause severe performance hits.

Isn't that the main reason we just recently introduced the multiple area
stuff?

Robin.

> Changes in this patch set
> =========================
> Instead of keeping "infecting" the legacy swiotlb code with CoCo logic,
> this patch tries to introduce a new cc-swiotlb for Confidential VMs.
>
> Confidential VMs usually have reasonable modern devices (virtio devices,
> NVME, etc.), which can access memory above 4GiB, cc-swiotlb could
> allocate TLB buffers at any position dynamically. Since
> set_memory_{decrypted,encrypted} is time-consuming and cannot be used in
> interrupt context, a new kernel thread "kccd" has been added to populate
> new TLB buffers on-demand, which solved the problem 1.
>
> In addition, the cc-swiotlb manages TLB buffers by different sizes
> (512B, 2KiB, 4KiB, 16KiB, and 512KiB). The above values come from the
> following observations (boot with 8core, 32 GiB, 1 nvme disk, and 1
> virtio-net):
> - Allocations of 512 bytes and below account for 3.5% of the total DMA
> cache allocations;
> - Allocations of 2 KiB and below account for 57.7%;
> - Allocations of 4 KiB and below account for 91.3%;
> - Allocations of 16 KiB and below account for 96.0%;
> - Allocations of 512 KiB and below accounted for 100%;
> - At the end of booting, cc-swiotlb uses 288 MiB in total.
>
> For comparison, legacy swiotlb reserves memory at 6%, which requires
> min(1GiB, 32GiB * 0.06) = 1GiB, and will hang when operating multiple
> disks simultaneously due to no memory for the swiotlb buffer.
>
> These patches were tested with fio (using different iodepth and block
> size) on a platform with 96 cores, 384 GiB, and 20 NVME disks, and no IO
> hang or error was observed.
>
> For simplicity, the current RFC version cannot switch between legacy
> implementation with cmdline but through compile options. I am open to
> discussing how to integrate the cc-swiotlb into the legacy one.
>
> Patch Organization
> ==================
> - swiotlb: Split common code from swiotlb.{c,h}
> - swiotlb: Add a new cc-swiotlb implementation for Confidential VMs
> - swiotlb: Add tracepoint swiotlb_unbounced
> - cc-swiotlb: Allow set swiotlb watermark from cmdline
>
> Thanks for your time!
>
> Have a nice day,
> Guorui
>
>

2023-01-30 13:45:32

by Guorui Yu

[permalink] [raw]
Subject: Re: [PATCH 2/4] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs

Hi Andi,

在 2023/1/30 14:46, Andi Kleen 写道:
>
>> I try to solve this problem by creating a new kernel thread, "kccd",
>> to populate the TLB buffer in the backgroud.
>>
>> Specifically,
>> 1. A new kernel thread is created with the help of "arch_initcall",
>> and this kthread is responsible for memory allocation and setting
>> memory attributes (private or shared);
>> 2. The "swiotlb_tbl_map_single" routine only use the spin_lock
>> protected TLB buffers pre-allocated by the kthread;
>>   a) which actually includes ONE memory allocation brought by xarray
>> insertion "__xa_insert__".
>
> That already seems dangerous with all the usual problems of memory
> allocations in IO paths. Normally code at least uses a mempool to avoid
> the worst dead lock potential.
>

The __xa_insert__ is called with GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH),
and I will try to dig into this to check if there is any chance to have
the deadlock.

I also tried my best to test this piece of code, and no issue have been
found in the case of a maximum of 700,000 IOPS.

Thanks for your advices from this point, since I have not notice such
possibility.

>> 3. After each allocation, the water level of TLB resources will be
>> checked. If the current TLB resources are found to be lower than the
>> preset value (half of the watermark), the kthread will be awakened to
>> fill them.
>> 4. The TLB buffer allocation in the kthread is batched to
>> "(MAX_ORDER_NR_PAGES << PAGE_SHIFT)" to reduce the holding time of
>> spin_lock and number of calls to set_memory_decrypted().
>
> Okay, but does this guarantee that it will never run out of memory?
>
> It seems difficult to make such guarantees. What happens for example if
> the background thread gets starved by something higher priority?
>
No, this cannot guarantee we always have sufficient TLB caches, so we
can also have a "No memory for cc-swiotlb buffer" warning.

But I want to emphasize that in this case, the current implementation is
no worse than the legacy implementation. Moreover, dynamic TLB
allocation is more suitable for situations where more disks/network
devices will be hotplugged, in which case you cannot pre-set a
reasonable value.

> Or if the allocators have such high bandwidth that they can overwhelm
> any reasonable background thread.
>
> -Andi
>

Sincerely,
Guorui

2023-01-30 14:38:14

by Guorui Yu

[permalink] [raw]
Subject: Re: [RFC] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs

Hi Robin,

在 2023/1/30 21:03, Robin Murphy 写道:
> On 2023-01-28 08:32, GuoRui.Yu wrote:
>> This patch series adds a new swiotlb implementation, cc-swiotlb, for
>> Confidential VMs (such as TDX and SEV-SNP). The new cc-swiotlb allocates
>> the DMA TLB buffer dynamically in runtime instead of allocating at boot
>> with a fixed size. Furthermore, future optimization and security
>> enhancement could be applied on cc-swiotlb without "infecting" the
>> legacy swiotlb.
>>
>> Background
>> ==========
>> Under COnfidential COmputing (CoCo) scenarios, the VMM cannot access
>> guest memory directly but requires the guest to explicitly mark the
>> memory as shared (decrypted). To make the streaming DMA mappings work,
>> the current implementation relays on legacy SWIOTLB to bounce the DMA
>> buffer between private (encrypted) and shared (decrypted) memory.
>>
>> However, the legacy swiotlb is designed for compatibility rather than
>> efficiency and CoCo purpose, which will inevitably introduce some
>> unnecessary restrictions.
>>
>> 1. Fixed immutable swiotlb size cannot accommodate to requirements of
>> multiple devices. And 1GiB (current maximum size) of swiotlb in our
>> testbed cannot afford multiple disks reads/writes simultaneously.
>
> That's not a very logical argument - if a particular use-case needs a
> particular total amount of SWIOTLB capacity, then that's how much it
> needs. Whether that capacity is allocated up-front or allocated
> gradually over time doesn't affect that. The obvious solution to this
> issue as presented is "make the maximum size bigger", not "add a whole
> other SWIOTLB implementation".
>

The reason is all about the hotplugged devices.

In public cloud scenarios, customers often need to hotplug disks. The
number of disks can be up to about 50 disks, and the maximum IOPS of
each disk may reach 1 million IOPS. It may be difficult for the user to
estimate the SWIOTLB size he needs to use when the system starts. And
the same story also applies to the NIC devices.

>> 2. Fixed immutable IO_TLB_SIZE (2KiB) cannot satisfy various kinds of
>> devices. At the moment, the minimal size of a swiotlb buffer is 2KiB,
>> which will waste memory on small network packets (under 256 bytes) and
>> decrease efficiency on a large block (up to 256KiB) size reads/writes of
>> disks. And it is hard to have a trade-off on legacy swiotlb to rule them
>> all.
>
> That's clearly a general argument, so why should any improvement be
> arbitrarily restricted to Confidential Compute scenarios?
>
My idea is that other scenarios can generally support modern high-speed
devices without swiotlb.

But almost all DMA allocations in CoCo scenarios need to go through
swiotlb for bouncing DMA buffer between the private and shared memory,
which makes this problem more prominent in CoCo scenarios.

I guess the same logic may also apply to Xen, but I am not sure so it is
CoCo-limited feature for now.

>> 3. The legacy swiotlb cannot efficiently support larger swiotlb buffers.
>> In the worst case, the current implementation requires a full scan of
>> the entire swiotlb buffer, which can cause severe performance hits.
>
> Isn't that the main reason we just recently introduced the multiple area
> stuff?
>

Yes, this issue is mitigated by multiple area stuff partially, and I am
actually facing some issues current scanning logic (for example, it
keeps scan the tlb buffer forever and causes soft lookups[1]). And I
will keep track this issue (candidate) to give your more details about this.

But my more intuitive thought is why do we need such a complex
allocation algorithm? If we only supported CoCo scenarios (or maybe
Xen), we might be able to have a more deterministic algorithm instead of
waiting for the results of the scan.

[1] some details about the soft lookup:

$ cat ./test.sh
count=0
while true; do
for i in {1..21};do
mkfs.xfs -f /dev/nvme${i}n1 &> /dev/null &
done
wait
count=$((count+1))
echo $count
done

$ ./test.sh # It may take some time...

$ # The system is not fully operational from here...

$ echo l > /proc/sysrq-trigger
...

Every other CPU are waiting this core to exit the swiotlb_do_find_slots
loop.

[10199.924385] NMI backtrace for cpu 49
[10199.924386] CPU: 49 PID: 951 Comm: kworker/49:1H Kdump: loaded
Tainted: G EL 6.2.0-rc6-swiotlb-upstream-bug-repreduce+ #70
[10199.924388] Workqueue: kblockd blk_mq_run_work_fn
[10199.924391] RIP: 0010:swiotlb_do_find_slots+0x1fe/0x3e0
[10199.924393] Code: 77 21 48 8b 54 24 28 44 8b 7c 24 38 4c 8b 52 48 48
8d 14 49 48 c1 e2 03 45 39 7c 12 10 0f 83 f7 00 00 00 01 e8 bb 00 00 00
00 <41> 39 c0 0f 46 c3 39 f8 0f 85 4f ff ff f
f 48 8b 74 24 40 48 8b 7c
[10199.924394] RSP: 0018:ffffc90003913bb0 EFLAGS: 00000002
[10199.924395] RAX: 00000000000000e5 RBX: 0000000000000000 RCX:
00000000001880e1
[10199.924396] RDX: 00000000024c1518 RSI: 0000000000188000 RDI:
000000000000030c
[10199.924397] RBP: 0000000000000004 R08: 0000000000008000 R09:
0000000000000002
[10199.924397] R10: ffff88dad2400000 R11: 00000000001fffff R12:
00000000d8400000
[10199.924398] R13: 000000012cc2a800 R14: 0000000000200000 R15:
0000000000000002
[10199.924399] FS: 0000000000000000(0000) GS:ffff88dce5040000(0000)
knlGS:0000000000000000
[10199.924400] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10199.924401] CR2: 00007f8eb9dece80 CR3: 000000000640a004 CR4:
0000000000770ee0
[10199.924401] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[10199.924402] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
0000000000000400
[10199.924403] PKRU: 55555554
[10199.924403] Call Trace:
[10199.924404] <TASK>
[10199.924405] swiotlb_tbl_map_single+0xec/0x1f0
[10199.924407] swiotlb_map+0x5c/0x260
[10199.924409] ? nvme_pci_setup_prps+0x1ed/0x340
[10199.924411] dma_direct_map_page+0x12e/0x1c0
[10199.924413] nvme_map_data+0x304/0x370
[10199.924415] nvme_prep_rq.part.0+0x31/0x120
[10199.924417] nvme_queue_rq+0x77/0x1f0
[10199.924420] blk_mq_dispatch_rq_list+0x17e/0x670
[10199.924422] __blk_mq_sched_dispatch_requests+0x129/0x140
[10199.924424] blk_mq_sched_dispatch_requests+0x34/0x60
[10199.924426] __blk_mq_run_hw_queue+0x91/0xb0
[10199.924428] process_one_work+0x1df/0x3b0
[10199.924430] worker_thread+0x49/0x2e0
[10199.924432] ? rescuer_thread+0x390/0x390
[10199.924433] kthread+0xe5/0x110
[10199.924435] ? kthread_complete_and_exit+0x20/0x20
[10199.924436] ret_from_fork+0x1f/0x30
[10199.924439] </TASK>

...
[ 9639.596311] NMI backtrace for cpu 48
[ 9639.596313] CPU: 48 PID: 1215 Comm: kworker/48:1H Kdump: loaded
Tainted: G E 6.2.0-rc6-swiotlb-upstream-bug-repreduce+ #70
[ 9639.596315] Workqueue: kblockd blk_mq_run_work_fn
[ 9639.596319] RIP: 0010:native_queued_spin_lock_slowpath+0x18/0x2c0
[ 9639.596322] Code: 75 02 5a c3 56 0f b6 f0 e8 c5 ff ff ff 5e 5a c3 66
90 0f 1f 44 00 00 41 55 41 54 55 48 89 fd 53 66 90 ba 01 00 00 00 8b 45
00 <85> c0 75 10 f0 0f b1 55 00 85 c0 75 f0 5
b 5d 41 5c 41 5d c3 f3 90
[ 9639.596323] RSP: 0018:ffffc90003e5fb38 EFLAGS: 00000002
[ 9639.596325] RAX: 0000000000000001 RBX: ffff88de5e459adc RCX:
0000000000001000
[ 9639.596326] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
ffff88de5e459adc
[ 9639.596327] RBP: ffff88de5e459adc R08: 0000000000000000 R09:
ffff888102ab6090
[ 9639.596328] R10: ffff88dad2400000 R11: 00000000001fffff R12:
0000000000000293
[ 9639.596329] R13: 000000000730d000 R14: 0000000000200000 R15:
ffffffff83540160
[ 9639.596330] FS: 0000000000000000(0000) GS:ffff88dce5000000(0000)
knlGS:0000000000000000
[ 9639.596331] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9639.596332] CR2: 00007fc091da0000 CR3: 000000000640a002 CR4:
0000000000770ee0
[ 9639.596334] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 9639.596334] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
0000000000000400
[ 9639.596335] PKRU: 55555554
[ 9639.596336] Call Trace:
[ 9639.596337] <TASK>
[ 9639.596338] _raw_spin_lock_irqsave+0x37/0x40
[ 9639.596341] swiotlb_do_find_slots+0xef/0x3e0
[ 9639.596344] swiotlb_tbl_map_single+0xec/0x1f0
[ 9639.596347] swiotlb_map+0x5c/0x260
[ 9639.596349] dma_direct_map_sg+0x7a/0x280
[ 9639.596352] __dma_map_sg_attrs+0x30/0x70
[ 9639.596355] dma_map_sgtable+0x1d/0x30
[ 9639.596356] nvme_map_data+0xce/0x370
[ 9639.596359] nvme_prep_rq.part.0+0x31/0x120
[ 9639.596362] nvme_queue_rq+0x77/0x1f0
[ 9639.596364] blk_mq_dispatch_rq_list+0x17e/0x670
[ 9639.596367] __blk_mq_sched_dispatch_requests+0x129/0x140
[ 9639.596370] blk_mq_sched_dispatch_requests+0x34/0x60
[ 9639.596372] __blk_mq_run_hw_queue+0x91/0xb0
[ 9639.596374] process_one_work+0x1df/0x3b0
[ 9639.596377] worker_thread+0x49/0x2e0
[ 9639.596379] ? rescuer_thread+0x390/0x390
[ 9639.596380] kthread+0xe5/0x110
[ 9639.596382] ? kthread_complete_and_exit+0x20/0x20
[ 9639.596383] ret_from_fork+0x1f/0x30
[ 9639.596387] </TASK>

...

[ 9639.595665] NMI backtrace for cpu 50
[ 9639.595667] CPU: 50 PID: 0 Comm: swapper/50 Kdump: loaded Tainted: G
E 6.2.0-rc6-swiotlb-upstream-bug-repreduce+ #70
[ 9639.595669] RIP: 0010:native_queued_spin_lock_slowpath+0x2e/0x2c0
[ 9639.595672] Code: 00 41 55 41 54 55 48 89 fd 53 66 90 ba 01 00 00 00
8b 45 00 85 c0 75 10 f0 0f b1 55 00 85 c0 75 f0 5b 5d 41 5c 41 5d c3 f3
90 <eb> e5 81 fe 00 01 00 00 74 4e 40 30 f6 8
5 f6 75 71 f0 0f ba 6d 00
[ 9639.595673] RSP: 0018:ffffc90000c38e28 EFLAGS: 00000002
[ 9639.595674] RAX: 0000000000000001 RBX: ffff88de5e459adc RCX:
0000000000000001
[ 9639.595675] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
ffff88de5e459adc
[ 9639.595675] RBP: ffff88de5e459adc R08: ffff88dad2400000 R09:
0000000000001000
[ 9639.595676] R10: 0000005b9c570000 R11: ffffc90000c38ff8 R12:
0000000000000087
[ 9639.595677] R13: ffff88de5e459ad0 R14: ffff88de5e459adc R15:
00000000001882e0
[ 9639.595678] FS: 0000000000000000(0000) GS:ffff88dce5080000(0000)
knlGS:0000000000000000
[ 9639.595679] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9639.595679] CR2: 00007f256aa4d10c CR3: 000000010ade0005 CR4:
0000000000770ee0
[ 9639.595680] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 9639.595681] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
0000000000000400
[ 9639.595681] PKRU: 55555554
[ 9639.595682] Call Trace:
[ 9639.595682] <IRQ>
[ 9639.595683] _raw_spin_lock_irqsave+0x37/0x40
[ 9639.595686] swiotlb_release_slots.isra.0+0x86/0x180
[ 9639.595688] dma_direct_unmap_sg+0xcf/0x1a0
[ 9639.595690] nvme_unmap_data.part.0+0x43/0xc0
[ 9639.595693] nvme_pci_complete_batch+0x71/0xe0
[ 9639.595695] nvme_irq+0x7b/0x90
[ 9639.595697] ? nvme_prep_rq_batch+0xc0/0xc0
[ 9639.595700] __handle_irq_event_percpu+0x46/0x170
[ 9639.595701] handle_irq_event+0x3a/0x80
[ 9639.595703] handle_edge_irq+0xae/0x290
[ 9639.595705] __common_interrupt+0x62/0x100
[ 9639.595707] common_interrupt+0xb3/0xd0
[ 9639.595709] </IRQ>
[ 9639.595709] <TASK>
[ 9639.595710] asm_common_interrupt+0x22/0x40
[ 9639.595712] RIP: 0010:__tdx_hypercall+0x34/0x80

$ git diff # which allow me to allocate swiotlb memory bigger than 4GiB.
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index a34c38bbe28f..a6d7aae3c040 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -311,6 +311,7 @@ static void *swiotlb_memblock_alloc(unsigned long
nslabs, unsigned int flags,
* allow to pick a location everywhere for hypervisors with guest
* memory encryption.
*/
+ flags |= SWIOTLB_ANY;
if (flags & SWIOTLB_ANY)
tlb = memblock_alloc(bytes, PAGE_SIZE);
else



Thank you very much for your time and careful reading.
Guorui

> Robin.
>
>> Changes in this patch set
>> =========================
>> Instead of keeping "infecting" the legacy swiotlb code with CoCo logic,
>> this patch tries to introduce a new cc-swiotlb for Confidential VMs.
>>
>> Confidential VMs usually have reasonable modern devices (virtio devices,
>> NVME, etc.), which can access memory above 4GiB, cc-swiotlb could
>> allocate TLB buffers at any position dynamically. Since
>> set_memory_{decrypted,encrypted} is time-consuming and cannot be used in
>> interrupt context, a new kernel thread "kccd" has been added to populate
>> new TLB buffers on-demand, which solved the problem 1.
>>
>> In addition, the cc-swiotlb manages TLB buffers by different sizes
>> (512B, 2KiB, 4KiB, 16KiB, and 512KiB). The above values come from the
>> following observations (boot with 8core, 32 GiB, 1 nvme disk, and 1
>> virtio-net):
>> - Allocations of 512 bytes and below account for 3.5% of the total DMA
>>    cache allocations;
>> - Allocations of 2 KiB and below account for 57.7%;
>> - Allocations of 4 KiB and below account for 91.3%;
>> - Allocations of 16 KiB and below account for 96.0%;
>> - Allocations of 512 KiB and below accounted for 100%;
>> - At the end of booting, cc-swiotlb uses 288 MiB in total.
>>
>> For comparison, legacy swiotlb reserves memory at 6%, which requires
>> min(1GiB, 32GiB * 0.06) = 1GiB, and will hang when operating multiple
>> disks simultaneously due to no memory for the swiotlb buffer.
>>
>> These patches were tested with fio (using different iodepth and block
>> size) on a platform with 96 cores, 384 GiB, and 20 NVME disks, and no IO
>> hang or error was observed.
>>
>> For simplicity, the current RFC version cannot switch between legacy
>> implementation with cmdline but through compile options. I am open to
>> discussing how to integrate the cc-swiotlb into the legacy one.
>>
>> Patch Organization
>> ==================
>> - swiotlb: Split common code from swiotlb.{c,h}
>> - swiotlb: Add a new cc-swiotlb implementation for Confidential VMs
>> - swiotlb: Add tracepoint swiotlb_unbounced
>> - cc-swiotlb: Allow set swiotlb watermark from cmdline
>>
>> Thanks for your time!
>>
>> Have a nice day,
>> Guorui
>>
>>

2023-01-31 17:18:12

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 2/4] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs

>No, this cannot guarantee we always have sufficient TLB caches, so we
can also have a "No memory for cc-swiotlb buffer" warning.

It's not just a warning, it will be IO errors, right?

>
> But I want to emphasize that in this case, the current implementation
> is no worse than the legacy implementation. Moreover, dynamic TLB
> allocation is more suitable for situations where more disks/network
> devices will be hotplugged, in which case you cannot pre-set a
> reasonable value.

That's a reasonable stand point, but have to emphasize that is
"probabilistic" in all the descriptions and comments.

I assume you did some stress testing (E.g. all cores submitting at full
bandwidth) to validate that it works for you?

-Andi



2023-02-01 02:08:59

by Guorui Yu

[permalink] [raw]
Subject: Re: [PATCH 2/4] swiotlb: Add a new cc-swiotlb implementation for Confidential VMs



在 2023/2/1 01:16, Andi Kleen 写道:
> >No, this cannot guarantee we always have sufficient TLB caches, so we
> can also have a "No memory for cc-swiotlb buffer" warning.
>
> It's not just a warning, it will be IO errors, right?
>

Yes, they are IO errors, but unsustainable such IO errors are not fatal
in my limited testing so far, and the system can survive after through
them. Again, legacy swiotlb occasionally suffers from TLB starvation.

However, if dynamic allocation of TLB is not allowed at all, the system
will be more likely to be overwhelmed by a large of bursting IOs and
unable to respond. Such problems are generally transient, so it is
difficult to reproduce and debug in a production environment. Users can
only set an unreasonably large fixed size and REBOOT to mitigate this
problem as much as possible.

>>
>> But I want to emphasize that in this case, the current implementation
>> is no worse than the legacy implementation. Moreover, dynamic TLB
>> allocation is more suitable for situations where more disks/network
>> devices will be hotplugged, in which case you cannot pre-set a
>> reasonable value.
>
> That's a reasonable stand point, but have to emphasize that is
> "probabilistic" in all the descriptions and comments.
>

Agreed, but one point to add is that the user can adjust the water level
setting to reduce the possibility of interrupt context allocation TLB
failure.

According to the current design, the kthread will be awaken to allocate
new TLBs when it is lower than half of the water level, so more flexible
room can be left by increasing the water level.

> I assume you did some stress testing (E.g. all cores submitting at full
> bandwidth) to validate that it works for you?
>
> -Andi
>

Yes, I tested by fio with different block sizes, iodepths and job
numbers on my testbed.

And I have noticed that there are some "IO errors" of `No memory for
cc-swiotlb buffer` in the beginning of the test, but it will be
eventually disappeared as long as there are enough free memory.

Thanks for your time,
Guorui