set_memory_decrypted() may block so it is not possible to do non-blocking
allocations through the DMA API for devices that required unencrypted
memory.
The solution is to expand the atomic DMA pools for the various possible
gfp requirements as a means to prevent an unnecessary depletion of lowmem.
These atomic pools are separated from the remap code and can be selected
for configurations that need them outside the scope of
CONFIG_DMA_DIRECT_REMAP, such as CONFIG_AMD_MEM_ENCRYPT.
These atomic DMA pools are kept unencrypted so they can immediately be
used for non-blocking allocations. Since the need for this type of memory
depends on the kernel config and devices being used, these pools are also
dynamically expandable.
The sizes of the various atomic DMA pools is exported through debugfs at
/sys/kernel/debug/dma_pools.
This patchset is based on latest Linus HEAD:
commit 8632e9b5645bbc2331d21d892b0d6961c1a08429
Merge: 6cc9306b8fc0 f3a99e761efa
Author: Linus Torvalds <[email protected]>
Date: Tue Apr 14 11:58:04 2020 -0700
Merge tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
---
arch/x86/Kconfig | 1 +
drivers/iommu/dma-iommu.c | 5 +-
include/linux/dma-direct.h | 2 +
include/linux/dma-mapping.h | 6 +-
kernel/dma/Kconfig | 6 +-
kernel/dma/Makefile | 1 +
kernel/dma/direct.c | 56 ++++++--
kernel/dma/pool.c | 275 ++++++++++++++++++++++++++++++++++++
kernel/dma/remap.c | 114 ---------------
9 files changed, 334 insertions(+), 132 deletions(-)
create mode 100644 kernel/dma/pool.c
When AMD memory encryption is enabled, some devices may use more than
256KB/sec from the atomic pools. It would be more appropriate to scale
the default size based on memory capacity unless the coherent_pool
option is used on the kernel command line.
This provides a slight optimization on initial expansion and is deemed
appropriate due to the increased reliance on the atomic pools. Note that
the default size of 128KB per pool will normally be larger than the
single coherent pool implementation since there are now up to three
coherent pools (DMA, DMA32, and kernel).
Note that even prior to this patch, coherent_pool= for sizes larger than
1 << (PAGE_SHIFT + MAX_ORDER-1) can fail. With new dynamic expansion
support, this would be trivially extensible to allow even larger initial
sizes.
Signed-off-by: David Rientjes <[email protected]>
---
kernel/dma/pool.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
index 3e22022c933b..763b687569b0 100644
--- a/kernel/dma/pool.c
+++ b/kernel/dma/pool.c
@@ -22,8 +22,8 @@ static unsigned long pool_size_dma32;
static unsigned long pool_size_kernel;
#endif
-#define DEFAULT_DMA_COHERENT_POOL_SIZE SZ_256K
-static size_t atomic_pool_size = DEFAULT_DMA_COHERENT_POOL_SIZE;
+/* Size can be defined by the coherent_pool command line */
+static size_t atomic_pool_size;
/* Dynamic background expansion when the atomic pool is near capacity */
static struct work_struct atomic_pool_work;
@@ -181,6 +181,16 @@ static int __init dma_atomic_pool_init(void)
{
int ret = 0;
+ /*
+ * If coherent_pool was not used on the command line, default the pool
+ * sizes to 128KB per 1GB of memory, min 128KB, max MAX_ORDER-1.
+ */
+ if (!atomic_pool_size) {
+ atomic_pool_size = max(totalram_pages() >> PAGE_SHIFT, 1UL) *
+ SZ_128K;
+ atomic_pool_size = min_t(size_t, atomic_pool_size,
+ 1 << (PAGE_SHIFT + MAX_ORDER-1));
+ }
INIT_WORK(&atomic_pool_work, atomic_pool_work_fn);
atomic_pool_kernel = __dma_atomic_pool_init(atomic_pool_size,
The single atomic pool is allocated from the lowest zone possible since
it is guaranteed to be applicable for any DMA allocation.
Devices may allocate through the DMA API but not have a strict reliance
on GFP_DMA memory. Since the atomic pool will be used for all
non-blockable allocations, returning all memory from ZONE_DMA may
unnecessarily deplete the zone.
Provision for multiple atomic pools that will map to the optimal gfp
mask of the device.
When allocating non-blockable memory, determine the optimal gfp mask of
the device and use the appropriate atomic pool.
The coherent DMA mask will remain the same between allocation and free
and, thus, memory will be freed to the same atomic pool it was allocated
from.
__dma_atomic_pool_init() will be changed to return struct gen_pool *
later once dynamic expansion is added.
Signed-off-by: David Rientjes <[email protected]>
---
drivers/iommu/dma-iommu.c | 5 +-
include/linux/dma-direct.h | 2 +
include/linux/dma-mapping.h | 6 +-
kernel/dma/direct.c | 12 ++--
kernel/dma/pool.c | 120 +++++++++++++++++++++++-------------
5 files changed, 91 insertions(+), 54 deletions(-)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index ba128d1cdaee..4959f5df21bd 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -952,7 +952,7 @@ static void __iommu_dma_free(struct device *dev, size_t size, void *cpu_addr)
/* Non-coherent atomic allocation? Easy */
if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
- dma_free_from_pool(cpu_addr, alloc_size))
+ dma_free_from_pool(dev, cpu_addr, alloc_size))
return;
if (IS_ENABLED(CONFIG_DMA_REMAP) && is_vmalloc_addr(cpu_addr)) {
@@ -1035,7 +1035,8 @@ static void *iommu_dma_alloc(struct device *dev, size_t size,
if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
!gfpflags_allow_blocking(gfp) && !coherent)
- cpu_addr = dma_alloc_from_pool(PAGE_ALIGN(size), &page, gfp);
+ cpu_addr = dma_alloc_from_pool(dev, PAGE_ALIGN(size), &page,
+ gfp);
else
cpu_addr = iommu_dma_alloc_pages(dev, size, &page, gfp, attrs);
if (!cpu_addr)
diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index 24b8684aa21d..136f984df0d9 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -67,6 +67,8 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size,
}
u64 dma_direct_get_required_mask(struct device *dev);
+gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 dma_mask,
+ u64 *phys_mask);
void *dma_direct_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
gfp_t gfp, unsigned long attrs);
void dma_direct_free(struct device *dev, size_t size, void *cpu_addr,
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 330ad58fbf4d..b43116a6405d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -630,9 +630,9 @@ void *dma_common_pages_remap(struct page **pages, size_t size,
pgprot_t prot, const void *caller);
void dma_common_free_remap(void *cpu_addr, size_t size);
-bool dma_in_atomic_pool(void *start, size_t size);
-void *dma_alloc_from_pool(size_t size, struct page **ret_page, gfp_t flags);
-bool dma_free_from_pool(void *start, size_t size);
+void *dma_alloc_from_pool(struct device *dev, size_t size,
+ struct page **ret_page, gfp_t flags);
+bool dma_free_from_pool(struct device *dev, void *start, size_t size);
int
dma_common_get_sgtable(struct device *dev, struct sg_table *sgt, void *cpu_addr,
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 8f4bbdaf965e..a834ee22f8ff 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -45,8 +45,8 @@ u64 dma_direct_get_required_mask(struct device *dev)
return (1ULL << (fls64(max_dma) - 1)) * 2 - 1;
}
-static gfp_t __dma_direct_optimal_gfp_mask(struct device *dev, u64 dma_mask,
- u64 *phys_limit)
+gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 dma_mask,
+ u64 *phys_limit)
{
u64 dma_limit = min_not_zero(dma_mask, dev->bus_dma_limit);
@@ -89,8 +89,8 @@ struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
/* we always manually zero the memory once we are done: */
gfp &= ~__GFP_ZERO;
- gfp |= __dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask,
- &phys_limit);
+ gfp |= dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask,
+ &phys_limit);
page = dma_alloc_contiguous(dev, alloc_size, gfp);
if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
dma_free_contiguous(dev, page, alloc_size);
@@ -128,7 +128,7 @@ void *dma_direct_alloc_pages(struct device *dev, size_t size,
if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
dma_alloc_need_uncached(dev, attrs) &&
!gfpflags_allow_blocking(gfp)) {
- ret = dma_alloc_from_pool(PAGE_ALIGN(size), &page, gfp);
+ ret = dma_alloc_from_pool(dev, PAGE_ALIGN(size), &page, gfp);
if (!ret)
return NULL;
goto done;
@@ -212,7 +212,7 @@ void dma_direct_free_pages(struct device *dev, size_t size, void *cpu_addr,
}
if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
- dma_free_from_pool(cpu_addr, PAGE_ALIGN(size)))
+ dma_free_from_pool(dev, cpu_addr, PAGE_ALIGN(size)))
return;
if (force_dma_unencrypted(dev))
diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
index 6612c2d51d3c..5c98ab991b16 100644
--- a/kernel/dma/pool.c
+++ b/kernel/dma/pool.c
@@ -10,7 +10,9 @@
#include <linux/slab.h>
#include <linux/vmalloc.h>
-static struct gen_pool *atomic_pool __ro_after_init;
+static struct gen_pool *atomic_pool_dma __ro_after_init;
+static struct gen_pool *atomic_pool_dma32 __ro_after_init;
+static struct gen_pool *atomic_pool_kernel __ro_after_init;
#define DEFAULT_DMA_COHERENT_POOL_SIZE SZ_256K
static size_t atomic_pool_size __initdata = DEFAULT_DMA_COHERENT_POOL_SIZE;
@@ -22,89 +24,119 @@ static int __init early_coherent_pool(char *p)
}
early_param("coherent_pool", early_coherent_pool);
-static gfp_t dma_atomic_pool_gfp(void)
+static int __init __dma_atomic_pool_init(struct gen_pool **pool,
+ size_t pool_size, gfp_t gfp)
{
- if (IS_ENABLED(CONFIG_ZONE_DMA))
- return GFP_DMA;
- if (IS_ENABLED(CONFIG_ZONE_DMA32))
- return GFP_DMA32;
- return GFP_KERNEL;
-}
-
-static int __init dma_atomic_pool_init(void)
-{
- unsigned int pool_size_order = get_order(atomic_pool_size);
- unsigned long nr_pages = atomic_pool_size >> PAGE_SHIFT;
+ const unsigned int order = get_order(pool_size);
+ const unsigned long nr_pages = pool_size >> PAGE_SHIFT;
struct page *page;
void *addr;
int ret;
if (dev_get_cma_area(NULL))
- page = dma_alloc_from_contiguous(NULL, nr_pages,
- pool_size_order, false);
+ page = dma_alloc_from_contiguous(NULL, nr_pages, order, false);
else
- page = alloc_pages(dma_atomic_pool_gfp(), pool_size_order);
+ page = alloc_pages(gfp, order);
if (!page)
goto out;
- arch_dma_prep_coherent(page, atomic_pool_size);
+ arch_dma_prep_coherent(page, pool_size);
- atomic_pool = gen_pool_create(PAGE_SHIFT, -1);
- if (!atomic_pool)
+ *pool = gen_pool_create(PAGE_SHIFT, -1);
+ if (!*pool)
goto free_page;
- addr = dma_common_contiguous_remap(page, atomic_pool_size,
+ addr = dma_common_contiguous_remap(page, pool_size,
pgprot_dmacoherent(PAGE_KERNEL),
__builtin_return_address(0));
if (!addr)
goto destroy_genpool;
- ret = gen_pool_add_virt(atomic_pool, (unsigned long)addr,
- page_to_phys(page), atomic_pool_size, -1);
+ ret = gen_pool_add_virt(*pool, (unsigned long)addr, page_to_phys(page),
+ pool_size, -1);
if (ret)
goto remove_mapping;
- gen_pool_set_algo(atomic_pool, gen_pool_first_fit_order_align, NULL);
+ gen_pool_set_algo(*pool, gen_pool_first_fit_order_align, NULL);
- pr_info("DMA: preallocated %zu KiB pool for atomic allocations\n",
- atomic_pool_size / 1024);
+ pr_info("DMA: preallocated %zu KiB %pGg pool for atomic allocations\n",
+ pool_size >> 10, &gfp);
return 0;
remove_mapping:
- dma_common_free_remap(addr, atomic_pool_size);
+ dma_common_free_remap(addr, pool_size);
destroy_genpool:
- gen_pool_destroy(atomic_pool);
- atomic_pool = NULL;
+ gen_pool_destroy(*pool);
+ *pool = NULL;
free_page:
if (!dma_release_from_contiguous(NULL, page, nr_pages))
- __free_pages(page, pool_size_order);
+ __free_pages(page, order);
out:
- pr_err("DMA: failed to allocate %zu KiB pool for atomic coherent allocation\n",
- atomic_pool_size / 1024);
+ pr_err("DMA: failed to allocate %zu KiB %pGg pool for atomic allocation\n",
+ pool_size >> 10, &gfp);
return -ENOMEM;
}
+
+static int __init dma_atomic_pool_init(void)
+{
+ int ret = 0;
+ int err;
+
+ ret = __dma_atomic_pool_init(&atomic_pool_kernel, atomic_pool_size,
+ GFP_KERNEL);
+ if (IS_ENABLED(CONFIG_ZONE_DMA)) {
+ err = __dma_atomic_pool_init(&atomic_pool_dma,
+ atomic_pool_size, GFP_DMA);
+ if (!ret && err)
+ ret = err;
+ }
+ if (IS_ENABLED(CONFIG_ZONE_DMA32)) {
+ err = __dma_atomic_pool_init(&atomic_pool_dma32,
+ atomic_pool_size, GFP_DMA32);
+ if (!ret && err)
+ ret = err;
+ }
+ return ret;
+}
postcore_initcall(dma_atomic_pool_init);
-bool dma_in_atomic_pool(void *start, size_t size)
+static inline struct gen_pool *dev_to_pool(struct device *dev)
{
- if (unlikely(!atomic_pool))
- return false;
+ u64 phys_mask;
+ gfp_t gfp;
+
+ gfp = dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask,
+ &phys_mask);
+ if (IS_ENABLED(CONFIG_ZONE_DMA) && gfp == GFP_DMA)
+ return atomic_pool_dma;
+ if (IS_ENABLED(CONFIG_ZONE_DMA32) && gfp == GFP_DMA32)
+ return atomic_pool_dma32;
+ return atomic_pool_kernel;
+}
- return gen_pool_has_addr(atomic_pool, (unsigned long)start, size);
+static bool dma_in_atomic_pool(struct device *dev, void *start, size_t size)
+{
+ struct gen_pool *pool = dev_to_pool(dev);
+
+ if (unlikely(!pool))
+ return false;
+ return gen_pool_has_addr(pool, (unsigned long)start, size);
}
-void *dma_alloc_from_pool(size_t size, struct page **ret_page, gfp_t flags)
+void *dma_alloc_from_pool(struct device *dev, size_t size,
+ struct page **ret_page, gfp_t flags)
{
+ struct gen_pool *pool = dev_to_pool(dev);
unsigned long val;
void *ptr = NULL;
- if (!atomic_pool) {
- WARN(1, "coherent pool not initialised!\n");
+ if (!pool) {
+ WARN(1, "%pGg atomic pool not initialised!\n", &flags);
return NULL;
}
- val = gen_pool_alloc(atomic_pool, size);
+ val = gen_pool_alloc(pool, size);
if (val) {
- phys_addr_t phys = gen_pool_virt_to_phys(atomic_pool, val);
+ phys_addr_t phys = gen_pool_virt_to_phys(pool, val);
*ret_page = pfn_to_page(__phys_to_pfn(phys));
ptr = (void *)val;
@@ -114,10 +146,12 @@ void *dma_alloc_from_pool(size_t size, struct page **ret_page, gfp_t flags)
return ptr;
}
-bool dma_free_from_pool(void *start, size_t size)
+bool dma_free_from_pool(struct device *dev, void *start, size_t size)
{
- if (!dma_in_atomic_pool(start, size))
+ struct gen_pool *pool = dev_to_pool(dev);
+
+ if (!dma_in_atomic_pool(dev, start, size))
return false;
- gen_pool_free(atomic_pool, (unsigned long)start, size);
+ gen_pool_free(pool, (unsigned long)start, size);
return true;
}
When CONFIG_AMD_MEM_ENCRYPT is enabled and a device requires unencrypted
DMA, all non-blocking allocations must originate from the atomic DMA
coherent pools.
Select CONFIG_DMA_COHERENT_POOL for CONFIG_AMD_MEM_ENCRYPT.
Signed-off-by: David Rientjes <[email protected]>
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d6104ea8af0..2bf2222819d3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1520,6 +1520,7 @@ config X86_CPA_STATISTICS
config AMD_MEM_ENCRYPT
bool "AMD Secure Memory Encryption (SME) support"
depends on X86_64 && CPU_SUP_AMD
+ select DMA_COHERENT_POOL
select DYNAMIC_PHYSICAL_MASK
select ARCH_USE_MEMREMAP_PROT
select ARCH_HAS_FORCE_DMA_UNENCRYPTED
So modulo a few comments that I can fix up myself this looks good. Unless
you want to resend for some reason I'm ready to pick this up once I open
the dma-mapping tree after -rc2.
On Fri, 17 Apr 2020, Christoph Hellwig wrote:
> So modulo a few comments that I can fix up myself this looks good. Unless
> you want to resend for some reason I'm ready to pick this up once I open
> the dma-mapping tree after -rc2.
>
Yes, please do, and thanks to both you and Thomas for the guidance and
code reviews.
Once these patches take on their final form in your branch, how supportive
would you be of stable backports going back to 4.19 LTS?
There have been several changes to this area over time, so there are
varying levels of rework that need to be done for each stable kernel back
to 4.19. But I'd be happy to do that work if you are receptive to it.
For rationale, without these fixes, all SEV enabled guests warn of
blocking in rcu read side critical sections when using drivers that
allocate atomically though the DMA API that calls set_memory_decrypted().
Users can see warnings such as these in the guest:
BUG: sleeping function called from invalid context at mm/vmalloc.c:1710
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 3383, name: fio
2 locks held by fio/3383:
#0: ffff93b6a8568348 (&sb->s_type->i_mutex_key#16){+.+.}, at: ext4_file_write_iter+0xa2/0x5d0
#1: ffffffffa52a61a0 (rcu_read_lock){....}, at: hctx_lock+0x1a/0xe0
CPU: 0 PID: 3383 Comm: fio Tainted: G W 5.5.10 #14
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
dump_stack+0x98/0xd5
___might_sleep+0x175/0x260
__might_sleep+0x4a/0x80
_vm_unmap_aliases+0x45/0x250
vm_unmap_aliases+0x19/0x20
__set_memory_enc_dec+0xa4/0x130
set_memory_decrypted+0x10/0x20
dma_direct_alloc_pages+0x148/0x150
dma_direct_alloc+0xe/0x10
dma_alloc_attrs+0x86/0xc0
dma_pool_alloc+0x16f/0x2b0
nvme_queue_rq+0x878/0xc30 [nvme]
__blk_mq_try_issue_directly+0x135/0x200
blk_mq_request_issue_directly+0x4f/0x80
blk_mq_try_issue_list_directly+0x46/0xb0
blk_mq_sched_insert_requests+0x19b/0x2b0
blk_mq_flush_plug_list+0x22f/0x3b0
blk_flush_plug_list+0xd1/0x100
blk_finish_plug+0x2c/0x40
iomap_dio_rw+0x427/0x490
ext4_file_write_iter+0x181/0x5d0
aio_write+0x109/0x1b0
io_submit_one+0x7d0/0xfa0
__x64_sys_io_submit+0xa2/0x280
do_syscall_64+0x5f/0x250