2023-01-18 18:03:25

by Jason Gunthorpe

[permalink] [raw]
Subject: [PATCH v2 00/10] Let iommufd charge IOPTE allocations to the memory cgroup

iommufd follows the same design as KVM and uses memory cgroups to limit
the amount of kernel memory a iommufd file descriptor can pin down. The
various internal data structures already use GFP_KERNEL_ACCOUNT to charge
its own memory.

However, one of the biggest consumers of kernel memory is the IOPTEs
stored under the iommu_domain and these allocations are not tracked.

This series is the first step in fixing it.

The iommu driver contract already includes a 'gfp' argument to the
map_pages op, allowing iommufd to specify GFP_KERNEL_ACCOUNT and then
having the driver allocate the IOPTE tables with that flag will capture a
significant amount of the allocations.

Update the iommu_map() API to pass in the GFP argument, and fix all call
sites. Replace iommu_map_atomic().

Audit the "enterprise" iommu drivers to make sure they do the right thing.
Intel and S390 ignore the GFP argument and always use GFP_ATOMIC. This is
problematic for iommufd anyhow, so fix it. AMD and ARM SMMUv2/3 are
already correct.

A follow up series will be needed to capture the allocations made when the
iommu_domain itself is allocated, which will complete the job.

v2:
- Prohibit bad GFP flags in the iommu wrappers
- Split out the new GFP_KERNEL usages into dedicated patches so it is
easier to check. No code change after the full series
v1: https://lore.kernel.org/r/[email protected]

Jason Gunthorpe (10):
iommu: Add a gfp parameter to iommu_map()
iommu: Remove iommu_map_atomic()
iommu: Add a gfp parameter to iommu_map_sg()
iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()
iommufd: Use GFP_KERNEL_ACCOUNT for iommu_map()
iommu/intel: Add a gfp parameter to alloc_pgtable_page()
iommu/intel: Support the gfp argument to the map_pages op
iommu/intel: Use GFP_KERNEL in sleepable contexts
iommu/s390: Push the gfp parameter to the kmem_cache_alloc()'s
iommu/s390: Use GFP_KERNEL in sleepable contexts

arch/arm/mm/dma-mapping.c | 11 ++--
arch/s390/include/asm/pci_dma.h | 5 +-
arch/s390/pci/pci_dma.c | 31 ++++++-----
.../drm/nouveau/nvkm/subdev/instmem/gk20a.c | 3 +-
drivers/gpu/drm/tegra/drm.c | 2 +-
drivers/gpu/host1x/cdma.c | 2 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 4 +-
drivers/iommu/dma-iommu.c | 11 ++--
drivers/iommu/intel/iommu.c | 36 +++++++------
drivers/iommu/intel/iommu.h | 2 +-
drivers/iommu/intel/pasid.c | 2 +-
drivers/iommu/iommu.c | 53 +++++++------------
drivers/iommu/iommufd/pages.c | 6 ++-
drivers/iommu/s390-iommu.c | 15 +++---
drivers/media/platform/qcom/venus/firmware.c | 2 +-
drivers/net/ipa/ipa_mem.c | 6 ++-
drivers/net/wireless/ath/ath10k/snoc.c | 2 +-
drivers/net/wireless/ath/ath11k/ahb.c | 4 +-
drivers/remoteproc/remoteproc_core.c | 5 +-
drivers/vfio/vfio_iommu_type1.c | 9 ++--
drivers/vhost/vdpa.c | 2 +-
include/linux/iommu.h | 31 +++--------
22 files changed, 119 insertions(+), 125 deletions(-)


base-commit: 5dc4c995db9eb45f6373a956eb1f69460e69e6d4
--
2.39.0


2023-01-18 18:03:27

by Jason Gunthorpe

[permalink] [raw]
Subject: [PATCH v2 07/10] iommu/intel: Support the gfp argument to the map_pages op

Flow it down to alloc_pgtable_page() via pfn_to_dma_pte() and
__domain_mapping().

Signed-off-by: Jason Gunthorpe <[email protected]>
---
drivers/iommu/intel/iommu.c | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index aa29561d3549b3..e95f7703ce7b83 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -908,7 +908,8 @@ void dmar_fault_dump_ptes(struct intel_iommu *iommu, u16 source_id,
#endif

static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
- unsigned long pfn, int *target_level)
+ unsigned long pfn, int *target_level,
+ gfp_t gfp)
{
struct dma_pte *parent, *pte;
int level = agaw_to_level(domain->agaw);
@@ -935,7 +936,7 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
if (!dma_pte_present(pte)) {
uint64_t pteval;

- tmp_page = alloc_pgtable_page(domain->nid, GFP_ATOMIC);
+ tmp_page = alloc_pgtable_page(domain->nid, gfp);

if (!tmp_page)
return NULL;
@@ -2150,7 +2151,8 @@ static void switch_to_super_page(struct dmar_domain *domain,

while (start_pfn <= end_pfn) {
if (!pte)
- pte = pfn_to_dma_pte(domain, start_pfn, &level);
+ pte = pfn_to_dma_pte(domain, start_pfn, &level,
+ GFP_ATOMIC);

if (dma_pte_present(pte)) {
dma_pte_free_pagetable(domain, start_pfn,
@@ -2172,7 +2174,8 @@ static void switch_to_super_page(struct dmar_domain *domain,

static int
__domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
- unsigned long phys_pfn, unsigned long nr_pages, int prot)
+ unsigned long phys_pfn, unsigned long nr_pages, int prot,
+ gfp_t gfp)
{
struct dma_pte *first_pte = NULL, *pte = NULL;
unsigned int largepage_lvl = 0;
@@ -2202,7 +2205,8 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
largepage_lvl = hardware_largepage_caps(domain, iov_pfn,
phys_pfn, nr_pages);

- pte = pfn_to_dma_pte(domain, iov_pfn, &largepage_lvl);
+ pte = pfn_to_dma_pte(domain, iov_pfn, &largepage_lvl,
+ gfp);
if (!pte)
return -ENOMEM;
first_pte = pte;
@@ -2368,7 +2372,7 @@ static int iommu_domain_identity_map(struct dmar_domain *domain,

return __domain_mapping(domain, first_vpfn,
first_vpfn, last_vpfn - first_vpfn + 1,
- DMA_PTE_READ|DMA_PTE_WRITE);
+ DMA_PTE_READ|DMA_PTE_WRITE, GFP_ATOMIC);
}

static int md_domain_init(struct dmar_domain *domain, int guest_width);
@@ -4298,7 +4302,7 @@ static int intel_iommu_map(struct iommu_domain *domain,
the low bits of hpa would take us onto the next page */
size = aligned_nrpages(hpa, size);
return __domain_mapping(dmar_domain, iova >> VTD_PAGE_SHIFT,
- hpa >> VTD_PAGE_SHIFT, size, prot);
+ hpa >> VTD_PAGE_SHIFT, size, prot, gfp);
}

static int intel_iommu_map_pages(struct iommu_domain *domain,
@@ -4333,7 +4337,8 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain,

/* Cope with horrid API which requires us to unmap more than the
size argument if it happens to be a large-page mapping. */
- BUG_ON(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level));
+ BUG_ON(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level,
+ GFP_ATOMIC));

if (size < VTD_PAGE_SIZE << level_to_offset_bits(level))
size = VTD_PAGE_SIZE << level_to_offset_bits(level);
@@ -4392,7 +4397,8 @@ static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
int level = 0;
u64 phys = 0;

- pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level);
+ pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level,
+ GFP_ATOMIC);
if (pte && dma_pte_present(pte))
phys = dma_pte_addr(pte) +
(iova & (BIT_MASK(level_to_offset_bits(level) +
--
2.39.0

2023-01-18 18:03:27

by Jason Gunthorpe

[permalink] [raw]
Subject: [PATCH v2 04/10] iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()

Change the sg_alloc_table_from_pages() allocation that was hardwired to
GFP_KERNEL to use the gfp parameter like the other allocations in this
function.

Auditing says this is never called from an atomic context, so it is safe
as is, but reads wrong.

Signed-off-by: Jason Gunthorpe <[email protected]>
---
drivers/iommu/dma-iommu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8c2788633c1766..e4bf1bb159f7c7 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -822,7 +822,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
if (!iova)
goto out_free_pages;

- if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, GFP_KERNEL))
+ if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, gfp))
goto out_free_iova;

if (!(ioprot & IOMMU_CACHE)) {
--
2.39.0

2023-01-18 18:04:57

by Jason Gunthorpe

[permalink] [raw]
Subject: [PATCH v2 01/10] iommu: Add a gfp parameter to iommu_map()

The internal mechanisms support this, but instead of exposting the gfp to
the caller it wrappers it into iommu_map() and iommu_map_atomic()

Fix this instead of adding more variants for GFP_KERNEL_ACCOUNT.

Signed-off-by: Jason Gunthorpe <[email protected]>
---
arch/arm/mm/dma-mapping.c | 11 ++++++----
.../drm/nouveau/nvkm/subdev/instmem/gk20a.c | 3 ++-
drivers/gpu/drm/tegra/drm.c | 2 +-
drivers/gpu/host1x/cdma.c | 2 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 4 ++--
drivers/iommu/dma-iommu.c | 2 +-
drivers/iommu/iommu.c | 22 +++++++++----------
drivers/iommu/iommufd/pages.c | 6 +++--
drivers/media/platform/qcom/venus/firmware.c | 2 +-
drivers/net/ipa/ipa_mem.c | 6 +++--
drivers/net/wireless/ath/ath10k/snoc.c | 2 +-
drivers/net/wireless/ath/ath11k/ahb.c | 4 ++--
drivers/remoteproc/remoteproc_core.c | 5 +++--
drivers/vfio/vfio_iommu_type1.c | 9 ++++----
drivers/vhost/vdpa.c | 2 +-
include/linux/iommu.h | 4 ++--
16 files changed, 48 insertions(+), 38 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index c135f6e37a00ca..8bc01071474ab7 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -984,7 +984,8 @@ __iommu_create_mapping(struct device *dev, struct page **pages, size_t size,

len = (j - i) << PAGE_SHIFT;
ret = iommu_map(mapping->domain, iova, phys, len,
- __dma_info_to_prot(DMA_BIDIRECTIONAL, attrs));
+ __dma_info_to_prot(DMA_BIDIRECTIONAL, attrs),
+ GFP_KERNEL);
if (ret < 0)
goto fail;
iova += len;
@@ -1207,7 +1208,8 @@ static int __map_sg_chunk(struct device *dev, struct scatterlist *sg,

prot = __dma_info_to_prot(dir, attrs);

- ret = iommu_map(mapping->domain, iova, phys, len, prot);
+ ret = iommu_map(mapping->domain, iova, phys, len, prot,
+ GFP_KERNEL);
if (ret < 0)
goto fail;
count += len >> PAGE_SHIFT;
@@ -1379,7 +1381,8 @@ static dma_addr_t arm_iommu_map_page(struct device *dev, struct page *page,

prot = __dma_info_to_prot(dir, attrs);

- ret = iommu_map(mapping->domain, dma_addr, page_to_phys(page), len, prot);
+ ret = iommu_map(mapping->domain, dma_addr, page_to_phys(page), len,
+ prot, GFP_KERNEL);
if (ret < 0)
goto fail;

@@ -1443,7 +1446,7 @@ static dma_addr_t arm_iommu_map_resource(struct device *dev,

prot = __dma_info_to_prot(dir, attrs) | IOMMU_MMIO;

- ret = iommu_map(mapping->domain, dma_addr, addr, len, prot);
+ ret = iommu_map(mapping->domain, dma_addr, addr, len, prot, GFP_KERNEL);
if (ret < 0)
goto fail;

diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c b/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c
index 648ecf5a8fbc2a..a4ac94a2ab57fc 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c
@@ -475,7 +475,8 @@ gk20a_instobj_ctor_iommu(struct gk20a_instmem *imem, u32 npages, u32 align,
u32 offset = (r->offset + i) << imem->iommu_pgshift;

ret = iommu_map(imem->domain, offset, node->dma_addrs[i],
- PAGE_SIZE, IOMMU_READ | IOMMU_WRITE);
+ PAGE_SIZE, IOMMU_READ | IOMMU_WRITE,
+ GFP_KERNEL);
if (ret < 0) {
nvkm_error(subdev, "IOMMU mapping failure: %d\n", ret);

diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
index 7bd2e65c2a16c5..6ca9f396e55be4 100644
--- a/drivers/gpu/drm/tegra/drm.c
+++ b/drivers/gpu/drm/tegra/drm.c
@@ -1057,7 +1057,7 @@ void *tegra_drm_alloc(struct tegra_drm *tegra, size_t size, dma_addr_t *dma)

*dma = iova_dma_addr(&tegra->carveout.domain, alloc);
err = iommu_map(tegra->domain, *dma, virt_to_phys(virt),
- size, IOMMU_READ | IOMMU_WRITE);
+ size, IOMMU_READ | IOMMU_WRITE, GFP_KERNEL);
if (err < 0)
goto free_iova;

diff --git a/drivers/gpu/host1x/cdma.c b/drivers/gpu/host1x/cdma.c
index 103fda055394ab..4ddfcd2138c95b 100644
--- a/drivers/gpu/host1x/cdma.c
+++ b/drivers/gpu/host1x/cdma.c
@@ -105,7 +105,7 @@ static int host1x_pushbuffer_init(struct push_buffer *pb)

pb->dma = iova_dma_addr(&host1x->iova, alloc);
err = iommu_map(host1x->domain, pb->dma, pb->phys, size,
- IOMMU_READ);
+ IOMMU_READ, GFP_KERNEL);
if (err)
goto iommu_free_iova;
} else {
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index c301b3be9f303d..aeeaca65ace96a 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -277,7 +277,7 @@ static int usnic_uiom_map_sorted_intervals(struct list_head *intervals,
usnic_dbg("va 0x%lx pa %pa size 0x%zx flags 0x%x",
va_start, &pa_start, size, flags);
err = iommu_map(pd->domain, va_start, pa_start,
- size, flags);
+ size, flags, GFP_KERNEL);
if (err) {
usnic_err("Failed to map va 0x%lx pa %pa size 0x%zx with err %d\n",
va_start, &pa_start, size, err);
@@ -294,7 +294,7 @@ static int usnic_uiom_map_sorted_intervals(struct list_head *intervals,
usnic_dbg("va 0x%lx pa %pa size 0x%zx flags 0x%x\n",
va_start, &pa_start, size, flags);
err = iommu_map(pd->domain, va_start, pa_start,
- size, flags);
+ size, flags, GFP_KERNEL);
if (err) {
usnic_err("Failed to map va 0x%lx pa %pa size 0x%zx with err %d\n",
va_start, &pa_start, size, err);
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f798c44e090337..8bdb65e7686ff9 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1615,7 +1615,7 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
if (!iova)
goto out_free_page;

- if (iommu_map(domain, iova, msi_addr, size, prot))
+ if (iommu_map(domain, iova, msi_addr, size, prot, GFP_KERNEL))
goto out_free_iova;

INIT_LIST_HEAD(&msi_page->list);
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 5f6a85aea501ec..7dac062b58f039 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -930,7 +930,7 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group,
if (map_size) {
ret = iommu_map(domain, addr - map_size,
addr - map_size, map_size,
- entry->prot);
+ entry->prot, GFP_KERNEL);
if (ret)
goto out;
map_size = 0;
@@ -2360,31 +2360,31 @@ static int __iommu_map(struct iommu_domain *domain, unsigned long iova,
return ret;
}

-static int _iommu_map(struct iommu_domain *domain, unsigned long iova,
- phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
+int iommu_map(struct iommu_domain *domain, unsigned long iova,
+ phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
{
const struct iommu_domain_ops *ops = domain->ops;
int ret;

+ might_sleep_if(gfpflags_allow_blocking(gfp));
+
+ /* Discourage passing strange GFP flags */
+ if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
+ __GFP_HIGHMEM)))
+ return -EINVAL;
+
ret = __iommu_map(domain, iova, paddr, size, prot, gfp);
if (ret == 0 && ops->iotlb_sync_map)
ops->iotlb_sync_map(domain, iova, size);

return ret;
}
-
-int iommu_map(struct iommu_domain *domain, unsigned long iova,
- phys_addr_t paddr, size_t size, int prot)
-{
- might_sleep();
- return _iommu_map(domain, iova, paddr, size, prot, GFP_KERNEL);
-}
EXPORT_SYMBOL_GPL(iommu_map);

int iommu_map_atomic(struct iommu_domain *domain, unsigned long iova,
phys_addr_t paddr, size_t size, int prot)
{
- return _iommu_map(domain, iova, paddr, size, prot, GFP_ATOMIC);
+ return iommu_map(domain, iova, paddr, size, prot, GFP_ATOMIC);
}
EXPORT_SYMBOL_GPL(iommu_map_atomic);

diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 1e1d3509efae5e..22cc3bb0c6c55a 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -456,7 +456,8 @@ static int batch_iommu_map_small(struct iommu_domain *domain,
size % PAGE_SIZE);

while (size) {
- rc = iommu_map(domain, iova, paddr, PAGE_SIZE, prot);
+ rc = iommu_map(domain, iova, paddr, PAGE_SIZE, prot,
+ GFP_KERNEL);
if (rc)
goto err_unmap;
iova += PAGE_SIZE;
@@ -500,7 +501,8 @@ static int batch_to_domain(struct pfn_batch *batch, struct iommu_domain *domain,
else
rc = iommu_map(domain, iova,
PFN_PHYS(batch->pfns[cur]) + page_offset,
- next_iova - iova, area->iommu_prot);
+ next_iova - iova, area->iommu_prot,
+ GFP_KERNEL);
if (rc)
goto err_unmap;
iova = next_iova;
diff --git a/drivers/media/platform/qcom/venus/firmware.c b/drivers/media/platform/qcom/venus/firmware.c
index 142d4c74017c04..07d4dceb5e72c7 100644
--- a/drivers/media/platform/qcom/venus/firmware.c
+++ b/drivers/media/platform/qcom/venus/firmware.c
@@ -158,7 +158,7 @@ static int venus_boot_no_tz(struct venus_core *core, phys_addr_t mem_phys,
core->fw.mapped_mem_size = mem_size;

ret = iommu_map(iommu, VENUS_FW_START_ADDR, mem_phys, mem_size,
- IOMMU_READ | IOMMU_WRITE | IOMMU_PRIV);
+ IOMMU_READ | IOMMU_WRITE | IOMMU_PRIV, GFP_KERNEL);
if (ret) {
dev_err(dev, "could not map video firmware region\n");
return ret;
diff --git a/drivers/net/ipa/ipa_mem.c b/drivers/net/ipa/ipa_mem.c
index 9ec5af323f731d..991a7d39f06661 100644
--- a/drivers/net/ipa/ipa_mem.c
+++ b/drivers/net/ipa/ipa_mem.c
@@ -466,7 +466,8 @@ static int ipa_imem_init(struct ipa *ipa, unsigned long addr, size_t size)
size = PAGE_ALIGN(size + addr - phys);
iova = phys; /* We just want a direct mapping */

- ret = iommu_map(domain, iova, phys, size, IOMMU_READ | IOMMU_WRITE);
+ ret = iommu_map(domain, iova, phys, size, IOMMU_READ | IOMMU_WRITE,
+ GFP_KERNEL);
if (ret)
return ret;

@@ -574,7 +575,8 @@ static int ipa_smem_init(struct ipa *ipa, u32 item, size_t size)
size = PAGE_ALIGN(size + addr - phys);
iova = phys; /* We just want a direct mapping */

- ret = iommu_map(domain, iova, phys, size, IOMMU_READ | IOMMU_WRITE);
+ ret = iommu_map(domain, iova, phys, size, IOMMU_READ | IOMMU_WRITE,
+ GFP_KERNEL);
if (ret)
return ret;

diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index cfcb759a87deac..9a82f0336d9537 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -1639,7 +1639,7 @@ static int ath10k_fw_init(struct ath10k *ar)

ret = iommu_map(iommu_dom, ar_snoc->fw.fw_start_addr,
ar->msa.paddr, ar->msa.mem_size,
- IOMMU_READ | IOMMU_WRITE);
+ IOMMU_READ | IOMMU_WRITE, GFP_KERNEL);
if (ret) {
ath10k_err(ar, "failed to map firmware region: %d\n", ret);
goto err_iommu_detach;
diff --git a/drivers/net/wireless/ath/ath11k/ahb.c b/drivers/net/wireless/ath/ath11k/ahb.c
index d34a4d6325b2b4..df8fdc7067f99c 100644
--- a/drivers/net/wireless/ath/ath11k/ahb.c
+++ b/drivers/net/wireless/ath/ath11k/ahb.c
@@ -1021,7 +1021,7 @@ static int ath11k_ahb_fw_resources_init(struct ath11k_base *ab)

ret = iommu_map(iommu_dom, ab_ahb->fw.msa_paddr,
ab_ahb->fw.msa_paddr, ab_ahb->fw.msa_size,
- IOMMU_READ | IOMMU_WRITE);
+ IOMMU_READ | IOMMU_WRITE, GFP_KERNEL);
if (ret) {
ath11k_err(ab, "failed to map firmware region: %d\n", ret);
goto err_iommu_detach;
@@ -1029,7 +1029,7 @@ static int ath11k_ahb_fw_resources_init(struct ath11k_base *ab)

ret = iommu_map(iommu_dom, ab_ahb->fw.ce_paddr,
ab_ahb->fw.ce_paddr, ab_ahb->fw.ce_size,
- IOMMU_READ | IOMMU_WRITE);
+ IOMMU_READ | IOMMU_WRITE, GFP_KERNEL);
if (ret) {
ath11k_err(ab, "failed to map firmware CE region: %d\n", ret);
goto err_iommu_unmap;
diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index 1cd4815a6dd197..80072b6b628358 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -643,7 +643,8 @@ static int rproc_handle_devmem(struct rproc *rproc, void *ptr,
if (!mapping)
return -ENOMEM;

- ret = iommu_map(rproc->domain, rsc->da, rsc->pa, rsc->len, rsc->flags);
+ ret = iommu_map(rproc->domain, rsc->da, rsc->pa, rsc->len, rsc->flags,
+ GFP_KERNEL);
if (ret) {
dev_err(dev, "failed to map devmem: %d\n", ret);
goto out;
@@ -737,7 +738,7 @@ static int rproc_alloc_carveout(struct rproc *rproc,
}

ret = iommu_map(rproc->domain, mem->da, dma, mem->len,
- mem->flags);
+ mem->flags, GFP_KERNEL);
if (ret) {
dev_err(dev, "iommu_map failed: %d\n", ret);
goto free_mapping;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 23c24fe98c00d4..e14f86a8ef5258 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1480,7 +1480,8 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,

list_for_each_entry(d, &iommu->domain_list, next) {
ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
- npage << PAGE_SHIFT, prot | IOMMU_CACHE);
+ npage << PAGE_SHIFT, prot | IOMMU_CACHE,
+ GFP_KERNEL);
if (ret)
goto unwind;

@@ -1777,8 +1778,8 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
size = npage << PAGE_SHIFT;
}

- ret = iommu_map(domain->domain, iova, phys,
- size, dma->prot | IOMMU_CACHE);
+ ret = iommu_map(domain->domain, iova, phys, size,
+ dma->prot | IOMMU_CACHE, GFP_KERNEL);
if (ret) {
if (!dma->iommu_mapped) {
vfio_unpin_pages_remote(dma, iova,
@@ -1866,7 +1867,7 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
return;

ret = iommu_map(domain->domain, 0, page_to_phys(pages), PAGE_SIZE * 2,
- IOMMU_READ | IOMMU_WRITE | IOMMU_CACHE);
+ IOMMU_READ | IOMMU_WRITE | IOMMU_CACHE, GFP_KERNEL);
if (!ret) {
size_t unmapped = iommu_unmap(domain->domain, 0, PAGE_SIZE);

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ec32f785dfdec1..fd1536de5b1df0 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -792,7 +792,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
r = ops->set_map(vdpa, asid, iotlb);
} else {
r = iommu_map(v->domain, iova, pa, size,
- perm_to_iommu_flags(perm));
+ perm_to_iommu_flags(perm), GFP_KERNEL);
}
if (r) {
vhost_iotlb_del_range(iotlb, iova, iova + size - 1);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 46e1347bfa2286..d2020994f292db 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -467,7 +467,7 @@ extern int iommu_sva_unbind_gpasid(struct iommu_domain *domain,
extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
- phys_addr_t paddr, size_t size, int prot);
+ phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
extern int iommu_map_atomic(struct iommu_domain *domain, unsigned long iova,
phys_addr_t paddr, size_t size, int prot);
extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
@@ -773,7 +773,7 @@ static inline struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
}

static inline int iommu_map(struct iommu_domain *domain, unsigned long iova,
- phys_addr_t paddr, size_t size, int prot)
+ phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
{
return -ENODEV;
}
--
2.39.0

2023-01-18 18:05:49

by Jason Gunthorpe

[permalink] [raw]
Subject: [PATCH v2 09/10] iommu/s390: Push the gfp parameter to the kmem_cache_alloc()'s

dma_alloc_cpu_table() and dma_alloc_page_table() are eventually called by
iommufd through s390_iommu_map_pages() and it should not be forced to
atomic. Thread the gfp parameter through the call chain starting from
s390_iommu_map_pages().

Reviewed-by: Niklas Schnelle <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
---
arch/s390/include/asm/pci_dma.h | 5 +++--
arch/s390/pci/pci_dma.c | 31 +++++++++++++++++--------------
drivers/iommu/s390-iommu.c | 15 +++++++++------
3 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/arch/s390/include/asm/pci_dma.h b/arch/s390/include/asm/pci_dma.h
index 91e63426bdc53f..7119c04c51c5c8 100644
--- a/arch/s390/include/asm/pci_dma.h
+++ b/arch/s390/include/asm/pci_dma.h
@@ -186,9 +186,10 @@ static inline unsigned long *get_st_pto(unsigned long entry)

/* Prototypes */
void dma_free_seg_table(unsigned long);
-unsigned long *dma_alloc_cpu_table(void);
+unsigned long *dma_alloc_cpu_table(gfp_t gfp);
void dma_cleanup_tables(unsigned long *);
-unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr);
+unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr,
+ gfp_t gfp);
void dma_update_cpu_trans(unsigned long *entry, phys_addr_t page_addr, int flags);

extern const struct dma_map_ops s390_pci_dma_ops;
diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
index ea478d11fbd132..2f6d05d6da4f76 100644
--- a/arch/s390/pci/pci_dma.c
+++ b/arch/s390/pci/pci_dma.c
@@ -27,11 +27,11 @@ static int zpci_refresh_global(struct zpci_dev *zdev)
zdev->iommu_pages * PAGE_SIZE);
}

-unsigned long *dma_alloc_cpu_table(void)
+unsigned long *dma_alloc_cpu_table(gfp_t gfp)
{
unsigned long *table, *entry;

- table = kmem_cache_alloc(dma_region_table_cache, GFP_ATOMIC);
+ table = kmem_cache_alloc(dma_region_table_cache, gfp);
if (!table)
return NULL;

@@ -45,11 +45,11 @@ static void dma_free_cpu_table(void *table)
kmem_cache_free(dma_region_table_cache, table);
}

-static unsigned long *dma_alloc_page_table(void)
+static unsigned long *dma_alloc_page_table(gfp_t gfp)
{
unsigned long *table, *entry;

- table = kmem_cache_alloc(dma_page_table_cache, GFP_ATOMIC);
+ table = kmem_cache_alloc(dma_page_table_cache, gfp);
if (!table)
return NULL;

@@ -63,7 +63,7 @@ static void dma_free_page_table(void *table)
kmem_cache_free(dma_page_table_cache, table);
}

-static unsigned long *dma_get_seg_table_origin(unsigned long *rtep)
+static unsigned long *dma_get_seg_table_origin(unsigned long *rtep, gfp_t gfp)
{
unsigned long old_rte, rte;
unsigned long *sto;
@@ -72,7 +72,7 @@ static unsigned long *dma_get_seg_table_origin(unsigned long *rtep)
if (reg_entry_isvalid(rte)) {
sto = get_rt_sto(rte);
} else {
- sto = dma_alloc_cpu_table();
+ sto = dma_alloc_cpu_table(gfp);
if (!sto)
return NULL;

@@ -90,7 +90,7 @@ static unsigned long *dma_get_seg_table_origin(unsigned long *rtep)
return sto;
}

-static unsigned long *dma_get_page_table_origin(unsigned long *step)
+static unsigned long *dma_get_page_table_origin(unsigned long *step, gfp_t gfp)
{
unsigned long old_ste, ste;
unsigned long *pto;
@@ -99,7 +99,7 @@ static unsigned long *dma_get_page_table_origin(unsigned long *step)
if (reg_entry_isvalid(ste)) {
pto = get_st_pto(ste);
} else {
- pto = dma_alloc_page_table();
+ pto = dma_alloc_page_table(gfp);
if (!pto)
return NULL;
set_st_pto(&ste, virt_to_phys(pto));
@@ -116,18 +116,19 @@ static unsigned long *dma_get_page_table_origin(unsigned long *step)
return pto;
}

-unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr)
+unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr,
+ gfp_t gfp)
{
unsigned long *sto, *pto;
unsigned int rtx, sx, px;

rtx = calc_rtx(dma_addr);
- sto = dma_get_seg_table_origin(&rto[rtx]);
+ sto = dma_get_seg_table_origin(&rto[rtx], gfp);
if (!sto)
return NULL;

sx = calc_sx(dma_addr);
- pto = dma_get_page_table_origin(&sto[sx]);
+ pto = dma_get_page_table_origin(&sto[sx], gfp);
if (!pto)
return NULL;

@@ -170,7 +171,8 @@ static int __dma_update_trans(struct zpci_dev *zdev, phys_addr_t pa,
return -EINVAL;

for (i = 0; i < nr_pages; i++) {
- entry = dma_walk_cpu_trans(zdev->dma_table, dma_addr);
+ entry = dma_walk_cpu_trans(zdev->dma_table, dma_addr,
+ GFP_ATOMIC);
if (!entry) {
rc = -ENOMEM;
goto undo_cpu_trans;
@@ -186,7 +188,8 @@ static int __dma_update_trans(struct zpci_dev *zdev, phys_addr_t pa,
while (i-- > 0) {
page_addr -= PAGE_SIZE;
dma_addr -= PAGE_SIZE;
- entry = dma_walk_cpu_trans(zdev->dma_table, dma_addr);
+ entry = dma_walk_cpu_trans(zdev->dma_table, dma_addr,
+ GFP_ATOMIC);
if (!entry)
break;
dma_update_cpu_trans(entry, page_addr, flags);
@@ -576,7 +579,7 @@ int zpci_dma_init_device(struct zpci_dev *zdev)

spin_lock_init(&zdev->iommu_bitmap_lock);

- zdev->dma_table = dma_alloc_cpu_table();
+ zdev->dma_table = dma_alloc_cpu_table(GFP_ATOMIC);
if (!zdev->dma_table) {
rc = -ENOMEM;
goto out;
diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index ed33c6cce08362..654ec4411fe36c 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -52,7 +52,7 @@ static struct iommu_domain *s390_domain_alloc(unsigned domain_type)
if (!s390_domain)
return NULL;

- s390_domain->dma_table = dma_alloc_cpu_table();
+ s390_domain->dma_table = dma_alloc_cpu_table(GFP_ATOMIC);
if (!s390_domain->dma_table) {
kfree(s390_domain);
return NULL;
@@ -260,7 +260,8 @@ static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,

static int s390_iommu_validate_trans(struct s390_domain *s390_domain,
phys_addr_t pa, dma_addr_t dma_addr,
- unsigned long nr_pages, int flags)
+ unsigned long nr_pages, int flags,
+ gfp_t gfp)
{
phys_addr_t page_addr = pa & PAGE_MASK;
unsigned long *entry;
@@ -268,7 +269,8 @@ static int s390_iommu_validate_trans(struct s390_domain *s390_domain,
int rc;

for (i = 0; i < nr_pages; i++) {
- entry = dma_walk_cpu_trans(s390_domain->dma_table, dma_addr);
+ entry = dma_walk_cpu_trans(s390_domain->dma_table, dma_addr,
+ gfp);
if (unlikely(!entry)) {
rc = -ENOMEM;
goto undo_cpu_trans;
@@ -284,7 +286,7 @@ static int s390_iommu_validate_trans(struct s390_domain *s390_domain,
while (i-- > 0) {
dma_addr -= PAGE_SIZE;
entry = dma_walk_cpu_trans(s390_domain->dma_table,
- dma_addr);
+ dma_addr, gfp);
if (!entry)
break;
dma_update_cpu_trans(entry, 0, ZPCI_PTE_INVALID);
@@ -301,7 +303,8 @@ static int s390_iommu_invalidate_trans(struct s390_domain *s390_domain,
int rc = 0;

for (i = 0; i < nr_pages; i++) {
- entry = dma_walk_cpu_trans(s390_domain->dma_table, dma_addr);
+ entry = dma_walk_cpu_trans(s390_domain->dma_table, dma_addr,
+ GFP_ATOMIC);
if (unlikely(!entry)) {
rc = -EINVAL;
break;
@@ -339,7 +342,7 @@ static int s390_iommu_map_pages(struct iommu_domain *domain,
flags |= ZPCI_TABLE_PROTECTED;

rc = s390_iommu_validate_trans(s390_domain, paddr, iova,
- pgcount, flags);
+ pgcount, flags, gfp);
if (!rc)
*mapped = size;

--
2.39.0

2023-01-19 04:55:59

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 01/10] iommu: Add a gfp parameter to iommu_map()

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, January 19, 2023 2:01 AM
>
> The internal mechanisms support this, but instead of exposting the gfp to
> the caller it wrappers it into iommu_map() and iommu_map_atomic()
>
> Fix this instead of adding more variants for GFP_KERNEL_ACCOUNT.
>
> Signed-off-by: Jason Gunthorpe <[email protected]>

Reviewed-by: Kevin Tian <[email protected]>

2023-01-19 04:56:09

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 07/10] iommu/intel: Support the gfp argument to the map_pages op

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, January 19, 2023 2:01 AM
>
> Flow it down to alloc_pgtable_page() via pfn_to_dma_pte() and
> __domain_mapping().
>
> Signed-off-by: Jason Gunthorpe <[email protected]>

Reviewed-by: Kevin Tian <[email protected]>

2023-01-19 04:57:55

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH v2 04/10] iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, January 19, 2023 2:01 AM
>
> Change the sg_alloc_table_from_pages() allocation that was hardwired to
> GFP_KERNEL to use the gfp parameter like the other allocations in this
> function.
>
> Auditing says this is never called from an atomic context, so it is safe
> as is, but reads wrong.
>
> Signed-off-by: Jason Gunthorpe <[email protected]>

Reviewed-by: Kevin Tian <[email protected]>

2023-01-19 12:08:25

by Lu Baolu

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] iommu/intel: Support the gfp argument to the map_pages op

On 2023/1/19 2:00, Jason Gunthorpe wrote:
> Flow it down to alloc_pgtable_page() via pfn_to_dma_pte() and
> __domain_mapping().
>
> Signed-off-by: Jason Gunthorpe<[email protected]>

Irrelevant to this patch, GFP_ATOMIC could be changed to GFP_KERNEL in
some places. I will follow up further to clean it up.

For this patch,

Reviewed-by: Lu Baolu <[email protected]>

Best regards,
baolu

2023-01-19 12:09:06

by Lu Baolu

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] iommu/intel: Support the gfp argument to the map_pages op

On 2023/1/19 19:57, Baolu Lu wrote:
> On 2023/1/19 2:00, Jason Gunthorpe wrote:
>> Flow it down to alloc_pgtable_page() via pfn_to_dma_pte() and
>> __domain_mapping().
>>
>> Signed-off-by: Jason Gunthorpe<[email protected]>
>
> Irrelevant to this patch, GFP_ATOMIC could be changed to GFP_KERNEL in
> some places. I will follow up further to clean it up.

It has been done in the next patch. Sorry for the noise.

Best regards,
baolu

2023-01-19 22:17:54

by Matthew Rosato

[permalink] [raw]
Subject: Re: [PATCH v2 09/10] iommu/s390: Push the gfp parameter to the kmem_cache_alloc()'s

On 1/18/23 1:00 PM, Jason Gunthorpe wrote:
> dma_alloc_cpu_table() and dma_alloc_page_table() are eventually called by
> iommufd through s390_iommu_map_pages() and it should not be forced to
> atomic. Thread the gfp parameter through the call chain starting from
> s390_iommu_map_pages().
>
> Reviewed-by: Niklas Schnelle <[email protected]>
> Signed-off-by: Jason Gunthorpe <[email protected]>
> ---
> arch/s390/include/asm/pci_dma.h | 5 +++--
> arch/s390/pci/pci_dma.c | 31 +++++++++++++++++--------------
> drivers/iommu/s390-iommu.c | 15 +++++++++------
> 3 files changed, 29 insertions(+), 22 deletions(-)
>

Reviewed-by: Matthew Rosato <[email protected]>

2023-01-20 19:44:12

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v2 04/10] iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()

On 2023-01-18 18:00, Jason Gunthorpe wrote:
> Change the sg_alloc_table_from_pages() allocation that was hardwired to
> GFP_KERNEL to use the gfp parameter like the other allocations in this
> function.
>
> Auditing says this is never called from an atomic context, so it is safe
> as is, but reads wrong.

I think the point may have been that the sgtable metadata is a
logically-distinct allocation from the buffer pages themselves. Much
like the allocation of the pages array itself further down in
__iommu_dma_alloc_pages(). I see these days it wouldn't be catastrophic
to pass GFP_HIGHMEM into __get_free_page() via sg_kmalloc(), but still,
allocating implementation-internal metadata with all the same
constraints as a DMA buffer has just as much smell of wrong about it IMO.

I'd say the more confusing thing about this particular context is why
we're using iommu_map_sg_atomic() further down - that seems to have been
an oversight in 781ca2de89ba, since this particular path has never
supported being called in atomic context.

Overall I'm starting to wonder if it might not be better to stick a "use
GFP_KERNEL_ACCOUNT if you allocate" flag in the domain for any level of
the API internals to pick up as appropriate, rather than propagate
per-call gfp flags everywhere. As it stands we're still missing
potential pagetable and other domain-related allocations by drivers in
.attach_dev and even (in probably-shouldn't-really-happen cases)
.unmap_pages...

Thanks,
Robin.

> Signed-off-by: Jason Gunthorpe <[email protected]>
> ---
> drivers/iommu/dma-iommu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 8c2788633c1766..e4bf1bb159f7c7 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -822,7 +822,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
> if (!iova)
> goto out_free_pages;
>
> - if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, GFP_KERNEL))
> + if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, gfp))
> goto out_free_iova;
>
> if (!(ioprot & IOMMU_CACHE)) {

2023-01-20 21:01:50

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 04/10] iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()

On Fri, Jan 20, 2023 at 07:28:19PM +0000, Robin Murphy wrote:
> On 2023-01-18 18:00, Jason Gunthorpe wrote:
> > Change the sg_alloc_table_from_pages() allocation that was hardwired to
> > GFP_KERNEL to use the gfp parameter like the other allocations in this
> > function.
> >
> > Auditing says this is never called from an atomic context, so it is safe
> > as is, but reads wrong.
>
> I think the point may have been that the sgtable metadata is a
> logically-distinct allocation from the buffer pages themselves. Much like
> the allocation of the pages array itself further down in
> __iommu_dma_alloc_pages().

That makes sense, and it is a good reason to mask off the allocation
policy flags from the gfp.

On the other hand it also makes sense to continue to pass in things
like NOWAIT|NOWARN to all the allocations. Even to the iommu driver.

So I'd prefer to change this to mask and make all the following calls
consistently use the input gfp

> I'd say the more confusing thing about this particular context is why we're
> using iommu_map_sg_atomic() further down - that seems to have been an
> oversight in 781ca2de89ba, since this particular path has never supported
> being called in atomic context.

Huh. I had fixed that in v1, this patch was supposed to have that
hunk, that was the main point of making this patch actually..

> Overall I'm starting to wonder if it might not be better to stick a "use
> GFP_KERNEL_ACCOUNT if you allocate" flag in the domain for any level of the
> API internals to pick up as appropriate, rather than propagate per-call gfp
> flags everywhere.

We might get to something like that, but it requires more parts that
are not ready yet. Most likely this would take the form of some kind
of 'this is an iommufd created domain' indication. This happens
naturally as part of the nesting patches.

Right now I want to get people to start testing with this because the
charge from the IOPTEs is far and away the largest memory draw. Parts
like fixing the iommu drivers to actually use gfp are necessary to
make it work.

If we flip the two places using KERNEL_ACCOUNT to something else later
it doesn't really matter. I think the removal of the two _atomic
wrappers is still appropriate stand-alone.

> As it stands we're still missing potential pagetable and other
> domain-related allocations by drivers in .attach_dev and even (in

Yes, I plan to get to those when we add an alloc_domain_iommufd() or
whatever op. The driver will know the calling context and can set the
gfp flags for any allocations under alloc_domain under that time.

Then we can go and figure out if there are other allocations and if
all or only some drivers need a flag - eg at attach time. Though this
is less worrying because you can only scale attach up to num_pasids *
num open vfios.

iommufd will let userspace create and populate an unlimited number of
iommu_domains, so everything linked to an unattached iommu_domain
should be charged.

> probably-shouldn't-really-happen cases) .unmap_pages...

Gah, unmap_pages isn't allow to fail. There is no way to recover from
this. iommufd will spew a warn and then have a small race where
userspace can UAF kernel memory.

I'd call such a driver implementation broken. Why would you need to do
this?? :(

Thanks,
Jason

2023-01-23 13:58:16

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 04/10] iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()

On Fri, Jan 20, 2023 at 07:28:19PM +0000, Robin Murphy wrote:

> Overall I'm starting to wonder if it might not be better to stick a "use
> GFP_KERNEL_ACCOUNT if you allocate" flag in the domain for any level of the
> API internals to pick up as appropriate, rather than propagate per-call gfp
> flags everywhere.

I was thinking about this some more, and I don't thinking hiding the
GFP_KERNEL_ACCOUNT in the iommu driver will be very maintainable.

The GFP_KERNEL_ACCOUNT is sensitive to current since that is where it
gets the cgroup from, if we start putting it in driver code directly
it becomes very hard to understand if the call chains are actually
originating from a syscall or not. I'd prefer we try to keep thing so
that iommufd provides the GFP_KERNEL_ACCOUNT on a call-by-call basis
where it is clearer what call chains originate from a system call vs
not.

So, I think we will strive for adding a gfp flag to the future 'alloc
domain iommufd' and pass GFP_KERNEL_ACCOUNT there. Then we can see
what is left.

Jason