2023-11-28 20:50:09

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 00/16] IOMMU memory observability

From: Pasha Tatashin <[email protected]>

IOMMU subsystem may contain state that is in gigabytes. Majority of that
state is iommu page tables. Yet, there is currently, no way to observe
how much memory is actually used by the iommu subsystem.

This patch series solves this problem by adding both observability to
all pages that are allocated by IOMMU, and also accountability, so
admins can limit the amount if via cgroups.

The system-wide observability is using /proc/meminfo:
SecPageTables: 438176 kB

Contains IOMMU and KVM memory.

Per-node observability:
/sys/devices/system/node/nodeN/meminfo
Node N SecPageTables: 422204 kB

Contains IOMMU and KVM memory memory in the given NUMA node.

Per-node IOMMU only observability:
/sys/devices/system/node/nodeN/vmstat
nr_iommu_pages 105555

Contains number of pages IOMMU allocated in the given node.

Accountability: using sec_pagetables cgroup-v2 memory.stat entry.

With the change, iova_stress[1] stops as limit is reached:

# ./iova_stress
iova space: 0T free memory: 497G
iova space: 1T free memory: 495G
iova space: 2T free memory: 493G
iova space: 3T free memory: 491G

stops as limit is reached.

This series encorporates suggestions that came from the discussion
at LPC [2].

[1] https://github.com/soleen/iova_stress
[2] https://lpc.events/event/17/contributions/1466

Pasha Tatashin (16):
iommu/vt-d: add wrapper functions for page allocations
iommu/amd: use page allocation function provided by iommu-pages.h
iommu/io-pgtable-arm: use page allocation function provided by
iommu-pages.h
iommu/io-pgtable-dart: use page allocation function provided by
iommu-pages.h
iommu/io-pgtable-arm-v7s: use page allocation function provided by
iommu-pages.h
iommu/dma: use page allocation function provided by iommu-pages.h
iommu/exynos: use page allocation function provided by iommu-pages.h
iommu/fsl: use page allocation function provided by iommu-pages.h
iommu/iommufd: use page allocation function provided by iommu-pages.h
iommu/rockchip: use page allocation function provided by iommu-pages.h
iommu/sun50i: use page allocation function provided by iommu-pages.h
iommu/tegra-smmu: use page allocation function provided by
iommu-pages.h
iommu: observability of the IOMMU allocations
iommu: account IOMMU allocated memory
vhost-vdpa: account iommu allocations
vfio: account iommu allocations

Documentation/admin-guide/cgroup-v2.rst | 2 +-
Documentation/filesystems/proc.rst | 4 +-
drivers/iommu/amd/amd_iommu.h | 8 -
drivers/iommu/amd/init.c | 91 +++++-----
drivers/iommu/amd/io_pgtable.c | 13 +-
drivers/iommu/amd/io_pgtable_v2.c | 20 +-
drivers/iommu/amd/iommu.c | 13 +-
drivers/iommu/dma-iommu.c | 8 +-
drivers/iommu/exynos-iommu.c | 14 +-
drivers/iommu/fsl_pamu.c | 5 +-
drivers/iommu/intel/dmar.c | 10 +-
drivers/iommu/intel/iommu.c | 47 ++---
drivers/iommu/intel/iommu.h | 2 -
drivers/iommu/intel/irq_remapping.c | 10 +-
drivers/iommu/intel/pasid.c | 12 +-
drivers/iommu/intel/svm.c | 7 +-
drivers/iommu/io-pgtable-arm-v7s.c | 9 +-
drivers/iommu/io-pgtable-arm.c | 7 +-
drivers/iommu/io-pgtable-dart.c | 37 ++--
drivers/iommu/iommu-pages.h | 231 ++++++++++++++++++++++++
drivers/iommu/iommufd/iova_bitmap.c | 6 +-
drivers/iommu/rockchip-iommu.c | 14 +-
drivers/iommu/sun50i-iommu.c | 7 +-
drivers/iommu/tegra-smmu.c | 18 +-
drivers/vfio/vfio_iommu_type1.c | 8 +-
drivers/vhost/vdpa.c | 3 +-
include/linux/mmzone.h | 5 +-
mm/vmstat.c | 3 +
28 files changed, 415 insertions(+), 199 deletions(-)
create mode 100644 drivers/iommu/iommu-pages.h

--
2.43.0.rc2.451.g8631bc7472-goog


2023-11-28 20:50:11

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 01/16] iommu/vt-d: add wrapper functions for page allocations

In order to improve observability and accountability of IOMMU layer, we
must account the number of pages that are allocated by functions that
are calling directly into buddy allocator.

This is achieved by first wrapping the allocation related functions into a
separate inline functions in new file:

drivers/iommu/iommu-pages.h

Convert all page allocation calls under iommu/intel to use these new
functions.

Signed-off-by: Pasha Tatashin <[email protected]>
---
drivers/iommu/intel/dmar.c | 10 +-
drivers/iommu/intel/iommu.c | 47 +++----
drivers/iommu/intel/iommu.h | 2 -
drivers/iommu/intel/irq_remapping.c | 10 +-
drivers/iommu/intel/pasid.c | 12 +-
drivers/iommu/intel/svm.c | 7 +-
drivers/iommu/iommu-pages.h | 199 ++++++++++++++++++++++++++++
7 files changed, 236 insertions(+), 51 deletions(-)
create mode 100644 drivers/iommu/iommu-pages.h

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index a3414afe11b0..a6937e1e20a5 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -32,6 +32,7 @@

#include "iommu.h"
#include "../irq_remapping.h"
+#include "../iommu-pages.h"
#include "perf.h"
#include "trace.h"
#include "perfmon.h"
@@ -1185,7 +1186,7 @@ static void free_iommu(struct intel_iommu *iommu)
}

if (iommu->qi) {
- free_page((unsigned long)iommu->qi->desc);
+ iommu_free_page(iommu->qi->desc);
kfree(iommu->qi->desc_status);
kfree(iommu->qi);
}
@@ -1714,6 +1715,7 @@ int dmar_enable_qi(struct intel_iommu *iommu)
{
struct q_inval *qi;
struct page *desc_page;
+ int order;

if (!ecap_qis(iommu->ecap))
return -ENOENT;
@@ -1734,8 +1736,8 @@ int dmar_enable_qi(struct intel_iommu *iommu)
* Need two pages to accommodate 256 descriptors of 256 bits each
* if the remapping hardware supports scalable mode translation.
*/
- desc_page = alloc_pages_node(iommu->node, GFP_ATOMIC | __GFP_ZERO,
- !!ecap_smts(iommu->ecap));
+ order = ecap_smts(iommu->ecap) ? 1 : 0;
+ desc_page = __iommu_alloc_pages_node(iommu->node, GFP_ATOMIC, order);
if (!desc_page) {
kfree(qi);
iommu->qi = NULL;
@@ -1746,7 +1748,7 @@ int dmar_enable_qi(struct intel_iommu *iommu)

qi->desc_status = kcalloc(QI_LENGTH, sizeof(int), GFP_ATOMIC);
if (!qi->desc_status) {
- free_page((unsigned long) qi->desc);
+ iommu_free_page(qi->desc);
kfree(qi);
iommu->qi = NULL;
return -ENOMEM;
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 3531b956556c..04f852175cbe 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -28,6 +28,7 @@
#include "../dma-iommu.h"
#include "../irq_remapping.h"
#include "../iommu-sva.h"
+#include "../iommu-pages.h"
#include "pasid.h"
#include "cap_audit.h"
#include "perfmon.h"
@@ -367,22 +368,6 @@ static int __init intel_iommu_setup(char *str)
}
__setup("intel_iommu=", intel_iommu_setup);

-void *alloc_pgtable_page(int node, gfp_t gfp)
-{
- struct page *page;
- void *vaddr = NULL;
-
- page = alloc_pages_node(node, gfp | __GFP_ZERO, 0);
- if (page)
- vaddr = page_address(page);
- return vaddr;
-}
-
-void free_pgtable_page(void *vaddr)
-{
- free_page((unsigned long)vaddr);
-}
-
static inline int domain_type_is_si(struct dmar_domain *domain)
{
return domain->domain.type == IOMMU_DOMAIN_IDENTITY;
@@ -617,7 +602,7 @@ struct context_entry *iommu_context_addr(struct intel_iommu *iommu, u8 bus,
if (!alloc)
return NULL;

- context = alloc_pgtable_page(iommu->node, GFP_ATOMIC);
+ context = iommu_alloc_page_node(iommu->node, GFP_ATOMIC);
if (!context)
return NULL;

@@ -791,17 +776,17 @@ static void free_context_table(struct intel_iommu *iommu)
for (i = 0; i < ROOT_ENTRY_NR; i++) {
context = iommu_context_addr(iommu, i, 0, 0);
if (context)
- free_pgtable_page(context);
+ iommu_free_page(context);

if (!sm_supported(iommu))
continue;

context = iommu_context_addr(iommu, i, 0x80, 0);
if (context)
- free_pgtable_page(context);
+ iommu_free_page(context);
}

- free_pgtable_page(iommu->root_entry);
+ iommu_free_page(iommu->root_entry);
iommu->root_entry = NULL;
}

@@ -939,7 +924,7 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
if (!dma_pte_present(pte)) {
uint64_t pteval;

- tmp_page = alloc_pgtable_page(domain->nid, gfp);
+ tmp_page = iommu_alloc_page_node(domain->nid, gfp);

if (!tmp_page)
return NULL;
@@ -951,7 +936,7 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,

if (cmpxchg64(&pte->val, 0ULL, pteval))
/* Someone else set it while we were thinking; use theirs. */
- free_pgtable_page(tmp_page);
+ iommu_free_page(tmp_page);
else
domain_flush_cache(domain, pte, sizeof(*pte));
}
@@ -1064,7 +1049,7 @@ static void dma_pte_free_level(struct dmar_domain *domain, int level,
last_pfn < level_pfn + level_size(level) - 1)) {
dma_clear_pte(pte);
domain_flush_cache(domain, pte, sizeof(*pte));
- free_pgtable_page(level_pte);
+ iommu_free_page(level_pte);
}
next:
pfn += level_size(level);
@@ -1088,7 +1073,7 @@ static void dma_pte_free_pagetable(struct dmar_domain *domain,

/* free pgd */
if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
- free_pgtable_page(domain->pgd);
+ iommu_free_page(domain->pgd);
domain->pgd = NULL;
}
}
@@ -1190,7 +1175,7 @@ static int iommu_alloc_root_entry(struct intel_iommu *iommu)
{
struct root_entry *root;

- root = alloc_pgtable_page(iommu->node, GFP_ATOMIC);
+ root = iommu_alloc_page_node(iommu->node, GFP_ATOMIC);
if (!root) {
pr_err("Allocating root entry for %s failed\n",
iommu->name);
@@ -1863,7 +1848,7 @@ static void domain_exit(struct dmar_domain *domain)
LIST_HEAD(freelist);

domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist);
- put_pages_list(&freelist);
+ iommu_free_pages_list(&freelist);
}

if (WARN_ON(!list_empty(&domain->devices)))
@@ -2637,7 +2622,7 @@ static int copy_context_table(struct intel_iommu *iommu,
if (!old_ce)
goto out;

- new_ce = alloc_pgtable_page(iommu->node, GFP_KERNEL);
+ new_ce = iommu_alloc_page_node(iommu->node, GFP_KERNEL);
if (!new_ce)
goto out_unmap;

@@ -3570,7 +3555,7 @@ static int intel_iommu_memory_notifier(struct notifier_block *nb,
start_vpfn, mhp->nr_pages,
list_empty(&freelist), 0);
rcu_read_unlock();
- put_pages_list(&freelist);
+ iommu_free_pages_list(&freelist);
}
break;
}
@@ -4001,7 +3986,7 @@ static int md_domain_init(struct dmar_domain *domain, int guest_width)
domain->max_addr = 0;

/* always allocate the top pgd */
- domain->pgd = alloc_pgtable_page(domain->nid, GFP_ATOMIC);
+ domain->pgd = iommu_alloc_page_node(domain->nid, GFP_ATOMIC);
if (!domain->pgd)
return -ENOMEM;
domain_flush_cache(domain, domain->pgd, PAGE_SIZE);
@@ -4148,7 +4133,7 @@ int prepare_domain_attach_device(struct iommu_domain *domain,
pte = dmar_domain->pgd;
if (dma_pte_present(pte)) {
dmar_domain->pgd = phys_to_virt(dma_pte_addr(pte));
- free_pgtable_page(pte);
+ iommu_free_page(pte);
}
dmar_domain->agaw--;
}
@@ -4295,7 +4280,7 @@ static void intel_iommu_tlb_sync(struct iommu_domain *domain,
start_pfn, nrpages,
list_empty(&gather->freelist), 0);

- put_pages_list(&gather->freelist);
+ iommu_free_pages_list(&gather->freelist);
}

static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 65d37a138c75..b505f3f44d0a 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -894,8 +894,6 @@ void domain_update_iommu_cap(struct dmar_domain *domain);

int dmar_ir_support(void);

-void *alloc_pgtable_page(int node, gfp_t gfp);
-void free_pgtable_page(void *vaddr);
void iommu_flush_write_buffer(struct intel_iommu *iommu);
struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn);
struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *parent,
diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 29b9e55dcf26..72e1c1342c13 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -22,6 +22,7 @@

#include "iommu.h"
#include "../irq_remapping.h"
+#include "../iommu-pages.h"
#include "cap_audit.h"

enum irq_mode {
@@ -536,8 +537,8 @@ static int intel_setup_irq_remapping(struct intel_iommu *iommu)
if (!ir_table)
return -ENOMEM;

- pages = alloc_pages_node(iommu->node, GFP_KERNEL | __GFP_ZERO,
- INTR_REMAP_PAGE_ORDER);
+ pages = __iommu_alloc_pages_node(iommu->node, GFP_KERNEL,
+ INTR_REMAP_PAGE_ORDER);
if (!pages) {
pr_err("IR%d: failed to allocate pages of order %d\n",
iommu->seq_id, INTR_REMAP_PAGE_ORDER);
@@ -622,7 +623,7 @@ static int intel_setup_irq_remapping(struct intel_iommu *iommu)
out_free_bitmap:
bitmap_free(bitmap);
out_free_pages:
- __free_pages(pages, INTR_REMAP_PAGE_ORDER);
+ __iommu_free_pages(pages, INTR_REMAP_PAGE_ORDER);
out_free_table:
kfree(ir_table);

@@ -643,8 +644,7 @@ static void intel_teardown_irq_remapping(struct intel_iommu *iommu)
irq_domain_free_fwnode(fn);
iommu->ir_domain = NULL;
}
- free_pages((unsigned long)iommu->ir_table->base,
- INTR_REMAP_PAGE_ORDER);
+ iommu_free_pages(iommu->ir_table->base, INTR_REMAP_PAGE_ORDER);
bitmap_free(iommu->ir_table->bitmap);
kfree(iommu->ir_table);
iommu->ir_table = NULL;
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 74e8e4c17e81..1856e74bba78 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -20,6 +20,7 @@

#include "iommu.h"
#include "pasid.h"
+#include "../iommu-pages.h"

/*
* Intel IOMMU system wide PASID name space:
@@ -116,8 +117,7 @@ int intel_pasid_alloc_table(struct device *dev)

size = max_pasid >> (PASID_PDE_SHIFT - 3);
order = size ? get_order(size) : 0;
- pages = alloc_pages_node(info->iommu->node,
- GFP_KERNEL | __GFP_ZERO, order);
+ pages = __iommu_alloc_pages_node(info->iommu->node, GFP_KERNEL, order);
if (!pages) {
kfree(pasid_table);
return -ENOMEM;
@@ -154,10 +154,10 @@ void intel_pasid_free_table(struct device *dev)
max_pde = pasid_table->max_pasid >> PASID_PDE_SHIFT;
for (i = 0; i < max_pde; i++) {
table = get_pasid_table_from_pde(&dir[i]);
- free_pgtable_page(table);
+ iommu_free_page(table);
}

- free_pages((unsigned long)pasid_table->table, pasid_table->order);
+ iommu_free_pages(pasid_table->table, pasid_table->order);
kfree(pasid_table);
}

@@ -203,7 +203,7 @@ static struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid)
retry:
entries = get_pasid_table_from_pde(&dir[dir_index]);
if (!entries) {
- entries = alloc_pgtable_page(info->iommu->node, GFP_ATOMIC);
+ entries = iommu_alloc_page_node(info->iommu->node, GFP_ATOMIC);
if (!entries)
return NULL;

@@ -215,7 +215,7 @@ static struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid)
*/
if (cmpxchg64(&dir[dir_index].val, 0ULL,
(u64)virt_to_phys(entries) | PASID_PTE_PRESENT)) {
- free_pgtable_page(entries);
+ iommu_free_page(entries);
goto retry;
}
if (!ecap_coherent(info->iommu->ecap)) {
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 50a481c895b8..4cf8826b30e1 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -23,6 +23,7 @@
#include "pasid.h"
#include "perf.h"
#include "../iommu-sva.h"
+#include "../iommu-pages.h"
#include "trace.h"

static irqreturn_t prq_event_thread(int irq, void *d);
@@ -67,7 +68,7 @@ int intel_svm_enable_prq(struct intel_iommu *iommu)
struct page *pages;
int irq, ret;

- pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, PRQ_ORDER);
+ pages = __iommu_alloc_pages(GFP_KERNEL, PRQ_ORDER);
if (!pages) {
pr_warn("IOMMU: %s: Failed to allocate page request queue\n",
iommu->name);
@@ -118,7 +119,7 @@ int intel_svm_enable_prq(struct intel_iommu *iommu)
dmar_free_hwirq(irq);
iommu->pr_irq = 0;
free_prq:
- free_pages((unsigned long)iommu->prq, PRQ_ORDER);
+ iommu_free_pages(iommu->prq, PRQ_ORDER);
iommu->prq = NULL;

return ret;
@@ -141,7 +142,7 @@ int intel_svm_finish_prq(struct intel_iommu *iommu)
iommu->iopf_queue = NULL;
}

- free_pages((unsigned long)iommu->prq, PRQ_ORDER);
+ iommu_free_pages(iommu->prq, PRQ_ORDER);
iommu->prq = NULL;

return 0;
diff --git a/drivers/iommu/iommu-pages.h b/drivers/iommu/iommu-pages.h
new file mode 100644
index 000000000000..2332f807d514
--- /dev/null
+++ b/drivers/iommu/iommu-pages.h
@@ -0,0 +1,199 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2023, Google LLC.
+ * Pasha Tatashin <[email protected]>
+ */
+
+#ifndef __IOMMU_PAGES_H
+#define __IOMMU_PAGES_H
+
+#include <linux/vmstat.h>
+#include <linux/gfp.h>
+#include <linux/mm.h>
+
+/*
+ * All page allocation that are performed in the IOMMU subsystem must use one of
+ * the functions below. This is necessary for the proper accounting as IOMMU
+ * state can be rather large, i.e. multiple gigabytes in size.
+ */
+
+/**
+ * __iommu_alloc_pages_node - allocate a zeroed page of a given order from
+ * specific NUMA node.
+ * @nid: memory NUMA node id
+ * @gfp: buddy allocator flags
+ * @order: page order
+ *
+ * returns the head struct page of the allocated page.
+ */
+static inline struct page *__iommu_alloc_pages_node(int nid, gfp_t gfp,
+ int order)
+{
+ struct page *pages;
+
+ pages = alloc_pages_node(nid, gfp | __GFP_ZERO, order);
+ if (!pages)
+ return NULL;
+
+ return pages;
+}
+
+/**
+ * __iommu_alloc_pages - allocate a zeroed page of a given order.
+ * @gfp: buddy allocator flags
+ * @order: page order
+ *
+ * returns the head struct page of the allocated page.
+ */
+static inline struct page *__iommu_alloc_pages(gfp_t gfp, int order)
+{
+ struct page *pages;
+
+ pages = alloc_pages(gfp | __GFP_ZERO, order);
+ if (!pages)
+ return NULL;
+
+ return pages;
+}
+
+/**
+ * __iommu_alloc_page_node - allocate a zeroed page at specific NUMA node.
+ * @nid: memory NUMA node id
+ * @gfp: buddy allocator flags
+ *
+ * returns the struct page of the allocated page.
+ */
+static inline struct page *__iommu_alloc_page_node(int nid, gfp_t gfp)
+{
+ return __iommu_alloc_pages_node(nid, gfp, 0);
+}
+
+/**
+ * __iommu_alloc_page - allocate a zeroed page
+ * @gfp: buddy allocator flags
+ *
+ * returns the struct page of the allocated page.
+ */
+static inline struct page *__iommu_alloc_page(gfp_t gfp)
+{
+ return __iommu_alloc_pages(gfp, 0);
+}
+
+/**
+ * __iommu_free_pages - free page of a given order
+ * @pages: head struct page of the page
+ * @order: page order
+ */
+static inline void __iommu_free_pages(struct page *pages, int order)
+{
+ if (!pages)
+ return;
+
+ __free_pages(pages, order);
+}
+
+/**
+ * __iommu_free_page - free page
+ * @page: struct page of the page
+ */
+static inline void __iommu_free_page(struct page *page)
+{
+ __iommu_free_pages(page, 0);
+}
+
+/**
+ * iommu_alloc_pages_node - allocate a zeroed page of a given order from
+ * specific NUMA node.
+ * @nid: memory NUMA node id
+ * @gfp: buddy allocator flags
+ * @order: page order
+ *
+ * returns the virtual address of the allocated page
+ */
+static inline void *iommu_alloc_pages_node(int nid, gfp_t gfp, int order)
+{
+ struct page *pages = __iommu_alloc_pages_node(nid, gfp, order);
+
+ if (!pages)
+ return NULL;
+
+ return page_address(pages);
+}
+
+/**
+ * iommu_alloc_pages - allocate a zeroed page of a given order
+ * @gfp: buddy allocator flags
+ * @order: page order
+ *
+ * returns the virtual address of the allocated page
+ */
+static inline void *iommu_alloc_pages(gfp_t gfp, int order)
+{
+ struct page *pages = __iommu_alloc_pages(gfp, order);
+
+ if (!pages)
+ return NULL;
+
+ return page_address(pages);
+}
+
+/**
+ * iommu_alloc_page_node - allocate a zeroed page at specific NUMA node.
+ * @nid: memory NUMA node id
+ * @gfp: buddy allocator flags
+ *
+ * returns the virtual address of the allocated page
+ */
+static inline void *iommu_alloc_page_node(int nid, gfp_t gfp)
+{
+ return iommu_alloc_pages_node(nid, gfp, 0);
+}
+
+/**
+ * iommu_alloc_page - allocate a zeroed page
+ * @gfp: buddy allocator flags
+ *
+ * returns the virtual address of the allocated page
+ */
+static inline void *iommu_alloc_page(gfp_t gfp)
+{
+ return iommu_alloc_pages(gfp, 0);
+}
+
+/**
+ * iommu_free_pages - free page of a given order
+ * @virt: virtual address of the page to be freed.
+ * @order: page order
+ */
+static inline void iommu_free_pages(void *virt, int order)
+{
+ if (!virt)
+ return;
+
+ __iommu_free_pages(virt_to_page(virt), order);
+}
+
+/**
+ * iommu_free_page - free page
+ * @virt: virtual address of the page to be freed.
+ */
+static inline void iommu_free_page(void *virt)
+{
+ iommu_free_pages(virt, 0);
+}
+
+/**
+ * iommu_free_pages_list - free a list of pages.
+ * @pages: the head of the lru list to be freed.
+ */
+static inline void iommu_free_pages_list(struct list_head *pages)
+{
+ while (!list_empty(pages)) {
+ struct page *p = list_entry(pages->prev, struct page, lru);
+
+ list_del(&p->lru);
+ put_page(p);
+ }
+}
+
+#endif /* __IOMMU_PAGES_H */
--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 20:50:14

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 05/16] iommu/io-pgtable-arm-v7s: use page allocation function provided by iommu-pages.h

Convert iommu/io-pgtable-arm-v7s.c to use the new page allocation functions
provided in iommu-pages.h.

Signed-off-by: Pasha Tatashin <[email protected]>
---
drivers/iommu/io-pgtable-arm-v7s.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 75f244a3e12d..3d494ca1f671 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -34,6 +34,7 @@
#include <linux/types.h>

#include <asm/barrier.h>
+#include "iommu-pages.h"

/* Struct accessors */
#define io_pgtable_to_data(x) \
@@ -255,7 +256,7 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
GFP_KERNEL : ARM_V7S_TABLE_GFP_DMA;

if (lvl == 1)
- table = (void *)__get_free_pages(gfp_l1 | __GFP_ZERO, get_order(size));
+ table = iommu_alloc_pages(gfp_l1, get_order(size));
else if (lvl == 2)
table = kmem_cache_zalloc(data->l2_tables, gfp);

@@ -283,6 +284,7 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
}
if (lvl == 2)
kmemleak_ignore(table);
+
return table;

out_unmap:
@@ -290,7 +292,7 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
dma_unmap_single(dev, dma, size, DMA_TO_DEVICE);
out_free:
if (lvl == 1)
- free_pages((unsigned long)table, get_order(size));
+ iommu_free_pages(table, get_order(size));
else
kmem_cache_free(data->l2_tables, table);
return NULL;
@@ -306,8 +308,9 @@ static void __arm_v7s_free_table(void *table, int lvl,
if (!cfg->coherent_walk)
dma_unmap_single(dev, __arm_v7s_dma_addr(table), size,
DMA_TO_DEVICE);
+
if (lvl == 1)
- free_pages((unsigned long)table, get_order(size));
+ iommu_free_pages(table, get_order(size));
else
kmem_cache_free(data->l2_tables, table);
}
--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 20:50:19

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 03/16] iommu/io-pgtable-arm: use page allocation function provided by iommu-pages.h

Convert iommu/io-pgtable-arm.c to use the new page allocation functions
provided in iommu-pages.h.

Signed-off-by: Pasha Tatashin <[email protected]>
---
drivers/iommu/io-pgtable-arm.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 72dcdd468cf3..21d315151ad6 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -21,6 +21,7 @@
#include <asm/barrier.h>

#include "io-pgtable-arm.h"
+#include "iommu-pages.h"

#define ARM_LPAE_MAX_ADDR_BITS 52
#define ARM_LPAE_S2_MAX_CONCAT_PAGES 16
@@ -197,7 +198,7 @@ static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
void *pages;

VM_BUG_ON((gfp & __GFP_HIGHMEM));
- p = alloc_pages_node(dev_to_node(dev), gfp | __GFP_ZERO, order);
+ p = __iommu_alloc_pages_node(dev_to_node(dev), gfp, order);
if (!p)
return NULL;

@@ -221,7 +222,7 @@ static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
dev_err(dev, "Cannot accommodate DMA translation for IOMMU page tables\n");
dma_unmap_single(dev, dma, size, DMA_TO_DEVICE);
out_free:
- __free_pages(p, order);
+ __iommu_free_pages(p, order);
return NULL;
}

@@ -231,7 +232,7 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
if (!cfg->coherent_walk)
dma_unmap_single(cfg->iommu_dev, __arm_lpae_dma_addr(pages),
size, DMA_TO_DEVICE);
- free_pages((unsigned long)pages, get_order(size));
+ iommu_free_pages(pages, get_order(size));
}

static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep, int num_entries,
--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 20:50:20

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 09/16] iommu/iommufd: use page allocation function provided by iommu-pages.h

Convert iommu/iommufd/* files to use the new page allocation functions
provided in iommu-pages.h.

Signed-off-by: Pasha Tatashin <[email protected]>
---
drivers/iommu/iommufd/iova_bitmap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommufd/iova_bitmap.c b/drivers/iommu/iommufd/iova_bitmap.c
index 0a92c9eeaf7f..6b8575b93f17 100644
--- a/drivers/iommu/iommufd/iova_bitmap.c
+++ b/drivers/iommu/iommufd/iova_bitmap.c
@@ -253,7 +253,7 @@ struct iova_bitmap *iova_bitmap_alloc(unsigned long iova, size_t length,
bitmap->iova = iova;
bitmap->length = length;
mapped->iova = iova;
- mapped->pages = (struct page **)__get_free_page(GFP_KERNEL);
+ mapped->pages = iommu_alloc_page(GFP_KERNEL);
if (!mapped->pages) {
rc = -ENOMEM;
goto err;
@@ -284,7 +284,7 @@ void iova_bitmap_free(struct iova_bitmap *bitmap)
iova_bitmap_put(bitmap);

if (mapped->pages) {
- free_page((unsigned long)mapped->pages);
+ iommu_free_page(mapped->pages);
mapped->pages = NULL;
}

--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 20:50:27

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 06/16] iommu/dma: use page allocation function provided by iommu-pages.h

Convert iommu/dma-iommu.c to use the new page allocation functions
provided in iommu-pages.h.

Signed-off-by: Pasha Tatashin <[email protected]>
---
drivers/iommu/dma-iommu.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 85163a83df2f..822adad464c2 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -31,6 +31,7 @@
#include <linux/vmalloc.h>

#include "dma-iommu.h"
+#include "iommu-pages.h"

struct iommu_dma_msi_page {
struct list_head list;
@@ -874,7 +875,7 @@ static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
static void __iommu_dma_free_pages(struct page **pages, int count)
{
while (count--)
- __free_page(pages[count]);
+ __iommu_free_page(pages[count]);
kvfree(pages);
}

@@ -912,7 +913,8 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
order_size = 1U << order;
if (order_mask > order_size)
alloc_flags |= __GFP_NORETRY;
- page = alloc_pages_node(nid, alloc_flags, order);
+ page = __iommu_alloc_pages_node(nid, alloc_flags,
+ order);
if (!page)
continue;
if (order)
@@ -1572,7 +1574,7 @@ static void *iommu_dma_alloc_pages(struct device *dev, size_t size,

page = dma_alloc_contiguous(dev, alloc_size, gfp);
if (!page)
- page = alloc_pages_node(node, gfp, get_order(alloc_size));
+ page = __iommu_alloc_pages_node(node, gfp, get_order(alloc_size));
if (!page)
return NULL;

--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 20:50:28

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 12/16] iommu/tegra-smmu: use page allocation function provided by iommu-pages.h

Convert iommu/tegra-smmu.c to use the new page allocation functions
provided in iommu-pages.h.

Signed-off-by: Pasha Tatashin <[email protected]>
---
drivers/iommu/tegra-smmu.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
index 310871728ab4..5e0730dc1b0e 100644
--- a/drivers/iommu/tegra-smmu.c
+++ b/drivers/iommu/tegra-smmu.c
@@ -19,6 +19,8 @@
#include <soc/tegra/ahb.h>
#include <soc/tegra/mc.h>

+#include "iommu-pages.h"
+
struct tegra_smmu_group {
struct list_head list;
struct tegra_smmu *smmu;
@@ -282,7 +284,7 @@ static struct iommu_domain *tegra_smmu_domain_alloc_paging(struct device *dev)

as->attr = SMMU_PD_READABLE | SMMU_PD_WRITABLE | SMMU_PD_NONSECURE;

- as->pd = alloc_page(GFP_KERNEL | __GFP_DMA | __GFP_ZERO);
+ as->pd = __iommu_alloc_page(GFP_KERNEL | __GFP_DMA);
if (!as->pd) {
kfree(as);
return NULL;
@@ -290,7 +292,7 @@ static struct iommu_domain *tegra_smmu_domain_alloc_paging(struct device *dev)

as->count = kcalloc(SMMU_NUM_PDE, sizeof(u32), GFP_KERNEL);
if (!as->count) {
- __free_page(as->pd);
+ __iommu_free_page(as->pd);
kfree(as);
return NULL;
}
@@ -298,7 +300,7 @@ static struct iommu_domain *tegra_smmu_domain_alloc_paging(struct device *dev)
as->pts = kcalloc(SMMU_NUM_PDE, sizeof(*as->pts), GFP_KERNEL);
if (!as->pts) {
kfree(as->count);
- __free_page(as->pd);
+ __iommu_free_page(as->pd);
kfree(as);
return NULL;
}
@@ -599,14 +601,14 @@ static u32 *as_get_pte(struct tegra_smmu_as *as, dma_addr_t iova,
dma = dma_map_page(smmu->dev, page, 0, SMMU_SIZE_PT,
DMA_TO_DEVICE);
if (dma_mapping_error(smmu->dev, dma)) {
- __free_page(page);
+ __iommu_free_page(page);
return NULL;
}

if (!smmu_dma_addr_valid(smmu, dma)) {
dma_unmap_page(smmu->dev, dma, SMMU_SIZE_PT,
DMA_TO_DEVICE);
- __free_page(page);
+ __iommu_free_page(page);
return NULL;
}

@@ -649,7 +651,7 @@ static void tegra_smmu_pte_put_use(struct tegra_smmu_as *as, unsigned long iova)
tegra_smmu_set_pde(as, iova, 0);

dma_unmap_page(smmu->dev, pte_dma, SMMU_SIZE_PT, DMA_TO_DEVICE);
- __free_page(page);
+ __iommu_free_page(page);
as->pts[pde] = NULL;
}
}
@@ -688,7 +690,7 @@ static struct page *as_get_pde_page(struct tegra_smmu_as *as,
if (gfpflags_allow_blocking(gfp))
spin_unlock_irqrestore(&as->lock, *flags);

- page = alloc_page(gfp | __GFP_DMA | __GFP_ZERO);
+ page = __iommu_alloc_page(gfp | __GFP_DMA);

if (gfpflags_allow_blocking(gfp))
spin_lock_irqsave(&as->lock, *flags);
@@ -700,7 +702,7 @@ static struct page *as_get_pde_page(struct tegra_smmu_as *as,
*/
if (as->pts[pde]) {
if (page)
- __free_page(page);
+ __iommu_free_page(page);

page = as->pts[pde];
}
--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 21:00:21

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 13/16] iommu: observability of the IOMMU allocations

Add NR_IOMMU_PAGES into node_stat_item that counts number of pages
that are allocated by the IOMMU subsystem.

The allocations can be view per-node via:
/sys/devices/system/node/nodeN/vmstat.

For example:

$ grep iommu /sys/devices/system/node/node*/vmstat
/sys/devices/system/node/node0/vmstat:nr_iommu_pages 106025
/sys/devices/system/node/node1/vmstat:nr_iommu_pages 3464

The value is in page-count, therefore, in the above example
the iommu allocations amount to ~428M.

Signed-off-by: Pasha Tatashin <[email protected]>
---
drivers/iommu/iommu-pages.h | 30 ++++++++++++++++++++++++++++++
include/linux/mmzone.h | 3 +++
mm/vmstat.c | 3 +++
3 files changed, 36 insertions(+)

diff --git a/drivers/iommu/iommu-pages.h b/drivers/iommu/iommu-pages.h
index 2332f807d514..69895a355c0c 100644
--- a/drivers/iommu/iommu-pages.h
+++ b/drivers/iommu/iommu-pages.h
@@ -17,6 +17,30 @@
* state can be rather large, i.e. multiple gigabytes in size.
*/

+/**
+ * __iommu_alloc_account - account for newly allocated page.
+ * @pages: head struct page of the page.
+ * @order: order of the page
+ */
+static inline void __iommu_alloc_account(struct page *pages, int order)
+{
+ const long pgcnt = 1l << order;
+
+ mod_node_page_state(page_pgdat(pages), NR_IOMMU_PAGES, pgcnt);
+}
+
+/**
+ * __iommu_free_account - account a page that is about to be freed.
+ * @pages: head struct page of the page.
+ * @order: order of the page
+ */
+static inline void __iommu_free_account(struct page *pages, int order)
+{
+ const long pgcnt = 1l << order;
+
+ mod_node_page_state(page_pgdat(pages), NR_IOMMU_PAGES, -pgcnt);
+}
+
/**
* __iommu_alloc_pages_node - allocate a zeroed page of a given order from
* specific NUMA node.
@@ -35,6 +59,8 @@ static inline struct page *__iommu_alloc_pages_node(int nid, gfp_t gfp,
if (!pages)
return NULL;

+ __iommu_alloc_account(pages, order);
+
return pages;
}

@@ -53,6 +79,8 @@ static inline struct page *__iommu_alloc_pages(gfp_t gfp, int order)
if (!pages)
return NULL;

+ __iommu_alloc_account(pages, order);
+
return pages;
}

@@ -89,6 +117,7 @@ static inline void __iommu_free_pages(struct page *pages, int order)
if (!pages)
return;

+ __iommu_free_account(pages, order);
__free_pages(pages, order);
}

@@ -192,6 +221,7 @@ static inline void iommu_free_pages_list(struct list_head *pages)
struct page *p = list_entry(pages->prev, struct page, lru);

list_del(&p->lru);
+ __iommu_free_account(p, 0);
put_page(p);
}
}
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3c25226beeed..1a4d0bba3e8b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -200,6 +200,9 @@ enum node_stat_item {
#endif
NR_PAGETABLE, /* used for pagetables */
NR_SECONDARY_PAGETABLE, /* secondary pagetables, e.g. KVM pagetables */
+#ifdef CONFIG_IOMMU_SUPPORT
+ NR_IOMMU_PAGES, /* # of pages allocated by IOMMU */
+#endif
#ifdef CONFIG_SWAP
NR_SWAPCACHE,
#endif
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 359460deb377..801b58890b6c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1242,6 +1242,9 @@ const char * const vmstat_text[] = {
#endif
"nr_page_table_pages",
"nr_sec_page_table_pages",
+#ifdef CONFIG_IOMMU_SUPPORT
+ "nr_iommu_pages",
+#endif
#ifdef CONFIG_SWAP
"nr_swapcached",
#endif
--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 21:00:24

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 11/16] iommu/sun50i: use page allocation function provided by iommu-pages.h

Convert iommu/sun50i-iommu.c to use the new page allocation functions
provided in iommu-pages.h.

Signed-off-by: Pasha Tatashin <[email protected]>
---
drivers/iommu/sun50i-iommu.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/sun50i-iommu.c b/drivers/iommu/sun50i-iommu.c
index 41484a5a399b..172ddb717eb5 100644
--- a/drivers/iommu/sun50i-iommu.c
+++ b/drivers/iommu/sun50i-iommu.c
@@ -26,6 +26,8 @@
#include <linux/spinlock.h>
#include <linux/types.h>

+#include "iommu-pages.h"
+
#define IOMMU_RESET_REG 0x010
#define IOMMU_RESET_RELEASE_ALL 0xffffffff
#define IOMMU_ENABLE_REG 0x020
@@ -679,8 +681,7 @@ sun50i_iommu_domain_alloc_paging(struct device *dev)
if (!sun50i_domain)
return NULL;

- sun50i_domain->dt = (u32 *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
- get_order(DT_SIZE));
+ sun50i_domain->dt = iommu_alloc_pages(GFP_KERNEL, get_order(DT_SIZE));
if (!sun50i_domain->dt)
goto err_free_domain;

@@ -702,7 +703,7 @@ static void sun50i_iommu_domain_free(struct iommu_domain *domain)
{
struct sun50i_iommu_domain *sun50i_domain = to_sun50i_domain(domain);

- free_pages((unsigned long)sun50i_domain->dt, get_order(DT_SIZE));
+ iommu_free_pages(sun50i_domain->dt, get_order(DT_SIZE));
sun50i_domain->dt = NULL;

kfree(sun50i_domain);
--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 21:00:26

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 04/16] iommu/io-pgtable-dart: use page allocation function provided by iommu-pages.h

Convert iommu/io-pgtable-dart.c to use the new page allocation functions
provided in iommu-pages.h.

Signed-off-by: Pasha Tatashin <[email protected]>
---
drivers/iommu/io-pgtable-dart.c | 37 +++++++++++++--------------------
1 file changed, 14 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/io-pgtable-dart.c b/drivers/iommu/io-pgtable-dart.c
index 74b1ef2b96be..ad28031e1e93 100644
--- a/drivers/iommu/io-pgtable-dart.c
+++ b/drivers/iommu/io-pgtable-dart.c
@@ -23,6 +23,7 @@
#include <linux/types.h>

#include <asm/barrier.h>
+#include "iommu-pages.h"

#define DART1_MAX_ADDR_BITS 36

@@ -106,18 +107,12 @@ static phys_addr_t iopte_to_paddr(dart_iopte pte,
return paddr;
}

-static void *__dart_alloc_pages(size_t size, gfp_t gfp,
- struct io_pgtable_cfg *cfg)
+static void *__dart_alloc_pages(size_t size, gfp_t gfp)
{
int order = get_order(size);
- struct page *p;

VM_BUG_ON((gfp & __GFP_HIGHMEM));
- p = alloc_pages(gfp | __GFP_ZERO, order);
- if (!p)
- return NULL;
-
- return page_address(p);
+ return iommu_alloc_pages(gfp, order);
}

static int dart_init_pte(struct dart_io_pgtable *data,
@@ -262,13 +257,13 @@ static int dart_map_pages(struct io_pgtable_ops *ops, unsigned long iova,

/* no L2 table present */
if (!pte) {
- cptep = __dart_alloc_pages(tblsz, gfp, cfg);
+ cptep = __dart_alloc_pages(tblsz, gfp);
if (!cptep)
return -ENOMEM;

pte = dart_install_table(cptep, ptep, 0, data);
if (pte)
- free_pages((unsigned long)cptep, get_order(tblsz));
+ iommu_free_pages(cptep, get_order(tblsz));

/* L2 table is present (now) */
pte = READ_ONCE(*ptep);
@@ -419,8 +414,7 @@ apple_dart_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
cfg->apple_dart_cfg.n_ttbrs = 1 << data->tbl_bits;

for (i = 0; i < cfg->apple_dart_cfg.n_ttbrs; ++i) {
- data->pgd[i] = __dart_alloc_pages(DART_GRANULE(data), GFP_KERNEL,
- cfg);
+ data->pgd[i] = __dart_alloc_pages(DART_GRANULE(data), GFP_KERNEL);
if (!data->pgd[i])
goto out_free_data;
cfg->apple_dart_cfg.ttbr[i] = virt_to_phys(data->pgd[i]);
@@ -429,9 +423,10 @@ apple_dart_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
return &data->iop;

out_free_data:
- while (--i >= 0)
- free_pages((unsigned long)data->pgd[i],
- get_order(DART_GRANULE(data)));
+ while (--i >= 0) {
+ iommu_free_pages(data->pgd[i],
+ get_order(DART_GRANULE(data)));
+ }
kfree(data);
return NULL;
}
@@ -439,6 +434,7 @@ apple_dart_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
static void apple_dart_free_pgtable(struct io_pgtable *iop)
{
struct dart_io_pgtable *data = io_pgtable_to_data(iop);
+ int order = get_order(DART_GRANULE(data));
dart_iopte *ptep, *end;
int i;

@@ -449,15 +445,10 @@ static void apple_dart_free_pgtable(struct io_pgtable *iop)
while (ptep != end) {
dart_iopte pte = *ptep++;

- if (pte) {
- unsigned long page =
- (unsigned long)iopte_deref(pte, data);
-
- free_pages(page, get_order(DART_GRANULE(data)));
- }
+ if (pte)
+ iommu_free_pages(iopte_deref(pte, data), order);
}
- free_pages((unsigned long)data->pgd[i],
- get_order(DART_GRANULE(data)));
+ iommu_free_pages(data->pgd[i], order);
}

kfree(data);
--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 21:02:03

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 14/16] iommu: account IOMMU allocated memory

In order to be able to limit the amount of memory that is allocated
by IOMMU subsystem, the memory must be accounted.

Account IOMMU as part of the secondary pagetables as it was discussed
at LPC.

The value of SecPageTables now contains mmeory allocation by IOMMU
and KVM.

Signed-off-by: Pasha Tatashin <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 2 +-
Documentation/filesystems/proc.rst | 4 ++--
drivers/iommu/iommu-pages.h | 2 ++
include/linux/mmzone.h | 2 +-
4 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3f85254f3cef..e004e05a7cde 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1418,7 +1418,7 @@ PAGE_SIZE multiple when read back.
sec_pagetables
Amount of memory allocated for secondary page tables,
this currently includes KVM mmu allocations on x86
- and arm64.
+ and arm64 and IOMMU page tables.

percpu (npn)
Amount of memory used for storing per-cpu kernel
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 49ef12df631b..86f137a9b66b 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1110,8 +1110,8 @@ KernelStack
PageTables
Memory consumed by userspace page tables
SecPageTables
- Memory consumed by secondary page tables, this currently
- currently includes KVM mmu allocations on x86 and arm64.
+ Memory consumed by secondary page tables, this currently includes
+ KVM mmu and IOMMU allocations on x86 and arm64.
NFS_Unstable
Always zero. Previous counted pages which had been written to
the server, but has not been committed to stable storage.
diff --git a/drivers/iommu/iommu-pages.h b/drivers/iommu/iommu-pages.h
index 69895a355c0c..cdd257585284 100644
--- a/drivers/iommu/iommu-pages.h
+++ b/drivers/iommu/iommu-pages.h
@@ -27,6 +27,7 @@ static inline void __iommu_alloc_account(struct page *pages, int order)
const long pgcnt = 1l << order;

mod_node_page_state(page_pgdat(pages), NR_IOMMU_PAGES, pgcnt);
+ mod_lruvec_page_state(pages, NR_SECONDARY_PAGETABLE, pgcnt);
}

/**
@@ -39,6 +40,7 @@ static inline void __iommu_free_account(struct page *pages, int order)
const long pgcnt = 1l << order;

mod_node_page_state(page_pgdat(pages), NR_IOMMU_PAGES, -pgcnt);
+ mod_lruvec_page_state(pages, NR_SECONDARY_PAGETABLE, -pgcnt);
}

/**
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1a4d0bba3e8b..aaabb385663c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -199,7 +199,7 @@ enum node_stat_item {
NR_KERNEL_SCS_KB, /* measured in KiB */
#endif
NR_PAGETABLE, /* used for pagetables */
- NR_SECONDARY_PAGETABLE, /* secondary pagetables, e.g. KVM pagetables */
+ NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */
#ifdef CONFIG_IOMMU_SUPPORT
NR_IOMMU_PAGES, /* # of pages allocated by IOMMU */
#endif
--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 21:12:00

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 02/16] iommu/amd: use page allocation function provided by iommu-pages.h

Convert iommu/amd/* files to use the new page allocation functions
provided in iommu-pages.h.

Signed-off-by: Pasha Tatashin <[email protected]>
---
drivers/iommu/amd/amd_iommu.h | 8 ---
drivers/iommu/amd/init.c | 91 ++++++++++++++-----------------
drivers/iommu/amd/io_pgtable.c | 13 +++--
drivers/iommu/amd/io_pgtable_v2.c | 20 +++----
drivers/iommu/amd/iommu.c | 13 +++--
5 files changed, 64 insertions(+), 81 deletions(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 86be1edd50ee..bf697d566e0b 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -136,14 +136,6 @@ static inline int get_pci_sbdf_id(struct pci_dev *pdev)
return PCI_SEG_DEVID_TO_SBDF(seg, devid);
}

-static inline void *alloc_pgtable_page(int nid, gfp_t gfp)
-{
- struct page *page;
-
- page = alloc_pages_node(nid, gfp | __GFP_ZERO, 0);
- return page ? page_address(page) : NULL;
-}
-
bool translation_pre_enabled(struct amd_iommu *iommu);
bool amd_iommu_is_attach_deferred(struct device *dev);
int __init add_special_device(u8 type, u8 id, u32 *devid, bool cmd_line);
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 64bcf3df37ee..5b8a80fc7e50 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -35,6 +35,7 @@

#include "amd_iommu.h"
#include "../irq_remapping.h"
+#include "../iommu-pages.h"

/*
* definitions for the ACPI scanning code
@@ -648,8 +649,8 @@ static int __init find_last_devid_acpi(struct acpi_table_header *table, u16 pci_
/* Allocate per PCI segment device table */
static inline int __init alloc_dev_table(struct amd_iommu_pci_seg *pci_seg)
{
- pci_seg->dev_table = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO | GFP_DMA32,
- get_order(pci_seg->dev_table_size));
+ pci_seg->dev_table = iommu_alloc_pages(GFP_KERNEL | GFP_DMA32,
+ get_order(pci_seg->dev_table_size));
if (!pci_seg->dev_table)
return -ENOMEM;

@@ -658,17 +659,16 @@ static inline int __init alloc_dev_table(struct amd_iommu_pci_seg *pci_seg)

static inline void free_dev_table(struct amd_iommu_pci_seg *pci_seg)
{
- free_pages((unsigned long)pci_seg->dev_table,
- get_order(pci_seg->dev_table_size));
+ iommu_free_pages(pci_seg->dev_table,
+ get_order(pci_seg->dev_table_size));
pci_seg->dev_table = NULL;
}

/* Allocate per PCI segment IOMMU rlookup table. */
static inline int __init alloc_rlookup_table(struct amd_iommu_pci_seg *pci_seg)
{
- pci_seg->rlookup_table = (void *)__get_free_pages(
- GFP_KERNEL | __GFP_ZERO,
- get_order(pci_seg->rlookup_table_size));
+ pci_seg->rlookup_table = iommu_alloc_pages(GFP_KERNEL,
+ get_order(pci_seg->rlookup_table_size));
if (pci_seg->rlookup_table == NULL)
return -ENOMEM;

@@ -677,16 +677,15 @@ static inline int __init alloc_rlookup_table(struct amd_iommu_pci_seg *pci_seg)

static inline void free_rlookup_table(struct amd_iommu_pci_seg *pci_seg)
{
- free_pages((unsigned long)pci_seg->rlookup_table,
- get_order(pci_seg->rlookup_table_size));
+ iommu_free_pages(pci_seg->rlookup_table,
+ get_order(pci_seg->rlookup_table_size));
pci_seg->rlookup_table = NULL;
}

static inline int __init alloc_irq_lookup_table(struct amd_iommu_pci_seg *pci_seg)
{
- pci_seg->irq_lookup_table = (void *)__get_free_pages(
- GFP_KERNEL | __GFP_ZERO,
- get_order(pci_seg->rlookup_table_size));
+ pci_seg->irq_lookup_table = iommu_alloc_pages(GFP_KERNEL,
+ get_order(pci_seg->rlookup_table_size));
kmemleak_alloc(pci_seg->irq_lookup_table,
pci_seg->rlookup_table_size, 1, GFP_KERNEL);
if (pci_seg->irq_lookup_table == NULL)
@@ -698,8 +697,8 @@ static inline int __init alloc_irq_lookup_table(struct amd_iommu_pci_seg *pci_se
static inline void free_irq_lookup_table(struct amd_iommu_pci_seg *pci_seg)
{
kmemleak_free(pci_seg->irq_lookup_table);
- free_pages((unsigned long)pci_seg->irq_lookup_table,
- get_order(pci_seg->rlookup_table_size));
+ iommu_free_pages(pci_seg->irq_lookup_table,
+ get_order(pci_seg->rlookup_table_size));
pci_seg->irq_lookup_table = NULL;
}

@@ -707,8 +706,8 @@ static int __init alloc_alias_table(struct amd_iommu_pci_seg *pci_seg)
{
int i;

- pci_seg->alias_table = (void *)__get_free_pages(GFP_KERNEL,
- get_order(pci_seg->alias_table_size));
+ pci_seg->alias_table = iommu_alloc_pages(GFP_KERNEL,
+ get_order(pci_seg->alias_table_size));
if (!pci_seg->alias_table)
return -ENOMEM;

@@ -723,8 +722,8 @@ static int __init alloc_alias_table(struct amd_iommu_pci_seg *pci_seg)

static void __init free_alias_table(struct amd_iommu_pci_seg *pci_seg)
{
- free_pages((unsigned long)pci_seg->alias_table,
- get_order(pci_seg->alias_table_size));
+ iommu_free_pages(pci_seg->alias_table,
+ get_order(pci_seg->alias_table_size));
pci_seg->alias_table = NULL;
}

@@ -735,8 +734,8 @@ static void __init free_alias_table(struct amd_iommu_pci_seg *pci_seg)
*/
static int __init alloc_command_buffer(struct amd_iommu *iommu)
{
- iommu->cmd_buf = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
- get_order(CMD_BUFFER_SIZE));
+ iommu->cmd_buf = iommu_alloc_pages(GFP_KERNEL,
+ get_order(CMD_BUFFER_SIZE));

return iommu->cmd_buf ? 0 : -ENOMEM;
}
@@ -844,19 +843,19 @@ static void iommu_disable_command_buffer(struct amd_iommu *iommu)

static void __init free_command_buffer(struct amd_iommu *iommu)
{
- free_pages((unsigned long)iommu->cmd_buf, get_order(CMD_BUFFER_SIZE));
+ iommu_free_pages(iommu->cmd_buf, get_order(CMD_BUFFER_SIZE));
}

static void *__init iommu_alloc_4k_pages(struct amd_iommu *iommu,
gfp_t gfp, size_t size)
{
int order = get_order(size);
- void *buf = (void *)__get_free_pages(gfp, order);
+ void *buf = iommu_alloc_pages(gfp, order);

if (buf &&
check_feature(FEATURE_SNP) &&
set_memory_4k((unsigned long)buf, (1 << order))) {
- free_pages((unsigned long)buf, order);
+ iommu_free_pages(buf, order);
buf = NULL;
}

@@ -866,7 +865,7 @@ static void *__init iommu_alloc_4k_pages(struct amd_iommu *iommu,
/* allocates the memory where the IOMMU will log its events to */
static int __init alloc_event_buffer(struct amd_iommu *iommu)
{
- iommu->evt_buf = iommu_alloc_4k_pages(iommu, GFP_KERNEL | __GFP_ZERO,
+ iommu->evt_buf = iommu_alloc_4k_pages(iommu, GFP_KERNEL,
EVT_BUFFER_SIZE);

return iommu->evt_buf ? 0 : -ENOMEM;
@@ -900,14 +899,13 @@ static void iommu_disable_event_buffer(struct amd_iommu *iommu)

static void __init free_event_buffer(struct amd_iommu *iommu)
{
- free_pages((unsigned long)iommu->evt_buf, get_order(EVT_BUFFER_SIZE));
+ iommu_free_pages(iommu->evt_buf, get_order(EVT_BUFFER_SIZE));
}

/* allocates the memory where the IOMMU will log its events to */
static int __init alloc_ppr_log(struct amd_iommu *iommu)
{
- iommu->ppr_log = iommu_alloc_4k_pages(iommu, GFP_KERNEL | __GFP_ZERO,
- PPR_LOG_SIZE);
+ iommu->ppr_log = iommu_alloc_4k_pages(iommu, GFP_KERNEL, PPR_LOG_SIZE);

return iommu->ppr_log ? 0 : -ENOMEM;
}
@@ -936,14 +934,14 @@ static void iommu_enable_ppr_log(struct amd_iommu *iommu)

static void __init free_ppr_log(struct amd_iommu *iommu)
{
- free_pages((unsigned long)iommu->ppr_log, get_order(PPR_LOG_SIZE));
+ iommu_free_pages(iommu->ppr_log, get_order(PPR_LOG_SIZE));
}

static void free_ga_log(struct amd_iommu *iommu)
{
#ifdef CONFIG_IRQ_REMAP
- free_pages((unsigned long)iommu->ga_log, get_order(GA_LOG_SIZE));
- free_pages((unsigned long)iommu->ga_log_tail, get_order(8));
+ iommu_free_pages(iommu->ga_log, get_order(GA_LOG_SIZE));
+ iommu_free_pages(iommu->ga_log_tail, get_order(8));
#endif
}

@@ -988,13 +986,11 @@ static int iommu_init_ga_log(struct amd_iommu *iommu)
if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir))
return 0;

- iommu->ga_log = (u8 *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
- get_order(GA_LOG_SIZE));
+ iommu->ga_log = iommu_alloc_pages(GFP_KERNEL, get_order(GA_LOG_SIZE));
if (!iommu->ga_log)
goto err_out;

- iommu->ga_log_tail = (u8 *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
- get_order(8));
+ iommu->ga_log_tail = iommu_alloc_pages(GFP_KERNEL, get_order(8));
if (!iommu->ga_log_tail)
goto err_out;

@@ -1007,7 +1003,7 @@ static int iommu_init_ga_log(struct amd_iommu *iommu)

static int __init alloc_cwwb_sem(struct amd_iommu *iommu)
{
- iommu->cmd_sem = iommu_alloc_4k_pages(iommu, GFP_KERNEL | __GFP_ZERO, 1);
+ iommu->cmd_sem = iommu_alloc_4k_pages(iommu, GFP_KERNEL, 1);

return iommu->cmd_sem ? 0 : -ENOMEM;
}
@@ -1015,7 +1011,7 @@ static int __init alloc_cwwb_sem(struct amd_iommu *iommu)
static void __init free_cwwb_sem(struct amd_iommu *iommu)
{
if (iommu->cmd_sem)
- free_page((unsigned long)iommu->cmd_sem);
+ iommu_free_page((void *)iommu->cmd_sem);
}

static void iommu_enable_xt(struct amd_iommu *iommu)
@@ -1080,7 +1076,6 @@ static bool __copy_device_table(struct amd_iommu *iommu)
u32 lo, hi, devid, old_devtb_size;
phys_addr_t old_devtb_phys;
u16 dom_id, dte_v, irq_v;
- gfp_t gfp_flag;
u64 tmp;

/* Each IOMMU use separate device table with the same size */
@@ -1114,9 +1109,8 @@ static bool __copy_device_table(struct amd_iommu *iommu)
if (!old_devtb)
return false;

- gfp_flag = GFP_KERNEL | __GFP_ZERO | GFP_DMA32;
- pci_seg->old_dev_tbl_cpy = (void *)__get_free_pages(gfp_flag,
- get_order(pci_seg->dev_table_size));
+ pci_seg->old_dev_tbl_cpy = iommu_alloc_pages(GFP_KERNEL | GFP_DMA32,
+ get_order(pci_seg->dev_table_size));
if (pci_seg->old_dev_tbl_cpy == NULL) {
pr_err("Failed to allocate memory for copying old device table!\n");
memunmap(old_devtb);
@@ -2800,8 +2794,8 @@ static void early_enable_iommus(void)

for_each_pci_segment(pci_seg) {
if (pci_seg->old_dev_tbl_cpy != NULL) {
- free_pages((unsigned long)pci_seg->old_dev_tbl_cpy,
- get_order(pci_seg->dev_table_size));
+ iommu_free_pages(pci_seg->old_dev_tbl_cpy,
+ get_order(pci_seg->dev_table_size));
pci_seg->old_dev_tbl_cpy = NULL;
}
}
@@ -2814,8 +2808,8 @@ static void early_enable_iommus(void)
pr_info("Copied DEV table from previous kernel.\n");

for_each_pci_segment(pci_seg) {
- free_pages((unsigned long)pci_seg->dev_table,
- get_order(pci_seg->dev_table_size));
+ iommu_free_pages(pci_seg->dev_table,
+ get_order(pci_seg->dev_table_size));
pci_seg->dev_table = pci_seg->old_dev_tbl_cpy;
}

@@ -3018,8 +3012,8 @@ static bool __init check_ioapic_information(void)

static void __init free_dma_resources(void)
{
- free_pages((unsigned long)amd_iommu_pd_alloc_bitmap,
- get_order(MAX_DOMAIN_ID/8));
+ iommu_free_pages(amd_iommu_pd_alloc_bitmap,
+ get_order(MAX_DOMAIN_ID / 8));
amd_iommu_pd_alloc_bitmap = NULL;

free_unity_maps();
@@ -3091,9 +3085,8 @@ static int __init early_amd_iommu_init(void)
/* Device table - directly used by all IOMMUs */
ret = -ENOMEM;

- amd_iommu_pd_alloc_bitmap = (void *)__get_free_pages(
- GFP_KERNEL | __GFP_ZERO,
- get_order(MAX_DOMAIN_ID/8));
+ amd_iommu_pd_alloc_bitmap = iommu_alloc_pages(GFP_KERNEL,
+ get_order(MAX_DOMAIN_ID / 8));
if (amd_iommu_pd_alloc_bitmap == NULL)
goto out;

diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
index 6c0621f6f572..f8b7d4c39a9f 100644
--- a/drivers/iommu/amd/io_pgtable.c
+++ b/drivers/iommu/amd/io_pgtable.c
@@ -22,6 +22,7 @@

#include "amd_iommu_types.h"
#include "amd_iommu.h"
+#include "../iommu-pages.h"

static void v1_tlb_flush_all(void *cookie)
{
@@ -156,7 +157,7 @@ static bool increase_address_space(struct protection_domain *domain,
bool ret = true;
u64 *pte;

- pte = alloc_pgtable_page(domain->nid, gfp);
+ pte = iommu_alloc_page_node(domain->nid, gfp);
if (!pte)
return false;

@@ -187,7 +188,7 @@ static bool increase_address_space(struct protection_domain *domain,

out:
spin_unlock_irqrestore(&domain->lock, flags);
- free_page((unsigned long)pte);
+ iommu_free_page(pte);

return ret;
}
@@ -250,7 +251,7 @@ static u64 *alloc_pte(struct protection_domain *domain,

if (!IOMMU_PTE_PRESENT(__pte) ||
pte_level == PAGE_MODE_NONE) {
- page = alloc_pgtable_page(domain->nid, gfp);
+ page = iommu_alloc_page_node(domain->nid, gfp);

if (!page)
return NULL;
@@ -259,7 +260,7 @@ static u64 *alloc_pte(struct protection_domain *domain,

/* pte could have been changed somewhere. */
if (!try_cmpxchg64(pte, &__pte, __npte))
- free_page((unsigned long)page);
+ iommu_free_page(page);
else if (IOMMU_PTE_PRESENT(__pte))
*updated = true;

@@ -430,7 +431,7 @@ static int iommu_v1_map_pages(struct io_pgtable_ops *ops, unsigned long iova,
}

/* Everything flushed out, free pages now */
- put_pages_list(&freelist);
+ iommu_free_pages_list(&freelist);

return ret;
}
@@ -579,7 +580,7 @@ static void v1_free_pgtable(struct io_pgtable *iop)
/* Make changes visible to IOMMUs */
amd_iommu_domain_update(dom);

- put_pages_list(&freelist);
+ iommu_free_pages_list(&freelist);
}

static struct io_pgtable *v1_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
diff --git a/drivers/iommu/amd/io_pgtable_v2.c b/drivers/iommu/amd/io_pgtable_v2.c
index f818a7e254d4..1e08dab93686 100644
--- a/drivers/iommu/amd/io_pgtable_v2.c
+++ b/drivers/iommu/amd/io_pgtable_v2.c
@@ -18,6 +18,7 @@

#include "amd_iommu_types.h"
#include "amd_iommu.h"
+#include "../iommu-pages.h"

#define IOMMU_PAGE_PRESENT BIT_ULL(0) /* Is present */
#define IOMMU_PAGE_RW BIT_ULL(1) /* Writeable */
@@ -99,11 +100,6 @@ static inline int page_size_to_level(u64 pg_size)
return PAGE_MODE_1_LEVEL;
}

-static inline void free_pgtable_page(u64 *pt)
-{
- free_page((unsigned long)pt);
-}
-
static void free_pgtable(u64 *pt, int level)
{
u64 *p;
@@ -125,10 +121,10 @@ static void free_pgtable(u64 *pt, int level)
if (level > 2)
free_pgtable(p, level - 1);
else
- free_pgtable_page(p);
+ iommu_free_page(p);
}

- free_pgtable_page(pt);
+ iommu_free_page(pt);
}

/* Allocate page table */
@@ -156,14 +152,14 @@ static u64 *v2_alloc_pte(int nid, u64 *pgd, unsigned long iova,
}

if (!IOMMU_PTE_PRESENT(__pte)) {
- page = alloc_pgtable_page(nid, gfp);
+ page = iommu_alloc_page_node(nid, gfp);
if (!page)
return NULL;

__npte = set_pgtable_attr(page);
/* pte could have been changed somewhere. */
if (cmpxchg64(pte, __pte, __npte) != __pte)
- free_pgtable_page(page);
+ iommu_free_page(page);
else if (IOMMU_PTE_PRESENT(__pte))
*updated = true;

@@ -185,7 +181,7 @@ static u64 *v2_alloc_pte(int nid, u64 *pgd, unsigned long iova,
if (pg_size == IOMMU_PAGE_SIZE_1G)
free_pgtable(__pte, end_level - 1);
else if (pg_size == IOMMU_PAGE_SIZE_2M)
- free_pgtable_page(__pte);
+ iommu_free_page(__pte);
}

return pte;
@@ -380,7 +376,7 @@ static struct io_pgtable *v2_alloc_pgtable(struct io_pgtable_cfg *cfg, void *coo
int ret;
int ias = IOMMU_IN_ADDR_BIT_SIZE;

- pgtable->pgd = alloc_pgtable_page(pdom->nid, GFP_ATOMIC);
+ pgtable->pgd = iommu_alloc_page_node(pdom->nid, GFP_ATOMIC);
if (!pgtable->pgd)
return NULL;

@@ -403,7 +399,7 @@ static struct io_pgtable *v2_alloc_pgtable(struct io_pgtable_cfg *cfg, void *coo
return &pgtable->iop;

err_free_pgd:
- free_pgtable_page(pgtable->pgd);
+ iommu_free_page(pgtable->pgd);

return NULL;
}
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index fcc987f5d4ed..9a228a95da0e 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -42,6 +42,7 @@
#include "amd_iommu.h"
#include "../dma-iommu.h"
#include "../irq_remapping.h"
+#include "../iommu-pages.h"

#define CMD_SET_TYPE(cmd, t) ((cmd)->data[1] |= ((t) << 28))

@@ -1642,7 +1643,7 @@ static void free_gcr3_tbl_level1(u64 *tbl)

ptr = iommu_phys_to_virt(tbl[i] & PAGE_MASK);

- free_page((unsigned long)ptr);
+ iommu_free_page(ptr);
}
}

@@ -1670,7 +1671,7 @@ static void free_gcr3_table(struct protection_domain *domain)
else
BUG_ON(domain->glx != 0);

- free_page((unsigned long)domain->gcr3_tbl);
+ iommu_free_page(domain->gcr3_tbl);
}

/*
@@ -1697,7 +1698,7 @@ static int setup_gcr3_table(struct protection_domain *domain, int pasids)
if (levels > amd_iommu_max_glx_val)
return -EINVAL;

- domain->gcr3_tbl = alloc_pgtable_page(domain->nid, GFP_ATOMIC);
+ domain->gcr3_tbl = iommu_alloc_page_node(domain->nid, GFP_ATOMIC);
if (domain->gcr3_tbl == NULL)
return -ENOMEM;

@@ -2092,7 +2093,7 @@ static void protection_domain_free(struct protection_domain *domain)
free_gcr3_table(domain);

if (domain->iop.root)
- free_page((unsigned long)domain->iop.root);
+ iommu_free_page(domain->iop.root);

if (domain->id)
domain_id_free(domain->id);
@@ -2107,7 +2108,7 @@ static int protection_domain_init_v1(struct protection_domain *domain, int mode)
BUG_ON(mode < PAGE_MODE_NONE || mode > PAGE_MODE_6_LEVEL);

if (mode != PAGE_MODE_NONE) {
- pt_root = (void *)get_zeroed_page(GFP_KERNEL);
+ pt_root = iommu_alloc_page(GFP_KERNEL);
if (!pt_root)
return -ENOMEM;
}
@@ -2783,7 +2784,7 @@ static u64 *__get_gcr3_pte(u64 *root, int level, u32 pasid, bool alloc)
if (!alloc)
return NULL;

- root = (void *)get_zeroed_page(GFP_ATOMIC);
+ root = iommu_alloc_page(GFP_ATOMIC);
if (root == NULL)
return NULL;

--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 21:12:43

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 15/16] vhost-vdpa: account iommu allocations

iommu allocations should be accounted in order to allow admins to
monitor and limit the amount of iommu memory.

Signed-off-by: Pasha Tatashin <[email protected]>
---
drivers/vhost/vdpa.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index da7ec77cdaff..a51c69c078d9 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -968,7 +968,8 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
r = ops->set_map(vdpa, asid, iotlb);
} else {
r = iommu_map(v->domain, iova, pa, size,
- perm_to_iommu_flags(perm), GFP_KERNEL);
+ perm_to_iommu_flags(perm),
+ GFP_KERNEL_ACCOUNT);
}
if (r) {
vhost_iotlb_del_range(iotlb, iova, iova + size - 1);
--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 21:16:55

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 16/16] vfio: account iommu allocations

iommu allocations should be accounted in order to allow admins to
monitor and limit the amount of iommu memory.

Signed-off-by: Pasha Tatashin <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index eacd6ec04de5..b2854d7939ce 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1436,7 +1436,7 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
list_for_each_entry(d, &iommu->domain_list, next) {
ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
npage << PAGE_SHIFT, prot | IOMMU_CACHE,
- GFP_KERNEL);
+ GFP_KERNEL_ACCOUNT);
if (ret)
goto unwind;

@@ -1750,7 +1750,8 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
}

ret = iommu_map(domain->domain, iova, phys, size,
- dma->prot | IOMMU_CACHE, GFP_KERNEL);
+ dma->prot | IOMMU_CACHE,
+ GFP_KERNEL_ACCOUNT);
if (ret) {
if (!dma->iommu_mapped) {
vfio_unpin_pages_remote(dma, iova,
@@ -1845,7 +1846,8 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
continue;

ret = iommu_map(domain->domain, start, page_to_phys(pages), PAGE_SIZE * 2,
- IOMMU_READ | IOMMU_WRITE | IOMMU_CACHE, GFP_KERNEL);
+ IOMMU_READ | IOMMU_WRITE | IOMMU_CACHE,
+ GFP_KERNEL_ACCOUNT);
if (!ret) {
size_t unmapped = iommu_unmap(domain->domain, start, PAGE_SIZE);

--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 21:30:11

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 08/16] iommu/fsl: use page allocation function provided by iommu-pages.h

Convert iommu/fsl_pamu.c to use the new page allocation functions
provided in iommu-pages.h.

Signed-off-by: Pasha Tatashin <[email protected]>
---
drivers/iommu/fsl_pamu.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/fsl_pamu.c b/drivers/iommu/fsl_pamu.c
index f37d3b044131..7bfb49940f0c 100644
--- a/drivers/iommu/fsl_pamu.c
+++ b/drivers/iommu/fsl_pamu.c
@@ -16,6 +16,7 @@
#include <linux/platform_device.h>

#include <asm/mpc85xx.h>
+#include "iommu-pages.h"

/* define indexes for each operation mapping scenario */
#define OMI_QMAN 0x00
@@ -828,7 +829,7 @@ static int fsl_pamu_probe(struct platform_device *pdev)
(PAGE_SIZE << get_order(OMT_SIZE));
order = get_order(mem_size);

- p = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
+ p = __iommu_alloc_pages(GFP_KERNEL, order);
if (!p) {
dev_err(dev, "unable to allocate PAACT/SPAACT/OMT block\n");
ret = -ENOMEM;
@@ -916,7 +917,7 @@ static int fsl_pamu_probe(struct platform_device *pdev)
iounmap(guts_regs);

if (ppaact)
- free_pages((unsigned long)ppaact, order);
+ iommu_free_pages(ppaact, order);

ppaact = NULL;

--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 21:30:12

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 10/16] iommu/rockchip: use page allocation function provided by iommu-pages.h

Convert iommu/rockchip-iommu.c to use the new page allocation functions
provided in iommu-pages.h.

Signed-off-by: Pasha Tatashin <[email protected]>
---
drivers/iommu/iommufd/iova_bitmap.c | 2 ++
drivers/iommu/rockchip-iommu.c | 14 ++++++++------
2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/iommufd/iova_bitmap.c b/drivers/iommu/iommufd/iova_bitmap.c
index 6b8575b93f17..4d5d1be807fe 100644
--- a/drivers/iommu/iommufd/iova_bitmap.c
+++ b/drivers/iommu/iommufd/iova_bitmap.c
@@ -8,6 +8,8 @@
#include <linux/slab.h>
#include <linux/highmem.h>

+#include "../iommu-pages.h"
+
#define BITS_PER_PAGE (PAGE_SIZE * BITS_PER_BYTE)

/*
diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c
index 2685861c0a12..e04f22d481d0 100644
--- a/drivers/iommu/rockchip-iommu.c
+++ b/drivers/iommu/rockchip-iommu.c
@@ -26,6 +26,8 @@
#include <linux/slab.h>
#include <linux/spinlock.h>

+#include "iommu-pages.h"
+
/** MMU register offsets */
#define RK_MMU_DTE_ADDR 0x00 /* Directory table address */
#define RK_MMU_STATUS 0x04
@@ -727,14 +729,14 @@ static u32 *rk_dte_get_page_table(struct rk_iommu_domain *rk_domain,
if (rk_dte_is_pt_valid(dte))
goto done;

- page_table = (u32 *)get_zeroed_page(GFP_ATOMIC | rk_ops->gfp_flags);
+ page_table = iommu_alloc_page(GFP_ATOMIC | rk_ops->gfp_flags);
if (!page_table)
return ERR_PTR(-ENOMEM);

pt_dma = dma_map_single(dma_dev, page_table, SPAGE_SIZE, DMA_TO_DEVICE);
if (dma_mapping_error(dma_dev, pt_dma)) {
dev_err(dma_dev, "DMA mapping error while allocating page table\n");
- free_page((unsigned long)page_table);
+ iommu_free_page(page_table);
return ERR_PTR(-ENOMEM);
}

@@ -1061,7 +1063,7 @@ static struct iommu_domain *rk_iommu_domain_alloc_paging(struct device *dev)
* Each level1 (dt) and level2 (pt) table has 1024 4-byte entries.
* Allocate one 4 KiB page for each table.
*/
- rk_domain->dt = (u32 *)get_zeroed_page(GFP_KERNEL | rk_ops->gfp_flags);
+ rk_domain->dt = iommu_alloc_page(GFP_KERNEL | rk_ops->gfp_flags);
if (!rk_domain->dt)
goto err_free_domain;

@@ -1083,7 +1085,7 @@ static struct iommu_domain *rk_iommu_domain_alloc_paging(struct device *dev)
return &rk_domain->domain;

err_free_dt:
- free_page((unsigned long)rk_domain->dt);
+ iommu_free_page(rk_domain->dt);
err_free_domain:
kfree(rk_domain);

@@ -1104,13 +1106,13 @@ static void rk_iommu_domain_free(struct iommu_domain *domain)
u32 *page_table = phys_to_virt(pt_phys);
dma_unmap_single(dma_dev, pt_phys,
SPAGE_SIZE, DMA_TO_DEVICE);
- free_page((unsigned long)page_table);
+ iommu_free_page(page_table);
}
}

dma_unmap_single(dma_dev, rk_domain->dt_dma,
SPAGE_SIZE, DMA_TO_DEVICE);
- free_page((unsigned long)rk_domain->dt);
+ iommu_free_page(rk_domain->dt);

kfree(rk_domain);
}
--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 21:30:31

by Pasha Tatashin

[permalink] [raw]
Subject: [PATCH 07/16] iommu/exynos: use page allocation function provided by iommu-pages.h

Convert iommu/exynos-iommu.c to use the new page allocation functions
provided in iommu-pages.h.

Signed-off-by: Pasha Tatashin <[email protected]>
---
drivers/iommu/exynos-iommu.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index 2c6e9094f1e9..3eab0ae65a4f 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -22,6 +22,8 @@
#include <linux/pm_runtime.h>
#include <linux/slab.h>

+#include "iommu-pages.h"
+
typedef u32 sysmmu_iova_t;
typedef u32 sysmmu_pte_t;
static struct iommu_domain exynos_identity_domain;
@@ -900,11 +902,11 @@ static struct iommu_domain *exynos_iommu_domain_alloc_paging(struct device *dev)
if (!domain)
return NULL;

- domain->pgtable = (sysmmu_pte_t *)__get_free_pages(GFP_KERNEL, 2);
+ domain->pgtable = iommu_alloc_pages(GFP_KERNEL, 2);
if (!domain->pgtable)
goto err_pgtable;

- domain->lv2entcnt = (short *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 1);
+ domain->lv2entcnt = iommu_alloc_pages(GFP_KERNEL, 1);
if (!domain->lv2entcnt)
goto err_counter;

@@ -930,9 +932,9 @@ static struct iommu_domain *exynos_iommu_domain_alloc_paging(struct device *dev)
return &domain->domain;

err_lv2ent:
- free_pages((unsigned long)domain->lv2entcnt, 1);
+ iommu_free_pages(domain->lv2entcnt, 1);
err_counter:
- free_pages((unsigned long)domain->pgtable, 2);
+ iommu_free_pages(domain->pgtable, 2);
err_pgtable:
kfree(domain);
return NULL;
@@ -973,8 +975,8 @@ static void exynos_iommu_domain_free(struct iommu_domain *iommu_domain)
phys_to_virt(base));
}

- free_pages((unsigned long)domain->pgtable, 2);
- free_pages((unsigned long)domain->lv2entcnt, 1);
+ iommu_free_pages(domain->pgtable, 2);
+ iommu_free_pages(domain->lv2entcnt, 1);
kfree(domain);
}

--
2.43.0.rc2.451.g8631bc7472-goog

2023-11-28 21:34:53

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 00/16] IOMMU memory observability

On Tue, Nov 28, 2023 at 12:49 PM Pasha Tatashin
<[email protected]> wrote:
>
> From: Pasha Tatashin <[email protected]>
>
> IOMMU subsystem may contain state that is in gigabytes. Majority of that
> state is iommu page tables. Yet, there is currently, no way to observe
> how much memory is actually used by the iommu subsystem.
>
> This patch series solves this problem by adding both observability to
> all pages that are allocated by IOMMU, and also accountability, so
> admins can limit the amount if via cgroups.
>
> The system-wide observability is using /proc/meminfo:
> SecPageTables: 438176 kB
>
> Contains IOMMU and KVM memory.
>
> Per-node observability:
> /sys/devices/system/node/nodeN/meminfo
> Node N SecPageTables: 422204 kB
>
> Contains IOMMU and KVM memory memory in the given NUMA node.
>
> Per-node IOMMU only observability:
> /sys/devices/system/node/nodeN/vmstat
> nr_iommu_pages 105555
>
> Contains number of pages IOMMU allocated in the given node.

Does it make sense to have a KVM-only entry there as well?

In that case, if SecPageTables in /proc/meminfo is found to be
suspiciously high, it should be easy to tell which component is
contributing most usage through vmstat. I understand that users can do
the subtraction, but we wouldn't want userspace depending on that, in
case a third class of "secondary" page tables emerges that we want to
add to SecPageTables. The in-kernel implementation can do the
subtraction for now if it makes sense though.

>
> Accountability: using sec_pagetables cgroup-v2 memory.stat entry.
>
> With the change, iova_stress[1] stops as limit is reached:
>
> # ./iova_stress
> iova space: 0T free memory: 497G
> iova space: 1T free memory: 495G
> iova space: 2T free memory: 493G
> iova space: 3T free memory: 491G
>
> stops as limit is reached.
>
> This series encorporates suggestions that came from the discussion
> at LPC [2].
>
> [1] https://github.com/soleen/iova_stress
> [2] https://lpc.events/event/17/contributions/1466
>
> Pasha Tatashin (16):
> iommu/vt-d: add wrapper functions for page allocations
> iommu/amd: use page allocation function provided by iommu-pages.h
> iommu/io-pgtable-arm: use page allocation function provided by
> iommu-pages.h
> iommu/io-pgtable-dart: use page allocation function provided by
> iommu-pages.h
> iommu/io-pgtable-arm-v7s: use page allocation function provided by
> iommu-pages.h
> iommu/dma: use page allocation function provided by iommu-pages.h
> iommu/exynos: use page allocation function provided by iommu-pages.h
> iommu/fsl: use page allocation function provided by iommu-pages.h
> iommu/iommufd: use page allocation function provided by iommu-pages.h
> iommu/rockchip: use page allocation function provided by iommu-pages.h
> iommu/sun50i: use page allocation function provided by iommu-pages.h
> iommu/tegra-smmu: use page allocation function provided by
> iommu-pages.h
> iommu: observability of the IOMMU allocations
> iommu: account IOMMU allocated memory
> vhost-vdpa: account iommu allocations
> vfio: account iommu allocations
>
> Documentation/admin-guide/cgroup-v2.rst | 2 +-
> Documentation/filesystems/proc.rst | 4 +-
> drivers/iommu/amd/amd_iommu.h | 8 -
> drivers/iommu/amd/init.c | 91 +++++-----
> drivers/iommu/amd/io_pgtable.c | 13 +-
> drivers/iommu/amd/io_pgtable_v2.c | 20 +-
> drivers/iommu/amd/iommu.c | 13 +-
> drivers/iommu/dma-iommu.c | 8 +-
> drivers/iommu/exynos-iommu.c | 14 +-
> drivers/iommu/fsl_pamu.c | 5 +-
> drivers/iommu/intel/dmar.c | 10 +-
> drivers/iommu/intel/iommu.c | 47 ++---
> drivers/iommu/intel/iommu.h | 2 -
> drivers/iommu/intel/irq_remapping.c | 10 +-
> drivers/iommu/intel/pasid.c | 12 +-
> drivers/iommu/intel/svm.c | 7 +-
> drivers/iommu/io-pgtable-arm-v7s.c | 9 +-
> drivers/iommu/io-pgtable-arm.c | 7 +-
> drivers/iommu/io-pgtable-dart.c | 37 ++--
> drivers/iommu/iommu-pages.h | 231 ++++++++++++++++++++++++
> drivers/iommu/iommufd/iova_bitmap.c | 6 +-
> drivers/iommu/rockchip-iommu.c | 14 +-
> drivers/iommu/sun50i-iommu.c | 7 +-
> drivers/iommu/tegra-smmu.c | 18 +-
> drivers/vfio/vfio_iommu_type1.c | 8 +-
> drivers/vhost/vdpa.c | 3 +-
> include/linux/mmzone.h | 5 +-
> mm/vmstat.c | 3 +
> 28 files changed, 415 insertions(+), 199 deletions(-)
> create mode 100644 drivers/iommu/iommu-pages.h
>
> --
> 2.43.0.rc2.451.g8631bc7472-goog
>
>

2023-11-28 22:32:40

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 00/16] IOMMU memory observability

On Tue, Nov 28, 2023 at 4:34 PM Yosry Ahmed <[email protected]> wrote:
>
> On Tue, Nov 28, 2023 at 12:49 PM Pasha Tatashin
> <[email protected]> wrote:
> >
> > From: Pasha Tatashin <[email protected]>
> >
> > IOMMU subsystem may contain state that is in gigabytes. Majority of that
> > state is iommu page tables. Yet, there is currently, no way to observe
> > how much memory is actually used by the iommu subsystem.
> >
> > This patch series solves this problem by adding both observability to
> > all pages that are allocated by IOMMU, and also accountability, so
> > admins can limit the amount if via cgroups.
> >
> > The system-wide observability is using /proc/meminfo:
> > SecPageTables: 438176 kB
> >
> > Contains IOMMU and KVM memory.
> >
> > Per-node observability:
> > /sys/devices/system/node/nodeN/meminfo
> > Node N SecPageTables: 422204 kB
> >
> > Contains IOMMU and KVM memory memory in the given NUMA node.
> >
> > Per-node IOMMU only observability:
> > /sys/devices/system/node/nodeN/vmstat
> > nr_iommu_pages 105555
> >
> > Contains number of pages IOMMU allocated in the given node.
>
> Does it make sense to have a KVM-only entry there as well?
>
> In that case, if SecPageTables in /proc/meminfo is found to be
> suspiciously high, it should be easy to tell which component is
> contributing most usage through vmstat. I understand that users can do
> the subtraction, but we wouldn't want userspace depending on that, in
> case a third class of "secondary" page tables emerges that we want to
> add to SecPageTables. The in-kernel implementation can do the
> subtraction for now if it makes sense though.

Hi Yosry,

Yes, another counter for KVM could be added. On the other hand KVM
only can be computed by subtracting one from another as there are only
two types of secondary page tables, KVM and IOMMU:

/sys/devices/system/node/node0/meminfo
Node 0 SecPageTables: 422204 kB

/sys/devices/system/node/nodeN/vmstat
nr_iommu_pages 105555

KVM only = SecPageTables - nr_iommu_pages * PAGE_SIZE / 1024

Pasha

>
> >
> > Accountability: using sec_pagetables cgroup-v2 memory.stat entry.
> >
> > With the change, iova_stress[1] stops as limit is reached:
> >
> > # ./iova_stress
> > iova space: 0T free memory: 497G
> > iova space: 1T free memory: 495G
> > iova space: 2T free memory: 493G
> > iova space: 3T free memory: 491G
> >
> > stops as limit is reached.
> >
> > This series encorporates suggestions that came from the discussion
> > at LPC [2].
> >
> > [1] https://github.com/soleen/iova_stress
> > [2] https://lpc.events/event/17/contributions/1466
> >
> > Pasha Tatashin (16):
> > iommu/vt-d: add wrapper functions for page allocations
> > iommu/amd: use page allocation function provided by iommu-pages.h
> > iommu/io-pgtable-arm: use page allocation function provided by
> > iommu-pages.h
> > iommu/io-pgtable-dart: use page allocation function provided by
> > iommu-pages.h
> > iommu/io-pgtable-arm-v7s: use page allocation function provided by
> > iommu-pages.h
> > iommu/dma: use page allocation function provided by iommu-pages.h
> > iommu/exynos: use page allocation function provided by iommu-pages.h
> > iommu/fsl: use page allocation function provided by iommu-pages.h
> > iommu/iommufd: use page allocation function provided by iommu-pages.h
> > iommu/rockchip: use page allocation function provided by iommu-pages.h
> > iommu/sun50i: use page allocation function provided by iommu-pages.h
> > iommu/tegra-smmu: use page allocation function provided by
> > iommu-pages.h
> > iommu: observability of the IOMMU allocations
> > iommu: account IOMMU allocated memory
> > vhost-vdpa: account iommu allocations
> > vfio: account iommu allocations
> >
> > Documentation/admin-guide/cgroup-v2.rst | 2 +-
> > Documentation/filesystems/proc.rst | 4 +-
> > drivers/iommu/amd/amd_iommu.h | 8 -
> > drivers/iommu/amd/init.c | 91 +++++-----
> > drivers/iommu/amd/io_pgtable.c | 13 +-
> > drivers/iommu/amd/io_pgtable_v2.c | 20 +-
> > drivers/iommu/amd/iommu.c | 13 +-
> > drivers/iommu/dma-iommu.c | 8 +-
> > drivers/iommu/exynos-iommu.c | 14 +-
> > drivers/iommu/fsl_pamu.c | 5 +-
> > drivers/iommu/intel/dmar.c | 10 +-
> > drivers/iommu/intel/iommu.c | 47 ++---
> > drivers/iommu/intel/iommu.h | 2 -
> > drivers/iommu/intel/irq_remapping.c | 10 +-
> > drivers/iommu/intel/pasid.c | 12 +-
> > drivers/iommu/intel/svm.c | 7 +-
> > drivers/iommu/io-pgtable-arm-v7s.c | 9 +-
> > drivers/iommu/io-pgtable-arm.c | 7 +-
> > drivers/iommu/io-pgtable-dart.c | 37 ++--
> > drivers/iommu/iommu-pages.h | 231 ++++++++++++++++++++++++
> > drivers/iommu/iommufd/iova_bitmap.c | 6 +-
> > drivers/iommu/rockchip-iommu.c | 14 +-
> > drivers/iommu/sun50i-iommu.c | 7 +-
> > drivers/iommu/tegra-smmu.c | 18 +-
> > drivers/vfio/vfio_iommu_type1.c | 8 +-
> > drivers/vhost/vdpa.c | 3 +-
> > include/linux/mmzone.h | 5 +-
> > mm/vmstat.c | 3 +
> > 28 files changed, 415 insertions(+), 199 deletions(-)
> > create mode 100644 drivers/iommu/iommu-pages.h
> >
> > --
> > 2.43.0.rc2.451.g8631bc7472-goog
> >
> >

2023-11-28 22:34:18

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH 06/16] iommu/dma: use page allocation function provided by iommu-pages.h

On 2023-11-28 8:49 pm, Pasha Tatashin wrote:
> Convert iommu/dma-iommu.c to use the new page allocation functions
> provided in iommu-pages.h.

These have nothing to do with IOMMU pagetables, they are DMA buffers and
they belong to whoever called the corresponding dma_alloc_* function.

Thanks,
Robin.

> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> drivers/iommu/dma-iommu.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 85163a83df2f..822adad464c2 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -31,6 +31,7 @@
> #include <linux/vmalloc.h>
>
> #include "dma-iommu.h"
> +#include "iommu-pages.h"
>
> struct iommu_dma_msi_page {
> struct list_head list;
> @@ -874,7 +875,7 @@ static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
> static void __iommu_dma_free_pages(struct page **pages, int count)
> {
> while (count--)
> - __free_page(pages[count]);
> + __iommu_free_page(pages[count]);
> kvfree(pages);
> }
>
> @@ -912,7 +913,8 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
> order_size = 1U << order;
> if (order_mask > order_size)
> alloc_flags |= __GFP_NORETRY;
> - page = alloc_pages_node(nid, alloc_flags, order);
> + page = __iommu_alloc_pages_node(nid, alloc_flags,
> + order);
> if (!page)
> continue;
> if (order)
> @@ -1572,7 +1574,7 @@ static void *iommu_dma_alloc_pages(struct device *dev, size_t size,
>
> page = dma_alloc_contiguous(dev, alloc_size, gfp);
> if (!page)
> - page = alloc_pages_node(node, gfp, get_order(alloc_size));
> + page = __iommu_alloc_pages_node(node, gfp, get_order(alloc_size));
> if (!page)
> return NULL;
>

2023-11-28 22:47:22

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH 05/16] iommu/io-pgtable-arm-v7s: use page allocation function provided by iommu-pages.h

On 2023-11-28 8:49 pm, Pasha Tatashin wrote:
> Convert iommu/io-pgtable-arm-v7s.c to use the new page allocation functions
> provided in iommu-pages.h.
>
> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> drivers/iommu/io-pgtable-arm-v7s.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
> index 75f244a3e12d..3d494ca1f671 100644
> --- a/drivers/iommu/io-pgtable-arm-v7s.c
> +++ b/drivers/iommu/io-pgtable-arm-v7s.c
> @@ -34,6 +34,7 @@
> #include <linux/types.h>
>
> #include <asm/barrier.h>
> +#include "iommu-pages.h"
>
> /* Struct accessors */
> #define io_pgtable_to_data(x) \
> @@ -255,7 +256,7 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
> GFP_KERNEL : ARM_V7S_TABLE_GFP_DMA;
>
> if (lvl == 1)
> - table = (void *)__get_free_pages(gfp_l1 | __GFP_ZERO, get_order(size));
> + table = iommu_alloc_pages(gfp_l1, get_order(size));
> else if (lvl == 2)
> table = kmem_cache_zalloc(data->l2_tables, gfp);

Is it really meaningful to account the L1 table which is always
allocated upon initial creation, yet not the L2 tables which are
allocated in use?

Thanks,
Robin.

> @@ -283,6 +284,7 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
> }
> if (lvl == 2)
> kmemleak_ignore(table);
> +
> return table;
>
> out_unmap:
> @@ -290,7 +292,7 @@ static void *__arm_v7s_alloc_table(int lvl, gfp_t gfp,
> dma_unmap_single(dev, dma, size, DMA_TO_DEVICE);
> out_free:
> if (lvl == 1)
> - free_pages((unsigned long)table, get_order(size));
> + iommu_free_pages(table, get_order(size));
> else
> kmem_cache_free(data->l2_tables, table);
> return NULL;
> @@ -306,8 +308,9 @@ static void __arm_v7s_free_table(void *table, int lvl,
> if (!cfg->coherent_walk)
> dma_unmap_single(dev, __arm_v7s_dma_addr(table), size,
> DMA_TO_DEVICE);
> +
> if (lvl == 1)
> - free_pages((unsigned long)table, get_order(size));
> + iommu_free_pages(table, get_order(size));
> else
> kmem_cache_free(data->l2_tables, table);
> }

2023-11-28 22:50:59

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 06/16] iommu/dma: use page allocation function provided by iommu-pages.h

On Tue, Nov 28, 2023 at 5:34 PM Robin Murphy <[email protected]> wrote:
>
> On 2023-11-28 8:49 pm, Pasha Tatashin wrote:
> > Convert iommu/dma-iommu.c to use the new page allocation functions
> > provided in iommu-pages.h.
>
> These have nothing to do with IOMMU pagetables, they are DMA buffers and
> they belong to whoever called the corresponding dma_alloc_* function.

Hi Robin,

This is true, however, we want to account and observe the pages
allocated by IOMMU subsystem for DMA buffers, as they are essentially
unmovable locked pages. Should we separate IOMMU memory from KVM
memory all together and add another field to /proc/meminfo, something
like "iommu -> iommu pagetable and dma memory", or do we want to
export DMA memory separately from IOMMU page tables?

Since, I included DMA memory, I specifically removed mentioning of
IOMMU page tables in the most of places, and only report it as IOMMU
memory. However, since it is still bundled together with SecPageTables
it can be confusing.

Pasha

2023-11-28 22:53:23

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH 08/16] iommu/fsl: use page allocation function provided by iommu-pages.h

On 2023-11-28 8:49 pm, Pasha Tatashin wrote:
> Convert iommu/fsl_pamu.c to use the new page allocation functions
> provided in iommu-pages.h.

Again, this is not a pagetable. This thing doesn't even *have* pagetables.

Similar to patches #1 and #2 where you're lumping in configuration
tables which belong to the IOMMU driver itself, as opposed to pagetables
which effectively belong to an IOMMU domain's user. But then there are
still drivers where you're *not* accounting similar configuration
structures, so I really struggle to see how this metric is useful when
it's so completely inconsistent in what it's counting :/

Thanks,
Robin.

> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> drivers/iommu/fsl_pamu.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/fsl_pamu.c b/drivers/iommu/fsl_pamu.c
> index f37d3b044131..7bfb49940f0c 100644
> --- a/drivers/iommu/fsl_pamu.c
> +++ b/drivers/iommu/fsl_pamu.c
> @@ -16,6 +16,7 @@
> #include <linux/platform_device.h>
>
> #include <asm/mpc85xx.h>
> +#include "iommu-pages.h"
>
> /* define indexes for each operation mapping scenario */
> #define OMI_QMAN 0x00
> @@ -828,7 +829,7 @@ static int fsl_pamu_probe(struct platform_device *pdev)
> (PAGE_SIZE << get_order(OMT_SIZE));
> order = get_order(mem_size);
>
> - p = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
> + p = __iommu_alloc_pages(GFP_KERNEL, order);
> if (!p) {
> dev_err(dev, "unable to allocate PAACT/SPAACT/OMT block\n");
> ret = -ENOMEM;
> @@ -916,7 +917,7 @@ static int fsl_pamu_probe(struct platform_device *pdev)
> iounmap(guts_regs);
>
> if (ppaact)
> - free_pages((unsigned long)ppaact, order);
> + iommu_free_pages(ppaact, order);
>
> ppaact = NULL;
>

2023-11-28 22:56:07

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 05/16] iommu/io-pgtable-arm-v7s: use page allocation function provided by iommu-pages.h

> > kmem_cache_free(data->l2_tables, table);

We only account page allocations, not subpages, however, this is
something I was surprised about this particular architecture of why do
we allocate l2 using kmem ? Are the second level tables on arm v7s
really sub-page in size?

Pasha

2023-11-28 23:00:22

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH 06/16] iommu/dma: use page allocation function provided by iommu-pages.h

On 2023-11-28 10:50 pm, Pasha Tatashin wrote:
> On Tue, Nov 28, 2023 at 5:34 PM Robin Murphy <[email protected]> wrote:
>>
>> On 2023-11-28 8:49 pm, Pasha Tatashin wrote:
>>> Convert iommu/dma-iommu.c to use the new page allocation functions
>>> provided in iommu-pages.h.
>>
>> These have nothing to do with IOMMU pagetables, they are DMA buffers and
>> they belong to whoever called the corresponding dma_alloc_* function.
>
> Hi Robin,
>
> This is true, however, we want to account and observe the pages
> allocated by IOMMU subsystem for DMA buffers, as they are essentially
> unmovable locked pages. Should we separate IOMMU memory from KVM
> memory all together and add another field to /proc/meminfo, something
> like "iommu -> iommu pagetable and dma memory", or do we want to
> export DMA memory separately from IOMMU page tables?

These are not allocated by "the IOMMU subsystem", they are allocated by
the DMA API. Even if you want to claim that a driver pinning memory via
iommu_dma_ops is somehow different from the same driver pinning the same
amount of memory via dma-direct when iommu.passthrough=1, it's still
nonsense because you're failing to account the pages which iommu_dma_ops
gets from CMA, dma_common_alloc_pages(), dynamic SWIOTLB, the various
pools, and so on.

Thanks,
Robin.

> Since, I included DMA memory, I specifically removed mentioning of
> IOMMU page tables in the most of places, and only report it as IOMMU
> memory. However, since it is still bundled together with SecPageTables
> it can be confusing.
>
> Pasha

2023-11-28 23:01:06

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 08/16] iommu/fsl: use page allocation function provided by iommu-pages.h

On Tue, Nov 28, 2023 at 5:53 PM Robin Murphy <[email protected]> wrote:
>
> On 2023-11-28 8:49 pm, Pasha Tatashin wrote:
> > Convert iommu/fsl_pamu.c to use the new page allocation functions
> > provided in iommu-pages.h.
>
> Again, this is not a pagetable. This thing doesn't even *have* pagetables.
>
> Similar to patches #1 and #2 where you're lumping in configuration
> tables which belong to the IOMMU driver itself, as opposed to pagetables
> which effectively belong to an IOMMU domain's user. But then there are
> still drivers where you're *not* accounting similar configuration
> structures, so I really struggle to see how this metric is useful when
> it's so completely inconsistent in what it's counting :/

The whole IOMMU subsystem allocates a significant amount of kernel
locked memory that we want to at least observe. The new field in
vmstat does just that: it reports ALL buddy allocator memory that
IOMMU allocates. However, for accounting purposes, I agree, we need to
do better, and separate at least iommu pagetables from the rest.

We can separate the metric into two:
iommu pagetable only
iommu everything

or into three:
iommu pagetable only
iommu dma
iommu everything

What do you think?

Pasha

>
> Thanks,
> Robin.
>
> > Signed-off-by: Pasha Tatashin <[email protected]>
> > ---
> > drivers/iommu/fsl_pamu.c | 5 +++--
> > 1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/iommu/fsl_pamu.c b/drivers/iommu/fsl_pamu.c
> > index f37d3b044131..7bfb49940f0c 100644
> > --- a/drivers/iommu/fsl_pamu.c
> > +++ b/drivers/iommu/fsl_pamu.c
> > @@ -16,6 +16,7 @@
> > #include <linux/platform_device.h>
> >
> > #include <asm/mpc85xx.h>
> > +#include "iommu-pages.h"
> >
> > /* define indexes for each operation mapping scenario */
> > #define OMI_QMAN 0x00
> > @@ -828,7 +829,7 @@ static int fsl_pamu_probe(struct platform_device *pdev)
> > (PAGE_SIZE << get_order(OMT_SIZE));
> > order = get_order(mem_size);
> >
> > - p = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
> > + p = __iommu_alloc_pages(GFP_KERNEL, order);
> > if (!p) {
> > dev_err(dev, "unable to allocate PAACT/SPAACT/OMT block\n");
> > ret = -ENOMEM;
> > @@ -916,7 +917,7 @@ static int fsl_pamu_probe(struct platform_device *pdev)
> > iounmap(guts_regs);
> >
> > if (ppaact)
> > - free_pages((unsigned long)ppaact, order);
> > + iommu_free_pages(ppaact, order);
> >
> > ppaact = NULL;
> >

2023-11-28 23:04:34

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 00/16] IOMMU memory observability

On Tue, Nov 28, 2023 at 2:32 PM Pasha Tatashin
<[email protected]> wrote:
>
> On Tue, Nov 28, 2023 at 4:34 PM Yosry Ahmed <[email protected]> wrote:
> >
> > On Tue, Nov 28, 2023 at 12:49 PM Pasha Tatashin
> > <[email protected]> wrote:
> > >
> > > From: Pasha Tatashin <[email protected]>
> > >
> > > IOMMU subsystem may contain state that is in gigabytes. Majority of that
> > > state is iommu page tables. Yet, there is currently, no way to observe
> > > how much memory is actually used by the iommu subsystem.
> > >
> > > This patch series solves this problem by adding both observability to
> > > all pages that are allocated by IOMMU, and also accountability, so
> > > admins can limit the amount if via cgroups.
> > >
> > > The system-wide observability is using /proc/meminfo:
> > > SecPageTables: 438176 kB
> > >
> > > Contains IOMMU and KVM memory.
> > >
> > > Per-node observability:
> > > /sys/devices/system/node/nodeN/meminfo
> > > Node N SecPageTables: 422204 kB
> > >
> > > Contains IOMMU and KVM memory memory in the given NUMA node.
> > >
> > > Per-node IOMMU only observability:
> > > /sys/devices/system/node/nodeN/vmstat
> > > nr_iommu_pages 105555
> > >
> > > Contains number of pages IOMMU allocated in the given node.
> >
> > Does it make sense to have a KVM-only entry there as well?
> >
> > In that case, if SecPageTables in /proc/meminfo is found to be
> > suspiciously high, it should be easy to tell which component is
> > contributing most usage through vmstat. I understand that users can do
> > the subtraction, but we wouldn't want userspace depending on that, in
> > case a third class of "secondary" page tables emerges that we want to
> > add to SecPageTables. The in-kernel implementation can do the
> > subtraction for now if it makes sense though.
>
> Hi Yosry,
>
> Yes, another counter for KVM could be added. On the other hand KVM
> only can be computed by subtracting one from another as there are only
> two types of secondary page tables, KVM and IOMMU:
>
> /sys/devices/system/node/node0/meminfo
> Node 0 SecPageTables: 422204 kB
>
> /sys/devices/system/node/nodeN/vmstat
> nr_iommu_pages 105555
>
> KVM only = SecPageTables - nr_iommu_pages * PAGE_SIZE / 1024
>

Right, but as I mention above, if userspace starts depending on this
equation, we won't be able to add any more classes of "secondary" page
tables to SecPageTables. I'd like to avoid that if possible. We can do
the subtraction in the kernel.

2023-11-28 23:07:28

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 06/16] iommu/dma: use page allocation function provided by iommu-pages.h

On Tue, Nov 28, 2023 at 5:59 PM Robin Murphy <[email protected]> wrote:
>
> On 2023-11-28 10:50 pm, Pasha Tatashin wrote:
> > On Tue, Nov 28, 2023 at 5:34 PM Robin Murphy <[email protected]> wrote:
> >>
> >> On 2023-11-28 8:49 pm, Pasha Tatashin wrote:
> >>> Convert iommu/dma-iommu.c to use the new page allocation functions
> >>> provided in iommu-pages.h.
> >>
> >> These have nothing to do with IOMMU pagetables, they are DMA buffers and
> >> they belong to whoever called the corresponding dma_alloc_* function.
> >
> > Hi Robin,
> >
> > This is true, however, we want to account and observe the pages
> > allocated by IOMMU subsystem for DMA buffers, as they are essentially
> > unmovable locked pages. Should we separate IOMMU memory from KVM
> > memory all together and add another field to /proc/meminfo, something
> > like "iommu -> iommu pagetable and dma memory", or do we want to
> > export DMA memory separately from IOMMU page tables?
>
> These are not allocated by "the IOMMU subsystem", they are allocated by
> the DMA API. Even if you want to claim that a driver pinning memory via
> iommu_dma_ops is somehow different from the same driver pinning the same
> amount of memory via dma-direct when iommu.passthrough=1, it's still
> nonsense because you're failing to account the pages which iommu_dma_ops
> gets from CMA, dma_common_alloc_pages(), dynamic SWIOTLB, the various
> pools, and so on.
>
> Thanks,
> Robin.
>
> > Since, I included DMA memory, I specifically removed mentioning of
> > IOMMU page tables in the most of places, and only report it as IOMMU
> > memory. However, since it is still bundled together with SecPageTables
> > it can be confusing.
> >
> > Pasha

2023-11-28 23:08:08

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH 05/16] iommu/io-pgtable-arm-v7s: use page allocation function provided by iommu-pages.h

On 2023-11-28 10:55 pm, Pasha Tatashin wrote:
>>> kmem_cache_free(data->l2_tables, table);
>
> We only account page allocations, not subpages, however, this is
> something I was surprised about this particular architecture of why do
> we allocate l2 using kmem ? Are the second level tables on arm v7s
> really sub-page in size?

Yes, L2 tables are 1KB, so the kmem_cache could still quite easily end
up consuming significantly more memory than the L1 table, which is
usually 16KB (but could potentially be smaller depending on the config,
or up to 64KB with the Mediatek hacks).

Thanks,
Robin.

2023-11-28 23:09:02

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 06/16] iommu/dma: use page allocation function provided by iommu-pages.h

> > This is true, however, we want to account and observe the pages
> > allocated by IOMMU subsystem for DMA buffers, as they are essentially
> > unmovable locked pages. Should we separate IOMMU memory from KVM
> > memory all together and add another field to /proc/meminfo, something
> > like "iommu -> iommu pagetable and dma memory", or do we want to
> > export DMA memory separately from IOMMU page tables?
>
> These are not allocated by "the IOMMU subsystem", they are allocated by
> the DMA API. Even if you want to claim that a driver pinning memory via
> iommu_dma_ops is somehow different from the same driver pinning the same
> amount of memory via dma-direct when iommu.passthrough=1, it's still
> nonsense because you're failing to account the pages which iommu_dma_ops
> gets from CMA, dma_common_alloc_pages(), dynamic SWIOTLB, the various
> pools, and so on.

I see, IOMMU variants are used only for discontiguous allocations, and
the common ones are defined outside of driver/iommu. Alright, I can
remove all the changes for all no-page table related IOMMU
allocations.

Pasha

2023-11-28 23:33:09

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 05/16] iommu/io-pgtable-arm-v7s: use page allocation function provided by iommu-pages.h

On Tue, Nov 28, 2023 at 6:08 PM Robin Murphy <[email protected]> wrote:
>
> On 2023-11-28 10:55 pm, Pasha Tatashin wrote:
> >>> kmem_cache_free(data->l2_tables, table);
> >
> > We only account page allocations, not subpages, however, this is
> > something I was surprised about this particular architecture of why do
> > we allocate l2 using kmem ? Are the second level tables on arm v7s
> > really sub-page in size?
>
> Yes, L2 tables are 1KB, so the kmem_cache could still quite easily end
> up consuming significantly more memory than the L1 table, which is
> usually 16KB (but could potentially be smaller depending on the config,
> or up to 64KB with the Mediatek hacks).

I am OK removing support for this architecture, or keeping only info
for L1, I do not think there is a reason to worry about sub-page
accounting only for v7s.

Pasha

>
> Thanks,
> Robin.

2023-11-28 23:51:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 08/16] iommu/fsl: use page allocation function provided by iommu-pages.h

On Tue, Nov 28, 2023 at 06:00:13PM -0500, Pasha Tatashin wrote:
> On Tue, Nov 28, 2023 at 5:53 PM Robin Murphy <[email protected]> wrote:
> >
> > On 2023-11-28 8:49 pm, Pasha Tatashin wrote:
> > > Convert iommu/fsl_pamu.c to use the new page allocation functions
> > > provided in iommu-pages.h.
> >
> > Again, this is not a pagetable. This thing doesn't even *have* pagetables.
> >
> > Similar to patches #1 and #2 where you're lumping in configuration
> > tables which belong to the IOMMU driver itself, as opposed to pagetables
> > which effectively belong to an IOMMU domain's user. But then there are
> > still drivers where you're *not* accounting similar configuration
> > structures, so I really struggle to see how this metric is useful when
> > it's so completely inconsistent in what it's counting :/
>
> The whole IOMMU subsystem allocates a significant amount of kernel
> locked memory that we want to at least observe. The new field in
> vmstat does just that: it reports ALL buddy allocator memory that
> IOMMU allocates. However, for accounting purposes, I agree, we need to
> do better, and separate at least iommu pagetables from the rest.
>
> We can separate the metric into two:
> iommu pagetable only
> iommu everything
>
> or into three:
> iommu pagetable only
> iommu dma
> iommu everything
>
> What do you think?

I think I said this at LPC - if you want to have fine grained
accounting of memory by owner you need to go talk to the cgroup people
and come up with something generic. Adding ever open coded finer
category breakdowns just for iommu doesn't make alot of sense.

You can make some argument that the pagetable memory should be counted
because kvm counts it's shadow memory, but I wouldn't go into further
detail than that with hand coded counters..

Jason

2023-11-28 23:52:22

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 00/16] IOMMU memory observability

On Tue, Nov 28, 2023 at 03:03:30PM -0800, Yosry Ahmed wrote:
> > Yes, another counter for KVM could be added. On the other hand KVM
> > only can be computed by subtracting one from another as there are only
> > two types of secondary page tables, KVM and IOMMU:
> >
> > /sys/devices/system/node/node0/meminfo
> > Node 0 SecPageTables: 422204 kB
> >
> > /sys/devices/system/node/nodeN/vmstat
> > nr_iommu_pages 105555
> >
> > KVM only = SecPageTables - nr_iommu_pages * PAGE_SIZE / 1024
> >
>
> Right, but as I mention above, if userspace starts depending on this
> equation, we won't be able to add any more classes of "secondary" page
> tables to SecPageTables. I'd like to avoid that if possible. We can do
> the subtraction in the kernel.

What Sean had suggested was that SecPageTables was always intended to
account all the non-primary mmu memory used by page tables. If this is
the case we shouldn't be trying to break it apart into finer
counters. These are big picture counters, not detailed allocation by
owner counters.

Jason

2023-11-28 23:53:05

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 09/16] iommu/iommufd: use page allocation function provided by iommu-pages.h

On Tue, Nov 28, 2023 at 08:49:31PM +0000, Pasha Tatashin wrote:
> Convert iommu/iommufd/* files to use the new page allocation functions
> provided in iommu-pages.h.
>
> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> drivers/iommu/iommufd/iova_bitmap.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)

This is a short term allocation, it should not be counted, that is why
it is already not using GFP_KERNEL_ACCOUNT.

Jason

2023-11-28 23:54:09

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 16/16] vfio: account iommu allocations

On Tue, Nov 28, 2023 at 08:49:38PM +0000, Pasha Tatashin wrote:
> iommu allocations should be accounted in order to allow admins to
> monitor and limit the amount of iommu memory.
>
> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> drivers/vfio/vfio_iommu_type1.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)

You should send the seperately and directly to Alex.

Jason

2023-11-29 00:25:57

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 00/16] IOMMU memory observability

On Tue, Nov 28, 2023 at 3:52 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Nov 28, 2023 at 03:03:30PM -0800, Yosry Ahmed wrote:
> > > Yes, another counter for KVM could be added. On the other hand KVM
> > > only can be computed by subtracting one from another as there are only
> > > two types of secondary page tables, KVM and IOMMU:
> > >
> > > /sys/devices/system/node/node0/meminfo
> > > Node 0 SecPageTables: 422204 kB
> > >
> > > /sys/devices/system/node/nodeN/vmstat
> > > nr_iommu_pages 105555
> > >
> > > KVM only = SecPageTables - nr_iommu_pages * PAGE_SIZE / 1024
> > >
> >
> > Right, but as I mention above, if userspace starts depending on this
> > equation, we won't be able to add any more classes of "secondary" page
> > tables to SecPageTables. I'd like to avoid that if possible. We can do
> > the subtraction in the kernel.
>
> What Sean had suggested was that SecPageTables was always intended to
> account all the non-primary mmu memory used by page tables. If this is
> the case we shouldn't be trying to break it apart into finer
> counters. These are big picture counters, not detailed allocation by
> owner counters.

Right, I agree with that, but if SecPageTables includes page tables
from multiple sources, and it is observed to be suspiciously high, the
logical next step is to try to find the culprit, right?

>
> Jason

2023-11-29 00:28:53

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 00/16] IOMMU memory observability

On Tue, Nov 28, 2023 at 04:25:03PM -0800, Yosry Ahmed wrote:

> > > Right, but as I mention above, if userspace starts depending on this
> > > equation, we won't be able to add any more classes of "secondary" page
> > > tables to SecPageTables. I'd like to avoid that if possible. We can do
> > > the subtraction in the kernel.
> >
> > What Sean had suggested was that SecPageTables was always intended to
> > account all the non-primary mmu memory used by page tables. If this is
> > the case we shouldn't be trying to break it apart into finer
> > counters. These are big picture counters, not detailed allocation by
> > owner counters.
>
> Right, I agree with that, but if SecPageTables includes page tables
> from multiple sources, and it is observed to be suspiciously high, the
> logical next step is to try to find the culprit, right?

You can make that case already, if it is high wouldn't you want to
find the exact VMM process that was making it high?

It is a sign of fire, not a detailed debug tool.

Jason

2023-11-29 00:31:56

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 00/16] IOMMU memory observability

On Tue, Nov 28, 2023 at 4:28 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Nov 28, 2023 at 04:25:03PM -0800, Yosry Ahmed wrote:
>
> > > > Right, but as I mention above, if userspace starts depending on this
> > > > equation, we won't be able to add any more classes of "secondary" page
> > > > tables to SecPageTables. I'd like to avoid that if possible. We can do
> > > > the subtraction in the kernel.
> > >
> > > What Sean had suggested was that SecPageTables was always intended to
> > > account all the non-primary mmu memory used by page tables. If this is
> > > the case we shouldn't be trying to break it apart into finer
> > > counters. These are big picture counters, not detailed allocation by
> > > owner counters.
> >
> > Right, I agree with that, but if SecPageTables includes page tables
> > from multiple sources, and it is observed to be suspiciously high, the
> > logical next step is to try to find the culprit, right?
>
> You can make that case already, if it is high wouldn't you want to
> find the exact VMM process that was making it high?
>
> It is a sign of fire, not a detailed debug tool.

Fair enough. We can always add separate counters later if needed,
potentially under KVM stats to get more fine-grained details as you
mentioned.

I am only worried about users subtracting the iommu-only counter to
get a KVM counter. We should at least document that SecPageTables may
be expanded to include other sources later to avoid that.

>
> Jason

2023-11-29 00:55:09

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 00/16] IOMMU memory observability

On Tue, Nov 28, 2023 at 04:30:27PM -0800, Yosry Ahmed wrote:
> On Tue, Nov 28, 2023 at 4:28 PM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Tue, Nov 28, 2023 at 04:25:03PM -0800, Yosry Ahmed wrote:
> >
> > > > > Right, but as I mention above, if userspace starts depending on this
> > > > > equation, we won't be able to add any more classes of "secondary" page
> > > > > tables to SecPageTables. I'd like to avoid that if possible. We can do
> > > > > the subtraction in the kernel.
> > > >
> > > > What Sean had suggested was that SecPageTables was always intended to
> > > > account all the non-primary mmu memory used by page tables. If this is
> > > > the case we shouldn't be trying to break it apart into finer
> > > > counters. These are big picture counters, not detailed allocation by
> > > > owner counters.
> > >
> > > Right, I agree with that, but if SecPageTables includes page tables
> > > from multiple sources, and it is observed to be suspiciously high, the
> > > logical next step is to try to find the culprit, right?
> >
> > You can make that case already, if it is high wouldn't you want to
> > find the exact VMM process that was making it high?
> >
> > It is a sign of fire, not a detailed debug tool.
>
> Fair enough. We can always add separate counters later if needed,
> potentially under KVM stats to get more fine-grained details as you
> mentioned.
>
> I am only worried about users subtracting the iommu-only counter to
> get a KVM counter. We should at least document that SecPageTables may
> be expanded to include other sources later to avoid that.

Well, we just broke it already, anyone thinking it was only kvm
counters is going to be sad now :) As I understand it was already
described to be more general that kvm so probably nothing to do really

Jason

2023-11-29 07:56:09

by Janne Grunau

[permalink] [raw]
Subject: Re: [PATCH 04/16] iommu/io-pgtable-dart: use page allocation function provided by iommu-pages.h

Hej,

On Tue, Nov 28, 2023, at 21:49, Pasha Tatashin wrote:
> Convert iommu/io-pgtable-dart.c to use the new page allocation functions
> provided in iommu-pages.h.
>
> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> drivers/iommu/io-pgtable-dart.c | 37 +++++++++++++--------------------
> 1 file changed, 14 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/iommu/io-pgtable-dart.c b/drivers/iommu/io-pgtable-dart.c
> index 74b1ef2b96be..ad28031e1e93 100644
> --- a/drivers/iommu/io-pgtable-dart.c
> +++ b/drivers/iommu/io-pgtable-dart.c
> @@ -23,6 +23,7 @@
> #include <linux/types.h>
>
> #include <asm/barrier.h>
> +#include "iommu-pages.h"
>
> #define DART1_MAX_ADDR_BITS 36
>
> @@ -106,18 +107,12 @@ static phys_addr_t iopte_to_paddr(dart_iopte pte,
> return paddr;
> }
>
> -static void *__dart_alloc_pages(size_t size, gfp_t gfp,
> - struct io_pgtable_cfg *cfg)
> +static void *__dart_alloc_pages(size_t size, gfp_t gfp)
> {
> int order = get_order(size);
> - struct page *p;
>
> VM_BUG_ON((gfp & __GFP_HIGHMEM));
> - p = alloc_pages(gfp | __GFP_ZERO, order);
> - if (!p)
> - return NULL;
> -
> - return page_address(p);
> + return iommu_alloc_pages(gfp, order);
> }
>
> static int dart_init_pte(struct dart_io_pgtable *data,
> @@ -262,13 +257,13 @@ static int dart_map_pages(struct io_pgtable_ops
> *ops, unsigned long iova,
>
> /* no L2 table present */
> if (!pte) {
> - cptep = __dart_alloc_pages(tblsz, gfp, cfg);
> + cptep = __dart_alloc_pages(tblsz, gfp);
> if (!cptep)
> return -ENOMEM;
>
> pte = dart_install_table(cptep, ptep, 0, data);
> if (pte)
> - free_pages((unsigned long)cptep, get_order(tblsz));
> + iommu_free_pages(cptep, get_order(tblsz));
>
> /* L2 table is present (now) */
> pte = READ_ONCE(*ptep);
> @@ -419,8 +414,7 @@ apple_dart_alloc_pgtable(struct io_pgtable_cfg
> *cfg, void *cookie)
> cfg->apple_dart_cfg.n_ttbrs = 1 << data->tbl_bits;
>
> for (i = 0; i < cfg->apple_dart_cfg.n_ttbrs; ++i) {
> - data->pgd[i] = __dart_alloc_pages(DART_GRANULE(data), GFP_KERNEL,
> - cfg);
> + data->pgd[i] = __dart_alloc_pages(DART_GRANULE(data), GFP_KERNEL);
> if (!data->pgd[i])
> goto out_free_data;
> cfg->apple_dart_cfg.ttbr[i] = virt_to_phys(data->pgd[i]);
> @@ -429,9 +423,10 @@ apple_dart_alloc_pgtable(struct io_pgtable_cfg
> *cfg, void *cookie)
> return &data->iop;
>
> out_free_data:
> - while (--i >= 0)
> - free_pages((unsigned long)data->pgd[i],
> - get_order(DART_GRANULE(data)));
> + while (--i >= 0) {
> + iommu_free_pages(data->pgd[i],
> + get_order(DART_GRANULE(data)));
> + }
> kfree(data);
> return NULL;
> }
> @@ -439,6 +434,7 @@ apple_dart_alloc_pgtable(struct io_pgtable_cfg
> *cfg, void *cookie)
> static void apple_dart_free_pgtable(struct io_pgtable *iop)
> {
> struct dart_io_pgtable *data = io_pgtable_to_data(iop);
> + int order = get_order(DART_GRANULE(data));
> dart_iopte *ptep, *end;
> int i;
>
> @@ -449,15 +445,10 @@ static void apple_dart_free_pgtable(struct
> io_pgtable *iop)
> while (ptep != end) {
> dart_iopte pte = *ptep++;
>
> - if (pte) {
> - unsigned long page =
> - (unsigned long)iopte_deref(pte, data);
> -
> - free_pages(page, get_order(DART_GRANULE(data)));
> - }
> + if (pte)
> + iommu_free_pages(iopte_deref(pte, data), order);
> }
> - free_pages((unsigned long)data->pgd[i],
> - get_order(DART_GRANULE(data)));
> + iommu_free_pages(data->pgd[i], order);
> }
>
> kfree(data);

Reviewed-by: Janne Grunau <[email protected]>

Janne

2023-11-29 16:49:10

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH 08/16] iommu/fsl: use page allocation function provided by iommu-pages.h

On 28/11/2023 11:50 pm, Jason Gunthorpe wrote:
> On Tue, Nov 28, 2023 at 06:00:13PM -0500, Pasha Tatashin wrote:
>> On Tue, Nov 28, 2023 at 5:53 PM Robin Murphy <[email protected]> wrote:
>>>
>>> On 2023-11-28 8:49 pm, Pasha Tatashin wrote:
>>>> Convert iommu/fsl_pamu.c to use the new page allocation functions
>>>> provided in iommu-pages.h.
>>>
>>> Again, this is not a pagetable. This thing doesn't even *have* pagetables.
>>>
>>> Similar to patches #1 and #2 where you're lumping in configuration
>>> tables which belong to the IOMMU driver itself, as opposed to pagetables
>>> which effectively belong to an IOMMU domain's user. But then there are
>>> still drivers where you're *not* accounting similar configuration
>>> structures, so I really struggle to see how this metric is useful when
>>> it's so completely inconsistent in what it's counting :/
>>
>> The whole IOMMU subsystem allocates a significant amount of kernel
>> locked memory that we want to at least observe. The new field in
>> vmstat does just that: it reports ALL buddy allocator memory that
>> IOMMU allocates. However, for accounting purposes, I agree, we need to
>> do better, and separate at least iommu pagetables from the rest.
>>
>> We can separate the metric into two:
>> iommu pagetable only
>> iommu everything
>>
>> or into three:
>> iommu pagetable only
>> iommu dma
>> iommu everything
>>
>> What do you think?
>
> I think I said this at LPC - if you want to have fine grained
> accounting of memory by owner you need to go talk to the cgroup people
> and come up with something generic. Adding ever open coded finer
> category breakdowns just for iommu doesn't make alot of sense.
>
> You can make some argument that the pagetable memory should be counted
> because kvm counts it's shadow memory, but I wouldn't go into further
> detail than that with hand coded counters..

Right, pagetable memory is interesting since it's something that any
random kernel user can indirectly allocate via iommu_domain_alloc() and
iommu_map(), and some of those users may even be doing so on behalf of
userspace. I have no objection to accounting and potentially applying
limits to *that*.

Beyond that, though, there is nothing special about "the IOMMU
subsystem". The amount of memory an IOMMU driver needs to allocate for
itself in order to function is not of interest beyond curiosity, it just
is what it is; limiting it would only break the IOMMU, and if a user
thinks it's "too much", the only actionable thing that might help is to
physically remove devices from the system. Similar for DMA buffers; it
might be intriguing to account those, but it's not really an actionable
metric - in the overwhelming majority of cases you can't simply tell a
driver to allocate less than what it needs. And that is of course
assuming if we were to account *all* DMA buffers, since whether they
happen to have an IOMMU translation or not is irrelevant (we'd have
already accounted the pagetables as pagetables if so).

I bet "the networking subsystem" also consumes significant memory on the
same kind of big systems where IOMMU pagetables would be of any concern.
I believe some of the some of the "serious" NICs can easily run up
hundreds of megabytes if not gigabytes worth of queues, SKB pools, etc.
- would you propose accounting those too?

Thanks,
Robin.

2023-11-29 19:45:59

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 08/16] iommu/fsl: use page allocation function provided by iommu-pages.h

> >> We can separate the metric into two:
> >> iommu pagetable only
> >> iommu everything
> >>
> >> or into three:
> >> iommu pagetable only
> >> iommu dma
> >> iommu everything
> >>
> >> What do you think?
> >
> > I think I said this at LPC - if you want to have fine grained
> > accounting of memory by owner you need to go talk to the cgroup people
> > and come up with something generic. Adding ever open coded finer
> > category breakdowns just for iommu doesn't make alot of sense.
> >
> > You can make some argument that the pagetable memory should be counted
> > because kvm counts it's shadow memory, but I wouldn't go into further
> > detail than that with hand coded counters..
>
> Right, pagetable memory is interesting since it's something that any
> random kernel user can indirectly allocate via iommu_domain_alloc() and
> iommu_map(), and some of those users may even be doing so on behalf of
> userspace. I have no objection to accounting and potentially applying
> limits to *that*.

Yes, in the next version, I will separate pagetable only from the
rest, for the limits.

> Beyond that, though, there is nothing special about "the IOMMU
> subsystem". The amount of memory an IOMMU driver needs to allocate for
> itself in order to function is not of interest beyond curiosity, it just
> is what it is; limiting it would only break the IOMMU, and if a user

Agree about the amount of memory IOMMU allocates for itself, but that
should be small, if it is not, we have to at least show where the
memory is used.

> thinks it's "too much", the only actionable thing that might help is to
> physically remove devices from the system. Similar for DMA buffers; it
> might be intriguing to account those, but it's not really an actionable
> metric - in the overwhelming majority of cases you can't simply tell a
> driver to allocate less than what it needs. And that is of course
> assuming if we were to account *all* DMA buffers, since whether they
> happen to have an IOMMU translation or not is irrelevant (we'd have
> already accounted the pagetables as pagetables if so).

DMA mappings should be observable (do not have to be limited). At the
very least, it can help with explaining the kernel memory overhead
anomalies on production systems.

> I bet "the networking subsystem" also consumes significant memory on the

It does, and GPU drivers also may consume a significant amount of memory.

> same kind of big systems where IOMMU pagetables would be of any concern.
> I believe some of the some of the "serious" NICs can easily run up
> hundreds of megabytes if not gigabytes worth of queues, SKB pools, etc.
> - would you propose accounting those too?

Yes. Any kind of kernel memory that is proportional to the workload
should be accountable. Someone is using those resources compared to
the idling system, and that someone should be charged.

Pasha

2023-11-29 20:03:22

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 08/16] iommu/fsl: use page allocation function provided by iommu-pages.h

On Wed, Nov 29, 2023 at 02:45:03PM -0500, Pasha Tatashin wrote:

> > same kind of big systems where IOMMU pagetables would be of any concern.
> > I believe some of the some of the "serious" NICs can easily run up
> > hundreds of megabytes if not gigabytes worth of queues, SKB pools, etc.
> > - would you propose accounting those too?
>
> Yes. Any kind of kernel memory that is proportional to the workload
> should be accountable. Someone is using those resources compared to
> the idling system, and that someone should be charged.

There is a difference between charged and accounted

You should be running around adding GFP_KERNEL_ACCOUNT, yes. I already
did a bunch of that work. Split that out from this series and send it
to the right maintainers.

Adding a counter for allocations and showing in procfs is a very
different question. IMHO that should not be done in micro, the
threshold to add a new counter should be high.

There is definately room for a generic debugging feature to break down
GFP_KERNEL_ACCOUNT by owernship somehow. Maybe it can already be done
with BPF. IDK

Jason

2023-11-29 20:45:34

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 08/16] iommu/fsl: use page allocation function provided by iommu-pages.h

On Wed, Nov 29, 2023 at 3:03 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Wed, Nov 29, 2023 at 02:45:03PM -0500, Pasha Tatashin wrote:
>
> > > same kind of big systems where IOMMU pagetables would be of any concern.
> > > I believe some of the some of the "serious" NICs can easily run up
> > > hundreds of megabytes if not gigabytes worth of queues, SKB pools, etc.
> > > - would you propose accounting those too?
> >
> > Yes. Any kind of kernel memory that is proportional to the workload
> > should be accountable. Someone is using those resources compared to
> > the idling system, and that someone should be charged.
>
> There is a difference between charged and accounted
>
> You should be running around adding GFP_KERNEL_ACCOUNT, yes. I already
> did a bunch of that work. Split that out from this series and send it
> to the right maintainers.

I will do that.

>
> Adding a counter for allocations and showing in procfs is a very
> different question. IMHO that should not be done in micro, the
> threshold to add a new counter should be high.

I agree, /proc/meminfo, should not include everything, however overall
network consumption that includes memory allocated by network driver
would be useful to have, may be it should be exported by device
drivers and added to the protocol memory. We already have network
protocol memory consumption in procfs:

# awk '{printf "%-10s %s\n", $1, $4}' /proc/net/protocols | grep -v '\-1'
protocol memory
UDPv6 22673
TCPv6 16961

> There is definately room for a generic debugging feature to break down
> GFP_KERNEL_ACCOUNT by owernship somehow. Maybe it can already be done
> with BPF. IDK

2023-11-29 21:37:09

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 16/16] vfio: account iommu allocations

On Tue, Nov 28, 2023 at 6:53 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Nov 28, 2023 at 08:49:38PM +0000, Pasha Tatashin wrote:
> > iommu allocations should be accounted in order to allow admins to
> > monitor and limit the amount of iommu memory.
> >
> > Signed-off-by: Pasha Tatashin <[email protected]>
> > ---
> > drivers/vfio/vfio_iommu_type1.c | 8 +++++---
> > 1 file changed, 5 insertions(+), 3 deletions(-)
>
> You should send the seperately and directly to Alex.

Thanks, I will.

>
> Jason

2023-11-29 22:01:31

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 09/16] iommu/iommufd: use page allocation function provided by iommu-pages.h

On Tue, Nov 28, 2023 at 6:52 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Nov 28, 2023 at 08:49:31PM +0000, Pasha Tatashin wrote:
> > Convert iommu/iommufd/* files to use the new page allocation functions
> > provided in iommu-pages.h.
> >
> > Signed-off-by: Pasha Tatashin <[email protected]>
> > ---
> > drivers/iommu/iommufd/iova_bitmap.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
>
> This is a short term allocation, it should not be counted, that is why
> it is already not using GFP_KERNEL_ACCOUNT.

I made this change for completeness. I changed all calls to
get_free_page/alloc_page etc under driver/iommu to use the
iommu_alloc_* variants, this also helps future developers in this area
to use the right allocation functions.
The accounting is implemented using cheap per-cpu counters, so should
not affect the performance, I think it is OK to keep them here.

2023-11-30 00:03:12

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 09/16] iommu/iommufd: use page allocation function provided by iommu-pages.h

On Wed, Nov 29, 2023 at 04:59:43PM -0500, Pasha Tatashin wrote:
> On Tue, Nov 28, 2023 at 6:52 PM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Tue, Nov 28, 2023 at 08:49:31PM +0000, Pasha Tatashin wrote:
> > > Convert iommu/iommufd/* files to use the new page allocation functions
> > > provided in iommu-pages.h.
> > >
> > > Signed-off-by: Pasha Tatashin <[email protected]>
> > > ---
> > > drivers/iommu/iommufd/iova_bitmap.c | 4 ++--
> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > This is a short term allocation, it should not be counted, that is why
> > it is already not using GFP_KERNEL_ACCOUNT.
>
> I made this change for completeness. I changed all calls to
> get_free_page/alloc_page etc under driver/iommu to use the
> iommu_alloc_* variants, this also helps future developers in this area
> to use the right allocation functions.
> The accounting is implemented using cheap per-cpu counters, so should
> not affect the performance, I think it is OK to keep them here.

Except it is a mis use of an API that should only be used for page
table memory :(

Jason

2023-11-30 14:05:15

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 13/16] iommu: observability of the IOMMU allocations

Hi Pasha,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on awilliam-vfio/for-linus linus/master v6.7-rc3]
[cannot apply to joro-iommu/next awilliam-vfio/next next-20231130]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Pasha-Tatashin/iommu-vt-d-add-wrapper-functions-for-page-allocations/20231129-054908
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20231128204938.1453583-14-pasha.tatashin%40soleen.com
patch subject: [PATCH 13/16] iommu: observability of the IOMMU allocations
config: sparc64-randconfig-r054-20231130 (https://download.01.org/0day-ci/archive/20231130/[email protected]/config)
compiler: sparc64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231130/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

In file included from drivers/iommu/iommufd/iova_bitmap.c:11:
drivers/iommu/iommufd/../iommu-pages.h: In function '__iommu_alloc_account':
>> drivers/iommu/iommufd/../iommu-pages.h:29:48: error: 'NR_IOMMU_PAGES' undeclared (first use in this function)
29 | mod_node_page_state(page_pgdat(pages), NR_IOMMU_PAGES, pgcnt);
| ^~~~~~~~~~~~~~
drivers/iommu/iommufd/../iommu-pages.h:29:48: note: each undeclared identifier is reported only once for each function it appears in
drivers/iommu/iommufd/../iommu-pages.h: In function '__iommu_free_account':
drivers/iommu/iommufd/../iommu-pages.h:41:48: error: 'NR_IOMMU_PAGES' undeclared (first use in this function)
41 | mod_node_page_state(page_pgdat(pages), NR_IOMMU_PAGES, -pgcnt);
| ^~~~~~~~~~~~~~


vim +/NR_IOMMU_PAGES +29 drivers/iommu/iommufd/../iommu-pages.h

13
14 /*
15 * All page allocation that are performed in the IOMMU subsystem must use one of
16 * the functions below. This is necessary for the proper accounting as IOMMU
17 * state can be rather large, i.e. multiple gigabytes in size.
18 */
19
20 /**
21 * __iommu_alloc_account - account for newly allocated page.
22 * @pages: head struct page of the page.
23 * @order: order of the page
24 */
25 static inline void __iommu_alloc_account(struct page *pages, int order)
26 {
27 const long pgcnt = 1l << order;
28
> 29 mod_node_page_state(page_pgdat(pages), NR_IOMMU_PAGES, pgcnt);
30 }
31

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-12-25 16:10:30

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 15/16] vhost-vdpa: account iommu allocations

On Tue, Nov 28, 2023 at 08:49:37PM +0000, Pasha Tatashin wrote:
> iommu allocations should be accounted in order to allow admins to
> monitor and limit the amount of iommu memory.
>
> Signed-off-by: Pasha Tatashin <[email protected]>


Acked-by: Michael S. Tsirkin <[email protected]>


> ---
> drivers/vhost/vdpa.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index da7ec77cdaff..a51c69c078d9 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -968,7 +968,8 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
> r = ops->set_map(vdpa, asid, iotlb);
> } else {
> r = iommu_map(v->domain, iova, pa, size,
> - perm_to_iommu_flags(perm), GFP_KERNEL);
> + perm_to_iommu_flags(perm),
> + GFP_KERNEL_ACCOUNT);
> }
> if (r) {
> vhost_iotlb_del_range(iotlb, iova, iova + size - 1);
> --
> 2.43.0.rc2.451.g8631bc7472-goog


2023-12-26 18:24:35

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [PATCH 15/16] vhost-vdpa: account iommu allocations

On Mon, Dec 25, 2023 at 11:09 AM Michael S. Tsirkin <[email protected]> wrote:
>
> On Tue, Nov 28, 2023 at 08:49:37PM +0000, Pasha Tatashin wrote:
> > iommu allocations should be accounted in order to allow admins to
> > monitor and limit the amount of iommu memory.
> >
> > Signed-off-by: Pasha Tatashin <[email protected]>
>
>
> Acked-by: Michael S. Tsirkin <[email protected]>

Thank you,
Pasha