2017-09-21 15:52:53

by Robin Murphy

[permalink] [raw]
Subject: [PATCH v5 0/6] Optimise 64-bit IOVA allocations

v4: https://www.mail-archive.com/[email protected]/msg1493704.html

Right, this is hopefully the last version - I've put things back in a
sensible order with the new additions at the end, so if they prove
contentious the first 4 previously-tested patches can still get their
time in -next. Patch #3 is updated to fix the bug brought to light by
Tomasz, patch #6 includes the cleanup afterthought from v4, and I've
nobbled one or two rogue whitespace changes I'd missed before.

And the diffstat is still even more negative than before, hooray! For
the whole series, the total code size reduction of alloc_iova() comes
to just over 26% (AArch64 GCC 6.3.1).

Robin.


Robin Murphy (3):
iommu/iova: Extend rbtree node caching
iommu/iova: Add rbtree anchor node
iommu/iova: Simplify cached node logic

Zhen Lei (3):
iommu/iova: Optimise rbtree searching
iommu/iova: Optimise the padding calculation
iommu/iova: Make dma_32bit_pfn implicit

drivers/gpu/drm/tegra/drm.c | 3 +-
drivers/gpu/host1x/dev.c | 3 +-
drivers/iommu/amd_iommu.c | 7 +-
drivers/iommu/dma-iommu.c | 18 +-----
drivers/iommu/intel-iommu.c | 11 +---
drivers/iommu/iova.c | 135 ++++++++++++++++-----------------------
drivers/misc/mic/scif/scif_rma.c | 3 +-
include/linux/iova.h | 9 +--
8 files changed, 69 insertions(+), 120 deletions(-)

--
2.13.4.dirty


2017-09-21 15:53:00

by Robin Murphy

[permalink] [raw]
Subject: [PATCH v5 1/6] iommu/iova: Optimise rbtree searching

From: Zhen Lei <[email protected]>

Checking the IOVA bounds separately before deciding which direction to
continue the search (if necessary) results in redundantly comparing both
pfns twice each. GCC can already determine that the final comparison op
is redundant and optimise it down to 3 in total, but we can go one
further with a little tweak of the ordering (which makes the intent of
the code that much cleaner as a bonus).

Signed-off-by: Zhen Lei <[email protected]>
Tested-by: Ard Biesheuvel <[email protected]>
Tested-by: Zhen Lei <[email protected]>
Tested-by: Nate Watterson <[email protected]>
[rm: rewrote commit message to clarify]
Signed-off-by: Robin Murphy <[email protected]>
---

v5: No change

drivers/iommu/iova.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 33edfa794ae9..f129ff4f5c89 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -342,15 +342,12 @@ private_find_iova(struct iova_domain *iovad, unsigned long pfn)
while (node) {
struct iova *iova = rb_entry(node, struct iova, node);

- /* If pfn falls within iova's range, return iova */
- if ((pfn >= iova->pfn_lo) && (pfn <= iova->pfn_hi)) {
- return iova;
- }
-
if (pfn < iova->pfn_lo)
node = node->rb_left;
- else if (pfn > iova->pfn_lo)
+ else if (pfn > iova->pfn_hi)
node = node->rb_right;
+ else
+ return iova; /* pfn falls within iova's range */
}

return NULL;
--
2.13.4.dirty

2017-09-21 15:53:04

by Robin Murphy

[permalink] [raw]
Subject: [PATCH v5 2/6] iommu/iova: Optimise the padding calculation

From: Zhen Lei <[email protected]>

The mask for calculating the padding size doesn't change, so there's no
need to recalculate it every loop iteration. Furthermore, Once we've
done that, it becomes clear that we don't actually need to calculate a
padding size at all - by flipping the arithmetic around, we can just
combine the upper limit, size, and mask directly to check against the
lower limit.

For an arm64 build, this alone knocks 20% off the object code size of
the entire alloc_iova() function!

Signed-off-by: Zhen Lei <[email protected]>
Tested-by: Ard Biesheuvel <[email protected]>
Tested-by: Zhen Lei <[email protected]>
Tested-by: Nate Watterson <[email protected]>
[rm: simplified more of the arithmetic, rewrote commit message]
Signed-off-by: Robin Murphy <[email protected]>
---

v5: No change

drivers/iommu/iova.c | 42 +++++++++++++++---------------------------
1 file changed, 15 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index f129ff4f5c89..20be9a8b3188 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -182,24 +182,17 @@ iova_insert_rbtree(struct rb_root *root, struct iova *iova,
rb_insert_color(&iova->node, root);
}

-/*
- * Computes the padding size required, to make the start address
- * naturally aligned on the power-of-two order of its size
- */
-static unsigned int
-iova_get_pad_size(unsigned int size, unsigned int limit_pfn)
-{
- return (limit_pfn - size) & (__roundup_pow_of_two(size) - 1);
-}
-
static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
unsigned long size, unsigned long limit_pfn,
struct iova *new, bool size_aligned)
{
struct rb_node *prev, *curr = NULL;
unsigned long flags;
- unsigned long saved_pfn;
- unsigned int pad_size = 0;
+ unsigned long saved_pfn, new_pfn;
+ unsigned long align_mask = ~0UL;
+
+ if (size_aligned)
+ align_mask <<= fls_long(size - 1);

/* Walk the tree backwards */
spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
@@ -209,31 +202,26 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
while (curr) {
struct iova *curr_iova = rb_entry(curr, struct iova, node);

- if (limit_pfn <= curr_iova->pfn_lo) {
+ if (limit_pfn <= curr_iova->pfn_lo)
goto move_left;
- } else if (limit_pfn > curr_iova->pfn_hi) {
- if (size_aligned)
- pad_size = iova_get_pad_size(size, limit_pfn);
- if ((curr_iova->pfn_hi + size + pad_size) < limit_pfn)
- break; /* found a free slot */
- }
+
+ if (((limit_pfn - size) & align_mask) > curr_iova->pfn_hi)
+ break; /* found a free slot */
+
limit_pfn = curr_iova->pfn_lo;
move_left:
prev = curr;
curr = rb_prev(curr);
}

- if (!curr) {
- if (size_aligned)
- pad_size = iova_get_pad_size(size, limit_pfn);
- if ((iovad->start_pfn + size + pad_size) > limit_pfn) {
- spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
- return -ENOMEM;
- }
+ new_pfn = (limit_pfn - size) & align_mask;
+ if (limit_pfn < size || new_pfn < iovad->start_pfn) {
+ spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+ return -ENOMEM;
}

/* pfn_lo will point to size aligned address if size_aligned is set */
- new->pfn_lo = limit_pfn - (size + pad_size);
+ new->pfn_lo = new_pfn;
new->pfn_hi = new->pfn_lo + size - 1;

/* If we have 'prev', it's a valid place to start the insertion. */
--
2.13.4.dirty

2017-09-21 15:53:09

by Robin Murphy

[permalink] [raw]
Subject: [PATCH v5 4/6] iommu/iova: Make dma_32bit_pfn implicit

From: Zhen Lei <[email protected]>

Now that the cached node optimisation can apply to all allocations, the
couple of users which were playing tricks with dma_32bit_pfn in order to
benefit from it can stop doing so. Conversely, there is also no need for
all the other users to explicitly calculate a 'real' 32-bit PFN, when
init_iova_domain() can happily do that itself from the page granularity.

CC: Thierry Reding <[email protected]>
CC: Jonathan Hunter <[email protected]>
CC: David Airlie <[email protected]>
CC: Sudeep Dutt <[email protected]>
CC: Ashutosh Dixit <[email protected]>
Signed-off-by: Zhen Lei <[email protected]>
Tested-by: Ard Biesheuvel <[email protected]>
Tested-by: Zhen Lei <[email protected]>
Tested-by: Nate Watterson <[email protected]>
[rm: use iova_shift(), rewrote commit message]
Signed-off-by: Robin Murphy <[email protected]>
---

v5: No change

drivers/gpu/drm/tegra/drm.c | 3 +--
drivers/gpu/host1x/dev.c | 3 +--
drivers/iommu/amd_iommu.c | 7 ++-----
drivers/iommu/dma-iommu.c | 18 +-----------------
drivers/iommu/intel-iommu.c | 11 +++--------
drivers/iommu/iova.c | 4 ++--
drivers/misc/mic/scif/scif_rma.c | 3 +--
include/linux/iova.h | 5 ++---
8 files changed, 13 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
index 597d563d636a..b822e484b7e5 100644
--- a/drivers/gpu/drm/tegra/drm.c
+++ b/drivers/gpu/drm/tegra/drm.c
@@ -155,8 +155,7 @@ static int tegra_drm_load(struct drm_device *drm, unsigned long flags)

order = __ffs(tegra->domain->pgsize_bitmap);
init_iova_domain(&tegra->carveout.domain, 1UL << order,
- carveout_start >> order,
- carveout_end >> order);
+ carveout_start >> order);

tegra->carveout.shift = iova_shift(&tegra->carveout.domain);
tegra->carveout.limit = carveout_end >> tegra->carveout.shift;
diff --git a/drivers/gpu/host1x/dev.c b/drivers/gpu/host1x/dev.c
index 7f22c5c37660..5267c62e8896 100644
--- a/drivers/gpu/host1x/dev.c
+++ b/drivers/gpu/host1x/dev.c
@@ -198,8 +198,7 @@ static int host1x_probe(struct platform_device *pdev)

order = __ffs(host->domain->pgsize_bitmap);
init_iova_domain(&host->iova, 1UL << order,
- geometry->aperture_start >> order,
- geometry->aperture_end >> order);
+ geometry->aperture_start >> order);
host->iova_end = geometry->aperture_end;
}

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 51f8215877f5..647ab7691aee 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -63,7 +63,6 @@
/* IO virtual address start page frame number */
#define IOVA_START_PFN (1)
#define IOVA_PFN(addr) ((addr) >> PAGE_SHIFT)
-#define DMA_32BIT_PFN IOVA_PFN(DMA_BIT_MASK(32))

/* Reserved IOVA ranges */
#define MSI_RANGE_START (0xfee00000)
@@ -1788,8 +1787,7 @@ static struct dma_ops_domain *dma_ops_domain_alloc(void)
if (!dma_dom->domain.pt_root)
goto free_dma_dom;

- init_iova_domain(&dma_dom->iovad, PAGE_SIZE,
- IOVA_START_PFN, DMA_32BIT_PFN);
+ init_iova_domain(&dma_dom->iovad, PAGE_SIZE, IOVA_START_PFN);

if (init_iova_flush_queue(&dma_dom->iovad, iova_domain_flush_tlb, NULL))
goto free_dma_dom;
@@ -2696,8 +2694,7 @@ static int init_reserved_iova_ranges(void)
struct pci_dev *pdev = NULL;
struct iova *val;

- init_iova_domain(&reserved_iova_ranges, PAGE_SIZE,
- IOVA_START_PFN, DMA_32BIT_PFN);
+ init_iova_domain(&reserved_iova_ranges, PAGE_SIZE, IOVA_START_PFN);

lockdep_set_class(&reserved_iova_ranges.iova_rbtree_lock,
&reserved_rbtree_key);
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 9d1cebe7f6cb..191be9c80a8a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -292,18 +292,7 @@ int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
/* ...then finally give it a kicking to make sure it fits */
base_pfn = max_t(unsigned long, base_pfn,
domain->geometry.aperture_start >> order);
- end_pfn = min_t(unsigned long, end_pfn,
- domain->geometry.aperture_end >> order);
}
- /*
- * PCI devices may have larger DMA masks, but still prefer allocating
- * within a 32-bit mask to avoid DAC addressing. Such limitations don't
- * apply to the typical platform device, so for those we may as well
- * leave the cache limit at the top of their range to save an rb_last()
- * traversal on every allocation.
- */
- if (dev && dev_is_pci(dev))
- end_pfn &= DMA_BIT_MASK(32) >> order;

/* start_pfn is always nonzero for an already-initialised domain */
if (iovad->start_pfn) {
@@ -312,16 +301,11 @@ int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
pr_warn("Incompatible range for DMA domain\n");
return -EFAULT;
}
- /*
- * If we have devices with different DMA masks, move the free
- * area cache limit down for the benefit of the smaller one.
- */
- iovad->dma_32bit_pfn = min(end_pfn + 1, iovad->dma_32bit_pfn);

return 0;
}

- init_iova_domain(iovad, 1UL << order, base_pfn, end_pfn);
+ init_iova_domain(iovad, 1UL << order, base_pfn);
if (!dev)
return 0;

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6784a05dd6b2..ebb48353dd39 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -82,8 +82,6 @@
#define IOVA_START_PFN (1)

#define IOVA_PFN(addr) ((addr) >> PAGE_SHIFT)
-#define DMA_32BIT_PFN IOVA_PFN(DMA_BIT_MASK(32))
-#define DMA_64BIT_PFN IOVA_PFN(DMA_BIT_MASK(64))

/* page table handling */
#define LEVEL_STRIDE (9)
@@ -1878,8 +1876,7 @@ static int dmar_init_reserved_ranges(void)
struct iova *iova;
int i;

- init_iova_domain(&reserved_iova_list, VTD_PAGE_SIZE, IOVA_START_PFN,
- DMA_32BIT_PFN);
+ init_iova_domain(&reserved_iova_list, VTD_PAGE_SIZE, IOVA_START_PFN);

lockdep_set_class(&reserved_iova_list.iova_rbtree_lock,
&reserved_rbtree_key);
@@ -1938,8 +1935,7 @@ static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu,
unsigned long sagaw;
int err;

- init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN,
- DMA_32BIT_PFN);
+ init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN);

err = init_iova_flush_queue(&domain->iovad,
iommu_flush_iova, iova_entry_free);
@@ -4897,8 +4893,7 @@ static int md_domain_init(struct dmar_domain *domain, int guest_width)
{
int adjust_width;

- init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN,
- DMA_32BIT_PFN);
+ init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN);
domain_reserve_special_ranges(domain);

/* calculate AGAW */
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index c6f5a22f8d20..65032e60a5d1 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -37,7 +37,7 @@ static void fq_flush_timeout(unsigned long data);

void
init_iova_domain(struct iova_domain *iovad, unsigned long granule,
- unsigned long start_pfn, unsigned long pfn_32bit)
+ unsigned long start_pfn)
{
/*
* IOVA granularity will normally be equal to the smallest
@@ -52,7 +52,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
iovad->cached32_node = NULL;
iovad->granule = granule;
iovad->start_pfn = start_pfn;
- iovad->dma_32bit_pfn = pfn_32bit + 1;
+ iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
iovad->flush_cb = NULL;
iovad->fq = NULL;
init_iova_rcaches(iovad);
diff --git a/drivers/misc/mic/scif/scif_rma.c b/drivers/misc/mic/scif/scif_rma.c
index 329727e00e97..c824329f7012 100644
--- a/drivers/misc/mic/scif/scif_rma.c
+++ b/drivers/misc/mic/scif/scif_rma.c
@@ -39,8 +39,7 @@ void scif_rma_ep_init(struct scif_endpt *ep)
struct scif_endpt_rma_info *rma = &ep->rma_info;

mutex_init(&rma->rma_lock);
- init_iova_domain(&rma->iovad, PAGE_SIZE, SCIF_IOVA_START_PFN,
- SCIF_DMA_64BIT_PFN);
+ init_iova_domain(&rma->iovad, PAGE_SIZE, SCIF_IOVA_START_PFN);
spin_lock_init(&rma->tc_lock);
mutex_init(&rma->mmn_lock);
INIT_LIST_HEAD(&rma->reg_list);
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 69ea3e258ff2..953cfd20f152 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -154,7 +154,7 @@ struct iova *reserve_iova(struct iova_domain *iovad, unsigned long pfn_lo,
unsigned long pfn_hi);
void copy_reserved_iova(struct iova_domain *from, struct iova_domain *to);
void init_iova_domain(struct iova_domain *iovad, unsigned long granule,
- unsigned long start_pfn, unsigned long pfn_32bit);
+ unsigned long start_pfn);
int init_iova_flush_queue(struct iova_domain *iovad,
iova_flush_cb flush_cb, iova_entry_dtor entry_dtor);
struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
@@ -230,8 +230,7 @@ static inline void copy_reserved_iova(struct iova_domain *from,

static inline void init_iova_domain(struct iova_domain *iovad,
unsigned long granule,
- unsigned long start_pfn,
- unsigned long pfn_32bit)
+ unsigned long start_pfn)
{
}

--
2.13.4.dirty

2017-09-21 15:53:12

by Robin Murphy

[permalink] [raw]
Subject: [PATCH v5 6/6] iommu/iova: Simplify cached node logic

The logic of __get_cached_rbnode() is a little obtuse, but then
__get_prev_node_of_cached_rbnode_or_last_node_and_update_limit_pfn()
wouldn't exactly roll off the tongue...

Now that we have the invariant that there is always a valid node to
start searching downwards from, everything gets a bit easier to follow
if we simplify that function to do what it says on the tin and return
the cached node (or anchor node as appropriate) directly. In turn, we
can then deduplicate the rb_prev() and limit_pfn logic into the main
loop itself, further reduce the amount of code under the lock, and
generally make the inner workings a bit less subtle.

Signed-off-by: Robin Murphy <[email protected]>
---

v5: Make cached nodes always valid

drivers/iommu/iova.c | 51 +++++++++++++++++----------------------------------
1 file changed, 17 insertions(+), 34 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 9e04c1f3e740..7b7363518733 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -51,8 +51,8 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,

spin_lock_init(&iovad->iova_rbtree_lock);
iovad->rbroot = RB_ROOT;
- iovad->cached_node = NULL;
- iovad->cached32_node = NULL;
+ iovad->cached_node = &iovad->anchor.node;
+ iovad->cached32_node = &iovad->anchor.node;
iovad->granule = granule;
iovad->start_pfn = start_pfn;
iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
@@ -115,22 +115,12 @@ int init_iova_flush_queue(struct iova_domain *iovad,
EXPORT_SYMBOL_GPL(init_iova_flush_queue);

static struct rb_node *
-__get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
+__get_cached_rbnode(struct iova_domain *iovad, unsigned long limit_pfn)
{
- struct rb_node *cached_node = NULL;
- struct iova *curr_iova;
+ if (limit_pfn <= iovad->dma_32bit_pfn)
+ return iovad->cached32_node;

- if (*limit_pfn <= iovad->dma_32bit_pfn)
- cached_node = iovad->cached32_node;
- if (!cached_node)
- cached_node = iovad->cached_node;
- if (!cached_node)
- return rb_prev(&iovad->anchor.node);
-
- curr_iova = rb_entry(cached_node, struct iova, node);
- *limit_pfn = min(*limit_pfn, curr_iova->pfn_lo);
-
- return rb_prev(cached_node);
+ return iovad->cached_node;
}

static void
@@ -149,11 +139,11 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)

cached_iova = rb_entry(iovad->cached32_node, struct iova, node);
if (free->pfn_hi < iovad->dma_32bit_pfn &&
- iovad->cached32_node && free->pfn_lo >= cached_iova->pfn_lo)
+ free->pfn_lo >= cached_iova->pfn_lo)
iovad->cached32_node = rb_next(&free->node);

cached_iova = rb_entry(iovad->cached_node, struct iova, node);
- if (iovad->cached_node && free->pfn_lo >= cached_iova->pfn_lo)
+ if (free->pfn_lo >= cached_iova->pfn_lo)
iovad->cached_node = rb_next(&free->node);
}

@@ -189,7 +179,8 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
unsigned long size, unsigned long limit_pfn,
struct iova *new, bool size_aligned)
{
- struct rb_node *prev, *curr = NULL;
+ struct rb_node *curr, *prev;
+ struct iova *curr_iova;
unsigned long flags;
unsigned long new_pfn;
unsigned long align_mask = ~0UL;
@@ -199,24 +190,16 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,

/* Walk the tree backwards */
spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
- curr = __get_cached_rbnode(iovad, &limit_pfn);
- prev = curr;
- while (curr) {
- struct iova *curr_iova = rb_entry(curr, struct iova, node);
-
- if (limit_pfn <= curr_iova->pfn_lo)
- goto move_left;
-
- if (((limit_pfn - size) & align_mask) > curr_iova->pfn_hi)
- break; /* found a free slot */
-
- limit_pfn = curr_iova->pfn_lo;
-move_left:
+ curr = __get_cached_rbnode(iovad, limit_pfn);
+ curr_iova = rb_entry(curr, struct iova, node);
+ do {
+ limit_pfn = min(limit_pfn, curr_iova->pfn_lo);
+ new_pfn = (limit_pfn - size) & align_mask;
prev = curr;
curr = rb_prev(curr);
- }
+ curr_iova = rb_entry(curr, struct iova, node);
+ } while (curr && new_pfn <= curr_iova->pfn_hi);

- new_pfn = (limit_pfn - size) & align_mask;
if (limit_pfn < size || new_pfn < iovad->start_pfn) {
spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
return -ENOMEM;
--
2.13.4.dirty

2017-09-21 15:53:31

by Robin Murphy

[permalink] [raw]
Subject: [PATCH v5 5/6] iommu/iova: Add rbtree anchor node

Add a permanent dummy IOVA reservation to the rbtree, such that we can
always access the top of the address space instantly. The immediate
benefit is that we remove the overhead of the rb_last() traversal when
not using the cached node, but it also paves the way for further
simplifications.

Signed-off-by: Robin Murphy <[email protected]>
---

v5: No change

drivers/iommu/iova.c | 15 +++++++++++++--
include/linux/iova.h | 1 +
2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 65032e60a5d1..9e04c1f3e740 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -24,6 +24,9 @@
#include <linux/bitops.h>
#include <linux/cpu.h>

+/* The anchor node sits above the top of the usable address space */
+#define IOVA_ANCHOR ~0UL
+
static bool iova_rcache_insert(struct iova_domain *iovad,
unsigned long pfn,
unsigned long size);
@@ -55,6 +58,9 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
iovad->flush_cb = NULL;
iovad->fq = NULL;
+ iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
+ rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node);
+ rb_insert_color(&iovad->anchor.node, &iovad->rbroot);
init_iova_rcaches(iovad);
}
EXPORT_SYMBOL_GPL(init_iova_domain);
@@ -119,7 +125,7 @@ __get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
if (!cached_node)
cached_node = iovad->cached_node;
if (!cached_node)
- return rb_last(&iovad->rbroot);
+ return rb_prev(&iovad->anchor.node);

curr_iova = rb_entry(cached_node, struct iova, node);
*limit_pfn = min(*limit_pfn, curr_iova->pfn_lo);
@@ -242,7 +248,8 @@ EXPORT_SYMBOL(alloc_iova_mem);

void free_iova_mem(struct iova *iova)
{
- kmem_cache_free(iova_cache, iova);
+ if (iova->pfn_lo != IOVA_ANCHOR)
+ kmem_cache_free(iova_cache, iova);
}
EXPORT_SYMBOL(free_iova_mem);

@@ -676,6 +683,10 @@ reserve_iova(struct iova_domain *iovad,
struct iova *iova;
unsigned int overlap = 0;

+ /* Don't allow nonsensical pfns */
+ if (WARN_ON((pfn_hi | pfn_lo) > (ULLONG_MAX >> iova_shift(iovad))))
+ return NULL;
+
spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
for (node = rb_first(&iovad->rbroot); node; node = rb_next(node)) {
if (__is_range_overlap(node, pfn_lo, pfn_hi)) {
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 953cfd20f152..c696ee81054e 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -75,6 +75,7 @@ struct iova_domain {
unsigned long granule; /* pfn granularity for this domain */
unsigned long start_pfn; /* Lower limit for this domain */
unsigned long dma_32bit_pfn;
+ struct iova anchor; /* rbtree lookup anchor */
struct iova_rcache rcaches[IOVA_RANGE_CACHE_MAX_SIZE]; /* IOVA range caches */

iova_flush_cb flush_cb; /* Call-Back function to flush IOMMU
--
2.13.4.dirty

2017-09-21 15:53:54

by Robin Murphy

[permalink] [raw]
Subject: [PATCH v5 3/6] iommu/iova: Extend rbtree node caching

The cached node mechanism provides a significant performance benefit for
allocations using a 32-bit DMA mask, but in the case of non-PCI devices
or where the 32-bit space is full, the loss of this benefit can be
significant - on large systems there can be many thousands of entries in
the tree, such that walking all the way down to find free space every
time becomes increasingly awful.

Maintain a similar cached node for the whole IOVA space as a superset of
the 32-bit space so that performance can remain much more consistent.

Inspired by work by Zhen Lei <[email protected]>.

Tested-by: Ard Biesheuvel <[email protected]>
Tested-by: Zhen Lei <[email protected]>
Tested-by: Nate Watterson <[email protected]>
Signed-off-by: Robin Murphy <[email protected]>
---

v5: Fixed __cached_rbnode_delete_update() logic to update both nodes
when necessary

drivers/iommu/iova.c | 60 ++++++++++++++++++++++++----------------------------
include/linux/iova.h | 3 ++-
2 files changed, 30 insertions(+), 33 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 20be9a8b3188..c6f5a22f8d20 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -48,6 +48,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,

spin_lock_init(&iovad->iova_rbtree_lock);
iovad->rbroot = RB_ROOT;
+ iovad->cached_node = NULL;
iovad->cached32_node = NULL;
iovad->granule = granule;
iovad->start_pfn = start_pfn;
@@ -110,48 +111,44 @@ EXPORT_SYMBOL_GPL(init_iova_flush_queue);
static struct rb_node *
__get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
{
- if ((*limit_pfn > iovad->dma_32bit_pfn) ||
- (iovad->cached32_node == NULL))
+ struct rb_node *cached_node = NULL;
+ struct iova *curr_iova;
+
+ if (*limit_pfn <= iovad->dma_32bit_pfn)
+ cached_node = iovad->cached32_node;
+ if (!cached_node)
+ cached_node = iovad->cached_node;
+ if (!cached_node)
return rb_last(&iovad->rbroot);
- else {
- struct rb_node *prev_node = rb_prev(iovad->cached32_node);
- struct iova *curr_iova =
- rb_entry(iovad->cached32_node, struct iova, node);
- *limit_pfn = curr_iova->pfn_lo;
- return prev_node;
- }
+
+ curr_iova = rb_entry(cached_node, struct iova, node);
+ *limit_pfn = min(*limit_pfn, curr_iova->pfn_lo);
+
+ return rb_prev(cached_node);
}

static void
-__cached_rbnode_insert_update(struct iova_domain *iovad,
- unsigned long limit_pfn, struct iova *new)
+__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
{
- if (limit_pfn != iovad->dma_32bit_pfn)
- return;
- iovad->cached32_node = &new->node;
+ if (new->pfn_hi < iovad->dma_32bit_pfn)
+ iovad->cached32_node = &new->node;
+ else
+ iovad->cached_node = &new->node;
}

static void
__cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
{
struct iova *cached_iova;
- struct rb_node *curr;

- if (!iovad->cached32_node)
- return;
- curr = iovad->cached32_node;
- cached_iova = rb_entry(curr, struct iova, node);
+ cached_iova = rb_entry(iovad->cached32_node, struct iova, node);
+ if (free->pfn_hi < iovad->dma_32bit_pfn &&
+ iovad->cached32_node && free->pfn_lo >= cached_iova->pfn_lo)
+ iovad->cached32_node = rb_next(&free->node);

- if (free->pfn_lo >= cached_iova->pfn_lo) {
- struct rb_node *node = rb_next(&free->node);
- struct iova *iova = rb_entry(node, struct iova, node);
-
- /* only cache if it's below 32bit pfn */
- if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
- iovad->cached32_node = node;
- else
- iovad->cached32_node = NULL;
- }
+ cached_iova = rb_entry(iovad->cached_node, struct iova, node);
+ if (iovad->cached_node && free->pfn_lo >= cached_iova->pfn_lo)
+ iovad->cached_node = rb_next(&free->node);
}

/* Insert the iova into domain rbtree by holding writer lock */
@@ -188,7 +185,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
{
struct rb_node *prev, *curr = NULL;
unsigned long flags;
- unsigned long saved_pfn, new_pfn;
+ unsigned long new_pfn;
unsigned long align_mask = ~0UL;

if (size_aligned)
@@ -196,7 +193,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,

/* Walk the tree backwards */
spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
- saved_pfn = limit_pfn;
curr = __get_cached_rbnode(iovad, &limit_pfn);
prev = curr;
while (curr) {
@@ -226,7 +222,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,

/* If we have 'prev', it's a valid place to start the insertion. */
iova_insert_rbtree(&iovad->rbroot, new, prev);
- __cached_rbnode_insert_update(iovad, saved_pfn, new);
+ __cached_rbnode_insert_update(iovad, new);

spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);

diff --git a/include/linux/iova.h b/include/linux/iova.h
index d179b9bf7814..69ea3e258ff2 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -70,7 +70,8 @@ struct iova_fq {
struct iova_domain {
spinlock_t iova_rbtree_lock; /* Lock to protect update of rbtree */
struct rb_root rbroot; /* iova domain rbtree root */
- struct rb_node *cached32_node; /* Save last alloced node */
+ struct rb_node *cached_node; /* Save last alloced node */
+ struct rb_node *cached32_node; /* Save last 32-bit alloced node */
unsigned long granule; /* pfn granularity for this domain */
unsigned long start_pfn; /* Lower limit for this domain */
unsigned long dma_32bit_pfn;
--
2.13.4.dirty

2017-09-22 16:21:19

by Tomasz Nowicki

[permalink] [raw]
Subject: Re: [PATCH v5 3/6] iommu/iova: Extend rbtree node caching

Hi Robin,

On 21.09.2017 17:52, Robin Murphy wrote:
> The cached node mechanism provides a significant performance benefit for
> allocations using a 32-bit DMA mask, but in the case of non-PCI devices
> or where the 32-bit space is full, the loss of this benefit can be
> significant - on large systems there can be many thousands of entries in
> the tree, such that walking all the way down to find free space every
> time becomes increasingly awful.
>
> Maintain a similar cached node for the whole IOVA space as a superset of
> the 32-bit space so that performance can remain much more consistent.
>
> Inspired by work by Zhen Lei <[email protected]>.
>
> Tested-by: Ard Biesheuvel <[email protected]>
> Tested-by: Zhen Lei <[email protected]>
> Tested-by: Nate Watterson <[email protected]>
> Signed-off-by: Robin Murphy <[email protected]>
> ---
>
> v5: Fixed __cached_rbnode_delete_update() logic to update both nodes
> when necessary
>
> drivers/iommu/iova.c | 60 ++++++++++++++++++++++++----------------------------
> include/linux/iova.h | 3 ++-
> 2 files changed, 30 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 20be9a8b3188..c6f5a22f8d20 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -48,6 +48,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
>
> spin_lock_init(&iovad->iova_rbtree_lock);
> iovad->rbroot = RB_ROOT;
> + iovad->cached_node = NULL;
> iovad->cached32_node = NULL;
> iovad->granule = granule;
> iovad->start_pfn = start_pfn;
> @@ -110,48 +111,44 @@ EXPORT_SYMBOL_GPL(init_iova_flush_queue);
> static struct rb_node *
> __get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
> {
> - if ((*limit_pfn > iovad->dma_32bit_pfn) ||
> - (iovad->cached32_node == NULL))
> + struct rb_node *cached_node = NULL;
> + struct iova *curr_iova;
> +
> + if (*limit_pfn <= iovad->dma_32bit_pfn)
> + cached_node = iovad->cached32_node;
> + if (!cached_node)
> + cached_node = iovad->cached_node;
> + if (!cached_node)
> return rb_last(&iovad->rbroot);
> - else {
> - struct rb_node *prev_node = rb_prev(iovad->cached32_node);
> - struct iova *curr_iova =
> - rb_entry(iovad->cached32_node, struct iova, node);
> - *limit_pfn = curr_iova->pfn_lo;
> - return prev_node;
> - }
> +
> + curr_iova = rb_entry(cached_node, struct iova, node);
> + *limit_pfn = min(*limit_pfn, curr_iova->pfn_lo);

I guess this it the fix for stale pointer in iovad->cached32_node from
previous series but I think this is something more.

With this min() here we have real control over the limit_pfn we would
like to allocate at most. In other works, without your series two
subsequent calls can give us:
iova (ffff) = alloc_iova_fast(iovad, 1, DMA_BIT_MASK(32) >> shift);

iova (fffe)= alloc_iova_fast(iovad, 1, DMA_BIT_MASK(16) >> shift);

We do not see this since nobody uses limit_pfn below DMA_BIT_MASK(32)
now. That might be done intentionally so I ask for your opinion.

Also, with your patch and two the same alloc_iova_fast() calls in
iteration may get 32-bit space full much faster. Please correct me if I
messed things up.

Thanks,
Tomasz


> +
> + return rb_prev(cached_node);
> }
>
> static void
> -__cached_rbnode_insert_update(struct iova_domain *iovad,
> - unsigned long limit_pfn, struct iova *new)
> +__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
> {
> - if (limit_pfn != iovad->dma_32bit_pfn)
> - return;
> - iovad->cached32_node = &new->node;
> + if (new->pfn_hi < iovad->dma_32bit_pfn)
> + iovad->cached32_node = &new->node;
> + else
> + iovad->cached_node = &new->node;
> }
>
> static void
> __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
> {
> struct iova *cached_iova;
> - struct rb_node *curr;
>
> - if (!iovad->cached32_node)
> - return;
> - curr = iovad->cached32_node;
> - cached_iova = rb_entry(curr, struct iova, node);
> + cached_iova = rb_entry(iovad->cached32_node, struct iova, node);
> + if (free->pfn_hi < iovad->dma_32bit_pfn &&
> + iovad->cached32_node && free->pfn_lo >= cached_iova->pfn_lo)
> + iovad->cached32_node = rb_next(&free->node);
>
> - if (free->pfn_lo >= cached_iova->pfn_lo) {
> - struct rb_node *node = rb_next(&free->node);
> - struct iova *iova = rb_entry(node, struct iova, node);
> -
> - /* only cache if it's below 32bit pfn */
> - if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
> - iovad->cached32_node = node;
> - else
> - iovad->cached32_node = NULL;
> - }
> + cached_iova = rb_entry(iovad->cached_node, struct iova, node);
> + if (iovad->cached_node && free->pfn_lo >= cached_iova->pfn_lo)
> + iovad->cached_node = rb_next(&free->node);
> }
>
> /* Insert the iova into domain rbtree by holding writer lock */
> @@ -188,7 +185,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
> {
> struct rb_node *prev, *curr = NULL;
> unsigned long flags;
> - unsigned long saved_pfn, new_pfn;
> + unsigned long new_pfn;
> unsigned long align_mask = ~0UL;
>
> if (size_aligned)
> @@ -196,7 +193,6 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>
> /* Walk the tree backwards */
> spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> - saved_pfn = limit_pfn;
> curr = __get_cached_rbnode(iovad, &limit_pfn);
> prev = curr;
> while (curr) {
> @@ -226,7 +222,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>
> /* If we have 'prev', it's a valid place to start the insertion. */
> iova_insert_rbtree(&iovad->rbroot, new, prev);
> - __cached_rbnode_insert_update(iovad, saved_pfn, new);
> + __cached_rbnode_insert_update(iovad, new);
>
> spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>
> diff --git a/include/linux/iova.h b/include/linux/iova.h
> index d179b9bf7814..69ea3e258ff2 100644
> --- a/include/linux/iova.h
> +++ b/include/linux/iova.h
> @@ -70,7 +70,8 @@ struct iova_fq {
> struct iova_domain {
> spinlock_t iova_rbtree_lock; /* Lock to protect update of rbtree */
> struct rb_root rbroot; /* iova domain rbtree root */
> - struct rb_node *cached32_node; /* Save last alloced node */
> + struct rb_node *cached_node; /* Save last alloced node */
> + struct rb_node *cached32_node; /* Save last 32-bit alloced node */
> unsigned long granule; /* pfn granularity for this domain */
> unsigned long start_pfn; /* Lower limit for this domain */
> unsigned long dma_32bit_pfn;
>

2017-09-22 16:51:17

by Tomasz Nowicki

[permalink] [raw]
Subject: Re: [PATCH v5 3/6] iommu/iova: Extend rbtree node caching

On 22.09.2017 18:21, Tomasz Nowicki wrote:
> Hi Robin,
>
> On 21.09.2017 17:52, Robin Murphy wrote:
>> The cached node mechanism provides a significant performance benefit for
>> allocations using a 32-bit DMA mask, but in the case of non-PCI devices
>> or where the 32-bit space is full, the loss of this benefit can be
>> significant - on large systems there can be many thousands of entries in
>> the tree, such that walking all the way down to find free space every
>> time becomes increasingly awful.
>>
>> Maintain a similar cached node for the whole IOVA space as a superset of
>> the 32-bit space so that performance can remain much more consistent.
>>
>> Inspired by work by Zhen Lei <[email protected]>.
>>
>> Tested-by: Ard Biesheuvel <[email protected]>
>> Tested-by: Zhen Lei <[email protected]>
>> Tested-by: Nate Watterson <[email protected]>
>> Signed-off-by: Robin Murphy <[email protected]>
>> ---
>>
>> v5: Fixed __cached_rbnode_delete_update() logic to update both nodes
>>      when necessary
>>
>>   drivers/iommu/iova.c | 60
>> ++++++++++++++++++++++++----------------------------
>>   include/linux/iova.h |  3 ++-
>>   2 files changed, 30 insertions(+), 33 deletions(-)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index 20be9a8b3188..c6f5a22f8d20 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -48,6 +48,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
>> long granule,
>>       spin_lock_init(&iovad->iova_rbtree_lock);
>>       iovad->rbroot = RB_ROOT;
>> +    iovad->cached_node = NULL;
>>       iovad->cached32_node = NULL;
>>       iovad->granule = granule;
>>       iovad->start_pfn = start_pfn;
>> @@ -110,48 +111,44 @@ EXPORT_SYMBOL_GPL(init_iova_flush_queue);
>>   static struct rb_node *
>>   __get_cached_rbnode(struct iova_domain *iovad, unsigned long
>> *limit_pfn)
>>   {
>> -    if ((*limit_pfn > iovad->dma_32bit_pfn) ||
>> -        (iovad->cached32_node == NULL))
>> +    struct rb_node *cached_node = NULL;
>> +    struct iova *curr_iova;
>> +
>> +    if (*limit_pfn <= iovad->dma_32bit_pfn)
>> +        cached_node = iovad->cached32_node;
>> +    if (!cached_node)
>> +        cached_node = iovad->cached_node;
>> +    if (!cached_node)
>>           return rb_last(&iovad->rbroot);
>> -    else {
>> -        struct rb_node *prev_node = rb_prev(iovad->cached32_node);
>> -        struct iova *curr_iova =
>> -            rb_entry(iovad->cached32_node, struct iova, node);
>> -        *limit_pfn = curr_iova->pfn_lo;
>> -        return prev_node;
>> -    }
>> +
>> +    curr_iova = rb_entry(cached_node, struct iova, node);
>> +    *limit_pfn = min(*limit_pfn, curr_iova->pfn_lo);
>
> I guess this it the fix for stale pointer in iovad->cached32_node from
> previous series but I think this is something more.
>
> With this min() here we have real control over the limit_pfn we would
> like to allocate at most. In other works, without your series two
> subsequent calls can give us:
> iova (ffff) = alloc_iova_fast(iovad, 1, DMA_BIT_MASK(32) >> shift);
>
> iova (fffe)= alloc_iova_fast(iovad, 1, DMA_BIT_MASK(16) >> shift);
>
> We do not see this since nobody uses limit_pfn below DMA_BIT_MASK(32)
> now. That might be done intentionally so I ask for your opinion.
>
> Also, with your patch and two the same alloc_iova_fast() calls in
> iteration may get 32-bit space full much faster. Please correct me if I
> messed things up.

After few more thoughts, I think this is unrealistic case.

In any case, please consider below fix:
static void
-__cached_rbnode_insert_update(struct iova_domain *iovad,
- unsigned long limit_pfn, struct iova *new)
+__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
{
- if (limit_pfn != iovad->dma_32bit_pfn)
- return;
- iovad->cached32_node = &new->node;
+ if (new->pfn_hi == iovad->dma_32bit_pfn)
+ iovad->cached32_node = &new->node;
+ else
+ iovad->cached_node = &new->node;
}

Thanks,
Tomasz

2017-09-22 17:08:31

by Tomasz Nowicki

[permalink] [raw]
Subject: Re: [PATCH v5 3/6] iommu/iova: Extend rbtree node caching

On 22.09.2017 18:50, tn wrote:
> On 22.09.2017 18:21, Tomasz Nowicki wrote:
>> Hi Robin,
>>
>> On 21.09.2017 17:52, Robin Murphy wrote:
>>> The cached node mechanism provides a significant performance benefit for
>>> allocations using a 32-bit DMA mask, but in the case of non-PCI devices
>>> or where the 32-bit space is full, the loss of this benefit can be
>>> significant - on large systems there can be many thousands of entries in
>>> the tree, such that walking all the way down to find free space every
>>> time becomes increasingly awful.
>>>
>>> Maintain a similar cached node for the whole IOVA space as a superset of
>>> the 32-bit space so that performance can remain much more consistent.
>>>
>>> Inspired by work by Zhen Lei <[email protected]>.
>>>
>>> Tested-by: Ard Biesheuvel <[email protected]>
>>> Tested-by: Zhen Lei <[email protected]>
>>> Tested-by: Nate Watterson <[email protected]>
>>> Signed-off-by: Robin Murphy <[email protected]>
>>> ---
>>>
>>> v5: Fixed __cached_rbnode_delete_update() logic to update both nodes
>>>      when necessary
>>>
>>>   drivers/iommu/iova.c | 60
>>> ++++++++++++++++++++++++----------------------------
>>>   include/linux/iova.h |  3 ++-
>>>   2 files changed, 30 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>> index 20be9a8b3188..c6f5a22f8d20 100644
>>> --- a/drivers/iommu/iova.c
>>> +++ b/drivers/iommu/iova.c
>>> @@ -48,6 +48,7 @@ init_iova_domain(struct iova_domain *iovad,
>>> unsigned long granule,
>>>       spin_lock_init(&iovad->iova_rbtree_lock);
>>>       iovad->rbroot = RB_ROOT;
>>> +    iovad->cached_node = NULL;
>>>       iovad->cached32_node = NULL;
>>>       iovad->granule = granule;
>>>       iovad->start_pfn = start_pfn;
>>> @@ -110,48 +111,44 @@ EXPORT_SYMBOL_GPL(init_iova_flush_queue);
>>>   static struct rb_node *
>>>   __get_cached_rbnode(struct iova_domain *iovad, unsigned long
>>> *limit_pfn)
>>>   {
>>> -    if ((*limit_pfn > iovad->dma_32bit_pfn) ||
>>> -        (iovad->cached32_node == NULL))
>>> +    struct rb_node *cached_node = NULL;
>>> +    struct iova *curr_iova;
>>> +
>>> +    if (*limit_pfn <= iovad->dma_32bit_pfn)
>>> +        cached_node = iovad->cached32_node;
>>> +    if (!cached_node)
>>> +        cached_node = iovad->cached_node;
>>> +    if (!cached_node)
>>>           return rb_last(&iovad->rbroot);
>>> -    else {
>>> -        struct rb_node *prev_node = rb_prev(iovad->cached32_node);
>>> -        struct iova *curr_iova =
>>> -            rb_entry(iovad->cached32_node, struct iova, node);
>>> -        *limit_pfn = curr_iova->pfn_lo;
>>> -        return prev_node;
>>> -    }
>>> +
>>> +    curr_iova = rb_entry(cached_node, struct iova, node);
>>> +    *limit_pfn = min(*limit_pfn, curr_iova->pfn_lo);
>>
>> I guess this it the fix for stale pointer in iovad->cached32_node from
>> previous series but I think this is something more.
>>
>> With this min() here we have real control over the limit_pfn we would
>> like to allocate at most. In other works, without your series two
>> subsequent calls can give us:
>> iova (ffff) = alloc_iova_fast(iovad, 1, DMA_BIT_MASK(32) >> shift);
>>
>> iova (fffe)= alloc_iova_fast(iovad, 1, DMA_BIT_MASK(16) >> shift);
>>
>> We do not see this since nobody uses limit_pfn below DMA_BIT_MASK(32)
>> now. That might be done intentionally so I ask for your opinion.
>>
>> Also, with your patch and two the same alloc_iova_fast() calls in
>> iteration may get 32-bit space full much faster. Please correct me if
>> I messed things up.
>
> After few more thoughts, I think this is unrealistic case.
>
> In any case, please consider below fix:
>  static void
> -__cached_rbnode_insert_update(struct iova_domain *iovad,
> -    unsigned long limit_pfn, struct iova *new)
> +__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
>  {
> -    if (limit_pfn != iovad->dma_32bit_pfn)
> -        return;
> -    iovad->cached32_node = &new->node;
> +    if (new->pfn_hi == iovad->dma_32bit_pfn)
> +        iovad->cached32_node = &new->node;
> +    else
> +        iovad->cached_node = &new->node;
>  }

Sorry :(, this is still wrong.

It should look like this:
static void
__cached_rbnode_insert_update(struct iova_domain *iovad,
unsigned long limit_pfn, struct iova *new)
{
- if (limit_pfn != iovad->dma_32bit_pfn)
- return;
- iovad->cached32_node = &new->node;
+ if (limit_pfn == iovad->dma_32bit_pfn)
+ iovad->cached32_node = &new->node;
+ else if (new->pfn_hi > iovad->dma_32bit_pfn)
+ iovad->cached_node = &new->node;
}

but then we still need to save initial limit_pfn in
__alloc_and_insert_iova_range() for this decision.

Tomasz

2017-09-22 17:25:07

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v5 3/6] iommu/iova: Extend rbtree node caching

On 22/09/17 17:21, Tomasz Nowicki wrote:
> Hi Robin,
>
> On 21.09.2017 17:52, Robin Murphy wrote:
>> The cached node mechanism provides a significant performance benefit for
>> allocations using a 32-bit DMA mask, but in the case of non-PCI devices
>> or where the 32-bit space is full, the loss of this benefit can be
>> significant - on large systems there can be many thousands of entries in
>> the tree, such that walking all the way down to find free space every
>> time becomes increasingly awful.
>>
>> Maintain a similar cached node for the whole IOVA space as a superset of
>> the 32-bit space so that performance can remain much more consistent.
>>
>> Inspired by work by Zhen Lei <[email protected]>.
>>
>> Tested-by: Ard Biesheuvel <[email protected]>
>> Tested-by: Zhen Lei <[email protected]>
>> Tested-by: Nate Watterson <[email protected]>
>> Signed-off-by: Robin Murphy <[email protected]>
>> ---
>>
>> v5: Fixed __cached_rbnode_delete_update() logic to update both nodes
>>      when necessary
>>
>>   drivers/iommu/iova.c | 60
>> ++++++++++++++++++++++++----------------------------
>>   include/linux/iova.h |  3 ++-
>>   2 files changed, 30 insertions(+), 33 deletions(-)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index 20be9a8b3188..c6f5a22f8d20 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -48,6 +48,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
>> long granule,
>>         spin_lock_init(&iovad->iova_rbtree_lock);
>>       iovad->rbroot = RB_ROOT;
>> +    iovad->cached_node = NULL;
>>       iovad->cached32_node = NULL;
>>       iovad->granule = granule;
>>       iovad->start_pfn = start_pfn;
>> @@ -110,48 +111,44 @@ EXPORT_SYMBOL_GPL(init_iova_flush_queue);
>>   static struct rb_node *
>>   __get_cached_rbnode(struct iova_domain *iovad, unsigned long
>> *limit_pfn)
>>   {
>> -    if ((*limit_pfn > iovad->dma_32bit_pfn) ||
>> -        (iovad->cached32_node == NULL))
>> +    struct rb_node *cached_node = NULL;
>> +    struct iova *curr_iova;
>> +
>> +    if (*limit_pfn <= iovad->dma_32bit_pfn)
>> +        cached_node = iovad->cached32_node;
>> +    if (!cached_node)
>> +        cached_node = iovad->cached_node;
>> +    if (!cached_node)
>>           return rb_last(&iovad->rbroot);
>> -    else {
>> -        struct rb_node *prev_node = rb_prev(iovad->cached32_node);
>> -        struct iova *curr_iova =
>> -            rb_entry(iovad->cached32_node, struct iova, node);
>> -        *limit_pfn = curr_iova->pfn_lo;
>> -        return prev_node;
>> -    }
>> +
>> +    curr_iova = rb_entry(cached_node, struct iova, node);
>> +    *limit_pfn = min(*limit_pfn, curr_iova->pfn_lo);
>
> I guess this it the fix for stale pointer in iovad->cached32_node from
> previous series but I think this is something more.
>
> With this min() here we have real control over the limit_pfn we would
> like to allocate at most. In other works, without your series two
> subsequent calls can give us:
> iova (ffff) = alloc_iova_fast(iovad, 1, DMA_BIT_MASK(32) >> shift);
>
> iova (fffe)= alloc_iova_fast(iovad, 1, DMA_BIT_MASK(16) >> shift);
>
> We do not see this since nobody uses limit_pfn below DMA_BIT_MASK(32)
> now. That might be done intentionally so I ask for your opinion.

I realise it's not called out in the commit message, but this patch does
make one small change to how the existing 32-bit caching behaves when
freeing the topmost 32-bit IOVA. With the previous behaviour,
__cached_rbnode_delete_update() set cached32_node to NULL when the
rb_next(free) node lies above dma_32bit_pfn, meaning the next 32-bit
allocation returns from here early with rb_last(). Therefore limit_pfn
is updated unconditionally when a cached node exists on the expectation
that it will only ever move downwards.

...and having worked through that, I now realise I failed to take it
into account in 62280cf2e8bb, so yes, the theoretical case of a 32-bit
allocation followed by a <32-bit allocation has in fact been broken (in
that it could adjust limit_pfn upwards and return an too-high IOVA for
the second call) for a while. Grrr...

With the new behaviour from this patch, freeing the topmost 32-bit IOVA
can let cached32_node point at the lowest >32-bit IOVA to avoid the
rb_last() overhead on the next allocation - that's how the stale pointer
bug could now happen (which is fixed by always checking both nodes in
__cached_rbnode_delete_update() below). Because cached32_node may now
occasionally point at something for which pfn_lo > dma_32bit_pfn, we add
the min() here to make sure we never move limit_pfn upwards (and thus
inadvertently fix the <32-bit case as well).

I admit that one of my motivations for rewriting so much here is just
because the existing code is so horrendously subtle and tricky. I'd like
to think that by patch #6 it's actually understandable without spending
several days picking through it...

Robin.

> Also, with your patch and two the same alloc_iova_fast() calls in
> iteration may get 32-bit space full much faster. Please correct me if I
> messed things up.
>
> Thanks,
> Tomasz
>
>
>> +
>> +    return rb_prev(cached_node);
>>   }
>>     static void
>> -__cached_rbnode_insert_update(struct iova_domain *iovad,
>> -    unsigned long limit_pfn, struct iova *new)
>> +__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova
>> *new)
>>   {
>> -    if (limit_pfn != iovad->dma_32bit_pfn)
>> -        return;
>> -    iovad->cached32_node = &new->node;
>> +    if (new->pfn_hi < iovad->dma_32bit_pfn)
>> +        iovad->cached32_node = &new->node;
>> +    else
>> +        iovad->cached_node = &new->node;
>>   }
>>     static void
>>   __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova
>> *free)
>>   {
>>       struct iova *cached_iova;
>> -    struct rb_node *curr;
>>   -    if (!iovad->cached32_node)
>> -        return;
>> -    curr = iovad->cached32_node;
>> -    cached_iova = rb_entry(curr, struct iova, node);
>> +    cached_iova = rb_entry(iovad->cached32_node, struct iova, node);
>> +    if (free->pfn_hi < iovad->dma_32bit_pfn &&
>> +        iovad->cached32_node && free->pfn_lo >= cached_iova->pfn_lo)
>> +        iovad->cached32_node = rb_next(&free->node);
>>   -    if (free->pfn_lo >= cached_iova->pfn_lo) {
>> -        struct rb_node *node = rb_next(&free->node);
>> -        struct iova *iova = rb_entry(node, struct iova, node);
>> -
>> -        /* only cache if it's below 32bit pfn */
>> -        if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
>> -            iovad->cached32_node = node;
>> -        else
>> -            iovad->cached32_node = NULL;
>> -    }
>> +    cached_iova = rb_entry(iovad->cached_node, struct iova, node);
>> +    if (iovad->cached_node && free->pfn_lo >= cached_iova->pfn_lo)
>> +        iovad->cached_node = rb_next(&free->node);
>>   }
>>     /* Insert the iova into domain rbtree by holding writer lock */
>> @@ -188,7 +185,7 @@ static int __alloc_and_insert_iova_range(struct
>> iova_domain *iovad,
>>   {
>>       struct rb_node *prev, *curr = NULL;
>>       unsigned long flags;
>> -    unsigned long saved_pfn, new_pfn;
>> +    unsigned long new_pfn;
>>       unsigned long align_mask = ~0UL;
>>         if (size_aligned)
>> @@ -196,7 +193,6 @@ static int __alloc_and_insert_iova_range(struct
>> iova_domain *iovad,
>>         /* Walk the tree backwards */
>>       spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
>> -    saved_pfn = limit_pfn;
>>       curr = __get_cached_rbnode(iovad, &limit_pfn);
>>       prev = curr;
>>       while (curr) {
>> @@ -226,7 +222,7 @@ static int __alloc_and_insert_iova_range(struct
>> iova_domain *iovad,
>>         /* If we have 'prev', it's a valid place to start the
>> insertion. */
>>       iova_insert_rbtree(&iovad->rbroot, new, prev);
>> -    __cached_rbnode_insert_update(iovad, saved_pfn, new);
>> +    __cached_rbnode_insert_update(iovad, new);
>>         spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>   diff --git a/include/linux/iova.h b/include/linux/iova.h
>> index d179b9bf7814..69ea3e258ff2 100644
>> --- a/include/linux/iova.h
>> +++ b/include/linux/iova.h
>> @@ -70,7 +70,8 @@ struct iova_fq {
>>   struct iova_domain {
>>       spinlock_t    iova_rbtree_lock; /* Lock to protect update of
>> rbtree */
>>       struct rb_root    rbroot;        /* iova domain rbtree root */
>> -    struct rb_node    *cached32_node; /* Save last alloced node */
>> +    struct rb_node    *cached_node;    /* Save last alloced node */
>> +    struct rb_node    *cached32_node; /* Save last 32-bit alloced
>> node */
>>       unsigned long    granule;    /* pfn granularity for this domain */
>>       unsigned long    start_pfn;    /* Lower limit for this domain */
>>       unsigned long    dma_32bit_pfn;
>>
>

2017-09-27 15:11:37

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v5 0/6] Optimise 64-bit IOVA allocations

On Thu, Sep 21, 2017 at 04:52:41PM +0100, Robin Murphy wrote:
> v4: https://www.mail-archive.com/[email protected]/msg1493704.html
>
> Right, this is hopefully the last version - I've put things back in a
> sensible order with the new additions at the end, so if they prove
> contentious the first 4 previously-tested patches can still get their
> time in -next. Patch #3 is updated to fix the bug brought to light by
> Tomasz, patch #6 includes the cleanup afterthought from v4, and I've
> nobbled one or two rogue whitespace changes I'd missed before.
>
> And the diffstat is still even more negative than before, hooray! For
> the whole series, the total code size reduction of alloc_iova() comes
> to just over 26% (AArch64 GCC 6.3.1).
>
> Robin.
>
>
> Robin Murphy (3):
> iommu/iova: Extend rbtree node caching
> iommu/iova: Add rbtree anchor node
> iommu/iova: Simplify cached node logic
>
> Zhen Lei (3):
> iommu/iova: Optimise rbtree searching
> iommu/iova: Optimise the padding calculation
> iommu/iova: Make dma_32bit_pfn implicit

Applied, thanks Robin.