2020-08-17 23:41:56

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v1 00/10] DDW indirect mapping

This patchset must be applied on top of:
http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=194179&state=%2A&archive=both

As of today, if the biggest DDW that can be created can't map the whole
partition, it's creation is skipped and the default DMA window
ibm,dma-window" is used instead.

Usually, the available DDW will be 16x bigger than the default DMA window,
as it keep the same page count and raise the page size from 4k to 64k.
Besides the increased window size, it performs better on allocations
bigger than 4k, so it would be nice to use it instead.

Patch #1 replaces hard-coded 4K page size with a variable containing the
correct page size for the window.

Patch #2 makes sure alignment is correct in iommu_*_coherent().

Patch #3 let small allocations use largepool if there is no more space
left in the other pools, thus allowing the whole DMA window to be used by
smaller allocations.

Patch #4 introduces iommu_table_in_use(), and replace manual bit-field
checking where it's used. It will be used for aborting enable_ddw() if
there is any current iommu allocation and we are trying single window
indirect mapping.

Patch #5 introduces iommu_pseries_alloc_table() that will be helpful
when indirect mapping needs to replace the iommu_table.

Patch #6 adds helpers for adding and removing DDWs in the list.

Patch #7 refactors enable_ddw() so it returns if direct mapping is
possible, instead of DMA offset. It helps for next patches on
indirect DMA mapping and also allows DMA windows starting at 0x00.

Patch #8 bring new helper to simplify enable_ddw(), allowing
some reorganization for introducing indirect mapping DDW.

Patch #9:
Instead of destroying the created DDW if it doesn't map the whole
partition, make use of it instead of the default DMA window as it improves
performance. Also, update the iommu_table and re-generate the pools.

Patch #10:
Does some renaming of 'direct window' to 'dma window', given the DDW
created can now be also used in indirect mapping if direct mapping is not
available.

All patches were tested into an LPAR with an Ethernet VF:
4005:01:00.0 Ethernet controller: Mellanox Technologies MT27700 Family
[ConnectX-4 Virtual Function]

Patchset was tested with a 64GB DDW which did not map the whole
partition (128G).

Leonardo Bras (10):
powerpc/pseries/iommu: Replace hard-coded page shift
powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on
iommu_*_coherent()
powerpc/kernel/iommu: Use largepool as a last resort when !largealloc
powerpc/kernel/iommu: Add new iommu_table_in_use() helper
powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper
powerpc/pseries/iommu: Add ddw_list_add() helper
powerpc/pseries/iommu: Allow DDW windows starting at 0x00
powerpc/pseries/iommu: Add ddw_property_create() and refactor
enable_ddw()
powerpc/pseries/iommu: Make use of DDW even if it does not map the
partition
powerpc/pseries/iommu: Rename "direct window" to "dma window"

arch/powerpc/include/asm/iommu.h | 1 +
arch/powerpc/include/asm/tce.h | 10 +-
arch/powerpc/kernel/iommu.c | 88 +++---
arch/powerpc/platforms/pseries/iommu.c | 394 ++++++++++++++++---------
4 files changed, 305 insertions(+), 188 deletions(-)

--
2.25.4


2020-08-17 23:42:08

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

Some functions assume IOMMU page size can only be 4K (pageshift == 12).
Update them to accept any page size passed, so we can use 64K pages.

In the process, some defines like TCE_SHIFT were made obsolete, and then
removed. TCE_RPN_MASK was updated to generate a mask according to
the pageshift used.

Most places had a tbl struct, so using tbl->it_page_shift was simple.
tce_free_pSeriesLP() was a special case, since callers not always have a
tbl struct, so adding a tceshift parameter seems the right thing to do.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/include/asm/tce.h | 10 ++----
arch/powerpc/platforms/pseries/iommu.c | 42 ++++++++++++++++----------
2 files changed, 28 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index db5fc2f2262d..971cba2d87cc 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -19,15 +19,9 @@
#define TCE_VB 0
#define TCE_PCI 1

-/* TCE page size is 4096 bytes (1 << 12) */
-
-#define TCE_SHIFT 12
-#define TCE_PAGE_SIZE (1 << TCE_SHIFT)
-
#define TCE_ENTRY_SIZE 8 /* each TCE is 64 bits */
-
-#define TCE_RPN_MASK 0xfffffffffful /* 40-bit RPN (4K pages) */
-#define TCE_RPN_SHIFT 12
+#define TCE_RPN_BITS 52 /* Bits 0-51 represent RPN on TCE */
+#define TCE_RPN_MASK(ps) ((1ul << (TCE_RPN_BITS - (ps))) - 1)
#define TCE_VALID 0x800 /* TCE valid */
#define TCE_ALLIO 0x400 /* TCE valid for all lpars */
#define TCE_PCI_WRITE 0x2 /* write from PCI allowed */
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index e4198700ed1a..8fe23b7dff3a 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -107,6 +107,9 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index,
u64 proto_tce;
__be64 *tcep;
u64 rpn;
+ const unsigned long tceshift = tbl->it_page_shift;
+ const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
+ const u64 rpn_mask = TCE_RPN_MASK(tceshift);

proto_tce = TCE_PCI_READ; // Read allowed

@@ -117,10 +120,10 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index,

while (npages--) {
/* can't move this out since we might cross MEMBLOCK boundary */
- rpn = __pa(uaddr) >> TCE_SHIFT;
- *tcep = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT);
+ rpn = __pa(uaddr) >> tceshift;
+ *tcep = cpu_to_be64(proto_tce | (rpn & rpn_mask) << tceshift);

- uaddr += TCE_PAGE_SIZE;
+ uaddr += pagesize;
tcep++;
}
return 0;
@@ -146,7 +149,7 @@ static unsigned long tce_get_pseries(struct iommu_table *tbl, long index)
return be64_to_cpu(*tcep);
}

-static void tce_free_pSeriesLP(unsigned long liobn, long, long);
+static void tce_free_pSeriesLP(unsigned long liobn, long, long, long);
static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);

static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,
@@ -159,6 +162,7 @@ static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,
u64 rpn;
int ret = 0;
long tcenum_start = tcenum, npages_start = npages;
+ const u64 rpn_mask = TCE_RPN_MASK(tceshift);

rpn = __pa(uaddr) >> tceshift;
proto_tce = TCE_PCI_READ;
@@ -166,12 +170,12 @@ static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,
proto_tce |= TCE_PCI_WRITE;

while (npages--) {
- tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift;
+ tce = proto_tce | (rpn & rpn_mask) << tceshift;
rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce);

if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
ret = (int)rc;
- tce_free_pSeriesLP(liobn, tcenum_start,
+ tce_free_pSeriesLP(liobn, tcenum_start, tceshift,
(npages_start - (npages + 1)));
break;
}
@@ -205,10 +209,12 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
long tcenum_start = tcenum, npages_start = npages;
int ret = 0;
unsigned long flags;
+ const unsigned long tceshift = tbl->it_page_shift;
+ const u64 rpn_mask = TCE_RPN_MASK(tceshift);

if ((npages == 1) || !firmware_has_feature(FW_FEATURE_PUT_TCE_IND)) {
return tce_build_pSeriesLP(tbl->it_index, tcenum,
- tbl->it_page_shift, npages, uaddr,
+ tceshift, npages, uaddr,
direction, attrs);
}

@@ -225,13 +231,13 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
if (!tcep) {
local_irq_restore(flags);
return tce_build_pSeriesLP(tbl->it_index, tcenum,
- tbl->it_page_shift,
+ tceshift,
npages, uaddr, direction, attrs);
}
__this_cpu_write(tce_page, tcep);
}

- rpn = __pa(uaddr) >> TCE_SHIFT;
+ rpn = __pa(uaddr) >> tceshift;
proto_tce = TCE_PCI_READ;
if (direction != DMA_TO_DEVICE)
proto_tce |= TCE_PCI_WRITE;
@@ -245,12 +251,12 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
limit = min_t(long, npages, 4096/TCE_ENTRY_SIZE);

for (l = 0; l < limit; l++) {
- tcep[l] = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT);
+ tcep[l] = cpu_to_be64(proto_tce | (rpn & rpn_mask) << tceshift);
rpn++;
}

rc = plpar_tce_put_indirect((u64)tbl->it_index,
- (u64)tcenum << 12,
+ (u64)tcenum << tceshift,
(u64)__pa(tcep),
limit);

@@ -277,12 +283,13 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
return ret;
}

-static void tce_free_pSeriesLP(unsigned long liobn, long tcenum, long npages)
+static void tce_free_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,
+ long npages)
{
u64 rc;

while (npages--) {
- rc = plpar_tce_put((u64)liobn, (u64)tcenum << 12, 0);
+ rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, 0);

if (rc && printk_ratelimit()) {
printk("tce_free_pSeriesLP: plpar_tce_put failed. rc=%lld\n", rc);
@@ -301,9 +308,11 @@ static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long n
u64 rc;

if (!firmware_has_feature(FW_FEATURE_STUFF_TCE))
- return tce_free_pSeriesLP(tbl->it_index, tcenum, npages);
+ return tce_free_pSeriesLP(tbl->it_index, tcenum,
+ tbl->it_page_shift, npages);

- rc = plpar_tce_stuff((u64)tbl->it_index, (u64)tcenum << 12, 0, npages);
+ rc = plpar_tce_stuff((u64)tbl->it_index,
+ (u64)tcenum << tbl->it_page_shift, 0, npages);

if (rc && printk_ratelimit()) {
printk("tce_freemulti_pSeriesLP: plpar_tce_stuff failed\n");
@@ -319,7 +328,8 @@ static unsigned long tce_get_pSeriesLP(struct iommu_table *tbl, long tcenum)
u64 rc;
unsigned long tce_ret;

- rc = plpar_tce_get((u64)tbl->it_index, (u64)tcenum << 12, &tce_ret);
+ rc = plpar_tce_get((u64)tbl->it_index,
+ (u64)tcenum << tbl->it_page_shift, &tce_ret);

if (rc && printk_ratelimit()) {
printk("tce_get_pSeriesLP: plpar_tce_get failed. rc=%lld\n", rc);
--
2.25.4

2020-08-17 23:42:16

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()

Both iommu_alloc_coherent() and iommu_free_coherent() assume that once
size is aligned to PAGE_SIZE it will be aligned to IOMMU_PAGE_SIZE.

Update those functions to guarantee alignment with requested size
using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().

Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/kernel/iommu.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9704f3f76e63..d7086087830f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device *dev,
}

if (dev)
- boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
- 1 << tbl->it_page_shift);
+ boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, tbl);
else
- boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
+ boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
/* 4GB boundary for iseries_hv_alloc and iseries_hv_map */

n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
@@ -858,6 +857,7 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
unsigned int order;
unsigned int nio_pages, io_order;
struct page *page;
+ size_t size_io = size;

size = PAGE_ALIGN(size);
order = get_order(size);
@@ -884,8 +884,9 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
memset(ret, 0, size);

/* Set up tces to cover the allocated range */
- nio_pages = size >> tbl->it_page_shift;
- io_order = get_iommu_order(size, tbl);
+ size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
+ nio_pages = size_io >> tbl->it_page_shift;
+ io_order = get_iommu_order(size_io, tbl);
mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
mask >> tbl->it_page_shift, io_order, 0);
if (mapping == DMA_MAPPING_ERROR) {
@@ -900,11 +901,11 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
void *vaddr, dma_addr_t dma_handle)
{
if (tbl) {
- unsigned int nio_pages;
+ size_t size_io = IOMMU_PAGE_ALIGN(size, tbl);
+ unsigned int nio_pages = size_io >> tbl->it_page_shift;

- size = PAGE_ALIGN(size);
- nio_pages = size >> tbl->it_page_shift;
iommu_free(tbl, dma_handle, nio_pages);
+
size = PAGE_ALIGN(size);
free_pages((unsigned long)vaddr, get_order(size));
}
--
2.25.4

2020-08-17 23:42:32

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v1 04/10] powerpc/kernel/iommu: Add new iommu_table_in_use() helper

Having a function to check if the iommu table has any allocation helps
deciding if a tbl can be reset for using a new DMA window.

It should be enough to replace all instances of !bitmap_empty(tbl...).

iommu_table_in_use() skips reserved memory, so we don't need to worry about
releasing it before testing. This causes iommu_table_release_pages() to
become unnecessary, given it is only used to remove reserved memory for
testing.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 1 +
arch/powerpc/kernel/iommu.c | 62 ++++++++++++++++++--------------
2 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 5032f1593299..2913e5c8b1f8 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -154,6 +154,7 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
*/
extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
int nid, unsigned long res_start, unsigned long res_end);
+bool iommu_table_in_use(struct iommu_table *tbl);

#define IOMMU_TABLE_GROUP_MAX_TABLES 2

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 7f603d4e62d4..c5d5d36ab65e 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -668,21 +668,6 @@ static void iommu_table_reserve_pages(struct iommu_table *tbl,
set_bit(i - tbl->it_offset, tbl->it_map);
}

-static void iommu_table_release_pages(struct iommu_table *tbl)
-{
- int i;
-
- /*
- * In case we have reserved the first bit, we should not emit
- * the warning below.
- */
- if (tbl->it_offset == 0)
- clear_bit(0, tbl->it_map);
-
- for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
- clear_bit(i - tbl->it_offset, tbl->it_map);
-}
-
/*
* Build a iommu_table structure. This contains a bit map which
* is used to manage allocation of the tce space.
@@ -743,6 +728,38 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
return tbl;
}

+bool iommu_table_in_use(struct iommu_table *tbl)
+{
+ bool in_use;
+ unsigned long p1_start = 0, p1_end, p2_start, p2_end;
+
+ /*ignore reserved bit0*/
+ if (tbl->it_offset == 0)
+ p1_start = 1;
+
+ /* Check if reserved memory is valid*/
+ if (tbl->it_reserved_start >= tbl->it_offset &&
+ tbl->it_reserved_start <= (tbl->it_offset + tbl->it_size) &&
+ tbl->it_reserved_end >= tbl->it_offset &&
+ tbl->it_reserved_end <= (tbl->it_offset + tbl->it_size)) {
+ p1_end = tbl->it_reserved_start - tbl->it_offset;
+ p2_start = tbl->it_reserved_end - tbl->it_offset + 1;
+ p2_end = tbl->it_size;
+ } else {
+ p1_end = tbl->it_size;
+ p2_start = 0;
+ p2_end = 0;
+ }
+
+ in_use = (find_next_bit(tbl->it_map, p1_end, p1_start) != p1_end);
+ if (in_use || p2_start == 0)
+ return in_use;
+
+ in_use = (find_next_bit(tbl->it_map, p2_end, p2_start) != p2_end);
+
+ return in_use;
+}
+
static void iommu_table_free(struct kref *kref)
{
unsigned long bitmap_sz;
@@ -759,10 +776,8 @@ static void iommu_table_free(struct kref *kref)
return;
}

- iommu_table_release_pages(tbl);
-
/* verify that table contains no entries */
- if (!bitmap_empty(tbl->it_map, tbl->it_size))
+ if (iommu_table_in_use(tbl))
pr_warn("%s: Unexpected TCEs\n", __func__);

/* calculate bitmap size in bytes */
@@ -1069,18 +1084,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
for (i = 0; i < tbl->nr_pools; i++)
spin_lock(&tbl->pools[i].lock);

- iommu_table_release_pages(tbl);
-
- if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
+ if (iommu_table_in_use(tbl)) {
pr_err("iommu_tce: it_map is not empty");
ret = -EBUSY;
- /* Undo iommu_table_release_pages, i.e. restore bit#0, etc */
- iommu_table_reserve_pages(tbl, tbl->it_reserved_start,
- tbl->it_reserved_end);
- } else {
- memset(tbl->it_map, 0xff, sz);
}

+ memset(tbl->it_map, 0xff, sz);
+
for (i = 0; i < tbl->nr_pools; i++)
spin_unlock(&tbl->pools[i].lock);
spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
--
2.25.4

2020-08-17 23:42:40

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v1 03/10] powerpc/kernel/iommu: Use largepool as a last resort when !largealloc

As of today, doing iommu_range_alloc() only for !largealloc (npages <= 15)
will only be able to use 3/4 of the available pages, given pages on
largepool not being available for !largealloc.

This could mean some drivers not being able to fully use all the available
pages for the DMA window.

Add pages on largepool as a last resort for !largealloc, making all pages
of the DMA window available.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/kernel/iommu.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d7086087830f..7f603d4e62d4 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -261,6 +261,15 @@ static unsigned long iommu_range_alloc(struct device *dev,
pass++;
goto again;

+ } else if (pass == tbl->nr_pools + 1) {
+ /* Last resort: try largepool */
+ spin_unlock(&pool->lock);
+ pool = &tbl->large_pool;
+ spin_lock(&pool->lock);
+ pool->hint = pool->start;
+ pass++;
+ goto again;
+
} else {
/* Give up */
spin_unlock_irqrestore(&(pool->lock), flags);
--
2.25.4

2020-08-17 23:43:01

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v1 06/10] powerpc/pseries/iommu: Add ddw_list_add() helper

There are two functions adding DDW to the direct_window_list in a
similar way, so create a ddw_list_add() to avoid duplicity and
simplify those functions.

Also, on enable_ddw(), add list_del() on out_free_window to allow
removing the window from list if any error occurs.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/platforms/pseries/iommu.c | 42 ++++++++++++++++----------
1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 39617ce0ec83..fcdefcc0f365 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -872,6 +872,24 @@ static u64 find_existing_ddw(struct device_node *pdn)
return dma_addr;
}

+static struct direct_window *ddw_list_add(struct device_node *pdn,
+ const struct dynamic_dma_window_prop *dma64)
+{
+ struct direct_window *window;
+
+ window = kzalloc(sizeof(*window), GFP_KERNEL);
+ if (!window)
+ return NULL;
+
+ window->device = pdn;
+ window->prop = dma64;
+ spin_lock(&direct_window_list_lock);
+ list_add(&window->list, &direct_window_list);
+ spin_unlock(&direct_window_list_lock);
+
+ return window;
+}
+
static int find_existing_ddw_windows(void)
{
int len;
@@ -887,18 +905,11 @@ static int find_existing_ddw_windows(void)
if (!direct64)
continue;

- window = kzalloc(sizeof(*window), GFP_KERNEL);
- if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
+ window = ddw_list_add(pdn, direct64);
+ if (!window || len < sizeof(*direct64)) {
kfree(window);
remove_ddw(pdn, true);
- continue;
}
-
- window->device = pdn;
- window->prop = direct64;
- spin_lock(&direct_window_list_lock);
- list_add(&window->list, &direct_window_list);
- spin_unlock(&direct_window_list_lock);
}

return 0;
@@ -1261,7 +1272,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
create.liobn, dn);

- window = kzalloc(sizeof(*window), GFP_KERNEL);
+ /* Add new window to existing DDW list */
+ window = ddw_list_add(pdn, ddwprop);
if (!window)
goto out_clear_window;

@@ -1280,16 +1292,14 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
goto out_free_window;
}

- window->device = pdn;
- window->prop = ddwprop;
- spin_lock(&direct_window_list_lock);
- list_add(&window->list, &direct_window_list);
- spin_unlock(&direct_window_list_lock);
-
dma_addr = be64_to_cpu(ddwprop->dma_base);
goto out_unlock;

out_free_window:
+ spin_lock(&direct_window_list_lock);
+ list_del(&window->list);
+ spin_unlock(&direct_window_list_lock);
+
kfree(window);

out_clear_window:
--
2.25.4

2020-08-17 23:43:12

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v1 08/10] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

Code used to create a ddw property that was previously scattered in
enable_ddw() is now gathered in ddw_property_create(), which deals with
allocation and filling the property, letting it ready for
of_property_add(), which now occurs in sequence.

This created an opportunity to reorganize the second part of enable_ddw():

Without this patch enable_ddw() does, in order:
kzalloc() property & members, create_ddw(), fill ddwprop inside property,
ddw_list_add(), do tce_setrange_multi_pSeriesLP_walk in all memory,
of_add_property().

With this patch enable_ddw() does, in order:
create_ddw(), ddw_property_create(), of_add_property(), ddw_list_add(),
do tce_setrange_multi_pSeriesLP_walk in all memory.

This change requires of_remove_property() in case anything fails after
of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
in all memory, which looks the most expensive operation, only if
everything else succeeds.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/platforms/pseries/iommu.c | 97 +++++++++++++++-----------
1 file changed, 57 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 4031127c9537..3a1ef02ad9d5 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1123,6 +1123,31 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
ret);
}

+static int ddw_property_create(struct property **ddw_win, const char *propname,
+ u32 liobn, u64 dma_addr, u32 page_shift, u32 window_shift)
+{
+ struct dynamic_dma_window_prop *ddwprop;
+ struct property *win64;
+
+ *ddw_win = win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
+ if (!win64)
+ return -ENOMEM;
+
+ win64->name = kstrdup(propname, GFP_KERNEL);
+ ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
+ win64->value = ddwprop;
+ win64->length = sizeof(*ddwprop);
+ if (!win64->name || !win64->value)
+ return -ENOMEM;
+
+ ddwprop->liobn = cpu_to_be32(liobn);
+ ddwprop->dma_base = cpu_to_be64(dma_addr);
+ ddwprop->tce_shift = cpu_to_be32(page_shift);
+ ddwprop->window_shift = cpu_to_be32(window_shift);
+
+ return 0;
+}
+
/*
* If the PE supports dynamic dma windows, and there is space for a table
* that can map all pages in a linear offset, then setup such a table,
@@ -1140,12 +1165,11 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
- u64 max_addr;
+ u64 max_addr, win_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
- struct property *win64;
- struct dynamic_dma_window_prop *ddwprop;
+ struct property *win64 = NULL;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false;

@@ -1244,38 +1268,34 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
goto out_failed;
}
len = order_base_2(max_addr);
- win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
- if (!win64) {
- dev_info(&dev->dev,
- "couldn't allocate property for 64bit dma window\n");
+
+ ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
+ if (ret != 0)
goto out_failed;
- }
- win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
- win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
- win64->length = sizeof(*ddwprop);
- if (!win64->name || !win64->value) {
+
+ dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
+ create.liobn, dn);
+
+ win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
+ ret = ddw_property_create(&win64, DIRECT64_PROPNAME, create.liobn, win_addr,
+ page_shift, len);
+ if (ret) {
dev_info(&dev->dev,
- "couldn't allocate property name and value\n");
+ "couldn't allocate property, property name, or value\n");
goto out_free_prop;
}

- ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
- if (ret != 0)
+ ret = of_add_property(pdn, win64);
+ if (ret) {
+ dev_err(&dev->dev, "unable to add dma window property for %pOF: %d",
+ pdn, ret);
goto out_free_prop;
-
- ddwprop->liobn = cpu_to_be32(create.liobn);
- ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) |
- create.addr_lo);
- ddwprop->tce_shift = cpu_to_be32(page_shift);
- ddwprop->window_shift = cpu_to_be32(len);
-
- dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
- create.liobn, dn);
+ }

/* Add new window to existing DDW list */
- window = ddw_list_add(pdn, ddwprop);
+ window = ddw_list_add(pdn, win64->value);
if (!window)
- goto out_clear_window;
+ goto out_prop_del;

ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
win64->value, tce_setrange_multi_pSeriesLP_walk);
@@ -1285,14 +1305,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
goto out_free_window;
}

- ret = of_add_property(pdn, win64);
- if (ret) {
- dev_err(&dev->dev, "unable to add dma window property for %pOF: %d",
- pdn, ret);
- goto out_free_window;
- }
-
- dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
+ dev->dev.archdata.dma_offset = win_addr;
goto out_unlock;

out_free_window:
@@ -1302,14 +1315,18 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)

kfree(window);

-out_clear_window:
- remove_ddw(pdn, true);
+out_prop_del:
+ of_remove_property(pdn, win64);

out_free_prop:
- kfree(win64->name);
- kfree(win64->value);
- kfree(win64);
- win64 = NULL;
+ if (win64) {
+ kfree(win64->name);
+ kfree(win64->value);
+ kfree(win64);
+ win64 = NULL;
+ }
+
+ remove_ddw(pdn, true);

out_failed:
if (default_win_removed)
--
2.25.4

2020-08-17 23:43:31

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v1 07/10] powerpc/pseries/iommu: Allow DDW windows starting at 0x00

enable_ddw() currently returns the address of the DMA window, which is
considered invalid if has the value 0x00.

Also, it only considers valid an address returned from find_existing_ddw
if it's not 0x00.

Changing this behavior makes sense, given the users of enable_ddw() only
need to know if direct mapping is possible. It can also allow a DMA window
starting at 0x00 to be used.

This will be helpful for using a DDW with indirect mapping, as the window
address will be different than 0x00, but it will not map the whole
partition.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/platforms/pseries/iommu.c | 30 ++++++++++++--------------
1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index fcdefcc0f365..4031127c9537 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -852,24 +852,25 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
np, ret);
}

-static u64 find_existing_ddw(struct device_node *pdn)
+static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
{
struct direct_window *window;
const struct dynamic_dma_window_prop *direct64;
- u64 dma_addr = 0;
+ bool found = false;

spin_lock(&direct_window_list_lock);
/* check if we already created a window and dupe that config if so */
list_for_each_entry(window, &direct_window_list, list) {
if (window->device == pdn) {
direct64 = window->prop;
- dma_addr = be64_to_cpu(direct64->dma_base);
+ *dma_addr = be64_to_cpu(direct64->dma_base);
+ found = true;
break;
}
}
spin_unlock(&direct_window_list_lock);

- return dma_addr;
+ return found;
}

static struct direct_window *ddw_list_add(struct device_node *pdn,
@@ -1131,15 +1132,15 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
* pdn: the parent pe node with the ibm,dma_window property
* Future: also check if we can remap the base window for our base page size
*
- * returns the dma offset for use by the direct mapped DMA code.
+ * returns true if can map all pages (direct mapping), false otherwise..
*/
-static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
{
int len, ret;
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
- u64 dma_addr, max_addr;
+ u64 max_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
@@ -1150,8 +1151,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)

mutex_lock(&direct_window_init_mutex);

- dma_addr = find_existing_ddw(pdn);
- if (dma_addr != 0)
+ if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset))
goto out_unlock;

/*
@@ -1292,7 +1292,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
goto out_free_window;
}

- dma_addr = be64_to_cpu(ddwprop->dma_base);
+ dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
goto out_unlock;

out_free_window:
@@ -1309,6 +1309,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
kfree(win64->name);
kfree(win64->value);
kfree(win64);
+ win64 = NULL;

out_failed:
if (default_win_removed)
@@ -1322,7 +1323,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)

out_unlock:
mutex_unlock(&direct_window_init_mutex);
- return dma_addr;
+ return win64;
}

static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
@@ -1401,11 +1402,8 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
break;
}

- if (pdn && PCI_DN(pdn)) {
- pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn);
- if (pdev->dev.archdata.dma_offset)
- return true;
- }
+ if (pdn && PCI_DN(pdn))
+ return enable_ddw(pdev, pdn);

return false;
}
--
2.25.4

2020-08-17 23:43:47

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v1 09/10] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

As of today, if the biggest DDW that can be created can't map the whole
partition, it's creation is skipped and the default DMA window
"ibm,dma-window" is used instead.

DDW is 16x bigger than the default DMA window, having the same amount of
pages, but increasing the page size to 64k.
Besides larger DMA window, it performs better for allocations over 4k,
so it would be nice to use it instead.

The DDW created will be used for direct mapping by default.
If it's not available, indirect mapping will be used instead.

For indirect mapping, it's necessary to update the iommu_table so
iommu_alloc() can use the DDW created. For this,
iommu_table_update_window() is called when everything else succeeds
at enable_ddw().

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

As there will never have both direct and indirect mappings at the same
time, the same property name can be used for the created DDW.

So renaming
define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
to
define DMA64_PROPNAME "linux,dma64-ddr-window-info"
looks the right thing to do.

To make sure the property differentiates both cases, a new u32 for flags
was added at the end of the property, where BIT(0) set means direct
mapping.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/platforms/pseries/iommu.c | 108 +++++++++++++++++++------
1 file changed, 84 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 3a1ef02ad9d5..9544e3c91ced 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -350,8 +350,11 @@ struct dynamic_dma_window_prop {
__be64 dma_base; /* address hi,lo */
__be32 tce_shift; /* ilog2(tce_page_size) */
__be32 window_shift; /* ilog2(tce_window_size) */
+ __be32 flags; /* DDW properties, see bellow */
};

+#define DDW_FLAGS_DIRECT 0x01
+
struct direct_window {
struct device_node *device;
const struct dynamic_dma_window_prop *prop;
@@ -377,7 +380,7 @@ static LIST_HEAD(direct_window_list);
static DEFINE_SPINLOCK(direct_window_list_lock);
/* protects initializing window twice for same device */
static DEFINE_MUTEX(direct_window_init_mutex);
-#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"

static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
unsigned long num_pfn, const void *arg)
@@ -836,7 +839,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
if (ret)
return;

- win = of_find_property(np, DIRECT64_PROPNAME, NULL);
+ win = of_find_property(np, DMA64_PROPNAME, NULL);
if (!win)
return;

@@ -852,7 +855,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
np, ret);
}

-static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
+static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, bool *direct_mapping)
{
struct direct_window *window;
const struct dynamic_dma_window_prop *direct64;
@@ -864,6 +867,7 @@ static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
if (window->device == pdn) {
direct64 = window->prop;
*dma_addr = be64_to_cpu(direct64->dma_base);
+ *direct_mapping = be32_to_cpu(direct64->flags) & DDW_FLAGS_DIRECT;
found = true;
break;
}
@@ -901,8 +905,8 @@ static int find_existing_ddw_windows(void)
if (!firmware_has_feature(FW_FEATURE_LPAR))
return 0;

- for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
- direct64 = of_get_property(pdn, DIRECT64_PROPNAME, &len);
+ for_each_node_with_property(pdn, DMA64_PROPNAME) {
+ direct64 = of_get_property(pdn, DMA64_PROPNAME, &len);
if (!direct64)
continue;

@@ -1124,7 +1128,8 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
}

static int ddw_property_create(struct property **ddw_win, const char *propname,
- u32 liobn, u64 dma_addr, u32 page_shift, u32 window_shift)
+ u32 liobn, u64 dma_addr, u32 page_shift,
+ u32 window_shift, bool direct_mapping)
{
struct dynamic_dma_window_prop *ddwprop;
struct property *win64;
@@ -1144,6 +1149,36 @@ static int ddw_property_create(struct property **ddw_win, const char *propname,
ddwprop->dma_base = cpu_to_be64(dma_addr);
ddwprop->tce_shift = cpu_to_be32(page_shift);
ddwprop->window_shift = cpu_to_be32(window_shift);
+ if (direct_mapping)
+ ddwprop->flags = cpu_to_be32(DDW_FLAGS_DIRECT);
+
+ return 0;
+}
+
+static int iommu_table_update_window(struct iommu_table **tbl, int nid, unsigned long liobn,
+ unsigned long win_addr, unsigned long page_shift,
+ unsigned long window_size)
+{
+ struct iommu_table *new_tbl, *old_tbl;
+
+ new_tbl = iommu_pseries_alloc_table(nid);
+ if (!new_tbl)
+ return -ENOMEM;
+
+ old_tbl = *tbl;
+ new_tbl->it_index = liobn;
+ new_tbl->it_offset = win_addr >> page_shift;
+ new_tbl->it_page_shift = page_shift;
+ new_tbl->it_size = window_size >> page_shift;
+ new_tbl->it_base = old_tbl->it_base;
+ new_tbl->it_busno = old_tbl->it_busno;
+ new_tbl->it_blocksize = old_tbl->it_blocksize;
+ new_tbl->it_type = old_tbl->it_type;
+ new_tbl->it_ops = old_tbl->it_ops;
+
+ iommu_init_table(new_tbl, nid, old_tbl->it_reserved_start, old_tbl->it_reserved_end);
+ iommu_tce_table_put(old_tbl);
+ *tbl = new_tbl;

return 0;
}
@@ -1171,12 +1206,16 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
struct direct_window *window;
struct property *win64 = NULL;
struct failed_ddw_pdn *fpdn;
- bool default_win_removed = false;
+ bool default_win_removed = false, maps_whole_partition = false;
+ struct pci_dn *pci = PCI_DN(pdn);
+ struct iommu_table *tbl = pci->table_group->tables[0];

mutex_lock(&direct_window_init_mutex);

- if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset))
- goto out_unlock;
+ if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset, &maps_whole_partition)) {
+ mutex_unlock(&direct_window_init_mutex);
+ return maps_whole_partition;
+ }

/*
* If we already went through this for a previous function of
@@ -1258,16 +1297,24 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
query.page_size);
goto out_failed;
}
+
/* verify the window * number of ptes will map the partition */
- /* check largest block * page size > max memory hotplug addr */
max_addr = ddw_memory_hotplug_max();
if (query.largest_available_block < (max_addr >> page_shift)) {
- dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
- "%llu-sized pages\n", max_addr, query.largest_available_block,
- 1ULL << page_shift);
- goto out_failed;
+ dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu %llu-sized pages\n",
+ max_addr, query.largest_available_block,
+ 1ULL << page_shift);
+
+ len = order_base_2(query.largest_available_block << page_shift);
+ } else {
+ maps_whole_partition = true;
+ len = order_base_2(max_addr);
}
- len = order_base_2(max_addr);
+
+ /* DDW + IOMMU on single window may fail if there is any allocation */
+ if (default_win_removed && !maps_whole_partition &&
+ iommu_table_in_use(tbl))
+ goto out_failed;

ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
if (ret != 0)
@@ -1277,8 +1324,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
create.liobn, dn);

win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
- ret = ddw_property_create(&win64, DIRECT64_PROPNAME, create.liobn, win_addr,
- page_shift, len);
+ ret = ddw_property_create(&win64, DMA64_PROPNAME, create.liobn, win_addr,
+ page_shift, len, maps_whole_partition);
if (ret) {
dev_info(&dev->dev,
"couldn't allocate property, property name, or value\n");
@@ -1297,12 +1344,25 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
if (!window)
goto out_prop_del;

- ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
- win64->value, tce_setrange_multi_pSeriesLP_walk);
- if (ret) {
- dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
- dn, ret);
- goto out_free_window;
+ if (maps_whole_partition) {
+ /* DDW maps the whole partition, so enable direct DMA mapping */
+ ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
+ win64->value, tce_setrange_multi_pSeriesLP_walk);
+ if (ret) {
+ dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
+ dn, ret);
+ goto out_free_window;
+ }
+ } else {
+ /* New table for using DDW instead of the default DMA window */
+ if (iommu_table_update_window(&tbl, pci->phb->node, create.liobn,
+ win_addr, page_shift, 1UL << len))
+ goto out_free_window;
+
+ set_iommu_table_base(&dev->dev, tbl);
+ WARN_ON(dev->dev.archdata.dma_offset >= SZ_4G);
+ goto out_unlock;
+
}

dev->dev.archdata.dma_offset = win_addr;
@@ -1340,7 +1400,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)

out_unlock:
mutex_unlock(&direct_window_init_mutex);
- return win64;
+ return win64 && maps_whole_partition;
}

static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
--
2.25.4

2020-08-17 23:43:51

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v1 10/10] powerpc/pseries/iommu: Rename "direct window" to "dma window"

A previous change introduced the usage of DDW as a bigger indirect DMA
mapping when the DDW available size does not map the whole partition.

As most of the code that manipulates direct mappings was reused for
indirect mappings, it's necessary to rename all names and debug/info
messages to reflect that it can be used for both kinds of mapping.

Also, defines DEFAULT_DMA_WIN as "ibm,dma-window" to document that
it's the name of the default DMA window.

Those changes are not supposed to change how the code works in any
way, just adjust naming.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/platforms/pseries/iommu.c | 110 +++++++++++++------------
1 file changed, 57 insertions(+), 53 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 9544e3c91ced..c1454f9cd254 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -355,7 +355,7 @@ struct dynamic_dma_window_prop {

#define DDW_FLAGS_DIRECT 0x01

-struct direct_window {
+struct dma_win {
struct device_node *device;
const struct dynamic_dma_window_prop *prop;
struct list_head list;
@@ -375,12 +375,13 @@ struct ddw_create_response {
u32 addr_lo;
};

-static LIST_HEAD(direct_window_list);
+static LIST_HEAD(dma_win_list);
/* prevents races between memory on/offline and window creation */
-static DEFINE_SPINLOCK(direct_window_list_lock);
+static DEFINE_SPINLOCK(dma_win_list_lock);
/* protects initializing window twice for same device */
-static DEFINE_MUTEX(direct_window_init_mutex);
+static DEFINE_MUTEX(dma_win_init_mutex);
#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
+#define DEFAULT_DMA_WIN "ibm,dma-window"

static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
unsigned long num_pfn, const void *arg)
@@ -713,15 +714,18 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
pr_debug("pci_dma_bus_setup_pSeriesLP: setting up bus %pOF\n",
dn);

- /* Find nearest ibm,dma-window, walking up the device tree */
+ /*
+ * Find nearest ibm,dma-window (default DMA window), walking up the
+ * device tree
+ */
for (pdn = dn; pdn != NULL; pdn = pdn->parent) {
- dma_window = of_get_property(pdn, "ibm,dma-window", NULL);
+ dma_window = of_get_property(pdn, DEFAULT_DMA_WIN, NULL);
if (dma_window != NULL)
break;
}

if (dma_window == NULL) {
- pr_debug(" no ibm,dma-window property !\n");
+ pr_debug(" no %s property !\n", DEFAULT_DMA_WIN);
return;
}

@@ -819,11 +823,11 @@ static void remove_dma_window(struct device_node *np, u32 *ddw_avail,

ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn);
if (ret)
- pr_warn("%pOF: failed to remove direct window: rtas returned "
+ pr_warn("%pOF: failed to remove dma window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
else
- pr_debug("%pOF: successfully removed direct window: rtas returned "
+ pr_debug("%pOF: successfully removed dma window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
}
@@ -851,36 +855,36 @@ static void remove_ddw(struct device_node *np, bool remove_prop)

ret = of_remove_property(np, win);
if (ret)
- pr_warn("%pOF: failed to remove direct window property: %d\n",
+ pr_warn("%pOF: failed to remove dma window property: %d\n",
np, ret);
}

static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, bool *direct_mapping)
{
- struct direct_window *window;
- const struct dynamic_dma_window_prop *direct64;
+ struct dma_win *window;
+ const struct dynamic_dma_window_prop *dma64;
bool found = false;

- spin_lock(&direct_window_list_lock);
+ spin_lock(&dma_win_list_lock);
/* check if we already created a window and dupe that config if so */
- list_for_each_entry(window, &direct_window_list, list) {
+ list_for_each_entry(window, &dma_win_list, list) {
if (window->device == pdn) {
- direct64 = window->prop;
- *dma_addr = be64_to_cpu(direct64->dma_base);
- *direct_mapping = be32_to_cpu(direct64->flags) & DDW_FLAGS_DIRECT;
+ dma64 = window->prop;
+ *dma_addr = be64_to_cpu(dma64->dma_base);
+ *direct_mapping = be32_to_cpu(dma64->flags) & DDW_FLAGS_DIRECT;
found = true;
break;
}
}
- spin_unlock(&direct_window_list_lock);
+ spin_unlock(&dma_win_list_lock);

return found;
}

-static struct direct_window *ddw_list_add(struct device_node *pdn,
- const struct dynamic_dma_window_prop *dma64)
+static struct dma_win *ddw_list_add(struct device_node *pdn,
+ const struct dynamic_dma_window_prop *dma64)
{
- struct direct_window *window;
+ struct dma_win *window;

window = kzalloc(sizeof(*window), GFP_KERNEL);
if (!window)
@@ -888,9 +892,9 @@ static struct direct_window *ddw_list_add(struct device_node *pdn,

window->device = pdn;
window->prop = dma64;
- spin_lock(&direct_window_list_lock);
- list_add(&window->list, &direct_window_list);
- spin_unlock(&direct_window_list_lock);
+ spin_lock(&dma_win_list_lock);
+ list_add(&window->list, &dma_win_list);
+ spin_unlock(&dma_win_list_lock);

return window;
}
@@ -899,19 +903,19 @@ static int find_existing_ddw_windows(void)
{
int len;
struct device_node *pdn;
- struct direct_window *window;
- const struct dynamic_dma_window_prop *direct64;
+ struct dma_win *window;
+ const struct dynamic_dma_window_prop *dma64;

if (!firmware_has_feature(FW_FEATURE_LPAR))
return 0;

for_each_node_with_property(pdn, DMA64_PROPNAME) {
- direct64 = of_get_property(pdn, DMA64_PROPNAME, &len);
- if (!direct64)
+ dma64 = of_get_property(pdn, DMA64_PROPNAME, &len);
+ if (!dma64)
continue;

- window = ddw_list_add(pdn, direct64);
- if (!window || len < sizeof(*direct64)) {
+ window = ddw_list_add(pdn, dma64);
+ if (!window || len < sizeof(*dma64)) {
kfree(window);
remove_ddw(pdn, true);
}
@@ -1203,17 +1207,17 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
u64 max_addr, win_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
- struct direct_window *window;
+ struct dma_win *window;
struct property *win64 = NULL;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false, maps_whole_partition = false;
struct pci_dn *pci = PCI_DN(pdn);
struct iommu_table *tbl = pci->table_group->tables[0];

- mutex_lock(&direct_window_init_mutex);
+ mutex_lock(&dma_win_init_mutex);

if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset, &maps_whole_partition)) {
- mutex_unlock(&direct_window_init_mutex);
+ mutex_unlock(&dma_win_init_mutex);
return maps_whole_partition;
}

@@ -1264,7 +1268,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
struct property *default_win;
int reset_win_ext;

- default_win = of_find_property(pdn, "ibm,dma-window", NULL);
+ default_win = of_find_property(pdn, DEFAULT_DMA_WIN, NULL);
if (!default_win)
goto out_failed;

@@ -1293,8 +1297,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
} else if (query.page_size & 1) {
page_shift = 12; /* 4kB */
} else {
- dev_dbg(&dev->dev, "no supported direct page size in mask %x",
- query.page_size);
+ dev_dbg(&dev->dev, "no supported page size in mask %x",
+ query.page_size);
goto out_failed;
}

@@ -1349,7 +1353,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
win64->value, tce_setrange_multi_pSeriesLP_walk);
if (ret) {
- dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
+ dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
dn, ret);
goto out_free_window;
}
@@ -1369,9 +1373,9 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
goto out_unlock;

out_free_window:
- spin_lock(&direct_window_list_lock);
+ spin_lock(&dma_win_list_lock);
list_del(&window->list);
- spin_unlock(&direct_window_list_lock);
+ spin_unlock(&dma_win_list_lock);

kfree(window);

@@ -1399,7 +1403,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
list_add(&fpdn->list, &failed_ddw_pdn_list);

out_unlock:
- mutex_unlock(&direct_window_init_mutex);
+ mutex_unlock(&dma_win_init_mutex);
return win64 && maps_whole_partition;
}

@@ -1423,7 +1427,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)

for (pdn = dn; pdn && PCI_DN(pdn) && !PCI_DN(pdn)->table_group;
pdn = pdn->parent) {
- dma_window = of_get_property(pdn, "ibm,dma-window", NULL);
+ dma_window = of_get_property(pdn, DEFAULT_DMA_WIN, NULL);
if (dma_window)
break;
}
@@ -1474,7 +1478,7 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
*/
for (pdn = dn; pdn && PCI_DN(pdn) && !PCI_DN(pdn)->table_group;
pdn = pdn->parent) {
- dma_window = of_get_property(pdn, "ibm,dma-window", NULL);
+ dma_window = of_get_property(pdn, DEFAULT_DMA_WIN, NULL);
if (dma_window)
break;
}
@@ -1488,29 +1492,29 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
void *data)
{
- struct direct_window *window;
+ struct dma_win *window;
struct memory_notify *arg = data;
int ret = 0;

switch (action) {
case MEM_GOING_ONLINE:
- spin_lock(&direct_window_list_lock);
- list_for_each_entry(window, &direct_window_list, list) {
+ spin_lock(&dma_win_list_lock);
+ list_for_each_entry(window, &dma_win_list, list) {
ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
arg->nr_pages, window->prop);
/* XXX log error */
}
- spin_unlock(&direct_window_list_lock);
+ spin_unlock(&dma_win_list_lock);
break;
case MEM_CANCEL_ONLINE:
case MEM_OFFLINE:
- spin_lock(&direct_window_list_lock);
- list_for_each_entry(window, &direct_window_list, list) {
+ spin_lock(&dma_win_list_lock);
+ list_for_each_entry(window, &dma_win_list, list) {
ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
arg->nr_pages, window->prop);
/* XXX log error */
}
- spin_unlock(&direct_window_list_lock);
+ spin_unlock(&dma_win_list_lock);
break;
default:
break;
@@ -1531,7 +1535,7 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti
struct of_reconfig_data *rd = data;
struct device_node *np = rd->dn;
struct pci_dn *pci = PCI_DN(np);
- struct direct_window *window;
+ struct dma_win *window;

switch (action) {
case OF_RECONFIG_DETACH_NODE:
@@ -1547,15 +1551,15 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti
iommu_pseries_free_group(pci->table_group,
np->full_name);

- spin_lock(&direct_window_list_lock);
- list_for_each_entry(window, &direct_window_list, list) {
+ spin_lock(&dma_win_list_lock);
+ list_for_each_entry(window, &dma_win_list, list) {
if (window->device == np) {
list_del(&window->list);
kfree(window);
break;
}
}
- spin_unlock(&direct_window_list_lock);
+ spin_unlock(&dma_win_list_lock);
break;
default:
err = NOTIFY_DONE;
--
2.25.4

2020-08-17 23:45:49

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v1 05/10] powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper

Creates a helper to allow allocating a new iommu_table without the need
to reallocate the iommu_group.

This will be helpful for replacing the iommu_table for the new DMA window,
after we remove the old one with iommu_tce_table_put().

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/platforms/pseries/iommu.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 8fe23b7dff3a..39617ce0ec83 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,28 +53,31 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
};

-static struct iommu_table_group *iommu_pseries_alloc_group(int node)
+static struct iommu_table *iommu_pseries_alloc_table(int node)
{
- struct iommu_table_group *table_group;
struct iommu_table *tbl;

- table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
- node);
- if (!table_group)
- return NULL;
-
tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
if (!tbl)
- goto free_group;
+ return NULL;

INIT_LIST_HEAD_RCU(&tbl->it_group_list);
kref_init(&tbl->it_kref);
+ return tbl;
+}

- table_group->tables[0] = tbl;
+static struct iommu_table_group *iommu_pseries_alloc_group(int node)
+{
+ struct iommu_table_group *table_group;
+
+ table_group = kzalloc_node(sizeof(*table_group), GFP_KERNEL, node);
+ if (!table_group)
+ return NULL;

- return table_group;
+ table_group->tables[0] = iommu_pseries_alloc_table(node);
+ if (table_group->tables[0])
+ return table_group;

-free_group:
kfree(table_group);
return NULL;
}
--
2.25.4

2020-08-22 09:35:25

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift



On 18/08/2020 09:40, Leonardo Bras wrote:
> Some functions assume IOMMU page size can only be 4K (pageshift == 12).
> Update them to accept any page size passed, so we can use 64K pages.
>
> In the process, some defines like TCE_SHIFT were made obsolete, and then
> removed. TCE_RPN_MASK was updated to generate a mask according to
> the pageshift used.
>
> Most places had a tbl struct, so using tbl->it_page_shift was simple.
> tce_free_pSeriesLP() was a special case, since callers not always have a
> tbl struct, so adding a tceshift parameter seems the right thing to do.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> arch/powerpc/include/asm/tce.h | 10 ++----
> arch/powerpc/platforms/pseries/iommu.c | 42 ++++++++++++++++----------
> 2 files changed, 28 insertions(+), 24 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
> index db5fc2f2262d..971cba2d87cc 100644
> --- a/arch/powerpc/include/asm/tce.h
> +++ b/arch/powerpc/include/asm/tce.h
> @@ -19,15 +19,9 @@
> #define TCE_VB 0
> #define TCE_PCI 1
>
> -/* TCE page size is 4096 bytes (1 << 12) */
> -
> -#define TCE_SHIFT 12
> -#define TCE_PAGE_SIZE (1 << TCE_SHIFT)
> -
> #define TCE_ENTRY_SIZE 8 /* each TCE is 64 bits */
> -
> -#define TCE_RPN_MASK 0xfffffffffful /* 40-bit RPN (4K pages) */
> -#define TCE_RPN_SHIFT 12
> +#define TCE_RPN_BITS 52 /* Bits 0-51 represent RPN on TCE */


Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
is the actual limit.


> +#define TCE_RPN_MASK(ps) ((1ul << (TCE_RPN_BITS - (ps))) - 1)
> #define TCE_VALID 0x800 /* TCE valid */
> #define TCE_ALLIO 0x400 /* TCE valid for all lpars */
> #define TCE_PCI_WRITE 0x2 /* write from PCI allowed */
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index e4198700ed1a..8fe23b7dff3a 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -107,6 +107,9 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index,
> u64 proto_tce;
> __be64 *tcep;
> u64 rpn;
> + const unsigned long tceshift = tbl->it_page_shift;
> + const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
> + const u64 rpn_mask = TCE_RPN_MASK(tceshift);

Using IOMMU_PAGE_SIZE macro for the page size and not using
IOMMU_PAGE_MASK for the mask - this incosistency makes my small brain
explode :) I understand the history but maaaaan... Oh well, ok.

Good, otherwise. Thanks,

>
> proto_tce = TCE_PCI_READ; // Read allowed
>
> @@ -117,10 +120,10 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index,
>
> while (npages--) {
> /* can't move this out since we might cross MEMBLOCK boundary */
> - rpn = __pa(uaddr) >> TCE_SHIFT;
> - *tcep = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT);
> + rpn = __pa(uaddr) >> tceshift;
> + *tcep = cpu_to_be64(proto_tce | (rpn & rpn_mask) << tceshift);
>
> - uaddr += TCE_PAGE_SIZE;
> + uaddr += pagesize;
> tcep++;
> }
> return 0;
> @@ -146,7 +149,7 @@ static unsigned long tce_get_pseries(struct iommu_table *tbl, long index)
> return be64_to_cpu(*tcep);
> }
>
> -static void tce_free_pSeriesLP(unsigned long liobn, long, long);
> +static void tce_free_pSeriesLP(unsigned long liobn, long, long, long);
> static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
>
> static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,
> @@ -159,6 +162,7 @@ static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,
> u64 rpn;
> int ret = 0;
> long tcenum_start = tcenum, npages_start = npages;
> + const u64 rpn_mask = TCE_RPN_MASK(tceshift);
>
> rpn = __pa(uaddr) >> tceshift;
> proto_tce = TCE_PCI_READ;
> @@ -166,12 +170,12 @@ static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,
> proto_tce |= TCE_PCI_WRITE;
>
> while (npages--) {
> - tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift;
> + tce = proto_tce | (rpn & rpn_mask) << tceshift;
> rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce);
>
> if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
> ret = (int)rc;
> - tce_free_pSeriesLP(liobn, tcenum_start,
> + tce_free_pSeriesLP(liobn, tcenum_start, tceshift,
> (npages_start - (npages + 1)));
> break;
> }
> @@ -205,10 +209,12 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
> long tcenum_start = tcenum, npages_start = npages;
> int ret = 0;
> unsigned long flags;
> + const unsigned long tceshift = tbl->it_page_shift;
> + const u64 rpn_mask = TCE_RPN_MASK(tceshift);
>
> if ((npages == 1) || !firmware_has_feature(FW_FEATURE_PUT_TCE_IND)) {
> return tce_build_pSeriesLP(tbl->it_index, tcenum,
> - tbl->it_page_shift, npages, uaddr,
> + tceshift, npages, uaddr,
> direction, attrs);
> }
>
> @@ -225,13 +231,13 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
> if (!tcep) {
> local_irq_restore(flags);
> return tce_build_pSeriesLP(tbl->it_index, tcenum,
> - tbl->it_page_shift,
> + tceshift,
> npages, uaddr, direction, attrs);
> }
> __this_cpu_write(tce_page, tcep);
> }
>
> - rpn = __pa(uaddr) >> TCE_SHIFT;
> + rpn = __pa(uaddr) >> tceshift;
> proto_tce = TCE_PCI_READ;
> if (direction != DMA_TO_DEVICE)
> proto_tce |= TCE_PCI_WRITE;
> @@ -245,12 +251,12 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
> limit = min_t(long, npages, 4096/TCE_ENTRY_SIZE);
>
> for (l = 0; l < limit; l++) {
> - tcep[l] = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT);
> + tcep[l] = cpu_to_be64(proto_tce | (rpn & rpn_mask) << tceshift);
> rpn++;
> }
>
> rc = plpar_tce_put_indirect((u64)tbl->it_index,
> - (u64)tcenum << 12,
> + (u64)tcenum << tceshift,
> (u64)__pa(tcep),
> limit);
>
> @@ -277,12 +283,13 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
> return ret;
> }
>
> -static void tce_free_pSeriesLP(unsigned long liobn, long tcenum, long npages)
> +static void tce_free_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,
> + long npages)
> {
> u64 rc;
>
> while (npages--) {
> - rc = plpar_tce_put((u64)liobn, (u64)tcenum << 12, 0);
> + rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, 0);
>
> if (rc && printk_ratelimit()) {
> printk("tce_free_pSeriesLP: plpar_tce_put failed. rc=%lld\n", rc);
> @@ -301,9 +308,11 @@ static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long n
> u64 rc;
>
> if (!firmware_has_feature(FW_FEATURE_STUFF_TCE))
> - return tce_free_pSeriesLP(tbl->it_index, tcenum, npages);
> + return tce_free_pSeriesLP(tbl->it_index, tcenum,
> + tbl->it_page_shift, npages);
>
> - rc = plpar_tce_stuff((u64)tbl->it_index, (u64)tcenum << 12, 0, npages);
> + rc = plpar_tce_stuff((u64)tbl->it_index,
> + (u64)tcenum << tbl->it_page_shift, 0, npages);
>
> if (rc && printk_ratelimit()) {
> printk("tce_freemulti_pSeriesLP: plpar_tce_stuff failed\n");
> @@ -319,7 +328,8 @@ static unsigned long tce_get_pSeriesLP(struct iommu_table *tbl, long tcenum)
> u64 rc;
> unsigned long tce_ret;
>
> - rc = plpar_tce_get((u64)tbl->it_index, (u64)tcenum << 12, &tce_ret);
> + rc = plpar_tce_get((u64)tbl->it_index,
> + (u64)tcenum << tbl->it_page_shift, &tce_ret);
>
> if (rc && printk_ratelimit()) {
> printk("tce_get_pSeriesLP: plpar_tce_get failed. rc=%lld\n", rc);
>

--
Alexey

2020-08-22 10:11:24

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()



On 18/08/2020 09:40, Leonardo Bras wrote:
> Both iommu_alloc_coherent() and iommu_free_coherent() assume that once
> size is aligned to PAGE_SIZE it will be aligned to IOMMU_PAGE_SIZE.

The only case when it is not aligned is when IOMMU_PAGE_SIZE > PAGE_SIZE
which is unlikely but not impossible, we could configure the kernel for
4K system pages and 64K IOMMU pages I suppose. Do we really want to do
this here, or simply put WARN_ON(tbl->it_page_shift > PAGE_SHIFT)?
Because if we want the former (==support), then we'll have to align the
size up to the bigger page size when allocating/zeroing system pages,
etc. Bigger pages are not the case here as I understand it.


>
> Update those functions to guarantee alignment with requested size
> using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().
>
> Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
> with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> arch/powerpc/kernel/iommu.c | 17 +++++++++--------
> 1 file changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 9704f3f76e63..d7086087830f 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device *dev,
> }
>
> if (dev)
> - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
> - 1 << tbl->it_page_shift);
> + boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, tbl);


Run checkpatch.pl, should complain about a long line.


> else
> - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
> + boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
> /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
>
> n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
> @@ -858,6 +857,7 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
> unsigned int order;
> unsigned int nio_pages, io_order;
> struct page *page;
> + size_t size_io = size;
>
> size = PAGE_ALIGN(size);
> order = get_order(size);
> @@ -884,8 +884,9 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
> memset(ret, 0, size);
>
> /* Set up tces to cover the allocated range */
> - nio_pages = size >> tbl->it_page_shift;
> - io_order = get_iommu_order(size, tbl);
> + size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
> + nio_pages = size_io >> tbl->it_page_shift;
> + io_order = get_iommu_order(size_io, tbl);
> mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
> mask >> tbl->it_page_shift, io_order, 0);
> if (mapping == DMA_MAPPING_ERROR) {
> @@ -900,11 +901,11 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
> void *vaddr, dma_addr_t dma_handle)
> {
> if (tbl) {
> - unsigned int nio_pages;
> + size_t size_io = IOMMU_PAGE_ALIGN(size, tbl);
> + unsigned int nio_pages = size_io >> tbl->it_page_shift;
>
> - size = PAGE_ALIGN(size);
> - nio_pages = size >> tbl->it_page_shift;
> iommu_free(tbl, dma_handle, nio_pages);
> +

Unrelated new line.


> size = PAGE_ALIGN(size);
> free_pages((unsigned long)vaddr, get_order(size));
> }
>

--
Alexey

2020-08-22 10:14:09

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 03/10] powerpc/kernel/iommu: Use largepool as a last resort when !largealloc



On 18/08/2020 09:40, Leonardo Bras wrote:
> As of today, doing iommu_range_alloc() only for !largealloc (npages <= 15)
> will only be able to use 3/4 of the available pages, given pages on
> largepool not being available for !largealloc.
>
> This could mean some drivers not being able to fully use all the available
> pages for the DMA window.
>
> Add pages on largepool as a last resort for !largealloc, making all pages
> of the DMA window available.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> arch/powerpc/kernel/iommu.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index d7086087830f..7f603d4e62d4 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -261,6 +261,15 @@ static unsigned long iommu_range_alloc(struct device *dev,
> pass++;
> goto again;
>
> + } else if (pass == tbl->nr_pools + 1) {
> + /* Last resort: try largepool */
> + spin_unlock(&pool->lock);
> + pool = &tbl->large_pool;
> + spin_lock(&pool->lock);
> + pool->hint = pool->start;
> + pass++;
> + goto again;
> +


A nit: unnecessary new line.


Reviewed-by: Alexey Kardashevskiy <[email protected]>



> } else {
> /* Give up */
> spin_unlock_irqrestore(&(pool->lock), flags);
>

--
Alexey

2020-08-22 10:37:13

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 04/10] powerpc/kernel/iommu: Add new iommu_table_in_use() helper



On 18/08/2020 09:40, Leonardo Bras wrote:
> Having a function to check if the iommu table has any allocation helps
> deciding if a tbl can be reset for using a new DMA window.
>
> It should be enough to replace all instances of !bitmap_empty(tbl...).
>
> iommu_table_in_use() skips reserved memory, so we don't need to worry about
> releasing it before testing. This causes iommu_table_release_pages() to
> become unnecessary, given it is only used to remove reserved memory for
> testing.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> arch/powerpc/include/asm/iommu.h | 1 +
> arch/powerpc/kernel/iommu.c | 62 ++++++++++++++++++--------------
> 2 files changed, 37 insertions(+), 26 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 5032f1593299..2913e5c8b1f8 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -154,6 +154,7 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
> */
> extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
> int nid, unsigned long res_start, unsigned long res_end);
> +bool iommu_table_in_use(struct iommu_table *tbl);
>
> #define IOMMU_TABLE_GROUP_MAX_TABLES 2
>
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 7f603d4e62d4..c5d5d36ab65e 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -668,21 +668,6 @@ static void iommu_table_reserve_pages(struct iommu_table *tbl,
> set_bit(i - tbl->it_offset, tbl->it_map);
> }
>
> -static void iommu_table_release_pages(struct iommu_table *tbl)
> -{
> - int i;
> -
> - /*
> - * In case we have reserved the first bit, we should not emit
> - * the warning below.
> - */
> - if (tbl->it_offset == 0)
> - clear_bit(0, tbl->it_map);
> -
> - for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
> - clear_bit(i - tbl->it_offset, tbl->it_map);
> -}
> -
> /*
> * Build a iommu_table structure. This contains a bit map which
> * is used to manage allocation of the tce space.
> @@ -743,6 +728,38 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
> return tbl;
> }
>
> +bool iommu_table_in_use(struct iommu_table *tbl)
> +{
> + bool in_use;
> + unsigned long p1_start = 0, p1_end, p2_start, p2_end;
> +
> + /*ignore reserved bit0*/

s/ignore reserved bit0/ ignore reserved bit0 / (add spaces)

> + if (tbl->it_offset == 0)
> + p1_start = 1;
> +
> + /* Check if reserved memory is valid*/

A missing space here.

> + if (tbl->it_reserved_start >= tbl->it_offset &&
> + tbl->it_reserved_start <= (tbl->it_offset + tbl->it_size) &&
> + tbl->it_reserved_end >= tbl->it_offset &&
> + tbl->it_reserved_end <= (tbl->it_offset + tbl->it_size)) {


Uff. What if tbl->it_reserved_end is bigger than tbl->it_offset +
tbl->it_size?

The reserved area is to preserve MMIO32 so it is for it_offset==0 only
and the boundaries are checked in the only callsite, and it is unlikely
to change soon or ever.

Rather that bothering with fixing that, may be just add (did not test):

if (WARN_ON((
(tbl->it_reserved_start || tbl->it_reserved_end) && (it_offset != 0))
||
(tbl->it_reserved_start > it_offset && tbl->it_reserved_end < it_offset
+ it_size) && (it_offset == 0)) )
return true;

Or simply always look for it_offset..it_reserved_start and
it_reserved_end..it_offset+it_size and if there is no reserved area,
initialize it_reserved_start=it_reserved_end=it_offset so the first
it_offset..it_reserved_start becomes a no-op.


> + p1_end = tbl->it_reserved_start - tbl->it_offset;
> + p2_start = tbl->it_reserved_end - tbl->it_offset + 1;
> + p2_end = tbl->it_size;
> + } else {
> + p1_end = tbl->it_size;
> + p2_start = 0;
> + p2_end = 0;
> + }
> +
> + in_use = (find_next_bit(tbl->it_map, p1_end, p1_start) != p1_end);
> + if (in_use || p2_start == 0)
> + return in_use;
> +
> + in_use = (find_next_bit(tbl->it_map, p2_end, p2_start) != p2_end);
> +
> + return in_use;
> +}
> +
> static void iommu_table_free(struct kref *kref)
> {
> unsigned long bitmap_sz;
> @@ -759,10 +776,8 @@ static void iommu_table_free(struct kref *kref)
> return;
> }
>
> - iommu_table_release_pages(tbl);
> -
> /* verify that table contains no entries */
> - if (!bitmap_empty(tbl->it_map, tbl->it_size))
> + if (iommu_table_in_use(tbl))
> pr_warn("%s: Unexpected TCEs\n", __func__);
>
> /* calculate bitmap size in bytes */
> @@ -1069,18 +1084,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
> for (i = 0; i < tbl->nr_pools; i++)
> spin_lock(&tbl->pools[i].lock);
>
> - iommu_table_release_pages(tbl);
> -
> - if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
> + if (iommu_table_in_use(tbl)) {
> pr_err("iommu_tce: it_map is not empty");
> ret = -EBUSY;
> - /* Undo iommu_table_release_pages, i.e. restore bit#0, etc */
> - iommu_table_reserve_pages(tbl, tbl->it_reserved_start,
> - tbl->it_reserved_end);
> - } else {
> - memset(tbl->it_map, 0xff, sz);
> }
>
> + memset(tbl->it_map, 0xff, sz);
> +
> for (i = 0; i < tbl->nr_pools; i++)
> spin_unlock(&tbl->pools[i].lock);
> spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
>

--
Alexey

2020-08-24 01:02:33

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 05/10] powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper



On 18/08/2020 09:40, Leonardo Bras wrote:
> Creates a helper to allow allocating a new iommu_table without the need
> to reallocate the iommu_group.
>
> This will be helpful for replacing the iommu_table for the new DMA window,
> after we remove the old one with iommu_tce_table_put().
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> arch/powerpc/platforms/pseries/iommu.c | 25 ++++++++++++++-----------
> 1 file changed, 14 insertions(+), 11 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 8fe23b7dff3a..39617ce0ec83 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -53,28 +53,31 @@ enum {
> DDW_EXT_QUERY_OUT_SIZE = 2
> };
>
> -static struct iommu_table_group *iommu_pseries_alloc_group(int node)
> +static struct iommu_table *iommu_pseries_alloc_table(int node)
> {
> - struct iommu_table_group *table_group;
> struct iommu_table *tbl;
>
> - table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
> - node);
> - if (!table_group)
> - return NULL;
> -
> tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
> if (!tbl)
> - goto free_group;
> + return NULL;
>
> INIT_LIST_HEAD_RCU(&tbl->it_group_list);
> kref_init(&tbl->it_kref);
> + return tbl;
> +}
>
> - table_group->tables[0] = tbl;
> +static struct iommu_table_group *iommu_pseries_alloc_group(int node)
> +{
> + struct iommu_table_group *table_group;
> +
> + table_group = kzalloc_node(sizeof(*table_group), GFP_KERNEL, node);


I'd prefer you did not make unrelated changes (sizeof(struct
iommu_table_group) -> sizeof(*table_group)) so the diff stays shorter
and easier to follow. You changed sizeof(struct iommu_table_group) but
not sizeof(struct iommu_table) and this confused me enough to spend more
time than this straight forward change deserves.

Not important in this case though so

Reviewed-by: Alexey Kardashevskiy <[email protected]>




> + if (!table_group)
> + return NULL;
>
> - return table_group;
> + table_group->tables[0] = iommu_pseries_alloc_table(node);
> + if (table_group->tables[0])
> + return table_group;
>
> -free_group:
> kfree(table_group);
> return NULL;
> }
>

--
Alexey

2020-08-24 04:42:52

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 06/10] powerpc/pseries/iommu: Add ddw_list_add() helper



On 18/08/2020 09:40, Leonardo Bras wrote:
> There are two functions adding DDW to the direct_window_list in a
> similar way, so create a ddw_list_add() to avoid duplicity and
> simplify those functions.
>
> Also, on enable_ddw(), add list_del() on out_free_window to allow
> removing the window from list if any error occurs.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> arch/powerpc/platforms/pseries/iommu.c | 42 ++++++++++++++++----------
> 1 file changed, 26 insertions(+), 16 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 39617ce0ec83..fcdefcc0f365 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -872,6 +872,24 @@ static u64 find_existing_ddw(struct device_node *pdn)
> return dma_addr;
> }
>
> +static struct direct_window *ddw_list_add(struct device_node *pdn,
> + const struct dynamic_dma_window_prop *dma64)
> +{
> + struct direct_window *window;
> +
> + window = kzalloc(sizeof(*window), GFP_KERNEL);
> + if (!window)
> + return NULL;
> +
> + window->device = pdn;
> + window->prop = dma64;
> + spin_lock(&direct_window_list_lock);
> + list_add(&window->list, &direct_window_list);
> + spin_unlock(&direct_window_list_lock);
> +
> + return window;
> +}
> +
> static int find_existing_ddw_windows(void)
> {
> int len;
> @@ -887,18 +905,11 @@ static int find_existing_ddw_windows(void)
> if (!direct64)
> continue;
>
> - window = kzalloc(sizeof(*window), GFP_KERNEL);
> - if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
> + window = ddw_list_add(pdn, direct64);
> + if (!window || len < sizeof(*direct64)) {


Since you are touching this code, it looks like the "len <
sizeof(*direct64)" part should go above to "if (!direct64)".



> kfree(window);
> remove_ddw(pdn, true);
> - continue;
> }
> -
> - window->device = pdn;
> - window->prop = direct64;
> - spin_lock(&direct_window_list_lock);
> - list_add(&window->list, &direct_window_list);
> - spin_unlock(&direct_window_list_lock);
> }
>
> return 0;
> @@ -1261,7 +1272,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
> create.liobn, dn);
>
> - window = kzalloc(sizeof(*window), GFP_KERNEL);
> + /* Add new window to existing DDW list */

The comment seems to duplicate what the ddw_list_add name already suggests.


> + window = ddw_list_add(pdn, ddwprop);
> if (!window)
> goto out_clear_window;
>
> @@ -1280,16 +1292,14 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> goto out_free_window;
> }
>
> - window->device = pdn;
> - window->prop = ddwprop;
> - spin_lock(&direct_window_list_lock);
> - list_add(&window->list, &direct_window_list);
> - spin_unlock(&direct_window_list_lock);

I'd leave these 3 lines here and in find_existing_ddw_windows() (which
would make ddw_list_add -> ddw_prop_alloc). In general you want to have
less stuff to do on the failure path. kmalloc may fail and needs kfree
but you can safely delay list_add (which cannot fail) and avoid having
the lock help twice in the same function (one of them is hidden inside
ddw_list_add).

Not sure if this change is really needed after all. Thanks,

> -
> dma_addr = be64_to_cpu(ddwprop->dma_base);
> goto out_unlock;
>
> out_free_window:
> + spin_lock(&direct_window_list_lock);
> + list_del(&window->list);
> + spin_unlock(&direct_window_list_lock);
> +
> kfree(window);
>
> out_clear_window:
>

--
Alexey

2020-08-24 05:11:48

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 07/10] powerpc/pseries/iommu: Allow DDW windows starting at 0x00



On 18/08/2020 09:40, Leonardo Bras wrote:
> enable_ddw() currently returns the address of the DMA window, which is
> considered invalid if has the value 0x00.
>
> Also, it only considers valid an address returned from find_existing_ddw
> if it's not 0x00.
>
> Changing this behavior makes sense, given the users of enable_ddw() only
> need to know if direct mapping is possible. It can also allow a DMA window
> starting at 0x00 to be used.
>
> This will be helpful for using a DDW with indirect mapping, as the window
> address will be different than 0x00, but it will not map the whole
> partition.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> arch/powerpc/platforms/pseries/iommu.c | 30 ++++++++++++--------------
> 1 file changed, 14 insertions(+), 16 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index fcdefcc0f365..4031127c9537 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -852,24 +852,25 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
> np, ret);
> }
>
> -static u64 find_existing_ddw(struct device_node *pdn)
> +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> {
> struct direct_window *window;
> const struct dynamic_dma_window_prop *direct64;
> - u64 dma_addr = 0;
> + bool found = false;
>
> spin_lock(&direct_window_list_lock);
> /* check if we already created a window and dupe that config if so */
> list_for_each_entry(window, &direct_window_list, list) {
> if (window->device == pdn) {
> direct64 = window->prop;
> - dma_addr = be64_to_cpu(direct64->dma_base);
> + *dma_addr = be64_to_cpu(direct64->dma_base);
> + found = true;
> break;
> }
> }
> spin_unlock(&direct_window_list_lock);
>
> - return dma_addr;
> + return found;
> }
>
> static struct direct_window *ddw_list_add(struct device_node *pdn,
> @@ -1131,15 +1132,15 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
> * pdn: the parent pe node with the ibm,dma_window property
> * Future: also check if we can remap the base window for our base page size
> *
> - * returns the dma offset for use by the direct mapped DMA code.
> + * returns true if can map all pages (direct mapping), false otherwise..
> */
> -static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> +static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> {
> int len, ret;
> struct ddw_query_response query;
> struct ddw_create_response create;
> int page_shift;
> - u64 dma_addr, max_addr;
> + u64 max_addr;
> struct device_node *dn;
> u32 ddw_avail[DDW_APPLICABLE_SIZE];
> struct direct_window *window;
> @@ -1150,8 +1151,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>
> mutex_lock(&direct_window_init_mutex);
>
> - dma_addr = find_existing_ddw(pdn);
> - if (dma_addr != 0)
> + if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset))
> goto out_unlock;
>
> /*
> @@ -1292,7 +1292,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> goto out_free_window;
> }
>
> - dma_addr = be64_to_cpu(ddwprop->dma_base);
> + dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);


Do not you need the same chunk in the find_existing_ddw() case above as
well? Thanks,


> goto out_unlock;
>
> out_free_window:
> @@ -1309,6 +1309,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> kfree(win64->name);
> kfree(win64->value);
> kfree(win64);
> + win64 = NULL;
>
> out_failed:
> if (default_win_removed)
> @@ -1322,7 +1323,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>
> out_unlock:
> mutex_unlock(&direct_window_init_mutex);
> - return dma_addr;
> + return win64;
> }
>
> static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
> @@ -1401,11 +1402,8 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
> break;
> }
>
> - if (pdn && PCI_DN(pdn)) {
> - pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn);
> - if (pdev->dev.archdata.dma_offset)
> - return true;
> - }
> + if (pdn && PCI_DN(pdn))
> + return enable_ddw(pdev, pdn);
>
> return false;
> }
>

--
Alexey

2020-08-24 05:43:56

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 08/10] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()



On 18/08/2020 09:40, Leonardo Bras wrote:
> Code used to create a ddw property that was previously scattered in
> enable_ddw() is now gathered in ddw_property_create(), which deals with
> allocation and filling the property, letting it ready for
> of_property_add(), which now occurs in sequence.
>
> This created an opportunity to reorganize the second part of enable_ddw():
>
> Without this patch enable_ddw() does, in order:
> kzalloc() property & members, create_ddw(), fill ddwprop inside property,
> ddw_list_add(), do tce_setrange_multi_pSeriesLP_walk in all memory,
> of_add_property().
>
> With this patch enable_ddw() does, in order:
> create_ddw(), ddw_property_create(), of_add_property(), ddw_list_add(),
> do tce_setrange_multi_pSeriesLP_walk in all memory.
>
> This change requires of_remove_property() in case anything fails after
> of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
> in all memory, which looks the most expensive operation, only if
> everything else succeeds.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> arch/powerpc/platforms/pseries/iommu.c | 97 +++++++++++++++-----------
> 1 file changed, 57 insertions(+), 40 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 4031127c9537..3a1ef02ad9d5 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -1123,6 +1123,31 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
> ret);
> }
>
> +static int ddw_property_create(struct property **ddw_win, const char *propname,

@propname is always the same, do you really want to pass it every time?

> + u32 liobn, u64 dma_addr, u32 page_shift, u32 window_shift)
> +{
> + struct dynamic_dma_window_prop *ddwprop;
> + struct property *win64;
> +
> + *ddw_win = win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
> + if (!win64)
> + return -ENOMEM;
> +
> + win64->name = kstrdup(propname, GFP_KERNEL);

Not clear why "win64->name = DIRECT64_PROPNAME" would not work here, the
generic OF code does not try kfree() it but it is probably out of scope
here.


> + ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
> + win64->value = ddwprop;
> + win64->length = sizeof(*ddwprop);
> + if (!win64->name || !win64->value)
> + return -ENOMEM;


Up to 2 memory leaks here. I see the cleanup at "out_free_prop:" but
still looks fragile. Instead you could simply return win64 as the only
error possible here is -ENOMEM and returning NULL is equally good.


> +
> + ddwprop->liobn = cpu_to_be32(liobn);
> + ddwprop->dma_base = cpu_to_be64(dma_addr);
> + ddwprop->tce_shift = cpu_to_be32(page_shift);
> + ddwprop->window_shift = cpu_to_be32(window_shift);
> +
> + return 0;
> +}
> +
> /*
> * If the PE supports dynamic dma windows, and there is space for a table
> * that can map all pages in a linear offset, then setup such a table,
> @@ -1140,12 +1165,11 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> struct ddw_query_response query;
> struct ddw_create_response create;
> int page_shift;
> - u64 max_addr;
> + u64 max_addr, win_addr;
> struct device_node *dn;
> u32 ddw_avail[DDW_APPLICABLE_SIZE];
> struct direct_window *window;
> - struct property *win64;
> - struct dynamic_dma_window_prop *ddwprop;
> + struct property *win64 = NULL;
> struct failed_ddw_pdn *fpdn;
> bool default_win_removed = false;
>
> @@ -1244,38 +1268,34 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> goto out_failed;
> }
> len = order_base_2(max_addr);
> - win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
> - if (!win64) {
> - dev_info(&dev->dev,
> - "couldn't allocate property for 64bit dma window\n");
> +
> + ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
> + if (ret != 0)

It is usually just "if (ret)"


> goto out_failed;
> - }
> - win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
> - win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
> - win64->length = sizeof(*ddwprop);
> - if (!win64->name || !win64->value) {
> +
> + dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
> + create.liobn, dn);
> +
> + win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
> + ret = ddw_property_create(&win64, DIRECT64_PROPNAME, create.liobn, win_addr,
> + page_shift, len);
> + if (ret) {
> dev_info(&dev->dev,
> - "couldn't allocate property name and value\n");
> + "couldn't allocate property, property name, or value\n");
> goto out_free_prop;
> }
>
> - ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
> - if (ret != 0)
> + ret = of_add_property(pdn, win64);
> + if (ret) {
> + dev_err(&dev->dev, "unable to add dma window property for %pOF: %d",
> + pdn, ret);
> goto out_free_prop;
> -
> - ddwprop->liobn = cpu_to_be32(create.liobn);
> - ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) |
> - create.addr_lo);
> - ddwprop->tce_shift = cpu_to_be32(page_shift);
> - ddwprop->window_shift = cpu_to_be32(len);
> -
> - dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
> - create.liobn, dn);
> + }
>
> /* Add new window to existing DDW list */
> - window = ddw_list_add(pdn, ddwprop);
> + window = ddw_list_add(pdn, win64->value);
> if (!window)
> - goto out_clear_window;
> + goto out_prop_del;
>
> ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> win64->value, tce_setrange_multi_pSeriesLP_walk);
> @@ -1285,14 +1305,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> goto out_free_window;
> }
>
> - ret = of_add_property(pdn, win64);
> - if (ret) {
> - dev_err(&dev->dev, "unable to add dma window property for %pOF: %d",
> - pdn, ret);
> - goto out_free_window;
> - }
> -
> - dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
> + dev->dev.archdata.dma_offset = win_addr;
> goto out_unlock;
>
> out_free_window:
> @@ -1302,14 +1315,18 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>
> kfree(window);
>
> -out_clear_window:
> - remove_ddw(pdn, true);
> +out_prop_del:
> + of_remove_property(pdn, win64);
>
> out_free_prop:
> - kfree(win64->name);
> - kfree(win64->value);
> - kfree(win64);
> - win64 = NULL;
> + if (win64) {
> + kfree(win64->name);
> + kfree(win64->value);
> + kfree(win64);
> + win64 = NULL;
> + }
> +
> + remove_ddw(pdn, true);
>
> out_failed:
> if (default_win_removed)
>

--
Alexey

2020-08-24 05:47:04

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 09/10] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition



On 18/08/2020 09:40, Leonardo Bras wrote:
> As of today, if the biggest DDW that can be created can't map the whole
> partition, it's creation is skipped and the default DMA window
> "ibm,dma-window" is used instead.
>
> DDW is 16x bigger than the default DMA window,

16x only under very specific circumstances which are
1. phyp
2. sriov
3. device class in hmc (or what that priority number is in the lpar config).

> having the same amount of
> pages, but increasing the page size to 64k.
> Besides larger DMA window,

"Besides being larger"?

> it performs better for allocations over 4k,

Better how?

> so it would be nice to use it instead.


I'd rather say something like:
===
So far we assumed we can map the guest RAM 1:1 to the bus which worked
with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.
===


>
> The DDW created will be used for direct mapping by default.
> If it's not available, indirect mapping will be used instead.
>
> For indirect mapping, it's necessary to update the iommu_table so
> iommu_alloc() can use the DDW created. For this,
> iommu_table_update_window() is called when everything else succeeds
> at enable_ddw().
>
> Removing the default DMA window for using DDW with indirect mapping
> is only allowed if there is no current IOMMU memory allocated in
> the iommu_table. enable_ddw() is aborted otherwise.
>
> As there will never have both direct and indirect mappings at the same
> time, the same property name can be used for the created DDW.
>
> So renaming
> define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
> to
> define DMA64_PROPNAME "linux,dma64-ddr-window-info"
> looks the right thing to do.

I know I suggested this but this does not look so good anymore as I
suspect it breaks kexec (from older kernel to this one) so you either
need to check for both DT names or just keep the old one. Changing the
macro name is fine.


>
> To make sure the property differentiates both cases, a new u32 for flags
> was added at the end of the property, where BIT(0) set means direct
> mapping.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
> arch/powerpc/platforms/pseries/iommu.c | 108 +++++++++++++++++++------
> 1 file changed, 84 insertions(+), 24 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 3a1ef02ad9d5..9544e3c91ced 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -350,8 +350,11 @@ struct dynamic_dma_window_prop {
> __be64 dma_base; /* address hi,lo */
> __be32 tce_shift; /* ilog2(tce_page_size) */
> __be32 window_shift; /* ilog2(tce_window_size) */
> + __be32 flags; /* DDW properties, see bellow */
> };
>
> +#define DDW_FLAGS_DIRECT 0x01

This is set if ((1<<window_shift) >= ddw_memory_hotplug_max()), you
could simply check window_shift and drop the flags.


> +
> struct direct_window {
> struct device_node *device;
> const struct dynamic_dma_window_prop *prop;
> @@ -377,7 +380,7 @@ static LIST_HEAD(direct_window_list);
> static DEFINE_SPINLOCK(direct_window_list_lock);
> /* protects initializing window twice for same device */
> static DEFINE_MUTEX(direct_window_init_mutex);
> -#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
> +#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
>
> static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
> unsigned long num_pfn, const void *arg)
> @@ -836,7 +839,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
> if (ret)
> return;
>
> - win = of_find_property(np, DIRECT64_PROPNAME, NULL);
> + win = of_find_property(np, DMA64_PROPNAME, NULL);
> if (!win)
> return;
>
> @@ -852,7 +855,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
> np, ret);
> }
>
> -static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, bool *direct_mapping)
> {
> struct direct_window *window;
> const struct dynamic_dma_window_prop *direct64;
> @@ -864,6 +867,7 @@ static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> if (window->device == pdn) {
> direct64 = window->prop;
> *dma_addr = be64_to_cpu(direct64->dma_base);
> + *direct_mapping = be32_to_cpu(direct64->flags) & DDW_FLAGS_DIRECT;
> found = true;
> break;
> }
> @@ -901,8 +905,8 @@ static int find_existing_ddw_windows(void)
> if (!firmware_has_feature(FW_FEATURE_LPAR))
> return 0;
>
> - for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
> - direct64 = of_get_property(pdn, DIRECT64_PROPNAME, &len);
> + for_each_node_with_property(pdn, DMA64_PROPNAME) {
> + direct64 = of_get_property(pdn, DMA64_PROPNAME, &len);
> if (!direct64)
> continue;
>
> @@ -1124,7 +1128,8 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
> }
>
> static int ddw_property_create(struct property **ddw_win, const char *propname,
> - u32 liobn, u64 dma_addr, u32 page_shift, u32 window_shift)
> + u32 liobn, u64 dma_addr, u32 page_shift,
> + u32 window_shift, bool direct_mapping)
> {
> struct dynamic_dma_window_prop *ddwprop;
> struct property *win64;
> @@ -1144,6 +1149,36 @@ static int ddw_property_create(struct property **ddw_win, const char *propname,
> ddwprop->dma_base = cpu_to_be64(dma_addr);
> ddwprop->tce_shift = cpu_to_be32(page_shift);
> ddwprop->window_shift = cpu_to_be32(window_shift);
> + if (direct_mapping)
> + ddwprop->flags = cpu_to_be32(DDW_FLAGS_DIRECT);
> +
> + return 0;
> +}
> +
> +static int iommu_table_update_window(struct iommu_table **tbl, int nid, unsigned long liobn,
> + unsigned long win_addr, unsigned long page_shift,
> + unsigned long window_size)

Rather strange helper imho. I'd extract the most of
iommu_table_setparms_lpar() into iommu_table_setparms() (except
of_parse_dma_window) and call new helper from where you call
iommu_table_update_window; and do
iommu_pseries_alloc_table/iommu_tce_table_put there.


> +{
> + struct iommu_table *new_tbl, *old_tbl;
> +
> + new_tbl = iommu_pseries_alloc_table(nid);
> + if (!new_tbl)
> + return -ENOMEM;
> +
> + old_tbl = *tbl;
> + new_tbl->it_index = liobn;
> + new_tbl->it_offset = win_addr >> page_shift;
> + new_tbl->it_page_shift = page_shift;
> + new_tbl->it_size = window_size >> page_shift;
> + new_tbl->it_base = old_tbl->it_base;

Should not be used in pseries.


> + new_tbl->it_busno = old_tbl->it_busno;
> + new_tbl->it_blocksize = old_tbl->it_blocksize;

16 for pseries and does not change (may be even make it a macro).

> + new_tbl->it_type = old_tbl->it_type;

TCE_PCI.


> + new_tbl->it_ops = old_tbl->it_ops;
> +
> + iommu_init_table(new_tbl, nid, old_tbl->it_reserved_start, old_tbl->it_reserved_end);
> + iommu_tce_table_put(old_tbl);
> + *tbl = new_tbl;
>
> return 0;
> }
> @@ -1171,12 +1206,16 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> struct direct_window *window;
> struct property *win64 = NULL;
> struct failed_ddw_pdn *fpdn;
> - bool default_win_removed = false;
> + bool default_win_removed = false, maps_whole_partition = false;


s/maps_whole_partition/direct_mapping/


> + struct pci_dn *pci = PCI_DN(pdn);
> + struct iommu_table *tbl = pci->table_group->tables[0];
>
> mutex_lock(&direct_window_init_mutex);
>
> - if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset))
> - goto out_unlock;
> + if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset, &maps_whole_partition)) {
> + mutex_unlock(&direct_window_init_mutex);
> + return maps_whole_partition;
> + }
>
> /*
> * If we already went through this for a previous function of
> @@ -1258,16 +1297,24 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> query.page_size);
> goto out_failed;
> }
> +
> /* verify the window * number of ptes will map the partition */
> - /* check largest block * page size > max memory hotplug addr */
> max_addr = ddw_memory_hotplug_max();
> if (query.largest_available_block < (max_addr >> page_shift)) {
> - dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
> - "%llu-sized pages\n", max_addr, query.largest_available_block,
> - 1ULL << page_shift);
> - goto out_failed;
> + dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu %llu-sized pages\n",
> + max_addr, query.largest_available_block,
> + 1ULL << page_shift);
> +
> + len = order_base_2(query.largest_available_block << page_shift);
> + } else {
> + maps_whole_partition = true;
> + len = order_base_2(max_addr);
> }
> - len = order_base_2(max_addr);
> +
> + /* DDW + IOMMU on single window may fail if there is any allocation */
> + if (default_win_removed && !maps_whole_partition &&
> + iommu_table_in_use(tbl))
> + goto out_failed;
>
> ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
> if (ret != 0)
> @@ -1277,8 +1324,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> create.liobn, dn);
>
> win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
> - ret = ddw_property_create(&win64, DIRECT64_PROPNAME, create.liobn, win_addr,
> - page_shift, len);
> + ret = ddw_property_create(&win64, DMA64_PROPNAME, create.liobn, win_addr,
> + page_shift, len, maps_whole_partition);
> if (ret) {
> dev_info(&dev->dev,
> "couldn't allocate property, property name, or value\n");
> @@ -1297,12 +1344,25 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> if (!window)
> goto out_prop_del;
>
> - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> - win64->value, tce_setrange_multi_pSeriesLP_walk);
> - if (ret) {
> - dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
> - dn, ret);
> - goto out_free_window;
> + if (maps_whole_partition) {
> + /* DDW maps the whole partition, so enable direct DMA mapping */
> + ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> + win64->value, tce_setrange_multi_pSeriesLP_walk);
> + if (ret) {
> + dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
> + dn, ret);
> + goto out_free_window;
> + }
> + } else {
> + /* New table for using DDW instead of the default DMA window */
> + if (iommu_table_update_window(&tbl, pci->phb->node, create.liobn,
> + win_addr, page_shift, 1UL << len))
> + goto out_free_window;
> +
> + set_iommu_table_base(&dev->dev, tbl);
> + WARN_ON(dev->dev.archdata.dma_offset >= SZ_4G);

What is this check for exactly? Why 4G, not >= 0, for example?

> + goto out_unlock;
> +
> }
>
> dev->dev.archdata.dma_offset = win_addr;
> @@ -1340,7 +1400,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>
> out_unlock:
> mutex_unlock(&direct_window_init_mutex);
> - return win64;
> + return win64 && maps_whole_partition;
> }
>
> static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
>

--
Alexey

2020-08-27 15:34:20

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

Hello Alexey, thank you for this feedback!

On Sat, 2020-08-22 at 19:33 +1000, Alexey Kardashevskiy wrote:
> > +#define TCE_RPN_BITS 52 /* Bits 0-51 represent RPN on TCE */
>
> Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
> is the actual limit.

I understand this MAX_PHYSMEM_BITS(51) comes from the maximum physical memory addressable in the machine. IIUC, it means we can access physical address up to (1ul << MAX_PHYSMEM_BITS).

This 52 comes from PAPR "Table 9. TCE Definition" which defines bits
0-51 as the RPN. By looking at code, I understand that it means we may input any address < (1ul << 52) to TCE.

In practice, MAX_PHYSMEM_BITS should be enough as of today, because I suppose we can't ever pass a physical page address over
(1ul << 51), and TCE accepts up to (1ul << 52).
But if we ever increase MAX_PHYSMEM_BITS, it doesn't necessarily means that TCE_RPN_BITS will also be increased, so I think they are independent values.

Does it make sense? Please let me know if I am missing something.

>
>
> > +#define TCE_RPN_MASK(ps) ((1ul << (TCE_RPN_BITS - (ps))) - 1)
> > #define TCE_VALID 0x800 /* TCE valid */
> > #define TCE_ALLIO 0x400 /* TCE valid for all lpars */
> > #define TCE_PCI_WRITE 0x2 /* write from PCI allowed */
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > index e4198700ed1a..8fe23b7dff3a 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -107,6 +107,9 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index,
> > u64 proto_tce;
> > __be64 *tcep;
> > u64 rpn;
> > + const unsigned long tceshift = tbl->it_page_shift;
> > + const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
> > + const u64 rpn_mask = TCE_RPN_MASK(tceshift);
>
> Using IOMMU_PAGE_SIZE macro for the page size and not using
> IOMMU_PAGE_MASK for the mask - this incosistency makes my small brain
> explode :) I understand the history but maaaaan... Oh well, ok.
>

Yeah, it feels kind of weird after two IOMMU related consts. :)
But sure IOMMU_PAGE_MASK() would not be useful here :)

And this kind of let me thinking:
> > + rpn = __pa(uaddr) >> tceshift;
> > + *tcep = cpu_to_be64(proto_tce | (rpn & rpn_mask) << tceshift);
Why not:
rpn_mask = TCE_RPN_MASK(tceshift) << tceshift;

rpn = __pa(uaddr) & rpn_mask;
*tcep = cpu_to_be64(proto_tce | rpn)

I am usually afraid of changing stuff like this, but I think it's safe.

> Good, otherwise. Thanks,

Thank you for reviewing!



2020-08-27 16:53:34

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()

On Sat, 2020-08-22 at 20:07 +1000, Alexey Kardashevskiy wrote:
>
> On 18/08/2020 09:40, Leonardo Bras wrote:
> > Both iommu_alloc_coherent() and iommu_free_coherent() assume that once
> > size is aligned to PAGE_SIZE it will be aligned to IOMMU_PAGE_SIZE.
>
> The only case when it is not aligned is when IOMMU_PAGE_SIZE > PAGE_SIZE
> which is unlikely but not impossible, we could configure the kernel for
> 4K system pages and 64K IOMMU pages I suppose. Do we really want to do
> this here, or simply put WARN_ON(tbl->it_page_shift > PAGE_SHIFT)?

I think it would be better to keep the code as much generic as possible
regarding page sizes.

> Because if we want the former (==support), then we'll have to align the
> size up to the bigger page size when allocating/zeroing system pages,
> etc.

This part I don't understand. Why do we need to align everything to the
bigger pagesize?

I mean, is not that enough that the range [ret, ret + size[ is both
allocated by mm and mapped on a iommu range?

Suppose a iommu_alloc_coherent() of 16kB on PAGESIZE = 4k and
IOMMU_PAGE_SIZE() == 64k.
Why 4 * cpu_pages mapped by a 64k IOMMU page is not enough?
All the space the user asked for is allocated and mapped for DMA.


> Bigger pages are not the case here as I understand it.

I did not get this part, what do you mean?

> > Update those functions to guarantee alignment with requested size
> > using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().
> >
> > Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
> > with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.
> >
> > Signed-off-by: Leonardo Bras <[email protected]>
> > ---
> > arch/powerpc/kernel/iommu.c | 17 +++++++++--------
> > 1 file changed, 9 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> > index 9704f3f76e63..d7086087830f 100644
> > --- a/arch/powerpc/kernel/iommu.c
> > +++ b/arch/powerpc/kernel/iommu.c
> > @@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device *dev,
> > }
> >
> > if (dev)
> > - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
> > - 1 << tbl->it_page_shift);
> > + boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, tbl);
>
> Run checkpatch.pl, should complain about a long line.

It's 86 columns long, which is less than the new limit of 100 columns
Linus announced a few weeks ago. checkpatch.pl was updated too:
https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kernel-Deprecates-80-Col


>
>
> > else
> > - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
> > + boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
> > /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
> >
> > n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
> > @@ -858,6 +857,7 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
> > unsigned int order;
> > unsigned int nio_pages, io_order;
> > struct page *page;
> > + size_t size_io = size;
> >
> > size = PAGE_ALIGN(size);
> > order = get_order(size);
> > @@ -884,8 +884,9 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
> > memset(ret, 0, size);
> >
> > /* Set up tces to cover the allocated range */
> > - nio_pages = size >> tbl->it_page_shift;
> > - io_order = get_iommu_order(size, tbl);
> > + size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
> > + nio_pages = size_io >> tbl->it_page_shift;
> > + io_order = get_iommu_order(size_io, tbl);
> > mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
> > mask >> tbl->it_page_shift, io_order, 0);
> > if (mapping == DMA_MAPPING_ERROR) {
> > @@ -900,11 +901,11 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
> > void *vaddr, dma_addr_t dma_handle)
> > {
> > if (tbl) {
> > - unsigned int nio_pages;
> > + size_t size_io = IOMMU_PAGE_ALIGN(size, tbl);
> > + unsigned int nio_pages = size_io >> tbl->it_page_shift;
> >
> > - size = PAGE_ALIGN(size);
> > - nio_pages = size >> tbl->it_page_shift;
> > iommu_free(tbl, dma_handle, nio_pages);
> > +
>
> Unrelated new line.

Will be removed. Thanks!

>
>
> > size = PAGE_ALIGN(size);
> > free_pages((unsigned long)vaddr, get_order(size));
> > }
> >

2020-08-27 17:03:11

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 03/10] powerpc/kernel/iommu: Use largepool as a last resort when !largealloc

On Sat, 2020-08-22 at 20:09 +1000, Alexey Kardashevskiy wrote:
> > + goto again;
> > +
>
> A nit: unnecessary new line.

I was following the pattern used above. There is a newline after every
"goto again" in this 'if'.

> Reviewed-by: Alexey Kardashevskiy <[email protected]>

Thank you!


2020-08-27 18:36:05

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 04/10] powerpc/kernel/iommu: Add new iommu_table_in_use() helper

On Sat, 2020-08-22 at 20:34 +1000, Alexey Kardashevskiy wrote:
> > +
> > + /*ignore reserved bit0*/
>
> s/ignore reserved bit0/ ignore reserved bit0 / (add spaces)

Fixed

> > + if (tbl->it_offset == 0)
> > + p1_start = 1;
> > +
> > + /* Check if reserved memory is valid*/
>
> A missing space here.

Fixed

>
> > + if (tbl->it_reserved_start >= tbl->it_offset &&
> > + tbl->it_reserved_start <= (tbl->it_offset + tbl->it_size) &&
> > + tbl->it_reserved_end >= tbl->it_offset &&
> > + tbl->it_reserved_end <= (tbl->it_offset + tbl->it_size)) {
>
> Uff. What if tbl->it_reserved_end is bigger than tbl->it_offset +
> tbl->it_size?
>
> The reserved area is to preserve MMIO32 so it is for it_offset==0 only
> and the boundaries are checked in the only callsite, and it is unlikely
> to change soon or ever.
>
> Rather that bothering with fixing that, may be just add (did not test):
>
> if (WARN_ON((
> (tbl->it_reserved_start || tbl->it_reserved_end) && (it_offset != 0))
> (tbl->it_reserved_start > it_offset && tbl->it_reserved_end < it_offset
> + it_size) && (it_offset == 0)) )
> return true;
>
> Or simply always look for it_offset..it_reserved_start and
> it_reserved_end..it_offset+it_size and if there is no reserved area,
> initialize it_reserved_start=it_reserved_end=it_offset so the first
> it_offset..it_reserved_start becomes a no-op.

The problem here is that the values of it_reserved_{start,end} are not
necessarily valid. I mean, on iommu_table_reserve_pages() the values
are stored however they are given (bit reserving is done only if they
are valid).

Having a it_reserved_{start,end} value outside the valid ranges would
cause find_next_bit() to run over memory outside the bitmap.
Even if the those values are < tbl->it_offset, the resulting
subtraction on unsigned would cause it to become a big value and run
over memory outside the bitmap.

But I think you are right. That is not the place to check if the
reserved values are valid. It should just trust them here.
I intent to change iommu_table_reserve_pages() to only store the
parameters in it_reserved_{start,end} if they are in the range, and or
it_offset in both of them if they are not.

What do you think?

Thanks for the feedback!
Leonardo Bras



2020-08-27 21:26:07

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 05/10] powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper

On Mon, 2020-08-24 at 10:38 +1000, Alexey Kardashevskiy wrote:
>
> On 18/08/2020 09:40, Leonardo Bras wrote:
> > Creates a helper to allow allocating a new iommu_table without the need
> > to reallocate the iommu_group.
> >
> > This will be helpful for replacing the iommu_table for the new DMA window,
> > after we remove the old one with iommu_tce_table_put().
> >
> > Signed-off-by: Leonardo Bras <[email protected]>
> > ---
> > arch/powerpc/platforms/pseries/iommu.c | 25 ++++++++++++++-----------
> > 1 file changed, 14 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > index 8fe23b7dff3a..39617ce0ec83 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -53,28 +53,31 @@ enum {
> > DDW_EXT_QUERY_OUT_SIZE = 2
> > };
> >
> > -static struct iommu_table_group *iommu_pseries_alloc_group(int node)
> > +static struct iommu_table *iommu_pseries_alloc_table(int node)
> > {
> > - struct iommu_table_group *table_group;
> > struct iommu_table *tbl;
> >
> > - table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
> > - node);
> > - if (!table_group)
> > - return NULL;
> > -
> > tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
> > if (!tbl)
> > - goto free_group;
> > + return NULL;
> >
> > INIT_LIST_HEAD_RCU(&tbl->it_group_list);
> > kref_init(&tbl->it_kref);
> > + return tbl;
> > +}
> >
> > - table_group->tables[0] = tbl;
> > +static struct iommu_table_group *iommu_pseries_alloc_group(int node)
> > +{
> > + struct iommu_table_group *table_group;
> > +
> > + table_group = kzalloc_node(sizeof(*table_group), GFP_KERNEL, node);
>
> I'd prefer you did not make unrelated changes (sizeof(struct
> iommu_table_group) -> sizeof(*table_group)) so the diff stays shorter
> and easier to follow. You changed sizeof(struct iommu_table_group) but
> not sizeof(struct iommu_table) and this confused me enough to spend more
> time than this straight forward change deserves.

Sorry, I will keep this in mind for future patches.
Thank you for the tip!

>
> Not important in this case though so
>
> Reviewed-by: Alexey Kardashevskiy <[email protected]>

Thank you!


2020-08-27 22:12:54

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 06/10] powerpc/pseries/iommu: Add ddw_list_add() helper

On Mon, 2020-08-24 at 13:46 +1000, Alexey Kardashevskiy wrote:
> > static int find_existing_ddw_windows(void)
> > {
> > int len;
> > @@ -887,18 +905,11 @@ static int find_existing_ddw_windows(void)
> > if (!direct64)
> > continue;
> >
> > - window = kzalloc(sizeof(*window), GFP_KERNEL);
> > - if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
> > + window = ddw_list_add(pdn, direct64);
> > + if (!window || len < sizeof(*direct64)) {
>
> Since you are touching this code, it looks like the "len <
> sizeof(*direct64)" part should go above to "if (!direct64)".

Sure, makes sense.
It will be fixed for v2.

>
>
>
> > kfree(window);
> > remove_ddw(pdn, true);
> > - continue;
> > }
> > -
> > - window->device = pdn;
> > - window->prop = direct64;
> > - spin_lock(&direct_window_list_lock);
> > - list_add(&window->list, &direct_window_list);
> > - spin_unlock(&direct_window_list_lock);
> > }
> >
> > return 0;
> > @@ -1261,7 +1272,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
> > create.liobn, dn);
> >
> > - window = kzalloc(sizeof(*window), GFP_KERNEL);
> > + /* Add new window to existing DDW list */
>
> The comment seems to duplicate what the ddw_list_add name already suggests.

Ok, I will remove it then.

> > + window = ddw_list_add(pdn, ddwprop);
> > if (!window)
> > goto out_clear_window;
> >
> > @@ -1280,16 +1292,14 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > goto out_free_window;
> > }
> >
> > - window->device = pdn;
> > - window->prop = ddwprop;
> > - spin_lock(&direct_window_list_lock);
> > - list_add(&window->list, &direct_window_list);
> > - spin_unlock(&direct_window_list_lock);
>
> I'd leave these 3 lines here and in find_existing_ddw_windows() (which
> would make ddw_list_add -> ddw_prop_alloc). In general you want to have
> less stuff to do on the failure path. kmalloc may fail and needs kfree
> but you can safely delay list_add (which cannot fail) and avoid having
> the lock help twice in the same function (one of them is hidden inside
> ddw_list_add).
> Not sure if this change is really needed after all. Thanks,

I understand this leads to better performance in case anything fails.
Also, I think list_add happening in the end is less error-prone (in
case the list is checked between list_add and a fail).

But what if we put it at the end?
What is the chance of a kzalloc of 4 pointers (struct direct_window)
failing after walk_system_ram_range?

Is it not worthy doing that for making enable_ddw() easier to
understand?

Best regards,
Leonardo

2020-08-28 01:42:37

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()



On 28/08/2020 02:51, Leonardo Bras wrote:
> On Sat, 2020-08-22 at 20:07 +1000, Alexey Kardashevskiy wrote:
>>
>> On 18/08/2020 09:40, Leonardo Bras wrote:
>>> Both iommu_alloc_coherent() and iommu_free_coherent() assume that once
>>> size is aligned to PAGE_SIZE it will be aligned to IOMMU_PAGE_SIZE.
>>
>> The only case when it is not aligned is when IOMMU_PAGE_SIZE > PAGE_SIZE
>> which is unlikely but not impossible, we could configure the kernel for
>> 4K system pages and 64K IOMMU pages I suppose. Do we really want to do
>> this here, or simply put WARN_ON(tbl->it_page_shift > PAGE_SHIFT)?
>
> I think it would be better to keep the code as much generic as possible
> regarding page sizes.

Then you need to test it. Does 4K guest even boot (it should but I would
not bet much on it)?

>
>> Because if we want the former (==support), then we'll have to align the
>> size up to the bigger page size when allocating/zeroing system pages,
>> etc.
>
> This part I don't understand. Why do we need to align everything to the
> bigger pagesize?
>
> I mean, is not that enough that the range [ret, ret + size[ is both
> allocated by mm and mapped on a iommu range?
>
> Suppose a iommu_alloc_coherent() of 16kB on PAGESIZE = 4k and
> IOMMU_PAGE_SIZE() == 64k.
> Why 4 * cpu_pages mapped by a 64k IOMMU page is not enough?
> All the space the user asked for is allocated and mapped for DMA.


The user asked to map 16K, the rest - 48K - is used for something else
(may be even mapped to another device) but you are making all 64K
accessible by the device which only should be able to access 16K.

In practice, if this happens, H_PUT_TCE will simply fail.


>
>> Bigger pages are not the case here as I understand it.
>
> I did not get this part, what do you mean?


Possible IOMMU page sizes are 4K, 64K, 2M, 16M, 256M, 1GB, and the
supported set of sizes is different for P8/P9 and type of IO (PHB,
NVLink/CAPI).


>
>>> Update those functions to guarantee alignment with requested size
>>> using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().
>>>
>>> Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
>>> with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.
>>>
>>> Signed-off-by: Leonardo Bras <[email protected]>
>>> ---
>>> arch/powerpc/kernel/iommu.c | 17 +++++++++--------
>>> 1 file changed, 9 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>>> index 9704f3f76e63..d7086087830f 100644
>>> --- a/arch/powerpc/kernel/iommu.c
>>> +++ b/arch/powerpc/kernel/iommu.c
>>> @@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device *dev,
>>> }
>>>
>>> if (dev)
>>> - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
>>> - 1 << tbl->it_page_shift);
>>> + boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, tbl);
>>
>> Run checkpatch.pl, should complain about a long line.
>
> It's 86 columns long, which is less than the new limit of 100 columns
> Linus announced a few weeks ago. checkpatch.pl was updated too:
> https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kernel-Deprecates-80-Col

Yay finally :) Thanks,


>
>>
>>
>>> else
>>> - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
>>> + boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
>>> /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
>>>
>>> n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
>>> @@ -858,6 +857,7 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
>>> unsigned int order;
>>> unsigned int nio_pages, io_order;
>>> struct page *page;
>>> + size_t size_io = size;
>>>
>>> size = PAGE_ALIGN(size);
>>> order = get_order(size);
>>> @@ -884,8 +884,9 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
>>> memset(ret, 0, size);
>>>
>>> /* Set up tces to cover the allocated range */
>>> - nio_pages = size >> tbl->it_page_shift;
>>> - io_order = get_iommu_order(size, tbl);
>>> + size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
>>> + nio_pages = size_io >> tbl->it_page_shift;
>>> + io_order = get_iommu_order(size_io, tbl);
>>> mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
>>> mask >> tbl->it_page_shift, io_order, 0);
>>> if (mapping == DMA_MAPPING_ERROR) {
>>> @@ -900,11 +901,11 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
>>> void *vaddr, dma_addr_t dma_handle)
>>> {
>>> if (tbl) {
>>> - unsigned int nio_pages;
>>> + size_t size_io = IOMMU_PAGE_ALIGN(size, tbl);
>>> + unsigned int nio_pages = size_io >> tbl->it_page_shift;
>>>
>>> - size = PAGE_ALIGN(size);
>>> - nio_pages = size >> tbl->it_page_shift;
>>> iommu_free(tbl, dma_handle, nio_pages);
>>> +
>>
>> Unrelated new line.
>
> Will be removed. Thanks!
>
>>
>>
>>> size = PAGE_ALIGN(size);
>>> free_pages((unsigned long)vaddr, get_order(size));
>>> }
>>>
>

--
Alexey

2020-08-28 01:52:33

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 04/10] powerpc/kernel/iommu: Add new iommu_table_in_use() helper



On 28/08/2020 04:34, Leonardo Bras wrote:
> On Sat, 2020-08-22 at 20:34 +1000, Alexey Kardashevskiy wrote:
>>> +
>>> + /*ignore reserved bit0*/
>>
>> s/ignore reserved bit0/ ignore reserved bit0 / (add spaces)
>
> Fixed
>
>>> + if (tbl->it_offset == 0)
>>> + p1_start = 1;
>>> +
>>> + /* Check if reserved memory is valid*/
>>
>> A missing space here.
>
> Fixed
>
>>
>>> + if (tbl->it_reserved_start >= tbl->it_offset &&
>>> + tbl->it_reserved_start <= (tbl->it_offset + tbl->it_size) &&
>>> + tbl->it_reserved_end >= tbl->it_offset &&
>>> + tbl->it_reserved_end <= (tbl->it_offset + tbl->it_size)) {
>>
>> Uff. What if tbl->it_reserved_end is bigger than tbl->it_offset +
>> tbl->it_size?
>>
>> The reserved area is to preserve MMIO32 so it is for it_offset==0 only
>> and the boundaries are checked in the only callsite, and it is unlikely
>> to change soon or ever.
>>
>> Rather that bothering with fixing that, may be just add (did not test):
>>
>> if (WARN_ON((
>> (tbl->it_reserved_start || tbl->it_reserved_end) && (it_offset != 0))
>> (tbl->it_reserved_start > it_offset && tbl->it_reserved_end < it_offset
>> + it_size) && (it_offset == 0)) )
>> return true;
>>
>> Or simply always look for it_offset..it_reserved_start and
>> it_reserved_end..it_offset+it_size and if there is no reserved area,
>> initialize it_reserved_start=it_reserved_end=it_offset so the first
>> it_offset..it_reserved_start becomes a no-op.
>
> The problem here is that the values of it_reserved_{start,end} are not
> necessarily valid. I mean, on iommu_table_reserve_pages() the values
> are stored however they are given (bit reserving is done only if they
> are valid).
>
> Having a it_reserved_{start,end} value outside the valid ranges would
> cause find_next_bit() to run over memory outside the bitmap.
> Even if the those values are < tbl->it_offset, the resulting
> subtraction on unsigned would cause it to become a big value and run
> over memory outside the bitmap.
>
> But I think you are right. That is not the place to check if the
> reserved values are valid. It should just trust them here.
> I intent to change iommu_table_reserve_pages() to only store the
> parameters in it_reserved_{start,end} if they are in the range, and or
> it_offset in both of them if they are not.
>
> What do you think?

This should work, yes.


>
> Thanks for the feedback!
> Leonardo Bras
>
>
>

--
Alexey

2020-08-28 02:02:19

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 06/10] powerpc/pseries/iommu: Add ddw_list_add() helper



On 28/08/2020 08:11, Leonardo Bras wrote:
> On Mon, 2020-08-24 at 13:46 +1000, Alexey Kardashevskiy wrote:
>>> static int find_existing_ddw_windows(void)
>>> {
>>> int len;
>>> @@ -887,18 +905,11 @@ static int find_existing_ddw_windows(void)
>>> if (!direct64)
>>> continue;
>>>
>>> - window = kzalloc(sizeof(*window), GFP_KERNEL);
>>> - if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
>>> + window = ddw_list_add(pdn, direct64);
>>> + if (!window || len < sizeof(*direct64)) {
>>
>> Since you are touching this code, it looks like the "len <
>> sizeof(*direct64)" part should go above to "if (!direct64)".
>
> Sure, makes sense.
> It will be fixed for v2.
>
>>
>>
>>
>>> kfree(window);
>>> remove_ddw(pdn, true);
>>> - continue;
>>> }
>>> -
>>> - window->device = pdn;
>>> - window->prop = direct64;
>>> - spin_lock(&direct_window_list_lock);
>>> - list_add(&window->list, &direct_window_list);
>>> - spin_unlock(&direct_window_list_lock);
>>> }
>>>
>>> return 0;
>>> @@ -1261,7 +1272,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>> dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
>>> create.liobn, dn);
>>>
>>> - window = kzalloc(sizeof(*window), GFP_KERNEL);
>>> + /* Add new window to existing DDW list */
>>
>> The comment seems to duplicate what the ddw_list_add name already suggests.
>
> Ok, I will remove it then.
>
>>> + window = ddw_list_add(pdn, ddwprop);
>>> if (!window)
>>> goto out_clear_window;
>>>
>>> @@ -1280,16 +1292,14 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>> goto out_free_window;
>>> }
>>>
>>> - window->device = pdn;
>>> - window->prop = ddwprop;
>>> - spin_lock(&direct_window_list_lock);
>>> - list_add(&window->list, &direct_window_list);
>>> - spin_unlock(&direct_window_list_lock);
>>
>> I'd leave these 3 lines here and in find_existing_ddw_windows() (which
>> would make ddw_list_add -> ddw_prop_alloc). In general you want to have
>> less stuff to do on the failure path. kmalloc may fail and needs kfree
>> but you can safely delay list_add (which cannot fail) and avoid having
>> the lock help twice in the same function (one of them is hidden inside
>> ddw_list_add).
>> Not sure if this change is really needed after all. Thanks,
>
> I understand this leads to better performance in case anything fails.
> Also, I think list_add happening in the end is less error-prone (in
> case the list is checked between list_add and a fail).

Performance was not in my mind at all.

I noticed you remove from a list with a lock help and it was not there
before and there is a bunch on labels on the exit path and started
looking for list_add() and if you do not double remove from the list.


> But what if we put it at the end?
> What is the chance of a kzalloc of 4 pointers (struct direct_window)
> failing after walk_system_ram_range?

This is not about chances really, it is about readability. If let's say
kmalloc failed, you just to the error exit label and simply call kfree()
on that pointer, kfree will do nothing if it is NULL already, simple.
list_del() does not have this simplicity.


> Is it not worthy doing that for making enable_ddw() easier to
> understand?

This is my goal here :)


>
> Best regards,
> Leonardo
>

--
Alexey

2020-08-28 02:30:57

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift



On 28/08/2020 01:32, Leonardo Bras wrote:
> Hello Alexey, thank you for this feedback!
>
> On Sat, 2020-08-22 at 19:33 +1000, Alexey Kardashevskiy wrote:
>>> +#define TCE_RPN_BITS 52 /* Bits 0-51 represent RPN on TCE */
>>
>> Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
>> is the actual limit.
>
> I understand this MAX_PHYSMEM_BITS(51) comes from the maximum physical memory addressable in the machine. IIUC, it means we can access physical address up to (1ul << MAX_PHYSMEM_BITS).
>
> This 52 comes from PAPR "Table 9. TCE Definition" which defines bits
> 0-51 as the RPN. By looking at code, I understand that it means we may input any address < (1ul << 52) to TCE.
>
> In practice, MAX_PHYSMEM_BITS should be enough as of today, because I suppose we can't ever pass a physical page address over
> (1ul << 51), and TCE accepts up to (1ul << 52).
> But if we ever increase MAX_PHYSMEM_BITS, it doesn't necessarily means that TCE_RPN_BITS will also be increased, so I think they are independent values.
>
> Does it make sense? Please let me know if I am missing something.

The underlying hardware is PHB3/4 about which the IODA2 Version 2.4
6Apr2012.pdf spec says:

"The number of most significant RPN bits implemented in the TCE is
dependent on the max size of System Memory to be supported by the platform".

IODA3 is the same on this matter.

This is MAX_PHYSMEM_BITS and PHB itself does not have any other limits
on top of that. So the only real limit comes from MAX_PHYSMEM_BITS and
where TCE_RPN_BITS comes from exactly - I have no idea.


>
>>
>>
>>> +#define TCE_RPN_MASK(ps) ((1ul << (TCE_RPN_BITS - (ps))) - 1)
>>> #define TCE_VALID 0x800 /* TCE valid */
>>> #define TCE_ALLIO 0x400 /* TCE valid for all lpars */
>>> #define TCE_PCI_WRITE 0x2 /* write from PCI allowed */
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>>> index e4198700ed1a..8fe23b7dff3a 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -107,6 +107,9 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index,
>>> u64 proto_tce;
>>> __be64 *tcep;
>>> u64 rpn;
>>> + const unsigned long tceshift = tbl->it_page_shift;
>>> + const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
>>> + const u64 rpn_mask = TCE_RPN_MASK(tceshift);
>>
>> Using IOMMU_PAGE_SIZE macro for the page size and not using
>> IOMMU_PAGE_MASK for the mask - this incosistency makes my small brain
>> explode :) I understand the history but maaaaan... Oh well, ok.
>>
>
> Yeah, it feels kind of weird after two IOMMU related consts. :)
> But sure IOMMU_PAGE_MASK() would not be useful here :)
>
> And this kind of let me thinking:
>>> + rpn = __pa(uaddr) >> tceshift;
>>> + *tcep = cpu_to_be64(proto_tce | (rpn & rpn_mask) << tceshift);
> Why not:
> rpn_mask = TCE_RPN_MASK(tceshift) << tceshift;


A mask for a page number (but not the address!) hurts my brain, masks
are good against addresses but numbers should already have all bits
adjusted imho, may be it is just me :-/


>
> rpn = __pa(uaddr) & rpn_mask;
> *tcep = cpu_to_be64(proto_tce | rpn)
>
> I am usually afraid of changing stuff like this, but I think it's safe.
>
>> Good, otherwise. Thanks,
>
> Thank you for reviewing!
>
>
>

--
Alexey

2020-08-28 14:06:37

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 07/10] powerpc/pseries/iommu: Allow DDW windows starting at 0x00

On Mon, 2020-08-24 at 13:44 +1000, Alexey Kardashevskiy wrote:
>
> > On 18/08/2020 09:40, Leonardo Bras wrote:
> > enable_ddw() currently returns the address of the DMA window, which is
> > considered invalid if has the value 0x00.
> >
> > Also, it only considers valid an address returned from find_existing_ddw
> > if it's not 0x00.
> >
> > Changing this behavior makes sense, given the users of enable_ddw() only
> > need to know if direct mapping is possible. It can also allow a DMA window
> > starting at 0x00 to be used.
> >
> > This will be helpful for using a DDW with indirect mapping, as the window
> > address will be different than 0x00, but it will not map the whole
> > partition.
> >
> > Signed-off-by: Leonardo Bras <[email protected]>
> > ---
> > arch/powerpc/platforms/pseries/iommu.c | 30 ++++++++++++--------------
> > 1 file changed, 14 insertions(+), 16 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > index fcdefcc0f365..4031127c9537 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -852,24 +852,25 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
> > np, ret);
> > }
> > >
> > -static u64 find_existing_ddw(struct device_node *pdn)
> > +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> > {
> > struct direct_window *window;
> > const struct dynamic_dma_window_prop *direct64;
> > - u64 dma_addr = 0;
> > + bool found = false;
> >
> > spin_lock(&direct_window_list_lock);
> > /* check if we already created a window and dupe that config if so */
> > list_for_each_entry(window, &direct_window_list, list) {
> > if (window->device == pdn) {
> > direct64 = window->prop;
> > - dma_addr = be64_to_cpu(direct64->dma_base);
> > + *dma_addr = be64_to_cpu(direct64->dma_base);
> > + found = true;
> > break;
> > }
> > }
> > spin_unlock(&direct_window_list_lock);
> >
> > - return dma_addr;
> > + return found;
> > }
> >
> > static struct direct_window *ddw_list_add(struct device_node *pdn,
> > @@ -1131,15 +1132,15 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
> > * pdn: the parent pe node with the ibm,dma_window property
> > * Future: also check if we can remap the base window for our base page size
> > *
> > - * returns the dma offset for use by the direct mapped DMA code.
> > + * returns true if can map all pages (direct mapping), false otherwise..
> > */
> > -static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > +static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > {
> > int len, ret;
> > struct ddw_query_response query;
> > struct ddw_create_response create;
> > int page_shift;
> > - u64 dma_addr, max_addr;
> > + u64 max_addr;
> > struct device_node *dn;
> > u32 ddw_avail[DDW_APPLICABLE_SIZE];
> > struct direct_window *window;
> > @@ -1150,8 +1151,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> >
> > mutex_lock(&direct_window_init_mutex);
> >
> > - dma_addr = find_existing_ddw(pdn);
> > - if (dma_addr != 0)
> > + if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset))
> > goto out_unlock;
> >
> > /*
> > @@ -1292,7 +1292,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > goto out_free_window;
> > }
> >
> > - dma_addr = be64_to_cpu(ddwprop->dma_base);
> > + dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
>
> Do not you need the same chunk in the find_existing_ddw() case above as
> well? Thanks,

The new signature of find_existing_ddw() is
static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)

And on enable_ddw(), we call
find_existing_ddw(pdn, &dev->dev.archdata.dma_offset)

And inside the function we do:
*dma_addr = be64_to_cpu(direct64->dma_base);

I think it's the same as the chunk before.
Am I missing something?

>
>
> > goto out_unlock;
> >
> > out_free_window:
> > @@ -1309,6 +1309,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > kfree(win64->name);
> > kfree(win64->value);
> > kfree(win64);
> > + win64 = NULL;
> >
> > out_failed:
> > if (default_win_removed)
> > @@ -1322,7 +1323,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> >
> > out_unlock:
> > mutex_unlock(&direct_window_init_mutex);
> > - return dma_addr;
> > + return win64;
> > }
> >
> > static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
> > @@ -1401,11 +1402,8 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
> > break;
> > }
> >
> > - if (pdn && PCI_DN(pdn)) {
> > - pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn);
> > - if (pdev->dev.archdata.dma_offset)
> > - return true;
> > - }
> > + if (pdn && PCI_DN(pdn))
> > + return enable_ddw(pdev, pdn);
> >
> > return false;
> > }
> >

2020-08-28 15:27:05

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 08/10] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

On Mon, 2020-08-24 at 15:07 +1000, Alexey Kardashevskiy wrote:
>
> On 18/08/2020 09:40, Leonardo Bras wrote:
> > Code used to create a ddw property that was previously scattered in
> > enable_ddw() is now gathered in ddw_property_create(), which deals with
> > allocation and filling the property, letting it ready for
> > of_property_add(), which now occurs in sequence.
> >
> > This created an opportunity to reorganize the second part of enable_ddw():
> >
> > Without this patch enable_ddw() does, in order:
> > kzalloc() property & members, create_ddw(), fill ddwprop inside property,
> > ddw_list_add(), do tce_setrange_multi_pSeriesLP_walk in all memory,
> > of_add_property().
> >
> > With this patch enable_ddw() does, in order:
> > create_ddw(), ddw_property_create(), of_add_property(), ddw_list_add(),
> > do tce_setrange_multi_pSeriesLP_walk in all memory.
> >
> > This change requires of_remove_property() in case anything fails after
> > of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
> > in all memory, which looks the most expensive operation, only if
> > everything else succeeds.
> >
> > Signed-off-by: Leonardo Bras <[email protected]>
> > ---
> > arch/powerpc/platforms/pseries/iommu.c | 97 +++++++++++++++-----------
> > 1 file changed, 57 insertions(+), 40 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > index 4031127c9537..3a1ef02ad9d5 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -1123,6 +1123,31 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
> > ret);
> > }
> >
> > +static int ddw_property_create(struct property **ddw_win, const char *propname,
>
> @propname is always the same, do you really want to pass it every time?

I think it reads better, like "create a ddw property with this name".
Also, it makes possible to create ddw properties with other names, in
case we decide to create properties with different names depending on
the window created.

Also, it's probably optimized / inlined at this point.
Is it ok doing it like this?

>
> > + u32 liobn, u64 dma_addr, u32 page_shift, u32 window_shift)
> > +{
> > + struct dynamic_dma_window_prop *ddwprop;
> > + struct property *win64;
> > +
> > + *ddw_win = win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
> > + if (!win64)
> > + return -ENOMEM;
> > +
> > + win64->name = kstrdup(propname, GFP_KERNEL);
>
> Not clear why "win64->name = DIRECT64_PROPNAME" would not work here, the
> generic OF code does not try kfree() it but it is probably out of scope
> here.

Yeah, I had that question too.
Previous code was like that, and I as trying not to mess too much on
how it's done.

> > + ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
> > + win64->value = ddwprop;
> > + win64->length = sizeof(*ddwprop);
> > + if (!win64->name || !win64->value)
> > + return -ENOMEM;
>
> Up to 2 memory leaks here. I see the cleanup at "out_free_prop:" but
> still looks fragile. Instead you could simply return win64 as the only
> error possible here is -ENOMEM and returning NULL is equally good.

I agree. It's better if this function have it's own cleaning routine.
It will be fixed for next version.

>
>
> > +
> > + ddwprop->liobn = cpu_to_be32(liobn);
> > + ddwprop->dma_base = cpu_to_be64(dma_addr);
> > + ddwprop->tce_shift = cpu_to_be32(page_shift);
> > + ddwprop->window_shift = cpu_to_be32(window_shift);
> > +
> > + return 0;
> > +}
> > +
> > /*
> > * If the PE supports dynamic dma windows, and there is space for a table
> > * that can map all pages in a linear offset, then setup such a table,
> > @@ -1140,12 +1165,11 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > struct ddw_query_response query;
> > struct ddw_create_response create;
> > int page_shift;
> > - u64 max_addr;
> > + u64 max_addr, win_addr;
> > struct device_node *dn;
> > u32 ddw_avail[DDW_APPLICABLE_SIZE];
> > struct direct_window *window;
> > - struct property *win64;
> > - struct dynamic_dma_window_prop *ddwprop;
> > + struct property *win64 = NULL;
> > struct failed_ddw_pdn *fpdn;
> > bool default_win_removed = false;
> >
> > @@ -1244,38 +1268,34 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > goto out_failed;
> > }
> > len = order_base_2(max_addr);
> > - win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
> > - if (!win64) {
> > - dev_info(&dev->dev,
> > - "couldn't allocate property for 64bit dma window\n");
> > +
> > + ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
> > + if (ret != 0)
>
> It is usually just "if (ret)"

It was previously like that, and all query_ddw() checks return value
this way. Should I update them all or just this one?

Thanks!

>
>
> > goto out_failed;
> > - }
> > - win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
> > - win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
> > - win64->length = sizeof(*ddwprop);
> > - if (!win64->name || !win64->value) {
> > +
> > + dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
> > + create.liobn, dn);
> > +
> > + win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
> > + ret = ddw_property_create(&win64, DIRECT64_PROPNAME, create.liobn, win_addr,
> > + page_shift, len);
> > + if (ret) {
> > dev_info(&dev->dev,
> > - "couldn't allocate property name and value\n");
> > + "couldn't allocate property, property name, or value\n");
> > goto out_free_prop;
> > }
> >
> > - ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
> > - if (ret != 0)
> > + ret = of_add_property(pdn, win64);
> > + if (ret) {
> > + dev_err(&dev->dev, "unable to add dma window property for %pOF: %d",
> > + pdn, ret);
> > goto out_free_prop;
> > -
> > - ddwprop->liobn = cpu_to_be32(create.liobn);
> > - ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) |
> > - create.addr_lo);
> > - ddwprop->tce_shift = cpu_to_be32(page_shift);
> > - ddwprop->window_shift = cpu_to_be32(len);
> > -
> > - dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
> > - create.liobn, dn);
> > + }
> >
> > /* Add new window to existing DDW list */
> > - window = ddw_list_add(pdn, ddwprop);
> > + window = ddw_list_add(pdn, win64->value);
> > if (!window)
> > - goto out_clear_window;
> > + goto out_prop_del;
> >
> > ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> > win64->value, tce_setrange_multi_pSeriesLP_walk);
> > @@ -1285,14 +1305,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > goto out_free_window;
> > }
> >
> > - ret = of_add_property(pdn, win64);
> > - if (ret) {
> > - dev_err(&dev->dev, "unable to add dma window property for %pOF: %d",
> > - pdn, ret);
> > - goto out_free_window;
> > - }
> > -
> > - dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
> > + dev->dev.archdata.dma_offset = win_addr;
> > goto out_unlock;
> >
> > out_free_window:
> > @@ -1302,14 +1315,18 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> >
> > kfree(window);
> >
> > -out_clear_window:
> > - remove_ddw(pdn, true);
> > +out_prop_del:
> > + of_remove_property(pdn, win64);
> >
> > out_free_prop:
> > - kfree(win64->name);
> > - kfree(win64->value);
> > - kfree(win64);
> > - win64 = NULL;
> > + if (win64) {
> > + kfree(win64->name);
> > + kfree(win64->value);
> > + kfree(win64);
> > + win64 = NULL;
> > + }
> > +
> > + remove_ddw(pdn, true);
> >
> > out_failed:
> > if (default_win_removed)
> >

2020-08-28 18:38:07

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 09/10] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

On Mon, 2020-08-24 at 15:17 +1000, Alexey Kardashevskiy wrote:
>
> On 18/08/2020 09:40, Leonardo Bras wrote:
> > As of today, if the biggest DDW that can be created can't map the whole
> > partition, it's creation is skipped and the default DMA window
> > "ibm,dma-window" is used instead.
> >
> > DDW is 16x bigger than the default DMA window,
>
> 16x only under very specific circumstances which are
> 1. phyp
> 2. sriov
> 3. device class in hmc (or what that priority number is in the lpar config).

Yeah, missing details.

> > having the same amount of
> > pages, but increasing the page size to 64k.
> > Besides larger DMA window,
>
> "Besides being larger"?

You are right there.

>
> > it performs better for allocations over 4k,
>
> Better how?

I was thinking for allocations larger than (512 * 4k), since >2
hypercalls are needed here, and for 64k pages would still be just 1
hypercall up to (512 * 64k).
But yeah, not the usual case anyway.

>
> > so it would be nice to use it instead.
>
> I'd rather say something like:
> ===
> So far we assumed we can map the guest RAM 1:1 to the bus which worked
> with a small number of devices. SRIOV changes it as the user can
> configure hundreds VFs and since phyp preallocates TCEs and does not
> allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
> per a PE to limit waste of physical pages.
> ===

I mixed this in my commit message, it looks like this:

===
powerpc/pseries/iommu: Make use of DDW for indirect mapping

So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW
creation is skipped and the default DMA window "ibm,dma-window" is used
instead.

The default DMA window uses 4k pages instead of 64k pages, and since
the amount of pages is the same, making use of DDW instead of the
default DMA window for indirect mapping will expand in 16x the amount
of memory that can be mapped on DMA.

The DDW created will be used for direct mapping by default. [...]
===

What do you think?

> > The DDW created will be used for direct mapping by default.
> > If it's not available, indirect mapping will be used instead.
> >
> > For indirect mapping, it's necessary to update the iommu_table so
> > iommu_alloc() can use the DDW created. For this,
> > iommu_table_update_window() is called when everything else succeeds
> > at enable_ddw().
> >
> > Removing the default DMA window for using DDW with indirect mapping
> > is only allowed if there is no current IOMMU memory allocated in
> > the iommu_table. enable_ddw() is aborted otherwise.
> >
> > As there will never have both direct and indirect mappings at the same
> > time, the same property name can be used for the created DDW.
> >
> > So renaming
> > define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
> > to
> > define DMA64_PROPNAME "linux,dma64-ddr-window-info"
> > looks the right thing to do.
>
> I know I suggested this but this does not look so good anymore as I
> suspect it breaks kexec (from older kernel to this one) so you either
> need to check for both DT names or just keep the old one. Changing the
> macro name is fine.
>

Yeah, having 'direct' in the name don't really makes sense if it's used
for indirect mapping. I will just add this new define instead of
replacing the old one, and check for both.
Is that ok?

>
> > To make sure the property differentiates both cases, a new u32 for flags
> > was added at the end of the property, where BIT(0) set means direct
> > mapping.
> >
> > Signed-off-by: Leonardo Bras <[email protected]>
> > ---
> > arch/powerpc/platforms/pseries/iommu.c | 108 +++++++++++++++++++------
> > 1 file changed, 84 insertions(+), 24 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > index 3a1ef02ad9d5..9544e3c91ced 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -350,8 +350,11 @@ struct dynamic_dma_window_prop {
> > __be64 dma_base; /* address hi,lo */
> > __be32 tce_shift; /* ilog2(tce_page_size) */
> > __be32 window_shift; /* ilog2(tce_window_size) */
> > + __be32 flags; /* DDW properties, see bellow */
> > };
> >
> > +#define DDW_FLAGS_DIRECT 0x01
>
> This is set if ((1<<window_shift) >= ddw_memory_hotplug_max()), you
> could simply check window_shift and drop the flags.
>

Yeah, it's better this way, I will revert this.

>
> > +
> > struct direct_window {
> > struct device_node *device;
> > const struct dynamic_dma_window_prop *prop;
> > @@ -377,7 +380,7 @@ static LIST_HEAD(direct_window_list);
> > static DEFINE_SPINLOCK(direct_window_list_lock);
> > /* protects initializing window twice for same device */
> > static DEFINE_MUTEX(direct_window_init_mutex);
> > -#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
> > +#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
> >
> > static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
> > unsigned long num_pfn, const void *arg)
> > @@ -836,7 +839,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
> > if (ret)
> > return;
> >
> > - win = of_find_property(np, DIRECT64_PROPNAME, NULL);
> > + win = of_find_property(np, DMA64_PROPNAME, NULL);
> > if (!win)
> > return;
> >
> > @@ -852,7 +855,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
> > np, ret);
> > }
> >
> > -static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> > +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, bool *direct_mapping)
> > {
> > struct direct_window *window;
> > const struct dynamic_dma_window_prop *direct64;
> > @@ -864,6 +867,7 @@ static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> > if (window->device == pdn) {
> > direct64 = window->prop;
> > *dma_addr = be64_to_cpu(direct64->dma_base);
> > + *direct_mapping = be32_to_cpu(direct64->flags) & DDW_FLAGS_DIRECT;
> > found = true;
> > break;
> > }
> > @@ -901,8 +905,8 @@ static int find_existing_ddw_windows(void)
> > if (!firmware_has_feature(FW_FEATURE_LPAR))
> > return 0;
> >
> > - for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
> > - direct64 = of_get_property(pdn, DIRECT64_PROPNAME, &len);
> > + for_each_node_with_property(pdn, DMA64_PROPNAME) {
> > + direct64 = of_get_property(pdn, DMA64_PROPNAME, &len);
> > if (!direct64)
> > continue;
> >
> > @@ -1124,7 +1128,8 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
> > }
> >
> > static int ddw_property_create(struct property **ddw_win, const char *propname,
> > - u32 liobn, u64 dma_addr, u32 page_shift, u32 window_shift)
> > + u32 liobn, u64 dma_addr, u32 page_shift,
> > + u32 window_shift, bool direct_mapping)
> > {
> > struct dynamic_dma_window_prop *ddwprop;
> > struct property *win64;
> > @@ -1144,6 +1149,36 @@ static int ddw_property_create(struct property **ddw_win, const char *propname,
> > ddwprop->dma_base = cpu_to_be64(dma_addr);
> > ddwprop->tce_shift = cpu_to_be32(page_shift);
> > ddwprop->window_shift = cpu_to_be32(window_shift);
> > + if (direct_mapping)
> > + ddwprop->flags = cpu_to_be32(DDW_FLAGS_DIRECT);
> > +
> > + return 0;
> > +}
> > +
> > +static int iommu_table_update_window(struct iommu_table **tbl, int nid, unsigned long liobn,
> > + unsigned long win_addr, unsigned long page_shift,
> > + unsigned long window_size)
>
> Rather strange helper imho. I'd extract the most of
> iommu_table_setparms_lpar() into iommu_table_setparms() (except
> of_parse_dma_window) and call new helper from where you call
> iommu_table_update_window; and do
> iommu_pseries_alloc_table/iommu_tce_table_put there.
>

I don't see how to extract iommu_table_setparms_lpar() into
iommu_table_setparms(), they look to be used for different machine
types.

Do mean you extracting most of iommu_table_setparms_lpar() (and maybe
iommu_table_setparms() ) into a new helper, which is called in both
functions and use it instead of iommu_table_update_window() ?

>
> > +{
> > + struct iommu_table *new_tbl, *old_tbl;
> > +
> > + new_tbl = iommu_pseries_alloc_table(nid);
> > + if (!new_tbl)
> > + return -ENOMEM;
> > +
> > + old_tbl = *tbl;
> > + new_tbl->it_index = liobn;
> > + new_tbl->it_offset = win_addr >> page_shift;
> > + new_tbl->it_page_shift = page_shift;
> > + new_tbl->it_size = window_size >> page_shift;
> > + new_tbl->it_base = old_tbl->it_base;
>
> Should not be used in pseries.
>

The point here is to migrate the values from the older tbl to the
newer. I Would like to understand why this is bad, if it will still be
'unused' as the older tbl.

>
> > + new_tbl->it_busno = old_tbl->it_busno;
> > + new_tbl->it_blocksize = old_tbl->it_blocksize;
>
> 16 for pseries and does not change (may be even make it a macro).
>
> > + new_tbl->it_type = old_tbl->it_type;
>
> TCE_PCI.
>

Same as above.

>
> > + new_tbl->it_ops = old_tbl->it_ops;
> > +
> > + iommu_init_table(new_tbl, nid, old_tbl->it_reserved_start, old_tbl->it_reserved_end);
> > + iommu_tce_table_put(old_tbl);
> > + *tbl = new_tbl;
> >
> > return 0;
> > }
> > @@ -1171,12 +1206,16 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > struct direct_window *window;
> > struct property *win64 = NULL;
> > struct failed_ddw_pdn *fpdn;
> > - bool default_win_removed = false;
> > + bool default_win_removed = false, maps_whole_partition = false;
>
> s/maps_whole_partition/direct_mapping/
>

Sure, I will get it replaced.

>
> > + struct pci_dn *pci = PCI_DN(pdn);
> > + struct iommu_table *tbl = pci->table_group->tables[0];
> >
> > mutex_lock(&direct_window_init_mutex);
> >
> > - if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset))
> > - goto out_unlock;
> > + if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset, &maps_whole_partition)) {
> > + mutex_unlock(&direct_window_init_mutex);
> > + return maps_whole_partition;
> > + }
> >
> > /*
> > * If we already went through this for a previous function of
> > @@ -1258,16 +1297,24 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > query.page_size);
> > goto out_failed;
> > }
> > +
> > /* verify the window * number of ptes will map the partition */
> > - /* check largest block * page size > max memory hotplug addr */
> > max_addr = ddw_memory_hotplug_max();
> > if (query.largest_available_block < (max_addr >> page_shift)) {
> > - dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
> > - "%llu-sized pages\n", max_addr, query.largest_available_block,
> > - 1ULL << page_shift);
> > - goto out_failed;
> > + dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu %llu-sized pages\n",
> > + max_addr, query.largest_available_block,
> > + 1ULL << page_shift);
> > +
> > + len = order_base_2(query.largest_available_block << page_shift);
> > + } else {
> > + maps_whole_partition = true;
> > + len = order_base_2(max_addr);
> > }
> > - len = order_base_2(max_addr);
> > +
> > + /* DDW + IOMMU on single window may fail if there is any allocation */
> > + if (default_win_removed && !maps_whole_partition &&
> > + iommu_table_in_use(tbl))
> > + goto out_failed;
> >
> > ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
> > if (ret != 0)
> > @@ -1277,8 +1324,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > create.liobn, dn);
> >
> > win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
> > - ret = ddw_property_create(&win64, DIRECT64_PROPNAME, create.liobn, win_addr,
> > - page_shift, len);
> > + ret = ddw_property_create(&win64, DMA64_PROPNAME, create.liobn, win_addr,
> > + page_shift, len, maps_whole_partition);
> > if (ret) {
> > dev_info(&dev->dev,
> > "couldn't allocate property, property name, or value\n");
> > @@ -1297,12 +1344,25 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > if (!window)
> > goto out_prop_del;
> >
> > - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> > - win64->value, tce_setrange_multi_pSeriesLP_walk);
> > - if (ret) {
> > - dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
> > - dn, ret);
> > - goto out_free_window;
> > + if (maps_whole_partition) {
> > + /* DDW maps the whole partition, so enable direct DMA mapping */
> > + ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> > + win64->value, tce_setrange_multi_pSeriesLP_walk);
> > + if (ret) {
> > + dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
> > + dn, ret);
> > + goto out_free_window;
> > + }
> > + } else {
> > + /* New table for using DDW instead of the default DMA window */
> > + if (iommu_table_update_window(&tbl, pci->phb->node, create.liobn,
> > + win_addr, page_shift, 1UL << len))
> > + goto out_free_window;
> > +
> > + set_iommu_table_base(&dev->dev, tbl);
> > + WARN_ON(dev->dev.archdata.dma_offset >= SZ_4G);
>
> What is this check for exactly? Why 4G, not >= 0, for example?

I am not really sure, you suggested adding it here:
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/[email protected]/#2488874

I can remove it if it's ok.

>
> > + goto out_unlock;
> > +
> > }
> >
> > dev->dev.archdata.dma_offset = win_addr;
> > @@ -1340,7 +1400,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> >
> > out_unlock:
> > mutex_unlock(&direct_window_init_mutex);
> > - return win64;
> > + return win64 && maps_whole_partition;
> > }
> >
> > static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
> >

2020-08-28 19:59:41

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

On Fri, 2020-08-28 at 12:27 +1000, Alexey Kardashevskiy wrote:
>
> On 28/08/2020 01:32, Leonardo Bras wrote:
> > Hello Alexey, thank you for this feedback!
> >
> > On Sat, 2020-08-22 at 19:33 +1000, Alexey Kardashevskiy wrote:
> > > > +#define TCE_RPN_BITS 52 /* Bits 0-51 represent RPN on TCE */
> > >
> > > Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
> > > is the actual limit.
> >
> > I understand this MAX_PHYSMEM_BITS(51) comes from the maximum physical memory addressable in the machine. IIUC, it means we can access physical address up to (1ul << MAX_PHYSMEM_BITS).
> >
> > This 52 comes from PAPR "Table 9. TCE Definition" which defines bits
> > 0-51 as the RPN. By looking at code, I understand that it means we may input any address < (1ul << 52) to TCE.
> >
> > In practice, MAX_PHYSMEM_BITS should be enough as of today, because I suppose we can't ever pass a physical page address over
> > (1ul << 51), and TCE accepts up to (1ul << 52).
> > But if we ever increase MAX_PHYSMEM_BITS, it doesn't necessarily means that TCE_RPN_BITS will also be increased, so I think they are independent values.
> >
> > Does it make sense? Please let me know if I am missing something.
>
> The underlying hardware is PHB3/4 about which the IODA2 Version 2.4
> 6Apr2012.pdf spec says:
>
> "The number of most significant RPN bits implemented in the TCE is
> dependent on the max size of System Memory to be supported by the platform".
>
> IODA3 is the same on this matter.
>
> This is MAX_PHYSMEM_BITS and PHB itself does not have any other limits
> on top of that. So the only real limit comes from MAX_PHYSMEM_BITS and
> where TCE_RPN_BITS comes from exactly - I have no idea.

Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
hardcoded 40-bit mask (0xfffffffffful), for hard-coded 12-bit (4k)
pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
described as RPN, as described before.

IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
shows system memory mapping into a TCE, and the TCE also has bits 0-51
for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.

In fact, by the looks of those figures, the RPN_MASK should always be a
52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.

Maybe that's it?

>
>
> > >
> > > > +#define TCE_RPN_MASK(ps) ((1ul << (TCE_RPN_BITS - (ps))) - 1)
> > > > #define TCE_VALID 0x800 /* TCE valid */
> > > > #define TCE_ALLIO 0x400 /* TCE valid for all lpars */
> > > > #define TCE_PCI_WRITE 0x2 /* write from PCI allowed */
> > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > > index e4198700ed1a..8fe23b7dff3a 100644
> > > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > > @@ -107,6 +107,9 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index,
> > > > u64 proto_tce;
> > > > __be64 *tcep;
> > > > u64 rpn;
> > > > + const unsigned long tceshift = tbl->it_page_shift;
> > > > + const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
> > > > + const u64 rpn_mask = TCE_RPN_MASK(tceshift);
> > >
> > > Using IOMMU_PAGE_SIZE macro for the page size and not using
> > > IOMMU_PAGE_MASK for the mask - this incosistency makes my small brain
> > > explode :) I understand the history but maaaaan... Oh well, ok.
> > >
> >
> > Yeah, it feels kind of weird after two IOMMU related consts. :)
> > But sure IOMMU_PAGE_MASK() would not be useful here :)
> >
> > And this kind of let me thinking:
> > > > + rpn = __pa(uaddr) >> tceshift;
> > > > + *tcep = cpu_to_be64(proto_tce | (rpn & rpn_mask) << tceshift);
> > Why not:
> > rpn_mask = TCE_RPN_MASK(tceshift) << tceshift;
>
> A mask for a page number (but not the address!) hurts my brain, masks
> are good against addresses but numbers should already have all bits
> adjusted imho, may be it is just me :-/
>
>
> >
> > rpn = __pa(uaddr) & rpn_mask;
> > *tcep = cpu_to_be64(proto_tce | rpn)
> >
> > I am usually afraid of changing stuff like this, but I think it's safe.
> >
> > > Good, otherwise. Thanks,
> >
> > Thank you for reviewing!
> >
> >
> >

2020-08-28 20:43:04

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()

On Fri, 2020-08-28 at 11:40 +1000, Alexey Kardashevskiy wrote:
> > I think it would be better to keep the code as much generic as possible
> > regarding page sizes.
>
> Then you need to test it. Does 4K guest even boot (it should but I would
> not bet much on it)?

Maybe testing with host 64k pagesize and IOMMU 16MB pagesize in qemu
should be enough, is there any chance to get indirect mapping in qemu
like this? (DDW but with smaller DMA window available)

> > > Because if we want the former (==support), then we'll have to align the
> > > size up to the bigger page size when allocating/zeroing system pages,
> > > etc.
> >
> > This part I don't understand. Why do we need to align everything to the
> > bigger pagesize?
> >
> > I mean, is not that enough that the range [ret, ret + size[ is both
> > allocated by mm and mapped on a iommu range?
> >
> > Suppose a iommu_alloc_coherent() of 16kB on PAGESIZE = 4k and
> > IOMMU_PAGE_SIZE() == 64k.
> > Why 4 * cpu_pages mapped by a 64k IOMMU page is not enough?
> > All the space the user asked for is allocated and mapped for DMA.
>
> The user asked to map 16K, the rest - 48K - is used for something else
> (may be even mapped to another device) but you are making all 64K
> accessible by the device which only should be able to access 16K.
>
> In practice, if this happens, H_PUT_TCE will simply fail.

I have noticed mlx5 driver getting a few bytes in a buffer, and using
iommu_map_page(). It does map a whole page for as few bytes as the user
wants mapped, and the other bytes get used for something else, or just
mapped on another DMA page.
It seems to work fine.

>
>
> > > Bigger pages are not the case here as I understand it.
> >
> > I did not get this part, what do you mean?
>
> Possible IOMMU page sizes are 4K, 64K, 2M, 16M, 256M, 1GB, and the
> supported set of sizes is different for P8/P9 and type of IO (PHB,
> NVLink/CAPI).
>
>
> > > > Update those functions to guarantee alignment with requested size
> > > > using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().
> > > >
> > > > Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
> > > > with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.
> > > >
> > > > Signed-off-by: Leonardo Bras <[email protected]>
> > > > ---
> > > > arch/powerpc/kernel/iommu.c | 17 +++++++++--------
> > > > 1 file changed, 9 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> > > > index 9704f3f76e63..d7086087830f 100644
> > > > --- a/arch/powerpc/kernel/iommu.c
> > > > +++ b/arch/powerpc/kernel/iommu.c
> > > > @@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device *dev,
> > > > }
> > > >
> > > > if (dev)
> > > > - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
> > > > - 1 << tbl->it_page_shift);
> > > > + boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, tbl);
> > >
> > > Run checkpatch.pl, should complain about a long line.
> >
> > It's 86 columns long, which is less than the new limit of 100 columns
> > Linus announced a few weeks ago. checkpatch.pl was updated too:
> > https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kernel-Deprecates-80-Col
>
> Yay finally :) Thanks,

:)

>
>
> > >
> > > > else
> > > > - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
> > > > + boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
> > > > /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
> > > >
> > > > n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
> > > > @@ -858,6 +857,7 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
> > > > unsigned int order;
> > > > unsigned int nio_pages, io_order;
> > > > struct page *page;
> > > > + size_t size_io = size;
> > > >
> > > > size = PAGE_ALIGN(size);
> > > > order = get_order(size);
> > > > @@ -884,8 +884,9 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
> > > > memset(ret, 0, size);
> > > >
> > > > /* Set up tces to cover the allocated range */
> > > > - nio_pages = size >> tbl->it_page_shift;
> > > > - io_order = get_iommu_order(size, tbl);
> > > > + size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
> > > > + nio_pages = size_io >> tbl->it_page_shift;
> > > > + io_order = get_iommu_order(size_io, tbl);
> > > > mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
> > > > mask >> tbl->it_page_shift, io_order, 0);
> > > > if (mapping == DMA_MAPPING_ERROR) {
> > > > @@ -900,11 +901,11 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
> > > > void *vaddr, dma_addr_t dma_handle)
> > > > {
> > > > if (tbl) {
> > > > - unsigned int nio_pages;
> > > > + size_t size_io = IOMMU_PAGE_ALIGN(size, tbl);
> > > > + unsigned int nio_pages = size_io >> tbl->it_page_shift;
> > > >
> > > > - size = PAGE_ALIGN(size);
> > > > - nio_pages = size >> tbl->it_page_shift;
> > > > iommu_free(tbl, dma_handle, nio_pages);
> > > > +
> > >
> > > Unrelated new line.
> >
> > Will be removed. Thanks!
> >
> > >
> > > > size = PAGE_ALIGN(size);
> > > > free_pages((unsigned long)vaddr, get_order(size));
> > > > }
> > > >

2020-08-28 21:29:45

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 06/10] powerpc/pseries/iommu: Add ddw_list_add() helper

On Fri, 2020-08-28 at 11:58 +1000, Alexey Kardashevskiy wrote:
>
> On 28/08/2020 08:11, Leonardo Bras wrote:
> > On Mon, 2020-08-24 at 13:46 +1000, Alexey Kardashevskiy wrote:
> > > > static int find_existing_ddw_windows(void)
> > > > {
> > > > int len;
> > > > @@ -887,18 +905,11 @@ static int find_existing_ddw_windows(void)
> > > > if (!direct64)
> > > > continue;
> > > >
> > > > - window = kzalloc(sizeof(*window), GFP_KERNEL);
> > > > - if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
> > > > + window = ddw_list_add(pdn, direct64);
> > > > + if (!window || len < sizeof(*direct64)) {
> > >
> > > Since you are touching this code, it looks like the "len <
> > > sizeof(*direct64)" part should go above to "if (!direct64)".
> >
> > Sure, makes sense.
> > It will be fixed for v2.
> >
> > >
> > >
> > > > kfree(window);
> > > > remove_ddw(pdn, true);
> > > > - continue;
> > > > }
> > > > -
> > > > - window->device = pdn;
> > > > - window->prop = direct64;
> > > > - spin_lock(&direct_window_list_lock);
> > > > - list_add(&window->list, &direct_window_list);
> > > > - spin_unlock(&direct_window_list_lock);
> > > > }
> > > >
> > > > return 0;
> > > > @@ -1261,7 +1272,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > > > dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
> > > > create.liobn, dn);
> > > >
> > > > - window = kzalloc(sizeof(*window), GFP_KERNEL);
> > > > + /* Add new window to existing DDW list */
> > >
> > > The comment seems to duplicate what the ddw_list_add name already suggests.
> >
> > Ok, I will remove it then.
> >
> > > > + window = ddw_list_add(pdn, ddwprop);
> > > > if (!window)
> > > > goto out_clear_window;
> > > >
> > > > @@ -1280,16 +1292,14 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > > > goto out_free_window;
> > > > }
> > > >
> > > > - window->device = pdn;
> > > > - window->prop = ddwprop;
> > > > - spin_lock(&direct_window_list_lock);
> > > > - list_add(&window->list, &direct_window_list);
> > > > - spin_unlock(&direct_window_list_lock);
> > >
> > > I'd leave these 3 lines here and in find_existing_ddw_windows() (which
> > > would make ddw_list_add -> ddw_prop_alloc). In general you want to have
> > > less stuff to do on the failure path. kmalloc may fail and needs kfree
> > > but you can safely delay list_add (which cannot fail) and avoid having
> > > the lock help twice in the same function (one of them is hidden inside
> > > ddw_list_add).
> > > Not sure if this change is really needed after all. Thanks,
> >
> > I understand this leads to better performance in case anything fails.
> > Also, I think list_add happening in the end is less error-prone (in
> > case the list is checked between list_add and a fail).
>
> Performance was not in my mind at all.
>
> I noticed you remove from a list with a lock help and it was not there
> before and there is a bunch on labels on the exit path and started
> looking for list_add() and if you do not double remove from the list.
>
>
> > But what if we put it at the end?
> > What is the chance of a kzalloc of 4 pointers (struct direct_window)
> > failing after walk_system_ram_range?
>
> This is not about chances really, it is about readability. If let's say
> kmalloc failed, you just to the error exit label and simply call kfree()
> on that pointer, kfree will do nothing if it is NULL already, simple.
> list_del() does not have this simplicity.
>
>
> > Is it not worthy doing that for making enable_ddw() easier to
> > understand?
>
> This is my goal here :)

Ok, it makes sense to me now.
I tried creating list_add() to keep everything related to list-adding
into a single place, instead of splitting it around the other stuff,
but now I understand that the code may look more complex than it was
before, because of the failing path increasing in size.

For me it was strange creating a list entry end not list_add()ing it
right away, but maybe it's something worth to get used to, as it may
increase the failing path simplicity, since list_add() don't fail.

I will try to see if the ddw_list_add() routine would become a useful
ddw_list_entry(), but if not, I will remove this patch.

Alexey, Thank you for reviewing this series!
Best regards,

Leonardo

2020-08-31 00:08:12

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift



On 29/08/2020 05:55, Leonardo Bras wrote:
> On Fri, 2020-08-28 at 12:27 +1000, Alexey Kardashevskiy wrote:
>>
>> On 28/08/2020 01:32, Leonardo Bras wrote:
>>> Hello Alexey, thank you for this feedback!
>>>
>>> On Sat, 2020-08-22 at 19:33 +1000, Alexey Kardashevskiy wrote:
>>>>> +#define TCE_RPN_BITS 52 /* Bits 0-51 represent RPN on TCE */
>>>>
>>>> Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
>>>> is the actual limit.
>>>
>>> I understand this MAX_PHYSMEM_BITS(51) comes from the maximum physical memory addressable in the machine. IIUC, it means we can access physical address up to (1ul << MAX_PHYSMEM_BITS).
>>>
>>> This 52 comes from PAPR "Table 9. TCE Definition" which defines bits
>>> 0-51 as the RPN. By looking at code, I understand that it means we may input any address < (1ul << 52) to TCE.
>>>
>>> In practice, MAX_PHYSMEM_BITS should be enough as of today, because I suppose we can't ever pass a physical page address over
>>> (1ul << 51), and TCE accepts up to (1ul << 52).
>>> But if we ever increase MAX_PHYSMEM_BITS, it doesn't necessarily means that TCE_RPN_BITS will also be increased, so I think they are independent values.
>>>
>>> Does it make sense? Please let me know if I am missing something.
>>
>> The underlying hardware is PHB3/4 about which the IODA2 Version 2.4
>> 6Apr2012.pdf spec says:
>>
>> "The number of most significant RPN bits implemented in the TCE is
>> dependent on the max size of System Memory to be supported by the platform".
>>
>> IODA3 is the same on this matter.
>>
>> This is MAX_PHYSMEM_BITS and PHB itself does not have any other limits
>> on top of that. So the only real limit comes from MAX_PHYSMEM_BITS and
>> where TCE_RPN_BITS comes from exactly - I have no idea.
>
> Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
> hardcoded 40-bit mask (0xfffffffffful), for hard-coded 12-bit (4k)
> pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
> described as RPN, as described before.
>
> IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
> shows system memory mapping into a TCE, and the TCE also has bits 0-51
> for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.
>> In fact, by the looks of those figures, the RPN_MASK should always be a
> 52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.


I suspect the mask is there in the first place for extra protection
against too big addresses going to the TCE table (or/and for virtial vs
physical addresses). Using 52bit mask makes no sense for anything, you
could just drop the mask and let c compiler deal with 64bit "uint" as it
is basically a 4K page address anywhere in the 64bit space. Thanks,


> Maybe that's it?




>
>>
>>
>>>>
>>>>> +#define TCE_RPN_MASK(ps) ((1ul << (TCE_RPN_BITS - (ps))) - 1)
>>>>> #define TCE_VALID 0x800 /* TCE valid */
>>>>> #define TCE_ALLIO 0x400 /* TCE valid for all lpars */
>>>>> #define TCE_PCI_WRITE 0x2 /* write from PCI allowed */
>>>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>>>>> index e4198700ed1a..8fe23b7dff3a 100644
>>>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>>>> @@ -107,6 +107,9 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index,
>>>>> u64 proto_tce;
>>>>> __be64 *tcep;
>>>>> u64 rpn;
>>>>> + const unsigned long tceshift = tbl->it_page_shift;
>>>>> + const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
>>>>> + const u64 rpn_mask = TCE_RPN_MASK(tceshift);
>>>>
>>>> Using IOMMU_PAGE_SIZE macro for the page size and not using
>>>> IOMMU_PAGE_MASK for the mask - this incosistency makes my small brain
>>>> explode :) I understand the history but maaaaan... Oh well, ok.
>>>>
>>>
>>> Yeah, it feels kind of weird after two IOMMU related consts. :)
>>> But sure IOMMU_PAGE_MASK() would not be useful here :)
>>>
>>> And this kind of let me thinking:
>>>>> + rpn = __pa(uaddr) >> tceshift;
>>>>> + *tcep = cpu_to_be64(proto_tce | (rpn & rpn_mask) << tceshift);
>>> Why not:
>>> rpn_mask = TCE_RPN_MASK(tceshift) << tceshift;
>>
>> A mask for a page number (but not the address!) hurts my brain, masks
>> are good against addresses but numbers should already have all bits
>> adjusted imho, may be it is just me :-/
>>
>>
>>>
>>> rpn = __pa(uaddr) & rpn_mask;
>>> *tcep = cpu_to_be64(proto_tce | rpn)
>>>
>>> I am usually afraid of changing stuff like this, but I think it's safe.
>>>
>>>> Good, otherwise. Thanks,
>>>
>>> Thank you for reviewing!
>>>
>>>
>>>
>

--
Alexey

2020-08-31 00:48:58

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()



On 29/08/2020 06:41, Leonardo Bras wrote:
> On Fri, 2020-08-28 at 11:40 +1000, Alexey Kardashevskiy wrote:
>>> I think it would be better to keep the code as much generic as possible
>>> regarding page sizes.
>>
>> Then you need to test it. Does 4K guest even boot (it should but I would
>> not bet much on it)?
>
> Maybe testing with host 64k pagesize and IOMMU 16MB pagesize in qemu
> should be enough, is there any chance to get indirect mapping in qemu
> like this? (DDW but with smaller DMA window available)


You will have to hack the guest kernel to always do indirect mapping or
hack QEMU's rtas_ibm_query_pe_dma_window() to return a small number of
available TCEs. But you will be testing QEMU/KVM which behave quite
differently to pHyp in this particular case.



>>>> Because if we want the former (==support), then we'll have to align the
>>>> size up to the bigger page size when allocating/zeroing system pages,
>>>> etc.
>>>
>>> This part I don't understand. Why do we need to align everything to the
>>> bigger pagesize?
>>>
>>> I mean, is not that enough that the range [ret, ret + size[ is both
>>> allocated by mm and mapped on a iommu range?
>>>
>>> Suppose a iommu_alloc_coherent() of 16kB on PAGESIZE = 4k and
>>> IOMMU_PAGE_SIZE() == 64k.
>>> Why 4 * cpu_pages mapped by a 64k IOMMU page is not enough?
>>> All the space the user asked for is allocated and mapped for DMA.
>>
>> The user asked to map 16K, the rest - 48K - is used for something else
>> (may be even mapped to another device) but you are making all 64K
>> accessible by the device which only should be able to access 16K.
>>
>> In practice, if this happens, H_PUT_TCE will simply fail.
>
> I have noticed mlx5 driver getting a few bytes in a buffer, and using
> iommu_map_page(). It does map a whole page for as few bytes as the user


Whole 4K system page or whole 64K iommu page?

> wants mapped, and the other bytes get used for something else, or just
> mapped on another DMA page.
> It seems to work fine.



With 4K system page and 64K IOMMU page? In practice it would take an
effort or/and bad luck to see it crashing. Thanks,



>
>>
>>
>>>> Bigger pages are not the case here as I understand it.
>>>
>>> I did not get this part, what do you mean?
>>
>> Possible IOMMU page sizes are 4K, 64K, 2M, 16M, 256M, 1GB, and the
>> supported set of sizes is different for P8/P9 and type of IO (PHB,
>> NVLink/CAPI).
>>
>>
>>>>> Update those functions to guarantee alignment with requested size
>>>>> using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().
>>>>>
>>>>> Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
>>>>> with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.
>>>>>
>>>>> Signed-off-by: Leonardo Bras <[email protected]>
>>>>> ---
>>>>> arch/powerpc/kernel/iommu.c | 17 +++++++++--------
>>>>> 1 file changed, 9 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>>>>> index 9704f3f76e63..d7086087830f 100644
>>>>> --- a/arch/powerpc/kernel/iommu.c
>>>>> +++ b/arch/powerpc/kernel/iommu.c
>>>>> @@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device *dev,
>>>>> }
>>>>>
>>>>> if (dev)
>>>>> - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
>>>>> - 1 << tbl->it_page_shift);
>>>>> + boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, tbl);
>>>>
>>>> Run checkpatch.pl, should complain about a long line.
>>>
>>> It's 86 columns long, which is less than the new limit of 100 columns
>>> Linus announced a few weeks ago. checkpatch.pl was updated too:
>>> https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kernel-Deprecates-80-Col
>>
>> Yay finally :) Thanks,
>
> :)
>
>>
>>
>>>>
>>>>> else
>>>>> - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
>>>>> + boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
>>>>> /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
>>>>>
>>>>> n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
>>>>> @@ -858,6 +857,7 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
>>>>> unsigned int order;
>>>>> unsigned int nio_pages, io_order;
>>>>> struct page *page;
>>>>> + size_t size_io = size;
>>>>>
>>>>> size = PAGE_ALIGN(size);
>>>>> order = get_order(size);
>>>>> @@ -884,8 +884,9 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
>>>>> memset(ret, 0, size);
>>>>>
>>>>> /* Set up tces to cover the allocated range */
>>>>> - nio_pages = size >> tbl->it_page_shift;
>>>>> - io_order = get_iommu_order(size, tbl);
>>>>> + size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
>>>>> + nio_pages = size_io >> tbl->it_page_shift;
>>>>> + io_order = get_iommu_order(size_io, tbl);
>>>>> mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
>>>>> mask >> tbl->it_page_shift, io_order, 0);
>>>>> if (mapping == DMA_MAPPING_ERROR) {
>>>>> @@ -900,11 +901,11 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
>>>>> void *vaddr, dma_addr_t dma_handle)
>>>>> {
>>>>> if (tbl) {
>>>>> - unsigned int nio_pages;
>>>>> + size_t size_io = IOMMU_PAGE_ALIGN(size, tbl);
>>>>> + unsigned int nio_pages = size_io >> tbl->it_page_shift;
>>>>>
>>>>> - size = PAGE_ALIGN(size);
>>>>> - nio_pages = size >> tbl->it_page_shift;
>>>>> iommu_free(tbl, dma_handle, nio_pages);
>>>>> +
>>>>
>>>> Unrelated new line.
>>>
>>> Will be removed. Thanks!
>>>
>>>>
>>>>> size = PAGE_ALIGN(size);
>>>>> free_pages((unsigned long)vaddr, get_order(size));
>>>>> }
>>>>>
>

--
Alexey

2020-08-31 00:51:24

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 07/10] powerpc/pseries/iommu: Allow DDW windows starting at 0x00



On 29/08/2020 00:04, Leonardo Bras wrote:
> On Mon, 2020-08-24 at 13:44 +1000, Alexey Kardashevskiy wrote:
>>
>>> On 18/08/2020 09:40, Leonardo Bras wrote:
>>> enable_ddw() currently returns the address of the DMA window, which is
>>> considered invalid if has the value 0x00.
>>>
>>> Also, it only considers valid an address returned from find_existing_ddw
>>> if it's not 0x00.
>>>
>>> Changing this behavior makes sense, given the users of enable_ddw() only
>>> need to know if direct mapping is possible. It can also allow a DMA window
>>> starting at 0x00 to be used.
>>>
>>> This will be helpful for using a DDW with indirect mapping, as the window
>>> address will be different than 0x00, but it will not map the whole
>>> partition.
>>>
>>> Signed-off-by: Leonardo Bras <[email protected]>
>>> ---
>>> arch/powerpc/platforms/pseries/iommu.c | 30 ++++++++++++--------------
>>> 1 file changed, 14 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>>> index fcdefcc0f365..4031127c9537 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -852,24 +852,25 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
>>> np, ret);
>>> }
>>>>
>>> -static u64 find_existing_ddw(struct device_node *pdn)
>>> +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
>>> {
>>> struct direct_window *window;
>>> const struct dynamic_dma_window_prop *direct64;
>>> - u64 dma_addr = 0;
>>> + bool found = false;
>>>
>>> spin_lock(&direct_window_list_lock);
>>> /* check if we already created a window and dupe that config if so */
>>> list_for_each_entry(window, &direct_window_list, list) {
>>> if (window->device == pdn) {
>>> direct64 = window->prop;
>>> - dma_addr = be64_to_cpu(direct64->dma_base);
>>> + *dma_addr = be64_to_cpu(direct64->dma_base);
>>> + found = true;
>>> break;
>>> }
>>> }
>>> spin_unlock(&direct_window_list_lock);
>>>
>>> - return dma_addr;
>>> + return found;
>>> }
>>>
>>> static struct direct_window *ddw_list_add(struct device_node *pdn,
>>> @@ -1131,15 +1132,15 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
>>> * pdn: the parent pe node with the ibm,dma_window property
>>> * Future: also check if we can remap the base window for our base page size
>>> *
>>> - * returns the dma offset for use by the direct mapped DMA code.
>>> + * returns true if can map all pages (direct mapping), false otherwise..
>>> */
>>> -static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>> +static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>> {
>>> int len, ret;
>>> struct ddw_query_response query;
>>> struct ddw_create_response create;
>>> int page_shift;
>>> - u64 dma_addr, max_addr;
>>> + u64 max_addr;
>>> struct device_node *dn;
>>> u32 ddw_avail[DDW_APPLICABLE_SIZE];
>>> struct direct_window *window;
>>> @@ -1150,8 +1151,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>>
>>> mutex_lock(&direct_window_init_mutex);
>>>
>>> - dma_addr = find_existing_ddw(pdn);
>>> - if (dma_addr != 0)
>>> + if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset))
>>> goto out_unlock;
>>>
>>> /*
>>> @@ -1292,7 +1292,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>> goto out_free_window;
>>> }
>>>
>>> - dma_addr = be64_to_cpu(ddwprop->dma_base);
>>> + dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
>>
>> Do not you need the same chunk in the find_existing_ddw() case above as
>> well? Thanks,
>
> The new signature of find_existing_ddw() is
> static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
>
> And on enable_ddw(), we call
> find_existing_ddw(pdn, &dev->dev.archdata.dma_offset)
>
> And inside the function we do:
> *dma_addr = be64_to_cpu(direct64->dma_base);
>
> I think it's the same as the chunk before.
> Am I missing something?

ah no, sorry, you are not missing anything.


Reviewed-by: Alexey Kardashevskiy <[email protected]>




--
Alexey

2020-08-31 01:43:16

by Oliver O'Halloran

[permalink] [raw]
Subject: Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

On Mon, Aug 31, 2020 at 10:08 AM Alexey Kardashevskiy <[email protected]> wrote:
>
> On 29/08/2020 05:55, Leonardo Bras wrote:
> > On Fri, 2020-08-28 at 12:27 +1000, Alexey Kardashevskiy wrote:
> >>
> >> On 28/08/2020 01:32, Leonardo Bras wrote:
> >>> Hello Alexey, thank you for this feedback!
> >>>
> >>> On Sat, 2020-08-22 at 19:33 +1000, Alexey Kardashevskiy wrote:
> >>>>> +#define TCE_RPN_BITS 52 /* Bits 0-51 represent RPN on TCE */
> >>>>
> >>>> Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
> >>>> is the actual limit.
> >>>
> >>> I understand this MAX_PHYSMEM_BITS(51) comes from the maximum physical memory addressable in the machine. IIUC, it means we can access physical address up to (1ul << MAX_PHYSMEM_BITS).
> >>>
> >>> This 52 comes from PAPR "Table 9. TCE Definition" which defines bits
> >>> 0-51 as the RPN. By looking at code, I understand that it means we may input any address < (1ul << 52) to TCE.
> >>>
> >>> In practice, MAX_PHYSMEM_BITS should be enough as of today, because I suppose we can't ever pass a physical page address over
> >>> (1ul << 51), and TCE accepts up to (1ul << 52).
> >>> But if we ever increase MAX_PHYSMEM_BITS, it doesn't necessarily means that TCE_RPN_BITS will also be increased, so I think they are independent values.
> >>>
> >>> Does it make sense? Please let me know if I am missing something.
> >>
> >> The underlying hardware is PHB3/4 about which the IODA2 Version 2.4
> >> 6Apr2012.pdf spec says:
> >>
> >> "The number of most significant RPN bits implemented in the TCE is
> >> dependent on the max size of System Memory to be supported by the platform".
> >>
> >> IODA3 is the same on this matter.
> >>
> >> This is MAX_PHYSMEM_BITS and PHB itself does not have any other limits
> >> on top of that. So the only real limit comes from MAX_PHYSMEM_BITS and
> >> where TCE_RPN_BITS comes from exactly - I have no idea.
> >
> > Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
> > hardcoded 40-bit mask (0xfffffffffful), for hard-coded 12-bit (4k)
> > pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
> > described as RPN, as described before.
> >
> > IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
> > shows system memory mapping into a TCE, and the TCE also has bits 0-51
> > for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.
> >> In fact, by the looks of those figures, the RPN_MASK should always be a
> > 52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.
>
> I suspect the mask is there in the first place for extra protection
> against too big addresses going to the TCE table (or/and for virtial vs
> physical addresses). Using 52bit mask makes no sense for anything, you
> could just drop the mask and let c compiler deal with 64bit "uint" as it
> is basically a 4K page address anywhere in the 64bit space. Thanks,

Assuming 4K pages you need 52 RPN bits to cover the whole 64bit
physical address space. The IODA3 spec does explicitly say the upper
bits are optional and the implementation only needs to support enough
to cover up to the physical address limit, which is 56bits of P9 /
PHB4. If you want to validate that the address will fit inside of
MAX_PHYSMEM_BITS then fine, but I think that should be done as a
WARN_ON or similar rather than just silently masking off the bits.

2020-08-31 03:50:00

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift



On 31/08/2020 11:41, Oliver O'Halloran wrote:
> On Mon, Aug 31, 2020 at 10:08 AM Alexey Kardashevskiy <[email protected]> wrote:
>>
>> On 29/08/2020 05:55, Leonardo Bras wrote:
>>> On Fri, 2020-08-28 at 12:27 +1000, Alexey Kardashevskiy wrote:
>>>>
>>>> On 28/08/2020 01:32, Leonardo Bras wrote:
>>>>> Hello Alexey, thank you for this feedback!
>>>>>
>>>>> On Sat, 2020-08-22 at 19:33 +1000, Alexey Kardashevskiy wrote:
>>>>>>> +#define TCE_RPN_BITS 52 /* Bits 0-51 represent RPN on TCE */
>>>>>>
>>>>>> Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
>>>>>> is the actual limit.
>>>>>
>>>>> I understand this MAX_PHYSMEM_BITS(51) comes from the maximum physical memory addressable in the machine. IIUC, it means we can access physical address up to (1ul << MAX_PHYSMEM_BITS).
>>>>>
>>>>> This 52 comes from PAPR "Table 9. TCE Definition" which defines bits
>>>>> 0-51 as the RPN. By looking at code, I understand that it means we may input any address < (1ul << 52) to TCE.
>>>>>
>>>>> In practice, MAX_PHYSMEM_BITS should be enough as of today, because I suppose we can't ever pass a physical page address over
>>>>> (1ul << 51), and TCE accepts up to (1ul << 52).
>>>>> But if we ever increase MAX_PHYSMEM_BITS, it doesn't necessarily means that TCE_RPN_BITS will also be increased, so I think they are independent values.
>>>>>
>>>>> Does it make sense? Please let me know if I am missing something.
>>>>
>>>> The underlying hardware is PHB3/4 about which the IODA2 Version 2.4
>>>> 6Apr2012.pdf spec says:
>>>>
>>>> "The number of most significant RPN bits implemented in the TCE is
>>>> dependent on the max size of System Memory to be supported by the platform".
>>>>
>>>> IODA3 is the same on this matter.
>>>>
>>>> This is MAX_PHYSMEM_BITS and PHB itself does not have any other limits
>>>> on top of that. So the only real limit comes from MAX_PHYSMEM_BITS and
>>>> where TCE_RPN_BITS comes from exactly - I have no idea.
>>>
>>> Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
>>> hardcoded 40-bit mask (0xfffffffffful), for hard-coded 12-bit (4k)
>>> pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
>>> described as RPN, as described before.
>>>
>>> IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
>>> shows system memory mapping into a TCE, and the TCE also has bits 0-51
>>> for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.
>>>> In fact, by the looks of those figures, the RPN_MASK should always be a
>>> 52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.
>>
>> I suspect the mask is there in the first place for extra protection
>> against too big addresses going to the TCE table (or/and for virtial vs
>> physical addresses). Using 52bit mask makes no sense for anything, you
>> could just drop the mask and let c compiler deal with 64bit "uint" as it
>> is basically a 4K page address anywhere in the 64bit space. Thanks,
>
> Assuming 4K pages you need 52 RPN bits to cover the whole 64bit
> physical address space. The IODA3 spec does explicitly say the upper
> bits are optional and the implementation only needs to support enough
> to cover up to the physical address limit, which is 56bits of P9 /
> PHB4. If you want to validate that the address will fit inside of
> MAX_PHYSMEM_BITS then fine, but I think that should be done as a
> WARN_ON or similar rather than just silently masking off the bits.

We can do this and probably should anyway but I am also pretty sure we
can just ditch the mask and have the hypervisor return an error which
will show up in dmesg.


--
Alexey

2020-08-31 04:37:02

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 09/10] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition



On 29/08/2020 04:36, Leonardo Bras wrote:
> On Mon, 2020-08-24 at 15:17 +1000, Alexey Kardashevskiy wrote:
>>
>> On 18/08/2020 09:40, Leonardo Bras wrote:
>>> As of today, if the biggest DDW that can be created can't map the whole
>>> partition, it's creation is skipped and the default DMA window
>>> "ibm,dma-window" is used instead.
>>>
>>> DDW is 16x bigger than the default DMA window,
>>
>> 16x only under very specific circumstances which are
>> 1. phyp
>> 2. sriov
>> 3. device class in hmc (or what that priority number is in the lpar config).
>
> Yeah, missing details.
>
>>> having the same amount of
>>> pages, but increasing the page size to 64k.
>>> Besides larger DMA window,
>>
>> "Besides being larger"?
>
> You are right there.
>
>>
>>> it performs better for allocations over 4k,
>>
>> Better how?
>
> I was thinking for allocations larger than (512 * 4k), since >2
> hypercalls are needed here, and for 64k pages would still be just 1
> hypercall up to (512 * 64k).
> But yeah, not the usual case anyway.

Yup.


>
>>
>>> so it would be nice to use it instead.
>>
>> I'd rather say something like:
>> ===
>> So far we assumed we can map the guest RAM 1:1 to the bus which worked
>> with a small number of devices. SRIOV changes it as the user can
>> configure hundreds VFs and since phyp preallocates TCEs and does not
>> allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
>> per a PE to limit waste of physical pages.
>> ===
>
> I mixed this in my commit message, it looks like this:
>
> ===
> powerpc/pseries/iommu: Make use of DDW for indirect mapping
>
> So far it's assumed possible to map the guest RAM 1:1 to the bus, which
> works with a small number of devices. SRIOV changes it as the user can
> configure hundreds VFs and since phyp preallocates TCEs and does not
> allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
> per a PE to limit waste of physical pages.
>
> As of today, if the assumed direct mapping is not possible, DDW
> creation is skipped and the default DMA window "ibm,dma-window" is used
> instead.
>
> The default DMA window uses 4k pages instead of 64k pages, and since
> the amount of pages is the same,


Is the amount really the same? I thought you can prioritize some VFs
over others (== allocate different number of TCEs). Does it really
matter if it is the same?


> making use of DDW instead of the
> default DMA window for indirect mapping will expand in 16x the amount
> of memory that can be mapped on DMA.

Stop saying "16x", it is not guaranteed by anything :)


>
> The DDW created will be used for direct mapping by default. [...]
> ===
>
> What do you think?
>
>>> The DDW created will be used for direct mapping by default.
>>> If it's not available, indirect mapping will be used instead.
>>>
>>> For indirect mapping, it's necessary to update the iommu_table so
>>> iommu_alloc() can use the DDW created. For this,
>>> iommu_table_update_window() is called when everything else succeeds
>>> at enable_ddw().
>>>
>>> Removing the default DMA window for using DDW with indirect mapping
>>> is only allowed if there is no current IOMMU memory allocated in
>>> the iommu_table. enable_ddw() is aborted otherwise.
>>>
>>> As there will never have both direct and indirect mappings at the same
>>> time, the same property name can be used for the created DDW.
>>>
>>> So renaming
>>> define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
>>> to
>>> define DMA64_PROPNAME "linux,dma64-ddr-window-info"
>>> looks the right thing to do.
>>
>> I know I suggested this but this does not look so good anymore as I
>> suspect it breaks kexec (from older kernel to this one) so you either
>> need to check for both DT names or just keep the old one. Changing the
>> macro name is fine.
>>
>
> Yeah, having 'direct' in the name don't really makes sense if it's used
> for indirect mapping. I will just add this new define instead of
> replacing the old one, and check for both.
> Is that ok?


No, having two of these does not seem right or useful. It is pseries
which does not use petitboot (relies on grub instead so until the target
kernel is started, there will be no ddw) so realistically we need this
property for kexec/kdump which uses the same kernel but different
initramdisk so for that purpose we need the same property name.

But I can see myself annoyed when I try petitboot in the hacked pseries
qemu and things may crash :) On this basis I'd suggest keeping the name
and adding a comment next to it that it is not always "direct" anymore.


>
>>
>>> To make sure the property differentiates both cases, a new u32 for flags
>>> was added at the end of the property, where BIT(0) set means direct
>>> mapping.
>>>
>>> Signed-off-by: Leonardo Bras <[email protected]>
>>> ---
>>> arch/powerpc/platforms/pseries/iommu.c | 108 +++++++++++++++++++------
>>> 1 file changed, 84 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>>> index 3a1ef02ad9d5..9544e3c91ced 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -350,8 +350,11 @@ struct dynamic_dma_window_prop {
>>> __be64 dma_base; /* address hi,lo */
>>> __be32 tce_shift; /* ilog2(tce_page_size) */
>>> __be32 window_shift; /* ilog2(tce_window_size) */
>>> + __be32 flags; /* DDW properties, see bellow */
>>> };
>>>
>>> +#define DDW_FLAGS_DIRECT 0x01
>>
>> This is set if ((1<<window_shift) >= ddw_memory_hotplug_max()), you
>> could simply check window_shift and drop the flags.
>>
>
> Yeah, it's better this way, I will revert this.
>
>>
>>> +
>>> struct direct_window {
>>> struct device_node *device;
>>> const struct dynamic_dma_window_prop *prop;
>>> @@ -377,7 +380,7 @@ static LIST_HEAD(direct_window_list);
>>> static DEFINE_SPINLOCK(direct_window_list_lock);
>>> /* protects initializing window twice for same device */
>>> static DEFINE_MUTEX(direct_window_init_mutex);
>>> -#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
>>> +#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
>>>
>>> static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
>>> unsigned long num_pfn, const void *arg)
>>> @@ -836,7 +839,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
>>> if (ret)
>>> return;
>>>
>>> - win = of_find_property(np, DIRECT64_PROPNAME, NULL);
>>> + win = of_find_property(np, DMA64_PROPNAME, NULL);
>>> if (!win)
>>> return;
>>>
>>> @@ -852,7 +855,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
>>> np, ret);
>>> }
>>>
>>> -static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
>>> +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, bool *direct_mapping)
>>> {
>>> struct direct_window *window;
>>> const struct dynamic_dma_window_prop *direct64;
>>> @@ -864,6 +867,7 @@ static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
>>> if (window->device == pdn) {
>>> direct64 = window->prop;
>>> *dma_addr = be64_to_cpu(direct64->dma_base);
>>> + *direct_mapping = be32_to_cpu(direct64->flags) & DDW_FLAGS_DIRECT;
>>> found = true;
>>> break;
>>> }
>>> @@ -901,8 +905,8 @@ static int find_existing_ddw_windows(void)
>>> if (!firmware_has_feature(FW_FEATURE_LPAR))
>>> return 0;
>>>
>>> - for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
>>> - direct64 = of_get_property(pdn, DIRECT64_PROPNAME, &len);
>>> + for_each_node_with_property(pdn, DMA64_PROPNAME) {
>>> + direct64 = of_get_property(pdn, DMA64_PROPNAME, &len);
>>> if (!direct64)
>>> continue;
>>>
>>> @@ -1124,7 +1128,8 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
>>> }
>>>
>>> static int ddw_property_create(struct property **ddw_win, const char *propname,
>>> - u32 liobn, u64 dma_addr, u32 page_shift, u32 window_shift)
>>> + u32 liobn, u64 dma_addr, u32 page_shift,
>>> + u32 window_shift, bool direct_mapping)
>>> {
>>> struct dynamic_dma_window_prop *ddwprop;
>>> struct property *win64;
>>> @@ -1144,6 +1149,36 @@ static int ddw_property_create(struct property **ddw_win, const char *propname,
>>> ddwprop->dma_base = cpu_to_be64(dma_addr);
>>> ddwprop->tce_shift = cpu_to_be32(page_shift);
>>> ddwprop->window_shift = cpu_to_be32(window_shift);
>>> + if (direct_mapping)
>>> + ddwprop->flags = cpu_to_be32(DDW_FLAGS_DIRECT);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static int iommu_table_update_window(struct iommu_table **tbl, int nid, unsigned long liobn,
>>> + unsigned long win_addr, unsigned long page_shift,
>>> + unsigned long window_size)
>>
>> Rather strange helper imho. I'd extract the most of
>> iommu_table_setparms_lpar() into iommu_table_setparms() (except
>> of_parse_dma_window) and call new helper from where you call
>> iommu_table_update_window; and do
>> iommu_pseries_alloc_table/iommu_tce_table_put there.
>>
>
> I don't see how to extract iommu_table_setparms_lpar() into
> iommu_table_setparms(), they look to be used for different machine
> types.
>
> Do mean you extracting most of iommu_table_setparms_lpar() (and maybe
> iommu_table_setparms() ) into a new helper, which is called in both
> functions and use it instead of iommu_table_update_window() ?

Yes, this.


>
>>
>>> +{
>>> + struct iommu_table *new_tbl, *old_tbl;
>>> +
>>> + new_tbl = iommu_pseries_alloc_table(nid);
>>> + if (!new_tbl)
>>> + return -ENOMEM;
>>> +
>>> + old_tbl = *tbl;
>>> + new_tbl->it_index = liobn;
>>> + new_tbl->it_offset = win_addr >> page_shift;
>>> + new_tbl->it_page_shift = page_shift;
>>> + new_tbl->it_size = window_size >> page_shift;
>>> + new_tbl->it_base = old_tbl->it_base;
>>
>> Should not be used in pseries.
>>
>
> The point here is to migrate the values from the older tbl to the


The actual window/table is new (on the hypervisor side), you are not
migrating a single TCE, you deleted one whole window and created another
whole window, calling it "migration" is confusing, especially when PAPR
actually defines TCE migration.


> newer. I Would like to understand why this is bad, if it will still be
> 'unused' as the older tbl.


Having explicit values is more readable imho.


>>
>>> + new_tbl->it_busno = old_tbl->it_busno;
>>> + new_tbl->it_blocksize = old_tbl->it_blocksize;
>>
>> 16 for pseries and does not change (may be even make it a macro).
>>
>>> + new_tbl->it_type = old_tbl->it_type;
>>
>> TCE_PCI.
>>
>
> Same as above.
>
>>
>>> + new_tbl->it_ops = old_tbl->it_ops;
>>> +
>>> + iommu_init_table(new_tbl, nid, old_tbl->it_reserved_start, old_tbl->it_reserved_end);
>>> + iommu_tce_table_put(old_tbl);
>>> + *tbl = new_tbl;
>>>
>>> return 0;
>>> }
>>> @@ -1171,12 +1206,16 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>> struct direct_window *window;
>>> struct property *win64 = NULL;
>>> struct failed_ddw_pdn *fpdn;
>>> - bool default_win_removed = false;
>>> + bool default_win_removed = false, maps_whole_partition = false;
>>
>> s/maps_whole_partition/direct_mapping/
>>
>
> Sure, I will get it replaced.
>
>>
>>> + struct pci_dn *pci = PCI_DN(pdn);
>>> + struct iommu_table *tbl = pci->table_group->tables[0];
>>>
>>> mutex_lock(&direct_window_init_mutex);
>>>
>>> - if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset))
>>> - goto out_unlock;
>>> + if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset, &maps_whole_partition)) {
>>> + mutex_unlock(&direct_window_init_mutex);
>>> + return maps_whole_partition;
>>> + }
>>>
>>> /*
>>> * If we already went through this for a previous function of
>>> @@ -1258,16 +1297,24 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>> query.page_size);
>>> goto out_failed;
>>> }
>>> +
>>> /* verify the window * number of ptes will map the partition */
>>> - /* check largest block * page size > max memory hotplug addr */
>>> max_addr = ddw_memory_hotplug_max();
>>> if (query.largest_available_block < (max_addr >> page_shift)) {
>>> - dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
>>> - "%llu-sized pages\n", max_addr, query.largest_available_block,
>>> - 1ULL << page_shift);
>>> - goto out_failed;
>>> + dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu %llu-sized pages\n",
>>> + max_addr, query.largest_available_block,
>>> + 1ULL << page_shift);
>>> +
>>> + len = order_base_2(query.largest_available_block << page_shift);
>>> + } else {
>>> + maps_whole_partition = true;
>>> + len = order_base_2(max_addr);
>>> }
>>> - len = order_base_2(max_addr);
>>> +
>>> + /* DDW + IOMMU on single window may fail if there is any allocation */
>>> + if (default_win_removed && !maps_whole_partition &&
>>> + iommu_table_in_use(tbl))
>>> + goto out_failed;
>>>
>>> ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
>>> if (ret != 0)
>>> @@ -1277,8 +1324,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>> create.liobn, dn);
>>>
>>> win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
>>> - ret = ddw_property_create(&win64, DIRECT64_PROPNAME, create.liobn, win_addr,
>>> - page_shift, len);
>>> + ret = ddw_property_create(&win64, DMA64_PROPNAME, create.liobn, win_addr,
>>> + page_shift, len, maps_whole_partition);
>>> if (ret) {
>>> dev_info(&dev->dev,
>>> "couldn't allocate property, property name, or value\n");
>>> @@ -1297,12 +1344,25 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>> if (!window)
>>> goto out_prop_del;
>>>
>>> - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
>>> - win64->value, tce_setrange_multi_pSeriesLP_walk);
>>> - if (ret) {
>>> - dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
>>> - dn, ret);
>>> - goto out_free_window;
>>> + if (maps_whole_partition) {
>>> + /* DDW maps the whole partition, so enable direct DMA mapping */
>>> + ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
>>> + win64->value, tce_setrange_multi_pSeriesLP_walk);
>>> + if (ret) {
>>> + dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
>>> + dn, ret);
>>> + goto out_free_window;
>>> + }
>>> + } else {
>>> + /* New table for using DDW instead of the default DMA window */
>>> + if (iommu_table_update_window(&tbl, pci->phb->node, create.liobn,
>>> + win_addr, page_shift, 1UL << len))
>>> + goto out_free_window;
>>> +
>>> + set_iommu_table_base(&dev->dev, tbl);
>>> + WARN_ON(dev->dev.archdata.dma_offset >= SZ_4G);
>>
>> What is this check for exactly? Why 4G, not >= 0, for example?
>
> I am not really sure, you suggested adding it here:
> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/[email protected]/#2488874


Ah right I did suggest this :) My bad. I think I suggested it before
suggesting to keep the reserved area boundaries checked/adjusted to the
window boundaries, may as well drop this. Thanks,


>
> I can remove it if it's ok.
>
>>
>>> + goto out_unlock;
>>> +
>>> }
>>>
>>> dev->dev.archdata.dma_offset = win_addr;
>>> @@ -1340,7 +1400,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>>
>>> out_unlock:
>>> mutex_unlock(&direct_window_init_mutex);
>>> - return win64;
>>> + return win64 && maps_whole_partition;
>>> }
>>>
>>> static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
>>>
>

--
Alexey

2020-08-31 04:37:04

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 08/10] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()



On 29/08/2020 01:25, Leonardo Bras wrote:
> On Mon, 2020-08-24 at 15:07 +1000, Alexey Kardashevskiy wrote:
>>
>> On 18/08/2020 09:40, Leonardo Bras wrote:
>>> Code used to create a ddw property that was previously scattered in
>>> enable_ddw() is now gathered in ddw_property_create(), which deals with
>>> allocation and filling the property, letting it ready for
>>> of_property_add(), which now occurs in sequence.
>>>
>>> This created an opportunity to reorganize the second part of enable_ddw():
>>>
>>> Without this patch enable_ddw() does, in order:
>>> kzalloc() property & members, create_ddw(), fill ddwprop inside property,
>>> ddw_list_add(), do tce_setrange_multi_pSeriesLP_walk in all memory,
>>> of_add_property().
>>>
>>> With this patch enable_ddw() does, in order:
>>> create_ddw(), ddw_property_create(), of_add_property(), ddw_list_add(),
>>> do tce_setrange_multi_pSeriesLP_walk in all memory.
>>>
>>> This change requires of_remove_property() in case anything fails after
>>> of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
>>> in all memory, which looks the most expensive operation, only if
>>> everything else succeeds.
>>>
>>> Signed-off-by: Leonardo Bras <[email protected]>
>>> ---
>>> arch/powerpc/platforms/pseries/iommu.c | 97 +++++++++++++++-----------
>>> 1 file changed, 57 insertions(+), 40 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>>> index 4031127c9537..3a1ef02ad9d5 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -1123,6 +1123,31 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
>>> ret);
>>> }
>>>
>>> +static int ddw_property_create(struct property **ddw_win, const char *propname,
>>
>> @propname is always the same, do you really want to pass it every time?
>
> I think it reads better, like "create a ddw property with this name".

This reads as "there are at least two ddw properties".

> Also, it makes possible to create ddw properties with other names, in
> case we decide to create properties with different names depending on
> the window created.

It is one window at any given moment, why call it different names... I
get the part that it is not always "direct" anymore but still...


> Also, it's probably optimized / inlined at this point.
> Is it ok doing it like this?
>
>>
>>> + u32 liobn, u64 dma_addr, u32 page_shift, u32 window_shift)
>>> +{
>>> + struct dynamic_dma_window_prop *ddwprop;
>>> + struct property *win64;
>>> +
>>> + *ddw_win = win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
>>> + if (!win64)
>>> + return -ENOMEM;
>>> +
>>> + win64->name = kstrdup(propname, GFP_KERNEL);
>>
>> Not clear why "win64->name = DIRECT64_PROPNAME" would not work here, the
>> generic OF code does not try kfree() it but it is probably out of scope
>> here.
>
> Yeah, I had that question too.
> Previous code was like that, and I as trying not to mess too much on
> how it's done.
>
>>> + ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
>>> + win64->value = ddwprop;
>>> + win64->length = sizeof(*ddwprop);
>>> + if (!win64->name || !win64->value)
>>> + return -ENOMEM;
>>
>> Up to 2 memory leaks here. I see the cleanup at "out_free_prop:" but
>> still looks fragile. Instead you could simply return win64 as the only
>> error possible here is -ENOMEM and returning NULL is equally good.
>
> I agree. It's better if this function have it's own cleaning routine.
> It will be fixed for next version.
>
>>
>>
>>> +
>>> + ddwprop->liobn = cpu_to_be32(liobn);
>>> + ddwprop->dma_base = cpu_to_be64(dma_addr);
>>> + ddwprop->tce_shift = cpu_to_be32(page_shift);
>>> + ddwprop->window_shift = cpu_to_be32(window_shift);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> /*
>>> * If the PE supports dynamic dma windows, and there is space for a table
>>> * that can map all pages in a linear offset, then setup such a table,
>>> @@ -1140,12 +1165,11 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>> struct ddw_query_response query;
>>> struct ddw_create_response create;
>>> int page_shift;
>>> - u64 max_addr;
>>> + u64 max_addr, win_addr;
>>> struct device_node *dn;
>>> u32 ddw_avail[DDW_APPLICABLE_SIZE];
>>> struct direct_window *window;
>>> - struct property *win64;
>>> - struct dynamic_dma_window_prop *ddwprop;
>>> + struct property *win64 = NULL;
>>> struct failed_ddw_pdn *fpdn;
>>> bool default_win_removed = false;
>>>
>>> @@ -1244,38 +1268,34 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>> goto out_failed;
>>> }
>>> len = order_base_2(max_addr);
>>> - win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
>>> - if (!win64) {
>>> - dev_info(&dev->dev,
>>> - "couldn't allocate property for 64bit dma window\n");
>>> +
>>> + ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
>>> + if (ret != 0)
>>
>> It is usually just "if (ret)"
>
> It was previously like that, and all query_ddw() checks return value
> this way.

ah I see.

> Should I update them all or just this one?

Pick one variant and make sure all new lines use just that. In this
patch you add both variants. Thanks,

>
> Thanks!
>
>>
>>
>>> goto out_failed;
>>> - }
>>> - win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
>>> - win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
>>> - win64->length = sizeof(*ddwprop);
>>> - if (!win64->name || !win64->value) {
>>> +
>>> + dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
>>> + create.liobn, dn);
>>> +
>>> + win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
>>> + ret = ddw_property_create(&win64, DIRECT64_PROPNAME, create.liobn, win_addr,
>>> + page_shift, len);
>>> + if (ret) {
>>> dev_info(&dev->dev,
>>> - "couldn't allocate property name and value\n");
>>> + "couldn't allocate property, property name, or value\n");
>>> goto out_free_prop;
>>> }
>>>
>>> - ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
>>> - if (ret != 0)
>>> + ret = of_add_property(pdn, win64);
>>> + if (ret) {
>>> + dev_err(&dev->dev, "unable to add dma window property for %pOF: %d",
>>> + pdn, ret);
>>> goto out_free_prop;
>>> -
>>> - ddwprop->liobn = cpu_to_be32(create.liobn);
>>> - ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) |
>>> - create.addr_lo);
>>> - ddwprop->tce_shift = cpu_to_be32(page_shift);
>>> - ddwprop->window_shift = cpu_to_be32(len);
>>> -
>>> - dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
>>> - create.liobn, dn);
>>> + }
>>>
>>> /* Add new window to existing DDW list */
>>> - window = ddw_list_add(pdn, ddwprop);
>>> + window = ddw_list_add(pdn, win64->value);
>>> if (!window)
>>> - goto out_clear_window;
>>> + goto out_prop_del;
>>>
>>> ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
>>> win64->value, tce_setrange_multi_pSeriesLP_walk);
>>> @@ -1285,14 +1305,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>> goto out_free_window;
>>> }
>>>
>>> - ret = of_add_property(pdn, win64);
>>> - if (ret) {
>>> - dev_err(&dev->dev, "unable to add dma window property for %pOF: %d",
>>> - pdn, ret);
>>> - goto out_free_window;
>>> - }
>>> -
>>> - dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
>>> + dev->dev.archdata.dma_offset = win_addr;
>>> goto out_unlock;
>>>
>>> out_free_window:
>>> @@ -1302,14 +1315,18 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>>
>>> kfree(window);
>>>
>>> -out_clear_window:
>>> - remove_ddw(pdn, true);
>>> +out_prop_del:
>>> + of_remove_property(pdn, win64);
>>>
>>> out_free_prop:
>>> - kfree(win64->name);
>>> - kfree(win64->value);
>>> - kfree(win64);
>>> - win64 = NULL;
>>> + if (win64) {
>>> + kfree(win64->name);
>>> + kfree(win64->value);
>>> + kfree(win64);
>>> + win64 = NULL;
>>> + }
>>> +
>>> + remove_ddw(pdn, true);
>>>
>>> out_failed:
>>> if (default_win_removed)
>>>
>

--
Alexey

2020-09-01 21:40:06

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

On Mon, 2020-08-31 at 13:48 +1000, Alexey Kardashevskiy wrote:
> > > > Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
> > > > hardcoded 40-bit mask (0xfffffffffful), for hard-coded 12-bit (4k)
> > > > pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
> > > > described as RPN, as described before.
> > > >
> > > > IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
> > > > shows system memory mapping into a TCE, and the TCE also has bits 0-51
> > > > for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.
> > > > In fact, by the looks of those figures, the RPN_MASK should always be a
> > > > 52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.
> > >
> > > I suspect the mask is there in the first place for extra protection
> > > against too big addresses going to the TCE table (or/and for virtial vs
> > > physical addresses). Using 52bit mask makes no sense for anything, you
> > > could just drop the mask and let c compiler deal with 64bit "uint" as it
> > > is basically a 4K page address anywhere in the 64bit space. Thanks,
> >
> > Assuming 4K pages you need 52 RPN bits to cover the whole 64bit
> > physical address space. The IODA3 spec does explicitly say the upper
> > bits are optional and the implementation only needs to support enough
> > to cover up to the physical address limit, which is 56bits of P9 /
> > PHB4. If you want to validate that the address will fit inside of
> > MAX_PHYSMEM_BITS then fine, but I think that should be done as a
> > WARN_ON or similar rather than just silently masking off the bits.
>
> We can do this and probably should anyway but I am also pretty sure we
> can just ditch the mask and have the hypervisor return an error which
> will show up in dmesg.

Ok then, ditching the mask.
Thanks!

2020-09-01 22:35:59

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()

On Mon, 2020-08-31 at 10:47 +1000, Alexey Kardashevskiy wrote:
> >
> > Maybe testing with host 64k pagesize and IOMMU 16MB pagesize in qemu
> > should be enough, is there any chance to get indirect mapping in qemu
> > like this? (DDW but with smaller DMA window available)
>
> You will have to hack the guest kernel to always do indirect mapping or
> hack QEMU's rtas_ibm_query_pe_dma_window() to return a small number of
> available TCEs. But you will be testing QEMU/KVM which behave quite
> differently to pHyp in this particular case.
>

As you suggested before, building for 4k cpu pagesize should be the
best approach. It would allow testing for both pHyp and qemu scenarios.

> > > > > Because if we want the former (==support), then we'll have to align the
> > > > > size up to the bigger page size when allocating/zeroing system pages,
> > > > > etc.
> > > >
> > > > This part I don't understand. Why do we need to align everything to the
> > > > bigger pagesize?
> > > >
> > > > I mean, is not that enough that the range [ret, ret + size[ is both
> > > > allocated by mm and mapped on a iommu range?
> > > >
> > > > Suppose a iommu_alloc_coherent() of 16kB on PAGESIZE = 4k and
> > > > IOMMU_PAGE_SIZE() == 64k.
> > > > Why 4 * cpu_pages mapped by a 64k IOMMU page is not enough?
> > > > All the space the user asked for is allocated and mapped for DMA.
> > >
> > > The user asked to map 16K, the rest - 48K - is used for something else
> > > (may be even mapped to another device) but you are making all 64K
> > > accessible by the device which only should be able to access 16K.
> > >
> > > In practice, if this happens, H_PUT_TCE will simply fail.
> >
> > I have noticed mlx5 driver getting a few bytes in a buffer, and using
> > iommu_map_page(). It does map a whole page for as few bytes as the user
>
> Whole 4K system page or whole 64K iommu page?

I tested it in 64k system page + 64k iommu page.

The 64K system page may be used for anything, and a small portion of it
(say 128 bytes) needs to be used for DMA.
The whole page is mapped by IOMMU, and the driver gets info of the
memory range it should access / modify.

>
> > wants mapped, and the other bytes get used for something else, or just
> > mapped on another DMA page.
> > It seems to work fine.
>
>
> With 4K system page and 64K IOMMU page? In practice it would take an
> effort or/and bad luck to see it crashing. Thanks,

I haven't tested it yet. On a 64k system page and 4k/64k iommu page, it
works as described above.

I am new to this, so I am trying to understand how a memory page mapped
as DMA, and used for something else could be a problem.

Thanks!

>
> > >
> > > > > Bigger pages are not the case here as I understand it.
> > > >
> > > > I did not get this part, what do you mean?
> > >
> > > Possible IOMMU page sizes are 4K, 64K, 2M, 16M, 256M, 1GB, and the
> > > supported set of sizes is different for P8/P9 and type of IO (PHB,
> > > NVLink/CAPI).
> > >
> > >
> > > > > > Update those functions to guarantee alignment with requested size
> > > > > > using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().
> > > > > >
> > > > > > Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
> > > > > > with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.
> > > > > >
> > > > > > Signed-off-by: Leonardo Bras <[email protected]>
> > > > > > ---
> > > > > > arch/powerpc/kernel/iommu.c | 17 +++++++++--------
> > > > > > 1 file changed, 9 insertions(+), 8 deletions(-)
> > > > > >
> > > > > > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> > > > > > index 9704f3f76e63..d7086087830f 100644
> > > > > > --- a/arch/powerpc/kernel/iommu.c
> > > > > > +++ b/arch/powerpc/kernel/iommu.c
> > > > > > @@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device *dev,
> > > > > > }
> > > > > >
> > > > > > if (dev)
> > > > > > - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
> > > > > > - 1 << tbl->it_page_shift);
> > > > > > + boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, tbl);
> > > > >
> > > > > Run checkpatch.pl, should complain about a long line.
> > > >
> > > > It's 86 columns long, which is less than the new limit of 100 columns
> > > > Linus announced a few weeks ago. checkpatch.pl was updated too:
> > > > https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kernel-Deprecates-80-Col
> > >
> > > Yay finally :) Thanks,
> >
> > :)
> >
> > >
> > > > > > else
> > > > > > - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
> > > > > > + boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
> > > > > > /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
> > > > > >
> > > > > > n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
> > > > > > @@ -858,6 +857,7 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
> > > > > > unsigned int order;
> > > > > > unsigned int nio_pages, io_order;
> > > > > > struct page *page;
> > > > > > + size_t size_io = size;
> > > > > >
> > > > > > size = PAGE_ALIGN(size);
> > > > > > order = get_order(size);
> > > > > > @@ -884,8 +884,9 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
> > > > > > memset(ret, 0, size);
> > > > > >
> > > > > > /* Set up tces to cover the allocated range */
> > > > > > - nio_pages = size >> tbl->it_page_shift;
> > > > > > - io_order = get_iommu_order(size, tbl);
> > > > > > + size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
> > > > > > + nio_pages = size_io >> tbl->it_page_shift;
> > > > > > + io_order = get_iommu_order(size_io, tbl);
> > > > > > mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
> > > > > > mask >> tbl->it_page_shift, io_order, 0);
> > > > > > if (mapping == DMA_MAPPING_ERROR) {
> > > > > > @@ -900,11 +901,11 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
> > > > > > void *vaddr, dma_addr_t dma_handle)
> > > > > > {
> > > > > > if (tbl) {
> > > > > > - unsigned int nio_pages;
> > > > > > + size_t size_io = IOMMU_PAGE_ALIGN(size, tbl);
> > > > > > + unsigned int nio_pages = size_io >> tbl->it_page_shift;
> > > > > >
> > > > > > - size = PAGE_ALIGN(size);
> > > > > > - nio_pages = size >> tbl->it_page_shift;
> > > > > > iommu_free(tbl, dma_handle, nio_pages);
> > > > > > +
> > > > >
> > > > > Unrelated new line.
> > > >
> > > > Will be removed. Thanks!
> > > >
> > > > > > size = PAGE_ALIGN(size);
> > > > > > free_pages((unsigned long)vaddr, get_order(size));
> > > > > > }
> > > > > >

2020-09-02 05:29:03

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 08/10] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

On Mon, 2020-08-31 at 14:34 +1000, Alexey Kardashevskiy wrote:
>
> On 29/08/2020 01:25, Leonardo Bras wrote:
> > On Mon, 2020-08-24 at 15:07 +1000, Alexey Kardashevskiy wrote:
> > > On 18/08/2020 09:40, Leonardo Bras wrote:
> > > > Code used to create a ddw property that was previously scattered in
> > > > enable_ddw() is now gathered in ddw_property_create(), which deals with
> > > > allocation and filling the property, letting it ready for
> > > > of_property_add(), which now occurs in sequence.
> > > >
> > > > This created an opportunity to reorganize the second part of enable_ddw():
> > > >
> > > > Without this patch enable_ddw() does, in order:
> > > > kzalloc() property & members, create_ddw(), fill ddwprop inside property,
> > > > ddw_list_add(), do tce_setrange_multi_pSeriesLP_walk in all memory,
> > > > of_add_property().
> > > >
> > > > With this patch enable_ddw() does, in order:
> > > > create_ddw(), ddw_property_create(), of_add_property(), ddw_list_add(),
> > > > do tce_setrange_multi_pSeriesLP_walk in all memory.
> > > >
> > > > This change requires of_remove_property() in case anything fails after
> > > > of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
> > > > in all memory, which looks the most expensive operation, only if
> > > > everything else succeeds.
> > > >
> > > > Signed-off-by: Leonardo Bras <[email protected]>
> > > > ---
> > > > arch/powerpc/platforms/pseries/iommu.c | 97 +++++++++++++++-----------
> > > > 1 file changed, 57 insertions(+), 40 deletions(-)
> > > >
> > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > > index 4031127c9537..3a1ef02ad9d5 100644
> > > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > > @@ -1123,6 +1123,31 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
> > > > ret);
> > > > }
> > > >
> > > > +static int ddw_property_create(struct property **ddw_win, const char *propname,
> > >
> > > @propname is always the same, do you really want to pass it every time?
> >
> > I think it reads better, like "create a ddw property with this name".
>
> This reads as "there are at least two ddw properties".
>
> > Also, it makes possible to create ddw properties with other names, in
> > case we decide to create properties with different names depending on
> > the window created.
>
> It is one window at any given moment, why call it different names... I
> get the part that it is not always "direct" anymore but still...
>

It seems the case as one of the options you suggested on patch [09/10]

>> I suspect it breaks kexec (from older kernel to this one) so you
>> either need to check for both DT names or just keep the old one.

>
> > Also, it's probably optimized / inlined at this point.
> > Is it ok doing it like this?
> >
> > > > + u32 liobn, u64 dma_addr, u32 page_shift, u32 window_shift)
> > > > +{
> > > > + struct dynamic_dma_window_prop *ddwprop;
> > > > + struct property *win64;
> > > > +
> > > > + *ddw_win = win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
> > > > + if (!win64)
> > > > + return -ENOMEM;
> > > > +
> > > > + win64->name = kstrdup(propname, GFP_KERNEL);
> > >
> > > Not clear why "win64->name = DIRECT64_PROPNAME" would not work here, the
> > > generic OF code does not try kfree() it but it is probably out of scope
> > > here.
> >
> > Yeah, I had that question too.
> > Previous code was like that, and I as trying not to mess too much on
> > how it's done.
> >
> > > > + ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
> > > > + win64->value = ddwprop;
> > > > + win64->length = sizeof(*ddwprop);
> > > > + if (!win64->name || !win64->value)
> > > > + return -ENOMEM;
> > >
> > > Up to 2 memory leaks here. I see the cleanup at "out_free_prop:" but
> > > still looks fragile. Instead you could simply return win64 as the only
> > > error possible here is -ENOMEM and returning NULL is equally good.
> >
> > I agree. It's better if this function have it's own cleaning routine.
> > It will be fixed for next version.
> >
> > >
> > > > +
> > > > + ddwprop->liobn = cpu_to_be32(liobn);
> > > > + ddwprop->dma_base = cpu_to_be64(dma_addr);
> > > > + ddwprop->tce_shift = cpu_to_be32(page_shift);
> > > > + ddwprop->window_shift = cpu_to_be32(window_shift);
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > /*
> > > > * If the PE supports dynamic dma windows, and there is space for a table
> > > > * that can map all pages in a linear offset, then setup such a table,
> > > > @@ -1140,12 +1165,11 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > > > struct ddw_query_response query;
> > > > struct ddw_create_response create;
> > > > int page_shift;
> > > > - u64 max_addr;
> > > > + u64 max_addr, win_addr;
> > > > struct device_node *dn;
> > > > u32 ddw_avail[DDW_APPLICABLE_SIZE];
> > > > struct direct_window *window;
> > > > - struct property *win64;
> > > > - struct dynamic_dma_window_prop *ddwprop;
> > > > + struct property *win64 = NULL;
> > > > struct failed_ddw_pdn *fpdn;
> > > > bool default_win_removed = false;
> > > >
> > > > @@ -1244,38 +1268,34 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > > > goto out_failed;
> > > > }
> > > > len = order_base_2(max_addr);
> > > > - win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
> > > > - if (!win64) {
> > > > - dev_info(&dev->dev,
> > > > - "couldn't allocate property for 64bit dma window\n");
> > > > +
> > > > + ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
> > > > + if (ret != 0)
> > >
> > > It is usually just "if (ret)"
> >
> > It was previously like that, and all query_ddw() checks return value
> > this way.
>
> ah I see.
>
> > Should I update them all or just this one?
>
> Pick one variant and make sure all new lines use just that. In this
> patch you add both variants. Thanks,

Ok, I will do that from now on.
Thanks!



2020-09-02 06:12:19

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 09/10] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

On Mon, 2020-08-31 at 14:35 +1000, Alexey Kardashevskiy wrote:
>
> On 29/08/2020 04:36, Leonardo Bras wrote:
> > On Mon, 2020-08-24 at 15:17 +1000, Alexey Kardashevskiy wrote:
> > > On 18/08/2020 09:40, Leonardo Bras wrote:
> > > > As of today, if the biggest DDW that can be created can't map the whole
> > > > partition, it's creation is skipped and the default DMA window
> > > > "ibm,dma-window" is used instead.
> > > >
> > > > DDW is 16x bigger than the default DMA window,
> > >
> > > 16x only under very specific circumstances which are
> > > 1. phyp
> > > 2. sriov
> > > 3. device class in hmc (or what that priority number is in the lpar config).
> >
> > Yeah, missing details.
> >
> > > > having the same amount of
> > > > pages, but increasing the page size to 64k.
> > > > Besides larger DMA window,
> > >
> > > "Besides being larger"?
> >
> > You are right there.
> >
> > > > it performs better for allocations over 4k,
> > >
> > > Better how?
> >
> > I was thinking for allocations larger than (512 * 4k), since >2
> > hypercalls are needed here, and for 64k pages would still be just 1
> > hypercall up to (512 * 64k).
> > But yeah, not the usual case anyway.
>
> Yup.
>
>
> > > > so it would be nice to use it instead.
> > >
> > > I'd rather say something like:
> > > ===
> > > So far we assumed we can map the guest RAM 1:1 to the bus which worked
> > > with a small number of devices. SRIOV changes it as the user can
> > > configure hundreds VFs and since phyp preallocates TCEs and does not
> > > allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
> > > per a PE to limit waste of physical pages.
> > > ===
> >
> > I mixed this in my commit message, it looks like this:
> >
> > ===
> > powerpc/pseries/iommu: Make use of DDW for indirect mapping
> >
> > So far it's assumed possible to map the guest RAM 1:1 to the bus, which
> > works with a small number of devices. SRIOV changes it as the user can
> > configure hundreds VFs and since phyp preallocates TCEs and does not
> > allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
> > per a PE to limit waste of physical pages.
> >
> > As of today, if the assumed direct mapping is not possible, DDW
> > creation is skipped and the default DMA window "ibm,dma-window" is used
> > instead.
> >
> > The default DMA window uses 4k pages instead of 64k pages, and since
> > the amount of pages is the same,
>
> Is the amount really the same? I thought you can prioritize some VFs
> over others (== allocate different number of TCEs). Does it really
> matter if it is the same?

On a conversation with Travis Pizel, he explained how it's supposed to
work, and I understood this:

When a VF is created, it will be assigned a capacity, like 4%, 20%, and
so on. The number of 'TCE entries' that are available to that partition
are proportional to that capacity.

If we use the default DMA window, the IOMMU pagesize/entry will be 4k,
and if we use DDW, we will get 64k pagesize. As the number of entries
will be the same (for the same capacity), the total space that can be
addressed by the IOMMU will be 16 times bigger. This sometimes enable
direct mapping, but sometimes it's still not enough.

On Travis words :
"A low capacity VF, with less resources available, will certainly have
less DMA window capability than a high capacity VF. But, an 8GB DMA
window (with 64k pages) is still 16x larger than an 512MB window (with
4K pages).
A high capacity VF - for example, one that Leonardo has in his scenario
- will go from 8GB (using 4K pages) to 128GB (using 64K pages) - again,
16x larger - but it's obviously still possible to create a partition
that exceeds 128GB of memory in size."

>
>
> > making use of DDW instead of the
> > default DMA window for indirect mapping will expand in 16x the amount
> > of memory that can be mapped on DMA.
>
> Stop saying "16x", it is not guaranteed by anything :)
>
>
> > The DDW created will be used for direct mapping by default. [...]
> > ===
> >
> > What do you think?
> >
> > > > The DDW created will be used for direct mapping by default.
> > > > If it's not available, indirect mapping will be used instead.
> > > >
> > > > For indirect mapping, it's necessary to update the iommu_table so
> > > > iommu_alloc() can use the DDW created. For this,
> > > > iommu_table_update_window() is called when everything else succeeds
> > > > at enable_ddw().
> > > >
> > > > Removing the default DMA window for using DDW with indirect mapping
> > > > is only allowed if there is no current IOMMU memory allocated in
> > > > the iommu_table. enable_ddw() is aborted otherwise.
> > > >
> > > > As there will never have both direct and indirect mappings at the same
> > > > time, the same property name can be used for the created DDW.
> > > >
> > > > So renaming
> > > > define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
> > > > to
> > > > define DMA64_PROPNAME "linux,dma64-ddr-window-info"
> > > > looks the right thing to do.
> > >
> > > I know I suggested this but this does not look so good anymore as I
> > > suspect it breaks kexec (from older kernel to this one) so you either
> > > need to check for both DT names or just keep the old one. Changing the
> > > macro name is fine.
> > >
> >
> > Yeah, having 'direct' in the name don't really makes sense if it's used
> > for indirect mapping. I will just add this new define instead of
> > replacing the old one, and check for both.
> > Is that ok?
>
> No, having two of these does not seem right or useful. It is pseries
> which does not use petitboot (relies on grub instead so until the target
> kernel is started, there will be no ddw) so realistically we need this
> property for kexec/kdump which uses the same kernel but different
> initramdisk so for that purpose we need the same property name.
>
> But I can see myself annoyed when I try petitboot in the hacked pseries
> qemu and things may crash :) On this basis I'd suggest keeping the name
> and adding a comment next to it that it is not always "direct" anymore.
>

Keeping the same name should bring more problems than solve.
If we have indirect mapping and kexec() to an older kernel, it will
think direct mapping is enabled, and trying to use a DMA address
without doing H_PUT_* first may cause a crash.

I tested with a new property name, and it doesn't crash.
As the property is not found, it does try to create a new DDW, which
fails and it falls back to using the default DMA window.
The device that need the IOMMU don't work well, but when iommu_map()
fails, it doesn't try to use the DMA address as valid.

>
> > > > To make sure the property differentiates both cases, a new u32 for flags
> > > > was added at the end of the property, where BIT(0) set means direct
> > > > mapping.
> > > >
> > > > Signed-off-by: Leonardo Bras <[email protected]>
> > > > ---
> > > > arch/powerpc/platforms/pseries/iommu.c | 108 +++++++++++++++++++------
> > > > 1 file changed, 84 insertions(+), 24 deletions(-)
> > > >
> > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > > index 3a1ef02ad9d5..9544e3c91ced 100644
> > > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > > @@ -350,8 +350,11 @@ struct dynamic_dma_window_prop {
> > > > __be64 dma_base; /* address hi,lo */
> > > > __be32 tce_shift; /* ilog2(tce_page_size) */
> > > > __be32 window_shift; /* ilog2(tce_window_size) */
> > > > + __be32 flags; /* DDW properties, see bellow */
> > > > };
> > > >
> > > > +#define DDW_FLAGS_DIRECT 0x01
> > >
> > > This is set if ((1<<window_shift) >= ddw_memory_hotplug_max()), you
> > > could simply check window_shift and drop the flags.
> > >
> >
> > Yeah, it's better this way, I will revert this.
> >
> > > > +
> > > > struct direct_window {
> > > > struct device_node *device;
> > > > const struct dynamic_dma_window_prop *prop;
> > > > @@ -377,7 +380,7 @@ static LIST_HEAD(direct_window_list);
> > > > static DEFINE_SPINLOCK(direct_window_list_lock);
> > > > /* protects initializing window twice for same device */
> > > > static DEFINE_MUTEX(direct_window_init_mutex);
> > > > -#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
> > > > +#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
> > > >
> > > > static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
> > > > unsigned long num_pfn, const void *arg)
> > > > @@ -836,7 +839,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
> > > > if (ret)
> > > > return;
> > > >
> > > > - win = of_find_property(np, DIRECT64_PROPNAME, NULL);
> > > > + win = of_find_property(np, DMA64_PROPNAME, NULL);
> > > > if (!win)
> > > > return;
> > > >
> > > > @@ -852,7 +855,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
> > > > np, ret);
> > > > }
> > > >
> > > > -static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> > > > +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, bool *direct_mapping)
> > > > {
> > > > struct direct_window *window;
> > > > const struct dynamic_dma_window_prop *direct64;
> > > > @@ -864,6 +867,7 @@ static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> > > > if (window->device == pdn) {
> > > > direct64 = window->prop;
> > > > *dma_addr = be64_to_cpu(direct64->dma_base);
> > > > + *direct_mapping = be32_to_cpu(direct64->flags) & DDW_FLAGS_DIRECT;
> > > > found = true;
> > > > break;
> > > > }
> > > > @@ -901,8 +905,8 @@ static int find_existing_ddw_windows(void)
> > > > if (!firmware_has_feature(FW_FEATURE_LPAR))
> > > > return 0;
> > > >
> > > > - for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
> > > > - direct64 = of_get_property(pdn, DIRECT64_PROPNAME, &len);
> > > > + for_each_node_with_property(pdn, DMA64_PROPNAME) {
> > > > + direct64 = of_get_property(pdn, DMA64_PROPNAME, &len);
> > > > if (!direct64)
> > > > continue;
> > > >
> > > > @@ -1124,7 +1128,8 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
> > > > }
> > > >
> > > > static int ddw_property_create(struct property **ddw_win, const char *propname,
> > > > - u32 liobn, u64 dma_addr, u32 page_shift, u32 window_shift)
> > > > + u32 liobn, u64 dma_addr, u32 page_shift,
> > > > + u32 window_shift, bool direct_mapping)
> > > > {
> > > > struct dynamic_dma_window_prop *ddwprop;
> > > > struct property *win64;
> > > > @@ -1144,6 +1149,36 @@ static int ddw_property_create(struct property **ddw_win, const char *propname,
> > > > ddwprop->dma_base = cpu_to_be64(dma_addr);
> > > > ddwprop->tce_shift = cpu_to_be32(page_shift);
> > > > ddwprop->window_shift = cpu_to_be32(window_shift);
> > > > + if (direct_mapping)
> > > > + ddwprop->flags = cpu_to_be32(DDW_FLAGS_DIRECT);
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static int iommu_table_update_window(struct iommu_table **tbl, int nid, unsigned long liobn,
> > > > + unsigned long win_addr, unsigned long page_shift,
> > > > + unsigned long window_size)
> > >
> > > Rather strange helper imho. I'd extract the most of
> > > iommu_table_setparms_lpar() into iommu_table_setparms() (except
> > > of_parse_dma_window) and call new helper from where you call
> > > iommu_table_update_window; and do
> > > iommu_pseries_alloc_table/iommu_tce_table_put there.
> > >
> >
> > I don't see how to extract iommu_table_setparms_lpar() into
> > iommu_table_setparms(), they look to be used for different machine
> > types.
> >
> > Do mean you extracting most of iommu_table_setparms_lpar() (and maybe
> > iommu_table_setparms() ) into a new helper, which is called in both
> > functions and use it instead of iommu_table_update_window() ?
>
> Yes, this.

I will do that then, seems better. :)

>
>
> > > > +{
> > > > + struct iommu_table *new_tbl, *old_tbl;
> > > > +
> > > > + new_tbl = iommu_pseries_alloc_table(nid);
> > > > + if (!new_tbl)
> > > > + return -ENOMEM;
> > > > +
> > > > + old_tbl = *tbl;
> > > > + new_tbl->it_index = liobn;
> > > > + new_tbl->it_offset = win_addr >> page_shift;
> > > > + new_tbl->it_page_shift = page_shift;
> > > > + new_tbl->it_size = window_size >> page_shift;
> > > > + new_tbl->it_base = old_tbl->it_base;
> > >
> > > Should not be used in pseries.
> > >
> >
> > The point here is to migrate the values from the older tbl to the
>
> The actual window/table is new (on the hypervisor side), you are not
> migrating a single TCE, you deleted one whole window and created another
> whole window, calling it "migration" is confusing, especially when PAPR
> actually defines TCE migration.

Ok, I understand it's confusing now. I will avoid using this term from
now on.

>
>
> > newer. I Would like to understand why this is bad, if it will still be
> > 'unused' as the older tbl.
>
> Having explicit values is more readable imho.

Ok, I understand why it should be improved.!

Alexey, thank you for reviewing, and for helping me with my questions!

Best regards,

>
>
> > > > + new_tbl->it_busno = old_tbl->it_busno;
> > > > + new_tbl->it_blocksize = old_tbl->it_blocksize;
> > >
> > > 16 for pseries and does not change (may be even make it a macro).
> > >
> > > > + new_tbl->it_type = old_tbl->it_type;
> > >
> > > TCE_PCI.
> > >
> >
> > Same as above.
> >
> > > > + new_tbl->it_ops = old_tbl->it_ops;
> > > > +
> > > > + iommu_init_table(new_tbl, nid, old_tbl->it_reserved_start, old_tbl->it_reserved_end);
> > > > + iommu_tce_table_put(old_tbl);
> > > > + *tbl = new_tbl;
> > > >
> > > > return 0;
> > > > }
> > > > @@ -1171,12 +1206,16 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > > > struct direct_window *window;
> > > > struct property *win64 = NULL;
> > > > struct failed_ddw_pdn *fpdn;
> > > > - bool default_win_removed = false;
> > > > + bool default_win_removed = false, maps_whole_partition = false;
> > >
> > > s/maps_whole_partition/direct_mapping/
> > >
> >
> > Sure, I will get it replaced.
> >
> > > > + struct pci_dn *pci = PCI_DN(pdn);
> > > > + struct iommu_table *tbl = pci->table_group->tables[0];
> > > >
> > > > mutex_lock(&direct_window_init_mutex);
> > > >
> > > > - if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset))
> > > > - goto out_unlock;
> > > > + if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset, &maps_whole_partition)) {
> > > > + mutex_unlock(&direct_window_init_mutex);
> > > > + return maps_whole_partition;
> > > > + }
> > > >
> > > > /*
> > > > * If we already went through this for a previous function of
> > > > @@ -1258,16 +1297,24 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > > > query.page_size);
> > > > goto out_failed;
> > > > }
> > > > +
> > > > /* verify the window * number of ptes will map the partition */
> > > > - /* check largest block * page size > max memory hotplug addr */
> > > > max_addr = ddw_memory_hotplug_max();
> > > > if (query.largest_available_block < (max_addr >> page_shift)) {
> > > > - dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
> > > > - "%llu-sized pages\n", max_addr, query.largest_available_block,
> > > > - 1ULL << page_shift);
> > > > - goto out_failed;
> > > > + dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu %llu-sized pages\n",
> > > > + max_addr, query.largest_available_block,
> > > > + 1ULL << page_shift);
> > > > +
> > > > + len = order_base_2(query.largest_available_block << page_shift);
> > > > + } else {
> > > > + maps_whole_partition = true;
> > > > + len = order_base_2(max_addr);
> > > > }
> > > > - len = order_base_2(max_addr);
> > > > +
> > > > + /* DDW + IOMMU on single window may fail if there is any allocation */
> > > > + if (default_win_removed && !maps_whole_partition &&
> > > > + iommu_table_in_use(tbl))
> > > > + goto out_failed;
> > > >
> > > > ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
> > > > if (ret != 0)
> > > > @@ -1277,8 +1324,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > > > create.liobn, dn);
> > > >
> > > > win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
> > > > - ret = ddw_property_create(&win64, DIRECT64_PROPNAME, create.liobn, win_addr,
> > > > - page_shift, len);
> > > > + ret = ddw_property_create(&win64, DMA64_PROPNAME, create.liobn, win_addr,
> > > > + page_shift, len, maps_whole_partition);
> > > > if (ret) {
> > > > dev_info(&dev->dev,
> > > > "couldn't allocate property, property name, or value\n");
> > > > @@ -1297,12 +1344,25 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > > > if (!window)
> > > > goto out_prop_del;
> > > >
> > > > - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> > > > - win64->value, tce_setrange_multi_pSeriesLP_walk);
> > > > - if (ret) {
> > > > - dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
> > > > - dn, ret);
> > > > - goto out_free_window;
> > > > + if (maps_whole_partition) {
> > > > + /* DDW maps the whole partition, so enable direct DMA mapping */
> > > > + ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> > > > + win64->value, tce_setrange_multi_pSeriesLP_walk);
> > > > + if (ret) {
> > > > + dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
> > > > + dn, ret);
> > > > + goto out_free_window;
> > > > + }
> > > > + } else {
> > > > + /* New table for using DDW instead of the default DMA window */
> > > > + if (iommu_table_update_window(&tbl, pci->phb->node, create.liobn,
> > > > + win_addr, page_shift, 1UL << len))
> > > > + goto out_free_window;
> > > > +
> > > > + set_iommu_table_base(&dev->dev, tbl);
> > > > + WARN_ON(dev->dev.archdata.dma_offset >= SZ_4G);
> > >
> > > What is this check for exactly? Why 4G, not >= 0, for example?
> >
> > I am not really sure, you suggested adding it here:
> > http://patchwork.ozlabs.org/project/linuxppc-dev/patch/[email protected]/#2488874
>
> Ah right I did suggest this :) My bad. I think I suggested it before
> suggesting to keep the reserved area boundaries checked/adjusted to the
> window boundaries, may as well drop this. Thanks,
>
>
> > I can remove it if it's ok.
> >
> > > > + goto out_unlock;
> > > > +
> > > > }
> > > >
> > > > dev->dev.archdata.dma_offset = win_addr;
> > > > @@ -1340,7 +1400,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > > >
> > > > out_unlock:
> > > > mutex_unlock(&direct_window_init_mutex);
> > > > - return win64;
> > > > + return win64 && maps_whole_partition;
> > > > }
> > > >
> > > > static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
> > > >

2020-09-03 04:31:07

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift



On 02/09/2020 07:38, Leonardo Bras wrote:
> On Mon, 2020-08-31 at 13:48 +1000, Alexey Kardashevskiy wrote:
>>>>> Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
>>>>> hardcoded 40-bit mask (0xfffffffffful), for hard-coded 12-bit (4k)
>>>>> pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
>>>>> described as RPN, as described before.
>>>>>
>>>>> IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
>>>>> shows system memory mapping into a TCE, and the TCE also has bits 0-51
>>>>> for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.
>>>>> In fact, by the looks of those figures, the RPN_MASK should always be a
>>>>> 52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.
>>>>
>>>> I suspect the mask is there in the first place for extra protection
>>>> against too big addresses going to the TCE table (or/and for virtial vs
>>>> physical addresses). Using 52bit mask makes no sense for anything, you
>>>> could just drop the mask and let c compiler deal with 64bit "uint" as it
>>>> is basically a 4K page address anywhere in the 64bit space. Thanks,
>>>
>>> Assuming 4K pages you need 52 RPN bits to cover the whole 64bit
>>> physical address space. The IODA3 spec does explicitly say the upper
>>> bits are optional and the implementation only needs to support enough
>>> to cover up to the physical address limit, which is 56bits of P9 /
>>> PHB4. If you want to validate that the address will fit inside of
>>> MAX_PHYSMEM_BITS then fine, but I think that should be done as a
>>> WARN_ON or similar rather than just silently masking off the bits.
>>
>> We can do this and probably should anyway but I am also pretty sure we
>> can just ditch the mask and have the hypervisor return an error which
>> will show up in dmesg.
>
> Ok then, ditching the mask.


Well, you could run a little experiment and set some bits above that old
mask and see how phyp reacts :)


> Thanks!
>

--
Alexey

2020-09-03 04:45:30

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()



On 02/09/2020 08:34, Leonardo Bras wrote:
> On Mon, 2020-08-31 at 10:47 +1000, Alexey Kardashevskiy wrote:
>>>
>>> Maybe testing with host 64k pagesize and IOMMU 16MB pagesize in qemu
>>> should be enough, is there any chance to get indirect mapping in qemu
>>> like this? (DDW but with smaller DMA window available)
>>
>> You will have to hack the guest kernel to always do indirect mapping or
>> hack QEMU's rtas_ibm_query_pe_dma_window() to return a small number of
>> available TCEs. But you will be testing QEMU/KVM which behave quite
>> differently to pHyp in this particular case.
>>
>
> As you suggested before, building for 4k cpu pagesize should be the
> best approach. It would allow testing for both pHyp and qemu scenarios.
>
>>>>>> Because if we want the former (==support), then we'll have to align the
>>>>>> size up to the bigger page size when allocating/zeroing system pages,
>>>>>> etc.
>>>>>
>>>>> This part I don't understand. Why do we need to align everything to the
>>>>> bigger pagesize?
>>>>>
>>>>> I mean, is not that enough that the range [ret, ret + size[ is both
>>>>> allocated by mm and mapped on a iommu range?
>>>>>
>>>>> Suppose a iommu_alloc_coherent() of 16kB on PAGESIZE = 4k and
>>>>> IOMMU_PAGE_SIZE() == 64k.
>>>>> Why 4 * cpu_pages mapped by a 64k IOMMU page is not enough?
>>>>> All the space the user asked for is allocated and mapped for DMA.
>>>>
>>>> The user asked to map 16K, the rest - 48K - is used for something else
>>>> (may be even mapped to another device) but you are making all 64K
>>>> accessible by the device which only should be able to access 16K.
>>>>
>>>> In practice, if this happens, H_PUT_TCE will simply fail.
>>>
>>> I have noticed mlx5 driver getting a few bytes in a buffer, and using
>>> iommu_map_page(). It does map a whole page for as few bytes as the user
>>
>> Whole 4K system page or whole 64K iommu page?
>
> I tested it in 64k system page + 64k iommu page.
>
> The 64K system page may be used for anything, and a small portion of it
> (say 128 bytes) needs to be used for DMA.
> The whole page is mapped by IOMMU, and the driver gets info of the
> memory range it should access / modify.


This works because the whole system page belongs to the same memory
context and IOMMU allows a device to access that page. You can still
have problems if there is a bug within the page but it will go mostly
unnoticed as it will be memory corruption.

If you system page is smaller (4K) than IOMMU page (64K), then the
device gets wider access than it should but it is still going to be
silent memory corruption.


>
>>
>>> wants mapped, and the other bytes get used for something else, or just
>>> mapped on another DMA page.
>>> It seems to work fine.
>>
>>
>> With 4K system page and 64K IOMMU page? In practice it would take an
>> effort or/and bad luck to see it crashing. Thanks,
>
> I haven't tested it yet. On a 64k system page and 4k/64k iommu page, it
> works as described above.
>
> I am new to this, so I am trying to understand how a memory page mapped
> as DMA, and used for something else could be a problem.

From the device prospective, there is PCI space and everything from 0
till 1<<64 is accessible and what is that mapped to - the device does
not know. PHB's IOMMU is the thing to notice invalid access and raise
EEH but PHB only knows about PCI->physical memory mapping (with IOMMU
pages) but nothing about the host kernel pages. Does this help? Thanks,


>
> Thanks!
>
>>
>>>>
>>>>>> Bigger pages are not the case here as I understand it.
>>>>>
>>>>> I did not get this part, what do you mean?
>>>>
>>>> Possible IOMMU page sizes are 4K, 64K, 2M, 16M, 256M, 1GB, and the
>>>> supported set of sizes is different for P8/P9 and type of IO (PHB,
>>>> NVLink/CAPI).
>>>>
>>>>
>>>>>>> Update those functions to guarantee alignment with requested size
>>>>>>> using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().
>>>>>>>
>>>>>>> Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
>>>>>>> with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.
>>>>>>>
>>>>>>> Signed-off-by: Leonardo Bras <[email protected]>
>>>>>>> ---
>>>>>>> arch/powerpc/kernel/iommu.c | 17 +++++++++--------
>>>>>>> 1 file changed, 9 insertions(+), 8 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>>>>>>> index 9704f3f76e63..d7086087830f 100644
>>>>>>> --- a/arch/powerpc/kernel/iommu.c
>>>>>>> +++ b/arch/powerpc/kernel/iommu.c
>>>>>>> @@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device *dev,
>>>>>>> }
>>>>>>>
>>>>>>> if (dev)
>>>>>>> - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
>>>>>>> - 1 << tbl->it_page_shift);
>>>>>>> + boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, tbl);
>>>>>>
>>>>>> Run checkpatch.pl, should complain about a long line.
>>>>>
>>>>> It's 86 columns long, which is less than the new limit of 100 columns
>>>>> Linus announced a few weeks ago. checkpatch.pl was updated too:
>>>>> https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kernel-Deprecates-80-Col
>>>>
>>>> Yay finally :) Thanks,
>>>
>>> :)
>>>
>>>>
>>>>>>> else
>>>>>>> - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
>>>>>>> + boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
>>>>>>> /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
>>>>>>>
>>>>>>> n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
>>>>>>> @@ -858,6 +857,7 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
>>>>>>> unsigned int order;
>>>>>>> unsigned int nio_pages, io_order;
>>>>>>> struct page *page;
>>>>>>> + size_t size_io = size;
>>>>>>>
>>>>>>> size = PAGE_ALIGN(size);
>>>>>>> order = get_order(size);
>>>>>>> @@ -884,8 +884,9 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
>>>>>>> memset(ret, 0, size);
>>>>>>>
>>>>>>> /* Set up tces to cover the allocated range */
>>>>>>> - nio_pages = size >> tbl->it_page_shift;
>>>>>>> - io_order = get_iommu_order(size, tbl);
>>>>>>> + size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
>>>>>>> + nio_pages = size_io >> tbl->it_page_shift;
>>>>>>> + io_order = get_iommu_order(size_io, tbl);
>>>>>>> mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
>>>>>>> mask >> tbl->it_page_shift, io_order, 0);
>>>>>>> if (mapping == DMA_MAPPING_ERROR) {
>>>>>>> @@ -900,11 +901,11 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
>>>>>>> void *vaddr, dma_addr_t dma_handle)
>>>>>>> {
>>>>>>> if (tbl) {
>>>>>>> - unsigned int nio_pages;
>>>>>>> + size_t size_io = IOMMU_PAGE_ALIGN(size, tbl);
>>>>>>> + unsigned int nio_pages = size_io >> tbl->it_page_shift;
>>>>>>>
>>>>>>> - size = PAGE_ALIGN(size);
>>>>>>> - nio_pages = size >> tbl->it_page_shift;
>>>>>>> iommu_free(tbl, dma_handle, nio_pages);
>>>>>>> +
>>>>>>
>>>>>> Unrelated new line.
>>>>>
>>>>> Will be removed. Thanks!
>>>>>
>>>>>>> size = PAGE_ALIGN(size);
>>>>>>> free_pages((unsigned long)vaddr, get_order(size));
>>>>>>> }
>>>>>>>
>

--
Alexey

2020-09-04 01:02:39

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 09/10] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition



On 02/09/2020 16:11, Leonardo Bras wrote:
> On Mon, 2020-08-31 at 14:35 +1000, Alexey Kardashevskiy wrote:
>>
>> On 29/08/2020 04:36, Leonardo Bras wrote:
>>> On Mon, 2020-08-24 at 15:17 +1000, Alexey Kardashevskiy wrote:
>>>> On 18/08/2020 09:40, Leonardo Bras wrote:
>>>>> As of today, if the biggest DDW that can be created can't map the whole
>>>>> partition, it's creation is skipped and the default DMA window
>>>>> "ibm,dma-window" is used instead.
>>>>>
>>>>> DDW is 16x bigger than the default DMA window,
>>>>
>>>> 16x only under very specific circumstances which are
>>>> 1. phyp
>>>> 2. sriov
>>>> 3. device class in hmc (or what that priority number is in the lpar config).
>>>
>>> Yeah, missing details.
>>>
>>>>> having the same amount of
>>>>> pages, but increasing the page size to 64k.
>>>>> Besides larger DMA window,
>>>>
>>>> "Besides being larger"?
>>>
>>> You are right there.
>>>
>>>>> it performs better for allocations over 4k,
>>>>
>>>> Better how?
>>>
>>> I was thinking for allocations larger than (512 * 4k), since >2
>>> hypercalls are needed here, and for 64k pages would still be just 1
>>> hypercall up to (512 * 64k).
>>> But yeah, not the usual case anyway.
>>
>> Yup.
>>
>>
>>>>> so it would be nice to use it instead.
>>>>
>>>> I'd rather say something like:
>>>> ===
>>>> So far we assumed we can map the guest RAM 1:1 to the bus which worked
>>>> with a small number of devices. SRIOV changes it as the user can
>>>> configure hundreds VFs and since phyp preallocates TCEs and does not
>>>> allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
>>>> per a PE to limit waste of physical pages.
>>>> ===
>>>
>>> I mixed this in my commit message, it looks like this:
>>>
>>> ===
>>> powerpc/pseries/iommu: Make use of DDW for indirect mapping
>>>
>>> So far it's assumed possible to map the guest RAM 1:1 to the bus, which
>>> works with a small number of devices. SRIOV changes it as the user can
>>> configure hundreds VFs and since phyp preallocates TCEs and does not
>>> allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
>>> per a PE to limit waste of physical pages.
>>>
>>> As of today, if the assumed direct mapping is not possible, DDW
>>> creation is skipped and the default DMA window "ibm,dma-window" is used
>>> instead.
>>>
>>> The default DMA window uses 4k pages instead of 64k pages, and since
>>> the amount of pages is the same,
>>
>> Is the amount really the same? I thought you can prioritize some VFs
>> over others (== allocate different number of TCEs). Does it really
>> matter if it is the same?
>
> On a conversation with Travis Pizel, he explained how it's supposed to
> work, and I understood this:
>
> When a VF is created, it will be assigned a capacity, like 4%, 20%, and
> so on. The number of 'TCE entries' that are available to that partition
> are proportional to that capacity.
>
> If we use the default DMA window, the IOMMU pagesize/entry will be 4k,
> and if we use DDW, we will get 64k pagesize. As the number of entries
> will be the same (for the same capacity), the total space that can be
> addressed by the IOMMU will be 16 times bigger. This sometimes enable
> direct mapping, but sometimes it's still not enough.


Good to know. This is still an implementation detail, QEMU does not
allocate TCEs like this.


>
> On Travis words :
> "A low capacity VF, with less resources available, will certainly have
> less DMA window capability than a high capacity VF. But, an 8GB DMA
> window (with 64k pages) is still 16x larger than an 512MB window (with
> 4K pages).
> A high capacity VF - for example, one that Leonardo has in his scenario
> - will go from 8GB (using 4K pages) to 128GB (using 64K pages) - again,
> 16x larger - but it's obviously still possible to create a partition
> that exceeds 128GB of memory in size."


Right except the default dma window is not 8GB, it is <=2GB.


>
>>
>>
>>> making use of DDW instead of the
>>> default DMA window for indirect mapping will expand in 16x the amount
>>> of memory that can be mapped on DMA.
>>
>> Stop saying "16x", it is not guaranteed by anything :)
>>
>>
>>> The DDW created will be used for direct mapping by default. [...]
>>> ===
>>>
>>> What do you think?
>>>
>>>>> The DDW created will be used for direct mapping by default.
>>>>> If it's not available, indirect mapping will be used instead.
>>>>>
>>>>> For indirect mapping, it's necessary to update the iommu_table so
>>>>> iommu_alloc() can use the DDW created. For this,
>>>>> iommu_table_update_window() is called when everything else succeeds
>>>>> at enable_ddw().
>>>>>
>>>>> Removing the default DMA window for using DDW with indirect mapping
>>>>> is only allowed if there is no current IOMMU memory allocated in
>>>>> the iommu_table. enable_ddw() is aborted otherwise.
>>>>>
>>>>> As there will never have both direct and indirect mappings at the same
>>>>> time, the same property name can be used for the created DDW.
>>>>>
>>>>> So renaming
>>>>> define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
>>>>> to
>>>>> define DMA64_PROPNAME "linux,dma64-ddr-window-info"
>>>>> looks the right thing to do.
>>>>
>>>> I know I suggested this but this does not look so good anymore as I
>>>> suspect it breaks kexec (from older kernel to this one) so you either
>>>> need to check for both DT names or just keep the old one. Changing the
>>>> macro name is fine.
>>>>
>>>
>>> Yeah, having 'direct' in the name don't really makes sense if it's used
>>> for indirect mapping. I will just add this new define instead of
>>> replacing the old one, and check for both.
>>> Is that ok?
>>
>> No, having two of these does not seem right or useful. It is pseries
>> which does not use petitboot (relies on grub instead so until the target
>> kernel is started, there will be no ddw) so realistically we need this
>> property for kexec/kdump which uses the same kernel but different
>> initramdisk so for that purpose we need the same property name.
>>
>> But I can see myself annoyed when I try petitboot in the hacked pseries
>> qemu and things may crash :) On this basis I'd suggest keeping the name
>> and adding a comment next to it that it is not always "direct" anymore.
>>
>
> Keeping the same name should bring more problems than solve.
> If we have indirect mapping and kexec() to an older kernel, it will
> think direct mapping is enabled, and trying to use a DMA address
> without doing H_PUT_* first may cause a crash.
>
> I tested with a new property name, and it doesn't crash.
> As the property is not found, it does try to create a new DDW, which
> fails and it falls back to using the default DMA window.
> The device that need the IOMMU don't work well, but when iommu_map()
> fails, it doesn't try to use the DMA address as valid.


Right, as discussed on slack.

>
>>
>>>>> To make sure the property differentiates both cases, a new u32 for flags
>>>>> was added at the end of the property, where BIT(0) set means direct
>>>>> mapping.
>>>>>
>>>>> Signed-off-by: Leonardo Bras <[email protected]>
>>>>> ---
>>>>> arch/powerpc/platforms/pseries/iommu.c | 108 +++++++++++++++++++------
>>>>> 1 file changed, 84 insertions(+), 24 deletions(-)
>>>>>
>>>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>>>>> index 3a1ef02ad9d5..9544e3c91ced 100644
>>>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>>>> @@ -350,8 +350,11 @@ struct dynamic_dma_window_prop {
>>>>> __be64 dma_base; /* address hi,lo */
>>>>> __be32 tce_shift; /* ilog2(tce_page_size) */
>>>>> __be32 window_shift; /* ilog2(tce_window_size) */
>>>>> + __be32 flags; /* DDW properties, see bellow */
>>>>> };
>>>>>
>>>>> +#define DDW_FLAGS_DIRECT 0x01
>>>>
>>>> This is set if ((1<<window_shift) >= ddw_memory_hotplug_max()), you
>>>> could simply check window_shift and drop the flags.
>>>>
>>>
>>> Yeah, it's better this way, I will revert this.
>>>
>>>>> +
>>>>> struct direct_window {
>>>>> struct device_node *device;
>>>>> const struct dynamic_dma_window_prop *prop;
>>>>> @@ -377,7 +380,7 @@ static LIST_HEAD(direct_window_list);
>>>>> static DEFINE_SPINLOCK(direct_window_list_lock);
>>>>> /* protects initializing window twice for same device */
>>>>> static DEFINE_MUTEX(direct_window_init_mutex);
>>>>> -#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
>>>>> +#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
>>>>>
>>>>> static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
>>>>> unsigned long num_pfn, const void *arg)
>>>>> @@ -836,7 +839,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
>>>>> if (ret)
>>>>> return;
>>>>>
>>>>> - win = of_find_property(np, DIRECT64_PROPNAME, NULL);
>>>>> + win = of_find_property(np, DMA64_PROPNAME, NULL);
>>>>> if (!win)
>>>>> return;
>>>>>
>>>>> @@ -852,7 +855,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
>>>>> np, ret);
>>>>> }
>>>>>
>>>>> -static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
>>>>> +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, bool *direct_mapping)
>>>>> {
>>>>> struct direct_window *window;
>>>>> const struct dynamic_dma_window_prop *direct64;
>>>>> @@ -864,6 +867,7 @@ static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
>>>>> if (window->device == pdn) {
>>>>> direct64 = window->prop;
>>>>> *dma_addr = be64_to_cpu(direct64->dma_base);
>>>>> + *direct_mapping = be32_to_cpu(direct64->flags) & DDW_FLAGS_DIRECT;
>>>>> found = true;
>>>>> break;
>>>>> }
>>>>> @@ -901,8 +905,8 @@ static int find_existing_ddw_windows(void)
>>>>> if (!firmware_has_feature(FW_FEATURE_LPAR))
>>>>> return 0;
>>>>>
>>>>> - for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
>>>>> - direct64 = of_get_property(pdn, DIRECT64_PROPNAME, &len);
>>>>> + for_each_node_with_property(pdn, DMA64_PROPNAME) {
>>>>> + direct64 = of_get_property(pdn, DMA64_PROPNAME, &len);
>>>>> if (!direct64)
>>>>> continue;
>>>>>
>>>>> @@ -1124,7 +1128,8 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
>>>>> }
>>>>>
>>>>> static int ddw_property_create(struct property **ddw_win, const char *propname,
>>>>> - u32 liobn, u64 dma_addr, u32 page_shift, u32 window_shift)
>>>>> + u32 liobn, u64 dma_addr, u32 page_shift,
>>>>> + u32 window_shift, bool direct_mapping)
>>>>> {
>>>>> struct dynamic_dma_window_prop *ddwprop;
>>>>> struct property *win64;
>>>>> @@ -1144,6 +1149,36 @@ static int ddw_property_create(struct property **ddw_win, const char *propname,
>>>>> ddwprop->dma_base = cpu_to_be64(dma_addr);
>>>>> ddwprop->tce_shift = cpu_to_be32(page_shift);
>>>>> ddwprop->window_shift = cpu_to_be32(window_shift);
>>>>> + if (direct_mapping)
>>>>> + ddwprop->flags = cpu_to_be32(DDW_FLAGS_DIRECT);
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> +static int iommu_table_update_window(struct iommu_table **tbl, int nid, unsigned long liobn,
>>>>> + unsigned long win_addr, unsigned long page_shift,
>>>>> + unsigned long window_size)
>>>>
>>>> Rather strange helper imho. I'd extract the most of
>>>> iommu_table_setparms_lpar() into iommu_table_setparms() (except
>>>> of_parse_dma_window) and call new helper from where you call
>>>> iommu_table_update_window; and do
>>>> iommu_pseries_alloc_table/iommu_tce_table_put there.
>>>>
>>>
>>> I don't see how to extract iommu_table_setparms_lpar() into
>>> iommu_table_setparms(), they look to be used for different machine
>>> types.
>>>
>>> Do mean you extracting most of iommu_table_setparms_lpar() (and maybe
>>> iommu_table_setparms() ) into a new helper, which is called in both
>>> functions and use it instead of iommu_table_update_window() ?
>>
>> Yes, this.
>
> I will do that then, seems better. :)
>
>>
>>
>>>>> +{
>>>>> + struct iommu_table *new_tbl, *old_tbl;
>>>>> +
>>>>> + new_tbl = iommu_pseries_alloc_table(nid);
>>>>> + if (!new_tbl)
>>>>> + return -ENOMEM;
>>>>> +
>>>>> + old_tbl = *tbl;
>>>>> + new_tbl->it_index = liobn;
>>>>> + new_tbl->it_offset = win_addr >> page_shift;
>>>>> + new_tbl->it_page_shift = page_shift;
>>>>> + new_tbl->it_size = window_size >> page_shift;
>>>>> + new_tbl->it_base = old_tbl->it_base;
>>>>
>>>> Should not be used in pseries.
>>>>
>>>
>>> The point here is to migrate the values from the older tbl to the
>>
>> The actual window/table is new (on the hypervisor side), you are not
>> migrating a single TCE, you deleted one whole window and created another
>> whole window, calling it "migration" is confusing, especially when PAPR
>> actually defines TCE migration.
>
> Ok, I understand it's confusing now. I will avoid using this term from
> now on.
>
>>
>>
>>> newer. I Would like to understand why this is bad, if it will still be
>>> 'unused' as the older tbl.
>>
>> Having explicit values is more readable imho.
>
> Ok, I understand why it should be improved.!
>
> Alexey, thank you for reviewing, and for helping me with my questions!


Thanks for doing this all!


>
> Best regards,
>
>>
>>
>>>>> + new_tbl->it_busno = old_tbl->it_busno;
>>>>> + new_tbl->it_blocksize = old_tbl->it_blocksize;
>>>>
>>>> 16 for pseries and does not change (may be even make it a macro).
>>>>
>>>>> + new_tbl->it_type = old_tbl->it_type;
>>>>
>>>> TCE_PCI.
>>>>
>>>
>>> Same as above.
>>>
>>>>> + new_tbl->it_ops = old_tbl->it_ops;
>>>>> +
>>>>> + iommu_init_table(new_tbl, nid, old_tbl->it_reserved_start, old_tbl->it_reserved_end);
>>>>> + iommu_tce_table_put(old_tbl);
>>>>> + *tbl = new_tbl;
>>>>>
>>>>> return 0;
>>>>> }
>>>>> @@ -1171,12 +1206,16 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>>>> struct direct_window *window;
>>>>> struct property *win64 = NULL;
>>>>> struct failed_ddw_pdn *fpdn;
>>>>> - bool default_win_removed = false;
>>>>> + bool default_win_removed = false, maps_whole_partition = false;
>>>>
>>>> s/maps_whole_partition/direct_mapping/
>>>>
>>>
>>> Sure, I will get it replaced.
>>>
>>>>> + struct pci_dn *pci = PCI_DN(pdn);
>>>>> + struct iommu_table *tbl = pci->table_group->tables[0];
>>>>>
>>>>> mutex_lock(&direct_window_init_mutex);
>>>>>
>>>>> - if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset))
>>>>> - goto out_unlock;
>>>>> + if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset, &maps_whole_partition)) {
>>>>> + mutex_unlock(&direct_window_init_mutex);
>>>>> + return maps_whole_partition;
>>>>> + }
>>>>>
>>>>> /*
>>>>> * If we already went through this for a previous function of
>>>>> @@ -1258,16 +1297,24 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>>>> query.page_size);
>>>>> goto out_failed;
>>>>> }
>>>>> +
>>>>> /* verify the window * number of ptes will map the partition */
>>>>> - /* check largest block * page size > max memory hotplug addr */
>>>>> max_addr = ddw_memory_hotplug_max();
>>>>> if (query.largest_available_block < (max_addr >> page_shift)) {
>>>>> - dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
>>>>> - "%llu-sized pages\n", max_addr, query.largest_available_block,
>>>>> - 1ULL << page_shift);
>>>>> - goto out_failed;
>>>>> + dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu %llu-sized pages\n",
>>>>> + max_addr, query.largest_available_block,
>>>>> + 1ULL << page_shift);
>>>>> +
>>>>> + len = order_base_2(query.largest_available_block << page_shift);
>>>>> + } else {
>>>>> + maps_whole_partition = true;
>>>>> + len = order_base_2(max_addr);
>>>>> }
>>>>> - len = order_base_2(max_addr);
>>>>> +
>>>>> + /* DDW + IOMMU on single window may fail if there is any allocation */
>>>>> + if (default_win_removed && !maps_whole_partition &&
>>>>> + iommu_table_in_use(tbl))
>>>>> + goto out_failed;
>>>>>
>>>>> ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
>>>>> if (ret != 0)
>>>>> @@ -1277,8 +1324,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>>>> create.liobn, dn);
>>>>>
>>>>> win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
>>>>> - ret = ddw_property_create(&win64, DIRECT64_PROPNAME, create.liobn, win_addr,
>>>>> - page_shift, len);
>>>>> + ret = ddw_property_create(&win64, DMA64_PROPNAME, create.liobn, win_addr,
>>>>> + page_shift, len, maps_whole_partition);
>>>>> if (ret) {
>>>>> dev_info(&dev->dev,
>>>>> "couldn't allocate property, property name, or value\n");
>>>>> @@ -1297,12 +1344,25 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>>>> if (!window)
>>>>> goto out_prop_del;
>>>>>
>>>>> - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
>>>>> - win64->value, tce_setrange_multi_pSeriesLP_walk);
>>>>> - if (ret) {
>>>>> - dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
>>>>> - dn, ret);
>>>>> - goto out_free_window;
>>>>> + if (maps_whole_partition) {
>>>>> + /* DDW maps the whole partition, so enable direct DMA mapping */
>>>>> + ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
>>>>> + win64->value, tce_setrange_multi_pSeriesLP_walk);
>>>>> + if (ret) {
>>>>> + dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
>>>>> + dn, ret);
>>>>> + goto out_free_window;
>>>>> + }
>>>>> + } else {
>>>>> + /* New table for using DDW instead of the default DMA window */
>>>>> + if (iommu_table_update_window(&tbl, pci->phb->node, create.liobn,
>>>>> + win_addr, page_shift, 1UL << len))
>>>>> + goto out_free_window;
>>>>> +
>>>>> + set_iommu_table_base(&dev->dev, tbl);
>>>>> + WARN_ON(dev->dev.archdata.dma_offset >= SZ_4G);
>>>>
>>>> What is this check for exactly? Why 4G, not >= 0, for example?
>>>
>>> I am not really sure, you suggested adding it here:
>>> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/[email protected]/#2488874
>>
>> Ah right I did suggest this :) My bad. I think I suggested it before
>> suggesting to keep the reserved area boundaries checked/adjusted to the
>> window boundaries, may as well drop this. Thanks,
>>
>>
>>> I can remove it if it's ok.
>>>
>>>>> + goto out_unlock;
>>>>> +
>>>>> }
>>>>>
>>>>> dev->dev.archdata.dma_offset = win_addr;
>>>>> @@ -1340,7 +1400,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>>>>
>>>>> out_unlock:
>>>>> mutex_unlock(&direct_window_init_mutex);
>>>>> - return win64;
>>>>> + return win64 && maps_whole_partition;
>>>>> }
>>>>>
>>>>> static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
>>>>>
>

--
Alexey

2020-09-04 06:07:47

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()

On Thu, 2020-09-03 at 14:41 +1000, Alexey Kardashevskiy wrote:
> I am new to this, so I am trying to understand how a memory page mapped
> > as DMA, and used for something else could be a problem.
>
> From the device prospective, there is PCI space and everything from 0
> till 1<<64 is accessible and what is that mapped to - the device does
> not know. PHB's IOMMU is the thing to notice invalid access and raise
> EEH but PHB only knows about PCI->physical memory mapping (with IOMMU
> pages) but nothing about the host kernel pages. Does this help? Thanks,

According to our conversation on Slack:
1- There is a problem if a hypervisor gives to it's VMs contiguous
memory blocks that are not aligned to IOMMU pages, because then an
iommu_map_page() could map some memory in this VM and some memory in
other VM / process.
2- To guarantee this, we should have system pagesize >= iommu_pagesize

One way to get (2) is by doing this in enable_ddw():
if ((query.page_size & 4) && PAGE_SHIFT >= 24) {
page_shift = 24; /* 16MB */
} else if ((query.page_size & 2) && PAGE_SHIFT >= 16 ) {
page_shift = 16; /* 64kB */
} else if (query.page_size & 1 && PAGE_SHIFT >= 12) {
page_shift = 12; /* 4kB */
[...]

Another way of solving this, would be adding in LoPAR documentation
that the blocksize of contiguous memory the hypervisor gives a VM
should always be aligned to IOMMU pagesize offered.

I think the best approach would be first sending the above patch, which
is faster, and then get working into adding that to documentation, so
hypervisors guarantee this.

If this gets into the docs, we can revert the patch.

What do you think?

Best regards!

2020-09-08 03:20:46

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()



On 04/09/2020 16:04, Leonardo Bras wrote:
> On Thu, 2020-09-03 at 14:41 +1000, Alexey Kardashevskiy wrote:
>> I am new to this, so I am trying to understand how a memory page mapped
>>> as DMA, and used for something else could be a problem.
>>
>> From the device prospective, there is PCI space and everything from 0
>> till 1<<64 is accessible and what is that mapped to - the device does
>> not know. PHB's IOMMU is the thing to notice invalid access and raise
>> EEH but PHB only knows about PCI->physical memory mapping (with IOMMU
>> pages) but nothing about the host kernel pages. Does this help? Thanks,
>
> According to our conversation on Slack:
> 1- There is a problem if a hypervisor gives to it's VMs contiguous
> memory blocks that are not aligned to IOMMU pages, because then an
> iommu_map_page() could map some memory in this VM and some memory in
> other VM / process.
> 2- To guarantee this, we should have system pagesize >= iommu_pagesize
>
> One way to get (2) is by doing this in enable_ddw():
> if ((query.page_size & 4) && PAGE_SHIFT >= 24) {

You won't ever (well, soon) see PAGE_SHIFT==24, it is either 4K or 64K.
However 16MB IOMMU pages is fine - if hypervisor uses huge pages for VMs
RAM, it also then advertises huge IOMMU pages in ddw-query. So for the
1:1 case there must be no "PAGE_SHIFT >= 24".


> page_shift = 24; /* 16MB */
> } else if ((query.page_size & 2) && PAGE_SHIFT >= 16 ) {
> page_shift = 16; /* 64kB */
> } else if (query.page_size & 1 && PAGE_SHIFT >= 12) {
> page_shift = 12; /* 4kB */
> [...]
>
> Another way of solving this, would be adding in LoPAR documentation
> that the blocksize of contiguous memory the hypervisor gives a VM
> should always be aligned to IOMMU pagesize offered.

I think this is assumed already by the design of the DDW API.

>
> I think the best approach would be first sending the above patch, which
> is faster, and then get working into adding that to documentation, so
> hypervisors guarantee this.
>
> If this gets into the docs, we can revert the patch.
>
> What do you think?
I think we diverted from the original patch :) I am not quite sure what
you were fixing there. Thanks,


--
Alexey