2021-08-17 06:41:09

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v6 00/11] DDW + Indirect Mapping

So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

Using the DDW instead of the default DMA window may allow to expand the
amount of memory that can be DMA-mapped, given the number of pages (TCEs)
may stay the same (or increase) and the default DMA window offers only
4k-pages while DDW may offer larger pages (4k, 64k, 16M ...).

Patch #1 replaces hard-coded 4K page size with a variable containing the
correct page size for the window.

Patch #2 introduces iommu_table_in_use(), and replace manual bit-field
checking where it's used. It will be used for aborting enable_ddw() if
there is any current iommu allocation and we are trying single window
indirect mapping.

Patch #3 introduces iommu_pseries_alloc_table() that will be helpful
when indirect mapping needs to replace the iommu_table.

Patch #4 adds helpers for adding DDWs in the list.

Patch #5 refactors enable_ddw() so it returns if direct mapping is
possible, instead of DMA offset. It helps for next patches on
indirect DMA mapping and also allows DMA windows starting at 0x00.

Patch #6 bring new helper to simplify enable_ddw(), allowing
some reorganization for introducing indirect mapping DDW.

Patch #7 adds new helper _iommu_table_setparms() and use it in other
*setparams*() to fill iommu_table. It will also be used for creating a
new iommu_table for indirect mapping.

Patch #8 updates remove_dma_window() to accept different property names,
so we can introduce a new property for indirect mapping.

Patch #9 extracts find_existing_ddw_windows() into
find_existing_ddw_windows_named(), and calls it by it's property name.
This will be useful when the property for indirect mapping is created,
so we can search the device-tree for both properties.

Patch #10:
Instead of destroying the created DDW if it doesn't map the whole
partition, make use of it instead of the default DMA window as it improves
performance. Also, update the iommu_table and re-generate the pools.
It introduces a new property name for DDW with indirect DMA mapping.

Patch #11:
Does some renaming of 'direct window' to 'dma window', given the DDW
created can now be also used in indirect mapping if direct mapping is not
available.

All patches were tested into an LPAR with an virtio-net interface that
allows default DMA window and DDW to coexist.

Changes since v5:
- Reviews from Frederic Barrat
- 02/11 : memset bitmap only if tbl not in use
- 06/11 : remove_ddw() is not used in enable_ddw() error path anymore
New helpers were created for that.
- 10/11 : There was a typo, but got replaced due to 06/11 fix.
v5 Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=253799&state=%2A&archive=both

Changes since v4:
- Solve conflicts with new upstream versions
- Avoid unecessary code moving by doing variable declaration before definition
- Rename _iommu_table_setparms to iommu_table_setparms_common and changed base
parameter from unsigned long to void* in order to avoid unecessary casting.
- Fix breaking case for existing direct-mapping.
- Fix IORESOURCE_MEM bound issue
- Move new tbl to pci->table_group->tables[1] instead of replacing [0]
v4 Link: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=241597&state=%2A&archive=both

Changes since v3:
- Fixed inverted free order at ddw_property_create()
- Updated goto tag naming
v3 Link: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=240287&state=%2A&archive=both

Changes since v2:
- Some patches got removed from the series and sent by themselves,
- New tbl created for DDW + indirect mapping reserves MMIO32 space,
- Improved reserved area algorithm,
- Improved commit messages,
- Removed define for default DMA window prop name,
- Avoided some unnecessary renaming,
- Removed some unnecessary empty lines,
- Changed some code moving to forward declarations.
v2 Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=201210&state=%2A&archive=both


Leonardo Bras (11):
powerpc/pseries/iommu: Replace hard-coded page shift
powerpc/kernel/iommu: Add new iommu_table_in_use() helper
powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper
powerpc/pseries/iommu: Add ddw_list_new_entry() helper
powerpc/pseries/iommu: Allow DDW windows starting at 0x00
powerpc/pseries/iommu: Add ddw_property_create() and refactor
enable_ddw()
powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new
helper
powerpc/pseries/iommu: Update remove_dma_window() to accept property
name
powerpc/pseries/iommu: Find existing DDW with given property name
powerpc/pseries/iommu: Make use of DDW for indirect mapping
powerpc/pseries/iommu: Rename "direct window" to "dma window"

arch/powerpc/include/asm/iommu.h | 1 +
arch/powerpc/include/asm/tce.h | 8 -
arch/powerpc/kernel/iommu.c | 65 ++--
arch/powerpc/platforms/pseries/iommu.c | 481 +++++++++++++++----------
4 files changed, 330 insertions(+), 225 deletions(-)

--
2.32.0


2021-08-17 06:41:15

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v6 01/11] powerpc/pseries/iommu: Replace hard-coded page shift

Some functions assume IOMMU page size can only be 4K (pageshift == 12).
Update them to accept any page size passed, so we can use 64K pages.

In the process, some defines like TCE_SHIFT were made obsolete, and then
removed.

IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figures 3.4 and 3.5 show
a RPN of 52-bit, and considers a 12-bit pageshift, so there should be
no need of using TCE_RPN_MASK, which masks out any bit after 40 in rpn.
It's usage removed from tce_build_pSeries(), tce_build_pSeriesLP(), and
tce_buildmulti_pSeriesLP().

Most places had a tbl struct, so using tbl->it_page_shift was simple.
tce_free_pSeriesLP() was a special case, since callers not always have a
tbl struct, so adding a tceshift parameter seems the right thing to do.

Signed-off-by: Leonardo Bras <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: Frederic Barrat <[email protected]>
---
arch/powerpc/include/asm/tce.h | 8 ------
arch/powerpc/platforms/pseries/iommu.c | 39 +++++++++++++++-----------
2 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index db5fc2f2262d..0c34d2756d92 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -19,15 +19,7 @@
#define TCE_VB 0
#define TCE_PCI 1

-/* TCE page size is 4096 bytes (1 << 12) */
-
-#define TCE_SHIFT 12
-#define TCE_PAGE_SIZE (1 << TCE_SHIFT)
-
#define TCE_ENTRY_SIZE 8 /* each TCE is 64 bits */
-
-#define TCE_RPN_MASK 0xfffffffffful /* 40-bit RPN (4K pages) */
-#define TCE_RPN_SHIFT 12
#define TCE_VALID 0x800 /* TCE valid */
#define TCE_ALLIO 0x400 /* TCE valid for all lpars */
#define TCE_PCI_WRITE 0x2 /* write from PCI allowed */
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 0c55b991f665..b1b8d12bab39 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -107,6 +107,8 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index,
u64 proto_tce;
__be64 *tcep;
u64 rpn;
+ const unsigned long tceshift = tbl->it_page_shift;
+ const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);

proto_tce = TCE_PCI_READ; // Read allowed

@@ -117,10 +119,10 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index,

while (npages--) {
/* can't move this out since we might cross MEMBLOCK boundary */
- rpn = __pa(uaddr) >> TCE_SHIFT;
- *tcep = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT);
+ rpn = __pa(uaddr) >> tceshift;
+ *tcep = cpu_to_be64(proto_tce | rpn << tceshift);

- uaddr += TCE_PAGE_SIZE;
+ uaddr += pagesize;
tcep++;
}
return 0;
@@ -146,7 +148,7 @@ static unsigned long tce_get_pseries(struct iommu_table *tbl, long index)
return be64_to_cpu(*tcep);
}

-static void tce_free_pSeriesLP(unsigned long liobn, long, long);
+static void tce_free_pSeriesLP(unsigned long liobn, long, long, long);
static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);

static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,
@@ -166,12 +168,12 @@ static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,
proto_tce |= TCE_PCI_WRITE;

while (npages--) {
- tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift;
+ tce = proto_tce | rpn << tceshift;
rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce);

if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
ret = (int)rc;
- tce_free_pSeriesLP(liobn, tcenum_start,
+ tce_free_pSeriesLP(liobn, tcenum_start, tceshift,
(npages_start - (npages + 1)));
break;
}
@@ -205,10 +207,11 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
long tcenum_start = tcenum, npages_start = npages;
int ret = 0;
unsigned long flags;
+ const unsigned long tceshift = tbl->it_page_shift;

if ((npages == 1) || !firmware_has_feature(FW_FEATURE_PUT_TCE_IND)) {
return tce_build_pSeriesLP(tbl->it_index, tcenum,
- tbl->it_page_shift, npages, uaddr,
+ tceshift, npages, uaddr,
direction, attrs);
}

@@ -225,13 +228,13 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
if (!tcep) {
local_irq_restore(flags);
return tce_build_pSeriesLP(tbl->it_index, tcenum,
- tbl->it_page_shift,
+ tceshift,
npages, uaddr, direction, attrs);
}
__this_cpu_write(tce_page, tcep);
}

- rpn = __pa(uaddr) >> TCE_SHIFT;
+ rpn = __pa(uaddr) >> tceshift;
proto_tce = TCE_PCI_READ;
if (direction != DMA_TO_DEVICE)
proto_tce |= TCE_PCI_WRITE;
@@ -245,12 +248,12 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
limit = min_t(long, npages, 4096/TCE_ENTRY_SIZE);

for (l = 0; l < limit; l++) {
- tcep[l] = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT);
+ tcep[l] = cpu_to_be64(proto_tce | rpn << tceshift);
rpn++;
}

rc = plpar_tce_put_indirect((u64)tbl->it_index,
- (u64)tcenum << 12,
+ (u64)tcenum << tceshift,
(u64)__pa(tcep),
limit);

@@ -277,12 +280,13 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
return ret;
}

-static void tce_free_pSeriesLP(unsigned long liobn, long tcenum, long npages)
+static void tce_free_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,
+ long npages)
{
u64 rc;

while (npages--) {
- rc = plpar_tce_put((u64)liobn, (u64)tcenum << 12, 0);
+ rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, 0);

if (rc && printk_ratelimit()) {
printk("tce_free_pSeriesLP: plpar_tce_put failed. rc=%lld\n", rc);
@@ -301,9 +305,11 @@ static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long n
u64 rc;

if (!firmware_has_feature(FW_FEATURE_STUFF_TCE))
- return tce_free_pSeriesLP(tbl->it_index, tcenum, npages);
+ return tce_free_pSeriesLP(tbl->it_index, tcenum,
+ tbl->it_page_shift, npages);

- rc = plpar_tce_stuff((u64)tbl->it_index, (u64)tcenum << 12, 0, npages);
+ rc = plpar_tce_stuff((u64)tbl->it_index,
+ (u64)tcenum << tbl->it_page_shift, 0, npages);

if (rc && printk_ratelimit()) {
printk("tce_freemulti_pSeriesLP: plpar_tce_stuff failed\n");
@@ -319,7 +325,8 @@ static unsigned long tce_get_pSeriesLP(struct iommu_table *tbl, long tcenum)
u64 rc;
unsigned long tce_ret;

- rc = plpar_tce_get((u64)tbl->it_index, (u64)tcenum << 12, &tce_ret);
+ rc = plpar_tce_get((u64)tbl->it_index,
+ (u64)tcenum << tbl->it_page_shift, &tce_ret);

if (rc && printk_ratelimit()) {
printk("tce_get_pSeriesLP: plpar_tce_get failed. rc=%lld\n", rc);
--
2.32.0

2021-08-17 06:41:18

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v6 02/11] powerpc/kernel/iommu: Add new iommu_table_in_use() helper

Having a function to check if the iommu table has any allocation helps
deciding if a tbl can be reset for using a new DMA window.

It should be enough to replace all instances of !bitmap_empty(tbl...).

iommu_table_in_use() skips reserved memory, so we don't need to worry about
releasing it before testing. This causes iommu_table_release_pages() to
become unnecessary, given it is only used to remove reserved memory for
testing.

Also, only allow storing reserved memory values in tbl if they are valid
in the table, so there is no need to check it in the new helper.

Signed-off-by: Leonardo Bras <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 1 +
arch/powerpc/kernel/iommu.c | 61 ++++++++++++++++----------------
2 files changed, 32 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index deef7c94d7b6..bf3b84128525 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -154,6 +154,7 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
*/
extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
int nid, unsigned long res_start, unsigned long res_end);
+bool iommu_table_in_use(struct iommu_table *tbl);

#define IOMMU_TABLE_GROUP_MAX_TABLES 2

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 2af89a5e379f..ed98ad63633e 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -690,32 +690,24 @@ static void iommu_table_reserve_pages(struct iommu_table *tbl,
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);

- tbl->it_reserved_start = res_start;
- tbl->it_reserved_end = res_end;
-
- /* Check if res_start..res_end isn't empty and overlaps the table */
- if (res_start && res_end &&
- (tbl->it_offset + tbl->it_size < res_start ||
- res_end < tbl->it_offset))
- return;
+ if (res_start < tbl->it_offset)
+ res_start = tbl->it_offset;

- for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
- set_bit(i - tbl->it_offset, tbl->it_map);
-}
+ if (res_end > (tbl->it_offset + tbl->it_size))
+ res_end = tbl->it_offset + tbl->it_size;

-static void iommu_table_release_pages(struct iommu_table *tbl)
-{
- int i;
+ /* Check if res_start..res_end is a valid range in the table */
+ if (res_start >= res_end) {
+ tbl->it_reserved_start = tbl->it_offset;
+ tbl->it_reserved_end = tbl->it_offset;
+ return;
+ }

- /*
- * In case we have reserved the first bit, we should not emit
- * the warning below.
- */
- if (tbl->it_offset == 0)
- clear_bit(0, tbl->it_map);
+ tbl->it_reserved_start = res_start;
+ tbl->it_reserved_end = res_end;

for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
- clear_bit(i - tbl->it_offset, tbl->it_map);
+ set_bit(i - tbl->it_offset, tbl->it_map);
}

/*
@@ -779,6 +771,22 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
return tbl;
}

+bool iommu_table_in_use(struct iommu_table *tbl)
+{
+ unsigned long start = 0, end;
+
+ /* ignore reserved bit0 */
+ if (tbl->it_offset == 0)
+ start = 1;
+ end = tbl->it_reserved_start - tbl->it_offset;
+ if (find_next_bit(tbl->it_map, end, start) != end)
+ return true;
+
+ start = tbl->it_reserved_end - tbl->it_offset;
+ end = tbl->it_size;
+ return find_next_bit(tbl->it_map, end, start) != end;
+}
+
static void iommu_table_free(struct kref *kref)
{
struct iommu_table *tbl;
@@ -795,10 +803,8 @@ static void iommu_table_free(struct kref *kref)

iommu_debugfs_del(tbl);

- iommu_table_release_pages(tbl);
-
/* verify that table contains no entries */
- if (!bitmap_empty(tbl->it_map, tbl->it_size))
+ if (iommu_table_in_use(tbl))
pr_warn("%s: Unexpected TCEs\n", __func__);

/* free bitmap */
@@ -1099,14 +1105,9 @@ int iommu_take_ownership(struct iommu_table *tbl)
for (i = 0; i < tbl->nr_pools; i++)
spin_lock_nest_lock(&tbl->pools[i].lock, &tbl->large_pool.lock);

- iommu_table_release_pages(tbl);
-
- if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
+ if (iommu_table_in_use(tbl)) {
pr_err("iommu_tce: it_map is not empty");
ret = -EBUSY;
- /* Undo iommu_table_release_pages, i.e. restore bit#0, etc */
- iommu_table_reserve_pages(tbl, tbl->it_reserved_start,
- tbl->it_reserved_end);
} else {
memset(tbl->it_map, 0xff, sz);
}
--
2.32.0

2021-08-17 06:41:31

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v6 03/11] powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper

Creates a helper to allow allocating a new iommu_table without the need
to reallocate the iommu_group.

This will be helpful for replacing the iommu_table for the new DMA window,
after we remove the old one with iommu_tce_table_put().

Signed-off-by: Leonardo Bras <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: Frederic Barrat <[email protected]>
---
arch/powerpc/platforms/pseries/iommu.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index b1b8d12bab39..33d82865d6e6 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,28 +53,31 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
};

-static struct iommu_table_group *iommu_pseries_alloc_group(int node)
+static struct iommu_table *iommu_pseries_alloc_table(int node)
{
- struct iommu_table_group *table_group;
struct iommu_table *tbl;

- table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
- node);
- if (!table_group)
- return NULL;
-
tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
if (!tbl)
- goto free_group;
+ return NULL;

INIT_LIST_HEAD_RCU(&tbl->it_group_list);
kref_init(&tbl->it_kref);
+ return tbl;
+}

- table_group->tables[0] = tbl;
+static struct iommu_table_group *iommu_pseries_alloc_group(int node)
+{
+ struct iommu_table_group *table_group;
+
+ table_group = kzalloc_node(sizeof(*table_group), GFP_KERNEL, node);
+ if (!table_group)
+ return NULL;

- return table_group;
+ table_group->tables[0] = iommu_pseries_alloc_table(node);
+ if (table_group->tables[0])
+ return table_group;

-free_group:
kfree(table_group);
return NULL;
}
--
2.32.0

2021-08-17 06:41:41

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v6 05/11] powerpc/pseries/iommu: Allow DDW windows starting at 0x00

enable_ddw() currently returns the address of the DMA window, which is
considered invalid if has the value 0x00.

Also, it only considers valid an address returned from find_existing_ddw
if it's not 0x00.

Changing this behavior makes sense, given the users of enable_ddw() only
need to know if direct mapping is possible. It can also allow a DMA window
starting at 0x00 to be used.

This will be helpful for using a DDW with indirect mapping, as the window
address will be different than 0x00, but it will not map the whole
partition.

Signed-off-by: Leonardo Bras <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: Frederic Barrat <[email protected]>
---
arch/powerpc/platforms/pseries/iommu.c | 36 +++++++++++++-------------
1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 712d1667144a..b34b473bbdc1 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -853,25 +853,26 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
np, ret);
}

-static u64 find_existing_ddw(struct device_node *pdn, int *window_shift)
+static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int *window_shift)
{
struct direct_window *window;
const struct dynamic_dma_window_prop *direct64;
- u64 dma_addr = 0;
+ bool found = false;

spin_lock(&direct_window_list_lock);
/* check if we already created a window and dupe that config if so */
list_for_each_entry(window, &direct_window_list, list) {
if (window->device == pdn) {
direct64 = window->prop;
- dma_addr = be64_to_cpu(direct64->dma_base);
+ *dma_addr = be64_to_cpu(direct64->dma_base);
*window_shift = be32_to_cpu(direct64->window_shift);
+ found = true;
break;
}
}
spin_unlock(&direct_window_list_lock);

- return dma_addr;
+ return found;
}

static struct direct_window *ddw_list_new_entry(struct device_node *pdn,
@@ -1161,20 +1162,20 @@ static int iommu_get_page_shift(u32 query_page_size)
* pdn: the parent pe node with the ibm,dma_window property
* Future: also check if we can remap the base window for our base page size
*
- * returns the dma offset for use by the direct mapped DMA code.
+ * returns true if can map all pages (direct mapping), false otherwise..
*/
-static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
{
int len = 0, ret;
int max_ram_len = order_base_2(ddw_memory_hotplug_max());
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
- u64 dma_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64;
+ bool ddw_enabled = false;
struct dynamic_dma_window_prop *ddwprop;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false;
@@ -1186,9 +1187,10 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)

mutex_lock(&direct_window_init_mutex);

- dma_addr = find_existing_ddw(pdn, &len);
- if (dma_addr != 0)
+ if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset, &len)) {
+ ddw_enabled = true;
goto out_unlock;
+ }

/*
* If we already went through this for a previous function of
@@ -1342,7 +1344,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
list_add(&window->list, &direct_window_list);
spin_unlock(&direct_window_list_lock);

- dma_addr = be64_to_cpu(ddwprop->dma_base);
+ dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
+ ddw_enabled = true;
goto out_unlock;

out_free_window:
@@ -1374,10 +1377,10 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
* as RAM, then we failed to create a window to cover persistent
* memory and need to set the DMA limit.
*/
- if (pmem_present && dma_addr && (len == max_ram_len))
- dev->dev.bus_dma_limit = dma_addr + (1ULL << len);
+ if (pmem_present && ddw_enabled && (len == max_ram_len))
+ dev->dev.bus_dma_limit = dev->dev.archdata.dma_offset + (1ULL << len);

- return dma_addr;
+ return ddw_enabled;
}

static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
@@ -1456,11 +1459,8 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
break;
}

- if (pdn && PCI_DN(pdn)) {
- pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn);
- if (pdev->dev.archdata.dma_offset)
- return true;
- }
+ if (pdn && PCI_DN(pdn))
+ return enable_ddw(pdev, pdn);

return false;
}
--
2.32.0

2021-08-17 06:41:49

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v6 06/11] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

Code used to create a ddw property that was previously scattered in
enable_ddw() is now gathered in ddw_property_create(), which deals with
allocation and filling the property, letting it ready for
of_property_add(), which now occurs in sequence.

This created an opportunity to reorganize the second part of enable_ddw():

Without this patch enable_ddw() does, in order:
kzalloc() property & members, create_ddw(), fill ddwprop inside property,
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
of_add_property(), and list_add().

With this patch enable_ddw() does, in order:
create_ddw(), ddw_property_create(), of_add_property(),
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
and list_add().

This change requires of_remove_property() in case anything fails after
of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
in all memory, which looks the most expensive operation, only if
everything else succeeds.

Also, the error path got remove_ddw() replaced by a new helper
__remove_dma_window(), which only removes the new DDW with an rtas-call.
For this, a new helper clean_dma_window() was needed to clean anything
that could left if walk_system_ram_range() fails.

Signed-off-by: Leonardo Bras <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/platforms/pseries/iommu.c | 129 ++++++++++++++++---------
1 file changed, 84 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index b34b473bbdc1..00392582fe10 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -795,17 +795,10 @@ static int __init disable_ddw_setup(char *str)

early_param("disable_ddw", disable_ddw_setup);

-static void remove_dma_window(struct device_node *np, u32 *ddw_avail,
- struct property *win)
+static void clean_dma_window(struct device_node *np, struct dynamic_dma_window_prop *dwp)
{
- struct dynamic_dma_window_prop *dwp;
- u64 liobn;
int ret;

- dwp = win->value;
- liobn = (u64)be32_to_cpu(dwp->liobn);
-
- /* clear the whole window, note the arg is in kernel pages */
ret = tce_clearrange_multi_pSeriesLP(0,
1ULL << (be32_to_cpu(dwp->window_shift) - PAGE_SHIFT), dwp);
if (ret)
@@ -814,18 +807,39 @@ static void remove_dma_window(struct device_node *np, u32 *ddw_avail,
else
pr_debug("%pOF successfully cleared tces in window.\n",
np);
+}
+
+/*
+ * Call only if DMA window is clean.
+ */
+static void __remove_dma_window(struct device_node *np, u32 *ddw_avail, u64 liobn)
+{
+ int ret;

ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn);
if (ret)
- pr_warn("%pOF: failed to remove direct window: rtas returned "
+ pr_warn("%pOF: failed to remove DMA window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
else
- pr_debug("%pOF: successfully removed direct window: rtas returned "
+ pr_debug("%pOF: successfully removed DMA window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
}

+static void remove_dma_window(struct device_node *np, u32 *ddw_avail,
+ struct property *win)
+{
+ struct dynamic_dma_window_prop *dwp;
+ u64 liobn;
+
+ dwp = win->value;
+ liobn = (u64)be32_to_cpu(dwp->liobn);
+
+ clean_dma_window(np, dwp);
+ __remove_dma_window(np, ddw_avail, liobn);
+}
+
static void remove_ddw(struct device_node *np, bool remove_prop)
{
struct property *win;
@@ -1153,6 +1167,35 @@ static int iommu_get_page_shift(u32 query_page_size)
return 0;
}

+static struct property *ddw_property_create(const char *propname, u32 liobn, u64 dma_addr,
+ u32 page_shift, u32 window_shift)
+{
+ struct dynamic_dma_window_prop *ddwprop;
+ struct property *win64;
+
+ win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
+ if (!win64)
+ return NULL;
+
+ win64->name = kstrdup(propname, GFP_KERNEL);
+ ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
+ win64->value = ddwprop;
+ win64->length = sizeof(*ddwprop);
+ if (!win64->name || !win64->value) {
+ kfree(win64->name);
+ kfree(win64->value);
+ kfree(win64);
+ return NULL;
+ }
+
+ ddwprop->liobn = cpu_to_be32(liobn);
+ ddwprop->dma_base = cpu_to_be64(dma_addr);
+ ddwprop->tce_shift = cpu_to_be32(page_shift);
+ ddwprop->window_shift = cpu_to_be32(window_shift);
+
+ return win64;
+}
+
/*
* If the PE supports dynamic dma windows, and there is space for a table
* that can map all pages in a linear offset, then setup such a table,
@@ -1171,12 +1214,12 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
+ u64 win_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64;
bool ddw_enabled = false;
- struct dynamic_dma_window_prop *ddwprop;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false;
bool pmem_present;
@@ -1293,72 +1336,68 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
1ULL << page_shift);
goto out_failed;
}
- win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
- if (!win64) {
- dev_info(&dev->dev,
- "couldn't allocate property for 64bit dma window\n");
- goto out_failed;
- }
- win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
- win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
- win64->length = sizeof(*ddwprop);
- if (!win64->name || !win64->value) {
- dev_info(&dev->dev,
- "couldn't allocate property name and value\n");
- goto out_free_prop;
- }

ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
if (ret != 0)
- goto out_free_prop;
-
- ddwprop->liobn = cpu_to_be32(create.liobn);
- ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) |
- create.addr_lo);
- ddwprop->tce_shift = cpu_to_be32(page_shift);
- ddwprop->window_shift = cpu_to_be32(len);
+ goto out_failed;

dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
create.liobn, dn);

- window = ddw_list_new_entry(pdn, ddwprop);
+ win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
+ win64 = ddw_property_create(DIRECT64_PROPNAME, create.liobn, win_addr,
+ page_shift, len);
+ if (!win64) {
+ dev_info(&dev->dev,
+ "couldn't allocate property, property name, or value\n");
+ goto out_remove_win;
+ }
+
+ ret = of_add_property(pdn, win64);
+ if (ret) {
+ dev_err(&dev->dev, "unable to add dma window property for %pOF: %d",
+ pdn, ret);
+ goto out_free_prop;
+ }
+
+ window = ddw_list_new_entry(pdn, win64->value);
if (!window)
- goto out_clear_window;
+ goto out_del_prop;

ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
win64->value, tce_setrange_multi_pSeriesLP_walk);
if (ret) {
dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
dn, ret);
- goto out_free_window;
- }

- ret = of_add_property(pdn, win64);
- if (ret) {
- dev_err(&dev->dev, "unable to add dma window property for %pOF: %d",
- pdn, ret);
- goto out_free_window;
+ /* Make sure to clean DDW if any TCE was set*/
+ clean_dma_window(pdn, win64->value);
+ goto out_del_list;
}

spin_lock(&direct_window_list_lock);
list_add(&window->list, &direct_window_list);
spin_unlock(&direct_window_list_lock);

- dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
+ dev->dev.archdata.dma_offset = win_addr;
ddw_enabled = true;
goto out_unlock;

-out_free_window:
+out_del_list:
kfree(window);

-out_clear_window:
- remove_ddw(pdn, true);
+out_del_prop:
+ of_remove_property(pdn, win64);

out_free_prop:
kfree(win64->name);
kfree(win64->value);
kfree(win64);

+out_remove_win:
+ /* DDW is clean, so it's ok to call this directly. */
+ __remove_dma_window(pdn, ddw_avail, create.liobn);
+
out_failed:
if (default_win_removed)
reset_dma_window(dev, pdn);
--
2.32.0

2021-08-17 06:42:11

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v6 09/11] powerpc/pseries/iommu: Find existing DDW with given property name

At the moment pseries stores information about created directly mapped
DDW window in DIRECT64_PROPNAME.

With the objective of implementing indirect DMA mapping with DDW, it's
necessary to have another propriety name to make sure kexec'ing into older
kernels does not break, as it would if we reuse DIRECT64_PROPNAME.

In order to have this, find_existing_ddw_windows() needs to be able to
look for different property names.

Extract find_existing_ddw_windows() into find_existing_ddw_windows_named()
and calls it with current property name.

Signed-off-by: Leonardo Bras <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: Frederic Barrat <[email protected]>
---
arch/powerpc/platforms/pseries/iommu.c | 25 +++++++++++++++----------
1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 901f290999d0..e11c00b2dc1e 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -910,24 +910,21 @@ static struct direct_window *ddw_list_new_entry(struct device_node *pdn,
return window;
}

-static int find_existing_ddw_windows(void)
+static void find_existing_ddw_windows_named(const char *name)
{
int len;
struct device_node *pdn;
struct direct_window *window;
- const struct dynamic_dma_window_prop *direct64;
-
- if (!firmware_has_feature(FW_FEATURE_LPAR))
- return 0;
+ const struct dynamic_dma_window_prop *dma64;

- for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
- direct64 = of_get_property(pdn, DIRECT64_PROPNAME, &len);
- if (!direct64 || len < sizeof(*direct64)) {
- remove_ddw(pdn, true, DIRECT64_PROPNAME);
+ for_each_node_with_property(pdn, name) {
+ dma64 = of_get_property(pdn, name, &len);
+ if (!dma64 || len < sizeof(*dma64)) {
+ remove_ddw(pdn, true, name);
continue;
}

- window = ddw_list_new_entry(pdn, direct64);
+ window = ddw_list_new_entry(pdn, dma64);
if (!window)
break;

@@ -935,6 +932,14 @@ static int find_existing_ddw_windows(void)
list_add(&window->list, &direct_window_list);
spin_unlock(&direct_window_list_lock);
}
+}
+
+static int find_existing_ddw_windows(void)
+{
+ if (!firmware_has_feature(FW_FEATURE_LPAR))
+ return 0;
+
+ find_existing_ddw_windows_named(DIRECT64_PROPNAME);

return 0;
}
--
2.32.0

2021-08-17 06:42:14

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v6 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

By using DDW, indirect mapping can get more TCEs than available for the
default DMA window, and also get access to using much larger pagesizes
(16MB as implemented in qemu vs 4k from default DMA window), causing a
significant increase on the maximum amount of memory that can be IOMMU
mapped at the same time.

Indirect mapping will only be used if direct mapping is not a
possibility.

For indirect mapping, it's necessary to re-create the iommu_table with
the new DMA window parameters, so iommu_alloc() can use it.

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

Even though there won't be both direct and indirect mappings at the
same time, we can't reuse the DIRECT64_PROPNAME property name, or else
an older kexec()ed kernel can assume direct mapping, and skip
iommu_alloc(), causing undesirable behavior.
So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info"
was created to represent a DDW that does not allow direct mapping.

Signed-off-by: Leonardo Bras <[email protected]>
---
arch/powerpc/platforms/pseries/iommu.c | 89 +++++++++++++++++++++-----
1 file changed, 74 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index e11c00b2dc1e..0eccc29f5573 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -375,6 +375,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock);
/* protects initializing window twice for same device */
static DEFINE_MUTEX(direct_window_init_mutex);
#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"

static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
unsigned long num_pfn, const void *arg)
@@ -940,6 +941,7 @@ static int find_existing_ddw_windows(void)
return 0;

find_existing_ddw_windows_named(DIRECT64_PROPNAME);
+ find_existing_ddw_windows_named(DMA64_PROPNAME);

return 0;
}
@@ -1226,14 +1228,17 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
struct ddw_create_response create;
int page_shift;
u64 win_addr;
+ const char *win_name;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64;
bool ddw_enabled = false;
struct failed_ddw_pdn *fpdn;
- bool default_win_removed = false;
+ bool default_win_removed = false, direct_mapping = false;
bool pmem_present;
+ struct pci_dn *pci = PCI_DN(pdn);
+ struct iommu_table *tbl = pci->table_group->tables[0];

dn = of_find_node_by_type(NULL, "ibm,pmemory");
pmem_present = dn != NULL;
@@ -1242,6 +1247,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
mutex_lock(&direct_window_init_mutex);

if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset, &len)) {
+ direct_mapping = (len >= max_ram_len);
ddw_enabled = true;
goto out_unlock;
}
@@ -1322,8 +1328,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
query.page_size);
goto out_failed;
}
- /* verify the window * number of ptes will map the partition */
- /* check largest block * page size > max memory hotplug addr */
+
+
/*
* The "ibm,pmemory" can appear anywhere in the address space.
* Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
@@ -1339,13 +1345,25 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
dev_info(&dev->dev, "Skipping ibm,pmemory");
}

+ /* check if the available block * number of ptes will map everything */
if (query.largest_available_block < (1ULL << (len - page_shift))) {
dev_dbg(&dev->dev,
"can't map partition max 0x%llx with %llu %llu-sized pages\n",
1ULL << len,
query.largest_available_block,
1ULL << page_shift);
- goto out_failed;
+
+ /* DDW + IOMMU on single window may fail if there is any allocation */
+ if (default_win_removed && iommu_table_in_use(tbl)) {
+ dev_dbg(&dev->dev, "current IOMMU table in use, can't be replaced.\n");
+ goto out_failed;
+ }
+
+ len = order_base_2(query.largest_available_block << page_shift);
+ win_name = DMA64_PROPNAME;
+ } else {
+ direct_mapping = true;
+ win_name = DIRECT64_PROPNAME;
}

ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
@@ -1356,8 +1374,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
create.liobn, dn);

win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
- win64 = ddw_property_create(DIRECT64_PROPNAME, create.liobn, win_addr,
- page_shift, len);
+ win64 = ddw_property_create(win_name, create.liobn, win_addr, page_shift, len);
+
if (!win64) {
dev_info(&dev->dev,
"couldn't allocate property, property name, or value\n");
@@ -1375,15 +1393,54 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
if (!window)
goto out_del_prop;

- ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
- win64->value, tce_setrange_multi_pSeriesLP_walk);
- if (ret) {
- dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
- dn, ret);
+ if (direct_mapping) {
+ /* DDW maps the whole partition, so enable direct DMA mapping */
+ ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
+ win64->value, tce_setrange_multi_pSeriesLP_walk);
+ if (ret) {
+ dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
+ dn, ret);

/* Make sure to clean DDW if any TCE was set*/
clean_dma_window(pdn, win64->value);
- goto out_del_list;
+ goto out_del_list;
+ }
+ } else {
+ struct iommu_table *newtbl;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
+ const unsigned long mask = IORESOURCE_MEM_64 | IORESOURCE_MEM;
+
+ /* Look for MMIO32 */
+ if ((pci->phb->mem_resources[i].flags & mask) == IORESOURCE_MEM)
+ break;
+ }
+
+ if (i == ARRAY_SIZE(pci->phb->mem_resources))
+ goto out_del_list;
+
+ /* New table for using DDW instead of the default DMA window */
+ newtbl = iommu_pseries_alloc_table(pci->phb->node);
+ if (!newtbl) {
+ dev_dbg(&dev->dev, "couldn't create new IOMMU table\n");
+ goto out_del_list;
+ }
+
+ iommu_table_setparms_common(newtbl, pci->phb->bus->number, create.liobn, win_addr,
+ 1UL << len, page_shift, NULL, &iommu_table_lpar_multi_ops);
+ iommu_init_table(newtbl, pci->phb->node, pci->phb->mem_resources[i].start,
+ pci->phb->mem_resources[i].end);
+
+ pci->table_group->tables[1] = newtbl;
+
+ /* Keep default DMA window stuct if removed */
+ if (default_win_removed) {
+ tbl->it_size = 0;
+ kfree(tbl->it_map);
+ }
+
+ set_iommu_table_base(&dev->dev, newtbl);
}

spin_lock(&direct_window_list_lock);
@@ -1427,10 +1484,10 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
* as RAM, then we failed to create a window to cover persistent
* memory and need to set the DMA limit.
*/
- if (pmem_present && ddw_enabled && (len == max_ram_len))
+ if (pmem_present && ddw_enabled && direct_mapping && len == max_ram_len)
dev->dev.bus_dma_limit = dev->dev.archdata.dma_offset + (1ULL << len);

- return ddw_enabled;
+ return ddw_enabled && direct_mapping;
}

static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
@@ -1572,7 +1629,9 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti
* we have to remove the property when releasing
* the device node.
*/
- remove_ddw(np, false, DIRECT64_PROPNAME);
+ if (remove_ddw(np, false, DIRECT64_PROPNAME))
+ remove_ddw(np, false, DMA64_PROPNAME);
+
if (pci && pci->table_group)
iommu_pseries_free_group(pci->table_group,
np->full_name);
--
2.32.0

2021-08-17 06:42:32

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v6 04/11] powerpc/pseries/iommu: Add ddw_list_new_entry() helper

There are two functions creating direct_window_list entries in a
similar way, so create a ddw_list_new_entry() to avoid duplicity and
simplify those functions.

Signed-off-by: Leonardo Bras <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: Frederic Barrat <[email protected]>
---
arch/powerpc/platforms/pseries/iommu.c | 32 +++++++++++++++++---------
1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 33d82865d6e6..712d1667144a 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -874,6 +874,21 @@ static u64 find_existing_ddw(struct device_node *pdn, int *window_shift)
return dma_addr;
}

+static struct direct_window *ddw_list_new_entry(struct device_node *pdn,
+ const struct dynamic_dma_window_prop *dma64)
+{
+ struct direct_window *window;
+
+ window = kzalloc(sizeof(*window), GFP_KERNEL);
+ if (!window)
+ return NULL;
+
+ window->device = pdn;
+ window->prop = dma64;
+
+ return window;
+}
+
static int find_existing_ddw_windows(void)
{
int len;
@@ -886,18 +901,15 @@ static int find_existing_ddw_windows(void)

for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
direct64 = of_get_property(pdn, DIRECT64_PROPNAME, &len);
- if (!direct64)
- continue;
-
- window = kzalloc(sizeof(*window), GFP_KERNEL);
- if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
- kfree(window);
+ if (!direct64 || len < sizeof(*direct64)) {
remove_ddw(pdn, true);
continue;
}

- window->device = pdn;
- window->prop = direct64;
+ window = ddw_list_new_entry(pdn, direct64);
+ if (!window)
+ break;
+
spin_lock(&direct_window_list_lock);
list_add(&window->list, &direct_window_list);
spin_unlock(&direct_window_list_lock);
@@ -1307,7 +1319,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
create.liobn, dn);

- window = kzalloc(sizeof(*window), GFP_KERNEL);
+ window = ddw_list_new_entry(pdn, ddwprop);
if (!window)
goto out_clear_window;

@@ -1326,8 +1338,6 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
goto out_free_window;
}

- window->device = pdn;
- window->prop = ddwprop;
spin_lock(&direct_window_list_lock);
list_add(&window->list, &direct_window_list);
spin_unlock(&direct_window_list_lock);
--
2.32.0

2021-08-17 06:42:57

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v6 07/11] powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper

Add a new helper _iommu_table_setparms(), and use it in
iommu_table_setparms() and iommu_table_setparms_lpar() to avoid duplicated
code.

Also, setting tbl->it_ops was happening outsite iommu_table_setparms*(),
so move it to the new helper. Since we need the iommu_table_ops to be
declared before used, declare iommu_table_lpar_multi_ops and
iommu_table_pseries_ops to before their respective iommu_table_setparms*().

Signed-off-by: Leonardo Bras <[email protected]>
Reviewed-by: Frederic Barrat <[email protected]>
---
arch/powerpc/platforms/pseries/iommu.c | 72 ++++++++++++++------------
1 file changed, 38 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 00392582fe10..a47f59a8f107 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -501,6 +501,24 @@ static int tce_setrange_multi_pSeriesLP_walk(unsigned long start_pfn,
return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
}

+static void iommu_table_setparms_common(struct iommu_table *tbl, unsigned long busno,
+ unsigned long liobn, unsigned long win_addr,
+ unsigned long window_size, unsigned long page_shift,
+ void *base, struct iommu_table_ops *table_ops)
+{
+ tbl->it_busno = busno;
+ tbl->it_index = liobn;
+ tbl->it_offset = win_addr >> page_shift;
+ tbl->it_size = window_size >> page_shift;
+ tbl->it_page_shift = page_shift;
+ tbl->it_base = (unsigned long)base;
+ tbl->it_blocksize = 16;
+ tbl->it_type = TCE_PCI;
+ tbl->it_ops = table_ops;
+}
+
+struct iommu_table_ops iommu_table_pseries_ops;
+
static void iommu_table_setparms(struct pci_controller *phb,
struct device_node *dn,
struct iommu_table *tbl)
@@ -509,8 +527,13 @@ static void iommu_table_setparms(struct pci_controller *phb,
const unsigned long *basep;
const u32 *sizep;

- node = phb->dn;
+ /* Test if we are going over 2GB of DMA space */
+ if (phb->dma_window_base_cur + phb->dma_window_size > SZ_2G) {
+ udbg_printf("PCI_DMA: Unexpected number of IOAs under this PHB.\n");
+ panic("PCI_DMA: Unexpected number of IOAs under this PHB.\n");
+ }

+ node = phb->dn;
basep = of_get_property(node, "linux,tce-base", NULL);
sizep = of_get_property(node, "linux,tce-size", NULL);
if (basep == NULL || sizep == NULL) {
@@ -519,33 +542,18 @@ static void iommu_table_setparms(struct pci_controller *phb,
return;
}

- tbl->it_base = (unsigned long)__va(*basep);
+ iommu_table_setparms_common(tbl, phb->bus->number, 0, phb->dma_window_base_cur,
+ phb->dma_window_size, IOMMU_PAGE_SHIFT_4K,
+ __va(*basep), &iommu_table_pseries_ops);

if (!is_kdump_kernel())
memset((void *)tbl->it_base, 0, *sizep);

- tbl->it_busno = phb->bus->number;
- tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
-
- /* Units of tce entries */
- tbl->it_offset = phb->dma_window_base_cur >> tbl->it_page_shift;
-
- /* Test if we are going over 2GB of DMA space */
- if (phb->dma_window_base_cur + phb->dma_window_size > 0x80000000ul) {
- udbg_printf("PCI_DMA: Unexpected number of IOAs under this PHB.\n");
- panic("PCI_DMA: Unexpected number of IOAs under this PHB.\n");
- }
-
phb->dma_window_base_cur += phb->dma_window_size;
-
- /* Set the tce table size - measured in entries */
- tbl->it_size = phb->dma_window_size >> tbl->it_page_shift;
-
- tbl->it_index = 0;
- tbl->it_blocksize = 16;
- tbl->it_type = TCE_PCI;
}

+struct iommu_table_ops iommu_table_lpar_multi_ops;
+
/*
* iommu_table_setparms_lpar
*
@@ -557,17 +565,13 @@ static void iommu_table_setparms_lpar(struct pci_controller *phb,
struct iommu_table_group *table_group,
const __be32 *dma_window)
{
- unsigned long offset, size;
+ unsigned long offset, size, liobn;

- of_parse_dma_window(dn, dma_window, &tbl->it_index, &offset, &size);
+ of_parse_dma_window(dn, dma_window, &liobn, &offset, &size);
+
+ iommu_table_setparms_common(tbl, phb->bus->number, liobn, offset, size, IOMMU_PAGE_SHIFT_4K, NULL,
+ &iommu_table_lpar_multi_ops);

- tbl->it_busno = phb->bus->number;
- tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
- tbl->it_base = 0;
- tbl->it_blocksize = 16;
- tbl->it_type = TCE_PCI;
- tbl->it_offset = offset >> tbl->it_page_shift;
- tbl->it_size = size >> tbl->it_page_shift;

table_group->tce32_start = offset;
table_group->tce32_size = size;
@@ -647,7 +651,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
tbl = pci->table_group->tables[0];

iommu_table_setparms(pci->phb, dn, tbl);
- tbl->it_ops = &iommu_table_pseries_ops;
+
if (!iommu_init_table(tbl, pci->phb->node, 0, 0))
panic("Failed to initialize iommu table");

@@ -730,7 +734,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
tbl = ppci->table_group->tables[0];
iommu_table_setparms_lpar(ppci->phb, pdn, tbl,
ppci->table_group, dma_window);
- tbl->it_ops = &iommu_table_lpar_multi_ops;
+
if (!iommu_init_table(tbl, ppci->phb->node, 0, 0))
panic("Failed to initialize iommu table");
iommu_register_group(ppci->table_group,
@@ -760,7 +764,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
PCI_DN(dn)->table_group = iommu_pseries_alloc_group(phb->node);
tbl = PCI_DN(dn)->table_group->tables[0];
iommu_table_setparms(phb, dn, tbl);
- tbl->it_ops = &iommu_table_pseries_ops;
+
if (!iommu_init_table(tbl, phb->node, 0, 0))
panic("Failed to initialize iommu table");

@@ -1461,7 +1465,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
tbl = pci->table_group->tables[0];
iommu_table_setparms_lpar(pci->phb, pdn, tbl,
pci->table_group, dma_window);
- tbl->it_ops = &iommu_table_lpar_multi_ops;
+
iommu_init_table(tbl, pci->phb->node, 0, 0);
iommu_register_group(pci->table_group,
pci_domain_nr(pci->phb->bus), 0);
--
2.32.0

2021-08-17 06:43:07

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v6 08/11] powerpc/pseries/iommu: Update remove_dma_window() to accept property name

Update remove_dma_window() so it can be used to remove DDW with a given
property name.

This enables the creation of new property names for DDW, so we can
have different usage for it, like indirect mapping.

Also, add return values to it so we can check if the property was found
while removing the active DDW. This allows skipping the remaining property
names while reducing the impact of multiple property names.

Signed-off-by: Leonardo Bras <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/platforms/pseries/iommu.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index a47f59a8f107..901f290999d0 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -844,31 +844,33 @@ static void remove_dma_window(struct device_node *np, u32 *ddw_avail,
__remove_dma_window(np, ddw_avail, liobn);
}

-static void remove_ddw(struct device_node *np, bool remove_prop)
+static int remove_ddw(struct device_node *np, bool remove_prop, const char *win_name)
{
struct property *win;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
int ret = 0;

+ win = of_find_property(np, win_name, NULL);
+ if (!win)
+ return -EINVAL;
+
ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
&ddw_avail[0], DDW_APPLICABLE_SIZE);
if (ret)
- return;
+ return 0;

- win = of_find_property(np, DIRECT64_PROPNAME, NULL);
- if (!win)
- return;

if (win->length >= sizeof(struct dynamic_dma_window_prop))
remove_dma_window(np, ddw_avail, win);

if (!remove_prop)
- return;
+ return 0;

ret = of_remove_property(np, win);
if (ret)
pr_warn("%pOF: failed to remove direct window property: %d\n",
np, ret);
+ return 0;
}

static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int *window_shift)
@@ -921,7 +923,7 @@ static int find_existing_ddw_windows(void)
for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
direct64 = of_get_property(pdn, DIRECT64_PROPNAME, &len);
if (!direct64 || len < sizeof(*direct64)) {
- remove_ddw(pdn, true);
+ remove_ddw(pdn, true, DIRECT64_PROPNAME);
continue;
}

@@ -1565,7 +1567,7 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti
* we have to remove the property when releasing
* the device node.
*/
- remove_ddw(np, false);
+ remove_ddw(np, false, DIRECT64_PROPNAME);
if (pci && pci->table_group)
iommu_pseries_free_group(pci->table_group,
np->full_name);
--
2.32.0

2021-08-17 06:43:11

by Leonardo Brás

[permalink] [raw]
Subject: [PATCH v6 11/11] powerpc/pseries/iommu: Rename "direct window" to "dma window"

A previous change introduced the usage of DDW as a bigger indirect DMA
mapping when the DDW available size does not map the whole partition.

As most of the code that manipulates direct mappings was reused for
indirect mappings, it's necessary to rename all names and debug/info
messages to reflect that it can be used for both kinds of mapping.

This should cause no behavioural change, just adjust naming.

Signed-off-by: Leonardo Bras <[email protected]>
Reviewed-by: Frederic Barrat <[email protected]>
---
arch/powerpc/platforms/pseries/iommu.c | 87 +++++++++++++-------------
1 file changed, 45 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 0eccc29f5573..dab5c56ffd0e 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -349,7 +349,7 @@ struct dynamic_dma_window_prop {
__be32 window_shift; /* ilog2(tce_window_size) */
};

-struct direct_window {
+struct dma_win {
struct device_node *device;
const struct dynamic_dma_window_prop *prop;
struct list_head list;
@@ -369,11 +369,11 @@ struct ddw_create_response {
u32 addr_lo;
};

-static LIST_HEAD(direct_window_list);
+static LIST_HEAD(dma_win_list);
/* prevents races between memory on/offline and window creation */
-static DEFINE_SPINLOCK(direct_window_list_lock);
+static DEFINE_SPINLOCK(dma_win_list_lock);
/* protects initializing window twice for same device */
-static DEFINE_MUTEX(direct_window_init_mutex);
+static DEFINE_MUTEX(dma_win_init_mutex);
#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
#define DMA64_PROPNAME "linux,dma64-ddr-window-info"

@@ -713,7 +713,10 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
pr_debug("pci_dma_bus_setup_pSeriesLP: setting up bus %pOF\n",
dn);

- /* Find nearest ibm,dma-window, walking up the device tree */
+ /*
+ * Find nearest ibm,dma-window (default DMA window), walking up the
+ * device tree
+ */
for (pdn = dn; pdn != NULL; pdn = pdn->parent) {
dma_window = of_get_property(pdn, "ibm,dma-window", NULL);
if (dma_window != NULL)
@@ -869,37 +872,37 @@ static int remove_ddw(struct device_node *np, bool remove_prop, const char *win_

ret = of_remove_property(np, win);
if (ret)
- pr_warn("%pOF: failed to remove direct window property: %d\n",
+ pr_warn("%pOF: failed to remove DMA window property: %d\n",
np, ret);
return 0;
}

static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int *window_shift)
{
- struct direct_window *window;
- const struct dynamic_dma_window_prop *direct64;
+ struct dma_win *window;
+ const struct dynamic_dma_window_prop *dma64;
bool found = false;

- spin_lock(&direct_window_list_lock);
+ spin_lock(&dma_win_list_lock);
/* check if we already created a window and dupe that config if so */
- list_for_each_entry(window, &direct_window_list, list) {
+ list_for_each_entry(window, &dma_win_list, list) {
if (window->device == pdn) {
- direct64 = window->prop;
- *dma_addr = be64_to_cpu(direct64->dma_base);
- *window_shift = be32_to_cpu(direct64->window_shift);
+ dma64 = window->prop;
+ *dma_addr = be64_to_cpu(dma64->dma_base);
+ *window_shift = be32_to_cpu(dma64->window_shift);
found = true;
break;
}
}
- spin_unlock(&direct_window_list_lock);
+ spin_unlock(&dma_win_list_lock);

return found;
}

-static struct direct_window *ddw_list_new_entry(struct device_node *pdn,
- const struct dynamic_dma_window_prop *dma64)
+static struct dma_win *ddw_list_new_entry(struct device_node *pdn,
+ const struct dynamic_dma_window_prop *dma64)
{
- struct direct_window *window;
+ struct dma_win *window;

window = kzalloc(sizeof(*window), GFP_KERNEL);
if (!window)
@@ -915,7 +918,7 @@ static void find_existing_ddw_windows_named(const char *name)
{
int len;
struct device_node *pdn;
- struct direct_window *window;
+ struct dma_win *window;
const struct dynamic_dma_window_prop *dma64;

for_each_node_with_property(pdn, name) {
@@ -929,9 +932,9 @@ static void find_existing_ddw_windows_named(const char *name)
if (!window)
break;

- spin_lock(&direct_window_list_lock);
- list_add(&window->list, &direct_window_list);
- spin_unlock(&direct_window_list_lock);
+ spin_lock(&dma_win_list_lock);
+ list_add(&window->list, &dma_win_list);
+ spin_unlock(&dma_win_list_lock);
}
}

@@ -1231,7 +1234,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
const char *win_name;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
- struct direct_window *window;
+ struct dma_win *window;
struct property *win64;
bool ddw_enabled = false;
struct failed_ddw_pdn *fpdn;
@@ -1244,7 +1247,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
pmem_present = dn != NULL;
of_node_put(dn);

- mutex_lock(&direct_window_init_mutex);
+ mutex_lock(&dma_win_init_mutex);

if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset, &len)) {
direct_mapping = (len >= max_ram_len);
@@ -1324,8 +1327,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)

page_shift = iommu_get_page_shift(query.page_size);
if (!page_shift) {
- dev_dbg(&dev->dev, "no supported direct page size in mask %x",
- query.page_size);
+ dev_dbg(&dev->dev, "no supported page size in mask %x",
+ query.page_size);
goto out_failed;
}

@@ -1384,7 +1387,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)

ret = of_add_property(pdn, win64);
if (ret) {
- dev_err(&dev->dev, "unable to add dma window property for %pOF: %d",
+ dev_err(&dev->dev, "unable to add DMA window property for %pOF: %d",
pdn, ret);
goto out_free_prop;
}
@@ -1398,7 +1401,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
win64->value, tce_setrange_multi_pSeriesLP_walk);
if (ret) {
- dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
+ dev_info(&dev->dev, "failed to map DMA window for %pOF: %d\n",
dn, ret);

/* Make sure to clean DDW if any TCE was set*/
@@ -1443,9 +1446,9 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
set_iommu_table_base(&dev->dev, newtbl);
}

- spin_lock(&direct_window_list_lock);
- list_add(&window->list, &direct_window_list);
- spin_unlock(&direct_window_list_lock);
+ spin_lock(&dma_win_list_lock);
+ list_add(&window->list, &dma_win_list);
+ spin_unlock(&dma_win_list_lock);

dev->dev.archdata.dma_offset = win_addr;
ddw_enabled = true;
@@ -1477,7 +1480,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
list_add(&fpdn->list, &failed_ddw_pdn_list);

out_unlock:
- mutex_unlock(&direct_window_init_mutex);
+ mutex_unlock(&dma_win_init_mutex);

/*
* If we have persistent memory and the window size is only as big
@@ -1575,29 +1578,29 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
void *data)
{
- struct direct_window *window;
+ struct dma_win *window;
struct memory_notify *arg = data;
int ret = 0;

switch (action) {
case MEM_GOING_ONLINE:
- spin_lock(&direct_window_list_lock);
- list_for_each_entry(window, &direct_window_list, list) {
+ spin_lock(&dma_win_list_lock);
+ list_for_each_entry(window, &dma_win_list, list) {
ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
arg->nr_pages, window->prop);
/* XXX log error */
}
- spin_unlock(&direct_window_list_lock);
+ spin_unlock(&dma_win_list_lock);
break;
case MEM_CANCEL_ONLINE:
case MEM_OFFLINE:
- spin_lock(&direct_window_list_lock);
- list_for_each_entry(window, &direct_window_list, list) {
+ spin_lock(&dma_win_list_lock);
+ list_for_each_entry(window, &dma_win_list, list) {
ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
arg->nr_pages, window->prop);
/* XXX log error */
}
- spin_unlock(&direct_window_list_lock);
+ spin_unlock(&dma_win_list_lock);
break;
default:
break;
@@ -1618,7 +1621,7 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti
struct of_reconfig_data *rd = data;
struct device_node *np = rd->dn;
struct pci_dn *pci = PCI_DN(np);
- struct direct_window *window;
+ struct dma_win *window;

switch (action) {
case OF_RECONFIG_DETACH_NODE:
@@ -1636,15 +1639,15 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti
iommu_pseries_free_group(pci->table_group,
np->full_name);

- spin_lock(&direct_window_list_lock);
- list_for_each_entry(window, &direct_window_list, list) {
+ spin_lock(&dma_win_list_lock);
+ list_for_each_entry(window, &dma_win_list, list) {
if (window->device == np) {
list_del(&window->list);
kfree(window);
break;
}
}
- spin_unlock(&direct_window_list_lock);
+ spin_unlock(&dma_win_list_lock);
break;
default:
err = NOTIFY_DONE;
--
2.32.0

2021-08-27 16:54:23

by Frederic Barrat

[permalink] [raw]
Subject: Re: [PATCH v6 02/11] powerpc/kernel/iommu: Add new iommu_table_in_use() helper



On 17/08/2021 08:39, Leonardo Bras wrote:
> Having a function to check if the iommu table has any allocation helps
> deciding if a tbl can be reset for using a new DMA window.
>
> It should be enough to replace all instances of !bitmap_empty(tbl...).
>
> iommu_table_in_use() skips reserved memory, so we don't need to worry about
> releasing it before testing. This causes iommu_table_release_pages() to
> become unnecessary, given it is only used to remove reserved memory for
> testing.
>
> Also, only allow storing reserved memory values in tbl if they are valid
> in the table, so there is no need to check it in the new helper.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> Reviewed-by: Alexey Kardashevskiy <[email protected]>
> ---

Looks ok to me now, thanks!
Reviewed-by: Frederic Barrat <[email protected]>


> arch/powerpc/include/asm/iommu.h | 1 +
> arch/powerpc/kernel/iommu.c | 61 ++++++++++++++++----------------
> 2 files changed, 32 insertions(+), 30 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index deef7c94d7b6..bf3b84128525 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -154,6 +154,7 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
> */
> extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
> int nid, unsigned long res_start, unsigned long res_end);
> +bool iommu_table_in_use(struct iommu_table *tbl);
>
> #define IOMMU_TABLE_GROUP_MAX_TABLES 2
>
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 2af89a5e379f..ed98ad63633e 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -690,32 +690,24 @@ static void iommu_table_reserve_pages(struct iommu_table *tbl,
> if (tbl->it_offset == 0)
> set_bit(0, tbl->it_map);
>
> - tbl->it_reserved_start = res_start;
> - tbl->it_reserved_end = res_end;
> -
> - /* Check if res_start..res_end isn't empty and overlaps the table */
> - if (res_start && res_end &&
> - (tbl->it_offset + tbl->it_size < res_start ||
> - res_end < tbl->it_offset))
> - return;
> + if (res_start < tbl->it_offset)
> + res_start = tbl->it_offset;
>
> - for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
> - set_bit(i - tbl->it_offset, tbl->it_map);
> -}
> + if (res_end > (tbl->it_offset + tbl->it_size))
> + res_end = tbl->it_offset + tbl->it_size;
>
> -static void iommu_table_release_pages(struct iommu_table *tbl)
> -{
> - int i;
> + /* Check if res_start..res_end is a valid range in the table */
> + if (res_start >= res_end) {
> + tbl->it_reserved_start = tbl->it_offset;
> + tbl->it_reserved_end = tbl->it_offset;
> + return;
> + }
>
> - /*
> - * In case we have reserved the first bit, we should not emit
> - * the warning below.
> - */
> - if (tbl->it_offset == 0)
> - clear_bit(0, tbl->it_map);
> + tbl->it_reserved_start = res_start;
> + tbl->it_reserved_end = res_end;
>
> for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
> - clear_bit(i - tbl->it_offset, tbl->it_map);
> + set_bit(i - tbl->it_offset, tbl->it_map);
> }
>
> /*
> @@ -779,6 +771,22 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
> return tbl;
> }
>
> +bool iommu_table_in_use(struct iommu_table *tbl)
> +{
> + unsigned long start = 0, end;
> +
> + /* ignore reserved bit0 */
> + if (tbl->it_offset == 0)
> + start = 1;
> + end = tbl->it_reserved_start - tbl->it_offset;
> + if (find_next_bit(tbl->it_map, end, start) != end)
> + return true;
> +
> + start = tbl->it_reserved_end - tbl->it_offset;
> + end = tbl->it_size;
> + return find_next_bit(tbl->it_map, end, start) != end;
> +}
> +
> static void iommu_table_free(struct kref *kref)
> {
> struct iommu_table *tbl;
> @@ -795,10 +803,8 @@ static void iommu_table_free(struct kref *kref)
>
> iommu_debugfs_del(tbl);
>
> - iommu_table_release_pages(tbl);
> -
> /* verify that table contains no entries */
> - if (!bitmap_empty(tbl->it_map, tbl->it_size))
> + if (iommu_table_in_use(tbl))
> pr_warn("%s: Unexpected TCEs\n", __func__);
>
> /* free bitmap */
> @@ -1099,14 +1105,9 @@ int iommu_take_ownership(struct iommu_table *tbl)
> for (i = 0; i < tbl->nr_pools; i++)
> spin_lock_nest_lock(&tbl->pools[i].lock, &tbl->large_pool.lock);
>
> - iommu_table_release_pages(tbl);
> -
> - if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
> + if (iommu_table_in_use(tbl)) {
> pr_err("iommu_tce: it_map is not empty");
> ret = -EBUSY;
> - /* Undo iommu_table_release_pages, i.e. restore bit#0, etc */
> - iommu_table_reserve_pages(tbl, tbl->it_reserved_start,
> - tbl->it_reserved_end);
> } else {
> memset(tbl->it_map, 0xff, sz);
> }
>

2021-08-27 16:58:15

by Frederic Barrat

[permalink] [raw]
Subject: Re: [PATCH v6 06/11] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()



On 17/08/2021 08:39, Leonardo Bras wrote:
> Code used to create a ddw property that was previously scattered in
> enable_ddw() is now gathered in ddw_property_create(), which deals with
> allocation and filling the property, letting it ready for
> of_property_add(), which now occurs in sequence.
>
> This created an opportunity to reorganize the second part of enable_ddw():
>
> Without this patch enable_ddw() does, in order:
> kzalloc() property & members, create_ddw(), fill ddwprop inside property,
> ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
> of_add_property(), and list_add().
>
> With this patch enable_ddw() does, in order:
> create_ddw(), ddw_property_create(), of_add_property(),
> ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
> and list_add().
>
> This change requires of_remove_property() in case anything fails after
> of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
> in all memory, which looks the most expensive operation, only if
> everything else succeeds.
>
> Also, the error path got remove_ddw() replaced by a new helper
> __remove_dma_window(), which only removes the new DDW with an rtas-call.
> For this, a new helper clean_dma_window() was needed to clean anything
> that could left if walk_system_ram_range() fails.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> Reviewed-by: Alexey Kardashevskiy <[email protected]>
> ---


Thanks for fixing the error paths
Reviewed-by: Frederic Barrat <[email protected]>


> arch/powerpc/platforms/pseries/iommu.c | 129 ++++++++++++++++---------
> 1 file changed, 84 insertions(+), 45 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index b34b473bbdc1..00392582fe10 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -795,17 +795,10 @@ static int __init disable_ddw_setup(char *str)
>
> early_param("disable_ddw", disable_ddw_setup);
>
> -static void remove_dma_window(struct device_node *np, u32 *ddw_avail,
> - struct property *win)
> +static void clean_dma_window(struct device_node *np, struct dynamic_dma_window_prop *dwp)
> {
> - struct dynamic_dma_window_prop *dwp;
> - u64 liobn;
> int ret;
>
> - dwp = win->value;
> - liobn = (u64)be32_to_cpu(dwp->liobn);
> -
> - /* clear the whole window, note the arg is in kernel pages */
> ret = tce_clearrange_multi_pSeriesLP(0,
> 1ULL << (be32_to_cpu(dwp->window_shift) - PAGE_SHIFT), dwp);
> if (ret)
> @@ -814,18 +807,39 @@ static void remove_dma_window(struct device_node *np, u32 *ddw_avail,
> else
> pr_debug("%pOF successfully cleared tces in window.\n",
> np);
> +}
> +
> +/*
> + * Call only if DMA window is clean.
> + */
> +static void __remove_dma_window(struct device_node *np, u32 *ddw_avail, u64 liobn)
> +{
> + int ret;
>
> ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn);
> if (ret)
> - pr_warn("%pOF: failed to remove direct window: rtas returned "
> + pr_warn("%pOF: failed to remove DMA window: rtas returned "
> "%d to ibm,remove-pe-dma-window(%x) %llx\n",
> np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
> else
> - pr_debug("%pOF: successfully removed direct window: rtas returned "
> + pr_debug("%pOF: successfully removed DMA window: rtas returned "
> "%d to ibm,remove-pe-dma-window(%x) %llx\n",
> np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
> }
>
> +static void remove_dma_window(struct device_node *np, u32 *ddw_avail,
> + struct property *win)
> +{
> + struct dynamic_dma_window_prop *dwp;
> + u64 liobn;
> +
> + dwp = win->value;
> + liobn = (u64)be32_to_cpu(dwp->liobn);
> +
> + clean_dma_window(np, dwp);
> + __remove_dma_window(np, ddw_avail, liobn);
> +}
> +
> static void remove_ddw(struct device_node *np, bool remove_prop)
> {
> struct property *win;
> @@ -1153,6 +1167,35 @@ static int iommu_get_page_shift(u32 query_page_size)
> return 0;
> }
>
> +static struct property *ddw_property_create(const char *propname, u32 liobn, u64 dma_addr,
> + u32 page_shift, u32 window_shift)
> +{
> + struct dynamic_dma_window_prop *ddwprop;
> + struct property *win64;
> +
> + win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
> + if (!win64)
> + return NULL;
> +
> + win64->name = kstrdup(propname, GFP_KERNEL);
> + ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
> + win64->value = ddwprop;
> + win64->length = sizeof(*ddwprop);
> + if (!win64->name || !win64->value) {
> + kfree(win64->name);
> + kfree(win64->value);
> + kfree(win64);
> + return NULL;
> + }
> +
> + ddwprop->liobn = cpu_to_be32(liobn);
> + ddwprop->dma_base = cpu_to_be64(dma_addr);
> + ddwprop->tce_shift = cpu_to_be32(page_shift);
> + ddwprop->window_shift = cpu_to_be32(window_shift);
> +
> + return win64;
> +}
> +
> /*
> * If the PE supports dynamic dma windows, and there is space for a table
> * that can map all pages in a linear offset, then setup such a table,
> @@ -1171,12 +1214,12 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> struct ddw_query_response query;
> struct ddw_create_response create;
> int page_shift;
> + u64 win_addr;
> struct device_node *dn;
> u32 ddw_avail[DDW_APPLICABLE_SIZE];
> struct direct_window *window;
> struct property *win64;
> bool ddw_enabled = false;
> - struct dynamic_dma_window_prop *ddwprop;
> struct failed_ddw_pdn *fpdn;
> bool default_win_removed = false;
> bool pmem_present;
> @@ -1293,72 +1336,68 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> 1ULL << page_shift);
> goto out_failed;
> }
> - win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
> - if (!win64) {
> - dev_info(&dev->dev,
> - "couldn't allocate property for 64bit dma window\n");
> - goto out_failed;
> - }
> - win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
> - win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
> - win64->length = sizeof(*ddwprop);
> - if (!win64->name || !win64->value) {
> - dev_info(&dev->dev,
> - "couldn't allocate property name and value\n");
> - goto out_free_prop;
> - }
>
> ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
> if (ret != 0)
> - goto out_free_prop;
> -
> - ddwprop->liobn = cpu_to_be32(create.liobn);
> - ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) |
> - create.addr_lo);
> - ddwprop->tce_shift = cpu_to_be32(page_shift);
> - ddwprop->window_shift = cpu_to_be32(len);
> + goto out_failed;
>
> dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
> create.liobn, dn);
>
> - window = ddw_list_new_entry(pdn, ddwprop);
> + win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
> + win64 = ddw_property_create(DIRECT64_PROPNAME, create.liobn, win_addr,
> + page_shift, len);
> + if (!win64) {
> + dev_info(&dev->dev,
> + "couldn't allocate property, property name, or value\n");
> + goto out_remove_win;
> + }
> +
> + ret = of_add_property(pdn, win64);
> + if (ret) {
> + dev_err(&dev->dev, "unable to add dma window property for %pOF: %d",
> + pdn, ret);
> + goto out_free_prop;
> + }
> +
> + window = ddw_list_new_entry(pdn, win64->value);
> if (!window)
> - goto out_clear_window;
> + goto out_del_prop;
>
> ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> win64->value, tce_setrange_multi_pSeriesLP_walk);
> if (ret) {
> dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
> dn, ret);
> - goto out_free_window;
> - }
>
> - ret = of_add_property(pdn, win64);
> - if (ret) {
> - dev_err(&dev->dev, "unable to add dma window property for %pOF: %d",
> - pdn, ret);
> - goto out_free_window;
> + /* Make sure to clean DDW if any TCE was set*/
> + clean_dma_window(pdn, win64->value);
> + goto out_del_list;
> }
>
> spin_lock(&direct_window_list_lock);
> list_add(&window->list, &direct_window_list);
> spin_unlock(&direct_window_list_lock);
>
> - dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
> + dev->dev.archdata.dma_offset = win_addr;
> ddw_enabled = true;
> goto out_unlock;
>
> -out_free_window:
> +out_del_list:
> kfree(window);
>
> -out_clear_window:
> - remove_ddw(pdn, true);
> +out_del_prop:
> + of_remove_property(pdn, win64);
>
> out_free_prop:
> kfree(win64->name);
> kfree(win64->value);
> kfree(win64);
>
> +out_remove_win:
> + /* DDW is clean, so it's ok to call this directly. */
> + __remove_dma_window(pdn, ddw_avail, create.liobn);
> +
> out_failed:
> if (default_win_removed)
> reset_dma_window(dev, pdn);
>

2021-08-27 17:00:45

by Frederic Barrat

[permalink] [raw]
Subject: Re: [PATCH v6 08/11] powerpc/pseries/iommu: Update remove_dma_window() to accept property name



On 17/08/2021 08:39, Leonardo Bras wrote:
> Update remove_dma_window() so it can be used to remove DDW with a given
> property name.
>
> This enables the creation of new property names for DDW, so we can
> have different usage for it, like indirect mapping.
>
> Also, add return values to it so we can check if the property was found
> while removing the active DDW. This allows skipping the remaining property
> names while reducing the impact of multiple property names.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> Reviewed-by: Alexey Kardashevskiy <[email protected]>
> ---


Reviewed-by: Frederic Barrat <[email protected]>



> arch/powerpc/platforms/pseries/iommu.c | 18 ++++++++++--------
> 1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index a47f59a8f107..901f290999d0 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -844,31 +844,33 @@ static void remove_dma_window(struct device_node *np, u32 *ddw_avail,
> __remove_dma_window(np, ddw_avail, liobn);
> }
>
> -static void remove_ddw(struct device_node *np, bool remove_prop)
> +static int remove_ddw(struct device_node *np, bool remove_prop, const char *win_name)
> {
> struct property *win;
> u32 ddw_avail[DDW_APPLICABLE_SIZE];
> int ret = 0;
>
> + win = of_find_property(np, win_name, NULL);
> + if (!win)
> + return -EINVAL;
> +
> ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
> &ddw_avail[0], DDW_APPLICABLE_SIZE);
> if (ret)
> - return;
> + return 0;
>
> - win = of_find_property(np, DIRECT64_PROPNAME, NULL);
> - if (!win)
> - return;
>
> if (win->length >= sizeof(struct dynamic_dma_window_prop))
> remove_dma_window(np, ddw_avail, win);
>
> if (!remove_prop)
> - return;
> + return 0;
>
> ret = of_remove_property(np, win);
> if (ret)
> pr_warn("%pOF: failed to remove direct window property: %d\n",
> np, ret);
> + return 0;
> }
>
> static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int *window_shift)
> @@ -921,7 +923,7 @@ static int find_existing_ddw_windows(void)
> for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
> direct64 = of_get_property(pdn, DIRECT64_PROPNAME, &len);
> if (!direct64 || len < sizeof(*direct64)) {
> - remove_ddw(pdn, true);
> + remove_ddw(pdn, true, DIRECT64_PROPNAME);
> continue;
> }
>
> @@ -1565,7 +1567,7 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti
> * we have to remove the property when releasing
> * the device node.
> */
> - remove_ddw(np, false);
> + remove_ddw(np, false, DIRECT64_PROPNAME);
> if (pci && pci->table_group)
> iommu_pseries_free_group(pci->table_group,
> np->full_name);
>

2021-08-27 17:08:38

by Frederic Barrat

[permalink] [raw]
Subject: Re: [PATCH v6 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping



On 17/08/2021 08:39, Leonardo Bras wrote:
> So far it's assumed possible to map the guest RAM 1:1 to the bus, which
> works with a small number of devices. SRIOV changes it as the user can
> configure hundreds VFs and since phyp preallocates TCEs and does not
> allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
> per a PE to limit waste of physical pages.
>
> As of today, if the assumed direct mapping is not possible, DDW creation
> is skipped and the default DMA window "ibm,dma-window" is used instead.
>
> By using DDW, indirect mapping can get more TCEs than available for the
> default DMA window, and also get access to using much larger pagesizes
> (16MB as implemented in qemu vs 4k from default DMA window), causing a
> significant increase on the maximum amount of memory that can be IOMMU
> mapped at the same time.
>
> Indirect mapping will only be used if direct mapping is not a
> possibility.
>
> For indirect mapping, it's necessary to re-create the iommu_table with
> the new DMA window parameters, so iommu_alloc() can use it.
>
> Removing the default DMA window for using DDW with indirect mapping
> is only allowed if there is no current IOMMU memory allocated in
> the iommu_table. enable_ddw() is aborted otherwise.
>
> Even though there won't be both direct and indirect mappings at the
> same time, we can't reuse the DIRECT64_PROPNAME property name, or else
> an older kexec()ed kernel can assume direct mapping, and skip
> iommu_alloc(), causing undesirable behavior.
> So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info"
> was created to represent a DDW that does not allow direct mapping.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---


I think it looks ok now as it was mostly me who was misunderstanding one
part of the previous iteration.
Reviewed-by: Frederic Barrat <[email protected]>

Sorry for the late review, I was enjoying some time off. And thanks for
that series, I believe it should help with those bugs complaining about
lack of DMA space. It was also very educational for me, thanks to you
and Alexey for your detailed answers.

Fred


> arch/powerpc/platforms/pseries/iommu.c | 89 +++++++++++++++++++++-----
> 1 file changed, 74 insertions(+), 15 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index e11c00b2dc1e..0eccc29f5573 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -375,6 +375,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock);
> /* protects initializing window twice for same device */
> static DEFINE_MUTEX(direct_window_init_mutex);
> #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
> +#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
>
> static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
> unsigned long num_pfn, const void *arg)
> @@ -940,6 +941,7 @@ static int find_existing_ddw_windows(void)
> return 0;
>
> find_existing_ddw_windows_named(DIRECT64_PROPNAME);
> + find_existing_ddw_windows_named(DMA64_PROPNAME);
>
> return 0;
> }
> @@ -1226,14 +1228,17 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> struct ddw_create_response create;
> int page_shift;
> u64 win_addr;
> + const char *win_name;
> struct device_node *dn;
> u32 ddw_avail[DDW_APPLICABLE_SIZE];
> struct direct_window *window;
> struct property *win64;
> bool ddw_enabled = false;
> struct failed_ddw_pdn *fpdn;
> - bool default_win_removed = false;
> + bool default_win_removed = false, direct_mapping = false;
> bool pmem_present;
> + struct pci_dn *pci = PCI_DN(pdn);
> + struct iommu_table *tbl = pci->table_group->tables[0];
>
> dn = of_find_node_by_type(NULL, "ibm,pmemory");
> pmem_present = dn != NULL;
> @@ -1242,6 +1247,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> mutex_lock(&direct_window_init_mutex);
>
> if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset, &len)) {
> + direct_mapping = (len >= max_ram_len);
> ddw_enabled = true;
> goto out_unlock;
> }
> @@ -1322,8 +1328,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> query.page_size);
> goto out_failed;
> }
> - /* verify the window * number of ptes will map the partition */
> - /* check largest block * page size > max memory hotplug addr */
> +
> +
> /*
> * The "ibm,pmemory" can appear anywhere in the address space.
> * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
> @@ -1339,13 +1345,25 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> dev_info(&dev->dev, "Skipping ibm,pmemory");
> }
>
> + /* check if the available block * number of ptes will map everything */
> if (query.largest_available_block < (1ULL << (len - page_shift))) {
> dev_dbg(&dev->dev,
> "can't map partition max 0x%llx with %llu %llu-sized pages\n",
> 1ULL << len,
> query.largest_available_block,
> 1ULL << page_shift);
> - goto out_failed;
> +
> + /* DDW + IOMMU on single window may fail if there is any allocation */
> + if (default_win_removed && iommu_table_in_use(tbl)) {
> + dev_dbg(&dev->dev, "current IOMMU table in use, can't be replaced.\n");
> + goto out_failed;
> + }
> +
> + len = order_base_2(query.largest_available_block << page_shift);
> + win_name = DMA64_PROPNAME;
> + } else {
> + direct_mapping = true;
> + win_name = DIRECT64_PROPNAME;
> }
>
> ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
> @@ -1356,8 +1374,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> create.liobn, dn);
>
> win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
> - win64 = ddw_property_create(DIRECT64_PROPNAME, create.liobn, win_addr,
> - page_shift, len);
> + win64 = ddw_property_create(win_name, create.liobn, win_addr, page_shift, len);
> +
> if (!win64) {
> dev_info(&dev->dev,
> "couldn't allocate property, property name, or value\n");
> @@ -1375,15 +1393,54 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> if (!window)
> goto out_del_prop;
>
> - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> - win64->value, tce_setrange_multi_pSeriesLP_walk);
> - if (ret) {
> - dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
> - dn, ret);
> + if (direct_mapping) {
> + /* DDW maps the whole partition, so enable direct DMA mapping */
> + ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> + win64->value, tce_setrange_multi_pSeriesLP_walk);
> + if (ret) {
> + dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
> + dn, ret);
>
> /* Make sure to clean DDW if any TCE was set*/
> clean_dma_window(pdn, win64->value);
> - goto out_del_list;
> + goto out_del_list;
> + }
> + } else {
> + struct iommu_table *newtbl;
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
> + const unsigned long mask = IORESOURCE_MEM_64 | IORESOURCE_MEM;
> +
> + /* Look for MMIO32 */
> + if ((pci->phb->mem_resources[i].flags & mask) == IORESOURCE_MEM)
> + break;
> + }
> +
> + if (i == ARRAY_SIZE(pci->phb->mem_resources))
> + goto out_del_list;
> +
> + /* New table for using DDW instead of the default DMA window */
> + newtbl = iommu_pseries_alloc_table(pci->phb->node);
> + if (!newtbl) {
> + dev_dbg(&dev->dev, "couldn't create new IOMMU table\n");
> + goto out_del_list;
> + }
> +
> + iommu_table_setparms_common(newtbl, pci->phb->bus->number, create.liobn, win_addr,
> + 1UL << len, page_shift, NULL, &iommu_table_lpar_multi_ops);
> + iommu_init_table(newtbl, pci->phb->node, pci->phb->mem_resources[i].start,
> + pci->phb->mem_resources[i].end);
> +
> + pci->table_group->tables[1] = newtbl;
> +
> + /* Keep default DMA window stuct if removed */
> + if (default_win_removed) {
> + tbl->it_size = 0;
> + kfree(tbl->it_map);
> + }
> +
> + set_iommu_table_base(&dev->dev, newtbl);
> }
>
> spin_lock(&direct_window_list_lock);
> @@ -1427,10 +1484,10 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> * as RAM, then we failed to create a window to cover persistent
> * memory and need to set the DMA limit.
> */
> - if (pmem_present && ddw_enabled && (len == max_ram_len))
> + if (pmem_present && ddw_enabled && direct_mapping && len == max_ram_len)
> dev->dev.bus_dma_limit = dev->dev.archdata.dma_offset + (1ULL << len);
>
> - return ddw_enabled;
> + return ddw_enabled && direct_mapping;
> }
>
> static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
> @@ -1572,7 +1629,9 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti
> * we have to remove the property when releasing
> * the device node.
> */
> - remove_ddw(np, false, DIRECT64_PROPNAME);
> + if (remove_ddw(np, false, DIRECT64_PROPNAME))
> + remove_ddw(np, false, DMA64_PROPNAME);
> +
> if (pci && pci->table_group)
> iommu_pseries_free_group(pci->table_group,
> np->full_name);
>

2021-08-27 17:56:41

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v6 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

Hello Fred,

On Fri, 2021-08-27 at 19:06 +0200, Frederic Barrat wrote:
>
> I think it looks ok now as it was mostly me who was misunderstanding
> one
> part of the previous iteration.
> Reviewed-by: Frederic Barrat <[email protected]>
>

Thank you for reviewing this series!

> Sorry for the late review, I was enjoying some time off. And thanks
> for
> that series, I believe it should help with those bugs complaining
> about
> lack of DMA space. It was also very educational for me, thanks to you
> and Alexey for your detailed answers.

Working on this series caused me to learn a lot about how DMA and IOMMU
works in Power, and it was a great experience. Thanks to Alexey who
helped and guided me through this!

Best regards,
Leonardo Bras

2021-08-31 14:01:41

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v6 00/11] DDW + Indirect Mapping

On Tue, 17 Aug 2021 03:39:18 -0300, Leonardo Bras wrote:
> So far it's assumed possible to map the guest RAM 1:1 to the bus, which
> works with a small number of devices. SRIOV changes it as the user can
> configure hundreds VFs and since phyp preallocates TCEs and does not
> allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
> per a PE to limit waste of physical pages.
>
> As of today, if the assumed direct mapping is not possible, DDW creation
> is skipped and the default DMA window "ibm,dma-window" is used instead.
>
> [...]

Applied to powerpc/next.

[01/11] powerpc/pseries/iommu: Replace hard-coded page shift
https://git.kernel.org/powerpc/c/0c634bafe3bbee7a36dca7f1277057e05bf14d91
[02/11] powerpc/kernel/iommu: Add new iommu_table_in_use() helper
https://git.kernel.org/powerpc/c/3c33066a21903076722a2881556a92aa3cd7d359
[03/11] powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper
https://git.kernel.org/powerpc/c/4ff8677a0b192a58d998d1d34fc5168203041a24
[04/11] powerpc/pseries/iommu: Add ddw_list_new_entry() helper
https://git.kernel.org/powerpc/c/92a23219299cedde52e3298788484f4875d5ce0f
[05/11] powerpc/pseries/iommu: Allow DDW windows starting at 0x00
https://git.kernel.org/powerpc/c/2ca73c54ce24489518a56d816331b774044c2445
[06/11] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()
https://git.kernel.org/powerpc/c/7ed2ed2db2685a285cb09ab330dc4efea0b64022
[07/11] powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper
https://git.kernel.org/powerpc/c/fc8cba8f989fb98e496b33a78476861e246c42a0
[08/11] powerpc/pseries/iommu: Update remove_dma_window() to accept property name
https://git.kernel.org/powerpc/c/a5fd95120c653962a9e75e260a35436b96d2c991
[09/11] powerpc/pseries/iommu: Find existing DDW with given property name
https://git.kernel.org/powerpc/c/8599395d34f2dd7b77bef42da1d99798e7a3d58f
[10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping
https://git.kernel.org/powerpc/c/381ceda88c4c4c8345cad1cffa6328892f15dca6
[11/11] powerpc/pseries/iommu: Rename "direct window" to "dma window"
https://git.kernel.org/powerpc/c/57dbbe590f152e5e8a3ff8bf5ba163df34eeae0b

cheers

2021-08-31 20:40:58

by David Christensen

[permalink] [raw]
Subject: Re: [PATCH v6 00/11] DDW + Indirect Mapping



On 8/31/21 1:18 PM, Leonardo Brás wrote:
> Hello David,
>
> Sorry for the delay, I did not get your mail because I was not CC'd
> in your reply (you sent the mail just to the mailing list).
>
> Replies bellow:
>
> On Mon, 2021-08-30 at 10:48 -0700, David Christensen wrote:
>> On 8/16/21 11:39 PM, Leonardo Bras wrote:
>>> So far it's assumed possible to map the guest RAM 1:1 to the bus,
>>> which
>>> works with a small number of devices. SRIOV changes it as the user
>>> can
>>> configure hundreds VFs and since phyp preallocates TCEs and does not
>>> allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
>>> per a PE to limit waste of physical pages.
>>>
>>> As of today, if the assumed direct mapping is not possible, DDW
>>> creation
>>> is skipped and the default DMA window "ibm,dma-window" is used
>>> instead.
>>>
>>> Using the DDW instead of the default DMA window may allow to expand
>>> the
>>> amount of memory that can be DMA-mapped, given the number of pages
>>> (TCEs)
>>> may stay the same (or increase) and the default DMA window offers
>>> only
>>> 4k-pages while DDW may offer larger pages (4k, 64k, 16M ...).
>>
>> So if I'm reading this correctly, VFIO applications requiring hugepage
>> DMA mappings (e.g. 16M or 2GB) can be supported on an LPAR or DLPAR
>> after this change, is that correct?
>
> Different DDW IOMMU page sizes were already supported in Linux (4k,
> 64k, 16M) for a while now, and the remaining page sizes in LoPAR were
> enabled in the following patch:
> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/[email protected]/
> (commit 472724111f0f72042deb6a9dcee9578e5398a1a1)
>
> The thing is there are two ways of using DMA:
> - Direct DMA, mapping the whole memory space of the host, which
> requires a lot of DMA space if the guest memory is huge. This already
> supports DDW and allows using the bigger pagesizes.
> This happens on device/bus probe.
>
> - Indirect DMA with IOMMU, mapping memory regions on demand, and un-
> mapping after use. This requires much less DMA space, but causes an
> overhead because an hcall is necessary for mapping and un-mapping.
> Before this series, Indirect DMA was only possible with the 'default
> DMA window' which allows using only 4k pages.
>
> This series allow Indirect DMA using DDW when available, which usually
> means bigger pagesizes and more TCEs, and so more DMA space.

How is the mapping method selected? LPAR creation via the HMC, Linux
kernel load parameter, or some other method?

The hcall overhead doesn't seem too worrisome when mapping 1GB pages so
the Indirect DMA method might be best in my situation (DPDK).

Dave

2021-08-31 20:50:51

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v6 00/11] DDW + Indirect Mapping

On Tue, 2021-08-31 at 13:39 -0700, David Christensen wrote:
> >
> > This series allow Indirect DMA using DDW when available, which
> > usually
> > means bigger pagesizes and more TCEs, and so more DMA space.
>
> How is the mapping method selected?  LPAR creation via the HMC, Linux
> kernel load parameter, or some other method?

At device/bus probe, if there is enough DMA space available for Direct
DMA, then it's used. If not, it uses indirect DMA.

>
> The hcall overhead doesn't seem too worrisome when mapping 1GB pages
> so
> the Indirect DMA method might be best in my situation (DPDK).

Well, it depends on usage.
I mean, the recommended use of IOMMU is to map, transmit and then
unmap, but this will vary on the implementation of the driver.

If, for example, there is some reuse of the DMA mapping, as in a
previous patchset I sent (IOMMU Pagecache), then the hcall overhead can
be reduced drastically.

>
> Dave
Best regards,
Leonardo

2021-08-31 21:50:21

by Leonardo Brás

[permalink] [raw]
Subject: Re: [PATCH v6 00/11] DDW + Indirect Mapping

Hello David,

Sorry for the delay, I did not get your mail because I was not CC'd
in your reply (you sent the mail just to the mailing list).

Replies bellow:

On Mon, 2021-08-30 at 10:48 -0700, David Christensen wrote:
> On 8/16/21 11:39 PM, Leonardo Bras wrote:
> > So far it's assumed possible to map the guest RAM 1:1 to the bus,
> > which
> > works with a small number of devices. SRIOV changes it as the user
> > can
> > configure hundreds VFs and since phyp preallocates TCEs and does not
> > allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
> > per a PE to limit waste of physical pages.
> >
> > As of today, if the assumed direct mapping is not possible, DDW
> > creation
> > is skipped and the default DMA window "ibm,dma-window" is used
> > instead.
> >
> > Using the DDW instead of the default DMA window may allow to expand
> > the
> > amount of memory that can be DMA-mapped, given the number of pages
> > (TCEs)
> > may stay the same (or increase) and the default DMA window offers
> > only
> > 4k-pages while DDW may offer larger pages (4k, 64k, 16M ...).
>
> So if I'm reading this correctly, VFIO applications requiring hugepage
> DMA mappings (e.g. 16M or 2GB) can be supported on an LPAR or DLPAR
> after this change, is that correct?  

Different DDW IOMMU page sizes were already supported in Linux (4k,
64k, 16M) for a while now, and the remaining page sizes in LoPAR were
enabled in the following patch:
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/[email protected]/
(commit 472724111f0f72042deb6a9dcee9578e5398a1a1)

The thing is there are two ways of using DMA:
- Direct DMA, mapping the whole memory space of the host, which
requires a lot of DMA space if the guest memory is huge. This already
supports DDW and allows using the bigger pagesizes.
This happens on device/bus probe.

- Indirect DMA with IOMMU, mapping memory regions on demand, and un-
mapping after use. This requires much less DMA space, but causes an
overhead because an hcall is necessary for mapping and un-mapping.
Before this series, Indirect DMA was only possible with the 'default
DMA window' which allows using only 4k pages.

This series allow Indirect DMA using DDW when available, which usually
means bigger pagesizes and more TCEs, and so more DMA space.


tl;dr this patchset means you can have more DMA space in Indirect DMA,
because you are using DDW instead of the Default DMA window.

> Any limitations based on processor
> or pHyp revision levels?

The IOMMU page size will be limited by the sizes offered by processor
and hypervisor. They are announced at "IO Page Sizes" output of
ibm,query-pe-dma-window, but now the biggest pagesize automatically
selected with commit 472724111f0f72042deb6a9dcee9578e5398a1a1 above
mentioned.

Hope this helps, please let me know if there is any remaining question.

Best regards,
Leonardo