2015-04-25 12:16:09

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 00/32] powerpc/iommu/vfio: Enable Dynamic DMA windows


This enables sPAPR defined feature called Dynamic DMA windows (DDW).

Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
where devices are allowed to do DMA. These ranges are called DMA windows.
By default, there is a single DMA window, 1 or 2GB big, mapped at zero
on a PCI bus.

Hi-speed devices may suffer from the limited size of the window.
The recent host kernels use a TCE bypass window on POWER8 CPU which implements
direct PCI bus address range mapping (with offset of 1<<59) to the host memory.

For guests, PAPR defines a DDW RTAS API which allows pseries guests
querying the hypervisor about DDW support and capabilities (page size mask
for now). A pseries guest may request an additional (to the default)
DMA windows using this RTAS API.
The existing pseries Linux guests request an additional window as big as
the guest RAM and map the entire guest window which effectively creates
direct mapping of the guest memory to a PCI bus.

The multiple DMA windows feature is supported by POWER7/POWER8 CPUs; however
this patchset only adds support for POWER8 as TCE tables are implemented
in POWER7 in a quite different way ans POWER7 is not the highest priority.

This patchset reworks PPC64 IOMMU code and adds necessary structures
to support big windows.

Once a Linux guest discovers the presence of DDW, it does:
1. query hypervisor about number of available windows and page size masks;
2. create a window with the biggest possible page size (today 4K/64K/16M);
3. map the entire guest RAM via H_PUT_TCE* hypercalls;
4. switche dma_ops to direct_dma_ops on the selected PE.

Once this is done, H_PUT_TCE is not called anymore for 64bit devices and
the guest does not waste time on DMA map/unmap operations.

Note that 32bit devices won't use DDW and will keep using the default
DMA window so KVM optimizations will be required (to be posted later).

This is pushed to [email protected]:aik/linux.git
+ d9b711d...4d0247b 4d0247b -> vfio-for-github (forced update)

Changes:
v9:
* rebased on top of SRIOV (which is in upstream now)
* fixed multiple comments from David
* reworked ownership patches
* removed vfio: powerpc/spapr: Do cleanup when releasing the group (used to be #2)
as updated #1 should do this
* moved "powerpc/powernv: Implement accessor to TCE entry" to a separate patch
* added a patch which moves TCE Kill register address to PE from IOMMU table

v8:
* fixed a bug in error fallback in "powerpc/mmu: Add userspace-to-physical
addresses translation cache"
* fixed subject in "vfio: powerpc/spapr: Check that IOMMU page is fully
contained by system page"
* moved v2 documentation to the correct patch
* added checks for failed vzalloc() in "powerpc/iommu: Add userspace view
of TCE table"

v7:
* moved memory preregistration to the current process's MMU context
* added code preventing unregistration if some pages are still mapped;
for this, there is a userspace view of the table is stored in iommu_table
* added locked_vm counting for DDW tables (including userspace view of those)

v6:
* fixed a bunch of errors in "vfio: powerpc/spapr: Support Dynamic DMA windows"
* moved static IOMMU properties from iommu_table_group to iommu_table_group_ops

v5:
* added SPAPR_TCE_IOMMU_v2 to tell the userspace that there is a memory
pre-registration feature
* added backward compatibility
* renamed few things (mostly powerpc_iommu -> iommu_table_group)

v4:
* moved patches around to have VFIO and PPC patches separated as much as
possible
* now works with the existing upstream QEMU

v3:
* redesigned the whole thing
* multiple IOMMU groups per PHB -> one PHB is needed for VFIO in the guest ->
no problems with locked_vm counting; also we save memory on actual tables
* guest RAM preregistration is required for DDW
* PEs (IOMMU groups) are passed to VFIO with no DMA windows at all so
we do not bother with iommu_table::it_map anymore
* added multilevel TCE tables support to support really huge guests

v2:
* added missing __pa() in "powerpc/powernv: Release replaced TCE"
* reposted to make some noise




Alexey Kardashevskiy (32):
powerpc/iommu: Split iommu_free_table into 2 helpers
Revert "powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table
dynamically"
vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU
driver
vfio: powerpc/spapr: Check that IOMMU page is fully contained by
system page
vfio: powerpc/spapr: Use it_page_size
vfio: powerpc/spapr: Move locked_vm accounting to helpers
vfio: powerpc/spapr: Disable DMA mappings on disabled container
vfio: powerpc/spapr: Moving pinning/unpinning to helpers
vfio: powerpc/spapr: Rework groups attaching
powerpc/powernv: Do not set "read" flag if direction==DMA_NONE
powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group
vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership
control
powerpc/iommu: Fix IOMMU ownership control functions
powerpc/powernv/ioda/ioda2: Rework TCE invalidation in
tce_build()/tce_free()
powerpc/powernv/ioda: Move TCE kill register address to PE
powerpc/powernv: Implement accessor to TCE entry
powerpc/iommu/powernv: Release replaced TCE
powerpc/powernv/ioda2: Rework iommu_table creation
powerpc/powernv/ioda2: Introduce
pnv_pci_create_table/pnv_pci_free_table
powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window
powerpc/powernv: Implement multilevel TCE tables
powerpc/powernv/ioda: Define and implement DMA table/window management
callbacks
powerpc/powernv/ioda2: Use new helpers to do proper cleanup on PE
release
vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework ownership
powerpc/iommu: Add userspace view of TCE table
powerpc/iommu/ioda2: Add get_table_size() to calculate the size of
future table
powerpc/mmu: Add userspace-to-physical addresses translation cache
vfio: powerpc/spapr: Register memory and define IOMMU v2
vfio: powerpc/spapr: Use 32bit DMA window properties from table_group
vfio: powerpc/spapr: Support multiple groups in one container if
possible
vfio: powerpc/spapr: Support Dynamic DMA windows

Documentation/vfio.txt | 50 +-
arch/powerpc/include/asm/iommu.h | 111 ++-
arch/powerpc/include/asm/machdep.h | 25 -
arch/powerpc/include/asm/mmu-hash64.h | 3 +
arch/powerpc/include/asm/mmu_context.h | 17 +
arch/powerpc/include/asm/pci-bridge.h | 2 +-
arch/powerpc/kernel/eeh.c | 2 +-
arch/powerpc/kernel/iommu.c | 303 ++++----
arch/powerpc/kernel/vio.c | 5 +
arch/powerpc/mm/Makefile | 1 +
arch/powerpc/mm/mmu_context_hash64.c | 6 +
arch/powerpc/mm/mmu_context_hash64_iommu.c | 215 ++++++
arch/powerpc/platforms/cell/iommu.c | 8 +-
arch/powerpc/platforms/pasemi/iommu.c | 7 +-
arch/powerpc/platforms/powernv/pci-ioda.c | 520 ++++++++++----
arch/powerpc/platforms/powernv/pci-p5ioc2.c | 33 +-
arch/powerpc/platforms/powernv/pci.c | 275 +++++--
arch/powerpc/platforms/powernv/pci.h | 20 +-
arch/powerpc/platforms/pseries/iommu.c | 138 ++--
arch/powerpc/sysdev/dart_iommu.c | 12 +-
drivers/vfio/vfio_iommu_spapr_tce.c | 1034 ++++++++++++++++++++++++---
include/uapi/linux/vfio.h | 88 ++-
22 files changed, 2304 insertions(+), 571 deletions(-)
create mode 100644 arch/powerpc/mm/mmu_context_hash64_iommu.c

--
2.0.0


2015-04-25 12:16:06

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 01/32] powerpc/iommu: Split iommu_free_table into 2 helpers

The iommu_free_table helper release memory it is using (the TCE table and
@it_map) and release the iommu_table struct as well. We might not want
the very last step as we store iommu_table in parent structures.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 1 +
arch/powerpc/kernel/iommu.c | 58 +++++++++++++++++++++++-----------------
2 files changed, 35 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 1e27d63..e2cef38 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -105,6 +105,7 @@ static inline void *get_iommu_table_base(struct device *dev)
}

/* Frees table for an individual device node */
+extern void iommu_reset_table(struct iommu_table *tbl, const char *node_name);
extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);

/* Initializes an iommu_table based in values set in the passed-in
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b054f33..5c154e1 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -708,23 +708,44 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
return tbl;
}

+void iommu_reset_table(struct iommu_table *tbl, const char *node_name)
+{
+ if (!tbl)
+ return;
+
+ if (tbl->it_map) {
+ unsigned long bitmap_sz;
+ unsigned int order;
+
+ /*
+ * In case we have reserved the first bit, we should not emit
+ * the warning below.
+ */
+ if (tbl->it_offset == 0)
+ clear_bit(0, tbl->it_map);
+
+ /* verify that table contains no entries */
+ if (!bitmap_empty(tbl->it_map, tbl->it_size))
+ pr_warn("%s: Unexpected TCEs for %s\n", __func__,
+ node_name);
+
+ /* calculate bitmap size in bytes */
+ bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
+
+ /* free bitmap */
+ order = get_order(bitmap_sz);
+ free_pages((unsigned long) tbl->it_map, order);
+ }
+
+ memset(tbl, 0, sizeof(*tbl));
+}
+
void iommu_free_table(struct iommu_table *tbl, const char *node_name)
{
- unsigned long bitmap_sz;
- unsigned int order;
-
- if (!tbl || !tbl->it_map) {
- printk(KERN_ERR "%s: expected TCE map for %s\n", __func__,
- node_name);
+ if (!tbl)
return;
- }

- /*
- * In case we have reserved the first bit, we should not emit
- * the warning below.
- */
- if (tbl->it_offset == 0)
- clear_bit(0, tbl->it_map);
+ iommu_reset_table(tbl, node_name);

#ifdef CONFIG_IOMMU_API
if (tbl->it_group) {
@@ -733,17 +754,6 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
}
#endif

- /* verify that table contains no entries */
- if (!bitmap_empty(tbl->it_map, tbl->it_size))
- pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
-
- /* calculate bitmap size in bytes */
- bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
-
- /* free bitmap */
- order = get_order(bitmap_sz);
- free_pages((unsigned long) tbl->it_map, order);
-
/* free table */
kfree(tbl);
}
--
2.0.0

2015-04-25 12:23:45

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 02/32] Revert "powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically"

This reverts commit 9e8d4a19ab66ec9e132d405357b9108a4f26efd3 as
tce32_table has exactly the same life time as the whole PE.

This makes use of a new iommu_reset_table() helper instead.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 3 ---
arch/powerpc/platforms/powernv/pci-ioda.c | 35 +++++++++++++------------------
arch/powerpc/platforms/powernv/pci.h | 2 +-
3 files changed, 15 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index e2cef38..9d320e0 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -79,9 +79,6 @@ struct iommu_table {
struct iommu_group *it_group;
#endif
void (*set_bypass)(struct iommu_table *tbl, bool enable);
-#ifdef CONFIG_PPC_POWERNV
- void *data;
-#endif
};

/* Pure 2^n version of get_order */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 920c252..eff26ed 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1086,10 +1086,6 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
return;
}

- pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
- GFP_KERNEL, hose->node);
- pe->tce32_table->data = pe;
-
/* Associate it with all child devices */
pnv_ioda_setup_same_PE(bus, pe);

@@ -1295,7 +1291,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
bus = dev->bus;
hose = pci_bus_to_host(bus);
phb = hose->private_data;
- tbl = pe->tce32_table;
+ tbl = &pe->tce32_table;
addr = tbl->it_base;

opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
@@ -1310,9 +1306,8 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
if (rc)
pe_warn(pe, "OPAL error %ld release DMA window\n", rc);

- iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+ iommu_reset_table(tbl, of_node_full_name(dev->dev.of_node));
free_pages(addr, get_order(TCE32_TABLE_SIZE));
- pe->tce32_table = NULL;
}

static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
@@ -1460,10 +1455,6 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
continue;
}

- pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
- GFP_KERNEL, hose->node);
- pe->tce32_table->data = pe;
-
/* Put PE to the list */
mutex_lock(&phb->ioda.pe_list_mutex);
list_add_tail(&pe->list, &phb->ioda.pe_list);
@@ -1598,7 +1589,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev

pe = &phb->ioda.pe_array[pdn->pe_number];
WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
- set_iommu_table_base_and_group(&pdev->dev, pe->tce32_table);
+ set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
}

static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -1625,7 +1616,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
} else {
dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
set_dma_ops(&pdev->dev, &dma_iommu_ops);
- set_iommu_table_base(&pdev->dev, pe->tce32_table);
+ set_iommu_table_base(&pdev->dev, &pe->tce32_table);
}
*pdev->dev.dma_mask = dma_mask;
return 0;
@@ -1662,9 +1653,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
list_for_each_entry(dev, &bus->devices, bus_list) {
if (add_to_iommu_group)
set_iommu_table_base_and_group(&dev->dev,
- pe->tce32_table);
+ &pe->tce32_table);
else
- set_iommu_table_base(&dev->dev, pe->tce32_table);
+ set_iommu_table_base(&dev->dev, &pe->tce32_table);

if (dev->subordinate)
pnv_ioda_setup_bus_dma(pe, dev->subordinate,
@@ -1754,7 +1745,8 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
__be64 *startp, __be64 *endp, bool rm)
{
- struct pnv_ioda_pe *pe = tbl->data;
+ struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
+ tce32_table);
struct pnv_phb *phb = pe->phb;

if (phb->type == PNV_PHB_IODA1)
@@ -1817,7 +1809,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
}

/* Setup linux iommu table */
- tbl = pe->tce32_table;
+ tbl = &pe->tce32_table;
pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
base << 28, IOMMU_PAGE_SHIFT_4K);

@@ -1862,7 +1854,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,

static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
{
- struct pnv_ioda_pe *pe = tbl->data;
+ struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
+ tce32_table);
uint16_t window_id = (pe->pe_number << 1 ) + 1;
int64_t rc;

@@ -1907,10 +1900,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
pe->tce_bypass_base = 1ull << 59;

/* Install set_bypass callback for VFIO */
- pe->tce32_table->set_bypass = pnv_pci_ioda2_set_bypass;
+ pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;

/* Enable bypass by default */
- pnv_pci_ioda2_set_bypass(pe->tce32_table, true);
+ pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
}

static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1958,7 +1951,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
}

/* Setup linux iommu table */
- tbl = pe->tce32_table;
+ tbl = &pe->tce32_table;
pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
IOMMU_PAGE_SHIFT_4K);

diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 070ee88..c954c64 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -57,7 +57,7 @@ struct pnv_ioda_pe {
/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
int tce32_seg;
int tce32_segcount;
- struct iommu_table *tce32_table;
+ struct iommu_table tce32_table;
phys_addr_t tce_inval_reg_phys;

/* 64-bit TCE bypass region */
--
2.0.0

2015-04-25 12:24:26

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 03/32] vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver

This moves page pinning (get_user_pages_fast()/put_page()) code out of
the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs
to as the platform code does not deal with page pinning.

This makes iommu_take_ownership()/iommu_release_ownership() deal with
the IOMMU table bitmap only.

This removes page unpinning from iommu_take_ownership() as the actual
TCE table might contain garbage and doing put_page() on it is undefined
behaviour.

Besides the last part, the rest of the patch is mechanical.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <[email protected]>
Reviewed-by: David Gibson <[email protected]>
---
Changes:
v9:
* added missing tce_iommu_clear call after iommu_release_ownership()
* brought @offset (a local variable) back to make patch even more
mechanical

v4:
* s/iommu_tce_build(tbl, entry + 1/iommu_tce_build(tbl, entry + i/
---
arch/powerpc/include/asm/iommu.h | 4 --
arch/powerpc/kernel/iommu.c | 55 -------------------------
drivers/vfio/vfio_iommu_spapr_tce.c | 80 +++++++++++++++++++++++++++++++------
3 files changed, 67 insertions(+), 72 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9d320e0..4955233 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -199,10 +199,6 @@ extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
unsigned long hwaddr, enum dma_data_direction direction);
extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
unsigned long entry);
-extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
- unsigned long entry, unsigned long pages);
-extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
- unsigned long entry, unsigned long tce);

extern void iommu_flush_tce(struct iommu_table *tbl);
extern int iommu_take_ownership(struct iommu_table *tbl);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5c154e1..fc8b253 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1001,30 +1001,6 @@ unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
}
EXPORT_SYMBOL_GPL(iommu_clear_tce);

-int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
- unsigned long entry, unsigned long pages)
-{
- unsigned long oldtce;
- struct page *page;
-
- for ( ; pages; --pages, ++entry) {
- oldtce = iommu_clear_tce(tbl, entry);
- if (!oldtce)
- continue;
-
- page = pfn_to_page(oldtce >> PAGE_SHIFT);
- WARN_ON(!page);
- if (page) {
- if (oldtce & TCE_PCI_WRITE)
- SetPageDirty(page);
- put_page(page);
- }
- }
-
- return 0;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
-
/*
* hwaddr is a kernel virtual address here (0xc... bazillion),
* tce_build converts it to a physical address.
@@ -1054,35 +1030,6 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
}
EXPORT_SYMBOL_GPL(iommu_tce_build);

-int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
- unsigned long tce)
-{
- int ret;
- struct page *page = NULL;
- unsigned long hwaddr, offset = tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
- enum dma_data_direction direction = iommu_tce_direction(tce);
-
- ret = get_user_pages_fast(tce & PAGE_MASK, 1,
- direction != DMA_TO_DEVICE, &page);
- if (unlikely(ret != 1)) {
- /* pr_err("iommu_tce: get_user_pages_fast failed tce=%lx ioba=%lx ret=%d\n",
- tce, entry << tbl->it_page_shift, ret); */
- return -EFAULT;
- }
- hwaddr = (unsigned long) page_address(page) + offset;
-
- ret = iommu_tce_build(tbl, entry, hwaddr, direction);
- if (ret)
- put_page(page);
-
- if (ret < 0)
- pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
- __func__, entry << tbl->it_page_shift, tce, ret);
-
- return ret;
-}
-EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
-
int iommu_take_ownership(struct iommu_table *tbl)
{
unsigned long sz = (tbl->it_size + 7) >> 3;
@@ -1096,7 +1043,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
}

memset(tbl->it_map, 0xff, sz);
- iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);

/*
* Disable iommu bypass, otherwise the user can DMA to all of
@@ -1114,7 +1060,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
{
unsigned long sz = (tbl->it_size + 7) >> 3;

- iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
memset(tbl->it_map, 0, sz);

/* Restore bit#0 set by iommu_init_table() */
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 730b4ef..b95fa2b 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -147,6 +147,67 @@ static void tce_iommu_release(void *iommu_data)
kfree(container);
}

+static int tce_iommu_clear(struct tce_container *container,
+ struct iommu_table *tbl,
+ unsigned long entry, unsigned long pages)
+{
+ unsigned long oldtce;
+ struct page *page;
+
+ for ( ; pages; --pages, ++entry) {
+ oldtce = iommu_clear_tce(tbl, entry);
+ if (!oldtce)
+ continue;
+
+ page = pfn_to_page(oldtce >> PAGE_SHIFT);
+ WARN_ON(!page);
+ if (page) {
+ if (oldtce & TCE_PCI_WRITE)
+ SetPageDirty(page);
+ put_page(page);
+ }
+ }
+
+ return 0;
+}
+
+static long tce_iommu_build(struct tce_container *container,
+ struct iommu_table *tbl,
+ unsigned long entry, unsigned long tce, unsigned long pages)
+{
+ long i, ret = 0;
+ struct page *page = NULL;
+ unsigned long hva;
+ enum dma_data_direction direction = iommu_tce_direction(tce);
+
+ for (i = 0; i < pages; ++i) {
+ unsigned long offset = tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
+
+ ret = get_user_pages_fast(tce & PAGE_MASK, 1,
+ direction != DMA_TO_DEVICE, &page);
+ if (unlikely(ret != 1)) {
+ ret = -EFAULT;
+ break;
+ }
+ hva = (unsigned long) page_address(page) + offset;
+
+ ret = iommu_tce_build(tbl, entry + i, hva, direction);
+ if (ret) {
+ put_page(page);
+ pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
+ __func__, entry << tbl->it_page_shift,
+ tce, ret);
+ break;
+ }
+ tce += IOMMU_PAGE_SIZE_4K;
+ }
+
+ if (ret)
+ tce_iommu_clear(container, tbl, entry, i);
+
+ return ret;
+}
+
static long tce_iommu_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
{
@@ -195,7 +256,7 @@ static long tce_iommu_ioctl(void *iommu_data,
case VFIO_IOMMU_MAP_DMA: {
struct vfio_iommu_type1_dma_map param;
struct iommu_table *tbl = container->tbl;
- unsigned long tce, i;
+ unsigned long tce;

if (!tbl)
return -ENXIO;
@@ -229,17 +290,9 @@ static long tce_iommu_ioctl(void *iommu_data,
if (ret)
return ret;

- for (i = 0; i < (param.size >> IOMMU_PAGE_SHIFT_4K); ++i) {
- ret = iommu_put_tce_user_mode(tbl,
- (param.iova >> IOMMU_PAGE_SHIFT_4K) + i,
- tce);
- if (ret)
- break;
- tce += IOMMU_PAGE_SIZE_4K;
- }
- if (ret)
- iommu_clear_tces_and_put_pages(tbl,
- param.iova >> IOMMU_PAGE_SHIFT_4K, i);
+ ret = tce_iommu_build(container, tbl,
+ param.iova >> IOMMU_PAGE_SHIFT_4K,
+ tce, param.size >> IOMMU_PAGE_SHIFT_4K);

iommu_flush_tce(tbl);

@@ -273,7 +326,7 @@ static long tce_iommu_ioctl(void *iommu_data,
if (ret)
return ret;

- ret = iommu_clear_tces_and_put_pages(tbl,
+ ret = tce_iommu_clear(container, tbl,
param.iova >> IOMMU_PAGE_SHIFT_4K,
param.size >> IOMMU_PAGE_SHIFT_4K);
iommu_flush_tce(tbl);
@@ -357,6 +410,7 @@ static void tce_iommu_detach_group(void *iommu_data,
/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
iommu_group_id(iommu_group), iommu_group); */
container->tbl = NULL;
+ tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
iommu_release_ownership(tbl);
}
mutex_unlock(&container->lock);
--
2.0.0

2015-04-25 12:16:12

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 04/32] vfio: powerpc/spapr: Check that IOMMU page is fully contained by system page

This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.

Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.

Since compound_order() and compound_head() work correctly on non-huge
pages, there is no need for additional check whether the page is huge.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <[email protected]>
Reviewed-by: David Gibson <[email protected]>
---
Changes:
v8: changed subject

v6:
* the helper is simplified to one line

v4:
* s/tce_check_page_size/tce_page_is_contained/
---
drivers/vfio/vfio_iommu_spapr_tce.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index b95fa2b..735b308 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -47,6 +47,16 @@ struct tce_container {
bool enabled;
};

+static bool tce_page_is_contained(struct page *page, unsigned page_shift)
+{
+ /*
+ * Check that the TCE table granularity is not bigger than the size of
+ * a page we just found. Otherwise the hardware can get access to
+ * a bigger memory chunk that it should.
+ */
+ return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift;
+}
+
static int tce_iommu_enable(struct tce_container *container)
{
int ret = 0;
@@ -189,6 +199,12 @@ static long tce_iommu_build(struct tce_container *container,
ret = -EFAULT;
break;
}
+
+ if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+ ret = -EPERM;
+ break;
+ }
+
hva = (unsigned long) page_address(page) + offset;

ret = iommu_tce_build(tbl, entry + i, hva, direction);
--
2.0.0

2015-04-25 12:20:27

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 05/32] vfio: powerpc/spapr: Use it_page_size

This makes use of the it_page_size from the iommu_table struct
as page size can differ.

This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <[email protected]>
---
drivers/vfio/vfio_iommu_spapr_tce.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 735b308..64300cc 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -91,7 +91,7 @@ static int tce_iommu_enable(struct tce_container *container)
* enforcing the limit based on the max that the guest can map.
*/
down_write(&current->mm->mmap_sem);
- npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+ npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
locked = current->mm->locked_vm + npages;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
@@ -120,7 +120,7 @@ static void tce_iommu_disable(struct tce_container *container)

down_write(&current->mm->mmap_sem);
current->mm->locked_vm -= (container->tbl->it_size <<
- IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+ container->tbl->it_page_shift) >> PAGE_SHIFT;
up_write(&current->mm->mmap_sem);
}

@@ -215,7 +215,7 @@ static long tce_iommu_build(struct tce_container *container,
tce, ret);
break;
}
- tce += IOMMU_PAGE_SIZE_4K;
+ tce += IOMMU_PAGE_SIZE(tbl);
}

if (ret)
@@ -260,8 +260,8 @@ static long tce_iommu_ioctl(void *iommu_data,
if (info.argsz < minsz)
return -EINVAL;

- info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT_4K;
- info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT_4K;
+ info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
+ info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
info.flags = 0;

if (copy_to_user((void __user *)arg, &info, minsz))
@@ -291,8 +291,8 @@ static long tce_iommu_ioctl(void *iommu_data,
VFIO_DMA_MAP_FLAG_WRITE))
return -EINVAL;

- if ((param.size & ~IOMMU_PAGE_MASK_4K) ||
- (param.vaddr & ~IOMMU_PAGE_MASK_4K))
+ if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
+ (param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
return -EINVAL;

/* iova is checked by the IOMMU API */
@@ -307,8 +307,8 @@ static long tce_iommu_ioctl(void *iommu_data,
return ret;

ret = tce_iommu_build(container, tbl,
- param.iova >> IOMMU_PAGE_SHIFT_4K,
- tce, param.size >> IOMMU_PAGE_SHIFT_4K);
+ param.iova >> tbl->it_page_shift,
+ tce, param.size >> tbl->it_page_shift);

iommu_flush_tce(tbl);

@@ -334,17 +334,17 @@ static long tce_iommu_ioctl(void *iommu_data,
if (param.flags)
return -EINVAL;

- if (param.size & ~IOMMU_PAGE_MASK_4K)
+ if (param.size & ~IOMMU_PAGE_MASK(tbl))
return -EINVAL;

ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
- param.size >> IOMMU_PAGE_SHIFT_4K);
+ param.size >> tbl->it_page_shift);
if (ret)
return ret;

ret = tce_iommu_clear(container, tbl,
- param.iova >> IOMMU_PAGE_SHIFT_4K,
- param.size >> IOMMU_PAGE_SHIFT_4K);
+ param.iova >> tbl->it_page_shift,
+ param.size >> tbl->it_page_shift);
iommu_flush_tce(tbl);

return ret;
--
2.0.0

2015-04-25 12:20:16

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 06/32] vfio: powerpc/spapr: Move locked_vm accounting to helpers

There moves locked pages accounting to helpers.
Later they will be reused for Dynamic DMA windows (DDW).

This reworks debug messages to show the current value and the limit.

This stores the locked pages number in the container so when unlocking
the iommu table pointer won't be needed. This does not have an effect
now but it will with the multiple tables per container as then we will
allow attaching/detaching groups on fly and we may end up having
a container with no group attached but with the counter incremented.

While we are here, update the comment explaining why RLIMIT_MEMLOCK
might be required to be bigger than the guest RAM. This also prints
pid of the current process in pr_warn/pr_debug.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <[email protected]>
Reviewed-by: David Gibson <[email protected]>
---
Changes:
v4:
* new helpers do nothing if @npages == 0
* tce_iommu_disable() now can decrement the counter if the group was
detached (not possible now but will be in the future)
---
drivers/vfio/vfio_iommu_spapr_tce.c | 82 ++++++++++++++++++++++++++++---------
1 file changed, 63 insertions(+), 19 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 64300cc..40583f9 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -29,6 +29,51 @@
static void tce_iommu_detach_group(void *iommu_data,
struct iommu_group *iommu_group);

+static long try_increment_locked_vm(long npages)
+{
+ long ret = 0, locked, lock_limit;
+
+ if (!current || !current->mm)
+ return -ESRCH; /* process exited */
+
+ if (!npages)
+ return 0;
+
+ down_write(&current->mm->mmap_sem);
+ locked = current->mm->locked_vm + npages;
+ lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+ if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+ ret = -ENOMEM;
+ else
+ current->mm->locked_vm += npages;
+
+ pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
+ npages << PAGE_SHIFT,
+ current->mm->locked_vm << PAGE_SHIFT,
+ rlimit(RLIMIT_MEMLOCK),
+ ret ? " - exceeded" : "");
+
+ up_write(&current->mm->mmap_sem);
+
+ return ret;
+}
+
+static void decrement_locked_vm(long npages)
+{
+ if (!current || !current->mm || !npages)
+ return; /* process exited */
+
+ down_write(&current->mm->mmap_sem);
+ if (npages > current->mm->locked_vm)
+ npages = current->mm->locked_vm;
+ current->mm->locked_vm -= npages;
+ pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
+ npages << PAGE_SHIFT,
+ current->mm->locked_vm << PAGE_SHIFT,
+ rlimit(RLIMIT_MEMLOCK));
+ up_write(&current->mm->mmap_sem);
+}
+
/*
* VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
*
@@ -45,6 +90,7 @@ struct tce_container {
struct mutex lock;
struct iommu_table *tbl;
bool enabled;
+ unsigned long locked_pages;
};

static bool tce_page_is_contained(struct page *page, unsigned page_shift)
@@ -60,7 +106,7 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift)
static int tce_iommu_enable(struct tce_container *container)
{
int ret = 0;
- unsigned long locked, lock_limit, npages;
+ unsigned long locked;
struct iommu_table *tbl = container->tbl;

if (!container->tbl)
@@ -89,21 +135,22 @@ static int tce_iommu_enable(struct tce_container *container)
* Also we don't have a nice way to fail on H_PUT_TCE due to ulimits,
* that would effectively kill the guest at random points, much better
* enforcing the limit based on the max that the guest can map.
+ *
+ * Unfortunately at the moment it counts whole tables, no matter how
+ * much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups
+ * each with 2GB DMA window, 8GB will be counted here. The reason for
+ * this is that we cannot tell here the amount of RAM used by the guest
+ * as this information is only available from KVM and VFIO is
+ * KVM agnostic.
*/
- down_write(&current->mm->mmap_sem);
- npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
- locked = current->mm->locked_vm + npages;
- lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
- if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
- pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
- rlimit(RLIMIT_MEMLOCK));
- ret = -ENOMEM;
- } else {
+ locked = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
+ ret = try_increment_locked_vm(locked);
+ if (ret)
+ return ret;

- current->mm->locked_vm += npages;
- container->enabled = true;
- }
- up_write(&current->mm->mmap_sem);
+ container->locked_pages = locked;
+
+ container->enabled = true;

return ret;
}
@@ -115,13 +162,10 @@ static void tce_iommu_disable(struct tce_container *container)

container->enabled = false;

- if (!container->tbl || !current->mm)
+ if (!current->mm)
return;

- down_write(&current->mm->mmap_sem);
- current->mm->locked_vm -= (container->tbl->it_size <<
- container->tbl->it_page_shift) >> PAGE_SHIFT;
- up_write(&current->mm->mmap_sem);
+ decrement_locked_vm(container->locked_pages);
}

static void *tce_iommu_open(unsigned long arg)
--
2.0.0

2015-04-25 12:20:06

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 07/32] vfio: powerpc/spapr: Disable DMA mappings on disabled container

At the moment DMA map/unmap requests are handled irrespective to
the container's state. This allows the user space to pin memory which
it might not be allowed to pin.

This adds checks to MAP/UNMAP that the container is enabled, otherwise
-EPERM is returned.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <[email protected]>
Reviewed-by: David Gibson <[email protected]>
---
drivers/vfio/vfio_iommu_spapr_tce.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 40583f9..e21479c 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -318,6 +318,9 @@ static long tce_iommu_ioctl(void *iommu_data,
struct iommu_table *tbl = container->tbl;
unsigned long tce;

+ if (!container->enabled)
+ return -EPERM;
+
if (!tbl)
return -ENXIO;

@@ -362,6 +365,9 @@ static long tce_iommu_ioctl(void *iommu_data,
struct vfio_iommu_type1_dma_unmap param;
struct iommu_table *tbl = container->tbl;

+ if (!container->enabled)
+ return -EPERM;
+
if (WARN_ON(!tbl))
return -ENXIO;

--
2.0.0

2015-04-25 12:16:48

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 08/32] vfio: powerpc/spapr: Moving pinning/unpinning to helpers

This is a pretty mechanical patch to make next patches simpler.

New tce_iommu_unuse_page() helper does put_page() now but it might skip
that after the memory registering patch applied.

As we are here, this removes unnecessary checks for a value returned
by pfn_to_page() as it cannot possibly return NULL.

This moves tce_iommu_disable() later to let tce_iommu_clear() know if
the container has been enabled because if it has not been, then
put_page() must not be called on TCEs from the TCE table. This situation
is not yet possible but it will after KVM acceleration patchset is
applied.

This changes code to work with physical addresses rather than linear
mapping addresses for better code readability. Following patches will
add an xchg() callback for an IOMMU table which will accept/return
physical addresses (unlike current tce_build()) which will eliminate
redundant conversions.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <[email protected]>
---
Changes:
v9:
* changed helpers to work with physical addresses rather than linear
(for simplicity - later ::xchg() will receive physical and avoid
additional convertions)

v6:
* tce_get_hva() returns hva via a pointer
---
drivers/vfio/vfio_iommu_spapr_tce.c | 61 +++++++++++++++++++++++++------------
1 file changed, 41 insertions(+), 20 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index e21479c..115d5e6 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -191,69 +191,90 @@ static void tce_iommu_release(void *iommu_data)
struct tce_container *container = iommu_data;

WARN_ON(container->tbl && !container->tbl->it_group);
- tce_iommu_disable(container);

if (container->tbl && container->tbl->it_group)
tce_iommu_detach_group(iommu_data, container->tbl->it_group);

+ tce_iommu_disable(container);
mutex_destroy(&container->lock);

kfree(container);
}

+static void tce_iommu_unuse_page(struct tce_container *container,
+ unsigned long oldtce)
+{
+ struct page *page;
+
+ if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
+ return;
+
+ page = pfn_to_page(oldtce >> PAGE_SHIFT);
+
+ if (oldtce & TCE_PCI_WRITE)
+ SetPageDirty(page);
+
+ put_page(page);
+}
+
static int tce_iommu_clear(struct tce_container *container,
struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
{
unsigned long oldtce;
- struct page *page;

for ( ; pages; --pages, ++entry) {
oldtce = iommu_clear_tce(tbl, entry);
if (!oldtce)
continue;

- page = pfn_to_page(oldtce >> PAGE_SHIFT);
- WARN_ON(!page);
- if (page) {
- if (oldtce & TCE_PCI_WRITE)
- SetPageDirty(page);
- put_page(page);
- }
+ tce_iommu_unuse_page(container, oldtce);
}

return 0;
}

+static int tce_iommu_use_page(unsigned long tce, unsigned long *hpa)
+{
+ struct page *page = NULL;
+ enum dma_data_direction direction = iommu_tce_direction(tce);
+
+ if (get_user_pages_fast(tce & PAGE_MASK, 1,
+ direction != DMA_TO_DEVICE, &page) != 1)
+ return -EFAULT;
+
+ *hpa = __pa((unsigned long) page_address(page));
+
+ return 0;
+}
+
static long tce_iommu_build(struct tce_container *container,
struct iommu_table *tbl,
unsigned long entry, unsigned long tce, unsigned long pages)
{
long i, ret = 0;
- struct page *page = NULL;
- unsigned long hva;
+ struct page *page;
+ unsigned long hpa;
enum dma_data_direction direction = iommu_tce_direction(tce);

for (i = 0; i < pages; ++i) {
unsigned long offset = tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;

- ret = get_user_pages_fast(tce & PAGE_MASK, 1,
- direction != DMA_TO_DEVICE, &page);
- if (unlikely(ret != 1)) {
- ret = -EFAULT;
+ ret = tce_iommu_use_page(tce, &hpa);
+ if (ret)
break;
- }

+ page = pfn_to_page(hpa >> PAGE_SHIFT);
if (!tce_page_is_contained(page, tbl->it_page_shift)) {
ret = -EPERM;
break;
}

- hva = (unsigned long) page_address(page) + offset;
-
- ret = iommu_tce_build(tbl, entry + i, hva, direction);
+ hpa |= offset;
+ ret = iommu_tce_build(tbl, entry + i, (unsigned long) __va(hpa),
+ direction);
if (ret) {
- put_page(page);
+ tce_iommu_unuse_page(container, hpa);
pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
__func__, entry << tbl->it_page_shift,
tce, ret);
--
2.0.0

2015-04-25 12:19:52

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 09/32] vfio: powerpc/spapr: Rework groups attaching

This is to make extended ownership and multiple groups support patches
simpler for review.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <[email protected]>
Reviewed-by: David Gibson <[email protected]>
---
drivers/vfio/vfio_iommu_spapr_tce.c | 40 ++++++++++++++++++++++---------------
1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 115d5e6..0fbe03e 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -460,16 +460,21 @@ static int tce_iommu_attach_group(void *iommu_data,
iommu_group_id(container->tbl->it_group),
iommu_group_id(iommu_group));
ret = -EBUSY;
- } else if (container->enabled) {
+ goto unlock_exit;
+ }
+
+ if (container->enabled) {
pr_err("tce_vfio: attaching group #%u to enabled container\n",
iommu_group_id(iommu_group));
ret = -EBUSY;
- } else {
- ret = iommu_take_ownership(tbl);
- if (!ret)
- container->tbl = tbl;
+ goto unlock_exit;
}

+ ret = iommu_take_ownership(tbl);
+ if (!ret)
+ container->tbl = tbl;
+
+unlock_exit:
mutex_unlock(&container->lock);

return ret;
@@ -487,19 +492,22 @@ static void tce_iommu_detach_group(void *iommu_data,
pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
iommu_group_id(iommu_group),
iommu_group_id(tbl->it_group));
- } else {
- if (container->enabled) {
- pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
- iommu_group_id(tbl->it_group));
- tce_iommu_disable(container);
- }
+ goto unlock_exit;
+ }

- /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
- iommu_group_id(iommu_group), iommu_group); */
- container->tbl = NULL;
- tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
- iommu_release_ownership(tbl);
+ if (container->enabled) {
+ pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
+ iommu_group_id(tbl->it_group));
+ tce_iommu_disable(container);
}
+
+ /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
+ iommu_group_id(iommu_group), iommu_group); */
+ container->tbl = NULL;
+ tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
+ iommu_release_ownership(tbl);
+
+unlock_exit:
mutex_unlock(&container->lock);
}

--
2.0.0

2015-04-25 12:17:06

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 10/32] powerpc/powernv: Do not set "read" flag if direction==DMA_NONE

Normally a bitmap from the iommu_table is used to track what TCE entry
is in use. Since we are going to use iommu_table without its locks and
do xchg() instead, it becomes essential not to put bits which are not
implied in the direction flag as the old TCE value (more precisely -
the permission bits) will be used to decide whether to put the page or not.

This adds iommu_direction_to_tce_perm() (its counterpart is there already)
and uses it for powernv's pnv_tce_build().

Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
---
Changes:
v9:
* added comment why we must put only valid permission bits
---
arch/powerpc/include/asm/iommu.h | 1 +
arch/powerpc/kernel/iommu.c | 15 +++++++++++++++
arch/powerpc/platforms/powernv/pci.c | 7 +------
3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 4955233..5eb6e76 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -205,6 +205,7 @@ extern int iommu_take_ownership(struct iommu_table *tbl);
extern void iommu_release_ownership(struct iommu_table *tbl);

extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
+extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir);

#endif /* __KERNEL__ */
#endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index fc8b253..e0e94c7 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -881,6 +881,21 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
}
}

+unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir)
+{
+ switch (dir) {
+ case DMA_BIDIRECTIONAL:
+ return TCE_PCI_READ | TCE_PCI_WRITE;
+ case DMA_FROM_DEVICE:
+ return TCE_PCI_WRITE;
+ case DMA_TO_DEVICE:
+ return TCE_PCI_READ;
+ default:
+ return 0;
+ }
+}
+EXPORT_SYMBOL_GPL(iommu_direction_to_tce_perm);
+
#ifdef CONFIG_IOMMU_API
/*
* SPAPR TCE API
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index bca2aeb..b7ea245 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -576,15 +576,10 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
unsigned long uaddr, enum dma_data_direction direction,
struct dma_attrs *attrs, bool rm)
{
- u64 proto_tce;
+ u64 proto_tce = iommu_direction_to_tce_perm(direction);
__be64 *tcep, *tces;
u64 rpn;

- proto_tce = TCE_PCI_READ; // Read allowed
-
- if (direction != DMA_TO_DEVICE)
- proto_tce |= TCE_PCI_WRITE;
-
tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
rpn = __pa(uaddr) >> tbl->it_page_shift;

--
2.0.0

2015-04-25 12:20:10

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 11/32] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table

This adds a iommu_table_ops struct and puts pointer to it into
the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
callbacks from ppc_md to the new struct where they really belong to.

This adds the requirement for @it_ops to be initialized before calling
iommu_init_table() to make sure that we do not leave any IOMMU table
with iommu_table_ops uninitialized. This is not a parameter of
iommu_init_table() though as there will be cases when iommu_init_table()
will not be called on TCE tables, for example - VFIO.

This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_"
redundand prefixes.

This removes tce_xxx_rm handlers from ppc_md but does not add
them to iommu_table_ops as this will be done later if we decide to
support TCE hypercalls in real mode. This removes _vm callbacks as
only virtual mode is supported by now so this also removes @rm parameter.

For pSeries, this always uses tce_buildmulti_pSeriesLP/
tce_buildmulti_pSeriesLP. This changes multi callback to fall back to
tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
present. The reason for this is we still have to support "multitce=off"
boot parameter in disable_multitce() and we do not want to walk through
all IOMMU tables in the system and replace "multi" callbacks with single
ones.

For powernv, this defines _ops per PHB type which are P5IOC2/IODA1/IODA2.
This makes the callbacks for them public. Later patches will extend
callbacks for IODA1/2.

No change in behaviour is expected.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
---
Changes:
v9:
* pnv_tce_build/pnv_tce_free/pnv_tce_get have been made public and lost
"rm" parameters to make following patches simpler (realmode is not
supported here anyway)
* got rid of _vm versions of callbacks
---
arch/powerpc/include/asm/iommu.h | 17 +++++++++++
arch/powerpc/include/asm/machdep.h | 25 ---------------
arch/powerpc/kernel/iommu.c | 46 ++++++++++++++--------------
arch/powerpc/kernel/vio.c | 5 +++
arch/powerpc/platforms/cell/iommu.c | 8 +++--
arch/powerpc/platforms/pasemi/iommu.c | 7 +++--
arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++++++++
arch/powerpc/platforms/powernv/pci-p5ioc2.c | 7 +++++
arch/powerpc/platforms/powernv/pci.c | 47 +++++------------------------
arch/powerpc/platforms/powernv/pci.h | 5 +++
arch/powerpc/platforms/pseries/iommu.c | 34 ++++++++++++---------
arch/powerpc/sysdev/dart_iommu.c | 12 +++++---
12 files changed, 116 insertions(+), 111 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 5eb6e76..f0cab49 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -44,6 +44,22 @@
extern int iommu_is_off;
extern int iommu_force_on;

+struct iommu_table_ops {
+ int (*set)(struct iommu_table *tbl,
+ long index, long npages,
+ unsigned long uaddr,
+ enum dma_data_direction direction,
+ struct dma_attrs *attrs);
+ void (*clear)(struct iommu_table *tbl,
+ long index, long npages);
+ unsigned long (*get)(struct iommu_table *tbl, long index);
+ void (*flush)(struct iommu_table *tbl);
+};
+
+/* These are used by VIO */
+extern struct iommu_table_ops iommu_table_lpar_multi_ops;
+extern struct iommu_table_ops iommu_table_pseries_ops;
+
/*
* IOMAP_MAX_ORDER defines the largest contiguous block
* of dma space we can get. IOMAP_MAX_ORDER = 13
@@ -78,6 +94,7 @@ struct iommu_table {
#ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
#endif
+ struct iommu_table_ops *it_ops;
void (*set_bypass)(struct iommu_table *tbl, bool enable);
};

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index ef889943..ab721b4 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -65,31 +65,6 @@ struct machdep_calls {
* destroyed as well */
void (*hpte_clear_all)(void);

- int (*tce_build)(struct iommu_table *tbl,
- long index,
- long npages,
- unsigned long uaddr,
- enum dma_data_direction direction,
- struct dma_attrs *attrs);
- void (*tce_free)(struct iommu_table *tbl,
- long index,
- long npages);
- unsigned long (*tce_get)(struct iommu_table *tbl,
- long index);
- void (*tce_flush)(struct iommu_table *tbl);
-
- /* _rm versions are for real mode use only */
- int (*tce_build_rm)(struct iommu_table *tbl,
- long index,
- long npages,
- unsigned long uaddr,
- enum dma_data_direction direction,
- struct dma_attrs *attrs);
- void (*tce_free_rm)(struct iommu_table *tbl,
- long index,
- long npages);
- void (*tce_flush_rm)(struct iommu_table *tbl);
-
void __iomem * (*ioremap)(phys_addr_t addr, unsigned long size,
unsigned long flags, void *caller);
void (*iounmap)(volatile void __iomem *token);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index e0e94c7..e289f91 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -322,11 +322,11 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
ret = entry << tbl->it_page_shift; /* Set the return dma address */

/* Put the TCEs in the HW table */
- build_fail = ppc_md.tce_build(tbl, entry, npages,
+ build_fail = tbl->it_ops->set(tbl, entry, npages,
(unsigned long)page &
IOMMU_PAGE_MASK(tbl), direction, attrs);

- /* ppc_md.tce_build() only returns non-zero for transient errors.
+ /* tbl->it_ops->set() only returns non-zero for transient errors.
* Clean up the table bitmap in this case and return
* DMA_ERROR_CODE. For all other errors the functionality is
* not altered.
@@ -337,8 +337,8 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
}

/* Flush/invalidate TLB caches if necessary */
- if (ppc_md.tce_flush)
- ppc_md.tce_flush(tbl);
+ if (tbl->it_ops->flush)
+ tbl->it_ops->flush(tbl);

/* Make sure updates are seen by hardware */
mb();
@@ -408,7 +408,7 @@ static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
if (!iommu_free_check(tbl, dma_addr, npages))
return;

- ppc_md.tce_free(tbl, entry, npages);
+ tbl->it_ops->clear(tbl, entry, npages);

spin_lock_irqsave(&(pool->lock), flags);
bitmap_clear(tbl->it_map, free_entry, npages);
@@ -424,8 +424,8 @@ static void iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
* not do an mb() here on purpose, it is not needed on any of
* the current platforms.
*/
- if (ppc_md.tce_flush)
- ppc_md.tce_flush(tbl);
+ if (tbl->it_ops->flush)
+ tbl->it_ops->flush(tbl);
}

int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
@@ -495,7 +495,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
npages, entry, dma_addr);

/* Insert into HW table */
- build_fail = ppc_md.tce_build(tbl, entry, npages,
+ build_fail = tbl->it_ops->set(tbl, entry, npages,
vaddr & IOMMU_PAGE_MASK(tbl),
direction, attrs);
if(unlikely(build_fail))
@@ -534,8 +534,8 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
}

/* Flush/invalidate TLB caches if necessary */
- if (ppc_md.tce_flush)
- ppc_md.tce_flush(tbl);
+ if (tbl->it_ops->flush)
+ tbl->it_ops->flush(tbl);

DBG("mapped %d elements:\n", outcount);

@@ -600,8 +600,8 @@ void ppc_iommu_unmap_sg(struct iommu_table *tbl, struct scatterlist *sglist,
* do not do an mb() here, the affected platforms do not need it
* when freeing.
*/
- if (ppc_md.tce_flush)
- ppc_md.tce_flush(tbl);
+ if (tbl->it_ops->flush)
+ tbl->it_ops->flush(tbl);
}

static void iommu_table_clear(struct iommu_table *tbl)
@@ -613,17 +613,17 @@ static void iommu_table_clear(struct iommu_table *tbl)
*/
if (!is_kdump_kernel() || is_fadump_active()) {
/* Clear the table in case firmware left allocations in it */
- ppc_md.tce_free(tbl, tbl->it_offset, tbl->it_size);
+ tbl->it_ops->clear(tbl, tbl->it_offset, tbl->it_size);
return;
}

#ifdef CONFIG_CRASH_DUMP
- if (ppc_md.tce_get) {
+ if (tbl->it_ops->get) {
unsigned long index, tceval, tcecount = 0;

/* Reserve the existing mappings left by the first kernel. */
for (index = 0; index < tbl->it_size; index++) {
- tceval = ppc_md.tce_get(tbl, index + tbl->it_offset);
+ tceval = tbl->it_ops->get(tbl, index + tbl->it_offset);
/*
* Freed TCE entry contains 0x7fffffffffffffff on JS20
*/
@@ -657,6 +657,8 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
unsigned int i;
struct iommu_pool *p;

+ BUG_ON(!tbl->it_ops);
+
/* number of bytes needed for the bitmap */
sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);

@@ -944,8 +946,8 @@ EXPORT_SYMBOL_GPL(iommu_tce_direction);
void iommu_flush_tce(struct iommu_table *tbl)
{
/* Flush/invalidate TLB caches if necessary */
- if (ppc_md.tce_flush)
- ppc_md.tce_flush(tbl);
+ if (tbl->it_ops->flush)
+ tbl->it_ops->flush(tbl);

/* Make sure updates are seen by hardware */
mb();
@@ -956,7 +958,7 @@ int iommu_tce_clear_param_check(struct iommu_table *tbl,
unsigned long ioba, unsigned long tce_value,
unsigned long npages)
{
- /* ppc_md.tce_free() does not support any value but 0 */
+ /* tbl->it_ops->clear() does not support any value but 0 */
if (tce_value)
return -EINVAL;

@@ -1004,9 +1006,9 @@ unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)

spin_lock(&(pool->lock));

- oldtce = ppc_md.tce_get(tbl, entry);
+ oldtce = tbl->it_ops->get(tbl, entry);
if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
- ppc_md.tce_free(tbl, entry, 1);
+ tbl->it_ops->clear(tbl, entry, 1);
else
oldtce = 0;

@@ -1029,10 +1031,10 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,

spin_lock(&(pool->lock));

- oldtce = ppc_md.tce_get(tbl, entry);
+ oldtce = tbl->it_ops->get(tbl, entry);
/* Add new entry if it is not busy */
if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
- ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
+ ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);

spin_unlock(&(pool->lock));

diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 5bfdab9..b41426c 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1196,6 +1196,11 @@ static struct iommu_table *vio_build_iommu_table(struct vio_dev *dev)
tbl->it_type = TCE_VB;
tbl->it_blocksize = 16;

+ if (firmware_has_feature(FW_FEATURE_LPAR))
+ tbl->it_ops = &iommu_table_lpar_multi_ops;
+ else
+ tbl->it_ops = &iommu_table_pseries_ops;
+
return iommu_init_table(tbl, -1);
}

diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
index 21b5023..14a582b 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -466,6 +466,11 @@ static inline u32 cell_iommu_get_ioid(struct device_node *np)
return *ioid;
}

+static struct iommu_table_ops cell_iommu_ops = {
+ .set = tce_build_cell,
+ .clear = tce_free_cell
+};
+
static struct iommu_window * __init
cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
unsigned long offset, unsigned long size,
@@ -492,6 +497,7 @@ cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
window->table.it_offset =
(offset >> window->table.it_page_shift) + pte_offset;
window->table.it_size = size >> window->table.it_page_shift;
+ window->table.it_ops = &cell_iommu_ops;

iommu_init_table(&window->table, iommu->nid);

@@ -1201,8 +1207,6 @@ static int __init cell_iommu_init(void)
/* Setup various callbacks */
cell_pci_controller_ops.dma_dev_setup = cell_pci_dma_dev_setup;
ppc_md.dma_get_required_mask = cell_dma_get_required_mask;
- ppc_md.tce_build = tce_build_cell;
- ppc_md.tce_free = tce_free_cell;

if (!iommu_fixed_disabled && cell_iommu_fixed_mapping_init() == 0)
goto bail;
diff --git a/arch/powerpc/platforms/pasemi/iommu.c b/arch/powerpc/platforms/pasemi/iommu.c
index b8f567b..c929644 100644
--- a/arch/powerpc/platforms/pasemi/iommu.c
+++ b/arch/powerpc/platforms/pasemi/iommu.c
@@ -134,6 +134,10 @@ static void iobmap_free(struct iommu_table *tbl, long index,
}
}

+static struct iommu_table_ops iommu_table_iobmap_ops = {
+ .set = iobmap_build,
+ .clear = iobmap_free
+};

static void iommu_table_iobmap_setup(void)
{
@@ -153,6 +157,7 @@ static void iommu_table_iobmap_setup(void)
* Should probably be 8 (64 bytes)
*/
iommu_table_iobmap.it_blocksize = 4;
+ iommu_table_iobmap.it_ops = &iommu_table_iobmap_ops;
iommu_init_table(&iommu_table_iobmap, 0);
pr_debug(" <- %s\n", __func__);
}
@@ -252,8 +257,6 @@ void __init iommu_init_early_pasemi(void)

pasemi_pci_controller_ops.dma_dev_setup = pci_dma_dev_setup_pasemi;
pasemi_pci_controller_ops.dma_bus_setup = pci_dma_bus_setup_pasemi;
- ppc_md.tce_build = iobmap_build;
- ppc_md.tce_free = iobmap_free;
set_pci_dma_ops(&dma_iommu_ops);
}

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index eff26ed..7a9137a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1710,6 +1710,12 @@ static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
*/
}

+static struct iommu_table_ops pnv_ioda1_iommu_ops = {
+ .set = pnv_tce_build,
+ .clear = pnv_tce_free,
+ .get = pnv_tce_get,
+};
+
static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
struct iommu_table *tbl,
__be64 *startp, __be64 *endp, bool rm)
@@ -1755,6 +1761,12 @@ void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
}

+static struct iommu_table_ops pnv_ioda2_iommu_ops = {
+ .set = pnv_tce_build,
+ .clear = pnv_tce_free,
+ .get = pnv_tce_get,
+};
+
static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
struct pnv_ioda_pe *pe, unsigned int base,
unsigned int segs)
@@ -1828,6 +1840,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
TCE_PCI_SWINV_FREE |
TCE_PCI_SWINV_PAIR);
}
+ tbl->it_ops = &pnv_ioda1_iommu_ops;
iommu_init_table(tbl, phb->hose->node);

if (pe->flags & PNV_IODA_PE_DEV) {
@@ -1968,6 +1981,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
8);
tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
}
+ tbl->it_ops = &pnv_ioda2_iommu_ops;
iommu_init_table(tbl, phb->hose->node);

if (pe->flags & PNV_IODA_PE_DEV) {
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index 4729ca7..f05057e 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -83,10 +83,17 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb)
static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
#endif /* CONFIG_PCI_MSI */

+static struct iommu_table_ops pnv_p5ioc2_iommu_ops = {
+ .set = pnv_tce_build,
+ .clear = pnv_tce_free,
+ .get = pnv_tce_get,
+};
+
static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
struct pci_dev *pdev)
{
if (phb->p5ioc2.iommu_table.it_map == NULL) {
+ phb->p5ioc2.iommu_table.it_ops = &pnv_p5ioc2_iommu_ops;
iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
iommu_register_group(&phb->p5ioc2.iommu_table,
pci_domain_nr(phb->hose->bus), phb->opal_id);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index b7ea245..4c3bbb1 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -572,9 +572,9 @@ struct pci_ops pnv_pci_ops = {
.write = pnv_pci_write_config,
};

-static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
- unsigned long uaddr, enum dma_data_direction direction,
- struct dma_attrs *attrs, bool rm)
+int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr, enum dma_data_direction direction,
+ struct dma_attrs *attrs)
{
u64 proto_tce = iommu_direction_to_tce_perm(direction);
__be64 *tcep, *tces;
@@ -592,22 +592,12 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
* of flags if that becomes the case
*/
if (tbl->it_type & TCE_PCI_SWINV_CREATE)
- pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+ pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, false);

return 0;
}

-static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
- unsigned long uaddr,
- enum dma_data_direction direction,
- struct dma_attrs *attrs)
-{
- return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs,
- false);
-}
-
-static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
- bool rm)
+void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
{
__be64 *tcep, *tces;

@@ -617,32 +607,14 @@ static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
*(tcep++) = cpu_to_be64(0);

if (tbl->it_type & TCE_PCI_SWINV_FREE)
- pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+ pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, false);
}

-static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long npages)
-{
- pnv_tce_free(tbl, index, npages, false);
-}
-
-static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
+unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
{
return ((u64 *)tbl->it_base)[index - tbl->it_offset];
}

-static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
- unsigned long uaddr,
- enum dma_data_direction direction,
- struct dma_attrs *attrs)
-{
- return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
-}
-
-static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
-{
- pnv_tce_free(tbl, index, npages, true);
-}
-
void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
void *tce_mem, u64 tce_size,
u64 dma_offset, unsigned page_shift)
@@ -757,11 +729,6 @@ void __init pnv_pci_init(void)
pci_devs_phb_init();

/* Configure IOMMU DMA hooks */
- ppc_md.tce_build = pnv_tce_build_vm;
- ppc_md.tce_free = pnv_tce_free_vm;
- ppc_md.tce_build_rm = pnv_tce_build_rm;
- ppc_md.tce_free_rm = pnv_tce_free_rm;
- ppc_md.tce_get = pnv_tce_get;
set_pci_dma_ops(&dma_iommu_ops);

/* Configure MSIs */
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index c954c64..7eb6076 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -200,6 +200,11 @@ struct pnv_phb {
};

extern struct pci_ops pnv_pci_ops;
+extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr, enum dma_data_direction direction,
+ struct dma_attrs *attrs);
+extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
+extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);

void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
unsigned char *log_buff);
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 61d5a17..e379acf 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -193,7 +193,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
int ret = 0;
unsigned long flags;

- if (npages == 1) {
+ if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
direction, attrs);
}
@@ -285,6 +285,9 @@ static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long n
{
u64 rc;

+ if (!firmware_has_feature(FW_FEATURE_MULTITCE))
+ return tce_free_pSeriesLP(tbl, tcenum, npages);
+
rc = plpar_tce_stuff((u64)tbl->it_index, (u64)tcenum << 12, 0, npages);

if (rc && printk_ratelimit()) {
@@ -460,7 +463,6 @@ static int tce_setrange_multi_pSeriesLP_walk(unsigned long start_pfn,
return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
}

-
#ifdef CONFIG_PCI
static void iommu_table_setparms(struct pci_controller *phb,
struct device_node *dn,
@@ -546,6 +548,12 @@ static void iommu_table_setparms_lpar(struct pci_controller *phb,
tbl->it_size = size >> tbl->it_page_shift;
}

+struct iommu_table_ops iommu_table_pseries_ops = {
+ .set = tce_build_pSeries,
+ .clear = tce_free_pSeries,
+ .get = tce_get_pseries
+};
+
static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
{
struct device_node *dn;
@@ -614,6 +622,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
pci->phb->node);

iommu_table_setparms(pci->phb, dn, tbl);
+ tbl->it_ops = &iommu_table_pseries_ops;
pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
iommu_register_group(tbl, pci_domain_nr(bus), 0);

@@ -625,6 +634,11 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
pr_debug("ISA/IDE, window size is 0x%llx\n", pci->phb->dma_window_size);
}

+struct iommu_table_ops iommu_table_lpar_multi_ops = {
+ .set = tce_buildmulti_pSeriesLP,
+ .clear = tce_freemulti_pSeriesLP,
+ .get = tce_get_pSeriesLP
+};

static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
{
@@ -659,6 +673,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
ppci->phb->node);
iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
+ tbl->it_ops = &iommu_table_lpar_multi_ops;
ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
iommu_register_group(tbl, pci_domain_nr(bus), 0);
pr_debug(" created table: %p\n", ppci->iommu_table);
@@ -686,6 +701,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
phb->node);
iommu_table_setparms(phb, dn, tbl);
+ tbl->it_ops = &iommu_table_pseries_ops;
PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
set_iommu_table_base_and_group(&dev->dev,
@@ -1108,6 +1124,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
pci->phb->node);
iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
+ tbl->it_ops = &iommu_table_lpar_multi_ops;
pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
pr_debug(" created table: %p\n", pci->iommu_table);
@@ -1300,22 +1317,11 @@ void iommu_init_early_pSeries(void)
return;

if (firmware_has_feature(FW_FEATURE_LPAR)) {
- if (firmware_has_feature(FW_FEATURE_MULTITCE)) {
- ppc_md.tce_build = tce_buildmulti_pSeriesLP;
- ppc_md.tce_free = tce_freemulti_pSeriesLP;
- } else {
- ppc_md.tce_build = tce_build_pSeriesLP;
- ppc_md.tce_free = tce_free_pSeriesLP;
- }
- ppc_md.tce_get = tce_get_pSeriesLP;
pseries_pci_controller_ops.dma_bus_setup = pci_dma_bus_setup_pSeriesLP;
pseries_pci_controller_ops.dma_dev_setup = pci_dma_dev_setup_pSeriesLP;
ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
} else {
- ppc_md.tce_build = tce_build_pSeries;
- ppc_md.tce_free = tce_free_pSeries;
- ppc_md.tce_get = tce_get_pseries;
pseries_pci_controller_ops.dma_bus_setup = pci_dma_bus_setup_pSeries;
pseries_pci_controller_ops.dma_dev_setup = pci_dma_dev_setup_pSeries;
}
@@ -1333,8 +1339,6 @@ static int __init disable_multitce(char *str)
firmware_has_feature(FW_FEATURE_LPAR) &&
firmware_has_feature(FW_FEATURE_MULTITCE)) {
printk(KERN_INFO "Disabling MULTITCE firmware feature\n");
- ppc_md.tce_build = tce_build_pSeriesLP;
- ppc_md.tce_free = tce_free_pSeriesLP;
powerpc_firmware_features &= ~FW_FEATURE_MULTITCE;
}
return 1;
diff --git a/arch/powerpc/sysdev/dart_iommu.c b/arch/powerpc/sysdev/dart_iommu.c
index d00a566..90bcdfe 100644
--- a/arch/powerpc/sysdev/dart_iommu.c
+++ b/arch/powerpc/sysdev/dart_iommu.c
@@ -286,6 +286,12 @@ static int __init dart_init(struct device_node *dart_node)
return 0;
}

+static struct iommu_table_ops iommu_dart_ops = {
+ .set = dart_build,
+ .clear = dart_free,
+ .flush = dart_flush,
+};
+
static void iommu_table_dart_setup(void)
{
iommu_table_dart.it_busno = 0;
@@ -298,6 +304,7 @@ static void iommu_table_dart_setup(void)
iommu_table_dart.it_base = (unsigned long)dart_vbase;
iommu_table_dart.it_index = 0;
iommu_table_dart.it_blocksize = 1;
+ iommu_table_dart.it_ops = &iommu_dart_ops;
iommu_init_table(&iommu_table_dart, -1);

/* Reserve the last page of the DART to avoid possible prefetch
@@ -386,11 +393,6 @@ void __init iommu_init_early_dart(struct pci_controller_ops *controller_ops)
if (dart_init(dn) != 0)
goto bail;

- /* Setup low level TCE operations for the core IOMMU code */
- ppc_md.tce_build = dart_build;
- ppc_md.tce_free = dart_free;
- ppc_md.tce_flush = dart_flush;
-
/* Setup bypass if supported */
if (dart_is_u4)
ppc_md.dma_set_mask = dart_dma_set_mask;
--
2.0.0

2015-04-25 12:20:00

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 12/32] powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group

Modern IBM POWERPC systems support multiple (currently two) TCE tables
per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
for TCE tables. Right now just one table is supported.

For P5IOC2 and IODA, iommu_table_group is embedded into PE struct
(pnv_ioda_pe and pnv_phb) and does not require iommu_free_table(), only .
iommu_reset_table().

For pSeries, this replaces multiple calls of kzalloc_node() with a new
iommu_pseries_group_alloc() helper and stores the table group struct
pointer into the pci_dn struct. For release, a iommu_table_group_free()
helper is added.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <[email protected]>
---
Changes:
v9:
* s/it_group/it_table_group/
* added and used iommu_table_group_free(), from now iommu_free_table()
is only used for VIO
* added iommu_pseries_group_alloc()
* squashed "powerpc/iommu: Introduce iommu_table_alloc() helper" into this
---
arch/powerpc/include/asm/iommu.h | 18 +++--
arch/powerpc/include/asm/pci-bridge.h | 2 +-
arch/powerpc/kernel/eeh.c | 2 +-
arch/powerpc/kernel/iommu.c | 24 +++---
arch/powerpc/platforms/powernv/pci-ioda.c | 46 ++++++-----
arch/powerpc/platforms/powernv/pci-p5ioc2.c | 19 +++--
arch/powerpc/platforms/powernv/pci.h | 4 +-
arch/powerpc/platforms/pseries/iommu.c | 104 +++++++++++++++++--------
drivers/vfio/vfio_iommu_spapr_tce.c | 114 ++++++++++++++++++++--------
9 files changed, 222 insertions(+), 111 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index f0cab49..fa37519 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -91,9 +91,7 @@ struct iommu_table {
struct iommu_pool pools[IOMMU_NR_POOLS];
unsigned long *it_map; /* A simple allocation bitmap for now */
unsigned long it_page_shift;/* table iommu page size */
-#ifdef CONFIG_IOMMU_API
- struct iommu_group *it_group;
-#endif
+ struct iommu_table_group *it_table_group;
struct iommu_table_ops *it_ops;
void (*set_bypass)(struct iommu_table *tbl, bool enable);
};
@@ -127,14 +125,24 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
*/
extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
int nid);
+
+#define IOMMU_TABLE_GROUP_MAX_TABLES 1
+
+struct iommu_table_group {
#ifdef CONFIG_IOMMU_API
-extern void iommu_register_group(struct iommu_table *tbl,
+ struct iommu_group *group;
+#endif
+ struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
+};
+
+#ifdef CONFIG_IOMMU_API
+extern void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number, unsigned long pe_num);
extern int iommu_add_device(struct device *dev);
extern void iommu_del_device(struct device *dev);
extern int __init tce_iommu_bus_notifier_init(void);
#else
-static inline void iommu_register_group(struct iommu_table *tbl,
+static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
unsigned long pe_num)
{
diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 1811c44..e2d7479 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -185,7 +185,7 @@ struct pci_dn {

struct pci_dn *parent;
struct pci_controller *phb; /* for pci devices */
- struct iommu_table *iommu_table; /* for phb's or bridges */
+ struct iommu_table_group *table_group; /* for phb's or bridges */
struct device_node *node; /* back-pointer to the device_node */

int pci_ext_config_space; /* for pci devices */
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index a4c62eb..6bab695 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1407,7 +1407,7 @@ static int dev_has_iommu_table(struct device *dev, void *data)
return 0;

tbl = get_iommu_table_base(dev);
- if (tbl && tbl->it_group) {
+ if (tbl && tbl->it_table_group) {
*ppdev = pdev;
return 1;
}
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index e289f91..005146b 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -749,12 +749,8 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)

iommu_reset_table(tbl, node_name);

-#ifdef CONFIG_IOMMU_API
- if (tbl->it_group) {
- iommu_group_put(tbl->it_group);
- BUG_ON(tbl->it_group);
- }
-#endif
+ /* iommu_free_table() is only used by VIO so no groups expected here */
+ BUG_ON(tbl->it_table_group);

/* free table */
kfree(tbl);
@@ -904,11 +900,11 @@ EXPORT_SYMBOL_GPL(iommu_direction_to_tce_perm);
*/
static void group_release(void *iommu_data)
{
- struct iommu_table *tbl = iommu_data;
- tbl->it_group = NULL;
+ struct iommu_table_group *table_group = iommu_data;
+ table_group->group = NULL;
}

-void iommu_register_group(struct iommu_table *tbl,
+void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number, unsigned long pe_num)
{
struct iommu_group *grp;
@@ -920,8 +916,8 @@ void iommu_register_group(struct iommu_table *tbl,
PTR_ERR(grp));
return;
}
- tbl->it_group = grp;
- iommu_group_set_iommudata(grp, tbl, group_release);
+ table_group->group = grp;
+ iommu_group_set_iommudata(grp, table_group, group_release);
name = kasprintf(GFP_KERNEL, "domain%d-pe%lx",
pci_domain_number, pe_num);
if (!name)
@@ -1109,7 +1105,7 @@ int iommu_add_device(struct device *dev)
}

tbl = get_iommu_table_base(dev);
- if (!tbl || !tbl->it_group) {
+ if (!tbl || !tbl->it_table_group || !tbl->it_table_group->group) {
pr_debug("%s: Skipping device %s with no tbl\n",
__func__, dev_name(dev));
return 0;
@@ -1117,7 +1113,7 @@ int iommu_add_device(struct device *dev)

pr_debug("%s: Adding %s to iommu group %d\n",
__func__, dev_name(dev),
- iommu_group_id(tbl->it_group));
+ iommu_group_id(tbl->it_table_group->group));

if (PAGE_SIZE < IOMMU_PAGE_SIZE(tbl)) {
pr_err("%s: Invalid IOMMU page size %lx (%lx) on %s\n",
@@ -1126,7 +1122,7 @@ int iommu_add_device(struct device *dev)
return -EINVAL;
}

- return iommu_group_add_device(tbl->it_group, dev);
+ return iommu_group_add_device(tbl->it_table_group->group, dev);
}
EXPORT_SYMBOL_GPL(iommu_add_device);

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7a9137a..88472cb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -23,6 +23,7 @@
#include <linux/io.h>
#include <linux/msi.h>
#include <linux/memblock.h>
+#include <linux/iommu.h>

#include <asm/sections.h>
#include <asm/io.h>
@@ -1291,7 +1292,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
bus = dev->bus;
hose = pci_bus_to_host(bus);
phb = hose->private_data;
- tbl = &pe->tce32_table;
+ tbl = &pe->table_group.tables[0];
addr = tbl->it_base;

opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
@@ -1589,7 +1590,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev

pe = &phb->ioda.pe_array[pdn->pe_number];
WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
- set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
+ set_iommu_table_base_and_group(&pdev->dev, &pe->table_group.tables[0]);
}

static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -1616,7 +1617,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
} else {
dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
set_dma_ops(&pdev->dev, &dma_iommu_ops);
- set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+ set_iommu_table_base(&pdev->dev, &pe->table_group.tables[0]);
}
*pdev->dev.dma_mask = dma_mask;
return 0;
@@ -1653,9 +1654,10 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
list_for_each_entry(dev, &bus->devices, bus_list) {
if (add_to_iommu_group)
set_iommu_table_base_and_group(&dev->dev,
- &pe->tce32_table);
+ &pe->table_group.tables[0]);
else
- set_iommu_table_base(&dev->dev, &pe->tce32_table);
+ set_iommu_table_base(&dev->dev,
+ &pe->table_group.tables[0]);

if (dev->subordinate)
pnv_ioda_setup_bus_dma(pe, dev->subordinate,
@@ -1751,8 +1753,8 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
__be64 *startp, __be64 *endp, bool rm)
{
- struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
- tce32_table);
+ struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
+ struct pnv_ioda_pe, table_group);
struct pnv_phb *phb = pe->phb;

if (phb->type == PNV_PHB_IODA1)
@@ -1820,8 +1822,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
}
}

+ /* Setup iommu */
+ pe->table_group.tables[0].it_table_group = &pe->table_group;
+
/* Setup linux iommu table */
- tbl = &pe->tce32_table;
+ tbl = &pe->table_group.tables[0];
pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
base << 28, IOMMU_PAGE_SHIFT_4K);

@@ -1844,15 +1849,15 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
iommu_init_table(tbl, phb->hose->node);

if (pe->flags & PNV_IODA_PE_DEV) {
- iommu_register_group(tbl, phb->hose->global_number,
+ iommu_register_group(&pe->table_group, phb->hose->global_number,
pe->pe_number);
set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
- iommu_register_group(tbl, phb->hose->global_number,
+ iommu_register_group(&pe->table_group, phb->hose->global_number,
pe->pe_number);
pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
} else if (pe->flags & PNV_IODA_PE_VF) {
- iommu_register_group(tbl, phb->hose->global_number,
+ iommu_register_group(&pe->table_group, phb->hose->global_number,
pe->pe_number);
}

@@ -1867,8 +1872,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,

static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
{
- struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
- tce32_table);
+ struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
+ struct pnv_ioda_pe, table_group);
uint16_t window_id = (pe->pe_number << 1 ) + 1;
int64_t rc;

@@ -1913,10 +1918,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
pe->tce_bypass_base = 1ull << 59;

/* Install set_bypass callback for VFIO */
- pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+ pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;

/* Enable bypass by default */
- pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+ pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
}

static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1963,8 +1968,11 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
goto fail;
}

+ /* Setup iommu */
+ pe->table_group.tables[0].it_table_group = &pe->table_group;
+
/* Setup linux iommu table */
- tbl = &pe->tce32_table;
+ tbl = &pe->table_group.tables[0];
pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
IOMMU_PAGE_SHIFT_4K);

@@ -1985,15 +1993,15 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
iommu_init_table(tbl, phb->hose->node);

if (pe->flags & PNV_IODA_PE_DEV) {
- iommu_register_group(tbl, phb->hose->global_number,
+ iommu_register_group(&pe->table_group, phb->hose->global_number,
pe->pe_number);
set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
- iommu_register_group(tbl, phb->hose->global_number,
+ iommu_register_group(&pe->table_group, phb->hose->global_number,
pe->pe_number);
pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
} else if (pe->flags & PNV_IODA_PE_VF) {
- iommu_register_group(tbl, phb->hose->global_number,
+ iommu_register_group(&pe->table_group, phb->hose->global_number,
pe->pe_number);
}

diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index f05057e..a073af0 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -92,14 +92,17 @@ static struct iommu_table_ops pnv_p5ioc2_iommu_ops = {
static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
struct pci_dev *pdev)
{
- if (phb->p5ioc2.iommu_table.it_map == NULL) {
- phb->p5ioc2.iommu_table.it_ops = &pnv_p5ioc2_iommu_ops;
- iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
- iommu_register_group(&phb->p5ioc2.iommu_table,
+ if (phb->p5ioc2.table_group.tables[0].it_map == NULL) {
+ phb->p5ioc2.table_group.tables[0].it_ops =
+ &pnv_p5ioc2_iommu_ops;
+ iommu_init_table(&phb->p5ioc2.table_group.tables[0],
+ phb->hose->node);
+ iommu_register_group(&phb->p5ioc2.table_group,
pci_domain_nr(phb->hose->bus), phb->opal_id);
}

- set_iommu_table_base_and_group(&pdev->dev, &phb->p5ioc2.iommu_table);
+ set_iommu_table_base_and_group(&pdev->dev,
+ &phb->p5ioc2.table_group.tables[0]);
}

static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
@@ -174,9 +177,13 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
/* Setup MSI support */
pnv_pci_init_p5ioc2_msis(phb);

+ /* Setup iommu */
+ phb->p5ioc2.table_group.tables[0].it_table_group =
+ &phb->p5ioc2.table_group;
+
/* Setup TCEs */
phb->dma_dev_setup = pnv_pci_p5ioc2_dma_dev_setup;
- pnv_pci_setup_iommu_table(&phb->p5ioc2.iommu_table,
+ pnv_pci_setup_iommu_table(&phb->p5ioc2.table_group.tables[0],
tce_mem, tce_size, 0,
IOMMU_PAGE_SHIFT_4K);
}
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 7eb6076..368d4ed 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -57,7 +57,7 @@ struct pnv_ioda_pe {
/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
int tce32_seg;
int tce32_segcount;
- struct iommu_table tce32_table;
+ struct iommu_table_group table_group;
phys_addr_t tce_inval_reg_phys;

/* 64-bit TCE bypass region */
@@ -122,7 +122,7 @@ struct pnv_phb {

union {
struct {
- struct iommu_table iommu_table;
+ struct iommu_table_group table_group;
} p5ioc2;

struct {
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index e379acf..8c29919 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -36,6 +36,7 @@
#include <linux/crash_dump.h>
#include <linux/memory.h>
#include <linux/of.h>
+#include <linux/iommu.h>
#include <asm/io.h>
#include <asm/prom.h>
#include <asm/rtas.h>
@@ -51,6 +52,42 @@

#include "pseries.h"

+static struct iommu_table_group *iommu_pseries_group_alloc(int node)
+{
+ struct iommu_table_group *table_group;
+
+ table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
+ node);
+ if (!table_group) {
+ pr_debug("%s failed\n", __func__);
+ return NULL;
+ }
+ table_group->tables[0].it_table_group = table_group;
+
+ return table_group;
+}
+
+static void iommu_table_group_free(struct iommu_table_group *table_group)
+{
+ long i;
+
+ if (!table_group)
+ return;
+
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i)
+ iommu_reset_table(&table_group->tables[i], "noname");
+
+#ifdef CONFIG_IOMMU_API
+ if (table_group->group) {
+ iommu_group_put(table_group->group);
+ BUG_ON(table_group->group);
+ }
+#endif
+
+ /* free table group */
+ kfree(table_group);
+}
+
static void tce_invalidate_pSeries_sw(struct iommu_table *tbl,
__be64 *startp, __be64 *endp)
{
@@ -618,13 +655,13 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
pci->phb->dma_window_size = 0x8000000ul;
pci->phb->dma_window_base_cur = 0x8000000ul;

- tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
- pci->phb->node);
+ pci->table_group = iommu_pseries_group_alloc(pci->phb->node);
+ tbl = &pci->table_group->tables[0];

iommu_table_setparms(pci->phb, dn, tbl);
tbl->it_ops = &iommu_table_pseries_ops;
- pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
- iommu_register_group(tbl, pci_domain_nr(bus), 0);
+ iommu_init_table(tbl, pci->phb->node);
+ iommu_register_group(pci->table_group, pci_domain_nr(bus), 0);

/* Divide the rest (1.75GB) among the children */
pci->phb->dma_window_size = 0x80000000ul;
@@ -667,16 +704,17 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
ppci = PCI_DN(pdn);

pr_debug(" parent is %s, iommu_table: 0x%p\n",
- pdn->full_name, ppci->iommu_table);
+ pdn->full_name, ppci->table_group);

- if (!ppci->iommu_table) {
- tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
- ppci->phb->node);
+ if (!ppci->table_group) {
+ ppci->table_group = iommu_pseries_group_alloc(ppci->phb->node);
+ tbl = &ppci->table_group->tables[0];
iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
tbl->it_ops = &iommu_table_lpar_multi_ops;
- ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
- iommu_register_group(tbl, pci_domain_nr(bus), 0);
- pr_debug(" created table: %p\n", ppci->iommu_table);
+ iommu_init_table(tbl, ppci->phb->node);
+ iommu_register_group(ppci->table_group,
+ pci_domain_nr(bus), 0);
+ pr_debug(" created table: %p\n", ppci->table_group);
}
}

@@ -698,14 +736,16 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
struct pci_controller *phb = PCI_DN(dn)->phb;

pr_debug(" --> first child, no bridge. Allocating iommu table.\n");
- tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
- phb->node);
+ PCI_DN(dn)->table_group = iommu_pseries_group_alloc(phb->node);
+ tbl = &PCI_DN(dn)->table_group->tables[0];
iommu_table_setparms(phb, dn, tbl);
tbl->it_ops = &iommu_table_pseries_ops;
- PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
- iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
+ iommu_init_table(tbl, phb->node);
+ iommu_register_group(PCI_DN(dn)->table_group,
+ pci_domain_nr(phb->bus), 0);
set_iommu_table_base_and_group(&dev->dev,
- PCI_DN(dn)->iommu_table);
+ &PCI_DN(dn)->
+ table_group->tables[0]);
return;
}

@@ -713,12 +753,13 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
* an already allocated iommu table is found and use that.
*/

- while (dn && PCI_DN(dn) && PCI_DN(dn)->iommu_table == NULL)
+ while (dn && PCI_DN(dn) && PCI_DN(dn)->table_group == NULL)
dn = dn->parent;

if (dn && PCI_DN(dn))
set_iommu_table_base_and_group(&dev->dev,
- PCI_DN(dn)->iommu_table);
+ &PCI_DN(dn)->
+ table_group->tables[0]);
else
printk(KERN_WARNING "iommu: Device %s has no iommu table\n",
pci_name(dev));
@@ -1104,7 +1145,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
dn = pci_device_to_OF_node(dev);
pr_debug(" node is %s\n", dn->full_name);

- for (pdn = dn; pdn && PCI_DN(pdn) && !PCI_DN(pdn)->iommu_table;
+ for (pdn = dn; pdn && PCI_DN(pdn) && !PCI_DN(pdn)->table_group;
pdn = pdn->parent) {
dma_window = of_get_property(pdn, "ibm,dma-window", NULL);
if (dma_window)
@@ -1120,19 +1161,20 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
pr_debug(" parent is %s\n", pdn->full_name);

pci = PCI_DN(pdn);
- if (!pci->iommu_table) {
- tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
- pci->phb->node);
+ if (!pci->table_group) {
+ pci->table_group = iommu_pseries_group_alloc(pci->phb->node);
+ tbl = &pci->table_group->tables[0];
iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
tbl->it_ops = &iommu_table_lpar_multi_ops;
- pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
- iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
- pr_debug(" created table: %p\n", pci->iommu_table);
+ iommu_init_table(tbl, pci->phb->node);
+ iommu_register_group(pci->table_group,
+ pci_domain_nr(pci->phb->bus), 0);
+ pr_debug(" created table: %p\n", pci->table_group);
} else {
- pr_debug(" found DMA window, table: %p\n", pci->iommu_table);
+ pr_debug(" found DMA window, table: %p\n", pci->table_group);
}

- set_iommu_table_base_and_group(&dev->dev, pci->iommu_table);
+ set_iommu_table_base_and_group(&dev->dev, &pci->table_group->tables[0]);
}

static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
@@ -1162,7 +1204,7 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
* search upwards in the tree until we either hit a dma-window
* property, OR find a parent with a table already allocated.
*/
- for (pdn = dn; pdn && PCI_DN(pdn) && !PCI_DN(pdn)->iommu_table;
+ for (pdn = dn; pdn && PCI_DN(pdn) && !PCI_DN(pdn)->table_group;
pdn = pdn->parent) {
dma_window = of_get_property(pdn, "ibm,dma-window", NULL);
if (dma_window)
@@ -1206,7 +1248,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device *dev)
dn = pci_device_to_OF_node(pdev);

/* search upwards for ibm,dma-window */
- for (; dn && PCI_DN(dn) && !PCI_DN(dn)->iommu_table;
+ for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group;
dn = dn->parent)
if (of_get_property(dn, "ibm,dma-window", NULL))
break;
@@ -1286,8 +1328,8 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti
* the device node.
*/
remove_ddw(np, false);
- if (pci && pci->iommu_table)
- iommu_free_table(pci->iommu_table, np->full_name);
+ if (pci && pci->table_group)
+ iommu_table_group_free(pci->table_group);

spin_lock(&direct_window_list_lock);
list_for_each_entry(window, &direct_window_list, list) {
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 0fbe03e..17e884a 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -88,7 +88,7 @@ static void decrement_locked_vm(long npages)
*/
struct tce_container {
struct mutex lock;
- struct iommu_table *tbl;
+ struct iommu_group *grp;
bool enabled;
unsigned long locked_pages;
};
@@ -103,13 +103,41 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift)
return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift;
}

+static struct iommu_table *spapr_tce_find_table(
+ struct tce_container *container,
+ phys_addr_t ioba)
+{
+ long i;
+ struct iommu_table *ret = NULL;
+ struct iommu_table_group *table_group;
+
+ table_group = iommu_group_get_iommudata(container->grp);
+ if (!table_group)
+ return NULL;
+
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ struct iommu_table *tbl = &table_group->tables[i];
+ unsigned long entry = ioba >> tbl->it_page_shift;
+ unsigned long start = tbl->it_offset;
+ unsigned long end = start + tbl->it_size;
+
+ if ((start <= entry) && (entry < end)) {
+ ret = tbl;
+ break;
+ }
+ }
+
+ return ret;
+}
+
static int tce_iommu_enable(struct tce_container *container)
{
int ret = 0;
unsigned long locked;
- struct iommu_table *tbl = container->tbl;
+ struct iommu_table *tbl;
+ struct iommu_table_group *table_group;

- if (!container->tbl)
+ if (!container->grp)
return -ENXIO;

if (!current->mm)
@@ -143,6 +171,11 @@ static int tce_iommu_enable(struct tce_container *container)
* as this information is only available from KVM and VFIO is
* KVM agnostic.
*/
+ table_group = iommu_group_get_iommudata(container->grp);
+ if (!table_group)
+ return -ENODEV;
+
+ tbl = &table_group->tables[0];
locked = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
ret = try_increment_locked_vm(locked);
if (ret)
@@ -190,10 +223,10 @@ static void tce_iommu_release(void *iommu_data)
{
struct tce_container *container = iommu_data;

- WARN_ON(container->tbl && !container->tbl->it_group);
+ WARN_ON(container->grp);

- if (container->tbl && container->tbl->it_group)
- tce_iommu_detach_group(iommu_data, container->tbl->it_group);
+ if (container->grp)
+ tce_iommu_detach_group(iommu_data, container->grp);

tce_iommu_disable(container);
mutex_destroy(&container->lock);
@@ -311,9 +344,16 @@ static long tce_iommu_ioctl(void *iommu_data,

case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
struct vfio_iommu_spapr_tce_info info;
- struct iommu_table *tbl = container->tbl;
+ struct iommu_table *tbl;
+ struct iommu_table_group *table_group;

- if (WARN_ON(!tbl))
+ if (WARN_ON(!container->grp))
+ return -ENXIO;
+
+ table_group = iommu_group_get_iommudata(container->grp);
+
+ tbl = &table_group->tables[0];
+ if (WARN_ON_ONCE(!tbl))
return -ENXIO;

minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
@@ -336,17 +376,12 @@ static long tce_iommu_ioctl(void *iommu_data,
}
case VFIO_IOMMU_MAP_DMA: {
struct vfio_iommu_type1_dma_map param;
- struct iommu_table *tbl = container->tbl;
+ struct iommu_table *tbl;
unsigned long tce;

if (!container->enabled)
return -EPERM;

- if (!tbl)
- return -ENXIO;
-
- BUG_ON(!tbl->it_group);
-
minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);

if (copy_from_user(&param, (void __user *)arg, minsz))
@@ -359,6 +394,10 @@ static long tce_iommu_ioctl(void *iommu_data,
VFIO_DMA_MAP_FLAG_WRITE))
return -EINVAL;

+ tbl = spapr_tce_find_table(container, param.iova);
+ if (!tbl)
+ return -ENXIO;
+
if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
(param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
return -EINVAL;
@@ -384,14 +423,11 @@ static long tce_iommu_ioctl(void *iommu_data,
}
case VFIO_IOMMU_UNMAP_DMA: {
struct vfio_iommu_type1_dma_unmap param;
- struct iommu_table *tbl = container->tbl;
+ struct iommu_table *tbl;

if (!container->enabled)
return -EPERM;

- if (WARN_ON(!tbl))
- return -ENXIO;
-
minsz = offsetofend(struct vfio_iommu_type1_dma_unmap,
size);

@@ -405,6 +441,10 @@ static long tce_iommu_ioctl(void *iommu_data,
if (param.flags)
return -EINVAL;

+ tbl = spapr_tce_find_table(container, param.iova);
+ if (!tbl)
+ return -ENXIO;
+
if (param.size & ~IOMMU_PAGE_MASK(tbl))
return -EINVAL;

@@ -433,10 +473,10 @@ static long tce_iommu_ioctl(void *iommu_data,
mutex_unlock(&container->lock);
return 0;
case VFIO_EEH_PE_OP:
- if (!container->tbl || !container->tbl->it_group)
+ if (!container->grp)
return -ENODEV;

- return vfio_spapr_iommu_eeh_ioctl(container->tbl->it_group,
+ return vfio_spapr_iommu_eeh_ioctl(container->grp,
cmd, arg);
}

@@ -448,16 +488,15 @@ static int tce_iommu_attach_group(void *iommu_data,
{
int ret;
struct tce_container *container = iommu_data;
- struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+ struct iommu_table_group *table_group;

- BUG_ON(!tbl);
mutex_lock(&container->lock);

/* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
iommu_group_id(iommu_group), iommu_group); */
- if (container->tbl) {
+ if (container->grp) {
pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
- iommu_group_id(container->tbl->it_group),
+ iommu_group_id(container->grp),
iommu_group_id(iommu_group));
ret = -EBUSY;
goto unlock_exit;
@@ -470,9 +509,15 @@ static int tce_iommu_attach_group(void *iommu_data,
goto unlock_exit;
}

- ret = iommu_take_ownership(tbl);
+ table_group = iommu_group_get_iommudata(iommu_group);
+ if (!table_group) {
+ ret = -ENXIO;
+ goto unlock_exit;
+ }
+
+ ret = iommu_take_ownership(&table_group->tables[0]);
if (!ret)
- container->tbl = tbl;
+ container->grp = iommu_group;

unlock_exit:
mutex_unlock(&container->lock);
@@ -484,26 +529,31 @@ static void tce_iommu_detach_group(void *iommu_data,
struct iommu_group *iommu_group)
{
struct tce_container *container = iommu_data;
- struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+ struct iommu_table_group *table_group;
+ struct iommu_table *tbl;

- BUG_ON(!tbl);
mutex_lock(&container->lock);
- if (tbl != container->tbl) {
+ if (iommu_group != container->grp) {
pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
iommu_group_id(iommu_group),
- iommu_group_id(tbl->it_group));
+ iommu_group_id(container->grp));
goto unlock_exit;
}

if (container->enabled) {
pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
- iommu_group_id(tbl->it_group));
+ iommu_group_id(container->grp));
tce_iommu_disable(container);
}

/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
iommu_group_id(iommu_group), iommu_group); */
- container->tbl = NULL;
+ container->grp = NULL;
+
+ table_group = iommu_group_get_iommudata(iommu_group);
+ BUG_ON(!table_group);
+
+ tbl = &table_group->tables[0];
tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
iommu_release_ownership(tbl);

--
2.0.0

2015-04-25 12:19:55

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 13/32] vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control

This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
which call in a loop iommu_take_ownership()/iommu_release_ownership()
for every table on the group. As there is just one now, no change in
behaviour is expected.

At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.

The set_bypass() callback is not really an iommu_table function but
IOMMU/PE function. This introduces a iommu_table_group_ops struct and
adds take_ownership()/release_ownership() callbacks to it which are
called when an external user takes/releases control over the IOMMU.

This replaces set_bypass() with ownership callbacks as it is not
necessarily just bypass enabling, it can be something else/more
so let's give it more generic name.

The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
The following patches will replace iommu_take_ownership/
iommu_release_ownership calls in IODA2 with full IOMMU table release/
create.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <[email protected]>
---
Changes:
v9:
* squashed "vfio: powerpc/spapr: powerpc/iommu: Rework IOMMU ownership control"
and "vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework IOMMU ownership control"
into a single patch
* moved helpers with a loop through tables in a group
to vfio_iommu_spapr_tce.c to keep the platform code free of IOMMU table
groups as much as possible
* added missing tce_iommu_clear() to tce_iommu_release_ownership()
* replaced the set_ownership(enable) callback with take_ownership() and
release_ownership()
---
arch/powerpc/include/asm/iommu.h | 13 +++++-
arch/powerpc/kernel/iommu.c | 11 ------
arch/powerpc/platforms/powernv/pci-ioda.c | 40 +++++++++++++++----
drivers/vfio/vfio_iommu_spapr_tce.c | 66 +++++++++++++++++++++++++++----
4 files changed, 103 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index fa37519..e63419e 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -93,7 +93,6 @@ struct iommu_table {
unsigned long it_page_shift;/* table iommu page size */
struct iommu_table_group *it_table_group;
struct iommu_table_ops *it_ops;
- void (*set_bypass)(struct iommu_table *tbl, bool enable);
};

/* Pure 2^n version of get_order */
@@ -128,11 +127,23 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,

#define IOMMU_TABLE_GROUP_MAX_TABLES 1

+struct iommu_table_group;
+
+struct iommu_table_group_ops {
+ /*
+ * Switches ownership from the kernel itself to an external
+ * user. While onwership is taken, the kernel cannot use IOMMU itself.
+ */
+ void (*take_ownership)(struct iommu_table_group *table_group);
+ void (*release_ownership)(struct iommu_table_group *table_group);
+};
+
struct iommu_table_group {
#ifdef CONFIG_IOMMU_API
struct iommu_group *group;
#endif
struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
+ struct iommu_table_group_ops *ops;
};

#ifdef CONFIG_IOMMU_API
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 005146b..2856d27 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1057,13 +1057,6 @@ int iommu_take_ownership(struct iommu_table *tbl)

memset(tbl->it_map, 0xff, sz);

- /*
- * Disable iommu bypass, otherwise the user can DMA to all of
- * our physical memory via the bypass window instead of just
- * the pages that has been explicitly mapped into the iommu
- */
- if (tbl->set_bypass)
- tbl->set_bypass(tbl, false);

return 0;
}
@@ -1078,10 +1071,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
/* Restore bit#0 set by iommu_init_table() */
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
-
- /* The kernel owns the device now, we can restore the iommu bypass */
- if (tbl->set_bypass)
- tbl->set_bypass(tbl, true);
}
EXPORT_SYMBOL_GPL(iommu_release_ownership);

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 88472cb..718d5cc 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1870,10 +1870,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
}

-static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
{
- struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
- struct pnv_ioda_pe, table_group);
uint16_t window_id = (pe->pe_number << 1 ) + 1;
int64_t rc;

@@ -1901,7 +1899,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
* host side.
*/
if (pe->pdev)
- set_iommu_table_base(&pe->pdev->dev, tbl);
+ set_iommu_table_base(&pe->pdev->dev,
+ &pe->table_group.tables[0]);
else
pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
}
@@ -1917,13 +1916,35 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
/* TVE #1 is selected by PCI address bit 59 */
pe->tce_bypass_base = 1ull << 59;

- /* Install set_bypass callback for VFIO */
- pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
-
/* Enable bypass by default */
- pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
+ pnv_pci_ioda2_set_bypass(pe, true);
}

+#ifdef CONFIG_IOMMU_API
+static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
+{
+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
+ table_group);
+
+ iommu_take_ownership(&table_group->tables[0]);
+ pnv_pci_ioda2_set_bypass(pe, false);
+}
+
+static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
+{
+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
+ table_group);
+
+ iommu_release_ownership(&table_group->tables[0]);
+ pnv_pci_ioda2_set_bypass(pe, true);
+}
+
+static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
+ .take_ownership = pnv_ioda2_take_ownership,
+ .release_ownership = pnv_ioda2_release_ownership,
+};
+#endif
+
static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
struct pnv_ioda_pe *pe)
{
@@ -1991,6 +2012,9 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
}
tbl->it_ops = &pnv_ioda2_iommu_ops;
iommu_init_table(tbl, phb->hose->node);
+#ifdef CONFIG_IOMMU_API
+ pe->table_group.ops = &pnv_pci_ioda2_ops;
+#endif

if (pe->flags & PNV_IODA_PE_DEV) {
iommu_register_group(&pe->table_group, phb->hose->global_number,
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 17e884a..dacc738 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -483,6 +483,43 @@ static long tce_iommu_ioctl(void *iommu_data,
return -ENOTTY;
}

+static void tce_iommu_release_ownership(struct tce_container *container,
+ struct iommu_table_group *table_group)
+{
+ int i;
+
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ struct iommu_table *tbl = &table_group->tables[i];
+
+ tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
+ if (tbl->it_map)
+ iommu_release_ownership(tbl);
+ }
+}
+
+static int tce_iommu_take_ownership(struct iommu_table_group *table_group)
+{
+ int i, j, rc = 0;
+
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ struct iommu_table *tbl = &table_group->tables[i];
+
+ if (!tbl->it_map)
+ continue;
+
+ rc = iommu_take_ownership(tbl);
+ if (rc) {
+ for (j = 0; j < i; ++j)
+ iommu_release_ownership(
+ &table_group->tables[j]);
+
+ return rc;
+ }
+ }
+
+ return 0;
+}
+
static int tce_iommu_attach_group(void *iommu_data,
struct iommu_group *iommu_group)
{
@@ -515,9 +552,23 @@ static int tce_iommu_attach_group(void *iommu_data,
goto unlock_exit;
}

- ret = iommu_take_ownership(&table_group->tables[0]);
- if (!ret)
- container->grp = iommu_group;
+ if (!table_group->ops || !table_group->ops->take_ownership ||
+ !table_group->ops->release_ownership) {
+ ret = tce_iommu_take_ownership(table_group);
+ } else {
+ /*
+ * Disable iommu bypass, otherwise the user can DMA to all of
+ * our physical memory via the bypass window instead of just
+ * the pages that has been explicitly mapped into the iommu
+ */
+ table_group->ops->take_ownership(table_group);
+ ret = 0;
+ }
+
+ if (ret)
+ goto unlock_exit;
+
+ container->grp = iommu_group;

unlock_exit:
mutex_unlock(&container->lock);
@@ -530,7 +581,6 @@ static void tce_iommu_detach_group(void *iommu_data,
{
struct tce_container *container = iommu_data;
struct iommu_table_group *table_group;
- struct iommu_table *tbl;

mutex_lock(&container->lock);
if (iommu_group != container->grp) {
@@ -553,9 +603,11 @@ static void tce_iommu_detach_group(void *iommu_data,
table_group = iommu_group_get_iommudata(iommu_group);
BUG_ON(!table_group);

- tbl = &table_group->tables[0];
- tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
- iommu_release_ownership(tbl);
+ /* Kernel owns the device now, we can restore bypass */
+ if (!table_group->ops || !table_group->ops->release_ownership)
+ tce_iommu_release_ownership(container, table_group);
+ else
+ table_group->ops->release_ownership(table_group);

unlock_exit:
mutex_unlock(&container->lock);
--
2.0.0

2015-04-25 12:19:50

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 14/32] powerpc/iommu: Fix IOMMU ownership control functions

This adds missing locks in iommu_take_ownership()/
iommu_release_ownership().

This marks all pages busy in iommu_table::it_map in order to catch
errors if there is an attempt to use this table while ownership over it
is taken.

This only clears TCE content if there is no page marked busy in it_map.
Clearing must be done outside of the table locks as iommu_clear_tce()
called from iommu_clear_tces_and_put_pages() does this.

In order to use bitmap_empty(), the existing code clears bit#0 which
is set even in an empty table if it is bus-mapped at 0 as
iommu_init_table() reserves page#0 to prevent buggy drivers
from crashing when allocated page is bus-mapped at zero
(which is correct). This restores the bit in the case of failure
to bring the it_map to the state it was in when we called
iommu_take_ownership().

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v9:
* iommu_table_take_ownership() did not return @ret (and ignored EBUSY),
now it does return correct error.
* updated commit log about setting bit#0 in the case of failure

v5:
* do not store bit#0 value, it has to be set for zero-based table
anyway
* removed test_and_clear_bit
---
arch/powerpc/kernel/iommu.c | 31 +++++++++++++++++++++++++------
1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 2856d27..ea2c8ba 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1045,32 +1045,51 @@ EXPORT_SYMBOL_GPL(iommu_tce_build);

int iommu_take_ownership(struct iommu_table *tbl)
{
- unsigned long sz = (tbl->it_size + 7) >> 3;
+ unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+ int ret = 0;
+
+ spin_lock_irqsave(&tbl->large_pool.lock, flags);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_lock(&tbl->pools[i].lock);

if (tbl->it_offset == 0)
clear_bit(0, tbl->it_map);

if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
pr_err("iommu_tce: it_map is not empty");
- return -EBUSY;
+ ret = -EBUSY;
+ /* Restore bit#0 set by iommu_init_table() */
+ if (tbl->it_offset == 0)
+ set_bit(0, tbl->it_map);
+ } else {
+ memset(tbl->it_map, 0xff, sz);
}

- memset(tbl->it_map, 0xff, sz);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_unlock(&tbl->pools[i].lock);
+ spin_unlock_irqrestore(&tbl->large_pool.lock, flags);

-
- return 0;
+ return ret;
}
EXPORT_SYMBOL_GPL(iommu_take_ownership);

void iommu_release_ownership(struct iommu_table *tbl)
{
- unsigned long sz = (tbl->it_size + 7) >> 3;
+ unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+
+ spin_lock_irqsave(&tbl->large_pool.lock, flags);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_lock(&tbl->pools[i].lock);

memset(tbl->it_map, 0, sz);

/* Restore bit#0 set by iommu_init_table() */
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
+
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_unlock(&tbl->pools[i].lock);
+ spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
}
EXPORT_SYMBOL_GPL(iommu_release_ownership);

--
2.0.0

2015-04-25 12:20:37

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 15/32] powerpc/powernv/ioda/ioda2: Rework TCE invalidation in tce_build()/tce_free()

The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is
supposed to be called on IODA1/2 and not called on p5ioc2. It receives
start and end host addresses of TCE table.

IODA2 actually needs PCI addresses to invalidate the cache. Those
can be calculated from host addresses but since we are going
to implement multi-level TCE tables, calculating PCI address from
a host address might get either tricky or ugly as TCE table remains flat
on PCI bus but not in RAM.

This moves pnv_pci_ioda_tce_invalidate() from generic pnv_tce_build/
pnt_tce_free and defines IODA1/2-specific callbacks which call generic
ones and do PHB-model-specific TCE cache invalidation. P5IOC2 keeps
using generic callbacks as before.

This changes pnv_pci_ioda2_tce_invalidate() to receives TCE index and
number of pages which are PCI addresses shifted by IOMMU page shift.

No change in behaviour is expected.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v9:
* removed confusing comment from commit log about unintentional calling of
pnv_pci_ioda_tce_invalidate()
* moved mechanical changes away to "powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table"
* fixed bug with broken invalidation in pnv_pci_ioda2_tce_invalidate -
@index includes @tbl->it_offset but old code added it anyway which later broke
DDW
---
arch/powerpc/platforms/powernv/pci-ioda.c | 86 +++++++++++++++++++++----------
arch/powerpc/platforms/powernv/pci.c | 17 ++----
2 files changed, 64 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 718d5cc..f070c44 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1665,18 +1665,20 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
}
}

-static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
- struct iommu_table *tbl,
- __be64 *startp, __be64 *endp, bool rm)
+static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
+ unsigned long index, unsigned long npages, bool rm)
{
+ struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
+ struct pnv_ioda_pe, table_group);
__be64 __iomem *invalidate = rm ?
(__be64 __iomem *)pe->tce_inval_reg_phys :
(__be64 __iomem *)tbl->it_index;
unsigned long start, end, inc;
const unsigned shift = tbl->it_page_shift;

- start = __pa(startp);
- end = __pa(endp);
+ start = __pa((__be64 *)tbl->it_base + index - tbl->it_offset);
+ end = __pa((__be64 *)tbl->it_base + index - tbl->it_offset +
+ npages - 1);

/* BML uses this case for p6/p7/galaxy2: Shift addr and put in node */
if (tbl->it_busno) {
@@ -1712,16 +1714,40 @@ static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
*/
}

+static int pnv_ioda1_tce_build(struct iommu_table *tbl, long index,
+ long npages, unsigned long uaddr,
+ enum dma_data_direction direction,
+ struct dma_attrs *attrs)
+{
+ long ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
+ attrs);
+
+ if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE))
+ pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
+
+ return ret;
+}
+
+static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
+ long npages)
+{
+ pnv_tce_free(tbl, index, npages);
+
+ if (tbl->it_type & TCE_PCI_SWINV_FREE)
+ pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
+}
+
static struct iommu_table_ops pnv_ioda1_iommu_ops = {
- .set = pnv_tce_build,
- .clear = pnv_tce_free,
+ .set = pnv_ioda1_tce_build,
+ .clear = pnv_ioda1_tce_free,
.get = pnv_tce_get,
};

-static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
- struct iommu_table *tbl,
- __be64 *startp, __be64 *endp, bool rm)
+static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
+ unsigned long index, unsigned long npages, bool rm)
{
+ struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
+ struct pnv_ioda_pe, table_group);
unsigned long start, end, inc;
__be64 __iomem *invalidate = rm ?
(__be64 __iomem *)pe->tce_inval_reg_phys :
@@ -1734,10 +1760,8 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
end = start;

/* Figure out the start, end and step */
- inc = tbl->it_offset + (((u64)startp - tbl->it_base) / sizeof(u64));
- start |= (inc << shift);
- inc = tbl->it_offset + (((u64)endp - tbl->it_base) / sizeof(u64));
- end |= (inc << shift);
+ start |= (index << shift);
+ end |= ((index + npages - 1) << shift);
inc = (0x1ull << shift);
mb();

@@ -1750,22 +1774,32 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
}
}

-void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
- __be64 *startp, __be64 *endp, bool rm)
+static int pnv_ioda2_tce_build(struct iommu_table *tbl, long index,
+ long npages, unsigned long uaddr,
+ enum dma_data_direction direction,
+ struct dma_attrs *attrs)
{
- struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
- struct pnv_ioda_pe, table_group);
- struct pnv_phb *phb = pe->phb;
-
- if (phb->type == PNV_PHB_IODA1)
- pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
- else
- pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
+ long ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
+ attrs);
+
+ if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE))
+ pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
+
+ return ret;
+}
+
+static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
+ long npages)
+{
+ pnv_tce_free(tbl, index, npages);
+
+ if (tbl->it_type & TCE_PCI_SWINV_FREE)
+ pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
}

static struct iommu_table_ops pnv_ioda2_iommu_ops = {
- .set = pnv_tce_build,
- .clear = pnv_tce_free,
+ .set = pnv_ioda2_tce_build,
+ .clear = pnv_ioda2_tce_free,
.get = pnv_tce_get,
};

diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 4c3bbb1..84b4ea4 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -577,37 +577,28 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
struct dma_attrs *attrs)
{
u64 proto_tce = iommu_direction_to_tce_perm(direction);
- __be64 *tcep, *tces;
+ __be64 *tcep;
u64 rpn;

- tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
+ tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
rpn = __pa(uaddr) >> tbl->it_page_shift;

while (npages--)
*(tcep++) = cpu_to_be64(proto_tce |
(rpn++ << tbl->it_page_shift));

- /* Some implementations won't cache invalid TCEs and thus may not
- * need that flush. We'll probably turn it_type into a bit mask
- * of flags if that becomes the case
- */
- if (tbl->it_type & TCE_PCI_SWINV_CREATE)
- pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, false);

return 0;
}

void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
{
- __be64 *tcep, *tces;
+ __be64 *tcep;

- tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
+ tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;

while (npages--)
*(tcep++) = cpu_to_be64(0);
-
- if (tbl->it_type & TCE_PCI_SWINV_FREE)
- pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, false);
}

unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
--
2.0.0

2015-04-25 12:16:54

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 16/32] powerpc/powernv/ioda: Move TCE kill register address to PE

At the moment the DMA setup code looks for the "ibm,opal-tce-kill" property
which contains the TCE kill register address. Writes to this register
invalidates TCE cache on IODA/IODA2 hub.

This moves the register address from iommu_table to pnv_ioda_pe as
later there will be 2 tables per PE and it will be used for both tables.

This moves the property reading/remapping code to a helper to reduce
code duplication.

This adds a new pnv_pci_ioda2_tvt_invalidate() helper which invalidates
the entire table. It should be called after every call to
opal_pci_map_pe_dma_window(). It was not required before because
there is just a single TCE table and 64bit DMA is handled via bypass
window (which has no table so no chache is used) but this is going
to change with Dynamic DMA windows (DDW).

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v9:
* new in the series
---
arch/powerpc/platforms/powernv/pci-ioda.c | 69 +++++++++++++++++++------------
arch/powerpc/platforms/powernv/pci.h | 1 +
2 files changed, 44 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f070c44..b22b3ca 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1672,7 +1672,7 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
struct pnv_ioda_pe, table_group);
__be64 __iomem *invalidate = rm ?
(__be64 __iomem *)pe->tce_inval_reg_phys :
- (__be64 __iomem *)tbl->it_index;
+ pe->tce_inval_reg;
unsigned long start, end, inc;
const unsigned shift = tbl->it_page_shift;

@@ -1743,6 +1743,18 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
.get = pnv_tce_get,
};

+static inline void pnv_pci_ioda2_tvt_invalidate(struct pnv_ioda_pe *pe)
+{
+ /* 01xb - invalidate TCEs that match the specified PE# */
+ unsigned long addr = (0x4ull << 60) | (pe->pe_number & 0xFF);
+
+ if (!pe->tce_inval_reg)
+ return;
+
+ mb(); /* Ensure above stores are visible */
+ __raw_writeq(cpu_to_be64(addr), pe->tce_inval_reg);
+}
+
static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
unsigned long index, unsigned long npages, bool rm)
{
@@ -1751,7 +1763,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
unsigned long start, end, inc;
__be64 __iomem *invalidate = rm ?
(__be64 __iomem *)pe->tce_inval_reg_phys :
- (__be64 __iomem *)tbl->it_index;
+ pe->tce_inval_reg;
const unsigned shift = tbl->it_page_shift;

/* We'll invalidate DMA address in PE scope */
@@ -1803,13 +1815,31 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
.get = pnv_tce_get,
};

+static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
+ struct pnv_ioda_pe *pe)
+{
+ const __be64 *swinvp;
+
+ /* OPAL variant of PHB3 invalidated TCEs */
+ swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
+ if (!swinvp)
+ return;
+
+ /* We need a couple more fields -- an address and a data
+ * to or. Since the bus is only printed out on table free
+ * errors, and on the first pass the data will be a relative
+ * bus number, print that out instead.
+ */
+ pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
+ pe->tce_inval_reg = ioremap(pe->tce_inval_reg_phys, 8);
+}
+
static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
struct pnv_ioda_pe *pe, unsigned int base,
unsigned int segs)
{

struct page *tce_mem = NULL;
- const __be64 *swinvp;
struct iommu_table *tbl;
unsigned int i;
int64_t rc;
@@ -1823,6 +1853,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
if (WARN_ON(pe->tce32_seg >= 0))
return;

+ pnv_pci_ioda_setup_opal_tce_kill(phb, pe);
+
/* Grab a 32-bit TCE table */
pe->tce32_seg = base;
pe_info(pe, " Setting up 32-bit TCE table at %08x..%08x\n",
@@ -1865,20 +1897,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
base << 28, IOMMU_PAGE_SHIFT_4K);

/* OPAL variant of P7IOC SW invalidated TCEs */
- swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
- if (swinvp) {
- /* We need a couple more fields -- an address and a data
- * to or. Since the bus is only printed out on table free
- * errors, and on the first pass the data will be a relative
- * bus number, print that out instead.
- */
- pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
- tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
- 8);
+ if (pe->tce_inval_reg)
tbl->it_type |= (TCE_PCI_SWINV_CREATE |
TCE_PCI_SWINV_FREE |
TCE_PCI_SWINV_PAIR);
- }
+
tbl->it_ops = &pnv_ioda1_iommu_ops;
iommu_init_table(tbl, phb->hose->node);

@@ -1984,7 +2007,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
{
struct page *tce_mem = NULL;
void *addr;
- const __be64 *swinvp;
struct iommu_table *tbl;
unsigned int tce_table_size, end;
int64_t rc;
@@ -1993,6 +2015,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
if (WARN_ON(pe->tce32_seg >= 0))
return;

+ pnv_pci_ioda_setup_opal_tce_kill(phb, pe);
+
/* The PE will reserve all possible 32-bits space */
pe->tce32_seg = 0;
end = (1 << ilog2(phb->ioda.m32_pci_base));
@@ -2023,6 +2047,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
goto fail;
}

+ pnv_pci_ioda2_tvt_invalidate(pe);
+
/* Setup iommu */
pe->table_group.tables[0].it_table_group = &pe->table_group;

@@ -2032,18 +2058,9 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
IOMMU_PAGE_SHIFT_4K);

/* OPAL variant of PHB3 invalidated TCEs */
- swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
- if (swinvp) {
- /* We need a couple more fields -- an address and a data
- * to or. Since the bus is only printed out on table free
- * errors, and on the first pass the data will be a relative
- * bus number, print that out instead.
- */
- pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
- tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
- 8);
+ if (pe->tce_inval_reg)
tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
- }
+
tbl->it_ops = &pnv_ioda2_iommu_ops;
iommu_init_table(tbl, phb->hose->node);
#ifdef CONFIG_IOMMU_API
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 368d4ed..bd83d85 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -59,6 +59,7 @@ struct pnv_ioda_pe {
int tce32_segcount;
struct iommu_table_group table_group;
phys_addr_t tce_inval_reg_phys;
+ __be64 __iomem *tce_inval_reg;

/* 64-bit TCE bypass region */
bool tce_bypass_enabled;
--
2.0.0

2015-04-25 12:22:52

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 17/32] powerpc/powernv: Implement accessor to TCE entry

This replaces direct accesses to TCE table with a helper which
returns an TCE entry address. This does not make difference now but will
when multi-level TCE tables get introduces.

No change in behavior is expected.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v9:
* new patch in the series to separate this mechanical change from
functional changes; this is not right before
"powerpc/powernv: Implement multilevel TCE tables" but here in order
to let the next patch - "powerpc/iommu/powernv: Release replaced TCE" -
use pnv_tce() and avoid changing the same code twice
---
arch/powerpc/platforms/powernv/pci.c | 34 +++++++++++++++++++++-------------
1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 84b4ea4..ba75aa5 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -572,38 +572,46 @@ struct pci_ops pnv_pci_ops = {
.write = pnv_pci_write_config,
};

+static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
+{
+ __be64 *tmp = ((__be64 *)tbl->it_base);
+
+ return tmp + idx;
+}
+
int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
unsigned long uaddr, enum dma_data_direction direction,
struct dma_attrs *attrs)
{
u64 proto_tce = iommu_direction_to_tce_perm(direction);
- __be64 *tcep;
- u64 rpn;
+ u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
+ long i;

- tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
- rpn = __pa(uaddr) >> tbl->it_page_shift;
-
- while (npages--)
- *(tcep++) = cpu_to_be64(proto_tce |
- (rpn++ << tbl->it_page_shift));
+ for (i = 0; i < npages; i++) {
+ unsigned long newtce = proto_tce |
+ ((rpn + i) << tbl->it_page_shift);
+ unsigned long idx = index - tbl->it_offset + i;

+ *(pnv_tce(tbl, idx)) = cpu_to_be64(newtce);
+ }

return 0;
}

void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
{
- __be64 *tcep;
+ long i;

- tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
+ for (i = 0; i < npages; i++) {
+ unsigned long idx = index - tbl->it_offset + i;

- while (npages--)
- *(tcep++) = cpu_to_be64(0);
+ *(pnv_tce(tbl, idx)) = cpu_to_be64(0);
+ }
}

unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
{
- return ((u64 *)tbl->it_base)[index - tbl->it_offset];
+ return *(pnv_tce(tbl, index - tbl->it_offset));
}

void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
--
2.0.0

2015-04-25 12:19:22

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 18/32] powerpc/iommu/powernv: Release replaced TCE

At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.

Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.

This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE and DMA direction so
the caller can release the pages afterwards. The exchange() receives
a physical address unlike set() which receives linear mapping address;
and returns a physical address as the clear() does.

This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
for a platform to have exchange() implemented in order to support VFIO.

This replaces iommu_tce_build() and iommu_clear_tce() with
a single iommu_tce_xchg().

This makes sure that TCE permission bits are not set in TCE passed to
IOMMU API as those are to be calculated by platform code from DMA direction.

This moves SetPageDirty() to the IOMMU code to make it work for both
VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
available later).

Signed-off-by: Alexey Kardashevskiy <[email protected]>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <[email protected]>
---
Changes:
v9:
* changed exchange() to work with physical addresses as these addresses
are never accessed by the code and physical addresses are actual values
we put into the IOMMU table
---
arch/powerpc/include/asm/iommu.h | 22 +++++++++--
arch/powerpc/kernel/iommu.c | 57 +++++++++-------------------
arch/powerpc/platforms/powernv/pci-ioda.c | 34 +++++++++++++++++
arch/powerpc/platforms/powernv/pci-p5ioc2.c | 3 ++
arch/powerpc/platforms/powernv/pci.c | 17 +++++++++
arch/powerpc/platforms/powernv/pci.h | 2 +
drivers/vfio/vfio_iommu_spapr_tce.c | 58 ++++++++++++++++++-----------
7 files changed, 128 insertions(+), 65 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index e63419e..7e7ca0a 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -45,13 +45,29 @@ extern int iommu_is_off;
extern int iommu_force_on;

struct iommu_table_ops {
+ /*
+ * When called with direction==DMA_NONE, it is equal to clear().
+ * uaddr is a linear map address.
+ */
int (*set)(struct iommu_table *tbl,
long index, long npages,
unsigned long uaddr,
enum dma_data_direction direction,
struct dma_attrs *attrs);
+#ifdef CONFIG_IOMMU_API
+ /*
+ * Exchanges existing TCE with new TCE plus direction bits;
+ * returns old TCE and DMA direction mask.
+ * @tce is a physical address.
+ */
+ int (*exchange)(struct iommu_table *tbl,
+ long index,
+ unsigned long *tce,
+ enum dma_data_direction *direction);
+#endif
void (*clear)(struct iommu_table *tbl,
long index, long npages);
+ /* get() returns a physical address */
unsigned long (*get)(struct iommu_table *tbl, long index);
void (*flush)(struct iommu_table *tbl);
};
@@ -152,6 +168,8 @@ extern void iommu_register_group(struct iommu_table_group *table_group,
extern int iommu_add_device(struct device *dev);
extern void iommu_del_device(struct device *dev);
extern int __init tce_iommu_bus_notifier_init(void);
+extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
+ unsigned long *tce, enum dma_data_direction *direction);
#else
static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
@@ -231,10 +249,6 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
unsigned long npages);
extern int iommu_tce_put_param_check(struct iommu_table *tbl,
unsigned long ioba, unsigned long tce);
-extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
- unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
- unsigned long entry);

extern void iommu_flush_tce(struct iommu_table *tbl);
extern int iommu_take_ownership(struct iommu_table *tbl);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ea2c8ba..2eaba0c 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -975,9 +975,6 @@ EXPORT_SYMBOL_GPL(iommu_tce_clear_param_check);
int iommu_tce_put_param_check(struct iommu_table *tbl,
unsigned long ioba, unsigned long tce)
{
- if (!(tce & (TCE_PCI_WRITE | TCE_PCI_READ)))
- return -EINVAL;
-
if (tce & ~(IOMMU_PAGE_MASK(tbl) | TCE_PCI_WRITE | TCE_PCI_READ))
return -EINVAL;

@@ -995,44 +992,16 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
}
EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);

-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
+long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
+ unsigned long *tce, enum dma_data_direction *direction)
{
- unsigned long oldtce;
- struct iommu_pool *pool = get_pool(tbl, entry);
+ long ret;

- spin_lock(&(pool->lock));
+ ret = tbl->it_ops->exchange(tbl, entry, tce, direction);

- oldtce = tbl->it_ops->get(tbl, entry);
- if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
- tbl->it_ops->clear(tbl, entry, 1);
- else
- oldtce = 0;
-
- spin_unlock(&(pool->lock));
-
- return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
-/*
- * hwaddr is a kernel virtual address here (0xc... bazillion),
- * tce_build converts it to a physical address.
- */
-int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
- unsigned long hwaddr, enum dma_data_direction direction)
-{
- int ret = -EBUSY;
- unsigned long oldtce;
- struct iommu_pool *pool = get_pool(tbl, entry);
-
- spin_lock(&(pool->lock));
-
- oldtce = tbl->it_ops->get(tbl, entry);
- /* Add new entry if it is not busy */
- if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
- ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
-
- spin_unlock(&(pool->lock));
+ if (!ret && ((*direction == DMA_FROM_DEVICE) ||
+ (*direction == DMA_BIDIRECTIONAL)))
+ SetPageDirty(pfn_to_page(*tce >> PAGE_SHIFT));

/* if (unlikely(ret))
pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
@@ -1041,13 +1010,23 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,

return ret;
}
-EXPORT_SYMBOL_GPL(iommu_tce_build);
+EXPORT_SYMBOL_GPL(iommu_tce_xchg);

int iommu_take_ownership(struct iommu_table *tbl)
{
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
int ret = 0;

+ /*
+ * VFIO does not control TCE entries allocation and the guest
+ * can write new TCEs on top of existing ones so iommu_tce_build()
+ * must be able to release old pages. This functionality
+ * requires exchange() callback defined so if it is not
+ * implemented, we disallow taking ownership over the table.
+ */
+ if (!tbl->it_ops->exchange)
+ return -EINVAL;
+
spin_lock_irqsave(&tbl->large_pool.lock, flags);
for (i = 0; i < tbl->nr_pools; i++)
spin_lock(&tbl->pools[i].lock);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index b22b3ca..fb765af 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1728,6 +1728,20 @@ static int pnv_ioda1_tce_build(struct iommu_table *tbl, long index,
return ret;
}

+#ifdef CONFIG_IOMMU_API
+static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
+ unsigned long *tce, enum dma_data_direction *direction)
+{
+ long ret = pnv_tce_xchg(tbl, index, tce, direction);
+
+ if (!ret && (tbl->it_type &
+ (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
+ pnv_pci_ioda1_tce_invalidate(tbl, index, 1, false);
+
+ return ret;
+}
+#endif
+
static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
long npages)
{
@@ -1739,6 +1753,9 @@ static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,

static struct iommu_table_ops pnv_ioda1_iommu_ops = {
.set = pnv_ioda1_tce_build,
+#ifdef CONFIG_IOMMU_API
+ .exchange = pnv_ioda1_tce_xchg,
+#endif
.clear = pnv_ioda1_tce_free,
.get = pnv_tce_get,
};
@@ -1800,6 +1817,20 @@ static int pnv_ioda2_tce_build(struct iommu_table *tbl, long index,
return ret;
}

+#ifdef CONFIG_IOMMU_API
+static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
+ unsigned long *tce, enum dma_data_direction *direction)
+{
+ long ret = pnv_tce_xchg(tbl, index, tce, direction);
+
+ if (!ret && (tbl->it_type &
+ (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
+ pnv_pci_ioda2_tce_invalidate(tbl, index, 1, false);
+
+ return ret;
+}
+#endif
+
static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
long npages)
{
@@ -1811,6 +1842,9 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,

static struct iommu_table_ops pnv_ioda2_iommu_ops = {
.set = pnv_ioda2_tce_build,
+#ifdef CONFIG_IOMMU_API
+ .exchange = pnv_ioda2_tce_xchg,
+#endif
.clear = pnv_ioda2_tce_free,
.get = pnv_tce_get,
};
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index a073af0..7a6fd92 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -85,6 +85,9 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }

static struct iommu_table_ops pnv_p5ioc2_iommu_ops = {
.set = pnv_tce_build,
+#ifdef CONFIG_IOMMU_API
+ .exchange = pnv_tce_xchg,
+#endif
.clear = pnv_tce_free,
.get = pnv_tce_get,
};
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index ba75aa5..e8802ac 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -598,6 +598,23 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
return 0;
}

+#ifdef CONFIG_IOMMU_API
+int pnv_tce_xchg(struct iommu_table *tbl, long index,
+ unsigned long *tce, enum dma_data_direction *direction)
+{
+ u64 proto_tce = iommu_direction_to_tce_perm(*direction);
+ unsigned long newtce = *tce | proto_tce;
+ unsigned long idx = index - tbl->it_offset;
+
+ *tce = xchg(pnv_tce(tbl, idx), cpu_to_be64(newtce));
+ *tce = be64_to_cpu(*tce);
+ *direction = iommu_tce_direction(*tce);
+ *tce &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+ return 0;
+}
+#endif
+
void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
{
long i;
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index bd83d85..b15cce5 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -205,6 +205,8 @@ extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
unsigned long uaddr, enum dma_data_direction direction,
struct dma_attrs *attrs);
extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
+extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
+ unsigned long *tce, enum dma_data_direction *direction);
extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);

void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index dacc738..2d51bbf 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -239,14 +239,7 @@ static void tce_iommu_unuse_page(struct tce_container *container,
{
struct page *page;

- if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
- return;
-
page = pfn_to_page(oldtce >> PAGE_SHIFT);
-
- if (oldtce & TCE_PCI_WRITE)
- SetPageDirty(page);
-
put_page(page);
}

@@ -255,10 +248,17 @@ static int tce_iommu_clear(struct tce_container *container,
unsigned long entry, unsigned long pages)
{
unsigned long oldtce;
+ long ret;
+ enum dma_data_direction direction;

for ( ; pages; --pages, ++entry) {
- oldtce = iommu_clear_tce(tbl, entry);
- if (!oldtce)
+ direction = DMA_NONE;
+ oldtce = 0;
+ ret = iommu_tce_xchg(tbl, entry, &oldtce, &direction);
+ if (ret)
+ continue;
+
+ if (direction == DMA_NONE)
continue;

tce_iommu_unuse_page(container, oldtce);
@@ -283,12 +283,13 @@ static int tce_iommu_use_page(unsigned long tce, unsigned long *hpa)

static long tce_iommu_build(struct tce_container *container,
struct iommu_table *tbl,
- unsigned long entry, unsigned long tce, unsigned long pages)
+ unsigned long entry, unsigned long tce, unsigned long pages,
+ enum dma_data_direction direction)
{
long i, ret = 0;
struct page *page;
unsigned long hpa;
- enum dma_data_direction direction = iommu_tce_direction(tce);
+ enum dma_data_direction dirtmp;

for (i = 0; i < pages; ++i) {
unsigned long offset = tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
@@ -304,8 +305,8 @@ static long tce_iommu_build(struct tce_container *container,
}

hpa |= offset;
- ret = iommu_tce_build(tbl, entry + i, (unsigned long) __va(hpa),
- direction);
+ dirtmp = direction;
+ ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
if (ret) {
tce_iommu_unuse_page(container, hpa);
pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
@@ -313,6 +314,10 @@ static long tce_iommu_build(struct tce_container *container,
tce, ret);
break;
}
+
+ if (dirtmp != DMA_NONE)
+ tce_iommu_unuse_page(container, hpa);
+
tce += IOMMU_PAGE_SIZE(tbl);
}

@@ -377,7 +382,7 @@ static long tce_iommu_ioctl(void *iommu_data,
case VFIO_IOMMU_MAP_DMA: {
struct vfio_iommu_type1_dma_map param;
struct iommu_table *tbl;
- unsigned long tce;
+ enum dma_data_direction direction;

if (!container->enabled)
return -EPERM;
@@ -398,24 +403,33 @@ static long tce_iommu_ioctl(void *iommu_data,
if (!tbl)
return -ENXIO;

- if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
- (param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
+ if (param.size & ~IOMMU_PAGE_MASK(tbl))
+ return -EINVAL;
+
+ if (param.vaddr & (TCE_PCI_READ | TCE_PCI_WRITE))
return -EINVAL;

/* iova is checked by the IOMMU API */
- tce = param.vaddr;
if (param.flags & VFIO_DMA_MAP_FLAG_READ)
- tce |= TCE_PCI_READ;
- if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
- tce |= TCE_PCI_WRITE;
+ if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
+ direction = DMA_BIDIRECTIONAL;
+ else
+ direction = DMA_TO_DEVICE;
+ else
+ if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
+ direction = DMA_FROM_DEVICE;
+ else
+ return -EINVAL;

- ret = iommu_tce_put_param_check(tbl, param.iova, tce);
+ ret = iommu_tce_put_param_check(tbl, param.iova, param.vaddr);
if (ret)
return ret;

ret = tce_iommu_build(container, tbl,
param.iova >> tbl->it_page_shift,
- tce, param.size >> tbl->it_page_shift);
+ param.vaddr,
+ param.size >> tbl->it_page_shift,
+ direction);

iommu_flush_tce(tbl);

--
2.0.0

2015-04-25 12:22:42

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 19/32] powerpc/powernv/ioda2: Rework iommu_table creation

This moves iommu_table creation to the beginning to make following changes
easier to review. This starts using table parameters from the iommu_table
struct.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v9:
* updated commit log and did minor cleanup
---
arch/powerpc/platforms/powernv/pci-ioda.c | 33 +++++++++++++++----------------
1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index fb765af..a80be34 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2041,7 +2041,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
{
struct page *tce_mem = NULL;
void *addr;
- struct iommu_table *tbl;
+ struct iommu_table *tbl = &pe->table_group.tables[0];
unsigned int tce_table_size, end;
int64_t rc;

@@ -2068,13 +2068,26 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
addr = page_address(tce_mem);
memset(addr, 0, tce_table_size);

+ /* Setup iommu */
+ tbl->it_table_group = &pe->table_group;
+
+ /* Setup linux iommu table */
+ pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
+ IOMMU_PAGE_SHIFT_4K);
+
+ tbl->it_ops = &pnv_ioda2_iommu_ops;
+ iommu_init_table(tbl, phb->hose->node);
+#ifdef CONFIG_IOMMU_API
+ pe->table_group.ops = &pnv_pci_ioda2_ops;
+#endif
+
/*
* Map TCE table through TVT. The TVE index is the PE number
* shifted by 1 bit for 32-bits DMA space.
*/
rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
- pe->pe_number << 1, 1, __pa(addr),
- tce_table_size, 0x1000);
+ pe->pe_number << 1, 1, __pa(tbl->it_base),
+ tbl->it_size << 3, 1ULL << tbl->it_page_shift);
if (rc) {
pe_err(pe, "Failed to configure 32-bit TCE table,"
" err %ld\n", rc);
@@ -2083,24 +2096,10 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,

pnv_pci_ioda2_tvt_invalidate(pe);

- /* Setup iommu */
- pe->table_group.tables[0].it_table_group = &pe->table_group;
-
- /* Setup linux iommu table */
- tbl = &pe->table_group.tables[0];
- pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
- IOMMU_PAGE_SHIFT_4K);
-
/* OPAL variant of PHB3 invalidated TCEs */
if (pe->tce_inval_reg)
tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);

- tbl->it_ops = &pnv_ioda2_iommu_ops;
- iommu_init_table(tbl, phb->hose->node);
-#ifdef CONFIG_IOMMU_API
- pe->table_group.ops = &pnv_pci_ioda2_ops;
-#endif
-
if (pe->flags & PNV_IODA_PE_DEV) {
iommu_register_group(&pe->table_group, phb->hose->global_number,
pe->pe_number);
--
2.0.0

2015-04-25 12:20:22

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 20/32] powerpc/powernv/ioda2: Introduce pnv_pci_create_table/pnv_pci_free_table

This is a part of moving TCE table allocation into an iommu_ops
callback to support multiple IOMMU groups per one VFIO container.

This moves a table creation window to the file with common powernv-pci
helpers as it does not do anything IODA2-specific.

This adds pnv_pci_free_table() helper to release the actual TCE table.

This enforces window size to be a power of two.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
---
Changes:
v9:
* moved helpers to the common powernv pci.c file from pci-ioda.c
* moved bits from pnv_pci_create_table() to pnv_alloc_tce_table_pages()
---
arch/powerpc/platforms/powernv/pci-ioda.c | 36 ++++++------------
arch/powerpc/platforms/powernv/pci.c | 61 +++++++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/pci.h | 4 ++
3 files changed, 76 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index a80be34..b9b3773 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1307,8 +1307,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
if (rc)
pe_warn(pe, "OPAL error %ld release DMA window\n", rc);

- iommu_reset_table(tbl, of_node_full_name(dev->dev.of_node));
- free_pages(addr, get_order(TCE32_TABLE_SIZE));
+ pnv_pci_free_table(tbl);
}

static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
@@ -2039,10 +2038,7 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
struct pnv_ioda_pe *pe)
{
- struct page *tce_mem = NULL;
- void *addr;
struct iommu_table *tbl = &pe->table_group.tables[0];
- unsigned int tce_table_size, end;
int64_t rc;

/* We shouldn't already have a 32-bit DMA associated */
@@ -2053,29 +2049,20 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,

/* The PE will reserve all possible 32-bits space */
pe->tce32_seg = 0;
- end = (1 << ilog2(phb->ioda.m32_pci_base));
- tce_table_size = (end / 0x1000) * 8;
pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
- end);
+ phb->ioda.m32_pci_base);

- /* Allocate TCE table */
- tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL,
- get_order(tce_table_size));
- if (!tce_mem) {
- pe_err(pe, "Failed to allocate a 32-bit TCE memory\n");
- goto fail;
+ rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node,
+ 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, tbl);
+ if (rc) {
+ pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
+ return;
}
- addr = page_address(tce_mem);
- memset(addr, 0, tce_table_size);
-
- /* Setup iommu */
- tbl->it_table_group = &pe->table_group;
-
- /* Setup linux iommu table */
- pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
- IOMMU_PAGE_SHIFT_4K);

tbl->it_ops = &pnv_ioda2_iommu_ops;
+
+ /* Setup iommu */
+ tbl->it_table_group = &pe->table_group;
iommu_init_table(tbl, phb->hose->node);
#ifdef CONFIG_IOMMU_API
pe->table_group.ops = &pnv_pci_ioda2_ops;
@@ -2121,8 +2108,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
fail:
if (pe->tce32_seg >= 0)
pe->tce32_seg = -1;
- if (tce_mem)
- __free_pages(tce_mem, get_order(tce_table_size));
+ pnv_pci_free_table(tbl);
}

static void pnv_ioda_setup_dma(struct pnv_phb *phb)
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index e8802ac..6bcfad5 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -20,7 +20,9 @@
#include <linux/io.h>
#include <linux/msi.h>
#include <linux/iommu.h>
+#include <linux/memblock.h>

+#include <asm/mmzone.h>
#include <asm/sections.h>
#include <asm/io.h>
#include <asm/prom.h>
@@ -645,6 +647,65 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
tbl->it_type = TCE_PCI;
}

+static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
+ unsigned long *tce_table_allocated)
+{
+ struct page *tce_mem = NULL;
+ __be64 *addr;
+ unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
+ unsigned long local_allocated = 1UL << (order + PAGE_SHIFT);
+
+ tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
+ if (!tce_mem) {
+ pr_err("Failed to allocate a TCE memory, order=%d\n", order);
+ return NULL;
+ }
+ addr = page_address(tce_mem);
+ memset(addr, 0, local_allocated);
+ *tce_table_allocated = local_allocated;
+
+ return addr;
+}
+
+long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
+ __u64 bus_offset, __u32 page_shift, __u64 window_size,
+ struct iommu_table *tbl)
+{
+ void *addr;
+ unsigned long tce_table_allocated = 0;
+ const unsigned window_shift = ilog2(window_size);
+ unsigned entries_shift = window_shift - page_shift;
+ unsigned table_shift = entries_shift + 3;
+ const unsigned long tce_table_size = max(0x1000UL, 1UL << table_shift);
+
+ if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
+ return -EINVAL;
+
+ /* Allocate TCE table */
+ addr = pnv_alloc_tce_table_pages(nid, table_shift,
+ &tce_table_allocated);
+
+ /* Setup linux iommu table */
+ pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset,
+ page_shift);
+
+ pr_info("Created TCE table: window size = %08llx, "
+ "tablesize = %lx (%lx), start @%08llx\n",
+ window_size, tce_table_size, tce_table_allocated,
+ bus_offset);
+
+ return 0;
+}
+
+void pnv_pci_free_table(struct iommu_table *tbl)
+{
+ if (!tbl->it_size)
+ return;
+
+ free_pages(tbl->it_base, get_order(tbl->it_size << 3));
+ iommu_reset_table(tbl, "pnv");
+}
+
static void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
{
struct pci_controller *hose = pci_bus_to_host(pdev->bus);
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index b15cce5..e6cbbec 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -218,6 +218,10 @@ int pnv_pci_cfg_write(struct pci_dn *pdn,
extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
void *tce_mem, u64 tce_size,
u64 dma_offset, unsigned page_shift);
+extern long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
+ __u64 bus_offset, __u32 page_shift, __u64 window_size,
+ struct iommu_table *tbl);
+extern void pnv_pci_free_table(struct iommu_table *tbl);
extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
extern void pnv_pci_init_ioda_hub(struct device_node *np);
extern void pnv_pci_init_ioda2_phb(struct device_node *np);
--
2.0.0

2015-04-25 12:22:48

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 21/32] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window

This is a part of moving DMA window programming to an iommu_ops
callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as
a first parameter (not pnv_ioda_pe) as it is going to be used as
a callback for VFIO DDW code.

This adds pnv_pci_ioda2_tvt_invalidate() to invalidate TVT as it is
a good thing to do. It does not have immediate effect now as the table
is never recreated after reboot but it will in the following patches.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
---
Changes:
v9:
* initialize pe->table_group.tables[0] at the very end when
tbl is fully initialized
* moved pnv_pci_ioda2_tvt_invalidate() from earlier patch
---
arch/powerpc/platforms/powernv/pci-ioda.c | 67 +++++++++++++++++++++++--------
1 file changed, 51 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index b9b3773..59baa15 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1960,6 +1960,52 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
}

+static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
+ struct iommu_table *tbl)
+{
+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
+ table_group);
+ struct pnv_phb *phb = pe->phb;
+ int64_t rc;
+ const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
+ const __u64 win_size = tbl->it_size << tbl->it_page_shift;
+
+ pe_info(pe, "Setting up window at %llx..%llx "
+ "pgsize=0x%x tablesize=0x%lx\n",
+ start_addr, start_addr + win_size - 1,
+ 1UL << tbl->it_page_shift, tbl->it_size << 3);
+
+ tbl->it_table_group = &pe->table_group;
+
+ /*
+ * Map TCE table through TVT. The TVE index is the PE number
+ * shifted by 1 bit for 32-bits DMA space.
+ */
+ rc = opal_pci_map_pe_dma_window(phb->opal_id,
+ pe->pe_number,
+ pe->pe_number << 1,
+ 1,
+ __pa(tbl->it_base),
+ tbl->it_size << 3,
+ 1ULL << tbl->it_page_shift);
+ if (rc) {
+ pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
+ goto fail;
+ }
+
+ pnv_pci_ioda2_tvt_invalidate(pe);
+
+ /* Store fully initialized *tbl (may be external) in PE */
+ pe->table_group.tables[0] = *tbl;
+
+ return 0;
+fail:
+ if (pe->tce32_seg >= 0)
+ pe->tce32_seg = -1;
+
+ return rc;
+}
+
static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
{
uint16_t window_id = (pe->pe_number << 1 ) + 1;
@@ -2068,21 +2114,16 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
pe->table_group.ops = &pnv_pci_ioda2_ops;
#endif

- /*
- * Map TCE table through TVT. The TVE index is the PE number
- * shifted by 1 bit for 32-bits DMA space.
- */
- rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
- pe->pe_number << 1, 1, __pa(tbl->it_base),
- tbl->it_size << 3, 1ULL << tbl->it_page_shift);
+ rc = pnv_pci_ioda2_set_window(&pe->table_group, tbl);
if (rc) {
pe_err(pe, "Failed to configure 32-bit TCE table,"
" err %ld\n", rc);
- goto fail;
+ pnv_pci_free_table(tbl);
+ if (pe->tce32_seg >= 0)
+ pe->tce32_seg = -1;
+ return;
}

- pnv_pci_ioda2_tvt_invalidate(pe);
-
/* OPAL variant of PHB3 invalidated TCEs */
if (pe->tce_inval_reg)
tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
@@ -2103,12 +2144,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
/* Also create a bypass window */
if (!pnv_iommu_bypass_disabled)
pnv_pci_ioda2_setup_bypass_pe(phb, pe);
-
- return;
-fail:
- if (pe->tce32_seg >= 0)
- pe->tce32_seg = -1;
- pnv_pci_free_table(tbl);
}

static void pnv_ioda_setup_dma(struct pnv_phb *phb)
--
2.0.0

2015-04-25 12:17:32

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 22/32] powerpc/powernv: Implement multilevel TCE tables

TCE tables might get too big in case of 4K IOMMU pages and DDW enabled
on huge guests (hundreds of GB of RAM) so the kernel might be unable to
allocate contiguous chunk of physical memory to store the TCE table.

To address this, POWER8 CPU (actually, IODA2) supports multi-level TCE tables,
up to 5 levels which splits the table into a tree of smaller subtables.

This adds multi-level TCE tables support to pnv_pci_create_table()
and pnv_pci_free_table() helpers.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v9:
* moved from ioda2 to common powernv pci code
* fixed cleanup if allocation fails in a middle
* removed check for the size - all boundary checks happen in the calling code
anyway
---
arch/powerpc/include/asm/iommu.h | 2 +
arch/powerpc/platforms/powernv/pci-ioda.c | 15 +++--
arch/powerpc/platforms/powernv/pci.c | 94 +++++++++++++++++++++++++++++--
arch/powerpc/platforms/powernv/pci.h | 4 +-
4 files changed, 104 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7e7ca0a..0f50ee2 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -96,6 +96,8 @@ struct iommu_pool {
struct iommu_table {
unsigned long it_busno; /* Bus number this table belongs to */
unsigned long it_size; /* Size of iommu table in entries */
+ unsigned long it_indirect_levels;
+ unsigned long it_level_size;
unsigned long it_offset; /* Offset into global table */
unsigned long it_base; /* mapped address of tce table */
unsigned long it_index; /* which iommu table this is */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 59baa15..cc1d09c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1967,13 +1967,17 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
table_group);
struct pnv_phb *phb = pe->phb;
int64_t rc;
+ const unsigned long size = tbl->it_indirect_levels ?
+ tbl->it_level_size : tbl->it_size;
const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
const __u64 win_size = tbl->it_size << tbl->it_page_shift;

pe_info(pe, "Setting up window at %llx..%llx "
- "pgsize=0x%x tablesize=0x%lx\n",
+ "pgsize=0x%x tablesize=0x%lx "
+ "levels=%d levelsize=%x\n",
start_addr, start_addr + win_size - 1,
- 1UL << tbl->it_page_shift, tbl->it_size << 3);
+ 1UL << tbl->it_page_shift, tbl->it_size << 3,
+ tbl->it_indirect_levels + 1, tbl->it_level_size << 3);

tbl->it_table_group = &pe->table_group;

@@ -1984,9 +1988,9 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
rc = opal_pci_map_pe_dma_window(phb->opal_id,
pe->pe_number,
pe->pe_number << 1,
- 1,
+ tbl->it_indirect_levels + 1,
__pa(tbl->it_base),
- tbl->it_size << 3,
+ size << 3,
1ULL << tbl->it_page_shift);
if (rc) {
pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
@@ -2099,7 +2103,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
phb->ioda.m32_pci_base);

rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node,
- 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, tbl);
+ 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base,
+ POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
if (rc) {
pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
return;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 6bcfad5..fc129c4 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -46,6 +46,8 @@
#define cfg_dbg(fmt...) do { } while(0)
//#define cfg_dbg(fmt...) printk(fmt)

+#define ROUND_UP(x, n) (((x) + (n) - 1ULL) & ~((n) - 1ULL))
+
#ifdef CONFIG_PCI_MSI
static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
{
@@ -577,6 +579,19 @@ struct pci_ops pnv_pci_ops = {
static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
{
__be64 *tmp = ((__be64 *)tbl->it_base);
+ int level = tbl->it_indirect_levels;
+ const long shift = ilog2(tbl->it_level_size);
+ unsigned long mask = (tbl->it_level_size - 1) << (level * shift);
+
+ while (level) {
+ int n = (idx & mask) >> (level * shift);
+ unsigned long tce = be64_to_cpu(tmp[n]);
+
+ tmp = __va(tce & ~(TCE_PCI_READ | TCE_PCI_WRITE));
+ idx &= ~mask;
+ mask >>= shift;
+ --level;
+ }

return tmp + idx;
}
@@ -648,12 +663,18 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
}

static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
+ unsigned levels, unsigned long limit,
unsigned long *tce_table_allocated)
{
struct page *tce_mem = NULL;
- __be64 *addr;
+ __be64 *addr, *tmp;
unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
unsigned long local_allocated = 1UL << (order + PAGE_SHIFT);
+ unsigned entries = 1UL << (shift - 3);
+ long i;
+
+ if (limit == *tce_table_allocated)
+ return NULL;

tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
if (!tce_mem) {
@@ -662,14 +683,33 @@ static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
}
addr = page_address(tce_mem);
memset(addr, 0, local_allocated);
- *tce_table_allocated = local_allocated;
+
+ --levels;
+ if (!levels) {
+ /* Update tce_table_allocated with bottom level table size only */
+ *tce_table_allocated += local_allocated;
+ return addr;
+ }
+
+ for (i = 0; i < entries; ++i) {
+ tmp = pnv_alloc_tce_table_pages(nid, shift, levels, limit,
+ tce_table_allocated);
+ if (!tmp)
+ break;
+
+ addr[i] = cpu_to_be64(__pa(tmp) |
+ TCE_PCI_READ | TCE_PCI_WRITE);
+ }

return addr;
}

+static void pnv_free_tce_table_pages(unsigned long addr, unsigned long size,
+ unsigned level);
+
long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
__u64 bus_offset, __u32 page_shift, __u64 window_size,
- struct iommu_table *tbl)
+ __u32 levels, struct iommu_table *tbl)
{
void *addr;
unsigned long tce_table_allocated = 0;
@@ -678,16 +718,34 @@ long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
unsigned table_shift = entries_shift + 3;
const unsigned long tce_table_size = max(0x1000UL, 1UL << table_shift);

+ if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
+ return -EINVAL;
+
if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
return -EINVAL;

+ /* Adjust direct table size from window_size and levels */
+ entries_shift = ROUND_UP(entries_shift, levels) / levels;
+ table_shift = entries_shift + 3;
+ table_shift = max_t(unsigned, table_shift, PAGE_SHIFT);
+
/* Allocate TCE table */
addr = pnv_alloc_tce_table_pages(nid, table_shift,
- &tce_table_allocated);
+ levels, tce_table_size, &tce_table_allocated);
+ if (!addr)
+ return -ENOMEM;
+
+ if (tce_table_size != tce_table_allocated) {
+ pnv_free_tce_table_pages((unsigned long) addr,
+ tbl->it_level_size, tbl->it_indirect_levels);
+ return -ENOMEM;
+ }

/* Setup linux iommu table */
pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset,
page_shift);
+ tbl->it_level_size = 1ULL << (table_shift - 3);
+ tbl->it_indirect_levels = levels - 1;

pr_info("Created TCE table: window size = %08llx, "
"tablesize = %lx (%lx), start @%08llx\n",
@@ -697,12 +755,38 @@ long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
return 0;
}

+static void pnv_free_tce_table_pages(unsigned long addr, unsigned long size,
+ unsigned level)
+{
+ addr &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+ if (level) {
+ long i;
+ u64 *tmp = (u64 *) addr;
+
+ for (i = 0; i < size; ++i) {
+ unsigned long hpa = be64_to_cpu(tmp[i]);
+
+ if (!(hpa & (TCE_PCI_READ | TCE_PCI_WRITE)))
+ continue;
+
+ pnv_free_tce_table_pages((unsigned long) __va(hpa),
+ size, level - 1);
+ }
+ }
+
+ free_pages(addr, get_order(size << 3));
+}
+
void pnv_pci_free_table(struct iommu_table *tbl)
{
+ const unsigned long size = tbl->it_indirect_levels ?
+ tbl->it_level_size : tbl->it_size;
+
if (!tbl->it_size)
return;

- free_pages(tbl->it_base, get_order(tbl->it_size << 3));
+ pnv_free_tce_table_pages(tbl->it_base, size, tbl->it_indirect_levels);
iommu_reset_table(tbl, "pnv");
}

diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index e6cbbec..3d1ff584 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -218,9 +218,11 @@ int pnv_pci_cfg_write(struct pci_dn *pdn,
extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
void *tce_mem, u64 tce_size,
u64 dma_offset, unsigned page_shift);
+#define POWERNV_IOMMU_DEFAULT_LEVELS 1
+#define POWERNV_IOMMU_MAX_LEVELS 5
extern long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
__u64 bus_offset, __u32 page_shift, __u64 window_size,
- struct iommu_table *tbl);
+ __u32 levels, struct iommu_table *tbl);
extern void pnv_pci_free_table(struct iommu_table *tbl);
extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
extern void pnv_pci_init_ioda_hub(struct device_node *np);
--
2.0.0

2015-04-25 12:17:11

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 23/32] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks

This extends iommu_table_group_ops by a set of callbacks to support
dynamic DMA windows management.

create_table() creates a TCE table with specific parameters.
it receives iommu_table_group to know nodeid in order to allocate
TCE table memory closer to the PHB. The exact format of allocated
multi-level table might be also specific to the PHB model (not
the case now though).
This callback calculated the DMA window offset on a PCI bus from @num
and stores it in a just created table.

set_window() sets the window at specified TVT index + @num on PHB.

unset_window() unsets the window from specified TVT.

This adds a free() callback to iommu_table_ops to free the memory
(potentially a tree of tables) allocated for the TCE table.

create_table() and free() are supposed to be called once per
VFIO container and set_window()/unset_window() are supposed to be
called for every group in a container.

This adds IOMMU capabilities to iommu_table_group such as default
32bit window parameters and others.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 19 ++++++++
arch/powerpc/platforms/powernv/pci-ioda.c | 75 ++++++++++++++++++++++++++---
arch/powerpc/platforms/powernv/pci-p5ioc2.c | 12 +++--
3 files changed, 96 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 0f50ee2..7694546 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -70,6 +70,7 @@ struct iommu_table_ops {
/* get() returns a physical address */
unsigned long (*get)(struct iommu_table *tbl, long index);
void (*flush)(struct iommu_table *tbl);
+ void (*free)(struct iommu_table *tbl);
};

/* These are used by VIO */
@@ -148,6 +149,17 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
struct iommu_table_group;

struct iommu_table_group_ops {
+ long (*create_table)(struct iommu_table_group *table_group,
+ int num,
+ __u32 page_shift,
+ __u64 window_size,
+ __u32 levels,
+ struct iommu_table *tbl);
+ long (*set_window)(struct iommu_table_group *table_group,
+ int num,
+ struct iommu_table *tblnew);
+ long (*unset_window)(struct iommu_table_group *table_group,
+ int num);
/*
* Switches ownership from the kernel itself to an external
* user. While onwership is taken, the kernel cannot use IOMMU itself.
@@ -160,6 +172,13 @@ struct iommu_table_group {
#ifdef CONFIG_IOMMU_API
struct iommu_group *group;
#endif
+ /* Some key properties of IOMMU */
+ __u32 tce32_start;
+ __u32 tce32_size;
+ __u64 pgsizes; /* Bitmap of supported page sizes */
+ __u32 max_dynamic_windows_supported;
+ __u32 max_levels;
+
struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
struct iommu_table_group_ops *ops;
};
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index cc1d09c..4828837 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -24,6 +24,7 @@
#include <linux/msi.h>
#include <linux/memblock.h>
#include <linux/iommu.h>
+#include <linux/sizes.h>

#include <asm/sections.h>
#include <asm/io.h>
@@ -1846,6 +1847,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
#endif
.clear = pnv_ioda2_tce_free,
.get = pnv_tce_get,
+ .free = pnv_pci_free_table,
};

static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
@@ -1936,6 +1938,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
TCE_PCI_SWINV_PAIR);

tbl->it_ops = &pnv_ioda1_iommu_ops;
+ pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift;
+ pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
iommu_init_table(tbl, phb->hose->node);

if (pe->flags & PNV_IODA_PE_DEV) {
@@ -1961,7 +1965,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
}

static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
- struct iommu_table *tbl)
+ int num, struct iommu_table *tbl)
{
struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
table_group);
@@ -1972,9 +1976,10 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
const __u64 win_size = tbl->it_size << tbl->it_page_shift;

- pe_info(pe, "Setting up window at %llx..%llx "
+ pe_info(pe, "Setting up window#%d at %llx..%llx "
"pgsize=0x%x tablesize=0x%lx "
"levels=%d levelsize=%x\n",
+ num,
start_addr, start_addr + win_size - 1,
1UL << tbl->it_page_shift, tbl->it_size << 3,
tbl->it_indirect_levels + 1, tbl->it_level_size << 3);
@@ -1987,7 +1992,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
*/
rc = opal_pci_map_pe_dma_window(phb->opal_id,
pe->pe_number,
- pe->pe_number << 1,
+ (pe->pe_number << 1) + num,
tbl->it_indirect_levels + 1,
__pa(tbl->it_base),
size << 3,
@@ -2000,7 +2005,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
pnv_pci_ioda2_tvt_invalidate(pe);

/* Store fully initialized *tbl (may be external) in PE */
- pe->table_group.tables[0] = *tbl;
+ pe->table_group.tables[num] = *tbl;

return 0;
fail:
@@ -2061,6 +2066,53 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
}

#ifdef CONFIG_IOMMU_API
+static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
+ int num, __u32 page_shift, __u64 window_size, __u32 levels,
+ struct iommu_table *tbl)
+{
+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
+ table_group);
+ int nid = pe->phb->hose->node;
+ __u64 bus_offset = num ? pe->tce_bypass_base : 0;
+ long ret;
+
+ ret = pnv_pci_create_table(table_group, nid, bus_offset, page_shift,
+ window_size, levels, tbl);
+ if (ret)
+ return ret;
+
+ tbl->it_ops = &pnv_ioda2_iommu_ops;
+ if (pe->tce_inval_reg)
+ tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
+
+ return 0;
+}
+
+static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
+ int num)
+{
+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
+ table_group);
+ struct pnv_phb *phb = pe->phb;
+ struct iommu_table *tbl = &pe->table_group.tables[num];
+ long ret;
+
+ pe_info(pe, "Removing DMA window #%d\n", num);
+
+ ret = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
+ (pe->pe_number << 1) + num,
+ 0/* levels */, 0/* table address */,
+ 0/* table size */, 0/* page size */);
+ if (ret)
+ pe_warn(pe, "Unmapping failed, ret = %ld\n", ret);
+ else
+ pnv_pci_ioda2_tvt_invalidate(pe);
+
+ memset(tbl, 0, sizeof(*tbl));
+
+ return ret;
+}
+
static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
{
struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
@@ -2080,6 +2132,9 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
}

static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
+ .create_table = pnv_pci_ioda2_create_table,
+ .set_window = pnv_pci_ioda2_set_window,
+ .unset_window = pnv_pci_ioda2_unset_window,
.take_ownership = pnv_ioda2_take_ownership,
.release_ownership = pnv_ioda2_release_ownership,
};
@@ -2102,8 +2157,16 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
phb->ioda.m32_pci_base);

+ pe->table_group.tce32_start = 0;
+ pe->table_group.tce32_size = phb->ioda.m32_pci_base;
+ pe->table_group.max_dynamic_windows_supported =
+ IOMMU_TABLE_GROUP_MAX_TABLES;
+ pe->table_group.max_levels = POWERNV_IOMMU_MAX_LEVELS;
+ pe->table_group.pgsizes = SZ_4K | SZ_64K | SZ_16M;
+
rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node,
- 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base,
+ pe->table_group.tce32_start, IOMMU_PAGE_SHIFT_4K,
+ pe->table_group.tce32_size,
POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
if (rc) {
pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
@@ -2119,7 +2182,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
pe->table_group.ops = &pnv_pci_ioda2_ops;
#endif

- rc = pnv_pci_ioda2_set_window(&pe->table_group, tbl);
+ rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
if (rc) {
pe_err(pe, "Failed to configure 32-bit TCE table,"
" err %ld\n", rc);
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index 7a6fd92..d9de4c7 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -116,6 +116,8 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
u64 phb_id;
int64_t rc;
static int primary = 1;
+ struct iommu_table_group *table_group;
+ struct iommu_table *tbl;

pr_info(" Initializing p5ioc2 PHB %s\n", np->full_name);

@@ -181,14 +183,16 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
pnv_pci_init_p5ioc2_msis(phb);

/* Setup iommu */
- phb->p5ioc2.table_group.tables[0].it_table_group =
- &phb->p5ioc2.table_group;
+ table_group = &phb->p5ioc2.table_group;
+ tbl = &phb->p5ioc2.table_group.tables[0];
+ tbl->it_table_group = table_group;

/* Setup TCEs */
phb->dma_dev_setup = pnv_pci_p5ioc2_dma_dev_setup;
- pnv_pci_setup_iommu_table(&phb->p5ioc2.table_group.tables[0],
- tce_mem, tce_size, 0,
+ pnv_pci_setup_iommu_table(tbl, tce_mem, tce_size, 0,
IOMMU_PAGE_SHIFT_4K);
+ table_group->tce32_start = tbl->it_offset << tbl->it_page_shift;
+ table_group->tce32_size = tbl->it_size << tbl->it_page_shift;
}

void __init pnv_pci_init_p5ioc2_hub(struct device_node *np)
--
2.0.0

2015-04-25 12:17:23

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 24/32] powerpc/powernv/ioda2: Use new helpers to do proper cleanup on PE release

The existing code programmed TVT#0 with some address and then
immediately released that memory.

This makes use of pnv_pci_ioda2_unset_window() and
pnv_pci_ioda2_set_bypass() which do correct resource release and
TVT update.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/platforms/powernv/pci-ioda.c | 33 ++++++++++---------------------
1 file changed, 10 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 4828837..2a4b2b2 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1281,34 +1281,21 @@ m64_failed:
return -EBUSY;
}

+static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
+ int num);
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
+
static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe)
{
- struct pci_bus *bus;
- struct pci_controller *hose;
- struct pnv_phb *phb;
- struct iommu_table *tbl;
- unsigned long addr;
- int64_t rc;
+ long rc;

- bus = dev->bus;
- hose = pci_bus_to_host(bus);
- phb = hose->private_data;
- tbl = &pe->table_group.tables[0];
- addr = tbl->it_base;
-
- opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
- pe->pe_number << 1, 1, __pa(addr),
- 0, 0x1000);
-
- rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
- pe->pe_number,
- (pe->pe_number << 1) + 1,
- pe->tce_bypass_base,
- 0);
+ rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0);
if (rc)
- pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
+ pe_warn(pe, "OPAL error %ld release default DMA window\n", rc);

- pnv_pci_free_table(tbl);
+ pnv_pci_ioda2_set_bypass(pe, false);
+
+ pnv_pci_free_table(&pe->table_group.tables[0]);
}

static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
--
2.0.0

2015-04-25 12:17:27

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 25/32] vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework ownership

Before the IOMMU user (VFIO) would take control over the IOMMU table
belonging to a specific IOMMU group. This approach did not allow sharing
tables between IOMMU groups attached to the same container.

This introduces a new IOMMU ownership flavour when the user can not
just control the existing IOMMU table but remove/create tables on demand.
If an IOMMU implements take/release_ownership() callbacks, this lets
the user have full control over the IOMMU group. When the ownership is taken,
the platform code removes all the windows so the caller must create them.
Before returning the ownership back to the platform code, VFIO
unprograms and removes all the tables it created.

This changes IODA2's onwership handler to remove the existing table
rather than manipulating with the existing one. From now on,
iommu_take_ownership() and iommu_release_ownership() are only called
from the vfio_iommu_spapr_tce driver.

In tce_iommu_detach_group(), this copies a iommu_table descriptor on stack
as IODA2's unset_window() will clear the descriptor embedded into PE
and we will not be able to free the table afterwards.
This is a transitional hack and following patches will replace this code
anyway.

Old-style ownership is still supported allowing VFIO to run on older
P5IOC2 and IODA IO controllers.

No change in userspace-visible behaviour is expected. Since it recreates
TCE tables on each ownership change, related kernel traces will appear
more often.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <[email protected]>
---
Changes:
v9:
* fixed crash in tce_iommu_detach_group() on tbl->it_ops->free as
tce_iommu_attach_group() used to initialize the table from a descriptor
on stack (it does not matter for the series as this bit is changed later anyway
but it ruing bisectability)

v6:
* fixed commit log that VFIO removes tables before passing ownership
back to the platform code, not userspace

1
---
arch/powerpc/platforms/powernv/pci-ioda.c | 27 +++++++++++++++++++++++--
drivers/vfio/vfio_iommu_spapr_tce.c | 33 +++++++++++++++++++++++++++++--
2 files changed, 56 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2a4b2b2..45bc131 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2105,16 +2105,39 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
table_group);

- iommu_take_ownership(&table_group->tables[0]);
pnv_pci_ioda2_set_bypass(pe, false);
+ pnv_pci_ioda2_unset_window(&pe->table_group, 0);
+ pnv_pci_free_table(&pe->table_group.tables[0]);
}

static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
{
struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
table_group);
+ struct iommu_table *tbl = &pe->table_group.tables[0];
+ int64_t rc;
+
+ rc = pnv_pci_ioda2_create_table(&pe->table_group, 0,
+ IOMMU_PAGE_SHIFT_4K,
+ pe->phb->ioda.m32_pci_base,
+ POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
+ if (rc) {
+ pe_err(pe, "Failed to create 32-bit TCE table, err %ld",
+ rc);
+ return;
+ }
+
+ tbl->it_table_group = &pe->table_group;
+ iommu_init_table(tbl, pe->phb->hose->node);
+
+ rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
+ if (rc) {
+ pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
+ rc);
+ pnv_pci_free_table(tbl);
+ return;
+ }

- iommu_release_ownership(&table_group->tables[0]);
pnv_pci_ioda2_set_bypass(pe, true);
}

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 2d51bbf..892a584 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -569,6 +569,10 @@ static int tce_iommu_attach_group(void *iommu_data,
if (!table_group->ops || !table_group->ops->take_ownership ||
!table_group->ops->release_ownership) {
ret = tce_iommu_take_ownership(table_group);
+ } else if (!table_group->ops->create_table ||
+ !table_group->ops->set_window) {
+ WARN_ON_ONCE(1);
+ ret = -EFAULT;
} else {
/*
* Disable iommu bypass, otherwise the user can DMA to all of
@@ -576,7 +580,15 @@ static int tce_iommu_attach_group(void *iommu_data,
* the pages that has been explicitly mapped into the iommu
*/
table_group->ops->take_ownership(table_group);
- ret = 0;
+ ret = table_group->ops->create_table(table_group,
+ 0, /* window number */
+ IOMMU_PAGE_SHIFT_4K,
+ table_group->tce32_size,
+ 1, /* default levels */
+ &table_group->tables[0]);
+ if (!ret)
+ ret = table_group->ops->set_window(table_group, 0,
+ &table_group->tables[0]);
}

if (ret)
@@ -595,6 +607,7 @@ static void tce_iommu_detach_group(void *iommu_data,
{
struct tce_container *container = iommu_data;
struct iommu_table_group *table_group;
+ long i;

mutex_lock(&container->lock);
if (iommu_group != container->grp) {
@@ -620,8 +633,24 @@ static void tce_iommu_detach_group(void *iommu_data,
/* Kernel owns the device now, we can restore bypass */
if (!table_group->ops || !table_group->ops->release_ownership)
tce_iommu_release_ownership(container, table_group);
- else
+ else if (!table_group->ops->unset_window)
+ WARN_ON_ONCE(1);
+ else {
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ struct iommu_table tbl = table_group->tables[i];
+
+ if (!tbl.it_size)
+ continue;
+
+ table_group->ops->unset_window(table_group, i);
+ tce_iommu_clear(container, &tbl,
+ tbl.it_offset, tbl.it_size);
+ if (tbl.it_ops->free)
+ tbl.it_ops->free(&tbl);
+ }
+
table_group->ops->release_ownership(table_group);
+ }

unlock_exit:
mutex_unlock(&container->lock);
--
2.0.0

2015-04-25 12:17:19

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v9:
* fixed code flow in error cases added in v8

v8:
* added ENOMEM on failed vzalloc()
---
arch/powerpc/include/asm/iommu.h | 6 ++++++
arch/powerpc/kernel/iommu.c | 18 ++++++++++++++++++
arch/powerpc/platforms/powernv/pci-ioda.c | 22 ++++++++++++++++++++--
3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7694546..1472de3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -111,9 +111,15 @@ struct iommu_table {
unsigned long *it_map; /* A simple allocation bitmap for now */
unsigned long it_page_shift;/* table iommu page size */
struct iommu_table_group *it_table_group;
+ unsigned long *it_userspace; /* userspace view of the table */
struct iommu_table_ops *it_ops;
};

+#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
+ ((tbl)->it_userspace ? \
+ &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \
+ NULL)
+
/* Pure 2^n version of get_order */
static inline __attribute_const__
int get_iommu_order(unsigned long size, struct iommu_table *tbl)
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 2eaba0c..74a3f52 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -38,6 +38,7 @@
#include <linux/pci.h>
#include <linux/iommu.h>
#include <linux/sched.h>
+#include <linux/vmalloc.h>
#include <asm/io.h>
#include <asm/prom.h>
#include <asm/iommu.h>
@@ -739,6 +740,8 @@ void iommu_reset_table(struct iommu_table *tbl, const char *node_name)
free_pages((unsigned long) tbl->it_map, order);
}

+ WARN_ON(tbl->it_userspace);
+
memset(tbl, 0, sizeof(*tbl));
}

@@ -1016,6 +1019,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
{
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
int ret = 0;
+ unsigned long *uas;

/*
* VFIO does not control TCE entries allocation and the guest
@@ -1027,6 +1031,10 @@ int iommu_take_ownership(struct iommu_table *tbl)
if (!tbl->it_ops->exchange)
return -EINVAL;

+ uas = vzalloc(sizeof(*uas) * tbl->it_size);
+ if (!uas)
+ return -ENOMEM;
+
spin_lock_irqsave(&tbl->large_pool.lock, flags);
for (i = 0; i < tbl->nr_pools; i++)
spin_lock(&tbl->pools[i].lock);
@@ -1044,6 +1052,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
memset(tbl->it_map, 0xff, sz);
}

+ if (ret) {
+ vfree(uas);
+ } else {
+ BUG_ON(tbl->it_userspace);
+ tbl->it_userspace = uas;
+ }
+
for (i = 0; i < tbl->nr_pools; i++)
spin_unlock(&tbl->pools[i].lock);
spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
@@ -1056,6 +1071,9 @@ void iommu_release_ownership(struct iommu_table *tbl)
{
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;

+ vfree(tbl->it_userspace);
+ tbl->it_userspace = NULL;
+
spin_lock_irqsave(&tbl->large_pool.lock, flags);
for (i = 0; i < tbl->nr_pools; i++)
spin_lock(&tbl->pools[i].lock);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 45bc131..e0be556 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -25,6 +25,7 @@
#include <linux/memblock.h>
#include <linux/iommu.h>
#include <linux/sizes.h>
+#include <linux/vmalloc.h>

#include <asm/sections.h>
#include <asm/io.h>
@@ -1827,6 +1828,14 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
}

+void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
+{
+ vfree(tbl->it_userspace);
+ tbl->it_userspace = NULL;
+
+ pnv_pci_free_table(tbl);
+}
+
static struct iommu_table_ops pnv_ioda2_iommu_ops = {
.set = pnv_ioda2_tce_build,
#ifdef CONFIG_IOMMU_API
@@ -1834,7 +1843,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
#endif
.clear = pnv_ioda2_tce_free,
.get = pnv_tce_get,
- .free = pnv_pci_free_table,
+ .free = pnv_pci_ioda2_free_table,
};

static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
@@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
int nid = pe->phb->hose->node;
__u64 bus_offset = num ? pe->tce_bypass_base : 0;
long ret;
+ unsigned long *uas, uas_cb = sizeof(*uas) * (window_size >> page_shift);
+
+ uas = vzalloc(uas_cb);
+ if (!uas)
+ return -ENOMEM;

ret = pnv_pci_create_table(table_group, nid, bus_offset, page_shift,
window_size, levels, tbl);
- if (ret)
+ if (ret) {
+ vfree(uas);
return ret;
+ }

+ BUG_ON(tbl->it_userspace);
+ tbl->it_userspace = uas;
tbl->it_ops = &pnv_ioda2_iommu_ops;
if (pe->tce_inval_reg)
tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
--
2.0.0

2015-04-25 12:17:00

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.

This stores the allocated table size in pnv_pci_create_table()
so the locked_vm counter can be updated correctly when a table is
being disposed.

This defines an iommu_table_group_ops callback to let VFIO know
how much memory will be locked if a table is created.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v9:
* reimplemented the whole patch
---
arch/powerpc/include/asm/iommu.h | 5 +++++
arch/powerpc/platforms/powernv/pci-ioda.c | 14 ++++++++++++
arch/powerpc/platforms/powernv/pci.c | 36 +++++++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/pci.h | 2 ++
4 files changed, 57 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 1472de3..9844c106 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -99,6 +99,7 @@ struct iommu_table {
unsigned long it_size; /* Size of iommu table in entries */
unsigned long it_indirect_levels;
unsigned long it_level_size;
+ unsigned long it_allocated_size;
unsigned long it_offset; /* Offset into global table */
unsigned long it_base; /* mapped address of tce table */
unsigned long it_index; /* which iommu table this is */
@@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
struct iommu_table_group;

struct iommu_table_group_ops {
+ unsigned long (*get_table_size)(
+ __u32 page_shift,
+ __u64 window_size,
+ __u32 levels);
long (*create_table)(struct iommu_table_group *table_group,
int num,
__u32 page_shift,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index e0be556..7f548b4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
}

#ifdef CONFIG_IOMMU_API
+static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+ __u64 window_size, __u32 levels)
+{
+ unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
+
+ if (!ret)
+ return ret;
+
+ /* Add size of it_userspace */
+ return ret + (window_size >> page_shift) * sizeof(unsigned long);
+}
+
static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
int num, __u32 page_shift, __u64 window_size, __u32 levels,
struct iommu_table *tbl)
@@ -2086,6 +2098,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,

BUG_ON(tbl->it_userspace);
tbl->it_userspace = uas;
+ tbl->it_allocated_size += uas_cb;
tbl->it_ops = &pnv_ioda2_iommu_ops;
if (pe->tce_inval_reg)
tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
@@ -2160,6 +2173,7 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
}

static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
+ .get_table_size = pnv_pci_ioda2_get_table_size,
.create_table = pnv_pci_ioda2_create_table,
.set_window = pnv_pci_ioda2_set_window,
.unset_window = pnv_pci_ioda2_unset_window,
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index fc129c4..1b5b48a 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -662,6 +662,38 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
tbl->it_type = TCE_PCI;
}

+unsigned long pnv_get_table_size(__u32 page_shift,
+ __u64 window_size, __u32 levels)
+{
+ unsigned long bytes = 0;
+ const unsigned window_shift = ilog2(window_size);
+ unsigned entries_shift = window_shift - page_shift;
+ unsigned table_shift = entries_shift + 3;
+ unsigned long tce_table_size = max(0x1000UL, 1UL << table_shift);
+ unsigned long direct_table_size;
+
+ if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS) ||
+ (window_size > memory_hotplug_max()) ||
+ !is_power_of_2(window_size))
+ return 0;
+
+ /* Calculate a direct table size from window_size and levels */
+ entries_shift = ROUND_UP(entries_shift, levels) / levels;
+ table_shift = entries_shift + 3;
+ table_shift = max_t(unsigned, table_shift, PAGE_SHIFT);
+ direct_table_size = 1UL << table_shift;
+
+ for ( ; levels; --levels) {
+ bytes += ROUND_UP(tce_table_size, direct_table_size);
+
+ tce_table_size /= direct_table_size;
+ tce_table_size <<= 3;
+ tce_table_size = ROUND_UP(tce_table_size, direct_table_size);
+ }
+
+ return bytes;
+}
+
static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
unsigned levels, unsigned long limit,
unsigned long *tce_table_allocated)
@@ -741,6 +773,10 @@ long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
return -ENOMEM;
}

+ tbl->it_allocated_size = pnv_get_table_size(page_shift, window_size,
+ levels);
+ WARN_ON(!tbl->it_allocated_size);
+
/* Setup linux iommu table */
pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset,
page_shift);
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 3d1ff584..ce4bc3c 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -224,6 +224,8 @@ extern long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
__u64 bus_offset, __u32 page_shift, __u64 window_size,
__u32 levels, struct iommu_table *tbl);
extern void pnv_pci_free_table(struct iommu_table *tbl);
+extern unsigned long pnv_get_table_size(__u32 page_shift,
+ __u64 window_size, __u32 levels);
extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
extern void pnv_pci_init_ioda_hub(struct device_node *np);
extern void pnv_pci_init_ioda2_phb(struct device_node *np);
--
2.0.0

2015-04-25 12:18:59

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache

We are adding support for DMA memory pre-registration to be used in
conjunction with VFIO. The idea is that the userspace which is going to
run a guest may want to pre-register a user space memory region so
it all gets pinned once and never goes away. Having this done,
a hypervisor will not have to pin/unpin pages on every DMA map/unmap
request. This is going to help with multiple pinning of the same memory
and in-kernel acceleration of DMA requests.

This adds a list of memory regions to mm_context_t. Each region consists
of a header and a list of physical addresses. This adds API to:
1. register/unregister memory regions;
2. do final cleanup (which puts all pre-registered pages);
3. do userspace to physical address translation;
4. manage a mapped pages counter; when it is zero, it is safe to
unregister the region.

Multiple registration of the same region is allowed, kref is used to
track the number of registrations.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v8:
* s/mm_iommu_table_group_mem_t/struct mm_iommu_table_group_mem_t/
* fixed error fallback look (s/[i]/[j]/)
---
arch/powerpc/include/asm/mmu-hash64.h | 3 +
arch/powerpc/include/asm/mmu_context.h | 17 +++
arch/powerpc/mm/Makefile | 1 +
arch/powerpc/mm/mmu_context_hash64.c | 6 +
arch/powerpc/mm/mmu_context_hash64_iommu.c | 215 +++++++++++++++++++++++++++++
5 files changed, 242 insertions(+)
create mode 100644 arch/powerpc/mm/mmu_context_hash64_iommu.c

diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
index 1da6a81..a82f534 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -536,6 +536,9 @@ typedef struct {
/* for 4K PTE fragment support */
void *pte_frag;
#endif
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+ struct list_head iommu_group_mem_list;
+#endif
} mm_context_t;


diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 73382eb..d6116ca 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -16,6 +16,23 @@
*/
extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
extern void destroy_context(struct mm_struct *mm);
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+struct mm_iommu_table_group_mem_t;
+
+extern bool mm_iommu_preregistered(void);
+extern long mm_iommu_alloc(unsigned long ua, unsigned long entries,
+ struct mm_iommu_table_group_mem_t **pmem);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_get(unsigned long ua,
+ unsigned long entries);
+extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
+extern void mm_iommu_cleanup(mm_context_t *ctx);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
+ unsigned long size);
+extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
+ unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_mapped_update(struct mm_iommu_table_group_mem_t *mem,
+ bool inc);
+#endif

extern void switch_mmu_context(struct mm_struct *prev, struct mm_struct *next);
extern void switch_slb(struct task_struct *tsk, struct mm_struct *mm);
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 9c8770b..e216704 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_PPC_SUBPAGE_PROT) += subpage-prot.o
obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
obj-$(CONFIG_HIGHMEM) += highmem.o
obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o
+obj-$(CONFIG_SPAPR_TCE_IOMMU) += mmu_context_hash64_iommu.o
diff --git a/arch/powerpc/mm/mmu_context_hash64.c b/arch/powerpc/mm/mmu_context_hash64.c
index 178876ae..eb3080c 100644
--- a/arch/powerpc/mm/mmu_context_hash64.c
+++ b/arch/powerpc/mm/mmu_context_hash64.c
@@ -89,6 +89,9 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
#ifdef CONFIG_PPC_64K_PAGES
mm->context.pte_frag = NULL;
#endif
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+ INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
+#endif
return 0;
}

@@ -132,6 +135,9 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)

void destroy_context(struct mm_struct *mm)
{
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+ mm_iommu_cleanup(&mm->context);
+#endif

#ifdef CONFIG_PPC_ICSWX
drop_cop(mm->context.acop, mm);
diff --git a/arch/powerpc/mm/mmu_context_hash64_iommu.c b/arch/powerpc/mm/mmu_context_hash64_iommu.c
new file mode 100644
index 0000000..af7668c
--- /dev/null
+++ b/arch/powerpc/mm/mmu_context_hash64_iommu.c
@@ -0,0 +1,215 @@
+/*
+ * IOMMU helpers in MMU context.
+ *
+ * Copyright (C) 2015 IBM Corp. <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/rculist.h>
+#include <linux/vmalloc.h>
+#include <linux/kref.h>
+#include <asm/mmu_context.h>
+
+struct mm_iommu_table_group_mem_t {
+ struct list_head next;
+ struct rcu_head rcu;
+ struct kref kref; /* one reference per VFIO container */
+ atomic_t mapped; /* number of currently mapped pages */
+ u64 ua; /* userspace address */
+ u64 entries; /* number of entries in hpas[] */
+ u64 *hpas; /* vmalloc'ed */
+};
+
+bool mm_iommu_preregistered(void)
+{
+ if (!current || !current->mm)
+ return false;
+
+ return !list_empty(&current->mm->context.iommu_group_mem_list);
+}
+EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
+
+long mm_iommu_alloc(unsigned long ua, unsigned long entries,
+ struct mm_iommu_table_group_mem_t **pmem)
+{
+ struct mm_iommu_table_group_mem_t *mem;
+ long i, j;
+ struct page *page = NULL;
+
+ list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
+ next) {
+ if ((mem->ua == ua) && (mem->entries == entries))
+ return -EBUSY;
+
+ /* Overlap? */
+ if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
+ (ua < (mem->ua + (mem->entries << PAGE_SHIFT))))
+ return -EINVAL;
+ }
+
+ mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+ if (!mem)
+ return -ENOMEM;
+
+ mem->hpas = vzalloc(entries * sizeof(mem->hpas[0]));
+ if (!mem->hpas) {
+ kfree(mem);
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < entries; ++i) {
+ if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
+ 1/* pages */, 1/* iswrite */, &page)) {
+ for (j = 0; j < i; ++j)
+ put_page(pfn_to_page(
+ mem->hpas[j] >> PAGE_SHIFT));
+ vfree(mem->hpas);
+ kfree(mem);
+ return -EFAULT;
+ }
+
+ mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
+ }
+
+ kref_init(&mem->kref);
+ atomic_set(&mem->mapped, 0);
+ mem->ua = ua;
+ mem->entries = entries;
+ *pmem = mem;
+
+ list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_alloc);
+
+static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
+{
+ long i;
+ struct page *page = NULL;
+
+ for (i = 0; i < mem->entries; ++i) {
+ if (!mem->hpas[i])
+ continue;
+
+ page = pfn_to_page(mem->hpas[i] >> PAGE_SHIFT);
+ if (!page)
+ continue;
+
+ put_page(page);
+ mem->hpas[i] = 0;
+ }
+}
+
+static void mm_iommu_free(struct rcu_head *head)
+{
+ struct mm_iommu_table_group_mem_t *mem = container_of(head,
+ struct mm_iommu_table_group_mem_t, rcu);
+
+ mm_iommu_unpin(mem);
+ vfree(mem->hpas);
+ kfree(mem);
+}
+
+static void mm_iommu_release(struct kref *kref)
+{
+ struct mm_iommu_table_group_mem_t *mem = container_of(kref,
+ struct mm_iommu_table_group_mem_t, kref);
+
+ list_del_rcu(&mem->next);
+ call_rcu(&mem->rcu, mm_iommu_free);
+}
+
+struct mm_iommu_table_group_mem_t *mm_iommu_get(unsigned long ua,
+ unsigned long entries)
+{
+ struct mm_iommu_table_group_mem_t *mem;
+
+ list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
+ next) {
+ if ((mem->ua == ua) && (mem->entries == entries)) {
+ kref_get(&mem->kref);
+ return mem;
+ }
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_get);
+
+long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
+{
+ if (atomic_read(&mem->mapped))
+ return -EBUSY;
+
+ kref_put(&mem->kref, mm_iommu_release);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_put);
+
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
+ unsigned long size)
+{
+ struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+ list_for_each_entry_rcu(mem,
+ &current->mm->context.iommu_group_mem_list,
+ next) {
+ if ((mem->ua <= ua) &&
+ (ua + size <= mem->ua +
+ (mem->entries << PAGE_SHIFT))) {
+ ret = mem;
+ break;
+ }
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup);
+
+long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
+ unsigned long ua, unsigned long *hpa)
+{
+ const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+ u64 *va = &mem->hpas[entry];
+
+ if (entry >= mem->entries)
+ return -EFAULT;
+
+ *hpa = *va | (ua & ~PAGE_MASK);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
+
+long mm_iommu_mapped_update(struct mm_iommu_table_group_mem_t *mem, bool inc)
+{
+ long ret = 0;
+
+ if (inc)
+ atomic_inc(&mem->mapped);
+ else
+ ret = atomic_dec_if_positive(&mem->mapped);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_mapped_update);
+
+void mm_iommu_cleanup(mm_context_t *ctx)
+{
+ while (!list_empty(&ctx->iommu_group_mem_list)) {
+ struct mm_iommu_table_group_mem_t *mem;
+
+ mem = list_first_entry(&ctx->iommu_group_mem_list,
+ struct mm_iommu_table_group_mem_t, next);
+ mm_iommu_release(&mem->kref);
+ }
+}
--
2.0.0

2015-04-25 12:16:38

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.

Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.

This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update into 2 different
operations. It requires 1) guest pages to be registered first 2) consequent
map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.

The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.

The accounting is done per the user process.

This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <[email protected]>
---
Changes:
v9:
* s/tce_get_hva_cached/tce_iommu_use_page_v2/

v7:
* now memory is registered per mm (i.e. process)
* moved memory registration code to powerpc/mmu
* merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
* limited new ioctls to v2 IOMMU
* updated doc
* unsupported ioclts return -ENOTTY instead of -EPERM

v6:
* tce_get_hva_cached() returns hva via a pointer

v4:
* updated docs
* s/kzmalloc/vzalloc/
* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
replaced offset with index
* renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
and removed duplicating vfio_iommu_spapr_register_memory
---
Documentation/vfio.txt | 23 ++++
drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++++++++++++++++++++++++++++++++++-
include/uapi/linux/vfio.h | 27 +++++
3 files changed, 274 insertions(+), 6 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 96978ec..94328c8 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -427,6 +427,29 @@ The code flow from the example above should be slightly changed:

....

+5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
+VFIO_IOMMU_DISABLE and implements 2 new ioctls:
+VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
+(which are unsupported in v1 IOMMU).
+
+PPC64 paravirtualized guests generate a lot of map/unmap requests,
+and the handling of those includes pinning/unpinning pages and updating
+mm::locked_vm counter to make sure we do not exceed the rlimit.
+The v2 IOMMU splits accounting and pinning into separate operations:
+
+- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
+receive a user space address and size of the block to be pinned.
+Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
+be called with the exact address and size used for registering
+the memory block. The userspace is not expected to call these often.
+The ranges are stored in a linked list in a VFIO container.
+
+- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
+IOMMU table and do not do pinning; instead these check that the userspace
+address is from pre-registered range.
+
+This separation helps in optimizing DMA for guests.
+
-------------------------------------------------------------------------------

[1] VFIO was originally an acronym for "Virtual Function I/O" in its
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 892a584..4cfc2c1 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -21,6 +21,7 @@
#include <linux/vfio.h>
#include <asm/iommu.h>
#include <asm/tce.h>
+#include <asm/mmu_context.h>

#define DRIVER_VERSION "0.1"
#define DRIVER_AUTHOR "[email protected]"
@@ -91,8 +92,58 @@ struct tce_container {
struct iommu_group *grp;
bool enabled;
unsigned long locked_pages;
+ bool v2;
};

+static long tce_unregister_pages(struct tce_container *container,
+ __u64 vaddr, __u64 size)
+{
+ long ret;
+ struct mm_iommu_table_group_mem_t *mem;
+
+ if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
+ return -EINVAL;
+
+ mem = mm_iommu_get(vaddr, size >> PAGE_SHIFT);
+ if (!mem)
+ return -EINVAL;
+
+ ret = mm_iommu_put(mem); /* undo kref_get() from mm_iommu_get() */
+ if (!ret)
+ ret = mm_iommu_put(mem);
+
+ return ret;
+}
+
+static long tce_register_pages(struct tce_container *container,
+ __u64 vaddr, __u64 size)
+{
+ long ret = 0;
+ struct mm_iommu_table_group_mem_t *mem;
+ unsigned long entries = size >> PAGE_SHIFT;
+
+ if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
+ ((vaddr + size) < vaddr))
+ return -EINVAL;
+
+ mem = mm_iommu_get(vaddr, entries);
+ if (!mem) {
+ ret = try_increment_locked_vm(entries);
+ if (ret)
+ return ret;
+
+ ret = mm_iommu_alloc(vaddr, entries, &mem);
+ if (ret) {
+ decrement_locked_vm(entries);
+ return ret;
+ }
+ }
+
+ container->enabled = true;
+
+ return 0;
+}
+
static bool tce_page_is_contained(struct page *page, unsigned page_shift)
{
/*
@@ -205,7 +256,7 @@ static void *tce_iommu_open(unsigned long arg)
{
struct tce_container *container;

- if (arg != VFIO_SPAPR_TCE_IOMMU) {
+ if ((arg != VFIO_SPAPR_TCE_IOMMU) && (arg != VFIO_SPAPR_TCE_v2_IOMMU)) {
pr_err("tce_vfio: Wrong IOMMU type\n");
return ERR_PTR(-EINVAL);
}
@@ -215,6 +266,7 @@ static void *tce_iommu_open(unsigned long arg)
return ERR_PTR(-ENOMEM);

mutex_init(&container->lock);
+ container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;

return container;
}
@@ -243,6 +295,47 @@ static void tce_iommu_unuse_page(struct tce_container *container,
put_page(page);
}

+static int tce_iommu_use_page_v2(unsigned long tce, unsigned long size,
+ unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
+{
+ long ret = 0;
+ struct mm_iommu_table_group_mem_t *mem;
+
+ mem = mm_iommu_lookup(tce, size);
+ if (!mem)
+ return -EINVAL;
+
+ ret = mm_iommu_ua_to_hpa(mem, tce, phpa);
+ if (ret)
+ return -EINVAL;
+
+ *pmem = mem;
+
+ return 0;
+}
+
+static void tce_iommu_unuse_page_v2(struct iommu_table *tbl,
+ unsigned long entry)
+{
+ struct mm_iommu_table_group_mem_t *mem = NULL;
+ int ret;
+ unsigned long hpa = 0;
+ unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+ if (!pua || !current || !current->mm)
+ return;
+
+ ret = tce_iommu_use_page_v2(*pua, IOMMU_PAGE_SIZE(tbl),
+ &hpa, &mem);
+ if (ret)
+ pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
+ __func__, *pua, entry, ret);
+ if (mem)
+ mm_iommu_mapped_update(mem, false);
+
+ *pua = 0;
+}
+
static int tce_iommu_clear(struct tce_container *container,
struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
@@ -261,6 +354,11 @@ static int tce_iommu_clear(struct tce_container *container,
if (direction == DMA_NONE)
continue;

+ if (container->v2) {
+ tce_iommu_unuse_page_v2(tbl, entry);
+ continue;
+ }
+
tce_iommu_unuse_page(container, oldtce);
}

@@ -327,6 +425,62 @@ static long tce_iommu_build(struct tce_container *container,
return ret;
}

+static long tce_iommu_build_v2(struct tce_container *container,
+ struct iommu_table *tbl,
+ unsigned long entry, unsigned long tce, unsigned long pages,
+ enum dma_data_direction direction)
+{
+ long i, ret = 0;
+ struct page *page;
+ unsigned long hpa;
+ enum dma_data_direction dirtmp;
+
+ for (i = 0; i < pages; ++i) {
+ struct mm_iommu_table_group_mem_t *mem = NULL;
+ unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
+ entry + i);
+
+ ret = tce_iommu_use_page_v2(tce, IOMMU_PAGE_SIZE(tbl),
+ &hpa, &mem);
+ if (ret)
+ break;
+
+ page = pfn_to_page(hpa >> PAGE_SHIFT);
+ if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+ ret = -EPERM;
+ break;
+ }
+
+ /* Preserve offset within IOMMU page */
+ hpa |= tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
+ dirtmp = direction;
+
+ ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
+ if (ret) {
+ /* dirtmp cannot be DMA_NONE here */
+ tce_iommu_unuse_page_v2(tbl, entry + i);
+ pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
+ __func__, entry << tbl->it_page_shift,
+ tce, ret);
+ break;
+ }
+
+ mm_iommu_mapped_update(mem, true);
+
+ if (dirtmp != DMA_NONE)
+ tce_iommu_unuse_page_v2(tbl, entry + i);
+
+ *pua = tce;
+
+ tce += IOMMU_PAGE_SIZE(tbl);
+ }
+
+ if (ret)
+ tce_iommu_clear(container, tbl, entry, i);
+
+ return ret;
+}
+
static long tce_iommu_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
{
@@ -338,6 +492,7 @@ static long tce_iommu_ioctl(void *iommu_data,
case VFIO_CHECK_EXTENSION:
switch (arg) {
case VFIO_SPAPR_TCE_IOMMU:
+ case VFIO_SPAPR_TCE_v2_IOMMU:
ret = 1;
break;
default:
@@ -425,11 +580,18 @@ static long tce_iommu_ioctl(void *iommu_data,
if (ret)
return ret;

- ret = tce_iommu_build(container, tbl,
- param.iova >> tbl->it_page_shift,
- param.vaddr,
- param.size >> tbl->it_page_shift,
- direction);
+ if (container->v2)
+ ret = tce_iommu_build_v2(container, tbl,
+ param.iova >> tbl->it_page_shift,
+ param.vaddr,
+ param.size >> tbl->it_page_shift,
+ direction);
+ else
+ ret = tce_iommu_build(container, tbl,
+ param.iova >> tbl->it_page_shift,
+ param.vaddr,
+ param.size >> tbl->it_page_shift,
+ direction);

iommu_flush_tce(tbl);

@@ -474,7 +636,60 @@ static long tce_iommu_ioctl(void *iommu_data,

return ret;
}
+ case VFIO_IOMMU_SPAPR_REGISTER_MEMORY: {
+ struct vfio_iommu_spapr_register_memory param;
+
+ if (!container->v2)
+ break;
+
+ minsz = offsetofend(struct vfio_iommu_spapr_register_memory,
+ size);
+
+ if (copy_from_user(&param, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (param.argsz < minsz)
+ return -EINVAL;
+
+ /* No flag is supported now */
+ if (param.flags)
+ return -EINVAL;
+
+ mutex_lock(&container->lock);
+ ret = tce_register_pages(container, param.vaddr, param.size);
+ mutex_unlock(&container->lock);
+
+ return ret;
+ }
+ case VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY: {
+ struct vfio_iommu_spapr_register_memory param;
+
+ if (!container->v2)
+ break;
+
+ minsz = offsetofend(struct vfio_iommu_spapr_register_memory,
+ size);
+
+ if (copy_from_user(&param, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (param.argsz < minsz)
+ return -EINVAL;
+
+ /* No flag is supported now */
+ if (param.flags)
+ return -EINVAL;
+
+ mutex_lock(&container->lock);
+ tce_unregister_pages(container, param.vaddr, param.size);
+ mutex_unlock(&container->lock);
+
+ return 0;
+ }
case VFIO_IOMMU_ENABLE:
+ if (container->v2)
+ break;
+
mutex_lock(&container->lock);
ret = tce_iommu_enable(container);
mutex_unlock(&container->lock);
@@ -482,6 +697,9 @@ static long tce_iommu_ioctl(void *iommu_data,


case VFIO_IOMMU_DISABLE:
+ if (container->v2)
+ break;
+
mutex_lock(&container->lock);
tce_iommu_disable(container);
mutex_unlock(&container->lock);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index b57b750..8fdcfb9 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -36,6 +36,8 @@
/* Two-stage IOMMU */
#define VFIO_TYPE1_NESTING_IOMMU 6 /* Implies v2 */

+#define VFIO_SPAPR_TCE_v2_IOMMU 7
+
/*
* The IOCTL interface is designed for extensibility by embedding the
* structure length (argsz) and flags into structures passed between
@@ -495,6 +497,31 @@ struct vfio_eeh_pe_op {

#define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21)

+/**
+ * VFIO_IOMMU_SPAPR_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_spapr_register_memory)
+ *
+ * Registers user space memory where DMA is allowed. It pins
+ * user pages and does the locked memory accounting so
+ * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
+ * get faster.
+ */
+struct vfio_iommu_spapr_register_memory {
+ __u32 argsz;
+ __u32 flags;
+ __u64 vaddr; /* Process virtual address */
+ __u64 size; /* Size of mapping (bytes) */
+};
+#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 17)
+
+/**
+ * VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_spapr_register_memory)
+ *
+ * Unregisters user space memory registered with
+ * VFIO_IOMMU_SPAPR_REGISTER_MEMORY.
+ * Uses vfio_iommu_spapr_register_memory for parameters.
+ */
+#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 18)
+
/* ***************************************************************** */

#endif /* _UAPIVFIO_H */
--
2.0.0

2015-04-25 12:17:14

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 30/32] vfio: powerpc/spapr: Use 32bit DMA window properties from table_group

A table group might not have a table but it always has the default 32bit
window parameters so use these.

No change in behavior is expected.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v9:
* new in the series - to make the next patch simpler
---
drivers/vfio/vfio_iommu_spapr_tce.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 4cfc2c1..a7d6729 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -185,7 +185,6 @@ static int tce_iommu_enable(struct tce_container *container)
{
int ret = 0;
unsigned long locked;
- struct iommu_table *tbl;
struct iommu_table_group *table_group;

if (!container->grp)
@@ -221,13 +220,19 @@ static int tce_iommu_enable(struct tce_container *container)
* this is that we cannot tell here the amount of RAM used by the guest
* as this information is only available from KVM and VFIO is
* KVM agnostic.
+ *
+ * So we do not allow enabling a container without a group attached
+ * as there is no way to know how much we should increment
+ * the locked_vm counter.
*/
table_group = iommu_group_get_iommudata(container->grp);
if (!table_group)
return -ENODEV;

- tbl = &table_group->tables[0];
- locked = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
+ if (!table_group->tce32_size)
+ return -EPERM;
+
+ locked = table_group->tce32_size >> PAGE_SHIFT;
ret = try_increment_locked_vm(locked);
if (ret)
return ret;
@@ -504,7 +509,6 @@ static long tce_iommu_ioctl(void *iommu_data,

case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
struct vfio_iommu_spapr_tce_info info;
- struct iommu_table *tbl;
struct iommu_table_group *table_group;

if (WARN_ON(!container->grp))
@@ -512,8 +516,7 @@ static long tce_iommu_ioctl(void *iommu_data,

table_group = iommu_group_get_iommudata(container->grp);

- tbl = &table_group->tables[0];
- if (WARN_ON_ONCE(!tbl))
+ if (!table_group)
return -ENXIO;

minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
@@ -525,8 +528,8 @@ static long tce_iommu_ioctl(void *iommu_data,
if (info.argsz < minsz)
return -EINVAL;

- info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
- info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
+ info.dma32_window_start = table_group->tce32_start;
+ info.dma32_window_size = table_group->tce32_size;
info.flags = 0;

if (copy_to_user((void __user *)arg, &info, minsz))
--
2.0.0

2015-04-25 12:20:33

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible

At the moment only one group per container is supported.
POWER8 CPUs have more flexible design and allows naving 2 TCE tables per
IOMMU group so we can relax this limitation and support multiple groups
per container.

This adds TCE table descriptors to a container and uses iommu_table_group_ops
to create/set DMA windows on IOMMU groups so the same TCE tables will be
shared between several IOMMU groups.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <[email protected]>
---
Changes:
v7:
* updated doc
---
Documentation/vfio.txt | 8 +-
drivers/vfio/vfio_iommu_spapr_tce.c | 268 ++++++++++++++++++++++++++----------
2 files changed, 199 insertions(+), 77 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 94328c8..7dcf2b5 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -289,10 +289,12 @@ PPC64 sPAPR implementation note

This implementation has some specifics:

-1) Only one IOMMU group per container is supported as an IOMMU group
-represents the minimal entity which isolation can be guaranteed for and
-groups are allocated statically, one per a Partitionable Endpoint (PE)
+1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
+container is supported as an IOMMU table is allocated at the boot time,
+one table per a IOMMU group which is a Partitionable Endpoint (PE)
(PE is often a PCI domain but not always).
+Newer systems (POWER8 with IODA2) have improved hardware design which allows
+to remove this limitation and have multiple IOMMU groups per a VFIO container.

2) The hardware supports so called DMA windows - the PCI address range
within which DMA transfer is allowed, any attempt to access address space
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index a7d6729..970e3a2 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -82,6 +82,11 @@ static void decrement_locked_vm(long npages)
* into DMA'ble space using the IOMMU
*/

+struct tce_iommu_group {
+ struct list_head next;
+ struct iommu_group *grp;
+};
+
/*
* The container descriptor supports only a single group per container.
* Required by the API as the container is not supplied with the IOMMU group
@@ -89,10 +94,11 @@ static void decrement_locked_vm(long npages)
*/
struct tce_container {
struct mutex lock;
- struct iommu_group *grp;
bool enabled;
unsigned long locked_pages;
bool v2;
+ struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
+ struct list_head group_list;
};

static long tce_unregister_pages(struct tce_container *container,
@@ -154,20 +160,20 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift)
return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift;
}

+static inline bool tce_groups_attached(struct tce_container *container)
+{
+ return !list_empty(&container->group_list);
+}
+
static struct iommu_table *spapr_tce_find_table(
struct tce_container *container,
phys_addr_t ioba)
{
long i;
struct iommu_table *ret = NULL;
- struct iommu_table_group *table_group;
-
- table_group = iommu_group_get_iommudata(container->grp);
- if (!table_group)
- return NULL;

for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
- struct iommu_table *tbl = &table_group->tables[i];
+ struct iommu_table *tbl = &container->tables[i];
unsigned long entry = ioba >> tbl->it_page_shift;
unsigned long start = tbl->it_offset;
unsigned long end = start + tbl->it_size;
@@ -186,9 +192,7 @@ static int tce_iommu_enable(struct tce_container *container)
int ret = 0;
unsigned long locked;
struct iommu_table_group *table_group;
-
- if (!container->grp)
- return -ENXIO;
+ struct tce_iommu_group *tcegrp;

if (!current->mm)
return -ESRCH; /* process exited */
@@ -225,7 +229,12 @@ static int tce_iommu_enable(struct tce_container *container)
* as there is no way to know how much we should increment
* the locked_vm counter.
*/
- table_group = iommu_group_get_iommudata(container->grp);
+ if (!tce_groups_attached(container))
+ return -ENODEV;
+
+ tcegrp = list_first_entry(&container->group_list,
+ struct tce_iommu_group, next);
+ table_group = iommu_group_get_iommudata(tcegrp->grp);
if (!table_group)
return -ENODEV;

@@ -257,6 +266,48 @@ static void tce_iommu_disable(struct tce_container *container)
decrement_locked_vm(container->locked_pages);
}

+static long tce_iommu_create_table(struct iommu_table_group *table_group,
+ int num,
+ __u32 page_shift,
+ __u64 window_size,
+ __u32 levels,
+ struct iommu_table *tbl)
+{
+ long ret, table_size;
+
+ table_size = table_group->ops->get_table_size(page_shift, window_size,
+ levels);
+ if (!table_size)
+ return -EINVAL;
+
+ ret = try_increment_locked_vm(table_size >> PAGE_SHIFT);
+ if (ret)
+ return ret;
+
+ ret = table_group->ops->create_table(table_group, num,
+ page_shift, window_size, levels, tbl);
+
+ WARN_ON(!ret && !tbl->it_ops->free);
+ WARN_ON(!ret && (tbl->it_allocated_size != table_size));
+
+ if (ret)
+ decrement_locked_vm(table_size >> PAGE_SHIFT);
+
+ return ret;
+}
+
+static void tce_iommu_free_table(struct iommu_table *tbl)
+{
+ unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
+
+ if (!tbl->it_size)
+ return;
+
+ tbl->it_ops->free(tbl);
+ decrement_locked_vm(pages);
+ memset(tbl, 0, sizeof(*tbl));
+}
+
static void *tce_iommu_open(unsigned long arg)
{
struct tce_container *container;
@@ -271,19 +322,41 @@ static void *tce_iommu_open(unsigned long arg)
return ERR_PTR(-ENOMEM);

mutex_init(&container->lock);
+ INIT_LIST_HEAD_RCU(&container->group_list);
+
container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;

return container;
}

+static int tce_iommu_clear(struct tce_container *container,
+ struct iommu_table *tbl,
+ unsigned long entry, unsigned long pages);
+
static void tce_iommu_release(void *iommu_data)
{
struct tce_container *container = iommu_data;
+ struct iommu_table_group *table_group;
+ struct tce_iommu_group *tcegrp;
+ long i;

- WARN_ON(container->grp);
+ while (tce_groups_attached(container)) {
+ tcegrp = list_first_entry(&container->group_list,
+ struct tce_iommu_group, next);
+ table_group = iommu_group_get_iommudata(tcegrp->grp);
+ tce_iommu_detach_group(iommu_data, tcegrp->grp);
+ }

- if (container->grp)
- tce_iommu_detach_group(iommu_data, container->grp);
+ /*
+ * If VFIO created a table, it was not disposed
+ * by tce_iommu_detach_group() so do it now.
+ */
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ struct iommu_table *tbl = &container->tables[i];
+
+ tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
+ tce_iommu_free_table(tbl);
+ }

tce_iommu_disable(container);
mutex_destroy(&container->lock);
@@ -509,12 +582,15 @@ static long tce_iommu_ioctl(void *iommu_data,

case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
struct vfio_iommu_spapr_tce_info info;
+ struct tce_iommu_group *tcegrp;
struct iommu_table_group *table_group;

- if (WARN_ON(!container->grp))
+ if (!tce_groups_attached(container))
return -ENXIO;

- table_group = iommu_group_get_iommudata(container->grp);
+ tcegrp = list_first_entry(&container->group_list,
+ struct tce_iommu_group, next);
+ table_group = iommu_group_get_iommudata(tcegrp->grp);

if (!table_group)
return -ENXIO;
@@ -707,12 +783,20 @@ static long tce_iommu_ioctl(void *iommu_data,
tce_iommu_disable(container);
mutex_unlock(&container->lock);
return 0;
- case VFIO_EEH_PE_OP:
- if (!container->grp)
- return -ENODEV;

- return vfio_spapr_iommu_eeh_ioctl(container->grp,
- cmd, arg);
+ case VFIO_EEH_PE_OP: {
+ struct tce_iommu_group *tcegrp;
+
+ ret = 0;
+ list_for_each_entry(tcegrp, &container->group_list, next) {
+ ret = vfio_spapr_iommu_eeh_ioctl(tcegrp->grp,
+ cmd, arg);
+ if (ret)
+ return ret;
+ }
+ return ret;
+ }
+
}

return -ENOTTY;
@@ -724,11 +808,14 @@ static void tce_iommu_release_ownership(struct tce_container *container,
int i;

for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
- struct iommu_table *tbl = &table_group->tables[i];
+ struct iommu_table *tbl = &container->tables[i];

tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
if (tbl->it_map)
iommu_release_ownership(tbl);
+
+ /* Reset the container's copy of the table descriptor */
+ memset(tbl, 0, sizeof(*tbl));
}
}

@@ -758,38 +845,56 @@ static int tce_iommu_take_ownership(struct iommu_table_group *table_group)
static int tce_iommu_attach_group(void *iommu_data,
struct iommu_group *iommu_group)
{
- int ret;
+ int ret, i;
struct tce_container *container = iommu_data;
struct iommu_table_group *table_group;
+ struct tce_iommu_group *tcegrp = NULL;
+ bool first_group = !tce_groups_attached(container);

mutex_lock(&container->lock);

/* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
iommu_group_id(iommu_group), iommu_group); */
- if (container->grp) {
- pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
- iommu_group_id(container->grp),
- iommu_group_id(iommu_group));
- ret = -EBUSY;
- goto unlock_exit;
- }
-
- if (container->enabled) {
- pr_err("tce_vfio: attaching group #%u to enabled container\n",
- iommu_group_id(iommu_group));
- ret = -EBUSY;
- goto unlock_exit;
- }
-
table_group = iommu_group_get_iommudata(iommu_group);
- if (!table_group) {
- ret = -ENXIO;
+
+ if (!first_group && (!table_group->ops ||
+ !table_group->ops->take_ownership ||
+ !table_group->ops->release_ownership)) {
+ ret = -EBUSY;
+ goto unlock_exit;
+ }
+
+ /* Check if new group has the same iommu_ops (i.e. compatible) */
+ list_for_each_entry(tcegrp, &container->group_list, next) {
+ struct iommu_table_group *table_group_tmp;
+
+ if (tcegrp->grp == iommu_group) {
+ pr_warn("tce_vfio: Group %d is already attached\n",
+ iommu_group_id(iommu_group));
+ ret = -EBUSY;
+ goto unlock_exit;
+ }
+ table_group_tmp = iommu_group_get_iommudata(tcegrp->grp);
+ if (table_group_tmp->ops != table_group->ops) {
+ pr_warn("tce_vfio: Group %d is incompatible with group %d\n",
+ iommu_group_id(iommu_group),
+ iommu_group_id(tcegrp->grp));
+ ret = -EPERM;
+ goto unlock_exit;
+ }
+ }
+
+ tcegrp = kzalloc(sizeof(*tcegrp), GFP_KERNEL);
+ if (!tcegrp) {
+ ret = -ENOMEM;
goto unlock_exit;
}

if (!table_group->ops || !table_group->ops->take_ownership ||
!table_group->ops->release_ownership) {
ret = tce_iommu_take_ownership(table_group);
+ if (!ret)
+ container->tables[0] = table_group->tables[0];
} else if (!table_group->ops->create_table ||
!table_group->ops->set_window) {
WARN_ON_ONCE(1);
@@ -801,23 +906,46 @@ static int tce_iommu_attach_group(void *iommu_data,
* the pages that has been explicitly mapped into the iommu
*/
table_group->ops->take_ownership(table_group);
- ret = table_group->ops->create_table(table_group,
- 0, /* window number */
- IOMMU_PAGE_SHIFT_4K,
- table_group->tce32_size,
- 1, /* default levels */
- &table_group->tables[0]);
- if (!ret)
- ret = table_group->ops->set_window(table_group, 0,
- &table_group->tables[0]);
+ /*
+ * If it the first group attached, check if there is
+ * a default DMA window and create one if none as
+ * the userspace expects it to exist.
+ */
+ if (first_group && !container->tables[0].it_size) {
+ ret = tce_iommu_create_table(table_group,
+ 0, /* window number */
+ IOMMU_PAGE_SHIFT_4K,
+ table_group->tce32_size,
+ 1, /* default levels */
+ &container->tables[0]);
+ if (ret)
+ goto unlock_exit;
+ }
+
+ /* Set all windows to the new group */
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ struct iommu_table *tbl = &container->tables[i];
+
+ if (!tbl->it_size)
+ continue;
+
+ /* Set the default window to a new group */
+ ret = table_group->ops->set_window(table_group, i, tbl);
+ if (ret)
+ break;
+ }
}

if (ret)
goto unlock_exit;

- container->grp = iommu_group;
+ tcegrp->grp = iommu_group;
+ list_add(&tcegrp->next, &container->group_list);

unlock_exit:
+ if (ret && tcegrp)
+ kfree(tcegrp);
+
mutex_unlock(&container->lock);

return ret;
@@ -828,25 +956,27 @@ static void tce_iommu_detach_group(void *iommu_data,
{
struct tce_container *container = iommu_data;
struct iommu_table_group *table_group;
+ struct tce_iommu_group *tcegrp;
long i;
+ bool found = false;

mutex_lock(&container->lock);
- if (iommu_group != container->grp) {
- pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
- iommu_group_id(iommu_group),
- iommu_group_id(container->grp));
+
+ list_for_each_entry(tcegrp, &container->group_list, next) {
+ if (tcegrp->grp == iommu_group) {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found) {
+ pr_warn("tce_vfio: detaching unattached group #%u\n",
+ iommu_group_id(iommu_group));
goto unlock_exit;
}

- if (container->enabled) {
- pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
- iommu_group_id(container->grp));
- tce_iommu_disable(container);
- }
-
- /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
- iommu_group_id(iommu_group), iommu_group); */
- container->grp = NULL;
+ list_del(&tcegrp->next);
+ kfree(tcegrp);

table_group = iommu_group_get_iommudata(iommu_group);
BUG_ON(!table_group);
@@ -857,18 +987,8 @@ static void tce_iommu_detach_group(void *iommu_data,
else if (!table_group->ops->unset_window)
WARN_ON_ONCE(1);
else {
- for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
- struct iommu_table tbl = table_group->tables[i];
-
- if (!tbl.it_size)
- continue;
-
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i)
table_group->ops->unset_window(table_group, i);
- tce_iommu_clear(container, &tbl,
- tbl.it_offset, tbl.it_size);
- if (tbl.it_ops->free)
- tbl.it_ops->free(&tbl);
- }

table_group->ops->release_ownership(table_group);
}
--
2.0.0

2015-04-25 12:22:37

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH kernel v9 32/32] vfio: powerpc/spapr: Support Dynamic DMA windows

This adds create/remove window ioctls to create and remove DMA windows.
sPAPR defines a Dynamic DMA windows capability which allows
para-virtualized guests to create additional DMA windows on a PCI bus.
The existing linux kernels use this new window to map the entire guest
memory and switch to the direct DMA operations saving time on map/unmap
requests which would normally happen in a big amounts.

This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
Up to 2 windows are supported now by the hardware and by this driver.

This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
information such as a number of supported windows and maximum number
levels of TCE tables.

DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature
as we still want to support v2 on platforms which cannot do DDW for
the sake of TCE acceleration in KVM (coming soon).

Signed-off-by: Alexey Kardashevskiy <[email protected]>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <[email protected]>
---
Changes:
v7:
* s/VFIO_IOMMU_INFO_DDW/VFIO_IOMMU_SPAPR_INFO_DDW/
* fixed typos in and updated vfio.txt
* fixed VFIO_IOMMU_SPAPR_TCE_GET_INFO handler
* moved ddw properties to vfio_iommu_spapr_tce_ddw_info

v6:
* added explicit VFIO_IOMMU_INFO_DDW flag to vfio_iommu_spapr_tce_info,
it used to be page mask flags from platform code
* added explicit pgsizes field
* added cleanup if tce_iommu_create_window() failed in a middle
* added checks for callbacks in tce_iommu_create_window and remove those
from tce_iommu_remove_window when it is too late to test anyway
* spapr_tce_find_free_table returns sensible error code now
* updated description of VFIO_IOMMU_SPAPR_TCE_CREATE/
VFIO_IOMMU_SPAPR_TCE_REMOVE

v4:
* moved code to tce_iommu_create_window()/tce_iommu_remove_window()
helpers
* added docs
---
Documentation/vfio.txt | 19 ++++
arch/powerpc/include/asm/iommu.h | 2 +-
drivers/vfio/vfio_iommu_spapr_tce.c | 197 +++++++++++++++++++++++++++++++++++-
include/uapi/linux/vfio.h | 61 ++++++++++-
4 files changed, 274 insertions(+), 5 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 7dcf2b5..8b1ec51 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -452,6 +452,25 @@ address is from pre-registered range.

This separation helps in optimizing DMA for guests.

+6) sPAPR specification allows guests to have an additional DMA window(s) on
+a PCI bus with a variable page size. Two ioctls have been added to support
+this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE.
+The platform has to support the functionality or error will be returned to
+the userspace. The existing hardware supports up to 2 DMA windows, one is
+2GB long, uses 4K pages and called "default 32bit window"; the other can
+be as big as entire RAM, use different page size, it is optional - guests
+create those in run-time if the guest driver supports 64bit DMA.
+
+VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and
+a number of TCE table levels (if a TCE table is going to be big enough and
+the kernel may not be able to allocate enough of physically contiguous memory).
+It creates a new window in the available slot and returns the bus address where
+the new window starts. Due to hardware limitation, the user space cannot choose
+the location of DMA windows.
+
+VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window
+and removes it.
+
-------------------------------------------------------------------------------

[1] VFIO was originally an acronym for "Virtual Function I/O" in its
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9844c106..282767f 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -151,7 +151,7 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
int nid);

-#define IOMMU_TABLE_GROUP_MAX_TABLES 1
+#define IOMMU_TABLE_GROUP_MAX_TABLES 2

struct iommu_table_group;

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 970e3a2..f04c6f5 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -266,6 +266,20 @@ static void tce_iommu_disable(struct tce_container *container)
decrement_locked_vm(container->locked_pages);
}

+static int spapr_tce_find_free_table(struct tce_container *container)
+{
+ int i;
+
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ struct iommu_table *tbl = &container->tables[i];
+
+ if (!tbl->it_size)
+ return i;
+ }
+
+ return -ENOSPC;
+}
+
static long tce_iommu_create_table(struct iommu_table_group *table_group,
int num,
__u32 page_shift,
@@ -559,11 +573,114 @@ static long tce_iommu_build_v2(struct tce_container *container,
return ret;
}

+static long tce_iommu_create_window(struct tce_container *container,
+ __u32 page_shift, __u64 window_size, __u32 levels,
+ __u64 *start_addr)
+{
+ struct tce_iommu_group *tcegrp;
+ struct iommu_table_group *table_group;
+ struct iommu_table *tbl;
+ long ret, num;
+
+ num = spapr_tce_find_free_table(container);
+ if (num < 0)
+ return num;
+
+ tbl = &container->tables[num];
+
+ /* Get the first group for ops::create_table */
+ tcegrp = list_first_entry(&container->group_list,
+ struct tce_iommu_group, next);
+ table_group = iommu_group_get_iommudata(tcegrp->grp);
+ if (!table_group)
+ return -EFAULT;
+
+ if (!(table_group->pgsizes & (1ULL << page_shift)))
+ return -EINVAL;
+
+ if (!table_group->ops->set_window || !table_group->ops->unset_window ||
+ !table_group->ops->get_table_size ||
+ !table_group->ops->create_table)
+ return -EPERM;
+
+ /* Create TCE table */
+ ret = tce_iommu_create_table(table_group, num,
+ page_shift, window_size, levels, tbl);
+ if (ret)
+ return ret;
+
+ BUG_ON(!tbl->it_ops->free);
+
+ /*
+ * Program the table to every group.
+ * Groups have been tested for compatibility at the attach time.
+ */
+ list_for_each_entry(tcegrp, &container->group_list, next) {
+ table_group = iommu_group_get_iommudata(tcegrp->grp);
+
+ ret = table_group->ops->set_window(table_group, num, tbl);
+ if (ret)
+ goto unset_exit;
+ }
+
+ /* Return start address assigned by platform in create_table() */
+ *start_addr = tbl->it_offset << tbl->it_page_shift;
+
+ return 0;
+
+unset_exit:
+ list_for_each_entry(tcegrp, &container->group_list, next) {
+ table_group = iommu_group_get_iommudata(tcegrp->grp);
+ table_group->ops->unset_window(table_group, num);
+ }
+ tce_iommu_free_table(tbl);
+
+ return ret;
+}
+
+static long tce_iommu_remove_window(struct tce_container *container,
+ __u64 start_addr)
+{
+ struct iommu_table_group *table_group = NULL;
+ struct iommu_table *tbl;
+ struct tce_iommu_group *tcegrp;
+ int num;
+
+ tbl = spapr_tce_find_table(container, start_addr);
+ if (!tbl)
+ return -EINVAL;
+
+ /* Detach groups from IOMMUs */
+ num = tbl - container->tables;
+ list_for_each_entry(tcegrp, &container->group_list, next) {
+ table_group = iommu_group_get_iommudata(tcegrp->grp);
+
+ /*
+ * SPAPR TCE IOMMU exposes the default DMA window to
+ * the guest via dma32_window_start/size of
+ * VFIO_IOMMU_SPAPR_TCE_GET_INFO. Some platforms allow
+ * the userspace to remove this window, some do not so
+ * here we check for the platform capability.
+ */
+ if (!table_group->ops || !table_group->ops->unset_window)
+ return -EPERM;
+
+ if (container->tables[num].it_size)
+ table_group->ops->unset_window(table_group, num);
+ }
+
+ /* Free table */
+ tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
+ tce_iommu_free_table(tbl);
+
+ return 0;
+}
+
static long tce_iommu_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
{
struct tce_container *container = iommu_data;
- unsigned long minsz;
+ unsigned long minsz, ddwsz;
long ret;

switch (cmd) {
@@ -607,6 +724,21 @@ static long tce_iommu_ioctl(void *iommu_data,
info.dma32_window_start = table_group->tce32_start;
info.dma32_window_size = table_group->tce32_size;
info.flags = 0;
+ memset(&info.ddw, 0, sizeof(info.ddw));
+
+ if (table_group->max_dynamic_windows_supported &&
+ container->v2) {
+ info.flags |= VFIO_IOMMU_SPAPR_INFO_DDW;
+ info.ddw.pgsizes = table_group->pgsizes;
+ info.ddw.max_dynamic_windows_supported =
+ table_group->max_dynamic_windows_supported;
+ info.ddw.levels = table_group->max_levels;
+ }
+
+ ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info, ddw);
+
+ if (info.argsz >= ddwsz)
+ minsz = ddwsz;

if (copy_to_user((void __user *)arg, &info, minsz))
return -EFAULT;
@@ -797,6 +929,69 @@ static long tce_iommu_ioctl(void *iommu_data,
return ret;
}

+ case VFIO_IOMMU_SPAPR_TCE_CREATE: {
+ struct vfio_iommu_spapr_tce_create create;
+
+ if (!container->v2)
+ break;
+
+ if (!tce_groups_attached(container))
+ return -ENXIO;
+
+ minsz = offsetofend(struct vfio_iommu_spapr_tce_create,
+ start_addr);
+
+ if (copy_from_user(&create, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (create.argsz < minsz)
+ return -EINVAL;
+
+ if (create.flags)
+ return -EINVAL;
+
+ mutex_lock(&container->lock);
+
+ ret = tce_iommu_create_window(container, create.page_shift,
+ create.window_size, create.levels,
+ &create.start_addr);
+
+ mutex_unlock(&container->lock);
+
+ if (!ret && copy_to_user((void __user *)arg, &create, minsz))
+ ret = -EFAULT;
+
+ return ret;
+ }
+ case VFIO_IOMMU_SPAPR_TCE_REMOVE: {
+ struct vfio_iommu_spapr_tce_remove remove;
+
+ if (!container->v2)
+ break;
+
+ if (!tce_groups_attached(container))
+ return -ENXIO;
+
+ minsz = offsetofend(struct vfio_iommu_spapr_tce_remove,
+ start_addr);
+
+ if (copy_from_user(&remove, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (remove.argsz < minsz)
+ return -EINVAL;
+
+ if (remove.flags)
+ return -EINVAL;
+
+ mutex_lock(&container->lock);
+
+ ret = tce_iommu_remove_window(container, remove.start_addr);
+
+ mutex_unlock(&container->lock);
+
+ return ret;
+ }
}

return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 8fdcfb9..dde0fe5 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -445,6 +445,23 @@ struct vfio_iommu_type1_dma_unmap {
/* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */

/*
+ * The SPAPR TCE DDW info struct provides the information about
+ * the details of Dynamic DMA window capability.
+ *
+ * @pgsizes contains a page size bitmask, 4K/64K/16M are supported.
+ * @max_dynamic_windows_supported tells the maximum number of windows
+ * which the platform can create.
+ * @levels tells the maximum number of levels in multi-level IOMMU tables;
+ * this allows splitting a table into smaller chunks which reduces
+ * the amount of physically contiguous memory required for the table.
+ */
+struct vfio_iommu_spapr_tce_ddw_info {
+ __u64 pgsizes; /* Bitmap of supported page sizes */
+ __u32 max_dynamic_windows_supported;
+ __u32 levels;
+};
+
+/*
* The SPAPR TCE info struct provides the information about the PCI bus
* address ranges available for DMA, these values are programmed into
* the hardware so the guest has to know that information.
@@ -454,14 +471,17 @@ struct vfio_iommu_type1_dma_unmap {
* addresses too so the window works as a filter rather than an offset
* for IOVA addresses.
*
- * A flag will need to be added if other page sizes are supported,
- * so as defined here, it is always 4k.
+ * Flags supported:
+ * - VFIO_IOMMU_SPAPR_INFO_DDW: informs the userspace that dynamic DMA windows
+ * (DDW) support is present. @ddw is only supported when DDW is present.
*/
struct vfio_iommu_spapr_tce_info {
__u32 argsz;
- __u32 flags; /* reserved for future use */
+ __u32 flags;
+#define VFIO_IOMMU_SPAPR_INFO_DDW (1 << 0) /* DDW supported */
__u32 dma32_window_start; /* 32 bit window start (bytes) */
__u32 dma32_window_size; /* 32 bit window size (bytes) */
+ struct vfio_iommu_spapr_tce_ddw_info ddw;
};

#define VFIO_IOMMU_SPAPR_TCE_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
@@ -522,6 +542,41 @@ struct vfio_iommu_spapr_register_memory {
*/
#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 18)

+/**
+ * VFIO_IOMMU_SPAPR_TCE_CREATE - _IOWR(VFIO_TYPE, VFIO_BASE + 19, struct vfio_iommu_spapr_tce_create)
+ *
+ * Creates an additional TCE table and programs it (sets a new DMA window)
+ * to every IOMMU group in the container. It receives page shift, window
+ * size and number of levels in the TCE table being created.
+ *
+ * It allocates and returns an offset on a PCI bus of the new DMA window.
+ */
+struct vfio_iommu_spapr_tce_create {
+ __u32 argsz;
+ __u32 flags;
+ /* in */
+ __u32 page_shift;
+ __u64 window_size;
+ __u32 levels;
+ /* out */
+ __u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_CREATE _IO(VFIO_TYPE, VFIO_BASE + 19)
+
+/**
+ * VFIO_IOMMU_SPAPR_TCE_REMOVE - _IOW(VFIO_TYPE, VFIO_BASE + 20, struct vfio_iommu_spapr_tce_remove)
+ *
+ * Unprograms a TCE table from all groups in the container and destroys it.
+ * It receives a PCI bus offset as a window id.
+ */
+struct vfio_iommu_spapr_tce_remove {
+ __u32 argsz;
+ __u32 flags;
+ /* in */
+ __u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE _IO(VFIO_TYPE, VFIO_BASE + 20)
+
/* ***************************************************************** */

#endif /* _UAPIVFIO_H */
--
2.0.0

2015-04-27 21:05:24

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 02/32] Revert "powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically"

On Sat, 2015-04-25 at 22:14 +1000, Alexey Kardashevskiy wrote:
> This reverts commit 9e8d4a19ab66ec9e132d405357b9108a4f26efd3 as
> tce32_table has exactly the same life time as the whole PE.

scripts/checkpatch.pl would like your commit reference to appear as:

commit 9e8d4a19ab66 ("powerpc/powernv: Allocate struct pnv_ioda_pe
iommu_table dynamically")

>
> This makes use of a new iommu_reset_table() helper instead.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> arch/powerpc/include/asm/iommu.h | 3 ---
> arch/powerpc/platforms/powernv/pci-ioda.c | 35 +++++++++++++------------------
> arch/powerpc/platforms/powernv/pci.h | 2 +-
> 3 files changed, 15 insertions(+), 25 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index e2cef38..9d320e0 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -79,9 +79,6 @@ struct iommu_table {
> struct iommu_group *it_group;
> #endif
> void (*set_bypass)(struct iommu_table *tbl, bool enable);
> -#ifdef CONFIG_PPC_POWERNV
> - void *data;
> -#endif
> };
>
> /* Pure 2^n version of get_order */
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 920c252..eff26ed 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1086,10 +1086,6 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
> return;
> }
>
> - pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
> - GFP_KERNEL, hose->node);
> - pe->tce32_table->data = pe;
> -
> /* Associate it with all child devices */
> pnv_ioda_setup_same_PE(bus, pe);
>
> @@ -1295,7 +1291,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
> bus = dev->bus;
> hose = pci_bus_to_host(bus);
> phb = hose->private_data;
> - tbl = pe->tce32_table;
> + tbl = &pe->tce32_table;
> addr = tbl->it_base;
>
> opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
> @@ -1310,9 +1306,8 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
> if (rc)
> pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
>
> - iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
> + iommu_reset_table(tbl, of_node_full_name(dev->dev.of_node));
> free_pages(addr, get_order(TCE32_TABLE_SIZE));
> - pe->tce32_table = NULL;
> }
>
> static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
> @@ -1460,10 +1455,6 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
> continue;
> }
>
> - pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
> - GFP_KERNEL, hose->node);
> - pe->tce32_table->data = pe;
> -
> /* Put PE to the list */
> mutex_lock(&phb->ioda.pe_list_mutex);
> list_add_tail(&pe->list, &phb->ioda.pe_list);
> @@ -1598,7 +1589,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
>
> pe = &phb->ioda.pe_array[pdn->pe_number];
> WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
> - set_iommu_table_base_and_group(&pdev->dev, pe->tce32_table);
> + set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
> }
>
> static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
> @@ -1625,7 +1616,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
> } else {
> dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
> set_dma_ops(&pdev->dev, &dma_iommu_ops);
> - set_iommu_table_base(&pdev->dev, pe->tce32_table);
> + set_iommu_table_base(&pdev->dev, &pe->tce32_table);
> }
> *pdev->dev.dma_mask = dma_mask;
> return 0;
> @@ -1662,9 +1653,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
> list_for_each_entry(dev, &bus->devices, bus_list) {
> if (add_to_iommu_group)
> set_iommu_table_base_and_group(&dev->dev,
> - pe->tce32_table);
> + &pe->tce32_table);
> else
> - set_iommu_table_base(&dev->dev, pe->tce32_table);
> + set_iommu_table_base(&dev->dev, &pe->tce32_table);
>
> if (dev->subordinate)
> pnv_ioda_setup_bus_dma(pe, dev->subordinate,
> @@ -1754,7 +1745,8 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
> void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
> __be64 *startp, __be64 *endp, bool rm)
> {
> - struct pnv_ioda_pe *pe = tbl->data;
> + struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
> + tce32_table);
> struct pnv_phb *phb = pe->phb;
>
> if (phb->type == PNV_PHB_IODA1)
> @@ -1817,7 +1809,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> }
>
> /* Setup linux iommu table */
> - tbl = pe->tce32_table;
> + tbl = &pe->tce32_table;
> pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
> base << 28, IOMMU_PAGE_SHIFT_4K);
>
> @@ -1862,7 +1854,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>
> static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> {
> - struct pnv_ioda_pe *pe = tbl->data;
> + struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
> + tce32_table);
> uint16_t window_id = (pe->pe_number << 1 ) + 1;
> int64_t rc;
>
> @@ -1907,10 +1900,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> pe->tce_bypass_base = 1ull << 59;
>
> /* Install set_bypass callback for VFIO */
> - pe->tce32_table->set_bypass = pnv_pci_ioda2_set_bypass;
> + pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
>
> /* Enable bypass by default */
> - pnv_pci_ioda2_set_bypass(pe->tce32_table, true);
> + pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
> }
>
> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> @@ -1958,7 +1951,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> }
>
> /* Setup linux iommu table */
> - tbl = pe->tce32_table;
> + tbl = &pe->tce32_table;
> pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
> IOMMU_PAGE_SHIFT_4K);
>
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 070ee88..c954c64 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -57,7 +57,7 @@ struct pnv_ioda_pe {
> /* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
> int tce32_seg;
> int tce32_segcount;
> - struct iommu_table *tce32_table;
> + struct iommu_table tce32_table;
> phys_addr_t tce_inval_reg_phys;
>
> /* 64-bit TCE bypass region */


2015-04-27 21:05:40

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 16/32] powerpc/powernv/ioda: Move TCE kill register address to PE

On Sat, 2015-04-25 at 22:14 +1000, Alexey Kardashevskiy wrote:
> At the moment the DMA setup code looks for the "ibm,opal-tce-kill" property
> which contains the TCE kill register address. Writes to this register
> invalidates TCE cache on IODA/IODA2 hub.
>
> This moves the register address from iommu_table to pnv_ioda_pe as
> later there will be 2 tables per PE and it will be used for both tables.
>
> This moves the property reading/remapping code to a helper to reduce
> code duplication.
>
> This adds a new pnv_pci_ioda2_tvt_invalidate() helper which invalidates
> the entire table. It should be called after every call to
> opal_pci_map_pe_dma_window(). It was not required before because
> there is just a single TCE table and 64bit DMA is handled via bypass
> window (which has no table so no chache is used) but this is going
> to change with Dynamic DMA windows (DDW).
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> Changes:
> v9:
> * new in the series
> ---
> arch/powerpc/platforms/powernv/pci-ioda.c | 69 +++++++++++++++++++------------
> arch/powerpc/platforms/powernv/pci.h | 1 +
> 2 files changed, 44 insertions(+), 26 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index f070c44..b22b3ca 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1672,7 +1672,7 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
> struct pnv_ioda_pe, table_group);
> __be64 __iomem *invalidate = rm ?
> (__be64 __iomem *)pe->tce_inval_reg_phys :
> - (__be64 __iomem *)tbl->it_index;
> + pe->tce_inval_reg;
> unsigned long start, end, inc;
> const unsigned shift = tbl->it_page_shift;
>
> @@ -1743,6 +1743,18 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
> .get = pnv_tce_get,
> };
>
> +static inline void pnv_pci_ioda2_tvt_invalidate(struct pnv_ioda_pe *pe)
> +{
> + /* 01xb - invalidate TCEs that match the specified PE# */
> + unsigned long addr = (0x4ull << 60) | (pe->pe_number & 0xFF);
> +
> + if (!pe->tce_inval_reg)
> + return;
> +
> + mb(); /* Ensure above stores are visible */


ERROR: code indent should use tabs where possible


> + __raw_writeq(cpu_to_be64(addr), pe->tce_inval_reg);
> +}
> +
> static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
> unsigned long index, unsigned long npages, bool rm)
> {
> @@ -1751,7 +1763,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
> unsigned long start, end, inc;
> __be64 __iomem *invalidate = rm ?
> (__be64 __iomem *)pe->tce_inval_reg_phys :
> - (__be64 __iomem *)tbl->it_index;
> + pe->tce_inval_reg;
> const unsigned shift = tbl->it_page_shift;
>
> /* We'll invalidate DMA address in PE scope */
> @@ -1803,13 +1815,31 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> .get = pnv_tce_get,
> };
>
> +static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
> + struct pnv_ioda_pe *pe)
> +{
> + const __be64 *swinvp;
> +
> + /* OPAL variant of PHB3 invalidated TCEs */
> + swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
> + if (!swinvp)
> + return;
> +
> + /* We need a couple more fields -- an address and a data
> + * to or. Since the bus is only printed out on table free
> + * errors, and on the first pass the data will be a relative
> + * bus number, print that out instead.
> + */
> + pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
> + pe->tce_inval_reg = ioremap(pe->tce_inval_reg_phys, 8);
> +}
> +
> static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> struct pnv_ioda_pe *pe, unsigned int base,
> unsigned int segs)
> {
>
> struct page *tce_mem = NULL;
> - const __be64 *swinvp;
> struct iommu_table *tbl;
> unsigned int i;
> int64_t rc;
> @@ -1823,6 +1853,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> if (WARN_ON(pe->tce32_seg >= 0))
> return;
>
> + pnv_pci_ioda_setup_opal_tce_kill(phb, pe);
> +
> /* Grab a 32-bit TCE table */
> pe->tce32_seg = base;
> pe_info(pe, " Setting up 32-bit TCE table at %08x..%08x\n",
> @@ -1865,20 +1897,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> base << 28, IOMMU_PAGE_SHIFT_4K);
>
> /* OPAL variant of P7IOC SW invalidated TCEs */
> - swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
> - if (swinvp) {
> - /* We need a couple more fields -- an address and a data
> - * to or. Since the bus is only printed out on table free
> - * errors, and on the first pass the data will be a relative
> - * bus number, print that out instead.
> - */
> - pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
> - tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
> - 8);
> + if (pe->tce_inval_reg)
> tbl->it_type |= (TCE_PCI_SWINV_CREATE |
> TCE_PCI_SWINV_FREE |
> TCE_PCI_SWINV_PAIR);
> - }
> +
> tbl->it_ops = &pnv_ioda1_iommu_ops;
> iommu_init_table(tbl, phb->hose->node);
>
> @@ -1984,7 +2007,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> {
> struct page *tce_mem = NULL;
> void *addr;
> - const __be64 *swinvp;
> struct iommu_table *tbl;
> unsigned int tce_table_size, end;
> int64_t rc;
> @@ -1993,6 +2015,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> if (WARN_ON(pe->tce32_seg >= 0))
> return;
>
> + pnv_pci_ioda_setup_opal_tce_kill(phb, pe);
> +
> /* The PE will reserve all possible 32-bits space */
> pe->tce32_seg = 0;
> end = (1 << ilog2(phb->ioda.m32_pci_base));
> @@ -2023,6 +2047,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> goto fail;
> }
>
> + pnv_pci_ioda2_tvt_invalidate(pe);
> +
> /* Setup iommu */
> pe->table_group.tables[0].it_table_group = &pe->table_group;
>
> @@ -2032,18 +2058,9 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> IOMMU_PAGE_SHIFT_4K);
>
> /* OPAL variant of PHB3 invalidated TCEs */
> - swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
> - if (swinvp) {
> - /* We need a couple more fields -- an address and a data
> - * to or. Since the bus is only printed out on table free
> - * errors, and on the first pass the data will be a relative
> - * bus number, print that out instead.
> - */
> - pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
> - tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
> - 8);
> + if (pe->tce_inval_reg)
> tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
> - }
> +
> tbl->it_ops = &pnv_ioda2_iommu_ops;
> iommu_init_table(tbl, phb->hose->node);
> #ifdef CONFIG_IOMMU_API
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 368d4ed..bd83d85 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -59,6 +59,7 @@ struct pnv_ioda_pe {
> int tce32_segcount;
> struct iommu_table_group table_group;
> phys_addr_t tce_inval_reg_phys;
> + __be64 __iomem *tce_inval_reg;
>
> /* 64-bit TCE bypass region */
> bool tce_bypass_enabled;


2015-04-27 22:18:27

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 30/32] vfio: powerpc/spapr: Use 32bit DMA window properties from table_group

On Sat, 2015-04-25 at 22:14 +1000, Alexey Kardashevskiy wrote:
> A table group might not have a table but it always has the default 32bit
> window parameters so use these.
>
> No change in behavior is expected.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> Changes:
> v9:
> * new in the series - to make the next patch simpler
> ---
> drivers/vfio/vfio_iommu_spapr_tce.c | 19 +++++++++++--------
> 1 file changed, 11 insertions(+), 8 deletions(-)


Acked-by: Alex Williamson <[email protected]>


> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 4cfc2c1..a7d6729 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -185,7 +185,6 @@ static int tce_iommu_enable(struct tce_container *container)
> {
> int ret = 0;
> unsigned long locked;
> - struct iommu_table *tbl;
> struct iommu_table_group *table_group;
>
> if (!container->grp)
> @@ -221,13 +220,19 @@ static int tce_iommu_enable(struct tce_container *container)
> * this is that we cannot tell here the amount of RAM used by the guest
> * as this information is only available from KVM and VFIO is
> * KVM agnostic.
> + *
> + * So we do not allow enabling a container without a group attached
> + * as there is no way to know how much we should increment
> + * the locked_vm counter.
> */
> table_group = iommu_group_get_iommudata(container->grp);
> if (!table_group)
> return -ENODEV;
>
> - tbl = &table_group->tables[0];
> - locked = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
> + if (!table_group->tce32_size)
> + return -EPERM;
> +
> + locked = table_group->tce32_size >> PAGE_SHIFT;
> ret = try_increment_locked_vm(locked);
> if (ret)
> return ret;
> @@ -504,7 +509,6 @@ static long tce_iommu_ioctl(void *iommu_data,
>
> case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
> struct vfio_iommu_spapr_tce_info info;
> - struct iommu_table *tbl;
> struct iommu_table_group *table_group;
>
> if (WARN_ON(!container->grp))
> @@ -512,8 +516,7 @@ static long tce_iommu_ioctl(void *iommu_data,
>
> table_group = iommu_group_get_iommudata(container->grp);
>
> - tbl = &table_group->tables[0];
> - if (WARN_ON_ONCE(!tbl))
> + if (!table_group)
> return -ENXIO;
>
> minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
> @@ -525,8 +528,8 @@ static long tce_iommu_ioctl(void *iommu_data,
> if (info.argsz < minsz)
> return -EINVAL;
>
> - info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
> - info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
> + info.dma32_window_start = table_group->tce32_start;
> + info.dma32_window_size = table_group->tce32_size;
> info.flags = 0;
>
> if (copy_to_user((void __user *)arg, &info, minsz))


2015-04-29 03:26:09

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 01/32] powerpc/iommu: Split iommu_free_table into 2 helpers

On Sat, Apr 25, 2015 at 10:14:25PM +1000, Alexey Kardashevskiy wrote:
> The iommu_free_table helper release memory it is using (the TCE table and
> @it_map) and release the iommu_table struct as well. We might not want
> the very last step as we store iommu_table in parent structures.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>

Reviewed-by: David Gibson <[email protected]>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (590.00 B)
(No filename) (819.00 B)
Download all attachments

2015-04-29 03:26:20

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 02/32] Revert "powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically"

On Sat, Apr 25, 2015 at 10:14:26PM +1000, Alexey Kardashevskiy wrote:
> This reverts commit 9e8d4a19ab66ec9e132d405357b9108a4f26efd3 as
> tce32_table has exactly the same life time as the whole PE.
>
> This makes use of a new iommu_reset_table() helper instead.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>

Reviewed-by: David Gibson <[email protected]>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (567.00 B)
(No filename) (819.00 B)
Download all attachments

2015-04-29 03:26:13

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 08/32] vfio: powerpc/spapr: Moving pinning/unpinning to helpers

On Sat, Apr 25, 2015 at 10:14:32PM +1000, Alexey Kardashevskiy wrote:
> This is a pretty mechanical patch to make next patches simpler.
>
> New tce_iommu_unuse_page() helper does put_page() now but it might skip
> that after the memory registering patch applied.
>
> As we are here, this removes unnecessary checks for a value returned
> by pfn_to_page() as it cannot possibly return NULL.
>
> This moves tce_iommu_disable() later to let tce_iommu_clear() know if
> the container has been enabled because if it has not been, then
> put_page() must not be called on TCEs from the TCE table. This situation
> is not yet possible but it will after KVM acceleration patchset is
> applied.
>
> This changes code to work with physical addresses rather than linear
> mapping addresses for better code readability. Following patches will
> add an xchg() callback for an IOMMU table which will accept/return
> physical addresses (unlike current tce_build()) which will eliminate
> redundant conversions.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> [aw: for the vfio related changes]
> Acked-by: Alex Williamson <[email protected]>

Reviewed-by: David Gibson <[email protected]>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.36 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 03:27:45

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 09/32] vfio: powerpc/spapr: Rework groups attaching

On Sat, Apr 25, 2015 at 10:14:33PM +1000, Alexey Kardashevskiy wrote:
> This is to make extended ownership and multiple groups support patches
> simpler for review.
>
> This should cause no behavioural change.

Um.. this doesn't appear to be true. Previously removing a group from
an enabled container would fail with EBUSY, now it forces a disable.

>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> [aw: for the vfio related changes]
> Acked-by: Alex Williamson <[email protected]>
> Reviewed-by: David Gibson <[email protected]>
> ---
> drivers/vfio/vfio_iommu_spapr_tce.c | 40 ++++++++++++++++++++++---------------
> 1 file changed, 24 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 115d5e6..0fbe03e 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -460,16 +460,21 @@ static int tce_iommu_attach_group(void *iommu_data,
> iommu_group_id(container->tbl->it_group),
> iommu_group_id(iommu_group));
> ret = -EBUSY;
> - } else if (container->enabled) {
> + goto unlock_exit;
> + }
> +
> + if (container->enabled) {
> pr_err("tce_vfio: attaching group #%u to enabled container\n",
> iommu_group_id(iommu_group));
> ret = -EBUSY;
> - } else {
> - ret = iommu_take_ownership(tbl);
> - if (!ret)
> - container->tbl = tbl;
> + goto unlock_exit;
> }
>
> + ret = iommu_take_ownership(tbl);
> + if (!ret)
> + container->tbl = tbl;
> +
> +unlock_exit:
> mutex_unlock(&container->lock);
>
> return ret;
> @@ -487,19 +492,22 @@ static void tce_iommu_detach_group(void *iommu_data,
> pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
> iommu_group_id(iommu_group),
> iommu_group_id(tbl->it_group));
> - } else {
> - if (container->enabled) {
> - pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
> - iommu_group_id(tbl->it_group));
> - tce_iommu_disable(container);
> - }
> + goto unlock_exit;
> + }
>
> - /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
> - iommu_group_id(iommu_group), iommu_group); */
> - container->tbl = NULL;
> - tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> - iommu_release_ownership(tbl);
> + if (container->enabled) {
> + pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
> + iommu_group_id(tbl->it_group));
> + tce_iommu_disable(container);
> }
> +
> + /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
> + iommu_group_id(iommu_group), iommu_group); */
> + container->tbl = NULL;
> + tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> + iommu_release_ownership(tbl);
> +
> +unlock_exit:
> mutex_unlock(&container->lock);
> }
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.96 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 03:26:11

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 12/32] powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group

On Sat, Apr 25, 2015 at 10:14:36PM +1000, Alexey Kardashevskiy wrote:
> Modern IBM POWERPC systems support multiple (currently two) TCE tables
> per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
> for TCE tables. Right now just one table is supported.
>
> For P5IOC2 and IODA, iommu_table_group is embedded into PE struct
> (pnv_ioda_pe and pnv_phb) and does not require iommu_free_table(), only .
> iommu_reset_table().
>
> For pSeries, this replaces multiple calls of kzalloc_node() with a new
> iommu_pseries_group_alloc() helper and stores the table group struct
> pointer into the pci_dn struct. For release, a iommu_table_group_free()
> helper is added.
>
> This should cause no behavioural change.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> [aw: for the vfio related changes]
> Acked-by: Alex Williamson <[email protected]>

I'm not particularly fond of the "table_group" name, but I can't
really think of a better name for now. So,

Reviewed-by: David Gibson <[email protected]>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.21 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 03:27:31

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 13/32] vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control

On Sat, Apr 25, 2015 at 10:14:37PM +1000, Alexey Kardashevskiy wrote:
> This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
> which call in a loop iommu_take_ownership()/iommu_release_ownership()
> for every table on the group. As there is just one now, no change in
> behaviour is expected.
>
> At the moment the iommu_table struct has a set_bypass() which enables/
> disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
> which calls this callback when external IOMMU users such as VFIO are
> about to get over a PHB.
>
> The set_bypass() callback is not really an iommu_table function but
> IOMMU/PE function. This introduces a iommu_table_group_ops struct and
> adds take_ownership()/release_ownership() callbacks to it which are
> called when an external user takes/releases control over the IOMMU.
>
> This replaces set_bypass() with ownership callbacks as it is not
> necessarily just bypass enabling, it can be something else/more
> so let's give it more generic name.
>
> The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
> IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
> The following patches will replace iommu_take_ownership/
> iommu_release_ownership calls in IODA2 with full IOMMU table release/
> create.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> [aw: for the vfio related changes]
> Acked-by: Alex Williamson <[email protected]>
> ---
> Changes:
> v9:
> * squashed "vfio: powerpc/spapr: powerpc/iommu: Rework IOMMU ownership control"
> and "vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework IOMMU ownership control"
> into a single patch
> * moved helpers with a loop through tables in a group
> to vfio_iommu_spapr_tce.c to keep the platform code free of IOMMU table
> groups as much as possible
> * added missing tce_iommu_clear() to tce_iommu_release_ownership()
> * replaced the set_ownership(enable) callback with take_ownership() and
> release_ownership()
> ---
> arch/powerpc/include/asm/iommu.h | 13 +++++-
> arch/powerpc/kernel/iommu.c | 11 ------
> arch/powerpc/platforms/powernv/pci-ioda.c | 40 +++++++++++++++----
> drivers/vfio/vfio_iommu_spapr_tce.c | 66 +++++++++++++++++++++++++++----
> 4 files changed, 103 insertions(+), 27 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index fa37519..e63419e 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -93,7 +93,6 @@ struct iommu_table {
> unsigned long it_page_shift;/* table iommu page size */
> struct iommu_table_group *it_table_group;
> struct iommu_table_ops *it_ops;
> - void (*set_bypass)(struct iommu_table *tbl, bool enable);
> };
>
> /* Pure 2^n version of get_order */
> @@ -128,11 +127,23 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>
> #define IOMMU_TABLE_GROUP_MAX_TABLES 1
>
> +struct iommu_table_group;
> +
> +struct iommu_table_group_ops {
> + /*
> + * Switches ownership from the kernel itself to an external
> + * user. While onwership is taken, the kernel cannot use IOMMU itself.

Typo in "onwership". I'd also like to see this be even more explicit
that "take" is the "core kernel -> vfio/whatever" transition and
release is the reverse.

> + */
> + void (*take_ownership)(struct iommu_table_group *table_group);
> + void (*release_ownership)(struct iommu_table_group *table_group);
> +};
> +
> struct iommu_table_group {
> #ifdef CONFIG_IOMMU_API
> struct iommu_group *group;
> #endif
> struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> + struct iommu_table_group_ops *ops;
> };
>
> #ifdef CONFIG_IOMMU_API
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 005146b..2856d27 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1057,13 +1057,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
>
> memset(tbl->it_map, 0xff, sz);
>
> - /*
> - * Disable iommu bypass, otherwise the user can DMA to all of
> - * our physical memory via the bypass window instead of just
> - * the pages that has been explicitly mapped into the iommu
> - */
> - if (tbl->set_bypass)
> - tbl->set_bypass(tbl, false);
>
> return 0;
> }
> @@ -1078,10 +1071,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
> /* Restore bit#0 set by iommu_init_table() */
> if (tbl->it_offset == 0)
> set_bit(0, tbl->it_map);
> -
> - /* The kernel owns the device now, we can restore the iommu bypass */
> - if (tbl->set_bypass)
> - tbl->set_bypass(tbl, true);
> }
> EXPORT_SYMBOL_GPL(iommu_release_ownership);
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 88472cb..718d5cc 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1870,10 +1870,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
> }
>
> -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
> {
> - struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
> - struct pnv_ioda_pe, table_group);
> uint16_t window_id = (pe->pe_number << 1 ) + 1;
> int64_t rc;
>
> @@ -1901,7 +1899,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> * host side.
> */
> if (pe->pdev)
> - set_iommu_table_base(&pe->pdev->dev, tbl);
> + set_iommu_table_base(&pe->pdev->dev,
> + &pe->table_group.tables[0]);
> else
> pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> }
> @@ -1917,13 +1916,35 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> /* TVE #1 is selected by PCI address bit 59 */
> pe->tce_bypass_base = 1ull << 59;
>
> - /* Install set_bypass callback for VFIO */
> - pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
> -
> /* Enable bypass by default */
> - pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
> + pnv_pci_ioda2_set_bypass(pe, true);
> }
>
> +#ifdef CONFIG_IOMMU_API
> +static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
> +{
> + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> + table_group);
> +
> + iommu_take_ownership(&table_group->tables[0]);
> + pnv_pci_ioda2_set_bypass(pe, false);
> +}
> +
> +static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> +{
> + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> + table_group);
> +
> + iommu_release_ownership(&table_group->tables[0]);
> + pnv_pci_ioda2_set_bypass(pe, true);
> +}
> +
> +static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> + .take_ownership = pnv_ioda2_take_ownership,
> + .release_ownership = pnv_ioda2_release_ownership,
> +};
> +#endif
> +
> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> struct pnv_ioda_pe *pe)
> {
> @@ -1991,6 +2012,9 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> }
> tbl->it_ops = &pnv_ioda2_iommu_ops;
> iommu_init_table(tbl, phb->hose->node);
> +#ifdef CONFIG_IOMMU_API
> + pe->table_group.ops = &pnv_pci_ioda2_ops;
> +#endif
>
> if (pe->flags & PNV_IODA_PE_DEV) {
> iommu_register_group(&pe->table_group, phb->hose->global_number,
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 17e884a..dacc738 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -483,6 +483,43 @@ static long tce_iommu_ioctl(void *iommu_data,
> return -ENOTTY;
> }
>
> +static void tce_iommu_release_ownership(struct tce_container *container,
> + struct iommu_table_group *table_group)
> +{
> + int i;
> +
> + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> + struct iommu_table *tbl = &table_group->tables[i];
> +
> + tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> + if (tbl->it_map)
> + iommu_release_ownership(tbl);
> + }
> +}
> +
> +static int tce_iommu_take_ownership(struct iommu_table_group *table_group)
> +{
> + int i, j, rc = 0;
> +
> + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> + struct iommu_table *tbl = &table_group->tables[i];
> +
> + if (!tbl->it_map)
> + continue;
> +
> + rc = iommu_take_ownership(tbl);
> + if (rc) {
> + for (j = 0; j < i; ++j)
> + iommu_release_ownership(
> + &table_group->tables[j]);
> +
> + return rc;
> + }
> + }
> +
> + return 0;
> +}
> +
> static int tce_iommu_attach_group(void *iommu_data,
> struct iommu_group *iommu_group)
> {
> @@ -515,9 +552,23 @@ static int tce_iommu_attach_group(void *iommu_data,
> goto unlock_exit;
> }
>
> - ret = iommu_take_ownership(&table_group->tables[0]);
> - if (!ret)
> - container->grp = iommu_group;
> + if (!table_group->ops || !table_group->ops->take_ownership ||
> + !table_group->ops->release_ownership) {
> + ret = tce_iommu_take_ownership(table_group);

Haven't looked at the rest of the series. I'm hoping that you're
eventually planning to replace this fallback with setting the
take_ownership call for p5ioc etc. to point to
tce_iommu_take_ownership.

> + } else {
> + /*
> + * Disable iommu bypass, otherwise the user can DMA to all of
> + * our physical memory via the bypass window instead of just
> + * the pages that has been explicitly mapped into the iommu
> + */
> + table_group->ops->take_ownership(table_group);
> + ret = 0;
> + }
> +
> + if (ret)
> + goto unlock_exit;
> +
> + container->grp = iommu_group;
>
> unlock_exit:
> mutex_unlock(&container->lock);
> @@ -530,7 +581,6 @@ static void tce_iommu_detach_group(void *iommu_data,
> {
> struct tce_container *container = iommu_data;
> struct iommu_table_group *table_group;
> - struct iommu_table *tbl;
>
> mutex_lock(&container->lock);
> if (iommu_group != container->grp) {
> @@ -553,9 +603,11 @@ static void tce_iommu_detach_group(void *iommu_data,
> table_group = iommu_group_get_iommudata(iommu_group);
> BUG_ON(!table_group);
>
> - tbl = &table_group->tables[0];
> - tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> - iommu_release_ownership(tbl);
> + /* Kernel owns the device now, we can restore bypass */
> + if (!table_group->ops || !table_group->ops->release_ownership)
> + tce_iommu_release_ownership(container, table_group);
> + else
> + table_group->ops->release_ownership(table_group);
>
> unlock_exit:
> mutex_unlock(&container->lock);

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (10.59 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 03:26:16

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 14/32] powerpc/iommu: Fix IOMMU ownership control functions

On Sat, Apr 25, 2015 at 10:14:38PM +1000, Alexey Kardashevskiy wrote:
> This adds missing locks in iommu_take_ownership()/
> iommu_release_ownership().
>
> This marks all pages busy in iommu_table::it_map in order to catch
> errors if there is an attempt to use this table while ownership over it
> is taken.
>
> This only clears TCE content if there is no page marked busy in it_map.
> Clearing must be done outside of the table locks as iommu_clear_tce()
> called from iommu_clear_tces_and_put_pages() does this.
>
> In order to use bitmap_empty(), the existing code clears bit#0 which
> is set even in an empty table if it is bus-mapped at 0 as
> iommu_init_table() reserves page#0 to prevent buggy drivers
> from crashing when allocated page is bus-mapped at zero
> (which is correct). This restores the bit in the case of failure
> to bring the it_map to the state it was in when we called
> iommu_take_ownership().

Ah! I finally understand what all this bit#0 stuff is about. Thanks
for the explanation.

>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>

Reviewed-by: David Gibson <[email protected]>

With one small comment..


> ---
> Changes:
> v9:
> * iommu_table_take_ownership() did not return @ret (and ignored EBUSY),
> now it does return correct error.
> * updated commit log about setting bit#0 in the case of failure
>
> v5:
> * do not store bit#0 value, it has to be set for zero-based table
> anyway
> * removed test_and_clear_bit
> ---
> arch/powerpc/kernel/iommu.c | 31 +++++++++++++++++++++++++------
> 1 file changed, 25 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 2856d27..ea2c8ba 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1045,32 +1045,51 @@ EXPORT_SYMBOL_GPL(iommu_tce_build);
>
> int iommu_take_ownership(struct iommu_table *tbl)
> {
> - unsigned long sz = (tbl->it_size + 7) >> 3;
> + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
> + int ret = 0;
> +
> + spin_lock_irqsave(&tbl->large_pool.lock, flags);
> + for (i = 0; i < tbl->nr_pools; i++)
> + spin_lock(&tbl->pools[i].lock);

> if (tbl->it_offset == 0)
> clear_bit(0, tbl->it_map);
>
> if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
> pr_err("iommu_tce: it_map is not empty");
> - return -EBUSY;
> + ret = -EBUSY;
> + /* Restore bit#0 set by iommu_init_table() */
> + if (tbl->it_offset == 0)
> + set_bit(0, tbl->it_map);
> + } else {
> + memset(tbl->it_map, 0xff, sz);
> }
>
> - memset(tbl->it_map, 0xff, sz);
> + for (i = 0; i < tbl->nr_pools; i++)
> + spin_unlock(&tbl->pools[i].lock);

I *think* it's safe in this case, but releasing locks not in the
reverse order you acquired them makes me a bit nervous.

> + spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
>
> -
> - return 0;
> + return ret;
> }
> EXPORT_SYMBOL_GPL(iommu_take_ownership);
>
> void iommu_release_ownership(struct iommu_table *tbl)
> {
> - unsigned long sz = (tbl->it_size + 7) >> 3;
> + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
> +
> + spin_lock_irqsave(&tbl->large_pool.lock, flags);
> + for (i = 0; i < tbl->nr_pools; i++)
> + spin_lock(&tbl->pools[i].lock);
>
> memset(tbl->it_map, 0, sz);
>
> /* Restore bit#0 set by iommu_init_table() */
> if (tbl->it_offset == 0)
> set_bit(0, tbl->it_map);
> +
> + for (i = 0; i < tbl->nr_pools; i++)
> + spin_unlock(&tbl->pools[i].lock);
> + spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
> }
> EXPORT_SYMBOL_GPL(iommu_release_ownership);
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.67 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 03:27:47

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 15/32] powerpc/powernv/ioda/ioda2: Rework TCE invalidation in tce_build()/tce_free()

On Sat, Apr 25, 2015 at 10:14:39PM +1000, Alexey Kardashevskiy wrote:
> The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is
> supposed to be called on IODA1/2 and not called on p5ioc2. It receives
> start and end host addresses of TCE table.
>
> IODA2 actually needs PCI addresses to invalidate the cache. Those
> can be calculated from host addresses but since we are going
> to implement multi-level TCE tables, calculating PCI address from
> a host address might get either tricky or ugly as TCE table remains flat
> on PCI bus but not in RAM.
>
> This moves pnv_pci_ioda_tce_invalidate() from generic pnv_tce_build/
> pnt_tce_free and defines IODA1/2-specific callbacks which call generic
> ones and do PHB-model-specific TCE cache invalidation. P5IOC2 keeps
> using generic callbacks as before.
>
> This changes pnv_pci_ioda2_tce_invalidate() to receives TCE index and
> number of pages which are PCI addresses shifted by IOMMU page shift.
>
> No change in behaviour is expected.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> Changes:
> v9:
> * removed confusing comment from commit log about unintentional calling of
> pnv_pci_ioda_tce_invalidate()
> * moved mechanical changes away to "powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table"
> * fixed bug with broken invalidation in pnv_pci_ioda2_tce_invalidate -
> @index includes @tbl->it_offset but old code added it anyway which later broke
> DDW
> ---
> arch/powerpc/platforms/powernv/pci-ioda.c | 86 +++++++++++++++++++++----------
> arch/powerpc/platforms/powernv/pci.c | 17 ++----
> 2 files changed, 64 insertions(+), 39 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 718d5cc..f070c44 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1665,18 +1665,20 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
> }
> }
>
> -static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
> - struct iommu_table *tbl,
> - __be64 *startp, __be64 *endp, bool rm)
> +static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
> + unsigned long index, unsigned long npages, bool rm)
> {
> + struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
> + struct pnv_ioda_pe, table_group);
> __be64 __iomem *invalidate = rm ?
> (__be64 __iomem *)pe->tce_inval_reg_phys :
> (__be64 __iomem *)tbl->it_index;
> unsigned long start, end, inc;
> const unsigned shift = tbl->it_page_shift;
>
> - start = __pa(startp);
> - end = __pa(endp);
> + start = __pa((__be64 *)tbl->it_base + index - tbl->it_offset);
> + end = __pa((__be64 *)tbl->it_base + index - tbl->it_offset +
> + npages - 1);

This doesn't look right. The arguments to __pa don't appear to be
addresses (since index and if_offset are in units of (TCE) pages, not
bytes).

>
> /* BML uses this case for p6/p7/galaxy2: Shift addr and put in node */
> if (tbl->it_busno) {
> @@ -1712,16 +1714,40 @@ static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
> */
> }
>
> +static int pnv_ioda1_tce_build(struct iommu_table *tbl, long index,
> + long npages, unsigned long uaddr,
> + enum dma_data_direction direction,
> + struct dma_attrs *attrs)
> +{
> + long ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
> + attrs);
> +
> + if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE))
> + pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
> +
> + return ret;
> +}
> +
> +static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
> + long npages)
> +{
> + pnv_tce_free(tbl, index, npages);
> +
> + if (tbl->it_type & TCE_PCI_SWINV_FREE)
> + pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
> +}
> +
> static struct iommu_table_ops pnv_ioda1_iommu_ops = {
> - .set = pnv_tce_build,
> - .clear = pnv_tce_free,
> + .set = pnv_ioda1_tce_build,
> + .clear = pnv_ioda1_tce_free,
> .get = pnv_tce_get,
> };
>
> -static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
> - struct iommu_table *tbl,
> - __be64 *startp, __be64 *endp, bool rm)
> +static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
> + unsigned long index, unsigned long npages, bool rm)
> {
> + struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
> + struct pnv_ioda_pe, table_group);
> unsigned long start, end, inc;
> __be64 __iomem *invalidate = rm ?
> (__be64 __iomem *)pe->tce_inval_reg_phys :
> @@ -1734,10 +1760,8 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
> end = start;
>
> /* Figure out the start, end and step */
> - inc = tbl->it_offset + (((u64)startp - tbl->it_base) / sizeof(u64));
> - start |= (inc << shift);
> - inc = tbl->it_offset + (((u64)endp - tbl->it_base) / sizeof(u64));
> - end |= (inc << shift);
> + start |= (index << shift);
> + end |= ((index + npages - 1) << shift);
> inc = (0x1ull << shift);
> mb();
>
> @@ -1750,22 +1774,32 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
> }
> }
>
> -void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
> - __be64 *startp, __be64 *endp, bool rm)
> +static int pnv_ioda2_tce_build(struct iommu_table *tbl, long index,
> + long npages, unsigned long uaddr,
> + enum dma_data_direction direction,
> + struct dma_attrs *attrs)
> {
> - struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
> - struct pnv_ioda_pe, table_group);
> - struct pnv_phb *phb = pe->phb;
> -
> - if (phb->type == PNV_PHB_IODA1)
> - pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
> - else
> - pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
> + long ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
> + attrs);
> +
> + if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE))
> + pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
> +
> + return ret;
> +}
> +
> +static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
> + long npages)
> +{
> + pnv_tce_free(tbl, index, npages);
> +
> + if (tbl->it_type & TCE_PCI_SWINV_FREE)
> + pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
> }
>
> static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> - .set = pnv_tce_build,
> - .clear = pnv_tce_free,
> + .set = pnv_ioda2_tce_build,
> + .clear = pnv_ioda2_tce_free,
> .get = pnv_tce_get,
> };
>
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index 4c3bbb1..84b4ea4 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -577,37 +577,28 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> struct dma_attrs *attrs)
> {
> u64 proto_tce = iommu_direction_to_tce_perm(direction);
> - __be64 *tcep, *tces;
> + __be64 *tcep;
> u64 rpn;
>
> - tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
> + tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
> rpn = __pa(uaddr) >> tbl->it_page_shift;
>
> while (npages--)
> *(tcep++) = cpu_to_be64(proto_tce |
> (rpn++ << tbl->it_page_shift));
>
> - /* Some implementations won't cache invalid TCEs and thus may not
> - * need that flush. We'll probably turn it_type into a bit mask
> - * of flags if that becomes the case
> - */
> - if (tbl->it_type & TCE_PCI_SWINV_CREATE)
> - pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, false);
>
> return 0;
> }
>
> void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
> {
> - __be64 *tcep, *tces;
> + __be64 *tcep;
>
> - tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
> + tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
>
> while (npages--)
> *(tcep++) = cpu_to_be64(0);
> -
> - if (tbl->it_type & TCE_PCI_SWINV_FREE)
> - pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, false);
> }
>
> unsigned long pnv_tce_get(struct iommu_table *tbl, long index)

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (7.96 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 03:26:17

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 16/32] powerpc/powernv/ioda: Move TCE kill register address to PE

On Sat, Apr 25, 2015 at 10:14:40PM +1000, Alexey Kardashevskiy wrote:
> At the moment the DMA setup code looks for the "ibm,opal-tce-kill" property
> which contains the TCE kill register address. Writes to this register
> invalidates TCE cache on IODA/IODA2 hub.
>
> This moves the register address from iommu_table to pnv_ioda_pe as
> later there will be 2 tables per PE and it will be used for both tables.
>
> This moves the property reading/remapping code to a helper to reduce
> code duplication.
>
> This adds a new pnv_pci_ioda2_tvt_invalidate() helper which invalidates
> the entire table. It should be called after every call to
> opal_pci_map_pe_dma_window(). It was not required before because
> there is just a single TCE table and 64bit DMA is handled via bypass
> window (which has no table so no chache is used) but this is going
> to change with Dynamic DMA windows (DDW).
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> Changes:
> v9:
> * new in the series
> ---
> arch/powerpc/platforms/powernv/pci-ioda.c | 69 +++++++++++++++++++------------
> arch/powerpc/platforms/powernv/pci.h | 1 +
> 2 files changed, 44 insertions(+), 26 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index f070c44..b22b3ca 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1672,7 +1672,7 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
> struct pnv_ioda_pe, table_group);
> __be64 __iomem *invalidate = rm ?
> (__be64 __iomem *)pe->tce_inval_reg_phys :
> - (__be64 __iomem *)tbl->it_index;
> + pe->tce_inval_reg;
> unsigned long start, end, inc;
> const unsigned shift = tbl->it_page_shift;
>
> @@ -1743,6 +1743,18 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
> .get = pnv_tce_get,
> };
>
> +static inline void pnv_pci_ioda2_tvt_invalidate(struct pnv_ioda_pe *pe)
> +{
> + /* 01xb - invalidate TCEs that match the specified PE# */
> + unsigned long addr = (0x4ull << 60) | (pe->pe_number & 0xFF);

This doesn't really look like an address, but rather the data you're
writing to the register.

> + if (!pe->tce_inval_reg)
> + return;
> +
> + mb(); /* Ensure above stores are visible */
> + __raw_writeq(cpu_to_be64(addr), pe->tce_inval_reg);
> +}
> +
> static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
> unsigned long index, unsigned long npages, bool rm)
> {
> @@ -1751,7 +1763,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
> unsigned long start, end, inc;
> __be64 __iomem *invalidate = rm ?
> (__be64 __iomem *)pe->tce_inval_reg_phys :
> - (__be64 __iomem *)tbl->it_index;
> + pe->tce_inval_reg;
> const unsigned shift = tbl->it_page_shift;
>
> /* We'll invalidate DMA address in PE scope */
> @@ -1803,13 +1815,31 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> .get = pnv_tce_get,
> };
>
> +static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
> + struct pnv_ioda_pe *pe)
> +{
> + const __be64 *swinvp;
> +
> + /* OPAL variant of PHB3 invalidated TCEs */
> + swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
> + if (!swinvp)
> + return;
> +
> + /* We need a couple more fields -- an address and a data
> + * to or. Since the bus is only printed out on table free
> + * errors, and on the first pass the data will be a relative
> + * bus number, print that out instead.
> + */

The comment above appears to have nothing to do with the surrounding code.

> + pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
> + pe->tce_inval_reg = ioremap(pe->tce_inval_reg_phys, 8);
> +}
> +
> static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> struct pnv_ioda_pe *pe, unsigned int base,
> unsigned int segs)
> {
>
> struct page *tce_mem = NULL;
> - const __be64 *swinvp;
> struct iommu_table *tbl;
> unsigned int i;
> int64_t rc;
> @@ -1823,6 +1853,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> if (WARN_ON(pe->tce32_seg >= 0))
> return;
>
> + pnv_pci_ioda_setup_opal_tce_kill(phb, pe);
> +
> /* Grab a 32-bit TCE table */
> pe->tce32_seg = base;
> pe_info(pe, " Setting up 32-bit TCE table at %08x..%08x\n",
> @@ -1865,20 +1897,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> base << 28, IOMMU_PAGE_SHIFT_4K);
>
> /* OPAL variant of P7IOC SW invalidated TCEs */
> - swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
> - if (swinvp) {
> - /* We need a couple more fields -- an address and a data
> - * to or. Since the bus is only printed out on table free
> - * errors, and on the first pass the data will be a relative
> - * bus number, print that out instead.
> - */

.. although I guess it didn't make any more sense in its original context.

> - pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
> - tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
> - 8);
> + if (pe->tce_inval_reg)
> tbl->it_type |= (TCE_PCI_SWINV_CREATE |
> TCE_PCI_SWINV_FREE |
> TCE_PCI_SWINV_PAIR);
> - }
> +
> tbl->it_ops = &pnv_ioda1_iommu_ops;
> iommu_init_table(tbl, phb->hose->node);
>
> @@ -1984,7 +2007,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> {
> struct page *tce_mem = NULL;
> void *addr;
> - const __be64 *swinvp;
> struct iommu_table *tbl;
> unsigned int tce_table_size, end;
> int64_t rc;
> @@ -1993,6 +2015,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> if (WARN_ON(pe->tce32_seg >= 0))
> return;
>
> + pnv_pci_ioda_setup_opal_tce_kill(phb, pe);
> +
> /* The PE will reserve all possible 32-bits space */
> pe->tce32_seg = 0;
> end = (1 << ilog2(phb->ioda.m32_pci_base));
> @@ -2023,6 +2047,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> goto fail;
> }
>
> + pnv_pci_ioda2_tvt_invalidate(pe);
> +

This looks to be a change in behavbiour - if it's replacing a previous
invalidation, I'm not seeing where.

> /* Setup iommu */
> pe->table_group.tables[0].it_table_group = &pe->table_group;
>
> @@ -2032,18 +2058,9 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> IOMMU_PAGE_SHIFT_4K);
>
> /* OPAL variant of PHB3 invalidated TCEs */
> - swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
> - if (swinvp) {
> - /* We need a couple more fields -- an address and a data
> - * to or. Since the bus is only printed out on table free
> - * errors, and on the first pass the data will be a relative
> - * bus number, print that out instead.
> - */
> - pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
> - tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
> - 8);
> + if (pe->tce_inval_reg)
> tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
> - }
> +
> tbl->it_ops = &pnv_ioda2_iommu_ops;
> iommu_init_table(tbl, phb->hose->node);
> #ifdef CONFIG_IOMMU_API
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 368d4ed..bd83d85 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -59,6 +59,7 @@ struct pnv_ioda_pe {
> int tce32_segcount;
> struct iommu_table_group table_group;
> phys_addr_t tce_inval_reg_phys;
> + __be64 __iomem *tce_inval_reg;
>
> /* 64-bit TCE bypass region */
> bool tce_bypass_enabled;

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (7.50 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 05:50:06

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 17/32] powerpc/powernv: Implement accessor to TCE entry

On Sat, Apr 25, 2015 at 10:14:41PM +1000, Alexey Kardashevskiy wrote:
> This replaces direct accesses to TCE table with a helper which
> returns an TCE entry address. This does not make difference now but will
> when multi-level TCE tables get introduces.
>
> No change in behavior is expected.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>

Reviewed-by: David Gibson <[email protected]>


> ---
> Changes:
> v9:
> * new patch in the series to separate this mechanical change from
> functional changes; this is not right before
> "powerpc/powernv: Implement multilevel TCE tables" but here in order
> to let the next patch - "powerpc/iommu/powernv: Release replaced TCE" -
> use pnv_tce() and avoid changing the same code twice
> ---
> arch/powerpc/platforms/powernv/pci.c | 34 +++++++++++++++++++++-------------
> 1 file changed, 21 insertions(+), 13 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index 84b4ea4..ba75aa5 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -572,38 +572,46 @@ struct pci_ops pnv_pci_ops = {
> .write = pnv_pci_write_config,
> };
>
> +static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
> +{
> + __be64 *tmp = ((__be64 *)tbl->it_base);
> +
> + return tmp + idx;
> +}
> +
> int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> unsigned long uaddr, enum dma_data_direction direction,
> struct dma_attrs *attrs)
> {
> u64 proto_tce = iommu_direction_to_tce_perm(direction);
> - __be64 *tcep;
> - u64 rpn;
> + u64 rpn = __pa(uaddr) >> tbl->it_page_shift;

I guess this was a problem in the existing code, not this patch. But
"uaddr" is a really bad name (and unsigned long is a bad type) for
what must actually be a kernel linear mapping address.

> + long i;
>
> - tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
> - rpn = __pa(uaddr) >> tbl->it_page_shift;
> -
> - while (npages--)
> - *(tcep++) = cpu_to_be64(proto_tce |
> - (rpn++ << tbl->it_page_shift));
> + for (i = 0; i < npages; i++) {
> + unsigned long newtce = proto_tce |
> + ((rpn + i) << tbl->it_page_shift);
> + unsigned long idx = index - tbl->it_offset + i;
>
> + *(pnv_tce(tbl, idx)) = cpu_to_be64(newtce);
> + }
>
> return 0;
> }
>
> void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
> {
> - __be64 *tcep;
> + long i;
>
> - tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
> + for (i = 0; i < npages; i++) {
> + unsigned long idx = index - tbl->it_offset + i;
>
> - while (npages--)
> - *(tcep++) = cpu_to_be64(0);
> + *(pnv_tce(tbl, idx)) = cpu_to_be64(0);
> + }
> }
>
> unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
> {
> - return ((u64 *)tbl->it_base)[index - tbl->it_offset];
> + return *(pnv_tce(tbl, index - tbl->it_offset));
> }
>
> void pnv_pci_setup_iommu_table(struct iommu_table *tbl,

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.08 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 05:50:50

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 18/32] powerpc/iommu/powernv: Release replaced TCE

On Sat, Apr 25, 2015 at 10:14:42PM +1000, Alexey Kardashevskiy wrote:
> At the moment writing new TCE value to the IOMMU table fails with EBUSY
> if there is a valid entry already. However PAPR specification allows
> the guest to write new TCE value without clearing it first.
>
> Another problem this patch is addressing is the use of pool locks for
> external IOMMU users such as VFIO. The pool locks are to protect
> DMA page allocator rather than entries and since the host kernel does
> not control what pages are in use, there is no point in pool locks and
> exchange()+put_page(oldtce) is sufficient to avoid possible races.
>
> This adds an exchange() callback to iommu_table_ops which does the same
> thing as set() plus it returns replaced TCE and DMA direction so
> the caller can release the pages afterwards. The exchange() receives
> a physical address unlike set() which receives linear mapping address;
> and returns a physical address as the clear() does.
>
> This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
> for a platform to have exchange() implemented in order to support VFIO.
>
> This replaces iommu_tce_build() and iommu_clear_tce() with
> a single iommu_tce_xchg().
>
> This makes sure that TCE permission bits are not set in TCE passed to
> IOMMU API as those are to be calculated by platform code from DMA direction.
>
> This moves SetPageDirty() to the IOMMU code to make it work for both
> VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
> available later).
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> [aw: for the vfio related changes]
> Acked-by: Alex Williamson <[email protected]>

This looks mostly good, but there are couple of details that need fixing.

> ---
> Changes:
> v9:
> * changed exchange() to work with physical addresses as these addresses
> are never accessed by the code and physical addresses are actual values
> we put into the IOMMU table
> ---
> arch/powerpc/include/asm/iommu.h | 22 +++++++++--
> arch/powerpc/kernel/iommu.c | 57 +++++++++-------------------
> arch/powerpc/platforms/powernv/pci-ioda.c | 34 +++++++++++++++++
> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 3 ++
> arch/powerpc/platforms/powernv/pci.c | 17 +++++++++
> arch/powerpc/platforms/powernv/pci.h | 2 +
> drivers/vfio/vfio_iommu_spapr_tce.c | 58 ++++++++++++++++++-----------
> 7 files changed, 128 insertions(+), 65 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index e63419e..7e7ca0a 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -45,13 +45,29 @@ extern int iommu_is_off;
> extern int iommu_force_on;
>
> struct iommu_table_ops {
> + /*
> + * When called with direction==DMA_NONE, it is equal to clear().
> + * uaddr is a linear map address.
> + */
> int (*set)(struct iommu_table *tbl,
> long index, long npages,
> unsigned long uaddr,
> enum dma_data_direction direction,
> struct dma_attrs *attrs);
> +#ifdef CONFIG_IOMMU_API
> + /*
> + * Exchanges existing TCE with new TCE plus direction bits;
> + * returns old TCE and DMA direction mask.
> + * @tce is a physical address.
> + */
> + int (*exchange)(struct iommu_table *tbl,
> + long index,
> + unsigned long *tce,

I'd prefer to call this "address" or "paddr" or something, since it's
not a full TCE entry (which would contain permission bits).

> + enum dma_data_direction *direction);
> +#endif
> void (*clear)(struct iommu_table *tbl,
> long index, long npages);
> + /* get() returns a physical address */
> unsigned long (*get)(struct iommu_table *tbl, long index);
> void (*flush)(struct iommu_table *tbl);
> };
> @@ -152,6 +168,8 @@ extern void iommu_register_group(struct iommu_table_group *table_group,
> extern int iommu_add_device(struct device *dev);
> extern void iommu_del_device(struct device *dev);
> extern int __init tce_iommu_bus_notifier_init(void);
> +extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
> + unsigned long *tce, enum dma_data_direction *direction);
> #else
> static inline void iommu_register_group(struct iommu_table_group *table_group,
> int pci_domain_number,
> @@ -231,10 +249,6 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
> unsigned long npages);
> extern int iommu_tce_put_param_check(struct iommu_table *tbl,
> unsigned long ioba, unsigned long tce);
> -extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
> - unsigned long hwaddr, enum dma_data_direction direction);
> -extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
> - unsigned long entry);
>
> extern void iommu_flush_tce(struct iommu_table *tbl);
> extern int iommu_take_ownership(struct iommu_table *tbl);
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index ea2c8ba..2eaba0c 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -975,9 +975,6 @@ EXPORT_SYMBOL_GPL(iommu_tce_clear_param_check);
> int iommu_tce_put_param_check(struct iommu_table *tbl,
> unsigned long ioba, unsigned long tce)
> {
> - if (!(tce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> - return -EINVAL;
> -
> if (tce & ~(IOMMU_PAGE_MASK(tbl) | TCE_PCI_WRITE | TCE_PCI_READ))
> return -EINVAL;

Since the value you're passing is now an address rather than a full
TCE, can't you remove the permission bits from this check, rather than
checking that elsewhere?

> @@ -995,44 +992,16 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
> }
> EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
>
> -unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
> +long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
> + unsigned long *tce, enum dma_data_direction *direction)
> {
> - unsigned long oldtce;
> - struct iommu_pool *pool = get_pool(tbl, entry);
> + long ret;
>
> - spin_lock(&(pool->lock));
> + ret = tbl->it_ops->exchange(tbl, entry, tce, direction);
>
> - oldtce = tbl->it_ops->get(tbl, entry);
> - if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
> - tbl->it_ops->clear(tbl, entry, 1);
> - else
> - oldtce = 0;
> -
> - spin_unlock(&(pool->lock));
> -
> - return oldtce;
> -}
> -EXPORT_SYMBOL_GPL(iommu_clear_tce);
> -
> -/*
> - * hwaddr is a kernel virtual address here (0xc... bazillion),
> - * tce_build converts it to a physical address.
> - */
> -int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
> - unsigned long hwaddr, enum dma_data_direction direction)
> -{
> - int ret = -EBUSY;
> - unsigned long oldtce;
> - struct iommu_pool *pool = get_pool(tbl, entry);
> -
> - spin_lock(&(pool->lock));
> -
> - oldtce = tbl->it_ops->get(tbl, entry);
> - /* Add new entry if it is not busy */
> - if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> - ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
> -
> - spin_unlock(&(pool->lock));
> + if (!ret && ((*direction == DMA_FROM_DEVICE) ||
> + (*direction == DMA_BIDIRECTIONAL)))
> + SetPageDirty(pfn_to_page(*tce >> PAGE_SHIFT));
>
> /* if (unlikely(ret))
> pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
> @@ -1041,13 +1010,23 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
>
> return ret;
> }
> -EXPORT_SYMBOL_GPL(iommu_tce_build);
> +EXPORT_SYMBOL_GPL(iommu_tce_xchg);
>
> int iommu_take_ownership(struct iommu_table *tbl)
> {
> unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
> int ret = 0;
>
> + /*
> + * VFIO does not control TCE entries allocation and the guest
> + * can write new TCEs on top of existing ones so iommu_tce_build()
> + * must be able to release old pages. This functionality
> + * requires exchange() callback defined so if it is not
> + * implemented, we disallow taking ownership over the table.
> + */
> + if (!tbl->it_ops->exchange)
> + return -EINVAL;
> +
> spin_lock_irqsave(&tbl->large_pool.lock, flags);
> for (i = 0; i < tbl->nr_pools; i++)
> spin_lock(&tbl->pools[i].lock);
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index b22b3ca..fb765af 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1728,6 +1728,20 @@ static int pnv_ioda1_tce_build(struct iommu_table *tbl, long index,
> return ret;
> }
>
> +#ifdef CONFIG_IOMMU_API
> +static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
> + unsigned long *tce, enum dma_data_direction *direction)
> +{
> + long ret = pnv_tce_xchg(tbl, index, tce, direction);
> +
> + if (!ret && (tbl->it_type &
> + (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
> + pnv_pci_ioda1_tce_invalidate(tbl, index, 1, false);
> +
> + return ret;
> +}
> +#endif
> +
> static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
> long npages)
> {
> @@ -1739,6 +1753,9 @@ static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
>
> static struct iommu_table_ops pnv_ioda1_iommu_ops = {
> .set = pnv_ioda1_tce_build,
> +#ifdef CONFIG_IOMMU_API
> + .exchange = pnv_ioda1_tce_xchg,
> +#endif
> .clear = pnv_ioda1_tce_free,
> .get = pnv_tce_get,
> };
> @@ -1800,6 +1817,20 @@ static int pnv_ioda2_tce_build(struct iommu_table *tbl, long index,
> return ret;
> }
>
> +#ifdef CONFIG_IOMMU_API
> +static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
> + unsigned long *tce, enum dma_data_direction *direction)
> +{
> + long ret = pnv_tce_xchg(tbl, index, tce, direction);
> +
> + if (!ret && (tbl->it_type &
> + (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
> + pnv_pci_ioda2_tce_invalidate(tbl, index, 1, false);
> +
> + return ret;
> +}
> +#endif
> +
> static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
> long npages)
> {
> @@ -1811,6 +1842,9 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
>
> static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> .set = pnv_ioda2_tce_build,
> +#ifdef CONFIG_IOMMU_API
> + .exchange = pnv_ioda2_tce_xchg,
> +#endif
> .clear = pnv_ioda2_tce_free,
> .get = pnv_tce_get,
> };
> diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> index a073af0..7a6fd92 100644
> --- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> +++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> @@ -85,6 +85,9 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
>
> static struct iommu_table_ops pnv_p5ioc2_iommu_ops = {
> .set = pnv_tce_build,
> +#ifdef CONFIG_IOMMU_API
> + .exchange = pnv_tce_xchg,
> +#endif
> .clear = pnv_tce_free,
> .get = pnv_tce_get,
> };
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index ba75aa5..e8802ac 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -598,6 +598,23 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> return 0;
> }
>
> +#ifdef CONFIG_IOMMU_API
> +int pnv_tce_xchg(struct iommu_table *tbl, long index,
> + unsigned long *tce, enum dma_data_direction *direction)
> +{
> + u64 proto_tce = iommu_direction_to_tce_perm(*direction);
> + unsigned long newtce = *tce | proto_tce;
> + unsigned long idx = index - tbl->it_offset;

Should this have a BUG_ON or WARN_ON if the supplied tce has bits set
below the page mask?

> + *tce = xchg(pnv_tce(tbl, idx), cpu_to_be64(newtce));
> + *tce = be64_to_cpu(*tce);
> + *direction = iommu_tce_direction(*tce);
> + *tce &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> + return 0;
> +}
> +#endif
> +
> void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
> {
> long i;
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index bd83d85..b15cce5 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -205,6 +205,8 @@ extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> unsigned long uaddr, enum dma_data_direction direction,
> struct dma_attrs *attrs);
> extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
> +extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
> + unsigned long *tce, enum dma_data_direction *direction);
> extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
>
> void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index dacc738..2d51bbf 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -239,14 +239,7 @@ static void tce_iommu_unuse_page(struct tce_container *container,
> {
> struct page *page;
>
> - if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
> - return;
> -
> page = pfn_to_page(oldtce >> PAGE_SHIFT);
> -
> - if (oldtce & TCE_PCI_WRITE)
> - SetPageDirty(page);
> -
> put_page(page);
> }
>
> @@ -255,10 +248,17 @@ static int tce_iommu_clear(struct tce_container *container,
> unsigned long entry, unsigned long pages)
> {
> unsigned long oldtce;
> + long ret;
> + enum dma_data_direction direction;
>
> for ( ; pages; --pages, ++entry) {
> - oldtce = iommu_clear_tce(tbl, entry);
> - if (!oldtce)
> + direction = DMA_NONE;
> + oldtce = 0;
> + ret = iommu_tce_xchg(tbl, entry, &oldtce, &direction);
> + if (ret)
> + continue;
> +
> + if (direction == DMA_NONE)
> continue;
>
> tce_iommu_unuse_page(container, oldtce);
> @@ -283,12 +283,13 @@ static int tce_iommu_use_page(unsigned long tce, unsigned long *hpa)
>
> static long tce_iommu_build(struct tce_container *container,
> struct iommu_table *tbl,
> - unsigned long entry, unsigned long tce, unsigned long pages)
> + unsigned long entry, unsigned long tce, unsigned long pages,
> + enum dma_data_direction direction)
> {
> long i, ret = 0;
> struct page *page;
> unsigned long hpa;
> - enum dma_data_direction direction = iommu_tce_direction(tce);
> + enum dma_data_direction dirtmp;
>
> for (i = 0; i < pages; ++i) {
> unsigned long offset = tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
> @@ -304,8 +305,8 @@ static long tce_iommu_build(struct tce_container *container,
> }
>
> hpa |= offset;
> - ret = iommu_tce_build(tbl, entry + i, (unsigned long) __va(hpa),
> - direction);
> + dirtmp = direction;
> + ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
> if (ret) {
> tce_iommu_unuse_page(container, hpa);
> pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
> @@ -313,6 +314,10 @@ static long tce_iommu_build(struct tce_container *container,
> tce, ret);
> break;
> }
> +
> + if (dirtmp != DMA_NONE)
> + tce_iommu_unuse_page(container, hpa);
> +
> tce += IOMMU_PAGE_SIZE(tbl);
> }
>
> @@ -377,7 +382,7 @@ static long tce_iommu_ioctl(void *iommu_data,
> case VFIO_IOMMU_MAP_DMA: {
> struct vfio_iommu_type1_dma_map param;
> struct iommu_table *tbl;
> - unsigned long tce;
> + enum dma_data_direction direction;
>
> if (!container->enabled)
> return -EPERM;
> @@ -398,24 +403,33 @@ static long tce_iommu_ioctl(void *iommu_data,
> if (!tbl)
> return -ENXIO;
>
> - if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
> - (param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
> + if (param.size & ~IOMMU_PAGE_MASK(tbl))
> + return -EINVAL;
> +
> + if (param.vaddr & (TCE_PCI_READ | TCE_PCI_WRITE))
> return -EINVAL;

This doesn't look right - the existing check against PAGE_MASK
is still correct and included the check for the permission bits as well.

> /* iova is checked by the IOMMU API */
> - tce = param.vaddr;
> if (param.flags & VFIO_DMA_MAP_FLAG_READ)
> - tce |= TCE_PCI_READ;
> - if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
> - tce |= TCE_PCI_WRITE;
> + if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
> + direction = DMA_BIDIRECTIONAL;
> + else
> + direction = DMA_TO_DEVICE;
> + else
> + if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
> + direction = DMA_FROM_DEVICE;
> + else
> + return -EINVAL;
>
> - ret = iommu_tce_put_param_check(tbl, param.iova, tce);
> + ret = iommu_tce_put_param_check(tbl, param.iova, param.vaddr);
> if (ret)
> return ret;
>
> ret = tce_iommu_build(container, tbl,
> param.iova >> tbl->it_page_shift,
> - tce, param.size >> tbl->it_page_shift);
> + param.vaddr,
> + param.size >> tbl->it_page_shift,
> + direction);
>
> iommu_flush_tce(tbl);
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (16.42 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 05:50:07

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 19/32] powerpc/powernv/ioda2: Rework iommu_table creation

On Sat, Apr 25, 2015 at 10:14:43PM +1000, Alexey Kardashevskiy wrote:
> This moves iommu_table creation to the beginning to make following changes
> easier to review. This starts using table parameters from the iommu_table
> struct.
>
> This should cause no behavioural change.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>

Reviewed-by: David Gibson <[email protected]>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (583.00 B)
(No filename) (819.00 B)
Download all attachments

2015-04-29 05:50:54

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 20/32] powerpc/powernv/ioda2: Introduce pnv_pci_create_table/pnv_pci_free_table

On Sat, Apr 25, 2015 at 10:14:44PM +1000, Alexey Kardashevskiy wrote:
> This is a part of moving TCE table allocation into an iommu_ops
> callback to support multiple IOMMU groups per one VFIO container.
>
> This moves a table creation window to the file with common powernv-pci
> helpers as it does not do anything IODA2-specific.
>
> This adds pnv_pci_free_table() helper to release the actual TCE table.
>
> This enforces window size to be a power of two.
>
> This should cause no behavioural change.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> Reviewed-by: David Gibson <[email protected]>
> ---
> Changes:
> v9:
> * moved helpers to the common powernv pci.c file from pci-ioda.c
> * moved bits from pnv_pci_create_table() to pnv_alloc_tce_table_pages()
> ---
> arch/powerpc/platforms/powernv/pci-ioda.c | 36 ++++++------------
> arch/powerpc/platforms/powernv/pci.c | 61 +++++++++++++++++++++++++++++++
> arch/powerpc/platforms/powernv/pci.h | 4 ++
> 3 files changed, 76 insertions(+), 25 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index a80be34..b9b3773 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1307,8 +1307,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
> if (rc)
> pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
>
> - iommu_reset_table(tbl, of_node_full_name(dev->dev.of_node));
> - free_pages(addr, get_order(TCE32_TABLE_SIZE));
> + pnv_pci_free_table(tbl);
> }
>
> static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
> @@ -2039,10 +2038,7 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> struct pnv_ioda_pe *pe)
> {
> - struct page *tce_mem = NULL;
> - void *addr;
> struct iommu_table *tbl = &pe->table_group.tables[0];
> - unsigned int tce_table_size, end;
> int64_t rc;
>
> /* We shouldn't already have a 32-bit DMA associated */
> @@ -2053,29 +2049,20 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>
> /* The PE will reserve all possible 32-bits space */
> pe->tce32_seg = 0;
> - end = (1 << ilog2(phb->ioda.m32_pci_base));
> - tce_table_size = (end / 0x1000) * 8;
> pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
> - end);
> + phb->ioda.m32_pci_base);
>
> - /* Allocate TCE table */
> - tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL,
> - get_order(tce_table_size));
> - if (!tce_mem) {
> - pe_err(pe, "Failed to allocate a 32-bit TCE memory\n");
> - goto fail;
> + rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node,
> + 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, tbl);
> + if (rc) {
> + pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
> + return;
> }
> - addr = page_address(tce_mem);
> - memset(addr, 0, tce_table_size);
> -
> - /* Setup iommu */
> - tbl->it_table_group = &pe->table_group;
> -
> - /* Setup linux iommu table */
> - pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
> - IOMMU_PAGE_SHIFT_4K);
>
> tbl->it_ops = &pnv_ioda2_iommu_ops;
> +
> + /* Setup iommu */
> + tbl->it_table_group = &pe->table_group;
> iommu_init_table(tbl, phb->hose->node);
> #ifdef CONFIG_IOMMU_API
> pe->table_group.ops = &pnv_pci_ioda2_ops;
> @@ -2121,8 +2108,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> fail:
> if (pe->tce32_seg >= 0)
> pe->tce32_seg = -1;
> - if (tce_mem)
> - __free_pages(tce_mem, get_order(tce_table_size));
> + pnv_pci_free_table(tbl);
> }
>
> static void pnv_ioda_setup_dma(struct pnv_phb *phb)
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index e8802ac..6bcfad5 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -20,7 +20,9 @@
> #include <linux/io.h>
> #include <linux/msi.h>
> #include <linux/iommu.h>
> +#include <linux/memblock.h>
>
> +#include <asm/mmzone.h>
> #include <asm/sections.h>
> #include <asm/io.h>
> #include <asm/prom.h>
> @@ -645,6 +647,65 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
> tbl->it_type = TCE_PCI;
> }
>
> +static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
> + unsigned long *tce_table_allocated)

I'm a bit confused by the tce_table_allocated parameter. What's the
circumstance where more memory is requested than required, and why
does it matter to the caller?

> +{
> + struct page *tce_mem = NULL;
> + __be64 *addr;
> + unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
> + unsigned long local_allocated = 1UL << (order + PAGE_SHIFT);
> +
> + tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
> + if (!tce_mem) {
> + pr_err("Failed to allocate a TCE memory, order=%d\n", order);
> + return NULL;
> + }
> + addr = page_address(tce_mem);
> + memset(addr, 0, local_allocated);
> + *tce_table_allocated = local_allocated;
> +
> + return addr;
> +}
> +
> +long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
> + __u64 bus_offset, __u32 page_shift, __u64 window_size,
> + struct iommu_table *tbl)

The table_group parameter is redundant, isn't it? It must be equal to
tbl->table_group, yes?

Or would it make more sense for this function to set
tbl->table_group? And for that matter wouldn't it make more sense for
this to set it_size as well?

> +{
> + void *addr;
> + unsigned long tce_table_allocated = 0;
> + const unsigned window_shift = ilog2(window_size);
> + unsigned entries_shift = window_shift - page_shift;
> + unsigned table_shift = entries_shift + 3;
> + const unsigned long tce_table_size = max(0x1000UL, 1UL << table_shift);

So, here you round up to 4k, the in the alloc function you round up to
PAGE_SIZE (which may or may not be the same). It's not clear to me why
there are two rounds of rounding up.

> + if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
> + return -EINVAL;
> +
> + /* Allocate TCE table */
> + addr = pnv_alloc_tce_table_pages(nid, table_shift,
> + &tce_table_allocated);
> +
> + /* Setup linux iommu table */
> + pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset,
> + page_shift);
> +
> + pr_info("Created TCE table: window size = %08llx, "
> + "tablesize = %lx (%lx), start @%08llx\n",
> + window_size, tce_table_size, tce_table_allocated,
> + bus_offset);
> +
> + return 0;
> +}
> +
> +void pnv_pci_free_table(struct iommu_table *tbl)
> +{
> + if (!tbl->it_size)
> + return;
> +
> + free_pages(tbl->it_base, get_order(tbl->it_size << 3));
> + iommu_reset_table(tbl, "pnv");
> +}
> +
> static void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
> {
> struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index b15cce5..e6cbbec 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -218,6 +218,10 @@ int pnv_pci_cfg_write(struct pci_dn *pdn,
> extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
> void *tce_mem, u64 tce_size,
> u64 dma_offset, unsigned page_shift);
> +extern long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
> + __u64 bus_offset, __u32 page_shift, __u64 window_size,
> + struct iommu_table *tbl);
> +extern void pnv_pci_free_table(struct iommu_table *tbl);
> extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
> extern void pnv_pci_init_ioda_hub(struct device_node *np);
> extern void pnv_pci_init_ioda2_phb(struct device_node *np);

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (7.73 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 05:51:54

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 21/32] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window

On Sat, Apr 25, 2015 at 10:14:45PM +1000, Alexey Kardashevskiy wrote:
> This is a part of moving DMA window programming to an iommu_ops
> callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as
> a first parameter (not pnv_ioda_pe) as it is going to be used as
> a callback for VFIO DDW code.
>
> This adds pnv_pci_ioda2_tvt_invalidate() to invalidate TVT as it is
> a good thing to do.

What's the TVT and why is invalidating it a good thing?

Also, it looks like it didn't add it, just move it.

> It does not have immediate effect now as the table
> is never recreated after reboot but it will in the following patches.
>
> This should cause no behavioural change.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> Reviewed-by: David Gibson <[email protected]>

Really? I don't remember this one.

> ---
> Changes:
> v9:
> * initialize pe->table_group.tables[0] at the very end when
> tbl is fully initialized
> * moved pnv_pci_ioda2_tvt_invalidate() from earlier patch
> ---
> arch/powerpc/platforms/powernv/pci-ioda.c | 67 +++++++++++++++++++++++--------
> 1 file changed, 51 insertions(+), 16 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index b9b3773..59baa15 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1960,6 +1960,52 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
> }
>
> +static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> + struct iommu_table *tbl)
> +{
> + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> + table_group);
> + struct pnv_phb *phb = pe->phb;
> + int64_t rc;
> + const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
> + const __u64 win_size = tbl->it_size << tbl->it_page_shift;
> +
> + pe_info(pe, "Setting up window at %llx..%llx "
> + "pgsize=0x%x tablesize=0x%lx\n",
> + start_addr, start_addr + win_size - 1,
> + 1UL << tbl->it_page_shift, tbl->it_size << 3);
> +
> + tbl->it_table_group = &pe->table_group;
> +
> + /*
> + * Map TCE table through TVT. The TVE index is the PE number
> + * shifted by 1 bit for 32-bits DMA space.
> + */
> + rc = opal_pci_map_pe_dma_window(phb->opal_id,
> + pe->pe_number,
> + pe->pe_number << 1,
> + 1,
> + __pa(tbl->it_base),
> + tbl->it_size << 3,
> + 1ULL << tbl->it_page_shift);
> + if (rc) {
> + pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
> + goto fail;
> + }
> +
> + pnv_pci_ioda2_tvt_invalidate(pe);
> +
> + /* Store fully initialized *tbl (may be external) in PE */
> + pe->table_group.tables[0] = *tbl;

Hrm, a non-atomic copy of a whole structure into the array. Is that
really what you want?

> + return 0;
> +fail:
> + if (pe->tce32_seg >= 0)
> + pe->tce32_seg = -1;
> +
> + return rc;
> +}
> +
> static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
> {
> uint16_t window_id = (pe->pe_number << 1 ) + 1;
> @@ -2068,21 +2114,16 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> pe->table_group.ops = &pnv_pci_ioda2_ops;
> #endif
>
> - /*
> - * Map TCE table through TVT. The TVE index is the PE number
> - * shifted by 1 bit for 32-bits DMA space.
> - */
> - rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
> - pe->pe_number << 1, 1, __pa(tbl->it_base),
> - tbl->it_size << 3, 1ULL << tbl->it_page_shift);
> + rc = pnv_pci_ioda2_set_window(&pe->table_group, tbl);
> if (rc) {
> pe_err(pe, "Failed to configure 32-bit TCE table,"
> " err %ld\n", rc);
> - goto fail;
> + pnv_pci_free_table(tbl);
> + if (pe->tce32_seg >= 0)
> + pe->tce32_seg = -1;
> + return;
> }
>
> - pnv_pci_ioda2_tvt_invalidate(pe);
> -
> /* OPAL variant of PHB3 invalidated TCEs */
> if (pe->tce_inval_reg)
> tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
> @@ -2103,12 +2144,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> /* Also create a bypass window */
> if (!pnv_iommu_bypass_disabled)
> pnv_pci_ioda2_setup_bypass_pe(phb, pe);
> -
> - return;
> -fail:
> - if (pe->tce32_seg >= 0)
> - pe->tce32_seg = -1;
> - pnv_pci_free_table(tbl);
> }
>
> static void pnv_ioda_setup_dma(struct pnv_phb *phb)

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (4.44 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 05:52:00

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 22/32] powerpc/powernv: Implement multilevel TCE tables

On Sat, Apr 25, 2015 at 10:14:46PM +1000, Alexey Kardashevskiy wrote:
> TCE tables might get too big in case of 4K IOMMU pages and DDW enabled
> on huge guests (hundreds of GB of RAM) so the kernel might be unable to
> allocate contiguous chunk of physical memory to store the TCE table.
>
> To address this, POWER8 CPU (actually, IODA2) supports multi-level TCE tables,
> up to 5 levels which splits the table into a tree of smaller subtables.
>
> This adds multi-level TCE tables support to pnv_pci_create_table()
> and pnv_pci_free_table() helpers.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> Changes:
> v9:
> * moved from ioda2 to common powernv pci code
> * fixed cleanup if allocation fails in a middle
> * removed check for the size - all boundary checks happen in the calling code
> anyway
> ---
> arch/powerpc/include/asm/iommu.h | 2 +
> arch/powerpc/platforms/powernv/pci-ioda.c | 15 +++--
> arch/powerpc/platforms/powernv/pci.c | 94 +++++++++++++++++++++++++++++--
> arch/powerpc/platforms/powernv/pci.h | 4 +-
> 4 files changed, 104 insertions(+), 11 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 7e7ca0a..0f50ee2 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -96,6 +96,8 @@ struct iommu_pool {
> struct iommu_table {
> unsigned long it_busno; /* Bus number this table belongs to */
> unsigned long it_size; /* Size of iommu table in entries */
> + unsigned long it_indirect_levels;
> + unsigned long it_level_size;
> unsigned long it_offset; /* Offset into global table */
> unsigned long it_base; /* mapped address of tce table */
> unsigned long it_index; /* which iommu table this is */
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 59baa15..cc1d09c 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1967,13 +1967,17 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> table_group);
> struct pnv_phb *phb = pe->phb;
> int64_t rc;
> + const unsigned long size = tbl->it_indirect_levels ?
> + tbl->it_level_size : tbl->it_size;
> const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
> const __u64 win_size = tbl->it_size << tbl->it_page_shift;
>
> pe_info(pe, "Setting up window at %llx..%llx "
> - "pgsize=0x%x tablesize=0x%lx\n",
> + "pgsize=0x%x tablesize=0x%lx "
> + "levels=%d levelsize=%x\n",
> start_addr, start_addr + win_size - 1,
> - 1UL << tbl->it_page_shift, tbl->it_size << 3);
> + 1UL << tbl->it_page_shift, tbl->it_size << 3,
> + tbl->it_indirect_levels + 1, tbl->it_level_size << 3);
>
> tbl->it_table_group = &pe->table_group;
>
> @@ -1984,9 +1988,9 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> rc = opal_pci_map_pe_dma_window(phb->opal_id,
> pe->pe_number,
> pe->pe_number << 1,
> - 1,
> + tbl->it_indirect_levels + 1,
> __pa(tbl->it_base),
> - tbl->it_size << 3,
> + size << 3,
> 1ULL << tbl->it_page_shift);
> if (rc) {
> pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
> @@ -2099,7 +2103,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> phb->ioda.m32_pci_base);
>
> rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node,
> - 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, tbl);
> + 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base,
> + POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
> if (rc) {
> pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
> return;
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index 6bcfad5..fc129c4 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -46,6 +46,8 @@
> #define cfg_dbg(fmt...) do { } while(0)
> //#define cfg_dbg(fmt...) printk(fmt)
>
> +#define ROUND_UP(x, n) (((x) + (n) - 1ULL) & ~((n) - 1ULL))

Use the existing ALIGN_UP macro instead of creating a new one.

> #ifdef CONFIG_PCI_MSI
> static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
> {
> @@ -577,6 +579,19 @@ struct pci_ops pnv_pci_ops = {
> static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
> {
> __be64 *tmp = ((__be64 *)tbl->it_base);
> + int level = tbl->it_indirect_levels;
> + const long shift = ilog2(tbl->it_level_size);
> + unsigned long mask = (tbl->it_level_size - 1) << (level * shift);
> +
> + while (level) {
> + int n = (idx & mask) >> (level * shift);
> + unsigned long tce = be64_to_cpu(tmp[n]);
> +
> + tmp = __va(tce & ~(TCE_PCI_READ | TCE_PCI_WRITE));
> + idx &= ~mask;
> + mask >>= shift;
> + --level;
> + }
>
> return tmp + idx;
> }
> @@ -648,12 +663,18 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
> }
>
> static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
> + unsigned levels, unsigned long limit,
> unsigned long *tce_table_allocated)
> {
> struct page *tce_mem = NULL;
> - __be64 *addr;
> + __be64 *addr, *tmp;
> unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
> unsigned long local_allocated = 1UL << (order + PAGE_SHIFT);
> + unsigned entries = 1UL << (shift - 3);
> + long i;
> +
> + if (limit == *tce_table_allocated)
> + return NULL;

If this is for what I think, it seems a bit unsafe. Shouldn't it be
>=, otherwise it could fail to trip if the limit isn't exactly a
>multiple of the bottom level allocation unit.

> tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
> if (!tce_mem) {
> @@ -662,14 +683,33 @@ static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
> }
> addr = page_address(tce_mem);
> memset(addr, 0, local_allocated);
> - *tce_table_allocated = local_allocated;
> +
> + --levels;
> + if (!levels) {
> + /* Update tce_table_allocated with bottom level table size only */
> + *tce_table_allocated += local_allocated;
> + return addr;
> + }
> +
> + for (i = 0; i < entries; ++i) {
> + tmp = pnv_alloc_tce_table_pages(nid, shift, levels, limit,
> + tce_table_allocated);

Urgh.. it's a limited depth so it *might* be ok, but recursion is
generally avoided in the kernel, becuase of the very limited stack
size.

> + if (!tmp)
> + break;
> +
> + addr[i] = cpu_to_be64(__pa(tmp) |
> + TCE_PCI_READ | TCE_PCI_WRITE);
> + }

It also seems like it would make sense for this function ti set
it_indirect_levels ant it_level_size, rather than leaving it to the
caller.

> return addr;
> }
>
> +static void pnv_free_tce_table_pages(unsigned long addr, unsigned long size,
> + unsigned level);
> +
> long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
> __u64 bus_offset, __u32 page_shift, __u64 window_size,
> - struct iommu_table *tbl)
> + __u32 levels, struct iommu_table *tbl)
> {
> void *addr;
> unsigned long tce_table_allocated = 0;
> @@ -678,16 +718,34 @@ long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
> unsigned table_shift = entries_shift + 3;
> const unsigned long tce_table_size = max(0x1000UL, 1UL << table_shift);
>
> + if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
> + return -EINVAL;
> +
> if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
> return -EINVAL;
>
> + /* Adjust direct table size from window_size and levels */
> + entries_shift = ROUND_UP(entries_shift, levels) / levels;

ROUND_UP() only works if the second parameter is a power of 2. Is
that always true for levels?

For division rounding up, the usual idiom is just ((a + (b - 1)) / b)


> + table_shift = entries_shift + 3;
> + table_shift = max_t(unsigned, table_shift, PAGE_SHIFT);

Does the PAGE_SHIFT rounding make sense any more? I would have
thought you'd round the level size up to page size, rather than the
whole thing.

> /* Allocate TCE table */
> addr = pnv_alloc_tce_table_pages(nid, table_shift,
> - &tce_table_allocated);
> + levels, tce_table_size, &tce_table_allocated);
> + if (!addr)
> + return -ENOMEM;
> +
> + if (tce_table_size != tce_table_allocated) {
> + pnv_free_tce_table_pages((unsigned long) addr,
> + tbl->it_level_size, tbl->it_indirect_levels);
> + return -ENOMEM;
> + }
>
> /* Setup linux iommu table */
> pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset,
> page_shift);
> + tbl->it_level_size = 1ULL << (table_shift - 3);
> + tbl->it_indirect_levels = levels - 1;
>
> pr_info("Created TCE table: window size = %08llx, "
> "tablesize = %lx (%lx), start @%08llx\n",
> @@ -697,12 +755,38 @@ long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
> return 0;
> }
>
> +static void pnv_free_tce_table_pages(unsigned long addr, unsigned long size,
> + unsigned level)
> +{
> + addr &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> + if (level) {
> + long i;
> + u64 *tmp = (u64 *) addr;
> +
> + for (i = 0; i < size; ++i) {
> + unsigned long hpa = be64_to_cpu(tmp[i]);
> +
> + if (!(hpa & (TCE_PCI_READ | TCE_PCI_WRITE)))
> + continue;
> +
> + pnv_free_tce_table_pages((unsigned long) __va(hpa),
> + size, level - 1);
> + }
> + }
> +
> + free_pages(addr, get_order(size << 3));
> +}
> +
> void pnv_pci_free_table(struct iommu_table *tbl)
> {
> + const unsigned long size = tbl->it_indirect_levels ?
> + tbl->it_level_size : tbl->it_size;
> +
> if (!tbl->it_size)
> return;
>
> - free_pages(tbl->it_base, get_order(tbl->it_size << 3));
> + pnv_free_tce_table_pages(tbl->it_base, size, tbl->it_indirect_levels);
> iommu_reset_table(tbl, "pnv");
> }
>
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index e6cbbec..3d1ff584 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -218,9 +218,11 @@ int pnv_pci_cfg_write(struct pci_dn *pdn,
> extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
> void *tce_mem, u64 tce_size,
> u64 dma_offset, unsigned page_shift);
> +#define POWERNV_IOMMU_DEFAULT_LEVELS 1
> +#define POWERNV_IOMMU_MAX_LEVELS 5
> extern long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
> __u64 bus_offset, __u32 page_shift, __u64 window_size,
> - struct iommu_table *tbl);
> + __u32 levels, struct iommu_table *tbl);
> extern void pnv_pci_free_table(struct iommu_table *tbl);
> extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
> extern void pnv_pci_init_ioda_hub(struct device_node *np);

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (10.58 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 05:50:09

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 23/32] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks

On Sat, Apr 25, 2015 at 10:14:47PM +1000, Alexey Kardashevskiy wrote:
> This extends iommu_table_group_ops by a set of callbacks to support
> dynamic DMA windows management.
>
> create_table() creates a TCE table with specific parameters.
> it receives iommu_table_group to know nodeid in order to allocate
> TCE table memory closer to the PHB. The exact format of allocated
> multi-level table might be also specific to the PHB model (not
> the case now though).
> This callback calculated the DMA window offset on a PCI bus from @num
> and stores it in a just created table.
>
> set_window() sets the window at specified TVT index + @num on PHB.
>
> unset_window() unsets the window from specified TVT.
>
> This adds a free() callback to iommu_table_ops to free the memory
> (potentially a tree of tables) allocated for the TCE table.

Doesn't the free callback belong with the previous patch introducing
multi-level tables?

> create_table() and free() are supposed to be called once per
> VFIO container and set_window()/unset_window() are supposed to be
> called for every group in a container.
>
> This adds IOMMU capabilities to iommu_table_group such as default
> 32bit window parameters and others.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> arch/powerpc/include/asm/iommu.h | 19 ++++++++
> arch/powerpc/platforms/powernv/pci-ioda.c | 75 ++++++++++++++++++++++++++---
> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 12 +++--
> 3 files changed, 96 insertions(+), 10 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 0f50ee2..7694546 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -70,6 +70,7 @@ struct iommu_table_ops {
> /* get() returns a physical address */
> unsigned long (*get)(struct iommu_table *tbl, long index);
> void (*flush)(struct iommu_table *tbl);
> + void (*free)(struct iommu_table *tbl);
> };
>
> /* These are used by VIO */
> @@ -148,6 +149,17 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
> struct iommu_table_group;
>
> struct iommu_table_group_ops {
> + long (*create_table)(struct iommu_table_group *table_group,
> + int num,
> + __u32 page_shift,
> + __u64 window_size,
> + __u32 levels,
> + struct iommu_table *tbl);
> + long (*set_window)(struct iommu_table_group *table_group,
> + int num,
> + struct iommu_table *tblnew);
> + long (*unset_window)(struct iommu_table_group *table_group,
> + int num);
> /*
> * Switches ownership from the kernel itself to an external
> * user. While onwership is taken, the kernel cannot use IOMMU itself.
> @@ -160,6 +172,13 @@ struct iommu_table_group {
> #ifdef CONFIG_IOMMU_API
> struct iommu_group *group;
> #endif
> + /* Some key properties of IOMMU */
> + __u32 tce32_start;
> + __u32 tce32_size;
> + __u64 pgsizes; /* Bitmap of supported page sizes */
> + __u32 max_dynamic_windows_supported;
> + __u32 max_levels;

With this information, table_group seems even more like a bad name.
"iommu_state" maybe?

> struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> struct iommu_table_group_ops *ops;
> };
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index cc1d09c..4828837 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -24,6 +24,7 @@
> #include <linux/msi.h>
> #include <linux/memblock.h>
> #include <linux/iommu.h>
> +#include <linux/sizes.h>
>
> #include <asm/sections.h>
> #include <asm/io.h>
> @@ -1846,6 +1847,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> #endif
> .clear = pnv_ioda2_tce_free,
> .get = pnv_tce_get,
> + .free = pnv_pci_free_table,
> };
>
> static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
> @@ -1936,6 +1938,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> TCE_PCI_SWINV_PAIR);
>
> tbl->it_ops = &pnv_ioda1_iommu_ops;
> + pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift;
> + pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
> iommu_init_table(tbl, phb->hose->node);
>
> if (pe->flags & PNV_IODA_PE_DEV) {
> @@ -1961,7 +1965,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> }
>
> static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> - struct iommu_table *tbl)
> + int num, struct iommu_table *tbl)
> {
> struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> table_group);
> @@ -1972,9 +1976,10 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
> const __u64 win_size = tbl->it_size << tbl->it_page_shift;
>
> - pe_info(pe, "Setting up window at %llx..%llx "
> + pe_info(pe, "Setting up window#%d at %llx..%llx "
> "pgsize=0x%x tablesize=0x%lx "
> "levels=%d levelsize=%x\n",
> + num,
> start_addr, start_addr + win_size - 1,
> 1UL << tbl->it_page_shift, tbl->it_size << 3,
> tbl->it_indirect_levels + 1, tbl->it_level_size << 3);
> @@ -1987,7 +1992,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> */
> rc = opal_pci_map_pe_dma_window(phb->opal_id,
> pe->pe_number,
> - pe->pe_number << 1,
> + (pe->pe_number << 1) + num,

Heh, yes, well, that makes it rather clear that only 2 tables are possible.

> tbl->it_indirect_levels + 1,
> __pa(tbl->it_base),
> size << 3,
> @@ -2000,7 +2005,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> pnv_pci_ioda2_tvt_invalidate(pe);
>
> /* Store fully initialized *tbl (may be external) in PE */
> - pe->table_group.tables[0] = *tbl;
> + pe->table_group.tables[num] = *tbl;

I'm a bit confused by this whole set_window thing. Is the idea that
with multiple groups in a container you have multiple table_group s
each with different copies of the iommu_table structures, but pointing
to the same actual TCE entries (it_base)? It seems to me not terribly
obvious when you "create" a table and when you "set" a window.
It's also kind of hard to assess whether the relative lifetimes are
correct of the table_group, struct iommu_table and the actual TCE tables.

Would it make more sense for table_group to become the
non-vfio-specific counterpart to the vfio container?
i.e. representing one set of DMA mappings, which one or more PEs could
be bound to.

> return 0;
> fail:
> @@ -2061,6 +2066,53 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> }
>
> #ifdef CONFIG_IOMMU_API
> +static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
> + int num, __u32 page_shift, __u64 window_size, __u32 levels,
> + struct iommu_table *tbl)
> +{
> + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> + table_group);
> + int nid = pe->phb->hose->node;
> + __u64 bus_offset = num ? pe->tce_bypass_base : 0;
> + long ret;
> +
> + ret = pnv_pci_create_table(table_group, nid, bus_offset, page_shift,
> + window_size, levels, tbl);
> + if (ret)
> + return ret;
> +
> + tbl->it_ops = &pnv_ioda2_iommu_ops;
> + if (pe->tce_inval_reg)
> + tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
> +
> + return 0;
> +}
> +
> +static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
> + int num)
> +{
> + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> + table_group);
> + struct pnv_phb *phb = pe->phb;
> + struct iommu_table *tbl = &pe->table_group.tables[num];
> + long ret;
> +
> + pe_info(pe, "Removing DMA window #%d\n", num);
> +
> + ret = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
> + (pe->pe_number << 1) + num,
> + 0/* levels */, 0/* table address */,
> + 0/* table size */, 0/* page size */);
> + if (ret)
> + pe_warn(pe, "Unmapping failed, ret = %ld\n", ret);
> + else
> + pnv_pci_ioda2_tvt_invalidate(pe);
> +
> + memset(tbl, 0, sizeof(*tbl));
> +
> + return ret;
> +}
> +
> static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
> {
> struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> @@ -2080,6 +2132,9 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> }
>
> static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> + .create_table = pnv_pci_ioda2_create_table,
> + .set_window = pnv_pci_ioda2_set_window,
> + .unset_window = pnv_pci_ioda2_unset_window,
> .take_ownership = pnv_ioda2_take_ownership,
> .release_ownership = pnv_ioda2_release_ownership,
> };
> @@ -2102,8 +2157,16 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
> phb->ioda.m32_pci_base);
>
> + pe->table_group.tce32_start = 0;
> + pe->table_group.tce32_size = phb->ioda.m32_pci_base;
> + pe->table_group.max_dynamic_windows_supported =
> + IOMMU_TABLE_GROUP_MAX_TABLES;
> + pe->table_group.max_levels = POWERNV_IOMMU_MAX_LEVELS;
> + pe->table_group.pgsizes = SZ_4K | SZ_64K | SZ_16M;
> +
> rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node,
> - 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base,
> + pe->table_group.tce32_start, IOMMU_PAGE_SHIFT_4K,
> + pe->table_group.tce32_size,
> POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
> if (rc) {
> pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
> @@ -2119,7 +2182,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> pe->table_group.ops = &pnv_pci_ioda2_ops;
> #endif
>
> - rc = pnv_pci_ioda2_set_window(&pe->table_group, tbl);
> + rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
> if (rc) {
> pe_err(pe, "Failed to configure 32-bit TCE table,"
> " err %ld\n", rc);
> diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> index 7a6fd92..d9de4c7 100644
> --- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> +++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> @@ -116,6 +116,8 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
> u64 phb_id;
> int64_t rc;
> static int primary = 1;
> + struct iommu_table_group *table_group;
> + struct iommu_table *tbl;
>
> pr_info(" Initializing p5ioc2 PHB %s\n", np->full_name);
>
> @@ -181,14 +183,16 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
> pnv_pci_init_p5ioc2_msis(phb);
>
> /* Setup iommu */
> - phb->p5ioc2.table_group.tables[0].it_table_group =
> - &phb->p5ioc2.table_group;
> + table_group = &phb->p5ioc2.table_group;
> + tbl = &phb->p5ioc2.table_group.tables[0];
> + tbl->it_table_group = table_group;
>
> /* Setup TCEs */
> phb->dma_dev_setup = pnv_pci_p5ioc2_dma_dev_setup;
> - pnv_pci_setup_iommu_table(&phb->p5ioc2.table_group.tables[0],
> - tce_mem, tce_size, 0,
> + pnv_pci_setup_iommu_table(tbl, tce_mem, tce_size, 0,
> IOMMU_PAGE_SHIFT_4K);
> + table_group->tce32_start = tbl->it_offset << tbl->it_page_shift;
> + table_group->tce32_size = tbl->it_size << tbl->it_page_shift;

Doesn't pgsizes need to be set here (although it will only include 4K,
I'm assuming).

> }
>
> void __init pnv_pci_init_p5ioc2_hub(struct device_node *np)

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (11.27 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 05:50:57

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 25/32] vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework ownership

On Sat, Apr 25, 2015 at 10:14:49PM +1000, Alexey Kardashevskiy wrote:
> Before the IOMMU user (VFIO) would take control over the IOMMU table
> belonging to a specific IOMMU group. This approach did not allow sharing
> tables between IOMMU groups attached to the same container.
>
> This introduces a new IOMMU ownership flavour when the user can not
> just control the existing IOMMU table but remove/create tables on demand.
> If an IOMMU implements take/release_ownership() callbacks, this lets
> the user have full control over the IOMMU group. When the ownership is taken,
> the platform code removes all the windows so the caller must create them.
> Before returning the ownership back to the platform code, VFIO
> unprograms and removes all the tables it created.
>
> This changes IODA2's onwership handler to remove the existing table

"onwership"

> rather than manipulating with the existing one. From now on,
> iommu_take_ownership() and iommu_release_ownership() are only called
> from the vfio_iommu_spapr_tce driver.
>
> In tce_iommu_detach_group(), this copies a iommu_table descriptor on stack
> as IODA2's unset_window() will clear the descriptor embedded into PE
> and we will not be able to free the table afterwards.
> This is a transitional hack and following patches will replace this code
> anyway.
>
> Old-style ownership is still supported allowing VFIO to run on older
> P5IOC2 and IODA IO controllers.
>
> No change in userspace-visible behaviour is expected. Since it recreates
> TCE tables on each ownership change, related kernel traces will appear
> more often.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> [aw: for the vfio related changes]
> Acked-by: Alex Williamson <[email protected]>
> ---
> Changes:
> v9:
> * fixed crash in tce_iommu_detach_group() on tbl->it_ops->free as
> tce_iommu_attach_group() used to initialize the table from a descriptor
> on stack (it does not matter for the series as this bit is changed later anyway
> but it ruing bisectability)
>
> v6:
> * fixed commit log that VFIO removes tables before passing ownership
> back to the platform code, not userspace
>
> 1
> ---
> arch/powerpc/platforms/powernv/pci-ioda.c | 27 +++++++++++++++++++++++--
> drivers/vfio/vfio_iommu_spapr_tce.c | 33 +++++++++++++++++++++++++++++--
> 2 files changed, 56 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 2a4b2b2..45bc131 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2105,16 +2105,39 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
> struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> table_group);
>
> - iommu_take_ownership(&table_group->tables[0]);
> pnv_pci_ioda2_set_bypass(pe, false);
> + pnv_pci_ioda2_unset_window(&pe->table_group, 0);
> + pnv_pci_free_table(&pe->table_group.tables[0]);
> }
>
> static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> {
> struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> table_group);
> + struct iommu_table *tbl = &pe->table_group.tables[0];
> + int64_t rc;
> +
> + rc = pnv_pci_ioda2_create_table(&pe->table_group, 0,
> + IOMMU_PAGE_SHIFT_4K,
> + pe->phb->ioda.m32_pci_base,
> + POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
> + if (rc) {
> + pe_err(pe, "Failed to create 32-bit TCE table, err %ld",
> + rc);
> + return;
> + }
> +
> + tbl->it_table_group = &pe->table_group;
> + iommu_init_table(tbl, pe->phb->hose->node);
> +
> + rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
> + if (rc) {
> + pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
> + rc);
> + pnv_pci_free_table(tbl);
> + return;
> + }

It seems like you want a helper function called both here and in the
initial PE setup. Otherwise you encourage future bugs where the
initial PE setup changes, but taking and releasing IOMMU ownership
from VFIO no longer sets up exactly the same thing again.

> - iommu_release_ownership(&table_group->tables[0]);
> pnv_pci_ioda2_set_bypass(pe, true);
> }
>
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 2d51bbf..892a584 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -569,6 +569,10 @@ static int tce_iommu_attach_group(void *iommu_data,
> if (!table_group->ops || !table_group->ops->take_ownership ||
> !table_group->ops->release_ownership) {
> ret = tce_iommu_take_ownership(table_group);
> + } else if (!table_group->ops->create_table ||
> + !table_group->ops->set_window) {
> + WARN_ON_ONCE(1);
> + ret = -EFAULT;
> } else {
> /*
> * Disable iommu bypass, otherwise the user can DMA to all of
> @@ -576,7 +580,15 @@ static int tce_iommu_attach_group(void *iommu_data,
> * the pages that has been explicitly mapped into the iommu
> */
> table_group->ops->take_ownership(table_group);
> - ret = 0;
> + ret = table_group->ops->create_table(table_group,
> + 0, /* window number */
> + IOMMU_PAGE_SHIFT_4K,
> + table_group->tce32_size,
> + 1, /* default levels */
> + &table_group->tables[0]);
> + if (!ret)
> + ret = table_group->ops->set_window(table_group, 0,
> + &table_group->tables[0]);
> }
>
> if (ret)
> @@ -595,6 +607,7 @@ static void tce_iommu_detach_group(void *iommu_data,
> {
> struct tce_container *container = iommu_data;
> struct iommu_table_group *table_group;
> + long i;
>
> mutex_lock(&container->lock);
> if (iommu_group != container->grp) {
> @@ -620,8 +633,24 @@ static void tce_iommu_detach_group(void *iommu_data,
> /* Kernel owns the device now, we can restore bypass */
> if (!table_group->ops || !table_group->ops->release_ownership)
> tce_iommu_release_ownership(container, table_group);
> - else
> + else if (!table_group->ops->unset_window)
> + WARN_ON_ONCE(1);
> + else {
> + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> + struct iommu_table tbl = table_group->tables[i];
> +
> + if (!tbl.it_size)
> + continue;
> +
> + table_group->ops->unset_window(table_group, i);
> + tce_iommu_clear(container, &tbl,
> + tbl.it_offset, tbl.it_size);
> + if (tbl.it_ops->free)
> + tbl.it_ops->free(&tbl);
> + }
> +
> table_group->ops->release_ownership(table_group);
> + }
>
> unlock_exit:
> mutex_unlock(&container->lock);

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (6.56 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 22:33:56

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:
> In order to support memory pre-registration, we need a way to track
> the use of every registered memory region and only allow unregistration
> if a region is not in use anymore. So we need a way to tell from what
> region the just cleared TCE was from.
>
> This adds a userspace view of the TCE table into iommu_table struct.
> It contains userspace address, one per TCE entry. The table is only
> allocated when the ownership over an IOMMU group is taken which means
> it is only used from outside of the powernv code (such as VFIO).
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> Changes:
> v9:
> * fixed code flow in error cases added in v8
>
> v8:
> * added ENOMEM on failed vzalloc()
> ---
> arch/powerpc/include/asm/iommu.h | 6 ++++++
> arch/powerpc/kernel/iommu.c | 18 ++++++++++++++++++
> arch/powerpc/platforms/powernv/pci-ioda.c | 22 ++++++++++++++++++++--
> 3 files changed, 44 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 7694546..1472de3 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -111,9 +111,15 @@ struct iommu_table {
> unsigned long *it_map; /* A simple allocation bitmap for now */
> unsigned long it_page_shift;/* table iommu page size */
> struct iommu_table_group *it_table_group;
> + unsigned long *it_userspace; /* userspace view of the table */

A single unsigned long doesn't seem like enough. How do you know
which process's address space this address refers to?

> struct iommu_table_ops *it_ops;
> };
>
> +#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
> + ((tbl)->it_userspace ? \
> + &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \
> + NULL)
> +
> /* Pure 2^n version of get_order */
> static inline __attribute_const__
> int get_iommu_order(unsigned long size, struct iommu_table *tbl)
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 2eaba0c..74a3f52 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -38,6 +38,7 @@
> #include <linux/pci.h>
> #include <linux/iommu.h>
> #include <linux/sched.h>
> +#include <linux/vmalloc.h>
> #include <asm/io.h>
> #include <asm/prom.h>
> #include <asm/iommu.h>
> @@ -739,6 +740,8 @@ void iommu_reset_table(struct iommu_table *tbl, const char *node_name)
> free_pages((unsigned long) tbl->it_map, order);
> }
>
> + WARN_ON(tbl->it_userspace);
> +
> memset(tbl, 0, sizeof(*tbl));
> }
>
> @@ -1016,6 +1019,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
> {
> unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
> int ret = 0;
> + unsigned long *uas;
>
> /*
> * VFIO does not control TCE entries allocation and the guest
> @@ -1027,6 +1031,10 @@ int iommu_take_ownership(struct iommu_table *tbl)
> if (!tbl->it_ops->exchange)
> return -EINVAL;
>
> + uas = vzalloc(sizeof(*uas) * tbl->it_size);
> + if (!uas)
> + return -ENOMEM;
> +
> spin_lock_irqsave(&tbl->large_pool.lock, flags);
> for (i = 0; i < tbl->nr_pools; i++)
> spin_lock(&tbl->pools[i].lock);
> @@ -1044,6 +1052,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
> memset(tbl->it_map, 0xff, sz);
> }
>
> + if (ret) {
> + vfree(uas);
> + } else {
> + BUG_ON(tbl->it_userspace);
> + tbl->it_userspace = uas;
> + }
> +
> for (i = 0; i < tbl->nr_pools; i++)
> spin_unlock(&tbl->pools[i].lock);
> spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
> @@ -1056,6 +1071,9 @@ void iommu_release_ownership(struct iommu_table *tbl)
> {
> unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
>
> + vfree(tbl->it_userspace);
> + tbl->it_userspace = NULL;
> +
> spin_lock_irqsave(&tbl->large_pool.lock, flags);
> for (i = 0; i < tbl->nr_pools; i++)
> spin_lock(&tbl->pools[i].lock);
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 45bc131..e0be556 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -25,6 +25,7 @@
> #include <linux/memblock.h>
> #include <linux/iommu.h>
> #include <linux/sizes.h>
> +#include <linux/vmalloc.h>
>
> #include <asm/sections.h>
> #include <asm/io.h>
> @@ -1827,6 +1828,14 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
> pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
> }
>
> +void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
> +{
> + vfree(tbl->it_userspace);
> + tbl->it_userspace = NULL;
> +
> + pnv_pci_free_table(tbl);
> +}
> +
> static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> .set = pnv_ioda2_tce_build,
> #ifdef CONFIG_IOMMU_API
> @@ -1834,7 +1843,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> #endif
> .clear = pnv_ioda2_tce_free,
> .get = pnv_tce_get,
> - .free = pnv_pci_free_table,
> + .free = pnv_pci_ioda2_free_table,
> };
>
> static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
> @@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
> int nid = pe->phb->hose->node;
> __u64 bus_offset = num ? pe->tce_bypass_base : 0;
> long ret;
> + unsigned long *uas, uas_cb = sizeof(*uas) * (window_size >> page_shift);
> +
> + uas = vzalloc(uas_cb);
> + if (!uas)
> + return -ENOMEM;

I don't see why this is allocated both here as well as in
take_ownership. Isn't this function used for core-kernel users of the
iommu as well, in which case it shouldn't need the it_userspace.


> ret = pnv_pci_create_table(table_group, nid, bus_offset, page_shift,
> window_size, levels, tbl);
> - if (ret)
> + if (ret) {
> + vfree(uas);
> return ret;
> + }
>
> + BUG_ON(tbl->it_userspace);
> + tbl->it_userspace = uas;
> tbl->it_ops = &pnv_ioda2_iommu_ops;
> if (pe->tce_inval_reg)
> tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (6.07 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 22:33:24

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:
> This adds a way for the IOMMU user to know how much a new table will
> use so it can be accounted in the locked_vm limit before allocation
> happens.
>
> This stores the allocated table size in pnv_pci_create_table()
> so the locked_vm counter can be updated correctly when a table is
> being disposed.
>
> This defines an iommu_table_group_ops callback to let VFIO know
> how much memory will be locked if a table is created.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> Changes:
> v9:
> * reimplemented the whole patch
> ---
> arch/powerpc/include/asm/iommu.h | 5 +++++
> arch/powerpc/platforms/powernv/pci-ioda.c | 14 ++++++++++++
> arch/powerpc/platforms/powernv/pci.c | 36 +++++++++++++++++++++++++++++++
> arch/powerpc/platforms/powernv/pci.h | 2 ++
> 4 files changed, 57 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 1472de3..9844c106 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -99,6 +99,7 @@ struct iommu_table {
> unsigned long it_size; /* Size of iommu table in entries */
> unsigned long it_indirect_levels;
> unsigned long it_level_size;
> + unsigned long it_allocated_size;
> unsigned long it_offset; /* Offset into global table */
> unsigned long it_base; /* mapped address of tce table */
> unsigned long it_index; /* which iommu table this is */
> @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
> struct iommu_table_group;
>
> struct iommu_table_group_ops {
> + unsigned long (*get_table_size)(
> + __u32 page_shift,
> + __u64 window_size,
> + __u32 levels);
> long (*create_table)(struct iommu_table_group *table_group,
> int num,
> __u32 page_shift,
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index e0be556..7f548b4 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> }
>
> #ifdef CONFIG_IOMMU_API
> +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
> + __u64 window_size, __u32 levels)
> +{
> + unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
> +
> + if (!ret)
> + return ret;
> +
> + /* Add size of it_userspace */
> + return ret + (window_size >> page_shift) * sizeof(unsigned long);

This doesn't make much sense. The userspace view can't possibly be a
property of the specific low-level IOMMU model.

> +}
> +
> static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
> int num, __u32 page_shift, __u64 window_size, __u32 levels,
> struct iommu_table *tbl)
> @@ -2086,6 +2098,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>
> BUG_ON(tbl->it_userspace);
> tbl->it_userspace = uas;
> + tbl->it_allocated_size += uas_cb;
> tbl->it_ops = &pnv_ioda2_iommu_ops;
> if (pe->tce_inval_reg)
> tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
> @@ -2160,6 +2173,7 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> }
>
> static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> + .get_table_size = pnv_pci_ioda2_get_table_size,
> .create_table = pnv_pci_ioda2_create_table,
> .set_window = pnv_pci_ioda2_set_window,
> .unset_window = pnv_pci_ioda2_unset_window,
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index fc129c4..1b5b48a 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -662,6 +662,38 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
> tbl->it_type = TCE_PCI;
> }
>
> +unsigned long pnv_get_table_size(__u32 page_shift,
> + __u64 window_size, __u32 levels)
> +{
> + unsigned long bytes = 0;
> + const unsigned window_shift = ilog2(window_size);
> + unsigned entries_shift = window_shift - page_shift;
> + unsigned table_shift = entries_shift + 3;
> + unsigned long tce_table_size = max(0x1000UL, 1UL << table_shift);
> + unsigned long direct_table_size;
> +
> + if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS) ||
> + (window_size > memory_hotplug_max()) ||
> + !is_power_of_2(window_size))
> + return 0;
> +
> + /* Calculate a direct table size from window_size and levels */
> + entries_shift = ROUND_UP(entries_shift, levels) / levels;
> + table_shift = entries_shift + 3;
> + table_shift = max_t(unsigned, table_shift, PAGE_SHIFT);
> + direct_table_size = 1UL << table_shift;
> +
> + for ( ; levels; --levels) {
> + bytes += ROUND_UP(tce_table_size, direct_table_size);
> +
> + tce_table_size /= direct_table_size;
> + tce_table_size <<= 3;
> + tce_table_size = ROUND_UP(tce_table_size, direct_table_size);
> + }
> +
> + return bytes;
> +}
> +
> static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
> unsigned levels, unsigned long limit,
> unsigned long *tce_table_allocated)
> @@ -741,6 +773,10 @@ long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
> return -ENOMEM;
> }
>
> + tbl->it_allocated_size = pnv_get_table_size(page_shift, window_size,
> + levels);
> + WARN_ON(!tbl->it_allocated_size);
> +
> /* Setup linux iommu table */
> pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset,
> page_shift);
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 3d1ff584..ce4bc3c 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -224,6 +224,8 @@ extern long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
> __u64 bus_offset, __u32 page_shift, __u64 window_size,
> __u32 levels, struct iommu_table *tbl);
> extern void pnv_pci_free_table(struct iommu_table *tbl);
> +extern unsigned long pnv_get_table_size(__u32 page_shift,
> + __u64 window_size, __u32 levels);
> extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
> extern void pnv_pci_init_ioda_hub(struct device_node *np);
> extern void pnv_pci_init_ioda2_phb(struct device_node *np);

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (6.37 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 22:33:26

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache

On Sat, Apr 25, 2015 at 10:14:52PM +1000, Alexey Kardashevskiy wrote:
> We are adding support for DMA memory pre-registration to be used in
> conjunction with VFIO. The idea is that the userspace which is going to
> run a guest may want to pre-register a user space memory region so
> it all gets pinned once and never goes away. Having this done,
> a hypervisor will not have to pin/unpin pages on every DMA map/unmap
> request. This is going to help with multiple pinning of the same memory
> and in-kernel acceleration of DMA requests.
>
> This adds a list of memory regions to mm_context_t. Each region consists
> of a header and a list of physical addresses. This adds API to:
> 1. register/unregister memory regions;
> 2. do final cleanup (which puts all pre-registered pages);
> 3. do userspace to physical address translation;
> 4. manage a mapped pages counter; when it is zero, it is safe to
> unregister the region.
>
> Multiple registration of the same region is allowed, kref is used to
> track the number of registrations.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> Changes:
> v8:
> * s/mm_iommu_table_group_mem_t/struct mm_iommu_table_group_mem_t/
> * fixed error fallback look (s/[i]/[j]/)
> ---
> arch/powerpc/include/asm/mmu-hash64.h | 3 +
> arch/powerpc/include/asm/mmu_context.h | 17 +++
> arch/powerpc/mm/Makefile | 1 +
> arch/powerpc/mm/mmu_context_hash64.c | 6 +
> arch/powerpc/mm/mmu_context_hash64_iommu.c | 215 +++++++++++++++++++++++++++++
> 5 files changed, 242 insertions(+)
> create mode 100644 arch/powerpc/mm/mmu_context_hash64_iommu.c
>
> diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
> index 1da6a81..a82f534 100644
> --- a/arch/powerpc/include/asm/mmu-hash64.h
> +++ b/arch/powerpc/include/asm/mmu-hash64.h
> @@ -536,6 +536,9 @@ typedef struct {
> /* for 4K PTE fragment support */
> void *pte_frag;
> #endif
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> + struct list_head iommu_group_mem_list;
> +#endif

Urgh. I know I'm not one to talk, having done the hugepage crap in
there, but man mm_context_t has grown to a bloated mess from orginally
being just intended as a context ID integer :/.

> } mm_context_t;
>
>
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index 73382eb..d6116ca 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -16,6 +16,23 @@
> */
> extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
> extern void destroy_context(struct mm_struct *mm);
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +struct mm_iommu_table_group_mem_t;
> +
> +extern bool mm_iommu_preregistered(void);
> +extern long mm_iommu_alloc(unsigned long ua, unsigned long entries,
> + struct mm_iommu_table_group_mem_t **pmem);
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_get(unsigned long ua,
> + unsigned long entries);
> +extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
> +extern void mm_iommu_cleanup(mm_context_t *ctx);
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
> + unsigned long size);
> +extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> + unsigned long ua, unsigned long *hpa);
> +extern long mm_iommu_mapped_update(struct mm_iommu_table_group_mem_t *mem,
> + bool inc);
> +#endif
>
> extern void switch_mmu_context(struct mm_struct *prev, struct mm_struct *next);
> extern void switch_slb(struct task_struct *tsk, struct mm_struct *mm);
> diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
> index 9c8770b..e216704 100644
> --- a/arch/powerpc/mm/Makefile
> +++ b/arch/powerpc/mm/Makefile
> @@ -36,3 +36,4 @@ obj-$(CONFIG_PPC_SUBPAGE_PROT) += subpage-prot.o
> obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
> obj-$(CONFIG_HIGHMEM) += highmem.o
> obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o
> +obj-$(CONFIG_SPAPR_TCE_IOMMU) += mmu_context_hash64_iommu.o
> diff --git a/arch/powerpc/mm/mmu_context_hash64.c b/arch/powerpc/mm/mmu_context_hash64.c
> index 178876ae..eb3080c 100644
> --- a/arch/powerpc/mm/mmu_context_hash64.c
> +++ b/arch/powerpc/mm/mmu_context_hash64.c
> @@ -89,6 +89,9 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> #ifdef CONFIG_PPC_64K_PAGES
> mm->context.pte_frag = NULL;
> #endif
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> + INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
> +#endif
> return 0;
> }
>
> @@ -132,6 +135,9 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
>
> void destroy_context(struct mm_struct *mm)
> {
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> + mm_iommu_cleanup(&mm->context);
> +#endif
>
> #ifdef CONFIG_PPC_ICSWX
> drop_cop(mm->context.acop, mm);
> diff --git a/arch/powerpc/mm/mmu_context_hash64_iommu.c b/arch/powerpc/mm/mmu_context_hash64_iommu.c
> new file mode 100644
> index 0000000..af7668c
> --- /dev/null
> +++ b/arch/powerpc/mm/mmu_context_hash64_iommu.c
> @@ -0,0 +1,215 @@
> +/*
> + * IOMMU helpers in MMU context.
> + *
> + * Copyright (C) 2015 IBM Corp. <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + *
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/rculist.h>
> +#include <linux/vmalloc.h>
> +#include <linux/kref.h>
> +#include <asm/mmu_context.h>
> +
> +struct mm_iommu_table_group_mem_t {
> + struct list_head next;
> + struct rcu_head rcu;
> + struct kref kref; /* one reference per VFIO container */
> + atomic_t mapped; /* number of currently mapped pages */
> + u64 ua; /* userspace address */
> + u64 entries; /* number of entries in hpas[] */

Maybe 'npages', since this is used to determine the range of user
addresses covered, not just the number of entries in hpas.

> + u64 *hpas; /* vmalloc'ed */
> +};
> +
> +bool mm_iommu_preregistered(void)
> +{
> + if (!current || !current->mm)
> + return false;
> +
> + return !list_empty(&current->mm->context.iommu_group_mem_list);
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
> +
> +long mm_iommu_alloc(unsigned long ua, unsigned long entries,
> + struct mm_iommu_table_group_mem_t **pmem)
> +{
> + struct mm_iommu_table_group_mem_t *mem;
> + long i, j;
> + struct page *page = NULL;
> +
> + list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
> + next) {
> + if ((mem->ua == ua) && (mem->entries == entries))
> + return -EBUSY;
> +
> + /* Overlap? */
> + if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
> + (ua < (mem->ua + (mem->entries << PAGE_SHIFT))))
> + return -EINVAL;
> + }
> +
> + mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> + if (!mem)
> + return -ENOMEM;
> +
> + mem->hpas = vzalloc(entries * sizeof(mem->hpas[0]));
> + if (!mem->hpas) {
> + kfree(mem);
> + return -ENOMEM;
> + }
> +
> + for (i = 0; i < entries; ++i) {
> + if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> + 1/* pages */, 1/* iswrite */, &page)) {

Do you really need to call gup() in a loop? It can do more than one
page at a time..

That might work better if you kept a list of struct page *s instead of
hpas.

> + for (j = 0; j < i; ++j)
> + put_page(pfn_to_page(
> + mem->hpas[j] >> PAGE_SHIFT));
> + vfree(mem->hpas);
> + kfree(mem);
> + return -EFAULT;
> + }
> +
> + mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
> + }
> +
> + kref_init(&mem->kref);
> + atomic_set(&mem->mapped, 0);
> + mem->ua = ua;
> + mem->entries = entries;
> + *pmem = mem;
> +
> + list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_alloc);
> +
> +static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
> +{
> + long i;
> + struct page *page = NULL;
> +
> + for (i = 0; i < mem->entries; ++i) {
> + if (!mem->hpas[i])
> + continue;
> +
> + page = pfn_to_page(mem->hpas[i] >> PAGE_SHIFT);
> + if (!page)
> + continue;
> +
> + put_page(page);
> + mem->hpas[i] = 0;
> + }
> +}
> +
> +static void mm_iommu_free(struct rcu_head *head)
> +{
> + struct mm_iommu_table_group_mem_t *mem = container_of(head,
> + struct mm_iommu_table_group_mem_t, rcu);
> +
> + mm_iommu_unpin(mem);
> + vfree(mem->hpas);
> + kfree(mem);
> +}
> +
> +static void mm_iommu_release(struct kref *kref)
> +{
> + struct mm_iommu_table_group_mem_t *mem = container_of(kref,
> + struct mm_iommu_table_group_mem_t, kref);
> +
> + list_del_rcu(&mem->next);
> + call_rcu(&mem->rcu, mm_iommu_free);
> +}
> +
> +struct mm_iommu_table_group_mem_t *mm_iommu_get(unsigned long ua,
> + unsigned long entries)
> +{
> + struct mm_iommu_table_group_mem_t *mem;
> +
> + list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
> + next) {
> + if ((mem->ua == ua) && (mem->entries == entries)) {
> + kref_get(&mem->kref);
> + return mem;
> + }
> + }
> +
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_get);
> +
> +long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
> +{
> + if (atomic_read(&mem->mapped))
> + return -EBUSY;

What prevents a race between the atomic_read() above and the release below?

> + kref_put(&mem->kref, mm_iommu_release);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_put);
> +
> +struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
> + unsigned long size)
> +{
> + struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
> +
> + list_for_each_entry_rcu(mem,
> + &current->mm->context.iommu_group_mem_list,
> + next) {
> + if ((mem->ua <= ua) &&
> + (ua + size <= mem->ua +
> + (mem->entries << PAGE_SHIFT))) {
> + ret = mem;
> + break;
> + }
> + }
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_lookup);
> +
> +long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> + unsigned long ua, unsigned long *hpa)

Return type should be int, it's just an error code.

> +{
> + const long entry = (ua - mem->ua) >> PAGE_SHIFT;
> + u64 *va = &mem->hpas[entry];
> +
> + if (entry >= mem->entries)
> + return -EFAULT;
> +
> + *hpa = *va | (ua & ~PAGE_MASK);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
> +
> +long mm_iommu_mapped_update(struct mm_iommu_table_group_mem_t *mem, bool inc)
> +{
> + long ret = 0;
> +
> + if (inc)
> + atomic_inc(&mem->mapped);
> + else
> + ret = atomic_dec_if_positive(&mem->mapped);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(mm_iommu_mapped_update);

I think this would be clearer as separate inc and dec functions.

> +
> +void mm_iommu_cleanup(mm_context_t *ctx)
> +{
> + while (!list_empty(&ctx->iommu_group_mem_list)) {
> + struct mm_iommu_table_group_mem_t *mem;
> +
> + mem = list_first_entry(&ctx->iommu_group_mem_list,
> + struct mm_iommu_table_group_mem_t, next);
> + mm_iommu_release(&mem->kref);
> + }
> +}

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (10.95 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-29 09:00:43

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 16/32] powerpc/powernv/ioda: Move TCE kill register address to PE

On 04/29/2015 01:25 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:40PM +1000, Alexey Kardashevskiy wrote:
>> At the moment the DMA setup code looks for the "ibm,opal-tce-kill" property
>> which contains the TCE kill register address. Writes to this register
>> invalidates TCE cache on IODA/IODA2 hub.
>>
>> This moves the register address from iommu_table to pnv_ioda_pe as
>> later there will be 2 tables per PE and it will be used for both tables.
>>
>> This moves the property reading/remapping code to a helper to reduce
>> code duplication.
>>
>> This adds a new pnv_pci_ioda2_tvt_invalidate() helper which invalidates
>> the entire table. It should be called after every call to
>> opal_pci_map_pe_dma_window(). It was not required before because
>> there is just a single TCE table and 64bit DMA is handled via bypass
>> window (which has no table so no chache is used) but this is going
>> to change with Dynamic DMA windows (DDW).
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> Changes:
>> v9:
>> * new in the series
>> ---
>> arch/powerpc/platforms/powernv/pci-ioda.c | 69 +++++++++++++++++++------------
>> arch/powerpc/platforms/powernv/pci.h | 1 +
>> 2 files changed, 44 insertions(+), 26 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index f070c44..b22b3ca 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1672,7 +1672,7 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
>> struct pnv_ioda_pe, table_group);
>> __be64 __iomem *invalidate = rm ?
>> (__be64 __iomem *)pe->tce_inval_reg_phys :
>> - (__be64 __iomem *)tbl->it_index;
>> + pe->tce_inval_reg;
>> unsigned long start, end, inc;
>> const unsigned shift = tbl->it_page_shift;
>>
>> @@ -1743,6 +1743,18 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
>> .get = pnv_tce_get,
>> };
>>
>> +static inline void pnv_pci_ioda2_tvt_invalidate(struct pnv_ioda_pe *pe)
>> +{
>> + /* 01xb - invalidate TCEs that match the specified PE# */
>> + unsigned long addr = (0x4ull << 60) | (pe->pe_number & 0xFF);
>
> This doesn't really look like an address, but rather the data you're
> writing to the register.


This thing is made of "invalidate operation" (0x4 here), "invalidate
address" (pci address but it is zero here as we reset everything, most bits
are here) and "invalidate PE number". So what should I call it? :)



>> + if (!pe->tce_inval_reg)
>> + return;
>> +
>> + mb(); /* Ensure above stores are visible */
>> + __raw_writeq(cpu_to_be64(addr), pe->tce_inval_reg);
>> +}
>> +
>> static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
>> unsigned long index, unsigned long npages, bool rm)
>> {
>> @@ -1751,7 +1763,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
>> unsigned long start, end, inc;
>> __be64 __iomem *invalidate = rm ?
>> (__be64 __iomem *)pe->tce_inval_reg_phys :
>> - (__be64 __iomem *)tbl->it_index;
>> + pe->tce_inval_reg;
>> const unsigned shift = tbl->it_page_shift;
>>
>> /* We'll invalidate DMA address in PE scope */
>> @@ -1803,13 +1815,31 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>> .get = pnv_tce_get,
>> };
>>
>> +static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
>> + struct pnv_ioda_pe *pe)
>> +{
>> + const __be64 *swinvp;
>> +
>> + /* OPAL variant of PHB3 invalidated TCEs */
>> + swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
>> + if (!swinvp)
>> + return;
>> +
>> + /* We need a couple more fields -- an address and a data
>> + * to or. Since the bus is only printed out on table free
>> + * errors, and on the first pass the data will be a relative
>> + * bus number, print that out instead.
>> + */
>
> The comment above appears to have nothing to do with the surrounding code.

I'll just remove it.


>
>> + pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
>> + pe->tce_inval_reg = ioremap(pe->tce_inval_reg_phys, 8);
>> +}
>> +
>> static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>> struct pnv_ioda_pe *pe, unsigned int base,
>> unsigned int segs)
>> {
>>
>> struct page *tce_mem = NULL;
>> - const __be64 *swinvp;
>> struct iommu_table *tbl;
>> unsigned int i;
>> int64_t rc;
>> @@ -1823,6 +1853,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>> if (WARN_ON(pe->tce32_seg >= 0))
>> return;
>>
>> + pnv_pci_ioda_setup_opal_tce_kill(phb, pe);
>> +
>> /* Grab a 32-bit TCE table */
>> pe->tce32_seg = base;
>> pe_info(pe, " Setting up 32-bit TCE table at %08x..%08x\n",
>> @@ -1865,20 +1897,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>> base << 28, IOMMU_PAGE_SHIFT_4K);
>>
>> /* OPAL variant of P7IOC SW invalidated TCEs */
>> - swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
>> - if (swinvp) {
>> - /* We need a couple more fields -- an address and a data
>> - * to or. Since the bus is only printed out on table free
>> - * errors, and on the first pass the data will be a relative
>> - * bus number, print that out instead.
>> - */
>
> .. although I guess it didn't make any more sense in its original context.
>
>> - pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
>> - tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
>> - 8);
>> + if (pe->tce_inval_reg)
>> tbl->it_type |= (TCE_PCI_SWINV_CREATE |
>> TCE_PCI_SWINV_FREE |
>> TCE_PCI_SWINV_PAIR);
>> - }
>> +
>> tbl->it_ops = &pnv_ioda1_iommu_ops;
>> iommu_init_table(tbl, phb->hose->node);
>>
>> @@ -1984,7 +2007,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> {
>> struct page *tce_mem = NULL;
>> void *addr;
>> - const __be64 *swinvp;
>> struct iommu_table *tbl;
>> unsigned int tce_table_size, end;
>> int64_t rc;
>> @@ -1993,6 +2015,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> if (WARN_ON(pe->tce32_seg >= 0))
>> return;
>>
>> + pnv_pci_ioda_setup_opal_tce_kill(phb, pe);
>> +
>> /* The PE will reserve all possible 32-bits space */
>> pe->tce32_seg = 0;
>> end = (1 << ilog2(phb->ioda.m32_pci_base));
>> @@ -2023,6 +2047,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> goto fail;
>> }
>>
>> + pnv_pci_ioda2_tvt_invalidate(pe);
>> +
>
> This looks to be a change in behavbiour - if it's replacing a previous
> invalidation, I'm not seeing where.


It is a new thing and the patch adds it. And it does not say anywhere that
this patch does not change behavior.


>
>> /* Setup iommu */
>> pe->table_group.tables[0].it_table_group = &pe->table_group;
>>
>> @@ -2032,18 +2058,9 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> IOMMU_PAGE_SHIFT_4K);
>>
>> /* OPAL variant of PHB3 invalidated TCEs */
>> - swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
>> - if (swinvp) {
>> - /* We need a couple more fields -- an address and a data
>> - * to or. Since the bus is only printed out on table free
>> - * errors, and on the first pass the data will be a relative
>> - * bus number, print that out instead.
>> - */
>> - pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
>> - tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
>> - 8);
>> + if (pe->tce_inval_reg)
>> tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
>> - }
>> +
>> tbl->it_ops = &pnv_ioda2_iommu_ops;
>> iommu_init_table(tbl, phb->hose->node);
>> #ifdef CONFIG_IOMMU_API
>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>> index 368d4ed..bd83d85 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -59,6 +59,7 @@ struct pnv_ioda_pe {
>> int tce32_segcount;
>> struct iommu_table_group table_group;
>> phys_addr_t tce_inval_reg_phys;
>> + __be64 __iomem *tce_inval_reg;
>>
>> /* 64-bit TCE bypass region */
>> bool tce_bypass_enabled;
>


--
Alexey

2015-04-29 09:02:28

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 17/32] powerpc/powernv: Implement accessor to TCE entry

On 04/29/2015 02:04 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:41PM +1000, Alexey Kardashevskiy wrote:
>> This replaces direct accesses to TCE table with a helper which
>> returns an TCE entry address. This does not make difference now but will
>> when multi-level TCE tables get introduces.
>>
>> No change in behavior is expected.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>
> Reviewed-by: David Gibson <[email protected]>
>
>
>> ---
>> Changes:
>> v9:
>> * new patch in the series to separate this mechanical change from
>> functional changes; this is not right before
>> "powerpc/powernv: Implement multilevel TCE tables" but here in order
>> to let the next patch - "powerpc/iommu/powernv: Release replaced TCE" -
>> use pnv_tce() and avoid changing the same code twice
>> ---
>> arch/powerpc/platforms/powernv/pci.c | 34 +++++++++++++++++++++-------------
>> 1 file changed, 21 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>> index 84b4ea4..ba75aa5 100644
>> --- a/arch/powerpc/platforms/powernv/pci.c
>> +++ b/arch/powerpc/platforms/powernv/pci.c
>> @@ -572,38 +572,46 @@ struct pci_ops pnv_pci_ops = {
>> .write = pnv_pci_write_config,
>> };
>>
>> +static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
>> +{
>> + __be64 *tmp = ((__be64 *)tbl->it_base);
>> +
>> + return tmp + idx;
>> +}
>> +
>> int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
>> unsigned long uaddr, enum dma_data_direction direction,
>> struct dma_attrs *attrs)
>> {
>> u64 proto_tce = iommu_direction_to_tce_perm(direction);
>> - __be64 *tcep;
>> - u64 rpn;
>> + u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
>
> I guess this was a problem in the existing code, not this patch. But
> "uaddr" is a really bad name (and unsigned long is a bad type) for
> what must actually be a kernel linear mapping address.


Yes and may be one day I'll clean this up. s/uaddr/linear/ and
s/hwaddr/hpa/ are the first things to do globally but not in this patchset.


>
>> + long i;
>>
>> - tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
>> - rpn = __pa(uaddr) >> tbl->it_page_shift;
>> -
>> - while (npages--)
>> - *(tcep++) = cpu_to_be64(proto_tce |
>> - (rpn++ << tbl->it_page_shift));
>> + for (i = 0; i < npages; i++) {
>> + unsigned long newtce = proto_tce |
>> + ((rpn + i) << tbl->it_page_shift);
>> + unsigned long idx = index - tbl->it_offset + i;
>>
>> + *(pnv_tce(tbl, idx)) = cpu_to_be64(newtce);
>> + }
>>
>> return 0;
>> }
>>
>> void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
>> {
>> - __be64 *tcep;
>> + long i;
>>
>> - tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
>> + for (i = 0; i < npages; i++) {
>> + unsigned long idx = index - tbl->it_offset + i;
>>
>> - while (npages--)
>> - *(tcep++) = cpu_to_be64(0);
>> + *(pnv_tce(tbl, idx)) = cpu_to_be64(0);
>> + }
>> }
>>
>> unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
>> {
>> - return ((u64 *)tbl->it_base)[index - tbl->it_offset];
>> + return *(pnv_tce(tbl, index - tbl->it_offset));
>> }
>>
>> void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
>


--
Alexey

2015-04-29 09:12:49

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 20/32] powerpc/powernv/ioda2: Introduce pnv_pci_create_table/pnv_pci_free_table

On 04/29/2015 02:39 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:44PM +1000, Alexey Kardashevskiy wrote:
>> This is a part of moving TCE table allocation into an iommu_ops
>> callback to support multiple IOMMU groups per one VFIO container.
>>
>> This moves a table creation window to the file with common powernv-pci
>> helpers as it does not do anything IODA2-specific.
>>
>> This adds pnv_pci_free_table() helper to release the actual TCE table.
>>
>> This enforces window size to be a power of two.
>>
>> This should cause no behavioural change.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> Reviewed-by: David Gibson <[email protected]>
>> ---
>> Changes:
>> v9:
>> * moved helpers to the common powernv pci.c file from pci-ioda.c
>> * moved bits from pnv_pci_create_table() to pnv_alloc_tce_table_pages()
>> ---
>> arch/powerpc/platforms/powernv/pci-ioda.c | 36 ++++++------------
>> arch/powerpc/platforms/powernv/pci.c | 61 +++++++++++++++++++++++++++++++
>> arch/powerpc/platforms/powernv/pci.h | 4 ++
>> 3 files changed, 76 insertions(+), 25 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index a80be34..b9b3773 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1307,8 +1307,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
>> if (rc)
>> pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
>>
>> - iommu_reset_table(tbl, of_node_full_name(dev->dev.of_node));
>> - free_pages(addr, get_order(TCE32_TABLE_SIZE));
>> + pnv_pci_free_table(tbl);
>> }
>>
>> static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>> @@ -2039,10 +2038,7 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
>> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> struct pnv_ioda_pe *pe)
>> {
>> - struct page *tce_mem = NULL;
>> - void *addr;
>> struct iommu_table *tbl = &pe->table_group.tables[0];
>> - unsigned int tce_table_size, end;
>> int64_t rc;
>>
>> /* We shouldn't already have a 32-bit DMA associated */
>> @@ -2053,29 +2049,20 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>>
>> /* The PE will reserve all possible 32-bits space */
>> pe->tce32_seg = 0;
>> - end = (1 << ilog2(phb->ioda.m32_pci_base));
>> - tce_table_size = (end / 0x1000) * 8;
>> pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
>> - end);
>> + phb->ioda.m32_pci_base);
>>
>> - /* Allocate TCE table */
>> - tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL,
>> - get_order(tce_table_size));
>> - if (!tce_mem) {
>> - pe_err(pe, "Failed to allocate a 32-bit TCE memory\n");
>> - goto fail;
>> + rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node,
>> + 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, tbl);
>> + if (rc) {
>> + pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
>> + return;
>> }
>> - addr = page_address(tce_mem);
>> - memset(addr, 0, tce_table_size);
>> -
>> - /* Setup iommu */
>> - tbl->it_table_group = &pe->table_group;
>> -
>> - /* Setup linux iommu table */
>> - pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
>> - IOMMU_PAGE_SHIFT_4K);
>>
>> tbl->it_ops = &pnv_ioda2_iommu_ops;
>> +
>> + /* Setup iommu */
>> + tbl->it_table_group = &pe->table_group;
>> iommu_init_table(tbl, phb->hose->node);
>> #ifdef CONFIG_IOMMU_API
>> pe->table_group.ops = &pnv_pci_ioda2_ops;
>> @@ -2121,8 +2108,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> fail:
>> if (pe->tce32_seg >= 0)
>> pe->tce32_seg = -1;
>> - if (tce_mem)
>> - __free_pages(tce_mem, get_order(tce_table_size));
>> + pnv_pci_free_table(tbl);
>> }
>>
>> static void pnv_ioda_setup_dma(struct pnv_phb *phb)
>> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>> index e8802ac..6bcfad5 100644
>> --- a/arch/powerpc/platforms/powernv/pci.c
>> +++ b/arch/powerpc/platforms/powernv/pci.c
>> @@ -20,7 +20,9 @@
>> #include <linux/io.h>
>> #include <linux/msi.h>
>> #include <linux/iommu.h>
>> +#include <linux/memblock.h>
>>
>> +#include <asm/mmzone.h>
>> #include <asm/sections.h>
>> #include <asm/io.h>
>> #include <asm/prom.h>
>> @@ -645,6 +647,65 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
>> tbl->it_type = TCE_PCI;
>> }
>>
>> +static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
>> + unsigned long *tce_table_allocated)
>
> I'm a bit confused by the tce_table_allocated parameter. What's the
> circumstance where more memory is requested than required, and why
> does it matter to the caller?

It does not make much sense here but it does for "powerpc/powernv:
Implement multilevel TCE tables" - I was trying to avoid changing same
lines many times.

The idea is if multilevel table is requested, I do not really want to
allocate the whole tree. For example, if the userspace asked for 64K table
and 5 levels, the result will be a list of just 5 pages - last one will be
the actual table and upper levels will have a single valud TCE entry
pointing to next level.

But I change the prototype there anyway so I'll just move this
tce_table_allocated thing there.



>> +{
>> + struct page *tce_mem = NULL;
>> + __be64 *addr;
>> + unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
>> + unsigned long local_allocated = 1UL << (order + PAGE_SHIFT);
>> +
>> + tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
>> + if (!tce_mem) {
>> + pr_err("Failed to allocate a TCE memory, order=%d\n", order);
>> + return NULL;
>> + }
>> + addr = page_address(tce_mem);
>> + memset(addr, 0, local_allocated);
>> + *tce_table_allocated = local_allocated;
>> +
>> + return addr;
>> +}
>> +
>> +long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
>> + __u64 bus_offset, __u32 page_shift, __u64 window_size,
>> + struct iommu_table *tbl)
>
> The table_group parameter is redundant, isn't it? It must be equal to
> tbl->table_group, yes?
>
> Or would it make more sense for this function to set
> tbl->table_group? And for that matter wouldn't it make more sense for
> this to set it_size as well?


It is too many changes already :)


>> +{
>> + void *addr;
>> + unsigned long tce_table_allocated = 0;
>> + const unsigned window_shift = ilog2(window_size);
>> + unsigned entries_shift = window_shift - page_shift;
>> + unsigned table_shift = entries_shift + 3;
>> + const unsigned long tce_table_size = max(0x1000UL, 1UL << table_shift);
>
> So, here you round up to 4k, the in the alloc function you round up to
> PAGE_SIZE (which may or may not be the same). It's not clear to me why
> there are two rounds of rounding up.
>
>> + if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
>> + return -EINVAL;
>> +
>> + /* Allocate TCE table */
>> + addr = pnv_alloc_tce_table_pages(nid, table_shift,
>> + &tce_table_allocated);
>> +
>> + /* Setup linux iommu table */
>> + pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset,
>> + page_shift);
>> +
>> + pr_info("Created TCE table: window size = %08llx, "
>> + "tablesize = %lx (%lx), start @%08llx\n",
>> + window_size, tce_table_size, tce_table_allocated,
>> + bus_offset);
>> +
>> + return 0;
>> +}
>> +
>> +void pnv_pci_free_table(struct iommu_table *tbl)
>> +{
>> + if (!tbl->it_size)
>> + return;
>> +
>> + free_pages(tbl->it_base, get_order(tbl->it_size << 3));
>> + iommu_reset_table(tbl, "pnv");
>> +}
>> +
>> static void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
>> {
>> struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>> index b15cce5..e6cbbec 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -218,6 +218,10 @@ int pnv_pci_cfg_write(struct pci_dn *pdn,
>> extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
>> void *tce_mem, u64 tce_size,
>> u64 dma_offset, unsigned page_shift);
>> +extern long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
>> + __u64 bus_offset, __u32 page_shift, __u64 window_size,
>> + struct iommu_table *tbl);
>> +extern void pnv_pci_free_table(struct iommu_table *tbl);
>> extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
>> extern void pnv_pci_init_ioda_hub(struct device_node *np);
>> extern void pnv_pci_init_ioda2_phb(struct device_node *np);
>


--
Alexey

2015-04-29 09:20:01

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 13/32] vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control

On 04/29/2015 01:02 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:37PM +1000, Alexey Kardashevskiy wrote:
>> This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
>> which call in a loop iommu_take_ownership()/iommu_release_ownership()
>> for every table on the group. As there is just one now, no change in
>> behaviour is expected.
>>
>> At the moment the iommu_table struct has a set_bypass() which enables/
>> disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
>> which calls this callback when external IOMMU users such as VFIO are
>> about to get over a PHB.
>>
>> The set_bypass() callback is not really an iommu_table function but
>> IOMMU/PE function. This introduces a iommu_table_group_ops struct and
>> adds take_ownership()/release_ownership() callbacks to it which are
>> called when an external user takes/releases control over the IOMMU.
>>
>> This replaces set_bypass() with ownership callbacks as it is not
>> necessarily just bypass enabling, it can be something else/more
>> so let's give it more generic name.
>>
>> The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
>> IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
>> The following patches will replace iommu_take_ownership/
>> iommu_release_ownership calls in IODA2 with full IOMMU table release/
>> create.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> [aw: for the vfio related changes]
>> Acked-by: Alex Williamson <[email protected]>
>> ---
>> Changes:
>> v9:
>> * squashed "vfio: powerpc/spapr: powerpc/iommu: Rework IOMMU ownership control"
>> and "vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework IOMMU ownership control"
>> into a single patch
>> * moved helpers with a loop through tables in a group
>> to vfio_iommu_spapr_tce.c to keep the platform code free of IOMMU table
>> groups as much as possible
>> * added missing tce_iommu_clear() to tce_iommu_release_ownership()
>> * replaced the set_ownership(enable) callback with take_ownership() and
>> release_ownership()
>> ---
>> arch/powerpc/include/asm/iommu.h | 13 +++++-
>> arch/powerpc/kernel/iommu.c | 11 ------
>> arch/powerpc/platforms/powernv/pci-ioda.c | 40 +++++++++++++++----
>> drivers/vfio/vfio_iommu_spapr_tce.c | 66 +++++++++++++++++++++++++++----
>> 4 files changed, 103 insertions(+), 27 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index fa37519..e63419e 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -93,7 +93,6 @@ struct iommu_table {
>> unsigned long it_page_shift;/* table iommu page size */
>> struct iommu_table_group *it_table_group;
>> struct iommu_table_ops *it_ops;
>> - void (*set_bypass)(struct iommu_table *tbl, bool enable);
>> };
>>
>> /* Pure 2^n version of get_order */
>> @@ -128,11 +127,23 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>>
>> #define IOMMU_TABLE_GROUP_MAX_TABLES 1
>>
>> +struct iommu_table_group;
>> +
>> +struct iommu_table_group_ops {
>> + /*
>> + * Switches ownership from the kernel itself to an external
>> + * user. While onwership is taken, the kernel cannot use IOMMU itself.
>
> Typo in "onwership". I'd also like to see this be even more explicit
> that "take" is the "core kernel -> vfio/whatever" transition and
> release is the reverse.


Will this work?

/*
* Switches ownership from the kernel itself to an external
* user.
* The ownership is taken when VFIO starts using the IOMMU group
* and released when the platform code gets the control over the group back.
* While ownership is taken, the platform code cannot use IOMMU itself.
*/


>> + */
>> + void (*take_ownership)(struct iommu_table_group *table_group);
>> + void (*release_ownership)(struct iommu_table_group *table_group);
>> +};
>> +
>> struct iommu_table_group {
>> #ifdef CONFIG_IOMMU_API
>> struct iommu_group *group;
>> #endif
>> struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>> + struct iommu_table_group_ops *ops;
>> };
>>
>> #ifdef CONFIG_IOMMU_API
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index 005146b..2856d27 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -1057,13 +1057,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
>>
>> memset(tbl->it_map, 0xff, sz);
>>
>> - /*
>> - * Disable iommu bypass, otherwise the user can DMA to all of
>> - * our physical memory via the bypass window instead of just
>> - * the pages that has been explicitly mapped into the iommu
>> - */
>> - if (tbl->set_bypass)
>> - tbl->set_bypass(tbl, false);
>>
>> return 0;
>> }
>> @@ -1078,10 +1071,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
>> /* Restore bit#0 set by iommu_init_table() */
>> if (tbl->it_offset == 0)
>> set_bit(0, tbl->it_map);
>> -
>> - /* The kernel owns the device now, we can restore the iommu bypass */
>> - if (tbl->set_bypass)
>> - tbl->set_bypass(tbl, true);
>> }
>> EXPORT_SYMBOL_GPL(iommu_release_ownership);
>>
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 88472cb..718d5cc 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1870,10 +1870,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>> __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
>> }
>>
>> -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
>> +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
>> {
>> - struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
>> - struct pnv_ioda_pe, table_group);
>> uint16_t window_id = (pe->pe_number << 1 ) + 1;
>> int64_t rc;
>>
>> @@ -1901,7 +1899,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
>> * host side.
>> */
>> if (pe->pdev)
>> - set_iommu_table_base(&pe->pdev->dev, tbl);
>> + set_iommu_table_base(&pe->pdev->dev,
>> + &pe->table_group.tables[0]);
>> else
>> pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
>> }
>> @@ -1917,13 +1916,35 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>> /* TVE #1 is selected by PCI address bit 59 */
>> pe->tce_bypass_base = 1ull << 59;
>>
>> - /* Install set_bypass callback for VFIO */
>> - pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
>> -
>> /* Enable bypass by default */
>> - pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
>> + pnv_pci_ioda2_set_bypass(pe, true);
>> }
>>
>> +#ifdef CONFIG_IOMMU_API
>> +static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>> +{
>> + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
>> + table_group);
>> +
>> + iommu_take_ownership(&table_group->tables[0]);
>> + pnv_pci_ioda2_set_bypass(pe, false);
>> +}
>> +
>> +static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
>> +{
>> + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
>> + table_group);
>> +
>> + iommu_release_ownership(&table_group->tables[0]);
>> + pnv_pci_ioda2_set_bypass(pe, true);
>> +}
>> +
>> +static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
>> + .take_ownership = pnv_ioda2_take_ownership,
>> + .release_ownership = pnv_ioda2_release_ownership,
>> +};
>> +#endif
>> +
>> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> struct pnv_ioda_pe *pe)
>> {
>> @@ -1991,6 +2012,9 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> }
>> tbl->it_ops = &pnv_ioda2_iommu_ops;
>> iommu_init_table(tbl, phb->hose->node);
>> +#ifdef CONFIG_IOMMU_API
>> + pe->table_group.ops = &pnv_pci_ioda2_ops;
>> +#endif
>>
>> if (pe->flags & PNV_IODA_PE_DEV) {
>> iommu_register_group(&pe->table_group, phb->hose->global_number,
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 17e884a..dacc738 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -483,6 +483,43 @@ static long tce_iommu_ioctl(void *iommu_data,
>> return -ENOTTY;
>> }
>>
>> +static void tce_iommu_release_ownership(struct tce_container *container,
>> + struct iommu_table_group *table_group)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> + struct iommu_table *tbl = &table_group->tables[i];
>> +
>> + tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
>> + if (tbl->it_map)
>> + iommu_release_ownership(tbl);
>> + }
>> +}
>> +
>> +static int tce_iommu_take_ownership(struct iommu_table_group *table_group)
>> +{
>> + int i, j, rc = 0;
>> +
>> + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> + struct iommu_table *tbl = &table_group->tables[i];
>> +
>> + if (!tbl->it_map)
>> + continue;
>> +
>> + rc = iommu_take_ownership(tbl);
>> + if (rc) {
>> + for (j = 0; j < i; ++j)
>> + iommu_release_ownership(
>> + &table_group->tables[j]);
>> +
>> + return rc;
>> + }
>> + }
>> +
>> + return 0;
>> +}
>> +
>> static int tce_iommu_attach_group(void *iommu_data,
>> struct iommu_group *iommu_group)
>> {
>> @@ -515,9 +552,23 @@ static int tce_iommu_attach_group(void *iommu_data,
>> goto unlock_exit;
>> }
>>
>> - ret = iommu_take_ownership(&table_group->tables[0]);
>> - if (!ret)
>> - container->grp = iommu_group;
>> + if (!table_group->ops || !table_group->ops->take_ownership ||
>> + !table_group->ops->release_ownership) {
>> + ret = tce_iommu_take_ownership(table_group);
>
> Haven't looked at the rest of the series. I'm hoping that you're
> eventually planning to replace this fallback with setting the
> take_ownership call for p5ioc etc. to point to
> tce_iommu_take_ownership.


Why? I do not really want p5ioc2 or ioda1 to have
take_ownership/release_ownership callbacks defined as they will only do
this default stuff which is not going to change ever as this hardware is
quite old and extremely rare so there is no real customer for it. Should I
still convert these to callbacks?



>> + } else {
>> + /*
>> + * Disable iommu bypass, otherwise the user can DMA to all of
>> + * our physical memory via the bypass window instead of just
>> + * the pages that has been explicitly mapped into the iommu
>> + */
>> + table_group->ops->take_ownership(table_group);
>> + ret = 0;
>> + }
>> +
>> + if (ret)
>> + goto unlock_exit;
>> +
>> + container->grp = iommu_group;
>>
>> unlock_exit:
>> mutex_unlock(&container->lock);
>> @@ -530,7 +581,6 @@ static void tce_iommu_detach_group(void *iommu_data,
>> {
>> struct tce_container *container = iommu_data;
>> struct iommu_table_group *table_group;
>> - struct iommu_table *tbl;
>>
>> mutex_lock(&container->lock);
>> if (iommu_group != container->grp) {
>> @@ -553,9 +603,11 @@ static void tce_iommu_detach_group(void *iommu_data,
>> table_group = iommu_group_get_iommudata(iommu_group);
>> BUG_ON(!table_group);
>>
>> - tbl = &table_group->tables[0];
>> - tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
>> - iommu_release_ownership(tbl);
>> + /* Kernel owns the device now, we can restore bypass */
>> + if (!table_group->ops || !table_group->ops->release_ownership)
>> + tce_iommu_release_ownership(container, table_group);
>> + else
>> + table_group->ops->release_ownership(table_group);
>>
>> unlock_exit:
>> mutex_unlock(&container->lock);
>


--
Alexey

2015-04-29 09:26:38

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 21/32] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window

On 04/29/2015 02:45 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:45PM +1000, Alexey Kardashevskiy wrote:
>> This is a part of moving DMA window programming to an iommu_ops
>> callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as
>> a first parameter (not pnv_ioda_pe) as it is going to be used as
>> a callback for VFIO DDW code.
>>
>> This adds pnv_pci_ioda2_tvt_invalidate() to invalidate TVT as it is
>> a good thing to do.
>
> What's the TVT and why is invalidating it a good thing?


"TCE Validation Table". Yeah, I need to rephrase it. Will do.


> Also, it looks like it didn't add it, just move it.

Agrh. Lost it in rebases. Will fix.


>> It does not have immediate effect now as the table
>> is never recreated after reboot but it will in the following patches.
>>
>> This should cause no behavioural change.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> Reviewed-by: David Gibson <[email protected]>
>
> Really? I don't remember this one.


Message-ID: <[email protected]>
:)

But I believe it did not have TVT stuff then so I should have removed your
RB from here.

>
>> ---
>> Changes:
>> v9:
>> * initialize pe->table_group.tables[0] at the very end when
>> tbl is fully initialized
>> * moved pnv_pci_ioda2_tvt_invalidate() from earlier patch
>> ---
>> arch/powerpc/platforms/powernv/pci-ioda.c | 67 +++++++++++++++++++++++--------
>> 1 file changed, 51 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index b9b3773..59baa15 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1960,6 +1960,52 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>> __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
>> }
>>
>> +static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
>> + struct iommu_table *tbl)
>> +{
>> + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
>> + table_group);
>> + struct pnv_phb *phb = pe->phb;
>> + int64_t rc;
>> + const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
>> + const __u64 win_size = tbl->it_size << tbl->it_page_shift;
>> +
>> + pe_info(pe, "Setting up window at %llx..%llx "
>> + "pgsize=0x%x tablesize=0x%lx\n",
>> + start_addr, start_addr + win_size - 1,
>> + 1UL << tbl->it_page_shift, tbl->it_size << 3);
>> +
>> + tbl->it_table_group = &pe->table_group;
>> +
>> + /*
>> + * Map TCE table through TVT. The TVE index is the PE number
>> + * shifted by 1 bit for 32-bits DMA space.
>> + */
>> + rc = opal_pci_map_pe_dma_window(phb->opal_id,
>> + pe->pe_number,
>> + pe->pe_number << 1,
>> + 1,
>> + __pa(tbl->it_base),
>> + tbl->it_size << 3,
>> + 1ULL << tbl->it_page_shift);
>> + if (rc) {
>> + pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
>> + goto fail;
>> + }
>> +
>> + pnv_pci_ioda2_tvt_invalidate(pe);
>> +
>> + /* Store fully initialized *tbl (may be external) in PE */
>> + pe->table_group.tables[0] = *tbl;
>
> Hrm, a non-atomic copy of a whole structure into the array. Is that
> really what you want?


set_window is called from VFIO (protected by mutex there) and the platform
code which I believe is not racy (or hotplug takes care of it anyway). Or I
am missing something else?


>> + return 0;
>> +fail:
>> + if (pe->tce32_seg >= 0)
>> + pe->tce32_seg = -1;
>> +
>> + return rc;
>> +}
>> +
>> static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
>> {
>> uint16_t window_id = (pe->pe_number << 1 ) + 1;
>> @@ -2068,21 +2114,16 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> pe->table_group.ops = &pnv_pci_ioda2_ops;
>> #endif
>>
>> - /*
>> - * Map TCE table through TVT. The TVE index is the PE number
>> - * shifted by 1 bit for 32-bits DMA space.
>> - */
>> - rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
>> - pe->pe_number << 1, 1, __pa(tbl->it_base),
>> - tbl->it_size << 3, 1ULL << tbl->it_page_shift);
>> + rc = pnv_pci_ioda2_set_window(&pe->table_group, tbl);
>> if (rc) {
>> pe_err(pe, "Failed to configure 32-bit TCE table,"
>> " err %ld\n", rc);
>> - goto fail;
>> + pnv_pci_free_table(tbl);
>> + if (pe->tce32_seg >= 0)
>> + pe->tce32_seg = -1;
>> + return;
>> }
>>
>> - pnv_pci_ioda2_tvt_invalidate(pe);
>> -
>> /* OPAL variant of PHB3 invalidated TCEs */
>> if (pe->tce_inval_reg)
>> tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
>> @@ -2103,12 +2144,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> /* Also create a bypass window */
>> if (!pnv_iommu_bypass_disabled)
>> pnv_pci_ioda2_setup_bypass_pe(phb, pe);
>> -
>> - return;
>> -fail:
>> - if (pe->tce32_seg >= 0)
>> - pe->tce32_seg = -1;
>> - pnv_pci_free_table(tbl);
>> }
>>
>> static void pnv_ioda_setup_dma(struct pnv_phb *phb)
>


--
Alexey

2015-04-29 09:44:39

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 23/32] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks

On 04/29/2015 03:30 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:47PM +1000, Alexey Kardashevskiy wrote:
>> This extends iommu_table_group_ops by a set of callbacks to support
>> dynamic DMA windows management.
>>
>> create_table() creates a TCE table with specific parameters.
>> it receives iommu_table_group to know nodeid in order to allocate
>> TCE table memory closer to the PHB. The exact format of allocated
>> multi-level table might be also specific to the PHB model (not
>> the case now though).
>> This callback calculated the DMA window offset on a PCI bus from @num
>> and stores it in a just created table.
>>
>> set_window() sets the window at specified TVT index + @num on PHB.
>>
>> unset_window() unsets the window from specified TVT.
>>
>> This adds a free() callback to iommu_table_ops to free the memory
>> (potentially a tree of tables) allocated for the TCE table.
>
> Doesn't the free callback belong with the previous patch introducing
> multi-level tables?



If I did that, you would say "why is it here if nothing calls it" on
"multilevel" patch and "I see the allocation but I do not see memory
release" ;)

I need some rule of thumb here. I think it is a bit cleaner if the same
patch adds a callback for memory allocation and its counterpart, no?



>> create_table() and free() are supposed to be called once per
>> VFIO container and set_window()/unset_window() are supposed to be
>> called for every group in a container.
>>
>> This adds IOMMU capabilities to iommu_table_group such as default
>> 32bit window parameters and others.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> arch/powerpc/include/asm/iommu.h | 19 ++++++++
>> arch/powerpc/platforms/powernv/pci-ioda.c | 75 ++++++++++++++++++++++++++---
>> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 12 +++--
>> 3 files changed, 96 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index 0f50ee2..7694546 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -70,6 +70,7 @@ struct iommu_table_ops {
>> /* get() returns a physical address */
>> unsigned long (*get)(struct iommu_table *tbl, long index);
>> void (*flush)(struct iommu_table *tbl);
>> + void (*free)(struct iommu_table *tbl);
>> };
>>
>> /* These are used by VIO */
>> @@ -148,6 +149,17 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>> struct iommu_table_group;
>>
>> struct iommu_table_group_ops {
>> + long (*create_table)(struct iommu_table_group *table_group,
>> + int num,
>> + __u32 page_shift,
>> + __u64 window_size,
>> + __u32 levels,
>> + struct iommu_table *tbl);
>> + long (*set_window)(struct iommu_table_group *table_group,
>> + int num,
>> + struct iommu_table *tblnew);
>> + long (*unset_window)(struct iommu_table_group *table_group,
>> + int num);
>> /*
>> * Switches ownership from the kernel itself to an external
>> * user. While onwership is taken, the kernel cannot use IOMMU itself.
>> @@ -160,6 +172,13 @@ struct iommu_table_group {
>> #ifdef CONFIG_IOMMU_API
>> struct iommu_group *group;
>> #endif
>> + /* Some key properties of IOMMU */
>> + __u32 tce32_start;
>> + __u32 tce32_size;
>> + __u64 pgsizes; /* Bitmap of supported page sizes */
>> + __u32 max_dynamic_windows_supported;
>> + __u32 max_levels;
>
> With this information, table_group seems even more like a bad name.
> "iommu_state" maybe?


Please, no. We will never come to agreement then :( And "iommu_state" is
too general anyway, it won't pass.


>> struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>> struct iommu_table_group_ops *ops;
>> };
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index cc1d09c..4828837 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -24,6 +24,7 @@
>> #include <linux/msi.h>
>> #include <linux/memblock.h>
>> #include <linux/iommu.h>
>> +#include <linux/sizes.h>
>>
>> #include <asm/sections.h>
>> #include <asm/io.h>
>> @@ -1846,6 +1847,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>> #endif
>> .clear = pnv_ioda2_tce_free,
>> .get = pnv_tce_get,
>> + .free = pnv_pci_free_table,
>> };
>>
>> static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
>> @@ -1936,6 +1938,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>> TCE_PCI_SWINV_PAIR);
>>
>> tbl->it_ops = &pnv_ioda1_iommu_ops;
>> + pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift;
>> + pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
>> iommu_init_table(tbl, phb->hose->node);
>>
>> if (pe->flags & PNV_IODA_PE_DEV) {
>> @@ -1961,7 +1965,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>> }
>>
>> static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
>> - struct iommu_table *tbl)
>> + int num, struct iommu_table *tbl)
>> {
>> struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
>> table_group);
>> @@ -1972,9 +1976,10 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
>> const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
>> const __u64 win_size = tbl->it_size << tbl->it_page_shift;
>>
>> - pe_info(pe, "Setting up window at %llx..%llx "
>> + pe_info(pe, "Setting up window#%d at %llx..%llx "
>> "pgsize=0x%x tablesize=0x%lx "
>> "levels=%d levelsize=%x\n",
>> + num,
>> start_addr, start_addr + win_size - 1,
>> 1UL << tbl->it_page_shift, tbl->it_size << 3,
>> tbl->it_indirect_levels + 1, tbl->it_level_size << 3);
>> @@ -1987,7 +1992,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
>> */
>> rc = opal_pci_map_pe_dma_window(phb->opal_id,
>> pe->pe_number,
>> - pe->pe_number << 1,
>> + (pe->pe_number << 1) + num,
>
> Heh, yes, well, that makes it rather clear that only 2 tables are possible.
>
>> tbl->it_indirect_levels + 1,
>> __pa(tbl->it_base),
>> size << 3,
>> @@ -2000,7 +2005,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
>> pnv_pci_ioda2_tvt_invalidate(pe);
>>
>> /* Store fully initialized *tbl (may be external) in PE */
>> - pe->table_group.tables[0] = *tbl;
>> + pe->table_group.tables[num] = *tbl;
>
> I'm a bit confused by this whole set_window thing. Is the idea that
> with multiple groups in a container you have multiple table_group s
> each with different copies of the iommu_table structures, but pointing
> to the same actual TCE entries (it_base)?

Yes.

> It seems to me not terribly
> obvious when you "create" a table and when you "set" a window.


A table is not attached anywhere until its address is programmed (in
set_window()) to the hardware, it is just a table in memory. For
POWER8/IODA2, I create a table before I attach any group to a container,
then I program this table to every attached container, right now it is done
in container's attach_group(). So later we can hotplug any host PCI device
to a container - it will program same TCE table to every new group in the
container.


> It's also kind of hard to assess whether the relative lifetimes are
> correct of the table_group, struct iommu_table and the actual TCE tables.

That is true. Do not know how to improve this though.


> Would it make more sense for table_group to become the
> non-vfio-specific counterpart to the vfio container?
> i.e. representing one set of DMA mappings, which one or more PEs could
> be bound to.


table_group is embedded into PE and table/table_group callbacks access PE
when invalidating TCE table. So I will need something to access PE. Or just
have an array of 2 iommu_table.



>
>> return 0;
>> fail:
>> @@ -2061,6 +2066,53 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>> }
>>
>> #ifdef CONFIG_IOMMU_API
>> +static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>> + int num, __u32 page_shift, __u64 window_size, __u32 levels,
>> + struct iommu_table *tbl)
>> +{
>> + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
>> + table_group);
>> + int nid = pe->phb->hose->node;
>> + __u64 bus_offset = num ? pe->tce_bypass_base : 0;
>> + long ret;
>> +
>> + ret = pnv_pci_create_table(table_group, nid, bus_offset, page_shift,
>> + window_size, levels, tbl);
>> + if (ret)
>> + return ret;
>> +
>> + tbl->it_ops = &pnv_ioda2_iommu_ops;
>> + if (pe->tce_inval_reg)
>> + tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
>> +
>> + return 0;
>> +}
>> +
>> +static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
>> + int num)
>> +{
>> + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
>> + table_group);
>> + struct pnv_phb *phb = pe->phb;
>> + struct iommu_table *tbl = &pe->table_group.tables[num];
>> + long ret;
>> +
>> + pe_info(pe, "Removing DMA window #%d\n", num);
>> +
>> + ret = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
>> + (pe->pe_number << 1) + num,
>> + 0/* levels */, 0/* table address */,
>> + 0/* table size */, 0/* page size */);
>> + if (ret)
>> + pe_warn(pe, "Unmapping failed, ret = %ld\n", ret);
>> + else
>> + pnv_pci_ioda2_tvt_invalidate(pe);
>> +
>> + memset(tbl, 0, sizeof(*tbl));
>> +
>> + return ret;
>> +}
>> +
>> static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>> {
>> struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
>> @@ -2080,6 +2132,9 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
>> }
>>
>> static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
>> + .create_table = pnv_pci_ioda2_create_table,
>> + .set_window = pnv_pci_ioda2_set_window,
>> + .unset_window = pnv_pci_ioda2_unset_window,
>> .take_ownership = pnv_ioda2_take_ownership,
>> .release_ownership = pnv_ioda2_release_ownership,
>> };
>> @@ -2102,8 +2157,16 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
>> phb->ioda.m32_pci_base);
>>
>> + pe->table_group.tce32_start = 0;
>> + pe->table_group.tce32_size = phb->ioda.m32_pci_base;
>> + pe->table_group.max_dynamic_windows_supported =
>> + IOMMU_TABLE_GROUP_MAX_TABLES;
>> + pe->table_group.max_levels = POWERNV_IOMMU_MAX_LEVELS;
>> + pe->table_group.pgsizes = SZ_4K | SZ_64K | SZ_16M;
>> +
>> rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node,
>> - 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base,
>> + pe->table_group.tce32_start, IOMMU_PAGE_SHIFT_4K,
>> + pe->table_group.tce32_size,
>> POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
>> if (rc) {
>> pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
>> @@ -2119,7 +2182,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> pe->table_group.ops = &pnv_pci_ioda2_ops;
>> #endif
>>
>> - rc = pnv_pci_ioda2_set_window(&pe->table_group, tbl);
>> + rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
>> if (rc) {
>> pe_err(pe, "Failed to configure 32-bit TCE table,"
>> " err %ld\n", rc);
>> diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
>> index 7a6fd92..d9de4c7 100644
>> --- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
>> +++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
>> @@ -116,6 +116,8 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
>> u64 phb_id;
>> int64_t rc;
>> static int primary = 1;
>> + struct iommu_table_group *table_group;
>> + struct iommu_table *tbl;
>>
>> pr_info(" Initializing p5ioc2 PHB %s\n", np->full_name);
>>
>> @@ -181,14 +183,16 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
>> pnv_pci_init_p5ioc2_msis(phb);
>>
>> /* Setup iommu */
>> - phb->p5ioc2.table_group.tables[0].it_table_group =
>> - &phb->p5ioc2.table_group;
>> + table_group = &phb->p5ioc2.table_group;
>> + tbl = &phb->p5ioc2.table_group.tables[0];
>> + tbl->it_table_group = table_group;
>>
>> /* Setup TCEs */
>> phb->dma_dev_setup = pnv_pci_p5ioc2_dma_dev_setup;
>> - pnv_pci_setup_iommu_table(&phb->p5ioc2.table_group.tables[0],
>> - tce_mem, tce_size, 0,
>> + pnv_pci_setup_iommu_table(tbl, tce_mem, tce_size, 0,
>> IOMMU_PAGE_SHIFT_4K);
>> + table_group->tce32_start = tbl->it_offset << tbl->it_page_shift;
>> + table_group->tce32_size = tbl->it_size << tbl->it_page_shift;
>
> Doesn't pgsizes need to be set here (although it will only include 4K,
> I'm assuming).


No, pgsizes are not returned to the userspace for p5ioc2/ioda1 as they do
not support DDW. No pgsize => no DDW.



>> }
>>
>> void __init pnv_pci_init_p5ioc2_hub(struct device_node *np)
>


--
Alexey

2015-04-29 09:51:46

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 18/32] powerpc/iommu/powernv: Release replaced TCE

On 04/29/2015 02:18 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:42PM +1000, Alexey Kardashevskiy wrote:
>> At the moment writing new TCE value to the IOMMU table fails with EBUSY
>> if there is a valid entry already. However PAPR specification allows
>> the guest to write new TCE value without clearing it first.
>>
>> Another problem this patch is addressing is the use of pool locks for
>> external IOMMU users such as VFIO. The pool locks are to protect
>> DMA page allocator rather than entries and since the host kernel does
>> not control what pages are in use, there is no point in pool locks and
>> exchange()+put_page(oldtce) is sufficient to avoid possible races.
>>
>> This adds an exchange() callback to iommu_table_ops which does the same
>> thing as set() plus it returns replaced TCE and DMA direction so
>> the caller can release the pages afterwards. The exchange() receives
>> a physical address unlike set() which receives linear mapping address;
>> and returns a physical address as the clear() does.
>>
>> This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
>> for a platform to have exchange() implemented in order to support VFIO.
>>
>> This replaces iommu_tce_build() and iommu_clear_tce() with
>> a single iommu_tce_xchg().
>>
>> This makes sure that TCE permission bits are not set in TCE passed to
>> IOMMU API as those are to be calculated by platform code from DMA direction.
>>
>> This moves SetPageDirty() to the IOMMU code to make it work for both
>> VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
>> available later).
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> [aw: for the vfio related changes]
>> Acked-by: Alex Williamson <[email protected]>
>
> This looks mostly good, but there are couple of details that need fixing.
>


[...]

>> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>> index ba75aa5..e8802ac 100644
>> --- a/arch/powerpc/platforms/powernv/pci.c
>> +++ b/arch/powerpc/platforms/powernv/pci.c
>> @@ -598,6 +598,23 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
>> return 0;
>> }
>>
>> +#ifdef CONFIG_IOMMU_API
>> +int pnv_tce_xchg(struct iommu_table *tbl, long index,
>> + unsigned long *tce, enum dma_data_direction *direction)
>> +{
>> + u64 proto_tce = iommu_direction_to_tce_perm(*direction);
>> + unsigned long newtce = *tce | proto_tce;
>> + unsigned long idx = index - tbl->it_offset;
>
> Should this have a BUG_ON or WARN_ON if the supplied tce has bits set
> below the page mask?


Why? The caller checks these bits, do we really need to duplicate it here?


>> + *tce = xchg(pnv_tce(tbl, idx), cpu_to_be64(newtce));
>> + *tce = be64_to_cpu(*tce);
>> + *direction = iommu_tce_direction(*tce);
>> + *tce &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> + return 0;
>> +}
>> +#endif



--
Alexey

2015-04-30 00:38:30

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 17/32] powerpc/powernv: Implement accessor to TCE entry

On Wed, Apr 29, 2015 at 07:02:17PM +1000, Alexey Kardashevskiy wrote:
> On 04/29/2015 02:04 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:41PM +1000, Alexey Kardashevskiy wrote:
> >>This replaces direct accesses to TCE table with a helper which
> >>returns an TCE entry address. This does not make difference now but will
> >>when multi-level TCE tables get introduces.
> >>
> >>No change in behavior is expected.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >
> >Reviewed-by: David Gibson <[email protected]>
> >
> >
> >>---
> >>Changes:
> >>v9:
> >>* new patch in the series to separate this mechanical change from
> >>functional changes; this is not right before
> >>"powerpc/powernv: Implement multilevel TCE tables" but here in order
> >>to let the next patch - "powerpc/iommu/powernv: Release replaced TCE" -
> >>use pnv_tce() and avoid changing the same code twice
> >>---
> >> arch/powerpc/platforms/powernv/pci.c | 34 +++++++++++++++++++++-------------
> >> 1 file changed, 21 insertions(+), 13 deletions(-)
> >>
> >>diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> >>index 84b4ea4..ba75aa5 100644
> >>--- a/arch/powerpc/platforms/powernv/pci.c
> >>+++ b/arch/powerpc/platforms/powernv/pci.c
> >>@@ -572,38 +572,46 @@ struct pci_ops pnv_pci_ops = {
> >> .write = pnv_pci_write_config,
> >> };
> >>
> >>+static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
> >>+{
> >>+ __be64 *tmp = ((__be64 *)tbl->it_base);
> >>+
> >>+ return tmp + idx;
> >>+}
> >>+
> >> int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> >> unsigned long uaddr, enum dma_data_direction direction,
> >> struct dma_attrs *attrs)
> >> {
> >> u64 proto_tce = iommu_direction_to_tce_perm(direction);
> >>- __be64 *tcep;
> >>- u64 rpn;
> >>+ u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
> >
> >I guess this was a problem in the existing code, not this patch. But
> >"uaddr" is a really bad name (and unsigned long is a bad type) for
> >what must actually be a kernel linear mapping address.
>
>
> Yes and may be one day I'll clean this up. s/uaddr/linear/ and s/hwaddr/hpa/
> are the first things to do globally but not in this patchset.

Ok.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.37 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-30 02:29:38

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 09/32] vfio: powerpc/spapr: Rework groups attaching

On 04/29/2015 12:16 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:33PM +1000, Alexey Kardashevskiy wrote:
>> This is to make extended ownership and multiple groups support patches
>> simpler for review.
>>
>> This should cause no behavioural change.
>
> Um.. this doesn't appear to be true. Previously removing a group from
> an enabled container would fail with EBUSY, now it forces a disable.


This is the original tce_iommu_detach_group() where I cannot find EBUSY you
are referring to; it did and does enforce disable. What do I miss here?

static void tce_iommu_detach_group(void *iommu_data,
struct iommu_group *iommu_group)
{
struct tce_container *container = iommu_data;
struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);

BUG_ON(!tbl);
mutex_lock(&container->lock);
if (tbl != container->tbl) {
pr_warn("tce_vfio: detaching group #%u, expected group is
#%u\n",
iommu_group_id(iommu_group),
iommu_group_id(tbl->it_group));
} else {
if (container->enabled) {
pr_warn("tce_vfio: detaching group #%u from
enabled container, forcing disable\n",
iommu_group_id(tbl->it_group));
tce_iommu_disable(container);
}

/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
iommu_group_id(iommu_group), iommu_group); */
container->tbl = NULL;
tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
iommu_release_ownership(tbl);
}
mutex_unlock(&container->lock);
}



>
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> [aw: for the vfio related changes]
>> Acked-by: Alex Williamson <[email protected]>
>> Reviewed-by: David Gibson <[email protected]>
>> ---
>> drivers/vfio/vfio_iommu_spapr_tce.c | 40 ++++++++++++++++++++++---------------
>> 1 file changed, 24 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 115d5e6..0fbe03e 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -460,16 +460,21 @@ static int tce_iommu_attach_group(void *iommu_data,
>> iommu_group_id(container->tbl->it_group),
>> iommu_group_id(iommu_group));
>> ret = -EBUSY;
>> - } else if (container->enabled) {
>> + goto unlock_exit;
>> + }
>> +
>> + if (container->enabled) {
>> pr_err("tce_vfio: attaching group #%u to enabled container\n",
>> iommu_group_id(iommu_group));
>> ret = -EBUSY;
>> - } else {
>> - ret = iommu_take_ownership(tbl);
>> - if (!ret)
>> - container->tbl = tbl;
>> + goto unlock_exit;
>> }
>>
>> + ret = iommu_take_ownership(tbl);
>> + if (!ret)
>> + container->tbl = tbl;
>> +
>> +unlock_exit:
>> mutex_unlock(&container->lock);
>>
>> return ret;
>> @@ -487,19 +492,22 @@ static void tce_iommu_detach_group(void *iommu_data,
>> pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
>> iommu_group_id(iommu_group),
>> iommu_group_id(tbl->it_group));
>> - } else {
>> - if (container->enabled) {
>> - pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
>> - iommu_group_id(tbl->it_group));
>> - tce_iommu_disable(container);
>> - }
>> + goto unlock_exit;
>> + }
>>
>> - /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
>> - iommu_group_id(iommu_group), iommu_group); */
>> - container->tbl = NULL;
>> - tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
>> - iommu_release_ownership(tbl);
>> + if (container->enabled) {
>> + pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
>> + iommu_group_id(tbl->it_group));
>> + tce_iommu_disable(container);
>> }
>> +
>> + /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
>> + iommu_group_id(iommu_group), iommu_group); */
>> + container->tbl = NULL;
>> + tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
>> + iommu_release_ownership(tbl);
>> +
>> +unlock_exit:
>> mutex_unlock(&container->lock);
>> }
>>
>


--
Alexey

2015-04-30 02:30:57

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 12/32] powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group

On 04/29/2015 12:49 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:36PM +1000, Alexey Kardashevskiy wrote:
>> Modern IBM POWERPC systems support multiple (currently two) TCE tables
>> per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
>> for TCE tables. Right now just one table is supported.
>>
>> For P5IOC2 and IODA, iommu_table_group is embedded into PE struct
>> (pnv_ioda_pe and pnv_phb) and does not require iommu_free_table(), only .
>> iommu_reset_table().
>>
>> For pSeries, this replaces multiple calls of kzalloc_node() with a new
>> iommu_pseries_group_alloc() helper and stores the table group struct
>> pointer into the pci_dn struct. For release, a iommu_table_group_free()
>> helper is added.
>>
>> This should cause no behavioural change.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> [aw: for the vfio related changes]
>> Acked-by: Alex Williamson <[email protected]>
>
> I'm not particularly fond of the "table_group" name, but I can't
> really think of a better name for now. So,


I asked Ben again. iommu_state is not much better either. I'd stick to
iommu_table_group.


> Reviewed-by: David Gibson <[email protected]>





--
Alexey

2015-04-30 02:58:22

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 15/32] powerpc/powernv/ioda/ioda2: Rework TCE invalidation in tce_build()/tce_free()

On 04/29/2015 01:18 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:39PM +1000, Alexey Kardashevskiy wrote:
>> The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is
>> supposed to be called on IODA1/2 and not called on p5ioc2. It receives
>> start and end host addresses of TCE table.
>>
>> IODA2 actually needs PCI addresses to invalidate the cache. Those
>> can be calculated from host addresses but since we are going
>> to implement multi-level TCE tables, calculating PCI address from
>> a host address might get either tricky or ugly as TCE table remains flat
>> on PCI bus but not in RAM.
>>
>> This moves pnv_pci_ioda_tce_invalidate() from generic pnv_tce_build/
>> pnt_tce_free and defines IODA1/2-specific callbacks which call generic
>> ones and do PHB-model-specific TCE cache invalidation. P5IOC2 keeps
>> using generic callbacks as before.
>>
>> This changes pnv_pci_ioda2_tce_invalidate() to receives TCE index and
>> number of pages which are PCI addresses shifted by IOMMU page shift.
>>
>> No change in behaviour is expected.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> Changes:
>> v9:
>> * removed confusing comment from commit log about unintentional calling of
>> pnv_pci_ioda_tce_invalidate()
>> * moved mechanical changes away to "powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table"
>> * fixed bug with broken invalidation in pnv_pci_ioda2_tce_invalidate -
>> @index includes @tbl->it_offset but old code added it anyway which later broke
>> DDW
>> ---
>> arch/powerpc/platforms/powernv/pci-ioda.c | 86 +++++++++++++++++++++----------
>> arch/powerpc/platforms/powernv/pci.c | 17 ++----
>> 2 files changed, 64 insertions(+), 39 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 718d5cc..f070c44 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1665,18 +1665,20 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
>> }
>> }
>>
>> -static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
>> - struct iommu_table *tbl,
>> - __be64 *startp, __be64 *endp, bool rm)
>> +static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
>> + unsigned long index, unsigned long npages, bool rm)
>> {
>> + struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
>> + struct pnv_ioda_pe, table_group);
>> __be64 __iomem *invalidate = rm ?
>> (__be64 __iomem *)pe->tce_inval_reg_phys :
>> (__be64 __iomem *)tbl->it_index;
>> unsigned long start, end, inc;
>> const unsigned shift = tbl->it_page_shift;
>>
>> - start = __pa(startp);
>> - end = __pa(endp);
>> + start = __pa((__be64 *)tbl->it_base + index - tbl->it_offset);
>> + end = __pa((__be64 *)tbl->it_base + index - tbl->it_offset +
>> + npages - 1);
>
> This doesn't look right. The arguments to __pa don't appear to be
> addresses (since index and if_offset are in units of (TCE) pages, not
> bytes).


tbl->it_base is an address and it is casted to __be64* which means:

(char*)tbl->it_base + (index - tbl->it_offset)*sizeof(__be64).

Which seems to be correct (I just removed extra braces compared to the old
code), no?


>
>>
>> /* BML uses this case for p6/p7/galaxy2: Shift addr and put in node */
>> if (tbl->it_busno) {
>> @@ -1712,16 +1714,40 @@ static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
>> */
>> }
>>
>> +static int pnv_ioda1_tce_build(struct iommu_table *tbl, long index,
>> + long npages, unsigned long uaddr,
>> + enum dma_data_direction direction,
>> + struct dma_attrs *attrs)
>> +{
>> + long ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
>> + attrs);
>> +
>> + if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE))
>> + pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
>> +
>> + return ret;
>> +}
>> +
>> +static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
>> + long npages)
>> +{
>> + pnv_tce_free(tbl, index, npages);
>> +
>> + if (tbl->it_type & TCE_PCI_SWINV_FREE)
>> + pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
>> +}
>> +
>> static struct iommu_table_ops pnv_ioda1_iommu_ops = {
>> - .set = pnv_tce_build,
>> - .clear = pnv_tce_free,
>> + .set = pnv_ioda1_tce_build,
>> + .clear = pnv_ioda1_tce_free,
>> .get = pnv_tce_get,
>> };
>>
>> -static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
>> - struct iommu_table *tbl,
>> - __be64 *startp, __be64 *endp, bool rm)
>> +static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
>> + unsigned long index, unsigned long npages, bool rm)
>> {
>> + struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
>> + struct pnv_ioda_pe, table_group);
>> unsigned long start, end, inc;
>> __be64 __iomem *invalidate = rm ?
>> (__be64 __iomem *)pe->tce_inval_reg_phys :
>> @@ -1734,10 +1760,8 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
>> end = start;
>>
>> /* Figure out the start, end and step */
>> - inc = tbl->it_offset + (((u64)startp - tbl->it_base) / sizeof(u64));
>> - start |= (inc << shift);
>> - inc = tbl->it_offset + (((u64)endp - tbl->it_base) / sizeof(u64));
>> - end |= (inc << shift);
>> + start |= (index << shift);
>> + end |= ((index + npages - 1) << shift);
>> inc = (0x1ull << shift);
>> mb();
>>
>> @@ -1750,22 +1774,32 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
>> }
>> }
>>
>> -void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
>> - __be64 *startp, __be64 *endp, bool rm)
>> +static int pnv_ioda2_tce_build(struct iommu_table *tbl, long index,
>> + long npages, unsigned long uaddr,
>> + enum dma_data_direction direction,
>> + struct dma_attrs *attrs)
>> {
>> - struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
>> - struct pnv_ioda_pe, table_group);
>> - struct pnv_phb *phb = pe->phb;
>> -
>> - if (phb->type == PNV_PHB_IODA1)
>> - pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
>> - else
>> - pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
>> + long ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
>> + attrs);
>> +
>> + if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE))
>> + pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
>> +
>> + return ret;
>> +}
>> +
>> +static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
>> + long npages)
>> +{
>> + pnv_tce_free(tbl, index, npages);
>> +
>> + if (tbl->it_type & TCE_PCI_SWINV_FREE)
>> + pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
>> }
>>
>> static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>> - .set = pnv_tce_build,
>> - .clear = pnv_tce_free,
>> + .set = pnv_ioda2_tce_build,
>> + .clear = pnv_ioda2_tce_free,
>> .get = pnv_tce_get,
>> };
>>
>> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>> index 4c3bbb1..84b4ea4 100644
>> --- a/arch/powerpc/platforms/powernv/pci.c
>> +++ b/arch/powerpc/platforms/powernv/pci.c
>> @@ -577,37 +577,28 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
>> struct dma_attrs *attrs)
>> {
>> u64 proto_tce = iommu_direction_to_tce_perm(direction);
>> - __be64 *tcep, *tces;
>> + __be64 *tcep;
>> u64 rpn;
>>
>> - tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
>> + tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
>> rpn = __pa(uaddr) >> tbl->it_page_shift;
>>
>> while (npages--)
>> *(tcep++) = cpu_to_be64(proto_tce |
>> (rpn++ << tbl->it_page_shift));
>>
>> - /* Some implementations won't cache invalid TCEs and thus may not
>> - * need that flush. We'll probably turn it_type into a bit mask
>> - * of flags if that becomes the case
>> - */
>> - if (tbl->it_type & TCE_PCI_SWINV_CREATE)
>> - pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, false);
>>
>> return 0;
>> }
>>
>> void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
>> {
>> - __be64 *tcep, *tces;
>> + __be64 *tcep;
>>
>> - tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
>> + tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
>>
>> while (npages--)
>> *(tcep++) = cpu_to_be64(0);
>> -
>> - if (tbl->it_type & TCE_PCI_SWINV_FREE)
>> - pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, false);
>> }
>>
>> unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
>


--
Alexey

2015-04-30 04:38:03

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 09/32] vfio: powerpc/spapr: Rework groups attaching

On Thu, Apr 30, 2015 at 12:29:30PM +1000, Alexey Kardashevskiy wrote:
> On 04/29/2015 12:16 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:33PM +1000, Alexey Kardashevskiy wrote:
> >>This is to make extended ownership and multiple groups support patches
> >>simpler for review.
> >>
> >>This should cause no behavioural change.
> >
> >Um.. this doesn't appear to be true. Previously removing a group from
> >an enabled container would fail with EBUSY, now it forces a disable.
>
>
> This is the original tce_iommu_detach_group() where I cannot find EBUSY you
> are referring to; it did and does enforce disable. What do I miss
> here?

Sorry, my mistake. I misread the patch.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (880.00 B)
(No filename) (819.00 B)
Download all attachments

2015-04-30 04:38:43

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 13/32] vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control

On Wed, Apr 29, 2015 at 07:19:51PM +1000, Alexey Kardashevskiy wrote:
> On 04/29/2015 01:02 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:37PM +1000, Alexey Kardashevskiy wrote:
> >>This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
> >>which call in a loop iommu_take_ownership()/iommu_release_ownership()
> >>for every table on the group. As there is just one now, no change in
> >>behaviour is expected.
> >>
> >>At the moment the iommu_table struct has a set_bypass() which enables/
> >>disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
> >>which calls this callback when external IOMMU users such as VFIO are
> >>about to get over a PHB.
> >>
> >>The set_bypass() callback is not really an iommu_table function but
> >>IOMMU/PE function. This introduces a iommu_table_group_ops struct and
> >>adds take_ownership()/release_ownership() callbacks to it which are
> >>called when an external user takes/releases control over the IOMMU.
> >>
> >>This replaces set_bypass() with ownership callbacks as it is not
> >>necessarily just bypass enabling, it can be something else/more
> >>so let's give it more generic name.
> >>
> >>The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
> >>IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
> >>The following patches will replace iommu_take_ownership/
> >>iommu_release_ownership calls in IODA2 with full IOMMU table release/
> >>create.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>[aw: for the vfio related changes]
> >>Acked-by: Alex Williamson <[email protected]>
> >>---
> >>Changes:
> >>v9:
> >>* squashed "vfio: powerpc/spapr: powerpc/iommu: Rework IOMMU ownership control"
> >>and "vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework IOMMU ownership control"
> >>into a single patch
> >>* moved helpers with a loop through tables in a group
> >>to vfio_iommu_spapr_tce.c to keep the platform code free of IOMMU table
> >>groups as much as possible
> >>* added missing tce_iommu_clear() to tce_iommu_release_ownership()
> >>* replaced the set_ownership(enable) callback with take_ownership() and
> >>release_ownership()
> >>---
> >> arch/powerpc/include/asm/iommu.h | 13 +++++-
> >> arch/powerpc/kernel/iommu.c | 11 ------
> >> arch/powerpc/platforms/powernv/pci-ioda.c | 40 +++++++++++++++----
> >> drivers/vfio/vfio_iommu_spapr_tce.c | 66 +++++++++++++++++++++++++++----
> >> 4 files changed, 103 insertions(+), 27 deletions(-)
> >>
> >>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>index fa37519..e63419e 100644
> >>--- a/arch/powerpc/include/asm/iommu.h
> >>+++ b/arch/powerpc/include/asm/iommu.h
> >>@@ -93,7 +93,6 @@ struct iommu_table {
> >> unsigned long it_page_shift;/* table iommu page size */
> >> struct iommu_table_group *it_table_group;
> >> struct iommu_table_ops *it_ops;
> >>- void (*set_bypass)(struct iommu_table *tbl, bool enable);
> >> };
> >>
> >> /* Pure 2^n version of get_order */
> >>@@ -128,11 +127,23 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
> >>
> >> #define IOMMU_TABLE_GROUP_MAX_TABLES 1
> >>
> >>+struct iommu_table_group;
> >>+
> >>+struct iommu_table_group_ops {
> >>+ /*
> >>+ * Switches ownership from the kernel itself to an external
> >>+ * user. While onwership is taken, the kernel cannot use IOMMU itself.
> >
> >Typo in "onwership". I'd also like to see this be even more explicit
> >that "take" is the "core kernel -> vfio/whatever" transition and
> >release is the reverse.
>
>
> Will this work?
>
> /*
> * Switches ownership from the kernel itself to an external
> * user.
> * The ownership is taken when VFIO starts using the IOMMU group
> * and released when the platform code gets the control over the group back.
> * While ownership is taken, the platform code cannot use IOMMU itself.
> */

Hrm, verbose and still doesn't emphasise the point that always
confuses me enough. I'd prefer:

/* Switch ownership from platform code to external user (e.g. VFIO) */

above "take" then

/* Switch ownership from external user (e.g. VFIO) back to core */

above "release"..

>
>
> >>+ */
> >>+ void (*take_ownership)(struct iommu_table_group *table_group);
> >>+ void (*release_ownership)(struct iommu_table_group *table_group);
> >>+};
> >>+
> >> struct iommu_table_group {
> >> #ifdef CONFIG_IOMMU_API
> >> struct iommu_group *group;
> >> #endif
> >> struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> >>+ struct iommu_table_group_ops *ops;
> >> };
> >>
> >> #ifdef CONFIG_IOMMU_API
> >>diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> >>index 005146b..2856d27 100644
> >>--- a/arch/powerpc/kernel/iommu.c
> >>+++ b/arch/powerpc/kernel/iommu.c
> >>@@ -1057,13 +1057,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
> >>
> >> memset(tbl->it_map, 0xff, sz);
> >>
> >>- /*
> >>- * Disable iommu bypass, otherwise the user can DMA to all of
> >>- * our physical memory via the bypass window instead of just
> >>- * the pages that has been explicitly mapped into the iommu
> >>- */
> >>- if (tbl->set_bypass)
> >>- tbl->set_bypass(tbl, false);
> >>
> >> return 0;
> >> }
> >>@@ -1078,10 +1071,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
> >> /* Restore bit#0 set by iommu_init_table() */
> >> if (tbl->it_offset == 0)
> >> set_bit(0, tbl->it_map);
> >>-
> >>- /* The kernel owns the device now, we can restore the iommu bypass */
> >>- if (tbl->set_bypass)
> >>- tbl->set_bypass(tbl, true);
> >> }
> >> EXPORT_SYMBOL_GPL(iommu_release_ownership);
> >>
> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>index 88472cb..718d5cc 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>@@ -1870,10 +1870,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> >> __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
> >> }
> >>
> >>-static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> >>+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
> >> {
> >>- struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
> >>- struct pnv_ioda_pe, table_group);
> >> uint16_t window_id = (pe->pe_number << 1 ) + 1;
> >> int64_t rc;
> >>
> >>@@ -1901,7 +1899,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> >> * host side.
> >> */
> >> if (pe->pdev)
> >>- set_iommu_table_base(&pe->pdev->dev, tbl);
> >>+ set_iommu_table_base(&pe->pdev->dev,
> >>+ &pe->table_group.tables[0]);
> >> else
> >> pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> >> }
> >>@@ -1917,13 +1916,35 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> >> /* TVE #1 is selected by PCI address bit 59 */
> >> pe->tce_bypass_base = 1ull << 59;
> >>
> >>- /* Install set_bypass callback for VFIO */
> >>- pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
> >>-
> >> /* Enable bypass by default */
> >>- pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
> >>+ pnv_pci_ioda2_set_bypass(pe, true);
> >> }
> >>
> >>+#ifdef CONFIG_IOMMU_API
> >>+static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
> >>+{
> >>+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> >>+ table_group);
> >>+
> >>+ iommu_take_ownership(&table_group->tables[0]);
> >>+ pnv_pci_ioda2_set_bypass(pe, false);
> >>+}
> >>+
> >>+static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> >>+{
> >>+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> >>+ table_group);
> >>+
> >>+ iommu_release_ownership(&table_group->tables[0]);
> >>+ pnv_pci_ioda2_set_bypass(pe, true);
> >>+}
> >>+
> >>+static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> >>+ .take_ownership = pnv_ioda2_take_ownership,
> >>+ .release_ownership = pnv_ioda2_release_ownership,
> >>+};
> >>+#endif
> >>+
> >> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >> struct pnv_ioda_pe *pe)
> >> {
> >>@@ -1991,6 +2012,9 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >> }
> >> tbl->it_ops = &pnv_ioda2_iommu_ops;
> >> iommu_init_table(tbl, phb->hose->node);
> >>+#ifdef CONFIG_IOMMU_API
> >>+ pe->table_group.ops = &pnv_pci_ioda2_ops;
> >>+#endif
> >>
> >> if (pe->flags & PNV_IODA_PE_DEV) {
> >> iommu_register_group(&pe->table_group, phb->hose->global_number,
> >>diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>index 17e884a..dacc738 100644
> >>--- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>@@ -483,6 +483,43 @@ static long tce_iommu_ioctl(void *iommu_data,
> >> return -ENOTTY;
> >> }
> >>
> >>+static void tce_iommu_release_ownership(struct tce_container *container,
> >>+ struct iommu_table_group *table_group)
> >>+{
> >>+ int i;
> >>+
> >>+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >>+ struct iommu_table *tbl = &table_group->tables[i];
> >>+
> >>+ tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> >>+ if (tbl->it_map)
> >>+ iommu_release_ownership(tbl);
> >>+ }
> >>+}
> >>+
> >>+static int tce_iommu_take_ownership(struct iommu_table_group *table_group)
> >>+{
> >>+ int i, j, rc = 0;
> >>+
> >>+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >>+ struct iommu_table *tbl = &table_group->tables[i];
> >>+
> >>+ if (!tbl->it_map)
> >>+ continue;
> >>+
> >>+ rc = iommu_take_ownership(tbl);
> >>+ if (rc) {
> >>+ for (j = 0; j < i; ++j)
> >>+ iommu_release_ownership(
> >>+ &table_group->tables[j]);
> >>+
> >>+ return rc;
> >>+ }
> >>+ }
> >>+
> >>+ return 0;
> >>+}
> >>+
> >> static int tce_iommu_attach_group(void *iommu_data,
> >> struct iommu_group *iommu_group)
> >> {
> >>@@ -515,9 +552,23 @@ static int tce_iommu_attach_group(void *iommu_data,
> >> goto unlock_exit;
> >> }
> >>
> >>- ret = iommu_take_ownership(&table_group->tables[0]);
> >>- if (!ret)
> >>- container->grp = iommu_group;
> >>+ if (!table_group->ops || !table_group->ops->take_ownership ||
> >>+ !table_group->ops->release_ownership) {
> >>+ ret = tce_iommu_take_ownership(table_group);
> >
> >Haven't looked at the rest of the series. I'm hoping that you're
> >eventually planning to replace this fallback with setting the
> >take_ownership call for p5ioc etc. to point to
> >tce_iommu_take_ownership.
>
>
> Why? I do not really want p5ioc2 or ioda1 to have
> take_ownership/release_ownership callbacks defined as they will only do this
> default stuff which is not going to change ever as this hardware is quite
> old and extremely rare so there is no real customer for it. Should I still
> convert these to callbacks?

Leave it for now - having this fallback makes more sense in light of
the changes later in the series that I hadn't read when I made the
comment.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (11.01 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-30 04:39:46

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 15/32] powerpc/powernv/ioda/ioda2: Rework TCE invalidation in tce_build()/tce_free()

On Thu, Apr 30, 2015 at 12:58:12PM +1000, Alexey Kardashevskiy wrote:
> On 04/29/2015 01:18 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:39PM +1000, Alexey Kardashevskiy wrote:
> >>The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is
> >>supposed to be called on IODA1/2 and not called on p5ioc2. It receives
> >>start and end host addresses of TCE table.
> >>
> >>IODA2 actually needs PCI addresses to invalidate the cache. Those
> >>can be calculated from host addresses but since we are going
> >>to implement multi-level TCE tables, calculating PCI address from
> >>a host address might get either tricky or ugly as TCE table remains flat
> >>on PCI bus but not in RAM.
> >>
> >>This moves pnv_pci_ioda_tce_invalidate() from generic pnv_tce_build/
> >>pnt_tce_free and defines IODA1/2-specific callbacks which call generic
> >>ones and do PHB-model-specific TCE cache invalidation. P5IOC2 keeps
> >>using generic callbacks as before.
> >>
> >>This changes pnv_pci_ioda2_tce_invalidate() to receives TCE index and
> >>number of pages which are PCI addresses shifted by IOMMU page shift.
> >>
> >>No change in behaviour is expected.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>---
> >>Changes:
> >>v9:
> >>* removed confusing comment from commit log about unintentional calling of
> >>pnv_pci_ioda_tce_invalidate()
> >>* moved mechanical changes away to "powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table"
> >>* fixed bug with broken invalidation in pnv_pci_ioda2_tce_invalidate -
> >>@index includes @tbl->it_offset but old code added it anyway which later broke
> >>DDW
> >>---
> >> arch/powerpc/platforms/powernv/pci-ioda.c | 86 +++++++++++++++++++++----------
> >> arch/powerpc/platforms/powernv/pci.c | 17 ++----
> >> 2 files changed, 64 insertions(+), 39 deletions(-)
> >>
> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>index 718d5cc..f070c44 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>@@ -1665,18 +1665,20 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
> >> }
> >> }
> >>
> >>-static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
> >>- struct iommu_table *tbl,
> >>- __be64 *startp, __be64 *endp, bool rm)
> >>+static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
> >>+ unsigned long index, unsigned long npages, bool rm)
> >> {
> >>+ struct pnv_ioda_pe *pe = container_of(tbl->it_table_group,
> >>+ struct pnv_ioda_pe, table_group);
> >> __be64 __iomem *invalidate = rm ?
> >> (__be64 __iomem *)pe->tce_inval_reg_phys :
> >> (__be64 __iomem *)tbl->it_index;
> >> unsigned long start, end, inc;
> >> const unsigned shift = tbl->it_page_shift;
> >>
> >>- start = __pa(startp);
> >>- end = __pa(endp);
> >>+ start = __pa((__be64 *)tbl->it_base + index - tbl->it_offset);
> >>+ end = __pa((__be64 *)tbl->it_base + index - tbl->it_offset +
> >>+ npages - 1);
> >
> >This doesn't look right. The arguments to __pa don't appear to be
> >addresses (since index and if_offset are in units of (TCE) pages, not
> >bytes).
>
>
> tbl->it_base is an address and it is casted to __be64* which means:
>
> (char*)tbl->it_base + (index - tbl->it_offset)*sizeof(__be64).
>
> Which seems to be correct (I just removed extra braces compared to the old
> code), no?

Ah, yes, I'm just forgetting my C pointer arithmetic rules.

Reviewed-by: David Gibson <[email protected]>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.66 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-30 04:38:05

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 16/32] powerpc/powernv/ioda: Move TCE kill register address to PE

On Wed, Apr 29, 2015 at 07:00:30PM +1000, Alexey Kardashevskiy wrote:
> On 04/29/2015 01:25 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:40PM +1000, Alexey Kardashevskiy wrote:
> >>At the moment the DMA setup code looks for the "ibm,opal-tce-kill" property
> >>which contains the TCE kill register address. Writes to this register
> >>invalidates TCE cache on IODA/IODA2 hub.
> >>
> >>This moves the register address from iommu_table to pnv_ioda_pe as
> >>later there will be 2 tables per PE and it will be used for both tables.
> >>
> >>This moves the property reading/remapping code to a helper to reduce
> >>code duplication.
> >>
> >>This adds a new pnv_pci_ioda2_tvt_invalidate() helper which invalidates
> >>the entire table. It should be called after every call to
> >>opal_pci_map_pe_dma_window(). It was not required before because
> >>there is just a single TCE table and 64bit DMA is handled via bypass
> >>window (which has no table so no chache is used) but this is going
> >>to change with Dynamic DMA windows (DDW).
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>---
> >>Changes:
> >>v9:
> >>* new in the series
> >>---
> >> arch/powerpc/platforms/powernv/pci-ioda.c | 69 +++++++++++++++++++------------
> >> arch/powerpc/platforms/powernv/pci.h | 1 +
> >> 2 files changed, 44 insertions(+), 26 deletions(-)
> >>
> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>index f070c44..b22b3ca 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>@@ -1672,7 +1672,7 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
> >> struct pnv_ioda_pe, table_group);
> >> __be64 __iomem *invalidate = rm ?
> >> (__be64 __iomem *)pe->tce_inval_reg_phys :
> >>- (__be64 __iomem *)tbl->it_index;
> >>+ pe->tce_inval_reg;
> >> unsigned long start, end, inc;
> >> const unsigned shift = tbl->it_page_shift;
> >>
> >>@@ -1743,6 +1743,18 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
> >> .get = pnv_tce_get,
> >> };
> >>
> >>+static inline void pnv_pci_ioda2_tvt_invalidate(struct pnv_ioda_pe *pe)
> >>+{
> >>+ /* 01xb - invalidate TCEs that match the specified PE# */
> >>+ unsigned long addr = (0x4ull << 60) | (pe->pe_number & 0xFF);
> >
> >This doesn't really look like an address, but rather the data you're
> >writing to the register.
>
>
> This thing is made of "invalidate operation" (0x4 here), "invalidate
> address" (pci address but it is zero here as we reset everything, most bits
> are here) and "invalidate PE number". So what should I call it? :)

Ah, I see. An address from the hardware point of view, but not so
much from the kernel point of view. Probably just call it 'val' or
'data'.

>
>
>
> >>+ if (!pe->tce_inval_reg)
> >>+ return;
> >>+
> >>+ mb(); /* Ensure above stores are visible */
> >>+ __raw_writeq(cpu_to_be64(addr), pe->tce_inval_reg);
> >>+}
> >>+
> >> static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
> >> unsigned long index, unsigned long npages, bool rm)
> >> {
> >>@@ -1751,7 +1763,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
> >> unsigned long start, end, inc;
> >> __be64 __iomem *invalidate = rm ?
> >> (__be64 __iomem *)pe->tce_inval_reg_phys :
> >>- (__be64 __iomem *)tbl->it_index;
> >>+ pe->tce_inval_reg;
> >> const unsigned shift = tbl->it_page_shift;
> >>
> >> /* We'll invalidate DMA address in PE scope */
> >>@@ -1803,13 +1815,31 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> >> .get = pnv_tce_get,
> >> };
> >>
> >>+static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
> >>+ struct pnv_ioda_pe *pe)
> >>+{
> >>+ const __be64 *swinvp;
> >>+
> >>+ /* OPAL variant of PHB3 invalidated TCEs */
> >>+ swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
> >>+ if (!swinvp)
> >>+ return;
> >>+
> >>+ /* We need a couple more fields -- an address and a data
> >>+ * to or. Since the bus is only printed out on table free
> >>+ * errors, and on the first pass the data will be a relative
> >>+ * bus number, print that out instead.
> >>+ */
> >
> >The comment above appears to have nothing to do with the surrounding code.
>
> I'll just remove it.

Ok, good.

>
>
> >
> >>+ pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
> >>+ pe->tce_inval_reg = ioremap(pe->tce_inval_reg_phys, 8);
> >>+}
> >>+
> >> static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> >> struct pnv_ioda_pe *pe, unsigned int base,
> >> unsigned int segs)
> >> {
> >>
> >> struct page *tce_mem = NULL;
> >>- const __be64 *swinvp;
> >> struct iommu_table *tbl;
> >> unsigned int i;
> >> int64_t rc;
> >>@@ -1823,6 +1853,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> >> if (WARN_ON(pe->tce32_seg >= 0))
> >> return;
> >>
> >>+ pnv_pci_ioda_setup_opal_tce_kill(phb, pe);
> >>+
> >> /* Grab a 32-bit TCE table */
> >> pe->tce32_seg = base;
> >> pe_info(pe, " Setting up 32-bit TCE table at %08x..%08x\n",
> >>@@ -1865,20 +1897,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> >> base << 28, IOMMU_PAGE_SHIFT_4K);
> >>
> >> /* OPAL variant of P7IOC SW invalidated TCEs */
> >>- swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
> >>- if (swinvp) {
> >>- /* We need a couple more fields -- an address and a data
> >>- * to or. Since the bus is only printed out on table free
> >>- * errors, and on the first pass the data will be a relative
> >>- * bus number, print that out instead.
> >>- */
> >
> >.. although I guess it didn't make any more sense in its original context.
> >
> >>- pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
> >>- tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
> >>- 8);
> >>+ if (pe->tce_inval_reg)
> >> tbl->it_type |= (TCE_PCI_SWINV_CREATE |
> >> TCE_PCI_SWINV_FREE |
> >> TCE_PCI_SWINV_PAIR);
> >>- }
> >>+
> >> tbl->it_ops = &pnv_ioda1_iommu_ops;
> >> iommu_init_table(tbl, phb->hose->node);
> >>
> >>@@ -1984,7 +2007,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >> {
> >> struct page *tce_mem = NULL;
> >> void *addr;
> >>- const __be64 *swinvp;
> >> struct iommu_table *tbl;
> >> unsigned int tce_table_size, end;
> >> int64_t rc;
> >>@@ -1993,6 +2015,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >> if (WARN_ON(pe->tce32_seg >= 0))
> >> return;
> >>
> >>+ pnv_pci_ioda_setup_opal_tce_kill(phb, pe);
> >>+
> >> /* The PE will reserve all possible 32-bits space */
> >> pe->tce32_seg = 0;
> >> end = (1 << ilog2(phb->ioda.m32_pci_base));
> >>@@ -2023,6 +2047,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >> goto fail;
> >> }
> >>
> >>+ pnv_pci_ioda2_tvt_invalidate(pe);
> >>+
> >
> >This looks to be a change in behavbiour - if it's replacing a previous
> >invalidation, I'm not seeing where.
>
>
> It is a new thing and the patch adds it. And it does not say anywhere that
> this patch does not change behavior.

Ah, ok, I think I see.

Seems I was even more tired than I realised yesterday and making a
bunch of mistakes while reviewing.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (7.28 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-30 04:38:45

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 18/32] powerpc/iommu/powernv: Release replaced TCE

On Wed, Apr 29, 2015 at 07:51:21PM +1000, Alexey Kardashevskiy wrote:
> On 04/29/2015 02:18 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:42PM +1000, Alexey Kardashevskiy wrote:
> >>At the moment writing new TCE value to the IOMMU table fails with EBUSY
> >>if there is a valid entry already. However PAPR specification allows
> >>the guest to write new TCE value without clearing it first.
> >>
> >>Another problem this patch is addressing is the use of pool locks for
> >>external IOMMU users such as VFIO. The pool locks are to protect
> >>DMA page allocator rather than entries and since the host kernel does
> >>not control what pages are in use, there is no point in pool locks and
> >>exchange()+put_page(oldtce) is sufficient to avoid possible races.
> >>
> >>This adds an exchange() callback to iommu_table_ops which does the same
> >>thing as set() plus it returns replaced TCE and DMA direction so
> >>the caller can release the pages afterwards. The exchange() receives
> >>a physical address unlike set() which receives linear mapping address;
> >>and returns a physical address as the clear() does.
> >>
> >>This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
> >>for a platform to have exchange() implemented in order to support VFIO.
> >>
> >>This replaces iommu_tce_build() and iommu_clear_tce() with
> >>a single iommu_tce_xchg().
> >>
> >>This makes sure that TCE permission bits are not set in TCE passed to
> >>IOMMU API as those are to be calculated by platform code from DMA direction.
> >>
> >>This moves SetPageDirty() to the IOMMU code to make it work for both
> >>VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
> >>available later).
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>[aw: for the vfio related changes]
> >>Acked-by: Alex Williamson <[email protected]>
> >
> >This looks mostly good, but there are couple of details that need fixing.
> >
>
>
> [...]
>
> >>diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> >>index ba75aa5..e8802ac 100644
> >>--- a/arch/powerpc/platforms/powernv/pci.c
> >>+++ b/arch/powerpc/platforms/powernv/pci.c
> >>@@ -598,6 +598,23 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> >> return 0;
> >> }
> >>
> >>+#ifdef CONFIG_IOMMU_API
> >>+int pnv_tce_xchg(struct iommu_table *tbl, long index,
> >>+ unsigned long *tce, enum dma_data_direction *direction)
> >>+{
> >>+ u64 proto_tce = iommu_direction_to_tce_perm(*direction);
> >>+ unsigned long newtce = *tce | proto_tce;
> >>+ unsigned long idx = index - tbl->it_offset;
> >
> >Should this have a BUG_ON or WARN_ON if the supplied tce has bits set
> >below the page mask?
>
>
> Why? The caller checks these bits, do we really need to duplicate it
> here?

Because this is the crunch point where bad bits will actually cause
strange stuff to happen.

As much as anything the point of a BUG_ON would be to document that
this function expects the parameter to be aligned.

>
>
> >>+ *tce = xchg(pnv_tce(tbl, idx), cpu_to_be64(newtce));
> >>+ *tce = be64_to_cpu(*tce);
> >>+ *direction = iommu_tce_direction(*tce);
> >>+ *tce &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>+
> >>+ return 0;
> >>+}
> >>+#endif
>
>
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.39 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-30 04:39:48

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 20/32] powerpc/powernv/ioda2: Introduce pnv_pci_create_table/pnv_pci_free_table

On Wed, Apr 29, 2015 at 07:12:37PM +1000, Alexey Kardashevskiy wrote:
> On 04/29/2015 02:39 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:44PM +1000, Alexey Kardashevskiy wrote:
> >>This is a part of moving TCE table allocation into an iommu_ops
> >>callback to support multiple IOMMU groups per one VFIO container.
> >>
> >>This moves a table creation window to the file with common powernv-pci
> >>helpers as it does not do anything IODA2-specific.
> >>
> >>This adds pnv_pci_free_table() helper to release the actual TCE table.
> >>
> >>This enforces window size to be a power of two.
> >>
> >>This should cause no behavioural change.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>Reviewed-by: David Gibson <[email protected]>
> >>---
> >>Changes:
> >>v9:
> >>* moved helpers to the common powernv pci.c file from pci-ioda.c
> >>* moved bits from pnv_pci_create_table() to pnv_alloc_tce_table_pages()
> >>---
> >> arch/powerpc/platforms/powernv/pci-ioda.c | 36 ++++++------------
> >> arch/powerpc/platforms/powernv/pci.c | 61 +++++++++++++++++++++++++++++++
> >> arch/powerpc/platforms/powernv/pci.h | 4 ++
> >> 3 files changed, 76 insertions(+), 25 deletions(-)
> >>
> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>index a80be34..b9b3773 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>@@ -1307,8 +1307,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
> >> if (rc)
> >> pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
> >>
> >>- iommu_reset_table(tbl, of_node_full_name(dev->dev.of_node));
> >>- free_pages(addr, get_order(TCE32_TABLE_SIZE));
> >>+ pnv_pci_free_table(tbl);
> >> }
> >>
> >> static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
> >>@@ -2039,10 +2038,7 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> >> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >> struct pnv_ioda_pe *pe)
> >> {
> >>- struct page *tce_mem = NULL;
> >>- void *addr;
> >> struct iommu_table *tbl = &pe->table_group.tables[0];
> >>- unsigned int tce_table_size, end;
> >> int64_t rc;
> >>
> >> /* We shouldn't already have a 32-bit DMA associated */
> >>@@ -2053,29 +2049,20 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >>
> >> /* The PE will reserve all possible 32-bits space */
> >> pe->tce32_seg = 0;
> >>- end = (1 << ilog2(phb->ioda.m32_pci_base));
> >>- tce_table_size = (end / 0x1000) * 8;
> >> pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
> >>- end);
> >>+ phb->ioda.m32_pci_base);
> >>
> >>- /* Allocate TCE table */
> >>- tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL,
> >>- get_order(tce_table_size));
> >>- if (!tce_mem) {
> >>- pe_err(pe, "Failed to allocate a 32-bit TCE memory\n");
> >>- goto fail;
> >>+ rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node,
> >>+ 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, tbl);
> >>+ if (rc) {
> >>+ pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
> >>+ return;
> >> }
> >>- addr = page_address(tce_mem);
> >>- memset(addr, 0, tce_table_size);
> >>-
> >>- /* Setup iommu */
> >>- tbl->it_table_group = &pe->table_group;
> >>-
> >>- /* Setup linux iommu table */
> >>- pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
> >>- IOMMU_PAGE_SHIFT_4K);
> >>
> >> tbl->it_ops = &pnv_ioda2_iommu_ops;
> >>+
> >>+ /* Setup iommu */
> >>+ tbl->it_table_group = &pe->table_group;
> >> iommu_init_table(tbl, phb->hose->node);
> >> #ifdef CONFIG_IOMMU_API
> >> pe->table_group.ops = &pnv_pci_ioda2_ops;
> >>@@ -2121,8 +2108,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >> fail:
> >> if (pe->tce32_seg >= 0)
> >> pe->tce32_seg = -1;
> >>- if (tce_mem)
> >>- __free_pages(tce_mem, get_order(tce_table_size));
> >>+ pnv_pci_free_table(tbl);
> >> }
> >>
> >> static void pnv_ioda_setup_dma(struct pnv_phb *phb)
> >>diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> >>index e8802ac..6bcfad5 100644
> >>--- a/arch/powerpc/platforms/powernv/pci.c
> >>+++ b/arch/powerpc/platforms/powernv/pci.c
> >>@@ -20,7 +20,9 @@
> >> #include <linux/io.h>
> >> #include <linux/msi.h>
> >> #include <linux/iommu.h>
> >>+#include <linux/memblock.h>
> >>
> >>+#include <asm/mmzone.h>
> >> #include <asm/sections.h>
> >> #include <asm/io.h>
> >> #include <asm/prom.h>
> >>@@ -645,6 +647,65 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
> >> tbl->it_type = TCE_PCI;
> >> }
> >>
> >>+static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
> >>+ unsigned long *tce_table_allocated)
> >
> >I'm a bit confused by the tce_table_allocated parameter. What's the
> >circumstance where more memory is requested than required, and why
> >does it matter to the caller?
>
> It does not make much sense here but it does for "powerpc/powernv: Implement
> multilevel TCE tables" - I was trying to avoid changing same lines many
> times.
>
> The idea is if multilevel table is requested, I do not really want to
> allocate the whole tree. For example, if the userspace asked for 64K table
> and 5 levels, the result will be a list of just 5 pages - last one will be
> the actual table and upper levels will have a single valud TCE entry
> pointing to next level.
>
> But I change the prototype there anyway so I'll just move this
> tce_table_allocated thing there.

Yeah, I think that's better. It is more churn, but I think the
clearer reviewability is worth it.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (5.72 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-30 04:38:41

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 21/32] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window

On Wed, Apr 29, 2015 at 07:26:28PM +1000, Alexey Kardashevskiy wrote:
> On 04/29/2015 02:45 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:45PM +1000, Alexey Kardashevskiy wrote:
> >>This is a part of moving DMA window programming to an iommu_ops
> >>callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as
> >>a first parameter (not pnv_ioda_pe) as it is going to be used as
> >>a callback for VFIO DDW code.
> >>
> >>This adds pnv_pci_ioda2_tvt_invalidate() to invalidate TVT as it is
> >>a good thing to do.
> >
> >What's the TVT and why is invalidating it a good thing?
>
>
> "TCE Validation Table". Yeah, I need to rephrase it. Will do.
>
>
> >Also, it looks like it didn't add it, just move it.
>
> Agrh. Lost it in rebases. Will fix.
>
>
> >>It does not have immediate effect now as the table
> >>is never recreated after reboot but it will in the following patches.
> >>
> >>This should cause no behavioural change.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>Reviewed-by: David Gibson <[email protected]>
> >
> >Really? I don't remember this one.
>
>
> Message-ID: <[email protected]>
> :)
>
> But I believe it did not have TVT stuff then so I should have removed your
> RB from here.

Yeah, that's probably why I didn't recognize it.

>
> >
> >>---
> >>Changes:
> >>v9:
> >>* initialize pe->table_group.tables[0] at the very end when
> >>tbl is fully initialized
> >>* moved pnv_pci_ioda2_tvt_invalidate() from earlier patch
> >>---
> >> arch/powerpc/platforms/powernv/pci-ioda.c | 67 +++++++++++++++++++++++--------
> >> 1 file changed, 51 insertions(+), 16 deletions(-)
> >>
> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>index b9b3773..59baa15 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>@@ -1960,6 +1960,52 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> >> __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
> >> }
> >>
> >>+static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> >>+ struct iommu_table *tbl)
> >>+{
> >>+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> >>+ table_group);
> >>+ struct pnv_phb *phb = pe->phb;
> >>+ int64_t rc;
> >>+ const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
> >>+ const __u64 win_size = tbl->it_size << tbl->it_page_shift;
> >>+
> >>+ pe_info(pe, "Setting up window at %llx..%llx "
> >>+ "pgsize=0x%x tablesize=0x%lx\n",
> >>+ start_addr, start_addr + win_size - 1,
> >>+ 1UL << tbl->it_page_shift, tbl->it_size << 3);
> >>+
> >>+ tbl->it_table_group = &pe->table_group;
> >>+
> >>+ /*
> >>+ * Map TCE table through TVT. The TVE index is the PE number
> >>+ * shifted by 1 bit for 32-bits DMA space.
> >>+ */
> >>+ rc = opal_pci_map_pe_dma_window(phb->opal_id,
> >>+ pe->pe_number,
> >>+ pe->pe_number << 1,
> >>+ 1,
> >>+ __pa(tbl->it_base),
> >>+ tbl->it_size << 3,
> >>+ 1ULL << tbl->it_page_shift);
> >>+ if (rc) {
> >>+ pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
> >>+ goto fail;
> >>+ }
> >>+
> >>+ pnv_pci_ioda2_tvt_invalidate(pe);
> >>+
> >>+ /* Store fully initialized *tbl (may be external) in PE */
> >>+ pe->table_group.tables[0] = *tbl;
> >
> >Hrm, a non-atomic copy of a whole structure into the array. Is that
> >really what you want?
>
>
> set_window is called from VFIO (protected by mutex there) and the platform
> code which I believe is not racy (or hotplug takes care of it anyway). Or I
> am missing something else?

Sorry, I wasn't clear. It's not that I actually think the copy is
going to race with anything now.

It's more that copying whole structures about is a rather odd way of
doing things, and makes it much less obvious how object lifetimes
interact.

From what I've seen of the rest of the series it seems like the
following scheme would make more sense:

* struct iommu_table has identical lifetime to the actual tables
allocated under it.
* So, the "create" function both allocates the header structure,
all the actual TCE tables under it, and fills in the header
with the details of same (size, levelsize, levels etc.)
* table_group would have an array of pointers to iommu_table
structs, rather than embedding an array of iommu_table structs
* This pointers would be optionally populated
* set_window function would populate the table_group array with
a previously "create"ed iommu_table
* unset window would clear the pointer, and unref the iommu_table
* "free" and "reset" for a single table would be rolled back into a
single function

Unless there's some reason I've missed that you want to embed the
whole array of iommu_table structs.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (4.92 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-30 04:38:07

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 23/32] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks

On Wed, Apr 29, 2015 at 07:44:20PM +1000, Alexey Kardashevskiy wrote:
> On 04/29/2015 03:30 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:47PM +1000, Alexey Kardashevskiy wrote:
> >>This extends iommu_table_group_ops by a set of callbacks to support
> >>dynamic DMA windows management.
> >>
> >>create_table() creates a TCE table with specific parameters.
> >>it receives iommu_table_group to know nodeid in order to allocate
> >>TCE table memory closer to the PHB. The exact format of allocated
> >>multi-level table might be also specific to the PHB model (not
> >>the case now though).
> >>This callback calculated the DMA window offset on a PCI bus from @num
> >>and stores it in a just created table.
> >>
> >>set_window() sets the window at specified TVT index + @num on PHB.
> >>
> >>unset_window() unsets the window from specified TVT.
> >>
> >>This adds a free() callback to iommu_table_ops to free the memory
> >>(potentially a tree of tables) allocated for the TCE table.
> >
> >Doesn't the free callback belong with the previous patch introducing
> >multi-level tables?
>
>
>
> If I did that, you would say "why is it here if nothing calls it" on
> "multilevel" patch and "I see the allocation but I do not see memory
> release" ;)

Yeah, fair enough ;)

> I need some rule of thumb here. I think it is a bit cleaner if the same
> patch adds a callback for memory allocation and its counterpart, no?

On further consideration, yes, I think you're right.

> >>create_table() and free() are supposed to be called once per
> >>VFIO container and set_window()/unset_window() are supposed to be
> >>called for every group in a container.
> >>
> >>This adds IOMMU capabilities to iommu_table_group such as default
> >>32bit window parameters and others.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>---
> >> arch/powerpc/include/asm/iommu.h | 19 ++++++++
> >> arch/powerpc/platforms/powernv/pci-ioda.c | 75 ++++++++++++++++++++++++++---
> >> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 12 +++--
> >> 3 files changed, 96 insertions(+), 10 deletions(-)
> >>
> >>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>index 0f50ee2..7694546 100644
> >>--- a/arch/powerpc/include/asm/iommu.h
> >>+++ b/arch/powerpc/include/asm/iommu.h
> >>@@ -70,6 +70,7 @@ struct iommu_table_ops {
> >> /* get() returns a physical address */
> >> unsigned long (*get)(struct iommu_table *tbl, long index);
> >> void (*flush)(struct iommu_table *tbl);
> >>+ void (*free)(struct iommu_table *tbl);
> >> };
> >>
> >> /* These are used by VIO */
> >>@@ -148,6 +149,17 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
> >> struct iommu_table_group;
> >>
> >> struct iommu_table_group_ops {
> >>+ long (*create_table)(struct iommu_table_group *table_group,
> >>+ int num,
> >>+ __u32 page_shift,
> >>+ __u64 window_size,
> >>+ __u32 levels,
> >>+ struct iommu_table *tbl);
> >>+ long (*set_window)(struct iommu_table_group *table_group,
> >>+ int num,
> >>+ struct iommu_table *tblnew);
> >>+ long (*unset_window)(struct iommu_table_group *table_group,
> >>+ int num);
> >> /*
> >> * Switches ownership from the kernel itself to an external
> >> * user. While onwership is taken, the kernel cannot use IOMMU itself.
> >>@@ -160,6 +172,13 @@ struct iommu_table_group {
> >> #ifdef CONFIG_IOMMU_API
> >> struct iommu_group *group;
> >> #endif
> >>+ /* Some key properties of IOMMU */
> >>+ __u32 tce32_start;
> >>+ __u32 tce32_size;
> >>+ __u64 pgsizes; /* Bitmap of supported page sizes */
> >>+ __u32 max_dynamic_windows_supported;
> >>+ __u32 max_levels;
> >
> >With this information, table_group seems even more like a bad name.
> >"iommu_state" maybe?
>
>
> Please, no. We will never come to agreement then :( And "iommu_state" is too
> general anyway, it won't pass.
>
>
> >> struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> >> struct iommu_table_group_ops *ops;
> >> };
> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>index cc1d09c..4828837 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>@@ -24,6 +24,7 @@
> >> #include <linux/msi.h>
> >> #include <linux/memblock.h>
> >> #include <linux/iommu.h>
> >>+#include <linux/sizes.h>
> >>
> >> #include <asm/sections.h>
> >> #include <asm/io.h>
> >>@@ -1846,6 +1847,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> >> #endif
> >> .clear = pnv_ioda2_tce_free,
> >> .get = pnv_tce_get,
> >>+ .free = pnv_pci_free_table,
> >> };
> >>
> >> static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
> >>@@ -1936,6 +1938,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> >> TCE_PCI_SWINV_PAIR);
> >>
> >> tbl->it_ops = &pnv_ioda1_iommu_ops;
> >>+ pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift;
> >>+ pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
> >> iommu_init_table(tbl, phb->hose->node);
> >>
> >> if (pe->flags & PNV_IODA_PE_DEV) {
> >>@@ -1961,7 +1965,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> >> }
> >>
> >> static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> >>- struct iommu_table *tbl)
> >>+ int num, struct iommu_table *tbl)
> >> {
> >> struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> >> table_group);
> >>@@ -1972,9 +1976,10 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> >> const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
> >> const __u64 win_size = tbl->it_size << tbl->it_page_shift;
> >>
> >>- pe_info(pe, "Setting up window at %llx..%llx "
> >>+ pe_info(pe, "Setting up window#%d at %llx..%llx "
> >> "pgsize=0x%x tablesize=0x%lx "
> >> "levels=%d levelsize=%x\n",
> >>+ num,
> >> start_addr, start_addr + win_size - 1,
> >> 1UL << tbl->it_page_shift, tbl->it_size << 3,
> >> tbl->it_indirect_levels + 1, tbl->it_level_size << 3);
> >>@@ -1987,7 +1992,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> >> */
> >> rc = opal_pci_map_pe_dma_window(phb->opal_id,
> >> pe->pe_number,
> >>- pe->pe_number << 1,
> >>+ (pe->pe_number << 1) + num,
> >
> >Heh, yes, well, that makes it rather clear that only 2 tables are possible.
> >
> >> tbl->it_indirect_levels + 1,
> >> __pa(tbl->it_base),
> >> size << 3,
> >>@@ -2000,7 +2005,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> >> pnv_pci_ioda2_tvt_invalidate(pe);
> >>
> >> /* Store fully initialized *tbl (may be external) in PE */
> >>- pe->table_group.tables[0] = *tbl;
> >>+ pe->table_group.tables[num] = *tbl;
> >
> >I'm a bit confused by this whole set_window thing. Is the idea that
> >with multiple groups in a container you have multiple table_group s
> >each with different copies of the iommu_table structures, but pointing
> >to the same actual TCE entries (it_base)?
>
> Yes.
>
> >It seems to me not terribly
> >obvious when you "create" a table and when you "set" a window.
>
>
> A table is not attached anywhere until its address is programmed (in
> set_window()) to the hardware, it is just a table in memory. For
> POWER8/IODA2, I create a table before I attach any group to a container,
> then I program this table to every attached container, right now it is done
> in container's attach_group(). So later we can hotplug any host PCI device
> to a container - it will program same TCE table to every new group in the
> container.

So you "create" once, then "set" it to one or more table_groups? It
seems odd that "create" is a table_group callback in that case.

> >It's also kind of hard to assess whether the relative lifetimes are
> >correct of the table_group, struct iommu_table and the actual TCE tables.
>
> That is true. Do not know how to improve this though.

So I think the scheme I suggested in reply to an earlier patch helps
this. With that the lifetime of the struct iommu_table represents the
lifetime of the TCE table in the hardware sense as well, which I think
makes things clearer.

>
>
> >Would it make more sense for table_group to become the
> >non-vfio-specific counterpart to the vfio container?
> >i.e. representing one set of DMA mappings, which one or more PEs could
> >be bound to.
>
>
> table_group is embedded into PE and table/table_group callbacks access PE
> when invalidating TCE table. So I will need something to access PE. Or just
> have an array of 2 iommu_table.
>
>
>
> >
> >> return 0;
> >> fail:
> >>@@ -2061,6 +2066,53 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> >> }
> >>
> >> #ifdef CONFIG_IOMMU_API
> >>+static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
> >>+ int num, __u32 page_shift, __u64 window_size, __u32 levels,
> >>+ struct iommu_table *tbl)
> >>+{
> >>+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> >>+ table_group);
> >>+ int nid = pe->phb->hose->node;
> >>+ __u64 bus_offset = num ? pe->tce_bypass_base : 0;
> >>+ long ret;
> >>+
> >>+ ret = pnv_pci_create_table(table_group, nid, bus_offset, page_shift,
> >>+ window_size, levels, tbl);
> >>+ if (ret)
> >>+ return ret;
> >>+
> >>+ tbl->it_ops = &pnv_ioda2_iommu_ops;
> >>+ if (pe->tce_inval_reg)
> >>+ tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
> >>+
> >>+ return 0;
> >>+}
> >>+
> >>+static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
> >>+ int num)
> >>+{
> >>+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> >>+ table_group);
> >>+ struct pnv_phb *phb = pe->phb;
> >>+ struct iommu_table *tbl = &pe->table_group.tables[num];
> >>+ long ret;
> >>+
> >>+ pe_info(pe, "Removing DMA window #%d\n", num);
> >>+
> >>+ ret = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
> >>+ (pe->pe_number << 1) + num,
> >>+ 0/* levels */, 0/* table address */,
> >>+ 0/* table size */, 0/* page size */);
> >>+ if (ret)
> >>+ pe_warn(pe, "Unmapping failed, ret = %ld\n", ret);
> >>+ else
> >>+ pnv_pci_ioda2_tvt_invalidate(pe);
> >>+
> >>+ memset(tbl, 0, sizeof(*tbl));
> >>+
> >>+ return ret;
> >>+}
> >>+
> >> static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
> >> {
> >> struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> >>@@ -2080,6 +2132,9 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
> >> }
> >>
> >> static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> >>+ .create_table = pnv_pci_ioda2_create_table,
> >>+ .set_window = pnv_pci_ioda2_set_window,
> >>+ .unset_window = pnv_pci_ioda2_unset_window,
> >> .take_ownership = pnv_ioda2_take_ownership,
> >> .release_ownership = pnv_ioda2_release_ownership,
> >> };
> >>@@ -2102,8 +2157,16 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >> pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
> >> phb->ioda.m32_pci_base);
> >>
> >>+ pe->table_group.tce32_start = 0;
> >>+ pe->table_group.tce32_size = phb->ioda.m32_pci_base;
> >>+ pe->table_group.max_dynamic_windows_supported =
> >>+ IOMMU_TABLE_GROUP_MAX_TABLES;
> >>+ pe->table_group.max_levels = POWERNV_IOMMU_MAX_LEVELS;
> >>+ pe->table_group.pgsizes = SZ_4K | SZ_64K | SZ_16M;
> >>+
> >> rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node,
> >>- 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base,
> >>+ pe->table_group.tce32_start, IOMMU_PAGE_SHIFT_4K,
> >>+ pe->table_group.tce32_size,
> >> POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
> >> if (rc) {
> >> pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
> >>@@ -2119,7 +2182,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >> pe->table_group.ops = &pnv_pci_ioda2_ops;
> >> #endif
> >>
> >>- rc = pnv_pci_ioda2_set_window(&pe->table_group, tbl);
> >>+ rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
> >> if (rc) {
> >> pe_err(pe, "Failed to configure 32-bit TCE table,"
> >> " err %ld\n", rc);
> >>diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> >>index 7a6fd92..d9de4c7 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> >>@@ -116,6 +116,8 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
> >> u64 phb_id;
> >> int64_t rc;
> >> static int primary = 1;
> >>+ struct iommu_table_group *table_group;
> >>+ struct iommu_table *tbl;
> >>
> >> pr_info(" Initializing p5ioc2 PHB %s\n", np->full_name);
> >>
> >>@@ -181,14 +183,16 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
> >> pnv_pci_init_p5ioc2_msis(phb);
> >>
> >> /* Setup iommu */
> >>- phb->p5ioc2.table_group.tables[0].it_table_group =
> >>- &phb->p5ioc2.table_group;
> >>+ table_group = &phb->p5ioc2.table_group;
> >>+ tbl = &phb->p5ioc2.table_group.tables[0];
> >>+ tbl->it_table_group = table_group;
> >>
> >> /* Setup TCEs */
> >> phb->dma_dev_setup = pnv_pci_p5ioc2_dma_dev_setup;
> >>- pnv_pci_setup_iommu_table(&phb->p5ioc2.table_group.tables[0],
> >>- tce_mem, tce_size, 0,
> >>+ pnv_pci_setup_iommu_table(tbl, tce_mem, tce_size, 0,
> >> IOMMU_PAGE_SHIFT_4K);
> >>+ table_group->tce32_start = tbl->it_offset << tbl->it_page_shift;
> >>+ table_group->tce32_size = tbl->it_size << tbl->it_page_shift;
> >
> >Doesn't pgsizes need to be set here (although it will only include 4K,
> >I'm assuming).
>
>
> No, pgsizes are not returned to the userspace for p5ioc2/ioda1 as they do
> not support DDW. No pgsize => no DDW.

Ah, ok.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (13.68 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-30 07:11:09

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache

On Sat, Apr 25, 2015 at 10:14:52PM +1000, Alexey Kardashevskiy wrote:
> We are adding support for DMA memory pre-registration to be used in
> conjunction with VFIO. The idea is that the userspace which is going to
> run a guest may want to pre-register a user space memory region so
> it all gets pinned once and never goes away. Having this done,
> a hypervisor will not have to pin/unpin pages on every DMA map/unmap
> request. This is going to help with multiple pinning of the same memory
> and in-kernel acceleration of DMA requests.
>
> This adds a list of memory regions to mm_context_t. Each region consists
> of a header and a list of physical addresses. This adds API to:
> 1. register/unregister memory regions;
> 2. do final cleanup (which puts all pre-registered pages);
> 3. do userspace to physical address translation;
> 4. manage a mapped pages counter; when it is zero, it is safe to
> unregister the region.
>
> Multiple registration of the same region is allowed, kref is used to
> track the number of registrations.

[snip]
> +long mm_iommu_alloc(unsigned long ua, unsigned long entries,
> + struct mm_iommu_table_group_mem_t **pmem)
> +{
> + struct mm_iommu_table_group_mem_t *mem;
> + long i, j;
> + struct page *page = NULL;
> +
> + list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
> + next) {
> + if ((mem->ua == ua) && (mem->entries == entries))
> + return -EBUSY;
> +
> + /* Overlap? */
> + if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
> + (ua < (mem->ua + (mem->entries << PAGE_SHIFT))))
> + return -EINVAL;
> + }
> +
> + mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> + if (!mem)
> + return -ENOMEM;
> +
> + mem->hpas = vzalloc(entries * sizeof(mem->hpas[0]));
> + if (!mem->hpas) {
> + kfree(mem);
> + return -ENOMEM;
> + }

So, I've thought more about this and I'm really confused as to what
this is supposed to be accomplishing.

I see that you need to keep track of what regions are registered, so
you don't double lock or unlock, but I don't see what the point of
actualy storing the translations in hpas is.

I had assumed it was so that you could later on get to the
translations in real mode when you do in-kernel acceleration. But
that doesn't make sense, because the array is vmalloc()ed, so can't be
accessed in real mode anyway.

I can't think of a circumstance in which you can use hpas where you
couldn't just walk the page tables anyway.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.55 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-30 07:11:16

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:
> The existing implementation accounts the whole DMA window in
> the locked_vm counter. This is going to be worse with multiple
> containers and huge DMA windows. Also, real-time accounting would requite
> additional tracking of accounted pages due to the page size difference -
> IOMMU uses 4K pages and system uses 4K or 64K pages.
>
> Another issue is that actual pages pinning/unpinning happens on every
> DMA map/unmap request. This does not affect the performance much now as
> we spend way too much time now on switching context between
> guest/userspace/host but this will start to matter when we add in-kernel
> DMA map/unmap acceleration.
>
> This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
> New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
> 2 new ioctls to register/unregister DMA memory -
> VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
> which receive user space address and size of a memory region which
> needs to be pinned/unpinned and counted in locked_vm.
> New IOMMU splits physical pages pinning and TCE table update into 2 different
> operations. It requires 1) guest pages to be registered first 2) consequent
> map/unmap requests to work only with pre-registered memory.
> For the default single window case this means that the entire guest
> (instead of 2GB) needs to be pinned before using VFIO.
> When a huge DMA window is added, no additional pinning will be
> required, otherwise it would be guest RAM + 2GB.
>
> The new memory registration ioctls are not supported by
> VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
> will require memory to be preregistered in order to work.
>
> The accounting is done per the user process.
>
> This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
> can do with v1 or v2 IOMMUs.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> [aw: for the vfio related changes]
> Acked-by: Alex Williamson <[email protected]>
> ---
> Changes:
> v9:
> * s/tce_get_hva_cached/tce_iommu_use_page_v2/
>
> v7:
> * now memory is registered per mm (i.e. process)
> * moved memory registration code to powerpc/mmu
> * merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
> * limited new ioctls to v2 IOMMU
> * updated doc
> * unsupported ioclts return -ENOTTY instead of -EPERM
>
> v6:
> * tce_get_hva_cached() returns hva via a pointer
>
> v4:
> * updated docs
> * s/kzmalloc/vzalloc/
> * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
> replaced offset with index
> * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
> and removed duplicating vfio_iommu_spapr_register_memory
> ---
> Documentation/vfio.txt | 23 ++++
> drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++++++++++++++++++++++++++++++++++-
> include/uapi/linux/vfio.h | 27 +++++
> 3 files changed, 274 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> index 96978ec..94328c8 100644
> --- a/Documentation/vfio.txt
> +++ b/Documentation/vfio.txt
> @@ -427,6 +427,29 @@ The code flow from the example above should be slightly changed:
>
> ....
>
> +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
> +VFIO_IOMMU_DISABLE and implements 2 new ioctls:
> +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
> +(which are unsupported in v1 IOMMU).

A summary of the semantic differeces between v1 and v2 would be nice.
At this point it's not really clear to me if there's a case for
creating v2, or if this could just be done by adding (optional)
functionality to v1.

> +PPC64 paravirtualized guests generate a lot of map/unmap requests,
> +and the handling of those includes pinning/unpinning pages and updating
> +mm::locked_vm counter to make sure we do not exceed the rlimit.
> +The v2 IOMMU splits accounting and pinning into separate operations:
> +
> +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
> +receive a user space address and size of the block to be pinned.
> +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
> +be called with the exact address and size used for registering
> +the memory block. The userspace is not expected to call these often.
> +The ranges are stored in a linked list in a VFIO container.
> +
> +- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
> +IOMMU table and do not do pinning; instead these check that the userspace
> +address is from pre-registered range.
> +
> +This separation helps in optimizing DMA for guests.
> +
> -------------------------------------------------------------------------------
>
> [1] VFIO was originally an acronym for "Virtual Function I/O" in its
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 892a584..4cfc2c1 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c

So, from things you said at other points, I thought the idea was that
this registration stuff could also be used on non-Power IOMMUs. Did I
misunderstand, or is that a possibility for the future?

> @@ -21,6 +21,7 @@
> #include <linux/vfio.h>
> #include <asm/iommu.h>
> #include <asm/tce.h>
> +#include <asm/mmu_context.h>
>
> #define DRIVER_VERSION "0.1"
> #define DRIVER_AUTHOR "[email protected]"
> @@ -91,8 +92,58 @@ struct tce_container {
> struct iommu_group *grp;
> bool enabled;
> unsigned long locked_pages;
> + bool v2;
> };
>
> +static long tce_unregister_pages(struct tce_container *container,
> + __u64 vaddr, __u64 size)
> +{
> + long ret;
> + struct mm_iommu_table_group_mem_t *mem;
> +
> + if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
> + return -EINVAL;
> +
> + mem = mm_iommu_get(vaddr, size >> PAGE_SHIFT);
> + if (!mem)
> + return -EINVAL;
> +
> + ret = mm_iommu_put(mem); /* undo kref_get() from mm_iommu_get() */
> + if (!ret)
> + ret = mm_iommu_put(mem);
> +
> + return ret;
> +}
> +
> +static long tce_register_pages(struct tce_container *container,
> + __u64 vaddr, __u64 size)
> +{
> + long ret = 0;
> + struct mm_iommu_table_group_mem_t *mem;
> + unsigned long entries = size >> PAGE_SHIFT;
> +
> + if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
> + ((vaddr + size) < vaddr))
> + return -EINVAL;
> +
> + mem = mm_iommu_get(vaddr, entries);
> + if (!mem) {
> + ret = try_increment_locked_vm(entries);
> + if (ret)
> + return ret;
> +
> + ret = mm_iommu_alloc(vaddr, entries, &mem);
> + if (ret) {
> + decrement_locked_vm(entries);
> + return ret;
> + }
> + }
> +
> + container->enabled = true;
> +
> + return 0;
> +}

So requiring that registered regions get unregistered with exactly the
same addr/length is reasonable. I'm a bit less convinced that
disallowing overlaps is a good idea. What if two libraries in the
same process are trying to use VFIO - they may not know if the regions
they try to register are overlapping.

> static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> {
> /*
> @@ -205,7 +256,7 @@ static void *tce_iommu_open(unsigned long arg)
> {
> struct tce_container *container;
>
> - if (arg != VFIO_SPAPR_TCE_IOMMU) {
> + if ((arg != VFIO_SPAPR_TCE_IOMMU) && (arg != VFIO_SPAPR_TCE_v2_IOMMU)) {
> pr_err("tce_vfio: Wrong IOMMU type\n");
> return ERR_PTR(-EINVAL);
> }
> @@ -215,6 +266,7 @@ static void *tce_iommu_open(unsigned long arg)
> return ERR_PTR(-ENOMEM);
>
> mutex_init(&container->lock);
> + container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>
> return container;
> }
> @@ -243,6 +295,47 @@ static void tce_iommu_unuse_page(struct tce_container *container,
> put_page(page);
> }
>
> +static int tce_iommu_use_page_v2(unsigned long tce, unsigned long size,
> + unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
> +{
> + long ret = 0;
> + struct mm_iommu_table_group_mem_t *mem;
> +
> + mem = mm_iommu_lookup(tce, size);
> + if (!mem)
> + return -EINVAL;
> +
> + ret = mm_iommu_ua_to_hpa(mem, tce, phpa);
> + if (ret)
> + return -EINVAL;
> +
> + *pmem = mem;
> +
> + return 0;
> +}
> +
> +static void tce_iommu_unuse_page_v2(struct iommu_table *tbl,
> + unsigned long entry)
> +{
> + struct mm_iommu_table_group_mem_t *mem = NULL;
> + int ret;
> + unsigned long hpa = 0;
> + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> + if (!pua || !current || !current->mm)
> + return;
> +
> + ret = tce_iommu_use_page_v2(*pua, IOMMU_PAGE_SIZE(tbl),
> + &hpa, &mem);
> + if (ret)
> + pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
> + __func__, *pua, entry, ret);
> + if (mem)
> + mm_iommu_mapped_update(mem, false);
> +
> + *pua = 0;
> +}
> +
> static int tce_iommu_clear(struct tce_container *container,
> struct iommu_table *tbl,
> unsigned long entry, unsigned long pages)
> @@ -261,6 +354,11 @@ static int tce_iommu_clear(struct tce_container *container,
> if (direction == DMA_NONE)
> continue;
>
> + if (container->v2) {
> + tce_iommu_unuse_page_v2(tbl, entry);
> + continue;
> + }
> +
> tce_iommu_unuse_page(container, oldtce);
> }
>
> @@ -327,6 +425,62 @@ static long tce_iommu_build(struct tce_container *container,
> return ret;
> }
>
> +static long tce_iommu_build_v2(struct tce_container *container,
> + struct iommu_table *tbl,
> + unsigned long entry, unsigned long tce, unsigned long pages,
> + enum dma_data_direction direction)
> +{
> + long i, ret = 0;
> + struct page *page;
> + unsigned long hpa;
> + enum dma_data_direction dirtmp;
> +
> + for (i = 0; i < pages; ++i) {
> + struct mm_iommu_table_group_mem_t *mem = NULL;
> + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
> + entry + i);
> +
> + ret = tce_iommu_use_page_v2(tce, IOMMU_PAGE_SIZE(tbl),
> + &hpa, &mem);
> + if (ret)
> + break;
> +
> + page = pfn_to_page(hpa >> PAGE_SHIFT);
> + if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> + ret = -EPERM;
> + break;
> + }
> +
> + /* Preserve offset within IOMMU page */
> + hpa |= tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
> + dirtmp = direction;
> +
> + ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
> + if (ret) {
> + /* dirtmp cannot be DMA_NONE here */
> + tce_iommu_unuse_page_v2(tbl, entry + i);
> + pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
> + __func__, entry << tbl->it_page_shift,
> + tce, ret);
> + break;
> + }
> +
> + mm_iommu_mapped_update(mem, true);
> +
> + if (dirtmp != DMA_NONE)
> + tce_iommu_unuse_page_v2(tbl, entry + i);
> +
> + *pua = tce;
> +
> + tce += IOMMU_PAGE_SIZE(tbl);
> + }
> +
> + if (ret)
> + tce_iommu_clear(container, tbl, entry, i);
> +
> + return ret;
> +}
> +
> static long tce_iommu_ioctl(void *iommu_data,
> unsigned int cmd, unsigned long arg)
> {
> @@ -338,6 +492,7 @@ static long tce_iommu_ioctl(void *iommu_data,
> case VFIO_CHECK_EXTENSION:
> switch (arg) {
> case VFIO_SPAPR_TCE_IOMMU:
> + case VFIO_SPAPR_TCE_v2_IOMMU:
> ret = 1;
> break;
> default:
> @@ -425,11 +580,18 @@ static long tce_iommu_ioctl(void *iommu_data,
> if (ret)
> return ret;
>
> - ret = tce_iommu_build(container, tbl,
> - param.iova >> tbl->it_page_shift,
> - param.vaddr,
> - param.size >> tbl->it_page_shift,
> - direction);
> + if (container->v2)
> + ret = tce_iommu_build_v2(container, tbl,
> + param.iova >> tbl->it_page_shift,
> + param.vaddr,
> + param.size >> tbl->it_page_shift,
> + direction);
> + else
> + ret = tce_iommu_build(container, tbl,
> + param.iova >> tbl->it_page_shift,
> + param.vaddr,
> + param.size >> tbl->it_page_shift,
> + direction);
>
> iommu_flush_tce(tbl);
>
> @@ -474,7 +636,60 @@ static long tce_iommu_ioctl(void *iommu_data,
>
> return ret;
> }
> + case VFIO_IOMMU_SPAPR_REGISTER_MEMORY: {
> + struct vfio_iommu_spapr_register_memory param;
> +
> + if (!container->v2)
> + break;
> +
> + minsz = offsetofend(struct vfio_iommu_spapr_register_memory,
> + size);
> +
> + if (copy_from_user(&param, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (param.argsz < minsz)
> + return -EINVAL;
> +
> + /* No flag is supported now */
> + if (param.flags)
> + return -EINVAL;
> +
> + mutex_lock(&container->lock);
> + ret = tce_register_pages(container, param.vaddr, param.size);
> + mutex_unlock(&container->lock);

AFAICT, this is the only call to tce_register_pages(), so why not put
the mutex into the function.

> +
> + return ret;
> + }
> + case VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY: {
> + struct vfio_iommu_spapr_register_memory param;
> +
> + if (!container->v2)
> + break;
> +
> + minsz = offsetofend(struct vfio_iommu_spapr_register_memory,
> + size);
> +
> + if (copy_from_user(&param, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (param.argsz < minsz)
> + return -EINVAL;
> +
> + /* No flag is supported now */
> + if (param.flags)
> + return -EINVAL;
> +
> + mutex_lock(&container->lock);
> + tce_unregister_pages(container, param.vaddr, param.size);
> + mutex_unlock(&container->lock);
> +
> + return 0;
> + }
> case VFIO_IOMMU_ENABLE:
> + if (container->v2)
> + break;
> +
> mutex_lock(&container->lock);
> ret = tce_iommu_enable(container);
> mutex_unlock(&container->lock);
> @@ -482,6 +697,9 @@ static long tce_iommu_ioctl(void *iommu_data,
>
>
> case VFIO_IOMMU_DISABLE:
> + if (container->v2)
> + break;
> +
> mutex_lock(&container->lock);
> tce_iommu_disable(container);
> mutex_unlock(&container->lock);
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index b57b750..8fdcfb9 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -36,6 +36,8 @@
> /* Two-stage IOMMU */
> #define VFIO_TYPE1_NESTING_IOMMU 6 /* Implies v2 */
>
> +#define VFIO_SPAPR_TCE_v2_IOMMU 7
> +
> /*
> * The IOCTL interface is designed for extensibility by embedding the
> * structure length (argsz) and flags into structures passed between
> @@ -495,6 +497,31 @@ struct vfio_eeh_pe_op {
>
> #define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21)
>
> +/**
> + * VFIO_IOMMU_SPAPR_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_spapr_register_memory)
> + *
> + * Registers user space memory where DMA is allowed. It pins
> + * user pages and does the locked memory accounting so
> + * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
> + * get faster.
> + */
> +struct vfio_iommu_spapr_register_memory {
> + __u32 argsz;
> + __u32 flags;
> + __u64 vaddr; /* Process virtual address */
> + __u64 size; /* Size of mapping (bytes) */
> +};
> +#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 17)
> +
> +/**
> + * VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_spapr_register_memory)
> + *
> + * Unregisters user space memory registered with
> + * VFIO_IOMMU_SPAPR_REGISTER_MEMORY.
> + * Uses vfio_iommu_spapr_register_memory for parameters.
> + */
> +#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 18)
> +
> /* ***************************************************************** */
>
> #endif /* _UAPIVFIO_H */

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (15.18 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-30 07:11:12

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 30/32] vfio: powerpc/spapr: Use 32bit DMA window properties from table_group

On Sat, Apr 25, 2015 at 10:14:54PM +1000, Alexey Kardashevskiy wrote:
> A table group might not have a table but it always has the default 32bit
> window parameters so use these.
>
> No change in behavior is expected.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>

It would be easier to review if you took this and the parts of the
earlier patch which add the tce32_* fields to table_group and roll
them up on their own.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (623.00 B)
(No filename) (819.00 B)
Download all attachments

2015-04-30 07:45:26

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible

On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote:
> At the moment only one group per container is supported.
> POWER8 CPUs have more flexible design and allows naving 2 TCE tables per
> IOMMU group so we can relax this limitation and support multiple groups
> per container.

It's not obvious why allowing multiple TCE tables per PE has any
pearing on allowing multiple groups per container.

> This adds TCE table descriptors to a container and uses iommu_table_group_ops
> to create/set DMA windows on IOMMU groups so the same TCE tables will be
> shared between several IOMMU groups.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> [aw: for the vfio related changes]
> Acked-by: Alex Williamson <[email protected]>
> ---
> Changes:
> v7:
> * updated doc
> ---
> Documentation/vfio.txt | 8 +-
> drivers/vfio/vfio_iommu_spapr_tce.c | 268 ++++++++++++++++++++++++++----------
> 2 files changed, 199 insertions(+), 77 deletions(-)
>
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> index 94328c8..7dcf2b5 100644
> --- a/Documentation/vfio.txt
> +++ b/Documentation/vfio.txt
> @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
>
> This implementation has some specifics:
>
> -1) Only one IOMMU group per container is supported as an IOMMU group
> -represents the minimal entity which isolation can be guaranteed for and
> -groups are allocated statically, one per a Partitionable Endpoint (PE)
> +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
> +container is supported as an IOMMU table is allocated at the boot time,
> +one table per a IOMMU group which is a Partitionable Endpoint (PE)
> (PE is often a PCI domain but not always).

I thought the more fundamental problem was that different PEs tended
to use disjoint bus address ranges, so even by duplicating put_tce
across PEs you couldn't have a common address space.

> +Newer systems (POWER8 with IODA2) have improved hardware design which allows
> +to remove this limitation and have multiple IOMMU groups per a VFIO container.
>
> 2) The hardware supports so called DMA windows - the PCI address range
> within which DMA transfer is allowed, any attempt to access address space
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index a7d6729..970e3a2 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -82,6 +82,11 @@ static void decrement_locked_vm(long npages)
> * into DMA'ble space using the IOMMU
> */
>
> +struct tce_iommu_group {
> + struct list_head next;
> + struct iommu_group *grp;
> +};
> +
> /*
> * The container descriptor supports only a single group per container.
> * Required by the API as the container is not supplied with the IOMMU group
> @@ -89,10 +94,11 @@ static void decrement_locked_vm(long npages)
> */
> struct tce_container {
> struct mutex lock;
> - struct iommu_group *grp;
> bool enabled;
> unsigned long locked_pages;
> bool v2;
> + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];

Hrm, so here we have more copies of the full iommu_table structures,
which again muddies the lifetime. The table_group pointer is
presumably meaningless in these copies, which seems dangerously
confusing.

> + struct list_head group_list;
> };
>
> static long tce_unregister_pages(struct tce_container *container,
> @@ -154,20 +160,20 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift;
> }
>
> +static inline bool tce_groups_attached(struct tce_container *container)
> +{
> + return !list_empty(&container->group_list);
> +}
> +
> static struct iommu_table *spapr_tce_find_table(
> struct tce_container *container,
> phys_addr_t ioba)
> {
> long i;
> struct iommu_table *ret = NULL;
> - struct iommu_table_group *table_group;
> -
> - table_group = iommu_group_get_iommudata(container->grp);
> - if (!table_group)
> - return NULL;
>
> for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> - struct iommu_table *tbl = &table_group->tables[i];
> + struct iommu_table *tbl = &container->tables[i];
> unsigned long entry = ioba >> tbl->it_page_shift;
> unsigned long start = tbl->it_offset;
> unsigned long end = start + tbl->it_size;
> @@ -186,9 +192,7 @@ static int tce_iommu_enable(struct tce_container *container)
> int ret = 0;
> unsigned long locked;
> struct iommu_table_group *table_group;
> -
> - if (!container->grp)
> - return -ENXIO;
> + struct tce_iommu_group *tcegrp;
>
> if (!current->mm)
> return -ESRCH; /* process exited */
> @@ -225,7 +229,12 @@ static int tce_iommu_enable(struct tce_container *container)
> * as there is no way to know how much we should increment
> * the locked_vm counter.
> */
> - table_group = iommu_group_get_iommudata(container->grp);
> + if (!tce_groups_attached(container))
> + return -ENODEV;
> +
> + tcegrp = list_first_entry(&container->group_list,
> + struct tce_iommu_group, next);
> + table_group = iommu_group_get_iommudata(tcegrp->grp);
> if (!table_group)
> return -ENODEV;
>
> @@ -257,6 +266,48 @@ static void tce_iommu_disable(struct tce_container *container)
> decrement_locked_vm(container->locked_pages);
> }
>
> +static long tce_iommu_create_table(struct iommu_table_group *table_group,
> + int num,
> + __u32 page_shift,
> + __u64 window_size,
> + __u32 levels,
> + struct iommu_table *tbl)

With multiple groups (and therefore PEs) per container, this seems
wrong. There's only one table_group per PE, so what's special about
PE whose table group is passed in here.

> +{
> + long ret, table_size;
> +
> + table_size = table_group->ops->get_table_size(page_shift, window_size,
> + levels);
> + if (!table_size)
> + return -EINVAL;
> +
> + ret = try_increment_locked_vm(table_size >> PAGE_SHIFT);
> + if (ret)
> + return ret;
> +
> + ret = table_group->ops->create_table(table_group, num,
> + page_shift, window_size, levels, tbl);
> +
> + WARN_ON(!ret && !tbl->it_ops->free);
> + WARN_ON(!ret && (tbl->it_allocated_size != table_size));
> +
> + if (ret)
> + decrement_locked_vm(table_size >> PAGE_SHIFT);
> +
> + return ret;
> +}
> +
> +static void tce_iommu_free_table(struct iommu_table *tbl)
> +{
> + unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
> +
> + if (!tbl->it_size)
> + return;
> +
> + tbl->it_ops->free(tbl);

So, this is exactly the case where the lifetimes are badly confusing.
How can you be confident here that another copy of the iommu_table
struct isn't referencing the same TCE tables?

> + decrement_locked_vm(pages);
> + memset(tbl, 0, sizeof(*tbl));
> +}
> +
> static void *tce_iommu_open(unsigned long arg)
> {
> struct tce_container *container;
> @@ -271,19 +322,41 @@ static void *tce_iommu_open(unsigned long arg)
> return ERR_PTR(-ENOMEM);
>
> mutex_init(&container->lock);
> + INIT_LIST_HEAD_RCU(&container->group_list);

I see no other mentions of rcu related to this list, which doesn't
seem right.

> container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>
> return container;
> }
>
> +static int tce_iommu_clear(struct tce_container *container,
> + struct iommu_table *tbl,
> + unsigned long entry, unsigned long pages);
> +
> static void tce_iommu_release(void *iommu_data)
> {
> struct tce_container *container = iommu_data;
> + struct iommu_table_group *table_group;
> + struct tce_iommu_group *tcegrp;
> + long i;
>
> - WARN_ON(container->grp);
> + while (tce_groups_attached(container)) {
> + tcegrp = list_first_entry(&container->group_list,
> + struct tce_iommu_group, next);
> + table_group = iommu_group_get_iommudata(tcegrp->grp);
> + tce_iommu_detach_group(iommu_data, tcegrp->grp);
> + }
>
> - if (container->grp)
> - tce_iommu_detach_group(iommu_data, container->grp);
> + /*
> + * If VFIO created a table, it was not disposed
> + * by tce_iommu_detach_group() so do it now.
> + */
> + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> + struct iommu_table *tbl = &container->tables[i];
> +
> + tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> + tce_iommu_free_table(tbl);
> + }
>
> tce_iommu_disable(container);
> mutex_destroy(&container->lock);
> @@ -509,12 +582,15 @@ static long tce_iommu_ioctl(void *iommu_data,
>
> case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
> struct vfio_iommu_spapr_tce_info info;
> + struct tce_iommu_group *tcegrp;
> struct iommu_table_group *table_group;
>
> - if (WARN_ON(!container->grp))
> + if (!tce_groups_attached(container))
> return -ENXIO;
>
> - table_group = iommu_group_get_iommudata(container->grp);
> + tcegrp = list_first_entry(&container->group_list,
> + struct tce_iommu_group, next);
> + table_group = iommu_group_get_iommudata(tcegrp->grp);
>
> if (!table_group)
> return -ENXIO;
> @@ -707,12 +783,20 @@ static long tce_iommu_ioctl(void *iommu_data,
> tce_iommu_disable(container);
> mutex_unlock(&container->lock);
> return 0;
> - case VFIO_EEH_PE_OP:
> - if (!container->grp)
> - return -ENODEV;
>
> - return vfio_spapr_iommu_eeh_ioctl(container->grp,
> - cmd, arg);
> + case VFIO_EEH_PE_OP: {
> + struct tce_iommu_group *tcegrp;
> +
> + ret = 0;
> + list_for_each_entry(tcegrp, &container->group_list, next) {
> + ret = vfio_spapr_iommu_eeh_ioctl(tcegrp->grp,
> + cmd, arg);
> + if (ret)
> + return ret;

Hrm. It occurs to me that EEH may need a way of referencing
individual groups. Even if multiple PEs are referencing the same TCE
tables, presumably EEH will isolate them individually.

> + }
> + return ret;
> + }
> +
> }
>
> return -ENOTTY;
> @@ -724,11 +808,14 @@ static void tce_iommu_release_ownership(struct tce_container *container,
> int i;
>
> for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> - struct iommu_table *tbl = &table_group->tables[i];
> + struct iommu_table *tbl = &container->tables[i];
>
> tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> if (tbl->it_map)
> iommu_release_ownership(tbl);
> +
> + /* Reset the container's copy of the table descriptor */
> + memset(tbl, 0, sizeof(*tbl));
> }
> }
>
> @@ -758,38 +845,56 @@ static int tce_iommu_take_ownership(struct iommu_table_group *table_group)
> static int tce_iommu_attach_group(void *iommu_data,
> struct iommu_group *iommu_group)
> {
> - int ret;
> + int ret, i;
> struct tce_container *container = iommu_data;
> struct iommu_table_group *table_group;
> + struct tce_iommu_group *tcegrp = NULL;
> + bool first_group = !tce_groups_attached(container);
>
> mutex_lock(&container->lock);
>
> /* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
> iommu_group_id(iommu_group), iommu_group); */
> - if (container->grp) {
> - pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
> - iommu_group_id(container->grp),
> - iommu_group_id(iommu_group));
> - ret = -EBUSY;
> - goto unlock_exit;
> - }
> -
> - if (container->enabled) {
> - pr_err("tce_vfio: attaching group #%u to enabled container\n",
> - iommu_group_id(iommu_group));
> - ret = -EBUSY;
> - goto unlock_exit;
> - }
> -
> table_group = iommu_group_get_iommudata(iommu_group);
> - if (!table_group) {
> - ret = -ENXIO;
> +
> + if (!first_group && (!table_group->ops ||
> + !table_group->ops->take_ownership ||
> + !table_group->ops->release_ownership)) {
> + ret = -EBUSY;
> + goto unlock_exit;
> + }
> +
> + /* Check if new group has the same iommu_ops (i.e. compatible) */
> + list_for_each_entry(tcegrp, &container->group_list, next) {
> + struct iommu_table_group *table_group_tmp;
> +
> + if (tcegrp->grp == iommu_group) {
> + pr_warn("tce_vfio: Group %d is already attached\n",
> + iommu_group_id(iommu_group));
> + ret = -EBUSY;
> + goto unlock_exit;
> + }
> + table_group_tmp = iommu_group_get_iommudata(tcegrp->grp);
> + if (table_group_tmp->ops != table_group->ops) {
> + pr_warn("tce_vfio: Group %d is incompatible with group %d\n",
> + iommu_group_id(iommu_group),
> + iommu_group_id(tcegrp->grp));
> + ret = -EPERM;
> + goto unlock_exit;
> + }
> + }
> +
> + tcegrp = kzalloc(sizeof(*tcegrp), GFP_KERNEL);
> + if (!tcegrp) {
> + ret = -ENOMEM;
> goto unlock_exit;
> }
>
> if (!table_group->ops || !table_group->ops->take_ownership ||
> !table_group->ops->release_ownership) {
> ret = tce_iommu_take_ownership(table_group);
> + if (!ret)
> + container->tables[0] = table_group->tables[0];
> } else if (!table_group->ops->create_table ||
> !table_group->ops->set_window) {
> WARN_ON_ONCE(1);
> @@ -801,23 +906,46 @@ static int tce_iommu_attach_group(void *iommu_data,
> * the pages that has been explicitly mapped into the iommu
> */
> table_group->ops->take_ownership(table_group);
> - ret = table_group->ops->create_table(table_group,
> - 0, /* window number */
> - IOMMU_PAGE_SHIFT_4K,
> - table_group->tce32_size,
> - 1, /* default levels */
> - &table_group->tables[0]);
> - if (!ret)
> - ret = table_group->ops->set_window(table_group, 0,
> - &table_group->tables[0]);
> + /*
> + * If it the first group attached, check if there is
> + * a default DMA window and create one if none as
> + * the userspace expects it to exist.
> + */
> + if (first_group && !container->tables[0].it_size) {
> + ret = tce_iommu_create_table(table_group,
> + 0, /* window number */
> + IOMMU_PAGE_SHIFT_4K,
> + table_group->tce32_size,
> + 1, /* default levels */
> + &container->tables[0]);
> + if (ret)
> + goto unlock_exit;
> + }
> +
> + /* Set all windows to the new group */
> + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> + struct iommu_table *tbl = &container->tables[i];
> +
> + if (!tbl->it_size)
> + continue;
> +
> + /* Set the default window to a new group */
> + ret = table_group->ops->set_window(table_group, i, tbl);
> + if (ret)
> + break;
> + }
> }
>
> if (ret)
> goto unlock_exit;
>
> - container->grp = iommu_group;
> + tcegrp->grp = iommu_group;
> + list_add(&tcegrp->next, &container->group_list);
>
> unlock_exit:
> + if (ret && tcegrp)
> + kfree(tcegrp);
> +
> mutex_unlock(&container->lock);
>
> return ret;
> @@ -828,25 +956,27 @@ static void tce_iommu_detach_group(void *iommu_data,
> {
> struct tce_container *container = iommu_data;
> struct iommu_table_group *table_group;
> + struct tce_iommu_group *tcegrp;
> long i;
> + bool found = false;
>
> mutex_lock(&container->lock);
> - if (iommu_group != container->grp) {
> - pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
> - iommu_group_id(iommu_group),
> - iommu_group_id(container->grp));
> +
> + list_for_each_entry(tcegrp, &container->group_list, next) {
> + if (tcegrp->grp == iommu_group) {
> + found = true;
> + break;
> + }
> + }
> +
> + if (!found) {
> + pr_warn("tce_vfio: detaching unattached group #%u\n",
> + iommu_group_id(iommu_group));
> goto unlock_exit;
> }
>
> - if (container->enabled) {
> - pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
> - iommu_group_id(container->grp));
> - tce_iommu_disable(container);
> - }
> -
> - /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
> - iommu_group_id(iommu_group), iommu_group); */
> - container->grp = NULL;
> + list_del(&tcegrp->next);
> + kfree(tcegrp);
>
> table_group = iommu_group_get_iommudata(iommu_group);
> BUG_ON(!table_group);
> @@ -857,18 +987,8 @@ static void tce_iommu_detach_group(void *iommu_data,
> else if (!table_group->ops->unset_window)
> WARN_ON_ONCE(1);
> else {
> - for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> - struct iommu_table tbl = table_group->tables[i];
> -
> - if (!tbl.it_size)
> - continue;
> -
> + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i)
> table_group->ops->unset_window(table_group, i);
> - tce_iommu_clear(container, &tbl,
> - tbl.it_offset, tbl.it_size);
> - if (tbl.it_ops->free)
> - tbl.it_ops->free(&tbl);
> - }
>
> table_group->ops->release_ownership(table_group);
> }

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (16.11 kB)
(No filename) (819.00 B)
Download all attachments

2015-04-30 08:25:37

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache

On Thu, Apr 30, 2015 at 04:34:55PM +1000, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:52PM +1000, Alexey Kardashevskiy wrote:
> > We are adding support for DMA memory pre-registration to be used in
> > conjunction with VFIO. The idea is that the userspace which is going to
> > run a guest may want to pre-register a user space memory region so
> > it all gets pinned once and never goes away. Having this done,
> > a hypervisor will not have to pin/unpin pages on every DMA map/unmap
> > request. This is going to help with multiple pinning of the same memory
> > and in-kernel acceleration of DMA requests.
> >
> > This adds a list of memory regions to mm_context_t. Each region consists
> > of a header and a list of physical addresses. This adds API to:
> > 1. register/unregister memory regions;
> > 2. do final cleanup (which puts all pre-registered pages);
> > 3. do userspace to physical address translation;
> > 4. manage a mapped pages counter; when it is zero, it is safe to
> > unregister the region.
> >
> > Multiple registration of the same region is allowed, kref is used to
> > track the number of registrations.
>
> [snip]
> > +long mm_iommu_alloc(unsigned long ua, unsigned long entries,
> > + struct mm_iommu_table_group_mem_t **pmem)
> > +{
> > + struct mm_iommu_table_group_mem_t *mem;
> > + long i, j;
> > + struct page *page = NULL;
> > +
> > + list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
> > + next) {
> > + if ((mem->ua == ua) && (mem->entries == entries))
> > + return -EBUSY;
> > +
> > + /* Overlap? */
> > + if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
> > + (ua < (mem->ua + (mem->entries << PAGE_SHIFT))))
> > + return -EINVAL;
> > + }
> > +
> > + mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> > + if (!mem)
> > + return -ENOMEM;
> > +
> > + mem->hpas = vzalloc(entries * sizeof(mem->hpas[0]));
> > + if (!mem->hpas) {
> > + kfree(mem);
> > + return -ENOMEM;
> > + }
>
> So, I've thought more about this and I'm really confused as to what
> this is supposed to be accomplishing.
>
> I see that you need to keep track of what regions are registered, so
> you don't double lock or unlock, but I don't see what the point of
> actualy storing the translations in hpas is.
>
> I had assumed it was so that you could later on get to the
> translations in real mode when you do in-kernel acceleration. But
> that doesn't make sense, because the array is vmalloc()ed, so can't be
> accessed in real mode anyway.

We can access vmalloc'd arrays in real mode using real_vmalloc_addr().

> I can't think of a circumstance in which you can use hpas where you
> couldn't just walk the page tables anyway.

The problem with walking the page tables is that there is no guarantee
that the page you find that way is the page that was returned by the
gup_fast() we did earlier. Storing the hpas means that we know for
sure that the page we're doing DMA to is one that we have an elevated
page count on.

Also, there are various points where a Linux PTE is made temporarily
invalid for a short time. If we happened to do a H_PUT_TCE on one cpu
while another cpu was doing that, we'd get a spurious failure returned
by the H_PUT_TCE.

Paul.

2015-04-30 09:33:38

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible

On 04/30/2015 05:22 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote:
>> At the moment only one group per container is supported.
>> POWER8 CPUs have more flexible design and allows naving 2 TCE tables per
>> IOMMU group so we can relax this limitation and support multiple groups
>> per container.
>
> It's not obvious why allowing multiple TCE tables per PE has any
> pearing on allowing multiple groups per container.


This patchset is a global TCE tables rework (patches 1..30, roughly) with 2
outcomes:
1. reusing the same IOMMU table for multiple groups - patch 31;
2. allowing dynamic create/remove of IOMMU tables - patch 32.

I can remove this one from the patchset and post it separately later but
since 1..30 aim to support both 1) and 2), I'd think I better keep them all
together (might explain some of changes I do in 1..30).



>> This adds TCE table descriptors to a container and uses iommu_table_group_ops
>> to create/set DMA windows on IOMMU groups so the same TCE tables will be
>> shared between several IOMMU groups.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> [aw: for the vfio related changes]
>> Acked-by: Alex Williamson <[email protected]>
>> ---
>> Changes:
>> v7:
>> * updated doc
>> ---
>> Documentation/vfio.txt | 8 +-
>> drivers/vfio/vfio_iommu_spapr_tce.c | 268 ++++++++++++++++++++++++++----------
>> 2 files changed, 199 insertions(+), 77 deletions(-)
>>
>> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
>> index 94328c8..7dcf2b5 100644
>> --- a/Documentation/vfio.txt
>> +++ b/Documentation/vfio.txt
>> @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
>>
>> This implementation has some specifics:
>>
>> -1) Only one IOMMU group per container is supported as an IOMMU group
>> -represents the minimal entity which isolation can be guaranteed for and
>> -groups are allocated statically, one per a Partitionable Endpoint (PE)
>> +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
>> +container is supported as an IOMMU table is allocated at the boot time,
>> +one table per a IOMMU group which is a Partitionable Endpoint (PE)
>> (PE is often a PCI domain but not always).
>
> I thought the more fundamental problem was that different PEs tended
> to use disjoint bus address ranges, so even by duplicating put_tce
> across PEs you couldn't have a common address space.


Sorry, I am not following you here.

By duplicating put_tce, I can have multiple IOMMU groups on the same
virtual PHB in QEMU, "[PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple
groups per container" does this, the address ranges will the same.

What I cannot do on p5ioc2 is programming the same table to multiple
physical PHBs (or I could but it is very different than IODA2 and pretty
ugly and might not always be possible because I would have to allocate
these pages from some common pool and face problems like fragmentation).



>> +Newer systems (POWER8 with IODA2) have improved hardware design which allows
>> +to remove this limitation and have multiple IOMMU groups per a VFIO container.
>>
>> 2) The hardware supports so called DMA windows - the PCI address range
>> within which DMA transfer is allowed, any attempt to access address space
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index a7d6729..970e3a2 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -82,6 +82,11 @@ static void decrement_locked_vm(long npages)
>> * into DMA'ble space using the IOMMU
>> */
>>
>> +struct tce_iommu_group {
>> + struct list_head next;
>> + struct iommu_group *grp;
>> +};
>> +
>> /*
>> * The container descriptor supports only a single group per container.
>> * Required by the API as the container is not supplied with the IOMMU group
>> @@ -89,10 +94,11 @@ static void decrement_locked_vm(long npages)
>> */
>> struct tce_container {
>> struct mutex lock;
>> - struct iommu_group *grp;
>> bool enabled;
>> unsigned long locked_pages;
>> bool v2;
>> + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>
> Hrm, so here we have more copies of the full iommu_table structures,
> which again muddies the lifetime. The table_group pointer is
> presumably meaningless in these copies, which seems dangerously
> confusing.


Ouch. This is bad. No, table_group is not pointless here as it is used to
get to the PE number to invalidate TCE cache. I just realized although I
need to update just a single table, I still have to invalidate TCE cache
for every attached group/PE so I need a list of iommu_table_group's here,
not a single pointer...



>> + struct list_head group_list;
>> };
>>
>> static long tce_unregister_pages(struct tce_container *container,
>> @@ -154,20 +160,20 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift)
>> return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift;
>> }
>>
>> +static inline bool tce_groups_attached(struct tce_container *container)
>> +{
>> + return !list_empty(&container->group_list);
>> +}
>> +
>> static struct iommu_table *spapr_tce_find_table(
>> struct tce_container *container,
>> phys_addr_t ioba)
>> {
>> long i;
>> struct iommu_table *ret = NULL;
>> - struct iommu_table_group *table_group;
>> -
>> - table_group = iommu_group_get_iommudata(container->grp);
>> - if (!table_group)
>> - return NULL;
>>
>> for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> - struct iommu_table *tbl = &table_group->tables[i];
>> + struct iommu_table *tbl = &container->tables[i];
>> unsigned long entry = ioba >> tbl->it_page_shift;
>> unsigned long start = tbl->it_offset;
>> unsigned long end = start + tbl->it_size;
>> @@ -186,9 +192,7 @@ static int tce_iommu_enable(struct tce_container *container)
>> int ret = 0;
>> unsigned long locked;
>> struct iommu_table_group *table_group;
>> -
>> - if (!container->grp)
>> - return -ENXIO;
>> + struct tce_iommu_group *tcegrp;
>>
>> if (!current->mm)
>> return -ESRCH; /* process exited */
>> @@ -225,7 +229,12 @@ static int tce_iommu_enable(struct tce_container *container)
>> * as there is no way to know how much we should increment
>> * the locked_vm counter.
>> */
>> - table_group = iommu_group_get_iommudata(container->grp);
>> + if (!tce_groups_attached(container))
>> + return -ENODEV;
>> +
>> + tcegrp = list_first_entry(&container->group_list,
>> + struct tce_iommu_group, next);
>> + table_group = iommu_group_get_iommudata(tcegrp->grp);
>> if (!table_group)
>> return -ENODEV;
>>
>> @@ -257,6 +266,48 @@ static void tce_iommu_disable(struct tce_container *container)
>> decrement_locked_vm(container->locked_pages);
>> }
>>
>> +static long tce_iommu_create_table(struct iommu_table_group *table_group,
>> + int num,
>> + __u32 page_shift,
>> + __u64 window_size,
>> + __u32 levels,
>> + struct iommu_table *tbl)
>
> With multiple groups (and therefore PEs) per container, this seems
> wrong. There's only one table_group per PE, so what's special about
> PE whose table group is passed in here.


The created table is allocated at the same node as table_group
(pe->phb->hose->node). This does not make much sense if we put multiple
groups to the same container but we will recommend people to avoid putting
groups from different NUMA nodes to the same container.

Also, the allocated table gets bus offset initialized in create_table()
(which is IODA2-specific knowledge). It is there to emphasize the fact that
we do not get to choose where to map the window on a bus, it is hardcoded
and easier to deal with the tables which have offset set once - I could add
a bus_offset parameter to set_window() but it would be converted back to
the window number.



>> +{
>> + long ret, table_size;
>> +
>> + table_size = table_group->ops->get_table_size(page_shift, window_size,
>> + levels);
>> + if (!table_size)
>> + return -EINVAL;
>> +
>> + ret = try_increment_locked_vm(table_size >> PAGE_SHIFT);
>> + if (ret)
>> + return ret;
>> +
>> + ret = table_group->ops->create_table(table_group, num,
>> + page_shift, window_size, levels, tbl);
>> +
>> + WARN_ON(!ret && !tbl->it_ops->free);
>> + WARN_ON(!ret && (tbl->it_allocated_size != table_size));
>> +
>> + if (ret)
>> + decrement_locked_vm(table_size >> PAGE_SHIFT);
>> +
>> + return ret;
>> +}
>> +
>> +static void tce_iommu_free_table(struct iommu_table *tbl)
>> +{
>> + unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>> +
>> + if (!tbl->it_size)
>> + return;
>> +
>> + tbl->it_ops->free(tbl);
>
> So, this is exactly the case where the lifetimes are badly confusing.
> How can you be confident here that another copy of the iommu_table
> struct isn't referencing the same TCE tables?


Create/remove window is handled by a single file driver. It is not like
many of there tables. But yes, valid point.



>> + decrement_locked_vm(pages);
>> + memset(tbl, 0, sizeof(*tbl));
>> +}
>> +
>> static void *tce_iommu_open(unsigned long arg)
>> {
>> struct tce_container *container;
>> @@ -271,19 +322,41 @@ static void *tce_iommu_open(unsigned long arg)
>> return ERR_PTR(-ENOMEM);
>>
>> mutex_init(&container->lock);
>> + INIT_LIST_HEAD_RCU(&container->group_list);
>
> I see no other mentions of rcu related to this list, which doesn't
> seem right.
>
>> container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>>
>> return container;
>> }
>>
>> +static int tce_iommu_clear(struct tce_container *container,
>> + struct iommu_table *tbl,
>> + unsigned long entry, unsigned long pages);
>> +
>> static void tce_iommu_release(void *iommu_data)
>> {
>> struct tce_container *container = iommu_data;
>> + struct iommu_table_group *table_group;
>> + struct tce_iommu_group *tcegrp;
>> + long i;
>>
>> - WARN_ON(container->grp);
>> + while (tce_groups_attached(container)) {
>> + tcegrp = list_first_entry(&container->group_list,
>> + struct tce_iommu_group, next);
>> + table_group = iommu_group_get_iommudata(tcegrp->grp);
>> + tce_iommu_detach_group(iommu_data, tcegrp->grp);
>> + }
>>
>> - if (container->grp)
>> - tce_iommu_detach_group(iommu_data, container->grp);
>> + /*
>> + * If VFIO created a table, it was not disposed
>> + * by tce_iommu_detach_group() so do it now.
>> + */
>> + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> + struct iommu_table *tbl = &container->tables[i];
>> +
>> + tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
>> + tce_iommu_free_table(tbl);
>> + }
>>
>> tce_iommu_disable(container);
>> mutex_destroy(&container->lock);
>> @@ -509,12 +582,15 @@ static long tce_iommu_ioctl(void *iommu_data,
>>
>> case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
>> struct vfio_iommu_spapr_tce_info info;
>> + struct tce_iommu_group *tcegrp;
>> struct iommu_table_group *table_group;
>>
>> - if (WARN_ON(!container->grp))
>> + if (!tce_groups_attached(container))
>> return -ENXIO;
>>
>> - table_group = iommu_group_get_iommudata(container->grp);
>> + tcegrp = list_first_entry(&container->group_list,
>> + struct tce_iommu_group, next);
>> + table_group = iommu_group_get_iommudata(tcegrp->grp);
>>
>> if (!table_group)
>> return -ENXIO;
>> @@ -707,12 +783,20 @@ static long tce_iommu_ioctl(void *iommu_data,
>> tce_iommu_disable(container);
>> mutex_unlock(&container->lock);
>> return 0;
>> - case VFIO_EEH_PE_OP:
>> - if (!container->grp)
>> - return -ENODEV;
>>
>> - return vfio_spapr_iommu_eeh_ioctl(container->grp,
>> - cmd, arg);
>> + case VFIO_EEH_PE_OP: {
>> + struct tce_iommu_group *tcegrp;
>> +
>> + ret = 0;
>> + list_for_each_entry(tcegrp, &container->group_list, next) {
>> + ret = vfio_spapr_iommu_eeh_ioctl(tcegrp->grp,
>> + cmd, arg);
>> + if (ret)
>> + return ret;
>
> Hrm. It occurs to me that EEH may need a way of referencing
> individual groups. Even if multiple PEs are referencing the same TCE
> tables, presumably EEH will isolate them individually.


Well. I asked our EEH guy Gavin, he did not object to this change but I'll
double check :)



--
Alexey

2015-04-30 09:56:31

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 23/32] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks

On 04/30/2015 02:37 PM, David Gibson wrote:
> On Wed, Apr 29, 2015 at 07:44:20PM +1000, Alexey Kardashevskiy wrote:
>> On 04/29/2015 03:30 PM, David Gibson wrote:
>>> On Sat, Apr 25, 2015 at 10:14:47PM +1000, Alexey Kardashevskiy wrote:
>>>> This extends iommu_table_group_ops by a set of callbacks to support
>>>> dynamic DMA windows management.
>>>>
>>>> create_table() creates a TCE table with specific parameters.
>>>> it receives iommu_table_group to know nodeid in order to allocate
>>>> TCE table memory closer to the PHB. The exact format of allocated
>>>> multi-level table might be also specific to the PHB model (not
>>>> the case now though).
>>>> This callback calculated the DMA window offset on a PCI bus from @num
>>>> and stores it in a just created table.
>>>>
>>>> set_window() sets the window at specified TVT index + @num on PHB.
>>>>
>>>> unset_window() unsets the window from specified TVT.
>>>>
>>>> This adds a free() callback to iommu_table_ops to free the memory
>>>> (potentially a tree of tables) allocated for the TCE table.
>>>
>>> Doesn't the free callback belong with the previous patch introducing
>>> multi-level tables?
>>
>>
>>
>> If I did that, you would say "why is it here if nothing calls it" on
>> "multilevel" patch and "I see the allocation but I do not see memory
>> release" ;)
>
> Yeah, fair enough ;)
>
>> I need some rule of thumb here. I think it is a bit cleaner if the same
>> patch adds a callback for memory allocation and its counterpart, no?
>
> On further consideration, yes, I think you're right.
>
>>>> create_table() and free() are supposed to be called once per
>>>> VFIO container and set_window()/unset_window() are supposed to be
>>>> called for every group in a container.
>>>>
>>>> This adds IOMMU capabilities to iommu_table_group such as default
>>>> 32bit window parameters and others.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>> ---
>>>> arch/powerpc/include/asm/iommu.h | 19 ++++++++
>>>> arch/powerpc/platforms/powernv/pci-ioda.c | 75 ++++++++++++++++++++++++++---
>>>> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 12 +++--
>>>> 3 files changed, 96 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>>>> index 0f50ee2..7694546 100644
>>>> --- a/arch/powerpc/include/asm/iommu.h
>>>> +++ b/arch/powerpc/include/asm/iommu.h
>>>> @@ -70,6 +70,7 @@ struct iommu_table_ops {
>>>> /* get() returns a physical address */
>>>> unsigned long (*get)(struct iommu_table *tbl, long index);
>>>> void (*flush)(struct iommu_table *tbl);
>>>> + void (*free)(struct iommu_table *tbl);
>>>> };
>>>>
>>>> /* These are used by VIO */
>>>> @@ -148,6 +149,17 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>>>> struct iommu_table_group;
>>>>
>>>> struct iommu_table_group_ops {
>>>> + long (*create_table)(struct iommu_table_group *table_group,
>>>> + int num,
>>>> + __u32 page_shift,
>>>> + __u64 window_size,
>>>> + __u32 levels,
>>>> + struct iommu_table *tbl);
>>>> + long (*set_window)(struct iommu_table_group *table_group,
>>>> + int num,
>>>> + struct iommu_table *tblnew);
>>>> + long (*unset_window)(struct iommu_table_group *table_group,
>>>> + int num);
>>>> /*
>>>> * Switches ownership from the kernel itself to an external
>>>> * user. While onwership is taken, the kernel cannot use IOMMU itself.
>>>> @@ -160,6 +172,13 @@ struct iommu_table_group {
>>>> #ifdef CONFIG_IOMMU_API
>>>> struct iommu_group *group;
>>>> #endif
>>>> + /* Some key properties of IOMMU */
>>>> + __u32 tce32_start;
>>>> + __u32 tce32_size;
>>>> + __u64 pgsizes; /* Bitmap of supported page sizes */
>>>> + __u32 max_dynamic_windows_supported;
>>>> + __u32 max_levels;
>>>
>>> With this information, table_group seems even more like a bad name.
>>> "iommu_state" maybe?
>>
>>
>> Please, no. We will never come to agreement then :( And "iommu_state" is too
>> general anyway, it won't pass.
>>
>>
>>>> struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>>>> struct iommu_table_group_ops *ops;
>>>> };
>>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> index cc1d09c..4828837 100644
>>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> @@ -24,6 +24,7 @@
>>>> #include <linux/msi.h>
>>>> #include <linux/memblock.h>
>>>> #include <linux/iommu.h>
>>>> +#include <linux/sizes.h>
>>>>
>>>> #include <asm/sections.h>
>>>> #include <asm/io.h>
>>>> @@ -1846,6 +1847,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>>>> #endif
>>>> .clear = pnv_ioda2_tce_free,
>>>> .get = pnv_tce_get,
>>>> + .free = pnv_pci_free_table,
>>>> };
>>>>
>>>> static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
>>>> @@ -1936,6 +1938,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>>>> TCE_PCI_SWINV_PAIR);
>>>>
>>>> tbl->it_ops = &pnv_ioda1_iommu_ops;
>>>> + pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift;
>>>> + pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
>>>> iommu_init_table(tbl, phb->hose->node);
>>>>
>>>> if (pe->flags & PNV_IODA_PE_DEV) {
>>>> @@ -1961,7 +1965,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>>>> }
>>>>
>>>> static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
>>>> - struct iommu_table *tbl)
>>>> + int num, struct iommu_table *tbl)
>>>> {
>>>> struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
>>>> table_group);
>>>> @@ -1972,9 +1976,10 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
>>>> const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
>>>> const __u64 win_size = tbl->it_size << tbl->it_page_shift;
>>>>
>>>> - pe_info(pe, "Setting up window at %llx..%llx "
>>>> + pe_info(pe, "Setting up window#%d at %llx..%llx "
>>>> "pgsize=0x%x tablesize=0x%lx "
>>>> "levels=%d levelsize=%x\n",
>>>> + num,
>>>> start_addr, start_addr + win_size - 1,
>>>> 1UL << tbl->it_page_shift, tbl->it_size << 3,
>>>> tbl->it_indirect_levels + 1, tbl->it_level_size << 3);
>>>> @@ -1987,7 +1992,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
>>>> */
>>>> rc = opal_pci_map_pe_dma_window(phb->opal_id,
>>>> pe->pe_number,
>>>> - pe->pe_number << 1,
>>>> + (pe->pe_number << 1) + num,
>>>
>>> Heh, yes, well, that makes it rather clear that only 2 tables are possible.
>>>
>>>> tbl->it_indirect_levels + 1,
>>>> __pa(tbl->it_base),
>>>> size << 3,
>>>> @@ -2000,7 +2005,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
>>>> pnv_pci_ioda2_tvt_invalidate(pe);
>>>>
>>>> /* Store fully initialized *tbl (may be external) in PE */
>>>> - pe->table_group.tables[0] = *tbl;
>>>> + pe->table_group.tables[num] = *tbl;
>>>
>>> I'm a bit confused by this whole set_window thing. Is the idea that
>>> with multiple groups in a container you have multiple table_group s
>>> each with different copies of the iommu_table structures, but pointing
>>> to the same actual TCE entries (it_base)?
>>
>> Yes.
>>
>>> It seems to me not terribly
>>> obvious when you "create" a table and when you "set" a window.
>>
>>
>> A table is not attached anywhere until its address is programmed (in
>> set_window()) to the hardware, it is just a table in memory. For
>> POWER8/IODA2, I create a table before I attach any group to a container,
>> then I program this table to every attached container, right now it is done
>> in container's attach_group(). So later we can hotplug any host PCI device
>> to a container - it will program same TCE table to every new group in the
>> container.
>
> So you "create" once, then "set" it to one or more table_groups? It
> seems odd that "create" is a table_group callback in that case.


Where else could it be? ppc_md? We are getting rid of these. Some global
function? We do not want VFIO to know about this. I run out of ideas here.




--
Alexey

2015-05-01 01:46:45

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible

On Thu, 2015-04-30 at 19:33 +1000, Alexey Kardashevskiy wrote:
> On 04/30/2015 05:22 PM, David Gibson wrote:
> > On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote:
> >> At the moment only one group per container is supported.
> >> POWER8 CPUs have more flexible design and allows naving 2 TCE tables per
> >> IOMMU group so we can relax this limitation and support multiple groups
> >> per container.
> >
> > It's not obvious why allowing multiple TCE tables per PE has any
> > pearing on allowing multiple groups per container.
>
>
> This patchset is a global TCE tables rework (patches 1..30, roughly) with 2
> outcomes:
> 1. reusing the same IOMMU table for multiple groups - patch 31;
> 2. allowing dynamic create/remove of IOMMU tables - patch 32.
>
> I can remove this one from the patchset and post it separately later but
> since 1..30 aim to support both 1) and 2), I'd think I better keep them all
> together (might explain some of changes I do in 1..30).

I think you are talking past each other :-)

But yes, having 2 tables per group is orthogonal to the ability of
having multiple groups per container.

The latter is made possible on P8 in large part because each PE has its
own DMA address space (unlike P5IOC2 or P7IOC where a single address
space is segmented).

Also, on P8 you can actually make the TVT entries point to the same
table in memory, thus removing the need to duplicate the actual
tables (though you still have to duplicate the invalidations). I would
however recommend only sharing the table that way within a chip/node.

.../..

> >>
> >> -1) Only one IOMMU group per container is supported as an IOMMU group
> >> -represents the minimal entity which isolation can be guaranteed for and
> >> -groups are allocated statically, one per a Partitionable Endpoint (PE)
> >> +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
> >> +container is supported as an IOMMU table is allocated at the boot time,
> >> +one table per a IOMMU group which is a Partitionable Endpoint (PE)
> >> (PE is often a PCI domain but not always).

> > I thought the more fundamental problem was that different PEs tended
> > to use disjoint bus address ranges, so even by duplicating put_tce
> > across PEs you couldn't have a common address space.

Yes. This is the problem with P7IOC and earlier. It *could* be doable on
P7IOC by making them the same PE but let's not go there.

> Sorry, I am not following you here.
>
> By duplicating put_tce, I can have multiple IOMMU groups on the same
> virtual PHB in QEMU, "[PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple
> groups per container" does this, the address ranges will the same.

But that is only possible on P8 because only there do we have separate
address spaces between PEs.

> What I cannot do on p5ioc2 is programming the same table to multiple
> physical PHBs (or I could but it is very different than IODA2 and pretty
> ugly and might not always be possible because I would have to allocate
> these pages from some common pool and face problems like fragmentation).

And P7IOC has a similar issue. The DMA address top bits indexes the
window on P7IOC within a shared address space. It's possible to
configure a TVT to cover multiple devices but with very serious
limitations.

> >> +Newer systems (POWER8 with IODA2) have improved hardware design which allows
> >> +to remove this limitation and have multiple IOMMU groups per a VFIO container.
> >>
> >> 2) The hardware supports so called DMA windows - the PCI address range
> >> within which DMA transfer is allowed, any attempt to access address space
> >> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> index a7d6729..970e3a2 100644
> >> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> @@ -82,6 +82,11 @@ static void decrement_locked_vm(long npages)
> >> * into DMA'ble space using the IOMMU
> >> */
> >>
> >> +struct tce_iommu_group {
> >> + struct list_head next;
> >> + struct iommu_group *grp;
> >> +};
> >> +
> >> /*
> >> * The container descriptor supports only a single group per container.
> >> * Required by the API as the container is not supplied with the IOMMU group
> >> @@ -89,10 +94,11 @@ static void decrement_locked_vm(long npages)
> >> */
> >> struct tce_container {
> >> struct mutex lock;
> >> - struct iommu_group *grp;
> >> bool enabled;
> >> unsigned long locked_pages;
> >> bool v2;
> >> + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> >
> > Hrm, so here we have more copies of the full iommu_table structures,
> > which again muddies the lifetime. The table_group pointer is
> > presumably meaningless in these copies, which seems dangerously
> > confusing.
>
>
> Ouch. This is bad. No, table_group is not pointless here as it is used to
> get to the PE number to invalidate TCE cache. I just realized although I
> need to update just a single table, I still have to invalidate TCE cache
> for every attached group/PE so I need a list of iommu_table_group's here,
> not a single pointer...
>
>
>
> >> + struct list_head group_list;
> >> };
> >>
> >> static long tce_unregister_pages(struct tce_container *container,
> >> @@ -154,20 +160,20 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> >> return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift;
> >> }
> >>
> >> +static inline bool tce_groups_attached(struct tce_container *container)
> >> +{
> >> + return !list_empty(&container->group_list);
> >> +}
> >> +
> >> static struct iommu_table *spapr_tce_find_table(
> >> struct tce_container *container,
> >> phys_addr_t ioba)
> >> {
> >> long i;
> >> struct iommu_table *ret = NULL;
> >> - struct iommu_table_group *table_group;
> >> -
> >> - table_group = iommu_group_get_iommudata(container->grp);
> >> - if (!table_group)
> >> - return NULL;
> >>
> >> for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >> - struct iommu_table *tbl = &table_group->tables[i];
> >> + struct iommu_table *tbl = &container->tables[i];
> >> unsigned long entry = ioba >> tbl->it_page_shift;
> >> unsigned long start = tbl->it_offset;
> >> unsigned long end = start + tbl->it_size;
> >> @@ -186,9 +192,7 @@ static int tce_iommu_enable(struct tce_container *container)
> >> int ret = 0;
> >> unsigned long locked;
> >> struct iommu_table_group *table_group;
> >> -
> >> - if (!container->grp)
> >> - return -ENXIO;
> >> + struct tce_iommu_group *tcegrp;
> >>
> >> if (!current->mm)
> >> return -ESRCH; /* process exited */
> >> @@ -225,7 +229,12 @@ static int tce_iommu_enable(struct tce_container *container)
> >> * as there is no way to know how much we should increment
> >> * the locked_vm counter.
> >> */
> >> - table_group = iommu_group_get_iommudata(container->grp);
> >> + if (!tce_groups_attached(container))
> >> + return -ENODEV;
> >> +
> >> + tcegrp = list_first_entry(&container->group_list,
> >> + struct tce_iommu_group, next);
> >> + table_group = iommu_group_get_iommudata(tcegrp->grp);
> >> if (!table_group)
> >> return -ENODEV;
> >>
> >> @@ -257,6 +266,48 @@ static void tce_iommu_disable(struct tce_container *container)
> >> decrement_locked_vm(container->locked_pages);
> >> }
> >>
> >> +static long tce_iommu_create_table(struct iommu_table_group *table_group,
> >> + int num,
> >> + __u32 page_shift,
> >> + __u64 window_size,
> >> + __u32 levels,
> >> + struct iommu_table *tbl)
> >
> > With multiple groups (and therefore PEs) per container, this seems
> > wrong. There's only one table_group per PE, so what's special about
> > PE whose table group is passed in here.
>
>
> The created table is allocated at the same node as table_group
> (pe->phb->hose->node). This does not make much sense if we put multiple
> groups to the same container but we will recommend people to avoid putting
> groups from different NUMA nodes to the same container.
>
> Also, the allocated table gets bus offset initialized in create_table()
> (which is IODA2-specific knowledge). It is there to emphasize the fact that
> we do not get to choose where to map the window on a bus, it is hardcoded
> and easier to deal with the tables which have offset set once - I could add
> a bus_offset parameter to set_window() but it would be converted back to
> the window number.
>
>
>
> >> +{
> >> + long ret, table_size;
> >> +
> >> + table_size = table_group->ops->get_table_size(page_shift, window_size,
> >> + levels);
> >> + if (!table_size)
> >> + return -EINVAL;
> >> +
> >> + ret = try_increment_locked_vm(table_size >> PAGE_SHIFT);
> >> + if (ret)
> >> + return ret;
> >> +
> >> + ret = table_group->ops->create_table(table_group, num,
> >> + page_shift, window_size, levels, tbl);
> >> +
> >> + WARN_ON(!ret && !tbl->it_ops->free);
> >> + WARN_ON(!ret && (tbl->it_allocated_size != table_size));
> >> +
> >> + if (ret)
> >> + decrement_locked_vm(table_size >> PAGE_SHIFT);
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static void tce_iommu_free_table(struct iommu_table *tbl)
> >> +{
> >> + unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
> >> +
> >> + if (!tbl->it_size)
> >> + return;
> >> +
> >> + tbl->it_ops->free(tbl);
> >
> > So, this is exactly the case where the lifetimes are badly confusing.
> > How can you be confident here that another copy of the iommu_table
> > struct isn't referencing the same TCE tables?
>
>
> Create/remove window is handled by a single file driver. It is not like
> many of there tables. But yes, valid point.
>
>
>
> >> + decrement_locked_vm(pages);
> >> + memset(tbl, 0, sizeof(*tbl));
> >> +}
> >> +
> >> static void *tce_iommu_open(unsigned long arg)
> >> {
> >> struct tce_container *container;
> >> @@ -271,19 +322,41 @@ static void *tce_iommu_open(unsigned long arg)
> >> return ERR_PTR(-ENOMEM);
> >>
> >> mutex_init(&container->lock);
> >> + INIT_LIST_HEAD_RCU(&container->group_list);
> >
> > I see no other mentions of rcu related to this list, which doesn't
> > seem right.
> >
> >> container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
> >>
> >> return container;
> >> }
> >>
> >> +static int tce_iommu_clear(struct tce_container *container,
> >> + struct iommu_table *tbl,
> >> + unsigned long entry, unsigned long pages);
> >> +
> >> static void tce_iommu_release(void *iommu_data)
> >> {
> >> struct tce_container *container = iommu_data;
> >> + struct iommu_table_group *table_group;
> >> + struct tce_iommu_group *tcegrp;
> >> + long i;
> >>
> >> - WARN_ON(container->grp);
> >> + while (tce_groups_attached(container)) {
> >> + tcegrp = list_first_entry(&container->group_list,
> >> + struct tce_iommu_group, next);
> >> + table_group = iommu_group_get_iommudata(tcegrp->grp);
> >> + tce_iommu_detach_group(iommu_data, tcegrp->grp);
> >> + }
> >>
> >> - if (container->grp)
> >> - tce_iommu_detach_group(iommu_data, container->grp);
> >> + /*
> >> + * If VFIO created a table, it was not disposed
> >> + * by tce_iommu_detach_group() so do it now.
> >> + */
> >> + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >> + struct iommu_table *tbl = &container->tables[i];
> >> +
> >> + tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> >> + tce_iommu_free_table(tbl);
> >> + }
> >>
> >> tce_iommu_disable(container);
> >> mutex_destroy(&container->lock);
> >> @@ -509,12 +582,15 @@ static long tce_iommu_ioctl(void *iommu_data,
> >>
> >> case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
> >> struct vfio_iommu_spapr_tce_info info;
> >> + struct tce_iommu_group *tcegrp;
> >> struct iommu_table_group *table_group;
> >>
> >> - if (WARN_ON(!container->grp))
> >> + if (!tce_groups_attached(container))
> >> return -ENXIO;
> >>
> >> - table_group = iommu_group_get_iommudata(container->grp);
> >> + tcegrp = list_first_entry(&container->group_list,
> >> + struct tce_iommu_group, next);
> >> + table_group = iommu_group_get_iommudata(tcegrp->grp);
> >>
> >> if (!table_group)
> >> return -ENXIO;
> >> @@ -707,12 +783,20 @@ static long tce_iommu_ioctl(void *iommu_data,
> >> tce_iommu_disable(container);
> >> mutex_unlock(&container->lock);
> >> return 0;
> >> - case VFIO_EEH_PE_OP:
> >> - if (!container->grp)
> >> - return -ENODEV;
> >>
> >> - return vfio_spapr_iommu_eeh_ioctl(container->grp,
> >> - cmd, arg);
> >> + case VFIO_EEH_PE_OP: {
> >> + struct tce_iommu_group *tcegrp;
> >> +
> >> + ret = 0;
> >> + list_for_each_entry(tcegrp, &container->group_list, next) {
> >> + ret = vfio_spapr_iommu_eeh_ioctl(tcegrp->grp,
> >> + cmd, arg);
> >> + if (ret)
> >> + return ret;
> >
> > Hrm. It occurs to me that EEH may need a way of referencing
> > individual groups. Even if multiple PEs are referencing the same TCE
> > tables, presumably EEH will isolate them individually.
>
>
> Well. I asked our EEH guy Gavin, he did not object to this change but I'll
> double check :)
>
>
>

2015-05-01 03:56:00

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 23/32] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks

On Thu, Apr 30, 2015 at 07:56:17PM +1000, Alexey Kardashevskiy wrote:
> On 04/30/2015 02:37 PM, David Gibson wrote:
> >On Wed, Apr 29, 2015 at 07:44:20PM +1000, Alexey Kardashevskiy wrote:
> >>On 04/29/2015 03:30 PM, David Gibson wrote:
> >>>On Sat, Apr 25, 2015 at 10:14:47PM +1000, Alexey Kardashevskiy wrote:
> >>>>This extends iommu_table_group_ops by a set of callbacks to support
> >>>>dynamic DMA windows management.
> >>>>
> >>>>create_table() creates a TCE table with specific parameters.
> >>>>it receives iommu_table_group to know nodeid in order to allocate
> >>>>TCE table memory closer to the PHB. The exact format of allocated
> >>>>multi-level table might be also specific to the PHB model (not
> >>>>the case now though).
> >>>>This callback calculated the DMA window offset on a PCI bus from @num
> >>>>and stores it in a just created table.
> >>>>
> >>>>set_window() sets the window at specified TVT index + @num on PHB.
> >>>>
> >>>>unset_window() unsets the window from specified TVT.
> >>>>
> >>>>This adds a free() callback to iommu_table_ops to free the memory
> >>>>(potentially a tree of tables) allocated for the TCE table.
> >>>
> >>>Doesn't the free callback belong with the previous patch introducing
> >>>multi-level tables?
> >>
> >>
> >>
> >>If I did that, you would say "why is it here if nothing calls it" on
> >>"multilevel" patch and "I see the allocation but I do not see memory
> >>release" ;)
> >
> >Yeah, fair enough ;)
> >
> >>I need some rule of thumb here. I think it is a bit cleaner if the same
> >>patch adds a callback for memory allocation and its counterpart, no?
> >
> >On further consideration, yes, I think you're right.
> >
> >>>>create_table() and free() are supposed to be called once per
> >>>>VFIO container and set_window()/unset_window() are supposed to be
> >>>>called for every group in a container.
> >>>>
> >>>>This adds IOMMU capabilities to iommu_table_group such as default
> >>>>32bit window parameters and others.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>>>---
> >>>> arch/powerpc/include/asm/iommu.h | 19 ++++++++
> >>>> arch/powerpc/platforms/powernv/pci-ioda.c | 75 ++++++++++++++++++++++++++---
> >>>> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 12 +++--
> >>>> 3 files changed, 96 insertions(+), 10 deletions(-)
> >>>>
> >>>>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>>>index 0f50ee2..7694546 100644
> >>>>--- a/arch/powerpc/include/asm/iommu.h
> >>>>+++ b/arch/powerpc/include/asm/iommu.h
> >>>>@@ -70,6 +70,7 @@ struct iommu_table_ops {
> >>>> /* get() returns a physical address */
> >>>> unsigned long (*get)(struct iommu_table *tbl, long index);
> >>>> void (*flush)(struct iommu_table *tbl);
> >>>>+ void (*free)(struct iommu_table *tbl);
> >>>> };
> >>>>
> >>>> /* These are used by VIO */
> >>>>@@ -148,6 +149,17 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
> >>>> struct iommu_table_group;
> >>>>
> >>>> struct iommu_table_group_ops {
> >>>>+ long (*create_table)(struct iommu_table_group *table_group,
> >>>>+ int num,
> >>>>+ __u32 page_shift,
> >>>>+ __u64 window_size,
> >>>>+ __u32 levels,
> >>>>+ struct iommu_table *tbl);
> >>>>+ long (*set_window)(struct iommu_table_group *table_group,
> >>>>+ int num,
> >>>>+ struct iommu_table *tblnew);
> >>>>+ long (*unset_window)(struct iommu_table_group *table_group,
> >>>>+ int num);
> >>>> /*
> >>>> * Switches ownership from the kernel itself to an external
> >>>> * user. While onwership is taken, the kernel cannot use IOMMU itself.
> >>>>@@ -160,6 +172,13 @@ struct iommu_table_group {
> >>>> #ifdef CONFIG_IOMMU_API
> >>>> struct iommu_group *group;
> >>>> #endif
> >>>>+ /* Some key properties of IOMMU */
> >>>>+ __u32 tce32_start;
> >>>>+ __u32 tce32_size;
> >>>>+ __u64 pgsizes; /* Bitmap of supported page sizes */
> >>>>+ __u32 max_dynamic_windows_supported;
> >>>>+ __u32 max_levels;
> >>>
> >>>With this information, table_group seems even more like a bad name.
> >>>"iommu_state" maybe?
> >>
> >>
> >>Please, no. We will never come to agreement then :( And "iommu_state" is too
> >>general anyway, it won't pass.
> >>
> >>
> >>>> struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> >>>> struct iommu_table_group_ops *ops;
> >>>> };
> >>>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>>>index cc1d09c..4828837 100644
> >>>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>>>@@ -24,6 +24,7 @@
> >>>> #include <linux/msi.h>
> >>>> #include <linux/memblock.h>
> >>>> #include <linux/iommu.h>
> >>>>+#include <linux/sizes.h>
> >>>>
> >>>> #include <asm/sections.h>
> >>>> #include <asm/io.h>
> >>>>@@ -1846,6 +1847,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> >>>> #endif
> >>>> .clear = pnv_ioda2_tce_free,
> >>>> .get = pnv_tce_get,
> >>>>+ .free = pnv_pci_free_table,
> >>>> };
> >>>>
> >>>> static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
> >>>>@@ -1936,6 +1938,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> >>>> TCE_PCI_SWINV_PAIR);
> >>>>
> >>>> tbl->it_ops = &pnv_ioda1_iommu_ops;
> >>>>+ pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift;
> >>>>+ pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
> >>>> iommu_init_table(tbl, phb->hose->node);
> >>>>
> >>>> if (pe->flags & PNV_IODA_PE_DEV) {
> >>>>@@ -1961,7 +1965,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> >>>> }
> >>>>
> >>>> static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> >>>>- struct iommu_table *tbl)
> >>>>+ int num, struct iommu_table *tbl)
> >>>> {
> >>>> struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> >>>> table_group);
> >>>>@@ -1972,9 +1976,10 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> >>>> const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
> >>>> const __u64 win_size = tbl->it_size << tbl->it_page_shift;
> >>>>
> >>>>- pe_info(pe, "Setting up window at %llx..%llx "
> >>>>+ pe_info(pe, "Setting up window#%d at %llx..%llx "
> >>>> "pgsize=0x%x tablesize=0x%lx "
> >>>> "levels=%d levelsize=%x\n",
> >>>>+ num,
> >>>> start_addr, start_addr + win_size - 1,
> >>>> 1UL << tbl->it_page_shift, tbl->it_size << 3,
> >>>> tbl->it_indirect_levels + 1, tbl->it_level_size << 3);
> >>>>@@ -1987,7 +1992,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> >>>> */
> >>>> rc = opal_pci_map_pe_dma_window(phb->opal_id,
> >>>> pe->pe_number,
> >>>>- pe->pe_number << 1,
> >>>>+ (pe->pe_number << 1) + num,
> >>>
> >>>Heh, yes, well, that makes it rather clear that only 2 tables are possible.
> >>>
> >>>> tbl->it_indirect_levels + 1,
> >>>> __pa(tbl->it_base),
> >>>> size << 3,
> >>>>@@ -2000,7 +2005,7 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> >>>> pnv_pci_ioda2_tvt_invalidate(pe);
> >>>>
> >>>> /* Store fully initialized *tbl (may be external) in PE */
> >>>>- pe->table_group.tables[0] = *tbl;
> >>>>+ pe->table_group.tables[num] = *tbl;
> >>>
> >>>I'm a bit confused by this whole set_window thing. Is the idea that
> >>>with multiple groups in a container you have multiple table_group s
> >>>each with different copies of the iommu_table structures, but pointing
> >>>to the same actual TCE entries (it_base)?
> >>
> >>Yes.
> >>
> >>>It seems to me not terribly
> >>>obvious when you "create" a table and when you "set" a window.
> >>
> >>
> >>A table is not attached anywhere until its address is programmed (in
> >>set_window()) to the hardware, it is just a table in memory. For
> >>POWER8/IODA2, I create a table before I attach any group to a container,
> >>then I program this table to every attached container, right now it is done
> >>in container's attach_group(). So later we can hotplug any host PCI device
> >>to a container - it will program same TCE table to every new group in the
> >>container.
> >
> >So you "create" once, then "set" it to one or more table_groups? It
> >seems odd that "create" is a table_group callback in that case.
>
>
> Where else could it be? ppc_md? We are getting rid of these. Some global
> function? We do not want VFIO to know about this. I run out of ideas here.

Yeah, I guess it has to be in table_group, despite the oddness. I
guess the point is that it's the first group that determines the type
of IOMMU you're using for this container. IIRC you already check on
set_window that any additional groups have a compatible
(i.e. identical) IOMMU.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (8.75 kB)
(No filename) (819.00 B)
Download all attachments

2015-05-01 03:56:03

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache

On Thu, Apr 30, 2015 at 06:25:25PM +1000, Paul Mackerras wrote:
> On Thu, Apr 30, 2015 at 04:34:55PM +1000, David Gibson wrote:
> > On Sat, Apr 25, 2015 at 10:14:52PM +1000, Alexey Kardashevskiy wrote:
> > > We are adding support for DMA memory pre-registration to be used in
> > > conjunction with VFIO. The idea is that the userspace which is going to
> > > run a guest may want to pre-register a user space memory region so
> > > it all gets pinned once and never goes away. Having this done,
> > > a hypervisor will not have to pin/unpin pages on every DMA map/unmap
> > > request. This is going to help with multiple pinning of the same memory
> > > and in-kernel acceleration of DMA requests.
> > >
> > > This adds a list of memory regions to mm_context_t. Each region consists
> > > of a header and a list of physical addresses. This adds API to:
> > > 1. register/unregister memory regions;
> > > 2. do final cleanup (which puts all pre-registered pages);
> > > 3. do userspace to physical address translation;
> > > 4. manage a mapped pages counter; when it is zero, it is safe to
> > > unregister the region.
> > >
> > > Multiple registration of the same region is allowed, kref is used to
> > > track the number of registrations.
> >
> > [snip]
> > > +long mm_iommu_alloc(unsigned long ua, unsigned long entries,
> > > + struct mm_iommu_table_group_mem_t **pmem)
> > > +{
> > > + struct mm_iommu_table_group_mem_t *mem;
> > > + long i, j;
> > > + struct page *page = NULL;
> > > +
> > > + list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
> > > + next) {
> > > + if ((mem->ua == ua) && (mem->entries == entries))
> > > + return -EBUSY;
> > > +
> > > + /* Overlap? */
> > > + if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
> > > + (ua < (mem->ua + (mem->entries << PAGE_SHIFT))))
> > > + return -EINVAL;
> > > + }
> > > +
> > > + mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> > > + if (!mem)
> > > + return -ENOMEM;
> > > +
> > > + mem->hpas = vzalloc(entries * sizeof(mem->hpas[0]));
> > > + if (!mem->hpas) {
> > > + kfree(mem);
> > > + return -ENOMEM;
> > > + }
> >
> > So, I've thought more about this and I'm really confused as to what
> > this is supposed to be accomplishing.
> >
> > I see that you need to keep track of what regions are registered, so
> > you don't double lock or unlock, but I don't see what the point of
> > actualy storing the translations in hpas is.
> >
> > I had assumed it was so that you could later on get to the
> > translations in real mode when you do in-kernel acceleration. But
> > that doesn't make sense, because the array is vmalloc()ed, so can't be
> > accessed in real mode anyway.
>
> We can access vmalloc'd arrays in real mode using real_vmalloc_addr().

Ah, ok.

> > I can't think of a circumstance in which you can use hpas where you
> > couldn't just walk the page tables anyway.
>
> The problem with walking the page tables is that there is no guarantee
> that the page you find that way is the page that was returned by the
> gup_fast() we did earlier. Storing the hpas means that we know for
> sure that the page we're doing DMA to is one that we have an elevated
> page count on.
>
> Also, there are various points where a Linux PTE is made temporarily
> invalid for a short time. If we happened to do a H_PUT_TCE on one cpu
> while another cpu was doing that, we'd get a spurious failure returned
> by the H_PUT_TCE.

I think we want this explanation in the commit message. Anr/or in a
comment somewhere, I'm not sure.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.64 kB)
(No filename) (819.00 B)
Download all attachments

2015-05-01 04:01:29

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

On 04/29/2015 04:31 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:
>> In order to support memory pre-registration, we need a way to track
>> the use of every registered memory region and only allow unregistration
>> if a region is not in use anymore. So we need a way to tell from what
>> region the just cleared TCE was from.
>>
>> This adds a userspace view of the TCE table into iommu_table struct.
>> It contains userspace address, one per TCE entry. The table is only
>> allocated when the ownership over an IOMMU group is taken which means
>> it is only used from outside of the powernv code (such as VFIO).
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> Changes:
>> v9:
>> * fixed code flow in error cases added in v8
>>
>> v8:
>> * added ENOMEM on failed vzalloc()
>> ---
>> arch/powerpc/include/asm/iommu.h | 6 ++++++
>> arch/powerpc/kernel/iommu.c | 18 ++++++++++++++++++
>> arch/powerpc/platforms/powernv/pci-ioda.c | 22 ++++++++++++++++++++--
>> 3 files changed, 44 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index 7694546..1472de3 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -111,9 +111,15 @@ struct iommu_table {
>> unsigned long *it_map; /* A simple allocation bitmap for now */
>> unsigned long it_page_shift;/* table iommu page size */
>> struct iommu_table_group *it_table_group;
>> + unsigned long *it_userspace; /* userspace view of the table */
>
> A single unsigned long doesn't seem like enough.

Why single? This is an array.

> How do you know
> which process's address space this address refers to?

It is a current task. Multiple userspaces cannot use the same container/tables.



>> struct iommu_table_ops *it_ops;
>> };
>>
>> +#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
>> + ((tbl)->it_userspace ? \
>> + &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \
>> + NULL)
>> +
>> /* Pure 2^n version of get_order */
>> static inline __attribute_const__
>> int get_iommu_order(unsigned long size, struct iommu_table *tbl)
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index 2eaba0c..74a3f52 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -38,6 +38,7 @@
>> #include <linux/pci.h>
>> #include <linux/iommu.h>
>> #include <linux/sched.h>
>> +#include <linux/vmalloc.h>
>> #include <asm/io.h>
>> #include <asm/prom.h>
>> #include <asm/iommu.h>
>> @@ -739,6 +740,8 @@ void iommu_reset_table(struct iommu_table *tbl, const char *node_name)
>> free_pages((unsigned long) tbl->it_map, order);
>> }
>>
>> + WARN_ON(tbl->it_userspace);
>> +
>> memset(tbl, 0, sizeof(*tbl));
>> }
>>
>> @@ -1016,6 +1019,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
>> {
>> unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
>> int ret = 0;
>> + unsigned long *uas;
>>
>> /*
>> * VFIO does not control TCE entries allocation and the guest
>> @@ -1027,6 +1031,10 @@ int iommu_take_ownership(struct iommu_table *tbl)
>> if (!tbl->it_ops->exchange)
>> return -EINVAL;
>>
>> + uas = vzalloc(sizeof(*uas) * tbl->it_size);
>> + if (!uas)
>> + return -ENOMEM;
>> +
>> spin_lock_irqsave(&tbl->large_pool.lock, flags);
>> for (i = 0; i < tbl->nr_pools; i++)
>> spin_lock(&tbl->pools[i].lock);
>> @@ -1044,6 +1052,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
>> memset(tbl->it_map, 0xff, sz);
>> }
>>
>> + if (ret) {
>> + vfree(uas);
>> + } else {
>> + BUG_ON(tbl->it_userspace);
>> + tbl->it_userspace = uas;
>> + }
>> +
>> for (i = 0; i < tbl->nr_pools; i++)
>> spin_unlock(&tbl->pools[i].lock);
>> spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
>> @@ -1056,6 +1071,9 @@ void iommu_release_ownership(struct iommu_table *tbl)
>> {
>> unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
>>
>> + vfree(tbl->it_userspace);
>> + tbl->it_userspace = NULL;
>> +
>> spin_lock_irqsave(&tbl->large_pool.lock, flags);
>> for (i = 0; i < tbl->nr_pools; i++)
>> spin_lock(&tbl->pools[i].lock);
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 45bc131..e0be556 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -25,6 +25,7 @@
>> #include <linux/memblock.h>
>> #include <linux/iommu.h>
>> #include <linux/sizes.h>
>> +#include <linux/vmalloc.h>
>>
>> #include <asm/sections.h>
>> #include <asm/io.h>
>> @@ -1827,6 +1828,14 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index,
>> pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
>> }
>>
>> +void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
>> +{
>> + vfree(tbl->it_userspace);
>> + tbl->it_userspace = NULL;
>> +
>> + pnv_pci_free_table(tbl);
>> +}
>> +
>> static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>> .set = pnv_ioda2_tce_build,
>> #ifdef CONFIG_IOMMU_API
>> @@ -1834,7 +1843,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>> #endif
>> .clear = pnv_ioda2_tce_free,
>> .get = pnv_tce_get,
>> - .free = pnv_pci_free_table,
>> + .free = pnv_pci_ioda2_free_table,
>> };
>>
>> static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
>> @@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>> int nid = pe->phb->hose->node;
>> __u64 bus_offset = num ? pe->tce_bypass_base : 0;
>> long ret;
>> + unsigned long *uas, uas_cb = sizeof(*uas) * (window_size >> page_shift);
>> +
>> + uas = vzalloc(uas_cb);
>> + if (!uas)
>> + return -ENOMEM;
>
> I don't see why this is allocated both here as well as in
> take_ownership.

Where else? The only alternative is vfio_iommu_spapr_tce but I really do
not want to touch iommu_table fields there.


> Isn't this function used for core-kernel users of the
> iommu as well, in which case it shouldn't need the it_userspace.


No. This is an iommu_table_group_ops callback which calls what the platform
code calls (pnv_pci_create_table()) plus allocates this it_userspace thing.
The callback is only called from VFIO.


>
>> ret = pnv_pci_create_table(table_group, nid, bus_offset, page_shift,
>> window_size, levels, tbl);
>> - if (ret)
>> + if (ret) {
>> + vfree(uas);
>> return ret;
>> + }
>>
>> + BUG_ON(tbl->it_userspace);
>> + tbl->it_userspace = uas;
>> tbl->it_ops = &pnv_ioda2_iommu_ops;
>> if (pe->tce_inval_reg)
>> tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
>


--
Alexey

2015-05-01 04:11:12

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

On 04/29/2015 04:40 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:
>> This adds a way for the IOMMU user to know how much a new table will
>> use so it can be accounted in the locked_vm limit before allocation
>> happens.
>>
>> This stores the allocated table size in pnv_pci_create_table()
>> so the locked_vm counter can be updated correctly when a table is
>> being disposed.
>>
>> This defines an iommu_table_group_ops callback to let VFIO know
>> how much memory will be locked if a table is created.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> Changes:
>> v9:
>> * reimplemented the whole patch
>> ---
>> arch/powerpc/include/asm/iommu.h | 5 +++++
>> arch/powerpc/platforms/powernv/pci-ioda.c | 14 ++++++++++++
>> arch/powerpc/platforms/powernv/pci.c | 36 +++++++++++++++++++++++++++++++
>> arch/powerpc/platforms/powernv/pci.h | 2 ++
>> 4 files changed, 57 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index 1472de3..9844c106 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -99,6 +99,7 @@ struct iommu_table {
>> unsigned long it_size; /* Size of iommu table in entries */
>> unsigned long it_indirect_levels;
>> unsigned long it_level_size;
>> + unsigned long it_allocated_size;
>> unsigned long it_offset; /* Offset into global table */
>> unsigned long it_base; /* mapped address of tce table */
>> unsigned long it_index; /* which iommu table this is */
>> @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>> struct iommu_table_group;
>>
>> struct iommu_table_group_ops {
>> + unsigned long (*get_table_size)(
>> + __u32 page_shift,
>> + __u64 window_size,
>> + __u32 levels);
>> long (*create_table)(struct iommu_table_group *table_group,
>> int num,
>> __u32 page_shift,
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index e0be556..7f548b4 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>> }
>>
>> #ifdef CONFIG_IOMMU_API
>> +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
>> + __u64 window_size, __u32 levels)
>> +{
>> + unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
>> +
>> + if (!ret)
>> + return ret;
>> +
>> + /* Add size of it_userspace */
>> + return ret + (window_size >> page_shift) * sizeof(unsigned long);
>
> This doesn't make much sense. The userspace view can't possibly be a
> property of the specific low-level IOMMU model.


This it_userspace thing is all about memory preregistration.

I need some way to track how many actual mappings the
mm_iommu_table_group_mem_t has in order to decide whether to allow
unregistering or not.

When I clear TCE, I can read the old value which is host physical address
which I cannot use to find the preregistered region and adjust the mappings
counter; I can only use userspace addresses for this (not even guest
physical addresses as it is VFIO and probably no KVM).

So I have to keep userspace addresses somewhere, one per IOMMU page, and
the iommu_table seems a natural place for this.





>
>> +}
>> +
>> static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>> int num, __u32 page_shift, __u64 window_size, __u32 levels,
>> struct iommu_table *tbl)
>> @@ -2086,6 +2098,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>>
>> BUG_ON(tbl->it_userspace);
>> tbl->it_userspace = uas;
>> + tbl->it_allocated_size += uas_cb;
>> tbl->it_ops = &pnv_ioda2_iommu_ops;
>> if (pe->tce_inval_reg)
>> tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
>> @@ -2160,6 +2173,7 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
>> }
>>
>> static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
>> + .get_table_size = pnv_pci_ioda2_get_table_size,
>> .create_table = pnv_pci_ioda2_create_table,
>> .set_window = pnv_pci_ioda2_set_window,
>> .unset_window = pnv_pci_ioda2_unset_window,
>> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>> index fc129c4..1b5b48a 100644
>> --- a/arch/powerpc/platforms/powernv/pci.c
>> +++ b/arch/powerpc/platforms/powernv/pci.c
>> @@ -662,6 +662,38 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
>> tbl->it_type = TCE_PCI;
>> }
>>
>> +unsigned long pnv_get_table_size(__u32 page_shift,
>> + __u64 window_size, __u32 levels)
>> +{
>> + unsigned long bytes = 0;
>> + const unsigned window_shift = ilog2(window_size);
>> + unsigned entries_shift = window_shift - page_shift;
>> + unsigned table_shift = entries_shift + 3;
>> + unsigned long tce_table_size = max(0x1000UL, 1UL << table_shift);
>> + unsigned long direct_table_size;
>> +
>> + if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS) ||
>> + (window_size > memory_hotplug_max()) ||
>> + !is_power_of_2(window_size))
>> + return 0;
>> +
>> + /* Calculate a direct table size from window_size and levels */
>> + entries_shift = ROUND_UP(entries_shift, levels) / levels;
>> + table_shift = entries_shift + 3;
>> + table_shift = max_t(unsigned, table_shift, PAGE_SHIFT);
>> + direct_table_size = 1UL << table_shift;
>> +
>> + for ( ; levels; --levels) {
>> + bytes += ROUND_UP(tce_table_size, direct_table_size);
>> +
>> + tce_table_size /= direct_table_size;
>> + tce_table_size <<= 3;
>> + tce_table_size = ROUND_UP(tce_table_size, direct_table_size);
>> + }
>> +
>> + return bytes;
>> +}
>> +
>> static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
>> unsigned levels, unsigned long limit,
>> unsigned long *tce_table_allocated)
>> @@ -741,6 +773,10 @@ long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
>> return -ENOMEM;
>> }
>>
>> + tbl->it_allocated_size = pnv_get_table_size(page_shift, window_size,
>> + levels);
>> + WARN_ON(!tbl->it_allocated_size);
>> +
>> /* Setup linux iommu table */
>> pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset,
>> page_shift);
>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>> index 3d1ff584..ce4bc3c 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -224,6 +224,8 @@ extern long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
>> __u64 bus_offset, __u32 page_shift, __u64 window_size,
>> __u32 levels, struct iommu_table *tbl);
>> extern void pnv_pci_free_table(struct iommu_table *tbl);
>> +extern unsigned long pnv_get_table_size(__u32 page_shift,
>> + __u64 window_size, __u32 levels);
>> extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
>> extern void pnv_pci_init_ioda_hub(struct device_node *np);
>> extern void pnv_pci_init_ioda2_phb(struct device_node *np);
>


--
Alexey

2015-05-01 04:54:03

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:
> On 04/29/2015 04:31 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:
> >>In order to support memory pre-registration, we need a way to track
> >>the use of every registered memory region and only allow unregistration
> >>if a region is not in use anymore. So we need a way to tell from what
> >>region the just cleared TCE was from.
> >>
> >>This adds a userspace view of the TCE table into iommu_table struct.
> >>It contains userspace address, one per TCE entry. The table is only
> >>allocated when the ownership over an IOMMU group is taken which means
> >>it is only used from outside of the powernv code (such as VFIO).
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>---
> >>Changes:
> >>v9:
> >>* fixed code flow in error cases added in v8
> >>
> >>v8:
> >>* added ENOMEM on failed vzalloc()
> >>---
> >> arch/powerpc/include/asm/iommu.h | 6 ++++++
> >> arch/powerpc/kernel/iommu.c | 18 ++++++++++++++++++
> >> arch/powerpc/platforms/powernv/pci-ioda.c | 22 ++++++++++++++++++++--
> >> 3 files changed, 44 insertions(+), 2 deletions(-)
> >>
> >>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>index 7694546..1472de3 100644
> >>--- a/arch/powerpc/include/asm/iommu.h
> >>+++ b/arch/powerpc/include/asm/iommu.h
> >>@@ -111,9 +111,15 @@ struct iommu_table {
> >> unsigned long *it_map; /* A simple allocation bitmap for now */
> >> unsigned long it_page_shift;/* table iommu page size */
> >> struct iommu_table_group *it_table_group;
> >>+ unsigned long *it_userspace; /* userspace view of the table */
> >
> >A single unsigned long doesn't seem like enough.
>
> Why single? This is an array.

As in single per page.

> > How do you know
> >which process's address space this address refers to?
>
> It is a current task. Multiple userspaces cannot use the same container/tables.

Where is that enforced?

More to the point, that's a VFIO constraint, but it's here affecting
the design of a structure owned by the platform code.

[snip]
> >> static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
> >>@@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
> >> int nid = pe->phb->hose->node;
> >> __u64 bus_offset = num ? pe->tce_bypass_base : 0;
> >> long ret;
> >>+ unsigned long *uas, uas_cb = sizeof(*uas) * (window_size >> page_shift);
> >>+
> >>+ uas = vzalloc(uas_cb);
> >>+ if (!uas)
> >>+ return -ENOMEM;
> >
> >I don't see why this is allocated both here as well as in
> >take_ownership.
>
> Where else? The only alternative is vfio_iommu_spapr_tce but I really do not
> want to touch iommu_table fields there.

Well to put it another way, why isn't take_ownership calling create
itself (or at least a common helper).

Clearly the it_userspace table needs to have lifetime which matches
the TCE table itself, so there should be a single function that marks
the beginning of that joint lifetime.

> >Isn't this function used for core-kernel users of the
> >iommu as well, in which case it shouldn't need the it_userspace.
>
>
> No. This is an iommu_table_group_ops callback which calls what the platform
> code calls (pnv_pci_create_table()) plus allocates this it_userspace thing.
> The callback is only called from VFIO.

Ok.

As touched on above it seems more like this should be owned by VFIO
code than the platform code.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.63 kB)
(No filename) (819.00 B)
Download all attachments

2015-05-01 04:54:07

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible

On Thu, Apr 30, 2015 at 07:33:09PM +1000, Alexey Kardashevskiy wrote:
> On 04/30/2015 05:22 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote:
> >>At the moment only one group per container is supported.
> >>POWER8 CPUs have more flexible design and allows naving 2 TCE tables per
> >>IOMMU group so we can relax this limitation and support multiple groups
> >>per container.
> >
> >It's not obvious why allowing multiple TCE tables per PE has any
> >pearing on allowing multiple groups per container.
>
>
> This patchset is a global TCE tables rework (patches 1..30, roughly) with 2
> outcomes:
> 1. reusing the same IOMMU table for multiple groups - patch 31;
> 2. allowing dynamic create/remove of IOMMU tables - patch 32.
>
> I can remove this one from the patchset and post it separately later but
> since 1..30 aim to support both 1) and 2), I'd think I better keep them all
> together (might explain some of changes I do in 1..30).

The combined patchset is fine. My comment is because your commit
message says that multiple groups are possible *because* 2 TCE tables
per group are allowed, and it's not at all clear why one follows from
the other.

> >>This adds TCE table descriptors to a container and uses iommu_table_group_ops
> >>to create/set DMA windows on IOMMU groups so the same TCE tables will be
> >>shared between several IOMMU groups.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>[aw: for the vfio related changes]
> >>Acked-by: Alex Williamson <[email protected]>
> >>---
> >>Changes:
> >>v7:
> >>* updated doc
> >>---
> >> Documentation/vfio.txt | 8 +-
> >> drivers/vfio/vfio_iommu_spapr_tce.c | 268 ++++++++++++++++++++++++++----------
> >> 2 files changed, 199 insertions(+), 77 deletions(-)
> >>
> >>diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> >>index 94328c8..7dcf2b5 100644
> >>--- a/Documentation/vfio.txt
> >>+++ b/Documentation/vfio.txt
> >>@@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
> >>
> >> This implementation has some specifics:
> >>
> >>-1) Only one IOMMU group per container is supported as an IOMMU group
> >>-represents the minimal entity which isolation can be guaranteed for and
> >>-groups are allocated statically, one per a Partitionable Endpoint (PE)
> >>+1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
> >>+container is supported as an IOMMU table is allocated at the boot time,
> >>+one table per a IOMMU group which is a Partitionable Endpoint (PE)
> >> (PE is often a PCI domain but not always).
> >
> >I thought the more fundamental problem was that different PEs tended
> >to use disjoint bus address ranges, so even by duplicating put_tce
> >across PEs you couldn't have a common address space.
>
>
> Sorry, I am not following you here.
>
> By duplicating put_tce, I can have multiple IOMMU groups on the same virtual
> PHB in QEMU, "[PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups
> per container" does this, the address ranges will the same.

Oh, ok. For some reason I thought that (at least on the older
machines) the different PEs used different and not easily changeable
DMA windows in bus addresses space.

> What I cannot do on p5ioc2 is programming the same table to multiple
> physical PHBs (or I could but it is very different than IODA2 and pretty
> ugly and might not always be possible because I would have to allocate these
> pages from some common pool and face problems like fragmentation).

So allowing multiple groups per container should be possible (at the
kernel rather than qemu level) by writing the same value to multiple
TCE tables. I guess its not worth doing for just the almost-obsolete
IOMMUs though.

>
>
>
> >>+Newer systems (POWER8 with IODA2) have improved hardware design which allows
> >>+to remove this limitation and have multiple IOMMU groups per a VFIO container.
> >>
> >> 2) The hardware supports so called DMA windows - the PCI address range
> >> within which DMA transfer is allowed, any attempt to access address space
> >>diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>index a7d6729..970e3a2 100644
> >>--- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>@@ -82,6 +82,11 @@ static void decrement_locked_vm(long npages)
> >> * into DMA'ble space using the IOMMU
> >> */
> >>
> >>+struct tce_iommu_group {
> >>+ struct list_head next;
> >>+ struct iommu_group *grp;
> >>+};
> >>+
> >> /*
> >> * The container descriptor supports only a single group per container.
> >> * Required by the API as the container is not supplied with the IOMMU group
> >>@@ -89,10 +94,11 @@ static void decrement_locked_vm(long npages)
> >> */
> >> struct tce_container {
> >> struct mutex lock;
> >>- struct iommu_group *grp;
> >> bool enabled;
> >> unsigned long locked_pages;
> >> bool v2;
> >>+ struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> >
> >Hrm, so here we have more copies of the full iommu_table structures,
> >which again muddies the lifetime. The table_group pointer is
> >presumably meaningless in these copies, which seems dangerously
> >confusing.
>
>
> Ouch. This is bad. No, table_group is not pointless here as it is used to
> get to the PE number to invalidate TCE cache. I just realized although I
> need to update just a single table, I still have to invalidate TCE cache for
> every attached group/PE so I need a list of iommu_table_group's here, not a
> single pointer...

Right.

> >>+ struct list_head group_list;
> >> };
> >>
> >> static long tce_unregister_pages(struct tce_container *container,
> >>@@ -154,20 +160,20 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> >> return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift;
> >> }
> >>
> >>+static inline bool tce_groups_attached(struct tce_container *container)
> >>+{
> >>+ return !list_empty(&container->group_list);
> >>+}
> >>+
> >> static struct iommu_table *spapr_tce_find_table(
> >> struct tce_container *container,
> >> phys_addr_t ioba)
> >> {
> >> long i;
> >> struct iommu_table *ret = NULL;
> >>- struct iommu_table_group *table_group;
> >>-
> >>- table_group = iommu_group_get_iommudata(container->grp);
> >>- if (!table_group)
> >>- return NULL;
> >>
> >> for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >>- struct iommu_table *tbl = &table_group->tables[i];
> >>+ struct iommu_table *tbl = &container->tables[i];
> >> unsigned long entry = ioba >> tbl->it_page_shift;
> >> unsigned long start = tbl->it_offset;
> >> unsigned long end = start + tbl->it_size;
> >>@@ -186,9 +192,7 @@ static int tce_iommu_enable(struct tce_container *container)
> >> int ret = 0;
> >> unsigned long locked;
> >> struct iommu_table_group *table_group;
> >>-
> >>- if (!container->grp)
> >>- return -ENXIO;
> >>+ struct tce_iommu_group *tcegrp;
> >>
> >> if (!current->mm)
> >> return -ESRCH; /* process exited */
> >>@@ -225,7 +229,12 @@ static int tce_iommu_enable(struct tce_container *container)
> >> * as there is no way to know how much we should increment
> >> * the locked_vm counter.
> >> */
> >>- table_group = iommu_group_get_iommudata(container->grp);
> >>+ if (!tce_groups_attached(container))
> >>+ return -ENODEV;
> >>+
> >>+ tcegrp = list_first_entry(&container->group_list,
> >>+ struct tce_iommu_group, next);
> >>+ table_group = iommu_group_get_iommudata(tcegrp->grp);
> >> if (!table_group)
> >> return -ENODEV;
> >>
> >>@@ -257,6 +266,48 @@ static void tce_iommu_disable(struct tce_container *container)
> >> decrement_locked_vm(container->locked_pages);
> >> }
> >>
> >>+static long tce_iommu_create_table(struct iommu_table_group *table_group,
> >>+ int num,
> >>+ __u32 page_shift,
> >>+ __u64 window_size,
> >>+ __u32 levels,
> >>+ struct iommu_table *tbl)
> >
> >With multiple groups (and therefore PEs) per container, this seems
> >wrong. There's only one table_group per PE, so what's special about
> >PE whose table group is passed in here.
>
>
> The created table is allocated at the same node as table_group
> (pe->phb->hose->node). This does not make much sense if we put multiple
> groups to the same container but we will recommend people to avoid putting
> groups from different NUMA nodes to the same container.

Ok. I guess the point is that the first PE attached has a special
bearing on how the table is configured, and that's what the
table_group here represents.

> Also, the allocated table gets bus offset initialized in create_table()
> (which is IODA2-specific knowledge). It is there to emphasize the fact that
> we do not get to choose where to map the window on a bus, it is hardcoded
> and easier to deal with the tables which have offset set once - I could add
> a bus_offset parameter to set_window() but it would be converted back to the
> window number.
>
>
>
> >>+{
> >>+ long ret, table_size;
> >>+
> >>+ table_size = table_group->ops->get_table_size(page_shift, window_size,
> >>+ levels);
> >>+ if (!table_size)
> >>+ return -EINVAL;
> >>+
> >>+ ret = try_increment_locked_vm(table_size >> PAGE_SHIFT);
> >>+ if (ret)
> >>+ return ret;
> >>+
> >>+ ret = table_group->ops->create_table(table_group, num,
> >>+ page_shift, window_size, levels, tbl);
> >>+
> >>+ WARN_ON(!ret && !tbl->it_ops->free);
> >>+ WARN_ON(!ret && (tbl->it_allocated_size != table_size));
> >>+
> >>+ if (ret)
> >>+ decrement_locked_vm(table_size >> PAGE_SHIFT);
> >>+
> >>+ return ret;
> >>+}
> >>+
> >>+static void tce_iommu_free_table(struct iommu_table *tbl)
> >>+{
> >>+ unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
> >>+
> >>+ if (!tbl->it_size)
> >>+ return;
> >>+
> >>+ tbl->it_ops->free(tbl);
> >
> >So, this is exactly the case where the lifetimes are badly confusing.
> >How can you be confident here that another copy of the iommu_table
> >struct isn't referencing the same TCE tables?
>
>
> Create/remove window is handled by a single file driver. It is not like many
> of there tables. But yes, valid point.
>
>
>
> >>+ decrement_locked_vm(pages);
> >>+ memset(tbl, 0, sizeof(*tbl));
> >>+}
> >>+
> >> static void *tce_iommu_open(unsigned long arg)
> >> {
> >> struct tce_container *container;
> >>@@ -271,19 +322,41 @@ static void *tce_iommu_open(unsigned long arg)
> >> return ERR_PTR(-ENOMEM);
> >>
> >> mutex_init(&container->lock);
> >>+ INIT_LIST_HEAD_RCU(&container->group_list);
> >
> >I see no other mentions of rcu related to this list, which doesn't
> >seem right.
> >
> >> container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
> >>
> >> return container;
> >> }
> >>
> >>+static int tce_iommu_clear(struct tce_container *container,
> >>+ struct iommu_table *tbl,
> >>+ unsigned long entry, unsigned long pages);
> >>+
> >> static void tce_iommu_release(void *iommu_data)
> >> {
> >> struct tce_container *container = iommu_data;
> >>+ struct iommu_table_group *table_group;
> >>+ struct tce_iommu_group *tcegrp;
> >>+ long i;
> >>
> >>- WARN_ON(container->grp);
> >>+ while (tce_groups_attached(container)) {
> >>+ tcegrp = list_first_entry(&container->group_list,
> >>+ struct tce_iommu_group, next);
> >>+ table_group = iommu_group_get_iommudata(tcegrp->grp);
> >>+ tce_iommu_detach_group(iommu_data, tcegrp->grp);
> >>+ }
> >>
> >>- if (container->grp)
> >>- tce_iommu_detach_group(iommu_data, container->grp);
> >>+ /*
> >>+ * If VFIO created a table, it was not disposed
> >>+ * by tce_iommu_detach_group() so do it now.
> >>+ */
> >>+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> >>+ struct iommu_table *tbl = &container->tables[i];
> >>+
> >>+ tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> >>+ tce_iommu_free_table(tbl);
> >>+ }
> >>
> >> tce_iommu_disable(container);
> >> mutex_destroy(&container->lock);
> >>@@ -509,12 +582,15 @@ static long tce_iommu_ioctl(void *iommu_data,
> >>
> >> case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
> >> struct vfio_iommu_spapr_tce_info info;
> >>+ struct tce_iommu_group *tcegrp;
> >> struct iommu_table_group *table_group;
> >>
> >>- if (WARN_ON(!container->grp))
> >>+ if (!tce_groups_attached(container))
> >> return -ENXIO;
> >>
> >>- table_group = iommu_group_get_iommudata(container->grp);
> >>+ tcegrp = list_first_entry(&container->group_list,
> >>+ struct tce_iommu_group, next);
> >>+ table_group = iommu_group_get_iommudata(tcegrp->grp);
> >>
> >> if (!table_group)
> >> return -ENXIO;
> >>@@ -707,12 +783,20 @@ static long tce_iommu_ioctl(void *iommu_data,
> >> tce_iommu_disable(container);
> >> mutex_unlock(&container->lock);
> >> return 0;
> >>- case VFIO_EEH_PE_OP:
> >>- if (!container->grp)
> >>- return -ENODEV;
> >>
> >>- return vfio_spapr_iommu_eeh_ioctl(container->grp,
> >>- cmd, arg);
> >>+ case VFIO_EEH_PE_OP: {
> >>+ struct tce_iommu_group *tcegrp;
> >>+
> >>+ ret = 0;
> >>+ list_for_each_entry(tcegrp, &container->group_list, next) {
> >>+ ret = vfio_spapr_iommu_eeh_ioctl(tcegrp->grp,
> >>+ cmd, arg);
> >>+ if (ret)
> >>+ return ret;
> >
> >Hrm. It occurs to me that EEH may need a way of referencing
> >individual groups. Even if multiple PEs are referencing the same TCE
> >tables, presumably EEH will isolate them individually.
>
>
> Well. I asked our EEH guy Gavin, he did not object to this change but I'll
> double check :)
>
>
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (13.37 kB)
(No filename) (819.00 B)
Download all attachments

2015-05-01 04:35:34

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

On 04/30/2015 04:55 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:
>> The existing implementation accounts the whole DMA window in
>> the locked_vm counter. This is going to be worse with multiple
>> containers and huge DMA windows. Also, real-time accounting would requite
>> additional tracking of accounted pages due to the page size difference -
>> IOMMU uses 4K pages and system uses 4K or 64K pages.
>>
>> Another issue is that actual pages pinning/unpinning happens on every
>> DMA map/unmap request. This does not affect the performance much now as
>> we spend way too much time now on switching context between
>> guest/userspace/host but this will start to matter when we add in-kernel
>> DMA map/unmap acceleration.
>>
>> This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
>> New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
>> 2 new ioctls to register/unregister DMA memory -
>> VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
>> which receive user space address and size of a memory region which
>> needs to be pinned/unpinned and counted in locked_vm.
>> New IOMMU splits physical pages pinning and TCE table update into 2 different
>> operations. It requires 1) guest pages to be registered first 2) consequent
>> map/unmap requests to work only with pre-registered memory.
>> For the default single window case this means that the entire guest
>> (instead of 2GB) needs to be pinned before using VFIO.
>> When a huge DMA window is added, no additional pinning will be
>> required, otherwise it would be guest RAM + 2GB.
>>
>> The new memory registration ioctls are not supported by
>> VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
>> will require memory to be preregistered in order to work.
>>
>> The accounting is done per the user process.
>>
>> This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
>> can do with v1 or v2 IOMMUs.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> [aw: for the vfio related changes]
>> Acked-by: Alex Williamson <[email protected]>
>> ---
>> Changes:
>> v9:
>> * s/tce_get_hva_cached/tce_iommu_use_page_v2/
>>
>> v7:
>> * now memory is registered per mm (i.e. process)
>> * moved memory registration code to powerpc/mmu
>> * merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
>> * limited new ioctls to v2 IOMMU
>> * updated doc
>> * unsupported ioclts return -ENOTTY instead of -EPERM
>>
>> v6:
>> * tce_get_hva_cached() returns hva via a pointer
>>
>> v4:
>> * updated docs
>> * s/kzmalloc/vzalloc/
>> * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
>> replaced offset with index
>> * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
>> and removed duplicating vfio_iommu_spapr_register_memory
>> ---
>> Documentation/vfio.txt | 23 ++++
>> drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++++++++++++++++++++++++++++++++++-
>> include/uapi/linux/vfio.h | 27 +++++
>> 3 files changed, 274 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
>> index 96978ec..94328c8 100644
>> --- a/Documentation/vfio.txt
>> +++ b/Documentation/vfio.txt
>> @@ -427,6 +427,29 @@ The code flow from the example above should be slightly changed:
>>
>> ....
>>
>> +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
>> +VFIO_IOMMU_DISABLE and implements 2 new ioctls:
>> +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
>> +(which are unsupported in v1 IOMMU).
>
> A summary of the semantic differeces between v1 and v2 would be nice.
> At this point it's not really clear to me if there's a case for
> creating v2, or if this could just be done by adding (optional)
> functionality to v1.

v1: memory preregistration is not supported; explicit enable/disable ioctls
are required

v2: memory preregistration is required; explicit enable/disable are
prohibited (as they are not needed).

Mixing these in one IOMMU type caused a lot of problems like should I
increment locked_vm by the 32bit window size on enable() or not; what do I
do about pages pinning when map/map (check if it is from registered memory
and do not pin?).

Having 2 IOMMU models makes everything a lot simpler.


>> +PPC64 paravirtualized guests generate a lot of map/unmap requests,
>> +and the handling of those includes pinning/unpinning pages and updating
>> +mm::locked_vm counter to make sure we do not exceed the rlimit.
>> +The v2 IOMMU splits accounting and pinning into separate operations:
>> +
>> +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
>> +receive a user space address and size of the block to be pinned.
>> +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
>> +be called with the exact address and size used for registering
>> +the memory block. The userspace is not expected to call these often.
>> +The ranges are stored in a linked list in a VFIO container.
>> +
>> +- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
>> +IOMMU table and do not do pinning; instead these check that the userspace
>> +address is from pre-registered range.
>> +
>> +This separation helps in optimizing DMA for guests.
>> +
>> -------------------------------------------------------------------------------
>>
>> [1] VFIO was originally an acronym for "Virtual Function I/O" in its
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 892a584..4cfc2c1 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>
> So, from things you said at other points, I thought the idea was that
> this registration stuff could also be used on non-Power IOMMUs. Did I
> misunderstand, or is that a possibility for the future?


I never said a thing about non-PPC :) I seriously doubt any other arch has
this hypervisor interface with H_PUT_TCE (may be s390? :) ); for others
there is no profit from memory preregistration as they (at least x86) do
map the entire guest before it starts which essentially is that
preregistration.


btw later we may want to implement simple IOMMU v3 which will do pinning +
locked_vm when mapping as x86 does, for http://dpdk.org/ - these things do
not really have to bother with preregistration (even if it just a single
additional ioctl).



>> @@ -21,6 +21,7 @@
>> #include <linux/vfio.h>
>> #include <asm/iommu.h>
>> #include <asm/tce.h>
>> +#include <asm/mmu_context.h>
>>
>> #define DRIVER_VERSION "0.1"
>> #define DRIVER_AUTHOR "[email protected]"
>> @@ -91,8 +92,58 @@ struct tce_container {
>> struct iommu_group *grp;
>> bool enabled;
>> unsigned long locked_pages;
>> + bool v2;
>> };
>>
>> +static long tce_unregister_pages(struct tce_container *container,
>> + __u64 vaddr, __u64 size)
>> +{
>> + long ret;
>> + struct mm_iommu_table_group_mem_t *mem;
>> +
>> + if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>> + return -EINVAL;
>> +
>> + mem = mm_iommu_get(vaddr, size >> PAGE_SHIFT);
>> + if (!mem)
>> + return -EINVAL;
>> +
>> + ret = mm_iommu_put(mem); /* undo kref_get() from mm_iommu_get() */
>> + if (!ret)
>> + ret = mm_iommu_put(mem);
>> +
>> + return ret;
>> +}
>> +
>> +static long tce_register_pages(struct tce_container *container,
>> + __u64 vaddr, __u64 size)
>> +{
>> + long ret = 0;
>> + struct mm_iommu_table_group_mem_t *mem;
>> + unsigned long entries = size >> PAGE_SHIFT;
>> +
>> + if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
>> + ((vaddr + size) < vaddr))
>> + return -EINVAL;
>> +
>> + mem = mm_iommu_get(vaddr, entries);
>> + if (!mem) {
>> + ret = try_increment_locked_vm(entries);
>> + if (ret)
>> + return ret;
>> +
>> + ret = mm_iommu_alloc(vaddr, entries, &mem);
>> + if (ret) {
>> + decrement_locked_vm(entries);
>> + return ret;
>> + }
>> + }
>> +
>> + container->enabled = true;
>> +
>> + return 0;
>> +}
>
> So requiring that registered regions get unregistered with exactly the
> same addr/length is reasonable. I'm a bit less convinced that
> disallowing overlaps is a good idea. What if two libraries in the
> same process are trying to use VFIO - they may not know if the regions
> they try to register are overlapping.


Sorry, I do not understand. A library allocates RAM. A library is expected
to do register it via additional ioctl, that's it. Another library
allocates another chunk of memory and it won't overlap and the registered
areas won't either.


>> static bool tce_page_is_contained(struct page *page, unsigned page_shift)
>> {
>> /*
>> @@ -205,7 +256,7 @@ static void *tce_iommu_open(unsigned long arg)
>> {
>> struct tce_container *container;
>>
>> - if (arg != VFIO_SPAPR_TCE_IOMMU) {
>> + if ((arg != VFIO_SPAPR_TCE_IOMMU) && (arg != VFIO_SPAPR_TCE_v2_IOMMU)) {
>> pr_err("tce_vfio: Wrong IOMMU type\n");
>> return ERR_PTR(-EINVAL);
>> }
>> @@ -215,6 +266,7 @@ static void *tce_iommu_open(unsigned long arg)
>> return ERR_PTR(-ENOMEM);
>>
>> mutex_init(&container->lock);
>> + container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>>
>> return container;
>> }
>> @@ -243,6 +295,47 @@ static void tce_iommu_unuse_page(struct tce_container *container,
>> put_page(page);
>> }
>>
>> +static int tce_iommu_use_page_v2(unsigned long tce, unsigned long size,
>> + unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)


You suggested s/tce_get_hpa/tce_iommu_use_page/ but in this particular
patch it is confusing as tce_iommu_unuse_page_v2() calls it to find
corresponding mm_iommu_table_group_mem_t by the userspace address address
of a page being stopped used.

tce_iommu_use_page (without v2) does use the page but this one I'll rename
back to tce_iommu_ua_to_hpa_v2(), is that ok?


>> +{
>> + long ret = 0;
>> + struct mm_iommu_table_group_mem_t *mem;
>> +
>> + mem = mm_iommu_lookup(tce, size);
>> + if (!mem)
>> + return -EINVAL;
>> +
>> + ret = mm_iommu_ua_to_hpa(mem, tce, phpa);
>> + if (ret)
>> + return -EINVAL;
>> +
>> + *pmem = mem;
>> +
>> + return 0;
>> +}
>> +
>> +static void tce_iommu_unuse_page_v2(struct iommu_table *tbl,
>> + unsigned long entry)
>> +{
>> + struct mm_iommu_table_group_mem_t *mem = NULL;
>> + int ret;
>> + unsigned long hpa = 0;
>> + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>> +
>> + if (!pua || !current || !current->mm)
>> + return;
>> +
>> + ret = tce_iommu_use_page_v2(*pua, IOMMU_PAGE_SIZE(tbl),
>> + &hpa, &mem);
>> + if (ret)
>> + pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
>> + __func__, *pua, entry, ret);
>> + if (mem)
>> + mm_iommu_mapped_update(mem, false);
>> +
>> + *pua = 0;
>> +}
>> +
>> static int tce_iommu_clear(struct tce_container *container,
>> struct iommu_table *tbl,
>> unsigned long entry, unsigned long pages)
>> @@ -261,6 +354,11 @@ static int tce_iommu_clear(struct tce_container *container,
>> if (direction == DMA_NONE)
>> continue;
>>
>> + if (container->v2) {
>> + tce_iommu_unuse_page_v2(tbl, entry);
>> + continue;
>> + }
>> +
>> tce_iommu_unuse_page(container, oldtce);
>> }
>>
>> @@ -327,6 +425,62 @@ static long tce_iommu_build(struct tce_container *container,
>> return ret;
>> }
>>
>> +static long tce_iommu_build_v2(struct tce_container *container,
>> + struct iommu_table *tbl,
>> + unsigned long entry, unsigned long tce, unsigned long pages,
>> + enum dma_data_direction direction)
>> +{
>> + long i, ret = 0;
>> + struct page *page;
>> + unsigned long hpa;
>> + enum dma_data_direction dirtmp;
>> +
>> + for (i = 0; i < pages; ++i) {
>> + struct mm_iommu_table_group_mem_t *mem = NULL;
>> + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
>> + entry + i);
>> +
>> + ret = tce_iommu_use_page_v2(tce, IOMMU_PAGE_SIZE(tbl),
>> + &hpa, &mem);
>> + if (ret)
>> + break;
>> +
>> + page = pfn_to_page(hpa >> PAGE_SHIFT);
>> + if (!tce_page_is_contained(page, tbl->it_page_shift)) {
>> + ret = -EPERM;
>> + break;
>> + }
>> +
>> + /* Preserve offset within IOMMU page */
>> + hpa |= tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
>> + dirtmp = direction;
>> +
>> + ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
>> + if (ret) {
>> + /* dirtmp cannot be DMA_NONE here */
>> + tce_iommu_unuse_page_v2(tbl, entry + i);
>> + pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
>> + __func__, entry << tbl->it_page_shift,
>> + tce, ret);
>> + break;
>> + }
>> +
>> + mm_iommu_mapped_update(mem, true);
>> +
>> + if (dirtmp != DMA_NONE)
>> + tce_iommu_unuse_page_v2(tbl, entry + i);
>> +
>> + *pua = tce;
>> +
>> + tce += IOMMU_PAGE_SIZE(tbl);
>> + }
>> +
>> + if (ret)
>> + tce_iommu_clear(container, tbl, entry, i);
>> +
>> + return ret;
>> +}
>> +
>> static long tce_iommu_ioctl(void *iommu_data,
>> unsigned int cmd, unsigned long arg)
>> {
>> @@ -338,6 +492,7 @@ static long tce_iommu_ioctl(void *iommu_data,
>> case VFIO_CHECK_EXTENSION:
>> switch (arg) {
>> case VFIO_SPAPR_TCE_IOMMU:
>> + case VFIO_SPAPR_TCE_v2_IOMMU:
>> ret = 1;
>> break;
>> default:
>> @@ -425,11 +580,18 @@ static long tce_iommu_ioctl(void *iommu_data,
>> if (ret)
>> return ret;
>>
>> - ret = tce_iommu_build(container, tbl,
>> - param.iova >> tbl->it_page_shift,
>> - param.vaddr,
>> - param.size >> tbl->it_page_shift,
>> - direction);
>> + if (container->v2)
>> + ret = tce_iommu_build_v2(container, tbl,
>> + param.iova >> tbl->it_page_shift,
>> + param.vaddr,
>> + param.size >> tbl->it_page_shift,
>> + direction);
>> + else
>> + ret = tce_iommu_build(container, tbl,
>> + param.iova >> tbl->it_page_shift,
>> + param.vaddr,
>> + param.size >> tbl->it_page_shift,
>> + direction);
>>
>> iommu_flush_tce(tbl);
>>
>> @@ -474,7 +636,60 @@ static long tce_iommu_ioctl(void *iommu_data,
>>
>> return ret;
>> }
>> + case VFIO_IOMMU_SPAPR_REGISTER_MEMORY: {
>> + struct vfio_iommu_spapr_register_memory param;
>> +
>> + if (!container->v2)
>> + break;
>> +
>> + minsz = offsetofend(struct vfio_iommu_spapr_register_memory,
>> + size);
>> +
>> + if (copy_from_user(&param, (void __user *)arg, minsz))
>> + return -EFAULT;
>> +
>> + if (param.argsz < minsz)
>> + return -EINVAL;
>> +
>> + /* No flag is supported now */
>> + if (param.flags)
>> + return -EINVAL;
>> +
>> + mutex_lock(&container->lock);
>> + ret = tce_register_pages(container, param.vaddr, param.size);
>> + mutex_unlock(&container->lock);
>
> AFAICT, this is the only call to tce_register_pages(), so why not put
> the mutex into the function.

1) I can use "return" in tce_register_pages() instead of "goto
unlock_exit". Convinient.

2) I keep mutex_lock()/mutex_unlock() in immediate vfio_iommu_driver_ops
callbacks (i.e. tce_iommu_ioctl, tce_iommu_attach_group,
tce_iommu_detach_group) and do not spread them all over the file which I
find easier to track, no?


>> +
>> + return ret;
>> + }
>> + case VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY: {
>> + struct vfio_iommu_spapr_register_memory param;
>> +
>> + if (!container->v2)
>> + break;
>> +
>> + minsz = offsetofend(struct vfio_iommu_spapr_register_memory,
>> + size);
>> +
>> + if (copy_from_user(&param, (void __user *)arg, minsz))
>> + return -EFAULT;
>> +
>> + if (param.argsz < minsz)
>> + return -EINVAL;
>> +
>> + /* No flag is supported now */
>> + if (param.flags)
>> + return -EINVAL;
>> +
>> + mutex_lock(&container->lock);
>> + tce_unregister_pages(container, param.vaddr, param.size);
>> + mutex_unlock(&container->lock);
>> +
>> + return 0;
>> + }
>> case VFIO_IOMMU_ENABLE:
>> + if (container->v2)
>> + break;
>> +
>> mutex_lock(&container->lock);
>> ret = tce_iommu_enable(container);
>> mutex_unlock(&container->lock);
>> @@ -482,6 +697,9 @@ static long tce_iommu_ioctl(void *iommu_data,
>>
>>
>> case VFIO_IOMMU_DISABLE:
>> + if (container->v2)
>> + break;
>> +
>> mutex_lock(&container->lock);
>> tce_iommu_disable(container);
>> mutex_unlock(&container->lock);
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index b57b750..8fdcfb9 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -36,6 +36,8 @@
>> /* Two-stage IOMMU */
>> #define VFIO_TYPE1_NESTING_IOMMU 6 /* Implies v2 */
>>
>> +#define VFIO_SPAPR_TCE_v2_IOMMU 7
>> +
>> /*
>> * The IOCTL interface is designed for extensibility by embedding the
>> * structure length (argsz) and flags into structures passed between
>> @@ -495,6 +497,31 @@ struct vfio_eeh_pe_op {
>>
>> #define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21)
>>
>> +/**
>> + * VFIO_IOMMU_SPAPR_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_spapr_register_memory)
>> + *
>> + * Registers user space memory where DMA is allowed. It pins
>> + * user pages and does the locked memory accounting so
>> + * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
>> + * get faster.
>> + */
>> +struct vfio_iommu_spapr_register_memory {
>> + __u32 argsz;
>> + __u32 flags;
>> + __u64 vaddr; /* Process virtual address */
>> + __u64 size; /* Size of mapping (bytes) */
>> +};
>> +#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 17)
>> +
>> +/**
>> + * VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_spapr_register_memory)
>> + *
>> + * Unregisters user space memory registered with
>> + * VFIO_IOMMU_SPAPR_REGISTER_MEMORY.
>> + * Uses vfio_iommu_spapr_register_memory for parameters.
>> + */
>> +#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 18)
>> +
>> /* ***************************************************************** */
>>
>> #endif /* _UAPIVFIO_H */
>


--
Alexey

2015-05-01 04:54:00

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible

On Fri, May 01, 2015 at 10:46:08AM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2015-04-30 at 19:33 +1000, Alexey Kardashevskiy wrote:
> > On 04/30/2015 05:22 PM, David Gibson wrote:
> > > On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote:
> > >> At the moment only one group per container is supported.
> > >> POWER8 CPUs have more flexible design and allows naving 2 TCE tables per
> > >> IOMMU group so we can relax this limitation and support multiple groups
> > >> per container.
> > >
> > > It's not obvious why allowing multiple TCE tables per PE has any
> > > pearing on allowing multiple groups per container.
> >
> >
> > This patchset is a global TCE tables rework (patches 1..30, roughly) with 2
> > outcomes:
> > 1. reusing the same IOMMU table for multiple groups - patch 31;
> > 2. allowing dynamic create/remove of IOMMU tables - patch 32.
> >
> > I can remove this one from the patchset and post it separately later but
> > since 1..30 aim to support both 1) and 2), I'd think I better keep them all
> > together (might explain some of changes I do in 1..30).
>
> I think you are talking past each other :-)
>
> But yes, having 2 tables per group is orthogonal to the ability of
> having multiple groups per container.
>
> The latter is made possible on P8 in large part because each PE has its
> own DMA address space (unlike P5IOC2 or P7IOC where a single address
> space is segmented).
>
> Also, on P8 you can actually make the TVT entries point to the same
> table in memory, thus removing the need to duplicate the actual
> tables (though you still have to duplicate the invalidations). I would
> however recommend only sharing the table that way within a chip/node.
>
> .../..
>
> > >>
> > >> -1) Only one IOMMU group per container is supported as an IOMMU group
> > >> -represents the minimal entity which isolation can be guaranteed for and
> > >> -groups are allocated statically, one per a Partitionable Endpoint (PE)
> > >> +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
> > >> +container is supported as an IOMMU table is allocated at the boot time,
> > >> +one table per a IOMMU group which is a Partitionable Endpoint (PE)
> > >> (PE is often a PCI domain but not always).
>
> > > I thought the more fundamental problem was that different PEs tended
> > > to use disjoint bus address ranges, so even by duplicating put_tce
> > > across PEs you couldn't have a common address space.
>
> Yes. This is the problem with P7IOC and earlier. It *could* be doable on
> P7IOC by making them the same PE but let's not go there.
>
> > Sorry, I am not following you here.
> >
> > By duplicating put_tce, I can have multiple IOMMU groups on the same
> > virtual PHB in QEMU, "[PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple
> > groups per container" does this, the address ranges will the same.
>
> But that is only possible on P8 because only there do we have separate
> address spaces between PEs.
>
> > What I cannot do on p5ioc2 is programming the same table to multiple
> > physical PHBs (or I could but it is very different than IODA2 and pretty
> > ugly and might not always be possible because I would have to allocate
> > these pages from some common pool and face problems like fragmentation).
>
> And P7IOC has a similar issue. The DMA address top bits indexes the
> window on P7IOC within a shared address space. It's possible to
> configure a TVT to cover multiple devices but with very serious
> limitations.

Ok. To check my understanding does this sound reasonable:

* The table_group more-or-less represents a PE, but in a way you can
reference without first knowing the specific IOMMU hardware type.

* When attaching multiple groups to the same container, the first PE
(i.e. table_group) attached is used as a representative so that
subsequent groups can be checked for compatibility with the first
PE and therefore all PEs currently included in the container

- This is why the table_group appears in some places where it
doesn't seem sensible from a pure object ownership point of
view

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (4.23 kB)
(No filename) (819.00 B)
Download all attachments

2015-05-01 05:24:11

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote:
> On 04/29/2015 04:40 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:
> >>This adds a way for the IOMMU user to know how much a new table will
> >>use so it can be accounted in the locked_vm limit before allocation
> >>happens.
> >>
> >>This stores the allocated table size in pnv_pci_create_table()
> >>so the locked_vm counter can be updated correctly when a table is
> >>being disposed.
> >>
> >>This defines an iommu_table_group_ops callback to let VFIO know
> >>how much memory will be locked if a table is created.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>---
> >>Changes:
> >>v9:
> >>* reimplemented the whole patch
> >>---
> >> arch/powerpc/include/asm/iommu.h | 5 +++++
> >> arch/powerpc/platforms/powernv/pci-ioda.c | 14 ++++++++++++
> >> arch/powerpc/platforms/powernv/pci.c | 36 +++++++++++++++++++++++++++++++
> >> arch/powerpc/platforms/powernv/pci.h | 2 ++
> >> 4 files changed, 57 insertions(+)
> >>
> >>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>index 1472de3..9844c106 100644
> >>--- a/arch/powerpc/include/asm/iommu.h
> >>+++ b/arch/powerpc/include/asm/iommu.h
> >>@@ -99,6 +99,7 @@ struct iommu_table {
> >> unsigned long it_size; /* Size of iommu table in entries */
> >> unsigned long it_indirect_levels;
> >> unsigned long it_level_size;
> >>+ unsigned long it_allocated_size;
> >> unsigned long it_offset; /* Offset into global table */
> >> unsigned long it_base; /* mapped address of tce table */
> >> unsigned long it_index; /* which iommu table this is */
> >>@@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
> >> struct iommu_table_group;
> >>
> >> struct iommu_table_group_ops {
> >>+ unsigned long (*get_table_size)(
> >>+ __u32 page_shift,
> >>+ __u64 window_size,
> >>+ __u32 levels);
> >> long (*create_table)(struct iommu_table_group *table_group,
> >> int num,
> >> __u32 page_shift,
> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>index e0be556..7f548b4 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>@@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> >> }
> >>
> >> #ifdef CONFIG_IOMMU_API
> >>+static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
> >>+ __u64 window_size, __u32 levels)
> >>+{
> >>+ unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
> >>+
> >>+ if (!ret)
> >>+ return ret;
> >>+
> >>+ /* Add size of it_userspace */
> >>+ return ret + (window_size >> page_shift) * sizeof(unsigned long);
> >
> >This doesn't make much sense. The userspace view can't possibly be a
> >property of the specific low-level IOMMU model.
>
>
> This it_userspace thing is all about memory preregistration.
>
> I need some way to track how many actual mappings the
> mm_iommu_table_group_mem_t has in order to decide whether to allow
> unregistering or not.
>
> When I clear TCE, I can read the old value which is host physical address
> which I cannot use to find the preregistered region and adjust the mappings
> counter; I can only use userspace addresses for this (not even guest
> physical addresses as it is VFIO and probably no KVM).
>
> So I have to keep userspace addresses somewhere, one per IOMMU page, and the
> iommu_table seems a natural place for this.

Well.. sort of. But as noted elsewhere this pulls VFIO specific
constraints into a platform code structure. And whether you get this
table depends on the platform IOMMU type rather than on what VFIO
wants to do with it, which doesn't make sense.

What might make more sense is an opaque pointer io iommu_table for use
by the table "owner" (in the take_ownership sense). The pointer would
be stored in iommu_table, but VFIO is responsible for populating and
managing its contents.

Or you could just put the userspace mappings in the container.
Although you might want a different data structure in that case.

The other thing to bear in mind is that registered regions are likely
to be large contiguous blocks in user addresses, though obviously not
contiguous in physical addr. So you might be able to compaticfy this
information by storing it as a list of variable length blocks in
userspace address space, rather than a per-page address..



But.. isn't there a bigger problem here. As Paulus was pointing out,
there's nothing guaranteeing the page tables continue to contain the
same page as was there at gup() time.

What's going to happen if you REGISTER a memory region, then mremap()
over it? Then attempt to PUT_TCE a page in the region? Or what if you
mremap() it to someplace else then try to PUT_TCE a page there? Or
REGISTER it again in its new location?

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (5.06 kB)
(No filename) (819.00 B)
Download all attachments

2015-05-01 05:24:24

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

On Fri, May 01, 2015 at 02:35:23PM +1000, Alexey Kardashevskiy wrote:
> On 04/30/2015 04:55 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:
> >>The existing implementation accounts the whole DMA window in
> >>the locked_vm counter. This is going to be worse with multiple
> >>containers and huge DMA windows. Also, real-time accounting would requite
> >>additional tracking of accounted pages due to the page size difference -
> >>IOMMU uses 4K pages and system uses 4K or 64K pages.
> >>
> >>Another issue is that actual pages pinning/unpinning happens on every
> >>DMA map/unmap request. This does not affect the performance much now as
> >>we spend way too much time now on switching context between
> >>guest/userspace/host but this will start to matter when we add in-kernel
> >>DMA map/unmap acceleration.
> >>
> >>This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
> >>New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
> >>2 new ioctls to register/unregister DMA memory -
> >>VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
> >>which receive user space address and size of a memory region which
> >>needs to be pinned/unpinned and counted in locked_vm.
> >>New IOMMU splits physical pages pinning and TCE table update into 2 different
> >>operations. It requires 1) guest pages to be registered first 2) consequent
> >>map/unmap requests to work only with pre-registered memory.
> >>For the default single window case this means that the entire guest
> >>(instead of 2GB) needs to be pinned before using VFIO.
> >>When a huge DMA window is added, no additional pinning will be
> >>required, otherwise it would be guest RAM + 2GB.
> >>
> >>The new memory registration ioctls are not supported by
> >>VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
> >>will require memory to be preregistered in order to work.
> >>
> >>The accounting is done per the user process.
> >>
> >>This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
> >>can do with v1 or v2 IOMMUs.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>[aw: for the vfio related changes]
> >>Acked-by: Alex Williamson <[email protected]>
> >>---
> >>Changes:
> >>v9:
> >>* s/tce_get_hva_cached/tce_iommu_use_page_v2/
> >>
> >>v7:
> >>* now memory is registered per mm (i.e. process)
> >>* moved memory registration code to powerpc/mmu
> >>* merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
> >>* limited new ioctls to v2 IOMMU
> >>* updated doc
> >>* unsupported ioclts return -ENOTTY instead of -EPERM
> >>
> >>v6:
> >>* tce_get_hva_cached() returns hva via a pointer
> >>
> >>v4:
> >>* updated docs
> >>* s/kzmalloc/vzalloc/
> >>* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
> >>replaced offset with index
> >>* renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
> >>and removed duplicating vfio_iommu_spapr_register_memory
> >>---
> >> Documentation/vfio.txt | 23 ++++
> >> drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++++++++++++++++++++++++++++++++++-
> >> include/uapi/linux/vfio.h | 27 +++++
> >> 3 files changed, 274 insertions(+), 6 deletions(-)
> >>
> >>diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> >>index 96978ec..94328c8 100644
> >>--- a/Documentation/vfio.txt
> >>+++ b/Documentation/vfio.txt
> >>@@ -427,6 +427,29 @@ The code flow from the example above should be slightly changed:
> >>
> >> ....
> >>
> >>+5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
> >>+VFIO_IOMMU_DISABLE and implements 2 new ioctls:
> >>+VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
> >>+(which are unsupported in v1 IOMMU).
> >
> >A summary of the semantic differeces between v1 and v2 would be nice.
> >At this point it's not really clear to me if there's a case for
> >creating v2, or if this could just be done by adding (optional)
> >functionality to v1.
>
> v1: memory preregistration is not supported; explicit enable/disable ioctls
> are required
>
> v2: memory preregistration is required; explicit enable/disable are
> prohibited (as they are not needed).
>
> Mixing these in one IOMMU type caused a lot of problems like should I
> increment locked_vm by the 32bit window size on enable() or not; what do I
> do about pages pinning when map/map (check if it is from registered memory
> and do not pin?).
>
> Having 2 IOMMU models makes everything a lot simpler.

Ok. Would it simplify it further if you made v2 only usable on IODA2
hardware?

> >>+PPC64 paravirtualized guests generate a lot of map/unmap requests,
> >>+and the handling of those includes pinning/unpinning pages and updating
> >>+mm::locked_vm counter to make sure we do not exceed the rlimit.
> >>+The v2 IOMMU splits accounting and pinning into separate operations:
> >>+
> >>+- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
> >>+receive a user space address and size of the block to be pinned.
> >>+Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
> >>+be called with the exact address and size used for registering
> >>+the memory block. The userspace is not expected to call these often.
> >>+The ranges are stored in a linked list in a VFIO container.
> >>+
> >>+- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
> >>+IOMMU table and do not do pinning; instead these check that the userspace
> >>+address is from pre-registered range.
> >>+
> >>+This separation helps in optimizing DMA for guests.
> >>+
> >> -------------------------------------------------------------------------------
> >>
> >> [1] VFIO was originally an acronym for "Virtual Function I/O" in its
> >>diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>index 892a584..4cfc2c1 100644
> >>--- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >
> >So, from things you said at other points, I thought the idea was that
> >this registration stuff could also be used on non-Power IOMMUs. Did I
> >misunderstand, or is that a possibility for the future?
>
>
> I never said a thing about non-PPC :) I seriously doubt any other arch has
> this hypervisor interface with H_PUT_TCE (may be s390? :) ); for others
> there is no profit from memory preregistration as they (at least x86) do map
> the entire guest before it starts which essentially is that preregistration.
>
>
> btw later we may want to implement simple IOMMU v3 which will do pinning +
> locked_vm when mapping as x86 does, for http://dpdk.org/ - these things do
> not really have to bother with preregistration (even if it just a single
> additional ioctl).
>
>
>
> >>@@ -21,6 +21,7 @@
> >> #include <linux/vfio.h>
> >> #include <asm/iommu.h>
> >> #include <asm/tce.h>
> >>+#include <asm/mmu_context.h>
> >>
> >> #define DRIVER_VERSION "0.1"
> >> #define DRIVER_AUTHOR "[email protected]"
> >>@@ -91,8 +92,58 @@ struct tce_container {
> >> struct iommu_group *grp;
> >> bool enabled;
> >> unsigned long locked_pages;
> >>+ bool v2;
> >> };
> >>
> >>+static long tce_unregister_pages(struct tce_container *container,
> >>+ __u64 vaddr, __u64 size)
> >>+{
> >>+ long ret;
> >>+ struct mm_iommu_table_group_mem_t *mem;
> >>+
> >>+ if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
> >>+ return -EINVAL;
> >>+
> >>+ mem = mm_iommu_get(vaddr, size >> PAGE_SHIFT);
> >>+ if (!mem)
> >>+ return -EINVAL;
> >>+
> >>+ ret = mm_iommu_put(mem); /* undo kref_get() from mm_iommu_get() */
> >>+ if (!ret)
> >>+ ret = mm_iommu_put(mem);
> >>+
> >>+ return ret;
> >>+}
> >>+
> >>+static long tce_register_pages(struct tce_container *container,
> >>+ __u64 vaddr, __u64 size)
> >>+{
> >>+ long ret = 0;
> >>+ struct mm_iommu_table_group_mem_t *mem;
> >>+ unsigned long entries = size >> PAGE_SHIFT;
> >>+
> >>+ if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
> >>+ ((vaddr + size) < vaddr))
> >>+ return -EINVAL;
> >>+
> >>+ mem = mm_iommu_get(vaddr, entries);
> >>+ if (!mem) {
> >>+ ret = try_increment_locked_vm(entries);
> >>+ if (ret)
> >>+ return ret;
> >>+
> >>+ ret = mm_iommu_alloc(vaddr, entries, &mem);
> >>+ if (ret) {
> >>+ decrement_locked_vm(entries);
> >>+ return ret;
> >>+ }
> >>+ }
> >>+
> >>+ container->enabled = true;
> >>+
> >>+ return 0;
> >>+}
> >
> >So requiring that registered regions get unregistered with exactly the
> >same addr/length is reasonable. I'm a bit less convinced that
> >disallowing overlaps is a good idea. What if two libraries in the
> >same process are trying to use VFIO - they may not know if the regions
> >they try to register are overlapping.
>
>
> Sorry, I do not understand. A library allocates RAM. A library is expected
> to do register it via additional ioctl, that's it. Another library allocates
> another chunk of memory and it won't overlap and the registered areas won't
> either.

So the case I'm thinking is where the library does VFIO using a buffer
passed into it from the program at large. Another library does the
same.

The main program, unaware of the VFIO shenanigans passes different
parts of the same page to the 2 libraries.

This is somewhat similar to the case of the horribly, horribly broken
semantics of POSIX file range locks (it's both hard to implement and
dangerous in the multi-library case similar to above).

>
>
> >> static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> >> {
> >> /*
> >>@@ -205,7 +256,7 @@ static void *tce_iommu_open(unsigned long arg)
> >> {
> >> struct tce_container *container;
> >>
> >>- if (arg != VFIO_SPAPR_TCE_IOMMU) {
> >>+ if ((arg != VFIO_SPAPR_TCE_IOMMU) && (arg != VFIO_SPAPR_TCE_v2_IOMMU)) {
> >> pr_err("tce_vfio: Wrong IOMMU type\n");
> >> return ERR_PTR(-EINVAL);
> >> }
> >>@@ -215,6 +266,7 @@ static void *tce_iommu_open(unsigned long arg)
> >> return ERR_PTR(-ENOMEM);
> >>
> >> mutex_init(&container->lock);
> >>+ container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
> >>
> >> return container;
> >> }
> >>@@ -243,6 +295,47 @@ static void tce_iommu_unuse_page(struct tce_container *container,
> >> put_page(page);
> >> }
> >>
> >>+static int tce_iommu_use_page_v2(unsigned long tce, unsigned long size,
> >>+ unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
>
>
> You suggested s/tce_get_hpa/tce_iommu_use_page/ but in this particular patch
> it is confusing as tce_iommu_unuse_page_v2() calls it to find corresponding
> mm_iommu_table_group_mem_t by the userspace address address of a page being
> stopped used.
>
> tce_iommu_use_page (without v2) does use the page but this one I'll rename
> back to tce_iommu_ua_to_hpa_v2(), is that ok?

Sorry, I couldn't follow this comment.

>
>
> >>+{
> >>+ long ret = 0;
> >>+ struct mm_iommu_table_group_mem_t *mem;
> >>+
> >>+ mem = mm_iommu_lookup(tce, size);
> >>+ if (!mem)
> >>+ return -EINVAL;
> >>+
> >>+ ret = mm_iommu_ua_to_hpa(mem, tce, phpa);
> >>+ if (ret)
> >>+ return -EINVAL;
> >>+
> >>+ *pmem = mem;
> >>+
> >>+ return 0;
> >>+}
> >>+
> >>+static void tce_iommu_unuse_page_v2(struct iommu_table *tbl,
> >>+ unsigned long entry)
> >>+{
> >>+ struct mm_iommu_table_group_mem_t *mem = NULL;
> >>+ int ret;
> >>+ unsigned long hpa = 0;
> >>+ unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>+
> >>+ if (!pua || !current || !current->mm)
> >>+ return;
> >>+
> >>+ ret = tce_iommu_use_page_v2(*pua, IOMMU_PAGE_SIZE(tbl),
> >>+ &hpa, &mem);
> >>+ if (ret)
> >>+ pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
> >>+ __func__, *pua, entry, ret);
> >>+ if (mem)
> >>+ mm_iommu_mapped_update(mem, false);
> >>+
> >>+ *pua = 0;
> >>+}
> >>+
> >> static int tce_iommu_clear(struct tce_container *container,
> >> struct iommu_table *tbl,
> >> unsigned long entry, unsigned long pages)
> >>@@ -261,6 +354,11 @@ static int tce_iommu_clear(struct tce_container *container,
> >> if (direction == DMA_NONE)
> >> continue;
> >>
> >>+ if (container->v2) {
> >>+ tce_iommu_unuse_page_v2(tbl, entry);
> >>+ continue;
> >>+ }
> >>+
> >> tce_iommu_unuse_page(container, oldtce);
> >> }
> >>
> >>@@ -327,6 +425,62 @@ static long tce_iommu_build(struct tce_container *container,
> >> return ret;
> >> }
> >>
> >>+static long tce_iommu_build_v2(struct tce_container *container,
> >>+ struct iommu_table *tbl,
> >>+ unsigned long entry, unsigned long tce, unsigned long pages,
> >>+ enum dma_data_direction direction)
> >>+{
> >>+ long i, ret = 0;
> >>+ struct page *page;
> >>+ unsigned long hpa;
> >>+ enum dma_data_direction dirtmp;
> >>+
> >>+ for (i = 0; i < pages; ++i) {
> >>+ struct mm_iommu_table_group_mem_t *mem = NULL;
> >>+ unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
> >>+ entry + i);
> >>+
> >>+ ret = tce_iommu_use_page_v2(tce, IOMMU_PAGE_SIZE(tbl),
> >>+ &hpa, &mem);
> >>+ if (ret)
> >>+ break;
> >>+
> >>+ page = pfn_to_page(hpa >> PAGE_SHIFT);
> >>+ if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> >>+ ret = -EPERM;
> >>+ break;
> >>+ }
> >>+
> >>+ /* Preserve offset within IOMMU page */
> >>+ hpa |= tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
> >>+ dirtmp = direction;
> >>+
> >>+ ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
> >>+ if (ret) {
> >>+ /* dirtmp cannot be DMA_NONE here */
> >>+ tce_iommu_unuse_page_v2(tbl, entry + i);
> >>+ pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
> >>+ __func__, entry << tbl->it_page_shift,
> >>+ tce, ret);
> >>+ break;
> >>+ }
> >>+
> >>+ mm_iommu_mapped_update(mem, true);
> >>+
> >>+ if (dirtmp != DMA_NONE)
> >>+ tce_iommu_unuse_page_v2(tbl, entry + i);
> >>+
> >>+ *pua = tce;
> >>+
> >>+ tce += IOMMU_PAGE_SIZE(tbl);
> >>+ }
> >>+
> >>+ if (ret)
> >>+ tce_iommu_clear(container, tbl, entry, i);
> >>+
> >>+ return ret;
> >>+}
> >>+
> >> static long tce_iommu_ioctl(void *iommu_data,
> >> unsigned int cmd, unsigned long arg)
> >> {
> >>@@ -338,6 +492,7 @@ static long tce_iommu_ioctl(void *iommu_data,
> >> case VFIO_CHECK_EXTENSION:
> >> switch (arg) {
> >> case VFIO_SPAPR_TCE_IOMMU:
> >>+ case VFIO_SPAPR_TCE_v2_IOMMU:
> >> ret = 1;
> >> break;
> >> default:
> >>@@ -425,11 +580,18 @@ static long tce_iommu_ioctl(void *iommu_data,
> >> if (ret)
> >> return ret;
> >>
> >>- ret = tce_iommu_build(container, tbl,
> >>- param.iova >> tbl->it_page_shift,
> >>- param.vaddr,
> >>- param.size >> tbl->it_page_shift,
> >>- direction);
> >>+ if (container->v2)
> >>+ ret = tce_iommu_build_v2(container, tbl,
> >>+ param.iova >> tbl->it_page_shift,
> >>+ param.vaddr,
> >>+ param.size >> tbl->it_page_shift,
> >>+ direction);
> >>+ else
> >>+ ret = tce_iommu_build(container, tbl,
> >>+ param.iova >> tbl->it_page_shift,
> >>+ param.vaddr,
> >>+ param.size >> tbl->it_page_shift,
> >>+ direction);
> >>
> >> iommu_flush_tce(tbl);
> >>
> >>@@ -474,7 +636,60 @@ static long tce_iommu_ioctl(void *iommu_data,
> >>
> >> return ret;
> >> }
> >>+ case VFIO_IOMMU_SPAPR_REGISTER_MEMORY: {
> >>+ struct vfio_iommu_spapr_register_memory param;
> >>+
> >>+ if (!container->v2)
> >>+ break;
> >>+
> >>+ minsz = offsetofend(struct vfio_iommu_spapr_register_memory,
> >>+ size);
> >>+
> >>+ if (copy_from_user(&param, (void __user *)arg, minsz))
> >>+ return -EFAULT;
> >>+
> >>+ if (param.argsz < minsz)
> >>+ return -EINVAL;
> >>+
> >>+ /* No flag is supported now */
> >>+ if (param.flags)
> >>+ return -EINVAL;
> >>+
> >>+ mutex_lock(&container->lock);
> >>+ ret = tce_register_pages(container, param.vaddr, param.size);
> >>+ mutex_unlock(&container->lock);
> >
> >AFAICT, this is the only call to tce_register_pages(), so why not put
> >the mutex into the function.
>
> 1) I can use "return" in tce_register_pages() instead of "goto unlock_exit".
> Convinient.
>
> 2) I keep mutex_lock()/mutex_unlock() in immediate vfio_iommu_driver_ops
> callbacks (i.e. tce_iommu_ioctl, tce_iommu_attach_group,
> tce_iommu_detach_group) and do not spread them all over the file which I
> find easier to track, no?

Yeah, fair enough.

> >>+
> >>+ return ret;
> >>+ }
> >>+ case VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY: {
> >>+ struct vfio_iommu_spapr_register_memory param;
> >>+
> >>+ if (!container->v2)
> >>+ break;
> >>+
> >>+ minsz = offsetofend(struct vfio_iommu_spapr_register_memory,
> >>+ size);
> >>+
> >>+ if (copy_from_user(&param, (void __user *)arg, minsz))
> >>+ return -EFAULT;
> >>+
> >>+ if (param.argsz < minsz)
> >>+ return -EINVAL;
> >>+
> >>+ /* No flag is supported now */
> >>+ if (param.flags)
> >>+ return -EINVAL;
> >>+
> >>+ mutex_lock(&container->lock);
> >>+ tce_unregister_pages(container, param.vaddr, param.size);
> >>+ mutex_unlock(&container->lock);
> >>+
> >>+ return 0;
> >>+ }
> >> case VFIO_IOMMU_ENABLE:
> >>+ if (container->v2)
> >>+ break;
> >>+
> >> mutex_lock(&container->lock);
> >> ret = tce_iommu_enable(container);
> >> mutex_unlock(&container->lock);
> >>@@ -482,6 +697,9 @@ static long tce_iommu_ioctl(void *iommu_data,
> >>
> >>
> >> case VFIO_IOMMU_DISABLE:
> >>+ if (container->v2)
> >>+ break;
> >>+
> >> mutex_lock(&container->lock);
> >> tce_iommu_disable(container);
> >> mutex_unlock(&container->lock);
> >>diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>index b57b750..8fdcfb9 100644
> >>--- a/include/uapi/linux/vfio.h
> >>+++ b/include/uapi/linux/vfio.h
> >>@@ -36,6 +36,8 @@
> >> /* Two-stage IOMMU */
> >> #define VFIO_TYPE1_NESTING_IOMMU 6 /* Implies v2 */
> >>
> >>+#define VFIO_SPAPR_TCE_v2_IOMMU 7
> >>+
> >> /*
> >> * The IOCTL interface is designed for extensibility by embedding the
> >> * structure length (argsz) and flags into structures passed between
> >>@@ -495,6 +497,31 @@ struct vfio_eeh_pe_op {
> >>
> >> #define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21)
> >>
> >>+/**
> >>+ * VFIO_IOMMU_SPAPR_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_spapr_register_memory)
> >>+ *
> >>+ * Registers user space memory where DMA is allowed. It pins
> >>+ * user pages and does the locked memory accounting so
> >>+ * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
> >>+ * get faster.
> >>+ */
> >>+struct vfio_iommu_spapr_register_memory {
> >>+ __u32 argsz;
> >>+ __u32 flags;
> >>+ __u64 vaddr; /* Process virtual address */
> >>+ __u64 size; /* Size of mapping (bytes) */
> >>+};
> >>+#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 17)
> >>+
> >>+/**
> >>+ * VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_spapr_register_memory)
> >>+ *
> >>+ * Unregisters user space memory registered with
> >>+ * VFIO_IOMMU_SPAPR_REGISTER_MEMORY.
> >>+ * Uses vfio_iommu_spapr_register_memory for parameters.
> >>+ */
> >>+#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 18)
> >>+
> >> /* ***************************************************************** */
> >>
> >> #endif /* _UAPIVFIO_H */
> >
>
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (18.91 kB)
(No filename) (819.00 B)
Download all attachments

2015-05-01 06:05:34

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible

On 05/01/2015 02:33 PM, David Gibson wrote:
> On Thu, Apr 30, 2015 at 07:33:09PM +1000, Alexey Kardashevskiy wrote:
>> On 04/30/2015 05:22 PM, David Gibson wrote:
>>> On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote:
>>>> At the moment only one group per container is supported.
>>>> POWER8 CPUs have more flexible design and allows naving 2 TCE tables per
>>>> IOMMU group so we can relax this limitation and support multiple groups
>>>> per container.
>>>
>>> It's not obvious why allowing multiple TCE tables per PE has any
>>> pearing on allowing multiple groups per container.
>>
>>
>> This patchset is a global TCE tables rework (patches 1..30, roughly) with 2
>> outcomes:
>> 1. reusing the same IOMMU table for multiple groups - patch 31;
>> 2. allowing dynamic create/remove of IOMMU tables - patch 32.
>>
>> I can remove this one from the patchset and post it separately later but
>> since 1..30 aim to support both 1) and 2), I'd think I better keep them all
>> together (might explain some of changes I do in 1..30).
>
> The combined patchset is fine. My comment is because your commit
> message says that multiple groups are possible *because* 2 TCE tables
> per group are allowed, and it's not at all clear why one follows from
> the other.


Ah. That's wrong indeed, I'll fix it.


>>>> This adds TCE table descriptors to a container and uses iommu_table_group_ops
>>>> to create/set DMA windows on IOMMU groups so the same TCE tables will be
>>>> shared between several IOMMU groups.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>> [aw: for the vfio related changes]
>>>> Acked-by: Alex Williamson <[email protected]>
>>>> ---
>>>> Changes:
>>>> v7:
>>>> * updated doc
>>>> ---
>>>> Documentation/vfio.txt | 8 +-
>>>> drivers/vfio/vfio_iommu_spapr_tce.c | 268 ++++++++++++++++++++++++++----------
>>>> 2 files changed, 199 insertions(+), 77 deletions(-)
>>>>
>>>> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
>>>> index 94328c8..7dcf2b5 100644
>>>> --- a/Documentation/vfio.txt
>>>> +++ b/Documentation/vfio.txt
>>>> @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
>>>>
>>>> This implementation has some specifics:
>>>>
>>>> -1) Only one IOMMU group per container is supported as an IOMMU group
>>>> -represents the minimal entity which isolation can be guaranteed for and
>>>> -groups are allocated statically, one per a Partitionable Endpoint (PE)
>>>> +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
>>>> +container is supported as an IOMMU table is allocated at the boot time,
>>>> +one table per a IOMMU group which is a Partitionable Endpoint (PE)
>>>> (PE is often a PCI domain but not always).
>>>
>>> I thought the more fundamental problem was that different PEs tended
>>> to use disjoint bus address ranges, so even by duplicating put_tce
>>> across PEs you couldn't have a common address space.
>>
>>
>> Sorry, I am not following you here.
>>
>> By duplicating put_tce, I can have multiple IOMMU groups on the same virtual
>> PHB in QEMU, "[PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups
>> per container" does this, the address ranges will the same.
>
> Oh, ok. For some reason I thought that (at least on the older
> machines) the different PEs used different and not easily changeable
> DMA windows in bus addresses space.


They do use different tables (which VFIO does not get to remove/create and
uses these old helpers - iommu_take/release_ownership), correct. But all
these windows are mapped at zero on a PE's PCI bus and nothing prevents me
from updating all these tables with the same TCE values when handling
H_PUT_TCE. Yes it is slow but it works (bit more details below).



>> What I cannot do on p5ioc2 is programming the same table to multiple
>> physical PHBs (or I could but it is very different than IODA2 and pretty
>> ugly and might not always be possible because I would have to allocate these
>> pages from some common pool and face problems like fragmentation).
>
> So allowing multiple groups per container should be possible (at the
> kernel rather than qemu level) by writing the same value to multiple
> TCE tables. I guess its not worth doing for just the almost-obsolete
> IOMMUs though.


It is done at QEMU level though. As it works now, QEMU opens a group, walks
through all existing containers and tries attaching a new group there. If
it succeeded (x86 always; POWER8 after this patch), a TCE table is shared.
If it failed, QEMU creates another container, attaches it to the same
VFIO/PHB address space and attaches a group there.

Then the only thing left is repeating ioctl() in vfio_container_ioctl() for
every container in the VFIO address space; this is what that QEMU patch
does (the first version of that patch called ioctl() only for the first
container in the address space).

From the kernel prospective there are 2 isolated containers; I'd like to
keep it this way.

btw thanks for the detailed review :)

--
Alexey

2015-05-01 06:27:59

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

On 05/01/2015 03:23 PM, David Gibson wrote:
> On Fri, May 01, 2015 at 02:35:23PM +1000, Alexey Kardashevskiy wrote:
>> On 04/30/2015 04:55 PM, David Gibson wrote:
>>> On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:
>>>> The existing implementation accounts the whole DMA window in
>>>> the locked_vm counter. This is going to be worse with multiple
>>>> containers and huge DMA windows. Also, real-time accounting would requite
>>>> additional tracking of accounted pages due to the page size difference -
>>>> IOMMU uses 4K pages and system uses 4K or 64K pages.
>>>>
>>>> Another issue is that actual pages pinning/unpinning happens on every
>>>> DMA map/unmap request. This does not affect the performance much now as
>>>> we spend way too much time now on switching context between
>>>> guest/userspace/host but this will start to matter when we add in-kernel
>>>> DMA map/unmap acceleration.
>>>>
>>>> This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
>>>> New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
>>>> 2 new ioctls to register/unregister DMA memory -
>>>> VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
>>>> which receive user space address and size of a memory region which
>>>> needs to be pinned/unpinned and counted in locked_vm.
>>>> New IOMMU splits physical pages pinning and TCE table update into 2 different
>>>> operations. It requires 1) guest pages to be registered first 2) consequent
>>>> map/unmap requests to work only with pre-registered memory.
>>>> For the default single window case this means that the entire guest
>>>> (instead of 2GB) needs to be pinned before using VFIO.
>>>> When a huge DMA window is added, no additional pinning will be
>>>> required, otherwise it would be guest RAM + 2GB.
>>>>
>>>> The new memory registration ioctls are not supported by
>>>> VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
>>>> will require memory to be preregistered in order to work.
>>>>
>>>> The accounting is done per the user process.
>>>>
>>>> This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
>>>> can do with v1 or v2 IOMMUs.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>> [aw: for the vfio related changes]
>>>> Acked-by: Alex Williamson <[email protected]>
>>>> ---
>>>> Changes:
>>>> v9:
>>>> * s/tce_get_hva_cached/tce_iommu_use_page_v2/
>>>>
>>>> v7:
>>>> * now memory is registered per mm (i.e. process)
>>>> * moved memory registration code to powerpc/mmu
>>>> * merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
>>>> * limited new ioctls to v2 IOMMU
>>>> * updated doc
>>>> * unsupported ioclts return -ENOTTY instead of -EPERM
>>>>
>>>> v6:
>>>> * tce_get_hva_cached() returns hva via a pointer
>>>>
>>>> v4:
>>>> * updated docs
>>>> * s/kzmalloc/vzalloc/
>>>> * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
>>>> replaced offset with index
>>>> * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
>>>> and removed duplicating vfio_iommu_spapr_register_memory
>>>> ---
>>>> Documentation/vfio.txt | 23 ++++
>>>> drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++++++++++++++++++++++++++++++++++-
>>>> include/uapi/linux/vfio.h | 27 +++++
>>>> 3 files changed, 274 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
>>>> index 96978ec..94328c8 100644
>>>> --- a/Documentation/vfio.txt
>>>> +++ b/Documentation/vfio.txt
>>>> @@ -427,6 +427,29 @@ The code flow from the example above should be slightly changed:
>>>>
>>>> ....
>>>>
>>>> +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
>>>> +VFIO_IOMMU_DISABLE and implements 2 new ioctls:
>>>> +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
>>>> +(which are unsupported in v1 IOMMU).
>>>
>>> A summary of the semantic differeces between v1 and v2 would be nice.
>>> At this point it's not really clear to me if there's a case for
>>> creating v2, or if this could just be done by adding (optional)
>>> functionality to v1.
>>
>> v1: memory preregistration is not supported; explicit enable/disable ioctls
>> are required
>>
>> v2: memory preregistration is required; explicit enable/disable are
>> prohibited (as they are not needed).
>>
>> Mixing these in one IOMMU type caused a lot of problems like should I
>> increment locked_vm by the 32bit window size on enable() or not; what do I
>> do about pages pinning when map/map (check if it is from registered memory
>> and do not pin?).
>>
>> Having 2 IOMMU models makes everything a lot simpler.
>
> Ok. Would it simplify it further if you made v2 only usable on IODA2
> hardware?


Very little. V2 addresses memory pinning issue which is handled the same
way on ioda2 and older hardware, including KVM acceleration. Whether enable
DDW or not - this is handled just fine via extra properties in the GET_INFO
ioctl().

IODA2 and others are different in handling multiple groups per container
but this does not require changes to userspace API.

And remember, the only machine I can use 100% of time is POWER7/P5IOC2 so
it is really useful if at least some bits of the patchset can be tested
there; if it was a bit less different from IODA2, I would have even
implemented DDW there too :)


>>>> +PPC64 paravirtualized guests generate a lot of map/unmap requests,
>>>> +and the handling of those includes pinning/unpinning pages and updating
>>>> +mm::locked_vm counter to make sure we do not exceed the rlimit.
>>>> +The v2 IOMMU splits accounting and pinning into separate operations:
>>>> +
>>>> +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
>>>> +receive a user space address and size of the block to be pinned.
>>>> +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
>>>> +be called with the exact address and size used for registering
>>>> +the memory block. The userspace is not expected to call these often.
>>>> +The ranges are stored in a linked list in a VFIO container.
>>>> +
>>>> +- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
>>>> +IOMMU table and do not do pinning; instead these check that the userspace
>>>> +address is from pre-registered range.
>>>> +
>>>> +This separation helps in optimizing DMA for guests.
>>>> +
>>>> -------------------------------------------------------------------------------
>>>>
>>>> [1] VFIO was originally an acronym for "Virtual Function I/O" in its
>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>> index 892a584..4cfc2c1 100644
>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>
>>> So, from things you said at other points, I thought the idea was that
>>> this registration stuff could also be used on non-Power IOMMUs. Did I
>>> misunderstand, or is that a possibility for the future?
>>
>>
>> I never said a thing about non-PPC :) I seriously doubt any other arch has
>> this hypervisor interface with H_PUT_TCE (may be s390? :) ); for others
>> there is no profit from memory preregistration as they (at least x86) do map
>> the entire guest before it starts which essentially is that preregistration.
>>
>>
>> btw later we may want to implement simple IOMMU v3 which will do pinning +
>> locked_vm when mapping as x86 does, for http://dpdk.org/ - these things do
>> not really have to bother with preregistration (even if it just a single
>> additional ioctl).
>>
>>
>>
>>>> @@ -21,6 +21,7 @@
>>>> #include <linux/vfio.h>
>>>> #include <asm/iommu.h>
>>>> #include <asm/tce.h>
>>>> +#include <asm/mmu_context.h>
>>>>
>>>> #define DRIVER_VERSION "0.1"
>>>> #define DRIVER_AUTHOR "[email protected]"
>>>> @@ -91,8 +92,58 @@ struct tce_container {
>>>> struct iommu_group *grp;
>>>> bool enabled;
>>>> unsigned long locked_pages;
>>>> + bool v2;
>>>> };
>>>>
>>>> +static long tce_unregister_pages(struct tce_container *container,
>>>> + __u64 vaddr, __u64 size)
>>>> +{
>>>> + long ret;
>>>> + struct mm_iommu_table_group_mem_t *mem;
>>>> +
>>>> + if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>>>> + return -EINVAL;
>>>> +
>>>> + mem = mm_iommu_get(vaddr, size >> PAGE_SHIFT);
>>>> + if (!mem)
>>>> + return -EINVAL;
>>>> +
>>>> + ret = mm_iommu_put(mem); /* undo kref_get() from mm_iommu_get() */
>>>> + if (!ret)
>>>> + ret = mm_iommu_put(mem);
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> +static long tce_register_pages(struct tce_container *container,
>>>> + __u64 vaddr, __u64 size)
>>>> +{
>>>> + long ret = 0;
>>>> + struct mm_iommu_table_group_mem_t *mem;
>>>> + unsigned long entries = size >> PAGE_SHIFT;
>>>> +
>>>> + if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
>>>> + ((vaddr + size) < vaddr))
>>>> + return -EINVAL;
>>>> +
>>>> + mem = mm_iommu_get(vaddr, entries);
>>>> + if (!mem) {
>>>> + ret = try_increment_locked_vm(entries);
>>>> + if (ret)
>>>> + return ret;
>>>> +
>>>> + ret = mm_iommu_alloc(vaddr, entries, &mem);
>>>> + if (ret) {
>>>> + decrement_locked_vm(entries);
>>>> + return ret;
>>>> + }
>>>> + }
>>>> +
>>>> + container->enabled = true;
>>>> +
>>>> + return 0;
>>>> +}
>>>
>>> So requiring that registered regions get unregistered with exactly the
>>> same addr/length is reasonable. I'm a bit less convinced that
>>> disallowing overlaps is a good idea. What if two libraries in the
>>> same process are trying to use VFIO - they may not know if the regions
>>> they try to register are overlapping.
>>
>>
>> Sorry, I do not understand. A library allocates RAM. A library is expected
>> to do register it via additional ioctl, that's it. Another library allocates
>> another chunk of memory and it won't overlap and the registered areas won't
>> either.
>
> So the case I'm thinking is where the library does VFIO using a buffer
> passed into it from the program at large. Another library does the
> same.
>
> The main program, unaware of the VFIO shenanigans passes different
> parts of the same page to the 2 libraries.
>
> This is somewhat similar to the case of the horribly, horribly broken
> semantics of POSIX file range locks (it's both hard to implement and
> dangerous in the multi-library case similar to above).


Ok. I'll implement x86-alike V3 SPAPR TCE IOMMU for these people, later :)

V2 addresses issues caused by H_PUT_TCE + DDW RTAS interfaces.



>>>> static bool tce_page_is_contained(struct page *page, unsigned page_shift)
>>>> {
>>>> /*
>>>> @@ -205,7 +256,7 @@ static void *tce_iommu_open(unsigned long arg)
>>>> {
>>>> struct tce_container *container;
>>>>
>>>> - if (arg != VFIO_SPAPR_TCE_IOMMU) {
>>>> + if ((arg != VFIO_SPAPR_TCE_IOMMU) && (arg != VFIO_SPAPR_TCE_v2_IOMMU)) {
>>>> pr_err("tce_vfio: Wrong IOMMU type\n");
>>>> return ERR_PTR(-EINVAL);
>>>> }
>>>> @@ -215,6 +266,7 @@ static void *tce_iommu_open(unsigned long arg)
>>>> return ERR_PTR(-ENOMEM);
>>>>
>>>> mutex_init(&container->lock);
>>>> + container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>>>>
>>>> return container;
>>>> }
>>>> @@ -243,6 +295,47 @@ static void tce_iommu_unuse_page(struct tce_container *container,
>>>> put_page(page);
>>>> }
>>>>
>>>> +static int tce_iommu_use_page_v2(unsigned long tce, unsigned long size,
>>>> + unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
>>
>>
>> You suggested s/tce_get_hpa/tce_iommu_use_page/ but in this particular patch
>> it is confusing as tce_iommu_unuse_page_v2() calls it to find corresponding
>> mm_iommu_table_group_mem_t by the userspace address address of a page being
>> stopped used.
>>
>> tce_iommu_use_page (without v2) does use the page but this one I'll rename
>> back to tce_iommu_ua_to_hpa_v2(), is that ok?
>
> Sorry, I couldn't follow this comment.


For V1 IOMMU, I used to have:
tce_get_hpa() - this converted UA to linear address and did gup();
tce_iommu_unuse_page() - this did put_page().

You suggested (*) to rename the first one to tce_use_page() which makes sense.

V2 introduces its own versions of use/unuse but these use preregistered
memory and do not do gup()/put_page(). I named them:
tce_get_hpa_cached()
tce_iommu_unuse_page_v2()

then, replaying your comment (*) on V2 IOMMU, I renamed
tce_get_hpa_cached() to tce_iommu_use_page_v2(). And I do not like the
result now (in the chunk below). I'll rename it to
tce_iommu_ua_to_hpa_v2(), will it be ok?



>
>>
>>
>>>> +{
>>>> + long ret = 0;
>>>> + struct mm_iommu_table_group_mem_t *mem;
>>>> +
>>>> + mem = mm_iommu_lookup(tce, size);
>>>> + if (!mem)
>>>> + return -EINVAL;
>>>> +
>>>> + ret = mm_iommu_ua_to_hpa(mem, tce, phpa);
>>>> + if (ret)
>>>> + return -EINVAL;
>>>> +
>>>> + *pmem = mem;
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> +static void tce_iommu_unuse_page_v2(struct iommu_table *tbl,
>>>> + unsigned long entry)
>>>> +{
>>>> + struct mm_iommu_table_group_mem_t *mem = NULL;
>>>> + int ret;
>>>> + unsigned long hpa = 0;
>>>> + unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>>>> +
>>>> + if (!pua || !current || !current->mm)
>>>> + return;
>>>> +
>>>> + ret = tce_iommu_use_page_v2(*pua, IOMMU_PAGE_SIZE(tbl),
>>>> + &hpa, &mem);
>>>> + if (ret)
>>>> + pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
>>>> + __func__, *pua, entry, ret);
>>>> + if (mem)
>>>> + mm_iommu_mapped_update(mem, false);
>>>> +
>>>> + *pua = 0;
>>>> +}
>>>> +


--
Alexey

2015-05-01 06:53:17

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

On 05/01/2015 03:12 PM, David Gibson wrote:
> On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote:
>> On 04/29/2015 04:40 PM, David Gibson wrote:
>>> On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:
>>>> This adds a way for the IOMMU user to know how much a new table will
>>>> use so it can be accounted in the locked_vm limit before allocation
>>>> happens.
>>>>
>>>> This stores the allocated table size in pnv_pci_create_table()
>>>> so the locked_vm counter can be updated correctly when a table is
>>>> being disposed.
>>>>
>>>> This defines an iommu_table_group_ops callback to let VFIO know
>>>> how much memory will be locked if a table is created.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>> ---
>>>> Changes:
>>>> v9:
>>>> * reimplemented the whole patch
>>>> ---
>>>> arch/powerpc/include/asm/iommu.h | 5 +++++
>>>> arch/powerpc/platforms/powernv/pci-ioda.c | 14 ++++++++++++
>>>> arch/powerpc/platforms/powernv/pci.c | 36 +++++++++++++++++++++++++++++++
>>>> arch/powerpc/platforms/powernv/pci.h | 2 ++
>>>> 4 files changed, 57 insertions(+)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>>>> index 1472de3..9844c106 100644
>>>> --- a/arch/powerpc/include/asm/iommu.h
>>>> +++ b/arch/powerpc/include/asm/iommu.h
>>>> @@ -99,6 +99,7 @@ struct iommu_table {
>>>> unsigned long it_size; /* Size of iommu table in entries */
>>>> unsigned long it_indirect_levels;
>>>> unsigned long it_level_size;
>>>> + unsigned long it_allocated_size;
>>>> unsigned long it_offset; /* Offset into global table */
>>>> unsigned long it_base; /* mapped address of tce table */
>>>> unsigned long it_index; /* which iommu table this is */
>>>> @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>>>> struct iommu_table_group;
>>>>
>>>> struct iommu_table_group_ops {
>>>> + unsigned long (*get_table_size)(
>>>> + __u32 page_shift,
>>>> + __u64 window_size,
>>>> + __u32 levels);
>>>> long (*create_table)(struct iommu_table_group *table_group,
>>>> int num,
>>>> __u32 page_shift,
>>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> index e0be556..7f548b4 100644
>>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>>>> }
>>>>
>>>> #ifdef CONFIG_IOMMU_API
>>>> +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
>>>> + __u64 window_size, __u32 levels)
>>>> +{
>>>> + unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
>>>> +
>>>> + if (!ret)
>>>> + return ret;
>>>> +
>>>> + /* Add size of it_userspace */
>>>> + return ret + (window_size >> page_shift) * sizeof(unsigned long);
>>>
>>> This doesn't make much sense. The userspace view can't possibly be a
>>> property of the specific low-level IOMMU model.
>>
>>
>> This it_userspace thing is all about memory preregistration.
>>
>> I need some way to track how many actual mappings the
>> mm_iommu_table_group_mem_t has in order to decide whether to allow
>> unregistering or not.
>>
>> When I clear TCE, I can read the old value which is host physical address
>> which I cannot use to find the preregistered region and adjust the mappings
>> counter; I can only use userspace addresses for this (not even guest
>> physical addresses as it is VFIO and probably no KVM).
>>
>> So I have to keep userspace addresses somewhere, one per IOMMU page, and the
>> iommu_table seems a natural place for this.
>
> Well.. sort of. But as noted elsewhere this pulls VFIO specific
> constraints into a platform code structure. And whether you get this
> table depends on the platform IOMMU type rather than on what VFIO
> wants to do with it, which doesn't make sense.
>
> What might make more sense is an opaque pointer io iommu_table for use
> by the table "owner" (in the take_ownership sense). The pointer would
> be stored in iommu_table, but VFIO is responsible for populating and
> managing its contents.
>
> Or you could just put the userspace mappings in the container.
> Although you might want a different data structure in that case.

Nope. I need this table in in-kernel acceleration to update the mappings
counter per mm_iommu_table_group_mem_t. In KVM's real mode handlers, I only
have IOMMU tables, not containers or groups. QEMU creates a guest view of
the table (KVM_CREATE_SPAPR_TCE) specifying a LIOBN, and then attaches TCE
tables to it via set of ioctls (one per IOMMU group) to VFIO KVM device.

So if I call it it_opaque (instead of it_userspace), I will still need a
common place (visible to VFIO and PowerKVM) for this to put:
#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry)

So far this place was arch/powerpc/include/asm/iommu.h and the iommu_table
struct.


> The other thing to bear in mind is that registered regions are likely
> to be large contiguous blocks in user addresses, though obviously not
> contiguous in physical addr. So you might be able to compaticfy this
> information by storing it as a list of variable length blocks in
> userspace address space, rather than a per-page address..

It is 8 bytes per system page - 8/65536 = 0.00012 (or 26MB for 200GB guest)
- very little overhead.


> But.. isn't there a bigger problem here. As Paulus was pointing out,
> there's nothing guaranteeing the page tables continue to contain the
> same page as was there at gup() time.

This can happen if the userspace remaps memory which it registered/mapped
for DMA via VFIO, no? If so, then the userspace just should not do this, it
is DMA, it cannot be moved like this. What am I missing here?


> What's going to happen if you REGISTER a memory region, then mremap()
> over it?

The registered pages will remain pinned and PUT_TCE will use that region
for translation (and this will fail as the userspace addresses changed).

I do not see how it is different from the situation when the userspace
mapped a page and mremap()ed it while it is DMA-mapped.

> Then attempt to PUT_TCE a page in the region? Or what if you
> mremap() it to someplace else then try to PUT_TCE a page there?

This will fail - a new userspace address has to be preregistered.

> Or REGISTER it again in its new location?

It will be pinned twice + some memory overhead to store the same host
physical address(es) twice.



--
Alexey

2015-05-01 07:12:56

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

On 05/01/2015 02:23 PM, David Gibson wrote:
> On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:
>> On 04/29/2015 04:31 PM, David Gibson wrote:
>>> On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:
>>>> In order to support memory pre-registration, we need a way to track
>>>> the use of every registered memory region and only allow unregistration
>>>> if a region is not in use anymore. So we need a way to tell from what
>>>> region the just cleared TCE was from.
>>>>
>>>> This adds a userspace view of the TCE table into iommu_table struct.
>>>> It contains userspace address, one per TCE entry. The table is only
>>>> allocated when the ownership over an IOMMU group is taken which means
>>>> it is only used from outside of the powernv code (such as VFIO).
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>> ---
>>>> Changes:
>>>> v9:
>>>> * fixed code flow in error cases added in v8
>>>>
>>>> v8:
>>>> * added ENOMEM on failed vzalloc()
>>>> ---
>>>> arch/powerpc/include/asm/iommu.h | 6 ++++++
>>>> arch/powerpc/kernel/iommu.c | 18 ++++++++++++++++++
>>>> arch/powerpc/platforms/powernv/pci-ioda.c | 22 ++++++++++++++++++++--
>>>> 3 files changed, 44 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>>>> index 7694546..1472de3 100644
>>>> --- a/arch/powerpc/include/asm/iommu.h
>>>> +++ b/arch/powerpc/include/asm/iommu.h
>>>> @@ -111,9 +111,15 @@ struct iommu_table {
>>>> unsigned long *it_map; /* A simple allocation bitmap for now */
>>>> unsigned long it_page_shift;/* table iommu page size */
>>>> struct iommu_table_group *it_table_group;
>>>> + unsigned long *it_userspace; /* userspace view of the table */
>>>
>>> A single unsigned long doesn't seem like enough.
>>
>> Why single? This is an array.
>
> As in single per page.


Sorry, I am not following you here.
It is per IOMMU page. MAP/UNMAP work with IOMMU pages which are fully
backed with either system page or a huge page.


>
>>> How do you know
>>> which process's address space this address refers to?
>>
>> It is a current task. Multiple userspaces cannot use the same container/tables.
>
> Where is that enforced?


It is accessed from VFIO DMA map/unmap which are ioctls() to a container's
fd which is per a process. Same for KVM - when it registers IOMMU groups in
KVM, fd's of opened IOMMU groups are passed there. Or I did not understand
the question...


> More to the point, that's a VFIO constraint, but it's here affecting
> the design of a structure owned by the platform code.

Right. But keeping in mind KVM, I cannot think of any better design here.


> [snip]
>>>> static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
>>>> @@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>>>> int nid = pe->phb->hose->node;
>>>> __u64 bus_offset = num ? pe->tce_bypass_base : 0;
>>>> long ret;
>>>> + unsigned long *uas, uas_cb = sizeof(*uas) * (window_size >> page_shift);
>>>> +
>>>> + uas = vzalloc(uas_cb);
>>>> + if (!uas)
>>>> + return -ENOMEM;
>>>
>>> I don't see why this is allocated both here as well as in
>>> take_ownership.
>>
>> Where else? The only alternative is vfio_iommu_spapr_tce but I really do not
>> want to touch iommu_table fields there.
>
> Well to put it another way, why isn't take_ownership calling create
> itself (or at least a common helper).

I am trying to keep DDW stuff away from platform-oriented
arch/powerpc/kernel/iommu.c which main purpose is to implement
iommu_alloc()&co. It already has

I'd rather move it_userspace allocation completely to vfio_iommu_spapr_tce
(should have done earlier, actually), would this be ok?


> Clearly the it_userspace table needs to have lifetime which matches
> the TCE table itself, so there should be a single function that marks
> the beginning of that joint lifetime.


No. it_userspace lives as long as the platform code does not control the
table. For IODA2 it is equal for the lifetime of the table, for
IODA1/P5IOC2 it is not.



>>> Isn't this function used for core-kernel users of the
>>> iommu as well, in which case it shouldn't need the it_userspace.
>>
>>
>> No. This is an iommu_table_group_ops callback which calls what the platform
>> code calls (pnv_pci_create_table()) plus allocates this it_userspace thing.
>> The callback is only called from VFIO.
>
> Ok.
>
> As touched on above it seems more like this should be owned by VFIO
> code than the platform code.

Agree now :) I'll move the allocation to VFIO. Thanks!


--
Alexey

2015-05-01 09:49:06

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 22/32] powerpc/powernv: Implement multilevel TCE tables

On 04/29/2015 03:04 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:46PM +1000, Alexey Kardashevskiy wrote:
>> TCE tables might get too big in case of 4K IOMMU pages and DDW enabled
>> on huge guests (hundreds of GB of RAM) so the kernel might be unable to
>> allocate contiguous chunk of physical memory to store the TCE table.
>>
>> To address this, POWER8 CPU (actually, IODA2) supports multi-level TCE tables,
>> up to 5 levels which splits the table into a tree of smaller subtables.
>>
>> This adds multi-level TCE tables support to pnv_pci_create_table()
>> and pnv_pci_free_table() helpers.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> Changes:
>> v9:
>> * moved from ioda2 to common powernv pci code
>> * fixed cleanup if allocation fails in a middle
>> * removed check for the size - all boundary checks happen in the calling code
>> anyway
>> ---
>> arch/powerpc/include/asm/iommu.h | 2 +
>> arch/powerpc/platforms/powernv/pci-ioda.c | 15 +++--
>> arch/powerpc/platforms/powernv/pci.c | 94 +++++++++++++++++++++++++++++--
>> arch/powerpc/platforms/powernv/pci.h | 4 +-
>> 4 files changed, 104 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index 7e7ca0a..0f50ee2 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -96,6 +96,8 @@ struct iommu_pool {
>> struct iommu_table {
>> unsigned long it_busno; /* Bus number this table belongs to */
>> unsigned long it_size; /* Size of iommu table in entries */
>> + unsigned long it_indirect_levels;
>> + unsigned long it_level_size;
>> unsigned long it_offset; /* Offset into global table */
>> unsigned long it_base; /* mapped address of tce table */
>> unsigned long it_index; /* which iommu table this is */
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 59baa15..cc1d09c 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1967,13 +1967,17 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
>> table_group);
>> struct pnv_phb *phb = pe->phb;
>> int64_t rc;
>> + const unsigned long size = tbl->it_indirect_levels ?
>> + tbl->it_level_size : tbl->it_size;
>> const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
>> const __u64 win_size = tbl->it_size << tbl->it_page_shift;
>>
>> pe_info(pe, "Setting up window at %llx..%llx "
>> - "pgsize=0x%x tablesize=0x%lx\n",
>> + "pgsize=0x%x tablesize=0x%lx "
>> + "levels=%d levelsize=%x\n",
>> start_addr, start_addr + win_size - 1,
>> - 1UL << tbl->it_page_shift, tbl->it_size << 3);
>> + 1UL << tbl->it_page_shift, tbl->it_size << 3,
>> + tbl->it_indirect_levels + 1, tbl->it_level_size << 3);
>>
>> tbl->it_table_group = &pe->table_group;
>>
>> @@ -1984,9 +1988,9 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
>> rc = opal_pci_map_pe_dma_window(phb->opal_id,
>> pe->pe_number,
>> pe->pe_number << 1,
>> - 1,
>> + tbl->it_indirect_levels + 1,
>> __pa(tbl->it_base),
>> - tbl->it_size << 3,
>> + size << 3,
>> 1ULL << tbl->it_page_shift);
>> if (rc) {
>> pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
>> @@ -2099,7 +2103,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> phb->ioda.m32_pci_base);
>>
>> rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node,
>> - 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, tbl);
>> + 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base,
>> + POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
>> if (rc) {
>> pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
>> return;
>> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>> index 6bcfad5..fc129c4 100644
>> --- a/arch/powerpc/platforms/powernv/pci.c
>> +++ b/arch/powerpc/platforms/powernv/pci.c
>> @@ -46,6 +46,8 @@
>> #define cfg_dbg(fmt...) do { } while(0)
>> //#define cfg_dbg(fmt...) printk(fmt)
>>
>> +#define ROUND_UP(x, n) (((x) + (n) - 1ULL) & ~((n) - 1ULL))
>
> Use the existing ALIGN_UP macro instead of creating a new one.


Ok. I knew it existed, it is just _ALIGN_UP (with an underscore) and
PPC-only - this is why I did not find it :)


>> #ifdef CONFIG_PCI_MSI
>> static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
>> {
>> @@ -577,6 +579,19 @@ struct pci_ops pnv_pci_ops = {
>> static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
>> {
>> __be64 *tmp = ((__be64 *)tbl->it_base);
>> + int level = tbl->it_indirect_levels;
>> + const long shift = ilog2(tbl->it_level_size);
>> + unsigned long mask = (tbl->it_level_size - 1) << (level * shift);
>> +
>> + while (level) {
>> + int n = (idx & mask) >> (level * shift);
>> + unsigned long tce = be64_to_cpu(tmp[n]);
>> +
>> + tmp = __va(tce & ~(TCE_PCI_READ | TCE_PCI_WRITE));
>> + idx &= ~mask;
>> + mask >>= shift;
>> + --level;
>> + }
>>
>> return tmp + idx;
>> }
>> @@ -648,12 +663,18 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
>> }
>>
>> static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
>> + unsigned levels, unsigned long limit,
>> unsigned long *tce_table_allocated)
>> {
>> struct page *tce_mem = NULL;
>> - __be64 *addr;
>> + __be64 *addr, *tmp;
>> unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
>> unsigned long local_allocated = 1UL << (order + PAGE_SHIFT);
>> + unsigned entries = 1UL << (shift - 3);
>> + long i;
>> +
>> + if (limit == *tce_table_allocated)
>> + return NULL;
>
> If this is for what I think, it seems a bit unsafe. Shouldn't it be
>> =, otherwise it could fail to trip if the limit isn't exactly a
>> multiple of the bottom level allocation unit.

Good point, will fix.


>> tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
>> if (!tce_mem) {
>> @@ -662,14 +683,33 @@ static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
>> }
>> addr = page_address(tce_mem);
>> memset(addr, 0, local_allocated);
>> - *tce_table_allocated = local_allocated;
>> +
>> + --levels;
>> + if (!levels) {
>> + /* Update tce_table_allocated with bottom level table size only */
>> + *tce_table_allocated += local_allocated;
>> + return addr;
>> + }
>> +
>> + for (i = 0; i < entries; ++i) {
>> + tmp = pnv_alloc_tce_table_pages(nid, shift, levels, limit,
>> + tce_table_allocated);
>
> Urgh.. it's a limited depth so it *might* be ok, but recursion is
> generally avoided in the kernel, becuase of the very limited stack
> size.


It is 5 levels max 7 64bit values, there should be room for it. Avoiding
recursion here - I can do that but it is going to look ugly :-/


>> + if (!tmp)
>> + break;
>> +
>> + addr[i] = cpu_to_be64(__pa(tmp) |
>> + TCE_PCI_READ | TCE_PCI_WRITE);
>> + }
>
> It also seems like it would make sense for this function ti set
> it_indirect_levels ant it_level_size, rather than leaving it to the
> caller.


Mmm. Sure? It calls itself in recursion, does not seem like it is the right
place for setting up it_indirect_levels ant it_level_size.


>> return addr;
>> }
>>
>> +static void pnv_free_tce_table_pages(unsigned long addr, unsigned long size,
>> + unsigned level);
>> +
>> long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
>> __u64 bus_offset, __u32 page_shift, __u64 window_size,
>> - struct iommu_table *tbl)
>> + __u32 levels, struct iommu_table *tbl)
>> {
>> void *addr;
>> unsigned long tce_table_allocated = 0;
>> @@ -678,16 +718,34 @@ long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
>> unsigned table_shift = entries_shift + 3;
>> const unsigned long tce_table_size = max(0x1000UL, 1UL << table_shift);
>>
>> + if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
>> + return -EINVAL;
>> +
>> if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
>> return -EINVAL;
>>
>> + /* Adjust direct table size from window_size and levels */
>> + entries_shift = ROUND_UP(entries_shift, levels) / levels;
>
> ROUND_UP() only works if the second parameter is a power of 2. Is
> that always true for levels?
>
> For division rounding up, the usual idiom is just ((a + (b - 1)) / b)


Yes, I think this is what I actually wanted.


>> + table_shift = entries_shift + 3;
>> + table_shift = max_t(unsigned, table_shift, PAGE_SHIFT);
>
> Does the PAGE_SHIFT rounding make sense any more? I would have
> thought you'd round the level size up to page size, rather than the
> whole thing.


At this point in the code @table_shift is level_shift but it is not that
obvious :) I'll rework it. Thanks.


>> /* Allocate TCE table */
>> addr = pnv_alloc_tce_table_pages(nid, table_shift,
>> - &tce_table_allocated);
>> + levels, tce_table_size, &tce_table_allocated);
>> + if (!addr)
>> + return -ENOMEM;
>> +
>> + if (tce_table_size != tce_table_allocated) {
>> + pnv_free_tce_table_pages((unsigned long) addr,
>> + tbl->it_level_size, tbl->it_indirect_levels);
>> + return -ENOMEM;
>> + }
>>
>> /* Setup linux iommu table */
>> pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset,
>> page_shift);
>> + tbl->it_level_size = 1ULL << (table_shift - 3);
>> + tbl->it_indirect_levels = levels - 1;
>>
>> pr_info("Created TCE table: window size = %08llx, "
>> "tablesize = %lx (%lx), start @%08llx\n",
>> @@ -697,12 +755,38 @@ long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
>> return 0;
>> }
>>
>> +static void pnv_free_tce_table_pages(unsigned long addr, unsigned long size,
>> + unsigned level)
>> +{
>> + addr &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> + if (level) {
>> + long i;
>> + u64 *tmp = (u64 *) addr;
>> +
>> + for (i = 0; i < size; ++i) {
>> + unsigned long hpa = be64_to_cpu(tmp[i]);
>> +
>> + if (!(hpa & (TCE_PCI_READ | TCE_PCI_WRITE)))
>> + continue;
>> +
>> + pnv_free_tce_table_pages((unsigned long) __va(hpa),
>> + size, level - 1);
>> + }
>> + }
>> +
>> + free_pages(addr, get_order(size << 3));
>> +}
>> +
>> void pnv_pci_free_table(struct iommu_table *tbl)
>> {
>> + const unsigned long size = tbl->it_indirect_levels ?
>> + tbl->it_level_size : tbl->it_size;
>> +
>> if (!tbl->it_size)
>> return;
>>
>> - free_pages(tbl->it_base, get_order(tbl->it_size << 3));
>> + pnv_free_tce_table_pages(tbl->it_base, size, tbl->it_indirect_levels);
>> iommu_reset_table(tbl, "pnv");
>> }
>>
>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>> index e6cbbec..3d1ff584 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -218,9 +218,11 @@ int pnv_pci_cfg_write(struct pci_dn *pdn,
>> extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
>> void *tce_mem, u64 tce_size,
>> u64 dma_offset, unsigned page_shift);
>> +#define POWERNV_IOMMU_DEFAULT_LEVELS 1
>> +#define POWERNV_IOMMU_MAX_LEVELS 5
>> extern long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
>> __u64 bus_offset, __u32 page_shift, __u64 window_size,
>> - struct iommu_table *tbl);
>> + __u32 levels, struct iommu_table *tbl);
>> extern void pnv_pci_free_table(struct iommu_table *tbl);
>> extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
>> extern void pnv_pci_init_ioda_hub(struct device_node *np);
>


--
Alexey

2015-05-01 10:13:15

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 20/32] powerpc/powernv/ioda2: Introduce pnv_pci_create_table/pnv_pci_free_table

On 04/29/2015 02:39 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:44PM +1000, Alexey Kardashevskiy wrote:
>> This is a part of moving TCE table allocation into an iommu_ops
>> callback to support multiple IOMMU groups per one VFIO container.
>>
>> This moves a table creation window to the file with common powernv-pci
>> helpers as it does not do anything IODA2-specific.
>>
>> This adds pnv_pci_free_table() helper to release the actual TCE table.
>>
>> This enforces window size to be a power of two.
>>
>> This should cause no behavioural change.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> Reviewed-by: David Gibson <[email protected]>
>> ---
>> Changes:
>> v9:
>> * moved helpers to the common powernv pci.c file from pci-ioda.c
>> * moved bits from pnv_pci_create_table() to pnv_alloc_tce_table_pages()
>> ---
>> arch/powerpc/platforms/powernv/pci-ioda.c | 36 ++++++------------
>> arch/powerpc/platforms/powernv/pci.c | 61 +++++++++++++++++++++++++++++++
>> arch/powerpc/platforms/powernv/pci.h | 4 ++
>> 3 files changed, 76 insertions(+), 25 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index a80be34..b9b3773 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1307,8 +1307,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
>> if (rc)
>> pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
>>
>> - iommu_reset_table(tbl, of_node_full_name(dev->dev.of_node));
>> - free_pages(addr, get_order(TCE32_TABLE_SIZE));
>> + pnv_pci_free_table(tbl);
>> }
>>
>> static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>> @@ -2039,10 +2038,7 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
>> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> struct pnv_ioda_pe *pe)
>> {
>> - struct page *tce_mem = NULL;
>> - void *addr;
>> struct iommu_table *tbl = &pe->table_group.tables[0];
>> - unsigned int tce_table_size, end;
>> int64_t rc;
>>
>> /* We shouldn't already have a 32-bit DMA associated */
>> @@ -2053,29 +2049,20 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>>
>> /* The PE will reserve all possible 32-bits space */
>> pe->tce32_seg = 0;
>> - end = (1 << ilog2(phb->ioda.m32_pci_base));
>> - tce_table_size = (end / 0x1000) * 8;
>> pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
>> - end);
>> + phb->ioda.m32_pci_base);
>>
>> - /* Allocate TCE table */
>> - tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL,
>> - get_order(tce_table_size));
>> - if (!tce_mem) {
>> - pe_err(pe, "Failed to allocate a 32-bit TCE memory\n");
>> - goto fail;
>> + rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node,
>> + 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, tbl);
>> + if (rc) {
>> + pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
>> + return;
>> }
>> - addr = page_address(tce_mem);
>> - memset(addr, 0, tce_table_size);
>> -
>> - /* Setup iommu */
>> - tbl->it_table_group = &pe->table_group;
>> -
>> - /* Setup linux iommu table */
>> - pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
>> - IOMMU_PAGE_SHIFT_4K);
>>
>> tbl->it_ops = &pnv_ioda2_iommu_ops;
>> +
>> + /* Setup iommu */
>> + tbl->it_table_group = &pe->table_group;
>> iommu_init_table(tbl, phb->hose->node);
>> #ifdef CONFIG_IOMMU_API
>> pe->table_group.ops = &pnv_pci_ioda2_ops;
>> @@ -2121,8 +2108,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> fail:
>> if (pe->tce32_seg >= 0)
>> pe->tce32_seg = -1;
>> - if (tce_mem)
>> - __free_pages(tce_mem, get_order(tce_table_size));
>> + pnv_pci_free_table(tbl);
>> }
>>
>> static void pnv_ioda_setup_dma(struct pnv_phb *phb)
>> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>> index e8802ac..6bcfad5 100644
>> --- a/arch/powerpc/platforms/powernv/pci.c
>> +++ b/arch/powerpc/platforms/powernv/pci.c
>> @@ -20,7 +20,9 @@
>> #include <linux/io.h>
>> #include <linux/msi.h>
>> #include <linux/iommu.h>
>> +#include <linux/memblock.h>
>>
>> +#include <asm/mmzone.h>
>> #include <asm/sections.h>
>> #include <asm/io.h>
>> #include <asm/prom.h>
>> @@ -645,6 +647,65 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
>> tbl->it_type = TCE_PCI;
>> }
>>
>> +static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
>> + unsigned long *tce_table_allocated)
>
> I'm a bit confused by the tce_table_allocated parameter. What's the
> circumstance where more memory is requested than required, and why
> does it matter to the caller?
>
>> +{
>> + struct page *tce_mem = NULL;
>> + __be64 *addr;
>> + unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
>> + unsigned long local_allocated = 1UL << (order + PAGE_SHIFT);
>> +
>> + tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
>> + if (!tce_mem) {
>> + pr_err("Failed to allocate a TCE memory, order=%d\n", order);
>> + return NULL;
>> + }
>> + addr = page_address(tce_mem);
>> + memset(addr, 0, local_allocated);
>> + *tce_table_allocated = local_allocated;
>> +
>> + return addr;
>> +}
>> +
>> +long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
>> + __u64 bus_offset, __u32 page_shift, __u64 window_size,
>> + struct iommu_table *tbl)
>
> The table_group parameter is redundant, isn't it? It must be equal to
> tbl->table_group, yes?
>
> Or would it make more sense for this function to set
> tbl->table_group?

I removed table_group from here.


> And for that matter wouldn't it make more sense for
> this to set it_size as well?


Missed this comment. It does set it_size by calling
pnv_pci_setup_iommu_table().



>> +{
>> + void *addr;
>> + unsigned long tce_table_allocated = 0;
>> + const unsigned window_shift = ilog2(window_size);
>> + unsigned entries_shift = window_shift - page_shift;
>> + unsigned table_shift = entries_shift + 3;
>> + const unsigned long tce_table_size = max(0x1000UL, 1UL << table_shift);
>
> So, here you round up to 4k, the in the alloc function you round up to
> PAGE_SIZE (which may or may not be the same). It's not clear to me why
> there are two rounds of rounding up.


@tce_table_size will be programmed later into IODA2 via OPAL and OPAL will
reject a window if it is less than 4K. I'll rework the whole thing just to
align it to PAGE_SIZE as this is what it really is anyway.




--
Alexey

2015-05-01 11:26:59

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache

On 04/29/2015 05:01 PM, David Gibson wrote:
> On Sat, Apr 25, 2015 at 10:14:52PM +1000, Alexey Kardashevskiy wrote:
>> We are adding support for DMA memory pre-registration to be used in
>> conjunction with VFIO. The idea is that the userspace which is going to
>> run a guest may want to pre-register a user space memory region so
>> it all gets pinned once and never goes away. Having this done,
>> a hypervisor will not have to pin/unpin pages on every DMA map/unmap
>> request. This is going to help with multiple pinning of the same memory
>> and in-kernel acceleration of DMA requests.
>>
>> This adds a list of memory regions to mm_context_t. Each region consists
>> of a header and a list of physical addresses. This adds API to:
>> 1. register/unregister memory regions;
>> 2. do final cleanup (which puts all pre-registered pages);
>> 3. do userspace to physical address translation;
>> 4. manage a mapped pages counter; when it is zero, it is safe to
>> unregister the region.
>>
>> Multiple registration of the same region is allowed, kref is used to
>> track the number of registrations.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> Changes:
>> v8:
>> * s/mm_iommu_table_group_mem_t/struct mm_iommu_table_group_mem_t/
>> * fixed error fallback look (s/[i]/[j]/)
>> ---
>> arch/powerpc/include/asm/mmu-hash64.h | 3 +
>> arch/powerpc/include/asm/mmu_context.h | 17 +++
>> arch/powerpc/mm/Makefile | 1 +
>> arch/powerpc/mm/mmu_context_hash64.c | 6 +
>> arch/powerpc/mm/mmu_context_hash64_iommu.c | 215 +++++++++++++++++++++++++++++
>> 5 files changed, 242 insertions(+)
>> create mode 100644 arch/powerpc/mm/mmu_context_hash64_iommu.c
>>
>> diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
>> index 1da6a81..a82f534 100644
>> --- a/arch/powerpc/include/asm/mmu-hash64.h
>> +++ b/arch/powerpc/include/asm/mmu-hash64.h
>> @@ -536,6 +536,9 @@ typedef struct {
>> /* for 4K PTE fragment support */
>> void *pte_frag;
>> #endif
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> + struct list_head iommu_group_mem_list;
>> +#endif
>
> Urgh. I know I'm not one to talk, having done the hugepage crap in
> there, but man mm_context_t has grown to a bloated mess from orginally
> being just intended as a context ID integer :/.


Where else to put it then?... The other way to go would be some global map
of pid<->iommu_group_mem_list which needs to be available from both VFIO
and KVM.


>> } mm_context_t;
>>
>>
>> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
>> index 73382eb..d6116ca 100644
>> --- a/arch/powerpc/include/asm/mmu_context.h
>> +++ b/arch/powerpc/include/asm/mmu_context.h
>> @@ -16,6 +16,23 @@
>> */
>> extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
>> extern void destroy_context(struct mm_struct *mm);
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> +struct mm_iommu_table_group_mem_t;
>> +
>> +extern bool mm_iommu_preregistered(void);
>> +extern long mm_iommu_alloc(unsigned long ua, unsigned long entries,
>> + struct mm_iommu_table_group_mem_t **pmem);
>> +extern struct mm_iommu_table_group_mem_t *mm_iommu_get(unsigned long ua,
>> + unsigned long entries);
>> +extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
>> +extern void mm_iommu_cleanup(mm_context_t *ctx);
>> +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
>> + unsigned long size);
>> +extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>> + unsigned long ua, unsigned long *hpa);
>> +extern long mm_iommu_mapped_update(struct mm_iommu_table_group_mem_t *mem,
>> + bool inc);
>> +#endif
>>
>> extern void switch_mmu_context(struct mm_struct *prev, struct mm_struct *next);
>> extern void switch_slb(struct task_struct *tsk, struct mm_struct *mm);
>> diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
>> index 9c8770b..e216704 100644
>> --- a/arch/powerpc/mm/Makefile
>> +++ b/arch/powerpc/mm/Makefile
>> @@ -36,3 +36,4 @@ obj-$(CONFIG_PPC_SUBPAGE_PROT) += subpage-prot.o
>> obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
>> obj-$(CONFIG_HIGHMEM) += highmem.o
>> obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o
>> +obj-$(CONFIG_SPAPR_TCE_IOMMU) += mmu_context_hash64_iommu.o
>> diff --git a/arch/powerpc/mm/mmu_context_hash64.c b/arch/powerpc/mm/mmu_context_hash64.c
>> index 178876ae..eb3080c 100644
>> --- a/arch/powerpc/mm/mmu_context_hash64.c
>> +++ b/arch/powerpc/mm/mmu_context_hash64.c
>> @@ -89,6 +89,9 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>> #ifdef CONFIG_PPC_64K_PAGES
>> mm->context.pte_frag = NULL;
>> #endif
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> + INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
>> +#endif
>> return 0;
>> }
>>
>> @@ -132,6 +135,9 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
>>
>> void destroy_context(struct mm_struct *mm)
>> {
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>> + mm_iommu_cleanup(&mm->context);
>> +#endif
>>
>> #ifdef CONFIG_PPC_ICSWX
>> drop_cop(mm->context.acop, mm);
>> diff --git a/arch/powerpc/mm/mmu_context_hash64_iommu.c b/arch/powerpc/mm/mmu_context_hash64_iommu.c
>> new file mode 100644
>> index 0000000..af7668c
>> --- /dev/null
>> +++ b/arch/powerpc/mm/mmu_context_hash64_iommu.c
>> @@ -0,0 +1,215 @@
>> +/*
>> + * IOMMU helpers in MMU context.
>> + *
>> + * Copyright (C) 2015 IBM Corp. <[email protected]>
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public License
>> + * as published by the Free Software Foundation; either version
>> + * 2 of the License, or (at your option) any later version.
>> + *
>> + */
>> +
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/rculist.h>
>> +#include <linux/vmalloc.h>
>> +#include <linux/kref.h>
>> +#include <asm/mmu_context.h>
>> +
>> +struct mm_iommu_table_group_mem_t {
>> + struct list_head next;
>> + struct rcu_head rcu;
>> + struct kref kref; /* one reference per VFIO container */
>> + atomic_t mapped; /* number of currently mapped pages */
>> + u64 ua; /* userspace address */
>> + u64 entries; /* number of entries in hpas[] */
>
> Maybe 'npages', since this is used to determine the range of user
> addresses covered, not just the number of entries in hpas.


Hm. Ok :)


>> + u64 *hpas; /* vmalloc'ed */
>> +};
>> +
>> +bool mm_iommu_preregistered(void)
>> +{
>> + if (!current || !current->mm)
>> + return false;
>> +
>> + return !list_empty(&current->mm->context.iommu_group_mem_list);
>> +}
>> +EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>> +
>> +long mm_iommu_alloc(unsigned long ua, unsigned long entries,
>> + struct mm_iommu_table_group_mem_t **pmem)
>> +{
>> + struct mm_iommu_table_group_mem_t *mem;
>> + long i, j;
>> + struct page *page = NULL;
>> +
>> + list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
>> + next) {
>> + if ((mem->ua == ua) && (mem->entries == entries))
>> + return -EBUSY;
>> +
>> + /* Overlap? */
>> + if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
>> + (ua < (mem->ua + (mem->entries << PAGE_SHIFT))))
>> + return -EINVAL;
>> + }
>> +
>> + mem = kzalloc(sizeof(*mem), GFP_KERNEL);
>> + if (!mem)
>> + return -ENOMEM;
>> +
>> + mem->hpas = vzalloc(entries * sizeof(mem->hpas[0]));
>> + if (!mem->hpas) {
>> + kfree(mem);
>> + return -ENOMEM;
>> + }
>> +
>> + for (i = 0; i < entries; ++i) {
>> + if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
>> + 1/* pages */, 1/* iswrite */, &page)) {
>
> Do you really need to call gup() in a loop? It can do more than one
> page at a time..


Ufff. gup() returns the number of pages pinned or -errno if none. So if the
return value is positive but less than the requested number of pages, it is
still an error. Functions like this make me nervous :(


> That might work better if you kept a list of struct page *s instead of
> hpas.

I only need struct page* when release the registered area. In other cases I
just need fast conversion from an userspace address to a host physical
address, including real mode. Ideally I would have to use page_address()
which will work in real mode in my case but in general it does not have to.
Using addresses rather than page structs makes it more explicit - I need an
address, I store an address, simple.

I can change to page structs if you think it makes more sense, should I?




>> + for (j = 0; j < i; ++j)
>> + put_page(pfn_to_page(
>> + mem->hpas[j] >> PAGE_SHIFT));
>> + vfree(mem->hpas);
>> + kfree(mem);
>> + return -EFAULT;
>> + }
>> +
>> + mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
>> + }
>> +
>> + kref_init(&mem->kref);
>> + atomic_set(&mem->mapped, 0);
>> + mem->ua = ua;
>> + mem->entries = entries;
>> + *pmem = mem;
>> +
>> + list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
>> +
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(mm_iommu_alloc);
>> +
>> +static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
>> +{
>> + long i;
>> + struct page *page = NULL;
>> +
>> + for (i = 0; i < mem->entries; ++i) {
>> + if (!mem->hpas[i])
>> + continue;
>> +
>> + page = pfn_to_page(mem->hpas[i] >> PAGE_SHIFT);
>> + if (!page)
>> + continue;
>> +
>> + put_page(page);
>> + mem->hpas[i] = 0;
>> + }
>> +}
>> +
>> +static void mm_iommu_free(struct rcu_head *head)
>> +{
>> + struct mm_iommu_table_group_mem_t *mem = container_of(head,
>> + struct mm_iommu_table_group_mem_t, rcu);
>> +
>> + mm_iommu_unpin(mem);
>> + vfree(mem->hpas);
>> + kfree(mem);
>> +}
>> +
>> +static void mm_iommu_release(struct kref *kref)
>> +{
>> + struct mm_iommu_table_group_mem_t *mem = container_of(kref,
>> + struct mm_iommu_table_group_mem_t, kref);
>> +
>> + list_del_rcu(&mem->next);
>> + call_rcu(&mem->rcu, mm_iommu_free);
>> +}
>> +
>> +struct mm_iommu_table_group_mem_t *mm_iommu_get(unsigned long ua,
>> + unsigned long entries)
>> +{
>> + struct mm_iommu_table_group_mem_t *mem;
>> +
>> + list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
>> + next) {
>> + if ((mem->ua == ua) && (mem->entries == entries)) {
>> + kref_get(&mem->kref);
>> + return mem;
>> + }
>> + }
>> +
>> + return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(mm_iommu_get);
>> +
>> +long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
>> +{
>> + if (atomic_read(&mem->mapped))
>> + return -EBUSY;
>
> What prevents a race between the atomic_read() above and the release below?

Ouch. Nothing. And I cannot think of any nice fast solution here...
I can remove @mapped at all and do kref_get/put(&mem->kref) instead; a
container will hold one reference too. And add a flag to
mm_iommu_table_group_mem_t to know if mm_iommu_release has been called -
this way I will know that was the very last reference, otherwise I'll
return -EBUSY.

Or change mm_iommu_lookup() to do kref_get() and require every caller of it
also call mm_iommu_put() and only call mm_iommu_mapped_update() when the
reference is elevated. And change mm_iommu_put() to return a special code
if that was the very last put() (will be checked by
VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY handler only, others would not care).

Any ideas?

I am pretty sure there is something very cool (like RCU) which allows
avoiding locks in this situation, I am just too ignorant and do not know it :)


>> + kref_put(&mem->kref, mm_iommu_release);
>> +
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(mm_iommu_put);
>> +
>> +struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
>> + unsigned long size)
>> +{
>> + struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
>> +
>> + list_for_each_entry_rcu(mem,
>> + &current->mm->context.iommu_group_mem_list,
>> + next) {
>> + if ((mem->ua <= ua) &&
>> + (ua + size <= mem->ua +
>> + (mem->entries << PAGE_SHIFT))) {
>> + ret = mem;
>> + break;
>> + }
>> + }
>> +
>> + return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(mm_iommu_lookup);
>> +
>> +long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>> + unsigned long ua, unsigned long *hpa)
>
> Return type should be int, it's just an error code.


Is it some generic rule that errors must always be "int"? I was just told
that gcc on PPC64 will generate an extra instruction to cut 64bit long to
32bit int so I am just trying to use "long" everywhere. Very simple but
still optimization :)


>> +{
>> + const long entry = (ua - mem->ua) >> PAGE_SHIFT;
>> + u64 *va = &mem->hpas[entry];
>> +
>> + if (entry >= mem->entries)
>> + return -EFAULT;
>> +
>> + *hpa = *va | (ua & ~PAGE_MASK);
>> +
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
>> +
>> +long mm_iommu_mapped_update(struct mm_iommu_table_group_mem_t *mem, bool inc)
>> +{
>> + long ret = 0;
>> +
>> + if (inc)
>> + atomic_inc(&mem->mapped);
>> + else
>> + ret = atomic_dec_if_positive(&mem->mapped);
>> +
>> + return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(mm_iommu_mapped_update);
>
> I think this would be clearer as separate inc and dec functions.

Okay.


>> +
>> +void mm_iommu_cleanup(mm_context_t *ctx)
>> +{
>> + while (!list_empty(&ctx->iommu_group_mem_list)) {
>> + struct mm_iommu_table_group_mem_t *mem;
>> +
>> + mem = list_first_entry(&ctx->iommu_group_mem_list,
>> + struct mm_iommu_table_group_mem_t, next);
>> + mm_iommu_release(&mem->kref);
>> + }
>> +}
>


--
Alexey

2015-05-05 13:09:15

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible

On Fri, May 01, 2015 at 04:05:24PM +1000, Alexey Kardashevskiy wrote:
> On 05/01/2015 02:33 PM, David Gibson wrote:
> >On Thu, Apr 30, 2015 at 07:33:09PM +1000, Alexey Kardashevskiy wrote:
> >>On 04/30/2015 05:22 PM, David Gibson wrote:
> >>>On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote:
> >>>>At the moment only one group per container is supported.
> >>>>POWER8 CPUs have more flexible design and allows naving 2 TCE tables per
> >>>>IOMMU group so we can relax this limitation and support multiple groups
> >>>>per container.
> >>>
> >>>It's not obvious why allowing multiple TCE tables per PE has any
> >>>pearing on allowing multiple groups per container.
> >>
> >>
> >>This patchset is a global TCE tables rework (patches 1..30, roughly) with 2
> >>outcomes:
> >>1. reusing the same IOMMU table for multiple groups - patch 31;
> >>2. allowing dynamic create/remove of IOMMU tables - patch 32.
> >>
> >>I can remove this one from the patchset and post it separately later but
> >>since 1..30 aim to support both 1) and 2), I'd think I better keep them all
> >>together (might explain some of changes I do in 1..30).
> >
> >The combined patchset is fine. My comment is because your commit
> >message says that multiple groups are possible *because* 2 TCE tables
> >per group are allowed, and it's not at all clear why one follows from
> >the other.
>
>
> Ah. That's wrong indeed, I'll fix it.
>
>
> >>>>This adds TCE table descriptors to a container and uses iommu_table_group_ops
> >>>>to create/set DMA windows on IOMMU groups so the same TCE tables will be
> >>>>shared between several IOMMU groups.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>>>[aw: for the vfio related changes]
> >>>>Acked-by: Alex Williamson <[email protected]>
> >>>>---
> >>>>Changes:
> >>>>v7:
> >>>>* updated doc
> >>>>---
> >>>> Documentation/vfio.txt | 8 +-
> >>>> drivers/vfio/vfio_iommu_spapr_tce.c | 268 ++++++++++++++++++++++++++----------
> >>>> 2 files changed, 199 insertions(+), 77 deletions(-)
> >>>>
> >>>>diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> >>>>index 94328c8..7dcf2b5 100644
> >>>>--- a/Documentation/vfio.txt
> >>>>+++ b/Documentation/vfio.txt
> >>>>@@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
> >>>>
> >>>> This implementation has some specifics:
> >>>>
> >>>>-1) Only one IOMMU group per container is supported as an IOMMU group
> >>>>-represents the minimal entity which isolation can be guaranteed for and
> >>>>-groups are allocated statically, one per a Partitionable Endpoint (PE)
> >>>>+1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
> >>>>+container is supported as an IOMMU table is allocated at the boot time,
> >>>>+one table per a IOMMU group which is a Partitionable Endpoint (PE)
> >>>> (PE is often a PCI domain but not always).
> >>>
> >>>I thought the more fundamental problem was that different PEs tended
> >>>to use disjoint bus address ranges, so even by duplicating put_tce
> >>>across PEs you couldn't have a common address space.
> >>
> >>
> >>Sorry, I am not following you here.
> >>
> >>By duplicating put_tce, I can have multiple IOMMU groups on the same virtual
> >>PHB in QEMU, "[PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups
> >>per container" does this, the address ranges will the same.
> >
> >Oh, ok. For some reason I thought that (at least on the older
> >machines) the different PEs used different and not easily changeable
> >DMA windows in bus addresses space.
>
>
> They do use different tables (which VFIO does not get to remove/create and
> uses these old helpers - iommu_take/release_ownership), correct. But all
> these windows are mapped at zero on a PE's PCI bus and nothing prevents me
> from updating all these tables with the same TCE values when handling
> H_PUT_TCE. Yes it is slow but it works (bit more details below).

Um.. I'm pretty sure that contradicts what Ben was saying on the
thread.

> >>What I cannot do on p5ioc2 is programming the same table to multiple
> >>physical PHBs (or I could but it is very different than IODA2 and pretty
> >>ugly and might not always be possible because I would have to allocate these
> >>pages from some common pool and face problems like fragmentation).
> >
> >So allowing multiple groups per container should be possible (at the
> >kernel rather than qemu level) by writing the same value to multiple
> >TCE tables. I guess its not worth doing for just the almost-obsolete
> >IOMMUs though.
>
>
> It is done at QEMU level though. As it works now, QEMU opens a group, walks
> through all existing containers and tries attaching a new group there. If it
> succeeded (x86 always; POWER8 after this patch), a TCE table is shared. If
> it failed, QEMU creates another container, attaches it to the same VFIO/PHB
> address space and attaches a group there.
>
> Then the only thing left is repeating ioctl() in vfio_container_ioctl() for
> every container in the VFIO address space; this is what that QEMU patch does
> (the first version of that patch called ioctl() only for the first container
> in the address space).
>
> From the kernel prospective there are 2 isolated containers; I'd like to
> keep it this way.
>
> btw thanks for the detailed review :)
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (5.38 kB)
(No filename) (819.00 B)
Download all attachments

2015-05-05 13:09:24

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

On Fri, May 01, 2015 at 04:27:47PM +1000, Alexey Kardashevskiy wrote:
> On 05/01/2015 03:23 PM, David Gibson wrote:
> >On Fri, May 01, 2015 at 02:35:23PM +1000, Alexey Kardashevskiy wrote:
> >>On 04/30/2015 04:55 PM, David Gibson wrote:
> >>>On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:
> >>>>The existing implementation accounts the whole DMA window in
> >>>>the locked_vm counter. This is going to be worse with multiple
> >>>>containers and huge DMA windows. Also, real-time accounting would requite
> >>>>additional tracking of accounted pages due to the page size difference -
> >>>>IOMMU uses 4K pages and system uses 4K or 64K pages.
> >>>>
> >>>>Another issue is that actual pages pinning/unpinning happens on every
> >>>>DMA map/unmap request. This does not affect the performance much now as
> >>>>we spend way too much time now on switching context between
> >>>>guest/userspace/host but this will start to matter when we add in-kernel
> >>>>DMA map/unmap acceleration.
> >>>>
> >>>>This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
> >>>>New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
> >>>>2 new ioctls to register/unregister DMA memory -
> >>>>VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
> >>>>which receive user space address and size of a memory region which
> >>>>needs to be pinned/unpinned and counted in locked_vm.
> >>>>New IOMMU splits physical pages pinning and TCE table update into 2 different
> >>>>operations. It requires 1) guest pages to be registered first 2) consequent
> >>>>map/unmap requests to work only with pre-registered memory.
> >>>>For the default single window case this means that the entire guest
> >>>>(instead of 2GB) needs to be pinned before using VFIO.
> >>>>When a huge DMA window is added, no additional pinning will be
> >>>>required, otherwise it would be guest RAM + 2GB.
> >>>>
> >>>>The new memory registration ioctls are not supported by
> >>>>VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
> >>>>will require memory to be preregistered in order to work.
> >>>>
> >>>>The accounting is done per the user process.
> >>>>
> >>>>This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
> >>>>can do with v1 or v2 IOMMUs.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>>>[aw: for the vfio related changes]
> >>>>Acked-by: Alex Williamson <[email protected]>
> >>>>---
> >>>>Changes:
> >>>>v9:
> >>>>* s/tce_get_hva_cached/tce_iommu_use_page_v2/
> >>>>
> >>>>v7:
> >>>>* now memory is registered per mm (i.e. process)
> >>>>* moved memory registration code to powerpc/mmu
> >>>>* merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
> >>>>* limited new ioctls to v2 IOMMU
> >>>>* updated doc
> >>>>* unsupported ioclts return -ENOTTY instead of -EPERM
> >>>>
> >>>>v6:
> >>>>* tce_get_hva_cached() returns hva via a pointer
> >>>>
> >>>>v4:
> >>>>* updated docs
> >>>>* s/kzmalloc/vzalloc/
> >>>>* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
> >>>>replaced offset with index
> >>>>* renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
> >>>>and removed duplicating vfio_iommu_spapr_register_memory
> >>>>---
> >>>> Documentation/vfio.txt | 23 ++++
> >>>> drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++++++++++++++++++++++++++++++++++-
> >>>> include/uapi/linux/vfio.h | 27 +++++
> >>>> 3 files changed, 274 insertions(+), 6 deletions(-)
> >>>>
> >>>>diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> >>>>index 96978ec..94328c8 100644
> >>>>--- a/Documentation/vfio.txt
> >>>>+++ b/Documentation/vfio.txt
> >>>>@@ -427,6 +427,29 @@ The code flow from the example above should be slightly changed:
> >>>>
> >>>> ....
> >>>>
> >>>>+5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
> >>>>+VFIO_IOMMU_DISABLE and implements 2 new ioctls:
> >>>>+VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
> >>>>+(which are unsupported in v1 IOMMU).
> >>>
> >>>A summary of the semantic differeces between v1 and v2 would be nice.
> >>>At this point it's not really clear to me if there's a case for
> >>>creating v2, or if this could just be done by adding (optional)
> >>>functionality to v1.
> >>
> >>v1: memory preregistration is not supported; explicit enable/disable ioctls
> >>are required
> >>
> >>v2: memory preregistration is required; explicit enable/disable are
> >>prohibited (as they are not needed).
> >>
> >>Mixing these in one IOMMU type caused a lot of problems like should I
> >>increment locked_vm by the 32bit window size on enable() or not; what do I
> >>do about pages pinning when map/map (check if it is from registered memory
> >>and do not pin?).
> >>
> >>Having 2 IOMMU models makes everything a lot simpler.
> >
> >Ok. Would it simplify it further if you made v2 only usable on IODA2
> >hardware?
>
>
> Very little. V2 addresses memory pinning issue which is handled the same way
> on ioda2 and older hardware, including KVM acceleration. Whether enable DDW
> or not - this is handled just fine via extra properties in the GET_INFO
> ioctl().
>
> IODA2 and others are different in handling multiple groups per container but
> this does not require changes to userspace API.
>
> And remember, the only machine I can use 100% of time is POWER7/P5IOC2 so it
> is really useful if at least some bits of the patchset can be tested there;
> if it was a bit less different from IODA2, I would have even implemented DDW
> there too :)

Hm, ok.

> >>>>+PPC64 paravirtualized guests generate a lot of map/unmap requests,
> >>>>+and the handling of those includes pinning/unpinning pages and updating
> >>>>+mm::locked_vm counter to make sure we do not exceed the rlimit.
> >>>>+The v2 IOMMU splits accounting and pinning into separate operations:
> >>>>+
> >>>>+- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
> >>>>+receive a user space address and size of the block to be pinned.
> >>>>+Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
> >>>>+be called with the exact address and size used for registering
> >>>>+the memory block. The userspace is not expected to call these often.
> >>>>+The ranges are stored in a linked list in a VFIO container.
> >>>>+
> >>>>+- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
> >>>>+IOMMU table and do not do pinning; instead these check that the userspace
> >>>>+address is from pre-registered range.
> >>>>+
> >>>>+This separation helps in optimizing DMA for guests.
> >>>>+
> >>>> -------------------------------------------------------------------------------
> >>>>
> >>>> [1] VFIO was originally an acronym for "Virtual Function I/O" in its
> >>>>diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>index 892a584..4cfc2c1 100644
> >>>>--- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>
> >>>So, from things you said at other points, I thought the idea was that
> >>>this registration stuff could also be used on non-Power IOMMUs. Did I
> >>>misunderstand, or is that a possibility for the future?
> >>
> >>
> >>I never said a thing about non-PPC :) I seriously doubt any other arch has
> >>this hypervisor interface with H_PUT_TCE (may be s390? :) ); for others
> >>there is no profit from memory preregistration as they (at least x86) do map
> >>the entire guest before it starts which essentially is that preregistration.
> >>
> >>
> >>btw later we may want to implement simple IOMMU v3 which will do pinning +
> >>locked_vm when mapping as x86 does, for http://dpdk.org/ - these things do
> >>not really have to bother with preregistration (even if it just a single
> >>additional ioctl).
> >>
> >>
> >>
> >>>>@@ -21,6 +21,7 @@
> >>>> #include <linux/vfio.h>
> >>>> #include <asm/iommu.h>
> >>>> #include <asm/tce.h>
> >>>>+#include <asm/mmu_context.h>
> >>>>
> >>>> #define DRIVER_VERSION "0.1"
> >>>> #define DRIVER_AUTHOR "[email protected]"
> >>>>@@ -91,8 +92,58 @@ struct tce_container {
> >>>> struct iommu_group *grp;
> >>>> bool enabled;
> >>>> unsigned long locked_pages;
> >>>>+ bool v2;
> >>>> };
> >>>>
> >>>>+static long tce_unregister_pages(struct tce_container *container,
> >>>>+ __u64 vaddr, __u64 size)
> >>>>+{
> >>>>+ long ret;
> >>>>+ struct mm_iommu_table_group_mem_t *mem;
> >>>>+
> >>>>+ if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
> >>>>+ return -EINVAL;
> >>>>+
> >>>>+ mem = mm_iommu_get(vaddr, size >> PAGE_SHIFT);
> >>>>+ if (!mem)
> >>>>+ return -EINVAL;
> >>>>+
> >>>>+ ret = mm_iommu_put(mem); /* undo kref_get() from mm_iommu_get() */
> >>>>+ if (!ret)
> >>>>+ ret = mm_iommu_put(mem);
> >>>>+
> >>>>+ return ret;
> >>>>+}
> >>>>+
> >>>>+static long tce_register_pages(struct tce_container *container,
> >>>>+ __u64 vaddr, __u64 size)
> >>>>+{
> >>>>+ long ret = 0;
> >>>>+ struct mm_iommu_table_group_mem_t *mem;
> >>>>+ unsigned long entries = size >> PAGE_SHIFT;
> >>>>+
> >>>>+ if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
> >>>>+ ((vaddr + size) < vaddr))
> >>>>+ return -EINVAL;
> >>>>+
> >>>>+ mem = mm_iommu_get(vaddr, entries);
> >>>>+ if (!mem) {
> >>>>+ ret = try_increment_locked_vm(entries);
> >>>>+ if (ret)
> >>>>+ return ret;
> >>>>+
> >>>>+ ret = mm_iommu_alloc(vaddr, entries, &mem);
> >>>>+ if (ret) {
> >>>>+ decrement_locked_vm(entries);
> >>>>+ return ret;
> >>>>+ }
> >>>>+ }
> >>>>+
> >>>>+ container->enabled = true;
> >>>>+
> >>>>+ return 0;
> >>>>+}
> >>>
> >>>So requiring that registered regions get unregistered with exactly the
> >>>same addr/length is reasonable. I'm a bit less convinced that
> >>>disallowing overlaps is a good idea. What if two libraries in the
> >>>same process are trying to use VFIO - they may not know if the regions
> >>>they try to register are overlapping.
> >>
> >>
> >>Sorry, I do not understand. A library allocates RAM. A library is expected
> >>to do register it via additional ioctl, that's it. Another library allocates
> >>another chunk of memory and it won't overlap and the registered areas won't
> >>either.
> >
> >So the case I'm thinking is where the library does VFIO using a buffer
> >passed into it from the program at large. Another library does the
> >same.
> >
> >The main program, unaware of the VFIO shenanigans passes different
> >parts of the same page to the 2 libraries.
> >
> >This is somewhat similar to the case of the horribly, horribly broken
> >semantics of POSIX file range locks (it's both hard to implement and
> >dangerous in the multi-library case similar to above).
>
>
> Ok. I'll implement x86-alike V3 SPAPR TCE IOMMU for these people, later :)
>
> V2 addresses issues caused by H_PUT_TCE + DDW RTAS interfaces.
>
>
>
> >>>> static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> >>>> {
> >>>> /*
> >>>>@@ -205,7 +256,7 @@ static void *tce_iommu_open(unsigned long arg)
> >>>> {
> >>>> struct tce_container *container;
> >>>>
> >>>>- if (arg != VFIO_SPAPR_TCE_IOMMU) {
> >>>>+ if ((arg != VFIO_SPAPR_TCE_IOMMU) && (arg != VFIO_SPAPR_TCE_v2_IOMMU)) {
> >>>> pr_err("tce_vfio: Wrong IOMMU type\n");
> >>>> return ERR_PTR(-EINVAL);
> >>>> }
> >>>>@@ -215,6 +266,7 @@ static void *tce_iommu_open(unsigned long arg)
> >>>> return ERR_PTR(-ENOMEM);
> >>>>
> >>>> mutex_init(&container->lock);
> >>>>+ container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
> >>>>
> >>>> return container;
> >>>> }
> >>>>@@ -243,6 +295,47 @@ static void tce_iommu_unuse_page(struct tce_container *container,
> >>>> put_page(page);
> >>>> }
> >>>>
> >>>>+static int tce_iommu_use_page_v2(unsigned long tce, unsigned long size,
> >>>>+ unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
> >>
> >>
> >>You suggested s/tce_get_hpa/tce_iommu_use_page/ but in this particular patch
> >>it is confusing as tce_iommu_unuse_page_v2() calls it to find corresponding
> >>mm_iommu_table_group_mem_t by the userspace address address of a page being
> >>stopped used.
> >>
> >>tce_iommu_use_page (without v2) does use the page but this one I'll rename
> >>back to tce_iommu_ua_to_hpa_v2(), is that ok?
> >
> >Sorry, I couldn't follow this comment.
>
>
> For V1 IOMMU, I used to have:
> tce_get_hpa() - this converted UA to linear address and did gup();
> tce_iommu_unuse_page() - this did put_page().
>
> You suggested (*) to rename the first one to tce_use_page() which makes sense.
>
> V2 introduces its own versions of use/unuse but these use preregistered
> memory and do not do gup()/put_page(). I named them:
> tce_get_hpa_cached()
> tce_iommu_unuse_page_v2()
>
> then, replaying your comment (*) on V2 IOMMU, I renamed tce_get_hpa_cached()
> to tce_iommu_use_page_v2(). And I do not like the result now (in the chunk
> below). I'll rename it to tce_iommu_ua_to_hpa_v2(), will it be ok?

Uh, I guess so. To me "use_page" suggests incrementing the reference
or locking it or something along those lines, so I think that name
should follow the "gup".

>
>
>
> >
> >>
> >>
> >>>>+{
> >>>>+ long ret = 0;
> >>>>+ struct mm_iommu_table_group_mem_t *mem;
> >>>>+
> >>>>+ mem = mm_iommu_lookup(tce, size);
> >>>>+ if (!mem)
> >>>>+ return -EINVAL;
> >>>>+
> >>>>+ ret = mm_iommu_ua_to_hpa(mem, tce, phpa);
> >>>>+ if (ret)
> >>>>+ return -EINVAL;
> >>>>+
> >>>>+ *pmem = mem;
> >>>>+
> >>>>+ return 0;
> >>>>+}
> >>>>+
> >>>>+static void tce_iommu_unuse_page_v2(struct iommu_table *tbl,
> >>>>+ unsigned long entry)
> >>>>+{
> >>>>+ struct mm_iommu_table_group_mem_t *mem = NULL;
> >>>>+ int ret;
> >>>>+ unsigned long hpa = 0;
> >>>>+ unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> >>>>+
> >>>>+ if (!pua || !current || !current->mm)
> >>>>+ return;
> >>>>+
> >>>>+ ret = tce_iommu_use_page_v2(*pua, IOMMU_PAGE_SIZE(tbl),
> >>>>+ &hpa, &mem);
> >>>>+ if (ret)
> >>>>+ pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
> >>>>+ __func__, *pua, entry, ret);
> >>>>+ if (mem)
> >>>>+ mm_iommu_mapped_update(mem, false);
> >>>>+
> >>>>+ *pua = 0;
> >>>>+}
> >>>>+
>
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (14.03 kB)
(No filename) (819.00 B)
Download all attachments

2015-05-05 13:09:30

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

On Fri, May 01, 2015 at 04:53:08PM +1000, Alexey Kardashevskiy wrote:
> On 05/01/2015 03:12 PM, David Gibson wrote:
> >On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote:
> >>On 04/29/2015 04:40 PM, David Gibson wrote:
> >>>On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:
> >>>>This adds a way for the IOMMU user to know how much a new table will
> >>>>use so it can be accounted in the locked_vm limit before allocation
> >>>>happens.
> >>>>
> >>>>This stores the allocated table size in pnv_pci_create_table()
> >>>>so the locked_vm counter can be updated correctly when a table is
> >>>>being disposed.
> >>>>
> >>>>This defines an iommu_table_group_ops callback to let VFIO know
> >>>>how much memory will be locked if a table is created.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>>>---
> >>>>Changes:
> >>>>v9:
> >>>>* reimplemented the whole patch
> >>>>---
> >>>> arch/powerpc/include/asm/iommu.h | 5 +++++
> >>>> arch/powerpc/platforms/powernv/pci-ioda.c | 14 ++++++++++++
> >>>> arch/powerpc/platforms/powernv/pci.c | 36 +++++++++++++++++++++++++++++++
> >>>> arch/powerpc/platforms/powernv/pci.h | 2 ++
> >>>> 4 files changed, 57 insertions(+)
> >>>>
> >>>>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>>>index 1472de3..9844c106 100644
> >>>>--- a/arch/powerpc/include/asm/iommu.h
> >>>>+++ b/arch/powerpc/include/asm/iommu.h
> >>>>@@ -99,6 +99,7 @@ struct iommu_table {
> >>>> unsigned long it_size; /* Size of iommu table in entries */
> >>>> unsigned long it_indirect_levels;
> >>>> unsigned long it_level_size;
> >>>>+ unsigned long it_allocated_size;
> >>>> unsigned long it_offset; /* Offset into global table */
> >>>> unsigned long it_base; /* mapped address of tce table */
> >>>> unsigned long it_index; /* which iommu table this is */
> >>>>@@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
> >>>> struct iommu_table_group;
> >>>>
> >>>> struct iommu_table_group_ops {
> >>>>+ unsigned long (*get_table_size)(
> >>>>+ __u32 page_shift,
> >>>>+ __u64 window_size,
> >>>>+ __u32 levels);
> >>>> long (*create_table)(struct iommu_table_group *table_group,
> >>>> int num,
> >>>> __u32 page_shift,
> >>>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>>>index e0be556..7f548b4 100644
> >>>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>>>@@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> >>>> }
> >>>>
> >>>> #ifdef CONFIG_IOMMU_API
> >>>>+static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
> >>>>+ __u64 window_size, __u32 levels)
> >>>>+{
> >>>>+ unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
> >>>>+
> >>>>+ if (!ret)
> >>>>+ return ret;
> >>>>+
> >>>>+ /* Add size of it_userspace */
> >>>>+ return ret + (window_size >> page_shift) * sizeof(unsigned long);
> >>>
> >>>This doesn't make much sense. The userspace view can't possibly be a
> >>>property of the specific low-level IOMMU model.
> >>
> >>
> >>This it_userspace thing is all about memory preregistration.
> >>
> >>I need some way to track how many actual mappings the
> >>mm_iommu_table_group_mem_t has in order to decide whether to allow
> >>unregistering or not.
> >>
> >>When I clear TCE, I can read the old value which is host physical address
> >>which I cannot use to find the preregistered region and adjust the mappings
> >>counter; I can only use userspace addresses for this (not even guest
> >>physical addresses as it is VFIO and probably no KVM).
> >>
> >>So I have to keep userspace addresses somewhere, one per IOMMU page, and the
> >>iommu_table seems a natural place for this.
> >
> >Well.. sort of. But as noted elsewhere this pulls VFIO specific
> >constraints into a platform code structure. And whether you get this
> >table depends on the platform IOMMU type rather than on what VFIO
> >wants to do with it, which doesn't make sense.
> >
> >What might make more sense is an opaque pointer io iommu_table for use
> >by the table "owner" (in the take_ownership sense). The pointer would
> >be stored in iommu_table, but VFIO is responsible for populating and
> >managing its contents.
> >
> >Or you could just put the userspace mappings in the container.
> >Although you might want a different data structure in that case.
>
> Nope. I need this table in in-kernel acceleration to update the mappings
> counter per mm_iommu_table_group_mem_t. In KVM's real mode handlers, I only
> have IOMMU tables, not containers or groups. QEMU creates a guest view of
> the table (KVM_CREATE_SPAPR_TCE) specifying a LIOBN, and then attaches TCE
> tables to it via set of ioctls (one per IOMMU group) to VFIO KVM device.
>
> So if I call it it_opaque (instead of it_userspace), I will still need a
> common place (visible to VFIO and PowerKVM) for this to put:
> #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry)

I think it should be in a VFIO header. If I'm understanding right
this part of the PowerKVM code is explicitly VFIO aware - that's kind
of the point.

> So far this place was arch/powerpc/include/asm/iommu.h and the iommu_table
> struct.
>
>
> >The other thing to bear in mind is that registered regions are likely
> >to be large contiguous blocks in user addresses, though obviously not
> >contiguous in physical addr. So you might be able to compaticfy this
> >information by storing it as a list of variable length blocks in
> >userspace address space, rather than a per-page address..
>
> It is 8 bytes per system page - 8/65536 = 0.00012 (or 26MB for 200GB guest)
> - very little overhead.
>
>
> >But.. isn't there a bigger problem here. As Paulus was pointing out,
> >there's nothing guaranteeing the page tables continue to contain the
> >same page as was there at gup() time.
>
> This can happen if the userspace remaps memory which it registered/mapped
> for DMA via VFIO, no? If so, then the userspace just should not do this, it
> is DMA, it cannot be moved like this. What am I missing here?
>
>
> >What's going to happen if you REGISTER a memory region, then mremap()
> >over it?
>
> The registered pages will remain pinned and PUT_TCE will use that region for
> translation (and this will fail as the userspace addresses changed).
>
> I do not see how it is different from the situation when the userspace
> mapped a page and mremap()ed it while it is DMA-mapped.

True, it's basically the same. Hrm, so what guarantees a dma_map,
mremap() dma_unmap will unreference the correct pages.

> >Then attempt to PUT_TCE a page in the region? Or what if you
> >mremap() it to someplace else then try to PUT_TCE a page there?
>
> This will fail - a new userspace address has to be preregistered.
>
> >Or REGISTER it again in its new location?
>
> It will be pinned twice + some memory overhead to store the same host
> physical address(es) twice.
>
>
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (7.13 kB)
(No filename) (819.00 B)
Download all attachments

2015-05-05 13:09:53

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

On Fri, May 01, 2015 at 05:12:45PM +1000, Alexey Kardashevskiy wrote:
> On 05/01/2015 02:23 PM, David Gibson wrote:
> >On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:
> >>On 04/29/2015 04:31 PM, David Gibson wrote:
> >>>On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:
> >>>>In order to support memory pre-registration, we need a way to track
> >>>>the use of every registered memory region and only allow unregistration
> >>>>if a region is not in use anymore. So we need a way to tell from what
> >>>>region the just cleared TCE was from.
> >>>>
> >>>>This adds a userspace view of the TCE table into iommu_table struct.
> >>>>It contains userspace address, one per TCE entry. The table is only
> >>>>allocated when the ownership over an IOMMU group is taken which means
> >>>>it is only used from outside of the powernv code (such as VFIO).
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>>>---
> >>>>Changes:
> >>>>v9:
> >>>>* fixed code flow in error cases added in v8
> >>>>
> >>>>v8:
> >>>>* added ENOMEM on failed vzalloc()
> >>>>---
> >>>> arch/powerpc/include/asm/iommu.h | 6 ++++++
> >>>> arch/powerpc/kernel/iommu.c | 18 ++++++++++++++++++
> >>>> arch/powerpc/platforms/powernv/pci-ioda.c | 22 ++++++++++++++++++++--
> >>>> 3 files changed, 44 insertions(+), 2 deletions(-)
> >>>>
> >>>>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>>>index 7694546..1472de3 100644
> >>>>--- a/arch/powerpc/include/asm/iommu.h
> >>>>+++ b/arch/powerpc/include/asm/iommu.h
> >>>>@@ -111,9 +111,15 @@ struct iommu_table {
> >>>> unsigned long *it_map; /* A simple allocation bitmap for now */
> >>>> unsigned long it_page_shift;/* table iommu page size */
> >>>> struct iommu_table_group *it_table_group;
> >>>>+ unsigned long *it_userspace; /* userspace view of the table */
> >>>
> >>>A single unsigned long doesn't seem like enough.
> >>
> >>Why single? This is an array.
> >
> >As in single per page.
>
>
> Sorry, I am not following you here.
> It is per IOMMU page. MAP/UNMAP work with IOMMU pages which are fully backed
> with either system page or a huge page.
>
>
> >
> >>>How do you know
> >>>which process's address space this address refers to?
> >>
> >>It is a current task. Multiple userspaces cannot use the same container/tables.
> >
> >Where is that enforced?
>
>
> It is accessed from VFIO DMA map/unmap which are ioctls() to a container's
> fd which is per a process.

Usually, but what enforces that. If you open a container fd, then
fork(), and attempt to map from both parent and child, what happens?

> Same for KVM - when it registers IOMMU groups in
> KVM, fd's of opened IOMMU groups are passed there. Or I did not understand
> the question...
>
>
> >More to the point, that's a VFIO constraint, but it's here affecting
> >the design of a structure owned by the platform code.
>
> Right. But keeping in mind KVM, I cannot think of any better design here.
>
>
> >[snip]
> >>>> static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
> >>>>@@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
> >>>> int nid = pe->phb->hose->node;
> >>>> __u64 bus_offset = num ? pe->tce_bypass_base : 0;
> >>>> long ret;
> >>>>+ unsigned long *uas, uas_cb = sizeof(*uas) * (window_size >> page_shift);
> >>>>+
> >>>>+ uas = vzalloc(uas_cb);
> >>>>+ if (!uas)
> >>>>+ return -ENOMEM;
> >>>
> >>>I don't see why this is allocated both here as well as in
> >>>take_ownership.
> >>
> >>Where else? The only alternative is vfio_iommu_spapr_tce but I really do not
> >>want to touch iommu_table fields there.
> >
> >Well to put it another way, why isn't take_ownership calling create
> >itself (or at least a common helper).
>
> I am trying to keep DDW stuff away from platform-oriented
> arch/powerpc/kernel/iommu.c which main purpose is to implement
> iommu_alloc()&co. It already has
>
> I'd rather move it_userspace allocation completely to vfio_iommu_spapr_tce
> (should have done earlier, actually), would this be ok?

Yeah, that makes more sense to me.

> >Clearly the it_userspace table needs to have lifetime which matches
> >the TCE table itself, so there should be a single function that marks
> >the beginning of that joint lifetime.
>
>
> No. it_userspace lives as long as the platform code does not control the
> table. For IODA2 it is equal for the lifetime of the table, for IODA1/P5IOC2
> it is not.

Right, I was imprecise. I was thinking of the ownership change as an
end/beginning of lifetime even for IODA1, because the table has to be
fully cleared at that point, even though it's not actually
reallocated.

> >>>Isn't this function used for core-kernel users of the
> >>>iommu as well, in which case it shouldn't need the it_userspace.
> >>
> >>
> >>No. This is an iommu_table_group_ops callback which calls what the platform
> >>code calls (pnv_pci_create_table()) plus allocates this it_userspace thing.
> >>The callback is only called from VFIO.
> >
> >Ok.
> >
> >As touched on above it seems more like this should be owned by VFIO
> >code than the platform code.
>
> Agree now :) I'll move the allocation to VFIO. Thanks!
>
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (5.34 kB)
(No filename) (819.00 B)
Download all attachments

2015-05-05 13:09:42

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 22/32] powerpc/powernv: Implement multilevel TCE tables

On Fri, May 01, 2015 at 07:48:49PM +1000, Alexey Kardashevskiy wrote:
> On 04/29/2015 03:04 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:46PM +1000, Alexey Kardashevskiy wrote:
> >>TCE tables might get too big in case of 4K IOMMU pages and DDW enabled
> >>on huge guests (hundreds of GB of RAM) so the kernel might be unable to
> >>allocate contiguous chunk of physical memory to store the TCE table.
> >>
> >>To address this, POWER8 CPU (actually, IODA2) supports multi-level TCE tables,
> >>up to 5 levels which splits the table into a tree of smaller subtables.
> >>
> >>This adds multi-level TCE tables support to pnv_pci_create_table()
> >>and pnv_pci_free_table() helpers.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>---
> >>Changes:
> >>v9:
> >>* moved from ioda2 to common powernv pci code
> >>* fixed cleanup if allocation fails in a middle
> >>* removed check for the size - all boundary checks happen in the calling code
> >>anyway
> >>---
> >> arch/powerpc/include/asm/iommu.h | 2 +
> >> arch/powerpc/platforms/powernv/pci-ioda.c | 15 +++--
> >> arch/powerpc/platforms/powernv/pci.c | 94 +++++++++++++++++++++++++++++--
> >> arch/powerpc/platforms/powernv/pci.h | 4 +-
> >> 4 files changed, 104 insertions(+), 11 deletions(-)
> >>
> >>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>index 7e7ca0a..0f50ee2 100644
> >>--- a/arch/powerpc/include/asm/iommu.h
> >>+++ b/arch/powerpc/include/asm/iommu.h
> >>@@ -96,6 +96,8 @@ struct iommu_pool {
> >> struct iommu_table {
> >> unsigned long it_busno; /* Bus number this table belongs to */
> >> unsigned long it_size; /* Size of iommu table in entries */
> >>+ unsigned long it_indirect_levels;
> >>+ unsigned long it_level_size;
> >> unsigned long it_offset; /* Offset into global table */
> >> unsigned long it_base; /* mapped address of tce table */
> >> unsigned long it_index; /* which iommu table this is */
> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>index 59baa15..cc1d09c 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>@@ -1967,13 +1967,17 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> >> table_group);
> >> struct pnv_phb *phb = pe->phb;
> >> int64_t rc;
> >>+ const unsigned long size = tbl->it_indirect_levels ?
> >>+ tbl->it_level_size : tbl->it_size;
> >> const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
> >> const __u64 win_size = tbl->it_size << tbl->it_page_shift;
> >>
> >> pe_info(pe, "Setting up window at %llx..%llx "
> >>- "pgsize=0x%x tablesize=0x%lx\n",
> >>+ "pgsize=0x%x tablesize=0x%lx "
> >>+ "levels=%d levelsize=%x\n",
> >> start_addr, start_addr + win_size - 1,
> >>- 1UL << tbl->it_page_shift, tbl->it_size << 3);
> >>+ 1UL << tbl->it_page_shift, tbl->it_size << 3,
> >>+ tbl->it_indirect_levels + 1, tbl->it_level_size << 3);
> >>
> >> tbl->it_table_group = &pe->table_group;
> >>
> >>@@ -1984,9 +1988,9 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> >> rc = opal_pci_map_pe_dma_window(phb->opal_id,
> >> pe->pe_number,
> >> pe->pe_number << 1,
> >>- 1,
> >>+ tbl->it_indirect_levels + 1,
> >> __pa(tbl->it_base),
> >>- tbl->it_size << 3,
> >>+ size << 3,
> >> 1ULL << tbl->it_page_shift);
> >> if (rc) {
> >> pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
> >>@@ -2099,7 +2103,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >> phb->ioda.m32_pci_base);
> >>
> >> rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node,
> >>- 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, tbl);
> >>+ 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base,
> >>+ POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
> >> if (rc) {
> >> pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
> >> return;
> >>diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> >>index 6bcfad5..fc129c4 100644
> >>--- a/arch/powerpc/platforms/powernv/pci.c
> >>+++ b/arch/powerpc/platforms/powernv/pci.c
> >>@@ -46,6 +46,8 @@
> >> #define cfg_dbg(fmt...) do { } while(0)
> >> //#define cfg_dbg(fmt...) printk(fmt)
> >>
> >>+#define ROUND_UP(x, n) (((x) + (n) - 1ULL) & ~((n) - 1ULL))
> >
> >Use the existing ALIGN_UP macro instead of creating a new one.
>
> Ok. I knew it existed, it is just _ALIGN_UP (with an underscore) and
> PPC-only - this is why I did not find it :)

I'm pretty sure there's a generic one too. I think it's just plain
"ALIGN".

> >> #ifdef CONFIG_PCI_MSI
> >> static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
> >> {
> >>@@ -577,6 +579,19 @@ struct pci_ops pnv_pci_ops = {
> >> static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
> >> {
> >> __be64 *tmp = ((__be64 *)tbl->it_base);
> >>+ int level = tbl->it_indirect_levels;
> >>+ const long shift = ilog2(tbl->it_level_size);
> >>+ unsigned long mask = (tbl->it_level_size - 1) << (level * shift);
> >>+
> >>+ while (level) {
> >>+ int n = (idx & mask) >> (level * shift);
> >>+ unsigned long tce = be64_to_cpu(tmp[n]);
> >>+
> >>+ tmp = __va(tce & ~(TCE_PCI_READ | TCE_PCI_WRITE));
> >>+ idx &= ~mask;
> >>+ mask >>= shift;
> >>+ --level;
> >>+ }
> >>
> >> return tmp + idx;
> >> }
> >>@@ -648,12 +663,18 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
> >> }
> >>
> >> static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
> >>+ unsigned levels, unsigned long limit,
> >> unsigned long *tce_table_allocated)
> >> {
> >> struct page *tce_mem = NULL;
> >>- __be64 *addr;
> >>+ __be64 *addr, *tmp;
> >> unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
> >> unsigned long local_allocated = 1UL << (order + PAGE_SHIFT);
> >>+ unsigned entries = 1UL << (shift - 3);
> >>+ long i;
> >>+
> >>+ if (limit == *tce_table_allocated)
> >>+ return NULL;
> >
> >If this is for what I think, it seems a bit unsafe. Shouldn't it be
> >>=, otherwise it could fail to trip if the limit isn't exactly a
> >>multiple of the bottom level allocation unit.
>
> Good point, will fix.
>
>
> >> tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
> >> if (!tce_mem) {
> >>@@ -662,14 +683,33 @@ static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift,
> >> }
> >> addr = page_address(tce_mem);
> >> memset(addr, 0, local_allocated);
> >>- *tce_table_allocated = local_allocated;
> >>+
> >>+ --levels;
> >>+ if (!levels) {
> >>+ /* Update tce_table_allocated with bottom level table size only */
> >>+ *tce_table_allocated += local_allocated;
> >>+ return addr;
> >>+ }
> >>+
> >>+ for (i = 0; i < entries; ++i) {
> >>+ tmp = pnv_alloc_tce_table_pages(nid, shift, levels, limit,
> >>+ tce_table_allocated);
> >
> >Urgh.. it's a limited depth so it *might* be ok, but recursion is
> >generally avoided in the kernel, becuase of the very limited stack
> >size.
>
>
> It is 5 levels max 7 64bit values, there should be room for it. Avoiding
> recursion here - I can do that but it is going to look ugly :-/

Yeah, I guess. Probably worth a comment noting why the recusion depth
is limited though.

>
>
> >>+ if (!tmp)
> >>+ break;
> >>+
> >>+ addr[i] = cpu_to_be64(__pa(tmp) |
> >>+ TCE_PCI_READ | TCE_PCI_WRITE);
> >>+ }
> >
> >It also seems like it would make sense for this function ti set
> >it_indirect_levels ant it_level_size, rather than leaving it to the
> >caller.
>
>
> Mmm. Sure? It calls itself in recursion, does not seem like it is the right
> place for setting up it_indirect_levels ant it_level_size.

Yeah, ok, I hadn't properly thought through the recursion.

> >> return addr;
> >> }
> >>
> >>+static void pnv_free_tce_table_pages(unsigned long addr, unsigned long size,
> >>+ unsigned level);
> >>+
> >> long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
> >> __u64 bus_offset, __u32 page_shift, __u64 window_size,
> >>- struct iommu_table *tbl)
> >>+ __u32 levels, struct iommu_table *tbl)
> >> {
> >> void *addr;
> >> unsigned long tce_table_allocated = 0;
> >>@@ -678,16 +718,34 @@ long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
> >> unsigned table_shift = entries_shift + 3;
> >> const unsigned long tce_table_size = max(0x1000UL, 1UL << table_shift);
> >>
> >>+ if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
> >>+ return -EINVAL;
> >>+
> >> if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
> >> return -EINVAL;
> >>
> >>+ /* Adjust direct table size from window_size and levels */
> >>+ entries_shift = ROUND_UP(entries_shift, levels) / levels;
> >
> >ROUND_UP() only works if the second parameter is a power of 2. Is
> >that always true for levels?
> >
> >For division rounding up, the usual idiom is just ((a + (b - 1)) / b)
>
>
> Yes, I think this is what I actually wanted.
>
>
> >>+ table_shift = entries_shift + 3;
> >>+ table_shift = max_t(unsigned, table_shift, PAGE_SHIFT);
> >
> >Does the PAGE_SHIFT rounding make sense any more? I would have
> >thought you'd round the level size up to page size, rather than the
> >whole thing.
>
>
> At this point in the code @table_shift is level_shift but it is not that
> obvious :) I'll rework it. Thanks.
>
>
> >> /* Allocate TCE table */
> >> addr = pnv_alloc_tce_table_pages(nid, table_shift,
> >>- &tce_table_allocated);
> >>+ levels, tce_table_size, &tce_table_allocated);
> >>+ if (!addr)
> >>+ return -ENOMEM;
> >>+
> >>+ if (tce_table_size != tce_table_allocated) {
> >>+ pnv_free_tce_table_pages((unsigned long) addr,
> >>+ tbl->it_level_size, tbl->it_indirect_levels);
> >>+ return -ENOMEM;
> >>+ }
> >>
> >> /* Setup linux iommu table */
> >> pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset,
> >> page_shift);
> >>+ tbl->it_level_size = 1ULL << (table_shift - 3);
> >>+ tbl->it_indirect_levels = levels - 1;
> >>
> >> pr_info("Created TCE table: window size = %08llx, "
> >> "tablesize = %lx (%lx), start @%08llx\n",
> >>@@ -697,12 +755,38 @@ long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
> >> return 0;
> >> }
> >>
> >>+static void pnv_free_tce_table_pages(unsigned long addr, unsigned long size,
> >>+ unsigned level)
> >>+{
> >>+ addr &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>+
> >>+ if (level) {
> >>+ long i;
> >>+ u64 *tmp = (u64 *) addr;
> >>+
> >>+ for (i = 0; i < size; ++i) {
> >>+ unsigned long hpa = be64_to_cpu(tmp[i]);
> >>+
> >>+ if (!(hpa & (TCE_PCI_READ | TCE_PCI_WRITE)))
> >>+ continue;
> >>+
> >>+ pnv_free_tce_table_pages((unsigned long) __va(hpa),
> >>+ size, level - 1);
> >>+ }
> >>+ }
> >>+
> >>+ free_pages(addr, get_order(size << 3));
> >>+}
> >>+
> >> void pnv_pci_free_table(struct iommu_table *tbl)
> >> {
> >>+ const unsigned long size = tbl->it_indirect_levels ?
> >>+ tbl->it_level_size : tbl->it_size;
> >>+
> >> if (!tbl->it_size)
> >> return;
> >>
> >>- free_pages(tbl->it_base, get_order(tbl->it_size << 3));
> >>+ pnv_free_tce_table_pages(tbl->it_base, size, tbl->it_indirect_levels);
> >> iommu_reset_table(tbl, "pnv");
> >> }
> >>
> >>diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> >>index e6cbbec..3d1ff584 100644
> >>--- a/arch/powerpc/platforms/powernv/pci.h
> >>+++ b/arch/powerpc/platforms/powernv/pci.h
> >>@@ -218,9 +218,11 @@ int pnv_pci_cfg_write(struct pci_dn *pdn,
> >> extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
> >> void *tce_mem, u64 tce_size,
> >> u64 dma_offset, unsigned page_shift);
> >>+#define POWERNV_IOMMU_DEFAULT_LEVELS 1
> >>+#define POWERNV_IOMMU_MAX_LEVELS 5
> >> extern long pnv_pci_create_table(struct iommu_table_group *table_group, int nid,
> >> __u64 bus_offset, __u32 page_shift, __u64 window_size,
> >>- struct iommu_table *tbl);
> >>+ __u32 levels, struct iommu_table *tbl);
> >> extern void pnv_pci_free_table(struct iommu_table *tbl);
> >> extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
> >> extern void pnv_pci_init_ioda_hub(struct device_node *np);
> >
>
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (12.19 kB)
(No filename) (819.00 B)
Download all attachments

2015-05-05 13:10:49

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache

On Fri, May 01, 2015 at 09:26:48PM +1000, Alexey Kardashevskiy wrote:
> On 04/29/2015 05:01 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:52PM +1000, Alexey Kardashevskiy wrote:
> >>We are adding support for DMA memory pre-registration to be used in
> >>conjunction with VFIO. The idea is that the userspace which is going to
> >>run a guest may want to pre-register a user space memory region so
> >>it all gets pinned once and never goes away. Having this done,
> >>a hypervisor will not have to pin/unpin pages on every DMA map/unmap
> >>request. This is going to help with multiple pinning of the same memory
> >>and in-kernel acceleration of DMA requests.
> >>
> >>This adds a list of memory regions to mm_context_t. Each region consists
> >>of a header and a list of physical addresses. This adds API to:
> >>1. register/unregister memory regions;
> >>2. do final cleanup (which puts all pre-registered pages);
> >>3. do userspace to physical address translation;
> >>4. manage a mapped pages counter; when it is zero, it is safe to
> >>unregister the region.
> >>
> >>Multiple registration of the same region is allowed, kref is used to
> >>track the number of registrations.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>---
> >>Changes:
> >>v8:
> >>* s/mm_iommu_table_group_mem_t/struct mm_iommu_table_group_mem_t/
> >>* fixed error fallback look (s/[i]/[j]/)
> >>---
> >> arch/powerpc/include/asm/mmu-hash64.h | 3 +
> >> arch/powerpc/include/asm/mmu_context.h | 17 +++
> >> arch/powerpc/mm/Makefile | 1 +
> >> arch/powerpc/mm/mmu_context_hash64.c | 6 +
> >> arch/powerpc/mm/mmu_context_hash64_iommu.c | 215 +++++++++++++++++++++++++++++
> >> 5 files changed, 242 insertions(+)
> >> create mode 100644 arch/powerpc/mm/mmu_context_hash64_iommu.c
> >>
> >>diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
> >>index 1da6a81..a82f534 100644
> >>--- a/arch/powerpc/include/asm/mmu-hash64.h
> >>+++ b/arch/powerpc/include/asm/mmu-hash64.h
> >>@@ -536,6 +536,9 @@ typedef struct {
> >> /* for 4K PTE fragment support */
> >> void *pte_frag;
> >> #endif
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+ struct list_head iommu_group_mem_list;
> >>+#endif
> >
> >Urgh. I know I'm not one to talk, having done the hugepage crap in
> >there, but man mm_context_t has grown to a bloated mess from orginally
> >being just intended as a context ID integer :/.
>
>
> Where else to put it then?... The other way to go would be some global map
> of pid<->iommu_group_mem_list which needs to be available from both VFIO and
> KVM.

I'd suggest putting it as a new field in mm_struct, guarded by a
CONFIG_VFIO_PREREGISTER (or something) which you can make sure is
selected by CONFIG_SPAPR_TCE_IOMMU.

>
>
> >> } mm_context_t;
> >>
> >>
> >>diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> >>index 73382eb..d6116ca 100644
> >>--- a/arch/powerpc/include/asm/mmu_context.h
> >>+++ b/arch/powerpc/include/asm/mmu_context.h
> >>@@ -16,6 +16,23 @@
> >> */
> >> extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
> >> extern void destroy_context(struct mm_struct *mm);
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+struct mm_iommu_table_group_mem_t;
> >>+
> >>+extern bool mm_iommu_preregistered(void);
> >>+extern long mm_iommu_alloc(unsigned long ua, unsigned long entries,
> >>+ struct mm_iommu_table_group_mem_t **pmem);
> >>+extern struct mm_iommu_table_group_mem_t *mm_iommu_get(unsigned long ua,
> >>+ unsigned long entries);
> >>+extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
> >>+extern void mm_iommu_cleanup(mm_context_t *ctx);
> >>+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
> >>+ unsigned long size);
> >>+extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> >>+ unsigned long ua, unsigned long *hpa);
> >>+extern long mm_iommu_mapped_update(struct mm_iommu_table_group_mem_t *mem,
> >>+ bool inc);
> >>+#endif
> >>
> >> extern void switch_mmu_context(struct mm_struct *prev, struct mm_struct *next);
> >> extern void switch_slb(struct task_struct *tsk, struct mm_struct *mm);
> >>diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
> >>index 9c8770b..e216704 100644
> >>--- a/arch/powerpc/mm/Makefile
> >>+++ b/arch/powerpc/mm/Makefile
> >>@@ -36,3 +36,4 @@ obj-$(CONFIG_PPC_SUBPAGE_PROT) += subpage-prot.o
> >> obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
> >> obj-$(CONFIG_HIGHMEM) += highmem.o
> >> obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o
> >>+obj-$(CONFIG_SPAPR_TCE_IOMMU) += mmu_context_hash64_iommu.o
> >>diff --git a/arch/powerpc/mm/mmu_context_hash64.c b/arch/powerpc/mm/mmu_context_hash64.c
> >>index 178876ae..eb3080c 100644
> >>--- a/arch/powerpc/mm/mmu_context_hash64.c
> >>+++ b/arch/powerpc/mm/mmu_context_hash64.c
> >>@@ -89,6 +89,9 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> >> #ifdef CONFIG_PPC_64K_PAGES
> >> mm->context.pte_frag = NULL;
> >> #endif
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+ INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
> >>+#endif
> >> return 0;
> >> }
> >>
> >>@@ -132,6 +135,9 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
> >>
> >> void destroy_context(struct mm_struct *mm)
> >> {
> >>+#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>+ mm_iommu_cleanup(&mm->context);
> >>+#endif
> >>
> >> #ifdef CONFIG_PPC_ICSWX
> >> drop_cop(mm->context.acop, mm);
> >>diff --git a/arch/powerpc/mm/mmu_context_hash64_iommu.c b/arch/powerpc/mm/mmu_context_hash64_iommu.c
> >>new file mode 100644
> >>index 0000000..af7668c
> >>--- /dev/null
> >>+++ b/arch/powerpc/mm/mmu_context_hash64_iommu.c
> >>@@ -0,0 +1,215 @@
> >>+/*
> >>+ * IOMMU helpers in MMU context.
> >>+ *
> >>+ * Copyright (C) 2015 IBM Corp. <[email protected]>
> >>+ *
> >>+ * This program is free software; you can redistribute it and/or
> >>+ * modify it under the terms of the GNU General Public License
> >>+ * as published by the Free Software Foundation; either version
> >>+ * 2 of the License, or (at your option) any later version.
> >>+ *
> >>+ */
> >>+
> >>+#include <linux/sched.h>
> >>+#include <linux/slab.h>
> >>+#include <linux/rculist.h>
> >>+#include <linux/vmalloc.h>
> >>+#include <linux/kref.h>
> >>+#include <asm/mmu_context.h>
> >>+
> >>+struct mm_iommu_table_group_mem_t {
> >>+ struct list_head next;
> >>+ struct rcu_head rcu;
> >>+ struct kref kref; /* one reference per VFIO container */
> >>+ atomic_t mapped; /* number of currently mapped pages */
> >>+ u64 ua; /* userspace address */
> >>+ u64 entries; /* number of entries in hpas[] */
> >
> >Maybe 'npages', since this is used to determine the range of user
> >addresses covered, not just the number of entries in hpas.
>
>
> Hm. Ok :)
>
>
> >>+ u64 *hpas; /* vmalloc'ed */
> >>+};
> >>+
> >>+bool mm_iommu_preregistered(void)
> >>+{
> >>+ if (!current || !current->mm)
> >>+ return false;
> >>+
> >>+ return !list_empty(&current->mm->context.iommu_group_mem_list);
> >>+}
> >>+EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
> >>+
> >>+long mm_iommu_alloc(unsigned long ua, unsigned long entries,
> >>+ struct mm_iommu_table_group_mem_t **pmem)
> >>+{
> >>+ struct mm_iommu_table_group_mem_t *mem;
> >>+ long i, j;
> >>+ struct page *page = NULL;
> >>+
> >>+ list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
> >>+ next) {
> >>+ if ((mem->ua == ua) && (mem->entries == entries))
> >>+ return -EBUSY;
> >>+
> >>+ /* Overlap? */
> >>+ if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
> >>+ (ua < (mem->ua + (mem->entries << PAGE_SHIFT))))
> >>+ return -EINVAL;
> >>+ }
> >>+
> >>+ mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> >>+ if (!mem)
> >>+ return -ENOMEM;
> >>+
> >>+ mem->hpas = vzalloc(entries * sizeof(mem->hpas[0]));
> >>+ if (!mem->hpas) {
> >>+ kfree(mem);
> >>+ return -ENOMEM;
> >>+ }
> >>+
> >>+ for (i = 0; i < entries; ++i) {
> >>+ if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> >>+ 1/* pages */, 1/* iswrite */, &page)) {
> >
> >Do you really need to call gup() in a loop? It can do more than one
> >page at a time..
>
>
> Ufff. gup() returns the number of pages pinned or -errno if none. So if the
> return value is positive but less than the requested number of pages, it is
> still an error. Functions like this make me nervous :(
>
>
> >That might work better if you kept a list of struct page *s instead of
> >hpas.
>
> I only need struct page* when release the registered area. In other cases I
> just need fast conversion from an userspace address to a host physical
> address, including real mode. Ideally I would have to use page_address()
> which will work in real mode in my case but in general it does not have to.
> Using addresses rather than page structs makes it more explicit - I need an
> address, I store an address, simple.

Ok, you convinced me. And if you have to translate them each from
struct page to hpa at this point, then the gup() in a loop does make
as much sense as anything, so ok.

> I can change to page structs if you think it makes more sense, should I?
>
>
>
>
> >>+ for (j = 0; j < i; ++j)
> >>+ put_page(pfn_to_page(
> >>+ mem->hpas[j] >> PAGE_SHIFT));
> >>+ vfree(mem->hpas);
> >>+ kfree(mem);
> >>+ return -EFAULT;
> >>+ }
> >>+
> >>+ mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
> >>+ }
> >>+
> >>+ kref_init(&mem->kref);
> >>+ atomic_set(&mem->mapped, 0);
> >>+ mem->ua = ua;
> >>+ mem->entries = entries;
> >>+ *pmem = mem;
> >>+
> >>+ list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
> >>+
> >>+ return 0;
> >>+}
> >>+EXPORT_SYMBOL_GPL(mm_iommu_alloc);
> >>+
> >>+static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
> >>+{
> >>+ long i;
> >>+ struct page *page = NULL;
> >>+
> >>+ for (i = 0; i < mem->entries; ++i) {
> >>+ if (!mem->hpas[i])
> >>+ continue;
> >>+
> >>+ page = pfn_to_page(mem->hpas[i] >> PAGE_SHIFT);
> >>+ if (!page)
> >>+ continue;
> >>+
> >>+ put_page(page);
> >>+ mem->hpas[i] = 0;
> >>+ }
> >>+}
> >>+
> >>+static void mm_iommu_free(struct rcu_head *head)
> >>+{
> >>+ struct mm_iommu_table_group_mem_t *mem = container_of(head,
> >>+ struct mm_iommu_table_group_mem_t, rcu);
> >>+
> >>+ mm_iommu_unpin(mem);
> >>+ vfree(mem->hpas);
> >>+ kfree(mem);
> >>+}
> >>+
> >>+static void mm_iommu_release(struct kref *kref)
> >>+{
> >>+ struct mm_iommu_table_group_mem_t *mem = container_of(kref,
> >>+ struct mm_iommu_table_group_mem_t, kref);
> >>+
> >>+ list_del_rcu(&mem->next);
> >>+ call_rcu(&mem->rcu, mm_iommu_free);
> >>+}
> >>+
> >>+struct mm_iommu_table_group_mem_t *mm_iommu_get(unsigned long ua,
> >>+ unsigned long entries)
> >>+{
> >>+ struct mm_iommu_table_group_mem_t *mem;
> >>+
> >>+ list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
> >>+ next) {
> >>+ if ((mem->ua == ua) && (mem->entries == entries)) {
> >>+ kref_get(&mem->kref);
> >>+ return mem;
> >>+ }
> >>+ }
> >>+
> >>+ return NULL;
> >>+}
> >>+EXPORT_SYMBOL_GPL(mm_iommu_get);
> >>+
> >>+long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
> >>+{
> >>+ if (atomic_read(&mem->mapped))
> >>+ return -EBUSY;
> >
> >What prevents a race between the atomic_read() above and the release below?
>
> Ouch. Nothing. And I cannot think of any nice fast solution here...
> I can remove @mapped at all and do kref_get/put(&mem->kref) instead; a
> container will hold one reference too. And add a flag to
> mm_iommu_table_group_mem_t to know if mm_iommu_release has been called -
> this way I will know that was the very last reference, otherwise I'll return
> -EBUSY.
>
> Or change mm_iommu_lookup() to do kref_get() and require every caller of it
> also call mm_iommu_put() and only call mm_iommu_mapped_update() when the
> reference is elevated. And change mm_iommu_put() to return a special code if
> that was the very last put() (will be checked by
> VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY handler only, others would not care).
>
> Any ideas?
>
> I am pretty sure there is something very cool (like RCU) which allows
> avoiding locks in this situation, I am just too ignorant and do not know it
> :)

I can't quickly see an answer either, sorry.


> >>+ kref_put(&mem->kref, mm_iommu_release);
> >>+
> >>+ return 0;
> >>+}
> >>+EXPORT_SYMBOL_GPL(mm_iommu_put);
> >>+
> >>+struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
> >>+ unsigned long size)
> >>+{
> >>+ struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
> >>+
> >>+ list_for_each_entry_rcu(mem,
> >>+ &current->mm->context.iommu_group_mem_list,
> >>+ next) {
> >>+ if ((mem->ua <= ua) &&
> >>+ (ua + size <= mem->ua +
> >>+ (mem->entries << PAGE_SHIFT))) {
> >>+ ret = mem;
> >>+ break;
> >>+ }
> >>+ }
> >>+
> >>+ return ret;
> >>+}
> >>+EXPORT_SYMBOL_GPL(mm_iommu_lookup);
> >>+
> >>+long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> >>+ unsigned long ua, unsigned long *hpa)
> >
> >Return type should be int, it's just an error code.
>
>
> Is it some generic rule that errors must always be "int"? I was just told
> that gcc on PPC64 will generate an extra instruction to cut 64bit long to
> 32bit int so I am just trying to use "long" everywhere. Very simple but
> still optimization :)

Ok, I guess leave it. Probably makes little difference either way.

>
>
> >>+{
> >>+ const long entry = (ua - mem->ua) >> PAGE_SHIFT;
> >>+ u64 *va = &mem->hpas[entry];
> >>+
> >>+ if (entry >= mem->entries)
> >>+ return -EFAULT;
> >>+
> >>+ *hpa = *va | (ua & ~PAGE_MASK);
> >>+
> >>+ return 0;
> >>+}
> >>+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
> >>+
> >>+long mm_iommu_mapped_update(struct mm_iommu_table_group_mem_t *mem, bool inc)
> >>+{
> >>+ long ret = 0;
> >>+
> >>+ if (inc)
> >>+ atomic_inc(&mem->mapped);
> >>+ else
> >>+ ret = atomic_dec_if_positive(&mem->mapped);
> >>+
> >>+ return ret;
> >>+}
> >>+EXPORT_SYMBOL_GPL(mm_iommu_mapped_update);
> >
> >I think this would be clearer as separate inc and dec functions.
>
> Okay.
>
>
> >>+
> >>+void mm_iommu_cleanup(mm_context_t *ctx)
> >>+{
> >>+ while (!list_empty(&ctx->iommu_group_mem_list)) {
> >>+ struct mm_iommu_table_group_mem_t *mem;
> >>+
> >>+ mem = list_first_entry(&ctx->iommu_group_mem_list,
> >>+ struct mm_iommu_table_group_mem_t, next);
> >>+ mm_iommu_release(&mem->kref);
> >>+ }
> >>+}
> >
>
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (14.30 kB)
(No filename) (819.00 B)
Download all attachments

2015-05-11 02:11:25

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

On 05/05/2015 10:02 PM, David Gibson wrote:
> On Fri, May 01, 2015 at 05:12:45PM +1000, Alexey Kardashevskiy wrote:
>> On 05/01/2015 02:23 PM, David Gibson wrote:
>>> On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:
>>>> On 04/29/2015 04:31 PM, David Gibson wrote:
>>>>> On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:
>>>>>> In order to support memory pre-registration, we need a way to track
>>>>>> the use of every registered memory region and only allow unregistration
>>>>>> if a region is not in use anymore. So we need a way to tell from what
>>>>>> region the just cleared TCE was from.
>>>>>>
>>>>>> This adds a userspace view of the TCE table into iommu_table struct.
>>>>>> It contains userspace address, one per TCE entry. The table is only
>>>>>> allocated when the ownership over an IOMMU group is taken which means
>>>>>> it is only used from outside of the powernv code (such as VFIO).
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>>>> ---
>>>>>> Changes:
>>>>>> v9:
>>>>>> * fixed code flow in error cases added in v8
>>>>>>
>>>>>> v8:
>>>>>> * added ENOMEM on failed vzalloc()
>>>>>> ---
>>>>>> arch/powerpc/include/asm/iommu.h | 6 ++++++
>>>>>> arch/powerpc/kernel/iommu.c | 18 ++++++++++++++++++
>>>>>> arch/powerpc/platforms/powernv/pci-ioda.c | 22 ++++++++++++++++++++--
>>>>>> 3 files changed, 44 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>>>>>> index 7694546..1472de3 100644
>>>>>> --- a/arch/powerpc/include/asm/iommu.h
>>>>>> +++ b/arch/powerpc/include/asm/iommu.h
>>>>>> @@ -111,9 +111,15 @@ struct iommu_table {
>>>>>> unsigned long *it_map; /* A simple allocation bitmap for now */
>>>>>> unsigned long it_page_shift;/* table iommu page size */
>>>>>> struct iommu_table_group *it_table_group;
>>>>>> + unsigned long *it_userspace; /* userspace view of the table */
>>>>>
>>>>> A single unsigned long doesn't seem like enough.
>>>>
>>>> Why single? This is an array.
>>>
>>> As in single per page.
>>
>>
>> Sorry, I am not following you here.
>> It is per IOMMU page. MAP/UNMAP work with IOMMU pages which are fully backed
>> with either system page or a huge page.
>>
>>
>>>
>>>>> How do you know
>>>>> which process's address space this address refers to?
>>>>
>>>> It is a current task. Multiple userspaces cannot use the same container/tables.
>>>
>>> Where is that enforced?
>>
>>
>> It is accessed from VFIO DMA map/unmap which are ioctls() to a container's
>> fd which is per a process.
>
> Usually, but what enforces that. If you open a container fd, then
> fork(), and attempt to map from both parent and child, what happens?


vfio_group_fops::open() checks if the group is already opened, and I want
to believe open() is called from fork() for new fd so no mapping can happen
later.


--
Alexey

2015-05-11 02:24:48

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

On 05/05/2015 09:58 PM, David Gibson wrote:
> On Fri, May 01, 2015 at 04:53:08PM +1000, Alexey Kardashevskiy wrote:
>> On 05/01/2015 03:12 PM, David Gibson wrote:
>>> On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote:
>>>> On 04/29/2015 04:40 PM, David Gibson wrote:
>>>>> On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:
>>>>>> This adds a way for the IOMMU user to know how much a new table will
>>>>>> use so it can be accounted in the locked_vm limit before allocation
>>>>>> happens.
>>>>>>
>>>>>> This stores the allocated table size in pnv_pci_create_table()
>>>>>> so the locked_vm counter can be updated correctly when a table is
>>>>>> being disposed.
>>>>>>
>>>>>> This defines an iommu_table_group_ops callback to let VFIO know
>>>>>> how much memory will be locked if a table is created.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>>>> ---
>>>>>> Changes:
>>>>>> v9:
>>>>>> * reimplemented the whole patch
>>>>>> ---
>>>>>> arch/powerpc/include/asm/iommu.h | 5 +++++
>>>>>> arch/powerpc/platforms/powernv/pci-ioda.c | 14 ++++++++++++
>>>>>> arch/powerpc/platforms/powernv/pci.c | 36 +++++++++++++++++++++++++++++++
>>>>>> arch/powerpc/platforms/powernv/pci.h | 2 ++
>>>>>> 4 files changed, 57 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>>>>>> index 1472de3..9844c106 100644
>>>>>> --- a/arch/powerpc/include/asm/iommu.h
>>>>>> +++ b/arch/powerpc/include/asm/iommu.h
>>>>>> @@ -99,6 +99,7 @@ struct iommu_table {
>>>>>> unsigned long it_size; /* Size of iommu table in entries */
>>>>>> unsigned long it_indirect_levels;
>>>>>> unsigned long it_level_size;
>>>>>> + unsigned long it_allocated_size;
>>>>>> unsigned long it_offset; /* Offset into global table */
>>>>>> unsigned long it_base; /* mapped address of tce table */
>>>>>> unsigned long it_index; /* which iommu table this is */
>>>>>> @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>>>>>> struct iommu_table_group;
>>>>>>
>>>>>> struct iommu_table_group_ops {
>>>>>> + unsigned long (*get_table_size)(
>>>>>> + __u32 page_shift,
>>>>>> + __u64 window_size,
>>>>>> + __u32 levels);
>>>>>> long (*create_table)(struct iommu_table_group *table_group,
>>>>>> int num,
>>>>>> __u32 page_shift,
>>>>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>> index e0be556..7f548b4 100644
>>>>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>> @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>>>>>> }
>>>>>>
>>>>>> #ifdef CONFIG_IOMMU_API
>>>>>> +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
>>>>>> + __u64 window_size, __u32 levels)
>>>>>> +{
>>>>>> + unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
>>>>>> +
>>>>>> + if (!ret)
>>>>>> + return ret;
>>>>>> +
>>>>>> + /* Add size of it_userspace */
>>>>>> + return ret + (window_size >> page_shift) * sizeof(unsigned long);
>>>>>
>>>>> This doesn't make much sense. The userspace view can't possibly be a
>>>>> property of the specific low-level IOMMU model.
>>>>
>>>>
>>>> This it_userspace thing is all about memory preregistration.
>>>>
>>>> I need some way to track how many actual mappings the
>>>> mm_iommu_table_group_mem_t has in order to decide whether to allow
>>>> unregistering or not.
>>>>
>>>> When I clear TCE, I can read the old value which is host physical address
>>>> which I cannot use to find the preregistered region and adjust the mappings
>>>> counter; I can only use userspace addresses for this (not even guest
>>>> physical addresses as it is VFIO and probably no KVM).
>>>>
>>>> So I have to keep userspace addresses somewhere, one per IOMMU page, and the
>>>> iommu_table seems a natural place for this.
>>>
>>> Well.. sort of. But as noted elsewhere this pulls VFIO specific
>>> constraints into a platform code structure. And whether you get this
>>> table depends on the platform IOMMU type rather than on what VFIO
>>> wants to do with it, which doesn't make sense.
>>>
>>> What might make more sense is an opaque pointer io iommu_table for use
>>> by the table "owner" (in the take_ownership sense). The pointer would
>>> be stored in iommu_table, but VFIO is responsible for populating and
>>> managing its contents.
>>>
>>> Or you could just put the userspace mappings in the container.
>>> Although you might want a different data structure in that case.
>>
>> Nope. I need this table in in-kernel acceleration to update the mappings
>> counter per mm_iommu_table_group_mem_t. In KVM's real mode handlers, I only
>> have IOMMU tables, not containers or groups. QEMU creates a guest view of
>> the table (KVM_CREATE_SPAPR_TCE) specifying a LIOBN, and then attaches TCE
>> tables to it via set of ioctls (one per IOMMU group) to VFIO KVM device.
>>
>> So if I call it it_opaque (instead of it_userspace), I will still need a
>> common place (visible to VFIO and PowerKVM) for this to put:
>> #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry)
>
> I think it should be in a VFIO header. If I'm understanding right
> this part of the PowerKVM code is explicitly VFIO aware - that's kind
> of the point.

Well. 2 points against it are:

1. arch/powerpc/kvm/book3s_64_vio_hv.c and arch/powerpc/kvm/book3s_64_vio.c
are not including any of vfio headers now (all required liobn<->iommu
hooking bits are in virt/kvm/vfio.c), and I kind of like it.

2. for me it seems like a good idea to keep an accessor close to what it
provides the access to. Since I cannot move it_userspace somewhere else, I
should not move IOMMU_TABLE_USERSPACE_ENTRY() either.



>> So far this place was arch/powerpc/include/asm/iommu.h and the iommu_table
>> struct.
>>
>>
>>> The other thing to bear in mind is that registered regions are likely
>>> to be large contiguous blocks in user addresses, though obviously not
>>> contiguous in physical addr. So you might be able to compaticfy this
>>> information by storing it as a list of variable length blocks in
>>> userspace address space, rather than a per-page address..
>>
>> It is 8 bytes per system page - 8/65536 = 0.00012 (or 26MB for 200GB guest)
>> - very little overhead.
>>
>>
>>> But.. isn't there a bigger problem here. As Paulus was pointing out,
>>> there's nothing guaranteeing the page tables continue to contain the
>>> same page as was there at gup() time.
>>
>> This can happen if the userspace remaps memory which it registered/mapped
>> for DMA via VFIO, no? If so, then the userspace just should not do this, it
>> is DMA, it cannot be moved like this. What am I missing here?
>>
>>
>>> What's going to happen if you REGISTER a memory region, then mremap()
>>> over it?
>>
>> The registered pages will remain pinned and PUT_TCE will use that region for
>> translation (and this will fail as the userspace addresses changed).
>>
>> I do not see how it is different from the situation when the userspace
>> mapped a page and mremap()ed it while it is DMA-mapped.
>
> True, it's basically the same. Hrm, so what guarantees a dma_map,
> mremap() dma_unmap will unreference the correct pages.

The original page will remain pinned, wrong one will be unpinned, access to
it will produce EEH, the process exit will do cleanup. What is the problem
here? Inability for the userspace to dma_map() and then remap()? I do not
think x86 (or anything else) can and should cope with this well, it is DMA,
you just cannot do certain things when you work with DMA...



>>> Then attempt to PUT_TCE a page in the region? Or what if you
>>> mremap() it to someplace else then try to PUT_TCE a page there?
>>
>> This will fail - a new userspace address has to be preregistered.
>>
>>> Or REGISTER it again in its new location?
>>
>> It will be pinned twice + some memory overhead to store the same host
>> physical address(es) twice.
>>
>>
>>
>


--
Alexey

2015-05-11 02:26:27

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible

On 05/05/2015 09:50 PM, David Gibson wrote:
> On Fri, May 01, 2015 at 04:05:24PM +1000, Alexey Kardashevskiy wrote:
>> On 05/01/2015 02:33 PM, David Gibson wrote:
>>> On Thu, Apr 30, 2015 at 07:33:09PM +1000, Alexey Kardashevskiy wrote:
>>>> On 04/30/2015 05:22 PM, David Gibson wrote:
>>>>> On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote:
>>>>>> At the moment only one group per container is supported.
>>>>>> POWER8 CPUs have more flexible design and allows naving 2 TCE tables per
>>>>>> IOMMU group so we can relax this limitation and support multiple groups
>>>>>> per container.
>>>>>
>>>>> It's not obvious why allowing multiple TCE tables per PE has any
>>>>> pearing on allowing multiple groups per container.
>>>>
>>>>
>>>> This patchset is a global TCE tables rework (patches 1..30, roughly) with 2
>>>> outcomes:
>>>> 1. reusing the same IOMMU table for multiple groups - patch 31;
>>>> 2. allowing dynamic create/remove of IOMMU tables - patch 32.
>>>>
>>>> I can remove this one from the patchset and post it separately later but
>>>> since 1..30 aim to support both 1) and 2), I'd think I better keep them all
>>>> together (might explain some of changes I do in 1..30).
>>>
>>> The combined patchset is fine. My comment is because your commit
>>> message says that multiple groups are possible *because* 2 TCE tables
>>> per group are allowed, and it's not at all clear why one follows from
>>> the other.
>>
>>
>> Ah. That's wrong indeed, I'll fix it.
>>
>>
>>>>>> This adds TCE table descriptors to a container and uses iommu_table_group_ops
>>>>>> to create/set DMA windows on IOMMU groups so the same TCE tables will be
>>>>>> shared between several IOMMU groups.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>>>> [aw: for the vfio related changes]
>>>>>> Acked-by: Alex Williamson <[email protected]>
>>>>>> ---
>>>>>> Changes:
>>>>>> v7:
>>>>>> * updated doc
>>>>>> ---
>>>>>> Documentation/vfio.txt | 8 +-
>>>>>> drivers/vfio/vfio_iommu_spapr_tce.c | 268 ++++++++++++++++++++++++++----------
>>>>>> 2 files changed, 199 insertions(+), 77 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
>>>>>> index 94328c8..7dcf2b5 100644
>>>>>> --- a/Documentation/vfio.txt
>>>>>> +++ b/Documentation/vfio.txt
>>>>>> @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
>>>>>>
>>>>>> This implementation has some specifics:
>>>>>>
>>>>>> -1) Only one IOMMU group per container is supported as an IOMMU group
>>>>>> -represents the minimal entity which isolation can be guaranteed for and
>>>>>> -groups are allocated statically, one per a Partitionable Endpoint (PE)
>>>>>> +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
>>>>>> +container is supported as an IOMMU table is allocated at the boot time,
>>>>>> +one table per a IOMMU group which is a Partitionable Endpoint (PE)
>>>>>> (PE is often a PCI domain but not always).
>>>>>
>>>>> I thought the more fundamental problem was that different PEs tended
>>>>> to use disjoint bus address ranges, so even by duplicating put_tce
>>>>> across PEs you couldn't have a common address space.
>>>>
>>>>
>>>> Sorry, I am not following you here.
>>>>
>>>> By duplicating put_tce, I can have multiple IOMMU groups on the same virtual
>>>> PHB in QEMU, "[PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups
>>>> per container" does this, the address ranges will the same.
>>>
>>> Oh, ok. For some reason I thought that (at least on the older
>>> machines) the different PEs used different and not easily changeable
>>> DMA windows in bus addresses space.
>>
>>
>> They do use different tables (which VFIO does not get to remove/create and
>> uses these old helpers - iommu_take/release_ownership), correct. But all
>> these windows are mapped at zero on a PE's PCI bus and nothing prevents me
>> from updating all these tables with the same TCE values when handling
>> H_PUT_TCE. Yes it is slow but it works (bit more details below).
>
> Um.. I'm pretty sure that contradicts what Ben was saying on the
> thread.


True, it does contradict, I do not know why he said what he said :)



--
Alexey

2015-05-11 04:52:59

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

On 05/11/2015 12:11 PM, Alexey Kardashevskiy wrote:
> On 05/05/2015 10:02 PM, David Gibson wrote:
>> On Fri, May 01, 2015 at 05:12:45PM +1000, Alexey Kardashevskiy wrote:
>>> On 05/01/2015 02:23 PM, David Gibson wrote:
>>>> On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:
>>>>> On 04/29/2015 04:31 PM, David Gibson wrote:
>>>>>> On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:
>>>>>>> In order to support memory pre-registration, we need a way to track
>>>>>>> the use of every registered memory region and only allow unregistration
>>>>>>> if a region is not in use anymore. So we need a way to tell from what
>>>>>>> region the just cleared TCE was from.
>>>>>>>
>>>>>>> This adds a userspace view of the TCE table into iommu_table struct.
>>>>>>> It contains userspace address, one per TCE entry. The table is only
>>>>>>> allocated when the ownership over an IOMMU group is taken which means
>>>>>>> it is only used from outside of the powernv code (such as VFIO).
>>>>>>>
>>>>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>>>>> ---
>>>>>>> Changes:
>>>>>>> v9:
>>>>>>> * fixed code flow in error cases added in v8
>>>>>>>
>>>>>>> v8:
>>>>>>> * added ENOMEM on failed vzalloc()
>>>>>>> ---
>>>>>>> arch/powerpc/include/asm/iommu.h | 6 ++++++
>>>>>>> arch/powerpc/kernel/iommu.c | 18 ++++++++++++++++++
>>>>>>> arch/powerpc/platforms/powernv/pci-ioda.c | 22 ++++++++++++++++++++--
>>>>>>> 3 files changed, 44 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/powerpc/include/asm/iommu.h
>>>>>>> b/arch/powerpc/include/asm/iommu.h
>>>>>>> index 7694546..1472de3 100644
>>>>>>> --- a/arch/powerpc/include/asm/iommu.h
>>>>>>> +++ b/arch/powerpc/include/asm/iommu.h
>>>>>>> @@ -111,9 +111,15 @@ struct iommu_table {
>>>>>>> unsigned long *it_map; /* A simple allocation bitmap for
>>>>>>> now */
>>>>>>> unsigned long it_page_shift;/* table iommu page size */
>>>>>>> struct iommu_table_group *it_table_group;
>>>>>>> + unsigned long *it_userspace; /* userspace view of the table */
>>>>>>
>>>>>> A single unsigned long doesn't seem like enough.
>>>>>
>>>>> Why single? This is an array.
>>>>
>>>> As in single per page.
>>>
>>>
>>> Sorry, I am not following you here.
>>> It is per IOMMU page. MAP/UNMAP work with IOMMU pages which are fully
>>> backed
>>> with either system page or a huge page.
>>>
>>>
>>>>
>>>>>> How do you know
>>>>>> which process's address space this address refers to?
>>>>>
>>>>> It is a current task. Multiple userspaces cannot use the same
>>>>> container/tables.
>>>>
>>>> Where is that enforced?
>>>
>>>
>>> It is accessed from VFIO DMA map/unmap which are ioctls() to a container's
>>> fd which is per a process.
>>
>> Usually, but what enforces that. If you open a container fd, then
>> fork(), and attempt to map from both parent and child, what happens?
>
>
> vfio_group_fops::open() checks if the group is already opened, and I want
> to believe open() is called from fork() for new fd so no mapping can happen
> later.

I am wrong here. Nothing prevents multiple userspace from using the same
container. It still does not seem really dangerous as in order to use VFIO,
someone with the root privilege should set right permissions on /dev/vfio*
first anyway and that person knows what QEMU does and what QEMU does not :)

I could add pid into iommu_table, next to it_userspace, and fail when other
pid is trying to change the it_userspace table. Not sure if I want to do
this check in realmode though (performance). Or make sure somehow that
fork() closes container and group fd's (but how?). In the worst case, wrong
userspace page will be put and there will be random backtraces on the host
kernel. What would you do?


--
Alexey