This enables sPAPR defined feature called Dynamic DMA windows (DDW).
Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
where devices are allowed to do DMA. These ranges are called DMA windows.
By default, there is a single DMA window, 1 or 2GB big, mapped at zero
on a PCI bus.
Hi-speed devices may suffer from the limited size of the window.
The recent host kernels use a TCE bypass window on POWER8 CPU which implements
direct PCI bus address range mapping (with offset of 1<<59) to the host memory.
For guests, PAPR defines a DDW RTAS API which allows pseries guests
querying the hypervisor about DDW support and capabilities (page size mask
for now). A pseries guest may request an additional (to the default)
DMA windows using this RTAS API.
The existing pseries Linux guests request an additional window as big as
the guest RAM and map the entire guest window which effectively creates
direct mapping of the guest memory to a PCI bus.
The multiple DMA windows feature is supported by POWER7/POWER8 CPUs; however
this patchset only adds support for POWER8 as TCE tables are implemented
in POWER7 in a quite different way ans POWER7 is not the highest priority.
This patchset reworks PPC64 IOMMU code and adds necessary structures
to support big windows.
Once a Linux guest discovers the presence of DDW, it does:
1. query hypervisor about number of available windows and page size masks;
2. create a window with the biggest possible page size (today 4K/64K/16M);
3. map the entire guest RAM via H_PUT_TCE* hypercalls;
4. switche dma_ops to direct_dma_ops on the selected PE.
Once this is done, H_PUT_TCE is not called anymore for 64bit devices and
the guest does not waste time on DMA map/unmap operations.
Note that 32bit devices won't use DDW and will keep using the default
DMA window so KVM optimizations will be required (to be posted later).
This is pushed to [email protected]:aik/linux.git
+ 09bb8ea...d9b711d vfio-for-github -> vfio-for-github (forced update)
Please comment. Thank you!
Changes:
v8:
* fixed a bug in error fallback in "powerpc/mmu: Add userspace-to-physical
addresses translation cache"
* fixed subject in "vfio: powerpc/spapr: Check that IOMMU page is fully
contained by system page"
* moved v2 documentation to the correct patch
* added checks for failed vzalloc() in "powerpc/iommu: Add userspace view
of TCE table"
v7:
* moved memory preregistration to the current process's MMU context
* added code preventing unregistration if some pages are still mapped;
for this, there is a userspace view of the table is stored in iommu_table
* added locked_vm counting for DDW tables (including userspace view of those)
v6:
* fixed a bunch of errors in "vfio: powerpc/spapr: Support Dynamic DMA windows"
* moved static IOMMU properties from iommu_table_group to iommu_table_group_ops
v5:
* added SPAPR_TCE_IOMMU_v2 to tell the userspace that there is a memory
pre-registration feature
* added backward compatibility
* renamed few things (mostly powerpc_iommu -> iommu_table_group)
v4:
* moved patches around to have VFIO and PPC patches separated as much as
possible
* now works with the existing upstream QEMU
v3:
* redesigned the whole thing
* multiple IOMMU groups per PHB -> one PHB is needed for VFIO in the guest ->
no problems with locked_vm counting; also we save memory on actual tables
* guest RAM preregistration is required for DDW
* PEs (IOMMU groups) are passed to VFIO with no DMA windows at all so
we do not bother with iommu_table::it_map anymore
* added multilevel TCE tables support to support really huge guests
v2:
* added missing __pa() in "powerpc/powernv: Release replaced TCE"
* reposted to make some noise
Alexey Kardashevskiy (31):
vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU
driver
vfio: powerpc/spapr: Do cleanup when releasing the group
vfio: powerpc/spapr: Check that IOMMU page is fully contained by
system page
vfio: powerpc/spapr: Use it_page_size
vfio: powerpc/spapr: Move locked_vm accounting to helpers
vfio: powerpc/spapr: Disable DMA mappings on disabled container
vfio: powerpc/spapr: Moving pinning/unpinning to helpers
vfio: powerpc/spapr: Rework groups attaching
powerpc/powernv: Do not set "read" flag if direction==DMA_NONE
powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
powerpc/iommu: Introduce iommu_table_alloc() helper
powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group
vfio: powerpc/spapr: powerpc/iommu: Rework IOMMU ownership control
vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework IOMMU ownership
control
powerpc/iommu: Fix IOMMU ownership control functions
powerpc/powernv/ioda/ioda2: Rework tce_build()/tce_free()
powerpc/iommu/powernv: Release replaced TCE
powerpc/powernv/ioda2: Rework iommu_table creation
powerpc/powernv/ioda2: Introduce
pnv_pci_ioda2_create_table/pnc_pci_free_table
powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window
powerpc/iommu: Split iommu_free_table into 2 helpers
powerpc/powernv: Implement multilevel TCE tables
powerpc/powernv: Change prototypes to receive iommu
powerpc/powernv/ioda: Define and implement DMA table/window management
callbacks
vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework ownership
powerpc/iommu: Add userspace view of TCE table
powerpc/iommu/ioda2: Add get_table_size() to calculate the size of
fiture table
powerpc/mmu: Add userspace-to-physical addresses translation cache
vfio: powerpc/spapr: Register memory and define IOMMU v2
vfio: powerpc/spapr: Support multiple groups in one container if
possible
vfio: powerpc/spapr: Support Dynamic DMA windows
Documentation/vfio.txt | 50 +-
arch/powerpc/include/asm/iommu.h | 111 ++-
arch/powerpc/include/asm/machdep.h | 25 -
arch/powerpc/include/asm/mmu-hash64.h | 3 +
arch/powerpc/include/asm/mmu_context.h | 17 +
arch/powerpc/kernel/iommu.c | 336 +++++----
arch/powerpc/kernel/vio.c | 5 +
arch/powerpc/mm/Makefile | 1 +
arch/powerpc/mm/mmu_context_hash64.c | 6 +
arch/powerpc/mm/mmu_context_hash64_iommu.c | 215 ++++++
arch/powerpc/platforms/cell/iommu.c | 8 +-
arch/powerpc/platforms/pasemi/iommu.c | 7 +-
arch/powerpc/platforms/powernv/pci-ioda.c | 589 ++++++++++++---
arch/powerpc/platforms/powernv/pci-p5ioc2.c | 33 +-
arch/powerpc/platforms/powernv/pci.c | 116 ++-
arch/powerpc/platforms/powernv/pci.h | 12 +-
arch/powerpc/platforms/pseries/iommu.c | 55 +-
arch/powerpc/sysdev/dart_iommu.c | 12 +-
drivers/vfio/vfio_iommu_spapr_tce.c | 1021 ++++++++++++++++++++++++---
include/uapi/linux/vfio.h | 88 ++-
20 files changed, 2218 insertions(+), 492 deletions(-)
create mode 100644 arch/powerpc/mm/mmu_context_hash64_iommu.c
--
2.0.0
This moves page pinning (get_user_pages_fast()/put_page()) code out of
the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs
to as the platform code does not deal with page pinning.
This makes iommu_take_ownership()/iommu_release_ownership() deal with
the IOMMU table bitmap only.
This removes page unpinning from iommu_take_ownership() as the actual
TCE table might contain garbage and doing put_page() on it is undefined
behaviour.
Besides the last part, the rest of the patch is mechanical.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v4:
* s/iommu_tce_build(tbl, entry + 1/iommu_tce_build(tbl, entry + i/
---
arch/powerpc/include/asm/iommu.h | 4 --
arch/powerpc/kernel/iommu.c | 55 --------------------------
drivers/vfio/vfio_iommu_spapr_tce.c | 78 ++++++++++++++++++++++++++++++-------
3 files changed, 65 insertions(+), 72 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index f1ea597..ed69b7d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -197,10 +197,6 @@ extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
unsigned long hwaddr, enum dma_data_direction direction);
extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
unsigned long entry);
-extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
- unsigned long entry, unsigned long pages);
-extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
- unsigned long entry, unsigned long tce);
extern void iommu_flush_tce(struct iommu_table *tbl);
extern int iommu_take_ownership(struct iommu_table *tbl);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b054f33..1b4a178 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -991,30 +991,6 @@ unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
}
EXPORT_SYMBOL_GPL(iommu_clear_tce);
-int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
- unsigned long entry, unsigned long pages)
-{
- unsigned long oldtce;
- struct page *page;
-
- for ( ; pages; --pages, ++entry) {
- oldtce = iommu_clear_tce(tbl, entry);
- if (!oldtce)
- continue;
-
- page = pfn_to_page(oldtce >> PAGE_SHIFT);
- WARN_ON(!page);
- if (page) {
- if (oldtce & TCE_PCI_WRITE)
- SetPageDirty(page);
- put_page(page);
- }
- }
-
- return 0;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
-
/*
* hwaddr is a kernel virtual address here (0xc... bazillion),
* tce_build converts it to a physical address.
@@ -1044,35 +1020,6 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
}
EXPORT_SYMBOL_GPL(iommu_tce_build);
-int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
- unsigned long tce)
-{
- int ret;
- struct page *page = NULL;
- unsigned long hwaddr, offset = tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
- enum dma_data_direction direction = iommu_tce_direction(tce);
-
- ret = get_user_pages_fast(tce & PAGE_MASK, 1,
- direction != DMA_TO_DEVICE, &page);
- if (unlikely(ret != 1)) {
- /* pr_err("iommu_tce: get_user_pages_fast failed tce=%lx ioba=%lx ret=%d\n",
- tce, entry << tbl->it_page_shift, ret); */
- return -EFAULT;
- }
- hwaddr = (unsigned long) page_address(page) + offset;
-
- ret = iommu_tce_build(tbl, entry, hwaddr, direction);
- if (ret)
- put_page(page);
-
- if (ret < 0)
- pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
- __func__, entry << tbl->it_page_shift, tce, ret);
-
- return ret;
-}
-EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
-
int iommu_take_ownership(struct iommu_table *tbl)
{
unsigned long sz = (tbl->it_size + 7) >> 3;
@@ -1086,7 +1033,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
}
memset(tbl->it_map, 0xff, sz);
- iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
/*
* Disable iommu bypass, otherwise the user can DMA to all of
@@ -1104,7 +1050,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
{
unsigned long sz = (tbl->it_size + 7) >> 3;
- iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
memset(tbl->it_map, 0, sz);
/* Restore bit#0 set by iommu_init_table() */
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 730b4ef..cefaf05 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -147,6 +147,66 @@ static void tce_iommu_release(void *iommu_data)
kfree(container);
}
+static int tce_iommu_clear(struct tce_container *container,
+ struct iommu_table *tbl,
+ unsigned long entry, unsigned long pages)
+{
+ unsigned long oldtce;
+ struct page *page;
+
+ for ( ; pages; --pages, ++entry) {
+ oldtce = iommu_clear_tce(tbl, entry);
+ if (!oldtce)
+ continue;
+
+ page = pfn_to_page(oldtce >> PAGE_SHIFT);
+ WARN_ON(!page);
+ if (page) {
+ if (oldtce & TCE_PCI_WRITE)
+ SetPageDirty(page);
+ put_page(page);
+ }
+ }
+
+ return 0;
+}
+
+static long tce_iommu_build(struct tce_container *container,
+ struct iommu_table *tbl,
+ unsigned long entry, unsigned long tce, unsigned long pages)
+{
+ long i, ret = 0;
+ struct page *page = NULL;
+ unsigned long hva;
+ enum dma_data_direction direction = iommu_tce_direction(tce);
+
+ for (i = 0; i < pages; ++i) {
+ ret = get_user_pages_fast(tce & PAGE_MASK, 1,
+ direction != DMA_TO_DEVICE, &page);
+ if (unlikely(ret != 1)) {
+ ret = -EFAULT;
+ break;
+ }
+ hva = (unsigned long) page_address(page) +
+ (tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
+
+ ret = iommu_tce_build(tbl, entry + i, hva, direction);
+ if (ret) {
+ put_page(page);
+ pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
+ __func__, entry << tbl->it_page_shift,
+ tce, ret);
+ break;
+ }
+ tce += IOMMU_PAGE_SIZE_4K;
+ }
+
+ if (ret)
+ tce_iommu_clear(container, tbl, entry, i);
+
+ return ret;
+}
+
static long tce_iommu_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
{
@@ -195,7 +255,7 @@ static long tce_iommu_ioctl(void *iommu_data,
case VFIO_IOMMU_MAP_DMA: {
struct vfio_iommu_type1_dma_map param;
struct iommu_table *tbl = container->tbl;
- unsigned long tce, i;
+ unsigned long tce;
if (!tbl)
return -ENXIO;
@@ -229,17 +289,9 @@ static long tce_iommu_ioctl(void *iommu_data,
if (ret)
return ret;
- for (i = 0; i < (param.size >> IOMMU_PAGE_SHIFT_4K); ++i) {
- ret = iommu_put_tce_user_mode(tbl,
- (param.iova >> IOMMU_PAGE_SHIFT_4K) + i,
- tce);
- if (ret)
- break;
- tce += IOMMU_PAGE_SIZE_4K;
- }
- if (ret)
- iommu_clear_tces_and_put_pages(tbl,
- param.iova >> IOMMU_PAGE_SHIFT_4K, i);
+ ret = tce_iommu_build(container, tbl,
+ param.iova >> IOMMU_PAGE_SHIFT_4K,
+ tce, param.size >> IOMMU_PAGE_SHIFT_4K);
iommu_flush_tce(tbl);
@@ -273,7 +325,7 @@ static long tce_iommu_ioctl(void *iommu_data,
if (ret)
return ret;
- ret = iommu_clear_tces_and_put_pages(tbl,
+ ret = tce_iommu_clear(container, tbl,
param.iova >> IOMMU_PAGE_SHIFT_4K,
param.size >> IOMMU_PAGE_SHIFT_4K);
iommu_flush_tce(tbl);
--
2.0.0
This clears the TCE table when a container is being closed as this is
a good thing to leave the table clean before passing the ownership
back to the host kernel.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
drivers/vfio/vfio_iommu_spapr_tce.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index cefaf05..e9b4d7d 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -132,16 +132,24 @@ static void *tce_iommu_open(unsigned long arg)
return container;
}
+static int tce_iommu_clear(struct tce_container *container,
+ struct iommu_table *tbl,
+ unsigned long entry, unsigned long pages);
+
static void tce_iommu_release(void *iommu_data)
{
struct tce_container *container = iommu_data;
+ struct iommu_table *tbl = container->tbl;
- WARN_ON(container->tbl && !container->tbl->it_group);
+ WARN_ON(tbl && !tbl->it_group);
tce_iommu_disable(container);
- if (container->tbl && container->tbl->it_group)
- tce_iommu_detach_group(iommu_data, container->tbl->it_group);
+ if (tbl) {
+ tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
+ if (tbl->it_group)
+ tce_iommu_detach_group(iommu_data, tbl->it_group);
+ }
mutex_destroy(&container->lock);
kfree(container);
--
2.0.0
This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.
Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.
Since compound_order() and compound_head() work correctly on non-huge
pages, there is no need for additional check whether the page is huge.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v8: changed subject
v6:
* the helper is simplified to one line
v4:
* s/tce_check_page_size/tce_page_is_contained/
---
drivers/vfio/vfio_iommu_spapr_tce.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index e9b4d7d..f835e63 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -47,6 +47,16 @@ struct tce_container {
bool enabled;
};
+static bool tce_page_is_contained(struct page *page, unsigned page_shift)
+{
+ /*
+ * Check that the TCE table granularity is not bigger than the size of
+ * a page we just found. Otherwise the hardware can get access to
+ * a bigger memory chunk that it should.
+ */
+ return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift;
+}
+
static int tce_iommu_enable(struct tce_container *container)
{
int ret = 0;
@@ -195,6 +205,12 @@ static long tce_iommu_build(struct tce_container *container,
ret = -EFAULT;
break;
}
+
+ if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+ ret = -EPERM;
+ break;
+ }
+
hva = (unsigned long) page_address(page) +
(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
--
2.0.0
This makes use of the it_page_size from the iommu_table struct
as page size can differ.
This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
---
drivers/vfio/vfio_iommu_spapr_tce.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index f835e63..8bbee22 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -91,7 +91,7 @@ static int tce_iommu_enable(struct tce_container *container)
* enforcing the limit based on the max that the guest can map.
*/
down_write(¤t->mm->mmap_sem);
- npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+ npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
locked = current->mm->locked_vm + npages;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
@@ -120,7 +120,7 @@ static void tce_iommu_disable(struct tce_container *container)
down_write(¤t->mm->mmap_sem);
current->mm->locked_vm -= (container->tbl->it_size <<
- IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+ container->tbl->it_page_shift) >> PAGE_SHIFT;
up_write(¤t->mm->mmap_sem);
}
@@ -222,7 +222,7 @@ static long tce_iommu_build(struct tce_container *container,
tce, ret);
break;
}
- tce += IOMMU_PAGE_SIZE_4K;
+ tce += IOMMU_PAGE_SIZE(tbl);
}
if (ret)
@@ -267,8 +267,8 @@ static long tce_iommu_ioctl(void *iommu_data,
if (info.argsz < minsz)
return -EINVAL;
- info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT_4K;
- info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT_4K;
+ info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
+ info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
info.flags = 0;
if (copy_to_user((void __user *)arg, &info, minsz))
@@ -298,8 +298,8 @@ static long tce_iommu_ioctl(void *iommu_data,
VFIO_DMA_MAP_FLAG_WRITE))
return -EINVAL;
- if ((param.size & ~IOMMU_PAGE_MASK_4K) ||
- (param.vaddr & ~IOMMU_PAGE_MASK_4K))
+ if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
+ (param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
return -EINVAL;
/* iova is checked by the IOMMU API */
@@ -314,8 +314,8 @@ static long tce_iommu_ioctl(void *iommu_data,
return ret;
ret = tce_iommu_build(container, tbl,
- param.iova >> IOMMU_PAGE_SHIFT_4K,
- tce, param.size >> IOMMU_PAGE_SHIFT_4K);
+ param.iova >> tbl->it_page_shift,
+ tce, param.size >> tbl->it_page_shift);
iommu_flush_tce(tbl);
@@ -341,17 +341,17 @@ static long tce_iommu_ioctl(void *iommu_data,
if (param.flags)
return -EINVAL;
- if (param.size & ~IOMMU_PAGE_MASK_4K)
+ if (param.size & ~IOMMU_PAGE_MASK(tbl))
return -EINVAL;
ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
- param.size >> IOMMU_PAGE_SHIFT_4K);
+ param.size >> tbl->it_page_shift);
if (ret)
return ret;
ret = tce_iommu_clear(container, tbl,
- param.iova >> IOMMU_PAGE_SHIFT_4K,
- param.size >> IOMMU_PAGE_SHIFT_4K);
+ param.iova >> tbl->it_page_shift,
+ param.size >> tbl->it_page_shift);
iommu_flush_tce(tbl);
return ret;
--
2.0.0
There moves locked pages accounting to helpers.
Later they will be reused for Dynamic DMA windows (DDW).
This reworks debug messages to show the current value and the limit.
This stores the locked pages number in the container so when unlocking
the iommu table pointer won't be needed. This does not have an effect
now but it will with the multiple tables per container as then we will
allow attaching/detaching groups on fly and we may end up having
a container with no group attached but with the counter incremented.
While we are here, update the comment explaining why RLIMIT_MEMLOCK
might be required to be bigger than the guest RAM. This also prints
pid of the current process in pr_warn/pr_debug.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v4:
* new helpers do nothing if @npages == 0
* tce_iommu_disable() now can decrement the counter if the group was
detached (not possible now but will be in the future)
---
drivers/vfio/vfio_iommu_spapr_tce.c | 82 ++++++++++++++++++++++++++++---------
1 file changed, 63 insertions(+), 19 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 8bbee22..9448e39 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -29,6 +29,51 @@
static void tce_iommu_detach_group(void *iommu_data,
struct iommu_group *iommu_group);
+static long try_increment_locked_vm(long npages)
+{
+ long ret = 0, locked, lock_limit;
+
+ if (!current || !current->mm)
+ return -ESRCH; /* process exited */
+
+ if (!npages)
+ return 0;
+
+ down_write(¤t->mm->mmap_sem);
+ locked = current->mm->locked_vm + npages;
+ lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+ if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+ ret = -ENOMEM;
+ else
+ current->mm->locked_vm += npages;
+
+ pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
+ npages << PAGE_SHIFT,
+ current->mm->locked_vm << PAGE_SHIFT,
+ rlimit(RLIMIT_MEMLOCK),
+ ret ? " - exceeded" : "");
+
+ up_write(¤t->mm->mmap_sem);
+
+ return ret;
+}
+
+static void decrement_locked_vm(long npages)
+{
+ if (!current || !current->mm || !npages)
+ return; /* process exited */
+
+ down_write(¤t->mm->mmap_sem);
+ if (npages > current->mm->locked_vm)
+ npages = current->mm->locked_vm;
+ current->mm->locked_vm -= npages;
+ pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
+ npages << PAGE_SHIFT,
+ current->mm->locked_vm << PAGE_SHIFT,
+ rlimit(RLIMIT_MEMLOCK));
+ up_write(¤t->mm->mmap_sem);
+}
+
/*
* VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
*
@@ -45,6 +90,7 @@ struct tce_container {
struct mutex lock;
struct iommu_table *tbl;
bool enabled;
+ unsigned long locked_pages;
};
static bool tce_page_is_contained(struct page *page, unsigned page_shift)
@@ -60,7 +106,7 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift)
static int tce_iommu_enable(struct tce_container *container)
{
int ret = 0;
- unsigned long locked, lock_limit, npages;
+ unsigned long locked;
struct iommu_table *tbl = container->tbl;
if (!container->tbl)
@@ -89,21 +135,22 @@ static int tce_iommu_enable(struct tce_container *container)
* Also we don't have a nice way to fail on H_PUT_TCE due to ulimits,
* that would effectively kill the guest at random points, much better
* enforcing the limit based on the max that the guest can map.
+ *
+ * Unfortunately at the moment it counts whole tables, no matter how
+ * much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups
+ * each with 2GB DMA window, 8GB will be counted here. The reason for
+ * this is that we cannot tell here the amount of RAM used by the guest
+ * as this information is only available from KVM and VFIO is
+ * KVM agnostic.
*/
- down_write(¤t->mm->mmap_sem);
- npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
- locked = current->mm->locked_vm + npages;
- lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
- if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
- pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
- rlimit(RLIMIT_MEMLOCK));
- ret = -ENOMEM;
- } else {
+ locked = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
+ ret = try_increment_locked_vm(locked);
+ if (ret)
+ return ret;
- current->mm->locked_vm += npages;
- container->enabled = true;
- }
- up_write(¤t->mm->mmap_sem);
+ container->locked_pages = locked;
+
+ container->enabled = true;
return ret;
}
@@ -115,13 +162,10 @@ static void tce_iommu_disable(struct tce_container *container)
container->enabled = false;
- if (!container->tbl || !current->mm)
+ if (!current->mm)
return;
- down_write(¤t->mm->mmap_sem);
- current->mm->locked_vm -= (container->tbl->it_size <<
- container->tbl->it_page_shift) >> PAGE_SHIFT;
- up_write(¤t->mm->mmap_sem);
+ decrement_locked_vm(container->locked_pages);
}
static void *tce_iommu_open(unsigned long arg)
--
2.0.0
At the moment DMA map/unmap requests are handled irrespective to
the container's state. This allows the user space to pin memory which
it might not be allowed to pin.
This adds checks to MAP/UNMAP that the container is enabled, otherwise
-EPERM is returned.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
drivers/vfio/vfio_iommu_spapr_tce.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 9448e39..c137bb3 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -325,6 +325,9 @@ static long tce_iommu_ioctl(void *iommu_data,
struct iommu_table *tbl = container->tbl;
unsigned long tce;
+ if (!container->enabled)
+ return -EPERM;
+
if (!tbl)
return -ENXIO;
@@ -369,6 +372,9 @@ static long tce_iommu_ioctl(void *iommu_data,
struct vfio_iommu_type1_dma_unmap param;
struct iommu_table *tbl = container->tbl;
+ if (!container->enabled)
+ return -EPERM;
+
if (WARN_ON(!tbl))
return -ENXIO;
--
2.0.0
This is a pretty mechanical patch to make next patches simpler.
New tce_iommu_unuse_page() helper does put_page() now but it might skip
that after the memory registering patch applied.
As we are here, this removes unnecessary checks for a value returned
by pfn_to_page() as it cannot possibly return NULL.
This moves tce_iommu_disable() later to let tce_iommu_clear() know if
the container has been enabled because if it has not been, then
put_page() must not be called on TCEs from the TCE table. This situation
is not yet possible but it will after KVM acceleration patchset is
applied.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v6:
* tce_get_hva() returns hva via a pointer
---
drivers/vfio/vfio_iommu_spapr_tce.c | 68 +++++++++++++++++++++++++++----------
1 file changed, 50 insertions(+), 18 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index c137bb3..ec5ee83 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -196,7 +196,6 @@ static void tce_iommu_release(void *iommu_data)
struct iommu_table *tbl = container->tbl;
WARN_ON(tbl && !tbl->it_group);
- tce_iommu_disable(container);
if (tbl) {
tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
@@ -204,63 +203,96 @@ static void tce_iommu_release(void *iommu_data)
if (tbl->it_group)
tce_iommu_detach_group(iommu_data, tbl->it_group);
}
+
+ tce_iommu_disable(container);
+
mutex_destroy(&container->lock);
kfree(container);
}
+static void tce_iommu_unuse_page(struct tce_container *container,
+ unsigned long oldtce)
+{
+ struct page *page;
+
+ if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
+ return;
+
+ /*
+ * VFIO cannot map/unmap when a container is not enabled so
+ * we would not need this check but KVM could map/unmap and if
+ * this happened, we must not put pages as KVM does not get them as
+ * it expects memory pre-registation to do this part.
+ */
+ if (!container->enabled)
+ return;
+
+ page = pfn_to_page(__pa(oldtce) >> PAGE_SHIFT);
+
+ if (oldtce & TCE_PCI_WRITE)
+ SetPageDirty(page);
+
+ put_page(page);
+}
+
static int tce_iommu_clear(struct tce_container *container,
struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
{
unsigned long oldtce;
- struct page *page;
for ( ; pages; --pages, ++entry) {
oldtce = iommu_clear_tce(tbl, entry);
if (!oldtce)
continue;
- page = pfn_to_page(oldtce >> PAGE_SHIFT);
- WARN_ON(!page);
- if (page) {
- if (oldtce & TCE_PCI_WRITE)
- SetPageDirty(page);
- put_page(page);
- }
+ tce_iommu_unuse_page(container, (unsigned long) __va(oldtce));
}
return 0;
}
+static int tce_get_hva(unsigned long tce, unsigned long *hva)
+{
+ struct page *page = NULL;
+ enum dma_data_direction direction = iommu_tce_direction(tce);
+
+ if (get_user_pages_fast(tce & PAGE_MASK, 1,
+ direction != DMA_TO_DEVICE, &page) != 1)
+ return -EFAULT;
+
+ *hva = (unsigned long) page_address(page);
+
+ return 0;
+}
+
static long tce_iommu_build(struct tce_container *container,
struct iommu_table *tbl,
unsigned long entry, unsigned long tce, unsigned long pages)
{
long i, ret = 0;
- struct page *page = NULL;
+ struct page *page;
unsigned long hva;
enum dma_data_direction direction = iommu_tce_direction(tce);
for (i = 0; i < pages; ++i) {
- ret = get_user_pages_fast(tce & PAGE_MASK, 1,
- direction != DMA_TO_DEVICE, &page);
- if (unlikely(ret != 1)) {
- ret = -EFAULT;
+ ret = tce_get_hva(tce, &hva);
+ if (ret)
break;
- }
+ page = pfn_to_page(__pa(hva) >> PAGE_SHIFT);
if (!tce_page_is_contained(page, tbl->it_page_shift)) {
ret = -EPERM;
break;
}
- hva = (unsigned long) page_address(page) +
- (tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
+ /* Preserve offset within IOMMU page */
+ hva |= tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
ret = iommu_tce_build(tbl, entry + i, hva, direction);
if (ret) {
- put_page(page);
+ tce_iommu_unuse_page(container, hva);
pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
__func__, entry << tbl->it_page_shift,
tce, ret);
--
2.0.0
This is to make extended ownership and multiple groups support patches
simpler for review.
This is a mechanical patch.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
drivers/vfio/vfio_iommu_spapr_tce.c | 38 ++++++++++++++++++++++---------------
1 file changed, 23 insertions(+), 15 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index ec5ee83..244c958 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -478,16 +478,21 @@ static int tce_iommu_attach_group(void *iommu_data,
iommu_group_id(container->tbl->it_group),
iommu_group_id(iommu_group));
ret = -EBUSY;
- } else if (container->enabled) {
+ goto unlock_exit;
+ }
+
+ if (container->enabled) {
pr_err("tce_vfio: attaching group #%u to enabled container\n",
iommu_group_id(iommu_group));
ret = -EBUSY;
- } else {
- ret = iommu_take_ownership(tbl);
- if (!ret)
- container->tbl = tbl;
+ goto unlock_exit;
}
+ ret = iommu_take_ownership(tbl);
+ if (!ret)
+ container->tbl = tbl;
+
+unlock_exit:
mutex_unlock(&container->lock);
return ret;
@@ -505,18 +510,21 @@ static void tce_iommu_detach_group(void *iommu_data,
pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
iommu_group_id(iommu_group),
iommu_group_id(tbl->it_group));
- } else {
- if (container->enabled) {
- pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
- iommu_group_id(tbl->it_group));
- tce_iommu_disable(container);
- }
+ goto unlock_exit;
+ }
- /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
- iommu_group_id(iommu_group), iommu_group); */
- container->tbl = NULL;
- iommu_release_ownership(tbl);
+ if (container->enabled) {
+ pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
+ iommu_group_id(tbl->it_group));
+ tce_iommu_disable(container);
}
+
+ /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
+ iommu_group_id(iommu_group), iommu_group); */
+ container->tbl = NULL;
+ iommu_release_ownership(tbl);
+
+unlock_exit:
mutex_unlock(&container->lock);
}
--
2.0.0
Normally a bitmap from the iommu_table is used to track what TCE entry
is in use. Since we are going to use iommu_table without its locks and
do xchg() instead, it becomes essential not to put bits which are not
implied in the direction flag.
This adds iommu_direction_to_tce_perm() (its counterpart is there already)
and uses it for powernv's pnv_tce_build().
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 1 +
arch/powerpc/kernel/iommu.c | 15 +++++++++++++++
arch/powerpc/platforms/powernv/pci.c | 7 +------
3 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index ed69b7d..2af2d70 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -203,6 +203,7 @@ extern int iommu_take_ownership(struct iommu_table *tbl);
extern void iommu_release_ownership(struct iommu_table *tbl);
extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
+extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir);
#endif /* __KERNEL__ */
#endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 1b4a178..029b1ea 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -871,6 +871,21 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
}
}
+unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir)
+{
+ switch (dir) {
+ case DMA_BIDIRECTIONAL:
+ return TCE_PCI_READ | TCE_PCI_WRITE;
+ case DMA_FROM_DEVICE:
+ return TCE_PCI_WRITE;
+ case DMA_TO_DEVICE:
+ return TCE_PCI_READ;
+ default:
+ return 0;
+ }
+}
+EXPORT_SYMBOL_GPL(iommu_direction_to_tce_perm);
+
#ifdef CONFIG_IOMMU_API
/*
* SPAPR TCE API
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 54323d6..609f5b1 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -593,15 +593,10 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
unsigned long uaddr, enum dma_data_direction direction,
struct dma_attrs *attrs, bool rm)
{
- u64 proto_tce;
+ u64 proto_tce = iommu_direction_to_tce_perm(direction);
__be64 *tcep, *tces;
u64 rpn;
- proto_tce = TCE_PCI_READ; // Read allowed
-
- if (direction != DMA_TO_DEVICE)
- proto_tce |= TCE_PCI_WRITE;
-
tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
rpn = __pa(uaddr) >> tbl->it_page_shift;
--
2.0.0
This adds a iommu_table_ops struct and puts pointer to it into
the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
callbacks from ppc_md to the new struct where they really belong to.
This adds the requirement for @it_ops to be initialized before calling
iommu_init_table() to make sure that we do not leave any IOMMU table
with iommu_table_ops uninitialized. This is not a parameter of
iommu_init_table() though as there will be cases when iommu_init_table()
will not be called on TCE tables, for example - VFIO.
This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_"
redundand prefixes.
This removes tce_xxx_rm handlers from ppc_md but does not add
them to iommu_table_ops as this will be done later if we decide to
support TCE hypercalls in real mode.
For pSeries, this always uses tce_buildmulti_pSeriesLP/
tce_buildmulti_pSeriesLP. This changes multi callback to fall back to
tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
present. The reason for this is we still have to support "multitce=off"
boot parameter in disable_multitce() and we do not want to walk through
all IOMMU tables in the system and replace "multi" callbacks with single
ones.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 17 +++++++++++
arch/powerpc/include/asm/machdep.h | 25 ----------------
arch/powerpc/kernel/iommu.c | 46 +++++++++++++++--------------
arch/powerpc/kernel/vio.c | 5 ++++
arch/powerpc/platforms/cell/iommu.c | 8 +++--
arch/powerpc/platforms/pasemi/iommu.c | 7 +++--
arch/powerpc/platforms/powernv/pci-ioda.c | 2 ++
arch/powerpc/platforms/powernv/pci-p5ioc2.c | 1 +
arch/powerpc/platforms/powernv/pci.c | 23 ++++-----------
arch/powerpc/platforms/powernv/pci.h | 1 +
arch/powerpc/platforms/pseries/iommu.c | 34 +++++++++++----------
arch/powerpc/sysdev/dart_iommu.c | 12 ++++----
12 files changed, 93 insertions(+), 88 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2af2d70..d909e2a 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -43,6 +43,22 @@
extern int iommu_is_off;
extern int iommu_force_on;
+struct iommu_table_ops {
+ int (*set)(struct iommu_table *tbl,
+ long index, long npages,
+ unsigned long uaddr,
+ enum dma_data_direction direction,
+ struct dma_attrs *attrs);
+ void (*clear)(struct iommu_table *tbl,
+ long index, long npages);
+ unsigned long (*get)(struct iommu_table *tbl, long index);
+ void (*flush)(struct iommu_table *tbl);
+};
+
+/* These are used by VIO */
+extern struct iommu_table_ops iommu_table_lpar_multi_ops;
+extern struct iommu_table_ops iommu_table_pseries_ops;
+
/*
* IOMAP_MAX_ORDER defines the largest contiguous block
* of dma space we can get. IOMAP_MAX_ORDER = 13
@@ -77,6 +93,7 @@ struct iommu_table {
#ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
#endif
+ struct iommu_table_ops *it_ops;
void (*set_bypass)(struct iommu_table *tbl, bool enable);
};
diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index c8175a3..2abe744 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -65,31 +65,6 @@ struct machdep_calls {
* destroyed as well */
void (*hpte_clear_all)(void);
- int (*tce_build)(struct iommu_table *tbl,
- long index,
- long npages,
- unsigned long uaddr,
- enum dma_data_direction direction,
- struct dma_attrs *attrs);
- void (*tce_free)(struct iommu_table *tbl,
- long index,
- long npages);
- unsigned long (*tce_get)(struct iommu_table *tbl,
- long index);
- void (*tce_flush)(struct iommu_table *tbl);
-
- /* _rm versions are for real mode use only */
- int (*tce_build_rm)(struct iommu_table *tbl,
- long index,
- long npages,
- unsigned long uaddr,
- enum dma_data_direction direction,
- struct dma_attrs *attrs);
- void (*tce_free_rm)(struct iommu_table *tbl,
- long index,
- long npages);
- void (*tce_flush_rm)(struct iommu_table *tbl);
-
void __iomem * (*ioremap)(phys_addr_t addr, unsigned long size,
unsigned long flags, void *caller);
void (*iounmap)(volatile void __iomem *token);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 029b1ea..eceb214 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -322,11 +322,11 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
ret = entry << tbl->it_page_shift; /* Set the return dma address */
/* Put the TCEs in the HW table */
- build_fail = ppc_md.tce_build(tbl, entry, npages,
+ build_fail = tbl->it_ops->set(tbl, entry, npages,
(unsigned long)page &
IOMMU_PAGE_MASK(tbl), direction, attrs);
- /* ppc_md.tce_build() only returns non-zero for transient errors.
+ /* tbl->it_ops->set() only returns non-zero for transient errors.
* Clean up the table bitmap in this case and return
* DMA_ERROR_CODE. For all other errors the functionality is
* not altered.
@@ -337,8 +337,8 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
}
/* Flush/invalidate TLB caches if necessary */
- if (ppc_md.tce_flush)
- ppc_md.tce_flush(tbl);
+ if (tbl->it_ops->flush)
+ tbl->it_ops->flush(tbl);
/* Make sure updates are seen by hardware */
mb();
@@ -408,7 +408,7 @@ static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
if (!iommu_free_check(tbl, dma_addr, npages))
return;
- ppc_md.tce_free(tbl, entry, npages);
+ tbl->it_ops->clear(tbl, entry, npages);
spin_lock_irqsave(&(pool->lock), flags);
bitmap_clear(tbl->it_map, free_entry, npages);
@@ -424,8 +424,8 @@ static void iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
* not do an mb() here on purpose, it is not needed on any of
* the current platforms.
*/
- if (ppc_md.tce_flush)
- ppc_md.tce_flush(tbl);
+ if (tbl->it_ops->flush)
+ tbl->it_ops->flush(tbl);
}
int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
@@ -495,7 +495,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
npages, entry, dma_addr);
/* Insert into HW table */
- build_fail = ppc_md.tce_build(tbl, entry, npages,
+ build_fail = tbl->it_ops->set(tbl, entry, npages,
vaddr & IOMMU_PAGE_MASK(tbl),
direction, attrs);
if(unlikely(build_fail))
@@ -534,8 +534,8 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
}
/* Flush/invalidate TLB caches if necessary */
- if (ppc_md.tce_flush)
- ppc_md.tce_flush(tbl);
+ if (tbl->it_ops->flush)
+ tbl->it_ops->flush(tbl);
DBG("mapped %d elements:\n", outcount);
@@ -600,8 +600,8 @@ void ppc_iommu_unmap_sg(struct iommu_table *tbl, struct scatterlist *sglist,
* do not do an mb() here, the affected platforms do not need it
* when freeing.
*/
- if (ppc_md.tce_flush)
- ppc_md.tce_flush(tbl);
+ if (tbl->it_ops->flush)
+ tbl->it_ops->flush(tbl);
}
static void iommu_table_clear(struct iommu_table *tbl)
@@ -613,17 +613,17 @@ static void iommu_table_clear(struct iommu_table *tbl)
*/
if (!is_kdump_kernel() || is_fadump_active()) {
/* Clear the table in case firmware left allocations in it */
- ppc_md.tce_free(tbl, tbl->it_offset, tbl->it_size);
+ tbl->it_ops->clear(tbl, tbl->it_offset, tbl->it_size);
return;
}
#ifdef CONFIG_CRASH_DUMP
- if (ppc_md.tce_get) {
+ if (tbl->it_ops->get) {
unsigned long index, tceval, tcecount = 0;
/* Reserve the existing mappings left by the first kernel. */
for (index = 0; index < tbl->it_size; index++) {
- tceval = ppc_md.tce_get(tbl, index + tbl->it_offset);
+ tceval = tbl->it_ops->get(tbl, index + tbl->it_offset);
/*
* Freed TCE entry contains 0x7fffffffffffffff on JS20
*/
@@ -657,6 +657,8 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
unsigned int i;
struct iommu_pool *p;
+ BUG_ON(!tbl->it_ops);
+
/* number of bytes needed for the bitmap */
sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
@@ -934,8 +936,8 @@ EXPORT_SYMBOL_GPL(iommu_tce_direction);
void iommu_flush_tce(struct iommu_table *tbl)
{
/* Flush/invalidate TLB caches if necessary */
- if (ppc_md.tce_flush)
- ppc_md.tce_flush(tbl);
+ if (tbl->it_ops->flush)
+ tbl->it_ops->flush(tbl);
/* Make sure updates are seen by hardware */
mb();
@@ -946,7 +948,7 @@ int iommu_tce_clear_param_check(struct iommu_table *tbl,
unsigned long ioba, unsigned long tce_value,
unsigned long npages)
{
- /* ppc_md.tce_free() does not support any value but 0 */
+ /* tbl->it_ops->clear() does not support any value but 0 */
if (tce_value)
return -EINVAL;
@@ -994,9 +996,9 @@ unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
spin_lock(&(pool->lock));
- oldtce = ppc_md.tce_get(tbl, entry);
+ oldtce = tbl->it_ops->get(tbl, entry);
if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
- ppc_md.tce_free(tbl, entry, 1);
+ tbl->it_ops->clear(tbl, entry, 1);
else
oldtce = 0;
@@ -1019,10 +1021,10 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
spin_lock(&(pool->lock));
- oldtce = ppc_md.tce_get(tbl, entry);
+ oldtce = tbl->it_ops->get(tbl, entry);
/* Add new entry if it is not busy */
if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
- ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
+ ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
spin_unlock(&(pool->lock));
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 5bfdab9..b41426c 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1196,6 +1196,11 @@ static struct iommu_table *vio_build_iommu_table(struct vio_dev *dev)
tbl->it_type = TCE_VB;
tbl->it_blocksize = 16;
+ if (firmware_has_feature(FW_FEATURE_LPAR))
+ tbl->it_ops = &iommu_table_lpar_multi_ops;
+ else
+ tbl->it_ops = &iommu_table_pseries_ops;
+
return iommu_init_table(tbl, -1);
}
diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
index c7c8720..72763a8 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -465,6 +465,11 @@ static inline u32 cell_iommu_get_ioid(struct device_node *np)
return *ioid;
}
+static struct iommu_table_ops cell_iommu_ops = {
+ .set = tce_build_cell,
+ .clear = tce_free_cell
+};
+
static struct iommu_window * __init
cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
unsigned long offset, unsigned long size,
@@ -491,6 +496,7 @@ cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
window->table.it_offset =
(offset >> window->table.it_page_shift) + pte_offset;
window->table.it_size = size >> window->table.it_page_shift;
+ window->table.it_ops = &cell_iommu_ops;
iommu_init_table(&window->table, iommu->nid);
@@ -1200,8 +1206,6 @@ static int __init cell_iommu_init(void)
/* Setup various ppc_md. callbacks */
ppc_md.pci_dma_dev_setup = cell_pci_dma_dev_setup;
ppc_md.dma_get_required_mask = cell_dma_get_required_mask;
- ppc_md.tce_build = tce_build_cell;
- ppc_md.tce_free = tce_free_cell;
if (!iommu_fixed_disabled && cell_iommu_fixed_mapping_init() == 0)
goto bail;
diff --git a/arch/powerpc/platforms/pasemi/iommu.c b/arch/powerpc/platforms/pasemi/iommu.c
index 2e576f2..b7245b2 100644
--- a/arch/powerpc/platforms/pasemi/iommu.c
+++ b/arch/powerpc/platforms/pasemi/iommu.c
@@ -132,6 +132,10 @@ static void iobmap_free(struct iommu_table *tbl, long index,
}
}
+static struct iommu_table_ops iommu_table_iobmap_ops = {
+ .set = iobmap_build,
+ .clear = iobmap_free
+};
static void iommu_table_iobmap_setup(void)
{
@@ -151,6 +155,7 @@ static void iommu_table_iobmap_setup(void)
* Should probably be 8 (64 bytes)
*/
iommu_table_iobmap.it_blocksize = 4;
+ iommu_table_iobmap.it_ops = &iommu_table_iobmap_ops;
iommu_init_table(&iommu_table_iobmap, 0);
pr_debug(" <- %s\n", __func__);
}
@@ -250,8 +255,6 @@ void __init iommu_init_early_pasemi(void)
ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pasemi;
ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pasemi;
- ppc_md.tce_build = iobmap_build;
- ppc_md.tce_free = iobmap_free;
set_pci_dma_ops(&dma_iommu_ops);
}
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 6c9ff2b..85e64a5 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1231,6 +1231,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
TCE_PCI_SWINV_FREE |
TCE_PCI_SWINV_PAIR);
}
+ tbl->it_ops = &pnv_iommu_ops;
iommu_init_table(tbl, phb->hose->node);
iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
@@ -1364,6 +1365,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
8);
tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
}
+ tbl->it_ops = &pnv_iommu_ops;
iommu_init_table(tbl, phb->hose->node);
iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index 6ef6d4d..0256fcc 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -87,6 +87,7 @@ static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
struct pci_dev *pdev)
{
if (phb->p5ioc2.iommu_table.it_map == NULL) {
+ phb->p5ioc2.iommu_table.it_ops = &pnv_iommu_ops;
iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
iommu_register_group(&phb->p5ioc2.iommu_table,
pci_domain_nr(phb->hose->bus), phb->opal_id);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 609f5b1..c619ec6 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -647,18 +647,11 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
return ((u64 *)tbl->it_base)[index - tbl->it_offset];
}
-static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
- unsigned long uaddr,
- enum dma_data_direction direction,
- struct dma_attrs *attrs)
-{
- return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
-}
-
-static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
-{
- pnv_tce_free(tbl, index, npages, true);
-}
+struct iommu_table_ops pnv_iommu_ops = {
+ .set = pnv_tce_build_vm,
+ .clear = pnv_tce_free_vm,
+ .get = pnv_tce_get,
+};
void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
void *tce_mem, u64 tce_size,
@@ -692,6 +685,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
return NULL;
pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
+ tbl->it_ops = &pnv_iommu_ops;
iommu_init_table(tbl, hose->node);
iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
@@ -817,11 +811,6 @@ void __init pnv_pci_init(void)
/* Configure IOMMU DMA hooks */
ppc_md.pci_dma_dev_setup = pnv_pci_dma_dev_setup;
- ppc_md.tce_build = pnv_tce_build_vm;
- ppc_md.tce_free = pnv_tce_free_vm;
- ppc_md.tce_build_rm = pnv_tce_build_rm;
- ppc_md.tce_free_rm = pnv_tce_free_rm;
- ppc_md.tce_get = pnv_tce_get;
set_pci_dma_ops(&dma_iommu_ops);
/* Configure MSIs */
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 6c02ff8..f726700 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -216,6 +216,7 @@ extern struct pci_ops pnv_pci_ops;
#ifdef CONFIG_EEH
extern struct pnv_eeh_ops ioda_eeh_ops;
#endif
+extern struct iommu_table_ops pnv_iommu_ops;
void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
unsigned char *log_buff);
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 7803a19..48d1fde 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -192,7 +192,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
int ret = 0;
unsigned long flags;
- if (npages == 1) {
+ if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
direction, attrs);
}
@@ -284,6 +284,9 @@ static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long n
{
u64 rc;
+ if (!firmware_has_feature(FW_FEATURE_MULTITCE))
+ return tce_free_pSeriesLP(tbl, tcenum, npages);
+
rc = plpar_tce_stuff((u64)tbl->it_index, (u64)tcenum << 12, 0, npages);
if (rc && printk_ratelimit()) {
@@ -459,7 +462,6 @@ static int tce_setrange_multi_pSeriesLP_walk(unsigned long start_pfn,
return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
}
-
#ifdef CONFIG_PCI
static void iommu_table_setparms(struct pci_controller *phb,
struct device_node *dn,
@@ -545,6 +547,12 @@ static void iommu_table_setparms_lpar(struct pci_controller *phb,
tbl->it_size = size >> tbl->it_page_shift;
}
+struct iommu_table_ops iommu_table_pseries_ops = {
+ .set = tce_build_pSeries,
+ .clear = tce_free_pSeries,
+ .get = tce_get_pseries
+};
+
static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
{
struct device_node *dn;
@@ -613,6 +621,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
pci->phb->node);
iommu_table_setparms(pci->phb, dn, tbl);
+ tbl->it_ops = &iommu_table_pseries_ops;
pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
iommu_register_group(tbl, pci_domain_nr(bus), 0);
@@ -624,6 +633,11 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
pr_debug("ISA/IDE, window size is 0x%llx\n", pci->phb->dma_window_size);
}
+struct iommu_table_ops iommu_table_lpar_multi_ops = {
+ .set = tce_buildmulti_pSeriesLP,
+ .clear = tce_freemulti_pSeriesLP,
+ .get = tce_get_pSeriesLP
+};
static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
{
@@ -658,6 +672,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
ppci->phb->node);
iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
+ tbl->it_ops = &iommu_table_lpar_multi_ops;
ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
iommu_register_group(tbl, pci_domain_nr(bus), 0);
pr_debug(" created table: %p\n", ppci->iommu_table);
@@ -685,6 +700,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
phb->node);
iommu_table_setparms(phb, dn, tbl);
+ tbl->it_ops = &iommu_table_pseries_ops;
PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
set_iommu_table_base_and_group(&dev->dev,
@@ -1107,6 +1123,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
pci->phb->node);
iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
+ tbl->it_ops = &iommu_table_lpar_multi_ops;
pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
pr_debug(" created table: %p\n", pci->iommu_table);
@@ -1299,22 +1316,11 @@ void iommu_init_early_pSeries(void)
return;
if (firmware_has_feature(FW_FEATURE_LPAR)) {
- if (firmware_has_feature(FW_FEATURE_MULTITCE)) {
- ppc_md.tce_build = tce_buildmulti_pSeriesLP;
- ppc_md.tce_free = tce_freemulti_pSeriesLP;
- } else {
- ppc_md.tce_build = tce_build_pSeriesLP;
- ppc_md.tce_free = tce_free_pSeriesLP;
- }
- ppc_md.tce_get = tce_get_pSeriesLP;
ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pSeriesLP;
ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pSeriesLP;
ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
} else {
- ppc_md.tce_build = tce_build_pSeries;
- ppc_md.tce_free = tce_free_pSeries;
- ppc_md.tce_get = tce_get_pseries;
ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pSeries;
ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pSeries;
}
@@ -1332,8 +1338,6 @@ static int __init disable_multitce(char *str)
firmware_has_feature(FW_FEATURE_LPAR) &&
firmware_has_feature(FW_FEATURE_MULTITCE)) {
printk(KERN_INFO "Disabling MULTITCE firmware feature\n");
- ppc_md.tce_build = tce_build_pSeriesLP;
- ppc_md.tce_free = tce_free_pSeriesLP;
powerpc_firmware_features &= ~FW_FEATURE_MULTITCE;
}
return 1;
diff --git a/arch/powerpc/sysdev/dart_iommu.c b/arch/powerpc/sysdev/dart_iommu.c
index 9e5353f..ab361a3 100644
--- a/arch/powerpc/sysdev/dart_iommu.c
+++ b/arch/powerpc/sysdev/dart_iommu.c
@@ -286,6 +286,12 @@ static int __init dart_init(struct device_node *dart_node)
return 0;
}
+static struct iommu_table_ops iommu_dart_ops = {
+ .set = dart_build,
+ .clear = dart_free,
+ .flush = dart_flush,
+};
+
static void iommu_table_dart_setup(void)
{
iommu_table_dart.it_busno = 0;
@@ -298,6 +304,7 @@ static void iommu_table_dart_setup(void)
iommu_table_dart.it_base = (unsigned long)dart_vbase;
iommu_table_dart.it_index = 0;
iommu_table_dart.it_blocksize = 1;
+ iommu_table_dart.it_ops = &iommu_dart_ops;
iommu_init_table(&iommu_table_dart, -1);
/* Reserve the last page of the DART to avoid possible prefetch
@@ -386,11 +393,6 @@ void __init iommu_init_early_dart(void)
if (dart_init(dn) != 0)
goto bail;
- /* Setup low level TCE operations for the core IOMMU code */
- ppc_md.tce_build = dart_build;
- ppc_md.tce_free = dart_free;
- ppc_md.tce_flush = dart_flush;
-
/* Setup bypass if supported */
if (dart_is_u4)
ppc_md.dma_set_mask = dart_dma_set_mask;
--
2.0.0
This replaces multiple calls of kzalloc_node() with a new
iommu_table_alloc() helper. Right now it calls kzalloc_node() but
later it will be modified to allocate a iommu_table_group struct with
a single iommu_table in it.
Later the helper will allocate a iommu_table_group struct which embeds
the iommu table(s).
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 1 +
arch/powerpc/kernel/iommu.c | 9 +++++++++
arch/powerpc/platforms/powernv/pci.c | 2 +-
arch/powerpc/platforms/pseries/iommu.c | 12 ++++--------
4 files changed, 15 insertions(+), 9 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index d909e2a..eb75726 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -117,6 +117,7 @@ static inline void *get_iommu_table_base(struct device *dev)
return dev->archdata.dma_data.iommu_table_base;
}
+extern struct iommu_table *iommu_table_alloc(int node);
/* Frees table for an individual device node */
extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index eceb214..b39d00a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -710,6 +710,15 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
return tbl;
}
+struct iommu_table *iommu_table_alloc(int node)
+{
+ struct iommu_table *tbl;
+
+ tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
+
+ return tbl;
+}
+
void iommu_free_table(struct iommu_table *tbl, const char *node_name)
{
unsigned long bitmap_sz;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index c619ec6..1c31ac8 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -680,7 +680,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
hose->dn->full_name);
return NULL;
}
- tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, hose->node);
+ tbl = iommu_table_alloc(hose->node);
if (WARN_ON(!tbl))
return NULL;
pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 48d1fde..41a8b14 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -617,8 +617,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
pci->phb->dma_window_size = 0x8000000ul;
pci->phb->dma_window_base_cur = 0x8000000ul;
- tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
- pci->phb->node);
+ tbl = iommu_table_alloc(pci->phb->node);
iommu_table_setparms(pci->phb, dn, tbl);
tbl->it_ops = &iommu_table_pseries_ops;
@@ -669,8 +668,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
pdn->full_name, ppci->iommu_table);
if (!ppci->iommu_table) {
- tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
- ppci->phb->node);
+ tbl = iommu_table_alloc(ppci->phb->node);
iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
tbl->it_ops = &iommu_table_lpar_multi_ops;
ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
@@ -697,8 +695,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
struct pci_controller *phb = PCI_DN(dn)->phb;
pr_debug(" --> first child, no bridge. Allocating iommu table.\n");
- tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
- phb->node);
+ tbl = iommu_table_alloc(phb->node);
iommu_table_setparms(phb, dn, tbl);
tbl->it_ops = &iommu_table_pseries_ops;
PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
@@ -1120,8 +1117,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
pci = PCI_DN(pdn);
if (!pci->iommu_table) {
- tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
- pci->phb->node);
+ tbl = iommu_table_alloc(pci->phb->node);
iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
tbl->it_ops = &iommu_table_lpar_multi_ops;
pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
--
2.0.0
Modern IBM POWERPC systems support multiple (currently two) TCE tables
per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
for TCE tables. Right now just one table is supported.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 18 +++--
arch/powerpc/kernel/iommu.c | 34 ++++----
arch/powerpc/platforms/powernv/pci-ioda.c | 38 +++++----
arch/powerpc/platforms/powernv/pci-p5ioc2.c | 17 ++--
arch/powerpc/platforms/powernv/pci.c | 2 +-
arch/powerpc/platforms/powernv/pci.h | 4 +-
arch/powerpc/platforms/pseries/iommu.c | 9 ++-
drivers/vfio/vfio_iommu_spapr_tce.c | 120 ++++++++++++++++++++--------
8 files changed, 160 insertions(+), 82 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index eb75726..667aa1a 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -90,9 +90,7 @@ struct iommu_table {
struct iommu_pool pools[IOMMU_NR_POOLS];
unsigned long *it_map; /* A simple allocation bitmap for now */
unsigned long it_page_shift;/* table iommu page size */
-#ifdef CONFIG_IOMMU_API
- struct iommu_group *it_group;
-#endif
+ struct iommu_table_group *it_group;
struct iommu_table_ops *it_ops;
void (*set_bypass)(struct iommu_table *tbl, bool enable);
};
@@ -126,14 +124,24 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
*/
extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
int nid);
+
+#define IOMMU_TABLE_GROUP_MAX_TABLES 1
+
+struct iommu_table_group {
#ifdef CONFIG_IOMMU_API
-extern void iommu_register_group(struct iommu_table *tbl,
+ struct iommu_group *group;
+#endif
+ struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
+};
+
+#ifdef CONFIG_IOMMU_API
+extern void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number, unsigned long pe_num);
extern int iommu_add_device(struct device *dev);
extern void iommu_del_device(struct device *dev);
extern int __init tce_iommu_bus_notifier_init(void);
#else
-static inline void iommu_register_group(struct iommu_table *tbl,
+static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
unsigned long pe_num)
{
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b39d00a..fd49c8e 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -712,17 +712,20 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
struct iommu_table *iommu_table_alloc(int node)
{
- struct iommu_table *tbl;
+ struct iommu_table_group *table_group;
- tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
+ table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
+ node);
+ table_group->tables[0].it_group = table_group;
- return tbl;
+ return &table_group->tables[0];
}
void iommu_free_table(struct iommu_table *tbl, const char *node_name)
{
unsigned long bitmap_sz;
unsigned int order;
+ struct iommu_table_group *table_group = tbl->it_group;
if (!tbl || !tbl->it_map) {
printk(KERN_ERR "%s: expected TCE map for %s\n", __func__,
@@ -738,9 +741,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
clear_bit(0, tbl->it_map);
#ifdef CONFIG_IOMMU_API
- if (tbl->it_group) {
- iommu_group_put(tbl->it_group);
- BUG_ON(tbl->it_group);
+ if (table_group->group) {
+ iommu_group_put(table_group->group);
+ BUG_ON(table_group->group);
}
#endif
@@ -756,7 +759,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
free_pages((unsigned long) tbl->it_map, order);
/* free table */
- kfree(tbl);
+ kfree(table_group);
}
/* Creates TCEs for a user provided buffer. The user buffer must be
@@ -903,11 +906,12 @@ EXPORT_SYMBOL_GPL(iommu_direction_to_tce_perm);
*/
static void group_release(void *iommu_data)
{
- struct iommu_table *tbl = iommu_data;
- tbl->it_group = NULL;
+ struct iommu_table_group *table_group = iommu_data;
+
+ table_group->group = NULL;
}
-void iommu_register_group(struct iommu_table *tbl,
+void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number, unsigned long pe_num)
{
struct iommu_group *grp;
@@ -919,8 +923,8 @@ void iommu_register_group(struct iommu_table *tbl,
PTR_ERR(grp));
return;
}
- tbl->it_group = grp;
- iommu_group_set_iommudata(grp, tbl, group_release);
+ table_group->group = grp;
+ iommu_group_set_iommudata(grp, table_group, group_release);
name = kasprintf(GFP_KERNEL, "domain%d-pe%lx",
pci_domain_number, pe_num);
if (!name)
@@ -1108,7 +1112,7 @@ int iommu_add_device(struct device *dev)
}
tbl = get_iommu_table_base(dev);
- if (!tbl || !tbl->it_group) {
+ if (!tbl || !tbl->it_group || !tbl->it_group->group) {
pr_debug("%s: Skipping device %s with no tbl\n",
__func__, dev_name(dev));
return 0;
@@ -1116,7 +1120,7 @@ int iommu_add_device(struct device *dev)
pr_debug("%s: Adding %s to iommu group %d\n",
__func__, dev_name(dev),
- iommu_group_id(tbl->it_group));
+ iommu_group_id(tbl->it_group->group));
if (PAGE_SIZE < IOMMU_PAGE_SIZE(tbl)) {
pr_err("%s: Invalid IOMMU page size %lx (%lx) on %s\n",
@@ -1125,7 +1129,7 @@ int iommu_add_device(struct device *dev)
return -EINVAL;
}
- return iommu_group_add_device(tbl->it_group, dev);
+ return iommu_group_add_device(tbl->it_group->group, dev);
}
EXPORT_SYMBOL_GPL(iommu_add_device);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 85e64a5..a964c50 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -23,6 +23,7 @@
#include <linux/io.h>
#include <linux/msi.h>
#include <linux/memblock.h>
+#include <linux/iommu.h>
#include <asm/sections.h>
#include <asm/io.h>
@@ -989,7 +990,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
pe = &phb->ioda.pe_array[pdn->pe_number];
WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
- set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
+ set_iommu_table_base_and_group(&pdev->dev, &pe->table_group.tables[0]);
}
static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -1016,7 +1017,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
} else {
dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
set_dma_ops(&pdev->dev, &dma_iommu_ops);
- set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+ set_iommu_table_base(&pdev->dev, &pe->table_group.tables[0]);
}
*pdev->dev.dma_mask = dma_mask;
return 0;
@@ -1053,9 +1054,10 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
list_for_each_entry(dev, &bus->devices, bus_list) {
if (add_to_iommu_group)
set_iommu_table_base_and_group(&dev->dev,
- &pe->tce32_table);
+ &pe->table_group.tables[0]);
else
- set_iommu_table_base(&dev->dev, &pe->tce32_table);
+ set_iommu_table_base(&dev->dev,
+ &pe->table_group.tables[0]);
if (dev->subordinate)
pnv_ioda_setup_bus_dma(pe, dev->subordinate,
@@ -1145,8 +1147,8 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
__be64 *startp, __be64 *endp, bool rm)
{
- struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
- tce32_table);
+ struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
+ table_group);
struct pnv_phb *phb = pe->phb;
if (phb->type == PNV_PHB_IODA1)
@@ -1211,8 +1213,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
}
}
+ /* Setup iommu */
+ pe->table_group.tables[0].it_group = &pe->table_group;
+
/* Setup linux iommu table */
- tbl = &pe->tce32_table;
+ tbl = &pe->table_group.tables[0];
pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
base << 28, IOMMU_PAGE_SHIFT_4K);
@@ -1233,7 +1238,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
}
tbl->it_ops = &pnv_iommu_ops;
iommu_init_table(tbl, phb->hose->node);
- iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
+ iommu_register_group(&pe->table_group, phb->hose->global_number,
+ pe->pe_number);
if (pe->pdev)
set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
@@ -1251,8 +1257,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
{
- struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
- tce32_table);
+ struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
+ table_group);
uint16_t window_id = (pe->pe_number << 1 ) + 1;
int64_t rc;
@@ -1297,10 +1303,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
pe->tce_bypass_base = 1ull << 59;
/* Install set_bypass callback for VFIO */
- pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+ pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
/* Enable bypass by default */
- pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+ pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
}
static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1347,8 +1353,11 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
goto fail;
}
+ /* Setup iommu */
+ pe->table_group.tables[0].it_group = &pe->table_group;
+
/* Setup linux iommu table */
- tbl = &pe->tce32_table;
+ tbl = &pe->table_group.tables[0];
pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
IOMMU_PAGE_SHIFT_4K);
@@ -1367,7 +1376,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
}
tbl->it_ops = &pnv_iommu_ops;
iommu_init_table(tbl, phb->hose->node);
- iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
+ iommu_register_group(&pe->table_group, phb->hose->global_number,
+ pe->pe_number);
if (pe->pdev)
set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index 0256fcc..ff68cac 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -86,14 +86,16 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
struct pci_dev *pdev)
{
- if (phb->p5ioc2.iommu_table.it_map == NULL) {
- phb->p5ioc2.iommu_table.it_ops = &pnv_iommu_ops;
- iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
- iommu_register_group(&phb->p5ioc2.iommu_table,
+ if (phb->p5ioc2.table_group.tables[0].it_map == NULL) {
+ phb->p5ioc2.table_group.tables[0].it_ops = &pnv_iommu_ops;
+ iommu_init_table(&phb->p5ioc2.table_group.tables[0],
+ phb->hose->node);
+ iommu_register_group(&phb->p5ioc2.table_group,
pci_domain_nr(phb->hose->bus), phb->opal_id);
}
- set_iommu_table_base_and_group(&pdev->dev, &phb->p5ioc2.iommu_table);
+ set_iommu_table_base_and_group(&pdev->dev,
+ &phb->p5ioc2.table_group.tables[0]);
}
static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
@@ -167,9 +169,12 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
/* Setup MSI support */
pnv_pci_init_p5ioc2_msis(phb);
+ /* Setup iommu */
+ phb->p5ioc2.table_group.tables[0].it_group = &phb->p5ioc2.table_group;
+
/* Setup TCEs */
phb->dma_dev_setup = pnv_pci_p5ioc2_dma_dev_setup;
- pnv_pci_setup_iommu_table(&phb->p5ioc2.iommu_table,
+ pnv_pci_setup_iommu_table(&phb->p5ioc2.table_group.tables[0],
tce_mem, tce_size, 0,
IOMMU_PAGE_SHIFT_4K);
}
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 1c31ac8..3050cc8 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -687,7 +687,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
tbl->it_ops = &pnv_iommu_ops;
iommu_init_table(tbl, hose->node);
- iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
+ iommu_register_group(tbl->it_group, pci_domain_nr(hose->bus), 0);
/* Deal with SW invalidated TCEs when needed (BML way) */
swinvp = of_get_property(hose->dn, "linux,tce-sw-invalidate-info",
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index f726700..762d906 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -53,7 +53,7 @@ struct pnv_ioda_pe {
/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
int tce32_seg;
int tce32_segcount;
- struct iommu_table tce32_table;
+ struct iommu_table_group table_group;
phys_addr_t tce_inval_reg_phys;
/* 64-bit TCE bypass region */
@@ -138,7 +138,7 @@ struct pnv_phb {
union {
struct {
- struct iommu_table iommu_table;
+ struct iommu_table_group table_group;
} p5ioc2;
struct {
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 41a8b14..75ea581 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -622,7 +622,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
iommu_table_setparms(pci->phb, dn, tbl);
tbl->it_ops = &iommu_table_pseries_ops;
pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
- iommu_register_group(tbl, pci_domain_nr(bus), 0);
+ iommu_register_group(tbl->it_group, pci_domain_nr(bus), 0);
/* Divide the rest (1.75GB) among the children */
pci->phb->dma_window_size = 0x80000000ul;
@@ -672,7 +672,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
tbl->it_ops = &iommu_table_lpar_multi_ops;
ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
- iommu_register_group(tbl, pci_domain_nr(bus), 0);
+ iommu_register_group(tbl->it_group, pci_domain_nr(bus), 0);
pr_debug(" created table: %p\n", ppci->iommu_table);
}
}
@@ -699,7 +699,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
iommu_table_setparms(phb, dn, tbl);
tbl->it_ops = &iommu_table_pseries_ops;
PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
- iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
+ iommu_register_group(tbl->it_group, pci_domain_nr(phb->bus), 0);
set_iommu_table_base_and_group(&dev->dev,
PCI_DN(dn)->iommu_table);
return;
@@ -1121,7 +1121,8 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
tbl->it_ops = &iommu_table_lpar_multi_ops;
pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
- iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
+ iommu_register_group(tbl->it_group,
+ pci_domain_nr(pci->phb->bus), 0);
pr_debug(" created table: %p\n", pci->iommu_table);
} else {
pr_debug(" found DMA window, table: %p\n", pci->iommu_table);
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 244c958..d61aad2 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -88,7 +88,7 @@ static void decrement_locked_vm(long npages)
*/
struct tce_container {
struct mutex lock;
- struct iommu_table *tbl;
+ struct iommu_group *grp;
bool enabled;
unsigned long locked_pages;
};
@@ -103,13 +103,41 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift)
return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift;
}
+static struct iommu_table *spapr_tce_find_table(
+ struct tce_container *container,
+ phys_addr_t ioba)
+{
+ long i;
+ struct iommu_table *ret = NULL;
+ struct iommu_table_group *table_group;
+
+ table_group = iommu_group_get_iommudata(container->grp);
+ if (!table_group)
+ return NULL;
+
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ struct iommu_table *tbl = &table_group->tables[i];
+ unsigned long entry = ioba >> tbl->it_page_shift;
+ unsigned long start = tbl->it_offset;
+ unsigned long end = start + tbl->it_size;
+
+ if ((start <= entry) && (entry < end)) {
+ ret = tbl;
+ break;
+ }
+ }
+
+ return ret;
+}
+
static int tce_iommu_enable(struct tce_container *container)
{
int ret = 0;
unsigned long locked;
- struct iommu_table *tbl = container->tbl;
+ struct iommu_table *tbl;
+ struct iommu_table_group *table_group;
- if (!container->tbl)
+ if (!container->grp)
return -ENXIO;
if (!current->mm)
@@ -143,6 +171,11 @@ static int tce_iommu_enable(struct tce_container *container)
* as this information is only available from KVM and VFIO is
* KVM agnostic.
*/
+ table_group = iommu_group_get_iommudata(container->grp);
+ if (!table_group)
+ return -ENODEV;
+
+ tbl = &table_group->tables[0];
locked = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
ret = try_increment_locked_vm(locked);
if (ret)
@@ -193,15 +226,17 @@ static int tce_iommu_clear(struct tce_container *container,
static void tce_iommu_release(void *iommu_data)
{
struct tce_container *container = iommu_data;
- struct iommu_table *tbl = container->tbl;
+ struct iommu_table *tbl;
+ struct iommu_table_group *table_group;
- WARN_ON(tbl && !tbl->it_group);
+ WARN_ON(container->grp);
- if (tbl) {
+ if (container->grp) {
+ table_group = iommu_group_get_iommudata(container->grp);
+ tbl = &table_group->tables[0];
tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
- if (tbl->it_group)
- tce_iommu_detach_group(iommu_data, tbl->it_group);
+ tce_iommu_detach_group(iommu_data, container->grp);
}
tce_iommu_disable(container);
@@ -329,9 +364,16 @@ static long tce_iommu_ioctl(void *iommu_data,
case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
struct vfio_iommu_spapr_tce_info info;
- struct iommu_table *tbl = container->tbl;
+ struct iommu_table *tbl;
+ struct iommu_table_group *table_group;
- if (WARN_ON(!tbl))
+ if (WARN_ON(!container->grp))
+ return -ENXIO;
+
+ table_group = iommu_group_get_iommudata(container->grp);
+
+ tbl = &table_group->tables[0];
+ if (WARN_ON_ONCE(!tbl))
return -ENXIO;
minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
@@ -354,17 +396,12 @@ static long tce_iommu_ioctl(void *iommu_data,
}
case VFIO_IOMMU_MAP_DMA: {
struct vfio_iommu_type1_dma_map param;
- struct iommu_table *tbl = container->tbl;
+ struct iommu_table *tbl;
unsigned long tce;
if (!container->enabled)
return -EPERM;
- if (!tbl)
- return -ENXIO;
-
- BUG_ON(!tbl->it_group);
-
minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
if (copy_from_user(¶m, (void __user *)arg, minsz))
@@ -377,6 +414,10 @@ static long tce_iommu_ioctl(void *iommu_data,
VFIO_DMA_MAP_FLAG_WRITE))
return -EINVAL;
+ tbl = spapr_tce_find_table(container, param.iova);
+ if (!tbl)
+ return -ENXIO;
+
if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
(param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
return -EINVAL;
@@ -402,14 +443,11 @@ static long tce_iommu_ioctl(void *iommu_data,
}
case VFIO_IOMMU_UNMAP_DMA: {
struct vfio_iommu_type1_dma_unmap param;
- struct iommu_table *tbl = container->tbl;
+ struct iommu_table *tbl;
if (!container->enabled)
return -EPERM;
- if (WARN_ON(!tbl))
- return -ENXIO;
-
minsz = offsetofend(struct vfio_iommu_type1_dma_unmap,
size);
@@ -423,6 +461,10 @@ static long tce_iommu_ioctl(void *iommu_data,
if (param.flags)
return -EINVAL;
+ tbl = spapr_tce_find_table(container, param.iova);
+ if (!tbl)
+ return -ENXIO;
+
if (param.size & ~IOMMU_PAGE_MASK(tbl))
return -EINVAL;
@@ -451,10 +493,10 @@ static long tce_iommu_ioctl(void *iommu_data,
mutex_unlock(&container->lock);
return 0;
case VFIO_EEH_PE_OP:
- if (!container->tbl || !container->tbl->it_group)
+ if (!container->grp)
return -ENODEV;
- return vfio_spapr_iommu_eeh_ioctl(container->tbl->it_group,
+ return vfio_spapr_iommu_eeh_ioctl(container->grp,
cmd, arg);
}
@@ -466,16 +508,15 @@ static int tce_iommu_attach_group(void *iommu_data,
{
int ret;
struct tce_container *container = iommu_data;
- struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+ struct iommu_table_group *table_group;
- BUG_ON(!tbl);
mutex_lock(&container->lock);
/* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
iommu_group_id(iommu_group), iommu_group); */
- if (container->tbl) {
+ if (container->grp) {
pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
- iommu_group_id(container->tbl->it_group),
+ iommu_group_id(container->grp),
iommu_group_id(iommu_group));
ret = -EBUSY;
goto unlock_exit;
@@ -488,9 +529,15 @@ static int tce_iommu_attach_group(void *iommu_data,
goto unlock_exit;
}
- ret = iommu_take_ownership(tbl);
+ table_group = iommu_group_get_iommudata(iommu_group);
+ if (!table_group) {
+ ret = -ENXIO;
+ goto unlock_exit;
+ }
+
+ ret = iommu_take_ownership(&table_group->tables[0]);
if (!ret)
- container->tbl = tbl;
+ container->grp = iommu_group;
unlock_exit:
mutex_unlock(&container->lock);
@@ -502,27 +549,30 @@ static void tce_iommu_detach_group(void *iommu_data,
struct iommu_group *iommu_group)
{
struct tce_container *container = iommu_data;
- struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+ struct iommu_table_group *table_group;
- BUG_ON(!tbl);
mutex_lock(&container->lock);
- if (tbl != container->tbl) {
+ if (iommu_group != container->grp) {
pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
iommu_group_id(iommu_group),
- iommu_group_id(tbl->it_group));
+ iommu_group_id(container->grp));
goto unlock_exit;
}
if (container->enabled) {
pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
- iommu_group_id(tbl->it_group));
+ iommu_group_id(container->grp));
tce_iommu_disable(container);
}
/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
iommu_group_id(iommu_group), iommu_group); */
- container->tbl = NULL;
- iommu_release_ownership(tbl);
+ container->grp = NULL;
+
+ table_group = iommu_group_get_iommudata(iommu_group);
+ BUG_ON(!table_group);
+
+ iommu_release_ownership(&table_group->tables[0]);
unlock_exit:
mutex_unlock(&container->lock);
--
2.0.0
This replaces iommu_take_ownership()/iommu_release_ownership() calls
with the callback calls and it is up to the platform code to call
iommu_take_ownership()/iommu_release_ownership() if needed.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 4 +--
arch/powerpc/kernel/iommu.c | 50 ++++++++++++++++++++++++++++---------
drivers/vfio/vfio_iommu_spapr_tce.c | 4 +--
3 files changed, 42 insertions(+), 16 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 667aa1a..b9e50d3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -225,8 +225,8 @@ extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
unsigned long entry);
extern void iommu_flush_tce(struct iommu_table *tbl);
-extern int iommu_take_ownership(struct iommu_table *tbl);
-extern void iommu_release_ownership(struct iommu_table *tbl);
+extern int iommu_take_ownership(struct iommu_table_group *table_group);
+extern void iommu_release_ownership(struct iommu_table_group *table_group);
extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index fd49c8e..7d6089b 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1050,7 +1050,7 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
}
EXPORT_SYMBOL_GPL(iommu_tce_build);
-int iommu_take_ownership(struct iommu_table *tbl)
+static int iommu_table_take_ownership(struct iommu_table *tbl)
{
unsigned long sz = (tbl->it_size + 7) >> 3;
@@ -1064,19 +1064,36 @@ int iommu_take_ownership(struct iommu_table *tbl)
memset(tbl->it_map, 0xff, sz);
- /*
- * Disable iommu bypass, otherwise the user can DMA to all of
- * our physical memory via the bypass window instead of just
- * the pages that has been explicitly mapped into the iommu
- */
- if (tbl->set_bypass)
- tbl->set_bypass(tbl, false);
+ return 0;
+}
+
+static void iommu_table_release_ownership(struct iommu_table *tbl);
+
+int iommu_take_ownership(struct iommu_table_group *table_group)
+{
+ int i, j, rc = 0;
+
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ struct iommu_table *tbl = &table_group->tables[i];
+
+ if (!tbl->it_map)
+ continue;
+
+ rc = iommu_table_take_ownership(tbl);
+ if (rc) {
+ for (j = 0; j < i; ++j)
+ iommu_table_release_ownership(
+ &table_group->tables[j]);
+
+ return rc;
+ }
+ }
return 0;
}
EXPORT_SYMBOL_GPL(iommu_take_ownership);
-void iommu_release_ownership(struct iommu_table *tbl)
+static void iommu_table_release_ownership(struct iommu_table *tbl)
{
unsigned long sz = (tbl->it_size + 7) >> 3;
@@ -1086,9 +1103,18 @@ void iommu_release_ownership(struct iommu_table *tbl)
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
- /* The kernel owns the device now, we can restore the iommu bypass */
- if (tbl->set_bypass)
- tbl->set_bypass(tbl, true);
+}
+
+extern void iommu_release_ownership(struct iommu_table_group *table_group)
+{
+ int i;
+
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ struct iommu_table *tbl = &table_group->tables[i];
+
+ if (tbl->it_map)
+ iommu_table_release_ownership(tbl);
+ }
}
EXPORT_SYMBOL_GPL(iommu_release_ownership);
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index d61aad2..9f38351 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -535,7 +535,7 @@ static int tce_iommu_attach_group(void *iommu_data,
goto unlock_exit;
}
- ret = iommu_take_ownership(&table_group->tables[0]);
+ ret = iommu_take_ownership(table_group);
if (!ret)
container->grp = iommu_group;
@@ -572,7 +572,7 @@ static void tce_iommu_detach_group(void *iommu_data,
table_group = iommu_group_get_iommudata(iommu_group);
BUG_ON(!table_group);
- iommu_release_ownership(&table_group->tables[0]);
+ iommu_release_ownership(table_group);
unlock_exit:
mutex_unlock(&container->lock);
--
2.0.0
At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.
The set_bypass() callback is not really an iommu_table function but
IOMMU/PE function. This introduces a iommu_table_group_ops struct and
adds a set_ownership() callback to it which is called when an external
user takes control over the IOMMU.
This renames set_bypass() to set_ownership() as it is not necessarily
just enabling bypassing, it can be something else/more so let's give it
more generic name. The bool parameter is inverted.
The callback is implemented for IODA2 only. Other platforms (P5IOC2,
IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 14 +++++++++++++-
arch/powerpc/platforms/powernv/pci-ioda.c | 30 ++++++++++++++++++++++--------
drivers/vfio/vfio_iommu_spapr_tce.c | 25 +++++++++++++++++++++----
3 files changed, 56 insertions(+), 13 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index b9e50d3..d1f8c6c 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -92,7 +92,6 @@ struct iommu_table {
unsigned long it_page_shift;/* table iommu page size */
struct iommu_table_group *it_group;
struct iommu_table_ops *it_ops;
- void (*set_bypass)(struct iommu_table *tbl, bool enable);
};
/* Pure 2^n version of get_order */
@@ -127,11 +126,24 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
#define IOMMU_TABLE_GROUP_MAX_TABLES 1
+struct iommu_table_group;
+
+struct iommu_table_group_ops {
+ /*
+ * Switches ownership from the kernel itself to an external
+ * user. While onwership is enabled, the kernel cannot use IOMMU
+ * for itself.
+ */
+ void (*set_ownership)(struct iommu_table_group *table_group,
+ bool enable);
+};
+
struct iommu_table_group {
#ifdef CONFIG_IOMMU_API
struct iommu_group *group;
#endif
struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
+ struct iommu_table_group_ops *ops;
};
#ifdef CONFIG_IOMMU_API
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index a964c50..9687731 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1255,10 +1255,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
}
-static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
{
- struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
- table_group);
uint16_t window_id = (pe->pe_number << 1 ) + 1;
int64_t rc;
@@ -1286,7 +1284,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
* host side.
*/
if (pe->pdev)
- set_iommu_table_base(&pe->pdev->dev, tbl);
+ set_iommu_table_base(&pe->pdev->dev,
+ &pe->table_group.tables[0]);
else
pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
}
@@ -1302,13 +1301,27 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
/* TVE #1 is selected by PCI address bit 59 */
pe->tce_bypass_base = 1ull << 59;
- /* Install set_bypass callback for VFIO */
- pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
-
/* Enable bypass by default */
- pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
+ pnv_pci_ioda2_set_bypass(pe, true);
}
+static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
+ bool enable)
+{
+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
+ table_group);
+ if (enable)
+ iommu_take_ownership(table_group);
+ else
+ iommu_release_ownership(table_group);
+
+ pnv_pci_ioda2_set_bypass(pe, !enable);
+}
+
+static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
+ .set_ownership = pnv_ioda2_set_ownership,
+};
+
static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
struct pnv_ioda_pe *pe)
{
@@ -1376,6 +1389,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
}
tbl->it_ops = &pnv_iommu_ops;
iommu_init_table(tbl, phb->hose->node);
+ pe->table_group.ops = &pnv_pci_ioda2_ops;
iommu_register_group(&pe->table_group, phb->hose->global_number,
pe->pe_number);
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 9f38351..d5d8c50 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -535,9 +535,22 @@ static int tce_iommu_attach_group(void *iommu_data,
goto unlock_exit;
}
- ret = iommu_take_ownership(table_group);
- if (!ret)
- container->grp = iommu_group;
+ if (!table_group->ops || !table_group->ops->set_ownership) {
+ ret = iommu_take_ownership(table_group);
+ } else {
+ /*
+ * Disable iommu bypass, otherwise the user can DMA to all of
+ * our physical memory via the bypass window instead of just
+ * the pages that has been explicitly mapped into the iommu
+ */
+ table_group->ops->set_ownership(table_group, true);
+ ret = 0;
+ }
+
+ if (ret)
+ goto unlock_exit;
+
+ container->grp = iommu_group;
unlock_exit:
mutex_unlock(&container->lock);
@@ -572,7 +585,11 @@ static void tce_iommu_detach_group(void *iommu_data,
table_group = iommu_group_get_iommudata(iommu_group);
BUG_ON(!table_group);
- iommu_release_ownership(table_group);
+ /* Kernel owns the device now, we can restore bypass */
+ if (!table_group->ops || !table_group->ops->set_ownership)
+ iommu_release_ownership(table_group);
+ else
+ table_group->ops->set_ownership(table_group, false);
unlock_exit:
mutex_unlock(&container->lock);
--
2.0.0
This adds missing locks in iommu_take_ownership()/
iommu_release_ownership().
This marks all pages busy in iommu_table::it_map in order to catch
errors if there is an attempt to use this table while ownership over it
is taken.
This only clears TCE content if there is no page marked busy in it_map.
Clearing must be done outside of the table locks as iommu_clear_tce()
called from iommu_clear_tces_and_put_pages() does this.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v5:
* do not store bit#0 value, it has to be set for zero-based table
anyway
* removed test_and_clear_bit
---
arch/powerpc/kernel/iommu.c | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 7d6089b..068fe4ff 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1052,17 +1052,28 @@ EXPORT_SYMBOL_GPL(iommu_tce_build);
static int iommu_table_take_ownership(struct iommu_table *tbl)
{
- unsigned long sz = (tbl->it_size + 7) >> 3;
+ unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+ int ret = 0;
+
+ spin_lock_irqsave(&tbl->large_pool.lock, flags);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_lock(&tbl->pools[i].lock);
if (tbl->it_offset == 0)
clear_bit(0, tbl->it_map);
if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
pr_err("iommu_tce: it_map is not empty");
- return -EBUSY;
+ ret = -EBUSY;
+ if (tbl->it_offset == 0)
+ set_bit(0, tbl->it_map);
+ } else {
+ memset(tbl->it_map, 0xff, sz);
}
- memset(tbl->it_map, 0xff, sz);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_unlock(&tbl->pools[i].lock);
+ spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
return 0;
}
@@ -1095,7 +1106,11 @@ EXPORT_SYMBOL_GPL(iommu_take_ownership);
static void iommu_table_release_ownership(struct iommu_table *tbl)
{
- unsigned long sz = (tbl->it_size + 7) >> 3;
+ unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+
+ spin_lock_irqsave(&tbl->large_pool.lock, flags);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_lock(&tbl->pools[i].lock);
memset(tbl->it_map, 0, sz);
@@ -1103,6 +1118,9 @@ static void iommu_table_release_ownership(struct iommu_table *tbl)
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_unlock(&tbl->pools[i].lock);
+ spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
}
extern void iommu_release_ownership(struct iommu_table_group *table_group)
--
2.0.0
The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is
supposed to be called on IODA1/2 and not called on p5ioc2. It receives
start and end host addresses of TCE table. This approach makes it possible
to get pnv_pci_ioda_tce_invalidate() unintentionally called on p5ioc2.
Another issue is that IODA2 needs PCI addresses to invalidate the cache
and those can be calculated from host addresses but since we are going
to implement multi-level TCE tables, calculating PCI address from
a host address might get either tricky or ugly as TCE table remains flat
on PCI bus but not in RAM.
This defines separate iommu_table_ops callbacks for p5ioc2 and IODA1/2
PHBs. They all call common pnv_tce_build/pnv_tce_free/pnv_tce_get helpers
but call PHB specific TCE invalidation helper (when needed).
This changes pnv_pci_ioda2_tce_invalidate() to receives TCE index and
number of pages which are PCI addresses shifted by IOMMU page shift.
The patch is pretty mechanical and behaviour is not expected to change.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/platforms/powernv/pci-ioda.c | 92 ++++++++++++++++++++++-------
arch/powerpc/platforms/powernv/pci-p5ioc2.c | 9 ++-
arch/powerpc/platforms/powernv/pci.c | 76 +++++++++---------------
arch/powerpc/platforms/powernv/pci.h | 7 ++-
4 files changed, 111 insertions(+), 73 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9687731..fd993bc 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1065,18 +1065,20 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
}
}
-static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
- struct iommu_table *tbl,
- __be64 *startp, __be64 *endp, bool rm)
+static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
+ unsigned long index, unsigned long npages, bool rm)
{
+ struct pnv_ioda_pe *pe = container_of(tbl->it_group,
+ struct pnv_ioda_pe, table_group);
__be64 __iomem *invalidate = rm ?
(__be64 __iomem *)pe->tce_inval_reg_phys :
(__be64 __iomem *)tbl->it_index;
unsigned long start, end, inc;
const unsigned shift = tbl->it_page_shift;
- start = __pa(startp);
- end = __pa(endp);
+ start = __pa((__be64 *)tbl->it_base + index - tbl->it_offset);
+ end = __pa((__be64 *)tbl->it_base + index - tbl->it_offset +
+ npages - 1);
/* BML uses this case for p6/p7/galaxy2: Shift addr and put in node */
if (tbl->it_busno) {
@@ -1112,10 +1114,40 @@ static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
*/
}
-static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
- struct iommu_table *tbl,
- __be64 *startp, __be64 *endp, bool rm)
+static int pnv_ioda1_tce_build_vm(struct iommu_table *tbl, long index,
+ long npages, unsigned long uaddr,
+ enum dma_data_direction direction,
+ struct dma_attrs *attrs)
{
+ long ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
+ attrs);
+
+ if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE))
+ pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
+
+ return ret;
+}
+
+static void pnv_ioda1_tce_free_vm(struct iommu_table *tbl, long index,
+ long npages)
+{
+ pnv_tce_free(tbl, index, npages);
+
+ if (tbl->it_type & TCE_PCI_SWINV_FREE)
+ pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
+}
+
+struct iommu_table_ops pnv_ioda1_iommu_ops = {
+ .set = pnv_ioda1_tce_build_vm,
+ .clear = pnv_ioda1_tce_free_vm,
+ .get = pnv_tce_get,
+};
+
+static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
+ unsigned long index, unsigned long npages, bool rm)
+{
+ struct pnv_ioda_pe *pe = container_of(tbl->it_group,
+ struct pnv_ioda_pe, table_group);
unsigned long start, end, inc;
__be64 __iomem *invalidate = rm ?
(__be64 __iomem *)pe->tce_inval_reg_phys :
@@ -1128,9 +1160,9 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
end = start;
/* Figure out the start, end and step */
- inc = tbl->it_offset + (((u64)startp - tbl->it_base) / sizeof(u64));
+ inc = tbl->it_offset + index / sizeof(u64);
start |= (inc << shift);
- inc = tbl->it_offset + (((u64)endp - tbl->it_base) / sizeof(u64));
+ inc = tbl->it_offset + (index + npages - 1) / sizeof(u64);
end |= (inc << shift);
inc = (0x1ull << shift);
mb();
@@ -1144,19 +1176,35 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
}
}
-void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
- __be64 *startp, __be64 *endp, bool rm)
+static int pnv_ioda2_tce_build_vm(struct iommu_table *tbl, long index,
+ long npages, unsigned long uaddr,
+ enum dma_data_direction direction,
+ struct dma_attrs *attrs)
{
- struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
- table_group);
- struct pnv_phb *phb = pe->phb;
-
- if (phb->type == PNV_PHB_IODA1)
- pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
- else
- pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
+ long ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
+ attrs);
+
+ if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE))
+ pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
+
+ return ret;
}
+static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
+ long npages)
+{
+ pnv_tce_free(tbl, index, npages);
+
+ if (tbl->it_type & TCE_PCI_SWINV_FREE)
+ pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
+}
+
+static struct iommu_table_ops pnv_ioda2_iommu_ops = {
+ .set = pnv_ioda2_tce_build_vm,
+ .clear = pnv_ioda2_tce_free_vm,
+ .get = pnv_tce_get,
+};
+
static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
struct pnv_ioda_pe *pe, unsigned int base,
unsigned int segs)
@@ -1236,7 +1284,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
TCE_PCI_SWINV_FREE |
TCE_PCI_SWINV_PAIR);
}
- tbl->it_ops = &pnv_iommu_ops;
+ tbl->it_ops = &pnv_ioda1_iommu_ops;
iommu_init_table(tbl, phb->hose->node);
iommu_register_group(&pe->table_group, phb->hose->global_number,
pe->pe_number);
@@ -1387,7 +1435,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
8);
tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
}
- tbl->it_ops = &pnv_iommu_ops;
+ tbl->it_ops = &pnv_ioda2_iommu_ops;
iommu_init_table(tbl, phb->hose->node);
pe->table_group.ops = &pnv_pci_ioda2_ops;
iommu_register_group(&pe->table_group, phb->hose->global_number,
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index ff68cac..6906a9c 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -83,11 +83,18 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb)
static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
#endif /* CONFIG_PCI_MSI */
+static struct iommu_table_ops pnv_p5ioc2_iommu_ops = {
+ .set = pnv_tce_build,
+ .clear = pnv_tce_free,
+ .get = pnv_tce_get,
+};
+
static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
struct pci_dev *pdev)
{
if (phb->p5ioc2.table_group.tables[0].it_map == NULL) {
- phb->p5ioc2.table_group.tables[0].it_ops = &pnv_iommu_ops;
+ phb->p5ioc2.table_group.tables[0].it_ops =
+ &pnv_p5ioc2_iommu_ops;
iommu_init_table(&phb->p5ioc2.table_group.tables[0],
phb->hose->node);
iommu_register_group(&phb->p5ioc2.table_group,
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 3050cc8..a8c05de 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -589,70 +589,48 @@ struct pci_ops pnv_pci_ops = {
.write = pnv_pci_write_config,
};
-static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
- unsigned long uaddr, enum dma_data_direction direction,
- struct dma_attrs *attrs, bool rm)
+static __be64 *pnv_tce(struct iommu_table *tbl, long index)
+{
+ __be64 *tmp = ((__be64 *)tbl->it_base);
+
+ return tmp + index;
+}
+
+int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr, enum dma_data_direction direction,
+ struct dma_attrs *attrs)
{
u64 proto_tce = iommu_direction_to_tce_perm(direction);
- __be64 *tcep, *tces;
- u64 rpn;
+ u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
+ long i;
- tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
- rpn = __pa(uaddr) >> tbl->it_page_shift;
+ for (i = 0; i < npages; i++) {
+ unsigned long newtce = proto_tce |
+ ((rpn + i) << tbl->it_page_shift);
+ unsigned long idx = index - tbl->it_offset + i;
- while (npages--)
- *(tcep++) = cpu_to_be64(proto_tce |
- (rpn++ << tbl->it_page_shift));
-
- /* Some implementations won't cache invalid TCEs and thus may not
- * need that flush. We'll probably turn it_type into a bit mask
- * of flags if that becomes the case
- */
- if (tbl->it_type & TCE_PCI_SWINV_CREATE)
- pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+ *(pnv_tce(tbl, idx)) = cpu_to_be64(newtce);
+ }
return 0;
}
-static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
- unsigned long uaddr,
- enum dma_data_direction direction,
- struct dma_attrs *attrs)
+void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
{
- return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs,
- false);
-}
-
-static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
- bool rm)
-{
- __be64 *tcep, *tces;
-
- tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
+ long i;
- while (npages--)
- *(tcep++) = cpu_to_be64(0);
+ for (i = 0; i < npages; i++) {
+ unsigned long idx = index - tbl->it_offset + i;
- if (tbl->it_type & TCE_PCI_SWINV_FREE)
- pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+ *(pnv_tce(tbl, idx)) = cpu_to_be64(0);
+ }
}
-static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long npages)
+unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
{
- pnv_tce_free(tbl, index, npages, false);
+ return *(pnv_tce(tbl, index));
}
-static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
-{
- return ((u64 *)tbl->it_base)[index - tbl->it_offset];
-}
-
-struct iommu_table_ops pnv_iommu_ops = {
- .set = pnv_tce_build_vm,
- .clear = pnv_tce_free_vm,
- .get = pnv_tce_get,
-};
-
void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
void *tce_mem, u64 tce_size,
u64 dma_offset, unsigned page_shift)
@@ -685,7 +663,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
return NULL;
pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
- tbl->it_ops = &pnv_iommu_ops;
+ tbl->it_ops = &pnv_ioda1_iommu_ops;
iommu_init_table(tbl, hose->node);
iommu_register_group(tbl->it_group, pci_domain_nr(hose->bus), 0);
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 762d906..0d4df32 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -216,7 +216,12 @@ extern struct pci_ops pnv_pci_ops;
#ifdef CONFIG_EEH
extern struct pnv_eeh_ops ioda_eeh_ops;
#endif
-extern struct iommu_table_ops pnv_iommu_ops;
+extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr, enum dma_data_direction direction,
+ struct dma_attrs *attrs);
+extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
+extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
+extern struct iommu_table_ops pnv_ioda1_iommu_ops;
void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
unsigned char *log_buff);
--
2.0.0
At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.
Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.
This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE and DMA direction so
the caller can release the pages afterwards.
The returned old TCE value is a virtual address as the new TCE value.
This is different from tce_clear() which returns a physical address.
This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
for a platform to have exchange() implemented in order to support VFIO.
This replaces iommu_tce_build() and iommu_clear_tce() with
a single iommu_tce_xchg().
This makes sure that TCE permission bits are not set in TCE passed to
IOMMU API as those are to be calculated by platform code from DMA direction.
This moves SetPageDirty() to the IOMMU code to make it work for both
VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
available later).
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 17 ++++++--
arch/powerpc/kernel/iommu.c | 53 +++++++++---------------
arch/powerpc/platforms/powernv/pci-ioda.c | 38 ++++++++++++++++++
arch/powerpc/platforms/powernv/pci-p5ioc2.c | 3 ++
arch/powerpc/platforms/powernv/pci.c | 17 ++++++++
arch/powerpc/platforms/powernv/pci.h | 2 +
drivers/vfio/vfio_iommu_spapr_tce.c | 62 ++++++++++++++++++-----------
7 files changed, 130 insertions(+), 62 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index d1f8c6c..bde7ee7 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -44,11 +44,22 @@ extern int iommu_is_off;
extern int iommu_force_on;
struct iommu_table_ops {
+ /* When called with direction==DMA_NONE, it is equal to clear() */
int (*set)(struct iommu_table *tbl,
long index, long npages,
unsigned long uaddr,
enum dma_data_direction direction,
struct dma_attrs *attrs);
+#ifdef CONFIG_IOMMU_API
+ /*
+ * Exchanges existing TCE with new TCE plus direction bits;
+ * returns old TCE and DMA direction mask
+ */
+ int (*exchange)(struct iommu_table *tbl,
+ long index,
+ unsigned long *tce,
+ enum dma_data_direction *direction);
+#endif
void (*clear)(struct iommu_table *tbl,
long index, long npages);
unsigned long (*get)(struct iommu_table *tbl, long index);
@@ -152,6 +163,8 @@ extern void iommu_register_group(struct iommu_table_group *table_group,
extern int iommu_add_device(struct device *dev);
extern void iommu_del_device(struct device *dev);
extern int __init tce_iommu_bus_notifier_init(void);
+extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
+ unsigned long *tce, enum dma_data_direction *direction);
#else
static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
@@ -231,10 +244,6 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
unsigned long npages);
extern int iommu_tce_put_param_check(struct iommu_table *tbl,
unsigned long ioba, unsigned long tce);
-extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
- unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
- unsigned long entry);
extern void iommu_flush_tce(struct iommu_table *tbl);
extern int iommu_take_ownership(struct iommu_table_group *table_group);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 068fe4ff..501e8ee 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -982,9 +982,6 @@ EXPORT_SYMBOL_GPL(iommu_tce_clear_param_check);
int iommu_tce_put_param_check(struct iommu_table *tbl,
unsigned long ioba, unsigned long tce)
{
- if (!(tce & (TCE_PCI_WRITE | TCE_PCI_READ)))
- return -EINVAL;
-
if (tce & ~(IOMMU_PAGE_MASK(tbl) | TCE_PCI_WRITE | TCE_PCI_READ))
return -EINVAL;
@@ -1002,44 +999,20 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
}
EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
-{
- unsigned long oldtce;
- struct iommu_pool *pool = get_pool(tbl, entry);
-
- spin_lock(&(pool->lock));
-
- oldtce = tbl->it_ops->get(tbl, entry);
- if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
- tbl->it_ops->clear(tbl, entry, 1);
- else
- oldtce = 0;
-
- spin_unlock(&(pool->lock));
-
- return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
/*
* hwaddr is a kernel virtual address here (0xc... bazillion),
* tce_build converts it to a physical address.
*/
-int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
- unsigned long hwaddr, enum dma_data_direction direction)
+long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
+ unsigned long *tce, enum dma_data_direction *direction)
{
- int ret = -EBUSY;
- unsigned long oldtce;
- struct iommu_pool *pool = get_pool(tbl, entry);
+ long ret;
- spin_lock(&(pool->lock));
+ ret = tbl->it_ops->exchange(tbl, entry, tce, direction);
- oldtce = tbl->it_ops->get(tbl, entry);
- /* Add new entry if it is not busy */
- if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
- ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
-
- spin_unlock(&(pool->lock));
+ if (!ret && ((*direction == DMA_FROM_DEVICE) ||
+ (*direction == DMA_BIDIRECTIONAL)))
+ SetPageDirty(pfn_to_page(__pa(*tce) >> PAGE_SHIFT));
/* if (unlikely(ret))
pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
@@ -1048,13 +1021,23 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
return ret;
}
-EXPORT_SYMBOL_GPL(iommu_tce_build);
+EXPORT_SYMBOL_GPL(iommu_tce_xchg);
static int iommu_table_take_ownership(struct iommu_table *tbl)
{
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
int ret = 0;
+ /*
+ * VFIO does not control TCE entries allocation and the guest
+ * can write new TCEs on top of existing ones so iommu_tce_build()
+ * must be able to release old pages. This functionality
+ * requires exchange() callback defined so if it is not
+ * implemented, we disallow taking ownership over the table.
+ */
+ if (!tbl->it_ops->exchange)
+ return -EINVAL;
+
spin_lock_irqsave(&tbl->large_pool.lock, flags);
for (i = 0; i < tbl->nr_pools; i++)
spin_lock(&tbl->pools[i].lock);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index fd993bc..4d80502 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1128,6 +1128,20 @@ static int pnv_ioda1_tce_build_vm(struct iommu_table *tbl, long index,
return ret;
}
+#ifdef CONFIG_IOMMU_API
+static int pnv_ioda1_tce_xchg_vm(struct iommu_table *tbl, long index,
+ unsigned long *tce, enum dma_data_direction *direction)
+{
+ long ret = pnv_tce_xchg(tbl, index, tce, direction);
+
+ if (!ret && (tbl->it_type &
+ (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
+ pnv_pci_ioda1_tce_invalidate(tbl, index, 1, false);
+
+ return ret;
+}
+#endif
+
static void pnv_ioda1_tce_free_vm(struct iommu_table *tbl, long index,
long npages)
{
@@ -1139,6 +1153,9 @@ static void pnv_ioda1_tce_free_vm(struct iommu_table *tbl, long index,
struct iommu_table_ops pnv_ioda1_iommu_ops = {
.set = pnv_ioda1_tce_build_vm,
+#ifdef CONFIG_IOMMU_API
+ .exchange = pnv_ioda1_tce_xchg_vm,
+#endif
.clear = pnv_ioda1_tce_free_vm,
.get = pnv_tce_get,
};
@@ -1190,6 +1207,20 @@ static int pnv_ioda2_tce_build_vm(struct iommu_table *tbl, long index,
return ret;
}
+#ifdef CONFIG_IOMMU_API
+static int pnv_ioda2_tce_xchg_vm(struct iommu_table *tbl, long index,
+ unsigned long *tce, enum dma_data_direction *direction)
+{
+ long ret = pnv_tce_xchg(tbl, index, tce, direction);
+
+ if (!ret && (tbl->it_type &
+ (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
+ pnv_pci_ioda2_tce_invalidate(tbl, index, 1, false);
+
+ return ret;
+}
+#endif
+
static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
long npages)
{
@@ -1201,6 +1232,9 @@ static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
static struct iommu_table_ops pnv_ioda2_iommu_ops = {
.set = pnv_ioda2_tce_build_vm,
+#ifdef CONFIG_IOMMU_API
+ .exchange = pnv_ioda2_tce_xchg_vm,
+#endif
.clear = pnv_ioda2_tce_free_vm,
.get = pnv_tce_get,
};
@@ -1353,6 +1387,7 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
pnv_pci_ioda2_set_bypass(pe, true);
}
+#ifdef CONFIG_IOMMU_API
static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
bool enable)
{
@@ -1369,6 +1404,7 @@ static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
.set_ownership = pnv_ioda2_set_ownership,
};
+#endif
static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
struct pnv_ioda_pe *pe)
@@ -1437,7 +1473,9 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
}
tbl->it_ops = &pnv_ioda2_iommu_ops;
iommu_init_table(tbl, phb->hose->node);
+#ifdef CONFIG_IOMMU_API
pe->table_group.ops = &pnv_pci_ioda2_ops;
+#endif
iommu_register_group(&pe->table_group, phb->hose->global_number,
pe->pe_number);
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index 6906a9c..d2d9092 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -85,6 +85,9 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
static struct iommu_table_ops pnv_p5ioc2_iommu_ops = {
.set = pnv_tce_build,
+#ifdef CONFIG_IOMMU_API
+ .exchange = pnv_tce_xchg,
+#endif
.clear = pnv_tce_free,
.get = pnv_tce_get,
};
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index a8c05de..a9797dd 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -615,6 +615,23 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
return 0;
}
+#ifdef CONFIG_IOMMU_API
+int pnv_tce_xchg(struct iommu_table *tbl, long index,
+ unsigned long *tce, enum dma_data_direction *direction)
+{
+ u64 proto_tce = iommu_direction_to_tce_perm(*direction);
+ unsigned long newtce = __pa(*tce) | proto_tce;
+ unsigned long idx = index - tbl->it_offset;
+
+ *tce = xchg(pnv_tce(tbl, idx), cpu_to_be64(newtce));
+ *tce = (unsigned long) __va(be64_to_cpu(*tce));
+ *direction = iommu_tce_direction(*tce);
+ *tce &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+ return 0;
+}
+#endif
+
void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
{
long i;
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 0d4df32..4d1a78c 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -220,6 +220,8 @@ extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
unsigned long uaddr, enum dma_data_direction direction,
struct dma_attrs *attrs);
extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
+extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
+ unsigned long *tce, enum dma_data_direction *direction);
extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
extern struct iommu_table_ops pnv_ioda1_iommu_ops;
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index d5d8c50..7c3c215 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -251,9 +251,6 @@ static void tce_iommu_unuse_page(struct tce_container *container,
{
struct page *page;
- if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
- return;
-
/*
* VFIO cannot map/unmap when a container is not enabled so
* we would not need this check but KVM could map/unmap and if
@@ -264,10 +261,6 @@ static void tce_iommu_unuse_page(struct tce_container *container,
return;
page = pfn_to_page(__pa(oldtce) >> PAGE_SHIFT);
-
- if (oldtce & TCE_PCI_WRITE)
- SetPageDirty(page);
-
put_page(page);
}
@@ -275,14 +268,21 @@ static int tce_iommu_clear(struct tce_container *container,
struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
{
- unsigned long oldtce;
+ long ret;
+ enum dma_data_direction direction;
+ unsigned long tce;
for ( ; pages; --pages, ++entry) {
- oldtce = iommu_clear_tce(tbl, entry);
- if (!oldtce)
+ direction = DMA_NONE;
+ tce = (unsigned long) __va(0);
+ ret = iommu_tce_xchg(tbl, entry, &tce, &direction);
+ if (ret)
continue;
- tce_iommu_unuse_page(container, (unsigned long) __va(oldtce));
+ if (direction == DMA_NONE)
+ continue;
+
+ tce_iommu_unuse_page(container, tce);
}
return 0;
@@ -304,12 +304,13 @@ static int tce_get_hva(unsigned long tce, unsigned long *hva)
static long tce_iommu_build(struct tce_container *container,
struct iommu_table *tbl,
- unsigned long entry, unsigned long tce, unsigned long pages)
+ unsigned long entry, unsigned long tce, unsigned long pages,
+ enum dma_data_direction direction)
{
long i, ret = 0;
struct page *page;
unsigned long hva;
- enum dma_data_direction direction = iommu_tce_direction(tce);
+ enum dma_data_direction dirtmp;
for (i = 0; i < pages; ++i) {
ret = tce_get_hva(tce, &hva);
@@ -324,15 +325,21 @@ static long tce_iommu_build(struct tce_container *container,
/* Preserve offset within IOMMU page */
hva |= tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
+ dirtmp = direction;
- ret = iommu_tce_build(tbl, entry + i, hva, direction);
+ ret = iommu_tce_xchg(tbl, entry + i, &hva, &dirtmp);
if (ret) {
+ /* dirtmp cannot be DMA_NONE here */
tce_iommu_unuse_page(container, hva);
pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
__func__, entry << tbl->it_page_shift,
tce, ret);
break;
}
+
+ if (dirtmp != DMA_NONE)
+ tce_iommu_unuse_page(container, hva);
+
tce += IOMMU_PAGE_SIZE(tbl);
}
@@ -397,7 +404,7 @@ static long tce_iommu_ioctl(void *iommu_data,
case VFIO_IOMMU_MAP_DMA: {
struct vfio_iommu_type1_dma_map param;
struct iommu_table *tbl;
- unsigned long tce;
+ enum dma_data_direction direction;
if (!container->enabled)
return -EPERM;
@@ -418,24 +425,33 @@ static long tce_iommu_ioctl(void *iommu_data,
if (!tbl)
return -ENXIO;
- if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
- (param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
+ if (param.size & ~IOMMU_PAGE_MASK(tbl))
+ return -EINVAL;
+
+ if (param.vaddr & (TCE_PCI_READ | TCE_PCI_WRITE))
return -EINVAL;
/* iova is checked by the IOMMU API */
- tce = param.vaddr;
if (param.flags & VFIO_DMA_MAP_FLAG_READ)
- tce |= TCE_PCI_READ;
- if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
- tce |= TCE_PCI_WRITE;
+ if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
+ direction = DMA_BIDIRECTIONAL;
+ else
+ direction = DMA_TO_DEVICE;
+ else
+ if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
+ direction = DMA_FROM_DEVICE;
+ else
+ return -EINVAL;
- ret = iommu_tce_put_param_check(tbl, param.iova, tce);
+ ret = iommu_tce_put_param_check(tbl, param.iova, param.vaddr);
if (ret)
return ret;
ret = tce_iommu_build(container, tbl,
param.iova >> tbl->it_page_shift,
- tce, param.size >> tbl->it_page_shift);
+ param.vaddr,
+ param.size >> tbl->it_page_shift,
+ direction);
iommu_flush_tce(tbl);
--
2.0.0
This moves iommu_table creation to the beginning. This is a mechanical
patch.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/platforms/powernv/pci-ioda.c | 34 ++++++++++++++++---------------
1 file changed, 18 insertions(+), 16 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 4d80502..a1e0df9 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1437,27 +1437,33 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
addr = page_address(tce_mem);
memset(addr, 0, tce_table_size);
+ /* Setup iommu */
+ pe->table_group.tables[0].it_group = &pe->table_group;
+
+ /* Setup linux iommu table */
+ tbl = &pe->table_group.tables[0];
+ pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
+ IOMMU_PAGE_SHIFT_4K);
+
+ tbl->it_ops = &pnv_ioda2_iommu_ops;
+ iommu_init_table(tbl, phb->hose->node);
+#ifdef CONFIG_IOMMU_API
+ pe->table_group.ops = &pnv_pci_ioda2_ops;
+#endif
+
/*
* Map TCE table through TVT. The TVE index is the PE number
* shifted by 1 bit for 32-bits DMA space.
*/
rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
- pe->pe_number << 1, 1, __pa(addr),
- tce_table_size, 0x1000);
+ pe->pe_number << 1, 1, __pa(tbl->it_base),
+ tbl->it_size << 3, 1ULL << tbl->it_page_shift);
if (rc) {
pe_err(pe, "Failed to configure 32-bit TCE table,"
" err %ld\n", rc);
goto fail;
}
- /* Setup iommu */
- pe->table_group.tables[0].it_group = &pe->table_group;
-
- /* Setup linux iommu table */
- tbl = &pe->table_group.tables[0];
- pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
- IOMMU_PAGE_SHIFT_4K);
-
/* OPAL variant of PHB3 invalidated TCEs */
swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
if (swinvp) {
@@ -1471,16 +1477,12 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
8);
tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
}
- tbl->it_ops = &pnv_ioda2_iommu_ops;
- iommu_init_table(tbl, phb->hose->node);
-#ifdef CONFIG_IOMMU_API
- pe->table_group.ops = &pnv_pci_ioda2_ops;
-#endif
iommu_register_group(&pe->table_group, phb->hose->global_number,
pe->pe_number);
if (pe->pdev)
- set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
+ set_iommu_table_base_and_group(&pe->pdev->dev,
+ &pe->table_group.tables[0]);
else
pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
--
2.0.0
This is a part of moving TCE table allocation into an iommu_ops
callback to support multiple IOMMU groups per one VFIO container.
This enforce window size to be a power of two.
This is a pretty mechanical patch.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/platforms/powernv/pci-ioda.c | 85 +++++++++++++++++++++++--------
1 file changed, 63 insertions(+), 22 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index a1e0df9..908863a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -24,7 +24,9 @@
#include <linux/msi.h>
#include <linux/memblock.h>
#include <linux/iommu.h>
+#include <linux/mmzone.h>
+#include <asm/mmzone.h>
#include <asm/sections.h>
#include <asm/io.h>
#include <asm/prom.h>
@@ -1337,6 +1339,58 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
}
+static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe,
+ __u32 page_shift, __u64 window_size,
+ struct iommu_table *tbl)
+{
+ int nid = pe->phb->hose->node;
+ struct page *tce_mem = NULL;
+ void *addr;
+ unsigned long tce_table_size;
+ int64_t rc;
+ unsigned order;
+
+ if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
+ return -EINVAL;
+
+ tce_table_size = (window_size >> page_shift) * 8;
+ tce_table_size = max(0x1000UL, tce_table_size);
+
+ /* Allocate TCE table */
+ order = get_order(tce_table_size);
+
+ tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
+ if (!tce_mem) {
+ pr_err("Failed to allocate a TCE memory, order=%d\n", order);
+ rc = -ENOMEM;
+ goto fail;
+ }
+ addr = page_address(tce_mem);
+ memset(addr, 0, tce_table_size);
+
+ /* Setup linux iommu table */
+ pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
+ page_shift);
+
+ tbl->it_ops = &pnv_ioda2_iommu_ops;
+
+ return 0;
+fail:
+ if (tce_mem)
+ __free_pages(tce_mem, get_order(tce_table_size));
+
+ return rc;
+}
+
+static void pnv_pci_free_table(struct iommu_table *tbl)
+{
+ if (!tbl->it_size)
+ return;
+
+ free_pages(tbl->it_base, get_order(tbl->it_size << 3));
+ memset(tbl, 0, sizeof(struct iommu_table));
+}
+
static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
{
uint16_t window_id = (pe->pe_number << 1 ) + 1;
@@ -1409,11 +1463,9 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
struct pnv_ioda_pe *pe)
{
- struct page *tce_mem = NULL;
- void *addr;
const __be64 *swinvp;
- struct iommu_table *tbl;
- unsigned int tce_table_size, end;
+ unsigned int end;
+ struct iommu_table *tbl = &pe->table_group.tables[0];
int64_t rc;
/* We shouldn't already have a 32-bit DMA associated */
@@ -1422,30 +1474,20 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
/* The PE will reserve all possible 32-bits space */
pe->tce32_seg = 0;
+
end = (1 << ilog2(phb->ioda.m32_pci_base));
- tce_table_size = (end / 0x1000) * 8;
pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
end);
- /* Allocate TCE table */
- tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL,
- get_order(tce_table_size));
- if (!tce_mem) {
- pe_err(pe, "Failed to allocate a 32-bit TCE memory\n");
- goto fail;
+ rc = pnv_pci_ioda2_create_table(pe, IOMMU_PAGE_SHIFT_4K,
+ phb->ioda.m32_pci_base, tbl);
+ if (rc) {
+ pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
+ return;
}
- addr = page_address(tce_mem);
- memset(addr, 0, tce_table_size);
/* Setup iommu */
pe->table_group.tables[0].it_group = &pe->table_group;
-
- /* Setup linux iommu table */
- tbl = &pe->table_group.tables[0];
- pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
- IOMMU_PAGE_SHIFT_4K);
-
- tbl->it_ops = &pnv_ioda2_iommu_ops;
iommu_init_table(tbl, phb->hose->node);
#ifdef CONFIG_IOMMU_API
pe->table_group.ops = &pnv_pci_ioda2_ops;
@@ -1494,8 +1536,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
fail:
if (pe->tce32_seg >= 0)
pe->tce32_seg = -1;
- if (tce_mem)
- __free_pages(tce_mem, get_order(tce_table_size));
+ pnv_pci_free_table(tbl);
}
static void pnv_ioda_setup_dma(struct pnv_phb *phb)
--
2.0.0
This is a part of moving DMA window programming to an iommu_ops
callback.
This is a mechanical patch.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/platforms/powernv/pci-ioda.c | 85 ++++++++++++++++++++-----------
1 file changed, 56 insertions(+), 29 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 908863a..64b7cfe 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1391,6 +1391,57 @@ static void pnv_pci_free_table(struct iommu_table *tbl)
memset(tbl, 0, sizeof(struct iommu_table));
}
+static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe,
+ struct iommu_table *tbl)
+{
+ struct pnv_phb *phb = pe->phb;
+ const __be64 *swinvp;
+ int64_t rc;
+ const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
+ const __u64 win_size = tbl->it_size << tbl->it_page_shift;
+
+ pe_info(pe, "Setting up window at %llx..%llx pagesize=0x%x tablesize=0x%lx\n",
+ start_addr, start_addr + win_size - 1,
+ 1UL << tbl->it_page_shift, tbl->it_size << 3);
+
+ pe->table_group.tables[0] = *tbl;
+ tbl = &pe->table_group.tables[0];
+ tbl->it_group = &pe->table_group;
+
+ /*
+ * Map TCE table through TVT. The TVE index is the PE number
+ * shifted by 1 bit for 32-bits DMA space.
+ */
+ rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
+ pe->pe_number << 1, 1, __pa(tbl->it_base),
+ tbl->it_size << 3, 1ULL << tbl->it_page_shift);
+ if (rc) {
+ pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
+ goto fail;
+ }
+
+ /* OPAL variant of PHB3 invalidated TCEs */
+ swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
+ if (swinvp) {
+ /* We need a couple more fields -- an address and a data
+ * to or. Since the bus is only printed out on table free
+ * errors, and on the first pass the data will be a relative
+ * bus number, print that out instead.
+ */
+ pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
+ tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
+ 8);
+ tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
+ }
+
+ return 0;
+fail:
+ if (pe->tce32_seg >= 0)
+ pe->tce32_seg = -1;
+
+ return rc;
+}
+
static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
{
uint16_t window_id = (pe->pe_number << 1 ) + 1;
@@ -1463,7 +1514,6 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
struct pnv_ioda_pe *pe)
{
- const __be64 *swinvp;
unsigned int end;
struct iommu_table *tbl = &pe->table_group.tables[0];
int64_t rc;
@@ -1493,31 +1543,14 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
pe->table_group.ops = &pnv_pci_ioda2_ops;
#endif
- /*
- * Map TCE table through TVT. The TVE index is the PE number
- * shifted by 1 bit for 32-bits DMA space.
- */
- rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
- pe->pe_number << 1, 1, __pa(tbl->it_base),
- tbl->it_size << 3, 1ULL << tbl->it_page_shift);
+ rc = pnv_pci_ioda2_set_window(pe, tbl);
if (rc) {
pe_err(pe, "Failed to configure 32-bit TCE table,"
" err %ld\n", rc);
- goto fail;
- }
-
- /* OPAL variant of PHB3 invalidated TCEs */
- swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
- if (swinvp) {
- /* We need a couple more fields -- an address and a data
- * to or. Since the bus is only printed out on table free
- * errors, and on the first pass the data will be a relative
- * bus number, print that out instead.
- */
- pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
- tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
- 8);
- tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
+ pnv_pci_free_table(tbl);
+ if (pe->tce32_seg >= 0)
+ pe->tce32_seg = -1;
+ return;
}
iommu_register_group(&pe->table_group, phb->hose->global_number,
pe->pe_number);
@@ -1531,12 +1564,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
/* Also create a bypass window */
if (!pnv_iommu_bypass_disabled)
pnv_pci_ioda2_setup_bypass_pe(phb, pe);
-
- return;
-fail:
- if (pe->tce32_seg >= 0)
- pe->tce32_seg = -1;
- pnv_pci_free_table(tbl);
}
static void pnv_ioda_setup_dma(struct pnv_phb *phb)
--
2.0.0
The iommu_free_table helper release memory it is using (the TCE table and
@it_map) and release the iommu_table struct as well. We might not want
the very last step as we store iommu_table in parent structures.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 1 +
arch/powerpc/kernel/iommu.c | 57 ++++++++++++++++++++++++----------------
2 files changed, 35 insertions(+), 23 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index bde7ee7..8ed4648 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -127,6 +127,7 @@ static inline void *get_iommu_table_base(struct device *dev)
extern struct iommu_table *iommu_table_alloc(int node);
/* Frees table for an individual device node */
+extern void iommu_reset_table(struct iommu_table *tbl, const char *node_name);
extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
/* Initializes an iommu_table based in values set in the passed-in
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 501e8ee..0bcd988 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -721,24 +721,46 @@ struct iommu_table *iommu_table_alloc(int node)
return &table_group->tables[0];
}
+void iommu_reset_table(struct iommu_table *tbl, const char *node_name)
+{
+ if (!tbl)
+ return;
+
+ if (tbl->it_map) {
+ unsigned long bitmap_sz;
+ unsigned int order;
+
+ /*
+ * In case we have reserved the first bit, we should not emit
+ * the warning below.
+ */
+ if (tbl->it_offset == 0)
+ clear_bit(0, tbl->it_map);
+
+ /* verify that table contains no entries */
+ if (!bitmap_empty(tbl->it_map, tbl->it_size))
+ pr_warn("%s: Unexpected TCEs for %s\n", __func__,
+ node_name);
+
+ /* calculate bitmap size in bytes */
+ bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
+
+ /* free bitmap */
+ order = get_order(bitmap_sz);
+ free_pages((unsigned long) tbl->it_map, order);
+ }
+
+ memset(tbl, 0, sizeof(*tbl));
+}
+
void iommu_free_table(struct iommu_table *tbl, const char *node_name)
{
- unsigned long bitmap_sz;
- unsigned int order;
struct iommu_table_group *table_group = tbl->it_group;
- if (!tbl || !tbl->it_map) {
- printk(KERN_ERR "%s: expected TCE map for %s\n", __func__,
- node_name);
+ if (!tbl)
return;
- }
- /*
- * In case we have reserved the first bit, we should not emit
- * the warning below.
- */
- if (tbl->it_offset == 0)
- clear_bit(0, tbl->it_map);
+ iommu_reset_table(tbl, node_name);
#ifdef CONFIG_IOMMU_API
if (table_group->group) {
@@ -747,17 +769,6 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
}
#endif
- /* verify that table contains no entries */
- if (!bitmap_empty(tbl->it_map, tbl->it_size))
- pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
-
- /* calculate bitmap size in bytes */
- bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
-
- /* free bitmap */
- order = get_order(bitmap_sz);
- free_pages((unsigned long) tbl->it_map, order);
-
/* free table */
kfree(table_group);
}
--
2.0.0
TCE tables might get too big in case of 4K IOMMU pages and DDW enabled
on huge guests (hundreds of GB of RAM) so the kernel might be unable to
allocate contiguous chunk of physical memory to store the TCE table.
To address this, POWER8 CPU (actually, IODA2) supports multi-level TCE tables,
up to 5 levels which splits the table into a tree of smaller subtables.
This adds multi-level TCE tables support to pnv_pci_ioda2_create_table()
and pnv_pci_ioda2_free_table() callbacks.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 2 +
arch/powerpc/platforms/powernv/pci-ioda.c | 128 ++++++++++++++++++++++++------
arch/powerpc/platforms/powernv/pci.c | 19 +++++
3 files changed, 123 insertions(+), 26 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 8ed4648..1e0d907 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -90,6 +90,8 @@ struct iommu_pool {
struct iommu_table {
unsigned long it_busno; /* Bus number this table belongs to */
unsigned long it_size; /* Size of iommu table in entries */
+ unsigned long it_indirect_levels;
+ unsigned long it_level_size;
unsigned long it_offset; /* Offset into global table */
unsigned long it_base; /* mapped address of tce table */
unsigned long it_index; /* which iommu table this is */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 64b7cfe..c212e51 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -47,6 +47,10 @@
#include "powernv.h"
#include "pci.h"
+#define POWERNV_IOMMU_DEFAULT_LEVELS 1
+#define POWERNV_IOMMU_MAX_LEVELS 5
+#define ROUND_UP(x, n) (((x) + (n) - 1u) & ~((n) - 1u))
+
static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
const char *fmt, ...)
{
@@ -1339,16 +1343,82 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
}
+static void pnv_free_tce_table(unsigned long addr, unsigned long size,
+ unsigned level)
+{
+ addr &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+ if (level) {
+ long i;
+ u64 *tmp = (u64 *) addr;
+
+ for (i = 0; i < size; ++i) {
+ unsigned long hpa = be64_to_cpu(tmp[i]);
+
+ if (!(hpa & (TCE_PCI_READ | TCE_PCI_WRITE)))
+ continue;
+
+ pnv_free_tce_table((unsigned long) __va(hpa),
+ size, level - 1);
+ }
+ }
+
+ free_pages(addr, get_order(size << 3));
+}
+
+static __be64 *pnv_alloc_tce_table(int nid,
+ unsigned shift, unsigned levels, unsigned long *left)
+{
+ struct page *tce_mem = NULL;
+ __be64 *addr, *tmp;
+ unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
+ unsigned long chunk = 1UL << shift, i;
+
+ tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
+ if (!tce_mem) {
+ pr_err("Failed to allocate a TCE memory\n");
+ return NULL;
+ }
+
+ if (!*left)
+ return NULL;
+
+ addr = page_address(tce_mem);
+ memset(addr, 0, chunk);
+
+ --levels;
+ if (!levels) {
+ /* This is last level, actual TCEs */
+ *left -= min(*left, chunk);
+ return addr;
+ }
+
+ for (i = 0; i < (chunk >> 3); ++i) {
+ /* We allocated required TCEs, mark the rest "page fault" */
+ if (!*left) {
+ addr[i] = cpu_to_be64(0);
+ continue;
+ }
+
+ tmp = pnv_alloc_tce_table(nid, shift, levels, left);
+ addr[i] = cpu_to_be64(__pa(tmp) |
+ TCE_PCI_READ | TCE_PCI_WRITE);
+ }
+
+ return addr;
+}
+
static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe,
- __u32 page_shift, __u64 window_size,
+ __u32 page_shift, __u64 window_size, __u32 levels,
struct iommu_table *tbl)
{
int nid = pe->phb->hose->node;
- struct page *tce_mem = NULL;
void *addr;
- unsigned long tce_table_size;
- int64_t rc;
- unsigned order;
+ unsigned long tce_table_size, left;
+ unsigned shift;
+
+ if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
+ return -EINVAL;
if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
return -EINVAL;
@@ -1357,16 +1427,19 @@ static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe,
tce_table_size = max(0x1000UL, tce_table_size);
/* Allocate TCE table */
- order = get_order(tce_table_size);
+ shift = ROUND_UP(ilog2(window_size) - page_shift, levels) / levels;
+ shift += 3;
+ shift = max_t(unsigned, shift, IOMMU_PAGE_SHIFT_4K);
+ pr_info("Creating TCE table %08llx, %d levels, TCE table size = %lx\n",
+ window_size, levels, 1UL << shift);
- tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
- if (!tce_mem) {
- pr_err("Failed to allocate a TCE memory, order=%d\n", order);
- rc = -ENOMEM;
- goto fail;
- }
- addr = page_address(tce_mem);
- memset(addr, 0, tce_table_size);
+ tbl->it_level_size = 1ULL << (shift - 3);
+ left = tce_table_size;
+ addr = pnv_alloc_tce_table(nid, shift, levels, &left);
+ if (!addr)
+ return -ENOMEM;
+
+ tbl->it_indirect_levels = levels - 1;
/* Setup linux iommu table */
pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
@@ -1375,20 +1448,18 @@ static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe,
tbl->it_ops = &pnv_ioda2_iommu_ops;
return 0;
-fail:
- if (tce_mem)
- __free_pages(tce_mem, get_order(tce_table_size));
-
- return rc;
}
static void pnv_pci_free_table(struct iommu_table *tbl)
{
+ const unsigned long size = tbl->it_indirect_levels ?
+ tbl->it_level_size : tbl->it_size;
+
if (!tbl->it_size)
return;
- free_pages(tbl->it_base, get_order(tbl->it_size << 3));
- memset(tbl, 0, sizeof(struct iommu_table));
+ pnv_free_tce_table(tbl->it_base, size, tbl->it_indirect_levels);
+ iommu_reset_table(tbl, "ioda2");
}
static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe,
@@ -1397,12 +1468,15 @@ static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe,
struct pnv_phb *phb = pe->phb;
const __be64 *swinvp;
int64_t rc;
+ const unsigned long size = tbl->it_indirect_levels ?
+ tbl->it_level_size : tbl->it_size;
const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
const __u64 win_size = tbl->it_size << tbl->it_page_shift;
- pe_info(pe, "Setting up window at %llx..%llx pagesize=0x%x tablesize=0x%lx\n",
+ pe_info(pe, "Setting up window at %llx..%llx pagesize=0x%x tablesize=0x%lx levels=%d levelsize=%x\n",
start_addr, start_addr + win_size - 1,
- 1UL << tbl->it_page_shift, tbl->it_size << 3);
+ 1UL << tbl->it_page_shift, tbl->it_size,
+ tbl->it_indirect_levels + 1, tbl->it_level_size);
pe->table_group.tables[0] = *tbl;
tbl = &pe->table_group.tables[0];
@@ -1413,8 +1487,9 @@ static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe,
* shifted by 1 bit for 32-bits DMA space.
*/
rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
- pe->pe_number << 1, 1, __pa(tbl->it_base),
- tbl->it_size << 3, 1ULL << tbl->it_page_shift);
+ pe->pe_number << 1, tbl->it_indirect_levels + 1,
+ __pa(tbl->it_base),
+ size << 3, 1ULL << tbl->it_page_shift);
if (rc) {
pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
goto fail;
@@ -1530,7 +1605,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
end);
rc = pnv_pci_ioda2_create_table(pe, IOMMU_PAGE_SHIFT_4K,
- phb->ioda.m32_pci_base, tbl);
+ phb->ioda.m32_pci_base,
+ POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
if (rc) {
pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
return;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index a9797dd..e734e37 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -592,6 +592,25 @@ struct pci_ops pnv_pci_ops = {
static __be64 *pnv_tce(struct iommu_table *tbl, long index)
{
__be64 *tmp = ((__be64 *)tbl->it_base);
+ int level = tbl->it_indirect_levels;
+ const long shift = ilog2(tbl->it_level_size);
+ unsigned long mask = (tbl->it_level_size - 1) << (level * shift);
+
+ if (index >= tbl->it_size)
+ return NULL;
+
+ while (level) {
+ int n = (index & mask) >> (level * shift);
+ unsigned long tce = be64_to_cpu(tmp[n]);
+
+ if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE)))
+ return NULL;
+
+ tmp = __va(tce & ~(TCE_PCI_READ | TCE_PCI_WRITE));
+ index &= ~mask;
+ mask >>= shift;
+ --level;
+ }
return tmp + index;
}
--
2.0.0
This changes few functions to receive a iommu_table_group pointer
rather than PE as they are going to be a part of upcoming
iommu_table_group_ops callback set.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/platforms/powernv/pci-ioda.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index c212e51..99d1a92 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1408,10 +1408,12 @@ static __be64 *pnv_alloc_tce_table(int nid,
return addr;
}
-static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe,
+static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
__u32 page_shift, __u64 window_size, __u32 levels,
struct iommu_table *tbl)
{
+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
+ table_group);
int nid = pe->phb->hose->node;
void *addr;
unsigned long tce_table_size, left;
@@ -1462,9 +1464,11 @@ static void pnv_pci_free_table(struct iommu_table *tbl)
iommu_reset_table(tbl, "ioda2");
}
-static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe,
+static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
struct iommu_table *tbl)
{
+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
+ table_group);
struct pnv_phb *phb = pe->phb;
const __be64 *swinvp;
int64_t rc;
@@ -1599,12 +1603,11 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
/* The PE will reserve all possible 32-bits space */
pe->tce32_seg = 0;
-
end = (1 << ilog2(phb->ioda.m32_pci_base));
pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
end);
- rc = pnv_pci_ioda2_create_table(pe, IOMMU_PAGE_SHIFT_4K,
+ rc = pnv_pci_ioda2_create_table(&pe->table_group, IOMMU_PAGE_SHIFT_4K,
phb->ioda.m32_pci_base,
POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
if (rc) {
@@ -1619,7 +1622,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
pe->table_group.ops = &pnv_pci_ioda2_ops;
#endif
- rc = pnv_pci_ioda2_set_window(pe, tbl);
+ rc = pnv_pci_ioda2_set_window(&pe->table_group, tbl);
if (rc) {
pe_err(pe, "Failed to configure 32-bit TCE table,"
" err %ld\n", rc);
--
2.0.0
This extends iommu_table_group_ops by a set of callbacks to support
dynamic DMA windows management.
query() returns IOMMU capabilities such as default DMA window address and
supported number of DMA windows and TCE table levels.
create_table() creates a TCE table with specific parameters.
it receives iommu_table_group to know nodeid in order to allocate
TCE table memory closer to the PHB. The exact format of allocated
multi-level table might be also specific to the PHB model (not
the case now though).
This callback calculated the DMA window offset on a PCI bus from @num
and stores it in a just created table.
set_window() sets the window at specified TVT index + @num on PHB.
unset_window() unsets the window from specified TVT.
This adds a free() callback to iommu_table_ops to free the memory
(potentially a tree of tables) allocated for the TCE table.
create_table() and free() are supposed to be called once per
VFIO container and set_window()/unset_window() are supposed to be
called for every group in a container.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 21 +++++++
arch/powerpc/platforms/powernv/pci-ioda.c | 87 ++++++++++++++++++++++++-----
arch/powerpc/platforms/powernv/pci-p5ioc2.c | 11 +++-
3 files changed, 102 insertions(+), 17 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 1e0d907..2c08c91 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -64,6 +64,8 @@ struct iommu_table_ops {
long index, long npages);
unsigned long (*get)(struct iommu_table *tbl, long index);
void (*flush)(struct iommu_table *tbl);
+
+ void (*free)(struct iommu_table *tbl);
};
/* These are used by VIO */
@@ -150,12 +152,31 @@ struct iommu_table_group_ops {
*/
void (*set_ownership)(struct iommu_table_group *table_group,
bool enable);
+
+ long (*create_table)(struct iommu_table_group *table_group,
+ int num,
+ __u32 page_shift,
+ __u64 window_size,
+ __u32 levels,
+ struct iommu_table *tbl);
+ long (*set_window)(struct iommu_table_group *table_group,
+ int num,
+ struct iommu_table *tblnew);
+ long (*unset_window)(struct iommu_table_group *table_group,
+ int num);
};
struct iommu_table_group {
#ifdef CONFIG_IOMMU_API
struct iommu_group *group;
#endif
+ /* Some key properties of IOMMU */
+ __u32 tce32_start;
+ __u32 tce32_size;
+ __u64 pgsizes; /* Bitmap of supported page sizes */
+ __u32 max_dynamic_windows_supported;
+ __u32 max_levels;
+
struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
struct iommu_table_group_ops *ops;
};
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 99d1a92..ab0cfb7 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -25,6 +25,7 @@
#include <linux/memblock.h>
#include <linux/iommu.h>
#include <linux/mmzone.h>
+#include <linux/sizes.h>
#include <asm/mmzone.h>
#include <asm/sections.h>
@@ -1236,6 +1237,8 @@ static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
}
+static void pnv_pci_free_table(struct iommu_table *tbl);
+
static struct iommu_table_ops pnv_ioda2_iommu_ops = {
.set = pnv_ioda2_tce_build_vm,
#ifdef CONFIG_IOMMU_API
@@ -1243,6 +1246,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
#endif
.clear = pnv_ioda2_tce_free_vm,
.get = pnv_tce_get,
+ .free = pnv_pci_free_table,
};
static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
@@ -1325,6 +1329,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
TCE_PCI_SWINV_PAIR);
}
tbl->it_ops = &pnv_ioda1_iommu_ops;
+ pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift;
+ pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
iommu_init_table(tbl, phb->hose->node);
iommu_register_group(&pe->table_group, phb->hose->global_number,
pe->pe_number);
@@ -1409,7 +1415,7 @@ static __be64 *pnv_alloc_tce_table(int nid,
}
static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
- __u32 page_shift, __u64 window_size, __u32 levels,
+ int num, __u32 page_shift, __u64 window_size, __u32 levels,
struct iommu_table *tbl)
{
struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
@@ -1422,6 +1428,9 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
return -EINVAL;
+ if (!(table_group->pgsizes & (1ULL << page_shift)))
+ return -EINVAL;
+
if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
return -EINVAL;
@@ -1432,8 +1441,8 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
shift = ROUND_UP(ilog2(window_size) - page_shift, levels) / levels;
shift += 3;
shift = max_t(unsigned, shift, IOMMU_PAGE_SHIFT_4K);
- pr_info("Creating TCE table %08llx, %d levels, TCE table size = %lx\n",
- window_size, levels, 1UL << shift);
+ pr_info("Creating TCE table #%d %08llx, %d levels, TCE table size = %lx\n",
+ num, window_size, levels, 1UL << shift);
tbl->it_level_size = 1ULL << (shift - 3);
left = tce_table_size;
@@ -1444,8 +1453,8 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
tbl->it_indirect_levels = levels - 1;
/* Setup linux iommu table */
- pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
- page_shift);
+ pnv_pci_setup_iommu_table(tbl, addr, tce_table_size,
+ num ? pe->tce_bypass_base : 0, page_shift);
tbl->it_ops = &pnv_ioda2_iommu_ops;
@@ -1464,8 +1473,21 @@ static void pnv_pci_free_table(struct iommu_table *tbl)
iommu_reset_table(tbl, "ioda2");
}
+static inline void pnv_pci_ioda2_tvt_invalidate(unsigned int pe_number,
+ unsigned long it_index)
+{
+ __be64 __iomem *invalidate = (__be64 __iomem *)it_index;
+ /* 01xb - invalidate TCEs that match the specified PE# */
+ unsigned long addr = (0x4ull << 60) | (pe_number & 0xFF);
+
+ if (!it_index)
+ return;
+
+ __raw_writeq(cpu_to_be64(addr), invalidate);
+}
+
static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
- struct iommu_table *tbl)
+ int num, struct iommu_table *tbl)
{
struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
table_group);
@@ -1477,13 +1499,13 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
const __u64 win_size = tbl->it_size << tbl->it_page_shift;
- pe_info(pe, "Setting up window at %llx..%llx pagesize=0x%x tablesize=0x%lx levels=%d levelsize=%x\n",
- start_addr, start_addr + win_size - 1,
+ pe_info(pe, "Setting up window #%d (%p) at %llx..%llx pagesize=0x%x tablesize=0x%lx levels=%d levelsize=%x\n",
+ num, tbl, start_addr, start_addr + win_size - 1,
1UL << tbl->it_page_shift, tbl->it_size,
tbl->it_indirect_levels + 1, tbl->it_level_size);
- pe->table_group.tables[0] = *tbl;
- tbl = &pe->table_group.tables[0];
+ pe->table_group.tables[num] = *tbl;
+ tbl = &pe->table_group.tables[num];
tbl->it_group = &pe->table_group;
/*
@@ -1491,7 +1513,8 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
* shifted by 1 bit for 32-bits DMA space.
*/
rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
- pe->pe_number << 1, tbl->it_indirect_levels + 1,
+ (pe->pe_number << 1) + num,
+ tbl->it_indirect_levels + 1,
__pa(tbl->it_base),
size << 3, 1ULL << tbl->it_page_shift);
if (rc) {
@@ -1513,6 +1536,8 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
}
+ pnv_pci_ioda2_tvt_invalidate(pe->pe_number, tbl->it_index);
+
return 0;
fail:
if (pe->tce32_seg >= 0)
@@ -1572,6 +1597,30 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
}
#ifdef CONFIG_IOMMU_API
+static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
+ int num)
+{
+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
+ table_group);
+ struct pnv_phb *phb = pe->phb;
+ struct iommu_table *tbl = &pe->table_group.tables[num];
+ long ret;
+
+ pe_info(pe, "Removing DMA window #%d\n", num);
+
+ ret = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
+ (pe->pe_number << 1) + num,
+ 0/* levels */, 0/* table address */,
+ 0/* table size */, 0/* page size */);
+ if (ret)
+ pe_warn(pe, "Unmapping failed, ret = %ld\n", ret);
+
+ pnv_pci_ioda2_tvt_invalidate(pe->pe_number, tbl->it_index);
+ memset(tbl, 0, sizeof(*tbl));
+
+ return ret;
+}
+
static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
bool enable)
{
@@ -1587,6 +1636,9 @@ static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
.set_ownership = pnv_ioda2_set_ownership,
+ .create_table = pnv_pci_ioda2_create_table,
+ .set_window = pnv_pci_ioda2_set_window,
+ .unset_window = pnv_pci_ioda2_unset_window,
};
#endif
@@ -1607,8 +1659,15 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
end);
- rc = pnv_pci_ioda2_create_table(&pe->table_group, IOMMU_PAGE_SHIFT_4K,
- phb->ioda.m32_pci_base,
+ pe->table_group.tce32_start = 0;
+ pe->table_group.tce32_size = phb->ioda.m32_pci_base;
+ pe->table_group.max_dynamic_windows_supported =
+ IOMMU_TABLE_GROUP_MAX_TABLES;
+ pe->table_group.max_levels = POWERNV_IOMMU_MAX_LEVELS;
+ pe->table_group.pgsizes = SZ_4K | SZ_64K | SZ_16M;
+
+ rc = pnv_pci_ioda2_create_table(&pe->table_group, 0,
+ IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base,
POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
if (rc) {
pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
@@ -1622,7 +1681,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
pe->table_group.ops = &pnv_pci_ioda2_ops;
#endif
- rc = pnv_pci_ioda2_set_window(&pe->table_group, tbl);
+ rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
if (rc) {
pe_err(pe, "Failed to configure 32-bit TCE table,"
" err %ld\n", rc);
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index d2d9092..99cb858 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -116,6 +116,8 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
u64 phb_id;
int64_t rc;
static int primary = 1;
+ struct iommu_table_group *table_group;
+ struct iommu_table *tbl;
pr_info(" Initializing p5ioc2 PHB %s\n", np->full_name);
@@ -180,13 +182,16 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
pnv_pci_init_p5ioc2_msis(phb);
/* Setup iommu */
- phb->p5ioc2.table_group.tables[0].it_group = &phb->p5ioc2.table_group;
+ table_group = &phb->p5ioc2.table_group;
+ tbl = &phb->p5ioc2.table_group.tables[0];
+ tbl->it_group = table_group;
/* Setup TCEs */
phb->dma_dev_setup = pnv_pci_p5ioc2_dma_dev_setup;
- pnv_pci_setup_iommu_table(&phb->p5ioc2.table_group.tables[0],
- tce_mem, tce_size, 0,
+ pnv_pci_setup_iommu_table(tbl, tce_mem, tce_size, 0,
IOMMU_PAGE_SHIFT_4K);
+ table_group->tce32_start = tbl->it_offset << tbl->it_page_shift;
+ table_group->tce32_size = tbl->it_size << tbl->it_page_shift;
}
void __init pnv_pci_init_p5ioc2_hub(struct device_node *np)
--
2.0.0
Before the IOMMU user (VFIO) would take control over the IOMMU table
belonging to a specific IOMMU group. This approach did not allow sharing
tables between IOMMU groups attached to the same container.
This introduces a new IOMMU ownership flavour when the user can not
just control the existing IOMMU table but remove/create tables on demand.
If an IOMMU implements a set_ownership() callback, this lets the user have
full control over the IOMMU group. When the ownership is taken,
the platform code removes all the windows so the caller must create them.
Before returning the ownership back to the platform code, VFIO
unprograms and removes all the tables it created.
Old-style ownership is still supported allowing VFIO to run on older
P5IOC2 and IODA IO controllers.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v6:
* fixed commit log that VFIO removes tables before passing ownership
back to the platform code, not userspace
---
arch/powerpc/platforms/powernv/pci-ioda.c | 30 +++++++++++++++---
drivers/vfio/vfio_iommu_spapr_tce.c | 51 ++++++++++++++++++++++++-------
2 files changed, 66 insertions(+), 15 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index ab0cfb7..751aeab 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1626,11 +1626,33 @@ static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
{
struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
table_group);
- if (enable)
- iommu_take_ownership(table_group);
- else
- iommu_release_ownership(table_group);
+ if (enable) {
+ pnv_pci_ioda2_unset_window(&pe->table_group, 0);
+ pnv_pci_free_table(&pe->table_group.tables[0]);
+ } else {
+ struct iommu_table *tbl = &pe->table_group.tables[0];
+ int64_t rc;
+ rc = pnv_pci_ioda2_create_table(&pe->table_group, 0,
+ IOMMU_PAGE_SHIFT_4K,
+ pe->phb->ioda.m32_pci_base,
+ POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
+ if (rc) {
+ pe_err(pe, "Failed to create 32-bit TCE table, err %ld",
+ rc);
+ return;
+ }
+
+ iommu_init_table(tbl, pe->phb->hose->node);
+
+ rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
+ if (rc) {
+ pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
+ rc);
+ pnv_pci_free_table(tbl);
+ return;
+ }
+ }
pnv_pci_ioda2_set_bypass(pe, !enable);
}
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 7c3c215..9aeaed6 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -226,18 +226,11 @@ static int tce_iommu_clear(struct tce_container *container,
static void tce_iommu_release(void *iommu_data)
{
struct tce_container *container = iommu_data;
- struct iommu_table *tbl;
- struct iommu_table_group *table_group;
WARN_ON(container->grp);
- if (container->grp) {
- table_group = iommu_group_get_iommudata(container->grp);
- tbl = &table_group->tables[0];
- tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
-
+ if (container->grp)
tce_iommu_detach_group(iommu_data, container->grp);
- }
tce_iommu_disable(container);
@@ -553,14 +546,24 @@ static int tce_iommu_attach_group(void *iommu_data,
if (!table_group->ops || !table_group->ops->set_ownership) {
ret = iommu_take_ownership(table_group);
+ } else if (!table_group->ops->create_table ||
+ !table_group->ops->set_window) {
+ WARN_ON_ONCE(1);
+ ret = -EFAULT;
} else {
/*
* Disable iommu bypass, otherwise the user can DMA to all of
* our physical memory via the bypass window instead of just
* the pages that has been explicitly mapped into the iommu
*/
+ struct iommu_table tbltmp = { 0 }, *tbl = &tbltmp;
+
table_group->ops->set_ownership(table_group, true);
- ret = 0;
+ ret = table_group->ops->create_table(table_group, 0,
+ IOMMU_PAGE_SHIFT_4K,
+ table_group->tce32_size, 1, tbl);
+ if (!ret)
+ ret = table_group->ops->set_window(table_group, 0, tbl);
}
if (ret)
@@ -579,6 +582,7 @@ static void tce_iommu_detach_group(void *iommu_data,
{
struct tce_container *container = iommu_data;
struct iommu_table_group *table_group;
+ long i;
mutex_lock(&container->lock);
if (iommu_group != container->grp) {
@@ -602,10 +606,35 @@ static void tce_iommu_detach_group(void *iommu_data,
BUG_ON(!table_group);
/* Kernel owns the device now, we can restore bypass */
- if (!table_group->ops || !table_group->ops->set_ownership)
+ if (!table_group->ops || !table_group->ops->set_ownership) {
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ struct iommu_table *tbl = &table_group->tables[i];
+
+ if (!tbl->it_size)
+ continue;
+
+ if (!tbl->it_ops)
+ goto unlock_exit;
+ tce_iommu_clear(container, tbl,
+ tbl->it_offset, tbl->it_size);
+ }
iommu_release_ownership(table_group);
- else
+ } else if (!table_group->ops->unset_window) {
+ WARN_ON_ONCE(1);
+ } else {
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ struct iommu_table *tbl = &table_group->tables[i];
+
+ table_group->ops->unset_window(table_group, i);
+ tce_iommu_clear(container, tbl,
+ tbl->it_offset, tbl->it_size);
+
+ if (tbl->it_ops->free)
+ tbl->it_ops->free(tbl);
+ }
+
table_group->ops->set_ownership(table_group, false);
+ }
unlock_exit:
mutex_unlock(&container->lock);
--
2.0.0
In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.
This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v8:
* added ENOMEM on failed vzalloc()
---
arch/powerpc/include/asm/iommu.h | 6 ++++++
arch/powerpc/kernel/iommu.c | 9 +++++++++
arch/powerpc/platforms/powernv/pci-ioda.c | 25 ++++++++++++++++++++++++-
3 files changed, 39 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2c08c91..a768a4d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -106,9 +106,15 @@ struct iommu_table {
unsigned long *it_map; /* A simple allocation bitmap for now */
unsigned long it_page_shift;/* table iommu page size */
struct iommu_table_group *it_group;
+ unsigned long *it_userspace; /* userspace view of the table */
struct iommu_table_ops *it_ops;
};
+#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
+ ((tbl)->it_userspace ? \
+ &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \
+ NULL)
+
/* Pure 2^n version of get_order */
static inline __attribute_const__
int get_iommu_order(unsigned long size, struct iommu_table *tbl)
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 0bcd988..833b396 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -38,6 +38,7 @@
#include <linux/pci.h>
#include <linux/iommu.h>
#include <linux/sched.h>
+#include <linux/vmalloc.h>
#include <asm/io.h>
#include <asm/prom.h>
#include <asm/iommu.h>
@@ -1069,6 +1070,11 @@ static int iommu_table_take_ownership(struct iommu_table *tbl)
spin_unlock(&tbl->pools[i].lock);
spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
+ BUG_ON(tbl->it_userspace);
+ tbl->it_userspace = vzalloc(sizeof(*tbl->it_userspace) * tbl->it_size);
+ if (!tbl->it_userspace)
+ return -ENOMEM;
+
return 0;
}
@@ -1102,6 +1108,9 @@ static void iommu_table_release_ownership(struct iommu_table *tbl)
{
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+ vfree(tbl->it_userspace);
+ tbl->it_userspace = NULL;
+
spin_lock_irqsave(&tbl->large_pool.lock, flags);
for (i = 0; i < tbl->nr_pools; i++)
spin_lock(&tbl->pools[i].lock);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 751aeab..3ac523d 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -26,6 +26,7 @@
#include <linux/iommu.h>
#include <linux/mmzone.h>
#include <linux/sizes.h>
+#include <linux/vmalloc.h>
#include <asm/mmzone.h>
#include <asm/sections.h>
@@ -1469,6 +1470,9 @@ static void pnv_pci_free_table(struct iommu_table *tbl)
if (!tbl->it_size)
return;
+ vfree(tbl->it_userspace);
+ tbl->it_userspace = NULL;
+
pnv_free_tce_table(tbl->it_base, size, tbl->it_indirect_levels);
iommu_reset_table(tbl, "ioda2");
}
@@ -1656,9 +1660,28 @@ static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
pnv_pci_ioda2_set_bypass(pe, !enable);
}
+static long pnv_pci_ioda2_create_table_with_uas(
+ struct iommu_table_group *table_group,
+ int num, __u32 page_shift, __u64 window_size, __u32 levels,
+ struct iommu_table *tbl)
+{
+ long ret = pnv_pci_ioda2_create_table(table_group, num,
+ page_shift, window_size, levels, tbl);
+
+ if (ret)
+ return ret;
+
+ BUG_ON(tbl->it_userspace);
+ tbl->it_userspace = vzalloc(sizeof(*tbl->it_userspace) * tbl->it_size);
+ if (!tbl->it_userspace)
+ return -ENOMEM;
+
+ return 0;
+}
+
static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
.set_ownership = pnv_ioda2_set_ownership,
- .create_table = pnv_pci_ioda2_create_table,
+ .create_table = pnv_pci_ioda2_create_table_with_uas,
.set_window = pnv_pci_ioda2_set_window,
.unset_window = pnv_pci_ioda2_unset_window,
};
--
2.0.0
This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.
This stores the allocated table size in pnv_pci_ioda2_create_table()
so the locked_vm counter can be updated correctly when a table is
being disposed.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 5 +++
arch/powerpc/platforms/powernv/pci-ioda.c | 54 +++++++++++++++++++++++++++++++
2 files changed, 59 insertions(+)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index a768a4d..9027b9e 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -94,6 +94,7 @@ struct iommu_table {
unsigned long it_size; /* Size of iommu table in entries */
unsigned long it_indirect_levels;
unsigned long it_level_size;
+ unsigned long it_allocated_size;
unsigned long it_offset; /* Offset into global table */
unsigned long it_base; /* mapped address of tce table */
unsigned long it_index; /* which iommu table this is */
@@ -159,6 +160,10 @@ struct iommu_table_group_ops {
void (*set_ownership)(struct iommu_table_group *table_group,
bool enable);
+ unsigned long (*get_table_size)(
+ __u32 page_shift,
+ __u64 window_size,
+ __u32 levels);
long (*create_table)(struct iommu_table_group *table_group,
int num,
__u32 page_shift,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 3ac523d..1d2b1e4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1373,6 +1373,57 @@ static void pnv_free_tce_table(unsigned long addr, unsigned long size,
free_pages(addr, get_order(size << 3));
}
+static unsigned long pnv_get_tce_table_size(unsigned shift, unsigned levels,
+ unsigned long *left)
+{
+ unsigned long ret, chunk = 1UL << shift, i;
+
+ ret = chunk;
+
+ if (!*left)
+ return 0;
+
+ --levels;
+ if (!levels) {
+ /* This is last level, actual TCEs */
+ *left -= min(*left, chunk);
+ return chunk;
+ }
+
+ for (i = 0; i < (chunk >> 3); ++i) {
+ ret += pnv_get_tce_table_size(shift, levels, left);
+ if (!*left)
+ break;
+ }
+
+ return ret;
+}
+
+static unsigned long pnv_ioda2_get_table_size(__u32 page_shift,
+ __u64 window_size, __u32 levels)
+{
+ unsigned long tce_table_size, shift, ret;
+
+ if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
+ return -EINVAL;
+
+ if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
+ return -EINVAL;
+
+ tce_table_size = (window_size >> page_shift) * 8;
+ tce_table_size = max(0x1000UL, tce_table_size);
+
+ /* Allocate TCE table */
+ shift = ROUND_UP(ilog2(window_size) - page_shift, levels) / levels;
+ shift += 3;
+ shift = max_t(unsigned, shift, IOMMU_PAGE_SHIFT_4K);
+
+ ret = tce_table_size; /* tbl->it_userspace */
+ ret += pnv_get_tce_table_size(shift, levels, &tce_table_size);
+
+ return ret;
+}
+
static __be64 *pnv_alloc_tce_table(int nid,
unsigned shift, unsigned levels, unsigned long *left)
{
@@ -1452,6 +1503,8 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
return -ENOMEM;
tbl->it_indirect_levels = levels - 1;
+ tbl->it_allocated_size = pnv_ioda2_get_table_size(page_shift,
+ window_size, levels);
/* Setup linux iommu table */
pnv_pci_setup_iommu_table(tbl, addr, tce_table_size,
@@ -1681,6 +1734,7 @@ static long pnv_pci_ioda2_create_table_with_uas(
static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
.set_ownership = pnv_ioda2_set_ownership,
+ .get_table_size = pnv_ioda2_get_table_size,
.create_table = pnv_pci_ioda2_create_table_with_uas,
.set_window = pnv_pci_ioda2_set_window,
.unset_window = pnv_pci_ioda2_unset_window,
--
2.0.0
We are adding support for DMA memory pre-registration to be used in
conjunction with VFIO. The idea is that the userspace which is going to
run a guest may want to pre-register a user space memory region so
it all gets pinned once and never goes away. Having this done,
a hypervisor will not have to pin/unpin pages on every DMA map/unmap
request. This is going to help with multiple pinning of the same memory
and in-kernel acceleration of DMA requests.
This adds a list of memory regions to mm_context_t. Each region consists
of a header and a list of physical addresses. This adds API to:
1. register/unregister memory regions;
2. do final cleanup (which puts all pre-registered pages);
3. do userspace to physical address translation;
4. manage a mapped pages counter; when it is zero, it is safe to
unregister the region.
Multiple registration of the same region is allowed, kref is used to
track the number of registrations.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v8:
* s/mm_iommu_table_group_mem_t/struct mm_iommu_table_group_mem_t/
* fixed error fallback look (s/[i]/[j]/)
---
arch/powerpc/include/asm/mmu-hash64.h | 3 +
arch/powerpc/include/asm/mmu_context.h | 17 +++
arch/powerpc/mm/Makefile | 1 +
arch/powerpc/mm/mmu_context_hash64.c | 6 +
arch/powerpc/mm/mmu_context_hash64_iommu.c | 215 +++++++++++++++++++++++++++++
5 files changed, 242 insertions(+)
create mode 100644 arch/powerpc/mm/mmu_context_hash64_iommu.c
diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
index 4f13c3e..83214c4 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -535,6 +535,9 @@ typedef struct {
/* for 4K PTE fragment support */
void *pte_frag;
#endif
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+ struct list_head iommu_group_mem_list;
+#endif
} mm_context_t;
diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 73382eb..d6116ca 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -16,6 +16,23 @@
*/
extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
extern void destroy_context(struct mm_struct *mm);
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+struct mm_iommu_table_group_mem_t;
+
+extern bool mm_iommu_preregistered(void);
+extern long mm_iommu_alloc(unsigned long ua, unsigned long entries,
+ struct mm_iommu_table_group_mem_t **pmem);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_get(unsigned long ua,
+ unsigned long entries);
+extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
+extern void mm_iommu_cleanup(mm_context_t *ctx);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
+ unsigned long size);
+extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
+ unsigned long ua, unsigned long *hpa);
+extern long mm_iommu_mapped_update(struct mm_iommu_table_group_mem_t *mem,
+ bool inc);
+#endif
extern void switch_mmu_context(struct mm_struct *prev, struct mm_struct *next);
extern void switch_slb(struct task_struct *tsk, struct mm_struct *mm);
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 438dcd3..49fbfc7 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -35,3 +35,4 @@ obj-$(CONFIG_PPC_SUBPAGE_PROT) += subpage-prot.o
obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
obj-$(CONFIG_HIGHMEM) += highmem.o
obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o
+obj-$(CONFIG_SPAPR_TCE_IOMMU) += mmu_context_hash64_iommu.o
diff --git a/arch/powerpc/mm/mmu_context_hash64.c b/arch/powerpc/mm/mmu_context_hash64.c
index 178876ae..eb3080c 100644
--- a/arch/powerpc/mm/mmu_context_hash64.c
+++ b/arch/powerpc/mm/mmu_context_hash64.c
@@ -89,6 +89,9 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
#ifdef CONFIG_PPC_64K_PAGES
mm->context.pte_frag = NULL;
#endif
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+ INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
+#endif
return 0;
}
@@ -132,6 +135,9 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
void destroy_context(struct mm_struct *mm)
{
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+ mm_iommu_cleanup(&mm->context);
+#endif
#ifdef CONFIG_PPC_ICSWX
drop_cop(mm->context.acop, mm);
diff --git a/arch/powerpc/mm/mmu_context_hash64_iommu.c b/arch/powerpc/mm/mmu_context_hash64_iommu.c
new file mode 100644
index 0000000..af7668c
--- /dev/null
+++ b/arch/powerpc/mm/mmu_context_hash64_iommu.c
@@ -0,0 +1,215 @@
+/*
+ * IOMMU helpers in MMU context.
+ *
+ * Copyright (C) 2015 IBM Corp. <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/rculist.h>
+#include <linux/vmalloc.h>
+#include <linux/kref.h>
+#include <asm/mmu_context.h>
+
+struct mm_iommu_table_group_mem_t {
+ struct list_head next;
+ struct rcu_head rcu;
+ struct kref kref; /* one reference per VFIO container */
+ atomic_t mapped; /* number of currently mapped pages */
+ u64 ua; /* userspace address */
+ u64 entries; /* number of entries in hpas[] */
+ u64 *hpas; /* vmalloc'ed */
+};
+
+bool mm_iommu_preregistered(void)
+{
+ if (!current || !current->mm)
+ return false;
+
+ return !list_empty(¤t->mm->context.iommu_group_mem_list);
+}
+EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
+
+long mm_iommu_alloc(unsigned long ua, unsigned long entries,
+ struct mm_iommu_table_group_mem_t **pmem)
+{
+ struct mm_iommu_table_group_mem_t *mem;
+ long i, j;
+ struct page *page = NULL;
+
+ list_for_each_entry_rcu(mem, ¤t->mm->context.iommu_group_mem_list,
+ next) {
+ if ((mem->ua == ua) && (mem->entries == entries))
+ return -EBUSY;
+
+ /* Overlap? */
+ if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
+ (ua < (mem->ua + (mem->entries << PAGE_SHIFT))))
+ return -EINVAL;
+ }
+
+ mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+ if (!mem)
+ return -ENOMEM;
+
+ mem->hpas = vzalloc(entries * sizeof(mem->hpas[0]));
+ if (!mem->hpas) {
+ kfree(mem);
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < entries; ++i) {
+ if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
+ 1/* pages */, 1/* iswrite */, &page)) {
+ for (j = 0; j < i; ++j)
+ put_page(pfn_to_page(
+ mem->hpas[j] >> PAGE_SHIFT));
+ vfree(mem->hpas);
+ kfree(mem);
+ return -EFAULT;
+ }
+
+ mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
+ }
+
+ kref_init(&mem->kref);
+ atomic_set(&mem->mapped, 0);
+ mem->ua = ua;
+ mem->entries = entries;
+ *pmem = mem;
+
+ list_add_rcu(&mem->next, ¤t->mm->context.iommu_group_mem_list);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_alloc);
+
+static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
+{
+ long i;
+ struct page *page = NULL;
+
+ for (i = 0; i < mem->entries; ++i) {
+ if (!mem->hpas[i])
+ continue;
+
+ page = pfn_to_page(mem->hpas[i] >> PAGE_SHIFT);
+ if (!page)
+ continue;
+
+ put_page(page);
+ mem->hpas[i] = 0;
+ }
+}
+
+static void mm_iommu_free(struct rcu_head *head)
+{
+ struct mm_iommu_table_group_mem_t *mem = container_of(head,
+ struct mm_iommu_table_group_mem_t, rcu);
+
+ mm_iommu_unpin(mem);
+ vfree(mem->hpas);
+ kfree(mem);
+}
+
+static void mm_iommu_release(struct kref *kref)
+{
+ struct mm_iommu_table_group_mem_t *mem = container_of(kref,
+ struct mm_iommu_table_group_mem_t, kref);
+
+ list_del_rcu(&mem->next);
+ call_rcu(&mem->rcu, mm_iommu_free);
+}
+
+struct mm_iommu_table_group_mem_t *mm_iommu_get(unsigned long ua,
+ unsigned long entries)
+{
+ struct mm_iommu_table_group_mem_t *mem;
+
+ list_for_each_entry_rcu(mem, ¤t->mm->context.iommu_group_mem_list,
+ next) {
+ if ((mem->ua == ua) && (mem->entries == entries)) {
+ kref_get(&mem->kref);
+ return mem;
+ }
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_get);
+
+long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
+{
+ if (atomic_read(&mem->mapped))
+ return -EBUSY;
+
+ kref_put(&mem->kref, mm_iommu_release);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_put);
+
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
+ unsigned long size)
+{
+ struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
+
+ list_for_each_entry_rcu(mem,
+ ¤t->mm->context.iommu_group_mem_list,
+ next) {
+ if ((mem->ua <= ua) &&
+ (ua + size <= mem->ua +
+ (mem->entries << PAGE_SHIFT))) {
+ ret = mem;
+ break;
+ }
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_lookup);
+
+long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
+ unsigned long ua, unsigned long *hpa)
+{
+ const long entry = (ua - mem->ua) >> PAGE_SHIFT;
+ u64 *va = &mem->hpas[entry];
+
+ if (entry >= mem->entries)
+ return -EFAULT;
+
+ *hpa = *va | (ua & ~PAGE_MASK);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
+
+long mm_iommu_mapped_update(struct mm_iommu_table_group_mem_t *mem, bool inc)
+{
+ long ret = 0;
+
+ if (inc)
+ atomic_inc(&mem->mapped);
+ else
+ ret = atomic_dec_if_positive(&mem->mapped);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(mm_iommu_mapped_update);
+
+void mm_iommu_cleanup(mm_context_t *ctx)
+{
+ while (!list_empty(&ctx->iommu_group_mem_list)) {
+ struct mm_iommu_table_group_mem_t *mem;
+
+ mem = list_first_entry(&ctx->iommu_group_mem_list,
+ struct mm_iommu_table_group_mem_t, next);
+ mm_iommu_release(&mem->kref);
+ }
+}
--
2.0.0
The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.
Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.
This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update into 2 different
operations. It requires 1) guest pages to be registered first 2) consequent
map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.
The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.
The accounting is done per the user process.
This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v7:
* now memory is registered per mm (i.e. process)
* moved memory registration code to powerpc/mmu
* merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
* limited new ioctls to v2 IOMMU
* updated doc
* unsupported ioclts return -ENOTTY instead of -EPERM
v6:
* tce_get_hva_cached() returns hva via a pointer
v4:
* updated docs
* s/kzmalloc/vzalloc/
* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
replaced offset with index
* renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
and removed duplicating vfio_iommu_spapr_register_memory
---
Documentation/vfio.txt | 23 ++++
drivers/vfio/vfio_iommu_spapr_tce.c | 232 +++++++++++++++++++++++++++++++++++-
include/uapi/linux/vfio.h | 27 +++++
3 files changed, 276 insertions(+), 6 deletions(-)
diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 96978ec..94328c8 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -427,6 +427,29 @@ The code flow from the example above should be slightly changed:
....
+5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
+VFIO_IOMMU_DISABLE and implements 2 new ioctls:
+VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
+(which are unsupported in v1 IOMMU).
+
+PPC64 paravirtualized guests generate a lot of map/unmap requests,
+and the handling of those includes pinning/unpinning pages and updating
+mm::locked_vm counter to make sure we do not exceed the rlimit.
+The v2 IOMMU splits accounting and pinning into separate operations:
+
+- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
+receive a user space address and size of the block to be pinned.
+Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
+be called with the exact address and size used for registering
+the memory block. The userspace is not expected to call these often.
+The ranges are stored in a linked list in a VFIO container.
+
+- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
+IOMMU table and do not do pinning; instead these check that the userspace
+address is from pre-registered range.
+
+This separation helps in optimizing DMA for guests.
+
-------------------------------------------------------------------------------
[1] VFIO was originally an acronym for "Virtual Function I/O" in its
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 9aeaed6..3eded0d 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -21,6 +21,7 @@
#include <linux/vfio.h>
#include <asm/iommu.h>
#include <asm/tce.h>
+#include <asm/mmu_context.h>
#define DRIVER_VERSION "0.1"
#define DRIVER_AUTHOR "[email protected]"
@@ -91,8 +92,58 @@ struct tce_container {
struct iommu_group *grp;
bool enabled;
unsigned long locked_pages;
+ bool v2;
};
+static long tce_unregister_pages(struct tce_container *container,
+ __u64 vaddr, __u64 size)
+{
+ long ret;
+ struct mm_iommu_table_group_mem_t *mem;
+
+ if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
+ return -EINVAL;
+
+ mem = mm_iommu_get(vaddr, size >> PAGE_SHIFT);
+ if (!mem)
+ return -EINVAL;
+
+ ret = mm_iommu_put(mem); /* undo kref_get() from mm_iommu_get() */
+ if (!ret)
+ ret = mm_iommu_put(mem);
+
+ return ret;
+}
+
+static long tce_register_pages(struct tce_container *container,
+ __u64 vaddr, __u64 size)
+{
+ long ret = 0;
+ struct mm_iommu_table_group_mem_t *mem;
+ unsigned long entries = size >> PAGE_SHIFT;
+
+ if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
+ ((vaddr + size) < vaddr))
+ return -EINVAL;
+
+ mem = mm_iommu_get(vaddr, entries);
+ if (!mem) {
+ ret = try_increment_locked_vm(entries);
+ if (ret)
+ return ret;
+
+ ret = mm_iommu_alloc(vaddr, entries, &mem);
+ if (ret) {
+ decrement_locked_vm(entries);
+ return ret;
+ }
+ }
+
+ container->enabled = true;
+
+ return 0;
+}
+
static bool tce_page_is_contained(struct page *page, unsigned page_shift)
{
/*
@@ -205,7 +256,7 @@ static void *tce_iommu_open(unsigned long arg)
{
struct tce_container *container;
- if (arg != VFIO_SPAPR_TCE_IOMMU) {
+ if ((arg != VFIO_SPAPR_TCE_IOMMU) && (arg != VFIO_SPAPR_TCE_v2_IOMMU)) {
pr_err("tce_vfio: Wrong IOMMU type\n");
return ERR_PTR(-EINVAL);
}
@@ -215,6 +266,7 @@ static void *tce_iommu_open(unsigned long arg)
return ERR_PTR(-ENOMEM);
mutex_init(&container->lock);
+ container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
return container;
}
@@ -257,6 +309,49 @@ static void tce_iommu_unuse_page(struct tce_container *container,
put_page(page);
}
+static int tce_get_hva_cached(unsigned long tce, unsigned long size,
+ unsigned long *hva, struct mm_iommu_table_group_mem_t **pmem)
+{
+ long ret = 0;
+ unsigned long hpa;
+ struct mm_iommu_table_group_mem_t *mem;
+
+ mem = mm_iommu_lookup(tce, size);
+ if (!mem)
+ return -EINVAL;
+
+ ret = mm_iommu_ua_to_hpa(mem, tce, &hpa);
+ if (ret)
+ return -EINVAL;
+
+ *hva = (unsigned long) __va(hpa);
+ *pmem = mem;
+
+ return 0;
+}
+
+static void tce_iommu_unuse_page_v2(struct iommu_table *tbl,
+ unsigned long entry)
+{
+ struct mm_iommu_table_group_mem_t *mem = NULL;
+ int ret;
+ unsigned long hva = 0;
+ unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+
+ if (!pua || !current || !current->mm)
+ return;
+
+ ret = tce_get_hva_cached(*pua, IOMMU_PAGE_SIZE(tbl),
+ &hva, &mem);
+ if (ret)
+ pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
+ __func__, *pua, entry, ret);
+ if (mem)
+ mm_iommu_mapped_update(mem, false);
+
+ *pua = 0;
+}
+
static int tce_iommu_clear(struct tce_container *container,
struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
@@ -275,6 +370,11 @@ static int tce_iommu_clear(struct tce_container *container,
if (direction == DMA_NONE)
continue;
+ if (container->v2) {
+ tce_iommu_unuse_page_v2(tbl, entry);
+ continue;
+ }
+
tce_iommu_unuse_page(container, tce);
}
@@ -342,6 +442,62 @@ static long tce_iommu_build(struct tce_container *container,
return ret;
}
+static long tce_iommu_build_v2(struct tce_container *container,
+ struct iommu_table *tbl,
+ unsigned long entry, unsigned long tce, unsigned long pages,
+ enum dma_data_direction direction)
+{
+ long i, ret = 0;
+ struct page *page;
+ unsigned long hva;
+ enum dma_data_direction dirtmp;
+
+ for (i = 0; i < pages; ++i) {
+ struct mm_iommu_table_group_mem_t *mem = NULL;
+ unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
+ entry + i);
+
+ ret = tce_get_hva_cached(tce, IOMMU_PAGE_SIZE(tbl),
+ &hva, &mem);
+ if (ret)
+ break;
+
+ page = pfn_to_page(__pa(hva) >> PAGE_SHIFT);
+ if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+ ret = -EPERM;
+ break;
+ }
+
+ /* Preserve offset within IOMMU page */
+ hva |= tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
+ dirtmp = direction;
+
+ ret = iommu_tce_xchg(tbl, entry + i, &hva, &dirtmp);
+ if (ret) {
+ /* dirtmp cannot be DMA_NONE here */
+ tce_iommu_unuse_page_v2(tbl, entry + i);
+ pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
+ __func__, entry << tbl->it_page_shift,
+ tce, ret);
+ break;
+ }
+
+ mm_iommu_mapped_update(mem, true);
+
+ if (dirtmp != DMA_NONE)
+ tce_iommu_unuse_page_v2(tbl, entry + i);
+
+ *pua = tce;
+
+ tce += IOMMU_PAGE_SIZE(tbl);
+ }
+
+ if (ret)
+ tce_iommu_clear(container, tbl, entry, i);
+
+ return ret;
+}
+
static long tce_iommu_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
{
@@ -353,6 +509,7 @@ static long tce_iommu_ioctl(void *iommu_data,
case VFIO_CHECK_EXTENSION:
switch (arg) {
case VFIO_SPAPR_TCE_IOMMU:
+ case VFIO_SPAPR_TCE_v2_IOMMU:
ret = 1;
break;
default:
@@ -440,11 +597,18 @@ static long tce_iommu_ioctl(void *iommu_data,
if (ret)
return ret;
- ret = tce_iommu_build(container, tbl,
- param.iova >> tbl->it_page_shift,
- param.vaddr,
- param.size >> tbl->it_page_shift,
- direction);
+ if (container->v2)
+ ret = tce_iommu_build_v2(container, tbl,
+ param.iova >> tbl->it_page_shift,
+ param.vaddr,
+ param.size >> tbl->it_page_shift,
+ direction);
+ else
+ ret = tce_iommu_build(container, tbl,
+ param.iova >> tbl->it_page_shift,
+ param.vaddr,
+ param.size >> tbl->it_page_shift,
+ direction);
iommu_flush_tce(tbl);
@@ -489,7 +653,60 @@ static long tce_iommu_ioctl(void *iommu_data,
return ret;
}
+ case VFIO_IOMMU_SPAPR_REGISTER_MEMORY: {
+ struct vfio_iommu_spapr_register_memory param;
+
+ if (!container->v2)
+ break;
+
+ minsz = offsetofend(struct vfio_iommu_spapr_register_memory,
+ size);
+
+ if (copy_from_user(¶m, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (param.argsz < minsz)
+ return -EINVAL;
+
+ /* No flag is supported now */
+ if (param.flags)
+ return -EINVAL;
+
+ mutex_lock(&container->lock);
+ ret = tce_register_pages(container, param.vaddr, param.size);
+ mutex_unlock(&container->lock);
+
+ return ret;
+ }
+ case VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY: {
+ struct vfio_iommu_spapr_register_memory param;
+
+ if (!container->v2)
+ break;
+
+ minsz = offsetofend(struct vfio_iommu_spapr_register_memory,
+ size);
+
+ if (copy_from_user(¶m, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (param.argsz < minsz)
+ return -EINVAL;
+
+ /* No flag is supported now */
+ if (param.flags)
+ return -EINVAL;
+
+ mutex_lock(&container->lock);
+ tce_unregister_pages(container, param.vaddr, param.size);
+ mutex_unlock(&container->lock);
+
+ return 0;
+ }
case VFIO_IOMMU_ENABLE:
+ if (container->v2)
+ break;
+
mutex_lock(&container->lock);
ret = tce_iommu_enable(container);
mutex_unlock(&container->lock);
@@ -497,6 +714,9 @@ static long tce_iommu_ioctl(void *iommu_data,
case VFIO_IOMMU_DISABLE:
+ if (container->v2)
+ break;
+
mutex_lock(&container->lock);
tce_iommu_disable(container);
mutex_unlock(&container->lock);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 82889c3..fbc5286 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -36,6 +36,8 @@
/* Two-stage IOMMU */
#define VFIO_TYPE1_NESTING_IOMMU 6 /* Implies v2 */
+#define VFIO_SPAPR_TCE_v2_IOMMU 7
+
/*
* The IOCTL interface is designed for extensibility by embedding the
* structure length (argsz) and flags into structures passed between
@@ -493,6 +495,31 @@ struct vfio_eeh_pe_op {
#define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21)
+/**
+ * VFIO_IOMMU_SPAPR_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_spapr_register_memory)
+ *
+ * Registers user space memory where DMA is allowed. It pins
+ * user pages and does the locked memory accounting so
+ * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
+ * get faster.
+ */
+struct vfio_iommu_spapr_register_memory {
+ __u32 argsz;
+ __u32 flags;
+ __u64 vaddr; /* Process virtual address */
+ __u64 size; /* Size of mapping (bytes) */
+};
+#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 17)
+
+/**
+ * VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_spapr_register_memory)
+ *
+ * Unregisters user space memory registered with
+ * VFIO_IOMMU_SPAPR_REGISTER_MEMORY.
+ * Uses vfio_iommu_spapr_register_memory for parameters.
+ */
+#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 18)
+
/* ***************************************************************** */
#endif /* _UAPIVFIO_H */
--
2.0.0
At the moment only one group per container is supported.
POWER8 CPUs have more flexible design and allows naving 2 TCE tables per
IOMMU group so we can relax this limitation and support multiple groups
per container.
This adds TCE table descriptors to a container and uses iommu_table_group_ops
to create/set DMA windows on IOMMU groups so the same TCE tables will be
shared between several IOMMU groups.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v7:
* updated doc
---
Documentation/vfio.txt | 8 +-
drivers/vfio/vfio_iommu_spapr_tce.c | 289 ++++++++++++++++++++++++++----------
2 files changed, 214 insertions(+), 83 deletions(-)
diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 94328c8..7dcf2b5 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
This implementation has some specifics:
-1) Only one IOMMU group per container is supported as an IOMMU group
-represents the minimal entity which isolation can be guaranteed for and
-groups are allocated statically, one per a Partitionable Endpoint (PE)
+1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
+container is supported as an IOMMU table is allocated at the boot time,
+one table per a IOMMU group which is a Partitionable Endpoint (PE)
(PE is often a PCI domain but not always).
+Newer systems (POWER8 with IODA2) have improved hardware design which allows
+to remove this limitation and have multiple IOMMU groups per a VFIO container.
2) The hardware supports so called DMA windows - the PCI address range
within which DMA transfer is allowed, any attempt to access address space
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 3eded0d..a9520a8 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -89,10 +89,16 @@ static void decrement_locked_vm(long npages)
*/
struct tce_container {
struct mutex lock;
- struct iommu_group *grp;
bool enabled;
unsigned long locked_pages;
bool v2;
+ struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
+ struct list_head group_list;
+};
+
+struct tce_iommu_group {
+ struct list_head next;
+ struct iommu_group *grp;
};
static long tce_unregister_pages(struct tce_container *container,
@@ -154,20 +160,20 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift)
return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift;
}
+static inline bool tce_groups_attached(struct tce_container *container)
+{
+ return !list_empty(&container->group_list);
+}
+
static struct iommu_table *spapr_tce_find_table(
struct tce_container *container,
phys_addr_t ioba)
{
long i;
struct iommu_table *ret = NULL;
- struct iommu_table_group *table_group;
-
- table_group = iommu_group_get_iommudata(container->grp);
- if (!table_group)
- return NULL;
for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
- struct iommu_table *tbl = &table_group->tables[i];
+ struct iommu_table *tbl = &container->tables[i];
unsigned long entry = ioba >> tbl->it_page_shift;
unsigned long start = tbl->it_offset;
unsigned long end = start + tbl->it_size;
@@ -185,11 +191,8 @@ static int tce_iommu_enable(struct tce_container *container)
{
int ret = 0;
unsigned long locked;
- struct iommu_table *tbl;
struct iommu_table_group *table_group;
-
- if (!container->grp)
- return -ENXIO;
+ struct tce_iommu_group *tcegrp;
if (!current->mm)
return -ESRCH; /* process exited */
@@ -222,12 +225,24 @@ static int tce_iommu_enable(struct tce_container *container)
* as this information is only available from KVM and VFIO is
* KVM agnostic.
*/
- table_group = iommu_group_get_iommudata(container->grp);
+ if (!tce_groups_attached(container))
+ return -ENODEV;
+
+ tcegrp = list_first_entry(&container->group_list,
+ struct tce_iommu_group, next);
+ table_group = iommu_group_get_iommudata(tcegrp->grp);
if (!table_group)
return -ENODEV;
- tbl = &table_group->tables[0];
- locked = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
+ /*
+ * We do not allow enabling a group if no DMA-able memory was
+ * registered as there is no way to know how much we should
+ * increment the locked_vm counter.
+ */
+ if (!table_group->tce32_size)
+ return -EPERM;
+
+ locked = table_group->tce32_size >> PAGE_SHIFT;
ret = try_increment_locked_vm(locked);
if (ret)
return ret;
@@ -266,6 +281,8 @@ static void *tce_iommu_open(unsigned long arg)
return ERR_PTR(-ENOMEM);
mutex_init(&container->lock);
+ INIT_LIST_HEAD_RCU(&container->group_list);
+
container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
return container;
@@ -274,15 +291,35 @@ static void *tce_iommu_open(unsigned long arg)
static int tce_iommu_clear(struct tce_container *container,
struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
+static void tce_iommu_free_table(struct iommu_table *tbl);
static void tce_iommu_release(void *iommu_data)
{
struct tce_container *container = iommu_data;
+ struct iommu_table_group *table_group;
+ int i;
+ struct tce_iommu_group *tcegrp;
- WARN_ON(container->grp);
+ while (tce_groups_attached(container)) {
+ tcegrp = list_first_entry(&container->group_list,
+ struct tce_iommu_group, next);
+ table_group = iommu_group_get_iommudata(tcegrp->grp);
+ tce_iommu_detach_group(iommu_data, tcegrp->grp);
+ }
- if (container->grp)
- tce_iommu_detach_group(iommu_data, container->grp);
+ /* Free tables */
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ struct iommu_table *tbl = &container->tables[i];
+
+ if (!tbl->it_size)
+ continue;
+
+ tce_iommu_clear(container, tbl,
+ tbl->it_offset, tbl->it_size);
+
+ if (tbl->it_ops && tbl->it_ops->free)
+ tce_iommu_free_table(tbl);
+ }
tce_iommu_disable(container);
@@ -498,6 +535,44 @@ static long tce_iommu_build_v2(struct tce_container *container,
return ret;
}
+static long tce_iommu_create_table(struct iommu_table_group *table_group,
+ int num,
+ __u32 page_shift,
+ __u64 window_size,
+ __u32 levels,
+ struct iommu_table *tbl)
+{
+ long ret;
+ unsigned long table_size;
+
+ table_size = table_group->ops->get_table_size(page_shift, window_size,
+ levels) >> PAGE_SHIFT;
+
+ ret = try_increment_locked_vm(table_size);
+ if (ret)
+ return ret;
+
+ ret = table_group->ops->create_table(table_group, num,
+ page_shift, window_size, levels, tbl);
+
+ if (ret)
+ decrement_locked_vm(table_size);
+
+ return ret;
+}
+
+static void tce_iommu_free_table(struct iommu_table *tbl)
+{
+ unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
+
+ if (!tbl->it_allocated_size)
+ return;
+
+ tbl->it_ops->free(tbl);
+ decrement_locked_vm(pages);
+ memset(tbl, 0, sizeof(*tbl));
+}
+
static long tce_iommu_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
{
@@ -521,16 +596,17 @@ static long tce_iommu_ioctl(void *iommu_data,
case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
struct vfio_iommu_spapr_tce_info info;
- struct iommu_table *tbl;
+ struct tce_iommu_group *tcegrp;
struct iommu_table_group *table_group;
- if (WARN_ON(!container->grp))
+ if (!tce_groups_attached(container))
return -ENXIO;
- table_group = iommu_group_get_iommudata(container->grp);
+ tcegrp = list_first_entry(&container->group_list,
+ struct tce_iommu_group, next);
+ table_group = iommu_group_get_iommudata(tcegrp->grp);
- tbl = &table_group->tables[0];
- if (WARN_ON_ONCE(!tbl))
+ if (!table_group)
return -ENXIO;
minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
@@ -542,9 +618,9 @@ static long tce_iommu_ioctl(void *iommu_data,
if (info.argsz < minsz)
return -EINVAL;
- info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
- info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
info.flags = 0;
+ info.dma32_window_start = table_group->tce32_start;
+ info.dma32_window_size = table_group->tce32_size;
if (copy_to_user((void __user *)arg, &info, minsz))
return -EFAULT;
@@ -721,12 +797,20 @@ static long tce_iommu_ioctl(void *iommu_data,
tce_iommu_disable(container);
mutex_unlock(&container->lock);
return 0;
- case VFIO_EEH_PE_OP:
- if (!container->grp)
- return -ENODEV;
- return vfio_spapr_iommu_eeh_ioctl(container->grp,
- cmd, arg);
+ case VFIO_EEH_PE_OP: {
+ struct tce_iommu_group *tcegrp;
+
+ ret = 0;
+ list_for_each_entry(tcegrp, &container->group_list, next) {
+ ret = vfio_spapr_iommu_eeh_ioctl(tcegrp->grp,
+ cmd, arg);
+ if (ret)
+ return ret;
+ }
+ return ret;
+ }
+
}
return -ENOTTY;
@@ -735,63 +819,111 @@ static long tce_iommu_ioctl(void *iommu_data,
static int tce_iommu_attach_group(void *iommu_data,
struct iommu_group *iommu_group)
{
- int ret;
+ int ret, i;
struct tce_container *container = iommu_data;
struct iommu_table_group *table_group;
+ struct tce_iommu_group *tcegrp = NULL;
+ bool first_group = !tce_groups_attached(container);
mutex_lock(&container->lock);
/* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
iommu_group_id(iommu_group), iommu_group); */
- if (container->grp) {
- pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
- iommu_group_id(container->grp),
- iommu_group_id(iommu_group));
- ret = -EBUSY;
- goto unlock_exit;
- }
-
- if (container->enabled) {
- pr_err("tce_vfio: attaching group #%u to enabled container\n",
- iommu_group_id(iommu_group));
- ret = -EBUSY;
- goto unlock_exit;
- }
-
table_group = iommu_group_get_iommudata(iommu_group);
- if (!table_group) {
- ret = -ENXIO;
+
+ if (!first_group && (!table_group->ops ||
+ !table_group->ops->set_ownership)) {
+ ret = -EBUSY;
+ goto unlock_exit;
+ }
+
+ /* Check if new group has the same iommu_ops (i.e. compatible) */
+ list_for_each_entry(tcegrp, &container->group_list, next) {
+ struct iommu_table_group *table_group_tmp;
+
+ if (tcegrp->grp == iommu_group) {
+ pr_warn("tce_vfio: Group %d is already attached\n",
+ iommu_group_id(iommu_group));
+ ret = -EBUSY;
+ goto unlock_exit;
+ }
+ table_group_tmp = iommu_group_get_iommudata(tcegrp->grp);
+ if (table_group_tmp->ops != table_group->ops) {
+ pr_warn("tce_vfio: Group %d is incompatible with group %d\n",
+ iommu_group_id(iommu_group),
+ iommu_group_id(tcegrp->grp));
+ ret = -EPERM;
+ goto unlock_exit;
+ }
+ }
+
+ tcegrp = kzalloc(sizeof(*tcegrp), GFP_KERNEL);
+ if (!tcegrp) {
+ ret = -ENOMEM;
goto unlock_exit;
}
if (!table_group->ops || !table_group->ops->set_ownership) {
ret = iommu_take_ownership(table_group);
+ if (!ret)
+ container->tables[0] = table_group->tables[0];
} else if (!table_group->ops->create_table ||
!table_group->ops->set_window) {
WARN_ON_ONCE(1);
ret = -EFAULT;
} else {
+ table_group->ops->set_ownership(table_group, true);
/*
- * Disable iommu bypass, otherwise the user can DMA to all of
- * our physical memory via the bypass window instead of just
- * the pages that has been explicitly mapped into the iommu
+ * If it the first group attached, check if there is any window
+ * created and create one if none.
*/
- struct iommu_table tbltmp = { 0 }, *tbl = &tbltmp;
-
- table_group->ops->set_ownership(table_group, true);
- ret = table_group->ops->create_table(table_group, 0,
- IOMMU_PAGE_SHIFT_4K,
- table_group->tce32_size, 1, tbl);
- if (!ret)
- ret = table_group->ops->set_window(table_group, 0, tbl);
+ if (first_group) {
+ bool found = false;
+
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ if (!container->tables[i].it_size)
+ continue;
+
+ found = true;
+ break;
+ }
+ if (!found) {
+ struct iommu_table *tbl = &container->tables[0];
+
+ ret = tce_iommu_create_table(
+ table_group, 0,
+ IOMMU_PAGE_SHIFT_4K,
+ table_group->tce32_size, 1,
+ tbl);
+ if (ret)
+ goto unlock_exit;
+ }
+ }
+
+ /* Set all windows to the new group */
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ struct iommu_table *tbl = &container->tables[i];
+
+ if (!tbl->it_size)
+ continue;
+
+ /* Set the default window to a new group */
+ ret = table_group->ops->set_window(table_group, i, tbl);
+ if (ret)
+ break;
+ }
}
if (ret)
goto unlock_exit;
- container->grp = iommu_group;
+ tcegrp->grp = iommu_group;
+ list_add(&tcegrp->next, &container->group_list);
unlock_exit:
+ if (ret && tcegrp)
+ kfree(tcegrp);
+
mutex_unlock(&container->lock);
return ret;
@@ -802,25 +934,27 @@ static void tce_iommu_detach_group(void *iommu_data,
{
struct tce_container *container = iommu_data;
struct iommu_table_group *table_group;
+ struct tce_iommu_group *tcegrp, *tcetmp;
long i;
+ bool found = false;
mutex_lock(&container->lock);
- if (iommu_group != container->grp) {
- pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
- iommu_group_id(iommu_group),
- iommu_group_id(container->grp));
+
+ list_for_each_entry_safe(tcegrp, tcetmp, &container->group_list, next) {
+ if (tcegrp->grp != iommu_group)
+ continue;
+ found = true;
+ break;
+ }
+
+ if (!found) {
+ pr_warn("tce_vfio: detaching unattached group #%u\n",
+ iommu_group_id(iommu_group));
goto unlock_exit;
}
- if (container->enabled) {
- pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
- iommu_group_id(container->grp));
- tce_iommu_disable(container);
- }
-
- /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
- iommu_group_id(iommu_group), iommu_group); */
- container->grp = NULL;
+ list_del(&tcegrp->next);
+ kfree(tcegrp);
table_group = iommu_group_get_iommudata(iommu_group);
BUG_ON(!table_group);
@@ -828,7 +962,7 @@ static void tce_iommu_detach_group(void *iommu_data,
/* Kernel owns the device now, we can restore bypass */
if (!table_group->ops || !table_group->ops->set_ownership) {
for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
- struct iommu_table *tbl = &table_group->tables[i];
+ struct iommu_table *tbl = &container->tables[i];
if (!tbl->it_size)
continue;
@@ -837,20 +971,15 @@ static void tce_iommu_detach_group(void *iommu_data,
goto unlock_exit;
tce_iommu_clear(container, tbl,
tbl->it_offset, tbl->it_size);
+
+ memset(tbl, 0, sizeof(*tbl));
}
iommu_release_ownership(table_group);
} else if (!table_group->ops->unset_window) {
WARN_ON_ONCE(1);
} else {
for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
- struct iommu_table *tbl = &table_group->tables[i];
-
table_group->ops->unset_window(table_group, i);
- tce_iommu_clear(container, tbl,
- tbl->it_offset, tbl->it_size);
-
- if (tbl->it_ops->free)
- tbl->it_ops->free(tbl);
}
table_group->ops->set_ownership(table_group, false);
--
2.0.0
This adds create/remove window ioctls to create and remove DMA windows.
sPAPR defines a Dynamic DMA windows capability which allows
para-virtualized guests to create additional DMA windows on a PCI bus.
The existing linux kernels use this new window to map the entire guest
memory and switch to the direct DMA operations saving time on map/unmap
requests which would normally happen in a big amounts.
This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
Up to 2 windows are supported now by the hardware and by this driver.
This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
information such as a number of supported windows and maximum number
levels of TCE tables.
DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature
as we still want to support v2 on platforms which cannot do DDW for
the sake of TCE acceleration in KVM (coming soon).
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v7:
* s/VFIO_IOMMU_INFO_DDW/VFIO_IOMMU_SPAPR_INFO_DDW/
* fixed typos in and updated vfio.txt
* fixed VFIO_IOMMU_SPAPR_TCE_GET_INFO handler
* moved ddw properties to vfio_iommu_spapr_tce_ddw_info
v6:
* added explicit VFIO_IOMMU_INFO_DDW flag to vfio_iommu_spapr_tce_info,
it used to be page mask flags from platform code
* added explicit pgsizes field
* added cleanup if tce_iommu_create_window() failed in a middle
* added checks for callbacks in tce_iommu_create_window and remove those
from tce_iommu_remove_window when it is too late to test anyway
* spapr_tce_find_free_table returns sensible error code now
* updated description of VFIO_IOMMU_SPAPR_TCE_CREATE/
VFIO_IOMMU_SPAPR_TCE_REMOVE
v4:
* moved code to tce_iommu_create_window()/tce_iommu_remove_window()
helpers
* added docs
---
Documentation/vfio.txt | 19 ++++
arch/powerpc/include/asm/iommu.h | 2 +-
drivers/vfio/vfio_iommu_spapr_tce.c | 196 +++++++++++++++++++++++++++++++++++-
include/uapi/linux/vfio.h | 61 ++++++++++-
4 files changed, 273 insertions(+), 5 deletions(-)
diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 7dcf2b5..8b1ec51 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -452,6 +452,25 @@ address is from pre-registered range.
This separation helps in optimizing DMA for guests.
+6) sPAPR specification allows guests to have an additional DMA window(s) on
+a PCI bus with a variable page size. Two ioctls have been added to support
+this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE.
+The platform has to support the functionality or error will be returned to
+the userspace. The existing hardware supports up to 2 DMA windows, one is
+2GB long, uses 4K pages and called "default 32bit window"; the other can
+be as big as entire RAM, use different page size, it is optional - guests
+create those in run-time if the guest driver supports 64bit DMA.
+
+VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and
+a number of TCE table levels (if a TCE table is going to be big enough and
+the kernel may not be able to allocate enough of physically contiguous memory).
+It creates a new window in the available slot and returns the bus address where
+the new window starts. Due to hardware limitation, the user space cannot choose
+the location of DMA windows.
+
+VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window
+and removes it.
+
-------------------------------------------------------------------------------
[1] VFIO was originally an acronym for "Virtual Function I/O" in its
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9027b9e..1db774c0 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -147,7 +147,7 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
int nid);
-#define IOMMU_TABLE_GROUP_MAX_TABLES 1
+#define IOMMU_TABLE_GROUP_MAX_TABLES 2
struct iommu_table_group;
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index a9520a8..ac667a5 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -535,6 +535,20 @@ static long tce_iommu_build_v2(struct tce_container *container,
return ret;
}
+static int spapr_tce_find_free_table(struct tce_container *container)
+{
+ int i;
+
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ struct iommu_table *tbl = &container->tables[i];
+
+ if (!tbl->it_size)
+ return i;
+ }
+
+ return -ENOSPC;
+}
+
static long tce_iommu_create_table(struct iommu_table_group *table_group,
int num,
__u32 page_shift,
@@ -573,11 +587,114 @@ static void tce_iommu_free_table(struct iommu_table *tbl)
memset(tbl, 0, sizeof(*tbl));
}
+static long tce_iommu_create_window(struct tce_container *container,
+ __u32 page_shift, __u64 window_size, __u32 levels,
+ __u64 *start_addr)
+{
+ struct tce_iommu_group *tcegrp;
+ struct iommu_table_group *table_group;
+ struct iommu_table *tbl;
+ long ret, num;
+
+ num = spapr_tce_find_free_table(container);
+ if (num < 0)
+ return num;
+
+ tbl = &container->tables[num];
+
+ /* Get the first group for ops::create_table */
+ tcegrp = list_first_entry(&container->group_list,
+ struct tce_iommu_group, next);
+ table_group = iommu_group_get_iommudata(tcegrp->grp);
+ if (!table_group)
+ return -EFAULT;
+
+ if (!(table_group->pgsizes & (1ULL << page_shift)))
+ return -EINVAL;
+
+ if (!table_group->ops->set_window || !table_group->ops->unset_window ||
+ !table_group->ops->get_table_size ||
+ !table_group->ops->create_table)
+ return -EPERM;
+
+ /* Create TCE table */
+ ret = tce_iommu_create_table(table_group, num,
+ page_shift, window_size, levels, tbl);
+ if (ret)
+ return ret;
+
+ BUG_ON(!tbl->it_ops->free);
+
+ /*
+ * Program the table to every group.
+ * Groups have been tested for compatibility at the attach time.
+ */
+ list_for_each_entry(tcegrp, &container->group_list, next) {
+ table_group = iommu_group_get_iommudata(tcegrp->grp);
+
+ ret = table_group->ops->set_window(table_group, num, tbl);
+ if (ret)
+ goto unset_exit;
+ }
+
+ /* Return start address assigned by platform in create_table() */
+ *start_addr = tbl->it_offset << tbl->it_page_shift;
+
+ return 0;
+
+unset_exit:
+ list_for_each_entry(tcegrp, &container->group_list, next) {
+ table_group = iommu_group_get_iommudata(tcegrp->grp);
+ table_group->ops->unset_window(table_group, num);
+ }
+ tce_iommu_free_table(tbl);
+
+ return ret;
+}
+
+static long tce_iommu_remove_window(struct tce_container *container,
+ __u64 start_addr)
+{
+ struct iommu_table_group *table_group = NULL;
+ struct iommu_table *tbl;
+ struct tce_iommu_group *tcegrp;
+ int num;
+
+ tbl = spapr_tce_find_table(container, start_addr);
+ if (!tbl)
+ return -EINVAL;
+
+ /* Detach groups from IOMMUs */
+ num = tbl - container->tables;
+ list_for_each_entry(tcegrp, &container->group_list, next) {
+ table_group = iommu_group_get_iommudata(tcegrp->grp);
+
+ /*
+ * SPAPR TCE IOMMU exposes the default DMA window to
+ * the guest via dma32_window_start/size of
+ * VFIO_IOMMU_SPAPR_TCE_GET_INFO. Some platforms allow
+ * the userspace to remove this window, some do not so
+ * here we check for the platform capability.
+ */
+ if (!table_group->ops || !table_group->ops->unset_window)
+ return -EPERM;
+
+ if (container->tables[num].it_size)
+ table_group->ops->unset_window(table_group, num);
+ }
+
+ /* Free table */
+ tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
+ tce_iommu_free_table(tbl);
+
+ return 0;
+}
+
static long tce_iommu_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
{
struct tce_container *container = iommu_data;
- unsigned long minsz;
+ unsigned long minsz, ddwsz;
long ret;
switch (cmd) {
@@ -621,6 +738,20 @@ static long tce_iommu_ioctl(void *iommu_data,
info.flags = 0;
info.dma32_window_start = table_group->tce32_start;
info.dma32_window_size = table_group->tce32_size;
+ memset(&info.ddw, 0, sizeof(info.ddw));
+
+ if (table_group->max_dynamic_windows_supported) {
+ info.flags |= VFIO_IOMMU_SPAPR_INFO_DDW;
+ info.ddw.pgsizes = table_group->pgsizes;
+ info.ddw.max_dynamic_windows_supported =
+ table_group->max_dynamic_windows_supported;
+ info.ddw.levels = table_group->max_levels;
+ }
+
+ ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info, ddw);
+
+ if (info.argsz >= ddwsz)
+ minsz = ddwsz;
if (copy_to_user((void __user *)arg, &info, minsz))
return -EFAULT;
@@ -811,6 +942,69 @@ static long tce_iommu_ioctl(void *iommu_data,
return ret;
}
+ case VFIO_IOMMU_SPAPR_TCE_CREATE: {
+ struct vfio_iommu_spapr_tce_create create;
+
+ if (!container->v2)
+ break;
+
+ if (!tce_groups_attached(container))
+ return -ENXIO;
+
+ minsz = offsetofend(struct vfio_iommu_spapr_tce_create,
+ start_addr);
+
+ if (copy_from_user(&create, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (create.argsz < minsz)
+ return -EINVAL;
+
+ if (create.flags)
+ return -EINVAL;
+
+ mutex_lock(&container->lock);
+
+ ret = tce_iommu_create_window(container, create.page_shift,
+ create.window_size, create.levels,
+ &create.start_addr);
+
+ mutex_unlock(&container->lock);
+
+ if (!ret && copy_to_user((void __user *)arg, &create, minsz))
+ ret = -EFAULT;
+
+ return ret;
+ }
+ case VFIO_IOMMU_SPAPR_TCE_REMOVE: {
+ struct vfio_iommu_spapr_tce_remove remove;
+
+ if (!container->v2)
+ break;
+
+ if (!tce_groups_attached(container))
+ return -ENXIO;
+
+ minsz = offsetofend(struct vfio_iommu_spapr_tce_remove,
+ start_addr);
+
+ if (copy_from_user(&remove, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (remove.argsz < minsz)
+ return -EINVAL;
+
+ if (remove.flags)
+ return -EINVAL;
+
+ mutex_lock(&container->lock);
+
+ ret = tce_iommu_remove_window(container, remove.start_addr);
+
+ mutex_unlock(&container->lock);
+
+ return ret;
+ }
}
return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index fbc5286..7b0ff00 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -443,6 +443,23 @@ struct vfio_iommu_type1_dma_unmap {
/* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
/*
+ * The SPAPR TCE DDW info struct provides the information about
+ * the details of Dynamic DMA window capability.
+ *
+ * @pgsizes contains a page size bitmask, 4K/64K/16M are supported.
+ * @max_dynamic_windows_supported tells the maximum number of windows
+ * which the platform can create.
+ * @levels tells the maximum number of levels in multi-level IOMMU tables;
+ * this allows splitting a table into smaller chunks which reduces
+ * the amount of physically contiguous memory required for the table.
+ */
+struct vfio_iommu_spapr_tce_ddw_info {
+ __u64 pgsizes; /* Bitmap of supported page sizes */
+ __u32 max_dynamic_windows_supported;
+ __u32 levels;
+};
+
+/*
* The SPAPR TCE info struct provides the information about the PCI bus
* address ranges available for DMA, these values are programmed into
* the hardware so the guest has to know that information.
@@ -452,14 +469,17 @@ struct vfio_iommu_type1_dma_unmap {
* addresses too so the window works as a filter rather than an offset
* for IOVA addresses.
*
- * A flag will need to be added if other page sizes are supported,
- * so as defined here, it is always 4k.
+ * Flags supported:
+ * - VFIO_IOMMU_SPAPR_INFO_DDW: informs the userspace that dynamic DMA windows
+ * (DDW) support is present. @ddw is only supported when DDW is present.
*/
struct vfio_iommu_spapr_tce_info {
__u32 argsz;
- __u32 flags; /* reserved for future use */
+ __u32 flags;
+#define VFIO_IOMMU_SPAPR_INFO_DDW (1 << 0) /* DDW supported */
__u32 dma32_window_start; /* 32 bit window start (bytes) */
__u32 dma32_window_size; /* 32 bit window size (bytes) */
+ struct vfio_iommu_spapr_tce_ddw_info ddw;
};
#define VFIO_IOMMU_SPAPR_TCE_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
@@ -520,6 +540,41 @@ struct vfio_iommu_spapr_register_memory {
*/
#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 18)
+/**
+ * VFIO_IOMMU_SPAPR_TCE_CREATE - _IOWR(VFIO_TYPE, VFIO_BASE + 19, struct vfio_iommu_spapr_tce_create)
+ *
+ * Creates an additional TCE table and programs it (sets a new DMA window)
+ * to every IOMMU group in the container. It receives page shift, window
+ * size and number of levels in the TCE table being created.
+ *
+ * It allocates and returns an offset on a PCI bus of the new DMA window.
+ */
+struct vfio_iommu_spapr_tce_create {
+ __u32 argsz;
+ __u32 flags;
+ /* in */
+ __u32 page_shift;
+ __u64 window_size;
+ __u32 levels;
+ /* out */
+ __u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_CREATE _IO(VFIO_TYPE, VFIO_BASE + 19)
+
+/**
+ * VFIO_IOMMU_SPAPR_TCE_REMOVE - _IOW(VFIO_TYPE, VFIO_BASE + 20, struct vfio_iommu_spapr_tce_remove)
+ *
+ * Unprograms a TCE table from all groups in the container and destroys it.
+ * It receives a PCI bus offset as a window id.
+ */
+struct vfio_iommu_spapr_tce_remove {
+ __u32 argsz;
+ __u32 flags;
+ /* in */
+ __u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE _IO(VFIO_TYPE, VFIO_BASE + 20)
+
/* ***************************************************************** */
#endif /* _UAPIVFIO_H */
--
2.0.0
On Fri, 2015-04-10 at 16:30 +1000, Alexey Kardashevskiy wrote:
> This adds missing locks in iommu_take_ownership()/
> iommu_release_ownership().
>
> This marks all pages busy in iommu_table::it_map in order to catch
> errors if there is an attempt to use this table while ownership over it
> is taken.
>
> This only clears TCE content if there is no page marked busy in it_map.
> Clearing must be done outside of the table locks as iommu_clear_tce()
> called from iommu_clear_tces_and_put_pages() does this.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> Changes:
> v5:
> * do not store bit#0 value, it has to be set for zero-based table
> anyway
> * removed test_and_clear_bit
> ---
> arch/powerpc/kernel/iommu.c | 26 ++++++++++++++++++++++----
> 1 file changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 7d6089b..068fe4ff 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1052,17 +1052,28 @@ EXPORT_SYMBOL_GPL(iommu_tce_build);
>
> static int iommu_table_take_ownership(struct iommu_table *tbl)
> {
> - unsigned long sz = (tbl->it_size + 7) >> 3;
> + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
> + int ret = 0;
> +
> + spin_lock_irqsave(&tbl->large_pool.lock, flags);
> + for (i = 0; i < tbl->nr_pools; i++)
> + spin_lock(&tbl->pools[i].lock);
>
> if (tbl->it_offset == 0)
> clear_bit(0, tbl->it_map);
>
> if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
> pr_err("iommu_tce: it_map is not empty");
> - return -EBUSY;
> + ret = -EBUSY;
This error is never returned.
> + if (tbl->it_offset == 0)
> + set_bit(0, tbl->it_map);
> + } else {
> + memset(tbl->it_map, 0xff, sz);
> }
>
> - memset(tbl->it_map, 0xff, sz);
> + for (i = 0; i < tbl->nr_pools; i++)
> + spin_unlock(&tbl->pools[i].lock);
> + spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
>
> return 0;
> }
> @@ -1095,7 +1106,11 @@ EXPORT_SYMBOL_GPL(iommu_take_ownership);
>
> static void iommu_table_release_ownership(struct iommu_table *tbl)
> {
> - unsigned long sz = (tbl->it_size + 7) >> 3;
> + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
> +
> + spin_lock_irqsave(&tbl->large_pool.lock, flags);
> + for (i = 0; i < tbl->nr_pools; i++)
> + spin_lock(&tbl->pools[i].lock);
>
> memset(tbl->it_map, 0, sz);
>
> @@ -1103,6 +1118,9 @@ static void iommu_table_release_ownership(struct iommu_table *tbl)
> if (tbl->it_offset == 0)
> set_bit(0, tbl->it_map);
>
> + for (i = 0; i < tbl->nr_pools; i++)
> + spin_unlock(&tbl->pools[i].lock);
> + spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
> }
>
> extern void iommu_release_ownership(struct iommu_table_group *table_group)
On Fri, 2015-04-10 at 16:31 +1000, Alexey Kardashevskiy wrote:
> In order to support memory pre-registration, we need a way to track
> the use of every registered memory region and only allow unregistration
> if a region is not in use anymore. So we need a way to tell from what
> region the just cleared TCE was from.
>
> This adds a userspace view of the TCE table into iommu_table struct.
> It contains userspace address, one per TCE entry. The table is only
> allocated when the ownership over an IOMMU group is taken which means
> it is only used from outside of the powernv code (such as VFIO).
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> Changes:
> v8:
> * added ENOMEM on failed vzalloc()
> ---
> arch/powerpc/include/asm/iommu.h | 6 ++++++
> arch/powerpc/kernel/iommu.c | 9 +++++++++
> arch/powerpc/platforms/powernv/pci-ioda.c | 25 ++++++++++++++++++++++++-
> 3 files changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 2c08c91..a768a4d 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -106,9 +106,15 @@ struct iommu_table {
> unsigned long *it_map; /* A simple allocation bitmap for now */
> unsigned long it_page_shift;/* table iommu page size */
> struct iommu_table_group *it_group;
> + unsigned long *it_userspace; /* userspace view of the table */
> struct iommu_table_ops *it_ops;
> };
>
> +#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
> + ((tbl)->it_userspace ? \
> + &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \
> + NULL)
> +
> /* Pure 2^n version of get_order */
> static inline __attribute_const__
> int get_iommu_order(unsigned long size, struct iommu_table *tbl)
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 0bcd988..833b396 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -38,6 +38,7 @@
> #include <linux/pci.h>
> #include <linux/iommu.h>
> #include <linux/sched.h>
> +#include <linux/vmalloc.h>
> #include <asm/io.h>
> #include <asm/prom.h>
> #include <asm/iommu.h>
> @@ -1069,6 +1070,11 @@ static int iommu_table_take_ownership(struct iommu_table *tbl)
> spin_unlock(&tbl->pools[i].lock);
> spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
>
> + BUG_ON(tbl->it_userspace);
> + tbl->it_userspace = vzalloc(sizeof(*tbl->it_userspace) * tbl->it_size);
> + if (!tbl->it_userspace)
> + return -ENOMEM;
> +
It would really make more sense from an error path perspective in this
function if the vzalloc where done first. Doing it at the end, you need
to consider whether anything previous needs to be un-done. Also note
that this -ENOMEM return clobbers the -EBUSY if you fix 15/31 to return
"ret".
> return 0;
> }
>
> @@ -1102,6 +1108,9 @@ static void iommu_table_release_ownership(struct iommu_table *tbl)
> {
> unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
>
> + vfree(tbl->it_userspace);
> + tbl->it_userspace = NULL;
> +
> spin_lock_irqsave(&tbl->large_pool.lock, flags);
> for (i = 0; i < tbl->nr_pools; i++)
> spin_lock(&tbl->pools[i].lock);
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 751aeab..3ac523d 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -26,6 +26,7 @@
> #include <linux/iommu.h>
> #include <linux/mmzone.h>
> #include <linux/sizes.h>
> +#include <linux/vmalloc.h>
>
> #include <asm/mmzone.h>
> #include <asm/sections.h>
> @@ -1469,6 +1470,9 @@ static void pnv_pci_free_table(struct iommu_table *tbl)
> if (!tbl->it_size)
> return;
>
> + vfree(tbl->it_userspace);
> + tbl->it_userspace = NULL;
> +
> pnv_free_tce_table(tbl->it_base, size, tbl->it_indirect_levels);
> iommu_reset_table(tbl, "ioda2");
> }
> @@ -1656,9 +1660,28 @@ static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
> pnv_pci_ioda2_set_bypass(pe, !enable);
> }
>
> +static long pnv_pci_ioda2_create_table_with_uas(
> + struct iommu_table_group *table_group,
> + int num, __u32 page_shift, __u64 window_size, __u32 levels,
> + struct iommu_table *tbl)
> +{
> + long ret = pnv_pci_ioda2_create_table(table_group, num,
> + page_shift, window_size, levels, tbl);
> +
> + if (ret)
> + return ret;
> +
> + BUG_ON(tbl->it_userspace);
> + tbl->it_userspace = vzalloc(sizeof(*tbl->it_userspace) * tbl->it_size);
> + if (!tbl->it_userspace)
> + return -ENOMEM;
So all of the work done in pnv_pci_ioda2_create_table() can just be
ignored, we undo nothing and return -ENOMEM? Again, doing the
allocation first might make a lot more sense than slapping on an -ENOMEM
and calling the error handling "good".
> +
> + return 0;
> +}
> +
> static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> .set_ownership = pnv_ioda2_set_ownership,
> - .create_table = pnv_pci_ioda2_create_table,
> + .create_table = pnv_pci_ioda2_create_table_with_uas,
> .set_window = pnv_pci_ioda2_set_window,
> .unset_window = pnv_pci_ioda2_unset_window,
> };
On Fri, 2015-04-10 at 16:30 +1000, Alexey Kardashevskiy wrote:
> This enables sPAPR defined feature called Dynamic DMA windows (DDW).
>
> Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
> where devices are allowed to do DMA. These ranges are called DMA windows.
> By default, there is a single DMA window, 1 or 2GB big, mapped at zero
> on a PCI bus.
>
> Hi-speed devices may suffer from the limited size of the window.
> The recent host kernels use a TCE bypass window on POWER8 CPU which implements
> direct PCI bus address range mapping (with offset of 1<<59) to the host memory.
>
> For guests, PAPR defines a DDW RTAS API which allows pseries guests
> querying the hypervisor about DDW support and capabilities (page size mask
> for now). A pseries guest may request an additional (to the default)
> DMA windows using this RTAS API.
> The existing pseries Linux guests request an additional window as big as
> the guest RAM and map the entire guest window which effectively creates
> direct mapping of the guest memory to a PCI bus.
>
> The multiple DMA windows feature is supported by POWER7/POWER8 CPUs; however
> this patchset only adds support for POWER8 as TCE tables are implemented
> in POWER7 in a quite different way ans POWER7 is not the highest priority.
>
> This patchset reworks PPC64 IOMMU code and adds necessary structures
> to support big windows.
>
> Once a Linux guest discovers the presence of DDW, it does:
> 1. query hypervisor about number of available windows and page size masks;
> 2. create a window with the biggest possible page size (today 4K/64K/16M);
> 3. map the entire guest RAM via H_PUT_TCE* hypercalls;
> 4. switche dma_ops to direct_dma_ops on the selected PE.
>
> Once this is done, H_PUT_TCE is not called anymore for 64bit devices and
> the guest does not waste time on DMA map/unmap operations.
>
> Note that 32bit devices won't use DDW and will keep using the default
> DMA window so KVM optimizations will be required (to be posted later).
>
> This is pushed to [email protected]:aik/linux.git
> + 09bb8ea...d9b711d vfio-for-github -> vfio-for-github (forced update)
>
>
> Please comment. Thank you!
>
>
> Changes:
> v8:
> * fixed a bug in error fallback in "powerpc/mmu: Add userspace-to-physical
> addresses translation cache"
> * fixed subject in "vfio: powerpc/spapr: Check that IOMMU page is fully
> contained by system page"
> * moved v2 documentation to the correct patch
> * added checks for failed vzalloc() in "powerpc/iommu: Add userspace view
> of TCE table"
>
> v7:
> * moved memory preregistration to the current process's MMU context
> * added code preventing unregistration if some pages are still mapped;
> for this, there is a userspace view of the table is stored in iommu_table
> * added locked_vm counting for DDW tables (including userspace view of those)
>
> v6:
> * fixed a bunch of errors in "vfio: powerpc/spapr: Support Dynamic DMA windows"
> * moved static IOMMU properties from iommu_table_group to iommu_table_group_ops
>
> v5:
> * added SPAPR_TCE_IOMMU_v2 to tell the userspace that there is a memory
> pre-registration feature
> * added backward compatibility
> * renamed few things (mostly powerpc_iommu -> iommu_table_group)
>
> v4:
> * moved patches around to have VFIO and PPC patches separated as much as
> possible
> * now works with the existing upstream QEMU
>
> v3:
> * redesigned the whole thing
> * multiple IOMMU groups per PHB -> one PHB is needed for VFIO in the guest ->
> no problems with locked_vm counting; also we save memory on actual tables
> * guest RAM preregistration is required for DDW
> * PEs (IOMMU groups) are passed to VFIO with no DMA windows at all so
> we do not bother with iommu_table::it_map anymore
> * added multilevel TCE tables support to support really huge guests
>
> v2:
> * added missing __pa() in "powerpc/powernv: Release replaced TCE"
> * reposted to make some noise
>
>
>
>
> Alexey Kardashevskiy (31):
> vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU
> driver
> vfio: powerpc/spapr: Do cleanup when releasing the group
> vfio: powerpc/spapr: Check that IOMMU page is fully contained by
> system page
> vfio: powerpc/spapr: Use it_page_size
> vfio: powerpc/spapr: Move locked_vm accounting to helpers
> vfio: powerpc/spapr: Disable DMA mappings on disabled container
> vfio: powerpc/spapr: Moving pinning/unpinning to helpers
> vfio: powerpc/spapr: Rework groups attaching
> powerpc/powernv: Do not set "read" flag if direction==DMA_NONE
> powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
> powerpc/iommu: Introduce iommu_table_alloc() helper
> powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group
> vfio: powerpc/spapr: powerpc/iommu: Rework IOMMU ownership control
> vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework IOMMU ownership
> control
> powerpc/iommu: Fix IOMMU ownership control functions
> powerpc/powernv/ioda/ioda2: Rework tce_build()/tce_free()
> powerpc/iommu/powernv: Release replaced TCE
> powerpc/powernv/ioda2: Rework iommu_table creation
> powerpc/powernv/ioda2: Introduce
> pnv_pci_ioda2_create_table/pnc_pci_free_table
> powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window
> powerpc/iommu: Split iommu_free_table into 2 helpers
> powerpc/powernv: Implement multilevel TCE tables
> powerpc/powernv: Change prototypes to receive iommu
> powerpc/powernv/ioda: Define and implement DMA table/window management
> callbacks
> vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework ownership
> powerpc/iommu: Add userspace view of TCE table
> powerpc/iommu/ioda2: Add get_table_size() to calculate the size of
> fiture table
> powerpc/mmu: Add userspace-to-physical addresses translation cache
> vfio: powerpc/spapr: Register memory and define IOMMU v2
> vfio: powerpc/spapr: Support multiple groups in one container if
> possible
> vfio: powerpc/spapr: Support Dynamic DMA windows
>
> Documentation/vfio.txt | 50 +-
> arch/powerpc/include/asm/iommu.h | 111 ++-
> arch/powerpc/include/asm/machdep.h | 25 -
> arch/powerpc/include/asm/mmu-hash64.h | 3 +
> arch/powerpc/include/asm/mmu_context.h | 17 +
> arch/powerpc/kernel/iommu.c | 336 +++++----
> arch/powerpc/kernel/vio.c | 5 +
> arch/powerpc/mm/Makefile | 1 +
> arch/powerpc/mm/mmu_context_hash64.c | 6 +
> arch/powerpc/mm/mmu_context_hash64_iommu.c | 215 ++++++
> arch/powerpc/platforms/cell/iommu.c | 8 +-
> arch/powerpc/platforms/pasemi/iommu.c | 7 +-
> arch/powerpc/platforms/powernv/pci-ioda.c | 589 ++++++++++++---
> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 33 +-
> arch/powerpc/platforms/powernv/pci.c | 116 ++-
> arch/powerpc/platforms/powernv/pci.h | 12 +-
> arch/powerpc/platforms/pseries/iommu.c | 55 +-
> arch/powerpc/sysdev/dart_iommu.c | 12 +-
> drivers/vfio/vfio_iommu_spapr_tce.c | 1021 ++++++++++++++++++++++++---
> include/uapi/linux/vfio.h | 88 ++-
> 20 files changed, 2218 insertions(+), 492 deletions(-)
> create mode 100644 arch/powerpc/mm/mmu_context_hash64_iommu.c
There are still some issues that need to be addressed in arch code, I've
noted them in comments for patches 15 & 26. I think I've run out of
issues for the vfio changes, so for the vfio related changes in patches
1-8,12-14,17,25,29-31:
Acked-by: Alex Williamson <[email protected]>
On Fri, Apr 10, 2015 at 04:30:43PM +1000, Alexey Kardashevskiy wrote:
> This moves page pinning (get_user_pages_fast()/put_page()) code out of
> the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs
> to as the platform code does not deal with page pinning.
>
> This makes iommu_take_ownership()/iommu_release_ownership() deal with
> the IOMMU table bitmap only.
>
> This removes page unpinning from iommu_take_ownership() as the actual
> TCE table might contain garbage and doing put_page() on it is undefined
> behaviour.
>
> Besides the last part, the rest of the patch is mechanical.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:30:44PM +1000, Alexey Kardashevskiy wrote:
> This clears the TCE table when a container is being closed as this is
> a good thing to leave the table clean before passing the ownership
> back to the host kernel.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:30:45PM +1000, Alexey Kardashevskiy wrote:
> This checks that the TCE table page size is not bigger that the size of
> a page we just pinned and going to put its physical address to the table.
>
> Otherwise the hardware gets unwanted access to physical memory between
> the end of the actual page and the end of the aligned up TCE page.
>
> Since compound_order() and compound_head() work correctly on non-huge
> pages, there is no need for additional check whether the page is huge.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
Only thing I'm not sure about is...
> + if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> + ret = -EPERM;
> + break;
.. whether EPERM is the right error code.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:30:47PM +1000, Alexey Kardashevskiy wrote:
> There moves locked pages accounting to helpers.
> Later they will be reused for Dynamic DMA windows (DDW).
>
> This reworks debug messages to show the current value and the limit.
>
> This stores the locked pages number in the container so when unlocking
> the iommu table pointer won't be needed. This does not have an effect
> now but it will with the multiple tables per container as then we will
> allow attaching/detaching groups on fly and we may end up having
> a container with no group attached but with the counter incremented.
>
> While we are here, update the comment explaining why RLIMIT_MEMLOCK
> might be required to be bigger than the guest RAM. This also prints
> pid of the current process in pr_warn/pr_debug.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:30:48PM +1000, Alexey Kardashevskiy wrote:
> At the moment DMA map/unmap requests are handled irrespective to
> the container's state. This allows the user space to pin memory which
> it might not be allowed to pin.
>
> This adds checks to MAP/UNMAP that the container is enabled, otherwise
> -EPERM is returned.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:30:49PM +1000, Alexey Kardashevskiy wrote:
> This is a pretty mechanical patch to make next patches simpler.
>
> New tce_iommu_unuse_page() helper does put_page() now but it might skip
> that after the memory registering patch applied.
>
> As we are here, this removes unnecessary checks for a value returned
> by pfn_to_page() as it cannot possibly return NULL.
>
> This moves tce_iommu_disable() later to let tce_iommu_clear() know if
> the container has been enabled because if it has not been, then
> put_page() must not be called on TCEs from the TCE table. This situation
> is not yet possible but it will after KVM acceleration patchset is
> applied.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> Changes:
> v6:
> * tce_get_hva() returns hva via a pointer
> ---
> drivers/vfio/vfio_iommu_spapr_tce.c | 68 +++++++++++++++++++++++++++----------
> 1 file changed, 50 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index c137bb3..ec5ee83 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -196,7 +196,6 @@ static void tce_iommu_release(void *iommu_data)
> struct iommu_table *tbl = container->tbl;
>
> WARN_ON(tbl && !tbl->it_group);
> - tce_iommu_disable(container);
>
> if (tbl) {
> tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> @@ -204,63 +203,96 @@ static void tce_iommu_release(void *iommu_data)
> if (tbl->it_group)
> tce_iommu_detach_group(iommu_data, tbl->it_group);
> }
> +
> + tce_iommu_disable(container);
> +
> mutex_destroy(&container->lock);
>
> kfree(container);
> }
>
> +static void tce_iommu_unuse_page(struct tce_container *container,
> + unsigned long oldtce)
> +{
> + struct page *page;
> +
> + if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
> + return;
> +
> + /*
> + * VFIO cannot map/unmap when a container is not enabled so
> + * we would not need this check but KVM could map/unmap and if
> + * this happened, we must not put pages as KVM does not get them as
> + * it expects memory pre-registation to do this part.
> + */
> + if (!container->enabled)
> + return;
This worries me a bit. How can whether the contained is enabled now
safely tell you whether get_page() at some earlier point in time?
> +
> + page = pfn_to_page(__pa(oldtce) >> PAGE_SHIFT);
> +
> + if (oldtce & TCE_PCI_WRITE)
> + SetPageDirty(page);
> +
> + put_page(page);
> +}
> +
> static int tce_iommu_clear(struct tce_container *container,
> struct iommu_table *tbl,
> unsigned long entry, unsigned long pages)
> {
> unsigned long oldtce;
> - struct page *page;
>
> for ( ; pages; --pages, ++entry) {
> oldtce = iommu_clear_tce(tbl, entry);
> if (!oldtce)
> continue;
>
> - page = pfn_to_page(oldtce >> PAGE_SHIFT);
> - WARN_ON(!page);
> - if (page) {
> - if (oldtce & TCE_PCI_WRITE)
> - SetPageDirty(page);
> - put_page(page);
> - }
> + tce_iommu_unuse_page(container, (unsigned long) __va(oldtce));
> }
>
> return 0;
> }
>
> +static int tce_get_hva(unsigned long tce, unsigned long *hva)
> +{
> + struct page *page = NULL;
> + enum dma_data_direction direction = iommu_tce_direction(tce);
> +
> + if (get_user_pages_fast(tce & PAGE_MASK, 1,
> + direction != DMA_TO_DEVICE, &page) != 1)
> + return -EFAULT;
> +
> + *hva = (unsigned long) page_address(page);
> +
> + return 0;
> +}
I'd prefer to see this called tce_iommu_use_page() for symmetry.
> +
> static long tce_iommu_build(struct tce_container *container,
> struct iommu_table *tbl,
> unsigned long entry, unsigned long tce, unsigned long pages)
> {
> long i, ret = 0;
> - struct page *page = NULL;
> + struct page *page;
> unsigned long hva;
> enum dma_data_direction direction = iommu_tce_direction(tce);
>
> for (i = 0; i < pages; ++i) {
> - ret = get_user_pages_fast(tce & PAGE_MASK, 1,
> - direction != DMA_TO_DEVICE, &page);
> - if (unlikely(ret != 1)) {
> - ret = -EFAULT;
> + ret = tce_get_hva(tce, &hva);
> + if (ret)
> break;
> - }
>
> + page = pfn_to_page(__pa(hva) >> PAGE_SHIFT);
> if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> ret = -EPERM;
> break;
> }
>
> - hva = (unsigned long) page_address(page) +
> - (tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
> + /* Preserve offset within IOMMU page */
> + hva |= tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
>
> ret = iommu_tce_build(tbl, entry + i, hva, direction);
> if (ret) {
> - put_page(page);
> + tce_iommu_unuse_page(container, hva);
> pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
> __func__, entry << tbl->it_page_shift,
> tce, ret);
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:30:50PM +1000, Alexey Kardashevskiy wrote:
> This is to make extended ownership and multiple groups support patches
> simpler for review.
>
> This is a mechanical patch.
I think you're pushing the meaning of that term. Moving whole slabs
of code by copy/paste I'd call mechanical. Reworking logic in this
way, not so much. Say "This should cause no behavioural change" if
that's what you mean.
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
Though it's not instantly obvious, it does look as though it makes no
behavioural change though, so:
Reviewed-by: David Gibson <[email protected]>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:30:51PM +1000, Alexey Kardashevskiy wrote:
> Normally a bitmap from the iommu_table is used to track what TCE entry
> is in use. Since we are going to use iommu_table without its locks and
> do xchg() instead, it becomes essential not to put bits which are not
> implied in the direction flag.
It's not clear to me from this why lack of locking implies the need to
not put extra bits in.
> This adds iommu_direction_to_tce_perm() (its counterpart is there already)
> and uses it for powernv's pnv_tce_build().
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
I have no objection to the patch though, so
Reviewed-by: David Gibson <[email protected]>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:30:52PM +1000, Alexey Kardashevskiy wrote:
> This adds a iommu_table_ops struct and puts pointer to it into
> the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
> callbacks from ppc_md to the new struct where they really belong to.
>
> This adds the requirement for @it_ops to be initialized before calling
> iommu_init_table() to make sure that we do not leave any IOMMU table
> with iommu_table_ops uninitialized. This is not a parameter of
> iommu_init_table() though as there will be cases when iommu_init_table()
> will not be called on TCE tables, for example - VFIO.
That seems a little bit clunky to me, but it's not a big enough
objection to delay the patch over, so
Reviewed-by: David Gibson <[email protected]>
>
> This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_"
> redundand prefixes.
>
> This removes tce_xxx_rm handlers from ppc_md but does not add
> them to iommu_table_ops as this will be done later if we decide to
> support TCE hypercalls in real mode.
>
> For pSeries, this always uses tce_buildmulti_pSeriesLP/
> tce_buildmulti_pSeriesLP. This changes multi callback to fall back to
> tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
> present. The reason for this is we still have to support "multitce=off"
> boot parameter in disable_multitce() and we do not want to walk through
> all IOMMU tables in the system and replace "multi" callbacks with single
> ones.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> arch/powerpc/include/asm/iommu.h | 17 +++++++++++
> arch/powerpc/include/asm/machdep.h | 25 ----------------
> arch/powerpc/kernel/iommu.c | 46 +++++++++++++++--------------
> arch/powerpc/kernel/vio.c | 5 ++++
> arch/powerpc/platforms/cell/iommu.c | 8 +++--
> arch/powerpc/platforms/pasemi/iommu.c | 7 +++--
> arch/powerpc/platforms/powernv/pci-ioda.c | 2 ++
> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 1 +
> arch/powerpc/platforms/powernv/pci.c | 23 ++++-----------
> arch/powerpc/platforms/powernv/pci.h | 1 +
> arch/powerpc/platforms/pseries/iommu.c | 34 +++++++++++----------
> arch/powerpc/sysdev/dart_iommu.c | 12 ++++----
> 12 files changed, 93 insertions(+), 88 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 2af2d70..d909e2a 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -43,6 +43,22 @@
> extern int iommu_is_off;
> extern int iommu_force_on;
>
> +struct iommu_table_ops {
> + int (*set)(struct iommu_table *tbl,
> + long index, long npages,
> + unsigned long uaddr,
> + enum dma_data_direction direction,
> + struct dma_attrs *attrs);
> + void (*clear)(struct iommu_table *tbl,
> + long index, long npages);
> + unsigned long (*get)(struct iommu_table *tbl, long index);
> + void (*flush)(struct iommu_table *tbl);
> +};
> +
> +/* These are used by VIO */
> +extern struct iommu_table_ops iommu_table_lpar_multi_ops;
> +extern struct iommu_table_ops iommu_table_pseries_ops;
> +
> /*
> * IOMAP_MAX_ORDER defines the largest contiguous block
> * of dma space we can get. IOMAP_MAX_ORDER = 13
> @@ -77,6 +93,7 @@ struct iommu_table {
> #ifdef CONFIG_IOMMU_API
> struct iommu_group *it_group;
> #endif
> + struct iommu_table_ops *it_ops;
> void (*set_bypass)(struct iommu_table *tbl, bool enable);
> };
>
> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
> index c8175a3..2abe744 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -65,31 +65,6 @@ struct machdep_calls {
> * destroyed as well */
> void (*hpte_clear_all)(void);
>
> - int (*tce_build)(struct iommu_table *tbl,
> - long index,
> - long npages,
> - unsigned long uaddr,
> - enum dma_data_direction direction,
> - struct dma_attrs *attrs);
> - void (*tce_free)(struct iommu_table *tbl,
> - long index,
> - long npages);
> - unsigned long (*tce_get)(struct iommu_table *tbl,
> - long index);
> - void (*tce_flush)(struct iommu_table *tbl);
> -
> - /* _rm versions are for real mode use only */
> - int (*tce_build_rm)(struct iommu_table *tbl,
> - long index,
> - long npages,
> - unsigned long uaddr,
> - enum dma_data_direction direction,
> - struct dma_attrs *attrs);
> - void (*tce_free_rm)(struct iommu_table *tbl,
> - long index,
> - long npages);
> - void (*tce_flush_rm)(struct iommu_table *tbl);
> -
> void __iomem * (*ioremap)(phys_addr_t addr, unsigned long size,
> unsigned long flags, void *caller);
> void (*iounmap)(volatile void __iomem *token);
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 029b1ea..eceb214 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -322,11 +322,11 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
> ret = entry << tbl->it_page_shift; /* Set the return dma address */
>
> /* Put the TCEs in the HW table */
> - build_fail = ppc_md.tce_build(tbl, entry, npages,
> + build_fail = tbl->it_ops->set(tbl, entry, npages,
> (unsigned long)page &
> IOMMU_PAGE_MASK(tbl), direction, attrs);
>
> - /* ppc_md.tce_build() only returns non-zero for transient errors.
> + /* tbl->it_ops->set() only returns non-zero for transient errors.
> * Clean up the table bitmap in this case and return
> * DMA_ERROR_CODE. For all other errors the functionality is
> * not altered.
> @@ -337,8 +337,8 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
> }
>
> /* Flush/invalidate TLB caches if necessary */
> - if (ppc_md.tce_flush)
> - ppc_md.tce_flush(tbl);
> + if (tbl->it_ops->flush)
> + tbl->it_ops->flush(tbl);
>
> /* Make sure updates are seen by hardware */
> mb();
> @@ -408,7 +408,7 @@ static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
> if (!iommu_free_check(tbl, dma_addr, npages))
> return;
>
> - ppc_md.tce_free(tbl, entry, npages);
> + tbl->it_ops->clear(tbl, entry, npages);
>
> spin_lock_irqsave(&(pool->lock), flags);
> bitmap_clear(tbl->it_map, free_entry, npages);
> @@ -424,8 +424,8 @@ static void iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
> * not do an mb() here on purpose, it is not needed on any of
> * the current platforms.
> */
> - if (ppc_md.tce_flush)
> - ppc_md.tce_flush(tbl);
> + if (tbl->it_ops->flush)
> + tbl->it_ops->flush(tbl);
> }
>
> int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
> @@ -495,7 +495,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
> npages, entry, dma_addr);
>
> /* Insert into HW table */
> - build_fail = ppc_md.tce_build(tbl, entry, npages,
> + build_fail = tbl->it_ops->set(tbl, entry, npages,
> vaddr & IOMMU_PAGE_MASK(tbl),
> direction, attrs);
> if(unlikely(build_fail))
> @@ -534,8 +534,8 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
> }
>
> /* Flush/invalidate TLB caches if necessary */
> - if (ppc_md.tce_flush)
> - ppc_md.tce_flush(tbl);
> + if (tbl->it_ops->flush)
> + tbl->it_ops->flush(tbl);
>
> DBG("mapped %d elements:\n", outcount);
>
> @@ -600,8 +600,8 @@ void ppc_iommu_unmap_sg(struct iommu_table *tbl, struct scatterlist *sglist,
> * do not do an mb() here, the affected platforms do not need it
> * when freeing.
> */
> - if (ppc_md.tce_flush)
> - ppc_md.tce_flush(tbl);
> + if (tbl->it_ops->flush)
> + tbl->it_ops->flush(tbl);
> }
>
> static void iommu_table_clear(struct iommu_table *tbl)
> @@ -613,17 +613,17 @@ static void iommu_table_clear(struct iommu_table *tbl)
> */
> if (!is_kdump_kernel() || is_fadump_active()) {
> /* Clear the table in case firmware left allocations in it */
> - ppc_md.tce_free(tbl, tbl->it_offset, tbl->it_size);
> + tbl->it_ops->clear(tbl, tbl->it_offset, tbl->it_size);
> return;
> }
>
> #ifdef CONFIG_CRASH_DUMP
> - if (ppc_md.tce_get) {
> + if (tbl->it_ops->get) {
> unsigned long index, tceval, tcecount = 0;
>
> /* Reserve the existing mappings left by the first kernel. */
> for (index = 0; index < tbl->it_size; index++) {
> - tceval = ppc_md.tce_get(tbl, index + tbl->it_offset);
> + tceval = tbl->it_ops->get(tbl, index + tbl->it_offset);
> /*
> * Freed TCE entry contains 0x7fffffffffffffff on JS20
> */
> @@ -657,6 +657,8 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
> unsigned int i;
> struct iommu_pool *p;
>
> + BUG_ON(!tbl->it_ops);
> +
> /* number of bytes needed for the bitmap */
> sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
>
> @@ -934,8 +936,8 @@ EXPORT_SYMBOL_GPL(iommu_tce_direction);
> void iommu_flush_tce(struct iommu_table *tbl)
> {
> /* Flush/invalidate TLB caches if necessary */
> - if (ppc_md.tce_flush)
> - ppc_md.tce_flush(tbl);
> + if (tbl->it_ops->flush)
> + tbl->it_ops->flush(tbl);
>
> /* Make sure updates are seen by hardware */
> mb();
> @@ -946,7 +948,7 @@ int iommu_tce_clear_param_check(struct iommu_table *tbl,
> unsigned long ioba, unsigned long tce_value,
> unsigned long npages)
> {
> - /* ppc_md.tce_free() does not support any value but 0 */
> + /* tbl->it_ops->clear() does not support any value but 0 */
> if (tce_value)
> return -EINVAL;
>
> @@ -994,9 +996,9 @@ unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
>
> spin_lock(&(pool->lock));
>
> - oldtce = ppc_md.tce_get(tbl, entry);
> + oldtce = tbl->it_ops->get(tbl, entry);
> if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
> - ppc_md.tce_free(tbl, entry, 1);
> + tbl->it_ops->clear(tbl, entry, 1);
> else
> oldtce = 0;
>
> @@ -1019,10 +1021,10 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
>
> spin_lock(&(pool->lock));
>
> - oldtce = ppc_md.tce_get(tbl, entry);
> + oldtce = tbl->it_ops->get(tbl, entry);
> /* Add new entry if it is not busy */
> if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> - ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
> + ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
>
> spin_unlock(&(pool->lock));
>
> diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
> index 5bfdab9..b41426c 100644
> --- a/arch/powerpc/kernel/vio.c
> +++ b/arch/powerpc/kernel/vio.c
> @@ -1196,6 +1196,11 @@ static struct iommu_table *vio_build_iommu_table(struct vio_dev *dev)
> tbl->it_type = TCE_VB;
> tbl->it_blocksize = 16;
>
> + if (firmware_has_feature(FW_FEATURE_LPAR))
> + tbl->it_ops = &iommu_table_lpar_multi_ops;
> + else
> + tbl->it_ops = &iommu_table_pseries_ops;
> +
> return iommu_init_table(tbl, -1);
> }
>
> diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
> index c7c8720..72763a8 100644
> --- a/arch/powerpc/platforms/cell/iommu.c
> +++ b/arch/powerpc/platforms/cell/iommu.c
> @@ -465,6 +465,11 @@ static inline u32 cell_iommu_get_ioid(struct device_node *np)
> return *ioid;
> }
>
> +static struct iommu_table_ops cell_iommu_ops = {
> + .set = tce_build_cell,
> + .clear = tce_free_cell
> +};
> +
> static struct iommu_window * __init
> cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
> unsigned long offset, unsigned long size,
> @@ -491,6 +496,7 @@ cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
> window->table.it_offset =
> (offset >> window->table.it_page_shift) + pte_offset;
> window->table.it_size = size >> window->table.it_page_shift;
> + window->table.it_ops = &cell_iommu_ops;
>
> iommu_init_table(&window->table, iommu->nid);
>
> @@ -1200,8 +1206,6 @@ static int __init cell_iommu_init(void)
> /* Setup various ppc_md. callbacks */
> ppc_md.pci_dma_dev_setup = cell_pci_dma_dev_setup;
> ppc_md.dma_get_required_mask = cell_dma_get_required_mask;
> - ppc_md.tce_build = tce_build_cell;
> - ppc_md.tce_free = tce_free_cell;
>
> if (!iommu_fixed_disabled && cell_iommu_fixed_mapping_init() == 0)
> goto bail;
> diff --git a/arch/powerpc/platforms/pasemi/iommu.c b/arch/powerpc/platforms/pasemi/iommu.c
> index 2e576f2..b7245b2 100644
> --- a/arch/powerpc/platforms/pasemi/iommu.c
> +++ b/arch/powerpc/platforms/pasemi/iommu.c
> @@ -132,6 +132,10 @@ static void iobmap_free(struct iommu_table *tbl, long index,
> }
> }
>
> +static struct iommu_table_ops iommu_table_iobmap_ops = {
> + .set = iobmap_build,
> + .clear = iobmap_free
> +};
>
> static void iommu_table_iobmap_setup(void)
> {
> @@ -151,6 +155,7 @@ static void iommu_table_iobmap_setup(void)
> * Should probably be 8 (64 bytes)
> */
> iommu_table_iobmap.it_blocksize = 4;
> + iommu_table_iobmap.it_ops = &iommu_table_iobmap_ops;
> iommu_init_table(&iommu_table_iobmap, 0);
> pr_debug(" <- %s\n", __func__);
> }
> @@ -250,8 +255,6 @@ void __init iommu_init_early_pasemi(void)
>
> ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pasemi;
> ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pasemi;
> - ppc_md.tce_build = iobmap_build;
> - ppc_md.tce_free = iobmap_free;
> set_pci_dma_ops(&dma_iommu_ops);
> }
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 6c9ff2b..85e64a5 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1231,6 +1231,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> TCE_PCI_SWINV_FREE |
> TCE_PCI_SWINV_PAIR);
> }
> + tbl->it_ops = &pnv_iommu_ops;
> iommu_init_table(tbl, phb->hose->node);
> iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
>
> @@ -1364,6 +1365,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> 8);
> tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
> }
> + tbl->it_ops = &pnv_iommu_ops;
> iommu_init_table(tbl, phb->hose->node);
> iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
>
> diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> index 6ef6d4d..0256fcc 100644
> --- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> +++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> @@ -87,6 +87,7 @@ static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
> struct pci_dev *pdev)
> {
> if (phb->p5ioc2.iommu_table.it_map == NULL) {
> + phb->p5ioc2.iommu_table.it_ops = &pnv_iommu_ops;
> iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
> iommu_register_group(&phb->p5ioc2.iommu_table,
> pci_domain_nr(phb->hose->bus), phb->opal_id);
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index 609f5b1..c619ec6 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -647,18 +647,11 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
> return ((u64 *)tbl->it_base)[index - tbl->it_offset];
> }
>
> -static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
> - unsigned long uaddr,
> - enum dma_data_direction direction,
> - struct dma_attrs *attrs)
> -{
> - return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
> -}
> -
> -static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
> -{
> - pnv_tce_free(tbl, index, npages, true);
> -}
> +struct iommu_table_ops pnv_iommu_ops = {
> + .set = pnv_tce_build_vm,
> + .clear = pnv_tce_free_vm,
> + .get = pnv_tce_get,
> +};
>
> void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
> void *tce_mem, u64 tce_size,
> @@ -692,6 +685,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
> return NULL;
> pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
> be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
> + tbl->it_ops = &pnv_iommu_ops;
> iommu_init_table(tbl, hose->node);
> iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
>
> @@ -817,11 +811,6 @@ void __init pnv_pci_init(void)
>
> /* Configure IOMMU DMA hooks */
> ppc_md.pci_dma_dev_setup = pnv_pci_dma_dev_setup;
> - ppc_md.tce_build = pnv_tce_build_vm;
> - ppc_md.tce_free = pnv_tce_free_vm;
> - ppc_md.tce_build_rm = pnv_tce_build_rm;
> - ppc_md.tce_free_rm = pnv_tce_free_rm;
> - ppc_md.tce_get = pnv_tce_get;
> set_pci_dma_ops(&dma_iommu_ops);
>
> /* Configure MSIs */
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 6c02ff8..f726700 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -216,6 +216,7 @@ extern struct pci_ops pnv_pci_ops;
> #ifdef CONFIG_EEH
> extern struct pnv_eeh_ops ioda_eeh_ops;
> #endif
> +extern struct iommu_table_ops pnv_iommu_ops;
>
> void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
> unsigned char *log_buff);
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 7803a19..48d1fde 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -192,7 +192,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
> int ret = 0;
> unsigned long flags;
>
> - if (npages == 1) {
> + if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
> return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
> direction, attrs);
> }
> @@ -284,6 +284,9 @@ static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long n
> {
> u64 rc;
>
> + if (!firmware_has_feature(FW_FEATURE_MULTITCE))
> + return tce_free_pSeriesLP(tbl, tcenum, npages);
> +
> rc = plpar_tce_stuff((u64)tbl->it_index, (u64)tcenum << 12, 0, npages);
>
> if (rc && printk_ratelimit()) {
> @@ -459,7 +462,6 @@ static int tce_setrange_multi_pSeriesLP_walk(unsigned long start_pfn,
> return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
> }
>
> -
> #ifdef CONFIG_PCI
> static void iommu_table_setparms(struct pci_controller *phb,
> struct device_node *dn,
> @@ -545,6 +547,12 @@ static void iommu_table_setparms_lpar(struct pci_controller *phb,
> tbl->it_size = size >> tbl->it_page_shift;
> }
>
> +struct iommu_table_ops iommu_table_pseries_ops = {
> + .set = tce_build_pSeries,
> + .clear = tce_free_pSeries,
> + .get = tce_get_pseries
> +};
> +
> static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
> {
> struct device_node *dn;
> @@ -613,6 +621,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
> pci->phb->node);
>
> iommu_table_setparms(pci->phb, dn, tbl);
> + tbl->it_ops = &iommu_table_pseries_ops;
> pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
> iommu_register_group(tbl, pci_domain_nr(bus), 0);
>
> @@ -624,6 +633,11 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
> pr_debug("ISA/IDE, window size is 0x%llx\n", pci->phb->dma_window_size);
> }
>
> +struct iommu_table_ops iommu_table_lpar_multi_ops = {
> + .set = tce_buildmulti_pSeriesLP,
> + .clear = tce_freemulti_pSeriesLP,
> + .get = tce_get_pSeriesLP
> +};
>
> static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
> {
> @@ -658,6 +672,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
> tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
> ppci->phb->node);
> iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
> + tbl->it_ops = &iommu_table_lpar_multi_ops;
> ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
> iommu_register_group(tbl, pci_domain_nr(bus), 0);
> pr_debug(" created table: %p\n", ppci->iommu_table);
> @@ -685,6 +700,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
> tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
> phb->node);
> iommu_table_setparms(phb, dn, tbl);
> + tbl->it_ops = &iommu_table_pseries_ops;
> PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
> iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
> set_iommu_table_base_and_group(&dev->dev,
> @@ -1107,6 +1123,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
> tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
> pci->phb->node);
> iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
> + tbl->it_ops = &iommu_table_lpar_multi_ops;
> pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
> iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
> pr_debug(" created table: %p\n", pci->iommu_table);
> @@ -1299,22 +1316,11 @@ void iommu_init_early_pSeries(void)
> return;
>
> if (firmware_has_feature(FW_FEATURE_LPAR)) {
> - if (firmware_has_feature(FW_FEATURE_MULTITCE)) {
> - ppc_md.tce_build = tce_buildmulti_pSeriesLP;
> - ppc_md.tce_free = tce_freemulti_pSeriesLP;
> - } else {
> - ppc_md.tce_build = tce_build_pSeriesLP;
> - ppc_md.tce_free = tce_free_pSeriesLP;
> - }
> - ppc_md.tce_get = tce_get_pSeriesLP;
> ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pSeriesLP;
> ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pSeriesLP;
> ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
> ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
> } else {
> - ppc_md.tce_build = tce_build_pSeries;
> - ppc_md.tce_free = tce_free_pSeries;
> - ppc_md.tce_get = tce_get_pseries;
> ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pSeries;
> ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pSeries;
> }
> @@ -1332,8 +1338,6 @@ static int __init disable_multitce(char *str)
> firmware_has_feature(FW_FEATURE_LPAR) &&
> firmware_has_feature(FW_FEATURE_MULTITCE)) {
> printk(KERN_INFO "Disabling MULTITCE firmware feature\n");
> - ppc_md.tce_build = tce_build_pSeriesLP;
> - ppc_md.tce_free = tce_free_pSeriesLP;
> powerpc_firmware_features &= ~FW_FEATURE_MULTITCE;
> }
> return 1;
> diff --git a/arch/powerpc/sysdev/dart_iommu.c b/arch/powerpc/sysdev/dart_iommu.c
> index 9e5353f..ab361a3 100644
> --- a/arch/powerpc/sysdev/dart_iommu.c
> +++ b/arch/powerpc/sysdev/dart_iommu.c
> @@ -286,6 +286,12 @@ static int __init dart_init(struct device_node *dart_node)
> return 0;
> }
>
> +static struct iommu_table_ops iommu_dart_ops = {
> + .set = dart_build,
> + .clear = dart_free,
> + .flush = dart_flush,
> +};
> +
> static void iommu_table_dart_setup(void)
> {
> iommu_table_dart.it_busno = 0;
> @@ -298,6 +304,7 @@ static void iommu_table_dart_setup(void)
> iommu_table_dart.it_base = (unsigned long)dart_vbase;
> iommu_table_dart.it_index = 0;
> iommu_table_dart.it_blocksize = 1;
> + iommu_table_dart.it_ops = &iommu_dart_ops;
> iommu_init_table(&iommu_table_dart, -1);
>
> /* Reserve the last page of the DART to avoid possible prefetch
> @@ -386,11 +393,6 @@ void __init iommu_init_early_dart(void)
> if (dart_init(dn) != 0)
> goto bail;
>
> - /* Setup low level TCE operations for the core IOMMU code */
> - ppc_md.tce_build = dart_build;
> - ppc_md.tce_free = dart_free;
> - ppc_md.tce_flush = dart_flush;
> -
> /* Setup bypass if supported */
> if (dart_is_u4)
> ppc_md.dma_set_mask = dart_dma_set_mask;
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On 04/15/2015 05:10 PM, David Gibson wrote:
> On Fri, Apr 10, 2015 at 04:30:49PM +1000, Alexey Kardashevskiy wrote:
>> This is a pretty mechanical patch to make next patches simpler.
>>
>> New tce_iommu_unuse_page() helper does put_page() now but it might skip
>> that after the memory registering patch applied.
>>
>> As we are here, this removes unnecessary checks for a value returned
>> by pfn_to_page() as it cannot possibly return NULL.
>>
>> This moves tce_iommu_disable() later to let tce_iommu_clear() know if
>> the container has been enabled because if it has not been, then
>> put_page() must not be called on TCEs from the TCE table. This situation
>> is not yet possible but it will after KVM acceleration patchset is
>> applied.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> Changes:
>> v6:
>> * tce_get_hva() returns hva via a pointer
>> ---
>> drivers/vfio/vfio_iommu_spapr_tce.c | 68 +++++++++++++++++++++++++++----------
>> 1 file changed, 50 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index c137bb3..ec5ee83 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -196,7 +196,6 @@ static void tce_iommu_release(void *iommu_data)
>> struct iommu_table *tbl = container->tbl;
>>
>> WARN_ON(tbl && !tbl->it_group);
>> - tce_iommu_disable(container);
>>
>> if (tbl) {
>> tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
>> @@ -204,63 +203,96 @@ static void tce_iommu_release(void *iommu_data)
>> if (tbl->it_group)
>> tce_iommu_detach_group(iommu_data, tbl->it_group);
>> }
>> +
>> + tce_iommu_disable(container);
>> +
>> mutex_destroy(&container->lock);
>>
>> kfree(container);
>> }
>>
>> +static void tce_iommu_unuse_page(struct tce_container *container,
>> + unsigned long oldtce)
>> +{
>> + struct page *page;
>> +
>> + if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
>> + return;
>> +
>> + /*
>> + * VFIO cannot map/unmap when a container is not enabled so
>> + * we would not need this check but KVM could map/unmap and if
>> + * this happened, we must not put pages as KVM does not get them as
>> + * it expects memory pre-registation to do this part.
>> + */
>> + if (!container->enabled)
>> + return;
>
> This worries me a bit. How can whether the contained is enabled now
> safely tell you whether get_page() at some earlier point in time?
This is a leftover, I'll remove it as after the "iommu v2" patch there will
be tce_iommu_unuse_page_v2().
>> +
>> + page = pfn_to_page(__pa(oldtce) >> PAGE_SHIFT);
>> +
>> + if (oldtce & TCE_PCI_WRITE)
>> + SetPageDirty(page);
>> +
>> + put_page(page);
>> +}
>> +
>> static int tce_iommu_clear(struct tce_container *container,
>> struct iommu_table *tbl,
>> unsigned long entry, unsigned long pages)
>> {
>> unsigned long oldtce;
>> - struct page *page;
>>
>> for ( ; pages; --pages, ++entry) {
>> oldtce = iommu_clear_tce(tbl, entry);
>> if (!oldtce)
>> continue;
>>
>> - page = pfn_to_page(oldtce >> PAGE_SHIFT);
>> - WARN_ON(!page);
>> - if (page) {
>> - if (oldtce & TCE_PCI_WRITE)
>> - SetPageDirty(page);
>> - put_page(page);
>> - }
>> + tce_iommu_unuse_page(container, (unsigned long) __va(oldtce));
>> }
>>
>> return 0;
>> }
>>
>> +static int tce_get_hva(unsigned long tce, unsigned long *hva)
>> +{
>> + struct page *page = NULL;
>> + enum dma_data_direction direction = iommu_tce_direction(tce);
>> +
>> + if (get_user_pages_fast(tce & PAGE_MASK, 1,
>> + direction != DMA_TO_DEVICE, &page) != 1)
>> + return -EFAULT;
>> +
>> + *hva = (unsigned long) page_address(page);
>> +
>> + return 0;
>> +}
>
> I'd prefer to see this called tce_iommu_use_page() for symmetry.
If I rename this one, then what would I call tce_get_hva_cached() from
"fio: powerpc/spapr: Register memory and define IOMMU v2"?
>
>> +
>> static long tce_iommu_build(struct tce_container *container,
>> struct iommu_table *tbl,
>> unsigned long entry, unsigned long tce, unsigned long pages)
>> {
>> long i, ret = 0;
>> - struct page *page = NULL;
>> + struct page *page;
>> unsigned long hva;
>> enum dma_data_direction direction = iommu_tce_direction(tce);
>>
>> for (i = 0; i < pages; ++i) {
>> - ret = get_user_pages_fast(tce & PAGE_MASK, 1,
>> - direction != DMA_TO_DEVICE, &page);
>> - if (unlikely(ret != 1)) {
>> - ret = -EFAULT;
>> + ret = tce_get_hva(tce, &hva);
>> + if (ret)
>> break;
>> - }
>>
>> + page = pfn_to_page(__pa(hva) >> PAGE_SHIFT);
>> if (!tce_page_is_contained(page, tbl->it_page_shift)) {
>> ret = -EPERM;
>> break;
>> }
>>
>> - hva = (unsigned long) page_address(page) +
>> - (tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
>> + /* Preserve offset within IOMMU page */
>> + hva |= tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
>>
>> ret = iommu_tce_build(tbl, entry + i, hva, direction);
>> if (ret) {
>> - put_page(page);
>> + tce_iommu_unuse_page(container, hva);
>> pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
>> __func__, entry << tbl->it_page_shift,
>> tce, ret);
>
--
Alexey
On Fri, Apr 10, 2015 at 04:30:53PM +1000, Alexey Kardashevskiy wrote:
> This replaces multiple calls of kzalloc_node() with a new
> iommu_table_alloc() helper. Right now it calls kzalloc_node() but
> later it will be modified to allocate a iommu_table_group struct with
> a single iommu_table in it.
>
> Later the helper will allocate a iommu_table_group struct which embeds
> the iommu table(s).
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
I'd prefer to see this folded into the next patch. This one's very
simple and I have trouble making sense of it without the next.
> ---
> arch/powerpc/include/asm/iommu.h | 1 +
> arch/powerpc/kernel/iommu.c | 9 +++++++++
> arch/powerpc/platforms/powernv/pci.c | 2 +-
> arch/powerpc/platforms/pseries/iommu.c | 12 ++++--------
> 4 files changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index d909e2a..eb75726 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -117,6 +117,7 @@ static inline void *get_iommu_table_base(struct device *dev)
> return dev->archdata.dma_data.iommu_table_base;
> }
>
> +extern struct iommu_table *iommu_table_alloc(int node);
> /* Frees table for an individual device node */
> extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
>
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index eceb214..b39d00a 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -710,6 +710,15 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
> return tbl;
> }
>
> +struct iommu_table *iommu_table_alloc(int node)
> +{
> + struct iommu_table *tbl;
> +
> + tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
> +
> + return tbl;
> +}
> +
> void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> {
> unsigned long bitmap_sz;
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index c619ec6..1c31ac8 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -680,7 +680,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
> hose->dn->full_name);
> return NULL;
> }
> - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, hose->node);
> + tbl = iommu_table_alloc(hose->node);
> if (WARN_ON(!tbl))
> return NULL;
> pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 48d1fde..41a8b14 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -617,8 +617,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
> pci->phb->dma_window_size = 0x8000000ul;
> pci->phb->dma_window_base_cur = 0x8000000ul;
>
> - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
> - pci->phb->node);
> + tbl = iommu_table_alloc(pci->phb->node);
>
> iommu_table_setparms(pci->phb, dn, tbl);
> tbl->it_ops = &iommu_table_pseries_ops;
> @@ -669,8 +668,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
> pdn->full_name, ppci->iommu_table);
>
> if (!ppci->iommu_table) {
> - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
> - ppci->phb->node);
> + tbl = iommu_table_alloc(ppci->phb->node);
> iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
> tbl->it_ops = &iommu_table_lpar_multi_ops;
> ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
> @@ -697,8 +695,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
> struct pci_controller *phb = PCI_DN(dn)->phb;
>
> pr_debug(" --> first child, no bridge. Allocating iommu table.\n");
> - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
> - phb->node);
> + tbl = iommu_table_alloc(phb->node);
> iommu_table_setparms(phb, dn, tbl);
> tbl->it_ops = &iommu_table_pseries_ops;
> PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
> @@ -1120,8 +1117,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
>
> pci = PCI_DN(pdn);
> if (!pci->iommu_table) {
> - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
> - pci->phb->node);
> + tbl = iommu_table_alloc(pci->phb->node);
> iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
> tbl->it_ops = &iommu_table_lpar_multi_ops;
> pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:30:54PM +1000, Alexey Kardashevskiy wrote:
> Modern IBM POWERPC systems support multiple (currently two) TCE tables
> per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
> for TCE tables. Right now just one table is supported.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> arch/powerpc/include/asm/iommu.h | 18 +++--
> arch/powerpc/kernel/iommu.c | 34 ++++----
> arch/powerpc/platforms/powernv/pci-ioda.c | 38 +++++----
> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 17 ++--
> arch/powerpc/platforms/powernv/pci.c | 2 +-
> arch/powerpc/platforms/powernv/pci.h | 4 +-
> arch/powerpc/platforms/pseries/iommu.c | 9 ++-
> drivers/vfio/vfio_iommu_spapr_tce.c | 120 ++++++++++++++++++++--------
> 8 files changed, 160 insertions(+), 82 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index eb75726..667aa1a 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -90,9 +90,7 @@ struct iommu_table {
> struct iommu_pool pools[IOMMU_NR_POOLS];
> unsigned long *it_map; /* A simple allocation bitmap for now */
> unsigned long it_page_shift;/* table iommu page size */
> -#ifdef CONFIG_IOMMU_API
> - struct iommu_group *it_group;
> -#endif
> + struct iommu_table_group *it_group;
> struct iommu_table_ops *it_ops;
> void (*set_bypass)(struct iommu_table *tbl, bool enable);
> };
> @@ -126,14 +124,24 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
> */
> extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
> int nid);
> +
> +#define IOMMU_TABLE_GROUP_MAX_TABLES 1
> +
> +struct iommu_table_group {
> #ifdef CONFIG_IOMMU_API
> -extern void iommu_register_group(struct iommu_table *tbl,
> + struct iommu_group *group;
> +#endif
> + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
There's nothing to indicate which of the tables are in use at the
current time. I mean, it doesn't matter now because there's only one,
but the patch doesn't make a whole lot of sense without that.
> +};
> +
> +#ifdef CONFIG_IOMMU_API
> +extern void iommu_register_group(struct iommu_table_group *table_group,
> int pci_domain_number, unsigned long pe_num);
> extern int iommu_add_device(struct device *dev);
> extern void iommu_del_device(struct device *dev);
> extern int __init tce_iommu_bus_notifier_init(void);
> #else
> -static inline void iommu_register_group(struct iommu_table *tbl,
> +static inline void iommu_register_group(struct iommu_table_group *table_group,
> int pci_domain_number,
> unsigned long pe_num)
> {
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index b39d00a..fd49c8e 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -712,17 +712,20 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
>
> struct iommu_table *iommu_table_alloc(int node)
> {
> - struct iommu_table *tbl;
> + struct iommu_table_group *table_group;
>
> - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
> + table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
> + node);
> + table_group->tables[0].it_group = table_group;
>
> - return tbl;
> + return &table_group->tables[0];
> }
>
> void iommu_free_table(struct iommu_table *tbl, const char *node_name)
Surely the free function should take a table group rather than a table
as argument.
> {
> unsigned long bitmap_sz;
> unsigned int order;
> + struct iommu_table_group *table_group = tbl->it_group;
>
> if (!tbl || !tbl->it_map) {
> printk(KERN_ERR "%s: expected TCE map for %s\n", __func__,
> @@ -738,9 +741,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> clear_bit(0, tbl->it_map);
>
> #ifdef CONFIG_IOMMU_API
> - if (tbl->it_group) {
> - iommu_group_put(tbl->it_group);
> - BUG_ON(tbl->it_group);
> + if (table_group->group) {
> + iommu_group_put(table_group->group);
> + BUG_ON(table_group->group);
> }
> #endif
>
> @@ -756,7 +759,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> free_pages((unsigned long) tbl->it_map, order);
>
> /* free table */
> - kfree(tbl);
> + kfree(table_group);
> }
>
> /* Creates TCEs for a user provided buffer. The user buffer must be
> @@ -903,11 +906,12 @@ EXPORT_SYMBOL_GPL(iommu_direction_to_tce_perm);
> */
> static void group_release(void *iommu_data)
> {
> - struct iommu_table *tbl = iommu_data;
> - tbl->it_group = NULL;
> + struct iommu_table_group *table_group = iommu_data;
> +
> + table_group->group = NULL;
> }
>
> -void iommu_register_group(struct iommu_table *tbl,
> +void iommu_register_group(struct iommu_table_group *table_group,
> int pci_domain_number, unsigned long pe_num)
> {
> struct iommu_group *grp;
> @@ -919,8 +923,8 @@ void iommu_register_group(struct iommu_table *tbl,
> PTR_ERR(grp));
> return;
> }
> - tbl->it_group = grp;
> - iommu_group_set_iommudata(grp, tbl, group_release);
> + table_group->group = grp;
> + iommu_group_set_iommudata(grp, table_group, group_release);
> name = kasprintf(GFP_KERNEL, "domain%d-pe%lx",
> pci_domain_number, pe_num);
> if (!name)
> @@ -1108,7 +1112,7 @@ int iommu_add_device(struct device *dev)
> }
>
> tbl = get_iommu_table_base(dev);
> - if (!tbl || !tbl->it_group) {
> + if (!tbl || !tbl->it_group || !tbl->it_group->group) {
> pr_debug("%s: Skipping device %s with no tbl\n",
> __func__, dev_name(dev));
> return 0;
> @@ -1116,7 +1120,7 @@ int iommu_add_device(struct device *dev)
>
> pr_debug("%s: Adding %s to iommu group %d\n",
> __func__, dev_name(dev),
> - iommu_group_id(tbl->it_group));
> + iommu_group_id(tbl->it_group->group));
>
> if (PAGE_SIZE < IOMMU_PAGE_SIZE(tbl)) {
> pr_err("%s: Invalid IOMMU page size %lx (%lx) on %s\n",
> @@ -1125,7 +1129,7 @@ int iommu_add_device(struct device *dev)
> return -EINVAL;
> }
>
> - return iommu_group_add_device(tbl->it_group, dev);
> + return iommu_group_add_device(tbl->it_group->group, dev);
> }
> EXPORT_SYMBOL_GPL(iommu_add_device);
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 85e64a5..a964c50 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -23,6 +23,7 @@
> #include <linux/io.h>
> #include <linux/msi.h>
> #include <linux/memblock.h>
> +#include <linux/iommu.h>
>
> #include <asm/sections.h>
> #include <asm/io.h>
> @@ -989,7 +990,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
>
> pe = &phb->ioda.pe_array[pdn->pe_number];
> WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
> - set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
> + set_iommu_table_base_and_group(&pdev->dev, &pe->table_group.tables[0]);
> }
>
> static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
> @@ -1016,7 +1017,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
> } else {
> dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
> set_dma_ops(&pdev->dev, &dma_iommu_ops);
> - set_iommu_table_base(&pdev->dev, &pe->tce32_table);
> + set_iommu_table_base(&pdev->dev, &pe->table_group.tables[0]);
> }
> *pdev->dev.dma_mask = dma_mask;
> return 0;
> @@ -1053,9 +1054,10 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
> list_for_each_entry(dev, &bus->devices, bus_list) {
> if (add_to_iommu_group)
> set_iommu_table_base_and_group(&dev->dev,
> - &pe->tce32_table);
> + &pe->table_group.tables[0]);
> else
> - set_iommu_table_base(&dev->dev, &pe->tce32_table);
> + set_iommu_table_base(&dev->dev,
> + &pe->table_group.tables[0]);
>
> if (dev->subordinate)
> pnv_ioda_setup_bus_dma(pe, dev->subordinate,
> @@ -1145,8 +1147,8 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
> void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
> __be64 *startp, __be64 *endp, bool rm)
> {
> - struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
> - tce32_table);
> + struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
> + table_group);
> struct pnv_phb *phb = pe->phb;
>
> if (phb->type == PNV_PHB_IODA1)
> @@ -1211,8 +1213,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> }
> }
>
> + /* Setup iommu */
> + pe->table_group.tables[0].it_group = &pe->table_group;
> +
> /* Setup linux iommu table */
> - tbl = &pe->tce32_table;
> + tbl = &pe->table_group.tables[0];
> pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
> base << 28, IOMMU_PAGE_SHIFT_4K);
>
> @@ -1233,7 +1238,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> }
> tbl->it_ops = &pnv_iommu_ops;
> iommu_init_table(tbl, phb->hose->node);
> - iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
> + iommu_register_group(&pe->table_group, phb->hose->global_number,
> + pe->pe_number);
>
> if (pe->pdev)
> set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
> @@ -1251,8 +1257,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>
> static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> {
> - struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
> - tce32_table);
> + struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
> + table_group);
> uint16_t window_id = (pe->pe_number << 1 ) + 1;
> int64_t rc;
>
> @@ -1297,10 +1303,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> pe->tce_bypass_base = 1ull << 59;
>
> /* Install set_bypass callback for VFIO */
> - pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
> + pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
>
> /* Enable bypass by default */
> - pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
> + pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
> }
>
> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> @@ -1347,8 +1353,11 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> goto fail;
> }
>
> + /* Setup iommu */
> + pe->table_group.tables[0].it_group = &pe->table_group;
> +
> /* Setup linux iommu table */
> - tbl = &pe->tce32_table;
> + tbl = &pe->table_group.tables[0];
> pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
> IOMMU_PAGE_SHIFT_4K);
>
> @@ -1367,7 +1376,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> }
> tbl->it_ops = &pnv_iommu_ops;
> iommu_init_table(tbl, phb->hose->node);
> - iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
> + iommu_register_group(&pe->table_group, phb->hose->global_number,
> + pe->pe_number);
>
> if (pe->pdev)
> set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
> diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> index 0256fcc..ff68cac 100644
> --- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> +++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> @@ -86,14 +86,16 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
> static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
> struct pci_dev *pdev)
> {
> - if (phb->p5ioc2.iommu_table.it_map == NULL) {
> - phb->p5ioc2.iommu_table.it_ops = &pnv_iommu_ops;
> - iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
> - iommu_register_group(&phb->p5ioc2.iommu_table,
> + if (phb->p5ioc2.table_group.tables[0].it_map == NULL) {
> + phb->p5ioc2.table_group.tables[0].it_ops = &pnv_iommu_ops;
> + iommu_init_table(&phb->p5ioc2.table_group.tables[0],
> + phb->hose->node);
> + iommu_register_group(&phb->p5ioc2.table_group,
> pci_domain_nr(phb->hose->bus), phb->opal_id);
> }
>
> - set_iommu_table_base_and_group(&pdev->dev, &phb->p5ioc2.iommu_table);
> + set_iommu_table_base_and_group(&pdev->dev,
> + &phb->p5ioc2.table_group.tables[0]);
> }
>
> static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
> @@ -167,9 +169,12 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
> /* Setup MSI support */
> pnv_pci_init_p5ioc2_msis(phb);
>
> + /* Setup iommu */
> + phb->p5ioc2.table_group.tables[0].it_group = &phb->p5ioc2.table_group;
> +
> /* Setup TCEs */
> phb->dma_dev_setup = pnv_pci_p5ioc2_dma_dev_setup;
> - pnv_pci_setup_iommu_table(&phb->p5ioc2.iommu_table,
> + pnv_pci_setup_iommu_table(&phb->p5ioc2.table_group.tables[0],
> tce_mem, tce_size, 0,
> IOMMU_PAGE_SHIFT_4K);
> }
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index 1c31ac8..3050cc8 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -687,7 +687,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
> be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
> tbl->it_ops = &pnv_iommu_ops;
> iommu_init_table(tbl, hose->node);
> - iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
> + iommu_register_group(tbl->it_group, pci_domain_nr(hose->bus), 0);
>
> /* Deal with SW invalidated TCEs when needed (BML way) */
> swinvp = of_get_property(hose->dn, "linux,tce-sw-invalidate-info",
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index f726700..762d906 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -53,7 +53,7 @@ struct pnv_ioda_pe {
> /* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
> int tce32_seg;
> int tce32_segcount;
> - struct iommu_table tce32_table;
> + struct iommu_table_group table_group;
> phys_addr_t tce_inval_reg_phys;
>
> /* 64-bit TCE bypass region */
> @@ -138,7 +138,7 @@ struct pnv_phb {
>
> union {
> struct {
> - struct iommu_table iommu_table;
> + struct iommu_table_group table_group;
> } p5ioc2;
>
> struct {
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 41a8b14..75ea581 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -622,7 +622,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
> iommu_table_setparms(pci->phb, dn, tbl);
> tbl->it_ops = &iommu_table_pseries_ops;
> pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
> - iommu_register_group(tbl, pci_domain_nr(bus), 0);
> + iommu_register_group(tbl->it_group, pci_domain_nr(bus), 0);
>
> /* Divide the rest (1.75GB) among the children */
> pci->phb->dma_window_size = 0x80000000ul;
> @@ -672,7 +672,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
> iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
> tbl->it_ops = &iommu_table_lpar_multi_ops;
> ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
> - iommu_register_group(tbl, pci_domain_nr(bus), 0);
> + iommu_register_group(tbl->it_group, pci_domain_nr(bus), 0);
> pr_debug(" created table: %p\n", ppci->iommu_table);
> }
> }
> @@ -699,7 +699,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
> iommu_table_setparms(phb, dn, tbl);
> tbl->it_ops = &iommu_table_pseries_ops;
> PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
> - iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
> + iommu_register_group(tbl->it_group, pci_domain_nr(phb->bus), 0);
> set_iommu_table_base_and_group(&dev->dev,
> PCI_DN(dn)->iommu_table);
> return;
> @@ -1121,7 +1121,8 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
> iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
> tbl->it_ops = &iommu_table_lpar_multi_ops;
> pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
> - iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
> + iommu_register_group(tbl->it_group,
> + pci_domain_nr(pci->phb->bus), 0);
> pr_debug(" created table: %p\n", pci->iommu_table);
> } else {
> pr_debug(" found DMA window, table: %p\n", pci->iommu_table);
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 244c958..d61aad2 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -88,7 +88,7 @@ static void decrement_locked_vm(long npages)
> */
> struct tce_container {
> struct mutex lock;
> - struct iommu_table *tbl;
> + struct iommu_group *grp;
> bool enabled;
> unsigned long locked_pages;
> };
> @@ -103,13 +103,41 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift;
> }
>
> +static struct iommu_table *spapr_tce_find_table(
> + struct tce_container *container,
> + phys_addr_t ioba)
> +{
> + long i;
> + struct iommu_table *ret = NULL;
> + struct iommu_table_group *table_group;
> +
> + table_group = iommu_group_get_iommudata(container->grp);
> + if (!table_group)
> + return NULL;
> +
> + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> + struct iommu_table *tbl = &table_group->tables[i];
> + unsigned long entry = ioba >> tbl->it_page_shift;
> + unsigned long start = tbl->it_offset;
> + unsigned long end = start + tbl->it_size;
> +
> + if ((start <= entry) && (entry < end)) {
> + ret = tbl;
> + break;
> + }
> + }
> +
> + return ret;
> +}
> +
> static int tce_iommu_enable(struct tce_container *container)
> {
> int ret = 0;
> unsigned long locked;
> - struct iommu_table *tbl = container->tbl;
> + struct iommu_table *tbl;
> + struct iommu_table_group *table_group;
>
> - if (!container->tbl)
> + if (!container->grp)
> return -ENXIO;
>
> if (!current->mm)
> @@ -143,6 +171,11 @@ static int tce_iommu_enable(struct tce_container *container)
> * as this information is only available from KVM and VFIO is
> * KVM agnostic.
> */
> + table_group = iommu_group_get_iommudata(container->grp);
> + if (!table_group)
> + return -ENODEV;
> +
> + tbl = &table_group->tables[0];
> locked = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
> ret = try_increment_locked_vm(locked);
> if (ret)
> @@ -193,15 +226,17 @@ static int tce_iommu_clear(struct tce_container *container,
> static void tce_iommu_release(void *iommu_data)
> {
> struct tce_container *container = iommu_data;
> - struct iommu_table *tbl = container->tbl;
> + struct iommu_table *tbl;
> + struct iommu_table_group *table_group;
>
> - WARN_ON(tbl && !tbl->it_group);
> + WARN_ON(container->grp);
>
> - if (tbl) {
> + if (container->grp) {
> + table_group = iommu_group_get_iommudata(container->grp);
> + tbl = &table_group->tables[0];
> tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
>
> - if (tbl->it_group)
> - tce_iommu_detach_group(iommu_data, tbl->it_group);
> + tce_iommu_detach_group(iommu_data, container->grp);
> }
>
> tce_iommu_disable(container);
> @@ -329,9 +364,16 @@ static long tce_iommu_ioctl(void *iommu_data,
>
> case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
> struct vfio_iommu_spapr_tce_info info;
> - struct iommu_table *tbl = container->tbl;
> + struct iommu_table *tbl;
> + struct iommu_table_group *table_group;
>
> - if (WARN_ON(!tbl))
> + if (WARN_ON(!container->grp))
> + return -ENXIO;
> +
> + table_group = iommu_group_get_iommudata(container->grp);
> +
> + tbl = &table_group->tables[0];
> + if (WARN_ON_ONCE(!tbl))
> return -ENXIO;
>
> minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
> @@ -354,17 +396,12 @@ static long tce_iommu_ioctl(void *iommu_data,
> }
> case VFIO_IOMMU_MAP_DMA: {
> struct vfio_iommu_type1_dma_map param;
> - struct iommu_table *tbl = container->tbl;
> + struct iommu_table *tbl;
> unsigned long tce;
>
> if (!container->enabled)
> return -EPERM;
>
> - if (!tbl)
> - return -ENXIO;
> -
> - BUG_ON(!tbl->it_group);
> -
> minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>
> if (copy_from_user(¶m, (void __user *)arg, minsz))
> @@ -377,6 +414,10 @@ static long tce_iommu_ioctl(void *iommu_data,
> VFIO_DMA_MAP_FLAG_WRITE))
> return -EINVAL;
>
> + tbl = spapr_tce_find_table(container, param.iova);
> + if (!tbl)
> + return -ENXIO;
> +
> if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
> (param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
> return -EINVAL;
> @@ -402,14 +443,11 @@ static long tce_iommu_ioctl(void *iommu_data,
> }
> case VFIO_IOMMU_UNMAP_DMA: {
> struct vfio_iommu_type1_dma_unmap param;
> - struct iommu_table *tbl = container->tbl;
> + struct iommu_table *tbl;
>
> if (!container->enabled)
> return -EPERM;
>
> - if (WARN_ON(!tbl))
> - return -ENXIO;
> -
> minsz = offsetofend(struct vfio_iommu_type1_dma_unmap,
> size);
>
> @@ -423,6 +461,10 @@ static long tce_iommu_ioctl(void *iommu_data,
> if (param.flags)
> return -EINVAL;
>
> + tbl = spapr_tce_find_table(container, param.iova);
> + if (!tbl)
> + return -ENXIO;
> +
> if (param.size & ~IOMMU_PAGE_MASK(tbl))
> return -EINVAL;
>
> @@ -451,10 +493,10 @@ static long tce_iommu_ioctl(void *iommu_data,
> mutex_unlock(&container->lock);
> return 0;
> case VFIO_EEH_PE_OP:
> - if (!container->tbl || !container->tbl->it_group)
> + if (!container->grp)
> return -ENODEV;
>
> - return vfio_spapr_iommu_eeh_ioctl(container->tbl->it_group,
> + return vfio_spapr_iommu_eeh_ioctl(container->grp,
> cmd, arg);
> }
>
> @@ -466,16 +508,15 @@ static int tce_iommu_attach_group(void *iommu_data,
> {
> int ret;
> struct tce_container *container = iommu_data;
> - struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> + struct iommu_table_group *table_group;
>
> - BUG_ON(!tbl);
> mutex_lock(&container->lock);
>
> /* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
> iommu_group_id(iommu_group), iommu_group); */
> - if (container->tbl) {
> + if (container->grp) {
> pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
> - iommu_group_id(container->tbl->it_group),
> + iommu_group_id(container->grp),
> iommu_group_id(iommu_group));
> ret = -EBUSY;
> goto unlock_exit;
> @@ -488,9 +529,15 @@ static int tce_iommu_attach_group(void *iommu_data,
> goto unlock_exit;
> }
>
> - ret = iommu_take_ownership(tbl);
> + table_group = iommu_group_get_iommudata(iommu_group);
> + if (!table_group) {
> + ret = -ENXIO;
> + goto unlock_exit;
> + }
> +
> + ret = iommu_take_ownership(&table_group->tables[0]);
> if (!ret)
> - container->tbl = tbl;
> + container->grp = iommu_group;
>
> unlock_exit:
> mutex_unlock(&container->lock);
> @@ -502,27 +549,30 @@ static void tce_iommu_detach_group(void *iommu_data,
> struct iommu_group *iommu_group)
> {
> struct tce_container *container = iommu_data;
> - struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> + struct iommu_table_group *table_group;
>
> - BUG_ON(!tbl);
> mutex_lock(&container->lock);
> - if (tbl != container->tbl) {
> + if (iommu_group != container->grp) {
> pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
> iommu_group_id(iommu_group),
> - iommu_group_id(tbl->it_group));
> + iommu_group_id(container->grp));
> goto unlock_exit;
> }
>
> if (container->enabled) {
> pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
> - iommu_group_id(tbl->it_group));
> + iommu_group_id(container->grp));
> tce_iommu_disable(container);
> }
>
> /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
> iommu_group_id(iommu_group), iommu_group); */
> - container->tbl = NULL;
> - iommu_release_ownership(tbl);
> + container->grp = NULL;
> +
> + table_group = iommu_group_get_iommudata(iommu_group);
> + BUG_ON(!table_group);
> +
> + iommu_release_ownership(&table_group->tables[0]);
>
> unlock_exit:
> mutex_unlock(&container->lock);
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:30:55PM +1000, Alexey Kardashevskiy wrote:
> This replaces iommu_take_ownership()/iommu_release_ownership() calls
> with the callback calls and it is up to the platform code to call
> iommu_take_ownership()/iommu_release_ownership() if needed.
I think this commit message is out of date - I don't see any callbacks
here.
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> arch/powerpc/include/asm/iommu.h | 4 +--
> arch/powerpc/kernel/iommu.c | 50 ++++++++++++++++++++++++++++---------
> drivers/vfio/vfio_iommu_spapr_tce.c | 4 +--
> 3 files changed, 42 insertions(+), 16 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 667aa1a..b9e50d3 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -225,8 +225,8 @@ extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
> unsigned long entry);
>
> extern void iommu_flush_tce(struct iommu_table *tbl);
> -extern int iommu_take_ownership(struct iommu_table *tbl);
> -extern void iommu_release_ownership(struct iommu_table *tbl);
> +extern int iommu_take_ownership(struct iommu_table_group *table_group);
> +extern void iommu_release_ownership(struct iommu_table_group *table_group);
>
> extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
> extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir);
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index fd49c8e..7d6089b 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1050,7 +1050,7 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
> }
> EXPORT_SYMBOL_GPL(iommu_tce_build);
>
> -int iommu_take_ownership(struct iommu_table *tbl)
> +static int iommu_table_take_ownership(struct iommu_table *tbl)
> {
> unsigned long sz = (tbl->it_size + 7) >> 3;
>
> @@ -1064,19 +1064,36 @@ int iommu_take_ownership(struct iommu_table *tbl)
>
> memset(tbl->it_map, 0xff, sz);
>
> - /*
> - * Disable iommu bypass, otherwise the user can DMA to all of
> - * our physical memory via the bypass window instead of just
> - * the pages that has been explicitly mapped into the iommu
> - */
> - if (tbl->set_bypass)
> - tbl->set_bypass(tbl, false);
The code to disable bypass is removed, and doesn't seem to be replaced
with anything. That doesn't look safe.
> + return 0;
> +}
> +
> +static void iommu_table_release_ownership(struct iommu_table *tbl);
> +
> +int iommu_take_ownership(struct iommu_table_group *table_group)
> +{
> + int i, j, rc = 0;
> +
> + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> + struct iommu_table *tbl = &table_group->tables[i];
> +
> + if (!tbl->it_map)
> + continue;
> +
> + rc = iommu_table_take_ownership(tbl);
> + if (rc) {
> + for (j = 0; j < i; ++j)
> + iommu_table_release_ownership(
> + &table_group->tables[j]);
> +
> + return rc;
> + }
> + }
>
> return 0;
> }
> EXPORT_SYMBOL_GPL(iommu_take_ownership);
>
> -void iommu_release_ownership(struct iommu_table *tbl)
> +static void iommu_table_release_ownership(struct iommu_table *tbl)
> {
> unsigned long sz = (tbl->it_size + 7) >> 3;
>
> @@ -1086,9 +1103,18 @@ void iommu_release_ownership(struct iommu_table *tbl)
> if (tbl->it_offset == 0)
> set_bit(0, tbl->it_map);
>
> - /* The kernel owns the device now, we can restore the iommu bypass */
> - if (tbl->set_bypass)
> - tbl->set_bypass(tbl, true);
> +}
> +
> +extern void iommu_release_ownership(struct iommu_table_group *table_group)
> +{
> + int i;
> +
> + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> + struct iommu_table *tbl = &table_group->tables[i];
> +
> + if (tbl->it_map)
> + iommu_table_release_ownership(tbl);
> + }
> }
> EXPORT_SYMBOL_GPL(iommu_release_ownership);
>
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index d61aad2..9f38351 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -535,7 +535,7 @@ static int tce_iommu_attach_group(void *iommu_data,
> goto unlock_exit;
> }
>
> - ret = iommu_take_ownership(&table_group->tables[0]);
> + ret = iommu_take_ownership(table_group);
> if (!ret)
> container->grp = iommu_group;
>
> @@ -572,7 +572,7 @@ static void tce_iommu_detach_group(void *iommu_data,
> table_group = iommu_group_get_iommudata(iommu_group);
> BUG_ON(!table_group);
>
> - iommu_release_ownership(&table_group->tables[0]);
> + iommu_release_ownership(table_group);
>
> unlock_exit:
> mutex_unlock(&container->lock);
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:30:56PM +1000, Alexey Kardashevskiy wrote:
> At the moment the iommu_table struct has a set_bypass() which enables/
> disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
> which calls this callback when external IOMMU users such as VFIO are
> about to get over a PHB.
>
> The set_bypass() callback is not really an iommu_table function but
> IOMMU/PE function. This introduces a iommu_table_group_ops struct and
> adds a set_ownership() callback to it which is called when an external
> user takes control over the IOMMU.
Do you really need separate ops structures at both the single table
and table group level? The different tables in a group will all
belong to the same basic iommu won't they?
> This renames set_bypass() to set_ownership() as it is not necessarily
> just enabling bypassing, it can be something else/more so let's give it
> more generic name. The bool parameter is inverted.
>
> The callback is implemented for IODA2 only. Other platforms (P5IOC2,
> IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> arch/powerpc/include/asm/iommu.h | 14 +++++++++++++-
> arch/powerpc/platforms/powernv/pci-ioda.c | 30 ++++++++++++++++++++++--------
> drivers/vfio/vfio_iommu_spapr_tce.c | 25 +++++++++++++++++++++----
> 3 files changed, 56 insertions(+), 13 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index b9e50d3..d1f8c6c 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -92,7 +92,6 @@ struct iommu_table {
> unsigned long it_page_shift;/* table iommu page size */
> struct iommu_table_group *it_group;
> struct iommu_table_ops *it_ops;
> - void (*set_bypass)(struct iommu_table *tbl, bool enable);
> };
>
> /* Pure 2^n version of get_order */
> @@ -127,11 +126,24 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>
> #define IOMMU_TABLE_GROUP_MAX_TABLES 1
>
> +struct iommu_table_group;
> +
> +struct iommu_table_group_ops {
> + /*
> + * Switches ownership from the kernel itself to an external
> + * user. While onwership is enabled, the kernel cannot use IOMMU
> + * for itself.
> + */
> + void (*set_ownership)(struct iommu_table_group *table_group,
> + bool enable);
The meaning of "enable" in a function called "set_ownership" is
entirely obscure.
> +};
> +
> struct iommu_table_group {
> #ifdef CONFIG_IOMMU_API
> struct iommu_group *group;
> #endif
> struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> + struct iommu_table_group_ops *ops;
> };
>
> #ifdef CONFIG_IOMMU_API
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index a964c50..9687731 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1255,10 +1255,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
> }
>
> -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
> {
> - struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
> - table_group);
> uint16_t window_id = (pe->pe_number << 1 ) + 1;
> int64_t rc;
>
> @@ -1286,7 +1284,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> * host side.
> */
> if (pe->pdev)
> - set_iommu_table_base(&pe->pdev->dev, tbl);
> + set_iommu_table_base(&pe->pdev->dev,
> + &pe->table_group.tables[0]);
> else
> pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> }
> @@ -1302,13 +1301,27 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> /* TVE #1 is selected by PCI address bit 59 */
> pe->tce_bypass_base = 1ull << 59;
>
> - /* Install set_bypass callback for VFIO */
> - pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
> -
> /* Enable bypass by default */
> - pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
> + pnv_pci_ioda2_set_bypass(pe, true);
> }
>
> +static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
> + bool enable)
> +{
> + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> + table_group);
> + if (enable)
> + iommu_take_ownership(table_group);
> + else
> + iommu_release_ownership(table_group);
> +
> + pnv_pci_ioda2_set_bypass(pe, !enable);
> +}
> +
> +static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> + .set_ownership = pnv_ioda2_set_ownership,
> +};
> +
> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> struct pnv_ioda_pe *pe)
> {
> @@ -1376,6 +1389,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> }
> tbl->it_ops = &pnv_iommu_ops;
> iommu_init_table(tbl, phb->hose->node);
> + pe->table_group.ops = &pnv_pci_ioda2_ops;
> iommu_register_group(&pe->table_group, phb->hose->global_number,
> pe->pe_number);
>
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 9f38351..d5d8c50 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -535,9 +535,22 @@ static int tce_iommu_attach_group(void *iommu_data,
> goto unlock_exit;
> }
>
> - ret = iommu_take_ownership(table_group);
> - if (!ret)
> - container->grp = iommu_group;
> + if (!table_group->ops || !table_group->ops->set_ownership) {
> + ret = iommu_take_ownership(table_group);
> + } else {
> + /*
> + * Disable iommu bypass, otherwise the user can DMA to all of
> + * our physical memory via the bypass window instead of just
> + * the pages that has been explicitly mapped into the iommu
> + */
> + table_group->ops->set_ownership(table_group, true);
And here to disable bypass you call it with enable=true, so it doesn't
even have the same meaning as it used to.
Plus, you should fold the logic to call the callback if necessary into
iommu_take_ownership().
> + ret = 0;
> + }
> +
> + if (ret)
> + goto unlock_exit;
> +
> + container->grp = iommu_group;
>
> unlock_exit:
> mutex_unlock(&container->lock);
> @@ -572,7 +585,11 @@ static void tce_iommu_detach_group(void *iommu_data,
> table_group = iommu_group_get_iommudata(iommu_group);
> BUG_ON(!table_group);
>
> - iommu_release_ownership(table_group);
> + /* Kernel owns the device now, we can restore bypass */
> + if (!table_group->ops || !table_group->ops->set_ownership)
> + iommu_release_ownership(table_group);
> + else
> + table_group->ops->set_ownership(table_group, false);
Likewise fold this if into iommu_release_ownership().
> unlock_exit:
> mutex_unlock(&container->lock);
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:30:57PM +1000, Alexey Kardashevskiy wrote:
> This adds missing locks in iommu_take_ownership()/
> iommu_release_ownership().
>
> This marks all pages busy in iommu_table::it_map in order to catch
> errors if there is an attempt to use this table while ownership over it
> is taken.
>
> This only clears TCE content if there is no page marked busy in it_map.
> Clearing must be done outside of the table locks as iommu_clear_tce()
> called from iommu_clear_tces_and_put_pages() does this.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> Changes:
> v5:
> * do not store bit#0 value, it has to be set for zero-based table
> anyway
> * removed test_and_clear_bit
> ---
> arch/powerpc/kernel/iommu.c | 26 ++++++++++++++++++++++----
> 1 file changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 7d6089b..068fe4ff 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1052,17 +1052,28 @@ EXPORT_SYMBOL_GPL(iommu_tce_build);
>
> static int iommu_table_take_ownership(struct iommu_table *tbl)
> {
> - unsigned long sz = (tbl->it_size + 7) >> 3;
> + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
> + int ret = 0;
> +
> + spin_lock_irqsave(&tbl->large_pool.lock, flags);
> + for (i = 0; i < tbl->nr_pools; i++)
> + spin_lock(&tbl->pools[i].lock);
>
> if (tbl->it_offset == 0)
> clear_bit(0, tbl->it_map);
>
> if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
> pr_err("iommu_tce: it_map is not empty");
> - return -EBUSY;
> + ret = -EBUSY;
> + if (tbl->it_offset == 0)
> + set_bit(0, tbl->it_map);
This really needs a comment. Why on earth are you changing the it_map
on a failure case?
> + } else {
> + memset(tbl->it_map, 0xff, sz);
> }
>
> - memset(tbl->it_map, 0xff, sz);
> + for (i = 0; i < tbl->nr_pools; i++)
> + spin_unlock(&tbl->pools[i].lock);
> + spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
>
> return 0;
> }
> @@ -1095,7 +1106,11 @@ EXPORT_SYMBOL_GPL(iommu_take_ownership);
>
> static void iommu_table_release_ownership(struct iommu_table *tbl)
> {
> - unsigned long sz = (tbl->it_size + 7) >> 3;
> + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
> +
> + spin_lock_irqsave(&tbl->large_pool.lock, flags);
> + for (i = 0; i < tbl->nr_pools; i++)
> + spin_lock(&tbl->pools[i].lock);
>
> memset(tbl->it_map, 0, sz);
>
> @@ -1103,6 +1118,9 @@ static void iommu_table_release_ownership(struct iommu_table *tbl)
> if (tbl->it_offset == 0)
> set_bit(0, tbl->it_map);
>
> + for (i = 0; i < tbl->nr_pools; i++)
> + spin_unlock(&tbl->pools[i].lock);
> + spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
> }
>
> extern void iommu_release_ownership(struct iommu_table_group *table_group)
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:30:58PM +1000, Alexey Kardashevskiy wrote:
> The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is
> supposed to be called on IODA1/2 and not called on p5ioc2. It receives
> start and end host addresses of TCE table. This approach makes it possible
> to get pnv_pci_ioda_tce_invalidate() unintentionally called on
> p5ioc2.
It's not clear what passing start and end addresses has to do with
unintentionally calling the wrong invalidate function.
> Another issue is that IODA2 needs PCI addresses to invalidate the cache
> and those can be calculated from host addresses but since we are going
> to implement multi-level TCE tables, calculating PCI address from
> a host address might get either tricky or ugly as TCE table remains flat
> on PCI bus but not in RAM.
>
> This defines separate iommu_table_ops callbacks for p5ioc2 and IODA1/2
> PHBs. They all call common pnv_tce_build/pnv_tce_free/pnv_tce_get helpers
> but call PHB specific TCE invalidation helper (when needed).
>
> This changes pnv_pci_ioda2_tce_invalidate() to receives TCE index and
> number of pages which are PCI addresses shifted by IOMMU page shift.
>
> The patch is pretty mechanical and behaviour is not expected to change.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> arch/powerpc/platforms/powernv/pci-ioda.c | 92 ++++++++++++++++++++++-------
> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 9 ++-
> arch/powerpc/platforms/powernv/pci.c | 76 +++++++++---------------
> arch/powerpc/platforms/powernv/pci.h | 7 ++-
> 4 files changed, 111 insertions(+), 73 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 9687731..fd993bc 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1065,18 +1065,20 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
> }
> }
>
> -static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
> - struct iommu_table *tbl,
> - __be64 *startp, __be64 *endp, bool rm)
> +static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
> + unsigned long index, unsigned long npages, bool rm)
> {
> + struct pnv_ioda_pe *pe = container_of(tbl->it_group,
> + struct pnv_ioda_pe, table_group);
> __be64 __iomem *invalidate = rm ?
> (__be64 __iomem *)pe->tce_inval_reg_phys :
> (__be64 __iomem *)tbl->it_index;
> unsigned long start, end, inc;
> const unsigned shift = tbl->it_page_shift;
>
> - start = __pa(startp);
> - end = __pa(endp);
> + start = __pa((__be64 *)tbl->it_base + index - tbl->it_offset);
> + end = __pa((__be64 *)tbl->it_base + index - tbl->it_offset +
> + npages - 1);
>
> /* BML uses this case for p6/p7/galaxy2: Shift addr and put in node */
> if (tbl->it_busno) {
> @@ -1112,10 +1114,40 @@ static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
> */
> }
>
> -static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
> - struct iommu_table *tbl,
> - __be64 *startp, __be64 *endp, bool rm)
> +static int pnv_ioda1_tce_build_vm(struct iommu_table *tbl, long index,
What does the the "_vm" stand for?
> + long npages, unsigned long uaddr,
> + enum dma_data_direction direction,
> + struct dma_attrs *attrs)
> {
> + long ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
> + attrs);
> +
> + if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE))
> + pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
> +
> + return ret;
> +}
> +
> +static void pnv_ioda1_tce_free_vm(struct iommu_table *tbl, long index,
> + long npages)
> +{
> + pnv_tce_free(tbl, index, npages);
> +
> + if (tbl->it_type & TCE_PCI_SWINV_FREE)
> + pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
> +}
> +
> +struct iommu_table_ops pnv_ioda1_iommu_ops = {
> + .set = pnv_ioda1_tce_build_vm,
> + .clear = pnv_ioda1_tce_free_vm,
> + .get = pnv_tce_get,
> +};
> +
> +static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
> + unsigned long index, unsigned long npages, bool rm)
> +{
> + struct pnv_ioda_pe *pe = container_of(tbl->it_group,
> + struct pnv_ioda_pe, table_group);
> unsigned long start, end, inc;
> __be64 __iomem *invalidate = rm ?
> (__be64 __iomem *)pe->tce_inval_reg_phys :
> @@ -1128,9 +1160,9 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
> end = start;
>
> /* Figure out the start, end and step */
> - inc = tbl->it_offset + (((u64)startp - tbl->it_base) / sizeof(u64));
> + inc = tbl->it_offset + index / sizeof(u64);
> start |= (inc << shift);
> - inc = tbl->it_offset + (((u64)endp - tbl->it_base) / sizeof(u64));
> + inc = tbl->it_offset + (index + npages - 1) / sizeof(u64);
> end |= (inc << shift);
> inc = (0x1ull << shift);
> mb();
> @@ -1144,19 +1176,35 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
> }
> }
>
> -void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
> - __be64 *startp, __be64 *endp, bool rm)
> +static int pnv_ioda2_tce_build_vm(struct iommu_table *tbl, long index,
> + long npages, unsigned long uaddr,
> + enum dma_data_direction direction,
> + struct dma_attrs *attrs)
> {
> - struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
> - table_group);
> - struct pnv_phb *phb = pe->phb;
> -
> - if (phb->type == PNV_PHB_IODA1)
> - pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
> - else
> - pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
> + long ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
> + attrs);
> +
> + if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE))
> + pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
> +
> + return ret;
> }
>
> +static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
> + long npages)
> +{
> + pnv_tce_free(tbl, index, npages);
> +
> + if (tbl->it_type & TCE_PCI_SWINV_FREE)
> + pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
> +}
> +
> +static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> + .set = pnv_ioda2_tce_build_vm,
> + .clear = pnv_ioda2_tce_free_vm,
> + .get = pnv_tce_get,
> +};
> +
> static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> struct pnv_ioda_pe *pe, unsigned int base,
> unsigned int segs)
> @@ -1236,7 +1284,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> TCE_PCI_SWINV_FREE |
> TCE_PCI_SWINV_PAIR);
> }
> - tbl->it_ops = &pnv_iommu_ops;
> + tbl->it_ops = &pnv_ioda1_iommu_ops;
> iommu_init_table(tbl, phb->hose->node);
> iommu_register_group(&pe->table_group, phb->hose->global_number,
> pe->pe_number);
> @@ -1387,7 +1435,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> 8);
> tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
> }
> - tbl->it_ops = &pnv_iommu_ops;
> + tbl->it_ops = &pnv_ioda2_iommu_ops;
> iommu_init_table(tbl, phb->hose->node);
> pe->table_group.ops = &pnv_pci_ioda2_ops;
> iommu_register_group(&pe->table_group, phb->hose->global_number,
> diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> index ff68cac..6906a9c 100644
> --- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> +++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> @@ -83,11 +83,18 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb)
> static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
> #endif /* CONFIG_PCI_MSI */
>
> +static struct iommu_table_ops pnv_p5ioc2_iommu_ops = {
> + .set = pnv_tce_build,
> + .clear = pnv_tce_free,
> + .get = pnv_tce_get,
> +};
> +
> static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
> struct pci_dev *pdev)
> {
> if (phb->p5ioc2.table_group.tables[0].it_map == NULL) {
> - phb->p5ioc2.table_group.tables[0].it_ops = &pnv_iommu_ops;
> + phb->p5ioc2.table_group.tables[0].it_ops =
> + &pnv_p5ioc2_iommu_ops;
> iommu_init_table(&phb->p5ioc2.table_group.tables[0],
> phb->hose->node);
> iommu_register_group(&phb->p5ioc2.table_group,
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index 3050cc8..a8c05de 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -589,70 +589,48 @@ struct pci_ops pnv_pci_ops = {
> .write = pnv_pci_write_config,
> };
>
> -static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> - unsigned long uaddr, enum dma_data_direction direction,
> - struct dma_attrs *attrs, bool rm)
> +static __be64 *pnv_tce(struct iommu_table *tbl, long index)
> +{
> + __be64 *tmp = ((__be64 *)tbl->it_base);
> +
> + return tmp + index;
> +}
> +
> +int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> + unsigned long uaddr, enum dma_data_direction direction,
> + struct dma_attrs *attrs)
> {
> u64 proto_tce = iommu_direction_to_tce_perm(direction);
> - __be64 *tcep, *tces;
> - u64 rpn;
> + u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
> + long i;
>
> - tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
> - rpn = __pa(uaddr) >> tbl->it_page_shift;
> + for (i = 0; i < npages; i++) {
> + unsigned long newtce = proto_tce |
> + ((rpn + i) << tbl->it_page_shift);
> + unsigned long idx = index - tbl->it_offset + i;
>
> - while (npages--)
> - *(tcep++) = cpu_to_be64(proto_tce |
> - (rpn++ << tbl->it_page_shift));
> -
> - /* Some implementations won't cache invalid TCEs and thus may not
> - * need that flush. We'll probably turn it_type into a bit mask
> - * of flags if that becomes the case
> - */
> - if (tbl->it_type & TCE_PCI_SWINV_CREATE)
> - pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
> + *(pnv_tce(tbl, idx)) = cpu_to_be64(newtce);
> + }
>
> return 0;
> }
>
> -static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
> - unsigned long uaddr,
> - enum dma_data_direction direction,
> - struct dma_attrs *attrs)
> +void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
> {
> - return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs,
> - false);
> -}
> -
> -static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
> - bool rm)
> -{
> - __be64 *tcep, *tces;
> -
> - tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
> + long i;
>
> - while (npages--)
> - *(tcep++) = cpu_to_be64(0);
> + for (i = 0; i < npages; i++) {
> + unsigned long idx = index - tbl->it_offset + i;
>
> - if (tbl->it_type & TCE_PCI_SWINV_FREE)
> - pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
> + *(pnv_tce(tbl, idx)) = cpu_to_be64(0);
> + }
> }
>
> -static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long npages)
> +unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
> {
> - pnv_tce_free(tbl, index, npages, false);
> + return *(pnv_tce(tbl, index));
> }
>
> -static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
> -{
> - return ((u64 *)tbl->it_base)[index - tbl->it_offset];
> -}
> -
> -struct iommu_table_ops pnv_iommu_ops = {
> - .set = pnv_tce_build_vm,
> - .clear = pnv_tce_free_vm,
> - .get = pnv_tce_get,
> -};
> -
> void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
> void *tce_mem, u64 tce_size,
> u64 dma_offset, unsigned page_shift)
> @@ -685,7 +663,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
> return NULL;
> pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
> be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
> - tbl->it_ops = &pnv_iommu_ops;
> + tbl->it_ops = &pnv_ioda1_iommu_ops;
> iommu_init_table(tbl, hose->node);
> iommu_register_group(tbl->it_group, pci_domain_nr(hose->bus), 0);
>
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 762d906..0d4df32 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -216,7 +216,12 @@ extern struct pci_ops pnv_pci_ops;
> #ifdef CONFIG_EEH
> extern struct pnv_eeh_ops ioda_eeh_ops;
> #endif
> -extern struct iommu_table_ops pnv_iommu_ops;
> +extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> + unsigned long uaddr, enum dma_data_direction direction,
> + struct dma_attrs *attrs);
> +extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
> +extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
> +extern struct iommu_table_ops pnv_ioda1_iommu_ops;
>
> void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
> unsigned char *log_buff);
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:30:59PM +1000, Alexey Kardashevskiy wrote:
> At the moment writing new TCE value to the IOMMU table fails with EBUSY
> if there is a valid entry already. However PAPR specification allows
> the guest to write new TCE value without clearing it first.
>
> Another problem this patch is addressing is the use of pool locks for
> external IOMMU users such as VFIO. The pool locks are to protect
> DMA page allocator rather than entries and since the host kernel does
> not control what pages are in use, there is no point in pool locks and
> exchange()+put_page(oldtce) is sufficient to avoid possible races.
>
> This adds an exchange() callback to iommu_table_ops which does the same
> thing as set() plus it returns replaced TCE and DMA direction so
> the caller can release the pages afterwards.
>
> The returned old TCE value is a virtual address as the new TCE value.
> This is different from tce_clear() which returns a physical address.
>
> This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
> for a platform to have exchange() implemented in order to support VFIO.
>
> This replaces iommu_tce_build() and iommu_clear_tce() with
> a single iommu_tce_xchg().
>
> This makes sure that TCE permission bits are not set in TCE passed to
> IOMMU API as those are to be calculated by platform code from DMA direction.
>
> This moves SetPageDirty() to the IOMMU code to make it work for both
> VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
> available later).
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> arch/powerpc/include/asm/iommu.h | 17 ++++++--
> arch/powerpc/kernel/iommu.c | 53 +++++++++---------------
> arch/powerpc/platforms/powernv/pci-ioda.c | 38 ++++++++++++++++++
> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 3 ++
> arch/powerpc/platforms/powernv/pci.c | 17 ++++++++
> arch/powerpc/platforms/powernv/pci.h | 2 +
> drivers/vfio/vfio_iommu_spapr_tce.c | 62 ++++++++++++++++++-----------
> 7 files changed, 130 insertions(+), 62 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index d1f8c6c..bde7ee7 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -44,11 +44,22 @@ extern int iommu_is_off;
> extern int iommu_force_on;
>
> struct iommu_table_ops {
> + /* When called with direction==DMA_NONE, it is equal to clear() */
> int (*set)(struct iommu_table *tbl,
> long index, long npages,
> unsigned long uaddr,
> enum dma_data_direction direction,
> struct dma_attrs *attrs);
> +#ifdef CONFIG_IOMMU_API
> + /*
> + * Exchanges existing TCE with new TCE plus direction bits;
> + * returns old TCE and DMA direction mask
> + */
> + int (*exchange)(struct iommu_table *tbl,
> + long index,
> + unsigned long *tce,
> + enum dma_data_direction *direction);
> +#endif
> void (*clear)(struct iommu_table *tbl,
> long index, long npages);
> unsigned long (*get)(struct iommu_table *tbl, long index);
> @@ -152,6 +163,8 @@ extern void iommu_register_group(struct iommu_table_group *table_group,
> extern int iommu_add_device(struct device *dev);
> extern void iommu_del_device(struct device *dev);
> extern int __init tce_iommu_bus_notifier_init(void);
> +extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
> + unsigned long *tce, enum dma_data_direction *direction);
> #else
> static inline void iommu_register_group(struct iommu_table_group *table_group,
> int pci_domain_number,
> @@ -231,10 +244,6 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
> unsigned long npages);
> extern int iommu_tce_put_param_check(struct iommu_table *tbl,
> unsigned long ioba, unsigned long tce);
> -extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
> - unsigned long hwaddr, enum dma_data_direction direction);
> -extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
> - unsigned long entry);
>
> extern void iommu_flush_tce(struct iommu_table *tbl);
> extern int iommu_take_ownership(struct iommu_table_group *table_group);
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 068fe4ff..501e8ee 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -982,9 +982,6 @@ EXPORT_SYMBOL_GPL(iommu_tce_clear_param_check);
> int iommu_tce_put_param_check(struct iommu_table *tbl,
> unsigned long ioba, unsigned long tce)
> {
> - if (!(tce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> - return -EINVAL;
> -
> if (tce & ~(IOMMU_PAGE_MASK(tbl) | TCE_PCI_WRITE | TCE_PCI_READ))
> return -EINVAL;
>
> @@ -1002,44 +999,20 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
> }
> EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
>
> -unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
> -{
> - unsigned long oldtce;
> - struct iommu_pool *pool = get_pool(tbl, entry);
> -
> - spin_lock(&(pool->lock));
> -
> - oldtce = tbl->it_ops->get(tbl, entry);
> - if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
> - tbl->it_ops->clear(tbl, entry, 1);
> - else
> - oldtce = 0;
> -
> - spin_unlock(&(pool->lock));
> -
> - return oldtce;
> -}
> -EXPORT_SYMBOL_GPL(iommu_clear_tce);
> -
> /*
> * hwaddr is a kernel virtual address here (0xc... bazillion),
> * tce_build converts it to a physical address.
> */
> -int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
> - unsigned long hwaddr, enum dma_data_direction direction)
> +long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
> + unsigned long *tce, enum dma_data_direction *direction)
> {
> - int ret = -EBUSY;
> - unsigned long oldtce;
> - struct iommu_pool *pool = get_pool(tbl, entry);
> + long ret;
>
> - spin_lock(&(pool->lock));
> + ret = tbl->it_ops->exchange(tbl, entry, tce, direction);
>
> - oldtce = tbl->it_ops->get(tbl, entry);
> - /* Add new entry if it is not busy */
> - if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> - ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
> -
> - spin_unlock(&(pool->lock));
> + if (!ret && ((*direction == DMA_FROM_DEVICE) ||
> + (*direction == DMA_BIDIRECTIONAL)))
> + SetPageDirty(pfn_to_page(__pa(*tce) >> PAGE_SHIFT));
>
> /* if (unlikely(ret))
> pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
> @@ -1048,13 +1021,23 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
>
> return ret;
> }
> -EXPORT_SYMBOL_GPL(iommu_tce_build);
> +EXPORT_SYMBOL_GPL(iommu_tce_xchg);
>
> static int iommu_table_take_ownership(struct iommu_table *tbl)
> {
> unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
> int ret = 0;
>
> + /*
> + * VFIO does not control TCE entries allocation and the guest
> + * can write new TCEs on top of existing ones so iommu_tce_build()
> + * must be able to release old pages. This functionality
> + * requires exchange() callback defined so if it is not
> + * implemented, we disallow taking ownership over the table.
> + */
> + if (!tbl->it_ops->exchange)
> + return -EINVAL;
> +
> spin_lock_irqsave(&tbl->large_pool.lock, flags);
> for (i = 0; i < tbl->nr_pools; i++)
> spin_lock(&tbl->pools[i].lock);
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index fd993bc..4d80502 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1128,6 +1128,20 @@ static int pnv_ioda1_tce_build_vm(struct iommu_table *tbl, long index,
> return ret;
> }
>
> +#ifdef CONFIG_IOMMU_API
> +static int pnv_ioda1_tce_xchg_vm(struct iommu_table *tbl, long index,
> + unsigned long *tce, enum dma_data_direction *direction)
> +{
> + long ret = pnv_tce_xchg(tbl, index, tce, direction);
> +
> + if (!ret && (tbl->it_type &
> + (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
> + pnv_pci_ioda1_tce_invalidate(tbl, index, 1, false);
> +
> + return ret;
> +}
> +#endif
> +
> static void pnv_ioda1_tce_free_vm(struct iommu_table *tbl, long index,
> long npages)
> {
> @@ -1139,6 +1153,9 @@ static void pnv_ioda1_tce_free_vm(struct iommu_table *tbl, long index,
>
> struct iommu_table_ops pnv_ioda1_iommu_ops = {
> .set = pnv_ioda1_tce_build_vm,
> +#ifdef CONFIG_IOMMU_API
> + .exchange = pnv_ioda1_tce_xchg_vm,
> +#endif
> .clear = pnv_ioda1_tce_free_vm,
> .get = pnv_tce_get,
> };
> @@ -1190,6 +1207,20 @@ static int pnv_ioda2_tce_build_vm(struct iommu_table *tbl, long index,
> return ret;
> }
>
> +#ifdef CONFIG_IOMMU_API
> +static int pnv_ioda2_tce_xchg_vm(struct iommu_table *tbl, long index,
> + unsigned long *tce, enum dma_data_direction *direction)
> +{
> + long ret = pnv_tce_xchg(tbl, index, tce, direction);
> +
> + if (!ret && (tbl->it_type &
> + (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
> + pnv_pci_ioda2_tce_invalidate(tbl, index, 1, false);
> +
> + return ret;
> +}
> +#endif
> +
> static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
> long npages)
> {
> @@ -1201,6 +1232,9 @@ static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
>
> static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> .set = pnv_ioda2_tce_build_vm,
> +#ifdef CONFIG_IOMMU_API
> + .exchange = pnv_ioda2_tce_xchg_vm,
> +#endif
> .clear = pnv_ioda2_tce_free_vm,
> .get = pnv_tce_get,
> };
> @@ -1353,6 +1387,7 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> pnv_pci_ioda2_set_bypass(pe, true);
> }
>
> +#ifdef CONFIG_IOMMU_API
> static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
> bool enable)
> {
> @@ -1369,6 +1404,7 @@ static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
> static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> .set_ownership = pnv_ioda2_set_ownership,
> };
> +#endif
>
> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> struct pnv_ioda_pe *pe)
> @@ -1437,7 +1473,9 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> }
> tbl->it_ops = &pnv_ioda2_iommu_ops;
> iommu_init_table(tbl, phb->hose->node);
> +#ifdef CONFIG_IOMMU_API
> pe->table_group.ops = &pnv_pci_ioda2_ops;
> +#endif
> iommu_register_group(&pe->table_group, phb->hose->global_number,
> pe->pe_number);
>
> diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> index 6906a9c..d2d9092 100644
> --- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> +++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> @@ -85,6 +85,9 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
>
> static struct iommu_table_ops pnv_p5ioc2_iommu_ops = {
> .set = pnv_tce_build,
> +#ifdef CONFIG_IOMMU_API
> + .exchange = pnv_tce_xchg,
> +#endif
> .clear = pnv_tce_free,
> .get = pnv_tce_get,
> };
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index a8c05de..a9797dd 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -615,6 +615,23 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> return 0;
> }
>
> +#ifdef CONFIG_IOMMU_API
> +int pnv_tce_xchg(struct iommu_table *tbl, long index,
> + unsigned long *tce, enum dma_data_direction *direction)
> +{
> + u64 proto_tce = iommu_direction_to_tce_perm(*direction);
> + unsigned long newtce = __pa(*tce) | proto_tce;
> + unsigned long idx = index - tbl->it_offset;
> +
> + *tce = xchg(pnv_tce(tbl, idx), cpu_to_be64(newtce));
> + *tce = (unsigned long) __va(be64_to_cpu(*tce));
> + *direction = iommu_tce_direction(*tce);
> + *tce &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> + return 0;
> +}
> +#endif
> +
> void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
> {
> long i;
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 0d4df32..4d1a78c 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -220,6 +220,8 @@ extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> unsigned long uaddr, enum dma_data_direction direction,
> struct dma_attrs *attrs);
> extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
> +extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
> + unsigned long *tce, enum dma_data_direction *direction);
> extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
> extern struct iommu_table_ops pnv_ioda1_iommu_ops;
>
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index d5d8c50..7c3c215 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -251,9 +251,6 @@ static void tce_iommu_unuse_page(struct tce_container *container,
> {
> struct page *page;
>
> - if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
> - return;
> -
> /*
> * VFIO cannot map/unmap when a container is not enabled so
> * we would not need this check but KVM could map/unmap and if
> @@ -264,10 +261,6 @@ static void tce_iommu_unuse_page(struct tce_container *container,
> return;
>
> page = pfn_to_page(__pa(oldtce) >> PAGE_SHIFT);
> -
> - if (oldtce & TCE_PCI_WRITE)
> - SetPageDirty(page);
> -
Seems to me that unuse_page() should get a direction parameter,
instead of moving the PageDirty (and DMA_NONE test) to all the
callers.
> put_page(page);
> }
>
> @@ -275,14 +268,21 @@ static int tce_iommu_clear(struct tce_container *container,
> struct iommu_table *tbl,
> unsigned long entry, unsigned long pages)
> {
> - unsigned long oldtce;
> + long ret;
> + enum dma_data_direction direction;
> + unsigned long tce;
>
> for ( ; pages; --pages, ++entry) {
> - oldtce = iommu_clear_tce(tbl, entry);
> - if (!oldtce)
> + direction = DMA_NONE;
> + tce = (unsigned long) __va(0);
> + ret = iommu_tce_xchg(tbl, entry, &tce, &direction);
> + if (ret)
> continue;
>
> - tce_iommu_unuse_page(container, (unsigned long) __va(oldtce));
> + if (direction == DMA_NONE)
> + continue;
> +
> + tce_iommu_unuse_page(container, tce);
> }
>
> return 0;
> @@ -304,12 +304,13 @@ static int tce_get_hva(unsigned long tce, unsigned long *hva)
>
> static long tce_iommu_build(struct tce_container *container,
> struct iommu_table *tbl,
> - unsigned long entry, unsigned long tce, unsigned long pages)
> + unsigned long entry, unsigned long tce, unsigned long pages,
> + enum dma_data_direction direction)
> {
> long i, ret = 0;
> struct page *page;
> unsigned long hva;
> - enum dma_data_direction direction = iommu_tce_direction(tce);
> + enum dma_data_direction dirtmp;
>
> for (i = 0; i < pages; ++i) {
> ret = tce_get_hva(tce, &hva);
> @@ -324,15 +325,21 @@ static long tce_iommu_build(struct tce_container *container,
>
> /* Preserve offset within IOMMU page */
> hva |= tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
> + dirtmp = direction;
>
> - ret = iommu_tce_build(tbl, entry + i, hva, direction);
> + ret = iommu_tce_xchg(tbl, entry + i, &hva, &dirtmp);
> if (ret) {
> + /* dirtmp cannot be DMA_NONE here */
> tce_iommu_unuse_page(container, hva);
> pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
> __func__, entry << tbl->it_page_shift,
> tce, ret);
> break;
> }
> +
> + if (dirtmp != DMA_NONE)
> + tce_iommu_unuse_page(container, hva);
> +
> tce += IOMMU_PAGE_SIZE(tbl);
> }
>
> @@ -397,7 +404,7 @@ static long tce_iommu_ioctl(void *iommu_data,
> case VFIO_IOMMU_MAP_DMA: {
> struct vfio_iommu_type1_dma_map param;
> struct iommu_table *tbl;
> - unsigned long tce;
> + enum dma_data_direction direction;
>
> if (!container->enabled)
> return -EPERM;
> @@ -418,24 +425,33 @@ static long tce_iommu_ioctl(void *iommu_data,
> if (!tbl)
> return -ENXIO;
>
> - if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
> - (param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
> + if (param.size & ~IOMMU_PAGE_MASK(tbl))
> + return -EINVAL;
> +
> + if (param.vaddr & (TCE_PCI_READ | TCE_PCI_WRITE))
> return -EINVAL;
>
> /* iova is checked by the IOMMU API */
> - tce = param.vaddr;
> if (param.flags & VFIO_DMA_MAP_FLAG_READ)
> - tce |= TCE_PCI_READ;
> - if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
> - tce |= TCE_PCI_WRITE;
> + if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
> + direction = DMA_BIDIRECTIONAL;
> + else
> + direction = DMA_TO_DEVICE;
> + else
> + if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
> + direction = DMA_FROM_DEVICE;
> + else
> + return -EINVAL;
>
> - ret = iommu_tce_put_param_check(tbl, param.iova, tce);
> + ret = iommu_tce_put_param_check(tbl, param.iova, param.vaddr);
> if (ret)
> return ret;
>
> ret = tce_iommu_build(container, tbl,
> param.iova >> tbl->it_page_shift,
> - tce, param.size >> tbl->it_page_shift);
> + param.vaddr,
> + param.size >> tbl->it_page_shift,
> + direction);
>
> iommu_flush_tce(tbl);
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:31:00PM +1000, Alexey Kardashevskiy wrote:
> This moves iommu_table creation to the beginning. This is a mechanical
> patch.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> arch/powerpc/platforms/powernv/pci-ioda.c | 34 ++++++++++++++++---------------
> 1 file changed, 18 insertions(+), 16 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 4d80502..a1e0df9 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1437,27 +1437,33 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> addr = page_address(tce_mem);
> memset(addr, 0, tce_table_size);
>
> + /* Setup iommu */
> + pe->table_group.tables[0].it_group = &pe->table_group;
> +
> + /* Setup linux iommu table */
> + tbl = &pe->table_group.tables[0];
> + pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
> + IOMMU_PAGE_SHIFT_4K);
> +
> + tbl->it_ops = &pnv_ioda2_iommu_ops;
> + iommu_init_table(tbl, phb->hose->node);
> +#ifdef CONFIG_IOMMU_API
> + pe->table_group.ops = &pnv_pci_ioda2_ops;
> +#endif
> +
> /*
> * Map TCE table through TVT. The TVE index is the PE number
> * shifted by 1 bit for 32-bits DMA space.
> */
> rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
> - pe->pe_number << 1, 1, __pa(addr),
> - tce_table_size, 0x1000);
> + pe->pe_number << 1, 1, __pa(tbl->it_base),
> + tbl->it_size << 3, 1ULL << tbl->it_page_shift);
This looks like a real change, not just mechanical code movement.
> if (rc) {
> pe_err(pe, "Failed to configure 32-bit TCE table,"
> " err %ld\n", rc);
> goto fail;
> }
>
> - /* Setup iommu */
> - pe->table_group.tables[0].it_group = &pe->table_group;
> -
> - /* Setup linux iommu table */
> - tbl = &pe->table_group.tables[0];
> - pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
> - IOMMU_PAGE_SHIFT_4K);
> -
> /* OPAL variant of PHB3 invalidated TCEs */
> swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
> if (swinvp) {
> @@ -1471,16 +1477,12 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> 8);
> tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
> }
> - tbl->it_ops = &pnv_ioda2_iommu_ops;
> - iommu_init_table(tbl, phb->hose->node);
> -#ifdef CONFIG_IOMMU_API
> - pe->table_group.ops = &pnv_pci_ioda2_ops;
> -#endif
> iommu_register_group(&pe->table_group, phb->hose->global_number,
> pe->pe_number);
>
> if (pe->pdev)
> - set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
> + set_iommu_table_base_and_group(&pe->pdev->dev,
> + &pe->table_group.tables[0]);
And it's not obvious why this change happens either.
> else
> pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:31:01PM +1000, Alexey Kardashevskiy wrote:
> This is a part of moving TCE table allocation into an iommu_ops
> callback to support multiple IOMMU groups per one VFIO container.
>
> This enforce window size to be a power of two.
>
> This is a pretty mechanical patch.
???
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
But apart from that dubious comment in the commit message,
Reviewed-by: David Gibson <[email protected]>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:31:02PM +1000, Alexey Kardashevskiy wrote:
> This is a part of moving DMA window programming to an iommu_ops
> callback.
>
> This is a mechanical patch.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 10, 2015 at 04:31:03PM +1000, Alexey Kardashevskiy wrote:
> The iommu_free_table helper release memory it is using (the TCE table and
> @it_map) and release the iommu_table struct as well. We might not want
> the very last step as we store iommu_table in parent structures.
Yeah, as I commented on the earlier patch, freeing the surrounding
group from a function taking just the individual table is wrong.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> arch/powerpc/include/asm/iommu.h | 1 +
> arch/powerpc/kernel/iommu.c | 57 ++++++++++++++++++++++++----------------
> 2 files changed, 35 insertions(+), 23 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index bde7ee7..8ed4648 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -127,6 +127,7 @@ static inline void *get_iommu_table_base(struct device *dev)
>
> extern struct iommu_table *iommu_table_alloc(int node);
> /* Frees table for an individual device node */
> +extern void iommu_reset_table(struct iommu_table *tbl, const char *node_name);
> extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
>
> /* Initializes an iommu_table based in values set in the passed-in
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 501e8ee..0bcd988 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -721,24 +721,46 @@ struct iommu_table *iommu_table_alloc(int node)
> return &table_group->tables[0];
> }
>
> +void iommu_reset_table(struct iommu_table *tbl, const char *node_name)
> +{
> + if (!tbl)
> + return;
> +
> + if (tbl->it_map) {
> + unsigned long bitmap_sz;
> + unsigned int order;
> +
> + /*
> + * In case we have reserved the first bit, we should not emit
> + * the warning below.
> + */
> + if (tbl->it_offset == 0)
> + clear_bit(0, tbl->it_map);
> +
> + /* verify that table contains no entries */
> + if (!bitmap_empty(tbl->it_map, tbl->it_size))
> + pr_warn("%s: Unexpected TCEs for %s\n", __func__,
> + node_name);
> +
> + /* calculate bitmap size in bytes */
> + bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
> +
> + /* free bitmap */
> + order = get_order(bitmap_sz);
> + free_pages((unsigned long) tbl->it_map, order);
> + }
> +
> + memset(tbl, 0, sizeof(*tbl));
> +}
> +
> void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> {
> - unsigned long bitmap_sz;
> - unsigned int order;
> struct iommu_table_group *table_group = tbl->it_group;
>
> - if (!tbl || !tbl->it_map) {
> - printk(KERN_ERR "%s: expected TCE map for %s\n", __func__,
> - node_name);
> + if (!tbl)
> return;
> - }
>
> - /*
> - * In case we have reserved the first bit, we should not emit
> - * the warning below.
> - */
> - if (tbl->it_offset == 0)
> - clear_bit(0, tbl->it_map);
> + iommu_reset_table(tbl, node_name);
>
> #ifdef CONFIG_IOMMU_API
> if (table_group->group) {
> @@ -747,17 +769,6 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> }
> #endif
>
> - /* verify that table contains no entries */
> - if (!bitmap_empty(tbl->it_map, tbl->it_size))
> - pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
> -
> - /* calculate bitmap size in bytes */
> - bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
> -
> - /* free bitmap */
> - order = get_order(bitmap_sz);
> - free_pages((unsigned long) tbl->it_map, order);
> -
> /* free table */
> kfree(table_group);
> }
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On 04/16/2015 03:55 PM, David Gibson wrote:
> On Fri, Apr 10, 2015 at 04:30:54PM +1000, Alexey Kardashevskiy wrote:
>> Modern IBM POWERPC systems support multiple (currently two) TCE tables
>> per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
>> for TCE tables. Right now just one table is supported.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> arch/powerpc/include/asm/iommu.h | 18 +++--
>> arch/powerpc/kernel/iommu.c | 34 ++++----
>> arch/powerpc/platforms/powernv/pci-ioda.c | 38 +++++----
>> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 17 ++--
>> arch/powerpc/platforms/powernv/pci.c | 2 +-
>> arch/powerpc/platforms/powernv/pci.h | 4 +-
>> arch/powerpc/platforms/pseries/iommu.c | 9 ++-
>> drivers/vfio/vfio_iommu_spapr_tce.c | 120 ++++++++++++++++++++--------
>> 8 files changed, 160 insertions(+), 82 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index eb75726..667aa1a 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -90,9 +90,7 @@ struct iommu_table {
>> struct iommu_pool pools[IOMMU_NR_POOLS];
>> unsigned long *it_map; /* A simple allocation bitmap for now */
>> unsigned long it_page_shift;/* table iommu page size */
>> -#ifdef CONFIG_IOMMU_API
>> - struct iommu_group *it_group;
>> -#endif
>> + struct iommu_table_group *it_group;
>> struct iommu_table_ops *it_ops;
>> void (*set_bypass)(struct iommu_table *tbl, bool enable);
>> };
>> @@ -126,14 +124,24 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
>> */
>> extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>> int nid);
>> +
>> +#define IOMMU_TABLE_GROUP_MAX_TABLES 1
>> +
>> +struct iommu_table_group {
>> #ifdef CONFIG_IOMMU_API
>> -extern void iommu_register_group(struct iommu_table *tbl,
>> + struct iommu_group *group;
>> +#endif
>> + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>
> There's nothing to indicate which of the tables are in use at the
> current time. I mean, it doesn't matter now because there's only one,
> but the patch doesn't make a whole lot of sense without that.
>
>> +};
>> +
>> +#ifdef CONFIG_IOMMU_API
>> +extern void iommu_register_group(struct iommu_table_group *table_group,
>> int pci_domain_number, unsigned long pe_num);
>> extern int iommu_add_device(struct device *dev);
>> extern void iommu_del_device(struct device *dev);
>> extern int __init tce_iommu_bus_notifier_init(void);
>> #else
>> -static inline void iommu_register_group(struct iommu_table *tbl,
>> +static inline void iommu_register_group(struct iommu_table_group *table_group,
>> int pci_domain_number,
>> unsigned long pe_num)
>> {
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index b39d00a..fd49c8e 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -712,17 +712,20 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
>>
>> struct iommu_table *iommu_table_alloc(int node)
>> {
>> - struct iommu_table *tbl;
>> + struct iommu_table_group *table_group;
>>
>> - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
>> + table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
>> + node);
>> + table_group->tables[0].it_group = table_group;
>>
>> - return tbl;
>> + return &table_group->tables[0];
>> }
>>
>> void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>
> Surely the free function should take a table group rather than a table
> as argument.
No, it should not. Tables lifetime is not the same even within the same group.
>
>> {
>> unsigned long bitmap_sz;
>> unsigned int order;
>> + struct iommu_table_group *table_group = tbl->it_group;
>>
>> if (!tbl || !tbl->it_map) {
>> printk(KERN_ERR "%s: expected TCE map for %s\n", __func__,
>> @@ -738,9 +741,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>> clear_bit(0, tbl->it_map);
>>
>> #ifdef CONFIG_IOMMU_API
>> - if (tbl->it_group) {
>> - iommu_group_put(tbl->it_group);
>> - BUG_ON(tbl->it_group);
>> + if (table_group->group) {
>> + iommu_group_put(table_group->group);
>> + BUG_ON(table_group->group);
>> }
>> #endif
>>
>> @@ -756,7 +759,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>> free_pages((unsigned long) tbl->it_map, order);
>>
>> /* free table */
>> - kfree(tbl);
>> + kfree(table_group);
>> }
>>
>> /* Creates TCEs for a user provided buffer. The user buffer must be
>> @@ -903,11 +906,12 @@ EXPORT_SYMBOL_GPL(iommu_direction_to_tce_perm);
>> */
>> static void group_release(void *iommu_data)
>> {
>> - struct iommu_table *tbl = iommu_data;
>> - tbl->it_group = NULL;
>> + struct iommu_table_group *table_group = iommu_data;
>> +
>> + table_group->group = NULL;
>> }
>>
>> -void iommu_register_group(struct iommu_table *tbl,
>> +void iommu_register_group(struct iommu_table_group *table_group,
>> int pci_domain_number, unsigned long pe_num)
>> {
>> struct iommu_group *grp;
>> @@ -919,8 +923,8 @@ void iommu_register_group(struct iommu_table *tbl,
>> PTR_ERR(grp));
>> return;
>> }
>> - tbl->it_group = grp;
>> - iommu_group_set_iommudata(grp, tbl, group_release);
>> + table_group->group = grp;
>> + iommu_group_set_iommudata(grp, table_group, group_release);
>> name = kasprintf(GFP_KERNEL, "domain%d-pe%lx",
>> pci_domain_number, pe_num);
>> if (!name)
>> @@ -1108,7 +1112,7 @@ int iommu_add_device(struct device *dev)
>> }
>>
>> tbl = get_iommu_table_base(dev);
>> - if (!tbl || !tbl->it_group) {
>> + if (!tbl || !tbl->it_group || !tbl->it_group->group) {
>> pr_debug("%s: Skipping device %s with no tbl\n",
>> __func__, dev_name(dev));
>> return 0;
>> @@ -1116,7 +1120,7 @@ int iommu_add_device(struct device *dev)
>>
>> pr_debug("%s: Adding %s to iommu group %d\n",
>> __func__, dev_name(dev),
>> - iommu_group_id(tbl->it_group));
>> + iommu_group_id(tbl->it_group->group));
>>
>> if (PAGE_SIZE < IOMMU_PAGE_SIZE(tbl)) {
>> pr_err("%s: Invalid IOMMU page size %lx (%lx) on %s\n",
>> @@ -1125,7 +1129,7 @@ int iommu_add_device(struct device *dev)
>> return -EINVAL;
>> }
>>
>> - return iommu_group_add_device(tbl->it_group, dev);
>> + return iommu_group_add_device(tbl->it_group->group, dev);
>> }
>> EXPORT_SYMBOL_GPL(iommu_add_device);
>>
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 85e64a5..a964c50 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -23,6 +23,7 @@
>> #include <linux/io.h>
>> #include <linux/msi.h>
>> #include <linux/memblock.h>
>> +#include <linux/iommu.h>
>>
>> #include <asm/sections.h>
>> #include <asm/io.h>
>> @@ -989,7 +990,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
>>
>> pe = &phb->ioda.pe_array[pdn->pe_number];
>> WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
>> - set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
>> + set_iommu_table_base_and_group(&pdev->dev, &pe->table_group.tables[0]);
>> }
>>
>> static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
>> @@ -1016,7 +1017,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
>> } else {
>> dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
>> set_dma_ops(&pdev->dev, &dma_iommu_ops);
>> - set_iommu_table_base(&pdev->dev, &pe->tce32_table);
>> + set_iommu_table_base(&pdev->dev, &pe->table_group.tables[0]);
>> }
>> *pdev->dev.dma_mask = dma_mask;
>> return 0;
>> @@ -1053,9 +1054,10 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
>> list_for_each_entry(dev, &bus->devices, bus_list) {
>> if (add_to_iommu_group)
>> set_iommu_table_base_and_group(&dev->dev,
>> - &pe->tce32_table);
>> + &pe->table_group.tables[0]);
>> else
>> - set_iommu_table_base(&dev->dev, &pe->tce32_table);
>> + set_iommu_table_base(&dev->dev,
>> + &pe->table_group.tables[0]);
>>
>> if (dev->subordinate)
>> pnv_ioda_setup_bus_dma(pe, dev->subordinate,
>> @@ -1145,8 +1147,8 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
>> void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
>> __be64 *startp, __be64 *endp, bool rm)
>> {
>> - struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
>> - tce32_table);
>> + struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
>> + table_group);
>> struct pnv_phb *phb = pe->phb;
>>
>> if (phb->type == PNV_PHB_IODA1)
>> @@ -1211,8 +1213,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>> }
>> }
>>
>> + /* Setup iommu */
>> + pe->table_group.tables[0].it_group = &pe->table_group;
>> +
>> /* Setup linux iommu table */
>> - tbl = &pe->tce32_table;
>> + tbl = &pe->table_group.tables[0];
>> pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
>> base << 28, IOMMU_PAGE_SHIFT_4K);
>>
>> @@ -1233,7 +1238,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>> }
>> tbl->it_ops = &pnv_iommu_ops;
>> iommu_init_table(tbl, phb->hose->node);
>> - iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
>> + iommu_register_group(&pe->table_group, phb->hose->global_number,
>> + pe->pe_number);
>>
>> if (pe->pdev)
>> set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
>> @@ -1251,8 +1257,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>>
>> static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
>> {
>> - struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
>> - tce32_table);
>> + struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
>> + table_group);
>> uint16_t window_id = (pe->pe_number << 1 ) + 1;
>> int64_t rc;
>>
>> @@ -1297,10 +1303,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>> pe->tce_bypass_base = 1ull << 59;
>>
>> /* Install set_bypass callback for VFIO */
>> - pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
>> + pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
>>
>> /* Enable bypass by default */
>> - pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
>> + pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
>> }
>>
>> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> @@ -1347,8 +1353,11 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> goto fail;
>> }
>>
>> + /* Setup iommu */
>> + pe->table_group.tables[0].it_group = &pe->table_group;
>> +
>> /* Setup linux iommu table */
>> - tbl = &pe->tce32_table;
>> + tbl = &pe->table_group.tables[0];
>> pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
>> IOMMU_PAGE_SHIFT_4K);
>>
>> @@ -1367,7 +1376,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> }
>> tbl->it_ops = &pnv_iommu_ops;
>> iommu_init_table(tbl, phb->hose->node);
>> - iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
>> + iommu_register_group(&pe->table_group, phb->hose->global_number,
>> + pe->pe_number);
>>
>> if (pe->pdev)
>> set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
>> diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
>> index 0256fcc..ff68cac 100644
>> --- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
>> +++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
>> @@ -86,14 +86,16 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
>> static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
>> struct pci_dev *pdev)
>> {
>> - if (phb->p5ioc2.iommu_table.it_map == NULL) {
>> - phb->p5ioc2.iommu_table.it_ops = &pnv_iommu_ops;
>> - iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
>> - iommu_register_group(&phb->p5ioc2.iommu_table,
>> + if (phb->p5ioc2.table_group.tables[0].it_map == NULL) {
>> + phb->p5ioc2.table_group.tables[0].it_ops = &pnv_iommu_ops;
>> + iommu_init_table(&phb->p5ioc2.table_group.tables[0],
>> + phb->hose->node);
>> + iommu_register_group(&phb->p5ioc2.table_group,
>> pci_domain_nr(phb->hose->bus), phb->opal_id);
>> }
>>
>> - set_iommu_table_base_and_group(&pdev->dev, &phb->p5ioc2.iommu_table);
>> + set_iommu_table_base_and_group(&pdev->dev,
>> + &phb->p5ioc2.table_group.tables[0]);
>> }
>>
>> static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
>> @@ -167,9 +169,12 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
>> /* Setup MSI support */
>> pnv_pci_init_p5ioc2_msis(phb);
>>
>> + /* Setup iommu */
>> + phb->p5ioc2.table_group.tables[0].it_group = &phb->p5ioc2.table_group;
>> +
>> /* Setup TCEs */
>> phb->dma_dev_setup = pnv_pci_p5ioc2_dma_dev_setup;
>> - pnv_pci_setup_iommu_table(&phb->p5ioc2.iommu_table,
>> + pnv_pci_setup_iommu_table(&phb->p5ioc2.table_group.tables[0],
>> tce_mem, tce_size, 0,
>> IOMMU_PAGE_SHIFT_4K);
>> }
>> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>> index 1c31ac8..3050cc8 100644
>> --- a/arch/powerpc/platforms/powernv/pci.c
>> +++ b/arch/powerpc/platforms/powernv/pci.c
>> @@ -687,7 +687,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
>> be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
>> tbl->it_ops = &pnv_iommu_ops;
>> iommu_init_table(tbl, hose->node);
>> - iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
>> + iommu_register_group(tbl->it_group, pci_domain_nr(hose->bus), 0);
>>
>> /* Deal with SW invalidated TCEs when needed (BML way) */
>> swinvp = of_get_property(hose->dn, "linux,tce-sw-invalidate-info",
>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>> index f726700..762d906 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -53,7 +53,7 @@ struct pnv_ioda_pe {
>> /* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
>> int tce32_seg;
>> int tce32_segcount;
>> - struct iommu_table tce32_table;
>> + struct iommu_table_group table_group;
>> phys_addr_t tce_inval_reg_phys;
>>
>> /* 64-bit TCE bypass region */
>> @@ -138,7 +138,7 @@ struct pnv_phb {
>>
>> union {
>> struct {
>> - struct iommu_table iommu_table;
>> + struct iommu_table_group table_group;
>> } p5ioc2;
>>
>> struct {
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>> index 41a8b14..75ea581 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -622,7 +622,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
>> iommu_table_setparms(pci->phb, dn, tbl);
>> tbl->it_ops = &iommu_table_pseries_ops;
>> pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
>> - iommu_register_group(tbl, pci_domain_nr(bus), 0);
>> + iommu_register_group(tbl->it_group, pci_domain_nr(bus), 0);
>>
>> /* Divide the rest (1.75GB) among the children */
>> pci->phb->dma_window_size = 0x80000000ul;
>> @@ -672,7 +672,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
>> iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
>> tbl->it_ops = &iommu_table_lpar_multi_ops;
>> ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
>> - iommu_register_group(tbl, pci_domain_nr(bus), 0);
>> + iommu_register_group(tbl->it_group, pci_domain_nr(bus), 0);
>> pr_debug(" created table: %p\n", ppci->iommu_table);
>> }
>> }
>> @@ -699,7 +699,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
>> iommu_table_setparms(phb, dn, tbl);
>> tbl->it_ops = &iommu_table_pseries_ops;
>> PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
>> - iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
>> + iommu_register_group(tbl->it_group, pci_domain_nr(phb->bus), 0);
>> set_iommu_table_base_and_group(&dev->dev,
>> PCI_DN(dn)->iommu_table);
>> return;
>> @@ -1121,7 +1121,8 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
>> iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
>> tbl->it_ops = &iommu_table_lpar_multi_ops;
>> pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
>> - iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
>> + iommu_register_group(tbl->it_group,
>> + pci_domain_nr(pci->phb->bus), 0);
>> pr_debug(" created table: %p\n", pci->iommu_table);
>> } else {
>> pr_debug(" found DMA window, table: %p\n", pci->iommu_table);
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 244c958..d61aad2 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -88,7 +88,7 @@ static void decrement_locked_vm(long npages)
>> */
>> struct tce_container {
>> struct mutex lock;
>> - struct iommu_table *tbl;
>> + struct iommu_group *grp;
>> bool enabled;
>> unsigned long locked_pages;
>> };
>> @@ -103,13 +103,41 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift)
>> return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift;
>> }
>>
>> +static struct iommu_table *spapr_tce_find_table(
>> + struct tce_container *container,
>> + phys_addr_t ioba)
>> +{
>> + long i;
>> + struct iommu_table *ret = NULL;
>> + struct iommu_table_group *table_group;
>> +
>> + table_group = iommu_group_get_iommudata(container->grp);
>> + if (!table_group)
>> + return NULL;
>> +
>> + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> + struct iommu_table *tbl = &table_group->tables[i];
>> + unsigned long entry = ioba >> tbl->it_page_shift;
>> + unsigned long start = tbl->it_offset;
>> + unsigned long end = start + tbl->it_size;
>> +
>> + if ((start <= entry) && (entry < end)) {
>> + ret = tbl;
>> + break;
>> + }
>> + }
>> +
>> + return ret;
>> +}
>> +
>> static int tce_iommu_enable(struct tce_container *container)
>> {
>> int ret = 0;
>> unsigned long locked;
>> - struct iommu_table *tbl = container->tbl;
>> + struct iommu_table *tbl;
>> + struct iommu_table_group *table_group;
>>
>> - if (!container->tbl)
>> + if (!container->grp)
>> return -ENXIO;
>>
>> if (!current->mm)
>> @@ -143,6 +171,11 @@ static int tce_iommu_enable(struct tce_container *container)
>> * as this information is only available from KVM and VFIO is
>> * KVM agnostic.
>> */
>> + table_group = iommu_group_get_iommudata(container->grp);
>> + if (!table_group)
>> + return -ENODEV;
>> +
>> + tbl = &table_group->tables[0];
>> locked = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
>> ret = try_increment_locked_vm(locked);
>> if (ret)
>> @@ -193,15 +226,17 @@ static int tce_iommu_clear(struct tce_container *container,
>> static void tce_iommu_release(void *iommu_data)
>> {
>> struct tce_container *container = iommu_data;
>> - struct iommu_table *tbl = container->tbl;
>> + struct iommu_table *tbl;
>> + struct iommu_table_group *table_group;
>>
>> - WARN_ON(tbl && !tbl->it_group);
>> + WARN_ON(container->grp);
>>
>> - if (tbl) {
>> + if (container->grp) {
>> + table_group = iommu_group_get_iommudata(container->grp);
>> + tbl = &table_group->tables[0];
>> tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
>>
>> - if (tbl->it_group)
>> - tce_iommu_detach_group(iommu_data, tbl->it_group);
>> + tce_iommu_detach_group(iommu_data, container->grp);
>> }
>>
>> tce_iommu_disable(container);
>> @@ -329,9 +364,16 @@ static long tce_iommu_ioctl(void *iommu_data,
>>
>> case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
>> struct vfio_iommu_spapr_tce_info info;
>> - struct iommu_table *tbl = container->tbl;
>> + struct iommu_table *tbl;
>> + struct iommu_table_group *table_group;
>>
>> - if (WARN_ON(!tbl))
>> + if (WARN_ON(!container->grp))
>> + return -ENXIO;
>> +
>> + table_group = iommu_group_get_iommudata(container->grp);
>> +
>> + tbl = &table_group->tables[0];
>> + if (WARN_ON_ONCE(!tbl))
>> return -ENXIO;
>>
>> minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
>> @@ -354,17 +396,12 @@ static long tce_iommu_ioctl(void *iommu_data,
>> }
>> case VFIO_IOMMU_MAP_DMA: {
>> struct vfio_iommu_type1_dma_map param;
>> - struct iommu_table *tbl = container->tbl;
>> + struct iommu_table *tbl;
>> unsigned long tce;
>>
>> if (!container->enabled)
>> return -EPERM;
>>
>> - if (!tbl)
>> - return -ENXIO;
>> -
>> - BUG_ON(!tbl->it_group);
>> -
>> minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>>
>> if (copy_from_user(¶m, (void __user *)arg, minsz))
>> @@ -377,6 +414,10 @@ static long tce_iommu_ioctl(void *iommu_data,
>> VFIO_DMA_MAP_FLAG_WRITE))
>> return -EINVAL;
>>
>> + tbl = spapr_tce_find_table(container, param.iova);
>> + if (!tbl)
>> + return -ENXIO;
>> +
>> if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
>> (param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
>> return -EINVAL;
>> @@ -402,14 +443,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>> }
>> case VFIO_IOMMU_UNMAP_DMA: {
>> struct vfio_iommu_type1_dma_unmap param;
>> - struct iommu_table *tbl = container->tbl;
>> + struct iommu_table *tbl;
>>
>> if (!container->enabled)
>> return -EPERM;
>>
>> - if (WARN_ON(!tbl))
>> - return -ENXIO;
>> -
>> minsz = offsetofend(struct vfio_iommu_type1_dma_unmap,
>> size);
>>
>> @@ -423,6 +461,10 @@ static long tce_iommu_ioctl(void *iommu_data,
>> if (param.flags)
>> return -EINVAL;
>>
>> + tbl = spapr_tce_find_table(container, param.iova);
>> + if (!tbl)
>> + return -ENXIO;
>> +
>> if (param.size & ~IOMMU_PAGE_MASK(tbl))
>> return -EINVAL;
>>
>> @@ -451,10 +493,10 @@ static long tce_iommu_ioctl(void *iommu_data,
>> mutex_unlock(&container->lock);
>> return 0;
>> case VFIO_EEH_PE_OP:
>> - if (!container->tbl || !container->tbl->it_group)
>> + if (!container->grp)
>> return -ENODEV;
>>
>> - return vfio_spapr_iommu_eeh_ioctl(container->tbl->it_group,
>> + return vfio_spapr_iommu_eeh_ioctl(container->grp,
>> cmd, arg);
>> }
>>
>> @@ -466,16 +508,15 @@ static int tce_iommu_attach_group(void *iommu_data,
>> {
>> int ret;
>> struct tce_container *container = iommu_data;
>> - struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
>> + struct iommu_table_group *table_group;
>>
>> - BUG_ON(!tbl);
>> mutex_lock(&container->lock);
>>
>> /* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
>> iommu_group_id(iommu_group), iommu_group); */
>> - if (container->tbl) {
>> + if (container->grp) {
>> pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
>> - iommu_group_id(container->tbl->it_group),
>> + iommu_group_id(container->grp),
>> iommu_group_id(iommu_group));
>> ret = -EBUSY;
>> goto unlock_exit;
>> @@ -488,9 +529,15 @@ static int tce_iommu_attach_group(void *iommu_data,
>> goto unlock_exit;
>> }
>>
>> - ret = iommu_take_ownership(tbl);
>> + table_group = iommu_group_get_iommudata(iommu_group);
>> + if (!table_group) {
>> + ret = -ENXIO;
>> + goto unlock_exit;
>> + }
>> +
>> + ret = iommu_take_ownership(&table_group->tables[0]);
>> if (!ret)
>> - container->tbl = tbl;
>> + container->grp = iommu_group;
>>
>> unlock_exit:
>> mutex_unlock(&container->lock);
>> @@ -502,27 +549,30 @@ static void tce_iommu_detach_group(void *iommu_data,
>> struct iommu_group *iommu_group)
>> {
>> struct tce_container *container = iommu_data;
>> - struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
>> + struct iommu_table_group *table_group;
>>
>> - BUG_ON(!tbl);
>> mutex_lock(&container->lock);
>> - if (tbl != container->tbl) {
>> + if (iommu_group != container->grp) {
>> pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
>> iommu_group_id(iommu_group),
>> - iommu_group_id(tbl->it_group));
>> + iommu_group_id(container->grp));
>> goto unlock_exit;
>> }
>>
>> if (container->enabled) {
>> pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
>> - iommu_group_id(tbl->it_group));
>> + iommu_group_id(container->grp));
>> tce_iommu_disable(container);
>> }
>>
>> /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
>> iommu_group_id(iommu_group), iommu_group); */
>> - container->tbl = NULL;
>> - iommu_release_ownership(tbl);
>> + container->grp = NULL;
>> +
>> + table_group = iommu_group_get_iommudata(iommu_group);
>> + BUG_ON(!table_group);
>> +
>> + iommu_release_ownership(&table_group->tables[0]);
>>
>> unlock_exit:
>> mutex_unlock(&container->lock);
>
--
Alexey
On 04/16/2015 04:46 PM, David Gibson wrote:
> On Fri, Apr 10, 2015 at 04:31:03PM +1000, Alexey Kardashevskiy wrote:
>> The iommu_free_table helper release memory it is using (the TCE table and
>> @it_map) and release the iommu_table struct as well. We might not want
>> the very last step as we store iommu_table in parent structures.
>
> Yeah, as I commented on the earlier patch, freeing the surrounding
> group from a function taking just the individual table is wrong.
This is iommu tables created by the old code which stores these iommu_table
struct pointers in device nodes. I believe there is a plan to get rid of
iommu tables there and when this is done, this workaround will be gone.
>
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> arch/powerpc/include/asm/iommu.h | 1 +
>> arch/powerpc/kernel/iommu.c | 57 ++++++++++++++++++++++++----------------
>> 2 files changed, 35 insertions(+), 23 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index bde7ee7..8ed4648 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -127,6 +127,7 @@ static inline void *get_iommu_table_base(struct device *dev)
>>
>> extern struct iommu_table *iommu_table_alloc(int node);
>> /* Frees table for an individual device node */
>> +extern void iommu_reset_table(struct iommu_table *tbl, const char *node_name);
>> extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
>>
>> /* Initializes an iommu_table based in values set in the passed-in
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index 501e8ee..0bcd988 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -721,24 +721,46 @@ struct iommu_table *iommu_table_alloc(int node)
>> return &table_group->tables[0];
>> }
>>
>> +void iommu_reset_table(struct iommu_table *tbl, const char *node_name)
>> +{
>> + if (!tbl)
>> + return;
>> +
>> + if (tbl->it_map) {
>> + unsigned long bitmap_sz;
>> + unsigned int order;
>> +
>> + /*
>> + * In case we have reserved the first bit, we should not emit
>> + * the warning below.
>> + */
>> + if (tbl->it_offset == 0)
>> + clear_bit(0, tbl->it_map);
>> +
>> + /* verify that table contains no entries */
>> + if (!bitmap_empty(tbl->it_map, tbl->it_size))
>> + pr_warn("%s: Unexpected TCEs for %s\n", __func__,
>> + node_name);
>> +
>> + /* calculate bitmap size in bytes */
>> + bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
>> +
>> + /* free bitmap */
>> + order = get_order(bitmap_sz);
>> + free_pages((unsigned long) tbl->it_map, order);
>> + }
>> +
>> + memset(tbl, 0, sizeof(*tbl));
>> +}
>> +
>> void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>> {
>> - unsigned long bitmap_sz;
>> - unsigned int order;
>> struct iommu_table_group *table_group = tbl->it_group;
>>
>> - if (!tbl || !tbl->it_map) {
>> - printk(KERN_ERR "%s: expected TCE map for %s\n", __func__,
>> - node_name);
>> + if (!tbl)
>> return;
>> - }
>>
>> - /*
>> - * In case we have reserved the first bit, we should not emit
>> - * the warning below.
>> - */
>> - if (tbl->it_offset == 0)
>> - clear_bit(0, tbl->it_map);
>> + iommu_reset_table(tbl, node_name);
>>
>> #ifdef CONFIG_IOMMU_API
>> if (table_group->group) {
>> @@ -747,17 +769,6 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>> }
>> #endif
>>
>> - /* verify that table contains no entries */
>> - if (!bitmap_empty(tbl->it_map, tbl->it_size))
>> - pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
>> -
>> - /* calculate bitmap size in bytes */
>> - bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
>> -
>> - /* free bitmap */
>> - order = get_order(bitmap_sz);
>> - free_pages((unsigned long) tbl->it_map, order);
>> -
>> /* free table */
>> kfree(table_group);
>> }
>
--
Alexey
On 04/16/2015 03:55 PM, David Gibson wrote:
> On Fri, Apr 10, 2015 at 04:30:54PM +1000, Alexey Kardashevskiy wrote:
>> Modern IBM POWERPC systems support multiple (currently two) TCE tables
>> per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
>> for TCE tables. Right now just one table is supported.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> arch/powerpc/include/asm/iommu.h | 18 +++--
>> arch/powerpc/kernel/iommu.c | 34 ++++----
>> arch/powerpc/platforms/powernv/pci-ioda.c | 38 +++++----
>> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 17 ++--
>> arch/powerpc/platforms/powernv/pci.c | 2 +-
>> arch/powerpc/platforms/powernv/pci.h | 4 +-
>> arch/powerpc/platforms/pseries/iommu.c | 9 ++-
>> drivers/vfio/vfio_iommu_spapr_tce.c | 120 ++++++++++++++++++++--------
>> 8 files changed, 160 insertions(+), 82 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index eb75726..667aa1a 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -90,9 +90,7 @@ struct iommu_table {
>> struct iommu_pool pools[IOMMU_NR_POOLS];
>> unsigned long *it_map; /* A simple allocation bitmap for now */
>> unsigned long it_page_shift;/* table iommu page size */
>> -#ifdef CONFIG_IOMMU_API
>> - struct iommu_group *it_group;
>> -#endif
>> + struct iommu_table_group *it_group;
>> struct iommu_table_ops *it_ops;
>> void (*set_bypass)(struct iommu_table *tbl, bool enable);
>> };
>> @@ -126,14 +124,24 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
>> */
>> extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>> int nid);
>> +
>> +#define IOMMU_TABLE_GROUP_MAX_TABLES 1
>> +
>> +struct iommu_table_group {
>> #ifdef CONFIG_IOMMU_API
>> -extern void iommu_register_group(struct iommu_table *tbl,
>> + struct iommu_group *group;
>> +#endif
>> + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>
> There's nothing to indicate which of the tables are in use at the
> current time. I mean, it doesn't matter now because there's only one,
> but the patch doesn't make a whole lot of sense without that.
Later in the patchset, the code will look at @it_size to know if the table
is in use.
>
>> +};
>> +
>> +#ifdef CONFIG_IOMMU_API
>> +extern void iommu_register_group(struct iommu_table_group *table_group,
>> int pci_domain_number, unsigned long pe_num);
>> extern int iommu_add_device(struct device *dev);
>> extern void iommu_del_device(struct device *dev);
>> extern int __init tce_iommu_bus_notifier_init(void);
>> #else
>> -static inline void iommu_register_group(struct iommu_table *tbl,
>> +static inline void iommu_register_group(struct iommu_table_group *table_group,
>> int pci_domain_number,
>> unsigned long pe_num)
>> {
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index b39d00a..fd49c8e 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -712,17 +712,20 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
>>
>> struct iommu_table *iommu_table_alloc(int node)
>> {
>> - struct iommu_table *tbl;
>> + struct iommu_table_group *table_group;
>>
>> - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
>> + table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
>> + node);
>> + table_group->tables[0].it_group = table_group;
>>
>> - return tbl;
>> + return &table_group->tables[0];
>> }
>>
>> void iommu_free_table(struct iommu_table *tbl, const char *node_name)
>
> Surely the free function should take a table group rather than a table
> as argument.
Please ignore my other response to your reply; I reworked the whole thing
to store iommu_table_group in the pci device node. Thanks.
--
Alexey
On 04/16/2015 04:07 PM, David Gibson wrote:
> On Fri, Apr 10, 2015 at 04:30:56PM +1000, Alexey Kardashevskiy wrote:
>> At the moment the iommu_table struct has a set_bypass() which enables/
>> disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
>> which calls this callback when external IOMMU users such as VFIO are
>> about to get over a PHB.
>>
>> The set_bypass() callback is not really an iommu_table function but
>> IOMMU/PE function. This introduces a iommu_table_group_ops struct and
>> adds a set_ownership() callback to it which is called when an external
>> user takes control over the IOMMU.
>
> Do you really need separate ops structures at both the single table
> and table group level? The different tables in a group will all
> belong to the same basic iommu won't they?
IOMMU tables exist alone in VIO. Also, the platform code uses just a table
(or it is in bypass mode) and does not care about table groups. It looked
more clean for myself to keep them separated. Should I still merge those?
>
>> This renames set_bypass() to set_ownership() as it is not necessarily
>> just enabling bypassing, it can be something else/more so let's give it
>> more generic name. The bool parameter is inverted.
>>
>> The callback is implemented for IODA2 only. Other platforms (P5IOC2,
>> IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> arch/powerpc/include/asm/iommu.h | 14 +++++++++++++-
>> arch/powerpc/platforms/powernv/pci-ioda.c | 30 ++++++++++++++++++++++--------
>> drivers/vfio/vfio_iommu_spapr_tce.c | 25 +++++++++++++++++++++----
>> 3 files changed, 56 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index b9e50d3..d1f8c6c 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -92,7 +92,6 @@ struct iommu_table {
>> unsigned long it_page_shift;/* table iommu page size */
>> struct iommu_table_group *it_group;
>> struct iommu_table_ops *it_ops;
>> - void (*set_bypass)(struct iommu_table *tbl, bool enable);
>> };
>>
>> /* Pure 2^n version of get_order */
>> @@ -127,11 +126,24 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>>
>> #define IOMMU_TABLE_GROUP_MAX_TABLES 1
>>
>> +struct iommu_table_group;
>> +
>> +struct iommu_table_group_ops {
>> + /*
>> + * Switches ownership from the kernel itself to an external
>> + * user. While onwership is enabled, the kernel cannot use IOMMU
>> + * for itself.
>> + */
>> + void (*set_ownership)(struct iommu_table_group *table_group,
>> + bool enable);
>
> The meaning of "enable" in a function called "set_ownership" is
> entirely obscure.
Suggest something better please :) I have nothing better...
>
>> +};
>> +
>> struct iommu_table_group {
>> #ifdef CONFIG_IOMMU_API
>> struct iommu_group *group;
>> #endif
>> struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>> + struct iommu_table_group_ops *ops;
>> };
>>
>> #ifdef CONFIG_IOMMU_API
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index a964c50..9687731 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1255,10 +1255,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>> __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
>> }
>>
>> -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
>> +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
>> {
>> - struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
>> - table_group);
>> uint16_t window_id = (pe->pe_number << 1 ) + 1;
>> int64_t rc;
>>
>> @@ -1286,7 +1284,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
>> * host side.
>> */
>> if (pe->pdev)
>> - set_iommu_table_base(&pe->pdev->dev, tbl);
>> + set_iommu_table_base(&pe->pdev->dev,
>> + &pe->table_group.tables[0]);
>> else
>> pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
>> }
>> @@ -1302,13 +1301,27 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>> /* TVE #1 is selected by PCI address bit 59 */
>> pe->tce_bypass_base = 1ull << 59;
>>
>> - /* Install set_bypass callback for VFIO */
>> - pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
>> -
>> /* Enable bypass by default */
>> - pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
>> + pnv_pci_ioda2_set_bypass(pe, true);
>> }
>>
>> +static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
>> + bool enable)
>> +{
>> + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
>> + table_group);
>> + if (enable)
>> + iommu_take_ownership(table_group);
>> + else
>> + iommu_release_ownership(table_group);
>> +
>> + pnv_pci_ioda2_set_bypass(pe, !enable);
>> +}
>> +
>> +static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
>> + .set_ownership = pnv_ioda2_set_ownership,
>> +};
>> +
>> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> struct pnv_ioda_pe *pe)
>> {
>> @@ -1376,6 +1389,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> }
>> tbl->it_ops = &pnv_iommu_ops;
>> iommu_init_table(tbl, phb->hose->node);
>> + pe->table_group.ops = &pnv_pci_ioda2_ops;
>> iommu_register_group(&pe->table_group, phb->hose->global_number,
>> pe->pe_number);
>>
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 9f38351..d5d8c50 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -535,9 +535,22 @@ static int tce_iommu_attach_group(void *iommu_data,
>> goto unlock_exit;
>> }
>>
>> - ret = iommu_take_ownership(table_group);
>> - if (!ret)
>> - container->grp = iommu_group;
>> + if (!table_group->ops || !table_group->ops->set_ownership) {
>> + ret = iommu_take_ownership(table_group);
>> + } else {
>> + /*
>> + * Disable iommu bypass, otherwise the user can DMA to all of
>> + * our physical memory via the bypass window instead of just
>> + * the pages that has been explicitly mapped into the iommu
>> + */
>> + table_group->ops->set_ownership(table_group, true);
>
> And here to disable bypass you call it with enable=true, so it doesn't
> even have the same meaning as it used to.
I do not disable bypass per se (even if it what set_ownership(true) does)
as it is IODA business and VFIO has no idea about it. I do take control
over the group. I am not following you here - what used to have the same
meaning?
>
> Plus, you should fold the logic to call the callback if necessary into
> iommu_take_ownership().
I really want to keep VFIO stuff out of arch/powerpc/kernel/iommu.c as much
as possible as it is for platform DMA/IOMMU, not VFIO (which got SPAPR
driver for that). ops->set_ownership() is one of these things.
iommu_take_ownership()/iommu_release_ownership() are helpers for old-style
commercially-unsupported P5IOC2/IODA1, and this is kind of a hack while
ops->set_ownership() is an interface for VFIO to do dynamic windows thing.
If it makes sense, I could fold the previous patch into this one and move
iommu_take_ownership()/iommu_release_ownership() to vfio_iommu_spapr_tce.c,
should I? Or leave things are they are now.
>> + ret = 0;
>> + }
>> +
>> + if (ret)
>> + goto unlock_exit;
>> +
>> + container->grp = iommu_group;
>>
>> unlock_exit:
>> mutex_unlock(&container->lock);
>> @@ -572,7 +585,11 @@ static void tce_iommu_detach_group(void *iommu_data,
>> table_group = iommu_group_get_iommudata(iommu_group);
>> BUG_ON(!table_group);
>>
>> - iommu_release_ownership(table_group);
>> + /* Kernel owns the device now, we can restore bypass */
>> + if (!table_group->ops || !table_group->ops->set_ownership)
>> + iommu_release_ownership(table_group);
>> + else
>> + table_group->ops->set_ownership(table_group, false);
>
> Likewise fold this if into iommu_release_ownership().
>
>> unlock_exit:
>> mutex_unlock(&container->lock);
>
--
Alexey
On 04/16/2015 04:10 PM, David Gibson wrote:
> On Fri, Apr 10, 2015 at 04:30:57PM +1000, Alexey Kardashevskiy wrote:
>> This adds missing locks in iommu_take_ownership()/
>> iommu_release_ownership().
>>
>> This marks all pages busy in iommu_table::it_map in order to catch
>> errors if there is an attempt to use this table while ownership over it
>> is taken.
>>
>> This only clears TCE content if there is no page marked busy in it_map.
>> Clearing must be done outside of the table locks as iommu_clear_tce()
>> called from iommu_clear_tces_and_put_pages() does this.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> Changes:
>> v5:
>> * do not store bit#0 value, it has to be set for zero-based table
>> anyway
>> * removed test_and_clear_bit
>> ---
>> arch/powerpc/kernel/iommu.c | 26 ++++++++++++++++++++++----
>> 1 file changed, 22 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index 7d6089b..068fe4ff 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -1052,17 +1052,28 @@ EXPORT_SYMBOL_GPL(iommu_tce_build);
>>
>> static int iommu_table_take_ownership(struct iommu_table *tbl)
>> {
>> - unsigned long sz = (tbl->it_size + 7) >> 3;
>> + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
>> + int ret = 0;
>> +
>> + spin_lock_irqsave(&tbl->large_pool.lock, flags);
>> + for (i = 0; i < tbl->nr_pools; i++)
>> + spin_lock(&tbl->pools[i].lock);
>>
>> if (tbl->it_offset == 0)
>> clear_bit(0, tbl->it_map);
>>
>> if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
>> pr_err("iommu_tce: it_map is not empty");
>> - return -EBUSY;
>> + ret = -EBUSY;
>> + if (tbl->it_offset == 0)
>> + set_bit(0, tbl->it_map);
>
> This really needs a comment. Why on earth are you changing the it_map
> on a failure case?
Does this explain?
/*
* The platform code reserves zero address in iommu_init_table().
* As we cleared busy bit for page @0 before using bitmap_empty(),
* we are restoring it now.
*/
>
>> + } else {
>> + memset(tbl->it_map, 0xff, sz);
>> }
>>
>> - memset(tbl->it_map, 0xff, sz);
>> + for (i = 0; i < tbl->nr_pools; i++)
>> + spin_unlock(&tbl->pools[i].lock);
>> + spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
>>
>> return 0;
>> }
>> @@ -1095,7 +1106,11 @@ EXPORT_SYMBOL_GPL(iommu_take_ownership);
>>
>> static void iommu_table_release_ownership(struct iommu_table *tbl)
>> {
>> - unsigned long sz = (tbl->it_size + 7) >> 3;
>> + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
>> +
>> + spin_lock_irqsave(&tbl->large_pool.lock, flags);
>> + for (i = 0; i < tbl->nr_pools; i++)
>> + spin_lock(&tbl->pools[i].lock);
>>
>> memset(tbl->it_map, 0, sz);
>>
>> @@ -1103,6 +1118,9 @@ static void iommu_table_release_ownership(struct iommu_table *tbl)
>> if (tbl->it_offset == 0)
>> set_bit(0, tbl->it_map);
>>
>> + for (i = 0; i < tbl->nr_pools; i++)
>> + spin_unlock(&tbl->pools[i].lock);
>> + spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
>> }
>>
>> extern void iommu_release_ownership(struct iommu_table_group *table_group)
>
--
Alexey
On 04/16/2015 04:26 PM, David Gibson wrote:
> On Fri, Apr 10, 2015 at 04:30:59PM +1000, Alexey Kardashevskiy wrote:
>> At the moment writing new TCE value to the IOMMU table fails with EBUSY
>> if there is a valid entry already. However PAPR specification allows
>> the guest to write new TCE value without clearing it first.
>>
>> Another problem this patch is addressing is the use of pool locks for
>> external IOMMU users such as VFIO. The pool locks are to protect
>> DMA page allocator rather than entries and since the host kernel does
>> not control what pages are in use, there is no point in pool locks and
>> exchange()+put_page(oldtce) is sufficient to avoid possible races.
>>
>> This adds an exchange() callback to iommu_table_ops which does the same
>> thing as set() plus it returns replaced TCE and DMA direction so
>> the caller can release the pages afterwards.
>>
>> The returned old TCE value is a virtual address as the new TCE value.
>> This is different from tce_clear() which returns a physical address.
>>
>> This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
>> for a platform to have exchange() implemented in order to support VFIO.
>>
>> This replaces iommu_tce_build() and iommu_clear_tce() with
>> a single iommu_tce_xchg().
>>
>> This makes sure that TCE permission bits are not set in TCE passed to
>> IOMMU API as those are to be calculated by platform code from DMA direction.
>>
>> This moves SetPageDirty() to the IOMMU code to make it work for both
>> VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
>> available later).
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> arch/powerpc/include/asm/iommu.h | 17 ++++++--
>> arch/powerpc/kernel/iommu.c | 53 +++++++++---------------
>> arch/powerpc/platforms/powernv/pci-ioda.c | 38 ++++++++++++++++++
>> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 3 ++
>> arch/powerpc/platforms/powernv/pci.c | 17 ++++++++
>> arch/powerpc/platforms/powernv/pci.h | 2 +
>> drivers/vfio/vfio_iommu_spapr_tce.c | 62 ++++++++++++++++++-----------
>> 7 files changed, 130 insertions(+), 62 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index d1f8c6c..bde7ee7 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -44,11 +44,22 @@ extern int iommu_is_off;
>> extern int iommu_force_on;
>>
>> struct iommu_table_ops {
>> + /* When called with direction==DMA_NONE, it is equal to clear() */
>> int (*set)(struct iommu_table *tbl,
>> long index, long npages,
>> unsigned long uaddr,
>> enum dma_data_direction direction,
>> struct dma_attrs *attrs);
>> +#ifdef CONFIG_IOMMU_API
>> + /*
>> + * Exchanges existing TCE with new TCE plus direction bits;
>> + * returns old TCE and DMA direction mask
>> + */
>> + int (*exchange)(struct iommu_table *tbl,
>> + long index,
>> + unsigned long *tce,
>> + enum dma_data_direction *direction);
>> +#endif
>> void (*clear)(struct iommu_table *tbl,
>> long index, long npages);
>> unsigned long (*get)(struct iommu_table *tbl, long index);
>> @@ -152,6 +163,8 @@ extern void iommu_register_group(struct iommu_table_group *table_group,
>> extern int iommu_add_device(struct device *dev);
>> extern void iommu_del_device(struct device *dev);
>> extern int __init tce_iommu_bus_notifier_init(void);
>> +extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
>> + unsigned long *tce, enum dma_data_direction *direction);
>> #else
>> static inline void iommu_register_group(struct iommu_table_group *table_group,
>> int pci_domain_number,
>> @@ -231,10 +244,6 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
>> unsigned long npages);
>> extern int iommu_tce_put_param_check(struct iommu_table *tbl,
>> unsigned long ioba, unsigned long tce);
>> -extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
>> - unsigned long hwaddr, enum dma_data_direction direction);
>> -extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
>> - unsigned long entry);
>>
>> extern void iommu_flush_tce(struct iommu_table *tbl);
>> extern int iommu_take_ownership(struct iommu_table_group *table_group);
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index 068fe4ff..501e8ee 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -982,9 +982,6 @@ EXPORT_SYMBOL_GPL(iommu_tce_clear_param_check);
>> int iommu_tce_put_param_check(struct iommu_table *tbl,
>> unsigned long ioba, unsigned long tce)
>> {
>> - if (!(tce & (TCE_PCI_WRITE | TCE_PCI_READ)))
>> - return -EINVAL;
>> -
>> if (tce & ~(IOMMU_PAGE_MASK(tbl) | TCE_PCI_WRITE | TCE_PCI_READ))
>> return -EINVAL;
>>
>> @@ -1002,44 +999,20 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
>> }
>> EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
>>
>> -unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
>> -{
>> - unsigned long oldtce;
>> - struct iommu_pool *pool = get_pool(tbl, entry);
>> -
>> - spin_lock(&(pool->lock));
>> -
>> - oldtce = tbl->it_ops->get(tbl, entry);
>> - if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
>> - tbl->it_ops->clear(tbl, entry, 1);
>> - else
>> - oldtce = 0;
>> -
>> - spin_unlock(&(pool->lock));
>> -
>> - return oldtce;
>> -}
>> -EXPORT_SYMBOL_GPL(iommu_clear_tce);
>> -
>> /*
>> * hwaddr is a kernel virtual address here (0xc... bazillion),
>> * tce_build converts it to a physical address.
>> */
>> -int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
>> - unsigned long hwaddr, enum dma_data_direction direction)
>> +long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
>> + unsigned long *tce, enum dma_data_direction *direction)
>> {
>> - int ret = -EBUSY;
>> - unsigned long oldtce;
>> - struct iommu_pool *pool = get_pool(tbl, entry);
>> + long ret;
>>
>> - spin_lock(&(pool->lock));
>> + ret = tbl->it_ops->exchange(tbl, entry, tce, direction);
>>
>> - oldtce = tbl->it_ops->get(tbl, entry);
>> - /* Add new entry if it is not busy */
>> - if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
>> - ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
>> -
>> - spin_unlock(&(pool->lock));
>> + if (!ret && ((*direction == DMA_FROM_DEVICE) ||
>> + (*direction == DMA_BIDIRECTIONAL)))
>> + SetPageDirty(pfn_to_page(__pa(*tce) >> PAGE_SHIFT));
>>
>> /* if (unlikely(ret))
>> pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
>> @@ -1048,13 +1021,23 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
>>
>> return ret;
>> }
>> -EXPORT_SYMBOL_GPL(iommu_tce_build);
>> +EXPORT_SYMBOL_GPL(iommu_tce_xchg);
>>
>> static int iommu_table_take_ownership(struct iommu_table *tbl)
>> {
>> unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
>> int ret = 0;
>>
>> + /*
>> + * VFIO does not control TCE entries allocation and the guest
>> + * can write new TCEs on top of existing ones so iommu_tce_build()
>> + * must be able to release old pages. This functionality
>> + * requires exchange() callback defined so if it is not
>> + * implemented, we disallow taking ownership over the table.
>> + */
>> + if (!tbl->it_ops->exchange)
>> + return -EINVAL;
>> +
>> spin_lock_irqsave(&tbl->large_pool.lock, flags);
>> for (i = 0; i < tbl->nr_pools; i++)
>> spin_lock(&tbl->pools[i].lock);
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index fd993bc..4d80502 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1128,6 +1128,20 @@ static int pnv_ioda1_tce_build_vm(struct iommu_table *tbl, long index,
>> return ret;
>> }
>>
>> +#ifdef CONFIG_IOMMU_API
>> +static int pnv_ioda1_tce_xchg_vm(struct iommu_table *tbl, long index,
>> + unsigned long *tce, enum dma_data_direction *direction)
>> +{
>> + long ret = pnv_tce_xchg(tbl, index, tce, direction);
>> +
>> + if (!ret && (tbl->it_type &
>> + (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
>> + pnv_pci_ioda1_tce_invalidate(tbl, index, 1, false);
>> +
>> + return ret;
>> +}
>> +#endif
>> +
>> static void pnv_ioda1_tce_free_vm(struct iommu_table *tbl, long index,
>> long npages)
>> {
>> @@ -1139,6 +1153,9 @@ static void pnv_ioda1_tce_free_vm(struct iommu_table *tbl, long index,
>>
>> struct iommu_table_ops pnv_ioda1_iommu_ops = {
>> .set = pnv_ioda1_tce_build_vm,
>> +#ifdef CONFIG_IOMMU_API
>> + .exchange = pnv_ioda1_tce_xchg_vm,
>> +#endif
>> .clear = pnv_ioda1_tce_free_vm,
>> .get = pnv_tce_get,
>> };
>> @@ -1190,6 +1207,20 @@ static int pnv_ioda2_tce_build_vm(struct iommu_table *tbl, long index,
>> return ret;
>> }
>>
>> +#ifdef CONFIG_IOMMU_API
>> +static int pnv_ioda2_tce_xchg_vm(struct iommu_table *tbl, long index,
>> + unsigned long *tce, enum dma_data_direction *direction)
>> +{
>> + long ret = pnv_tce_xchg(tbl, index, tce, direction);
>> +
>> + if (!ret && (tbl->it_type &
>> + (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
>> + pnv_pci_ioda2_tce_invalidate(tbl, index, 1, false);
>> +
>> + return ret;
>> +}
>> +#endif
>> +
>> static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
>> long npages)
>> {
>> @@ -1201,6 +1232,9 @@ static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
>>
>> static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>> .set = pnv_ioda2_tce_build_vm,
>> +#ifdef CONFIG_IOMMU_API
>> + .exchange = pnv_ioda2_tce_xchg_vm,
>> +#endif
>> .clear = pnv_ioda2_tce_free_vm,
>> .get = pnv_tce_get,
>> };
>> @@ -1353,6 +1387,7 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>> pnv_pci_ioda2_set_bypass(pe, true);
>> }
>>
>> +#ifdef CONFIG_IOMMU_API
>> static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
>> bool enable)
>> {
>> @@ -1369,6 +1404,7 @@ static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
>> static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
>> .set_ownership = pnv_ioda2_set_ownership,
>> };
>> +#endif
>>
>> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> struct pnv_ioda_pe *pe)
>> @@ -1437,7 +1473,9 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> }
>> tbl->it_ops = &pnv_ioda2_iommu_ops;
>> iommu_init_table(tbl, phb->hose->node);
>> +#ifdef CONFIG_IOMMU_API
>> pe->table_group.ops = &pnv_pci_ioda2_ops;
>> +#endif
>> iommu_register_group(&pe->table_group, phb->hose->global_number,
>> pe->pe_number);
>>
>> diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
>> index 6906a9c..d2d9092 100644
>> --- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
>> +++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
>> @@ -85,6 +85,9 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
>>
>> static struct iommu_table_ops pnv_p5ioc2_iommu_ops = {
>> .set = pnv_tce_build,
>> +#ifdef CONFIG_IOMMU_API
>> + .exchange = pnv_tce_xchg,
>> +#endif
>> .clear = pnv_tce_free,
>> .get = pnv_tce_get,
>> };
>> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>> index a8c05de..a9797dd 100644
>> --- a/arch/powerpc/platforms/powernv/pci.c
>> +++ b/arch/powerpc/platforms/powernv/pci.c
>> @@ -615,6 +615,23 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
>> return 0;
>> }
>>
>> +#ifdef CONFIG_IOMMU_API
>> +int pnv_tce_xchg(struct iommu_table *tbl, long index,
>> + unsigned long *tce, enum dma_data_direction *direction)
>> +{
>> + u64 proto_tce = iommu_direction_to_tce_perm(*direction);
>> + unsigned long newtce = __pa(*tce) | proto_tce;
>> + unsigned long idx = index - tbl->it_offset;
>> +
>> + *tce = xchg(pnv_tce(tbl, idx), cpu_to_be64(newtce));
>> + *tce = (unsigned long) __va(be64_to_cpu(*tce));
>> + *direction = iommu_tce_direction(*tce);
>> + *tce &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +
>> + return 0;
>> +}
>> +#endif
>> +
>> void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
>> {
>> long i;
>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>> index 0d4df32..4d1a78c 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -220,6 +220,8 @@ extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
>> unsigned long uaddr, enum dma_data_direction direction,
>> struct dma_attrs *attrs);
>> extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
>> +extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
>> + unsigned long *tce, enum dma_data_direction *direction);
>> extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
>> extern struct iommu_table_ops pnv_ioda1_iommu_ops;
>>
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index d5d8c50..7c3c215 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -251,9 +251,6 @@ static void tce_iommu_unuse_page(struct tce_container *container,
>> {
>> struct page *page;
>>
>> - if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
>> - return;
>> -
>> /*
>> * VFIO cannot map/unmap when a container is not enabled so
>> * we would not need this check but KVM could map/unmap and if
>> @@ -264,10 +261,6 @@ static void tce_iommu_unuse_page(struct tce_container *container,
>> return;
>>
>> page = pfn_to_page(__pa(oldtce) >> PAGE_SHIFT);
>> -
>> - if (oldtce & TCE_PCI_WRITE)
>> - SetPageDirty(page);
>> -
>
> Seems to me that unuse_page() should get a direction parameter,
> instead of moving the PageDirty (and DMA_NONE test) to all the
> callers.
Sorry, I am not following you here. There is just a single gateway for VFIO
to the platform code which is iommu_tce_xchg() and this is where the
SetPageDirty() check went. What are "all the callers"?
--
Alexey
On Fri, Apr 17, 2015 at 01:48:13AM +1000, Alexey Kardashevskiy wrote:
> On 04/16/2015 03:55 PM, David Gibson wrote:
> >On Fri, Apr 10, 2015 at 04:30:54PM +1000, Alexey Kardashevskiy wrote:
> >>Modern IBM POWERPC systems support multiple (currently two) TCE tables
> >>per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
> >>for TCE tables. Right now just one table is supported.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>---
> >> arch/powerpc/include/asm/iommu.h | 18 +++--
> >> arch/powerpc/kernel/iommu.c | 34 ++++----
> >> arch/powerpc/platforms/powernv/pci-ioda.c | 38 +++++----
> >> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 17 ++--
> >> arch/powerpc/platforms/powernv/pci.c | 2 +-
> >> arch/powerpc/platforms/powernv/pci.h | 4 +-
> >> arch/powerpc/platforms/pseries/iommu.c | 9 ++-
> >> drivers/vfio/vfio_iommu_spapr_tce.c | 120 ++++++++++++++++++++--------
> >> 8 files changed, 160 insertions(+), 82 deletions(-)
> >>
> >>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>index eb75726..667aa1a 100644
> >>--- a/arch/powerpc/include/asm/iommu.h
> >>+++ b/arch/powerpc/include/asm/iommu.h
> >>@@ -90,9 +90,7 @@ struct iommu_table {
> >> struct iommu_pool pools[IOMMU_NR_POOLS];
> >> unsigned long *it_map; /* A simple allocation bitmap for now */
> >> unsigned long it_page_shift;/* table iommu page size */
> >>-#ifdef CONFIG_IOMMU_API
> >>- struct iommu_group *it_group;
> >>-#endif
> >>+ struct iommu_table_group *it_group;
> >> struct iommu_table_ops *it_ops;
> >> void (*set_bypass)(struct iommu_table *tbl, bool enable);
> >> };
> >>@@ -126,14 +124,24 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
> >> */
> >> extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
> >> int nid);
> >>+
> >>+#define IOMMU_TABLE_GROUP_MAX_TABLES 1
> >>+
> >>+struct iommu_table_group {
> >> #ifdef CONFIG_IOMMU_API
> >>-extern void iommu_register_group(struct iommu_table *tbl,
> >>+ struct iommu_group *group;
> >>+#endif
> >>+ struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> >
> >There's nothing to indicate which of the tables are in use at the
> >current time. I mean, it doesn't matter now because there's only one,
> >but the patch doesn't make a whole lot of sense without that.
> >
> >>+};
> >>+
> >>+#ifdef CONFIG_IOMMU_API
> >>+extern void iommu_register_group(struct iommu_table_group *table_group,
> >> int pci_domain_number, unsigned long pe_num);
> >> extern int iommu_add_device(struct device *dev);
> >> extern void iommu_del_device(struct device *dev);
> >> extern int __init tce_iommu_bus_notifier_init(void);
> >> #else
> >>-static inline void iommu_register_group(struct iommu_table *tbl,
> >>+static inline void iommu_register_group(struct iommu_table_group *table_group,
> >> int pci_domain_number,
> >> unsigned long pe_num)
> >> {
> >>diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> >>index b39d00a..fd49c8e 100644
> >>--- a/arch/powerpc/kernel/iommu.c
> >>+++ b/arch/powerpc/kernel/iommu.c
> >>@@ -712,17 +712,20 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
> >>
> >> struct iommu_table *iommu_table_alloc(int node)
> >> {
> >>- struct iommu_table *tbl;
> >>+ struct iommu_table_group *table_group;
> >>
> >>- tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
> >>+ table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
> >>+ node);
> >>+ table_group->tables[0].it_group = table_group;
> >>
> >>- return tbl;
> >>+ return &table_group->tables[0];
> >> }
> >>
> >> void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> >
> >Surely the free function should take a table group rather than a table
> >as argument.
>
>
> No, it should not. Tables lifetime is not the same even within the
> same group.
If that's so, then this function shouldn't free the group...
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 17, 2015 at 07:46:23PM +1000, Alexey Kardashevskiy wrote:
> On 04/16/2015 03:55 PM, David Gibson wrote:
> >On Fri, Apr 10, 2015 at 04:30:54PM +1000, Alexey Kardashevskiy wrote:
> >>Modern IBM POWERPC systems support multiple (currently two) TCE tables
> >>per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
> >>for TCE tables. Right now just one table is supported.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>---
> >> arch/powerpc/include/asm/iommu.h | 18 +++--
> >> arch/powerpc/kernel/iommu.c | 34 ++++----
> >> arch/powerpc/platforms/powernv/pci-ioda.c | 38 +++++----
> >> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 17 ++--
> >> arch/powerpc/platforms/powernv/pci.c | 2 +-
> >> arch/powerpc/platforms/powernv/pci.h | 4 +-
> >> arch/powerpc/platforms/pseries/iommu.c | 9 ++-
> >> drivers/vfio/vfio_iommu_spapr_tce.c | 120 ++++++++++++++++++++--------
> >> 8 files changed, 160 insertions(+), 82 deletions(-)
> >>
> >>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>index eb75726..667aa1a 100644
> >>--- a/arch/powerpc/include/asm/iommu.h
> >>+++ b/arch/powerpc/include/asm/iommu.h
> >>@@ -90,9 +90,7 @@ struct iommu_table {
> >> struct iommu_pool pools[IOMMU_NR_POOLS];
> >> unsigned long *it_map; /* A simple allocation bitmap for now */
> >> unsigned long it_page_shift;/* table iommu page size */
> >>-#ifdef CONFIG_IOMMU_API
> >>- struct iommu_group *it_group;
> >>-#endif
> >>+ struct iommu_table_group *it_group;
> >> struct iommu_table_ops *it_ops;
> >> void (*set_bypass)(struct iommu_table *tbl, bool enable);
> >> };
> >>@@ -126,14 +124,24 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
> >> */
> >> extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
> >> int nid);
> >>+
> >>+#define IOMMU_TABLE_GROUP_MAX_TABLES 1
> >>+
> >>+struct iommu_table_group {
> >> #ifdef CONFIG_IOMMU_API
> >>-extern void iommu_register_group(struct iommu_table *tbl,
> >>+ struct iommu_group *group;
> >>+#endif
> >>+ struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> >
> >There's nothing to indicate which of the tables are in use at the
> >current time. I mean, it doesn't matter now because there's only one,
> >but the patch doesn't make a whole lot of sense without that.
>
>
> Later in the patchset, the code will look at @it_size to know if the table
> is in use.
Ok, that makes sense. Might be worth a comment here.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 17, 2015 at 08:09:29PM +1000, Alexey Kardashevskiy wrote:
> On 04/16/2015 04:07 PM, David Gibson wrote:
> >On Fri, Apr 10, 2015 at 04:30:56PM +1000, Alexey Kardashevskiy wrote:
> >>At the moment the iommu_table struct has a set_bypass() which enables/
> >>disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
> >>which calls this callback when external IOMMU users such as VFIO are
> >>about to get over a PHB.
> >>
> >>The set_bypass() callback is not really an iommu_table function but
> >>IOMMU/PE function. This introduces a iommu_table_group_ops struct and
> >>adds a set_ownership() callback to it which is called when an external
> >>user takes control over the IOMMU.
> >
> >Do you really need separate ops structures at both the single table
> >and table group level? The different tables in a group will all
> >belong to the same basic iommu won't they?
>
>
> IOMMU tables exist alone in VIO. Also, the platform code uses just a table
> (or it is in bypass mode) and does not care about table groups. It looked
> more clean for myself to keep them separated. Should I still merge
> those?
Ok, that sounds like a reasonable argument for keeping them separate,
at least for now.
> >>This renames set_bypass() to set_ownership() as it is not necessarily
> >>just enabling bypassing, it can be something else/more so let's give it
> >>more generic name. The bool parameter is inverted.
> >>
> >>The callback is implemented for IODA2 only. Other platforms (P5IOC2,
> >>IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>---
> >> arch/powerpc/include/asm/iommu.h | 14 +++++++++++++-
> >> arch/powerpc/platforms/powernv/pci-ioda.c | 30 ++++++++++++++++++++++--------
> >> drivers/vfio/vfio_iommu_spapr_tce.c | 25 +++++++++++++++++++++----
> >> 3 files changed, 56 insertions(+), 13 deletions(-)
> >>
> >>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>index b9e50d3..d1f8c6c 100644
> >>--- a/arch/powerpc/include/asm/iommu.h
> >>+++ b/arch/powerpc/include/asm/iommu.h
> >>@@ -92,7 +92,6 @@ struct iommu_table {
> >> unsigned long it_page_shift;/* table iommu page size */
> >> struct iommu_table_group *it_group;
> >> struct iommu_table_ops *it_ops;
> >>- void (*set_bypass)(struct iommu_table *tbl, bool enable);
> >> };
> >>
> >> /* Pure 2^n version of get_order */
> >>@@ -127,11 +126,24 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
> >>
> >> #define IOMMU_TABLE_GROUP_MAX_TABLES 1
> >>
> >>+struct iommu_table_group;
> >>+
> >>+struct iommu_table_group_ops {
> >>+ /*
> >>+ * Switches ownership from the kernel itself to an external
> >>+ * user. While onwership is enabled, the kernel cannot use IOMMU
> >>+ * for itself.
> >>+ */
> >>+ void (*set_ownership)(struct iommu_table_group *table_group,
> >>+ bool enable);
> >
> >The meaning of "enable" in a function called "set_ownership" is
> >entirely obscure.
>
> Suggest something better please :) I have nothing better...
Well, given it's "set_ownershuip" you could have "owner" - that would
want to be an enum with OWNER_KERNEL and OWNER_VFIO or something
rather than a bool.
Or you could leave it a bool but call it "allow_bypass".
>
>
> >
> >>+};
> >>+
> >> struct iommu_table_group {
> >> #ifdef CONFIG_IOMMU_API
> >> struct iommu_group *group;
> >> #endif
> >> struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> >>+ struct iommu_table_group_ops *ops;
> >> };
> >>
> >> #ifdef CONFIG_IOMMU_API
> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>index a964c50..9687731 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>@@ -1255,10 +1255,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> >> __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
> >> }
> >>
> >>-static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> >>+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
> >> {
> >>- struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
> >>- table_group);
> >> uint16_t window_id = (pe->pe_number << 1 ) + 1;
> >> int64_t rc;
> >>
> >>@@ -1286,7 +1284,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> >> * host side.
> >> */
> >> if (pe->pdev)
> >>- set_iommu_table_base(&pe->pdev->dev, tbl);
> >>+ set_iommu_table_base(&pe->pdev->dev,
> >>+ &pe->table_group.tables[0]);
> >> else
> >> pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> >> }
> >>@@ -1302,13 +1301,27 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> >> /* TVE #1 is selected by PCI address bit 59 */
> >> pe->tce_bypass_base = 1ull << 59;
> >>
> >>- /* Install set_bypass callback for VFIO */
> >>- pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
> >>-
> >> /* Enable bypass by default */
> >>- pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
> >>+ pnv_pci_ioda2_set_bypass(pe, true);
> >> }
> >>
> >>+static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
> >>+ bool enable)
> >>+{
> >>+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> >>+ table_group);
> >>+ if (enable)
> >>+ iommu_take_ownership(table_group);
> >>+ else
> >>+ iommu_release_ownership(table_group);
> >>+
> >>+ pnv_pci_ioda2_set_bypass(pe, !enable);
> >>+}
> >>+
> >>+static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> >>+ .set_ownership = pnv_ioda2_set_ownership,
> >>+};
> >>+
> >> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >> struct pnv_ioda_pe *pe)
> >> {
> >>@@ -1376,6 +1389,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >> }
> >> tbl->it_ops = &pnv_iommu_ops;
> >> iommu_init_table(tbl, phb->hose->node);
> >>+ pe->table_group.ops = &pnv_pci_ioda2_ops;
> >> iommu_register_group(&pe->table_group, phb->hose->global_number,
> >> pe->pe_number);
> >>
> >>diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>index 9f38351..d5d8c50 100644
> >>--- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>@@ -535,9 +535,22 @@ static int tce_iommu_attach_group(void *iommu_data,
> >> goto unlock_exit;
> >> }
> >>
> >>- ret = iommu_take_ownership(table_group);
> >>- if (!ret)
> >>- container->grp = iommu_group;
> >>+ if (!table_group->ops || !table_group->ops->set_ownership) {
> >>+ ret = iommu_take_ownership(table_group);
> >>+ } else {
> >>+ /*
> >>+ * Disable iommu bypass, otherwise the user can DMA to all of
> >>+ * our physical memory via the bypass window instead of just
> >>+ * the pages that has been explicitly mapped into the iommu
> >>+ */
> >>+ table_group->ops->set_ownership(table_group, true);
> >
> >And here to disable bypass you call it with enable=true, so it doesn't
> >even have the same meaning as it used to.
>
>
> I do not disable bypass per se (even if it what set_ownership(true) does) as
> it is IODA business and VFIO has no idea about it. I do take control over
> the group. I am not following you here - what used to have the same
> meaning?
Well, in set_bypass, the enable parameter was whether bypass was
enabled. Here you're setting enable to true, when you want to
*disable* bypass (in the existing case). If the "enable" parameter
isn't about enabling bypass, it's meaning is even more confusing than
I thought.
> >Plus, you should fold the logic to call the callback if necessary into
> >iommu_take_ownership().
>
>
> I really want to keep VFIO stuff out of arch/powerpc/kernel/iommu.c as much
> as possible as it is for platform DMA/IOMMU, not VFIO (which got SPAPR
> driver for that). ops->set_ownership() is one of these things.
What's VFIO specific about this fragment - it's just if you have the
callback, call it, otherwise fall back to the default.
> iommu_take_ownership()/iommu_release_ownership() are helpers for old-style
> commercially-unsupported P5IOC2/IODA1, and this is kind of a hack while
> ops->set_ownership() is an interface for VFIO to do dynamic windows thing.
Can you put their logic into a set_ownership callback for IODA1 then?
> If it makes sense, I could fold the previous patch into this one and move
> iommu_take_ownership()/iommu_release_ownership() to vfio_iommu_spapr_tce.c,
> should I? Or leave things are they are now.
That sounds like it might make sense.
>
>
> >>+ ret = 0;
> >>+ }
> >>+
> >>+ if (ret)
> >>+ goto unlock_exit;
> >>+
> >>+ container->grp = iommu_group;
> >>
> >> unlock_exit:
> >> mutex_unlock(&container->lock);
> >>@@ -572,7 +585,11 @@ static void tce_iommu_detach_group(void *iommu_data,
> >> table_group = iommu_group_get_iommudata(iommu_group);
> >> BUG_ON(!table_group);
> >>
> >>- iommu_release_ownership(table_group);
> >>+ /* Kernel owns the device now, we can restore bypass */
> >>+ if (!table_group->ops || !table_group->ops->set_ownership)
> >>+ iommu_release_ownership(table_group);
> >>+ else
> >>+ table_group->ops->set_ownership(table_group, false);
> >
> >Likewise fold this if into iommu_release_ownership().
> >
> >> unlock_exit:
> >> mutex_unlock(&container->lock);
> >
>
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 17, 2015 at 08:16:13PM +1000, Alexey Kardashevskiy wrote:
> On 04/16/2015 04:10 PM, David Gibson wrote:
> >On Fri, Apr 10, 2015 at 04:30:57PM +1000, Alexey Kardashevskiy wrote:
> >>This adds missing locks in iommu_take_ownership()/
> >>iommu_release_ownership().
> >>
> >>This marks all pages busy in iommu_table::it_map in order to catch
> >>errors if there is an attempt to use this table while ownership over it
> >>is taken.
> >>
> >>This only clears TCE content if there is no page marked busy in it_map.
> >>Clearing must be done outside of the table locks as iommu_clear_tce()
> >>called from iommu_clear_tces_and_put_pages() does this.
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>---
> >>Changes:
> >>v5:
> >>* do not store bit#0 value, it has to be set for zero-based table
> >>anyway
> >>* removed test_and_clear_bit
> >>---
> >> arch/powerpc/kernel/iommu.c | 26 ++++++++++++++++++++++----
> >> 1 file changed, 22 insertions(+), 4 deletions(-)
> >>
> >>diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> >>index 7d6089b..068fe4ff 100644
> >>--- a/arch/powerpc/kernel/iommu.c
> >>+++ b/arch/powerpc/kernel/iommu.c
> >>@@ -1052,17 +1052,28 @@ EXPORT_SYMBOL_GPL(iommu_tce_build);
> >>
> >> static int iommu_table_take_ownership(struct iommu_table *tbl)
> >> {
> >>- unsigned long sz = (tbl->it_size + 7) >> 3;
> >>+ unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
> >>+ int ret = 0;
> >>+
> >>+ spin_lock_irqsave(&tbl->large_pool.lock, flags);
> >>+ for (i = 0; i < tbl->nr_pools; i++)
> >>+ spin_lock(&tbl->pools[i].lock);
> >>
> >> if (tbl->it_offset == 0)
> >> clear_bit(0, tbl->it_map);
> >>
> >> if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
> >> pr_err("iommu_tce: it_map is not empty");
> >>- return -EBUSY;
> >>+ ret = -EBUSY;
> >>+ if (tbl->it_offset == 0)
> >>+ set_bit(0, tbl->it_map);
> >
> >This really needs a comment. Why on earth are you changing the it_map
> >on a failure case?
>
>
> Does this explain?
>
> /*
> * The platform code reserves zero address in iommu_init_table().
> * As we cleared busy bit for page @0 before using bitmap_empty(),
> * we are restoring it now.
> */
Only partly. What's it reserved for, and why do you know it was
always set on entry?
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 17, 2015 at 08:37:54PM +1000, Alexey Kardashevskiy wrote:
> On 04/16/2015 04:26 PM, David Gibson wrote:
> >On Fri, Apr 10, 2015 at 04:30:59PM +1000, Alexey Kardashevskiy wrote:
> >>At the moment writing new TCE value to the IOMMU table fails with EBUSY
> >>if there is a valid entry already. However PAPR specification allows
> >>the guest to write new TCE value without clearing it first.
> >>
> >>Another problem this patch is addressing is the use of pool locks for
> >>external IOMMU users such as VFIO. The pool locks are to protect
> >>DMA page allocator rather than entries and since the host kernel does
> >>not control what pages are in use, there is no point in pool locks and
> >>exchange()+put_page(oldtce) is sufficient to avoid possible races.
> >>
> >>This adds an exchange() callback to iommu_table_ops which does the same
> >>thing as set() plus it returns replaced TCE and DMA direction so
> >>the caller can release the pages afterwards.
> >>
> >>The returned old TCE value is a virtual address as the new TCE value.
> >>This is different from tce_clear() which returns a physical address.
> >>
> >>This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
> >>for a platform to have exchange() implemented in order to support VFIO.
> >>
> >>This replaces iommu_tce_build() and iommu_clear_tce() with
> >>a single iommu_tce_xchg().
> >>
> >>This makes sure that TCE permission bits are not set in TCE passed to
> >>IOMMU API as those are to be calculated by platform code from DMA direction.
> >>
> >>This moves SetPageDirty() to the IOMMU code to make it work for both
> >>VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
> >>available later).
> >>
> >>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>---
> >> arch/powerpc/include/asm/iommu.h | 17 ++++++--
> >> arch/powerpc/kernel/iommu.c | 53 +++++++++---------------
> >> arch/powerpc/platforms/powernv/pci-ioda.c | 38 ++++++++++++++++++
> >> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 3 ++
> >> arch/powerpc/platforms/powernv/pci.c | 17 ++++++++
> >> arch/powerpc/platforms/powernv/pci.h | 2 +
> >> drivers/vfio/vfio_iommu_spapr_tce.c | 62 ++++++++++++++++++-----------
> >> 7 files changed, 130 insertions(+), 62 deletions(-)
> >>
> >>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>index d1f8c6c..bde7ee7 100644
> >>--- a/arch/powerpc/include/asm/iommu.h
> >>+++ b/arch/powerpc/include/asm/iommu.h
> >>@@ -44,11 +44,22 @@ extern int iommu_is_off;
> >> extern int iommu_force_on;
> >>
> >> struct iommu_table_ops {
> >>+ /* When called with direction==DMA_NONE, it is equal to clear() */
> >> int (*set)(struct iommu_table *tbl,
> >> long index, long npages,
> >> unsigned long uaddr,
> >> enum dma_data_direction direction,
> >> struct dma_attrs *attrs);
> >>+#ifdef CONFIG_IOMMU_API
> >>+ /*
> >>+ * Exchanges existing TCE with new TCE plus direction bits;
> >>+ * returns old TCE and DMA direction mask
> >>+ */
> >>+ int (*exchange)(struct iommu_table *tbl,
> >>+ long index,
> >>+ unsigned long *tce,
> >>+ enum dma_data_direction *direction);
> >>+#endif
> >> void (*clear)(struct iommu_table *tbl,
> >> long index, long npages);
> >> unsigned long (*get)(struct iommu_table *tbl, long index);
> >>@@ -152,6 +163,8 @@ extern void iommu_register_group(struct iommu_table_group *table_group,
> >> extern int iommu_add_device(struct device *dev);
> >> extern void iommu_del_device(struct device *dev);
> >> extern int __init tce_iommu_bus_notifier_init(void);
> >>+extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
> >>+ unsigned long *tce, enum dma_data_direction *direction);
> >> #else
> >> static inline void iommu_register_group(struct iommu_table_group *table_group,
> >> int pci_domain_number,
> >>@@ -231,10 +244,6 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
> >> unsigned long npages);
> >> extern int iommu_tce_put_param_check(struct iommu_table *tbl,
> >> unsigned long ioba, unsigned long tce);
> >>-extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
> >>- unsigned long hwaddr, enum dma_data_direction direction);
> >>-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
> >>- unsigned long entry);
> >>
> >> extern void iommu_flush_tce(struct iommu_table *tbl);
> >> extern int iommu_take_ownership(struct iommu_table_group *table_group);
> >>diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> >>index 068fe4ff..501e8ee 100644
> >>--- a/arch/powerpc/kernel/iommu.c
> >>+++ b/arch/powerpc/kernel/iommu.c
> >>@@ -982,9 +982,6 @@ EXPORT_SYMBOL_GPL(iommu_tce_clear_param_check);
> >> int iommu_tce_put_param_check(struct iommu_table *tbl,
> >> unsigned long ioba, unsigned long tce)
> >> {
> >>- if (!(tce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> >>- return -EINVAL;
> >>-
> >> if (tce & ~(IOMMU_PAGE_MASK(tbl) | TCE_PCI_WRITE | TCE_PCI_READ))
> >> return -EINVAL;
> >>
> >>@@ -1002,44 +999,20 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
> >> }
> >> EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
> >>
> >>-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
> >>-{
> >>- unsigned long oldtce;
> >>- struct iommu_pool *pool = get_pool(tbl, entry);
> >>-
> >>- spin_lock(&(pool->lock));
> >>-
> >>- oldtce = tbl->it_ops->get(tbl, entry);
> >>- if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
> >>- tbl->it_ops->clear(tbl, entry, 1);
> >>- else
> >>- oldtce = 0;
> >>-
> >>- spin_unlock(&(pool->lock));
> >>-
> >>- return oldtce;
> >>-}
> >>-EXPORT_SYMBOL_GPL(iommu_clear_tce);
> >>-
> >> /*
> >> * hwaddr is a kernel virtual address here (0xc... bazillion),
> >> * tce_build converts it to a physical address.
> >> */
> >>-int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
> >>- unsigned long hwaddr, enum dma_data_direction direction)
> >>+long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
> >>+ unsigned long *tce, enum dma_data_direction *direction)
> >> {
> >>- int ret = -EBUSY;
> >>- unsigned long oldtce;
> >>- struct iommu_pool *pool = get_pool(tbl, entry);
> >>+ long ret;
> >>
> >>- spin_lock(&(pool->lock));
> >>+ ret = tbl->it_ops->exchange(tbl, entry, tce, direction);
> >>
> >>- oldtce = tbl->it_ops->get(tbl, entry);
> >>- /* Add new entry if it is not busy */
> >>- if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> >>- ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
> >>-
> >>- spin_unlock(&(pool->lock));
> >>+ if (!ret && ((*direction == DMA_FROM_DEVICE) ||
> >>+ (*direction == DMA_BIDIRECTIONAL)))
> >>+ SetPageDirty(pfn_to_page(__pa(*tce) >> PAGE_SHIFT));
> >>
> >> /* if (unlikely(ret))
> >> pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
> >>@@ -1048,13 +1021,23 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
> >>
> >> return ret;
> >> }
> >>-EXPORT_SYMBOL_GPL(iommu_tce_build);
> >>+EXPORT_SYMBOL_GPL(iommu_tce_xchg);
> >>
> >> static int iommu_table_take_ownership(struct iommu_table *tbl)
> >> {
> >> unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
> >> int ret = 0;
> >>
> >>+ /*
> >>+ * VFIO does not control TCE entries allocation and the guest
> >>+ * can write new TCEs on top of existing ones so iommu_tce_build()
> >>+ * must be able to release old pages. This functionality
> >>+ * requires exchange() callback defined so if it is not
> >>+ * implemented, we disallow taking ownership over the table.
> >>+ */
> >>+ if (!tbl->it_ops->exchange)
> >>+ return -EINVAL;
> >>+
> >> spin_lock_irqsave(&tbl->large_pool.lock, flags);
> >> for (i = 0; i < tbl->nr_pools; i++)
> >> spin_lock(&tbl->pools[i].lock);
> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>index fd993bc..4d80502 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>@@ -1128,6 +1128,20 @@ static int pnv_ioda1_tce_build_vm(struct iommu_table *tbl, long index,
> >> return ret;
> >> }
> >>
> >>+#ifdef CONFIG_IOMMU_API
> >>+static int pnv_ioda1_tce_xchg_vm(struct iommu_table *tbl, long index,
> >>+ unsigned long *tce, enum dma_data_direction *direction)
> >>+{
> >>+ long ret = pnv_tce_xchg(tbl, index, tce, direction);
> >>+
> >>+ if (!ret && (tbl->it_type &
> >>+ (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
> >>+ pnv_pci_ioda1_tce_invalidate(tbl, index, 1, false);
> >>+
> >>+ return ret;
> >>+}
> >>+#endif
> >>+
> >> static void pnv_ioda1_tce_free_vm(struct iommu_table *tbl, long index,
> >> long npages)
> >> {
> >>@@ -1139,6 +1153,9 @@ static void pnv_ioda1_tce_free_vm(struct iommu_table *tbl, long index,
> >>
> >> struct iommu_table_ops pnv_ioda1_iommu_ops = {
> >> .set = pnv_ioda1_tce_build_vm,
> >>+#ifdef CONFIG_IOMMU_API
> >>+ .exchange = pnv_ioda1_tce_xchg_vm,
> >>+#endif
> >> .clear = pnv_ioda1_tce_free_vm,
> >> .get = pnv_tce_get,
> >> };
> >>@@ -1190,6 +1207,20 @@ static int pnv_ioda2_tce_build_vm(struct iommu_table *tbl, long index,
> >> return ret;
> >> }
> >>
> >>+#ifdef CONFIG_IOMMU_API
> >>+static int pnv_ioda2_tce_xchg_vm(struct iommu_table *tbl, long index,
> >>+ unsigned long *tce, enum dma_data_direction *direction)
> >>+{
> >>+ long ret = pnv_tce_xchg(tbl, index, tce, direction);
> >>+
> >>+ if (!ret && (tbl->it_type &
> >>+ (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
> >>+ pnv_pci_ioda2_tce_invalidate(tbl, index, 1, false);
> >>+
> >>+ return ret;
> >>+}
> >>+#endif
> >>+
> >> static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
> >> long npages)
> >> {
> >>@@ -1201,6 +1232,9 @@ static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
> >>
> >> static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> >> .set = pnv_ioda2_tce_build_vm,
> >>+#ifdef CONFIG_IOMMU_API
> >>+ .exchange = pnv_ioda2_tce_xchg_vm,
> >>+#endif
> >> .clear = pnv_ioda2_tce_free_vm,
> >> .get = pnv_tce_get,
> >> };
> >>@@ -1353,6 +1387,7 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> >> pnv_pci_ioda2_set_bypass(pe, true);
> >> }
> >>
> >>+#ifdef CONFIG_IOMMU_API
> >> static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
> >> bool enable)
> >> {
> >>@@ -1369,6 +1404,7 @@ static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
> >> static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> >> .set_ownership = pnv_ioda2_set_ownership,
> >> };
> >>+#endif
> >>
> >> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >> struct pnv_ioda_pe *pe)
> >>@@ -1437,7 +1473,9 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >> }
> >> tbl->it_ops = &pnv_ioda2_iommu_ops;
> >> iommu_init_table(tbl, phb->hose->node);
> >>+#ifdef CONFIG_IOMMU_API
> >> pe->table_group.ops = &pnv_pci_ioda2_ops;
> >>+#endif
> >> iommu_register_group(&pe->table_group, phb->hose->global_number,
> >> pe->pe_number);
> >>
> >>diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> >>index 6906a9c..d2d9092 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> >>@@ -85,6 +85,9 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
> >>
> >> static struct iommu_table_ops pnv_p5ioc2_iommu_ops = {
> >> .set = pnv_tce_build,
> >>+#ifdef CONFIG_IOMMU_API
> >>+ .exchange = pnv_tce_xchg,
> >>+#endif
> >> .clear = pnv_tce_free,
> >> .get = pnv_tce_get,
> >> };
> >>diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> >>index a8c05de..a9797dd 100644
> >>--- a/arch/powerpc/platforms/powernv/pci.c
> >>+++ b/arch/powerpc/platforms/powernv/pci.c
> >>@@ -615,6 +615,23 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> >> return 0;
> >> }
> >>
> >>+#ifdef CONFIG_IOMMU_API
> >>+int pnv_tce_xchg(struct iommu_table *tbl, long index,
> >>+ unsigned long *tce, enum dma_data_direction *direction)
> >>+{
> >>+ u64 proto_tce = iommu_direction_to_tce_perm(*direction);
> >>+ unsigned long newtce = __pa(*tce) | proto_tce;
> >>+ unsigned long idx = index - tbl->it_offset;
> >>+
> >>+ *tce = xchg(pnv_tce(tbl, idx), cpu_to_be64(newtce));
> >>+ *tce = (unsigned long) __va(be64_to_cpu(*tce));
> >>+ *direction = iommu_tce_direction(*tce);
> >>+ *tce &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
> >>+
> >>+ return 0;
> >>+}
> >>+#endif
> >>+
> >> void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
> >> {
> >> long i;
> >>diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> >>index 0d4df32..4d1a78c 100644
> >>--- a/arch/powerpc/platforms/powernv/pci.h
> >>+++ b/arch/powerpc/platforms/powernv/pci.h
> >>@@ -220,6 +220,8 @@ extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> >> unsigned long uaddr, enum dma_data_direction direction,
> >> struct dma_attrs *attrs);
> >> extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
> >>+extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
> >>+ unsigned long *tce, enum dma_data_direction *direction);
> >> extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
> >> extern struct iommu_table_ops pnv_ioda1_iommu_ops;
> >>
> >>diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>index d5d8c50..7c3c215 100644
> >>--- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>@@ -251,9 +251,6 @@ static void tce_iommu_unuse_page(struct tce_container *container,
> >> {
> >> struct page *page;
> >>
> >>- if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
> >>- return;
> >>-
> >> /*
> >> * VFIO cannot map/unmap when a container is not enabled so
> >> * we would not need this check but KVM could map/unmap and if
> >>@@ -264,10 +261,6 @@ static void tce_iommu_unuse_page(struct tce_container *container,
> >> return;
> >>
> >> page = pfn_to_page(__pa(oldtce) >> PAGE_SHIFT);
> >>-
> >>- if (oldtce & TCE_PCI_WRITE)
> >>- SetPageDirty(page);
> >>-
> >
> >Seems to me that unuse_page() should get a direction parameter,
> >instead of moving the PageDirty (and DMA_NONE test) to all the
> >callers.
>
>
> Sorry, I am not following you here. There is just a single gateway for VFIO
> to the platform code which is iommu_tce_xchg() and this is where the
> SetPageDirty() check went. What are "all the callers"?
Oh, ok, I think I see. I just saw the dirty check being removed here,
and it's not obvious looking at this patch alone that iommu_tce_xchg()
is called at all the callsites of unuse_page.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Fri, Apr 17, 2015 at 02:29:23AM +1000, Alexey Kardashevskiy wrote:
> On 04/16/2015 04:46 PM, David Gibson wrote:
> >On Fri, Apr 10, 2015 at 04:31:03PM +1000, Alexey Kardashevskiy wrote:
> >>The iommu_free_table helper release memory it is using (the TCE table and
> >>@it_map) and release the iommu_table struct as well. We might not want
> >>the very last step as we store iommu_table in parent structures.
> >
> >Yeah, as I commented on the earlier patch, freeing the surrounding
> >group from a function taking just the individual table is wrong.
>
>
> This is iommu tables created by the old code which stores these iommu_table
> struct pointers in device nodes. I believe there is a plan to get rid of
> iommu tables there and when this is done, this workaround will be
> gone.
Um.. what? The connection of where pointers are stored to an
obvious error with object lifetime handling is not at all obvious.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On 04/20/2015 12:46 PM, David Gibson wrote:
> On Fri, Apr 17, 2015 at 08:16:13PM +1000, Alexey Kardashevskiy wrote:
>> On 04/16/2015 04:10 PM, David Gibson wrote:
>>> On Fri, Apr 10, 2015 at 04:30:57PM +1000, Alexey Kardashevskiy wrote:
>>>> This adds missing locks in iommu_take_ownership()/
>>>> iommu_release_ownership().
>>>>
>>>> This marks all pages busy in iommu_table::it_map in order to catch
>>>> errors if there is an attempt to use this table while ownership over it
>>>> is taken.
>>>>
>>>> This only clears TCE content if there is no page marked busy in it_map.
>>>> Clearing must be done outside of the table locks as iommu_clear_tce()
>>>> called from iommu_clear_tces_and_put_pages() does this.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>> ---
>>>> Changes:
>>>> v5:
>>>> * do not store bit#0 value, it has to be set for zero-based table
>>>> anyway
>>>> * removed test_and_clear_bit
>>>> ---
>>>> arch/powerpc/kernel/iommu.c | 26 ++++++++++++++++++++++----
>>>> 1 file changed, 22 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>>>> index 7d6089b..068fe4ff 100644
>>>> --- a/arch/powerpc/kernel/iommu.c
>>>> +++ b/arch/powerpc/kernel/iommu.c
>>>> @@ -1052,17 +1052,28 @@ EXPORT_SYMBOL_GPL(iommu_tce_build);
>>>>
>>>> static int iommu_table_take_ownership(struct iommu_table *tbl)
>>>> {
>>>> - unsigned long sz = (tbl->it_size + 7) >> 3;
>>>> + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
>>>> + int ret = 0;
>>>> +
>>>> + spin_lock_irqsave(&tbl->large_pool.lock, flags);
>>>> + for (i = 0; i < tbl->nr_pools; i++)
>>>> + spin_lock(&tbl->pools[i].lock);
>>>>
>>>> if (tbl->it_offset == 0)
>>>> clear_bit(0, tbl->it_map);
>>>>
>>>> if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
>>>> pr_err("iommu_tce: it_map is not empty");
>>>> - return -EBUSY;
>>>> + ret = -EBUSY;
>>>> + if (tbl->it_offset == 0)
>>>> + set_bit(0, tbl->it_map);
>>>
>>> This really needs a comment. Why on earth are you changing the it_map
>>> on a failure case?
>>
>>
>> Does this explain?
>>
>> /*
>> * The platform code reserves zero address in iommu_init_table().
>> * As we cleared busy bit for page @0 before using bitmap_empty(),
>> * we are restoring it now.
>> */
>
> Only partly. What's it reserved for, and why do you know it was
> always set on entry?
Because it is only handled in this file and I can see it in the code. Or I
did not understand the question here...
--
Alexey
On 04/20/2015 12:44 PM, David Gibson wrote:
> On Fri, Apr 17, 2015 at 08:09:29PM +1000, Alexey Kardashevskiy wrote:
>> On 04/16/2015 04:07 PM, David Gibson wrote:
>>> On Fri, Apr 10, 2015 at 04:30:56PM +1000, Alexey Kardashevskiy wrote:
>>>> At the moment the iommu_table struct has a set_bypass() which enables/
>>>> disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
>>>> which calls this callback when external IOMMU users such as VFIO are
>>>> about to get over a PHB.
>>>>
>>>> The set_bypass() callback is not really an iommu_table function but
>>>> IOMMU/PE function. This introduces a iommu_table_group_ops struct and
>>>> adds a set_ownership() callback to it which is called when an external
>>>> user takes control over the IOMMU.
>>>
>>> Do you really need separate ops structures at both the single table
>>> and table group level? The different tables in a group will all
>>> belong to the same basic iommu won't they?
>>
>>
>> IOMMU tables exist alone in VIO. Also, the platform code uses just a table
>> (or it is in bypass mode) and does not care about table groups. It looked
>> more clean for myself to keep them separated. Should I still merge
>> those?
>
> Ok, that sounds like a reasonable argument for keeping them separate,
> at least for now.
>
>>>> This renames set_bypass() to set_ownership() as it is not necessarily
>>>> just enabling bypassing, it can be something else/more so let's give it
>>>> more generic name. The bool parameter is inverted.
>>>>
>>>> The callback is implemented for IODA2 only. Other platforms (P5IOC2,
>>>> IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>> ---
>>>> arch/powerpc/include/asm/iommu.h | 14 +++++++++++++-
>>>> arch/powerpc/platforms/powernv/pci-ioda.c | 30 ++++++++++++++++++++++--------
>>>> drivers/vfio/vfio_iommu_spapr_tce.c | 25 +++++++++++++++++++++----
>>>> 3 files changed, 56 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>>>> index b9e50d3..d1f8c6c 100644
>>>> --- a/arch/powerpc/include/asm/iommu.h
>>>> +++ b/arch/powerpc/include/asm/iommu.h
>>>> @@ -92,7 +92,6 @@ struct iommu_table {
>>>> unsigned long it_page_shift;/* table iommu page size */
>>>> struct iommu_table_group *it_group;
>>>> struct iommu_table_ops *it_ops;
>>>> - void (*set_bypass)(struct iommu_table *tbl, bool enable);
>>>> };
>>>>
>>>> /* Pure 2^n version of get_order */
>>>> @@ -127,11 +126,24 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>>>>
>>>> #define IOMMU_TABLE_GROUP_MAX_TABLES 1
>>>>
>>>> +struct iommu_table_group;
>>>> +
>>>> +struct iommu_table_group_ops {
>>>> + /*
>>>> + * Switches ownership from the kernel itself to an external
>>>> + * user. While onwership is enabled, the kernel cannot use IOMMU
>>>> + * for itself.
>>>> + */
>>>> + void (*set_ownership)(struct iommu_table_group *table_group,
>>>> + bool enable);
>>>
>>> The meaning of "enable" in a function called "set_ownership" is
>>> entirely obscure.
>>
>> Suggest something better please :) I have nothing better...
>
> Well, given it's "set_ownershuip" you could have "owner" - that would
> want to be an enum with OWNER_KERNEL and OWNER_VFIO or something
> rather than a bool.
It is iommu_take_ownership() in upstream and it is assumed that the owner
is anything but the platform code (for now and probably for ever - VFIO). I
am not changing this now, just using same naming approach when adding a
callback with a similar name.
> Or you could leave it a bool but call it "allow_bypass".
Commented below.
>
>>
>>
>>>
>>>> +};
>>>> +
>>>> struct iommu_table_group {
>>>> #ifdef CONFIG_IOMMU_API
>>>> struct iommu_group *group;
>>>> #endif
>>>> struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>>>> + struct iommu_table_group_ops *ops;
>>>> };
>>>>
>>>> #ifdef CONFIG_IOMMU_API
>>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> index a964c50..9687731 100644
>>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>> @@ -1255,10 +1255,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>>>> __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
>>>> }
>>>>
>>>> -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
>>>> +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
>>>> {
>>>> - struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
>>>> - table_group);
>>>> uint16_t window_id = (pe->pe_number << 1 ) + 1;
>>>> int64_t rc;
>>>>
>>>> @@ -1286,7 +1284,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
>>>> * host side.
>>>> */
>>>> if (pe->pdev)
>>>> - set_iommu_table_base(&pe->pdev->dev, tbl);
>>>> + set_iommu_table_base(&pe->pdev->dev,
>>>> + &pe->table_group.tables[0]);
>>>> else
>>>> pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
>>>> }
>>>> @@ -1302,13 +1301,27 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>>>> /* TVE #1 is selected by PCI address bit 59 */
>>>> pe->tce_bypass_base = 1ull << 59;
>>>>
>>>> - /* Install set_bypass callback for VFIO */
>>>> - pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
>>>> -
>>>> /* Enable bypass by default */
>>>> - pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
>>>> + pnv_pci_ioda2_set_bypass(pe, true);
>>>> }
>>>>
>>>> +static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
>>>> + bool enable)
>>>> +{
>>>> + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
>>>> + table_group);
>>>> + if (enable)
>>>> + iommu_take_ownership(table_group);
>>>> + else
>>>> + iommu_release_ownership(table_group);
>>>> +
>>>> + pnv_pci_ioda2_set_bypass(pe, !enable);
>>>> +}
>>>> +
>>>> +static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
>>>> + .set_ownership = pnv_ioda2_set_ownership,
>>>> +};
>>>> +
>>>> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>>>> struct pnv_ioda_pe *pe)
>>>> {
>>>> @@ -1376,6 +1389,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>>>> }
>>>> tbl->it_ops = &pnv_iommu_ops;
>>>> iommu_init_table(tbl, phb->hose->node);
>>>> + pe->table_group.ops = &pnv_pci_ioda2_ops;
>>>> iommu_register_group(&pe->table_group, phb->hose->global_number,
>>>> pe->pe_number);
>>>>
>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>> index 9f38351..d5d8c50 100644
>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>> @@ -535,9 +535,22 @@ static int tce_iommu_attach_group(void *iommu_data,
>>>> goto unlock_exit;
>>>> }
>>>>
>>>> - ret = iommu_take_ownership(table_group);
>>>> - if (!ret)
>>>> - container->grp = iommu_group;
>>>> + if (!table_group->ops || !table_group->ops->set_ownership) {
>>>> + ret = iommu_take_ownership(table_group);
>>>> + } else {
>>>> + /*
>>>> + * Disable iommu bypass, otherwise the user can DMA to all of
>>>> + * our physical memory via the bypass window instead of just
>>>> + * the pages that has been explicitly mapped into the iommu
>>>> + */
>>>> + table_group->ops->set_ownership(table_group, true);
>>>
>>> And here to disable bypass you call it with enable=true, so it doesn't
>>> even have the same meaning as it used to.
>>
>>
>> I do not disable bypass per se (even if it what set_ownership(true) does) as
>> it is IODA business and VFIO has no idea about it. I do take control over
>> the group. I am not following you here - what used to have the same
>> meaning?
>
> Well, in set_bypass, the enable parameter was whether bypass was
> enabled. Here you're setting enable to true, when you want to
> *disable* bypass (in the existing case). If the "enable" parameter
> isn't about enabling bypass, it's meaning is even more confusing than
> I thought.
Its meaning is "take ownership over the group". In this patch
set_ownership(true) means set_bypass(false).
But later (in 25/31) set_ownership(true) becomes unset(windows0) +
free(table0) + set_bypass(false) = clear DMA setup for the group (i.e.
invalidate both TVTs) so it is not just about bypass (which is TVT#1 but
not TVT#0) anymore.
>>> Plus, you should fold the logic to call the callback if necessary into
>>> iommu_take_ownership().
>>
>>
>> I really want to keep VFIO stuff out of arch/powerpc/kernel/iommu.c as much
>> as possible as it is for platform DMA/IOMMU, not VFIO (which got SPAPR
>> driver for that). ops->set_ownership() is one of these things.
>
> What's VFIO specific about this fragment - it's just if you have the
> callback, call it, otherwise fall back to the default.
>
>> iommu_take_ownership()/iommu_release_ownership() are helpers for old-style
>> commercially-unsupported P5IOC2/IODA1, and this is kind of a hack while
>> ops->set_ownership() is an interface for VFIO to do dynamic windows thing.
>
> Can you put their logic into a set_ownership callback for IODA1 then?
And P5IOC2, and pseries. We know these callbacks will call the same
iommu_take_ownership() and iommu_release_ownership() and this is not going
to change. Too invasive for such a hack imho.
>> If it makes sense, I could fold the previous patch into this one and move
>> iommu_take_ownership()/iommu_release_ownership() to vfio_iommu_spapr_tce.c,
>> should I? Or leave things are they are now.
>
> That sounds like it might make sense.
This is what will go to v9, looks cleaner. Thanks.
>>
>>
>>>> + ret = 0;
>>>> + }
>>>> +
>>>> + if (ret)
>>>> + goto unlock_exit;
>>>> +
>>>> + container->grp = iommu_group;
>>>>
>>>> unlock_exit:
>>>> mutex_unlock(&container->lock);
>>>> @@ -572,7 +585,11 @@ static void tce_iommu_detach_group(void *iommu_data,
>>>> table_group = iommu_group_get_iommudata(iommu_group);
>>>> BUG_ON(!table_group);
>>>>
>>>> - iommu_release_ownership(table_group);
>>>> + /* Kernel owns the device now, we can restore bypass */
>>>> + if (!table_group->ops || !table_group->ops->set_ownership)
>>>> + iommu_release_ownership(table_group);
>>>> + else
>>>> + table_group->ops->set_ownership(table_group, false);
>>>
>>> Likewise fold this if into iommu_release_ownership().
>>>
>>>> unlock_exit:
>>>> mutex_unlock(&container->lock);
>>>
>>
>>
>
--
Alexey
On Mon, Apr 20, 2015 at 04:34:24PM +1000, Alexey Kardashevskiy wrote:
> On 04/20/2015 12:46 PM, David Gibson wrote:
> >On Fri, Apr 17, 2015 at 08:16:13PM +1000, Alexey Kardashevskiy wrote:
> >>On 04/16/2015 04:10 PM, David Gibson wrote:
> >>>On Fri, Apr 10, 2015 at 04:30:57PM +1000, Alexey Kardashevskiy wrote:
> >>>>This adds missing locks in iommu_take_ownership()/
> >>>>iommu_release_ownership().
> >>>>
> >>>>This marks all pages busy in iommu_table::it_map in order to catch
> >>>>errors if there is an attempt to use this table while ownership over it
> >>>>is taken.
> >>>>
> >>>>This only clears TCE content if there is no page marked busy in it_map.
> >>>>Clearing must be done outside of the table locks as iommu_clear_tce()
> >>>>called from iommu_clear_tces_and_put_pages() does this.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>>>---
> >>>>Changes:
> >>>>v5:
> >>>>* do not store bit#0 value, it has to be set for zero-based table
> >>>>anyway
> >>>>* removed test_and_clear_bit
> >>>>---
> >>>> arch/powerpc/kernel/iommu.c | 26 ++++++++++++++++++++++----
> >>>> 1 file changed, 22 insertions(+), 4 deletions(-)
> >>>>
> >>>>diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> >>>>index 7d6089b..068fe4ff 100644
> >>>>--- a/arch/powerpc/kernel/iommu.c
> >>>>+++ b/arch/powerpc/kernel/iommu.c
> >>>>@@ -1052,17 +1052,28 @@ EXPORT_SYMBOL_GPL(iommu_tce_build);
> >>>>
> >>>> static int iommu_table_take_ownership(struct iommu_table *tbl)
> >>>> {
> >>>>- unsigned long sz = (tbl->it_size + 7) >> 3;
> >>>>+ unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
> >>>>+ int ret = 0;
> >>>>+
> >>>>+ spin_lock_irqsave(&tbl->large_pool.lock, flags);
> >>>>+ for (i = 0; i < tbl->nr_pools; i++)
> >>>>+ spin_lock(&tbl->pools[i].lock);
> >>>>
> >>>> if (tbl->it_offset == 0)
> >>>> clear_bit(0, tbl->it_map);
> >>>>
> >>>> if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
> >>>> pr_err("iommu_tce: it_map is not empty");
> >>>>- return -EBUSY;
> >>>>+ ret = -EBUSY;
> >>>>+ if (tbl->it_offset == 0)
> >>>>+ set_bit(0, tbl->it_map);
> >>>
> >>>This really needs a comment. Why on earth are you changing the it_map
> >>>on a failure case?
> >>
> >>
> >>Does this explain?
> >>
> >>/*
> >> * The platform code reserves zero address in iommu_init_table().
> >> * As we cleared busy bit for page @0 before using bitmap_empty(),
> >> * we are restoring it now.
> >> */
> >
> >Only partly. What's it reserved for, and why do you know it was
> >always set on entry?
>
>
> Because it is only handled in this file and I can see it in the code. Or I
> did not understand the question here...
Sure, you can see it in the code, but you've been looking at this code
for months and years. For anyone looking at this function who's not
familiar with the rest of the IOMMU code, a pointer to what's going on
here would be really helpful.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Mon, Apr 20, 2015 at 04:55:32PM +1000, Alexey Kardashevskiy wrote:
> On 04/20/2015 12:44 PM, David Gibson wrote:
> >On Fri, Apr 17, 2015 at 08:09:29PM +1000, Alexey Kardashevskiy wrote:
> >>On 04/16/2015 04:07 PM, David Gibson wrote:
> >>>On Fri, Apr 10, 2015 at 04:30:56PM +1000, Alexey Kardashevskiy wrote:
> >>>>At the moment the iommu_table struct has a set_bypass() which enables/
> >>>>disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
> >>>>which calls this callback when external IOMMU users such as VFIO are
> >>>>about to get over a PHB.
> >>>>
> >>>>The set_bypass() callback is not really an iommu_table function but
> >>>>IOMMU/PE function. This introduces a iommu_table_group_ops struct and
> >>>>adds a set_ownership() callback to it which is called when an external
> >>>>user takes control over the IOMMU.
> >>>
> >>>Do you really need separate ops structures at both the single table
> >>>and table group level? The different tables in a group will all
> >>>belong to the same basic iommu won't they?
> >>
> >>
> >>IOMMU tables exist alone in VIO. Also, the platform code uses just a table
> >>(or it is in bypass mode) and does not care about table groups. It looked
> >>more clean for myself to keep them separated. Should I still merge
> >>those?
> >
> >Ok, that sounds like a reasonable argument for keeping them separate,
> >at least for now.
> >
> >>>>This renames set_bypass() to set_ownership() as it is not necessarily
> >>>>just enabling bypassing, it can be something else/more so let's give it
> >>>>more generic name. The bool parameter is inverted.
> >>>>
> >>>>The callback is implemented for IODA2 only. Other platforms (P5IOC2,
> >>>>IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
> >>>>
> >>>>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>>>---
> >>>> arch/powerpc/include/asm/iommu.h | 14 +++++++++++++-
> >>>> arch/powerpc/platforms/powernv/pci-ioda.c | 30 ++++++++++++++++++++++--------
> >>>> drivers/vfio/vfio_iommu_spapr_tce.c | 25 +++++++++++++++++++++----
> >>>> 3 files changed, 56 insertions(+), 13 deletions(-)
> >>>>
> >>>>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>>>index b9e50d3..d1f8c6c 100644
> >>>>--- a/arch/powerpc/include/asm/iommu.h
> >>>>+++ b/arch/powerpc/include/asm/iommu.h
> >>>>@@ -92,7 +92,6 @@ struct iommu_table {
> >>>> unsigned long it_page_shift;/* table iommu page size */
> >>>> struct iommu_table_group *it_group;
> >>>> struct iommu_table_ops *it_ops;
> >>>>- void (*set_bypass)(struct iommu_table *tbl, bool enable);
> >>>> };
> >>>>
> >>>> /* Pure 2^n version of get_order */
> >>>>@@ -127,11 +126,24 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
> >>>>
> >>>> #define IOMMU_TABLE_GROUP_MAX_TABLES 1
> >>>>
> >>>>+struct iommu_table_group;
> >>>>+
> >>>>+struct iommu_table_group_ops {
> >>>>+ /*
> >>>>+ * Switches ownership from the kernel itself to an external
> >>>>+ * user. While onwership is enabled, the kernel cannot use IOMMU
> >>>>+ * for itself.
> >>>>+ */
> >>>>+ void (*set_ownership)(struct iommu_table_group *table_group,
> >>>>+ bool enable);
> >>>
> >>>The meaning of "enable" in a function called "set_ownership" is
> >>>entirely obscure.
> >>
> >>Suggest something better please :) I have nothing better...
> >
> >Well, given it's "set_ownershuip" you could have "owner" - that would
> >want to be an enum with OWNER_KERNEL and OWNER_VFIO or something
> >rather than a bool.
>
>
> It is iommu_take_ownership() in upstream and it is assumed that the owner is
> anything but the platform code (for now and probably for ever - VFIO). I am
> not changing this now, just using same naming approach when adding a
> callback with a similar name.
So "enabled" is actually that non kernel ownership is enabled. That
is totally non-obvious.
> >Or you could leave it a bool but call it "allow_bypass".
>
> Commented below.
>
> >>>>+};
> >>>>+
> >>>> struct iommu_table_group {
> >>>> #ifdef CONFIG_IOMMU_API
> >>>> struct iommu_group *group;
> >>>> #endif
> >>>> struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> >>>>+ struct iommu_table_group_ops *ops;
> >>>> };
> >>>>
> >>>> #ifdef CONFIG_IOMMU_API
> >>>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>>>index a964c50..9687731 100644
> >>>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>>>@@ -1255,10 +1255,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> >>>> __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
> >>>> }
> >>>>
> >>>>-static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> >>>>+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
> >>>> {
> >>>>- struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
> >>>>- table_group);
> >>>> uint16_t window_id = (pe->pe_number << 1 ) + 1;
> >>>> int64_t rc;
> >>>>
> >>>>@@ -1286,7 +1284,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> >>>> * host side.
> >>>> */
> >>>> if (pe->pdev)
> >>>>- set_iommu_table_base(&pe->pdev->dev, tbl);
> >>>>+ set_iommu_table_base(&pe->pdev->dev,
> >>>>+ &pe->table_group.tables[0]);
> >>>> else
> >>>> pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> >>>> }
> >>>>@@ -1302,13 +1301,27 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> >>>> /* TVE #1 is selected by PCI address bit 59 */
> >>>> pe->tce_bypass_base = 1ull << 59;
> >>>>
> >>>>- /* Install set_bypass callback for VFIO */
> >>>>- pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
> >>>>-
> >>>> /* Enable bypass by default */
> >>>>- pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
> >>>>+ pnv_pci_ioda2_set_bypass(pe, true);
> >>>> }
> >>>>
> >>>>+static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
> >>>>+ bool enable)
> >>>>+{
> >>>>+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> >>>>+ table_group);
> >>>>+ if (enable)
> >>>>+ iommu_take_ownership(table_group);
> >>>>+ else
> >>>>+ iommu_release_ownership(table_group);
> >>>>+
> >>>>+ pnv_pci_ioda2_set_bypass(pe, !enable);
> >>>>+}
> >>>>+
> >>>>+static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> >>>>+ .set_ownership = pnv_ioda2_set_ownership,
> >>>>+};
> >>>>+
> >>>> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >>>> struct pnv_ioda_pe *pe)
> >>>> {
> >>>>@@ -1376,6 +1389,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >>>> }
> >>>> tbl->it_ops = &pnv_iommu_ops;
> >>>> iommu_init_table(tbl, phb->hose->node);
> >>>>+ pe->table_group.ops = &pnv_pci_ioda2_ops;
> >>>> iommu_register_group(&pe->table_group, phb->hose->global_number,
> >>>> pe->pe_number);
> >>>>
> >>>>diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>index 9f38351..d5d8c50 100644
> >>>>--- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>@@ -535,9 +535,22 @@ static int tce_iommu_attach_group(void *iommu_data,
> >>>> goto unlock_exit;
> >>>> }
> >>>>
> >>>>- ret = iommu_take_ownership(table_group);
> >>>>- if (!ret)
> >>>>- container->grp = iommu_group;
> >>>>+ if (!table_group->ops || !table_group->ops->set_ownership) {
> >>>>+ ret = iommu_take_ownership(table_group);
> >>>>+ } else {
> >>>>+ /*
> >>>>+ * Disable iommu bypass, otherwise the user can DMA to all of
> >>>>+ * our physical memory via the bypass window instead of just
> >>>>+ * the pages that has been explicitly mapped into the iommu
> >>>>+ */
> >>>>+ table_group->ops->set_ownership(table_group, true);
> >>>
> >>>And here to disable bypass you call it with enable=true, so it doesn't
> >>>even have the same meaning as it used to.
> >>
> >>
> >>I do not disable bypass per se (even if it what set_ownership(true) does) as
> >>it is IODA business and VFIO has no idea about it. I do take control over
> >>the group. I am not following you here - what used to have the same
> >>meaning?
> >
> >Well, in set_bypass, the enable parameter was whether bypass was
> >enabled. Here you're setting enable to true, when you want to
> >*disable* bypass (in the existing case). If the "enable" parameter
> >isn't about enabling bypass, it's meaning is even more confusing than
> >I thought.
>
>
> Its meaning is "take ownership over the group". In this patch
> set_ownership(true) means set_bypass(false).
Ok. So "take_ownership" isn't quite as clear as I'd like, but it's
not too bad because it's implied that it's the caller that's taking
the ownership. *set* ownership makes no sense without saying who the
new owner is. "enable" has no clear meaning in that context.
Calling it "kernel_owned" or "non_kernel_owned" would be ok if a bit
clunky.
> But later (in 25/31) set_ownership(true) becomes unset(windows0) +
> free(table0) + set_bypass(false) = clear DMA setup for the group (i.e.
> invalidate both TVTs) so it is not just about bypass (which is TVT#1 but not
> TVT#0) anymore.
Right, I have no problem with a combined function for the operation
here. It's purely a naming thing "set_ownership" and "enable" are
just not concepts that fit together sensibly.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On 04/21/2015 07:43 PM, David Gibson wrote:
> On Mon, Apr 20, 2015 at 04:55:32PM +1000, Alexey Kardashevskiy wrote:
>> On 04/20/2015 12:44 PM, David Gibson wrote:
>>> On Fri, Apr 17, 2015 at 08:09:29PM +1000, Alexey Kardashevskiy wrote:
>>>> On 04/16/2015 04:07 PM, David Gibson wrote:
>>>>> On Fri, Apr 10, 2015 at 04:30:56PM +1000, Alexey Kardashevskiy wrote:
>>>>>> At the moment the iommu_table struct has a set_bypass() which enables/
>>>>>> disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
>>>>>> which calls this callback when external IOMMU users such as VFIO are
>>>>>> about to get over a PHB.
>>>>>>
>>>>>> The set_bypass() callback is not really an iommu_table function but
>>>>>> IOMMU/PE function. This introduces a iommu_table_group_ops struct and
>>>>>> adds a set_ownership() callback to it which is called when an external
>>>>>> user takes control over the IOMMU.
>>>>>
>>>>> Do you really need separate ops structures at both the single table
>>>>> and table group level? The different tables in a group will all
>>>>> belong to the same basic iommu won't they?
>>>>
>>>>
>>>> IOMMU tables exist alone in VIO. Also, the platform code uses just a table
>>>> (or it is in bypass mode) and does not care about table groups. It looked
>>>> more clean for myself to keep them separated. Should I still merge
>>>> those?
>>>
>>> Ok, that sounds like a reasonable argument for keeping them separate,
>>> at least for now.
>>>
>>>>>> This renames set_bypass() to set_ownership() as it is not necessarily
>>>>>> just enabling bypassing, it can be something else/more so let's give it
>>>>>> more generic name. The bool parameter is inverted.
>>>>>>
>>>>>> The callback is implemented for IODA2 only. Other platforms (P5IOC2,
>>>>>> IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>>>> ---
>>>>>> arch/powerpc/include/asm/iommu.h | 14 +++++++++++++-
>>>>>> arch/powerpc/platforms/powernv/pci-ioda.c | 30 ++++++++++++++++++++++--------
>>>>>> drivers/vfio/vfio_iommu_spapr_tce.c | 25 +++++++++++++++++++++----
>>>>>> 3 files changed, 56 insertions(+), 13 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>>>>>> index b9e50d3..d1f8c6c 100644
>>>>>> --- a/arch/powerpc/include/asm/iommu.h
>>>>>> +++ b/arch/powerpc/include/asm/iommu.h
>>>>>> @@ -92,7 +92,6 @@ struct iommu_table {
>>>>>> unsigned long it_page_shift;/* table iommu page size */
>>>>>> struct iommu_table_group *it_group;
>>>>>> struct iommu_table_ops *it_ops;
>>>>>> - void (*set_bypass)(struct iommu_table *tbl, bool enable);
>>>>>> };
>>>>>>
>>>>>> /* Pure 2^n version of get_order */
>>>>>> @@ -127,11 +126,24 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>>>>>>
>>>>>> #define IOMMU_TABLE_GROUP_MAX_TABLES 1
>>>>>>
>>>>>> +struct iommu_table_group;
>>>>>> +
>>>>>> +struct iommu_table_group_ops {
>>>>>> + /*
>>>>>> + * Switches ownership from the kernel itself to an external
>>>>>> + * user. While onwership is enabled, the kernel cannot use IOMMU
>>>>>> + * for itself.
>>>>>> + */
>>>>>> + void (*set_ownership)(struct iommu_table_group *table_group,
>>>>>> + bool enable);
>>>>>
>>>>> The meaning of "enable" in a function called "set_ownership" is
>>>>> entirely obscure.
>>>>
>>>> Suggest something better please :) I have nothing better...
>>>
>>> Well, given it's "set_ownershuip" you could have "owner" - that would
>>> want to be an enum with OWNER_KERNEL and OWNER_VFIO or something
>>> rather than a bool.
>>
>>
>> It is iommu_take_ownership() in upstream and it is assumed that the owner is
>> anything but the platform code (for now and probably for ever - VFIO). I am
>> not changing this now, just using same naming approach when adding a
>> callback with a similar name.
>
> So "enabled" is actually that non kernel ownership is enabled. That
> is totally non-obvious.
>
>>> Or you could leave it a bool but call it "allow_bypass".
>>
>> Commented below.
>>
>>>>>> +};
>>>>>> +
>>>>>> struct iommu_table_group {
>>>>>> #ifdef CONFIG_IOMMU_API
>>>>>> struct iommu_group *group;
>>>>>> #endif
>>>>>> struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>>>>>> + struct iommu_table_group_ops *ops;
>>>>>> };
>>>>>>
>>>>>> #ifdef CONFIG_IOMMU_API
>>>>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>> index a964c50..9687731 100644
>>>>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>> @@ -1255,10 +1255,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>>>>>> __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
>>>>>> }
>>>>>>
>>>>>> -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
>>>>>> +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
>>>>>> {
>>>>>> - struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
>>>>>> - table_group);
>>>>>> uint16_t window_id = (pe->pe_number << 1 ) + 1;
>>>>>> int64_t rc;
>>>>>>
>>>>>> @@ -1286,7 +1284,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
>>>>>> * host side.
>>>>>> */
>>>>>> if (pe->pdev)
>>>>>> - set_iommu_table_base(&pe->pdev->dev, tbl);
>>>>>> + set_iommu_table_base(&pe->pdev->dev,
>>>>>> + &pe->table_group.tables[0]);
>>>>>> else
>>>>>> pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
>>>>>> }
>>>>>> @@ -1302,13 +1301,27 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>>>>>> /* TVE #1 is selected by PCI address bit 59 */
>>>>>> pe->tce_bypass_base = 1ull << 59;
>>>>>>
>>>>>> - /* Install set_bypass callback for VFIO */
>>>>>> - pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
>>>>>> -
>>>>>> /* Enable bypass by default */
>>>>>> - pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
>>>>>> + pnv_pci_ioda2_set_bypass(pe, true);
>>>>>> }
>>>>>>
>>>>>> +static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
>>>>>> + bool enable)
>>>>>> +{
>>>>>> + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
>>>>>> + table_group);
>>>>>> + if (enable)
>>>>>> + iommu_take_ownership(table_group);
>>>>>> + else
>>>>>> + iommu_release_ownership(table_group);
>>>>>> +
>>>>>> + pnv_pci_ioda2_set_bypass(pe, !enable);
>>>>>> +}
>>>>>> +
>>>>>> +static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
>>>>>> + .set_ownership = pnv_ioda2_set_ownership,
>>>>>> +};
>>>>>> +
>>>>>> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>>>>>> struct pnv_ioda_pe *pe)
>>>>>> {
>>>>>> @@ -1376,6 +1389,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>>>>>> }
>>>>>> tbl->it_ops = &pnv_iommu_ops;
>>>>>> iommu_init_table(tbl, phb->hose->node);
>>>>>> + pe->table_group.ops = &pnv_pci_ioda2_ops;
>>>>>> iommu_register_group(&pe->table_group, phb->hose->global_number,
>>>>>> pe->pe_number);
>>>>>>
>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>> index 9f38351..d5d8c50 100644
>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>> @@ -535,9 +535,22 @@ static int tce_iommu_attach_group(void *iommu_data,
>>>>>> goto unlock_exit;
>>>>>> }
>>>>>>
>>>>>> - ret = iommu_take_ownership(table_group);
>>>>>> - if (!ret)
>>>>>> - container->grp = iommu_group;
>>>>>> + if (!table_group->ops || !table_group->ops->set_ownership) {
>>>>>> + ret = iommu_take_ownership(table_group);
>>>>>> + } else {
>>>>>> + /*
>>>>>> + * Disable iommu bypass, otherwise the user can DMA to all of
>>>>>> + * our physical memory via the bypass window instead of just
>>>>>> + * the pages that has been explicitly mapped into the iommu
>>>>>> + */
>>>>>> + table_group->ops->set_ownership(table_group, true);
>>>>>
>>>>> And here to disable bypass you call it with enable=true, so it doesn't
>>>>> even have the same meaning as it used to.
>>>>
>>>>
>>>> I do not disable bypass per se (even if it what set_ownership(true) does) as
>>>> it is IODA business and VFIO has no idea about it. I do take control over
>>>> the group. I am not following you here - what used to have the same
>>>> meaning?
>>>
>>> Well, in set_bypass, the enable parameter was whether bypass was
>>> enabled. Here you're setting enable to true, when you want to
>>> *disable* bypass (in the existing case). If the "enable" parameter
>>> isn't about enabling bypass, it's meaning is even more confusing than
>>> I thought.
>>
>>
>> Its meaning is "take ownership over the group". In this patch
>> set_ownership(true) means set_bypass(false).
>
> Ok. So "take_ownership" isn't quite as clear as I'd like, but it's
> not too bad because it's implied that it's the caller that's taking
> the ownership. *set* ownership makes no sense without saying who the
> new owner is. "enable" has no clear meaning in that context.
>
> Calling it "kernel_owned" or "non_kernel_owned" would be ok if a bit
> clunky.
Strictly speaking VFIO and platform code are both kernel.
So which one to choose?
+struct iommu_table_group_ops {
+ void (*take_ownership)(struct iommu_table_group *table_group);
+ void (*release_ownership)(struct iommu_table_group *table_group);
+};
OR
+enum { IOMMU_TABLE_GROUP_OWNER_KERNEL, IOMMU_TABLE_GROUP_OWNER_VFIO };
+struct iommu_table_group_ops {
+ void (*set_ownership)(struct iommu_table_group *table_group,
+ long owner);
+};
I have bad taste for names like this, need a hint here, please :)
>
>> But later (in 25/31) set_ownership(true) becomes unset(windows0) +
>> free(table0) + set_bypass(false) = clear DMA setup for the group (i.e.
>> invalidate both TVTs) so it is not just about bypass (which is TVT#1 but not
>> TVT#0) anymore.
>
> Right, I have no problem with a combined function for the operation
> here. It's purely a naming thing "set_ownership" and "enable" are
> just not concepts that fit together sensibly.
--
Alexey
On Tue, Apr 21, 2015 at 09:47:54PM +1000, Alexey Kardashevskiy wrote:
> On 04/21/2015 07:43 PM, David Gibson wrote:
> >On Mon, Apr 20, 2015 at 04:55:32PM +1000, Alexey Kardashevskiy wrote:
> >>On 04/20/2015 12:44 PM, David Gibson wrote:
> >>>On Fri, Apr 17, 2015 at 08:09:29PM +1000, Alexey Kardashevskiy wrote:
> >>>>On 04/16/2015 04:07 PM, David Gibson wrote:
> >>>>>On Fri, Apr 10, 2015 at 04:30:56PM +1000, Alexey Kardashevskiy wrote:
> >>>>>>At the moment the iommu_table struct has a set_bypass() which enables/
> >>>>>>disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
> >>>>>>which calls this callback when external IOMMU users such as VFIO are
> >>>>>>about to get over a PHB.
> >>>>>>
> >>>>>>The set_bypass() callback is not really an iommu_table function but
> >>>>>>IOMMU/PE function. This introduces a iommu_table_group_ops struct and
> >>>>>>adds a set_ownership() callback to it which is called when an external
> >>>>>>user takes control over the IOMMU.
> >>>>>
> >>>>>Do you really need separate ops structures at both the single table
> >>>>>and table group level? The different tables in a group will all
> >>>>>belong to the same basic iommu won't they?
> >>>>
> >>>>
> >>>>IOMMU tables exist alone in VIO. Also, the platform code uses just a table
> >>>>(or it is in bypass mode) and does not care about table groups. It looked
> >>>>more clean for myself to keep them separated. Should I still merge
> >>>>those?
> >>>
> >>>Ok, that sounds like a reasonable argument for keeping them separate,
> >>>at least for now.
> >>>
> >>>>>>This renames set_bypass() to set_ownership() as it is not necessarily
> >>>>>>just enabling bypassing, it can be something else/more so let's give it
> >>>>>>more generic name. The bool parameter is inverted.
> >>>>>>
> >>>>>>The callback is implemented for IODA2 only. Other platforms (P5IOC2,
> >>>>>>IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
> >>>>>>
> >>>>>>Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>>>>>---
> >>>>>> arch/powerpc/include/asm/iommu.h | 14 +++++++++++++-
> >>>>>> arch/powerpc/platforms/powernv/pci-ioda.c | 30 ++++++++++++++++++++++--------
> >>>>>> drivers/vfio/vfio_iommu_spapr_tce.c | 25 +++++++++++++++++++++----
> >>>>>> 3 files changed, 56 insertions(+), 13 deletions(-)
> >>>>>>
> >>>>>>diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >>>>>>index b9e50d3..d1f8c6c 100644
> >>>>>>--- a/arch/powerpc/include/asm/iommu.h
> >>>>>>+++ b/arch/powerpc/include/asm/iommu.h
> >>>>>>@@ -92,7 +92,6 @@ struct iommu_table {
> >>>>>> unsigned long it_page_shift;/* table iommu page size */
> >>>>>> struct iommu_table_group *it_group;
> >>>>>> struct iommu_table_ops *it_ops;
> >>>>>>- void (*set_bypass)(struct iommu_table *tbl, bool enable);
> >>>>>> };
> >>>>>>
> >>>>>> /* Pure 2^n version of get_order */
> >>>>>>@@ -127,11 +126,24 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
> >>>>>>
> >>>>>> #define IOMMU_TABLE_GROUP_MAX_TABLES 1
> >>>>>>
> >>>>>>+struct iommu_table_group;
> >>>>>>+
> >>>>>>+struct iommu_table_group_ops {
> >>>>>>+ /*
> >>>>>>+ * Switches ownership from the kernel itself to an external
> >>>>>>+ * user. While onwership is enabled, the kernel cannot use IOMMU
> >>>>>>+ * for itself.
> >>>>>>+ */
> >>>>>>+ void (*set_ownership)(struct iommu_table_group *table_group,
> >>>>>>+ bool enable);
> >>>>>
> >>>>>The meaning of "enable" in a function called "set_ownership" is
> >>>>>entirely obscure.
> >>>>
> >>>>Suggest something better please :) I have nothing better...
> >>>
> >>>Well, given it's "set_ownershuip" you could have "owner" - that would
> >>>want to be an enum with OWNER_KERNEL and OWNER_VFIO or something
> >>>rather than a bool.
> >>
> >>
> >>It is iommu_take_ownership() in upstream and it is assumed that the owner is
> >>anything but the platform code (for now and probably for ever - VFIO). I am
> >>not changing this now, just using same naming approach when adding a
> >>callback with a similar name.
> >
> >So "enabled" is actually that non kernel ownership is enabled. That
> >is totally non-obvious.
> >
> >>>Or you could leave it a bool but call it "allow_bypass".
> >>
> >>Commented below.
> >>
> >>>>>>+};
> >>>>>>+
> >>>>>> struct iommu_table_group {
> >>>>>> #ifdef CONFIG_IOMMU_API
> >>>>>> struct iommu_group *group;
> >>>>>> #endif
> >>>>>> struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> >>>>>>+ struct iommu_table_group_ops *ops;
> >>>>>> };
> >>>>>>
> >>>>>> #ifdef CONFIG_IOMMU_API
> >>>>>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>>>>>index a964c50..9687731 100644
> >>>>>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>>>>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>>>>>@@ -1255,10 +1255,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> >>>>>> __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
> >>>>>> }
> >>>>>>
> >>>>>>-static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> >>>>>>+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
> >>>>>> {
> >>>>>>- struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe,
> >>>>>>- table_group);
> >>>>>> uint16_t window_id = (pe->pe_number << 1 ) + 1;
> >>>>>> int64_t rc;
> >>>>>>
> >>>>>>@@ -1286,7 +1284,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> >>>>>> * host side.
> >>>>>> */
> >>>>>> if (pe->pdev)
> >>>>>>- set_iommu_table_base(&pe->pdev->dev, tbl);
> >>>>>>+ set_iommu_table_base(&pe->pdev->dev,
> >>>>>>+ &pe->table_group.tables[0]);
> >>>>>> else
> >>>>>> pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> >>>>>> }
> >>>>>>@@ -1302,13 +1301,27 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> >>>>>> /* TVE #1 is selected by PCI address bit 59 */
> >>>>>> pe->tce_bypass_base = 1ull << 59;
> >>>>>>
> >>>>>>- /* Install set_bypass callback for VFIO */
> >>>>>>- pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
> >>>>>>-
> >>>>>> /* Enable bypass by default */
> >>>>>>- pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true);
> >>>>>>+ pnv_pci_ioda2_set_bypass(pe, true);
> >>>>>> }
> >>>>>>
> >>>>>>+static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group,
> >>>>>>+ bool enable)
> >>>>>>+{
> >>>>>>+ struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> >>>>>>+ table_group);
> >>>>>>+ if (enable)
> >>>>>>+ iommu_take_ownership(table_group);
> >>>>>>+ else
> >>>>>>+ iommu_release_ownership(table_group);
> >>>>>>+
> >>>>>>+ pnv_pci_ioda2_set_bypass(pe, !enable);
> >>>>>>+}
> >>>>>>+
> >>>>>>+static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> >>>>>>+ .set_ownership = pnv_ioda2_set_ownership,
> >>>>>>+};
> >>>>>>+
> >>>>>> static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >>>>>> struct pnv_ioda_pe *pe)
> >>>>>> {
> >>>>>>@@ -1376,6 +1389,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >>>>>> }
> >>>>>> tbl->it_ops = &pnv_iommu_ops;
> >>>>>> iommu_init_table(tbl, phb->hose->node);
> >>>>>>+ pe->table_group.ops = &pnv_pci_ioda2_ops;
> >>>>>> iommu_register_group(&pe->table_group, phb->hose->global_number,
> >>>>>> pe->pe_number);
> >>>>>>
> >>>>>>diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>index 9f38351..d5d8c50 100644
> >>>>>>--- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>>>>@@ -535,9 +535,22 @@ static int tce_iommu_attach_group(void *iommu_data,
> >>>>>> goto unlock_exit;
> >>>>>> }
> >>>>>>
> >>>>>>- ret = iommu_take_ownership(table_group);
> >>>>>>- if (!ret)
> >>>>>>- container->grp = iommu_group;
> >>>>>>+ if (!table_group->ops || !table_group->ops->set_ownership) {
> >>>>>>+ ret = iommu_take_ownership(table_group);
> >>>>>>+ } else {
> >>>>>>+ /*
> >>>>>>+ * Disable iommu bypass, otherwise the user can DMA to all of
> >>>>>>+ * our physical memory via the bypass window instead of just
> >>>>>>+ * the pages that has been explicitly mapped into the iommu
> >>>>>>+ */
> >>>>>>+ table_group->ops->set_ownership(table_group, true);
> >>>>>
> >>>>>And here to disable bypass you call it with enable=true, so it doesn't
> >>>>>even have the same meaning as it used to.
> >>>>
> >>>>
> >>>>I do not disable bypass per se (even if it what set_ownership(true) does) as
> >>>>it is IODA business and VFIO has no idea about it. I do take control over
> >>>>the group. I am not following you here - what used to have the same
> >>>>meaning?
> >>>
> >>>Well, in set_bypass, the enable parameter was whether bypass was
> >>>enabled. Here you're setting enable to true, when you want to
> >>>*disable* bypass (in the existing case). If the "enable" parameter
> >>>isn't about enabling bypass, it's meaning is even more confusing than
> >>>I thought.
> >>
> >>
> >>Its meaning is "take ownership over the group". In this patch
> >>set_ownership(true) means set_bypass(false).
> >
> >Ok. So "take_ownership" isn't quite as clear as I'd like, but it's
> >not too bad because it's implied that it's the caller that's taking
> >the ownership. *set* ownership makes no sense without saying who the
> >new owner is. "enable" has no clear meaning in that context.
> >
> >Calling it "kernel_owned" or "non_kernel_owned" would be ok if a bit
> >clunky.
>
>
> Strictly speaking VFIO and platform code are both kernel.
Well, true, but VFIO is generally holding the device on behalf of a
userspace process or guest.
> So which one to choose?
>
> +struct iommu_table_group_ops {
> + void (*take_ownership)(struct iommu_table_group *table_group);
> + void (*release_ownership)(struct iommu_table_group *table_group);
> +};
>
>
> OR
>
> +enum { IOMMU_TABLE_GROUP_OWNER_KERNEL, IOMMU_TABLE_GROUP_OWNER_VFIO };
> +struct iommu_table_group_ops {
> + void (*set_ownership)(struct iommu_table_group *table_group,
> + long owner);
> +};
>
>
> I have bad taste for names like this, need a hint here, please :)
I think I'd be ok with either.
I think I'd vote for the first option, for consistency with the
existing function names. If that requires a bunch of code duplication
in the implementations between take and release, I'd probably change
my mind though.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson