2013-08-28 08:38:09

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v9 00/13] KVM: PPC: IOMMU in-kernel handling of VFIO

This accelerates VFIO DMA operations on POWER by moving them
into kernel.

This depends on VFIO external API patch which is on its way to upstream.

Changes:
v9:
* replaced the "link logical bus number to IOMMU group" ioctl to KVM
with a KVM device doing the same thing, i.e. the actual changes are in
these 3 patches:
KVM: PPC: reserve a capability and KVM device type for realmode VFIO
KVM: PPC: remove warning from kvmppc_core_destroy_vm
KVM: PPC: Add support for IOMMU in-kernel handling

* moved some VFIO external API bits to a separate patch to reduce the size
of the "KVM: PPC: Add support for IOMMU in-kernel handling" patch

* fixed code style problems reported by checkpatch.pl.

v8:
* fixed comments about capabilities numbers

v7:
* rebased on v3.11-rc3.
* VFIO external user API will go through VFIO tree so it is
excluded from this series.
* As nobody ever reacted on "hashtable: add hash_for_each_possible_rcu_notrace()",
Ben suggested to push it via his tree so I included it to the series.
* realmode_(get|put)_page is reworked.

More details in the individual patch comments.

Alexey Kardashevskiy (13):
KVM: PPC: POWERNV: move iommu_add_device earlier
hashtable: add hash_for_each_possible_rcu_notrace()
KVM: PPC: reserve a capability number for multitce support
KVM: PPC: reserve a capability and KVM device type for realmode VFIO
powerpc: Prepare to support kernel handling of IOMMU map/unmap
powerpc: add real mode support for dma operations on powernv
KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently
KVM: PPC: Add support for multiple-TCE hcalls
powerpc/iommu: rework to support realmode
KVM: PPC: remove warning from kvmppc_core_destroy_vm
KVM: PPC: add trampolines for VFIO external API
KVM: PPC: Add support for IOMMU in-kernel handling
KVM: PPC: Add hugepage support for IOMMU in-kernel handling

Documentation/virtual/kvm/api.txt | 26 +
.../virtual/kvm/devices/spapr_tce_iommu.txt | 37 ++
arch/powerpc/include/asm/iommu.h | 18 +-
arch/powerpc/include/asm/kvm_host.h | 38 ++
arch/powerpc/include/asm/kvm_ppc.h | 16 +-
arch/powerpc/include/asm/machdep.h | 12 +
arch/powerpc/include/asm/pgtable-ppc64.h | 2 +
arch/powerpc/include/uapi/asm/kvm.h | 8 +
arch/powerpc/kernel/iommu.c | 243 +++++----
arch/powerpc/kvm/Kconfig | 1 +
arch/powerpc/kvm/book3s_64_vio.c | 597 ++++++++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 408 +++++++++++++-
arch/powerpc/kvm/book3s_hv.c | 42 +-
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +-
arch/powerpc/kvm/book3s_pr_papr.c | 35 ++
arch/powerpc/kvm/powerpc.c | 4 +
arch/powerpc/mm/init_64.c | 50 +-
arch/powerpc/platforms/powernv/pci-ioda.c | 57 +-
arch/powerpc/platforms/powernv/pci-p5ioc2.c | 2 +-
arch/powerpc/platforms/powernv/pci.c | 75 ++-
arch/powerpc/platforms/powernv/pci.h | 3 +-
arch/powerpc/platforms/pseries/iommu.c | 8 +-
include/linux/hashtable.h | 15 +
include/linux/kvm_host.h | 1 +
include/linux/mm.h | 14 +
include/linux/page-flags.h | 4 +-
include/uapi/linux/kvm.h | 3 +
virt/kvm/kvm_main.c | 5 +
28 files changed, 1564 insertions(+), 168 deletions(-)
create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt

--
1.8.4.rc4


2013-08-28 08:38:18

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v9 02/13] hashtable: add hash_for_each_possible_rcu_notrace()

This adds hash_for_each_possible_rcu_notrace() which is basically
a notrace clone of hash_for_each_possible_rcu() which cannot be
used in real mode due to its tracing/debugging capability.

Signed-off-by: Alexey Kardashevskiy <[email protected]>

---
Changes:
v8:
* fixed warnings from check_patch.pl
---
include/linux/hashtable.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h
index a9df51f..519b6e2 100644
--- a/include/linux/hashtable.h
+++ b/include/linux/hashtable.h
@@ -174,6 +174,21 @@ static inline void hash_del_rcu(struct hlist_node *node)
member)

/**
+ * hash_for_each_possible_rcu_notrace - iterate over all possible objects hashing
+ * to the same bucket in an rcu enabled hashtable in a rcu enabled hashtable
+ * @name: hashtable to iterate
+ * @obj: the type * to use as a loop cursor for each entry
+ * @member: the name of the hlist_node within the struct
+ * @key: the key of the objects to iterate over
+ *
+ * This is the same as hash_for_each_possible_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define hash_for_each_possible_rcu_notrace(name, obj, member, key) \
+ hlist_for_each_entry_rcu_notrace(obj, \
+ &name[hash_min(key, HASH_BITS(name))], member)
+
+/**
* hash_for_each_possible_safe - iterate over all possible objects hashing to the
* same bucket safe against removals
* @name: hashtable to iterate
--
1.8.4.rc4

2013-08-28 08:38:27

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v9 04/13] KVM: PPC: reserve a capability and KVM device type for realmode VFIO

This reserves a capability number for upcoming support
of VFIO-IOMMU DMA operations in real mode.

This reserves a number for a new "SPAPR TCE IOMMU" KVM device
which is going to manage lifetime of SPAPR TCE IOMMU object.

This defines an attribute of the "SPAPR TCE IOMMU" KVM device
which is going to be used for initialization.

Signed-off-by: Alexey Kardashevskiy <[email protected]>

---
Changes:
v9:
* KVM ioctl is replaced with "SPAPR TCE IOMMU" KVM device type with
KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute

2013/08/15:
* fixed mistype in comments
* fixed commit message which says what uses ioctls 0xad and 0xae

2013/07/16:
* changed the number

2013/07/11:
* changed order in a file, added comment about a gap in ioctl number
---
arch/powerpc/include/uapi/asm/kvm.h | 8 ++++++++
include/uapi/linux/kvm.h | 2 ++
2 files changed, 10 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..c1ae1e5 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -511,4 +511,12 @@ struct kvm_get_htab_header {
#define KVM_XICS_MASKED (1ULL << 41)
#define KVM_XICS_PENDING (1ULL << 42)

+/* SPAPR TCE IOMMU device specification */
+struct kvm_create_spapr_tce_iommu_linkage {
+ __u64 liobn;
+ __u32 fd;
+ __u32 flags;
+};
+#define KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE 0
+
#endif /* __LINUX_KVM_POWERPC_H */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 99c2533..9d20630 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_IRQ_XICS 92
#define KVM_CAP_ARM_EL1_32BIT 93
#define KVM_CAP_SPAPR_MULTITCE 94
+#define KVM_CAP_SPAPR_TCE_IOMMU 95

#ifdef KVM_CAP_IRQ_ROUTING

@@ -843,6 +844,7 @@ struct kvm_device_attr {
#define KVM_DEV_TYPE_FSL_MPIC_20 1
#define KVM_DEV_TYPE_FSL_MPIC_42 2
#define KVM_DEV_TYPE_XICS 3
+#define KVM_DEV_TYPE_SPAPR_TCE_IOMMU 4

/*
* ioctls for VM fds
--
1.8.4.rc4

2013-08-28 08:38:32

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v9 05/13] powerpc: Prepare to support kernel handling of IOMMU map/unmap

The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API function realmode_pfn_to_page() to get page struct when
MMU is off.

This adds to MM a new function put_page_unless_one() which drops a page
if counter is bigger than 1. It is going to be used when MMU is off
(for example, real mode on PPC64) and we want to make sure that page
release will not happen in real mode as it may crash the kernel in
a horrible way.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Cc: [email protected]
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Andrew Morton <[email protected]>
Reviewed-by: Paul Mackerras <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>

---

Changes:
2013/07/25 (v7):
* removed realmode_put_page and added put_page_unless_one() instead.
The name has been chosen to conform the already existing
get_page_unless_zero().
* removed realmode_get_page. Instead, get_page_unless_zero() should be used

2013/07/10:
* adjusted comment (removed sentence about virtual mode)
* get_page_unless_zero replaced with atomic_inc_not_zero to minimize
effect of a possible get_page_unless_zero() rework (if it ever happens).

2013/06/27:
* realmode_get_page() fixed to use get_page_unless_zero(). If failed,
the call will be passed from real to virtual mode and safely handled.
* added comment to PageCompound() in include/linux/page-flags.h.

2013/05/20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
---
arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
arch/powerpc/mm/init_64.c | 50 +++++++++++++++++++++++++++++++-
include/linux/mm.h | 14 +++++++++
include/linux/page-flags.h | 4 ++-
4 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 46db094..4a191c4 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -394,6 +394,8 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array,
hpte_slot_array[index] = hidx << 4 | 0x1 << 3;
}

+struct page *realmode_pfn_to_page(unsigned long pfn);
+
static inline char *get_hpte_slot_array(pmd_t *pmdp)
{
/*
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index d0cd9e4..8cf345a 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -300,5 +300,53 @@ void vmemmap_free(unsigned long start, unsigned long end)
{
}

-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * realmode_pfn_to_page functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct vmemmap_backing *vmem_back;
+ struct page *page;
+ unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+ unsigned long pg_va = (unsigned long) pfn_to_page(pfn);

+ for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+ if (pg_va < vmem_back->virt_addr)
+ continue;
+
+ /* Check that page struct is not split between real pages */
+ if ((pg_va + sizeof(struct page)) >
+ (vmem_back->virt_addr + page_size))
+ return NULL;
+
+ page = (struct page *) (vmem_back->phys + pg_va -
+ vmem_back->virt_addr);
+ return page;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+ return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..dcc99b5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -290,12 +290,26 @@ static inline int put_page_testzero(struct page *page)
/*
* Try to grab a ref unless the page has a refcount of zero, return false if
* that is the case.
+ * This can be called when MMU is off so it must not access
+ * any of the virtual mappings.
*/
static inline int get_page_unless_zero(struct page *page)
{
return atomic_inc_not_zero(&page->_count);
}

+/*
+ * Try to drop a ref unless the page has a refcount of one, return false if
+ * that is the case.
+ * This is to make sure that the refcount won't become zero after this drop.
+ * This can be called when MMU is off so it must not access
+ * any of the virtual mappings.
+ */
+static inline int put_page_unless_one(struct page *page)
+{
+ return atomic_add_unless(&page->_count, -1, 1);
+}
+
extern int page_is_ram(unsigned long pfn);

/* Support for virtually mapped pages */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..98ada58 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -329,7 +329,9 @@ static inline void set_page_writeback(struct page *page)
* System with lots of page flags available. This allows separate
* flags for PageHead() and PageTail() checks of compound pages so that bit
* tests can be used in performance sensitive paths. PageCompound is
- * generally not used in hot code paths.
+ * generally not used in hot code paths except arch/powerpc/mm/init_64.c
+ * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
+ * and avoid handling those in real mode.
*/
__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
__PAGEFLAG(Tail, tail)
--
1.8.4.rc4

2013-08-28 08:38:39

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v9 06/13] powerpc: add real mode support for dma operations on powernv

The existing TCE machine calls (tce_build and tce_free) only support
virtual mode as they call __raw_writeq for TCE invalidation what
fails in real mode.

This introduces tce_build_rm and tce_free_rm real mode versions
which do mostly the same but use "Store Doubleword Caching Inhibited
Indexed" instruction for TCE invalidation.

This new feature is going to be utilized by real mode support of VFIO.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v8:
* fixed check_patch.pl warnings

2013/11/07:
* added comment why stdcix cannot be used in virtual mode

2013/08/07:
* tested on p7ioc and fixed a bug with realmode addresses
---
arch/powerpc/include/asm/machdep.h | 12 ++++++++
arch/powerpc/platforms/powernv/pci-ioda.c | 49 +++++++++++++++++++++++--------
arch/powerpc/platforms/powernv/pci.c | 42 ++++++++++++++++++++++----
arch/powerpc/platforms/powernv/pci.h | 3 +-
4 files changed, 87 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 8b48090..07dd3b1 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -78,6 +78,18 @@ struct machdep_calls {
long index);
void (*tce_flush)(struct iommu_table *tbl);

+ /* _rm versions are for real mode use only */
+ int (*tce_build_rm)(struct iommu_table *tbl,
+ long index,
+ long npages,
+ unsigned long uaddr,
+ enum dma_data_direction direction,
+ struct dma_attrs *attrs);
+ void (*tce_free_rm)(struct iommu_table *tbl,
+ long index,
+ long npages);
+ void (*tce_flush_rm)(struct iommu_table *tbl);
+
void __iomem * (*ioremap)(phys_addr_t addr, unsigned long size,
unsigned long flags, void *caller);
void (*iounmap)(volatile void __iomem *token);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 756bb58..8cba234 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -70,6 +70,16 @@ define_pe_printk_level(pe_err, KERN_ERR);
define_pe_printk_level(pe_warn, KERN_WARNING);
define_pe_printk_level(pe_info, KERN_INFO);

+/*
+ * stdcix is only supposed to be used in hypervisor real mode as per
+ * the architecture spec
+ */
+static inline void __raw_rm_writeq(u64 val, volatile void __iomem *paddr)
+{
+ __asm__ __volatile__("stdcix %0,0,%1"
+ : : "r" (val), "r" (paddr) : "memory");
+}
+
static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
{
unsigned long pe;
@@ -454,10 +464,13 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
}
}

-static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
- u64 *startp, u64 *endp)
+static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
+ struct iommu_table *tbl,
+ u64 *startp, u64 *endp, bool rm)
{
- u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
+ u64 __iomem *invalidate = rm ?
+ (u64 __iomem *)pe->tce_inval_reg_phys :
+ (u64 __iomem *)tbl->it_index;
unsigned long start, end, inc;

start = __pa(startp);
@@ -484,7 +497,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,

mb(); /* Ensure above stores are visible */
while (start <= end) {
- __raw_writeq(start, invalidate);
+ if (rm)
+ __raw_rm_writeq(start, invalidate);
+ else
+ __raw_writeq(start, invalidate);
start += inc;
}

@@ -496,10 +512,12 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,

static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
struct iommu_table *tbl,
- u64 *startp, u64 *endp)
+ u64 *startp, u64 *endp, bool rm)
{
unsigned long start, end, inc;
- u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
+ u64 __iomem *invalidate = rm ?
+ (u64 __iomem *)pe->tce_inval_reg_phys :
+ (u64 __iomem *)tbl->it_index;

/* We'll invalidate DMA address in PE scope */
start = 0x2ul << 60;
@@ -515,22 +533,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
mb();

while (start <= end) {
- __raw_writeq(start, invalidate);
+ if (rm)
+ __raw_rm_writeq(start, invalidate);
+ else
+ __raw_writeq(start, invalidate);
start += inc;
}
}

void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
- u64 *startp, u64 *endp)
+ u64 *startp, u64 *endp, bool rm)
{
struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
tce32_table);
struct pnv_phb *phb = pe->phb;

if (phb->type == PNV_PHB_IODA1)
- pnv_pci_ioda1_tce_invalidate(tbl, startp, endp);
+ pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
else
- pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp);
+ pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
}

static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
@@ -603,7 +624,9 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
* bus number, print that out instead.
*/
tbl->it_busno = 0;
- tbl->it_index = (unsigned long)ioremap(be64_to_cpup(swinvp), 8);
+ pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
+ tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
+ 8);
tbl->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE |
TCE_PCI_SWINV_PAIR;
}
@@ -681,7 +704,9 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
* bus number, print that out instead.
*/
tbl->it_busno = 0;
- tbl->it_index = (unsigned long)ioremap(be64_to_cpup(swinvp), 8);
+ pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
+ tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
+ 8);
tbl->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE;
}
iommu_init_table(tbl, phb->hose->node);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index c005011..8623529 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -401,7 +401,7 @@ struct pci_ops pnv_pci_ops = {

static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
unsigned long uaddr, enum dma_data_direction direction,
- struct dma_attrs *attrs)
+ struct dma_attrs *attrs, bool rm)
{
u64 proto_tce;
u64 *tcep, *tces;
@@ -423,12 +423,22 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
* of flags if that becomes the case
*/
if (tbl->it_type & TCE_PCI_SWINV_CREATE)
- pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
+ pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);

return 0;
}

-static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
+static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr,
+ enum dma_data_direction direction,
+ struct dma_attrs *attrs)
+{
+ return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs,
+ false);
+}
+
+static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
+ bool rm)
{
u64 *tcep, *tces;

@@ -438,7 +448,12 @@ static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
*(tcep++) = 0;

if (tbl->it_type & TCE_PCI_SWINV_FREE)
- pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
+ pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+}
+
+static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long npages)
+{
+ pnv_tce_free(tbl, index, npages, false);
}

static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
@@ -446,6 +461,19 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
return ((u64 *)tbl->it_base)[index - tbl->it_offset];
}

+static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr,
+ enum dma_data_direction direction,
+ struct dma_attrs *attrs)
+{
+ return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
+}
+
+static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
+{
+ pnv_tce_free(tbl, index, npages, true);
+}
+
void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
void *tce_mem, u64 tce_size,
u64 dma_offset)
@@ -610,8 +638,10 @@ void __init pnv_pci_init(void)

/* Configure IOMMU DMA hooks */
ppc_md.pci_dma_dev_setup = pnv_pci_dma_dev_setup;
- ppc_md.tce_build = pnv_tce_build;
- ppc_md.tce_free = pnv_tce_free;
+ ppc_md.tce_build = pnv_tce_build_vm;
+ ppc_md.tce_free = pnv_tce_free_vm;
+ ppc_md.tce_build_rm = pnv_tce_build_rm;
+ ppc_md.tce_free_rm = pnv_tce_free_rm;
ppc_md.tce_get = pnv_tce_get;
ppc_md.pci_probe_mode = pnv_pci_probe_mode;
set_pci_dma_ops(&dma_iommu_ops);
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index d633c64..170dd98 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -52,6 +52,7 @@ struct pnv_ioda_pe {
int tce32_seg;
int tce32_segcount;
struct iommu_table tce32_table;
+ phys_addr_t tce_inval_reg_phys;

/* XXX TODO: Add support for additional 64-bit iommus */

@@ -193,6 +194,6 @@ extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
extern void pnv_pci_init_ioda_hub(struct device_node *np);
extern void pnv_pci_init_ioda2_phb(struct device_node *np);
extern void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
- u64 *startp, u64 *endp);
+ u64 *startp, u64 *endp, bool rm);

#endif /* __POWERNV_PCI_H */
--
1.8.4.rc4

2013-08-28 08:38:54

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v9 08/13] KVM: PPC: Add support for multiple-TCE hcalls

This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO
devices or emulated PCI. These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.

This adds a function to convert a guest physical address to a host
virtual address in order to parse a TCE list from H_PUT_TCE_INDIRECT.

This also implements the KVM_CAP_PPC_MULTITCE capability. When present,
the hypercalls mentioned above may or may not be processed successfully
in the kernel based fast path. If they can not be handled by the kernel,
they will get passed on to user space. So user space still has to have
an implementation for these despite the in kernel acceleration.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>

---
Changelog:
v8:
* fixed warnings from check_patch.pl

2013/08/01 (v7):
* realmode_get_page/realmode_put_page use was replaced with
get_page_unless_zero/put_page_unless_one

2013/07/11:
* addressed many, many comments from maintainers

2013/07/06:
* fixed number of wrong get_page()/put_page() calls

2013/06/27:
* fixed clear of BUSY bit in kvmppc_lookup_pte()
* H_PUT_TCE_INDIRECT does realmode_get_page() now
* KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
* updated doc

2013/06/05:
* fixed mistype about IBMVIO in the commit message
* updated doc and moved it to another section
* changed capability number

2013/05/21:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
---
Documentation/virtual/kvm/api.txt | 26 +++
arch/powerpc/include/asm/kvm_host.h | 9 ++
arch/powerpc/include/asm/kvm_ppc.h | 16 +-
arch/powerpc/kvm/book3s_64_vio.c | 132 +++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 270 ++++++++++++++++++++++++++++----
arch/powerpc/kvm/book3s_hv.c | 41 ++++-
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +-
arch/powerpc/kvm/book3s_pr_papr.c | 35 +++++
arch/powerpc/kvm/powerpc.c | 3 +
9 files changed, 506 insertions(+), 34 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index ef925ea..1c8942a 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2382,6 +2382,32 @@ calls by the guest for that service will be passed to userspace to be
handled.


+4.86 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability means the kernel is capable of handling hypercalls
+H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
+space. This significantly accelerates DMA operations for PPC KVM guests.
+User space should expect that its handlers for these hypercalls
+are not going to be called if user space previously registered LIOBN
+in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
+
+In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
+user space might have to advertise it for the guest. For example,
+IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
+present in the "ibm,hypertas-functions" device-tree property.
+
+The hypercalls mentioned above may or may not be processed successfully
+in the kernel based fast path. If they can not be handled by the kernel,
+they will get passed on to user space. So user space still has to have
+an implementation for these despite the in kernel acceleration.
+
+This capability is always enabled.
+
+
5. The kvm_run structure
------------------------

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index af326cd..a23f132 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -30,6 +30,7 @@
#include <linux/kvm_para.h>
#include <linux/list.h>
#include <linux/atomic.h>
+#include <linux/tracepoint.h>
#include <asm/kvm_asm.h>
#include <asm/processor.h>
#include <asm/page.h>
@@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
+
+ unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT */
+ enum {
+ TCERM_NONE,
+ TCERM_GETPAGE,
+ TCERM_PUTTCE,
+ TCERM_PUTLIST,
+ } tce_rm_fail; /* failed stage of request processing */
#endif
};

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..0ce4691 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);

extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
- unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+ struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_tce_validate(unsigned long tce);
+extern void kvmppc_tce_put(struct kvmppc_spapr_tce_table *tt,
+ unsigned long ioba, unsigned long tce);
+extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce);
+extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages);
+extern long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages);
extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct kvm_allocate_rma *rma);
extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index b2d3f3b..cae1099 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
*
* Copyright 2010 Paul Mackerras, IBM Corp. <[email protected]>
* Copyright 2011 David Gibson, IBM Corporation <[email protected]>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <[email protected]>
*/

#include <linux/types.h>
@@ -36,8 +37,10 @@
#include <asm/ppc-opcode.h>
#include <asm/kvm_host.h>
#include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>

-#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR ((void *)~(unsigned long)0x0)

static long kvmppc_stt_npages(unsigned long window_size)
{
@@ -148,3 +151,130 @@ fail:
}
return ret;
}
+
+/* Converts guest physical address to host virtual address */
+static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
+ unsigned long gpa, struct page **pg)
+{
+ unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+ struct kvm_memory_slot *memslot;
+ const int is_write = 0;
+
+ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+ if (!memslot)
+ return ERROR_ADDR;
+
+ hva = __gfn_to_hva_memslot(memslot, gfn) | (gpa & ~PAGE_MASK);
+
+ if (get_user_pages_fast(hva & PAGE_MASK, 1, is_write, pg) != 1)
+ return ERROR_ADDR;
+
+ return (void *) hva;
+}
+
+long kvmppc_h_put_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce)
+{
+ long ret;
+ struct kvmppc_spapr_tce_table *tt;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ if (!tt)
+ return H_TOO_HARD;
+
+ if (ioba >= tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_tce_validate(tce);
+ if (ret)
+ return ret;
+
+ kvmppc_tce_put(tt, ioba, tce);
+
+ return H_SUCCESS;
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret = H_SUCCESS;
+ unsigned long __user *tces;
+ struct page *pg = NULL;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ if (!tt)
+ return H_TOO_HARD;
+
+ /*
+ * The spec says that the maximum size of the list is 512 TCEs
+ * so the whole table addressed resides in 4K page
+ */
+ if (npages > 512)
+ return H_PARAMETER;
+
+ if (tce_list & ~IOMMU_PAGE_MASK)
+ return H_PARAMETER;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ tces = kvmppc_gpa_to_hva_and_get(vcpu, tce_list, &pg);
+ if (tces == ERROR_ADDR)
+ return H_TOO_HARD;
+
+ if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
+ goto put_list_page_exit;
+
+ for (i = 0; i < npages; ++i) {
+ if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
+ ret = H_PARAMETER;
+ goto put_list_page_exit;
+ }
+
+ ret = kvmppc_tce_validate(vcpu->arch.tce_tmp_hpas[i]);
+ if (ret)
+ goto put_list_page_exit;
+ }
+
+ for (i = 0; i < npages; ++i)
+ kvmppc_tce_put(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+ vcpu->arch.tce_tmp_hpas[i]);
+put_list_page_exit:
+ if (pg) {
+ put_page(pg);
+ if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
+ vcpu->arch.tce_rm_fail = TCERM_NONE;
+ /* Finish pending put_page() from realmode */
+ put_page(pg);
+ }
+ }
+
+ return ret;
+}
+
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ if (!tt)
+ return H_TOO_HARD;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_tce_validate(tce_value);
+ if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+ kvmppc_tce_put(tt, ioba, tce_value);
+
+ return H_SUCCESS;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..2b2ce0a 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
*
* Copyright 2010 Paul Mackerras, IBM Corp. <[email protected]>
* Copyright 2011 David Gibson, IBM Corporation <[email protected]>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <[email protected]>
*/

#include <linux/types.h>
@@ -35,42 +36,253 @@
#include <asm/ppc-opcode.h>
#include <asm/kvm_host.h>
#include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>

#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR (~(unsigned long)0x0)

-/* WARNING: This will be called in real-mode on HV KVM and virtual
+/* Finds a TCE table descriptor by LIOBN.
+ *
+ * WARNING: This will be called in real or virtual mode on HV KVM and virtual
* mode on PR KVM
*/
-long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
+struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
+ unsigned long liobn)
+{
+ struct kvmppc_spapr_tce_table *tt;
+
+ list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+ if (tt->liobn == liobn)
+ return tt;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
+
+/*
+ * Validates TCE address.
+ * At the moment only flags are validated.
+ * As the host kernel does not access those addresses (just puts them
+ * to the table and user space is supposed to process them), we can skip
+ * checking other things (such as TCE is a guest RAM address or the page
+ * was actually allocated).
+ *
+ * WARNING: This will be called in real-mode on HV KVM and virtual
+ * mode on PR KVM
+ */
+long kvmppc_tce_validate(unsigned long tce)
+{
+ if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+ return H_PARAMETER;
+
+ return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_tce_validate);
+
+/* Note on the use of page_address() in real mode,
+ *
+ * It is safe to use page_address() in real mode on ppc64 because
+ * page_address() is always defined as lowmem_page_address()
+ * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
+ * operation and does not access page struct.
+ *
+ * Theoretically page_address() could be defined different
+ * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
+ * should be enabled.
+ * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
+ * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
+ * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
+ * is not expected to be enabled on ppc32, page_address()
+ * is safe for ppc32 as well.
+ *
+ * WARNING: This will be called in real-mode on HV KVM and virtual
+ * mode on PR KVM
+ */
+static u64 *kvmppc_page_address(struct page *page)
+{
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+ return (u64 *) page_address(page);
+}
+
+/*
+ * Handles TCE requests for emulated devices.
+ * Puts guest TCE values to the table and expects user space to convert them.
+ * Called in both real and virtual modes.
+ * Cannot fail so kvmppc_tce_validate must be called before it.
+ *
+ * WARNING: This will be called in real-mode on HV KVM and virtual
+ * mode on PR KVM
+ */
+void kvmppc_tce_put(struct kvmppc_spapr_tce_table *tt,
+ unsigned long ioba, unsigned long tce)
+{
+ unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+ struct page *page;
+ u64 *tbl;
+
+ page = tt->pages[idx / TCES_PER_PAGE];
+ tbl = kvmppc_page_address(page);
+
+ tbl[idx % TCES_PER_PAGE] = tce;
+}
+EXPORT_SYMBOL_GPL(kvmppc_tce_put);
+
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+/*
+ * Converts guest physical address to host physical address.
+ * Tries to increase page counter via get_page_unless_zero() and
+ * returns ERROR_ADDR if failed.
+ */
+static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
+ unsigned long gpa, struct page **pg)
+{
+ struct kvm_memory_slot *memslot;
+ pte_t *ptep, pte;
+ unsigned long hva, hpa = ERROR_ADDR;
+ unsigned long gfn = gpa >> PAGE_SHIFT;
+ unsigned shift = 0;
+
+ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+ if (!memslot)
+ return ERROR_ADDR;
+
+ hva = __gfn_to_hva_memslot(memslot, gfn);
+
+ ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+ if (!ptep || !pte_present(*ptep))
+ return ERROR_ADDR;
+ pte = *ptep;
+
+ if (!shift)
+ shift = PAGE_SHIFT;
+
+ /* Avoid handling anything potentially complicated in realmode */
+ if (shift > PAGE_SHIFT)
+ return ERROR_ADDR;
+
+ if (((gpa & TCE_PCI_WRITE) || pte_write(pte)) && !pte_dirty(pte))
+ return ERROR_ADDR;
+
+ if (!pte_young(pte))
+ return ERROR_ADDR;
+
+ /* Increase page counter */
+ *pg = realmode_pfn_to_page(pte_pfn(pte));
+ if (!*pg || PageCompound(*pg) || !get_page_unless_zero(*pg))
+ return ERROR_ADDR;
+
+ hpa = (pte_pfn(pte) << PAGE_SHIFT) + (gpa & ((1 << shift) - 1));
+
+ /*
+ * Page has gone since we got pte, safer to put
+ * the request to virt mode
+ */
+ if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+ hpa = ERROR_ADDR;
+ /* Try drop the page, if failed, let virtmode do that */
+ if (put_page_unless_one(*pg))
+ *pg = NULL;
+ }
+
+ return hpa;
+}
+
+long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
{
- struct kvm *kvm = vcpu->kvm;
- struct kvmppc_spapr_tce_table *stt;
-
- /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
- /* liobn, ioba, tce); */
-
- list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
- if (stt->liobn == liobn) {
- unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
- struct page *page;
- u64 *tbl;
-
- /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p window_size=0x%x\n", */
- /* liobn, stt, stt->window_size); */
- if (ioba >= stt->window_size)
- return H_PARAMETER;
-
- page = stt->pages[idx / TCES_PER_PAGE];
- tbl = (u64 *)page_address(page);
-
- /* FIXME: Need to validate the TCE itself */
- /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
- tbl[idx % TCES_PER_PAGE] = tce;
- return H_SUCCESS;
- }
+ long ret;
+ struct kvmppc_spapr_tce_table *tt;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ if (!tt)
+ return H_TOO_HARD;
+
+ if (ioba >= tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_tce_validate(tce);
+ if (!ret)
+ kvmppc_tce_put(tt, ioba, tce);
+
+ return ret;
+}
+
+long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret = H_SUCCESS;
+ unsigned long tces;
+ struct page *pg = NULL;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ if (!tt)
+ return H_TOO_HARD;
+
+ /*
+ * The spec says that the maximum size of the list is 512 TCEs
+ * so the whole table addressed resides in 4K page
+ */
+ if (npages > 512)
+ return H_PARAMETER;
+
+ if (tce_list & ~IOMMU_PAGE_MASK)
+ return H_PARAMETER;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce_list, &pg);
+ if (tces == ERROR_ADDR) {
+ ret = H_TOO_HARD;
+ goto put_unlock_exit;
}

- /* Didn't find the liobn, punt it to userspace */
- return H_TOO_HARD;
+ for (i = 0; i < npages; ++i) {
+ ret = kvmppc_tce_validate(((unsigned long *)tces)[i]);
+ if (ret)
+ goto put_unlock_exit;
+ }
+
+ for (i = 0; i < npages; ++i)
+ kvmppc_tce_put(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+ ((unsigned long *)tces)[i]);
+
+put_unlock_exit:
+ if (!ret && pg && !put_page_unless_one(pg)) {
+ vcpu->arch.tce_rm_fail = TCERM_PUTLIST;
+ ret = H_TOO_HARD;
+ }
+
+ return ret;
+}
+
+long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ if (!tt)
+ return H_TOO_HARD;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_tce_validate(tce_value);
+ if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+ kvmppc_tce_put(tt, ioba, tce_value);
+
+ return H_SUCCESS;
}
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 7629cd3..9e823ad 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -567,7 +567,31 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
if (kvmppc_xics_enabled(vcpu)) {
ret = kvmppc_xics_hcall(vcpu, req);
break;
- } /* fallthrough */
+ }
+ return RESUME_HOST;
+ case H_PUT_TCE:
+ ret = kvmppc_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
+ case H_PUT_TCE_INDIRECT:
+ ret = kvmppc_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6),
+ kvmppc_get_gpr(vcpu, 7));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
+ case H_STUFF_TCE:
+ ret = kvmppc_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6),
+ kvmppc_get_gpr(vcpu, 7));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
default:
return RESUME_HOST;
}
@@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
vcpu->arch.cpu_type = KVM_CPU_3S_64;
kvmppc_sanity_check(vcpu);

+ /*
+ * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
+ * half executed, we first read TCEs from the user, check them and
+ * return error if something went wrong and only then put TCEs into
+ * the TCE table.
+ *
+ * tce_tmp_hpas is a cache for TCEs to avoid stack allocation or
+ * kmalloc as the whole TCE list can take up to 512 items 8 bytes
+ * each (4096 bytes).
+ */
+ vcpu->arch.tce_tmp_hpas = kmalloc(4096, GFP_KERNEL);
+ if (!vcpu->arch.tce_tmp_hpas)
+ goto free_vcpu;
+
return vcpu;

free_vcpu:
@@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
spin_unlock(&vcpu->arch.vpa_update_lock);
+ kfree(vcpu->arch.tce_tmp_hpas);
kvm_vcpu_uninit(vcpu);
kmem_cache_free(kvm_vcpu_cache, vcpu);
}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b02f91e..15942bc 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1416,7 +1416,7 @@ hcall_real_table:
.long 0 /* 0x14 - H_CLEAR_REF */
.long .kvmppc_h_protect - hcall_real_table
.long 0 /* 0x1c - H_GET_TCE */
- .long .kvmppc_h_put_tce - hcall_real_table
+ .long .kvmppc_rm_h_put_tce - hcall_real_table
.long 0 /* 0x24 - H_SET_SPRG0 */
.long .kvmppc_h_set_dabr - hcall_real_table
.long 0 /* 0x2c */
@@ -1490,6 +1490,12 @@ hcall_real_table:
.long 0 /* 0x11c */
.long 0 /* 0x120 */
.long .kvmppc_h_bulk_remove - hcall_real_table
+ .long 0 /* 0x128 */
+ .long 0 /* 0x12c */
+ .long 0 /* 0x130 */
+ .long 0 /* 0x134 */
+ .long .kvmppc_rm_h_stuff_tce - hcall_real_table
+ .long .kvmppc_rm_h_put_tce_indirect - hcall_real_table
hcall_real_table_end:

ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index da0e0bc..6bd0d4a 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -227,6 +227,37 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
return EMULATE_DONE;
}

+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+ unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+ unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+ unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+ unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+ long rc;
+
+ rc = kvmppc_h_put_tce_indirect(vcpu, liobn, ioba,
+ tce, npages);
+ if (rc == H_TOO_HARD)
+ return EMULATE_FAIL;
+ kvmppc_set_gpr(vcpu, 3, rc);
+ return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+ unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+ unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+ unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+ unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+ long rc;
+
+ rc = kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
+ if (rc == H_TOO_HARD)
+ return EMULATE_FAIL;
+ kvmppc_set_gpr(vcpu, 3, rc);
+ return EMULATE_DONE;
+}
+
static int kvmppc_h_pr_xics_hcall(struct kvm_vcpu *vcpu, u32 cmd)
{
long rc = kvmppc_xics_hcall(vcpu, cmd);
@@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
return kvmppc_h_pr_bulk_remove(vcpu);
case H_PUT_TCE:
return kvmppc_h_pr_put_tce(vcpu);
+ case H_PUT_TCE_INDIRECT:
+ return kvmppc_h_pr_put_tce_indirect(vcpu);
+ case H_STUFF_TCE:
+ return kvmppc_h_pr_stuff_tce(vcpu);
case H_CEDE:
vcpu->arch.shared->msr |= MSR_EE;
kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 6316ee3..ccb578b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -394,6 +394,9 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_PPC_GET_SMMU_INFO:
r = 1;
break;
+ case KVM_CAP_SPAPR_MULTITCE:
+ r = 1;
+ break;
#endif
default:
r = 0;
--
1.8.4.rc4

2013-08-28 08:38:52

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v9 09/13] powerpc/iommu: rework to support realmode

The TCE tables handling may differ for real and virtual modes so
additional ppc_md.tce_build_rm/ppc_md.tce_free_rm/ppc_md.tce_flush_rm
handlers were introduced earlier.

So this adds the following:
1. support for the new ppc_md calls;
2. ability to iommu_tce_build to process mupltiple entries per
call;
3. arch_spin_lock to protect TCE table from races in both real and virtual
modes;
4. proper TCE table protection from races with the existing IOMMU code
in iommu_take_ownership/iommu_release_ownership;
5. hwaddr variable renamed to hpa as it better describes what it
actually represents;
6. iommu_tce_direction is static now as it is not called from anywhere else.

This will be used by upcoming real mode support of VFIO on POWER.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v8:
* fixed warnings from check_patch.pl
---
arch/powerpc/include/asm/iommu.h | 9 +-
arch/powerpc/kernel/iommu.c | 198 ++++++++++++++++++++++++++-------------
2 files changed, 136 insertions(+), 71 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 19ad77f..71ee525 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,7 @@ struct iommu_table {
unsigned long *it_map; /* A simple allocation bitmap for now */
#ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
+ arch_spinlock_t it_rm_lock;
#endif
};

@@ -161,9 +162,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
extern int iommu_tce_put_param_check(struct iommu_table *tbl,
unsigned long ioba, unsigned long tce);
extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
- unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
- unsigned long entry);
+ unsigned long *hpas, unsigned long npages, bool rm);
+extern int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+ unsigned long npages, bool rm);
extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
@@ -173,7 +174,5 @@ extern void iommu_flush_tce(struct iommu_table *tbl);
extern int iommu_take_ownership(struct iommu_table *tbl);
extern void iommu_release_ownership(struct iommu_table *tbl);

-extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
-
#endif /* __KERNEL__ */
#endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 15f8ca8..ff0cd90 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -903,7 +903,7 @@ void iommu_register_group(struct iommu_table *tbl,
kfree(name);
}

-enum dma_data_direction iommu_tce_direction(unsigned long tce)
+static enum dma_data_direction iommu_tce_direction(unsigned long tce)
{
if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
return DMA_BIDIRECTIONAL;
@@ -914,7 +914,6 @@ enum dma_data_direction iommu_tce_direction(unsigned long tce)
else
return DMA_NONE;
}
-EXPORT_SYMBOL_GPL(iommu_tce_direction);

void iommu_flush_tce(struct iommu_table *tbl)
{
@@ -972,73 +971,117 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
}
EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);

-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
-{
- unsigned long oldtce;
- struct iommu_pool *pool = get_pool(tbl, entry);
-
- spin_lock(&(pool->lock));
-
- oldtce = ppc_md.tce_get(tbl, entry);
- if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
- ppc_md.tce_free(tbl, entry, 1);
- else
- oldtce = 0;
-
- spin_unlock(&(pool->lock));
-
- return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
{
- unsigned long oldtce;
- struct page *page;
-
- for ( ; pages; --pages, ++entry) {
- oldtce = iommu_clear_tce(tbl, entry);
- if (!oldtce)
- continue;
-
- page = pfn_to_page(oldtce >> PAGE_SHIFT);
- WARN_ON(!page);
- if (page) {
- if (oldtce & TCE_PCI_WRITE)
- SetPageDirty(page);
- put_page(page);
- }
- }
-
- return 0;
+ return iommu_free_tces(tbl, entry, pages, false);
}
EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);

-/*
- * hwaddr is a kernel virtual address here (0xc... bazillion),
- * tce_build converts it to a physical address.
- */
+int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+ unsigned long npages, bool rm)
+{
+ int i, ret = 0, to_free = 0;
+
+ if (rm && !ppc_md.tce_free_rm)
+ return -EAGAIN;
+
+ arch_spin_lock(&tbl->it_rm_lock);
+
+ for (i = 0; i < npages; ++i) {
+ unsigned long oldtce = ppc_md.tce_get(tbl, entry + i);
+ if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ continue;
+
+ if (rm) {
+ struct page *pg = realmode_pfn_to_page(
+ oldtce >> PAGE_SHIFT);
+ if (!pg) {
+ ret = -EAGAIN;
+ } else if (PageCompound(pg)) {
+ ret = -EAGAIN;
+ } else {
+ if (oldtce & TCE_PCI_WRITE)
+ SetPageDirty(pg);
+ if (!put_page_unless_one(pg))
+ ret = -EAGAIN;
+ }
+ } else {
+ struct page *pg = pfn_to_page(oldtce >> PAGE_SHIFT);
+ if (!pg) {
+ ret = -EAGAIN;
+ } else {
+ if (oldtce & TCE_PCI_WRITE)
+ SetPageDirty(pg);
+ put_page(pg);
+ }
+ }
+ if (ret)
+ break;
+ to_free = i + 1;
+ }
+
+ if (to_free) {
+ if (rm)
+ ppc_md.tce_free_rm(tbl, entry, to_free);
+ else
+ ppc_md.tce_free(tbl, entry, to_free);
+
+ if (rm && ppc_md.tce_flush_rm)
+ ppc_md.tce_flush_rm(tbl);
+ else if (!rm && ppc_md.tce_flush)
+ ppc_md.tce_flush(tbl);
+ }
+ arch_spin_unlock(&tbl->it_rm_lock);
+
+ /* Make sure updates are seen by hardware */
+ mb();
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_free_tces);
+
int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
- unsigned long hwaddr, enum dma_data_direction direction)
+ unsigned long *hpas, unsigned long npages, bool rm)
{
- int ret = -EBUSY;
- unsigned long oldtce;
- struct iommu_pool *pool = get_pool(tbl, entry);
+ int i, ret = 0;

- spin_lock(&(pool->lock));
+ if (rm && !ppc_md.tce_build_rm)
+ return -EAGAIN;

- oldtce = ppc_md.tce_get(tbl, entry);
- /* Add new entry if it is not busy */
- if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
- ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
+ arch_spin_lock(&tbl->it_rm_lock);

- spin_unlock(&(pool->lock));
+ for (i = 0; i < npages; ++i) {
+ if (ppc_md.tce_get(tbl, entry + i) &
+ (TCE_PCI_WRITE | TCE_PCI_READ)) {
+ arch_spin_unlock(&tbl->it_rm_lock);
+ return -EBUSY;
+ }
+ }

- /* if (unlikely(ret))
- pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
- __func__, hwaddr, entry << IOMMU_PAGE_SHIFT,
- hwaddr, ret); */
+ for (i = 0; i < npages; ++i) {
+ unsigned long hva = (unsigned long) __va(hpas[i]);
+ enum dma_data_direction dir = iommu_tce_direction(hva);
+
+ if (rm)
+ ret = ppc_md.tce_build_rm(tbl, entry + i, 1,
+ hva, dir, NULL);
+ else
+ ret = ppc_md.tce_build(tbl, entry + i, 1,
+ hva, dir, NULL);
+ if (ret)
+ break;
+ }
+
+ if (rm && ppc_md.tce_flush_rm)
+ ppc_md.tce_flush_rm(tbl);
+ else if (!rm && ppc_md.tce_flush)
+ ppc_md.tce_flush(tbl);
+
+ arch_spin_unlock(&tbl->it_rm_lock);
+
+ /* Make sure updates are seen by hardware */
+ mb();

return ret;
}
@@ -1049,7 +1092,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
{
int ret;
struct page *page = NULL;
- unsigned long hwaddr, offset = tce & IOMMU_PAGE_MASK & ~PAGE_MASK;
+ unsigned long hpa, offset = tce & IOMMU_PAGE_MASK & ~PAGE_MASK;
enum dma_data_direction direction = iommu_tce_direction(tce);

ret = get_user_pages_fast(tce & PAGE_MASK, 1,
@@ -1059,9 +1102,9 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
tce, entry << IOMMU_PAGE_SHIFT, ret); */
return -EFAULT;
}
- hwaddr = (unsigned long) page_address(page) + offset;
+ hpa = __pa((unsigned long) page_address(page)) + offset;

- ret = iommu_tce_build(tbl, entry, hwaddr, direction);
+ ret = iommu_tce_build(tbl, entry, &hpa, 1, false);
if (ret)
put_page(page);

@@ -1075,18 +1118,32 @@ EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);

int iommu_take_ownership(struct iommu_table *tbl)
{
- unsigned long sz = (tbl->it_size + 7) >> 3;
+ unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+ int ret = 0;
+
+ spin_lock_irqsave(&tbl->large_pool.lock, flags);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_lock(&tbl->pools[i].lock);

if (tbl->it_offset == 0)
clear_bit(0, tbl->it_map);

if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
pr_err("iommu_tce: it_map is not empty");
- return -EBUSY;
+ ret = -EBUSY;
+ if (tbl->it_offset == 0)
+ clear_bit(1, tbl->it_map);
+
+ } else {
+ memset(tbl->it_map, 0xff, sz);
}

- memset(tbl->it_map, 0xff, sz);
- iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_unlock(&tbl->pools[i].lock);
+ spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
+
+ if (!ret)
+ iommu_free_tces(tbl, tbl->it_offset, tbl->it_size, false);

return 0;
}
@@ -1094,14 +1151,23 @@ EXPORT_SYMBOL_GPL(iommu_take_ownership);

void iommu_release_ownership(struct iommu_table *tbl)
{
- unsigned long sz = (tbl->it_size + 7) >> 3;
+ unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+
+ iommu_free_tces(tbl, tbl->it_offset, tbl->it_size, false);
+
+ spin_lock_irqsave(&tbl->large_pool.lock, flags);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_lock(&tbl->pools[i].lock);

- iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
memset(tbl->it_map, 0, sz);

/* Restore bit#0 set by iommu_init_table() */
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
+
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_unlock(&tbl->pools[i].lock);
+ spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
}
EXPORT_SYMBOL_GPL(iommu_release_ownership);

--
1.8.4.rc4

2013-08-28 08:41:32

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v9 07/13] KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently

It does not make much sense to have KVM in book3s-64bit and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s-kvm.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/kvm/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index c55c538..3b2b761 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -59,6 +59,7 @@ config KVM_BOOK3S_64
depends on PPC_BOOK3S_64
select KVM_BOOK3S_64_HANDLER
select KVM
+ select SPAPR_TCE_IOMMU
---help---
Support running unmodified book3s_64 and book3s_32 guest kernels
in virtual machines on book3s_64 host processors.
--
1.8.4.rc4

2013-08-28 08:42:50

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v9 03/13] KVM: PPC: reserve a capability number for multitce support

This is to reserve a capablity number for upcoming support
of H_PUT_TCE_INDIRECT and H_STUFF_TCE pseries hypercalls
which support mulptiple DMA map/unmap operations per one call.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
2013/07/16:
* changed the number
---
include/uapi/linux/kvm.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index acccd08..99c2533 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_PPC_RTAS 91
#define KVM_CAP_IRQ_XICS 92
#define KVM_CAP_ARM_EL1_32BIT 93
+#define KVM_CAP_SPAPR_MULTITCE 94

#ifdef KVM_CAP_IRQ_ROUTING

--
1.8.4.rc4

2013-08-28 08:43:59

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v9 01/13] KVM: PPC: POWERNV: move iommu_add_device earlier

The current implementation of IOMMU on sPAPR does not use iommu_ops
and therefore does not call IOMMU API's bus_set_iommu() which
1) sets iommu_ops for a bus
2) registers a bus notifier
Instead, PCI devices are added to IOMMU groups from
subsys_initcall_sync(tce_iommu_init) which does basically the same
thing without using iommu_ops callbacks.

However Freescale PAMU driver (https://lkml.org/lkml/2013/7/1/158)
implements iommu_ops and when tce_iommu_init is called, every PCI device
is already added to some group so there is a conflict.

This patch does 2 things:
1. removes the loop in which PCI devices were added to groups and
adds explicit iommu_add_device() calls to add devices as soon as they get
the iommu_table pointer assigned to them.
2. moves a bus notifier to powernv code in order to avoid conflict with
the notifier from Freescale driver.

iommu_add_device() and iommu_del_device() are public now.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
v8:
* added the check for iommu_group!=NULL before removing device from a group
as suggested by Wei Yang <[email protected]>

v2:
* added a helper - set_iommu_table_base_and_group - which does
set_iommu_table_base() and iommu_add_device()
---
arch/powerpc/include/asm/iommu.h | 9 +++++++
arch/powerpc/kernel/iommu.c | 41 +++--------------------------
arch/powerpc/platforms/powernv/pci-ioda.c | 8 +++---
arch/powerpc/platforms/powernv/pci-p5ioc2.c | 2 +-
arch/powerpc/platforms/powernv/pci.c | 33 ++++++++++++++++++++++-
arch/powerpc/platforms/pseries/iommu.c | 8 +++---
6 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index c34656a..19ad77f 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -103,6 +103,15 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
int nid);
extern void iommu_register_group(struct iommu_table *tbl,
int pci_domain_number, unsigned long pe_num);
+extern int iommu_add_device(struct device *dev);
+extern void iommu_del_device(struct device *dev);
+
+static inline void set_iommu_table_base_and_group(struct device *dev,
+ void *base)
+{
+ set_iommu_table_base(dev, base);
+ iommu_add_device(dev);
+}

extern int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
struct scatterlist *sglist, int nelems,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b20ff17..15f8ca8 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1105,7 +1105,7 @@ void iommu_release_ownership(struct iommu_table *tbl)
}
EXPORT_SYMBOL_GPL(iommu_release_ownership);

-static int iommu_add_device(struct device *dev)
+int iommu_add_device(struct device *dev)
{
struct iommu_table *tbl;
int ret = 0;
@@ -1134,46 +1134,13 @@ static int iommu_add_device(struct device *dev)

return ret;
}
+EXPORT_SYMBOL_GPL(iommu_add_device);

-static void iommu_del_device(struct device *dev)
+void iommu_del_device(struct device *dev)
{
iommu_group_remove_device(dev);
}
-
-static int iommu_bus_notifier(struct notifier_block *nb,
- unsigned long action, void *data)
-{
- struct device *dev = data;
-
- switch (action) {
- case BUS_NOTIFY_ADD_DEVICE:
- return iommu_add_device(dev);
- case BUS_NOTIFY_DEL_DEVICE:
- iommu_del_device(dev);
- return 0;
- default:
- return 0;
- }
-}
-
-static struct notifier_block tce_iommu_bus_nb = {
- .notifier_call = iommu_bus_notifier,
-};
-
-static int __init tce_iommu_init(void)
-{
- struct pci_dev *pdev = NULL;
-
- BUILD_BUG_ON(PAGE_SIZE < IOMMU_PAGE_SIZE);
-
- for_each_pci_dev(pdev)
- iommu_add_device(&pdev->dev);
-
- bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
- return 0;
-}
-
-subsys_initcall_sync(tce_iommu_init);
+EXPORT_SYMBOL_GPL(iommu_del_device);

#else

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index d8140b1..756bb58 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -440,7 +440,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
return;

pe = &phb->ioda.pe_array[pdn->pe_number];
- set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+ set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
}

static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
@@ -448,7 +448,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
struct pci_dev *dev;

list_for_each_entry(dev, &bus->devices, bus_list) {
- set_iommu_table_base(&dev->dev, &pe->tce32_table);
+ set_iommu_table_base_and_group(&dev->dev, &pe->tce32_table);
if (dev->subordinate)
pnv_ioda_setup_bus_dma(pe, dev->subordinate);
}
@@ -611,7 +611,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
iommu_register_group(tbl, pci_domain_nr(pe->pbus), pe->pe_number);

if (pe->pdev)
- set_iommu_table_base(&pe->pdev->dev, tbl);
+ set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
else
pnv_ioda_setup_bus_dma(pe, pe->pbus);

@@ -687,7 +687,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
iommu_init_table(tbl, phb->hose->node);

if (pe->pdev)
- set_iommu_table_base(&pe->pdev->dev, tbl);
+ set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
else
pnv_ioda_setup_bus_dma(pe, pe->pbus);

diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index b68db63..ede341b 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -92,7 +92,7 @@ static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
pci_domain_nr(phb->hose->bus), phb->opal_id);
}

- set_iommu_table_base(&pdev->dev, &phb->p5ioc2.iommu_table);
+ set_iommu_table_base_and_group(&pdev->dev, &phb->p5ioc2.iommu_table);
}

static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index a28d3b5..c005011 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -504,7 +504,7 @@ static void pnv_pci_dma_fallback_setup(struct pci_controller *hose,
pdn->iommu_table = pnv_pci_setup_bml_iommu(hose);
if (!pdn->iommu_table)
return;
- set_iommu_table_base(&pdev->dev, pdn->iommu_table);
+ set_iommu_table_base_and_group(&pdev->dev, pdn->iommu_table);
}

static void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
@@ -623,3 +623,34 @@ void __init pnv_pci_init(void)
ppc_md.teardown_msi_irqs = pnv_teardown_msi_irqs;
#endif
}
+
+static int tce_iommu_bus_notifier(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+ struct device *dev = data;
+
+ switch (action) {
+ case BUS_NOTIFY_ADD_DEVICE:
+ return iommu_add_device(dev);
+ case BUS_NOTIFY_DEL_DEVICE:
+ if (dev->iommu_group)
+ iommu_del_device(dev);
+ return 0;
+ default:
+ return 0;
+ }
+}
+
+static struct notifier_block tce_iommu_bus_nb = {
+ .notifier_call = tce_iommu_bus_notifier,
+};
+
+static int __init tce_iommu_bus_notifier_init(void)
+{
+ BUILD_BUG_ON(PAGE_SIZE < IOMMU_PAGE_SIZE);
+
+ bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
+ return 0;
+}
+
+subsys_initcall_sync(tce_iommu_bus_notifier_init);
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 23fc1dc..884ae71 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -687,7 +687,8 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
iommu_table_setparms(phb, dn, tbl);
PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
- set_iommu_table_base(&dev->dev, PCI_DN(dn)->iommu_table);
+ set_iommu_table_base_and_group(&dev->dev,
+ PCI_DN(dn)->iommu_table);
return;
}

@@ -699,7 +700,8 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
dn = dn->parent;

if (dn && PCI_DN(dn))
- set_iommu_table_base(&dev->dev, PCI_DN(dn)->iommu_table);
+ set_iommu_table_base_and_group(&dev->dev,
+ PCI_DN(dn)->iommu_table);
else
printk(KERN_WARNING "iommu: Device %s has no iommu table\n",
pci_name(dev));
@@ -1193,7 +1195,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
pr_debug(" found DMA window, table: %p\n", pci->iommu_table);
}

- set_iommu_table_base(&dev->dev, pci->iommu_table);
+ set_iommu_table_base_and_group(&dev->dev, pci->iommu_table);
}

static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
--
1.8.4.rc4

2013-08-28 08:50:14

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v9 10/13] KVM: PPC: remove warning from kvmppc_core_destroy_vm

Book3S KVM implements in-kernel TCE tables via kvmppc_spapr_tce_table
structs list (created per KVM). Entries in the list are per LIOBN
(logical bus number) and have a TCE table so DMA hypercalls (such as
H_PUT_TCE) can convert LIOBN to a TCE table.

The entry in the list is created via KVM_CREATE_SPAPR_TCE ioctl which
returns an anonymous fd. This fd is used to map the TCE table to the user
space and it also defines the lifetime of the TCE table in the kernel.
Every list item also hold the link to KVM so when KVM is about to be
destroyed, all kvmppc_spapr_tce_table objects are expected to be
released and removed from the global list. And this is what the warning
verifies.

The upcoming support of VFIO IOMMU will extend kvmppc_spapr_tce_table use.
Unlike emulated devices, it will create kvmppc_spapr_tce_table structs
via new KVM device API which opens an anonymous fd
(as KVM_CREATE_SPAPR_TCE) but the release callback does not call
KVM Device's destroy callback immediately. Instead, KVM devices destruction
is postponed this till KVM destruction but this happens after arch-specific
KVM destroy function so the warning does a false alarm.

This removes the warning as it never happens in real life and there is no
any obvious place to put it.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/kvm/book3s_hv.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 9e823ad..5f15ff7 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1974,7 +1974,6 @@ void kvmppc_core_destroy_vm(struct kvm *kvm)
kvmppc_rtas_tokens_free(kvm);

kvmppc_free_hpt(kvm);
- WARN_ON(!list_empty(&kvm->arch.spapr_tce_tables));
}

/* These are stubs for now */
--
1.8.4.rc4

2013-08-28 08:50:44

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v9 11/13] KVM: PPC: add trampolines for VFIO external API

KVM is going to use VFIO's external API. However KVM can operate even VFIO
is not compiled or loaded so KVM is linked to VFIO dynamically.

This adds proxy functions for VFIO external API.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/kvm/book3s_64_vio.c | 49 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 49 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index cae1099..047b94c 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,8 @@
#include <linux/hugetlb.h>
#include <linux/list.h>
#include <linux/anon_inodes.h>
+#include <linux/module.h>
+#include <linux/vfio.h>

#include <asm/tlbflush.h>
#include <asm/kvm_ppc.h>
@@ -42,6 +44,53 @@

#define ERROR_ADDR ((void *)~(unsigned long)0x0)

+/*
+ * Dynamically linked version of the external user VFIO API.
+ *
+ * As a IOMMU group access control is implemented by VFIO,
+ * there is some API to vefiry that specific process can own
+ * a group. As KVM may run when VFIO is not loaded, KVM is not
+ * linked statically to VFIO, instead wrappers are used.
+ */
+struct vfio_group *kvmppc_vfio_group_get_external_user(struct file *filep)
+{
+ struct vfio_group *ret;
+ struct vfio_group * (*proc)(struct file *) =
+ symbol_get(vfio_group_get_external_user);
+ if (!proc)
+ return NULL;
+
+ ret = proc(filep);
+ symbol_put(vfio_group_get_external_user);
+
+ return ret;
+}
+
+void kvmppc_vfio_group_put_external_user(struct vfio_group *group)
+{
+ void (*proc)(struct vfio_group *) =
+ symbol_get(vfio_group_put_external_user);
+ if (!proc)
+ return;
+
+ proc(group);
+ symbol_put(vfio_group_put_external_user);
+}
+
+int kvmppc_vfio_external_user_iommu_id(struct vfio_group *group)
+{
+ int ret;
+ int (*proc)(struct vfio_group *) =
+ symbol_get(vfio_external_user_iommu_id);
+ if (!proc)
+ return -EINVAL;
+
+ ret = proc(group);
+ symbol_put(vfio_external_user_iommu_id);
+
+ return ret;
+}
+
static long kvmppc_stt_npages(unsigned long window_size)
{
return ALIGN((window_size >> SPAPR_TCE_SHIFT)
--
1.8.4.rc4

2013-08-28 08:50:57

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
them to user space which saves time on switching to user space and back.

Both real and virtual modes are supported. The kernel tries to
handle a TCE request in the real mode, if fails it passes the request
to the virtual mode to complete the operation. If it a virtual mode
handler fails, the request is passed to user space.

The first user of this is VFIO on POWER. Trampolines to the VFIO external
user API functions are required for this patch.

This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
of map/unmap requests. The device supports a single attribute which is
a struct with LIOBN and IOMMU fd. When the attribute is set, the device
establishes the connection between KVM and VFIO.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>

---

Changes:
v9:
* KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
KVM device
* release_spapr_tce_table() is not shared between different TCE types
* reduced the patch size by moving VFIO external API
trampolines to separate patche
* moved documentation from Documentation/virtual/kvm/api.txt to
Documentation/virtual/kvm/devices/spapr_tce_iommu.txt

v8:
* fixed warnings from check_patch.pl

2013/07/11:
* removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
for KVM_BOOK3S_64
* kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
for this here but the next patch for hugepages support will use it more.

2013/07/06:
* added realmode arch_spin_lock to protect TCE table from races
in real and virtual modes
* POWERPC IOMMU API is changed to support real mode
* iommu_take_ownership and iommu_release_ownership are protected by
iommu_table's locks
* VFIO external user API use rewritten
* multiple small fixes

2013/06/27:
* tce_list page is referenced now in order to protect it from accident
invalidation during H_PUT_TCE_INDIRECT execution
* added use of the external user VFIO API

2013/06/05:
* changed capability number
* changed ioctl number
* update the doc article number

2013/05/20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
---
.../virtual/kvm/devices/spapr_tce_iommu.txt | 37 +++
arch/powerpc/include/asm/kvm_host.h | 4 +
arch/powerpc/kvm/book3s_64_vio.c | 310 ++++++++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 122 ++++++++
arch/powerpc/kvm/powerpc.c | 1 +
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 5 +
7 files changed, 477 insertions(+), 3 deletions(-)
create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt

diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
new file mode 100644
index 0000000..4bc8fc3
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
@@ -0,0 +1,37 @@
+SPAPR TCE IOMMU device
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+
+Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
+
+Groups:
+ KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
+ Attributes: single attribute with pair { LIOBN, IOMMU fd}
+
+This is completely made up device which provides API to link
+logical bus number (LIOBN) and IOMMU group. The user space has
+to create a new SPAPR TCE IOMMU device per a logical bus.
+
+LIOBN is a PCI bus identifier from PPC64-server (sPAPR) DMA hypercalls
+(H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE).
+IOMMU group is a minimal isolated device set which can be passed to
+the user space via VFIO.
+
+Right after creation the device is in uninitlized state and requires
+a KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute to be set.
+The attribute contains liobn, IOMMU fd and flags:
+
+struct kvm_create_spapr_tce_iommu_linkage {
+ __u64 liobn;
+ __u32 fd;
+ __u32 flags;
+};
+
+The user space creates the SPAPR TCE IOMMU device, obtains
+an IOMMU fd via VFIO ABI and sets the attribute to the SPAPR TCE IOMMU
+device. At the moment of setting the attribute, the SPAPR TCE IOMMU
+device links LIOBN to IOMMU group and makes necessary steps
+to make sure that VFIO group will not disappear before KVM destroys.
+
+The kernel advertises this feature via KVM_CAP_SPAPR_TCE_IOMMU capability.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index a23f132..c1a039d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -181,6 +181,9 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+ struct iommu_group *grp; /* used for IOMMU groups */
+ struct kvm_create_spapr_tce_iommu_linkage link;
+ struct vfio_group *vfio_grp; /* used for IOMMU groups */
struct page *pages[0];
};

@@ -612,6 +615,7 @@ struct kvm_vcpu_arch {
u64 busy_preempt;

unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT */
+ unsigned long tce_tmp_num; /* Number of handled TCEs in cache */
enum {
TCERM_NONE,
TCERM_GETPAGE,
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 047b94c..95f9e1a 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -29,6 +29,8 @@
#include <linux/anon_inodes.h>
#include <linux/module.h>
#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/file.h>

#include <asm/tlbflush.h>
#include <asm/kvm_ppc.h>
@@ -201,9 +203,164 @@ fail:
return ret;
}

-/* Converts guest physical address to host virtual address */
+static int kvmppc_spapr_tce_iommu_create(struct kvm_device *dev, u32 type)
+{
+ return 0;
+}
+
+static long kvmppc_spapr_tce_iommu_link(struct kvm_device *dev,
+ struct kvm_create_spapr_tce_iommu_linkage *args)
+{
+ struct kvm *kvm = dev->kvm;
+ struct kvmppc_spapr_tce_table *tt = NULL;
+ struct iommu_group *grp;
+ struct iommu_table *tbl;
+ struct file *vfio_filp;
+ struct vfio_group *vfio_grp;
+ int ret = -ENXIO, iommu_id;
+
+ /* Check this LIOBN hasn't been previously registered */
+ list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
+ if (tt->liobn == args->liobn)
+ return -EBUSY;
+ }
+
+ vfio_filp = fget(args->fd);
+ if (!vfio_filp)
+ return -ENXIO;
+
+ /*
+ * Lock the group. Fails if group is not viable or
+ * does not have IOMMU set
+ */
+ vfio_grp = kvmppc_vfio_group_get_external_user(vfio_filp);
+ if (IS_ERR_VALUE((unsigned long)vfio_grp))
+ goto fput_exit;
+
+ /* Get IOMMU ID, find iommu_group and iommu_table*/
+ iommu_id = kvmppc_vfio_external_user_iommu_id(vfio_grp);
+ if (iommu_id < 0)
+ goto grpput_fput_exit;
+
+ grp = iommu_group_get_by_id(iommu_id);
+ if (!grp)
+ goto grpput_fput_exit;
+
+ tbl = iommu_group_get_iommudata(grp);
+ if (!tbl)
+ goto grpput_fput_exit;
+
+ /* Create a TCE table descriptor and add into the descriptor list */
+ tt = kzalloc(sizeof(*tt), GFP_KERNEL);
+ if (!tt)
+ goto grpput_fput_exit;
+
+ tt->liobn = args->liobn;
+ tt->grp = grp;
+ tt->window_size = tbl->it_size << IOMMU_PAGE_SHIFT;
+ tt->vfio_grp = vfio_grp;
+
+ /* Add the TCE table descriptor to the descriptor list */
+ mutex_lock(&kvm->lock);
+ list_add(&tt->list, &kvm->arch.spapr_tce_tables);
+ mutex_unlock(&kvm->lock);
+
+ dev->private = tt;
+ tt->link = *args;
+
+ ret = 0;
+
+ goto fput_exit;
+
+grpput_fput_exit:
+ kvmppc_vfio_group_put_external_user(vfio_grp);
+fput_exit:
+ fput(vfio_filp);
+
+ return ret;
+}
+
+static int kvmppc_spapr_tce_iommu_set_attr(struct kvm_device *dev,
+ struct kvm_device_attr *attr)
+{
+ struct kvmppc_spapr_tce_table *tt = dev->private;
+ struct kvm_create_spapr_tce_iommu_linkage args;
+ void __user *argp = (void __user *) attr->addr;
+
+ switch (attr->group) {
+ case KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE:
+ if (tt)
+ return -EBUSY;
+
+ if (copy_from_user(&args, argp, sizeof(args)))
+ return -EFAULT;
+
+ return kvmppc_spapr_tce_iommu_link(dev, &args);
+ }
+ return -ENXIO;
+}
+
+static int kvmppc_spapr_tce_iommu_get_attr(struct kvm_device *dev,
+ struct kvm_device_attr *attr)
+{
+ struct kvmppc_spapr_tce_table *tt = dev->private;
+ void __user *argp = (void __user *) attr->addr;
+
+ switch (attr->group) {
+ case KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE:
+ if (!tt)
+ return -EFAULT;
+ if (copy_to_user(&tt->link, argp, sizeof(tt->link)))
+ return -EFAULT;
+ return 0;
+ }
+ return -ENXIO;
+}
+
+static int kvmppc_spapr_tce_iommu_has_attr(struct kvm_device *dev,
+ struct kvm_device_attr *attr)
+{
+ switch (attr->group) {
+ case KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE:
+ return 0;
+ }
+ return -ENXIO;
+}
+
+static void kvmppc_spapr_tce_iommu_destroy(struct kvm_device *dev)
+{
+ struct kvmppc_spapr_tce_table *tt = dev->private;
+ struct kvm *kvm = dev->kvm;
+
+ if (tt) {
+ mutex_lock(&kvm->lock);
+ list_del(&tt->list);
+
+ if (tt->vfio_grp)
+ kvmppc_vfio_group_put_external_user(tt->vfio_grp);
+ iommu_group_put(tt->grp);
+
+ kfree(tt);
+ mutex_unlock(&kvm->lock);
+ }
+ kfree(dev);
+}
+
+struct kvm_device_ops kvmppc_spapr_tce_iommu_ops = {
+ .name = "kvm-spapr-tce-iommu",
+ .create = kvmppc_spapr_tce_iommu_create,
+ .set_attr = kvmppc_spapr_tce_iommu_set_attr,
+ .get_attr = kvmppc_spapr_tce_iommu_get_attr,
+ .has_attr = kvmppc_spapr_tce_iommu_has_attr,
+ .destroy = kvmppc_spapr_tce_iommu_destroy,
+};
+
+/*
+ * Converts guest physical address to host virtual address.
+ * Also returns host physical address which is to put to TCE table.
+ */
static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
- unsigned long gpa, struct page **pg)
+ unsigned long gpa, struct page **pg, unsigned long *phpa)
{
unsigned long hva, gfn = gpa >> PAGE_SHIFT;
struct kvm_memory_slot *memslot;
@@ -218,9 +375,140 @@ static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
if (get_user_pages_fast(hva & PAGE_MASK, 1, is_write, pg) != 1)
return ERROR_ADDR;

+ if (phpa)
+ *phpa = __pa((unsigned long) page_address(*pg)) |
+ (hva & ~PAGE_MASK);
+
return (void *) hva;
}

+long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce)
+{
+ struct page *pg = NULL;
+ unsigned long hpa;
+ void __user *hva;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Clear TCE */
+ if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+ if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, ioba >> IOMMU_PAGE_SHIFT,
+ 1, false))
+ return H_HARDWARE;
+
+ return H_SUCCESS;
+ }
+
+ /* Put TCE */
+ if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
+ /* Retry iommu_tce_build if it failed in real mode */
+ vcpu->arch.tce_rm_fail = TCERM_NONE;
+ hpa = vcpu->arch.tce_tmp_hpas[0];
+ } else {
+ if (iommu_tce_put_param_check(tbl, ioba, tce))
+ return H_PARAMETER;
+
+ hva = kvmppc_gpa_to_hva_and_get(vcpu, tce, &pg, &hpa);
+ if (hva == ERROR_ADDR)
+ return H_HARDWARE;
+ }
+
+ if (!iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT, &hpa, 1, false))
+ return H_SUCCESS;
+
+ pg = pfn_to_page(hpa >> PAGE_SHIFT);
+ if (pg)
+ put_page(pg);
+
+ return H_HARDWARE;
+}
+
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt, unsigned long ioba,
+ unsigned long __user *tces, unsigned long npages)
+{
+ long i = 0, start = 0;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ switch (vcpu->arch.tce_rm_fail) {
+ case TCERM_NONE:
+ break;
+ case TCERM_GETPAGE:
+ start = vcpu->arch.tce_tmp_num;
+ break;
+ case TCERM_PUTTCE:
+ goto put_tces;
+ case TCERM_PUTLIST:
+ default:
+ WARN_ON(1);
+ return H_HARDWARE;
+ }
+
+ for (i = start; i < npages; ++i) {
+ struct page *pg = NULL;
+ unsigned long gpa;
+ void __user *hva;
+
+ if (get_user(gpa, tces + i))
+ return H_HARDWARE;
+
+ if (iommu_tce_put_param_check(tbl, ioba +
+ (i << IOMMU_PAGE_SHIFT), gpa))
+ return H_PARAMETER;
+
+ hva = kvmppc_gpa_to_hva_and_get(vcpu, gpa, &pg,
+ &vcpu->arch.tce_tmp_hpas[i]);
+ if (hva == ERROR_ADDR)
+ goto putpages_flush_exit;
+ }
+
+put_tces:
+ if (!iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT,
+ vcpu->arch.tce_tmp_hpas, npages, false))
+ return H_SUCCESS;
+
+putpages_flush_exit:
+ for (--i; i >= 0; --i) {
+ struct page *pg;
+ pg = pfn_to_page(vcpu->arch.tce_tmp_hpas[i] >> PAGE_SHIFT);
+ if (pg)
+ put_page(pg);
+ }
+
+ return H_HARDWARE;
+}
+
+long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+ unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, entry, npages, false))
+ return H_HARDWARE;
+
+ return H_SUCCESS;
+}
+
long kvmppc_h_put_tce(struct kvm_vcpu *vcpu,
unsigned long liobn, unsigned long ioba,
unsigned long tce)
@@ -232,6 +520,10 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;

+ if (tt->grp)
+ return kvmppc_h_put_tce_iommu(vcpu, tt, liobn, ioba, tce);
+
+ /* Emulated IO */
if (ioba >= tt->window_size)
return H_PARAMETER;

@@ -270,13 +562,20 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;

- tces = kvmppc_gpa_to_hva_and_get(vcpu, tce_list, &pg);
+ tces = kvmppc_gpa_to_hva_and_get(vcpu, tce_list, &pg, NULL);
if (tces == ERROR_ADDR)
return H_TOO_HARD;

if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
goto put_list_page_exit;

+ if (tt->grp) {
+ ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+ tt, ioba, tces, npages);
+ goto put_list_page_exit;
+ }
+
+ /* Emulated IO */
for (i = 0; i < npages; ++i) {
if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
ret = H_PARAMETER;
@@ -315,6 +614,11 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;

+ if (tt->grp)
+ return kvmppc_h_stuff_tce_iommu(vcpu, tt, liobn, ioba,
+ tce_value, npages);
+
+ /* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 2b2ce0a..c647990 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
#include <linux/slab.h>
#include <linux/hugetlb.h>
#include <linux/list.h>
+#include <linux/iommu.h>

#include <asm/tlbflush.h>
#include <asm/kvm_ppc.h>
@@ -191,6 +192,111 @@ static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
return hpa;
}

+static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt, unsigned long liobn,
+ unsigned long ioba, unsigned long tce)
+{
+ int ret = 0;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+ unsigned long hpa;
+ struct page *pg = NULL;
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Clear TCE */
+ if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+ if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, ioba >> IOMMU_PAGE_SHIFT, 1, true))
+ return H_TOO_HARD;
+
+ return H_SUCCESS;
+ }
+
+ /* Put TCE */
+ if (iommu_tce_put_param_check(tbl, ioba, tce))
+ return H_PARAMETER;
+
+ hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce, &pg);
+ if (hpa != ERROR_ADDR) {
+ ret = iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT,
+ &hpa, 1, true);
+ }
+
+ if (((hpa == ERROR_ADDR) && pg) || ret) {
+ vcpu->arch.tce_tmp_hpas[0] = hpa;
+ vcpu->arch.tce_tmp_num = 0;
+ vcpu->arch.tce_rm_fail = TCERM_PUTTCE;
+ return H_TOO_HARD;
+ }
+
+ return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt, unsigned long ioba,
+ unsigned long *tces, unsigned long npages)
+{
+ int i, ret;
+ unsigned long hpa;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+ struct page *pg = NULL;
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Check all TCEs */
+ for (i = 0; i < npages; ++i) {
+ if (iommu_tce_put_param_check(tbl, ioba +
+ (i << IOMMU_PAGE_SHIFT), tces[i]))
+ return H_PARAMETER;
+ }
+
+ /* Translate TCEs and go get_page() */
+ for (i = 0; i < npages; ++i) {
+ hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tces[i], &pg);
+ if (hpa == ERROR_ADDR) {
+ vcpu->arch.tce_tmp_num = i;
+ vcpu->arch.tce_rm_fail = TCERM_GETPAGE;
+ return H_TOO_HARD;
+ }
+ vcpu->arch.tce_tmp_hpas[i] = hpa;
+ }
+
+ /* Put TCEs to the table */
+ ret = iommu_tce_build(tbl, (ioba >> IOMMU_PAGE_SHIFT),
+ vcpu->arch.tce_tmp_hpas, npages, true);
+ if (ret == -EAGAIN) {
+ vcpu->arch.tce_rm_fail = TCERM_PUTTCE;
+ return H_TOO_HARD;
+ } else if (ret) {
+ return H_HARDWARE;
+ }
+
+ return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, ioba >> IOMMU_PAGE_SHIFT, npages, true))
+ return H_TOO_HARD;
+
+ return H_SUCCESS;
+}
+
long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
{
@@ -201,6 +307,10 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
if (!tt)
return H_TOO_HARD;

+ if (tt->grp)
+ return kvmppc_rm_h_put_tce_iommu(vcpu, tt, liobn, ioba, tce);
+
+ /* Emulated IO */
if (ioba >= tt->window_size)
return H_PARAMETER;

@@ -243,6 +353,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
goto put_unlock_exit;
}

+ if (tt->grp) {
+ ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
+ tt, ioba, (unsigned long *)tces, npages);
+ goto put_unlock_exit;
+ }
+
+ /* Emulated IO */
for (i = 0; i < npages; ++i) {
ret = kvmppc_tce_validate(((unsigned long *)tces)[i]);
if (ret)
@@ -273,6 +390,11 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;

+ if (tt->grp)
+ return kvmppc_rm_h_stuff_tce_iommu(vcpu, tt, liobn, ioba,
+ tce_value, npages);
+
+ /* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index ccb578b..ea9af59 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -395,6 +395,7 @@ int kvm_dev_ioctl_check_extension(long ext)
r = 1;
break;
case KVM_CAP_SPAPR_MULTITCE:
+ case KVM_CAP_SPAPR_TCE_IOMMU:
r = 1;
break;
#endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7a8c1b3..016df11 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1053,6 +1053,7 @@ struct kvm_device *kvm_device_from_filp(struct file *filp);

extern struct kvm_device_ops kvm_mpic_ops;
extern struct kvm_device_ops kvm_xics_ops;
+extern struct kvm_device_ops kvmppc_spapr_tce_iommu_ops;

#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1580dd4..34c3c22 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2282,6 +2282,11 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
ops = &kvm_xics_ops;
break;
#endif
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+ case KVM_DEV_TYPE_SPAPR_TCE_IOMMU:
+ ops = &kvmppc_spapr_tce_iommu_ops;
+ break;
+#endif
default:
return -ENODEV;
}
--
1.8.4.rc4

2013-08-28 08:51:43

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v9 13/13] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

This adds special support for huge pages (16MB) in real mode.

The reference counting cannot be easily done for such pages in real
mode (when MMU is off) so we added a hash table of huge pages.
It is populated in virtual mode and get_page is called just once
per a huge page. Real mode handlers check if the requested page is
in the hash table, then no reference counting is done, otherwise
an exit to virtual mode happens. The hash table is released at KVM
exit.

At the moment the fastest card available for tests uses up to 9 huge
pages so walking through this hash table does not cost much.
However this can change and we may want to optimize this.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>

---

Changes:
2013/07/12:
* removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
for KVM_BOOK3S_64

2013/06/27:
* list of huge pages replaces with hashtable for better performance
* spinlock removed from real mode and only protects insertion of new
huge [ages descriptors into the hashtable

2013/06/05:
* fixed compile error when CONFIG_IOMMU_API=n

2013/05/20:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).
---
arch/powerpc/include/asm/kvm_host.h | 25 ++++++++
arch/powerpc/kernel/iommu.c | 6 +-
arch/powerpc/kvm/book3s_64_vio.c | 122 ++++++++++++++++++++++++++++++++++--
arch/powerpc/kvm/book3s_64_vio_hv.c | 32 +++++++++-
4 files changed, 176 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index c1a039d..b970d26 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -31,6 +31,7 @@
#include <linux/list.h>
#include <linux/atomic.h>
#include <linux/tracepoint.h>
+#include <linux/hashtable.h>
#include <asm/kvm_asm.h>
#include <asm/processor.h>
#include <asm/page.h>
@@ -184,9 +185,33 @@ struct kvmppc_spapr_tce_table {
struct iommu_group *grp; /* used for IOMMU groups */
struct kvm_create_spapr_tce_iommu_linkage link;
struct vfio_group *vfio_grp; /* used for IOMMU groups */
+ DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */
+ spinlock_t hugepages_write_lock; /* used for IOMMU groups */
struct page *pages[0];
};

+/*
+ * The KVM guest can be backed with 16MB pages.
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa) hash_32(gpa >> 24, 32)
+
+struct kvmppc_spapr_iommu_hugepage {
+ struct hlist_node hash_node;
+ unsigned long gpa; /* Guest physical address */
+ unsigned long hpa; /* Host physical address */
+ struct page *page; /* page struct of the very first subpage */
+ unsigned long size; /* Huge page size (always 16MB at the moment) */
+};
+
struct kvmppc_linear_info {
void *base_virt;
unsigned long base_pfn;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ff0cd90..d0593c9 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
if (!pg) {
ret = -EAGAIN;
} else if (PageCompound(pg)) {
- ret = -EAGAIN;
+ /* Hugepages will be released at KVM exit */
+ ret = 0;
} else {
if (oldtce & TCE_PCI_WRITE)
SetPageDirty(pg);
@@ -1010,6 +1011,9 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
struct page *pg = pfn_to_page(oldtce >> PAGE_SHIFT);
if (!pg) {
ret = -EAGAIN;
+ } else if (PageCompound(pg)) {
+ /* Hugepages will be released at KVM exit */
+ ret = 0;
} else {
if (oldtce & TCE_PCI_WRITE)
SetPageDirty(pg);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 95f9e1a..1851778 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -93,6 +93,102 @@ int kvmppc_vfio_external_user_iommu_id(struct vfio_group *group)
return ret;
}

+/*
+ * API to support huge pages in real mode
+ */
+static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
+{
+ spin_lock_init(&tt->hugepages_write_lock);
+ hash_init(tt->hash_tab);
+}
+
+static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
+{
+ int bkt;
+ struct kvmppc_spapr_iommu_hugepage *hp;
+ struct hlist_node *tmp;
+
+ spin_lock(&tt->hugepages_write_lock);
+ hash_for_each_safe(tt->hash_tab, bkt, tmp, hp, hash_node) {
+ pr_debug("Release HP liobn=%llx #%u gpa=%lx hpa=%lx size=%ld\n",
+ tt->liobn, bkt, hp->gpa, hp->hpa, hp->size);
+ hlist_del_rcu(&hp->hash_node);
+
+ put_page(hp->page);
+ kfree(hp);
+ }
+ spin_unlock(&tt->hugepages_write_lock);
+}
+
+/* Returns true if a page with GPA is already in the hash table */
+static bool kvmppc_iommu_hugepage_lookup_gpa(struct kvmppc_spapr_tce_table *tt,
+ unsigned long gpa)
+{
+ struct kvmppc_spapr_iommu_hugepage *hp;
+ const unsigned key = KVMPPC_SPAPR_HUGEPAGE_HASH(gpa);
+
+ hash_for_each_possible_rcu(tt->hash_tab, hp, hash_node, key) {
+ if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size))
+ continue;
+
+ return true;
+ }
+
+ return false;
+}
+
+/* Returns true if a page with GPA has been added to the hash table */
+static bool kvmppc_iommu_hugepage_add(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long hva, unsigned long gpa)
+{
+ struct kvmppc_spapr_iommu_hugepage *hp;
+ const unsigned key = KVMPPC_SPAPR_HUGEPAGE_HASH(gpa);
+ pte_t *ptep;
+ unsigned int shift = 0;
+ static const int is_write = 1;
+
+ ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+ WARN_ON(!ptep);
+
+ if (!ptep || (shift <= PAGE_SHIFT))
+ return false;
+
+ hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+ if (!hp)
+ return false;
+
+ hp->gpa = gpa & ~((1 << shift) - 1);
+ hp->hpa = (pte_pfn(*ptep) << PAGE_SHIFT);
+ hp->size = 1 << shift;
+
+ if (get_user_pages_fast(hva & ~(hp->size - 1), 1,
+ is_write, &hp->page) != 1) {
+ kfree(hp);
+ return false;
+ }
+ hash_add_rcu(tt->hash_tab, &hp->hash_node, key);
+
+ return true;
+}
+
+/** Returns true if a page with GPA is in the hash table or
+ * has just been added.
+ */
+static bool kvmppc_iommu_hugepage_try_add(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long hva, unsigned long gpa)
+{
+ bool ret;
+
+ spin_lock(&tt->hugepages_write_lock);
+ ret = kvmppc_iommu_hugepage_lookup_gpa(tt, gpa) ||
+ kvmppc_iommu_hugepage_add(vcpu, tt, hva, gpa);
+ spin_unlock(&tt->hugepages_write_lock);
+
+ return ret;
+}
+
static long kvmppc_stt_npages(unsigned long window_size)
{
return ALIGN((window_size >> SPAPR_TCE_SHIFT)
@@ -106,6 +202,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)

mutex_lock(&kvm->lock);
list_del(&stt->list);
+ kvmppc_iommu_hugepages_cleanup(stt);
for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
__free_page(stt->pages[i]);
kfree(stt);
@@ -185,6 +282,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
kvm_get_kvm(kvm);

mutex_lock(&kvm->lock);
+ kvmppc_iommu_hugepages_init(stt);
list_add(&stt->list, &kvm->arch.spapr_tce_tables);

mutex_unlock(&kvm->lock);
@@ -262,6 +360,7 @@ static long kvmppc_spapr_tce_iommu_link(struct kvm_device *dev,

/* Add the TCE table descriptor to the descriptor list */
mutex_lock(&kvm->lock);
+ kvmppc_iommu_hugepages_init(tt);
list_add(&tt->list, &kvm->arch.spapr_tce_tables);
mutex_unlock(&kvm->lock);

@@ -336,6 +435,7 @@ static void kvmppc_spapr_tce_iommu_destroy(struct kvm_device *dev)
mutex_lock(&kvm->lock);
list_del(&tt->list);

+ kvmppc_iommu_hugepages_cleanup(tt);
if (tt->vfio_grp)
kvmppc_vfio_group_put_external_user(tt->vfio_grp);
iommu_group_put(tt->grp);
@@ -360,6 +460,7 @@ struct kvm_device_ops kvmppc_spapr_tce_iommu_ops = {
* Also returns host physical address which is to put to TCE table.
*/
static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
unsigned long gpa, struct page **pg, unsigned long *phpa)
{
unsigned long hva, gfn = gpa >> PAGE_SHIFT;
@@ -379,6 +480,17 @@ static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
*phpa = __pa((unsigned long) page_address(*pg)) |
(hva & ~PAGE_MASK);

+ if (PageCompound(*pg)) {
+ /** Check if this GPA is taken care of by the hash table.
+ * If this is the case, do not show the caller page struct
+ * address as huge pages will be released at KVM exit.
+ */
+ if (kvmppc_iommu_hugepage_try_add(vcpu, tt, hva, gpa)) {
+ put_page(*pg);
+ *pg = NULL;
+ }
+ }
+
return (void *) hva;
}

@@ -416,7 +528,7 @@ long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
if (iommu_tce_put_param_check(tbl, ioba, tce))
return H_PARAMETER;

- hva = kvmppc_gpa_to_hva_and_get(vcpu, tce, &pg, &hpa);
+ hva = kvmppc_gpa_to_hva_and_get(vcpu, tt, tce, &pg, &hpa);
if (hva == ERROR_ADDR)
return H_HARDWARE;
}
@@ -425,7 +537,7 @@ long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
return H_SUCCESS;

pg = pfn_to_page(hpa >> PAGE_SHIFT);
- if (pg)
+ if (pg && !PageCompound(pg))
put_page(pg);

return H_HARDWARE;
@@ -467,7 +579,7 @@ static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
(i << IOMMU_PAGE_SHIFT), gpa))
return H_PARAMETER;

- hva = kvmppc_gpa_to_hva_and_get(vcpu, gpa, &pg,
+ hva = kvmppc_gpa_to_hva_and_get(vcpu, tt, gpa, &pg,
&vcpu->arch.tce_tmp_hpas[i]);
if (hva == ERROR_ADDR)
goto putpages_flush_exit;
@@ -482,7 +594,7 @@ putpages_flush_exit:
for (--i; i >= 0; --i) {
struct page *pg;
pg = pfn_to_page(vcpu->arch.tce_tmp_hpas[i] >> PAGE_SHIFT);
- if (pg)
+ if (pg && !PageCompound(pg))
put_page(pg);
}

@@ -562,7 +674,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;

- tces = kvmppc_gpa_to_hva_and_get(vcpu, tce_list, &pg, NULL);
+ tces = kvmppc_gpa_to_hva_and_get(vcpu, tt, tce_list, &pg, NULL);
if (tces == ERROR_ADDR)
return H_TOO_HARD;

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index c647990..9488149 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -133,12 +133,30 @@ void kvmppc_tce_put(struct kvmppc_spapr_tce_table *tt,
EXPORT_SYMBOL_GPL(kvmppc_tce_put);

#ifdef CONFIG_KVM_BOOK3S_64_HV
+
+static unsigned long kvmppc_rm_hugepage_gpa_to_hpa(
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long gpa)
+{
+ struct kvmppc_spapr_iommu_hugepage *hp;
+ const unsigned key = KVMPPC_SPAPR_HUGEPAGE_HASH(gpa);
+
+ hash_for_each_possible_rcu_notrace(tt->hash_tab, hp, hash_node, key) {
+ if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size))
+ continue;
+ return hp->hpa + (gpa & (hp->size - 1));
+ }
+
+ return ERROR_ADDR;
+}
+
/*
* Converts guest physical address to host physical address.
* Tries to increase page counter via get_page_unless_zero() and
* returns ERROR_ADDR if failed.
*/
static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
unsigned long gpa, struct page **pg)
{
struct kvm_memory_slot *memslot;
@@ -147,6 +165,14 @@ static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
unsigned long gfn = gpa >> PAGE_SHIFT;
unsigned shift = 0;

+ /* Check if it is a hugepage */
+ hpa = kvmppc_rm_hugepage_gpa_to_hpa(tt, gpa);
+ if (hpa != ERROR_ADDR) {
+ *pg = NULL; /* Tell the caller not to put page */
+ return hpa;
+ }
+
+ /* System page size case */
memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
if (!memslot)
return ERROR_ADDR;
@@ -219,7 +245,7 @@ static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
if (iommu_tce_put_param_check(tbl, ioba, tce))
return H_PARAMETER;

- hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce, &pg);
+ hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tt, tce, &pg);
if (hpa != ERROR_ADDR) {
ret = iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT,
&hpa, 1, true);
@@ -256,7 +282,7 @@ static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,

/* Translate TCEs and go get_page() */
for (i = 0; i < npages; ++i) {
- hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tces[i], &pg);
+ hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tt, tces[i], &pg);
if (hpa == ERROR_ADDR) {
vcpu->arch.tce_tmp_num = i;
vcpu->arch.tce_rm_fail = TCERM_GETPAGE;
@@ -347,7 +373,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;

- tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce_list, &pg);
+ tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tt, tce_list, &pg);
if (tces == ERROR_ADDR) {
ret = H_TOO_HARD;
goto put_unlock_exit;
--
1.8.4.rc4

2013-08-30 10:27:07

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v9 00/13] KVM: PPC: IOMMU in-kernel handling of VFIO

On 08/28/2013 06:37 PM, Alexey Kardashevskiy wrote:
> This accelerates VFIO DMA operations on POWER by moving them
> into kernel.
>
> This depends on VFIO external API patch which is on its way to upstream.
>
> Changes:
> v9:
> * replaced the "link logical bus number to IOMMU group" ioctl to KVM
> with a KVM device doing the same thing, i.e. the actual changes are in
> these 3 patches:
> KVM: PPC: reserve a capability and KVM device type for realmode VFIO
> KVM: PPC: remove warning from kvmppc_core_destroy_vm
> KVM: PPC: Add support for IOMMU in-kernel handling
>
> * moved some VFIO external API bits to a separate patch to reduce the size
> of the "KVM: PPC: Add support for IOMMU in-kernel handling" patch
>
> * fixed code style problems reported by checkpatch.pl.
>
> v8:
> * fixed comments about capabilities numbers
>
> v7:
> * rebased on v3.11-rc3.
> * VFIO external user API will go through VFIO tree so it is
> excluded from this series.
> * As nobody ever reacted on "hashtable: add hash_for_each_possible_rcu_notrace()",
> Ben suggested to push it via his tree so I included it to the series.
> * realmode_(get|put)_page is reworked.
>
> More details in the individual patch comments.
>
> Alexey Kardashevskiy (13):
> KVM: PPC: POWERNV: move iommu_add_device earlier
> hashtable: add hash_for_each_possible_rcu_notrace()
> KVM: PPC: reserve a capability number for multitce support
> KVM: PPC: reserve a capability and KVM device type for realmode VFIO


Hi Gleb!

Could you please review and pick (if they are ok) the two "capability"
patches from above?

It would be cool if you also looked at "KVM: PPC: Add support for IOMMU
in-kernel handling", the part about KVM device for SPAPR TCE IOMMU table.

Thanks!



> powerpc: Prepare to support kernel handling of IOMMU map/unmap
> powerpc: add real mode support for dma operations on powernv
> KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently
> KVM: PPC: Add support for multiple-TCE hcalls
> powerpc/iommu: rework to support realmode
> KVM: PPC: remove warning from kvmppc_core_destroy_vm
> KVM: PPC: add trampolines for VFIO external API
> KVM: PPC: Add support for IOMMU in-kernel handling
> KVM: PPC: Add hugepage support for IOMMU in-kernel handling
>
> Documentation/virtual/kvm/api.txt | 26 +
> .../virtual/kvm/devices/spapr_tce_iommu.txt | 37 ++
> arch/powerpc/include/asm/iommu.h | 18 +-
> arch/powerpc/include/asm/kvm_host.h | 38 ++
> arch/powerpc/include/asm/kvm_ppc.h | 16 +-
> arch/powerpc/include/asm/machdep.h | 12 +
> arch/powerpc/include/asm/pgtable-ppc64.h | 2 +
> arch/powerpc/include/uapi/asm/kvm.h | 8 +
> arch/powerpc/kernel/iommu.c | 243 +++++----
> arch/powerpc/kvm/Kconfig | 1 +
> arch/powerpc/kvm/book3s_64_vio.c | 597 ++++++++++++++++++++-
> arch/powerpc/kvm/book3s_64_vio_hv.c | 408 +++++++++++++-
> arch/powerpc/kvm/book3s_hv.c | 42 +-
> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +-
> arch/powerpc/kvm/book3s_pr_papr.c | 35 ++
> arch/powerpc/kvm/powerpc.c | 4 +
> arch/powerpc/mm/init_64.c | 50 +-
> arch/powerpc/platforms/powernv/pci-ioda.c | 57 +-
> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 2 +-
> arch/powerpc/platforms/powernv/pci.c | 75 ++-
> arch/powerpc/platforms/powernv/pci.h | 3 +-
> arch/powerpc/platforms/pseries/iommu.c | 8 +-
> include/linux/hashtable.h | 15 +
> include/linux/kvm_host.h | 1 +
> include/linux/mm.h | 14 +
> include/linux/page-flags.h | 4 +-
> include/uapi/linux/kvm.h | 3 +
> virt/kvm/kvm_main.c | 5 +
> 28 files changed, 1564 insertions(+), 168 deletions(-)
> create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>


--
Alexey

2013-08-30 17:58:51

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v9 00/13] KVM: PPC: IOMMU in-kernel handling of VFIO

On Fri, Aug 30, 2013 at 08:26:54PM +1000, Alexey Kardashevskiy wrote:
> On 08/28/2013 06:37 PM, Alexey Kardashevskiy wrote:
> > This accelerates VFIO DMA operations on POWER by moving them
> > into kernel.
> >
> > This depends on VFIO external API patch which is on its way to upstream.
> >
> > Changes:
> > v9:
> > * replaced the "link logical bus number to IOMMU group" ioctl to KVM
> > with a KVM device doing the same thing, i.e. the actual changes are in
> > these 3 patches:
> > KVM: PPC: reserve a capability and KVM device type for realmode VFIO
> > KVM: PPC: remove warning from kvmppc_core_destroy_vm
> > KVM: PPC: Add support for IOMMU in-kernel handling
> >
> > * moved some VFIO external API bits to a separate patch to reduce the size
> > of the "KVM: PPC: Add support for IOMMU in-kernel handling" patch
> >
> > * fixed code style problems reported by checkpatch.pl.
> >
> > v8:
> > * fixed comments about capabilities numbers
> >
> > v7:
> > * rebased on v3.11-rc3.
> > * VFIO external user API will go through VFIO tree so it is
> > excluded from this series.
> > * As nobody ever reacted on "hashtable: add hash_for_each_possible_rcu_notrace()",
> > Ben suggested to push it via his tree so I included it to the series.
> > * realmode_(get|put)_page is reworked.
> >
> > More details in the individual patch comments.
> >
> > Alexey Kardashevskiy (13):
> > KVM: PPC: POWERNV: move iommu_add_device earlier
> > hashtable: add hash_for_each_possible_rcu_notrace()
> > KVM: PPC: reserve a capability number for multitce support
> > KVM: PPC: reserve a capability and KVM device type for realmode VFIO
>
>
> Hi Gleb!
>
> Could you please review and pick (if they are ok) the two "capability"
> patches from above?
>
> It would be cool if you also looked at "KVM: PPC: Add support for IOMMU
> in-kernel handling", the part about KVM device for SPAPR TCE IOMMU table.
>
> Thanks!
Will do it next week.

>
>
>
> > powerpc: Prepare to support kernel handling of IOMMU map/unmap
> > powerpc: add real mode support for dma operations on powernv
> > KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently
> > KVM: PPC: Add support for multiple-TCE hcalls
> > powerpc/iommu: rework to support realmode
> > KVM: PPC: remove warning from kvmppc_core_destroy_vm
> > KVM: PPC: add trampolines for VFIO external API
> > KVM: PPC: Add support for IOMMU in-kernel handling
> > KVM: PPC: Add hugepage support for IOMMU in-kernel handling
> >
> > Documentation/virtual/kvm/api.txt | 26 +
> > .../virtual/kvm/devices/spapr_tce_iommu.txt | 37 ++
> > arch/powerpc/include/asm/iommu.h | 18 +-
> > arch/powerpc/include/asm/kvm_host.h | 38 ++
> > arch/powerpc/include/asm/kvm_ppc.h | 16 +-
> > arch/powerpc/include/asm/machdep.h | 12 +
> > arch/powerpc/include/asm/pgtable-ppc64.h | 2 +
> > arch/powerpc/include/uapi/asm/kvm.h | 8 +
> > arch/powerpc/kernel/iommu.c | 243 +++++----
> > arch/powerpc/kvm/Kconfig | 1 +
> > arch/powerpc/kvm/book3s_64_vio.c | 597 ++++++++++++++++++++-
> > arch/powerpc/kvm/book3s_64_vio_hv.c | 408 +++++++++++++-
> > arch/powerpc/kvm/book3s_hv.c | 42 +-
> > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +-
> > arch/powerpc/kvm/book3s_pr_papr.c | 35 ++
> > arch/powerpc/kvm/powerpc.c | 4 +
> > arch/powerpc/mm/init_64.c | 50 +-
> > arch/powerpc/platforms/powernv/pci-ioda.c | 57 +-
> > arch/powerpc/platforms/powernv/pci-p5ioc2.c | 2 +-
> > arch/powerpc/platforms/powernv/pci.c | 75 ++-
> > arch/powerpc/platforms/powernv/pci.h | 3 +-
> > arch/powerpc/platforms/pseries/iommu.c | 8 +-
> > include/linux/hashtable.h | 15 +
> > include/linux/kvm_host.h | 1 +
> > include/linux/mm.h | 14 +
> > include/linux/page-flags.h | 4 +-
> > include/uapi/linux/kvm.h | 3 +
> > virt/kvm/kvm_main.c | 5 +
> > 28 files changed, 1564 insertions(+), 168 deletions(-)
> > create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >
>
>
> --
> Alexey

--
Gleb.

2013-09-01 11:27:58

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v9 04/13] KVM: PPC: reserve a capability and KVM device type for realmode VFIO

On Wed, Aug 28, 2013 at 06:37:41PM +1000, Alexey Kardashevskiy wrote:
> This reserves a capability number for upcoming support
> of VFIO-IOMMU DMA operations in real mode.
>
> This reserves a number for a new "SPAPR TCE IOMMU" KVM device
> which is going to manage lifetime of SPAPR TCE IOMMU object.
>
> This defines an attribute of the "SPAPR TCE IOMMU" KVM device
> which is going to be used for initialization.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>
> ---
> Changes:
> v9:
> * KVM ioctl is replaced with "SPAPR TCE IOMMU" KVM device type with
> KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute
>
> 2013/08/15:
> * fixed mistype in comments
> * fixed commit message which says what uses ioctls 0xad and 0xae
>
> 2013/07/16:
> * changed the number
>
> 2013/07/11:
> * changed order in a file, added comment about a gap in ioctl number
> ---
> arch/powerpc/include/uapi/asm/kvm.h | 8 ++++++++
> include/uapi/linux/kvm.h | 2 ++
> 2 files changed, 10 insertions(+)
>
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index 0fb1a6e..c1ae1e5 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -511,4 +511,12 @@ struct kvm_get_htab_header {
> #define KVM_XICS_MASKED (1ULL << 41)
> #define KVM_XICS_PENDING (1ULL << 42)
>
> +/* SPAPR TCE IOMMU device specification */
> +struct kvm_create_spapr_tce_iommu_linkage {
> + __u64 liobn;
> + __u32 fd;
> + __u32 flags;
> +};
> +#define KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE 0
> +
> #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 99c2533..9d20630 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
> #define KVM_CAP_IRQ_XICS 92
> #define KVM_CAP_ARM_EL1_32BIT 93
> #define KVM_CAP_SPAPR_MULTITCE 94
> +#define KVM_CAP_SPAPR_TCE_IOMMU 95
>
You do not need capability to check for a device support. Device API
supports checking for that with KVM_CREATE_DEVICE_TEST flag to
KVM_CREATE_DEVICE ioctl.

> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -843,6 +844,7 @@ struct kvm_device_attr {
> #define KVM_DEV_TYPE_FSL_MPIC_20 1
> #define KVM_DEV_TYPE_FSL_MPIC_42 2
> #define KVM_DEV_TYPE_XICS 3
> +#define KVM_DEV_TYPE_SPAPR_TCE_IOMMU 4
>
> /*
> * ioctls for VM fds
> --
> 1.8.4.rc4

--
Gleb.

2013-09-01 11:39:34

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v9 04/13] KVM: PPC: reserve a capability and KVM device type for realmode VFIO

On 09/01/2013 09:27 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 06:37:41PM +1000, Alexey Kardashevskiy wrote:
>> This reserves a capability number for upcoming support
>> of VFIO-IOMMU DMA operations in real mode.
>>
>> This reserves a number for a new "SPAPR TCE IOMMU" KVM device
>> which is going to manage lifetime of SPAPR TCE IOMMU object.
>>
>> This defines an attribute of the "SPAPR TCE IOMMU" KVM device
>> which is going to be used for initialization.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>
>> ---
>> Changes:
>> v9:
>> * KVM ioctl is replaced with "SPAPR TCE IOMMU" KVM device type with
>> KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute
>>
>> 2013/08/15:
>> * fixed mistype in comments
>> * fixed commit message which says what uses ioctls 0xad and 0xae
>>
>> 2013/07/16:
>> * changed the number
>>
>> 2013/07/11:
>> * changed order in a file, added comment about a gap in ioctl number
>> ---
>> arch/powerpc/include/uapi/asm/kvm.h | 8 ++++++++
>> include/uapi/linux/kvm.h | 2 ++
>> 2 files changed, 10 insertions(+)
>>
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>> index 0fb1a6e..c1ae1e5 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -511,4 +511,12 @@ struct kvm_get_htab_header {
>> #define KVM_XICS_MASKED (1ULL << 41)
>> #define KVM_XICS_PENDING (1ULL << 42)
>>
>> +/* SPAPR TCE IOMMU device specification */
>> +struct kvm_create_spapr_tce_iommu_linkage {
>> + __u64 liobn;
>> + __u32 fd;
>> + __u32 flags;
>> +};
>> +#define KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE 0
>> +
>> #endif /* __LINUX_KVM_POWERPC_H */
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 99c2533..9d20630 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
>> #define KVM_CAP_IRQ_XICS 92
>> #define KVM_CAP_ARM_EL1_32BIT 93
>> #define KVM_CAP_SPAPR_MULTITCE 94
>> +#define KVM_CAP_SPAPR_TCE_IOMMU 95
>>
> You do not need capability to check for a device support. Device API
> supports checking for that with KVM_CREATE_DEVICE_TEST flag to
> KVM_CREATE_DEVICE ioctl.

Hm. I copied my device from KVM_DEV_TYPE_XICS and there is a capability for
it - KVM_CAP_IRQ_XICS. Do We not need both capabilities? Or XICS is special
in some way but SPAPR TCE IOMMU is not? I am confused, sorry.


>
>> #ifdef KVM_CAP_IRQ_ROUTING
>>
>> @@ -843,6 +844,7 @@ struct kvm_device_attr {
>> #define KVM_DEV_TYPE_FSL_MPIC_20 1
>> #define KVM_DEV_TYPE_FSL_MPIC_42 2
>> #define KVM_DEV_TYPE_XICS 3
>> +#define KVM_DEV_TYPE_SPAPR_TCE_IOMMU 4
>>
>> /*
>> * ioctls for VM fds
>> --
>> 1.8.4.rc4
>
> --
> Gleb.
>


--
Alexey

2013-09-01 12:04:16

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v9 04/13] KVM: PPC: reserve a capability and KVM device type for realmode VFIO

On Sun, Sep 01, 2013 at 09:39:23PM +1000, Alexey Kardashevskiy wrote:
> On 09/01/2013 09:27 PM, Gleb Natapov wrote:
> > On Wed, Aug 28, 2013 at 06:37:41PM +1000, Alexey Kardashevskiy wrote:
> >> This reserves a capability number for upcoming support
> >> of VFIO-IOMMU DMA operations in real mode.
> >>
> >> This reserves a number for a new "SPAPR TCE IOMMU" KVM device
> >> which is going to manage lifetime of SPAPR TCE IOMMU object.
> >>
> >> This defines an attribute of the "SPAPR TCE IOMMU" KVM device
> >> which is going to be used for initialization.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>
> >> ---
> >> Changes:
> >> v9:
> >> * KVM ioctl is replaced with "SPAPR TCE IOMMU" KVM device type with
> >> KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute
> >>
> >> 2013/08/15:
> >> * fixed mistype in comments
> >> * fixed commit message which says what uses ioctls 0xad and 0xae
> >>
> >> 2013/07/16:
> >> * changed the number
> >>
> >> 2013/07/11:
> >> * changed order in a file, added comment about a gap in ioctl number
> >> ---
> >> arch/powerpc/include/uapi/asm/kvm.h | 8 ++++++++
> >> include/uapi/linux/kvm.h | 2 ++
> >> 2 files changed, 10 insertions(+)
> >>
> >> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> >> index 0fb1a6e..c1ae1e5 100644
> >> --- a/arch/powerpc/include/uapi/asm/kvm.h
> >> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> >> @@ -511,4 +511,12 @@ struct kvm_get_htab_header {
> >> #define KVM_XICS_MASKED (1ULL << 41)
> >> #define KVM_XICS_PENDING (1ULL << 42)
> >>
> >> +/* SPAPR TCE IOMMU device specification */
> >> +struct kvm_create_spapr_tce_iommu_linkage {
> >> + __u64 liobn;
> >> + __u32 fd;
> >> + __u32 flags;
> >> +};
> >> +#define KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE 0
> >> +
> >> #endif /* __LINUX_KVM_POWERPC_H */
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 99c2533..9d20630 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
> >> #define KVM_CAP_IRQ_XICS 92
> >> #define KVM_CAP_ARM_EL1_32BIT 93
> >> #define KVM_CAP_SPAPR_MULTITCE 94
> >> +#define KVM_CAP_SPAPR_TCE_IOMMU 95
> >>
> > You do not need capability to check for a device support. Device API
> > supports checking for that with KVM_CREATE_DEVICE_TEST flag to
> > KVM_CREATE_DEVICE ioctl.
>
> Hm. I copied my device from KVM_DEV_TYPE_XICS and there is a capability for
> it - KVM_CAP_IRQ_XICS. Do We not need both capabilities? Or XICS is special
> in some way but SPAPR TCE IOMMU is not? I am confused, sorry.
>
>
Looking at it KVM_CAP_IRQ_XICS/KVM_CAP_IRQ_MPIC are not used to detect
device existence, but to link a device to vcpu. KVM_CAP_IRQ_MPIC was
introduced separately from MPIC device code.

--
Gleb.

2013-09-01 12:06:25

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
> them to user space which saves time on switching to user space and back.
>
> Both real and virtual modes are supported. The kernel tries to
> handle a TCE request in the real mode, if fails it passes the request
> to the virtual mode to complete the operation. If it a virtual mode
> handler fails, the request is passed to user space.
>
> The first user of this is VFIO on POWER. Trampolines to the VFIO external
> user API functions are required for this patch.
>
> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
> of map/unmap requests. The device supports a single attribute which is
> a struct with LIOBN and IOMMU fd. When the attribute is set, the device
> establishes the connection between KVM and VFIO.
>
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>
> Signed-off-by: Paul Mackerras <[email protected]>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>
> ---
>
> Changes:
> v9:
> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
> KVM device
> * release_spapr_tce_table() is not shared between different TCE types
> * reduced the patch size by moving VFIO external API
> trampolines to separate patche
> * moved documentation from Documentation/virtual/kvm/api.txt to
> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>
> v8:
> * fixed warnings from check_patch.pl
>
> 2013/07/11:
> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
> for KVM_BOOK3S_64
> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
> for this here but the next patch for hugepages support will use it more.
>
> 2013/07/06:
> * added realmode arch_spin_lock to protect TCE table from races
> in real and virtual modes
> * POWERPC IOMMU API is changed to support real mode
> * iommu_take_ownership and iommu_release_ownership are protected by
> iommu_table's locks
> * VFIO external user API use rewritten
> * multiple small fixes
>
> 2013/06/27:
> * tce_list page is referenced now in order to protect it from accident
> invalidation during H_PUT_TCE_INDIRECT execution
> * added use of the external user VFIO API
>
> 2013/06/05:
> * changed capability number
> * changed ioctl number
> * update the doc article number
>
> 2013/05/20:
> * removed get_user() from real mode handlers
> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
> translated TCEs, tries realmode_get_page() on those and if it fails, it
> passes control over the virtual mode handler which tries to finish
> the request handling
> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
> on a page
> * The only reason to pass the request to user mode now is when the user mode
> did not register TCE table in the kernel, in all other cases the virtual mode
> handler is expected to do the job
> ---
> .../virtual/kvm/devices/spapr_tce_iommu.txt | 37 +++
> arch/powerpc/include/asm/kvm_host.h | 4 +
> arch/powerpc/kvm/book3s_64_vio.c | 310 ++++++++++++++++++++-
> arch/powerpc/kvm/book3s_64_vio_hv.c | 122 ++++++++
> arch/powerpc/kvm/powerpc.c | 1 +
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 5 +
> 7 files changed, 477 insertions(+), 3 deletions(-)
> create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>
> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> new file mode 100644
> index 0000000..4bc8fc3
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> @@ -0,0 +1,37 @@
> +SPAPR TCE IOMMU device
> +
> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> +Architectures: powerpc
> +
> +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
> +
> +Groups:
> + KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
> + Attributes: single attribute with pair { LIOBN, IOMMU fd}
> +
> +This is completely made up device which provides API to link
> +logical bus number (LIOBN) and IOMMU group. The user space has
> +to create a new SPAPR TCE IOMMU device per a logical bus.
> +
Why not have one device that can handle multimple links?

> +LIOBN is a PCI bus identifier from PPC64-server (sPAPR) DMA hypercalls
> +(H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE).
> +IOMMU group is a minimal isolated device set which can be passed to
> +the user space via VFIO.
> +
> +Right after creation the device is in uninitlized state and requires
> +a KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute to be set.
> +The attribute contains liobn, IOMMU fd and flags:
> +
> +struct kvm_create_spapr_tce_iommu_linkage {
> + __u64 liobn;
> + __u32 fd;
> + __u32 flags;
> +};
> +
> +The user space creates the SPAPR TCE IOMMU device, obtains
> +an IOMMU fd via VFIO ABI and sets the attribute to the SPAPR TCE IOMMU
> +device. At the moment of setting the attribute, the SPAPR TCE IOMMU
> +device links LIOBN to IOMMU group and makes necessary steps
> +to make sure that VFIO group will not disappear before KVM destroys.
> +
> +The kernel advertises this feature via KVM_CAP_SPAPR_TCE_IOMMU capability.
[skip]

> +
> +static int kvmppc_spapr_tce_iommu_get_attr(struct kvm_device *dev,
> + struct kvm_device_attr *attr)
> +{
> + struct kvmppc_spapr_tce_table *tt = dev->private;
> + void __user *argp = (void __user *) attr->addr;
> +
> + switch (attr->group) {
> + case KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE:
> + if (!tt)
> + return -EFAULT;
Does not look like correct error code to return here. EINVAL may be?

> + if (copy_to_user(&tt->link, argp, sizeof(tt->link)))
> + return -EFAULT;
> + return 0;
> + }
> + return -ENXIO;
> +}
> +

--
Gleb.

2013-09-02 03:14:39

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

On 09/01/2013 10:06 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote:
>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
>> them to user space which saves time on switching to user space and back.
>>
>> Both real and virtual modes are supported. The kernel tries to
>> handle a TCE request in the real mode, if fails it passes the request
>> to the virtual mode to complete the operation. If it a virtual mode
>> handler fails, the request is passed to user space.
>>
>> The first user of this is VFIO on POWER. Trampolines to the VFIO external
>> user API functions are required for this patch.
>>
>> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
>> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
>> of map/unmap requests. The device supports a single attribute which is
>> a struct with LIOBN and IOMMU fd. When the attribute is set, the device
>> establishes the connection between KVM and VFIO.
>>
>> Tests show that this patch increases transmission speed from 220MB/s
>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>
>> Signed-off-by: Paul Mackerras <[email protected]>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>
>> ---
>>
>> Changes:
>> v9:
>> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
>> KVM device
>> * release_spapr_tce_table() is not shared between different TCE types
>> * reduced the patch size by moving VFIO external API
>> trampolines to separate patche
>> * moved documentation from Documentation/virtual/kvm/api.txt to
>> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>
>> v8:
>> * fixed warnings from check_patch.pl
>>
>> 2013/07/11:
>> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
>> for KVM_BOOK3S_64
>> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
>> for this here but the next patch for hugepages support will use it more.
>>
>> 2013/07/06:
>> * added realmode arch_spin_lock to protect TCE table from races
>> in real and virtual modes
>> * POWERPC IOMMU API is changed to support real mode
>> * iommu_take_ownership and iommu_release_ownership are protected by
>> iommu_table's locks
>> * VFIO external user API use rewritten
>> * multiple small fixes
>>
>> 2013/06/27:
>> * tce_list page is referenced now in order to protect it from accident
>> invalidation during H_PUT_TCE_INDIRECT execution
>> * added use of the external user VFIO API
>>
>> 2013/06/05:
>> * changed capability number
>> * changed ioctl number
>> * update the doc article number
>>
>> 2013/05/20:
>> * removed get_user() from real mode handlers
>> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
>> translated TCEs, tries realmode_get_page() on those and if it fails, it
>> passes control over the virtual mode handler which tries to finish
>> the request handling
>> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
>> on a page
>> * The only reason to pass the request to user mode now is when the user mode
>> did not register TCE table in the kernel, in all other cases the virtual mode
>> handler is expected to do the job
>> ---
>> .../virtual/kvm/devices/spapr_tce_iommu.txt | 37 +++
>> arch/powerpc/include/asm/kvm_host.h | 4 +
>> arch/powerpc/kvm/book3s_64_vio.c | 310 ++++++++++++++++++++-
>> arch/powerpc/kvm/book3s_64_vio_hv.c | 122 ++++++++
>> arch/powerpc/kvm/powerpc.c | 1 +
>> include/linux/kvm_host.h | 1 +
>> virt/kvm/kvm_main.c | 5 +
>> 7 files changed, 477 insertions(+), 3 deletions(-)
>> create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>
>> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>> new file mode 100644
>> index 0000000..4bc8fc3
>> --- /dev/null
>> +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>> @@ -0,0 +1,37 @@
>> +SPAPR TCE IOMMU device
>> +
>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>> +Architectures: powerpc
>> +
>> +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
>> +
>> +Groups:
>> + KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
>> + Attributes: single attribute with pair { LIOBN, IOMMU fd}
>> +
>> +This is completely made up device which provides API to link
>> +logical bus number (LIOBN) and IOMMU group. The user space has
>> +to create a new SPAPR TCE IOMMU device per a logical bus.
>> +
> Why not have one device that can handle multimple links?


I can do that. If I make it so, it won't even look as a device at all, just
some weird interface to KVM but ok. What bothers me is it is just a
question what I will have to do next. Because I can easily predict a
suggestion to move kvmppc_spapr_tce_table's (a links list) from
kvm->arch.spapr_tce_tables to that device but I cannot do that for obvious
compatibility reasons caused by the fact that the list is already used for
emulated devices (for the starter - they need mmap()).

Or supporting all IOMMU links (and leaving emulated stuff as is) in on
"device" is the last thing I have to do and then you'll ack the patch?



>> +LIOBN is a PCI bus identifier from PPC64-server (sPAPR) DMA hypercalls
>> +(H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE).
>> +IOMMU group is a minimal isolated device set which can be passed to
>> +the user space via VFIO.
>> +
>> +Right after creation the device is in uninitlized state and requires
>> +a KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute to be set.
>> +The attribute contains liobn, IOMMU fd and flags:
>> +
>> +struct kvm_create_spapr_tce_iommu_linkage {
>> + __u64 liobn;
>> + __u32 fd;
>> + __u32 flags;
>> +};
>> +
>> +The user space creates the SPAPR TCE IOMMU device, obtains
>> +an IOMMU fd via VFIO ABI and sets the attribute to the SPAPR TCE IOMMU
>> +device. At the moment of setting the attribute, the SPAPR TCE IOMMU
>> +device links LIOBN to IOMMU group and makes necessary steps
>> +to make sure that VFIO group will not disappear before KVM destroys.
>> +
>> +The kernel advertises this feature via KVM_CAP_SPAPR_TCE_IOMMU capability.
> [skip]

Yes, I read the other comment. So roughly speaking I'll replace the
KVM_CAP_SPAPR_TCE_IOMMU check with the KVM_CAP_DEVICE_CTRL capability check
+ try to KVM_CREATE_DEVICE with the KVM_CREATE_DEVICE_TEST flag set, and we
are fine.


>> +
>> +static int kvmppc_spapr_tce_iommu_get_attr(struct kvm_device *dev,
>> + struct kvm_device_attr *attr)
>> +{
>> + struct kvmppc_spapr_tce_table *tt = dev->private;
>> + void __user *argp = (void __user *) attr->addr;
>> +
>> + switch (attr->group) {
>> + case KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE:
>> + if (!tt)
>> + return -EFAULT;
> Does not look like correct error code to return here. EINVAL may be?

Yep, I'll fix this. Thanks.


>> + if (copy_to_user(&tt->link, argp, sizeof(tt->link)))
>> + return -EFAULT;
>> + return 0;
>> + }
>> + return -ENXIO;
>> +}
>> +


--
Alexey

2013-09-03 10:53:38

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote:
> On 09/01/2013 10:06 PM, Gleb Natapov wrote:
> > On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote:
> >> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
> >> them to user space which saves time on switching to user space and back.
> >>
> >> Both real and virtual modes are supported. The kernel tries to
> >> handle a TCE request in the real mode, if fails it passes the request
> >> to the virtual mode to complete the operation. If it a virtual mode
> >> handler fails, the request is passed to user space.
> >>
> >> The first user of this is VFIO on POWER. Trampolines to the VFIO external
> >> user API functions are required for this patch.
> >>
> >> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
> >> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
> >> of map/unmap requests. The device supports a single attribute which is
> >> a struct with LIOBN and IOMMU fd. When the attribute is set, the device
> >> establishes the connection between KVM and VFIO.
> >>
> >> Tests show that this patch increases transmission speed from 220MB/s
> >> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>
> >> Signed-off-by: Paul Mackerras <[email protected]>
> >> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>
> >> ---
> >>
> >> Changes:
> >> v9:
> >> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
> >> KVM device
> >> * release_spapr_tce_table() is not shared between different TCE types
> >> * reduced the patch size by moving VFIO external API
> >> trampolines to separate patche
> >> * moved documentation from Documentation/virtual/kvm/api.txt to
> >> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >>
> >> v8:
> >> * fixed warnings from check_patch.pl
> >>
> >> 2013/07/11:
> >> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
> >> for KVM_BOOK3S_64
> >> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
> >> for this here but the next patch for hugepages support will use it more.
> >>
> >> 2013/07/06:
> >> * added realmode arch_spin_lock to protect TCE table from races
> >> in real and virtual modes
> >> * POWERPC IOMMU API is changed to support real mode
> >> * iommu_take_ownership and iommu_release_ownership are protected by
> >> iommu_table's locks
> >> * VFIO external user API use rewritten
> >> * multiple small fixes
> >>
> >> 2013/06/27:
> >> * tce_list page is referenced now in order to protect it from accident
> >> invalidation during H_PUT_TCE_INDIRECT execution
> >> * added use of the external user VFIO API
> >>
> >> 2013/06/05:
> >> * changed capability number
> >> * changed ioctl number
> >> * update the doc article number
> >>
> >> 2013/05/20:
> >> * removed get_user() from real mode handlers
> >> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
> >> translated TCEs, tries realmode_get_page() on those and if it fails, it
> >> passes control over the virtual mode handler which tries to finish
> >> the request handling
> >> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
> >> on a page
> >> * The only reason to pass the request to user mode now is when the user mode
> >> did not register TCE table in the kernel, in all other cases the virtual mode
> >> handler is expected to do the job
> >> ---
> >> .../virtual/kvm/devices/spapr_tce_iommu.txt | 37 +++
> >> arch/powerpc/include/asm/kvm_host.h | 4 +
> >> arch/powerpc/kvm/book3s_64_vio.c | 310 ++++++++++++++++++++-
> >> arch/powerpc/kvm/book3s_64_vio_hv.c | 122 ++++++++
> >> arch/powerpc/kvm/powerpc.c | 1 +
> >> include/linux/kvm_host.h | 1 +
> >> virt/kvm/kvm_main.c | 5 +
> >> 7 files changed, 477 insertions(+), 3 deletions(-)
> >> create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >>
> >> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >> new file mode 100644
> >> index 0000000..4bc8fc3
> >> --- /dev/null
> >> +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >> @@ -0,0 +1,37 @@
> >> +SPAPR TCE IOMMU device
> >> +
> >> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> >> +Architectures: powerpc
> >> +
> >> +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
> >> +
> >> +Groups:
> >> + KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
> >> + Attributes: single attribute with pair { LIOBN, IOMMU fd}
> >> +
> >> +This is completely made up device which provides API to link
> >> +logical bus number (LIOBN) and IOMMU group. The user space has
> >> +to create a new SPAPR TCE IOMMU device per a logical bus.
> >> +
> > Why not have one device that can handle multimple links?
>
>
> I can do that. If I make it so, it won't even look as a device at all, just
> some weird interface to KVM but ok. What bothers me is it is just a
May be I do not understand usage pattern here. Why do you feel that device
that can handle multiple links is worse than device per link? How many logical
buses is there usually? How often they created/destroyed? I am not insisting
on the change, just trying to understand why you do not like it.

> question what I will have to do next. Because I can easily predict a
> suggestion to move kvmppc_spapr_tce_table's (a links list) from
> kvm->arch.spapr_tce_tables to that device but I cannot do that for obvious
> compatibility reasons caused by the fact that the list is already used for
> emulated devices (for the starter - they need mmap()).
>
> Or supporting all IOMMU links (and leaving emulated stuff as is) in on
> "device" is the last thing I have to do and then you'll ack the patch?
>
I am concerned more about API here. Internal implementation details I
leave to powerpc experts :)

>
>
> >> +LIOBN is a PCI bus identifier from PPC64-server (sPAPR) DMA hypercalls
> >> +(H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE).
> >> +IOMMU group is a minimal isolated device set which can be passed to
> >> +the user space via VFIO.
> >> +
> >> +Right after creation the device is in uninitlized state and requires
> >> +a KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute to be set.
> >> +The attribute contains liobn, IOMMU fd and flags:
> >> +
> >> +struct kvm_create_spapr_tce_iommu_linkage {
> >> + __u64 liobn;
> >> + __u32 fd;
> >> + __u32 flags;
> >> +};
> >> +
> >> +The user space creates the SPAPR TCE IOMMU device, obtains
> >> +an IOMMU fd via VFIO ABI and sets the attribute to the SPAPR TCE IOMMU
> >> +device. At the moment of setting the attribute, the SPAPR TCE IOMMU
> >> +device links LIOBN to IOMMU group and makes necessary steps
> >> +to make sure that VFIO group will not disappear before KVM destroys.
> >> +
> >> +The kernel advertises this feature via KVM_CAP_SPAPR_TCE_IOMMU capability.
> > [skip]
>
> Yes, I read the other comment. So roughly speaking I'll replace the
> KVM_CAP_SPAPR_TCE_IOMMU check with the KVM_CAP_DEVICE_CTRL capability check
> + try to KVM_CREATE_DEVICE with the KVM_CREATE_DEVICE_TEST flag set, and we
> are fine.
Yes, but KVM_CREATE_DEVICE_TEST does not create device, only checks if
device type is supported.

--
Gleb.

2013-09-03 16:01:49

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

On 09/03/2013 08:53 PM, Gleb Natapov wrote:
> On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote:
>> On 09/01/2013 10:06 PM, Gleb Natapov wrote:
>>> On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote:
>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
>>>> them to user space which saves time on switching to user space and back.
>>>>
>>>> Both real and virtual modes are supported. The kernel tries to
>>>> handle a TCE request in the real mode, if fails it passes the request
>>>> to the virtual mode to complete the operation. If it a virtual mode
>>>> handler fails, the request is passed to user space.
>>>>
>>>> The first user of this is VFIO on POWER. Trampolines to the VFIO external
>>>> user API functions are required for this patch.
>>>>
>>>> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
>>>> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
>>>> of map/unmap requests. The device supports a single attribute which is
>>>> a struct with LIOBN and IOMMU fd. When the attribute is set, the device
>>>> establishes the connection between KVM and VFIO.
>>>>
>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>
>>>> Signed-off-by: Paul Mackerras <[email protected]>
>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>>
>>>> ---
>>>>
>>>> Changes:
>>>> v9:
>>>> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
>>>> KVM device
>>>> * release_spapr_tce_table() is not shared between different TCE types
>>>> * reduced the patch size by moving VFIO external API
>>>> trampolines to separate patche
>>>> * moved documentation from Documentation/virtual/kvm/api.txt to
>>>> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>
>>>> v8:
>>>> * fixed warnings from check_patch.pl
>>>>
>>>> 2013/07/11:
>>>> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
>>>> for KVM_BOOK3S_64
>>>> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
>>>> for this here but the next patch for hugepages support will use it more.
>>>>
>>>> 2013/07/06:
>>>> * added realmode arch_spin_lock to protect TCE table from races
>>>> in real and virtual modes
>>>> * POWERPC IOMMU API is changed to support real mode
>>>> * iommu_take_ownership and iommu_release_ownership are protected by
>>>> iommu_table's locks
>>>> * VFIO external user API use rewritten
>>>> * multiple small fixes
>>>>
>>>> 2013/06/27:
>>>> * tce_list page is referenced now in order to protect it from accident
>>>> invalidation during H_PUT_TCE_INDIRECT execution
>>>> * added use of the external user VFIO API
>>>>
>>>> 2013/06/05:
>>>> * changed capability number
>>>> * changed ioctl number
>>>> * update the doc article number
>>>>
>>>> 2013/05/20:
>>>> * removed get_user() from real mode handlers
>>>> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
>>>> translated TCEs, tries realmode_get_page() on those and if it fails, it
>>>> passes control over the virtual mode handler which tries to finish
>>>> the request handling
>>>> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
>>>> on a page
>>>> * The only reason to pass the request to user mode now is when the user mode
>>>> did not register TCE table in the kernel, in all other cases the virtual mode
>>>> handler is expected to do the job
>>>> ---
>>>> .../virtual/kvm/devices/spapr_tce_iommu.txt | 37 +++
>>>> arch/powerpc/include/asm/kvm_host.h | 4 +
>>>> arch/powerpc/kvm/book3s_64_vio.c | 310 ++++++++++++++++++++-
>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 122 ++++++++
>>>> arch/powerpc/kvm/powerpc.c | 1 +
>>>> include/linux/kvm_host.h | 1 +
>>>> virt/kvm/kvm_main.c | 5 +
>>>> 7 files changed, 477 insertions(+), 3 deletions(-)
>>>> create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>
>>>> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>> new file mode 100644
>>>> index 0000000..4bc8fc3
>>>> --- /dev/null
>>>> +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>> @@ -0,0 +1,37 @@
>>>> +SPAPR TCE IOMMU device
>>>> +
>>>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>>>> +Architectures: powerpc
>>>> +
>>>> +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
>>>> +
>>>> +Groups:
>>>> + KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
>>>> + Attributes: single attribute with pair { LIOBN, IOMMU fd}
>>>> +
>>>> +This is completely made up device which provides API to link
>>>> +logical bus number (LIOBN) and IOMMU group. The user space has
>>>> +to create a new SPAPR TCE IOMMU device per a logical bus.
>>>> +
>>> Why not have one device that can handle multimple links?
>>
>>
>> I can do that. If I make it so, it won't even look as a device at all, just
>> some weird interface to KVM but ok. What bothers me is it is just a
> May be I do not understand usage pattern here. Why do you feel that device
> that can handle multiple links is worse than device per link? How many logical
> buses is there usually? How often they created/destroyed? I am not insisting
> on the change, just trying to understand why you do not like it.


Is it usually one PCI host bus adapter per IOMMU group which is usually
one PCI card or 2-3 cards if it is a legacy PCI-X, and they are created
when QEMU-KVM starts. Not many. And they live till KVM ends.

My point is why would I want to put all links to one device? It all is just
a matter of taste and nothing more. Or I am missing something but I do not
see what. If it is all about making thing to be kosher/halal/orthodox, then
I have more stuff to do, like reworking the emulated TCEs. But if is it for
(I do not know, just guessing) performance or something like that - then
I'll fix it, I just need to know what I am fixing.



>> question what I will have to do next. Because I can easily predict a
>> suggestion to move kvmppc_spapr_tce_table's (a links list) from
>> kvm->arch.spapr_tce_tables to that device but I cannot do that for obvious
>> compatibility reasons caused by the fact that the list is already used for
>> emulated devices (for the starter - they need mmap()).
>>
>> Or supporting all IOMMU links (and leaving emulated stuff as is) in on
>> "device" is the last thing I have to do and then you'll ack the patch?
>>
> I am concerned more about API here. Internal implementation details I
> leave to powerpc experts :)


The Expert (Ben) wants capabilities number and API to get fixed in KVM tree :)


>
>>
>>
>>>> +LIOBN is a PCI bus identifier from PPC64-server (sPAPR) DMA hypercalls
>>>> +(H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE).
>>>> +IOMMU group is a minimal isolated device set which can be passed to
>>>> +the user space via VFIO.
>>>> +
>>>> +Right after creation the device is in uninitlized state and requires
>>>> +a KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute to be set.
>>>> +The attribute contains liobn, IOMMU fd and flags:
>>>> +
>>>> +struct kvm_create_spapr_tce_iommu_linkage {
>>>> + __u64 liobn;
>>>> + __u32 fd;
>>>> + __u32 flags;
>>>> +};
>>>> +
>>>> +The user space creates the SPAPR TCE IOMMU device, obtains
>>>> +an IOMMU fd via VFIO ABI and sets the attribute to the SPAPR TCE IOMMU
>>>> +device. At the moment of setting the attribute, the SPAPR TCE IOMMU
>>>> +device links LIOBN to IOMMU group and makes necessary steps
>>>> +to make sure that VFIO group will not disappear before KVM destroys.
>>>> +
>>>> +The kernel advertises this feature via KVM_CAP_SPAPR_TCE_IOMMU capability.
>>> [skip]
>>
>> Yes, I read the other comment. So roughly speaking I'll replace the
>> KVM_CAP_SPAPR_TCE_IOMMU check with the KVM_CAP_DEVICE_CTRL capability check
>> + try to KVM_CREATE_DEVICE with the KVM_CREATE_DEVICE_TEST flag set, and we
>> are fine.
> Yes, but KVM_CREATE_DEVICE_TEST does not create device, only checks if
> device type is supported.

Sure.



--
Alexey

2013-09-05 04:05:43

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

On Tue, 2013-09-03 at 13:53 +0300, Gleb Natapov wrote:
> > Or supporting all IOMMU links (and leaving emulated stuff as is) in on
> > "device" is the last thing I have to do and then you'll ack the patch?
> >
> I am concerned more about API here. Internal implementation details I
> leave to powerpc experts :)

So Gleb, I want to step in for a bit here.

While I understand that the new KVM device API is all nice and shiny and that this
whole thing should probably have been KVM devices in the first place (had they
existed or had we been told back then), the point is, the API for handling
HW IOMMUs that Alexey is trying to add is an extension of an existing mechanism
used for emulated IOMMUs.

The internal data structure is shared, and fundamentally, by forcing him to
use that new KVM device for the "new stuff", we create a oddball API with
an ioctl for one type of iommu and a KVM device for the other, which makes
the implementation a complete mess in the kernel (and you should care :-)

So for something completely new, I would tend to agree with you. However, I
still think that for this specific case, we should just plonk-in the original
ioctl proposed by Alexey and be done with it.

Cheers,
Ben.

2013-09-05 18:10:32

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

On Wed, Sep 04, 2013 at 02:01:28AM +1000, Alexey Kardashevskiy wrote:
> On 09/03/2013 08:53 PM, Gleb Natapov wrote:
> > On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote:
> >> On 09/01/2013 10:06 PM, Gleb Natapov wrote:
> >>> On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote:
> >>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >>>> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
> >>>> them to user space which saves time on switching to user space and back.
> >>>>
> >>>> Both real and virtual modes are supported. The kernel tries to
> >>>> handle a TCE request in the real mode, if fails it passes the request
> >>>> to the virtual mode to complete the operation. If it a virtual mode
> >>>> handler fails, the request is passed to user space.
> >>>>
> >>>> The first user of this is VFIO on POWER. Trampolines to the VFIO external
> >>>> user API functions are required for this patch.
> >>>>
> >>>> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
> >>>> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
> >>>> of map/unmap requests. The device supports a single attribute which is
> >>>> a struct with LIOBN and IOMMU fd. When the attribute is set, the device
> >>>> establishes the connection between KVM and VFIO.
> >>>>
> >>>> Tests show that this patch increases transmission speed from 220MB/s
> >>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>>>
> >>>> Signed-off-by: Paul Mackerras <[email protected]>
> >>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>>>
> >>>> ---
> >>>>
> >>>> Changes:
> >>>> v9:
> >>>> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
> >>>> KVM device
> >>>> * release_spapr_tce_table() is not shared between different TCE types
> >>>> * reduced the patch size by moving VFIO external API
> >>>> trampolines to separate patche
> >>>> * moved documentation from Documentation/virtual/kvm/api.txt to
> >>>> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >>>>
> >>>> v8:
> >>>> * fixed warnings from check_patch.pl
> >>>>
> >>>> 2013/07/11:
> >>>> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
> >>>> for KVM_BOOK3S_64
> >>>> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
> >>>> for this here but the next patch for hugepages support will use it more.
> >>>>
> >>>> 2013/07/06:
> >>>> * added realmode arch_spin_lock to protect TCE table from races
> >>>> in real and virtual modes
> >>>> * POWERPC IOMMU API is changed to support real mode
> >>>> * iommu_take_ownership and iommu_release_ownership are protected by
> >>>> iommu_table's locks
> >>>> * VFIO external user API use rewritten
> >>>> * multiple small fixes
> >>>>
> >>>> 2013/06/27:
> >>>> * tce_list page is referenced now in order to protect it from accident
> >>>> invalidation during H_PUT_TCE_INDIRECT execution
> >>>> * added use of the external user VFIO API
> >>>>
> >>>> 2013/06/05:
> >>>> * changed capability number
> >>>> * changed ioctl number
> >>>> * update the doc article number
> >>>>
> >>>> 2013/05/20:
> >>>> * removed get_user() from real mode handlers
> >>>> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
> >>>> translated TCEs, tries realmode_get_page() on those and if it fails, it
> >>>> passes control over the virtual mode handler which tries to finish
> >>>> the request handling
> >>>> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
> >>>> on a page
> >>>> * The only reason to pass the request to user mode now is when the user mode
> >>>> did not register TCE table in the kernel, in all other cases the virtual mode
> >>>> handler is expected to do the job
> >>>> ---
> >>>> .../virtual/kvm/devices/spapr_tce_iommu.txt | 37 +++
> >>>> arch/powerpc/include/asm/kvm_host.h | 4 +
> >>>> arch/powerpc/kvm/book3s_64_vio.c | 310 ++++++++++++++++++++-
> >>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 122 ++++++++
> >>>> arch/powerpc/kvm/powerpc.c | 1 +
> >>>> include/linux/kvm_host.h | 1 +
> >>>> virt/kvm/kvm_main.c | 5 +
> >>>> 7 files changed, 477 insertions(+), 3 deletions(-)
> >>>> create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >>>>
> >>>> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >>>> new file mode 100644
> >>>> index 0000000..4bc8fc3
> >>>> --- /dev/null
> >>>> +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >>>> @@ -0,0 +1,37 @@
> >>>> +SPAPR TCE IOMMU device
> >>>> +
> >>>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> >>>> +Architectures: powerpc
> >>>> +
> >>>> +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
> >>>> +
> >>>> +Groups:
> >>>> + KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
> >>>> + Attributes: single attribute with pair { LIOBN, IOMMU fd}
> >>>> +
> >>>> +This is completely made up device which provides API to link
> >>>> +logical bus number (LIOBN) and IOMMU group. The user space has
> >>>> +to create a new SPAPR TCE IOMMU device per a logical bus.
> >>>> +
> >>> Why not have one device that can handle multimple links?
> >>
> >>
> >> I can do that. If I make it so, it won't even look as a device at all, just
> >> some weird interface to KVM but ok. What bothers me is it is just a
> > May be I do not understand usage pattern here. Why do you feel that device
> > that can handle multiple links is worse than device per link? How many logical
> > buses is there usually? How often they created/destroyed? I am not insisting
> > on the change, just trying to understand why you do not like it.
>
>
> Is it usually one PCI host bus adapter per IOMMU group which is usually
> one PCI card or 2-3 cards if it is a legacy PCI-X, and they are created
> when QEMU-KVM starts. Not many. And they live till KVM ends.
>
> My point is why would I want to put all links to one device? It all is just
> a matter of taste and nothing more. Or I am missing something but I do not
> see what. If it is all about making thing to be kosher/halal/orthodox, then
> I have more stuff to do, like reworking the emulated TCEs. But if is it for
> (I do not know, just guessing) performance or something like that - then
> I'll fix it, I just need to know what I am fixing.
>
Each device creates an fd, if you can have a lot of them eventually this
will be a bottleneck. You are saying this is not the case, so lets go
with proposed interface.

--
Gleb.

2013-09-05 23:38:31

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

On 09/06/2013 04:10 AM, Gleb Natapov wrote:
> On Wed, Sep 04, 2013 at 02:01:28AM +1000, Alexey Kardashevskiy wrote:
>> On 09/03/2013 08:53 PM, Gleb Natapov wrote:
>>> On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote:
>>>> On 09/01/2013 10:06 PM, Gleb Natapov wrote:
>>>>> On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote:
>>>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
>>>>>> them to user space which saves time on switching to user space and back.
>>>>>>
>>>>>> Both real and virtual modes are supported. The kernel tries to
>>>>>> handle a TCE request in the real mode, if fails it passes the request
>>>>>> to the virtual mode to complete the operation. If it a virtual mode
>>>>>> handler fails, the request is passed to user space.
>>>>>>
>>>>>> The first user of this is VFIO on POWER. Trampolines to the VFIO external
>>>>>> user API functions are required for this patch.
>>>>>>
>>>>>> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
>>>>>> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
>>>>>> of map/unmap requests. The device supports a single attribute which is
>>>>>> a struct with LIOBN and IOMMU fd. When the attribute is set, the device
>>>>>> establishes the connection between KVM and VFIO.
>>>>>>
>>>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>>>
>>>>>> Signed-off-by: Paul Mackerras <[email protected]>
>>>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Changes:
>>>>>> v9:
>>>>>> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
>>>>>> KVM device
>>>>>> * release_spapr_tce_table() is not shared between different TCE types
>>>>>> * reduced the patch size by moving VFIO external API
>>>>>> trampolines to separate patche
>>>>>> * moved documentation from Documentation/virtual/kvm/api.txt to
>>>>>> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>>>
>>>>>> v8:
>>>>>> * fixed warnings from check_patch.pl
>>>>>>
>>>>>> 2013/07/11:
>>>>>> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
>>>>>> for KVM_BOOK3S_64
>>>>>> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
>>>>>> for this here but the next patch for hugepages support will use it more.
>>>>>>
>>>>>> 2013/07/06:
>>>>>> * added realmode arch_spin_lock to protect TCE table from races
>>>>>> in real and virtual modes
>>>>>> * POWERPC IOMMU API is changed to support real mode
>>>>>> * iommu_take_ownership and iommu_release_ownership are protected by
>>>>>> iommu_table's locks
>>>>>> * VFIO external user API use rewritten
>>>>>> * multiple small fixes
>>>>>>
>>>>>> 2013/06/27:
>>>>>> * tce_list page is referenced now in order to protect it from accident
>>>>>> invalidation during H_PUT_TCE_INDIRECT execution
>>>>>> * added use of the external user VFIO API
>>>>>>
>>>>>> 2013/06/05:
>>>>>> * changed capability number
>>>>>> * changed ioctl number
>>>>>> * update the doc article number
>>>>>>
>>>>>> 2013/05/20:
>>>>>> * removed get_user() from real mode handlers
>>>>>> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
>>>>>> translated TCEs, tries realmode_get_page() on those and if it fails, it
>>>>>> passes control over the virtual mode handler which tries to finish
>>>>>> the request handling
>>>>>> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
>>>>>> on a page
>>>>>> * The only reason to pass the request to user mode now is when the user mode
>>>>>> did not register TCE table in the kernel, in all other cases the virtual mode
>>>>>> handler is expected to do the job
>>>>>> ---
>>>>>> .../virtual/kvm/devices/spapr_tce_iommu.txt | 37 +++
>>>>>> arch/powerpc/include/asm/kvm_host.h | 4 +
>>>>>> arch/powerpc/kvm/book3s_64_vio.c | 310 ++++++++++++++++++++-
>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 122 ++++++++
>>>>>> arch/powerpc/kvm/powerpc.c | 1 +
>>>>>> include/linux/kvm_host.h | 1 +
>>>>>> virt/kvm/kvm_main.c | 5 +
>>>>>> 7 files changed, 477 insertions(+), 3 deletions(-)
>>>>>> create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>>>
>>>>>> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>>> new file mode 100644
>>>>>> index 0000000..4bc8fc3
>>>>>> --- /dev/null
>>>>>> +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>>> @@ -0,0 +1,37 @@
>>>>>> +SPAPR TCE IOMMU device
>>>>>> +
>>>>>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>>>>>> +Architectures: powerpc
>>>>>> +
>>>>>> +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
>>>>>> +
>>>>>> +Groups:
>>>>>> + KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
>>>>>> + Attributes: single attribute with pair { LIOBN, IOMMU fd}
>>>>>> +
>>>>>> +This is completely made up device which provides API to link
>>>>>> +logical bus number (LIOBN) and IOMMU group. The user space has
>>>>>> +to create a new SPAPR TCE IOMMU device per a logical bus.
>>>>>> +
>>>>> Why not have one device that can handle multimple links?
>>>>
>>>>
>>>> I can do that. If I make it so, it won't even look as a device at all, just
>>>> some weird interface to KVM but ok. What bothers me is it is just a
>>> May be I do not understand usage pattern here. Why do you feel that device
>>> that can handle multiple links is worse than device per link? How many logical
>>> buses is there usually? How often they created/destroyed? I am not insisting
>>> on the change, just trying to understand why you do not like it.
>>
>>
>> Is it usually one PCI host bus adapter per IOMMU group which is usually
>> one PCI card or 2-3 cards if it is a legacy PCI-X, and they are created
>> when QEMU-KVM starts. Not many. And they live till KVM ends.
>>
>> My point is why would I want to put all links to one device? It all is just
>> a matter of taste and nothing more. Or I am missing something but I do not
>> see what. If it is all about making thing to be kosher/halal/orthodox, then
>> I have more stuff to do, like reworking the emulated TCEs. But if is it for
>> (I do not know, just guessing) performance or something like that - then
>> I'll fix it, I just need to know what I am fixing.
>>
> Each device creates an fd, if you can have a lot of them eventually this
> will be a bottleneck. You are saying this is not the case, so lets go
> with proposed interface.


Did you decide not to answer the email which Ben sent yesterday or you just
did not see it? Just checking :)



--
Alexey

2013-09-06 06:02:17

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

On Fri, Sep 06, 2013 at 09:38:21AM +1000, Alexey Kardashevskiy wrote:
> On 09/06/2013 04:10 AM, Gleb Natapov wrote:
> > On Wed, Sep 04, 2013 at 02:01:28AM +1000, Alexey Kardashevskiy wrote:
> >> On 09/03/2013 08:53 PM, Gleb Natapov wrote:
> >>> On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote:
> >>>> On 09/01/2013 10:06 PM, Gleb Natapov wrote:
> >>>>> On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote:
> >>>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> >>>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
> >>>>>> them to user space which saves time on switching to user space and back.
> >>>>>>
> >>>>>> Both real and virtual modes are supported. The kernel tries to
> >>>>>> handle a TCE request in the real mode, if fails it passes the request
> >>>>>> to the virtual mode to complete the operation. If it a virtual mode
> >>>>>> handler fails, the request is passed to user space.
> >>>>>>
> >>>>>> The first user of this is VFIO on POWER. Trampolines to the VFIO external
> >>>>>> user API functions are required for this patch.
> >>>>>>
> >>>>>> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
> >>>>>> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
> >>>>>> of map/unmap requests. The device supports a single attribute which is
> >>>>>> a struct with LIOBN and IOMMU fd. When the attribute is set, the device
> >>>>>> establishes the connection between KVM and VFIO.
> >>>>>>
> >>>>>> Tests show that this patch increases transmission speed from 220MB/s
> >>>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> >>>>>>
> >>>>>> Signed-off-by: Paul Mackerras <[email protected]>
> >>>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>>>>>
> >>>>>> ---
> >>>>>>
> >>>>>> Changes:
> >>>>>> v9:
> >>>>>> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
> >>>>>> KVM device
> >>>>>> * release_spapr_tce_table() is not shared between different TCE types
> >>>>>> * reduced the patch size by moving VFIO external API
> >>>>>> trampolines to separate patche
> >>>>>> * moved documentation from Documentation/virtual/kvm/api.txt to
> >>>>>> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >>>>>>
> >>>>>> v8:
> >>>>>> * fixed warnings from check_patch.pl
> >>>>>>
> >>>>>> 2013/07/11:
> >>>>>> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
> >>>>>> for KVM_BOOK3S_64
> >>>>>> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
> >>>>>> for this here but the next patch for hugepages support will use it more.
> >>>>>>
> >>>>>> 2013/07/06:
> >>>>>> * added realmode arch_spin_lock to protect TCE table from races
> >>>>>> in real and virtual modes
> >>>>>> * POWERPC IOMMU API is changed to support real mode
> >>>>>> * iommu_take_ownership and iommu_release_ownership are protected by
> >>>>>> iommu_table's locks
> >>>>>> * VFIO external user API use rewritten
> >>>>>> * multiple small fixes
> >>>>>>
> >>>>>> 2013/06/27:
> >>>>>> * tce_list page is referenced now in order to protect it from accident
> >>>>>> invalidation during H_PUT_TCE_INDIRECT execution
> >>>>>> * added use of the external user VFIO API
> >>>>>>
> >>>>>> 2013/06/05:
> >>>>>> * changed capability number
> >>>>>> * changed ioctl number
> >>>>>> * update the doc article number
> >>>>>>
> >>>>>> 2013/05/20:
> >>>>>> * removed get_user() from real mode handlers
> >>>>>> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
> >>>>>> translated TCEs, tries realmode_get_page() on those and if it fails, it
> >>>>>> passes control over the virtual mode handler which tries to finish
> >>>>>> the request handling
> >>>>>> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
> >>>>>> on a page
> >>>>>> * The only reason to pass the request to user mode now is when the user mode
> >>>>>> did not register TCE table in the kernel, in all other cases the virtual mode
> >>>>>> handler is expected to do the job
> >>>>>> ---
> >>>>>> .../virtual/kvm/devices/spapr_tce_iommu.txt | 37 +++
> >>>>>> arch/powerpc/include/asm/kvm_host.h | 4 +
> >>>>>> arch/powerpc/kvm/book3s_64_vio.c | 310 ++++++++++++++++++++-
> >>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 122 ++++++++
> >>>>>> arch/powerpc/kvm/powerpc.c | 1 +
> >>>>>> include/linux/kvm_host.h | 1 +
> >>>>>> virt/kvm/kvm_main.c | 5 +
> >>>>>> 7 files changed, 477 insertions(+), 3 deletions(-)
> >>>>>> create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >>>>>>
> >>>>>> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >>>>>> new file mode 100644
> >>>>>> index 0000000..4bc8fc3
> >>>>>> --- /dev/null
> >>>>>> +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> >>>>>> @@ -0,0 +1,37 @@
> >>>>>> +SPAPR TCE IOMMU device
> >>>>>> +
> >>>>>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> >>>>>> +Architectures: powerpc
> >>>>>> +
> >>>>>> +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
> >>>>>> +
> >>>>>> +Groups:
> >>>>>> + KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
> >>>>>> + Attributes: single attribute with pair { LIOBN, IOMMU fd}
> >>>>>> +
> >>>>>> +This is completely made up device which provides API to link
> >>>>>> +logical bus number (LIOBN) and IOMMU group. The user space has
> >>>>>> +to create a new SPAPR TCE IOMMU device per a logical bus.
> >>>>>> +
> >>>>> Why not have one device that can handle multimple links?
> >>>>
> >>>>
> >>>> I can do that. If I make it so, it won't even look as a device at all, just
> >>>> some weird interface to KVM but ok. What bothers me is it is just a
> >>> May be I do not understand usage pattern here. Why do you feel that device
> >>> that can handle multiple links is worse than device per link? How many logical
> >>> buses is there usually? How often they created/destroyed? I am not insisting
> >>> on the change, just trying to understand why you do not like it.
> >>
> >>
> >> Is it usually one PCI host bus adapter per IOMMU group which is usually
> >> one PCI card or 2-3 cards if it is a legacy PCI-X, and they are created
> >> when QEMU-KVM starts. Not many. And they live till KVM ends.
> >>
> >> My point is why would I want to put all links to one device? It all is just
> >> a matter of taste and nothing more. Or I am missing something but I do not
> >> see what. If it is all about making thing to be kosher/halal/orthodox, then
> >> I have more stuff to do, like reworking the emulated TCEs. But if is it for
> >> (I do not know, just guessing) performance or something like that - then
> >> I'll fix it, I just need to know what I am fixing.
> >>
> > Each device creates an fd, if you can have a lot of them eventually this
> > will be a bottleneck. You are saying this is not the case, so lets go
> > with proposed interface.
>
>
> Did you decide not to answer the email which Ben sent yesterday or you just
> did not see it? Just checking :)
>
Haven't seen it. Which one?

--
Gleb.

2013-09-06 06:06:53

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

On 09/06/2013 04:01 PM, Gleb Natapov wrote:
> On Fri, Sep 06, 2013 at 09:38:21AM +1000, Alexey Kardashevskiy wrote:
>> On 09/06/2013 04:10 AM, Gleb Natapov wrote:
>>> On Wed, Sep 04, 2013 at 02:01:28AM +1000, Alexey Kardashevskiy wrote:
>>>> On 09/03/2013 08:53 PM, Gleb Natapov wrote:
>>>>> On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote:
>>>>>> On 09/01/2013 10:06 PM, Gleb Natapov wrote:
>>>>>>> On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote:
>>>>>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>>>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
>>>>>>>> them to user space which saves time on switching to user space and back.
>>>>>>>>
>>>>>>>> Both real and virtual modes are supported. The kernel tries to
>>>>>>>> handle a TCE request in the real mode, if fails it passes the request
>>>>>>>> to the virtual mode to complete the operation. If it a virtual mode
>>>>>>>> handler fails, the request is passed to user space.
>>>>>>>>
>>>>>>>> The first user of this is VFIO on POWER. Trampolines to the VFIO external
>>>>>>>> user API functions are required for this patch.
>>>>>>>>
>>>>>>>> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
>>>>>>>> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
>>>>>>>> of map/unmap requests. The device supports a single attribute which is
>>>>>>>> a struct with LIOBN and IOMMU fd. When the attribute is set, the device
>>>>>>>> establishes the connection between KVM and VFIO.
>>>>>>>>
>>>>>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>>>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>>>>>
>>>>>>>> Signed-off-by: Paul Mackerras <[email protected]>
>>>>>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> Changes:
>>>>>>>> v9:
>>>>>>>> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
>>>>>>>> KVM device
>>>>>>>> * release_spapr_tce_table() is not shared between different TCE types
>>>>>>>> * reduced the patch size by moving VFIO external API
>>>>>>>> trampolines to separate patche
>>>>>>>> * moved documentation from Documentation/virtual/kvm/api.txt to
>>>>>>>> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>>>>>
>>>>>>>> v8:
>>>>>>>> * fixed warnings from check_patch.pl
>>>>>>>>
>>>>>>>> 2013/07/11:
>>>>>>>> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
>>>>>>>> for KVM_BOOK3S_64
>>>>>>>> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
>>>>>>>> for this here but the next patch for hugepages support will use it more.
>>>>>>>>
>>>>>>>> 2013/07/06:
>>>>>>>> * added realmode arch_spin_lock to protect TCE table from races
>>>>>>>> in real and virtual modes
>>>>>>>> * POWERPC IOMMU API is changed to support real mode
>>>>>>>> * iommu_take_ownership and iommu_release_ownership are protected by
>>>>>>>> iommu_table's locks
>>>>>>>> * VFIO external user API use rewritten
>>>>>>>> * multiple small fixes
>>>>>>>>
>>>>>>>> 2013/06/27:
>>>>>>>> * tce_list page is referenced now in order to protect it from accident
>>>>>>>> invalidation during H_PUT_TCE_INDIRECT execution
>>>>>>>> * added use of the external user VFIO API
>>>>>>>>
>>>>>>>> 2013/06/05:
>>>>>>>> * changed capability number
>>>>>>>> * changed ioctl number
>>>>>>>> * update the doc article number
>>>>>>>>
>>>>>>>> 2013/05/20:
>>>>>>>> * removed get_user() from real mode handlers
>>>>>>>> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
>>>>>>>> translated TCEs, tries realmode_get_page() on those and if it fails, it
>>>>>>>> passes control over the virtual mode handler which tries to finish
>>>>>>>> the request handling
>>>>>>>> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
>>>>>>>> on a page
>>>>>>>> * The only reason to pass the request to user mode now is when the user mode
>>>>>>>> did not register TCE table in the kernel, in all other cases the virtual mode
>>>>>>>> handler is expected to do the job
>>>>>>>> ---
>>>>>>>> .../virtual/kvm/devices/spapr_tce_iommu.txt | 37 +++
>>>>>>>> arch/powerpc/include/asm/kvm_host.h | 4 +
>>>>>>>> arch/powerpc/kvm/book3s_64_vio.c | 310 ++++++++++++++++++++-
>>>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 122 ++++++++
>>>>>>>> arch/powerpc/kvm/powerpc.c | 1 +
>>>>>>>> include/linux/kvm_host.h | 1 +
>>>>>>>> virt/kvm/kvm_main.c | 5 +
>>>>>>>> 7 files changed, 477 insertions(+), 3 deletions(-)
>>>>>>>> create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>>>>>
>>>>>>>> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>>>>> new file mode 100644
>>>>>>>> index 0000000..4bc8fc3
>>>>>>>> --- /dev/null
>>>>>>>> +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>>>>> @@ -0,0 +1,37 @@
>>>>>>>> +SPAPR TCE IOMMU device
>>>>>>>> +
>>>>>>>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>>>>>>>> +Architectures: powerpc
>>>>>>>> +
>>>>>>>> +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
>>>>>>>> +
>>>>>>>> +Groups:
>>>>>>>> + KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
>>>>>>>> + Attributes: single attribute with pair { LIOBN, IOMMU fd}
>>>>>>>> +
>>>>>>>> +This is completely made up device which provides API to link
>>>>>>>> +logical bus number (LIOBN) and IOMMU group. The user space has
>>>>>>>> +to create a new SPAPR TCE IOMMU device per a logical bus.
>>>>>>>> +
>>>>>>> Why not have one device that can handle multimple links?
>>>>>>
>>>>>>
>>>>>> I can do that. If I make it so, it won't even look as a device at all, just
>>>>>> some weird interface to KVM but ok. What bothers me is it is just a
>>>>> May be I do not understand usage pattern here. Why do you feel that device
>>>>> that can handle multiple links is worse than device per link? How many logical
>>>>> buses is there usually? How often they created/destroyed? I am not insisting
>>>>> on the change, just trying to understand why you do not like it.
>>>>
>>>>
>>>> Is it usually one PCI host bus adapter per IOMMU group which is usually
>>>> one PCI card or 2-3 cards if it is a legacy PCI-X, and they are created
>>>> when QEMU-KVM starts. Not many. And they live till KVM ends.
>>>>
>>>> My point is why would I want to put all links to one device? It all is just
>>>> a matter of taste and nothing more. Or I am missing something but I do not
>>>> see what. If it is all about making thing to be kosher/halal/orthodox, then
>>>> I have more stuff to do, like reworking the emulated TCEs. But if is it for
>>>> (I do not know, just guessing) performance or something like that - then
>>>> I'll fix it, I just need to know what I am fixing.
>>>>
>>> Each device creates an fd, if you can have a lot of them eventually this
>>> will be a bottleneck. You are saying this is not the case, so lets go
>>> with proposed interface.
>>
>>
>> Did you decide not to answer the email which Ben sent yesterday or you just
>> did not see it? Just checking :)
>>
> Haven't seen it. Which one?


Subject: Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel
handling
Date: Thu, 05 Sep 2013 14:05:09 +1000
From: Benjamin Herrenschmidt <[email protected]>
To: Gleb Natapov <[email protected]>
CC: Alexey Kardashevskiy <[email protected]>, [email protected],
David Gibson <[email protected]>, Paul Mackerras
<[email protected]>, Paolo Bonzini <[email protected]>, Alexander
Graf <[email protected]>, [email protected],
[email protected], [email protected],
[email protected]


--
Alexey

2013-09-06 06:57:32

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

On Thu, Sep 05, 2013 at 02:05:09PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2013-09-03 at 13:53 +0300, Gleb Natapov wrote:
> > > Or supporting all IOMMU links (and leaving emulated stuff as is) in on
> > > "device" is the last thing I have to do and then you'll ack the patch?
> > >
> > I am concerned more about API here. Internal implementation details I
> > leave to powerpc experts :)
>
> So Gleb, I want to step in for a bit here.
>
> While I understand that the new KVM device API is all nice and shiny and that this
> whole thing should probably have been KVM devices in the first place (had they
> existed or had we been told back then), the point is, the API for handling
> HW IOMMUs that Alexey is trying to add is an extension of an existing mechanism
> used for emulated IOMMUs.
>
> The internal data structure is shared, and fundamentally, by forcing him to
> use that new KVM device for the "new stuff", we create a oddball API with
> an ioctl for one type of iommu and a KVM device for the other, which makes
> the implementation a complete mess in the kernel (and you should care :-)
>
Is it unfixable mess? Even if Alexey will do what you suggested earlier?

- Convert *both* existing TCE objects to the new
KVM_CREATE_DEVICE, and have some backward compat code for the old one.

The point is implementation usually can be changed, but for API it is
much harder to do so.

> So for something completely new, I would tend to agree with you. However, I
> still think that for this specific case, we should just plonk-in the original
> ioctl proposed by Alexey and be done with it.
>
Do you think this is the last extension to IOMMU code, or we will see
more and will use same justification to continue adding ioctls?

--
Gleb.

2013-09-06 07:05:10

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

On 09/06/2013 04:57 PM, Gleb Natapov wrote:
> On Thu, Sep 05, 2013 at 02:05:09PM +1000, Benjamin Herrenschmidt wrote:
>> On Tue, 2013-09-03 at 13:53 +0300, Gleb Natapov wrote:
>>>> Or supporting all IOMMU links (and leaving emulated stuff as is) in on
>>>> "device" is the last thing I have to do and then you'll ack the patch?
>>>>
>>> I am concerned more about API here. Internal implementation details I
>>> leave to powerpc experts :)
>>
>> So Gleb, I want to step in for a bit here.
>>
>> While I understand that the new KVM device API is all nice and shiny and that this
>> whole thing should probably have been KVM devices in the first place (had they
>> existed or had we been told back then), the point is, the API for handling
>> HW IOMMUs that Alexey is trying to add is an extension of an existing mechanism
>> used for emulated IOMMUs.
>>
>> The internal data structure is shared, and fundamentally, by forcing him to
>> use that new KVM device for the "new stuff", we create a oddball API with
>> an ioctl for one type of iommu and a KVM device for the other, which makes
>> the implementation a complete mess in the kernel (and you should care :-)
>>
> Is it unfixable mess? Even if Alexey will do what you suggested earlier?
>
> - Convert *both* existing TCE objects to the new
> KVM_CREATE_DEVICE, and have some backward compat code for the old one.



If we convert *old*, then for compatibility we will need one KVM device
(more precisely, one fd) per an TCE object (not one for all TCE objects) as
the guest (upstream QEMU) will mmap the table via mmap() and it won't use
any offset when mapping this fd.

The current KVM device implementation does not even support mmap().

So I would go with the KVM device patch I posted and really want to avoid
putting all TCEs into one device.



> The point is implementation usually can be changed, but for API it is
> much harder to do so.
>
>> So for something completely new, I would tend to agree with you. However, I
>> still think that for this specific case, we should just plonk-in the original
>> ioctl proposed by Alexey and be done with it.
>>
> Do you think this is the last extension to IOMMU code, or we will see
> more and will use same justification to continue adding ioctls?




--
Alexey

2013-09-06 10:40:49

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v10 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
them to user space which saves time on switching to user space and back.

Both real and virtual modes are supported. The kernel tries to
handle a TCE request in the real mode, if fails it passes the request
to the virtual mode to complete the operation. If it a virtual mode
handler fails, the request is passed to user space.

The first user of this is VFIO on POWER. Trampolines to the VFIO external
user API functions are required for this patch.

This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
of map/unmap requests. The device supports a single attribute which is
a struct with LIOBN and IOMMU fd. When the attribute is set, the device
establishes the connection between KVM and VFIO.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>

---

Changes:
v10:
* all IOMMU TCE links are handled by one KVM device now
* KVM device has its own list of TCE descriptors
* the search-by-liobn function was extended to search through
emulated and IOMMU lists

v9:
* KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
KVM device
* release_spapr_tce_table() is not shared between different TCE types
* reduced the patch size by moving KVM device bits and VFIO external API
trampolines to separate patches
* moved documentation from Documentation/virtual/kvm/api.txt to
Documentation/virtual/kvm/devices/spapr_tce_iommu.txt

v8:
* fixed warnings from check_patch.pl

2013/07/11:
* removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
for KVM_BOOK3S_64
* kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
for this here but the next patch for hugepages support will use it more.

2013/07/06:
* added realmode arch_spin_lock to protect TCE table from races
in real and virtual modes
* POWERPC IOMMU API is changed to support real mode
* iommu_take_ownership and iommu_release_ownership are protected by
iommu_table's locks
* VFIO external user API use rewritten
* multiple small fixes

2013/06/27:
* tce_list page is referenced now in order to protect it from accident
invalidation during H_PUT_TCE_INDIRECT execution
* added use of the external user VFIO API

2013/06/05:
* changed capability number
* changed ioctl number
* update the doc article number

2013/05/20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
---
.../virtual/kvm/devices/spapr_tce_iommu.txt | 40 +++
arch/powerpc/include/asm/kvm_host.h | 8 +
arch/powerpc/include/uapi/asm/kvm.h | 5 -
arch/powerpc/kvm/book3s_64_vio.c | 327 ++++++++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 142 +++++++++
arch/powerpc/kvm/powerpc.c | 1 +
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 5 +
8 files changed, 517 insertions(+), 12 deletions(-)
create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt

diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
new file mode 100644
index 0000000..b911945
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
@@ -0,0 +1,40 @@
+SPAPR TCE IOMMU device
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+
+Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
+
+Groups:
+ KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
+ Attributes: one VFIO IOMMU fd per LIOBN, indexed by LIOBN
+
+This is completely made up device which provides API to link
+logical bus number (LIOBN) and IOMMU group. The user space has
+to create a new SPAPR TCE IOMMU device once per KVM session
+and use "set_attr" to add or remove a logical bus.
+
+LIOBN is a PCI bus identifier from PPC64-server (sPAPR) DMA hypercalls
+(H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE).
+IOMMU group is a minimal isolated device set which can be passed to
+the user space via VFIO.
+
+The userspace adds the new LIOBN-IOMMU link by calling KVM_SET_DEVICE_ATTR
+with the attribute initialized as shown below:
+struct kvm_device_attr attr = {
+ .flags = 0,
+ .group = KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE,
+ .attr = liobn,
+ .addr = (uint64_t)(uintptr_t)&group_fd,
+};
+
+To remove the link, the userspace calls KVM_SET_DEVICE_ATTR with
+the group_fd equal to zero.
+
+As the device opens VFIO group fds and holds the file pointer,
+it does not need to keep an fd internally and therefore KVM_GET_DEVICE_ATTR
+is not supported.
+
+When KVM exits, all links are destroyed automatically.
+
+The kernel advertises this feature via KVM_CAP_SPAPR_TCE_IOMMU capability.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index a23f132..a2a481e 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -181,9 +181,15 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+ struct iommu_group *grp; /* used for IOMMU groups */
+ struct vfio_group *vfio_grp; /* used for IOMMU groups */
struct page *pages[0];
};

+struct kvmppc_spapr_tce_iommu_device {
+ struct list_head tables;
+};
+
struct kvmppc_linear_info {
void *base_virt;
unsigned long base_pfn;
@@ -264,6 +270,7 @@ struct kvm_arch {
#endif /* CONFIG_KVM_BOOK3S_64_HV */
#ifdef CONFIG_PPC_BOOK3S_64
struct list_head spapr_tce_tables;
+ struct kvmppc_spapr_tce_iommu_device *tcedev;
struct list_head rtas_tokens;
#endif
#ifdef CONFIG_KVM_MPIC
@@ -612,6 +619,7 @@ struct kvm_vcpu_arch {
u64 busy_preempt;

unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT */
+ unsigned long tce_tmp_num; /* Number of handled TCEs in cache */
enum {
TCERM_NONE,
TCERM_GETPAGE,
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index c1ae1e5..a9d3d73 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -512,11 +512,6 @@ struct kvm_get_htab_header {
#define KVM_XICS_PENDING (1ULL << 42)

/* SPAPR TCE IOMMU device specification */
-struct kvm_create_spapr_tce_iommu_linkage {
- __u64 liobn;
- __u32 fd;
- __u32 flags;
-};
#define KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE 0

#endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 2880d2b..0978794 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -29,6 +29,8 @@
#include <linux/anon_inodes.h>
#include <linux/module.h>
#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/file.h>

#include <asm/tlbflush.h>
#include <asm/kvm_ppc.h>
@@ -158,10 +160,8 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
int i;

/* Check this LIOBN hasn't been previously allocated */
- list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
- if (stt->liobn == args->liobn)
- return -EBUSY;
- }
+ if (kvmppc_find_tce_table(kvm, args->liobn))
+ return -EBUSY;

npages = kvmppc_stt_npages(args->window_size);

@@ -201,9 +201,175 @@ fail:
return ret;
}

-/* Converts guest physical address to host virtual address */
+static void kvmppc_spapr_tce_iommu_table_destroy(
+ struct kvm_device *dev,
+ struct kvmppc_spapr_tce_table *tt)
+{
+ struct kvm *kvm = dev->kvm;
+
+ mutex_lock(&kvm->lock);
+ list_del(&tt->list);
+
+ if (tt->vfio_grp)
+ kvmppc_vfio_group_put_external_user(tt->vfio_grp);
+ iommu_group_put(tt->grp);
+
+ kfree(tt);
+ mutex_unlock(&kvm->lock);
+}
+
+static int kvmppc_spapr_tce_iommu_create(struct kvm_device *dev, u32 type)
+{
+ struct kvmppc_spapr_tce_iommu_device *tcedev;
+ int ret = 0;
+
+ tcedev = kzalloc(sizeof(*tcedev), GFP_KERNEL);
+ if (!tcedev)
+ return -ENOMEM;
+ dev->private = tcedev;
+
+ INIT_LIST_HEAD(&tcedev->tables);
+
+ /* Already there ? */
+ mutex_lock(&dev->kvm->lock);
+ if (dev->kvm->arch.tcedev)
+ ret = -EEXIST;
+ else
+ dev->kvm->arch.tcedev = tcedev;
+ mutex_unlock(&dev->kvm->lock);
+
+ if (ret)
+ return ret;
+
+ return 0;
+}
+
+static long kvmppc_spapr_tce_iommu_link(struct kvm_device *dev,
+ u64 liobn, u32 group_fd)
+{
+ struct kvmppc_spapr_tce_iommu_device *tcedev = dev->private;
+ struct kvmppc_spapr_tce_table *tt;
+ struct iommu_group *grp;
+ struct iommu_table *tbl;
+ struct file *vfio_filp;
+ struct vfio_group *vfio_grp;
+ int ret = -ENXIO, iommu_id;
+
+ /* Check this LIOBN hasn't been previously registered */
+ tt = kvmppc_find_tce_table(dev->kvm, liobn);
+ if (tt) {
+ if (group_fd)
+ return -EBUSY;
+
+ /* Release and exit */
+ kvmppc_spapr_tce_iommu_table_destroy(dev, tt);
+ return 0;
+ }
+
+ vfio_filp = fget(group_fd);
+ if (!vfio_filp)
+ return -ENXIO;
+
+ /*
+ * Lock the group. Fails if group is not viable or
+ * does not have IOMMU set
+ */
+ vfio_grp = kvmppc_vfio_group_get_external_user(vfio_filp);
+ if (IS_ERR_VALUE((unsigned long)vfio_grp))
+ goto fput_exit;
+
+ /* Get IOMMU ID, find iommu_group and iommu_table*/
+ iommu_id = kvmppc_vfio_external_user_iommu_id(vfio_grp);
+ if (iommu_id < 0)
+ goto grpput_fput_exit;
+
+ grp = iommu_group_get_by_id(iommu_id);
+ if (!grp)
+ goto grpput_fput_exit;
+
+ tbl = iommu_group_get_iommudata(grp);
+ if (!tbl)
+ goto grpput_fput_exit;
+
+ /* Create a TCE table descriptor and add into the descriptor list */
+ tt = kzalloc(sizeof(*tt), GFP_KERNEL);
+ if (!tt)
+ goto grpput_fput_exit;
+
+ tt->liobn = liobn;
+ tt->grp = grp;
+ tt->window_size = tbl->it_size << IOMMU_PAGE_SHIFT;
+ tt->vfio_grp = vfio_grp;
+
+ /* Add the TCE table descriptor to the descriptor list */
+ mutex_lock(&dev->kvm->lock);
+ list_add(&tt->list, &tcedev->tables);
+ mutex_unlock(&dev->kvm->lock);
+
+ ret = 0;
+
+ goto fput_exit;
+
+grpput_fput_exit:
+ kvmppc_vfio_group_put_external_user(vfio_grp);
+fput_exit:
+ fput(vfio_filp);
+
+ return ret;
+}
+
+static int kvmppc_spapr_tce_iommu_set_attr(struct kvm_device *dev,
+ struct kvm_device_attr *attr)
+{
+ u32 group_fd;
+ u32 __user *argp = (u32 __user *) attr->addr;
+
+ switch (attr->group) {
+ case KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE:
+ if (get_user(group_fd, argp))
+ return -EFAULT;
+
+ return kvmppc_spapr_tce_iommu_link(dev, attr->attr, group_fd);
+ }
+ return -ENXIO;
+}
+
+static int kvmppc_spapr_tce_iommu_has_attr(struct kvm_device *dev,
+ struct kvm_device_attr *attr)
+{
+ switch (attr->group) {
+ case KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE:
+ return 0;
+ }
+ return -ENXIO;
+}
+
+static void kvmppc_spapr_tce_iommu_destroy(struct kvm_device *dev)
+{
+ struct kvmppc_spapr_tce_iommu_device *tcedev = dev->private;
+ struct kvmppc_spapr_tce_table *tt, *tmp;
+
+ list_for_each_entry_safe(tt, tmp, &tcedev->tables, list) {
+ kvmppc_spapr_tce_iommu_table_destroy(dev, tt);
+ }
+ kfree(tcedev);
+ kfree(dev);
+}
+
+struct kvm_device_ops kvmppc_spapr_tce_iommu_ops = {
+ .name = "kvm-spapr-tce-iommu",
+ .create = kvmppc_spapr_tce_iommu_create,
+ .set_attr = kvmppc_spapr_tce_iommu_set_attr,
+ .has_attr = kvmppc_spapr_tce_iommu_has_attr,
+ .destroy = kvmppc_spapr_tce_iommu_destroy,
+};
+
+/*
+ * Converts guest physical address to host virtual address.
+ * Also returns host physical address which is to put to TCE table.
+ */
static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
- unsigned long gpa, struct page **pg)
+ unsigned long gpa, struct page **pg, unsigned long *phpa)
{
unsigned long hva, gfn = gpa >> PAGE_SHIFT;
struct kvm_memory_slot *memslot;
@@ -218,9 +384,140 @@ static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
if (get_user_pages_fast(hva & PAGE_MASK, 1, is_write, pg) != 1)
return ERROR_ADDR;

+ if (phpa)
+ *phpa = __pa((unsigned long) page_address(*pg)) |
+ (hva & ~PAGE_MASK);
+
return (void *) hva;
}

+long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce)
+{
+ struct page *pg = NULL;
+ unsigned long hpa;
+ void __user *hva;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Clear TCE */
+ if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+ if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, ioba >> IOMMU_PAGE_SHIFT,
+ 1, false))
+ return H_HARDWARE;
+
+ return H_SUCCESS;
+ }
+
+ /* Put TCE */
+ if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
+ /* Retry iommu_tce_build if it failed in real mode */
+ vcpu->arch.tce_rm_fail = TCERM_NONE;
+ hpa = vcpu->arch.tce_tmp_hpas[0];
+ } else {
+ if (iommu_tce_put_param_check(tbl, ioba, tce))
+ return H_PARAMETER;
+
+ hva = kvmppc_gpa_to_hva_and_get(vcpu, tce, &pg, &hpa);
+ if (hva == ERROR_ADDR)
+ return H_HARDWARE;
+ }
+
+ if (!iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT, &hpa, 1, false))
+ return H_SUCCESS;
+
+ pg = pfn_to_page(hpa >> PAGE_SHIFT);
+ if (pg)
+ put_page(pg);
+
+ return H_HARDWARE;
+}
+
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt, unsigned long ioba,
+ unsigned long __user *tces, unsigned long npages)
+{
+ long i = 0, start = 0;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ switch (vcpu->arch.tce_rm_fail) {
+ case TCERM_NONE:
+ break;
+ case TCERM_GETPAGE:
+ start = vcpu->arch.tce_tmp_num;
+ break;
+ case TCERM_PUTTCE:
+ goto put_tces;
+ case TCERM_PUTLIST:
+ default:
+ WARN_ON(1);
+ return H_HARDWARE;
+ }
+
+ for (i = start; i < npages; ++i) {
+ struct page *pg = NULL;
+ unsigned long gpa;
+ void __user *hva;
+
+ if (get_user(gpa, tces + i))
+ return H_HARDWARE;
+
+ if (iommu_tce_put_param_check(tbl, ioba +
+ (i << IOMMU_PAGE_SHIFT), gpa))
+ return H_PARAMETER;
+
+ hva = kvmppc_gpa_to_hva_and_get(vcpu, gpa, &pg,
+ &vcpu->arch.tce_tmp_hpas[i]);
+ if (hva == ERROR_ADDR)
+ goto putpages_flush_exit;
+ }
+
+put_tces:
+ if (!iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT,
+ vcpu->arch.tce_tmp_hpas, npages, false))
+ return H_SUCCESS;
+
+putpages_flush_exit:
+ for (--i; i >= 0; --i) {
+ struct page *pg;
+ pg = pfn_to_page(vcpu->arch.tce_tmp_hpas[i] >> PAGE_SHIFT);
+ if (pg)
+ put_page(pg);
+ }
+
+ return H_HARDWARE;
+}
+
+long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+ unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, entry, npages, false))
+ return H_HARDWARE;
+
+ return H_SUCCESS;
+}
+
long kvmppc_h_put_tce(struct kvm_vcpu *vcpu,
unsigned long liobn, unsigned long ioba,
unsigned long tce)
@@ -232,6 +529,10 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;

+ if (tt->grp)
+ return kvmppc_h_put_tce_iommu(vcpu, tt, liobn, ioba, tce);
+
+ /* Emulated IO */
if (ioba >= tt->window_size)
return H_PARAMETER;

@@ -270,13 +571,20 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;

- tces = kvmppc_gpa_to_hva_and_get(vcpu, tce_list, &pg);
+ tces = kvmppc_gpa_to_hva_and_get(vcpu, tce_list, &pg, NULL);
if (tces == ERROR_ADDR)
return H_TOO_HARD;

if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
goto put_list_page_exit;

+ if (tt->grp) {
+ ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+ tt, ioba, tces, npages);
+ goto put_list_page_exit;
+ }
+
+ /* Emulated IO */
for (i = 0; i < npages; ++i) {
if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
ret = H_PARAMETER;
@@ -315,6 +623,11 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;

+ if (tt->grp)
+ return kvmppc_h_stuff_tce_iommu(vcpu, tt, liobn, ioba,
+ tce_value, npages);
+
+ /* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index f00fcd2..fc872f2 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
#include <linux/slab.h>
#include <linux/hugetlb.h>
#include <linux/list.h>
+#include <linux/iommu.h>

#include <asm/tlbflush.h>
#include <asm/kvm_ppc.h>
@@ -47,11 +48,31 @@
* WARNING: This will be called in real or virtual mode on HV KVM and virtual
* mode on PR KVM
*/
+static struct kvmppc_spapr_tce_table *kvmppc_find_iommu_tce_table(
+ struct kvm *kvm, unsigned long liobn)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ struct kvmppc_spapr_tce_iommu_device *tcedev = kvm->arch.tcedev;
+
+ if (tcedev) {
+ list_for_each_entry(tt, &tcedev->tables, list) {
+ if (tt->liobn == liobn)
+ return tt;
+ }
+ }
+
+ return NULL;
+}
+
struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm *kvm,
unsigned long liobn)
{
struct kvmppc_spapr_tce_table *tt;

+ tt = kvmppc_find_iommu_tce_table(kvm, liobn);
+ if (tt)
+ return tt;
+
list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
if (tt->liobn == liobn)
return tt;
@@ -191,6 +212,111 @@ static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
return hpa;
}

+static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt, unsigned long liobn,
+ unsigned long ioba, unsigned long tce)
+{
+ int ret = 0;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+ unsigned long hpa;
+ struct page *pg = NULL;
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Clear TCE */
+ if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+ if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, ioba >> IOMMU_PAGE_SHIFT, 1, true))
+ return H_TOO_HARD;
+
+ return H_SUCCESS;
+ }
+
+ /* Put TCE */
+ if (iommu_tce_put_param_check(tbl, ioba, tce))
+ return H_PARAMETER;
+
+ hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce, &pg);
+ if (hpa != ERROR_ADDR) {
+ ret = iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT,
+ &hpa, 1, true);
+ }
+
+ if (((hpa == ERROR_ADDR) && pg) || ret) {
+ vcpu->arch.tce_tmp_hpas[0] = hpa;
+ vcpu->arch.tce_tmp_num = 0;
+ vcpu->arch.tce_rm_fail = TCERM_PUTTCE;
+ return H_TOO_HARD;
+ }
+
+ return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt, unsigned long ioba,
+ unsigned long *tces, unsigned long npages)
+{
+ int i, ret;
+ unsigned long hpa;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+ struct page *pg = NULL;
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Check all TCEs */
+ for (i = 0; i < npages; ++i) {
+ if (iommu_tce_put_param_check(tbl, ioba +
+ (i << IOMMU_PAGE_SHIFT), tces[i]))
+ return H_PARAMETER;
+ }
+
+ /* Translate TCEs and go get_page() */
+ for (i = 0; i < npages; ++i) {
+ hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tces[i], &pg);
+ if (hpa == ERROR_ADDR) {
+ vcpu->arch.tce_tmp_num = i;
+ vcpu->arch.tce_rm_fail = TCERM_GETPAGE;
+ return H_TOO_HARD;
+ }
+ vcpu->arch.tce_tmp_hpas[i] = hpa;
+ }
+
+ /* Put TCEs to the table */
+ ret = iommu_tce_build(tbl, (ioba >> IOMMU_PAGE_SHIFT),
+ vcpu->arch.tce_tmp_hpas, npages, true);
+ if (ret == -EAGAIN) {
+ vcpu->arch.tce_rm_fail = TCERM_PUTTCE;
+ return H_TOO_HARD;
+ } else if (ret) {
+ return H_HARDWARE;
+ }
+
+ return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, ioba >> IOMMU_PAGE_SHIFT, npages, true))
+ return H_TOO_HARD;
+
+ return H_SUCCESS;
+}
+
long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
{
@@ -201,6 +327,10 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
if (!tt)
return H_TOO_HARD;

+ if (tt->grp)
+ return kvmppc_rm_h_put_tce_iommu(vcpu, tt, liobn, ioba, tce);
+
+ /* Emulated IO */
if (ioba >= tt->window_size)
return H_PARAMETER;

@@ -243,6 +373,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
goto put_unlock_exit;
}

+ if (tt->grp) {
+ ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
+ tt, ioba, (unsigned long *)tces, npages);
+ goto put_unlock_exit;
+ }
+
+ /* Emulated IO */
for (i = 0; i < npages; ++i) {
ret = kvmppc_tce_validate(((unsigned long *)tces)[i]);
if (ret)
@@ -273,6 +410,11 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;

+ if (tt->grp)
+ return kvmppc_rm_h_stuff_tce_iommu(vcpu, tt, liobn, ioba,
+ tce_value, npages);
+
+ /* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index ccb578b..ea9af59 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -395,6 +395,7 @@ int kvm_dev_ioctl_check_extension(long ext)
r = 1;
break;
case KVM_CAP_SPAPR_MULTITCE:
+ case KVM_CAP_SPAPR_TCE_IOMMU:
r = 1;
break;
#endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7a8c1b3..016df11 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1053,6 +1053,7 @@ struct kvm_device *kvm_device_from_filp(struct file *filp);

extern struct kvm_device_ops kvm_mpic_ops;
extern struct kvm_device_ops kvm_xics_ops;
+extern struct kvm_device_ops kvmppc_spapr_tce_iommu_ops;

#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1580dd4..34c3c22 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2282,6 +2282,11 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
ops = &kvm_xics_ops;
break;
#endif
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+ case KVM_DEV_TYPE_SPAPR_TCE_IOMMU:
+ ops = &kvmppc_spapr_tce_iommu_ops;
+ break;
+#endif
default:
return -ENODEV;
}
--
1.8.4.rc4

2013-09-06 10:45:38

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

On 09/06/2013 04:57 PM, Gleb Natapov wrote:
> On Thu, Sep 05, 2013 at 02:05:09PM +1000, Benjamin Herrenschmidt wrote:
>> On Tue, 2013-09-03 at 13:53 +0300, Gleb Natapov wrote:
>>>> Or supporting all IOMMU links (and leaving emulated stuff as is) in on
>>>> "device" is the last thing I have to do and then you'll ack the patch?
>>>>
>>> I am concerned more about API here. Internal implementation details I
>>> leave to powerpc experts :)
>>
>> So Gleb, I want to step in for a bit here.
>>
>> While I understand that the new KVM device API is all nice and shiny and that this
>> whole thing should probably have been KVM devices in the first place (had they
>> existed or had we been told back then), the point is, the API for handling
>> HW IOMMUs that Alexey is trying to add is an extension of an existing mechanism
>> used for emulated IOMMUs.
>>
>> The internal data structure is shared, and fundamentally, by forcing him to
>> use that new KVM device for the "new stuff", we create a oddball API with
>> an ioctl for one type of iommu and a KVM device for the other, which makes
>> the implementation a complete mess in the kernel (and you should care :-)
>>
> Is it unfixable mess? Even if Alexey will do what you suggested earlier?
>
> - Convert *both* existing TCE objects to the new
> KVM_CREATE_DEVICE, and have some backward compat code for the old one.
>
> The point is implementation usually can be changed, but for API it is
> much harder to do so.
>
>> So for something completely new, I would tend to agree with you. However, I
>> still think that for this specific case, we should just plonk-in the original
>> ioctl proposed by Alexey and be done with it.
>>
> Do you think this is the last extension to IOMMU code, or we will see
> more and will use same justification to continue adding ioctls?


Ok. I give up :) I implemented KVM device the way you suggested. Could you
please have a look? It is "[PATCH v10 12/13] KVM: PPC: Add support for
IOMMU in-kernel handling", attached to this thread. Thanks!



--
Alexey

2013-09-08 14:12:44

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH v10 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

On Fri, Sep 06, 2013 at 08:40:26PM +1000, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
> them to user space which saves time on switching to user space and back.
>
> Both real and virtual modes are supported. The kernel tries to
> handle a TCE request in the real mode, if fails it passes the request
> to the virtual mode to complete the operation. If it a virtual mode
> handler fails, the request is passed to user space.
>
> The first user of this is VFIO on POWER. Trampolines to the VFIO external
> user API functions are required for this patch.
>
> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
> of map/unmap requests. The device supports a single attribute which is
> a struct with LIOBN and IOMMU fd. When the attribute is set, the device
> establishes the connection between KVM and VFIO.
>
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>
> Signed-off-by: Paul Mackerras <[email protected]>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>
> ---
>
> Changes:
> v10:
> * all IOMMU TCE links are handled by one KVM device now
> * KVM device has its own list of TCE descriptors
> * the search-by-liobn function was extended to search through
> emulated and IOMMU lists
>
> v9:
> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
> KVM device
> * release_spapr_tce_table() is not shared between different TCE types
> * reduced the patch size by moving KVM device bits and VFIO external API
> trampolines to separate patches
> * moved documentation from Documentation/virtual/kvm/api.txt to
> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>
> v8:
> * fixed warnings from check_patch.pl
>
> 2013/07/11:
> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
> for KVM_BOOK3S_64
> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
> for this here but the next patch for hugepages support will use it more.
>
> 2013/07/06:
> * added realmode arch_spin_lock to protect TCE table from races
> in real and virtual modes
> * POWERPC IOMMU API is changed to support real mode
> * iommu_take_ownership and iommu_release_ownership are protected by
> iommu_table's locks
> * VFIO external user API use rewritten
> * multiple small fixes
>
> 2013/06/27:
> * tce_list page is referenced now in order to protect it from accident
> invalidation during H_PUT_TCE_INDIRECT execution
> * added use of the external user VFIO API
>
> 2013/06/05:
> * changed capability number
> * changed ioctl number
> * update the doc article number
>
> 2013/05/20:
> * removed get_user() from real mode handlers
> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
> translated TCEs, tries realmode_get_page() on those and if it fails, it
> passes control over the virtual mode handler which tries to finish
> the request handling
> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
> on a page
> * The only reason to pass the request to user mode now is when the user mode
> did not register TCE table in the kernel, in all other cases the virtual mode
> handler is expected to do the job
> ---
> .../virtual/kvm/devices/spapr_tce_iommu.txt | 40 +++
> arch/powerpc/include/asm/kvm_host.h | 8 +
> arch/powerpc/include/uapi/asm/kvm.h | 5 -
> arch/powerpc/kvm/book3s_64_vio.c | 327 ++++++++++++++++++++-
> arch/powerpc/kvm/book3s_64_vio_hv.c | 142 +++++++++
> arch/powerpc/kvm/powerpc.c | 1 +
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 5 +
> 8 files changed, 517 insertions(+), 12 deletions(-)
> create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>
> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> new file mode 100644
> index 0000000..b911945
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> @@ -0,0 +1,40 @@
> +SPAPR TCE IOMMU device
> +
> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> +Architectures: powerpc
> +
> +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
> +
> +Groups:
> + KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
> + Attributes: one VFIO IOMMU fd per LIOBN, indexed by LIOBN
> +
> +This is completely made up device which provides API to link
> +logical bus number (LIOBN) and IOMMU group. The user space has
> +to create a new SPAPR TCE IOMMU device once per KVM session
> +and use "set_attr" to add or remove a logical bus.
> +
> +LIOBN is a PCI bus identifier from PPC64-server (sPAPR) DMA hypercalls
> +(H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE).
> +IOMMU group is a minimal isolated device set which can be passed to
> +the user space via VFIO.
> +
> +The userspace adds the new LIOBN-IOMMU link by calling KVM_SET_DEVICE_ATTR
> +with the attribute initialized as shown below:
> +struct kvm_device_attr attr = {
> + .flags = 0,
> + .group = KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE,
> + .attr = liobn,
> + .addr = (uint64_t)(uintptr_t)&group_fd,
> +};
> +
> +To remove the link, the userspace calls KVM_SET_DEVICE_ATTR with
> +the group_fd equal to zero.
> +
Zero is a valid fd descriptor. Lest make it -1.

> +As the device opens VFIO group fds and holds the file pointer,
> +it does not need to keep an fd internally and therefore KVM_GET_DEVICE_ATTR
> +is not supported.
> +
> +When KVM exits, all links are destroyed automatically.
> +
> +The kernel advertises this feature via KVM_CAP_SPAPR_TCE_IOMMU capability.
Why KVM_CAP_SPAPR_TCE_IOMMU is needed? Supported devices are
discoverable without capabilities.

> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index a23f132..a2a481e 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -181,9 +181,15 @@ struct kvmppc_spapr_tce_table {
> struct kvm *kvm;
> u64 liobn;
> u32 window_size;
> + struct iommu_group *grp; /* used for IOMMU groups */
> + struct vfio_group *vfio_grp; /* used for IOMMU groups */
> struct page *pages[0];
> };
>
> +struct kvmppc_spapr_tce_iommu_device {
> + struct list_head tables;
> +};
> +
> struct kvmppc_linear_info {
> void *base_virt;
> unsigned long base_pfn;
> @@ -264,6 +270,7 @@ struct kvm_arch {
> #endif /* CONFIG_KVM_BOOK3S_64_HV */
> #ifdef CONFIG_PPC_BOOK3S_64
> struct list_head spapr_tce_tables;
> + struct kvmppc_spapr_tce_iommu_device *tcedev;
> struct list_head rtas_tokens;
> #endif
> #ifdef CONFIG_KVM_MPIC
> @@ -612,6 +619,7 @@ struct kvm_vcpu_arch {
> u64 busy_preempt;
>
> unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT */
> + unsigned long tce_tmp_num; /* Number of handled TCEs in cache */
> enum {
> TCERM_NONE,
> TCERM_GETPAGE,
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index c1ae1e5..a9d3d73 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -512,11 +512,6 @@ struct kvm_get_htab_header {
> #define KVM_XICS_PENDING (1ULL << 42)
>
> /* SPAPR TCE IOMMU device specification */
> -struct kvm_create_spapr_tce_iommu_linkage {
> - __u64 liobn;
> - __u32 fd;
> - __u32 flags;
> -};
> #define KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE 0
>
> #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 2880d2b..0978794 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -29,6 +29,8 @@
> #include <linux/anon_inodes.h>
> #include <linux/module.h>
> #include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/file.h>
>
> #include <asm/tlbflush.h>
> #include <asm/kvm_ppc.h>
> @@ -158,10 +160,8 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> int i;
>
> /* Check this LIOBN hasn't been previously allocated */
> - list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> - if (stt->liobn == args->liobn)
> - return -EBUSY;
> - }
> + if (kvmppc_find_tce_table(kvm, args->liobn))
> + return -EBUSY;
>
> npages = kvmppc_stt_npages(args->window_size);
>
> @@ -201,9 +201,175 @@ fail:
> return ret;
> }
>
> -/* Converts guest physical address to host virtual address */
> +static void kvmppc_spapr_tce_iommu_table_destroy(
> + struct kvm_device *dev,
> + struct kvmppc_spapr_tce_table *tt)
> +{
> + struct kvm *kvm = dev->kvm;
> +
> + mutex_lock(&kvm->lock);
> + list_del(&tt->list);
> +
> + if (tt->vfio_grp)
> + kvmppc_vfio_group_put_external_user(tt->vfio_grp);
> + iommu_group_put(tt->grp);
> +
> + kfree(tt);
> + mutex_unlock(&kvm->lock);
> +}
> +
> +static int kvmppc_spapr_tce_iommu_create(struct kvm_device *dev, u32 type)
> +{
> + struct kvmppc_spapr_tce_iommu_device *tcedev;
> + int ret = 0;
> +
> + tcedev = kzalloc(sizeof(*tcedev), GFP_KERNEL);
> + if (!tcedev)
> + return -ENOMEM;
> + dev->private = tcedev;
> +
> + INIT_LIST_HEAD(&tcedev->tables);
> +
> + /* Already there ? */
> + mutex_lock(&dev->kvm->lock);
> + if (dev->kvm->arch.tcedev)
> + ret = -EEXIST;
> + else
> + dev->kvm->arch.tcedev = tcedev;
> + mutex_unlock(&dev->kvm->lock);
> +
> + if (ret)
Need to free tcedev here.

> + return ret;
> +
> + return 0;
> +}
> +

--
Gleb.

2013-10-30 05:33:54

by Bharat Bhushan

[permalink] [raw]
Subject: RE: [PATCH v9 01/13] KVM: PPC: POWERNV: move iommu_add_device earlier

Hi Alex,

Looks like this patch is not picked by anyone, Are you going to pick this patch?
My vfio/iommu patches have dependency on this patch (this is already tested by me).

Thanks
-Bharat

> -----Original Message-----
> From: Linuxppc-dev [mailto:linuxppc-dev-
> [email protected]] On Behalf Of Alexey
> Kardashevskiy
> Sent: Wednesday, August 28, 2013 2:08 PM
> To: [email protected]
> Cc: [email protected]; Gleb Natapov; Alexey Kardashevskiy; Alexander Graf;
> [email protected]; [email protected]; [email protected]; Paul
> Mackerras; Paolo Bonzini; David Gibson
> Subject: [PATCH v9 01/13] KVM: PPC: POWERNV: move iommu_add_device earlier
>
> The current implementation of IOMMU on sPAPR does not use iommu_ops and
> therefore does not call IOMMU API's bus_set_iommu() which
> 1) sets iommu_ops for a bus
> 2) registers a bus notifier
> Instead, PCI devices are added to IOMMU groups from
> subsys_initcall_sync(tce_iommu_init) which does basically the same thing without
> using iommu_ops callbacks.
>
> However Freescale PAMU driver (https://lkml.org/lkml/2013/7/1/158)
> implements iommu_ops and when tce_iommu_init is called, every PCI device is
> already added to some group so there is a conflict.
>
> This patch does 2 things:
> 1. removes the loop in which PCI devices were added to groups and adds explicit
> iommu_add_device() calls to add devices as soon as they get the iommu_table
> pointer assigned to them.
> 2. moves a bus notifier to powernv code in order to avoid conflict with the
> notifier from Freescale driver.
>
> iommu_add_device() and iommu_del_device() are public now.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> Changes:
> v8:
> * added the check for iommu_group!=NULL before removing device from a group as
> suggested by Wei Yang <[email protected]>
>
> v2:
> * added a helper - set_iommu_table_base_and_group - which does
> set_iommu_table_base() and iommu_add_device()
> ---
> arch/powerpc/include/asm/iommu.h | 9 +++++++
> arch/powerpc/kernel/iommu.c | 41 +++--------------------------
> arch/powerpc/platforms/powernv/pci-ioda.c | 8 +++---
> arch/powerpc/platforms/powernv/pci-p5ioc2.c | 2 +-
> arch/powerpc/platforms/powernv/pci.c | 33 ++++++++++++++++++++++-
> arch/powerpc/platforms/pseries/iommu.c | 8 +++---
> 6 files changed, 55 insertions(+), 46 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index c34656a..19ad77f 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -103,6 +103,15 @@ extern struct iommu_table *iommu_init_table(struct
> iommu_table * tbl,
> int nid);
> extern void iommu_register_group(struct iommu_table *tbl,
> int pci_domain_number, unsigned long pe_num);
> +extern int iommu_add_device(struct device *dev); extern void
> +iommu_del_device(struct device *dev);
> +
> +static inline void set_iommu_table_base_and_group(struct device *dev,
> + void *base)
> +{
> + set_iommu_table_base(dev, base);
> + iommu_add_device(dev);
> +}
>
> extern int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
> struct scatterlist *sglist, int nelems, diff --git
> a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index
> b20ff17..15f8ca8 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1105,7 +1105,7 @@ void iommu_release_ownership(struct iommu_table *tbl) }
> EXPORT_SYMBOL_GPL(iommu_release_ownership);
>
> -static int iommu_add_device(struct device *dev)
> +int iommu_add_device(struct device *dev)
> {
> struct iommu_table *tbl;
> int ret = 0;
> @@ -1134,46 +1134,13 @@ static int iommu_add_device(struct device *dev)
>
> return ret;
> }
> +EXPORT_SYMBOL_GPL(iommu_add_device);
>
> -static void iommu_del_device(struct device *dev)
> +void iommu_del_device(struct device *dev)
> {
> iommu_group_remove_device(dev);
> }
> -
> -static int iommu_bus_notifier(struct notifier_block *nb,
> - unsigned long action, void *data)
> -{
> - struct device *dev = data;
> -
> - switch (action) {
> - case BUS_NOTIFY_ADD_DEVICE:
> - return iommu_add_device(dev);
> - case BUS_NOTIFY_DEL_DEVICE:
> - iommu_del_device(dev);
> - return 0;
> - default:
> - return 0;
> - }
> -}
> -
> -static struct notifier_block tce_iommu_bus_nb = {
> - .notifier_call = iommu_bus_notifier,
> -};
> -
> -static int __init tce_iommu_init(void)
> -{
> - struct pci_dev *pdev = NULL;
> -
> - BUILD_BUG_ON(PAGE_SIZE < IOMMU_PAGE_SIZE);
> -
> - for_each_pci_dev(pdev)
> - iommu_add_device(&pdev->dev);
> -
> - bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
> - return 0;
> -}
> -
> -subsys_initcall_sync(tce_iommu_init);
> +EXPORT_SYMBOL_GPL(iommu_del_device);
>
> #else
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index d8140b1..756bb58 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -440,7 +440,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb,
> struct pci_dev *pdev
> return;
>
> pe = &phb->ioda.pe_array[pdn->pe_number];
> - set_iommu_table_base(&pdev->dev, &pe->tce32_table);
> + set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
> }
>
> static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
> @@ -448,7 +448,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
> struct pci_bus *bus)
> struct pci_dev *dev;
>
> list_for_each_entry(dev, &bus->devices, bus_list) {
> - set_iommu_table_base(&dev->dev, &pe->tce32_table);
> + set_iommu_table_base_and_group(&dev->dev, &pe->tce32_table);
> if (dev->subordinate)
> pnv_ioda_setup_bus_dma(pe, dev->subordinate);
> }
> @@ -611,7 +611,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> iommu_register_group(tbl, pci_domain_nr(pe->pbus), pe->pe_number);
>
> if (pe->pdev)
> - set_iommu_table_base(&pe->pdev->dev, tbl);
> + set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
> else
> pnv_ioda_setup_bus_dma(pe, pe->pbus);
>
> @@ -687,7 +687,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> iommu_init_table(tbl, phb->hose->node);
>
> if (pe->pdev)
> - set_iommu_table_base(&pe->pdev->dev, tbl);
> + set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
> else
> pnv_ioda_setup_bus_dma(pe, pe->pbus);
>
> diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> index b68db63..ede341b 100644
> --- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> +++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> @@ -92,7 +92,7 @@ static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
> pci_domain_nr(phb->hose->bus), phb->opal_id);
> }
>
> - set_iommu_table_base(&pdev->dev, &phb->p5ioc2.iommu_table);
> + set_iommu_table_base_and_group(&pdev->dev, &phb->p5ioc2.iommu_table);
> }
>
> static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
> diff --git a/arch/powerpc/platforms/powernv/pci.c
> b/arch/powerpc/platforms/powernv/pci.c
> index a28d3b5..c005011 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -504,7 +504,7 @@ static void pnv_pci_dma_fallback_setup(struct pci_controller
> *hose,
> pdn->iommu_table = pnv_pci_setup_bml_iommu(hose);
> if (!pdn->iommu_table)
> return;
> - set_iommu_table_base(&pdev->dev, pdn->iommu_table);
> + set_iommu_table_base_and_group(&pdev->dev, pdn->iommu_table);
> }
>
> static void pnv_pci_dma_dev_setup(struct pci_dev *pdev) @@ -623,3 +623,34 @@
> void __init pnv_pci_init(void)
> ppc_md.teardown_msi_irqs = pnv_teardown_msi_irqs; #endif }
> +
> +static int tce_iommu_bus_notifier(struct notifier_block *nb,
> + unsigned long action, void *data)
> +{
> + struct device *dev = data;
> +
> + switch (action) {
> + case BUS_NOTIFY_ADD_DEVICE:
> + return iommu_add_device(dev);
> + case BUS_NOTIFY_DEL_DEVICE:
> + if (dev->iommu_group)
> + iommu_del_device(dev);
> + return 0;
> + default:
> + return 0;
> + }
> +}
> +
> +static struct notifier_block tce_iommu_bus_nb = {
> + .notifier_call = tce_iommu_bus_notifier, };
> +
> +static int __init tce_iommu_bus_notifier_init(void) {
> + BUILD_BUG_ON(PAGE_SIZE < IOMMU_PAGE_SIZE);
> +
> + bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
> + return 0;
> +}
> +
> +subsys_initcall_sync(tce_iommu_bus_notifier_init);
> diff --git a/arch/powerpc/platforms/pseries/iommu.c
> b/arch/powerpc/platforms/pseries/iommu.c
> index 23fc1dc..884ae71 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -687,7 +687,8 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
> iommu_table_setparms(phb, dn, tbl);
> PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
> iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
> - set_iommu_table_base(&dev->dev, PCI_DN(dn)->iommu_table);
> + set_iommu_table_base_and_group(&dev->dev,
> + PCI_DN(dn)->iommu_table);
> return;
> }
>
> @@ -699,7 +700,8 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
> dn = dn->parent;
>
> if (dn && PCI_DN(dn))
> - set_iommu_table_base(&dev->dev, PCI_DN(dn)->iommu_table);
> + set_iommu_table_base_and_group(&dev->dev,
> + PCI_DN(dn)->iommu_table);
> else
> printk(KERN_WARNING "iommu: Device %s has no iommu table\n",
> pci_name(dev));
> @@ -1193,7 +1195,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev
> *dev)
> pr_debug(" found DMA window, table: %p\n", pci->iommu_table);
> }
>
> - set_iommu_table_base(&dev->dev, pci->iommu_table);
> + set_iommu_table_base_and_group(&dev->dev, pci->iommu_table);
> }
>
> static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
> --
> 1.8.4.rc4
>
> _______________________________________________
> Linuxppc-dev mailing list
> [email protected]
> https://lists.ozlabs.org/listinfo/linuxppc-dev