The changes are:
1. rebased on v3.11-rc1 so the capability numbers changed again
2. fixed multiple comments from maintainers
3. "KVM: PPC: Add support for IOMMU in-kernel handling" is split into
2 patches, the new one is "powerpc/iommu: rework to support realmode".
4. IOMMU_API is now always enabled for KVM_BOOK3S_64.
MOre details in the individual patch comments.
Depends on "hashtable: add hash_for_each_possible_rcu_notrace()",
posted a while ago.
Alexey Kardashevskiy (10):
KVM: PPC: reserve a capability number for multitce support
KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
vfio: add external user support
powerpc: Prepare to support kernel handling of IOMMU map/unmap
powerpc: add real mode support for dma operations on powernv
KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently
KVM: PPC: Add support for multiple-TCE hcalls
powerpc/iommu: rework to support realmode
KVM: PPC: Add support for IOMMU in-kernel handling
KVM: PPC: Add hugepage support for IOMMU in-kernel handling
Documentation/virtual/kvm/api.txt | 52 +++
arch/powerpc/include/asm/iommu.h | 9 +-
arch/powerpc/include/asm/kvm_host.h | 37 +++
arch/powerpc/include/asm/kvm_ppc.h | 18 +-
arch/powerpc/include/asm/machdep.h | 12 +
arch/powerpc/include/asm/pgtable-ppc64.h | 4 +
arch/powerpc/include/uapi/asm/kvm.h | 7 +
arch/powerpc/kernel/iommu.c | 201 +++++++----
arch/powerpc/kvm/Kconfig | 1 +
arch/powerpc/kvm/book3s_64_vio.c | 533 +++++++++++++++++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 404 ++++++++++++++++++++--
arch/powerpc/kvm/book3s_hv.c | 41 ++-
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +-
arch/powerpc/kvm/book3s_pr_papr.c | 35 ++
arch/powerpc/kvm/powerpc.c | 15 +
arch/powerpc/mm/init_64.c | 76 ++++-
arch/powerpc/platforms/powernv/pci-ioda.c | 47 ++-
arch/powerpc/platforms/powernv/pci.c | 38 ++-
arch/powerpc/platforms/powernv/pci.h | 3 +-
drivers/vfio/vfio.c | 62 ++++
include/linux/page-flags.h | 4 +-
include/linux/vfio.h | 7 +
include/uapi/linux/kvm.h | 5 +
23 files changed, 1496 insertions(+), 123 deletions(-)
--
1.8.3.2
This is to reserve a capablity number for upcoming support
of H_PUT_TCE_INDIRECT and H_STUFF_TCE pseries hypercalls
which support mulptiple DMA map/unmap operations per one call.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
2013/07/16:
* changed the number
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
include/uapi/linux/kvm.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index acccd08..99c2533 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_PPC_RTAS 91
#define KVM_CAP_IRQ_XICS 92
#define KVM_CAP_ARM_EL1_32BIT 93
+#define KVM_CAP_SPAPR_MULTITCE 94
#ifdef KVM_CAP_IRQ_ROUTING
--
1.8.3.2
This is to reserve a capablity number for upcoming support
of VFIO-IOMMU DMA operations in real mode.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
2013/07/16:
* changed the number
2013/07/11:
* changed order in a file, added comment about a gap in ioctl number
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
include/uapi/linux/kvm.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 99c2533..53c3f1f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_IRQ_XICS 92
#define KVM_CAP_ARM_EL1_32BIT 93
#define KVM_CAP_SPAPR_MULTITCE 94
+#define KVM_CAP_SPAPR_TCE_IOMMU 95
#ifdef KVM_CAP_IRQ_ROUTING
@@ -933,6 +934,9 @@ struct kvm_s390_ucas_mapping {
#define KVM_ARM_SET_DEVICE_ADDR _IOW(KVMIO, 0xab, struct kvm_arm_device_addr)
/* Available with KVM_CAP_PPC_RTAS */
#define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO, 0xac, struct kvm_rtas_token_args)
+/* 0xad and 0xaf are already taken */
+/* Available with KVM_CAP_SPAPR_TCE_IOMMU */
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct kvm_create_spapr_tce_iommu)
/* ioctl for vm fd */
#define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device)
--
1.8.3.2
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).
To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.
This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.
CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.
Cc: [email protected]
Reviewed-by: Paul Mackerras <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
2013/07/10:
* adjusted comment (removed sentence about virtual mode)
* get_page_unless_zero replaced with atomic_inc_not_zero to minimize
effect of a possible get_page_unless_zero() rework (if it ever happens).
2013/06/27:
* realmode_get_page() fixed to use get_page_unless_zero(). If failed,
the call will be passed from real to virtual mode and safely handled.
* added comment to PageCompound() in include/linux/page-flags.h.
2013/05/20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++
arch/powerpc/mm/init_64.c | 76 +++++++++++++++++++++++++++++++-
include/linux/page-flags.h | 4 +-
3 files changed, 82 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 46db094..aa7b169 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -394,6 +394,10 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array,
hpte_slot_array[index] = hidx << 4 | 0x1 << 3;
}
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
static inline char *get_hpte_slot_array(pmd_t *pmdp)
{
/*
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index d0cd9e4..dcbb806 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -300,5 +300,79 @@ void vmemmap_free(unsigned long start, unsigned long end)
{
}
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_XXXX functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct vmemmap_backing *vmem_back;
+ struct page *page;
+ unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+ unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+ for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+ if (pg_va < vmem_back->virt_addr)
+ continue;
+
+ /* Check that page struct is not split between real pages */
+ if ((pg_va + sizeof(struct page)) >
+ (vmem_back->virt_addr + page_size))
+ return NULL;
+
+ page = (struct page *) (vmem_back->phys + pg_va -
+ vmem_back->virt_addr);
+ return page;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+ return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+ if (PageCompound(page))
+ return -EAGAIN;
+
+ if (!atomic_inc_not_zero(&page->_count))
+ return -EAGAIN;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+ if (PageCompound(page))
+ return -EAGAIN;
+
+ if (!atomic_add_unless(&page->_count, -1, 1))
+ return -EAGAIN;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_put_page);
+#endif
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..98ada58 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -329,7 +329,9 @@ static inline void set_page_writeback(struct page *page)
* System with lots of page flags available. This allows separate
* flags for PageHead() and PageTail() checks of compound pages so that bit
* tests can be used in performance sensitive paths. PageCompound is
- * generally not used in hot code paths.
+ * generally not used in hot code paths except arch/powerpc/mm/init_64.c
+ * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
+ * and avoid handling those in real mode.
*/
__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
__PAGEFLAG(Tail, tail)
--
1.8.3.2
It does not make much sense to have KVM in book3s-64bit and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s-kvm.
Having IOMMU_API enabled will make it unnecessary to have a lot of
IOMMU_API enabled, those files only accelerate user space emulated devices
which do not seem to be in use very often.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/kvm/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index c55c538..3b2b761 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -59,6 +59,7 @@ config KVM_BOOK3S_64
depends on PPC_BOOK3S_64
select KVM_BOOK3S_64_HANDLER
select KVM
+ select SPAPR_TCE_IOMMU
---help---
Support running unmodified book3s_64 and book3s_32 guest kernels
in virtual machines on book3s_64 host processors.
--
1.8.3.2
The existing TCE machine calls (tce_build and tce_free) only support
virtual mode as they call __raw_writeq for TCE invalidation what
fails in real mode.
This introduces tce_build_rm and tce_free_rm real mode versions
which do mostly the same but use "Store Doubleword Caching Inhibited
Indexed" instruction for TCE invalidation.
This new feature is going to be utilized by real mode support of VFIO.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
2013/11/07:
* added comment why stdcix cannot be used in virtual mode
2013/08/07:
* tested on p7ioc and fixed a bug with realmode addresses
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/machdep.h | 12 ++++++++
arch/powerpc/platforms/powernv/pci-ioda.c | 47 +++++++++++++++++++++++--------
arch/powerpc/platforms/powernv/pci.c | 38 +++++++++++++++++++++----
arch/powerpc/platforms/powernv/pci.h | 3 +-
4 files changed, 81 insertions(+), 19 deletions(-)
diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 8b48090..07dd3b1 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -78,6 +78,18 @@ struct machdep_calls {
long index);
void (*tce_flush)(struct iommu_table *tbl);
+ /* _rm versions are for real mode use only */
+ int (*tce_build_rm)(struct iommu_table *tbl,
+ long index,
+ long npages,
+ unsigned long uaddr,
+ enum dma_data_direction direction,
+ struct dma_attrs *attrs);
+ void (*tce_free_rm)(struct iommu_table *tbl,
+ long index,
+ long npages);
+ void (*tce_flush_rm)(struct iommu_table *tbl);
+
void __iomem * (*ioremap)(phys_addr_t addr, unsigned long size,
unsigned long flags, void *caller);
void (*iounmap)(volatile void __iomem *token);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 49b57b9..c358f82 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -70,6 +70,16 @@ define_pe_printk_level(pe_err, KERN_ERR);
define_pe_printk_level(pe_warn, KERN_WARNING);
define_pe_printk_level(pe_info, KERN_INFO);
+/*
+ * stdcix is only supposed to be used in hypervisor real mode as per
+ * the architecture spec
+ */
+static inline void __raw_rm_writeq(u64 val, volatile void __iomem *paddr)
+{
+ __asm__ __volatile__("stdcix %0,0,%1"
+ : : "r" (val), "r" (paddr) : "memory");
+}
+
static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
{
unsigned long pe;
@@ -454,10 +464,13 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
}
}
-static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
- u64 *startp, u64 *endp)
+static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
+ struct iommu_table *tbl,
+ u64 *startp, u64 *endp, bool rm)
{
- u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
+ u64 __iomem *invalidate = rm?
+ (u64 __iomem *)pe->tce_inval_reg_phys:
+ (u64 __iomem *)tbl->it_index;
unsigned long start, end, inc;
start = __pa(startp);
@@ -484,7 +497,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
mb(); /* Ensure above stores are visible */
while (start <= end) {
- __raw_writeq(start, invalidate);
+ if (rm)
+ __raw_rm_writeq(start, invalidate);
+ else
+ __raw_writeq(start, invalidate);
start += inc;
}
@@ -496,10 +512,12 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
struct iommu_table *tbl,
- u64 *startp, u64 *endp)
+ u64 *startp, u64 *endp, bool rm)
{
unsigned long start, end, inc;
- u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
+ u64 __iomem *invalidate = rm?
+ (u64 __iomem *)pe->tce_inval_reg_phys:
+ (u64 __iomem *)tbl->it_index;
/* We'll invalidate DMA address in PE scope */
start = 0x2ul << 60;
@@ -515,22 +533,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
mb();
while (start <= end) {
- __raw_writeq(start, invalidate);
+ if (rm)
+ __raw_rm_writeq(start, invalidate);
+ else
+ __raw_writeq(start, invalidate);
start += inc;
}
}
void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
- u64 *startp, u64 *endp)
+ u64 *startp, u64 *endp, bool rm)
{
struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
tce32_table);
struct pnv_phb *phb = pe->phb;
if (phb->type == PNV_PHB_IODA1)
- pnv_pci_ioda1_tce_invalidate(tbl, startp, endp);
+ pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
else
- pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp);
+ pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
}
static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
@@ -603,7 +624,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
* bus number, print that out instead.
*/
tbl->it_busno = 0;
- tbl->it_index = (unsigned long)ioremap(be64_to_cpup(swinvp), 8);
+ pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
+ tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys, 8);
tbl->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE |
TCE_PCI_SWINV_PAIR;
}
@@ -681,7 +703,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
* bus number, print that out instead.
*/
tbl->it_busno = 0;
- tbl->it_index = (unsigned long)ioremap(be64_to_cpup(swinvp), 8);
+ pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
+ tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys, 8);
tbl->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE;
}
iommu_init_table(tbl, phb->hose->node);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index a28d3b5..2b58549 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -401,7 +401,7 @@ struct pci_ops pnv_pci_ops = {
static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
unsigned long uaddr, enum dma_data_direction direction,
- struct dma_attrs *attrs)
+ struct dma_attrs *attrs, bool rm)
{
u64 proto_tce;
u64 *tcep, *tces;
@@ -423,12 +423,19 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
* of flags if that becomes the case
*/
if (tbl->it_type & TCE_PCI_SWINV_CREATE)
- pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
+ pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
return 0;
}
-static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
+static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr, enum dma_data_direction direction,
+ struct dma_attrs *attrs)
+{
+ return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, false);
+}
+
+static void pnv_tce_free(struct iommu_table *tbl, long index, long npages, bool rm)
{
u64 *tcep, *tces;
@@ -438,7 +445,12 @@ static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
*(tcep++) = 0;
if (tbl->it_type & TCE_PCI_SWINV_FREE)
- pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
+ pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+}
+
+static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long npages)
+{
+ pnv_tce_free(tbl, index, npages, false);
}
static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
@@ -446,6 +458,18 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
return ((u64 *)tbl->it_base)[index - tbl->it_offset];
}
+static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr, enum dma_data_direction direction,
+ struct dma_attrs *attrs)
+{
+ return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
+}
+
+static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
+{
+ pnv_tce_free(tbl, index, npages, true);
+}
+
void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
void *tce_mem, u64 tce_size,
u64 dma_offset)
@@ -610,8 +634,10 @@ void __init pnv_pci_init(void)
/* Configure IOMMU DMA hooks */
ppc_md.pci_dma_dev_setup = pnv_pci_dma_dev_setup;
- ppc_md.tce_build = pnv_tce_build;
- ppc_md.tce_free = pnv_tce_free;
+ ppc_md.tce_build = pnv_tce_build_vm;
+ ppc_md.tce_free = pnv_tce_free_vm;
+ ppc_md.tce_build_rm = pnv_tce_build_rm;
+ ppc_md.tce_free_rm = pnv_tce_free_rm;
ppc_md.tce_get = pnv_tce_get;
ppc_md.pci_probe_mode = pnv_pci_probe_mode;
set_pci_dma_ops(&dma_iommu_ops);
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index d633c64..170dd98 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -52,6 +52,7 @@ struct pnv_ioda_pe {
int tce32_seg;
int tce32_segcount;
struct iommu_table tce32_table;
+ phys_addr_t tce_inval_reg_phys;
/* XXX TODO: Add support for additional 64-bit iommus */
@@ -193,6 +194,6 @@ extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
extern void pnv_pci_init_ioda_hub(struct device_node *np);
extern void pnv_pci_init_ioda2_phb(struct device_node *np);
extern void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
- u64 *startp, u64 *endp);
+ u64 *startp, u64 *endp, bool rm);
#endif /* __POWERNV_PCI_H */
--
1.8.3.2
This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO
devices or emulated PCI. These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.
This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.
This adds a function to convert a guest physical address to a host
virtual address in order to parse a TCE list from H_PUT_TCE_INDIRECT.
This also implements the KVM_CAP_PPC_MULTITCE capability. When present,
the hypercalls mentioned above may or may not be processed successfully
in the kernel based fast path. If they can not be handled by the kernel,
they will get passed on to user space. So user space still has to have
an implementation for these despite the in kernel acceleration.
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changelog:
2013/07/11:
* addressed many, many comments from maintainers
2013/07/06:
* fixed number of wrong get_page()/put_page() calls
2013/06/27:
* fixed clear of BUSY bit in kvmppc_lookup_pte()
* H_PUT_TCE_INDIRECT does realmode_get_page() now
* KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
* updated doc
2013/06/05:
* fixed mistype about IBMVIO in the commit message
* updated doc and moved it to another section
* changed capability number
2013/05/21:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Documentation/virtual/kvm/api.txt | 26 ++++
arch/powerpc/include/asm/kvm_host.h | 9 ++
arch/powerpc/include/asm/kvm_ppc.h | 16 +-
arch/powerpc/kvm/book3s_64_vio.c | 132 +++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 264 ++++++++++++++++++++++++++++----
arch/powerpc/kvm/book3s_hv.c | 41 ++++-
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +-
arch/powerpc/kvm/book3s_pr_papr.c | 35 +++++
arch/powerpc/kvm/powerpc.c | 3 +
9 files changed, 500 insertions(+), 34 deletions(-)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index ef925ea..1c8942a 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2382,6 +2382,32 @@ calls by the guest for that service will be passed to userspace to be
handled.
+4.86 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability means the kernel is capable of handling hypercalls
+H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
+space. This significantly accelerates DMA operations for PPC KVM guests.
+User space should expect that its handlers for these hypercalls
+are not going to be called if user space previously registered LIOBN
+in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
+
+In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
+user space might have to advertise it for the guest. For example,
+IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
+present in the "ibm,hypertas-functions" device-tree property.
+
+The hypercalls mentioned above may or may not be processed successfully
+in the kernel based fast path. If they can not be handled by the kernel,
+they will get passed on to user space. So user space still has to have
+an implementation for these despite the in kernel acceleration.
+
+This capability is always enabled.
+
+
5. The kvm_run structure
------------------------
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index af326cd..b8fe3de 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -30,6 +30,7 @@
#include <linux/kvm_para.h>
#include <linux/list.h>
#include <linux/atomic.h>
+#include <linux/tracepoint.h>
#include <asm/kvm_asm.h>
#include <asm/processor.h>
#include <asm/page.h>
@@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
+
+ unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT hcall */
+ enum {
+ TCERM_NONE,
+ TCERM_GETPAGE,
+ TCERM_PUTTCE,
+ TCERM_PUTLIST,
+ } tce_rm_fail; /* failed stage of request processing */
#endif
};
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..0ce4691 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
- unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+ struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_tce_validate(unsigned long tce);
+extern void kvmppc_tce_put(struct kvmppc_spapr_tce_table *tt,
+ unsigned long ioba, unsigned long tce);
+extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce);
+extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages);
+extern long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages);
extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct kvm_allocate_rma *rma);
extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index b2d3f3b..0131bf9 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
*
* Copyright 2010 Paul Mackerras, IBM Corp. <[email protected]>
* Copyright 2011 David Gibson, IBM Corporation <[email protected]>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <[email protected]>
*/
#include <linux/types.h>
@@ -36,8 +37,10 @@
#include <asm/ppc-opcode.h>
#include <asm/kvm_host.h>
#include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
-#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR ((void *)~(unsigned long)0x0)
static long kvmppc_stt_npages(unsigned long window_size)
{
@@ -148,3 +151,130 @@ fail:
}
return ret;
}
+
+/* Converts guest physical address to host virtual address */
+static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
+ unsigned long gpa, struct page **pg)
+{
+ unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+ struct kvm_memory_slot *memslot;
+ static const int is_write = 0;
+
+ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+ if (!memslot)
+ return ERROR_ADDR;
+
+ hva = __gfn_to_hva_memslot(memslot, gfn) | (gpa & ~PAGE_MASK);
+
+ if (get_user_pages_fast(hva & PAGE_MASK, 1, is_write, pg) != 1)
+ return ERROR_ADDR;
+
+ return (void *) hva;
+}
+
+long kvmppc_h_put_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce)
+{
+ long ret;
+ struct kvmppc_spapr_tce_table *tt;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ if (!tt)
+ return H_TOO_HARD;
+
+ if (ioba >= tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_tce_validate(tce);
+ if (ret)
+ return ret;
+
+ kvmppc_tce_put(tt, ioba, tce);
+
+ return H_SUCCESS;
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret = H_SUCCESS;
+ unsigned long __user *tces;
+ struct page *pg = NULL;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ if (!tt)
+ return H_TOO_HARD;
+
+ /*
+ * The spec says that the maximum size of the list is 512 TCEs
+ * so the whole table addressed resides in 4K page
+ */
+ if (npages > 512)
+ return H_PARAMETER;
+
+ if (tce_list & ~IOMMU_PAGE_MASK)
+ return H_PARAMETER;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ tces = kvmppc_gpa_to_hva_and_get(vcpu, tce_list, &pg);
+ if (tces == ERROR_ADDR)
+ return H_TOO_HARD;
+
+ if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
+ goto put_list_page_exit;
+
+ for (i = 0; i < npages; ++i) {
+ if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
+ ret = H_PARAMETER;
+ goto put_list_page_exit;
+ }
+
+ ret = kvmppc_tce_validate(vcpu->arch.tce_tmp_hpas[i]);
+ if (ret)
+ goto put_list_page_exit;
+ }
+
+ for (i = 0; i < npages; ++i)
+ kvmppc_tce_put(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+ vcpu->arch.tce_tmp_hpas[i]);
+put_list_page_exit:
+ if (pg)
+ put_page(pg);
+
+ if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
+ vcpu->arch.tce_rm_fail = TCERM_NONE;
+ if (pg && !PageCompound(pg))
+ put_page(pg); /* finish pending realmode_put_page() */
+ }
+
+ return ret;
+}
+
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ if (!tt)
+ return H_TOO_HARD;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_tce_validate(tce_value);
+ if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+ kvmppc_tce_put(tt, ioba, tce_value);
+
+ return H_SUCCESS;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..9b0372f 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
*
* Copyright 2010 Paul Mackerras, IBM Corp. <[email protected]>
* Copyright 2011 David Gibson, IBM Corporation <[email protected]>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <[email protected]>
*/
#include <linux/types.h>
@@ -35,42 +36,247 @@
#include <asm/ppc-opcode.h>
#include <asm/kvm_host.h>
#include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR (~(unsigned long)0x0)
-/* WARNING: This will be called in real-mode on HV KVM and virtual
+/* Finds a TCE table descriptor by LIOBN.
+ *
+ * WARNING: This will be called in real or virtual mode on HV KVM and virtual
* mode on PR KVM
*/
-long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
+struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
+ unsigned long liobn)
+{
+ struct kvmppc_spapr_tce_table *tt;
+
+ list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+ if (tt->liobn == liobn)
+ return tt;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
+
+/*
+ * Validates TCE address.
+ * At the moment only flags are validated.
+ * As the host kernel does not access those addresses (just puts them
+ * to the table and user space is supposed to process them), we can skip
+ * checking other things (such as TCE is a guest RAM address or the page
+ * was actually allocated).
+ *
+ * WARNING: This will be called in real-mode on HV KVM and virtual
+ * mode on PR KVM
+ */
+long kvmppc_tce_validate(unsigned long tce)
+{
+ if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+ return H_PARAMETER;
+
+ return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_tce_validate);
+
+/* Note on the use of page_address() in real mode,
+ *
+ * It is safe to use page_address() in real mode on ppc64 because
+ * page_address() is always defined as lowmem_page_address()
+ * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
+ * operation and does not access page struct.
+ *
+ * Theoretically page_address() could be defined different
+ * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
+ * should be enabled.
+ * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
+ * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
+ * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
+ * is not expected to be enabled on ppc32, page_address()
+ * is safe for ppc32 as well.
+ *
+ * WARNING: This will be called in real-mode on HV KVM and virtual
+ * mode on PR KVM
+ */
+extern u64 *kvmppc_page_address(struct page *page)
+{
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+ return (u64 *)page_address(page);
+}
+
+/*
+ * Handles TCE requests for emulated devices.
+ * Puts guest TCE values to the table and expects user space to convert them.
+ * Called in both real and virtual modes.
+ * Cannot fail so kvmppc_tce_validate must be called before it.
+ *
+ * WARNING: This will be called in real-mode on HV KVM and virtual
+ * mode on PR KVM
+ */
+void kvmppc_tce_put(struct kvmppc_spapr_tce_table *tt,
+ unsigned long ioba, unsigned long tce)
+{
+ unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+ struct page *page;
+ u64 *tbl;
+
+ page = tt->pages[idx / TCES_PER_PAGE];
+ tbl = kvmppc_page_address(page);
+
+ tbl[idx % TCES_PER_PAGE] = tce;
+}
+EXPORT_SYMBOL_GPL(kvmppc_tce_put);
+
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+/*
+ * Converts guest physical address to host physical address.
+ * Tries to increase page counter via realmode_get_page() and
+ * returns ERROR_ADDR if failed.
+ */
+static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
+ unsigned long gpa, struct page **pg)
+{
+ struct kvm_memory_slot *memslot;
+ pte_t *ptep, pte;
+ unsigned long hva, hpa = ERROR_ADDR;
+ unsigned long gfn = gpa >> PAGE_SHIFT;
+ unsigned shift = 0;
+
+ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+ if (!memslot)
+ return ERROR_ADDR;
+
+ hva = __gfn_to_hva_memslot(memslot, gfn);
+
+ ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+ if (!ptep || !pte_present(*ptep))
+ return ERROR_ADDR;
+ pte = *ptep;
+
+ if (!shift)
+ shift = PAGE_SHIFT;
+
+ /* Avoid handling anything potentially complicated in realmode */
+ if (shift > PAGE_SHIFT)
+ return ERROR_ADDR;
+
+ if (((gpa & TCE_PCI_WRITE) || pte_write(pte)) && !pte_dirty(pte))
+ return ERROR_ADDR;
+
+ if (!pte_young(pte))
+ return ERROR_ADDR;
+
+ /* Increase page counter */
+ *pg = realmode_pfn_to_page(pte_pfn(pte));
+ if (!*pg || realmode_get_page(*pg))
+ return ERROR_ADDR;
+
+ hpa = (pte_pfn(pte) << PAGE_SHIFT) + (gpa & ((1 << shift) - 1));
+
+ /* Page has gone since we got pte, safer to put the request to virt mode */
+ if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+ hpa = ERROR_ADDR;
+ realmode_put_page(*pg);
+ *pg = NULL;
+ }
+
+ return hpa;
+}
+
+long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
{
- struct kvm *kvm = vcpu->kvm;
- struct kvmppc_spapr_tce_table *stt;
-
- /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
- /* liobn, ioba, tce); */
-
- list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
- if (stt->liobn == liobn) {
- unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
- struct page *page;
- u64 *tbl;
-
- /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p window_size=0x%x\n", */
- /* liobn, stt, stt->window_size); */
- if (ioba >= stt->window_size)
- return H_PARAMETER;
-
- page = stt->pages[idx / TCES_PER_PAGE];
- tbl = (u64 *)page_address(page);
-
- /* FIXME: Need to validate the TCE itself */
- /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
- tbl[idx % TCES_PER_PAGE] = tce;
- return H_SUCCESS;
- }
+ long ret;
+ struct kvmppc_spapr_tce_table *tt;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ if (!tt)
+ return H_TOO_HARD;
+
+ if (ioba >= tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_tce_validate(tce);
+ if (!ret)
+ kvmppc_tce_put(tt, ioba, tce);
+
+ return ret;
+}
+
+long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret = H_SUCCESS;
+ unsigned long tces;
+ struct page *pg = NULL;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ if (!tt)
+ return H_TOO_HARD;
+
+ /*
+ * The spec says that the maximum size of the list is 512 TCEs
+ * so the whole table addressed resides in 4K page
+ */
+ if (npages > 512)
+ return H_PARAMETER;
+
+ if (tce_list & ~IOMMU_PAGE_MASK)
+ return H_PARAMETER;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce_list, &pg);
+ if (tces == ERROR_ADDR)
+ return H_TOO_HARD;
+
+ for (i = 0; i < npages; ++i) {
+ ret = kvmppc_tce_validate(((unsigned long *)tces)[i]);
+ if (ret)
+ goto put_unlock_exit;
+ }
+
+ for (i = 0; i < npages; ++i)
+ kvmppc_tce_put(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+ ((unsigned long *)tces)[i]);
+
+put_unlock_exit:
+ if (!ret && pg && !PageCompound(pg) && realmode_put_page(pg)) {
+ vcpu->arch.tce_rm_fail = TCERM_PUTLIST;
+ ret = H_TOO_HARD;
}
- /* Didn't find the liobn, punt it to userspace */
- return H_TOO_HARD;
+ return ret;
+}
+
+long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ if (!tt)
+ return H_TOO_HARD;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_tce_validate(tce_value);
+ if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+ kvmppc_tce_put(tt, ioba, tce_value);
+
+ return H_SUCCESS;
}
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2efa9dd..8da85e6 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -567,7 +567,31 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
if (kvmppc_xics_enabled(vcpu)) {
ret = kvmppc_xics_hcall(vcpu, req);
break;
- } /* fallthrough */
+ }
+ return RESUME_HOST;
+ case H_PUT_TCE:
+ ret = kvmppc_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
+ case H_PUT_TCE_INDIRECT:
+ ret = kvmppc_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6),
+ kvmppc_get_gpr(vcpu, 7));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
+ case H_STUFF_TCE:
+ ret = kvmppc_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6),
+ kvmppc_get_gpr(vcpu, 7));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
default:
return RESUME_HOST;
}
@@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
vcpu->arch.cpu_type = KVM_CPU_3S_64;
kvmppc_sanity_check(vcpu);
+ /*
+ * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
+ * half executed, we first read TCEs from the user, check them and
+ * return error if something went wrong and only then put TCEs into
+ * the TCE table.
+ *
+ * tce_tmp_hpas is a cache for TCEs to avoid stack allocation or
+ * kmalloc as the whole TCE list can take up to 512 items 8 bytes
+ * each (4096 bytes).
+ */
+ vcpu->arch.tce_tmp_hpas = kmalloc(4096, GFP_KERNEL);
+ if (!vcpu->arch.tce_tmp_hpas)
+ goto free_vcpu;
+
return vcpu;
free_vcpu:
@@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
spin_unlock(&vcpu->arch.vpa_update_lock);
+ kfree(vcpu->arch.tce_tmp_hpas);
kvm_vcpu_uninit(vcpu);
kmem_cache_free(kvm_vcpu_cache, vcpu);
}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b02f91e..15942bc 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1416,7 +1416,7 @@ hcall_real_table:
.long 0 /* 0x14 - H_CLEAR_REF */
.long .kvmppc_h_protect - hcall_real_table
.long 0 /* 0x1c - H_GET_TCE */
- .long .kvmppc_h_put_tce - hcall_real_table
+ .long .kvmppc_rm_h_put_tce - hcall_real_table
.long 0 /* 0x24 - H_SET_SPRG0 */
.long .kvmppc_h_set_dabr - hcall_real_table
.long 0 /* 0x2c */
@@ -1490,6 +1490,12 @@ hcall_real_table:
.long 0 /* 0x11c */
.long 0 /* 0x120 */
.long .kvmppc_h_bulk_remove - hcall_real_table
+ .long 0 /* 0x128 */
+ .long 0 /* 0x12c */
+ .long 0 /* 0x130 */
+ .long 0 /* 0x134 */
+ .long .kvmppc_rm_h_stuff_tce - hcall_real_table
+ .long .kvmppc_rm_h_put_tce_indirect - hcall_real_table
hcall_real_table_end:
ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index da0e0bc..6bd0d4a 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -227,6 +227,37 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
return EMULATE_DONE;
}
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+ unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+ unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+ unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+ unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+ long rc;
+
+ rc = kvmppc_h_put_tce_indirect(vcpu, liobn, ioba,
+ tce, npages);
+ if (rc == H_TOO_HARD)
+ return EMULATE_FAIL;
+ kvmppc_set_gpr(vcpu, 3, rc);
+ return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+ unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+ unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+ unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+ unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+ long rc;
+
+ rc = kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
+ if (rc == H_TOO_HARD)
+ return EMULATE_FAIL;
+ kvmppc_set_gpr(vcpu, 3, rc);
+ return EMULATE_DONE;
+}
+
static int kvmppc_h_pr_xics_hcall(struct kvm_vcpu *vcpu, u32 cmd)
{
long rc = kvmppc_xics_hcall(vcpu, cmd);
@@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
return kvmppc_h_pr_bulk_remove(vcpu);
case H_PUT_TCE:
return kvmppc_h_pr_put_tce(vcpu);
+ case H_PUT_TCE_INDIRECT:
+ return kvmppc_h_pr_put_tce_indirect(vcpu);
+ case H_STUFF_TCE:
+ return kvmppc_h_pr_stuff_tce(vcpu);
case H_CEDE:
vcpu->arch.shared->msr |= MSR_EE;
kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 6316ee3..ccb578b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -394,6 +394,9 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_PPC_GET_SMMU_INFO:
r = 1;
break;
+ case KVM_CAP_SPAPR_MULTITCE:
+ r = 1;
+ break;
#endif
default:
r = 0;
--
1.8.3.2
The TCE tables handling may differ for real and virtual modes so
additional ppc_md.tce_build_rm/ppc_md.tce_free_rm/ppc_md.tce_flush_rm
handlers were introduced earlier.
So this adds the following:
1. support for the new ppc_md calls;
2. ability to iommu_tce_build to process mupltiple entries per
call;
3. arch_spin_lock to protect TCE table from races in both real and virtual
modes;
4. proper TCE table protection from races with the existing IOMMU code
in iommu_take_ownership/iommu_release_ownership;
5. hwaddr variable renamed to hpa as it better describes what it
actually represents;
6. iommu_tce_direction is static now as it is not called from anywhere else.
This will be used by upcoming real mode support of VFIO on POWER.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/iommu.h | 9 +-
arch/powerpc/kernel/iommu.c | 197 ++++++++++++++++++++++++++-------------
2 files changed, 135 insertions(+), 71 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index c34656a..b01bde1 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,7 @@ struct iommu_table {
unsigned long *it_map; /* A simple allocation bitmap for now */
#ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
+ arch_spinlock_t it_rm_lock;
#endif
};
@@ -152,9 +153,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
extern int iommu_tce_put_param_check(struct iommu_table *tbl,
unsigned long ioba, unsigned long tce);
extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
- unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
- unsigned long entry);
+ unsigned long *hpas, unsigned long npages, bool rm);
+extern int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+ unsigned long npages, bool rm);
extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
@@ -164,7 +165,5 @@ extern void iommu_flush_tce(struct iommu_table *tbl);
extern int iommu_take_ownership(struct iommu_table *tbl);
extern void iommu_release_ownership(struct iommu_table *tbl);
-extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
-
#endif /* __KERNEL__ */
#endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b20ff17..0f56cac 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -903,7 +903,7 @@ void iommu_register_group(struct iommu_table *tbl,
kfree(name);
}
-enum dma_data_direction iommu_tce_direction(unsigned long tce)
+static enum dma_data_direction iommu_tce_direction(unsigned long tce)
{
if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
return DMA_BIDIRECTIONAL;
@@ -914,7 +914,6 @@ enum dma_data_direction iommu_tce_direction(unsigned long tce)
else
return DMA_NONE;
}
-EXPORT_SYMBOL_GPL(iommu_tce_direction);
void iommu_flush_tce(struct iommu_table *tbl)
{
@@ -972,73 +971,116 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
}
EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
-{
- unsigned long oldtce;
- struct iommu_pool *pool = get_pool(tbl, entry);
-
- spin_lock(&(pool->lock));
-
- oldtce = ppc_md.tce_get(tbl, entry);
- if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
- ppc_md.tce_free(tbl, entry, 1);
- else
- oldtce = 0;
-
- spin_unlock(&(pool->lock));
-
- return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
{
- unsigned long oldtce;
- struct page *page;
-
- for ( ; pages; --pages, ++entry) {
- oldtce = iommu_clear_tce(tbl, entry);
- if (!oldtce)
- continue;
-
- page = pfn_to_page(oldtce >> PAGE_SHIFT);
- WARN_ON(!page);
- if (page) {
- if (oldtce & TCE_PCI_WRITE)
- SetPageDirty(page);
- put_page(page);
- }
- }
-
- return 0;
+ return iommu_free_tces(tbl, entry, pages, false);
}
EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
-/*
- * hwaddr is a kernel virtual address here (0xc... bazillion),
- * tce_build converts it to a physical address.
- */
+int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+ unsigned long npages, bool rm)
+{
+ int i, ret = 0, to_free = 0;
+
+ if (rm && !ppc_md.tce_free_rm)
+ return -EAGAIN;
+
+ arch_spin_lock(&tbl->it_rm_lock);
+
+ for (i = 0; i < npages; ++i) {
+ unsigned long oldtce = ppc_md.tce_get(tbl, entry + i);
+ if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ continue;
+
+ if (rm) {
+ struct page *pg = realmode_pfn_to_page(
+ oldtce >> PAGE_SHIFT);
+ if (!pg) {
+ ret = -EAGAIN;
+ } else if (PageCompound(pg)) {
+ ret = -EAGAIN;
+ } else {
+ if (oldtce & TCE_PCI_WRITE)
+ SetPageDirty(pg);
+ ret = realmode_put_page(pg);
+ }
+ } else {
+ struct page *pg = pfn_to_page(oldtce >> PAGE_SHIFT);
+ if (!pg) {
+ ret = -EAGAIN;
+ } else {
+ if (oldtce & TCE_PCI_WRITE)
+ SetPageDirty(pg);
+ put_page(pg);
+ }
+ }
+ if (ret)
+ break;
+ to_free = i + 1;
+ }
+
+ if (to_free) {
+ if (rm)
+ ppc_md.tce_free_rm(tbl, entry, to_free);
+ else
+ ppc_md.tce_free(tbl, entry, to_free);
+
+ if (rm && ppc_md.tce_flush_rm)
+ ppc_md.tce_flush_rm(tbl);
+ else if (!rm && ppc_md.tce_flush)
+ ppc_md.tce_flush(tbl);
+ }
+ arch_spin_unlock(&tbl->it_rm_lock);
+
+ /* Make sure updates are seen by hardware */
+ mb();
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_free_tces);
+
int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
- unsigned long hwaddr, enum dma_data_direction direction)
+ unsigned long *hpas, unsigned long npages, bool rm)
{
- int ret = -EBUSY;
- unsigned long oldtce;
- struct iommu_pool *pool = get_pool(tbl, entry);
+ int i, ret = 0;
- spin_lock(&(pool->lock));
+ if (rm && !ppc_md.tce_build_rm)
+ return -EAGAIN;
- oldtce = ppc_md.tce_get(tbl, entry);
- /* Add new entry if it is not busy */
- if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
- ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
+ arch_spin_lock(&tbl->it_rm_lock);
- spin_unlock(&(pool->lock));
+ for (i = 0; i < npages; ++i) {
+ if (ppc_md.tce_get(tbl, entry + i) &
+ (TCE_PCI_WRITE | TCE_PCI_READ)) {
+ arch_spin_unlock(&tbl->it_rm_lock);
+ return -EBUSY;
+ }
+ }
- /* if (unlikely(ret))
- pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
- __func__, hwaddr, entry << IOMMU_PAGE_SHIFT,
- hwaddr, ret); */
+ for (i = 0; i < npages; ++i) {
+ unsigned long volatile hva = (unsigned long) __va(hpas[i]);
+ enum dma_data_direction dir = iommu_tce_direction(hva);
+
+ if (rm)
+ ret = ppc_md.tce_build_rm(tbl, entry + i, 1,
+ hva, dir, NULL);
+ else
+ ret = ppc_md.tce_build(tbl, entry + i, 1,
+ hva, dir, NULL);
+ if (ret)
+ break;
+ }
+
+ if (rm && ppc_md.tce_flush_rm)
+ ppc_md.tce_flush_rm(tbl);
+ else if (!rm && ppc_md.tce_flush)
+ ppc_md.tce_flush(tbl);
+
+ arch_spin_unlock(&tbl->it_rm_lock);
+
+ /* Make sure updates are seen by hardware */
+ mb();
return ret;
}
@@ -1049,7 +1091,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
{
int ret;
struct page *page = NULL;
- unsigned long hwaddr, offset = tce & IOMMU_PAGE_MASK & ~PAGE_MASK;
+ unsigned long hpa, offset = tce & IOMMU_PAGE_MASK & ~PAGE_MASK;
enum dma_data_direction direction = iommu_tce_direction(tce);
ret = get_user_pages_fast(tce & PAGE_MASK, 1,
@@ -1059,9 +1101,9 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
tce, entry << IOMMU_PAGE_SHIFT, ret); */
return -EFAULT;
}
- hwaddr = (unsigned long) page_address(page) + offset;
+ hpa = __pa((unsigned long) page_address(page)) + offset;
- ret = iommu_tce_build(tbl, entry, hwaddr, direction);
+ ret = iommu_tce_build(tbl, entry, &hpa, 1, false);
if (ret)
put_page(page);
@@ -1075,18 +1117,32 @@ EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
int iommu_take_ownership(struct iommu_table *tbl)
{
- unsigned long sz = (tbl->it_size + 7) >> 3;
+ unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+ int ret = 0;
+
+ spin_lock_irqsave(&tbl->large_pool.lock, flags);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_lock(&tbl->pools[i].lock);
if (tbl->it_offset == 0)
clear_bit(0, tbl->it_map);
if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
pr_err("iommu_tce: it_map is not empty");
- return -EBUSY;
+ ret = -EBUSY;
+ if (tbl->it_offset == 0)
+ clear_bit(1, tbl->it_map);
+
+ } else {
+ memset(tbl->it_map, 0xff, sz);
}
- memset(tbl->it_map, 0xff, sz);
- iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_unlock(&tbl->pools[i].lock);
+ spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
+
+ if (!ret)
+ iommu_free_tces(tbl, tbl->it_offset, tbl->it_size, false);
return 0;
}
@@ -1094,14 +1150,23 @@ EXPORT_SYMBOL_GPL(iommu_take_ownership);
void iommu_release_ownership(struct iommu_table *tbl)
{
- unsigned long sz = (tbl->it_size + 7) >> 3;
+ unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+
+ iommu_free_tces(tbl, tbl->it_offset, tbl->it_size, false);
+
+ spin_lock_irqsave(&tbl->large_pool.lock, flags);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_lock(&tbl->pools[i].lock);
- iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
memset(tbl->it_map, 0, sz);
/* Restore bit#0 set by iommu_init_table() */
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
+
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_unlock(&tbl->pools[i].lock);
+ spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
}
EXPORT_SYMBOL_GPL(iommu_release_ownership);
--
1.8.3.2
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
them to user space which saves time on switching to user space and back.
Both real and virtual modes are supported. The kernel tries to
handle a TCE request in the real mode, if fails it passes the request
to the virtual mode to complete the operation. If it a virtual mode
handler fails, the request is passed to user space.
The first user of this is VFIO on POWER. The external user API in VFIO
is required for this patch. The patch adds a new KVM_CAP_SPAPR_TCE_IOMMU
ioctl to associate a virtual PCI bus number (LIOBN) with an VFIO IOMMU
group fd and enable in-kernel handling of map/unmap requests.
Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
2013/07/11:
* removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
for KVM_BOOK3S_64
* kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
for this here but the next patch for hugepages support will use it more.
2013/07/06:
* added realmode arch_spin_lock to protect TCE table from races
in real and virtual modes
* POWERPC IOMMU API is changed to support real mode
* iommu_take_ownership and iommu_release_ownership are protected by
iommu_table's locks
* VFIO external user API use rewritten
* multiple small fixes
2013/06/27:
* tce_list page is referenced now in order to protect it from accident
invalidation during H_PUT_TCE_INDIRECT execution
* added use of the external user VFIO API
2013/06/05:
* changed capability number
* changed ioctl number
* update the doc article number
2013/05/20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Documentation/virtual/kvm/api.txt | 26 ++++
arch/powerpc/include/asm/kvm_host.h | 3 +
arch/powerpc/include/asm/kvm_ppc.h | 2 +
arch/powerpc/include/uapi/asm/kvm.h | 7 +
arch/powerpc/kvm/book3s_64_vio.c | 296 +++++++++++++++++++++++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 124 +++++++++++++++
arch/powerpc/kvm/powerpc.c | 12 ++
7 files changed, 465 insertions(+), 5 deletions(-)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 1c8942a..6ae65bd 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2408,6 +2408,32 @@ an implementation for these despite the in kernel acceleration.
This capability is always enabled.
+4.87 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_create_spapr_tce_iommu {
+ __u64 liobn;
+ __u32 fd;
+ __u32 flags;
+};
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+User space passes VFIO group fd. Using the external user VFIO API,
+KVM tries gets IOMMU id from passed fd. If succeeded, acceleration
+turns on. If failed, map/unmap requests are passed to user space.
+
+No flag is supported at the moment.
+
+
5. The kvm_run structure
------------------------
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index b8fe3de..4eeaf7d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+ struct iommu_group *grp; /* used for IOMMU groups */
+ struct vfio_group *vfio_grp; /* used for IOMMU groups */
struct page *pages[0];
};
@@ -612,6 +614,7 @@ struct kvm_vcpu_arch {
u64 busy_preempt;
unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT hcall */
+ unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */
enum {
TCERM_NONE,
TCERM_GETPAGE,
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 0ce4691..297cab5 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+ struct kvm_create_spapr_tce_iommu *args);
extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
struct kvm_vcpu *vcpu, unsigned long liobn);
extern long kvmppc_tce_validate(unsigned long tce);
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..3da4aa3 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -319,6 +319,13 @@ struct kvm_create_spapr_tce {
__u32 window_size;
};
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+ __u64 liobn;
+ __u32 fd;
+ __u32 flags;
+};
+
/* for KVM_ALLOCATE_RMA */
struct kvm_allocate_rma {
__u64 rma_size;
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 0131bf9..125fc23 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,10 @@
#include <linux/hugetlb.h>
#include <linux/list.h>
#include <linux/anon_inodes.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/file.h>
+#include <linux/vfio.h>
#include <asm/tlbflush.h>
#include <asm/kvm_ppc.h>
@@ -42,6 +46,53 @@
#define ERROR_ADDR ((void *)~(unsigned long)0x0)
+/*
+ * Dynamically linked version of the external user VFIO API.
+ *
+ * As a IOMMU group access control is implemented by VFIO,
+ * there is some API to vefiry that specific process can own
+ * a group. As KVM may run when VFIO is not loaded, KVM is not
+ * linked statically to VFIO, instead wrappers are used.
+ */
+struct vfio_group *kvmppc_vfio_group_get_external_user(struct file *filep)
+{
+ struct vfio_group *ret;
+ struct vfio_group * (*proc)(struct file *) =
+ symbol_get(vfio_group_get_external_user);
+ if (!proc)
+ return NULL;
+
+ ret = proc(filep);
+ symbol_put(vfio_group_get_external_user);
+
+ return ret;
+}
+
+void kvmppc_vfio_group_put_external_user(struct vfio_group *group)
+{
+ void (*proc)(struct vfio_group *) =
+ symbol_get(vfio_group_put_external_user);
+ if (!proc)
+ return;
+
+ proc(group);
+ symbol_put(vfio_group_put_external_user);
+}
+
+int kvmppc_vfio_external_user_iommu_id(struct vfio_group *group)
+{
+ int ret;
+ int (*proc)(struct vfio_group *) =
+ symbol_get(vfio_external_user_iommu_id);
+ if (!proc)
+ return -EINVAL;
+
+ ret = proc(group);
+ symbol_put(vfio_external_user_iommu_id);
+
+ return ret;
+}
+
static long kvmppc_stt_npages(unsigned long window_size)
{
return ALIGN((window_size >> SPAPR_TCE_SHIFT)
@@ -55,8 +106,15 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
mutex_lock(&kvm->lock);
list_del(&stt->list);
- for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
- __free_page(stt->pages[i]);
+
+ if (stt->grp) {
+ if (stt->vfio_grp)
+ kvmppc_vfio_group_put_external_user(stt->vfio_grp);
+ iommu_group_put(stt->grp);
+ } else
+ for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+ __free_page(stt->pages[i]);
+
kfree(stt);
mutex_unlock(&kvm->lock);
@@ -152,9 +210,90 @@ fail:
return ret;
}
-/* Converts guest physical address to host virtual address */
+static const struct file_operations kvm_spapr_tce_iommu_fops = {
+ .release = kvm_spapr_tce_release,
+};
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+ struct kvm_create_spapr_tce_iommu *args)
+{
+ struct kvmppc_spapr_tce_table *tt = NULL;
+ struct iommu_group *grp;
+ struct iommu_table *tbl;
+ struct file *vfio_filp;
+ struct vfio_group *vfio_grp;
+ int ret = 0, iommu_id;
+
+ /* Check this LIOBN hasn't been previously registered */
+ list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
+ if (tt->liobn == args->liobn)
+ return -EBUSY;
+ }
+
+ vfio_filp = fget(args->fd);
+ if (!vfio_filp)
+ return -ENXIO;
+
+ /* Lock the group. Fails if group is not viable or does not have IOMMU set */
+ vfio_grp = kvmppc_vfio_group_get_external_user(vfio_filp);
+ if (IS_ERR_VALUE((unsigned long)vfio_grp))
+ goto fput_exit;
+
+ /* Get IOMMU ID, find iommu_group and iommu_table*/
+ iommu_id = kvmppc_vfio_external_user_iommu_id(vfio_grp);
+ if (iommu_id < 0)
+ goto grpput_fput_exit;
+
+ ret = -ENXIO;
+ grp = iommu_group_get_by_id(iommu_id);
+ if (!grp)
+ goto grpput_fput_exit;
+
+ tbl = iommu_group_get_iommudata(grp);
+ if (!tbl)
+ goto grpput_fput_exit;
+
+ /* Create a TCE table descriptor and add into the descriptor list */
+ tt = kzalloc(sizeof(*tt), GFP_KERNEL);
+ if (!tt)
+ goto grpput_fput_exit;
+
+ tt->liobn = args->liobn;
+ kvm_get_kvm(kvm);
+ tt->kvm = kvm;
+ tt->grp = grp;
+ tt->window_size = tbl->it_size << IOMMU_PAGE_SHIFT;
+ tt->vfio_grp = vfio_grp;
+
+ /* Create an inode to provide automatic cleanup upon exit */
+ ret = anon_inode_getfd("kvm-spapr-tce-iommu",
+ &kvm_spapr_tce_iommu_fops, tt, O_RDWR);
+ if (ret < 0)
+ goto free_grpput_fput_exit;
+
+ /* Add the TCE table descriptor to the descriptor list */
+ mutex_lock(&kvm->lock);
+ list_add(&tt->list, &kvm->arch.spapr_tce_tables);
+ mutex_unlock(&kvm->lock);
+
+ goto fput_exit;
+
+free_grpput_fput_exit:
+ kfree(tt);
+grpput_fput_exit:
+ kvmppc_vfio_group_put_external_user(vfio_grp);
+fput_exit:
+ fput(vfio_filp);
+
+ return ret;
+}
+
+/*
+ * Converts guest physical address to host virtual address.
+ * Also returns host physical address which is to put to TCE table.
+ */
static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
- unsigned long gpa, struct page **pg)
+ unsigned long gpa, struct page **pg, unsigned long *phpa)
{
unsigned long hva, gfn = gpa >> PAGE_SHIFT;
struct kvm_memory_slot *memslot;
@@ -169,9 +308,140 @@ static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
if (get_user_pages_fast(hva & PAGE_MASK, 1, is_write, pg) != 1)
return ERROR_ADDR;
+ if (phpa)
+ *phpa = __pa((unsigned long) page_address(*pg)) |
+ (hva & ~PAGE_MASK);
+
return (void *) hva;
}
+long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce)
+{
+ struct page *pg = NULL;
+ unsigned long hpa;
+ void __user *hva;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Clear TCE */
+ if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+ if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, ioba >> IOMMU_PAGE_SHIFT,
+ 1, false))
+ return H_HARDWARE;
+
+ return H_SUCCESS;
+ }
+
+ /* Put TCE */
+ if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
+ /* Retry iommu_tce_build if it failed in real mode */
+ vcpu->arch.tce_rm_fail = TCERM_NONE;
+ hpa = vcpu->arch.tce_tmp_hpas[0];
+ } else {
+ if (iommu_tce_put_param_check(tbl, ioba, tce))
+ return H_PARAMETER;
+
+ hva = kvmppc_gpa_to_hva_and_get(vcpu, tce, &pg, &hpa);
+ if (hva == ERROR_ADDR)
+ return H_HARDWARE;
+ }
+
+ if (!iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT, &hpa, 1, false))
+ return H_SUCCESS;
+
+ pg = pfn_to_page(hpa >> PAGE_SHIFT);
+ if (pg)
+ put_page(pg);
+
+ return H_HARDWARE;
+}
+
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt, unsigned long ioba,
+ unsigned long __user *tces, unsigned long npages)
+{
+ long i = 0, start = 0;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ switch (vcpu->arch.tce_rm_fail) {
+ case TCERM_NONE:
+ break;
+ case TCERM_GETPAGE:
+ start = vcpu->arch.tce_tmp_num;
+ break;
+ case TCERM_PUTTCE:
+ goto put_tces;
+ case TCERM_PUTLIST:
+ default:
+ WARN_ON(1);
+ return H_HARDWARE;
+ }
+
+ for (i = start; i < npages; ++i) {
+ struct page *pg = NULL;
+ unsigned long gpa;
+ void __user *hva;
+
+ if (get_user(gpa, tces + i))
+ return H_HARDWARE;
+
+ if (iommu_tce_put_param_check(tbl, ioba +
+ (i << IOMMU_PAGE_SHIFT), gpa))
+ return H_PARAMETER;
+
+ hva = kvmppc_gpa_to_hva_and_get(vcpu, gpa, &pg,
+ &vcpu->arch.tce_tmp_hpas[i]);
+ if (hva == ERROR_ADDR)
+ goto putpages_flush_exit;
+ }
+
+put_tces:
+ if (!iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT,
+ vcpu->arch.tce_tmp_hpas, npages, false))
+ return H_SUCCESS;
+
+putpages_flush_exit:
+ for ( --i; i >= 0; --i) {
+ struct page *pg;
+ pg = pfn_to_page(vcpu->arch.tce_tmp_hpas[i] >> PAGE_SHIFT);
+ if (pg)
+ put_page(pg);
+ }
+
+ return H_HARDWARE;
+}
+
+long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+ unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, entry, npages, false))
+ return H_HARDWARE;
+
+ return H_SUCCESS;
+}
+
long kvmppc_h_put_tce(struct kvm_vcpu *vcpu,
unsigned long liobn, unsigned long ioba,
unsigned long tce)
@@ -183,6 +453,10 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;
+ if (tt->grp)
+ return kvmppc_h_put_tce_iommu(vcpu, tt, liobn, ioba, tce);
+
+ /* Emulated IO */
if (ioba >= tt->window_size)
return H_PARAMETER;
@@ -221,13 +495,20 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
- tces = kvmppc_gpa_to_hva_and_get(vcpu, tce_list, &pg);
+ tces = kvmppc_gpa_to_hva_and_get(vcpu, tce_list, &pg, NULL);
if (tces == ERROR_ADDR)
return H_TOO_HARD;
if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
goto put_list_page_exit;
+ if (tt->grp) {
+ ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+ tt, ioba, tces, npages);
+ goto put_list_page_exit;
+ }
+
+ /* Emulated IO */
for (i = 0; i < npages; ++i) {
if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
ret = H_PARAMETER;
@@ -266,6 +547,11 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;
+ if (tt->grp)
+ return kvmppc_h_stuff_tce_iommu(vcpu, tt, liobn, ioba,
+ tce_value, npages);
+
+ /* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 9b0372f..d898c14 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
#include <linux/slab.h>
#include <linux/hugetlb.h>
#include <linux/list.h>
+#include <linux/iommu.h>
#include <asm/tlbflush.h>
#include <asm/kvm_ppc.h>
@@ -187,6 +188,113 @@ static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
return hpa;
}
+static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt, unsigned long liobn,
+ unsigned long ioba, unsigned long tce)
+{
+ int ret;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+ unsigned long hpa;
+ struct page *pg = NULL;
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Clear TCE */
+ if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+ if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, ioba >> IOMMU_PAGE_SHIFT, 1, true))
+ return H_TOO_HARD;
+
+ return H_SUCCESS;
+ }
+
+ /* Put TCE */
+ if (iommu_tce_put_param_check(tbl, ioba, tce))
+ return H_PARAMETER;
+
+ hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce, &pg);
+ if (hpa == ERROR_ADDR)
+ return H_TOO_HARD;
+
+ ret = iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT, &hpa, 1, true);
+ if (unlikely(ret)) {
+ if (ret == -EBUSY)
+ return H_PARAMETER;
+
+ vcpu->arch.tce_tmp_hpas[0] = hpa;
+ vcpu->arch.tce_tmp_num = 0;
+ vcpu->arch.tce_rm_fail = TCERM_PUTTCE;
+ return H_TOO_HARD;
+ }
+
+ return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt, unsigned long ioba,
+ unsigned long *tces, unsigned long npages)
+{
+ int i, ret;
+ unsigned long hpa;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+ struct page *pg = NULL;
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Check all TCEs */
+ for (i = 0; i < npages; ++i) {
+ if (iommu_tce_put_param_check(tbl, ioba +
+ (i << IOMMU_PAGE_SHIFT), tces[i]))
+ return H_PARAMETER;
+ }
+
+ /* Translate TCEs and go get_page() */
+ for (i = 0; i < npages; ++i) {
+ hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tces[i], &pg);
+ if (hpa == ERROR_ADDR) {
+ vcpu->arch.tce_tmp_num = i;
+ vcpu->arch.tce_rm_fail = TCERM_GETPAGE;
+ return H_TOO_HARD;
+ }
+ vcpu->arch.tce_tmp_hpas[i] = hpa;
+ }
+
+ /* Put TCEs to the table */
+ ret = iommu_tce_build(tbl, (ioba >> IOMMU_PAGE_SHIFT),
+ vcpu->arch.tce_tmp_hpas, npages, true);
+ if (ret == -EAGAIN) {
+ vcpu->arch.tce_rm_fail = TCERM_PUTTCE;
+ return H_TOO_HARD;
+ } else if (ret) {
+ return H_HARDWARE;
+ }
+
+ return H_SUCCESS;
+}
+
+static long kvmppc_rm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, ioba >> IOMMU_PAGE_SHIFT, npages, true))
+ return H_TOO_HARD;
+
+ return H_SUCCESS;
+}
+
long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
{
@@ -197,6 +305,10 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
if (!tt)
return H_TOO_HARD;
+ if (tt->grp)
+ return kvmppc_rm_h_put_tce_iommu(vcpu, tt, liobn, ioba, tce);
+
+ /* Emulated IO */
if (ioba >= tt->window_size)
return H_PARAMETER;
@@ -237,6 +349,13 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if (tces == ERROR_ADDR)
return H_TOO_HARD;
+ if (tt->grp) {
+ ret = kvmppc_rm_h_put_tce_indirect_iommu(vcpu,
+ tt, ioba, (unsigned long *)tces, npages);
+ goto put_unlock_exit;
+ }
+
+ /* Emulated IO */
for (i = 0; i < npages; ++i) {
ret = kvmppc_tce_validate(((unsigned long *)tces)[i]);
if (ret)
@@ -267,6 +386,11 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;
+ if (tt->grp)
+ return kvmppc_rm_h_stuff_tce_iommu(vcpu, tt, liobn, ioba,
+ tce_value, npages);
+
+ /* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index ccb578b..2909cfa 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -395,6 +395,7 @@ int kvm_dev_ioctl_check_extension(long ext)
r = 1;
break;
case KVM_CAP_SPAPR_MULTITCE:
+ case KVM_CAP_SPAPR_TCE_IOMMU:
r = 1;
break;
#endif
@@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
goto out;
}
+ case KVM_CREATE_SPAPR_TCE_IOMMU: {
+ struct kvm_create_spapr_tce_iommu create_tce_iommu;
+ struct kvm *kvm = filp->private_data;
+
+ r = -EFAULT;
+ if (copy_from_user(&create_tce_iommu, argp,
+ sizeof(create_tce_iommu)))
+ goto out;
+ r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+ goto out;
+ }
#endif /* CONFIG_PPC_BOOK3S_64 */
#ifdef CONFIG_KVM_BOOK3S_64_HV
--
1.8.3.2
This adds special support for huge pages (16MB) in real mode.
The reference counting cannot be easily done for such pages in real
mode (when MMU is off) so we added a hash table of huge pages.
It is populated in virtual mode and get_page is called just once
per a huge page. Real mode handlers check if the requested page is
in the hash table, then no reference counting is done, otherwise
an exit to virtual mode happens. The hash table is released at KVM
exit.
At the moment the fastest card available for tests uses up to 9 huge
pages so walking through this hash table does not cost much.
However this can change and we may want to optimize this.
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
2013/07/12:
* removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
for KVM_BOOK3S_64
2013/06/27:
* list of huge pages replaces with hashtable for better performance
* spinlock removed from real mode and only protects insertion of new
huge [ages descriptors into the hashtable
2013/06/05:
* fixed compile error when CONFIG_IOMMU_API=n
2013/05/20:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/kvm_host.h | 25 ++++++++
arch/powerpc/kernel/iommu.c | 6 +-
arch/powerpc/kvm/book3s_64_vio.c | 123 ++++++++++++++++++++++++++++++++++--
arch/powerpc/kvm/book3s_64_vio_hv.c | 32 +++++++++-
4 files changed, 176 insertions(+), 10 deletions(-)
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 4eeaf7d..c57b25a 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -31,6 +31,7 @@
#include <linux/list.h>
#include <linux/atomic.h>
#include <linux/tracepoint.h>
+#include <linux/hashtable.h>
#include <asm/kvm_asm.h>
#include <asm/processor.h>
#include <asm/page.h>
@@ -183,9 +184,33 @@ struct kvmppc_spapr_tce_table {
u32 window_size;
struct iommu_group *grp; /* used for IOMMU groups */
struct vfio_group *vfio_grp; /* used for IOMMU groups */
+ DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */
+ spinlock_t hugepages_write_lock; /* used for IOMMU groups */
struct page *pages[0];
};
+/*
+ * The KVM guest can be backed with 16MB pages.
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa) hash_32(gpa >> 24, 32)
+
+struct kvmppc_spapr_iommu_hugepage {
+ struct hlist_node hash_node;
+ unsigned long gpa; /* Guest physical address */
+ unsigned long hpa; /* Host physical address */
+ struct page *page; /* page struct of the very first subpage */
+ unsigned long size; /* Huge page size (always 16MB at the moment) */
+};
+
struct kvmppc_linear_info {
void *base_virt;
unsigned long base_pfn;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 0f56cac..bb54cc2 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
if (!pg) {
ret = -EAGAIN;
} else if (PageCompound(pg)) {
- ret = -EAGAIN;
+ /* Hugepages will be released at KVM exit */
+ ret = 0;
} else {
if (oldtce & TCE_PCI_WRITE)
SetPageDirty(pg);
@@ -1009,6 +1010,9 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
struct page *pg = pfn_to_page(oldtce >> PAGE_SHIFT);
if (!pg) {
ret = -EAGAIN;
+ } else if (PageCompound(pg)) {
+ /* Hugepages will be released at KVM exit */
+ ret = 0;
} else {
if (oldtce & TCE_PCI_WRITE)
SetPageDirty(pg);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 125fc23..a74a319 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -93,6 +93,102 @@ int kvmppc_vfio_external_user_iommu_id(struct vfio_group *group)
return ret;
}
+/*
+ * API to support huge pages in real mode
+ */
+static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
+{
+ spin_lock_init(&tt->hugepages_write_lock);
+ hash_init(tt->hash_tab);
+}
+
+static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
+{
+ int bkt;
+ struct kvmppc_spapr_iommu_hugepage *hp;
+ struct hlist_node *tmp;
+
+ spin_lock(&tt->hugepages_write_lock);
+ hash_for_each_safe(tt->hash_tab, bkt, tmp, hp, hash_node) {
+ pr_debug("Release HP liobn=%llx #%u gpa=%lx hpa=%lx size=%ld\n",
+ tt->liobn, bkt, hp->gpa, hp->hpa, hp->size);
+ hlist_del_rcu(&hp->hash_node);
+
+ put_page(hp->page);
+ kfree(hp);
+ }
+ spin_unlock(&tt->hugepages_write_lock);
+}
+
+/* Returns true if a page with GPA is already in the hash table */
+static bool kvmppc_iommu_hugepage_lookup_gpa(struct kvmppc_spapr_tce_table *tt,
+ unsigned long gpa)
+{
+ struct kvmppc_spapr_iommu_hugepage *hp;
+ const unsigned key = KVMPPC_SPAPR_HUGEPAGE_HASH(gpa);
+
+ hash_for_each_possible_rcu(tt->hash_tab, hp, hash_node, key) {
+ if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size))
+ continue;
+
+ return true;
+ }
+
+ return false;
+}
+
+/* Returns true if a page with GPA has been added to the hash table */
+static bool kvmppc_iommu_hugepage_add(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long hva, unsigned long gpa)
+{
+ struct kvmppc_spapr_iommu_hugepage *hp;
+ const unsigned key = KVMPPC_SPAPR_HUGEPAGE_HASH(gpa);
+ pte_t *ptep;
+ unsigned int shift = 0;
+ static const int is_write = 1;
+
+ ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+ WARN_ON(!ptep);
+
+ if (!ptep || (shift <= PAGE_SHIFT))
+ return false;
+
+ hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+ if (!hp)
+ return false;
+
+ hp->gpa = gpa & ~((1 << shift) - 1);
+ hp->hpa = (pte_pfn(*ptep) << PAGE_SHIFT);
+ hp->size = 1 << shift;
+
+ if (get_user_pages_fast(hva & ~(hp->size - 1), 1,
+ is_write, &hp->page) != 1) {
+ kfree(hp);
+ return false;
+ }
+ hash_add_rcu(tt->hash_tab, &hp->hash_node, key);
+
+ return true;
+}
+
+/** Returns true if a page with GPA is in the hash table or
+ * has just been added.
+ */
+static bool kvmppc_iommu_hugepage_try_add(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long hva, unsigned long gpa)
+{
+ bool ret;
+
+ spin_lock(&tt->hugepages_write_lock);
+ ret = kvmppc_iommu_hugepage_lookup_gpa(tt, gpa) ||
+ kvmppc_iommu_hugepage_add(vcpu, tt, hva, gpa);
+ spin_unlock(&tt->hugepages_write_lock);
+
+ return ret;
+}
+
static long kvmppc_stt_npages(unsigned long window_size)
{
return ALIGN((window_size >> SPAPR_TCE_SHIFT)
@@ -106,6 +202,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
mutex_lock(&kvm->lock);
list_del(&stt->list);
+ kvmppc_iommu_hugepages_cleanup(stt);
if (stt->grp) {
if (stt->vfio_grp)
@@ -192,6 +289,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
kvm_get_kvm(kvm);
mutex_lock(&kvm->lock);
+ kvmppc_iommu_hugepages_init(stt);
list_add(&stt->list, &kvm->arch.spapr_tce_tables);
mutex_unlock(&kvm->lock);
@@ -273,6 +371,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
/* Add the TCE table descriptor to the descriptor list */
mutex_lock(&kvm->lock);
+ kvmppc_iommu_hugepages_init(tt);
list_add(&tt->list, &kvm->arch.spapr_tce_tables);
mutex_unlock(&kvm->lock);
@@ -293,6 +392,7 @@ fput_exit:
* Also returns host physical address which is to put to TCE table.
*/
static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
unsigned long gpa, struct page **pg, unsigned long *phpa)
{
unsigned long hva, gfn = gpa >> PAGE_SHIFT;
@@ -312,6 +412,17 @@ static void __user *kvmppc_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
*phpa = __pa((unsigned long) page_address(*pg)) |
(hva & ~PAGE_MASK);
+ if (PageCompound(*pg)) {
+ /** Check if this GPA is taken care of by the hash table.
+ * If this is the case, do not show the caller page struct
+ * address as huge pages will be released at KVM exit.
+ */
+ if (kvmppc_iommu_hugepage_try_add(vcpu, tt, hva, gpa)) {
+ put_page(*pg);
+ *pg = NULL;
+ }
+ }
+
return (void *) hva;
}
@@ -349,7 +460,7 @@ long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
if (iommu_tce_put_param_check(tbl, ioba, tce))
return H_PARAMETER;
- hva = kvmppc_gpa_to_hva_and_get(vcpu, tce, &pg, &hpa);
+ hva = kvmppc_gpa_to_hva_and_get(vcpu, tt, tce, &pg, &hpa);
if (hva == ERROR_ADDR)
return H_HARDWARE;
}
@@ -358,7 +469,7 @@ long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
return H_SUCCESS;
pg = pfn_to_page(hpa >> PAGE_SHIFT);
- if (pg)
+ if (pg && !PageCompound(pg))
put_page(pg);
return H_HARDWARE;
@@ -400,7 +511,7 @@ static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
(i << IOMMU_PAGE_SHIFT), gpa))
return H_PARAMETER;
- hva = kvmppc_gpa_to_hva_and_get(vcpu, gpa, &pg,
+ hva = kvmppc_gpa_to_hva_and_get(vcpu, tt, gpa, &pg,
&vcpu->arch.tce_tmp_hpas[i]);
if (hva == ERROR_ADDR)
goto putpages_flush_exit;
@@ -415,7 +526,7 @@ putpages_flush_exit:
for ( --i; i >= 0; --i) {
struct page *pg;
pg = pfn_to_page(vcpu->arch.tce_tmp_hpas[i] >> PAGE_SHIFT);
- if (pg)
+ if (pg && !PageCompound(pg))
put_page(pg);
}
@@ -495,7 +606,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
- tces = kvmppc_gpa_to_hva_and_get(vcpu, tce_list, &pg, NULL);
+ tces = kvmppc_gpa_to_hva_and_get(vcpu, tt, tce_list, &pg, NULL);
if (tces == ERROR_ADDR)
return H_TOO_HARD;
@@ -524,7 +635,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
kvmppc_tce_put(tt, ioba + (i << IOMMU_PAGE_SHIFT),
vcpu->arch.tce_tmp_hpas[i]);
put_list_page_exit:
- if (pg)
+ if (pg && !PageCompound(pg))
put_page(pg);
if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index d898c14..81ea15c 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -133,12 +133,30 @@ void kvmppc_tce_put(struct kvmppc_spapr_tce_table *tt,
EXPORT_SYMBOL_GPL(kvmppc_tce_put);
#ifdef CONFIG_KVM_BOOK3S_64_HV
+
+static unsigned long kvmppc_rm_hugepage_gpa_to_hpa(
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long gpa)
+{
+ struct kvmppc_spapr_iommu_hugepage *hp;
+ const unsigned key = KVMPPC_SPAPR_HUGEPAGE_HASH(gpa);
+
+ hash_for_each_possible_rcu_notrace(tt->hash_tab, hp, hash_node, key) {
+ if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size))
+ continue;
+ return hp->hpa + (gpa & (hp->size - 1));
+ }
+
+ return ERROR_ADDR;
+}
+
/*
* Converts guest physical address to host physical address.
* Tries to increase page counter via realmode_get_page() and
* returns ERROR_ADDR if failed.
*/
static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
unsigned long gpa, struct page **pg)
{
struct kvm_memory_slot *memslot;
@@ -147,6 +165,14 @@ static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
unsigned long gfn = gpa >> PAGE_SHIFT;
unsigned shift = 0;
+ /* Check if it is a hugepage */
+ hpa = kvmppc_rm_hugepage_gpa_to_hpa(tt, gpa);
+ if (hpa != ERROR_ADDR) {
+ *pg = NULL; /* Tell the caller not to put page */
+ return hpa;
+ }
+
+ /* System page size case */
memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
if (!memslot)
return ERROR_ADDR;
@@ -215,7 +241,7 @@ static long kvmppc_rm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
if (iommu_tce_put_param_check(tbl, ioba, tce))
return H_PARAMETER;
- hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce, &pg);
+ hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tt, tce, &pg);
if (hpa == ERROR_ADDR)
return H_TOO_HARD;
@@ -254,7 +280,7 @@ static long kvmppc_rm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
/* Translate TCEs and go get_page() */
for (i = 0; i < npages; ++i) {
- hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tces[i], &pg);
+ hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tt, tces[i], &pg);
if (hpa == ERROR_ADDR) {
vcpu->arch.tce_tmp_num = i;
vcpu->arch.tce_rm_fail = TCERM_GETPAGE;
@@ -345,7 +371,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
- tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce_list, &pg);
+ tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tt, tce_list, &pg);
if (tces == ERROR_ADDR)
return H_TOO_HARD;
--
1.8.3.2
VFIO is designed to be used via ioctls on file descriptors
returned by VFIO.
However in some situations support for an external user is required.
The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
use the existing VFIO groups for exclusive access in real/virtual mode
on a host to avoid passing map/unmap requests to the user space which
would made things pretty slow.
The protocol includes:
1. do normal VFIO init operation:
- opening a new container;
- attaching group(s) to it;
- setting an IOMMU driver for a container.
When IOMMU is set for a container, all groups in it are
considered ready to use by an external user.
2. User space passes a group fd to an external user.
The external user calls vfio_group_get_external_user()
to verify that:
- the group is initialized;
- IOMMU is set for it.
If both checks passed, vfio_group_get_external_user()
increments the container user counter to prevent
the VFIO group from disposal before KVM exits.
3. The external user calls vfio_external_user_iommu_id()
to know an IOMMU ID. PPC64 KVM uses it to link logical bus
number (LIOBN) with IOMMU ID.
4. When the external KVM finishes, it calls
vfio_group_put_external_user() to release the VFIO group.
This call decrements the container user counter.
Everything gets released.
The "vfio: Limit group opens" patch is also required for the consistency.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
2013/07/11:
* added vfio_group_get()/vfio_group_put()
* protocol description changed
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
drivers/vfio/vfio.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/vfio.h | 7 ++++++
2 files changed, 69 insertions(+)
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index c488da5..58b034b 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1370,6 +1370,68 @@ static const struct file_operations vfio_device_fops = {
};
/**
+ * External user API, exported by symbols to be linked dynamically.
+ *
+ * The protocol includes:
+ * 1. do normal VFIO init operation:
+ * - opening a new container;
+ * - attaching group(s) to it;
+ * - setting an IOMMU driver for a container.
+ * When IOMMU is set for a container, all groups in it are
+ * considered ready to use by an external user.
+ *
+ * 2. User space passes a group fd to an external user.
+ * The external user calls vfio_group_get_external_user()
+ * to verify that:
+ * - the group is initialized;
+ * - IOMMU is set for it.
+ * If both checks passed, vfio_group_get_external_user()
+ * increments the container user counter to prevent
+ * the VFIO group from disposal before KVM exits.
+ *
+ * 3. The external user calls vfio_external_user_iommu_id()
+ * to know an IOMMU ID.
+ *
+ * 4. When the external KVM finishes, it calls
+ * vfio_group_put_external_user() to release the VFIO group.
+ * This call decrements the container user counter.
+ */
+struct vfio_group *vfio_group_get_external_user(struct file *filep)
+{
+ struct vfio_group *group = filep->private_data;
+
+ if (filep->f_op != &vfio_group_fops)
+ return ERR_PTR(-EINVAL);
+
+ if (!atomic_inc_not_zero(&group->container_users))
+ return ERR_PTR(-EINVAL);
+
+ if (!group->container->iommu_driver ||
+ !vfio_group_viable(group)) {
+ atomic_dec(&group->container_users);
+ return ERR_PTR(-EINVAL);
+ }
+
+ vfio_group_get(group);
+
+ return group;
+}
+EXPORT_SYMBOL_GPL(vfio_group_get_external_user);
+
+void vfio_group_put_external_user(struct vfio_group *group)
+{
+ vfio_group_put(group);
+ vfio_group_try_dissolve_container(group);
+}
+EXPORT_SYMBOL_GPL(vfio_group_put_external_user);
+
+int vfio_external_user_iommu_id(struct vfio_group *group)
+{
+ return iommu_group_id(group->iommu_group);
+}
+EXPORT_SYMBOL_GPL(vfio_external_user_iommu_id);
+
+/**
* Module/class support
*/
static char *vfio_devnode(struct device *dev, umode_t *mode)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index ac8d488..24579a0 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -90,4 +90,11 @@ extern void vfio_unregister_iommu_driver(
TYPE tmp; \
offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); }) \
+/*
+ * External user API
+ */
+extern struct vfio_group *vfio_group_get_external_user(struct file *filep);
+extern void vfio_group_put_external_user(struct vfio_group *group);
+extern int vfio_external_user_iommu_id(struct vfio_group *group);
+
#endif /* VFIO_H */
--
1.8.3.2
On 07/16/2013 10:53 AM, Alexey Kardashevskiy wrote:
> The changes are:
> 1. rebased on v3.11-rc1 so the capability numbers changed again
> 2. fixed multiple comments from maintainers
> 3. "KVM: PPC: Add support for IOMMU in-kernel handling" is split into
> 2 patches, the new one is "powerpc/iommu: rework to support realmode".
> 4. IOMMU_API is now always enabled for KVM_BOOK3S_64.
>
> MOre details in the individual patch comments.
>
> Depends on "hashtable: add hash_for_each_possible_rcu_notrace()",
> posted a while ago.
>
>
> Alexey Kardashevskiy (10):
> KVM: PPC: reserve a capability number for multitce support
> KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
Alex, could you please pull these 2 patches or tell what is wrong with
them? Having them sooner in the kernel would let me ask for a headers
update for QEMU and then I would try pushing miltiple TCE and VFIO support
in QEMU. Thanks.
--
Alexey
Ping, anyone, please?
Ben needs ack from any of MM people before proceeding with this patch. Thanks!
On 07/16/2013 10:53 AM, Alexey Kardashevskiy wrote:
> The current VFIO-on-POWER implementation supports only user mode
> driven mapping, i.e. QEMU is sending requests to map/unmap pages.
> However this approach is really slow, so we want to move that to KVM.
> Since H_PUT_TCE can be extremely performance sensitive (especially with
> network adapters where each packet needs to be mapped/unmapped) we chose
> to implement that as a "fast" hypercall directly in "real
> mode" (processor still in the guest context but MMU off).
>
> To be able to do that, we need to provide some facilities to
> access the struct page count within that real mode environment as things
> like the sparsemem vmemmap mappings aren't accessible.
>
> This adds an API to increment/decrement page counter as
> get_user_pages API used for user mode mapping does not work
> in the real mode.
>
> CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.
>
> Cc: [email protected]
> Reviewed-by: Paul Mackerras <[email protected]>
> Signed-off-by: Paul Mackerras <[email protected]>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>
> ---
>
> Changes:
> 2013/07/10:
> * adjusted comment (removed sentence about virtual mode)
> * get_page_unless_zero replaced with atomic_inc_not_zero to minimize
> effect of a possible get_page_unless_zero() rework (if it ever happens).
>
> 2013/06/27:
> * realmode_get_page() fixed to use get_page_unless_zero(). If failed,
> the call will be passed from real to virtual mode and safely handled.
> * added comment to PageCompound() in include/linux/page-flags.h.
>
> 2013/05/20:
> * PageTail() is replaced by PageCompound() in order to have the same checks
> for whether the page is huge in realmode_get_page() and realmode_put_page()
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++
> arch/powerpc/mm/init_64.c | 76 +++++++++++++++++++++++++++++++-
> include/linux/page-flags.h | 4 +-
> 3 files changed, 82 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
> index 46db094..aa7b169 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
> @@ -394,6 +394,10 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array,
> hpte_slot_array[index] = hidx << 4 | 0x1 << 3;
> }
>
> +struct page *realmode_pfn_to_page(unsigned long pfn);
> +int realmode_get_page(struct page *page);
> +int realmode_put_page(struct page *page);
> +
> static inline char *get_hpte_slot_array(pmd_t *pmdp)
> {
> /*
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index d0cd9e4..dcbb806 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -300,5 +300,79 @@ void vmemmap_free(unsigned long start, unsigned long end)
> {
> }
>
> -#endif /* CONFIG_SPARSEMEM_VMEMMAP */
> +/*
> + * We do not have access to the sparsemem vmemmap, so we fallback to
> + * walking the list of sparsemem blocks which we already maintain for
> + * the sake of crashdump. In the long run, we might want to maintain
> + * a tree if performance of that linear walk becomes a problem.
> + *
> + * Any of realmode_XXXX functions can fail due to:
> + * 1) As real sparsemem blocks do not lay in RAM continously (they
> + * are in virtual address space which is not available in the real mode),
> + * the requested page struct can be split between blocks so get_page/put_page
> + * may fail.
> + * 2) When huge pages are used, the get_page/put_page API will fail
> + * in real mode as the linked addresses in the page struct are virtual
> + * too.
> + */
> +struct page *realmode_pfn_to_page(unsigned long pfn)
> +{
> + struct vmemmap_backing *vmem_back;
> + struct page *page;
> + unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
> + unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
>
> + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
> + if (pg_va < vmem_back->virt_addr)
> + continue;
> +
> + /* Check that page struct is not split between real pages */
> + if ((pg_va + sizeof(struct page)) >
> + (vmem_back->virt_addr + page_size))
> + return NULL;
> +
> + page = (struct page *) (vmem_back->phys + pg_va -
> + vmem_back->virt_addr);
> + return page;
> + }
> +
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
> +
> +#elif defined(CONFIG_FLATMEM)
> +
> +struct page *realmode_pfn_to_page(unsigned long pfn)
> +{
> + struct page *page = pfn_to_page(pfn);
> + return page;
> +}
> +EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
> +
> +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
> +
> +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
> +int realmode_get_page(struct page *page)
> +{
> + if (PageCompound(page))
> + return -EAGAIN;
> +
> + if (!atomic_inc_not_zero(&page->_count))
> + return -EAGAIN;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(realmode_get_page);
> +
> +int realmode_put_page(struct page *page)
> +{
> + if (PageCompound(page))
> + return -EAGAIN;
> +
> + if (!atomic_add_unless(&page->_count, -1, 1))
> + return -EAGAIN;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(realmode_put_page);
> +#endif
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6d53675..98ada58 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -329,7 +329,9 @@ static inline void set_page_writeback(struct page *page)
> * System with lots of page flags available. This allows separate
> * flags for PageHead() and PageTail() checks of compound pages so that bit
> * tests can be used in performance sensitive paths. PageCompound is
> - * generally not used in hot code paths.
> + * generally not used in hot code paths except arch/powerpc/mm/init_64.c
> + * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
> + * and avoid handling those in real mode.
> */
> __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
> __PAGEFLAG(Tail, tail)
>
--
Alexey
On Tue, 2013-07-16 at 10:53 +1000, Alexey Kardashevskiy wrote:
> VFIO is designed to be used via ioctls on file descriptors
> returned by VFIO.
>
> However in some situations support for an external user is required.
> The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
> use the existing VFIO groups for exclusive access in real/virtual mode
> on a host to avoid passing map/unmap requests to the user space which
> would made things pretty slow.
>
> The protocol includes:
>
> 1. do normal VFIO init operation:
> - opening a new container;
> - attaching group(s) to it;
> - setting an IOMMU driver for a container.
> When IOMMU is set for a container, all groups in it are
> considered ready to use by an external user.
>
> 2. User space passes a group fd to an external user.
> The external user calls vfio_group_get_external_user()
> to verify that:
> - the group is initialized;
> - IOMMU is set for it.
> If both checks passed, vfio_group_get_external_user()
> increments the container user counter to prevent
> the VFIO group from disposal before KVM exits.
>
> 3. The external user calls vfio_external_user_iommu_id()
> to know an IOMMU ID. PPC64 KVM uses it to link logical bus
> number (LIOBN) with IOMMU ID.
>
> 4. When the external KVM finishes, it calls
> vfio_group_put_external_user() to release the VFIO group.
> This call decrements the container user counter.
> Everything gets released.
>
> The "vfio: Limit group opens" patch is also required for the consistency.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
This looks fine to me. Is the plan to add this through the ppc tree
again? Thanks,
Alex
> ---
> Changes:
> 2013/07/11:
> * added vfio_group_get()/vfio_group_put()
> * protocol description changed
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> drivers/vfio/vfio.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/vfio.h | 7 ++++++
> 2 files changed, 69 insertions(+)
>
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c488da5..58b034b 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1370,6 +1370,68 @@ static const struct file_operations vfio_device_fops = {
> };
>
> /**
> + * External user API, exported by symbols to be linked dynamically.
> + *
> + * The protocol includes:
> + * 1. do normal VFIO init operation:
> + * - opening a new container;
> + * - attaching group(s) to it;
> + * - setting an IOMMU driver for a container.
> + * When IOMMU is set for a container, all groups in it are
> + * considered ready to use by an external user.
> + *
> + * 2. User space passes a group fd to an external user.
> + * The external user calls vfio_group_get_external_user()
> + * to verify that:
> + * - the group is initialized;
> + * - IOMMU is set for it.
> + * If both checks passed, vfio_group_get_external_user()
> + * increments the container user counter to prevent
> + * the VFIO group from disposal before KVM exits.
> + *
> + * 3. The external user calls vfio_external_user_iommu_id()
> + * to know an IOMMU ID.
> + *
> + * 4. When the external KVM finishes, it calls
> + * vfio_group_put_external_user() to release the VFIO group.
> + * This call decrements the container user counter.
> + */
> +struct vfio_group *vfio_group_get_external_user(struct file *filep)
> +{
> + struct vfio_group *group = filep->private_data;
> +
> + if (filep->f_op != &vfio_group_fops)
> + return ERR_PTR(-EINVAL);
> +
> + if (!atomic_inc_not_zero(&group->container_users))
> + return ERR_PTR(-EINVAL);
> +
> + if (!group->container->iommu_driver ||
> + !vfio_group_viable(group)) {
> + atomic_dec(&group->container_users);
> + return ERR_PTR(-EINVAL);
> + }
> +
> + vfio_group_get(group);
> +
> + return group;
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_get_external_user);
> +
> +void vfio_group_put_external_user(struct vfio_group *group)
> +{
> + vfio_group_put(group);
> + vfio_group_try_dissolve_container(group);
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_put_external_user);
> +
> +int vfio_external_user_iommu_id(struct vfio_group *group)
> +{
> + return iommu_group_id(group->iommu_group);
> +}
> +EXPORT_SYMBOL_GPL(vfio_external_user_iommu_id);
> +
> +/**
> * Module/class support
> */
> static char *vfio_devnode(struct device *dev, umode_t *mode)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index ac8d488..24579a0 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -90,4 +90,11 @@ extern void vfio_unregister_iommu_driver(
> TYPE tmp; \
> offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); }) \
>
> +/*
> + * External user API
> + */
> +extern struct vfio_group *vfio_group_get_external_user(struct file *filep);
> +extern void vfio_group_put_external_user(struct vfio_group *group);
> +extern int vfio_external_user_iommu_id(struct vfio_group *group);
> +
> #endif /* VFIO_H */
On 07/23/2013 12:23 PM, Alex Williamson wrote:
> On Tue, 2013-07-16 at 10:53 +1000, Alexey Kardashevskiy wrote:
>> VFIO is designed to be used via ioctls on file descriptors
>> returned by VFIO.
>>
>> However in some situations support for an external user is required.
>> The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
>> use the existing VFIO groups for exclusive access in real/virtual mode
>> on a host to avoid passing map/unmap requests to the user space which
>> would made things pretty slow.
>>
>> The protocol includes:
>>
>> 1. do normal VFIO init operation:
>> - opening a new container;
>> - attaching group(s) to it;
>> - setting an IOMMU driver for a container.
>> When IOMMU is set for a container, all groups in it are
>> considered ready to use by an external user.
>>
>> 2. User space passes a group fd to an external user.
>> The external user calls vfio_group_get_external_user()
>> to verify that:
>> - the group is initialized;
>> - IOMMU is set for it.
>> If both checks passed, vfio_group_get_external_user()
>> increments the container user counter to prevent
>> the VFIO group from disposal before KVM exits.
>>
>> 3. The external user calls vfio_external_user_iommu_id()
>> to know an IOMMU ID. PPC64 KVM uses it to link logical bus
>> number (LIOBN) with IOMMU ID.
>>
>> 4. When the external KVM finishes, it calls
>> vfio_group_put_external_user() to release the VFIO group.
>> This call decrements the container user counter.
>> Everything gets released.
>>
>> The "vfio: Limit group opens" patch is also required for the consistency.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>
> This looks fine to me. Is the plan to add this through the ppc tree
> again? Thanks,
Nope, better to add this through your tree. And faster for sure :) Thanks!
--
Alexey
On Tue, 23 Jul 2013 12:22:59 +1000 Alexey Kardashevskiy <[email protected]> wrote:
> Ping, anyone, please?
ew, you top-posted.
> Ben needs ack from any of MM people before proceeding with this patch. Thanks!
For what? The three lines of comment in page-flags.h? ack :)
Manipulating page->_count directly is considered poor form. Don't
blame us if we break your code ;)
Actually, the manipulation in realmode_get_page() duplicates the
existing get_page_unless_zero() and the one in realmode_put_page()
could perhaps be placed in mm.h with a suitable name and some
documentation. That would improve your form and might protect the code
from getting broken later on.
On Wed, 2013-07-24 at 15:43 -0700, Andrew Morton wrote:
> For what? The three lines of comment in page-flags.h? ack :)
>
> Manipulating page->_count directly is considered poor form. Don't
> blame us if we break your code ;)
>
> Actually, the manipulation in realmode_get_page() duplicates the
> existing get_page_unless_zero() and the one in realmode_put_page()
> could perhaps be placed in mm.h with a suitable name and some
> documentation. That would improve your form and might protect the code
> from getting broken later on.
Yes, this stuff makes me really nervous :-) If it didn't provide an order
of magnitude performance improvement in KVM I would avoid it but heh...
Alexey, I like having that stuff in generic code.
However the meaning of the words "real mode" can be ambiguous accross
architectures, it might be best to then name it "mmu_off_put_page" to
make things a bit clearer, along with a comment explaining that this is
called in a context where none of the virtual mappings are accessible
(vmalloc, vmemmap, IOs, ...), and that in the case of sparsemem vmemmap
the caller must have taken care of getting the physical address of the
struct page and of ensuring it isn't split accross two vmemmap blocks.
Cheers,
Ben.
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).
To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.
This adds an API to get page struct when MMU is off.
This adds to MM a new function put_page_unless_one() which drops a page
if counter is bigger than 1. It is going to be used when MMU is off
(real mode on PPC64 is the first user) and we want to make sure that page
release will not happen in real mode as it may crash the kernel in
a horrible way.
CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.
Cc: [email protected]
Reviewed-by: Paul Mackerras <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
2013/07/25:
* removed realmode_put_page and added put_page_unless_one() instead.
The name has been chosen to conform the already existing get_page_unless_zero().
* removed realmode_get_page. Instead, get_page_unless_zero() will be used
* realmode_pfn_to_page fixed to return NULL for compound pages
2013/07/10:
* adjusted comment (removed sentence about virtual mode)
* get_page_unless_zero replaced with atomic_inc_not_zero to minimize
effect of a possible get_page_unless_zero() rework (if it ever happens).
2013/06/27:
* realmode_get_page() fixed to use get_page_unless_zero(). If failed,
the call will be passed from real to virtual mode and safely handled.
* added comment to PageCompound() in include/linux/page-flags.h.
2013/05/20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
arch/powerpc/mm/init_64.c | 54 +++++++++++++++++++++++++++++++-
include/linux/mm.h | 14 +++++++++
include/linux/page-flags.h | 4 ++-
4 files changed, 72 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 46db094..4a191c4 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -394,6 +394,8 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array,
hpte_slot_array[index] = hidx << 4 | 0x1 << 3;
}
+struct page *realmode_pfn_to_page(unsigned long pfn);
+
static inline char *get_hpte_slot_array(pmd_t *pmdp)
{
/*
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index d0cd9e4..d2bb97c 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -300,5 +300,57 @@ void vmemmap_free(unsigned long start, unsigned long end)
{
}
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * realmode_pfn_to_page functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct vmemmap_backing *vmem_back;
+ struct page *page;
+ unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+ unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+ for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+ if (pg_va < vmem_back->virt_addr)
+ continue;
+
+ /* Check that page struct is not split between real pages */
+ if ((pg_va + sizeof(struct page)) >
+ (vmem_back->virt_addr + page_size))
+ return NULL;
+
+ page = (struct page *) (vmem_back->phys + pg_va -
+ vmem_back->virt_addr);
+
+ if (PageCompound(page))
+ return NULL;
+
+ return page;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+ return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..ef07aea 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -290,12 +290,26 @@ static inline int put_page_testzero(struct page *page)
/*
* Try to grab a ref unless the page has a refcount of zero, return false if
* that is the case.
+ * This can be called when MMU is off so it must not to access
+ * any of the virtual mappings.
*/
static inline int get_page_unless_zero(struct page *page)
{
return atomic_inc_not_zero(&page->_count);
}
+/*
+ * Try to drop a ref unless the page has a refcount of one, return false if
+ * that is the case.
+ * This is to make sure that the refcount won't become zero after this drop.
+ * This can be called when MMU is off so it must not to access
+ * any of the virtual mappings.
+ */
+static inline int put_page_unless_one(struct page *page)
+{
+ return atomic_add_unless(&page->_count, -1, 1);
+}
+
extern int page_is_ram(unsigned long pfn);
/* Support for virtually mapped pages */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..98ada58 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -329,7 +329,9 @@ static inline void set_page_writeback(struct page *page)
* System with lots of page flags available. This allows separate
* flags for PageHead() and PageTail() checks of compound pages so that bit
* tests can be used in performance sensitive paths. PageCompound is
- * generally not used in hot code paths.
+ * generally not used in hot code paths except arch/powerpc/mm/init_64.c
+ * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
+ * and avoid handling those in real mode.
*/
__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
__PAGEFLAG(Tail, tail)
--
1.8.3.2
On 07/25/2013 08:26 PM, Alexey Kardashevskiy wrote:
> The current VFIO-on-POWER implementation supports only user mode
> driven mapping, i.e. QEMU is sending requests to map/unmap pages.
> However this approach is really slow, so we want to move that to KVM.
> Since H_PUT_TCE can be extremely performance sensitive (especially with
> network adapters where each packet needs to be mapped/unmapped) we chose
> to implement that as a "fast" hypercall directly in "real
> mode" (processor still in the guest context but MMU off).
>
> To be able to do that, we need to provide some facilities to
> access the struct page count within that real mode environment as things
> like the sparsemem vmemmap mappings aren't accessible.
>
> This adds an API to get page struct when MMU is off.
>
> This adds to MM a new function put_page_unless_one() which drops a page
> if counter is bigger than 1. It is going to be used when MMU is off
> (real mode on PPC64 is the first user) and we want to make sure that page
> release will not happen in real mode as it may crash the kernel in
> a horrible way.
Yes, my english needs to be polished, I even know where :)
--
Alexey
On Tue, 2013-07-23 at 19:07 +1000, Alexey Kardashevskiy wrote:
> On 07/23/2013 12:23 PM, Alex Williamson wrote:
> > On Tue, 2013-07-16 at 10:53 +1000, Alexey Kardashevskiy wrote:
> >> VFIO is designed to be used via ioctls on file descriptors
> >> returned by VFIO.
> >>
> >> However in some situations support for an external user is required.
> >> The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
> >> use the existing VFIO groups for exclusive access in real/virtual mode
> >> on a host to avoid passing map/unmap requests to the user space which
> >> would made things pretty slow.
> >>
> >> The protocol includes:
> >>
> >> 1. do normal VFIO init operation:
> >> - opening a new container;
> >> - attaching group(s) to it;
> >> - setting an IOMMU driver for a container.
> >> When IOMMU is set for a container, all groups in it are
> >> considered ready to use by an external user.
> >>
> >> 2. User space passes a group fd to an external user.
> >> The external user calls vfio_group_get_external_user()
> >> to verify that:
> >> - the group is initialized;
> >> - IOMMU is set for it.
> >> If both checks passed, vfio_group_get_external_user()
> >> increments the container user counter to prevent
> >> the VFIO group from disposal before KVM exits.
> >>
> >> 3. The external user calls vfio_external_user_iommu_id()
> >> to know an IOMMU ID. PPC64 KVM uses it to link logical bus
> >> number (LIOBN) with IOMMU ID.
> >>
> >> 4. When the external KVM finishes, it calls
> >> vfio_group_put_external_user() to release the VFIO group.
> >> This call decrements the container user counter.
> >> Everything gets released.
> >>
> >> The "vfio: Limit group opens" patch is also required for the consistency.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >
> > This looks fine to me. Is the plan to add this through the ppc tree
> > again? Thanks,
>
>
> Nope, better to add this through your tree. And faster for sure :) Thanks!
Applied to my next branch for v3.12. Thanks,
Alex