The changes are:
1. rebased on v3.10-rc7
2. removed spinlocks from real mode
3. added security checks between KVM and VFIO
MOre details in the individual patch comments.
Alexey Kardashevskiy (8):
KVM: PPC: reserve a capability number for multitce support
KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
vfio: add external user support
hashtable: add hash_for_each_possible_rcu_notrace()
powerpc: Prepare to support kernel handling of IOMMU map/unmap
KVM: PPC: Add support for multiple-TCE hcalls
KVM: PPC: Add support for IOMMU in-kernel handling
KVM: PPC: Add hugepage support for IOMMU in-kernel handling
Documentation/virtual/kvm/api.txt | 51 +++
arch/powerpc/include/asm/kvm_host.h | 31 ++
arch/powerpc/include/asm/kvm_ppc.h | 18 +-
arch/powerpc/include/asm/pgtable-ppc64.h | 4 +
arch/powerpc/include/uapi/asm/kvm.h | 8 +
arch/powerpc/kvm/book3s_64_vio.c | 506 +++++++++++++++++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 439 ++++++++++++++++++++++++--
arch/powerpc/kvm/book3s_hv.c | 41 ++-
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
arch/powerpc/kvm/book3s_pr_papr.c | 37 ++-
arch/powerpc/kvm/powerpc.c | 15 +
arch/powerpc/mm/init_64.c | 78 ++++-
drivers/vfio/vfio.c | 53 ++++
include/linux/hashtable.h | 15 +
include/linux/page-flags.h | 4 +-
include/uapi/linux/kvm.h | 3 +
16 files changed, 1279 insertions(+), 30 deletions(-)
--
1.7.10.4
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
include/uapi/linux/kvm.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d88c8ee..970b1f5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_IRQ_MPIC 90
#define KVM_CAP_PPC_RTAS 91
#define KVM_CAP_IRQ_XICS 92
+#define KVM_CAP_SPAPR_MULTITCE 93
#ifdef KVM_CAP_IRQ_ROUTING
--
1.7.10.4
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
include/uapi/linux/kvm.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 970b1f5..0865c01 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_PPC_RTAS 91
#define KVM_CAP_IRQ_XICS 92
#define KVM_CAP_SPAPR_MULTITCE 93
+#define KVM_CAP_SPAPR_TCE_IOMMU 94
#ifdef KVM_CAP_IRQ_ROUTING
@@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping {
/* Available with KVM_CAP_PPC_ALLOC_HTAB */
#define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32)
#define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct kvm_create_spapr_tce)
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct kvm_create_spapr_tce_iommu)
/* Available with KVM_CAP_RMA */
#define KVM_ALLOCATE_RMA _IOR(KVMIO, 0xa9, struct kvm_allocate_rma)
/* Available with KVM_CAP_PPC_HTAB_FD */
--
1.7.10.4
This adds hash_for_each_possible_rcu_notrace() which is basically
a notrace clone of hash_for_each_possible_rcu() which cannot be
used in real mode due to its tracing/debugging capability.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
include/linux/hashtable.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h
index a9df51f..af8b169 100644
--- a/include/linux/hashtable.h
+++ b/include/linux/hashtable.h
@@ -174,6 +174,21 @@ static inline void hash_del_rcu(struct hlist_node *node)
member)
/**
+ * hash_for_each_possible_rcu_notrace - iterate over all possible objects hashing
+ * to the same bucket in an rcu enabled hashtable in a rcu enabled hashtable
+ * @name: hashtable to iterate
+ * @obj: the type * to use as a loop cursor for each entry
+ * @member: the name of the hlist_node within the struct
+ * @key: the key of the objects to iterate over
+ *
+ * This is the same as hash_for_each_possible_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define hash_for_each_possible_rcu_notrace(name, obj, member, key) \
+ hlist_for_each_entry_rcu_notrace(obj, &name[hash_min(key, HASH_BITS(name))],\
+ member)
+
+/**
* hash_for_each_possible_safe - iterate over all possible objects hashing to the
* same bucket safe against removals
* @name: hashtable to iterate
--
1.7.10.4
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).
To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.
This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.
CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.
Reviewed-by: Paul Mackerras <[email protected]>
Cc: David Gibson <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
2013/06/27:
* realmode_get_page() fixed to use get_page_unless_zero(). If failed,
the call will be passed from real to virtual mode and safely handled.
* added comment to PageCompound() in include/linux/page-flags.h.
2013/05/20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
---
arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++
arch/powerpc/mm/init_64.c | 78 +++++++++++++++++++++++++++++-
include/linux/page-flags.h | 4 +-
3 files changed, 84 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index e3d55f6f..7b46e5f 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
}
#endif /* !CONFIG_HUGETLB_PAGE */
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
#endif /* __ASSEMBLY__ */
#endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a90b9c4..7031be3 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,81 @@ void vmemmap_free(unsigned long start, unsigned long end)
{
}
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_XXXX functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ * When 1) or 2) takes place, the API returns an error code to cause
+ * an exit to kernel virtual mode where the operation will be completed.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct vmemmap_backing *vmem_back;
+ struct page *page;
+ unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+ unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+ for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+ if (pg_va < vmem_back->virt_addr)
+ continue;
+ /* Check that page struct is not split between real pages */
+ if ((pg_va + sizeof(struct page)) >
+ (vmem_back->virt_addr + page_size))
+ return NULL;
+
+ page = (struct page *) (vmem_back->phys + pg_va -
+ vmem_back->virt_addr);
+ return page;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+ return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+ if (PageCompound(page))
+ return -EAGAIN;
+
+ if (!get_page_unless_zero(page))
+ return -EAGAIN;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+ if (PageCompound(page))
+ return -EAGAIN;
+
+ if (!atomic_add_unless(&page->_count, -1, 1))
+ return -EAGAIN;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_put_page);
+#endif
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..98ada58 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -329,7 +329,9 @@ static inline void set_page_writeback(struct page *page)
* System with lots of page flags available. This allows separate
* flags for PageHead() and PageTail() checks of compound pages so that bit
* tests can be used in performance sensitive paths. PageCompound is
- * generally not used in hot code paths.
+ * generally not used in hot code paths except arch/powerpc/mm/init_64.c
+ * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
+ * and avoid handling those in real mode.
*/
__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
__PAGEFLAG(Tail, tail)
--
1.7.10.4
This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
devices or emulated PCI. These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.
This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.
This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.
This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changelog:
2013/06/27:
* fixed clear of BUSY bit in kvmppc_lookup_pte()
* H_PUT_TCE_INDIRECT does realmode_get_page() now
* KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
* updated doc
2013/06/05:
* fixed mistype about IBMVIO in the commit message
* updated doc and moved it to another section
* changed capability number
2013/05/21:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
---
Documentation/virtual/kvm/api.txt | 25 +++
arch/powerpc/include/asm/kvm_host.h | 2 +
arch/powerpc/include/asm/kvm_ppc.h | 16 +-
arch/powerpc/kvm/book3s_64_vio.c | 123 ++++++++++++++
arch/powerpc/kvm/book3s_64_vio_hv.c | 270 +++++++++++++++++++++++++++----
arch/powerpc/kvm/book3s_hv.c | 41 ++++-
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++-
arch/powerpc/kvm/powerpc.c | 3 +
9 files changed, 490 insertions(+), 33 deletions(-)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 6365fef..762c703 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed to userspace to be
handled.
+4.86 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability means the kernel is capable of handling hypercalls
+H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
+space. This significanly accelerates DMA operations for PPC KVM guests.
+The user space should expect that its handlers for these hypercalls
+are not going to be called.
+
+In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
+the user space might have to advertise it for the guest. For example,
+IBM pSeries guest starts using them if "hcall-multi-tce" is present in
+the "ibm,hypertas-functions" device-tree property.
+
+Without this capability, only H_PUT_TCE is handled by the kernel and
+therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
+unless the capability is present as passing hypercalls to the userspace
+slows operations a lot.
+
+Unlike other capabilities of this section, this one is always enabled.
+
+
5. The kvm_run structure
------------------------
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index af326cd..3bf407b 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
+
+ unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hcall */
#endif
};
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..e852921b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
- unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+ struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_emulated_validate_tce(unsigned long tce);
+extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+ unsigned long ioba, unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages);
extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct kvm_allocate_rma *rma);
extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index b2d3f3b..45ee05a 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
*
* Copyright 2010 Paul Mackerras, IBM Corp. <[email protected]>
* Copyright 2011 David Gibson, IBM Corporation <[email protected]>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <[email protected]>
*/
#include <linux/types.h>
@@ -36,8 +37,11 @@
#include <asm/ppc-opcode.h>
#include <asm/kvm_host.h>
#include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR ((void *)~(unsigned long)0x0)
static long kvmppc_stt_npages(unsigned long window_size)
{
@@ -148,3 +152,122 @@ fail:
}
return ret;
}
+
+/* Converts guest physical address to host virtual address */
+static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
+ unsigned long gpa)
+{
+ unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+ struct kvm_memory_slot *memslot;
+
+ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+ if (!memslot)
+ return ERROR_ADDR;
+
+ hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+ return (void *) hva;
+}
+
+long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce)
+{
+ long ret;
+ struct kvmppc_spapr_tce_table *tt;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to userspace */
+ if (!tt)
+ return H_TOO_HARD;
+
+ if (ioba >= tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce);
+ if (ret)
+ return ret;
+
+ kvmppc_emulated_put_tce(tt, ioba, tce);
+
+ return H_SUCCESS;
+}
+
+long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret = H_SUCCESS;
+ unsigned long __user *tces;
+ struct page *pg;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to userspace */
+ if (!tt)
+ return H_TOO_HARD;
+
+ /*
+ * The spec says that the maximum size of the list is 512 TCEs so
+ * so the whole table addressed resides in 4K page
+ */
+ if (npages > 512)
+ return H_PARAMETER;
+
+ if (tce_list & ~IOMMU_PAGE_MASK)
+ return H_PARAMETER;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list);
+ if (tces == ERROR_ADDR)
+ return H_TOO_HARD;
+
+ if (get_user_pages_fast((unsigned long) tces & PAGE_MASK,
+ 1, 0, &pg) != 1)
+ return H_TOO_HARD;
+
+ for (i = 0; i < npages; ++i) {
+ if (get_user(vcpu->arch.tce_tmp[i], tces + i)) {
+ ret = H_PARAMETER;
+ goto put_list_page_exit;
+ }
+
+ ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]);
+ if (ret)
+ goto put_list_page_exit;
+ }
+
+ for (i = 0; i < npages; ++i)
+ kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+ vcpu->arch.tce_tmp[i]);
+put_list_page_exit:
+ put_page(pg);
+
+ return ret;
+}
+
+long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to userspace */
+ if (!tt)
+ return H_TOO_HARD;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce_value);
+ if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+ kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+ return H_SUCCESS;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..108cc08 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
*
* Copyright 2010 Paul Mackerras, IBM Corp. <[email protected]>
* Copyright 2011 David Gibson, IBM Corporation <[email protected]>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <[email protected]>
*/
#include <linux/types.h>
@@ -35,42 +36,253 @@
#include <asm/ppc-opcode.h>
#include <asm/kvm_host.h>
#include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR (~(unsigned long)0x0)
-/* WARNING: This will be called in real-mode on HV KVM and virtual
- * mode on PR KVM
+/* Finds a TCE table descriptor by LIOBN */
+struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
+ unsigned long liobn)
+{
+ struct kvmppc_spapr_tce_table *tt;
+
+ list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+ if (tt->liobn == liobn)
+ return tt;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
+
+/*
+ * Validates TCE address.
+ * At the moment only flags are validated as other checks will significantly slow
+ * down or can make it even impossible to handle TCE requests in real mode.
+ */
+long kvmppc_emulated_validate_tce(unsigned long tce)
+{
+ if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+ return H_PARAMETER;
+
+ return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
+
+/*
+ * Handles TCE requests for QEMU emulated devices.
+ * Puts guest TCE values to the table and expects QEMU to convert them
+ * later in a QEMU device implementation.
+ * Called in both real and virtual modes.
+ * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
+ */
+void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+ unsigned long ioba, unsigned long tce)
+{
+ unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+ struct page *page;
+ u64 *tbl;
+
+ /*
+ * Note on the use of page_address() in real mode,
+ *
+ * It is safe to use page_address() in real mode on ppc64 because
+ * page_address() is always defined as lowmem_page_address()
+ * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
+ * operation and does not access page struct.
+ *
+ * Theoretically page_address() could be defined different
+ * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
+ * should be enabled.
+ * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
+ * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
+ * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
+ * is not expected to be enabled on ppc32, page_address()
+ * is safe for ppc32 as well.
+ */
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+ page = tt->pages[idx / TCES_PER_PAGE];
+ tbl = (u64 *)page_address(page);
+
+ /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
+ tbl[idx % TCES_PER_PAGE] = tce;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
+
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+/*
+ * Converts guest physical address to host physical address.
+ * Tries to increase page counter via realmode_get_page() and
+ * returns ERROR_ADDR if failed.
*/
+static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
+ unsigned long gpa)
+{
+ struct kvm_memory_slot *memslot;
+ pte_t *ptep, pte;
+ unsigned long hva, hpa = ERROR_ADDR;
+ unsigned long gfn = gpa >> PAGE_SHIFT;
+ unsigned shift = 0;
+ struct page *pg;
+
+ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+ if (!memslot)
+ return ERROR_ADDR;
+
+ hva = __gfn_to_hva_memslot(memslot, gfn);
+
+ ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+ if (!ptep || !pte_present(*ptep))
+ return ERROR_ADDR;
+ pte = *ptep;
+
+ if (((gpa & TCE_PCI_WRITE) || pte_write(pte)) && !pte_dirty(pte))
+ return ERROR_ADDR;
+
+ if (!pte_young(pte))
+ return ERROR_ADDR;
+
+ if (!shift)
+ shift = PAGE_SHIFT;
+
+ /* Put huge pages handling to the virtual mode */
+ if (shift > PAGE_SHIFT)
+ return ERROR_ADDR;
+
+ pg = realmode_pfn_to_page(pte_pfn(pte));
+ if (!pg || realmode_get_page(pg))
+ return ERROR_ADDR;
+
+ /* pte_pfn(pte) returns address aligned to pg_size */
+ hpa = (pte_pfn(pte) << PAGE_SHIFT) +
+ (gpa & ((1 << shift) - 1));
+
+ if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+ hpa = ERROR_ADDR;
+ realmode_put_page(pg);
+ }
+
+ return hpa;
+}
+
+static long kvmppc_realmode_put_hpa(unsigned long hpa, bool dirty)
+{
+ struct page *pg;
+
+ pg = realmode_pfn_to_page(hpa >> PAGE_SHIFT);
+ if (!pg)
+ return H_TOO_HARD;
+
+ if (dirty)
+ SetPageDirty(pg);
+
+ if (PageCompound(pg))
+ return H_SUCCESS;
+
+ if (realmode_put_page(pg))
+ return H_TOO_HARD;
+
+ return H_SUCCESS;
+}
+
long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
{
- struct kvm *kvm = vcpu->kvm;
- struct kvmppc_spapr_tce_table *stt;
-
- /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
- /* liobn, ioba, tce); */
-
- list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
- if (stt->liobn == liobn) {
- unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
- struct page *page;
- u64 *tbl;
-
- /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p window_size=0x%x\n", */
- /* liobn, stt, stt->window_size); */
- if (ioba >= stt->window_size)
- return H_PARAMETER;
-
- page = stt->pages[idx / TCES_PER_PAGE];
- tbl = (u64 *)page_address(page);
-
- /* FIXME: Need to validate the TCE itself */
- /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
- tbl[idx % TCES_PER_PAGE] = tce;
- return H_SUCCESS;
- }
+ long ret;
+ struct kvmppc_spapr_tce_table *tt;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to virtual space */
+ if (!tt)
+ return H_TOO_HARD;
+
+ if (ioba >= tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce);
+ if (ret)
+ return ret;
+
+ kvmppc_emulated_put_tce(tt, ioba, tce);
+
+ return H_SUCCESS;
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret = H_SUCCESS;
+ unsigned long *tces;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to virtual space */
+ if (!tt)
+ return H_TOO_HARD;
+
+ /*
+ * The spec says that the maximum size of the list is 512 TCEs so
+ * so the whole table addressed resides in 4K page
+ */
+ if (npages > 512)
+ return H_PARAMETER;
+
+ if (tce_list & ~IOMMU_PAGE_MASK)
+ return H_PARAMETER;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
+ if ((unsigned long)tces == ERROR_ADDR)
+ return H_TOO_HARD;
+
+ for (i = 0; i < npages; ++i) {
+ ret = kvmppc_emulated_validate_tce(tces[i]);
+ if (ret)
+ goto put_list_page_exit;
+ }
+
+ for (i = 0; i < npages; ++i)
+ kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+ tces[i]);
+
+put_list_page_exit:
+ if (kvmppc_realmode_put_hpa((unsigned long)tces, false)) {
+ vcpu->arch.tce_reason = H_TOO_HARD;
+ ret = H_TOO_HARD;
}
- /* Didn't find the liobn, punt it to userspace */
- return H_TOO_HARD;
+ return ret;
+}
+
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to virtual space */
+ if (!tt)
+ return H_TOO_HARD;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce_value);
+ if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+ kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+ return H_SUCCESS;
}
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 550f592..b18b076 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -567,7 +567,31 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
if (kvmppc_xics_enabled(vcpu)) {
ret = kvmppc_xics_hcall(vcpu, req);
break;
- } /* fallthrough */
+ }
+ return RESUME_HOST;
+ case H_PUT_TCE:
+ ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
+ case H_PUT_TCE_INDIRECT:
+ ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6),
+ kvmppc_get_gpr(vcpu, 7));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
+ case H_STUFF_TCE:
+ ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6),
+ kvmppc_get_gpr(vcpu, 7));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
default:
return RESUME_HOST;
}
@@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
vcpu->arch.cpu_type = KVM_CPU_3S_64;
kvmppc_sanity_check(vcpu);
+ /*
+ * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
+ * half executed, we first read TCEs from the user, check them and
+ * return error if something went wrong and only then put TCEs into
+ * the TCE table.
+ *
+ * tce_tmp is a cache for TCEs to avoid stack allocation or
+ * kmalloc as the whole TCE list can take up to 512 items 8 bytes
+ * each (4096 bytes).
+ */
+ vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
+ if (!vcpu->arch.tce_tmp)
+ goto free_vcpu;
+
return vcpu;
free_vcpu:
@@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
spin_unlock(&vcpu->arch.vpa_update_lock);
+ kfree(vcpu->arch.tce_tmp);
kvm_vcpu_uninit(vcpu);
kmem_cache_free(kvm_vcpu_cache, vcpu);
}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b02f91e..d35554e 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1490,6 +1490,12 @@ hcall_real_table:
.long 0 /* 0x11c */
.long 0 /* 0x120 */
.long .kvmppc_h_bulk_remove - hcall_real_table
+ .long 0 /* 0x128 */
+ .long 0 /* 0x12c */
+ .long 0 /* 0x130 */
+ .long 0 /* 0x134 */
+ .long .kvmppc_h_stuff_tce - hcall_real_table
+ .long .kvmppc_h_put_tce_indirect - hcall_real_table
hcall_real_table_end:
ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index da0e0bc..91d4b45 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
unsigned long tce = kvmppc_get_gpr(vcpu, 6);
long rc;
- rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+ rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
+ if (rc == H_TOO_HARD)
+ return EMULATE_FAIL;
+ kvmppc_set_gpr(vcpu, 3, rc);
+ return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+ unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+ unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+ unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+ unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+ long rc;
+
+ rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
+ tce, npages);
+ if (rc == H_TOO_HARD)
+ return EMULATE_FAIL;
+ kvmppc_set_gpr(vcpu, 3, rc);
+ return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+ unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+ unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+ unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+ unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+ long rc;
+
+ rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
if (rc == H_TOO_HARD)
return EMULATE_FAIL;
kvmppc_set_gpr(vcpu, 3, rc);
@@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
return kvmppc_h_pr_bulk_remove(vcpu);
case H_PUT_TCE:
return kvmppc_h_pr_put_tce(vcpu);
+ case H_PUT_TCE_INDIRECT:
+ return kvmppc_h_pr_put_tce_indirect(vcpu);
+ case H_STUFF_TCE:
+ return kvmppc_h_pr_stuff_tce(vcpu);
case H_CEDE:
vcpu->arch.shared->msr |= MSR_EE;
kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 6316ee3..ccb578b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -394,6 +394,9 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_PPC_GET_SMMU_INFO:
r = 1;
break;
+ case KVM_CAP_SPAPR_MULTITCE:
+ r = 1;
+ break;
#endif
default:
r = 0;
--
1.7.10.4
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which saves time
on switching to QEMU and back.
Both real and virtual modes are supported. First the kernel tries to
handle a TCE request in the real mode, if failed it passes it to
the virtual mode to complete the operation. If it a virtual mode
handler fails, a request is passed to the user mode.
This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to associate
a virtual PCI bus ID (LIOBN) with an IOMMU group which enables
in-kernel handling of IOMMU map/unmap. The external user API support
in VFIO is required.
Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
2013/06/27:
* tce_list page is referenced now in order to protect it from accident
invalidation during H_PUT_TCE_INDIRECT execution
* added use of the external user VFIO API
2013/06/05:
* changed capability number
* changed ioctl number
* update the doc article number
2013/05/20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
---
Documentation/virtual/kvm/api.txt | 26 ++++
arch/powerpc/include/asm/kvm_host.h | 4 +
arch/powerpc/include/asm/kvm_ppc.h | 2 +
arch/powerpc/include/uapi/asm/kvm.h | 8 +
arch/powerpc/kvm/book3s_64_vio.c | 294 ++++++++++++++++++++++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 165 ++++++++++++++++++++
arch/powerpc/kvm/powerpc.c | 12 ++
7 files changed, 509 insertions(+), 2 deletions(-)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 762c703..01b0dc2 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2387,6 +2387,32 @@ slows operations a lot.
Unlike other capabilities of this section, this one is always enabled.
+4.87 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_create_spapr_tce_iommu {
+ __u64 liobn;
+ __u32 iommu_id;
+ __u32 flags;
+};
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+In response to a TCE hypercall, the kernel looks for a TCE table descriptor
+in the list and handles the hypercall in real or virtual modes if
+the descriptor is found. Otherwise the hypercall is passed to the user mode.
+
+No flag is supported at the moment.
+
+
5. The kvm_run structure
------------------------
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 3bf407b..716ab18 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,8 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+ struct iommu_group *grp; /* used for IOMMU groups */
+ struct file *vfio_filp; /* used for IOMMU groups */
struct page *pages[0];
};
@@ -611,6 +613,8 @@ struct kvm_vcpu_arch {
u64 busy_preempt;
unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hcall */
+ unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */
+ unsigned long tce_reason; /* The reason of switching to the virtmode */
#endif
};
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index e852921b..934e01d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+ struct kvm_create_spapr_tce_iommu *args);
extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
struct kvm_vcpu *vcpu, unsigned long liobn);
extern long kvmppc_emulated_validate_tce(unsigned long tce);
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..7d0fc02 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -319,6 +319,14 @@ struct kvm_create_spapr_tce {
__u32 window_size;
};
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+ __u64 liobn;
+ __u32 fd;
+ __u32 iommu_id;
+ __u32 flags;
+};
+
/* for KVM_ALLOCATE_RMA */
struct kvm_allocate_rma {
__u64 rma_size;
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 45ee05a..a5d0195 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,10 @@
#include <linux/hugetlb.h>
#include <linux/list.h>
#include <linux/anon_inodes.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/file.h>
#include <asm/tlbflush.h>
#include <asm/kvm_ppc.h>
@@ -49,6 +53,47 @@ static long kvmppc_stt_npages(unsigned long window_size)
* sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
}
+extern int vfio_group_add_external_user(struct file *filep);
+static int kvmppc_vfio_group_add_external_user(struct file *filep)
+{
+ int ret;
+ int (*proc)(struct file *) = symbol_get(vfio_group_add_external_user);
+
+ if (!proc)
+ return -EINVAL;
+
+ ret = proc(filep);
+ symbol_put(vfio_group_add_external_user);
+
+ return ret;
+}
+
+extern void vfio_group_del_external_user(struct file *filep);
+static void kvmppc_vfio_group_del_external_user(struct file *filep)
+{
+ void (*proc)(struct file *) = symbol_get(vfio_group_del_external_user);
+
+ if (!proc)
+ return;
+
+ proc(filep);
+ symbol_put(vfio_group_del_external_user);
+}
+
+extern int vfio_group_iommu_id_from_file(struct file *filep);
+static int kvmppc_vfio_group_iommu_id_from_file(struct file *filep)
+{
+ int ret;
+ int (*proc)(struct file *) = symbol_get(vfio_group_iommu_id_from_file);
+
+ if (!proc)
+ return -EINVAL;
+
+ ret = proc(filep);
+ symbol_put(vfio_group_iommu_id_from_file);
+
+ return ret;
+}
static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
{
struct kvm *kvm = stt->kvm;
@@ -56,8 +101,17 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
mutex_lock(&kvm->lock);
list_del(&stt->list);
- for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
- __free_page(stt->pages[i]);
+#ifdef CONFIG_IOMMU_API
+ if (stt->grp) {
+ if (stt->vfio_filp) {
+ kvmppc_vfio_group_del_external_user(stt->vfio_filp);
+ fput(stt->vfio_filp);
+ }
+ iommu_group_put(stt->grp);
+ } else
+#endif
+ for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+ __free_page(stt->pages[i]);
kfree(stt);
mutex_unlock(&kvm->lock);
@@ -153,6 +207,99 @@ fail:
return ret;
}
+#ifdef CONFIG_IOMMU_API
+static const struct file_operations kvm_spapr_tce_iommu_fops = {
+ .release = kvm_spapr_tce_release,
+};
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+ struct kvm_create_spapr_tce_iommu *args)
+{
+ struct kvmppc_spapr_tce_table *tt = NULL;
+ struct iommu_group *grp;
+ struct iommu_table *tbl;
+ struct file *vfio_filp;
+ int ret;
+
+ /* Check this LIOBN hasn't been previously allocated */
+ list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
+ if (tt->liobn == args->liobn)
+ return -EBUSY;
+ }
+
+ vfio_filp = fget(args->fd);
+ if (!vfio_filp)
+ return -ENXIO;
+
+ /* Lock the group */
+ ret = kvmppc_vfio_group_add_external_user(vfio_filp);
+ if (ret)
+ goto fput_exit;
+
+ /* Get IOMMU ID. Fails if group is not attached to IOMMU */
+ ret = kvmppc_vfio_group_iommu_id_from_file(vfio_filp);
+ if (ret < 0)
+ goto del_fput_exit;
+
+ if (ret != args->iommu_id) {
+ ret = -EINVAL;
+ goto del_fput_exit;
+ }
+
+ ret = -ENXIO;
+ /* Find an IOMMU table for the given ID */
+ grp = iommu_group_get_by_id(args->iommu_id);
+ if (!grp)
+ goto del_fput_exit;
+
+ tbl = iommu_group_get_iommudata(grp);
+ if (!tbl)
+ goto del_fput_exit;
+
+ tt = kzalloc(sizeof(*tt), GFP_KERNEL);
+ if (!tt)
+ goto del_fput_exit;
+
+ tt->liobn = args->liobn;
+ tt->kvm = kvm;
+ tt->grp = grp;
+ tt->window_size = tbl->it_size << IOMMU_PAGE_SHIFT;
+ tt->vfio_filp = vfio_filp;
+
+ pr_debug("LIOBN=%llX fd=%d hooked to IOMMU %d, flags=%u\n",
+ args->liobn, args->fd, args->iommu_id, args->flags);
+
+ ret = anon_inode_getfd("kvm-spapr-tce-iommu",
+ &kvm_spapr_tce_iommu_fops, tt, O_RDWR);
+ if (ret < 0)
+ goto free_del_fput_exit;
+
+ kvm_get_kvm(kvm);
+
+ mutex_lock(&kvm->lock);
+ list_add(&tt->list, &kvm->arch.spapr_tce_tables);
+
+ mutex_unlock(&kvm->lock);
+
+ return ret;
+
+free_del_fput_exit:
+ kfree(tt);
+del_fput_exit:
+ kvmppc_vfio_group_del_external_user(vfio_filp);
+fput_exit:
+ fput(vfio_filp);
+
+ return ret;
+}
+#else
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+ struct kvm_create_spapr_tce_iommu *args)
+{
+ return -ENOSYS;
+}
+#endif /* CONFIG_IOMMU_API */
+
/* Converts guest physical address to host virtual address */
static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
unsigned long gpa)
@@ -180,6 +327,47 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ if (vcpu->arch.tce_reason == H_HARDWARE) {
+ iommu_clear_tces_and_put_pages(tbl, entry, 1);
+ return H_HARDWARE;
+
+ } else if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+ if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+ return H_PARAMETER;
+
+ ret = iommu_clear_tces_and_put_pages(tbl, entry, 1);
+ } else {
+ void *hva;
+
+ if (iommu_tce_put_param_check(tbl, ioba, tce))
+ return H_PARAMETER;
+
+ hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce);
+ if (hva == ERROR_ADDR)
+ return H_HARDWARE;
+
+ ret = iommu_put_tce_user_mode(tbl,
+ ioba >> IOMMU_PAGE_SHIFT,
+ (unsigned long) hva);
+ }
+ iommu_flush_tce(tbl);
+
+ if (ret)
+ return H_HARDWARE;
+
+ return H_SUCCESS;
+ }
+#endif
+ /* Emulated IO */
if (ioba >= tt->window_size)
return H_PARAMETER;
@@ -192,6 +380,72 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
return H_SUCCESS;
}
+static long kvmppc_virtmode_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt, unsigned long ioba,
+ unsigned long *tces, unsigned long npages)
+{
+ int i;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Something bad happened, do cleanup and exit */
+ if (vcpu->arch.tce_reason == H_HARDWARE) {
+ i = vcpu->arch.tce_tmp_num;
+ goto fail_clear_tce;
+ } else if (vcpu->arch.tce_reason != H_TOO_HARD) {
+ /*
+ * We get here in PR KVM mode or
+ * when realmode_get_page() failed on tce_list.
+ */
+ for (i = 0; i < npages; ++i) {
+ if (get_user(vcpu->arch.tce_tmp[i], tces + i))
+ return H_HARDWARE;
+
+ if (iommu_tce_put_param_check(tbl, ioba +
+ (i << IOMMU_PAGE_SHIFT),
+ vcpu->arch.tce_tmp[i]))
+ return H_PARAMETER;
+ }
+ } /* else: The real mode handler checked TCEs already */
+
+ /* Translate TCEs */
+ for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) {
+ void *hva = kvmppc_virtmode_gpa_to_hva(vcpu,
+ vcpu->arch.tce_tmp[i]);
+
+ if (hva == ERROR_ADDR)
+ goto fail_clear_tce;
+
+ vcpu->arch.tce_tmp[i] = (unsigned long) hva;
+ }
+
+ /* Do get_page and put TCEs for all pages */
+ for (i = 0; i < npages; ++i) {
+ if (iommu_put_tce_user_mode(tbl,
+ (ioba >> IOMMU_PAGE_SHIFT) + i,
+ vcpu->arch.tce_tmp[i])) {
+ i = npages;
+ goto fail_clear_tce;
+ }
+ }
+
+ iommu_flush_tce(tbl);
+
+ return H_SUCCESS;
+
+fail_clear_tce:
+ /* Cannot complete the translation, clean up and exit */
+ iommu_clear_tces_and_put_pages(tbl, ioba >> IOMMU_PAGE_SHIFT, i);
+
+ iommu_flush_tce(tbl);
+
+ return H_HARDWARE;
+
+}
+
long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
unsigned long liobn, unsigned long ioba,
unsigned long tce_list, unsigned long npages)
@@ -227,6 +481,14 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
1, 0, &pg) != 1)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ ret = kvmppc_virtmode_h_put_tce_indirect_iommu(vcpu,
+ tt, ioba, tces, npages);
+ goto put_list_page_exit;
+ }
+#endif
+ /* Emulated IO */
for (i = 0; i < npages; ++i) {
if (get_user(vcpu->arch.tce_tmp[i], tces + i)) {
ret = H_PARAMETER;
@@ -259,6 +521,34 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+ unsigned long tmp, entry = ioba >> IOMMU_PAGE_SHIFT;
+
+ vcpu->arch.tce_tmp_num = 0;
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* PR KVM? */
+ if (!vcpu->arch.tce_tmp_num &&
+ (vcpu->arch.tce_reason != H_TOO_HARD) &&
+ iommu_tce_clear_param_check(tbl, ioba,
+ tce_value, npages))
+ return H_PARAMETER;
+
+ /* Do actual cleanup */
+ tmp = vcpu->arch.tce_tmp_num;
+ if (iommu_clear_tces_and_put_pages(tbl, entry + tmp,
+ npages - tmp))
+ return H_PARAMETER;
+
+ return H_SUCCESS;
+ }
+#endif
+ /* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 108cc08..497daaf 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
#include <linux/slab.h>
#include <linux/hugetlb.h>
#include <linux/list.h>
+#include <linux/iommu.h>
#include <asm/tlbflush.h>
#include <asm/kvm_ppc.h>
@@ -189,6 +190,41 @@ static long kvmppc_realmode_put_hpa(unsigned long hpa, bool dirty)
return H_SUCCESS;
}
+#ifdef CONFIG_IOMMU_API
+static long kvmppc_realmode_clear_tce(struct kvm_vcpu *vcpu,
+ struct iommu_table *tbl, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ long ret = 0, i;
+ unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+ if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i) {
+ unsigned long oldtce;
+
+ oldtce = iommu_clear_tce(tbl, entry + i);
+ if (!oldtce)
+ continue;
+
+ ret = kvmppc_realmode_put_hpa(oldtce, oldtce & TCE_PCI_WRITE);
+ if (ret)
+ break;
+ }
+
+ if (ret == H_TOO_HARD) {
+ vcpu->arch.tce_tmp_num = i;
+ vcpu->arch.tce_reason = H_TOO_HARD;
+ }
+ /* if (ret < 0)
+ pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%ld\n",
+ __func__, ioba, tce_value, ret); */
+
+ return ret;
+}
+#endif /* CONFIG_IOMMU_API */
+
long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
{
@@ -200,6 +236,53 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
if (!tt)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ vcpu->arch.tce_reason = 0;
+
+ if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+ unsigned long hpa, hva;
+
+ if (iommu_tce_put_param_check(tbl, ioba, tce))
+ return H_PARAMETER;
+
+ hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce);
+ if (hpa == ERROR_ADDR) {
+ vcpu->arch.tce_reason = H_TOO_HARD;
+ return H_TOO_HARD;
+ }
+
+ hva = (unsigned long) __va(hpa);
+ ret = iommu_tce_build(tbl,
+ ioba >> IOMMU_PAGE_SHIFT,
+ hva, iommu_tce_direction(hva));
+ if (unlikely(ret)) {
+ ret = kvmppc_realmode_put_hpa(hpa,
+ tce & TCE_PCI_WRITE);
+ if (ret) {
+ vcpu->arch.tce_reason = H_HARDWARE;
+ return H_TOO_HARD;
+ }
+ return H_HARDWARE;
+ }
+ } else {
+ ret = kvmppc_realmode_clear_tce(vcpu, tbl, ioba, 0, 1);
+ if (ret)
+ return ret;
+ }
+
+ iommu_flush_tce(tbl);
+
+ return H_SUCCESS;
+ }
+#endif
+ /* Emulated IO */
if (ioba >= tt->window_size)
return H_PARAMETER;
@@ -212,6 +295,56 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
return H_SUCCESS;
}
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt, unsigned long ioba,
+ unsigned long *tces, unsigned long npages)
+{
+ int i;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Check all TCEs */
+ for (i = 0; i < npages; ++i) {
+ if (iommu_tce_put_param_check(tbl, ioba +
+ (i << IOMMU_PAGE_SHIFT), tces[i]))
+ return H_PARAMETER;
+
+ vcpu->arch.tce_tmp[i] = tces[i];
+ }
+
+ /* Translate TCEs and go get_page */
+ for (i = 0; i < npages; ++i) {
+ unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu,
+ vcpu->arch.tce_tmp[i]);
+ if (hpa == ERROR_ADDR) {
+ vcpu->arch.tce_tmp_num = i;
+ vcpu->arch.tce_reason = H_TOO_HARD;
+ return H_TOO_HARD;
+ }
+ vcpu->arch.tce_tmp[i] = hpa;
+ }
+
+ /* Put TCEs to the table */
+ for (i = 0; i < npages; ++i) {
+ unsigned long hva = (unsigned long)
+ __va(vcpu->arch.tce_tmp[i]);
+ if (iommu_tce_build(tbl,
+ (ioba >> IOMMU_PAGE_SHIFT) + i,
+ hva, iommu_tce_direction(hva))) {
+ /* All wrong, go virtmode and do cleanup */
+ vcpu->arch.tce_reason = H_HARDWARE;
+ return H_TOO_HARD;
+ }
+ }
+
+ iommu_flush_tce(tbl);
+
+ return H_SUCCESS;
+}
+
long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
unsigned long liobn, unsigned long ioba,
unsigned long tce_list, unsigned long npages)
@@ -238,10 +371,21 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
+ vcpu->arch.tce_tmp_num = 0;
+ vcpu->arch.tce_reason = 0;
+
tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
if ((unsigned long)tces == ERROR_ADDR)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+ tt, ioba, tces, npages);
+ goto put_list_page_exit;
+ }
+#endif
+ /* Emulated IO */
for (i = 0; i < npages; ++i) {
ret = kvmppc_emulated_validate_tce(tces[i]);
if (ret)
@@ -273,6 +417,27 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ vcpu->arch.tce_reason = 0;
+
+ ret = kvmppc_realmode_clear_tce(vcpu, tbl, ioba,
+ tce_value, npages);
+ if (ret)
+ return ret;
+
+ iommu_flush_tce(tbl);
+
+ return H_SUCCESS;
+ }
+#endif
+ /* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index ccb578b..2909cfa 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -395,6 +395,7 @@ int kvm_dev_ioctl_check_extension(long ext)
r = 1;
break;
case KVM_CAP_SPAPR_MULTITCE:
+ case KVM_CAP_SPAPR_TCE_IOMMU:
r = 1;
break;
#endif
@@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
goto out;
}
+ case KVM_CREATE_SPAPR_TCE_IOMMU: {
+ struct kvm_create_spapr_tce_iommu create_tce_iommu;
+ struct kvm *kvm = filp->private_data;
+
+ r = -EFAULT;
+ if (copy_from_user(&create_tce_iommu, argp,
+ sizeof(create_tce_iommu)))
+ goto out;
+ r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+ goto out;
+ }
#endif /* CONFIG_PPC_BOOK3S_64 */
#ifdef CONFIG_KVM_BOOK3S_64_HV
--
1.7.10.4
This adds special support for huge pages (16MB). The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages. It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens. The list is released at KVM exit. At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive. However this can change and we may want
to optimize this.
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
2013/06/27:
* list of huge pages replaces with hashtable for better performance
* spinlock removed from real mode and only protects insertion of new
huge [ages descriptors into the hashtable
2013/06/05:
* fixed compile error when CONFIG_IOMMU_API=n
2013/05/20:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).
---
arch/powerpc/include/asm/kvm_host.h | 25 +++++++++
arch/powerpc/kvm/book3s_64_vio.c | 95 +++++++++++++++++++++++++++++++++--
arch/powerpc/kvm/book3s_64_vio_hv.c | 24 +++++++--
3 files changed, 138 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 716ab18..0ad6189 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -30,6 +30,7 @@
#include <linux/kvm_para.h>
#include <linux/list.h>
#include <linux/atomic.h>
+#include <linux/hashtable.h>
#include <asm/kvm_asm.h>
#include <asm/processor.h>
#include <asm/page.h>
@@ -182,9 +183,33 @@ struct kvmppc_spapr_tce_table {
u32 window_size;
struct iommu_group *grp; /* used for IOMMU groups */
struct file *vfio_filp; /* used for IOMMU groups */
+ DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */
+ spinlock_t hugepages_write_lock; /* used for IOMMU groups */
struct page *pages[0];
};
+/*
+ * The KVM guest can be backed with 16MB pages.
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+#define KVMPPC_HUGEPAGE_HASH(gpa) hash_32(gpa >> 24, 32)
+
+struct kvmppc_iommu_hugepage {
+ struct hlist_node hash_node;
+ unsigned long gpa; /* Guest physical address */
+ unsigned long hpa; /* Host physical address */
+ struct page *page; /* page struct of the very first subpage */
+ unsigned long size; /* Huge page size (always 16MB at the moment) */
+};
+
struct kvmppc_linear_info {
void *base_virt;
unsigned long base_pfn;
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index a5d0195..6cedfe9 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -47,6 +47,78 @@
#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
#define ERROR_ADDR ((void *)~(unsigned long)0x0)
+#ifdef CONFIG_IOMMU_API
+/* Adds a new huge page descriptor to the hashtable */
+static long kvmppc_iommu_hugepage_try_add(
+ struct kvmppc_spapr_tce_table *tt,
+ pte_t pte, unsigned long hva, unsigned long gpa,
+ unsigned long pg_size)
+{
+ long ret = 0;
+ struct kvmppc_iommu_hugepage *hp;
+ struct page *pg;
+ unsigned key = KVMPPC_HUGEPAGE_HASH(gpa);
+
+ spin_lock(&tt->hugepages_write_lock);
+ hash_for_each_possible_rcu(tt->hash_tab, hp, hash_node, key) {
+ if (KVMPPC_HUGEPAGE_HASH(hp->gpa) != key)
+ continue;
+ if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size))
+ continue;
+ goto unlock_exit;
+ }
+
+ hva = hva & ~(pg_size - 1);
+ ret = get_user_pages_fast(hva, 1, true/*write*/, &pg);
+ if ((ret != 1) || !pg) {
+ ret = -EFAULT;
+ goto unlock_exit;
+ }
+ ret = 0;
+
+ hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+ if (!hp) {
+ ret = -ENOMEM;
+ goto unlock_exit;
+ }
+
+ hp->page = pg;
+ hp->gpa = gpa & ~(pg_size - 1);
+ hp->hpa = (pte_pfn(pte) << PAGE_SHIFT);
+ hp->size = pg_size;
+
+ hash_add_rcu(tt->hash_tab, &hp->hash_node, key);
+
+unlock_exit:
+ spin_unlock(&tt->hugepages_write_lock);
+
+ return ret;
+}
+
+static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
+{
+ spin_lock_init(&tt->hugepages_write_lock);
+ hash_init(tt->hash_tab);
+}
+
+static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
+{
+ int bkt;
+ struct kvmppc_iommu_hugepage *hp;
+ struct hlist_node *tmp;
+
+ spin_lock(&tt->hugepages_write_lock);
+ hash_for_each_safe(tt->hash_tab, bkt, tmp, hp, hash_node) {
+ hlist_del_rcu(&hp->hash_node);
+
+ put_page(hp->page); /* one for iommu_put_tce_user_mode */
+ put_page(hp->page); /* one for kvmppc_iommu_hugepage_try_add */
+ kfree(hp);
+ }
+ spin_unlock(&tt->hugepages_write_lock);
+}
+#endif /* CONFIG_IOMMU_API */
+
static long kvmppc_stt_npages(unsigned long window_size)
{
return ALIGN((window_size >> SPAPR_TCE_SHIFT)
@@ -108,6 +180,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
fput(stt->vfio_filp);
}
iommu_group_put(stt->grp);
+ kvmppc_iommu_hugepages_cleanup(stt);
} else
#endif
for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
@@ -277,6 +350,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
kvm_get_kvm(kvm);
mutex_lock(&kvm->lock);
+ kvmppc_iommu_hugepages_init(tt);
list_add(&tt->list, &kvm->arch.spapr_tce_tables);
mutex_unlock(&kvm->lock);
@@ -302,16 +376,31 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
/* Converts guest physical address to host virtual address */
static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
unsigned long gpa)
{
unsigned long hva, gfn = gpa >> PAGE_SHIFT;
struct kvm_memory_slot *memslot;
+ pte_t *ptep;
+ unsigned int shift = 0;
memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
if (!memslot)
return ERROR_ADDR;
hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+
+ ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+ WARN_ON(!ptep);
+ if (!ptep)
+ return ERROR_ADDR;
+#ifdef CONFIG_IOMMU_API
+ if (tt && (shift > PAGE_SHIFT)) {
+ if (kvmppc_iommu_hugepage_try_add(tt, *ptep,
+ hva, gpa, 1 << shift))
+ return ERROR_ADDR;
+ }
+#endif
return (void *) hva;
}
@@ -351,7 +440,7 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
if (iommu_tce_put_param_check(tbl, ioba, tce))
return H_PARAMETER;
- hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce);
+ hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, tce);
if (hva == ERROR_ADDR)
return H_HARDWARE;
@@ -414,7 +503,7 @@ static long kvmppc_virtmode_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
/* Translate TCEs */
for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) {
void *hva = kvmppc_virtmode_gpa_to_hva(vcpu,
- vcpu->arch.tce_tmp[i]);
+ tt, vcpu->arch.tce_tmp[i]);
if (hva == ERROR_ADDR)
goto fail_clear_tce;
@@ -473,7 +562,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
- tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list);
+ tces = kvmppc_virtmode_gpa_to_hva(vcpu, NULL, tce_list);
if (tces == ERROR_ADDR)
return H_TOO_HARD;
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 497daaf..c1db019 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -121,6 +121,7 @@ EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
* returns ERROR_ADDR if failed.
*/
static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
unsigned long gpa)
{
struct kvm_memory_slot *memslot;
@@ -129,6 +130,23 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
unsigned long gfn = gpa >> PAGE_SHIFT;
unsigned shift = 0;
struct page *pg;
+ struct kvmppc_iommu_hugepage *hp;
+
+ /* Try to find an already used hugepage */
+ if (tt) {
+ unsigned key = KVMPPC_HUGEPAGE_HASH(gpa);
+
+ hash_for_each_possible_rcu_notrace(tt->hash_tab, hp,
+ hash_node, key) {
+ if (KVMPPC_HUGEPAGE_HASH(hp->gpa) != key)
+ continue;
+
+ if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size))
+ continue;
+
+ return hp->hpa + (gpa & (hp->size - 1));
+ }
+ }
memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
if (!memslot)
@@ -252,7 +270,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
if (iommu_tce_put_param_check(tbl, ioba, tce))
return H_PARAMETER;
- hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce);
+ hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce);
if (hpa == ERROR_ADDR) {
vcpu->arch.tce_reason = H_TOO_HARD;
return H_TOO_HARD;
@@ -318,7 +336,7 @@ static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
/* Translate TCEs and go get_page */
for (i = 0; i < npages; ++i) {
unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu,
- vcpu->arch.tce_tmp[i]);
+ tt, vcpu->arch.tce_tmp[i]);
if (hpa == ERROR_ADDR) {
vcpu->arch.tce_tmp_num = i;
vcpu->arch.tce_reason = H_TOO_HARD;
@@ -374,7 +392,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
vcpu->arch.tce_tmp_num = 0;
vcpu->arch.tce_reason = 0;
- tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
+ tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce_list);
if ((unsigned long)tces == ERROR_ADDR)
return H_TOO_HARD;
--
1.7.10.4
VFIO is designed to be used via ioctls on file descriptors
returned by VFIO.
However in some situations support for an external user is required.
The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
use the existing VFIO groups for exclusive access in real/virtual mode
in the host kernel to avoid passing map/unmap requests to the user
space which would made things pretty slow.
The proposed protocol includes:
1. do normal VFIO init stuff such as opening a new container, attaching
group(s) to it, setting an IOMMU driver for a container. When IOMMU is
set for a container, all groups in it are considered ready to use by
an external user.
2. pass a fd of the group we want to accelerate to KVM. KVM calls
vfio_group_iommu_id_from_file() to verify if the group is initialized
and IOMMU is set for it. The current TCE IOMMU driver marks the whole
IOMMU table as busy when IOMMU is set for a container what this prevents
other DMA users from allocating from it so it is safe to pass the group
to the user space.
3. KVM increases the container users counter via
vfio_group_add_external_user(). This prevents the VFIO group from
being disposed prior to exiting KVM.
4. When KVM is finished and doing cleanup, it releases the group file
and decrements the container users counter. Everything gets released.
5. KVM also keeps the group file as otherwise its fd might have been
closed at the moment of KVM finish so vfio_group_del_external_user()
call will not be possible.
The "vfio: Limit group opens" patch is also required for the consistency.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
drivers/vfio/vfio.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 53 insertions(+)
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index c488da5..54192b2 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1370,6 +1370,59 @@ static const struct file_operations vfio_device_fops = {
};
/**
+ * External user API, exported by symbols to be linked dynamically.
+ */
+
+/* Allows an external user (for example, KVM) to lock an IOMMU group */
+static int vfio_group_add_external_user(struct file *filep)
+{
+ struct vfio_group *group = filep->private_data;
+
+ if (filep->f_op != &vfio_group_fops)
+ return -EINVAL;
+
+ if (!atomic_inc_not_zero(&group->container_users))
+ return -EINVAL;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_group_add_external_user);
+
+/* Allows an external user (for example, KVM) to unlock an IOMMU group */
+static void vfio_group_del_external_user(struct file *filep)
+{
+ struct vfio_group *group = filep->private_data;
+
+ BUG_ON(filep->f_op != &vfio_group_fops);
+
+ vfio_group_try_dissolve_container(group);
+}
+EXPORT_SYMBOL_GPL(vfio_group_del_external_user);
+
+/*
+ * Checks if a group for the specified file can be used by
+ * an external user and returns the IOMMU ID if external use is possible.
+ */
+static int vfio_group_iommu_id_from_file(struct file *filep)
+{
+ int ret;
+ struct vfio_group *group = filep->private_data;
+
+ if (WARN_ON(filep->f_op != &vfio_group_fops))
+ return -EINVAL;
+
+ if (0 == atomic_read(&group->container_users) ||
+ !group->container->iommu_driver ||
+ !vfio_group_viable(group))
+ return -EINVAL;
+
+ ret = iommu_group_id(group->iommu_group);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_group_iommu_id_from_file);
+
+/**
* Module/class support
*/
static char *vfio_devnode(struct device *dev, umode_t *mode)
--
1.7.10.4
Hi Alexy,
On Thu, 27 Jun 2013 15:02:31 +1000 Alexey Kardashevskiy <[email protected]> wrote:
>
> index c488da5..54192b2 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1370,6 +1370,59 @@ static const struct file_operations vfio_device_fops = {
> };
>
> /**
> + * External user API, exported by symbols to be linked dynamically.
> + */
> +
> +/* Allows an external user (for example, KVM) to lock an IOMMU group */
> +static int vfio_group_add_external_user(struct file *filep)
> +{
> + struct vfio_group *group = filep->private_data;
> +
> + if (filep->f_op != &vfio_group_fops)
> + return -EINVAL;
> +
> + if (!atomic_inc_not_zero(&group->container_users))
> + return -EINVAL;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_add_external_user);
You cannot EXPORT a static symbol ... The same through the rest of the
file.
--
Cheers,
Stephen Rothwell [email protected]
Hi Alexy,
On Thu, 27 Jun 2013 15:02:31 +1000 Alexey Kardashevskiy <[email protected]> wrote:
>
> +/* Allows an external user (for example, KVM) to unlock an IOMMU group */
> +static void vfio_group_del_external_user(struct file *filep)
> +{
> + struct vfio_group *group = filep->private_data;
> +
> + BUG_ON(filep->f_op != &vfio_group_fops);
We usually reserve BUG_ON for situations where there is no way to
continue running or continuing will corrupt the running kernel. Maybe
WARN_ON() and return?
--
Cheers,
Stephen Rothwell [email protected]
VFIO is designed to be used via ioctls on file descriptors
returned by VFIO.
However in some situations support for an external user is required.
The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
use the existing VFIO groups for exclusive access in real/virtual mode
in the host kernel to avoid passing map/unmap requests to the user
space which would made things pretty slow.
The proposed protocol includes:
1. do normal VFIO init stuff such as opening a new container, attaching
group(s) to it, setting an IOMMU driver for a container. When IOMMU is
set for a container, all groups in it are considered ready to use by
an external user.
2. pass a fd of the group we want to accelerate to KVM. KVM calls
vfio_group_iommu_id_from_file() to verify if the group is initialized
and IOMMU is set for it. The current TCE IOMMU driver marks the whole
IOMMU table as busy when IOMMU is set for a container what this prevents
other DMA users from allocating from it so it is safe to pass the group
to the user space.
3. KVM increases the container users counter via
vfio_group_add_external_user(). This prevents the VFIO group from
being disposed prior to exiting KVM.
4. When KVM is finished and doing cleanup, it releases the group file
and decrements the container users counter. Everything gets released.
5. KVM also keeps the group file as otherwise its fd might have been
closed at the moment of KVM finish so vfio_group_del_external_user()
call will not be possible.
The "vfio: Limit group opens" patch is also required for the consistency.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
v1->v2: added definitions to vfio.h :)
Should not compile but compiled. Hm.
---
drivers/vfio/vfio.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/vfio.h | 7 +++++++
2 files changed, 61 insertions(+)
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index c488da5..40875d2 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1370,6 +1370,60 @@ static const struct file_operations vfio_device_fops = {
};
/**
+ * External user API, exported by symbols to be linked dynamically.
+ */
+
+/* Allows an external user (for example, KVM) to lock an IOMMU group */
+int vfio_group_add_external_user(struct file *filep)
+{
+ struct vfio_group *group = filep->private_data;
+
+ if (filep->f_op != &vfio_group_fops)
+ return -EINVAL;
+
+ if (!atomic_inc_not_zero(&group->container_users))
+ return -EINVAL;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_group_add_external_user);
+
+/* Allows an external user (for example, KVM) to unlock an IOMMU group */
+void vfio_group_del_external_user(struct file *filep)
+{
+ struct vfio_group *group = filep->private_data;
+
+ if (WARN_ON(filep->f_op != &vfio_group_fops))
+ return;
+
+ vfio_group_try_dissolve_container(group);
+}
+EXPORT_SYMBOL_GPL(vfio_group_del_external_user);
+
+/*
+ * Checks if a group for the specified file can be used by
+ * an external user and returns the IOMMU ID if external use is possible.
+ */
+int vfio_group_iommu_id_from_file(struct file *filep)
+{
+ int ret;
+ struct vfio_group *group = filep->private_data;
+
+ if (WARN_ON(filep->f_op != &vfio_group_fops))
+ return -EINVAL;
+
+ if (0 == atomic_read(&group->container_users) ||
+ !group->container->iommu_driver ||
+ !vfio_group_viable(group))
+ return -EINVAL;
+
+ ret = iommu_group_id(group->iommu_group);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_group_iommu_id_from_file);
+
+/**
* Module/class support
*/
static char *vfio_devnode(struct device *dev, umode_t *mode)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index ac8d488..7ee6575 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -90,4 +90,11 @@ extern void vfio_unregister_iommu_driver(
TYPE tmp; \
offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); }) \
+/*
+ * External user API
+ */
+int vfio_group_add_external_user(struct file *filep);
+void vfio_group_del_external_user(struct file *filep);
+int vfio_group_iommu_id_from_file(struct file *filep);
+
#endif /* VFIO_H */
--
1.7.10.4
Hi Alexy,
Thanks for the changes.
On Thu, 27 Jun 2013 17:14:20 +1000 Alexey Kardashevskiy <[email protected]> wrote:
>
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index ac8d488..7ee6575 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -90,4 +90,11 @@ extern void vfio_unregister_iommu_driver(
> TYPE tmp; \
> offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); }) \
>
> +/*
> + * External user API
> + */
> +int vfio_group_add_external_user(struct file *filep);
> +void vfio_group_del_external_user(struct file *filep);
> +int vfio_group_iommu_id_from_file(struct file *filep);
Just being picky, but all the other function declarations in that file
state "extern" explicitly.
--
Cheers,
Stephen Rothwell [email protected]
On Thu, 2013-06-27 at 16:59 +1000, Stephen Rothwell wrote:
> > +/* Allows an external user (for example, KVM) to unlock an IOMMU
> group */
> > +static void vfio_group_del_external_user(struct file *filep)
> > +{
> > + struct vfio_group *group = filep->private_data;
> > +
> > + BUG_ON(filep->f_op != &vfio_group_fops);
>
> We usually reserve BUG_ON for situations where there is no way to
> continue running or continuing will corrupt the running kernel. Maybe
> WARN_ON() and return?
Not even that. This is a user space provided "fd", we shouldn't oops the
kernel because we passed a wrong argument, just return -EINVAL or
something like that (add a return code).
Ben.
On 06/27/2013 07:42 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2013-06-27 at 16:59 +1000, Stephen Rothwell wrote:
>>> +/* Allows an external user (for example, KVM) to unlock an IOMMU
>> group */
>>> +static void vfio_group_del_external_user(struct file *filep)
>>> +{
>>> + struct vfio_group *group = filep->private_data;
>>> +
>>> + BUG_ON(filep->f_op != &vfio_group_fops);
>>
>> We usually reserve BUG_ON for situations where there is no way to
>> continue running or continuing will corrupt the running kernel. Maybe
>> WARN_ON() and return?
>
> Not even that. This is a user space provided "fd", we shouldn't oops the
> kernel because we passed a wrong argument, just return -EINVAL or
> something like that (add a return code).
I'll change to WARN_ON but...
This is going to be called on KVM exit on a file pointer previously
verified for correctness. If it is a wrong file*, then something went
terribly wrong.
--
Alexey
On Thu, 2013-06-27 at 17:14 +1000, Alexey Kardashevskiy wrote:
> VFIO is designed to be used via ioctls on file descriptors
> returned by VFIO.
>
> However in some situations support for an external user is required.
> The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
> use the existing VFIO groups for exclusive access in real/virtual mode
> in the host kernel to avoid passing map/unmap requests to the user
> space which would made things pretty slow.
>
> The proposed protocol includes:
>
> 1. do normal VFIO init stuff such as opening a new container, attaching
> group(s) to it, setting an IOMMU driver for a container. When IOMMU is
> set for a container, all groups in it are considered ready to use by
> an external user.
>
> 2. pass a fd of the group we want to accelerate to KVM. KVM calls
> vfio_group_iommu_id_from_file() to verify if the group is initialized
> and IOMMU is set for it. The current TCE IOMMU driver marks the whole
> IOMMU table as busy when IOMMU is set for a container what this prevents
> other DMA users from allocating from it so it is safe to pass the group
> to the user space.
>
> 3. KVM increases the container users counter via
> vfio_group_add_external_user(). This prevents the VFIO group from
> being disposed prior to exiting KVM.
>
> 4. When KVM is finished and doing cleanup, it releases the group file
> and decrements the container users counter. Everything gets released.
>
> 5. KVM also keeps the group file as otherwise its fd might have been
> closed at the moment of KVM finish so vfio_group_del_external_user()
> call will not be possible.
This is the wrong order in my mind. An external user has no business
checking or maintaining any state of a group until it calls
add_external_user(). Only after that call is successful can the user
assume the filep to group relationship is static and get the iommu_id.
Any use of the "external user" API should start with "add" and end with
"del".
> The "vfio: Limit group opens" patch is also required for the consistency.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
>
> v1->v2: added definitions to vfio.h :)
> Should not compile but compiled. Hm.
>
> ---
> drivers/vfio/vfio.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/vfio.h | 7 +++++++
> 2 files changed, 61 insertions(+)
>
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c488da5..40875d2 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1370,6 +1370,60 @@ static const struct file_operations vfio_device_fops = {
> };
>
> /**
> + * External user API, exported by symbols to be linked dynamically.
> + */
> +
> +/* Allows an external user (for example, KVM) to lock an IOMMU group */
> +int vfio_group_add_external_user(struct file *filep)
> +{
> + struct vfio_group *group = filep->private_data;
> +
> + if (filep->f_op != &vfio_group_fops)
> + return -EINVAL;
> +
> + if (!atomic_inc_not_zero(&group->container_users))
> + return -EINVAL;
This is the place where I was suggesting we need tests to match
get_device_fd. It's not clear what the external user is holding if the
group has no iommu or is not viable here.
if (!group->container->iommu_driver || !vfio_group_viable(group)) {
vfio_group_try_dissolve_container(group);
return -EINVAL;
}
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_add_external_user);
> +
> +/* Allows an external user (for example, KVM) to unlock an IOMMU group */
> +void vfio_group_del_external_user(struct file *filep)
> +{
> + struct vfio_group *group = filep->private_data;
> +
> + if (WARN_ON(filep->f_op != &vfio_group_fops))
> + return;
How about we make this return int so we can return 0/-EINVAL and the
caller can decide the severity of the response?
> +
> + vfio_group_try_dissolve_container(group);
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_del_external_user);
> +
> +/*
> + * Checks if a group for the specified file can be used by
> + * an external user and returns the IOMMU ID if external use is possible.
> + */
> +int vfio_group_iommu_id_from_file(struct file *filep)
Let's name this in a way that makes it clear that it's part of the
external_user API. vfio_group_external_user_iommu_id?
> +{
> + int ret;
> + struct vfio_group *group = filep->private_data;
> +
> + if (WARN_ON(filep->f_op != &vfio_group_fops))
> + return -EINVAL;
This one probably doesn't deserve a WARN_ON either, let the caller
blowup if it wants.
> +
> + if (0 == atomic_read(&group->container_users) ||
> + !group->container->iommu_driver ||
> + !vfio_group_viable(group))
> + return -EINVAL;
The above test just becomes a weak test that the caller is correctly
using the API since we should be enforcing these tests when the external
user is added. It doesn't hurt to leave them, but it's not very
convincing that the caller is the one holding anything.
> + ret = iommu_group_id(group->iommu_group);
The 'ret' variable isn't needed.
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_iommu_id_from_file);
> +
> +/**
> * Module/class support
> */
> static char *vfio_devnode(struct device *dev, umode_t *mode)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index ac8d488..7ee6575 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -90,4 +90,11 @@ extern void vfio_unregister_iommu_driver(
> TYPE tmp; \
> offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); }) \
>
> +/*
> + * External user API
> + */
> +int vfio_group_add_external_user(struct file *filep);
> +void vfio_group_del_external_user(struct file *filep);
> +int vfio_group_iommu_id_from_file(struct file *filep);
> +
As Stephen noted, let's use extern here. Also, please add some
additional text about the general purpose of this API and how to use it.
A generalized version of your commit log after fixes would be great.
Thanks,
Alex
On 06/27/2013 12:02:36 AM, Alexey Kardashevskiy wrote:
> +/*
> + * The KVM guest can be backed with 16MB pages.
> + * In this case, we cannot do page counting from the real mode
> + * as the compound pages are used - they are linked in a list
> + * with pointers as virtual addresses which are inaccessible
> + * in real mode.
> + *
> + * The code below keeps a 16MB pages list and uses page struct
> + * in real mode if it is already locked in RAM and inserted into
> + * the list or switches to the virtual mode where it can be
> + * handled in a usual manner.
> + */
> +#define KVMPPC_HUGEPAGE_HASH(gpa) hash_32(gpa >> 24, 32)
> +
> +struct kvmppc_iommu_hugepage {
> + struct hlist_node hash_node;
> + unsigned long gpa; /* Guest physical address */
> + unsigned long hpa; /* Host physical address */
> + struct page *page; /* page struct of the very first
> subpage */
> + unsigned long size; /* Huge page size (always 16MB at the
> moment) */
> +};
Shouldn't this be namespaced to something like "book3s" or "spapr"?
-Scott
On 06/28/2013 01:44 AM, Alex Williamson wrote:
> On Thu, 2013-06-27 at 17:14 +1000, Alexey Kardashevskiy wrote:
>> VFIO is designed to be used via ioctls on file descriptors
>> returned by VFIO.
>>
>> However in some situations support for an external user is required.
>> The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
>> use the existing VFIO groups for exclusive access in real/virtual mode
>> in the host kernel to avoid passing map/unmap requests to the user
>> space which would made things pretty slow.
>>
>> The proposed protocol includes:
>>
>> 1. do normal VFIO init stuff such as opening a new container, attaching
>> group(s) to it, setting an IOMMU driver for a container. When IOMMU is
>> set for a container, all groups in it are considered ready to use by
>> an external user.
>>
>> 2. pass a fd of the group we want to accelerate to KVM. KVM calls
>> vfio_group_iommu_id_from_file() to verify if the group is initialized
>> and IOMMU is set for it. The current TCE IOMMU driver marks the whole
>> IOMMU table as busy when IOMMU is set for a container what this prevents
>> other DMA users from allocating from it so it is safe to pass the group
>> to the user space.
>>
>> 3. KVM increases the container users counter via
>> vfio_group_add_external_user(). This prevents the VFIO group from
>> being disposed prior to exiting KVM.
>>
>> 4. When KVM is finished and doing cleanup, it releases the group file
>> and decrements the container users counter. Everything gets released.
>>
>> 5. KVM also keeps the group file as otherwise its fd might have been
>> closed at the moment of KVM finish so vfio_group_del_external_user()
>> call will not be possible.
>
> This is the wrong order in my mind. An external user has no business
> checking or maintaining any state of a group until it calls
> add_external_user(). Only after that call is successful can the user
> assume the filep to group relationship is static and get the iommu_id.
> Any use of the "external user" API should start with "add" and end with
> "del".
Yes, this is what I actually do, just wrong commit message, will fix.
>
>> The "vfio: Limit group opens" patch is also required for the consistency.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>>
>> v1->v2: added definitions to vfio.h :)
>> Should not compile but compiled. Hm.
>>
>> ---
>> drivers/vfio/vfio.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++
>> include/linux/vfio.h | 7 +++++++
>> 2 files changed, 61 insertions(+)
>>
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index c488da5..40875d2 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -1370,6 +1370,60 @@ static const struct file_operations vfio_device_fops = {
>> };
>>
>> /**
>> + * External user API, exported by symbols to be linked dynamically.
>> + */
>> +
>> +/* Allows an external user (for example, KVM) to lock an IOMMU group */
>> +int vfio_group_add_external_user(struct file *filep)
>> +{
>> + struct vfio_group *group = filep->private_data;
>> +
>> + if (filep->f_op != &vfio_group_fops)
>> + return -EINVAL;
>> +
>> + if (!atomic_inc_not_zero(&group->container_users))
>> + return -EINVAL;
>
> This is the place where I was suggesting we need tests to match
> get_device_fd. It's not clear what the external user is holding if the
> group has no iommu or is not viable here.
In my mind this test must include test for iommu id so I would merge it
with vfio_group_iommu_id_from_file(). Till I check iommu id, I still cannot
use this group so where to put check for iommu/viable does not really
matter (for me).
>
>
> if (!group->container->iommu_driver || !vfio_group_viable(group)) {
> vfio_group_try_dissolve_container(group);
> return -EINVAL;
> }
>
>> +
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_group_add_external_user);
>> +
>> +/* Allows an external user (for example, KVM) to unlock an IOMMU group */
>> +void vfio_group_del_external_user(struct file *filep)
>> +{
>> + struct vfio_group *group = filep->private_data;
>> +
>> + if (WARN_ON(filep->f_op != &vfio_group_fops))
>> + return;
>
> How about we make this return int so we can return 0/-EINVAL and the
> caller can decide the severity of the response?
And what can the caller possibly do on !0?
--
Alexey
On Fri, 2013-06-28 at 08:57 +1000, Alexey Kardashevskiy wrote:
> On 06/28/2013 01:44 AM, Alex Williamson wrote:
> > On Thu, 2013-06-27 at 17:14 +1000, Alexey Kardashevskiy wrote:
> >> VFIO is designed to be used via ioctls on file descriptors
> >> returned by VFIO.
> >>
> >> However in some situations support for an external user is required.
> >> The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
> >> use the existing VFIO groups for exclusive access in real/virtual mode
> >> in the host kernel to avoid passing map/unmap requests to the user
> >> space which would made things pretty slow.
> >>
> >> The proposed protocol includes:
> >>
> >> 1. do normal VFIO init stuff such as opening a new container, attaching
> >> group(s) to it, setting an IOMMU driver for a container. When IOMMU is
> >> set for a container, all groups in it are considered ready to use by
> >> an external user.
> >>
> >> 2. pass a fd of the group we want to accelerate to KVM. KVM calls
> >> vfio_group_iommu_id_from_file() to verify if the group is initialized
> >> and IOMMU is set for it. The current TCE IOMMU driver marks the whole
> >> IOMMU table as busy when IOMMU is set for a container what this prevents
> >> other DMA users from allocating from it so it is safe to pass the group
> >> to the user space.
> >>
> >> 3. KVM increases the container users counter via
> >> vfio_group_add_external_user(). This prevents the VFIO group from
> >> being disposed prior to exiting KVM.
> >>
> >> 4. When KVM is finished and doing cleanup, it releases the group file
> >> and decrements the container users counter. Everything gets released.
> >>
> >> 5. KVM also keeps the group file as otherwise its fd might have been
> >> closed at the moment of KVM finish so vfio_group_del_external_user()
> >> call will not be possible.
> >
> > This is the wrong order in my mind. An external user has no business
> > checking or maintaining any state of a group until it calls
> > add_external_user(). Only after that call is successful can the user
> > assume the filep to group relationship is static and get the iommu_id.
> > Any use of the "external user" API should start with "add" and end with
> > "del".
>
> Yes, this is what I actually do, just wrong commit message, will fix.
>
> >
> >> The "vfio: Limit group opens" patch is also required for the consistency.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >> ---
> >>
> >> v1->v2: added definitions to vfio.h :)
> >> Should not compile but compiled. Hm.
> >>
> >> ---
> >> drivers/vfio/vfio.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >> include/linux/vfio.h | 7 +++++++
> >> 2 files changed, 61 insertions(+)
> >>
> >> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >> index c488da5..40875d2 100644
> >> --- a/drivers/vfio/vfio.c
> >> +++ b/drivers/vfio/vfio.c
> >> @@ -1370,6 +1370,60 @@ static const struct file_operations vfio_device_fops = {
> >> };
> >>
> >> /**
> >> + * External user API, exported by symbols to be linked dynamically.
> >> + */
> >> +
> >> +/* Allows an external user (for example, KVM) to lock an IOMMU group */
> >> +int vfio_group_add_external_user(struct file *filep)
> >> +{
> >> + struct vfio_group *group = filep->private_data;
> >> +
> >> + if (filep->f_op != &vfio_group_fops)
> >> + return -EINVAL;
> >> +
> >> + if (!atomic_inc_not_zero(&group->container_users))
> >> + return -EINVAL;
> >
> > This is the place where I was suggesting we need tests to match
> > get_device_fd. It's not clear what the external user is holding if the
> > group has no iommu or is not viable here.
>
>
> In my mind this test must include test for iommu id so I would merge it
> with vfio_group_iommu_id_from_file().
I'm not sure what that means.
> Till I check iommu id, I still cannot
> use this group so where to put check for iommu/viable does not really
> matter (for me).
The difference is that getting the group id may just be the first of
several external user API interfaces. The idea of external user
interface is that from add->del the group is maintained in the same
state as if a device was opened. If we disassemble that so that add
sets up some stuff and getting the group id does a little more, what
happens if we start adding more external user API callbacks? A user of
the interface shouldn't need to know the internals to know which
interface allows what aspect of use. Besides, I don't want to have to
worry about managing another state slightly different from that used by
the device fd.
> >
> >
> > if (!group->container->iommu_driver || !vfio_group_viable(group)) {
> > vfio_group_try_dissolve_container(group);
> > return -EINVAL;
> > }
> >
> >> +
> >> + return 0;
> >> +}
> >> +EXPORT_SYMBOL_GPL(vfio_group_add_external_user);
> >> +
> >> +/* Allows an external user (for example, KVM) to unlock an IOMMU group */
> >> +void vfio_group_del_external_user(struct file *filep)
> >> +{
> >> + struct vfio_group *group = filep->private_data;
> >> +
> >> + if (WARN_ON(filep->f_op != &vfio_group_fops))
> >> + return;
> >
> > How about we make this return int so we can return 0/-EINVAL and the
> > caller can decide the severity of the response?
>
> And what can the caller possibly do on !0?
What if the caller is just passing a filep from userspace, should they
be allowed to fill the logs by hitting this WARN_ON? I don't know where
it comes from here and whether the caller can return an error to
userspace. If this is the same filep that the caller used on add, they
they can legitimately WARN_ON, but we can't tell if that's the case
here. Thanks,
Alex
On 06/28/2013 10:41 AM, Alex Williamson wrote:
> On Fri, 2013-06-28 at 08:57 +1000, Alexey Kardashevskiy wrote:
>> On 06/28/2013 01:44 AM, Alex Williamson wrote:
>>> On Thu, 2013-06-27 at 17:14 +1000, Alexey Kardashevskiy wrote:
>>>> VFIO is designed to be used via ioctls on file descriptors
>>>> returned by VFIO.
>>>>
>>>> However in some situations support for an external user is required.
>>>> The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
>>>> use the existing VFIO groups for exclusive access in real/virtual mode
>>>> in the host kernel to avoid passing map/unmap requests to the user
>>>> space which would made things pretty slow.
>>>>
>>>> The proposed protocol includes:
>>>>
>>>> 1. do normal VFIO init stuff such as opening a new container, attaching
>>>> group(s) to it, setting an IOMMU driver for a container. When IOMMU is
>>>> set for a container, all groups in it are considered ready to use by
>>>> an external user.
>>>>
>>>> 2. pass a fd of the group we want to accelerate to KVM. KVM calls
>>>> vfio_group_iommu_id_from_file() to verify if the group is initialized
>>>> and IOMMU is set for it. The current TCE IOMMU driver marks the whole
>>>> IOMMU table as busy when IOMMU is set for a container what this prevents
>>>> other DMA users from allocating from it so it is safe to pass the group
>>>> to the user space.
>>>>
>>>> 3. KVM increases the container users counter via
>>>> vfio_group_add_external_user(). This prevents the VFIO group from
>>>> being disposed prior to exiting KVM.
>>>>
>>>> 4. When KVM is finished and doing cleanup, it releases the group file
>>>> and decrements the container users counter. Everything gets released.
>>>>
>>>> 5. KVM also keeps the group file as otherwise its fd might have been
>>>> closed at the moment of KVM finish so vfio_group_del_external_user()
>>>> call will not be possible.
>>>
>>> This is the wrong order in my mind. An external user has no business
>>> checking or maintaining any state of a group until it calls
>>> add_external_user(). Only after that call is successful can the user
>>> assume the filep to group relationship is static and get the iommu_id.
>>> Any use of the "external user" API should start with "add" and end with
>>> "del".
>>
>> Yes, this is what I actually do, just wrong commit message, will fix.
>>
>>>
>>>> The "vfio: Limit group opens" patch is also required for the consistency.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>> ---
>>>>
>>>> v1->v2: added definitions to vfio.h :)
>>>> Should not compile but compiled. Hm.
>>>>
>>>> ---
>>>> drivers/vfio/vfio.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> include/linux/vfio.h | 7 +++++++
>>>> 2 files changed, 61 insertions(+)
>>>>
>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>> index c488da5..40875d2 100644
>>>> --- a/drivers/vfio/vfio.c
>>>> +++ b/drivers/vfio/vfio.c
>>>> @@ -1370,6 +1370,60 @@ static const struct file_operations vfio_device_fops = {
>>>> };
>>>>
>>>> /**
>>>> + * External user API, exported by symbols to be linked dynamically.
>>>> + */
>>>> +
>>>> +/* Allows an external user (for example, KVM) to lock an IOMMU group */
>>>> +int vfio_group_add_external_user(struct file *filep)
>>>> +{
>>>> + struct vfio_group *group = filep->private_data;
>>>> +
>>>> + if (filep->f_op != &vfio_group_fops)
>>>> + return -EINVAL;
>>>> +
>>>> + if (!atomic_inc_not_zero(&group->container_users))
>>>> + return -EINVAL;
>>>
>>> This is the place where I was suggesting we need tests to match
>>> get_device_fd. It's not clear what the external user is holding if the
>>> group has no iommu or is not viable here.
>>
>>
>> In my mind this test must include test for iommu id so I would merge it
>> with vfio_group_iommu_id_from_file().
>
> I'm not sure what that means.
Sorry. Still a mess in my head :( I'll to explain.
vfio_group_add_external_user() should tell if the group is viable and has
iommu (does not the latter include check for viable?).
vfio_group_iommu_id_from_file() tells the group id which has to be compared
by KVM with what KVM got from the userspace and KVM should reject if the
group id is wrong.
So there are 3 checks. KVM can continue if all three passed.
>> Till I check iommu id, I still cannot
>> use this group so where to put check for iommu/viable does not really
>> matter (for me).
>
> The difference is that getting the group id may just be the first of
> several external user API interfaces. The idea of external user
> interface is that from add->del the group is maintained in the same
> state as if a device was opened.
Good point.
> If we disassemble that so that add
> sets up some stuff and getting the group id does a little more, what
> happens if we start adding more external user API callbacks? A user of
> the interface shouldn't need to know the internals to know which
> interface allows what aspect of use. Besides, I don't want to have to
> worry about managing another state slightly different from that used by
> the device fd.
>>>
>>>
>>> if (!group->container->iommu_driver || !vfio_group_viable(group)) {
>>> vfio_group_try_dissolve_container(group);
>>> return -EINVAL;
>>> }
>>>
>>>> +
>>>> + return 0;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(vfio_group_add_external_user);
>>>> +
>>>> +/* Allows an external user (for example, KVM) to unlock an IOMMU group */
>>>> +void vfio_group_del_external_user(struct file *filep)
>>>> +{
>>>> + struct vfio_group *group = filep->private_data;
>>>> +
>>>> + if (WARN_ON(filep->f_op != &vfio_group_fops))
>>>> + return;
>>>
>>> How about we make this return int so we can return 0/-EINVAL and the
>>> caller can decide the severity of the response?
>>
>> And what can the caller possibly do on !0?
>
> What if the caller is just passing a filep from userspace, should they
> be allowed to fill the logs by hitting this WARN_ON? I don't know where
> it comes from here and whether the caller can return an error to
> userspace. If this is the same filep that the caller used on add, they
> they can legitimately WARN_ON, but we can't tell if that's the case
> here. Thanks,
Well, we say that holding file* is a part of API. Why would anyone call
vfio_group_del_external_user() on something but the file* it got when
opened a group fd?
--
Alexey
On Fri, 2013-06-28 at 11:38 +1000, Alexey Kardashevskiy wrote:
> On 06/28/2013 10:41 AM, Alex Williamson wrote:
> > On Fri, 2013-06-28 at 08:57 +1000, Alexey Kardashevskiy wrote:
> >> On 06/28/2013 01:44 AM, Alex Williamson wrote:
> >>> On Thu, 2013-06-27 at 17:14 +1000, Alexey Kardashevskiy wrote:
> >>>> VFIO is designed to be used via ioctls on file descriptors
> >>>> returned by VFIO.
> >>>>
> >>>> However in some situations support for an external user is required.
> >>>> The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
> >>>> use the existing VFIO groups for exclusive access in real/virtual mode
> >>>> in the host kernel to avoid passing map/unmap requests to the user
> >>>> space which would made things pretty slow.
> >>>>
> >>>> The proposed protocol includes:
> >>>>
> >>>> 1. do normal VFIO init stuff such as opening a new container, attaching
> >>>> group(s) to it, setting an IOMMU driver for a container. When IOMMU is
> >>>> set for a container, all groups in it are considered ready to use by
> >>>> an external user.
> >>>>
> >>>> 2. pass a fd of the group we want to accelerate to KVM. KVM calls
> >>>> vfio_group_iommu_id_from_file() to verify if the group is initialized
> >>>> and IOMMU is set for it. The current TCE IOMMU driver marks the whole
> >>>> IOMMU table as busy when IOMMU is set for a container what this prevents
> >>>> other DMA users from allocating from it so it is safe to pass the group
> >>>> to the user space.
> >>>>
> >>>> 3. KVM increases the container users counter via
> >>>> vfio_group_add_external_user(). This prevents the VFIO group from
> >>>> being disposed prior to exiting KVM.
> >>>>
> >>>> 4. When KVM is finished and doing cleanup, it releases the group file
> >>>> and decrements the container users counter. Everything gets released.
> >>>>
> >>>> 5. KVM also keeps the group file as otherwise its fd might have been
> >>>> closed at the moment of KVM finish so vfio_group_del_external_user()
> >>>> call will not be possible.
> >>>
> >>> This is the wrong order in my mind. An external user has no business
> >>> checking or maintaining any state of a group until it calls
> >>> add_external_user(). Only after that call is successful can the user
> >>> assume the filep to group relationship is static and get the iommu_id.
> >>> Any use of the "external user" API should start with "add" and end with
> >>> "del".
> >>
> >> Yes, this is what I actually do, just wrong commit message, will fix.
> >>
> >>>
> >>>> The "vfio: Limit group opens" patch is also required for the consistency.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >>>> ---
> >>>>
> >>>> v1->v2: added definitions to vfio.h :)
> >>>> Should not compile but compiled. Hm.
> >>>>
> >>>> ---
> >>>> drivers/vfio/vfio.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> include/linux/vfio.h | 7 +++++++
> >>>> 2 files changed, 61 insertions(+)
> >>>>
> >>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >>>> index c488da5..40875d2 100644
> >>>> --- a/drivers/vfio/vfio.c
> >>>> +++ b/drivers/vfio/vfio.c
> >>>> @@ -1370,6 +1370,60 @@ static const struct file_operations vfio_device_fops = {
> >>>> };
> >>>>
> >>>> /**
> >>>> + * External user API, exported by symbols to be linked dynamically.
> >>>> + */
> >>>> +
> >>>> +/* Allows an external user (for example, KVM) to lock an IOMMU group */
> >>>> +int vfio_group_add_external_user(struct file *filep)
> >>>> +{
> >>>> + struct vfio_group *group = filep->private_data;
> >>>> +
> >>>> + if (filep->f_op != &vfio_group_fops)
> >>>> + return -EINVAL;
> >>>> +
> >>>> + if (!atomic_inc_not_zero(&group->container_users))
> >>>> + return -EINVAL;
> >>>
> >>> This is the place where I was suggesting we need tests to match
> >>> get_device_fd. It's not clear what the external user is holding if the
> >>> group has no iommu or is not viable here.
> >>
> >>
> >> In my mind this test must include test for iommu id so I would merge it
> >> with vfio_group_iommu_id_from_file().
> >
> > I'm not sure what that means.
>
> Sorry. Still a mess in my head :( I'll to explain.
>
> vfio_group_add_external_user() should tell if the group is viable and has
> iommu
Agreed
> (does not the latter include check for viable?).
Mostly paranoia
> vfio_group_iommu_id_from_file() tells the group id which has to be compared
> by KVM with what KVM got from the userspace and KVM should reject if the
> group id is wrong.
>
> So there are 3 checks. KVM can continue if all three passed.
That's KVM's business, but what does it prove for userspace to give KVM
both a vfio group file descriptor and a group id? It seems redundant
since the group id from vfio needs to take precedence. More paranoia?
> >> Till I check iommu id, I still cannot
> >> use this group so where to put check for iommu/viable does not really
> >> matter (for me).
> >
> > The difference is that getting the group id may just be the first of
> > several external user API interfaces. The idea of external user
> > interface is that from add->del the group is maintained in the same
> > state as if a device was opened.
>
> Good point.
>
> > If we disassemble that so that add
> > sets up some stuff and getting the group id does a little more, what
> > happens if we start adding more external user API callbacks? A user of
> > the interface shouldn't need to know the internals to know which
> > interface allows what aspect of use. Besides, I don't want to have to
> > worry about managing another state slightly different from that used by
> > the device fd.
>
>
>
> >>>
> >>>
> >>> if (!group->container->iommu_driver || !vfio_group_viable(group)) {
> >>> vfio_group_try_dissolve_container(group);
> >>> return -EINVAL;
> >>> }
> >>>
> >>>> +
> >>>> + return 0;
> >>>> +}
> >>>> +EXPORT_SYMBOL_GPL(vfio_group_add_external_user);
> >>>> +
> >>>> +/* Allows an external user (for example, KVM) to unlock an IOMMU group */
> >>>> +void vfio_group_del_external_user(struct file *filep)
> >>>> +{
> >>>> + struct vfio_group *group = filep->private_data;
> >>>> +
> >>>> + if (WARN_ON(filep->f_op != &vfio_group_fops))
> >>>> + return;
> >>>
> >>> How about we make this return int so we can return 0/-EINVAL and the
> >>> caller can decide the severity of the response?
> >>
> >> And what can the caller possibly do on !0?
> >
> > What if the caller is just passing a filep from userspace, should they
> > be allowed to fill the logs by hitting this WARN_ON? I don't know where
> > it comes from here and whether the caller can return an error to
> > userspace. If this is the same filep that the caller used on add, they
> > they can legitimately WARN_ON, but we can't tell if that's the case
> > here. Thanks,
>
> Well, we say that holding file* is a part of API.
You're right, we should call vfio_group_get/put explicitly from add/del.
An open device increments the group reference count as well, so again
it's just making it look more like an open device. This may favor a
get/put interface like below.
> Why would anyone call
> vfio_group_del_external_user() on something but the file* it got when
> opened a group fd?
Is "Why" irrelevant? WARN makes more sense to me if the release is done
from an object we provide. If both the add and del are just
dereferencing a field of another object, we don't know where the object
comes from for release and we don't know how serious it is. So another
way we could do the interface would be:
struct vfio_group *vfio_group_get_external_user(struct file *filep)
void vfio_group_put_external_user(struct vfio_group *group)
int vfio_external_user_iommu_id(struct vfio_group *group)
group would of course be opaque externally.
Thanks,
Alex
On 06/28/2013 12:37 PM, Alex Williamson wrote:
> On Fri, 2013-06-28 at 11:38 +1000, Alexey Kardashevskiy wrote:
>> On 06/28/2013 10:41 AM, Alex Williamson wrote:
>>> On Fri, 2013-06-28 at 08:57 +1000, Alexey Kardashevskiy wrote:
>>>> On 06/28/2013 01:44 AM, Alex Williamson wrote:
>>>>> On Thu, 2013-06-27 at 17:14 +1000, Alexey Kardashevskiy wrote:
>>>>>> VFIO is designed to be used via ioctls on file descriptors
>>>>>> returned by VFIO.
>>>>>>
>>>>>> However in some situations support for an external user is required.
>>>>>> The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
>>>>>> use the existing VFIO groups for exclusive access in real/virtual mode
>>>>>> in the host kernel to avoid passing map/unmap requests to the user
>>>>>> space which would made things pretty slow.
>>>>>>
>>>>>> The proposed protocol includes:
>>>>>>
>>>>>> 1. do normal VFIO init stuff such as opening a new container, attaching
>>>>>> group(s) to it, setting an IOMMU driver for a container. When IOMMU is
>>>>>> set for a container, all groups in it are considered ready to use by
>>>>>> an external user.
>>>>>>
>>>>>> 2. pass a fd of the group we want to accelerate to KVM. KVM calls
>>>>>> vfio_group_iommu_id_from_file() to verify if the group is initialized
>>>>>> and IOMMU is set for it. The current TCE IOMMU driver marks the whole
>>>>>> IOMMU table as busy when IOMMU is set for a container what this prevents
>>>>>> other DMA users from allocating from it so it is safe to pass the group
>>>>>> to the user space.
>>>>>>
>>>>>> 3. KVM increases the container users counter via
>>>>>> vfio_group_add_external_user(). This prevents the VFIO group from
>>>>>> being disposed prior to exiting KVM.
>>>>>>
>>>>>> 4. When KVM is finished and doing cleanup, it releases the group file
>>>>>> and decrements the container users counter. Everything gets released.
>>>>>>
>>>>>> 5. KVM also keeps the group file as otherwise its fd might have been
>>>>>> closed at the moment of KVM finish so vfio_group_del_external_user()
>>>>>> call will not be possible.
>>>>>
>>>>> This is the wrong order in my mind. An external user has no business
>>>>> checking or maintaining any state of a group until it calls
>>>>> add_external_user(). Only after that call is successful can the user
>>>>> assume the filep to group relationship is static and get the iommu_id.
>>>>> Any use of the "external user" API should start with "add" and end with
>>>>> "del".
>>>>
>>>> Yes, this is what I actually do, just wrong commit message, will fix.
>>>>
>>>>>
>>>>>> The "vfio: Limit group opens" patch is also required for the consistency.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>>>>>> ---
>>>>>>
>>>>>> v1->v2: added definitions to vfio.h :)
>>>>>> Should not compile but compiled. Hm.
>>>>>>
>>>>>> ---
>>>>>> drivers/vfio/vfio.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> include/linux/vfio.h | 7 +++++++
>>>>>> 2 files changed, 61 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>>>> index c488da5..40875d2 100644
>>>>>> --- a/drivers/vfio/vfio.c
>>>>>> +++ b/drivers/vfio/vfio.c
>>>>>> @@ -1370,6 +1370,60 @@ static const struct file_operations vfio_device_fops = {
>>>>>> };
>>>>>>
>>>>>> /**
>>>>>> + * External user API, exported by symbols to be linked dynamically.
>>>>>> + */
>>>>>> +
>>>>>> +/* Allows an external user (for example, KVM) to lock an IOMMU group */
>>>>>> +int vfio_group_add_external_user(struct file *filep)
>>>>>> +{
>>>>>> + struct vfio_group *group = filep->private_data;
>>>>>> +
>>>>>> + if (filep->f_op != &vfio_group_fops)
>>>>>> + return -EINVAL;
>>>>>> +
>>>>>> + if (!atomic_inc_not_zero(&group->container_users))
>>>>>> + return -EINVAL;
>>>>>
>>>>> This is the place where I was suggesting we need tests to match
>>>>> get_device_fd. It's not clear what the external user is holding if the
>>>>> group has no iommu or is not viable here.
>>>>
>>>>
>>>> In my mind this test must include test for iommu id so I would merge it
>>>> with vfio_group_iommu_id_from_file().
>>>
>>> I'm not sure what that means.
>>
>> Sorry. Still a mess in my head :( I'll to explain.
>>
>> vfio_group_add_external_user() should tell if the group is viable and has
>> iommu
>
> Agreed
>
>> (does not the latter include check for viable?).
>
> Mostly paranoia
>
>> vfio_group_iommu_id_from_file() tells the group id which has to be compared
>> by KVM with what KVM got from the userspace and KVM should reject if the
>> group id is wrong.
>>
>> So there are 3 checks. KVM can continue if all three passed.
>
> That's KVM's business, but what does it prove for userspace to give KVM
> both a vfio group file descriptor and a group id? It seems redundant
> since the group id from vfio needs to take precedence. More paranoia?
Yep, that's her :) Without this check, the user space is allowed to mix up
PHB ID (liobn) and IOMMU group. It has the right to do so and it should not
break anything but nice to check, no?
>>>> Till I check iommu id, I still cannot
>>>> use this group so where to put check for iommu/viable does not really
>>>> matter (for me).
>>>
>>> The difference is that getting the group id may just be the first of
>>> several external user API interfaces. The idea of external user
>>> interface is that from add->del the group is maintained in the same
>>> state as if a device was opened.
>>
>> Good point.
>>
>>> If we disassemble that so that add
>>> sets up some stuff and getting the group id does a little more, what
>>> happens if we start adding more external user API callbacks? A user of
>>> the interface shouldn't need to know the internals to know which
>>> interface allows what aspect of use. Besides, I don't want to have to
>>> worry about managing another state slightly different from that used by
>>> the device fd.
>>
>>
>>
>>>>>
>>>>>
>>>>> if (!group->container->iommu_driver || !vfio_group_viable(group)) {
>>>>> vfio_group_try_dissolve_container(group);
>>>>> return -EINVAL;
>>>>> }
>>>>>
>>>>>> +
>>>>>> + return 0;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(vfio_group_add_external_user);
>>>>>> +
>>>>>> +/* Allows an external user (for example, KVM) to unlock an IOMMU group */
>>>>>> +void vfio_group_del_external_user(struct file *filep)
>>>>>> +{
>>>>>> + struct vfio_group *group = filep->private_data;
>>>>>> +
>>>>>> + if (WARN_ON(filep->f_op != &vfio_group_fops))
>>>>>> + return;
>>>>>
>>>>> How about we make this return int so we can return 0/-EINVAL and the
>>>>> caller can decide the severity of the response?
>>>>
>>>> And what can the caller possibly do on !0?
>>>
>>> What if the caller is just passing a filep from userspace, should they
>>> be allowed to fill the logs by hitting this WARN_ON? I don't know where
>>> it comes from here and whether the caller can return an error to
>>> userspace. If this is the same filep that the caller used on add, they
>>> they can legitimately WARN_ON, but we can't tell if that's the case
>>> here. Thanks,
>>
>> Well, we say that holding file* is a part of API.
>
> You're right, we should call vfio_group_get/put explicitly from add/del.
> An open device increments the group reference count as well, so again
> it's just making it look more like an open device. This may favor a
> get/put interface like below.
>
>> Why would anyone call
>> vfio_group_del_external_user() on something but the file* it got when
>> opened a group fd?
>
> Is "Why" irrelevant? WARN makes more sense to me if the release is done
> from an object we provide. If both the add and del are just
> dereferencing a field of another object, we don't know where the object
> comes from for release and we don't know how serious it is. So another
> way we could do the interface would be:
>
> struct vfio_group *vfio_group_get_external_user(struct file *filep)
> void vfio_group_put_external_user(struct vfio_group *group)
> int vfio_external_user_iommu_id(struct vfio_group *group)
>
> group would of course be opaque externally.
Makes sense and lets us avoid keeping file*. Ok, I'll try that then.
--
Alexey
On 06/27/2013 07:02 AM, Alexey Kardashevskiy wrote:
> Signed-off-by: Alexey Kardashevskiy<[email protected]>
> ---
> include/uapi/linux/kvm.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 970b1f5..0865c01 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
> #define KVM_CAP_PPC_RTAS 91
> #define KVM_CAP_IRQ_XICS 92
> #define KVM_CAP_SPAPR_MULTITCE 93
> +#define KVM_CAP_SPAPR_TCE_IOMMU 94
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping {
> /* Available with KVM_CAP_PPC_ALLOC_HTAB */
> #define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32)
> #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct kvm_create_spapr_tce)
> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct kvm_create_spapr_tce_iommu)
Please order them by number.
Alex
> /* Available with KVM_CAP_RMA */
> #define KVM_ALLOCATE_RMA _IOR(KVMIO, 0xa9, struct kvm_allocate_rma)
> /* Available with KVM_CAP_PPC_HTAB_FD */
On 07/10/2013 01:35 AM, Alexander Graf wrote:
> On 06/27/2013 07:02 AM, Alexey Kardashevskiy wrote:
>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>> ---
>> include/uapi/linux/kvm.h | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 970b1f5..0865c01 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
>> #define KVM_CAP_PPC_RTAS 91
>> #define KVM_CAP_IRQ_XICS 92
>> #define KVM_CAP_SPAPR_MULTITCE 93
>> +#define KVM_CAP_SPAPR_TCE_IOMMU 94
>>
>> #ifdef KVM_CAP_IRQ_ROUTING
>>
>> @@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping {
>> /* Available with KVM_CAP_PPC_ALLOC_HTAB */
>> #define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32)
>> #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct
>> kvm_create_spapr_tce)
>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct
>> kvm_create_spapr_tce_iommu)
>
> Please order them by number.
Oh. Again :( We have had this discussion with Scott Wood here already.
Where _exactly_ do you want me to put it? Many sections, not really
ordered. Thank you.
--
Alexey
On 10.07.2013, at 01:35, Alexey Kardashevskiy wrote:
> On 07/10/2013 01:35 AM, Alexander Graf wrote:
>> On 06/27/2013 07:02 AM, Alexey Kardashevskiy wrote:
>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>> ---
>>> include/uapi/linux/kvm.h | 2 ++
>>> 1 file changed, 2 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>> index 970b1f5..0865c01 100644
>>> --- a/include/uapi/linux/kvm.h
>>> +++ b/include/uapi/linux/kvm.h
>>> @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
>>> #define KVM_CAP_PPC_RTAS 91
>>> #define KVM_CAP_IRQ_XICS 92
>>> #define KVM_CAP_SPAPR_MULTITCE 93
>>> +#define KVM_CAP_SPAPR_TCE_IOMMU 94
>>>
>>> #ifdef KVM_CAP_IRQ_ROUTING
>>>
>>> @@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping {
>>> /* Available with KVM_CAP_PPC_ALLOC_HTAB */
>>> #define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32)
>>> #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct
>>> kvm_create_spapr_tce)
>>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct
>>> kvm_create_spapr_tce_iommu)
>>
>> Please order them by number.
>
> Oh. Again :( We have had this discussion with Scott Wood here already.
> Where _exactly_ do you want me to put it?
8 lines further down. With a comment saying when it's available. Also why is it af, not ad?
> Many sections, not really ordered. Thank you.
They should all be ordered inside of their own categories.
Alex
On 07/10/2013 08:27 PM, Alexander Graf wrote:
>
> On 10.07.2013, at 01:35, Alexey Kardashevskiy wrote:
>
>> On 07/10/2013 01:35 AM, Alexander Graf wrote:
>>> On 06/27/2013 07:02 AM, Alexey Kardashevskiy wrote:
>>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>>> ---
>>>> include/uapi/linux/kvm.h | 2 ++
>>>> 1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index 970b1f5..0865c01 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
>>>> #define KVM_CAP_PPC_RTAS 91
>>>> #define KVM_CAP_IRQ_XICS 92
>>>> #define KVM_CAP_SPAPR_MULTITCE 93
>>>> +#define KVM_CAP_SPAPR_TCE_IOMMU 94
>>>>
>>>> #ifdef KVM_CAP_IRQ_ROUTING
>>>>
>>>> @@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping {
>>>> /* Available with KVM_CAP_PPC_ALLOC_HTAB */
>>>> #define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32)
>>>> #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct
>>>> kvm_create_spapr_tce)
>>>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct
>>>> kvm_create_spapr_tce_iommu)
>>>
>>> Please order them by number.
>>
>> Oh. Again :( We have had this discussion with Scott Wood here already.
>> Where _exactly_ do you want me to put it?
>
> 8 lines further down. With a comment saying when it's available. Also why is it af, not ad?
0xad and 0xae are taken.
Where should I have commented this? In the commit message? Or in the patch
itself?
>> Many sections, not really ordered. Thank you.
>
> They should all be ordered inside of their own categories.
--
Alexey
On 10.07.2013, at 16:17, Alexey Kardashevskiy wrote:
> On 07/10/2013 08:27 PM, Alexander Graf wrote:
>>
>> On 10.07.2013, at 01:35, Alexey Kardashevskiy wrote:
>>
>>> On 07/10/2013 01:35 AM, Alexander Graf wrote:
>>>> On 06/27/2013 07:02 AM, Alexey Kardashevskiy wrote:
>>>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>>>> ---
>>>>> include/uapi/linux/kvm.h | 2 ++
>>>>> 1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>> index 970b1f5..0865c01 100644
>>>>> --- a/include/uapi/linux/kvm.h
>>>>> +++ b/include/uapi/linux/kvm.h
>>>>> @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
>>>>> #define KVM_CAP_PPC_RTAS 91
>>>>> #define KVM_CAP_IRQ_XICS 92
>>>>> #define KVM_CAP_SPAPR_MULTITCE 93
>>>>> +#define KVM_CAP_SPAPR_TCE_IOMMU 94
>>>>>
>>>>> #ifdef KVM_CAP_IRQ_ROUTING
>>>>>
>>>>> @@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping {
>>>>> /* Available with KVM_CAP_PPC_ALLOC_HTAB */
>>>>> #define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32)
>>>>> #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct
>>>>> kvm_create_spapr_tce)
>>>>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct
>>>>> kvm_create_spapr_tce_iommu)
>>>>
>>>> Please order them by number.
>>>
>>> Oh. Again :( We have had this discussion with Scott Wood here already.
>>> Where _exactly_ do you want me to put it?
>>
>> 8 lines further down. With a comment saying when it's available. Also why is it af, not ad?
>
>
> 0xad and 0xae are taken.
> Where should I have commented this? In the commit message? Or in the patch
> itself?
Yeah, with a comment right in between 0xad and your 0xaf entry explaining the gap.
Alex
>
>
>>> Many sections, not really ordered. Thank you.
>>
>> They should all be ordered inside of their own categories.
>
>
> --
> Alexey