This accelerates IOMMU operations in real and virtual
mode in the host kernel for the KVM guest.
The first patch with multitce support is useful for emulated devices as is.
The other patches are designed for VFIO although this series
does not contain any VFIO related code as the connection between
VFIO and the new handlers is to be made in QEMU
via ioctl to the KVM fd.
The series was made and tested against v3.10-rc1.
Alexey Kardashevskiy (4):
KVM: PPC: Add support for multiple-TCE hcalls
powerpc: Prepare to support kernel handling of IOMMU map/unmap
KVM: PPC: Add support for IOMMU in-kernel handling
KVM: PPC: Add hugepage support for IOMMU in-kernel handling
Documentation/virtual/kvm/api.txt | 42 +++
arch/powerpc/include/asm/kvm_host.h | 7 +
arch/powerpc/include/asm/kvm_ppc.h | 40 ++-
arch/powerpc/include/asm/pgtable-ppc64.h | 4 +
arch/powerpc/include/uapi/asm/kvm.h | 7 +
arch/powerpc/kvm/book3s_64_vio.c | 398 ++++++++++++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 471 ++++++++++++++++++++++++++++--
arch/powerpc/kvm/book3s_hv.c | 39 +++
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
arch/powerpc/kvm/book3s_pr_papr.c | 37 ++-
arch/powerpc/kvm/powerpc.c | 15 +
arch/powerpc/mm/init_64.c | 77 ++++-
include/uapi/linux/kvm.h | 5 +
13 files changed, 1120 insertions(+), 28 deletions(-)
--
1.7.10.4
This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
devices or emulated PCI. These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.
This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.
This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.
This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.
Cc: David Gibson <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
---
Changelog:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
---
Documentation/virtual/kvm/api.txt | 14 ++
arch/powerpc/include/asm/kvm_host.h | 2 +
arch/powerpc/include/asm/kvm_ppc.h | 16 +-
arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++
arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++----
arch/powerpc/kvm/book3s_hv.c | 39 +++++
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++-
arch/powerpc/kvm/powerpc.c | 3 +
include/uapi/linux/kvm.h | 1 +
10 files changed, 470 insertions(+), 32 deletions(-)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 5f91eda..3c7c7ea 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2780,3 +2780,17 @@ Parameters: args[0] is the XICS device fd
args[1] is the XICS CPU number (server ID) for this vcpu
This capability connects the vcpu to an in-kernel XICS device.
+
+6.8 KVM_CAP_PPC_MULTITCE
+
+Architectures: ppc
+Parameters: none
+Returns: 0 on success; -1 on error
+
+This capability enables the guest to put/remove multiple TCE entries
+per hypercall which significanly accelerates DMA operations for PPC KVM
+guests.
+
+When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are
+expected to occur rather than H_PUT_TCE which supports only one TCE entry
+per call.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index af326cd..85d8f26 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
+
+ unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */
#endif
};
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..e852921b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
- unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+ struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_emulated_validate_tce(unsigned long tce);
+extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+ unsigned long ioba, unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages);
extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct kvm_allocate_rma *rma);
extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index b2d3f3b..06b7b20 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
*
* Copyright 2010 Paul Mackerras, IBM Corp. <[email protected]>
* Copyright 2011 David Gibson, IBM Corporation <[email protected]>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <[email protected]>
*/
#include <linux/types.h>
@@ -36,8 +37,11 @@
#include <asm/ppc-opcode.h>
#include <asm/kvm_host.h>
#include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR ((void *)~(unsigned long)0x0)
static long kvmppc_stt_npages(unsigned long window_size)
{
@@ -148,3 +152,117 @@ fail:
}
return ret;
}
+
+/* Converts guest physical address into host virtual */
+static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
+ unsigned long gpa)
+{
+ unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+ struct kvm_memory_slot *memslot;
+
+ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+ if (!memslot)
+ return ERROR_ADDR;
+
+ hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+ return (void *) hva;
+}
+
+long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce)
+{
+ long ret;
+ struct kvmppc_spapr_tce_table *tt;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to userspace */
+ if (!tt)
+ return H_TOO_HARD;
+
+ /* Emulated IO */
+ if (ioba >= tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce);
+ if (ret)
+ return ret;
+
+ kvmppc_emulated_put_tce(tt, ioba, tce);
+
+ return H_SUCCESS;
+}
+
+long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+ unsigned long __user *tces;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to userspace */
+ if (!tt)
+ return H_TOO_HARD;
+
+ /*
+ * The spec says that the maximum size of the list is 512 TCEs so
+ * so the whole table addressed resides in 4K page
+ */
+ if (npages > 512)
+ return H_PARAMETER;
+
+ if (tce_list & ~IOMMU_PAGE_MASK)
+ return H_PARAMETER;
+
+ tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list);
+ if (tces == ERROR_ADDR)
+ return H_TOO_HARD;
+
+ /* Emulated IO */
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i) {
+ if (get_user(vcpu->arch.tce_tmp[i], tces + i))
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]);
+ if (ret)
+ return ret;
+ }
+
+ for (i = 0; i < npages; ++i)
+ kvmppc_emulated_put_tce(tt,
+ ioba + (i << IOMMU_PAGE_SHIFT),
+ vcpu->arch.tce_tmp[i]);
+
+ return H_SUCCESS;
+}
+
+long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to userspace */
+ if (!tt)
+ return H_TOO_HARD;
+
+ /* Emulated IO */
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce_value);
+ if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+ kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+ return H_SUCCESS;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..c68d538 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
*
* Copyright 2010 Paul Mackerras, IBM Corp. <[email protected]>
* Copyright 2011 David Gibson, IBM Corporation <[email protected]>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <[email protected]>
*/
#include <linux/types.h>
@@ -35,42 +36,249 @@
#include <asm/ppc-opcode.h>
#include <asm/kvm_host.h>
#include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR (~(unsigned long)0x0)
-/* WARNING: This will be called in real-mode on HV KVM and virtual
- * mode on PR KVM
+/* Finds a TCE table descriptor by LIOBN */
+struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
+ unsigned long liobn)
+{
+ struct kvmppc_spapr_tce_table *tt;
+
+ list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+ if (tt->liobn == liobn)
+ return tt;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
+
+/*
+ * Validate TCE address.
+ * At the moment only flags are validated
+ * as other check will significantly slow down
+ * or can make it even impossible to handle TCE requests
+ * in real mode.
+ */
+long kvmppc_emulated_validate_tce(unsigned long tce)
+{
+ if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+ return H_PARAMETER;
+
+ return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
+
+/*
+ * kvmppc_emulated_put_tce() handles TCE requests for devices emulated
+ * by QEMU. It puts guest TCE values into the table and expects
+ * the QEMU to convert them later in the QEMU device implementation.
+ * Wiorks in both real and virtual modes.
+ * It cannot fail so kvmppc_emulated_validate_tce must be called before it.
*/
+void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+ unsigned long ioba, unsigned long tce)
+{
+ unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+ struct page *page;
+ u64 *tbl;
+
+ /*
+ * Note on the use of page_address() in real mode,
+ *
+ * It is safe to use page_address() in real mode on ppc64 because
+ * page_address() is always defined as lowmem_page_address()
+ * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
+ * operation and does not access page struct.
+ *
+ * Theoretically page_address() could be defined different
+ * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
+ * should be enabled.
+ * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
+ * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
+ * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
+ * is not expected to be enabled on ppc32, page_address()
+ * is safe for ppc32 as well.
+ */
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+ page = tt->pages[idx / TCES_PER_PAGE];
+ tbl = (u64 *)page_address(page);
+
+ /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
+ tbl[idx % TCES_PER_PAGE] = tce;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
+
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+
+static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
+ unsigned long *pte_sizep)
+{
+ pte_t *ptep;
+ unsigned int shift = 0;
+ pte_t pte, tmp, ret;
+
+ ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift);
+ if (!ptep)
+ return __pte(0);
+ if (shift)
+ *pte_sizep = 1ul << shift;
+ else
+ *pte_sizep = PAGE_SIZE;
+
+ if (!pte_present(*ptep))
+ return __pte(0);
+
+ /* wait until _PAGE_BUSY is clear then set it atomically */
+ __asm__ __volatile__ (
+ "1: ldarx %0,0,%3\n"
+ " andi. %1,%0,%4\n"
+ " bne- 1b\n"
+ " ori %1,%0,%4\n"
+ " stdcx. %1,0,%3\n"
+ " bne- 1b"
+ : "=&r" (pte), "=&r" (tmp), "=m" (*ptep)
+ : "r" (ptep), "i" (_PAGE_BUSY)
+ : "cc");
+
+ ret = pte;
+
+ return ret;
+}
+
+/*
+ * Converts guest physical address into host physical address.
+ * Also returns pte and page size if the page is present in page table.
+ */
+static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
+ unsigned long gpa)
+{
+ struct kvm_memory_slot *memslot;
+ pte_t pte;
+ unsigned long hva, hpa, pg_size = 0, offset;
+ unsigned long gfn = gpa >> PAGE_SHIFT;
+ bool writing = gpa & TCE_PCI_WRITE;
+
+ /* Find a KVM memslot */
+ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+ if (!memslot)
+ return ERROR_ADDR;
+
+ /* Convert guest physical address to host virtual */
+ hva = __gfn_to_hva_memslot(memslot, gfn);
+
+ /* Find a PTE and determine the size */
+ pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva,
+ writing, &pg_size);
+ if (!pte)
+ return ERROR_ADDR;
+
+ /* Calculate host phys address keeping flags and offset in the page */
+ offset = gpa & (pg_size - 1);
+
+ /* pte_pfn(pte) should return an address aligned to pg_size */
+ hpa = (pte_pfn(pte) << PAGE_SHIFT) + offset;
+
+ return hpa;
+}
+
long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
{
- struct kvm *kvm = vcpu->kvm;
- struct kvmppc_spapr_tce_table *stt;
-
- /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
- /* liobn, ioba, tce); */
-
- list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
- if (stt->liobn == liobn) {
- unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
- struct page *page;
- u64 *tbl;
-
- /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p window_size=0x%x\n", */
- /* liobn, stt, stt->window_size); */
- if (ioba >= stt->window_size)
- return H_PARAMETER;
-
- page = stt->pages[idx / TCES_PER_PAGE];
- tbl = (u64 *)page_address(page);
-
- /* FIXME: Need to validate the TCE itself */
- /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
- tbl[idx % TCES_PER_PAGE] = tce;
- return H_SUCCESS;
- }
+ long ret;
+ struct kvmppc_spapr_tce_table *tt;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to virtual space */
+ if (!tt)
+ return H_TOO_HARD;
+
+ /* Emulated IO */
+ if (ioba >= tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce);
+ if (ret)
+ return ret;
+
+ kvmppc_emulated_put_tce(tt, ioba, tce);
+
+ return H_SUCCESS;
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+ unsigned long *tces;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to virtual space */
+ if (!tt)
+ return H_TOO_HARD;
+
+ /*
+ * The spec says that the maximum size of the list is 512 TCEs so
+ * so the whole table addressed resides in 4K page
+ */
+ if (npages > 512)
+ return H_PARAMETER;
+
+ if (tce_list & ~IOMMU_PAGE_MASK)
+ return H_PARAMETER;
+
+ tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
+ if ((unsigned long)tces == ERROR_ADDR)
+ return H_TOO_HARD;
+
+ /* Emulated IO */
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i) {
+ ret = kvmppc_emulated_validate_tce(tces[i]);
+ if (ret)
+ return ret;
}
- /* Didn't find the liobn, punt it to userspace */
- return H_TOO_HARD;
+ for (i = 0; i < npages; ++i)
+ kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+ tces[i]);
+
+ return H_SUCCESS;
+}
+
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to virtual space */
+ if (!tt)
+ return H_TOO_HARD;
+
+ /* Emulated IO */
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce_value);
+ if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+ kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+ return H_SUCCESS;
}
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 9de24f8..9d70fd8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -566,6 +566,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
ret = kvmppc_xics_hcall(vcpu, req);
break;
} /* fallthrough */
+ return RESUME_HOST;
+ case H_PUT_TCE:
+ ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
+ case H_PUT_TCE_INDIRECT:
+ ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6),
+ kvmppc_get_gpr(vcpu, 7));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
+ case H_STUFF_TCE:
+ ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6),
+ kvmppc_get_gpr(vcpu, 7));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
default:
return RESUME_HOST;
}
@@ -956,6 +980,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
vcpu->arch.cpu_type = KVM_CPU_3S_64;
kvmppc_sanity_check(vcpu);
+ /*
+ * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
+ * half executed, we first read TCEs from the user, check them and
+ * return error if something went wrong and only then put TCEs into
+ * the TCE table.
+ *
+ * tce_tmp is a cache for TCEs to avoid stack allocation or
+ * kmalloc as the whole TCE list can take up to 512 items 8 bytes
+ * each (4096 bytes).
+ */
+ vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
+ if (!vcpu->arch.tce_tmp)
+ goto free_vcpu;
+
return vcpu;
free_vcpu:
@@ -978,6 +1016,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
spin_unlock(&vcpu->arch.vpa_update_lock);
+ kfree(vcpu->arch.tce_tmp);
kvm_vcpu_uninit(vcpu);
kmem_cache_free(kvm_vcpu_cache, vcpu);
}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b02f91e..d35554e 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1490,6 +1490,12 @@ hcall_real_table:
.long 0 /* 0x11c */
.long 0 /* 0x120 */
.long .kvmppc_h_bulk_remove - hcall_real_table
+ .long 0 /* 0x128 */
+ .long 0 /* 0x12c */
+ .long 0 /* 0x130 */
+ .long 0 /* 0x134 */
+ .long .kvmppc_h_stuff_tce - hcall_real_table
+ .long .kvmppc_h_put_tce_indirect - hcall_real_table
hcall_real_table_end:
ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index b24309c..3750e3c 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
unsigned long tce = kvmppc_get_gpr(vcpu, 6);
long rc;
- rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+ rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
+ if (rc == H_TOO_HARD)
+ return EMULATE_FAIL;
+ kvmppc_set_gpr(vcpu, 3, rc);
+ return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+ unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+ unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+ unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+ unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+ long rc;
+
+ rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
+ tce, npages);
+ if (rc == H_TOO_HARD)
+ return EMULATE_FAIL;
+ kvmppc_set_gpr(vcpu, 3, rc);
+ return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+ unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+ unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+ unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+ unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+ long rc;
+
+ rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
if (rc == H_TOO_HARD)
return EMULATE_FAIL;
kvmppc_set_gpr(vcpu, 3, rc);
@@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
return kvmppc_h_pr_bulk_remove(vcpu);
case H_PUT_TCE:
return kvmppc_h_pr_put_tce(vcpu);
+ case H_PUT_TCE_INDIRECT:
+ return kvmppc_h_pr_put_tce_indirect(vcpu);
+ case H_STUFF_TCE:
+ return kvmppc_h_pr_stuff_tce(vcpu);
case H_CEDE:
vcpu->arch.shared->msr |= MSR_EE;
kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 6316ee3..8465c2a 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
r = 1;
break;
#endif
+ case KVM_CAP_SPAPR_MULTITCE:
+ r = 1;
+ break;
default:
r = 0;
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a5c86fc..5a2afda 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_IRQ_MPIC 90
#define KVM_CAP_PPC_RTAS 91
#define KVM_CAP_IRQ_XICS 92
+#define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
#ifdef KVM_CAP_IRQ_ROUTING
--
1.7.10.4
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which should
save time on switching to QEMU and back.
Both real and virtual modes are supported - whenever the kernel
fails to handle TCE request, it passes it to the virtual mode.
If it the virtual mode handlers fail, then the request is passed
to the user mode, for example, to QEMU.
This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
in-kernel handling of IOMMU map/unmap.
Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
Cc: David Gibson <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
---
Changes:
2013-05-20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
---
Documentation/virtual/kvm/api.txt | 28 +++++
arch/powerpc/include/asm/kvm_host.h | 3 +
arch/powerpc/include/asm/kvm_ppc.h | 2 +
arch/powerpc/include/uapi/asm/kvm.h | 7 ++
arch/powerpc/kvm/book3s_64_vio.c | 198 ++++++++++++++++++++++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 193 +++++++++++++++++++++++++++++++++-
arch/powerpc/kvm/powerpc.c | 12 +++
include/uapi/linux/kvm.h | 4 +
8 files changed, 441 insertions(+), 6 deletions(-)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 3c7c7ea..3c8e9fe 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2362,6 +2362,34 @@ calls by the guest for that service will be passed to userspace to be
handled.
+4.79 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+ __u64 liobn;
+ __u32 iommu_id;
+ __u32 flags;
+};
+
+No flag is supported at the moment.
+
+When the guest issues TCE call on a liobn for which a TCE table has been
+registered, the kernel will handle it in real mode, updating the hardware
+TCE table. TCE table calls for other liobns will cause a vm exit and must
+be handled by userspace.
+
+
5. The kvm_run structure
------------------------
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 85d8f26..ac0e2fe 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+ struct iommu_group *grp; /* used for IOMMU groups */
struct page *pages[0];
};
@@ -611,6 +612,8 @@ struct kvm_vcpu_arch {
u64 busy_preempt;
unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */
+ unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */
+ unsigned long tce_reason; /* The reason of switching to the virtmode */
#endif
};
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index e852921b..934e01d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+ struct kvm_create_spapr_tce_iommu *args);
extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
struct kvm_vcpu *vcpu, unsigned long liobn);
extern long kvmppc_emulated_validate_tce(unsigned long tce);
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..cf82af4 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -319,6 +319,13 @@ struct kvm_create_spapr_tce {
__u32 window_size;
};
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+ __u64 liobn;
+ __u32 iommu_id;
+ __u32 flags;
+};
+
/* for KVM_ALLOCATE_RMA */
struct kvm_allocate_rma {
__u64 rma_size;
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 06b7b20..ffb4698 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,8 @@
#include <linux/hugetlb.h>
#include <linux/list.h>
#include <linux/anon_inodes.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
#include <asm/tlbflush.h>
#include <asm/kvm_ppc.h>
@@ -56,8 +58,13 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
mutex_lock(&kvm->lock);
list_del(&stt->list);
- for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
- __free_page(stt->pages[i]);
+#ifdef CONFIG_IOMMU_API
+ if (stt->grp) {
+ iommu_group_put(stt->grp);
+ } else
+#endif
+ for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+ __free_page(stt->pages[i]);
kfree(stt);
mutex_unlock(&kvm->lock);
@@ -153,6 +160,62 @@ fail:
return ret;
}
+#ifdef CONFIG_IOMMU_API
+static const struct file_operations kvm_spapr_tce_iommu_fops = {
+ .release = kvm_spapr_tce_release,
+};
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+ struct kvm_create_spapr_tce_iommu *args)
+{
+ struct kvmppc_spapr_tce_table *tt = NULL;
+ struct iommu_group *grp;
+ struct iommu_table *tbl;
+
+ /* Find an IOMMU table for the given ID */
+ grp = iommu_group_get_by_id(args->iommu_id);
+ if (!grp)
+ return -ENXIO;
+
+ tbl = iommu_group_get_iommudata(grp);
+ if (!tbl)
+ return -ENXIO;
+
+ /* Check this LIOBN hasn't been previously allocated */
+ list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
+ if (tt->liobn == args->liobn)
+ return -EBUSY;
+ }
+
+ tt = kzalloc(sizeof(*tt), GFP_KERNEL);
+ if (!tt)
+ return -ENOMEM;
+
+ tt->liobn = args->liobn;
+ tt->kvm = kvm;
+ tt->grp = grp;
+
+ kvm_get_kvm(kvm);
+
+ mutex_lock(&kvm->lock);
+ list_add(&tt->list, &kvm->arch.spapr_tce_tables);
+
+ mutex_unlock(&kvm->lock);
+
+ pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n",
+ args->liobn, args->iommu_id, args->flags);
+
+ return anon_inode_getfd("kvm-spapr-tce-iommu",
+ &kvm_spapr_tce_iommu_fops, tt, O_RDWR);
+}
+#else
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+ struct kvm_create_spapr_tce_iommu *args)
+{
+ return -ENOSYS;
+}
+#endif /* CONFIG_IOMMU_API */
+
/* Converts guest physical address into host virtual */
static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
unsigned long gpa)
@@ -180,6 +243,46 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ if (vcpu->arch.tce_reason == H_HARDWARE) {
+ iommu_clear_tces_and_put_pages(tbl, entry, 1);
+ return H_HARDWARE;
+
+ } else if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+ if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+ return H_PARAMETER;
+
+ ret = iommu_clear_tces_and_put_pages(tbl, entry, 1);
+ } else {
+ void *hva;
+
+ if (iommu_tce_put_param_check(tbl, ioba, tce))
+ return H_PARAMETER;
+
+ hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce);
+ if (hva == ERROR_ADDR)
+ return H_HARDWARE;
+
+ ret = iommu_put_tce_user_mode(tbl,
+ ioba >> IOMMU_PAGE_SHIFT,
+ (unsigned long) hva);
+ }
+ iommu_flush_tce(tbl);
+
+ if (ret)
+ return H_HARDWARE;
+
+ return H_SUCCESS;
+ }
+#endif
/* Emulated IO */
if (ioba >= tt->window_size)
return H_PARAMETER;
@@ -220,6 +323,70 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if (tces == ERROR_ADDR)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Something bad happened, do cleanup and exit */
+ if (vcpu->arch.tce_reason == H_HARDWARE) {
+ i = vcpu->arch.tce_tmp_num;
+ goto fail_clear_tce;
+ } else if (vcpu->arch.tce_reason != H_TOO_HARD) {
+ /*
+ * We get here only in PR KVM mode, otherwise
+ * the real mode handler would have checked TCEs
+ * already and failed on guest TCE translation.
+ */
+ for (i = 0; i < npages; ++i) {
+ if (get_user(vcpu->arch.tce_tmp[i], tces + i))
+ return H_HARDWARE;
+
+ if (iommu_tce_put_param_check(tbl, ioba +
+ (i << IOMMU_PAGE_SHIFT),
+ vcpu->arch.tce_tmp[i]))
+ return H_PARAMETER;
+ }
+ } /* else: The real mode handler checked TCEs already */
+
+ /* Translate TCEs */
+ for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) {
+ void *hva = kvmppc_virtmode_gpa_to_hva(vcpu,
+ vcpu->arch.tce_tmp[i]);
+
+ if (hva == ERROR_ADDR)
+ goto fail_clear_tce;
+
+ vcpu->arch.tce_tmp[i] = (unsigned long) hva;
+ }
+
+ /* Do get_page and put TCEs for all pages */
+ for (i = 0; i < npages; ++i) {
+ if (iommu_put_tce_user_mode(tbl,
+ (ioba >> IOMMU_PAGE_SHIFT) + i,
+ vcpu->arch.tce_tmp[i])) {
+ i = npages;
+ goto fail_clear_tce;
+ }
+ }
+
+ iommu_flush_tce(tbl);
+
+ return H_SUCCESS;
+
+fail_clear_tce:
+ /* Cannot complete the translation, clean up and exit */
+ iommu_clear_tces_and_put_pages(tbl,
+ ioba >> IOMMU_PAGE_SHIFT, i);
+
+ iommu_flush_tce(tbl);
+
+ return H_HARDWARE;
+ }
+#endif
/* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
@@ -253,6 +420,33 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+ unsigned long tmp, entry = ioba >> IOMMU_PAGE_SHIFT;
+
+ vcpu->arch.tce_tmp_num = 0;
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* PR KVM? */
+ if (!vcpu->arch.tce_tmp_num &&
+ (vcpu->arch.tce_reason != H_TOO_HARD) &&
+ iommu_tce_clear_param_check(tbl, ioba,
+ tce_value, npages))
+ return H_PARAMETER;
+
+ /* Do actual cleanup */
+ tmp = vcpu->arch.tce_tmp_num;
+ if (iommu_clear_tces_and_put_pages(tbl, entry + tmp,
+ npages - tmp))
+ return H_PARAMETER;
+
+ return H_SUCCESS;
+ }
+#endif
/* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index c68d538..dc4ae32 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
#include <linux/slab.h>
#include <linux/hugetlb.h>
#include <linux/list.h>
+#include <linux/iommu.h>
#include <asm/tlbflush.h>
#include <asm/kvm_ppc.h>
@@ -118,7 +119,7 @@ EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
#ifdef CONFIG_KVM_BOOK3S_64_HV
static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
- unsigned long *pte_sizep)
+ unsigned long *pte_sizep, bool do_get_page)
{
pte_t *ptep;
unsigned int shift = 0;
@@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
if (!pte_present(*ptep))
return __pte(0);
+ /*
+ * Put huge pages handling to the virtual mode.
+ * The only exception is for TCE list pages which we
+ * do need to call get_page() for.
+ */
+ if ((*pte_sizep > PAGE_SIZE) && do_get_page)
+ return __pte(0);
+
/* wait until _PAGE_BUSY is clear then set it atomically */
__asm__ __volatile__ (
"1: ldarx %0,0,%3\n"
@@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
: "cc");
ret = pte;
+ if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) {
+ struct page *pg = NULL;
+ pg = realmode_pfn_to_page(pte_pfn(pte));
+ if (realmode_get_page(pg)) {
+ ret = __pte(0);
+ } else {
+ pte = pte_mkyoung(pte);
+ if (writing)
+ pte = pte_mkdirty(pte);
+ }
+ }
+ *ptep = pte; /* clears _PAGE_BUSY */
return ret;
}
@@ -157,7 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
* Also returns pte and page size if the page is present in page table.
*/
static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
- unsigned long gpa)
+ unsigned long gpa, bool do_get_page)
{
struct kvm_memory_slot *memslot;
pte_t pte;
@@ -175,7 +196,7 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
/* Find a PTE and determine the size */
pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva,
- writing, &pg_size);
+ writing, &pg_size, do_get_page);
if (!pte)
return ERROR_ADDR;
@@ -188,6 +209,52 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
return hpa;
}
+#ifdef CONFIG_IOMMU_API
+static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu,
+ struct iommu_table *tbl, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ long ret = 0, i;
+ unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+ if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i) {
+ struct page *page;
+ unsigned long oldtce;
+
+ oldtce = iommu_clear_tce(tbl, entry + i);
+ if (!oldtce)
+ continue;
+
+ page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT);
+ if (!page) {
+ ret = H_TOO_HARD;
+ break;
+ }
+
+ if (oldtce & TCE_PCI_WRITE)
+ SetPageDirty(page);
+
+ if (realmode_put_page(page)) {
+ ret = H_TOO_HARD;
+ break;
+ }
+ }
+
+ if (ret == H_TOO_HARD) {
+ vcpu->arch.tce_tmp_num = i;
+ vcpu->arch.tce_reason = H_TOO_HARD;
+ }
+ /* if (ret < 0)
+ pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
+ __func__, ioba, tce_value, ret); */
+
+ return ret;
+}
+#endif /* CONFIG_IOMMU_API */
+
long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
{
@@ -199,6 +266,52 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
if (!tt)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ vcpu->arch.tce_reason = 0;
+
+ if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+ unsigned long hpa, hva;
+
+ if (iommu_tce_put_param_check(tbl, ioba, tce))
+ return H_PARAMETER;
+
+ hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true);
+ if (hpa == ERROR_ADDR) {
+ vcpu->arch.tce_reason = H_TOO_HARD;
+ return H_TOO_HARD;
+ }
+
+ hva = (unsigned long) __va(hpa);
+ ret = iommu_tce_build(tbl,
+ ioba >> IOMMU_PAGE_SHIFT,
+ hva, iommu_tce_direction(hva));
+ if (unlikely(ret)) {
+ struct page *pg = realmode_pfn_to_page(hpa);
+ BUG_ON(!pg);
+ if (realmode_put_page(pg)) {
+ vcpu->arch.tce_reason = H_HARDWARE;
+ return H_TOO_HARD;
+ }
+ return H_HARDWARE;
+ }
+ } else {
+ ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba, 0, 1);
+ if (ret)
+ return ret;
+ }
+
+ iommu_flush_tce(tbl);
+
+ return H_SUCCESS;
+ }
+#endif
/* Emulated IO */
if (ioba >= tt->window_size)
return H_PARAMETER;
@@ -235,10 +348,62 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if (tce_list & ~IOMMU_PAGE_MASK)
return H_PARAMETER;
- tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
+ vcpu->arch.tce_tmp_num = 0;
+ vcpu->arch.tce_reason = 0;
+
+ tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu,
+ tce_list, false);
if ((unsigned long)tces == ERROR_ADDR)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Check all TCEs */
+ for (i = 0; i < npages; ++i) {
+ if (iommu_tce_put_param_check(tbl, ioba +
+ (i << IOMMU_PAGE_SHIFT), tces[i]))
+ return H_PARAMETER;
+ vcpu->arch.tce_tmp[i] = tces[i];
+ }
+
+ /* Translate TCEs and go get_page */
+ for (i = 0; i < npages; ++i) {
+ unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu,
+ vcpu->arch.tce_tmp[i], true);
+ if (hpa == ERROR_ADDR) {
+ vcpu->arch.tce_tmp_num = i;
+ vcpu->arch.tce_reason = H_TOO_HARD;
+ return H_TOO_HARD;
+ }
+ vcpu->arch.tce_tmp[i] = hpa;
+ }
+
+ /* Put TCEs to the table */
+ for (i = 0; i < npages; ++i) {
+ unsigned long hva = (unsigned long)
+ __va(vcpu->arch.tce_tmp[i]);
+
+ ret = iommu_tce_build(tbl,
+ (ioba >> IOMMU_PAGE_SHIFT) + i,
+ hva, iommu_tce_direction(hva));
+ if (ret) {
+ /* All wrong, go virtmode and do cleanup */
+ vcpu->arch.tce_reason = H_HARDWARE;
+ return H_TOO_HARD;
+ }
+ }
+
+ iommu_flush_tce(tbl);
+
+ return H_SUCCESS;
+ }
+#endif
/* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
@@ -268,6 +433,26 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ vcpu->arch.tce_reason = 0;
+
+ ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba,
+ tce_value, npages);
+ if (ret)
+ return ret;
+
+ iommu_flush_tce(tbl);
+
+ return H_SUCCESS;
+ }
+#endif
/* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 8465c2a..da6bf61 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
break;
#endif
case KVM_CAP_SPAPR_MULTITCE:
+ case KVM_CAP_SPAPR_TCE_IOMMU:
r = 1;
break;
default:
@@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
goto out;
}
+ case KVM_CREATE_SPAPR_TCE_IOMMU: {
+ struct kvm_create_spapr_tce_iommu create_tce_iommu;
+ struct kvm *kvm = filp->private_data;
+
+ r = -EFAULT;
+ if (copy_from_user(&create_tce_iommu, argp,
+ sizeof(create_tce_iommu)))
+ goto out;
+ r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+ goto out;
+ }
#endif /* CONFIG_PPC_BOOK3S_64 */
#ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5a2afda..450c82a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_PPC_RTAS 91
#define KVM_CAP_IRQ_XICS 92
#define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
+#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
#ifdef KVM_CAP_IRQ_ROUTING
@@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
#define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr)
#define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr)
+/* ioctl for SPAPR TCE IOMMU */
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct kvm_create_spapr_tce_iommu)
+
/*
* ioctls for vcpu fds
*/
--
1.7.10.4
This adds special support for huge pages (16MB). The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages. It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens. The list is released at KVM exit. At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive. However this can change and we may want
to optimize this.
Cc: David Gibson <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
---
Changes:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).
---
arch/powerpc/include/asm/kvm_host.h | 2 +
arch/powerpc/include/asm/kvm_ppc.h | 22 +++++++++
arch/powerpc/kvm/book3s_64_vio.c | 88 +++++++++++++++++++++++++++++++++--
arch/powerpc/kvm/book3s_64_vio_hv.c | 40 ++++++++++++++--
4 files changed, 146 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index ac0e2fe..4fc0865 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table {
u64 liobn;
u32 window_size;
struct iommu_group *grp; /* used for IOMMU groups */
+ struct list_head hugepages; /* used for IOMMU groups */
+ spinlock_t hugepages_lock; /* used for IOMMU groups */
struct page *pages[0];
};
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 934e01d..9054df0 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -149,6 +149,28 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
unsigned long liobn, unsigned long ioba,
unsigned long tce_value, unsigned long npages);
+
+/*
+ * The KVM guest can be backed with 16MB pages (qemu switch
+ * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+struct kvmppc_iommu_hugepage {
+ struct list_head list;
+ pte_t pte; /* Huge page PTE */
+ unsigned long gpa; /* Guest physical address */
+ struct page *page; /* page struct of the very first subpage */
+ unsigned long size; /* Huge page size (always 16MB at the moment) */
+};
+
extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct kvm_allocate_rma *rma);
extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index ffb4698..c34d63a 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -45,6 +45,71 @@
#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
#define ERROR_ADDR ((void *)~(unsigned long)0x0)
+#ifdef CONFIG_IOMMU_API
+/* Adds a new huge page descriptor to the list */
+static long kvmppc_iommu_hugepage_try_add(
+ struct kvmppc_spapr_tce_table *tt,
+ pte_t pte, unsigned long hva, unsigned long gpa,
+ unsigned long pg_size)
+{
+ long ret = 0;
+ struct kvmppc_iommu_hugepage *hp;
+ struct page *p;
+
+ spin_lock(&tt->hugepages_lock);
+ list_for_each_entry(hp, &tt->hugepages, list) {
+ if (hp->pte == pte)
+ goto unlock_exit;
+ }
+
+ hva = hva & ~(pg_size - 1);
+ ret = get_user_pages_fast(hva, 1, true/*write*/, &p);
+ if ((ret != 1) || !p) {
+ ret = -EFAULT;
+ goto unlock_exit;
+ }
+ ret = 0;
+
+ hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+ if (!hp) {
+ ret = -ENOMEM;
+ goto unlock_exit;
+ }
+
+ hp->page = p;
+ hp->pte = pte;
+ hp->gpa = gpa & ~(pg_size - 1);
+ hp->size = pg_size;
+
+ list_add(&hp->list, &tt->hugepages);
+
+unlock_exit:
+ spin_unlock(&tt->hugepages_lock);
+
+ return ret;
+}
+
+static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
+{
+ INIT_LIST_HEAD(&tt->hugepages);
+ spin_lock_init(&tt->hugepages_lock);
+}
+
+static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
+{
+ struct kvmppc_iommu_hugepage *hp, *tmp;
+
+ spin_lock(&tt->hugepages_lock);
+ list_for_each_entry_safe(hp, tmp, &tt->hugepages, list) {
+ list_del(&hp->list);
+ put_page(hp->page); /* one for iommu_put_tce_user_mode */
+ put_page(hp->page); /* one for kvmppc_iommu_hugepage_try_add */
+ kfree(hp);
+ }
+ spin_unlock(&tt->hugepages_lock);
+}
+#endif /* CONFIG_IOMMU_API */
+
static long kvmppc_stt_npages(unsigned long window_size)
{
return ALIGN((window_size >> SPAPR_TCE_SHIFT)
@@ -61,6 +126,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
#ifdef CONFIG_IOMMU_API
if (stt->grp) {
iommu_group_put(stt->grp);
+ kvmppc_iommu_hugepages_cleanup(stt);
} else
#endif
for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
@@ -198,6 +264,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
kvm_get_kvm(kvm);
mutex_lock(&kvm->lock);
+ kvmppc_iommu_hugepages_init(tt);
list_add(&tt->list, &kvm->arch.spapr_tce_tables);
mutex_unlock(&kvm->lock);
@@ -218,16 +285,31 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
/* Converts guest physical address into host virtual */
static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
unsigned long gpa)
{
unsigned long hva, gfn = gpa >> PAGE_SHIFT;
struct kvm_memory_slot *memslot;
+ pte_t *ptep;
+ unsigned int shift = 0;
memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
if (!memslot)
return ERROR_ADDR;
hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+
+ ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+ WARN_ON(!ptep);
+ if (!ptep)
+ return ERROR_ADDR;
+
+ if (tt && (shift > PAGE_SHIFT)) {
+ if (kvmppc_iommu_hugepage_try_add(tt, *ptep,
+ hva, gpa, 1 << shift))
+ return ERROR_ADDR;
+ }
+
return (void *) hva;
}
@@ -267,7 +349,7 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
if (iommu_tce_put_param_check(tbl, ioba, tce))
return H_PARAMETER;
- hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce);
+ hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, tce);
if (hva == ERROR_ADDR)
return H_HARDWARE;
@@ -319,7 +401,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if (tce_list & ~IOMMU_PAGE_MASK)
return H_PARAMETER;
- tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list);
+ tces = kvmppc_virtmode_gpa_to_hva(vcpu, NULL, tce_list);
if (tces == ERROR_ADDR)
return H_TOO_HARD;
@@ -354,7 +436,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
/* Translate TCEs */
for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) {
- void *hva = kvmppc_virtmode_gpa_to_hva(vcpu,
+ void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt,
vcpu->arch.tce_tmp[i]);
if (hva == ERROR_ADDR)
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index dc4ae32..6245365 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -178,6 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
* Also returns pte and page size if the page is present in page table.
*/
static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
unsigned long gpa, bool do_get_page)
{
struct kvm_memory_slot *memslot;
@@ -185,7 +186,31 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
unsigned long hva, hpa, pg_size = 0, offset;
unsigned long gfn = gpa >> PAGE_SHIFT;
bool writing = gpa & TCE_PCI_WRITE;
+ struct kvmppc_iommu_hugepage *hp;
+ /*
+ * Try to find an already used hugepage.
+ * If it is not there, the kvmppc_lookup_pte() will return zero
+ * as it won't do get_page() on a huge page in real mode
+ * and therefore the request will be passed to the virtual mode.
+ */
+ if (tt) {
+ spin_lock(&tt->hugepages_lock);
+ list_for_each_entry(hp, &tt->hugepages, list) {
+ if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size))
+ continue;
+
+ /* Calculate host phys address keeping flags and offset in the page */
+ offset = gpa & (hp->size - 1);
+
+ /* pte_pfn(pte) should return an address aligned to pg_size */
+ hpa = (pte_pfn(hp->pte) << PAGE_SHIFT) + offset;
+ spin_unlock(&tt->hugepages_lock);
+
+ return hpa;
+ }
+ spin_unlock(&tt->hugepages_lock);
+ }
/* Find a KVM memslot */
memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
if (!memslot)
@@ -237,6 +262,10 @@ static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu,
if (oldtce & TCE_PCI_WRITE)
SetPageDirty(page);
+ /* Do not put a huge page and continue without error */
+ if (PageCompound(page))
+ continue;
+
if (realmode_put_page(page)) {
ret = H_TOO_HARD;
break;
@@ -282,7 +311,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
if (iommu_tce_put_param_check(tbl, ioba, tce))
return H_PARAMETER;
- hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true);
+ hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce, true);
if (hpa == ERROR_ADDR) {
vcpu->arch.tce_reason = H_TOO_HARD;
return H_TOO_HARD;
@@ -295,6 +324,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
if (unlikely(ret)) {
struct page *pg = realmode_pfn_to_page(hpa);
BUG_ON(!pg);
+
+ /* Do not put a huge page and return an error */
+ if (!PageCompound(pg))
+ return H_HARDWARE;
+
if (realmode_put_page(pg)) {
vcpu->arch.tce_reason = H_HARDWARE;
return H_TOO_HARD;
@@ -351,7 +385,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
vcpu->arch.tce_tmp_num = 0;
vcpu->arch.tce_reason = 0;
- tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu,
+ tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, NULL,
tce_list, false);
if ((unsigned long)tces == ERROR_ADDR)
return H_TOO_HARD;
@@ -374,7 +408,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
/* Translate TCEs and go get_page */
for (i = 0; i < npages; ++i) {
- unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu,
+ unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt,
vcpu->arch.tce_tmp[i], true);
if (hpa == ERROR_ADDR) {
vcpu->arch.tce_tmp_num = i;
--
1.7.10.4
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).
To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.
This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.
CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: Paul Mackerras <[email protected]>
Cc: David Gibson <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
---
Changes:
2013-05-20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
---
arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++
arch/powerpc/mm/init_64.c | 77 +++++++++++++++++++++++++++++-
2 files changed, 80 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index e3d55f6f..7b46e5f 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
}
#endif /* !CONFIG_HUGETLB_PAGE */
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
#endif /* __ASSEMBLY__ */
#endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index c2787bf..ba6cf9b 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -296,5 +296,80 @@ void vmemmap_free(unsigned long start, unsigned long end)
{
}
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_XXXX functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ * When 1) or 2) takes place, the API returns an error code to cause
+ * an exit to kernel virtual mode where the operation will be completed.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct vmemmap_backing *vmem_back;
+ struct page *page;
+ unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+ unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+ for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+ if (pg_va < vmem_back->virt_addr)
+ continue;
+ /* Check that page struct is not split between real pages */
+ if ((pg_va + sizeof(struct page)) >
+ (vmem_back->virt_addr + page_size))
+ return NULL;
+
+ page = (struct page *) (vmem_back->phys + pg_va -
+ vmem_back->virt_addr);
+ return page;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+ return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+ if (PageCompound(page))
+ return -EAGAIN;
+
+ get_page(page);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+ if (PageCompound(page))
+ return -EAGAIN;
+
+ if (!atomic_add_unless(&page->_count, -1, 1))
+ return -EAGAIN;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_put_page);
+#endif
--
1.7.10.4
On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 8465c2a..da6bf61 100644
> --- a/arch/powerpc/kvm/powerpc.c
> @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> +++ b/arch/powerpc/kvm/powerpc.c
> break;
> #endif
> case KVM_CAP_SPAPR_MULTITCE:
> + case KVM_CAP_SPAPR_TCE_IOMMU:
> r = 1;
> break;
> default:
Don't advertise SPAPR capabilities if it's not book3s -- and probably
there's some additional limitation that would be appropriate.
> @@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
> r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
> goto out;
> }
> + case KVM_CREATE_SPAPR_TCE_IOMMU: {
> + struct kvm_create_spapr_tce_iommu create_tce_iommu;
> + struct kvm *kvm = filp->private_data;
> +
> + r = -EFAULT;
> + if (copy_from_user(&create_tce_iommu, argp,
> + sizeof(create_tce_iommu)))
> + goto out;
> + r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
> &create_tce_iommu);
> + goto out;
> + }
> #endif /* CONFIG_PPC_BOOK3S_64 */
>
> #ifdef CONFIG_KVM_BOOK3S_64_HV
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 5a2afda..450c82a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
> #define KVM_CAP_PPC_RTAS 91
> #define KVM_CAP_IRQ_XICS 92
> #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
> +#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
Hmm...
> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct
> kvm_device_attr)
> #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct
> kvm_device_attr)
>
> +/* ioctl for SPAPR TCE IOMMU */
> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct
> kvm_create_spapr_tce_iommu)
Shouldn't this go under the vm ioctl section?
-Scott
On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
> On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
> >diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >index 8465c2a..da6bf61 100644
> >--- a/arch/powerpc/kvm/powerpc.c
> >@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> >+++ b/arch/powerpc/kvm/powerpc.c
> > break;
> > #endif
> > case KVM_CAP_SPAPR_MULTITCE:
> >+ case KVM_CAP_SPAPR_TCE_IOMMU:
> > r = 1;
> > break;
> > default:
>
> Don't advertise SPAPR capabilities if it's not book3s -- and
> probably there's some additional limitation that would be
> appropriate.
So, in the case of MULTITCE, that's not quite right. PR KVM can
emulate a PAPR system on a BookE machine, and there's no reason not to
allow TCE acceleration as well. We can't make it dependent on PAPR
mode being selected, because that's enabled per-vcpu, whereas these
capabilities are queried on the VM before the vcpus are created.
CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
host side hardware (i.e. a PAPR style IOMMU), though.
>
> >@@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
> > r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
> > goto out;
> > }
> >+ case KVM_CREATE_SPAPR_TCE_IOMMU: {
> >+ struct kvm_create_spapr_tce_iommu create_tce_iommu;
> >+ struct kvm *kvm = filp->private_data;
> >+
> >+ r = -EFAULT;
> >+ if (copy_from_user(&create_tce_iommu, argp,
> >+ sizeof(create_tce_iommu)))
> >+ goto out;
> >+ r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
> >&create_tce_iommu);
> >+ goto out;
> >+ }
> > #endif /* CONFIG_PPC_BOOK3S_64 */
> >
> > #ifdef CONFIG_KVM_BOOK3S_64_HV
> >diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >index 5a2afda..450c82a 100644
> >--- a/include/uapi/linux/kvm.h
> >+++ b/include/uapi/linux/kvm.h
> >@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
> > #define KVM_CAP_PPC_RTAS 91
> > #define KVM_CAP_IRQ_XICS 92
> > #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
> >+#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
>
> Hmm...
Ah, yeah, that needs to be fixed. Those were interim numbers so that
we didn't have to keep changing our internal trees as new upstream
ioctls got added to the list. We need to get a proper number for the
merge, though.
> >@@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> > #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct
> >kvm_device_attr)
> > #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct
> >kvm_device_attr)
> >
> >+/* ioctl for SPAPR TCE IOMMU */
> >+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct
> >kvm_create_spapr_tce_iommu)
>
> Shouldn't this go under the vm ioctl section?
>
> -Scott
> _______________________________________________
> Linuxppc-dev mailing list
> [email protected]
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On 05/25/2013 12:45 PM, David Gibson wrote:
> On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
>> On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>> index 8465c2a..da6bf61 100644
>>> --- a/arch/powerpc/kvm/powerpc.c
>>> @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>>> +++ b/arch/powerpc/kvm/powerpc.c
>>> break;
>>> #endif
>>> case KVM_CAP_SPAPR_MULTITCE:
>>> + case KVM_CAP_SPAPR_TCE_IOMMU:
>>> r = 1;
>>> break;
>>> default:
>>
>> Don't advertise SPAPR capabilities if it's not book3s -- and
>> probably there's some additional limitation that would be
>> appropriate.
>
> So, in the case of MULTITCE, that's not quite right. PR KVM can
> emulate a PAPR system on a BookE machine, and there's no reason not to
> allow TCE acceleration as well. We can't make it dependent on PAPR
> mode being selected, because that's enabled per-vcpu, whereas these
> capabilities are queried on the VM before the vcpus are created.
>
> CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
> host side hardware (i.e. a PAPR style IOMMU), though.
The capability says that the ioctl is supported. If there is no IOMMU group
registered, than it will fail with a reasonable error and nobody gets hurt.
What is the problem?
>>
>>> @@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>> r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
>>> goto out;
>>> }
>>> + case KVM_CREATE_SPAPR_TCE_IOMMU: {
>>> + struct kvm_create_spapr_tce_iommu create_tce_iommu;
>>> + struct kvm *kvm = filp->private_data;
>>> +
>>> + r = -EFAULT;
>>> + if (copy_from_user(&create_tce_iommu, argp,
>>> + sizeof(create_tce_iommu)))
>>> + goto out;
>>> + r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
>>> &create_tce_iommu);
>>> + goto out;
>>> + }
>>> #endif /* CONFIG_PPC_BOOK3S_64 */
>>>
>>> #ifdef CONFIG_KVM_BOOK3S_64_HV
>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>> index 5a2afda..450c82a 100644
>>> --- a/include/uapi/linux/kvm.h
>>> +++ b/include/uapi/linux/kvm.h
>>> @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
>>> #define KVM_CAP_PPC_RTAS 91
>>> #define KVM_CAP_IRQ_XICS 92
>>> #define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
>>> +#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
>>
>> Hmm...
>
> Ah, yeah, that needs to be fixed. Those were interim numbers so that
> we didn't have to keep changing our internal trees as new upstream
> ioctls got added to the list. We need to get a proper number for the
> merge, though.
>
>>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>>> #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct
>>> kvm_device_attr)
>>> #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct
>>> kvm_device_attr)
>>>
>>> +/* ioctl for SPAPR TCE IOMMU */
>>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct
>>> kvm_create_spapr_tce_iommu)
>>
>> Shouldn't this go under the vm ioctl section?
The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated devices) is
in this section so I decided to keep them together. Wrong?
--
Alexey
Il 21/05/2013 05:06, Alexey Kardashevskiy ha scritto:
> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
> devices or emulated PCI.
Do you mean vio? virtio (without getting into whether that's good or
bad) will bypass the iommu.
Paolo
> These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.
>
> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
> (copied from user and verified) before writing the whole list into
> the TCE table. This cache will be utilized more in the upcoming
> VFIO/IOMMU support to continue TCE list processing in the virtual
> mode in the case if the real mode handler failed for some reason.
>
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
>
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.
>
> Cc: David Gibson <[email protected]>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> Signed-off-by: Paul Mackerras <[email protected]>
>
> ---
> Changelog:
> * added kvm_vcpu_arch::tce_tmp
> * removed cleanup if put_indirect failed, instead we do not even start
> writing to TCE table if we cannot get TCEs from the user and they are
> invalid
> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
> and kvmppc_emulated_validate_tce (for the previous item)
> * fixed bug with failthrough for H_IPI
> * removed all get_user() from real mode handlers
> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
> ---
> Documentation/virtual/kvm/api.txt | 14 ++
> arch/powerpc/include/asm/kvm_host.h | 2 +
> arch/powerpc/include/asm/kvm_ppc.h | 16 +-
> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++
> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++----
> arch/powerpc/kvm/book3s_hv.c | 39 +++++
> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++-
> arch/powerpc/kvm/powerpc.c | 3 +
> include/uapi/linux/kvm.h | 1 +
> 10 files changed, 470 insertions(+), 32 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 5f91eda..3c7c7ea 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2780,3 +2780,17 @@ Parameters: args[0] is the XICS device fd
> args[1] is the XICS CPU number (server ID) for this vcpu
>
> This capability connects the vcpu to an in-kernel XICS device.
> +
> +6.8 KVM_CAP_PPC_MULTITCE
> +
> +Architectures: ppc
> +Parameters: none
> +Returns: 0 on success; -1 on error
> +
> +This capability enables the guest to put/remove multiple TCE entries
> +per hypercall which significanly accelerates DMA operations for PPC KVM
> +guests.
> +
> +When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are
> +expected to occur rather than H_PUT_TCE which supports only one TCE entry
> +per call.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index af326cd..85d8f26 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
> spinlock_t tbacct_lock;
> u64 busy_stolen;
> u64 busy_preempt;
> +
> + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */
> #endif
> };
>
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index a5287fe..e852921b 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>
> extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> struct kvm_create_spapr_tce *args);
> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> - unsigned long ioba, unsigned long tce);
> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
> + struct kvm_vcpu *vcpu, unsigned long liobn);
> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
> + unsigned long ioba, unsigned long tce);
> +extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce);
> +extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_list, unsigned long npages);
> +extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_value, unsigned long npages);
> extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
> struct kvm_allocate_rma *rma);
> extern struct kvmppc_linear_info *kvm_alloc_rma(void);
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index b2d3f3b..06b7b20 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -14,6 +14,7 @@
> *
> * Copyright 2010 Paul Mackerras, IBM Corp. <[email protected]>
> * Copyright 2011 David Gibson, IBM Corporation <[email protected]>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <[email protected]>
> */
>
> #include <linux/types.h>
> @@ -36,8 +37,11 @@
> #include <asm/ppc-opcode.h>
> #include <asm/kvm_host.h>
> #include <asm/udbg.h>
> +#include <asm/iommu.h>
> +#include <asm/tce.h>
>
> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR ((void *)~(unsigned long)0x0)
>
> static long kvmppc_stt_npages(unsigned long window_size)
> {
> @@ -148,3 +152,117 @@ fail:
> }
> return ret;
> }
> +
> +/* Converts guest physical address into host virtual */
> +static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
> + unsigned long gpa)
> +{
> + unsigned long hva, gfn = gpa >> PAGE_SHIFT;
> + struct kvm_memory_slot *memslot;
> +
> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> + if (!memslot)
> + return ERROR_ADDR;
> +
> + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
> + return (void *) hva;
> +}
> +
> +long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce)
> +{
> + long ret;
> + struct kvmppc_spapr_tce_table *tt;
> +
> + tt = kvmppc_find_tce_table(vcpu, liobn);
> + /* Didn't find the liobn, put it to userspace */
> + if (!tt)
> + return H_TOO_HARD;
> +
> + /* Emulated IO */
> + if (ioba >= tt->window_size)
> + return H_PARAMETER;
> +
> + ret = kvmppc_emulated_validate_tce(tce);
> + if (ret)
> + return ret;
> +
> + kvmppc_emulated_put_tce(tt, ioba, tce);
> +
> + return H_SUCCESS;
> +}
> +
> +long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_list, unsigned long npages)
> +{
> + struct kvmppc_spapr_tce_table *tt;
> + long i, ret;
> + unsigned long __user *tces;
> +
> + tt = kvmppc_find_tce_table(vcpu, liobn);
> + /* Didn't find the liobn, put it to userspace */
> + if (!tt)
> + return H_TOO_HARD;
> +
> + /*
> + * The spec says that the maximum size of the list is 512 TCEs so
> + * so the whole table addressed resides in 4K page
> + */
> + if (npages > 512)
> + return H_PARAMETER;
> +
> + if (tce_list & ~IOMMU_PAGE_MASK)
> + return H_PARAMETER;
> +
> + tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list);
> + if (tces == ERROR_ADDR)
> + return H_TOO_HARD;
> +
> + /* Emulated IO */
> + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> + return H_PARAMETER;
> +
> + for (i = 0; i < npages; ++i) {
> + if (get_user(vcpu->arch.tce_tmp[i], tces + i))
> + return H_PARAMETER;
> +
> + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]);
> + if (ret)
> + return ret;
> + }
> +
> + for (i = 0; i < npages; ++i)
> + kvmppc_emulated_put_tce(tt,
> + ioba + (i << IOMMU_PAGE_SHIFT),
> + vcpu->arch.tce_tmp[i]);
> +
> + return H_SUCCESS;
> +}
> +
> +long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_value, unsigned long npages)
> +{
> + struct kvmppc_spapr_tce_table *tt;
> + long i, ret;
> +
> + tt = kvmppc_find_tce_table(vcpu, liobn);
> + /* Didn't find the liobn, put it to userspace */
> + if (!tt)
> + return H_TOO_HARD;
> +
> + /* Emulated IO */
> + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> + return H_PARAMETER;
> +
> + ret = kvmppc_emulated_validate_tce(tce_value);
> + if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
> + return H_PARAMETER;
> +
> + for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
> + kvmppc_emulated_put_tce(tt, ioba, tce_value);
> +
> + return H_SUCCESS;
> +}
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 30c2f3b..c68d538 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -14,6 +14,7 @@
> *
> * Copyright 2010 Paul Mackerras, IBM Corp. <[email protected]>
> * Copyright 2011 David Gibson, IBM Corporation <[email protected]>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <[email protected]>
> */
>
> #include <linux/types.h>
> @@ -35,42 +36,249 @@
> #include <asm/ppc-opcode.h>
> #include <asm/kvm_host.h>
> #include <asm/udbg.h>
> +#include <asm/iommu.h>
> +#include <asm/tce.h>
>
> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR (~(unsigned long)0x0)
>
> -/* WARNING: This will be called in real-mode on HV KVM and virtual
> - * mode on PR KVM
> +/* Finds a TCE table descriptor by LIOBN */
> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
> + unsigned long liobn)
> +{
> + struct kvmppc_spapr_tce_table *tt;
> +
> + list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
> + if (tt->liobn == liobn)
> + return tt;
> + }
> +
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
> +
> +/*
> + * Validate TCE address.
> + * At the moment only flags are validated
> + * as other check will significantly slow down
> + * or can make it even impossible to handle TCE requests
> + * in real mode.
> + */
> +long kvmppc_emulated_validate_tce(unsigned long tce)
> +{
> + if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
> + return H_PARAMETER;
> +
> + return H_SUCCESS;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
> +
> +/*
> + * kvmppc_emulated_put_tce() handles TCE requests for devices emulated
> + * by QEMU. It puts guest TCE values into the table and expects
> + * the QEMU to convert them later in the QEMU device implementation.
> + * Wiorks in both real and virtual modes.
> + * It cannot fail so kvmppc_emulated_validate_tce must be called before it.
> */
> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
> + unsigned long ioba, unsigned long tce)
> +{
> + unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> + struct page *page;
> + u64 *tbl;
> +
> + /*
> + * Note on the use of page_address() in real mode,
> + *
> + * It is safe to use page_address() in real mode on ppc64 because
> + * page_address() is always defined as lowmem_page_address()
> + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
> + * operation and does not access page struct.
> + *
> + * Theoretically page_address() could be defined different
> + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
> + * should be enabled.
> + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
> + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
> + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
> + * is not expected to be enabled on ppc32, page_address()
> + * is safe for ppc32 as well.
> + */
> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
> +#error TODO: fix to avoid page_address() here
> +#endif
> + page = tt->pages[idx / TCES_PER_PAGE];
> + tbl = (u64 *)page_address(page);
> +
> + /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> + tbl[idx % TCES_PER_PAGE] = tce;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
> +
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +
> +static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
> + unsigned long *pte_sizep)
> +{
> + pte_t *ptep;
> + unsigned int shift = 0;
> + pte_t pte, tmp, ret;
> +
> + ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift);
> + if (!ptep)
> + return __pte(0);
> + if (shift)
> + *pte_sizep = 1ul << shift;
> + else
> + *pte_sizep = PAGE_SIZE;
> +
> + if (!pte_present(*ptep))
> + return __pte(0);
> +
> + /* wait until _PAGE_BUSY is clear then set it atomically */
> + __asm__ __volatile__ (
> + "1: ldarx %0,0,%3\n"
> + " andi. %1,%0,%4\n"
> + " bne- 1b\n"
> + " ori %1,%0,%4\n"
> + " stdcx. %1,0,%3\n"
> + " bne- 1b"
> + : "=&r" (pte), "=&r" (tmp), "=m" (*ptep)
> + : "r" (ptep), "i" (_PAGE_BUSY)
> + : "cc");
> +
> + ret = pte;
> +
> + return ret;
> +}
> +
> +/*
> + * Converts guest physical address into host physical address.
> + * Also returns pte and page size if the page is present in page table.
> + */
> +static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
> + unsigned long gpa)
> +{
> + struct kvm_memory_slot *memslot;
> + pte_t pte;
> + unsigned long hva, hpa, pg_size = 0, offset;
> + unsigned long gfn = gpa >> PAGE_SHIFT;
> + bool writing = gpa & TCE_PCI_WRITE;
> +
> + /* Find a KVM memslot */
> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> + if (!memslot)
> + return ERROR_ADDR;
> +
> + /* Convert guest physical address to host virtual */
> + hva = __gfn_to_hva_memslot(memslot, gfn);
> +
> + /* Find a PTE and determine the size */
> + pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva,
> + writing, &pg_size);
> + if (!pte)
> + return ERROR_ADDR;
> +
> + /* Calculate host phys address keeping flags and offset in the page */
> + offset = gpa & (pg_size - 1);
> +
> + /* pte_pfn(pte) should return an address aligned to pg_size */
> + hpa = (pte_pfn(pte) << PAGE_SHIFT) + offset;
> +
> + return hpa;
> +}
> +
> long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> unsigned long ioba, unsigned long tce)
> {
> - struct kvm *kvm = vcpu->kvm;
> - struct kvmppc_spapr_tce_table *stt;
> -
> - /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> - /* liobn, ioba, tce); */
> -
> - list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> - if (stt->liobn == liobn) {
> - unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> - struct page *page;
> - u64 *tbl;
> -
> - /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p window_size=0x%x\n", */
> - /* liobn, stt, stt->window_size); */
> - if (ioba >= stt->window_size)
> - return H_PARAMETER;
> -
> - page = stt->pages[idx / TCES_PER_PAGE];
> - tbl = (u64 *)page_address(page);
> -
> - /* FIXME: Need to validate the TCE itself */
> - /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
> - tbl[idx % TCES_PER_PAGE] = tce;
> - return H_SUCCESS;
> - }
> + long ret;
> + struct kvmppc_spapr_tce_table *tt;
> +
> + tt = kvmppc_find_tce_table(vcpu, liobn);
> + /* Didn't find the liobn, put it to virtual space */
> + if (!tt)
> + return H_TOO_HARD;
> +
> + /* Emulated IO */
> + if (ioba >= tt->window_size)
> + return H_PARAMETER;
> +
> + ret = kvmppc_emulated_validate_tce(tce);
> + if (ret)
> + return ret;
> +
> + kvmppc_emulated_put_tce(tt, ioba, tce);
> +
> + return H_SUCCESS;
> +}
> +
> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_list, unsigned long npages)
> +{
> + struct kvmppc_spapr_tce_table *tt;
> + long i, ret;
> + unsigned long *tces;
> +
> + tt = kvmppc_find_tce_table(vcpu, liobn);
> + /* Didn't find the liobn, put it to virtual space */
> + if (!tt)
> + return H_TOO_HARD;
> +
> + /*
> + * The spec says that the maximum size of the list is 512 TCEs so
> + * so the whole table addressed resides in 4K page
> + */
> + if (npages > 512)
> + return H_PARAMETER;
> +
> + if (tce_list & ~IOMMU_PAGE_MASK)
> + return H_PARAMETER;
> +
> + tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
> + if ((unsigned long)tces == ERROR_ADDR)
> + return H_TOO_HARD;
> +
> + /* Emulated IO */
> + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> + return H_PARAMETER;
> +
> + for (i = 0; i < npages; ++i) {
> + ret = kvmppc_emulated_validate_tce(tces[i]);
> + if (ret)
> + return ret;
> }
>
> - /* Didn't find the liobn, punt it to userspace */
> - return H_TOO_HARD;
> + for (i = 0; i < npages; ++i)
> + kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
> + tces[i]);
> +
> + return H_SUCCESS;
> +}
> +
> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_value, unsigned long npages)
> +{
> + struct kvmppc_spapr_tce_table *tt;
> + long i, ret;
> +
> + tt = kvmppc_find_tce_table(vcpu, liobn);
> + /* Didn't find the liobn, put it to virtual space */
> + if (!tt)
> + return H_TOO_HARD;
> +
> + /* Emulated IO */
> + if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
> + return H_PARAMETER;
> +
> + ret = kvmppc_emulated_validate_tce(tce_value);
> + if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
> + return H_PARAMETER;
> +
> + for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
> + kvmppc_emulated_put_tce(tt, ioba, tce_value);
> +
> + return H_SUCCESS;
> }
> +#endif /* CONFIG_KVM_BOOK3S_64_HV */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 9de24f8..9d70fd8 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -566,6 +566,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
> ret = kvmppc_xics_hcall(vcpu, req);
> break;
> } /* fallthrough */
> + return RESUME_HOST;
> + case H_PUT_TCE:
> + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> + kvmppc_get_gpr(vcpu, 5),
> + kvmppc_get_gpr(vcpu, 6));
> + if (ret == H_TOO_HARD)
> + return RESUME_HOST;
> + break;
> + case H_PUT_TCE_INDIRECT:
> + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
> + kvmppc_get_gpr(vcpu, 5),
> + kvmppc_get_gpr(vcpu, 6),
> + kvmppc_get_gpr(vcpu, 7));
> + if (ret == H_TOO_HARD)
> + return RESUME_HOST;
> + break;
> + case H_STUFF_TCE:
> + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> + kvmppc_get_gpr(vcpu, 5),
> + kvmppc_get_gpr(vcpu, 6),
> + kvmppc_get_gpr(vcpu, 7));
> + if (ret == H_TOO_HARD)
> + return RESUME_HOST;
> + break;
> default:
> return RESUME_HOST;
> }
> @@ -956,6 +980,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
> vcpu->arch.cpu_type = KVM_CPU_3S_64;
> kvmppc_sanity_check(vcpu);
>
> + /*
> + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
> + * half executed, we first read TCEs from the user, check them and
> + * return error if something went wrong and only then put TCEs into
> + * the TCE table.
> + *
> + * tce_tmp is a cache for TCEs to avoid stack allocation or
> + * kmalloc as the whole TCE list can take up to 512 items 8 bytes
> + * each (4096 bytes).
> + */
> + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
> + if (!vcpu->arch.tce_tmp)
> + goto free_vcpu;
> +
> return vcpu;
>
> free_vcpu:
> @@ -978,6 +1016,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
> unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
> unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
> spin_unlock(&vcpu->arch.vpa_update_lock);
> + kfree(vcpu->arch.tce_tmp);
> kvm_vcpu_uninit(vcpu);
> kmem_cache_free(kvm_vcpu_cache, vcpu);
> }
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b02f91e..d35554e 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1490,6 +1490,12 @@ hcall_real_table:
> .long 0 /* 0x11c */
> .long 0 /* 0x120 */
> .long .kvmppc_h_bulk_remove - hcall_real_table
> + .long 0 /* 0x128 */
> + .long 0 /* 0x12c */
> + .long 0 /* 0x130 */
> + .long 0 /* 0x134 */
> + .long .kvmppc_h_stuff_tce - hcall_real_table
> + .long .kvmppc_h_put_tce_indirect - hcall_real_table
> hcall_real_table_end:
>
> ignore_hdec:
> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
> index b24309c..3750e3c 100644
> --- a/arch/powerpc/kvm/book3s_pr_papr.c
> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
> unsigned long tce = kvmppc_get_gpr(vcpu, 6);
> long rc;
>
> - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
> + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
> + if (rc == H_TOO_HARD)
> + return EMULATE_FAIL;
> + kvmppc_set_gpr(vcpu, 3, rc);
> + return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
> +{
> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> + unsigned long tce = kvmppc_get_gpr(vcpu, 6);
> + unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> + long rc;
> +
> + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
> + tce, npages);
> + if (rc == H_TOO_HARD)
> + return EMULATE_FAIL;
> + kvmppc_set_gpr(vcpu, 3, rc);
> + return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
> +{
> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
> + unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> + long rc;
> +
> + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
> if (rc == H_TOO_HARD)
> return EMULATE_FAIL;
> kvmppc_set_gpr(vcpu, 3, rc);
> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
> return kvmppc_h_pr_bulk_remove(vcpu);
> case H_PUT_TCE:
> return kvmppc_h_pr_put_tce(vcpu);
> + case H_PUT_TCE_INDIRECT:
> + return kvmppc_h_pr_put_tce_indirect(vcpu);
> + case H_STUFF_TCE:
> + return kvmppc_h_pr_stuff_tce(vcpu);
> case H_CEDE:
> vcpu->arch.shared->msr |= MSR_EE;
> kvm_vcpu_block(vcpu);
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 6316ee3..8465c2a 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
> r = 1;
> break;
> #endif
> + case KVM_CAP_SPAPR_MULTITCE:
> + r = 1;
> + break;
> default:
> r = 0;
> break;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a5c86fc..5a2afda 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info {
> #define KVM_CAP_IRQ_MPIC 90
> #define KVM_CAP_PPC_RTAS 91
> #define KVM_CAP_IRQ_XICS 92
> +#define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
>
Il 25/05/2013 04:45, David Gibson ha scritto:
>> >+ case KVM_CREATE_SPAPR_TCE_IOMMU: {
>> >+ struct kvm_create_spapr_tce_iommu create_tce_iommu;
>> >+ struct kvm *kvm = filp->private_data;
>> >+
>> >+ r = -EFAULT;
>> >+ if (copy_from_user(&create_tce_iommu, argp,
>> >+ sizeof(create_tce_iommu)))
>> >+ goto out;
>> >+ r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
>> >&create_tce_iommu);
>> >+ goto out;
>> >+ }
Would it make sense to make this the only interface for creating TCEs?
That is, pass both a window_size and an IOMMU group id (or e.g. -1 for
no hardware IOMMU usage), and have a single ioctl for both cases?
There's some duplicated code between kvm_vm_ioctl_create_spapr_tce and
kvm_vm_ioctl_create_spapr_tce_iommu.
KVM_CREATE_SPAPR_TCE could stay for backwards-compatibility, or you
could just use a new capability and drop the old ioctl. I'm not sure
whether you're already considering the ABI to be stable for kvmppc.
Paolo
On 05/27/2013 08:23 PM, Paolo Bonzini wrote:
> Il 25/05/2013 04:45, David Gibson ha scritto:
>>>> + case KVM_CREATE_SPAPR_TCE_IOMMU: {
>>>> + struct kvm_create_spapr_tce_iommu create_tce_iommu;
>>>> + struct kvm *kvm = filp->private_data;
>>>> +
>>>> + r = -EFAULT;
>>>> + if (copy_from_user(&create_tce_iommu, argp,
>>>> + sizeof(create_tce_iommu)))
>>>> + goto out;
>>>> + r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
>>>> &create_tce_iommu);
>>>> + goto out;
>>>> + }
>
> Would it make sense to make this the only interface for creating TCEs?
> That is, pass both a window_size and an IOMMU group id (or e.g. -1 for
> no hardware IOMMU usage), and have a single ioctl for both cases?
> There's some duplicated code between kvm_vm_ioctl_create_spapr_tce and
> kvm_vm_ioctl_create_spapr_tce_iommu.
Just few bits. Is there really much sense in making one function from those
two? I tried, looked a bit messy.
> KVM_CREATE_SPAPR_TCE could stay for backwards-compatibility, or you
> could just use a new capability and drop the old ioctl.
The old capability+ioctl already exist for quite a while and few QEMU
versions supporting it were released so we do not want just drop it. So
then what is the benefit of having a new interface with support of both types?
> I'm not sure
> whether you're already considering the ABI to be stable for kvmppc.
Is any bit of KVM using it? Cannot see from Documentation/ABI.
--
Alexey
On 05/27/2013 08:08 PM, Paolo Bonzini wrote:
> Il 21/05/2013 05:06, Alexey Kardashevskiy ha scritto:
>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>> H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
>> devices or emulated PCI.
>
> Do you mean vio? virtio (without getting into whether that's good or
> bad) will bypass the iommu.
yes, mistake, should be VIO. thanks!
--
Alexey
Il 27/05/2013 16:26, Alexey Kardashevskiy ha scritto:
> On 05/27/2013 08:23 PM, Paolo Bonzini wrote:
>> Il 25/05/2013 04:45, David Gibson ha scritto:
>>>>> + case KVM_CREATE_SPAPR_TCE_IOMMU: {
>>>>> + struct kvm_create_spapr_tce_iommu create_tce_iommu;
>>>>> + struct kvm *kvm = filp->private_data;
>>>>> +
>>>>> + r = -EFAULT;
>>>>> + if (copy_from_user(&create_tce_iommu, argp,
>>>>> + sizeof(create_tce_iommu)))
>>>>> + goto out;
>>>>> + r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm,
>>>>> &create_tce_iommu);
>>>>> + goto out;
>>>>> + }
>>
>> Would it make sense to make this the only interface for creating TCEs?
>> That is, pass both a window_size and an IOMMU group id (or e.g. -1 for
>> no hardware IOMMU usage), and have a single ioctl for both cases?
>> There's some duplicated code between kvm_vm_ioctl_create_spapr_tce and
>> kvm_vm_ioctl_create_spapr_tce_iommu.
>
> Just few bits. Is there really much sense in making one function from those
> two? I tried, looked a bit messy.
Cannot really tell without the userspace bits. But ioctl proliferation
is what the device and one_reg APIs were supposed to avoid...
>> KVM_CREATE_SPAPR_TCE could stay for backwards-compatibility, or you
>> could just use a new capability and drop the old ioctl.
>
> The old capability+ioctl already exist for quite a while and few QEMU
> versions supporting it were released so we do not want just drop it. So
> then what is the benefit of having a new interface with support of both types?
>
>> I'm not sure
>> whether you're already considering the ABI to be stable for kvmppc.
>
> Is any bit of KVM using it? Cannot see from Documentation/ABI.
I mean the userspace ABI (ioctls).
Paolo
On 05/24/2013 09:45:24 PM, David Gibson wrote:
> On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
> > On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
> > >diff --git a/arch/powerpc/kvm/powerpc.c
> b/arch/powerpc/kvm/powerpc.c
> > >index 8465c2a..da6bf61 100644
> > >--- a/arch/powerpc/kvm/powerpc.c
> > >@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > >+++ b/arch/powerpc/kvm/powerpc.c
> > > break;
> > > #endif
> > > case KVM_CAP_SPAPR_MULTITCE:
> > >+ case KVM_CAP_SPAPR_TCE_IOMMU:
> > > r = 1;
> > > break;
> > > default:
> >
> > Don't advertise SPAPR capabilities if it's not book3s -- and
> > probably there's some additional limitation that would be
> > appropriate.
>
> So, in the case of MULTITCE, that's not quite right. PR KVM can
> emulate a PAPR system on a BookE machine, and there's no reason not to
> allow TCE acceleration as well.
That might (or might not; consider things like Altivec versus SPE
opcode conflict, whether unimplemented SPRs trap, behavior of
unprivileged SPRs/instructions, etc) be true in theory, but it's not
currently a supported configuration. BookE KVM does not support
emulating a different CPU than the host. In the unlikely case that
ever changes to the point of allowing PAPR guests on a BookE host, then
we can revisit how to properly determine whether the capability is
supported, but for now the capability will never be valid in the
CONFIG_BOOKE case (though I'd rather see it depend on an appropriate
book3s symbol than depend on !BOOKE).
Or we could just leave it as is, and let it indicate whether the host
kernel supports the feature in general, with the user needing to
understand when it's applicable... I'm a bit confused by the
documentation, however -- the MULTITCE capability was documented in the
"capabilities that can be enabled" section, but I don't see where it
can be enabled.
-Scott
On 05/26/2013 09:44:24 PM, Alexey Kardashevskiy wrote:
> On 05/25/2013 12:45 PM, David Gibson wrote:
> > On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
> >> On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
> >>> diff --git a/arch/powerpc/kvm/powerpc.c
> b/arch/powerpc/kvm/powerpc.c
> >>> index 8465c2a..da6bf61 100644
> >>> --- a/arch/powerpc/kvm/powerpc.c
> >>> @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> >>> +++ b/arch/powerpc/kvm/powerpc.c
> >>> break;
> >>> #endif
> >>> case KVM_CAP_SPAPR_MULTITCE:
> >>> + case KVM_CAP_SPAPR_TCE_IOMMU:
> >>> r = 1;
> >>> break;
> >>> default:
> >>
> >> Don't advertise SPAPR capabilities if it's not book3s -- and
> >> probably there's some additional limitation that would be
> >> appropriate.
> >
> > So, in the case of MULTITCE, that's not quite right. PR KVM can
> > emulate a PAPR system on a BookE machine, and there's no reason not
> to
> > allow TCE acceleration as well. We can't make it dependent on PAPR
> > mode being selected, because that's enabled per-vcpu, whereas these
> > capabilities are queried on the VM before the vcpus are created.
> >
> > CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
> > host side hardware (i.e. a PAPR style IOMMU), though.
>
>
> The capability says that the ioctl is supported. If there is no IOMMU
> group
> registered, than it will fail with a reasonable error and nobody gets
> hurt.
> What is the problem?
You could say that about a lot of the capabilities that just advertise
the existence of new ioctls. :-)
Sometimes it's nice to know in advance whether it's supported, before
actually requesting that something happen.
> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> >>> #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct
> >>> kvm_device_attr)
> >>> #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct
> >>> kvm_device_attr)
> >>>
> >>> +/* ioctl for SPAPR TCE IOMMU */
> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct
> >>> kvm_create_spapr_tce_iommu)
> >>
> >> Shouldn't this go under the vm ioctl section?
>
>
> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
> devices) is
> in this section so I decided to keep them together. Wrong?
You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
KVM_CREATE_SPAPR_TCE_IOMMU?
-Scott
On 05/29/2013 03:45 AM, Scott Wood wrote:
> On 05/26/2013 09:44:24 PM, Alexey Kardashevskiy wrote:
>> On 05/25/2013 12:45 PM, David Gibson wrote:
>> > On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
>> >> On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
>> >>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> >>> index 8465c2a..da6bf61 100644
>> >>> --- a/arch/powerpc/kvm/powerpc.c
>> >>> @@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>> >>> +++ b/arch/powerpc/kvm/powerpc.c
>> >>> break;
>> >>> #endif
>> >>> case KVM_CAP_SPAPR_MULTITCE:
>> >>> + case KVM_CAP_SPAPR_TCE_IOMMU:
>> >>> r = 1;
>> >>> break;
>> >>> default:
>> >>
>> >> Don't advertise SPAPR capabilities if it's not book3s -- and
>> >> probably there's some additional limitation that would be
>> >> appropriate.
>> >
>> > So, in the case of MULTITCE, that's not quite right. PR KVM can
>> > emulate a PAPR system on a BookE machine, and there's no reason not to
>> > allow TCE acceleration as well. We can't make it dependent on PAPR
>> > mode being selected, because that's enabled per-vcpu, whereas these
>> > capabilities are queried on the VM before the vcpus are created.
>> >
>> > CAP_SPAPR_TCE_IOMMU should be dependent on the presence of suitable
>> > host side hardware (i.e. a PAPR style IOMMU), though.
>>
>>
>> The capability says that the ioctl is supported. If there is no IOMMU group
>> registered, than it will fail with a reasonable error and nobody gets hurt.
>> What is the problem?
>
> You could say that about a lot of the capabilities that just advertise the
> existence of new ioctls. :-)
>
> Sometimes it's nice to know in advance whether it's supported, before
> actually requesting that something happen.
Yes, would be nice. There is just no quick way to know if this real system
supports IOMMU groups. I could add another helper to generic IOMMU code
which would return the number of registered IOMMU groups but it is a bit
too much :)
>> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>> >>> #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct
>> >>> kvm_device_attr)
>> >>> #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct
>> >>> kvm_device_attr)
>> >>>
>> >>> +/* ioctl for SPAPR TCE IOMMU */
>> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct
>> >>> kvm_create_spapr_tce_iommu)
>> >>
>> >> Shouldn't this go under the vm ioctl section?
>>
>>
>> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated devices) is
>> in this section so I decided to keep them together. Wrong?
>
> You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
> KVM_CREATE_SPAPR_TCE_IOMMU?
Yes.
--
Alexey
On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> >> >>> #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct
> >> >>> kvm_device_attr)
> >> >>> #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct
> >> >>> kvm_device_attr)
> >> >>>
> >> >>> +/* ioctl for SPAPR TCE IOMMU */
> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct
> >> >>> kvm_create_spapr_tce_iommu)
> >> >>
> >> >> Shouldn't this go under the vm ioctl section?
> >>
> >>
> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
> devices) is
> >> in this section so I decided to keep them together. Wrong?
> >
> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
> > KVM_CREATE_SPAPR_TCE_IOMMU?
>
> Yes.
Sigh. That's the same thing repeated. There's only one IOCTL.
Nothing is being "kept together".
-Scott
On 05/29/2013 09:35 AM, Scott Wood wrote:
> On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
>> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>> >> >>> #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct
>> >> >>> kvm_device_attr)
>> >> >>> #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct
>> >> >>> kvm_device_attr)
>> >> >>>
>> >> >>> +/* ioctl for SPAPR TCE IOMMU */
>> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct
>> >> >>> kvm_create_spapr_tce_iommu)
>> >> >>
>> >> >> Shouldn't this go under the vm ioctl section?
>> >>
>> >>
>> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
>> devices) is
>> >> in this section so I decided to keep them together. Wrong?
>> >
>> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
>> > KVM_CREATE_SPAPR_TCE_IOMMU?
>>
>> Yes.
>
> Sigh. That's the same thing repeated. There's only one IOCTL. Nothing is
> being "kept together".
Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.
--
Alexey
On 05/29/2013 02:32 AM, Scott Wood wrote:
> On 05/24/2013 09:45:24 PM, David Gibson wrote:
>> On Wed, May 22, 2013 at 04:06:57PM -0500, Scott Wood wrote:
>> > On 05/20/2013 10:06:46 PM, Alexey Kardashevskiy wrote:
>> > >diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> > >index 8465c2a..da6bf61 100644
>> > >--- a/arch/powerpc/kvm/powerpc.c
>> > >@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>> > >+++ b/arch/powerpc/kvm/powerpc.c
>> > > break;
>> > > #endif
>> > > case KVM_CAP_SPAPR_MULTITCE:
>> > >+ case KVM_CAP_SPAPR_TCE_IOMMU:
>> > > r = 1;
>> > > break;
>> > > default:
>> >
>> > Don't advertise SPAPR capabilities if it's not book3s -- and
>> > probably there's some additional limitation that would be
>> > appropriate.
>>
>> So, in the case of MULTITCE, that's not quite right. PR KVM can
>> emulate a PAPR system on a BookE machine, and there's no reason not to
>> allow TCE acceleration as well.
>
> That might (or might not; consider things like Altivec versus SPE opcode
> conflict, whether unimplemented SPRs trap, behavior of unprivileged
> SPRs/instructions, etc) be true in theory, but it's not currently a
> supported configuration. BookE KVM does not support emulating a different
> CPU than the host. In the unlikely case that ever changes to the point of
> allowing PAPR guests on a BookE host, then we can revisit how to properly
> determine whether the capability is supported, but for now the capability
> will never be valid in the CONFIG_BOOKE case (though I'd rather see it
> depend on an appropriate book3s symbol than depend on !BOOKE).
>
> Or we could just leave it as is, and let it indicate whether the host
> kernel supports the feature in general, with the user needing to understand
> when it's applicable... I'm a bit confused by the documentation, however
> -- the MULTITCE capability was documented in the "capabilities that can be
> enabled" section, but I don't see where it can be enabled.
True, it cannot be enabled (but it could be enabled a long time ago), it is
either supported or not, I'll fix the documentation. Thanks!
--
Alexey
On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote:
> On 05/29/2013 09:35 AM, Scott Wood wrote:
> > On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
> >> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> >> >> >>> #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct
> >> >> >>> kvm_device_attr)
> >> >> >>> #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct
> >> >> >>> kvm_device_attr)
> >> >> >>>
> >> >> >>> +/* ioctl for SPAPR TCE IOMMU */
> >> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4,
> struct
> >> >> >>> kvm_create_spapr_tce_iommu)
> >> >> >>
> >> >> >> Shouldn't this go under the vm ioctl section?
> >> >>
> >> >>
> >> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
> >> devices) is
> >> >> in this section so I decided to keep them together. Wrong?
> >> >
> >> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
> >> > KVM_CREATE_SPAPR_TCE_IOMMU?
> >>
> >> Yes.
> >
> > Sigh. That's the same thing repeated. There's only one IOCTL.
> Nothing is
> > being "kept together".
>
> Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.
But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE.
0xe0 begins a different section.
-Scott
On 05/30/2013 06:05 AM, Scott Wood wrote:
> On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote:
>> On 05/29/2013 09:35 AM, Scott Wood wrote:
>> > On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
>> >> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>> >> >> >>> #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct
>> >> >> >>> kvm_device_attr)
>> >> >> >>> #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct
>> >> >> >>> kvm_device_attr)
>> >> >> >>>
>> >> >> >>> +/* ioctl for SPAPR TCE IOMMU */
>> >> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct
>> >> >> >>> kvm_create_spapr_tce_iommu)
>> >> >> >>
>> >> >> >> Shouldn't this go under the vm ioctl section?
>> >> >>
>> >> >>
>> >> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
>> >> devices) is
>> >> >> in this section so I decided to keep them together. Wrong?
>> >> >
>> >> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
>> >> > KVM_CREATE_SPAPR_TCE_IOMMU?
>> >>
>> >> Yes.
>> >
>> > Sigh. That's the same thing repeated. There's only one IOCTL.
>> Nothing is
>> > being "kept together".
>>
>> Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.
>
> But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE. 0xe0
> begins a different section.
It is not really obvious that there are sections as no comment defines
those :) But yes, makes sense to move it up a bit and change the code to 0xad.
--
Alexey
On 05/29/2013 06:10:33 PM, Alexey Kardashevskiy wrote:
> On 05/30/2013 06:05 AM, Scott Wood wrote:
> > On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote:
> >> On 05/29/2013 09:35 AM, Scott Wood wrote:
> >> > On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
> >> >> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
> >> >> >> >>> #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2,
> struct
> >> >> >> >>> kvm_device_attr)
> >> >> >> >>> #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3,
> struct
> >> >> >> >>> kvm_device_attr)
> >> >> >> >>>
> >> >> >> >>> +/* ioctl for SPAPR TCE IOMMU */
> >> >> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4,
> struct
> >> >> >> >>> kvm_create_spapr_tce_iommu)
> >> >> >> >>
> >> >> >> >> Shouldn't this go under the vm ioctl section?
> >> >> >>
> >> >> >>
> >> >> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for
> emulated
> >> >> devices) is
> >> >> >> in this section so I decided to keep them together. Wrong?
> >> >> >
> >> >> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
> >> >> > KVM_CREATE_SPAPR_TCE_IOMMU?
> >> >>
> >> >> Yes.
> >> >
> >> > Sigh. That's the same thing repeated. There's only one IOCTL.
> >> Nothing is
> >> > being "kept together".
> >>
> >> Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.
> >
> > But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE.
> 0xe0
> > begins a different section.
>
> It is not really obvious that there are sections as no comment defines
> those :)
There is a comment /* ioctls for fds returned by KVM_CREATE_DEVICE */
Putting KVM_CREATE_DEVICE in there was mainly to avoid dealing with the
ioctl number conflict mess in the vm-ioctl section, but at least that
one is related to the device control API. :-)
> But yes, makes sense to move it up a bit and change the code to 0xad.
0xad is KVM_KVMCLOCK_CTRL
-Scott
On 05/30/2013 09:14 AM, Scott Wood wrote:
> On 05/29/2013 06:10:33 PM, Alexey Kardashevskiy wrote:
>> On 05/30/2013 06:05 AM, Scott Wood wrote:
>> > On 05/28/2013 07:12:32 PM, Alexey Kardashevskiy wrote:
>> >> On 05/29/2013 09:35 AM, Scott Wood wrote:
>> >> > On 05/28/2013 06:30:40 PM, Alexey Kardashevskiy wrote:
>> >> >> >> >>> @@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
>> >> >> >> >>> #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct
>> >> >> >> >>> kvm_device_attr)
>> >> >> >> >>> #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct
>> >> >> >> >>> kvm_device_attr)
>> >> >> >> >>>
>> >> >> >> >>> +/* ioctl for SPAPR TCE IOMMU */
>> >> >> >> >>> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct
>> >> >> >> >>> kvm_create_spapr_tce_iommu)
>> >> >> >> >>
>> >> >> >> >> Shouldn't this go under the vm ioctl section?
>> >> >> >>
>> >> >> >>
>> >> >> >> The KVM_CREATE_SPAPR_TCE_IOMMU ioctl (the version for emulated
>> >> >> devices) is
>> >> >> >> in this section so I decided to keep them together. Wrong?
>> >> >> >
>> >> >> > You decided to keep KVM_CREATE_SPAPR_TCE_IOMMU together with
>> >> >> > KVM_CREATE_SPAPR_TCE_IOMMU?
>> >> >>
>> >> >> Yes.
>> >> >
>> >> > Sigh. That's the same thing repeated. There's only one IOCTL.
>> >> Nothing is
>> >> > being "kept together".
>> >>
>> >> Sorry, I meant this ioctl - KVM_CREATE_SPAPR_TCE.
>> >
>> > But you didn't put it in the same section as KVM_CREATE_SPAPR_TCE. 0xe0
>> > begins a different section.
>>
>> It is not really obvious that there are sections as no comment defines
>> those :)
>
> There is a comment /* ioctls for fds returned by KVM_CREATE_DEVICE */
>
> Putting KVM_CREATE_DEVICE in there was mainly to avoid dealing with the
> ioctl number conflict mess in the vm-ioctl section, but at least that one
> is related to the device control API. :-)
>
>> But yes, makes sense to move it up a bit and change the code to 0xad.
>
> 0xad is KVM_KVMCLOCK_CTRL
That's it. I am _completely_ confused now. No system whatsoever :(
What rule should I use in order to choose the number for my new ioctl? :)
--
Alexey
On 05/29/2013 06:29:13 PM, Alexey Kardashevskiy wrote:
> On 05/30/2013 09:14 AM, Scott Wood wrote:
> > On 05/29/2013 06:10:33 PM, Alexey Kardashevskiy wrote:
> >> On 05/30/2013 06:05 AM, Scott Wood wrote:
> >> > But you didn't put it in the same section as
> KVM_CREATE_SPAPR_TCE. 0xe0
> >> > begins a different section.
> >>
> >> It is not really obvious that there are sections as no comment
> defines
> >> those :)
> >
> > There is a comment /* ioctls for fds returned by KVM_CREATE_DEVICE
> */
> >
> > Putting KVM_CREATE_DEVICE in there was mainly to avoid dealing with
> the
> > ioctl number conflict mess in the vm-ioctl section, but at least
> that one
> > is related to the device control API. :-)
> >
> >> But yes, makes sense to move it up a bit and change the code to
> 0xad.
> >
> > 0xad is KVM_KVMCLOCK_CTRL
>
> That's it. I am _completely_ confused now. No system whatsoever :(
> What rule should I use in order to choose the number for my new
> ioctl? :)
Yeah, it's a mess. 0xaf seems to be free. :-)
-Scott