2013-07-06 15:07:22

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH 0/8 v5] KVM: PPC: IOMMU in-kernel handling

The changes are:
1. rebased on v3.10
2. added arch_spin_locks to protect TCE table in real mode
3. reworked VFIO external API
4. added missing bits for real mode handling of TCE requests on p7ioc

MOre details in the individual patch comments.

Depends on "hashtable: add hash_for_each_possible_rcu_notrace()",
posted earlier today.

Alexey Kardashevskiy (8):
KVM: PPC: reserve a capability number for multitce support
KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
vfio: add external user support
powerpc: Prepare to support kernel handling of IOMMU map/unmap
powerpc: add real mode support for dma operations on powernv
KVM: PPC: Add support for multiple-TCE hcalls
KVM: PPC: Add support for IOMMU in-kernel handling
KVM: PPC: Add hugepage support for IOMMU in-kernel handling

Documentation/virtual/kvm/api.txt | 51 +++
arch/powerpc/include/asm/iommu.h | 9 +-
arch/powerpc/include/asm/kvm_host.h | 37 ++
arch/powerpc/include/asm/kvm_ppc.h | 18 +-
arch/powerpc/include/asm/machdep.h | 12 +
arch/powerpc/include/asm/pgtable-ppc64.h | 4 +
arch/powerpc/include/uapi/asm/kvm.h | 7 +
arch/powerpc/kernel/iommu.c | 200 +++++++----
arch/powerpc/kvm/book3s_64_vio.c | 541 +++++++++++++++++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 404 ++++++++++++++++++++--
arch/powerpc/kvm/book3s_hv.c | 41 ++-
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
arch/powerpc/kvm/book3s_pr_papr.c | 37 +-
arch/powerpc/kvm/powerpc.c | 15 +
arch/powerpc/mm/init_64.c | 78 ++++-
arch/powerpc/platforms/powernv/pci-ioda.c | 26 +-
arch/powerpc/platforms/powernv/pci.c | 38 ++-
arch/powerpc/platforms/powernv/pci.h | 2 +-
drivers/vfio/vfio.c | 35 ++
include/linux/page-flags.h | 4 +-
include/linux/vfio.h | 7 +
include/uapi/linux/kvm.h | 3 +
22 files changed, 1453 insertions(+), 122 deletions(-)

--
1.8.3.2


2013-07-06 15:07:36

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH 2/8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO

This is to reserve a capablity number for upcoming support
of VFIO-IOMMU DMA operations in real mode.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
include/uapi/linux/kvm.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 970b1f5..0865c01 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_PPC_RTAS 91
#define KVM_CAP_IRQ_XICS 92
#define KVM_CAP_SPAPR_MULTITCE 93
+#define KVM_CAP_SPAPR_TCE_IOMMU 94

#ifdef KVM_CAP_IRQ_ROUTING

@@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping {
/* Available with KVM_CAP_PPC_ALLOC_HTAB */
#define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32)
#define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct kvm_create_spapr_tce)
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct kvm_create_spapr_tce_iommu)
/* Available with KVM_CAP_RMA */
#define KVM_ALLOCATE_RMA _IOR(KVMIO, 0xa9, struct kvm_allocate_rma)
/* Available with KVM_CAP_PPC_HTAB_FD */
--
1.8.3.2

2013-07-06 15:07:40

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH 3/8] vfio: add external user support

VFIO is designed to be used via ioctls on file descriptors
returned by VFIO.

However in some situations support for an external user is required.
The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
use the existing VFIO groups for exclusive access in real/virtual mode
on a host to avoid passing map/unmap requests to the user space which
would made things pretty slow.

The proposed protocol includes:

1. do normal VFIO init stuff such as opening a new container, attaching
group(s) to it, setting an IOMMU driver for a container. When IOMMU is
set for a container, all groups in it are considered ready to use by
an external user.

2. pass a fd of the group we want to accelerate to KVM. KVM calls
vfio_group_get_external_user() to verify if the group is initialized,
IOMMU is set for it and increment the container user counter to prevent
the VFIO group from disposal prior to KVM exit.
The current TCE IOMMU driver marks the whole IOMMU table as busy when
IOMMU is set for a container what prevents other DMA users from
allocating from it so it is safe to grant user space access to it.

3. KVM calls vfio_external_user_iommu_id() to obtian an IOMMU ID which
KVM uses to get an iommu_group struct for later use.

4. When KVM is finished, it calls vfio_group_put_external_user() to
release the VFIO group by decrementing the container user counter.
Everything gets released.

The "vfio: Limit group opens" patch is also required for the consistency.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index c488da5..57aa191 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1370,6 +1370,62 @@ static const struct file_operations vfio_device_fops = {
};

/**
+ * External user API, exported by symbols to be linked dynamically.
+ *
+ * The protocol includes:
+ * 1. do normal VFIO init operation:
+ * - opening a new container;
+ * - attaching group(s) to it;
+ * - setting an IOMMU driver for a container.
+ * When IOMMU is set for a container, all groups in it are
+ * considered ready to use by an external user.
+ *
+ * 2. The user space passed a group fd which we want to accelerate in
+ * KVM. KVM uses vfio_group_get_external_user() to verify that:
+ * - the group is initialized;
+ * - IOMMU is set for it.
+ * Then vfio_group_get_external_user() increments the container user
+ * counter to prevent the VFIO group from disposal prior to KVM exit.
+ *
+ * 3. KVM calls vfio_external_user_iommu_id() to know an IOMMU ID which
+ * KVM uses to get an iommu_group struct for later use.
+ *
+ * 4. When KVM is finished, it calls vfio_group_put_external_user() to
+ * release the VFIO group by decrementing the container user counter.
+ */
+struct vfio_group *vfio_group_get_external_user(struct file *filep)
+{
+ struct vfio_group *group = filep->private_data;
+
+ if (filep->f_op != &vfio_group_fops)
+ return NULL;
+
+ if (!atomic_inc_not_zero(&group->container_users))
+ return NULL;
+
+ if (!group->container->iommu_driver ||
+ !vfio_group_viable(group)) {
+ atomic_dec(&group->container_users);
+ return NULL;
+ }
+
+ return group;
+}
+EXPORT_SYMBOL_GPL(vfio_group_get_external_user);
+
+void vfio_group_put_external_user(struct vfio_group *group)
+{
+ vfio_group_try_dissolve_container(group);
+}
+EXPORT_SYMBOL_GPL(vfio_group_put_external_user);
+
+int vfio_external_user_iommu_id(struct vfio_group *group)
+{
+ return iommu_group_id(group->iommu_group);
+}
+EXPORT_SYMBOL_GPL(vfio_external_user_iommu_id);
+
+/**
* Module/class support
*/
static char *vfio_devnode(struct device *dev, umode_t *mode)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index ac8d488..24579a0 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -90,4 +90,11 @@ extern void vfio_unregister_iommu_driver(
TYPE tmp; \
offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); }) \

+/*
+ * External user API
+ */
+extern struct vfio_group *vfio_group_get_external_user(struct file *filep);
+extern void vfio_group_put_external_user(struct vfio_group *group);
+extern int vfio_external_user_iommu_id(struct vfio_group *group);
+
#endif /* VFIO_H */
--
1.8.3.2

2013-07-06 15:07:57

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH 5/8] powerpc: add real mode support for dma operations on powernv

The existing TCE machine calls (tce_build and tce_free) only support
virtual mode as they call __raw_writeq for TCE invalidation what
fails in real mode.

This introduces tce_build_rm and tce_free_rm real mode versions
which do mostly the same but use "Store Doubleword Caching Inhibited
Indexed" instruction for TCE invalidation.

This new feature is going to be utilized by real mode support of VFIO.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/machdep.h | 12 ++++++++++
arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++------
arch/powerpc/platforms/powernv/pci.c | 38 ++++++++++++++++++++++++++-----
arch/powerpc/platforms/powernv/pci.h | 2 +-
4 files changed, 64 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 92386fc..0c19eef 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -75,6 +75,18 @@ struct machdep_calls {
long index);
void (*tce_flush)(struct iommu_table *tbl);

+ /* _rm versions are for real mode use only */
+ int (*tce_build_rm)(struct iommu_table *tbl,
+ long index,
+ long npages,
+ unsigned long uaddr,
+ enum dma_data_direction direction,
+ struct dma_attrs *attrs);
+ void (*tce_free_rm)(struct iommu_table *tbl,
+ long index,
+ long npages);
+ void (*tce_flush_rm)(struct iommu_table *tbl);
+
void __iomem * (*ioremap)(phys_addr_t addr, unsigned long size,
unsigned long flags, void *caller);
void (*iounmap)(volatile void __iomem *token);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2931d97..2797dec 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -68,6 +68,12 @@ define_pe_printk_level(pe_err, KERN_ERR);
define_pe_printk_level(pe_warn, KERN_WARNING);
define_pe_printk_level(pe_info, KERN_INFO);

+static inline void rm_writed(unsigned long paddr, u64 val)
+{
+ __asm__ __volatile__("sync; stdcix %0,0,%1"
+ : : "r" (val), "r" (paddr) : "memory");
+}
+
static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
{
unsigned long pe;
@@ -442,7 +448,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
}

static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
- u64 *startp, u64 *endp)
+ u64 *startp, u64 *endp, bool rm)
{
u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
unsigned long start, end, inc;
@@ -471,7 +477,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,

mb(); /* Ensure above stores are visible */
while (start <= end) {
- __raw_writeq(start, invalidate);
+ if (rm)
+ rm_writed((unsigned long) invalidate, start);
+ else
+ __raw_writeq(start, invalidate);
start += inc;
}

@@ -483,7 +492,7 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,

static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
struct iommu_table *tbl,
- u64 *startp, u64 *endp)
+ u64 *startp, u64 *endp, bool rm)
{
unsigned long start, end, inc;
u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
@@ -502,22 +511,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
mb();

while (start <= end) {
- __raw_writeq(start, invalidate);
+ if (rm)
+ rm_writed((unsigned long) invalidate, start);
+ else
+ __raw_writeq(start, invalidate);
start += inc;
}
}

void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
- u64 *startp, u64 *endp)
+ u64 *startp, u64 *endp, bool rm)
{
struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
tce32_table);
struct pnv_phb *phb = pe->phb;

if (phb->type == PNV_PHB_IODA1)
- pnv_pci_ioda1_tce_invalidate(tbl, startp, endp);
+ pnv_pci_ioda1_tce_invalidate(tbl, startp, endp, rm);
else
- pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp);
+ pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
}

static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index e16b729..280f614 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -336,7 +336,7 @@ struct pci_ops pnv_pci_ops = {

static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
unsigned long uaddr, enum dma_data_direction direction,
- struct dma_attrs *attrs)
+ struct dma_attrs *attrs, bool rm)
{
u64 proto_tce;
u64 *tcep, *tces;
@@ -358,12 +358,19 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
* of flags if that becomes the case
*/
if (tbl->it_type & TCE_PCI_SWINV_CREATE)
- pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
+ pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);

return 0;
}

-static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
+static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr, enum dma_data_direction direction,
+ struct dma_attrs *attrs)
+{
+ return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, false);
+}
+
+static void pnv_tce_free(struct iommu_table *tbl, long index, long npages, bool rm)
{
u64 *tcep, *tces;

@@ -373,7 +380,12 @@ static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
*(tcep++) = 0;

if (tbl->it_type & TCE_PCI_SWINV_FREE)
- pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
+ pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+}
+
+static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long npages)
+{
+ pnv_tce_free(tbl, index, npages, false);
}

static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
@@ -381,6 +393,18 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
return ((u64 *)tbl->it_base)[index - tbl->it_offset];
}

+static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr, enum dma_data_direction direction,
+ struct dma_attrs *attrs)
+{
+ return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
+}
+
+static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
+{
+ pnv_tce_free(tbl, index, npages, true);
+}
+
void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
void *tce_mem, u64 tce_size,
u64 dma_offset)
@@ -545,8 +569,10 @@ void __init pnv_pci_init(void)

/* Configure IOMMU DMA hooks */
ppc_md.pci_dma_dev_setup = pnv_pci_dma_dev_setup;
- ppc_md.tce_build = pnv_tce_build;
- ppc_md.tce_free = pnv_tce_free;
+ ppc_md.tce_build = pnv_tce_build_vm;
+ ppc_md.tce_free = pnv_tce_free_vm;
+ ppc_md.tce_build_rm = pnv_tce_build_rm;
+ ppc_md.tce_free_rm = pnv_tce_free_rm;
ppc_md.tce_get = pnv_tce_get;
ppc_md.pci_probe_mode = pnv_pci_probe_mode;
set_pci_dma_ops(&dma_iommu_ops);
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 25d76c4..6799374 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -158,6 +158,6 @@ extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
extern void pnv_pci_init_ioda_hub(struct device_node *np);
extern void pnv_pci_init_ioda2_phb(struct device_node *np);
extern void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
- u64 *startp, u64 *endp);
+ u64 *startp, u64 *endp, bool rm);

#endif /* __POWERNV_PCI_H */
--
1.8.3.2

2013-07-06 15:08:09

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
devices or emulated PCI. These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.

This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>

---
Changelog:
2013/07/06:
* fixed number of wrong get_page()/put_page() calls

2013/06/27:
* fixed clear of BUSY bit in kvmppc_lookup_pte()
* H_PUT_TCE_INDIRECT does realmode_get_page() now
* KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
* updated doc

2013/06/05:
* fixed mistype about IBMVIO in the commit message
* updated doc and moved it to another section
* changed capability number

2013/05/21:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Documentation/virtual/kvm/api.txt | 25 +++
arch/powerpc/include/asm/kvm_host.h | 9 ++
arch/powerpc/include/asm/kvm_ppc.h | 16 +-
arch/powerpc/kvm/book3s_64_vio.c | 154 ++++++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 260 ++++++++++++++++++++++++++++----
arch/powerpc/kvm/book3s_hv.c | 41 ++++-
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++-
arch/powerpc/kvm/powerpc.c | 3 +
9 files changed, 517 insertions(+), 34 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 6365fef..762c703 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed to userspace to be
handled.


+4.86 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability means the kernel is capable of handling hypercalls
+H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
+space. This significanly accelerates DMA operations for PPC KVM guests.
+The user space should expect that its handlers for these hypercalls
+are not going to be called.
+
+In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
+the user space might have to advertise it for the guest. For example,
+IBM pSeries guest starts using them if "hcall-multi-tce" is present in
+the "ibm,hypertas-functions" device-tree property.
+
+Without this capability, only H_PUT_TCE is handled by the kernel and
+therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
+unless the capability is present as passing hypercalls to the userspace
+slows operations a lot.
+
+Unlike other capabilities of this section, this one is always enabled.
+
+
5. The kvm_run structure
------------------------

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index af326cd..20d04bd 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+ struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
struct page *pages[0];
};

@@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
+
+ unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT hcall */
+ enum {
+ TCERM_NONE,
+ TCERM_GETPAGE,
+ TCERM_PUTTCE,
+ TCERM_PUTLIST,
+ } tce_rm_fail; /* failed stage of request processing */
#endif
};

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..fa722a0 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);

extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
- unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+ struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_emulated_validate_tce(unsigned long tce);
+extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+ unsigned long ioba, unsigned long tce);
+extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce);
+extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages);
+extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages);
extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct kvm_allocate_rma *rma);
extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index b2d3f3b..99bf4e5 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
*
* Copyright 2010 Paul Mackerras, IBM Corp. <[email protected]>
* Copyright 2011 David Gibson, IBM Corporation <[email protected]>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <[email protected]>
*/

#include <linux/types.h>
@@ -36,8 +37,10 @@
#include <asm/ppc-opcode.h>
#include <asm/kvm_host.h>
#include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>

-#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR ((void *)~(unsigned long)0x0)

static long kvmppc_stt_npages(unsigned long window_size)
{
@@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
struct kvm *kvm = stt->kvm;
int i;

+#define __SV(x) stt->stat.x
+#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
+ pr_debug("%s stat for liobn=%llx\n"
+ "--------------- realmode ----- virtmode ---\n"
+ "put_tce %10ld %10ld\n"
+ "put_tce_indir %10ld %10ld\n"
+ "stuff_tce %10ld %10ld\n",
+ __func__, stt->liobn,
+ __SVD(put), __SV(vm.put),
+ __SVD(indir), __SV(vm.indir),
+ __SVD(stuff), __SV(vm.stuff));
+#undef __SVD
+#undef __SV
+
mutex_lock(&kvm->lock);
list_del(&stt->list);
for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
@@ -148,3 +165,138 @@ fail:
}
return ret;
}
+
+/* Converts guest physical address to host virtual address */
+static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
+ unsigned long gpa, struct page **pg)
+{
+ unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+ struct kvm_memory_slot *memslot;
+
+ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+ if (!memslot)
+ return ERROR_ADDR;
+
+ hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+
+ if (get_user_pages_fast(hva & PAGE_MASK, 1, 0, pg) != 1)
+ return ERROR_ADDR;
+
+ return (void *) hva;
+}
+
+long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce)
+{
+ long ret;
+ struct kvmppc_spapr_tce_table *tt;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to userspace */
+ if (!tt)
+ return H_TOO_HARD;
+
+ ++tt->stat.vm.put;
+
+ if (ioba >= tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce);
+ if (ret)
+ return ret;
+
+ kvmppc_emulated_put_tce(tt, ioba, tce);
+
+ return H_SUCCESS;
+}
+
+long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret = H_SUCCESS;
+ unsigned long __user *tces;
+ struct page *pg = NULL;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to userspace */
+ if (!tt)
+ return H_TOO_HARD;
+
+ ++tt->stat.vm.indir;
+
+ /*
+ * The spec says that the maximum size of the list is 512 TCEs so
+ * so the whole table addressed resides in 4K page
+ */
+ if (npages > 512)
+ return H_PARAMETER;
+
+ if (tce_list & ~IOMMU_PAGE_MASK)
+ return H_PARAMETER;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list, &pg);
+ if (tces == ERROR_ADDR)
+ return H_TOO_HARD;
+
+ if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
+ goto put_list_page_exit;
+
+ for (i = 0; i < npages; ++i) {
+ if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
+ ret = H_PARAMETER;
+ goto put_list_page_exit;
+ }
+
+ ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
+ if (ret)
+ goto put_list_page_exit;
+ }
+
+ for (i = 0; i < npages; ++i)
+ kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+ vcpu->arch.tce_tmp_hpas[i]);
+put_list_page_exit:
+ if (pg)
+ put_page(pg);
+
+ if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
+ vcpu->arch.tce_rm_fail = TCERM_NONE;
+ if (pg && !PageCompound(pg))
+ put_page(pg); /* finish pending realmode_put_page() */
+ }
+
+ return ret;
+}
+
+long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to userspace */
+ if (!tt)
+ return H_TOO_HARD;
+
+ ++tt->stat.vm.stuff;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce_value);
+ if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+ kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+ return H_SUCCESS;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..cd3e6f9 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
*
* Copyright 2010 Paul Mackerras, IBM Corp. <[email protected]>
* Copyright 2011 David Gibson, IBM Corporation <[email protected]>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <[email protected]>
*/

#include <linux/types.h>
@@ -35,42 +36,243 @@
#include <asm/ppc-opcode.h>
#include <asm/kvm_host.h>
#include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>

#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR (~(unsigned long)0x0)

-/* WARNING: This will be called in real-mode on HV KVM and virtual
- * mode on PR KVM
+/*
+ * Finds a TCE table descriptor by LIOBN
*/
+struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
+ unsigned long liobn)
+{
+ struct kvmppc_spapr_tce_table *tt;
+
+ list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+ if (tt->liobn == liobn)
+ return tt;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
+
+#ifdef DEBUG
+/*
+ * Lets user mode disable realmode handlers by putting big number
+ * in the bottom value of LIOBN
+ */
+#define kvmppc_find_tce_table(a, b) \
+ ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
+#endif
+
+/*
+ * Validates TCE address.
+ * At the moment only flags are validated as other checks will significantly slow
+ * down or can make it even impossible to handle TCE requests in real mode.
+ */
+long kvmppc_emulated_validate_tce(unsigned long tce)
+{
+ if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+ return H_PARAMETER;
+
+ return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
+
+/*
+ * Handles TCE requests for QEMU emulated devices.
+ * Puts guest TCE values to the table and expects QEMU to convert them
+ * later in a QEMU device implementation.
+ * Called in both real and virtual modes.
+ * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
+ */
+void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+ unsigned long ioba, unsigned long tce)
+{
+ unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+ struct page *page;
+ u64 *tbl;
+
+ /*
+ * Note on the use of page_address() in real mode,
+ *
+ * It is safe to use page_address() in real mode on ppc64 because
+ * page_address() is always defined as lowmem_page_address()
+ * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
+ * operation and does not access page struct.
+ *
+ * Theoretically page_address() could be defined different
+ * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
+ * should be enabled.
+ * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
+ * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
+ * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
+ * is not expected to be enabled on ppc32, page_address()
+ * is safe for ppc32 as well.
+ */
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+ page = tt->pages[idx / TCES_PER_PAGE];
+ tbl = (u64 *)page_address(page);
+
+ /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
+ tbl[idx % TCES_PER_PAGE] = tce;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
+
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+/*
+ * Converts guest physical address to host physical address.
+ * Tries to increase page counter via realmode_get_page() and
+ * returns ERROR_ADDR if failed.
+ */
+static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
+ unsigned long gpa, struct page **pg)
+{
+ struct kvm_memory_slot *memslot;
+ pte_t *ptep, pte;
+ unsigned long hva, hpa = ERROR_ADDR;
+ unsigned long gfn = gpa >> PAGE_SHIFT;
+ unsigned shift = 0;
+
+ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+ if (!memslot)
+ return ERROR_ADDR;
+
+ hva = __gfn_to_hva_memslot(memslot, gfn);
+
+ ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+ if (!ptep || !pte_present(*ptep))
+ return ERROR_ADDR;
+ pte = *ptep;
+
+ if (((gpa & TCE_PCI_WRITE) || pte_write(pte)) && !pte_dirty(pte))
+ return ERROR_ADDR;
+
+ if (!pte_young(pte))
+ return ERROR_ADDR;
+
+ if (!shift)
+ shift = PAGE_SHIFT;
+
+ /* Put huge pages handling to the virtual mode */
+ if (shift > PAGE_SHIFT)
+ return ERROR_ADDR;
+
+ *pg = realmode_pfn_to_page(pte_pfn(pte));
+ if (!*pg || realmode_get_page(*pg))
+ return ERROR_ADDR;
+
+ /* pte_pfn(pte) returns address aligned to pg_size */
+ hpa = (pte_pfn(pte) << PAGE_SHIFT) + (gpa & ((1 << shift) - 1));
+
+ if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+ hpa = ERROR_ADDR;
+ realmode_put_page(*pg);
+ *pg = NULL;
+ }
+
+ return hpa;
+}
+
long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
{
- struct kvm *kvm = vcpu->kvm;
- struct kvmppc_spapr_tce_table *stt;
-
- /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
- /* liobn, ioba, tce); */
-
- list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
- if (stt->liobn == liobn) {
- unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
- struct page *page;
- u64 *tbl;
-
- /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p window_size=0x%x\n", */
- /* liobn, stt, stt->window_size); */
- if (ioba >= stt->window_size)
- return H_PARAMETER;
-
- page = stt->pages[idx / TCES_PER_PAGE];
- tbl = (u64 *)page_address(page);
-
- /* FIXME: Need to validate the TCE itself */
- /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
- tbl[idx % TCES_PER_PAGE] = tce;
- return H_SUCCESS;
- }
+ long ret;
+ struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
+
+ if (!tt)
+ return H_TOO_HARD;
+
+ ++tt->stat.rm.put;
+
+ if (ioba >= tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce);
+ if (!ret)
+ kvmppc_emulated_put_tce(tt, ioba, tce);
+
+ return ret;
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
+ long i, ret = H_SUCCESS;
+ unsigned long tces;
+ struct page *pg = NULL;
+
+ if (!tt)
+ return H_TOO_HARD;
+
+ ++tt->stat.rm.indir;
+
+ /*
+ * The spec says that the maximum size of the list is 512 TCEs so
+ * so the whole table addressed resides in 4K page
+ */
+ if (npages > 512)
+ return H_PARAMETER;
+
+ if (tce_list & ~IOMMU_PAGE_MASK)
+ return H_PARAMETER;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce_list, &pg);
+ if (tces == ERROR_ADDR)
+ return H_TOO_HARD;
+
+ for (i = 0; i < npages; ++i) {
+ ret = kvmppc_emulated_validate_tce(((unsigned long *)tces)[i]);
+ if (ret)
+ goto put_unlock_exit;
+ }
+
+ for (i = 0; i < npages; ++i)
+ kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+ ((unsigned long *)tces)[i]);
+
+put_unlock_exit:
+ if (!ret && pg && !PageCompound(pg) && realmode_put_page(pg)) {
+ vcpu->arch.tce_rm_fail = TCERM_PUTLIST;
+ ret = H_TOO_HARD;
}

- /* Didn't find the liobn, punt it to userspace */
- return H_TOO_HARD;
+ return ret;
+}
+
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ if (!tt)
+ return H_TOO_HARD;
+
+ ++tt->stat.rm.stuff;
+
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce_value);
+ if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+ kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+ return H_SUCCESS;
}
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 550f592..ac41d01 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -567,7 +567,31 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
if (kvmppc_xics_enabled(vcpu)) {
ret = kvmppc_xics_hcall(vcpu, req);
break;
- } /* fallthrough */
+ }
+ return RESUME_HOST;
+ case H_PUT_TCE:
+ ret = kvmppc_vm_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
+ case H_PUT_TCE_INDIRECT:
+ ret = kvmppc_vm_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6),
+ kvmppc_get_gpr(vcpu, 7));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
+ case H_STUFF_TCE:
+ ret = kvmppc_vm_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6),
+ kvmppc_get_gpr(vcpu, 7));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
default:
return RESUME_HOST;
}
@@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
vcpu->arch.cpu_type = KVM_CPU_3S_64;
kvmppc_sanity_check(vcpu);

+ /*
+ * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
+ * half executed, we first read TCEs from the user, check them and
+ * return error if something went wrong and only then put TCEs into
+ * the TCE table.
+ *
+ * tce_tmp_hpas is a cache for TCEs to avoid stack allocation or
+ * kmalloc as the whole TCE list can take up to 512 items 8 bytes
+ * each (4096 bytes).
+ */
+ vcpu->arch.tce_tmp_hpas = kmalloc(4096, GFP_KERNEL);
+ if (!vcpu->arch.tce_tmp_hpas)
+ goto free_vcpu;
+
return vcpu;

free_vcpu:
@@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
spin_unlock(&vcpu->arch.vpa_update_lock);
+ kfree(vcpu->arch.tce_tmp_hpas);
kvm_vcpu_uninit(vcpu);
kmem_cache_free(kvm_vcpu_cache, vcpu);
}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b02f91e..d35554e 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1490,6 +1490,12 @@ hcall_real_table:
.long 0 /* 0x11c */
.long 0 /* 0x120 */
.long .kvmppc_h_bulk_remove - hcall_real_table
+ .long 0 /* 0x128 */
+ .long 0 /* 0x12c */
+ .long 0 /* 0x130 */
+ .long 0 /* 0x134 */
+ .long .kvmppc_h_stuff_tce - hcall_real_table
+ .long .kvmppc_h_put_tce_indirect - hcall_real_table
hcall_real_table_end:

ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index da0e0bc..edfea88 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
unsigned long tce = kvmppc_get_gpr(vcpu, 6);
long rc;

- rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+ rc = kvmppc_vm_h_put_tce(vcpu, liobn, ioba, tce);
+ if (rc == H_TOO_HARD)
+ return EMULATE_FAIL;
+ kvmppc_set_gpr(vcpu, 3, rc);
+ return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+ unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+ unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+ unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+ unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+ long rc;
+
+ rc = kvmppc_vm_h_put_tce_indirect(vcpu, liobn, ioba,
+ tce, npages);
+ if (rc == H_TOO_HARD)
+ return EMULATE_FAIL;
+ kvmppc_set_gpr(vcpu, 3, rc);
+ return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+ unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+ unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+ unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+ unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+ long rc;
+
+ rc = kvmppc_vm_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
if (rc == H_TOO_HARD)
return EMULATE_FAIL;
kvmppc_set_gpr(vcpu, 3, rc);
@@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
return kvmppc_h_pr_bulk_remove(vcpu);
case H_PUT_TCE:
return kvmppc_h_pr_put_tce(vcpu);
+ case H_PUT_TCE_INDIRECT:
+ return kvmppc_h_pr_put_tce_indirect(vcpu);
+ case H_STUFF_TCE:
+ return kvmppc_h_pr_stuff_tce(vcpu);
case H_CEDE:
vcpu->arch.shared->msr |= MSR_EE;
kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 6316ee3..ccb578b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -394,6 +394,9 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_PPC_GET_SMMU_INFO:
r = 1;
break;
+ case KVM_CAP_SPAPR_MULTITCE:
+ r = 1;
+ break;
#endif
default:
r = 0;
--
1.8.3.2

2013-07-06 15:08:18

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH 7/8] KVM: PPC: Add support for IOMMU in-kernel handling

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which saves time
on switching to QEMU and back.

Both real and virtual modes are supported. First the kernel tries to
handle a TCE request in the real mode, if failed it passes it to
the virtual mode to complete the operation. If it a virtual mode
handler fails, a request is passed to the user mode.

This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to associate
a virtual PCI bus ID (LIOBN) with an IOMMU group which enables
in-kernel handling of IOMMU map/unmap. The external user API support
in VFIO is required.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>

---

Changes:
2013/07/06:
* added realmode arch_spin_lock to protect TCE table from races
in real and virtual modes
* POWERPC IOMMU API is changed to support real mode
* iommu_take_ownership and iommu_release_ownership are protected by
iommu_table's locks
* VFIO external user API use rewritten
* multiple small fixes

2013/06/27:
* tce_list page is referenced now in order to protect it from accident
invalidation during H_PUT_TCE_INDIRECT execution
* added use of the external user VFIO API

2013/06/05:
* changed capability number
* changed ioctl number
* update the doc article number

2013/05/20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Documentation/virtual/kvm/api.txt | 26 ++++
arch/powerpc/include/asm/iommu.h | 9 +-
arch/powerpc/include/asm/kvm_host.h | 3 +
arch/powerpc/include/asm/kvm_ppc.h | 2 +
arch/powerpc/include/uapi/asm/kvm.h | 7 +
arch/powerpc/kernel/iommu.c | 196 +++++++++++++++--------
arch/powerpc/kvm/book3s_64_vio.c | 299 +++++++++++++++++++++++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 129 ++++++++++++++++
arch/powerpc/kvm/powerpc.c | 12 ++
9 files changed, 609 insertions(+), 74 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 762c703..01b0dc2 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2387,6 +2387,32 @@ slows operations a lot.
Unlike other capabilities of this section, this one is always enabled.


+4.87 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_create_spapr_tce_iommu {
+ __u64 liobn;
+ __u32 iommu_id;
+ __u32 flags;
+};
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+In response to a TCE hypercall, the kernel looks for a TCE table descriptor
+in the list and handles the hypercall in real or virtual modes if
+the descriptor is found. Otherwise the hypercall is passed to the user mode.
+
+No flag is supported at the moment.
+
+
5. The kvm_run structure
------------------------

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 98d1422..0845505 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,7 @@ struct iommu_table {
unsigned long *it_map; /* A simple allocation bitmap for now */
#ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
+ arch_spinlock_t it_rm_lock;
#endif
};

@@ -159,9 +160,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
extern int iommu_tce_put_param_check(struct iommu_table *tbl,
unsigned long ioba, unsigned long tce);
extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
- unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
- unsigned long entry);
+ unsigned long *hpas, unsigned long npages, bool rm);
+extern int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+ unsigned long npages, bool rm);
extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
@@ -171,7 +172,5 @@ extern void iommu_flush_tce(struct iommu_table *tbl);
extern int iommu_take_ownership(struct iommu_table *tbl);
extern void iommu_release_ownership(struct iommu_table *tbl);

-extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
-
#endif /* __KERNEL__ */
#endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 20d04bd..53e61b2 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,8 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+ struct iommu_group *grp; /* used for IOMMU groups */
+ struct vfio_group *vfio_grp; /* used for IOMMU groups */
struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
struct page *pages[0];
};
@@ -612,6 +614,7 @@ struct kvm_vcpu_arch {
u64 busy_preempt;

unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT hcall */
+ unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */
enum {
TCERM_NONE,
TCERM_GETPAGE,
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index fa722a0..1476538 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);

extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+ struct kvm_create_spapr_tce_iommu *args);
extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
struct kvm_vcpu *vcpu, unsigned long liobn);
extern long kvmppc_emulated_validate_tce(unsigned long tce);
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..3da4aa3 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -319,6 +319,13 @@ struct kvm_create_spapr_tce {
__u32 window_size;
};

+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+ __u64 liobn;
+ __u32 fd;
+ __u32 flags;
+};
+
/* for KVM_ALLOCATE_RMA */
struct kvm_allocate_rma {
__u64 rma_size;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b20ff17..51678ec 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -903,7 +903,7 @@ void iommu_register_group(struct iommu_table *tbl,
kfree(name);
}

-enum dma_data_direction iommu_tce_direction(unsigned long tce)
+static enum dma_data_direction iommu_tce_direction(unsigned long tce)
{
if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
return DMA_BIDIRECTIONAL;
@@ -914,7 +914,6 @@ enum dma_data_direction iommu_tce_direction(unsigned long tce)
else
return DMA_NONE;
}
-EXPORT_SYMBOL_GPL(iommu_tce_direction);

void iommu_flush_tce(struct iommu_table *tbl)
{
@@ -972,73 +971,117 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
}
EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);

-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
-{
- unsigned long oldtce;
- struct iommu_pool *pool = get_pool(tbl, entry);
-
- spin_lock(&(pool->lock));
-
- oldtce = ppc_md.tce_get(tbl, entry);
- if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
- ppc_md.tce_free(tbl, entry, 1);
- else
- oldtce = 0;
-
- spin_unlock(&(pool->lock));
-
- return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
{
- unsigned long oldtce;
- struct page *page;
-
- for ( ; pages; --pages, ++entry) {
- oldtce = iommu_clear_tce(tbl, entry);
- if (!oldtce)
- continue;
-
- page = pfn_to_page(oldtce >> PAGE_SHIFT);
- WARN_ON(!page);
- if (page) {
- if (oldtce & TCE_PCI_WRITE)
- SetPageDirty(page);
- put_page(page);
- }
- }
-
- return 0;
+ return iommu_free_tces(tbl, entry, pages, false);
}
EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);

-/*
- * hwaddr is a kernel virtual address here (0xc... bazillion),
- * tce_build converts it to a physical address.
- */
+int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+ unsigned long npages, bool rm)
+{
+ int i, ret = 0, clear_num = 0;
+
+ if (rm && !ppc_md.tce_free_rm)
+ return -EAGAIN;
+
+ arch_spin_lock(&tbl->it_rm_lock);
+
+ for (i = 0; i < npages; ++i) {
+ unsigned long oldtce = ppc_md.tce_get(tbl, entry + i);
+ if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ continue;
+
+ if (rm) {
+ struct page *pg = realmode_pfn_to_page(
+ oldtce >> PAGE_SHIFT);
+ if (!pg) {
+ ret = -EAGAIN;
+ } else if (PageCompound(pg)) {
+ ret = -EAGAIN;
+ } else {
+ if (oldtce & TCE_PCI_WRITE)
+ SetPageDirty(pg);
+ ret = realmode_put_page(pg);
+ }
+ } else {
+ struct page *pg = pfn_to_page(oldtce >> PAGE_SHIFT);
+ if (!pg) {
+ ret = -EAGAIN;
+ } else {
+ if (oldtce & TCE_PCI_WRITE)
+ SetPageDirty(pg);
+ put_page(pg);
+ }
+ }
+ if (ret)
+ break;
+ clear_num = i + 1;
+ }
+
+ if (clear_num) {
+ if (rm)
+ ppc_md.tce_free_rm(tbl, entry, clear_num);
+ else
+ ppc_md.tce_free(tbl, entry, clear_num);
+
+
+ if (rm && ppc_md.tce_flush_rm)
+ ppc_md.tce_flush_rm(tbl);
+ else if (!rm && ppc_md.tce_flush)
+ ppc_md.tce_flush(tbl);
+ }
+ arch_spin_unlock(&tbl->it_rm_lock);
+
+ /* Make sure updates are seen by hardware */
+ mb();
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_free_tces);
+
int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
- unsigned long hwaddr, enum dma_data_direction direction)
+ unsigned long *hpas, unsigned long npages, bool rm)
{
- int ret = -EBUSY;
- unsigned long oldtce;
- struct iommu_pool *pool = get_pool(tbl, entry);
+ int i, ret = 0;

- spin_lock(&(pool->lock));
+ if (rm && !ppc_md.tce_build_rm)
+ return -EAGAIN;

- oldtce = ppc_md.tce_get(tbl, entry);
- /* Add new entry if it is not busy */
- if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
- ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
+ arch_spin_lock(&tbl->it_rm_lock);

- spin_unlock(&(pool->lock));
+ for (i = 0; i < npages; ++i) {
+ if (ppc_md.tce_get(tbl, entry + i) &
+ (TCE_PCI_WRITE | TCE_PCI_READ)) {
+ arch_spin_unlock(&tbl->it_rm_lock);
+ return -EBUSY;
+ }
+ }

- /* if (unlikely(ret))
- pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
- __func__, hwaddr, entry << IOMMU_PAGE_SHIFT,
- hwaddr, ret); */
+ for (i = 0; i < npages; ++i) {
+ unsigned long volatile hva = (unsigned long) __va(hpas[i]);
+ enum dma_data_direction dir = iommu_tce_direction(hva);
+
+ if (rm)
+ ret = ppc_md.tce_build_rm(tbl, entry + i, 1,
+ hva, dir, NULL);
+ else
+ ret = ppc_md.tce_build(tbl, entry + i, 1,
+ hva, dir, NULL);
+ if (ret)
+ break;
+ }
+
+ if (rm && ppc_md.tce_flush_rm)
+ ppc_md.tce_flush_rm(tbl);
+ else if (!rm && ppc_md.tce_flush)
+ ppc_md.tce_flush(tbl);
+
+ arch_spin_unlock(&tbl->it_rm_lock);
+
+ /* Make sure updates are seen by hardware */
+ mb();

return ret;
}
@@ -1059,9 +1102,9 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
tce, entry << IOMMU_PAGE_SHIFT, ret); */
return -EFAULT;
}
- hwaddr = (unsigned long) page_address(page) + offset;
+ hwaddr = __pa((unsigned long) page_address(page)) + offset;

- ret = iommu_tce_build(tbl, entry, hwaddr, direction);
+ ret = iommu_tce_build(tbl, entry, &hwaddr, 1, false);
if (ret)
put_page(page);

@@ -1075,18 +1118,32 @@ EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);

int iommu_take_ownership(struct iommu_table *tbl)
{
- unsigned long sz = (tbl->it_size + 7) >> 3;
+ unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+ int ret = 0;
+
+ spin_lock_irqsave(&tbl->large_pool.lock, flags);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_lock(&tbl->pools[i].lock);

if (tbl->it_offset == 0)
clear_bit(0, tbl->it_map);

if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
pr_err("iommu_tce: it_map is not empty");
- return -EBUSY;
+ ret = -EBUSY;
+ if (tbl->it_offset == 0)
+ clear_bit(1, tbl->it_map);
+
+ } else {
+ memset(tbl->it_map, 0xff, sz);
}

- memset(tbl->it_map, 0xff, sz);
- iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_unlock(&tbl->pools[i].lock);
+ spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
+
+ if (!ret)
+ iommu_free_tces(tbl, tbl->it_offset, tbl->it_size, false);

return 0;
}
@@ -1094,14 +1151,23 @@ EXPORT_SYMBOL_GPL(iommu_take_ownership);

void iommu_release_ownership(struct iommu_table *tbl)
{
- unsigned long sz = (tbl->it_size + 7) >> 3;
+ unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+
+ iommu_free_tces(tbl, tbl->it_offset, tbl->it_size, false);
+
+ spin_lock_irqsave(&tbl->large_pool.lock, flags);
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_lock(&tbl->pools[i].lock);

- iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
memset(tbl->it_map, 0, sz);

/* Restore bit#0 set by iommu_init_table() */
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
+
+ for (i = 0; i < tbl->nr_pools; i++)
+ spin_unlock(&tbl->pools[i].lock);
+ spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
}
EXPORT_SYMBOL_GPL(iommu_release_ownership);

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 99bf4e5..2b51f4a 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,10 @@
#include <linux/hugetlb.h>
#include <linux/list.h>
#include <linux/anon_inodes.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/file.h>
+#include <linux/vfio.h>

#include <asm/tlbflush.h>
#include <asm/kvm_ppc.h>
@@ -48,6 +52,45 @@ static long kvmppc_stt_npages(unsigned long window_size)
* sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
}

+struct vfio_group *kvmppc_vfio_group_get_external_user(struct file *filep)
+{
+ struct vfio_group *ret;
+ struct vfio_group * (*proc)(struct file *) =
+ symbol_get(vfio_group_get_external_user);
+ if (!proc)
+ return NULL;
+
+ ret = proc(filep);
+ symbol_put(vfio_group_get_external_user);
+
+ return ret;
+}
+
+void kvmppc_vfio_group_put_external_user(struct vfio_group *group)
+{
+ void (*proc)(struct vfio_group *) =
+ symbol_get(vfio_group_put_external_user);
+ if (!proc)
+ return;
+
+ proc(group);
+ symbol_put(vfio_group_put_external_user);
+}
+
+int kvmppc_vfio_external_user_iommu_id(struct vfio_group *group)
+{
+ int ret;
+ int (*proc)(struct vfio_group *) =
+ symbol_get(vfio_external_user_iommu_id);
+ if (!proc)
+ return -EINVAL;
+
+ ret = proc(group);
+ symbol_put(vfio_external_user_iommu_id);
+
+ return ret;
+}
+
static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
{
struct kvm *kvm = stt->kvm;
@@ -69,8 +112,17 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)

mutex_lock(&kvm->lock);
list_del(&stt->list);
- for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
- __free_page(stt->pages[i]);
+
+#ifdef CONFIG_IOMMU_API
+ if (stt->grp) {
+ if (stt->vfio_grp)
+ kvmppc_vfio_group_put_external_user(stt->vfio_grp);
+ iommu_group_put(stt->grp);
+ } else
+#endif
+ for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+ __free_page(stt->pages[i]);
+
kfree(stt);
mutex_unlock(&kvm->lock);

@@ -166,9 +218,96 @@ fail:
return ret;
}

+#ifdef CONFIG_IOMMU_API
+static const struct file_operations kvm_spapr_tce_iommu_fops = {
+ .release = kvm_spapr_tce_release,
+};
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+ struct kvm_create_spapr_tce_iommu *args)
+{
+ struct kvmppc_spapr_tce_table *tt = NULL;
+ struct iommu_group *grp;
+ struct iommu_table *tbl;
+ struct file *vfio_filp;
+ struct vfio_group *vfio_grp;
+ int ret = 0, iommu_id;
+
+ /* Check this LIOBN hasn't been previously allocated */
+ list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
+ if (tt->liobn == args->liobn)
+ return -EBUSY;
+ }
+
+ vfio_filp = fget(args->fd);
+ if (!vfio_filp)
+ return -ENXIO;
+
+ /* Lock the group */
+ vfio_grp = kvmppc_vfio_group_get_external_user(vfio_filp);
+ if (!vfio_grp)
+ goto fput_exit;
+
+ /* Get IOMMU ID. Fails if group is not attached to IOMMU */
+ iommu_id = kvmppc_vfio_external_user_iommu_id(vfio_grp);
+ if (iommu_id < 0)
+ goto grpput_fput_exit;
+
+ ret = -ENXIO;
+ /* Find an IOMMU table for the given ID */
+ grp = iommu_group_get_by_id(iommu_id);
+ if (!grp)
+ goto grpput_fput_exit;
+
+ tbl = iommu_group_get_iommudata(grp);
+ if (!tbl)
+ goto grpput_fput_exit;
+
+ tt = kzalloc(sizeof(*tt), GFP_KERNEL);
+ if (!tt)
+ goto grpput_fput_exit;
+
+ tt->liobn = args->liobn;
+ tt->kvm = kvm;
+ tt->grp = grp;
+ tt->window_size = tbl->it_size << IOMMU_PAGE_SHIFT;
+ tt->vfio_grp = vfio_grp;
+
+ pr_debug("LIOBN=%llX fd=%d hooked to IOMMU %d, flags=%u\n",
+ args->liobn, args->fd, iommu_id, args->flags);
+
+ ret = anon_inode_getfd("kvm-spapr-tce-iommu",
+ &kvm_spapr_tce_iommu_fops, tt, O_RDWR);
+ if (ret < 0)
+ goto free_grpput_fput_exit;
+
+ kvm_get_kvm(kvm);
+ mutex_lock(&kvm->lock);
+ list_add(&tt->list, &kvm->arch.spapr_tce_tables);
+ mutex_unlock(&kvm->lock);
+
+ goto fput_exit;
+
+free_grpput_fput_exit:
+ kfree(tt);
+grpput_fput_exit:
+ kvmppc_vfio_group_put_external_user(vfio_grp);
+fput_exit:
+ fput(vfio_filp);
+
+ return ret;
+}
+#else
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+ struct kvm_create_spapr_tce_iommu *args)
+{
+ return -ENOSYS;
+}
+#endif /* CONFIG_IOMMU_API */
+
/* Converts guest physical address to host virtual address */
static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
- unsigned long gpa, struct page **pg)
+ unsigned long gpa, struct page **pg, unsigned long *hpa)
{
unsigned long hva, gfn = gpa >> PAGE_SHIFT;
struct kvm_memory_slot *memslot;
@@ -182,9 +321,142 @@ static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
if (get_user_pages_fast(hva & PAGE_MASK, 1, 0, pg) != 1)
return ERROR_ADDR;

+ if (hpa)
+ *hpa = __pa((unsigned long) page_address(*pg)) +
+ (hva & ~PAGE_MASK);
+
return (void *) hva;
}

+#ifdef CONFIG_IOMMU_API
+long kvmppc_vm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce)
+{
+ struct page *pg = NULL;
+ unsigned long hpa;
+ void __user *hva;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Clear TCE */
+ if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+ if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, ioba >> IOMMU_PAGE_SHIFT,
+ 1, false))
+ return H_HARDWARE;
+
+ return H_SUCCESS;
+ }
+
+ /* Put TCE */
+ if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
+ /* Try put_tce if failed in real mode */
+ vcpu->arch.tce_rm_fail = TCERM_NONE;
+ hpa = vcpu->arch.tce_tmp_hpas[0];
+ } else {
+ if (iommu_tce_put_param_check(tbl, ioba, tce))
+ return H_PARAMETER;
+
+ hva = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce, &pg, &hpa);
+ if (hva == ERROR_ADDR)
+ return H_HARDWARE;
+ }
+
+ if (!iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT, &hpa, 1, false))
+ return H_SUCCESS;
+
+ pg = pfn_to_page(hpa >> PAGE_SHIFT);
+ if (pg)
+ put_page(pg);
+
+ return H_HARDWARE;
+}
+
+static long kvmppc_vm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt, unsigned long ioba,
+ unsigned long __user *tces, unsigned long npages)
+{
+ long i = 0, start = 0;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ switch (vcpu->arch.tce_rm_fail) {
+ case TCERM_NONE:
+ break;
+ case TCERM_GETPAGE:
+ start = vcpu->arch.tce_tmp_num;
+ break;
+ case TCERM_PUTTCE:
+ goto put_tces;
+ case TCERM_PUTLIST:
+ default:
+ WARN_ON(1);
+ return H_HARDWARE;
+ }
+
+ for (i = start; i < npages; ++i) {
+ struct page *pg = NULL;
+ unsigned long gpa;
+ void __user *hva;
+
+ if (get_user(gpa, tces + i))
+ return H_HARDWARE;
+
+ if (iommu_tce_put_param_check(tbl, ioba +
+ (i << IOMMU_PAGE_SHIFT), gpa))
+ return H_PARAMETER;
+
+ hva = kvmppc_vm_gpa_to_hva_and_get(vcpu, gpa, &pg,
+ &vcpu->arch.tce_tmp_hpas[i]);
+ if (hva == ERROR_ADDR)
+ goto putpages_flush_exit;
+ }
+
+put_tces:
+ if (!iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT,
+ vcpu->arch.tce_tmp_hpas, npages, false))
+ return H_SUCCESS;
+
+putpages_flush_exit:
+ for ( --i; i >= 0; --i) {
+ struct page *pg;
+ pg = pfn_to_page(vcpu->arch.tce_tmp_hpas[i] >> PAGE_SHIFT);
+ if (pg)
+ put_page(pg);
+ }
+
+ return H_HARDWARE;
+}
+
+long kvmppc_vm_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+ unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, entry, npages, false))
+ return H_HARDWARE;
+
+ return H_SUCCESS;
+}
+#endif /* CONFIG_IOMMU_API */
+
long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
unsigned long liobn, unsigned long ioba,
unsigned long tce)
@@ -199,6 +471,11 @@ long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,

++tt->stat.vm.put;

+#ifdef CONFIG_IOMMU_API
+ if (tt->grp)
+ return kvmppc_vm_h_put_tce_iommu(vcpu, tt, liobn, ioba, tce);
+#endif
+ /* Emulated IO */
if (ioba >= tt->window_size)
return H_PARAMETER;

@@ -240,13 +517,21 @@ long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;

- tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list, &pg);
+ tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list, &pg, NULL);
if (tces == ERROR_ADDR)
return H_TOO_HARD;

if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
goto put_list_page_exit;

+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ ret = kvmppc_vm_h_put_tce_indirect_iommu(vcpu,
+ tt, ioba, tces, npages);
+ goto put_list_page_exit;
+ }
+#endif
+ /* Emulated IO */
for (i = 0; i < npages; ++i) {
if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
ret = H_PARAMETER;
@@ -288,6 +573,12 @@ long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,

++tt->stat.vm.stuff;

+#ifdef CONFIG_IOMMU_API
+ if (tt->grp)
+ return kvmppc_vm_h_stuff_tce_iommu(vcpu, tt, liobn, ioba,
+ tce_value, npages);
+#endif
+ /* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;

diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index cd3e6f9..f8103c6 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
#include <linux/slab.h>
#include <linux/hugetlb.h>
#include <linux/list.h>
+#include <linux/iommu.h>

#include <asm/tlbflush.h>
#include <asm/kvm_ppc.h>
@@ -179,6 +180,115 @@ static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
return hpa;
}

+#ifdef CONFIG_IOMMU_API
+static long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt, unsigned long liobn,
+ unsigned long ioba, unsigned long tce)
+{
+ int ret;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+ unsigned long hpa;
+ struct page *pg = NULL;
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Clear TCE */
+ if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+ if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, ioba >> IOMMU_PAGE_SHIFT, 1, true))
+ return H_TOO_HARD;
+
+ return H_SUCCESS;
+ }
+
+ /* Put TCE */
+ if (iommu_tce_put_param_check(tbl, ioba, tce))
+ return H_PARAMETER;
+
+ hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce, &pg);
+ if (hpa == ERROR_ADDR)
+ return H_TOO_HARD;
+
+ ret = iommu_tce_build(tbl, ioba >> IOMMU_PAGE_SHIFT, &hpa, 1, true);
+ if (unlikely(ret)) {
+ if (ret == -EBUSY)
+ return H_PARAMETER;
+
+ vcpu->arch.tce_tmp_hpas[0] = hpa;
+ vcpu->arch.tce_tmp_num = 0;
+ vcpu->arch.tce_rm_fail = TCERM_PUTTCE;
+ return H_TOO_HARD;
+ }
+
+ return H_SUCCESS;
+}
+
+static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt, unsigned long ioba,
+ unsigned long *tces, unsigned long npages)
+{
+ int i, ret;
+ unsigned long hpa;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+ struct page *pg = NULL;
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Check all TCEs */
+ for (i = 0; i < npages; ++i) {
+ if (iommu_tce_put_param_check(tbl, ioba +
+ (i << IOMMU_PAGE_SHIFT), tces[i]))
+ return H_PARAMETER;
+ }
+
+ /* Translate TCEs and go get_page() */
+ for (i = 0; i < npages; ++i) {
+ hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tces[i], &pg);
+ if (hpa == ERROR_ADDR) {
+ vcpu->arch.tce_tmp_num = i;
+ vcpu->arch.tce_rm_fail = TCERM_GETPAGE;
+ return H_TOO_HARD;
+ }
+ vcpu->arch.tce_tmp_hpas[i] = hpa;
+ }
+
+ /* Put TCEs to the table */
+ ret = iommu_tce_build(tbl, (ioba >> IOMMU_PAGE_SHIFT),
+ vcpu->arch.tce_tmp_hpas, npages, true);
+ if (ret == -EAGAIN) {
+ vcpu->arch.tce_rm_fail = TCERM_PUTTCE;
+ return H_TOO_HARD;
+ } else if (ret) {
+ return H_HARDWARE;
+ }
+
+ return H_SUCCESS;
+}
+
+static long kvmppc_h_stuff_tce_iommu(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ if (!tbl)
+ return H_RESCINDED;
+
+ if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+ return H_PARAMETER;
+
+ if (iommu_free_tces(tbl, ioba >> IOMMU_PAGE_SHIFT, npages, true))
+ return H_TOO_HARD;
+
+ return H_SUCCESS;
+}
+#endif /* CONFIG_IOMMU_API */
+
long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
{
@@ -190,6 +300,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,

++tt->stat.rm.put;

+#ifdef CONFIG_IOMMU_API
+ if (tt->grp)
+ return kvmppc_h_put_tce_iommu(vcpu, tt, liobn, ioba, tce);
+#endif
+ /* Emulated IO */
if (ioba >= tt->window_size)
return H_PARAMETER;

@@ -231,6 +346,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if (tces == ERROR_ADDR)
return H_TOO_HARD;

+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ ret = kvmppc_h_put_tce_indirect_iommu(vcpu,
+ tt, ioba, (unsigned long *)tces, npages);
+ goto put_unlock_exit;
+ }
+#endif
+ /* Emulated IO */
for (i = 0; i < npages; ++i) {
ret = kvmppc_emulated_validate_tce(((unsigned long *)tces)[i]);
if (ret)
@@ -263,6 +386,12 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,

++tt->stat.rm.stuff;

+#ifdef CONFIG_IOMMU_API
+ if (tt->grp)
+ return kvmppc_h_stuff_tce_iommu(vcpu, tt, liobn, ioba,
+ tce_value, npages);
+#endif
+ /* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index ccb578b..2909cfa 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -395,6 +395,7 @@ int kvm_dev_ioctl_check_extension(long ext)
r = 1;
break;
case KVM_CAP_SPAPR_MULTITCE:
+ case KVM_CAP_SPAPR_TCE_IOMMU:
r = 1;
break;
#endif
@@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
goto out;
}
+ case KVM_CREATE_SPAPR_TCE_IOMMU: {
+ struct kvm_create_spapr_tce_iommu create_tce_iommu;
+ struct kvm *kvm = filp->private_data;
+
+ r = -EFAULT;
+ if (copy_from_user(&create_tce_iommu, argp,
+ sizeof(create_tce_iommu)))
+ goto out;
+ r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+ goto out;
+ }
#endif /* CONFIG_PPC_BOOK3S_64 */

#ifdef CONFIG_KVM_BOOK3S_64_HV
--
1.8.3.2

2013-07-06 15:14:44

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH 1/8] KVM: PPC: reserve a capability number for multitce support

This is to reserve a capablity number for upcoming support
of H_PUT_TCE_INDIRECT and H_STUFF_TCE pseries hypercalls
which support mulptiple DMA map/unmap operations per one call.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
include/uapi/linux/kvm.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d88c8ee..970b1f5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_IRQ_MPIC 90
#define KVM_CAP_PPC_RTAS 91
#define KVM_CAP_IRQ_XICS 92
+#define KVM_CAP_SPAPR_MULTITCE 93

#ifdef KVM_CAP_IRQ_ROUTING

--
1.8.3.2

2013-07-06 15:15:28

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH 4/8] powerpc: Prepare to support kernel handling of IOMMU map/unmap

The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Reviewed-by: Paul Mackerras <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>

---

Changes:
2013/06/27:
* realmode_get_page() fixed to use get_page_unless_zero(). If failed,
the call will be passed from real to virtual mode and safely handled.
* added comment to PageCompound() in include/linux/page-flags.h.

2013/05/20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++
arch/powerpc/mm/init_64.c | 78 +++++++++++++++++++++++++++++++-
include/linux/page-flags.h | 4 +-
3 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index e3d55f6f..7b46e5f 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
}
#endif /* !CONFIG_HUGETLB_PAGE */

+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
#endif /* __ASSEMBLY__ */

#endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a90b9c4..7031be3 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,81 @@ void vmemmap_free(unsigned long start, unsigned long end)
{
}

-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_XXXX functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ * When 1) or 2) takes place, the API returns an error code to cause
+ * an exit to kernel virtual mode where the operation will be completed.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct vmemmap_backing *vmem_back;
+ struct page *page;
+ unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+ unsigned long pg_va = (unsigned long) pfn_to_page(pfn);

+ for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+ if (pg_va < vmem_back->virt_addr)
+ continue;
+
+ /* Check that page struct is not split between real pages */
+ if ((pg_va + sizeof(struct page)) >
+ (vmem_back->virt_addr + page_size))
+ return NULL;
+
+ page = (struct page *) (vmem_back->phys + pg_va -
+ vmem_back->virt_addr);
+ return page;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+ return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+ if (PageCompound(page))
+ return -EAGAIN;
+
+ if (!get_page_unless_zero(page))
+ return -EAGAIN;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+ if (PageCompound(page))
+ return -EAGAIN;
+
+ if (!atomic_add_unless(&page->_count, -1, 1))
+ return -EAGAIN;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_put_page);
+#endif
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..98ada58 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -329,7 +329,9 @@ static inline void set_page_writeback(struct page *page)
* System with lots of page flags available. This allows separate
* flags for PageHead() and PageTail() checks of compound pages so that bit
* tests can be used in performance sensitive paths. PageCompound is
- * generally not used in hot code paths.
+ * generally not used in hot code paths except arch/powerpc/mm/init_64.c
+ * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
+ * and avoid handling those in real mode.
*/
__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
__PAGEFLAG(Tail, tail)
--
1.8.3.2

2013-07-06 15:15:56

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

This adds special support for huge pages (16MB). The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages. It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens. The list is released at KVM exit. At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive. However this can change and we may want
to optimize this.

Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>

---

Changes:
2013/06/27:
* list of huge pages replaces with hashtable for better performance
* spinlock removed from real mode and only protects insertion of new
huge [ages descriptors into the hashtable

2013/06/05:
* fixed compile error when CONFIG_IOMMU_API=n

2013/05/20:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/kvm_host.h | 25 +++++++++
arch/powerpc/kernel/iommu.c | 6 ++-
arch/powerpc/kvm/book3s_64_vio.c | 104 +++++++++++++++++++++++++++++++++---
arch/powerpc/kvm/book3s_64_vio_hv.c | 21 ++++++--
4 files changed, 146 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 53e61b2..a7508cf 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -30,6 +30,7 @@
#include <linux/kvm_para.h>
#include <linux/list.h>
#include <linux/atomic.h>
+#include <linux/hashtable.h>
#include <asm/kvm_asm.h>
#include <asm/processor.h>
#include <asm/page.h>
@@ -182,10 +183,34 @@ struct kvmppc_spapr_tce_table {
u32 window_size;
struct iommu_group *grp; /* used for IOMMU groups */
struct vfio_group *vfio_grp; /* used for IOMMU groups */
+ DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */
+ spinlock_t hugepages_write_lock; /* used for IOMMU groups */
struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
struct page *pages[0];
};

+/*
+ * The KVM guest can be backed with 16MB pages.
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa) hash_32(gpa >> 24, 32)
+
+struct kvmppc_spapr_iommu_hugepage {
+ struct hlist_node hash_node;
+ unsigned long gpa; /* Guest physical address */
+ unsigned long hpa; /* Host physical address */
+ struct page *page; /* page struct of the very first subpage */
+ unsigned long size; /* Huge page size (always 16MB at the moment) */
+};
+
struct kvmppc_linear_info {
void *base_virt;
unsigned long base_pfn;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 51678ec..e0b6eca 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
if (!pg) {
ret = -EAGAIN;
} else if (PageCompound(pg)) {
- ret = -EAGAIN;
+ /* Hugepages will be released at KVM exit */
+ ret = 0;
} else {
if (oldtce & TCE_PCI_WRITE)
SetPageDirty(pg);
@@ -1009,6 +1010,9 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
struct page *pg = pfn_to_page(oldtce >> PAGE_SHIFT);
if (!pg) {
ret = -EAGAIN;
+ } else if (PageCompound(pg)) {
+ /* Hugepages will be released at KVM exit */
+ ret = 0;
} else {
if (oldtce & TCE_PCI_WRITE)
SetPageDirty(pg);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 2b51f4a..c037219 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -46,6 +46,40 @@

#define ERROR_ADDR ((void *)~(unsigned long)0x0)

+#ifdef CONFIG_IOMMU_API
+static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
+{
+ spin_lock_init(&tt->hugepages_write_lock);
+ hash_init(tt->hash_tab);
+}
+
+static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
+{
+ int bkt;
+ struct kvmppc_spapr_iommu_hugepage *hp;
+ struct hlist_node *tmp;
+
+ spin_lock(&tt->hugepages_write_lock);
+ hash_for_each_safe(tt->hash_tab, bkt, tmp, hp, hash_node) {
+ pr_debug("Release HP liobn=%llx #%u gpa=%lx hpa=%lx size=%ld\n",
+ tt->liobn, bkt, hp->gpa, hp->hpa, hp->size);
+ hlist_del_rcu(&hp->hash_node);
+
+ put_page(hp->page);
+ kfree(hp);
+ }
+ spin_unlock(&tt->hugepages_write_lock);
+}
+#else
+static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
+{
+}
+
+static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
+{
+}
+#endif /* CONFIG_IOMMU_API */
+
static long kvmppc_stt_npages(unsigned long window_size)
{
return ALIGN((window_size >> SPAPR_TCE_SHIFT)
@@ -112,6 +146,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)

mutex_lock(&kvm->lock);
list_del(&stt->list);
+ kvmppc_iommu_hugepages_cleanup(stt);

#ifdef CONFIG_IOMMU_API
if (stt->grp) {
@@ -200,6 +235,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
kvm_get_kvm(kvm);

mutex_lock(&kvm->lock);
+ kvmppc_iommu_hugepages_init(stt);
list_add(&stt->list, &kvm->arch.spapr_tce_tables);

mutex_unlock(&kvm->lock);
@@ -283,6 +319,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,

kvm_get_kvm(kvm);
mutex_lock(&kvm->lock);
+ kvmppc_iommu_hugepages_init(tt);
list_add(&tt->list, &kvm->arch.spapr_tce_tables);
mutex_unlock(&kvm->lock);

@@ -307,10 +344,17 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,

/* Converts guest physical address to host virtual address */
static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
unsigned long gpa, struct page **pg, unsigned long *hpa)
{
unsigned long hva, gfn = gpa >> PAGE_SHIFT;
struct kvm_memory_slot *memslot;
+#ifdef CONFIG_IOMMU_API
+ struct kvmppc_spapr_iommu_hugepage *hp;
+ unsigned key = KVMPPC_SPAPR_HUGEPAGE_HASH(gpa);
+ pte_t *ptep;
+ unsigned int shift = 0;
+#endif

memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
if (!memslot)
@@ -325,6 +369,54 @@ static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
*hpa = __pa((unsigned long) page_address(*pg)) +
(hva & ~PAGE_MASK);

+#ifdef CONFIG_IOMMU_API
+ if (!PageCompound(*pg))
+ return (void *) hva;
+
+ spin_lock(&tt->hugepages_write_lock);
+ hash_for_each_possible_rcu(tt->hash_tab, hp, hash_node, key) {
+ if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size))
+ continue;
+ if (hpa)
+ *hpa = __pa((unsigned long) page_address(hp->page)) +
+ (hva & (hp->size - 1));
+ goto unlock_exit;
+ }
+
+ ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+ WARN_ON(!ptep);
+
+ if (!ptep || (shift <= PAGE_SHIFT)) {
+ hva = (unsigned long) ERROR_ADDR;
+ goto unlock_exit;
+ }
+
+ hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+ if (!hp) {
+ hva = (unsigned long) ERROR_ADDR;
+ goto unlock_exit;
+ }
+
+ hp->gpa = gpa & ~((1 << shift) - 1);
+ hp->hpa = (pte_pfn(*ptep) << PAGE_SHIFT);
+ hp->size = 1 << shift;
+
+ if (get_user_pages_fast(hva & ~(hp->size - 1), 1, 1, &hp->page) != 1) {
+ hva = (unsigned long) ERROR_ADDR;
+ kfree(hp);
+ goto unlock_exit;
+ }
+ hash_add_rcu(tt->hash_tab, &hp->hash_node, key);
+
+ if (hpa)
+ *hpa = __pa((unsigned long) page_address(hp->page)) +
+ (hva & (hp->size - 1));
+unlock_exit:
+ spin_unlock(&tt->hugepages_write_lock);
+
+ put_page(*pg);
+ *pg = NULL;
+#endif /* CONFIG_IOMMU_API */
return (void *) hva;
}

@@ -363,7 +455,7 @@ long kvmppc_vm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
if (iommu_tce_put_param_check(tbl, ioba, tce))
return H_PARAMETER;

- hva = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce, &pg, &hpa);
+ hva = kvmppc_vm_gpa_to_hva_and_get(vcpu, tt, tce, &pg, &hpa);
if (hva == ERROR_ADDR)
return H_HARDWARE;
}
@@ -372,7 +464,7 @@ long kvmppc_vm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
return H_SUCCESS;

pg = pfn_to_page(hpa >> PAGE_SHIFT);
- if (pg)
+ if (pg && !PageCompound(pg))
put_page(pg);

return H_HARDWARE;
@@ -414,7 +506,7 @@ static long kvmppc_vm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
(i << IOMMU_PAGE_SHIFT), gpa))
return H_PARAMETER;

- hva = kvmppc_vm_gpa_to_hva_and_get(vcpu, gpa, &pg,
+ hva = kvmppc_vm_gpa_to_hva_and_get(vcpu, tt, gpa, &pg,
&vcpu->arch.tce_tmp_hpas[i]);
if (hva == ERROR_ADDR)
goto putpages_flush_exit;
@@ -429,7 +521,7 @@ putpages_flush_exit:
for ( --i; i >= 0; --i) {
struct page *pg;
pg = pfn_to_page(vcpu->arch.tce_tmp_hpas[i] >> PAGE_SHIFT);
- if (pg)
+ if (pg && !PageCompound(pg))
put_page(pg);
}

@@ -517,7 +609,7 @@ long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;

- tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list, &pg, NULL);
+ tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tt, tce_list, &pg, NULL);
if (tces == ERROR_ADDR)
return H_TOO_HARD;

@@ -547,7 +639,7 @@ long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
vcpu->arch.tce_tmp_hpas[i]);
put_list_page_exit:
- if (pg)
+ if (pg && !PageCompound(pg))
put_page(pg);

if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index f8103c6..8c6449f 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -132,6 +132,7 @@ EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
* returns ERROR_ADDR if failed.
*/
static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
unsigned long gpa, struct page **pg)
{
struct kvm_memory_slot *memslot;
@@ -139,6 +140,20 @@ static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
unsigned long hva, hpa = ERROR_ADDR;
unsigned long gfn = gpa >> PAGE_SHIFT;
unsigned shift = 0;
+ struct kvmppc_spapr_iommu_hugepage *hp;
+
+ /* Try to find an already used hugepage */
+ unsigned key = KVMPPC_SPAPR_HUGEPAGE_HASH(gpa);
+
+ hash_for_each_possible_rcu_notrace(tt->hash_tab, hp,
+ hash_node, key) {
+ if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size))
+ continue;
+
+ *pg = NULL; /* Tell the caller not to put page */
+
+ return hp->hpa + (gpa & (hp->size - 1));
+ }

memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
if (!memslot)
@@ -208,7 +223,7 @@ static long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
if (iommu_tce_put_param_check(tbl, ioba, tce))
return H_PARAMETER;

- hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce, &pg);
+ hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tt, tce, &pg);
if (hpa == ERROR_ADDR)
return H_TOO_HARD;

@@ -247,7 +262,7 @@ static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,

/* Translate TCEs and go get_page() */
for (i = 0; i < npages; ++i) {
- hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tces[i], &pg);
+ hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tt, tces[i], &pg);
if (hpa == ERROR_ADDR) {
vcpu->arch.tce_tmp_num = i;
vcpu->arch.tce_rm_fail = TCERM_GETPAGE;
@@ -342,7 +357,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;

- tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce_list, &pg);
+ tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tt, tce_list, &pg);
if (tces == ERROR_ADDR)
return H_TOO_HARD;

--
1.8.3.2

2013-07-08 01:34:16

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 4/8] powerpc: Prepare to support kernel handling of IOMMU map/unmap

On Sun, 2013-07-07 at 01:07 +1000, Alexey Kardashevskiy wrote:
> The current VFIO-on-POWER implementation supports only user mode
> driven mapping, i.e. QEMU is sending requests to map/unmap pages.
> However this approach is really slow, so we want to move that to KVM.
> Since H_PUT_TCE can be extremely performance sensitive (especially with
> network adapters where each packet needs to be mapped/unmapped) we chose
> to implement that as a "fast" hypercall directly in "real
> mode" (processor still in the guest context but MMU off).
>
> To be able to do that, we need to provide some facilities to
> access the struct page count within that real mode environment as things
> like the sparsemem vmemmap mappings aren't accessible.
>
> This adds an API to increment/decrement page counter as
> get_user_pages API used for user mode mapping does not work
> in the real mode.
>
> CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

This patch will need an ack from "mm" people to make sure they are ok
with our approach and ack the change to the generic header.

(Added linux-mm).

Cheers,
Ben.

> Reviewed-by: Paul Mackerras <[email protected]>
> Signed-off-by: Paul Mackerras <[email protected]>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>
> ---
>
> Changes:
> 2013/06/27:
> * realmode_get_page() fixed to use get_page_unless_zero(). If failed,
> the call will be passed from real to virtual mode and safely handled.
> * added comment to PageCompound() in include/linux/page-flags.h.
>
> 2013/05/20:
> * PageTail() is replaced by PageCompound() in order to have the same checks
> for whether the page is huge in realmode_get_page() and realmode_put_page()
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++
> arch/powerpc/mm/init_64.c | 78 +++++++++++++++++++++++++++++++-
> include/linux/page-flags.h | 4 +-
> 3 files changed, 84 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
> index e3d55f6f..7b46e5f 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
> @@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
> }
> #endif /* !CONFIG_HUGETLB_PAGE */
>
> +struct page *realmode_pfn_to_page(unsigned long pfn);
> +int realmode_get_page(struct page *page);
> +int realmode_put_page(struct page *page);
> +
> #endif /* __ASSEMBLY__ */
>
> #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a90b9c4..7031be3 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -297,5 +297,81 @@ void vmemmap_free(unsigned long start, unsigned long end)
> {
> }
>
> -#endif /* CONFIG_SPARSEMEM_VMEMMAP */
> +/*
> + * We do not have access to the sparsemem vmemmap, so we fallback to
> + * walking the list of sparsemem blocks which we already maintain for
> + * the sake of crashdump. In the long run, we might want to maintain
> + * a tree if performance of that linear walk becomes a problem.
> + *
> + * Any of realmode_XXXX functions can fail due to:
> + * 1) As real sparsemem blocks do not lay in RAM continously (they
> + * are in virtual address space which is not available in the real mode),
> + * the requested page struct can be split between blocks so get_page/put_page
> + * may fail.
> + * 2) When huge pages are used, the get_page/put_page API will fail
> + * in real mode as the linked addresses in the page struct are virtual
> + * too.
> + * When 1) or 2) takes place, the API returns an error code to cause
> + * an exit to kernel virtual mode where the operation will be completed.
> + */
> +struct page *realmode_pfn_to_page(unsigned long pfn)
> +{
> + struct vmemmap_backing *vmem_back;
> + struct page *page;
> + unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
> + unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
>
> + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
> + if (pg_va < vmem_back->virt_addr)
> + continue;
> +
> + /* Check that page struct is not split between real pages */
> + if ((pg_va + sizeof(struct page)) >
> + (vmem_back->virt_addr + page_size))
> + return NULL;
> +
> + page = (struct page *) (vmem_back->phys + pg_va -
> + vmem_back->virt_addr);
> + return page;
> + }
> +
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
> +
> +#elif defined(CONFIG_FLATMEM)
> +
> +struct page *realmode_pfn_to_page(unsigned long pfn)
> +{
> + struct page *page = pfn_to_page(pfn);
> + return page;
> +}
> +EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
> +
> +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
> +
> +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
> +int realmode_get_page(struct page *page)
> +{
> + if (PageCompound(page))
> + return -EAGAIN;
> +
> + if (!get_page_unless_zero(page))
> + return -EAGAIN;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(realmode_get_page);
> +
> +int realmode_put_page(struct page *page)
> +{
> + if (PageCompound(page))
> + return -EAGAIN;
> +
> + if (!atomic_add_unless(&page->_count, -1, 1))
> + return -EAGAIN;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(realmode_put_page);
> +#endif
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6d53675..98ada58 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -329,7 +329,9 @@ static inline void set_page_writeback(struct page *page)
> * System with lots of page flags available. This allows separate
> * flags for PageHead() and PageTail() checks of compound pages so that bit
> * tests can be used in performance sensitive paths. PageCompound is
> - * generally not used in hot code paths.
> + * generally not used in hot code paths except arch/powerpc/mm/init_64.c
> + * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
> + * and avoid handling those in real mode.
> */
> __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
> __PAGEFLAG(Tail, tail)

2013-07-08 04:44:47

by Alexey Kardashevskiy

[permalink] [raw]
Subject: [PATCH v2] powerpc: add real mode support for dma operations on powernv

The existing TCE machine calls (tce_build and tce_free) only support
virtual mode as they call __raw_writeq for TCE invalidation what
fails in real mode.

This introduces tce_build_rm and tce_free_rm real mode versions
which do mostly the same but use "Store Doubleword Caching Inhibited
Indexed" instruction for TCE invalidation.

This new feature is going to be utilized by real mode support of VFIO.

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
Changes:
2013/08/07:
* tested on p7ioc and fixed a bug with realmode addresses

Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
arch/powerpc/include/asm/machdep.h | 12 +++++++++
arch/powerpc/platforms/powernv/pci-ioda.c | 43 ++++++++++++++++++++++---------
arch/powerpc/platforms/powernv/pci.c | 38 ++++++++++++++++++++++-----
arch/powerpc/platforms/powernv/pci.h | 3 ++-
4 files changed, 77 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 92386fc..0c19eef 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -75,6 +75,18 @@ struct machdep_calls {
long index);
void (*tce_flush)(struct iommu_table *tbl);

+ /* _rm versions are for real mode use only */
+ int (*tce_build_rm)(struct iommu_table *tbl,
+ long index,
+ long npages,
+ unsigned long uaddr,
+ enum dma_data_direction direction,
+ struct dma_attrs *attrs);
+ void (*tce_free_rm)(struct iommu_table *tbl,
+ long index,
+ long npages);
+ void (*tce_flush_rm)(struct iommu_table *tbl);
+
void __iomem * (*ioremap)(phys_addr_t addr, unsigned long size,
unsigned long flags, void *caller);
void (*iounmap)(volatile void __iomem *token);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index d200594..8a70003 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -68,6 +68,12 @@ define_pe_printk_level(pe_err, KERN_ERR);
define_pe_printk_level(pe_warn, KERN_WARNING);
define_pe_printk_level(pe_info, KERN_INFO);

+static inline void __raw_rm_writeq(u64 val, volatile void __iomem *paddr)
+{
+ __asm__ __volatile__("stdcix %0,0,%1"
+ : : "r" (val), "r" (paddr) : "memory");
+}
+
static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
{
unsigned long pe;
@@ -452,10 +458,13 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
}
}

-static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
- u64 *startp, u64 *endp)
+static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
+ struct iommu_table *tbl,
+ u64 *startp, u64 *endp, bool rm)
{
- u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
+ u64 __iomem *invalidate = rm?
+ (u64 __iomem *)pe->it_index_rm:
+ (u64 __iomem *)tbl->it_index;
unsigned long start, end, inc;

start = __pa(startp);
@@ -482,7 +491,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,

mb(); /* Ensure above stores are visible */
while (start <= end) {
- __raw_writeq(start, invalidate);
+ if (rm)
+ __raw_rm_writeq(start, invalidate);
+ else
+ __raw_writeq(start, invalidate);
start += inc;
}

@@ -494,10 +506,12 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,

static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
struct iommu_table *tbl,
- u64 *startp, u64 *endp)
+ u64 *startp, u64 *endp, bool rm)
{
unsigned long start, end, inc;
- u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
+ u64 __iomem *invalidate = rm?
+ (u64 __iomem *)pe->it_index_rm:
+ (u64 __iomem *)tbl->it_index;

/* We'll invalidate DMA address in PE scope */
start = 0x2ul << 60;
@@ -513,22 +527,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
mb();

while (start <= end) {
- __raw_writeq(start, invalidate);
+ if (rm)
+ __raw_rm_writeq(start, invalidate);
+ else
+ __raw_writeq(start, invalidate);
start += inc;
}
}

void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
- u64 *startp, u64 *endp)
+ u64 *startp, u64 *endp, bool rm)
{
struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
tce32_table);
struct pnv_phb *phb = pe->phb;

if (phb->type == PNV_PHB_IODA1)
- pnv_pci_ioda1_tce_invalidate(tbl, startp, endp);
+ pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
else
- pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp);
+ pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
}

static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
@@ -601,7 +618,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
* bus number, print that out instead.
*/
tbl->it_busno = 0;
- tbl->it_index = (unsigned long)ioremap(be64_to_cpup(swinvp), 8);
+ pe->it_index_rm = be64_to_cpup(swinvp);
+ tbl->it_index = (unsigned long)ioremap(pe->it_index_rm, 8);
tbl->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE |
TCE_PCI_SWINV_PAIR;
}
@@ -679,7 +697,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
* bus number, print that out instead.
*/
tbl->it_busno = 0;
- tbl->it_index = (unsigned long)ioremap(be64_to_cpup(swinvp), 8);
+ pe->it_index_rm = be64_to_cpup(swinvp);
+ tbl->it_index = (unsigned long)ioremap(pe->it_index_rm, 8);
tbl->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE;
}
iommu_init_table(tbl, phb->hose->node);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index e16b729..280f614 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -336,7 +336,7 @@ struct pci_ops pnv_pci_ops = {

static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
unsigned long uaddr, enum dma_data_direction direction,
- struct dma_attrs *attrs)
+ struct dma_attrs *attrs, bool rm)
{
u64 proto_tce;
u64 *tcep, *tces;
@@ -358,12 +358,19 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
* of flags if that becomes the case
*/
if (tbl->it_type & TCE_PCI_SWINV_CREATE)
- pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
+ pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);

return 0;
}

-static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
+static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr, enum dma_data_direction direction,
+ struct dma_attrs *attrs)
+{
+ return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, false);
+}
+
+static void pnv_tce_free(struct iommu_table *tbl, long index, long npages, bool rm)
{
u64 *tcep, *tces;

@@ -373,7 +380,12 @@ static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
*(tcep++) = 0;

if (tbl->it_type & TCE_PCI_SWINV_FREE)
- pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
+ pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+}
+
+static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long npages)
+{
+ pnv_tce_free(tbl, index, npages, false);
}

static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
@@ -381,6 +393,18 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
return ((u64 *)tbl->it_base)[index - tbl->it_offset];
}

+static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr, enum dma_data_direction direction,
+ struct dma_attrs *attrs)
+{
+ return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
+}
+
+static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
+{
+ pnv_tce_free(tbl, index, npages, true);
+}
+
void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
void *tce_mem, u64 tce_size,
u64 dma_offset)
@@ -545,8 +569,10 @@ void __init pnv_pci_init(void)

/* Configure IOMMU DMA hooks */
ppc_md.pci_dma_dev_setup = pnv_pci_dma_dev_setup;
- ppc_md.tce_build = pnv_tce_build;
- ppc_md.tce_free = pnv_tce_free;
+ ppc_md.tce_build = pnv_tce_build_vm;
+ ppc_md.tce_free = pnv_tce_free_vm;
+ ppc_md.tce_build_rm = pnv_tce_build_rm;
+ ppc_md.tce_free_rm = pnv_tce_free_rm;
ppc_md.tce_get = pnv_tce_get;
ppc_md.pci_probe_mode = pnv_pci_probe_mode;
set_pci_dma_ops(&dma_iommu_ops);
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 25d76c4..7ea82c1 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -52,6 +52,7 @@ struct pnv_ioda_pe {
int tce32_seg;
int tce32_segcount;
struct iommu_table tce32_table;
+ phys_addr_t it_index_rm;

/* XXX TODO: Add support for additional 64-bit iommus */

@@ -158,6 +159,6 @@ extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
extern void pnv_pci_init_ioda_hub(struct device_node *np);
extern void pnv_pci_init_ioda2_phb(struct device_node *np);
extern void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
- u64 *startp, u64 *endp);
+ u64 *startp, u64 *endp, bool rm);

#endif /* __POWERNV_PCI_H */
--
1.8.3.2

2013-07-08 07:21:12

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: add real mode support for dma operations on powernv

On Mon, 2013-07-08 at 14:44 +1000, Alexey Kardashevskiy wrote:

> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 25d76c4..7ea82c1 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -52,6 +52,7 @@ struct pnv_ioda_pe {
> int tce32_seg;
> int tce32_segcount;
> struct iommu_table tce32_table;
> + phys_addr_t it_index_rm;

Please ....

The fact that we hijack the it_index field of the iommu table
for the virtual address is bad enough, but really don't need
to perpetuate this :-)

Call the field something decent such as "tce_inval_reg_phys"

Cheers,
Ben.

2013-07-08 07:31:58

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: add real mode support for dma operations on powernv

On 07/08/2013 05:20 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2013-07-08 at 14:44 +1000, Alexey Kardashevskiy wrote:
>
>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>> index 25d76c4..7ea82c1 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -52,6 +52,7 @@ struct pnv_ioda_pe {
>> int tce32_seg;
>> int tce32_segcount;
>> struct iommu_table tce32_table;
>> + phys_addr_t it_index_rm;
>
> Please ....
>
> The fact that we hijack the it_index field of the iommu table
> for the virtual address is bad enough, but really don't need
> to perpetuate this :-)
>
> Call the field something decent such as "tce_inval_reg_phys"


Yes we can. I just find it veeeeeeery attractive when I can grep "\<it_"
and get all users of iommu_table.

btw is phys_addr_t correct here?


--
Alexey

2013-07-08 07:40:33

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] powerpc: add real mode support for dma operations on powernv

On Mon, 2013-07-08 at 17:31 +1000, Alexey Kardashevskiy wrote:

> btw is phys_addr_t correct here?

Yes.

Cheers,
Ben.

2013-07-08 21:53:06

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH 3/8] vfio: add external user support

On Sun, 2013-07-07 at 01:07 +1000, Alexey Kardashevskiy wrote:
> VFIO is designed to be used via ioctls on file descriptors
> returned by VFIO.
>
> However in some situations support for an external user is required.
> The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
> use the existing VFIO groups for exclusive access in real/virtual mode
> on a host to avoid passing map/unmap requests to the user space which
> would made things pretty slow.
>
> The proposed protocol includes:
>
> 1. do normal VFIO init stuff such as opening a new container, attaching
> group(s) to it, setting an IOMMU driver for a container. When IOMMU is
> set for a container, all groups in it are considered ready to use by
> an external user.
>
> 2. pass a fd of the group we want to accelerate to KVM. KVM calls
> vfio_group_get_external_user() to verify if the group is initialized,
> IOMMU is set for it and increment the container user counter to prevent
> the VFIO group from disposal prior to KVM exit.
> The current TCE IOMMU driver marks the whole IOMMU table as busy when
> IOMMU is set for a container what prevents other DMA users from
> allocating from it so it is safe to grant user space access to it.
>
> 3. KVM calls vfio_external_user_iommu_id() to obtian an IOMMU ID which
> KVM uses to get an iommu_group struct for later use.
>
> 4. When KVM is finished, it calls vfio_group_put_external_user() to
> release the VFIO group by decrementing the container user counter.
> Everything gets released.
>
> The "vfio: Limit group opens" patch is also required for the consistency.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c488da5..57aa191 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1370,6 +1370,62 @@ static const struct file_operations vfio_device_fops = {
> };
>
> /**
> + * External user API, exported by symbols to be linked dynamically.
> + *
> + * The protocol includes:
> + * 1. do normal VFIO init operation:
> + * - opening a new container;
> + * - attaching group(s) to it;
> + * - setting an IOMMU driver for a container.
> + * When IOMMU is set for a container, all groups in it are
> + * considered ready to use by an external user.
> + *
> + * 2. The user space passed a group fd which we want to accelerate in
> + * KVM. KVM uses vfio_group_get_external_user() to verify that:
> + * - the group is initialized;
> + * - IOMMU is set for it.
> + * Then vfio_group_get_external_user() increments the container user
> + * counter to prevent the VFIO group from disposal prior to KVM exit.
> + *
> + * 3. KVM calls vfio_external_user_iommu_id() to know an IOMMU ID which
> + * KVM uses to get an iommu_group struct for later use.
> + *
> + * 4. When KVM is finished, it calls vfio_group_put_external_user() to
> + * release the VFIO group by decrementing the container user counter.

nit, the interface is for any external user, not just kvm.

> + */
> +struct vfio_group *vfio_group_get_external_user(struct file *filep)
> +{
> + struct vfio_group *group = filep->private_data;
> +
> + if (filep->f_op != &vfio_group_fops)
> + return NULL;

ERR_PTR(-EINVAL)

There also needs to be a vfio_group_get(group) here and put in error
cases.

> +
> + if (!atomic_inc_not_zero(&group->container_users))
> + return NULL;

ERR_PTR(-EINVAL)

> +
> + if (!group->container->iommu_driver ||
> + !vfio_group_viable(group)) {
> + atomic_dec(&group->container_users);
> + return NULL;

ERR_PTR(-EINVAL)

> + }
> +
> + return group;
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_get_external_user);
> +
> +void vfio_group_put_external_user(struct vfio_group *group)
> +{
> + vfio_group_try_dissolve_container(group);

And a vfio_group_put(group) here

> +}
> +EXPORT_SYMBOL_GPL(vfio_group_put_external_user);
> +
> +int vfio_external_user_iommu_id(struct vfio_group *group)
> +{
> + return iommu_group_id(group->iommu_group);
> +}
> +EXPORT_SYMBOL_GPL(vfio_external_user_iommu_id);
> +
> +/**
> * Module/class support
> */
> static char *vfio_devnode(struct device *dev, umode_t *mode)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index ac8d488..24579a0 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -90,4 +90,11 @@ extern void vfio_unregister_iommu_driver(
> TYPE tmp; \
> offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); }) \
>
> +/*
> + * External user API
> + */
> +extern struct vfio_group *vfio_group_get_external_user(struct file *filep);
> +extern void vfio_group_put_external_user(struct vfio_group *group);
> +extern int vfio_external_user_iommu_id(struct vfio_group *group);
> +
> #endif /* VFIO_H */


2013-07-09 05:40:21

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH 3/8] vfio: add external user support

On 07/09/2013 07:52 AM, Alex Williamson wrote:
> On Sun, 2013-07-07 at 01:07 +1000, Alexey Kardashevskiy wrote:
>> VFIO is designed to be used via ioctls on file descriptors
>> returned by VFIO.
>>
>> However in some situations support for an external user is required.
>> The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
>> use the existing VFIO groups for exclusive access in real/virtual mode
>> on a host to avoid passing map/unmap requests to the user space which
>> would made things pretty slow.
>>
>> The proposed protocol includes:
>>
>> 1. do normal VFIO init stuff such as opening a new container, attaching
>> group(s) to it, setting an IOMMU driver for a container. When IOMMU is
>> set for a container, all groups in it are considered ready to use by
>> an external user.
>>
>> 2. pass a fd of the group we want to accelerate to KVM. KVM calls
>> vfio_group_get_external_user() to verify if the group is initialized,
>> IOMMU is set for it and increment the container user counter to prevent
>> the VFIO group from disposal prior to KVM exit.
>> The current TCE IOMMU driver marks the whole IOMMU table as busy when
>> IOMMU is set for a container what prevents other DMA users from
>> allocating from it so it is safe to grant user space access to it.
>>
>> 3. KVM calls vfio_external_user_iommu_id() to obtian an IOMMU ID which
>> KVM uses to get an iommu_group struct for later use.
>>
>> 4. When KVM is finished, it calls vfio_group_put_external_user() to
>> release the VFIO group by decrementing the container user counter.
>> Everything gets released.
>>
>> The "vfio: Limit group opens" patch is also required for the consistency.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index c488da5..57aa191 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -1370,6 +1370,62 @@ static const struct file_operations vfio_device_fops = {
>> };
>>
>> /**
>> + * External user API, exported by symbols to be linked dynamically.
>> + *
>> + * The protocol includes:
>> + * 1. do normal VFIO init operation:
>> + * - opening a new container;
>> + * - attaching group(s) to it;
>> + * - setting an IOMMU driver for a container.
>> + * When IOMMU is set for a container, all groups in it are
>> + * considered ready to use by an external user.
>> + *
>> + * 2. The user space passed a group fd which we want to accelerate in
>> + * KVM. KVM uses vfio_group_get_external_user() to verify that:
>> + * - the group is initialized;
>> + * - IOMMU is set for it.
>> + * Then vfio_group_get_external_user() increments the container user
>> + * counter to prevent the VFIO group from disposal prior to KVM exit.
>> + *
>> + * 3. KVM calls vfio_external_user_iommu_id() to know an IOMMU ID which
>> + * KVM uses to get an iommu_group struct for later use.
>> + *
>> + * 4. When KVM is finished, it calls vfio_group_put_external_user() to
>> + * release the VFIO group by decrementing the container user counter.
>
> nit, the interface is for any external user, not just kvm.

s/KVM/An external user/ ?
Or add "the description below uses KVM just as an example of an external user"?


>> + */
>> +struct vfio_group *vfio_group_get_external_user(struct file *filep)
>> +{
>> + struct vfio_group *group = filep->private_data;
>> +
>> + if (filep->f_op != &vfio_group_fops)
>> + return NULL;
>
> ERR_PTR(-EINVAL)
>
> There also needs to be a vfio_group_get(group) here and put in error
> cases.


Is that because I do not hold a reference to the file anymore?


>> +
>> + if (!atomic_inc_not_zero(&group->container_users))
>> + return NULL;
>
> ERR_PTR(-EINVAL)
>
>> +
>> + if (!group->container->iommu_driver ||
>> + !vfio_group_viable(group)) {
>> + atomic_dec(&group->container_users);
>> + return NULL;
>
> ERR_PTR(-EINVAL)
>
>> + }
>> +
>> + return group;
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_group_get_external_user);
>> +
>> +void vfio_group_put_external_user(struct vfio_group *group)
>> +{
>> + vfio_group_try_dissolve_container(group);
>
> And a vfio_group_put(group) here
>
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_group_put_external_user);
>> +
>> +int vfio_external_user_iommu_id(struct vfio_group *group)
>> +{
>> + return iommu_group_id(group->iommu_group);
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_external_user_iommu_id);
>> +
>> +/**
>> * Module/class support
>> */
>> static char *vfio_devnode(struct device *dev, umode_t *mode)
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index ac8d488..24579a0 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -90,4 +90,11 @@ extern void vfio_unregister_iommu_driver(
>> TYPE tmp; \
>> offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); }) \
>>
>> +/*
>> + * External user API
>> + */
>> +extern struct vfio_group *vfio_group_get_external_user(struct file *filep);
>> +extern void vfio_group_put_external_user(struct vfio_group *group);
>> +extern int vfio_external_user_iommu_id(struct vfio_group *group);
>> +
>> #endif /* VFIO_H */
>
>
>


--
Alexey

2013-07-09 14:08:18

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH 3/8] vfio: add external user support

On Tue, 2013-07-09 at 15:40 +1000, Alexey Kardashevskiy wrote:
> On 07/09/2013 07:52 AM, Alex Williamson wrote:
> > On Sun, 2013-07-07 at 01:07 +1000, Alexey Kardashevskiy wrote:
> >> VFIO is designed to be used via ioctls on file descriptors
> >> returned by VFIO.
> >>
> >> However in some situations support for an external user is required.
> >> The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
> >> use the existing VFIO groups for exclusive access in real/virtual mode
> >> on a host to avoid passing map/unmap requests to the user space which
> >> would made things pretty slow.
> >>
> >> The proposed protocol includes:
> >>
> >> 1. do normal VFIO init stuff such as opening a new container, attaching
> >> group(s) to it, setting an IOMMU driver for a container. When IOMMU is
> >> set for a container, all groups in it are considered ready to use by
> >> an external user.
> >>
> >> 2. pass a fd of the group we want to accelerate to KVM. KVM calls
> >> vfio_group_get_external_user() to verify if the group is initialized,
> >> IOMMU is set for it and increment the container user counter to prevent
> >> the VFIO group from disposal prior to KVM exit.
> >> The current TCE IOMMU driver marks the whole IOMMU table as busy when
> >> IOMMU is set for a container what prevents other DMA users from
> >> allocating from it so it is safe to grant user space access to it.
> >>
> >> 3. KVM calls vfio_external_user_iommu_id() to obtian an IOMMU ID which
> >> KVM uses to get an iommu_group struct for later use.
> >>
> >> 4. When KVM is finished, it calls vfio_group_put_external_user() to
> >> release the VFIO group by decrementing the container user counter.
> >> Everything gets released.
> >>
> >> The "vfio: Limit group opens" patch is also required for the consistency.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> >> ---
> >> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >> index c488da5..57aa191 100644
> >> --- a/drivers/vfio/vfio.c
> >> +++ b/drivers/vfio/vfio.c
> >> @@ -1370,6 +1370,62 @@ static const struct file_operations vfio_device_fops = {
> >> };
> >>
> >> /**
> >> + * External user API, exported by symbols to be linked dynamically.
> >> + *
> >> + * The protocol includes:
> >> + * 1. do normal VFIO init operation:
> >> + * - opening a new container;
> >> + * - attaching group(s) to it;
> >> + * - setting an IOMMU driver for a container.
> >> + * When IOMMU is set for a container, all groups in it are
> >> + * considered ready to use by an external user.
> >> + *
> >> + * 2. The user space passed a group fd which we want to accelerate in
> >> + * KVM. KVM uses vfio_group_get_external_user() to verify that:
> >> + * - the group is initialized;
> >> + * - IOMMU is set for it.
> >> + * Then vfio_group_get_external_user() increments the container user
> >> + * counter to prevent the VFIO group from disposal prior to KVM exit.
> >> + *
> >> + * 3. KVM calls vfio_external_user_iommu_id() to know an IOMMU ID which
> >> + * KVM uses to get an iommu_group struct for later use.
> >> + *
> >> + * 4. When KVM is finished, it calls vfio_group_put_external_user() to
> >> + * release the VFIO group by decrementing the container user counter.
> >
> > nit, the interface is for any external user, not just kvm.
>
> s/KVM/An external user/ ?
> Or add "the description below uses KVM just as an example of an external user"?

Give a generic API description, KVM is just an example.

> >> + */
> >> +struct vfio_group *vfio_group_get_external_user(struct file *filep)
> >> +{
> >> + struct vfio_group *group = filep->private_data;
> >> +
> >> + if (filep->f_op != &vfio_group_fops)
> >> + return NULL;
> >
> > ERR_PTR(-EINVAL)
> >
> > There also needs to be a vfio_group_get(group) here and put in error
> > cases.
>
>
> Is that because I do not hold a reference to the file anymore?

We were debating whether it was needed even with the file reference
because we weren't sure that we wanted to trust the user to hold the
reference. Since we're now passing an object, we absolutely must
increase the reference count on the object for this user. Thanks,

Alex

> >> +
> >> + if (!atomic_inc_not_zero(&group->container_users))
> >> + return NULL;
> >
> > ERR_PTR(-EINVAL)
> >
> >> +
> >> + if (!group->container->iommu_driver ||
> >> + !vfio_group_viable(group)) {
> >> + atomic_dec(&group->container_users);
> >> + return NULL;
> >
> > ERR_PTR(-EINVAL)
> >
> >> + }
> >> +
> >> + return group;
> >> +}
> >> +EXPORT_SYMBOL_GPL(vfio_group_get_external_user);
> >> +
> >> +void vfio_group_put_external_user(struct vfio_group *group)
> >> +{
> >> + vfio_group_try_dissolve_container(group);
> >
> > And a vfio_group_put(group) here
> >
> >> +}
> >> +EXPORT_SYMBOL_GPL(vfio_group_put_external_user);
> >> +
> >> +int vfio_external_user_iommu_id(struct vfio_group *group)
> >> +{
> >> + return iommu_group_id(group->iommu_group);
> >> +}
> >> +EXPORT_SYMBOL_GPL(vfio_external_user_iommu_id);
> >> +
> >> +/**
> >> * Module/class support
> >> */
> >> static char *vfio_devnode(struct device *dev, umode_t *mode)
> >> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >> index ac8d488..24579a0 100644
> >> --- a/include/linux/vfio.h
> >> +++ b/include/linux/vfio.h
> >> @@ -90,4 +90,11 @@ extern void vfio_unregister_iommu_driver(
> >> TYPE tmp; \
> >> offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); }) \
> >>
> >> +/*
> >> + * External user API
> >> + */
> >> +extern struct vfio_group *vfio_group_get_external_user(struct file *filep);
> >> +extern void vfio_group_put_external_user(struct vfio_group *group);
> >> +extern int vfio_external_user_iommu_id(struct vfio_group *group);
> >> +
> >> #endif /* VFIO_H */
> >
> >
> >
>
>


2013-07-09 15:54:52

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 4/8] powerpc: Prepare to support kernel handling of IOMMU map/unmap

On 07/08/2013 03:33 AM, Benjamin Herrenschmidt wrote:
> On Sun, 2013-07-07 at 01:07 +1000, Alexey Kardashevskiy wrote:
>> The current VFIO-on-POWER implementation supports only user mode
>> driven mapping, i.e. QEMU is sending requests to map/unmap pages.
>> However this approach is really slow, so we want to move that to KVM.
>> Since H_PUT_TCE can be extremely performance sensitive (especially with
>> network adapters where each packet needs to be mapped/unmapped) we chose
>> to implement that as a "fast" hypercall directly in "real
>> mode" (processor still in the guest context but MMU off).
>>
>> To be able to do that, we need to provide some facilities to
>> access the struct page count within that real mode environment as things
>> like the sparsemem vmemmap mappings aren't accessible.
>>
>> This adds an API to increment/decrement page counter as
>> get_user_pages API used for user mode mapping does not work
>> in the real mode.
>>
>> CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.
> This patch will need an ack from "mm" people to make sure they are ok
> with our approach and ack the change to the generic header.
>
> (Added linux-mm).
>
> Cheers,
> Ben.
>
>> Reviewed-by: Paul Mackerras<[email protected]>
>> Signed-off-by: Paul Mackerras<[email protected]>
>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>
>> ---
>>
>> Changes:
>> 2013/06/27:
>> * realmode_get_page() fixed to use get_page_unless_zero(). If failed,
>> the call will be passed from real to virtual mode and safely handled.
>> * added comment to PageCompound() in include/linux/page-flags.h.
>>
>> 2013/05/20:
>> * PageTail() is replaced by PageCompound() in order to have the same checks
>> for whether the page is huge in realmode_get_page() and realmode_put_page()
>>
>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>> ---
>> arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++
>> arch/powerpc/mm/init_64.c | 78 +++++++++++++++++++++++++++++++-
>> include/linux/page-flags.h | 4 +-
>> 3 files changed, 84 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
>> index e3d55f6f..7b46e5f 100644
>> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
>> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
>> @@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
>> }
>> #endif /* !CONFIG_HUGETLB_PAGE */
>>
>> +struct page *realmode_pfn_to_page(unsigned long pfn);
>> +int realmode_get_page(struct page *page);
>> +int realmode_put_page(struct page *page);
>> +
>> #endif /* __ASSEMBLY__ */
>>
>> #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
>> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
>> index a90b9c4..7031be3 100644
>> --- a/arch/powerpc/mm/init_64.c
>> +++ b/arch/powerpc/mm/init_64.c
>> @@ -297,5 +297,81 @@ void vmemmap_free(unsigned long start, unsigned long end)
>> {
>> }
>>
>> -#endif /* CONFIG_SPARSEMEM_VMEMMAP */
>> +/*
>> + * We do not have access to the sparsemem vmemmap, so we fallback to
>> + * walking the list of sparsemem blocks which we already maintain for
>> + * the sake of crashdump. In the long run, we might want to maintain
>> + * a tree if performance of that linear walk becomes a problem.
>> + *
>> + * Any of realmode_XXXX functions can fail due to:
>> + * 1) As real sparsemem blocks do not lay in RAM continously (they
>> + * are in virtual address space which is not available in the real mode),
>> + * the requested page struct can be split between blocks so get_page/put_page
>> + * may fail.
>> + * 2) When huge pages are used, the get_page/put_page API will fail
>> + * in real mode as the linked addresses in the page struct are virtual
>> + * too.
>> + * When 1) or 2) takes place, the API returns an error code to cause
>> + * an exit to kernel virtual mode where the operation will be completed.

I don't see where these functions enter kernel virtual mode. I think
it's best to just remove the last sentence. It doesn't belong here.


Alex

>> + */
>> +struct page *realmode_pfn_to_page(unsigned long pfn)
>> +{
>> + struct vmemmap_backing *vmem_back;
>> + struct page *page;
>> + unsigned long page_size = 1<< mmu_psize_defs[mmu_vmemmap_psize].shift;
>> + unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
>>
>> + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
>> + if (pg_va< vmem_back->virt_addr)
>> + continue;
>> +
>> + /* Check that page struct is not split between real pages */
>> + if ((pg_va + sizeof(struct page))>
>> + (vmem_back->virt_addr + page_size))
>> + return NULL;
>> +
>> + page = (struct page *) (vmem_back->phys + pg_va -
>> + vmem_back->virt_addr);
>> + return page;
>> + }
>> +
>> + return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
>> +
>> +#elif defined(CONFIG_FLATMEM)
>> +
>> +struct page *realmode_pfn_to_page(unsigned long pfn)
>> +{
>> + struct page *page = pfn_to_page(pfn);
>> + return page;
>> +}
>> +EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
>> +
>> +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
>> +
>> +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
>> +int realmode_get_page(struct page *page)
>> +{
>> + if (PageCompound(page))
>> + return -EAGAIN;
>> +
>> + if (!get_page_unless_zero(page))
>> + return -EAGAIN;
>> +
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(realmode_get_page);
>> +
>> +int realmode_put_page(struct page *page)
>> +{
>> + if (PageCompound(page))
>> + return -EAGAIN;
>> +
>> + if (!atomic_add_unless(&page->_count, -1, 1))
>> + return -EAGAIN;
>> +
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(realmode_put_page);
>> +#endif
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 6d53675..98ada58 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -329,7 +329,9 @@ static inline void set_page_writeback(struct page *page)
>> * System with lots of page flags available. This allows separate
>> * flags for PageHead() and PageTail() checks of compound pages so that bit
>> * tests can be used in performance sensitive paths. PageCompound is
>> - * generally not used in hot code paths.
>> + * generally not used in hot code paths except arch/powerpc/mm/init_64.c
>> + * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
>> + * and avoid handling those in real mode.
>> */
>> __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
>> __PAGEFLAG(Tail, tail)
>

2013-07-09 16:02:44

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 5/8] powerpc: add real mode support for dma operations on powernv

On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
> The existing TCE machine calls (tce_build and tce_free) only support
> virtual mode as they call __raw_writeq for TCE invalidation what
> fails in real mode.
>
> This introduces tce_build_rm and tce_free_rm real mode versions
> which do mostly the same but use "Store Doubleword Caching Inhibited
> Indexed" instruction for TCE invalidation.

So would always using stdcix have any bad side effects?


Alex

>
> This new feature is going to be utilized by real mode support of VFIO.
>
> Signed-off-by: Alexey Kardashevskiy<[email protected]>
> ---
> arch/powerpc/include/asm/machdep.h | 12 ++++++++++
> arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++------
> arch/powerpc/platforms/powernv/pci.c | 38 ++++++++++++++++++++++++++-----
> arch/powerpc/platforms/powernv/pci.h | 2 +-
> 4 files changed, 64 insertions(+), 14 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
> index 92386fc..0c19eef 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -75,6 +75,18 @@ struct machdep_calls {
> long index);
> void (*tce_flush)(struct iommu_table *tbl);
>
> + /* _rm versions are for real mode use only */
> + int (*tce_build_rm)(struct iommu_table *tbl,
> + long index,
> + long npages,
> + unsigned long uaddr,
> + enum dma_data_direction direction,
> + struct dma_attrs *attrs);
> + void (*tce_free_rm)(struct iommu_table *tbl,
> + long index,
> + long npages);
> + void (*tce_flush_rm)(struct iommu_table *tbl);
> +
> void __iomem * (*ioremap)(phys_addr_t addr, unsigned long size,
> unsigned long flags, void *caller);
> void (*iounmap)(volatile void __iomem *token);
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 2931d97..2797dec 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -68,6 +68,12 @@ define_pe_printk_level(pe_err, KERN_ERR);
> define_pe_printk_level(pe_warn, KERN_WARNING);
> define_pe_printk_level(pe_info, KERN_INFO);
>
> +static inline void rm_writed(unsigned long paddr, u64 val)
> +{
> + __asm__ __volatile__("sync; stdcix %0,0,%1"
> + : : "r" (val), "r" (paddr) : "memory");
> +}
> +
> static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
> {
> unsigned long pe;
> @@ -442,7 +448,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
> }
>
> static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
> - u64 *startp, u64 *endp)
> + u64 *startp, u64 *endp, bool rm)
> {
> u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
> unsigned long start, end, inc;
> @@ -471,7 +477,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
>
> mb(); /* Ensure above stores are visible */
> while (start<= end) {
> - __raw_writeq(start, invalidate);
> + if (rm)
> + rm_writed((unsigned long) invalidate, start);
> + else
> + __raw_writeq(start, invalidate);
> start += inc;
> }
>
> @@ -483,7 +492,7 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
>
> static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
> struct iommu_table *tbl,
> - u64 *startp, u64 *endp)
> + u64 *startp, u64 *endp, bool rm)
> {
> unsigned long start, end, inc;
> u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
> @@ -502,22 +511,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
> mb();
>
> while (start<= end) {
> - __raw_writeq(start, invalidate);
> + if (rm)
> + rm_writed((unsigned long) invalidate, start);
> + else
> + __raw_writeq(start, invalidate);
> start += inc;
> }
> }
>
> void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
> - u64 *startp, u64 *endp)
> + u64 *startp, u64 *endp, bool rm)
> {
> struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
> tce32_table);
> struct pnv_phb *phb = pe->phb;
>
> if (phb->type == PNV_PHB_IODA1)
> - pnv_pci_ioda1_tce_invalidate(tbl, startp, endp);
> + pnv_pci_ioda1_tce_invalidate(tbl, startp, endp, rm);
> else
> - pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp);
> + pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
> }
>
> static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index e16b729..280f614 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -336,7 +336,7 @@ struct pci_ops pnv_pci_ops = {
>
> static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> unsigned long uaddr, enum dma_data_direction direction,
> - struct dma_attrs *attrs)
> + struct dma_attrs *attrs, bool rm)
> {
> u64 proto_tce;
> u64 *tcep, *tces;
> @@ -358,12 +358,19 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> * of flags if that becomes the case
> */
> if (tbl->it_type& TCE_PCI_SWINV_CREATE)
> - pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
> + pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
>
> return 0;
> }
>
> -static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
> +static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
> + unsigned long uaddr, enum dma_data_direction direction,
> + struct dma_attrs *attrs)
> +{
> + return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, false);
> +}
> +
> +static void pnv_tce_free(struct iommu_table *tbl, long index, long npages, bool rm)
> {
> u64 *tcep, *tces;
>
> @@ -373,7 +380,12 @@ static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
> *(tcep++) = 0;
>
> if (tbl->it_type& TCE_PCI_SWINV_FREE)
> - pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
> + pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
> +}
> +
> +static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long npages)
> +{
> + pnv_tce_free(tbl, index, npages, false);
> }
>
> static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
> @@ -381,6 +393,18 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
> return ((u64 *)tbl->it_base)[index - tbl->it_offset];
> }
>
> +static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
> + unsigned long uaddr, enum dma_data_direction direction,
> + struct dma_attrs *attrs)
> +{
> + return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
> +}
> +
> +static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
> +{
> + pnv_tce_free(tbl, index, npages, true);
> +}
> +
> void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
> void *tce_mem, u64 tce_size,
> u64 dma_offset)
> @@ -545,8 +569,10 @@ void __init pnv_pci_init(void)
>
> /* Configure IOMMU DMA hooks */
> ppc_md.pci_dma_dev_setup = pnv_pci_dma_dev_setup;
> - ppc_md.tce_build = pnv_tce_build;
> - ppc_md.tce_free = pnv_tce_free;
> + ppc_md.tce_build = pnv_tce_build_vm;
> + ppc_md.tce_free = pnv_tce_free_vm;
> + ppc_md.tce_build_rm = pnv_tce_build_rm;
> + ppc_md.tce_free_rm = pnv_tce_free_rm;
> ppc_md.tce_get = pnv_tce_get;
> ppc_md.pci_probe_mode = pnv_pci_probe_mode;
> set_pci_dma_ops(&dma_iommu_ops);
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 25d76c4..6799374 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -158,6 +158,6 @@ extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
> extern void pnv_pci_init_ioda_hub(struct device_node *np);
> extern void pnv_pci_init_ioda2_phb(struct device_node *np);
> extern void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
> - u64 *startp, u64 *endp);
> + u64 *startp, u64 *endp, bool rm);
>
> #endif /* __POWERNV_PCI_H */

2013-07-09 17:02:43

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
> devices or emulated PCI. These calls allow adding multiple entries
> (up to 512) into the TCE table in one call which saves time on
> transition to/from real mode.

We don't mention QEMU explicitly in KVM code usually.

> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
> (copied from user and verified) before writing the whole list into
> the TCE table. This cache will be utilized more in the upcoming
> VFIO/IOMMU support to continue TCE list processing in the virtual
> mode in the case if the real mode handler failed for some reason.
>
> This adds a guest physical to host real address converter
> and calls the existing H_PUT_TCE handler. The converting function
> is going to be fully utilized by upcoming VFIO supporting patches.
>
> This also implements the KVM_CAP_PPC_MULTITCE capability,
> so in order to support the functionality of this patch, QEMU
> needs to query for this capability and set the "hcall-multi-tce"
> hypertas property only if the capability is present, otherwise
> there will be serious performance degradation.

Same as above. But really you're only giving recommendations here.
What's the point? Please describe what the benefit of this patch is, not
what some other random subsystem might do with the benefits it brings.

>
> Signed-off-by: Paul Mackerras<[email protected]>
> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>
> ---
> Changelog:
> 2013/07/06:
> * fixed number of wrong get_page()/put_page() calls
>
> 2013/06/27:
> * fixed clear of BUSY bit in kvmppc_lookup_pte()
> * H_PUT_TCE_INDIRECT does realmode_get_page() now
> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
> * updated doc
>
> 2013/06/05:
> * fixed mistype about IBMVIO in the commit message
> * updated doc and moved it to another section
> * changed capability number
>
> 2013/05/21:
> * added kvm_vcpu_arch::tce_tmp
> * removed cleanup if put_indirect failed, instead we do not even start
> writing to TCE table if we cannot get TCEs from the user and they are
> invalid
> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
> and kvmppc_emulated_validate_tce (for the previous item)
> * fixed bug with failthrough for H_IPI
> * removed all get_user() from real mode handlers
> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>
> Signed-off-by: Alexey Kardashevskiy<[email protected]>
> ---
> Documentation/virtual/kvm/api.txt | 25 +++
> arch/powerpc/include/asm/kvm_host.h | 9 ++
> arch/powerpc/include/asm/kvm_ppc.h | 16 +-
> arch/powerpc/kvm/book3s_64_vio.c | 154 ++++++++++++++++++-
> arch/powerpc/kvm/book3s_64_vio_hv.c | 260 ++++++++++++++++++++++++++++----
> arch/powerpc/kvm/book3s_hv.c | 41 ++++-
> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++-
> arch/powerpc/kvm/powerpc.c | 3 +
> 9 files changed, 517 insertions(+), 34 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 6365fef..762c703 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed to userspace to be
> handled.
>
>
> +4.86 KVM_CAP_PPC_MULTITCE
> +
> +Capability: KVM_CAP_PPC_MULTITCE
> +Architectures: ppc
> +Type: vm
> +
> +This capability means the kernel is capable of handling hypercalls
> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
> +space. This significanly accelerates DMA operations for PPC KVM guests.

significanly? Please run this through a spell checker.

> +The user space should expect that its handlers for these hypercalls

s/The//

> +are not going to be called.

Is user space guaranteed they will not be called? Or can it still happen?

> +
> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
> +the user space might have to advertise it for the guest. For example,
> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in
> +the "ibm,hypertas-functions" device-tree property.

This paragraph describes sPAPR. That's fine, but please document it as
such. Also please check your grammar.

> +
> +Without this capability, only H_PUT_TCE is handled by the kernel and
> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
> +unless the capability is present as passing hypercalls to the userspace
> +slows operations a lot.
> +
> +Unlike other capabilities of this section, this one is always enabled.

Why? Wouldn't that confuse older user space?

> +
> +
> 5. The kvm_run structure
> ------------------------
>
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index af326cd..20d04bd 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
> struct kvm *kvm;
> u64 liobn;
> u32 window_size;
> + struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;

You don't need this.

> struct page *pages[0];
> };
>
> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
> spinlock_t tbacct_lock;
> u64 busy_stolen;
> u64 busy_preempt;
> +
> + unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT hcall */
> + enum {
> + TCERM_NONE,
> + TCERM_GETPAGE,
> + TCERM_PUTTCE,
> + TCERM_PUTLIST,
> + } tce_rm_fail; /* failed stage of request processing */
> #endif
> };
>
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index a5287fe..fa722a0 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
>
> extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> struct kvm_create_spapr_tce *args);
> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> - unsigned long ioba, unsigned long tce);
> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
> + struct kvm_vcpu *vcpu, unsigned long liobn);
> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
> + unsigned long ioba, unsigned long tce);
> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce);
> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_list, unsigned long npages);
> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_value, unsigned long npages);
> extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
> struct kvm_allocate_rma *rma);
> extern struct kvmppc_linear_info *kvm_alloc_rma(void);
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index b2d3f3b..99bf4e5 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -14,6 +14,7 @@
> *
> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
> */
>
> #include<linux/types.h>
> @@ -36,8 +37,10 @@
> #include<asm/ppc-opcode.h>
> #include<asm/kvm_host.h>
> #include<asm/udbg.h>
> +#include<asm/iommu.h>
> +#include<asm/tce.h>
>
> -#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR ((void *)~(unsigned long)0x0)
>
> static long kvmppc_stt_npages(unsigned long window_size)
> {
> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
> struct kvm *kvm = stt->kvm;
> int i;
>
> +#define __SV(x) stt->stat.x
> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
> + pr_debug("%s stat for liobn=%llx\n"
> + "--------------- realmode ----- virtmode ---\n"
> + "put_tce %10ld %10ld\n"
> + "put_tce_indir %10ld %10ld\n"
> + "stuff_tce %10ld %10ld\n",
> + __func__, stt->liobn,
> + __SVD(put), __SV(vm.put),
> + __SVD(indir), __SV(vm.indir),
> + __SVD(stuff), __SV(vm.stuff));
> +#undef __SVD
> +#undef __SV

All of these stat points should just be trace points. You can do the
statistic gathering from user space then.

> +
> mutex_lock(&kvm->lock);
> list_del(&stt->list);
> for (i = 0; i< kvmppc_stt_npages(stt->window_size); i++)
> @@ -148,3 +165,138 @@ fail:
> }
> return ret;
> }
> +
> +/* Converts guest physical address to host virtual address */
> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,

Please don't distinguish _vm versions. They're the normal case. _rm ones
are the special ones.

> + unsigned long gpa, struct page **pg)
> +{
> + unsigned long hva, gfn = gpa>> PAGE_SHIFT;
> + struct kvm_memory_slot *memslot;
> +
> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> + if (!memslot)
> + return ERROR_ADDR;
> +
> + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa& ~PAGE_MASK);

s/+/|/

> +
> + if (get_user_pages_fast(hva& PAGE_MASK, 1, 0, pg) != 1)
> + return ERROR_ADDR;
> +
> + return (void *) hva;
> +}
> +
> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce)
> +{
> + long ret;
> + struct kvmppc_spapr_tce_table *tt;
> +
> + tt = kvmppc_find_tce_table(vcpu, liobn);
> + /* Didn't find the liobn, put it to userspace */

Unclear comment.

> + if (!tt)
> + return H_TOO_HARD;
> +
> + ++tt->stat.vm.put;
> +
> + if (ioba>= tt->window_size)
> + return H_PARAMETER;
> +
> + ret = kvmppc_emulated_validate_tce(tce);
> + if (ret)
> + return ret;
> +
> + kvmppc_emulated_put_tce(tt, ioba, tce);
> +
> + return H_SUCCESS;
> +}
> +
> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_list, unsigned long npages)
> +{
> + struct kvmppc_spapr_tce_table *tt;
> + long i, ret = H_SUCCESS;
> + unsigned long __user *tces;
> + struct page *pg = NULL;
> +
> + tt = kvmppc_find_tce_table(vcpu, liobn);
> + /* Didn't find the liobn, put it to userspace */
> + if (!tt)
> + return H_TOO_HARD;
> +
> + ++tt->stat.vm.indir;
> +
> + /*
> + * The spec says that the maximum size of the list is 512 TCEs so
> + * so the whole table addressed resides in 4K page

so so?

> + */
> + if (npages> 512)
> + return H_PARAMETER;
> +
> + if (tce_list& ~IOMMU_PAGE_MASK)
> + return H_PARAMETER;
> +
> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
> + return H_PARAMETER;
> +
> + tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg);
> + if (tces == ERROR_ADDR)
> + return H_TOO_HARD;
> +
> + if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
> + goto put_list_page_exit;
> +
> + for (i = 0; i< npages; ++i) {
> + if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
> + ret = H_PARAMETER;
> + goto put_list_page_exit;
> + }
> +
> + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
> + if (ret)
> + goto put_list_page_exit;
> + }
> +
> + for (i = 0; i< npages; ++i)
> + kvmppc_emulated_put_tce(tt, ioba + (i<< IOMMU_PAGE_SHIFT),
> + vcpu->arch.tce_tmp_hpas[i]);
> +put_list_page_exit:
> + if (pg)
> + put_page(pg);
> +
> + if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
> + vcpu->arch.tce_rm_fail = TCERM_NONE;
> + if (pg&& !PageCompound(pg))
> + put_page(pg); /* finish pending realmode_put_page() */
> + }
> +
> + return ret;
> +}
> +
> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_value, unsigned long npages)
> +{
> + struct kvmppc_spapr_tce_table *tt;
> + long i, ret;
> +
> + tt = kvmppc_find_tce_table(vcpu, liobn);
> + /* Didn't find the liobn, put it to userspace */
> + if (!tt)
> + return H_TOO_HARD;
> +
> + ++tt->stat.vm.stuff;
> +
> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
> + return H_PARAMETER;
> +
> + ret = kvmppc_emulated_validate_tce(tce_value);
> + if (ret || (tce_value& (TCE_PCI_WRITE | TCE_PCI_READ)))
> + return H_PARAMETER;
> +
> + for (i = 0; i< npages; ++i, ioba += IOMMU_PAGE_SIZE)
> + kvmppc_emulated_put_tce(tt, ioba, tce_value);
> +
> + return H_SUCCESS;
> +}
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 30c2f3b..cd3e6f9 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -14,6 +14,7 @@
> *
> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
> */
>
> #include<linux/types.h>
> @@ -35,42 +36,243 @@
> #include<asm/ppc-opcode.h>
> #include<asm/kvm_host.h>
> #include<asm/udbg.h>
> +#include<asm/iommu.h>
> +#include<asm/tce.h>
>
> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
> +#define ERROR_ADDR (~(unsigned long)0x0)
>
> -/* WARNING: This will be called in real-mode on HV KVM and virtual
> - * mode on PR KVM

What's wrong with the warning?

> +/*
> + * Finds a TCE table descriptor by LIOBN
> */
> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
> + unsigned long liobn)
> +{
> + struct kvmppc_spapr_tce_table *tt;
> +
> + list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) {
> + if (tt->liobn == liobn)
> + return tt;
> + }
> +
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
> +
> +#ifdef DEBUG
> +/*
> + * Lets user mode disable realmode handlers by putting big number
> + * in the bottom value of LIOBN

What? Seriously? Just don't enable the CAP.

> + */
> +#define kvmppc_find_tce_table(a, b) \
> + ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
> +#endif
> +
> +/*
> + * Validates TCE address.
> + * At the moment only flags are validated as other checks will significantly slow
> + * down or can make it even impossible to handle TCE requests in real mode.

What?

> + */
> +long kvmppc_emulated_validate_tce(unsigned long tce)

I don't like the naming scheme. Please turn this around and make it
kvmppc_tce_validate().

> +{
> + if (tce& ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
> + return H_PARAMETER;
> +
> + return H_SUCCESS;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
> +
> +/*
> + * Handles TCE requests for QEMU emulated devices.

We still don't mention QEMU in KVM code. And does it really matter
whether they're emulated by QEMU? Devices could also be emulated by KVM.

> + * Puts guest TCE values to the table and expects QEMU to convert them
> + * later in a QEMU device implementation.
> + * Called in both real and virtual modes.
> + * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
> + */
> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,

kvmppc_tce_put()

> + unsigned long ioba, unsigned long tce)
> +{
> + unsigned long idx = ioba>> SPAPR_TCE_SHIFT;
> + struct page *page;
> + u64 *tbl;
> +
> + /*
> + * Note on the use of page_address() in real mode,
> + *
> + * It is safe to use page_address() in real mode on ppc64 because
> + * page_address() is always defined as lowmem_page_address()
> + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
> + * operation and does not access page struct.
> + *
> + * Theoretically page_address() could be defined different
> + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
> + * should be enabled.
> + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
> + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
> + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
> + * is not expected to be enabled on ppc32, page_address()
> + * is safe for ppc32 as well.
> + */
> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
> +#error TODO: fix to avoid page_address() here
> +#endif

Can you extract the text above, the check and the page_address call into
a simple wrapper function?

> + page = tt->pages[idx / TCES_PER_PAGE];
> + tbl = (u64 *)page_address(page);
> +
> + /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */

This is not an RFC, is it?

> + tbl[idx % TCES_PER_PAGE] = tce;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
> +
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +/*
> + * Converts guest physical address to host physical address.
> + * Tries to increase page counter via realmode_get_page() and
> + * returns ERROR_ADDR if failed.
> + */
> +static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
> + unsigned long gpa, struct page **pg)
> +{
> + struct kvm_memory_slot *memslot;
> + pte_t *ptep, pte;
> + unsigned long hva, hpa = ERROR_ADDR;
> + unsigned long gfn = gpa>> PAGE_SHIFT;
> + unsigned shift = 0;
> +
> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> + if (!memslot)
> + return ERROR_ADDR;
> +
> + hva = __gfn_to_hva_memslot(memslot, gfn);
> +
> + ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva,&shift);
> + if (!ptep || !pte_present(*ptep))
> + return ERROR_ADDR;
> + pte = *ptep;
> +
> + if (((gpa& TCE_PCI_WRITE) || pte_write(pte))&& !pte_dirty(pte))
> + return ERROR_ADDR;
> +
> + if (!pte_young(pte))
> + return ERROR_ADDR;
> +
> + if (!shift)
> + shift = PAGE_SHIFT;
> +
> + /* Put huge pages handling to the virtual mode */
> + if (shift> PAGE_SHIFT)
> + return ERROR_ADDR;
> +
> + *pg = realmode_pfn_to_page(pte_pfn(pte));
> + if (!*pg || realmode_get_page(*pg))
> + return ERROR_ADDR;
> +
> + /* pte_pfn(pte) returns address aligned to pg_size */
> + hpa = (pte_pfn(pte)<< PAGE_SHIFT) + (gpa& ((1<< shift) - 1));
> +
> + if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> + hpa = ERROR_ADDR;
> + realmode_put_page(*pg);
> + *pg = NULL;
> + }
> +
> + return hpa;
> +}
> +
> long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> unsigned long ioba, unsigned long tce)
> {
> - struct kvm *kvm = vcpu->kvm;
> - struct kvmppc_spapr_tce_table *stt;
> -
> - /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
> - /* liobn, ioba, tce); */
> -
> - list_for_each_entry(stt,&kvm->arch.spapr_tce_tables, list) {
> - if (stt->liobn == liobn) {
> - unsigned long idx = ioba>> SPAPR_TCE_SHIFT;
> - struct page *page;
> - u64 *tbl;
> -
> - /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p window_size=0x%x\n", */
> - /* liobn, stt, stt->window_size); */
> - if (ioba>= stt->window_size)
> - return H_PARAMETER;
> -
> - page = stt->pages[idx / TCES_PER_PAGE];
> - tbl = (u64 *)page_address(page);
> -
> - /* FIXME: Need to validate the TCE itself */
> - /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
> - tbl[idx % TCES_PER_PAGE] = tce;
> - return H_SUCCESS;
> - }
> + long ret;
> + struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
> +
> + if (!tt)
> + return H_TOO_HARD;
> +
> + ++tt->stat.rm.put;
> +
> + if (ioba>= tt->window_size)
> + return H_PARAMETER;
> +
> + ret = kvmppc_emulated_validate_tce(tce);
> + if (!ret)
> + kvmppc_emulated_put_tce(tt, ioba, tce);
> +
> + return ret;
> +}
> +
> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,

So the _vm version is the normal one and this is the _rm version? If so,
please mark it as such. Is there any way to generate both from the same
source? The way it's now there is a lot of duplicate code.


Alex

> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_list, unsigned long npages)
> +{
> + struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
> + long i, ret = H_SUCCESS;
> + unsigned long tces;
> + struct page *pg = NULL;
> +
> + if (!tt)
> + return H_TOO_HARD;
> +
> + ++tt->stat.rm.indir;
> +
> + /*
> + * The spec says that the maximum size of the list is 512 TCEs so
> + * so the whole table addressed resides in 4K page
> + */
> + if (npages> 512)
> + return H_PARAMETER;
> +
> + if (tce_list& ~IOMMU_PAGE_MASK)
> + return H_PARAMETER;
> +
> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
> + return H_PARAMETER;
> +
> + tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce_list,&pg);
> + if (tces == ERROR_ADDR)
> + return H_TOO_HARD;
> +
> + for (i = 0; i< npages; ++i) {
> + ret = kvmppc_emulated_validate_tce(((unsigned long *)tces)[i]);
> + if (ret)
> + goto put_unlock_exit;
> + }
> +
> + for (i = 0; i< npages; ++i)
> + kvmppc_emulated_put_tce(tt, ioba + (i<< IOMMU_PAGE_SHIFT),
> + ((unsigned long *)tces)[i]);
> +
> +put_unlock_exit:
> + if (!ret&& pg&& !PageCompound(pg)&& realmode_put_page(pg)) {
> + vcpu->arch.tce_rm_fail = TCERM_PUTLIST;
> + ret = H_TOO_HARD;
> }
>
> - /* Didn't find the liobn, punt it to userspace */
> - return H_TOO_HARD;
> + return ret;
> +}
> +
> +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
> + unsigned long liobn, unsigned long ioba,
> + unsigned long tce_value, unsigned long npages)
> +{
> + struct kvmppc_spapr_tce_table *tt;
> + long i, ret;
> +
> + tt = kvmppc_find_tce_table(vcpu, liobn);
> + if (!tt)
> + return H_TOO_HARD;
> +
> + ++tt->stat.rm.stuff;
> +
> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
> + return H_PARAMETER;
> +
> + ret = kvmppc_emulated_validate_tce(tce_value);
> + if (ret || (tce_value& (TCE_PCI_WRITE | TCE_PCI_READ)))
> + return H_PARAMETER;
> +
> + for (i = 0; i< npages; ++i, ioba += IOMMU_PAGE_SIZE)
> + kvmppc_emulated_put_tce(tt, ioba, tce_value);
> +
> + return H_SUCCESS;
> }
> +#endif /* CONFIG_KVM_BOOK3S_64_HV */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 550f592..ac41d01 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -567,7 +567,31 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
> if (kvmppc_xics_enabled(vcpu)) {
> ret = kvmppc_xics_hcall(vcpu, req);
> break;
> - } /* fallthrough */
> + }
> + return RESUME_HOST;
> + case H_PUT_TCE:
> + ret = kvmppc_vm_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> + kvmppc_get_gpr(vcpu, 5),
> + kvmppc_get_gpr(vcpu, 6));
> + if (ret == H_TOO_HARD)
> + return RESUME_HOST;
> + break;
> + case H_PUT_TCE_INDIRECT:
> + ret = kvmppc_vm_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
> + kvmppc_get_gpr(vcpu, 5),
> + kvmppc_get_gpr(vcpu, 6),
> + kvmppc_get_gpr(vcpu, 7));
> + if (ret == H_TOO_HARD)
> + return RESUME_HOST;
> + break;
> + case H_STUFF_TCE:
> + ret = kvmppc_vm_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
> + kvmppc_get_gpr(vcpu, 5),
> + kvmppc_get_gpr(vcpu, 6),
> + kvmppc_get_gpr(vcpu, 7));
> + if (ret == H_TOO_HARD)
> + return RESUME_HOST;
> + break;
> default:
> return RESUME_HOST;
> }
> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
> vcpu->arch.cpu_type = KVM_CPU_3S_64;
> kvmppc_sanity_check(vcpu);
>
> + /*
> + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
> + * half executed, we first read TCEs from the user, check them and
> + * return error if something went wrong and only then put TCEs into
> + * the TCE table.
> + *
> + * tce_tmp_hpas is a cache for TCEs to avoid stack allocation or
> + * kmalloc as the whole TCE list can take up to 512 items 8 bytes
> + * each (4096 bytes).
> + */
> + vcpu->arch.tce_tmp_hpas = kmalloc(4096, GFP_KERNEL);
> + if (!vcpu->arch.tce_tmp_hpas)
> + goto free_vcpu;
> +
> return vcpu;
>
> free_vcpu:
> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
> unpin_vpa(vcpu->kvm,&vcpu->arch.slb_shadow);
> unpin_vpa(vcpu->kvm,&vcpu->arch.vpa);
> spin_unlock(&vcpu->arch.vpa_update_lock);
> + kfree(vcpu->arch.tce_tmp_hpas);
> kvm_vcpu_uninit(vcpu);
> kmem_cache_free(kvm_vcpu_cache, vcpu);
> }
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b02f91e..d35554e 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1490,6 +1490,12 @@ hcall_real_table:
> .long 0 /* 0x11c */
> .long 0 /* 0x120 */
> .long .kvmppc_h_bulk_remove - hcall_real_table
> + .long 0 /* 0x128 */
> + .long 0 /* 0x12c */
> + .long 0 /* 0x130 */
> + .long 0 /* 0x134 */
> + .long .kvmppc_h_stuff_tce - hcall_real_table
> + .long .kvmppc_h_put_tce_indirect - hcall_real_table
> hcall_real_table_end:
>
> ignore_hdec:
> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
> index da0e0bc..edfea88 100644
> --- a/arch/powerpc/kvm/book3s_pr_papr.c
> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
> unsigned long tce = kvmppc_get_gpr(vcpu, 6);
> long rc;
>
> - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
> + rc = kvmppc_vm_h_put_tce(vcpu, liobn, ioba, tce);
> + if (rc == H_TOO_HARD)
> + return EMULATE_FAIL;
> + kvmppc_set_gpr(vcpu, 3, rc);
> + return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
> +{
> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> + unsigned long tce = kvmppc_get_gpr(vcpu, 6);
> + unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> + long rc;
> +
> + rc = kvmppc_vm_h_put_tce_indirect(vcpu, liobn, ioba,
> + tce, npages);
> + if (rc == H_TOO_HARD)
> + return EMULATE_FAIL;
> + kvmppc_set_gpr(vcpu, 3, rc);
> + return EMULATE_DONE;
> +}
> +
> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
> +{
> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
> + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
> + unsigned long npages = kvmppc_get_gpr(vcpu, 7);
> + long rc;
> +
> + rc = kvmppc_vm_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
> if (rc == H_TOO_HARD)
> return EMULATE_FAIL;
> kvmppc_set_gpr(vcpu, 3, rc);
> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
> return kvmppc_h_pr_bulk_remove(vcpu);
> case H_PUT_TCE:
> return kvmppc_h_pr_put_tce(vcpu);
> + case H_PUT_TCE_INDIRECT:
> + return kvmppc_h_pr_put_tce_indirect(vcpu);
> + case H_STUFF_TCE:
> + return kvmppc_h_pr_stuff_tce(vcpu);
> case H_CEDE:
> vcpu->arch.shared->msr |= MSR_EE;
> kvm_vcpu_block(vcpu);
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 6316ee3..ccb578b 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -394,6 +394,9 @@ int kvm_dev_ioctl_check_extension(long ext)
> case KVM_CAP_PPC_GET_SMMU_INFO:
> r = 1;
> break;
> + case KVM_CAP_SPAPR_MULTITCE:
> + r = 1;
> + break;
> #endif
> default:
> r = 0;

2013-07-09 17:06:09

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 7/8] KVM: PPC: Add support for IOMMU in-kernel handling

On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests without passing them to QEMU, which saves time
> on switching to QEMU and back.
>
> Both real and virtual modes are supported. First the kernel tries to
> handle a TCE request in the real mode, if failed it passes it to
> the virtual mode to complete the operation. If it a virtual mode
> handler fails, a request is passed to the user mode.
>
> This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to associate
> a virtual PCI bus ID (LIOBN) with an IOMMU group which enables
> in-kernel handling of IOMMU map/unmap. The external user API support
> in VFIO is required.
>
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>
> Signed-off-by: Paul Mackerras<[email protected]>
> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>
> ---
>
> Changes:
> 2013/07/06:
> * added realmode arch_spin_lock to protect TCE table from races
> in real and virtual modes
> * POWERPC IOMMU API is changed to support real mode
> * iommu_take_ownership and iommu_release_ownership are protected by
> iommu_table's locks
> * VFIO external user API use rewritten
> * multiple small fixes
>
> 2013/06/27:
> * tce_list page is referenced now in order to protect it from accident
> invalidation during H_PUT_TCE_INDIRECT execution
> * added use of the external user VFIO API
>
> 2013/06/05:
> * changed capability number
> * changed ioctl number
> * update the doc article number
>
> 2013/05/20:
> * removed get_user() from real mode handlers
> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
> translated TCEs, tries realmode_get_page() on those and if it fails, it
> passes control over the virtual mode handler which tries to finish
> the request handling
> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
> on a page
> * The only reason to pass the request to user mode now is when the user mode
> did not register TCE table in the kernel, in all other cases the virtual mode
> handler is expected to do the job
>
> Signed-off-by: Alexey Kardashevskiy<[email protected]>
> ---
> Documentation/virtual/kvm/api.txt | 26 ++++
> arch/powerpc/include/asm/iommu.h | 9 +-
> arch/powerpc/include/asm/kvm_host.h | 3 +
> arch/powerpc/include/asm/kvm_ppc.h | 2 +
> arch/powerpc/include/uapi/asm/kvm.h | 7 +
> arch/powerpc/kernel/iommu.c | 196 +++++++++++++++--------
> arch/powerpc/kvm/book3s_64_vio.c | 299 +++++++++++++++++++++++++++++++++++-
> arch/powerpc/kvm/book3s_64_vio_hv.c | 129 ++++++++++++++++
> arch/powerpc/kvm/powerpc.c | 12 ++
> 9 files changed, 609 insertions(+), 74 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 762c703..01b0dc2 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2387,6 +2387,32 @@ slows operations a lot.
> Unlike other capabilities of this section, this one is always enabled.
>
>
> +4.87 KVM_CREATE_SPAPR_TCE_IOMMU
> +
> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> +Architectures: powerpc
> +Type: vm ioctl
> +Parameters: struct kvm_create_spapr_tce_iommu (in)
> +Returns: 0 on success, -1 on error
> +
> +struct kvm_create_spapr_tce_iommu {
> + __u64 liobn;
> + __u32 iommu_id;
> + __u32 flags;

Your documentation is out of sync.

Please also split this patch up. It's too long for review.


Alex

2013-07-09 17:32:27

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
> This adds special support for huge pages (16MB). The reference
> counting cannot be easily done for such pages in real mode (when
> MMU is off) so we added a list of huge pages. It is populated in
> virtual mode and get_page is called just once per a huge page.
> Real mode handlers check if the requested page is huge and in the list,
> then no reference counting is done, otherwise an exit to virtual mode
> happens. The list is released at KVM exit. At the moment the fastest
> card available for tests uses up to 9 huge pages so walking through this
> list is not very expensive. However this can change and we may want
> to optimize this.
>
> Signed-off-by: Paul Mackerras<[email protected]>
> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>
> ---
>
> Changes:
> 2013/06/27:
> * list of huge pages replaces with hashtable for better performance

So the only thing your patch description really talks about is not true
anymore?

> * spinlock removed from real mode and only protects insertion of new
> huge [ages descriptors into the hashtable
>
> 2013/06/05:
> * fixed compile error when CONFIG_IOMMU_API=n
>
> 2013/05/20:
> * the real mode handler now searches for a huge page by gpa (used to be pte)
> * the virtual mode handler prints warning if it is called twice for the same
> huge page as the real mode handler is expected to fail just once - when a huge
> page is not in the list yet.
> * the huge page is refcounted twice - when added to the hugepage list and
> when used in the virtual mode hcall handler (can be optimized but it will
> make the patch less nice).
>
> Signed-off-by: Alexey Kardashevskiy<[email protected]>
> ---
> arch/powerpc/include/asm/kvm_host.h | 25 +++++++++
> arch/powerpc/kernel/iommu.c | 6 ++-
> arch/powerpc/kvm/book3s_64_vio.c | 104 +++++++++++++++++++++++++++++++++---
> arch/powerpc/kvm/book3s_64_vio_hv.c | 21 ++++++--
> 4 files changed, 146 insertions(+), 10 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 53e61b2..a7508cf 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -30,6 +30,7 @@
> #include<linux/kvm_para.h>
> #include<linux/list.h>
> #include<linux/atomic.h>
> +#include<linux/hashtable.h>
> #include<asm/kvm_asm.h>
> #include<asm/processor.h>
> #include<asm/page.h>
> @@ -182,10 +183,34 @@ struct kvmppc_spapr_tce_table {
> u32 window_size;
> struct iommu_group *grp; /* used for IOMMU groups */
> struct vfio_group *vfio_grp; /* used for IOMMU groups */
> + DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */
> + spinlock_t hugepages_write_lock; /* used for IOMMU groups */
> struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
> struct page *pages[0];
> };
>
> +/*
> + * The KVM guest can be backed with 16MB pages.
> + * In this case, we cannot do page counting from the real mode
> + * as the compound pages are used - they are linked in a list
> + * with pointers as virtual addresses which are inaccessible
> + * in real mode.
> + *
> + * The code below keeps a 16MB pages list and uses page struct
> + * in real mode if it is already locked in RAM and inserted into
> + * the list or switches to the virtual mode where it can be
> + * handled in a usual manner.
> + */
> +#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa) hash_32(gpa>> 24, 32)
> +
> +struct kvmppc_spapr_iommu_hugepage {
> + struct hlist_node hash_node;
> + unsigned long gpa; /* Guest physical address */
> + unsigned long hpa; /* Host physical address */
> + struct page *page; /* page struct of the very first subpage */
> + unsigned long size; /* Huge page size (always 16MB at the moment) */
> +};
> +
> struct kvmppc_linear_info {
> void *base_virt;
> unsigned long base_pfn;
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 51678ec..e0b6eca 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
> if (!pg) {
> ret = -EAGAIN;
> } else if (PageCompound(pg)) {
> - ret = -EAGAIN;
> + /* Hugepages will be released at KVM exit */
> + ret = 0;
> } else {
> if (oldtce& TCE_PCI_WRITE)
> SetPageDirty(pg);
> @@ -1009,6 +1010,9 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
> struct page *pg = pfn_to_page(oldtce>> PAGE_SHIFT);
> if (!pg) {
> ret = -EAGAIN;
> + } else if (PageCompound(pg)) {
> + /* Hugepages will be released at KVM exit */
> + ret = 0;
> } else {
> if (oldtce& TCE_PCI_WRITE)
> SetPageDirty(pg);
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 2b51f4a..c037219 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -46,6 +46,40 @@
>
> #define ERROR_ADDR ((void *)~(unsigned long)0x0)
>
> +#ifdef CONFIG_IOMMU_API

Can't you just make CONFIG_IOMMU_API mandatory in Kconfig?

> +static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
> +{
> + spin_lock_init(&tt->hugepages_write_lock);
> + hash_init(tt->hash_tab);
> +}
> +
> +static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
> +{
> + int bkt;
> + struct kvmppc_spapr_iommu_hugepage *hp;
> + struct hlist_node *tmp;
> +
> + spin_lock(&tt->hugepages_write_lock);
> + hash_for_each_safe(tt->hash_tab, bkt, tmp, hp, hash_node) {
> + pr_debug("Release HP liobn=%llx #%u gpa=%lx hpa=%lx size=%ld\n",
> + tt->liobn, bkt, hp->gpa, hp->hpa, hp->size);

trace point

> + hlist_del_rcu(&hp->hash_node);
> +
> + put_page(hp->page);

Don't you have to mark them dirty?

> + kfree(hp);
> + }
> + spin_unlock(&tt->hugepages_write_lock);
> +}
> +#else
> +static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
> +{
> +}
> +
> +static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
> +{
> +}
> +#endif /* CONFIG_IOMMU_API */
> +
> static long kvmppc_stt_npages(unsigned long window_size)
> {
> return ALIGN((window_size>> SPAPR_TCE_SHIFT)
> @@ -112,6 +146,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
>
> mutex_lock(&kvm->lock);
> list_del(&stt->list);
> + kvmppc_iommu_hugepages_cleanup(stt);
>
> #ifdef CONFIG_IOMMU_API
> if (stt->grp) {
> @@ -200,6 +235,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> kvm_get_kvm(kvm);
>
> mutex_lock(&kvm->lock);
> + kvmppc_iommu_hugepages_init(stt);
> list_add(&stt->list,&kvm->arch.spapr_tce_tables);
>
> mutex_unlock(&kvm->lock);
> @@ -283,6 +319,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
>
> kvm_get_kvm(kvm);
> mutex_lock(&kvm->lock);
> + kvmppc_iommu_hugepages_init(tt);
> list_add(&tt->list,&kvm->arch.spapr_tce_tables);
> mutex_unlock(&kvm->lock);
>
> @@ -307,10 +344,17 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
>
> /* Converts guest physical address to host virtual address */
> static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
> + struct kvmppc_spapr_tce_table *tt,
> unsigned long gpa, struct page **pg, unsigned long *hpa)
> {
> unsigned long hva, gfn = gpa>> PAGE_SHIFT;
> struct kvm_memory_slot *memslot;
> +#ifdef CONFIG_IOMMU_API
> + struct kvmppc_spapr_iommu_hugepage *hp;
> + unsigned key = KVMPPC_SPAPR_HUGEPAGE_HASH(gpa);
> + pte_t *ptep;
> + unsigned int shift = 0;
> +#endif
>
> memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> if (!memslot)
> @@ -325,6 +369,54 @@ static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
> *hpa = __pa((unsigned long) page_address(*pg)) +
> (hva& ~PAGE_MASK);
>
> +#ifdef CONFIG_IOMMU_API

This function is becoming incredibly large. Please split it up. Also
please document the code.


Alex

> + if (!PageCompound(*pg))
> + return (void *) hva;
> +
> + spin_lock(&tt->hugepages_write_lock);
> + hash_for_each_possible_rcu(tt->hash_tab, hp, hash_node, key) {
> + if ((gpa< hp->gpa) || (gpa>= hp->gpa + hp->size))
> + continue;
> + if (hpa)
> + *hpa = __pa((unsigned long) page_address(hp->page)) +
> + (hva& (hp->size - 1));
> + goto unlock_exit;
> + }
> +
> + ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva,&shift);
> + WARN_ON(!ptep);
> +
> + if (!ptep || (shift<= PAGE_SHIFT)) {
> + hva = (unsigned long) ERROR_ADDR;
> + goto unlock_exit;
> + }
> +
> + hp = kzalloc(sizeof(*hp), GFP_KERNEL);
> + if (!hp) {
> + hva = (unsigned long) ERROR_ADDR;
> + goto unlock_exit;
> + }
> +
> + hp->gpa = gpa& ~((1<< shift) - 1);
> + hp->hpa = (pte_pfn(*ptep)<< PAGE_SHIFT);
> + hp->size = 1<< shift;
> +
> + if (get_user_pages_fast(hva& ~(hp->size - 1), 1, 1,&hp->page) != 1) {
> + hva = (unsigned long) ERROR_ADDR;
> + kfree(hp);
> + goto unlock_exit;
> + }
> + hash_add_rcu(tt->hash_tab,&hp->hash_node, key);
> +
> + if (hpa)
> + *hpa = __pa((unsigned long) page_address(hp->page)) +
> + (hva& (hp->size - 1));
> +unlock_exit:
> + spin_unlock(&tt->hugepages_write_lock);
> +
> + put_page(*pg);
> + *pg = NULL;
> +#endif /* CONFIG_IOMMU_API */
> return (void *) hva;
> }
>
> @@ -363,7 +455,7 @@ long kvmppc_vm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> if (iommu_tce_put_param_check(tbl, ioba, tce))
> return H_PARAMETER;
>
> - hva = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce,&pg,&hpa);
> + hva = kvmppc_vm_gpa_to_hva_and_get(vcpu, tt, tce,&pg,&hpa);
> if (hva == ERROR_ADDR)
> return H_HARDWARE;
> }
> @@ -372,7 +464,7 @@ long kvmppc_vm_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> return H_SUCCESS;
>
> pg = pfn_to_page(hpa>> PAGE_SHIFT);
> - if (pg)
> + if (pg&& !PageCompound(pg))
> put_page(pg);
>
> return H_HARDWARE;
> @@ -414,7 +506,7 @@ static long kvmppc_vm_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
> (i<< IOMMU_PAGE_SHIFT), gpa))
> return H_PARAMETER;
>
> - hva = kvmppc_vm_gpa_to_hva_and_get(vcpu, gpa,&pg,
> + hva = kvmppc_vm_gpa_to_hva_and_get(vcpu, tt, gpa,&pg,
> &vcpu->arch.tce_tmp_hpas[i]);
> if (hva == ERROR_ADDR)
> goto putpages_flush_exit;
> @@ -429,7 +521,7 @@ putpages_flush_exit:
> for ( --i; i>= 0; --i) {
> struct page *pg;
> pg = pfn_to_page(vcpu->arch.tce_tmp_hpas[i]>> PAGE_SHIFT);
> - if (pg)
> + if (pg&& !PageCompound(pg))
> put_page(pg);
> }
>
> @@ -517,7 +609,7 @@ long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
> return H_PARAMETER;
>
> - tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg, NULL);
> + tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tt, tce_list,&pg, NULL);
> if (tces == ERROR_ADDR)
> return H_TOO_HARD;
>
> @@ -547,7 +639,7 @@ long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> kvmppc_emulated_put_tce(tt, ioba + (i<< IOMMU_PAGE_SHIFT),
> vcpu->arch.tce_tmp_hpas[i]);
> put_list_page_exit:
> - if (pg)
> + if (pg&& !PageCompound(pg))
> put_page(pg);
>
> if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index f8103c6..8c6449f 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -132,6 +132,7 @@ EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
> * returns ERROR_ADDR if failed.
> */
> static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
> + struct kvmppc_spapr_tce_table *tt,
> unsigned long gpa, struct page **pg)
> {
> struct kvm_memory_slot *memslot;
> @@ -139,6 +140,20 @@ static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
> unsigned long hva, hpa = ERROR_ADDR;
> unsigned long gfn = gpa>> PAGE_SHIFT;
> unsigned shift = 0;
> + struct kvmppc_spapr_iommu_hugepage *hp;
> +
> + /* Try to find an already used hugepage */
> + unsigned key = KVMPPC_SPAPR_HUGEPAGE_HASH(gpa);
> +
> + hash_for_each_possible_rcu_notrace(tt->hash_tab, hp,
> + hash_node, key) {
> + if ((gpa< hp->gpa) || (gpa>= hp->gpa + hp->size))
> + continue;
> +
> + *pg = NULL; /* Tell the caller not to put page */
> +
> + return hp->hpa + (gpa& (hp->size - 1));
> + }
>
> memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
> if (!memslot)
> @@ -208,7 +223,7 @@ static long kvmppc_h_put_tce_iommu(struct kvm_vcpu *vcpu,
> if (iommu_tce_put_param_check(tbl, ioba, tce))
> return H_PARAMETER;
>
> - hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce,&pg);
> + hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tt, tce,&pg);
> if (hpa == ERROR_ADDR)
> return H_TOO_HARD;
>
> @@ -247,7 +262,7 @@ static long kvmppc_h_put_tce_indirect_iommu(struct kvm_vcpu *vcpu,
>
> /* Translate TCEs and go get_page() */
> for (i = 0; i< npages; ++i) {
> - hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tces[i],&pg);
> + hpa = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tt, tces[i],&pg);
> if (hpa == ERROR_ADDR) {
> vcpu->arch.tce_tmp_num = i;
> vcpu->arch.tce_rm_fail = TCERM_GETPAGE;
> @@ -342,7 +357,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
> return H_PARAMETER;
>
> - tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tce_list,&pg);
> + tces = kvmppc_rm_gpa_to_hpa_and_get(vcpu, tt, tce_list,&pg);
> if (tces == ERROR_ADDR)
> return H_TOO_HARD;
>

2013-07-09 23:29:39

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

On 07/10/2013 03:32 AM, Alexander Graf wrote:
> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>> This adds special support for huge pages (16MB). The reference
>> counting cannot be easily done for such pages in real mode (when
>> MMU is off) so we added a list of huge pages. It is populated in
>> virtual mode and get_page is called just once per a huge page.
>> Real mode handlers check if the requested page is huge and in the list,
>> then no reference counting is done, otherwise an exit to virtual mode
>> happens. The list is released at KVM exit. At the moment the fastest
>> card available for tests uses up to 9 huge pages so walking through this
>> list is not very expensive. However this can change and we may want
>> to optimize this.
>>
>> Signed-off-by: Paul Mackerras<[email protected]>
>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>
>> ---
>>
>> Changes:
>> 2013/06/27:
>> * list of huge pages replaces with hashtable for better performance
>
> So the only thing your patch description really talks about is not true
> anymore?
>
>> * spinlock removed from real mode and only protects insertion of new
>> huge [ages descriptors into the hashtable
>>
>> 2013/06/05:
>> * fixed compile error when CONFIG_IOMMU_API=n
>>
>> 2013/05/20:
>> * the real mode handler now searches for a huge page by gpa (used to be pte)
>> * the virtual mode handler prints warning if it is called twice for the same
>> huge page as the real mode handler is expected to fail just once - when a
>> huge
>> page is not in the list yet.
>> * the huge page is refcounted twice - when added to the hugepage list and
>> when used in the virtual mode hcall handler (can be optimized but it will
>> make the patch less nice).
>>
>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>> ---
>> arch/powerpc/include/asm/kvm_host.h | 25 +++++++++
>> arch/powerpc/kernel/iommu.c | 6 ++-
>> arch/powerpc/kvm/book3s_64_vio.c | 104
>> +++++++++++++++++++++++++++++++++---
>> arch/powerpc/kvm/book3s_64_vio_hv.c | 21 ++++++--
>> 4 files changed, 146 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>> b/arch/powerpc/include/asm/kvm_host.h
>> index 53e61b2..a7508cf 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -30,6 +30,7 @@
>> #include<linux/kvm_para.h>
>> #include<linux/list.h>
>> #include<linux/atomic.h>
>> +#include<linux/hashtable.h>
>> #include<asm/kvm_asm.h>
>> #include<asm/processor.h>
>> #include<asm/page.h>
>> @@ -182,10 +183,34 @@ struct kvmppc_spapr_tce_table {
>> u32 window_size;
>> struct iommu_group *grp; /* used for IOMMU groups */
>> struct vfio_group *vfio_grp; /* used for IOMMU groups */
>> + DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */
>> + spinlock_t hugepages_write_lock; /* used for IOMMU groups */
>> struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>> struct page *pages[0];
>> };
>>
>> +/*
>> + * The KVM guest can be backed with 16MB pages.
>> + * In this case, we cannot do page counting from the real mode
>> + * as the compound pages are used - they are linked in a list
>> + * with pointers as virtual addresses which are inaccessible
>> + * in real mode.
>> + *
>> + * The code below keeps a 16MB pages list and uses page struct
>> + * in real mode if it is already locked in RAM and inserted into
>> + * the list or switches to the virtual mode where it can be
>> + * handled in a usual manner.
>> + */
>> +#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa) hash_32(gpa>> 24, 32)
>> +
>> +struct kvmppc_spapr_iommu_hugepage {
>> + struct hlist_node hash_node;
>> + unsigned long gpa; /* Guest physical address */
>> + unsigned long hpa; /* Host physical address */
>> + struct page *page; /* page struct of the very first subpage */
>> + unsigned long size; /* Huge page size (always 16MB at the moment) */
>> +};
>> +
>> struct kvmppc_linear_info {
>> void *base_virt;
>> unsigned long base_pfn;
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index 51678ec..e0b6eca 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned
>> long entry,
>> if (!pg) {
>> ret = -EAGAIN;
>> } else if (PageCompound(pg)) {
>> - ret = -EAGAIN;
>> + /* Hugepages will be released at KVM exit */
>> + ret = 0;
>> } else {
>> if (oldtce& TCE_PCI_WRITE)
>> SetPageDirty(pg);
>> @@ -1009,6 +1010,9 @@ int iommu_free_tces(struct iommu_table *tbl,
>> unsigned long entry,
>> struct page *pg = pfn_to_page(oldtce>> PAGE_SHIFT);
>> if (!pg) {
>> ret = -EAGAIN;
>> + } else if (PageCompound(pg)) {
>> + /* Hugepages will be released at KVM exit */
>> + ret = 0;
>> } else {
>> if (oldtce& TCE_PCI_WRITE)
>> SetPageDirty(pg);
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>> b/arch/powerpc/kvm/book3s_64_vio.c
>> index 2b51f4a..c037219 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -46,6 +46,40 @@
>>
>> #define ERROR_ADDR ((void *)~(unsigned long)0x0)
>>
>> +#ifdef CONFIG_IOMMU_API
>
> Can't you just make CONFIG_IOMMU_API mandatory in Kconfig?


Sure I can. I can do anything. Why should I? Do I have to do that to get
this accepted? I do not understand this comment. It has already been
discussed how to enable this option.


>> +static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
>> +{
>> + spin_lock_init(&tt->hugepages_write_lock);
>> + hash_init(tt->hash_tab);
>> +}
>> +
>> +static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table
>> *tt)
>> +{
>> + int bkt;
>> + struct kvmppc_spapr_iommu_hugepage *hp;
>> + struct hlist_node *tmp;
>> +
>> + spin_lock(&tt->hugepages_write_lock);
>> + hash_for_each_safe(tt->hash_tab, bkt, tmp, hp, hash_node) {
>> + pr_debug("Release HP liobn=%llx #%u gpa=%lx hpa=%lx size=%ld\n",
>> + tt->liobn, bkt, hp->gpa, hp->hpa, hp->size);
>
> trace point
>
>> + hlist_del_rcu(&hp->hash_node);
>> +
>> + put_page(hp->page);
>
> Don't you have to mark them dirty?


get_user_pages_fast is called with writing==1. Does not it do the same?

>
>> + kfree(hp);
>> + }
>> + spin_unlock(&tt->hugepages_write_lock);
>> +}
>> +#else
>> +static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
>> +{
>> +}
>> +
>> +static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table
>> *tt)
>> +{
>> +}
>> +#endif /* CONFIG_IOMMU_API */
>> +
>> static long kvmppc_stt_npages(unsigned long window_size)
>> {
>> return ALIGN((window_size>> SPAPR_TCE_SHIFT)
>> @@ -112,6 +146,7 @@ static void release_spapr_tce_table(struct
>> kvmppc_spapr_tce_table *stt)
>>
>> mutex_lock(&kvm->lock);
>> list_del(&stt->list);
>> + kvmppc_iommu_hugepages_cleanup(stt);
>>
>> #ifdef CONFIG_IOMMU_API
>> if (stt->grp) {
>> @@ -200,6 +235,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>> kvm_get_kvm(kvm);
>>
>> mutex_lock(&kvm->lock);
>> + kvmppc_iommu_hugepages_init(stt);
>> list_add(&stt->list,&kvm->arch.spapr_tce_tables);
>>
>> mutex_unlock(&kvm->lock);
>> @@ -283,6 +319,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm
>> *kvm,
>>
>> kvm_get_kvm(kvm);
>> mutex_lock(&kvm->lock);
>> + kvmppc_iommu_hugepages_init(tt);
>> list_add(&tt->list,&kvm->arch.spapr_tce_tables);
>> mutex_unlock(&kvm->lock);
>>
>> @@ -307,10 +344,17 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm
>> *kvm,
>>
>> /* Converts guest physical address to host virtual address */
>> static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>> + struct kvmppc_spapr_tce_table *tt,
>> unsigned long gpa, struct page **pg, unsigned long *hpa)
>> {
>> unsigned long hva, gfn = gpa>> PAGE_SHIFT;
>> struct kvm_memory_slot *memslot;
>> +#ifdef CONFIG_IOMMU_API
>> + struct kvmppc_spapr_iommu_hugepage *hp;
>> + unsigned key = KVMPPC_SPAPR_HUGEPAGE_HASH(gpa);
>> + pte_t *ptep;
>> + unsigned int shift = 0;
>> +#endif
>>
>> memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>> if (!memslot)
>> @@ -325,6 +369,54 @@ static void __user
>> *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>> *hpa = __pa((unsigned long) page_address(*pg)) +
>> (hva& ~PAGE_MASK);
>>
>> +#ifdef CONFIG_IOMMU_API
>
> This function is becoming incredibly large. Please split it up. Also please
> document the code.


Less than 100 lines is incredibly large? There are _many_ functions bigger
than that. I do not really see the point in making a separate function
which is going to be called only once.





--
Alexey

2013-07-10 03:25:31

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH 5/8] powerpc: add real mode support for dma operations on powernv

On 07/10/2013 02:02 AM, Alexander Graf wrote:
> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>> The existing TCE machine calls (tce_build and tce_free) only support
>> virtual mode as they call __raw_writeq for TCE invalidation what
>> fails in real mode.
>>
>> This introduces tce_build_rm and tce_free_rm real mode versions
>> which do mostly the same but use "Store Doubleword Caching Inhibited
>> Indexed" instruction for TCE invalidation.
>
> So would always using stdcix have any bad side effects?


PowerISA says "They must be executed only when MSRDR=0" about stdcix.



>
>
> Alex
>
>>
>> This new feature is going to be utilized by real mode support of VFIO.
>>
>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>> ---
>> arch/powerpc/include/asm/machdep.h | 12 ++++++++++
>> arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++------
>> arch/powerpc/platforms/powernv/pci.c | 38
>> ++++++++++++++++++++++++++-----
>> arch/powerpc/platforms/powernv/pci.h | 2 +-
>> 4 files changed, 64 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/machdep.h
>> b/arch/powerpc/include/asm/machdep.h
>> index 92386fc..0c19eef 100644
>> --- a/arch/powerpc/include/asm/machdep.h
>> +++ b/arch/powerpc/include/asm/machdep.h
>> @@ -75,6 +75,18 @@ struct machdep_calls {
>> long index);
>> void (*tce_flush)(struct iommu_table *tbl);
>>
>> + /* _rm versions are for real mode use only */
>> + int (*tce_build_rm)(struct iommu_table *tbl,
>> + long index,
>> + long npages,
>> + unsigned long uaddr,
>> + enum dma_data_direction direction,
>> + struct dma_attrs *attrs);
>> + void (*tce_free_rm)(struct iommu_table *tbl,
>> + long index,
>> + long npages);
>> + void (*tce_flush_rm)(struct iommu_table *tbl);
>> +
>> void __iomem * (*ioremap)(phys_addr_t addr, unsigned long size,
>> unsigned long flags, void *caller);
>> void (*iounmap)(volatile void __iomem *token);
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c
>> b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 2931d97..2797dec 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -68,6 +68,12 @@ define_pe_printk_level(pe_err, KERN_ERR);
>> define_pe_printk_level(pe_warn, KERN_WARNING);
>> define_pe_printk_level(pe_info, KERN_INFO);
>>
>> +static inline void rm_writed(unsigned long paddr, u64 val)
>> +{
>> + __asm__ __volatile__("sync; stdcix %0,0,%1"
>> + : : "r" (val), "r" (paddr) : "memory");
>> +}
>> +
>> static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
>> {
>> unsigned long pe;
>> @@ -442,7 +448,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb
>> *phb, struct pci_dev *pdev
>> }
>>
>> static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
>> - u64 *startp, u64 *endp)
>> + u64 *startp, u64 *endp, bool rm)
>> {
>> u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
>> unsigned long start, end, inc;
>> @@ -471,7 +477,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct
>> iommu_table *tbl,
>>
>> mb(); /* Ensure above stores are visible */
>> while (start<= end) {
>> - __raw_writeq(start, invalidate);
>> + if (rm)
>> + rm_writed((unsigned long) invalidate, start);
>> + else
>> + __raw_writeq(start, invalidate);
>> start += inc;
>> }
>>
>> @@ -483,7 +492,7 @@ static void pnv_pci_ioda1_tce_invalidate(struct
>> iommu_table *tbl,
>>
>> static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
>> struct iommu_table *tbl,
>> - u64 *startp, u64 *endp)
>> + u64 *startp, u64 *endp, bool rm)
>> {
>> unsigned long start, end, inc;
>> u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
>> @@ -502,22 +511,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct
>> pnv_ioda_pe *pe,
>> mb();
>>
>> while (start<= end) {
>> - __raw_writeq(start, invalidate);
>> + if (rm)
>> + rm_writed((unsigned long) invalidate, start);
>> + else
>> + __raw_writeq(start, invalidate);
>> start += inc;
>> }
>> }
>>
>> void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
>> - u64 *startp, u64 *endp)
>> + u64 *startp, u64 *endp, bool rm)
>> {
>> struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
>> tce32_table);
>> struct pnv_phb *phb = pe->phb;
>>
>> if (phb->type == PNV_PHB_IODA1)
>> - pnv_pci_ioda1_tce_invalidate(tbl, startp, endp);
>> + pnv_pci_ioda1_tce_invalidate(tbl, startp, endp, rm);
>> else
>> - pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp);
>> + pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
>> }
>>
>> static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>> diff --git a/arch/powerpc/platforms/powernv/pci.c
>> b/arch/powerpc/platforms/powernv/pci.c
>> index e16b729..280f614 100644
>> --- a/arch/powerpc/platforms/powernv/pci.c
>> +++ b/arch/powerpc/platforms/powernv/pci.c
>> @@ -336,7 +336,7 @@ struct pci_ops pnv_pci_ops = {
>>
>> static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
>> unsigned long uaddr, enum dma_data_direction direction,
>> - struct dma_attrs *attrs)
>> + struct dma_attrs *attrs, bool rm)
>> {
>> u64 proto_tce;
>> u64 *tcep, *tces;
>> @@ -358,12 +358,19 @@ static int pnv_tce_build(struct iommu_table *tbl,
>> long index, long npages,
>> * of flags if that becomes the case
>> */
>> if (tbl->it_type& TCE_PCI_SWINV_CREATE)
>> - pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
>> + pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
>>
>> return 0;
>> }
>>
>> -static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
>> +static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long
>> npages,
>> + unsigned long uaddr, enum dma_data_direction direction,
>> + struct dma_attrs *attrs)
>> +{
>> + return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs,
>> false);
>> +}
>> +
>> +static void pnv_tce_free(struct iommu_table *tbl, long index, long
>> npages, bool rm)
>> {
>> u64 *tcep, *tces;
>>
>> @@ -373,7 +380,12 @@ static void pnv_tce_free(struct iommu_table *tbl,
>> long index, long npages)
>> *(tcep++) = 0;
>>
>> if (tbl->it_type& TCE_PCI_SWINV_FREE)
>> - pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
>> + pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
>> +}
>> +
>> +static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long
>> npages)
>> +{
>> + pnv_tce_free(tbl, index, npages, false);
>> }
>>
>> static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
>> @@ -381,6 +393,18 @@ static unsigned long pnv_tce_get(struct iommu_table
>> *tbl, long index)
>> return ((u64 *)tbl->it_base)[index - tbl->it_offset];
>> }
>>
>> +static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long
>> npages,
>> + unsigned long uaddr, enum dma_data_direction direction,
>> + struct dma_attrs *attrs)
>> +{
>> + return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs,
>> true);
>> +}
>> +
>> +static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long
>> npages)
>> +{
>> + pnv_tce_free(tbl, index, npages, true);
>> +}
>> +
>> void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
>> void *tce_mem, u64 tce_size,
>> u64 dma_offset)
>> @@ -545,8 +569,10 @@ void __init pnv_pci_init(void)
>>
>> /* Configure IOMMU DMA hooks */
>> ppc_md.pci_dma_dev_setup = pnv_pci_dma_dev_setup;
>> - ppc_md.tce_build = pnv_tce_build;
>> - ppc_md.tce_free = pnv_tce_free;
>> + ppc_md.tce_build = pnv_tce_build_vm;
>> + ppc_md.tce_free = pnv_tce_free_vm;
>> + ppc_md.tce_build_rm = pnv_tce_build_rm;
>> + ppc_md.tce_free_rm = pnv_tce_free_rm;
>> ppc_md.tce_get = pnv_tce_get;
>> ppc_md.pci_probe_mode = pnv_pci_probe_mode;
>> set_pci_dma_ops(&dma_iommu_ops);
>> diff --git a/arch/powerpc/platforms/powernv/pci.h
>> b/arch/powerpc/platforms/powernv/pci.h
>> index 25d76c4..6799374 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -158,6 +158,6 @@ extern void pnv_pci_init_p5ioc2_hub(struct
>> device_node *np);
>> extern void pnv_pci_init_ioda_hub(struct device_node *np);
>> extern void pnv_pci_init_ioda2_phb(struct device_node *np);
>> extern void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
>> - u64 *startp, u64 *endp);
>> + u64 *startp, u64 *endp, bool rm);
>>
>> #endif /* __POWERNV_PCI_H */
>


--
Alexey

2013-07-10 03:37:32

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 5/8] powerpc: add real mode support for dma operations on powernv

On Tue, 2013-07-09 at 18:02 +0200, Alexander Graf wrote:
> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
> > The existing TCE machine calls (tce_build and tce_free) only support
> > virtual mode as they call __raw_writeq for TCE invalidation what
> > fails in real mode.
> >
> > This introduces tce_build_rm and tce_free_rm real mode versions
> > which do mostly the same but use "Store Doubleword Caching Inhibited
> > Indexed" instruction for TCE invalidation.
>
> So would always using stdcix have any bad side effects?

Yes. Those instructions are only supposed to be used in hypervisor real
mode as per the architecture spec.

Cheers,
Ben.

>
> Alex
>
> >
> > This new feature is going to be utilized by real mode support of VFIO.
> >
> > Signed-off-by: Alexey Kardashevskiy<[email protected]>
> > ---
> > arch/powerpc/include/asm/machdep.h | 12 ++++++++++
> > arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++++++------
> > arch/powerpc/platforms/powernv/pci.c | 38 ++++++++++++++++++++++++++-----
> > arch/powerpc/platforms/powernv/pci.h | 2 +-
> > 4 files changed, 64 insertions(+), 14 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
> > index 92386fc..0c19eef 100644
> > --- a/arch/powerpc/include/asm/machdep.h
> > +++ b/arch/powerpc/include/asm/machdep.h
> > @@ -75,6 +75,18 @@ struct machdep_calls {
> > long index);
> > void (*tce_flush)(struct iommu_table *tbl);
> >
> > + /* _rm versions are for real mode use only */
> > + int (*tce_build_rm)(struct iommu_table *tbl,
> > + long index,
> > + long npages,
> > + unsigned long uaddr,
> > + enum dma_data_direction direction,
> > + struct dma_attrs *attrs);
> > + void (*tce_free_rm)(struct iommu_table *tbl,
> > + long index,
> > + long npages);
> > + void (*tce_flush_rm)(struct iommu_table *tbl);
> > +
> > void __iomem * (*ioremap)(phys_addr_t addr, unsigned long size,
> > unsigned long flags, void *caller);
> > void (*iounmap)(volatile void __iomem *token);
> > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> > index 2931d97..2797dec 100644
> > --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> > @@ -68,6 +68,12 @@ define_pe_printk_level(pe_err, KERN_ERR);
> > define_pe_printk_level(pe_warn, KERN_WARNING);
> > define_pe_printk_level(pe_info, KERN_INFO);
> >
> > +static inline void rm_writed(unsigned long paddr, u64 val)
> > +{
> > + __asm__ __volatile__("sync; stdcix %0,0,%1"
> > + : : "r" (val), "r" (paddr) : "memory");
> > +}
> > +
> > static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
> > {
> > unsigned long pe;
> > @@ -442,7 +448,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
> > }
> >
> > static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
> > - u64 *startp, u64 *endp)
> > + u64 *startp, u64 *endp, bool rm)
> > {
> > u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
> > unsigned long start, end, inc;
> > @@ -471,7 +477,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
> >
> > mb(); /* Ensure above stores are visible */
> > while (start<= end) {
> > - __raw_writeq(start, invalidate);
> > + if (rm)
> > + rm_writed((unsigned long) invalidate, start);
> > + else
> > + __raw_writeq(start, invalidate);
> > start += inc;
> > }
> >
> > @@ -483,7 +492,7 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
> >
> > static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
> > struct iommu_table *tbl,
> > - u64 *startp, u64 *endp)
> > + u64 *startp, u64 *endp, bool rm)
> > {
> > unsigned long start, end, inc;
> > u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
> > @@ -502,22 +511,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
> > mb();
> >
> > while (start<= end) {
> > - __raw_writeq(start, invalidate);
> > + if (rm)
> > + rm_writed((unsigned long) invalidate, start);
> > + else
> > + __raw_writeq(start, invalidate);
> > start += inc;
> > }
> > }
> >
> > void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
> > - u64 *startp, u64 *endp)
> > + u64 *startp, u64 *endp, bool rm)
> > {
> > struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
> > tce32_table);
> > struct pnv_phb *phb = pe->phb;
> >
> > if (phb->type == PNV_PHB_IODA1)
> > - pnv_pci_ioda1_tce_invalidate(tbl, startp, endp);
> > + pnv_pci_ioda1_tce_invalidate(tbl, startp, endp, rm);
> > else
> > - pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp);
> > + pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
> > }
> >
> > static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> > diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> > index e16b729..280f614 100644
> > --- a/arch/powerpc/platforms/powernv/pci.c
> > +++ b/arch/powerpc/platforms/powernv/pci.c
> > @@ -336,7 +336,7 @@ struct pci_ops pnv_pci_ops = {
> >
> > static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> > unsigned long uaddr, enum dma_data_direction direction,
> > - struct dma_attrs *attrs)
> > + struct dma_attrs *attrs, bool rm)
> > {
> > u64 proto_tce;
> > u64 *tcep, *tces;
> > @@ -358,12 +358,19 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> > * of flags if that becomes the case
> > */
> > if (tbl->it_type& TCE_PCI_SWINV_CREATE)
> > - pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
> > + pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
> >
> > return 0;
> > }
> >
> > -static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
> > +static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
> > + unsigned long uaddr, enum dma_data_direction direction,
> > + struct dma_attrs *attrs)
> > +{
> > + return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, false);
> > +}
> > +
> > +static void pnv_tce_free(struct iommu_table *tbl, long index, long npages, bool rm)
> > {
> > u64 *tcep, *tces;
> >
> > @@ -373,7 +380,12 @@ static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
> > *(tcep++) = 0;
> >
> > if (tbl->it_type& TCE_PCI_SWINV_FREE)
> > - pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
> > + pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
> > +}
> > +
> > +static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long npages)
> > +{
> > + pnv_tce_free(tbl, index, npages, false);
> > }
> >
> > static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
> > @@ -381,6 +393,18 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
> > return ((u64 *)tbl->it_base)[index - tbl->it_offset];
> > }
> >
> > +static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
> > + unsigned long uaddr, enum dma_data_direction direction,
> > + struct dma_attrs *attrs)
> > +{
> > + return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
> > +}
> > +
> > +static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
> > +{
> > + pnv_tce_free(tbl, index, npages, true);
> > +}
> > +
> > void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
> > void *tce_mem, u64 tce_size,
> > u64 dma_offset)
> > @@ -545,8 +569,10 @@ void __init pnv_pci_init(void)
> >
> > /* Configure IOMMU DMA hooks */
> > ppc_md.pci_dma_dev_setup = pnv_pci_dma_dev_setup;
> > - ppc_md.tce_build = pnv_tce_build;
> > - ppc_md.tce_free = pnv_tce_free;
> > + ppc_md.tce_build = pnv_tce_build_vm;
> > + ppc_md.tce_free = pnv_tce_free_vm;
> > + ppc_md.tce_build_rm = pnv_tce_build_rm;
> > + ppc_md.tce_free_rm = pnv_tce_free_rm;
> > ppc_md.tce_get = pnv_tce_get;
> > ppc_md.pci_probe_mode = pnv_pci_probe_mode;
> > set_pci_dma_ops(&dma_iommu_ops);
> > diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> > index 25d76c4..6799374 100644
> > --- a/arch/powerpc/platforms/powernv/pci.h
> > +++ b/arch/powerpc/platforms/powernv/pci.h
> > @@ -158,6 +158,6 @@ extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
> > extern void pnv_pci_init_ioda_hub(struct device_node *np);
> > extern void pnv_pci_init_ioda2_phb(struct device_node *np);
> > extern void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
> > - u64 *startp, u64 *endp);
> > + u64 *startp, u64 *endp, bool rm);
> >
> > #endif /* __POWERNV_PCI_H */

2013-07-10 05:00:54

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

On 07/10/2013 03:02 AM, Alexander Graf wrote:
> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>> devices or emulated PCI. These calls allow adding multiple entries
>> (up to 512) into the TCE table in one call which saves time on
>> transition to/from real mode.
>
> We don't mention QEMU explicitly in KVM code usually.
>
>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>> (copied from user and verified) before writing the whole list into
>> the TCE table. This cache will be utilized more in the upcoming
>> VFIO/IOMMU support to continue TCE list processing in the virtual
>> mode in the case if the real mode handler failed for some reason.
>>
>> This adds a guest physical to host real address converter
>> and calls the existing H_PUT_TCE handler. The converting function
>> is going to be fully utilized by upcoming VFIO supporting patches.
>>
>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>> so in order to support the functionality of this patch, QEMU
>> needs to query for this capability and set the "hcall-multi-tce"
>> hypertas property only if the capability is present, otherwise
>> there will be serious performance degradation.
>
> Same as above. But really you're only giving recommendations here. What's
> the point? Please describe what the benefit of this patch is, not what some
> other random subsystem might do with the benefits it brings.
>
>>
>> Signed-off-by: Paul Mackerras<[email protected]>
>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>
>> ---
>> Changelog:
>> 2013/07/06:
>> * fixed number of wrong get_page()/put_page() calls
>>
>> 2013/06/27:
>> * fixed clear of BUSY bit in kvmppc_lookup_pte()
>> * H_PUT_TCE_INDIRECT does realmode_get_page() now
>> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
>> * updated doc
>>
>> 2013/06/05:
>> * fixed mistype about IBMVIO in the commit message
>> * updated doc and moved it to another section
>> * changed capability number
>>
>> 2013/05/21:
>> * added kvm_vcpu_arch::tce_tmp
>> * removed cleanup if put_indirect failed, instead we do not even start
>> writing to TCE table if we cannot get TCEs from the user and they are
>> invalid
>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>> and kvmppc_emulated_validate_tce (for the previous item)
>> * fixed bug with failthrough for H_IPI
>> * removed all get_user() from real mode handlers
>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>
>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>> ---
>> Documentation/virtual/kvm/api.txt | 25 +++
>> arch/powerpc/include/asm/kvm_host.h | 9 ++
>> arch/powerpc/include/asm/kvm_ppc.h | 16 +-
>> arch/powerpc/kvm/book3s_64_vio.c | 154 ++++++++++++++++++-
>> arch/powerpc/kvm/book3s_64_vio_hv.c | 260
>> ++++++++++++++++++++++++++++----
>> arch/powerpc/kvm/book3s_hv.c | 41 ++++-
>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++-
>> arch/powerpc/kvm/powerpc.c | 3 +
>> 9 files changed, 517 insertions(+), 34 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/api.txt
>> b/Documentation/virtual/kvm/api.txt
>> index 6365fef..762c703 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
>> to userspace to be
>> handled.
>>
>>
>> +4.86 KVM_CAP_PPC_MULTITCE
>> +
>> +Capability: KVM_CAP_PPC_MULTITCE
>> +Architectures: ppc
>> +Type: vm
>> +
>> +This capability means the kernel is capable of handling hypercalls
>> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
>> +space. This significanly accelerates DMA operations for PPC KVM guests.
>
> significanly? Please run this through a spell checker.
>
>> +The user space should expect that its handlers for these hypercalls
>
> s/The//
>
>> +are not going to be called.
>
> Is user space guaranteed they will not be called? Or can it still happen?

... if user space previously registered LIOBN in KVM (via
KVM_CREATE_SPAPR_TCE or similar calls).

ok?

There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
and may never get there.


>> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
>> +the user space might have to advertise it for the guest. For example,
>> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in
>> +the "ibm,hypertas-functions" device-tree property.
>
> This paragraph describes sPAPR. That's fine, but please document it as
> such. Also please check your grammar.

>> +
>> +Without this capability, only H_PUT_TCE is handled by the kernel and
>> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
>> +unless the capability is present as passing hypercalls to the userspace
>> +slows operations a lot.
>> +
>> +Unlike other capabilities of this section, this one is always enabled.
>
> Why? Wouldn't that confuse older user space?


How? Old user space won't check for this capability and won't tell the
guest to use it (via "hcall-multi-tce"). Old H_PUT_TCE is still there.

If the guest always uses H_PUT_TCE_INDIRECT/H_STUFF_TCE no matter what,
then it is its problem - it won't work now anyway as neither QEMU nor host
kernel supports these calls.



>> +
>> +
>> 5. The kvm_run structure
>> ------------------------
>>
>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>> b/arch/powerpc/include/asm/kvm_host.h
>> index af326cd..20d04bd 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
>> struct kvm *kvm;
>> u64 liobn;
>> u32 window_size;
>> + struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>
> You don't need this.
>
>> struct page *pages[0];
>> };
>>
>> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
>> spinlock_t tbacct_lock;
>> u64 busy_stolen;
>> u64 busy_preempt;
>> +
>> + unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT
>> hcall */
>> + enum {
>> + TCERM_NONE,
>> + TCERM_GETPAGE,
>> + TCERM_PUTTCE,
>> + TCERM_PUTLIST,
>> + } tce_rm_fail; /* failed stage of request processing */
>> #endif
>> };
>>
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h
>> b/arch/powerpc/include/asm/kvm_ppc.h
>> index a5287fe..fa722a0 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu
>> *vcpu);
>>
>> extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>> struct kvm_create_spapr_tce *args);
>> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>> - unsigned long ioba, unsigned long tce);
>> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>> + struct kvm_vcpu *vcpu, unsigned long liobn);
>> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
>> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>> + unsigned long ioba, unsigned long tce);
>> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>> + unsigned long liobn, unsigned long ioba,
>> + unsigned long tce);
>> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>> + unsigned long liobn, unsigned long ioba,
>> + unsigned long tce_list, unsigned long npages);
>> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>> + unsigned long liobn, unsigned long ioba,
>> + unsigned long tce_value, unsigned long npages);
>> extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>> struct kvm_allocate_rma *rma);
>> extern struct kvmppc_linear_info *kvm_alloc_rma(void);
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>> b/arch/powerpc/kvm/book3s_64_vio.c
>> index b2d3f3b..99bf4e5 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -14,6 +14,7 @@
>> *
>> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
>> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
>> */
>>
>> #include<linux/types.h>
>> @@ -36,8 +37,10 @@
>> #include<asm/ppc-opcode.h>
>> #include<asm/kvm_host.h>
>> #include<asm/udbg.h>
>> +#include<asm/iommu.h>
>> +#include<asm/tce.h>
>>
>> -#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
>> +#define ERROR_ADDR ((void *)~(unsigned long)0x0)
>>
>> static long kvmppc_stt_npages(unsigned long window_size)
>> {
>> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct
>> kvmppc_spapr_tce_table *stt)
>> struct kvm *kvm = stt->kvm;
>> int i;
>>
>> +#define __SV(x) stt->stat.x
>> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
>> + pr_debug("%s stat for liobn=%llx\n"
>> + "--------------- realmode ----- virtmode ---\n"
>> + "put_tce %10ld %10ld\n"
>> + "put_tce_indir %10ld %10ld\n"
>> + "stuff_tce %10ld %10ld\n",
>> + __func__, stt->liobn,
>> + __SVD(put), __SV(vm.put),
>> + __SVD(indir), __SV(vm.indir),
>> + __SVD(stuff), __SV(vm.stuff));
>> +#undef __SVD
>> +#undef __SV
>
> All of these stat points should just be trace points. You can do the
> statistic gathering from user space then.
>
>> +
>> mutex_lock(&kvm->lock);
>> list_del(&stt->list);
>> for (i = 0; i< kvmppc_stt_npages(stt->window_size); i++)
>> @@ -148,3 +165,138 @@ fail:
>> }
>> return ret;
>> }
>> +
>> +/* Converts guest physical address to host virtual address */
>> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>
> Please don't distinguish _vm versions. They're the normal case. _rm ones
> are the special ones.
>
>> + unsigned long gpa, struct page **pg)
>> +{
>> + unsigned long hva, gfn = gpa>> PAGE_SHIFT;
>> + struct kvm_memory_slot *memslot;
>> +
>> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>> + if (!memslot)
>> + return ERROR_ADDR;
>> +
>> + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa& ~PAGE_MASK);
>
> s/+/|/
>
>> +
>> + if (get_user_pages_fast(hva& PAGE_MASK, 1, 0, pg) != 1)
>> + return ERROR_ADDR;
>> +
>> + return (void *) hva;
>> +}
>> +
>> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>> + unsigned long liobn, unsigned long ioba,
>> + unsigned long tce)
>> +{
>> + long ret;
>> + struct kvmppc_spapr_tce_table *tt;
>> +
>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>> + /* Didn't find the liobn, put it to userspace */
>
> Unclear comment.


What detail is missing?


>> + if (!tt)
>> + return H_TOO_HARD;
>> +
>> + ++tt->stat.vm.put;
>> +
>> + if (ioba>= tt->window_size)
>> + return H_PARAMETER;
>> +
>> + ret = kvmppc_emulated_validate_tce(tce);
>> + if (ret)
>> + return ret;
>> +
>> + kvmppc_emulated_put_tce(tt, ioba, tce);
>> +
>> + return H_SUCCESS;
>> +}
>> +
>> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>> + unsigned long liobn, unsigned long ioba,
>> + unsigned long tce_list, unsigned long npages)
>> +{
>> + struct kvmppc_spapr_tce_table *tt;
>> + long i, ret = H_SUCCESS;
>> + unsigned long __user *tces;
>> + struct page *pg = NULL;
>> +
>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>> + /* Didn't find the liobn, put it to userspace */
>> + if (!tt)
>> + return H_TOO_HARD;
>> +
>> + ++tt->stat.vm.indir;
>> +
>> + /*
>> + * The spec says that the maximum size of the list is 512 TCEs so
>> + * so the whole table addressed resides in 4K page
>
> so so?
>
>> + */
>> + if (npages> 512)
>> + return H_PARAMETER;
>> +
>> + if (tce_list& ~IOMMU_PAGE_MASK)
>> + return H_PARAMETER;
>> +
>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
>> + return H_PARAMETER;
>> +
>> + tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg);
>> + if (tces == ERROR_ADDR)
>> + return H_TOO_HARD;
>> +
>> + if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
>> + goto put_list_page_exit;
>> +
>> + for (i = 0; i< npages; ++i) {
>> + if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
>> + ret = H_PARAMETER;
>> + goto put_list_page_exit;
>> + }
>> +
>> + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
>> + if (ret)
>> + goto put_list_page_exit;
>> + }
>> +
>> + for (i = 0; i< npages; ++i)
>> + kvmppc_emulated_put_tce(tt, ioba + (i<< IOMMU_PAGE_SHIFT),
>> + vcpu->arch.tce_tmp_hpas[i]);
>> +put_list_page_exit:
>> + if (pg)
>> + put_page(pg);
>> +
>> + if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
>> + vcpu->arch.tce_rm_fail = TCERM_NONE;
>> + if (pg&& !PageCompound(pg))
>> + put_page(pg); /* finish pending realmode_put_page() */
>> + }
>> +
>> + return ret;
>> +}
>> +
>> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>> + unsigned long liobn, unsigned long ioba,
>> + unsigned long tce_value, unsigned long npages)
>> +{
>> + struct kvmppc_spapr_tce_table *tt;
>> + long i, ret;
>> +
>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>> + /* Didn't find the liobn, put it to userspace */
>> + if (!tt)
>> + return H_TOO_HARD;
>> +
>> + ++tt->stat.vm.stuff;
>> +
>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
>> + return H_PARAMETER;
>> +
>> + ret = kvmppc_emulated_validate_tce(tce_value);
>> + if (ret || (tce_value& (TCE_PCI_WRITE | TCE_PCI_READ)))
>> + return H_PARAMETER;
>> +
>> + for (i = 0; i< npages; ++i, ioba += IOMMU_PAGE_SIZE)
>> + kvmppc_emulated_put_tce(tt, ioba, tce_value);
>> +
>> + return H_SUCCESS;
>> +}
>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> index 30c2f3b..cd3e6f9 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>> @@ -14,6 +14,7 @@
>> *
>> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
>> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
>> */
>>
>> #include<linux/types.h>
>> @@ -35,42 +36,243 @@
>> #include<asm/ppc-opcode.h>
>> #include<asm/kvm_host.h>
>> #include<asm/udbg.h>
>> +#include<asm/iommu.h>
>> +#include<asm/tce.h>
>>
>> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
>> +#define ERROR_ADDR (~(unsigned long)0x0)
>>
>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>> - * mode on PR KVM
>
> What's wrong with the warning?


It belongs to kvmppc_h_put_tce() which is not called in virtual mode anymore.

It is technically correct for kvmppc_find_tce_table() though. Should I put
this comment before every function which may be called from real and
virtual modes?



>> +/*
>> + * Finds a TCE table descriptor by LIOBN
>> */
>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
>> + unsigned long liobn)
>> +{
>> + struct kvmppc_spapr_tce_table *tt;
>> +
>> + list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) {
>> + if (tt->liobn == liobn)
>> + return tt;
>> + }
>> +
>> + return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>> +
>> +#ifdef DEBUG
>> +/*
>> + * Lets user mode disable realmode handlers by putting big number
>> + * in the bottom value of LIOBN
>
> What? Seriously? Just don't enable the CAP.


It is under DEBUG. It really, really helps to be able to disable real mode
handlers without reboot. Ok, no debug code, I'll remove.


>> + */
>> +#define kvmppc_find_tce_table(a, b) \
>> + ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
>> +#endif
>> +
>> +/*
>> + * Validates TCE address.
>> + * At the moment only flags are validated as other checks will
>> significantly slow
>> + * down or can make it even impossible to handle TCE requests in real mode.
>
> What?


What is missing here (besides good english)?


>> + */
>> +long kvmppc_emulated_validate_tce(unsigned long tce)
>
> I don't like the naming scheme. Please turn this around and make it
> kvmppc_tce_validate().


Oh. "Like"... Ok.


>> +{
>> + if (tce& ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>> + return H_PARAMETER;
>> +
>> + return H_SUCCESS;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
>> +
>> +/*
>> + * Handles TCE requests for QEMU emulated devices.
>
> We still don't mention QEMU in KVM code. And does it really matter whether
> they're emulated by QEMU? Devices could also be emulated by KVM.
>
>> + * Puts guest TCE values to the table and expects QEMU to convert them
>> + * later in a QEMU device implementation.
>> + * Called in both real and virtual modes.
>> + * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
>> + */
>> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>
> kvmppc_tce_put()
>
>> + unsigned long ioba, unsigned long tce)
>> +{
>> + unsigned long idx = ioba>> SPAPR_TCE_SHIFT;
>> + struct page *page;
>> + u64 *tbl;
>> +
>> + /*
>> + * Note on the use of page_address() in real mode,
>> + *
>> + * It is safe to use page_address() in real mode on ppc64 because
>> + * page_address() is always defined as lowmem_page_address()
>> + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
>> + * operation and does not access page struct.
>> + *
>> + * Theoretically page_address() could be defined different
>> + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>> + * should be enabled.
>> + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>> + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>> + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>> + * is not expected to be enabled on ppc32, page_address()
>> + * is safe for ppc32 as well.
>> + */
>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>> +#error TODO: fix to avoid page_address() here
>> +#endif
>
> Can you extract the text above, the check and the page_address call into a
> simple wrapper function?


Is this function also too big? Sorry, I do not understand the comment.


>> + page = tt->pages[idx / TCES_PER_PAGE];
>> + tbl = (u64 *)page_address(page);
>> +
>> + /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>
> This is not an RFC, is it?


Any debug code is prohibited? Ok, I'll remove.


>> + tbl[idx % TCES_PER_PAGE] = tce;
>> +}
>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
>> +
>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>> +/*
>> + * Converts guest physical address to host physical address.
>> + * Tries to increase page counter via realmode_get_page() and
>> + * returns ERROR_ADDR if failed.
>> + */
>> +static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
>> + unsigned long gpa, struct page **pg)
>> +{
>> + struct kvm_memory_slot *memslot;
>> + pte_t *ptep, pte;
>> + unsigned long hva, hpa = ERROR_ADDR;
>> + unsigned long gfn = gpa>> PAGE_SHIFT;
>> + unsigned shift = 0;
>> +
>> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>> + if (!memslot)
>> + return ERROR_ADDR;
>> +
>> + hva = __gfn_to_hva_memslot(memslot, gfn);
>> +
>> + ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva,&shift);
>> + if (!ptep || !pte_present(*ptep))
>> + return ERROR_ADDR;
>> + pte = *ptep;
>> +
>> + if (((gpa& TCE_PCI_WRITE) || pte_write(pte))&& !pte_dirty(pte))
>> + return ERROR_ADDR;
>> +
>> + if (!pte_young(pte))
>> + return ERROR_ADDR;
>> +
>> + if (!shift)
>> + shift = PAGE_SHIFT;
>> +
>> + /* Put huge pages handling to the virtual mode */
>> + if (shift> PAGE_SHIFT)
>> + return ERROR_ADDR;
>> +
>> + *pg = realmode_pfn_to_page(pte_pfn(pte));
>> + if (!*pg || realmode_get_page(*pg))
>> + return ERROR_ADDR;
>> +
>> + /* pte_pfn(pte) returns address aligned to pg_size */
>> + hpa = (pte_pfn(pte)<< PAGE_SHIFT) + (gpa& ((1<< shift) - 1));
>> +
>> + if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>> + hpa = ERROR_ADDR;
>> + realmode_put_page(*pg);
>> + *pg = NULL;
>> + }
>> +
>> + return hpa;
>> +}
>> +
>> long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>> unsigned long ioba, unsigned long tce)
>> {
>> - struct kvm *kvm = vcpu->kvm;
>> - struct kvmppc_spapr_tce_table *stt;
>> -
>> - /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>> - /* liobn, ioba, tce); */
>> -
>> - list_for_each_entry(stt,&kvm->arch.spapr_tce_tables, list) {
>> - if (stt->liobn == liobn) {
>> - unsigned long idx = ioba>> SPAPR_TCE_SHIFT;
>> - struct page *page;
>> - u64 *tbl;
>> -
>> - /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p
>> window_size=0x%x\n", */
>> - /* liobn, stt, stt->window_size); */
>> - if (ioba>= stt->window_size)
>> - return H_PARAMETER;
>> -
>> - page = stt->pages[idx / TCES_PER_PAGE];
>> - tbl = (u64 *)page_address(page);
>> -
>> - /* FIXME: Need to validate the TCE itself */
>> - /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>> - tbl[idx % TCES_PER_PAGE] = tce;
>> - return H_SUCCESS;
>> - }
>> + long ret;
>> + struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
>> +
>> + if (!tt)
>> + return H_TOO_HARD;
>> +
>> + ++tt->stat.rm.put;
>> +
>> + if (ioba>= tt->window_size)
>> + return H_PARAMETER;
>> +
>> + ret = kvmppc_emulated_validate_tce(tce);
>> + if (!ret)
>> + kvmppc_emulated_put_tce(tt, ioba, tce);
>> +
>> + return ret;
>> +}
>> +
>> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>
> So the _vm version is the normal one and this is the _rm version? If so,
> please mark it as such. Is there any way to generate both from the same
> source? The way it's now there is a lot of duplicate code.


I tried, looked very ugly. If you insist, I will do so.


--
Alexey

2013-07-10 10:05:31

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls


On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote:

> On 07/10/2013 03:02 AM, Alexander Graf wrote:
>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>> devices or emulated PCI. These calls allow adding multiple entries
>>> (up to 512) into the TCE table in one call which saves time on
>>> transition to/from real mode.
>>
>> We don't mention QEMU explicitly in KVM code usually.
>>
>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>> (copied from user and verified) before writing the whole list into
>>> the TCE table. This cache will be utilized more in the upcoming
>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>> mode in the case if the real mode handler failed for some reason.
>>>
>>> This adds a guest physical to host real address converter
>>> and calls the existing H_PUT_TCE handler. The converting function
>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>
>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>> so in order to support the functionality of this patch, QEMU
>>> needs to query for this capability and set the "hcall-multi-tce"
>>> hypertas property only if the capability is present, otherwise
>>> there will be serious performance degradation.
>>
>> Same as above. But really you're only giving recommendations here. What's
>> the point? Please describe what the benefit of this patch is, not what some
>> other random subsystem might do with the benefits it brings.
>>
>>>
>>> Signed-off-by: Paul Mackerras<[email protected]>
>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>>
>>> ---
>>> Changelog:
>>> 2013/07/06:
>>> * fixed number of wrong get_page()/put_page() calls
>>>
>>> 2013/06/27:
>>> * fixed clear of BUSY bit in kvmppc_lookup_pte()
>>> * H_PUT_TCE_INDIRECT does realmode_get_page() now
>>> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
>>> * updated doc
>>>
>>> 2013/06/05:
>>> * fixed mistype about IBMVIO in the commit message
>>> * updated doc and moved it to another section
>>> * changed capability number
>>>
>>> 2013/05/21:
>>> * added kvm_vcpu_arch::tce_tmp
>>> * removed cleanup if put_indirect failed, instead we do not even start
>>> writing to TCE table if we cannot get TCEs from the user and they are
>>> invalid
>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>> and kvmppc_emulated_validate_tce (for the previous item)
>>> * fixed bug with failthrough for H_IPI
>>> * removed all get_user() from real mode handlers
>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>>
>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>> ---
>>> Documentation/virtual/kvm/api.txt | 25 +++
>>> arch/powerpc/include/asm/kvm_host.h | 9 ++
>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +-
>>> arch/powerpc/kvm/book3s_64_vio.c | 154 ++++++++++++++++++-
>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 260
>>> ++++++++++++++++++++++++++++----
>>> arch/powerpc/kvm/book3s_hv.c | 41 ++++-
>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++-
>>> arch/powerpc/kvm/powerpc.c | 3 +
>>> 9 files changed, 517 insertions(+), 34 deletions(-)
>>>
>>> diff --git a/Documentation/virtual/kvm/api.txt
>>> b/Documentation/virtual/kvm/api.txt
>>> index 6365fef..762c703 100644
>>> --- a/Documentation/virtual/kvm/api.txt
>>> +++ b/Documentation/virtual/kvm/api.txt
>>> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
>>> to userspace to be
>>> handled.
>>>
>>>
>>> +4.86 KVM_CAP_PPC_MULTITCE
>>> +
>>> +Capability: KVM_CAP_PPC_MULTITCE
>>> +Architectures: ppc
>>> +Type: vm
>>> +
>>> +This capability means the kernel is capable of handling hypercalls
>>> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
>>> +space. This significanly accelerates DMA operations for PPC KVM guests.
>>
>> significanly? Please run this through a spell checker.
>>
>>> +The user space should expect that its handlers for these hypercalls
>>
>> s/The//
>>
>>> +are not going to be called.
>>
>> Is user space guaranteed they will not be called? Or can it still happen?
>
> ... if user space previously registered LIOBN in KVM (via
> KVM_CREATE_SPAPR_TCE or similar calls).
>
> ok?

How about this?

The hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration.

---

The target audience for this documentation is user space KVM API users. Someone developing kvm tool for example. They want to know implications specific CAPs have.

>
> There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
> and may never get there.
>
>
>>> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
>>> +the user space might have to advertise it for the guest. For example,
>>> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in
>>> +the "ibm,hypertas-functions" device-tree property.
>>
>> This paragraph describes sPAPR. That's fine, but please document it as
>> such. Also please check your grammar.
>
>>> +
>>> +Without this capability, only H_PUT_TCE is handled by the kernel and
>>> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
>>> +unless the capability is present as passing hypercalls to the userspace
>>> +slows operations a lot.
>>> +
>>> +Unlike other capabilities of this section, this one is always enabled.
>>
>> Why? Wouldn't that confuse older user space?
>
>
> How? Old user space won't check for this capability and won't tell the
> guest to use it (via "hcall-multi-tce"). Old H_PUT_TCE is still there.
>
> If the guest always uses H_PUT_TCE_INDIRECT/H_STUFF_TCE no matter what,
> then it is its problem - it won't work now anyway as neither QEMU nor host
> kernel supports these calls.

Always assume that you are a kernel developer without knowledge of any user space code using your interfaces. So there is the theoretical possibility that there is a user space client out there that implements H_PUT_TCE_INDIRECT and advertises hcall-multi-tce to the guest. Would that client break? If so, we should definitely have the CAP disabled by default.

But really, it's also as much about consistency as anything else. If we leave everything as is and always extend functionality by enabling new CAPs, we're pretty much guaranteed that we don't break anything by accident. It also makes debugging easier because you can for example disable this particular feature to see whether something has bad side effects.

>
>
>
>>> +
>>> +
>>> 5. The kvm_run structure
>>> ------------------------
>>>
>>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>>> b/arch/powerpc/include/asm/kvm_host.h
>>> index af326cd..20d04bd 100644
>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
>>> struct kvm *kvm;
>>> u64 liobn;
>>> u32 window_size;
>>> + struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>>
>> You don't need this.
>>
>>> struct page *pages[0];
>>> };
>>>
>>> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
>>> spinlock_t tbacct_lock;
>>> u64 busy_stolen;
>>> u64 busy_preempt;
>>> +
>>> + unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT
>>> hcall */
>>> + enum {
>>> + TCERM_NONE,
>>> + TCERM_GETPAGE,
>>> + TCERM_PUTTCE,
>>> + TCERM_PUTLIST,
>>> + } tce_rm_fail; /* failed stage of request processing */
>>> #endif
>>> };
>>>
>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h
>>> b/arch/powerpc/include/asm/kvm_ppc.h
>>> index a5287fe..fa722a0 100644
>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu
>>> *vcpu);
>>>
>>> extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>> struct kvm_create_spapr_tce *args);
>>> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>> - unsigned long ioba, unsigned long tce);
>>> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>> + struct kvm_vcpu *vcpu, unsigned long liobn);
>>> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
>>> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>> + unsigned long ioba, unsigned long tce);
>>> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>> + unsigned long liobn, unsigned long ioba,
>>> + unsigned long tce);
>>> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>> + unsigned long liobn, unsigned long ioba,
>>> + unsigned long tce_list, unsigned long npages);
>>> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>> + unsigned long liobn, unsigned long ioba,
>>> + unsigned long tce_value, unsigned long npages);
>>> extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>>> struct kvm_allocate_rma *rma);
>>> extern struct kvmppc_linear_info *kvm_alloc_rma(void);
>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>>> b/arch/powerpc/kvm/book3s_64_vio.c
>>> index b2d3f3b..99bf4e5 100644
>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>> @@ -14,6 +14,7 @@
>>> *
>>> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
>>> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
>>> */
>>>
>>> #include<linux/types.h>
>>> @@ -36,8 +37,10 @@
>>> #include<asm/ppc-opcode.h>
>>> #include<asm/kvm_host.h>
>>> #include<asm/udbg.h>
>>> +#include<asm/iommu.h>
>>> +#include<asm/tce.h>
>>>
>>> -#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
>>> +#define ERROR_ADDR ((void *)~(unsigned long)0x0)
>>>
>>> static long kvmppc_stt_npages(unsigned long window_size)
>>> {
>>> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct
>>> kvmppc_spapr_tce_table *stt)
>>> struct kvm *kvm = stt->kvm;
>>> int i;
>>>
>>> +#define __SV(x) stt->stat.x
>>> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
>>> + pr_debug("%s stat for liobn=%llx\n"
>>> + "--------------- realmode ----- virtmode ---\n"
>>> + "put_tce %10ld %10ld\n"
>>> + "put_tce_indir %10ld %10ld\n"
>>> + "stuff_tce %10ld %10ld\n",
>>> + __func__, stt->liobn,
>>> + __SVD(put), __SV(vm.put),
>>> + __SVD(indir), __SV(vm.indir),
>>> + __SVD(stuff), __SV(vm.stuff));
>>> +#undef __SVD
>>> +#undef __SV
>>
>> All of these stat points should just be trace points. You can do the
>> statistic gathering from user space then.
>>
>>> +
>>> mutex_lock(&kvm->lock);
>>> list_del(&stt->list);
>>> for (i = 0; i< kvmppc_stt_npages(stt->window_size); i++)
>>> @@ -148,3 +165,138 @@ fail:
>>> }
>>> return ret;
>>> }
>>> +
>>> +/* Converts guest physical address to host virtual address */
>>> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>>
>> Please don't distinguish _vm versions. They're the normal case. _rm ones
>> are the special ones.
>>
>>> + unsigned long gpa, struct page **pg)
>>> +{
>>> + unsigned long hva, gfn = gpa>> PAGE_SHIFT;
>>> + struct kvm_memory_slot *memslot;
>>> +
>>> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>> + if (!memslot)
>>> + return ERROR_ADDR;
>>> +
>>> + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa& ~PAGE_MASK);
>>
>> s/+/|/
>>
>>> +
>>> + if (get_user_pages_fast(hva& PAGE_MASK, 1, 0, pg) != 1)
>>> + return ERROR_ADDR;
>>> +
>>> + return (void *) hva;
>>> +}
>>> +
>>> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>> + unsigned long liobn, unsigned long ioba,
>>> + unsigned long tce)
>>> +{
>>> + long ret;
>>> + struct kvmppc_spapr_tce_table *tt;
>>> +
>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>> + /* Didn't find the liobn, put it to userspace */
>>
>> Unclear comment.
>
>
> What detail is missing?

Grammar wise "it" in the second half of the sentence refers to liobn. So you "put" the "liobn to userspace". That sentence doesn't make any sense.

What you really want to say is:

/* Couldn't find the liobn. Something went wrong. Let user space handle the hypercall. That has better ways of dealing with errors. */

>
>
>>> + if (!tt)
>>> + return H_TOO_HARD;
>>> +
>>> + ++tt->stat.vm.put;
>>> +
>>> + if (ioba>= tt->window_size)
>>> + return H_PARAMETER;
>>> +
>>> + ret = kvmppc_emulated_validate_tce(tce);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + kvmppc_emulated_put_tce(tt, ioba, tce);
>>> +
>>> + return H_SUCCESS;
>>> +}
>>> +
>>> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>> + unsigned long liobn, unsigned long ioba,
>>> + unsigned long tce_list, unsigned long npages)
>>> +{
>>> + struct kvmppc_spapr_tce_table *tt;
>>> + long i, ret = H_SUCCESS;
>>> + unsigned long __user *tces;
>>> + struct page *pg = NULL;
>>> +
>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>> + /* Didn't find the liobn, put it to userspace */
>>> + if (!tt)
>>> + return H_TOO_HARD;
>>> +
>>> + ++tt->stat.vm.indir;
>>> +
>>> + /*
>>> + * The spec says that the maximum size of the list is 512 TCEs so
>>> + * so the whole table addressed resides in 4K page
>>
>> so so?
>>
>>> + */
>>> + if (npages> 512)
>>> + return H_PARAMETER;
>>> +
>>> + if (tce_list& ~IOMMU_PAGE_MASK)
>>> + return H_PARAMETER;
>>> +
>>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
>>> + return H_PARAMETER;
>>> +
>>> + tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg);
>>> + if (tces == ERROR_ADDR)
>>> + return H_TOO_HARD;
>>> +
>>> + if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
>>> + goto put_list_page_exit;
>>> +
>>> + for (i = 0; i< npages; ++i) {
>>> + if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
>>> + ret = H_PARAMETER;
>>> + goto put_list_page_exit;
>>> + }
>>> +
>>> + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
>>> + if (ret)
>>> + goto put_list_page_exit;
>>> + }
>>> +
>>> + for (i = 0; i< npages; ++i)
>>> + kvmppc_emulated_put_tce(tt, ioba + (i<< IOMMU_PAGE_SHIFT),
>>> + vcpu->arch.tce_tmp_hpas[i]);
>>> +put_list_page_exit:
>>> + if (pg)
>>> + put_page(pg);
>>> +
>>> + if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
>>> + vcpu->arch.tce_rm_fail = TCERM_NONE;
>>> + if (pg&& !PageCompound(pg))
>>> + put_page(pg); /* finish pending realmode_put_page() */
>>> + }
>>> +
>>> + return ret;
>>> +}
>>> +
>>> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>> + unsigned long liobn, unsigned long ioba,
>>> + unsigned long tce_value, unsigned long npages)
>>> +{
>>> + struct kvmppc_spapr_tce_table *tt;
>>> + long i, ret;
>>> +
>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>> + /* Didn't find the liobn, put it to userspace */
>>> + if (!tt)
>>> + return H_TOO_HARD;
>>> +
>>> + ++tt->stat.vm.stuff;
>>> +
>>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
>>> + return H_PARAMETER;
>>> +
>>> + ret = kvmppc_emulated_validate_tce(tce_value);
>>> + if (ret || (tce_value& (TCE_PCI_WRITE | TCE_PCI_READ)))
>>> + return H_PARAMETER;
>>> +
>>> + for (i = 0; i< npages; ++i, ioba += IOMMU_PAGE_SIZE)
>>> + kvmppc_emulated_put_tce(tt, ioba, tce_value);
>>> +
>>> + return H_SUCCESS;
>>> +}
>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> index 30c2f3b..cd3e6f9 100644
>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>> @@ -14,6 +14,7 @@
>>> *
>>> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
>>> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
>>> */
>>>
>>> #include<linux/types.h>
>>> @@ -35,42 +36,243 @@
>>> #include<asm/ppc-opcode.h>
>>> #include<asm/kvm_host.h>
>>> #include<asm/udbg.h>
>>> +#include<asm/iommu.h>
>>> +#include<asm/tce.h>
>>>
>>> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
>>> +#define ERROR_ADDR (~(unsigned long)0x0)
>>>
>>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>>> - * mode on PR KVM
>>
>> What's wrong with the warning?
>
>
> It belongs to kvmppc_h_put_tce() which is not called in virtual mode anymore.

I thought the comment applied to the whole file before? Hrm. Maybe I misread it then.

> It is technically correct for kvmppc_find_tce_table() though. Should I put
> this comment before every function which may be called from real and
> virtual modes?

Yes, please. Otherwise someone might stick an access to a non-linear address in there by accident.

>
>
>
>>> +/*
>>> + * Finds a TCE table descriptor by LIOBN
>>> */
>>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
>>> + unsigned long liobn)
>>> +{
>>> + struct kvmppc_spapr_tce_table *tt;
>>> +
>>> + list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) {
>>> + if (tt->liobn == liobn)
>>> + return tt;
>>> + }
>>> +
>>> + return NULL;
>>> +}
>>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>>> +
>>> +#ifdef DEBUG
>>> +/*
>>> + * Lets user mode disable realmode handlers by putting big number
>>> + * in the bottom value of LIOBN
>>
>> What? Seriously? Just don't enable the CAP.
>
>
> It is under DEBUG. It really, really helps to be able to disable real mode
> handlers without reboot. Ok, no debug code, I'll remove.

Debug code is good, but #ifdefs are bad. For you, an #ifdef reads like "code that doesn't do any hard when disabled". For me, #ifdefs read "code that definitely breaks because nobody turns the #define on".

So please, avoid #ifdef'ed code whenever possible. Switching the CAP on and off is a much better debug approach in this case.

>
>
>>> + */
>>> +#define kvmppc_find_tce_table(a, b) \
>>> + ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
>>> +#endif
>>> +
>>> +/*
>>> + * Validates TCE address.
>>> + * At the moment only flags are validated as other checks will
>>> significantly slow
>>> + * down or can make it even impossible to handle TCE requests in real mode.
>>
>> What?
>
>
> What is missing here (besides good english)?

What badness could slip through by not validating everything?

>
>
>>> + */
>>> +long kvmppc_emulated_validate_tce(unsigned long tce)
>>
>> I don't like the naming scheme. Please turn this around and make it
>> kvmppc_tce_validate().
>
>
> Oh. "Like"... Ok.

Yes. Like.

>
>
>>> +{
>>> + if (tce& ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>>> + return H_PARAMETER;
>>> +
>>> + return H_SUCCESS;
>>> +}
>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
>>> +
>>> +/*
>>> + * Handles TCE requests for QEMU emulated devices.
>>
>> We still don't mention QEMU in KVM code. And does it really matter whether
>> they're emulated by QEMU? Devices could also be emulated by KVM.
>>
>>> + * Puts guest TCE values to the table and expects QEMU to convert them
>>> + * later in a QEMU device implementation.
>>> + * Called in both real and virtual modes.
>>> + * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
>>> + */
>>> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>
>> kvmppc_tce_put()
>>
>>> + unsigned long ioba, unsigned long tce)
>>> +{
>>> + unsigned long idx = ioba>> SPAPR_TCE_SHIFT;
>>> + struct page *page;
>>> + u64 *tbl;
>>> +
>>> + /*
>>> + * Note on the use of page_address() in real mode,
>>> + *
>>> + * It is safe to use page_address() in real mode on ppc64 because
>>> + * page_address() is always defined as lowmem_page_address()
>>> + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
>>> + * operation and does not access page struct.
>>> + *
>>> + * Theoretically page_address() could be defined different
>>> + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>>> + * should be enabled.
>>> + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>>> + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>>> + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>>> + * is not expected to be enabled on ppc32, page_address()
>>> + * is safe for ppc32 as well.
>>> + */
>>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>>> +#error TODO: fix to avoid page_address() here
>>> +#endif
>>
>> Can you extract the text above, the check and the page_address call into a
>> simple wrapper function?
>
>
> Is this function also too big? Sorry, I do not understand the comment.

All of the comment and #if here only deal with the fact that you have a real mode hack to call page_address() that happens to work under specific circumstances.

There's nothing kvmppc_tce_put() specific about this. The page_address() code happens to get called here, sure. But if I read the kvmppc_tce_put() function I don't care about these details - I want to understand the code flow that ends up writing the TCE.

>
>
>>> + page = tt->pages[idx / TCES_PER_PAGE];
>>> + tbl = (u64 *)page_address(page);
>>> +
>>> + /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>
>> This is not an RFC, is it?
>
>
> Any debug code is prohibited? Ok, I'll remove.

Debug code that requires code changes is prohibited, yes. Debug code that is runtime switchable (pr_debug, trace points, etc) are allowed.

>
>
>>> + tbl[idx % TCES_PER_PAGE] = tce;
>>> +}
>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
>>> +
>>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>>> +/*
>>> + * Converts guest physical address to host physical address.
>>> + * Tries to increase page counter via realmode_get_page() and
>>> + * returns ERROR_ADDR if failed.
>>> + */
>>> +static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
>>> + unsigned long gpa, struct page **pg)
>>> +{
>>> + struct kvm_memory_slot *memslot;
>>> + pte_t *ptep, pte;
>>> + unsigned long hva, hpa = ERROR_ADDR;
>>> + unsigned long gfn = gpa>> PAGE_SHIFT;
>>> + unsigned shift = 0;
>>> +
>>> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>> + if (!memslot)
>>> + return ERROR_ADDR;
>>> +
>>> + hva = __gfn_to_hva_memslot(memslot, gfn);
>>> +
>>> + ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva,&shift);
>>> + if (!ptep || !pte_present(*ptep))
>>> + return ERROR_ADDR;
>>> + pte = *ptep;
>>> +
>>> + if (((gpa& TCE_PCI_WRITE) || pte_write(pte))&& !pte_dirty(pte))
>>> + return ERROR_ADDR;
>>> +
>>> + if (!pte_young(pte))
>>> + return ERROR_ADDR;
>>> +
>>> + if (!shift)
>>> + shift = PAGE_SHIFT;
>>> +
>>> + /* Put huge pages handling to the virtual mode */
>>> + if (shift> PAGE_SHIFT)
>>> + return ERROR_ADDR;
>>> +
>>> + *pg = realmode_pfn_to_page(pte_pfn(pte));
>>> + if (!*pg || realmode_get_page(*pg))
>>> + return ERROR_ADDR;
>>> +
>>> + /* pte_pfn(pte) returns address aligned to pg_size */
>>> + hpa = (pte_pfn(pte)<< PAGE_SHIFT) + (gpa& ((1<< shift) - 1));
>>> +
>>> + if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>> + hpa = ERROR_ADDR;
>>> + realmode_put_page(*pg);
>>> + *pg = NULL;
>>> + }
>>> +
>>> + return hpa;
>>> +}
>>> +
>>> long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>> unsigned long ioba, unsigned long tce)
>>> {
>>> - struct kvm *kvm = vcpu->kvm;
>>> - struct kvmppc_spapr_tce_table *stt;
>>> -
>>> - /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>> - /* liobn, ioba, tce); */
>>> -
>>> - list_for_each_entry(stt,&kvm->arch.spapr_tce_tables, list) {
>>> - if (stt->liobn == liobn) {
>>> - unsigned long idx = ioba>> SPAPR_TCE_SHIFT;
>>> - struct page *page;
>>> - u64 *tbl;
>>> -
>>> - /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p
>>> window_size=0x%x\n", */
>>> - /* liobn, stt, stt->window_size); */
>>> - if (ioba>= stt->window_size)
>>> - return H_PARAMETER;
>>> -
>>> - page = stt->pages[idx / TCES_PER_PAGE];
>>> - tbl = (u64 *)page_address(page);
>>> -
>>> - /* FIXME: Need to validate the TCE itself */
>>> - /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>> - tbl[idx % TCES_PER_PAGE] = tce;
>>> - return H_SUCCESS;
>>> - }
>>> + long ret;
>>> + struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
>>> +
>>> + if (!tt)
>>> + return H_TOO_HARD;
>>> +
>>> + ++tt->stat.rm.put;
>>> +
>>> + if (ioba>= tt->window_size)
>>> + return H_PARAMETER;
>>> +
>>> + ret = kvmppc_emulated_validate_tce(tce);
>>> + if (!ret)
>>> + kvmppc_emulated_put_tce(tt, ioba, tce);
>>> +
>>> + return ret;
>>> +}
>>> +
>>> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>
>> So the _vm version is the normal one and this is the _rm version? If so,
>> please mark it as such. Is there any way to generate both from the same
>> source? The way it's now there is a lot of duplicate code.
>
>
> I tried, looked very ugly. If you insist, I will do so.

If it looks ugly better don't. I just want to make sure you explored the option. But please keep the naming scheme consistent.


Alex

2013-07-10 10:33:42

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling


On 10.07.2013, at 01:29, Alexey Kardashevskiy wrote:

> On 07/10/2013 03:32 AM, Alexander Graf wrote:
>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>>> This adds special support for huge pages (16MB). The reference
>>> counting cannot be easily done for such pages in real mode (when
>>> MMU is off) so we added a list of huge pages. It is populated in
>>> virtual mode and get_page is called just once per a huge page.
>>> Real mode handlers check if the requested page is huge and in the list,
>>> then no reference counting is done, otherwise an exit to virtual mode
>>> happens. The list is released at KVM exit. At the moment the fastest
>>> card available for tests uses up to 9 huge pages so walking through this
>>> list is not very expensive. However this can change and we may want
>>> to optimize this.
>>>
>>> Signed-off-by: Paul Mackerras<[email protected]>
>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>>
>>> ---
>>>
>>> Changes:
>>> 2013/06/27:
>>> * list of huge pages replaces with hashtable for better performance
>>
>> So the only thing your patch description really talks about is not true
>> anymore?
>>
>>> * spinlock removed from real mode and only protects insertion of new
>>> huge [ages descriptors into the hashtable
>>>
>>> 2013/06/05:
>>> * fixed compile error when CONFIG_IOMMU_API=n
>>>
>>> 2013/05/20:
>>> * the real mode handler now searches for a huge page by gpa (used to be pte)
>>> * the virtual mode handler prints warning if it is called twice for the same
>>> huge page as the real mode handler is expected to fail just once - when a
>>> huge
>>> page is not in the list yet.
>>> * the huge page is refcounted twice - when added to the hugepage list and
>>> when used in the virtual mode hcall handler (can be optimized but it will
>>> make the patch less nice).
>>>
>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>> ---
>>> arch/powerpc/include/asm/kvm_host.h | 25 +++++++++
>>> arch/powerpc/kernel/iommu.c | 6 ++-
>>> arch/powerpc/kvm/book3s_64_vio.c | 104
>>> +++++++++++++++++++++++++++++++++---
>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 21 ++++++--
>>> 4 files changed, 146 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>>> b/arch/powerpc/include/asm/kvm_host.h
>>> index 53e61b2..a7508cf 100644
>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>> @@ -30,6 +30,7 @@
>>> #include<linux/kvm_para.h>
>>> #include<linux/list.h>
>>> #include<linux/atomic.h>
>>> +#include<linux/hashtable.h>
>>> #include<asm/kvm_asm.h>
>>> #include<asm/processor.h>
>>> #include<asm/page.h>
>>> @@ -182,10 +183,34 @@ struct kvmppc_spapr_tce_table {
>>> u32 window_size;
>>> struct iommu_group *grp; /* used for IOMMU groups */
>>> struct vfio_group *vfio_grp; /* used for IOMMU groups */
>>> + DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */
>>> + spinlock_t hugepages_write_lock; /* used for IOMMU groups */
>>> struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>>> struct page *pages[0];
>>> };
>>>
>>> +/*
>>> + * The KVM guest can be backed with 16MB pages.
>>> + * In this case, we cannot do page counting from the real mode
>>> + * as the compound pages are used - they are linked in a list
>>> + * with pointers as virtual addresses which are inaccessible
>>> + * in real mode.
>>> + *
>>> + * The code below keeps a 16MB pages list and uses page struct
>>> + * in real mode if it is already locked in RAM and inserted into
>>> + * the list or switches to the virtual mode where it can be
>>> + * handled in a usual manner.
>>> + */
>>> +#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa) hash_32(gpa>> 24, 32)
>>> +
>>> +struct kvmppc_spapr_iommu_hugepage {
>>> + struct hlist_node hash_node;
>>> + unsigned long gpa; /* Guest physical address */
>>> + unsigned long hpa; /* Host physical address */
>>> + struct page *page; /* page struct of the very first subpage */
>>> + unsigned long size; /* Huge page size (always 16MB at the moment) */
>>> +};
>>> +
>>> struct kvmppc_linear_info {
>>> void *base_virt;
>>> unsigned long base_pfn;
>>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>>> index 51678ec..e0b6eca 100644
>>> --- a/arch/powerpc/kernel/iommu.c
>>> +++ b/arch/powerpc/kernel/iommu.c
>>> @@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned
>>> long entry,
>>> if (!pg) {
>>> ret = -EAGAIN;
>>> } else if (PageCompound(pg)) {
>>> - ret = -EAGAIN;
>>> + /* Hugepages will be released at KVM exit */
>>> + ret = 0;
>>> } else {
>>> if (oldtce& TCE_PCI_WRITE)
>>> SetPageDirty(pg);
>>> @@ -1009,6 +1010,9 @@ int iommu_free_tces(struct iommu_table *tbl,
>>> unsigned long entry,
>>> struct page *pg = pfn_to_page(oldtce>> PAGE_SHIFT);
>>> if (!pg) {
>>> ret = -EAGAIN;
>>> + } else if (PageCompound(pg)) {
>>> + /* Hugepages will be released at KVM exit */
>>> + ret = 0;
>>> } else {
>>> if (oldtce& TCE_PCI_WRITE)
>>> SetPageDirty(pg);
>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>>> b/arch/powerpc/kvm/book3s_64_vio.c
>>> index 2b51f4a..c037219 100644
>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>> @@ -46,6 +46,40 @@
>>>
>>> #define ERROR_ADDR ((void *)~(unsigned long)0x0)
>>>
>>> +#ifdef CONFIG_IOMMU_API
>>
>> Can't you just make CONFIG_IOMMU_API mandatory in Kconfig?
>
>
> Sure I can. I can do anything. Why should I?

To get rid of #ifdef's. They make code hard to maintain.

> Do I have to do that to get
> this accepted? I do not understand this comment. It has already been
> discussed how to enable this option.
>
>
>>> +static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
>>> +{
>>> + spin_lock_init(&tt->hugepages_write_lock);
>>> + hash_init(tt->hash_tab);
>>> +}
>>> +
>>> +static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table
>>> *tt)
>>> +{
>>> + int bkt;
>>> + struct kvmppc_spapr_iommu_hugepage *hp;
>>> + struct hlist_node *tmp;
>>> +
>>> + spin_lock(&tt->hugepages_write_lock);
>>> + hash_for_each_safe(tt->hash_tab, bkt, tmp, hp, hash_node) {
>>> + pr_debug("Release HP liobn=%llx #%u gpa=%lx hpa=%lx size=%ld\n",
>>> + tt->liobn, bkt, hp->gpa, hp->hpa, hp->size);
>>
>> trace point
>>
>>> + hlist_del_rcu(&hp->hash_node);
>>> +
>>> + put_page(hp->page);
>>
>> Don't you have to mark them dirty?
>
>
> get_user_pages_fast is called with writing==1. Does not it do the same?

It's not exactly obvious that you're calling it with writing == 1 :). Can you create a new local variable "is_write" in the calling function, set that to 1 before the call to get_user_pages_fast and pass it in instead of the 1? The compiler should easily optimize all of that away, but it makes the code by far easier to read.

>
>>
>>> + kfree(hp);
>>> + }
>>> + spin_unlock(&tt->hugepages_write_lock);
>>> +}
>>> +#else
>>> +static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
>>> +{
>>> +}
>>> +
>>> +static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table
>>> *tt)
>>> +{
>>> +}
>>> +#endif /* CONFIG_IOMMU_API */
>>> +
>>> static long kvmppc_stt_npages(unsigned long window_size)
>>> {
>>> return ALIGN((window_size>> SPAPR_TCE_SHIFT)
>>> @@ -112,6 +146,7 @@ static void release_spapr_tce_table(struct
>>> kvmppc_spapr_tce_table *stt)
>>>
>>> mutex_lock(&kvm->lock);
>>> list_del(&stt->list);
>>> + kvmppc_iommu_hugepages_cleanup(stt);
>>>
>>> #ifdef CONFIG_IOMMU_API
>>> if (stt->grp) {
>>> @@ -200,6 +235,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>> kvm_get_kvm(kvm);
>>>
>>> mutex_lock(&kvm->lock);
>>> + kvmppc_iommu_hugepages_init(stt);
>>> list_add(&stt->list,&kvm->arch.spapr_tce_tables);
>>>
>>> mutex_unlock(&kvm->lock);
>>> @@ -283,6 +319,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm
>>> *kvm,
>>>
>>> kvm_get_kvm(kvm);
>>> mutex_lock(&kvm->lock);
>>> + kvmppc_iommu_hugepages_init(tt);
>>> list_add(&tt->list,&kvm->arch.spapr_tce_tables);
>>> mutex_unlock(&kvm->lock);
>>>
>>> @@ -307,10 +344,17 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm
>>> *kvm,
>>>
>>> /* Converts guest physical address to host virtual address */
>>> static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>>> + struct kvmppc_spapr_tce_table *tt,
>>> unsigned long gpa, struct page **pg, unsigned long *hpa)
>>> {
>>> unsigned long hva, gfn = gpa>> PAGE_SHIFT;
>>> struct kvm_memory_slot *memslot;
>>> +#ifdef CONFIG_IOMMU_API
>>> + struct kvmppc_spapr_iommu_hugepage *hp;
>>> + unsigned key = KVMPPC_SPAPR_HUGEPAGE_HASH(gpa);
>>> + pte_t *ptep;
>>> + unsigned int shift = 0;
>>> +#endif
>>>
>>> memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>> if (!memslot)
>>> @@ -325,6 +369,54 @@ static void __user
>>> *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>>> *hpa = __pa((unsigned long) page_address(*pg)) +
>>> (hva& ~PAGE_MASK);
>>>
>>> +#ifdef CONFIG_IOMMU_API
>>
>> This function is becoming incredibly large. Please split it up. Also please
>> document the code.
>
>
> Less than 100 lines is incredibly large? There are _many_ functions bigger
> than that. I do not really see the point in making a separate function
> which is going to be called only once.

Anything above 20 lines is too big usually, with very few exceptions. As mnemonic you can always imagine Linus sitting there with an 80x25 UNIX terminal, reading your code ;).


Alex

2013-07-10 10:39:39

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

On Wed, 2013-07-10 at 12:33 +0200, Alexander Graf wrote:
>
> It's not exactly obvious that you're calling it with writing == 1 :).
> Can you create a new local variable "is_write" in the calling
> function, set that to 1 before the call to get_user_pages_fast and
> pass it in instead of the 1? The compiler should easily optimize all
> of that away, but it makes the code by far easier to read.

Ugh ?

Nobody else does that .... (look at futex :-)

Ben.

2013-07-10 10:40:35

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling


On 10.07.2013, at 12:39, Benjamin Herrenschmidt wrote:

> On Wed, 2013-07-10 at 12:33 +0200, Alexander Graf wrote:
>>
>> It's not exactly obvious that you're calling it with writing == 1 :).
>> Can you create a new local variable "is_write" in the calling
>> function, set that to 1 before the call to get_user_pages_fast and
>> pass it in instead of the 1? The compiler should easily optimize all
>> of that away, but it makes the code by far easier to read.
>
> Ugh ?
>
> Nobody else does that .... (look at futex :-)

Yeah, that's fortunately code that I don't have to read :).


Alex

2013-07-10 10:42:10

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling


On 10.07.2013, at 12:40, Alexander Graf wrote:

>
> On 10.07.2013, at 12:39, Benjamin Herrenschmidt wrote:
>
>> On Wed, 2013-07-10 at 12:33 +0200, Alexander Graf wrote:
>>>
>>> It's not exactly obvious that you're calling it with writing == 1 :).
>>> Can you create a new local variable "is_write" in the calling
>>> function, set that to 1 before the call to get_user_pages_fast and
>>> pass it in instead of the 1? The compiler should easily optimize all
>>> of that away, but it makes the code by far easier to read.
>>
>> Ugh ?
>>
>> Nobody else does that .... (look at futex :-)
>
> Yeah, that's fortunately code that I don't have to read :).

The "proper" alternative would be to pass an enum for read/write into the function rather than an int. But that'd be a pretty controversial, big change that I'd rather not put on Alexey. With a local variable we're nicely self-contained readable ;)


Alex

2013-07-11 05:12:57

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

On 07/10/2013 08:05 PM, Alexander Graf wrote:
>
> On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote:
>
>> On 07/10/2013 03:02 AM, Alexander Graf wrote:
>>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>>> devices or emulated PCI. These calls allow adding multiple entries
>>>> (up to 512) into the TCE table in one call which saves time on
>>>> transition to/from real mode.
>>>
>>> We don't mention QEMU explicitly in KVM code usually.
>>>
>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>> (copied from user and verified) before writing the whole list into
>>>> the TCE table. This cache will be utilized more in the upcoming
>>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>>> mode in the case if the real mode handler failed for some reason.
>>>>
>>>> This adds a guest physical to host real address converter
>>>> and calls the existing H_PUT_TCE handler. The converting function
>>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>>
>>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>>> so in order to support the functionality of this patch, QEMU
>>>> needs to query for this capability and set the "hcall-multi-tce"
>>>> hypertas property only if the capability is present, otherwise
>>>> there will be serious performance degradation.
>>>
>>> Same as above. But really you're only giving recommendations here. What's
>>> the point? Please describe what the benefit of this patch is, not what some
>>> other random subsystem might do with the benefits it brings.
>>>
>>>>
>>>> Signed-off-by: Paul Mackerras<[email protected]>
>>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>>>
>>>> ---
>>>> Changelog:
>>>> 2013/07/06:
>>>> * fixed number of wrong get_page()/put_page() calls
>>>>
>>>> 2013/06/27:
>>>> * fixed clear of BUSY bit in kvmppc_lookup_pte()
>>>> * H_PUT_TCE_INDIRECT does realmode_get_page() now
>>>> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
>>>> * updated doc
>>>>
>>>> 2013/06/05:
>>>> * fixed mistype about IBMVIO in the commit message
>>>> * updated doc and moved it to another section
>>>> * changed capability number
>>>>
>>>> 2013/05/21:
>>>> * added kvm_vcpu_arch::tce_tmp
>>>> * removed cleanup if put_indirect failed, instead we do not even start
>>>> writing to TCE table if we cannot get TCEs from the user and they are
>>>> invalid
>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>>> and kvmppc_emulated_validate_tce (for the previous item)
>>>> * fixed bug with failthrough for H_IPI
>>>> * removed all get_user() from real mode handlers
>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>>> ---
>>>> Documentation/virtual/kvm/api.txt | 25 +++
>>>> arch/powerpc/include/asm/kvm_host.h | 9 ++
>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +-
>>>> arch/powerpc/kvm/book3s_64_vio.c | 154 ++++++++++++++++++-
>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 260
>>>> ++++++++++++++++++++++++++++----
>>>> arch/powerpc/kvm/book3s_hv.c | 41 ++++-
>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++-
>>>> arch/powerpc/kvm/powerpc.c | 3 +
>>>> 9 files changed, 517 insertions(+), 34 deletions(-)
>>>>
>>>> diff --git a/Documentation/virtual/kvm/api.txt
>>>> b/Documentation/virtual/kvm/api.txt
>>>> index 6365fef..762c703 100644
>>>> --- a/Documentation/virtual/kvm/api.txt
>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
>>>> to userspace to be
>>>> handled.
>>>>
>>>>
>>>> +4.86 KVM_CAP_PPC_MULTITCE
>>>> +
>>>> +Capability: KVM_CAP_PPC_MULTITCE
>>>> +Architectures: ppc
>>>> +Type: vm
>>>> +
>>>> +This capability means the kernel is capable of handling hypercalls
>>>> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
>>>> +space. This significanly accelerates DMA operations for PPC KVM guests.
>>>
>>> significanly? Please run this through a spell checker.
>>>
>>>> +The user space should expect that its handlers for these hypercalls
>>>
>>> s/The//
>>>
>>>> +are not going to be called.
>>>
>>> Is user space guaranteed they will not be called? Or can it still happen?
>>
>> ... if user space previously registered LIOBN in KVM (via
>> KVM_CREATE_SPAPR_TCE or similar calls).
>>
>> ok?
>
> How about this?
>
> The hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration.
>
> ---
>
> The target audience for this documentation is user space KVM API users. Someone developing kvm tool for example. They want to know implications specific CAPs have.
>
>>
>> There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
>> and may never get there.
>>
>>
>>>> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
>>>> +the user space might have to advertise it for the guest. For example,
>>>> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in
>>>> +the "ibm,hypertas-functions" device-tree property.
>>>
>>> This paragraph describes sPAPR. That's fine, but please document it as
>>> such. Also please check your grammar.
>>
>>>> +
>>>> +Without this capability, only H_PUT_TCE is handled by the kernel and
>>>> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
>>>> +unless the capability is present as passing hypercalls to the userspace
>>>> +slows operations a lot.
>>>> +
>>>> +Unlike other capabilities of this section, this one is always enabled.
>>>
>>> Why? Wouldn't that confuse older user space?
>>
>>
>> How? Old user space won't check for this capability and won't tell the
>> guest to use it (via "hcall-multi-tce"). Old H_PUT_TCE is still there.
>>
>> If the guest always uses H_PUT_TCE_INDIRECT/H_STUFF_TCE no matter what,
>> then it is its problem - it won't work now anyway as neither QEMU nor host
>> kernel supports these calls.


> Always assume that you are a kernel developer without knowledge
> of any user space code using your interfaces. So there is the theoretical
> possibility that there is a user space client out there that implements
> H_PUT_TCE_INDIRECT and advertises hcall-multi-tce to the guest.
> Would that client break? If so, we should definitely have
> the CAP disabled by default.


No, it won't break. Why would it break? I really do not get it. This user
space client has to do an extra step to get this acceleration by calling
ioctl(KVM_CREATE_SPAPR_TCE) anyway. Previously that ioctl only had effect
on H_PUT_TCE, now on all three hcalls.


> But really, it's also as much about consistency as anything else.
> If we leave everything as is and always extend functionality
> by enabling new CAPs, we're pretty much guaranteed that we
> don't break anything by accident. It also makes debugging easier
> because you can for example disable this particular feature
> to see whether something has bad side effects.


So I must add one more ioctl to enable MULTITCE in kernel handling. Is it
what you are saying?

I can see KVM_CHECK_EXTENSION but I do not see KVM_ENABLE_EXTENSION or
anything like that.



>>>> +
>>>> +
>>>> 5. The kvm_run structure
>>>> ------------------------
>>>>
>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>>>> b/arch/powerpc/include/asm/kvm_host.h
>>>> index af326cd..20d04bd 100644
>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
>>>> struct kvm *kvm;
>>>> u64 liobn;
>>>> u32 window_size;
>>>> + struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>>>
>>> You don't need this.
>>>
>>>> struct page *pages[0];
>>>> };
>>>>
>>>> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
>>>> spinlock_t tbacct_lock;
>>>> u64 busy_stolen;
>>>> u64 busy_preempt;
>>>> +
>>>> + unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT
>>>> hcall */
>>>> + enum {
>>>> + TCERM_NONE,
>>>> + TCERM_GETPAGE,
>>>> + TCERM_PUTTCE,
>>>> + TCERM_PUTLIST,
>>>> + } tce_rm_fail; /* failed stage of request processing */
>>>> #endif
>>>> };
>>>>
>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h
>>>> b/arch/powerpc/include/asm/kvm_ppc.h
>>>> index a5287fe..fa722a0 100644
>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu
>>>> *vcpu);
>>>>
>>>> extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>> struct kvm_create_spapr_tce *args);
>>>> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>> - unsigned long ioba, unsigned long tce);
>>>> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>>> + struct kvm_vcpu *vcpu, unsigned long liobn);
>>>> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
>>>> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>> + unsigned long ioba, unsigned long tce);
>>>> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>> + unsigned long liobn, unsigned long ioba,
>>>> + unsigned long tce);
>>>> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>> + unsigned long liobn, unsigned long ioba,
>>>> + unsigned long tce_list, unsigned long npages);
>>>> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>> + unsigned long liobn, unsigned long ioba,
>>>> + unsigned long tce_value, unsigned long npages);
>>>> extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>>>> struct kvm_allocate_rma *rma);
>>>> extern struct kvmppc_linear_info *kvm_alloc_rma(void);
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>>>> b/arch/powerpc/kvm/book3s_64_vio.c
>>>> index b2d3f3b..99bf4e5 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>> @@ -14,6 +14,7 @@
>>>> *
>>>> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
>>>> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
>>>> */
>>>>
>>>> #include<linux/types.h>
>>>> @@ -36,8 +37,10 @@
>>>> #include<asm/ppc-opcode.h>
>>>> #include<asm/kvm_host.h>
>>>> #include<asm/udbg.h>
>>>> +#include<asm/iommu.h>
>>>> +#include<asm/tce.h>
>>>>
>>>> -#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
>>>> +#define ERROR_ADDR ((void *)~(unsigned long)0x0)
>>>>
>>>> static long kvmppc_stt_npages(unsigned long window_size)
>>>> {
>>>> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct
>>>> kvmppc_spapr_tce_table *stt)
>>>> struct kvm *kvm = stt->kvm;
>>>> int i;
>>>>
>>>> +#define __SV(x) stt->stat.x
>>>> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
>>>> + pr_debug("%s stat for liobn=%llx\n"
>>>> + "--------------- realmode ----- virtmode ---\n"
>>>> + "put_tce %10ld %10ld\n"
>>>> + "put_tce_indir %10ld %10ld\n"
>>>> + "stuff_tce %10ld %10ld\n",
>>>> + __func__, stt->liobn,
>>>> + __SVD(put), __SV(vm.put),
>>>> + __SVD(indir), __SV(vm.indir),
>>>> + __SVD(stuff), __SV(vm.stuff));
>>>> +#undef __SVD
>>>> +#undef __SV
>>>
>>> All of these stat points should just be trace points. You can do the
>>> statistic gathering from user space then.
>>>
>>>> +
>>>> mutex_lock(&kvm->lock);
>>>> list_del(&stt->list);
>>>> for (i = 0; i< kvmppc_stt_npages(stt->window_size); i++)
>>>> @@ -148,3 +165,138 @@ fail:
>>>> }
>>>> return ret;
>>>> }
>>>> +
>>>> +/* Converts guest physical address to host virtual address */
>>>> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>>>
>>> Please don't distinguish _vm versions. They're the normal case. _rm ones
>>> are the special ones.
>>>
>>>> + unsigned long gpa, struct page **pg)
>>>> +{
>>>> + unsigned long hva, gfn = gpa>> PAGE_SHIFT;
>>>> + struct kvm_memory_slot *memslot;
>>>> +
>>>> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>> + if (!memslot)
>>>> + return ERROR_ADDR;
>>>> +
>>>> + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa& ~PAGE_MASK);
>>>
>>> s/+/|/
>>>
>>>> +
>>>> + if (get_user_pages_fast(hva& PAGE_MASK, 1, 0, pg) != 1)
>>>> + return ERROR_ADDR;
>>>> +
>>>> + return (void *) hva;
>>>> +}
>>>> +
>>>> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>> + unsigned long liobn, unsigned long ioba,
>>>> + unsigned long tce)
>>>> +{
>>>> + long ret;
>>>> + struct kvmppc_spapr_tce_table *tt;
>>>> +
>>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>>> + /* Didn't find the liobn, put it to userspace */
>>>
>>> Unclear comment.
>>
>>
>> What detail is missing?
>

> Grammar wise "it" in the second half of the sentence refers to liobn.
> So you "put" the "liobn to userspace". That sentence doesn't
> make any sense.


Removed it. H_TOO_HARD itself says enough already.


> What you really want to say is:
>
> /* Couldn't find the liobn. Something went wrong. Let user space handle the hypercall. That has better ways of dealing with errors. */
>
>>
>>
>>>> + if (!tt)
>>>> + return H_TOO_HARD;
>>>> +
>>>> + ++tt->stat.vm.put;
>>>> +
>>>> + if (ioba>= tt->window_size)
>>>> + return H_PARAMETER;
>>>> +
>>>> + ret = kvmppc_emulated_validate_tce(tce);
>>>> + if (ret)
>>>> + return ret;
>>>> +
>>>> + kvmppc_emulated_put_tce(tt, ioba, tce);
>>>> +
>>>> + return H_SUCCESS;
>>>> +}
>>>> +
>>>> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>> + unsigned long liobn, unsigned long ioba,
>>>> + unsigned long tce_list, unsigned long npages)
>>>> +{
>>>> + struct kvmppc_spapr_tce_table *tt;
>>>> + long i, ret = H_SUCCESS;
>>>> + unsigned long __user *tces;
>>>> + struct page *pg = NULL;
>>>> +
>>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>>> + /* Didn't find the liobn, put it to userspace */
>>>> + if (!tt)
>>>> + return H_TOO_HARD;
>>>> +
>>>> + ++tt->stat.vm.indir;
>>>> +
>>>> + /*
>>>> + * The spec says that the maximum size of the list is 512 TCEs so
>>>> + * so the whole table addressed resides in 4K page
>>>
>>> so so?
>>>
>>>> + */
>>>> + if (npages> 512)
>>>> + return H_PARAMETER;
>>>> +
>>>> + if (tce_list& ~IOMMU_PAGE_MASK)
>>>> + return H_PARAMETER;
>>>> +
>>>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
>>>> + return H_PARAMETER;
>>>> +
>>>> + tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg);
>>>> + if (tces == ERROR_ADDR)
>>>> + return H_TOO_HARD;
>>>> +
>>>> + if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
>>>> + goto put_list_page_exit;
>>>> +
>>>> + for (i = 0; i< npages; ++i) {
>>>> + if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
>>>> + ret = H_PARAMETER;
>>>> + goto put_list_page_exit;
>>>> + }
>>>> +
>>>> + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
>>>> + if (ret)
>>>> + goto put_list_page_exit;
>>>> + }
>>>> +
>>>> + for (i = 0; i< npages; ++i)
>>>> + kvmppc_emulated_put_tce(tt, ioba + (i<< IOMMU_PAGE_SHIFT),
>>>> + vcpu->arch.tce_tmp_hpas[i]);
>>>> +put_list_page_exit:
>>>> + if (pg)
>>>> + put_page(pg);
>>>> +
>>>> + if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
>>>> + vcpu->arch.tce_rm_fail = TCERM_NONE;
>>>> + if (pg&& !PageCompound(pg))
>>>> + put_page(pg); /* finish pending realmode_put_page() */
>>>> + }
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>> + unsigned long liobn, unsigned long ioba,
>>>> + unsigned long tce_value, unsigned long npages)
>>>> +{
>>>> + struct kvmppc_spapr_tce_table *tt;
>>>> + long i, ret;
>>>> +
>>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>>> + /* Didn't find the liobn, put it to userspace */
>>>> + if (!tt)
>>>> + return H_TOO_HARD;
>>>> +
>>>> + ++tt->stat.vm.stuff;
>>>> +
>>>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
>>>> + return H_PARAMETER;
>>>> +
>>>> + ret = kvmppc_emulated_validate_tce(tce_value);
>>>> + if (ret || (tce_value& (TCE_PCI_WRITE | TCE_PCI_READ)))
>>>> + return H_PARAMETER;
>>>> +
>>>> + for (i = 0; i< npages; ++i, ioba += IOMMU_PAGE_SIZE)
>>>> + kvmppc_emulated_put_tce(tt, ioba, tce_value);
>>>> +
>>>> + return H_SUCCESS;
>>>> +}
>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> index 30c2f3b..cd3e6f9 100644
>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>> @@ -14,6 +14,7 @@
>>>> *
>>>> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
>>>> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
>>>> */
>>>>
>>>> #include<linux/types.h>
>>>> @@ -35,42 +36,243 @@
>>>> #include<asm/ppc-opcode.h>
>>>> #include<asm/kvm_host.h>
>>>> #include<asm/udbg.h>
>>>> +#include<asm/iommu.h>
>>>> +#include<asm/tce.h>
>>>>
>>>> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
>>>> +#define ERROR_ADDR (~(unsigned long)0x0)
>>>>
>>>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>>>> - * mode on PR KVM
>>>
>>> What's wrong with the warning?
>>
>>
>> It belongs to kvmppc_h_put_tce() which is not called in virtual mode anymore.
>
> I thought the comment applied to the whole file before? Hrm. Maybe I misread it then.
>
>> It is technically correct for kvmppc_find_tce_table() though. Should I put
>> this comment before every function which may be called from real and
>> virtual modes?
>
> Yes, please. Otherwise someone might stick an access to a non-linear address
> in there by accident.
>
>>
>>
>>
>>>> +/*
>>>> + * Finds a TCE table descriptor by LIOBN
>>>> */
>>>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
>>>> + unsigned long liobn)
>>>> +{
>>>> + struct kvmppc_spapr_tce_table *tt;
>>>> +
>>>> + list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) {
>>>> + if (tt->liobn == liobn)
>>>> + return tt;
>>>> + }
>>>> +
>>>> + return NULL;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>>>> +
>>>> +#ifdef DEBUG
>>>> +/*
>>>> + * Lets user mode disable realmode handlers by putting big number
>>>> + * in the bottom value of LIOBN
>>>
>>> What? Seriously? Just don't enable the CAP.
>>
>>
>> It is under DEBUG. It really, really helps to be able to disable real mode
>> handlers without reboot. Ok, no debug code, I'll remove.
>
> Debug code is good, but #ifdefs are bad. For you, an #ifdef reads like
> "code that doesn't do any hard when disabled". For me, #ifdefs read
> "code that definitely breaks because nobody turns the #define on".
>
> So please, avoid #ifdef'ed code whenever possible. Switching the CAP on and
> off is a much better debug approach in this case.
>
>>
>>
>>>> + */
>>>> +#define kvmppc_find_tce_table(a, b) \
>>>> + ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
>>>> +#endif
>>>> +
>>>> +/*
>>>> + * Validates TCE address.
>>>> + * At the moment only flags are validated as other checks will
>>>> significantly slow
>>>> + * down or can make it even impossible to handle TCE requests in real mode.
>>>
>>> What?
>>
>>
>> What is missing here (besides good english)?
>
> What badness could slip through by not validating everything?


I cannot think of any good check which could be done in real mode and not
be "more than 2 calls deep" (c) Ben. Check that the page is allocated at
all? How? Don't know.



>>>> + */
>>>> +long kvmppc_emulated_validate_tce(unsigned long tce)
>>>
>>> I don't like the naming scheme. Please turn this around and make it
>>> kvmppc_tce_validate().
>>
>>
>> Oh. "Like"... Ok.
>
> Yes. Like.
>
>>
>>
>>>> +{
>>>> + if (tce& ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>>>> + return H_PARAMETER;
>>>> +
>>>> + return H_SUCCESS;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
>>>> +
>>>> +/*
>>>> + * Handles TCE requests for QEMU emulated devices.
>>>
>>> We still don't mention QEMU in KVM code. And does it really matter whether
>>> they're emulated by QEMU? Devices could also be emulated by KVM.
>>>
>>>> + * Puts guest TCE values to the table and expects QEMU to convert them
>>>> + * later in a QEMU device implementation.
>>>> + * Called in both real and virtual modes.
>>>> + * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
>>>> + */
>>>> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>
>>> kvmppc_tce_put()
>>>
>>>> + unsigned long ioba, unsigned long tce)
>>>> +{
>>>> + unsigned long idx = ioba>> SPAPR_TCE_SHIFT;
>>>> + struct page *page;
>>>> + u64 *tbl;
>>>> +
>>>> + /*
>>>> + * Note on the use of page_address() in real mode,
>>>> + *
>>>> + * It is safe to use page_address() in real mode on ppc64 because
>>>> + * page_address() is always defined as lowmem_page_address()
>>>> + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
>>>> + * operation and does not access page struct.
>>>> + *
>>>> + * Theoretically page_address() could be defined different
>>>> + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>>>> + * should be enabled.
>>>> + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>>>> + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>>>> + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>>>> + * is not expected to be enabled on ppc32, page_address()
>>>> + * is safe for ppc32 as well.
>>>> + */
>>>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>>>> +#error TODO: fix to avoid page_address() here
>>>> +#endif
>>>
>>> Can you extract the text above, the check and the page_address call into a
>>> simple wrapper function?
>>
>>
>> Is this function also too big? Sorry, I do not understand the comment.
>
> All of the comment and #if here only deal with the fact that you
> have a real mode hack to call page_address() that happens
> to work under specific circumstances.
>
> There's nothing kvmppc_tce_put() specific about this.
> The page_address() code happens to get called here, sure.
> But if I read the kvmppc_tce_put() function I don't care about
> these details - I want to understand the code flow that ends
> up writing the TCE.
>
>>>> + page = tt->pages[idx / TCES_PER_PAGE];
>>>> + tbl = (u64 *)page_address(page);
>>>> +
>>>> + /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>
>>> This is not an RFC, is it?
>>
>>
>> Any debug code is prohibited? Ok, I'll remove.
>
> Debug code that requires code changes is prohibited, yes.
> Debug code that is runtime switchable (pr_debug, trace points, etc)
> are allowed.


Is there any easy way to enable just this specific udbg_printf (not all of
them at once)? Trace points do not work in real mode as we figured out.


>>>> + tbl[idx % TCES_PER_PAGE] = tce;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
>>>> +
>>>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>>>> +/*
>>>> + * Converts guest physical address to host physical address.
>>>> + * Tries to increase page counter via realmode_get_page() and
>>>> + * returns ERROR_ADDR if failed.
>>>> + */
>>>> +static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
>>>> + unsigned long gpa, struct page **pg)
>>>> +{
>>>> + struct kvm_memory_slot *memslot;
>>>> + pte_t *ptep, pte;
>>>> + unsigned long hva, hpa = ERROR_ADDR;
>>>> + unsigned long gfn = gpa>> PAGE_SHIFT;
>>>> + unsigned shift = 0;
>>>> +
>>>> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>> + if (!memslot)
>>>> + return ERROR_ADDR;
>>>> +
>>>> + hva = __gfn_to_hva_memslot(memslot, gfn);
>>>> +
>>>> + ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva,&shift);
>>>> + if (!ptep || !pte_present(*ptep))
>>>> + return ERROR_ADDR;
>>>> + pte = *ptep;
>>>> +
>>>> + if (((gpa& TCE_PCI_WRITE) || pte_write(pte))&& !pte_dirty(pte))
>>>> + return ERROR_ADDR;
>>>> +
>>>> + if (!pte_young(pte))
>>>> + return ERROR_ADDR;
>>>> +
>>>> + if (!shift)
>>>> + shift = PAGE_SHIFT;
>>>> +
>>>> + /* Put huge pages handling to the virtual mode */
>>>> + if (shift> PAGE_SHIFT)
>>>> + return ERROR_ADDR;
>>>> +
>>>> + *pg = realmode_pfn_to_page(pte_pfn(pte));
>>>> + if (!*pg || realmode_get_page(*pg))
>>>> + return ERROR_ADDR;
>>>> +
>>>> + /* pte_pfn(pte) returns address aligned to pg_size */
>>>> + hpa = (pte_pfn(pte)<< PAGE_SHIFT) + (gpa& ((1<< shift) - 1));
>>>> +
>>>> + if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>> + hpa = ERROR_ADDR;
>>>> + realmode_put_page(*pg);
>>>> + *pg = NULL;
>>>> + }
>>>> +
>>>> + return hpa;
>>>> +}
>>>> +
>>>> long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>> unsigned long ioba, unsigned long tce)
>>>> {
>>>> - struct kvm *kvm = vcpu->kvm;
>>>> - struct kvmppc_spapr_tce_table *stt;
>>>> -
>>>> - /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>>> - /* liobn, ioba, tce); */
>>>> -
>>>> - list_for_each_entry(stt,&kvm->arch.spapr_tce_tables, list) {
>>>> - if (stt->liobn == liobn) {
>>>> - unsigned long idx = ioba>> SPAPR_TCE_SHIFT;
>>>> - struct page *page;
>>>> - u64 *tbl;
>>>> -
>>>> - /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p
>>>> window_size=0x%x\n", */
>>>> - /* liobn, stt, stt->window_size); */
>>>> - if (ioba>= stt->window_size)
>>>> - return H_PARAMETER;
>>>> -
>>>> - page = stt->pages[idx / TCES_PER_PAGE];
>>>> - tbl = (u64 *)page_address(page);
>>>> -
>>>> - /* FIXME: Need to validate the TCE itself */
>>>> - /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>> - tbl[idx % TCES_PER_PAGE] = tce;
>>>> - return H_SUCCESS;
>>>> - }
>>>> + long ret;
>>>> + struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
>>>> +
>>>> + if (!tt)
>>>> + return H_TOO_HARD;
>>>> +
>>>> + ++tt->stat.rm.put;
>>>> +
>>>> + if (ioba>= tt->window_size)
>>>> + return H_PARAMETER;
>>>> +
>>>> + ret = kvmppc_emulated_validate_tce(tce);
>>>> + if (!ret)
>>>> + kvmppc_emulated_put_tce(tt, ioba, tce);
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>
>>> So the _vm version is the normal one and this is the _rm version? If so,
>>> please mark it as such. Is there any way to generate both from the same
>>> source? The way it's now there is a lot of duplicate code.
>>
>>
>> I tried, looked very ugly. If you insist, I will do so.
>

> If it looks ugly better don't. I just want to make sure you explored the option.
> But please keep the naming scheme consistent.


Removed _vm everywhere and put _rm in realmode handlers. I just was
confused by _vm in kvm_vm_ioctl_create_spapr_tce() at the first place.


--
Alexey

2013-07-11 08:57:38

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

On 07/10/2013 03:32 AM, Alexander Graf wrote:
> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>> This adds special support for huge pages (16MB). The reference
>> counting cannot be easily done for such pages in real mode (when
>> MMU is off) so we added a list of huge pages. It is populated in
>> virtual mode and get_page is called just once per a huge page.
>> Real mode handlers check if the requested page is huge and in the list,
>> then no reference counting is done, otherwise an exit to virtual mode
>> happens. The list is released at KVM exit. At the moment the fastest
>> card available for tests uses up to 9 huge pages so walking through this
>> list is not very expensive. However this can change and we may want
>> to optimize this.
>>
>> Signed-off-by: Paul Mackerras<[email protected]>
>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>
>> ---
>>
>> Changes:
>> 2013/06/27:
>> * list of huge pages replaces with hashtable for better performance
>
> So the only thing your patch description really talks about is not true
> anymore?
>
>> * spinlock removed from real mode and only protects insertion of new
>> huge [ages descriptors into the hashtable
>>
>> 2013/06/05:
>> * fixed compile error when CONFIG_IOMMU_API=n
>>
>> 2013/05/20:
>> * the real mode handler now searches for a huge page by gpa (used to be pte)
>> * the virtual mode handler prints warning if it is called twice for the same
>> huge page as the real mode handler is expected to fail just once - when a
>> huge
>> page is not in the list yet.
>> * the huge page is refcounted twice - when added to the hugepage list and
>> when used in the virtual mode hcall handler (can be optimized but it will
>> make the patch less nice).
>>
>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>> ---
>> arch/powerpc/include/asm/kvm_host.h | 25 +++++++++
>> arch/powerpc/kernel/iommu.c | 6 ++-
>> arch/powerpc/kvm/book3s_64_vio.c | 104
>> +++++++++++++++++++++++++++++++++---
>> arch/powerpc/kvm/book3s_64_vio_hv.c | 21 ++++++--
>> 4 files changed, 146 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>> b/arch/powerpc/include/asm/kvm_host.h
>> index 53e61b2..a7508cf 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -30,6 +30,7 @@
>> #include<linux/kvm_para.h>
>> #include<linux/list.h>
>> #include<linux/atomic.h>
>> +#include<linux/hashtable.h>
>> #include<asm/kvm_asm.h>
>> #include<asm/processor.h>
>> #include<asm/page.h>
>> @@ -182,10 +183,34 @@ struct kvmppc_spapr_tce_table {
>> u32 window_size;
>> struct iommu_group *grp; /* used for IOMMU groups */
>> struct vfio_group *vfio_grp; /* used for IOMMU groups */
>> + DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */
>> + spinlock_t hugepages_write_lock; /* used for IOMMU groups */
>> struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>> struct page *pages[0];
>> };
>>
>> +/*
>> + * The KVM guest can be backed with 16MB pages.
>> + * In this case, we cannot do page counting from the real mode
>> + * as the compound pages are used - they are linked in a list
>> + * with pointers as virtual addresses which are inaccessible
>> + * in real mode.
>> + *
>> + * The code below keeps a 16MB pages list and uses page struct
>> + * in real mode if it is already locked in RAM and inserted into
>> + * the list or switches to the virtual mode where it can be
>> + * handled in a usual manner.
>> + */
>> +#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa) hash_32(gpa>> 24, 32)
>> +
>> +struct kvmppc_spapr_iommu_hugepage {
>> + struct hlist_node hash_node;
>> + unsigned long gpa; /* Guest physical address */
>> + unsigned long hpa; /* Host physical address */
>> + struct page *page; /* page struct of the very first subpage */
>> + unsigned long size; /* Huge page size (always 16MB at the moment) */
>> +};
>> +
>> struct kvmppc_linear_info {
>> void *base_virt;
>> unsigned long base_pfn;
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index 51678ec..e0b6eca 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned
>> long entry,
>> if (!pg) {
>> ret = -EAGAIN;
>> } else if (PageCompound(pg)) {
>> - ret = -EAGAIN;
>> + /* Hugepages will be released at KVM exit */
>> + ret = 0;
>> } else {
>> if (oldtce& TCE_PCI_WRITE)
>> SetPageDirty(pg);
>> @@ -1009,6 +1010,9 @@ int iommu_free_tces(struct iommu_table *tbl,
>> unsigned long entry,
>> struct page *pg = pfn_to_page(oldtce>> PAGE_SHIFT);
>> if (!pg) {
>> ret = -EAGAIN;
>> + } else if (PageCompound(pg)) {
>> + /* Hugepages will be released at KVM exit */
>> + ret = 0;
>> } else {
>> if (oldtce& TCE_PCI_WRITE)
>> SetPageDirty(pg);
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>> b/arch/powerpc/kvm/book3s_64_vio.c
>> index 2b51f4a..c037219 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -46,6 +46,40 @@
>>
>> #define ERROR_ADDR ((void *)~(unsigned long)0x0)
>>
>> +#ifdef CONFIG_IOMMU_API
>
> Can't you just make CONFIG_IOMMU_API mandatory in Kconfig?


Where exactly (it is rather SPAPR_TCE_IOMMU but does not really matter)?
Select it on KVM_BOOK3S_64? CONFIG_KVM_BOOK3S_64_HV?
CONFIG_KVM_BOOK3S_64_PR? PPC_BOOK3S_64?

I am trying to imagine a configuration where we really do not want
IOMMU_API. Ben mentioned PPC32 and embedded PPC64 and that's it so any of
BOOK3S (KVM_BOOK3S_64 is the best) should be fine, no?



--
Alexey

2013-07-11 09:52:49

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling


On 11.07.2013, at 10:57, Alexey Kardashevskiy wrote:

> On 07/10/2013 03:32 AM, Alexander Graf wrote:
>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>>> This adds special support for huge pages (16MB). The reference
>>> counting cannot be easily done for such pages in real mode (when
>>> MMU is off) so we added a list of huge pages. It is populated in
>>> virtual mode and get_page is called just once per a huge page.
>>> Real mode handlers check if the requested page is huge and in the list,
>>> then no reference counting is done, otherwise an exit to virtual mode
>>> happens. The list is released at KVM exit. At the moment the fastest
>>> card available for tests uses up to 9 huge pages so walking through this
>>> list is not very expensive. However this can change and we may want
>>> to optimize this.
>>>
>>> Signed-off-by: Paul Mackerras<[email protected]>
>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>>
>>> ---
>>>
>>> Changes:
>>> 2013/06/27:
>>> * list of huge pages replaces with hashtable for better performance
>>
>> So the only thing your patch description really talks about is not true
>> anymore?
>>
>>> * spinlock removed from real mode and only protects insertion of new
>>> huge [ages descriptors into the hashtable
>>>
>>> 2013/06/05:
>>> * fixed compile error when CONFIG_IOMMU_API=n
>>>
>>> 2013/05/20:
>>> * the real mode handler now searches for a huge page by gpa (used to be pte)
>>> * the virtual mode handler prints warning if it is called twice for the same
>>> huge page as the real mode handler is expected to fail just once - when a
>>> huge
>>> page is not in the list yet.
>>> * the huge page is refcounted twice - when added to the hugepage list and
>>> when used in the virtual mode hcall handler (can be optimized but it will
>>> make the patch less nice).
>>>
>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>> ---
>>> arch/powerpc/include/asm/kvm_host.h | 25 +++++++++
>>> arch/powerpc/kernel/iommu.c | 6 ++-
>>> arch/powerpc/kvm/book3s_64_vio.c | 104
>>> +++++++++++++++++++++++++++++++++---
>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 21 ++++++--
>>> 4 files changed, 146 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>>> b/arch/powerpc/include/asm/kvm_host.h
>>> index 53e61b2..a7508cf 100644
>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>> @@ -30,6 +30,7 @@
>>> #include<linux/kvm_para.h>
>>> #include<linux/list.h>
>>> #include<linux/atomic.h>
>>> +#include<linux/hashtable.h>
>>> #include<asm/kvm_asm.h>
>>> #include<asm/processor.h>
>>> #include<asm/page.h>
>>> @@ -182,10 +183,34 @@ struct kvmppc_spapr_tce_table {
>>> u32 window_size;
>>> struct iommu_group *grp; /* used for IOMMU groups */
>>> struct vfio_group *vfio_grp; /* used for IOMMU groups */
>>> + DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */
>>> + spinlock_t hugepages_write_lock; /* used for IOMMU groups */
>>> struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>>> struct page *pages[0];
>>> };
>>>
>>> +/*
>>> + * The KVM guest can be backed with 16MB pages.
>>> + * In this case, we cannot do page counting from the real mode
>>> + * as the compound pages are used - they are linked in a list
>>> + * with pointers as virtual addresses which are inaccessible
>>> + * in real mode.
>>> + *
>>> + * The code below keeps a 16MB pages list and uses page struct
>>> + * in real mode if it is already locked in RAM and inserted into
>>> + * the list or switches to the virtual mode where it can be
>>> + * handled in a usual manner.
>>> + */
>>> +#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa) hash_32(gpa>> 24, 32)
>>> +
>>> +struct kvmppc_spapr_iommu_hugepage {
>>> + struct hlist_node hash_node;
>>> + unsigned long gpa; /* Guest physical address */
>>> + unsigned long hpa; /* Host physical address */
>>> + struct page *page; /* page struct of the very first subpage */
>>> + unsigned long size; /* Huge page size (always 16MB at the moment) */
>>> +};
>>> +
>>> struct kvmppc_linear_info {
>>> void *base_virt;
>>> unsigned long base_pfn;
>>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>>> index 51678ec..e0b6eca 100644
>>> --- a/arch/powerpc/kernel/iommu.c
>>> +++ b/arch/powerpc/kernel/iommu.c
>>> @@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned
>>> long entry,
>>> if (!pg) {
>>> ret = -EAGAIN;
>>> } else if (PageCompound(pg)) {
>>> - ret = -EAGAIN;
>>> + /* Hugepages will be released at KVM exit */
>>> + ret = 0;
>>> } else {
>>> if (oldtce& TCE_PCI_WRITE)
>>> SetPageDirty(pg);
>>> @@ -1009,6 +1010,9 @@ int iommu_free_tces(struct iommu_table *tbl,
>>> unsigned long entry,
>>> struct page *pg = pfn_to_page(oldtce>> PAGE_SHIFT);
>>> if (!pg) {
>>> ret = -EAGAIN;
>>> + } else if (PageCompound(pg)) {
>>> + /* Hugepages will be released at KVM exit */
>>> + ret = 0;
>>> } else {
>>> if (oldtce& TCE_PCI_WRITE)
>>> SetPageDirty(pg);
>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>>> b/arch/powerpc/kvm/book3s_64_vio.c
>>> index 2b51f4a..c037219 100644
>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>> @@ -46,6 +46,40 @@
>>>
>>> #define ERROR_ADDR ((void *)~(unsigned long)0x0)
>>>
>>> +#ifdef CONFIG_IOMMU_API
>>
>> Can't you just make CONFIG_IOMMU_API mandatory in Kconfig?
>
>
> Where exactly (it is rather SPAPR_TCE_IOMMU but does not really matter)?
> Select it on KVM_BOOK3S_64? CONFIG_KVM_BOOK3S_64_HV?
> CONFIG_KVM_BOOK3S_64_PR? PPC_BOOK3S_64?

I'd say the most logical choice would be to check the Makefile and see when it gets compiled. For those cases we want it enabled.

> I am trying to imagine a configuration where we really do not want
> IOMMU_API. Ben mentioned PPC32 and embedded PPC64 and that's it so any of
> BOOK3S (KVM_BOOK3S_64 is the best) should be fine, no?

book3s_32 doesn't want this, but any book3s_64 implementation could potentially use it, yes. That's pretty much what the Makefile tells you too :).


Alex

2013-07-11 10:11:20

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls


On 11.07.2013, at 07:12, Alexey Kardashevskiy wrote:

> On 07/10/2013 08:05 PM, Alexander Graf wrote:
>>
>> On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote:
>>
>>> On 07/10/2013 03:02 AM, Alexander Graf wrote:
>>>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>>>> devices or emulated PCI. These calls allow adding multiple entries
>>>>> (up to 512) into the TCE table in one call which saves time on
>>>>> transition to/from real mode.
>>>>
>>>> We don't mention QEMU explicitly in KVM code usually.
>>>>
>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>>> (copied from user and verified) before writing the whole list into
>>>>> the TCE table. This cache will be utilized more in the upcoming
>>>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>>>> mode in the case if the real mode handler failed for some reason.
>>>>>
>>>>> This adds a guest physical to host real address converter
>>>>> and calls the existing H_PUT_TCE handler. The converting function
>>>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>>>
>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>>>> so in order to support the functionality of this patch, QEMU
>>>>> needs to query for this capability and set the "hcall-multi-tce"
>>>>> hypertas property only if the capability is present, otherwise
>>>>> there will be serious performance degradation.
>>>>
>>>> Same as above. But really you're only giving recommendations here. What's
>>>> the point? Please describe what the benefit of this patch is, not what some
>>>> other random subsystem might do with the benefits it brings.
>>>>
>>>>>
>>>>> Signed-off-by: Paul Mackerras<[email protected]>
>>>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>>>>
>>>>> ---
>>>>> Changelog:
>>>>> 2013/07/06:
>>>>> * fixed number of wrong get_page()/put_page() calls
>>>>>
>>>>> 2013/06/27:
>>>>> * fixed clear of BUSY bit in kvmppc_lookup_pte()
>>>>> * H_PUT_TCE_INDIRECT does realmode_get_page() now
>>>>> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
>>>>> * updated doc
>>>>>
>>>>> 2013/06/05:
>>>>> * fixed mistype about IBMVIO in the commit message
>>>>> * updated doc and moved it to another section
>>>>> * changed capability number
>>>>>
>>>>> 2013/05/21:
>>>>> * added kvm_vcpu_arch::tce_tmp
>>>>> * removed cleanup if put_indirect failed, instead we do not even start
>>>>> writing to TCE table if we cannot get TCEs from the user and they are
>>>>> invalid
>>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>>>> and kvmppc_emulated_validate_tce (for the previous item)
>>>>> * fixed bug with failthrough for H_IPI
>>>>> * removed all get_user() from real mode handlers
>>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>>>>
>>>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>>>> ---
>>>>> Documentation/virtual/kvm/api.txt | 25 +++
>>>>> arch/powerpc/include/asm/kvm_host.h | 9 ++
>>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +-
>>>>> arch/powerpc/kvm/book3s_64_vio.c | 154 ++++++++++++++++++-
>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 260
>>>>> ++++++++++++++++++++++++++++----
>>>>> arch/powerpc/kvm/book3s_hv.c | 41 ++++-
>>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
>>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++-
>>>>> arch/powerpc/kvm/powerpc.c | 3 +
>>>>> 9 files changed, 517 insertions(+), 34 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/virtual/kvm/api.txt
>>>>> b/Documentation/virtual/kvm/api.txt
>>>>> index 6365fef..762c703 100644
>>>>> --- a/Documentation/virtual/kvm/api.txt
>>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>>> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
>>>>> to userspace to be
>>>>> handled.
>>>>>
>>>>>
>>>>> +4.86 KVM_CAP_PPC_MULTITCE
>>>>> +
>>>>> +Capability: KVM_CAP_PPC_MULTITCE
>>>>> +Architectures: ppc
>>>>> +Type: vm
>>>>> +
>>>>> +This capability means the kernel is capable of handling hypercalls
>>>>> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
>>>>> +space. This significanly accelerates DMA operations for PPC KVM guests.
>>>>
>>>> significanly? Please run this through a spell checker.
>>>>
>>>>> +The user space should expect that its handlers for these hypercalls
>>>>
>>>> s/The//
>>>>
>>>>> +are not going to be called.
>>>>
>>>> Is user space guaranteed they will not be called? Or can it still happen?
>>>
>>> ... if user space previously registered LIOBN in KVM (via
>>> KVM_CREATE_SPAPR_TCE or similar calls).
>>>
>>> ok?
>>
>> How about this?
>>
>> The hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration.
>>
>> ---
>>
>> The target audience for this documentation is user space KVM API users. Someone developing kvm tool for example. They want to know implications specific CAPs have.
>>
>>>
>>> There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
>>> and may never get there.
>>>
>>>
>>>>> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
>>>>> +the user space might have to advertise it for the guest. For example,
>>>>> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in
>>>>> +the "ibm,hypertas-functions" device-tree property.
>>>>
>>>> This paragraph describes sPAPR. That's fine, but please document it as
>>>> such. Also please check your grammar.
>>>
>>>>> +
>>>>> +Without this capability, only H_PUT_TCE is handled by the kernel and
>>>>> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
>>>>> +unless the capability is present as passing hypercalls to the userspace
>>>>> +slows operations a lot.
>>>>> +
>>>>> +Unlike other capabilities of this section, this one is always enabled.
>>>>
>>>> Why? Wouldn't that confuse older user space?
>>>
>>>
>>> How? Old user space won't check for this capability and won't tell the
>>> guest to use it (via "hcall-multi-tce"). Old H_PUT_TCE is still there.
>>>
>>> If the guest always uses H_PUT_TCE_INDIRECT/H_STUFF_TCE no matter what,
>>> then it is its problem - it won't work now anyway as neither QEMU nor host
>>> kernel supports these calls.
>
>
>> Always assume that you are a kernel developer without knowledge
>> of any user space code using your interfaces. So there is the theoretical
>> possibility that there is a user space client out there that implements
>> H_PUT_TCE_INDIRECT and advertises hcall-multi-tce to the guest.
>> Would that client break? If so, we should definitely have
>> the CAP disabled by default.
>
>
> No, it won't break. Why would it break? I really do not get it. This user
> space client has to do an extra step to get this acceleration by calling
> ioctl(KVM_CREATE_SPAPR_TCE) anyway. Previously that ioctl only had effect
> on H_PUT_TCE, now on all three hcalls.

Hrm. It's a change of behavior, it probably wouldn't break, yes.

>
>
>> But really, it's also as much about consistency as anything else.
>> If we leave everything as is and always extend functionality
>> by enabling new CAPs, we're pretty much guaranteed that we
>> don't break anything by accident. It also makes debugging easier
>> because you can for example disable this particular feature
>> to see whether something has bad side effects.
>
>
> So I must add one more ioctl to enable MULTITCE in kernel handling. Is it
> what you are saying?
>
> I can see KVM_CHECK_EXTENSION but I do not see KVM_ENABLE_EXTENSION or
> anything like that.

KVM_ENABLE_CAP. It's how we enable sPAPR capabilities too.

>
>
>
>>>>> +
>>>>> +
>>>>> 5. The kvm_run structure
>>>>> ------------------------
>>>>>
>>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>>>>> b/arch/powerpc/include/asm/kvm_host.h
>>>>> index af326cd..20d04bd 100644
>>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>>> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
>>>>> struct kvm *kvm;
>>>>> u64 liobn;
>>>>> u32 window_size;
>>>>> + struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>>>>
>>>> You don't need this.
>>>>
>>>>> struct page *pages[0];
>>>>> };
>>>>>
>>>>> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
>>>>> spinlock_t tbacct_lock;
>>>>> u64 busy_stolen;
>>>>> u64 busy_preempt;
>>>>> +
>>>>> + unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT
>>>>> hcall */
>>>>> + enum {
>>>>> + TCERM_NONE,
>>>>> + TCERM_GETPAGE,
>>>>> + TCERM_PUTTCE,
>>>>> + TCERM_PUTLIST,
>>>>> + } tce_rm_fail; /* failed stage of request processing */
>>>>> #endif
>>>>> };
>>>>>
>>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h
>>>>> b/arch/powerpc/include/asm/kvm_ppc.h
>>>>> index a5287fe..fa722a0 100644
>>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>>> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu
>>>>> *vcpu);
>>>>>
>>>>> extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>> struct kvm_create_spapr_tce *args);
>>>>> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>> - unsigned long ioba, unsigned long tce);
>>>>> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>>>> + struct kvm_vcpu *vcpu, unsigned long liobn);
>>>>> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
>>>>> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>>> + unsigned long ioba, unsigned long tce);
>>>>> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>>> + unsigned long liobn, unsigned long ioba,
>>>>> + unsigned long tce);
>>>>> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>> + unsigned long liobn, unsigned long ioba,
>>>>> + unsigned long tce_list, unsigned long npages);
>>>>> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>> + unsigned long liobn, unsigned long ioba,
>>>>> + unsigned long tce_value, unsigned long npages);
>>>>> extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>>>>> struct kvm_allocate_rma *rma);
>>>>> extern struct kvmppc_linear_info *kvm_alloc_rma(void);
>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>>>>> b/arch/powerpc/kvm/book3s_64_vio.c
>>>>> index b2d3f3b..99bf4e5 100644
>>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>>> @@ -14,6 +14,7 @@
>>>>> *
>>>>> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
>>>>> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
>>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
>>>>> */
>>>>>
>>>>> #include<linux/types.h>
>>>>> @@ -36,8 +37,10 @@
>>>>> #include<asm/ppc-opcode.h>
>>>>> #include<asm/kvm_host.h>
>>>>> #include<asm/udbg.h>
>>>>> +#include<asm/iommu.h>
>>>>> +#include<asm/tce.h>
>>>>>
>>>>> -#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
>>>>> +#define ERROR_ADDR ((void *)~(unsigned long)0x0)
>>>>>
>>>>> static long kvmppc_stt_npages(unsigned long window_size)
>>>>> {
>>>>> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct
>>>>> kvmppc_spapr_tce_table *stt)
>>>>> struct kvm *kvm = stt->kvm;
>>>>> int i;
>>>>>
>>>>> +#define __SV(x) stt->stat.x
>>>>> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
>>>>> + pr_debug("%s stat for liobn=%llx\n"
>>>>> + "--------------- realmode ----- virtmode ---\n"
>>>>> + "put_tce %10ld %10ld\n"
>>>>> + "put_tce_indir %10ld %10ld\n"
>>>>> + "stuff_tce %10ld %10ld\n",
>>>>> + __func__, stt->liobn,
>>>>> + __SVD(put), __SV(vm.put),
>>>>> + __SVD(indir), __SV(vm.indir),
>>>>> + __SVD(stuff), __SV(vm.stuff));
>>>>> +#undef __SVD
>>>>> +#undef __SV
>>>>
>>>> All of these stat points should just be trace points. You can do the
>>>> statistic gathering from user space then.
>>>>
>>>>> +
>>>>> mutex_lock(&kvm->lock);
>>>>> list_del(&stt->list);
>>>>> for (i = 0; i< kvmppc_stt_npages(stt->window_size); i++)
>>>>> @@ -148,3 +165,138 @@ fail:
>>>>> }
>>>>> return ret;
>>>>> }
>>>>> +
>>>>> +/* Converts guest physical address to host virtual address */
>>>>> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>>>>
>>>> Please don't distinguish _vm versions. They're the normal case. _rm ones
>>>> are the special ones.
>>>>
>>>>> + unsigned long gpa, struct page **pg)
>>>>> +{
>>>>> + unsigned long hva, gfn = gpa>> PAGE_SHIFT;
>>>>> + struct kvm_memory_slot *memslot;
>>>>> +
>>>>> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>>> + if (!memslot)
>>>>> + return ERROR_ADDR;
>>>>> +
>>>>> + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa& ~PAGE_MASK);
>>>>
>>>> s/+/|/
>>>>
>>>>> +
>>>>> + if (get_user_pages_fast(hva& PAGE_MASK, 1, 0, pg) != 1)
>>>>> + return ERROR_ADDR;
>>>>> +
>>>>> + return (void *) hva;
>>>>> +}
>>>>> +
>>>>> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>>> + unsigned long liobn, unsigned long ioba,
>>>>> + unsigned long tce)
>>>>> +{
>>>>> + long ret;
>>>>> + struct kvmppc_spapr_tce_table *tt;
>>>>> +
>>>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>> + /* Didn't find the liobn, put it to userspace */
>>>>
>>>> Unclear comment.
>>>
>>>
>>> What detail is missing?
>>
>
>> Grammar wise "it" in the second half of the sentence refers to liobn.
>> So you "put" the "liobn to userspace". That sentence doesn't
>> make any sense.
>
>
> Removed it. H_TOO_HARD itself says enough already.
>
>
>> What you really want to say is:
>>
>> /* Couldn't find the liobn. Something went wrong. Let user space handle the hypercall. That has better ways of dealing with errors. */
>>
>>>
>>>
>>>>> + if (!tt)
>>>>> + return H_TOO_HARD;
>>>>> +
>>>>> + ++tt->stat.vm.put;
>>>>> +
>>>>> + if (ioba>= tt->window_size)
>>>>> + return H_PARAMETER;
>>>>> +
>>>>> + ret = kvmppc_emulated_validate_tce(tce);
>>>>> + if (ret)
>>>>> + return ret;
>>>>> +
>>>>> + kvmppc_emulated_put_tce(tt, ioba, tce);
>>>>> +
>>>>> + return H_SUCCESS;
>>>>> +}
>>>>> +
>>>>> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>> + unsigned long liobn, unsigned long ioba,
>>>>> + unsigned long tce_list, unsigned long npages)
>>>>> +{
>>>>> + struct kvmppc_spapr_tce_table *tt;
>>>>> + long i, ret = H_SUCCESS;
>>>>> + unsigned long __user *tces;
>>>>> + struct page *pg = NULL;
>>>>> +
>>>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>> + /* Didn't find the liobn, put it to userspace */
>>>>> + if (!tt)
>>>>> + return H_TOO_HARD;
>>>>> +
>>>>> + ++tt->stat.vm.indir;
>>>>> +
>>>>> + /*
>>>>> + * The spec says that the maximum size of the list is 512 TCEs so
>>>>> + * so the whole table addressed resides in 4K page
>>>>
>>>> so so?
>>>>
>>>>> + */
>>>>> + if (npages> 512)
>>>>> + return H_PARAMETER;
>>>>> +
>>>>> + if (tce_list& ~IOMMU_PAGE_MASK)
>>>>> + return H_PARAMETER;
>>>>> +
>>>>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
>>>>> + return H_PARAMETER;
>>>>> +
>>>>> + tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg);
>>>>> + if (tces == ERROR_ADDR)
>>>>> + return H_TOO_HARD;
>>>>> +
>>>>> + if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
>>>>> + goto put_list_page_exit;
>>>>> +
>>>>> + for (i = 0; i< npages; ++i) {
>>>>> + if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
>>>>> + ret = H_PARAMETER;
>>>>> + goto put_list_page_exit;
>>>>> + }
>>>>> +
>>>>> + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
>>>>> + if (ret)
>>>>> + goto put_list_page_exit;
>>>>> + }
>>>>> +
>>>>> + for (i = 0; i< npages; ++i)
>>>>> + kvmppc_emulated_put_tce(tt, ioba + (i<< IOMMU_PAGE_SHIFT),
>>>>> + vcpu->arch.tce_tmp_hpas[i]);
>>>>> +put_list_page_exit:
>>>>> + if (pg)
>>>>> + put_page(pg);
>>>>> +
>>>>> + if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
>>>>> + vcpu->arch.tce_rm_fail = TCERM_NONE;
>>>>> + if (pg&& !PageCompound(pg))
>>>>> + put_page(pg); /* finish pending realmode_put_page() */
>>>>> + }
>>>>> +
>>>>> + return ret;
>>>>> +}
>>>>> +
>>>>> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>> + unsigned long liobn, unsigned long ioba,
>>>>> + unsigned long tce_value, unsigned long npages)
>>>>> +{
>>>>> + struct kvmppc_spapr_tce_table *tt;
>>>>> + long i, ret;
>>>>> +
>>>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>> + /* Didn't find the liobn, put it to userspace */
>>>>> + if (!tt)
>>>>> + return H_TOO_HARD;
>>>>> +
>>>>> + ++tt->stat.vm.stuff;
>>>>> +
>>>>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
>>>>> + return H_PARAMETER;
>>>>> +
>>>>> + ret = kvmppc_emulated_validate_tce(tce_value);
>>>>> + if (ret || (tce_value& (TCE_PCI_WRITE | TCE_PCI_READ)))
>>>>> + return H_PARAMETER;
>>>>> +
>>>>> + for (i = 0; i< npages; ++i, ioba += IOMMU_PAGE_SIZE)
>>>>> + kvmppc_emulated_put_tce(tt, ioba, tce_value);
>>>>> +
>>>>> + return H_SUCCESS;
>>>>> +}
>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>> b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>> index 30c2f3b..cd3e6f9 100644
>>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>> @@ -14,6 +14,7 @@
>>>>> *
>>>>> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
>>>>> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
>>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
>>>>> */
>>>>>
>>>>> #include<linux/types.h>
>>>>> @@ -35,42 +36,243 @@
>>>>> #include<asm/ppc-opcode.h>
>>>>> #include<asm/kvm_host.h>
>>>>> #include<asm/udbg.h>
>>>>> +#include<asm/iommu.h>
>>>>> +#include<asm/tce.h>
>>>>>
>>>>> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
>>>>> +#define ERROR_ADDR (~(unsigned long)0x0)
>>>>>
>>>>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>>>>> - * mode on PR KVM
>>>>
>>>> What's wrong with the warning?
>>>
>>>
>>> It belongs to kvmppc_h_put_tce() which is not called in virtual mode anymore.
>>
>> I thought the comment applied to the whole file before? Hrm. Maybe I misread it then.
>>
>>> It is technically correct for kvmppc_find_tce_table() though. Should I put
>>> this comment before every function which may be called from real and
>>> virtual modes?
>>
>> Yes, please. Otherwise someone might stick an access to a non-linear address
>> in there by accident.
>>
>>>
>>>
>>>
>>>>> +/*
>>>>> + * Finds a TCE table descriptor by LIOBN
>>>>> */
>>>>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
>>>>> + unsigned long liobn)
>>>>> +{
>>>>> + struct kvmppc_spapr_tce_table *tt;
>>>>> +
>>>>> + list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) {
>>>>> + if (tt->liobn == liobn)
>>>>> + return tt;
>>>>> + }
>>>>> +
>>>>> + return NULL;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>>>>> +
>>>>> +#ifdef DEBUG
>>>>> +/*
>>>>> + * Lets user mode disable realmode handlers by putting big number
>>>>> + * in the bottom value of LIOBN
>>>>
>>>> What? Seriously? Just don't enable the CAP.
>>>
>>>
>>> It is under DEBUG. It really, really helps to be able to disable real mode
>>> handlers without reboot. Ok, no debug code, I'll remove.
>>
>> Debug code is good, but #ifdefs are bad. For you, an #ifdef reads like
>> "code that doesn't do any hard when disabled". For me, #ifdefs read
>> "code that definitely breaks because nobody turns the #define on".
>>
>> So please, avoid #ifdef'ed code whenever possible. Switching the CAP on and
>> off is a much better debug approach in this case.
>>
>>>
>>>
>>>>> + */
>>>>> +#define kvmppc_find_tce_table(a, b) \
>>>>> + ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
>>>>> +#endif
>>>>> +
>>>>> +/*
>>>>> + * Validates TCE address.
>>>>> + * At the moment only flags are validated as other checks will
>>>>> significantly slow
>>>>> + * down or can make it even impossible to handle TCE requests in real mode.
>>>>
>>>> What?
>>>
>>>
>>> What is missing here (besides good english)?
>>
>> What badness could slip through by not validating everything?
>
>
> I cannot think of any good check which could be done in real mode and not
> be "more than 2 calls deep" (c) Ben. Check that the page is allocated at
> all? How? Don't know.

If you say that our validation doesn't validate everything, that makes me really weary. Could the guest use it to maliciously inject anything? Could a missing check make our code go berserk?

What checks exactly would you do in addition when this was virtual mode?

>
>
>
>>>>> + */
>>>>> +long kvmppc_emulated_validate_tce(unsigned long tce)
>>>>
>>>> I don't like the naming scheme. Please turn this around and make it
>>>> kvmppc_tce_validate().
>>>
>>>
>>> Oh. "Like"... Ok.
>>
>> Yes. Like.
>>
>>>
>>>
>>>>> +{
>>>>> + if (tce& ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>>>>> + return H_PARAMETER;
>>>>> +
>>>>> + return H_SUCCESS;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
>>>>> +
>>>>> +/*
>>>>> + * Handles TCE requests for QEMU emulated devices.
>>>>
>>>> We still don't mention QEMU in KVM code. And does it really matter whether
>>>> they're emulated by QEMU? Devices could also be emulated by KVM.
>>>>
>>>>> + * Puts guest TCE values to the table and expects QEMU to convert them
>>>>> + * later in a QEMU device implementation.
>>>>> + * Called in both real and virtual modes.
>>>>> + * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
>>>>> + */
>>>>> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>>
>>>> kvmppc_tce_put()
>>>>
>>>>> + unsigned long ioba, unsigned long tce)
>>>>> +{
>>>>> + unsigned long idx = ioba>> SPAPR_TCE_SHIFT;
>>>>> + struct page *page;
>>>>> + u64 *tbl;
>>>>> +
>>>>> + /*
>>>>> + * Note on the use of page_address() in real mode,
>>>>> + *
>>>>> + * It is safe to use page_address() in real mode on ppc64 because
>>>>> + * page_address() is always defined as lowmem_page_address()
>>>>> + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
>>>>> + * operation and does not access page struct.
>>>>> + *
>>>>> + * Theoretically page_address() could be defined different
>>>>> + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>>>>> + * should be enabled.
>>>>> + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>>>>> + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>>>>> + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>>>>> + * is not expected to be enabled on ppc32, page_address()
>>>>> + * is safe for ppc32 as well.
>>>>> + */
>>>>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>>>>> +#error TODO: fix to avoid page_address() here
>>>>> +#endif
>>>>
>>>> Can you extract the text above, the check and the page_address call into a
>>>> simple wrapper function?
>>>
>>>
>>> Is this function also too big? Sorry, I do not understand the comment.
>>
>> All of the comment and #if here only deal with the fact that you
>> have a real mode hack to call page_address() that happens
>> to work under specific circumstances.
>>
>> There's nothing kvmppc_tce_put() specific about this.
>> The page_address() code happens to get called here, sure.
>> But if I read the kvmppc_tce_put() function I don't care about
>> these details - I want to understand the code flow that ends
>> up writing the TCE.
>>
>>>>> + page = tt->pages[idx / TCES_PER_PAGE];
>>>>> + tbl = (u64 *)page_address(page);
>>>>> +
>>>>> + /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>>
>>>> This is not an RFC, is it?
>>>
>>>
>>> Any debug code is prohibited? Ok, I'll remove.
>>
>> Debug code that requires code changes is prohibited, yes.
>> Debug code that is runtime switchable (pr_debug, trace points, etc)
>> are allowed.
>
>
> Is there any easy way to enable just this specific udbg_printf (not all of
> them at once)? Trace points do not work in real mode as we figured out.

You can enable pr_debug by file IIRC.

>
>
>>>>> + tbl[idx % TCES_PER_PAGE] = tce;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
>>>>> +
>>>>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>>>>> +/*
>>>>> + * Converts guest physical address to host physical address.
>>>>> + * Tries to increase page counter via realmode_get_page() and
>>>>> + * returns ERROR_ADDR if failed.
>>>>> + */
>>>>> +static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
>>>>> + unsigned long gpa, struct page **pg)
>>>>> +{
>>>>> + struct kvm_memory_slot *memslot;
>>>>> + pte_t *ptep, pte;
>>>>> + unsigned long hva, hpa = ERROR_ADDR;
>>>>> + unsigned long gfn = gpa>> PAGE_SHIFT;
>>>>> + unsigned shift = 0;
>>>>> +
>>>>> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>>> + if (!memslot)
>>>>> + return ERROR_ADDR;
>>>>> +
>>>>> + hva = __gfn_to_hva_memslot(memslot, gfn);
>>>>> +
>>>>> + ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva,&shift);
>>>>> + if (!ptep || !pte_present(*ptep))
>>>>> + return ERROR_ADDR;
>>>>> + pte = *ptep;
>>>>> +
>>>>> + if (((gpa& TCE_PCI_WRITE) || pte_write(pte))&& !pte_dirty(pte))
>>>>> + return ERROR_ADDR;
>>>>> +
>>>>> + if (!pte_young(pte))
>>>>> + return ERROR_ADDR;
>>>>> +
>>>>> + if (!shift)
>>>>> + shift = PAGE_SHIFT;
>>>>> +
>>>>> + /* Put huge pages handling to the virtual mode */
>>>>> + if (shift> PAGE_SHIFT)
>>>>> + return ERROR_ADDR;
>>>>> +
>>>>> + *pg = realmode_pfn_to_page(pte_pfn(pte));
>>>>> + if (!*pg || realmode_get_page(*pg))
>>>>> + return ERROR_ADDR;
>>>>> +
>>>>> + /* pte_pfn(pte) returns address aligned to pg_size */
>>>>> + hpa = (pte_pfn(pte)<< PAGE_SHIFT) + (gpa& ((1<< shift) - 1));
>>>>> +
>>>>> + if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>>> + hpa = ERROR_ADDR;
>>>>> + realmode_put_page(*pg);
>>>>> + *pg = NULL;
>>>>> + }
>>>>> +
>>>>> + return hpa;
>>>>> +}
>>>>> +
>>>>> long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>> unsigned long ioba, unsigned long tce)
>>>>> {
>>>>> - struct kvm *kvm = vcpu->kvm;
>>>>> - struct kvmppc_spapr_tce_table *stt;
>>>>> -
>>>>> - /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>>>> - /* liobn, ioba, tce); */
>>>>> -
>>>>> - list_for_each_entry(stt,&kvm->arch.spapr_tce_tables, list) {
>>>>> - if (stt->liobn == liobn) {
>>>>> - unsigned long idx = ioba>> SPAPR_TCE_SHIFT;
>>>>> - struct page *page;
>>>>> - u64 *tbl;
>>>>> -
>>>>> - /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p
>>>>> window_size=0x%x\n", */
>>>>> - /* liobn, stt, stt->window_size); */
>>>>> - if (ioba>= stt->window_size)
>>>>> - return H_PARAMETER;
>>>>> -
>>>>> - page = stt->pages[idx / TCES_PER_PAGE];
>>>>> - tbl = (u64 *)page_address(page);
>>>>> -
>>>>> - /* FIXME: Need to validate the TCE itself */
>>>>> - /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>>> - tbl[idx % TCES_PER_PAGE] = tce;
>>>>> - return H_SUCCESS;
>>>>> - }
>>>>> + long ret;
>>>>> + struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>> +
>>>>> + if (!tt)
>>>>> + return H_TOO_HARD;
>>>>> +
>>>>> + ++tt->stat.rm.put;
>>>>> +
>>>>> + if (ioba>= tt->window_size)
>>>>> + return H_PARAMETER;
>>>>> +
>>>>> + ret = kvmppc_emulated_validate_tce(tce);
>>>>> + if (!ret)
>>>>> + kvmppc_emulated_put_tce(tt, ioba, tce);
>>>>> +
>>>>> + return ret;
>>>>> +}
>>>>> +
>>>>> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>
>>>> So the _vm version is the normal one and this is the _rm version? If so,
>>>> please mark it as such. Is there any way to generate both from the same
>>>> source? The way it's now there is a lot of duplicate code.
>>>
>>>
>>> I tried, looked very ugly. If you insist, I will do so.
>>
>
>> If it looks ugly better don't. I just want to make sure you explored the option.
>> But please keep the naming scheme consistent.
>
>
> Removed _vm everywhere and put _rm in realmode handlers. I just was
> confused by _vm in kvm_vm_ioctl_create_spapr_tce() at the first place.

That vm refers to the virtual machine. It's on VM scope, not VCPU scope.


Alex

>
>
> --
> Alexey

2013-07-11 10:54:44

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

On 07/11/2013 08:11 PM, Alexander Graf wrote:
>
> On 11.07.2013, at 07:12, Alexey Kardashevskiy wrote:
>
>> On 07/10/2013 08:05 PM, Alexander Graf wrote:
>>>
>>> On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote:
>>>
>>>> On 07/10/2013 03:02 AM, Alexander Graf wrote:
>>>>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>>>>> devices or emulated PCI. These calls allow adding multiple entries
>>>>>> (up to 512) into the TCE table in one call which saves time on
>>>>>> transition to/from real mode.
>>>>>
>>>>> We don't mention QEMU explicitly in KVM code usually.
>>>>>
>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>>>> (copied from user and verified) before writing the whole list into
>>>>>> the TCE table. This cache will be utilized more in the upcoming
>>>>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>>>>> mode in the case if the real mode handler failed for some reason.
>>>>>>
>>>>>> This adds a guest physical to host real address converter
>>>>>> and calls the existing H_PUT_TCE handler. The converting function
>>>>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>>>>
>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>>>>> so in order to support the functionality of this patch, QEMU
>>>>>> needs to query for this capability and set the "hcall-multi-tce"
>>>>>> hypertas property only if the capability is present, otherwise
>>>>>> there will be serious performance degradation.
>>>>>
>>>>> Same as above. But really you're only giving recommendations here. What's
>>>>> the point? Please describe what the benefit of this patch is, not what some
>>>>> other random subsystem might do with the benefits it brings.
>>>>>
>>>>>>
>>>>>> Signed-off-by: Paul Mackerras<[email protected]>
>>>>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>>>>>
>>>>>> ---
>>>>>> Changelog:
>>>>>> 2013/07/06:
>>>>>> * fixed number of wrong get_page()/put_page() calls
>>>>>>
>>>>>> 2013/06/27:
>>>>>> * fixed clear of BUSY bit in kvmppc_lookup_pte()
>>>>>> * H_PUT_TCE_INDIRECT does realmode_get_page() now
>>>>>> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
>>>>>> * updated doc
>>>>>>
>>>>>> 2013/06/05:
>>>>>> * fixed mistype about IBMVIO in the commit message
>>>>>> * updated doc and moved it to another section
>>>>>> * changed capability number
>>>>>>
>>>>>> 2013/05/21:
>>>>>> * added kvm_vcpu_arch::tce_tmp
>>>>>> * removed cleanup if put_indirect failed, instead we do not even start
>>>>>> writing to TCE table if we cannot get TCEs from the user and they are
>>>>>> invalid
>>>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>>>>> and kvmppc_emulated_validate_tce (for the previous item)
>>>>>> * fixed bug with failthrough for H_IPI
>>>>>> * removed all get_user() from real mode handlers
>>>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>>>>> ---
>>>>>> Documentation/virtual/kvm/api.txt | 25 +++
>>>>>> arch/powerpc/include/asm/kvm_host.h | 9 ++
>>>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +-
>>>>>> arch/powerpc/kvm/book3s_64_vio.c | 154 ++++++++++++++++++-
>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 260
>>>>>> ++++++++++++++++++++++++++++----
>>>>>> arch/powerpc/kvm/book3s_hv.c | 41 ++++-
>>>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
>>>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++-
>>>>>> arch/powerpc/kvm/powerpc.c | 3 +
>>>>>> 9 files changed, 517 insertions(+), 34 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/virtual/kvm/api.txt
>>>>>> b/Documentation/virtual/kvm/api.txt
>>>>>> index 6365fef..762c703 100644
>>>>>> --- a/Documentation/virtual/kvm/api.txt
>>>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>>>> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
>>>>>> to userspace to be
>>>>>> handled.
>>>>>>
>>>>>>
>>>>>> +4.86 KVM_CAP_PPC_MULTITCE
>>>>>> +
>>>>>> +Capability: KVM_CAP_PPC_MULTITCE
>>>>>> +Architectures: ppc
>>>>>> +Type: vm
>>>>>> +
>>>>>> +This capability means the kernel is capable of handling hypercalls
>>>>>> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
>>>>>> +space. This significanly accelerates DMA operations for PPC KVM guests.
>>>>>
>>>>> significanly? Please run this through a spell checker.
>>>>>
>>>>>> +The user space should expect that its handlers for these hypercalls
>>>>>
>>>>> s/The//
>>>>>
>>>>>> +are not going to be called.
>>>>>
>>>>> Is user space guaranteed they will not be called? Or can it still happen?
>>>>
>>>> ... if user space previously registered LIOBN in KVM (via
>>>> KVM_CREATE_SPAPR_TCE or similar calls).
>>>>
>>>> ok?
>>>
>>> How about this?
>>>
>>> The hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration.
>>>
>>> ---
>>>
>>> The target audience for this documentation is user space KVM API users. Someone developing kvm tool for example. They want to know implications specific CAPs have.
>>>
>>>>
>>>> There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
>>>> and may never get there.
>>>>
>>>>
>>>>>> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
>>>>>> +the user space might have to advertise it for the guest. For example,
>>>>>> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in
>>>>>> +the "ibm,hypertas-functions" device-tree property.
>>>>>
>>>>> This paragraph describes sPAPR. That's fine, but please document it as
>>>>> such. Also please check your grammar.
>>>>
>>>>>> +
>>>>>> +Without this capability, only H_PUT_TCE is handled by the kernel and
>>>>>> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
>>>>>> +unless the capability is present as passing hypercalls to the userspace
>>>>>> +slows operations a lot.
>>>>>> +
>>>>>> +Unlike other capabilities of this section, this one is always enabled.
>>>>>
>>>>> Why? Wouldn't that confuse older user space?
>>>>
>>>>
>>>> How? Old user space won't check for this capability and won't tell the
>>>> guest to use it (via "hcall-multi-tce"). Old H_PUT_TCE is still there.
>>>>
>>>> If the guest always uses H_PUT_TCE_INDIRECT/H_STUFF_TCE no matter what,
>>>> then it is its problem - it won't work now anyway as neither QEMU nor host
>>>> kernel supports these calls.
>>
>>
>>> Always assume that you are a kernel developer without knowledge
>>> of any user space code using your interfaces. So there is the theoretical
>>> possibility that there is a user space client out there that implements
>>> H_PUT_TCE_INDIRECT and advertises hcall-multi-tce to the guest.
>>> Would that client break? If so, we should definitely have
>>> the CAP disabled by default.
>>
>>
>> No, it won't break. Why would it break? I really do not get it. This user
>> space client has to do an extra step to get this acceleration by calling
>> ioctl(KVM_CREATE_SPAPR_TCE) anyway. Previously that ioctl only had effect
>> on H_PUT_TCE, now on all three hcalls.
>
> Hrm. It's a change of behavior, it probably wouldn't break, yes.


Aaand?


>>> But really, it's also as much about consistency as anything else.
>>> If we leave everything as is and always extend functionality
>>> by enabling new CAPs, we're pretty much guaranteed that we
>>> don't break anything by accident. It also makes debugging easier
>>> because you can for example disable this particular feature
>>> to see whether something has bad side effects.
>>
>>
>> So I must add one more ioctl to enable MULTITCE in kernel handling. Is it
>> what you are saying?
>>
>> I can see KVM_CHECK_EXTENSION but I do not see KVM_ENABLE_EXTENSION or
>> anything like that.
>
> KVM_ENABLE_CAP. It's how we enable sPAPR capabilities too.


Yeah, Paul already explained. It is platform specific but ok.
And does not have "EXTENSION" in the name for some reason but ok too.

KVM_ENABLE_CAP is vcpu ioctl. So kvm_arch_vcpu_ioctl() enables VCPU's
capabilities while KVM_CAP_SPAPR_MULTITCE is KVM (or more precisely
SPAPR-TCE/LIOBN but I really do not want it to be that specific) capability.

Sure I can add to kvm_arch_vcpu_ioctl():

case KVM_CAP_SPAPR_MULTITCE:
r = 0;
vcpu->kvm->arch.spapr_multitce_enabled = cap->args[0];
break;

But I suspect you and Ben will call it ugly. SO do I have to implement
KVM_ENABLE_CAP in kvm_arch_vm_ioctl and change the api.txt that it is not
just about vcpu ioctl anymore? Or my brand new ioctl for this?




>>>>>> +
>>>>>> +
>>>>>> 5. The kvm_run structure
>>>>>> ------------------------
>>>>>>
>>>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>>>>>> b/arch/powerpc/include/asm/kvm_host.h
>>>>>> index af326cd..20d04bd 100644
>>>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>>>> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
>>>>>> struct kvm *kvm;
>>>>>> u64 liobn;
>>>>>> u32 window_size;
>>>>>> + struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>>>>>
>>>>> You don't need this.
>>>>>
>>>>>> struct page *pages[0];
>>>>>> };
>>>>>>
>>>>>> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
>>>>>> spinlock_t tbacct_lock;
>>>>>> u64 busy_stolen;
>>>>>> u64 busy_preempt;
>>>>>> +
>>>>>> + unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT
>>>>>> hcall */
>>>>>> + enum {
>>>>>> + TCERM_NONE,
>>>>>> + TCERM_GETPAGE,
>>>>>> + TCERM_PUTTCE,
>>>>>> + TCERM_PUTLIST,
>>>>>> + } tce_rm_fail; /* failed stage of request processing */
>>>>>> #endif
>>>>>> };
>>>>>>
>>>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h
>>>>>> b/arch/powerpc/include/asm/kvm_ppc.h
>>>>>> index a5287fe..fa722a0 100644
>>>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>>>> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu
>>>>>> *vcpu);
>>>>>>
>>>>>> extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>>> struct kvm_create_spapr_tce *args);
>>>>>> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>>> - unsigned long ioba, unsigned long tce);
>>>>>> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>>>>> + struct kvm_vcpu *vcpu, unsigned long liobn);
>>>>>> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
>>>>>> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>>>> + unsigned long ioba, unsigned long tce);
>>>>>> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>>>> + unsigned long liobn, unsigned long ioba,
>>>>>> + unsigned long tce);
>>>>>> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>> + unsigned long liobn, unsigned long ioba,
>>>>>> + unsigned long tce_list, unsigned long npages);
>>>>>> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>>> + unsigned long liobn, unsigned long ioba,
>>>>>> + unsigned long tce_value, unsigned long npages);
>>>>>> extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>>>>>> struct kvm_allocate_rma *rma);
>>>>>> extern struct kvmppc_linear_info *kvm_alloc_rma(void);
>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> b/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> index b2d3f3b..99bf4e5 100644
>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>>>> @@ -14,6 +14,7 @@
>>>>>> *
>>>>>> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
>>>>>> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
>>>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
>>>>>> */
>>>>>>
>>>>>> #include<linux/types.h>
>>>>>> @@ -36,8 +37,10 @@
>>>>>> #include<asm/ppc-opcode.h>
>>>>>> #include<asm/kvm_host.h>
>>>>>> #include<asm/udbg.h>
>>>>>> +#include<asm/iommu.h>
>>>>>> +#include<asm/tce.h>
>>>>>>
>>>>>> -#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
>>>>>> +#define ERROR_ADDR ((void *)~(unsigned long)0x0)
>>>>>>
>>>>>> static long kvmppc_stt_npages(unsigned long window_size)
>>>>>> {
>>>>>> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct
>>>>>> kvmppc_spapr_tce_table *stt)
>>>>>> struct kvm *kvm = stt->kvm;
>>>>>> int i;
>>>>>>
>>>>>> +#define __SV(x) stt->stat.x
>>>>>> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
>>>>>> + pr_debug("%s stat for liobn=%llx\n"
>>>>>> + "--------------- realmode ----- virtmode ---\n"
>>>>>> + "put_tce %10ld %10ld\n"
>>>>>> + "put_tce_indir %10ld %10ld\n"
>>>>>> + "stuff_tce %10ld %10ld\n",
>>>>>> + __func__, stt->liobn,
>>>>>> + __SVD(put), __SV(vm.put),
>>>>>> + __SVD(indir), __SV(vm.indir),
>>>>>> + __SVD(stuff), __SV(vm.stuff));
>>>>>> +#undef __SVD
>>>>>> +#undef __SV
>>>>>
>>>>> All of these stat points should just be trace points. You can do the
>>>>> statistic gathering from user space then.
>>>>>
>>>>>> +
>>>>>> mutex_lock(&kvm->lock);
>>>>>> list_del(&stt->list);
>>>>>> for (i = 0; i< kvmppc_stt_npages(stt->window_size); i++)
>>>>>> @@ -148,3 +165,138 @@ fail:
>>>>>> }
>>>>>> return ret;
>>>>>> }
>>>>>> +
>>>>>> +/* Converts guest physical address to host virtual address */
>>>>>> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>>>>>
>>>>> Please don't distinguish _vm versions. They're the normal case. _rm ones
>>>>> are the special ones.
>>>>>
>>>>>> + unsigned long gpa, struct page **pg)
>>>>>> +{
>>>>>> + unsigned long hva, gfn = gpa>> PAGE_SHIFT;
>>>>>> + struct kvm_memory_slot *memslot;
>>>>>> +
>>>>>> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>>>> + if (!memslot)
>>>>>> + return ERROR_ADDR;
>>>>>> +
>>>>>> + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa& ~PAGE_MASK);
>>>>>
>>>>> s/+/|/
>>>>>
>>>>>> +
>>>>>> + if (get_user_pages_fast(hva& PAGE_MASK, 1, 0, pg) != 1)
>>>>>> + return ERROR_ADDR;
>>>>>> +
>>>>>> + return (void *) hva;
>>>>>> +}
>>>>>> +
>>>>>> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>>>> + unsigned long liobn, unsigned long ioba,
>>>>>> + unsigned long tce)
>>>>>> +{
>>>>>> + long ret;
>>>>>> + struct kvmppc_spapr_tce_table *tt;
>>>>>> +
>>>>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>>> + /* Didn't find the liobn, put it to userspace */
>>>>>
>>>>> Unclear comment.
>>>>
>>>>
>>>> What detail is missing?
>>>
>>
>>> Grammar wise "it" in the second half of the sentence refers to liobn.
>>> So you "put" the "liobn to userspace". That sentence doesn't
>>> make any sense.
>>
>>
>> Removed it. H_TOO_HARD itself says enough already.
>>
>>
>>> What you really want to say is:
>>>
>>> /* Couldn't find the liobn. Something went wrong. Let user space handle the hypercall. That has better ways of dealing with errors. */
>>>
>>>>
>>>>
>>>>>> + if (!tt)
>>>>>> + return H_TOO_HARD;
>>>>>> +
>>>>>> + ++tt->stat.vm.put;
>>>>>> +
>>>>>> + if (ioba>= tt->window_size)
>>>>>> + return H_PARAMETER;
>>>>>> +
>>>>>> + ret = kvmppc_emulated_validate_tce(tce);
>>>>>> + if (ret)
>>>>>> + return ret;
>>>>>> +
>>>>>> + kvmppc_emulated_put_tce(tt, ioba, tce);
>>>>>> +
>>>>>> + return H_SUCCESS;
>>>>>> +}
>>>>>> +
>>>>>> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>> + unsigned long liobn, unsigned long ioba,
>>>>>> + unsigned long tce_list, unsigned long npages)
>>>>>> +{
>>>>>> + struct kvmppc_spapr_tce_table *tt;
>>>>>> + long i, ret = H_SUCCESS;
>>>>>> + unsigned long __user *tces;
>>>>>> + struct page *pg = NULL;
>>>>>> +
>>>>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>>> + /* Didn't find the liobn, put it to userspace */
>>>>>> + if (!tt)
>>>>>> + return H_TOO_HARD;
>>>>>> +
>>>>>> + ++tt->stat.vm.indir;
>>>>>> +
>>>>>> + /*
>>>>>> + * The spec says that the maximum size of the list is 512 TCEs so
>>>>>> + * so the whole table addressed resides in 4K page
>>>>>
>>>>> so so?
>>>>>
>>>>>> + */
>>>>>> + if (npages> 512)
>>>>>> + return H_PARAMETER;
>>>>>> +
>>>>>> + if (tce_list& ~IOMMU_PAGE_MASK)
>>>>>> + return H_PARAMETER;
>>>>>> +
>>>>>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
>>>>>> + return H_PARAMETER;
>>>>>> +
>>>>>> + tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg);
>>>>>> + if (tces == ERROR_ADDR)
>>>>>> + return H_TOO_HARD;
>>>>>> +
>>>>>> + if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
>>>>>> + goto put_list_page_exit;
>>>>>> +
>>>>>> + for (i = 0; i< npages; ++i) {
>>>>>> + if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
>>>>>> + ret = H_PARAMETER;
>>>>>> + goto put_list_page_exit;
>>>>>> + }
>>>>>> +
>>>>>> + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
>>>>>> + if (ret)
>>>>>> + goto put_list_page_exit;
>>>>>> + }
>>>>>> +
>>>>>> + for (i = 0; i< npages; ++i)
>>>>>> + kvmppc_emulated_put_tce(tt, ioba + (i<< IOMMU_PAGE_SHIFT),
>>>>>> + vcpu->arch.tce_tmp_hpas[i]);
>>>>>> +put_list_page_exit:
>>>>>> + if (pg)
>>>>>> + put_page(pg);
>>>>>> +
>>>>>> + if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
>>>>>> + vcpu->arch.tce_rm_fail = TCERM_NONE;
>>>>>> + if (pg&& !PageCompound(pg))
>>>>>> + put_page(pg); /* finish pending realmode_put_page() */
>>>>>> + }
>>>>>> +
>>>>>> + return ret;
>>>>>> +}
>>>>>> +
>>>>>> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>>> + unsigned long liobn, unsigned long ioba,
>>>>>> + unsigned long tce_value, unsigned long npages)
>>>>>> +{
>>>>>> + struct kvmppc_spapr_tce_table *tt;
>>>>>> + long i, ret;
>>>>>> +
>>>>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>>> + /* Didn't find the liobn, put it to userspace */
>>>>>> + if (!tt)
>>>>>> + return H_TOO_HARD;
>>>>>> +
>>>>>> + ++tt->stat.vm.stuff;
>>>>>> +
>>>>>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
>>>>>> + return H_PARAMETER;
>>>>>> +
>>>>>> + ret = kvmppc_emulated_validate_tce(tce_value);
>>>>>> + if (ret || (tce_value& (TCE_PCI_WRITE | TCE_PCI_READ)))
>>>>>> + return H_PARAMETER;
>>>>>> +
>>>>>> + for (i = 0; i< npages; ++i, ioba += IOMMU_PAGE_SIZE)
>>>>>> + kvmppc_emulated_put_tce(tt, ioba, tce_value);
>>>>>> +
>>>>>> + return H_SUCCESS;
>>>>>> +}
>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> index 30c2f3b..cd3e6f9 100644
>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>> @@ -14,6 +14,7 @@
>>>>>> *
>>>>>> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
>>>>>> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
>>>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
>>>>>> */
>>>>>>
>>>>>> #include<linux/types.h>
>>>>>> @@ -35,42 +36,243 @@
>>>>>> #include<asm/ppc-opcode.h>
>>>>>> #include<asm/kvm_host.h>
>>>>>> #include<asm/udbg.h>
>>>>>> +#include<asm/iommu.h>
>>>>>> +#include<asm/tce.h>
>>>>>>
>>>>>> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
>>>>>> +#define ERROR_ADDR (~(unsigned long)0x0)
>>>>>>
>>>>>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>>>>>> - * mode on PR KVM
>>>>>
>>>>> What's wrong with the warning?
>>>>
>>>>
>>>> It belongs to kvmppc_h_put_tce() which is not called in virtual mode anymore.
>>>
>>> I thought the comment applied to the whole file before? Hrm. Maybe I misread it then.
>>>
>>>> It is technically correct for kvmppc_find_tce_table() though. Should I put
>>>> this comment before every function which may be called from real and
>>>> virtual modes?
>>>
>>> Yes, please. Otherwise someone might stick an access to a non-linear address
>>> in there by accident.
>>>
>>>>
>>>>
>>>>
>>>>>> +/*
>>>>>> + * Finds a TCE table descriptor by LIOBN
>>>>>> */
>>>>>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
>>>>>> + unsigned long liobn)
>>>>>> +{
>>>>>> + struct kvmppc_spapr_tce_table *tt;
>>>>>> +
>>>>>> + list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) {
>>>>>> + if (tt->liobn == liobn)
>>>>>> + return tt;
>>>>>> + }
>>>>>> +
>>>>>> + return NULL;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>>>>>> +
>>>>>> +#ifdef DEBUG
>>>>>> +/*
>>>>>> + * Lets user mode disable realmode handlers by putting big number
>>>>>> + * in the bottom value of LIOBN
>>>>>
>>>>> What? Seriously? Just don't enable the CAP.
>>>>
>>>>
>>>> It is under DEBUG. It really, really helps to be able to disable real mode
>>>> handlers without reboot. Ok, no debug code, I'll remove.
>>>
>>> Debug code is good, but #ifdefs are bad. For you, an #ifdef reads like
>>> "code that doesn't do any hard when disabled". For me, #ifdefs read
>>> "code that definitely breaks because nobody turns the #define on".
>>>
>>> So please, avoid #ifdef'ed code whenever possible. Switching the CAP on and
>>> off is a much better debug approach in this case.
>>>
>>>>
>>>>
>>>>>> + */
>>>>>> +#define kvmppc_find_tce_table(a, b) \
>>>>>> + ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
>>>>>> +#endif
>>>>>> +
>>>>>> +/*
>>>>>> + * Validates TCE address.
>>>>>> + * At the moment only flags are validated as other checks will
>>>>>> significantly slow
>>>>>> + * down or can make it even impossible to handle TCE requests in real mode.
>>>>>
>>>>> What?
>>>>
>>>>
>>>> What is missing here (besides good english)?
>>>
>>> What badness could slip through by not validating everything?
>>
>>
>> I cannot think of any good check which could be done in real mode and not
>> be "more than 2 calls deep" (c) Ben. Check that the page is allocated at
>> all? How? Don't know.
>

> If you say that our validation doesn't validate everything, that makes
> me really weary.


It checks that TCE does not have any bit set in bits 2..12. If they are
set, something went very wrong. Better than nothing.


> Could the guest use it to maliciously inject anything?
> Could a missing check make our code go berserk?


No. KVM does not do anything with those addresses, just puts them to the
table and lets QEMU or a guest deal with it.


> What checks exactly would you do in addition when this was virtual mode?


Check that TCE is within RAM boundaries. Or check that the page was
allocated. find_linux_pte_or_hugepte? It can fail in real mode but in
virtual mode I can call get_user_fast_page and confirm that the address is
ok. Not sure, did not think much about it. Compare page flags with TCE
flags if both or neither have "write" set, this kind of stuff.

I am not really sure we need any of those checks for emulated TCE at all.

Remove the comment then?


>>>>>> + */
>>>>>> +long kvmppc_emulated_validate_tce(unsigned long tce)
>>>>>
>>>>> I don't like the naming scheme. Please turn this around and make it
>>>>> kvmppc_tce_validate().
>>>>
>>>>
>>>> Oh. "Like"... Ok.
>>>
>>> Yes. Like.
>>>
>>>>
>>>>
>>>>>> +{
>>>>>> + if (tce& ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>>>>>> + return H_PARAMETER;
>>>>>> +
>>>>>> + return H_SUCCESS;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
>>>>>> +
>>>>>> +/*
>>>>>> + * Handles TCE requests for QEMU emulated devices.
>>>>>
>>>>> We still don't mention QEMU in KVM code. And does it really matter whether
>>>>> they're emulated by QEMU? Devices could also be emulated by KVM.
>>>>>
>>>>>> + * Puts guest TCE values to the table and expects QEMU to convert them
>>>>>> + * later in a QEMU device implementation.
>>>>>> + * Called in both real and virtual modes.
>>>>>> + * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
>>>>>> + */
>>>>>> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>>>
>>>>> kvmppc_tce_put()
>>>>>
>>>>>> + unsigned long ioba, unsigned long tce)
>>>>>> +{
>>>>>> + unsigned long idx = ioba>> SPAPR_TCE_SHIFT;
>>>>>> + struct page *page;
>>>>>> + u64 *tbl;
>>>>>> +
>>>>>> + /*
>>>>>> + * Note on the use of page_address() in real mode,
>>>>>> + *
>>>>>> + * It is safe to use page_address() in real mode on ppc64 because
>>>>>> + * page_address() is always defined as lowmem_page_address()
>>>>>> + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
>>>>>> + * operation and does not access page struct.
>>>>>> + *
>>>>>> + * Theoretically page_address() could be defined different
>>>>>> + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>>>>>> + * should be enabled.
>>>>>> + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>>>>>> + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>>>>>> + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>>>>>> + * is not expected to be enabled on ppc32, page_address()
>>>>>> + * is safe for ppc32 as well.
>>>>>> + */
>>>>>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>>>>>> +#error TODO: fix to avoid page_address() here
>>>>>> +#endif
>>>>>
>>>>> Can you extract the text above, the check and the page_address call into a
>>>>> simple wrapper function?
>>>>
>>>>
>>>> Is this function also too big? Sorry, I do not understand the comment.
>>>
>>> All of the comment and #if here only deal with the fact that you
>>> have a real mode hack to call page_address() that happens
>>> to work under specific circumstances.
>>>
>>> There's nothing kvmppc_tce_put() specific about this.
>>> The page_address() code happens to get called here, sure.
>>> But if I read the kvmppc_tce_put() function I don't care about
>>> these details - I want to understand the code flow that ends
>>> up writing the TCE.
>>>
>>>>>> + page = tt->pages[idx / TCES_PER_PAGE];
>>>>>> + tbl = (u64 *)page_address(page);
>>>>>> +
>>>>>> + /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>>>
>>>>> This is not an RFC, is it?
>>>>
>>>>
>>>> Any debug code is prohibited? Ok, I'll remove.
>>>
>>> Debug code that requires code changes is prohibited, yes.
>>> Debug code that is runtime switchable (pr_debug, trace points, etc)
>>> are allowed.
>>
>>
>> Is there any easy way to enable just this specific udbg_printf (not all of
>> them at once)? Trace points do not work in real mode as we figured out.
>
> You can enable pr_debug by file IIRC.


On already running kernel? :-/ Wow. How?



>>>>>> + tbl[idx % TCES_PER_PAGE] = tce;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
>>>>>> +
>>>>>> +#ifdef CONFIG_KVM_BOOK3S_64_HV
>>>>>> +/*
>>>>>> + * Converts guest physical address to host physical address.
>>>>>> + * Tries to increase page counter via realmode_get_page() and
>>>>>> + * returns ERROR_ADDR if failed.
>>>>>> + */
>>>>>> +static unsigned long kvmppc_rm_gpa_to_hpa_and_get(struct kvm_vcpu *vcpu,
>>>>>> + unsigned long gpa, struct page **pg)
>>>>>> +{
>>>>>> + struct kvm_memory_slot *memslot;
>>>>>> + pte_t *ptep, pte;
>>>>>> + unsigned long hva, hpa = ERROR_ADDR;
>>>>>> + unsigned long gfn = gpa>> PAGE_SHIFT;
>>>>>> + unsigned shift = 0;
>>>>>> +
>>>>>> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>>>> + if (!memslot)
>>>>>> + return ERROR_ADDR;
>>>>>> +
>>>>>> + hva = __gfn_to_hva_memslot(memslot, gfn);
>>>>>> +
>>>>>> + ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva,&shift);
>>>>>> + if (!ptep || !pte_present(*ptep))
>>>>>> + return ERROR_ADDR;
>>>>>> + pte = *ptep;
>>>>>> +
>>>>>> + if (((gpa& TCE_PCI_WRITE) || pte_write(pte))&& !pte_dirty(pte))
>>>>>> + return ERROR_ADDR;
>>>>>> +
>>>>>> + if (!pte_young(pte))
>>>>>> + return ERROR_ADDR;
>>>>>> +
>>>>>> + if (!shift)
>>>>>> + shift = PAGE_SHIFT;
>>>>>> +
>>>>>> + /* Put huge pages handling to the virtual mode */
>>>>>> + if (shift> PAGE_SHIFT)
>>>>>> + return ERROR_ADDR;
>>>>>> +
>>>>>> + *pg = realmode_pfn_to_page(pte_pfn(pte));
>>>>>> + if (!*pg || realmode_get_page(*pg))
>>>>>> + return ERROR_ADDR;
>>>>>> +
>>>>>> + /* pte_pfn(pte) returns address aligned to pg_size */
>>>>>> + hpa = (pte_pfn(pte)<< PAGE_SHIFT) + (gpa& ((1<< shift) - 1));
>>>>>> +
>>>>>> + if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>>>>> + hpa = ERROR_ADDR;
>>>>>> + realmode_put_page(*pg);
>>>>>> + *pg = NULL;
>>>>>> + }
>>>>>> +
>>>>>> + return hpa;
>>>>>> +}
>>>>>> +
>>>>>> long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>>> unsigned long ioba, unsigned long tce)
>>>>>> {
>>>>>> - struct kvm *kvm = vcpu->kvm;
>>>>>> - struct kvmppc_spapr_tce_table *stt;
>>>>>> -
>>>>>> - /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>>>>>> - /* liobn, ioba, tce); */
>>>>>> -
>>>>>> - list_for_each_entry(stt,&kvm->arch.spapr_tce_tables, list) {
>>>>>> - if (stt->liobn == liobn) {
>>>>>> - unsigned long idx = ioba>> SPAPR_TCE_SHIFT;
>>>>>> - struct page *page;
>>>>>> - u64 *tbl;
>>>>>> -
>>>>>> - /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p
>>>>>> window_size=0x%x\n", */
>>>>>> - /* liobn, stt, stt->window_size); */
>>>>>> - if (ioba>= stt->window_size)
>>>>>> - return H_PARAMETER;
>>>>>> -
>>>>>> - page = stt->pages[idx / TCES_PER_PAGE];
>>>>>> - tbl = (u64 *)page_address(page);
>>>>>> -
>>>>>> - /* FIXME: Need to validate the TCE itself */
>>>>>> - /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>>>> - tbl[idx % TCES_PER_PAGE] = tce;
>>>>>> - return H_SUCCESS;
>>>>>> - }
>>>>>> + long ret;
>>>>>> + struct kvmppc_spapr_tce_table *tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>>> +
>>>>>> + if (!tt)
>>>>>> + return H_TOO_HARD;
>>>>>> +
>>>>>> + ++tt->stat.rm.put;
>>>>>> +
>>>>>> + if (ioba>= tt->window_size)
>>>>>> + return H_PARAMETER;
>>>>>> +
>>>>>> + ret = kvmppc_emulated_validate_tce(tce);
>>>>>> + if (!ret)
>>>>>> + kvmppc_emulated_put_tce(tt, ioba, tce);
>>>>>> +
>>>>>> + return ret;
>>>>>> +}
>>>>>> +
>>>>>> +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>
>>>>> So the _vm version is the normal one and this is the _rm version? If so,
>>>>> please mark it as such. Is there any way to generate both from the same
>>>>> source? The way it's now there is a lot of duplicate code.
>>>>
>>>>
>>>> I tried, looked very ugly. If you insist, I will do so.
>>>
>>
>>> If it looks ugly better don't. I just want to make sure you explored the option.
>>> But please keep the naming scheme consistent.
>>
>>
>> Removed _vm everywhere and put _rm in realmode handlers. I just was
>> confused by _vm in kvm_vm_ioctl_create_spapr_tce() at the first place.
>
> That vm refers to the virtual machine. It's on VM scope, not VCPU scope.


I do not mind, just saying where it came from :)


--
Alexey

2013-07-11 11:15:28

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls


On 11.07.2013, at 12:54, Alexey Kardashevskiy wrote:

> On 07/11/2013 08:11 PM, Alexander Graf wrote:
>>
>> On 11.07.2013, at 07:12, Alexey Kardashevskiy wrote:
>>
>>> On 07/10/2013 08:05 PM, Alexander Graf wrote:
>>>>
>>>> On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote:
>>>>
>>>>> On 07/10/2013 03:02 AM, Alexander Graf wrote:
>>>>>> On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote:
>>>>>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>>>>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>>>>>> devices or emulated PCI. These calls allow adding multiple entries
>>>>>>> (up to 512) into the TCE table in one call which saves time on
>>>>>>> transition to/from real mode.
>>>>>>
>>>>>> We don't mention QEMU explicitly in KVM code usually.
>>>>>>
>>>>>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>>>>>> (copied from user and verified) before writing the whole list into
>>>>>>> the TCE table. This cache will be utilized more in the upcoming
>>>>>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>>>>>> mode in the case if the real mode handler failed for some reason.
>>>>>>>
>>>>>>> This adds a guest physical to host real address converter
>>>>>>> and calls the existing H_PUT_TCE handler. The converting function
>>>>>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>>>>>
>>>>>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>>>>>> so in order to support the functionality of this patch, QEMU
>>>>>>> needs to query for this capability and set the "hcall-multi-tce"
>>>>>>> hypertas property only if the capability is present, otherwise
>>>>>>> there will be serious performance degradation.
>>>>>>
>>>>>> Same as above. But really you're only giving recommendations here. What's
>>>>>> the point? Please describe what the benefit of this patch is, not what some
>>>>>> other random subsystem might do with the benefits it brings.
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Paul Mackerras<[email protected]>
>>>>>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>>>>>>
>>>>>>> ---
>>>>>>> Changelog:
>>>>>>> 2013/07/06:
>>>>>>> * fixed number of wrong get_page()/put_page() calls
>>>>>>>
>>>>>>> 2013/06/27:
>>>>>>> * fixed clear of BUSY bit in kvmppc_lookup_pte()
>>>>>>> * H_PUT_TCE_INDIRECT does realmode_get_page() now
>>>>>>> * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
>>>>>>> * updated doc
>>>>>>>
>>>>>>> 2013/06/05:
>>>>>>> * fixed mistype about IBMVIO in the commit message
>>>>>>> * updated doc and moved it to another section
>>>>>>> * changed capability number
>>>>>>>
>>>>>>> 2013/05/21:
>>>>>>> * added kvm_vcpu_arch::tce_tmp
>>>>>>> * removed cleanup if put_indirect failed, instead we do not even start
>>>>>>> writing to TCE table if we cannot get TCEs from the user and they are
>>>>>>> invalid
>>>>>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>>>>>> and kvmppc_emulated_validate_tce (for the previous item)
>>>>>>> * fixed bug with failthrough for H_IPI
>>>>>>> * removed all get_user() from real mode handlers
>>>>>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>>>>>>
>>>>>>> Signed-off-by: Alexey Kardashevskiy<[email protected]>
>>>>>>> ---
>>>>>>> Documentation/virtual/kvm/api.txt | 25 +++
>>>>>>> arch/powerpc/include/asm/kvm_host.h | 9 ++
>>>>>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +-
>>>>>>> arch/powerpc/kvm/book3s_64_vio.c | 154 ++++++++++++++++++-
>>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 260
>>>>>>> ++++++++++++++++++++++++++++----
>>>>>>> arch/powerpc/kvm/book3s_hv.c | 41 ++++-
>>>>>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
>>>>>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++-
>>>>>>> arch/powerpc/kvm/powerpc.c | 3 +
>>>>>>> 9 files changed, 517 insertions(+), 34 deletions(-)
>>>>>>>
>>>>>>> diff --git a/Documentation/virtual/kvm/api.txt
>>>>>>> b/Documentation/virtual/kvm/api.txt
>>>>>>> index 6365fef..762c703 100644
>>>>>>> --- a/Documentation/virtual/kvm/api.txt
>>>>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>>>>> @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed
>>>>>>> to userspace to be
>>>>>>> handled.
>>>>>>>
>>>>>>>
>>>>>>> +4.86 KVM_CAP_PPC_MULTITCE
>>>>>>> +
>>>>>>> +Capability: KVM_CAP_PPC_MULTITCE
>>>>>>> +Architectures: ppc
>>>>>>> +Type: vm
>>>>>>> +
>>>>>>> +This capability means the kernel is capable of handling hypercalls
>>>>>>> +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
>>>>>>> +space. This significanly accelerates DMA operations for PPC KVM guests.
>>>>>>
>>>>>> significanly? Please run this through a spell checker.
>>>>>>
>>>>>>> +The user space should expect that its handlers for these hypercalls
>>>>>>
>>>>>> s/The//
>>>>>>
>>>>>>> +are not going to be called.
>>>>>>
>>>>>> Is user space guaranteed they will not be called? Or can it still happen?
>>>>>
>>>>> ... if user space previously registered LIOBN in KVM (via
>>>>> KVM_CREATE_SPAPR_TCE or similar calls).
>>>>>
>>>>> ok?
>>>>
>>>> How about this?
>>>>
>>>> The hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration.
>>>>
>>>> ---
>>>>
>>>> The target audience for this documentation is user space KVM API users. Someone developing kvm tool for example. They want to know implications specific CAPs have.
>>>>
>>>>>
>>>>> There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet
>>>>> and may never get there.
>>>>>
>>>>>
>>>>>>> +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
>>>>>>> +the user space might have to advertise it for the guest. For example,
>>>>>>> +IBM pSeries guest starts using them if "hcall-multi-tce" is present in
>>>>>>> +the "ibm,hypertas-functions" device-tree property.
>>>>>>
>>>>>> This paragraph describes sPAPR. That's fine, but please document it as
>>>>>> such. Also please check your grammar.
>>>>>
>>>>>>> +
>>>>>>> +Without this capability, only H_PUT_TCE is handled by the kernel and
>>>>>>> +therefore the use of H_PUT_TCE_INDIRECT and H_STUFF_TCE is not recommended
>>>>>>> +unless the capability is present as passing hypercalls to the userspace
>>>>>>> +slows operations a lot.
>>>>>>> +
>>>>>>> +Unlike other capabilities of this section, this one is always enabled.
>>>>>>
>>>>>> Why? Wouldn't that confuse older user space?
>>>>>
>>>>>
>>>>> How? Old user space won't check for this capability and won't tell the
>>>>> guest to use it (via "hcall-multi-tce"). Old H_PUT_TCE is still there.
>>>>>
>>>>> If the guest always uses H_PUT_TCE_INDIRECT/H_STUFF_TCE no matter what,
>>>>> then it is its problem - it won't work now anyway as neither QEMU nor host
>>>>> kernel supports these calls.
>>>
>>>
>>>> Always assume that you are a kernel developer without knowledge
>>>> of any user space code using your interfaces. So there is the theoretical
>>>> possibility that there is a user space client out there that implements
>>>> H_PUT_TCE_INDIRECT and advertises hcall-multi-tce to the guest.
>>>> Would that client break? If so, we should definitely have
>>>> the CAP disabled by default.
>>>
>>>
>>> No, it won't break. Why would it break? I really do not get it. This user
>>> space client has to do an extra step to get this acceleration by calling
>>> ioctl(KVM_CREATE_SPAPR_TCE) anyway. Previously that ioctl only had effect
>>> on H_PUT_TCE, now on all three hcalls.
>>
>> Hrm. It's a change of behavior, it probably wouldn't break, yes.
>
>
> Aaand?


And that's bad. Jeez, seriously. Don't argue this case. We enable new features individually unless we're 100% sure we can keep everything working. In this case an ENABLE_CAP doesn't hurt at all, because user space still needs to handle the hypercalls if it wants them anyways. But you get debugging for free for example.

>
>
>>>> But really, it's also as much about consistency as anything else.
>>>> If we leave everything as is and always extend functionality
>>>> by enabling new CAPs, we're pretty much guaranteed that we
>>>> don't break anything by accident. It also makes debugging easier
>>>> because you can for example disable this particular feature
>>>> to see whether something has bad side effects.
>>>
>>>
>>> So I must add one more ioctl to enable MULTITCE in kernel handling. Is it
>>> what you are saying?
>>>
>>> I can see KVM_CHECK_EXTENSION but I do not see KVM_ENABLE_EXTENSION or
>>> anything like that.
>>
>> KVM_ENABLE_CAP. It's how we enable sPAPR capabilities too.
>
>
> Yeah, Paul already explained. It is platform specific but ok.
> And does not have "EXTENSION" in the name for some reason but ok too.
>
> KVM_ENABLE_CAP is vcpu ioctl. So kvm_arch_vcpu_ioctl() enables VCPU's
> capabilities while KVM_CAP_SPAPR_MULTITCE is KVM (or more precisely
> SPAPR-TCE/LIOBN but I really do not want it to be that specific) capability.
>
> Sure I can add to kvm_arch_vcpu_ioctl():
>
> case KVM_CAP_SPAPR_MULTITCE:
> r = 0;
> vcpu->kvm->arch.spapr_multitce_enabled = cap->args[0];
> break;
>
> But I suspect you and Ben will call it ugly. SO do I have to implement
> KVM_ENABLE_CAP in kvm_arch_vm_ioctl and change the api.txt that it is not
> just about vcpu ioctl anymore? Or my brand new ioctl for this?


There are 2 ways of dealing with this:

1) Call the ENABLE_CAP on every vcpu. That way one CPU may handle this hypercall in the kernel while another one may not. The same as we handle PAPR today.

2) Create a new ENABLE_CAP for the vm.

I think in this case option 1 is fine - it's how we handle everything else already.

>
>
>
>
>>>>>>> +
>>>>>>> +
>>>>>>> 5. The kvm_run structure
>>>>>>> ------------------------
>>>>>>>
>>>>>>> diff --git a/arch/powerpc/include/asm/kvm_host.h
>>>>>>> b/arch/powerpc/include/asm/kvm_host.h
>>>>>>> index af326cd..20d04bd 100644
>>>>>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>>>>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>>>>>> @@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
>>>>>>> struct kvm *kvm;
>>>>>>> u64 liobn;
>>>>>>> u32 window_size;
>>>>>>> + struct { struct { unsigned long put, indir, stuff; } rm, vm; } stat;
>>>>>>
>>>>>> You don't need this.
>>>>>>
>>>>>>> struct page *pages[0];
>>>>>>> };
>>>>>>>
>>>>>>> @@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
>>>>>>> spinlock_t tbacct_lock;
>>>>>>> u64 busy_stolen;
>>>>>>> u64 busy_preempt;
>>>>>>> +
>>>>>>> + unsigned long *tce_tmp_hpas; /* TCE cache for TCE_PUT_INDIRECT
>>>>>>> hcall */
>>>>>>> + enum {
>>>>>>> + TCERM_NONE,
>>>>>>> + TCERM_GETPAGE,
>>>>>>> + TCERM_PUTTCE,
>>>>>>> + TCERM_PUTLIST,
>>>>>>> + } tce_rm_fail; /* failed stage of request processing */
>>>>>>> #endif
>>>>>>> };
>>>>>>>
>>>>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h
>>>>>>> b/arch/powerpc/include/asm/kvm_ppc.h
>>>>>>> index a5287fe..fa722a0 100644
>>>>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>>>>> @@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu
>>>>>>> *vcpu);
>>>>>>>
>>>>>>> extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>>>>>> struct kvm_create_spapr_tce *args);
>>>>>>> -extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>>>>>>> - unsigned long ioba, unsigned long tce);
>>>>>>> +extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
>>>>>>> + struct kvm_vcpu *vcpu, unsigned long liobn);
>>>>>>> +extern long kvmppc_emulated_validate_tce(unsigned long tce);
>>>>>>> +extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>>>>> + unsigned long ioba, unsigned long tce);
>>>>>>> +extern long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>>>>> + unsigned long liobn, unsigned long ioba,
>>>>>>> + unsigned long tce);
>>>>>>> +extern long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>>> + unsigned long liobn, unsigned long ioba,
>>>>>>> + unsigned long tce_list, unsigned long npages);
>>>>>>> +extern long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>>>> + unsigned long liobn, unsigned long ioba,
>>>>>>> + unsigned long tce_value, unsigned long npages);
>>>>>>> extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
>>>>>>> struct kvm_allocate_rma *rma);
>>>>>>> extern struct kvmppc_linear_info *kvm_alloc_rma(void);
>>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c
>>>>>>> b/arch/powerpc/kvm/book3s_64_vio.c
>>>>>>> index b2d3f3b..99bf4e5 100644
>>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>>>>>>> @@ -14,6 +14,7 @@
>>>>>>> *
>>>>>>> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
>>>>>>> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
>>>>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
>>>>>>> */
>>>>>>>
>>>>>>> #include<linux/types.h>
>>>>>>> @@ -36,8 +37,10 @@
>>>>>>> #include<asm/ppc-opcode.h>
>>>>>>> #include<asm/kvm_host.h>
>>>>>>> #include<asm/udbg.h>
>>>>>>> +#include<asm/iommu.h>
>>>>>>> +#include<asm/tce.h>
>>>>>>>
>>>>>>> -#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
>>>>>>> +#define ERROR_ADDR ((void *)~(unsigned long)0x0)
>>>>>>>
>>>>>>> static long kvmppc_stt_npages(unsigned long window_size)
>>>>>>> {
>>>>>>> @@ -50,6 +53,20 @@ static void release_spapr_tce_table(struct
>>>>>>> kvmppc_spapr_tce_table *stt)
>>>>>>> struct kvm *kvm = stt->kvm;
>>>>>>> int i;
>>>>>>>
>>>>>>> +#define __SV(x) stt->stat.x
>>>>>>> +#define __SVD(x) (__SV(rm.x)?(__SV(rm.x)-__SV(vm.x)):0)
>>>>>>> + pr_debug("%s stat for liobn=%llx\n"
>>>>>>> + "--------------- realmode ----- virtmode ---\n"
>>>>>>> + "put_tce %10ld %10ld\n"
>>>>>>> + "put_tce_indir %10ld %10ld\n"
>>>>>>> + "stuff_tce %10ld %10ld\n",
>>>>>>> + __func__, stt->liobn,
>>>>>>> + __SVD(put), __SV(vm.put),
>>>>>>> + __SVD(indir), __SV(vm.indir),
>>>>>>> + __SVD(stuff), __SV(vm.stuff));
>>>>>>> +#undef __SVD
>>>>>>> +#undef __SV
>>>>>>
>>>>>> All of these stat points should just be trace points. You can do the
>>>>>> statistic gathering from user space then.
>>>>>>
>>>>>>> +
>>>>>>> mutex_lock(&kvm->lock);
>>>>>>> list_del(&stt->list);
>>>>>>> for (i = 0; i< kvmppc_stt_npages(stt->window_size); i++)
>>>>>>> @@ -148,3 +165,138 @@ fail:
>>>>>>> }
>>>>>>> return ret;
>>>>>>> }
>>>>>>> +
>>>>>>> +/* Converts guest physical address to host virtual address */
>>>>>>> +static void __user *kvmppc_vm_gpa_to_hva_and_get(struct kvm_vcpu *vcpu,
>>>>>>
>>>>>> Please don't distinguish _vm versions. They're the normal case. _rm ones
>>>>>> are the special ones.
>>>>>>
>>>>>>> + unsigned long gpa, struct page **pg)
>>>>>>> +{
>>>>>>> + unsigned long hva, gfn = gpa>> PAGE_SHIFT;
>>>>>>> + struct kvm_memory_slot *memslot;
>>>>>>> +
>>>>>>> + memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
>>>>>>> + if (!memslot)
>>>>>>> + return ERROR_ADDR;
>>>>>>> +
>>>>>>> + hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa& ~PAGE_MASK);
>>>>>>
>>>>>> s/+/|/
>>>>>>
>>>>>>> +
>>>>>>> + if (get_user_pages_fast(hva& PAGE_MASK, 1, 0, pg) != 1)
>>>>>>> + return ERROR_ADDR;
>>>>>>> +
>>>>>>> + return (void *) hva;
>>>>>>> +}
>>>>>>> +
>>>>>>> +long kvmppc_vm_h_put_tce(struct kvm_vcpu *vcpu,
>>>>>>> + unsigned long liobn, unsigned long ioba,
>>>>>>> + unsigned long tce)
>>>>>>> +{
>>>>>>> + long ret;
>>>>>>> + struct kvmppc_spapr_tce_table *tt;
>>>>>>> +
>>>>>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>>>> + /* Didn't find the liobn, put it to userspace */
>>>>>>
>>>>>> Unclear comment.
>>>>>
>>>>>
>>>>> What detail is missing?
>>>>
>>>
>>>> Grammar wise "it" in the second half of the sentence refers to liobn.
>>>> So you "put" the "liobn to userspace". That sentence doesn't
>>>> make any sense.
>>>
>>>
>>> Removed it. H_TOO_HARD itself says enough already.
>>>
>>>
>>>> What you really want to say is:
>>>>
>>>> /* Couldn't find the liobn. Something went wrong. Let user space handle the hypercall. That has better ways of dealing with errors. */
>>>>
>>>>>
>>>>>
>>>>>>> + if (!tt)
>>>>>>> + return H_TOO_HARD;
>>>>>>> +
>>>>>>> + ++tt->stat.vm.put;
>>>>>>> +
>>>>>>> + if (ioba>= tt->window_size)
>>>>>>> + return H_PARAMETER;
>>>>>>> +
>>>>>>> + ret = kvmppc_emulated_validate_tce(tce);
>>>>>>> + if (ret)
>>>>>>> + return ret;
>>>>>>> +
>>>>>>> + kvmppc_emulated_put_tce(tt, ioba, tce);
>>>>>>> +
>>>>>>> + return H_SUCCESS;
>>>>>>> +}
>>>>>>> +
>>>>>>> +long kvmppc_vm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>>>>>>> + unsigned long liobn, unsigned long ioba,
>>>>>>> + unsigned long tce_list, unsigned long npages)
>>>>>>> +{
>>>>>>> + struct kvmppc_spapr_tce_table *tt;
>>>>>>> + long i, ret = H_SUCCESS;
>>>>>>> + unsigned long __user *tces;
>>>>>>> + struct page *pg = NULL;
>>>>>>> +
>>>>>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>>>> + /* Didn't find the liobn, put it to userspace */
>>>>>>> + if (!tt)
>>>>>>> + return H_TOO_HARD;
>>>>>>> +
>>>>>>> + ++tt->stat.vm.indir;
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * The spec says that the maximum size of the list is 512 TCEs so
>>>>>>> + * so the whole table addressed resides in 4K page
>>>>>>
>>>>>> so so?
>>>>>>
>>>>>>> + */
>>>>>>> + if (npages> 512)
>>>>>>> + return H_PARAMETER;
>>>>>>> +
>>>>>>> + if (tce_list& ~IOMMU_PAGE_MASK)
>>>>>>> + return H_PARAMETER;
>>>>>>> +
>>>>>>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
>>>>>>> + return H_PARAMETER;
>>>>>>> +
>>>>>>> + tces = kvmppc_vm_gpa_to_hva_and_get(vcpu, tce_list,&pg);
>>>>>>> + if (tces == ERROR_ADDR)
>>>>>>> + return H_TOO_HARD;
>>>>>>> +
>>>>>>> + if (vcpu->arch.tce_rm_fail == TCERM_PUTLIST)
>>>>>>> + goto put_list_page_exit;
>>>>>>> +
>>>>>>> + for (i = 0; i< npages; ++i) {
>>>>>>> + if (get_user(vcpu->arch.tce_tmp_hpas[i], tces + i)) {
>>>>>>> + ret = H_PARAMETER;
>>>>>>> + goto put_list_page_exit;
>>>>>>> + }
>>>>>>> +
>>>>>>> + ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp_hpas[i]);
>>>>>>> + if (ret)
>>>>>>> + goto put_list_page_exit;
>>>>>>> + }
>>>>>>> +
>>>>>>> + for (i = 0; i< npages; ++i)
>>>>>>> + kvmppc_emulated_put_tce(tt, ioba + (i<< IOMMU_PAGE_SHIFT),
>>>>>>> + vcpu->arch.tce_tmp_hpas[i]);
>>>>>>> +put_list_page_exit:
>>>>>>> + if (pg)
>>>>>>> + put_page(pg);
>>>>>>> +
>>>>>>> + if (vcpu->arch.tce_rm_fail != TCERM_NONE) {
>>>>>>> + vcpu->arch.tce_rm_fail = TCERM_NONE;
>>>>>>> + if (pg&& !PageCompound(pg))
>>>>>>> + put_page(pg); /* finish pending realmode_put_page() */
>>>>>>> + }
>>>>>>> +
>>>>>>> + return ret;
>>>>>>> +}
>>>>>>> +
>>>>>>> +long kvmppc_vm_h_stuff_tce(struct kvm_vcpu *vcpu,
>>>>>>> + unsigned long liobn, unsigned long ioba,
>>>>>>> + unsigned long tce_value, unsigned long npages)
>>>>>>> +{
>>>>>>> + struct kvmppc_spapr_tce_table *tt;
>>>>>>> + long i, ret;
>>>>>>> +
>>>>>>> + tt = kvmppc_find_tce_table(vcpu, liobn);
>>>>>>> + /* Didn't find the liobn, put it to userspace */
>>>>>>> + if (!tt)
>>>>>>> + return H_TOO_HARD;
>>>>>>> +
>>>>>>> + ++tt->stat.vm.stuff;
>>>>>>> +
>>>>>>> + if ((ioba + (npages<< IOMMU_PAGE_SHIFT))> tt->window_size)
>>>>>>> + return H_PARAMETER;
>>>>>>> +
>>>>>>> + ret = kvmppc_emulated_validate_tce(tce_value);
>>>>>>> + if (ret || (tce_value& (TCE_PCI_WRITE | TCE_PCI_READ)))
>>>>>>> + return H_PARAMETER;
>>>>>>> +
>>>>>>> + for (i = 0; i< npages; ++i, ioba += IOMMU_PAGE_SIZE)
>>>>>>> + kvmppc_emulated_put_tce(tt, ioba, tce_value);
>>>>>>> +
>>>>>>> + return H_SUCCESS;
>>>>>>> +}
>>>>>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>>> b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>>> index 30c2f3b..cd3e6f9 100644
>>>>>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
>>>>>>> @@ -14,6 +14,7 @@
>>>>>>> *
>>>>>>> * Copyright 2010 Paul Mackerras, IBM Corp.<[email protected]>
>>>>>>> * Copyright 2011 David Gibson, IBM Corporation<[email protected]>
>>>>>>> + * Copyright 2013 Alexey Kardashevskiy, IBM Corporation<[email protected]>
>>>>>>> */
>>>>>>>
>>>>>>> #include<linux/types.h>
>>>>>>> @@ -35,42 +36,243 @@
>>>>>>> #include<asm/ppc-opcode.h>
>>>>>>> #include<asm/kvm_host.h>
>>>>>>> #include<asm/udbg.h>
>>>>>>> +#include<asm/iommu.h>
>>>>>>> +#include<asm/tce.h>
>>>>>>>
>>>>>>> #define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
>>>>>>> +#define ERROR_ADDR (~(unsigned long)0x0)
>>>>>>>
>>>>>>> -/* WARNING: This will be called in real-mode on HV KVM and virtual
>>>>>>> - * mode on PR KVM
>>>>>>
>>>>>> What's wrong with the warning?
>>>>>
>>>>>
>>>>> It belongs to kvmppc_h_put_tce() which is not called in virtual mode anymore.
>>>>
>>>> I thought the comment applied to the whole file before? Hrm. Maybe I misread it then.
>>>>
>>>>> It is technically correct for kvmppc_find_tce_table() though. Should I put
>>>>> this comment before every function which may be called from real and
>>>>> virtual modes?
>>>>
>>>> Yes, please. Otherwise someone might stick an access to a non-linear address
>>>> in there by accident.
>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> +/*
>>>>>>> + * Finds a TCE table descriptor by LIOBN
>>>>>>> */
>>>>>>> +struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
>>>>>>> + unsigned long liobn)
>>>>>>> +{
>>>>>>> + struct kvmppc_spapr_tce_table *tt;
>>>>>>> +
>>>>>>> + list_for_each_entry(tt,&vcpu->kvm->arch.spapr_tce_tables, list) {
>>>>>>> + if (tt->liobn == liobn)
>>>>>>> + return tt;
>>>>>>> + }
>>>>>>> +
>>>>>>> + return NULL;
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
>>>>>>> +
>>>>>>> +#ifdef DEBUG
>>>>>>> +/*
>>>>>>> + * Lets user mode disable realmode handlers by putting big number
>>>>>>> + * in the bottom value of LIOBN
>>>>>>
>>>>>> What? Seriously? Just don't enable the CAP.
>>>>>
>>>>>
>>>>> It is under DEBUG. It really, really helps to be able to disable real mode
>>>>> handlers without reboot. Ok, no debug code, I'll remove.
>>>>
>>>> Debug code is good, but #ifdefs are bad. For you, an #ifdef reads like
>>>> "code that doesn't do any hard when disabled". For me, #ifdefs read
>>>> "code that definitely breaks because nobody turns the #define on".
>>>>
>>>> So please, avoid #ifdef'ed code whenever possible. Switching the CAP on and
>>>> off is a much better debug approach in this case.
>>>>
>>>>>
>>>>>
>>>>>>> + */
>>>>>>> +#define kvmppc_find_tce_table(a, b) \
>>>>>>> + ((((b)&0xffff)>10000)?NULL:kvmppc_find_tce_table((a), (b)))
>>>>>>> +#endif
>>>>>>> +
>>>>>>> +/*
>>>>>>> + * Validates TCE address.
>>>>>>> + * At the moment only flags are validated as other checks will
>>>>>>> significantly slow
>>>>>>> + * down or can make it even impossible to handle TCE requests in real mode.
>>>>>>
>>>>>> What?
>>>>>
>>>>>
>>>>> What is missing here (besides good english)?
>>>>
>>>> What badness could slip through by not validating everything?
>>>
>>>
>>> I cannot think of any good check which could be done in real mode and not
>>> be "more than 2 calls deep" (c) Ben. Check that the page is allocated at
>>> all? How? Don't know.
>>
>
>> If you say that our validation doesn't validate everything, that makes
>> me really weary.
>
>
> It checks that TCE does not have any bit set in bits 2..12. If they are
> set, something went very wrong. Better than nothing.
>
>
>> Could the guest use it to maliciously inject anything?
>> Could a missing check make our code go berserk?
>
>
> No. KVM does not do anything with those addresses, just puts them to the
> table and lets QEMU or a guest deal with it.
>
>
>> What checks exactly would you do in addition when this was virtual mode?
>
>
> Check that TCE is within RAM boundaries. Or check that the page was
> allocated. find_linux_pte_or_hugepte? It can fail in real mode but in
> virtual mode I can call get_user_fast_page and confirm that the address is
> ok. Not sure, did not think much about it. Compare page flags with TCE
> flags if both or neither have "write" set, this kind of stuff.
>
> I am not really sure we need any of those checks for emulated TCE at all.
>
> Remove the comment then?

No, extend it. Explain what we could check and that we rely on surroundings to ensure everything's fine.

>
>
>>>>>>> + */
>>>>>>> +long kvmppc_emulated_validate_tce(unsigned long tce)
>>>>>>
>>>>>> I don't like the naming scheme. Please turn this around and make it
>>>>>> kvmppc_tce_validate().
>>>>>
>>>>>
>>>>> Oh. "Like"... Ok.
>>>>
>>>> Yes. Like.
>>>>
>>>>>
>>>>>
>>>>>>> +{
>>>>>>> + if (tce& ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
>>>>>>> + return H_PARAMETER;
>>>>>>> +
>>>>>>> + return H_SUCCESS;
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
>>>>>>> +
>>>>>>> +/*
>>>>>>> + * Handles TCE requests for QEMU emulated devices.
>>>>>>
>>>>>> We still don't mention QEMU in KVM code. And does it really matter whether
>>>>>> they're emulated by QEMU? Devices could also be emulated by KVM.
>>>>>>
>>>>>>> + * Puts guest TCE values to the table and expects QEMU to convert them
>>>>>>> + * later in a QEMU device implementation.
>>>>>>> + * Called in both real and virtual modes.
>>>>>>> + * Cannot fail so kvmppc_emulated_validate_tce must be called before it.
>>>>>>> + */
>>>>>>> +void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
>>>>>>
>>>>>> kvmppc_tce_put()
>>>>>>
>>>>>>> + unsigned long ioba, unsigned long tce)
>>>>>>> +{
>>>>>>> + unsigned long idx = ioba>> SPAPR_TCE_SHIFT;
>>>>>>> + struct page *page;
>>>>>>> + u64 *tbl;
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * Note on the use of page_address() in real mode,
>>>>>>> + *
>>>>>>> + * It is safe to use page_address() in real mode on ppc64 because
>>>>>>> + * page_address() is always defined as lowmem_page_address()
>>>>>>> + * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
>>>>>>> + * operation and does not access page struct.
>>>>>>> + *
>>>>>>> + * Theoretically page_address() could be defined different
>>>>>>> + * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
>>>>>>> + * should be enabled.
>>>>>>> + * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
>>>>>>> + * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
>>>>>>> + * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
>>>>>>> + * is not expected to be enabled on ppc32, page_address()
>>>>>>> + * is safe for ppc32 as well.
>>>>>>> + */
>>>>>>> +#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
>>>>>>> +#error TODO: fix to avoid page_address() here
>>>>>>> +#endif
>>>>>>
>>>>>> Can you extract the text above, the check and the page_address call into a
>>>>>> simple wrapper function?
>>>>>
>>>>>
>>>>> Is this function also too big? Sorry, I do not understand the comment.
>>>>
>>>> All of the comment and #if here only deal with the fact that you
>>>> have a real mode hack to call page_address() that happens
>>>> to work under specific circumstances.
>>>>
>>>> There's nothing kvmppc_tce_put() specific about this.
>>>> The page_address() code happens to get called here, sure.
>>>> But if I read the kvmppc_tce_put() function I don't care about
>>>> these details - I want to understand the code flow that ends
>>>> up writing the TCE.
>>>>
>>>>>>> + page = tt->pages[idx / TCES_PER_PAGE];
>>>>>>> + tbl = (u64 *)page_address(page);
>>>>>>> +
>>>>>>> + /* udbg_printf("tce @ %p\n",&tbl[idx % TCES_PER_PAGE]); */
>>>>>>
>>>>>> This is not an RFC, is it?
>>>>>
>>>>>
>>>>> Any debug code is prohibited? Ok, I'll remove.
>>>>
>>>> Debug code that requires code changes is prohibited, yes.
>>>> Debug code that is runtime switchable (pr_debug, trace points, etc)
>>>> are allowed.
>>>
>>>
>>> Is there any easy way to enable just this specific udbg_printf (not all of
>>> them at once)? Trace points do not work in real mode as we figured out.
>>
>> You can enable pr_debug by file IIRC.
>
>
> On already running kernel? :-/ Wow. How?

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/dynamic-debug-howto.txt?id=HEAD


Alex

2013-07-11 12:37:35

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

On Thu, 2013-07-11 at 11:52 +0200, Alexander Graf wrote:
> > Where exactly (it is rather SPAPR_TCE_IOMMU but does not really
> matter)?
> > Select it on KVM_BOOK3S_64? CONFIG_KVM_BOOK3S_64_HV?
> > CONFIG_KVM_BOOK3S_64_PR? PPC_BOOK3S_64?
>
> I'd say the most logical choice would be to check the Makefile and see
> when it gets compiled. For those cases we want it enabled.

What *what* gets compiled ? You know our Makefile, it's crap :-)

We enable built-in things when CONFIG_KVM=m (which means you cannot take
a kernel build with CONFIG_KVM not set, enable CONFIG_KVM=m, and just
build the module, it won't work).

We could use KVM_BOOK3S_64 maybe ?

> > I am trying to imagine a configuration where we really do not want
> > IOMMU_API. Ben mentioned PPC32 and embedded PPC64 and that's it so
> any of
> > BOOK3S (KVM_BOOK3S_64 is the best) should be fine, no?
>
> book3s_32 doesn't want this, but any book3s_64 implementation could
> potentially use it, yes. That's pretty much what the Makefile tells
> you too :).

Not really no. But that would do. You could have give a more useful
answer in the first place though rather than stringing him along.

Cheers,
Ben.

2013-07-11 12:39:56

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

On Thu, 2013-07-11 at 13:15 +0200, Alexander Graf wrote:
> And that's bad. Jeez, seriously. Don't argue this case. We enable new
> features individually unless we're 100% sure we can keep everything
> working. In this case an ENABLE_CAP doesn't hurt at all, because user
> space still needs to handle the hypercalls if it wants them anyways.
> But you get debugging for free for example.

An ENABLE_CAP is utterly pointless. More bloat. But you seem to like
it :-)

Ben.

2013-07-11 12:40:42

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

On Thu, 2013-07-11 at 13:15 +0200, Alexander Graf wrote:
> There are 2 ways of dealing with this:
>
> 1) Call the ENABLE_CAP on every vcpu. That way one CPU may handle
> this hypercall in the kernel while another one may not. The same as we
> handle PAPR today.
>
> 2) Create a new ENABLE_CAP for the vm.
>
> I think in this case option 1 is fine - it's how we handle everything
> else already.

So, you are now asking him to chose between a gross horror or adding a
new piece of infrastructure for something that is entirely pointless to
begin with ?

Come on, give him a break. That stuff is fine as it is.

Ben.

2013-07-11 12:50:12

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling


On 11.07.2013, at 14:37, Benjamin Herrenschmidt wrote:

> On Thu, 2013-07-11 at 11:52 +0200, Alexander Graf wrote:
>>> Where exactly (it is rather SPAPR_TCE_IOMMU but does not really
>> matter)?
>>> Select it on KVM_BOOK3S_64? CONFIG_KVM_BOOK3S_64_HV?
>>> CONFIG_KVM_BOOK3S_64_PR? PPC_BOOK3S_64?
>>
>> I'd say the most logical choice would be to check the Makefile and see
>> when it gets compiled. For those cases we want it enabled.
>
> What *what* gets compiled ? You know our Makefile, it's crap :-)
>
> We enable built-in things when CONFIG_KVM=m (which means you cannot take
> a kernel build with CONFIG_KVM not set, enable CONFIG_KVM=m, and just
> build the module, it won't work).
>
> We could use KVM_BOOK3S_64 maybe ?

If either a =m or a =y option selects a =y option, it gets selected regardless, no? So it shouldn't really matter where we attach it FWIW.

>
>>> I am trying to imagine a configuration where we really do not want
>>> IOMMU_API. Ben mentioned PPC32 and embedded PPC64 and that's it so
>> any of
>>> BOOK3S (KVM_BOOK3S_64 is the best) should be fine, no?
>>
>> book3s_32 doesn't want this, but any book3s_64 implementation could
>> potentially use it, yes. That's pretty much what the Makefile tells
>> you too :).
>
> Not really no. But that would do. You could have give a more useful
> answer in the first place though rather than stringing him along.

Sorry, I figured it was obvious.


Alex

2013-07-11 12:51:26

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls


On 11.07.2013, at 14:39, Benjamin Herrenschmidt wrote:

> On Thu, 2013-07-11 at 13:15 +0200, Alexander Graf wrote:
>> And that's bad. Jeez, seriously. Don't argue this case. We enable new
>> features individually unless we're 100% sure we can keep everything
>> working. In this case an ENABLE_CAP doesn't hurt at all, because user
>> space still needs to handle the hypercalls if it wants them anyways.
>> But you get debugging for free for example.
>
> An ENABLE_CAP is utterly pointless. More bloat. But you seem to like
> it :-)

I don't like bloat usually. But Alexey even had an #ifdef DEBUG in there to selectively disable in-kernel handling of multi-TCE. Not calling ENABLE_CAP would give him exactly that without ugly #ifdefs in the kernel.


Alex

>
> Ben.
>
>

2013-07-11 12:56:27

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

On 07/11/2013 10:51 PM, Alexander Graf wrote:
>
> On 11.07.2013, at 14:39, Benjamin Herrenschmidt wrote:
>
>> On Thu, 2013-07-11 at 13:15 +0200, Alexander Graf wrote:
>>> And that's bad. Jeez, seriously. Don't argue this case. We enable new
>>> features individually unless we're 100% sure we can keep everything
>>> working. In this case an ENABLE_CAP doesn't hurt at all, because user
>>> space still needs to handle the hypercalls if it wants them anyways.
>>> But you get debugging for free for example.
>>
>> An ENABLE_CAP is utterly pointless. More bloat. But you seem to like
>> it :-)
>

> I don't like bloat usually. But Alexey even had an #ifdef DEBUG in there
> to selectively disable in-kernel handling of multi-TCE. Not calling
> ENABLE_CAP would give him exactly that without ugly #ifdefs in the
> kernel.


No, it would not give m anithing. My ugly debug was to disable realmode
only and still leave virtual mode on, not to disable both real and virtual
modes. It is a lot easier to disable in kernel handling in QEMU.



--
Alexey

2013-07-11 12:56:58

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

On Thu, 2013-07-11 at 14:50 +0200, Alexander Graf wrote:
> > Not really no. But that would do. You could have give a more useful
> > answer in the first place though rather than stringing him along.
>
> Sorry, I figured it was obvious.

It wasn't no, because of the mess with modules and the nasty Makefile we
have in there. Even I had to scratch my head for a bit :-)

Ben.

2013-07-11 12:58:23

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

On Thu, 2013-07-11 at 14:51 +0200, Alexander Graf wrote:
> I don't like bloat usually. But Alexey even had an #ifdef DEBUG in
> there to selectively disable in-kernel handling of multi-TCE. Not
> calling ENABLE_CAP would give him exactly that without ugly #ifdefs in
> the kernel.

I don't see much point in disabling it... but ok, if that's a valuable
feature, then shoot some VM level ENABLE_CAP (please don't iterate all
VCPUs, that's gross).

Ben.

2013-07-11 13:01:15

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

On Thu, 2013-07-11 at 15:12 +1000, Alexey Kardashevskiy wrote:
> >> Any debug code is prohibited? Ok, I'll remove.
> >
> > Debug code that requires code changes is prohibited, yes.
> > Debug code that is runtime switchable (pr_debug, trace points, etc)
> > are allowed.

Bollox.

$ grep DBG\( arch/powerpc/ -r | wc -l
418

Also pr_devel is not runtime switchable in normal kernels either and
still an "official" kernel interface.

> Is there any easy way to enable just this specific udbg_printf (not all of
> them at once)? Trace points do not work in real mode as we figured out.

The cleaner way to do it is to use some kind of local macro that you
enable/disable by changing a #define at the top of the function, possibly
several.

Cheers,
Ben.

2013-07-11 13:11:26

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls


On 11.07.2013, at 14:33, Benjamin Herrenschmidt wrote:

> On Thu, 2013-07-11 at 15:12 +1000, Alexey Kardashevskiy wrote:
>>>> Any debug code is prohibited? Ok, I'll remove.
>>>
>>> Debug code that requires code changes is prohibited, yes.
>>> Debug code that is runtime switchable (pr_debug, trace points, etc)
>>> are allowed.
>
> Bollox.
>
> $ grep DBG\( arch/powerpc/ -r | wc -l
> 418
>
> Also pr_devel is not runtime switchable in normal kernels either and
> still an "official" kernel interface.
>
>> Is there any easy way to enable just this specific udbg_printf (not all of
>> them at once)? Trace points do not work in real mode as we figured out.
>
> The cleaner way to do it is to use some kind of local macro that you
> enable/disable by changing a #define at the top of the function, possibly
> several.

If you do that in a way that doesn't bitrot, that's acceptable, yes. Something that leaves cpp compile checks working while optimizing out the debug code.


Alex

2013-07-11 13:13:14

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

On 07/11/2013 10:58 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2013-07-11 at 14:51 +0200, Alexander Graf wrote:
>> I don't like bloat usually. But Alexey even had an #ifdef DEBUG in
>> there to selectively disable in-kernel handling of multi-TCE. Not
>> calling ENABLE_CAP would give him exactly that without ugly #ifdefs in
>> the kernel.
>
> I don't see much point in disabling it... but ok, if that's a valuable
> feature, then shoot some VM level ENABLE_CAP (please don't iterate all
> VCPUs, that's gross).

No use for me whatsoever as I only want to disable real more handlers and
keep virtual mode handlers enabled (sometime, for debug only) and this
capability is not about that - I can easily just not enable it in QEMU with
the exactly the same effect.

So please, fellas, decide whether I should iterate vcpu's or add ENABLE_CAP
per KVM. Thanks.


--
Alexey

2013-07-11 13:21:53

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls


On 11.07.2013, at 15:13, Alexey Kardashevskiy wrote:

> On 07/11/2013 10:58 PM, Benjamin Herrenschmidt wrote:
>> On Thu, 2013-07-11 at 14:51 +0200, Alexander Graf wrote:
>>> I don't like bloat usually. But Alexey even had an #ifdef DEBUG in
>>> there to selectively disable in-kernel handling of multi-TCE. Not
>>> calling ENABLE_CAP would give him exactly that without ugly #ifdefs in
>>> the kernel.
>>
>> I don't see much point in disabling it... but ok, if that's a valuable
>> feature, then shoot some VM level ENABLE_CAP (please don't iterate all
>> VCPUs, that's gross).
>
> No use for me whatsoever as I only want to disable real more handlers and
> keep virtual mode handlers enabled (sometime, for debug only) and this
> capability is not about that - I can easily just not enable it in QEMU with
> the exactly the same effect.
>
> So please, fellas, decide whether I should iterate vcpu's or add ENABLE_CAP
> per KVM. Thanks.

Thinking hard about this it might actually be ok to not have an ENABLE_CAP for this, if kernel code always works properly, because from the guest's point of view nothing changes - it either gets handled by kernel or by user space. And user space either handles it or doesn't, so it's ok.

Just leave it out this time. But be very weary of adding new features without an ENABLE_CAP switch. They might be guest visible changes.


Alex

2013-07-11 13:41:18

by chandrashekar shastri

[permalink] [raw]
Subject: Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

Hi All,

I complied the latest kernel 3.10.0+ pulled from the git on top of
3.10.0-rc5+ by enabling the new Virtualiztaion features. The compliation
was sucessfull, when I rebooted the machine it fails to boot with error
as " systemd [1] : Failed to mount /dev : no such device.

Is it problem with the KVM module?

Thanks,
Shastri

On 07/11/2013 06:26 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2013-07-11 at 14:50 +0200, Alexander Graf wrote:
>>> Not really no. But that would do. You could have give a more useful
>>> answer in the first place though rather than stringing him along.
>> Sorry, I figured it was obvious.
> It wasn't no, because of the mess with modules and the nasty Makefile we
> have in there. Even I had to scratch my head for a bit :-)
>
> Ben.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-11 13:44:11

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling


On 11.07.2013, at 15:41, chandrashekar shastri wrote:

> Hi All,
>
> I complied the latest kernel 3.10.0+ pulled from the git on top of 3.10.0-rc5+ by enabling the new Virtualiztaion features. The compliation was sucessfull, when I rebooted the machine it fails to boot with error as " systemd [1] : Failed to mount /dev : no such device.
>
> Is it problem with the KVM module?

Very unlikely. You're probably missing generic config options in your .config file. But this is very off topic for a) this thread and b) these mailing lists.


Alex

2013-07-11 13:45:19

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls

On Thu, 2013-07-11 at 12:11 +0200, Alexander Graf wrote:
> > So I must add one more ioctl to enable MULTITCE in kernel handling. Is it
> > what you are saying?
> >
> > I can see KVM_CHECK_EXTENSION but I do not see KVM_ENABLE_EXTENSION or
> > anything like that.
>
> KVM_ENABLE_CAP. It's how we enable sPAPR capabilities too.

But in that case I don't see the point.

Cheers,
Ben.

2013-07-11 13:46:44

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

On 07/11/2013 11:41 PM, chandrashekar shastri wrote:
> Hi All,
>
> I complied the latest kernel 3.10.0+ pulled from the git on top of
> 3.10.0-rc5+ by enabling the new Virtualiztaion features. The compliation
> was sucessfull, when I rebooted the machine it fails to boot with error as
> " systemd [1] : Failed to mount /dev : no such device.
>
> Is it problem with the KVM module?


Wrong thread actually, would be better if you started the new one.

And you may want to try this - http://patchwork.ozlabs.org/patch/256027/


--
Alexey