2018-06-29 11:36:25

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 00/20] arm64: Dynamic & 52bit IPA support


The physical address space size for a VM (IPA size) on arm/arm64 is
limited to a static limit of 40bits. This series adds support for
using an IPA size specific to a VM, allowing to use a limit supported
by the host (based on the host kernel configuration and CPU support).
The default and the minimum size is fixed to 40bits. We also add
support for handling 52bit IPA addresses added by Arm v8.2 extensions.

As mentioned above, the supported IPA size on a host could be different
from the system's PARange indicated by the CPUs (e.g, kernel limit
on the PA size). So we expose the limit via a new system ioctl request
- KVM_ARM_GET_MAX_VM_PHYS_SHIFT - on arm/arm64. This can then be
passed on to the KVM_CREATE_VM ioctl, encoded in the "type" field.
Bits [7-0] of the type are reserved for the IPA size. This approach
allows simpler management of the stage2 page table and guest memory
slots.

The arm64 page table level helpers are defined based on the page
table levels used by the host VA. So, the accessors may not work
if the guest uses more number of levels in stage2 than the stage1
of the host. The previous versions (v1 & v2) of this series refactored
the stage1 page table accessors to reuse the low-level accessors for an
independent stage2 table. However, due to the level folding in the
generic code, the types are redefined as well. i.e, if the PUD is
folded, the pud_t could be defined as :

typedef struct { pgd_t pgd; } pud_t;

similarly for pmd_t. So, without stage1 independent page table entry
types for stage2, we could be dealing with a different type for level
0-2 entries. This is practically fine on arm/arm64 as the entries
have similar format and size and we always use the appropriate
accessors to get the raw value (i.e, pud_val/pmd_val etc). But not
ideal for a solution upstream. So, this version caps the stage2 page
table levels to that of the stage1. This has the following impact on
the IPA support for various pagesize/host-va combinations :


x-----------------------------------------------------x
| host\ipa | 40bit | 42bit | 44bit | 48bit | 52bit |
-------------------------------------------------------
| 39bit-4K | y | y | n | n | n/a |
-------------------------------------------------------
| 48bit-4K | y | y | y | y | n/a |
-------------------------------------------------------
| 36bit-16K | y | n | n | n | n/a |
-------------------------------------------------------
| 47bit-16K | y | y | y | y | n/a |
-------------------------------------------------------
| 48bit-4K | y | y | y | y | n/a |
-------------------------------------------------------
| 42bit-64K | y | y | y | n | n |
-------------------------------------------------------
| 48bit-64K | y | y | y | y | y |
x-----------------------------------------------------x

Or the following list shows what cannot be supported :

39bit-4K host supporting IPA > 43bit (upto 48bit)
36bit-16K host for IPA > 40bit (upto 48bit)
42bit-64K host for IPA > 46bit (upto 52bit)

which is not really bad. We can pursue the independent stage2
page table support and lift the restriction once we get there.
Given there is a proposal for new generic page table walker [0],
it would make sense to make our efforts in sync with it to avoid
diverting from a common API.

52bit support is added for VGIC (including ITS emulation) and handling
of PAR, HPFAR registers.

The series applies on 4.18-rc2. A tree is available here:

git://linux-arm.org/linux-skp.git ipa52/v3

Tested with
- Modified kvmtool, which can only be used for (patches included in
the series for reference / testing):
* with virtio-pci upto 44bit PA (Due to 4K page size for virtio-pci
legacy implemented by kvmtool)
* Upto 48bit PA with virtio-mmio, due to 32bit PFN limitation.
- Hacked Qemu (boot loader support for highmem, phys-shift support)
* with virtio-pci GIC-v3 ITS & MSI upto 52bit on Foundation model.
Also see [1] for Qemu support.

[0] https://lkml.org/lkml/2018/4/24/777
[1] https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg05759.html

Changes since V2:
- Drop "refactoring of host page table helpers" and restrict the IPA size
to make sure stage2 doesn't use more page table levels than that of the host.
- Load VTCR for TLB operations on behalf of the VM (Pointed-by: James Morse)
- Split a couple of patches to make them easier to review.
- Fall back to normal (non-concatenated) entry level page table support if
possible.
- Bump the IOCTL number

Changes since V1:
- Change the userspace API for configuring VM to encode the IPA
size in the VM type. (suggested by Christoffer)
- Expose the IPA limit on the host via ioctl on /dev/kvm
- Handle 52bit addresses in PAR & HPFAR
- Drop patch changing the life time of stage2 PGD
- Rename macros for 48-to-52 bit conversion for GIC ITS BASER.
(suggested by Christoffer)
- Split virtio PFN check patches and address comments.

Kristina Martsenko (1):
vgic: Add support for 52bit guest physical address

Suzuki K Poulose (19):
virtio: mmio-v1: Validate queue PFN
virtio: pci-legacy: Validate queue pfn
arm64: Add a helper for PARange to physical shift conversion
kvm: arm64: Clean up VTCR_EL2 initialisation
kvm: arm/arm64: Fix stage2_flush_memslot for 4 level page table
kvm: arm/arm64: Remove spurious WARN_ON
kvm: arm/arm64: Prepare for VM specific stage2 translations
kvm: arm/arm64: Abstract stage2 pgd table allocation
kvm: arm64: Make stage2 page table layout dynamic
kvm: arm64: Dynamic configuration of VTTBR mask
kvm: arm64: Helper for computing VTCR_EL2.SL0
kvm: arm64: Add helper for loading the stage2 setting for a VM
kvm: arm64: Configure VTCR per VM
kvm: arm/arm64: Expose supported physical address limit for VM
kvm: arm/arm64: Allow tuning the physical address size for VM
kvm: arm64: Switch to per VM IPA limit
kvm: arm64: Add support for handling 52bit IPA
kvm: arm64: Allow IPA size supported by the system
kvm: arm64: Fall back to normal stage2 entry level

Documentation/virtual/kvm/api.txt | 15 ++
arch/arm/include/asm/kvm_arm.h | 3 +-
arch/arm/include/asm/kvm_mmu.h | 28 +++-
arch/arm/include/asm/stage2_pgtable.h | 42 ++---
arch/arm64/include/asm/cpufeature.h | 13 ++
arch/arm64/include/asm/kvm_arm.h | 137 ++++++++++++++---
arch/arm64/include/asm/kvm_asm.h | 2 +-
arch/arm64/include/asm/kvm_host.h | 19 ++-
arch/arm64/include/asm/kvm_hyp.h | 16 ++
arch/arm64/include/asm/kvm_mmu.h | 92 ++++++++++-
arch/arm64/include/asm/stage2_pgtable-nopmd.h | 42 -----
arch/arm64/include/asm/stage2_pgtable-nopud.h | 39 -----
arch/arm64/include/asm/stage2_pgtable.h | 213 +++++++++++++++++++-------
arch/arm64/kvm/guest.c | 42 +++++
arch/arm64/kvm/hyp/s2-setup.c | 37 +----
arch/arm64/kvm/hyp/switch.c | 4 +-
arch/arm64/kvm/hyp/tlb.c | 4 +-
drivers/virtio/virtio_mmio.c | 18 ++-
drivers/virtio/virtio_pci_legacy.c | 12 +-
include/linux/irqchip/arm-gic-v3.h | 5 +
include/uapi/linux/kvm.h | 16 ++
virt/kvm/arm/arm.c | 32 +++-
virt/kvm/arm/mmu.c | 124 ++++++++-------
virt/kvm/arm/vgic/vgic-its.c | 36 ++---
virt/kvm/arm/vgic/vgic-kvm-device.c | 2 +-
virt/kvm/arm/vgic/vgic-mmio-v3.c | 2 -
26 files changed, 663 insertions(+), 332 deletions(-)
delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h


kvmtool patches :

Suzuki K Poulose (4):
kvmtool: Allow backends to run checks on the KVM device fd
kvmtool: arm64: Add support for guest physical address size
kvmtool: arm64: Switch memory layout
kvmtool: arm: Add support for creating VM with PA size

arm/aarch32/include/kvm/kvm-arch.h | 6 ++++--
arm/aarch64/include/kvm/kvm-arch.h | 15 ++++++++++++---
arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
arm/include/arm-common/kvm-arch.h | 17 +++++++++++------
arm/include/arm-common/kvm-config-arch.h | 1 +
arm/kvm.c | 24 +++++++++++++++++++++++-
include/kvm/kvm.h | 4 ++++
kvm.c | 2 ++
8 files changed, 61 insertions(+), 13 deletions(-)

--
2.7.4



2018-06-29 11:25:20

by Suzuki K Poulose

[permalink] [raw]
Subject: [kvmtool test PATCH 22/24] kvmtool: arm64: Add support for guest physical address size

Add an option to specify the physical address size used by this
VM.

Signed-off-by: Suzuki K Poulose <[email protected]>
---
arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
arm/include/arm-common/kvm-config-arch.h | 1 +
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
index 04be43d..dabd22c 100644
--- a/arm/aarch64/include/kvm/kvm-config-arch.h
+++ b/arm/aarch64/include/kvm/kvm-config-arch.h
@@ -8,7 +8,10 @@
"Create PMUv3 device"), \
OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed, \
"Specify random seed for Kernel Address Space " \
- "Layout Randomization (KASLR)"),
+ "Layout Randomization (KASLR)"), \
+ OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift, \
+ "Specify maximum physical address size (not " \
+ "the amount of memory)"),

#include "arm-common/kvm-config-arch.h"

diff --git a/arm/include/arm-common/kvm-config-arch.h b/arm/include/arm-common/kvm-config-arch.h
index 6a196f1..e0b531e 100644
--- a/arm/include/arm-common/kvm-config-arch.h
+++ b/arm/include/arm-common/kvm-config-arch.h
@@ -11,6 +11,7 @@ struct kvm_config_arch {
bool has_pmuv3;
u64 kaslr_seed;
enum irqchip_type irqchip;
+ int phys_shift;
};

int irqchip_parser(const struct option *opt, const char *arg, int unset);
--
2.7.4


2018-06-29 11:25:40

by Suzuki K Poulose

[permalink] [raw]
Subject: [kvmtool test PATCH 21/24] kvmtool: Allow backends to run checks on the KVM device fd

Allow architectures to perform initialisation based on the
KVM device fd ioctls, even before the VM is created.

Signed-off-by: Suzuki K Poulose <[email protected]>
---
include/kvm/kvm.h | 4 ++++
kvm.c | 2 ++
2 files changed, 6 insertions(+)

diff --git a/include/kvm/kvm.h b/include/kvm/kvm.h
index 90463b8..a036dd2 100644
--- a/include/kvm/kvm.h
+++ b/include/kvm/kvm.h
@@ -103,6 +103,10 @@ int kvm__get_sock_by_instance(const char *name);
int kvm__enumerate_instances(int (*callback)(const char *name, int pid));
void kvm__remove_socket(const char *name);

+#ifndef kvm__arch_init_hyp
+static inline void kvm__arch_init_hyp(struct kvm *kvm) {}
+#endif
+
void kvm__arch_set_cmdline(char *cmdline, bool video);
void kvm__arch_init(struct kvm *kvm, const char *hugetlbfs_path, u64 ram_size);
void kvm__arch_delete_ram(struct kvm *kvm);
diff --git a/kvm.c b/kvm.c
index f8f2fdc..b992e74 100644
--- a/kvm.c
+++ b/kvm.c
@@ -304,6 +304,8 @@ int kvm__init(struct kvm *kvm)
goto err_sys_fd;
}

+ kvm__arch_init_hyp(kvm);
+
kvm->vm_fd = ioctl(kvm->sys_fd, KVM_CREATE_VM, KVM_VM_TYPE);
if (kvm->vm_fd < 0) {
pr_err("KVM_CREATE_VM ioctl");
--
2.7.4


2018-06-29 11:25:47

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 02/20] virtio: pci-legacy: Validate queue pfn

Legacy PCI over virtio uses a 32bit PFN for the queue. If the
queue pfn is too large to fit in 32bits, which we could hit on
arm64 systems with 52bit physical addresses (even with 64K page
size), we simply miss out a proper link to the other side of
the queue.

Add a check to validate the PFN, rather than silently breaking
the devices.

Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Cc: Peter Maydel <[email protected]>
Cc: Jean-Philippe Brucker <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since v2:
- Change errno to -E2BIG
---
drivers/virtio/virtio_pci_legacy.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c
index 2780886..c0d6987a 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -122,6 +122,7 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
struct virtqueue *vq;
u16 num;
int err;
+ u64 q_pfn;

/* Select the queue we're interested in */
iowrite16(index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
@@ -141,9 +142,15 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
if (!vq)
return ERR_PTR(-ENOMEM);

+ q_pfn = virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT;
+ if (q_pfn >> 32) {
+ dev_err(&vp_dev->pci_dev->dev, "virtio-pci queue PFN too large\n");
+ err = -E2BIG;
+ goto out_del_vq;
+ }
+
/* activate the queue */
- iowrite32(virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
- vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
+ iowrite32(q_pfn, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);

vq->priv = (void __force *)vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY;

@@ -160,6 +167,7 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,

out_deactivate:
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
+out_del_vq:
vring_del_virtqueue(vq);
return ERR_PTR(err);
}
--
2.7.4


2018-06-29 11:27:23

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 12/20] kvm: arm64: Add helper for loading the stage2 setting for a VM

We load the stage2 context of a guest for different operations,
including running the guest and tlb maintenance on behalf of the
guest. As of now only the vttbr is private to the guest, but this
is about to change with IPA per VM. Add a helper to load the stage2
configuration for a VM, which could do the right thing with the
future changes.

Cc: Christoffer Dall <[email protected]>
Cc: Marc Zyngier <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since v2:
- New patch
---
arch/arm64/include/asm/kvm_hyp.h | 6 ++++++
arch/arm64/kvm/hyp/switch.c | 2 +-
arch/arm64/kvm/hyp/tlb.c | 4 ++--
3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index 384c343..82f9994 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -155,5 +155,11 @@ void deactivate_traps_vhe_put(void);
u64 __guest_enter(struct kvm_vcpu *vcpu, struct kvm_cpu_context *host_ctxt);
void __noreturn __hyp_do_panic(unsigned long, ...);

+/* Must be called from hyp code running at EL2 */
+static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm)
+{
+ write_sysreg(kvm->arch.vttbr, vttbr_el2);
+}
+
#endif /* __ARM64_KVM_HYP_H__ */

diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
index d496ef5..355fb25 100644
--- a/arch/arm64/kvm/hyp/switch.c
+++ b/arch/arm64/kvm/hyp/switch.c
@@ -195,7 +195,7 @@ void deactivate_traps_vhe_put(void)

static void __hyp_text __activate_vm(struct kvm *kvm)
{
- write_sysreg(kvm->arch.vttbr, vttbr_el2);
+ __load_guest_stage2(kvm);
}

static void __hyp_text __deactivate_vm(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/kvm/hyp/tlb.c b/arch/arm64/kvm/hyp/tlb.c
index 131c777..4dbd9c6 100644
--- a/arch/arm64/kvm/hyp/tlb.c
+++ b/arch/arm64/kvm/hyp/tlb.c
@@ -30,7 +30,7 @@ static void __hyp_text __tlb_switch_to_guest_vhe(struct kvm *kvm)
* bits. Changing E2H is impossible (goodbye TTBR1_EL2), so
* let's flip TGE before executing the TLB operation.
*/
- write_sysreg(kvm->arch.vttbr, vttbr_el2);
+ __load_guest_stage2(kvm);
val = read_sysreg(hcr_el2);
val &= ~HCR_TGE;
write_sysreg(val, hcr_el2);
@@ -39,7 +39,7 @@ static void __hyp_text __tlb_switch_to_guest_vhe(struct kvm *kvm)

static void __hyp_text __tlb_switch_to_guest_nvhe(struct kvm *kvm)
{
- write_sysreg(kvm->arch.vttbr, vttbr_el2);
+ __load_guest_stage2(kvm);
isb();
}

--
2.7.4


2018-06-29 11:27:27

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 17/20] vgic: Add support for 52bit guest physical address

From: Kristina Martsenko <[email protected]>

Add support for handling 52bit guest physical address to the
VGIC layer. So far we have limited the guest physical address
to 48bits, by explicitly masking the upper bits. This patch
removes the restriction. We do not have to check if the host
supports 52bit as the gpa is always validated during an access.
(e.g, kvm_{read/write}_guest, kvm_is_visible_gfn()).
Also, the ITS table save-restore is also not affected with
the enhancement. The DTE entries already store the bits[51:8]
of the ITT_addr (with a 256byte alignment).

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Kristina Martsenko <[email protected]>
[ Macro clean ups, fix PROPBASER and PENDBASER accesses ]
Signed-off-by: Suzuki K Poulose <[email protected]>
---
include/linux/irqchip/arm-gic-v3.h | 5 +++++
virt/kvm/arm/vgic/vgic-its.c | 36 ++++++++++--------------------------
virt/kvm/arm/vgic/vgic-mmio-v3.c | 2 --
3 files changed, 15 insertions(+), 28 deletions(-)

diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h
index cbb872c..bc4b95b 100644
--- a/include/linux/irqchip/arm-gic-v3.h
+++ b/include/linux/irqchip/arm-gic-v3.h
@@ -346,6 +346,8 @@
#define GITS_CBASER_RaWaWt GIC_BASER_CACHEABILITY(GITS_CBASER, INNER, RaWaWt)
#define GITS_CBASER_RaWaWb GIC_BASER_CACHEABILITY(GITS_CBASER, INNER, RaWaWb)

+#define GITS_CBASER_ADDRESS(cbaser) ((cbaser) & GENMASK_ULL(52, 12))
+
#define GITS_BASER_NR_REGS 8

#define GITS_BASER_VALID (1ULL << 63)
@@ -377,6 +379,9 @@
#define GITS_BASER_ENTRY_SIZE_MASK GENMASK_ULL(52, 48)
#define GITS_BASER_PHYS_52_to_48(phys) \
(((phys) & GENMASK_ULL(47, 16)) | (((phys) >> 48) & 0xf) << 12)
+#define GITS_BASER_ADDR_48_to_52(baser) \
+ (((baser) & GENMASK_ULL(47, 16)) | (((baser) >> 12) & 0xf) << 48)
+
#define GITS_BASER_SHAREABILITY_SHIFT (10)
#define GITS_BASER_InnerShareable \
GIC_BASER_SHAREABILITY(GITS_BASER, InnerShareable)
diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c
index 4ed79c9..c6eb390 100644
--- a/virt/kvm/arm/vgic/vgic-its.c
+++ b/virt/kvm/arm/vgic/vgic-its.c
@@ -234,13 +234,6 @@ static struct its_ite *find_ite(struct vgic_its *its, u32 device_id,
list_for_each_entry(dev, &(its)->device_list, dev_list) \
list_for_each_entry(ite, &(dev)->itt_head, ite_list)

-/*
- * We only implement 48 bits of PA at the moment, although the ITS
- * supports more. Let's be restrictive here.
- */
-#define BASER_ADDRESS(x) ((x) & GENMASK_ULL(47, 16))
-#define CBASER_ADDRESS(x) ((x) & GENMASK_ULL(47, 12))
-
#define GIC_LPI_OFFSET 8192

#define VITS_TYPER_IDBITS 16
@@ -752,6 +745,7 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
{
int l1_tbl_size = GITS_BASER_NR_PAGES(baser) * SZ_64K;
u64 indirect_ptr, type = GITS_BASER_TYPE(baser);
+ phys_addr_t base = GITS_BASER_ADDR_48_to_52(baser);
int esz = GITS_BASER_ENTRY_SIZE(baser);
int index;
gfn_t gfn;
@@ -776,7 +770,7 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
if (id >= (l1_tbl_size / esz))
return false;

- addr = BASER_ADDRESS(baser) + id * esz;
+ addr = base + id * esz;
gfn = addr >> PAGE_SHIFT;

if (eaddr)
@@ -791,7 +785,7 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,

/* Each 1st level entry is represented by a 64-bit value. */
if (kvm_read_guest_lock(its->dev->kvm,
- BASER_ADDRESS(baser) + index * sizeof(indirect_ptr),
+ base + index * sizeof(indirect_ptr),
&indirect_ptr, sizeof(indirect_ptr)))
return false;

@@ -801,11 +795,7 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
if (!(indirect_ptr & BIT_ULL(63)))
return false;

- /*
- * Mask the guest physical address and calculate the frame number.
- * Any address beyond our supported 48 bits of PA will be caught
- * by the actual check in the final step.
- */
+ /* Mask the guest physical address and calculate the frame number. */
indirect_ptr &= GENMASK_ULL(51, 16);

/* Find the address of the actual entry */
@@ -1297,9 +1287,6 @@ static u64 vgic_sanitise_its_baser(u64 reg)
GITS_BASER_OUTER_CACHEABILITY_SHIFT,
vgic_sanitise_outer_cacheability);

- /* Bits 15:12 contain bits 51:48 of the PA, which we don't support. */
- reg &= ~GENMASK_ULL(15, 12);
-
/* We support only one (ITS) page size: 64K */
reg = (reg & ~GITS_BASER_PAGE_SIZE_MASK) | GITS_BASER_PAGE_SIZE_64K;

@@ -1318,11 +1305,8 @@ static u64 vgic_sanitise_its_cbaser(u64 reg)
GITS_CBASER_OUTER_CACHEABILITY_SHIFT,
vgic_sanitise_outer_cacheability);

- /*
- * Sanitise the physical address to be 64k aligned.
- * Also limit the physical addresses to 48 bits.
- */
- reg &= ~(GENMASK_ULL(51, 48) | GENMASK_ULL(15, 12));
+ /* Sanitise the physical address to be 64k aligned. */
+ reg &= ~GENMASK_ULL(15, 12);

return reg;
}
@@ -1368,7 +1352,7 @@ static void vgic_its_process_commands(struct kvm *kvm, struct vgic_its *its)
if (!its->enabled)
return;

- cbaser = CBASER_ADDRESS(its->cbaser);
+ cbaser = GITS_CBASER_ADDRESS(its->cbaser);

while (its->cwriter != its->creadr) {
int ret = kvm_read_guest_lock(kvm, cbaser + its->creadr,
@@ -2226,7 +2210,7 @@ static int vgic_its_restore_device_tables(struct vgic_its *its)
if (!(baser & GITS_BASER_VALID))
return 0;

- l1_gpa = BASER_ADDRESS(baser);
+ l1_gpa = GITS_BASER_ADDR_48_to_52(baser);

if (baser & GITS_BASER_INDIRECT) {
l1_esz = GITS_LVL1_ENTRY_SIZE;
@@ -2298,7 +2282,7 @@ static int vgic_its_save_collection_table(struct vgic_its *its)
{
const struct vgic_its_abi *abi = vgic_its_get_abi(its);
u64 baser = its->baser_coll_table;
- gpa_t gpa = BASER_ADDRESS(baser);
+ gpa_t gpa = GITS_BASER_ADDR_48_to_52(baser);
struct its_collection *collection;
u64 val;
size_t max_size, filled = 0;
@@ -2347,7 +2331,7 @@ static int vgic_its_restore_collection_table(struct vgic_its *its)
if (!(baser & GITS_BASER_VALID))
return 0;

- gpa = BASER_ADDRESS(baser);
+ gpa = GITS_BASER_ADDR_48_to_52(baser);

max_size = GITS_BASER_NR_PAGES(baser) * SZ_64K;

diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c b/virt/kvm/arm/vgic/vgic-mmio-v3.c
index 2877840..64647be 100644
--- a/virt/kvm/arm/vgic/vgic-mmio-v3.c
+++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c
@@ -338,7 +338,6 @@ static u64 vgic_sanitise_pendbaser(u64 reg)
vgic_sanitise_outer_cacheability);

reg &= ~PENDBASER_RES0_MASK;
- reg &= ~GENMASK_ULL(51, 48);

return reg;
}
@@ -356,7 +355,6 @@ static u64 vgic_sanitise_propbaser(u64 reg)
vgic_sanitise_outer_cacheability);

reg &= ~PROPBASER_RES0_MASK;
- reg &= ~GENMASK_ULL(51, 48);
return reg;
}

--
2.7.4


2018-06-29 11:27:52

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 13/20] kvm: arm64: Configure VTCR per VM

We set VTCR_EL2 very early during the stage2 init and don't
touch it ever. This is fine as we had a fixed IPA size. This
patch changes the behavior to set the VTCR for a given VM,
depending on its stage2 table. The common configuration for
VTCR is still performed during the early init as we have to
retain the hardware access flag update bits (VTCR_EL2_HA)
per CPU (as they are only set for the CPUs which are capabile).
The bits defining the number of levels in the page table (SL0)
and and the size of the Input address to the translation (T0SZ)
are programmed for each VM upon entry to the guest.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Change since V2:
- Load VTCR for TLB operations
---
arch/arm64/include/asm/kvm_arm.h | 19 +++++++++----------
arch/arm64/include/asm/kvm_asm.h | 2 +-
arch/arm64/include/asm/kvm_host.h | 9 ++++++---
arch/arm64/include/asm/kvm_hyp.h | 11 +++++++++++
arch/arm64/kvm/hyp/s2-setup.c | 17 +----------------
5 files changed, 28 insertions(+), 30 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 11a7db0..b02c316 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -120,9 +120,7 @@
#define VTCR_EL2_IRGN0_WBWA TCR_IRGN0_WBWA
#define VTCR_EL2_SL0_SHIFT 6
#define VTCR_EL2_SL0_MASK (3 << VTCR_EL2_SL0_SHIFT)
-#define VTCR_EL2_SL0_LVL1 (1 << VTCR_EL2_SL0_SHIFT)
#define VTCR_EL2_T0SZ_MASK 0x3f
-#define VTCR_EL2_T0SZ_40B 24
#define VTCR_EL2_VS_SHIFT 19
#define VTCR_EL2_VS_8BIT (0 << VTCR_EL2_VS_SHIFT)
#define VTCR_EL2_VS_16BIT (1 << VTCR_EL2_VS_SHIFT)
@@ -137,43 +135,44 @@
* VTCR_EL2.PS is extracted from ID_AA64MMFR0_EL1.PARange at boot time
* (see hyp-init.S).
*
+ * VTCR_EL2.SL0 and T0SZ are configured per VM at runtime before switching to
+ * the VM.
+ *
* Note that when using 4K pages, we concatenate two first level page tables
* together. With 16K pages, we concatenate 16 first level page tables.
*
*/

-#define VTCR_EL2_T0SZ_IPA VTCR_EL2_T0SZ_40B
#define VTCR_EL2_COMMON_BITS (VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
VTCR_EL2_IRGN0_WBWA | VTCR_EL2_RES1)
+#define VTCR_EL2_PRIVATE_MASK (VTCR_EL2_SL0_MASK | VTCR_EL2_T0SZ_MASK)

#ifdef CONFIG_ARM64_64K_PAGES
/*
* Stage2 translation configuration:
* 64kB pages (TG0 = 1)
- * 2 level page tables (SL = 1)
*/
-#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
+#define VTCR_EL2_TGRAN VTCR_EL2_TG0_64K
#define VTCR_EL2_TGRAN_SL0_BASE 3UL

#elif defined(CONFIG_ARM64_16K_PAGES)
/*
* Stage2 translation configuration:
* 16kB pages (TG0 = 2)
- * 2 level page tables (SL = 1)
*/
-#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
+#define VTCR_EL2_TGRAN VTCR_EL2_TG0_16K
#define VTCR_EL2_TGRAN_SL0_BASE 3UL
#else /* 4K */
/*
* Stage2 translation configuration:
* 4kB pages (TG0 = 0)
- * 3 level page tables (SL = 1)
*/
-#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
+#define VTCR_EL2_TGRAN VTCR_EL2_TG0_4K
#define VTCR_EL2_TGRAN_SL0_BASE 2UL
#endif

-#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
+#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN)
+
/*
* VTCR_EL2:SL0 indicates the entry level for Stage2 translation.
* Interestingly, it depends on the page size.
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 102b5a5..91372eb 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -72,7 +72,7 @@ extern void __vgic_v3_init_lrs(void);

extern u32 __kvm_get_mdcr_el2(void);

-extern u32 __init_stage2_translation(void);
+extern void __init_stage2_translation(void);

/* Home-grown __this_cpu_{ptr,read} variants that always work at HYP */
#define __hyp_this_cpu_ptr(sym) \
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index fe8777b..328f472 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -442,10 +442,13 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,

static inline void __cpu_init_stage2(void)
{
- u32 parange = kvm_call_hyp(__init_stage2_translation);
+ u32 ps;

- WARN_ONCE(parange < 40,
- "PARange is %d bits, unsupported configuration!", parange);
+ kvm_call_hyp(__init_stage2_translation);
+ /* Sanity check for minimum IPA size support */
+ ps = id_aa64mmfr0_parange_to_phys_shift(read_sysreg(id_aa64mmfr0_el1) & 0x7);
+ WARN_ONCE(ps < 40,
+ "PARange is %d bits, unsupported configuration!", ps);
}

/* Guest/host FPSIMD coordination helpers */
diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index 82f9994..3e8052d1 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -20,6 +20,7 @@

#include <linux/compiler.h>
#include <linux/kvm_host.h>
+#include <asm/kvm_mmu.h>
#include <asm/sysreg.h>

#define __hyp_text __section(.hyp.text) notrace
@@ -158,6 +159,16 @@ void __noreturn __hyp_do_panic(unsigned long, ...);
/* Must be called from hyp code running at EL2 */
static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm)
{
+ /*
+ * Configure the VTCR translation control bits
+ * for this VM.
+ */
+ u64 vtcr = read_sysreg(vtcr_el2);
+
+ vtcr &= ~VTCR_EL2_PRIVATE_MASK;
+ vtcr |= VTCR_EL2_SL0(kvm_stage2_levels(kvm)) |
+ VTCR_EL2_T0SZ(kvm_phys_shift(kvm));
+ write_sysreg(vtcr, vtcr_el2);
write_sysreg(kvm->arch.vttbr, vttbr_el2);
}

diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
index 81094f1..6567315 100644
--- a/arch/arm64/kvm/hyp/s2-setup.c
+++ b/arch/arm64/kvm/hyp/s2-setup.c
@@ -19,13 +19,11 @@
#include <asm/kvm_arm.h>
#include <asm/kvm_asm.h>
#include <asm/kvm_hyp.h>
-#include <asm/cpufeature.h>

-u32 __hyp_text __init_stage2_translation(void)
+void __hyp_text __init_stage2_translation(void)
{
u64 val = VTCR_EL2_FLAGS;
u64 parange;
- u32 phys_shift;
u64 tmp;

/*
@@ -38,17 +36,6 @@ u32 __hyp_text __init_stage2_translation(void)
parange = ID_AA64MMFR0_PARANGE_MAX;
val |= parange << VTCR_EL2_PS_SHIFT;

- /* Compute the actual PARange... */
- phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
-
- /*
- * ... and clamp it to 40 bits, unless we have some braindead
- * HW that implements less than that. In all cases, we'll
- * return that value for the rest of the kernel to decide what
- * to do.
- */
- val |= VTCR_EL2_T0SZ(phys_shift > 40 ? 40 : phys_shift);
-
/*
* Check the availability of Hardware Access Flag / Dirty Bit
* Management in ID_AA64MMFR1_EL1 and enable the feature in VTCR_EL2.
@@ -67,6 +54,4 @@ u32 __hyp_text __init_stage2_translation(void)
VTCR_EL2_VS_8BIT;

write_sysreg(val, vtcr_el2);
-
- return phys_shift;
}
--
2.7.4


2018-06-29 11:28:28

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 10/20] kvm: arm64: Dynamic configuration of VTTBR mask

On arm64 VTTBR_EL2:BADDR holds the base address for the stage2
translation table. The Arm ARM mandates that the bits BADDR[x-1:0]
should be 0, where 'x' is defined for a given IPA Size and the
number of levels for a translation granule size. It is defined
using some magical constants. This patch is a reverse engineered
implementation to calculate the 'x' at runtime for a given ipa and
number of page table levels. See patch for more details.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since V2:
- Part 1 of spilt from VTCR & VTTBR dynamic configuration
---
arch/arm64/include/asm/kvm_arm.h | 60 +++++++++++++++++++++++++++++++++++++---
arch/arm64/include/asm/kvm_mmu.h | 25 ++++++++++++++++-
2 files changed, 80 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 3dffd38..c557f45 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -140,8 +140,6 @@
* Note that when using 4K pages, we concatenate two first level page tables
* together. With 16K pages, we concatenate 16 first level page tables.
*
- * The magic numbers used for VTTBR_X in this patch can be found in Tables
- * D4-23 and D4-25 in ARM DDI 0487A.b.
*/

#define VTCR_EL2_T0SZ_IPA VTCR_EL2_T0SZ_40B
@@ -175,9 +173,63 @@
#endif

#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
-#define VTTBR_X (VTTBR_X_TGRAN_MAGIC - VTCR_EL2_T0SZ_IPA)
+/*
+ * ARM VMSAv8-64 defines an algorithm for finding the translation table
+ * descriptors in section D4.2.8 in ARM DDI 0487B.b.
+ *
+ * The algorithm defines the expectations on the BaseAddress (for the page
+ * table) bits resolved at each level based on the page size, entry level
+ * and T0SZ. The variable "x" in the algorithm also affects the VTTBR:BADDR
+ * for stage2 page table.
+ *
+ * The value of "x" is calculated as :
+ * x = Magic_N - T0SZ
+ *
+ * where Magic_N is an integer depending on the page size and the entry
+ * level of the page table as below:
+ *
+ * --------------------------------------------
+ * | Entry level | 4K 16K 64K |
+ * --------------------------------------------
+ * | Level: 0 (4 levels) | 28 | - | - |
+ * --------------------------------------------
+ * | Level: 1 (3 levels) | 37 | 31 | 25 |
+ * --------------------------------------------
+ * | Level: 2 (2 levels) | 46 | 42 | 38 |
+ * --------------------------------------------
+ * | Level: 3 (1 level) | - | 53 | 51 |
+ * --------------------------------------------
+ *
+ * We have a magic formula for the Magic_N below.
+ *
+ * Magic_N(PAGE_SIZE, Entry_Level) = 64 - ((PAGE_SHIFT - 3) * Number of levels)
+ *
+ * where number of levels = (4 - Entry_Level).
+ *
+ * So, given that T0SZ = (64 - PA_SHIFT), we can compute 'x' as follows:
+ *
+ * x = (64 - ((PAGE_SHIFT - 3) * Number_of_levels)) - (64 - PA_SHIFT)
+ * = PA_SHIFT - ((PAGE_SHIFT - 3) * Number of levels)
+ *
+ * Here is one way to explain the Magic Formula:
+ *
+ * x = log2(Size_of_Entry_Level_Table)
+ *
+ * Since, we can resolve (PAGE_SHIFT - 3) bits at each level, and another
+ * PAGE_SHIFT bits in the PTE, we have :
+ *
+ * Bits_Entry_level = PA_SHIFT - ((PAGE_SHIFT - 3) * (n - 1) + PAGE_SHIFT)
+ * = PA_SHIFT - (PAGE_SHIFT - 3) * n - 3
+ * where n = number of levels, and since each pointer is 8bytes, we have:
+ *
+ * x = Bits_Entry_Level + 3
+ * = PA_SHIFT - (PAGE_SHIFT - 3) * n
+ *
+ * The only constraint here is that, we have to find the number of page table
+ * levels for a given IPA size (which we do, see stage2_pt_levels())
+ */
+#define ARM64_VTTBR_X(ipa, levels) ((ipa) - ((levels) * (PAGE_SHIFT - 3)))

-#define VTTBR_BADDR_MASK (((UL(1) << (PHYS_MASK_SHIFT - VTTBR_X)) - 1) << VTTBR_X)
#define VTTBR_VMID_SHIFT (UL(48))
#define VTTBR_VMID_MASK(size) (_AT(u64, (1 << size) - 1) << VTTBR_VMID_SHIFT)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index a351722..813a72a 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -146,7 +146,6 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
#define kvm_phys_size(kvm) (_AC(1, ULL) << kvm_phys_shift(kvm))
#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
-#define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK

static inline bool kvm_page_empty(void *ptr)
{
@@ -503,6 +502,30 @@ static inline int hyp_map_aux_data(void)

#define kvm_phys_to_vttbr(addr) phys_to_ttbr(addr)

+/*
+ * Get the magic number 'x' for VTTBR:BADDR of this KVM instance.
+ * With v8.2 LVA extensions, 'x' should be a minimum of 6 with
+ * 52bit IPS.
+ */
+static inline int arm64_vttbr_x(u32 ipa_shift, u32 levels)
+{
+ int x = ARM64_VTTBR_X(ipa_shift, levels);
+
+ return (IS_ENABLED(CONFIG_ARM64_PA_BITS_52) && x < 6) ? 6 : x;
+}
+
+static inline u64 vttbr_baddr_mask(u32 ipa_shift, u32 levels)
+{
+ unsigned int x = arm64_vttbr_x(ipa_shift, levels);
+
+ return GENMASK_ULL(PHYS_MASK_SHIFT - 1, x);
+}
+
+static inline u64 kvm_vttbr_baddr_mask(struct kvm *kvm)
+{
+ return vttbr_baddr_mask(kvm_phys_shift(kvm), kvm_stage2_levels(kvm));
+}
+
static inline void *stage2_alloc_pgd(struct kvm *kvm)
{
return alloc_pages_exact(stage2_pgd_size(kvm),
--
2.7.4


2018-06-29 11:28:56

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 08/20] kvm: arm/arm64: Abstract stage2 pgd table allocation

Abstract the allocation of stage2 entry level tables for
given VM, so that later we can choose to fall back to the
normal page table levels (i.e, avoid entry level table
concatenation) on arm64.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since V2:
- New patch
---
arch/arm/include/asm/kvm_mmu.h | 6 ++++++
arch/arm64/include/asm/kvm_mmu.h | 6 ++++++
virt/kvm/arm/mmu.c | 2 +-
3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index f36eb20..b2da5a4 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -372,6 +372,12 @@ static inline int hyp_map_aux_data(void)
return 0;
}

+static inline void *stage2_alloc_pgd(struct kvm *kvm)
+{
+ return alloc_pages_exact(stage2_pgd_size(kvm),
+ GFP_KERNEL | __GFP_ZERO);
+}
+
#define kvm_phys_to_vttbr(addr) (addr)

#endif /* !__ASSEMBLY__ */
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 5da8f52..dbaf513 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -501,5 +501,11 @@ static inline int hyp_map_aux_data(void)

#define kvm_phys_to_vttbr(addr) phys_to_ttbr(addr)

+static inline void *stage2_alloc_pgd(struct kvm *kvm)
+{
+ return alloc_pages_exact(stage2_pgd_size(kvm),
+ GFP_KERNEL | __GFP_ZERO);
+}
+
#endif /* __ASSEMBLY__ */
#endif /* __ARM64_KVM_MMU_H__ */
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 82dd571..a339e00 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -868,7 +868,7 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
}

/* Allocate the HW PGD, making sure that each page gets its own refcount */
- pgd = alloc_pages_exact(stage2_pgd_size(kvm), GFP_KERNEL | __GFP_ZERO);
+ pgd = stage2_alloc_pgd(kvm);
if (!pgd)
return -ENOMEM;

--
2.7.4


2018-06-29 11:32:11

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 09/20] kvm: arm64: Make stage2 page table layout dynamic

So far we had a static stage2 page table handling code, based on a
fixed IPA of 40bits. As we prepare for a configurable IPA size per
VM, make our stage2 page table code dynamic, to do the right thing
for a given VM. We ensure the existing condition is always true even
when we lift the limit on the IPA. i.e,

page table levels in stage1 >= page table levels in stage2

Support for the IPA size configuration needs other changes in the way
we configure the EL2 registers (VTTBR and VTCR). So, the IPA is still
fixed to 40bits. The patch also moves the kvm_page_empty() in asm/kvm_mmu.h
to the top, before including the asm/stage2_pgtable.h to avoid a forward
declaration.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since V2
- Restrict the stage2 page table to allow reusing the host page table
helpers for now, until we get stage1 independent page table helpers.
---
arch/arm64/include/asm/kvm_mmu.h | 14 +-
arch/arm64/include/asm/stage2_pgtable-nopmd.h | 42 ------
arch/arm64/include/asm/stage2_pgtable-nopud.h | 39 -----
arch/arm64/include/asm/stage2_pgtable.h | 207 +++++++++++++++++++-------
4 files changed, 159 insertions(+), 143 deletions(-)
delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index dbaf513..a351722 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -21,6 +21,7 @@
#include <asm/page.h>
#include <asm/memory.h>
#include <asm/cpufeature.h>
+#include <asm/kvm_arm.h>

/*
* As ARMv8.0 only has the TTBR0_EL2 register, we cannot express
@@ -147,6 +148,13 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
#define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK

+static inline bool kvm_page_empty(void *ptr)
+{
+ struct page *ptr_page = virt_to_page(ptr);
+
+ return page_count(ptr_page) == 1;
+}
+
#include <asm/stage2_pgtable.h>

int create_hyp_mappings(void *from, void *to, pgprot_t prot);
@@ -237,12 +245,6 @@ static inline bool kvm_s2pmd_exec(pmd_t *pmdp)
return !(READ_ONCE(pmd_val(*pmdp)) & PMD_S2_XN);
}

-static inline bool kvm_page_empty(void *ptr)
-{
- struct page *ptr_page = virt_to_page(ptr);
- return page_count(ptr_page) == 1;
-}
-
#define hyp_pte_table_empty(ptep) kvm_page_empty(ptep)

#ifdef __PAGETABLE_PMD_FOLDED
diff --git a/arch/arm64/include/asm/stage2_pgtable-nopmd.h b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
deleted file mode 100644
index 0280ded..0000000
--- a/arch/arm64/include/asm/stage2_pgtable-nopmd.h
+++ /dev/null
@@ -1,42 +0,0 @@
-/*
- * Copyright (C) 2016 - ARM Ltd
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program. If not, see <http://www.gnu.org/licenses/>.
- */
-
-#ifndef __ARM64_S2_PGTABLE_NOPMD_H_
-#define __ARM64_S2_PGTABLE_NOPMD_H_
-
-#include <asm/stage2_pgtable-nopud.h>
-
-#define __S2_PGTABLE_PMD_FOLDED
-
-#define S2_PMD_SHIFT S2_PUD_SHIFT
-#define S2_PTRS_PER_PMD 1
-#define S2_PMD_SIZE (1UL << S2_PMD_SHIFT)
-#define S2_PMD_MASK (~(S2_PMD_SIZE-1))
-
-#define stage2_pud_none(kvm, pud) (0)
-#define stage2_pud_present(kvm, pud) (1)
-#define stage2_pud_clear(kvm, pud) do { } while (0)
-#define stage2_pud_populate(kvm, pud, pmd) do { } while (0)
-#define stage2_pmd_offset(kvm, pud, address) ((pmd_t *)(pud))
-
-#define stage2_pmd_free(kvm, pmd) do { } while (0)
-
-#define stage2_pmd_addr_end(kvm, addr, end) (end)
-
-#define stage2_pud_huge(kvm, pud) (0)
-#define stage2_pmd_table_empty(kvm, pmdp) (0)
-
-#endif
diff --git a/arch/arm64/include/asm/stage2_pgtable-nopud.h b/arch/arm64/include/asm/stage2_pgtable-nopud.h
deleted file mode 100644
index cd6304e..0000000
--- a/arch/arm64/include/asm/stage2_pgtable-nopud.h
+++ /dev/null
@@ -1,39 +0,0 @@
-/*
- * Copyright (C) 2016 - ARM Ltd
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program. If not, see <http://www.gnu.org/licenses/>.
- */
-
-#ifndef __ARM64_S2_PGTABLE_NOPUD_H_
-#define __ARM64_S2_PGTABLE_NOPUD_H_
-
-#define __S2_PGTABLE_PUD_FOLDED
-
-#define S2_PUD_SHIFT S2_PGDIR_SHIFT
-#define S2_PTRS_PER_PUD 1
-#define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
-#define S2_PUD_MASK (~(S2_PUD_SIZE-1))
-
-#define stage2_pgd_none(kvm, pgd) (0)
-#define stage2_pgd_present(kvm, pgd) (1)
-#define stage2_pgd_clear(kvm, pgd) do { } while (0)
-#define stage2_pgd_populate(kvm, pgd, pud) do { } while (0)
-
-#define stage2_pud_offset(kvm, pgd, address) ((pud_t *)(pgd))
-
-#define stage2_pud_free(kvm, x) do { } while (0)
-
-#define stage2_pud_addr_end(kvm, addr, end) (end)
-#define stage2_pud_table_empty(kvm, pmdp) (0)
-
-#endif
diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
index 057a405..ffc37cc 100644
--- a/arch/arm64/include/asm/stage2_pgtable.h
+++ b/arch/arm64/include/asm/stage2_pgtable.h
@@ -19,8 +19,12 @@
#ifndef __ARM64_S2_PGTABLE_H_
#define __ARM64_S2_PGTABLE_H_

+#include <linux/hugetlb.h>
#include <asm/pgtable.h>

+/* The PGDIR shift for a given page table with "n" levels. */
+#define pt_levels_pgdir_shift(n) ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - (n))
+
/*
* The hardware supports concatenation of up to 16 tables at stage2 entry level
* and we use the feature whenever possible.
@@ -29,118 +33,209 @@
* On arm64, the smallest PAGE_SIZE supported is 4k, which means
* (PAGE_SHIFT - 3) > 4 holds for all page sizes.
* This implies, the total number of page table levels at stage2 expected
- * by the hardware is actually the number of levels required for (KVM_PHYS_SHIFT - 4)
+ * by the hardware is actually the number of levels required for (IPA_SHIFT - 4)
* in normal translations(e.g, stage1), since we cannot have another level in
- * the range (KVM_PHYS_SHIFT, KVM_PHYS_SHIFT - 4).
+ * the range (IPA_SHIFT, IPA_SHIFT - 4).
*/
-#define STAGE2_PGTABLE_LEVELS ARM64_HW_PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)
+#define stage2_pt_levels(ipa_shift) ARM64_HW_PGTABLE_LEVELS((ipa_shift) - 4)

/*
- * With all the supported VA_BITs and 40bit guest IPA, the following condition
- * is always true:
+ * With all the supported VA_BITs and guest IPA, the following condition
+ * must be always true:
*
- * STAGE2_PGTABLE_LEVELS <= CONFIG_PGTABLE_LEVELS
+ * stage2_pt_levels <= CONFIG_PGTABLE_LEVELS
*
* We base our stage-2 page table walker helpers on this assumption and
* fall back to using the host version of the helper wherever possible.
* i.e, if a particular level is not folded (e.g, PUD) at stage2, we fall back
* to using the host version, since it is guaranteed it is not folded at host.
*
- * If the condition breaks in the future, we can rearrange the host level
- * definitions and reuse them for stage2. Till then...
+ * If the condition breaks in the future, we need completely independent
+ * page table helpers. Till then...
*/
-#if STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS
+
+#if stage2_pt_levels(KVM_PHYS_SHIFT) > CONFIG_PGTABLE_LEVELS
#error "Unsupported combination of guest IPA and host VA_BITS."
#endif

-/* S2_PGDIR_SHIFT is the size mapped by top-level stage2 entry */
-#define S2_PGDIR_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - STAGE2_PGTABLE_LEVELS)
-#define S2_PGDIR_SIZE (_AC(1, UL) << S2_PGDIR_SHIFT)
-#define S2_PGDIR_MASK (~(S2_PGDIR_SIZE - 1))
-
/*
* The number of PTRS across all concatenated stage2 tables given by the
* number of bits resolved at the initial level.
*/
-#define PTRS_PER_S2_PGD (1 << (KVM_PHYS_SHIFT - S2_PGDIR_SHIFT))
+#define __s2_pgd_ptrs(pa, lvls) (1 << ((pa) - pt_levels_pgdir_shift((lvls))))
+#define __s2_pgd_size(pa, lvls) (__s2_pgd_ptrs((pa), (lvls)) * sizeof(pgd_t))
+
+#define kvm_stage2_levels(kvm) stage2_pt_levels(kvm_phys_shift(kvm))
+#define stage2_pgdir_shift(kvm) \
+ pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
+#define stage2_pgdir_size(kvm) (_AC(1, UL) << stage2_pgdir_shift((kvm)))
+#define stage2_pgdir_mask(kvm) (~(stage2_pgdir_size((kvm)) - 1))
+#define stage2_pgd_ptrs(kvm) \
+ __s2_pgd_ptrs(kvm_phys_shift(kvm), kvm_stage2_levels(kvm))
+
+#define stage2_pgd_size(kvm) __s2_pgd_size(kvm_phys_shift(kvm), kvm_stage2_levels(kvm))

/*
* kvm_mmmu_cache_min_pages is the number of stage2 page table translation
* levels in addition to the PGD.
*/
-#define kvm_mmu_cache_min_pages(kvm) (STAGE2_PGTABLE_LEVELS - 1)
+#define kvm_mmu_cache_min_pages(kvm) (kvm_stage2_levels(kvm) - 1)


-#if STAGE2_PGTABLE_LEVELS > 3
+/* PUD/PMD definitions if present */
+#define __S2_PUD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(1)
+#define __S2_PUD_SIZE (_AC(1, UL) << __S2_PUD_SHIFT)
+#define __S2_PUD_MASK (~(__S2_PUD_SIZE - 1))

-#define S2_PUD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(1)
-#define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
-#define S2_PUD_MASK (~(S2_PUD_SIZE - 1))
+#define __S2_PMD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
+#define __S2_PMD_SIZE (_AC(1, UL) << __S2_PMD_SHIFT)
+#define __S2_PMD_MASK (~(__S2_PMD_SIZE - 1))

-#define stage2_pgd_none(kvm, pgd) pgd_none(pgd)
-#define stage2_pgd_clear(kvm, pgd) pgd_clear(pgd)
-#define stage2_pgd_present(kvm, pgd) pgd_present(pgd)
-#define stage2_pgd_populate(kvm, pgd, pud) pgd_populate(NULL, pgd, pud)
-#define stage2_pud_offset(kvm, pgd, address) pud_offset(pgd, address)
-#define stage2_pud_free(kvm, pud) pud_free(NULL, pud)
+#define __s2_pud_index(addr) \
+ (((addr) >> __S2_PUD_SHIFT) & (PTRS_PER_PTE - 1))
+#define __s2_pmd_index(addr) \
+ (((addr) >> __S2_PMD_SHIFT) & (PTRS_PER_PTE - 1))

-#define stage2_pud_table_empty(kvm, pudp) kvm_page_empty(pudp)
+#define __kvm_has_stage2_levels(kvm, min_levels) \
+ ((CONFIG_PGTABLE_LEVELS >= min_levels) && (kvm_stage2_levels(kvm) >= min_levels))
+
+#define kvm_stage2_has_pgd(kvm) __kvm_has_stage2_levels(kvm, 4)
+#define kvm_stage2_has_pud(kvm) __kvm_has_stage2_levels(kvm, 3)
+
+static inline int stage2_pgd_none(struct kvm *kvm, pgd_t pgd)
+{
+ return kvm_stage2_has_pgd(kvm) ? pgd_none(pgd) : 0;
+}
+
+static inline void stage2_pgd_clear(struct kvm *kvm, pgd_t *pgdp)
+{
+ if (kvm_stage2_has_pgd(kvm))
+ pgd_clear(pgdp);
+}
+
+static inline int stage2_pgd_present(struct kvm *kvm, pgd_t pgd)
+{
+ return kvm_stage2_has_pgd(kvm) ? pgd_present(pgd) : 1;
+}
+
+static inline void stage2_pgd_populate(struct kvm *kvm, pgd_t *pgdp, pud_t *pud)
+{
+ if (kvm_stage2_has_pgd(kvm))
+ pgd_populate(NULL, pgdp, pud);
+ else
+ BUG();
+}
+
+static inline pud_t *stage2_pud_offset(struct kvm *kvm,
+ pgd_t *pgd, unsigned long address)
+{
+ if (kvm_stage2_has_pgd(kvm)) {
+ phys_addr_t pud_phys = pgd_page_paddr(*pgd);
+
+ pud_phys += __s2_pud_index(address) * sizeof(pud_t);
+ return __va(pud_phys);
+ }
+ return (pud_t *)pgd;
+}
+
+static inline void stage2_pud_free(struct kvm *kvm, pud_t *pud)
+{
+ if (kvm_stage2_has_pgd(kvm))
+ pud_free(NULL, pud);
+}
+
+static inline int stage2_pud_table_empty(struct kvm *kvm, pud_t *pudp)
+{
+ return kvm_stage2_has_pgd(kvm) && kvm_page_empty(pudp);
+}

static inline phys_addr_t
stage2_pud_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
- phys_addr_t boundary = (addr + S2_PUD_SIZE) & S2_PUD_MASK;
+ if (kvm_stage2_has_pgd(kvm)) {
+ phys_addr_t boundary = (addr + __S2_PUD_SIZE) & __S2_PUD_MASK;

- return (boundary - 1 < end - 1) ? boundary : end;
+ return (boundary - 1 < end - 1) ? boundary : end;
+ }
+ return end;
}

-#endif /* STAGE2_PGTABLE_LEVELS > 3 */
+static inline int stage2_pud_none(struct kvm *kvm, pud_t pud)
+{
+ return kvm_stage2_has_pud(kvm) ? pud_none(pud) : 0;
+}

+static inline void stage2_pud_clear(struct kvm *kvm, pud_t *pudp)
+{
+ if (kvm_stage2_has_pud(kvm))
+ pud_clear(pudp);
+}

-#if STAGE2_PGTABLE_LEVELS > 2
+static inline int stage2_pud_present(struct kvm *kvm, pud_t pud)
+{
+ return kvm_stage2_has_pud(kvm) ? pud_present(pud) : 1;
+}

-#define S2_PMD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
-#define S2_PMD_SIZE (_AC(1, UL) << S2_PMD_SHIFT)
-#define S2_PMD_MASK (~(S2_PMD_SIZE - 1))
+static inline void stage2_pud_populate(struct kvm *kvm, pud_t *pudp, pmd_t *pmd)
+{
+ if (kvm_stage2_has_pud(kvm))
+ pud_populate(NULL, pudp, pmd);
+ else
+ BUG();
+}

-#define stage2_pud_none(kvm, pud) pud_none(pud)
-#define stage2_pud_clear(kvm, pud) pud_clear(pud)
-#define stage2_pud_present(kvm, pud) pud_present(pud)
-#define stage2_pud_populate(kvm, pud, pmd) pud_populate(NULL, pud, pmd)
-#define stage2_pmd_offset(kvm, pud, address) pmd_offset(pud, address)
-#define stage2_pmd_free(kvm, pmd) pmd_free(NULL, pmd)
+static inline pmd_t *stage2_pmd_offset(struct kvm *kvm,
+ pud_t *pud, unsigned long address)
+{
+ if (kvm_stage2_has_pud(kvm)) {
+ phys_addr_t pmd_phys = pud_page_paddr(*pud);

-#define stage2_pud_huge(kvm, pud) pud_huge(pud)
-#define stage2_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
+ pmd_phys += __s2_pmd_index(address) * sizeof(pmd_t);
+ return __va(pmd_phys);
+ }
+ return (pmd_t *)pud;
+}
+
+static inline void stage2_pmd_free(struct kvm *kvm, pmd_t *pmd)
+{
+ if (kvm_stage2_has_pud(kvm))
+ pmd_free(NULL, pmd);
+}
+
+static inline int stage2_pmd_table_empty(struct kvm *kvm, pmd_t *pmdp)
+{
+ return kvm_stage2_has_pud(kvm) && kvm_page_empty(pmdp);
+}

static inline phys_addr_t
stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
- phys_addr_t boundary = (addr + S2_PMD_SIZE) & S2_PMD_MASK;
+ if (kvm_stage2_has_pud(kvm)) {
+ phys_addr_t boundary = (addr + __S2_PMD_SIZE) & __S2_PMD_MASK;

- return (boundary - 1 < end - 1) ? boundary : end;
+ return (boundary - 1 < end - 1) ? boundary : end;
+ }
+ return end;
}

-#endif /* STAGE2_PGTABLE_LEVELS > 2 */
+static inline int stage2_pud_huge(struct kvm *kvm, pud_t pud)
+{
+ return kvm_stage2_has_pud(kvm) ? pud_huge(pud) : 0;
+}

#define stage2_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)

-#if STAGE2_PGTABLE_LEVELS == 2
-#include <asm/stage2_pgtable-nopmd.h>
-#elif STAGE2_PGTABLE_LEVELS == 3
-#include <asm/stage2_pgtable-nopud.h>
-#endif
-
-#define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
-
-#define stage2_pgd_index(kvm, addr) \
- (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
+static inline unsigned long stage2_pgd_index(struct kvm *kvm, phys_addr_t addr)
+{
+ return (addr >> stage2_pgdir_shift(kvm)) & (stage2_pgd_ptrs(kvm) - 1);
+}

static inline phys_addr_t
stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
- phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;
+ phys_addr_t boundary;

+ boundary = (addr + stage2_pgdir_size(kvm)) & stage2_pgdir_mask(kvm);
return (boundary - 1 < end - 1) ? boundary : end;
}

--
2.7.4


2018-06-29 11:34:36

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 05/20] kvm: arm/arm64: Fix stage2_flush_memslot for 4 level page table

So far we have only supported 3 level page table with fixed IPA of 40bits.
Fix stage2_flush_memslot() to accommodate for 4 level tables.

Cc: Marc Zyngier <[email protected]>
Acked-by: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
virt/kvm/arm/mmu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 1d90d79..061e6b3 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -379,7 +379,8 @@ static void stage2_flush_memslot(struct kvm *kvm,
pgd = kvm->arch.pgd + stage2_pgd_index(addr);
do {
next = stage2_pgd_addr_end(addr, end);
- stage2_flush_puds(kvm, pgd, addr, next);
+ if (!stage2_pgd_none(*pgd))
+ stage2_flush_puds(kvm, pgd, addr, next);
} while (pgd++, addr = next, addr != end);
}

--
2.7.4


2018-06-29 13:44:18

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

Allow specifying the physical address size for a new VM via
the kvm_type argument for KVM_CREATE_VM ioctl. This allows
us to finalise the stage2 page table format as early as possible
and hence perform the right checks on the memory slots without
complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
of the type field and can encode more information in the future if
required. The IPA size is still capped at 40bits.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Cc: Peter Maydel <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/include/asm/kvm_mmu.h | 2 ++
arch/arm64/include/asm/kvm_arm.h | 10 +++-------
arch/arm64/include/asm/kvm_mmu.h | 2 ++
include/uapi/linux/kvm.h | 10 ++++++++++
virt/kvm/arm/arm.c | 24 ++++++++++++++++++++++--
5 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index d86f8dd..bcc3dd9 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -385,6 +385,8 @@ static inline u32 kvm_get_ipa_limit(void)
return KVM_PHYS_SHIFT;
}

+static inline void kvm_config_stage2(struct kvm *kvm, u32 ipa_shift) {}
+
#endif /* !__ASSEMBLY__ */

#endif /* __ARM_KVM_MMU_H__ */
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index b02c316..2e90942 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -128,19 +128,15 @@
#define VTCR_EL2_T0SZ(x) TCR_T0SZ(x)

/*
- * We configure the Stage-2 page tables to always restrict the IPA space to be
- * 40 bits wide (T0SZ = 24). Systems with a PARange smaller than 40 bits are
- * not known to exist and will break with this configuration.
+ * We configure the Stage-2 page tables based on the requested size of
+ * IPA for each VM. The default size is set to 40bits and is not allowed
+ * go below that limit (for backward compatibility).
*
* VTCR_EL2.PS is extracted from ID_AA64MMFR0_EL1.PARange at boot time
* (see hyp-init.S).
*
* VTCR_EL2.SL0 and T0SZ are configured per VM at runtime before switching to
* the VM.
- *
- * Note that when using 4K pages, we concatenate two first level page tables
- * together. With 16K pages, we concatenate 16 first level page tables.
- *
*/

#define VTCR_EL2_COMMON_BITS (VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index b4564d8..f3fb05a3 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -537,5 +537,7 @@ static inline u32 kvm_get_ipa_limit(void)
return KVM_PHYS_SHIFT;
}

+static inline void kvm_config_stage2(struct kvm *kvm, u32 ipa_shift) {}
+
#endif /* __ASSEMBLY__ */
#endif /* __ARM64_KVM_MMU_H__ */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4df9bb6..fa4cab0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
#define KVM_S390_SIE_PAGE_OFFSET 1

/*
+ * On arm/arm64, machine type can be used to request the physical
+ * address size for the VM. Bits [7-0] have been reserved for the
+ * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
+ * value 0 implies the default IPA size, which is 40bits.
+ */
+#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK 0xff
+#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x) \
+ ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
+
+/*
* ioctls for /dev/kvm fds:
*/
#define KVM_GET_API_VERSION _IO(KVMIO, 0x00)
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 0d99e67..1085761 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -112,6 +112,25 @@ void kvm_arch_check_processor_compat(void *rtn)
}


+static int kvm_arch_config_vm(struct kvm *kvm, unsigned long type)
+{
+ u32 ipa_shift = KVM_VM_TYPE_ARM_PHYS_SHIFT(type);
+
+ /*
+ * Make sure the size, if specified, is within the range of
+ * default size and supported maximum limit.
+ */
+ if (ipa_shift) {
+ if (ipa_shift < KVM_PHYS_SHIFT || ipa_shift > kvm_ipa_limit)
+ return -EINVAL;
+ } else {
+ ipa_shift = KVM_PHYS_SHIFT;
+ }
+
+ kvm_config_stage2(kvm, ipa_shift);
+ return 0;
+}
+
/**
* kvm_arch_init_vm - initializes a VM data structure
* @kvm: pointer to the KVM struct
@@ -120,8 +139,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
{
int ret, cpu;

- if (type)
- return -EINVAL;
+ ret = kvm_arch_config_vm(kvm, type);
+ if (ret)
+ return ret;

kvm->arch.last_vcpu_ran = alloc_percpu(typeof(*kvm->arch.last_vcpu_ran));
if (!kvm->arch.last_vcpu_ran)
--
2.7.4


2018-06-29 13:47:11

by Suzuki K Poulose

[permalink] [raw]
Subject: [kvmtool test PATCH 23/24] kvmtool: arm64: Switch memory layout

If the guest wants to use a larger physical address space place
the RAM at upper half of the address space. Otherwise, it uses the
default layout.

Signed-off-by: Suzuki K Poulose <[email protected]>
---
arm/aarch32/include/kvm/kvm-arch.h | 6 ++++--
arm/aarch64/include/kvm/kvm-arch.h | 15 ++++++++++++---
arm/include/arm-common/kvm-arch.h | 11 ++++++-----
arm/kvm.c | 2 +-
4 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/arm/aarch32/include/kvm/kvm-arch.h b/arm/aarch32/include/kvm/kvm-arch.h
index cd31e72..bcd382b 100644
--- a/arm/aarch32/include/kvm/kvm-arch.h
+++ b/arm/aarch32/include/kvm/kvm-arch.h
@@ -3,8 +3,10 @@

#define ARM_KERN_OFFSET(...) 0x8000

-#define ARM_MAX_MEMORY(...) ARM_LOMAP_MAX_MEMORY
-
#include "arm-common/kvm-arch.h"

+#define ARM_MAX_MEMORY(...) ARM32_MAX_MEMORY
+#define ARM_MEMORY_AREA(...) ARM32_MEMORY_AREA
+
+
#endif /* KVM__KVM_ARCH_H */
diff --git a/arm/aarch64/include/kvm/kvm-arch.h b/arm/aarch64/include/kvm/kvm-arch.h
index 9de623a..bad35b9 100644
--- a/arm/aarch64/include/kvm/kvm-arch.h
+++ b/arm/aarch64/include/kvm/kvm-arch.h
@@ -1,14 +1,23 @@
#ifndef KVM__KVM_ARCH_H
#define KVM__KVM_ARCH_H

+#include "arm-common/kvm-arch.h"
+
+#define ARM64_MEMORY_AREA(phys_shift) (1UL << (phys_shift - 1))
+#define ARM64_MAX_MEMORY(phys_shift) \
+ ((1ULL << (phys_shift)) - ARM64_MEMORY_AREA(phys_shift))
+
+#define ARM_MEMORY_AREA(kvm) ((kvm)->cfg.arch.aarch32_guest ? \
+ ARM32_MEMORY_AREA : \
+ ARM64_MEMORY_AREA(kvm->cfg.arch.phys_shift))
+
#define ARM_KERN_OFFSET(kvm) ((kvm)->cfg.arch.aarch32_guest ? \
0x8000 : \
0x80000)

#define ARM_MAX_MEMORY(kvm) ((kvm)->cfg.arch.aarch32_guest ? \
- ARM_LOMAP_MAX_MEMORY : \
- ARM_HIMAP_MAX_MEMORY)
+ ARM32_MAX_MEMORY : \
+ ARM64_MAX_MEMORY(kvm->cfg.arch.phys_shift))

-#include "arm-common/kvm-arch.h"

#endif /* KVM__KVM_ARCH_H */
diff --git a/arm/include/arm-common/kvm-arch.h b/arm/include/arm-common/kvm-arch.h
index b9d486d..b29b4b1 100644
--- a/arm/include/arm-common/kvm-arch.h
+++ b/arm/include/arm-common/kvm-arch.h
@@ -6,14 +6,15 @@
#include <linux/types.h>

#include "arm-common/gic.h"
-
#define ARM_IOPORT_AREA _AC(0x0000000000000000, UL)
#define ARM_MMIO_AREA _AC(0x0000000000010000, UL)
#define ARM_AXI_AREA _AC(0x0000000040000000, UL)
-#define ARM_MEMORY_AREA _AC(0x0000000080000000, UL)

-#define ARM_LOMAP_MAX_MEMORY ((1ULL << 32) - ARM_MEMORY_AREA)
-#define ARM_HIMAP_MAX_MEMORY ((1ULL << 40) - ARM_MEMORY_AREA)
+#define ARM32_MEMORY_AREA _AC(0x0000000080000000, UL)
+#define ARM32_MAX_MEMORY ((1ULL << 32) - ARM32_MEMORY_AREA)
+
+#define ARM_IOMEM_AREA_END ARM32_MEMORY_AREA
+

#define ARM_GIC_DIST_BASE (ARM_AXI_AREA - ARM_GIC_DIST_SIZE)
#define ARM_GIC_CPUI_BASE (ARM_GIC_DIST_BASE - ARM_GIC_CPUI_SIZE)
@@ -24,7 +25,7 @@
#define ARM_IOPORT_SIZE (ARM_MMIO_AREA - ARM_IOPORT_AREA)
#define ARM_VIRTIO_MMIO_SIZE (ARM_AXI_AREA - (ARM_MMIO_AREA + ARM_GIC_SIZE))
#define ARM_PCI_CFG_SIZE (1ULL << 24)
-#define ARM_PCI_MMIO_SIZE (ARM_MEMORY_AREA - \
+#define ARM_PCI_MMIO_SIZE (ARM_IOMEM_AREA_END - \
(ARM_AXI_AREA + ARM_PCI_CFG_SIZE))

#define KVM_IOPORT_AREA ARM_IOPORT_AREA
diff --git a/arm/kvm.c b/arm/kvm.c
index 2ab436e..5701d41 100644
--- a/arm/kvm.c
+++ b/arm/kvm.c
@@ -30,7 +30,7 @@ void kvm__init_ram(struct kvm *kvm)
u64 phys_start, phys_size;
void *host_mem;

- phys_start = ARM_MEMORY_AREA;
+ phys_start = ARM_MEMORY_AREA(kvm);
phys_size = kvm->ram_size;
host_mem = kvm->ram_start;

--
2.7.4


2018-06-29 13:47:45

by Suzuki K Poulose

[permalink] [raw]
Subject: [kvmtool test PATCH 24/24] kvmtool: arm: Add support for creating VM with PA size

Specify the physical size for the VM encoded in the vm type.

Signed-off-by: Suzuki K Poulose <[email protected]>
---
arm/include/arm-common/kvm-arch.h | 6 +++++-
arm/kvm.c | 22 ++++++++++++++++++++++
2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/arm/include/arm-common/kvm-arch.h b/arm/include/arm-common/kvm-arch.h
index b29b4b1..d77f3ac 100644
--- a/arm/include/arm-common/kvm-arch.h
+++ b/arm/include/arm-common/kvm-arch.h
@@ -44,7 +44,11 @@

#define KVM_IRQ_OFFSET GIC_SPI_IRQ_BASE

-#define KVM_VM_TYPE 0
+extern unsigned long kvm_arm_type;
+extern void kvm__arch_init_hyp(struct kvm *kvm);
+
+#define KVM_VM_TYPE kvm_arm_type
+#define kvm__arch_init_hyp kvm__arch_init_hyp

#define VIRTIO_DEFAULT_TRANS(kvm) \
((kvm)->cfg.arch.virtio_trans_pci ? VIRTIO_PCI : VIRTIO_MMIO)
diff --git a/arm/kvm.c b/arm/kvm.c
index 5701d41..b1969be 100644
--- a/arm/kvm.c
+++ b/arm/kvm.c
@@ -11,6 +11,8 @@
#include <linux/kvm.h>
#include <linux/sizes.h>

+unsigned long kvm_arm_type;
+
struct kvm_ext kvm_req_ext[] = {
{ DEFINE_KVM_EXT(KVM_CAP_IRQCHIP) },
{ DEFINE_KVM_EXT(KVM_CAP_ONE_REG) },
@@ -18,6 +20,26 @@ struct kvm_ext kvm_req_ext[] = {
{ 0, 0 },
};

+#ifndef KVM_ARM_GET_MAX_VM_PHYS_SHIFT
+#define KVM_ARM_GET_MAX_VM_PHYS_SHIFT _IO(KVMIO, 0x0b)
+#endif
+
+void kvm__arch_init_hyp(struct kvm *kvm)
+{
+ int max_ipa;
+
+ max_ipa = ioctl(kvm->sys_fd, KVM_ARM_GET_MAX_VM_PHYS_SHIFT);
+ if (max_ipa < 0)
+ max_ipa = 40;
+ if (!kvm->cfg.arch.phys_shift)
+ kvm->cfg.arch.phys_shift = 40;
+ if (kvm->cfg.arch.phys_shift > max_ipa)
+ die("Requested PA size (%u) is not supported by the host (%ubits)\n",
+ kvm->cfg.arch.phys_shift, max_ipa);
+ if (kvm->cfg.arch.phys_shift != 40)
+ kvm_arm_type = kvm->cfg.arch.phys_shift;
+}
+
bool kvm__arch_cpu_supports_vm(void)
{
/* The KVM capability check is enough. */
--
2.7.4


2018-06-29 13:50:09

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 16/20] kvm: arm64: Switch to per VM IPA limit

Now that we can manage the stage2 page table per VM, switch the
configuration details to per VM instance. We keep track of the
IPA bits, number of page table levels and the VTCR bits (which
depends on the IPA and the number of levels). While at it, remove
unused pgd_lock field from kvm_arch for arm64.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 14 ++++++++++++--
arch/arm64/include/asm/kvm_hyp.h | 3 +--
arch/arm64/include/asm/kvm_mmu.h | 20 ++++++++++++++++++--
arch/arm64/include/asm/stage2_pgtable.h | 1 -
virt/kvm/arm/mmu.c | 4 ++++
5 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 328f472..9a15860 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -61,13 +61,23 @@ struct kvm_arch {
u64 vmid_gen;
u32 vmid;

- /* 1-level 2nd stage table and lock */
- spinlock_t pgd_lock;
+ /* stage-2 page table */
pgd_t *pgd;

/* VTTBR value associated with above pgd and vmid */
u64 vttbr;

+ /* Private bits of VTCR_EL2 for this VM */
+ u64 vtcr_private;
+ /* Size of the PA size for this guest */
+ u8 phys_shift;
+ /*
+ * Number of levels in page table. We could always calculate
+ * it from phys_shift above. We cache it for faster switches
+ * in stage2 page table helpers.
+ */
+ u8 s2_levels;
+
/* The last vcpu id that ran on each physical CPU */
int __percpu *last_vcpu_ran;

diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index 3e8052d1..699f678 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -166,8 +166,7 @@ static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm)
u64 vtcr = read_sysreg(vtcr_el2);

vtcr &= ~VTCR_EL2_PRIVATE_MASK;
- vtcr |= VTCR_EL2_SL0(kvm_stage2_levels(kvm)) |
- VTCR_EL2_T0SZ(kvm_phys_shift(kvm));
+ vtcr |= kvm->arch.vtcr_private;
write_sysreg(vtcr, vtcr_el2);
write_sysreg(kvm->arch.vttbr, vttbr_el2);
}
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index f3fb05a3..a291cdc 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -143,9 +143,10 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
*/
#define KVM_PHYS_SHIFT (40)

-#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
+#define kvm_phys_shift(kvm) (kvm->arch.phys_shift)
#define kvm_phys_size(kvm) (_AC(1, ULL) << kvm_phys_shift(kvm))
#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
+#define kvm_stage2_levels(kvm) (kvm->arch.s2_levels)

static inline bool kvm_page_empty(void *ptr)
{
@@ -528,6 +529,18 @@ static inline u64 kvm_vttbr_baddr_mask(struct kvm *kvm)

static inline void *stage2_alloc_pgd(struct kvm *kvm)
{
+ u32 ipa, lvls;
+
+ /*
+ * Stage2 page table can support concatenation of (upto 16) tables
+ * at the entry level, thereby reducing the number of levels.
+ */
+ ipa = kvm_phys_shift(kvm);
+ lvls = stage2_pt_levels(ipa);
+
+ kvm->arch.s2_levels = lvls;
+ kvm->arch.vtcr_private = VTCR_EL2_SL0(lvls) | TCR_T0SZ(ipa);
+
return alloc_pages_exact(stage2_pgd_size(kvm),
GFP_KERNEL | __GFP_ZERO);
}
@@ -537,7 +550,10 @@ static inline u32 kvm_get_ipa_limit(void)
return KVM_PHYS_SHIFT;
}

-static inline void kvm_config_stage2(struct kvm *kvm, u32 ipa_shift) {}
+static inline void kvm_config_stage2(struct kvm *kvm, u32 ipa_shift)
+{
+ kvm->arch.phys_shift = ipa_shift;
+}

#endif /* __ASSEMBLY__ */
#endif /* __ARM64_KVM_MMU_H__ */
diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
index ffc37cc..91d7936 100644
--- a/arch/arm64/include/asm/stage2_pgtable.h
+++ b/arch/arm64/include/asm/stage2_pgtable.h
@@ -65,7 +65,6 @@
#define __s2_pgd_ptrs(pa, lvls) (1 << ((pa) - pt_levels_pgdir_shift((lvls))))
#define __s2_pgd_size(pa, lvls) (__s2_pgd_ptrs((pa), (lvls)) * sizeof(pgd_t))

-#define kvm_stage2_levels(kvm) stage2_pt_levels(kvm_phys_shift(kvm))
#define stage2_pgdir_shift(kvm) \
pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
#define stage2_pgdir_size(kvm) (_AC(1, UL) << stage2_pgdir_shift((kvm)))
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index a339e00..d7822e1 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -867,6 +867,10 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
return -EINVAL;
}

+ /* Make sure we have the stage2 configured for this VM */
+ if (WARN_ON(!kvm_phys_shift(kvm)))
+ return -EINVAL;
+
/* Allocate the HW PGD, making sure that each page gets its own refcount */
pgd = stage2_alloc_pgd(kvm);
if (!pgd)
--
2.7.4


2018-06-29 14:52:11

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v3 04/20] kvm: arm64: Clean up VTCR_EL2 initialisation

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> Use the new helper for converting the parange to the physical shift.
> Also, add the missing definitions for the VTCR_EL2 register fields
> and use them instead of hard coding numbers.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Changes since V2
> - Part 2 of the split from original patch.
> - Also add missing VTCR field helpers and use them.
> ---
> arch/arm64/include/asm/kvm_arm.h | 3 +++
> arch/arm64/kvm/hyp/s2-setup.c | 30 ++++++------------------------
> 2 files changed, 9 insertions(+), 24 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 6dd285e..3dffd38 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -106,6 +106,7 @@
> #define VTCR_EL2_RES1 (1 << 31)
> #define VTCR_EL2_HD (1 << 22)
> #define VTCR_EL2_HA (1 << 21)
> +#define VTCR_EL2_PS_SHIFT TCR_EL2_PS_SHIFT
> #define VTCR_EL2_PS_MASK TCR_EL2_PS_MASK
> #define VTCR_EL2_TG0_MASK TCR_TG0_MASK
> #define VTCR_EL2_TG0_4K TCR_TG0_4K
> @@ -126,6 +127,8 @@
> #define VTCR_EL2_VS_8BIT (0 << VTCR_EL2_VS_SHIFT)
> #define VTCR_EL2_VS_16BIT (1 << VTCR_EL2_VS_SHIFT)
>
> +#define VTCR_EL2_T0SZ(x) TCR_T0SZ(x)
> +
> /*
> * We configure the Stage-2 page tables to always restrict the IPA space to be
> * 40 bits wide (T0SZ = 24). Systems with a PARange smaller than 40 bits are
> diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
> index 603e1ee..81094f1 100644
> --- a/arch/arm64/kvm/hyp/s2-setup.c
> +++ b/arch/arm64/kvm/hyp/s2-setup.c
> @@ -19,11 +19,13 @@
> #include <asm/kvm_arm.h>
> #include <asm/kvm_asm.h>
> #include <asm/kvm_hyp.h>
> +#include <asm/cpufeature.h>
>
> u32 __hyp_text __init_stage2_translation(void)
> {
> u64 val = VTCR_EL2_FLAGS;
> u64 parange;
> + u32 phys_shift;
> u64 tmp;

Not related to this patch but the comment reporting that bit 19 of
VTCR_EL2 is RES0 is not fully valid anymore as it now corresponds to
VMID size in ARM ARM >= 8.1.
>
> /*
> @@ -34,30 +36,10 @@ u32 __hyp_text __init_stage2_translation(void)
> parange = read_sysreg(id_aa64mmfr0_el1) & 7;
> if (parange > ID_AA64MMFR0_PARANGE_MAX)
> parange = ID_AA64MMFR0_PARANGE_MAX;
> - val |= parange << 16;
> + val |= parange << VTCR_EL2_PS_SHIFT;
>
> /* Compute the actual PARange... */
> - switch (parange) {
> - case 0:
> - parange = 32;
> - break;
> - case 1:
> - parange = 36;
> - break;
> - case 2:
> - parange = 40;
> - break;
> - case 3:
> - parange = 42;
> - break;
> - case 4:
> - parange = 44;
> - break;
> - case 5:
> - default:
> - parange = 48;
> - break;
> - }
> + phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
>
> /*
> * ... and clamp it to 40 bits, unless we have some braindead
> @@ -65,7 +47,7 @@ u32 __hyp_text __init_stage2_translation(void)
> * return that value for the rest of the kernel to decide what
> * to do.
> */
> - val |= 64 - (parange > 40 ? 40 : parange);
> + val |= VTCR_EL2_T0SZ(phys_shift > 40 ? 40 : phys_shift);
>
> /*
> * Check the availability of Hardware Access Flag / Dirty Bit
> @@ -86,5 +68,5 @@ u32 __hyp_text __init_stage2_translation(void)
>
> write_sysreg(val, vtcr_el2);
>
> - return parange;
> + return phys_shift;
Reviewed-by: Eric Auger <[email protected]>

Thanks

Eric

> }
>

2018-06-29 14:52:42

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v3 06/20] kvm: arm/arm64: Remove spurious WARN_ON

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> On a 4-level page table pgd entry can be empty, unlike a 3-level
> page table. Remove the spurious WARN_ON() in stage_get_pud().
>
> Cc: Marc Zyngier <[email protected]>
> Acked-by: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> virt/kvm/arm/mmu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 061e6b3..308171c 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -976,7 +976,7 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
> pud_t *pud;
>
> pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> - if (WARN_ON(stage2_pgd_none(*pgd))) {
> + if (stage2_pgd_none(*pgd)) {
> if (!cache)
> return NULL;
> pud = mmu_memory_cache_alloc(cache);
>

Reviewed-by: Eric Auger <[email protected]>

Thanks

Eric

2018-06-29 14:52:57

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v3 03/20] arm64: Add a helper for PARange to physical shift conversion

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> On arm64, ID_AA64MMFR0_EL1.PARange encodes the maximum Physical
> Address range supported by the CPU. Add a helper to decode this
> to actual physical shift. If we hit an unallocated value, return
> the maximum range supported by the kernel.
> This is will be used by the KVM to set the VTCR_EL2.T0SZ, as it
s/is// and s/the KVM/KVM
> is about to move its place. Having this helper keeps the code
> movement cleaner.
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: James Morse <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Changes since V2:
> - Split the patch
> - Limit the physical shift only for values unrecognized.
> ---
> arch/arm64/include/asm/cpufeature.h | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> index 1717ba1..855cf0e 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -530,6 +530,19 @@ void arm64_set_ssbd_mitigation(bool state);
> static inline void arm64_set_ssbd_mitigation(bool state) {}
> #endif
>
> +static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
> +{
> + switch (parange) {
> + case 0: return 32;
> + case 1: return 36;
> + case 2: return 40;
> + case 3: return 42;
> + case 4: return 44;
> + case 5: return 48;
> + case 6: return 52;
> + default: return CONFIG_ARM64_PA_BITS;
> + }
> +}
> #endif /* __ASSEMBLY__ */
>
> #endif
>

Reviewed-by: Eric Auger <[email protected]>

Thanks

Eric


2018-06-29 15:04:29

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 04/20] kvm: arm64: Clean up VTCR_EL2 initialisation

Use the new helper for converting the parange to the physical shift.
Also, add the missing definitions for the VTCR_EL2 register fields
and use them instead of hard coding numbers.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since V2
- Part 2 of the split from original patch.
- Also add missing VTCR field helpers and use them.
---
arch/arm64/include/asm/kvm_arm.h | 3 +++
arch/arm64/kvm/hyp/s2-setup.c | 30 ++++++------------------------
2 files changed, 9 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 6dd285e..3dffd38 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -106,6 +106,7 @@
#define VTCR_EL2_RES1 (1 << 31)
#define VTCR_EL2_HD (1 << 22)
#define VTCR_EL2_HA (1 << 21)
+#define VTCR_EL2_PS_SHIFT TCR_EL2_PS_SHIFT
#define VTCR_EL2_PS_MASK TCR_EL2_PS_MASK
#define VTCR_EL2_TG0_MASK TCR_TG0_MASK
#define VTCR_EL2_TG0_4K TCR_TG0_4K
@@ -126,6 +127,8 @@
#define VTCR_EL2_VS_8BIT (0 << VTCR_EL2_VS_SHIFT)
#define VTCR_EL2_VS_16BIT (1 << VTCR_EL2_VS_SHIFT)

+#define VTCR_EL2_T0SZ(x) TCR_T0SZ(x)
+
/*
* We configure the Stage-2 page tables to always restrict the IPA space to be
* 40 bits wide (T0SZ = 24). Systems with a PARange smaller than 40 bits are
diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
index 603e1ee..81094f1 100644
--- a/arch/arm64/kvm/hyp/s2-setup.c
+++ b/arch/arm64/kvm/hyp/s2-setup.c
@@ -19,11 +19,13 @@
#include <asm/kvm_arm.h>
#include <asm/kvm_asm.h>
#include <asm/kvm_hyp.h>
+#include <asm/cpufeature.h>

u32 __hyp_text __init_stage2_translation(void)
{
u64 val = VTCR_EL2_FLAGS;
u64 parange;
+ u32 phys_shift;
u64 tmp;

/*
@@ -34,30 +36,10 @@ u32 __hyp_text __init_stage2_translation(void)
parange = read_sysreg(id_aa64mmfr0_el1) & 7;
if (parange > ID_AA64MMFR0_PARANGE_MAX)
parange = ID_AA64MMFR0_PARANGE_MAX;
- val |= parange << 16;
+ val |= parange << VTCR_EL2_PS_SHIFT;

/* Compute the actual PARange... */
- switch (parange) {
- case 0:
- parange = 32;
- break;
- case 1:
- parange = 36;
- break;
- case 2:
- parange = 40;
- break;
- case 3:
- parange = 42;
- break;
- case 4:
- parange = 44;
- break;
- case 5:
- default:
- parange = 48;
- break;
- }
+ phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);

/*
* ... and clamp it to 40 bits, unless we have some braindead
@@ -65,7 +47,7 @@ u32 __hyp_text __init_stage2_translation(void)
* return that value for the rest of the kernel to decide what
* to do.
*/
- val |= 64 - (parange > 40 ? 40 : parange);
+ val |= VTCR_EL2_T0SZ(phys_shift > 40 ? 40 : phys_shift);

/*
* Check the availability of Hardware Access Flag / Dirty Bit
@@ -86,5 +68,5 @@ u32 __hyp_text __init_stage2_translation(void)

write_sysreg(val, vtcr_el2);

- return parange;
+ return phys_shift;
}
--
2.7.4


2018-06-29 15:04:51

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 01/20] virtio: mmio-v1: Validate queue PFN

virtio-mmio with virtio-v1 uses a 32bit PFN for the queue.
If the queue pfn is too large to fit in 32bits, which
we could hit on arm64 systems with 52bit physical addresses
(even with 64K page size), we simply miss out a proper link
to the other side of the queue.

Add a check to validate the PFN, rather than silently breaking
the devices.

Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Cc: Peter Maydel <[email protected]>
Cc: Jean-Philippe Brucker <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since v2:
- Change errno to -E2BIG
---
drivers/virtio/virtio_mmio.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 67763d3..82cedc8 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -397,9 +397,21 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
/* Activate the queue */
writel(virtqueue_get_vring_size(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NUM);
if (vm_dev->version == 1) {
+ u64 q_pfn = virtqueue_get_desc_addr(vq) >> PAGE_SHIFT;
+
+ /*
+ * virtio-mmio v1 uses a 32bit QUEUE PFN. If we have something
+ * that doesn't fit in 32bit, fail the setup rather than
+ * pretending to be successful.
+ */
+ if (q_pfn >> 32) {
+ dev_err(&vdev->dev, "virtio-mmio: queue address too large\n");
+ err = -E2BIG;
+ goto error_bad_pfn;
+ }
+
writel(PAGE_SIZE, vm_dev->base + VIRTIO_MMIO_QUEUE_ALIGN);
- writel(virtqueue_get_desc_addr(vq) >> PAGE_SHIFT,
- vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
+ writel(q_pfn, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
} else {
u64 addr;

@@ -430,6 +442,8 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,

return vq;

+error_bad_pfn:
+ vring_del_virtqueue(vq);
error_new_virtqueue:
if (vm_dev->version == 1) {
writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
--
2.7.4


2018-06-29 15:05:57

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 06/20] kvm: arm/arm64: Remove spurious WARN_ON

On a 4-level page table pgd entry can be empty, unlike a 3-level
page table. Remove the spurious WARN_ON() in stage_get_pud().

Cc: Marc Zyngier <[email protected]>
Acked-by: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
virt/kvm/arm/mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 061e6b3..308171c 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -976,7 +976,7 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
pud_t *pud;

pgd = kvm->arch.pgd + stage2_pgd_index(addr);
- if (WARN_ON(stage2_pgd_none(*pgd))) {
+ if (stage2_pgd_none(*pgd)) {
if (!cache)
return NULL;
pud = mmu_memory_cache_alloc(cache);
--
2.7.4


2018-06-29 15:06:26

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 14/20] kvm: arm/arm64: Expose supported physical address limit for VM

Expose the maximum physical address size supported by the host
for a VM. This could be later used by the userspace to choose the
appropriate size for a given VM. The limit is determined as the
minimum of actual CPU limit, the kernel limit (i.e, either 48 or 52)
and the stage2 page table support limit (which is 40bits at the moment).
For backward compatibility, we support a minimum of 40bits. The limit
will be lifted as we add support for the stage2 to support the host
kernel PA limit.

This value may be different from what is exposed to the VM via
CPU ID registers. The limit only applies to the stage2 page table.

Cc: Christoffer Dall <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Peter Maydel <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since V2:
- Bump the ioctl number
---
Documentation/virtual/kvm/api.txt | 15 +++++++++++++++
arch/arm/include/asm/kvm_mmu.h | 5 +++++
arch/arm64/include/asm/kvm_mmu.h | 5 +++++
include/uapi/linux/kvm.h | 6 ++++++
virt/kvm/arm/arm.c | 6 ++++++
5 files changed, 37 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index d10944e..662374b 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3561,6 +3561,21 @@ Returns: 0 on success,
-ENOENT on deassign if the conn_id isn't registered
-EEXIST on assign if the conn_id is already registered

+4.113 KVM_ARM_GET_MAX_VM_PHYS_SHIFT
+Capability: basic
+Architectures: arm, arm64
+Type: system ioctl
+Parameters: none
+Returns: log2(Maximum Guest physical address space size) supported by the
+hypervisor.
+
+This ioctl can be used to identify the maximum guest physical address
+space size supported by the hypervisor. The returned value indicates the
+maximum size of the address that can be resolved by the stage2
+translation table on arm/arm64. On arm64, the value is decided based
+on the host kernel configuration and the system wide safe value of
+ID_AA64MMFR0_EL1:PARange. This may not match the value exposed to the
+VM in CPU ID registers.

5. The kvm_run structure
------------------------
diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index b2da5a4..d86f8dd 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -380,6 +380,11 @@ static inline void *stage2_alloc_pgd(struct kvm *kvm)

#define kvm_phys_to_vttbr(addr) (addr)

+static inline u32 kvm_get_ipa_limit(void)
+{
+ return KVM_PHYS_SHIFT;
+}
+
#endif /* !__ASSEMBLY__ */

#endif /* __ARM_KVM_MMU_H__ */
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 813a72a..b4564d8 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -532,5 +532,10 @@ static inline void *stage2_alloc_pgd(struct kvm *kvm)
GFP_KERNEL | __GFP_ZERO);
}

+static inline u32 kvm_get_ipa_limit(void)
+{
+ return KVM_PHYS_SHIFT;
+}
+
#endif /* __ASSEMBLY__ */
#endif /* __ARM64_KVM_MMU_H__ */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b6270a3..4df9bb6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -775,6 +775,12 @@ struct kvm_ppc_resize_hpt {
#define KVM_GET_MSR_FEATURE_INDEX_LIST _IOWR(KVMIO, 0x0a, struct kvm_msr_list)

/*
+ * Get the maximum physical address size supported by the host.
+ * Returns log2(Max-Physical-Address-Size)
+ */
+#define KVM_ARM_GET_MAX_VM_PHYS_SHIFT _IO(KVMIO, 0x0b)
+
+/*
* Extension capability list.
*/
#define KVM_CAP_IRQCHIP 0
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index d2637bb..0d99e67 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -66,6 +66,7 @@ static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
static u32 kvm_next_vmid;
static unsigned int kvm_vmid_bits __read_mostly;
static DEFINE_RWLOCK(kvm_vmid_lock);
+static u32 kvm_ipa_limit;

static bool vgic_present;

@@ -248,6 +249,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
long kvm_arch_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
+ if (ioctl == KVM_ARM_GET_MAX_VM_PHYS_SHIFT)
+ return kvm_ipa_limit;
+
return -EINVAL;
}

@@ -1361,6 +1365,8 @@ static int init_common_resources(void)
kvm_vmid_bits = kvm_get_vmid_bits();
kvm_info("%d-bit VMID\n", kvm_vmid_bits);

+ kvm_ipa_limit = kvm_get_ipa_limit();
+
return 0;
}

--
2.7.4


2018-06-29 15:06:28

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 18/20] kvm: arm64: Add support for handling 52bit IPA

Add support for handling the 52bit IPA. 52bit IPA
support needs changes to the following :

1) Page-table entries - We use kernel page table helpers for setting
up the stage2. Hence we don't explicit changes here

2) VTTBR:BADDR - This is already supported with :
commit 529c4b05a3cb2f324aa ("arm64: handle 52-bit addresses in TTBR")

3) VGIC support for 52bit: Supported with a patch in this series.

That leaves us with the handling for PAR and HPAR. This patch adds
support for handling the 52bit addresses in PAR and HPFAR,
which are used while handling the permission faults in stage1.

Cc: Marc Zyngier <[email protected]>
Cc: Kristina Martsenko <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm64/include/asm/kvm_arm.h | 7 +++++++
arch/arm64/kvm/hyp/switch.c | 2 +-
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 2e90942..cb6a2ee 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -301,6 +301,13 @@

/* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
#define HPFAR_MASK (~UL(0xf))
+/*
+ * We have
+ * PAR [PA_Shift - 1 : 12] = PA [PA_Shift - 1 : 12]
+ * HPFAR [PA_Shift - 9 : 4] = FIPA [PA_Shift - 1 : 12]
+ */
+#define PAR_TO_HPFAR(par) \
+ (((par) & GENMASK_ULL(PHYS_MASK_SHIFT - 1, 12)) >> 8)

#define kvm_arm_exception_type \
{0, "IRQ" }, \
diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
index 355fb25..fb66320 100644
--- a/arch/arm64/kvm/hyp/switch.c
+++ b/arch/arm64/kvm/hyp/switch.c
@@ -260,7 +260,7 @@ static bool __hyp_text __translate_far_to_hpfar(u64 far, u64 *hpfar)
return false; /* Translation failed, back to guest */

/* Convert PAR to HPFAR format */
- *hpfar = ((tmp >> 12) & ((1UL << 36) - 1)) << 4;
+ *hpfar = PAR_TO_HPFAR(tmp);
return true;
}

--
2.7.4


2018-06-29 15:06:37

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 20/20] kvm: arm64: Fall back to normal stage2 entry level

We use concatenated entry level page tables (upto 16tables) for
stage2. If we don't have sufficient contiguous pages (e.g, 16 * 64K),
fallback to the normal page table format, by going one level
deeper if permitted.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
New in v3
---
arch/arm64/include/asm/kvm_arm.h | 7 +++++++
arch/arm64/include/asm/kvm_mmu.h | 18 +----------------
arch/arm64/kvm/guest.c | 42 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 50 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index cb6a2ee..42eb528 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -137,6 +137,8 @@
*
* VTCR_EL2.SL0 and T0SZ are configured per VM at runtime before switching to
* the VM.
+ *
+ * With 16k/64k, the maximum number of levels supported at Stage2 is 3.
*/

#define VTCR_EL2_COMMON_BITS (VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
@@ -150,6 +152,7 @@
*/
#define VTCR_EL2_TGRAN VTCR_EL2_TG0_64K
#define VTCR_EL2_TGRAN_SL0_BASE 3UL
+#define ARM64_TGRAN_STAGE2_MAX_LEVELS 3

#elif defined(CONFIG_ARM64_16K_PAGES)
/*
@@ -158,6 +161,8 @@
*/
#define VTCR_EL2_TGRAN VTCR_EL2_TG0_16K
#define VTCR_EL2_TGRAN_SL0_BASE 3UL
+#define ARM64_TGRAN_STAGE2_MAX_LEVELS 3
+
#else /* 4K */
/*
* Stage2 translation configuration:
@@ -165,6 +170,8 @@
*/
#define VTCR_EL2_TGRAN VTCR_EL2_TG0_4K
#define VTCR_EL2_TGRAN_SL0_BASE 2UL
+#define ARM64_TGRAN_STAGE2_MAX_LEVELS 4
+
#endif

#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN)
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index d38f395..50f632e 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -527,23 +527,7 @@ static inline u64 kvm_vttbr_baddr_mask(struct kvm *kvm)
return vttbr_baddr_mask(kvm_phys_shift(kvm), kvm_stage2_levels(kvm));
}

-static inline void *stage2_alloc_pgd(struct kvm *kvm)
-{
- u32 ipa, lvls;
-
- /*
- * Stage2 page table can support concatenation of (upto 16) tables
- * at the entry level, thereby reducing the number of levels.
- */
- ipa = kvm_phys_shift(kvm);
- lvls = stage2_pt_levels(ipa);
-
- kvm->arch.s2_levels = lvls;
- kvm->arch.vtcr_private = VTCR_EL2_SL0(lvls) | TCR_T0SZ(ipa);
-
- return alloc_pages_exact(stage2_pgd_size(kvm),
- GFP_KERNEL | __GFP_ZERO);
-}
+extern void *stage2_alloc_pgd(struct kvm *kvm);

static inline u32 kvm_get_ipa_limit(void)
{
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 56a0260..5a3a687 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -31,6 +31,8 @@
#include <asm/kvm.h>
#include <asm/kvm_emulate.h>
#include <asm/kvm_coproc.h>
+#include <asm/kvm_mmu.h>
+#include <asm/pgtable-hwdef.h>

#include "trace.h"

@@ -458,3 +460,43 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,

return ret;
}
+
+void *stage2_alloc_pgd(struct kvm *kvm)
+{
+ u32 ipa, s2_lvls, lvls;
+ u64 pgd_size;
+ void *pgd;
+
+ /*
+ * Stage2 page table can support concatenation of (upto 16) tables
+ * at the entry level, thereby reducing the number of levels. We try
+ * to use concatenation wherever possible. If we fail, fallback to
+ * normal levels if possible.
+ */
+ ipa = kvm_phys_shift(kvm);
+ lvls = s2_lvls = stage2_pt_levels(ipa);
+
+retry:
+ pgd_size = __s2_pgd_size(ipa, lvls);
+ pgd = alloc_pages_exact(pgd_size, GFP_KERNEL | __GFP_ZERO);
+
+ /* Check if the PGD meets the alignment requirements */
+ if (pgd && (virt_to_phys(pgd) & ~vttbr_baddr_mask(ipa, lvls))) {
+ free_pages_exact(pgd, pgd_size);
+ pgd = NULL;
+ }
+
+ if (pgd) {
+ kvm->arch.s2_levels = lvls;
+ kvm->arch.vtcr_private = VTCR_EL2_SL0(lvls) | TCR_T0SZ(ipa);
+ } else {
+ /* Check if we can use an entry level without concatenation */
+ lvls = ARM64_HW_PGTABLE_LEVELS(ipa);
+ if ((lvls > s2_lvls) &&
+ (lvls <= CONFIG_PGTABLE_LEVELS) &&
+ (lvls <= ARM64_TGRAN_STAGE2_MAX_LEVELS))
+ goto retry;
+ }
+
+ return pgd;
+}
--
2.7.4


2018-06-29 15:09:12

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 19/20] kvm: arm64: Allow IPA size supported by the system

So far we have restricted the IPA size of the VM to the default
value (40bits). Now that we can manage the IPA size per VM and
support dynamic stage2 page tables, allow VMs to have larger IPA.
This is done by setting the IPA limit to the one supported by
the hardware and kernel. This patch also moves the check for
the default IPA size support to kvm_get_ipa_limit().

Since the stage2 page table code is dependent on the stage1
page table, we always ensure that :

Number of Levels at Stage1 >= Number of Levels at Stage2

So we limit the IPA to make sure that the above condition
is satisfied. This will affect the following combinations
of VA_BITS and IPA for different page sizes.

39bit VA, 4K - IPA > 43 (Upto 48)
36bit VA, 16K - IPA > 40 (Upto 48)
42bit VA, 64K - IPA > 46 (Upto 52)

Supporting the above combinations need independent stage2
page table manipulation code, which would need substantial
changes. We could purse the solution independently and
switch the page table code once we have it ready.

Cc: Catalin Marinas <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since V2:
- Restrict the IPA size to limit the number of page table
levels in stage2 to that of stage1 or less.
---
arch/arm64/include/asm/kvm_host.h | 6 ------
arch/arm64/include/asm/kvm_mmu.h | 37 ++++++++++++++++++++++++++++++++++++-
2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 9a15860..e858e49 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -452,13 +452,7 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,

static inline void __cpu_init_stage2(void)
{
- u32 ps;
-
kvm_call_hyp(__init_stage2_translation);
- /* Sanity check for minimum IPA size support */
- ps = id_aa64mmfr0_parange_to_phys_shift(read_sysreg(id_aa64mmfr0_el1) & 0x7);
- WARN_ONCE(ps < 40,
- "PARange is %d bits, unsupported configuration!", ps);
}

/* Guest/host FPSIMD coordination helpers */
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index a291cdc..d38f395 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -547,7 +547,42 @@ static inline void *stage2_alloc_pgd(struct kvm *kvm)

static inline u32 kvm_get_ipa_limit(void)
{
- return KVM_PHYS_SHIFT;
+ unsigned int ipa_max, va_max, parange;
+
+ parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 0x7;
+ ipa_max = id_aa64mmfr0_parange_to_phys_shift(parange);
+
+ /* Raise the limit to the default size for backward compatibility */
+ if (ipa_max < KVM_PHYS_SHIFT) {
+ WARN_ONCE(1,
+ "PARange is %d bits, unsupported configuration!",
+ ipa_max);
+ ipa_max = KVM_PHYS_SHIFT;
+ }
+
+ /* Clamp it to the PA size supported by the kernel */
+ ipa_max = (ipa_max > PHYS_MASK_SHIFT) ? PHYS_MASK_SHIFT : ipa_max;
+ /*
+ * Since our stage2 table is dependent on the stage1 page table code,
+ * we must always honor the following condition:
+ *
+ * Number of levels in Stage1 >= Number of levels in Stage2.
+ *
+ * So clamp the ipa limit further down to limit the number of levels.
+ * Since we can concatenate upto 16 tables at entry level, we could
+ * go upto 4bits above the maximum VA addressible with the current
+ * number of levels.
+ */
+ va_max = PGDIR_SHIFT + PAGE_SHIFT - 3;
+ va_max += 4;
+
+ if (va_max < ipa_max) {
+ kvm_info("Limiting IPA limit to %dbytes due to host VA bits limitation\n",
+ va_max);
+ ipa_max = va_max;
+ }
+
+ return ipa_max;
}

static inline void kvm_config_stage2(struct kvm *kvm, u32 ipa_shift)
--
2.7.4


2018-06-29 15:09:34

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 11/20] kvm: arm64: Helper for computing VTCR_EL2.SL0

VTCR_EL2 holds the following key stage2 translation table
parameters:
SL0 - Entry level in the page table lookup.
T0SZ - Denotes the size of the memory addressed by the table.

We have been using fixed values for the SL0 depending on the
page size as we have a fixed IPA size. But since we are about
to make it dynamic, we need to calculate the SL0 at runtime
per VM. This patch adds a helper to comput the value of SL0 for
a given IPA.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since v2:
- Part 2 of split from VTCR & VTTBR dynamic configuration
---
arch/arm64/include/asm/kvm_arm.h | 35 ++++++++++++++++++++++++++++++++---
1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index c557f45..11a7db0 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -153,7 +153,8 @@
* 2 level page tables (SL = 1)
*/
#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
-#define VTTBR_X_TGRAN_MAGIC 38
+#define VTCR_EL2_TGRAN_SL0_BASE 3UL
+
#elif defined(CONFIG_ARM64_16K_PAGES)
/*
* Stage2 translation configuration:
@@ -161,7 +162,7 @@
* 2 level page tables (SL = 1)
*/
#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
-#define VTTBR_X_TGRAN_MAGIC 42
+#define VTCR_EL2_TGRAN_SL0_BASE 3UL
#else /* 4K */
/*
* Stage2 translation configuration:
@@ -169,11 +170,39 @@
* 3 level page tables (SL = 1)
*/
#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
-#define VTTBR_X_TGRAN_MAGIC 37
+#define VTCR_EL2_TGRAN_SL0_BASE 2UL
#endif

#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
/*
+ * VTCR_EL2:SL0 indicates the entry level for Stage2 translation.
+ * Interestingly, it depends on the page size.
+ * See D.10.2.110, VTCR_EL2, in ARM DDI 0487B.b
+ *
+ * -----------------------------------------
+ * | Entry level | 4K | 16K/64K |
+ * ------------------------------------------
+ * | Level: 0 | 2 | - |
+ * ------------------------------------------
+ * | Level: 1 | 1 | 2 |
+ * ------------------------------------------
+ * | Level: 2 | 0 | 1 |
+ * ------------------------------------------
+ * | Level: 3 | - | 0 |
+ * ------------------------------------------
+ *
+ * That table roughly translates to :
+ *
+ * SL0(PAGE_SIZE, Entry_level) = SL0_BASE(PAGE_SIZE) - Entry_Level
+ *
+ * Where SL0_BASE(4K) = 2 and SL0_BASE(16K) = 3, SL0_BASE(64K) = 3, provided
+ * we take care of ruling out the unsupported cases and
+ * Entry_Level = 4 - Number_of_levels.
+ *
+ */
+#define VTCR_EL2_SL0(levels) \
+ ((VTCR_EL2_TGRAN_SL0_BASE - (4 - (levels))) << VTCR_EL2_SL0_SHIFT)
+/*
* ARM VMSAv8-64 defines an algorithm for finding the translation table
* descriptors in section D4.2.8 in ARM DDI 0487B.b.
*
--
2.7.4


2018-06-29 15:10:15

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 07/20] kvm: arm/arm64: Prepare for VM specific stage2 translations

Right now the stage2 page table for a VM is hard coded, assuming
an IPA of 40bits. As we are about to add support for per VM IPA,
prepare the stage2 page table helpers to accept the kvm instance
to make the right decision for the VM. No functional changes.
Adds stage2_pgd_size(kvm) to replace S2_PGD_SIZE. Also, moves
some of the definitions dependent on kvm instance to asm/kvm_mmu.h
for arm32. In that process drop the _AC() specifier constants

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since V2:
- Update commit description abuot the movement to asm/kvm_mmu.h
for arm32
- Drop _AC() specifiers
---
arch/arm/include/asm/kvm_arm.h | 3 +-
arch/arm/include/asm/kvm_mmu.h | 15 +++-
arch/arm/include/asm/stage2_pgtable.h | 42 ++++-----
arch/arm64/include/asm/kvm_mmu.h | 7 +-
arch/arm64/include/asm/stage2_pgtable-nopmd.h | 18 ++--
arch/arm64/include/asm/stage2_pgtable-nopud.h | 16 ++--
arch/arm64/include/asm/stage2_pgtable.h | 49 ++++++-----
virt/kvm/arm/arm.c | 2 +-
virt/kvm/arm/mmu.c | 119 +++++++++++++-------------
virt/kvm/arm/vgic/vgic-kvm-device.c | 2 +-
10 files changed, 148 insertions(+), 125 deletions(-)

diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
index 3ab8b37..c3f1f9b 100644
--- a/arch/arm/include/asm/kvm_arm.h
+++ b/arch/arm/include/asm/kvm_arm.h
@@ -133,8 +133,7 @@
* space.
*/
#define KVM_PHYS_SHIFT (40)
-#define KVM_PHYS_SIZE (_AC(1, ULL) << KVM_PHYS_SHIFT)
-#define KVM_PHYS_MASK (KVM_PHYS_SIZE - _AC(1, ULL))
+
#define PTRS_PER_S2_PGD (_AC(1, ULL) << (KVM_PHYS_SHIFT - 30))

/* Virtualization Translation Control Register (VTCR) bits */
diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 8553d68..f36eb20 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -36,15 +36,19 @@
})

/*
- * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation levels.
+ * kvm_mmu_cache_min_pages() is the number of stage2 page
+ * table translation levels, excluding the top level, for
+ * the given VM. Since we have a 3 level page-table, this
+ * is fixed.
*/
-#define KVM_MMU_CACHE_MIN_PAGES 2
+#define kvm_mmu_cache_min_pages(kvm) 2

#ifndef __ASSEMBLY__

#include <linux/highmem.h>
#include <asm/cacheflush.h>
#include <asm/cputype.h>
+#include <asm/kvm_arm.h>
#include <asm/kvm_hyp.h>
#include <asm/pgalloc.h>
#include <asm/stage2_pgtable.h>
@@ -52,6 +56,13 @@
/* Ensure compatibility with arm64 */
#define VA_BITS 32

+#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
+#define kvm_phys_size(kvm) (1ULL << kvm_phys_shift(kvm))
+#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - 1ULL)
+#define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK
+
+#define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
+
int create_hyp_mappings(void *from, void *to, pgprot_t prot);
int create_hyp_io_mappings(phys_addr_t phys_addr, size_t size,
void __iomem **kaddr,
diff --git a/arch/arm/include/asm/stage2_pgtable.h b/arch/arm/include/asm/stage2_pgtable.h
index 460d616..e22ae94 100644
--- a/arch/arm/include/asm/stage2_pgtable.h
+++ b/arch/arm/include/asm/stage2_pgtable.h
@@ -19,43 +19,45 @@
#ifndef __ARM_S2_PGTABLE_H_
#define __ARM_S2_PGTABLE_H_

-#define stage2_pgd_none(pgd) pgd_none(pgd)
-#define stage2_pgd_clear(pgd) pgd_clear(pgd)
-#define stage2_pgd_present(pgd) pgd_present(pgd)
-#define stage2_pgd_populate(pgd, pud) pgd_populate(NULL, pgd, pud)
-#define stage2_pud_offset(pgd, address) pud_offset(pgd, address)
-#define stage2_pud_free(pud) pud_free(NULL, pud)
+#define stage2_pgd_none(kvm, pgd) pgd_none(pgd)
+#define stage2_pgd_clear(kvm, pgd) pgd_clear(pgd)
+#define stage2_pgd_present(kvm, pgd) pgd_present(pgd)
+#define stage2_pgd_populate(kvm, pgd, pud) pgd_populate(NULL, pgd, pud)
+#define stage2_pud_offset(kvm, pgd, address) pud_offset(pgd, address)
+#define stage2_pud_free(kvm, pud) pud_free(NULL, pud)

-#define stage2_pud_none(pud) pud_none(pud)
-#define stage2_pud_clear(pud) pud_clear(pud)
-#define stage2_pud_present(pud) pud_present(pud)
-#define stage2_pud_populate(pud, pmd) pud_populate(NULL, pud, pmd)
-#define stage2_pmd_offset(pud, address) pmd_offset(pud, address)
-#define stage2_pmd_free(pmd) pmd_free(NULL, pmd)
+#define stage2_pud_none(kvm, pud) pud_none(pud)
+#define stage2_pud_clear(kvm, pud) pud_clear(pud)
+#define stage2_pud_present(kvm, pud) pud_present(pud)
+#define stage2_pud_populate(kvm, pud, pmd) pud_populate(NULL, pud, pmd)
+#define stage2_pmd_offset(kvm, pud, address) pmd_offset(pud, address)
+#define stage2_pmd_free(kvm, pmd) pmd_free(NULL, pmd)

-#define stage2_pud_huge(pud) pud_huge(pud)
+#define stage2_pud_huge(kvm, pud) pud_huge(pud)

/* Open coded p*d_addr_end that can deal with 64bit addresses */
-static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
+static inline phys_addr_t
+stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
phys_addr_t boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;

return (boundary - 1 < end - 1) ? boundary : end;
}

-#define stage2_pud_addr_end(addr, end) (end)
+#define stage2_pud_addr_end(kvm, addr, end) (end)

-static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
+static inline phys_addr_t
+stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
phys_addr_t boundary = (addr + PMD_SIZE) & PMD_MASK;

return (boundary - 1 < end - 1) ? boundary : end;
}

-#define stage2_pgd_index(addr) pgd_index(addr)
+#define stage2_pgd_index(kvm, addr) pgd_index(addr)

-#define stage2_pte_table_empty(ptep) kvm_page_empty(ptep)
-#define stage2_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
-#define stage2_pud_table_empty(pudp) false
+#define stage2_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)
+#define stage2_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
+#define stage2_pud_table_empty(kvm, pudp) false

#endif /* __ARM_S2_PGTABLE_H_ */
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index fb9a712..5da8f52 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -141,8 +141,11 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
* We currently only support a 40bit IPA.
*/
#define KVM_PHYS_SHIFT (40)
-#define KVM_PHYS_SIZE (1UL << KVM_PHYS_SHIFT)
-#define KVM_PHYS_MASK (KVM_PHYS_SIZE - 1UL)
+
+#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
+#define kvm_phys_size(kvm) (_AC(1, ULL) << kvm_phys_shift(kvm))
+#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
+#define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK

#include <asm/stage2_pgtable.h>

diff --git a/arch/arm64/include/asm/stage2_pgtable-nopmd.h b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
index 2656a0f..0280ded 100644
--- a/arch/arm64/include/asm/stage2_pgtable-nopmd.h
+++ b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
@@ -26,17 +26,17 @@
#define S2_PMD_SIZE (1UL << S2_PMD_SHIFT)
#define S2_PMD_MASK (~(S2_PMD_SIZE-1))

-#define stage2_pud_none(pud) (0)
-#define stage2_pud_present(pud) (1)
-#define stage2_pud_clear(pud) do { } while (0)
-#define stage2_pud_populate(pud, pmd) do { } while (0)
-#define stage2_pmd_offset(pud, address) ((pmd_t *)(pud))
+#define stage2_pud_none(kvm, pud) (0)
+#define stage2_pud_present(kvm, pud) (1)
+#define stage2_pud_clear(kvm, pud) do { } while (0)
+#define stage2_pud_populate(kvm, pud, pmd) do { } while (0)
+#define stage2_pmd_offset(kvm, pud, address) ((pmd_t *)(pud))

-#define stage2_pmd_free(pmd) do { } while (0)
+#define stage2_pmd_free(kvm, pmd) do { } while (0)

-#define stage2_pmd_addr_end(addr, end) (end)
+#define stage2_pmd_addr_end(kvm, addr, end) (end)

-#define stage2_pud_huge(pud) (0)
-#define stage2_pmd_table_empty(pmdp) (0)
+#define stage2_pud_huge(kvm, pud) (0)
+#define stage2_pmd_table_empty(kvm, pmdp) (0)

#endif
diff --git a/arch/arm64/include/asm/stage2_pgtable-nopud.h b/arch/arm64/include/asm/stage2_pgtable-nopud.h
index 5ee87b5..cd6304e 100644
--- a/arch/arm64/include/asm/stage2_pgtable-nopud.h
+++ b/arch/arm64/include/asm/stage2_pgtable-nopud.h
@@ -24,16 +24,16 @@
#define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
#define S2_PUD_MASK (~(S2_PUD_SIZE-1))

-#define stage2_pgd_none(pgd) (0)
-#define stage2_pgd_present(pgd) (1)
-#define stage2_pgd_clear(pgd) do { } while (0)
-#define stage2_pgd_populate(pgd, pud) do { } while (0)
+#define stage2_pgd_none(kvm, pgd) (0)
+#define stage2_pgd_present(kvm, pgd) (1)
+#define stage2_pgd_clear(kvm, pgd) do { } while (0)
+#define stage2_pgd_populate(kvm, pgd, pud) do { } while (0)

-#define stage2_pud_offset(pgd, address) ((pud_t *)(pgd))
+#define stage2_pud_offset(kvm, pgd, address) ((pud_t *)(pgd))

-#define stage2_pud_free(x) do { } while (0)
+#define stage2_pud_free(kvm, x) do { } while (0)

-#define stage2_pud_addr_end(addr, end) (end)
-#define stage2_pud_table_empty(pmdp) (0)
+#define stage2_pud_addr_end(kvm, addr, end) (end)
+#define stage2_pud_table_empty(kvm, pmdp) (0)

#endif
diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
index 8b68099..057a405 100644
--- a/arch/arm64/include/asm/stage2_pgtable.h
+++ b/arch/arm64/include/asm/stage2_pgtable.h
@@ -65,10 +65,10 @@
#define PTRS_PER_S2_PGD (1 << (KVM_PHYS_SHIFT - S2_PGDIR_SHIFT))

/*
- * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
+ * kvm_mmmu_cache_min_pages is the number of stage2 page table translation
* levels in addition to the PGD.
*/
-#define KVM_MMU_CACHE_MIN_PAGES (STAGE2_PGTABLE_LEVELS - 1)
+#define kvm_mmu_cache_min_pages(kvm) (STAGE2_PGTABLE_LEVELS - 1)


#if STAGE2_PGTABLE_LEVELS > 3
@@ -77,16 +77,17 @@
#define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
#define S2_PUD_MASK (~(S2_PUD_SIZE - 1))

-#define stage2_pgd_none(pgd) pgd_none(pgd)
-#define stage2_pgd_clear(pgd) pgd_clear(pgd)
-#define stage2_pgd_present(pgd) pgd_present(pgd)
-#define stage2_pgd_populate(pgd, pud) pgd_populate(NULL, pgd, pud)
-#define stage2_pud_offset(pgd, address) pud_offset(pgd, address)
-#define stage2_pud_free(pud) pud_free(NULL, pud)
+#define stage2_pgd_none(kvm, pgd) pgd_none(pgd)
+#define stage2_pgd_clear(kvm, pgd) pgd_clear(pgd)
+#define stage2_pgd_present(kvm, pgd) pgd_present(pgd)
+#define stage2_pgd_populate(kvm, pgd, pud) pgd_populate(NULL, pgd, pud)
+#define stage2_pud_offset(kvm, pgd, address) pud_offset(pgd, address)
+#define stage2_pud_free(kvm, pud) pud_free(NULL, pud)

-#define stage2_pud_table_empty(pudp) kvm_page_empty(pudp)
+#define stage2_pud_table_empty(kvm, pudp) kvm_page_empty(pudp)

-static inline phys_addr_t stage2_pud_addr_end(phys_addr_t addr, phys_addr_t end)
+static inline phys_addr_t
+stage2_pud_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
phys_addr_t boundary = (addr + S2_PUD_SIZE) & S2_PUD_MASK;

@@ -102,17 +103,18 @@ static inline phys_addr_t stage2_pud_addr_end(phys_addr_t addr, phys_addr_t end)
#define S2_PMD_SIZE (_AC(1, UL) << S2_PMD_SHIFT)
#define S2_PMD_MASK (~(S2_PMD_SIZE - 1))

-#define stage2_pud_none(pud) pud_none(pud)
-#define stage2_pud_clear(pud) pud_clear(pud)
-#define stage2_pud_present(pud) pud_present(pud)
-#define stage2_pud_populate(pud, pmd) pud_populate(NULL, pud, pmd)
-#define stage2_pmd_offset(pud, address) pmd_offset(pud, address)
-#define stage2_pmd_free(pmd) pmd_free(NULL, pmd)
+#define stage2_pud_none(kvm, pud) pud_none(pud)
+#define stage2_pud_clear(kvm, pud) pud_clear(pud)
+#define stage2_pud_present(kvm, pud) pud_present(pud)
+#define stage2_pud_populate(kvm, pud, pmd) pud_populate(NULL, pud, pmd)
+#define stage2_pmd_offset(kvm, pud, address) pmd_offset(pud, address)
+#define stage2_pmd_free(kvm, pmd) pmd_free(NULL, pmd)

-#define stage2_pud_huge(pud) pud_huge(pud)
-#define stage2_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
+#define stage2_pud_huge(kvm, pud) pud_huge(pud)
+#define stage2_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)

-static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
+static inline phys_addr_t
+stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
phys_addr_t boundary = (addr + S2_PMD_SIZE) & S2_PMD_MASK;

@@ -121,7 +123,7 @@ static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)

#endif /* STAGE2_PGTABLE_LEVELS > 2 */

-#define stage2_pte_table_empty(ptep) kvm_page_empty(ptep)
+#define stage2_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)

#if STAGE2_PGTABLE_LEVELS == 2
#include <asm/stage2_pgtable-nopmd.h>
@@ -129,10 +131,13 @@ static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
#include <asm/stage2_pgtable-nopud.h>
#endif

+#define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))

-#define stage2_pgd_index(addr) (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
+#define stage2_pgd_index(kvm, addr) \
+ (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))

-static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
+static inline phys_addr_t
+stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;

diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 04e554c..d2637bb 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -538,7 +538,7 @@ static void update_vttbr(struct kvm *kvm)

/* update vttbr to be used with the new vmid */
pgd_phys = virt_to_phys(kvm->arch.pgd);
- BUG_ON(pgd_phys & ~VTTBR_BADDR_MASK);
+ BUG_ON(pgd_phys & ~kvm_vttbr_baddr_mask(kvm));
vmid = ((u64)(kvm->arch.vmid) << VTTBR_VMID_SHIFT) & VTTBR_VMID_MASK(kvm_vmid_bits);
kvm->arch.vttbr = kvm_phys_to_vttbr(pgd_phys) | vmid;

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 308171c..82dd571 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -45,7 +45,6 @@ static phys_addr_t hyp_idmap_vector;

static unsigned long io_map_base;

-#define S2_PGD_SIZE (PTRS_PER_S2_PGD * sizeof(pgd_t))
#define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t))

#define KVM_S2PTE_FLAG_IS_IOMAP (1UL << 0)
@@ -150,20 +149,20 @@ static void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)

static void clear_stage2_pgd_entry(struct kvm *kvm, pgd_t *pgd, phys_addr_t addr)
{
- pud_t *pud_table __maybe_unused = stage2_pud_offset(pgd, 0UL);
- stage2_pgd_clear(pgd);
+ pud_t *pud_table __maybe_unused = stage2_pud_offset(kvm, pgd, 0UL);
+ stage2_pgd_clear(kvm, pgd);
kvm_tlb_flush_vmid_ipa(kvm, addr);
- stage2_pud_free(pud_table);
+ stage2_pud_free(kvm, pud_table);
put_page(virt_to_page(pgd));
}

static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
{
- pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(pud, 0);
- VM_BUG_ON(stage2_pud_huge(*pud));
- stage2_pud_clear(pud);
+ pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(kvm, pud, 0);
+ VM_BUG_ON(stage2_pud_huge(kvm, *pud));
+ stage2_pud_clear(kvm, pud);
kvm_tlb_flush_vmid_ipa(kvm, addr);
- stage2_pmd_free(pmd_table);
+ stage2_pmd_free(kvm, pmd_table);
put_page(virt_to_page(pud));
}

@@ -219,7 +218,7 @@ static void unmap_stage2_ptes(struct kvm *kvm, pmd_t *pmd,
}
} while (pte++, addr += PAGE_SIZE, addr != end);

- if (stage2_pte_table_empty(start_pte))
+ if (stage2_pte_table_empty(kvm, start_pte))
clear_stage2_pmd_entry(kvm, pmd, start_addr);
}

@@ -229,9 +228,9 @@ static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud,
phys_addr_t next, start_addr = addr;
pmd_t *pmd, *start_pmd;

- start_pmd = pmd = stage2_pmd_offset(pud, addr);
+ start_pmd = pmd = stage2_pmd_offset(kvm, pud, addr);
do {
- next = stage2_pmd_addr_end(addr, end);
+ next = stage2_pmd_addr_end(kvm, addr, end);
if (!pmd_none(*pmd)) {
if (pmd_thp_or_huge(*pmd)) {
pmd_t old_pmd = *pmd;
@@ -248,7 +247,7 @@ static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud,
}
} while (pmd++, addr = next, addr != end);

- if (stage2_pmd_table_empty(start_pmd))
+ if (stage2_pmd_table_empty(kvm, start_pmd))
clear_stage2_pud_entry(kvm, pud, start_addr);
}

@@ -258,14 +257,14 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
phys_addr_t next, start_addr = addr;
pud_t *pud, *start_pud;

- start_pud = pud = stage2_pud_offset(pgd, addr);
+ start_pud = pud = stage2_pud_offset(kvm, pgd, addr);
do {
- next = stage2_pud_addr_end(addr, end);
- if (!stage2_pud_none(*pud)) {
- if (stage2_pud_huge(*pud)) {
+ next = stage2_pud_addr_end(kvm, addr, end);
+ if (!stage2_pud_none(kvm, *pud)) {
+ if (stage2_pud_huge(kvm, *pud)) {
pud_t old_pud = *pud;

- stage2_pud_clear(pud);
+ stage2_pud_clear(kvm, pud);
kvm_tlb_flush_vmid_ipa(kvm, addr);
kvm_flush_dcache_pud(old_pud);
put_page(virt_to_page(pud));
@@ -275,7 +274,7 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
}
} while (pud++, addr = next, addr != end);

- if (stage2_pud_table_empty(start_pud))
+ if (stage2_pud_table_empty(kvm, start_pud))
clear_stage2_pgd_entry(kvm, pgd, start_addr);
}

@@ -299,7 +298,7 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
assert_spin_locked(&kvm->mmu_lock);
WARN_ON(size & ~PAGE_MASK);

- pgd = kvm->arch.pgd + stage2_pgd_index(addr);
+ pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
do {
/*
* Make sure the page table is still active, as another thread
@@ -308,8 +307,8 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
*/
if (!READ_ONCE(kvm->arch.pgd))
break;
- next = stage2_pgd_addr_end(addr, end);
- if (!stage2_pgd_none(*pgd))
+ next = stage2_pgd_addr_end(kvm, addr, end);
+ if (!stage2_pgd_none(kvm, *pgd))
unmap_stage2_puds(kvm, pgd, addr, next);
/*
* If the range is too large, release the kvm->mmu_lock
@@ -338,9 +337,9 @@ static void stage2_flush_pmds(struct kvm *kvm, pud_t *pud,
pmd_t *pmd;
phys_addr_t next;

- pmd = stage2_pmd_offset(pud, addr);
+ pmd = stage2_pmd_offset(kvm, pud, addr);
do {
- next = stage2_pmd_addr_end(addr, end);
+ next = stage2_pmd_addr_end(kvm, addr, end);
if (!pmd_none(*pmd)) {
if (pmd_thp_or_huge(*pmd))
kvm_flush_dcache_pmd(*pmd);
@@ -356,11 +355,11 @@ static void stage2_flush_puds(struct kvm *kvm, pgd_t *pgd,
pud_t *pud;
phys_addr_t next;

- pud = stage2_pud_offset(pgd, addr);
+ pud = stage2_pud_offset(kvm, pgd, addr);
do {
- next = stage2_pud_addr_end(addr, end);
- if (!stage2_pud_none(*pud)) {
- if (stage2_pud_huge(*pud))
+ next = stage2_pud_addr_end(kvm, addr, end);
+ if (!stage2_pud_none(kvm, *pud)) {
+ if (stage2_pud_huge(kvm, *pud))
kvm_flush_dcache_pud(*pud);
else
stage2_flush_pmds(kvm, pud, addr, next);
@@ -376,10 +375,10 @@ static void stage2_flush_memslot(struct kvm *kvm,
phys_addr_t next;
pgd_t *pgd;

- pgd = kvm->arch.pgd + stage2_pgd_index(addr);
+ pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
do {
- next = stage2_pgd_addr_end(addr, end);
- if (!stage2_pgd_none(*pgd))
+ next = stage2_pgd_addr_end(kvm, addr, end);
+ if (!stage2_pgd_none(kvm, *pgd))
stage2_flush_puds(kvm, pgd, addr, next);
} while (pgd++, addr = next, addr != end);
}
@@ -869,7 +868,7 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
}

/* Allocate the HW PGD, making sure that each page gets its own refcount */
- pgd = alloc_pages_exact(S2_PGD_SIZE, GFP_KERNEL | __GFP_ZERO);
+ pgd = alloc_pages_exact(stage2_pgd_size(kvm), GFP_KERNEL | __GFP_ZERO);
if (!pgd)
return -ENOMEM;

@@ -958,7 +957,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm)

spin_lock(&kvm->mmu_lock);
if (kvm->arch.pgd) {
- unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
+ unmap_stage2_range(kvm, 0, kvm_phys_size(kvm));
pgd = READ_ONCE(kvm->arch.pgd);
kvm->arch.pgd = NULL;
}
@@ -966,7 +965,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm)

/* Free the HW pgd, one page at a time */
if (pgd)
- free_pages_exact(pgd, S2_PGD_SIZE);
+ free_pages_exact(pgd, stage2_pgd_size(kvm));
}

static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
@@ -975,16 +974,16 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
pgd_t *pgd;
pud_t *pud;

- pgd = kvm->arch.pgd + stage2_pgd_index(addr);
- if (stage2_pgd_none(*pgd)) {
+ pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
+ if (stage2_pgd_none(kvm, *pgd)) {
if (!cache)
return NULL;
pud = mmu_memory_cache_alloc(cache);
- stage2_pgd_populate(pgd, pud);
+ stage2_pgd_populate(kvm, pgd, pud);
get_page(virt_to_page(pgd));
}

- return stage2_pud_offset(pgd, addr);
+ return stage2_pud_offset(kvm, pgd, addr);
}

static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
@@ -997,15 +996,15 @@ static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
if (!pud)
return NULL;

- if (stage2_pud_none(*pud)) {
+ if (stage2_pud_none(kvm, *pud)) {
if (!cache)
return NULL;
pmd = mmu_memory_cache_alloc(cache);
- stage2_pud_populate(pud, pmd);
+ stage2_pud_populate(kvm, pud, pmd);
get_page(virt_to_page(pud));
}

- return stage2_pmd_offset(pud, addr);
+ return stage2_pmd_offset(kvm, pud, addr);
}

static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
@@ -1159,8 +1158,9 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
if (writable)
pte = kvm_s2pte_mkwrite(pte);

- ret = mmu_topup_memory_cache(&cache, KVM_MMU_CACHE_MIN_PAGES,
- KVM_NR_MEM_OBJS);
+ ret = mmu_topup_memory_cache(&cache,
+ kvm_mmu_cache_min_pages(kvm),
+ KVM_NR_MEM_OBJS);
if (ret)
goto out;
spin_lock(&kvm->mmu_lock);
@@ -1248,19 +1248,21 @@ static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)

/**
* stage2_wp_pmds - write protect PUD range
+ * kvm: kvm instance for the VM
* @pud: pointer to pud entry
* @addr: range start address
* @end: range end address
*/
-static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
+static void stage2_wp_pmds(struct kvm *kvm, pud_t *pud,
+ phys_addr_t addr, phys_addr_t end)
{
pmd_t *pmd;
phys_addr_t next;

- pmd = stage2_pmd_offset(pud, addr);
+ pmd = stage2_pmd_offset(kvm, pud, addr);

do {
- next = stage2_pmd_addr_end(addr, end);
+ next = stage2_pmd_addr_end(kvm, addr, end);
if (!pmd_none(*pmd)) {
if (pmd_thp_or_huge(*pmd)) {
if (!kvm_s2pmd_readonly(pmd))
@@ -1280,18 +1282,19 @@ static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
*
* Process PUD entries, for a huge PUD we cause a panic.
*/
-static void stage2_wp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
+static void stage2_wp_puds(struct kvm *kvm, pgd_t *pgd,
+ phys_addr_t addr, phys_addr_t end)
{
pud_t *pud;
phys_addr_t next;

- pud = stage2_pud_offset(pgd, addr);
+ pud = stage2_pud_offset(kvm, pgd, addr);
do {
- next = stage2_pud_addr_end(addr, end);
- if (!stage2_pud_none(*pud)) {
+ next = stage2_pud_addr_end(kvm, addr, end);
+ if (!stage2_pud_none(kvm, *pud)) {
/* TODO:PUD not supported, revisit later if supported */
- BUG_ON(stage2_pud_huge(*pud));
- stage2_wp_pmds(pud, addr, next);
+ BUG_ON(stage2_pud_huge(kvm, *pud));
+ stage2_wp_pmds(kvm, pud, addr, next);
}
} while (pud++, addr = next, addr != end);
}
@@ -1307,7 +1310,7 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
pgd_t *pgd;
phys_addr_t next;

- pgd = kvm->arch.pgd + stage2_pgd_index(addr);
+ pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
do {
/*
* Release kvm_mmu_lock periodically if the memory region is
@@ -1321,9 +1324,9 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
cond_resched_lock(&kvm->mmu_lock);
if (!READ_ONCE(kvm->arch.pgd))
break;
- next = stage2_pgd_addr_end(addr, end);
- if (stage2_pgd_present(*pgd))
- stage2_wp_puds(pgd, addr, next);
+ next = stage2_pgd_addr_end(kvm, addr, end);
+ if (stage2_pgd_present(kvm, *pgd))
+ stage2_wp_puds(kvm, pgd, addr, next);
} while (pgd++, addr = next, addr != end);
}

@@ -1472,7 +1475,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
up_read(&current->mm->mmap_sem);

/* We need minimum second+third level pages */
- ret = mmu_topup_memory_cache(memcache, KVM_MMU_CACHE_MIN_PAGES,
+ ret = mmu_topup_memory_cache(memcache, kvm_mmu_cache_min_pages(kvm),
KVM_NR_MEM_OBJS);
if (ret)
return ret;
@@ -1715,7 +1718,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
}

/* Userspace should not be able to register out-of-bounds IPAs */
- VM_BUG_ON(fault_ipa >= KVM_PHYS_SIZE);
+ VM_BUG_ON(fault_ipa >= kvm_phys_size(vcpu->kvm));

if (fault_status == FSC_ACCESS) {
handle_access_fault(vcpu, fault_ipa);
@@ -2019,7 +2022,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
* space addressable by the KVM guest IPA space.
*/
if (memslot->base_gfn + memslot->npages >=
- (KVM_PHYS_SIZE >> PAGE_SHIFT))
+ (kvm_phys_size(kvm) >> PAGE_SHIFT))
return -EFAULT;

down_read(&current->mm->mmap_sem);
diff --git a/virt/kvm/arm/vgic/vgic-kvm-device.c b/virt/kvm/arm/vgic/vgic-kvm-device.c
index 6ada243..114dce9 100644
--- a/virt/kvm/arm/vgic/vgic-kvm-device.c
+++ b/virt/kvm/arm/vgic/vgic-kvm-device.c
@@ -25,7 +25,7 @@
int vgic_check_ioaddr(struct kvm *kvm, phys_addr_t *ioaddr,
phys_addr_t addr, phys_addr_t alignment)
{
- if (addr & ~KVM_PHYS_MASK)
+ if (addr & ~kvm_phys_mask(kvm))
return -E2BIG;

if (!IS_ALIGNED(addr, alignment))
--
2.7.4


2018-06-29 15:10:28

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v3 03/20] arm64: Add a helper for PARange to physical shift conversion

On arm64, ID_AA64MMFR0_EL1.PARange encodes the maximum Physical
Address range supported by the CPU. Add a helper to decode this
to actual physical shift. If we hit an unallocated value, return
the maximum range supported by the kernel.
This is will be used by the KVM to set the VTCR_EL2.T0SZ, as it
is about to move its place. Having this helper keeps the code
movement cleaner.

Cc: Catalin Marinas <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: James Morse <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since V2:
- Split the patch
- Limit the physical shift only for values unrecognized.
---
arch/arm64/include/asm/cpufeature.h | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 1717ba1..855cf0e 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -530,6 +530,19 @@ void arm64_set_ssbd_mitigation(bool state);
static inline void arm64_set_ssbd_mitigation(bool state) {}
#endif

+static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
+{
+ switch (parange) {
+ case 0: return 32;
+ case 1: return 36;
+ case 2: return 40;
+ case 3: return 42;
+ case 4: return 44;
+ case 5: return 48;
+ case 6: return 52;
+ default: return CONFIG_ARM64_PA_BITS;
+ }
+}
#endif /* __ASSEMBLY__ */

#endif
--
2.7.4


2018-06-29 17:45:49

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v3 02/20] virtio: pci-legacy: Validate queue pfn

On Fri, Jun 29, 2018 at 12:15:22PM +0100, Suzuki K Poulose wrote:
> Legacy PCI over virtio uses a 32bit PFN for the queue. If the
> queue pfn is too large to fit in 32bits, which we could hit on
> arm64 systems with 52bit physical addresses (even with 64K page
> size), we simply miss out a proper link to the other side of
> the queue.
>
> Add a check to validate the PFN, rather than silently breaking
> the devices.
>
> Cc: "Michael S. Tsirkin" <[email protected]>
> Cc: Jason Wang <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Cc: Peter Maydel <[email protected]>
> Cc: Jean-Philippe Brucker <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Changes since v2:
> - Change errno to -E2BIG
> ---
> drivers/virtio/virtio_pci_legacy.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c
> index 2780886..c0d6987a 100644
> --- a/drivers/virtio/virtio_pci_legacy.c
> +++ b/drivers/virtio/virtio_pci_legacy.c
> @@ -122,6 +122,7 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
> struct virtqueue *vq;
> u16 num;
> int err;
> + u64 q_pfn;
>
> /* Select the queue we're interested in */
> iowrite16(index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
> @@ -141,9 +142,15 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
> if (!vq)
> return ERR_PTR(-ENOMEM);
>
> + q_pfn = virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT;
> + if (q_pfn >> 32) {
> + dev_err(&vp_dev->pci_dev->dev, "virtio-pci queue PFN too large\n");
> + err = -E2BIG;

Same comment here. Let's make it clear it's host not guest problem.

> + goto out_del_vq;
> + }
> +
> /* activate the queue */
> - iowrite32(virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
> - vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> + iowrite32(q_pfn, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
>
> vq->priv = (void __force *)vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY;
>
> @@ -160,6 +167,7 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
>
> out_deactivate:
> iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> +out_del_vq:
> vring_del_virtqueue(vq);
> return ERR_PTR(err);
> }
> --
> 2.7.4

2018-06-29 18:12:28

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v3 05/20] kvm: arm/arm64: Fix stage2_flush_memslot for 4 level page table

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> So far we have only supported 3 level page table with fixed IPA of 40bits.
> Fix stage2_flush_memslot() to accommodate for 4 level tables.
in 06/30 you add the justification for this change I think. worth to put
in here as well?

>
> Cc: Marc Zyngier <[email protected]>
> Acked-by: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> virt/kvm/arm/mmu.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 1d90d79..061e6b3 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -379,7 +379,8 @@ static void stage2_flush_memslot(struct kvm *kvm,
> pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> do {
> next = stage2_pgd_addr_end(addr, end);
> - stage2_flush_puds(kvm, pgd, addr, next);
> + if (!stage2_pgd_none(*pgd))
> + stage2_flush_puds(kvm, pgd, addr, next);
> } while (pgd++, addr = next, addr != end);
> }
>
>

Besides
Reviewed-by: Eric Auger <[email protected]>

Thanks

Eric



2018-06-29 20:02:23

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v3 01/20] virtio: mmio-v1: Validate queue PFN

On Fri, Jun 29, 2018 at 12:15:21PM +0100, Suzuki K Poulose wrote:
> virtio-mmio with virtio-v1 uses a 32bit PFN for the queue.
> If the queue pfn is too large to fit in 32bits, which
> we could hit on arm64 systems with 52bit physical addresses
> (even with 64K page size), we simply miss out a proper link
> to the other side of the queue.
>
> Add a check to validate the PFN, rather than silently breaking
> the devices.
>
> Cc: "Michael S. Tsirkin" <[email protected]>
> Cc: Jason Wang <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Cc: Peter Maydel <[email protected]>
> Cc: Jean-Philippe Brucker <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Changes since v2:
> - Change errno to -E2BIG
> ---
> drivers/virtio/virtio_mmio.c | 18 ++++++++++++++++--
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
> index 67763d3..82cedc8 100644
> --- a/drivers/virtio/virtio_mmio.c
> +++ b/drivers/virtio/virtio_mmio.c
> @@ -397,9 +397,21 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
> /* Activate the queue */
> writel(virtqueue_get_vring_size(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NUM);
> if (vm_dev->version == 1) {
> + u64 q_pfn = virtqueue_get_desc_addr(vq) >> PAGE_SHIFT;
> +
> + /*
> + * virtio-mmio v1 uses a 32bit QUEUE PFN. If we have something
> + * that doesn't fit in 32bit, fail the setup rather than
> + * pretending to be successful.
> + */
> + if (q_pfn >> 32) {
> + dev_err(&vdev->dev, "virtio-mmio: queue address too large\n");

How about:
"hypervisor bug: legacy virtio-mmio must not be used with more than 0x%llx Gigabytes of memory",
0x1ULL << (32 - 30) << PAGE_SHIFT

> + err = -E2BIG;
> + goto error_bad_pfn;
> + }
> +
> writel(PAGE_SIZE, vm_dev->base + VIRTIO_MMIO_QUEUE_ALIGN);
> - writel(virtqueue_get_desc_addr(vq) >> PAGE_SHIFT,
> - vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
> + writel(q_pfn, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
> } else {
> u64 addr;
>
> @@ -430,6 +442,8 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
>
> return vq;
>
> +error_bad_pfn:
> + vring_del_virtqueue(vq);
> error_new_virtqueue:
> if (vm_dev->version == 1) {
> writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
> --
> 2.7.4

2018-07-02 10:03:39

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v3 06/20] kvm: arm/arm64: Remove spurious WARN_ON

On 29/06/18 12:15, Suzuki K Poulose wrote:
> On a 4-level page table pgd entry can be empty, unlike a 3-level
> page table. Remove the spurious WARN_ON() in stage_get_pud().
>
> Cc: Marc Zyngier <[email protected]>
> Acked-by: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> virt/kvm/arm/mmu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 061e6b3..308171c 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -976,7 +976,7 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
> pud_t *pud;
>
> pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> - if (WARN_ON(stage2_pgd_none(*pgd))) {
> + if (stage2_pgd_none(*pgd)) {
> if (!cache)
> return NULL;
> pud = mmu_memory_cache_alloc(cache);
>

Acked-by: Marc Zyngier <[email protected]>

M.
--
Jazz is not dead. It just smells funny...

2018-07-02 10:14:50

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v3 07/20] kvm: arm/arm64: Prepare for VM specific stage2 translations

On 29/06/18 12:15, Suzuki K Poulose wrote:
> Right now the stage2 page table for a VM is hard coded, assuming
> an IPA of 40bits. As we are about to add support for per VM IPA,
> prepare the stage2 page table helpers to accept the kvm instance
> to make the right decision for the VM. No functional changes.
> Adds stage2_pgd_size(kvm) to replace S2_PGD_SIZE. Also, moves
> some of the definitions dependent on kvm instance to asm/kvm_mmu.h
> for arm32. In that process drop the _AC() specifier constants
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Changes since V2:
> - Update commit description abuot the movement to asm/kvm_mmu.h
> for arm32
> - Drop _AC() specifiers
> ---
> arch/arm/include/asm/kvm_arm.h | 3 +-
> arch/arm/include/asm/kvm_mmu.h | 15 +++-
> arch/arm/include/asm/stage2_pgtable.h | 42 ++++-----
> arch/arm64/include/asm/kvm_mmu.h | 7 +-
> arch/arm64/include/asm/stage2_pgtable-nopmd.h | 18 ++--
> arch/arm64/include/asm/stage2_pgtable-nopud.h | 16 ++--
> arch/arm64/include/asm/stage2_pgtable.h | 49 ++++++-----
> virt/kvm/arm/arm.c | 2 +-
> virt/kvm/arm/mmu.c | 119 +++++++++++++-------------
> virt/kvm/arm/vgic/vgic-kvm-device.c | 2 +-
> 10 files changed, 148 insertions(+), 125 deletions(-)
>
> diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
> index 3ab8b37..c3f1f9b 100644
> --- a/arch/arm/include/asm/kvm_arm.h
> +++ b/arch/arm/include/asm/kvm_arm.h
> @@ -133,8 +133,7 @@
> * space.
> */
> #define KVM_PHYS_SHIFT (40)
> -#define KVM_PHYS_SIZE (_AC(1, ULL) << KVM_PHYS_SHIFT)
> -#define KVM_PHYS_MASK (KVM_PHYS_SIZE - _AC(1, ULL))
> +
> #define PTRS_PER_S2_PGD (_AC(1, ULL) << (KVM_PHYS_SHIFT - 30))
>
> /* Virtualization Translation Control Register (VTCR) bits */
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 8553d68..f36eb20 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -36,15 +36,19 @@
> })
>
> /*
> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation levels.
> + * kvm_mmu_cache_min_pages() is the number of stage2 page
> + * table translation levels, excluding the top level, for
> + * the given VM. Since we have a 3 level page-table, this
> + * is fixed.

I find this comment quite confusing: number of levels, but excluding the
top one? The original one was just as bad, to be honest.

Can't we just say: "kvm_mmu_cache_min_page() is the number of pages
required to install a stage-2 translation"?

> */
> -#define KVM_MMU_CACHE_MIN_PAGES 2
> +#define kvm_mmu_cache_min_pages(kvm) 2
>
> #ifndef __ASSEMBLY__
>
> #include <linux/highmem.h>
> #include <asm/cacheflush.h>
> #include <asm/cputype.h>
> +#include <asm/kvm_arm.h>
> #include <asm/kvm_hyp.h>
> #include <asm/pgalloc.h>
> #include <asm/stage2_pgtable.h>
> @@ -52,6 +56,13 @@
> /* Ensure compatibility with arm64 */
> #define VA_BITS 32
>
> +#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
> +#define kvm_phys_size(kvm) (1ULL << kvm_phys_shift(kvm))
> +#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - 1ULL)
> +#define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK
> +
> +#define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
> +
> int create_hyp_mappings(void *from, void *to, pgprot_t prot);
> int create_hyp_io_mappings(phys_addr_t phys_addr, size_t size,
> void __iomem **kaddr,
> diff --git a/arch/arm/include/asm/stage2_pgtable.h b/arch/arm/include/asm/stage2_pgtable.h
> index 460d616..e22ae94 100644
> --- a/arch/arm/include/asm/stage2_pgtable.h
> +++ b/arch/arm/include/asm/stage2_pgtable.h
> @@ -19,43 +19,45 @@
> #ifndef __ARM_S2_PGTABLE_H_
> #define __ARM_S2_PGTABLE_H_
>
> -#define stage2_pgd_none(pgd) pgd_none(pgd)
> -#define stage2_pgd_clear(pgd) pgd_clear(pgd)
> -#define stage2_pgd_present(pgd) pgd_present(pgd)
> -#define stage2_pgd_populate(pgd, pud) pgd_populate(NULL, pgd, pud)
> -#define stage2_pud_offset(pgd, address) pud_offset(pgd, address)
> -#define stage2_pud_free(pud) pud_free(NULL, pud)
> +#define stage2_pgd_none(kvm, pgd) pgd_none(pgd)
> +#define stage2_pgd_clear(kvm, pgd) pgd_clear(pgd)
> +#define stage2_pgd_present(kvm, pgd) pgd_present(pgd)
> +#define stage2_pgd_populate(kvm, pgd, pud) pgd_populate(NULL, pgd, pud)
> +#define stage2_pud_offset(kvm, pgd, address) pud_offset(pgd, address)
> +#define stage2_pud_free(kvm, pud) pud_free(NULL, pud)
>
> -#define stage2_pud_none(pud) pud_none(pud)
> -#define stage2_pud_clear(pud) pud_clear(pud)
> -#define stage2_pud_present(pud) pud_present(pud)
> -#define stage2_pud_populate(pud, pmd) pud_populate(NULL, pud, pmd)
> -#define stage2_pmd_offset(pud, address) pmd_offset(pud, address)
> -#define stage2_pmd_free(pmd) pmd_free(NULL, pmd)
> +#define stage2_pud_none(kvm, pud) pud_none(pud)
> +#define stage2_pud_clear(kvm, pud) pud_clear(pud)
> +#define stage2_pud_present(kvm, pud) pud_present(pud)
> +#define stage2_pud_populate(kvm, pud, pmd) pud_populate(NULL, pud, pmd)
> +#define stage2_pmd_offset(kvm, pud, address) pmd_offset(pud, address)
> +#define stage2_pmd_free(kvm, pmd) pmd_free(NULL, pmd)
>
> -#define stage2_pud_huge(pud) pud_huge(pud)
> +#define stage2_pud_huge(kvm, pud) pud_huge(pud)
>
> /* Open coded p*d_addr_end that can deal with 64bit addresses */
> -static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> {
> phys_addr_t boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
>
> return (boundary - 1 < end - 1) ? boundary : end;
> }
>
> -#define stage2_pud_addr_end(addr, end) (end)
> +#define stage2_pud_addr_end(kvm, addr, end) (end)
>
> -static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> {
> phys_addr_t boundary = (addr + PMD_SIZE) & PMD_MASK;
>
> return (boundary - 1 < end - 1) ? boundary : end;
> }
>
> -#define stage2_pgd_index(addr) pgd_index(addr)
> +#define stage2_pgd_index(kvm, addr) pgd_index(addr)
>
> -#define stage2_pte_table_empty(ptep) kvm_page_empty(ptep)
> -#define stage2_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
> -#define stage2_pud_table_empty(pudp) false
> +#define stage2_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)
> +#define stage2_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
> +#define stage2_pud_table_empty(kvm, pudp) false
>
> #endif /* __ARM_S2_PGTABLE_H_ */
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index fb9a712..5da8f52 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -141,8 +141,11 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
> * We currently only support a 40bit IPA.
> */
> #define KVM_PHYS_SHIFT (40)
> -#define KVM_PHYS_SIZE (1UL << KVM_PHYS_SHIFT)
> -#define KVM_PHYS_MASK (KVM_PHYS_SIZE - 1UL)
> +
> +#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
> +#define kvm_phys_size(kvm) (_AC(1, ULL) << kvm_phys_shift(kvm))
> +#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
> +#define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK
>
> #include <asm/stage2_pgtable.h>
>
> diff --git a/arch/arm64/include/asm/stage2_pgtable-nopmd.h b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> index 2656a0f..0280ded 100644
> --- a/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> +++ b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> @@ -26,17 +26,17 @@
> #define S2_PMD_SIZE (1UL << S2_PMD_SHIFT)
> #define S2_PMD_MASK (~(S2_PMD_SIZE-1))
>
> -#define stage2_pud_none(pud) (0)
> -#define stage2_pud_present(pud) (1)
> -#define stage2_pud_clear(pud) do { } while (0)
> -#define stage2_pud_populate(pud, pmd) do { } while (0)
> -#define stage2_pmd_offset(pud, address) ((pmd_t *)(pud))
> +#define stage2_pud_none(kvm, pud) (0)
> +#define stage2_pud_present(kvm, pud) (1)
> +#define stage2_pud_clear(kvm, pud) do { } while (0)
> +#define stage2_pud_populate(kvm, pud, pmd) do { } while (0)
> +#define stage2_pmd_offset(kvm, pud, address) ((pmd_t *)(pud))
>
> -#define stage2_pmd_free(pmd) do { } while (0)
> +#define stage2_pmd_free(kvm, pmd) do { } while (0)
>
> -#define stage2_pmd_addr_end(addr, end) (end)
> +#define stage2_pmd_addr_end(kvm, addr, end) (end)
>
> -#define stage2_pud_huge(pud) (0)
> -#define stage2_pmd_table_empty(pmdp) (0)
> +#define stage2_pud_huge(kvm, pud) (0)
> +#define stage2_pmd_table_empty(kvm, pmdp) (0)
>
> #endif
> diff --git a/arch/arm64/include/asm/stage2_pgtable-nopud.h b/arch/arm64/include/asm/stage2_pgtable-nopud.h
> index 5ee87b5..cd6304e 100644
> --- a/arch/arm64/include/asm/stage2_pgtable-nopud.h
> +++ b/arch/arm64/include/asm/stage2_pgtable-nopud.h
> @@ -24,16 +24,16 @@
> #define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
> #define S2_PUD_MASK (~(S2_PUD_SIZE-1))
>
> -#define stage2_pgd_none(pgd) (0)
> -#define stage2_pgd_present(pgd) (1)
> -#define stage2_pgd_clear(pgd) do { } while (0)
> -#define stage2_pgd_populate(pgd, pud) do { } while (0)
> +#define stage2_pgd_none(kvm, pgd) (0)
> +#define stage2_pgd_present(kvm, pgd) (1)
> +#define stage2_pgd_clear(kvm, pgd) do { } while (0)
> +#define stage2_pgd_populate(kvm, pgd, pud) do { } while (0)
>
> -#define stage2_pud_offset(pgd, address) ((pud_t *)(pgd))
> +#define stage2_pud_offset(kvm, pgd, address) ((pud_t *)(pgd))
>
> -#define stage2_pud_free(x) do { } while (0)
> +#define stage2_pud_free(kvm, x) do { } while (0)
>
> -#define stage2_pud_addr_end(addr, end) (end)
> -#define stage2_pud_table_empty(pmdp) (0)
> +#define stage2_pud_addr_end(kvm, addr, end) (end)
> +#define stage2_pud_table_empty(kvm, pmdp) (0)
>
> #endif
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> index 8b68099..057a405 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -65,10 +65,10 @@
> #define PTRS_PER_S2_PGD (1 << (KVM_PHYS_SHIFT - S2_PGDIR_SHIFT))
>
> /*
> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
> + * kvm_mmmu_cache_min_pages is the number of stage2 page table translation
> * levels in addition to the PGD.
> */
> -#define KVM_MMU_CACHE_MIN_PAGES (STAGE2_PGTABLE_LEVELS - 1)
> +#define kvm_mmu_cache_min_pages(kvm) (STAGE2_PGTABLE_LEVELS - 1)

Same comment as for the 32bit case.

>
>
> #if STAGE2_PGTABLE_LEVELS > 3
> @@ -77,16 +77,17 @@
> #define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
> #define S2_PUD_MASK (~(S2_PUD_SIZE - 1))
>
> -#define stage2_pgd_none(pgd) pgd_none(pgd)
> -#define stage2_pgd_clear(pgd) pgd_clear(pgd)
> -#define stage2_pgd_present(pgd) pgd_present(pgd)
> -#define stage2_pgd_populate(pgd, pud) pgd_populate(NULL, pgd, pud)
> -#define stage2_pud_offset(pgd, address) pud_offset(pgd, address)
> -#define stage2_pud_free(pud) pud_free(NULL, pud)
> +#define stage2_pgd_none(kvm, pgd) pgd_none(pgd)
> +#define stage2_pgd_clear(kvm, pgd) pgd_clear(pgd)
> +#define stage2_pgd_present(kvm, pgd) pgd_present(pgd)
> +#define stage2_pgd_populate(kvm, pgd, pud) pgd_populate(NULL, pgd, pud)
> +#define stage2_pud_offset(kvm, pgd, address) pud_offset(pgd, address)
> +#define stage2_pud_free(kvm, pud) pud_free(NULL, pud)
>
> -#define stage2_pud_table_empty(pudp) kvm_page_empty(pudp)
> +#define stage2_pud_table_empty(kvm, pudp) kvm_page_empty(pudp)
>
> -static inline phys_addr_t stage2_pud_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pud_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> {
> phys_addr_t boundary = (addr + S2_PUD_SIZE) & S2_PUD_MASK;
>
> @@ -102,17 +103,18 @@ static inline phys_addr_t stage2_pud_addr_end(phys_addr_t addr, phys_addr_t end)
> #define S2_PMD_SIZE (_AC(1, UL) << S2_PMD_SHIFT)
> #define S2_PMD_MASK (~(S2_PMD_SIZE - 1))
>
> -#define stage2_pud_none(pud) pud_none(pud)
> -#define stage2_pud_clear(pud) pud_clear(pud)
> -#define stage2_pud_present(pud) pud_present(pud)
> -#define stage2_pud_populate(pud, pmd) pud_populate(NULL, pud, pmd)
> -#define stage2_pmd_offset(pud, address) pmd_offset(pud, address)
> -#define stage2_pmd_free(pmd) pmd_free(NULL, pmd)
> +#define stage2_pud_none(kvm, pud) pud_none(pud)
> +#define stage2_pud_clear(kvm, pud) pud_clear(pud)
> +#define stage2_pud_present(kvm, pud) pud_present(pud)
> +#define stage2_pud_populate(kvm, pud, pmd) pud_populate(NULL, pud, pmd)
> +#define stage2_pmd_offset(kvm, pud, address) pmd_offset(pud, address)
> +#define stage2_pmd_free(kvm, pmd) pmd_free(NULL, pmd)
>
> -#define stage2_pud_huge(pud) pud_huge(pud)
> -#define stage2_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
> +#define stage2_pud_huge(kvm, pud) pud_huge(pud)
> +#define stage2_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
>
> -static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> {
> phys_addr_t boundary = (addr + S2_PMD_SIZE) & S2_PMD_MASK;
>
> @@ -121,7 +123,7 @@ static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
>
> #endif /* STAGE2_PGTABLE_LEVELS > 2 */
>
> -#define stage2_pte_table_empty(ptep) kvm_page_empty(ptep)
> +#define stage2_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)
>
> #if STAGE2_PGTABLE_LEVELS == 2
> #include <asm/stage2_pgtable-nopmd.h>
> @@ -129,10 +131,13 @@ static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
> #include <asm/stage2_pgtable-nopud.h>
> #endif
>
> +#define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
>
> -#define stage2_pgd_index(addr) (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
> +#define stage2_pgd_index(kvm, addr) \
> + (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
>
> -static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> {
> phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;
>
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 04e554c..d2637bb 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -538,7 +538,7 @@ static void update_vttbr(struct kvm *kvm)
>
> /* update vttbr to be used with the new vmid */
> pgd_phys = virt_to_phys(kvm->arch.pgd);
> - BUG_ON(pgd_phys & ~VTTBR_BADDR_MASK);
> + BUG_ON(pgd_phys & ~kvm_vttbr_baddr_mask(kvm));
> vmid = ((u64)(kvm->arch.vmid) << VTTBR_VMID_SHIFT) & VTTBR_VMID_MASK(kvm_vmid_bits);
> kvm->arch.vttbr = kvm_phys_to_vttbr(pgd_phys) | vmid;
>
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 308171c..82dd571 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -45,7 +45,6 @@ static phys_addr_t hyp_idmap_vector;
>
> static unsigned long io_map_base;
>
> -#define S2_PGD_SIZE (PTRS_PER_S2_PGD * sizeof(pgd_t))
> #define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t))
>
> #define KVM_S2PTE_FLAG_IS_IOMAP (1UL << 0)
> @@ -150,20 +149,20 @@ static void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
>
> static void clear_stage2_pgd_entry(struct kvm *kvm, pgd_t *pgd, phys_addr_t addr)
> {
> - pud_t *pud_table __maybe_unused = stage2_pud_offset(pgd, 0UL);
> - stage2_pgd_clear(pgd);
> + pud_t *pud_table __maybe_unused = stage2_pud_offset(kvm, pgd, 0UL);
> + stage2_pgd_clear(kvm, pgd);
> kvm_tlb_flush_vmid_ipa(kvm, addr);
> - stage2_pud_free(pud_table);
> + stage2_pud_free(kvm, pud_table);
> put_page(virt_to_page(pgd));
> }
>
> static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
> {
> - pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(pud, 0);
> - VM_BUG_ON(stage2_pud_huge(*pud));
> - stage2_pud_clear(pud);
> + pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(kvm, pud, 0);
> + VM_BUG_ON(stage2_pud_huge(kvm, *pud));
> + stage2_pud_clear(kvm, pud);
> kvm_tlb_flush_vmid_ipa(kvm, addr);
> - stage2_pmd_free(pmd_table);
> + stage2_pmd_free(kvm, pmd_table);
> put_page(virt_to_page(pud));
> }
>
> @@ -219,7 +218,7 @@ static void unmap_stage2_ptes(struct kvm *kvm, pmd_t *pmd,
> }
> } while (pte++, addr += PAGE_SIZE, addr != end);
>
> - if (stage2_pte_table_empty(start_pte))
> + if (stage2_pte_table_empty(kvm, start_pte))
> clear_stage2_pmd_entry(kvm, pmd, start_addr);
> }
>
> @@ -229,9 +228,9 @@ static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud,
> phys_addr_t next, start_addr = addr;
> pmd_t *pmd, *start_pmd;
>
> - start_pmd = pmd = stage2_pmd_offset(pud, addr);
> + start_pmd = pmd = stage2_pmd_offset(kvm, pud, addr);
> do {
> - next = stage2_pmd_addr_end(addr, end);
> + next = stage2_pmd_addr_end(kvm, addr, end);
> if (!pmd_none(*pmd)) {
> if (pmd_thp_or_huge(*pmd)) {
> pmd_t old_pmd = *pmd;
> @@ -248,7 +247,7 @@ static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud,
> }
> } while (pmd++, addr = next, addr != end);
>
> - if (stage2_pmd_table_empty(start_pmd))
> + if (stage2_pmd_table_empty(kvm, start_pmd))
> clear_stage2_pud_entry(kvm, pud, start_addr);
> }
>
> @@ -258,14 +257,14 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
> phys_addr_t next, start_addr = addr;
> pud_t *pud, *start_pud;
>
> - start_pud = pud = stage2_pud_offset(pgd, addr);
> + start_pud = pud = stage2_pud_offset(kvm, pgd, addr);
> do {
> - next = stage2_pud_addr_end(addr, end);
> - if (!stage2_pud_none(*pud)) {
> - if (stage2_pud_huge(*pud)) {
> + next = stage2_pud_addr_end(kvm, addr, end);
> + if (!stage2_pud_none(kvm, *pud)) {
> + if (stage2_pud_huge(kvm, *pud)) {
> pud_t old_pud = *pud;
>
> - stage2_pud_clear(pud);
> + stage2_pud_clear(kvm, pud);
> kvm_tlb_flush_vmid_ipa(kvm, addr);
> kvm_flush_dcache_pud(old_pud);
> put_page(virt_to_page(pud));
> @@ -275,7 +274,7 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
> }
> } while (pud++, addr = next, addr != end);
>
> - if (stage2_pud_table_empty(start_pud))
> + if (stage2_pud_table_empty(kvm, start_pud))
> clear_stage2_pgd_entry(kvm, pgd, start_addr);
> }
>
> @@ -299,7 +298,7 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
> assert_spin_locked(&kvm->mmu_lock);
> WARN_ON(size & ~PAGE_MASK);
>
> - pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> + pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
> do {
> /*
> * Make sure the page table is still active, as another thread
> @@ -308,8 +307,8 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
> */
> if (!READ_ONCE(kvm->arch.pgd))
> break;
> - next = stage2_pgd_addr_end(addr, end);
> - if (!stage2_pgd_none(*pgd))
> + next = stage2_pgd_addr_end(kvm, addr, end);
> + if (!stage2_pgd_none(kvm, *pgd))
> unmap_stage2_puds(kvm, pgd, addr, next);
> /*
> * If the range is too large, release the kvm->mmu_lock
> @@ -338,9 +337,9 @@ static void stage2_flush_pmds(struct kvm *kvm, pud_t *pud,
> pmd_t *pmd;
> phys_addr_t next;
>
> - pmd = stage2_pmd_offset(pud, addr);
> + pmd = stage2_pmd_offset(kvm, pud, addr);
> do {
> - next = stage2_pmd_addr_end(addr, end);
> + next = stage2_pmd_addr_end(kvm, addr, end);
> if (!pmd_none(*pmd)) {
> if (pmd_thp_or_huge(*pmd))
> kvm_flush_dcache_pmd(*pmd);
> @@ -356,11 +355,11 @@ static void stage2_flush_puds(struct kvm *kvm, pgd_t *pgd,
> pud_t *pud;
> phys_addr_t next;
>
> - pud = stage2_pud_offset(pgd, addr);
> + pud = stage2_pud_offset(kvm, pgd, addr);
> do {
> - next = stage2_pud_addr_end(addr, end);
> - if (!stage2_pud_none(*pud)) {
> - if (stage2_pud_huge(*pud))
> + next = stage2_pud_addr_end(kvm, addr, end);
> + if (!stage2_pud_none(kvm, *pud)) {
> + if (stage2_pud_huge(kvm, *pud))
> kvm_flush_dcache_pud(*pud);
> else
> stage2_flush_pmds(kvm, pud, addr, next);
> @@ -376,10 +375,10 @@ static void stage2_flush_memslot(struct kvm *kvm,
> phys_addr_t next;
> pgd_t *pgd;
>
> - pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> + pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
> do {
> - next = stage2_pgd_addr_end(addr, end);
> - if (!stage2_pgd_none(*pgd))
> + next = stage2_pgd_addr_end(kvm, addr, end);
> + if (!stage2_pgd_none(kvm, *pgd))
> stage2_flush_puds(kvm, pgd, addr, next);
> } while (pgd++, addr = next, addr != end);
> }
> @@ -869,7 +868,7 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
> }
>
> /* Allocate the HW PGD, making sure that each page gets its own refcount */
> - pgd = alloc_pages_exact(S2_PGD_SIZE, GFP_KERNEL | __GFP_ZERO);
> + pgd = alloc_pages_exact(stage2_pgd_size(kvm), GFP_KERNEL | __GFP_ZERO);
> if (!pgd)
> return -ENOMEM;
>
> @@ -958,7 +957,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>
> spin_lock(&kvm->mmu_lock);
> if (kvm->arch.pgd) {
> - unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> + unmap_stage2_range(kvm, 0, kvm_phys_size(kvm));
> pgd = READ_ONCE(kvm->arch.pgd);
> kvm->arch.pgd = NULL;
> }
> @@ -966,7 +965,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>
> /* Free the HW pgd, one page at a time */
> if (pgd)
> - free_pages_exact(pgd, S2_PGD_SIZE);
> + free_pages_exact(pgd, stage2_pgd_size(kvm));
> }
>
> static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> @@ -975,16 +974,16 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
> pgd_t *pgd;
> pud_t *pud;
>
> - pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> - if (stage2_pgd_none(*pgd)) {
> + pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
> + if (stage2_pgd_none(kvm, *pgd)) {
> if (!cache)
> return NULL;
> pud = mmu_memory_cache_alloc(cache);
> - stage2_pgd_populate(pgd, pud);
> + stage2_pgd_populate(kvm, pgd, pud);
> get_page(virt_to_page(pgd));
> }
>
> - return stage2_pud_offset(pgd, addr);
> + return stage2_pud_offset(kvm, pgd, addr);
> }
>
> static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> @@ -997,15 +996,15 @@ static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
> if (!pud)
> return NULL;
>
> - if (stage2_pud_none(*pud)) {
> + if (stage2_pud_none(kvm, *pud)) {
> if (!cache)
> return NULL;
> pmd = mmu_memory_cache_alloc(cache);
> - stage2_pud_populate(pud, pmd);
> + stage2_pud_populate(kvm, pud, pmd);
> get_page(virt_to_page(pud));
> }
>
> - return stage2_pmd_offset(pud, addr);
> + return stage2_pmd_offset(kvm, pud, addr);
> }
>
> static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
> @@ -1159,8 +1158,9 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> if (writable)
> pte = kvm_s2pte_mkwrite(pte);
>
> - ret = mmu_topup_memory_cache(&cache, KVM_MMU_CACHE_MIN_PAGES,
> - KVM_NR_MEM_OBJS);
> + ret = mmu_topup_memory_cache(&cache,
> + kvm_mmu_cache_min_pages(kvm),
> + KVM_NR_MEM_OBJS);
> if (ret)
> goto out;
> spin_lock(&kvm->mmu_lock);
> @@ -1248,19 +1248,21 @@ static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
>
> /**
> * stage2_wp_pmds - write protect PUD range
> + * kvm: kvm instance for the VM
> * @pud: pointer to pud entry
> * @addr: range start address
> * @end: range end address
> */
> -static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
> +static void stage2_wp_pmds(struct kvm *kvm, pud_t *pud,
> + phys_addr_t addr, phys_addr_t end)
> {
> pmd_t *pmd;
> phys_addr_t next;
>
> - pmd = stage2_pmd_offset(pud, addr);
> + pmd = stage2_pmd_offset(kvm, pud, addr);
>
> do {
> - next = stage2_pmd_addr_end(addr, end);
> + next = stage2_pmd_addr_end(kvm, addr, end);
> if (!pmd_none(*pmd)) {
> if (pmd_thp_or_huge(*pmd)) {
> if (!kvm_s2pmd_readonly(pmd))
> @@ -1280,18 +1282,19 @@ static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
> *
> * Process PUD entries, for a huge PUD we cause a panic.
> */
> -static void stage2_wp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
> +static void stage2_wp_puds(struct kvm *kvm, pgd_t *pgd,
> + phys_addr_t addr, phys_addr_t end)
> {
> pud_t *pud;
> phys_addr_t next;
>
> - pud = stage2_pud_offset(pgd, addr);
> + pud = stage2_pud_offset(kvm, pgd, addr);
> do {
> - next = stage2_pud_addr_end(addr, end);
> - if (!stage2_pud_none(*pud)) {
> + next = stage2_pud_addr_end(kvm, addr, end);
> + if (!stage2_pud_none(kvm, *pud)) {
> /* TODO:PUD not supported, revisit later if supported */
> - BUG_ON(stage2_pud_huge(*pud));
> - stage2_wp_pmds(pud, addr, next);
> + BUG_ON(stage2_pud_huge(kvm, *pud));
> + stage2_wp_pmds(kvm, pud, addr, next);
> }
> } while (pud++, addr = next, addr != end);
> }
> @@ -1307,7 +1310,7 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> pgd_t *pgd;
> phys_addr_t next;
>
> - pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> + pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
> do {
> /*
> * Release kvm_mmu_lock periodically if the memory region is
> @@ -1321,9 +1324,9 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> cond_resched_lock(&kvm->mmu_lock);
> if (!READ_ONCE(kvm->arch.pgd))
> break;
> - next = stage2_pgd_addr_end(addr, end);
> - if (stage2_pgd_present(*pgd))
> - stage2_wp_puds(pgd, addr, next);
> + next = stage2_pgd_addr_end(kvm, addr, end);
> + if (stage2_pgd_present(kvm, *pgd))
> + stage2_wp_puds(kvm, pgd, addr, next);
> } while (pgd++, addr = next, addr != end);
> }
>
> @@ -1472,7 +1475,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> up_read(&current->mm->mmap_sem);
>
> /* We need minimum second+third level pages */
> - ret = mmu_topup_memory_cache(memcache, KVM_MMU_CACHE_MIN_PAGES,
> + ret = mmu_topup_memory_cache(memcache, kvm_mmu_cache_min_pages(kvm),
> KVM_NR_MEM_OBJS);
> if (ret)
> return ret;
> @@ -1715,7 +1718,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
> }
>
> /* Userspace should not be able to register out-of-bounds IPAs */
> - VM_BUG_ON(fault_ipa >= KVM_PHYS_SIZE);
> + VM_BUG_ON(fault_ipa >= kvm_phys_size(vcpu->kvm));
>
> if (fault_status == FSC_ACCESS) {
> handle_access_fault(vcpu, fault_ipa);
> @@ -2019,7 +2022,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> * space addressable by the KVM guest IPA space.
> */
> if (memslot->base_gfn + memslot->npages >=
> - (KVM_PHYS_SIZE >> PAGE_SHIFT))
> + (kvm_phys_size(kvm) >> PAGE_SHIFT))
> return -EFAULT;
>
> down_read(&current->mm->mmap_sem);
> diff --git a/virt/kvm/arm/vgic/vgic-kvm-device.c b/virt/kvm/arm/vgic/vgic-kvm-device.c
> index 6ada243..114dce9 100644
> --- a/virt/kvm/arm/vgic/vgic-kvm-device.c
> +++ b/virt/kvm/arm/vgic/vgic-kvm-device.c
> @@ -25,7 +25,7 @@
> int vgic_check_ioaddr(struct kvm *kvm, phys_addr_t *ioaddr,
> phys_addr_t addr, phys_addr_t alignment)
> {
> - if (addr & ~KVM_PHYS_MASK)
> + if (addr & ~kvm_phys_mask(kvm))
> return -E2BIG;
>
> if (!IS_ALIGNED(addr, alignment))
>

Otherwise:

Acked-by: Marc Zyngier <[email protected]>

M.
--
Jazz is not dead. It just smells funny...

2018-07-02 10:27:09

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 07/20] kvm: arm/arm64: Prepare for VM specific stage2 translations

On 02/07/18 11:12, Marc Zyngier wrote:
> On 29/06/18 12:15, Suzuki K Poulose wrote:
>> Right now the stage2 page table for a VM is hard coded, assuming
>> an IPA of 40bits. As we are about to add support for per VM IPA,
>> prepare the stage2 page table helpers to accept the kvm instance
>> to make the right decision for the VM. No functional changes.
>> Adds stage2_pgd_size(kvm) to replace S2_PGD_SIZE. Also, moves
>> some of the definitions dependent on kvm instance to asm/kvm_mmu.h
>> for arm32. In that process drop the _AC() specifier constants
>>
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Christoffer Dall <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> Changes since V2:
>> - Update commit description abuot the movement to asm/kvm_mmu.h
>> for arm32
>> - Drop _AC() specifiers
>> ---
>> arch/arm/include/asm/kvm_arm.h | 3 +-
>> arch/arm/include/asm/kvm_mmu.h | 15 +++-
>> arch/arm/include/asm/stage2_pgtable.h | 42 ++++-----
>> arch/arm64/include/asm/kvm_mmu.h | 7 +-
>> arch/arm64/include/asm/stage2_pgtable-nopmd.h | 18 ++--
>> arch/arm64/include/asm/stage2_pgtable-nopud.h | 16 ++--
>> arch/arm64/include/asm/stage2_pgtable.h | 49 ++++++-----
>> virt/kvm/arm/arm.c | 2 +-
>> virt/kvm/arm/mmu.c | 119 +++++++++++++-------------
>> virt/kvm/arm/vgic/vgic-kvm-device.c | 2 +-
>> 10 files changed, 148 insertions(+), 125 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
>> index 3ab8b37..c3f1f9b 100644
>> --- a/arch/arm/include/asm/kvm_arm.h
>> +++ b/arch/arm/include/asm/kvm_arm.h
>> @@ -133,8 +133,7 @@
>> * space.
>> */
>> #define KVM_PHYS_SHIFT (40)
>> -#define KVM_PHYS_SIZE (_AC(1, ULL) << KVM_PHYS_SHIFT)
>> -#define KVM_PHYS_MASK (KVM_PHYS_SIZE - _AC(1, ULL))
>> +
>> #define PTRS_PER_S2_PGD (_AC(1, ULL) << (KVM_PHYS_SHIFT - 30))
>>
>> /* Virtualization Translation Control Register (VTCR) bits */
>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>> index 8553d68..f36eb20 100644
>> --- a/arch/arm/include/asm/kvm_mmu.h
>> +++ b/arch/arm/include/asm/kvm_mmu.h
>> @@ -36,15 +36,19 @@
>> })
>>
>> /*
>> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation levels.
>> + * kvm_mmu_cache_min_pages() is the number of stage2 page
>> + * table translation levels, excluding the top level, for
>> + * the given VM. Since we have a 3 level page-table, this
>> + * is fixed.
>
> I find this comment quite confusing: number of levels, but excluding the
> top one? The original one was just as bad, to be honest.
>
> Can't we just say: "kvm_mmu_cache_min_page() is the number of pages
> required to install a stage-2 translation"?

Yes, that is much better. Will change it.

>> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
>> index 8b68099..057a405 100644
>> --- a/arch/arm64/include/asm/stage2_pgtable.h
>> +++ b/arch/arm64/include/asm/stage2_pgtable.h
>> @@ -65,10 +65,10 @@
>> #define PTRS_PER_S2_PGD (1 << (KVM_PHYS_SHIFT - S2_PGDIR_SHIFT))
>>
>> /*
>> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
>> + * kvm_mmmu_cache_min_pages is the number of stage2 page table translation
>> * levels in addition to the PGD.
>> */
>> -#define KVM_MMU_CACHE_MIN_PAGES (STAGE2_PGTABLE_LEVELS - 1)
>> +#define kvm_mmu_cache_min_pages(kvm) (STAGE2_PGTABLE_LEVELS - 1)
>
> Same comment as for the 32bit case.
>
>>

>>
>
> Otherwise:
>
> Acked-by: Marc Zyngier <[email protected]>

Thanks
Suzuki

2018-07-02 11:13:32

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v3 05/20] kvm: arm/arm64: Fix stage2_flush_memslot for 4 level page table

On 29/06/18 12:15, Suzuki K Poulose wrote:
> So far we have only supported 3 level page table with fixed IPA of 40bits.
> Fix stage2_flush_memslot() to accommodate for 4 level tables.
>
> Cc: Marc Zyngier <[email protected]>
> Acked-by: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> virt/kvm/arm/mmu.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 1d90d79..061e6b3 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -379,7 +379,8 @@ static void stage2_flush_memslot(struct kvm *kvm,
> pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> do {
> next = stage2_pgd_addr_end(addr, end);
> - stage2_flush_puds(kvm, pgd, addr, next);
> + if (!stage2_pgd_none(*pgd))
> + stage2_flush_puds(kvm, pgd, addr, next);
> } while (pgd++, addr = next, addr != end);
> }
>
>

Reviewed-by: Marc Zyngier <[email protected]>

M.
--
Jazz is not dead. It just smells funny...

2018-07-02 11:34:57

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v3 07/20] kvm: arm/arm64: Prepare for VM specific stage2 translations

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> Right now the stage2 page table for a VM is hard coded, assuming
> an IPA of 40bits. As we are about to add support for per VM IPA,
> prepare the stage2 page table helpers to accept the kvm instance
> to make the right decision for the VM. No functional changes.
> Adds stage2_pgd_size(kvm) to replace S2_PGD_SIZE. Also, moves
> some of the definitions dependent on kvm instance to asm/kvm_mmu.h
> for arm32. In that process drop the _AC() specifier constants
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Changes since V2:
> - Update commit description abuot the movement to asm/kvm_mmu.h
> for arm32
> - Drop _AC() specifiers
> ---
> arch/arm/include/asm/kvm_arm.h | 3 +-
> arch/arm/include/asm/kvm_mmu.h | 15 +++-
> arch/arm/include/asm/stage2_pgtable.h | 42 ++++-----
> arch/arm64/include/asm/kvm_mmu.h | 7 +-
> arch/arm64/include/asm/stage2_pgtable-nopmd.h | 18 ++--
> arch/arm64/include/asm/stage2_pgtable-nopud.h | 16 ++--
> arch/arm64/include/asm/stage2_pgtable.h | 49 ++++++-----
> virt/kvm/arm/arm.c | 2 +-
> virt/kvm/arm/mmu.c | 119 +++++++++++++-------------
> virt/kvm/arm/vgic/vgic-kvm-device.c | 2 +-
> 10 files changed, 148 insertions(+), 125 deletions(-)
>
> diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
> index 3ab8b37..c3f1f9b 100644
> --- a/arch/arm/include/asm/kvm_arm.h
> +++ b/arch/arm/include/asm/kvm_arm.h
> @@ -133,8 +133,7 @@
> * space.
> */
> #define KVM_PHYS_SHIFT (40)
> -#define KVM_PHYS_SIZE (_AC(1, ULL) << KVM_PHYS_SHIFT)
> -#define KVM_PHYS_MASK (KVM_PHYS_SIZE - _AC(1, ULL))
> +
> #define PTRS_PER_S2_PGD (_AC(1, ULL) << (KVM_PHYS_SHIFT - 30))
>
> /* Virtualization Translation Control Register (VTCR) bits */
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 8553d68..f36eb20 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -36,15 +36,19 @@
> })
>
> /*
> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation levels.
> + * kvm_mmu_cache_min_pages() is the number of stage2 page
> + * table translation levels, excluding the top level, for
> + * the given VM. Since we have a 3 level page-table, this
> + * is fixed.
> */
> -#define KVM_MMU_CACHE_MIN_PAGES 2
> +#define kvm_mmu_cache_min_pages(kvm) 2
nit: In addition to Marc'c comment, I can see it defined in
stage2_pgtable.h on arm64 side. Can't we align?
>
> #ifndef __ASSEMBLY__
>
> #include <linux/highmem.h>
> #include <asm/cacheflush.h>
> #include <asm/cputype.h>
> +#include <asm/kvm_arm.h>
> #include <asm/kvm_hyp.h>
> #include <asm/pgalloc.h>
> #include <asm/stage2_pgtable.h>
> @@ -52,6 +56,13 @@
> /* Ensure compatibility with arm64 */
> #define VA_BITS 32
>
> +#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
> +#define kvm_phys_size(kvm) (1ULL << kvm_phys_shift(kvm))
> +#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - 1ULL)
> +#define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK
> +
> +#define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
> +
> int create_hyp_mappings(void *from, void *to, pgprot_t prot);
> int create_hyp_io_mappings(phys_addr_t phys_addr, size_t size,
> void __iomem **kaddr,
> diff --git a/arch/arm/include/asm/stage2_pgtable.h b/arch/arm/include/asm/stage2_pgtable.h
> index 460d616..e22ae94 100644
> --- a/arch/arm/include/asm/stage2_pgtable.h
> +++ b/arch/arm/include/asm/stage2_pgtable.h
> @@ -19,43 +19,45 @@
> #ifndef __ARM_S2_PGTABLE_H_
> #define __ARM_S2_PGTABLE_H_
>
> -#define stage2_pgd_none(pgd) pgd_none(pgd)
> -#define stage2_pgd_clear(pgd) pgd_clear(pgd)
> -#define stage2_pgd_present(pgd) pgd_present(pgd)
> -#define stage2_pgd_populate(pgd, pud) pgd_populate(NULL, pgd, pud)
> -#define stage2_pud_offset(pgd, address) pud_offset(pgd, address)
> -#define stage2_pud_free(pud) pud_free(NULL, pud)
> +#define stage2_pgd_none(kvm, pgd) pgd_none(pgd)
> +#define stage2_pgd_clear(kvm, pgd) pgd_clear(pgd)
> +#define stage2_pgd_present(kvm, pgd) pgd_present(pgd)
> +#define stage2_pgd_populate(kvm, pgd, pud) pgd_populate(NULL, pgd, pud)
> +#define stage2_pud_offset(kvm, pgd, address) pud_offset(pgd, address)
> +#define stage2_pud_free(kvm, pud) pud_free(NULL, pud)
>
> -#define stage2_pud_none(pud) pud_none(pud)
> -#define stage2_pud_clear(pud) pud_clear(pud)
> -#define stage2_pud_present(pud) pud_present(pud)
> -#define stage2_pud_populate(pud, pmd) pud_populate(NULL, pud, pmd)
> -#define stage2_pmd_offset(pud, address) pmd_offset(pud, address)
> -#define stage2_pmd_free(pmd) pmd_free(NULL, pmd)
> +#define stage2_pud_none(kvm, pud) pud_none(pud)
> +#define stage2_pud_clear(kvm, pud) pud_clear(pud)
> +#define stage2_pud_present(kvm, pud) pud_present(pud)
> +#define stage2_pud_populate(kvm, pud, pmd) pud_populate(NULL, pud, pmd)
> +#define stage2_pmd_offset(kvm, pud, address) pmd_offset(pud, address)
> +#define stage2_pmd_free(kvm, pmd) pmd_free(NULL, pmd)
>
> -#define stage2_pud_huge(pud) pud_huge(pud)
> +#define stage2_pud_huge(kvm, pud) pud_huge(pud)
>
> /* Open coded p*d_addr_end that can deal with 64bit addresses */
> -static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> {
> phys_addr_t boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
>
> return (boundary - 1 < end - 1) ? boundary : end;
> }
>
> -#define stage2_pud_addr_end(addr, end) (end)
> +#define stage2_pud_addr_end(kvm, addr, end) (end)
>
> -static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> {
> phys_addr_t boundary = (addr + PMD_SIZE) & PMD_MASK;
>
> return (boundary - 1 < end - 1) ? boundary : end;
> }
>
> -#define stage2_pgd_index(addr) pgd_index(addr)
> +#define stage2_pgd_index(kvm, addr) pgd_index(addr)
>
> -#define stage2_pte_table_empty(ptep) kvm_page_empty(ptep)
> -#define stage2_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
> -#define stage2_pud_table_empty(pudp) false
> +#define stage2_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)
> +#define stage2_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
> +#define stage2_pud_table_empty(kvm, pudp) false
>
> #endif /* __ARM_S2_PGTABLE_H_ */
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index fb9a712..5da8f52 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -141,8 +141,11 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
> * We currently only support a 40bit IPA.
> */
> #define KVM_PHYS_SHIFT (40)
> -#define KVM_PHYS_SIZE (1UL << KVM_PHYS_SHIFT)
> -#define KVM_PHYS_MASK (KVM_PHYS_SIZE - 1UL)
> +
> +#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
> +#define kvm_phys_size(kvm) (_AC(1, ULL) << kvm_phys_shift(kvm))
Can't you get rid of _AC() also in arm64 case?

> +#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
> +#define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK
>
> #include <asm/stage2_pgtable.h>
>
> diff --git a/arch/arm64/include/asm/stage2_pgtable-nopmd.h b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> index 2656a0f..0280ded 100644
> --- a/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> +++ b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> @@ -26,17 +26,17 @@
> #define S2_PMD_SIZE (1UL << S2_PMD_SHIFT)
> #define S2_PMD_MASK (~(S2_PMD_SIZE-1))
>
> -#define stage2_pud_none(pud) (0)
> -#define stage2_pud_present(pud) (1)
> -#define stage2_pud_clear(pud) do { } while (0)
> -#define stage2_pud_populate(pud, pmd) do { } while (0)
> -#define stage2_pmd_offset(pud, address) ((pmd_t *)(pud))
> +#define stage2_pud_none(kvm, pud) (0)
> +#define stage2_pud_present(kvm, pud) (1)
> +#define stage2_pud_clear(kvm, pud) do { } while (0)
> +#define stage2_pud_populate(kvm, pud, pmd) do { } while (0)
> +#define stage2_pmd_offset(kvm, pud, address) ((pmd_t *)(pud))
>
> -#define stage2_pmd_free(pmd) do { } while (0)
> +#define stage2_pmd_free(kvm, pmd) do { } while (0)
>
> -#define stage2_pmd_addr_end(addr, end) (end)
> +#define stage2_pmd_addr_end(kvm, addr, end) (end)
>
> -#define stage2_pud_huge(pud) (0)
> -#define stage2_pmd_table_empty(pmdp) (0)
> +#define stage2_pud_huge(kvm, pud) (0)
> +#define stage2_pmd_table_empty(kvm, pmdp) (0)
>
> #endif
> diff --git a/arch/arm64/include/asm/stage2_pgtable-nopud.h b/arch/arm64/include/asm/stage2_pgtable-nopud.h
> index 5ee87b5..cd6304e 100644
> --- a/arch/arm64/include/asm/stage2_pgtable-nopud.h
> +++ b/arch/arm64/include/asm/stage2_pgtable-nopud.h
> @@ -24,16 +24,16 @@
> #define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
> #define S2_PUD_MASK (~(S2_PUD_SIZE-1))
>
> -#define stage2_pgd_none(pgd) (0)
> -#define stage2_pgd_present(pgd) (1)
> -#define stage2_pgd_clear(pgd) do { } while (0)
> -#define stage2_pgd_populate(pgd, pud) do { } while (0)
> +#define stage2_pgd_none(kvm, pgd) (0)
> +#define stage2_pgd_present(kvm, pgd) (1)
> +#define stage2_pgd_clear(kvm, pgd) do { } while (0)
> +#define stage2_pgd_populate(kvm, pgd, pud) do { } while (0)
>
> -#define stage2_pud_offset(pgd, address) ((pud_t *)(pgd))
> +#define stage2_pud_offset(kvm, pgd, address) ((pud_t *)(pgd))
>
> -#define stage2_pud_free(x) do { } while (0)
> +#define stage2_pud_free(kvm, x) do { } while (0)
>
> -#define stage2_pud_addr_end(addr, end) (end)
> -#define stage2_pud_table_empty(pmdp) (0)
> +#define stage2_pud_addr_end(kvm, addr, end) (end)
> +#define stage2_pud_table_empty(kvm, pmdp) (0)
>
> #endif
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> index 8b68099..057a405 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -65,10 +65,10 @@
> #define PTRS_PER_S2_PGD (1 << (KVM_PHYS_SHIFT - S2_PGDIR_SHIFT))
>
> /*
> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
> + * kvm_mmmu_cache_min_pages is the number of stage2 page table translation
> * levels in addition to the PGD.
> */
> -#define KVM_MMU_CACHE_MIN_PAGES (STAGE2_PGTABLE_LEVELS - 1)
> +#define kvm_mmu_cache_min_pages(kvm) (STAGE2_PGTABLE_LEVELS - 1)
>
>
> #if STAGE2_PGTABLE_LEVELS > 3
> @@ -77,16 +77,17 @@
> #define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
> #define S2_PUD_MASK (~(S2_PUD_SIZE - 1))
>
> -#define stage2_pgd_none(pgd) pgd_none(pgd)
> -#define stage2_pgd_clear(pgd) pgd_clear(pgd)
> -#define stage2_pgd_present(pgd) pgd_present(pgd)
> -#define stage2_pgd_populate(pgd, pud) pgd_populate(NULL, pgd, pud)
> -#define stage2_pud_offset(pgd, address) pud_offset(pgd, address)
> -#define stage2_pud_free(pud) pud_free(NULL, pud)
> +#define stage2_pgd_none(kvm, pgd) pgd_none(pgd)
> +#define stage2_pgd_clear(kvm, pgd) pgd_clear(pgd)
> +#define stage2_pgd_present(kvm, pgd) pgd_present(pgd)
> +#define stage2_pgd_populate(kvm, pgd, pud) pgd_populate(NULL, pgd, pud)
> +#define stage2_pud_offset(kvm, pgd, address) pud_offset(pgd, address)
> +#define stage2_pud_free(kvm, pud) pud_free(NULL, pud)
>
> -#define stage2_pud_table_empty(pudp) kvm_page_empty(pudp)
> +#define stage2_pud_table_empty(kvm, pudp) kvm_page_empty(pudp)
>
> -static inline phys_addr_t stage2_pud_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pud_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> {
> phys_addr_t boundary = (addr + S2_PUD_SIZE) & S2_PUD_MASK;
>
> @@ -102,17 +103,18 @@ static inline phys_addr_t stage2_pud_addr_end(phys_addr_t addr, phys_addr_t end)
> #define S2_PMD_SIZE (_AC(1, UL) << S2_PMD_SHIFT)
> #define S2_PMD_MASK (~(S2_PMD_SIZE - 1))
>
> -#define stage2_pud_none(pud) pud_none(pud)
> -#define stage2_pud_clear(pud) pud_clear(pud)
> -#define stage2_pud_present(pud) pud_present(pud)
> -#define stage2_pud_populate(pud, pmd) pud_populate(NULL, pud, pmd)
> -#define stage2_pmd_offset(pud, address) pmd_offset(pud, address)
> -#define stage2_pmd_free(pmd) pmd_free(NULL, pmd)
> +#define stage2_pud_none(kvm, pud) pud_none(pud)
> +#define stage2_pud_clear(kvm, pud) pud_clear(pud)
> +#define stage2_pud_present(kvm, pud) pud_present(pud)
> +#define stage2_pud_populate(kvm, pud, pmd) pud_populate(NULL, pud, pmd)
> +#define stage2_pmd_offset(kvm, pud, address) pmd_offset(pud, address)
> +#define stage2_pmd_free(kvm, pmd) pmd_free(NULL, pmd)
>
> -#define stage2_pud_huge(pud) pud_huge(pud)
> -#define stage2_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
> +#define stage2_pud_huge(kvm, pud) pud_huge(pud)
> +#define stage2_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
>
> -static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> {
> phys_addr_t boundary = (addr + S2_PMD_SIZE) & S2_PMD_MASK;
>
> @@ -121,7 +123,7 @@ static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
>
> #endif /* STAGE2_PGTABLE_LEVELS > 2 */
>
> -#define stage2_pte_table_empty(ptep) kvm_page_empty(ptep)
> +#define stage2_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)
>
> #if STAGE2_PGTABLE_LEVELS == 2
> #include <asm/stage2_pgtable-nopmd.h>
> @@ -129,10 +131,13 @@ static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
> #include <asm/stage2_pgtable-nopud.h>
> #endif
>
> +#define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
>
> -#define stage2_pgd_index(addr) (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
> +#define stage2_pgd_index(kvm, addr) \
> + (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
>
> -static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> {
> phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;
>
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 04e554c..d2637bb 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -538,7 +538,7 @@ static void update_vttbr(struct kvm *kvm)
>
> /* update vttbr to be used with the new vmid */
> pgd_phys = virt_to_phys(kvm->arch.pgd);
> - BUG_ON(pgd_phys & ~VTTBR_BADDR_MASK);
> + BUG_ON(pgd_phys & ~kvm_vttbr_baddr_mask(kvm));
> vmid = ((u64)(kvm->arch.vmid) << VTTBR_VMID_SHIFT) & VTTBR_VMID_MASK(kvm_vmid_bits);
> kvm->arch.vttbr = kvm_phys_to_vttbr(pgd_phys) | vmid;
>
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 308171c..82dd571 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -45,7 +45,6 @@ static phys_addr_t hyp_idmap_vector;
>
> static unsigned long io_map_base;
>
> -#define S2_PGD_SIZE (PTRS_PER_S2_PGD * sizeof(pgd_t))
> #define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t))
>
> #define KVM_S2PTE_FLAG_IS_IOMAP (1UL << 0)
> @@ -150,20 +149,20 @@ static void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
>
> static void clear_stage2_pgd_entry(struct kvm *kvm, pgd_t *pgd, phys_addr_t addr)
> {
> - pud_t *pud_table __maybe_unused = stage2_pud_offset(pgd, 0UL);
> - stage2_pgd_clear(pgd);
> + pud_t *pud_table __maybe_unused = stage2_pud_offset(kvm, pgd, 0UL);
> + stage2_pgd_clear(kvm, pgd);
> kvm_tlb_flush_vmid_ipa(kvm, addr);
> - stage2_pud_free(pud_table);
> + stage2_pud_free(kvm, pud_table);
> put_page(virt_to_page(pgd));
> }
>
> static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
> {
> - pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(pud, 0);
> - VM_BUG_ON(stage2_pud_huge(*pud));
> - stage2_pud_clear(pud);
> + pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(kvm, pud, 0);
> + VM_BUG_ON(stage2_pud_huge(kvm, *pud));
> + stage2_pud_clear(kvm, pud);
> kvm_tlb_flush_vmid_ipa(kvm, addr);
> - stage2_pmd_free(pmd_table);
> + stage2_pmd_free(kvm, pmd_table);
> put_page(virt_to_page(pud));
> }
>
> @@ -219,7 +218,7 @@ static void unmap_stage2_ptes(struct kvm *kvm, pmd_t *pmd,
> }
> } while (pte++, addr += PAGE_SIZE, addr != end);
>
> - if (stage2_pte_table_empty(start_pte))
> + if (stage2_pte_table_empty(kvm, start_pte))
> clear_stage2_pmd_entry(kvm, pmd, start_addr);
> }
>
> @@ -229,9 +228,9 @@ static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud,
> phys_addr_t next, start_addr = addr;
> pmd_t *pmd, *start_pmd;
>
> - start_pmd = pmd = stage2_pmd_offset(pud, addr);
> + start_pmd = pmd = stage2_pmd_offset(kvm, pud, addr);
> do {
> - next = stage2_pmd_addr_end(addr, end);
> + next = stage2_pmd_addr_end(kvm, addr, end);
> if (!pmd_none(*pmd)) {
> if (pmd_thp_or_huge(*pmd)) {
> pmd_t old_pmd = *pmd;
> @@ -248,7 +247,7 @@ static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud,
> }
> } while (pmd++, addr = next, addr != end);
>
> - if (stage2_pmd_table_empty(start_pmd))
> + if (stage2_pmd_table_empty(kvm, start_pmd))
> clear_stage2_pud_entry(kvm, pud, start_addr);
> }
>
> @@ -258,14 +257,14 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
> phys_addr_t next, start_addr = addr;
> pud_t *pud, *start_pud;
>
> - start_pud = pud = stage2_pud_offset(pgd, addr);
> + start_pud = pud = stage2_pud_offset(kvm, pgd, addr);
> do {
> - next = stage2_pud_addr_end(addr, end);
> - if (!stage2_pud_none(*pud)) {
> - if (stage2_pud_huge(*pud)) {
> + next = stage2_pud_addr_end(kvm, addr, end);
> + if (!stage2_pud_none(kvm, *pud)) {
> + if (stage2_pud_huge(kvm, *pud)) {
> pud_t old_pud = *pud;
>
> - stage2_pud_clear(pud);
> + stage2_pud_clear(kvm, pud);
> kvm_tlb_flush_vmid_ipa(kvm, addr);
> kvm_flush_dcache_pud(old_pud);
> put_page(virt_to_page(pud));
> @@ -275,7 +274,7 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
> }
> } while (pud++, addr = next, addr != end);
>
> - if (stage2_pud_table_empty(start_pud))
> + if (stage2_pud_table_empty(kvm, start_pud))
> clear_stage2_pgd_entry(kvm, pgd, start_addr);
> }
>
> @@ -299,7 +298,7 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
> assert_spin_locked(&kvm->mmu_lock);
> WARN_ON(size & ~PAGE_MASK);
>
> - pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> + pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
> do {
> /*
> * Make sure the page table is still active, as another thread
> @@ -308,8 +307,8 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
> */
> if (!READ_ONCE(kvm->arch.pgd))
> break;
> - next = stage2_pgd_addr_end(addr, end);
> - if (!stage2_pgd_none(*pgd))
> + next = stage2_pgd_addr_end(kvm, addr, end);
> + if (!stage2_pgd_none(kvm, *pgd))
> unmap_stage2_puds(kvm, pgd, addr, next);
> /*
> * If the range is too large, release the kvm->mmu_lock
> @@ -338,9 +337,9 @@ static void stage2_flush_pmds(struct kvm *kvm, pud_t *pud,
> pmd_t *pmd;
> phys_addr_t next;
>
> - pmd = stage2_pmd_offset(pud, addr);
> + pmd = stage2_pmd_offset(kvm, pud, addr);
> do {
> - next = stage2_pmd_addr_end(addr, end);
> + next = stage2_pmd_addr_end(kvm, addr, end);
> if (!pmd_none(*pmd)) {
> if (pmd_thp_or_huge(*pmd))
> kvm_flush_dcache_pmd(*pmd);
> @@ -356,11 +355,11 @@ static void stage2_flush_puds(struct kvm *kvm, pgd_t *pgd,
> pud_t *pud;
> phys_addr_t next;
>
> - pud = stage2_pud_offset(pgd, addr);
> + pud = stage2_pud_offset(kvm, pgd, addr);
> do {
> - next = stage2_pud_addr_end(addr, end);
> - if (!stage2_pud_none(*pud)) {
> - if (stage2_pud_huge(*pud))
> + next = stage2_pud_addr_end(kvm, addr, end);
> + if (!stage2_pud_none(kvm, *pud)) {
> + if (stage2_pud_huge(kvm, *pud))
> kvm_flush_dcache_pud(*pud);
> else
> stage2_flush_pmds(kvm, pud, addr, next);
> @@ -376,10 +375,10 @@ static void stage2_flush_memslot(struct kvm *kvm,
> phys_addr_t next;
> pgd_t *pgd;
>
> - pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> + pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
> do {
> - next = stage2_pgd_addr_end(addr, end);
> - if (!stage2_pgd_none(*pgd))
> + next = stage2_pgd_addr_end(kvm, addr, end);
> + if (!stage2_pgd_none(kvm, *pgd))
> stage2_flush_puds(kvm, pgd, addr, next);
> } while (pgd++, addr = next, addr != end);
> }
> @@ -869,7 +868,7 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
> }
>
> /* Allocate the HW PGD, making sure that each page gets its own refcount */
> - pgd = alloc_pages_exact(S2_PGD_SIZE, GFP_KERNEL | __GFP_ZERO);
> + pgd = alloc_pages_exact(stage2_pgd_size(kvm), GFP_KERNEL | __GFP_ZERO);
> if (!pgd)
> return -ENOMEM;
>
> @@ -958,7 +957,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>
> spin_lock(&kvm->mmu_lock);
> if (kvm->arch.pgd) {
> - unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> + unmap_stage2_range(kvm, 0, kvm_phys_size(kvm));
> pgd = READ_ONCE(kvm->arch.pgd);
> kvm->arch.pgd = NULL;
> }
> @@ -966,7 +965,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>
> /* Free the HW pgd, one page at a time */
> if (pgd)
> - free_pages_exact(pgd, S2_PGD_SIZE);
> + free_pages_exact(pgd, stage2_pgd_size(kvm));
> }
>
> static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> @@ -975,16 +974,16 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
> pgd_t *pgd;
> pud_t *pud;
>
> - pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> - if (stage2_pgd_none(*pgd)) {
> + pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
> + if (stage2_pgd_none(kvm, *pgd)) {
> if (!cache)
> return NULL;
> pud = mmu_memory_cache_alloc(cache);
> - stage2_pgd_populate(pgd, pud);
> + stage2_pgd_populate(kvm, pgd, pud);
> get_page(virt_to_page(pgd));
> }
>
> - return stage2_pud_offset(pgd, addr);
> + return stage2_pud_offset(kvm, pgd, addr);
> }
>
> static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> @@ -997,15 +996,15 @@ static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
> if (!pud)
> return NULL;
>
> - if (stage2_pud_none(*pud)) {
> + if (stage2_pud_none(kvm, *pud)) {
> if (!cache)
> return NULL;
> pmd = mmu_memory_cache_alloc(cache);
> - stage2_pud_populate(pud, pmd);
> + stage2_pud_populate(kvm, pud, pmd);
> get_page(virt_to_page(pud));
> }
>
> - return stage2_pmd_offset(pud, addr);
> + return stage2_pmd_offset(kvm, pud, addr);
> }
>
> static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
> @@ -1159,8 +1158,9 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> if (writable)
> pte = kvm_s2pte_mkwrite(pte);
>
> - ret = mmu_topup_memory_cache(&cache, KVM_MMU_CACHE_MIN_PAGES,
> - KVM_NR_MEM_OBJS);
> + ret = mmu_topup_memory_cache(&cache,
> + kvm_mmu_cache_min_pages(kvm),
> + KVM_NR_MEM_OBJS);
> if (ret)
> goto out;
> spin_lock(&kvm->mmu_lock);
> @@ -1248,19 +1248,21 @@ static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
>
> /**
> * stage2_wp_pmds - write protect PUD range
> + * kvm: kvm instance for the VM
> * @pud: pointer to pud entry
> * @addr: range start address
> * @end: range end address
> */
> -static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
> +static void stage2_wp_pmds(struct kvm *kvm, pud_t *pud,
> + phys_addr_t addr, phys_addr_t end)
> {
> pmd_t *pmd;
> phys_addr_t next;
>
> - pmd = stage2_pmd_offset(pud, addr);
> + pmd = stage2_pmd_offset(kvm, pud, addr);
>
> do {
> - next = stage2_pmd_addr_end(addr, end);
> + next = stage2_pmd_addr_end(kvm, addr, end);
> if (!pmd_none(*pmd)) {
> if (pmd_thp_or_huge(*pmd)) {
> if (!kvm_s2pmd_readonly(pmd))
> @@ -1280,18 +1282,19 @@ static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
> *
> * Process PUD entries, for a huge PUD we cause a panic.
> */
> -static void stage2_wp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
> +static void stage2_wp_puds(struct kvm *kvm, pgd_t *pgd,
> + phys_addr_t addr, phys_addr_t end)
> {
> pud_t *pud;
> phys_addr_t next;
>
> - pud = stage2_pud_offset(pgd, addr);
> + pud = stage2_pud_offset(kvm, pgd, addr);
> do {
> - next = stage2_pud_addr_end(addr, end);
> - if (!stage2_pud_none(*pud)) {
> + next = stage2_pud_addr_end(kvm, addr, end);
> + if (!stage2_pud_none(kvm, *pud)) {
> /* TODO:PUD not supported, revisit later if supported */
> - BUG_ON(stage2_pud_huge(*pud));
> - stage2_wp_pmds(pud, addr, next);
> + BUG_ON(stage2_pud_huge(kvm, *pud));
> + stage2_wp_pmds(kvm, pud, addr, next);
> }
> } while (pud++, addr = next, addr != end);
> }
> @@ -1307,7 +1310,7 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> pgd_t *pgd;
> phys_addr_t next;
>
> - pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> + pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
> do {
> /*
> * Release kvm_mmu_lock periodically if the memory region is
> @@ -1321,9 +1324,9 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> cond_resched_lock(&kvm->mmu_lock);
> if (!READ_ONCE(kvm->arch.pgd))
> break;
> - next = stage2_pgd_addr_end(addr, end);
> - if (stage2_pgd_present(*pgd))
> - stage2_wp_puds(pgd, addr, next);
> + next = stage2_pgd_addr_end(kvm, addr, end);
> + if (stage2_pgd_present(kvm, *pgd))
> + stage2_wp_puds(kvm, pgd, addr, next);
> } while (pgd++, addr = next, addr != end);
> }
>
> @@ -1472,7 +1475,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> up_read(&current->mm->mmap_sem);
>
> /* We need minimum second+third level pages */
> - ret = mmu_topup_memory_cache(memcache, KVM_MMU_CACHE_MIN_PAGES,
> + ret = mmu_topup_memory_cache(memcache, kvm_mmu_cache_min_pages(kvm),
> KVM_NR_MEM_OBJS);
> if (ret)
> return ret;
> @@ -1715,7 +1718,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
> }
>
> /* Userspace should not be able to register out-of-bounds IPAs */
> - VM_BUG_ON(fault_ipa >= KVM_PHYS_SIZE);
> + VM_BUG_ON(fault_ipa >= kvm_phys_size(vcpu->kvm));
>
> if (fault_status == FSC_ACCESS) {
> handle_access_fault(vcpu, fault_ipa);
> @@ -2019,7 +2022,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> * space addressable by the KVM guest IPA space.
> */
> if (memslot->base_gfn + memslot->npages >=
> - (KVM_PHYS_SIZE >> PAGE_SHIFT))
> + (kvm_phys_size(kvm) >> PAGE_SHIFT))
> return -EFAULT;
>
> down_read(&current->mm->mmap_sem);
> diff --git a/virt/kvm/arm/vgic/vgic-kvm-device.c b/virt/kvm/arm/vgic/vgic-kvm-device.c
> index 6ada243..114dce9 100644
> --- a/virt/kvm/arm/vgic/vgic-kvm-device.c
> +++ b/virt/kvm/arm/vgic/vgic-kvm-device.c
> @@ -25,7 +25,7 @@
> int vgic_check_ioaddr(struct kvm *kvm, phys_addr_t *ioaddr,
> phys_addr_t addr, phys_addr_t alignment)
> {
> - if (addr & ~KVM_PHYS_MASK)
> + if (addr & ~kvm_phys_mask(kvm))
> return -E2BIG;
>
> if (!IS_ALIGNED(addr, alignment))
>

Thanks

Eric

2018-07-02 12:18:56

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v3 13/20] kvm: arm64: Configure VTCR per VM

On 29/06/18 12:15, Suzuki K Poulose wrote:
> We set VTCR_EL2 very early during the stage2 init and don't
> touch it ever. This is fine as we had a fixed IPA size. This
> patch changes the behavior to set the VTCR for a given VM,
> depending on its stage2 table. The common configuration for
> VTCR is still performed during the early init as we have to
> retain the hardware access flag update bits (VTCR_EL2_HA)
> per CPU (as they are only set for the CPUs which are capabile).

capable

> The bits defining the number of levels in the page table (SL0)
> and and the size of the Input address to the translation (T0SZ)
> are programmed for each VM upon entry to the guest.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Change since V2:
> - Load VTCR for TLB operations
> ---
> arch/arm64/include/asm/kvm_arm.h | 19 +++++++++----------
> arch/arm64/include/asm/kvm_asm.h | 2 +-
> arch/arm64/include/asm/kvm_host.h | 9 ++++++---
> arch/arm64/include/asm/kvm_hyp.h | 11 +++++++++++
> arch/arm64/kvm/hyp/s2-setup.c | 17 +----------------
> 5 files changed, 28 insertions(+), 30 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 11a7db0..b02c316 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -120,9 +120,7 @@
> #define VTCR_EL2_IRGN0_WBWA TCR_IRGN0_WBWA
> #define VTCR_EL2_SL0_SHIFT 6
> #define VTCR_EL2_SL0_MASK (3 << VTCR_EL2_SL0_SHIFT)
> -#define VTCR_EL2_SL0_LVL1 (1 << VTCR_EL2_SL0_SHIFT)
> #define VTCR_EL2_T0SZ_MASK 0x3f
> -#define VTCR_EL2_T0SZ_40B 24
> #define VTCR_EL2_VS_SHIFT 19
> #define VTCR_EL2_VS_8BIT (0 << VTCR_EL2_VS_SHIFT)
> #define VTCR_EL2_VS_16BIT (1 << VTCR_EL2_VS_SHIFT)
> @@ -137,43 +135,44 @@
> * VTCR_EL2.PS is extracted from ID_AA64MMFR0_EL1.PARange at boot time
> * (see hyp-init.S).
> *
> + * VTCR_EL2.SL0 and T0SZ are configured per VM at runtime before switching to
> + * the VM.
> + *
> * Note that when using 4K pages, we concatenate two first level page tables
> * together. With 16K pages, we concatenate 16 first level page tables.
> *
> */
>
> -#define VTCR_EL2_T0SZ_IPA VTCR_EL2_T0SZ_40B
> #define VTCR_EL2_COMMON_BITS (VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
> VTCR_EL2_IRGN0_WBWA | VTCR_EL2_RES1)
> +#define VTCR_EL2_PRIVATE_MASK (VTCR_EL2_SL0_MASK | VTCR_EL2_T0SZ_MASK)

What does "private" mean here? It really is the IPA configuration, so
I'd rather have a naming that reflects that.

> #ifdef CONFIG_ARM64_64K_PAGES
> /*
> * Stage2 translation configuration:
> * 64kB pages (TG0 = 1)
> - * 2 level page tables (SL = 1)
> */
> -#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
> +#define VTCR_EL2_TGRAN VTCR_EL2_TG0_64K
> #define VTCR_EL2_TGRAN_SL0_BASE 3UL
>
> #elif defined(CONFIG_ARM64_16K_PAGES)
> /*
> * Stage2 translation configuration:
> * 16kB pages (TG0 = 2)
> - * 2 level page tables (SL = 1)
> */
> -#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
> +#define VTCR_EL2_TGRAN VTCR_EL2_TG0_16K
> #define VTCR_EL2_TGRAN_SL0_BASE 3UL
> #else /* 4K */
> /*
> * Stage2 translation configuration:
> * 4kB pages (TG0 = 0)
> - * 3 level page tables (SL = 1)
> */
> -#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
> +#define VTCR_EL2_TGRAN VTCR_EL2_TG0_4K
> #define VTCR_EL2_TGRAN_SL0_BASE 2UL
> #endif
>
> -#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
> +#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN)
> +
> /*
> * VTCR_EL2:SL0 indicates the entry level for Stage2 translation.
> * Interestingly, it depends on the page size.
> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index 102b5a5..91372eb 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -72,7 +72,7 @@ extern void __vgic_v3_init_lrs(void);
>
> extern u32 __kvm_get_mdcr_el2(void);
>
> -extern u32 __init_stage2_translation(void);
> +extern void __init_stage2_translation(void);
>
> /* Home-grown __this_cpu_{ptr,read} variants that always work at HYP */
> #define __hyp_this_cpu_ptr(sym) \
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index fe8777b..328f472 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -442,10 +442,13 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>
> static inline void __cpu_init_stage2(void)
> {
> - u32 parange = kvm_call_hyp(__init_stage2_translation);
> + u32 ps;
>
> - WARN_ONCE(parange < 40,
> - "PARange is %d bits, unsupported configuration!", parange);
> + kvm_call_hyp(__init_stage2_translation);
> + /* Sanity check for minimum IPA size support */
> + ps = id_aa64mmfr0_parange_to_phys_shift(read_sysreg(id_aa64mmfr0_el1) & 0x7);
> + WARN_ONCE(ps < 40,
> + "PARange is %d bits, unsupported configuration!", ps);
> }
>
> /* Guest/host FPSIMD coordination helpers */
> diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
> index 82f9994..3e8052d1 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -20,6 +20,7 @@
>
> #include <linux/compiler.h>
> #include <linux/kvm_host.h>
> +#include <asm/kvm_mmu.h>
> #include <asm/sysreg.h>
>
> #define __hyp_text __section(.hyp.text) notrace
> @@ -158,6 +159,16 @@ void __noreturn __hyp_do_panic(unsigned long, ...);
> /* Must be called from hyp code running at EL2 */
> static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm)
> {
> + /*
> + * Configure the VTCR translation control bits
> + * for this VM.
> + */
> + u64 vtcr = read_sysreg(vtcr_el2);
> +
> + vtcr &= ~VTCR_EL2_PRIVATE_MASK;
> + vtcr |= VTCR_EL2_SL0(kvm_stage2_levels(kvm)) |
> + VTCR_EL2_T0SZ(kvm_phys_shift(kvm));
> + write_sysreg(vtcr, vtcr_el2);

Can't we generate the whole vtcr value in one go, without reading it
back? Specially given that on patch 16, you're actually switching to a
per-VM variable, and it would make a lot of sense to start with that here.

> write_sysreg(kvm->arch.vttbr, vttbr_el2);
> }
>
> diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
> index 81094f1..6567315 100644
> --- a/arch/arm64/kvm/hyp/s2-setup.c
> +++ b/arch/arm64/kvm/hyp/s2-setup.c
> @@ -19,13 +19,11 @@
> #include <asm/kvm_arm.h>
> #include <asm/kvm_asm.h>
> #include <asm/kvm_hyp.h>
> -#include <asm/cpufeature.h>
>
> -u32 __hyp_text __init_stage2_translation(void)
> +void __hyp_text __init_stage2_translation(void)
> {
> u64 val = VTCR_EL2_FLAGS;
> u64 parange;
> - u32 phys_shift;
> u64 tmp;
>
> /*
> @@ -38,17 +36,6 @@ u32 __hyp_text __init_stage2_translation(void)
> parange = ID_AA64MMFR0_PARANGE_MAX;
> val |= parange << VTCR_EL2_PS_SHIFT;
>
> - /* Compute the actual PARange... */
> - phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
> -
> - /*
> - * ... and clamp it to 40 bits, unless we have some braindead
> - * HW that implements less than that. In all cases, we'll
> - * return that value for the rest of the kernel to decide what
> - * to do.
> - */
> - val |= VTCR_EL2_T0SZ(phys_shift > 40 ? 40 : phys_shift);
> -
> /*
> * Check the availability of Hardware Access Flag / Dirty Bit
> * Management in ID_AA64MMFR1_EL1 and enable the feature in VTCR_EL2.
> @@ -67,6 +54,4 @@ u32 __hyp_text __init_stage2_translation(void)
> VTCR_EL2_VS_8BIT;
>
> write_sysreg(val, vtcr_el2);

And then most of the code here could run on a per-VM basis.

> -
> - return phys_shift;
> }
>

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2018-07-02 12:41:38

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 09/20] kvm: arm64: Make stage2 page table layout dynamic

On 29/06/18 12:15, Suzuki K Poulose wrote:
> So far we had a static stage2 page table handling code, based on a
> fixed IPA of 40bits. As we prepare for a configurable IPA size per
> VM, make our stage2 page table code dynamic, to do the right thing
> for a given VM. We ensure the existing condition is always true even
> when we lift the limit on the IPA. i.e,
>
> page table levels in stage1 >= page table levels in stage2
>
> Support for the IPA size configuration needs other changes in the way
> we configure the EL2 registers (VTTBR and VTCR). So, the IPA is still
> fixed to 40bits. The patch also moves the kvm_page_empty() in asm/kvm_mmu.h
> to the top, before including the asm/stage2_pgtable.h to avoid a forward
> declaration.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Changes since V2
> - Restrict the stage2 page table to allow reusing the host page table
> helpers for now, until we get stage1 independent page table helpers.

...

> -#define stage2_pgd_none(kvm, pgd) pgd_none(pgd)
> -#define stage2_pgd_clear(kvm, pgd) pgd_clear(pgd)
> -#define stage2_pgd_present(kvm, pgd) pgd_present(pgd)
> -#define stage2_pgd_populate(kvm, pgd, pud) pgd_populate(NULL, pgd, pud)
> -#define stage2_pud_offset(kvm, pgd, address) pud_offset(pgd, address)
> -#define stage2_pud_free(kvm, pud) pud_free(NULL, pud)
> +#define __s2_pud_index(addr) \
> + (((addr) >> __S2_PUD_SHIFT) & (PTRS_PER_PTE - 1))
> +#define __s2_pmd_index(addr) \
> + (((addr) >> __S2_PMD_SHIFT) & (PTRS_PER_PTE - 1))
>
> -#define stage2_pud_table_empty(kvm, pudp) kvm_page_empty(pudp)
> +#define __kvm_has_stage2_levels(kvm, min_levels) \
> + ((CONFIG_PGTABLE_LEVELS >= min_levels) && (kvm_stage2_levels(kvm) >= min_levels))

On another look, I have renamed the helpers as follows :

kvm_stage2_has_pud(kvm) => kvm_stage2_has_pmd(kvm)
kvm_stage2_has_pgd(kvm) => kvm_stage2_has_pud(kvm)

below and everywhere.

> +
> +#define kvm_stage2_has_pgd(kvm) __kvm_has_stage2_levels(kvm, 4)
> +#define kvm_stage2_has_pud(kvm) __kvm_has_stage2_levels(kvm, 3)


Suzuki

2018-07-02 12:42:16

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 07/20] kvm: arm/arm64: Prepare for VM specific stage2 translations


Hi Eric,

On 02/07/18 11:51, Auger Eric wrote:
> Hi Suzuki,
>
> On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
>> Right now the stage2 page table for a VM is hard coded, assuming
>> an IPA of 40bits. As we are about to add support for per VM IPA,
>> prepare the stage2 page table helpers to accept the kvm instance
>> to make the right decision for the VM. No functional changes.
>> Adds stage2_pgd_size(kvm) to replace S2_PGD_SIZE. Also, moves
>> some of the definitions dependent on kvm instance to asm/kvm_mmu.h
>> for arm32. In that process drop the _AC() specifier constants
>>
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Christoffer Dall <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> Changes since V2:
>> - Update commit description abuot the movement to asm/kvm_mmu.h
>> for arm32
>> - Drop _AC() specifiers


>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>> index 8553d68..f36eb20 100644
>> --- a/arch/arm/include/asm/kvm_mmu.h
>> +++ b/arch/arm/include/asm/kvm_mmu.h
>> @@ -36,15 +36,19 @@
>> })
>>
>> /*
>> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation levels.
>> + * kvm_mmu_cache_min_pages() is the number of stage2 page
>> + * table translation levels, excluding the top level, for
>> + * the given VM. Since we have a 3 level page-table, this
>> + * is fixed.
>> */
>> -#define KVM_MMU_CACHE_MIN_PAGES 2
>> +#define kvm_mmu_cache_min_pages(kvm) 2
> nit: In addition to Marc'c comment, I can see it defined in
> stage2_pgtable.h on arm64 side. Can't we align?

Sure, will do that.

>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index fb9a712..5da8f52 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -141,8 +141,11 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
>> * We currently only support a 40bit IPA.
>> */
>> #define KVM_PHYS_SHIFT (40)
>> -#define KVM_PHYS_SIZE (1UL << KVM_PHYS_SHIFT)
>> -#define KVM_PHYS_MASK (KVM_PHYS_SIZE - 1UL)
>> +
>> +#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
>> +#define kvm_phys_size(kvm) (_AC(1, ULL) << kvm_phys_shift(kvm))
> Can't you get rid of _AC() also in arm64 case?
>

>> +#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))

Yes, that missed. I will do it. Thanks for spotting.

Cheers
Suzuki


2018-07-02 12:45:34

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v3 09/20] kvm: arm64: Make stage2 page table layout dynamic

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> So far we had a static stage2 page table handling code, based on a
> fixed IPA of 40bits. As we prepare for a configurable IPA size per
> VM, make our stage2 page table code dynamic, to do the right thing
> for a given VM. We ensure the existing condition is always true even
> when we lift the limit on the IPA. i.e,
>
> page table levels in stage1 >= page table levels in stage2
>
> Support for the IPA size configuration needs other changes in the way
> we configure the EL2 registers (VTTBR and VTCR). So, the IPA is still
> fixed to 40bits. The patch also moves the kvm_page_empty() in asm/kvm_mmu.h
> to the top, before including the asm/stage2_pgtable.h to avoid a forward
> declaration.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Changes since V2
> - Restrict the stage2 page table to allow reusing the host page table
> helpers for now, until we get stage1 independent page table helpers.
I would move this up in the commit msg to motivate the fact we enforce
the able condition.
> ---
> arch/arm64/include/asm/kvm_mmu.h | 14 +-
> arch/arm64/include/asm/stage2_pgtable-nopmd.h | 42 ------
> arch/arm64/include/asm/stage2_pgtable-nopud.h | 39 -----
> arch/arm64/include/asm/stage2_pgtable.h | 207 +++++++++++++++++++-------
> 4 files changed, 159 insertions(+), 143 deletions(-)
> delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
> delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h

with my very limited knowledge of S2 page table walkers I fail to
understand why we now can get rid of stage2_pgtable-nopmd.h and
stage2_pgtable-nopud.h and associated FOLDED config. Please could you
explain it in the commit message?
>
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index dbaf513..a351722 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -21,6 +21,7 @@
> #include <asm/page.h>
> #include <asm/memory.h>
> #include <asm/cpufeature.h>
> +#include <asm/kvm_arm.h>
>
> /*
> * As ARMv8.0 only has the TTBR0_EL2 register, we cannot express
> @@ -147,6 +148,13 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
> #define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
> #define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK
>
> +static inline bool kvm_page_empty(void *ptr)
> +{
> + struct page *ptr_page = virt_to_page(ptr);
> +
> + return page_count(ptr_page) == 1;
> +}
> +
> #include <asm/stage2_pgtable.h>
>
> int create_hyp_mappings(void *from, void *to, pgprot_t prot);
> @@ -237,12 +245,6 @@ static inline bool kvm_s2pmd_exec(pmd_t *pmdp)
> return !(READ_ONCE(pmd_val(*pmdp)) & PMD_S2_XN);
> }
>
> -static inline bool kvm_page_empty(void *ptr)
> -{
> - struct page *ptr_page = virt_to_page(ptr);
> - return page_count(ptr_page) == 1;
> -}
> -
> #define hyp_pte_table_empty(ptep) kvm_page_empty(ptep)
>
> #ifdef __PAGETABLE_PMD_FOLDED
> diff --git a/arch/arm64/include/asm/stage2_pgtable-nopmd.h b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> deleted file mode 100644
> index 0280ded..0000000
> --- a/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> +++ /dev/null
> @@ -1,42 +0,0 @@
> -/*
> - * Copyright (C) 2016 - ARM Ltd
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 as
> - * published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> - * GNU General Public License for more details.
> - *
> - * You should have received a copy of the GNU General Public License
> - * along with this program. If not, see <http://www.gnu.org/licenses/>.
> - */
> -
> -#ifndef __ARM64_S2_PGTABLE_NOPMD_H_
> -#define __ARM64_S2_PGTABLE_NOPMD_H_
> -
> -#include <asm/stage2_pgtable-nopud.h>
> -
> -#define __S2_PGTABLE_PMD_FOLDED
> -
> -#define S2_PMD_SHIFT S2_PUD_SHIFT
> -#define S2_PTRS_PER_PMD 1
> -#define S2_PMD_SIZE (1UL << S2_PMD_SHIFT)
> -#define S2_PMD_MASK (~(S2_PMD_SIZE-1))
> -
> -#define stage2_pud_none(kvm, pud) (0)
> -#define stage2_pud_present(kvm, pud) (1)
> -#define stage2_pud_clear(kvm, pud) do { } while (0)
> -#define stage2_pud_populate(kvm, pud, pmd) do { } while (0)
> -#define stage2_pmd_offset(kvm, pud, address) ((pmd_t *)(pud))
> -
> -#define stage2_pmd_free(kvm, pmd) do { } while (0)
> -
> -#define stage2_pmd_addr_end(kvm, addr, end) (end)
> -
> -#define stage2_pud_huge(kvm, pud) (0)
> -#define stage2_pmd_table_empty(kvm, pmdp) (0)
> -
> -#endif
> diff --git a/arch/arm64/include/asm/stage2_pgtable-nopud.h b/arch/arm64/include/asm/stage2_pgtable-nopud.h
> deleted file mode 100644
> index cd6304e..0000000
> --- a/arch/arm64/include/asm/stage2_pgtable-nopud.h
> +++ /dev/null
> @@ -1,39 +0,0 @@
> -/*
> - * Copyright (C) 2016 - ARM Ltd
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 as
> - * published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> - * GNU General Public License for more details.
> - *
> - * You should have received a copy of the GNU General Public License
> - * along with this program. If not, see <http://www.gnu.org/licenses/>.
> - */
> -
> -#ifndef __ARM64_S2_PGTABLE_NOPUD_H_
> -#define __ARM64_S2_PGTABLE_NOPUD_H_
> -
> -#define __S2_PGTABLE_PUD_FOLDED
> -
> -#define S2_PUD_SHIFT S2_PGDIR_SHIFT
> -#define S2_PTRS_PER_PUD 1
> -#define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
> -#define S2_PUD_MASK (~(S2_PUD_SIZE-1))
> -
> -#define stage2_pgd_none(kvm, pgd) (0)
> -#define stage2_pgd_present(kvm, pgd) (1)
> -#define stage2_pgd_clear(kvm, pgd) do { } while (0)
> -#define stage2_pgd_populate(kvm, pgd, pud) do { } while (0)
> -
> -#define stage2_pud_offset(kvm, pgd, address) ((pud_t *)(pgd))
> -
> -#define stage2_pud_free(kvm, x) do { } while (0)
> -
> -#define stage2_pud_addr_end(kvm, addr, end) (end)
> -#define stage2_pud_table_empty(kvm, pmdp) (0)
> -
> -#endif
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> index 057a405..ffc37cc 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -19,8 +19,12 @@
> #ifndef __ARM64_S2_PGTABLE_H_
> #define __ARM64_S2_PGTABLE_H_
>
> +#include <linux/hugetlb.h>
> #include <asm/pgtable.h>
>
> +/* The PGDIR shift for a given page table with "n" levels. */
> +#define pt_levels_pgdir_shift(n) ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - (n))
> +
> /*
> * The hardware supports concatenation of up to 16 tables at stage2 entry level
> * and we use the feature whenever possible.
> @@ -29,118 +33,209 @@
> * On arm64, the smallest PAGE_SIZE supported is 4k, which means
> * (PAGE_SHIFT - 3) > 4 holds for all page sizes.
Trying to understand that comment. Why do we compare to 4?
> * This implies, the total number of page table levels at stage2 expected
> - * by the hardware is actually the number of levels required for (KVM_PHYS_SHIFT - 4)
> + * by the hardware is actually the number of levels required for (IPA_SHIFT - 4)
although understandable, is IPA_SHIFT defined somewhere?
> * in normal translations(e.g, stage1), since we cannot have another level in
> - * the range (KVM_PHYS_SHIFT, KVM_PHYS_SHIFT - 4).
> + * the range (IPA_SHIFT, IPA_SHIFT - 4).
I fail to understand the above comment. Could you give a pointer to the
spec?
> */
> -#define STAGE2_PGTABLE_LEVELS ARM64_HW_PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)
> +#define stage2_pt_levels(ipa_shift) ARM64_HW_PGTABLE_LEVELS((ipa_shift) - 4)
>
> /*
> - * With all the supported VA_BITs and 40bit guest IPA, the following condition
> - * is always true:
> + * With all the supported VA_BITs and guest IPA, the following condition
> + * must be always true:
> *
> - * STAGE2_PGTABLE_LEVELS <= CONFIG_PGTABLE_LEVELS
> + * stage2_pt_levels <= CONFIG_PGTABLE_LEVELS
> *
> * We base our stage-2 page table walker helpers on this assumption and
> * fall back to using the host version of the helper wherever possible.
> * i.e, if a particular level is not folded (e.g, PUD) at stage2, we fall back
> * to using the host version, since it is guaranteed it is not folded at host.
> *
> - * If the condition breaks in the future, we can rearrange the host level
> - * definitions and reuse them for stage2. Till then...
> + * If the condition breaks in the future, we need completely independent
> + * page table helpers. Till then...
> */
> -#if STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS
> +
> +#if stage2_pt_levels(KVM_PHYS_SHIFT) > CONFIG_PGTABLE_LEVELS
> #error "Unsupported combination of guest IPA and host VA_BITS."
> #endif
>
> -/* S2_PGDIR_SHIFT is the size mapped by top-level stage2 entry */
> -#define S2_PGDIR_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - STAGE2_PGTABLE_LEVELS)
> -#define S2_PGDIR_SIZE (_AC(1, UL) << S2_PGDIR_SHIFT)
> -#define S2_PGDIR_MASK (~(S2_PGDIR_SIZE - 1))
> -
> /*
> * The number of PTRS across all concatenated stage2 tables given by the
> * number of bits resolved at the initial level.
> */
> -#define PTRS_PER_S2_PGD (1 << (KVM_PHYS_SHIFT - S2_PGDIR_SHIFT))
> +#define __s2_pgd_ptrs(pa, lvls) (1 << ((pa) - pt_levels_pgdir_shift((lvls))))
> +#define __s2_pgd_size(pa, lvls) (__s2_pgd_ptrs((pa), (lvls)) * sizeof(pgd_t))
> +
> +#define kvm_stage2_levels(kvm) stage2_pt_levels(kvm_phys_shift(kvm))
> +#define stage2_pgdir_shift(kvm) \
> + pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
> +#define stage2_pgdir_size(kvm) (_AC(1, UL) << stage2_pgdir_shift((kvm)))
> +#define stage2_pgdir_mask(kvm) (~(stage2_pgdir_size((kvm)) - 1))
> +#define stage2_pgd_ptrs(kvm) \
> + __s2_pgd_ptrs(kvm_phys_shift(kvm), kvm_stage2_levels(kvm))
> +
> +#define stage2_pgd_size(kvm) __s2_pgd_size(kvm_phys_shift(kvm), kvm_stage2_levels(kvm))
>
> /*
> * kvm_mmmu_cache_min_pages is the number of stage2 page table translation
> * levels in addition to the PGD.
> */
> -#define kvm_mmu_cache_min_pages(kvm) (STAGE2_PGTABLE_LEVELS - 1)
> +#define kvm_mmu_cache_min_pages(kvm) (kvm_stage2_levels(kvm) - 1)
>
>
> -#if STAGE2_PGTABLE_LEVELS > 3
> +/* PUD/PMD definitions if present */
> +#define __S2_PUD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(1)
> +#define __S2_PUD_SIZE (_AC(1, UL) << __S2_PUD_SHIFT)
> +#define __S2_PUD_MASK (~(__S2_PUD_SIZE - 1))
>
> -#define S2_PUD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(1)
> -#define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
> -#define S2_PUD_MASK (~(S2_PUD_SIZE - 1))
> +#define __S2_PMD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
> +#define __S2_PMD_SIZE (_AC(1, UL) << __S2_PMD_SHIFT)
> +#define __S2_PMD_MASK (~(__S2_PMD_SIZE - 1))
Is this renaming mandatory?
>
> -#define stage2_pgd_none(kvm, pgd) pgd_none(pgd)
> -#define stage2_pgd_clear(kvm, pgd) pgd_clear(pgd)
> -#define stage2_pgd_present(kvm, pgd) pgd_present(pgd)
> -#define stage2_pgd_populate(kvm, pgd, pud) pgd_populate(NULL, pgd, pud)
> -#define stage2_pud_offset(kvm, pgd, address) pud_offset(pgd, address)
> -#define stage2_pud_free(kvm, pud) pud_free(NULL, pud)
> +#define __s2_pud_index(addr) \
> + (((addr) >> __S2_PUD_SHIFT) & (PTRS_PER_PTE - 1))
> +#define __s2_pmd_index(addr) \
> + (((addr) >> __S2_PMD_SHIFT) & (PTRS_PER_PTE - 1))
>
> -#define stage2_pud_table_empty(kvm, pudp) kvm_page_empty(pudp)
> +#define __kvm_has_stage2_levels(kvm, min_levels) \
> + ((CONFIG_PGTABLE_LEVELS >= min_levels) && (kvm_stage2_levels(kvm) >= min_levels))
kvm_stage2_levels <= CONFIG_PGTABLE_LEVELS so you should just need to
check kvm_stage2_levels?
> +
> +#define kvm_stage2_has_pgd(kvm) __kvm_has_stage2_levels(kvm, 4)
> +#define kvm_stage2_has_pud(kvm) __kvm_has_stage2_levels(kvm, 3)
> +
> +static inline int stage2_pgd_none(struct kvm *kvm, pgd_t pgd)
> +{
> + return kvm_stage2_has_pgd(kvm) ? pgd_none(pgd) : 0;
> +}
> +
> +static inline void stage2_pgd_clear(struct kvm *kvm, pgd_t *pgdp)
> +{
> + if (kvm_stage2_has_pgd(kvm))
> + pgd_clear(pgdp);
> +}
> +
> +static inline int stage2_pgd_present(struct kvm *kvm, pgd_t pgd)
> +{
> + return kvm_stage2_has_pgd(kvm) ? pgd_present(pgd) : 1;
> +}
> +
> +static inline void stage2_pgd_populate(struct kvm *kvm, pgd_t *pgdp, pud_t *pud)
> +{
> + if (kvm_stage2_has_pgd(kvm))
> + pgd_populate(NULL, pgdp, pud);
> + else
> + BUG();
> +}
> +
> +static inline pud_t *stage2_pud_offset(struct kvm *kvm,
> + pgd_t *pgd, unsigned long address)
> +{
> + if (kvm_stage2_has_pgd(kvm)) {
> + phys_addr_t pud_phys = pgd_page_paddr(*pgd);
> +
> + pud_phys += __s2_pud_index(address) * sizeof(pud_t);
> + return __va(pud_phys);
> + }
> + return (pud_t *)pgd;
> +}
> +
> +static inline void stage2_pud_free(struct kvm *kvm, pud_t *pud)
> +{
> + if (kvm_stage2_has_pgd(kvm))
> + pud_free(NULL, pud);
> +}
> +
> +static inline int stage2_pud_table_empty(struct kvm *kvm, pud_t *pudp)
> +{
> + return kvm_stage2_has_pgd(kvm) && kvm_page_empty(pudp);
> +}
>
> static inline phys_addr_t
> stage2_pud_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> {
> - phys_addr_t boundary = (addr + S2_PUD_SIZE) & S2_PUD_MASK;
> + if (kvm_stage2_has_pgd(kvm)) {
> + phys_addr_t boundary = (addr + __S2_PUD_SIZE) & __S2_PUD_MASK;
>
> - return (boundary - 1 < end - 1) ? boundary : end;
> + return (boundary - 1 < end - 1) ? boundary : end;
> + }
> + return end;
> }
>
> -#endif /* STAGE2_PGTABLE_LEVELS > 3 */
> +static inline int stage2_pud_none(struct kvm *kvm, pud_t pud)
> +{
> + return kvm_stage2_has_pud(kvm) ? pud_none(pud) : 0;
> +}
>
> +static inline void stage2_pud_clear(struct kvm *kvm, pud_t *pudp)
> +{
> + if (kvm_stage2_has_pud(kvm))
> + pud_clear(pudp);
> +}
>
> -#if STAGE2_PGTABLE_LEVELS > 2
> +static inline int stage2_pud_present(struct kvm *kvm, pud_t pud)
> +{
> + return kvm_stage2_has_pud(kvm) ? pud_present(pud) : 1;
> +}
>
> -#define S2_PMD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
> -#define S2_PMD_SIZE (_AC(1, UL) << S2_PMD_SHIFT)
> -#define S2_PMD_MASK (~(S2_PMD_SIZE - 1))
> +static inline void stage2_pud_populate(struct kvm *kvm, pud_t *pudp, pmd_t *pmd)
> +{
> + if (kvm_stage2_has_pud(kvm))
> + pud_populate(NULL, pudp, pmd);
> + else
> + BUG();
> +}
>
> -#define stage2_pud_none(kvm, pud) pud_none(pud)
> -#define stage2_pud_clear(kvm, pud) pud_clear(pud)
> -#define stage2_pud_present(kvm, pud) pud_present(pud)
> -#define stage2_pud_populate(kvm, pud, pmd) pud_populate(NULL, pud, pmd)
> -#define stage2_pmd_offset(kvm, pud, address) pmd_offset(pud, address)
> -#define stage2_pmd_free(kvm, pmd) pmd_free(NULL, pmd)
> +static inline pmd_t *stage2_pmd_offset(struct kvm *kvm,
> + pud_t *pud, unsigned long address)
> +{
> + if (kvm_stage2_has_pud(kvm)) {
> + phys_addr_t pmd_phys = pud_page_paddr(*pud);
>
> -#define stage2_pud_huge(kvm, pud) pud_huge(pud)
> -#define stage2_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
> + pmd_phys += __s2_pmd_index(address) * sizeof(pmd_t);
> + return __va(pmd_phys);
> + }
> + return (pmd_t *)pud;
> +}
> +
> +static inline void stage2_pmd_free(struct kvm *kvm, pmd_t *pmd)
> +{
> + if (kvm_stage2_has_pud(kvm))
> + pmd_free(NULL, pmd);
> +}
> +
> +static inline int stage2_pmd_table_empty(struct kvm *kvm, pmd_t *pmdp)
> +{
> + return kvm_stage2_has_pud(kvm) && kvm_page_empty(pmdp);
> +}
>
> static inline phys_addr_t
> stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> {
> - phys_addr_t boundary = (addr + S2_PMD_SIZE) & S2_PMD_MASK;
> + if (kvm_stage2_has_pud(kvm)) {
> + phys_addr_t boundary = (addr + __S2_PMD_SIZE) & __S2_PMD_MASK;
>
> - return (boundary - 1 < end - 1) ? boundary : end;
> + return (boundary - 1 < end - 1) ? boundary : end;
> + }
> + return end;
> }
>
> -#endif /* STAGE2_PGTABLE_LEVELS > 2 */
> +static inline int stage2_pud_huge(struct kvm *kvm, pud_t pud)
> +{
> + return kvm_stage2_has_pud(kvm) ? pud_huge(pud) : 0;
> +}
>
> #define stage2_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)
>
> -#if STAGE2_PGTABLE_LEVELS == 2
> -#include <asm/stage2_pgtable-nopmd.h>
> -#elif STAGE2_PGTABLE_LEVELS == 3
> -#include <asm/stage2_pgtable-nopud.h>
> -#endif
> -
> -#define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
> -
> -#define stage2_pgd_index(kvm, addr) \
> - (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
> +static inline unsigned long stage2_pgd_index(struct kvm *kvm, phys_addr_t addr)
> +{
> + return (addr >> stage2_pgdir_shift(kvm)) & (stage2_pgd_ptrs(kvm) - 1);
> +}
>
> static inline phys_addr_t
> stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> {
> - phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;
> + phys_addr_t boundary;
>
> + boundary = (addr + stage2_pgdir_size(kvm)) & stage2_pgdir_mask(kvm);
> return (boundary - 1 < end - 1) ? boundary : end;
> }
>
>

Globally this patch is pretty hard to review. I don't know if it is
possible to split into 2. 1) Addition of some helper macros. 2) removal
of nopud and nopmd and implementation of the corresponding macros?

Thanks

Eric

2018-07-02 13:15:39

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v3 15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

On 29/06/18 12:15, Suzuki K Poulose wrote:
> Allow specifying the physical address size for a new VM via
> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
> us to finalise the stage2 page table format as early as possible
> and hence perform the right checks on the memory slots without
> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
> of the type field and can encode more information in the future if
> required. The IPA size is still capped at 40bits.

Can't we relax this? There is no technical reason (AFAICS) not to allow
going down to 36bit IPA if the user has requested it.

If we run on a 36bit IPA system, the default would fail. But if the user
specified "please give me a 36bit IPA VM", we could satisfy that
requirement and allow them to run their stupidly small guest!

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2018-07-02 13:30:41

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 09/20] kvm: arm64: Make stage2 page table layout dynamic

Hi Eric,


On 02/07/18 13:14, Auger Eric wrote:
> Hi Suzuki,
>
> On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
>> So far we had a static stage2 page table handling code, based on a
>> fixed IPA of 40bits. As we prepare for a configurable IPA size per
>> VM, make our stage2 page table code dynamic, to do the right thing
>> for a given VM. We ensure the existing condition is always true even
>> when we lift the limit on the IPA. i.e,
>>
>> page table levels in stage1 >= page table levels in stage2
>>
>> Support for the IPA size configuration needs other changes in the way
>> we configure the EL2 registers (VTTBR and VTCR). So, the IPA is still
>> fixed to 40bits. The patch also moves the kvm_page_empty() in asm/kvm_mmu.h
>> to the top, before including the asm/stage2_pgtable.h to avoid a forward
>> declaration.
>>
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Christoffer Dall <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> Changes since V2
>> - Restrict the stage2 page table to allow reusing the host page table
>> helpers for now, until we get stage1 independent page table helpers.
> I would move this up in the commit msg to motivate the fact we enforce
> the able condition.

This is mentioned in the commit message for the patch which lifts the limitation
on the IPA. This patch only deals with the dynamic page table level handling,
with the restriction on the levels. Nevertheless, I could add it to the
description.

>> ---
>> arch/arm64/include/asm/kvm_mmu.h | 14 +-
>> arch/arm64/include/asm/stage2_pgtable-nopmd.h | 42 ------
>> arch/arm64/include/asm/stage2_pgtable-nopud.h | 39 -----
>> arch/arm64/include/asm/stage2_pgtable.h | 207 +++++++++++++++++++-------
>> 4 files changed, 159 insertions(+), 143 deletions(-)
>> delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
>> delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h
>
> with my very limited knowledge of S2 page table walkers I fail to
> understand why we now can get rid of stage2_pgtable-nopmd.h and
> stage2_pgtable-nopud.h and associated FOLDED config. Please could you
> explain it in the commit message?

As mentioned above, we have static page table helpers, which are decided
at compile time (just like the stage1). So these files hold the definitions
for the cases where PUD/PMD is folded and included for a given stage1 VA.
But since we are now doing this check per VM, we make the decision
by checking the kvm_stage2_levels(), instead of hard coding it.

Does that help ? A short version of that is already there. May be I could
elaborate that a bit.

>> -
>> -#define stage2_pgd_index(kvm, addr) \
>> - (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
>> +static inline unsigned long stage2_pgd_index(struct kvm *kvm, phys_addr_t addr)
>> +{
>> + return (addr >> stage2_pgdir_shift(kvm)) & (stage2_pgd_ptrs(kvm) - 1);
>> +}
>>
>> static inline phys_addr_t
>> stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>> {
>> - phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;
>> + phys_addr_t boundary;
>>
>> + boundary = (addr + stage2_pgdir_size(kvm)) & stage2_pgdir_mask(kvm);
>> return (boundary - 1 < end - 1) ? boundary : end;
>> }
>>
>>
>
> Globally this patch is pretty hard to review. I don't know if it is
> possible to split into 2. 1) Addition of some helper macros. 2) removal
> of nopud and nopmd and implementation of the corresponding macros?

I acknowledge that. The patch redefines the "existing" macros to make the
decision at runtime based on the VM's setting. I will see if there is a
better way to do it.

Cheers
Suzuki

2018-07-02 13:33:15

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

On 02/07/18 14:13, Marc Zyngier wrote:
> On 29/06/18 12:15, Suzuki K Poulose wrote:
>> Allow specifying the physical address size for a new VM via
>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
>> us to finalise the stage2 page table format as early as possible
>> and hence perform the right checks on the memory slots without
>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
>> of the type field and can encode more information in the future if
>> required. The IPA size is still capped at 40bits.
>
> Can't we relax this? There is no technical reason (AFAICS) not to allow
> going down to 36bit IPA if the user has requested it.

Sure, we can.

>
> If we run on a 36bit IPA system, the default would fail. But if the user
> specified "please give me a 36bit IPA VM", we could satisfy that
> requirement and allow them to run their stupidly small guest!

Absolutely. I will fix this in the next version.

Cheers
Suzuki

2018-07-02 13:33:59

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v3 16/20] kvm: arm64: Switch to per VM IPA limit

On 29/06/18 12:15, Suzuki K Poulose wrote:
> Now that we can manage the stage2 page table per VM, switch the
> configuration details to per VM instance. We keep track of the
> IPA bits, number of page table levels and the VTCR bits (which
> depends on the IPA and the number of levels). While at it, remove
> unused pgd_lock field from kvm_arch for arm64.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arch/arm64/include/asm/kvm_host.h | 14 ++++++++++++--
> arch/arm64/include/asm/kvm_hyp.h | 3 +--
> arch/arm64/include/asm/kvm_mmu.h | 20 ++++++++++++++++++--
> arch/arm64/include/asm/stage2_pgtable.h | 1 -
> virt/kvm/arm/mmu.c | 4 ++++
> 5 files changed, 35 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 328f472..9a15860 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -61,13 +61,23 @@ struct kvm_arch {
> u64 vmid_gen;
> u32 vmid;
>
> - /* 1-level 2nd stage table and lock */
> - spinlock_t pgd_lock;
> + /* stage-2 page table */
> pgd_t *pgd;
>
> /* VTTBR value associated with above pgd and vmid */
> u64 vttbr;
>
> + /* Private bits of VTCR_EL2 for this VM */
> + u64 vtcr_private;

As I said in another email, this should become a full VTCR_EL2 copy.

> + /* Size of the PA size for this guest */
> + u8 phys_shift;
> + /*
> + * Number of levels in page table. We could always calculate
> + * it from phys_shift above. We cache it for faster switches
> + * in stage2 page table helpers.
> + */
> + u8 s2_levels;

And these two fields feel like they should be derived from the VTCR
itself, instead of being there on their own. Any chance you could look
into this?

> +
> /* The last vcpu id that ran on each physical CPU */
> int __percpu *last_vcpu_ran;
>
> diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
> index 3e8052d1..699f678 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -166,8 +166,7 @@ static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm)
> u64 vtcr = read_sysreg(vtcr_el2);
>
> vtcr &= ~VTCR_EL2_PRIVATE_MASK;
> - vtcr |= VTCR_EL2_SL0(kvm_stage2_levels(kvm)) |
> - VTCR_EL2_T0SZ(kvm_phys_shift(kvm));
> + vtcr |= kvm->arch.vtcr_private;
> write_sysreg(vtcr, vtcr_el2);
> write_sysreg(kvm->arch.vttbr, vttbr_el2);
> }
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index f3fb05a3..a291cdc 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -143,9 +143,10 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
> */
> #define KVM_PHYS_SHIFT (40)
>
> -#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
> +#define kvm_phys_shift(kvm) (kvm->arch.phys_shift)
> #define kvm_phys_size(kvm) (_AC(1, ULL) << kvm_phys_shift(kvm))
> #define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
> +#define kvm_stage2_levels(kvm) (kvm->arch.s2_levels)
>
> static inline bool kvm_page_empty(void *ptr)
> {
> @@ -528,6 +529,18 @@ static inline u64 kvm_vttbr_baddr_mask(struct kvm *kvm)
>
> static inline void *stage2_alloc_pgd(struct kvm *kvm)
> {
> + u32 ipa, lvls;
> +
> + /*
> + * Stage2 page table can support concatenation of (upto 16) tables
> + * at the entry level, thereby reducing the number of levels.
> + */
> + ipa = kvm_phys_shift(kvm);
> + lvls = stage2_pt_levels(ipa);
> +
> + kvm->arch.s2_levels = lvls;
> + kvm->arch.vtcr_private = VTCR_EL2_SL0(lvls) | TCR_T0SZ(ipa);
> +
> return alloc_pages_exact(stage2_pgd_size(kvm),
> GFP_KERNEL | __GFP_ZERO);
> }
> @@ -537,7 +550,10 @@ static inline u32 kvm_get_ipa_limit(void)
> return KVM_PHYS_SHIFT;
> }
>
> -static inline void kvm_config_stage2(struct kvm *kvm, u32 ipa_shift) {}
> +static inline void kvm_config_stage2(struct kvm *kvm, u32 ipa_shift)
> +{
> + kvm->arch.phys_shift = ipa_shift;
> +}
>
> #endif /* __ASSEMBLY__ */
> #endif /* __ARM64_KVM_MMU_H__ */
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> index ffc37cc..91d7936 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -65,7 +65,6 @@
> #define __s2_pgd_ptrs(pa, lvls) (1 << ((pa) - pt_levels_pgdir_shift((lvls))))
> #define __s2_pgd_size(pa, lvls) (__s2_pgd_ptrs((pa), (lvls)) * sizeof(pgd_t))
>
> -#define kvm_stage2_levels(kvm) stage2_pt_levels(kvm_phys_shift(kvm))
> #define stage2_pgdir_shift(kvm) \
> pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
> #define stage2_pgdir_size(kvm) (_AC(1, UL) << stage2_pgdir_shift((kvm)))
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index a339e00..d7822e1 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -867,6 +867,10 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
> return -EINVAL;
> }
>
> + /* Make sure we have the stage2 configured for this VM */
> + if (WARN_ON(!kvm_phys_shift(kvm)))

Can this be triggered from userspace?

> + return -EINVAL;
> +
> /* Allocate the HW PGD, making sure that each page gets its own refcount */
> pgd = stage2_alloc_pgd(kvm);
> if (!pgd)
>

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2018-07-02 13:45:17

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v3 18/20] kvm: arm64: Add support for handling 52bit IPA

On 29/06/18 12:15, Suzuki K Poulose wrote:
> Add support for handling the 52bit IPA. 52bit IPA
> support needs changes to the following :
>
> 1) Page-table entries - We use kernel page table helpers for setting
> up the stage2. Hence we don't explicit changes here
>
> 2) VTTBR:BADDR - This is already supported with :
> commit 529c4b05a3cb2f324aa ("arm64: handle 52-bit addresses in TTBR")
>
> 3) VGIC support for 52bit: Supported with a patch in this series.
>
> That leaves us with the handling for PAR and HPAR. This patch adds

HPFAR?

> support for handling the 52bit addresses in PAR and HPFAR,
> which are used while handling the permission faults in stage1.

Overall, this is a pretty confusing commit message. Can you just call it:

KVM/arm64: Add 52bit support for PAR to HPFAR conversion

and just describe that it now uses PHYS_MASK_SHIFT instead of a
hardcoded constant?

>
> Cc: Marc Zyngier <[email protected]>
> Cc: Kristina Martsenko <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arch/arm64/include/asm/kvm_arm.h | 7 +++++++
> arch/arm64/kvm/hyp/switch.c | 2 +-
> 2 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 2e90942..cb6a2ee 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -301,6 +301,13 @@
>
> /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
> #define HPFAR_MASK (~UL(0xf))
> +/*
> + * We have
> + * PAR [PA_Shift - 1 : 12] = PA [PA_Shift - 1 : 12]
> + * HPFAR [PA_Shift - 9 : 4] = FIPA [PA_Shift - 1 : 12]
> + */
> +#define PAR_TO_HPFAR(par) \
> + (((par) & GENMASK_ULL(PHYS_MASK_SHIFT - 1, 12)) >> 8)
>
> #define kvm_arm_exception_type \
> {0, "IRQ" }, \
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index 355fb25..fb66320 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -260,7 +260,7 @@ static bool __hyp_text __translate_far_to_hpfar(u64 far, u64 *hpfar)
> return false; /* Translation failed, back to guest */
>
> /* Convert PAR to HPFAR format */
> - *hpfar = ((tmp >> 12) & ((1UL << 36) - 1)) << 4;
> + *hpfar = PAR_TO_HPFAR(tmp);
> return true;
> }
>
>

Otherwise:

Acked-by: Marc Zyngier <[email protected]>

M.
--
Jazz is not dead. It just smells funny...

2018-07-02 13:52:01

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v3 19/20] kvm: arm64: Allow IPA size supported by the system

On 29/06/18 12:15, Suzuki K Poulose wrote:
> So far we have restricted the IPA size of the VM to the default
> value (40bits). Now that we can manage the IPA size per VM and
> support dynamic stage2 page tables, allow VMs to have larger IPA.
> This is done by setting the IPA limit to the one supported by
> the hardware and kernel. This patch also moves the check for
> the default IPA size support to kvm_get_ipa_limit().
>
> Since the stage2 page table code is dependent on the stage1
> page table, we always ensure that :
>
> Number of Levels at Stage1 >= Number of Levels at Stage2
>
> So we limit the IPA to make sure that the above condition
> is satisfied. This will affect the following combinations
> of VA_BITS and IPA for different page sizes.
>
> 39bit VA, 4K - IPA > 43 (Upto 48)
> 36bit VA, 16K - IPA > 40 (Upto 48)
> 42bit VA, 64K - IPA > 46 (Upto 52)

I'm not sure I get it. Are these the IPA sizes that we forbid based on
the host VA size and page size configuration? If so, can you rewrite
this as:

host configuration | unsupported IPA range
39bit VA, 4k | [44, 48]
36bit VA, 16K | [41, 48]
42bit VA, 64k | [47, 52]

and say that all the other combinations are supported?

>
> Supporting the above combinations need independent stage2
> page table manipulation code, which would need substantial
> changes. We could purse the solution independently and
> switch the page table code once we have it ready.
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Changes since V2:
> - Restrict the IPA size to limit the number of page table
> levels in stage2 to that of stage1 or less.
> ---
> arch/arm64/include/asm/kvm_host.h | 6 ------
> arch/arm64/include/asm/kvm_mmu.h | 37 ++++++++++++++++++++++++++++++++++++-
> 2 files changed, 36 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 9a15860..e858e49 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -452,13 +452,7 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>
> static inline void __cpu_init_stage2(void)
> {
> - u32 ps;
> -
> kvm_call_hyp(__init_stage2_translation);
> - /* Sanity check for minimum IPA size support */
> - ps = id_aa64mmfr0_parange_to_phys_shift(read_sysreg(id_aa64mmfr0_el1) & 0x7);
> - WARN_ONCE(ps < 40,
> - "PARange is %d bits, unsupported configuration!", ps);
> }
>
> /* Guest/host FPSIMD coordination helpers */
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index a291cdc..d38f395 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -547,7 +547,42 @@ static inline void *stage2_alloc_pgd(struct kvm *kvm)
>
> static inline u32 kvm_get_ipa_limit(void)
> {
> - return KVM_PHYS_SHIFT;
> + unsigned int ipa_max, va_max, parange;
> +
> + parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 0x7;
> + ipa_max = id_aa64mmfr0_parange_to_phys_shift(parange);
> +
> + /* Raise the limit to the default size for backward compatibility */
> + if (ipa_max < KVM_PHYS_SHIFT) {
> + WARN_ONCE(1,
> + "PARange is %d bits, unsupported configuration!",
> + ipa_max);
> + ipa_max = KVM_PHYS_SHIFT;
> + }
> +
> + /* Clamp it to the PA size supported by the kernel */
> + ipa_max = (ipa_max > PHYS_MASK_SHIFT) ? PHYS_MASK_SHIFT : ipa_max;
> + /*
> + * Since our stage2 table is dependent on the stage1 page table code,
> + * we must always honor the following condition:
> + *
> + * Number of levels in Stage1 >= Number of levels in Stage2.
> + *
> + * So clamp the ipa limit further down to limit the number of levels.
> + * Since we can concatenate upto 16 tables at entry level, we could
> + * go upto 4bits above the maximum VA addressible with the current
> + * number of levels.
> + */
> + va_max = PGDIR_SHIFT + PAGE_SHIFT - 3;
> + va_max += 4;
> +
> + if (va_max < ipa_max) {
> + kvm_info("Limiting IPA limit to %dbytes due to host VA bits limitation\n",
> + va_max);
> + ipa_max = va_max;
> + }
> +
> + return ipa_max;
> }
>
> static inline void kvm_config_stage2(struct kvm *kvm, u32 ipa_shift)
>

Otherwise looks good.

M.
--
Jazz is not dead. It just smells funny...

2018-07-02 13:54:23

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 16/20] kvm: arm64: Switch to per VM IPA limit


Hi Marc,

On 02/07/18 14:32, Marc Zyngier wrote:
> On 29/06/18 12:15, Suzuki K Poulose wrote:
>> Now that we can manage the stage2 page table per VM, switch the
>> configuration details to per VM instance. We keep track of the
>> IPA bits, number of page table levels and the VTCR bits (which
>> depends on the IPA and the number of levels). While at it, remove
>> unused pgd_lock field from kvm_arch for arm64.
>>
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Christoffer Dall <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>


>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 328f472..9a15860 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -61,13 +61,23 @@ struct kvm_arch {
>> u64 vmid_gen;
>> u32 vmid;
>>
>> - /* 1-level 2nd stage table and lock */
>> - spinlock_t pgd_lock;
>> + /* stage-2 page table */
>> pgd_t *pgd;
>>
>> /* VTTBR value associated with above pgd and vmid */
>> u64 vttbr;
>>
>> + /* Private bits of VTCR_EL2 for this VM */
>> + u64 vtcr_private;
>
> As I said in another email, this should become a full VTCR_EL2 copy.
>

OK

>> + /* Size of the PA size for this guest */
>> + u8 phys_shift;
>> + /*
>> + * Number of levels in page table. We could always calculate
>> + * it from phys_shift above. We cache it for faster switches
>> + * in stage2 page table helpers.
>> + */
>> + u8 s2_levels;
>
> And these two fields feel like they should be derived from the VTCR
> itself, instead of being there on their own. Any chance you could look
> into this?

Yes, the VTCR is computed from the above two values and we could compute
them back from the VTCR. I will give it a try.

>> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
>> index ffc37cc..91d7936 100644
>> --- a/arch/arm64/include/asm/stage2_pgtable.h
>> +++ b/arch/arm64/include/asm/stage2_pgtable.h
>> @@ -65,7 +65,6 @@
>> #define __s2_pgd_ptrs(pa, lvls) (1 << ((pa) - pt_levels_pgdir_shift((lvls))))
>> #define __s2_pgd_size(pa, lvls) (__s2_pgd_ptrs((pa), (lvls)) * sizeof(pgd_t))
>>
>> -#define kvm_stage2_levels(kvm) stage2_pt_levels(kvm_phys_shift(kvm))
>> #define stage2_pgdir_shift(kvm) \
>> pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
>> #define stage2_pgdir_size(kvm) (_AC(1, UL) << stage2_pgdir_shift((kvm)))
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index a339e00..d7822e1 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -867,6 +867,10 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
>> return -EINVAL;
>> }
>>
>> + /* Make sure we have the stage2 configured for this VM */
>> + if (WARN_ON(!kvm_phys_shift(kvm)))
>
> Can this be triggered from userspace?

No. As we initialise the phys shift before we get here. If type is left
blank (i.e, 0), we default to 40bits. So there should be something there.
The check is to make sure we have indeed past the configuration step.

>> + return -EINVAL;
>> +
>> /* Allocate the HW PGD, making sure that each page gets its own refcount */
>> pgd = stage2_alloc_pgd(kvm);
>> if (!pgd)
>>
>

Cheers
Suzuki

2018-07-02 13:55:35

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 19/20] kvm: arm64: Allow IPA size supported by the system

On 02/07/18 14:50, Marc Zyngier wrote:
> On 29/06/18 12:15, Suzuki K Poulose wrote:
>> So far we have restricted the IPA size of the VM to the default
>> value (40bits). Now that we can manage the IPA size per VM and
>> support dynamic stage2 page tables, allow VMs to have larger IPA.
>> This is done by setting the IPA limit to the one supported by
>> the hardware and kernel. This patch also moves the check for
>> the default IPA size support to kvm_get_ipa_limit().
>>
>> Since the stage2 page table code is dependent on the stage1
>> page table, we always ensure that :
>>
>> Number of Levels at Stage1 >= Number of Levels at Stage2
>>
>> So we limit the IPA to make sure that the above condition
>> is satisfied. This will affect the following combinations
>> of VA_BITS and IPA for different page sizes.
>>
>> 39bit VA, 4K - IPA > 43 (Upto 48)
>> 36bit VA, 16K - IPA > 40 (Upto 48)
>> 42bit VA, 64K - IPA > 46 (Upto 52)
>
> I'm not sure I get it. Are these the IPA sizes that we forbid based on
> the host VA size and page size configuration?

Yes, thats right.

> If so, can you rewrite
> this as:
>
> host configuration | unsupported IPA range
> 39bit VA, 4k | [44, 48]
> 36bit VA, 16K | [41, 48]
> 42bit VA, 64k | [47, 52]
>
> and say that all the other combinations are supported?

Sure, that looks much better. Thanks

Suzuki

2018-07-02 14:44:15

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v3 10/20] kvm: arm64: Dynamic configuration of VTTBR mask

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> On arm64 VTTBR_EL2:BADDR holds the base address for the stage2
> translation table. The Arm ARM mandates that the bits BADDR[x-1:0]
> should be 0, where 'x' is defined for a given IPA Size and the
> number of levels for a translation granule size. It is defined
> using some magical constants. This patch is a reverse engineered
> implementation to calculate the 'x' at runtime for a given ipa and
> number of page table levels. See patch for more details.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Changes since V2:
> - Part 1 of spilt from VTCR & VTTBR dynamic configuration
> ---
> arch/arm64/include/asm/kvm_arm.h | 60 +++++++++++++++++++++++++++++++++++++---
> arch/arm64/include/asm/kvm_mmu.h | 25 ++++++++++++++++-
> 2 files changed, 80 insertions(+), 5 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 3dffd38..c557f45 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -140,8 +140,6 @@
> * Note that when using 4K pages, we concatenate two first level page tables
> * together. With 16K pages, we concatenate 16 first level page tables.
> *
> - * The magic numbers used for VTTBR_X in this patch can be found in Tables
> - * D4-23 and D4-25 in ARM DDI 0487A.b.
Isn't it a pretty old reference? Could you refer to C.a?

> */
>
> #define VTCR_EL2_T0SZ_IPA VTCR_EL2_T0SZ_40B
> @@ -175,9 +173,63 @@
> #endif
>
> #define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
> -#define VTTBR_X (VTTBR_X_TGRAN_MAGIC - VTCR_EL2_T0SZ_IPA)
> +/*
> + * ARM VMSAv8-64 defines an algorithm for finding the translation table
> + * descriptors in section D4.2.8 in ARM DDI 0487B.b.
another one ;-)
> + *
> + * The algorithm defines the expectations on the BaseAddress (for the page
> + * table) bits resolved at each level based on the page size, entry level
> + * and T0SZ. The variable "x" in the algorithm also affects the VTTBR:BADDR
> + * for stage2 page table.
> + *
> + * The value of "x" is calculated as :
> + * x = Magic_N - T0SZ
> + *
> + * where Magic_N is an integer depending on the page size and the entry
> + * level of the page table as below:
> + *
> + * --------------------------------------------
> + * | Entry level | 4K 16K 64K |
> + * --------------------------------------------
> + * | Level: 0 (4 levels) | 28 | - | - |
> + * --------------------------------------------
> + * | Level: 1 (3 levels) | 37 | 31 | 25 |
> + * --------------------------------------------
> + * | Level: 2 (2 levels) | 46 | 42 | 38 |
> + * --------------------------------------------
> + * | Level: 3 (1 level) | - | 53 | 51 |
> + * --------------------------------------------
I understand entry level = Lookup level in the table.
But you may want to compute x for BaseAddress matching lookup level 2
with number of levels = 4.
So shouldn't you s/Number of levels/4 - entry_level?
for BADDR we want the BaseAddr of the initial lookup level so
effectively the entry level we are interested in is 4 - number of levels
and we don't care or d) condition. At least this is my understanding ;-)
If correct you may slightly reword the explanation?
> + *
> + * We have a magic formula for the Magic_N below.
> + *
> + * Magic_N(PAGE_SIZE, Entry_Level) = 64 - ((PAGE_SHIFT - 3) * Number of levels)
> + *
> + * where number of levels = (4 - Entry_Level).
> + *
> + * So, given that T0SZ = (64 - PA_SHIFT), we can compute 'x' as follows:
Isn't it IPA_SHIFT instead?
> + *
> + * x = (64 - ((PAGE_SHIFT - 3) * Number_of_levels)) - (64 - PA_SHIFT)
> + * = PA_SHIFT - ((PAGE_SHIFT - 3) * Number of levels)
> + *
> + * Here is one way to explain the Magic Formula:
> + *
> + * x = log2(Size_of_Entry_Level_Table)
> + *
> + * Since, we can resolve (PAGE_SHIFT - 3) bits at each level, and another
> + * PAGE_SHIFT bits in the PTE, we have :
> + *
> + * Bits_Entry_level = PA_SHIFT - ((PAGE_SHIFT - 3) * (n - 1) + PAGE_SHIFT)
> + * = PA_SHIFT - (PAGE_SHIFT - 3) * n - 3
> + * where n = number of levels, and since each pointer is 8bytes, we have:
> + *
> + * x = Bits_Entry_Level + 3
> + * = PA_SHIFT - (PAGE_SHIFT - 3) * n
> + *
> + * The only constraint here is that, we have to find the number of page table
> + * levels for a given IPA size (which we do, see stage2_pt_levels())
> + */
> +#define ARM64_VTTBR_X(ipa, levels) ((ipa) - ((levels) * (PAGE_SHIFT - 3)))
>
> -#define VTTBR_BADDR_MASK (((UL(1) << (PHYS_MASK_SHIFT - VTTBR_X)) - 1) << VTTBR_X)
> #define VTTBR_VMID_SHIFT (UL(48))
> #define VTTBR_VMID_MASK(size) (_AT(u64, (1 << size) - 1) << VTTBR_VMID_SHIFT)
>
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index a351722..813a72a 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -146,7 +146,6 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
> #define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
> #define kvm_phys_size(kvm) (_AC(1, ULL) << kvm_phys_shift(kvm))
> #define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
> -#define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK
>
> static inline bool kvm_page_empty(void *ptr)
> {
> @@ -503,6 +502,30 @@ static inline int hyp_map_aux_data(void)
>
> #define kvm_phys_to_vttbr(addr) phys_to_ttbr(addr)
>
> +/*
> + * Get the magic number 'x' for VTTBR:BADDR of this KVM instance.
> + * With v8.2 LVA extensions, 'x' should be a minimum of 6 with
> + * 52bit IPS.
Link to the spec?
> + */
> +static inline int arm64_vttbr_x(u32 ipa_shift, u32 levels)
> +{
> + int x = ARM64_VTTBR_X(ipa_shift, levels);
> +
> + return (IS_ENABLED(CONFIG_ARM64_PA_BITS_52) && x < 6) ? 6 : x;
> +}
> +
> +static inline u64 vttbr_baddr_mask(u32 ipa_shift, u32 levels)
> +{
> + unsigned int x = arm64_vttbr_x(ipa_shift, levels);
> +
> + return GENMASK_ULL(PHYS_MASK_SHIFT - 1, x);
> +}
> +
> +static inline u64 kvm_vttbr_baddr_mask(struct kvm *kvm)
> +{
> + return vttbr_baddr_mask(kvm_phys_shift(kvm), kvm_stage2_levels(kvm));
> +}
> +
> static inline void *stage2_alloc_pgd(struct kvm *kvm)
> {
> return alloc_pages_exact(stage2_pgd_size(kvm),
>

Thanks

Eric

2018-07-02 14:48:26

by Eric Auger

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH v3 09/20] kvm: arm64: Make stage2 page table layout dynamic

Hi Suzuki,

On 07/02/2018 03:24 PM, Suzuki K Poulose wrote:
> Hi Eric,
>
>
> On 02/07/18 13:14, Auger Eric wrote:
>> Hi Suzuki,
>>
>> On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
>>> So far we had a static stage2 page table handling code, based on a
>>> fixed IPA of 40bits. As we prepare for a configurable IPA size per
>>> VM, make our stage2 page table code dynamic, to do the right thing
>>> for a given VM. We ensure the existing condition is always true even
>>> when we lift the limit on the IPA. i.e,
>>>
>>> page table levels in stage1 >= page table levels in stage2
>>>
>>> Support for the IPA size configuration needs other changes in the way
>>> we configure the EL2 registers (VTTBR and VTCR). So, the IPA is still
>>> fixed to 40bits. The patch also moves the kvm_page_empty() in
>>> asm/kvm_mmu.h
>>> to the top, before including the asm/stage2_pgtable.h to avoid a forward
>>> declaration.
>>>
>>> Cc: Marc Zyngier <[email protected]>
>>> Cc: Christoffer Dall <[email protected]>
>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>> ---
>>> Changes since V2
>>> - Restrict the stage2 page table to allow reusing the host page table
>>> helpers for now, until we get stage1 independent page table helpers.
>> I would move this up in the commit msg to motivate the fact we enforce
>> the able condition.
>
> This is mentioned in the commit message for the patch which lifts the
> limitation
> on the IPA. This patch only deals with the dynamic page table level
> handling,
> with the restriction on the levels. Nevertheless, I could add it to the
> description.
>
>>> ---
>>> arch/arm64/include/asm/kvm_mmu.h | 14 +-
>>> arch/arm64/include/asm/stage2_pgtable-nopmd.h | 42 ------
>>> arch/arm64/include/asm/stage2_pgtable-nopud.h | 39 -----
>>> arch/arm64/include/asm/stage2_pgtable.h | 207
>>> +++++++++++++++++++-------
>>> 4 files changed, 159 insertions(+), 143 deletions(-)
>>> delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
>>> delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h
>>
>> with my very limited knowledge of S2 page table walkers I fail to
>> understand why we now can get rid of stage2_pgtable-nopmd.h and
>> stage2_pgtable-nopud.h and associated FOLDED config. Please could you
>> explain it in the commit message?
>
> As mentioned above, we have static page table helpers, which are decided
> at compile time (just like the stage1). So these files hold the definitions
> for the cases where PUD/PMD is folded and included for a given stage1 VA.
> But since we are now doing this check per VM, we make the decision
> by checking the kvm_stage2_levels(), instead of hard coding it.
>
> Does that help ? A short version of that is already there. May be I could
> elaborate that a bit.

not totally to be honest. But that's not your fault. I need to spend
more time studying the code to get what the FOLDED case does ;-)

Thanks

Eric
>
>>> -
>>> -#define stage2_pgd_index(kvm, addr) \
>>> - (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
>>> +static inline unsigned long stage2_pgd_index(struct kvm *kvm,
>>> phys_addr_t addr)
>>> +{
>>> + return (addr >> stage2_pgdir_shift(kvm)) & (stage2_pgd_ptrs(kvm)
>>> - 1);
>>> +}
>>> static inline phys_addr_t
>>> stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t
>>> end)
>>> {
>>> - phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;
>>> + phys_addr_t boundary;
>>> + boundary = (addr + stage2_pgdir_size(kvm)) &
>>> stage2_pgdir_mask(kvm);
>>> return (boundary - 1 < end - 1) ? boundary : end;
>>> }
>>>
>>
>> Globally this patch is pretty hard to review. I don't know if it is
>> possible to split into 2. 1) Addition of some helper macros. 2) removal
>> of nopud and nopmd and implementation of the corresponding macros?
>
> I acknowledge that. The patch redefines the "existing" macros to make the
> decision at runtime based on the VM's setting. I will see if there is a
> better way to do it.
>
> Cheers
> Suzuki
>

2018-07-02 15:02:16

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v3 11/20] kvm: arm64: Helper for computing VTCR_EL2.SL0

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> VTCR_EL2 holds the following key stage2 translation table
> parameters:
> SL0 - Entry level in the page table lookup.
> T0SZ - Denotes the size of the memory addressed by the table.
>
> We have been using fixed values for the SL0 depending on the
> page size as we have a fixed IPA size. But since we are about
> to make it dynamic, we need to calculate the SL0 at runtime
> per VM. This patch adds a helper to comput the value of SL0 for
compute
> a given IPA.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Changes since v2:
> - Part 2 of split from VTCR & VTTBR dynamic configuration
> ---
> arch/arm64/include/asm/kvm_arm.h | 35 ++++++++++++++++++++++++++++++++---
> 1 file changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index c557f45..11a7db0 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -153,7 +153,8 @@
> * 2 level page tables (SL = 1)
> */
> #define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
> -#define VTTBR_X_TGRAN_MAGIC 38
> +#define VTCR_EL2_TGRAN_SL0_BASE 3UL
> +
> #elif defined(CONFIG_ARM64_16K_PAGES)
> /*
> * Stage2 translation configuration:
> @@ -161,7 +162,7 @@
> * 2 level page tables (SL = 1)
> */
> #define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
> -#define VTTBR_X_TGRAN_MAGIC 42
> +#define VTCR_EL2_TGRAN_SL0_BASE 3UL
> #else /* 4K */
> /*
> * Stage2 translation configuration:
> @@ -169,11 +170,39 @@
> * 3 level page tables (SL = 1)
> */
> #define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
> -#define VTTBR_X_TGRAN_MAGIC 37
> +#define VTCR_EL2_TGRAN_SL0_BASE 2UL
> #endif
>
> #define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
> /*
> + * VTCR_EL2:SL0 indicates the entry level for Stage2 translation.
> + * Interestingly, it depends on the page size.
> + * See D.10.2.110, VTCR_EL2, in ARM DDI 0487B.b
update ref to the last one?
> + *
> + * -----------------------------------------
> + * | Entry level | 4K | 16K/64K |
> + * ------------------------------------------
> + * | Level: 0 | 2 | - |
> + * ------------------------------------------
> + * | Level: 1 | 1 | 2 |
> + * ------------------------------------------
> + * | Level: 2 | 0 | 1 |
> + * ------------------------------------------
> + * | Level: 3 | - | 0 |
> + * ------------------------------------------
> + *
> + * That table roughly translates to :
> + *
> + * SL0(PAGE_SIZE, Entry_level) = SL0_BASE(PAGE_SIZE) - Entry_Level
> + *
> + * Where SL0_BASE(4K) = 2 and SL0_BASE(16K) = 3, SL0_BASE(64K) = 3, provided
> + * we take care of ruling out the unsupported cases and
> + * Entry_Level = 4 - Number_of_levels.
> + *
> + */
> +#define VTCR_EL2_SL0(levels) \
> + ((VTCR_EL2_TGRAN_SL0_BASE - (4 - (levels))) << VTCR_EL2_SL0_SHIFT)
> +/*
> * ARM VMSAv8-64 defines an algorithm for finding the translation table
> * descriptors in section D4.2.8 in ARM DDI 0487B.b.
> *
>
Reviewed-by: Eric Auger <[email protected]>

Thanks

Eric

2018-07-02 15:04:04

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v3 08/20] kvm: arm/arm64: Abstract stage2 pgd table allocation

Hi Suzuki

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> Abstract the allocation of stage2 entry level tables for
> given VM, so that later we can choose to fall back to the
> normal page table levels (i.e, avoid entry level table
> concatenation) on arm64.

the justification is not crystal clear to me but it does no harm I think.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Changes since V2:
> - New patch
> ---
> arch/arm/include/asm/kvm_mmu.h | 6 ++++++
> arch/arm64/include/asm/kvm_mmu.h | 6 ++++++
> virt/kvm/arm/mmu.c | 2 +-
> 3 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index f36eb20..b2da5a4 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -372,6 +372,12 @@ static inline int hyp_map_aux_data(void)
> return 0;
> }
>
> +static inline void *stage2_alloc_pgd(struct kvm *kvm)
> +{
> + return alloc_pages_exact(stage2_pgd_size(kvm),
> + GFP_KERNEL | __GFP_ZERO);
> +}
> +
> #define kvm_phys_to_vttbr(addr) (addr)
>
> #endif /* !__ASSEMBLY__ */
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 5da8f52..dbaf513 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -501,5 +501,11 @@ static inline int hyp_map_aux_data(void)
>
> #define kvm_phys_to_vttbr(addr) phys_to_ttbr(addr)
>
> +static inline void *stage2_alloc_pgd(struct kvm *kvm)
> +{
> + return alloc_pages_exact(stage2_pgd_size(kvm),
> + GFP_KERNEL | __GFP_ZERO);
> +}
> +
> #endif /* __ASSEMBLY__ */
> #endif /* __ARM64_KVM_MMU_H__ */
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 82dd571..a339e00 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -868,7 +868,7 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
> }
>
> /* Allocate the HW PGD, making sure that each page gets its own refcount */
> - pgd = alloc_pages_exact(stage2_pgd_size(kvm), GFP_KERNEL | __GFP_ZERO);
> + pgd = stage2_alloc_pgd(kvm);
> if (!pgd)
> return -ENOMEM;
>
>
Reviewed-by: Eric Auger <[email protected]>

Thanks

Eric

2018-07-02 19:14:59

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v3 12/20] kvm: arm64: Add helper for loading the stage2 setting for a VM

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> We load the stage2 context of a guest for different operations,
> including running the guest and tlb maintenance on behalf of the
> guest. As of now only the vttbr is private to the guest, but this
> is about to change with IPA per VM. Add a helper to load the stage2
> configuration for a VM, which could do the right thing with the
> future changes.
>
> Cc: Christoffer Dall <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
Reviewed-by: Eric Auger <[email protected]>

Thanks

Eric
> ---
> Changes since v2:
> - New patch
> ---
> arch/arm64/include/asm/kvm_hyp.h | 6 ++++++
> arch/arm64/kvm/hyp/switch.c | 2 +-
> arch/arm64/kvm/hyp/tlb.c | 4 ++--
> 3 files changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
> index 384c343..82f9994 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -155,5 +155,11 @@ void deactivate_traps_vhe_put(void);
> u64 __guest_enter(struct kvm_vcpu *vcpu, struct kvm_cpu_context *host_ctxt);
> void __noreturn __hyp_do_panic(unsigned long, ...);
>
> +/* Must be called from hyp code running at EL2 */
> +static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm)
> +{
> + write_sysreg(kvm->arch.vttbr, vttbr_el2);
> +}
> +
> #endif /* __ARM64_KVM_HYP_H__ */
>
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index d496ef5..355fb25 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -195,7 +195,7 @@ void deactivate_traps_vhe_put(void)
>
> static void __hyp_text __activate_vm(struct kvm *kvm)
> {
> - write_sysreg(kvm->arch.vttbr, vttbr_el2);
> + __load_guest_stage2(kvm);
> }
>
> static void __hyp_text __deactivate_vm(struct kvm_vcpu *vcpu)
> diff --git a/arch/arm64/kvm/hyp/tlb.c b/arch/arm64/kvm/hyp/tlb.c
> index 131c777..4dbd9c6 100644
> --- a/arch/arm64/kvm/hyp/tlb.c
> +++ b/arch/arm64/kvm/hyp/tlb.c
> @@ -30,7 +30,7 @@ static void __hyp_text __tlb_switch_to_guest_vhe(struct kvm *kvm)
> * bits. Changing E2H is impossible (goodbye TTBR1_EL2), so
> * let's flip TGE before executing the TLB operation.
> */
> - write_sysreg(kvm->arch.vttbr, vttbr_el2);
> + __load_guest_stage2(kvm);
> val = read_sysreg(hcr_el2);
> val &= ~HCR_TGE;
> write_sysreg(val, hcr_el2);
> @@ -39,7 +39,7 @@ static void __hyp_text __tlb_switch_to_guest_vhe(struct kvm *kvm)
>
> static void __hyp_text __tlb_switch_to_guest_nvhe(struct kvm *kvm)
> {
> - write_sysreg(kvm->arch.vttbr, vttbr_el2);
> + __load_guest_stage2(kvm);
> isb();
> }
>
>

2018-07-03 08:04:43

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 01/20] virtio: mmio-v1: Validate queue PFN

Hi Michael,

On 06/29/2018 06:42 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 29, 2018 at 12:15:21PM +0100, Suzuki K Poulose wrote:
>> virtio-mmio with virtio-v1 uses a 32bit PFN for the queue.
>> If the queue pfn is too large to fit in 32bits, which
>> we could hit on arm64 systems with 52bit physical addresses
>> (even with 64K page size), we simply miss out a proper link
>> to the other side of the queue.
>>
>> Add a check to validate the PFN, rather than silently breaking
>> the devices.
>>
>> Cc: "Michael S. Tsirkin" <[email protected]>
>> Cc: Jason Wang <[email protected]>
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Christoffer Dall <[email protected]>
>> Cc: Peter Maydel <[email protected]>
>> Cc: Jean-Philippe Brucker <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> Changes since v2:
>> - Change errno to -E2BIG
>> ---
>> drivers/virtio/virtio_mmio.c | 18 ++++++++++++++++--
>> 1 file changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
>> index 67763d3..82cedc8 100644
>> --- a/drivers/virtio/virtio_mmio.c
>> +++ b/drivers/virtio/virtio_mmio.c
>> @@ -397,9 +397,21 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
>> /* Activate the queue */
>> writel(virtqueue_get_vring_size(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NUM);
>> if (vm_dev->version == 1) {
>> + u64 q_pfn = virtqueue_get_desc_addr(vq) >> PAGE_SHIFT;
>> +
>> + /*
>> + * virtio-mmio v1 uses a 32bit QUEUE PFN. If we have something
>> + * that doesn't fit in 32bit, fail the setup rather than
>> + * pretending to be successful.
>> + */
>> + if (q_pfn >> 32) {
>> + dev_err(&vdev->dev, "virtio-mmio: queue address too large\n");
>
> How about:
> "hypervisor bug: legacy virtio-mmio must not be used with more than 0x%llx Gigabytes of memory",
> 0x1ULL << (32 - 30) << PAGE_SHIFT

nit : Do we need change "hypervisor" => "platform" ? Virtio is used by
other tools (e.g, emulators) and not just virtual machines.

Suzuki

2018-07-03 10:49:50

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 13/20] kvm: arm64: Configure VTCR per VM

On 02/07/18 13:16, Marc Zyngier wrote:
> On 29/06/18 12:15, Suzuki K Poulose wrote:
>> We set VTCR_EL2 very early during the stage2 init and don't
>> touch it ever. This is fine as we had a fixed IPA size. This
>> patch changes the behavior to set the VTCR for a given VM,
>> depending on its stage2 table. The common configuration for
>> VTCR is still performed during the early init as we have to
>> retain the hardware access flag update bits (VTCR_EL2_HA)
>> per CPU (as they are only set for the CPUs which are capabile).
>
> capable
>
>> The bits defining the number of levels in the page table (SL0)
>> and and the size of the Input address to the translation (T0SZ)
>> are programmed for each VM upon entry to the guest.
>>
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Christoffer Dall <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> Change since V2:
>> - Load VTCR for TLB operations
>> ---
>> arch/arm64/include/asm/kvm_arm.h | 19 +++++++++----------
>> arch/arm64/include/asm/kvm_asm.h | 2 +-
>> arch/arm64/include/asm/kvm_host.h | 9 ++++++---
>> arch/arm64/include/asm/kvm_hyp.h | 11 +++++++++++
>> arch/arm64/kvm/hyp/s2-setup.c | 17 +----------------
>> 5 files changed, 28 insertions(+), 30 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>> index 11a7db0..b02c316 100644
>> --- a/arch/arm64/include/asm/kvm_arm.h
>> +++ b/arch/arm64/include/asm/kvm_arm.h
>> @@ -120,9 +120,7 @@
>> #define VTCR_EL2_IRGN0_WBWA TCR_IRGN0_WBWA
>> #define VTCR_EL2_SL0_SHIFT 6
>> #define VTCR_EL2_SL0_MASK (3 << VTCR_EL2_SL0_SHIFT)
>> -#define VTCR_EL2_SL0_LVL1 (1 << VTCR_EL2_SL0_SHIFT)
>> #define VTCR_EL2_T0SZ_MASK 0x3f
>> -#define VTCR_EL2_T0SZ_40B 24
>> #define VTCR_EL2_VS_SHIFT 19
>> #define VTCR_EL2_VS_8BIT (0 << VTCR_EL2_VS_SHIFT)
>> #define VTCR_EL2_VS_16BIT (1 << VTCR_EL2_VS_SHIFT)
>> @@ -137,43 +135,44 @@
>> * VTCR_EL2.PS is extracted from ID_AA64MMFR0_EL1.PARange at boot time
>> * (see hyp-init.S).
>> *
>> + * VTCR_EL2.SL0 and T0SZ are configured per VM at runtime before switching to
>> + * the VM.
>> + *
>> * Note that when using 4K pages, we concatenate two first level page tables
>> * together. With 16K pages, we concatenate 16 first level page tables.
>> *
>> */
>>
>> -#define VTCR_EL2_T0SZ_IPA VTCR_EL2_T0SZ_40B
>> #define VTCR_EL2_COMMON_BITS (VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
>> VTCR_EL2_IRGN0_WBWA | VTCR_EL2_RES1)
>> +#define VTCR_EL2_PRIVATE_MASK (VTCR_EL2_SL0_MASK | VTCR_EL2_T0SZ_MASK)
>
> What does "private" mean here? It really is the IPA configuration, so
> I'd rather have a naming that reflects that.
>
>> #ifdef CONFIG_ARM64_64K_PAGES
>> /*
>> * Stage2 translation configuration:
>> * 64kB pages (TG0 = 1)
>> - * 2 level page tables (SL = 1)
>> */
>> -#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
>> +#define VTCR_EL2_TGRAN VTCR_EL2_TG0_64K
>> #define VTCR_EL2_TGRAN_SL0_BASE 3UL
>>
>> #elif defined(CONFIG_ARM64_16K_PAGES)
>> /*
>> * Stage2 translation configuration:
>> * 16kB pages (TG0 = 2)
>> - * 2 level page tables (SL = 1)
>> */
>> -#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
>> +#define VTCR_EL2_TGRAN VTCR_EL2_TG0_16K
>> #define VTCR_EL2_TGRAN_SL0_BASE 3UL
>> #else /* 4K */
>> /*
>> * Stage2 translation configuration:
>> * 4kB pages (TG0 = 0)
>> - * 3 level page tables (SL = 1)
>> */
>> -#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
>> +#define VTCR_EL2_TGRAN VTCR_EL2_TG0_4K
>> #define VTCR_EL2_TGRAN_SL0_BASE 2UL
>> #endif
>>
>> -#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
>> +#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN)
>> +
>> /*
>> * VTCR_EL2:SL0 indicates the entry level for Stage2 translation.
>> * Interestingly, it depends on the page size.
>> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
>> index 102b5a5..91372eb 100644
>> --- a/arch/arm64/include/asm/kvm_asm.h
>> +++ b/arch/arm64/include/asm/kvm_asm.h
>> @@ -72,7 +72,7 @@ extern void __vgic_v3_init_lrs(void);
>>
>> extern u32 __kvm_get_mdcr_el2(void);
>>
>> -extern u32 __init_stage2_translation(void);
>> +extern void __init_stage2_translation(void);
>>
>> /* Home-grown __this_cpu_{ptr,read} variants that always work at HYP */
>> #define __hyp_this_cpu_ptr(sym) \
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index fe8777b..328f472 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -442,10 +442,13 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>>
>> static inline void __cpu_init_stage2(void)
>> {
>> - u32 parange = kvm_call_hyp(__init_stage2_translation);
>> + u32 ps;
>>
>> - WARN_ONCE(parange < 40,
>> - "PARange is %d bits, unsupported configuration!", parange);
>> + kvm_call_hyp(__init_stage2_translation);
>> + /* Sanity check for minimum IPA size support */
>> + ps = id_aa64mmfr0_parange_to_phys_shift(read_sysreg(id_aa64mmfr0_el1) & 0x7);
>> + WARN_ONCE(ps < 40,
>> + "PARange is %d bits, unsupported configuration!", ps);
>> }
>>
>> /* Guest/host FPSIMD coordination helpers */
>> diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
>> index 82f9994..3e8052d1 100644
>> --- a/arch/arm64/include/asm/kvm_hyp.h
>> +++ b/arch/arm64/include/asm/kvm_hyp.h
>> @@ -20,6 +20,7 @@
>>
>> #include <linux/compiler.h>
>> #include <linux/kvm_host.h>
>> +#include <asm/kvm_mmu.h>
>> #include <asm/sysreg.h>
>>
>> #define __hyp_text __section(.hyp.text) notrace
>> @@ -158,6 +159,16 @@ void __noreturn __hyp_do_panic(unsigned long, ...);
>> /* Must be called from hyp code running at EL2 */

Marc,

>> static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm)
>> {
>> + /*
>> + * Configure the VTCR translation control bits
>> + * for this VM.
>> + */
>> + u64 vtcr = read_sysreg(vtcr_el2);
>> +
>> + vtcr &= ~VTCR_EL2_PRIVATE_MASK;
>> + vtcr |= VTCR_EL2_SL0(kvm_stage2_levels(kvm)) |
>> + VTCR_EL2_T0SZ(kvm_phys_shift(kvm));
>> + write_sysreg(vtcr, vtcr_el2);
>
> Can't we generate the whole vtcr value in one go, without reading it
> back? Specially given that on patch 16, you're actually switching to a
> per-VM variable, and it would make a lot of sense to start with that here.

...

>> -u32 __hyp_text __init_stage2_translation(void)
>> +void __hyp_text __init_stage2_translation(void)
..

>
> And then most of the code here could run on a per-VM basis.

There is one problem with generating the entire vtcr for a VM.
On a system with mismatched CPU features, we need to have either :

- Per CPU VTCR fixed bits
OR
- Track system wide safe VTCR bits. (Not ideal with dirty bit and access
flag updates, if and when we support them ).

So far the only fields of interest are HA & HD, which may be turned on
for CPUs that can support the feature. Rest can be filled in from the
sanitised EL1 system registers and IPA limit and the others would need
to be filled as RES0. This could potentially have some issues on
newer versions of the architecture running on older kernels.

What do you think ?

Suzuki

2018-07-03 11:00:29

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v3 13/20] kvm: arm64: Configure VTCR per VM

On 03/07/18 11:48, Suzuki K Poulose wrote:
> On 02/07/18 13:16, Marc Zyngier wrote:
>> On 29/06/18 12:15, Suzuki K Poulose wrote:
>>> We set VTCR_EL2 very early during the stage2 init and don't
>>> touch it ever. This is fine as we had a fixed IPA size. This
>>> patch changes the behavior to set the VTCR for a given VM,
>>> depending on its stage2 table. The common configuration for
>>> VTCR is still performed during the early init as we have to
>>> retain the hardware access flag update bits (VTCR_EL2_HA)
>>> per CPU (as they are only set for the CPUs which are capabile).
>>
>> capable
>>
>>> The bits defining the number of levels in the page table (SL0)
>>> and and the size of the Input address to the translation (T0SZ)
>>> are programmed for each VM upon entry to the guest.
>>>
>>> Cc: Marc Zyngier <[email protected]>
>>> Cc: Christoffer Dall <[email protected]>
>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>> ---
>>> Change since V2:
>>> - Load VTCR for TLB operations
>>> ---
>>> arch/arm64/include/asm/kvm_arm.h | 19 +++++++++----------
>>> arch/arm64/include/asm/kvm_asm.h | 2 +-
>>> arch/arm64/include/asm/kvm_host.h | 9 ++++++---
>>> arch/arm64/include/asm/kvm_hyp.h | 11 +++++++++++
>>> arch/arm64/kvm/hyp/s2-setup.c | 17 +----------------
>>> 5 files changed, 28 insertions(+), 30 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>>> index 11a7db0..b02c316 100644
>>> --- a/arch/arm64/include/asm/kvm_arm.h
>>> +++ b/arch/arm64/include/asm/kvm_arm.h
>>> @@ -120,9 +120,7 @@
>>> #define VTCR_EL2_IRGN0_WBWA TCR_IRGN0_WBWA
>>> #define VTCR_EL2_SL0_SHIFT 6
>>> #define VTCR_EL2_SL0_MASK (3 << VTCR_EL2_SL0_SHIFT)
>>> -#define VTCR_EL2_SL0_LVL1 (1 << VTCR_EL2_SL0_SHIFT)
>>> #define VTCR_EL2_T0SZ_MASK 0x3f
>>> -#define VTCR_EL2_T0SZ_40B 24
>>> #define VTCR_EL2_VS_SHIFT 19
>>> #define VTCR_EL2_VS_8BIT (0 << VTCR_EL2_VS_SHIFT)
>>> #define VTCR_EL2_VS_16BIT (1 << VTCR_EL2_VS_SHIFT)
>>> @@ -137,43 +135,44 @@
>>> * VTCR_EL2.PS is extracted from ID_AA64MMFR0_EL1.PARange at boot time
>>> * (see hyp-init.S).
>>> *
>>> + * VTCR_EL2.SL0 and T0SZ are configured per VM at runtime before switching to
>>> + * the VM.
>>> + *
>>> * Note that when using 4K pages, we concatenate two first level page tables
>>> * together. With 16K pages, we concatenate 16 first level page tables.
>>> *
>>> */
>>>
>>> -#define VTCR_EL2_T0SZ_IPA VTCR_EL2_T0SZ_40B
>>> #define VTCR_EL2_COMMON_BITS (VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
>>> VTCR_EL2_IRGN0_WBWA | VTCR_EL2_RES1)
>>> +#define VTCR_EL2_PRIVATE_MASK (VTCR_EL2_SL0_MASK | VTCR_EL2_T0SZ_MASK)
>>
>> What does "private" mean here? It really is the IPA configuration, so
>> I'd rather have a naming that reflects that.
>>
>>> #ifdef CONFIG_ARM64_64K_PAGES
>>> /*
>>> * Stage2 translation configuration:
>>> * 64kB pages (TG0 = 1)
>>> - * 2 level page tables (SL = 1)
>>> */
>>> -#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
>>> +#define VTCR_EL2_TGRAN VTCR_EL2_TG0_64K
>>> #define VTCR_EL2_TGRAN_SL0_BASE 3UL
>>>
>>> #elif defined(CONFIG_ARM64_16K_PAGES)
>>> /*
>>> * Stage2 translation configuration:
>>> * 16kB pages (TG0 = 2)
>>> - * 2 level page tables (SL = 1)
>>> */
>>> -#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
>>> +#define VTCR_EL2_TGRAN VTCR_EL2_TG0_16K
>>> #define VTCR_EL2_TGRAN_SL0_BASE 3UL
>>> #else /* 4K */
>>> /*
>>> * Stage2 translation configuration:
>>> * 4kB pages (TG0 = 0)
>>> - * 3 level page tables (SL = 1)
>>> */
>>> -#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
>>> +#define VTCR_EL2_TGRAN VTCR_EL2_TG0_4K
>>> #define VTCR_EL2_TGRAN_SL0_BASE 2UL
>>> #endif
>>>
>>> -#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
>>> +#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN)
>>> +
>>> /*
>>> * VTCR_EL2:SL0 indicates the entry level for Stage2 translation.
>>> * Interestingly, it depends on the page size.
>>> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
>>> index 102b5a5..91372eb 100644
>>> --- a/arch/arm64/include/asm/kvm_asm.h
>>> +++ b/arch/arm64/include/asm/kvm_asm.h
>>> @@ -72,7 +72,7 @@ extern void __vgic_v3_init_lrs(void);
>>>
>>> extern u32 __kvm_get_mdcr_el2(void);
>>>
>>> -extern u32 __init_stage2_translation(void);
>>> +extern void __init_stage2_translation(void);
>>>
>>> /* Home-grown __this_cpu_{ptr,read} variants that always work at HYP */
>>> #define __hyp_this_cpu_ptr(sym) \
>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>> index fe8777b..328f472 100644
>>> --- a/arch/arm64/include/asm/kvm_host.h
>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>> @@ -442,10 +442,13 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>>>
>>> static inline void __cpu_init_stage2(void)
>>> {
>>> - u32 parange = kvm_call_hyp(__init_stage2_translation);
>>> + u32 ps;
>>>
>>> - WARN_ONCE(parange < 40,
>>> - "PARange is %d bits, unsupported configuration!", parange);
>>> + kvm_call_hyp(__init_stage2_translation);
>>> + /* Sanity check for minimum IPA size support */
>>> + ps = id_aa64mmfr0_parange_to_phys_shift(read_sysreg(id_aa64mmfr0_el1) & 0x7);
>>> + WARN_ONCE(ps < 40,
>>> + "PARange is %d bits, unsupported configuration!", ps);
>>> }
>>>
>>> /* Guest/host FPSIMD coordination helpers */
>>> diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
>>> index 82f9994..3e8052d1 100644
>>> --- a/arch/arm64/include/asm/kvm_hyp.h
>>> +++ b/arch/arm64/include/asm/kvm_hyp.h
>>> @@ -20,6 +20,7 @@
>>>
>>> #include <linux/compiler.h>
>>> #include <linux/kvm_host.h>
>>> +#include <asm/kvm_mmu.h>
>>> #include <asm/sysreg.h>
>>>
>>> #define __hyp_text __section(.hyp.text) notrace
>>> @@ -158,6 +159,16 @@ void __noreturn __hyp_do_panic(unsigned long, ...);
>>> /* Must be called from hyp code running at EL2 */
>
> Marc,
>
>>> static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm)
>>> {
>>> + /*
>>> + * Configure the VTCR translation control bits
>>> + * for this VM.
>>> + */
>>> + u64 vtcr = read_sysreg(vtcr_el2);
>>> +
>>> + vtcr &= ~VTCR_EL2_PRIVATE_MASK;
>>> + vtcr |= VTCR_EL2_SL0(kvm_stage2_levels(kvm)) |
>>> + VTCR_EL2_T0SZ(kvm_phys_shift(kvm));
>>> + write_sysreg(vtcr, vtcr_el2);
>>
>> Can't we generate the whole vtcr value in one go, without reading it
>> back? Specially given that on patch 16, you're actually switching to a
>> per-VM variable, and it would make a lot of sense to start with that here.
>
> ...
>
>>> -u32 __hyp_text __init_stage2_translation(void)
>>> +void __hyp_text __init_stage2_translation(void)
> ..
>
>>
>> And then most of the code here could run on a per-VM basis.
>
> There is one problem with generating the entire vtcr for a VM.
> On a system with mismatched CPU features, we need to have either :
>
> - Per CPU VTCR fixed bits
> OR
> - Track system wide safe VTCR bits. (Not ideal with dirty bit and access
> flag updates, if and when we support them ).
>
> So far the only fields of interest are HA & HD, which may be turned on
> for CPUs that can support the feature. Rest can be filled in from the
> sanitised EL1 system registers and IPA limit and the others would need
> to be filled as RES0. This could potentially have some issues on
> newer versions of the architecture running on older kernels.

For HA and HD, we can perfectly set them if if only one CPU in the
system has it. We already do this for other system registers on the
ground that if the CPU doesn't honour the RES0 behaviour, then it is
terminally broken.

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2018-07-03 11:56:16

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 10/20] kvm: arm64: Dynamic configuration of VTTBR mask

Hi Eric,

On 02/07/18 15:41, Auger Eric wrote:
> Hi Suzuki,
>
> On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
>> On arm64 VTTBR_EL2:BADDR holds the base address for the stage2
>> translation table. The Arm ARM mandates that the bits BADDR[x-1:0]
>> should be 0, where 'x' is defined for a given IPA Size and the
>> number of levels for a translation granule size. It is defined
>> using some magical constants. This patch is a reverse engineered
>> implementation to calculate the 'x' at runtime for a given ipa and
>> number of page table levels. See patch for more details.
>>
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Christoffer Dall <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> Changes since V2:
>> - Part 1 of spilt from VTCR & VTTBR dynamic configuration
>> ---
>> arch/arm64/include/asm/kvm_arm.h | 60 +++++++++++++++++++++++++++++++++++++---
>> arch/arm64/include/asm/kvm_mmu.h | 25 ++++++++++++++++-
>> 2 files changed, 80 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>> index 3dffd38..c557f45 100644
>> --- a/arch/arm64/include/asm/kvm_arm.h
>> +++ b/arch/arm64/include/asm/kvm_arm.h
>> @@ -140,8 +140,6 @@
>> * Note that when using 4K pages, we concatenate two first level page tables
>> * together. With 16K pages, we concatenate 16 first level page tables.
>> *
>> - * The magic numbers used for VTTBR_X in this patch can be found in Tables
>> - * D4-23 and D4-25 in ARM DDI 0487A.b.
> Isn't it a pretty old reference? Could you refer to C.a?

Sure, I will update the references everywhere.

>> + *
>> + * The algorithm defines the expectations on the BaseAddress (for the page
>> + * table) bits resolved at each level based on the page size, entry level
>> + * and T0SZ. The variable "x" in the algorithm also affects the VTTBR:BADDR
>> + * for stage2 page table.
>> + *
>> + * The value of "x" is calculated as :
>> + * x = Magic_N - T0SZ
>> + *
>> + * where Magic_N is an integer depending on the page size and the entry
>> + * level of the page table as below:
>> + *
>> + * --------------------------------------------
>> + * | Entry level | 4K 16K 64K |
>> + * --------------------------------------------
>> + * | Level: 0 (4 levels) | 28 | - | - |
>> + * --------------------------------------------
>> + * | Level: 1 (3 levels) | 37 | 31 | 25 |
>> + * --------------------------------------------
>> + * | Level: 2 (2 levels) | 46 | 42 | 38 |
>> + * --------------------------------------------
>> + * | Level: 3 (1 level) | - | 53 | 51 |
>> + * --------------------------------------------
> I understand entry level = Lookup level in the table.

Entry level => The level at which we start the page table walk for
a given address (This is in line with the ARM ARM). So,

Entry_level = (4 - Number_of_Page_table_levels)

> But you may want to compute x for BaseAddress matching lookup level 2
> with number of levels = 4.

No, the BaseAddress is only calcualted for the "Entry_level". So the
above case doesn't exist at all.

> So shouldn't you s/Number of levels/4 - entry_level?

Ok, I now understood what you are referring to [0]
> for BADDR we want the BaseAddr of the initial lookup level so
> effectively the entry level we are interested in is 4 - number of levels
> and we don't care or d) condition. At least this is my understanding ;-)
> If correct you may slightly reword the explanation?


>> + *
>> + * We have a magic formula for the Magic_N below.
>> + *
>> + * Magic_N(PAGE_SIZE, Entry_Level) = 64 - ((PAGE_SHIFT - 3) * Number of levels)

[0] ^^^



>> + *
>> + * where number of levels = (4 - Entry_Level).

^^^ Doesn't this help make it clear ? Using the expansion makes it a bit more
unreadable below.

>>
>> +/*
>> + * Get the magic number 'x' for VTTBR:BADDR of this KVM instance.
>> + * With v8.2 LVA extensions, 'x' should be a minimum of 6 with
>> + * 52bit IPS.
> Link to the spec?

Sure, will add it.

Thanks for the patience to review :-)

Cheers
Suzuki

2018-07-04 05:38:42

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v3 01/20] virtio: mmio-v1: Validate queue PFN

On Tue, Jul 03, 2018 at 09:04:01AM +0100, Suzuki K Poulose wrote:
> Hi Michael,
>
> On 06/29/2018 06:42 PM, Michael S. Tsirkin wrote:
> > On Fri, Jun 29, 2018 at 12:15:21PM +0100, Suzuki K Poulose wrote:
> > > virtio-mmio with virtio-v1 uses a 32bit PFN for the queue.
> > > If the queue pfn is too large to fit in 32bits, which
> > > we could hit on arm64 systems with 52bit physical addresses
> > > (even with 64K page size), we simply miss out a proper link
> > > to the other side of the queue.
> > >
> > > Add a check to validate the PFN, rather than silently breaking
> > > the devices.
> > >
> > > Cc: "Michael S. Tsirkin" <[email protected]>
> > > Cc: Jason Wang <[email protected]>
> > > Cc: Marc Zyngier <[email protected]>
> > > Cc: Christoffer Dall <[email protected]>
> > > Cc: Peter Maydel <[email protected]>
> > > Cc: Jean-Philippe Brucker <[email protected]>
> > > Signed-off-by: Suzuki K Poulose <[email protected]>
> > > ---
> > > Changes since v2:
> > > - Change errno to -E2BIG
> > > ---
> > > drivers/virtio/virtio_mmio.c | 18 ++++++++++++++++--
> > > 1 file changed, 16 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
> > > index 67763d3..82cedc8 100644
> > > --- a/drivers/virtio/virtio_mmio.c
> > > +++ b/drivers/virtio/virtio_mmio.c
> > > @@ -397,9 +397,21 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
> > > /* Activate the queue */
> > > writel(virtqueue_get_vring_size(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NUM);
> > > if (vm_dev->version == 1) {
> > > + u64 q_pfn = virtqueue_get_desc_addr(vq) >> PAGE_SHIFT;
> > > +
> > > + /*
> > > + * virtio-mmio v1 uses a 32bit QUEUE PFN. If we have something
> > > + * that doesn't fit in 32bit, fail the setup rather than
> > > + * pretending to be successful.
> > > + */
> > > + if (q_pfn >> 32) {
> > > + dev_err(&vdev->dev, "virtio-mmio: queue address too large\n");
> >
> > How about:
> > "hypervisor bug: legacy virtio-mmio must not be used with more than 0x%llx Gigabytes of memory",
> > 0x1ULL << (32 - 30) << PAGE_SHIFT
>
> nit : Do we need change "hypervisor" => "platform" ? Virtio is used by other
> tools (e.g, emulators) and not just virtual machines.
>
> Suzuki

OK.


2018-07-04 08:13:49

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v3 17/20] vgic: Add support for 52bit guest physical address

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> From: Kristina Martsenko <[email protected]>
>
> Add support for handling 52bit guest physical address to the
> VGIC layer. So far we have limited the guest physical address
> to 48bits, by explicitly masking the upper bits. This patch
> removes the restriction. We do not have to check if the host
> supports 52bit as the gpa is always validated during an access.
> (e.g, kvm_{read/write}_guest, kvm_is_visible_gfn()).
> Also, the ITS table save-restore is also not affected with
> the enhancement. The DTE entries already store the bits[51:8]
> of the ITT_addr (with a 256byte alignment).
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Kristina Martsenko <[email protected]>
> [ Macro clean ups, fix PROPBASER and PENDBASER accesses ]
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> include/linux/irqchip/arm-gic-v3.h | 5 +++++
> virt/kvm/arm/vgic/vgic-its.c | 36 ++++++++++--------------------------
> virt/kvm/arm/vgic/vgic-mmio-v3.c | 2 --
> 3 files changed, 15 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h
> index cbb872c..bc4b95b 100644
> --- a/include/linux/irqchip/arm-gic-v3.h
> +++ b/include/linux/irqchip/arm-gic-v3.h
> @@ -346,6 +346,8 @@
> #define GITS_CBASER_RaWaWt GIC_BASER_CACHEABILITY(GITS_CBASER, INNER, RaWaWt)
> #define GITS_CBASER_RaWaWb GIC_BASER_CACHEABILITY(GITS_CBASER, INNER, RaWaWb)
>
> +#define GITS_CBASER_ADDRESS(cbaser) ((cbaser) & GENMASK_ULL(52, 12))
> +
> #define GITS_BASER_NR_REGS 8
>
> #define GITS_BASER_VALID (1ULL << 63)
> @@ -377,6 +379,9 @@
> #define GITS_BASER_ENTRY_SIZE_MASK GENMASK_ULL(52, 48)
> #define GITS_BASER_PHYS_52_to_48(phys) \
> (((phys) & GENMASK_ULL(47, 16)) | (((phys) >> 48) & 0xf) << 12)
> +#define GITS_BASER_ADDR_48_to_52(baser) \
> + (((baser) & GENMASK_ULL(47, 16)) | (((baser) >> 12) & 0xf) << 48)
only works if page_size = 64kB which is the case in vITS but as it is in
irqchip header, may be worth a comment?
> +
> #define GITS_BASER_SHAREABILITY_SHIFT (10)
> #define GITS_BASER_InnerShareable \
> GIC_BASER_SHAREABILITY(GITS_BASER, InnerShareable)
> diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c
> index 4ed79c9..c6eb390 100644
> --- a/virt/kvm/arm/vgic/vgic-its.c
> +++ b/virt/kvm/arm/vgic/vgic-its.c
> @@ -234,13 +234,6 @@ static struct its_ite *find_ite(struct vgic_its *its, u32 device_id,
> list_for_each_entry(dev, &(its)->device_list, dev_list) \
> list_for_each_entry(ite, &(dev)->itt_head, ite_list)
>
> -/*
> - * We only implement 48 bits of PA at the moment, although the ITS
> - * supports more. Let's be restrictive here.
> - */
> -#define BASER_ADDRESS(x) ((x) & GENMASK_ULL(47, 16))
> -#define CBASER_ADDRESS(x) ((x) & GENMASK_ULL(47, 12))
> -
> #define GIC_LPI_OFFSET 8192
>
> #define VITS_TYPER_IDBITS 16
> @@ -752,6 +745,7 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
> {
> int l1_tbl_size = GITS_BASER_NR_PAGES(baser) * SZ_64K;
> u64 indirect_ptr, type = GITS_BASER_TYPE(baser);
> + phys_addr_t base = GITS_BASER_ADDR_48_to_52(baser);
> int esz = GITS_BASER_ENTRY_SIZE(baser);
> int index;
> gfn_t gfn;
> @@ -776,7 +770,7 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
> if (id >= (l1_tbl_size / esz))
> return false;
>
> - addr = BASER_ADDRESS(baser) + id * esz;
> + addr = base + id * esz;
> gfn = addr >> PAGE_SHIFT;
>
> if (eaddr)
> @@ -791,7 +785,7 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
>
> /* Each 1st level entry is represented by a 64-bit value. */
> if (kvm_read_guest_lock(its->dev->kvm,
> - BASER_ADDRESS(baser) + index * sizeof(indirect_ptr),
> + base + index * sizeof(indirect_ptr),
> &indirect_ptr, sizeof(indirect_ptr)))
> return false;
>
> @@ -801,11 +795,7 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
> if (!(indirect_ptr & BIT_ULL(63)))
> return false;
>
> - /*
> - * Mask the guest physical address and calculate the frame number.
> - * Any address beyond our supported 48 bits of PA will be caught
> - * by the actual check in the final step.
> - */
> + /* Mask the guest physical address and calculate the frame number. */
> indirect_ptr &= GENMASK_ULL(51, 16);
>
> /* Find the address of the actual entry */
> @@ -1297,9 +1287,6 @@ static u64 vgic_sanitise_its_baser(u64 reg)
> GITS_BASER_OUTER_CACHEABILITY_SHIFT,
> vgic_sanitise_outer_cacheability);
>
> - /* Bits 15:12 contain bits 51:48 of the PA, which we don't support. */
> - reg &= ~GENMASK_ULL(15, 12);
> -
> /* We support only one (ITS) page size: 64K */
> reg = (reg & ~GITS_BASER_PAGE_SIZE_MASK) | GITS_BASER_PAGE_SIZE_64K;
>
> @@ -1318,11 +1305,8 @@ static u64 vgic_sanitise_its_cbaser(u64 reg)
> GITS_CBASER_OUTER_CACHEABILITY_SHIFT,
> vgic_sanitise_outer_cacheability);
>
> - /*
> - * Sanitise the physical address to be 64k aligned.
> - * Also limit the physical addresses to 48 bits.
> - */
> - reg &= ~(GENMASK_ULL(51, 48) | GENMASK_ULL(15, 12));
> + /* Sanitise the physical address to be 64k aligned. */
> + reg &= ~GENMASK_ULL(15, 12);
>
> return reg;
> }
> @@ -1368,7 +1352,7 @@ static void vgic_its_process_commands(struct kvm *kvm, struct vgic_its *its)
> if (!its->enabled)
> return;
>
> - cbaser = CBASER_ADDRESS(its->cbaser);
> + cbaser = GITS_CBASER_ADDRESS(its->cbaser);
>
> while (its->cwriter != its->creadr) {
> int ret = kvm_read_guest_lock(kvm, cbaser + its->creadr,
> @@ -2226,7 +2210,7 @@ static int vgic_its_restore_device_tables(struct vgic_its *its)
> if (!(baser & GITS_BASER_VALID))
> return 0;
>
> - l1_gpa = BASER_ADDRESS(baser);
> + l1_gpa = GITS_BASER_ADDR_48_to_52(baser);
>
> if (baser & GITS_BASER_INDIRECT) {
> l1_esz = GITS_LVL1_ENTRY_SIZE;
> @@ -2298,7 +2282,7 @@ static int vgic_its_save_collection_table(struct vgic_its *its)
> {
> const struct vgic_its_abi *abi = vgic_its_get_abi(its);
> u64 baser = its->baser_coll_table;
> - gpa_t gpa = BASER_ADDRESS(baser);
> + gpa_t gpa = GITS_BASER_ADDR_48_to_52(baser);
> struct its_collection *collection;
> u64 val;
> size_t max_size, filled = 0;
> @@ -2347,7 +2331,7 @@ static int vgic_its_restore_collection_table(struct vgic_its *its)
> if (!(baser & GITS_BASER_VALID))
> return 0;
>
> - gpa = BASER_ADDRESS(baser);
> + gpa = GITS_BASER_ADDR_48_to_52(baser);
>
> max_size = GITS_BASER_NR_PAGES(baser) * SZ_64K;
>
> diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> index 2877840..64647be 100644
> --- a/virt/kvm/arm/vgic/vgic-mmio-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> @@ -338,7 +338,6 @@ static u64 vgic_sanitise_pendbaser(u64 reg)
> vgic_sanitise_outer_cacheability);
>
> reg &= ~PENDBASER_RES0_MASK;
> - reg &= ~GENMASK_ULL(51, 48);
>
> return reg;
> }
> @@ -356,7 +355,6 @@ static u64 vgic_sanitise_propbaser(u64 reg)
> vgic_sanitise_outer_cacheability);
>
> reg &= ~PROPBASER_RES0_MASK;
> - reg &= ~GENMASK_ULL(51, 48);
> return reg;
> }
>
>
Besides it looks good to me.

Reviewed-by: Eric Auger <[email protected]>

Thanks

Eric



2018-07-04 08:25:50

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v3 10/20] kvm: arm64: Dynamic configuration of VTTBR mask

Hi Suzuki,

On 07/03/2018 01:54 PM, Suzuki K Poulose wrote:
> Hi Eric,
>
> On 02/07/18 15:41, Auger Eric wrote:
>> Hi Suzuki,
>>
>> On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
>>> On arm64 VTTBR_EL2:BADDR holds the base address for the stage2
>>> translation table. The Arm ARM mandates that the bits BADDR[x-1:0]
>>> should be 0, where 'x' is defined for a given IPA Size and the
>>> number of levels for a translation granule size. It is defined
>>> using some magical constants. This patch is a reverse engineered
>>> implementation to calculate the 'x' at runtime for a given ipa and
>>> number of page table levels. See patch for more details.
>>>
>>> Cc: Marc Zyngier <[email protected]>
>>> Cc: Christoffer Dall <[email protected]>
>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>> ---
>>> Changes since V2:
>>> - Part 1 of spilt from VTCR & VTTBR dynamic configuration
>>> ---
>>> arch/arm64/include/asm/kvm_arm.h | 60
>>> +++++++++++++++++++++++++++++++++++++---
>>> arch/arm64/include/asm/kvm_mmu.h | 25 ++++++++++++++++-
>>> 2 files changed, 80 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_arm.h
>>> b/arch/arm64/include/asm/kvm_arm.h
>>> index 3dffd38..c557f45 100644
>>> --- a/arch/arm64/include/asm/kvm_arm.h
>>> +++ b/arch/arm64/include/asm/kvm_arm.h
>>> @@ -140,8 +140,6 @@
>>> * Note that when using 4K pages, we concatenate two first level
>>> page tables
>>> * together. With 16K pages, we concatenate 16 first level page
>>> tables.
>>> *
>>> - * The magic numbers used for VTTBR_X in this patch can be found in
>>> Tables
>>> - * D4-23 and D4-25 in ARM DDI 0487A.b.
>> Isn't it a pretty old reference? Could you refer to C.a?
>
> Sure, I will update the references everywhere.
>
>>> + *
>>> + * The algorithm defines the expectations on the BaseAddress (for
>>> the page
>>> + * table) bits resolved at each level based on the page size, entry
>>> level
>>> + * and T0SZ. The variable "x" in the algorithm also affects the
>>> VTTBR:BADDR
>>> + * for stage2 page table.
>>> + *
>>> + * The value of "x" is calculated as :
>>> + * x = Magic_N - T0SZ
>>> + *
>>> + * where Magic_N is an integer depending on the page size and the entry
>>> + * level of the page table as below:
>>> + *
>>> + * --------------------------------------------
>>> + * | Entry level | 4K 16K 64K |
>>> + * --------------------------------------------
>>> + * | Level: 0 (4 levels) | 28 | - | - |
>>> + * --------------------------------------------
>>> + * | Level: 1 (3 levels) | 37 | 31 | 25 |
>>> + * --------------------------------------------
>>> + * | Level: 2 (2 levels) | 46 | 42 | 38 |
>>> + * --------------------------------------------
>>> + * | Level: 3 (1 level) | - | 53 | 51 |
>>> + * --------------------------------------------
>> I understand entry level = Lookup level in the table.
>
> Entry level => The level at which we start the page table walk for
> a given address (This is in line with the ARM ARM). So,
>
> Entry_level = (4 - Number_of_Page_table_levels)
>
>> But you may want to compute x for BaseAddress matching lookup level 2
>> with number of levels = 4.
>
> No, the BaseAddress is only calcualted for the "Entry_level". So the
> above case doesn't exist at all.
>
>> So shouldn't you s/Number of levels/4 - entry_level?
>
> Ok, I now understood what you are referring to [0]
>> for BADDR we want the BaseAddr of the initial lookup level so
>> effectively the entry level we are interested in is 4 - number of levels
>> and we don't care or d) condition. At least this is my understanding ;-)
>> If correct you may slightly reword the explanation?
>
>
>>> + *
>>> + * We have a magic formula for the Magic_N below.
>>> + *
>>> + * Magic_N(PAGE_SIZE, Entry_Level) = 64 - ((PAGE_SHIFT - 3) *
>>> Number of levels)
>
> [0] ^^^
>
>
>
>>> + *
>>> + * where number of levels = (4 - Entry_Level).
>
> ^^^ Doesn't this help make it clear ? Using the expansion makes it a bit
> more
> unreadable below.

I just wanted to mention the tables you refer (D4-23 and D4-25) give
Magic_N for a larger scope as they deal with any lookup level while we
only care about the entry level for BADDR. So I was a little bit
confused when reading the explanation but that's not a big deal.

>
>>> +/*
>>> + * Get the magic number 'x' for VTTBR:BADDR of this KVM instance.
>>> + * With v8.2 LVA extensions, 'x' should be a minimum of 6 with
>>> + * 52bit IPS.
>> Link to the spec?
>
> Sure, will add it.
>
> Thanks for the patience to review :-)
you're welcome ;-)

Eric
>
> Cheers
> Suzuki

2018-07-04 08:30:24

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 10/20] kvm: arm64: Dynamic configuration of VTTBR mask

On 07/04/2018 09:24 AM, Auger Eric wrote:
>>>> + *
>>>> + * We have a magic formula for the Magic_N below.
>>>> + *
>>>> + * Magic_N(PAGE_SIZE, Entry_Level) = 64 - ((PAGE_SHIFT - 3) *
>>>> Number of levels)
>>
>> [0] ^^^
>>
>>
>>
>>>> + *
>>>> + * where number of levels = (4 - Entry_Level).
>>
>> ^^^ Doesn't this help make it clear ? Using the expansion makes it a bit
>> more
>> unreadable below.
>
> I just wanted to mention the tables you refer (D4-23 and D4-25) give
> Magic_N for a larger scope as they deal with any lookup level while we
> only care about the entry level for BADDR. So I was a little bit
> confused when reading the explanation but that's not a big deal.

Ah, ok. I will try to clarify it.

Cheers
Suzuki

2018-07-04 14:11:28

by Will Deacon

[permalink] [raw]
Subject: Re: [kvmtool test PATCH 22/24] kvmtool: arm64: Add support for guest physical address size

On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
> Add an option to specify the physical address size used by this
> VM.
>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
> arm/include/arm-common/kvm-config-arch.h | 1 +
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
> index 04be43d..dabd22c 100644
> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
> @@ -8,7 +8,10 @@
> "Create PMUv3 device"), \
> OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed, \
> "Specify random seed for Kernel Address Space " \
> - "Layout Randomization (KASLR)"),
> + "Layout Randomization (KASLR)"), \
> + OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift, \
> + "Specify maximum physical address size (not " \
> + "the amount of memory)"),

Given that this is a shift value, I think the help message could be more
informative. Something like:

"Specify maximum number of bits in a guest physical address"

I think I'd actually leave out any mention of memory, because this does
actually have an effect on the amount of addressable memory in a way that I
don't think we want to describe in half of a usage message line :)

Will

2018-07-04 14:24:42

by Will Deacon

[permalink] [raw]
Subject: Re: [kvmtool test PATCH 24/24] kvmtool: arm: Add support for creating VM with PA size

On Fri, Jun 29, 2018 at 12:15:44PM +0100, Suzuki K Poulose wrote:
> diff --git a/arm/kvm.c b/arm/kvm.c
> index 5701d41..b1969be 100644
> --- a/arm/kvm.c
> +++ b/arm/kvm.c
> @@ -11,6 +11,8 @@
> #include <linux/kvm.h>
> #include <linux/sizes.h>
>
> +unsigned long kvm_arm_type;
> +
> struct kvm_ext kvm_req_ext[] = {
> { DEFINE_KVM_EXT(KVM_CAP_IRQCHIP) },
> { DEFINE_KVM_EXT(KVM_CAP_ONE_REG) },
> @@ -18,6 +20,26 @@ struct kvm_ext kvm_req_ext[] = {
> { 0, 0 },
> };
>
> +#ifndef KVM_ARM_GET_MAX_VM_PHYS_SHIFT
> +#define KVM_ARM_GET_MAX_VM_PHYS_SHIFT _IO(KVMIO, 0x0b)
> +#endif
> +
> +void kvm__arch_init_hyp(struct kvm *kvm)
> +{
> + int max_ipa;
> +
> + max_ipa = ioctl(kvm->sys_fd, KVM_ARM_GET_MAX_VM_PHYS_SHIFT);
> + if (max_ipa < 0)
> + max_ipa = 40;
> + if (!kvm->cfg.arch.phys_shift)
> + kvm->cfg.arch.phys_shift = 40;
> + if (kvm->cfg.arch.phys_shift > max_ipa)
> + die("Requested PA size (%u) is not supported by the host (%ubits)\n",
> + kvm->cfg.arch.phys_shift, max_ipa);
> + if (kvm->cfg.arch.phys_shift != 40)
> + kvm_arm_type = kvm->cfg.arch.phys_shift;
> +}

Seems a bit weird that the "machine type identifier" to KVM_CREATE_VM is
dedicated entirely to holding the physical address shift verbatim. Is this
really the ABI?

Also, couldn't KVM figure it out automatically if you add memslots at high
addresses, making this a niche tunable outside of testing?

Will

2018-07-04 14:44:24

by Marc Zyngier

[permalink] [raw]
Subject: Re: [kvmtool test PATCH 24/24] kvmtool: arm: Add support for creating VM with PA size

On Wed, 04 Jul 2018 15:22:42 +0100,
Will Deacon <[email protected]> wrote:
>
> On Fri, Jun 29, 2018 at 12:15:44PM +0100, Suzuki K Poulose wrote:
> > diff --git a/arm/kvm.c b/arm/kvm.c
> > index 5701d41..b1969be 100644
> > --- a/arm/kvm.c
> > +++ b/arm/kvm.c
> > @@ -11,6 +11,8 @@
> > #include <linux/kvm.h>
> > #include <linux/sizes.h>
> >
> > +unsigned long kvm_arm_type;
> > +
> > struct kvm_ext kvm_req_ext[] = {
> > { DEFINE_KVM_EXT(KVM_CAP_IRQCHIP) },
> > { DEFINE_KVM_EXT(KVM_CAP_ONE_REG) },
> > @@ -18,6 +20,26 @@ struct kvm_ext kvm_req_ext[] = {
> > { 0, 0 },
> > };
> >
> > +#ifndef KVM_ARM_GET_MAX_VM_PHYS_SHIFT
> > +#define KVM_ARM_GET_MAX_VM_PHYS_SHIFT _IO(KVMIO, 0x0b)
> > +#endif
> > +
> > +void kvm__arch_init_hyp(struct kvm *kvm)
> > +{
> > + int max_ipa;
> > +
> > + max_ipa = ioctl(kvm->sys_fd, KVM_ARM_GET_MAX_VM_PHYS_SHIFT);
> > + if (max_ipa < 0)
> > + max_ipa = 40;
> > + if (!kvm->cfg.arch.phys_shift)
> > + kvm->cfg.arch.phys_shift = 40;
> > + if (kvm->cfg.arch.phys_shift > max_ipa)
> > + die("Requested PA size (%u) is not supported by the host (%ubits)\n",
> > + kvm->cfg.arch.phys_shift, max_ipa);
> > + if (kvm->cfg.arch.phys_shift != 40)
> > + kvm_arm_type = kvm->cfg.arch.phys_shift;
> > +}
>
> Seems a bit weird that the "machine type identifier" to KVM_CREATE_VM is
> dedicated entirely to holding the physical address shift verbatim. Is this
> really the ABI?
>
> Also, couldn't KVM figure it out automatically if you add memslots at high
> addresses, making this a niche tunable outside of testing?

Not really. Let's say I want my IPA space split in two: memory covers
the low 47 bit, and I want MMIO spanning the top 47 bit. With your
scheme, you'd end-up with a 47bit IPA space, while you really want 48
bits (MMIO space implemented by userspace isn't registered to the
kernel).

M.

--
Jazz is not dead, it just smell funny.

2018-07-04 15:01:11

by Julien Grall

[permalink] [raw]
Subject: Re: [kvmtool test PATCH 22/24] kvmtool: arm64: Add support for guest physical address size

Hi,

On 04/07/18 15:09, Will Deacon wrote:
> On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
>> Add an option to specify the physical address size used by this
>> VM.
>>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
>> arm/include/arm-common/kvm-config-arch.h | 1 +
>> 2 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
>> index 04be43d..dabd22c 100644
>> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
>> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
>> @@ -8,7 +8,10 @@
>> "Create PMUv3 device"), \
>> OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed, \
>> "Specify random seed for Kernel Address Space " \
>> - "Layout Randomization (KASLR)"),
>> + "Layout Randomization (KASLR)"), \
>> + OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift, \
>> + "Specify maximum physical address size (not " \
>> + "the amount of memory)"),
>
> Given that this is a shift value, I think the help message could be more
> informative. Something like:
>
> "Specify maximum number of bits in a guest physical address"
>
> I think I'd actually leave out any mention of memory, because this does
> actually have an effect on the amount of addressable memory in a way that I
> don't think we want to describe in half of a usage message line :)
Is there any particular reasons to expose this option to the user?

I have recently sent a series to allow the user to specify the position
of the RAM [1]. With that series in mind, I think the user would not
really need to specify the maximum physical shift. Instead we could
automatically find it.

Cheers,

[1]
http://archive.armlinux.org.uk/lurker/message/20180510.140428.1c295b5b.en.html

>
> Will
>

--
Julien Grall

2018-07-04 15:51:19

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v3 15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

Hi Suzuki,

On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
> Allow specifying the physical address size for a new VM via
> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
> us to finalise the stage2 page table format as early as possible
> and hence perform the right checks on the memory slots without
> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
> of the type field and can encode more information in the future if
> required. The IPA size is still capped at 40bits.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Cc: Peter Maydel <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Radim Krčmář <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arch/arm/include/asm/kvm_mmu.h | 2 ++
> arch/arm64/include/asm/kvm_arm.h | 10 +++-------
> arch/arm64/include/asm/kvm_mmu.h | 2 ++
> include/uapi/linux/kvm.h | 10 ++++++++++
> virt/kvm/arm/arm.c | 24 ++++++++++++++++++++++--
> 5 files changed, 39 insertions(+), 9 deletions(-)

[...]

> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 4df9bb6..fa4cab0 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
> #define KVM_S390_SIE_PAGE_OFFSET 1
>
> /*
> + * On arm/arm64, machine type can be used to request the physical
> + * address size for the VM. Bits [7-0] have been reserved for the
> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
> + * value 0 implies the default IPA size, which is 40bits.
> + */
> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK 0xff
> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x) \
> + ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)

This seems like you're allocating quite a lot of bits in a non-extensible
interface to a fairly esoteric parameter. Would it be better to add another
ioctl, or condense the number of sizes you support instead?

Will

2018-07-04 15:52:12

by Will Deacon

[permalink] [raw]
Subject: Re: [kvmtool test PATCH 24/24] kvmtool: arm: Add support for creating VM with PA size

On Wed, Jul 04, 2018 at 03:41:18PM +0100, Marc Zyngier wrote:
> On Wed, 04 Jul 2018 15:22:42 +0100,
> Will Deacon <[email protected]> wrote:
> >
> > On Fri, Jun 29, 2018 at 12:15:44PM +0100, Suzuki K Poulose wrote:
> > > diff --git a/arm/kvm.c b/arm/kvm.c
> > > index 5701d41..b1969be 100644
> > > --- a/arm/kvm.c
> > > +++ b/arm/kvm.c
> > > @@ -11,6 +11,8 @@
> > > #include <linux/kvm.h>
> > > #include <linux/sizes.h>
> > >
> > > +unsigned long kvm_arm_type;
> > > +
> > > struct kvm_ext kvm_req_ext[] = {
> > > { DEFINE_KVM_EXT(KVM_CAP_IRQCHIP) },
> > > { DEFINE_KVM_EXT(KVM_CAP_ONE_REG) },
> > > @@ -18,6 +20,26 @@ struct kvm_ext kvm_req_ext[] = {
> > > { 0, 0 },
> > > };
> > >
> > > +#ifndef KVM_ARM_GET_MAX_VM_PHYS_SHIFT
> > > +#define KVM_ARM_GET_MAX_VM_PHYS_SHIFT _IO(KVMIO, 0x0b)
> > > +#endif
> > > +
> > > +void kvm__arch_init_hyp(struct kvm *kvm)
> > > +{
> > > + int max_ipa;
> > > +
> > > + max_ipa = ioctl(kvm->sys_fd, KVM_ARM_GET_MAX_VM_PHYS_SHIFT);
> > > + if (max_ipa < 0)
> > > + max_ipa = 40;
> > > + if (!kvm->cfg.arch.phys_shift)
> > > + kvm->cfg.arch.phys_shift = 40;
> > > + if (kvm->cfg.arch.phys_shift > max_ipa)
> > > + die("Requested PA size (%u) is not supported by the host (%ubits)\n",
> > > + kvm->cfg.arch.phys_shift, max_ipa);
> > > + if (kvm->cfg.arch.phys_shift != 40)
> > > + kvm_arm_type = kvm->cfg.arch.phys_shift;
> > > +}
> >
> > Seems a bit weird that the "machine type identifier" to KVM_CREATE_VM is
> > dedicated entirely to holding the physical address shift verbatim. Is this
> > really the ABI?
> >
> > Also, couldn't KVM figure it out automatically if you add memslots at high
> > addresses, making this a niche tunable outside of testing?
>
> Not really. Let's say I want my IPA space split in two: memory covers
> the low 47 bit, and I want MMIO spanning the top 47 bit. With your
> scheme, you'd end-up with a 47bit IPA space, while you really want 48
> bits (MMIO space implemented by userspace isn't registered to the
> kernel).

That still sounds quite niche for a VM. Does QEMU do that? In any case,
having KVM automatically increase the IPA bits to cover the memslots it
knows about would make sense to me, and also be sufficient for kvmtool
without us having to add an extra command-line argument.

The MMIO case might be better dealt with by having a way to register MMIO
regions rather than having the PA bits exposed directly.

Will

2018-07-04 15:53:55

by Will Deacon

[permalink] [raw]
Subject: Re: [kvmtool test PATCH 22/24] kvmtool: arm64: Add support for guest physical address size

On Wed, Jul 04, 2018 at 04:00:11PM +0100, Julien Grall wrote:
> On 04/07/18 15:09, Will Deacon wrote:
> >On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
> >>Add an option to specify the physical address size used by this
> >>VM.
> >>
> >>Signed-off-by: Suzuki K Poulose <[email protected]>
> >>---
> >> arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
> >> arm/include/arm-common/kvm-config-arch.h | 1 +
> >> 2 files changed, 5 insertions(+), 1 deletion(-)
> >>
> >>diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
> >>index 04be43d..dabd22c 100644
> >>--- a/arm/aarch64/include/kvm/kvm-config-arch.h
> >>+++ b/arm/aarch64/include/kvm/kvm-config-arch.h
> >>@@ -8,7 +8,10 @@
> >> "Create PMUv3 device"), \
> >> OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed, \
> >> "Specify random seed for Kernel Address Space " \
> >>- "Layout Randomization (KASLR)"),
> >>+ "Layout Randomization (KASLR)"), \
> >>+ OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift, \
> >>+ "Specify maximum physical address size (not " \
> >>+ "the amount of memory)"),
> >
> >Given that this is a shift value, I think the help message could be more
> >informative. Something like:
> >
> > "Specify maximum number of bits in a guest physical address"
> >
> >I think I'd actually leave out any mention of memory, because this does
> >actually have an effect on the amount of addressable memory in a way that I
> >don't think we want to describe in half of a usage message line :)
> Is there any particular reasons to expose this option to the user?
>
> I have recently sent a series to allow the user to specify the position
> of the RAM [1]. With that series in mind, I think the user would not really
> need to specify the maximum physical shift. Instead we could automatically
> find it.

Marc makes a good point that it doesn't help for MMIO regions, so I'm trying
to understand whether we can do something differently there and avoid
sacrificing the type parameter.

Will

2018-07-04 15:59:03

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [kvmtool test PATCH 24/24] kvmtool: arm: Add support for creating VM with PA size

Hi Will,

On 07/04/2018 03:22 PM, Will Deacon wrote:
> On Fri, Jun 29, 2018 at 12:15:44PM +0100, Suzuki K Poulose wrote:
>> diff --git a/arm/kvm.c b/arm/kvm.c
>> index 5701d41..b1969be 100644
>> --- a/arm/kvm.c
>> +++ b/arm/kvm.c
>> @@ -11,6 +11,8 @@
>> #include <linux/kvm.h>
>> #include <linux/sizes.h>
>>
>> +unsigned long kvm_arm_type;
>> +
>> struct kvm_ext kvm_req_ext[] = {
>> { DEFINE_KVM_EXT(KVM_CAP_IRQCHIP) },
>> { DEFINE_KVM_EXT(KVM_CAP_ONE_REG) },
>> @@ -18,6 +20,26 @@ struct kvm_ext kvm_req_ext[] = {
>> { 0, 0 },
>> };
>>
>> +#ifndef KVM_ARM_GET_MAX_VM_PHYS_SHIFT
>> +#define KVM_ARM_GET_MAX_VM_PHYS_SHIFT _IO(KVMIO, 0x0b)
>> +#endif
>> +
>> +void kvm__arch_init_hyp(struct kvm *kvm)
>> +{
>> + int max_ipa;
>> +
>> + max_ipa = ioctl(kvm->sys_fd, KVM_ARM_GET_MAX_VM_PHYS_SHIFT);
>> + if (max_ipa < 0)
>> + max_ipa = 40;
>> + if (!kvm->cfg.arch.phys_shift)
>> + kvm->cfg.arch.phys_shift = 40;
>> + if (kvm->cfg.arch.phys_shift > max_ipa)
>> + die("Requested PA size (%u) is not supported by the host (%ubits)\n",
>> + kvm->cfg.arch.phys_shift, max_ipa);
>> + if (kvm->cfg.arch.phys_shift != 40)
>> + kvm_arm_type = kvm->cfg.arch.phys_shift;
>> +}
>
> Seems a bit weird that the "machine type identifier" to KVM_CREATE_VM is
> dedicated entirely to holding the physical address shift verbatim. Is this
> really the ABI?

The bits[7:0] of the machine type has been reserved for the IPA shift.
This version is missing the updates to the ABI documentation, I have it
for the next version.

>
> Also, couldn't KVM figure it out automatically if you add memslots at high
> addresses, making this a niche tunable outside of testing?

The stage2 pgd size is really dependent on the max IPA. Also, unlike the
stage1 (where the maximum size will be 1 page), the size can go upto 16
pages (and different number of levels due to concatenation), so we need
to finalize this at least before the first memory gets mapped (RAM or
Device). That implies, we cannot wait until all the memory slots are
created.

The first version of the series added a separate ioctl for specifying
the limit, which had its own complexities. So, this ABI was suggested
to keep things simpler.


Suzuki

2018-07-04 22:03:46

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

On 07/04/2018 04:51 PM, Will Deacon wrote:
> Hi Suzuki,
>
> On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
>> Allow specifying the physical address size for a new VM via
>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
>> us to finalise the stage2 page table format as early as possible
>> and hence perform the right checks on the memory slots without
>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
>> of the type field and can encode more information in the future if
>> required. The IPA size is still capped at 40bits.
>>
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Christoffer Dall <[email protected]>
>> Cc: Peter Maydel <[email protected]>
>> Cc: Paolo Bonzini <[email protected]>
>> Cc: Radim Krčmář <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> arch/arm/include/asm/kvm_mmu.h | 2 ++
>> arch/arm64/include/asm/kvm_arm.h | 10 +++-------
>> arch/arm64/include/asm/kvm_mmu.h | 2 ++
>> include/uapi/linux/kvm.h | 10 ++++++++++
>> virt/kvm/arm/arm.c | 24 ++++++++++++++++++++++--
>> 5 files changed, 39 insertions(+), 9 deletions(-)
>
> [...]
>
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 4df9bb6..fa4cab0 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
>> #define KVM_S390_SIE_PAGE_OFFSET 1
>>
>> /*
>> + * On arm/arm64, machine type can be used to request the physical
>> + * address size for the VM. Bits [7-0] have been reserved for the
>> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
>> + * value 0 implies the default IPA size, which is 40bits.
>> + */
>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK 0xff
>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x) \
>> + ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
>
> This seems like you're allocating quite a lot of bits in a non-extensible
> interface to a fairly esoteric parameter. Would it be better to add another
> ioctl, or condense the number of sizes you support instead?

As I explained in the other thread, we need the size as soon as the VM
is created. The major challenge is keeping the backward compatibility by
mapping 0 to 40bits. I will give it a thought.

Suzuki

2018-07-05 07:52:35

by Peter Maydell

[permalink] [raw]
Subject: Re: [kvmtool test PATCH 24/24] kvmtool: arm: Add support for creating VM with PA size

On 4 July 2018 at 16:51, Will Deacon <[email protected]> wrote:
> On Wed, Jul 04, 2018 at 03:41:18PM +0100, Marc Zyngier wrote:
>> Not really. Let's say I want my IPA space split in two: memory covers
>> the low 47 bit, and I want MMIO spanning the top 47 bit. With your
>> scheme, you'd end-up with a 47bit IPA space, while you really want 48
>> bits (MMIO space implemented by userspace isn't registered to the
>> kernel).
>
> That still sounds quite niche for a VM. Does QEMU do that?

Not at 47 bits, but we have RAM up to the 256GB mark, and
MMIO above that (including a large PCI window), so the general
arrangement of having the top end of the IPA space not
necessarily be things we've told the kernel about definitely
exists.

thanks
-- PMM

2018-07-05 08:00:37

by Eric Auger

[permalink] [raw]
Subject: Re: [kvmtool test PATCH 24/24] kvmtool: arm: Add support for creating VM with PA size

Hi,

On 07/05/2018 09:51 AM, Peter Maydell wrote:
> On 4 July 2018 at 16:51, Will Deacon <[email protected]> wrote:
>> On Wed, Jul 04, 2018 at 03:41:18PM +0100, Marc Zyngier wrote:
>>> Not really. Let's say I want my IPA space split in two: memory covers
>>> the low 47 bit, and I want MMIO spanning the top 47 bit. With your
>>> scheme, you'd end-up with a 47bit IPA space, while you really want 48
>>> bits (MMIO space implemented by userspace isn't registered to the
>>> kernel).
>>
>> That still sounds quite niche for a VM. Does QEMU do that?
>
> Not at 47 bits, but we have RAM up to the 256GB mark, and
> MMIO above that (including a large PCI window), so the general
> arrangement of having the top end of the IPA space not
> necessarily be things we've told the kernel about definitely
> exists.

Is this document (2012) still a reference document?
http://infocenter.arm.com/help/topic/com.arm.doc.den0001c/DEN0001C_principles_of_arm_memory_maps.pdf
(especially Fig 5?)

Peter, comments in QEMU hw/arm/virt.c suggested next RAM chunk should be
added at 2TB. This doc suggests to put it at 8TB. I understand the PA
memory map only is suggested but shouldn't we align?

Thanks

Eric


>
> thanks
> -- PMM
> _______________________________________________
> kvmarm mailing list
> [email protected]
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>

2018-07-05 12:49:15

by Julien Grall

[permalink] [raw]
Subject: Re: [kvmtool test PATCH 22/24] kvmtool: arm64: Add support for guest physical address size

Hi Will,

On 04/07/18 16:52, Will Deacon wrote:
> On Wed, Jul 04, 2018 at 04:00:11PM +0100, Julien Grall wrote:
>> On 04/07/18 15:09, Will Deacon wrote:
>>> On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
>>>> Add an option to specify the physical address size used by this
>>>> VM.
>>>>
>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>> ---
>>>> arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
>>>> arm/include/arm-common/kvm-config-arch.h | 1 +
>>>> 2 files changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>> index 04be43d..dabd22c 100644
>>>> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
>>>> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>> @@ -8,7 +8,10 @@
>>>> "Create PMUv3 device"), \
>>>> OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed, \
>>>> "Specify random seed for Kernel Address Space " \
>>>> - "Layout Randomization (KASLR)"),
>>>> + "Layout Randomization (KASLR)"), \
>>>> + OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift, \
>>>> + "Specify maximum physical address size (not " \
>>>> + "the amount of memory)"),
>>>
>>> Given that this is a shift value, I think the help message could be more
>>> informative. Something like:
>>>
>>> "Specify maximum number of bits in a guest physical address"
>>>
>>> I think I'd actually leave out any mention of memory, because this does
>>> actually have an effect on the amount of addressable memory in a way that I
>>> don't think we want to describe in half of a usage message line :)
>> Is there any particular reasons to expose this option to the user?
>>
>> I have recently sent a series to allow the user to specify the position
>> of the RAM [1]. With that series in mind, I think the user would not really
>> need to specify the maximum physical shift. Instead we could automatically
>> find it.
>
> Marc makes a good point that it doesn't help for MMIO regions, so I'm trying
> to understand whether we can do something differently there and avoid
> sacrificing the type parameter.

I am not sure to understand this. kvmtools knows the memory layout
(including MMIOs) of the guest, so couldn't it guess the maximum
physical shift for that?

Cheers,

--
Julien Grall

2018-07-05 13:22:23

by Marc Zyngier

[permalink] [raw]
Subject: Re: [kvmtool test PATCH 22/24] kvmtool: arm64: Add support for guest physical address size

On 05/07/18 13:47, Julien Grall wrote:
> Hi Will,
>
> On 04/07/18 16:52, Will Deacon wrote:
>> On Wed, Jul 04, 2018 at 04:00:11PM +0100, Julien Grall wrote:
>>> On 04/07/18 15:09, Will Deacon wrote:
>>>> On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
>>>>> Add an option to specify the physical address size used by this
>>>>> VM.
>>>>>
>>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>>> ---
>>>>> arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
>>>>> arm/include/arm-common/kvm-config-arch.h | 1 +
>>>>> 2 files changed, 5 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>> index 04be43d..dabd22c 100644
>>>>> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>> @@ -8,7 +8,10 @@
>>>>> "Create PMUv3 device"), \
>>>>> OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed, \
>>>>> "Specify random seed for Kernel Address Space " \
>>>>> - "Layout Randomization (KASLR)"),
>>>>> + "Layout Randomization (KASLR)"), \
>>>>> + OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift, \
>>>>> + "Specify maximum physical address size (not " \
>>>>> + "the amount of memory)"),
>>>>
>>>> Given that this is a shift value, I think the help message could be more
>>>> informative. Something like:
>>>>
>>>> "Specify maximum number of bits in a guest physical address"
>>>>
>>>> I think I'd actually leave out any mention of memory, because this does
>>>> actually have an effect on the amount of addressable memory in a way that I
>>>> don't think we want to describe in half of a usage message line :)
>>> Is there any particular reasons to expose this option to the user?
>>>
>>> I have recently sent a series to allow the user to specify the position
>>> of the RAM [1]. With that series in mind, I think the user would not really
>>> need to specify the maximum physical shift. Instead we could automatically
>>> find it.
>>
>> Marc makes a good point that it doesn't help for MMIO regions, so I'm trying
>> to understand whether we can do something differently there and avoid
>> sacrificing the type parameter.
>
> I am not sure to understand this. kvmtools knows the memory layout
> (including MMIOs) of the guest, so couldn't it guess the maximum
> physical shift for that?

That's exactly what Will was trying to avoid, by having KVM to compute
the size of the IPA space based on the registered memslots. We've now
established that it doesn't work, so what we need to define is:

- whether we need another ioctl(), or do we carry on piggy-backing on
the CPU type,
- assuming the latter, whether we can reduce the number of bits used in
the ioctl parameter by subtly encoding the IPA size.

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2018-07-05 13:47:52

by Eric Auger

[permalink] [raw]
Subject: Re: [kvmtool test PATCH 22/24] kvmtool: arm64: Add support for guest physical address size

Hi Marc,

On 07/05/2018 03:20 PM, Marc Zyngier wrote:
> On 05/07/18 13:47, Julien Grall wrote:
>> Hi Will,
>>
>> On 04/07/18 16:52, Will Deacon wrote:
>>> On Wed, Jul 04, 2018 at 04:00:11PM +0100, Julien Grall wrote:
>>>> On 04/07/18 15:09, Will Deacon wrote:
>>>>> On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
>>>>>> Add an option to specify the physical address size used by this
>>>>>> VM.
>>>>>>
>>>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>>>> ---
>>>>>> arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
>>>>>> arm/include/arm-common/kvm-config-arch.h | 1 +
>>>>>> 2 files changed, 5 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>> index 04be43d..dabd22c 100644
>>>>>> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>> @@ -8,7 +8,10 @@
>>>>>> "Create PMUv3 device"), \
>>>>>> OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed, \
>>>>>> "Specify random seed for Kernel Address Space " \
>>>>>> - "Layout Randomization (KASLR)"),
>>>>>> + "Layout Randomization (KASLR)"), \
>>>>>> + OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift, \
>>>>>> + "Specify maximum physical address size (not " \
>>>>>> + "the amount of memory)"),
>>>>>
>>>>> Given that this is a shift value, I think the help message could be more
>>>>> informative. Something like:
>>>>>
>>>>> "Specify maximum number of bits in a guest physical address"
>>>>>
>>>>> I think I'd actually leave out any mention of memory, because this does
>>>>> actually have an effect on the amount of addressable memory in a way that I
>>>>> don't think we want to describe in half of a usage message line :)
>>>> Is there any particular reasons to expose this option to the user?
>>>>
>>>> I have recently sent a series to allow the user to specify the position
>>>> of the RAM [1]. With that series in mind, I think the user would not really
>>>> need to specify the maximum physical shift. Instead we could automatically
>>>> find it.
>>>
>>> Marc makes a good point that it doesn't help for MMIO regions, so I'm trying
>>> to understand whether we can do something differently there and avoid
>>> sacrificing the type parameter.
>>
>> I am not sure to understand this. kvmtools knows the memory layout
>> (including MMIOs) of the guest, so couldn't it guess the maximum
>> physical shift for that?
>
> That's exactly what Will was trying to avoid, by having KVM to compute
> the size of the IPA space based on the registered memslots. We've now
> established that it doesn't work, so what we need to define is:
>
> - whether we need another ioctl(), or do we carry on piggy-backing on
> the CPU type,
kvm type I guess
> - assuming the latter, whether we can reduce the number of bits used in
> the ioctl parameter by subtly encoding the IPA size.
Getting benefit from your Freudian slip, how should guest CPU PARange
and maximum number of bits in a guest physical address relate?

My understanding is they are not correlated at the moment and our guest
PARange is fixed at the moment. But shouldn't they?

On Intel there is
qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,phys-bits=36
or
qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,host-phys-bits=true

where phys-bits, as far as I understand has a a similar semantics as the
PARange.

Thanks

Eric
>
> Thanks,
>
> M.
>

2018-07-05 14:14:35

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [kvmtool test PATCH 22/24] kvmtool: arm64: Add support for guest physical address size

On 05/07/18 14:46, Auger Eric wrote:
> Hi Marc,
>
> On 07/05/2018 03:20 PM, Marc Zyngier wrote:
>> On 05/07/18 13:47, Julien Grall wrote:
>>> Hi Will,
>>>
>>> On 04/07/18 16:52, Will Deacon wrote:
>>>> On Wed, Jul 04, 2018 at 04:00:11PM +0100, Julien Grall wrote:
>>>>> On 04/07/18 15:09, Will Deacon wrote:
>>>>>> On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
>>>>>>> Add an option to specify the physical address size used by this
>>>>>>> VM.
>>>>>>>
>>>>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>>>>> ---
>>>>>>> arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
>>>>>>> arm/include/arm-common/kvm-config-arch.h | 1 +
>>>>>>> 2 files changed, 5 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>> index 04be43d..dabd22c 100644
>>>>>>> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>> @@ -8,7 +8,10 @@
>>>>>>> "Create PMUv3 device"), \
>>>>>>> OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed, \
>>>>>>> "Specify random seed for Kernel Address Space " \
>>>>>>> - "Layout Randomization (KASLR)"),
>>>>>>> + "Layout Randomization (KASLR)"), \
>>>>>>> + OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift, \
>>>>>>> + "Specify maximum physical address size (not " \
>>>>>>> + "the amount of memory)"),
>>>>>>
>>>>>> Given that this is a shift value, I think the help message could be more
>>>>>> informative. Something like:
>>>>>>
>>>>>> "Specify maximum number of bits in a guest physical address"
>>>>>>
>>>>>> I think I'd actually leave out any mention of memory, because this does
>>>>>> actually have an effect on the amount of addressable memory in a way that I
>>>>>> don't think we want to describe in half of a usage message line :)
>>>>> Is there any particular reasons to expose this option to the user?
>>>>>
>>>>> I have recently sent a series to allow the user to specify the position
>>>>> of the RAM [1]. With that series in mind, I think the user would not really
>>>>> need to specify the maximum physical shift. Instead we could automatically
>>>>> find it.
>>>>
>>>> Marc makes a good point that it doesn't help for MMIO regions, so I'm trying
>>>> to understand whether we can do something differently there and avoid
>>>> sacrificing the type parameter.
>>>
>>> I am not sure to understand this. kvmtools knows the memory layout
>>> (including MMIOs) of the guest, so couldn't it guess the maximum
>>> physical shift for that?
>>
>> That's exactly what Will was trying to avoid, by having KVM to compute
>> the size of the IPA space based on the registered memslots. We've now
>> established that it doesn't work, so what we need to define is:
>>
>> - whether we need another ioctl(), or do we carry on piggy-backing on
>> the CPU type,
> kvm type I guess

machine type is more appropriate, going by the existing users.

>> - assuming the latter, whether we can reduce the number of bits used in
>> the ioctl parameter by subtly encoding the IPA size.
> Getting benefit from your Freudian slip, how should guest CPU PARange
> and maximum number of bits in a guest physical address relate?
>
> My understanding is they are not correlated at the moment and our guest
> PARange is fixed at the moment. But shouldn't they?
>
> On Intel there is
> qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,phys-bits=36
> or
> qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,host-phys-bits=true
>
> where phys-bits, as far as I understand has a a similar semantics as the
> PARange.


AFAIT, PARange tells you the maximum (I)Physcial Address that can be handled
by the CPU. But your IPA limit tells you where the guest RAM is placed.
So they need not be the same. e.g, on Juno, A57's have a PARange of 42 if I am
not wrong (but definitely > 40), while A53's have it at 40 and the system RAM
is at 40bits.

So, if we were to only use the A57s on Juno, we could run a KVM instance with 42
bits IPA or anything lower. So, PARange can be inferred as the maximum limit
of the CPU's capability while the IPA is where the RAM is placed for a given
system.
One could keep them in sync for a VM by emulating, but then nobody
uses the PARange, except the KVM. The other problem with capping PARange in the VM
to IPA is restricting the IPA size of a nested VM. So, I don't think this is
really beneficial.

Cheers
Suzuki


>
> Thanks
>
> Eric
>>
>> Thanks,
>>
>> M.
>>


2018-07-05 14:17:02

by Marc Zyngier

[permalink] [raw]
Subject: Re: [kvmtool test PATCH 22/24] kvmtool: arm64: Add support for guest physical address size

Hi Eric,

On 05/07/18 14:46, Auger Eric wrote:
> Hi Marc,
>
> On 07/05/2018 03:20 PM, Marc Zyngier wrote:
>> On 05/07/18 13:47, Julien Grall wrote:
>>> Hi Will,
>>>
>>> On 04/07/18 16:52, Will Deacon wrote:
>>>> On Wed, Jul 04, 2018 at 04:00:11PM +0100, Julien Grall wrote:
>>>>> On 04/07/18 15:09, Will Deacon wrote:
>>>>>> On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
>>>>>>> Add an option to specify the physical address size used by this
>>>>>>> VM.
>>>>>>>
>>>>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>>>>> ---
>>>>>>> arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
>>>>>>> arm/include/arm-common/kvm-config-arch.h | 1 +
>>>>>>> 2 files changed, 5 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>> index 04be43d..dabd22c 100644
>>>>>>> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>> @@ -8,7 +8,10 @@
>>>>>>> "Create PMUv3 device"), \
>>>>>>> OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed, \
>>>>>>> "Specify random seed for Kernel Address Space " \
>>>>>>> - "Layout Randomization (KASLR)"),
>>>>>>> + "Layout Randomization (KASLR)"), \
>>>>>>> + OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift, \
>>>>>>> + "Specify maximum physical address size (not " \
>>>>>>> + "the amount of memory)"),
>>>>>>
>>>>>> Given that this is a shift value, I think the help message could be more
>>>>>> informative. Something like:
>>>>>>
>>>>>> "Specify maximum number of bits in a guest physical address"
>>>>>>
>>>>>> I think I'd actually leave out any mention of memory, because this does
>>>>>> actually have an effect on the amount of addressable memory in a way that I
>>>>>> don't think we want to describe in half of a usage message line :)
>>>>> Is there any particular reasons to expose this option to the user?
>>>>>
>>>>> I have recently sent a series to allow the user to specify the position
>>>>> of the RAM [1]. With that series in mind, I think the user would not really
>>>>> need to specify the maximum physical shift. Instead we could automatically
>>>>> find it.
>>>>
>>>> Marc makes a good point that it doesn't help for MMIO regions, so I'm trying
>>>> to understand whether we can do something differently there and avoid
>>>> sacrificing the type parameter.
>>>
>>> I am not sure to understand this. kvmtools knows the memory layout
>>> (including MMIOs) of the guest, so couldn't it guess the maximum
>>> physical shift for that?
>>
>> That's exactly what Will was trying to avoid, by having KVM to compute
>> the size of the IPA space based on the registered memslots. We've now
>> established that it doesn't work, so what we need to define is:
>>
>> - whether we need another ioctl(), or do we carry on piggy-backing on
>> the CPU type,
> kvm type I guess

I really meant target here. Whatever you pass as a "-cpu" on your QEMU
command line.

>> - assuming the latter, whether we can reduce the number of bits used in
>> the ioctl parameter by subtly encoding the IPA size.
> Getting benefit from your Freudian slip, how should guest CPU PARange
> and maximum number of bits in a guest physical address relate?

Freudian? I'm not on the sofa yet... ;-)

> My understanding is they are not correlated at the moment and our guest
> PARange is fixed at the moment. But shouldn't they?
>
> On Intel there is
> qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,phys-bits=36
> or
> qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,host-phys-bits=true
>
> where phys-bits, as far as I understand has a a similar semantics as the
> PARange.

I think there is value in having it global, just like on x86. We don't
really support heterogeneous guests anyway.

Independently, we should also repaint/satinize PARange so that the guest
observes the same thing, no matter what CPU it runs on (an A53/A57
system could be confusing in that respect).

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2018-07-05 14:38:45

by Eric Auger

[permalink] [raw]
Subject: Re: [kvmtool test PATCH 22/24] kvmtool: arm64: Add support for guest physical address size

Hi Suzuki, Marc,

On 07/05/2018 04:15 PM, Marc Zyngier wrote:
> Hi Eric,
>
> On 05/07/18 14:46, Auger Eric wrote:
>> Hi Marc,
>>
>> On 07/05/2018 03:20 PM, Marc Zyngier wrote:
>>> On 05/07/18 13:47, Julien Grall wrote:
>>>> Hi Will,
>>>>
>>>> On 04/07/18 16:52, Will Deacon wrote:
>>>>> On Wed, Jul 04, 2018 at 04:00:11PM +0100, Julien Grall wrote:
>>>>>> On 04/07/18 15:09, Will Deacon wrote:
>>>>>>> On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
>>>>>>>> Add an option to specify the physical address size used by this
>>>>>>>> VM.
>>>>>>>>
>>>>>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>>>>>> ---
>>>>>>>> arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
>>>>>>>> arm/include/arm-common/kvm-config-arch.h | 1 +
>>>>>>>> 2 files changed, 5 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>>> index 04be43d..dabd22c 100644
>>>>>>>> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>>> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>>> @@ -8,7 +8,10 @@
>>>>>>>> "Create PMUv3 device"), \
>>>>>>>> OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed, \
>>>>>>>> "Specify random seed for Kernel Address Space " \
>>>>>>>> - "Layout Randomization (KASLR)"),
>>>>>>>> + "Layout Randomization (KASLR)"), \
>>>>>>>> + OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift, \
>>>>>>>> + "Specify maximum physical address size (not " \
>>>>>>>> + "the amount of memory)"),
>>>>>>>
>>>>>>> Given that this is a shift value, I think the help message could be more
>>>>>>> informative. Something like:
>>>>>>>
>>>>>>> "Specify maximum number of bits in a guest physical address"
>>>>>>>
>>>>>>> I think I'd actually leave out any mention of memory, because this does
>>>>>>> actually have an effect on the amount of addressable memory in a way that I
>>>>>>> don't think we want to describe in half of a usage message line :)
>>>>>> Is there any particular reasons to expose this option to the user?
>>>>>>
>>>>>> I have recently sent a series to allow the user to specify the position
>>>>>> of the RAM [1]. With that series in mind, I think the user would not really
>>>>>> need to specify the maximum physical shift. Instead we could automatically
>>>>>> find it.
>>>>>
>>>>> Marc makes a good point that it doesn't help for MMIO regions, so I'm trying
>>>>> to understand whether we can do something differently there and avoid
>>>>> sacrificing the type parameter.
>>>>
>>>> I am not sure to understand this. kvmtools knows the memory layout
>>>> (including MMIOs) of the guest, so couldn't it guess the maximum
>>>> physical shift for that?
>>>
>>> That's exactly what Will was trying to avoid, by having KVM to compute
>>> the size of the IPA space based on the registered memslots. We've now
>>> established that it doesn't work, so what we need to define is:
>>>
>>> - whether we need another ioctl(), or do we carry on piggy-backing on
>>> the CPU type,
>> kvm type I guess
>
> I really meant target here. Whatever you pass as a "-cpu" on your QEMU
> command line.
Oh OK. It was not a slip then ;-)
>
>>> - assuming the latter, whether we can reduce the number of bits used in
>>> the ioctl parameter by subtly encoding the IPA size.
>> Getting benefit from your Freudian slip, how should guest CPU PARange
>> and maximum number of bits in a guest physical address relate?
>
> Freudian? I'm not on the sofa yet... ;-)
>
>> My understanding is they are not correlated at the moment and our guest
>> PARange is fixed at the moment. But shouldn't they?
>>
>> On Intel there is
>> qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,phys-bits=36
>> or
>> qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,host-phys-bits=true
>>
>> where phys-bits, as far as I understand has a a similar semantics as the
>> PARange.
>
> I think there is value in having it global, just like on x86. We don't
> really support heterogeneous guests anyway.

Assuming we would use such a ",phys-bits=n" cpu option, is my
understanding correct that it would set both
- guest CPU PARange an
- maximum number of bits in a guest physical address
to n?

Thanks

Eric
>
> Independently, we should also repaint/satinize PARange so that the guest
> observes the same thing, no matter what CPU it runs on (an A53/A57
> system could be confusing in that respect).

>
> Thanks,
>
> M.
>

2018-07-06 13:52:05

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

On 04/07/18 23:03, Suzuki K Poulose wrote:
> On 07/04/2018 04:51 PM, Will Deacon wrote:
>> Hi Suzuki,
>>
>> On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
>>> Allow specifying the physical address size for a new VM via
>>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
>>> us to finalise the stage2 page table format as early as possible
>>> and hence perform the right checks on the memory slots without
>>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
>>> of the type field and can encode more information in the future if
>>> required. The IPA size is still capped at 40bits.
>>>
>>> Cc: Marc Zyngier <[email protected]>
>>> Cc: Christoffer Dall <[email protected]>
>>> Cc: Peter Maydel <[email protected]>
>>> Cc: Paolo Bonzini <[email protected]>
>>> Cc: Radim Krčmář <[email protected]>
>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>> ---
>>>   arch/arm/include/asm/kvm_mmu.h   |  2 ++
>>>   arch/arm64/include/asm/kvm_arm.h | 10 +++-------
>>>   arch/arm64/include/asm/kvm_mmu.h |  2 ++
>>>   include/uapi/linux/kvm.h         | 10 ++++++++++
>>>   virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
>>>   5 files changed, 39 insertions(+), 9 deletions(-)
>>
>> [...]
>>
>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>> index 4df9bb6..fa4cab0 100644
>>> --- a/include/uapi/linux/kvm.h
>>> +++ b/include/uapi/linux/kvm.h
>>> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
>>>   #define KVM_S390_SIE_PAGE_OFFSET 1
>>>   /*
>>> + * On arm/arm64, machine type can be used to request the physical
>>> + * address size for the VM. Bits [7-0] have been reserved for the
>>> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
>>> + * value 0 implies the default IPA size, which is 40bits.
>>> + */
>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK    0xff
>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)        \
>>> +    ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
>>
>> This seems like you're allocating quite a lot of bits in a non-extensible
>> interface to a fairly esoteric parameter. Would it be better to add another
>> ioctl, or condense the number of sizes you support instead?
>
> As I explained in the other thread, we need the size as soon as the VM
> is created. The major challenge is keeping the backward compatibility by
> mapping 0 to 40bits. I will give it a thought.

Here is one option. We could re-use the {V}TCR_ELx.{I}PS field format, which
occupies 3 bits and has the following definitions. (ID_AA64MMFR0_EL1:PARange
also has the field definitions, except that the field is 4bits wide, but
only 3bits are used)

000 32 bits, 4GB.
001 36 bits, 64GB.
010 40 bits, 1TB.
011 42 bits, 4TB.
100 44 bits, 16TB.
101 48 bits, 256TB.
110 52 bits, 4PB

But we need to map 0 => 40bits IPA to make our ABI backward compatible. So
we could use the additional one bit to indicate that IPA size is requested
in the 3 bits.

i.e,

machine_type:

Bit [2:0] - Requested IPA size. Values follow VTCR_EL2.PS format.

Bit [3] - 1 => IPA Size bits (Bits[2:0]) requested.
0 => Not requested

The only minor down side is restricting to the predefined values above,
which is not a real issue for a VM.

Thoughts ?

Suzuki

2018-07-06 15:10:41

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v3 15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

On 06/07/18 14:49, Suzuki K Poulose wrote:
> On 04/07/18 23:03, Suzuki K Poulose wrote:
>> On 07/04/2018 04:51 PM, Will Deacon wrote:
>>> Hi Suzuki,
>>>
>>> On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
>>>> Allow specifying the physical address size for a new VM via
>>>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
>>>> us to finalise the stage2 page table format as early as possible
>>>> and hence perform the right checks on the memory slots without
>>>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
>>>> of the type field and can encode more information in the future if
>>>> required. The IPA size is still capped at 40bits.
>>>>
>>>> Cc: Marc Zyngier <[email protected]>
>>>> Cc: Christoffer Dall <[email protected]>
>>>> Cc: Peter Maydel <[email protected]>
>>>> Cc: Paolo Bonzini <[email protected]>
>>>> Cc: Radim Krčmář <[email protected]>
>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>> ---
>>>>   arch/arm/include/asm/kvm_mmu.h   |  2 ++
>>>>   arch/arm64/include/asm/kvm_arm.h | 10 +++-------
>>>>   arch/arm64/include/asm/kvm_mmu.h |  2 ++
>>>>   include/uapi/linux/kvm.h         | 10 ++++++++++
>>>>   virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
>>>>   5 files changed, 39 insertions(+), 9 deletions(-)
>>>
>>> [...]
>>>
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index 4df9bb6..fa4cab0 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
>>>>   #define KVM_S390_SIE_PAGE_OFFSET 1
>>>>   /*
>>>> + * On arm/arm64, machine type can be used to request the physical
>>>> + * address size for the VM. Bits [7-0] have been reserved for the
>>>> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
>>>> + * value 0 implies the default IPA size, which is 40bits.
>>>> + */
>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK    0xff
>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)        \
>>>> +    ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
>>>
>>> This seems like you're allocating quite a lot of bits in a non-extensible
>>> interface to a fairly esoteric parameter. Would it be better to add another
>>> ioctl, or condense the number of sizes you support instead?
>>
>> As I explained in the other thread, we need the size as soon as the VM
>> is created. The major challenge is keeping the backward compatibility by
>> mapping 0 to 40bits. I will give it a thought.
>
> Here is one option. We could re-use the {V}TCR_ELx.{I}PS field format, which
> occupies 3 bits and has the following definitions. (ID_AA64MMFR0_EL1:PARange
> also has the field definitions, except that the field is 4bits wide, but
> only 3bits are used)
>
> 000 32 bits, 4GB.
> 001 36 bits, 64GB.
> 010 40 bits, 1TB.
> 011 42 bits, 4TB.
> 100 44 bits, 16TB.
> 101 48 bits, 256TB.
> 110 52 bits, 4PB
>
> But we need to map 0 => 40bits IPA to make our ABI backward compatible. So
> we could use the additional one bit to indicate that IPA size is requested
> in the 3 bits.
>
> i.e,
>
> machine_type:
>
> Bit [2:0] - Requested IPA size. Values follow VTCR_EL2.PS format.
>
> Bit [3] - 1 => IPA Size bits (Bits[2:0]) requested.
> 0 => Not requested
>
> The only minor down side is restricting to the predefined values above,
> which is not a real issue for a VM.
>
> Thoughts ?

I'd be very wary of using that 4th bit to do something that is not in
the architecture. We have only a single value left to be used (0b111),
and then your scheme clashes with the architecture definition.

I'd rather encode things in a way that is independent from the
architecture, and be done with it. You can map 0 to 40bits, and we have
the ability to express all values the architecture has (just in a
different order).

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2018-07-06 16:39:40

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

On 07/06/2018 04:09 PM, Marc Zyngier wrote:
> On 06/07/18 14:49, Suzuki K Poulose wrote:
>> On 04/07/18 23:03, Suzuki K Poulose wrote:
>>> On 07/04/2018 04:51 PM, Will Deacon wrote:
>>>> Hi Suzuki,
>>>>
>>>> On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
>>>>> Allow specifying the physical address size for a new VM via
>>>>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
>>>>> us to finalise the stage2 page table format as early as possible
>>>>> and hence perform the right checks on the memory slots without
>>>>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
>>>>> of the type field and can encode more information in the future if
>>>>> required. The IPA size is still capped at 40bits.
>>>>>
>>>>> Cc: Marc Zyngier <[email protected]>
>>>>> Cc: Christoffer Dall <[email protected]>
>>>>> Cc: Peter Maydel <[email protected]>
>>>>> Cc: Paolo Bonzini <[email protected]>
>>>>> Cc: Radim Krčmář <[email protected]>
>>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>>> ---
>>>>>   arch/arm/include/asm/kvm_mmu.h   |  2 ++
>>>>>   arch/arm64/include/asm/kvm_arm.h | 10 +++-------
>>>>>   arch/arm64/include/asm/kvm_mmu.h |  2 ++
>>>>>   include/uapi/linux/kvm.h         | 10 ++++++++++
>>>>>   virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
>>>>>   5 files changed, 39 insertions(+), 9 deletions(-)
>>>>
>>>> [...]
>>>>
>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>> index 4df9bb6..fa4cab0 100644
>>>>> --- a/include/uapi/linux/kvm.h
>>>>> +++ b/include/uapi/linux/kvm.h
>>>>> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
>>>>>   #define KVM_S390_SIE_PAGE_OFFSET 1
>>>>>   /*
>>>>> + * On arm/arm64, machine type can be used to request the physical
>>>>> + * address size for the VM. Bits [7-0] have been reserved for the
>>>>> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
>>>>> + * value 0 implies the default IPA size, which is 40bits.
>>>>> + */
>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK    0xff
>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)        \
>>>>> +    ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
>>>>
>>>> This seems like you're allocating quite a lot of bits in a non-extensible
>>>> interface to a fairly esoteric parameter. Would it be better to add another
>>>> ioctl, or condense the number of sizes you support instead?
>>>
>>> As I explained in the other thread, we need the size as soon as the VM
>>> is created. The major challenge is keeping the backward compatibility by
>>> mapping 0 to 40bits. I will give it a thought.
>>
>> Here is one option. We could re-use the {V}TCR_ELx.{I}PS field format, which
>> occupies 3 bits and has the following definitions. (ID_AA64MMFR0_EL1:PARange
>> also has the field definitions, except that the field is 4bits wide, but
>> only 3bits are used)
>>
>> 000 32 bits, 4GB.
>> 001 36 bits, 64GB.
>> 010 40 bits, 1TB.
>> 011 42 bits, 4TB.
>> 100 44 bits, 16TB.
>> 101 48 bits, 256TB.
>> 110 52 bits, 4PB
>>
>> But we need to map 0 => 40bits IPA to make our ABI backward compatible. So
>> we could use the additional one bit to indicate that IPA size is requested
>> in the 3 bits.
>>
>> i.e,
>>
>> machine_type:
>>
>> Bit [2:0] - Requested IPA size. Values follow VTCR_EL2.PS format.
>>
>> Bit [3] - 1 => IPA Size bits (Bits[2:0]) requested.
>> 0 => Not requested
>>
>> The only minor down side is restricting to the predefined values above,
>> which is not a real issue for a VM.
>>
>> Thoughts ?
>
> I'd be very wary of using that 4th bit to do something that is not in
> the architecture. We have only a single value left to be used (0b111),
> and then your scheme clashes with the architecture definition.

I agree. However, if we ever go beyond the 3bits in PARange, we have an
issue with {V}TCR counter part. But lets not take that chance.

>
> I'd rather encode things in a way that is independent from the
> architecture, and be done with it. You can map 0 to 40bits, and we have
> the ability to express all values the architecture has (just in a
> different order).

The other option I can think of is encoding a signed number which is the
difference of the IPA from 40. But that would need 5 bits if we were to
encode it as it is. And if we want to squeeze it in 4bit, we could store
half the difference (limiting the IPA limit to even numbers).

i.e IPA = 40 + 2 * sign_extend(bits[3:0);


Suzuki

2018-07-09 11:24:28

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH v3 15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

On Fri, Jul 06, 2018 at 05:39:00PM +0100, Suzuki K Poulose wrote:
> On 07/06/2018 04:09 PM, Marc Zyngier wrote:
> >On 06/07/18 14:49, Suzuki K Poulose wrote:
> >>On 04/07/18 23:03, Suzuki K Poulose wrote:
> >>>On 07/04/2018 04:51 PM, Will Deacon wrote:
> >>>>Hi Suzuki,
> >>>>
> >>>>On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
> >>>>>Allow specifying the physical address size for a new VM via
> >>>>>the kvm_type argument for KVM_CREATE_VM ioctl. This allows
> >>>>>us to finalise the stage2 page table format as early as possible
> >>>>>and hence perform the right checks on the memory slots without
> >>>>>complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
> >>>>>of the type field and can encode more information in the future if
> >>>>>required. The IPA size is still capped at 40bits.
> >>>>>
> >>>>>Cc: Marc Zyngier <[email protected]>
> >>>>>Cc: Christoffer Dall <[email protected]>
> >>>>>Cc: Peter Maydel <[email protected]>
> >>>>>Cc: Paolo Bonzini <[email protected]>
> >>>>>Cc: Radim Krčmář <[email protected]>
> >>>>>Signed-off-by: Suzuki K Poulose <[email protected]>
> >>>>>---
> >>>>>   arch/arm/include/asm/kvm_mmu.h   |  2 ++
> >>>>>   arch/arm64/include/asm/kvm_arm.h | 10 +++-------
> >>>>>   arch/arm64/include/asm/kvm_mmu.h |  2 ++
> >>>>>   include/uapi/linux/kvm.h         | 10 ++++++++++
> >>>>>   virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
> >>>>>   5 files changed, 39 insertions(+), 9 deletions(-)
> >>>>
> >>>>[...]
> >>>>
> >>>>>diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>>index 4df9bb6..fa4cab0 100644
> >>>>>--- a/include/uapi/linux/kvm.h
> >>>>>+++ b/include/uapi/linux/kvm.h
> >>>>>@@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
> >>>>>   #define KVM_S390_SIE_PAGE_OFFSET 1
> >>>>>   /*
> >>>>>+ * On arm/arm64, machine type can be used to request the physical
> >>>>>+ * address size for the VM. Bits [7-0] have been reserved for the
> >>>>>+ * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
> >>>>>+ * value 0 implies the default IPA size, which is 40bits.
> >>>>>+ */
> >>>>>+#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK    0xff
> >>>>>+#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)        \
> >>>>>+    ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
> >>>>
> >>>>This seems like you're allocating quite a lot of bits in a non-extensible
> >>>>interface to a fairly esoteric parameter. Would it be better to add another
> >>>>ioctl, or condense the number of sizes you support instead?
> >>>
> >>>As I explained in the other thread, we need the size as soon as the VM
> >>>is created. The major challenge is keeping the backward compatibility by
> >>>mapping 0 to 40bits. I will give it a thought.
> >>
> >>Here is one option. We could re-use the {V}TCR_ELx.{I}PS field format, which
> >>occupies 3 bits and has the following definitions. (ID_AA64MMFR0_EL1:PARange
> >>also has the field definitions, except that the field is 4bits wide, but
> >>only 3bits are used)
> >>
> >>000 32 bits, 4GB.
> >>001 36 bits, 64GB.
> >>010 40 bits, 1TB.
> >>011 42 bits, 4TB.
> >>100 44 bits, 16TB.
> >>101 48 bits, 256TB.
> >>110 52 bits, 4PB
> >>
> >>But we need to map 0 => 40bits IPA to make our ABI backward compatible. So
> >>we could use the additional one bit to indicate that IPA size is requested
> >>in the 3 bits.
> >>
> >>i.e,
> >>
> >>machine_type:
> >>
> >>Bit [2:0] - Requested IPA size. Values follow VTCR_EL2.PS format.
> >>
> >>Bit [3] - 1 => IPA Size bits (Bits[2:0]) requested.
> >> 0 => Not requested
> >>
> >>The only minor down side is restricting to the predefined values above,
> >>which is not a real issue for a VM.
> >>
> >>Thoughts ?
> >
> >I'd be very wary of using that 4th bit to do something that is not in
> >the architecture. We have only a single value left to be used (0b111),
> >and then your scheme clashes with the architecture definition.
>
> I agree. However, if we ever go beyond the 3bits in PARange, we have an
> issue with {V}TCR counter part. But lets not take that chance.
>
> >
> >I'd rather encode things in a way that is independent from the
> >architecture, and be done with it. You can map 0 to 40bits, and we have
> >the ability to express all values the architecture has (just in a
> >different order).
>
> The other option I can think of is encoding a signed number which is the
> difference of the IPA from 40. But that would need 5 bits if we were to
> encode it as it is. And if we want to squeeze it in 4bit, we could store
> half the difference (limiting the IPA limit to even numbers).
>
> i.e IPA = 40 + 2 * sign_extend(bits[3:0);

I came across similar issues when trying to work out how to enable
SVE for KVM. In the end I reduced this to a per-vcpu feature, but
it means that there is no global opt-in for the SVE-specific KVM
API extensions:

That's a bit gross, because SVE may require a change to the way
vcpus are initialised. The set of supported SVE vector lengths needs
to be set somehow before the vcpu is set running, but it's tricky do
do that without a new ioctl -- which would mean that if SVE is enabled
for a vcpu then the vcpu is not considered runnable until the new
magic ioctl is called.

Opting into that semantic change globally at VM creation time might
be preferable. On the SVE side, this is still very much subject to
review/change.


Here:

The KVM_CREATE_VM init argument seems undefined by the KVM core code and
is available for arches to abuse in creative ways. x86 and arm have
nothing here and reject non-zero values with -EINVAL; s390 treats it as
a bitmask, and defines a sincle feature-like bit here; powerpc treats it
as an enumeration of VM types.

If we want to be extensible, we could

a) Pass a pointer in type, and come up with some extensible VM parameter
struct for it to point to (which then wouldn't need a cryptic
compressed encoding), or

b) Introduce a new "KVM_CREATE_VM2" variant that either takes such
an argument, or mandates a parameter negotiation phase involving
additional ioctls before marking the VM as ready for vcpu and
device creation.

(a) feels like an easy backwards-compatible approach, but cannot be
readily adopted by other arches (maybe not an issue).

(b) might be considered overengineered, so it would need a bit of
thought.

Wedging arguments into a few bits in the type argument feels awkward,
and may be regretted later if we run out of bits, or something can't be
represented in the chosen encoding.

Cheers
---Dave

2018-07-09 12:32:40

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v3 15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

On 09/07/18 12:23, Dave Martin wrote:
> On Fri, Jul 06, 2018 at 05:39:00PM +0100, Suzuki K Poulose wrote:
>> On 07/06/2018 04:09 PM, Marc Zyngier wrote:
>>> On 06/07/18 14:49, Suzuki K Poulose wrote:
>>>> On 04/07/18 23:03, Suzuki K Poulose wrote:
>>>>> On 07/04/2018 04:51 PM, Will Deacon wrote:
>>>>>> Hi Suzuki,
>>>>>>
>>>>>> On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
>>>>>>> Allow specifying the physical address size for a new VM via
>>>>>>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
>>>>>>> us to finalise the stage2 page table format as early as possible
>>>>>>> and hence perform the right checks on the memory slots without
>>>>>>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
>>>>>>> of the type field and can encode more information in the future if
>>>>>>> required. The IPA size is still capped at 40bits.
>>>>>>>
>>>>>>> Cc: Marc Zyngier <[email protected]>
>>>>>>> Cc: Christoffer Dall <[email protected]>
>>>>>>> Cc: Peter Maydel <[email protected]>
>>>>>>> Cc: Paolo Bonzini <[email protected]>
>>>>>>> Cc: Radim Krčmář <[email protected]>
>>>>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>>>>> ---
>>>>>>>   arch/arm/include/asm/kvm_mmu.h   |  2 ++
>>>>>>>   arch/arm64/include/asm/kvm_arm.h | 10 +++-------
>>>>>>>   arch/arm64/include/asm/kvm_mmu.h |  2 ++
>>>>>>>   include/uapi/linux/kvm.h         | 10 ++++++++++
>>>>>>>   virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
>>>>>>>   5 files changed, 39 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>> index 4df9bb6..fa4cab0 100644
>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
>>>>>>>   #define KVM_S390_SIE_PAGE_OFFSET 1
>>>>>>>   /*
>>>>>>> + * On arm/arm64, machine type can be used to request the physical
>>>>>>> + * address size for the VM. Bits [7-0] have been reserved for the
>>>>>>> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
>>>>>>> + * value 0 implies the default IPA size, which is 40bits.
>>>>>>> + */
>>>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK    0xff
>>>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)        \
>>>>>>> +    ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
>>>>>>
>>>>>> This seems like you're allocating quite a lot of bits in a non-extensible
>>>>>> interface to a fairly esoteric parameter. Would it be better to add another
>>>>>> ioctl, or condense the number of sizes you support instead?
>>>>>
>>>>> As I explained in the other thread, we need the size as soon as the VM
>>>>> is created. The major challenge is keeping the backward compatibility by
>>>>> mapping 0 to 40bits. I will give it a thought.
>>>>
>>>> Here is one option. We could re-use the {V}TCR_ELx.{I}PS field format, which
>>>> occupies 3 bits and has the following definitions. (ID_AA64MMFR0_EL1:PARange
>>>> also has the field definitions, except that the field is 4bits wide, but
>>>> only 3bits are used)
>>>>
>>>> 000 32 bits, 4GB.
>>>> 001 36 bits, 64GB.
>>>> 010 40 bits, 1TB.
>>>> 011 42 bits, 4TB.
>>>> 100 44 bits, 16TB.
>>>> 101 48 bits, 256TB.
>>>> 110 52 bits, 4PB
>>>>
>>>> But we need to map 0 => 40bits IPA to make our ABI backward compatible. So
>>>> we could use the additional one bit to indicate that IPA size is requested
>>>> in the 3 bits.
>>>>
>>>> i.e,
>>>>
>>>> machine_type:
>>>>
>>>> Bit [2:0] - Requested IPA size. Values follow VTCR_EL2.PS format.
>>>>
>>>> Bit [3] - 1 => IPA Size bits (Bits[2:0]) requested.
>>>> 0 => Not requested
>>>>
>>>> The only minor down side is restricting to the predefined values above,
>>>> which is not a real issue for a VM.
>>>>
>>>> Thoughts ?
>>>
>>> I'd be very wary of using that 4th bit to do something that is not in
>>> the architecture. We have only a single value left to be used (0b111),
>>> and then your scheme clashes with the architecture definition.
>>
>> I agree. However, if we ever go beyond the 3bits in PARange, we have an
>> issue with {V}TCR counter part. But lets not take that chance.
>>
>>>
>>> I'd rather encode things in a way that is independent from the
>>> architecture, and be done with it. You can map 0 to 40bits, and we have
>>> the ability to express all values the architecture has (just in a
>>> different order).
>>
>> The other option I can think of is encoding a signed number which is the
>> difference of the IPA from 40. But that would need 5 bits if we were to
>> encode it as it is. And if we want to squeeze it in 4bit, we could store
>> half the difference (limiting the IPA limit to even numbers).
>>
>> i.e IPA = 40 + 2 * sign_extend(bits[3:0);
>
> I came across similar issues when trying to work out how to enable
> SVE for KVM. In the end I reduced this to a per-vcpu feature, but
> it means that there is no global opt-in for the SVE-specific KVM
> API extensions:
>
> That's a bit gross, because SVE may require a change to the way
> vcpus are initialised. The set of supported SVE vector lengths needs
> to be set somehow before the vcpu is set running, but it's tricky do
> do that without a new ioctl -- which would mean that if SVE is enabled
> for a vcpu then the vcpu is not considered runnable until the new
> magic ioctl is called.
>
> Opting into that semantic change globally at VM creation time might
> be preferable. On the SVE side, this is still very much subject to
> review/change.
>
>
> Here:
>
> The KVM_CREATE_VM init argument seems undefined by the KVM core code and
> is available for arches to abuse in creative ways. x86 and arm have
> nothing here and reject non-zero values with -EINVAL; s390 treats it as
> a bitmask, and defines a sincle feature-like bit here; powerpc treats it
> as an enumeration of VM types.
>
> If we want to be extensible, we could
>
> a) Pass a pointer in type, and come up with some extensible VM parameter
> struct for it to point to (which then wouldn't need a cryptic
> compressed encoding), or
>
> b) Introduce a new "KVM_CREATE_VM2" variant that either takes such
> an argument, or mandates a parameter negotiation phase involving
> additional ioctls before marking the VM as ready for vcpu and
> device creation.
>
> (a) feels like an easy backwards-compatible approach, but cannot be
> readily adopted by other arches (maybe not an issue).
>
> (b) might be considered overengineered, so it would need a bit of
> thought.
>
> Wedging arguments into a few bits in the type argument feels awkward,
> and may be regretted later if we run out of bits, or something can't be
> represented in the chosen encoding.

I think that's a pretty convincing argument for a "better" CREATE_VM,
one that would have a clearly defined, structured (and potentially
extensible) argument.

I've quickly hacked the following:

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b6270a3b38e9..3e76214034c2 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -735,6 +735,20 @@ struct kvm_ppc_resize_hpt {
__u32 pad;
};

+struct kvm_create_vm2 {
+ __u64 version; /* Or maybe not */
+ union {
+ struct {
+#define KVM_ARM_SVE_CAPABLE (1 << 0)
+#define KVM_ARM_SELECT_IPA {1 << 1)
+ __u64 capabilities;
+ __u16 sve_vlen;
+ __u8 ipa_size;
+ } arm64;
+ __u64 dummy[15];
+ };
+};
+
#define KVMIO 0xAE

/* machine type bits, to be used as argument to KVM_CREATE_VM */

Other architectures could fill in their own bits if they need to.

Thoughts?

M.
--
Jazz is not dead. It just smells funny...

2018-07-09 13:39:49

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH v3 15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

On Mon, Jul 09, 2018 at 01:29:42PM +0100, Marc Zyngier wrote:
> On 09/07/18 12:23, Dave Martin wrote:
> > On Fri, Jul 06, 2018 at 05:39:00PM +0100, Suzuki K Poulose wrote:
> >> On 07/06/2018 04:09 PM, Marc Zyngier wrote:
> >>> On 06/07/18 14:49, Suzuki K Poulose wrote:
> >>>> On 04/07/18 23:03, Suzuki K Poulose wrote:
> >>>>> On 07/04/2018 04:51 PM, Will Deacon wrote:
> >>>>>> Hi Suzuki,
> >>>>>>
> >>>>>> On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
> >>>>>>> Allow specifying the physical address size for a new VM via
> >>>>>>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
> >>>>>>> us to finalise the stage2 page table format as early as possible
> >>>>>>> and hence perform the right checks on the memory slots without
> >>>>>>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
> >>>>>>> of the type field and can encode more information in the future if
> >>>>>>> required. The IPA size is still capped at 40bits.
> >>>>>>>
> >>>>>>> Cc: Marc Zyngier <[email protected]>
> >>>>>>> Cc: Christoffer Dall <[email protected]>
> >>>>>>> Cc: Peter Maydel <[email protected]>
> >>>>>>> Cc: Paolo Bonzini <[email protected]>
> >>>>>>> Cc: Radim Krčmář <[email protected]>
> >>>>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
> >>>>>>> ---
> >>>>>>>   arch/arm/include/asm/kvm_mmu.h   |  2 ++
> >>>>>>>   arch/arm64/include/asm/kvm_arm.h | 10 +++-------
> >>>>>>>   arch/arm64/include/asm/kvm_mmu.h |  2 ++
> >>>>>>>   include/uapi/linux/kvm.h         | 10 ++++++++++
> >>>>>>>   virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
> >>>>>>>   5 files changed, 39 insertions(+), 9 deletions(-)
> >>>>>>
> >>>>>> [...]
> >>>>>>
> >>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>>>> index 4df9bb6..fa4cab0 100644
> >>>>>>> --- a/include/uapi/linux/kvm.h
> >>>>>>> +++ b/include/uapi/linux/kvm.h
> >>>>>>> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
> >>>>>>>   #define KVM_S390_SIE_PAGE_OFFSET 1
> >>>>>>>   /*
> >>>>>>> + * On arm/arm64, machine type can be used to request the physical
> >>>>>>> + * address size for the VM. Bits [7-0] have been reserved for the
> >>>>>>> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
> >>>>>>> + * value 0 implies the default IPA size, which is 40bits.
> >>>>>>> + */
> >>>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK    0xff
> >>>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)        \
> >>>>>>> +    ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
> >>>>>>
> >>>>>> This seems like you're allocating quite a lot of bits in a non-extensible
> >>>>>> interface to a fairly esoteric parameter. Would it be better to add another
> >>>>>> ioctl, or condense the number of sizes you support instead?
> >>>>>
> >>>>> As I explained in the other thread, we need the size as soon as the VM
> >>>>> is created. The major challenge is keeping the backward compatibility by
> >>>>> mapping 0 to 40bits. I will give it a thought.
> >>>>
> >>>> Here is one option. We could re-use the {V}TCR_ELx.{I}PS field format, which
> >>>> occupies 3 bits and has the following definitions. (ID_AA64MMFR0_EL1:PARange
> >>>> also has the field definitions, except that the field is 4bits wide, but
> >>>> only 3bits are used)
> >>>>
> >>>> 000 32 bits, 4GB.
> >>>> 001 36 bits, 64GB.
> >>>> 010 40 bits, 1TB.
> >>>> 011 42 bits, 4TB.
> >>>> 100 44 bits, 16TB.
> >>>> 101 48 bits, 256TB.
> >>>> 110 52 bits, 4PB
> >>>>
> >>>> But we need to map 0 => 40bits IPA to make our ABI backward compatible. So
> >>>> we could use the additional one bit to indicate that IPA size is requested
> >>>> in the 3 bits.
> >>>>
> >>>> i.e,
> >>>>
> >>>> machine_type:
> >>>>
> >>>> Bit [2:0] - Requested IPA size. Values follow VTCR_EL2.PS format.
> >>>>
> >>>> Bit [3] - 1 => IPA Size bits (Bits[2:0]) requested.
> >>>> 0 => Not requested
> >>>>
> >>>> The only minor down side is restricting to the predefined values above,
> >>>> which is not a real issue for a VM.
> >>>>
> >>>> Thoughts ?
> >>>
> >>> I'd be very wary of using that 4th bit to do something that is not in
> >>> the architecture. We have only a single value left to be used (0b111),
> >>> and then your scheme clashes with the architecture definition.
> >>
> >> I agree. However, if we ever go beyond the 3bits in PARange, we have an
> >> issue with {V}TCR counter part. But lets not take that chance.
> >>
> >>>
> >>> I'd rather encode things in a way that is independent from the
> >>> architecture, and be done with it. You can map 0 to 40bits, and we have
> >>> the ability to express all values the architecture has (just in a
> >>> different order).
> >>
> >> The other option I can think of is encoding a signed number which is the
> >> difference of the IPA from 40. But that would need 5 bits if we were to
> >> encode it as it is. And if we want to squeeze it in 4bit, we could store
> >> half the difference (limiting the IPA limit to even numbers).
> >>
> >> i.e IPA = 40 + 2 * sign_extend(bits[3:0);
> >
> > I came across similar issues when trying to work out how to enable
> > SVE for KVM. In the end I reduced this to a per-vcpu feature, but
> > it means that there is no global opt-in for the SVE-specific KVM
> > API extensions:
> >
> > That's a bit gross, because SVE may require a change to the way
> > vcpus are initialised. The set of supported SVE vector lengths needs
> > to be set somehow before the vcpu is set running, but it's tricky do
> > do that without a new ioctl -- which would mean that if SVE is enabled
> > for a vcpu then the vcpu is not considered runnable until the new
> > magic ioctl is called.
> >
> > Opting into that semantic change globally at VM creation time might
> > be preferable. On the SVE side, this is still very much subject to
> > review/change.
> >
> >
> > Here:
> >
> > The KVM_CREATE_VM init argument seems undefined by the KVM core code and
> > is available for arches to abuse in creative ways. x86 and arm have
> > nothing here and reject non-zero values with -EINVAL; s390 treats it as
> > a bitmask, and defines a sincle feature-like bit here; powerpc treats it
> > as an enumeration of VM types.
> >
> > If we want to be extensible, we could
> >
> > a) Pass a pointer in type, and come up with some extensible VM parameter
> > struct for it to point to (which then wouldn't need a cryptic
> > compressed encoding), or
> >
> > b) Introduce a new "KVM_CREATE_VM2" variant that either takes such
> > an argument, or mandates a parameter negotiation phase involving
> > additional ioctls before marking the VM as ready for vcpu and
> > device creation.
> >
> > (a) feels like an easy backwards-compatible approach, but cannot be
> > readily adopted by other arches (maybe not an issue).
> >
> > (b) might be considered overengineered, so it would need a bit of
> > thought.
> >
> > Wedging arguments into a few bits in the type argument feels awkward,
> > and may be regretted later if we run out of bits, or something can't be
> > represented in the chosen encoding.
>
> I think that's a pretty convincing argument for a "better" CREATE_VM,
> one that would have a clearly defined, structured (and potentially
> extensible) argument.
>
> I've quickly hacked the following:
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index b6270a3b38e9..3e76214034c2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -735,6 +735,20 @@ struct kvm_ppc_resize_hpt {
> __u32 pad;
> };
>
> +struct kvm_create_vm2 {
> + __u64 version; /* Or maybe not */
> + union {
> + struct {
> +#define KVM_ARM_SVE_CAPABLE (1 << 0)
> +#define KVM_ARM_SELECT_IPA {1 << 1)
> + __u64 capabilities;
> + __u16 sve_vlen;
> + __u8 ipa_size;
> + } arm64;
> + __u64 dummy[15];
> + };
> +};
> +
> #define KVMIO 0xAE
>
> /* machine type bits, to be used as argument to KVM_CREATE_VM */
>
> Other architectures could fill in their own bits if they need to.
>
> Thoughts?

This kind of thing should work, but it may still get messy when we
add additional fields.

It we want this to work cross-arch, would it make sense to go
for a more generic approach, say

struct kvm_create_vm_attr_any {
__u32 type;
};

#define KVM_CREATE_VM_ATTR_ARCH_CAPABILITIES 1
struct kvm_create_vm_attr_arch_capabilities {
__u32 type;
__u16 size; /* support future expansion of capabilities[] */
__u16 reserved;
__u64 capabilities[1];
};

#define KVM_CREATE_VM_ATTR_ARM64_PHYSADDR_SIZE 2
struct kvm_create_vm_attr_arm64_physaddr_size {
__u32 type;
__u32 physaddr_bits;
};

/* ... */

union kvm_create_vm_attr {
struct kvm_create_vm_attr_any;
struct kvm_create_vm_attr_arch_capabilities;
struct kvm_create_vm_attr_arm64_physaddr_size;
/* ... */
};

struct kvm_create_vm2 {
__u32 version; /* harmless, even if not useful */
__u16 nr_attrs; /* or could just terminate attrs with a
NULL entry */
union kvm_create_vm_attr __user *__user *attrs;
};


This is quite flexible, but obviously a bit heavy.

However, if we're adding a new interface due to lack of extensibility,
it may be worth going for something that's freely extensible.


Userspace might call this as

struct kvm_create_vm_attr_arch_capabilities vm_arch_caps = {
.type = KVM_CREATE_VM_ATTR_ARCH_CAPABILITIES,
.size = 64,
.capabilities[0] = KVM_CREATE_VM_ARM64_VCPU_NEEDS_SET_SVE_VLS,
};

struct kvm_create_vm_attr_arch_arm64_physaddr_size = {
.type = KVM_CREATE_VM_ATTR_ARM64_PHYSADDR_SIZE,
.physaddr_bits = 52,
};

union kvm_create_vm_attr **vmattrs[] = {
&vm_arch_caps,
&vm_arm64_physaddr_size,
NULL, /* maybe */
};

struct kvm_create_vm2 vm;

vm.version = 0;
vm.nr_attrs = 2; /* maybe */
vm.attrs = vmattrs;

ioctl(..., KVM_CREATE_VM2, &vm);

Cheers
---Dave

2018-07-10 17:01:47

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

On 09/07/18 14:37, Dave Martin wrote:
> On Mon, Jul 09, 2018 at 01:29:42PM +0100, Marc Zyngier wrote:
>> On 09/07/18 12:23, Dave Martin wrote:
>>> On Fri, Jul 06, 2018 at 05:39:00PM +0100, Suzuki K Poulose wrote:
>>>> On 07/06/2018 04:09 PM, Marc Zyngier wrote:
>>>>> On 06/07/18 14:49, Suzuki K Poulose wrote:
>>>>>> On 04/07/18 23:03, Suzuki K Poulose wrote:
>>>>>>> On 07/04/2018 04:51 PM, Will Deacon wrote:
>>>>>>>> Hi Suzuki,
>>>>>>>>
>>>>>>>> On Fri, Jun 29, 2018 at 12:15:35PM +0100, Suzuki K Poulose wrote:
>>>>>>>>> Allow specifying the physical address size for a new VM via
>>>>>>>>> the kvm_type argument for KVM_CREATE_VM ioctl. This allows
>>>>>>>>> us to finalise the stage2 page table format as early as possible
>>>>>>>>> and hence perform the right checks on the memory slots without
>>>>>>>>> complication. The size is encoded as Log2(PA_Size) in the bits[7:0]
>>>>>>>>> of the type field and can encode more information in the future if
>>>>>>>>> required. The IPA size is still capped at 40bits.
>>>>>>>>>
>>>>>>>>> Cc: Marc Zyngier <[email protected]>
>>>>>>>>> Cc: Christoffer Dall <[email protected]>
>>>>>>>>> Cc: Peter Maydel <[email protected]>
>>>>>>>>> Cc: Paolo Bonzini <[email protected]>
>>>>>>>>> Cc: Radim Krčmář <[email protected]>
>>>>>>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>>>>>>> ---
>>>>>>>>>   arch/arm/include/asm/kvm_mmu.h   |  2 ++
>>>>>>>>>   arch/arm64/include/asm/kvm_arm.h | 10 +++-------
>>>>>>>>>   arch/arm64/include/asm/kvm_mmu.h |  2 ++
>>>>>>>>>   include/uapi/linux/kvm.h         | 10 ++++++++++
>>>>>>>>>   virt/kvm/arm/arm.c               | 24 ++++++++++++++++++++++--
>>>>>>>>>   5 files changed, 39 insertions(+), 9 deletions(-)
>>>>>>>>
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>>>>>>> index 4df9bb6..fa4cab0 100644
>>>>>>>>> --- a/include/uapi/linux/kvm.h
>>>>>>>>> +++ b/include/uapi/linux/kvm.h
>>>>>>>>> @@ -751,6 +751,16 @@ struct kvm_ppc_resize_hpt {
>>>>>>>>>   #define KVM_S390_SIE_PAGE_OFFSET 1
>>>>>>>>>   /*
>>>>>>>>> + * On arm/arm64, machine type can be used to request the physical
>>>>>>>>> + * address size for the VM. Bits [7-0] have been reserved for the
>>>>>>>>> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
>>>>>>>>> + * value 0 implies the default IPA size, which is 40bits.
>>>>>>>>> + */
>>>>>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK    0xff
>>>>>>>>> +#define KVM_VM_TYPE_ARM_PHYS_SHIFT(x)        \
>>>>>>>>> +    ((x) & KVM_VM_TYPE_ARM_PHYS_SHIFT_MASK)
>>>>>>>>
>>>>>>>> This seems like you're allocating quite a lot of bits in a non-extensible
>>>>>>>> interface to a fairly esoteric parameter. Would it be better to add another
>>>>>>>> ioctl, or condense the number of sizes you support instead?
>>>>>>>
>>>>>>> As I explained in the other thread, we need the size as soon as the VM
>>>>>>> is created. The major challenge is keeping the backward compatibility by
>>>>>>> mapping 0 to 40bits. I will give it a thought.
>>>>>>
>>>>>> Here is one option. We could re-use the {V}TCR_ELx.{I}PS field format, which
>>>>>> occupies 3 bits and has the following definitions. (ID_AA64MMFR0_EL1:PARange
>>>>>> also has the field definitions, except that the field is 4bits wide, but
>>>>>> only 3bits are used)
>>>>>>
>>>>>> 000 32 bits, 4GB.
>>>>>> 001 36 bits, 64GB.
>>>>>> 010 40 bits, 1TB.
>>>>>> 011 42 bits, 4TB.
>>>>>> 100 44 bits, 16TB.
>>>>>> 101 48 bits, 256TB.
>>>>>> 110 52 bits, 4PB
>>>>>>
>>>>>> But we need to map 0 => 40bits IPA to make our ABI backward compatible. So
>>>>>> we could use the additional one bit to indicate that IPA size is requested
>>>>>> in the 3 bits.
>>>>>>
>>>>>> i.e,
>>>>>>
>>>>>> machine_type:
>>>>>>
>>>>>> Bit [2:0] - Requested IPA size. Values follow VTCR_EL2.PS format.
>>>>>>
>>>>>> Bit [3] - 1 => IPA Size bits (Bits[2:0]) requested.
>>>>>> 0 => Not requested
>>>>>>
>>>>>> The only minor down side is restricting to the predefined values above,
>>>>>> which is not a real issue for a VM.
>>>>>>
>>>>>> Thoughts ?
>>>>>
>>>>> I'd be very wary of using that 4th bit to do something that is not in
>>>>> the architecture. We have only a single value left to be used (0b111),
>>>>> and then your scheme clashes with the architecture definition.
>>>>
>>>> I agree. However, if we ever go beyond the 3bits in PARange, we have an
>>>> issue with {V}TCR counter part. But lets not take that chance.
>>>>
>>>>>
>>>>> I'd rather encode things in a way that is independent from the
>>>>> architecture, and be done with it. You can map 0 to 40bits, and we have
>>>>> the ability to express all values the architecture has (just in a
>>>>> different order).
>>>>
>>>> The other option I can think of is encoding a signed number which is the
>>>> difference of the IPA from 40. But that would need 5 bits if we were to
>>>> encode it as it is. And if we want to squeeze it in 4bit, we could store
>>>> half the difference (limiting the IPA limit to even numbers).
>>>>
>>>> i.e IPA = 40 + 2 * sign_extend(bits[3:0);
>>>
>>> I came across similar issues when trying to work out how to enable
>>> SVE for KVM. In the end I reduced this to a per-vcpu feature, but
>>> it means that there is no global opt-in for the SVE-specific KVM
>>> API extensions:
>>>
>>> That's a bit gross, because SVE may require a change to the way
>>> vcpus are initialised. The set of supported SVE vector lengths needs
>>> to be set somehow before the vcpu is set running, but it's tricky do
>>> do that without a new ioctl -- which would mean that if SVE is enabled
>>> for a vcpu then the vcpu is not considered runnable until the new
>>> magic ioctl is called.
>>>
>>> Opting into that semantic change globally at VM creation time might
>>> be preferable. On the SVE side, this is still very much subject to
>>> review/change.
>>>
>>>
>>> Here:
>>>
>>> The KVM_CREATE_VM init argument seems undefined by the KVM core code and
>>> is available for arches to abuse in creative ways. x86 and arm have
>>> nothing here and reject non-zero values with -EINVAL; s390 treats it as
>>> a bitmask, and defines a sincle feature-like bit here; powerpc treats it
>>> as an enumeration of VM types.
>>>
>>> If we want to be extensible, we could
>>>
>>> a) Pass a pointer in type, and come up with some extensible VM parameter
>>> struct for it to point to (which then wouldn't need a cryptic
>>> compressed encoding), or
>>>
>>> b) Introduce a new "KVM_CREATE_VM2" variant that either takes such
>>> an argument, or mandates a parameter negotiation phase involving
>>> additional ioctls before marking the VM as ready for vcpu and
>>> device creation.
>>>
>>> (a) feels like an easy backwards-compatible approach, but cannot be
>>> readily adopted by other arches (maybe not an issue).
>>>
>>> (b) might be considered overengineered, so it would need a bit of
>>> thought.
>>>
>>> Wedging arguments into a few bits in the type argument feels awkward,
>>> and may be regretted later if we run out of bits, or something can't be
>>> represented in the chosen encoding.
>>
>> I think that's a pretty convincing argument for a "better" CREATE_VM,
>> one that would have a clearly defined, structured (and potentially
>> extensible) argument.
>>
>> I've quickly hacked the following:
>>
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index b6270a3b38e9..3e76214034c2 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -735,6 +735,20 @@ struct kvm_ppc_resize_hpt {
>> __u32 pad;
>> };
>>
>> +struct kvm_create_vm2 {
>> + __u64 version; /* Or maybe not */
>> + union {
>> + struct {
>> +#define KVM_ARM_SVE_CAPABLE (1 << 0)
>> +#define KVM_ARM_SELECT_IPA {1 << 1)
>> + __u64 capabilities;
>> + __u16 sve_vlen;
>> + __u8 ipa_size;
>> + } arm64;
>> + __u64 dummy[15];
>> + };
>> +};
>> +
>> #define KVMIO 0xAE
>>
>> /* machine type bits, to be used as argument to KVM_CREATE_VM */
>>
>> Other architectures could fill in their own bits if they need to.
>>
>> Thoughts?
>
> This kind of thing should work, but it may still get messy when we
> add additional fields.


Marc, Dave,

I like Dave's approach. Some comments below.

>
> It we want this to work cross-arch, would it make sense to go
> for a more generic approach, say
>
> struct kvm_create_vm_attr_any {
> __u32 type;
> };
>
> #define KVM_CREATE_VM_ATTR_ARCH_CAPABILITIES 1
> struct kvm_create_vm_attr_arch_capabilities {
> __u32 type;
> __u16 size; /* support future expansion of capabilities[] */
> __u16 reserved;
> __u64 capabilities[1];
> };

We also need to advertise which attributes are supported by the host,
so that the user can tune the available ones. That would make a bit mask
like the above trickier, unless we return the supported values back
in the argument ptr for the "probe" call. And this scheme in general
can be useful for passing back a non-boolean result specific to the
attribute, without having a per-attribute ioctl. (e.g, maximum limit
for IPA).

>
> #define KVM_CREATE_VM_ATTR_ARM64_PHYSADDR_SIZE 2
> struct kvm_create_vm_attr_arm64_physaddr_size {
> __u32 type;
> __u32 physaddr_bits;
> };
>
> /* ... */
>
> union kvm_create_vm_attr {
> struct kvm_create_vm_attr_any;
> struct kvm_create_vm_attr_arch_capabilities;
> struct kvm_create_vm_attr_arm64_physaddr_size;
> /* ... */
> };

nit: Could we simply do s/kvm_create_vm_attr/kvm_vm_attr/ everywhere ?
While I agree that the kvm_create_vm_attr makes it implicit that the attributes
are valid only "create" ioctl, the lack of an ioctl to set the VM attribute
should be sufficient to indicate the same.

>
> struct kvm_create_vm2 {
> __u32 version; /* harmless, even if not useful */
> __u16 nr_attrs; /* or could just terminate attrs with a
> NULL entry */
> union kvm_create_vm_attr __user *__user *attrs;
> };
>
>
> This is quite flexible, but obviously a bit heavy.
>
> However, if we're adding a new interface due to lack of extensibility,
> it may be worth going for something that's freely extensible.

True. I could hack something up along the lines above and send it here.

>
>
> Userspace might call this as
>
> struct kvm_create_vm_attr_arch_capabilities vm_arch_caps = {
> .type = KVM_CREATE_VM_ATTR_ARCH_CAPABILITIES,
> .size = 64,
> .capabilities[0] = KVM_CREATE_VM_ARM64_VCPU_NEEDS_SET_SVE_VLS,
> };
>
> struct kvm_create_vm_attr_arch_arm64_physaddr_size = {
> .type = KVM_CREATE_VM_ATTR_ARM64_PHYSADDR_SIZE,
> .physaddr_bits = 52,
> };
>
> union kvm_create_vm_attr **vmattrs[] = {
> &vm_arch_caps,
> &vm_arm64_physaddr_size,
> NULL, /* maybe */
> };
>
> struct kvm_create_vm2 vm;
>
> vm.version = 0;
> vm.nr_attrs = 2; /* maybe */
> vm.attrs = vmattrs;
>
> ioctl(..., KVM_CREATE_VM2, &vm);

Thanks
Suzuki

2018-07-10 17:04:38

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH v3 15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

On Tue, Jul 10, 2018 at 05:38:39PM +0100, Suzuki K Poulose wrote:
> On 09/07/18 14:37, Dave Martin wrote:
> >On Mon, Jul 09, 2018 at 01:29:42PM +0100, Marc Zyngier wrote:
> >>On 09/07/18 12:23, Dave Martin wrote:

[...]

> >>>Wedging arguments into a few bits in the type argument feels awkward,
> >>>and may be regretted later if we run out of bits, or something can't be
> >>>represented in the chosen encoding.
> >>
> >>I think that's a pretty convincing argument for a "better" CREATE_VM,
> >>one that would have a clearly defined, structured (and potentially
> >>extensible) argument.
> >>
> >>I've quickly hacked the following:
> >>
> >>diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>index b6270a3b38e9..3e76214034c2 100644
> >>--- a/include/uapi/linux/kvm.h
> >>+++ b/include/uapi/linux/kvm.h
> >>@@ -735,6 +735,20 @@ struct kvm_ppc_resize_hpt {
> >> __u32 pad;
> >> };
> >>
> >>+struct kvm_create_vm2 {
> >>+ __u64 version; /* Or maybe not */
> >>+ union {
> >>+ struct {
> >>+#define KVM_ARM_SVE_CAPABLE (1 << 0)
> >>+#define KVM_ARM_SELECT_IPA {1 << 1)
> >>+ __u64 capabilities;
> >>+ __u16 sve_vlen;
> >>+ __u8 ipa_size;
> >>+ } arm64;
> >>+ __u64 dummy[15];
> >>+ };
> >>+};
> >>+
> >> #define KVMIO 0xAE
> >>
> >> /* machine type bits, to be used as argument to KVM_CREATE_VM */
> >>
> >>Other architectures could fill in their own bits if they need to.
> >>
> >>Thoughts?
> >
> >This kind of thing should work, but it may still get messy when we
> >add additional fields.
>
>
> Marc, Dave,
>
> I like Dave's approach. Some comments below.
>
> >
> >It we want this to work cross-arch, would it make sense to go
> >for a more generic approach, say
> >
> >struct kvm_create_vm_attr_any {
> > __u32 type;
> >};
> >
> >#define KVM_CREATE_VM_ATTR_ARCH_CAPABILITIES 1
> >struct kvm_create_vm_attr_arch_capabilities {
> > __u32 type;
> > __u16 size; /* support future expansion of capabilities[] */
> > __u16 reserved;
> > __u64 capabilities[1];
> >};
>
> We also need to advertise which attributes are supported by the host,
> so that the user can tune the available ones. That would make a bit mask
> like the above trickier, unless we return the supported values back
> in the argument ptr for the "probe" call. And this scheme in general
> can be useful for passing back a non-boolean result specific to the
> attribute, without having a per-attribute ioctl. (e.g, maximum limit
> for IPA).

Maybe, but this could quickly become bloated. (My approach already
feels a bit bloated...)

I'm not sure that arbitrarily complex negotiation will really be
needed, but userspace might want to change its mind if setting a
particular propertiy fails.

An alternative might be to have a bunch of per-VM ioctls to configure
different things, like x86 has. There's at least precedent for that.
For arm, we currently only have a few. That allows for easy extension,
at the cost of adding ioctls.

There may be some ioctls we can reuse, like KVM_ENABLE_CAP for per-
vm capability flags.


[...]

> >union kvm_create_vm_attr {
> > struct kvm_create_vm_attr_any;
> > struct kvm_create_vm_attr_arch_capabilities;
> > struct kvm_create_vm_attr_arm64_physaddr_size;
> > /* ... */
> >};
>
> nit: Could we simply do s/kvm_create_vm_attr/kvm_vm_attr/ everywhere ?
> While I agree that the kvm_create_vm_attr makes it implicit that the attributes
> are valid only "create" ioctl, the lack of an ioctl to set the VM attribute
> should be sufficient to indicate the same.

I just randomly came up with some names. The precise naming scheme
isn't that important, so long as it unlikely to result in name
collisions and so long as it's reasonablu clear (or compiler-checkable,
or preferably both) which things can be used where.

I wouldn't have a problem with something a bit terser.

>
> >
> >struct kvm_create_vm2 {
> > __u32 version; /* harmless, even if not useful */
> > __u16 nr_attrs; /* or could just terminate attrs with a
> > NULL entry */
> > union kvm_create_vm_attr __user *__user *attrs;
> >};
> >
> >
> >This is quite flexible, but obviously a bit heavy.
> >
> >However, if we're adding a new interface due to lack of extensibility,
> >it may be worth going for something that's freely extensible.
>
> True. I could hack something up along the lines above and send it here.

Sure, but best to keep it fairly rough for now.

Cheers
---Dave

2018-07-11 09:07:42

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v3 15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

On 10/07/18 18:03, Dave Martin wrote:
> On Tue, Jul 10, 2018 at 05:38:39PM +0100, Suzuki K Poulose wrote:
>> On 09/07/18 14:37, Dave Martin wrote:
>>> On Mon, Jul 09, 2018 at 01:29:42PM +0100, Marc Zyngier wrote:
>>>> On 09/07/18 12:23, Dave Martin wrote:
>
> [...]
>
>>>>> Wedging arguments into a few bits in the type argument feels awkward,
>>>>> and may be regretted later if we run out of bits, or something can't be
>>>>> represented in the chosen encoding.
>>>>
>>>> I think that's a pretty convincing argument for a "better" CREATE_VM,
>>>> one that would have a clearly defined, structured (and potentially
>>>> extensible) argument.
>>>>
>>>> I've quickly hacked the following:
>>>>
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index b6270a3b38e9..3e76214034c2 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -735,6 +735,20 @@ struct kvm_ppc_resize_hpt {
>>>> __u32 pad;
>>>> };
>>>>
>>>> +struct kvm_create_vm2 {
>>>> + __u64 version; /* Or maybe not */
>>>> + union {
>>>> + struct {
>>>> +#define KVM_ARM_SVE_CAPABLE (1 << 0)
>>>> +#define KVM_ARM_SELECT_IPA {1 << 1)
>>>> + __u64 capabilities;
>>>> + __u16 sve_vlen;
>>>> + __u8 ipa_size;
>>>> + } arm64;
>>>> + __u64 dummy[15];
>>>> + };
>>>> +};
>>>> +
>>>> #define KVMIO 0xAE
>>>>
>>>> /* machine type bits, to be used as argument to KVM_CREATE_VM */
>>>>
>>>> Other architectures could fill in their own bits if they need to.
>>>>
>>>> Thoughts?
>>>
>>> This kind of thing should work, but it may still get messy when we
>>> add additional fields.
>>
>>
>> Marc, Dave,
>>
>> I like Dave's approach. Some comments below.
>>
>>>
>>> It we want this to work cross-arch, would it make sense to go
>>> for a more generic approach, say
>>>
>>> struct kvm_create_vm_attr_any {
>>> __u32 type;
>>> };
>>>
>>> #define KVM_CREATE_VM_ATTR_ARCH_CAPABILITIES 1
>>> struct kvm_create_vm_attr_arch_capabilities {
>>> __u32 type;
>>> __u16 size; /* support future expansion of capabilities[] */
>>> __u16 reserved;
>>> __u64 capabilities[1];
>>> };
>>
>> We also need to advertise which attributes are supported by the host,
>> so that the user can tune the available ones. That would make a bit mask
>> like the above trickier, unless we return the supported values back
>> in the argument ptr for the "probe" call. And this scheme in general
>> can be useful for passing back a non-boolean result specific to the
>> attribute, without having a per-attribute ioctl. (e.g, maximum limit
>> for IPA).
>
> Maybe, but this could quickly become bloated. (My approach already
> feels a bit bloated...)
>
> I'm not sure that arbitrarily complex negotiation will really be
> needed, but userspace might want to change its mind if setting a
> particular propertiy fails.
>
> An alternative might be to have a bunch of per-VM ioctls to configure
> different things, like x86 has. There's at least precedent for that.
> For arm, we currently only have a few. That allows for easy extension,
> at the cost of adding ioctls.

As you know, one of the major problems with the per-VM ioctls is
the ordering of different operations and tracking to make sure that
the userspace follows the expected order. e.g, the first approach for
IPA series was based on this and it made things complex enough to drop
it.

>
> There may be some ioctls we can reuse, like KVM_ENABLE_CAP for per-
> vm capability flags.

May be we could switch to KVM_VM_CAPS and pass a list of capabilities
to be enabled at creation time ? The kvm_enable_cap can pass in additional
arguments for each cap. That way we don't have to rely on a new set of
attributes and probing becomes straight forward.

Suzuki

2018-07-11 10:40:03

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH v3 15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

On Wed, Jul 11, 2018 at 10:05:50AM +0100, Suzuki K Poulose wrote:
> On 10/07/18 18:03, Dave Martin wrote:
> >On Tue, Jul 10, 2018 at 05:38:39PM +0100, Suzuki K Poulose wrote:
> >>On 09/07/18 14:37, Dave Martin wrote:
> >>>On Mon, Jul 09, 2018 at 01:29:42PM +0100, Marc Zyngier wrote:
> >>>>On 09/07/18 12:23, Dave Martin wrote:
> >
> >[...]
> >
> >>>>>Wedging arguments into a few bits in the type argument feels awkward,
> >>>>>and may be regretted later if we run out of bits, or something can't be
> >>>>>represented in the chosen encoding.
> >>>>
> >>>>I think that's a pretty convincing argument for a "better" CREATE_VM,
> >>>>one that would have a clearly defined, structured (and potentially
> >>>>extensible) argument.
> >>>>
> >>>>I've quickly hacked the following:
> >>>>
> >>>>diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >>>>index b6270a3b38e9..3e76214034c2 100644
> >>>>--- a/include/uapi/linux/kvm.h
> >>>>+++ b/include/uapi/linux/kvm.h
> >>>>@@ -735,6 +735,20 @@ struct kvm_ppc_resize_hpt {
> >>>> __u32 pad;
> >>>> };
> >>>>
> >>>>+struct kvm_create_vm2 {
> >>>>+ __u64 version; /* Or maybe not */
> >>>>+ union {
> >>>>+ struct {
> >>>>+#define KVM_ARM_SVE_CAPABLE (1 << 0)
> >>>>+#define KVM_ARM_SELECT_IPA {1 << 1)
> >>>>+ __u64 capabilities;
> >>>>+ __u16 sve_vlen;
> >>>>+ __u8 ipa_size;
> >>>>+ } arm64;
> >>>>+ __u64 dummy[15];
> >>>>+ };
> >>>>+};
> >>>>+
> >>>> #define KVMIO 0xAE
> >>>>
> >>>> /* machine type bits, to be used as argument to KVM_CREATE_VM */
> >>>>
> >>>>Other architectures could fill in their own bits if they need to.
> >>>>
> >>>>Thoughts?
> >>>
> >>>This kind of thing should work, but it may still get messy when we
> >>>add additional fields.
> >>
> >>
> >>Marc, Dave,
> >>
> >>I like Dave's approach. Some comments below.
> >>
> >>>
> >>>It we want this to work cross-arch, would it make sense to go
> >>>for a more generic approach, say
> >>>
> >>>struct kvm_create_vm_attr_any {
> >>> __u32 type;
> >>>};
> >>>
> >>>#define KVM_CREATE_VM_ATTR_ARCH_CAPABILITIES 1
> >>>struct kvm_create_vm_attr_arch_capabilities {
> >>> __u32 type;
> >>> __u16 size; /* support future expansion of capabilities[] */
> >>> __u16 reserved;
> >>> __u64 capabilities[1];
> >>>};
> >>
> >>We also need to advertise which attributes are supported by the host,
> >>so that the user can tune the available ones. That would make a bit mask
> >>like the above trickier, unless we return the supported values back
> >>in the argument ptr for the "probe" call. And this scheme in general
> >>can be useful for passing back a non-boolean result specific to the
> >>attribute, without having a per-attribute ioctl. (e.g, maximum limit
> >>for IPA).
> >
> >Maybe, but this could quickly become bloated. (My approach already
> >feels a bit bloated...)
> >
> >I'm not sure that arbitrarily complex negotiation will really be
> >needed, but userspace might want to change its mind if setting a
> >particular propertiy fails.
> >
> >An alternative might be to have a bunch of per-VM ioctls to configure
> >different things, like x86 has. There's at least precedent for that.
> >For arm, we currently only have a few. That allows for easy extension,
> >at the cost of adding ioctls.
>
> As you know, one of the major problems with the per-VM ioctls is
> the ordering of different operations and tracking to make sure that
> the userspace follows the expected order. e.g, the first approach for
> IPA series was based on this and it made things complex enough to drop
> it.

I'm aware of that, but if we are adding a new KVM_CREATE_VM, we could
perhaps give it different semantics: i.e., we create a half-created VM
that only accepts configuration ioctls and a "finish creation" ioctl
that finalises everything before you're allowed to create devices,
vcpus etc.

This is the sort of thing I was moving torwards for SVE (but for
vcpus there).

I'm not saying we should drop the existing KVM_CREATE_VM2 ideas,
but that we should take a step back if it starts to accrue complexity.

> >
> >There may be some ioctls we can reuse, like KVM_ENABLE_CAP for per-
> >vm capability flags.
>
> May be we could switch to KVM_VM_CAPS and pass a list of capabilities
> to be enabled at creation time ? The kvm_enable_cap can pass in additional
> arguments for each cap. That way we don't have to rely on a new set of
> attributes and probing becomes straight forward.

That's a possibility. I guess we'd need to understand how exactly x86
uses this.

Cheers
---Dave