On arm64 we have a static limit of 40bits of physical address space
for the VM with KVM. This series lifts the limitation and allows the
VM to configure the physical address space upto 52bit on systems
where it is supported. We retain the default and minimum size to 40bits
to avoid breaking backward compatibility.
The interface provided is an IOCTL on the VM fd. The guest can change
only increase the limit from what is already configured, to prevent
breaking the devices which may have already been configured with a
particular guest PA. The guest can issue the request until something
is actually mapped into the stage2 table (e.g, memory region or device).
This also implies that we now have per VM configuration of stage2
control registers (VTCR_EL2 bits).
The arm64 page table level helpers are defined based on the page
table levels used by the host VA. So, the accessors may not work
if the guest uses more number of levels in stage2 than the stage1
of the host. In order to provide an independent stage2 page table,
we refactor the arm64 page table helpers to give us raw accessors
for each level, which should only used when that level is present.
And then, based on the VM, we make the decision of the stage2
page table using the raw accessors.
The series also adds :
- Support for handling 52bit IPA for vgic ITS.
- Cleanup in virtio to handle errors when the PFN used in
the virtio transport doesn't fit in 32bit.
Tested with
- Modified kvmtool, which can only be used for (patches included in
the series for reference / testing):
* with virtio-pci upto 44bit PA (Due to 4K page size for virtio-pci
legacy implemented by kvmtool)
* Upto 48bit PA with virtio-mmio, due to 32bit PFN limitation.
- Hacked Qemu (boot loader support for highmem, phys-shift support)
* with virtio-pci GIC-v3 ITS & MSI upto 52bit on Foundation model.
The series applies on arm64 for-next/core tree with 52bit PA support patches.
One would need the fix for virtio_mmio cleanup [1] on top of the arm64
tree to remove the warnings from virtio.
[1] https://marc.info/?l=linux-virtualization&m=151308636322117&w=2
Kristina Martsenko (1):
vgic: its: Add support for 52bit guest physical address
Suzuki K Poulose (15):
virtio: Validate queue pfn for 32bit transports
irqchip: gicv3-its: Add helpers for handling 52bit address
arm64: Make page table helpers reusable
arm64: Refactor pud_huge for reusability
arm64: Helper for parange to PASize
kvm: arm/arm64: Fix stage2_flush_memslot for 4 level page table
kvm: arm/arm64: Remove spurious WARN_ON
kvm: arm/arm64: Clean up stage2 pgd life time
kvm: arm/arm64: Delay stage2 page table allocation
kvm: arm/arm64: Prepare for VM specific stage2 translations
kvm: arm64: Make stage2 page table layout dynamic
kvm: arm64: Dynamic configuration of VTCR and VTTBR mask
kvm: arm64: Configure VTCR per VM
kvm: arm64: Switch to per VM IPA
kvm: arm64: Allow configuring physical address space size
Documentation/virtual/kvm/api.txt | 27 +++
arch/arm/include/asm/kvm_arm.h | 2 -
arch/arm/include/asm/kvm_host.h | 7 +
arch/arm/include/asm/kvm_mmu.h | 13 +-
arch/arm/include/asm/stage2_pgtable.h | 46 +++---
arch/arm64/include/asm/cpufeature.h | 16 ++
arch/arm64/include/asm/kvm_arm.h | 112 +++++++++++--
arch/arm64/include/asm/kvm_asm.h | 2 +-
arch/arm64/include/asm/kvm_host.h | 21 ++-
arch/arm64/include/asm/kvm_mmu.h | 83 ++++++++--
arch/arm64/include/asm/pgalloc.h | 32 +++-
arch/arm64/include/asm/pgtable.h | 61 ++++---
arch/arm64/include/asm/stage2_pgtable-nopmd.h | 42 -----
arch/arm64/include/asm/stage2_pgtable-nopud.h | 39 -----
arch/arm64/include/asm/stage2_pgtable.h | 211 ++++++++++++++++--------
arch/arm64/kvm/hyp/s2-setup.c | 34 +---
arch/arm64/kvm/hyp/switch.c | 8 +
arch/arm64/kvm/reset.c | 28 ++++
arch/arm64/mm/hugetlbpage.c | 2 +-
drivers/irqchip/irq-gic-v3-its.c | 2 +-
drivers/virtio/virtio_mmio.c | 19 ++-
drivers/virtio/virtio_pci_legacy.c | 11 +-
include/linux/irqchip/arm-gic-v3.h | 32 +++-
include/uapi/linux/kvm.h | 4 +
virt/kvm/arm/arm.c | 25 ++-
virt/kvm/arm/mmu.c | 228 +++++++++++++++-----------
virt/kvm/arm/vgic/vgic-its.c | 36 ++--
virt/kvm/arm/vgic/vgic-kvm-device.c | 2 +-
virt/kvm/arm/vgic/vgic-mmio-v3.c | 1 -
29 files changed, 738 insertions(+), 408 deletions(-)
delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h
--
2.13.6
virtio-mmio using virtio-v1 and virtio legacy pci use a 32bit PFN
for the queue. If the queue pfn is too large to fit in 32bits, which
we could hit on arm64 systems with 52bit physical addresses (even with
64K page size), we simply miss out a proper link to the other side of
the queue.
Add a check to validate the PFN, rather than silently breaking
the devices.
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
drivers/virtio/virtio_mmio.c | 19 ++++++++++++++++---
drivers/virtio/virtio_pci_legacy.c | 11 +++++++++--
2 files changed, 25 insertions(+), 5 deletions(-)
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index a9192fe4f345..47109baf37f7 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -358,6 +358,7 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
struct virtqueue *vq;
unsigned long flags;
unsigned int num;
+ u64 addr;
int err;
if (!name)
@@ -394,16 +395,26 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
goto error_new_virtqueue;
}
+ addr = virtqueue_get_desc_addr(vq);
+ /*
+ * virtio-mmio v1 uses a 32bit QUEUE PFN. If we have something that
+ * doesn't fit in 32bit, fail the setup rather than pretending to
+ * be successful.
+ */
+ if (vm_dev->version == 1 && (addr >> (PAGE_SHIFT + 32))) {
+ dev_err(&vdev->dev, "virtio-mmio: queue address too large\n");
+ err = -ENOMEM;
+ goto error_bad_pfn;
+ }
+
/* Activate the queue */
writel(virtqueue_get_vring_size(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NUM);
if (vm_dev->version == 1) {
writel(PAGE_SIZE, vm_dev->base + VIRTIO_MMIO_QUEUE_ALIGN);
- writel(virtqueue_get_desc_addr(vq) >> PAGE_SHIFT,
+ writel(addr >> PAGE_SHIFT,
vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
} else {
- u64 addr;
- addr = virtqueue_get_desc_addr(vq);
writel((u32)addr, vm_dev->base + VIRTIO_MMIO_QUEUE_DESC_LOW);
writel((u32)(addr >> 32),
vm_dev->base + VIRTIO_MMIO_QUEUE_DESC_HIGH);
@@ -430,6 +441,8 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
return vq;
+error_bad_pfn:
+ vring_del_virtqueue(vq);
error_new_virtqueue:
if (vm_dev->version == 1) {
writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c
index 2780886e8ba3..099d2cfb47b3 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -122,6 +122,7 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
struct virtqueue *vq;
u16 num;
int err;
+ u64 q_pfn;
/* Select the queue we're interested in */
iowrite16(index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
@@ -141,9 +142,15 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
if (!vq)
return ERR_PTR(-ENOMEM);
+ q_pfn = virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT;
+ if (q_pfn >> 32) {
+ dev_err(&vp_dev->pci_dev->dev, "virtio-pci queue PFN too large\n");
+ err = -ENOMEM;
+ goto out_deactivate;
+ }
+
/* activate the queue */
- iowrite32(virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
- vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
+ iowrite32((u32)q_pfn, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
vq->priv = (void __force *)vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY;
--
2.13.6
This patch rearranges the page table level helpers so that it can
be reused for a page table with different number of levels
(e.g, stage2 page table for a VM) than the kernel page tables.
As such there is no functional change with this patch.
The page table helpers are defined to do the right thing for the
fixed page table levels set for the kernel. This patch tries to
refactor the code such that, we can have helpers for each level,
which should be used when the caller knows that the level exists
for the page table dealt with. Since the kernel defines helpers
p.d_action and __p.d_action, for consistency, we name the raw
page table action helpers __raw_p.d_action.
Cc: Catalin Marinas <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Steve Capper <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm64/include/asm/pgalloc.h | 32 +++++++++++++++++-----
arch/arm64/include/asm/pgtable.h | 58 +++++++++++++++++++++++++---------------
2 files changed, 63 insertions(+), 27 deletions(-)
diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index e9d9f1b006ef..e555b04045d0 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -29,6 +29,28 @@
#define PGALLOC_GFP (GFP_KERNEL | __GFP_ZERO)
#define PGD_SIZE (PTRS_PER_PGD * sizeof(pgd_t))
+static inline void __raw_pmd_free(pmd_t *pmd)
+{
+ BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
+ free_page((unsigned long)pmd);
+}
+
+static inline void __raw_pud_free(pud_t *pud)
+{
+ BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
+ free_page((unsigned long)pud);
+}
+
+static inline void __raw_pgd_populate(pgd_t *pgdp, phys_addr_t pud, pgdval_t prot)
+{
+ __raw_set_pgd(pgdp, __pgd(__phys_to_pgd_val(pud) | prot));
+}
+
+static inline void __raw_pud_populate(pud_t *pud, phys_addr_t pmd, pudval_t prot)
+{
+ __raw_set_pud(pud, __pud(__phys_to_pud_val(pmd) | prot));
+}
+
#if CONFIG_PGTABLE_LEVELS > 2
static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
@@ -38,13 +60,12 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
{
- BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
- free_page((unsigned long)pmd);
+ __raw_pmd_free(pmd);
}
static inline void __pud_populate(pud_t *pud, phys_addr_t pmd, pudval_t prot)
{
- set_pud(pud, __pud(__phys_to_pud_val(pmd) | prot));
+ __raw_pud_populate(pud, pmd, prot);
}
static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
@@ -67,13 +88,12 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
static inline void pud_free(struct mm_struct *mm, pud_t *pud)
{
- BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
- free_page((unsigned long)pud);
+ __raw_pud_free(pud);
}
static inline void __pgd_populate(pgd_t *pgdp, phys_addr_t pud, pgdval_t prot)
{
- set_pgd(pgdp, __pgd(__phys_to_pgd_val(pud) | prot));
+ __raw_pgd_populate(pgdp, pud, prot);
}
static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index bfa237e892f1..a5a5203b603d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -464,31 +464,40 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd)
*/
#define mk_pte(page,prot) pfn_pte(page_to_pfn(page),prot)
-#if CONFIG_PGTABLE_LEVELS > 2
-
-#define pmd_ERROR(pmd) __pmd_error(__FILE__, __LINE__, pmd_val(pmd))
-#define pud_none(pud) (!pud_val(pud))
-#define pud_bad(pud) (!(pud_val(pud) & PUD_TABLE_BIT))
-#define pud_present(pud) pte_present(pud_pte(pud))
+#define __raw_pud_none(pud) (!pud_val((pud)))
+#define __raw_pud_bad(pud) (!(pud_val((pud)) & PUD_TABLE_BIT))
+#define __raw_pud_present(pud) pte_present(pud_pte((pud)))
-static inline void set_pud(pud_t *pudp, pud_t pud)
+static inline void __raw_set_pud(pud_t *pudp, pud_t pud)
{
*pudp = pud;
dsb(ishst);
isb();
}
-static inline void pud_clear(pud_t *pudp)
+static inline void __raw_pud_clear(pud_t *pudp)
{
- set_pud(pudp, __pud(0));
+ __raw_set_pud(pudp, __pud(0));
}
-static inline phys_addr_t pud_page_paddr(pud_t pud)
+static inline phys_addr_t __raw_pud_page_paddr(pud_t pud)
{
return __pud_to_phys(pud);
}
+#if CONFIG_PGTABLE_LEVELS > 2
+
+#define pmd_ERROR(pmd) __pmd_error(__FILE__, __LINE__, pmd_val(pmd))
+
+#define pud_none(pud) __raw_pud_none((pud))
+#define pud_bad(pud) __raw_pud_bad((pud))
+#define pud_present(pud) __raw_pud_present((pud))
+
+#define set_pud(pudp, pud) __raw_set_pud((pudp), (pud))
+#define pud_clear(pudp) __raw_pud_clear((pudp))
+#define pud_page_paddr(pud) __raw_pud_page_paddr((pud))
+
/* Find an entry in the second-level page table. */
#define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
@@ -517,30 +526,37 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)
#endif /* CONFIG_PGTABLE_LEVELS > 2 */
-#if CONFIG_PGTABLE_LEVELS > 3
-
-#define pud_ERROR(pud) __pud_error(__FILE__, __LINE__, pud_val(pud))
+#define __raw_pgd_none(pgd) (!pgd_val((pgd)))
+#define __raw_pgd_bad(pgd) (!(pgd_val((pgd)) & 2))
+#define __raw_pgd_present(pgd) (pgd_val((pgd)))
-#define pgd_none(pgd) (!pgd_val(pgd))
-#define pgd_bad(pgd) (!(pgd_val(pgd) & 2))
-#define pgd_present(pgd) (pgd_val(pgd))
-
-static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
+static inline void __raw_set_pgd(pgd_t *pgdp, pgd_t pgd)
{
*pgdp = pgd;
dsb(ishst);
}
-static inline void pgd_clear(pgd_t *pgdp)
+static inline void __raw_pgd_clear(pgd_t *pgdp)
{
- set_pgd(pgdp, __pgd(0));
+ __raw_set_pgd(pgdp, __pgd(0));
}
-static inline phys_addr_t pgd_page_paddr(pgd_t pgd)
+static inline phys_addr_t __raw_pgd_page_paddr(pgd_t pgd)
{
return __pgd_to_phys(pgd);
}
+#if CONFIG_PGTABLE_LEVELS > 3
+
+#define pud_ERROR(pud) __pud_error(__FILE__, __LINE__, pud_val(pud))
+
+#define pgd_none(pgd) __raw_pgd_none((pgd))
+#define pgd_bad(pgd) __raw_pgd_bad((pgd))
+#define pgd_present(pgd) __raw_pgd_present((pgd))
+#define set_pgd(pgdp, pgd) __raw_set_pgd((pgdp), (pgd))
+#define pgd_clear(pgdp) __raw_pgd_clear((pgdp))
+#define pgd_page_paddr(pgd) __raw_pgd_page_paddr((pgd))
+
/* Find an entry in the frst-level page table. */
#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
--
2.13.6
Make pud_huge reusable for stage2 tables, independent
of the stage1 levels.
Cc: Steve Capper <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 5 +++++
arch/arm64/mm/hugetlbpage.c | 2 +-
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index a5a5203b603d..a1c6e93a1a11 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -469,6 +469,11 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd)
#define __raw_pud_bad(pud) (!(pud_val((pud)) & PUD_TABLE_BIT))
#define __raw_pud_present(pud) pte_present(pud_pte((pud)))
+static inline int __raw_pud_huge(pud_t pud)
+{
+ return pud_val(pud) && !(pud_val(pud) & PUD_TABLE_BIT);
+}
+
static inline void __raw_set_pud(pud_t *pudp, pud_t pud)
{
*pudp = pud;
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 6cb0fa92a651..a6bd5cc3d88b 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -35,7 +35,7 @@ int pmd_huge(pmd_t pmd)
int pud_huge(pud_t pud)
{
#ifndef __PAGETABLE_PMD_FOLDED
- return pud_val(pud) && !(pud_val(pud) & PUD_TABLE_BIT);
+ return __raw_pud_huge(pud);
#else
return 0;
#endif
--
2.13.6
So far we have only supported 3 level page table with fixed IPA of 40bits.
Fix stage2_flush_memslot() to accommodate for 4 level tables.
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
virt/kvm/arm/mmu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 761787befd3b..e6548c85c495 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -375,7 +375,8 @@ static void stage2_flush_memslot(struct kvm *kvm,
pgd = kvm->arch.pgd + stage2_pgd_index(addr);
do {
next = stage2_pgd_addr_end(addr, end);
- stage2_flush_puds(kvm, pgd, addr, next);
+ if (!stage2_pgd_none(*pgd))
+ stage2_flush_puds(kvm, pgd, addr, next);
} while (pgd++, addr = next, addr != end);
}
--
2.13.6
On arm/arm64 we pre-allocate the entry level page tables when
a VM is created and is free'd when either all the mm users are
gone or the KVM is about to get destroyed. i.e, kvm_free_stage2_pgd
is triggered via kvm_arch_flush_shadow_all() which can be invoked
from two different paths :
1) do_exit()-> .-> mmu_notifier->release()-> ..-> kvm_arch_flush_shadow_all()
OR
2) kvm_destroy_vm()-> mmu_notifier_unregister-> kvm_arch_flush_shadow_all()
This has created lot of race conditions in the past as some of
the VCPUs could be active when we free the stage2 via path (1).
On a closer look, all we need to do with kvm_arch_flush_shadow_all() is,
to ensure that the stage2 mappings are cleared. This doesn't mean we
have to free up the stage2 entry level page tables yet, which could
be delayed until the kvm is destroyed. This would avoid issues
of use-after-free, as we don't free the page tables and anyone who
tries to access the page table would find them in the appropriate
state (mapped vs unmapped), as the page table modifications are
serialised via kvm->mmu_lock. This will be later used for delaying
the allocation of the stage2 entry level page tables until we really
need to do something with it.
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
virt/kvm/arm/arm.c | 1 +
virt/kvm/arm/mmu.c | 56 ++++++++++++++++++++++++++++--------------------------
2 files changed, 30 insertions(+), 27 deletions(-)
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index c8d49879307f..19b720ddedce 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -189,6 +189,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
}
}
atomic_set(&kvm->online_vcpus, 0);
+ kvm_free_stage2_pgd(kvm);
}
int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 78253fe00fc4..c94c61ac38b9 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -298,11 +298,10 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
pgd = kvm->arch.pgd + stage2_pgd_index(addr);
do {
/*
- * Make sure the page table is still active, as another thread
- * could have possibly freed the page table, while we released
- * the lock.
+ * The page table shouldn't be free'd as we still hold a reference
+ * to the KVM.
*/
- if (!READ_ONCE(kvm->arch.pgd))
+ if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))
break;
next = stage2_pgd_addr_end(addr, end);
if (!stage2_pgd_none(*pgd))
@@ -837,30 +836,33 @@ void stage2_unmap_vm(struct kvm *kvm)
up_read(¤t->mm->mmap_sem);
srcu_read_unlock(&kvm->srcu, idx);
}
-
-/**
- * kvm_free_stage2_pgd - free all stage-2 tables
- * @kvm: The KVM struct pointer for the VM.
- *
- * Walks the level-1 page table pointed to by kvm->arch.pgd and frees all
- * underlying level-2 and level-3 tables before freeing the actual level-1 table
- * and setting the struct pointer to NULL.
+/*
+ * kvm_flush_stage2_all: Unmap the entire stage2 mappings including
+ * device and regular RAM backing memory.
*/
-void kvm_free_stage2_pgd(struct kvm *kvm)
+static void kvm_flush_stage2_all(struct kvm *kvm)
{
- void *pgd = NULL;
-
spin_lock(&kvm->mmu_lock);
- if (kvm->arch.pgd) {
+ if (kvm->arch.pgd)
unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
- pgd = READ_ONCE(kvm->arch.pgd);
- kvm->arch.pgd = NULL;
- }
spin_unlock(&kvm->mmu_lock);
+}
- /* Free the HW pgd, one page at a time */
- if (pgd)
- free_pages_exact(pgd, S2_PGD_SIZE);
+/**
+ * kvm_free_stage2_pgd - Free the entry level page tables in stage-2.
+ * This is called when all reference to the KVM has gone away and we
+ * really don't need any protection in resetting the PGD. This also
+ * means that nobody should be touching stage2 at this point, as we
+ * have unmapped the entire stage2 already and all dynamic entities,
+ * (VCPUs and devices) are no longer active.
+ *
+ * @kvm: The KVM struct pointer for the VM.
+ */
+void kvm_free_stage2_pgd(struct kvm *kvm)
+{
+ kvm_flush_stage2_all(kvm);
+ free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE);
+ kvm->arch.pgd = NULL;
}
static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
@@ -1189,12 +1191,12 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
* large. Otherwise, we may see kernel panics with
* CONFIG_DETECT_HUNG_TASK, CONFIG_LOCKUP_DETECTOR,
* CONFIG_LOCKDEP. Additionally, holding the lock too long
- * will also starve other vCPUs. We have to also make sure
- * that the page tables are not freed while we released
- * the lock.
+ * will also starve other vCPUs.
+ * The page tables shouldn't be free'd while we released the
+ * lock, since we hold a reference to the KVM.
*/
cond_resched_lock(&kvm->mmu_lock);
- if (!READ_ONCE(kvm->arch.pgd))
+ if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))
break;
next = stage2_pgd_addr_end(addr, end);
if (stage2_pgd_present(*pgd))
@@ -1950,7 +1952,7 @@ void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots)
void kvm_arch_flush_shadow_all(struct kvm *kvm)
{
- kvm_free_stage2_pgd(kvm);
+ kvm_flush_stage2_all(kvm);
}
void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
--
2.13.6
We allocate the entry level page tables for stage2 when the
VM is created. This doesn't give us the flexibility of configuring
the physical address space size for a VM. In order to allow
the VM to choose the required size, we delay the allocation of
stage2 entry level tables until we really try to map something.
This could be either when the VM creates a memory range or when
it tries to map a device memory. So we add in a hook to these
two places to make sure the tables are allocated. We use
kvm->slots_lock to serialize the allocation entry point, since
we add hooks to the arch specific call back with the mutex held.
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
virt/kvm/arm/arm.c | 18 ++++++----------
virt/kvm/arm/mmu.c | 61 +++++++++++++++++++++++++++++++++++++++++++++---------
2 files changed, 57 insertions(+), 22 deletions(-)
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 19b720ddedce..d06f00566664 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -127,13 +127,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
for_each_possible_cpu(cpu)
*per_cpu_ptr(kvm->arch.last_vcpu_ran, cpu) = -1;
- ret = kvm_alloc_stage2_pgd(kvm);
- if (ret)
- goto out_fail_alloc;
-
ret = create_hyp_mappings(kvm, kvm + 1, PAGE_HYP);
- if (ret)
- goto out_free_stage2_pgd;
+ if (ret) {
+ free_percpu(kvm->arch.last_vcpu_ran);
+ kvm->arch.last_vcpu_ran = NULL;
+ return ret;
+ }
+
kvm_vgic_early_init(kvm);
@@ -145,12 +145,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS;
return ret;
-out_free_stage2_pgd:
- kvm_free_stage2_pgd(kvm);
-out_fail_alloc:
- free_percpu(kvm->arch.last_vcpu_ran);
- kvm->arch.last_vcpu_ran = NULL;
- return ret;
}
bool kvm_arch_has_vcpu_debugfs(void)
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index c94c61ac38b9..257f2a8ccfc7 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1011,15 +1011,39 @@ static int stage2_pmdp_test_and_clear_young(pmd_t *pmd)
return stage2_ptep_test_and_clear_young((pte_t *)pmd);
}
-/**
- * kvm_phys_addr_ioremap - map a device range to guest IPA
- *
- * @kvm: The KVM pointer
- * @guest_ipa: The IPA at which to insert the mapping
- * @pa: The physical address of the device
- * @size: The size of the mapping
+/*
+ * Finalise the stage2 page table layout. Must be called with kvm->slots_lock
+ * held.
*/
-int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
+static int __kvm_init_stage2_table(struct kvm *kvm)
+{
+ /* Double check if somebody has already allocated it */
+ if (likely(kvm->arch.pgd))
+ return 0;
+ return kvm_alloc_stage2_pgd(kvm);
+}
+
+static int kvm_init_stage2_table(struct kvm *kvm)
+{
+ int rc;
+
+ /*
+ * Once allocated, the stage2 entry level tables are only
+ * freed when the KVM instance is destroyed. So, if we see
+ * something valid here, that guarantees that we have
+ * done the one time allocation and it is something valid
+ * and won't go away until the last reference to the KVM
+ * is gone.
+ */
+ if (likely(kvm->arch.pgd))
+ return 0;
+ mutex_lock(&kvm->slots_lock);
+ rc = __kvm_init_stage2_table(kvm);
+ mutex_unlock(&kvm->slots_lock);
+ return rc;
+}
+
+static int __kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
phys_addr_t pa, unsigned long size, bool writable)
{
phys_addr_t addr, end;
@@ -1055,6 +1079,23 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
return ret;
}
+/**
+ * kvm_phys_addr_ioremap - map a device range to guest IPA.
+ * Acquires kvm->slots_lock for making sure that the stage2 is initialized.
+ *
+ * @kvm: The KVM pointer
+ * @guest_ipa: The IPA at which to insert the mapping
+ * @pa: The physical address of the device
+ * @size: The size of the mapping
+ */
+int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
+ phys_addr_t pa, unsigned long size, bool writable)
+{
+ if (unlikely(kvm_init_stage2_table(kvm)))
+ return -ENOMEM;
+ return __kvm_phys_addr_ioremap(kvm, guest_ipa, pa, size, writable);
+}
+
static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
{
kvm_pfn_t pfn = *pfnp;
@@ -1912,7 +1953,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
goto out;
}
- ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
+ ret = __kvm_phys_addr_ioremap(kvm, gpa, pa,
vm_end - vm_start,
writable);
if (ret)
@@ -1943,7 +1984,7 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free,
int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
unsigned long npages)
{
- return 0;
+ return __kvm_init_stage2_table(kvm);
}
void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots)
--
2.13.6
VTCR_EL2 holds the following key stage2 translation table
parameters:
SL0 - Entry level in the page table lookup.
T0SZ - Denotes the size of the memory addressed by the table.
We have been using fixed values for the SL0 depending on the
page size as we have a fixed IPA size. But since we are about
to make it dynamic, we need to calculate the SL0 at runtime
per VM.
Also the VTTBR:BADDR holds the base address for the stage2
translation table and the ARM ARM mandates that bits
BADDR[x-1:0] should be 0, where 'x' defined using some
magical constant, which depends on the page size, T0SZ
and the entry level of lookup (Since the entry level page
tables can be concatenated at stage2, a given T0SZ could
possibly start at 2 different levels). We need a way to
calculate this magical value per VM, depending on the
IPA size. Luckily there is a magic formula for finding
the "magic" number to find "x". See the patch for more
details.
This patch adds helpers to figure out the VTCR_SL0 and
the magic "X" for a configuration of stage2.
The other advantage we have with this change is switching
the entry level for a given IPA size, depending on if we
are able to get contiguous block of memory for the entry
level page table. (e.g, With 64KB page size and 46bit IPA
starting at level 2, finding 16 * 64KB contiguous block on a
loaded system could be tricky. So we could decide to rather
enter at level 1, with a single page).
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm64/include/asm/kvm_arm.h | 96 +++++++++++++++++++++++++++++++++++++---
arch/arm64/include/asm/kvm_mmu.h | 20 ++++++++-
2 files changed, 110 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 715d395ef45b..eb90d349e55f 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -122,6 +122,8 @@
#define VTCR_EL2_VS_8BIT (0 << VTCR_EL2_VS_SHIFT)
#define VTCR_EL2_VS_16BIT (1 << VTCR_EL2_VS_SHIFT)
+#define VTCR_EL2_T0SZ(x) TCR_T0SZ((x))
+
/*
* We configure the Stage-2 page tables to always restrict the IPA space to be
* 40 bits wide (T0SZ = 24). Systems with a PARange smaller than 40 bits are
@@ -148,7 +150,8 @@
* 2 level page tables (SL = 1)
*/
#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
-#define VTTBR_X_TGRAN_MAGIC 38
+#define VTCR_EL2_TGRAN_SL0_BASE 3UL
+
#elif defined(CONFIG_ARM64_16K_PAGES)
/*
* Stage2 translation configuration:
@@ -156,7 +159,7 @@
* 2 level page tables (SL = 1)
*/
#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
-#define VTTBR_X_TGRAN_MAGIC 42
+#define VTCR_EL2_TGRAN_SL0_BASE 3UL
#else /* 4K */
/*
* Stage2 translation configuration:
@@ -164,13 +167,96 @@
* 3 level page tables (SL = 1)
*/
#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
-#define VTTBR_X_TGRAN_MAGIC 37
+#define VTCR_EL2_TGRAN_SL0_BASE 2UL
#endif
#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
-#define VTTBR_X (VTTBR_X_TGRAN_MAGIC - VTCR_EL2_T0SZ_IPA)
+/*
+ * VTCR_EL2:SL0 indicates the entry level for Stage2 translation.
+ * Interestingly, it depends on the page size.
+ * See D.10.2.110, VTCR_EL2, in ARM DDI 0487B.b
+ *
+ * -----------------------------------------
+ * | Entry level | 4K | 16K/64K |
+ * ------------------------------------------
+ * | Level: 0 | 2 | - |
+ * ------------------------------------------
+ * | Level: 1 | 1 | 2 |
+ * ------------------------------------------
+ * | Level: 2 | 0 | 1 |
+ * ------------------------------------------
+ * | Level: 3 | - | 0 |
+ * ------------------------------------------
+ *
+ * That table roughly translates to :
+ *
+ * SL0(PAGE_SIZE, Entry_level) = SL0_BASE(PAGE_SIZE) - Entry_Level
+ *
+ * Where SL0_BASE(4K) = 2 and SL0_BASE(16K) = 3, SL0_BASE(64K) = 3, provided
+ * we take care of ruling out the unsupported cases and
+ * Entry_Level = 4 - Number_of_levels.
+ *
+ */
+#define VTCR_EL2_SL0(levels) \
+ ((VTCR_EL2_TGRAN_SL0_BASE - (4 - (levels))) << VTCR_EL2_SL0_SHIFT)
+/*
+ * ARM VMSAv8-64 defines an algorithm for finding the translation table
+ * descriptors in section D4.2.8 in ARM DDI 0487B.b.
+ *
+ * The algorithm defines the expectaions on the BaseAddress (for the page
+ * table) bits resolved at each level based on the page size, entry level
+ * and T0SZ. The variable "x" in the algorithm also affects the VTTBR:BADDR
+ * for stage2 page table.
+ *
+ * The value of "x" is calculated as :
+ * x = Magic_N - T0SZ
+ *
+ * where Magic_N is an integer depending on the page size and the entry
+ * level of the page table as below:
+ *
+ * --------------------------------------------
+ * | Entry level | 4K 16K 64K |
+ * --------------------------------------------
+ * | Level: 0 (4 levels) | 28 | - | - |
+ * --------------------------------------------
+ * | Level: 1 (3 levels) | 37 | 31 | 25 |
+ * --------------------------------------------
+ * | Level: 2 (2 levels) | 46 | 42 | 38 |
+ * --------------------------------------------
+ * | Level: 3 (1 level) | - | 53 | 51 |
+ * --------------------------------------------
+ *
+ * We have a magic formula for the Magic_N below.
+ * Which can also be expressed as:
+ *
+ * Magic_N(PAGE_SIZE, Entry_Level) = 64 - ((PAGE_SHIFT - 3) * Number of levels)
+ *
+ * where number of levels = (4 - Entry_Level).
+ *
+ * So, given that T0SZ = (64 - PA_SHIFT), we can compute 'x' as follows:
+ *
+ * x = (64 - ((PAGE_SHIFT - 3) * Number_of_levels)) - (64 - PA_SHIFT)
+ * = PA_SHIFT - ((PAGE_SHIFT - 3) * Number of levels)
+ *
+ * Here is one way to explain the Magic Formula:
+ *
+ * x = log2(Size_of_Entry_Level_Table)
+ *
+ * Since, we can resolve (PAGE_SHIFT - 3) bits at each level, and another
+ * PAGE_SHIFT bits in the PTE, we have :
+ *
+ * Bits_Entry_level = PA_SHIFT - ((PAGE_SHIFT - 3) * (n - 1) + PAGE_SHIFT)
+ * = PA_SHIFT - (PAGE_SHIFT - 3) * n - 3
+ * where n = number of levels, and since each pointer is 8bytes, we have:
+ *
+ * x = Bits_Entry_Level + 3
+ * = PA_SHIFT - (PAGE_SHIFT - 3) * n
+ *
+ * The only constraint here is that, we have to find the number of page table
+ * levels for a given IPA size (which we do, see STAGE2_PGTABLE_LEVELS).
+ */
+#define ARM64_VTTBR_X(ipa, levels) ((ipa) - ((levels) * (PAGE_SHIFT - 3)))
-#define VTTBR_BADDR_MASK (((UL(1) << (PHYS_MASK_SHIFT - VTTBR_X)) - 1) << VTTBR_X)
#define VTTBR_VMID_SHIFT (UL(48))
#define VTTBR_VMID_MASK(size) (_AT(u64, (1 << size) - 1) << VTTBR_VMID_SHIFT)
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index df2ee97f4428..483185ed2ecd 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -139,7 +139,6 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
#define kvm_phys_size(kvm) (_AC(1, ULL) << kvm_phys_shift(kvm))
#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
-#define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK
static inline bool kvm_page_empty(void *ptr)
{
@@ -328,5 +327,24 @@ static inline unsigned int kvm_get_vmid_bits(void)
#define kvm_phys_to_vttbr(addr) phys_to_ttbr(addr)
+/*
+ * Get the magic number 'x' for VTTBR:BADDR of this KVM instance.
+ * With v8.2 LVA extensions, 'x' (rather 'z') should be a minimum
+ * of 6 with 52bit IPS.
+ */
+static inline int kvm_vttbr_x(struct kvm *kvm)
+{
+ int x = ARM64_VTTBR_X(kvm_phys_shift(kvm), kvm_stage2_levels(kvm));
+
+ return (IS_ENABLED(CONFIG_ARM64_PA_BITS_52) && x < 6) ? 6 : x;
+}
+
+static inline u64 kvm_vttbr_baddr_mask(struct kvm *kvm)
+{
+ unsigned int x = kvm_vttbr_x(kvm);
+
+ return GENMASK_ULL(PHYS_MASK_SHIFT - 1, x);
+}
+
#endif /* __ASSEMBLY__ */
#endif /* __ARM64_KVM_MMU_H__ */
--
2.13.6
We set VTCR_EL2 very early during the stage2 init and don't
touch it ever. This is fine as we had a fixed IPA size. This
patch changes the behavior to set the VTCR for a given VM,
depending on its stage2 table. The common configuration for
VTCR is still performed during the early init. But the SL0
and T0SZ are programmed for each VM and is cleared once we
exit the VM.
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm64/include/asm/kvm_arm.h | 16 ++++++----------
arch/arm64/include/asm/kvm_asm.h | 2 +-
arch/arm64/include/asm/kvm_host.h | 8 +++++---
arch/arm64/kvm/hyp/s2-setup.c | 16 +---------------
arch/arm64/kvm/hyp/switch.c | 9 +++++++++
5 files changed, 22 insertions(+), 29 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index eb90d349e55f..d5c40816f073 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -115,9 +115,7 @@
#define VTCR_EL2_IRGN0_WBWA TCR_IRGN0_WBWA
#define VTCR_EL2_SL0_SHIFT 6
#define VTCR_EL2_SL0_MASK (3 << VTCR_EL2_SL0_SHIFT)
-#define VTCR_EL2_SL0_LVL1 (1 << VTCR_EL2_SL0_SHIFT)
#define VTCR_EL2_T0SZ_MASK 0x3f
-#define VTCR_EL2_T0SZ_40B 24
#define VTCR_EL2_VS_SHIFT 19
#define VTCR_EL2_VS_8BIT (0 << VTCR_EL2_VS_SHIFT)
#define VTCR_EL2_VS_16BIT (1 << VTCR_EL2_VS_SHIFT)
@@ -139,38 +137,36 @@
* D4-23 and D4-25 in ARM DDI 0487A.b.
*/
-#define VTCR_EL2_T0SZ_IPA VTCR_EL2_T0SZ_40B
#define VTCR_EL2_COMMON_BITS (VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
VTCR_EL2_IRGN0_WBWA | VTCR_EL2_RES1)
+#define VTCR_EL2_PRIVATE_MASK (VTCR_EL2_SL0_MASK | VTCR_EL2_T0SZ_MASK)
#ifdef CONFIG_ARM64_64K_PAGES
/*
* Stage2 translation configuration:
* 64kB pages (TG0 = 1)
- * 2 level page tables (SL = 1)
*/
-#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
+#define VTCR_EL2_TGRAN VTCR_EL2_TG0_64K
#define VTCR_EL2_TGRAN_SL0_BASE 3UL
#elif defined(CONFIG_ARM64_16K_PAGES)
/*
* Stage2 translation configuration:
* 16kB pages (TG0 = 2)
- * 2 level page tables (SL = 1)
*/
-#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
+#define VTCR_EL2_TGRAN VTCR_EL2_TG0_16K
#define VTCR_EL2_TGRAN_SL0_BASE 3UL
#else /* 4K */
/*
* Stage2 translation configuration:
* 4kB pages (TG0 = 0)
- * 3 level page tables (SL = 1)
*/
-#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
+#define VTCR_EL2_TGRAN VTCR_EL2_TG0_4K
#define VTCR_EL2_TGRAN_SL0_BASE 2UL
#endif
-#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
+#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN)
+
/*
* VTCR_EL2:SL0 indicates the entry level for Stage2 translation.
* Interestingly, it depends on the page size.
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index ab4d0a926043..21cfd1fe692c 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -66,7 +66,7 @@ extern void __vgic_v3_init_lrs(void);
extern u32 __kvm_get_mdcr_el2(void);
-extern u32 __init_stage2_translation(void);
+extern void __init_stage2_translation(void);
#endif
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index ea6cb5b24258..9a9ddeb33c84 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -380,10 +380,12 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
static inline void __cpu_init_stage2(void)
{
- u32 parange = kvm_call_hyp(__init_stage2_translation);
+ u32 ps;
- WARN_ONCE(parange < 40,
- "PARange is %d bits, unsupported configuration!", parange);
+ kvm_call_hyp(__init_stage2_translation);
+ ps = id_aa64mmfr0_parange_to_phys_shift(read_sysreg(id_aa64mmfr0_el1));
+ WARN_ONCE(ps < 40,
+ "PARange is %d bits, unsupported configuration!", ps);
}
/*
diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
index b1129c83c531..5c26ad4b8ac9 100644
--- a/arch/arm64/kvm/hyp/s2-setup.c
+++ b/arch/arm64/kvm/hyp/s2-setup.c
@@ -19,13 +19,11 @@
#include <asm/kvm_arm.h>
#include <asm/kvm_asm.h>
#include <asm/kvm_hyp.h>
-#include <asm/cpufeature.h>
-u32 __hyp_text __init_stage2_translation(void)
+void __hyp_text __init_stage2_translation(void)
{
u64 val = VTCR_EL2_FLAGS;
u64 parange;
- u32 phys_shift;
u64 tmp;
/*
@@ -38,16 +36,6 @@ u32 __hyp_text __init_stage2_translation(void)
parange = ID_AA64MMFR0_PARANGE_MAX;
val |= parange << 16;
- /* Compute the actual PARange... */
- phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
-
- /*
- * ... and clamp it to 40 bits, unless we have some braindead
- * HW that implements less than that. In all cases, we'll
- * return that value for the rest of the kernel to decide what
- * to do.
- */
- val |= 64 - (phys_shift > 40 ? 40 : phys_shift);
/*
* Check the availability of Hardware Access Flag / Dirty Bit
@@ -67,6 +55,4 @@ u32 __hyp_text __init_stage2_translation(void)
VTCR_EL2_VS_8BIT;
write_sysreg(val, vtcr_el2);
-
- return phys_shift;
}
diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
index f7c651f3a8c0..523471f0af7b 100644
--- a/arch/arm64/kvm/hyp/switch.c
+++ b/arch/arm64/kvm/hyp/switch.c
@@ -157,11 +157,20 @@ static void __hyp_text __deactivate_traps(struct kvm_vcpu *vcpu)
static void __hyp_text __activate_vm(struct kvm_vcpu *vcpu)
{
struct kvm *kvm = kern_hyp_va(vcpu->kvm);
+ u64 vtcr = read_sysreg(vtcr_el2);
+
+ vtcr &= ~VTCR_EL2_PRIVATE_MASK;
+ vtcr |= VTCR_EL2_SL0(stage2_pt_levels(kvm)) |
+ VTCR_EL2_T0SZ(kvm_phys_shift(kvm));
+ write_sysreg(vtcr, vtcr_el2);
write_sysreg(kvm->arch.vttbr, vttbr_el2);
}
static void __hyp_text __deactivate_vm(struct kvm_vcpu *vcpu)
{
+ u64 vtcr = read_sysreg(vtcr_el2) & ~VTCR_EL2_PRIVATE_MASK;
+
+ write_sysreg(vtcr, vtcr_el2);
write_sysreg(0, vttbr_el2);
}
--
2.13.6
Now that we can manage the stage2 page table per VM, switch the
configuration details to per VM instance. We keep track of the
IPA bits, number of page table levels and the VTCR bits (which
depends on the IPA and the number of levels).
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/include/asm/kvm_mmu.h | 1 +
arch/arm64/include/asm/kvm_host.h | 12 ++++++++++++
arch/arm64/include/asm/kvm_mmu.h | 22 ++++++++++++++++++++--
arch/arm64/include/asm/stage2_pgtable.h | 1 -
arch/arm64/kvm/hyp/switch.c | 3 +--
virt/kvm/arm/arm.c | 2 +-
6 files changed, 35 insertions(+), 6 deletions(-)
diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 440c80589453..dd592fe45660 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -48,6 +48,7 @@
#define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK
#define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
+#define kvm_init_stage2_config(kvm) do { } while (0)
int create_hyp_mappings(void *from, void *to, pgprot_t prot);
int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
void free_hyp_pgds(void);
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 9a9ddeb33c84..1e66e5ab3dde 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -64,6 +64,18 @@ struct kvm_arch {
/* VTTBR value associated with above pgd and vmid */
u64 vttbr;
+ /* Private bits of VTCR_EL2 for this VM */
+ u64 vtcr_private;
+ /* Size of the PA size for this guest */
+ u8 phys_shift;
+ /*
+ * Number of levels in page table. We could always calculate
+ * it from phys_shift above. We cache it for faster switches
+ * in stage2 page table helpers.
+ */
+ u8 s2_levels;
+
+
/* The last vcpu id that ran on each physical CPU */
int __percpu *last_vcpu_ran;
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 483185ed2ecd..ab6a8b905065 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -134,11 +134,12 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
/*
* We currently only support a 40bit IPA.
*/
-#define KVM_PHYS_SHIFT (40)
+#define KVM_PHYS_SHIFT_DEFAULT (40)
-#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
+#define kvm_phys_shift(kvm) (kvm->arch.phys_shift)
#define kvm_phys_size(kvm) (_AC(1, ULL) << kvm_phys_shift(kvm))
#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
+#define kvm_stage2_levels(kvm) (kvm->arch.s2_levels)
static inline bool kvm_page_empty(void *ptr)
{
@@ -346,5 +347,22 @@ static inline u64 kvm_vttbr_baddr_mask(struct kvm *kvm)
return GENMASK_ULL(PHYS_MASK_SHIFT - 1, x);
}
+/*
+ * kvm_init_stage2_config: Initialise the VM specific stage2 page table
+ * details to default IPA size.
+ */
+static inline void kvm_init_stage2_config(struct kvm *kvm)
+{
+ /*
+ * The stage2 PGD is dependent on the settings we initialise here
+ * and should be allocated only after this step.
+ */
+ VM_BUG_ON(kvm->arch.pgd != NULL);
+ kvm->arch.phys_shift = KVM_PHYS_SHIFT_DEFAULT;
+ kvm->arch.s2_levels = stage2_pt_levels(kvm->arch.phys_shift);
+ kvm->arch.vtcr_private = VTCR_EL2_SL0(kvm->arch.s2_levels) |
+ TCR_T0SZ(kvm->arch.phys_shift);
+}
+
#endif /* __ASSEMBLY__ */
#endif /* __ARM64_KVM_MMU_H__ */
diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
index 33e8ebb25037..9b75b83da643 100644
--- a/arch/arm64/include/asm/stage2_pgtable.h
+++ b/arch/arm64/include/asm/stage2_pgtable.h
@@ -44,7 +44,6 @@
*/
#define __s2_pgd_ptrs(pa, lvls) (1 << ((pa) - pt_levels_pgdir_shift((lvls))))
-#define kvm_stage2_levels(kvm) stage2_pt_levels(kvm_phys_shift(kvm))
#define stage2_pgdir_shift(kvm) \
pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
#define stage2_pgdir_size(kvm) (_AC(1, UL) << stage2_pgdir_shift((kvm)))
diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
index 523471f0af7b..d0725562ee3f 100644
--- a/arch/arm64/kvm/hyp/switch.c
+++ b/arch/arm64/kvm/hyp/switch.c
@@ -160,8 +160,7 @@ static void __hyp_text __activate_vm(struct kvm_vcpu *vcpu)
u64 vtcr = read_sysreg(vtcr_el2);
vtcr &= ~VTCR_EL2_PRIVATE_MASK;
- vtcr |= VTCR_EL2_SL0(stage2_pt_levels(kvm)) |
- VTCR_EL2_T0SZ(kvm_phys_shift(kvm));
+ vtcr |= kvm->arch.vtcr_private;
write_sysreg(vtcr, vtcr_el2);
write_sysreg(kvm->arch.vttbr, vttbr_el2);
}
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 8564ed907b18..e0bf8d19fcfe 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -143,7 +143,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
/* The maximum number of VCPUs is limited by the host's GIC model */
kvm->arch.max_vcpus = vgic_present ?
kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS;
-
+ kvm_init_stage2_config(kvm);
return ret;
}
--
2.13.6
When the host fails to complete the shake hand due to various
reasons. e.g, for PCI and MMIO, if 0 is written as the PFN,
it implies the host has given up and simply don't take any
action.
Signed-off-by: Suzuki K Poulose <[email protected]>
---
virtio/mmio.c | 14 ++++++++------
virtio/pci.c | 10 ++++++----
2 files changed, 14 insertions(+), 10 deletions(-)
diff --git a/virtio/mmio.c b/virtio/mmio.c
index f0af4bd..ba02358 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -188,12 +188,14 @@ static void virtio_mmio_config_out(struct kvm_cpu *vcpu,
break;
case VIRTIO_MMIO_QUEUE_PFN:
val = ioport__read32(data);
- virtio_mmio_init_ioeventfd(vmmio->kvm, vdev, vmmio->hdr.queue_sel);
- vdev->ops->init_vq(vmmio->kvm, vmmio->dev,
- vmmio->hdr.queue_sel,
- vmmio->hdr.guest_page_size,
- vmmio->hdr.queue_align,
- val);
+ if (val) {
+ virtio_mmio_init_ioeventfd(vmmio->kvm, vdev, vmmio->hdr.queue_sel);
+ vdev->ops->init_vq(vmmio->kvm, vmmio->dev,
+ vmmio->hdr.queue_sel,
+ vmmio->hdr.guest_page_size,
+ vmmio->hdr.queue_align,
+ val);
+ }
break;
case VIRTIO_MMIO_QUEUE_NOTIFY:
val = ioport__read32(data);
diff --git a/virtio/pci.c b/virtio/pci.c
index 4ce1111..3c694c2 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -271,10 +271,12 @@ static bool virtio_pci__io_out(struct ioport *ioport, struct kvm_cpu *vcpu, u16
break;
case VIRTIO_PCI_QUEUE_PFN:
val = ioport__read32(data);
- virtio_pci__init_ioeventfd(kvm, vdev, vpci->queue_selector);
- vdev->ops->init_vq(kvm, vpci->dev, vpci->queue_selector,
- 1 << VIRTIO_PCI_QUEUE_ADDR_SHIFT,
- VIRTIO_PCI_VRING_ALIGN, val);
+ if (val) {
+ virtio_pci__init_ioeventfd(kvm, vdev, vpci->queue_selector);
+ vdev->ops->init_vq(kvm, vpci->dev, vpci->queue_selector,
+ 1 << VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+ VIRTIO_PCI_VRING_ALIGN, val);
+ }
break;
case VIRTIO_PCI_QUEUE_SEL:
vpci->queue_selector = ioport__read16(data);
--
2.13.6
If the guest wants to use a larger physical address space place
the RAM at upper half of the address space. Otherwise, it uses the
default layout.
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arm/aarch32/include/kvm/kvm-arch.h | 1 +
arm/aarch64/include/kvm/kvm-arch.h | 15 ++++++++++++---
arm/include/arm-common/kvm-arch.h | 11 ++++++-----
arm/kvm.c | 2 +-
4 files changed, 20 insertions(+), 9 deletions(-)
diff --git a/arm/aarch32/include/kvm/kvm-arch.h b/arm/aarch32/include/kvm/kvm-arch.h
index cd31e72..2d62aab 100644
--- a/arm/aarch32/include/kvm/kvm-arch.h
+++ b/arm/aarch32/include/kvm/kvm-arch.h
@@ -4,6 +4,7 @@
#define ARM_KERN_OFFSET(...) 0x8000
#define ARM_MAX_MEMORY(...) ARM_LOMAP_MAX_MEMORY
+#define ARM_MEMORY_AREA(...) ARM32_MEMORY_AREA
#include "arm-common/kvm-arch.h"
diff --git a/arm/aarch64/include/kvm/kvm-arch.h b/arm/aarch64/include/kvm/kvm-arch.h
index 9de623a..bad35b9 100644
--- a/arm/aarch64/include/kvm/kvm-arch.h
+++ b/arm/aarch64/include/kvm/kvm-arch.h
@@ -1,14 +1,23 @@
#ifndef KVM__KVM_ARCH_H
#define KVM__KVM_ARCH_H
+#include "arm-common/kvm-arch.h"
+
+#define ARM64_MEMORY_AREA(phys_shift) (1UL << (phys_shift - 1))
+#define ARM64_MAX_MEMORY(phys_shift) \
+ ((1ULL << (phys_shift)) - ARM64_MEMORY_AREA(phys_shift))
+
+#define ARM_MEMORY_AREA(kvm) ((kvm)->cfg.arch.aarch32_guest ? \
+ ARM32_MEMORY_AREA : \
+ ARM64_MEMORY_AREA(kvm->cfg.arch.phys_shift))
+
#define ARM_KERN_OFFSET(kvm) ((kvm)->cfg.arch.aarch32_guest ? \
0x8000 : \
0x80000)
#define ARM_MAX_MEMORY(kvm) ((kvm)->cfg.arch.aarch32_guest ? \
- ARM_LOMAP_MAX_MEMORY : \
- ARM_HIMAP_MAX_MEMORY)
+ ARM32_MAX_MEMORY : \
+ ARM64_MAX_MEMORY(kvm->cfg.arch.phys_shift))
-#include "arm-common/kvm-arch.h"
#endif /* KVM__KVM_ARCH_H */
diff --git a/arm/include/arm-common/kvm-arch.h b/arm/include/arm-common/kvm-arch.h
index c83c45f..ca7ab0f 100644
--- a/arm/include/arm-common/kvm-arch.h
+++ b/arm/include/arm-common/kvm-arch.h
@@ -6,14 +6,15 @@
#include <linux/types.h>
#include "arm-common/gic.h"
-
#define ARM_IOPORT_AREA _AC(0x0000000000000000, UL)
#define ARM_MMIO_AREA _AC(0x0000000000010000, UL)
#define ARM_AXI_AREA _AC(0x0000000040000000, UL)
-#define ARM_MEMORY_AREA _AC(0x0000000080000000, UL)
-#define ARM_LOMAP_MAX_MEMORY ((1ULL << 32) - ARM_MEMORY_AREA)
-#define ARM_HIMAP_MAX_MEMORY ((1ULL << 40) - ARM_MEMORY_AREA)
+#define ARM32_MEMORY_AREA _AC(0x0000000080000000, UL)
+#define ARM32_MAX_MEMORY ((1ULL << 32) - ARM32_MEMORY_AREA)
+
+#define ARM_IOMEM_AREA_END ARM32_MEMORY_AREA
+
#define ARM_GIC_DIST_BASE (ARM_AXI_AREA - ARM_GIC_DIST_SIZE)
#define ARM_GIC_CPUI_BASE (ARM_GIC_DIST_BASE - ARM_GIC_CPUI_SIZE)
@@ -24,7 +25,7 @@
#define ARM_IOPORT_SIZE (ARM_MMIO_AREA - ARM_IOPORT_AREA)
#define ARM_VIRTIO_MMIO_SIZE (ARM_AXI_AREA - (ARM_MMIO_AREA + ARM_GIC_SIZE))
#define ARM_PCI_CFG_SIZE (1ULL << 24)
-#define ARM_PCI_MMIO_SIZE (ARM_MEMORY_AREA - \
+#define ARM_PCI_MMIO_SIZE (ARM_IOMEM_AREA_END - \
(ARM_AXI_AREA + ARM_PCI_CFG_SIZE))
#define KVM_IOPORT_AREA ARM_IOPORT_AREA
diff --git a/arm/kvm.c b/arm/kvm.c
index 7573391..0f155c6 100644
--- a/arm/kvm.c
+++ b/arm/kvm.c
@@ -38,7 +38,7 @@ void kvm__init_ram(struct kvm *kvm)
u64 phys_start, phys_size;
void *host_mem;
- phys_start = ARM_MEMORY_AREA;
+ phys_start = ARM_MEMORY_AREA(kvm);
phys_size = kvm->ram_size;
host_mem = kvm->ram_start;
--
2.13.6
Add an option to specify the physical address size used by this
VM.
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
arm/include/arm-common/kvm-config-arch.h | 1 +
arm/kvm.c | 30 ++++++++++++++++++++++++++++++
3 files changed, 35 insertions(+), 1 deletion(-)
diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
index 04be43d..c4bb207 100644
--- a/arm/aarch64/include/kvm/kvm-config-arch.h
+++ b/arm/aarch64/include/kvm/kvm-config-arch.h
@@ -8,7 +8,10 @@
"Create PMUv3 device"), \
OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed, \
"Specify random seed for Kernel Address Space " \
- "Layout Randomization (KASLR)"),
+ "Layout Randomization (KASLR)"), \
+ OPT_UINTEGER('\0', "phys-shift", &(cfg)->phys_shift, \
+ "Specify maximum physical address size (not " \
+ "the amount of memory)"),
#include "arm-common/kvm-config-arch.h"
diff --git a/arm/include/arm-common/kvm-config-arch.h b/arm/include/arm-common/kvm-config-arch.h
index 6a196f1..d841b0b 100644
--- a/arm/include/arm-common/kvm-config-arch.h
+++ b/arm/include/arm-common/kvm-config-arch.h
@@ -11,6 +11,7 @@ struct kvm_config_arch {
bool has_pmuv3;
u64 kaslr_seed;
enum irqchip_type irqchip;
+ unsigned int phys_shift;
};
int irqchip_parser(const struct option *opt, const char *arg, int unset);
diff --git a/arm/kvm.c b/arm/kvm.c
index 2ab436e..7573391 100644
--- a/arm/kvm.c
+++ b/arm/kvm.c
@@ -18,6 +18,14 @@ struct kvm_ext kvm_req_ext[] = {
{ 0, 0 },
};
+#ifndef KVM_CAP_ARM_CONFIG_PHYS_SHIFT
+#define KVM_CAP_ARM_CONFIG_PHYS_SHIFT 151
+#endif
+
+#ifndef KVM_ARM_SET_PHYS_SIZE
+#define KVM_ARM_SET_PHYS_SIZE _IOW(KVMIO, 0xb2, __u32)
+#endif
+
bool kvm__arch_cpu_supports_vm(void)
{
/* The KVM capability check is enough. */
@@ -57,8 +65,30 @@ void kvm__arch_set_cmdline(char *cmdline, bool video)
{
}
+static void kvm__init_phys_size(struct kvm *kvm)
+{
+ if (!kvm->cfg.arch.phys_shift)
+ goto default_phys_size;
+ if (kvm->cfg.arch.phys_shift > 48)
+ die("Physical memory size is limited to 48bits, %d\n",
+ kvm->cfg.arch.phys_shift);
+
+ if (!kvm__supports_extension(kvm, KVM_CAP_ARM_CONFIG_PHYS_SHIFT)) {
+ pr_warning("System doesn't support phys size configuration\n");
+ goto default_phys_size;
+ }
+ if (ioctl(kvm->vm_fd, KVM_ARM_SET_PHYS_SIZE, &kvm->cfg.arch.phys_shift))
+ die("Failed to set physical memory size to %dbits\n",
+ kvm->cfg.arch.phys_shift);
+ return;
+default_phys_size:
+ kvm->cfg.arch.phys_shift = 40;
+ return;
+}
+
void kvm__arch_init(struct kvm *kvm, const char *hugetlbfs_path, u64 ram_size)
{
+ kvm__init_phys_size(kvm);
/*
* Allocate guest memory. We must align our buffer to 64K to
* correlate with the maximum guest page size for virtio-mmio.
--
2.13.6
From: Kristina Martsenko <[email protected]>
We only support 64K for the VGIC, which makes it easier to
support the 52bit guest PA by simply removing the restriction
that we put in to limit the bits to 48.
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Kristina Martsenko <[email protected]>
[ Clean up macro usages, Add fixes for propbaser handling ]
Signed-off-by: Suzuki K Poulose <[email protected]>
---
virt/kvm/arm/vgic/vgic-its.c | 36 ++++++++++--------------------------
virt/kvm/arm/vgic/vgic-mmio-v3.c | 1 -
2 files changed, 10 insertions(+), 27 deletions(-)
diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c
index 8e633bd9cc1e..60ab293ec542 100644
--- a/virt/kvm/arm/vgic/vgic-its.c
+++ b/virt/kvm/arm/vgic/vgic-its.c
@@ -233,13 +233,6 @@ static struct its_ite *find_ite(struct vgic_its *its, u32 device_id,
list_for_each_entry(dev, &(its)->device_list, dev_list) \
list_for_each_entry(ite, &(dev)->itt_head, ite_list)
-/*
- * We only implement 48 bits of PA at the moment, although the ITS
- * supports more. Let's be restrictive here.
- */
-#define BASER_ADDRESS(x) ((x) & GENMASK_ULL(47, 16))
-#define CBASER_ADDRESS(x) ((x) & GENMASK_ULL(47, 12))
-
#define GIC_LPI_OFFSET 8192
#define VITS_TYPER_IDBITS 16
@@ -769,7 +762,7 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
if (id >= (l1_tbl_size / esz))
return false;
- addr = BASER_ADDRESS(baser) + id * esz;
+ addr = GITS_BASER_ADDR64K_TO_PHYS(baser) + id * esz;
gfn = addr >> PAGE_SHIFT;
if (eaddr)
@@ -784,7 +777,8 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
/* Each 1st level entry is represented by a 64-bit value. */
if (kvm_read_guest(its->dev->kvm,
- BASER_ADDRESS(baser) + index * sizeof(indirect_ptr),
+ GITS_BASER_ADDR64K_TO_PHYS(baser) +
+ index * sizeof(indirect_ptr),
&indirect_ptr, sizeof(indirect_ptr)))
return false;
@@ -794,11 +788,7 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
if (!(indirect_ptr & BIT_ULL(63)))
return false;
- /*
- * Mask the guest physical address and calculate the frame number.
- * Any address beyond our supported 48 bits of PA will be caught
- * by the actual check in the final step.
- */
+ /* Mask the guest physical address and calculate the frame number. */
indirect_ptr &= GENMASK_ULL(51, 16);
/* Find the address of the actual entry */
@@ -1292,9 +1282,6 @@ static u64 vgic_sanitise_its_baser(u64 reg)
GITS_BASER_OUTER_CACHEABILITY_SHIFT,
vgic_sanitise_outer_cacheability);
- /* Bits 15:12 contain bits 51:48 of the PA, which we don't support. */
- reg &= ~GENMASK_ULL(15, 12);
-
/* We support only one (ITS) page size: 64K */
reg = (reg & ~GITS_BASER_PAGE_SIZE_MASK) | GITS_BASER_PAGE_SIZE_64K;
@@ -1313,11 +1300,8 @@ static u64 vgic_sanitise_its_cbaser(u64 reg)
GITS_CBASER_OUTER_CACHEABILITY_SHIFT,
vgic_sanitise_outer_cacheability);
- /*
- * Sanitise the physical address to be 64k aligned.
- * Also limit the physical addresses to 48 bits.
- */
- reg &= ~(GENMASK_ULL(51, 48) | GENMASK_ULL(15, 12));
+ /* Sanitise the physical address to be 64k aligned. */
+ reg &= ~GENMASK_ULL(15, 12);
return reg;
}
@@ -1363,7 +1347,7 @@ static void vgic_its_process_commands(struct kvm *kvm, struct vgic_its *its)
if (!its->enabled)
return;
- cbaser = CBASER_ADDRESS(its->cbaser);
+ cbaser = GITS_CBASER_ADDRESS(its->cbaser);
while (its->cwriter != its->creadr) {
int ret = kvm_read_guest(kvm, cbaser + its->creadr,
@@ -2221,7 +2205,7 @@ static int vgic_its_restore_device_tables(struct vgic_its *its)
if (!(baser & GITS_BASER_VALID))
return 0;
- l1_gpa = BASER_ADDRESS(baser);
+ l1_gpa = GITS_BASER_ADDR64K_TO_PHYS(baser);
if (baser & GITS_BASER_INDIRECT) {
l1_esz = GITS_LVL1_ENTRY_SIZE;
@@ -2293,7 +2277,7 @@ static int vgic_its_save_collection_table(struct vgic_its *its)
{
const struct vgic_its_abi *abi = vgic_its_get_abi(its);
u64 baser = its->baser_coll_table;
- gpa_t gpa = BASER_ADDRESS(baser);
+ gpa_t gpa = GITS_BASER_ADDR64K_TO_PHYS(baser);
struct its_collection *collection;
u64 val;
size_t max_size, filled = 0;
@@ -2342,7 +2326,7 @@ static int vgic_its_restore_collection_table(struct vgic_its *its)
if (!(baser & GITS_BASER_VALID))
return 0;
- gpa = BASER_ADDRESS(baser);
+ gpa = GITS_BASER_ADDR64K_TO_PHYS(baser);
max_size = GITS_BASER_NR_PAGES(baser) * SZ_64K;
diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c b/virt/kvm/arm/vgic/vgic-mmio-v3.c
index 671fe81f8e1d..90f36d9c946b 100644
--- a/virt/kvm/arm/vgic/vgic-mmio-v3.c
+++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c
@@ -351,7 +351,6 @@ static u64 vgic_sanitise_propbaser(u64 reg)
vgic_sanitise_outer_cacheability);
reg &= ~PROPBASER_RES0_MASK;
- reg &= ~GENMASK_ULL(51, 48);
return reg;
}
--
2.13.6
Allow the guests to choose a larger physical address space size.
The default and minimum size is 40bits. A guest can change this
right after the VM creation, but before the stage2 entry page
tables are allocated (i.e, before it registers a memory range
or maps a device address). The size is restricted to the maximum
supported by the host. Also, the guest can only increase the PA size,
from the existing value, as reducing it could break the devices which
may have verified their physical address for validity and may do a
lazy mapping(e.g, VGIC).
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Cc: Peter Maydell <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Documentation/virtual/kvm/api.txt | 27 ++++++++++++++++++++++++++
arch/arm/include/asm/kvm_host.h | 7 +++++++
arch/arm64/include/asm/kvm_host.h | 1 +
arch/arm64/include/asm/kvm_mmu.h | 41 ++++++++++++++++++++++++++++++---------
arch/arm64/kvm/reset.c | 28 ++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 4 ++++
virt/kvm/arm/arm.c | 2 +-
7 files changed, 100 insertions(+), 10 deletions(-)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 57d3ee9e4bde..a203faf768c4 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3403,6 +3403,33 @@ invalid, if invalid pages are written to (e.g. after the end of memory)
or if no page table is present for the addresses (e.g. when using
hugepages).
+4.109 KVM_ARM_GET_PHYS_SHIFT
+
+Capability: KVM_CAP_ARM_CONFIG_PHYS_SHIFT
+Architectures: arm64
+Type: vm ioctl
+Parameters: __u32 (out)
+Returns: 0 on success, a negative value on error
+
+This ioctl is used to get the current maximum physical address size for
+the VM. The value is Log2(Maximum_Physical_Address). This is neither the
+ amount of physical memory assigned to the VM nor the maximum physical address
+that a real CPU on the host can handle. Rather, this is the upper limit of the
+guest physical address that can be used by the VM.
+
+4.109 KVM_ARM_SET_PHYS_SHIFT
+
+Capability: KVM_CAP_ARM_CONFIG_PHYS_SHIFT
+Architectures: arm64
+Type: vm ioctl
+Parameters: __u32 (in)
+Returns: 0 on success, a negative value on error
+
+This ioctl is used to set the maximum physical address size for
+the VM. The value is Log2(Maximum_Physical_Address). The value can only
+be increased from the existing setting. The value cannot be changed
+after the stage-2 page tables are allocated and will return an error.
+
5. The kvm_run structure
------------------------
diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index a9f7d3f47134..fa8e68a4f692 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -268,6 +268,13 @@ static inline int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext)
return 0;
}
+static inline long kvm_arch_dev_vm_ioctl(struct kvm *kvm,
+ unsigned int ioctl,
+ unsigned long arg)
+{
+ return -EINVAL;
+}
+
int kvm_perf_init(void);
int kvm_perf_teardown(void);
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 1e66e5ab3dde..2895c2cda8fc 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -50,6 +50,7 @@
int __attribute_const__ kvm_target_cpu(void);
int kvm_reset_vcpu(struct kvm_vcpu *vcpu);
int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext);
+long kvm_arch_dev_vm_ioctl(struct kvm *kvm, unsigned int ioctl, unsigned long arg);
void __extended_idmap_trampoline(phys_addr_t boot_pgd, phys_addr_t idmap_start);
struct kvm_arch {
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index ab6a8b905065..ab7f50f20bcd 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -347,21 +347,44 @@ static inline u64 kvm_vttbr_baddr_mask(struct kvm *kvm)
return GENMASK_ULL(PHYS_MASK_SHIFT - 1, x);
}
+static inline int kvm_reconfig_stage2(struct kvm *kvm, u32 phys_shift)
+{
+ int rc = 0;
+ unsigned int pa_max, parange;
+
+ parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 7;
+ pa_max = id_aa64mmfr0_parange_to_phys_shift(parange);
+ /* Raise it to 40bits for backward compatibility */
+ pa_max = (pa_max < 40) ? 40 : pa_max;
+ /* Make sure the size is supported/available */
+ if (phys_shift > PHYS_MASK_SHIFT || phys_shift > pa_max)
+ return -EINVAL;
+ /*
+ * The stage2 PGD is dependent on the settings we initialise here
+ * and should be allocated only after this step. We cannot allow
+ * down sizing the IPA size as there could be devices or memory
+ * regions, that depend on the previous size.
+ */
+ mutex_lock(&kvm->slots_lock);
+ if (kvm->arch.pgd || phys_shift < kvm->arch.phys_shift) {
+ rc = -EPERM;
+ } else if (phys_shift > kvm->arch.phys_shift) {
+ kvm->arch.phys_shift = phys_shift;
+ kvm->arch.s2_levels = stage2_pt_levels(kvm->arch.phys_shift);
+ kvm->arch.vtcr_private = VTCR_EL2_SL0(kvm->arch.s2_levels) |
+ TCR_T0SZ(kvm->arch.phys_shift);
+ }
+ mutex_unlock(&kvm->slots_lock);
+ return rc;
+}
+
/*
* kvm_init_stage2_config: Initialise the VM specific stage2 page table
* details to default IPA size.
*/
static inline void kvm_init_stage2_config(struct kvm *kvm)
{
- /*
- * The stage2 PGD is dependent on the settings we initialise here
- * and should be allocated only after this step.
- */
- VM_BUG_ON(kvm->arch.pgd != NULL);
- kvm->arch.phys_shift = KVM_PHYS_SHIFT_DEFAULT;
- kvm->arch.s2_levels = stage2_pt_levels(kvm->arch.phys_shift);
- kvm->arch.vtcr_private = VTCR_EL2_SL0(kvm->arch.s2_levels) |
- TCR_T0SZ(kvm->arch.phys_shift);
+ kvm_reconfig_stage2(kvm, KVM_PHYS_SHIFT_DEFAULT);
}
#endif /* __ASSEMBLY__ */
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 3256b9228e75..90ceca823aca 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -23,6 +23,7 @@
#include <linux/kvm_host.h>
#include <linux/kvm.h>
#include <linux/hw_breakpoint.h>
+#include <linux/uaccess.h>
#include <kvm/arm_arch_timer.h>
@@ -81,6 +82,9 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_VCPU_ATTRIBUTES:
r = 1;
break;
+ case KVM_CAP_ARM_CONFIG_PHYS_SHIFT:
+ r = 1;
+ break;
default:
r = 0;
}
@@ -88,6 +92,30 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext)
return r;
}
+long kvm_arch_dev_vm_ioctl(struct kvm *kvm,
+ unsigned int ioctl, unsigned long arg)
+{
+ void __user *argp = (void __user *)arg;
+ u32 phys_shift;
+ long r = -EFAULT;
+
+ switch (ioctl) {
+ case KVM_ARM_GET_PHYS_SHIFT:
+ phys_shift = kvm_phys_shift(kvm);
+ if (!put_user(phys_shift, (u32 __user *)argp))
+ r = 0;
+ break;
+ case KVM_ARM_SET_PHYS_SHIFT:
+ if (!get_user(phys_shift, (u32 __user*)argp))
+ r = kvm_reconfig_stage2(kvm, phys_shift);
+ break;
+ default:
+ r = -EINVAL;
+ }
+ return r;
+}
+
+
/**
* kvm_reset_vcpu - sets core registers and sys_regs to reset value
* @vcpu: The VCPU pointer
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 496e59a2738b..66bfbe19b434 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -932,6 +932,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_HYPERV_SYNIC2 148
#define KVM_CAP_HYPERV_VP_INDEX 149
#define KVM_CAP_S390_AIS_MIGRATION 150
+#define KVM_CAP_ARM_CONFIG_PHYS_SHIFT 151
#ifdef KVM_CAP_IRQ_ROUTING
@@ -1261,6 +1262,9 @@ struct kvm_s390_ucas_mapping {
#define KVM_PPC_CONFIGURE_V3_MMU _IOW(KVMIO, 0xaf, struct kvm_ppc_mmuv3_cfg)
/* Available with KVM_CAP_PPC_RADIX_MMU */
#define KVM_PPC_GET_RMMU_INFO _IOW(KVMIO, 0xb0, struct kvm_ppc_rmmu_info)
+/* Available with KVM_CAP_ARM_CONFIG_PHYS_SHIFT */
+#define KVM_ARM_GET_PHYS_SHIFT _IOR(KVMIO, 0xb1, __u32)
+#define KVM_ARM_SET_PHYS_SHIFT _IOW(KVMIO, 0xb2, __u32)
/* ioctl for vm fd */
#define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device)
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index e0bf8d19fcfe..05fc49304722 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -1136,7 +1136,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
return 0;
}
default:
- return -EINVAL;
+ return kvm_arch_dev_vm_ioctl(kvm, ioctl, arg);
}
}
--
2.13.6
Right now the stage2 page table for a VM is hard coded, assuming
an IPA of 40bits. As we are about to add support for per VM IPA,
prepare the stage2 page table helpers to accept the kvm instance
to make the right decision. No functional changes.
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/include/asm/kvm_arm.h | 2 -
arch/arm/include/asm/kvm_mmu.h | 11 ++-
arch/arm/include/asm/stage2_pgtable.h | 46 ++++++-----
arch/arm64/include/asm/kvm_mmu.h | 6 +-
arch/arm64/include/asm/pgtable.h | 2 +-
arch/arm64/include/asm/stage2_pgtable-nopmd.h | 18 ++--
arch/arm64/include/asm/stage2_pgtable-nopud.h | 16 ++--
arch/arm64/include/asm/stage2_pgtable.h | 49 ++++++-----
virt/kvm/arm/mmu.c | 114 +++++++++++++-------------
virt/kvm/arm/vgic/vgic-kvm-device.c | 2 +-
10 files changed, 140 insertions(+), 126 deletions(-)
diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
index 3ab8b3781bfe..4ebaf0c29723 100644
--- a/arch/arm/include/asm/kvm_arm.h
+++ b/arch/arm/include/asm/kvm_arm.h
@@ -133,8 +133,6 @@
* space.
*/
#define KVM_PHYS_SHIFT (40)
-#define KVM_PHYS_SIZE (_AC(1, ULL) << KVM_PHYS_SHIFT)
-#define KVM_PHYS_MASK (KVM_PHYS_SIZE - _AC(1, ULL))
#define PTRS_PER_S2_PGD (_AC(1, ULL) << (KVM_PHYS_SHIFT - 30))
/* Virtualization Translation Control Register (VTCR) bits */
diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 8c5643e2eea4..a3312f87a6e0 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -29,17 +29,24 @@
#define kern_hyp_va(kva) (kva)
/*
- * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation levels.
+ * kvm_mmu_cache_min_pages is the number of stage2 page table
+ * translation levels excluding the top level.
*/
-#define KVM_MMU_CACHE_MIN_PAGES 2
+#define kvm_mmu_cache_min_pages(kvm) 2
#ifndef __ASSEMBLY__
#include <linux/highmem.h>
#include <asm/cacheflush.h>
+#include <asm/kvm_arm.h>
#include <asm/pgalloc.h>
#include <asm/stage2_pgtable.h>
+#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
+#define kvm_phys_size(kvm) (_AC(1, ULL) << kvm_phys_shift(kvm))
+#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
+
+#define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
int create_hyp_mappings(void *from, void *to, pgprot_t prot);
int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
void free_hyp_pgds(void);
diff --git a/arch/arm/include/asm/stage2_pgtable.h b/arch/arm/include/asm/stage2_pgtable.h
index 460d616bb2d6..e22ae94f0bf9 100644
--- a/arch/arm/include/asm/stage2_pgtable.h
+++ b/arch/arm/include/asm/stage2_pgtable.h
@@ -19,43 +19,45 @@
#ifndef __ARM_S2_PGTABLE_H_
#define __ARM_S2_PGTABLE_H_
-#define stage2_pgd_none(pgd) pgd_none(pgd)
-#define stage2_pgd_clear(pgd) pgd_clear(pgd)
-#define stage2_pgd_present(pgd) pgd_present(pgd)
-#define stage2_pgd_populate(pgd, pud) pgd_populate(NULL, pgd, pud)
-#define stage2_pud_offset(pgd, address) pud_offset(pgd, address)
-#define stage2_pud_free(pud) pud_free(NULL, pud)
-
-#define stage2_pud_none(pud) pud_none(pud)
-#define stage2_pud_clear(pud) pud_clear(pud)
-#define stage2_pud_present(pud) pud_present(pud)
-#define stage2_pud_populate(pud, pmd) pud_populate(NULL, pud, pmd)
-#define stage2_pmd_offset(pud, address) pmd_offset(pud, address)
-#define stage2_pmd_free(pmd) pmd_free(NULL, pmd)
-
-#define stage2_pud_huge(pud) pud_huge(pud)
+#define stage2_pgd_none(kvm, pgd) pgd_none(pgd)
+#define stage2_pgd_clear(kvm, pgd) pgd_clear(pgd)
+#define stage2_pgd_present(kvm, pgd) pgd_present(pgd)
+#define stage2_pgd_populate(kvm, pgd, pud) pgd_populate(NULL, pgd, pud)
+#define stage2_pud_offset(kvm, pgd, address) pud_offset(pgd, address)
+#define stage2_pud_free(kvm, pud) pud_free(NULL, pud)
+
+#define stage2_pud_none(kvm, pud) pud_none(pud)
+#define stage2_pud_clear(kvm, pud) pud_clear(pud)
+#define stage2_pud_present(kvm, pud) pud_present(pud)
+#define stage2_pud_populate(kvm, pud, pmd) pud_populate(NULL, pud, pmd)
+#define stage2_pmd_offset(kvm, pud, address) pmd_offset(pud, address)
+#define stage2_pmd_free(kvm, pmd) pmd_free(NULL, pmd)
+
+#define stage2_pud_huge(kvm, pud) pud_huge(pud)
/* Open coded p*d_addr_end that can deal with 64bit addresses */
-static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
+static inline phys_addr_t
+stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
phys_addr_t boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
return (boundary - 1 < end - 1) ? boundary : end;
}
-#define stage2_pud_addr_end(addr, end) (end)
+#define stage2_pud_addr_end(kvm, addr, end) (end)
-static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
+static inline phys_addr_t
+stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
phys_addr_t boundary = (addr + PMD_SIZE) & PMD_MASK;
return (boundary - 1 < end - 1) ? boundary : end;
}
-#define stage2_pgd_index(addr) pgd_index(addr)
+#define stage2_pgd_index(kvm, addr) pgd_index(addr)
-#define stage2_pte_table_empty(ptep) kvm_page_empty(ptep)
-#define stage2_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
-#define stage2_pud_table_empty(pudp) false
+#define stage2_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)
+#define stage2_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
+#define stage2_pud_table_empty(kvm, pudp) false
#endif /* __ARM_S2_PGTABLE_H_ */
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index b33bdb5eeb3d..de542aa72d80 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -134,8 +134,10 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
* We currently only support a 40bit IPA.
*/
#define KVM_PHYS_SHIFT (40)
-#define KVM_PHYS_SIZE (1UL << KVM_PHYS_SHIFT)
-#define KVM_PHYS_MASK (KVM_PHYS_SIZE - 1UL)
+
+#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
+#define kvm_phys_size(kvm) (_AC(1, ULL) << kvm_phys_shift(kvm))
+#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
#include <asm/stage2_pgtable.h>
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index a1c6e93a1a11..50d72be469ca 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -467,7 +467,7 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd)
#define __raw_pud_none(pud) (!pud_val((pud)))
#define __raw_pud_bad(pud) (!(pud_val((pud)) & PUD_TABLE_BIT))
-#define __raw_pud_present(pud) pte_present(pud_pte((pud)))
+#define __raw_pud_present(pud) pte_present(pud_pte((pud)))
static inline int __raw_pud_huge(pud_t pud)
{
diff --git a/arch/arm64/include/asm/stage2_pgtable-nopmd.h b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
index 2656a0fd05a6..0280dedbf75f 100644
--- a/arch/arm64/include/asm/stage2_pgtable-nopmd.h
+++ b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
@@ -26,17 +26,17 @@
#define S2_PMD_SIZE (1UL << S2_PMD_SHIFT)
#define S2_PMD_MASK (~(S2_PMD_SIZE-1))
-#define stage2_pud_none(pud) (0)
-#define stage2_pud_present(pud) (1)
-#define stage2_pud_clear(pud) do { } while (0)
-#define stage2_pud_populate(pud, pmd) do { } while (0)
-#define stage2_pmd_offset(pud, address) ((pmd_t *)(pud))
+#define stage2_pud_none(kvm, pud) (0)
+#define stage2_pud_present(kvm, pud) (1)
+#define stage2_pud_clear(kvm, pud) do { } while (0)
+#define stage2_pud_populate(kvm, pud, pmd) do { } while (0)
+#define stage2_pmd_offset(kvm, pud, address) ((pmd_t *)(pud))
-#define stage2_pmd_free(pmd) do { } while (0)
+#define stage2_pmd_free(kvm, pmd) do { } while (0)
-#define stage2_pmd_addr_end(addr, end) (end)
+#define stage2_pmd_addr_end(kvm, addr, end) (end)
-#define stage2_pud_huge(pud) (0)
-#define stage2_pmd_table_empty(pmdp) (0)
+#define stage2_pud_huge(kvm, pud) (0)
+#define stage2_pmd_table_empty(kvm, pmdp) (0)
#endif
diff --git a/arch/arm64/include/asm/stage2_pgtable-nopud.h b/arch/arm64/include/asm/stage2_pgtable-nopud.h
index 5ee87b54ebf3..cd6304e203be 100644
--- a/arch/arm64/include/asm/stage2_pgtable-nopud.h
+++ b/arch/arm64/include/asm/stage2_pgtable-nopud.h
@@ -24,16 +24,16 @@
#define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
#define S2_PUD_MASK (~(S2_PUD_SIZE-1))
-#define stage2_pgd_none(pgd) (0)
-#define stage2_pgd_present(pgd) (1)
-#define stage2_pgd_clear(pgd) do { } while (0)
-#define stage2_pgd_populate(pgd, pud) do { } while (0)
+#define stage2_pgd_none(kvm, pgd) (0)
+#define stage2_pgd_present(kvm, pgd) (1)
+#define stage2_pgd_clear(kvm, pgd) do { } while (0)
+#define stage2_pgd_populate(kvm, pgd, pud) do { } while (0)
-#define stage2_pud_offset(pgd, address) ((pud_t *)(pgd))
+#define stage2_pud_offset(kvm, pgd, address) ((pud_t *)(pgd))
-#define stage2_pud_free(x) do { } while (0)
+#define stage2_pud_free(kvm, x) do { } while (0)
-#define stage2_pud_addr_end(addr, end) (end)
-#define stage2_pud_table_empty(pmdp) (0)
+#define stage2_pud_addr_end(kvm, addr, end) (end)
+#define stage2_pud_table_empty(kvm, pmdp) (0)
#endif
diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
index 8b68099348e5..057a405fa727 100644
--- a/arch/arm64/include/asm/stage2_pgtable.h
+++ b/arch/arm64/include/asm/stage2_pgtable.h
@@ -65,10 +65,10 @@
#define PTRS_PER_S2_PGD (1 << (KVM_PHYS_SHIFT - S2_PGDIR_SHIFT))
/*
- * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
+ * kvm_mmmu_cache_min_pages is the number of stage2 page table translation
* levels in addition to the PGD.
*/
-#define KVM_MMU_CACHE_MIN_PAGES (STAGE2_PGTABLE_LEVELS - 1)
+#define kvm_mmu_cache_min_pages(kvm) (STAGE2_PGTABLE_LEVELS - 1)
#if STAGE2_PGTABLE_LEVELS > 3
@@ -77,16 +77,17 @@
#define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
#define S2_PUD_MASK (~(S2_PUD_SIZE - 1))
-#define stage2_pgd_none(pgd) pgd_none(pgd)
-#define stage2_pgd_clear(pgd) pgd_clear(pgd)
-#define stage2_pgd_present(pgd) pgd_present(pgd)
-#define stage2_pgd_populate(pgd, pud) pgd_populate(NULL, pgd, pud)
-#define stage2_pud_offset(pgd, address) pud_offset(pgd, address)
-#define stage2_pud_free(pud) pud_free(NULL, pud)
+#define stage2_pgd_none(kvm, pgd) pgd_none(pgd)
+#define stage2_pgd_clear(kvm, pgd) pgd_clear(pgd)
+#define stage2_pgd_present(kvm, pgd) pgd_present(pgd)
+#define stage2_pgd_populate(kvm, pgd, pud) pgd_populate(NULL, pgd, pud)
+#define stage2_pud_offset(kvm, pgd, address) pud_offset(pgd, address)
+#define stage2_pud_free(kvm, pud) pud_free(NULL, pud)
-#define stage2_pud_table_empty(pudp) kvm_page_empty(pudp)
+#define stage2_pud_table_empty(kvm, pudp) kvm_page_empty(pudp)
-static inline phys_addr_t stage2_pud_addr_end(phys_addr_t addr, phys_addr_t end)
+static inline phys_addr_t
+stage2_pud_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
phys_addr_t boundary = (addr + S2_PUD_SIZE) & S2_PUD_MASK;
@@ -102,17 +103,18 @@ static inline phys_addr_t stage2_pud_addr_end(phys_addr_t addr, phys_addr_t end)
#define S2_PMD_SIZE (_AC(1, UL) << S2_PMD_SHIFT)
#define S2_PMD_MASK (~(S2_PMD_SIZE - 1))
-#define stage2_pud_none(pud) pud_none(pud)
-#define stage2_pud_clear(pud) pud_clear(pud)
-#define stage2_pud_present(pud) pud_present(pud)
-#define stage2_pud_populate(pud, pmd) pud_populate(NULL, pud, pmd)
-#define stage2_pmd_offset(pud, address) pmd_offset(pud, address)
-#define stage2_pmd_free(pmd) pmd_free(NULL, pmd)
+#define stage2_pud_none(kvm, pud) pud_none(pud)
+#define stage2_pud_clear(kvm, pud) pud_clear(pud)
+#define stage2_pud_present(kvm, pud) pud_present(pud)
+#define stage2_pud_populate(kvm, pud, pmd) pud_populate(NULL, pud, pmd)
+#define stage2_pmd_offset(kvm, pud, address) pmd_offset(pud, address)
+#define stage2_pmd_free(kvm, pmd) pmd_free(NULL, pmd)
-#define stage2_pud_huge(pud) pud_huge(pud)
-#define stage2_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
+#define stage2_pud_huge(kvm, pud) pud_huge(pud)
+#define stage2_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
-static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
+static inline phys_addr_t
+stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
phys_addr_t boundary = (addr + S2_PMD_SIZE) & S2_PMD_MASK;
@@ -121,7 +123,7 @@ static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
#endif /* STAGE2_PGTABLE_LEVELS > 2 */
-#define stage2_pte_table_empty(ptep) kvm_page_empty(ptep)
+#define stage2_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)
#if STAGE2_PGTABLE_LEVELS == 2
#include <asm/stage2_pgtable-nopmd.h>
@@ -129,10 +131,13 @@ static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
#include <asm/stage2_pgtable-nopud.h>
#endif
+#define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
-#define stage2_pgd_index(addr) (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
+#define stage2_pgd_index(kvm, addr) \
+ (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
-static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
+static inline phys_addr_t
+stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 257f2a8ccfc7..cd355aa70c61 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -43,7 +43,6 @@ static unsigned long hyp_idmap_start;
static unsigned long hyp_idmap_end;
static phys_addr_t hyp_idmap_vector;
-#define S2_PGD_SIZE (PTRS_PER_S2_PGD * sizeof(pgd_t))
#define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t))
#define KVM_S2PTE_FLAG_IS_IOMAP (1UL << 0)
@@ -148,20 +147,20 @@ static void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
static void clear_stage2_pgd_entry(struct kvm *kvm, pgd_t *pgd, phys_addr_t addr)
{
- pud_t *pud_table __maybe_unused = stage2_pud_offset(pgd, 0UL);
- stage2_pgd_clear(pgd);
+ pud_t *pud_table __maybe_unused = stage2_pud_offset(kvm, pgd, 0UL);
+ stage2_pgd_clear(kvm, pgd);
kvm_tlb_flush_vmid_ipa(kvm, addr);
- stage2_pud_free(pud_table);
+ stage2_pud_free(kvm, pud_table);
put_page(virt_to_page(pgd));
}
static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
{
- pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(pud, 0);
- VM_BUG_ON(stage2_pud_huge(*pud));
- stage2_pud_clear(pud);
+ pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(kvm, pud, 0);
+ VM_BUG_ON(stage2_pud_huge(kvm, *pud));
+ stage2_pud_clear(kvm, pud);
kvm_tlb_flush_vmid_ipa(kvm, addr);
- stage2_pmd_free(pmd_table);
+ stage2_pmd_free(kvm, pmd_table);
put_page(virt_to_page(pud));
}
@@ -217,7 +216,7 @@ static void unmap_stage2_ptes(struct kvm *kvm, pmd_t *pmd,
}
} while (pte++, addr += PAGE_SIZE, addr != end);
- if (stage2_pte_table_empty(start_pte))
+ if (stage2_pte_table_empty(kvm, start_pte))
clear_stage2_pmd_entry(kvm, pmd, start_addr);
}
@@ -227,9 +226,9 @@ static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud,
phys_addr_t next, start_addr = addr;
pmd_t *pmd, *start_pmd;
- start_pmd = pmd = stage2_pmd_offset(pud, addr);
+ start_pmd = pmd = stage2_pmd_offset(kvm, pud, addr);
do {
- next = stage2_pmd_addr_end(addr, end);
+ next = stage2_pmd_addr_end(kvm, addr, end);
if (!pmd_none(*pmd)) {
if (pmd_thp_or_huge(*pmd)) {
pmd_t old_pmd = *pmd;
@@ -246,7 +245,7 @@ static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud,
}
} while (pmd++, addr = next, addr != end);
- if (stage2_pmd_table_empty(start_pmd))
+ if (stage2_pmd_table_empty(kvm, start_pmd))
clear_stage2_pud_entry(kvm, pud, start_addr);
}
@@ -256,14 +255,14 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
phys_addr_t next, start_addr = addr;
pud_t *pud, *start_pud;
- start_pud = pud = stage2_pud_offset(pgd, addr);
+ start_pud = pud = stage2_pud_offset(kvm, pgd, addr);
do {
- next = stage2_pud_addr_end(addr, end);
- if (!stage2_pud_none(*pud)) {
- if (stage2_pud_huge(*pud)) {
+ next = stage2_pud_addr_end(kvm, addr, end);
+ if (!stage2_pud_none(kvm, *pud)) {
+ if (stage2_pud_huge(kvm, *pud)) {
pud_t old_pud = *pud;
- stage2_pud_clear(pud);
+ stage2_pud_clear(kvm, pud);
kvm_tlb_flush_vmid_ipa(kvm, addr);
kvm_flush_dcache_pud(old_pud);
put_page(virt_to_page(pud));
@@ -273,7 +272,7 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
}
} while (pud++, addr = next, addr != end);
- if (stage2_pud_table_empty(start_pud))
+ if (stage2_pud_table_empty(kvm, start_pud))
clear_stage2_pgd_entry(kvm, pgd, start_addr);
}
@@ -295,7 +294,7 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
phys_addr_t next;
assert_spin_locked(&kvm->mmu_lock);
- pgd = kvm->arch.pgd + stage2_pgd_index(addr);
+ pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
do {
/*
* The page table shouldn't be free'd as we still hold a reference
@@ -303,8 +302,8 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
*/
if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))
break;
- next = stage2_pgd_addr_end(addr, end);
- if (!stage2_pgd_none(*pgd))
+ next = stage2_pgd_addr_end(kvm, addr, end);
+ if (!stage2_pgd_none(kvm, *pgd))
unmap_stage2_puds(kvm, pgd, addr, next);
/*
* If the range is too large, release the kvm->mmu_lock
@@ -333,9 +332,9 @@ static void stage2_flush_pmds(struct kvm *kvm, pud_t *pud,
pmd_t *pmd;
phys_addr_t next;
- pmd = stage2_pmd_offset(pud, addr);
+ pmd = stage2_pmd_offset(kvm, pud, addr);
do {
- next = stage2_pmd_addr_end(addr, end);
+ next = stage2_pmd_addr_end(kvm, addr, end);
if (!pmd_none(*pmd)) {
if (pmd_thp_or_huge(*pmd))
kvm_flush_dcache_pmd(*pmd);
@@ -351,11 +350,11 @@ static void stage2_flush_puds(struct kvm *kvm, pgd_t *pgd,
pud_t *pud;
phys_addr_t next;
- pud = stage2_pud_offset(pgd, addr);
+ pud = stage2_pud_offset(kvm, pgd, addr);
do {
- next = stage2_pud_addr_end(addr, end);
- if (!stage2_pud_none(*pud)) {
- if (stage2_pud_huge(*pud))
+ next = stage2_pud_addr_end(kvm, addr, end);
+ if (!stage2_pud_none(kvm, *pud)) {
+ if (stage2_pud_huge(kvm, *pud))
kvm_flush_dcache_pud(*pud);
else
stage2_flush_pmds(kvm, pud, addr, next);
@@ -371,10 +370,10 @@ static void stage2_flush_memslot(struct kvm *kvm,
phys_addr_t next;
pgd_t *pgd;
- pgd = kvm->arch.pgd + stage2_pgd_index(addr);
+ pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
do {
- next = stage2_pgd_addr_end(addr, end);
- if (!stage2_pgd_none(*pgd))
+ next = stage2_pgd_addr_end(kvm, addr, end);
+ if (!stage2_pgd_none(kvm, *pgd))
stage2_flush_puds(kvm, pgd, addr, next);
} while (pgd++, addr = next, addr != end);
}
@@ -762,7 +761,7 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
}
/* Allocate the HW PGD, making sure that each page gets its own refcount */
- pgd = alloc_pages_exact(S2_PGD_SIZE, GFP_KERNEL | __GFP_ZERO);
+ pgd = alloc_pages_exact(stage2_pgd_size(kvm), GFP_KERNEL | __GFP_ZERO);
if (!pgd)
return -ENOMEM;
@@ -844,7 +843,7 @@ static void kvm_flush_stage2_all(struct kvm *kvm)
{
spin_lock(&kvm->mmu_lock);
if (kvm->arch.pgd)
- unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
+ unmap_stage2_range(kvm, 0, kvm_phys_size(kvm));
spin_unlock(&kvm->mmu_lock);
}
@@ -861,7 +860,7 @@ static void kvm_flush_stage2_all(struct kvm *kvm)
void kvm_free_stage2_pgd(struct kvm *kvm)
{
kvm_flush_stage2_all(kvm);
- free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE);
+ free_pages_exact(kvm->arch.pgd, stage2_pgd_size(kvm));
kvm->arch.pgd = NULL;
}
@@ -871,16 +870,16 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
pgd_t *pgd;
pud_t *pud;
- pgd = kvm->arch.pgd + stage2_pgd_index(addr);
- if (stage2_pgd_none(*pgd)) {
+ pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
+ if (stage2_pgd_none(kvm, *pgd)) {
if (!cache)
return NULL;
pud = mmu_memory_cache_alloc(cache);
- stage2_pgd_populate(pgd, pud);
+ stage2_pgd_populate(kvm, pgd, pud);
get_page(virt_to_page(pgd));
}
- return stage2_pud_offset(pgd, addr);
+ return stage2_pud_offset(kvm, pgd, addr);
}
static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
@@ -893,15 +892,15 @@ static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
if (!pud)
return NULL;
- if (stage2_pud_none(*pud)) {
+ if (stage2_pud_none(kvm, *pud)) {
if (!cache)
return NULL;
pmd = mmu_memory_cache_alloc(cache);
- stage2_pud_populate(pud, pmd);
+ stage2_pud_populate(kvm, pud, pmd);
get_page(virt_to_page(pud));
}
- return stage2_pmd_offset(pud, addr);
+ return stage2_pmd_offset(kvm, pud, addr);
}
static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
@@ -1060,7 +1059,7 @@ static int __kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
if (writable)
pte = kvm_s2pte_mkwrite(pte);
- ret = mmu_topup_memory_cache(&cache, KVM_MMU_CACHE_MIN_PAGES,
+ ret = mmu_topup_memory_cache(&cache, kvm_mmu_cache_min_pages(kvm),
KVM_NR_MEM_OBJS);
if (ret)
goto out;
@@ -1166,19 +1165,20 @@ static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
/**
* stage2_wp_pmds - write protect PUD range
+ * kvm: kvm instance for the VM
* @pud: pointer to pud entry
* @addr: range start address
* @end: range end address
*/
-static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
+static void stage2_wp_pmds(struct kvm *kvm, pud_t *pud, phys_addr_t addr, phys_addr_t end)
{
pmd_t *pmd;
phys_addr_t next;
- pmd = stage2_pmd_offset(pud, addr);
+ pmd = stage2_pmd_offset(kvm, pud, addr);
do {
- next = stage2_pmd_addr_end(addr, end);
+ next = stage2_pmd_addr_end(kvm, addr, end);
if (!pmd_none(*pmd)) {
if (pmd_thp_or_huge(*pmd)) {
if (!kvm_s2pmd_readonly(pmd))
@@ -1198,18 +1198,18 @@ static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
*
* Process PUD entries, for a huge PUD we cause a panic.
*/
-static void stage2_wp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
+static void stage2_wp_puds(struct kvm *kvm, pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
{
pud_t *pud;
phys_addr_t next;
- pud = stage2_pud_offset(pgd, addr);
+ pud = stage2_pud_offset(kvm, pgd, addr);
do {
- next = stage2_pud_addr_end(addr, end);
- if (!stage2_pud_none(*pud)) {
+ next = stage2_pud_addr_end(kvm, addr, end);
+ if (!stage2_pud_none(kvm, *pud)) {
/* TODO:PUD not supported, revisit later if supported */
- BUG_ON(stage2_pud_huge(*pud));
- stage2_wp_pmds(pud, addr, next);
+ BUG_ON(stage2_pud_huge(kvm, *pud));
+ stage2_wp_pmds(kvm, pud, addr, next);
}
} while (pud++, addr = next, addr != end);
}
@@ -1225,7 +1225,7 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
pgd_t *pgd;
phys_addr_t next;
- pgd = kvm->arch.pgd + stage2_pgd_index(addr);
+ pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
do {
/*
* Release kvm_mmu_lock periodically if the memory region is
@@ -1239,9 +1239,9 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
cond_resched_lock(&kvm->mmu_lock);
if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))
break;
- next = stage2_pgd_addr_end(addr, end);
- if (stage2_pgd_present(*pgd))
- stage2_wp_puds(pgd, addr, next);
+ next = stage2_pgd_addr_end(kvm, addr, end);
+ if (stage2_pgd_present(kvm, *pgd))
+ stage2_wp_puds(kvm, pgd, addr, next);
} while (pgd++, addr = next, addr != end);
}
@@ -1382,7 +1382,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
up_read(¤t->mm->mmap_sem);
/* We need minimum second+third level pages */
- ret = mmu_topup_memory_cache(memcache, KVM_MMU_CACHE_MIN_PAGES,
+ ret = mmu_topup_memory_cache(memcache, kvm_mmu_cache_min_pages(kvm),
KVM_NR_MEM_OBJS);
if (ret)
return ret;
@@ -1601,7 +1601,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
}
/* Userspace should not be able to register out-of-bounds IPAs */
- VM_BUG_ON(fault_ipa >= KVM_PHYS_SIZE);
+ VM_BUG_ON(fault_ipa >= kvm_phys_size(vcpu->kvm));
if (fault_status == FSC_ACCESS) {
handle_access_fault(vcpu, fault_ipa);
@@ -1901,7 +1901,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
* space addressable by the KVM guest IPA space.
*/
if (memslot->base_gfn + memslot->npages >=
- (KVM_PHYS_SIZE >> PAGE_SHIFT))
+ (kvm_phys_size(kvm) >> PAGE_SHIFT))
return -EFAULT;
down_read(¤t->mm->mmap_sem);
diff --git a/virt/kvm/arm/vgic/vgic-kvm-device.c b/virt/kvm/arm/vgic/vgic-kvm-device.c
index 10ae6f394b71..613ff4abcad5 100644
--- a/virt/kvm/arm/vgic/vgic-kvm-device.c
+++ b/virt/kvm/arm/vgic/vgic-kvm-device.c
@@ -25,7 +25,7 @@
int vgic_check_ioaddr(struct kvm *kvm, phys_addr_t *ioaddr,
phys_addr_t addr, phys_addr_t alignment)
{
- if (addr & ~KVM_PHYS_MASK)
+ if (addr & ~kvm_phys_mask(kvm))
return -E2BIG;
if (!IS_ALIGNED(addr, alignment))
--
2.13.6
So far we had a static stage2 page table handling code, based on a
fixed IPA of 40bits. As we prepare for a configurable IPA size per
VM, make the our stage2 page table code dynamic to do the right thing
for a given VM.
Support for the IPA size configuration needs other changes in the way
we configure the EL2 registers (VTTBR and VTCR). So, the IPA is still
fixed to 40bits.
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/include/asm/kvm_mmu.h | 1 +
arch/arm64/include/asm/kvm_mmu.h | 16 +-
arch/arm64/include/asm/stage2_pgtable-nopmd.h | 42 ------
arch/arm64/include/asm/stage2_pgtable-nopud.h | 39 -----
arch/arm64/include/asm/stage2_pgtable.h | 203 +++++++++++++++++---------
virt/kvm/arm/arm.c | 2 +-
6 files changed, 147 insertions(+), 156 deletions(-)
delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h
diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index a3312f87a6e0..440c80589453 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -45,6 +45,7 @@
#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
#define kvm_phys_size(kvm) (_AC(1, ULL) << kvm_phys_shift(kvm))
#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
+#define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK
#define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
int create_hyp_mappings(void *from, void *to, pgprot_t prot);
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index de542aa72d80..df2ee97f4428 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -18,9 +18,10 @@
#ifndef __ARM64_KVM_MMU_H__
#define __ARM64_KVM_MMU_H__
+#include <asm/cpufeature.h>
#include <asm/page.h>
#include <asm/memory.h>
-#include <asm/cpufeature.h>
+#include <asm/kvm_arm.h>
/*
* As ARMv8.0 only has the TTBR0_EL2 register, we cannot express
@@ -138,6 +139,13 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
#define kvm_phys_size(kvm) (_AC(1, ULL) << kvm_phys_shift(kvm))
#define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
+#define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK
+
+static inline bool kvm_page_empty(void *ptr)
+{
+ struct page *ptr_page = virt_to_page(ptr);
+ return page_count(ptr_page) == 1;
+}
#include <asm/stage2_pgtable.h>
@@ -203,12 +211,6 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
return kvm_s2pte_readonly((pte_t *)pmd);
}
-static inline bool kvm_page_empty(void *ptr)
-{
- struct page *ptr_page = virt_to_page(ptr);
- return page_count(ptr_page) == 1;
-}
-
#define hyp_pte_table_empty(ptep) kvm_page_empty(ptep)
#ifdef __PAGETABLE_PMD_FOLDED
diff --git a/arch/arm64/include/asm/stage2_pgtable-nopmd.h b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
deleted file mode 100644
index 0280dedbf75f..000000000000
--- a/arch/arm64/include/asm/stage2_pgtable-nopmd.h
+++ /dev/null
@@ -1,42 +0,0 @@
-/*
- * Copyright (C) 2016 - ARM Ltd
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program. If not, see <http://www.gnu.org/licenses/>.
- */
-
-#ifndef __ARM64_S2_PGTABLE_NOPMD_H_
-#define __ARM64_S2_PGTABLE_NOPMD_H_
-
-#include <asm/stage2_pgtable-nopud.h>
-
-#define __S2_PGTABLE_PMD_FOLDED
-
-#define S2_PMD_SHIFT S2_PUD_SHIFT
-#define S2_PTRS_PER_PMD 1
-#define S2_PMD_SIZE (1UL << S2_PMD_SHIFT)
-#define S2_PMD_MASK (~(S2_PMD_SIZE-1))
-
-#define stage2_pud_none(kvm, pud) (0)
-#define stage2_pud_present(kvm, pud) (1)
-#define stage2_pud_clear(kvm, pud) do { } while (0)
-#define stage2_pud_populate(kvm, pud, pmd) do { } while (0)
-#define stage2_pmd_offset(kvm, pud, address) ((pmd_t *)(pud))
-
-#define stage2_pmd_free(kvm, pmd) do { } while (0)
-
-#define stage2_pmd_addr_end(kvm, addr, end) (end)
-
-#define stage2_pud_huge(kvm, pud) (0)
-#define stage2_pmd_table_empty(kvm, pmdp) (0)
-
-#endif
diff --git a/arch/arm64/include/asm/stage2_pgtable-nopud.h b/arch/arm64/include/asm/stage2_pgtable-nopud.h
deleted file mode 100644
index cd6304e203be..000000000000
--- a/arch/arm64/include/asm/stage2_pgtable-nopud.h
+++ /dev/null
@@ -1,39 +0,0 @@
-/*
- * Copyright (C) 2016 - ARM Ltd
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program. If not, see <http://www.gnu.org/licenses/>.
- */
-
-#ifndef __ARM64_S2_PGTABLE_NOPUD_H_
-#define __ARM64_S2_PGTABLE_NOPUD_H_
-
-#define __S2_PGTABLE_PUD_FOLDED
-
-#define S2_PUD_SHIFT S2_PGDIR_SHIFT
-#define S2_PTRS_PER_PUD 1
-#define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
-#define S2_PUD_MASK (~(S2_PUD_SIZE-1))
-
-#define stage2_pgd_none(kvm, pgd) (0)
-#define stage2_pgd_present(kvm, pgd) (1)
-#define stage2_pgd_clear(kvm, pgd) do { } while (0)
-#define stage2_pgd_populate(kvm, pgd, pud) do { } while (0)
-
-#define stage2_pud_offset(kvm, pgd, address) ((pud_t *)(pgd))
-
-#define stage2_pud_free(kvm, x) do { } while (0)
-
-#define stage2_pud_addr_end(kvm, addr, end) (end)
-#define stage2_pud_table_empty(kvm, pmdp) (0)
-
-#endif
diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
index 057a405fa727..33e8ebb25037 100644
--- a/arch/arm64/include/asm/stage2_pgtable.h
+++ b/arch/arm64/include/asm/stage2_pgtable.h
@@ -21,6 +21,9 @@
#include <asm/pgtable.h>
+/* The PGDIR shift for a given page table with "n" levels. */
+#define pt_levels_pgdir_shift(n) ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - (n))
+
/*
* The hardware supports concatenation of up to 16 tables at stage2 entry level
* and we use the feature whenever possible.
@@ -29,118 +32,184 @@
* On arm64, the smallest PAGE_SIZE supported is 4k, which means
* (PAGE_SHIFT - 3) > 4 holds for all page sizes.
* This implies, the total number of page table levels at stage2 expected
- * by the hardware is actually the number of levels required for (KVM_PHYS_SHIFT - 4)
+ * by the hardware is actually the number of levels required for (IPA_SHIFT - 4)
* in normal translations(e.g, stage1), since we cannot have another level in
- * the range (KVM_PHYS_SHIFT, KVM_PHYS_SHIFT - 4).
- */
-#define STAGE2_PGTABLE_LEVELS ARM64_HW_PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)
-
-/*
- * With all the supported VA_BITs and 40bit guest IPA, the following condition
- * is always true:
- *
- * STAGE2_PGTABLE_LEVELS <= CONFIG_PGTABLE_LEVELS
- *
- * We base our stage-2 page table walker helpers on this assumption and
- * fall back to using the host version of the helper wherever possible.
- * i.e, if a particular level is not folded (e.g, PUD) at stage2, we fall back
- * to using the host version, since it is guaranteed it is not folded at host.
- *
- * If the condition breaks in the future, we can rearrange the host level
- * definitions and reuse them for stage2. Till then...
+ * the range (IPA_SHIFT, IPA_SHIFT - 4).
*/
-#if STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS
-#error "Unsupported combination of guest IPA and host VA_BITS."
-#endif
-
-/* S2_PGDIR_SHIFT is the size mapped by top-level stage2 entry */
-#define S2_PGDIR_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - STAGE2_PGTABLE_LEVELS)
-#define S2_PGDIR_SIZE (_AC(1, UL) << S2_PGDIR_SHIFT)
-#define S2_PGDIR_MASK (~(S2_PGDIR_SIZE - 1))
+#define stage2_pt_levels(ipa_shift) ARM64_HW_PGTABLE_LEVELS((ipa_shift) - 4)
/*
* The number of PTRS across all concatenated stage2 tables given by the
* number of bits resolved at the initial level.
*/
-#define PTRS_PER_S2_PGD (1 << (KVM_PHYS_SHIFT - S2_PGDIR_SHIFT))
+#define __s2_pgd_ptrs(pa, lvls) (1 << ((pa) - pt_levels_pgdir_shift((lvls))))
+
+#define kvm_stage2_levels(kvm) stage2_pt_levels(kvm_phys_shift(kvm))
+#define stage2_pgdir_shift(kvm) \
+ pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
+#define stage2_pgdir_size(kvm) (_AC(1, UL) << stage2_pgdir_shift((kvm)))
+#define stage2_pgdir_mask(kvm) (~(stage2_pgdir_size((kvm)) - 1))
+#define stage2_pgd_ptrs(kvm) \
+ __s2_pgd_ptrs(kvm_phys_shift(kvm), kvm_stage2_levels(kvm))
+
/*
* kvm_mmmu_cache_min_pages is the number of stage2 page table translation
* levels in addition to the PGD.
*/
-#define kvm_mmu_cache_min_pages(kvm) (STAGE2_PGTABLE_LEVELS - 1)
+#define kvm_mmu_cache_min_pages(kvm) (kvm_stage2_levels(kvm) - 1)
+
+/* PUD/PMD definitions if present */
+#define __S2_PUD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(1)
+#define __S2_PUD_SIZE (_AC(1, UL) << __S2_PUD_SHIFT)
+#define __S2_PUD_MASK (~(__S2_PUD_SIZE - 1))
-#if STAGE2_PGTABLE_LEVELS > 3
+#define __S2_PMD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
+#define __S2_PMD_SIZE (_AC(1, UL) << __S2_PMD_SHIFT)
+#define __S2_PMD_MASK (~(__S2_PMD_SIZE - 1))
-#define S2_PUD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(1)
-#define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
-#define S2_PUD_MASK (~(S2_PUD_SIZE - 1))
+#define __s2_pud_index(addr) \
+ (((addr) >> __S2_PUD_SHIFT) & (PTRS_PER_PTE - 1))
+#define __s2_pmd_index(addr) \
+ (((addr) >> __S2_PMD_SHIFT) & (PTRS_PER_PTE - 1))
+
+static inline int stage2_pgd_none(struct kvm *kvm, pgd_t pgd)
+{
+ return (kvm_stage2_levels(kvm) > 3) ? __raw_pgd_none(pgd) : 0;
+}
+
+static inline void stage2_pgd_clear(struct kvm *kvm, pgd_t *pgdp)
+{
+ if (kvm_stage2_levels(kvm) > 3)
+ __raw_pgd_clear(pgdp);
+}
+
+static inline int stage2_pgd_present(struct kvm *kvm, pgd_t pgd)
+{
+ return kvm_stage2_levels(kvm) > 3 ? __raw_pgd_present(pgd) : 1;
+}
+
+static inline void stage2_pgd_populate(struct kvm *kvm, pgd_t *pgdp, pud_t *pud)
+{
+ if (kvm_stage2_levels(kvm) > 3)
+ __raw_pgd_populate(pgdp, __pa(pud), PUD_TYPE_TABLE);
+ else
+ BUG();
+}
+
+static inline pud_t *stage2_pud_offset(struct kvm *kvm,
+ pgd_t *pgd, unsigned long address)
+{
+ if (kvm_stage2_levels(kvm) > 3) {
+ phys_addr_t pud_phys = __raw_pgd_page_paddr(*pgd);
+
+ pud_phys += __s2_pud_index(address) * sizeof(pud_t);
+ return __va(pud_phys);
+ }
+ return (pud_t *)pgd;
+}
-#define stage2_pgd_none(kvm, pgd) pgd_none(pgd)
-#define stage2_pgd_clear(kvm, pgd) pgd_clear(pgd)
-#define stage2_pgd_present(kvm, pgd) pgd_present(pgd)
-#define stage2_pgd_populate(kvm, pgd, pud) pgd_populate(NULL, pgd, pud)
-#define stage2_pud_offset(kvm, pgd, address) pud_offset(pgd, address)
-#define stage2_pud_free(kvm, pud) pud_free(NULL, pud)
+static inline void stage2_pud_free(struct kvm *kvm, pud_t *pud)
+{
+ if (kvm_stage2_levels(kvm) > 3)
+ __raw_pud_free(pud);
+}
-#define stage2_pud_table_empty(kvm, pudp) kvm_page_empty(pudp)
+static inline int stage2_pud_table_empty(struct kvm *kvm, pud_t *pudp)
+{
+ return kvm_stage2_levels(kvm) > 3 && kvm_page_empty(pudp);
+}
static inline phys_addr_t
stage2_pud_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
- phys_addr_t boundary = (addr + S2_PUD_SIZE) & S2_PUD_MASK;
+ if (kvm_stage2_levels(kvm) > 3) {
+ phys_addr_t boundary = (addr + __S2_PUD_SIZE) & __S2_PUD_MASK;
- return (boundary - 1 < end - 1) ? boundary : end;
+ return (boundary - 1 < end - 1) ? boundary : end;
+ }
+ return end;
}
-#endif /* STAGE2_PGTABLE_LEVELS > 3 */
+static inline int stage2_pud_none(struct kvm *kvm, pud_t pud)
+{
+ return kvm_stage2_levels(kvm) > 2 ? __raw_pud_none(pud) : 0;
+}
+static inline void stage2_pud_clear(struct kvm *kvm, pud_t *pudp)
+{
+ if (kvm_stage2_levels(kvm) > 2)
+ __raw_pud_clear(pudp);
+}
-#if STAGE2_PGTABLE_LEVELS > 2
+static inline int stage2_pud_present(struct kvm *kvm, pud_t pud)
+{
+ return kvm_stage2_levels(kvm) > 2 ? __raw_pud_present(pud) : 1;
+}
-#define S2_PMD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
-#define S2_PMD_SIZE (_AC(1, UL) << S2_PMD_SHIFT)
-#define S2_PMD_MASK (~(S2_PMD_SIZE - 1))
+static inline void stage2_pud_populate(struct kvm *kvm, pud_t *pudp, pmd_t *pmd)
+{
+ if (kvm_stage2_levels(kvm) > 2)
+ __raw_pud_populate(pudp, __pa(pmd), PMD_TYPE_TABLE);
+ else
+ BUG();
+}
-#define stage2_pud_none(kvm, pud) pud_none(pud)
-#define stage2_pud_clear(kvm, pud) pud_clear(pud)
-#define stage2_pud_present(kvm, pud) pud_present(pud)
-#define stage2_pud_populate(kvm, pud, pmd) pud_populate(NULL, pud, pmd)
-#define stage2_pmd_offset(kvm, pud, address) pmd_offset(pud, address)
-#define stage2_pmd_free(kvm, pmd) pmd_free(NULL, pmd)
+static inline pmd_t *stage2_pmd_offset(struct kvm *kvm,
+ pud_t *pud, unsigned long address)
+{
+ if (kvm_stage2_levels(kvm) > 2) {
+ phys_addr_t pmd_phys = __raw_pud_page_paddr(*pud);
-#define stage2_pud_huge(kvm, pud) pud_huge(pud)
-#define stage2_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
+ pmd_phys += __s2_pmd_index(address) * sizeof(pmd_t);
+ return __va(pmd_phys);
+ }
+ return (pmd_t *)pud;
+}
+
+static inline void stage2_pmd_free(struct kvm *kvm, pmd_t *pmd)
+{
+ if (kvm_stage2_levels(kvm) > 2)
+ __raw_pmd_free(pmd);
+}
+
+static inline int stage2_pmd_table_empty(struct kvm *kvm, pmd_t *pmdp)
+{
+ return kvm_stage2_levels(kvm) > 2 && kvm_page_empty(pmdp);
+}
static inline phys_addr_t
stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
- phys_addr_t boundary = (addr + S2_PMD_SIZE) & S2_PMD_MASK;
+ if (kvm_stage2_levels(kvm) > 2) {
+ phys_addr_t boundary = (addr + __S2_PMD_SIZE) & __S2_PMD_MASK;
- return (boundary - 1 < end - 1) ? boundary : end;
+ return (boundary - 1 < end - 1) ? boundary : end;
+ }
+ return end;
}
-#endif /* STAGE2_PGTABLE_LEVELS > 2 */
+static inline int stage2_pud_huge(struct kvm *kvm, pud_t pud)
+{
+ return kvm_stage2_levels(kvm) > 2 ? __raw_pud_huge(pud) : 0;
+}
#define stage2_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)
-#if STAGE2_PGTABLE_LEVELS == 2
-#include <asm/stage2_pgtable-nopmd.h>
-#elif STAGE2_PGTABLE_LEVELS == 3
-#include <asm/stage2_pgtable-nopud.h>
-#endif
+#define stage2_pgd_size(kvm) (stage2_pgd_ptrs(kvm) * sizeof(pgd_t))
-#define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
-
-#define stage2_pgd_index(kvm, addr) \
- (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
+static inline unsigned long stage2_pgd_index(struct kvm *kvm, phys_addr_t addr)
+{
+ return (addr >> stage2_pgdir_shift(kvm)) & (stage2_pgd_ptrs(kvm) - 1);
+}
static inline phys_addr_t
stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
{
- phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;
+ phys_addr_t boundary;
+ boundary = (addr + stage2_pgdir_size(kvm)) & stage2_pgdir_mask(kvm);
return (boundary - 1 < end - 1) ? boundary : end;
}
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index d06f00566664..8564ed907b18 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -502,7 +502,7 @@ static void update_vttbr(struct kvm *kvm)
/* update vttbr to be used with the new vmid */
pgd_phys = virt_to_phys(kvm->arch.pgd);
- BUG_ON(pgd_phys & ~VTTBR_BADDR_MASK);
+ BUG_ON(pgd_phys & ~kvm_vttbr_baddr_mask(kvm));
vmid = ((u64)(kvm->arch.vmid) << VTTBR_VMID_SHIFT) & VTTBR_VMID_MASK(kvm_vmid_bits);
kvm->arch.vttbr = kvm_phys_to_vttbr(pgd_phys) | vmid;
--
2.13.6
On a 4-level page table pgd entry can be empty, unlike a 3-level
page table. Remove the spurious WARN_ON() in stage_get_pud().
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
virt/kvm/arm/mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index e6548c85c495..78253fe00fc4 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -870,7 +870,7 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
pud_t *pud;
pgd = kvm->arch.pgd + stage2_pgd_index(addr);
- if (WARN_ON(stage2_pgd_none(*pgd))) {
+ if (stage2_pgd_none(*pgd)) {
if (!cache)
return NULL;
pud = mmu_memory_cache_alloc(cache);
--
2.13.6
Add a helper to convert ID_AA64MMFR0_EL1:PARange to they physical
size shift. Limit the size to the maximum supported by the kernel.
Cc: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Marc Zyngier <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm64/include/asm/cpufeature.h | 16 ++++++++++++++++
arch/arm64/kvm/hyp/s2-setup.c | 28 +++++-----------------------
2 files changed, 21 insertions(+), 23 deletions(-)
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index ac67cfc2585a..0564e14616eb 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -304,6 +304,22 @@ static inline u64 read_zcr_features(void)
return zcr;
}
+static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
+{
+ switch (parange) {
+ case 0: return 32;
+ case 1: return 36;
+ case 2: return 40;
+ case 3: return 42;
+ case 4: return 44;
+
+ default:
+ case 5: return 48;
+#ifdef CONFIG_ARM64_PA_BITS_52
+ case 6: return 52;
+#endif
+ }
+}
#endif /* __ASSEMBLY__ */
#endif
diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
index 603e1ee83e89..b1129c83c531 100644
--- a/arch/arm64/kvm/hyp/s2-setup.c
+++ b/arch/arm64/kvm/hyp/s2-setup.c
@@ -19,11 +19,13 @@
#include <asm/kvm_arm.h>
#include <asm/kvm_asm.h>
#include <asm/kvm_hyp.h>
+#include <asm/cpufeature.h>
u32 __hyp_text __init_stage2_translation(void)
{
u64 val = VTCR_EL2_FLAGS;
u64 parange;
+ u32 phys_shift;
u64 tmp;
/*
@@ -37,27 +39,7 @@ u32 __hyp_text __init_stage2_translation(void)
val |= parange << 16;
/* Compute the actual PARange... */
- switch (parange) {
- case 0:
- parange = 32;
- break;
- case 1:
- parange = 36;
- break;
- case 2:
- parange = 40;
- break;
- case 3:
- parange = 42;
- break;
- case 4:
- parange = 44;
- break;
- case 5:
- default:
- parange = 48;
- break;
- }
+ phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
/*
* ... and clamp it to 40 bits, unless we have some braindead
@@ -65,7 +47,7 @@ u32 __hyp_text __init_stage2_translation(void)
* return that value for the rest of the kernel to decide what
* to do.
*/
- val |= 64 - (parange > 40 ? 40 : parange);
+ val |= 64 - (phys_shift > 40 ? 40 : phys_shift);
/*
* Check the availability of Hardware Access Flag / Dirty Bit
@@ -86,5 +68,5 @@ u32 __hyp_text __init_stage2_translation(void)
write_sysreg(val, vtcr_el2);
- return parange;
+ return phys_shift;
}
--
2.13.6
Add helpers for encoding/decoding 52bit address in GICv3 ITS BASER
register. When ITS uses 64K page size, the 52bits of physical address
are encoded in BASER[47:12] as follows :
Bits[47:16] of the register => bits[47:16] of the physical address
Bits[15:12] of the register => bits[51:48] of the physical address
bits[15:0] of the physical address are 0.
Also adds a mask for CBASER address. This will be used for adding 52bit
support for VGIC ITS. More importantly ignore the upper bits if 52bit
support is not enabled.
Cc: Shanker Donthineni <[email protected]>
Cc: Marc Zyngier <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
drivers/irqchip/irq-gic-v3-its.c | 2 +-
include/linux/irqchip/arm-gic-v3.h | 32 ++++++++++++++++++++++++++++++--
2 files changed, 31 insertions(+), 3 deletions(-)
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 4039e64cd342..e6aa84f806f7 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -1615,7 +1615,7 @@ static int its_setup_baser(struct its_node *its, struct its_baser *baser,
}
/* Convert 52bit PA to 48bit field */
- baser_phys = GITS_BASER_PHYS_52_to_48(baser_phys);
+ baser_phys = GITS_BASER_ADDR64K_FROM_PHYS(baser_phys);
}
retry_baser:
diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h
index c00c4c33e432..b880b6682fa6 100644
--- a/include/linux/irqchip/arm-gic-v3.h
+++ b/include/linux/irqchip/arm-gic-v3.h
@@ -320,6 +320,15 @@
#define GITS_IIDR_REV(r) (((r) >> GITS_IIDR_REV_SHIFT) & 0xf)
#define GITS_IIDR_PRODUCTID_SHIFT 24
+#ifdef CONFIG_ARM64_PA_BITS_52
+#define GITS_PA_HI_MASK (0xfULL)
+#define GITS_PA_SHIFT 52
+#else
+/* Do not use the bits [51-48] if we don't support 52bit */
+#define GITS_PA_HI_MASK 0
+#define GITS_PA_SHIFT 48
+#endif
+
#define GITS_CBASER_VALID (1ULL << 63)
#define GITS_CBASER_SHAREABILITY_SHIFT (10)
#define GITS_CBASER_INNER_CACHEABILITY_SHIFT (59)
@@ -343,6 +352,7 @@
#define GITS_CBASER_WaWb GIC_BASER_CACHEABILITY(GITS_CBASER, INNER, WaWb)
#define GITS_CBASER_RaWaWt GIC_BASER_CACHEABILITY(GITS_CBASER, INNER, RaWaWt)
#define GITS_CBASER_RaWaWb GIC_BASER_CACHEABILITY(GITS_CBASER, INNER, RaWaWb)
+#define GITS_CBASER_ADDRESS(x) ((x) & GENMASK_ULL(GITS_PA_SHIFT, 12))
#define GITS_BASER_NR_REGS 8
@@ -373,8 +383,26 @@
#define GITS_BASER_ENTRY_SIZE_SHIFT (48)
#define GITS_BASER_ENTRY_SIZE(r) ((((r) >> GITS_BASER_ENTRY_SIZE_SHIFT) & 0x1f) + 1)
#define GITS_BASER_ENTRY_SIZE_MASK GENMASK_ULL(52, 48)
-#define GITS_BASER_PHYS_52_to_48(phys) \
- (((phys) & GENMASK_ULL(47, 16)) | (((phys) >> 48) & 0xf) << 12)
+
+/*
+ * With 64K page size, the physical address can be upto 52bit and
+ * uses the following encoding in the GITS_BASER[47:12]:
+ *
+ * Bits[47:16] of the register => bits[47:16] of the base physical address.
+ * Bits[15:12] of the register => bits[51:48] of the base physical address.
+ * bits[15:0] of the base physical address are 0.
+ * Clear the upper bits if the kernel doesn't support 52bits.
+ */
+#define GITS_BASER_ADDR64K_LO_MASK GENMASK_ULL(47, 16)
+#define GITS_BASER_ADDR64K_HI_SHIFT 12
+#define GITS_BASER_ADDR64K_HI_MOVE (48 - GITS_BASER_ADDR64K_HI_SHIFT)
+#define GITS_BASER_ADDR64K_HI_MASK (GITS_PA_HI_MASK << GITS_BASER_ADDR64K_HI_SHIFT)
+#define GITS_BASER_ADDR64K_TO_PHYS(x) \
+ (((x) & GITS_BASER_ADDR64K_LO_MASK) | \
+ (((x) & GITS_BASER_ADDR64K_HI_MASK) << GITS_BASER_ADDR64K_HI_MOVE))
+#define GITS_BASER_ADDR64K_FROM_PHYS(p) \
+ (((p) & GITS_BASER_ADDR64K_LO_MASK) | \
+ (((p) >> GITS_BASER_ADDR64K_HI_MOVE) & GITS_BASER_ADDR64K_HI_MASK))
#define GITS_BASER_SHAREABILITY_SHIFT (10)
#define GITS_BASER_InnerShareable \
GIC_BASER_SHAREABILITY(GITS_BASER, InnerShareable)
--
2.13.6
On Tue, Jan 09, 2018 at 07:03:56PM +0000, Suzuki K Poulose wrote:
> virtio-mmio using virtio-v1 and virtio legacy pci use a 32bit PFN
> for the queue. If the queue pfn is too large to fit in 32bits, which
> we could hit on arm64 systems with 52bit physical addresses (even with
> 64K page size), we simply miss out a proper link to the other side of
> the queue.
>
> Add a check to validate the PFN, rather than silently breaking
> the devices.
>
> Cc: "Michael S. Tsirkin" <[email protected]>
> Cc: Jason Wang <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
Could you guys please work on virtio 1 support in
for virtio mmio in qemu though?
It's not a lot of code.
> ---
> drivers/virtio/virtio_mmio.c | 19 ++++++++++++++++---
> drivers/virtio/virtio_pci_legacy.c | 11 +++++++++--
> 2 files changed, 25 insertions(+), 5 deletions(-)
I'd rather see this as 2 patches.
> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
> index a9192fe4f345..47109baf37f7 100644
> --- a/drivers/virtio/virtio_mmio.c
> +++ b/drivers/virtio/virtio_mmio.c
> @@ -358,6 +358,7 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
> struct virtqueue *vq;
> unsigned long flags;
> unsigned int num;
> + u64 addr;
> int err;
>
> if (!name)
> @@ -394,16 +395,26 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
> goto error_new_virtqueue;
> }
>
> + addr = virtqueue_get_desc_addr(vq);
> + /*
> + * virtio-mmio v1 uses a 32bit QUEUE PFN. If we have something that
> + * doesn't fit in 32bit, fail the setup rather than pretending to
> + * be successful.
> + */
> + if (vm_dev->version == 1 && (addr >> (PAGE_SHIFT + 32))) {
> + dev_err(&vdev->dev, "virtio-mmio: queue address too large\n");
> + err = -ENOMEM;
> + goto error_bad_pfn;
> + }
> +
Can you please move this below to where it's actually used?
> /* Activate the queue */
> writel(virtqueue_get_vring_size(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NUM);
> if (vm_dev->version == 1) {
> writel(PAGE_SIZE, vm_dev->base + VIRTIO_MMIO_QUEUE_ALIGN);
> - writel(virtqueue_get_desc_addr(vq) >> PAGE_SHIFT,
> + writel(addr >> PAGE_SHIFT,
> vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
> } else {
> - u64 addr;
>
> - addr = virtqueue_get_desc_addr(vq);
> writel((u32)addr, vm_dev->base + VIRTIO_MMIO_QUEUE_DESC_LOW);
> writel((u32)(addr >> 32),
> vm_dev->base + VIRTIO_MMIO_QUEUE_DESC_HIGH);
> @@ -430,6 +441,8 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
>
> return vq;
>
> +error_bad_pfn:
> + vring_del_virtqueue(vq);
> error_new_virtqueue:
> if (vm_dev->version == 1) {
> writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
> diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c
> index 2780886e8ba3..099d2cfb47b3 100644
> --- a/drivers/virtio/virtio_pci_legacy.c
> +++ b/drivers/virtio/virtio_pci_legacy.c
> @@ -122,6 +122,7 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
> struct virtqueue *vq;
> u16 num;
> int err;
> + u64 q_pfn;
>
> /* Select the queue we're interested in */
> iowrite16(index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
> @@ -141,9 +142,15 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
> if (!vq)
> return ERR_PTR(-ENOMEM);
>
> + q_pfn = virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT;
> + if (q_pfn >> 32) {
> + dev_err(&vp_dev->pci_dev->dev, "virtio-pci queue PFN too large\n");
> + err = -ENOMEM;
> + goto out_deactivate;
You never set up the address, it's cleaner to add another target
and not reset it.
> + }
> +
> /* activate the queue */
> - iowrite32(virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
> - vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> + iowrite32((u32)q_pfn, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
>
> vq->priv = (void __force *)vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY;
>
> --
> 2.13.6
On 09/01/18 23:29, Michael S. Tsirkin wrote:
> On Tue, Jan 09, 2018 at 07:03:56PM +0000, Suzuki K Poulose wrote:
>> virtio-mmio using virtio-v1 and virtio legacy pci use a 32bit PFN
>> for the queue. If the queue pfn is too large to fit in 32bits, which
>> we could hit on arm64 systems with 52bit physical addresses (even with
>> 64K page size), we simply miss out a proper link to the other side of
>> the queue.
>>
>> Add a check to validate the PFN, rather than silently breaking
>> the devices.
>>
>> Cc: "Michael S. Tsirkin" <[email protected]>
>> Cc: Jason Wang <[email protected]>
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Christoffer Dall <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>
> Could you guys please work on virtio 1 support in
> for virtio mmio in qemu though?
> It's not a lot of code.
Did you mean kvmtool ? Qemu already supports virto-1.
>
>> ---
>> drivers/virtio/virtio_mmio.c | 19 ++++++++++++++++---
>> drivers/virtio/virtio_pci_legacy.c | 11 +++++++++--
>> 2 files changed, 25 insertions(+), 5 deletions(-)
>
> I'd rather see this as 2 patches.
OK, I will split them.
>
>> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
>> index a9192fe4f345..47109baf37f7 100644
>> --- a/drivers/virtio/virtio_mmio.c
>> +++ b/drivers/virtio/virtio_mmio.c
>> @@ -358,6 +358,7 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
>> struct virtqueue *vq;
>> unsigned long flags;
>> unsigned int num;
>> + u64 addr;
>> int err;
>>
>> if (!name)
>> @@ -394,16 +395,26 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
>> goto error_new_virtqueue;
>> }
>>
>> + addr = virtqueue_get_desc_addr(vq);
>> + /*
>> + * virtio-mmio v1 uses a 32bit QUEUE PFN. If we have something that
>> + * doesn't fit in 32bit, fail the setup rather than pretending to
>> + * be successful.
>> + */
>> + if (vm_dev->version == 1 && (addr >> (PAGE_SHIFT + 32))) {
>> + dev_err(&vdev->dev, "virtio-mmio: queue address too large\n");
>> + err = -ENOMEM;
>> + goto error_bad_pfn;
>> + }
>> +
>
> Can you please move this below to where it's actually used?
>
The reason for keeping it here was to skip selecting the Queue number if we
have a bad PFN. May be it doesn't make much difference as we write PFN = 0 anyway
down.
>> /* Activate the queue */
>> writel(virtqueue_get_vring_size(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NUM);
>> if (vm_dev->version == 1) {
>> writel(PAGE_SIZE, vm_dev->base + VIRTIO_MMIO_QUEUE_ALIGN);
>> - writel(virtqueue_get_desc_addr(vq) >> PAGE_SHIFT,
>> + writel(addr >> PAGE_SHIFT,
>> vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
>> } else {
>> - u64 addr;
>>
>> - addr = virtqueue_get_desc_addr(vq);
>> writel((u32)addr, vm_dev->base + VIRTIO_MMIO_QUEUE_DESC_LOW);
>> writel((u32)(addr >> 32),
>> vm_dev->base + VIRTIO_MMIO_QUEUE_DESC_HIGH);
>> @@ -430,6 +441,8 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
>>
>> return vq;
>>
>> +error_bad_pfn:
>> + vring_del_virtqueue(vq);
>> error_new_virtqueue:
>> if (vm_dev->version == 1) {
>> writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
>> diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c
>> index 2780886e8ba3..099d2cfb47b3 100644
>> --- a/drivers/virtio/virtio_pci_legacy.c
>> +++ b/drivers/virtio/virtio_pci_legacy.c
>> @@ -122,6 +122,7 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
>> struct virtqueue *vq;
>> u16 num;
>> int err;
>> + u64 q_pfn;
>>
>> /* Select the queue we're interested in */
>> iowrite16(index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
>> @@ -141,9 +142,15 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
>> if (!vq)
>> return ERR_PTR(-ENOMEM);
>>
>> + q_pfn = virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT;
>> + if (q_pfn >> 32) {
>> + dev_err(&vp_dev->pci_dev->dev, "virtio-pci queue PFN too large\n");
>> + err = -ENOMEM;
>> + goto out_deactivate;
>
> You never set up the address, it's cleaner to add another target
> and not reset it.
Thats right. However, the only thing we do is writing PFN=0, which would be a good
thing to do to indicate the error to the host ? I could skip it if you think it is
not needed.
Thanks
Suzuki
On Wed, Jan 10, 2018 at 10:54:09AM +0000, Suzuki K Poulose wrote:
> On 09/01/18 23:29, Michael S. Tsirkin wrote:
> > On Tue, Jan 09, 2018 at 07:03:56PM +0000, Suzuki K Poulose wrote:
> > > virtio-mmio using virtio-v1 and virtio legacy pci use a 32bit PFN
> > > for the queue. If the queue pfn is too large to fit in 32bits, which
> > > we could hit on arm64 systems with 52bit physical addresses (even with
> > > 64K page size), we simply miss out a proper link to the other side of
> > > the queue.
> > >
> > > Add a check to validate the PFN, rather than silently breaking
> > > the devices.
> > >
> > > Cc: "Michael S. Tsirkin" <[email protected]>
> > > Cc: Jason Wang <[email protected]>
> > > Cc: Marc Zyngier <[email protected]>
> > > Cc: Christoffer Dall <[email protected]>
> > > Signed-off-by: Suzuki K Poulose <[email protected]>
> >
> > Could you guys please work on virtio 1 support in
> > for virtio mmio in qemu though?
> > It's not a lot of code.
>
> Did you mean kvmtool ? Qemu already supports virto-1.
For virtio-mmio? I don't seem to see that code in
hw/virtio/virtio-mmio.c
For example I still see handling for VIRTIO_MMIO_QUEUE_PFN
there, and no handling for VIRTIO_MMIO_QUEUE_DESC_LOW
and such.
What am I missing?
> >
> > > ---
> > > drivers/virtio/virtio_mmio.c | 19 ++++++++++++++++---
> > > drivers/virtio/virtio_pci_legacy.c | 11 +++++++++--
> > > 2 files changed, 25 insertions(+), 5 deletions(-)
> >
> > I'd rather see this as 2 patches.
>
> OK, I will split them.
>
> >
> > > diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
> > > index a9192fe4f345..47109baf37f7 100644
> > > --- a/drivers/virtio/virtio_mmio.c
> > > +++ b/drivers/virtio/virtio_mmio.c
> > > @@ -358,6 +358,7 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
> > > struct virtqueue *vq;
> > > unsigned long flags;
> > > unsigned int num;
> > > + u64 addr;
> > > int err;
> > > if (!name)
> > > @@ -394,16 +395,26 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
> > > goto error_new_virtqueue;
> > > }
> > > + addr = virtqueue_get_desc_addr(vq);
> > > + /*
> > > + * virtio-mmio v1 uses a 32bit QUEUE PFN. If we have something that
> > > + * doesn't fit in 32bit, fail the setup rather than pretending to
> > > + * be successful.
> > > + */
> > > + if (vm_dev->version == 1 && (addr >> (PAGE_SHIFT + 32))) {
> > > + dev_err(&vdev->dev, "virtio-mmio: queue address too large\n");
> > > + err = -ENOMEM;
> > > + goto error_bad_pfn;
> > > + }
> > > +
> >
> > Can you please move this below to where it's actually used?
> >
>
> The reason for keeping it here was to skip selecting the Queue number if we
> have a bad PFN.
Why is selecting a problem if we don't program anything?
> May be it doesn't make much difference as we write PFN = 0 anyway
> down.
>
> > > /* Activate the queue */
> > > writel(virtqueue_get_vring_size(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NUM);
> > > if (vm_dev->version == 1) {
> > > writel(PAGE_SIZE, vm_dev->base + VIRTIO_MMIO_QUEUE_ALIGN);
> > > - writel(virtqueue_get_desc_addr(vq) >> PAGE_SHIFT,
> > > + writel(addr >> PAGE_SHIFT,
> > > vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
> > > } else {
> > > - u64 addr;
> > > - addr = virtqueue_get_desc_addr(vq);
> > > writel((u32)addr, vm_dev->base + VIRTIO_MMIO_QUEUE_DESC_LOW);
> > > writel((u32)(addr >> 32),
> > > vm_dev->base + VIRTIO_MMIO_QUEUE_DESC_HIGH);
> > > @@ -430,6 +441,8 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
> > > return vq;
> > > +error_bad_pfn:
> > > + vring_del_virtqueue(vq);
> > > error_new_virtqueue:
> > > if (vm_dev->version == 1) {
> > > writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
> > > diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c
> > > index 2780886e8ba3..099d2cfb47b3 100644
> > > --- a/drivers/virtio/virtio_pci_legacy.c
> > > +++ b/drivers/virtio/virtio_pci_legacy.c
> > > @@ -122,6 +122,7 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
> > > struct virtqueue *vq;
> > > u16 num;
> > > int err;
> > > + u64 q_pfn;
> > > /* Select the queue we're interested in */
> > > iowrite16(index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
> > > @@ -141,9 +142,15 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
> > > if (!vq)
> > > return ERR_PTR(-ENOMEM);
> > > + q_pfn = virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT;
> > > + if (q_pfn >> 32) {
> > > + dev_err(&vp_dev->pci_dev->dev, "virtio-pci queue PFN too large\n");
> > > + err = -ENOMEM;
> > > + goto out_deactivate;
> >
> > You never set up the address, it's cleaner to add another target
> > and not reset it.
>
> Thats right. However, the only thing we do is writing PFN=0, which would be a good
> thing to do to indicate the error to the host ?
It isn't, a good way to indicate error is to set a bad status
which happens anyway I think. Writing PFN 0 resets the device
instead.
> I could skip it if you think it is
> not needed.
>
>
> Thanks
> Suzuki
On 10/01/18 11:06, Michael S. Tsirkin wrote:
> On Wed, Jan 10, 2018 at 10:54:09AM +0000, Suzuki K Poulose wrote:
>> On 09/01/18 23:29, Michael S. Tsirkin wrote:
>>> On Tue, Jan 09, 2018 at 07:03:56PM +0000, Suzuki K Poulose wrote:
>>>> virtio-mmio using virtio-v1 and virtio legacy pci use a 32bit PFN
>>>> for the queue. If the queue pfn is too large to fit in 32bits, which
>>>> we could hit on arm64 systems with 52bit physical addresses (even with
>>>> 64K page size), we simply miss out a proper link to the other side of
>>>> the queue.
>>>>
>>>> Add a check to validate the PFN, rather than silently breaking
>>>> the devices.
>>>>
>>>> Cc: "Michael S. Tsirkin" <[email protected]>
>>>> Cc: Jason Wang <[email protected]>
>>>> Cc: Marc Zyngier <[email protected]>
>>>> Cc: Christoffer Dall <[email protected]>
>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>
>>> Could you guys please work on virtio 1 support in
>>> for virtio mmio in qemu though?
>>> It's not a lot of code.
>>
>> Did you mean kvmtool ? Qemu already supports virto-1.
>
> For virtio-mmio? I don't seem to see that code in
> hw/virtio/virtio-mmio.c
> For example I still see handling for VIRTIO_MMIO_QUEUE_PFN
> there, and no handling for VIRTIO_MMIO_QUEUE_DESC_LOW
> and such.
>
> What am I missing?
Nah, you're right. Its the PCI that uses QUEUE_DESC_LOW/HIGH.
Btw, I can't work on Qemu.
>
>>>
>>>> ---
>>>> drivers/virtio/virtio_mmio.c | 19 ++++++++++++++++---
>>>> drivers/virtio/virtio_pci_legacy.c | 11 +++++++++--
>>>> 2 files changed, 25 insertions(+), 5 deletions(-)
>>>
>>> I'd rather see this as 2 patches.
>>
>> OK, I will split them.
>>
>>>
>>>> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
>>>> index a9192fe4f345..47109baf37f7 100644
>>>> --- a/drivers/virtio/virtio_mmio.c
>>>> +++ b/drivers/virtio/virtio_mmio.c
>>>> @@ -358,6 +358,7 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
>>>> struct virtqueue *vq;
>>>> unsigned long flags;
>>>> unsigned int num;
>>>> + u64 addr;
>>>> int err;
>>>> if (!name)
>>>> @@ -394,16 +395,26 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
>>>> goto error_new_virtqueue;
>>>> }
>>>> + addr = virtqueue_get_desc_addr(vq);
>>>> + /*
>>>> + * virtio-mmio v1 uses a 32bit QUEUE PFN. If we have something that
>>>> + * doesn't fit in 32bit, fail the setup rather than pretending to
>>>> + * be successful.
>>>> + */
>>>> + if (vm_dev->version == 1 && (addr >> (PAGE_SHIFT + 32))) {
>>>> + dev_err(&vdev->dev, "virtio-mmio: queue address too large\n");
>>>> + err = -ENOMEM;
>>>> + goto error_bad_pfn;
>>>> + }
>>>> +
>>>
>>> Can you please move this below to where it's actually used?
>>>
>>
>> The reason for keeping it here was to skip selecting the Queue number if we
>> have a bad PFN.
>
> Why is selecting a problem if we don't program anything?
>
I will be honest here, I don't know :-). The only reasoning was why do something
that you are not going to use after all. I will move it down.
>> May be it doesn't make much difference as we write PFN = 0 anyway
>> down.
>>>> --- a/drivers/virtio/virtio_pci_legacy.c
>>>> +++ b/drivers/virtio/virtio_pci_legacy.c
>>>> @@ -122,6 +122,7 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
>>>> struct virtqueue *vq;
>>>> u16 num;
>>>> int err;
>>>> + u64 q_pfn;
>>>> /* Select the queue we're interested in */
>>>> iowrite16(index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
>>>> @@ -141,9 +142,15 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
>>>> if (!vq)
>>>> return ERR_PTR(-ENOMEM);
>>>> + q_pfn = virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT;
>>>> + if (q_pfn >> 32) {
>>>> + dev_err(&vp_dev->pci_dev->dev, "virtio-pci queue PFN too large\n");
>>>> + err = -ENOMEM;
>>>> + goto out_deactivate;
>>>
>>> You never set up the address, it's cleaner to add another target
>>> and not reset it.
>>
>> Thats right. However, the only thing we do is writing PFN=0, which would be a good
>> thing to do to indicate the error to the host ?
>
> It isn't, a good way to indicate error is to set a bad status
> which happens anyway I think. Writing PFN 0 resets the device
> instead.
Ok, thats good to know. I will make the necessary changes. Thanks for the explanation.
Cheers
Suzuki
On 10 January 2018 at 11:06, Michael S. Tsirkin <[email protected]> wrote:
> For virtio-mmio? I don't seem to see that code in
> hw/virtio/virtio-mmio.c
> For example I still see handling for VIRTIO_MMIO_QUEUE_PFN
> there, and no handling for VIRTIO_MMIO_QUEUE_DESC_LOW
> and such.
Are there uses that make it worthwhile to get virtio-1
support added to virtio-mmio, rather than just getting
people to move over to virtio-pci instead ?
thanks
-- PMM
Hi Peter,
On 10/01/18 11:19, Peter Maydell wrote:
> On 10 January 2018 at 11:06, Michael S. Tsirkin <[email protected]> wrote:
>> For virtio-mmio? I don't seem to see that code in
>> hw/virtio/virtio-mmio.c
>> For example I still see handling for VIRTIO_MMIO_QUEUE_PFN
>> there, and no handling for VIRTIO_MMIO_QUEUE_DESC_LOW
>> and such.
>
> Are there uses that make it worthwhile to get virtio-1
> support added to virtio-mmio, rather than just getting
> people to move over to virtio-pci instead ?
virtio-iommu uses virtio-mmio transport. It makes little sense to have an
IOMMU presented as a PCI endpoint.
Thanks,
Jean
On Wed, Jan 10, 2018 at 11:19:34AM +0000, Peter Maydell wrote:
> On 10 January 2018 at 11:06, Michael S. Tsirkin <[email protected]> wrote:
> > For virtio-mmio? I don't seem to see that code in
> > hw/virtio/virtio-mmio.c
> > For example I still see handling for VIRTIO_MMIO_QUEUE_PFN
> > there, and no handling for VIRTIO_MMIO_QUEUE_DESC_LOW
> > and such.
>
> Are there uses that make it worthwhile to get virtio-1
> support added to virtio-mmio, rather than just getting
> people to move over to virtio-pci instead ?
>
> thanks
> -- PMM
I keep getting these patches (like the one that started this thread) so
I think yes. If nothing else the guest support will bit-rot without
an open-source implementation.
--
MST
On 10 January 2018 at 11:25, Jean-Philippe Brucker
<[email protected]> wrote:
> Hi Peter,
>
> On 10/01/18 11:19, Peter Maydell wrote:
>> Are there uses that make it worthwhile to get virtio-1
>> support added to virtio-mmio, rather than just getting
>> people to move over to virtio-pci instead ?
>
> virtio-iommu uses virtio-mmio transport. It makes little sense to have an
> IOMMU presented as a PCI endpoint.
Having an entire transport just for the IOMMU doesn't make
a great deal of sense either though :-) If we didn't already
have virtio-mmio kicking around would we really have designed
it that way?
thanks
-- PMM
On 12/01/18 10:21, Peter Maydell wrote:
> On 10 January 2018 at 11:25, Jean-Philippe Brucker
> <[email protected]> wrote:
>> Hi Peter,
>>
>> On 10/01/18 11:19, Peter Maydell wrote:
>>> Are there uses that make it worthwhile to get virtio-1
>>> support added to virtio-mmio, rather than just getting
>>> people to move over to virtio-pci instead ?
>>
>> virtio-iommu uses virtio-mmio transport. It makes little sense to have an
>> IOMMU presented as a PCI endpoint.
>
> Having an entire transport just for the IOMMU doesn't make
> a great deal of sense either though :-) If we didn't already
> have virtio-mmio kicking around would we really have designed
> it that way?
Possibly. It certainly was on the table during early investigations. It
does beat the alternative, having to redesign firmware interfaces and
rewrite core driver code to cater for unrealistic device topologies.
Thanks,
Jean
Hi Suzuki,
On Tue, Jan 09, 2018 at 07:03:57PM +0000, Suzuki K Poulose wrote:
> Add helpers for encoding/decoding 52bit address in GICv3 ITS BASER
> register. When ITS uses 64K page size, the 52bits of physical address
> are encoded in BASER[47:12] as follows :
>
> Bits[47:16] of the register => bits[47:16] of the physical address
> Bits[15:12] of the register => bits[51:48] of the physical address
> bits[15:0] of the physical address are 0.
>
> Also adds a mask for CBASER address. This will be used for adding 52bit
> support for VGIC ITS. More importantly ignore the upper bits if 52bit
> support is not enabled.
>
> Cc: Shanker Donthineni <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> drivers/irqchip/irq-gic-v3-its.c | 2 +-
> include/linux/irqchip/arm-gic-v3.h | 32 ++++++++++++++++++++++++++++++--
> 2 files changed, 31 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> index 4039e64cd342..e6aa84f806f7 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -1615,7 +1615,7 @@ static int its_setup_baser(struct its_node *its, struct its_baser *baser,
> }
>
> /* Convert 52bit PA to 48bit field */
> - baser_phys = GITS_BASER_PHYS_52_to_48(baser_phys);
> + baser_phys = GITS_BASER_ADDR64K_FROM_PHYS(baser_phys);
> }
>
> retry_baser:
> diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h
> index c00c4c33e432..b880b6682fa6 100644
> --- a/include/linux/irqchip/arm-gic-v3.h
> +++ b/include/linux/irqchip/arm-gic-v3.h
> @@ -320,6 +320,15 @@
> #define GITS_IIDR_REV(r) (((r) >> GITS_IIDR_REV_SHIFT) & 0xf)
> #define GITS_IIDR_PRODUCTID_SHIFT 24
>
> +#ifdef CONFIG_ARM64_PA_BITS_52
> +#define GITS_PA_HI_MASK (0xfULL)
> +#define GITS_PA_SHIFT 52
> +#else
> +/* Do not use the bits [51-48] if we don't support 52bit */
> +#define GITS_PA_HI_MASK 0
> +#define GITS_PA_SHIFT 48
> +#endif
> +
> #define GITS_CBASER_VALID (1ULL << 63)
> #define GITS_CBASER_SHAREABILITY_SHIFT (10)
> #define GITS_CBASER_INNER_CACHEABILITY_SHIFT (59)
> @@ -343,6 +352,7 @@
> #define GITS_CBASER_WaWb GIC_BASER_CACHEABILITY(GITS_CBASER, INNER, WaWb)
> #define GITS_CBASER_RaWaWt GIC_BASER_CACHEABILITY(GITS_CBASER, INNER, RaWaWt)
> #define GITS_CBASER_RaWaWb GIC_BASER_CACHEABILITY(GITS_CBASER, INNER, RaWaWb)
> +#define GITS_CBASER_ADDRESS(x) ((x) & GENMASK_ULL(GITS_PA_SHIFT, 12))
>
> #define GITS_BASER_NR_REGS 8
>
> @@ -373,8 +383,26 @@
> #define GITS_BASER_ENTRY_SIZE_SHIFT (48)
> #define GITS_BASER_ENTRY_SIZE(r) ((((r) >> GITS_BASER_ENTRY_SIZE_SHIFT) & 0x1f) + 1)
> #define GITS_BASER_ENTRY_SIZE_MASK GENMASK_ULL(52, 48)
> -#define GITS_BASER_PHYS_52_to_48(phys) \
> - (((phys) & GENMASK_ULL(47, 16)) | (((phys) >> 48) & 0xf) << 12)
> +
> +/*
> + * With 64K page size, the physical address can be upto 52bit and
> + * uses the following encoding in the GITS_BASER[47:12]:
> + *
> + * Bits[47:16] of the register => bits[47:16] of the base physical address.
> + * Bits[15:12] of the register => bits[51:48] of the base physical address.
> + * bits[15:0] of the base physical address are 0.
> + * Clear the upper bits if the kernel doesn't support 52bits.
> + */
> +#define GITS_BASER_ADDR64K_LO_MASK GENMASK_ULL(47, 16)
> +#define GITS_BASER_ADDR64K_HI_SHIFT 12
> +#define GITS_BASER_ADDR64K_HI_MOVE (48 - GITS_BASER_ADDR64K_HI_SHIFT)
> +#define GITS_BASER_ADDR64K_HI_MASK (GITS_PA_HI_MASK << GITS_BASER_ADDR64K_HI_SHIFT)
> +#define GITS_BASER_ADDR64K_TO_PHYS(x) \
> + (((x) & GITS_BASER_ADDR64K_LO_MASK) | \
> + (((x) & GITS_BASER_ADDR64K_HI_MASK) << GITS_BASER_ADDR64K_HI_MOVE))
> +#define GITS_BASER_ADDR64K_FROM_PHYS(p) \
> + (((p) & GITS_BASER_ADDR64K_LO_MASK) | \
> + (((p) >> GITS_BASER_ADDR64K_HI_MOVE) & GITS_BASER_ADDR64K_HI_MASK))
I don't understand why you need this masking logic embedded in these
macros? Isn't it strictly an error if anyone passes a physical address
with any of bits [51:48] set to the ITS on a system that doesn't support
52 bit PAs, and just silently masking off those bits could lead to some
interesting cases.
This is also notably more difficult to read than the existing macro.
If anything, I think it would be more useful to have
GITS_BASER_TO_PHYS(x) and GITS_PHYS_TO_BASER(x) which takes into account
CONFIG_ARM64_64K_PAGES.
> #define GITS_BASER_SHAREABILITY_SHIFT (10)
> #define GITS_BASER_InnerShareable \
> GIC_BASER_SHAREABILITY(GITS_BASER, InnerShareable)
> --
> 2.13.6
>
Thanks,
-Christoffer
On Tue, Jan 09, 2018 at 07:04:09PM +0000, Suzuki K Poulose wrote:
> Now that we can manage the stage2 page table per VM, switch the
> configuration details to per VM instance. We keep track of the
> IPA bits, number of page table levels and the VTCR bits (which
> depends on the IPA and the number of levels).
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arch/arm/include/asm/kvm_mmu.h | 1 +
> arch/arm64/include/asm/kvm_host.h | 12 ++++++++++++
> arch/arm64/include/asm/kvm_mmu.h | 22 ++++++++++++++++++++--
> arch/arm64/include/asm/stage2_pgtable.h | 1 -
> arch/arm64/kvm/hyp/switch.c | 3 +--
> virt/kvm/arm/arm.c | 2 +-
> 6 files changed, 35 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 440c80589453..dd592fe45660 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -48,6 +48,7 @@
> #define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK
>
> #define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
> +#define kvm_init_stage2_config(kvm) do { } while (0)
> int create_hyp_mappings(void *from, void *to, pgprot_t prot);
> int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
> void free_hyp_pgds(void);
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 9a9ddeb33c84..1e66e5ab3dde 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -64,6 +64,18 @@ struct kvm_arch {
> /* VTTBR value associated with above pgd and vmid */
> u64 vttbr;
>
> + /* Private bits of VTCR_EL2 for this VM */
> + u64 vtcr_private;
As to my comments in the previous patch, why isn't this simply u64 vtcr;
Thanks,
-Christoffer
> + /* Size of the PA size for this guest */
> + u8 phys_shift;
> + /*
> + * Number of levels in page table. We could always calculate
> + * it from phys_shift above. We cache it for faster switches
> + * in stage2 page table helpers.
> + */
> + u8 s2_levels;
> +
> +
> /* The last vcpu id that ran on each physical CPU */
> int __percpu *last_vcpu_ran;
>
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 483185ed2ecd..ab6a8b905065 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -134,11 +134,12 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
> /*
> * We currently only support a 40bit IPA.
> */
> -#define KVM_PHYS_SHIFT (40)
> +#define KVM_PHYS_SHIFT_DEFAULT (40)
>
> -#define kvm_phys_shift(kvm) KVM_PHYS_SHIFT
> +#define kvm_phys_shift(kvm) (kvm->arch.phys_shift)
> #define kvm_phys_size(kvm) (_AC(1, ULL) << kvm_phys_shift(kvm))
> #define kvm_phys_mask(kvm) (kvm_phys_size(kvm) - _AC(1, ULL))
> +#define kvm_stage2_levels(kvm) (kvm->arch.s2_levels)
>
> static inline bool kvm_page_empty(void *ptr)
> {
> @@ -346,5 +347,22 @@ static inline u64 kvm_vttbr_baddr_mask(struct kvm *kvm)
> return GENMASK_ULL(PHYS_MASK_SHIFT - 1, x);
> }
>
> +/*
> + * kvm_init_stage2_config: Initialise the VM specific stage2 page table
> + * details to default IPA size.
> + */
> +static inline void kvm_init_stage2_config(struct kvm *kvm)
> +{
> + /*
> + * The stage2 PGD is dependent on the settings we initialise here
> + * and should be allocated only after this step.
> + */
> + VM_BUG_ON(kvm->arch.pgd != NULL);
> + kvm->arch.phys_shift = KVM_PHYS_SHIFT_DEFAULT;
> + kvm->arch.s2_levels = stage2_pt_levels(kvm->arch.phys_shift);
> + kvm->arch.vtcr_private = VTCR_EL2_SL0(kvm->arch.s2_levels) |
> + TCR_T0SZ(kvm->arch.phys_shift);
> +}
> +
> #endif /* __ASSEMBLY__ */
> #endif /* __ARM64_KVM_MMU_H__ */
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> index 33e8ebb25037..9b75b83da643 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -44,7 +44,6 @@
> */
> #define __s2_pgd_ptrs(pa, lvls) (1 << ((pa) - pt_levels_pgdir_shift((lvls))))
>
> -#define kvm_stage2_levels(kvm) stage2_pt_levels(kvm_phys_shift(kvm))
> #define stage2_pgdir_shift(kvm) \
> pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
> #define stage2_pgdir_size(kvm) (_AC(1, UL) << stage2_pgdir_shift((kvm)))
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index 523471f0af7b..d0725562ee3f 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -160,8 +160,7 @@ static void __hyp_text __activate_vm(struct kvm_vcpu *vcpu)
> u64 vtcr = read_sysreg(vtcr_el2);
>
> vtcr &= ~VTCR_EL2_PRIVATE_MASK;
> - vtcr |= VTCR_EL2_SL0(stage2_pt_levels(kvm)) |
> - VTCR_EL2_T0SZ(kvm_phys_shift(kvm));
> + vtcr |= kvm->arch.vtcr_private;
> write_sysreg(vtcr, vtcr_el2);
> write_sysreg(kvm->arch.vttbr, vttbr_el2);
> }
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 8564ed907b18..e0bf8d19fcfe 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -143,7 +143,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> /* The maximum number of VCPUs is limited by the host's GIC model */
> kvm->arch.max_vcpus = vgic_present ?
> kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS;
> -
> + kvm_init_stage2_config(kvm);
> return ret;
> }
>
> --
> 2.13.6
>
On Tue, Jan 09, 2018 at 07:04:02PM +0000, Suzuki K Poulose wrote:
> On a 4-level page table pgd entry can be empty, unlike a 3-level
> page table. Remove the spurious WARN_ON() in stage_get_pud().
Acked-by: Christoffer Dall <[email protected]>
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> virt/kvm/arm/mmu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index e6548c85c495..78253fe00fc4 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -870,7 +870,7 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
> pud_t *pud;
>
> pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> - if (WARN_ON(stage2_pgd_none(*pgd))) {
> + if (stage2_pgd_none(*pgd)) {
> if (!cache)
> return NULL;
> pud = mmu_memory_cache_alloc(cache);
> --
> 2.13.6
>
On Tue, Jan 09, 2018 at 07:04:03PM +0000, Suzuki K Poulose wrote:
> On arm/arm64 we pre-allocate the entry level page tables when
> a VM is created and is free'd when either all the mm users are
> gone or the KVM is about to get destroyed. i.e, kvm_free_stage2_pgd
> is triggered via kvm_arch_flush_shadow_all() which can be invoked
> from two different paths :
>
> 1) do_exit()-> .-> mmu_notifier->release()-> ..-> kvm_arch_flush_shadow_all()
> OR
> 2) kvm_destroy_vm()-> mmu_notifier_unregister-> kvm_arch_flush_shadow_all()
>
> This has created lot of race conditions in the past as some of
> the VCPUs could be active when we free the stage2 via path (1).
How?? mmu_notifier->release() is called via __mput->exit_mmap(), which
is only called if mm_users == 0, which means there are no more threads
left than the one currently doing exit().
>
> On a closer look, all we need to do with kvm_arch_flush_shadow_all() is,
> to ensure that the stage2 mappings are cleared. This doesn't mean we
> have to free up the stage2 entry level page tables yet, which could
> be delayed until the kvm is destroyed. This would avoid issues
> of use-after-free,
do we have any of those left?
> This will be later used for delaying
> the allocation of the stage2 entry level page tables until we really
> need to do something with it.
Fine, but you don't actually explain why this change of flow is
necessary for what you're trying to do later?
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> virt/kvm/arm/arm.c | 1 +
> virt/kvm/arm/mmu.c | 56 ++++++++++++++++++++++++++++--------------------------
> 2 files changed, 30 insertions(+), 27 deletions(-)
>
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index c8d49879307f..19b720ddedce 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -189,6 +189,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> }
> }
> atomic_set(&kvm->online_vcpus, 0);
> + kvm_free_stage2_pgd(kvm);
> }
>
> int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 78253fe00fc4..c94c61ac38b9 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -298,11 +298,10 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
> pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> do {
> /*
> - * Make sure the page table is still active, as another thread
> - * could have possibly freed the page table, while we released
> - * the lock.
> + * The page table shouldn't be free'd as we still hold a reference
> + * to the KVM.
To avoid confusion about references to the kernel module KVM and a
specific KVM VM instance, please s/KVM/VM/.
> */
> - if (!READ_ONCE(kvm->arch.pgd))
> + if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))
This reads a lot like a defensive implementation now, and I think for
this patch to make sense, we shouldn't try to handle a buggy super-racy
implementation gracefully, but rather have VM_BUG_ON() (if we care about
performance of the check) or simply BUG_ON().
The rationale being that if we've gotten this flow incorrect and freed
the pgd at the wrong time, we don't want to leave a ticking bomb to blow
up somewhere else randomly (which it will!), but instead crash and burn.
> break;
> next = stage2_pgd_addr_end(addr, end);
> if (!stage2_pgd_none(*pgd))
> @@ -837,30 +836,33 @@ void stage2_unmap_vm(struct kvm *kvm)
> up_read(¤t->mm->mmap_sem);
> srcu_read_unlock(&kvm->srcu, idx);
> }
> -
> -/**
> - * kvm_free_stage2_pgd - free all stage-2 tables
> - * @kvm: The KVM struct pointer for the VM.
> - *
> - * Walks the level-1 page table pointed to by kvm->arch.pgd and frees all
> - * underlying level-2 and level-3 tables before freeing the actual level-1 table
> - * and setting the struct pointer to NULL.
> +/*
> + * kvm_flush_stage2_all: Unmap the entire stage2 mappings including
> + * device and regular RAM backing memory.
> */
> -void kvm_free_stage2_pgd(struct kvm *kvm)
> +static void kvm_flush_stage2_all(struct kvm *kvm)
> {
> - void *pgd = NULL;
> -
> spin_lock(&kvm->mmu_lock);
> - if (kvm->arch.pgd) {
> + if (kvm->arch.pgd)
> unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> - pgd = READ_ONCE(kvm->arch.pgd);
> - kvm->arch.pgd = NULL;
> - }
> spin_unlock(&kvm->mmu_lock);
> +}
>
> - /* Free the HW pgd, one page at a time */
> - if (pgd)
> - free_pages_exact(pgd, S2_PGD_SIZE);
> +/**
> + * kvm_free_stage2_pgd - Free the entry level page tables in stage-2.
nit: you should put the parameter description here and leave a blank
line before the lengthy discussion.
> + * This is called when all reference to the KVM has gone away and we
> + * really don't need any protection in resetting the PGD. This also
I don't think I understand the last part of this sentence.
This function is pretty self-explanatory really, and I think we can
either drop the documentation all together or simply say that this
function clears all stage 2 page table entries to release the memory of
the lower-level page tables themselves and then frees the pgd in the
end. The VM is known to go away and no more VCPUs exist at this point.
> + * means that nobody should be touching stage2 at this point, as we
> + * have unmapped the entire stage2 already and all dynamic entities,
> + * (VCPUs and devices) are no longer active.
> + *
> + * @kvm: The KVM struct pointer for the VM.
> + */
> +void kvm_free_stage2_pgd(struct kvm *kvm)
> +{
> + kvm_flush_stage2_all(kvm);
> + free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE);
> + kvm->arch.pgd = NULL;
> }
>
> static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> @@ -1189,12 +1191,12 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> * large. Otherwise, we may see kernel panics with
> * CONFIG_DETECT_HUNG_TASK, CONFIG_LOCKUP_DETECTOR,
> * CONFIG_LOCKDEP. Additionally, holding the lock too long
> - * will also starve other vCPUs. We have to also make sure
> - * that the page tables are not freed while we released
> - * the lock.
> + * will also starve other vCPUs.
> + * The page tables shouldn't be free'd while we released the
s/shouldn't/can't/
> + * lock, since we hold a reference to the KVM.
s/KVM/VM/
> */
> cond_resched_lock(&kvm->mmu_lock);
> - if (!READ_ONCE(kvm->arch.pgd))
> + if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))
> break;
> next = stage2_pgd_addr_end(addr, end);
> if (stage2_pgd_present(*pgd))
> @@ -1950,7 +1952,7 @@ void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots)
>
> void kvm_arch_flush_shadow_all(struct kvm *kvm)
> {
> - kvm_free_stage2_pgd(kvm);
> + kvm_flush_stage2_all(kvm);
> }
>
> void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
> --
> 2.13.6
>
Thanks,
-Christoffer
On Tue, Jan 09, 2018 at 07:04:04PM +0000, Suzuki K Poulose wrote:
> We allocate the entry level page tables for stage2 when the
> VM is created. This doesn't give us the flexibility of configuring
> the physical address space size for a VM. In order to allow
> the VM to choose the required size, we delay the allocation of
> stage2 entry level tables until we really try to map something.
>
> This could be either when the VM creates a memory range or when
> it tries to map a device memory. So we add in a hook to these
> two places to make sure the tables are allocated. We use
> kvm->slots_lock to serialize the allocation entry point, since
> we add hooks to the arch specific call back with the mutex held.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> virt/kvm/arm/arm.c | 18 ++++++----------
> virt/kvm/arm/mmu.c | 61 +++++++++++++++++++++++++++++++++++++++++++++---------
> 2 files changed, 57 insertions(+), 22 deletions(-)
>
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 19b720ddedce..d06f00566664 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -127,13 +127,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> for_each_possible_cpu(cpu)
> *per_cpu_ptr(kvm->arch.last_vcpu_ran, cpu) = -1;
>
> - ret = kvm_alloc_stage2_pgd(kvm);
> - if (ret)
> - goto out_fail_alloc;
> -
> ret = create_hyp_mappings(kvm, kvm + 1, PAGE_HYP);
> - if (ret)
> - goto out_free_stage2_pgd;
> + if (ret) {
> + free_percpu(kvm->arch.last_vcpu_ran);
> + kvm->arch.last_vcpu_ran = NULL;
> + return ret;
> + }
> +
>
> kvm_vgic_early_init(kvm);
>
> @@ -145,12 +145,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS;
>
> return ret;
> -out_free_stage2_pgd:
> - kvm_free_stage2_pgd(kvm);
> -out_fail_alloc:
> - free_percpu(kvm->arch.last_vcpu_ran);
> - kvm->arch.last_vcpu_ran = NULL;
> - return ret;
> }
>
> bool kvm_arch_has_vcpu_debugfs(void)
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index c94c61ac38b9..257f2a8ccfc7 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1011,15 +1011,39 @@ static int stage2_pmdp_test_and_clear_young(pmd_t *pmd)
> return stage2_ptep_test_and_clear_young((pte_t *)pmd);
> }
>
> -/**
> - * kvm_phys_addr_ioremap - map a device range to guest IPA
> - *
> - * @kvm: The KVM pointer
> - * @guest_ipa: The IPA at which to insert the mapping
> - * @pa: The physical address of the device
> - * @size: The size of the mapping
> +/*
> + * Finalise the stage2 page table layout. Must be called with kvm->slots_lock
> + * held.
> */
> -int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> +static int __kvm_init_stage2_table(struct kvm *kvm)
> +{
> + /* Double check if somebody has already allocated it */
dubious comment: Either leave it out or explain that we need to check
again with the mutex held.
> + if (likely(kvm->arch.pgd))
> + return 0;
> + return kvm_alloc_stage2_pgd(kvm);
> +}
> +
> +static int kvm_init_stage2_table(struct kvm *kvm)
> +{
> + int rc;
> +
> + /*
> + * Once allocated, the stage2 entry level tables are only
> + * freed when the KVM instance is destroyed. So, if we see
> + * something valid here, that guarantees that we have
> + * done the one time allocation and it is something valid
> + * and won't go away until the last reference to the KVM
> + * is gone.
> + */
Really not sure if this comment adds something beyond what's described
by the code already?
Thanks,
-Christoffer
> + if (likely(kvm->arch.pgd))
> + return 0;
> + mutex_lock(&kvm->slots_lock);
> + rc = __kvm_init_stage2_table(kvm);
> + mutex_unlock(&kvm->slots_lock);
> + return rc;
> +}
> +
> +static int __kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> phys_addr_t pa, unsigned long size, bool writable)
> {
> phys_addr_t addr, end;
> @@ -1055,6 +1079,23 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> return ret;
> }
>
> +/**
> + * kvm_phys_addr_ioremap - map a device range to guest IPA.
> + * Acquires kvm->slots_lock for making sure that the stage2 is initialized.
> + *
> + * @kvm: The KVM pointer
> + * @guest_ipa: The IPA at which to insert the mapping
> + * @pa: The physical address of the device
> + * @size: The size of the mapping
> + */
> +int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> + phys_addr_t pa, unsigned long size, bool writable)
> +{
> + if (unlikely(kvm_init_stage2_table(kvm)))
> + return -ENOMEM;
> + return __kvm_phys_addr_ioremap(kvm, guest_ipa, pa, size, writable);
> +}
> +
> static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
> {
> kvm_pfn_t pfn = *pfnp;
> @@ -1912,7 +1953,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> goto out;
> }
>
> - ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
> + ret = __kvm_phys_addr_ioremap(kvm, gpa, pa,
> vm_end - vm_start,
> writable);
> if (ret)
> @@ -1943,7 +1984,7 @@ void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free,
> int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
> unsigned long npages)
> {
> - return 0;
> + return __kvm_init_stage2_table(kvm);
> }
>
> void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots)
> --
> 2.13.6
>
On Tue, Jan 09, 2018 at 07:04:00PM +0000, Suzuki K Poulose wrote:
> Add a helper to convert ID_AA64MMFR0_EL1:PARange to they physical
*the*
> size shift. Limit the size to the maximum supported by the kernel.
Is this just a cleanup or are we actually going to need this feature in
the subsequent patches? That would be nice to motivate in the commit
letter.
>
> Cc: Mark Rutland <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arch/arm64/include/asm/cpufeature.h | 16 ++++++++++++++++
> arch/arm64/kvm/hyp/s2-setup.c | 28 +++++-----------------------
> 2 files changed, 21 insertions(+), 23 deletions(-)
>
> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> index ac67cfc2585a..0564e14616eb 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -304,6 +304,22 @@ static inline u64 read_zcr_features(void)
> return zcr;
> }
>
> +static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
> +{
> + switch (parange) {
> + case 0: return 32;
> + case 1: return 36;
> + case 2: return 40;
> + case 3: return 42;
> + case 4: return 44;
> +
> + default:
What is the case we want to cater for with making parange == 5 the
default for unrecognized values?
(I have a feeling that default label comes from making the compiler
happy about potentially uninitialized values once upon a time before a
lot of refactoring happened here.)
> + case 5: return 48;
> +#ifdef CONFIG_ARM64_PA_BITS_52
> + case 6: return 52;
> +#endif
> + }
> +}
> #endif /* __ASSEMBLY__ */
>
> #endif
> diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
> index 603e1ee83e89..b1129c83c531 100644
> --- a/arch/arm64/kvm/hyp/s2-setup.c
> +++ b/arch/arm64/kvm/hyp/s2-setup.c
> @@ -19,11 +19,13 @@
> #include <asm/kvm_arm.h>
> #include <asm/kvm_asm.h>
> #include <asm/kvm_hyp.h>
> +#include <asm/cpufeature.h>
>
> u32 __hyp_text __init_stage2_translation(void)
> {
> u64 val = VTCR_EL2_FLAGS;
> u64 parange;
> + u32 phys_shift;
> u64 tmp;
>
> /*
> @@ -37,27 +39,7 @@ u32 __hyp_text __init_stage2_translation(void)
> val |= parange << 16;
>
> /* Compute the actual PARange... */
> - switch (parange) {
> - case 0:
> - parange = 32;
> - break;
> - case 1:
> - parange = 36;
> - break;
> - case 2:
> - parange = 40;
> - break;
> - case 3:
> - parange = 42;
> - break;
> - case 4:
> - parange = 44;
> - break;
> - case 5:
> - default:
> - parange = 48;
> - break;
> - }
> + phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
>
> /*
> * ... and clamp it to 40 bits, unless we have some braindead
> @@ -65,7 +47,7 @@ u32 __hyp_text __init_stage2_translation(void)
> * return that value for the rest of the kernel to decide what
> * to do.
> */
> - val |= 64 - (parange > 40 ? 40 : parange);
> + val |= 64 - (phys_shift > 40 ? 40 : phys_shift);
>
> /*
> * Check the availability of Hardware Access Flag / Dirty Bit
> @@ -86,5 +68,5 @@ u32 __hyp_text __init_stage2_translation(void)
>
> write_sysreg(val, vtcr_el2);
>
> - return parange;
> + return phys_shift;
> }
> --
> 2.13.6
>
Could you fold this change into the commit as well:
diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
index 603e1ee83e89..eea2fbd68b8a 100644
--- a/arch/arm64/kvm/hyp/s2-setup.c
+++ b/arch/arm64/kvm/hyp/s2-setup.c
@@ -29,7 +29,8 @@ u32 __hyp_text __init_stage2_translation(void)
/*
* Read the PARange bits from ID_AA64MMFR0_EL1 and set the PS
* bits in VTCR_EL2. Amusingly, the PARange is 4 bits, while
- * PS is only 3. Fortunately, bit 19 is RES0 in VTCR_EL2...
+ * PS is only 3. Fortunately, only three bits is actually used to
+ * enode the supported PARange values.
*/
parange = read_sysreg(id_aa64mmfr0_el1) & 7;
if (parange > ID_AA64MMFR0_PARANGE_MAX)
Thanks,
-Christoffer
On Tue, Jan 09, 2018 at 07:04:01PM +0000, Suzuki K Poulose wrote:
> So far we have only supported 3 level page table with fixed IPA of 40bits.
> Fix stage2_flush_memslot() to accommodate for 4 level tables.
>
Acked-by: Christoffer Dall <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> virt/kvm/arm/mmu.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 761787befd3b..e6548c85c495 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -375,7 +375,8 @@ static void stage2_flush_memslot(struct kvm *kvm,
> pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> do {
> next = stage2_pgd_addr_end(addr, end);
> - stage2_flush_puds(kvm, pgd, addr, next);
> + if (!stage2_pgd_none(*pgd))
> + stage2_flush_puds(kvm, pgd, addr, next);
> } while (pgd++, addr = next, addr != end);
> }
>
> --
> 2.13.6
>
On 08/02/18 11:00, Christoffer Dall wrote:
> On Tue, Jan 09, 2018 at 07:04:00PM +0000, Suzuki K Poulose wrote:
>> Add a helper to convert ID_AA64MMFR0_EL1:PARange to they physical
> *the*
>> size shift. Limit the size to the maximum supported by the kernel.
>
> Is this just a cleanup or are we actually going to need this feature in
> the subsequent patches? That would be nice to motivate in the commit
> letter.
It is a cleanup, plus we are going to move the user of the code around from
one place to the other. So this makes it a bit easier and cleaner.
>>
>> Cc: Mark Rutland <[email protected]>
>> Cc: Catalin Marinas <[email protected]>
>> Cc: Will Deacon <[email protected]>
>> Cc: Marc Zyngier <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> arch/arm64/include/asm/cpufeature.h | 16 ++++++++++++++++
>> arch/arm64/kvm/hyp/s2-setup.c | 28 +++++-----------------------
>> 2 files changed, 21 insertions(+), 23 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
>> index ac67cfc2585a..0564e14616eb 100644
>> --- a/arch/arm64/include/asm/cpufeature.h
>> +++ b/arch/arm64/include/asm/cpufeature.h
>> @@ -304,6 +304,22 @@ static inline u64 read_zcr_features(void)
>> return zcr;
>> }
>>
>> +static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
>> +{
>> + switch (parange) {
>> + case 0: return 32;
>> + case 1: return 36;
>> + case 2: return 40;
>> + case 3: return 42;
>> + case 4: return 44;
>> +
>> + default:
>
> What is the case we want to cater for with making parange == 5 the
> default for unrecognized values?
>
> (I have a feeling that default label comes from making the compiler
> happy about potentially uninitialized values once upon a time before a
> lot of refactoring happened here.)
That is there to make sure we return 48 iff 52bit support (for that matter,
if there is a new limit in the future) is not enabled.
>
>> + case 5: return 48;
>> +#ifdef CONFIG_ARM64_PA_BITS_52
>> + case 6: return 52;
>> +#endif
>> + }
>> +}
>> #endif /* __ASSEMBLY__ */
>>
>>
>
> Could you fold this change into the commit as well:
>
> diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
> index 603e1ee83e89..eea2fbd68b8a 100644
> --- a/arch/arm64/kvm/hyp/s2-setup.c
> +++ b/arch/arm64/kvm/hyp/s2-setup.c
> @@ -29,7 +29,8 @@ u32 __hyp_text __init_stage2_translation(void)
> /*
> * Read the PARange bits from ID_AA64MMFR0_EL1 and set the PS
> * bits in VTCR_EL2. Amusingly, the PARange is 4 bits, while
> - * PS is only 3. Fortunately, bit 19 is RES0 in VTCR_EL2...
> + * PS is only 3. Fortunately, only three bits is actually used to
> + * enode the supported PARange values.
> */
> parange = read_sysreg(id_aa64mmfr0_el1) & 7;
> if (parange > ID_AA64MMFR0_PARANGE_MAX)
Sure.
Thanks for the review.
Suzuki
On Tue, Jan 09, 2018 at 07:04:10PM +0000, Suzuki K Poulose wrote:
> Allow the guests to choose a larger physical address space size.
> The default and minimum size is 40bits. A guest can change this
> right after the VM creation, but before the stage2 entry page
> tables are allocated (i.e, before it registers a memory range
> or maps a device address). The size is restricted to the maximum
> supported by the host. Also, the guest can only increase the PA size,
> from the existing value, as reducing it could break the devices which
> may have verified their physical address for validity and may do a
> lazy mapping(e.g, VGIC).
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Cc: Peter Maydell <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Documentation/virtual/kvm/api.txt | 27 ++++++++++++++++++++++++++
> arch/arm/include/asm/kvm_host.h | 7 +++++++
> arch/arm64/include/asm/kvm_host.h | 1 +
> arch/arm64/include/asm/kvm_mmu.h | 41 ++++++++++++++++++++++++++++++---------
> arch/arm64/kvm/reset.c | 28 ++++++++++++++++++++++++++
> include/uapi/linux/kvm.h | 4 ++++
> virt/kvm/arm/arm.c | 2 +-
> 7 files changed, 100 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 57d3ee9e4bde..a203faf768c4 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -3403,6 +3403,33 @@ invalid, if invalid pages are written to (e.g. after the end of memory)
> or if no page table is present for the addresses (e.g. when using
> hugepages).
>
> +4.109 KVM_ARM_GET_PHYS_SHIFT
> +
> +Capability: KVM_CAP_ARM_CONFIG_PHYS_SHIFT
> +Architectures: arm64
> +Type: vm ioctl
> +Parameters: __u32 (out)
> +Returns: 0 on success, a negative value on error
> +
> +This ioctl is used to get the current maximum physical address size for
> +the VM. The value is Log2(Maximum_Physical_Address). This is neither the
> + amount of physical memory assigned to the VM nor the maximum physical address
> +that a real CPU on the host can handle. Rather, this is the upper limit of the
> +guest physical address that can be used by the VM.
What is the point of this? Presumably if userspace has set the size, it
already knows the size?
> +
> +4.109 KVM_ARM_SET_PHYS_SHIFT
> +
> +Capability: KVM_CAP_ARM_CONFIG_PHYS_SHIFT
> +Architectures: arm64
> +Type: vm ioctl
> +Parameters: __u32 (in)
> +Returns: 0 on success, a negative value on error
> +
> +This ioctl is used to set the maximum physical address size for
> +the VM. The value is Log2(Maximum_Physical_Address). The value can only
> +be increased from the existing setting. The value cannot be changed
> +after the stage-2 page tables are allocated and will return an error.
> +
Is there a way for userspace to discover what the underlying hardware
can actually support, beyond trial-and-error on this ioctl?
> 5. The kvm_run structure
> ------------------------
>
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index a9f7d3f47134..fa8e68a4f692 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -268,6 +268,13 @@ static inline int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext)
> return 0;
> }
>
> +static inline long kvm_arch_dev_vm_ioctl(struct kvm *kvm,
> + unsigned int ioctl,
> + unsigned long arg)
> +{
> + return -EINVAL;
> +}
> +
> int kvm_perf_init(void);
> int kvm_perf_teardown(void);
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 1e66e5ab3dde..2895c2cda8fc 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -50,6 +50,7 @@
> int __attribute_const__ kvm_target_cpu(void);
> int kvm_reset_vcpu(struct kvm_vcpu *vcpu);
> int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext);
> +long kvm_arch_dev_vm_ioctl(struct kvm *kvm, unsigned int ioctl, unsigned long arg);
> void __extended_idmap_trampoline(phys_addr_t boot_pgd, phys_addr_t idmap_start);
>
> struct kvm_arch {
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index ab6a8b905065..ab7f50f20bcd 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -347,21 +347,44 @@ static inline u64 kvm_vttbr_baddr_mask(struct kvm *kvm)
> return GENMASK_ULL(PHYS_MASK_SHIFT - 1, x);
> }
>
> +static inline int kvm_reconfig_stage2(struct kvm *kvm, u32 phys_shift)
> +{
> + int rc = 0;
> + unsigned int pa_max, parange;
> +
> + parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 7;
> + pa_max = id_aa64mmfr0_parange_to_phys_shift(parange);
> + /* Raise it to 40bits for backward compatibility */
> + pa_max = (pa_max < 40) ? 40 : pa_max;
> + /* Make sure the size is supported/available */
> + if (phys_shift > PHYS_MASK_SHIFT || phys_shift > pa_max)
> + return -EINVAL;
> + /*
> + * The stage2 PGD is dependent on the settings we initialise here
> + * and should be allocated only after this step. We cannot allow
> + * down sizing the IPA size as there could be devices or memory
> + * regions, that depend on the previous size.
> + */
> + mutex_lock(&kvm->slots_lock);
> + if (kvm->arch.pgd || phys_shift < kvm->arch.phys_shift) {
> + rc = -EPERM;
> + } else if (phys_shift > kvm->arch.phys_shift) {
> + kvm->arch.phys_shift = phys_shift;
> + kvm->arch.s2_levels = stage2_pt_levels(kvm->arch.phys_shift);
> + kvm->arch.vtcr_private = VTCR_EL2_SL0(kvm->arch.s2_levels) |
> + TCR_T0SZ(kvm->arch.phys_shift);
> + }
I think you can rework the above to make it more obvious what's going on
in this way:
rc = -EPERM;
if (kvm->arch.pgd || phys_shift < kvm->arch.phys_shift)
goto out_unlock;
rc = 0;
if (phys_shift == kvm->arch.phys_shift)
goto out_unlock;
kvm->arch.phys_shift = phys_shift;
kvm->arch.s2_levels = stage2_pt_levels(kvm->arch.phys_shift);
kvm->arch.vtcr_private = VTCR_EL2_SL0(kvm->arch.s2_levels) |
TCR_T0SZ(kvm->arch.phys_shift);
out_unlock:
> + mutex_unlock(&kvm->slots_lock);
> + return rc;
> +}
> +
> /*
> * kvm_init_stage2_config: Initialise the VM specific stage2 page table
> * details to default IPA size.
> */
> static inline void kvm_init_stage2_config(struct kvm *kvm)
> {
> - /*
> - * The stage2 PGD is dependent on the settings we initialise here
> - * and should be allocated only after this step.
> - */
> - VM_BUG_ON(kvm->arch.pgd != NULL);
> - kvm->arch.phys_shift = KVM_PHYS_SHIFT_DEFAULT;
> - kvm->arch.s2_levels = stage2_pt_levels(kvm->arch.phys_shift);
> - kvm->arch.vtcr_private = VTCR_EL2_SL0(kvm->arch.s2_levels) |
> - TCR_T0SZ(kvm->arch.phys_shift);
> + kvm_reconfig_stage2(kvm, KVM_PHYS_SHIFT_DEFAULT);
> }
>
> #endif /* __ASSEMBLY__ */
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index 3256b9228e75..90ceca823aca 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -23,6 +23,7 @@
> #include <linux/kvm_host.h>
> #include <linux/kvm.h>
> #include <linux/hw_breakpoint.h>
> +#include <linux/uaccess.h>
>
> #include <kvm/arm_arch_timer.h>
>
> @@ -81,6 +82,9 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_VCPU_ATTRIBUTES:
> r = 1;
> break;
> + case KVM_CAP_ARM_CONFIG_PHYS_SHIFT:
> + r = 1;
> + break;
> default:
> r = 0;
> }
> @@ -88,6 +92,30 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext)
> return r;
> }
>
> +long kvm_arch_dev_vm_ioctl(struct kvm *kvm,
> + unsigned int ioctl, unsigned long arg)
> +{
> + void __user *argp = (void __user *)arg;
> + u32 phys_shift;
> + long r = -EFAULT;
> +
> + switch (ioctl) {
> + case KVM_ARM_GET_PHYS_SHIFT:
> + phys_shift = kvm_phys_shift(kvm);
> + if (!put_user(phys_shift, (u32 __user *)argp))
> + r = 0;
> + break;
> + case KVM_ARM_SET_PHYS_SHIFT:
> + if (!get_user(phys_shift, (u32 __user*)argp))
> + r = kvm_reconfig_stage2(kvm, phys_shift);
> + break;
> + default:
> + r = -EINVAL;
> + }
> + return r;
> +}
> +
> +
> /**
> * kvm_reset_vcpu - sets core registers and sys_regs to reset value
> * @vcpu: The VCPU pointer
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 496e59a2738b..66bfbe19b434 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -932,6 +932,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_HYPERV_SYNIC2 148
> #define KVM_CAP_HYPERV_VP_INDEX 149
> #define KVM_CAP_S390_AIS_MIGRATION 150
> +#define KVM_CAP_ARM_CONFIG_PHYS_SHIFT 151
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -1261,6 +1262,9 @@ struct kvm_s390_ucas_mapping {
> #define KVM_PPC_CONFIGURE_V3_MMU _IOW(KVMIO, 0xaf, struct kvm_ppc_mmuv3_cfg)
> /* Available with KVM_CAP_PPC_RADIX_MMU */
> #define KVM_PPC_GET_RMMU_INFO _IOW(KVMIO, 0xb0, struct kvm_ppc_rmmu_info)
> +/* Available with KVM_CAP_ARM_CONFIG_PHYS_SHIFT */
> +#define KVM_ARM_GET_PHYS_SHIFT _IOR(KVMIO, 0xb1, __u32)
> +#define KVM_ARM_SET_PHYS_SHIFT _IOW(KVMIO, 0xb2, __u32)
>
> /* ioctl for vm fd */
> #define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device)
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index e0bf8d19fcfe..05fc49304722 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -1136,7 +1136,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
> return 0;
> }
> default:
> - return -EINVAL;
> + return kvm_arch_dev_vm_ioctl(kvm, ioctl, arg);
> }
> }
>
> --
> 2.13.6
>
Have you considered making this capability a generic capability and
encoding this in the 'type' argument to KVM_CREATE_VM? This would
significantly simplify all the above and would allow you to drop patch 8
and 9 I think.
Thanks,
-Christoffer
Hi Suzuki,
On Tue, Jan 09, 2018 at 07:03:55PM +0000, Suzuki K Poulose wrote:
> On arm64 we have a static limit of 40bits of physical address space
> for the VM with KVM. This series lifts the limitation and allows the
> VM to configure the physical address space upto 52bit on systems
> where it is supported. We retain the default and minimum size to 40bits
> to avoid breaking backward compatibility.
>
> The interface provided is an IOCTL on the VM fd. The guest can change
> only increase the limit from what is already configured, to prevent
> breaking the devices which may have already been configured with a
> particular guest PA. The guest can issue the request until something
> is actually mapped into the stage2 table (e.g, memory region or device).
> This also implies that we now have per VM configuration of stage2
> control registers (VTCR_EL2 bits).
>
> The arm64 page table level helpers are defined based on the page
> table levels used by the host VA. So, the accessors may not work
> if the guest uses more number of levels in stage2 than the stage1
> of the host. In order to provide an independent stage2 page table,
> we refactor the arm64 page table helpers to give us raw accessors
> for each level, which should only used when that level is present.
> And then, based on the VM, we make the decision of the stage2
> page table using the raw accessors.
>
This may come a bit out of left field, but have we considered decoupling
the KVM stage 2 page table manipulation functions even further from the
host page table helpers? I found some of the patches a bit hard to read
with all the wrappers and folding logic considered, so I'm wondering if
it's possible to write something more generic for stage 2 page table
manipulations which doesn't have to fit within a Linux page table
manipulation nomenclature?
Wasn't this also the decision taken for IOMMU page table allocation, and
why was that the right approach for the IOMMU but not for KVM stage 2
page tables? Is there room for reuse of the IOMMU page table allocation
logic in KVM as well?
This may have been discussed already, but I'd like to know the arguments
for doing things the way proposed in this series.
Thanks,
-Christoffer
>
> The series also adds :
> - Support for handling 52bit IPA for vgic ITS.
> - Cleanup in virtio to handle errors when the PFN used in
> the virtio transport doesn't fit in 32bit.
>
> Tested with
> - Modified kvmtool, which can only be used for (patches included in
> the series for reference / testing):
> * with virtio-pci upto 44bit PA (Due to 4K page size for virtio-pci
> legacy implemented by kvmtool)
> * Upto 48bit PA with virtio-mmio, due to 32bit PFN limitation.
> - Hacked Qemu (boot loader support for highmem, phys-shift support)
> * with virtio-pci GIC-v3 ITS & MSI upto 52bit on Foundation model.
>
> The series applies on arm64 for-next/core tree with 52bit PA support patches.
> One would need the fix for virtio_mmio cleanup [1] on top of the arm64
> tree to remove the warnings from virtio.
>
> [1] https://marc.info/?l=linux-virtualization&m=151308636322117&w=2
>
> Kristina Martsenko (1):
> vgic: its: Add support for 52bit guest physical address
>
> Suzuki K Poulose (15):
> virtio: Validate queue pfn for 32bit transports
> irqchip: gicv3-its: Add helpers for handling 52bit address
> arm64: Make page table helpers reusable
> arm64: Refactor pud_huge for reusability
> arm64: Helper for parange to PASize
> kvm: arm/arm64: Fix stage2_flush_memslot for 4 level page table
> kvm: arm/arm64: Remove spurious WARN_ON
> kvm: arm/arm64: Clean up stage2 pgd life time
> kvm: arm/arm64: Delay stage2 page table allocation
> kvm: arm/arm64: Prepare for VM specific stage2 translations
> kvm: arm64: Make stage2 page table layout dynamic
> kvm: arm64: Dynamic configuration of VTCR and VTTBR mask
> kvm: arm64: Configure VTCR per VM
> kvm: arm64: Switch to per VM IPA
> kvm: arm64: Allow configuring physical address space size
>
> Documentation/virtual/kvm/api.txt | 27 +++
> arch/arm/include/asm/kvm_arm.h | 2 -
> arch/arm/include/asm/kvm_host.h | 7 +
> arch/arm/include/asm/kvm_mmu.h | 13 +-
> arch/arm/include/asm/stage2_pgtable.h | 46 +++---
> arch/arm64/include/asm/cpufeature.h | 16 ++
> arch/arm64/include/asm/kvm_arm.h | 112 +++++++++++--
> arch/arm64/include/asm/kvm_asm.h | 2 +-
> arch/arm64/include/asm/kvm_host.h | 21 ++-
> arch/arm64/include/asm/kvm_mmu.h | 83 ++++++++--
> arch/arm64/include/asm/pgalloc.h | 32 +++-
> arch/arm64/include/asm/pgtable.h | 61 ++++---
> arch/arm64/include/asm/stage2_pgtable-nopmd.h | 42 -----
> arch/arm64/include/asm/stage2_pgtable-nopud.h | 39 -----
> arch/arm64/include/asm/stage2_pgtable.h | 211 ++++++++++++++++--------
> arch/arm64/kvm/hyp/s2-setup.c | 34 +---
> arch/arm64/kvm/hyp/switch.c | 8 +
> arch/arm64/kvm/reset.c | 28 ++++
> arch/arm64/mm/hugetlbpage.c | 2 +-
> drivers/irqchip/irq-gic-v3-its.c | 2 +-
> drivers/virtio/virtio_mmio.c | 19 ++-
> drivers/virtio/virtio_pci_legacy.c | 11 +-
> include/linux/irqchip/arm-gic-v3.h | 32 +++-
> include/uapi/linux/kvm.h | 4 +
> virt/kvm/arm/arm.c | 25 ++-
> virt/kvm/arm/mmu.c | 228 +++++++++++++++-----------
> virt/kvm/arm/vgic/vgic-its.c | 36 ++--
> virt/kvm/arm/vgic/vgic-kvm-device.c | 2 +-
> virt/kvm/arm/vgic/vgic-mmio-v3.c | 1 -
> 29 files changed, 738 insertions(+), 408 deletions(-)
> delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
> delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h
>
> --
> 2.13.6
>
On 07/02/18 15:10, Christoffer Dall wrote:
> Hi Suzuki,
>
> On Tue, Jan 09, 2018 at 07:03:57PM +0000, Suzuki K Poulose wrote:
>> Add helpers for encoding/decoding 52bit address in GICv3 ITS BASER
>> register. When ITS uses 64K page size, the 52bits of physical address
>> are encoded in BASER[47:12] as follows :
>>
>> Bits[47:16] of the register => bits[47:16] of the physical address
>> Bits[15:12] of the register => bits[51:48] of the physical address
>> bits[15:0] of the physical address are 0.
>>
>> Also adds a mask for CBASER address. This will be used for adding 52bit
>> support for VGIC ITS. More importantly ignore the upper bits if 52bit
>> support is not enabled.
>>
>> Cc: Shanker Donthineni <[email protected]>
>> Cc: Marc Zyngier <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> +
>> +/*
>> + * With 64K page size, the physical address can be upto 52bit and
>> + * uses the following encoding in the GITS_BASER[47:12]:
>> + *
>> + * Bits[47:16] of the register => bits[47:16] of the base physical address.
>> + * Bits[15:12] of the register => bits[51:48] of the base physical address.
>> + * bits[15:0] of the base physical address are 0.
>> + * Clear the upper bits if the kernel doesn't support 52bits.
>> + */
>> +#define GITS_BASER_ADDR64K_LO_MASK GENMASK_ULL(47, 16)
>> +#define GITS_BASER_ADDR64K_HI_SHIFT 12
>> +#define GITS_BASER_ADDR64K_HI_MOVE (48 - GITS_BASER_ADDR64K_HI_SHIFT)
>> +#define GITS_BASER_ADDR64K_HI_MASK (GITS_PA_HI_MASK << GITS_BASER_ADDR64K_HI_SHIFT)
>> +#define GITS_BASER_ADDR64K_TO_PHYS(x) \
>> + (((x) & GITS_BASER_ADDR64K_LO_MASK) | \
>> + (((x) & GITS_BASER_ADDR64K_HI_MASK) << GITS_BASER_ADDR64K_HI_MOVE))
>> +#define GITS_BASER_ADDR64K_FROM_PHYS(p) \
>> + (((p) & GITS_BASER_ADDR64K_LO_MASK) | \
>> + (((p) >> GITS_BASER_ADDR64K_HI_MOVE) & GITS_BASER_ADDR64K_HI_MASK))
>
> I don't understand why you need this masking logic embedded in these
> macros? Isn't it strictly an error if anyone passes a physical address
> with any of bits [51:48] set to the ITS on a system that doesn't support
> 52 bit PAs, and just silently masking off those bits could lead to some
> interesting cases.
What do you think is the best way to handle such cases ? May be I could add
some checks where we get those addresses and handle it before we use this
macro ?
>
> This is also notably more difficult to read than the existing macro.
>
> If anything, I think it would be more useful to have
> GITS_BASER_TO_PHYS(x) and GITS_PHYS_TO_BASER(x) which takes into account
> CONFIG_ARM64_64K_PAGES.
I thought the 64K_PAGES is not kernel page size, but the page-size configured
by the "requester" for ITS. So, it doesn't really mean CONFIG_ARM64_64K_PAGES.
But the other way around, we can't handle 52bit address unless CONFIG_ARM64_64K_PAGES
is selected. Also, if the guest uses a 4K page size and uses a 48 bit address,
we could potentially mask Bits[15:12] to 0, which is not nice.
So I still think we need to have a special macro for handling addresses with 64K
page size in ITS.
Thanks
Suzuki
On Thu, Feb 08, 2018 at 11:08:18AM +0000, Suzuki K Poulose wrote:
> On 08/02/18 11:00, Christoffer Dall wrote:
> >On Tue, Jan 09, 2018 at 07:04:00PM +0000, Suzuki K Poulose wrote:
> >>Add a helper to convert ID_AA64MMFR0_EL1:PARange to they physical
> > *the*
> >>size shift. Limit the size to the maximum supported by the kernel.
> >
> >Is this just a cleanup or are we actually going to need this feature in
> >the subsequent patches? That would be nice to motivate in the commit
> >letter.
>
> It is a cleanup, plus we are going to move the user of the code around from
> one place to the other. So this makes it a bit easier and cleaner.
>
On its own I'm not sure it really is a cleanup, so it's good to mention
that this is to make some operation easier later on in the commit
letter.
>
> >>
> >>Cc: Mark Rutland <[email protected]>
> >>Cc: Catalin Marinas <[email protected]>
> >>Cc: Will Deacon <[email protected]>
> >>Cc: Marc Zyngier <[email protected]>
> >>Signed-off-by: Suzuki K Poulose <[email protected]>
> >>---
> >> arch/arm64/include/asm/cpufeature.h | 16 ++++++++++++++++
> >> arch/arm64/kvm/hyp/s2-setup.c | 28 +++++-----------------------
> >> 2 files changed, 21 insertions(+), 23 deletions(-)
> >>
> >>diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> >>index ac67cfc2585a..0564e14616eb 100644
> >>--- a/arch/arm64/include/asm/cpufeature.h
> >>+++ b/arch/arm64/include/asm/cpufeature.h
> >>@@ -304,6 +304,22 @@ static inline u64 read_zcr_features(void)
> >> return zcr;
> >> }
> >>+static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
> >>+{
> >>+ switch (parange) {
> >>+ case 0: return 32;
> >>+ case 1: return 36;
> >>+ case 2: return 40;
> >>+ case 3: return 42;
> >>+ case 4: return 44;
> >>+
> >>+ default:
> >
> >What is the case we want to cater for with making parange == 5 the
> >default for unrecognized values?
> >
> >(I have a feeling that default label comes from making the compiler
> >happy about potentially uninitialized values once upon a time before a
> >lot of refactoring happened here.)
>
> That is there to make sure we return 48 iff 52bit support (for that matter,
> if there is a new limit in the future) is not enabled.
>
duh, yeah, it's obvious when I look at it again now.
> >
> >>+ case 5: return 48;
> >>+#ifdef CONFIG_ARM64_PA_BITS_52
> >>+ case 6: return 52;
> >>+#endif
> >>+ }
> >>+}
> >> #endif /* __ASSEMBLY__ */
>
Thanks,
-Christoffer
I can comment on one part here:
On Thu, Feb 08, 2018 at 12:18:44PM +0100, Christoffer Dall wrote:
> Wasn't this also the decision taken for IOMMU page table allocation, and
> why was that the right approach for the IOMMU but not for KVM stage 2
> page tables? Is there room for reuse of the IOMMU page table allocation
> logic in KVM as well?
There were a few reasons we did this for IOMMU page tables:
* Ability to use different page size/VA bits/levels from the CPU
* Ability to support different page table formats (e.g. short descriptor)
* Ability to determine page table attributes at runtime
* Requirement to map/unmap in atomic context
* Ability to cope with non-coherent page table walkers
* Ability to create both stage-1 and stage-2 mappings
* Easier to hook in our own TLB invalidation routines
* Support for lockless concurrent map/unmap within confines of the DMA API
usage
Will
On 08/02/18 11:20, Suzuki K Poulose wrote:
> On 07/02/18 15:10, Christoffer Dall wrote:
>> Hi Suzuki,
>>
>> On Tue, Jan 09, 2018 at 07:03:57PM +0000, Suzuki K Poulose wrote:
>>> Add helpers for encoding/decoding 52bit address in GICv3 ITS BASER
>>> register. When ITS uses 64K page size, the 52bits of physical address
>>> are encoded in BASER[47:12] as follows :
>>>
>>> Â Bits[47:16] of the register => bits[47:16] of the physical address
>>> Â Bits[15:12] of the register => bits[51:48] of the physical address
>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â bits[15:0] of the physical address
>>> are 0.
>>>
>>> Also adds a mask for CBASER address. This will be used for adding 52bit
>>> support for VGIC ITS. More importantly ignore the upper bits if 52bit
>>> support is not enabled.
>>>
>>> Cc: Shanker Donthineni <[email protected]>
>>> Cc: Marc Zyngier <[email protected]>
>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>> ---
>
>
>>> +
>>> +/*
>>> + * With 64K page size, the physical address can be upto 52bit and
>>> + * uses the following encoding in the GITS_BASER[47:12]:
>>> + *
>>> + * Bits[47:16] of the register => bits[47:16] of the base physical
>>> address.
>>> + * Bits[15:12] of the register => bits[51:48] of the base physical
>>> address.
>>> + *Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â bits[15:0] of the base physical
>>> address are 0.
>>> + * Clear the upper bits if the kernel doesn't support 52bits.
>>> + */
>>> +#define GITS_BASER_ADDR64K_LO_MASKÂ Â Â GENMASK_ULL(47, 16)
>>> +#define GITS_BASER_ADDR64K_HI_SHIFTÂ Â Â 12
>>> +#define GITS_BASER_ADDR64K_HI_MOVEÂ Â Â (48 -
>>> GITS_BASER_ADDR64K_HI_SHIFT)
>>> +#define GITS_BASER_ADDR64K_HI_MASKÂ Â Â (GITS_PA_HI_MASK <<
>>> GITS_BASER_ADDR64K_HI_SHIFT)
>>> +#define GITS_BASER_ADDR64K_TO_PHYS(x)Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â \
>>> +Â Â Â (((x) & GITS_BASER_ADDR64K_LO_MASK) |Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â \
>>> +Â Â Â Â (((x) & GITS_BASER_ADDR64K_HI_MASK) <<
>>> GITS_BASER_ADDR64K_HI_MOVE))
>>> +#define GITS_BASER_ADDR64K_FROM_PHYS(p)Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â \
>>> +Â Â Â (((p) & GITS_BASER_ADDR64K_LO_MASK) |Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â \
>>> +Â Â Â Â (((p) >> GITS_BASER_ADDR64K_HI_MOVE) &
>>> GITS_BASER_ADDR64K_HI_MASK))
>>
>> I don't understand why you need this masking logic embedded in these
>> macros? Isn't it strictly an error if anyone passes a physical address
>> with any of bits [51:48] set to the ITS on a system that doesn't support
>> 52 bit PAs, and just silently masking off those bits could lead to some
>> interesting cases.
>
> What do you think is the best way to handle such cases ? May be I could add
> some checks where we get those addresses and handle it before we use this
> macro ?
>
>>
>> This is also notably more difficult to read than the existing macro.
>>
>> If anything, I think it would be more useful to have
>> GITS_BASER_TO_PHYS(x) and GITS_PHYS_TO_BASER(x) which takes into account
>> CONFIG_ARM64_64K_PAGES.
>
> I thought the 64K_PAGES is not kernel page size, but the page-size
> configured
> by the "requester" for ITS. So, it doesn't really mean
> CONFIG_ARM64_64K_PAGES.
> But the other way around, we can't handle 52bit address unless
> CONFIG_ARM64_64K_PAGES
> is selected. Also, if the guest uses a 4K page size and uses a 48 bit
> address,
> we could potentially mask Bits[15:12] to 0, which is not nice.
>
> So I still think we need to have a special macro for handling addresses
> with 64K
> page size in ITS.
If it's allowed to go wrong for invalid input, then you don't even need
to consider the page size at all, except if you care about
micro-optimising out a couple of instructions. For valid page-aligned
addresses, [51:48] and [15:12] can never *both* be nonzero, therefore
just this should be fine for all granules:
- (((phys) & GENMASK_ULL(47, 16)) | (((phys) >> 48) & 0xf) << 12)
+ (((phys) & GENMASK_ULL(47, 0)) | (((phys) >> 48) & 0xf) << 12)
Robin.
On Thu, Feb 08, 2018 at 11:20:02AM +0000, Suzuki K Poulose wrote:
> On 07/02/18 15:10, Christoffer Dall wrote:
> >Hi Suzuki,
> >
> >On Tue, Jan 09, 2018 at 07:03:57PM +0000, Suzuki K Poulose wrote:
> >>Add helpers for encoding/decoding 52bit address in GICv3 ITS BASER
> >>register. When ITS uses 64K page size, the 52bits of physical address
> >>are encoded in BASER[47:12] as follows :
> >>
> >> Bits[47:16] of the register => bits[47:16] of the physical address
> >> Bits[15:12] of the register => bits[51:48] of the physical address
> >> bits[15:0] of the physical address are 0.
> >>
> >>Also adds a mask for CBASER address. This will be used for adding 52bit
> >>support for VGIC ITS. More importantly ignore the upper bits if 52bit
> >>support is not enabled.
> >>
> >>Cc: Shanker Donthineni <[email protected]>
> >>Cc: Marc Zyngier <[email protected]>
> >>Signed-off-by: Suzuki K Poulose <[email protected]>
> >>---
>
>
> >>+
> >>+/*
> >>+ * With 64K page size, the physical address can be upto 52bit and
> >>+ * uses the following encoding in the GITS_BASER[47:12]:
> >>+ *
> >>+ * Bits[47:16] of the register => bits[47:16] of the base physical address.
> >>+ * Bits[15:12] of the register => bits[51:48] of the base physical address.
> >>+ * bits[15:0] of the base physical address are 0.
> >>+ * Clear the upper bits if the kernel doesn't support 52bits.
> >>+ */
> >>+#define GITS_BASER_ADDR64K_LO_MASK GENMASK_ULL(47, 16)
> >>+#define GITS_BASER_ADDR64K_HI_SHIFT 12
> >>+#define GITS_BASER_ADDR64K_HI_MOVE (48 - GITS_BASER_ADDR64K_HI_SHIFT)
> >>+#define GITS_BASER_ADDR64K_HI_MASK (GITS_PA_HI_MASK << GITS_BASER_ADDR64K_HI_SHIFT)
> >>+#define GITS_BASER_ADDR64K_TO_PHYS(x) \
> >>+ (((x) & GITS_BASER_ADDR64K_LO_MASK) | \
> >>+ (((x) & GITS_BASER_ADDR64K_HI_MASK) << GITS_BASER_ADDR64K_HI_MOVE))
> >>+#define GITS_BASER_ADDR64K_FROM_PHYS(p) \
> >>+ (((p) & GITS_BASER_ADDR64K_LO_MASK) | \
> >>+ (((p) >> GITS_BASER_ADDR64K_HI_MOVE) & GITS_BASER_ADDR64K_HI_MASK))
> >
> >I don't understand why you need this masking logic embedded in these
> >macros? Isn't it strictly an error if anyone passes a physical address
> >with any of bits [51:48] set to the ITS on a system that doesn't support
> >52 bit PAs, and just silently masking off those bits could lead to some
> >interesting cases.
>
> What do you think is the best way to handle such cases ? May be I could add
> some checks where we get those addresses and handle it before we use this
> macro ?
>
I don't think the conversion macros should try to hide programming
errors. I think we should limit the functionality in the macros to be
simple bit masking and shifting.
Any validation and masking depending on 52 PA support in the kernel
should be done in the context of the functionality, just like the ITS
driver already does.
> >
> >This is also notably more difficult to read than the existing macro.
> >
> >If anything, I think it would be more useful to have
> >GITS_BASER_TO_PHYS(x) and GITS_PHYS_TO_BASER(x) which takes into account
> >CONFIG_ARM64_64K_PAGES.
>
> I thought the 64K_PAGES is not kernel page size, but the page-size configured
> by the "requester" for ITS. So, it doesn't really mean CONFIG_ARM64_64K_PAGES.
You're right, I skimmed this logic too quickly.
> But the other way around, we can't handle 52bit address unless CONFIG_ARM64_64K_PAGES
> is selected. Also, if the guest uses a 4K page size and uses a 48 bit address,
> we could potentially mask Bits[15:12] to 0, which is not nice.
>
> So I still think we need to have a special macro for handling addresses with 64K
> page size in ITS.
>
I think it's easier to have the current GITS_BASER_PHYS_52_to_48 and
have a corresponding GITS_BASER_PHYS_48_to_52, following Robin's
observation.
Any additional logic can be written directly in the C code to check
consitency etc.
Thanks,
-Christoffer
On 08/02/18 11:00, Christoffer Dall wrote:
> On Tue, Jan 09, 2018 at 07:04:03PM +0000, Suzuki K Poulose wrote:
>> On arm/arm64 we pre-allocate the entry level page tables when
>> a VM is created and is free'd when either all the mm users are
>> gone or the KVM is about to get destroyed. i.e, kvm_free_stage2_pgd
>> is triggered via kvm_arch_flush_shadow_all() which can be invoked
>> from two different paths :
>>
>> 1) do_exit()-> .-> mmu_notifier->release()-> ..-> kvm_arch_flush_shadow_all()
>> OR
>> 2) kvm_destroy_vm()-> mmu_notifier_unregister-> kvm_arch_flush_shadow_all()
>>
>> This has created lot of race conditions in the past as some of
>> the VCPUs could be active when we free the stage2 via path (1).
>
> How?? mmu_notifier->release() is called via __mput->exit_mmap(), which
> is only called if mm_users == 0, which means there are no more threads
> left than the one currently doing exit().
IIRC, if the process is sent a fatal signal, that could cause all the threads
to exit, leaving the "last" thread to do the clean up. The files could still
be open, implying that the KVM fds are still active, without a stage2, even
though we are not going to run anything. (The race was fixed by moving the
stage2 teardown to mmu_notifier->release()).
>
>>
>> On a closer look, all we need to do with kvm_arch_flush_shadow_all() is,
>> to ensure that the stage2 mappings are cleared. This doesn't mean we
>> have to free up the stage2 entry level page tables yet, which could
>> be delayed until the kvm is destroyed. This would avoid issues
>> of use-after-free,
>
> do we have any of those left?
None that we know of.
>
>> This will be later used for delaying
>> the allocation of the stage2 entry level page tables until we really
>> need to do something with it.
>
> Fine, but you don't actually explain why this change of flow is
> necessary for what you're trying to do later?
This patch is not mandatory for the series. But, since we are delaying
the "allocation" stage2 tables anyway later, I thought it would be
good to clean up the "free" path.
>> int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 78253fe00fc4..c94c61ac38b9 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -298,11 +298,10 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
>> pgd = kvm->arch.pgd + stage2_pgd_index(addr);
>> do {
>> /*
>> - * Make sure the page table is still active, as another thread
>> - * could have possibly freed the page table, while we released
>> - * the lock.
>> + * The page table shouldn't be free'd as we still hold a reference
>> + * to the KVM.
>
> To avoid confusion about references to the kernel module KVM and a
> specific KVM VM instance, please s/KVM/VM/.
ok.
>
>> */
>> - if (!READ_ONCE(kvm->arch.pgd))
>> + if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))
>
> This reads a lot like a defensive implementation now, and I think for
> this patch to make sense, we shouldn't try to handle a buggy super-racy
> implementation gracefully, but rather have VM_BUG_ON() (if we care about
> performance of the check) or simply BUG_ON().
>
> The rationale being that if we've gotten this flow incorrect and freed
> the pgd at the wrong time, we don't want to leave a ticking bomb to blow
> up somewhere else randomly (which it will!), but instead crash and burn.
Ok.
>
>> break;
>> next = stage2_pgd_addr_end(addr, end);
>> if (!stage2_pgd_none(*pgd))
>> @@ -837,30 +836,33 @@ void stage2_unmap_vm(struct kvm *kvm)
>> up_read(¤t->mm->mmap_sem);
>> srcu_read_unlock(&kvm->srcu, idx);
>> }
>> -
>> -/**
>> - * kvm_free_stage2_pgd - free all stage-2 tables
>> - * @kvm: The KVM struct pointer for the VM.
>> - *
>> - * Walks the level-1 page table pointed to by kvm->arch.pgd and frees all
>> - * underlying level-2 and level-3 tables before freeing the actual level-1 table
>> - * and setting the struct pointer to NULL.
>> +/*
>> + * kvm_flush_stage2_all: Unmap the entire stage2 mappings including
>> + * device and regular RAM backing memory.
>> */
>> -void kvm_free_stage2_pgd(struct kvm *kvm)
>> +static void kvm_flush_stage2_all(struct kvm *kvm)
>> {
>> - void *pgd = NULL;
>> -
>> spin_lock(&kvm->mmu_lock);
>> - if (kvm->arch.pgd) {
>> + if (kvm->arch.pgd)
>> unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
>> - pgd = READ_ONCE(kvm->arch.pgd);
>> - kvm->arch.pgd = NULL;
>> - }
>> spin_unlock(&kvm->mmu_lock);
>> +}
>>
>> - /* Free the HW pgd, one page at a time */
>> - if (pgd)
>> - free_pages_exact(pgd, S2_PGD_SIZE);
>> +/**
>> + * kvm_free_stage2_pgd - Free the entry level page tables in stage-2.
>
> nit: you should put the parameter description here and leave a blank
> line before the lengthy discussion.
>
>> + * This is called when all reference to the KVM has gone away and we
>> + * really don't need any protection in resetting the PGD. This also
>
> I don't think I understand the last part of this sentence.
i.e, we don't have to use the spin_lock to reset the PGD.
> This function is pretty self-explanatory really, and I think we can
> either drop the documentation all together or simply say that this
> function clears all stage 2 page table entries to release the memory of
> the lower-level page tables themselves and then frees the pgd in the
> end. The VM is known to go away and no more VCPUs exist at this point.
>
Ok.
>> + * means that nobody should be touching stage2 at this point, as we
>> + * have unmapped the entire stage2 already and all dynamic entities,
>> + * (VCPUs and devices) are no longer active.
>> + *
>> + * @kvm: The KVM struct pointer for the VM.
>> + */
>> +void kvm_free_stage2_pgd(struct kvm *kvm)
>> +{
>> + kvm_flush_stage2_all(kvm);
>> + free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE);
>> + kvm->arch.pgd = NULL;
>> }
>>
>> static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
>> @@ -1189,12 +1191,12 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>> * large. Otherwise, we may see kernel panics with
>> * CONFIG_DETECT_HUNG_TASK, CONFIG_LOCKUP_DETECTOR,
>> * CONFIG_LOCKDEP. Additionally, holding the lock too long
>> - * will also starve other vCPUs. We have to also make sure
>> - * that the page tables are not freed while we released
>> - * the lock.
>> + * will also starve other vCPUs.
>> + * The page tables shouldn't be free'd while we released the
>
> s/shouldn't/can't/
>
ok
>> + * lock, since we hold a reference to the KVM.
>
> s/KVM/VM/
>
ok.
Thanks
Suzuki
On 08/02/18 11:01, Christoffer Dall wrote:
> On Tue, Jan 09, 2018 at 07:04:04PM +0000, Suzuki K Poulose wrote:
>> We allocate the entry level page tables for stage2 when the
>> VM is created. This doesn't give us the flexibility of configuring
>> the physical address space size for a VM. In order to allow
>> the VM to choose the required size, we delay the allocation of
>> stage2 entry level tables until we really try to map something.
>>
>> This could be either when the VM creates a memory range or when
>> it tries to map a device memory. So we add in a hook to these
>> two places to make sure the tables are allocated. We use
>> kvm->slots_lock to serialize the allocation entry point, since
>> we add hooks to the arch specific call back with the mutex held.
...
>>
>> -/**
>> - * kvm_phys_addr_ioremap - map a device range to guest IPA
>> - *
>> - * @kvm: The KVM pointer
>> - * @guest_ipa: The IPA at which to insert the mapping
>> - * @pa: The physical address of the device
>> - * @size: The size of the mapping
>> +/*
>> + * Finalise the stage2 page table layout. Must be called with kvm->slots_lock
>> + * held.
>> */
>> -int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>> +static int __kvm_init_stage2_table(struct kvm *kvm)
>> +{
>> + /* Double check if somebody has already allocated it */
>
> dubious comment: Either leave it out or explain that we need to check
> again with the mutex held.
>
>> + if (likely(kvm->arch.pgd))
>> + return 0;
>> + return kvm_alloc_stage2_pgd(kvm);
>> +}
>> +
>> +static int kvm_init_stage2_table(struct kvm *kvm)
>> +{
>> + int rc;
>> +
>> + /*
>> + * Once allocated, the stage2 entry level tables are only
>> + * freed when the KVM instance is destroyed. So, if we see
>> + * something valid here, that guarantees that we have
>> + * done the one time allocation and it is something valid
>> + * and won't go away until the last reference to the KVM
>> + * is gone.
>> + */
>
> Really not sure if this comment adds something beyond what's described
> by the code already?
Agreed. Will clean it up.
Thanks
Suzuki
On 08/02/18 11:00, Christoffer Dall wrote:
> On Tue, Jan 09, 2018 at 07:04:09PM +0000, Suzuki K Poulose wrote:
>> Now that we can manage the stage2 page table per VM, switch the
>> configuration details to per VM instance. We keep track of the
>> IPA bits, number of page table levels and the VTCR bits (which
>> depends on the IPA and the number of levels).
>>
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Christoffer Dall <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> arch/arm/include/asm/kvm_mmu.h | 1 +
>> arch/arm64/include/asm/kvm_host.h | 12 ++++++++++++
>> arch/arm64/include/asm/kvm_mmu.h | 22 ++++++++++++++++++++--
>> arch/arm64/include/asm/stage2_pgtable.h | 1 -
>> arch/arm64/kvm/hyp/switch.c | 3 +--
>> virt/kvm/arm/arm.c | 2 +-
>> 6 files changed, 35 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>> index 440c80589453..dd592fe45660 100644
>> --- a/arch/arm/include/asm/kvm_mmu.h
>> +++ b/arch/arm/include/asm/kvm_mmu.h
>> @@ -48,6 +48,7 @@
>> #define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK
>>
>> #define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
>> +#define kvm_init_stage2_config(kvm) do { } while (0)
>> int create_hyp_mappings(void *from, void *to, pgprot_t prot);
>> int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
>> void free_hyp_pgds(void);
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 9a9ddeb33c84..1e66e5ab3dde 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -64,6 +64,18 @@ struct kvm_arch {
>> /* VTTBR value associated with above pgd and vmid */
>> u64 vttbr;
>>
>> + /* Private bits of VTCR_EL2 for this VM */
>> + u64 vtcr_private;
>
> As to my comments in the previous patch, why isn't this simply u64 vtcr;
nit: I haven't received your response to the previous patch.
We could. I thought this gives a bit more clarity on what changes per-VM.
Thanks
Suzuki
On 08/02/18 11:14, Christoffer Dall wrote:
> On Tue, Jan 09, 2018 at 07:04:10PM +0000, Suzuki K Poulose wrote:
>> Allow the guests to choose a larger physical address space size.
>> The default and minimum size is 40bits. A guest can change this
>> right after the VM creation, but before the stage2 entry page
>> tables are allocated (i.e, before it registers a memory range
>> or maps a device address). The size is restricted to the maximum
>> supported by the host. Also, the guest can only increase the PA size,
>> from the existing value, as reducing it could break the devices which
>> may have verified their physical address for validity and may do a
>> lazy mapping(e.g, VGIC).
>>
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Christoffer Dall <[email protected]>
>> Cc: Peter Maydell <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> Documentation/virtual/kvm/api.txt | 27 ++++++++++++++++++++++++++
>> arch/arm/include/asm/kvm_host.h | 7 +++++++
>> arch/arm64/include/asm/kvm_host.h | 1 +
>> arch/arm64/include/asm/kvm_mmu.h | 41 ++++++++++++++++++++++++++++++---------
>> arch/arm64/kvm/reset.c | 28 ++++++++++++++++++++++++++
>> include/uapi/linux/kvm.h | 4 ++++
>> virt/kvm/arm/arm.c | 2 +-
>> 7 files changed, 100 insertions(+), 10 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index 57d3ee9e4bde..a203faf768c4 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -3403,6 +3403,33 @@ invalid, if invalid pages are written to (e.g. after the end of memory)
>> or if no page table is present for the addresses (e.g. when using
>> hugepages).
>>
>> +4.109 KVM_ARM_GET_PHYS_SHIFT
>> +
>> +Capability: KVM_CAP_ARM_CONFIG_PHYS_SHIFT
>> +Architectures: arm64
>> +Type: vm ioctl
>> +Parameters: __u32 (out)
>> +Returns: 0 on success, a negative value on error
>> +
>> +This ioctl is used to get the current maximum physical address size for
>> +the VM. The value is Log2(Maximum_Physical_Address). This is neither the
>> + amount of physical memory assigned to the VM nor the maximum physical address
>> +that a real CPU on the host can handle. Rather, this is the upper limit of the
>> +guest physical address that can be used by the VM.
>
> What is the point of this? Presumably if userspace has set the size, it
> already knows the size?
This can help the userspace know, what the "default" limit is. As such I am
not particular about keeping this around.
>
>> +
>> +4.109 KVM_ARM_SET_PHYS_SHIFT
>> +
>> +Capability: KVM_CAP_ARM_CONFIG_PHYS_SHIFT
>> +Architectures: arm64
>> +Type: vm ioctl
>> +Parameters: __u32 (in)
>> +Returns: 0 on success, a negative value on error
>> +
>> +This ioctl is used to set the maximum physical address size for
>> +the VM. The value is Log2(Maximum_Physical_Address). The value can only
>> +be increased from the existing setting. The value cannot be changed
>> +after the stage-2 page tables are allocated and will return an error.
>> +
>
> Is there a way for userspace to discover what the underlying hardware
> can actually support, beyond trial-and-error on this ioctl?
Unfortunately, there is none. We don't expose ID_AA64MMFR0 via mrs emulation.
>> +static inline int kvm_reconfig_stage2(struct kvm *kvm, u32 phys_shift)
>> +{
>> + int rc = 0;
>> + unsigned int pa_max, parange;
>> +
>> + parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 7;
>> + pa_max = id_aa64mmfr0_parange_to_phys_shift(parange);
>> + /* Raise it to 40bits for backward compatibility */
>> + pa_max = (pa_max < 40) ? 40 : pa_max;
>> + /* Make sure the size is supported/available */
>> + if (phys_shift > PHYS_MASK_SHIFT || phys_shift > pa_max)
>> + return -EINVAL;
>> + /*
>> + * The stage2 PGD is dependent on the settings we initialise here
>> + * and should be allocated only after this step. We cannot allow
>> + * down sizing the IPA size as there could be devices or memory
>> + * regions, that depend on the previous size.
>> + */
>> + mutex_lock(&kvm->slots_lock);
>> + if (kvm->arch.pgd || phys_shift < kvm->arch.phys_shift) {
>> + rc = -EPERM;
>> + } else if (phys_shift > kvm->arch.phys_shift) {
>> + kvm->arch.phys_shift = phys_shift;
>> + kvm->arch.s2_levels = stage2_pt_levels(kvm->arch.phys_shift);
>> + kvm->arch.vtcr_private = VTCR_EL2_SL0(kvm->arch.s2_levels) |
>> + TCR_T0SZ(kvm->arch.phys_shift);
>> + }
>
> I think you can rework the above to make it more obvious what's going on
> in this way:
>
> rc = -EPERM;
> if (kvm->arch.pgd || phys_shift < kvm->arch.phys_shift)
> goto out_unlock;
>
> rc = 0;
> if (phys_shift == kvm->arch.phys_shift)
> goto out_unlock;
>
> kvm->arch.phys_shift = phys_shift;
> kvm->arch.s2_levels = stage2_pt_levels(kvm->arch.phys_shift);
> kvm->arch.vtcr_private = VTCR_EL2_SL0(kvm->arch.s2_levels) |
> TCR_T0SZ(kvm->arch.phys_shift);
>
> out_unlock:
>
Sure.
>> --- a/virt/kvm/arm/arm.c
>> +++ b/virt/kvm/arm/arm.c
>> @@ -1136,7 +1136,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
>> return 0;
>> }
>> default:
>> - return -EINVAL;
>> + return kvm_arch_dev_vm_ioctl(kvm, ioctl, arg);
>> }
>> }
>>
>> --
>> 2.13.6
>>
>
> Have you considered making this capability a generic capability and
> encoding this in the 'type' argument to KVM_CREATE_VM? This would
> significantly simplify all the above and would allow you to drop patch 8
> and 9 I think.
No. I will take a look. Btw, there were patches flying around to support
"userspace" requesting specific values for ID feature registers. But even that
doesn't help with the detection part. May be that is another way to configure
the size, but not sure about the current status of that work.
Cheers
Suzuki
On Tue, Jan 09, 2018 at 07:04:08PM +0000, Suzuki K Poulose wrote:
> We set VTCR_EL2 very early during the stage2 init and don't
> touch it ever. This is fine as we had a fixed IPA size. This
> patch changes the behavior to set the VTCR for a given VM,
> depending on its stage2 table. The common configuration for
> VTCR is still performed during the early init. But the SL0
> and T0SZ are programmed for each VM and is cleared once we
> exit the VM.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arch/arm64/include/asm/kvm_arm.h | 16 ++++++----------
> arch/arm64/include/asm/kvm_asm.h | 2 +-
> arch/arm64/include/asm/kvm_host.h | 8 +++++---
> arch/arm64/kvm/hyp/s2-setup.c | 16 +---------------
> arch/arm64/kvm/hyp/switch.c | 9 +++++++++
> 5 files changed, 22 insertions(+), 29 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index eb90d349e55f..d5c40816f073 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -115,9 +115,7 @@
> #define VTCR_EL2_IRGN0_WBWA TCR_IRGN0_WBWA
> #define VTCR_EL2_SL0_SHIFT 6
> #define VTCR_EL2_SL0_MASK (3 << VTCR_EL2_SL0_SHIFT)
> -#define VTCR_EL2_SL0_LVL1 (1 << VTCR_EL2_SL0_SHIFT)
> #define VTCR_EL2_T0SZ_MASK 0x3f
> -#define VTCR_EL2_T0SZ_40B 24
> #define VTCR_EL2_VS_SHIFT 19
> #define VTCR_EL2_VS_8BIT (0 << VTCR_EL2_VS_SHIFT)
> #define VTCR_EL2_VS_16BIT (1 << VTCR_EL2_VS_SHIFT)
> @@ -139,38 +137,36 @@
> * D4-23 and D4-25 in ARM DDI 0487A.b.
> */
>
> -#define VTCR_EL2_T0SZ_IPA VTCR_EL2_T0SZ_40B
> #define VTCR_EL2_COMMON_BITS (VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
> VTCR_EL2_IRGN0_WBWA | VTCR_EL2_RES1)
> +#define VTCR_EL2_PRIVATE_MASK (VTCR_EL2_SL0_MASK | VTCR_EL2_T0SZ_MASK)
>
> #ifdef CONFIG_ARM64_64K_PAGES
> /*
> * Stage2 translation configuration:
> * 64kB pages (TG0 = 1)
> - * 2 level page tables (SL = 1)
> */
> -#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
> +#define VTCR_EL2_TGRAN VTCR_EL2_TG0_64K
> #define VTCR_EL2_TGRAN_SL0_BASE 3UL
>
> #elif defined(CONFIG_ARM64_16K_PAGES)
> /*
> * Stage2 translation configuration:
> * 16kB pages (TG0 = 2)
> - * 2 level page tables (SL = 1)
> */
> -#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
> +#define VTCR_EL2_TGRAN VTCR_EL2_TG0_16K
> #define VTCR_EL2_TGRAN_SL0_BASE 3UL
> #else /* 4K */
> /*
> * Stage2 translation configuration:
> * 4kB pages (TG0 = 0)
> - * 3 level page tables (SL = 1)
> */
> -#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
> +#define VTCR_EL2_TGRAN VTCR_EL2_TG0_4K
> #define VTCR_EL2_TGRAN_SL0_BASE 2UL
> #endif
>
> -#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
> +#define VTCR_EL2_FLAGS (VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN)
> +
> /*
> * VTCR_EL2:SL0 indicates the entry level for Stage2 translation.
> * Interestingly, it depends on the page size.
> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index ab4d0a926043..21cfd1fe692c 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -66,7 +66,7 @@ extern void __vgic_v3_init_lrs(void);
>
> extern u32 __kvm_get_mdcr_el2(void);
>
> -extern u32 __init_stage2_translation(void);
> +extern void __init_stage2_translation(void);
>
> #endif
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index ea6cb5b24258..9a9ddeb33c84 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -380,10 +380,12 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>
> static inline void __cpu_init_stage2(void)
> {
> - u32 parange = kvm_call_hyp(__init_stage2_translation);
> + u32 ps;
>
> - WARN_ONCE(parange < 40,
> - "PARange is %d bits, unsupported configuration!", parange);
> + kvm_call_hyp(__init_stage2_translation);
> + ps = id_aa64mmfr0_parange_to_phys_shift(read_sysreg(id_aa64mmfr0_el1));
> + WARN_ONCE(ps < 40,
> + "PARange is %d bits, unsupported configuration!", ps);
> }
>
> /*
> diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
> index b1129c83c531..5c26ad4b8ac9 100644
> --- a/arch/arm64/kvm/hyp/s2-setup.c
> +++ b/arch/arm64/kvm/hyp/s2-setup.c
> @@ -19,13 +19,11 @@
> #include <asm/kvm_arm.h>
> #include <asm/kvm_asm.h>
> #include <asm/kvm_hyp.h>
> -#include <asm/cpufeature.h>
>
> -u32 __hyp_text __init_stage2_translation(void)
> +void __hyp_text __init_stage2_translation(void)
> {
> u64 val = VTCR_EL2_FLAGS;
> u64 parange;
> - u32 phys_shift;
> u64 tmp;
>
> /*
> @@ -38,16 +36,6 @@ u32 __hyp_text __init_stage2_translation(void)
> parange = ID_AA64MMFR0_PARANGE_MAX;
> val |= parange << 16;
>
> - /* Compute the actual PARange... */
> - phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
> -
> - /*
> - * ... and clamp it to 40 bits, unless we have some braindead
> - * HW that implements less than that. In all cases, we'll
> - * return that value for the rest of the kernel to decide what
> - * to do.
> - */
> - val |= 64 - (phys_shift > 40 ? 40 : phys_shift);
>
> /*
> * Check the availability of Hardware Access Flag / Dirty Bit
> @@ -67,6 +55,4 @@ u32 __hyp_text __init_stage2_translation(void)
> VTCR_EL2_VS_8BIT;
>
> write_sysreg(val, vtcr_el2);
> -
> - return phys_shift;
> }
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index f7c651f3a8c0..523471f0af7b 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -157,11 +157,20 @@ static void __hyp_text __deactivate_traps(struct kvm_vcpu *vcpu)
> static void __hyp_text __activate_vm(struct kvm_vcpu *vcpu)
> {
> struct kvm *kvm = kern_hyp_va(vcpu->kvm);
> + u64 vtcr = read_sysreg(vtcr_el2);
> +
> + vtcr &= ~VTCR_EL2_PRIVATE_MASK;
> + vtcr |= VTCR_EL2_SL0(stage2_pt_levels(kvm)) |
> + VTCR_EL2_T0SZ(kvm_phys_shift(kvm));
> + write_sysreg(vtcr, vtcr_el2);
If we're writing VTCR_EL2 on each entry, do we really need to read the
value back first and calculate things on every entry to the VM? It
seems to me we should be able to compute the vtcr_el2 and store it on
struct kvm, and simply restore that per-VM value upon entering the VM?
> write_sysreg(kvm->arch.vttbr, vttbr_el2);
> }
>
> static void __hyp_text __deactivate_vm(struct kvm_vcpu *vcpu)
> {
> + u64 vtcr = read_sysreg(vtcr_el2) & ~VTCR_EL2_PRIVATE_MASK;
> +
> + write_sysreg(vtcr, vtcr_el2);
Why do we need to care about restoring VTCR when returning to the host?
> write_sysreg(0, vttbr_el2);
> }
>
> --
> 2.13.6
>
Thanks,
-Christoffer
On Thu, Feb 08, 2018 at 05:19:22PM +0000, Suzuki K Poulose wrote:
> On 08/02/18 11:00, Christoffer Dall wrote:
> >On Tue, Jan 09, 2018 at 07:04:03PM +0000, Suzuki K Poulose wrote:
> >>On arm/arm64 we pre-allocate the entry level page tables when
> >>a VM is created and is free'd when either all the mm users are
> >>gone or the KVM is about to get destroyed. i.e, kvm_free_stage2_pgd
> >>is triggered via kvm_arch_flush_shadow_all() which can be invoked
> >>from two different paths :
> >>
> >> 1) do_exit()-> .-> mmu_notifier->release()-> ..-> kvm_arch_flush_shadow_all()
> >> OR
> >> 2) kvm_destroy_vm()-> mmu_notifier_unregister-> kvm_arch_flush_shadow_all()
> >>
> >>This has created lot of race conditions in the past as some of
> >>the VCPUs could be active when we free the stage2 via path (1).
> >
> >How?? mmu_notifier->release() is called via __mput->exit_mmap(), which
> >is only called if mm_users == 0, which means there are no more threads
> >left than the one currently doing exit().
>
> IIRC, if the process is sent a fatal signal, that could cause all the threads
> to exit, leaving the "last" thread to do the clean up. The files could still
> be open, implying that the KVM fds are still active, without a stage2, even
> though we are not going to run anything. (The race was fixed by moving the
> stage2 teardown to mmu_notifier->release()).
>
>
Hmm, if the last thread is do_exit(), by definition there can't be any
other VCPU thread (because then it wouldn't be the last one) and
therefore only this last exiting thread can have the fd open, and since
it's in the middle of do_exit(), it will close the fds before anything
will have chance to run.
> >
> >>
> >>On a closer look, all we need to do with kvm_arch_flush_shadow_all() is,
> >>to ensure that the stage2 mappings are cleared. This doesn't mean we
> >>have to free up the stage2 entry level page tables yet, which could
> >>be delayed until the kvm is destroyed. This would avoid issues
> >>of use-after-free,
> >
> >do we have any of those left?
>
> None that we know of.
>
Then I think this commit text is misleading and pretty confusing. If
we have a correct implementation, but we want to clean something up,
then this commit message shouldn't talk about races or use-after-free,
it should just say that we rearrange code to change the flow, and
describe why/how.
> >
> >>This will be later used for delaying
> >>the allocation of the stage2 entry level page tables until we really
> >>need to do something with it.
> >
> >Fine, but you don't actually explain why this change of flow is
> >necessary for what you're trying to do later?
>
> This patch is not mandatory for the series. But, since we are delaying
> the "allocation" stage2 tables anyway later, I thought it would be
> good to clean up the "free" path.
>
Hmm, I'm not really sure it is a cleanup. In any case, the motivation
for this change should be clear. I do like the idea of getting read of
the kvm->arch.pgd checks in the various stage2 manipulation functions.
> >> int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>index 78253fe00fc4..c94c61ac38b9 100644
> >>--- a/virt/kvm/arm/mmu.c
> >>+++ b/virt/kvm/arm/mmu.c
> >>@@ -298,11 +298,10 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
> >> pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> >> do {
> >> /*
> >>- * Make sure the page table is still active, as another thread
> >>- * could have possibly freed the page table, while we released
> >>- * the lock.
> >>+ * The page table shouldn't be free'd as we still hold a reference
> >>+ * to the KVM.
> >
> >To avoid confusion about references to the kernel module KVM and a
> >specific KVM VM instance, please s/KVM/VM/.
>
> ok.
>
> >
> >> */
> >>- if (!READ_ONCE(kvm->arch.pgd))
> >>+ if (WARN_ON(!READ_ONCE(kvm->arch.pgd)))
> >
> >This reads a lot like a defensive implementation now, and I think for
> >this patch to make sense, we shouldn't try to handle a buggy super-racy
> >implementation gracefully, but rather have VM_BUG_ON() (if we care about
> >performance of the check) or simply BUG_ON().
> >
> >The rationale being that if we've gotten this flow incorrect and freed
> >the pgd at the wrong time, we don't want to leave a ticking bomb to blow
> >up somewhere else randomly (which it will!), but instead crash and burn.
>
> Ok.
>
> >
> >> break;
> >> next = stage2_pgd_addr_end(addr, end);
> >> if (!stage2_pgd_none(*pgd))
> >>@@ -837,30 +836,33 @@ void stage2_unmap_vm(struct kvm *kvm)
> >> up_read(¤t->mm->mmap_sem);
> >> srcu_read_unlock(&kvm->srcu, idx);
> >> }
> >>-
> >>-/**
> >>- * kvm_free_stage2_pgd - free all stage-2 tables
> >>- * @kvm: The KVM struct pointer for the VM.
> >>- *
> >>- * Walks the level-1 page table pointed to by kvm->arch.pgd and frees all
> >>- * underlying level-2 and level-3 tables before freeing the actual level-1 table
> >>- * and setting the struct pointer to NULL.
> >>+/*
> >>+ * kvm_flush_stage2_all: Unmap the entire stage2 mappings including
> >>+ * device and regular RAM backing memory.
> >> */
> >>-void kvm_free_stage2_pgd(struct kvm *kvm)
> >>+static void kvm_flush_stage2_all(struct kvm *kvm)
> >> {
> >>- void *pgd = NULL;
> >>-
> >> spin_lock(&kvm->mmu_lock);
> >>- if (kvm->arch.pgd) {
> >>+ if (kvm->arch.pgd)
> >> unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> >>- pgd = READ_ONCE(kvm->arch.pgd);
> >>- kvm->arch.pgd = NULL;
> >>- }
> >> spin_unlock(&kvm->mmu_lock);
> >>+}
> >>- /* Free the HW pgd, one page at a time */
> >>- if (pgd)
> >>- free_pages_exact(pgd, S2_PGD_SIZE);
> >>+/**
> >>+ * kvm_free_stage2_pgd - Free the entry level page tables in stage-2.
> >
> >nit: you should put the parameter description here and leave a blank
> >line before the lengthy discussion.
> >
> >>+ * This is called when all reference to the KVM has gone away and we
> >>+ * really don't need any protection in resetting the PGD. This also
> >
> >I don't think I understand the last part of this sentence.
>
> i.e, we don't have to use the spin_lock to reset the PGD.
>
Ah, then I think it should say
"We can set kvm->arch.pgd = NULL without holding a spinlock because this
function is only called when the exiting thread is the only process
holding a refernce to the struct kvm."
> >This function is pretty self-explanatory really, and I think we can
> >either drop the documentation all together or simply say that this
> >function clears all stage 2 page table entries to release the memory of
> >the lower-level page tables themselves and then frees the pgd in the
> >end. The VM is known to go away and no more VCPUs exist at this point.
> >
>
> Ok.
>
> >>+ * means that nobody should be touching stage2 at this point, as we
> >>+ * have unmapped the entire stage2 already and all dynamic entities,
> >>+ * (VCPUs and devices) are no longer active.
> >>+ *
> >>+ * @kvm: The KVM struct pointer for the VM.
> >>+ */
> >>+void kvm_free_stage2_pgd(struct kvm *kvm)
> >>+{
> >>+ kvm_flush_stage2_all(kvm);
> >>+ free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE);
> >>+ kvm->arch.pgd = NULL;
> >> }
> >> static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> >>@@ -1189,12 +1191,12 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> >> * large. Otherwise, we may see kernel panics with
> >> * CONFIG_DETECT_HUNG_TASK, CONFIG_LOCKUP_DETECTOR,
> >> * CONFIG_LOCKDEP. Additionally, holding the lock too long
> >>- * will also starve other vCPUs. We have to also make sure
> >>- * that the page tables are not freed while we released
> >>- * the lock.
> >>+ * will also starve other vCPUs.
> >>+ * The page tables shouldn't be free'd while we released the
> >
> >s/shouldn't/can't/
> >
>
> ok
> >>+ * lock, since we hold a reference to the KVM.
> >
> >s/KVM/VM/
> >
>
> ok.
>
>
Thanks,
-Christoffer
On Thu, Feb 08, 2018 at 05:22:29PM +0000, Suzuki K Poulose wrote:
> On 08/02/18 11:00, Christoffer Dall wrote:
> >On Tue, Jan 09, 2018 at 07:04:09PM +0000, Suzuki K Poulose wrote:
> >>Now that we can manage the stage2 page table per VM, switch the
> >>configuration details to per VM instance. We keep track of the
> >>IPA bits, number of page table levels and the VTCR bits (which
> >>depends on the IPA and the number of levels).
> >>
> >>Cc: Marc Zyngier <[email protected]>
> >>Cc: Christoffer Dall <[email protected]>
> >>Signed-off-by: Suzuki K Poulose <[email protected]>
> >>---
> >> arch/arm/include/asm/kvm_mmu.h | 1 +
> >> arch/arm64/include/asm/kvm_host.h | 12 ++++++++++++
> >> arch/arm64/include/asm/kvm_mmu.h | 22 ++++++++++++++++++++--
> >> arch/arm64/include/asm/stage2_pgtable.h | 1 -
> >> arch/arm64/kvm/hyp/switch.c | 3 +--
> >> virt/kvm/arm/arm.c | 2 +-
> >> 6 files changed, 35 insertions(+), 6 deletions(-)
> >>
> >>diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> >>index 440c80589453..dd592fe45660 100644
> >>--- a/arch/arm/include/asm/kvm_mmu.h
> >>+++ b/arch/arm/include/asm/kvm_mmu.h
> >>@@ -48,6 +48,7 @@
> >> #define kvm_vttbr_baddr_mask(kvm) VTTBR_BADDR_MASK
> >> #define stage2_pgd_size(kvm) (PTRS_PER_S2_PGD * sizeof(pgd_t))
> >>+#define kvm_init_stage2_config(kvm) do { } while (0)
> >> int create_hyp_mappings(void *from, void *to, pgprot_t prot);
> >> int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
> >> void free_hyp_pgds(void);
> >>diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> >>index 9a9ddeb33c84..1e66e5ab3dde 100644
> >>--- a/arch/arm64/include/asm/kvm_host.h
> >>+++ b/arch/arm64/include/asm/kvm_host.h
> >>@@ -64,6 +64,18 @@ struct kvm_arch {
> >> /* VTTBR value associated with above pgd and vmid */
> >> u64 vttbr;
> >>+ /* Private bits of VTCR_EL2 for this VM */
> >>+ u64 vtcr_private;
> >
> >As to my comments in the previous patch, why isn't this simply u64 vtcr;
>
> nit: I haven't received your response to the previous patch.
It got stuck in my drafts folder somehow, hopefully you received it now.
>
> We could. I thought this gives a bit more clarity on what changes per-VM.
>
Since there's a performance issue involved, I think it's cleaner to just
calculate the vtcr once per VM and store it.
Thanks,
-Christoffer
On Thu, Feb 08, 2018 at 05:53:17PM +0000, Suzuki K Poulose wrote:
> On 08/02/18 11:14, Christoffer Dall wrote:
> >On Tue, Jan 09, 2018 at 07:04:10PM +0000, Suzuki K Poulose wrote:
> >>Allow the guests to choose a larger physical address space size.
> >>The default and minimum size is 40bits. A guest can change this
> >>right after the VM creation, but before the stage2 entry page
> >>tables are allocated (i.e, before it registers a memory range
> >>or maps a device address). The size is restricted to the maximum
> >>supported by the host. Also, the guest can only increase the PA size,
> >>from the existing value, as reducing it could break the devices which
> >>may have verified their physical address for validity and may do a
> >>lazy mapping(e.g, VGIC).
> >>
> >>Cc: Marc Zyngier <[email protected]>
> >>Cc: Christoffer Dall <[email protected]>
> >>Cc: Peter Maydell <[email protected]>
> >>Signed-off-by: Suzuki K Poulose <[email protected]>
> >>---
> >> Documentation/virtual/kvm/api.txt | 27 ++++++++++++++++++++++++++
> >> arch/arm/include/asm/kvm_host.h | 7 +++++++
> >> arch/arm64/include/asm/kvm_host.h | 1 +
> >> arch/arm64/include/asm/kvm_mmu.h | 41 ++++++++++++++++++++++++++++++---------
> >> arch/arm64/kvm/reset.c | 28 ++++++++++++++++++++++++++
> >> include/uapi/linux/kvm.h | 4 ++++
> >> virt/kvm/arm/arm.c | 2 +-
> >> 7 files changed, 100 insertions(+), 10 deletions(-)
> >>
> >>diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> >>index 57d3ee9e4bde..a203faf768c4 100644
> >>--- a/Documentation/virtual/kvm/api.txt
> >>+++ b/Documentation/virtual/kvm/api.txt
> >>@@ -3403,6 +3403,33 @@ invalid, if invalid pages are written to (e.g. after the end of memory)
> >> or if no page table is present for the addresses (e.g. when using
> >> hugepages).
> >>+4.109 KVM_ARM_GET_PHYS_SHIFT
> >>+
> >>+Capability: KVM_CAP_ARM_CONFIG_PHYS_SHIFT
> >>+Architectures: arm64
> >>+Type: vm ioctl
> >>+Parameters: __u32 (out)
> >>+Returns: 0 on success, a negative value on error
> >>+
> >>+This ioctl is used to get the current maximum physical address size for
> >>+the VM. The value is Log2(Maximum_Physical_Address). This is neither the
> >>+ amount of physical memory assigned to the VM nor the maximum physical address
> >>+that a real CPU on the host can handle. Rather, this is the upper limit of the
> >>+guest physical address that can be used by the VM.
> >
> >What is the point of this? Presumably if userspace has set the size, it
> >already knows the size?
>
> This can help the userspace know, what the "default" limit is. As such I am
> not particular about keeping this around.
>
Userspace has to already know, since otherwise things don't work today,
so I think we can omit this.
> >
> >>+
> >>+4.109 KVM_ARM_SET_PHYS_SHIFT
> >>+
> >>+Capability: KVM_CAP_ARM_CONFIG_PHYS_SHIFT
> >>+Architectures: arm64
> >>+Type: vm ioctl
> >>+Parameters: __u32 (in)
> >>+Returns: 0 on success, a negative value on error
> >>+
> >>+This ioctl is used to set the maximum physical address size for
> >>+the VM. The value is Log2(Maximum_Physical_Address). The value can only
> >>+be increased from the existing setting. The value cannot be changed
> >>+after the stage-2 page tables are allocated and will return an error.
> >>+
> >
> >Is there a way for userspace to discover what the underlying hardware
> >can actually support, beyond trial-and-error on this ioctl?
>
> Unfortunately, there is none. We don't expose ID_AA64MMFR0 via mrs emulation.
>
We should probably think about that. Perhaps it could be tied to the
return value of KVM_CAP_ARM_CONFIG_PHYS_SHIFT ?
> >>+static inline int kvm_reconfig_stage2(struct kvm *kvm, u32 phys_shift)
> >>+{
> >>+ int rc = 0;
> >>+ unsigned int pa_max, parange;
> >>+
> >>+ parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 7;
> >>+ pa_max = id_aa64mmfr0_parange_to_phys_shift(parange);
> >>+ /* Raise it to 40bits for backward compatibility */
> >>+ pa_max = (pa_max < 40) ? 40 : pa_max;
> >>+ /* Make sure the size is supported/available */
> >>+ if (phys_shift > PHYS_MASK_SHIFT || phys_shift > pa_max)
> >>+ return -EINVAL;
> >>+ /*
> >>+ * The stage2 PGD is dependent on the settings we initialise here
> >>+ * and should be allocated only after this step. We cannot allow
> >>+ * down sizing the IPA size as there could be devices or memory
> >>+ * regions, that depend on the previous size.
> >>+ */
> >>+ mutex_lock(&kvm->slots_lock);
> >>+ if (kvm->arch.pgd || phys_shift < kvm->arch.phys_shift) {
> >>+ rc = -EPERM;
> >>+ } else if (phys_shift > kvm->arch.phys_shift) {
> >>+ kvm->arch.phys_shift = phys_shift;
> >>+ kvm->arch.s2_levels = stage2_pt_levels(kvm->arch.phys_shift);
> >>+ kvm->arch.vtcr_private = VTCR_EL2_SL0(kvm->arch.s2_levels) |
> >>+ TCR_T0SZ(kvm->arch.phys_shift);
> >>+ }
> >
> >I think you can rework the above to make it more obvious what's going on
> >in this way:
> >
> > rc = -EPERM;
> > if (kvm->arch.pgd || phys_shift < kvm->arch.phys_shift)
> > goto out_unlock;
> >
> > rc = 0;
> > if (phys_shift == kvm->arch.phys_shift)
> > goto out_unlock;
> >
> > kvm->arch.phys_shift = phys_shift;
> > kvm->arch.s2_levels = stage2_pt_levels(kvm->arch.phys_shift);
> > kvm->arch.vtcr_private = VTCR_EL2_SL0(kvm->arch.s2_levels) |
> > TCR_T0SZ(kvm->arch.phys_shift);
> >
> >out_unlock:
> >
>
> Sure.
>
>
>
> >>--- a/virt/kvm/arm/arm.c
> >>+++ b/virt/kvm/arm/arm.c
> >>@@ -1136,7 +1136,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
> >> return 0;
> >> }
> >> default:
> >>- return -EINVAL;
> >>+ return kvm_arch_dev_vm_ioctl(kvm, ioctl, arg);
> >> }
> >> }
> >>--
> >>2.13.6
> >>
> >
> >Have you considered making this capability a generic capability and
> >encoding this in the 'type' argument to KVM_CREATE_VM? This would
> >significantly simplify all the above and would allow you to drop patch 8
> >and 9 I think.
>
> No. I will take a look. Btw, there were patches flying around to support
> "userspace" requesting specific values for ID feature registers. But even that
> doesn't help with the detection part. May be that is another way to configure
> the size, but not sure about the current status of that work.
>
It's a bit stranded. Drew was driving this work (on cc). But the ID
register exposed to the guest should represent the actual limits
of the VM, so I don't think we need userspace to configure this, but we
can implement this in KVM based on the PA range configured for the VM.
Thanks,
-Christoffer
On Fri, Feb 09, 2018 at 09:16:06AM +0100, Christoffer Dall wrote:
> On Thu, Feb 08, 2018 at 05:53:17PM +0000, Suzuki K Poulose wrote:
> > On 08/02/18 11:14, Christoffer Dall wrote:
> > >On Tue, Jan 09, 2018 at 07:04:10PM +0000, Suzuki K Poulose wrote:
> > >>Allow the guests to choose a larger physical address space size.
> > >>The default and minimum size is 40bits. A guest can change this
> > >>right after the VM creation, but before the stage2 entry page
> > >>tables are allocated (i.e, before it registers a memory range
> > >>or maps a device address). The size is restricted to the maximum
> > >>supported by the host. Also, the guest can only increase the PA size,
> > >>from the existing value, as reducing it could break the devices which
> > >>may have verified their physical address for validity and may do a
> > >>lazy mapping(e.g, VGIC).
> > >>
> > >>Cc: Marc Zyngier <[email protected]>
> > >>Cc: Christoffer Dall <[email protected]>
> > >>Cc: Peter Maydell <[email protected]>
> > >>Signed-off-by: Suzuki K Poulose <[email protected]>
> > >>---
> > >> Documentation/virtual/kvm/api.txt | 27 ++++++++++++++++++++++++++
> > >> arch/arm/include/asm/kvm_host.h | 7 +++++++
> > >> arch/arm64/include/asm/kvm_host.h | 1 +
> > >> arch/arm64/include/asm/kvm_mmu.h | 41 ++++++++++++++++++++++++++++++---------
> > >> arch/arm64/kvm/reset.c | 28 ++++++++++++++++++++++++++
> > >> include/uapi/linux/kvm.h | 4 ++++
> > >> virt/kvm/arm/arm.c | 2 +-
> > >> 7 files changed, 100 insertions(+), 10 deletions(-)
> > >>
> > >>diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > >>index 57d3ee9e4bde..a203faf768c4 100644
> > >>--- a/Documentation/virtual/kvm/api.txt
> > >>+++ b/Documentation/virtual/kvm/api.txt
> > >>@@ -3403,6 +3403,33 @@ invalid, if invalid pages are written to (e.g. after the end of memory)
> > >> or if no page table is present for the addresses (e.g. when using
> > >> hugepages).
> > >>+4.109 KVM_ARM_GET_PHYS_SHIFT
> > >>+
> > >>+Capability: KVM_CAP_ARM_CONFIG_PHYS_SHIFT
> > >>+Architectures: arm64
> > >>+Type: vm ioctl
> > >>+Parameters: __u32 (out)
> > >>+Returns: 0 on success, a negative value on error
> > >>+
> > >>+This ioctl is used to get the current maximum physical address size for
> > >>+the VM. The value is Log2(Maximum_Physical_Address). This is neither the
> > >>+ amount of physical memory assigned to the VM nor the maximum physical address
> > >>+that a real CPU on the host can handle. Rather, this is the upper limit of the
> > >>+guest physical address that can be used by the VM.
> > >
> > >What is the point of this? Presumably if userspace has set the size, it
> > >already knows the size?
> >
> > This can help the userspace know, what the "default" limit is. As such I am
> > not particular about keeping this around.
> >
>
> Userspace has to already know, since otherwise things don't work today,
> so I think we can omit this.
>
> > >
> > >>+
> > >>+4.109 KVM_ARM_SET_PHYS_SHIFT
> > >>+
> > >>+Capability: KVM_CAP_ARM_CONFIG_PHYS_SHIFT
> > >>+Architectures: arm64
> > >>+Type: vm ioctl
> > >>+Parameters: __u32 (in)
> > >>+Returns: 0 on success, a negative value on error
> > >>+
> > >>+This ioctl is used to set the maximum physical address size for
> > >>+the VM. The value is Log2(Maximum_Physical_Address). The value can only
> > >>+be increased from the existing setting. The value cannot be changed
> > >>+after the stage-2 page tables are allocated and will return an error.
> > >>+
> > >
> > >Is there a way for userspace to discover what the underlying hardware
> > >can actually support, beyond trial-and-error on this ioctl?
> >
> > Unfortunately, there is none. We don't expose ID_AA64MMFR0 via mrs emulation.
> >
>
> We should probably think about that. Perhaps it could be tied to the
> return value of KVM_CAP_ARM_CONFIG_PHYS_SHIFT ?
FWIW, that sounds good to me.
>
> > >>+static inline int kvm_reconfig_stage2(struct kvm *kvm, u32 phys_shift)
> > >>+{
> > >>+ int rc = 0;
> > >>+ unsigned int pa_max, parange;
> > >>+
> > >>+ parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 7;
> > >>+ pa_max = id_aa64mmfr0_parange_to_phys_shift(parange);
> > >>+ /* Raise it to 40bits for backward compatibility */
> > >>+ pa_max = (pa_max < 40) ? 40 : pa_max;
> > >>+ /* Make sure the size is supported/available */
> > >>+ if (phys_shift > PHYS_MASK_SHIFT || phys_shift > pa_max)
> > >>+ return -EINVAL;
> > >>+ /*
> > >>+ * The stage2 PGD is dependent on the settings we initialise here
> > >>+ * and should be allocated only after this step. We cannot allow
> > >>+ * down sizing the IPA size as there could be devices or memory
> > >>+ * regions, that depend on the previous size.
> > >>+ */
> > >>+ mutex_lock(&kvm->slots_lock);
> > >>+ if (kvm->arch.pgd || phys_shift < kvm->arch.phys_shift) {
> > >>+ rc = -EPERM;
> > >>+ } else if (phys_shift > kvm->arch.phys_shift) {
> > >>+ kvm->arch.phys_shift = phys_shift;
> > >>+ kvm->arch.s2_levels = stage2_pt_levels(kvm->arch.phys_shift);
> > >>+ kvm->arch.vtcr_private = VTCR_EL2_SL0(kvm->arch.s2_levels) |
> > >>+ TCR_T0SZ(kvm->arch.phys_shift);
> > >>+ }
> > >
> > >I think you can rework the above to make it more obvious what's going on
> > >in this way:
> > >
> > > rc = -EPERM;
> > > if (kvm->arch.pgd || phys_shift < kvm->arch.phys_shift)
> > > goto out_unlock;
> > >
> > > rc = 0;
> > > if (phys_shift == kvm->arch.phys_shift)
> > > goto out_unlock;
> > >
> > > kvm->arch.phys_shift = phys_shift;
> > > kvm->arch.s2_levels = stage2_pt_levels(kvm->arch.phys_shift);
> > > kvm->arch.vtcr_private = VTCR_EL2_SL0(kvm->arch.s2_levels) |
> > > TCR_T0SZ(kvm->arch.phys_shift);
> > >
> > >out_unlock:
> > >
> >
> > Sure.
> >
> >
> >
> > >>--- a/virt/kvm/arm/arm.c
> > >>+++ b/virt/kvm/arm/arm.c
> > >>@@ -1136,7 +1136,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
> > >> return 0;
> > >> }
> > >> default:
> > >>- return -EINVAL;
> > >>+ return kvm_arch_dev_vm_ioctl(kvm, ioctl, arg);
> > >> }
> > >> }
> > >>--
> > >>2.13.6
> > >>
> > >
> > >Have you considered making this capability a generic capability and
> > >encoding this in the 'type' argument to KVM_CREATE_VM? This would
> > >significantly simplify all the above and would allow you to drop patch 8
> > >and 9 I think.
> >
> > No. I will take a look. Btw, there were patches flying around to support
> > "userspace" requesting specific values for ID feature registers. But even that
> > doesn't help with the detection part. May be that is another way to configure
> > the size, but not sure about the current status of that work.
> >
>
> It's a bit stranded. Drew was driving this work (on cc). But the ID
> register exposed to the guest should represent the actual limits
> of the VM, so I don't think we need userspace to configure this, but we
> can implement this in KVM based on the PA range configured for the VM.
>
I heard there were some patches being worked by someone at Arm, which
haven't been posted yet (obviously), but maybe that was just a rumor?
I was about to revisit this topic myself, at least to some degree, to
attempt to address PMU hiding. We really need to put some thought into
how best to generally give userspace control of the VM's ID registers,
within the constraints of the host. Anyway, I guess that should be done
in a separate thread, so I won't start brainstorming now, here.
Thanks,
drew
On 09/02/18 08:16, Christoffer Dall wrote:
> On Thu, Feb 08, 2018 at 05:53:17PM +0000, Suzuki K Poulose wrote:
>> On 08/02/18 11:14, Christoffer Dall wrote:
>>> On Tue, Jan 09, 2018 at 07:04:10PM +0000, Suzuki K Poulose wrote:
>>>> Allow the guests to choose a larger physical address space size.
>>>> The default and minimum size is 40bits. A guest can change this
>>>> right after the VM creation, but before the stage2 entry page
>>>> tables are allocated (i.e, before it registers a memory range
>>>> or maps a device address). The size is restricted to the maximum
>>>> supported by the host. Also, the guest can only increase the PA size,
>>> >from the existing value, as reducing it could break the devices which
>>>> may have verified their physical address for validity and may do a
>>>> lazy mapping(e.g, VGIC).
>>>>
>>>> Cc: Marc Zyngier <[email protected]>
>>>> Cc: Christoffer Dall <[email protected]>
>>>> Cc: Peter Maydell <[email protected]>
>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>> +
>>>> +4.109 KVM_ARM_SET_PHYS_SHIFT
>>>> +
>>>> +Capability: KVM_CAP_ARM_CONFIG_PHYS_SHIFT
>>>> +Architectures: arm64
>>>> +Type: vm ioctl
>>>> +Parameters: __u32 (in)
>>>> +Returns: 0 on success, a negative value on error
>>>> +
>>>> +This ioctl is used to set the maximum physical address size for
>>>> +the VM. The value is Log2(Maximum_Physical_Address). The value can only
>>>> +be increased from the existing setting. The value cannot be changed
>>>> +after the stage-2 page tables are allocated and will return an error.
>>>> +
>>>
>>> Is there a way for userspace to discover what the underlying hardware
>>> can actually support, beyond trial-and-error on this ioctl?
>>
>> Unfortunately, there is none. We don't expose ID_AA64MMFR0 via mrs emulation.
>>
>
> We should probably think about that. Perhaps it could be tied to the
> return value of KVM_CAP_ARM_CONFIG_PHYS_SHIFT ?
See below.
>>>
>>> Have you considered making this capability a generic capability and
>>> encoding this in the 'type' argument to KVM_CREATE_VM? This would
>>> significantly simplify all the above and would allow you to drop patch 8
>>> and 9 I think.
>>>> No. I will take a look.
We could add a KVM dev capability hooked to the kvm_arch_dev_ioctl() for ARM
to give out the maximum supported physical shift. And then the user could request
the physical shift via the type argument (of course, encoded to allow future uses)
to KVM_CREATE_VM.
Cheers
Suzuki
Hi Christoffer,
On 08/02/18 18:04, Christoffer Dall wrote:
> On Tue, Jan 09, 2018 at 07:04:08PM +0000, Suzuki K Poulose wrote:
>> We set VTCR_EL2 very early during the stage2 init and don't
>> touch it ever. This is fine as we had a fixed IPA size. This
>> patch changes the behavior to set the VTCR for a given VM,
>> depending on its stage2 table. The common configuration for
>> VTCR is still performed during the early init. But the SL0
>> and T0SZ are programmed for each VM and is cleared once we
>> exit the VM.
>>
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Christoffer Dall <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
>> index f7c651f3a8c0..523471f0af7b 100644
>> --- a/arch/arm64/kvm/hyp/switch.c
>> +++ b/arch/arm64/kvm/hyp/switch.c
>> @@ -157,11 +157,20 @@ static void __hyp_text __deactivate_traps(struct kvm_vcpu *vcpu)
>> static void __hyp_text __activate_vm(struct kvm_vcpu *vcpu)
>> {
>> struct kvm *kvm = kern_hyp_va(vcpu->kvm);
>> + u64 vtcr = read_sysreg(vtcr_el2);
>> +
>> + vtcr &= ~VTCR_EL2_PRIVATE_MASK;
>> + vtcr |= VTCR_EL2_SL0(stage2_pt_levels(kvm)) |
>> + VTCR_EL2_T0SZ(kvm_phys_shift(kvm));
>> + write_sysreg(vtcr, vtcr_el2);
>
> If we're writing VTCR_EL2 on each entry, do we really need to read the
> value back first and calculate things on every entry to the VM? It
> seems to me we should be able to compute the vtcr_el2 and store it on
> struct kvm, and simply restore that per-VM value upon entering the VM?
I took a look at this and we need to do this to make sure we retain the
Hardware update of Access flags for stage2 (VTCR_EL2_HA) bits on the CPUs
that has it and it is safe to run a mix of CPUs with and without the feature.
>> write_sysreg(kvm->arch.vttbr, vttbr_el2);
>> }
>>
>> static void __hyp_text __deactivate_vm(struct kvm_vcpu *vcpu)
>> {
>> + u64 vtcr = read_sysreg(vtcr_el2) & ~VTCR_EL2_PRIVATE_MASK;
>> +
>> + write_sysreg(vtcr, vtcr_el2);
>
> Why do we need to care about restoring VTCR when returning to the host?
Yes, this can be skipped.
Cheers
Suzuki