2016-04-04 16:26:48

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 00/17] kvm-arm: Add stage2 page table walker

This series adds support for stage2 page table helpers and makes
the core kvm-arm MMU code make use of it. At the moment we assume
that the host/hyp and the stage2 page tables have same number of
levels and hence use the host level accessors (except for some
hooks, e.g kvm_p.d_addr_end) and shares the routines for unmapping
the page table ranges.

On arm32, the only change w.r.t the page tables is dealing
with > 32bit physical addresses.

However on arm64, the hardware supports concatenation of tables (upto 16)
at the entry level, which could affect :
1) number of entries in the PGD table (upto 16 * PTRS_PER_PTE)
2) number of page table levels (reduced number of page table levels).

Also depending on the VA_BITS for the host kernel, the number of page table
levels for both host and stage2(40bit IPA) could differ. At present, we insert
(upto) one fake software page table(as the hardware is not aware of it and is
only used by the OS to walk the table) level to bring the number of levels to
that of the host/hyp table. However, with 16K + 48bit, and 40bit IPA, we could
end up in 2 fake levels, which complicates the code.

This series introduces explicit stage2 page table helpers and also defines
separate set of routines for unmapping hyp and stage2 tables.

On arm64 stage2 page table helpers are defined based on the number of levels
required to map the IPA bits. See patch 15 for more details.

Tested on TC2 (arm32), Fast models(with VHE) and real hardwares.

Changes since RFC:
* Rebased to rc2
* Use explicit routines for modifying the hyp/stage2 page tables
* Add pmd_thp_or_huge() for arm64 and use that for KVM
* Reuse TCR_EL definitions

Suzuki K Poulose (17):
arm64: Reuse TCR field definitions for EL1 and EL2
arm64: Cleanup VTCR_EL2 and VTTBR field values
kvm arm: Move fake PGD handling to arch specific files
arm64: Introduce pmd_thp_or_huge
kvm-arm: Replace kvm_pmd_huge with pmd_thp_or_huge
kvm-arm: Remove kvm_pud_huge()
kvm-arm: arm32: Introduce stage2 page table helpers
kvm-arm: arm: Introduce hyp page table empty checks
kvm-arm: arm64: Introduce stage2 page table helpers
kvm-arm: arm64: Introduce hyp page table empty checks
kvm-arm: Use explicit stage2 helper routines
kvm-arm: Add explicit hyp page table modifiers
kvm-arm: Add stage2 page table modifiers
kvm-arm: Cleanup kvm_* wrappers
kvm: arm64: Get rid of fake page table levels
kvm-arm: Cleanup stage2 pgd handling
arm64: kvm: Add support for 16K pages

arch/arm/include/asm/kvm_mmu.h | 35 +--
arch/arm/include/asm/stage2_pgtable.h | 59 +++++
arch/arm/kvm/arm.c | 2 +-
arch/arm/kvm/mmu.c | 354 ++++++++++++++-----------
arch/arm64/include/asm/kvm_arm.h | 82 +++---
arch/arm64/include/asm/kvm_mmu.h | 84 +-----
arch/arm64/include/asm/pgtable-hwdef.h | 80 ++++--
arch/arm64/include/asm/pgtable.h | 2 +
arch/arm64/include/asm/stage2_pgtable-nopmd.h | 42 +++
arch/arm64/include/asm/stage2_pgtable-nopud.h | 39 +++
arch/arm64/include/asm/stage2_pgtable.h | 133 ++++++++++
arch/arm64/kvm/Kconfig | 1 -
12 files changed, 599 insertions(+), 314 deletions(-)
create mode 100644 arch/arm/include/asm/stage2_pgtable.h
create mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
create mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h
create mode 100644 arch/arm64/include/asm/stage2_pgtable.h

--
1.7.9.5


2016-04-04 16:26:55

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 04/17] arm64: Introduce pmd_thp_or_huge

Add a helper to determine if a given pmd represents a huge page
either by hugetlb or thp, as we have for arm. This will be used
by KVM MMU code.

Suggested-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Steve Capper <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 989fef1..dda4aa9 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -290,6 +290,8 @@ static inline pgprot_t mk_sect_prot(pgprot_t prot)
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
#define pmd_mknotpresent(pmd) (__pmd(pmd_val(pmd) & ~PMD_TYPE_MASK))

+#define pmd_thp_or_huge(pmd) (pmd_huge(pmd) || pmd_trans_huge(pmd))
+
#define __HAVE_ARCH_PMD_WRITE
#define pmd_write(pmd) pte_write(pmd_pte(pmd))

--
1.7.9.5

2016-04-04 16:26:52

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 02/17] arm64: Cleanup VTCR_EL2 and VTTBR field values

We share most of the bits for VTCR_EL2 for different page sizes,
except for the TG0 value and the entry level value. This patch
makes the definitions a bit more cleaner to reflect this fact.

Also cleans up the VTTBR_X calculation. No funcational changes.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm64/include/asm/kvm_arm.h | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index c460cfe..d5d5fdf 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -144,32 +144,31 @@
* The magic numbers used for VTTBR_X in this patch can be found in Tables
* D4-23 and D4-25 in ARM DDI 0487A.b.
*/
+#define VTCR_EL2_T0SZ_IPA VTCR_EL2_T0SZ_40B
+#define VTCR_EL2_COMMON_BITS (VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
+ VTCR_EL2_IRGN0_WBWA | VTCR_EL2_SL0_LVL1 | \
+ VTCR_EL2_RES1 | VTCR_EL2_T0SZ_IPA)
#ifdef CONFIG_ARM64_64K_PAGES
/*
* Stage2 translation configuration:
- * 40bits input (T0SZ = 24)
* 64kB pages (TG0 = 1)
* 2 level page tables (SL = 1)
*/
-#define VTCR_EL2_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SH0_INNER | \
- VTCR_EL2_ORGN0_WBWA | VTCR_EL2_IRGN0_WBWA | \
- VTCR_EL2_SL0_LVL1 | VTCR_EL2_T0SZ_40B | \
- VTCR_EL2_RES1)
-#define VTTBR_X (38 - VTCR_EL2_T0SZ_40B)
+#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
+#define VTTBR_X_TGRAN_MAGIC 38
#else
/*
* Stage2 translation configuration:
- * 40bits input (T0SZ = 24)
* 4kB pages (TG0 = 0)
* 3 level page tables (SL = 1)
*/
-#define VTCR_EL2_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SH0_INNER | \
- VTCR_EL2_ORGN0_WBWA | VTCR_EL2_IRGN0_WBWA | \
- VTCR_EL2_SL0_LVL1 | VTCR_EL2_T0SZ_40B | \
- VTCR_EL2_RES1)
-#define VTTBR_X (37 - VTCR_EL2_T0SZ_40B)
+#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
+#define VTTBR_X_TGRAN_MAGIC 37
#endif

+#define VTCR_EL2_FLAGS (VTCR_EL2_TGRAN_FLAGS | VTCR_EL2_COMMON_BITS)
+#define VTTBR_X (VTTBR_X_TGRAN_MAGIC - VTCR_EL2_T0SZ_IPA)
+
#define VTTBR_BADDR_SHIFT (VTTBR_X - 1)
#define VTTBR_BADDR_MASK (((UL(1) << (PHYS_MASK_SHIFT - VTTBR_X)) - 1) << VTTBR_BADDR_SHIFT)
#define VTTBR_VMID_SHIFT (UL(48))
--
1.7.9.5

2016-04-04 16:27:00

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 07/17] kvm-arm: arm32: Introduce stage2 page table helpers

Define the page table helpers for walking the stage2 pagetable
for arm. Since both hyp and stage2 have the same number of levels,
as that of the host we reuse the host helpers.

The exceptions are the p.d_addr_end routines which have to deal
with IPA > 32bit, hence we use the open coded version of their host helpers
which supports 64bit.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/include/asm/kvm_mmu.h | 1 +
arch/arm/include/asm/stage2_pgtable.h | 59 +++++++++++++++++++++++++++++++++
2 files changed, 60 insertions(+)
create mode 100644 arch/arm/include/asm/stage2_pgtable.h

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index c2b2b27..7d207b4 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -47,6 +47,7 @@
#include <linux/highmem.h>
#include <asm/cacheflush.h>
#include <asm/pgalloc.h>
+#include <asm/stage2_pgtable.h>

int create_hyp_mappings(void *from, void *to);
int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
diff --git a/arch/arm/include/asm/stage2_pgtable.h b/arch/arm/include/asm/stage2_pgtable.h
new file mode 100644
index 0000000..7633b0a
--- /dev/null
+++ b/arch/arm/include/asm/stage2_pgtable.h
@@ -0,0 +1,59 @@
+/*
+ * Copyright (C) 2016 - ARM Ltd
+ *
+ * stage2 page table helpers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __ARM_S2_PGTABLE_H_
+#define __ARM_S2_PGTABLE_H_
+
+#define stage2_pgd_none(pgd) pgd_none(pgd)
+#define stage2_pgd_clear(pgd) pgd_clear(pgd)
+#define stage2_pgd_present(pgd) pgd_present(pgd)
+#define stage2_pgd_populate(mm, pgd, pud) pgd_populate(mm, pgd, pud)
+#define stage2_pud_offset(pgd, address) pud_offset(pgd, address)
+#define stage2_pud_free(mm, pud) pud_free(mm, pud)
+
+#define stage2_pud_none(pud) pud_none(pud)
+#define stage2_pud_clear(pud) pud_clear(pud)
+#define stage2_pud_present(pud) pud_present(pud)
+#define stage2_pud_populate(mm, pud, pmd) pud_populate(mm, pud, pmd)
+#define stage2_pmd_offset(pud, address) pmd_offset(pud, address)
+#define stage2_pmd_free(mm, pmd) pmd_free(mm, pmd)
+
+#define stage2_pud_huge(pud) pud_huge(pud)
+
+/* Open coded p*d_addr_end that can deal with 64bit addresses */
+static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
+{
+ phys_addr_t boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
+ return (boundary - 1 < end - 1) ? boundary : end;
+}
+
+#define stage2_pud_addr_end(addr, end) (end)
+
+static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
+{
+ phys_addr_t boundary = (addr + PMD_SIZE) & PMD_MASK;
+ return (boundary - 1 < end - 1) ? boundary : end;
+}
+
+#define stage2_pgd_index(addr) pgd_index(addr)
+
+#define stage2_pte_table_empty(ptep) kvm_page_empty(ptep)
+#define stage2_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
+#define stage2_pud_table_empty(pudp) 0
+
+#endif /* __ARM_S2_PGTABLE_H_ */
--
1.7.9.5

2016-04-04 16:27:11

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 11/17] kvm-arm: Use explicit stage2 helper routines

We have stage2 page table helpers for both arm and arm64. Switch to
the stage2 helpers for routines that only deal with stage2 page table.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/kvm/mmu.c | 48 ++++++++++++++++++++++++------------------------
1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index d0c0ee9..b46a337 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -319,9 +319,9 @@ static void stage2_flush_pmds(struct kvm *kvm, pud_t *pud,
pmd_t *pmd;
phys_addr_t next;

- pmd = pmd_offset(pud, addr);
+ pmd = stage2_pmd_offset(pud, addr);
do {
- next = kvm_pmd_addr_end(addr, end);
+ next = stage2_pmd_addr_end(addr, end);
if (!pmd_none(*pmd)) {
if (pmd_thp_or_huge(*pmd))
kvm_flush_dcache_pmd(*pmd);
@@ -337,11 +337,11 @@ static void stage2_flush_puds(struct kvm *kvm, pgd_t *pgd,
pud_t *pud;
phys_addr_t next;

- pud = pud_offset(pgd, addr);
+ pud = stage2_pud_offset(pgd, addr);
do {
- next = kvm_pud_addr_end(addr, end);
- if (!pud_none(*pud)) {
- if (pud_huge(*pud))
+ next = stage2_pud_addr_end(addr, end);
+ if (!stage2_pud_none(*pud)) {
+ if (stage2_pud_huge(*pud))
kvm_flush_dcache_pud(*pud);
else
stage2_flush_pmds(kvm, pud, addr, next);
@@ -357,9 +357,9 @@ static void stage2_flush_memslot(struct kvm *kvm,
phys_addr_t next;
pgd_t *pgd;

- pgd = kvm->arch.pgd + kvm_pgd_index(addr);
+ pgd = kvm->arch.pgd + stage2_pgd_index(addr);
do {
- next = kvm_pgd_addr_end(addr, end);
+ next = stage2_pgd_addr_end(addr, end);
stage2_flush_puds(kvm, pgd, addr, next);
} while (pgd++, addr = next, addr != end);
}
@@ -807,16 +807,16 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
pgd_t *pgd;
pud_t *pud;

- pgd = kvm->arch.pgd + kvm_pgd_index(addr);
- if (WARN_ON(pgd_none(*pgd))) {
+ pgd = kvm->arch.pgd + stage2_pgd_index(addr);
+ if (WARN_ON(stage2_pgd_none(*pgd))) {
if (!cache)
return NULL;
pud = mmu_memory_cache_alloc(cache);
- pgd_populate(NULL, pgd, pud);
+ stage2_pgd_populate(NULL, pgd, pud);
get_page(virt_to_page(pgd));
}

- return pud_offset(pgd, addr);
+ return stage2_pud_offset(pgd, addr);
}

static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
@@ -826,15 +826,15 @@ static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
pmd_t *pmd;

pud = stage2_get_pud(kvm, cache, addr);
- if (pud_none(*pud)) {
+ if (stage2_pud_none(*pud)) {
if (!cache)
return NULL;
pmd = mmu_memory_cache_alloc(cache);
- pud_populate(NULL, pud, pmd);
+ stage2_pud_populate(NULL, pud, pmd);
get_page(virt_to_page(pud));
}

- return pmd_offset(pud, addr);
+ return stage2_pmd_offset(pud, addr);
}

static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
@@ -1042,10 +1042,10 @@ static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
pmd_t *pmd;
phys_addr_t next;

- pmd = pmd_offset(pud, addr);
+ pmd = stage2_pmd_offset(pud, addr);

do {
- next = kvm_pmd_addr_end(addr, end);
+ next = stage2_pmd_addr_end(addr, end);
if (!pmd_none(*pmd)) {
if (pmd_thp_or_huge(*pmd)) {
if (!kvm_s2pmd_readonly(pmd))
@@ -1070,12 +1070,12 @@ static void stage2_wp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
pud_t *pud;
phys_addr_t next;

- pud = pud_offset(pgd, addr);
+ pud = stage2_pud_offset(pgd, addr);
do {
- next = kvm_pud_addr_end(addr, end);
- if (!pud_none(*pud)) {
+ next = stage2_pud_addr_end(addr, end);
+ if (!stage2_pud_none(*pud)) {
/* TODO:PUD not supported, revisit later if supported */
- BUG_ON(pud_huge(*pud));
+ BUG_ON(stage2_pud_huge(*pud));
stage2_wp_pmds(pud, addr, next);
}
} while (pud++, addr = next, addr != end);
@@ -1092,7 +1092,7 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
pgd_t *pgd;
phys_addr_t next;

- pgd = kvm->arch.pgd + kvm_pgd_index(addr);
+ pgd = kvm->arch.pgd + stage2_pgd_index(addr);
do {
/*
* Release kvm_mmu_lock periodically if the memory region is
@@ -1104,8 +1104,8 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
if (need_resched() || spin_needbreak(&kvm->mmu_lock))
cond_resched_lock(&kvm->mmu_lock);

- next = kvm_pgd_addr_end(addr, end);
- if (pgd_present(*pgd))
+ next = stage2_pgd_addr_end(addr, end);
+ if (stage2_pgd_present(*pgd))
stage2_wp_puds(pgd, addr, next);
} while (pgd++, addr = next, addr != end);
}
--
1.7.9.5

2016-04-04 16:27:15

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 15/17] kvm: arm64: Get rid of fake page table levels

On arm64, the hardware mandates concatenation of upto 16 tables,
at entry level for stage2 translations. This could lead to reduced
number of translation levels than the normal (stage1 table).
Also, since the IPA(40bit) is smaller than the some of the supported
VA_BITS (e.g, 48bit), there could be different number of levels
in stage-1 vs stage-2 tables. To work around this, so far we have been
using a fake software page table level, not known to the hardware.
But with 16K translations, there could be upto 2 fake software
levels, which complicates the code. Hence, we want to get rid of the hack.

Now that we have explicit accessors for hyp vs stage2 page tables,
define the stage2 walker helpers accordingly based on the actual
table used by the hardware.

Once we know the number of translation levels used by the hardware,
it is merely a job of defining the helpers based on whether a
particular level is folded or not, looking at the number of levels.

Some facts before we calculate the translation levels:

1) Smallest page size supported by arm64 is 4K.
2) The minimum number of bits resolved at any page table level
is (PAGE_SHIFT - 3) at intermediate levels.
Both of them implies, minimum number of bits required for a level
change is 9.

Since we can concatenate upto 16 tables at stage2 entry, the total
number of page table levels used by the hardware for resolving N bits
is same as that for (N - 4) bits (with concatenation), as there cannot
be a level in between (N, N-4) as per the above rules.

Hence, we have

STAGE2_PGTABLE_LEVELS = PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)

With the current IPA limit (40bit), for all supported translations
and VA_BITS, we have the following condition (even for 36bit VA with
16K page size):

CONFIG_PGTABLE_LEVELS >= STAGE2_PGTABLE_LEVELS.

So, for e.g, if PUD is present in stage2, it is present in the hyp(host).
Hence, we fall back to the host definition if we find that a level is not
folded. Otherwise we redefine it accordingly. A build time check is added
to make sure the above condition holds.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm64/include/asm/kvm_mmu.h | 64 +-------------
arch/arm64/include/asm/stage2_pgtable-nopmd.h | 42 +++++++++
arch/arm64/include/asm/stage2_pgtable-nopud.h | 39 +++++++++
arch/arm64/include/asm/stage2_pgtable.h | 113 +++++++++++++++++--------
4 files changed, 163 insertions(+), 95 deletions(-)
create mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
create mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index a3c0d05..e3fee0a 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -45,18 +45,6 @@
*/
#define TRAMPOLINE_VA (HYP_PAGE_OFFSET_MASK & PAGE_MASK)

-/*
- * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
- * levels in addition to the PGD and potentially the PUD which are
- * pre-allocated (we pre-allocate the fake PGD and the PUD when the Stage-2
- * tables use one level of tables less than the kernel.
- */
-#ifdef CONFIG_ARM64_64K_PAGES
-#define KVM_MMU_CACHE_MIN_PAGES 1
-#else
-#define KVM_MMU_CACHE_MIN_PAGES 2
-#endif
-
#ifdef __ASSEMBLY__

#include <asm/alternative.h>
@@ -155,69 +143,21 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)

static inline void *kvm_get_hwpgd(struct kvm *kvm)
{
- pgd_t *pgd = kvm->arch.pgd;
- pud_t *pud;
-
- if (KVM_PREALLOC_LEVEL == 0)
- return pgd;
-
- pud = pud_offset(pgd, 0);
- if (KVM_PREALLOC_LEVEL == 1)
- return pud;
-
- BUG_ON(KVM_PREALLOC_LEVEL != 2);
- return pmd_offset(pud, 0);
+ return kvm->arch.pgd;
}

static inline unsigned int kvm_get_hwpgd_size(void)
{
- if (KVM_PREALLOC_LEVEL > 0)
- return PTRS_PER_S2_PGD * PAGE_SIZE;
return PTRS_PER_S2_PGD * sizeof(pgd_t);
}

-/*
- * Allocate fake pgd for the host kernel page table macros to work.
- * This is not used by the hardware and we have no alignment
- * requirement for this allocation.
- */
static inline pgd_t *kvm_setup_fake_pgd(pgd_t *hwpgd)
{
- int i;
- pgd_t *pgd;
-
- if (!KVM_PREALLOC_LEVEL)
- return hwpgd;
-
- /*
- * When KVM_PREALLOC_LEVEL==2, we allocate a single page for
- * the PMD and the kernel will use folded pud.
- * When KVM_PREALLOC_LEVEL==1, we allocate 2 consecutive PUD
- * pages.
- */
-
- pgd = kmalloc(PTRS_PER_S2_PGD * sizeof(pgd_t),
- GFP_KERNEL | __GFP_ZERO);
- if (!pgd)
- return ERR_PTR(-ENOMEM);
-
- /* Plug the HW PGD into the fake one. */
- for (i = 0; i < PTRS_PER_S2_PGD; i++) {
- if (KVM_PREALLOC_LEVEL == 1)
- pgd_populate(NULL, pgd + i,
- (pud_t *)hwpgd + i * PTRS_PER_PUD);
- else if (KVM_PREALLOC_LEVEL == 2)
- pud_populate(NULL, pud_offset(pgd, 0) + i,
- (pmd_t *)hwpgd + i * PTRS_PER_PMD);
- }
-
- return pgd;
+ return hwpgd;
}

static inline void kvm_free_fake_pgd(pgd_t *pgd)
{
- if (KVM_PREALLOC_LEVEL > 0)
- kfree(pgd);
}
static inline bool kvm_page_empty(void *ptr)
{
diff --git a/arch/arm64/include/asm/stage2_pgtable-nopmd.h b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
new file mode 100644
index 0000000..40dda8c
--- /dev/null
+++ b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
@@ -0,0 +1,42 @@
+/*
+ * Copyright (C) 2016 - ARM Ltd
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __ARM64_S2_PGTABLE_NOPMD_H_
+#define __ARM64_S2_PGTABLE_NOPMD_H_
+
+#include <asm/stage2_pgtable-nopud.h>
+
+#define __S2_PGTABLE_PMD_FOLDED
+
+#define S2_PMD_SHIFT S2_PUD_SHIFT
+#define S2_PTRS_PER_PMD 1
+#define S2_PMD_SIZE (1UL << S2_PMD_SHIFT)
+#define S2_PMD_MASK (~(S2_PMD_SIZE-1))
+
+#define stage2_pud_none(pud) (0)
+#define stage2_pud_present(pud) (1)
+#define stage2_pud_clear(pud) do { } while (0)
+#define stage2_pud_populate(mm, pud, pmd) do { } while (0)
+#define stage2_pmd_offset(pud, address) ((pmd_t *)(pud))
+
+#define stage2_pmd_free(mm, pmd) do { } while (0)
+
+#define stage2_pmd_addr_end(addr, end) (end)
+
+#define stage2_pud_huge(pud) (0)
+#define stage2_pmd_table_empty(pmdp) (0)
+
+#endif
diff --git a/arch/arm64/include/asm/stage2_pgtable-nopud.h b/arch/arm64/include/asm/stage2_pgtable-nopud.h
new file mode 100644
index 0000000..b0eff9d
--- /dev/null
+++ b/arch/arm64/include/asm/stage2_pgtable-nopud.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2016 - ARM Ltd
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __ARM64_S2_PGTABLE_NOPUD_H_
+#define __ARM64_S2_PGTABLE_NOPUD_H_
+
+#define __S2_PGTABLE_PUD_FOLDED
+
+#define S2_PUD_SHIFT S2_PGDIR_SHIFT
+#define S2_PTRS_PER_PUD 1
+#define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
+#define S2_PUD_MASK (~(S2_PUD_SIZE-1))
+
+#define stage2_pgd_none(pgd) (0)
+#define stage2_pgd_present(pgd) (1)
+#define stage2_pgd_clear(pgd) do { } while (0)
+#define stage2_pgd_populate(mm, pgd, pud) do { } while (0)
+
+#define stage2_pud_offset(pgd, address) ((pud_t *)(pgd))
+
+#define stage2_pud_free(mm, x) do { } while (0)
+
+#define stage2_pud_addr_end(addr, end) (end)
+#define stage2_pud_table_empty(pmdp) (0)
+
+#endif
diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
index 751227d..139b4db 100644
--- a/arch/arm64/include/asm/stage2_pgtable.h
+++ b/arch/arm64/include/asm/stage2_pgtable.h
@@ -22,32 +22,55 @@
#include <asm/pgtable.h>

/*
- * In the case where PGDIR_SHIFT is larger than KVM_PHYS_SHIFT, we can address
- * the entire IPA input range with a single pgd entry, and we would only need
- * one pgd entry. Note that in this case, the pgd is actually not used by
- * the MMU for Stage-2 translations, but is merely a fake pgd used as a data
- * structure for the kernel pgtable macros to work.
+ * The hardware mandates concatenation of upto 16 tables at stage2 entry level.
+ * Now, the minimum number of bits resolved at any level is (PAGE_SHIFT - 3),
+ * or in other words log2(PTRS_PER_PTE). On arm64, the smallest PAGE_SIZE
+ * supported is 4k, which means (PAGE_SHIFT - 3) > 4 holds for all page sizes.
+ * This implies, the total number of page table levels at stage2 expected
+ * by the hardware is actually the number of levels required for (KVM_PHYS_SHIFT - 4)
+ * in normal translations(e.g, stage-1), since we cannot have another level in
+ * the range (KVM_PHYS_SHIFT, KVM_PHYS_SHIFT - 4).
*/
-#if PGDIR_SHIFT > KVM_PHYS_SHIFT
-#define PTRS_PER_S2_PGD_SHIFT 0
-#else
-#define PTRS_PER_S2_PGD_SHIFT (KVM_PHYS_SHIFT - PGDIR_SHIFT)
-#endif
-#define PTRS_PER_S2_PGD (1 << PTRS_PER_S2_PGD_SHIFT)
+#define STAGE2_PGTABLE_LEVELS ARM64_HW_PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)

/*
- * If we are concatenating first level stage-2 page tables, we would have less
- * than or equal to 16 pointers in the fake PGD, because that's what the
- * architecture allows. In this case, (4 - CONFIG_PGTABLE_LEVELS)
- * represents the first level for the host, and we add 1 to go to the next
- * level (which uses contatenation) for the stage-2 tables.
+ * At the moment, we do not support a combination of guest IPA and host VA_BITS
+ * where
+ * STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS
+ *
+ * We base our stage-2 page table walker helpers based on this assumption and
+ * fallback to using the host version of the helper wherever possible.
+ * i.e, if a particular level is not folded (e.g, PUD) at stage2, we fall back
+ * to using the host version, since it is guaranteed it is not folded at host.
+ *
+ * TODO: We could lift this limitation easily by rearranging the host level
+ * definitions to a more reusable version.
*/
-#if PTRS_PER_S2_PGD <= 16
-#define KVM_PREALLOC_LEVEL (4 - CONFIG_PGTABLE_LEVELS + 1)
-#else
-#define KVM_PREALLOC_LEVEL (0)
+#if STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS
+#error "Unsupported combination of guest IPA and host VA_BITS."
#endif

+
+#define S2_PGDIR_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - STAGE2_PGTABLE_LEVELS)
+#define S2_PGDIR_SIZE (_AC(1, UL) << S2_PGDIR_SHIFT)
+#define S2_PGDIR_MASK (~(S2_PGDIR_SIZE - 1))
+
+/* We can have concatenated tables at stage2 entry. */
+#define PTRS_PER_S2_PGD (1 << (KVM_PHYS_SHIFT - S2_PGDIR_SHIFT))
+
+/*
+ * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
+ * levels in addition to the PGD.
+ */
+#define KVM_MMU_CACHE_MIN_PAGES (STAGE2_PGTABLE_LEVELS - 1)
+
+
+#if STAGE2_PGTABLE_LEVELS > 3
+
+#define S2_PUD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(1)
+#define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
+#define S2_PUD_MASK (~(S2_PUD_SIZE - 1))
+
#define stage2_pgd_none(pgd) pgd_none(pgd)
#define stage2_pgd_clear(pgd) pgd_clear(pgd)
#define stage2_pgd_present(pgd) pgd_present(pgd)
@@ -55,6 +78,23 @@
#define stage2_pud_offset(pgd, address) pud_offset(pgd, address)
#define stage2_pud_free(mm, pud) pud_free(mm, pud)

+#define stage2_pud_table_empty(pudp) kvm_page_empty(pudp)
+
+static inline phys_addr_t stage2_pud_addr_end(phys_addr_t addr, phys_addr_t end)
+{
+ phys_addr_t boundary = (addr + S2_PUD_SIZE) & S2_PUD_MASK;
+ return (boundary - 1 < end - 1) ? boundary : end;
+}
+
+#endif /* STAGE2_PGTABLE_LEVELS > 3 */
+
+
+#if STAGE2_PGTABLE_LEVELS > 2
+
+#define S2_PMD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
+#define S2_PMD_SIZE (_AC(1, UL) << S2_PMD_SHIFT)
+#define S2_PMD_MASK (~(S2_PMD_SIZE - 1))
+
#define stage2_pud_none(pud) pud_none(pud)
#define stage2_pud_clear(pud) pud_clear(pud)
#define stage2_pud_present(pud) pud_present(pud)
@@ -63,24 +103,31 @@
#define stage2_pmd_free(mm, pmd) pmd_free(mm, pmd)

#define stage2_pud_huge(pud) pud_huge(pud)
+#define stage2_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
+
+static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
+{
+ phys_addr_t boundary = (addr + S2_PMD_SIZE) & S2_PMD_MASK;
+ return (boundary - 1 < end - 1) ? boundary : end;
+}

-#define stage2_pgd_addr_end(address, end) pgd_addr_end(address, end)
-#define stage2_pud_addr_end(address, end) pud_addr_end(address, end)
-#define stage2_pmd_addr_end(address, end) pmd_addr_end(address, end)
+#endif /* STAGE2_PGTABLE_LEVELS > 2 */

#define stage2_pte_table_empty(ptep) kvm_page_empty(ptep)
-#ifdef __PGTABLE_PMD_FOLDED
-#define stage2_pmd_table_empty(pmdp) (0)
-#else
-#define stage2_pmd_table_empty(pmdp) ((KVM_PREALLOC_LEVEL < 2) && kvm_page_empty(pmdp))
-#endif

-#ifdef __PGTABLE_PUD_FOLDED
-#define stage2_pud_table_empty(pudp) (0)
-#else
-#define stage2_pud_table_empty(pudp) ((KVM_PREALLOC_LEVEL < 1) && kvm_page_empty(pudp))
+#if STAGE2_PGTABLE_LEVELS == 2
+#include <asm/stage2_pgtable-nopmd.h>
+#elif STAGE2_PGTABLE_LEVELS == 3
+#include <asm/stage2_pgtable-nopud.h>
#endif

-#define stage2_pgd_index(addr) (((addr) >> PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
+
+#define stage2_pgd_index(addr) (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
+
+static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
+{
+ phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;
+ return (boundary - 1 < end - 1) ? boundary : end;
+}

#endif /* __ARM64_S2_PGTABLE_H_ */
--
1.7.9.5

2016-04-04 16:27:20

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 17/17] arm64: kvm: Add support for 16K pages

Now that we can handle stage-2 page tables independent
of the host page table levels, wire up the 16K page
support.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm64/include/asm/kvm_arm.h | 11 ++++++++++-
arch/arm64/kvm/Kconfig | 1 -
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index d5d5fdf..3d830a3 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -114,6 +114,7 @@
#define VTCR_EL2_PS_MASK TCR_EL2_PS_MASK
#define VTCR_EL2_TG0_MASK TCR_TG0_MASK
#define VTCR_EL2_TG0_4K TCR_TG0_4K
+#define VTCR_EL2_TG0_16K TCR_TG0_16K
#define VTCR_EL2_TG0_64K TCR_TG0_64K
#define VTCR_EL2_SH0_MASK TCR_SH0_MASK
#define VTCR_EL2_SH0_INNER TCR_SH0_INNER
@@ -139,7 +140,7 @@
* (see hyp-init.S).
*
* Note that when using 4K pages, we concatenate two first level page tables
- * together.
+ * together. With 16K pages, we concatenate 16 first level page tables.
*
* The magic numbers used for VTTBR_X in this patch can be found in Tables
* D4-23 and D4-25 in ARM DDI 0487A.b.
@@ -156,6 +157,14 @@
*/
#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
#define VTTBR_X_TGRAN_MAGIC 38
+#elif defined(CONFIG_ARM64_16K_PAGES)
+/*
+ * Stage2 translation configuration:
+ * 16kB pages (TG0 = 2)
+ * 2 level page tables (SL = 1)
+ */
+#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
+#define VTTBR_X_TGRAN_MAGIC 42
#else
/*
* Stage2 translation configuration:
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index de7450d..aa2e34e 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM_ARM_VGIC_V3
config KVM
bool "Kernel-based Virtual Machine (KVM) support"
depends on OF
- depends on !ARM64_16K_PAGES
select MMU_NOTIFIER
select PREEMPT_NOTIFIERS
select ANON_INODES
--
1.7.9.5

2016-04-04 16:27:42

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 16/17] kvm-arm: Cleanup stage2 pgd handling

Now that we don't have any fake page table levels for arm64,
cleanup the common code to get rid of the dead code.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/include/asm/kvm_mmu.h | 19 -------------------
arch/arm/kvm/arm.c | 2 +-
arch/arm/kvm/mmu.c | 25 ++++++++-----------------
arch/arm64/include/asm/kvm_mmu.h | 18 ------------------
4 files changed, 9 insertions(+), 55 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 50aa901..f8b7920 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -146,25 +146,6 @@ static inline bool kvm_page_empty(void *ptr)
#define hyp_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
#define hyp_pud_table_empty(pudp) (0)

-static inline void *kvm_get_hwpgd(struct kvm *kvm)
-{
- return kvm->arch.pgd;
-}
-
-static inline unsigned int kvm_get_hwpgd_size(void)
-{
- return PTRS_PER_S2_PGD * sizeof(pgd_t);
-}
-
-static inline pgd_t *kvm_setup_fake_pgd(pgd_t *hwpgd)
-{
- return hwpgd;
-}
-
-static inline void kvm_free_fake_pgd(pgd_t *pgd)
-{
-}
-
struct kvm;

#define kvm_flush_dcache_to_poc(a,l) __cpuc_flush_dcache_area((a), (l))
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 6accd66..dfd4987 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -448,7 +448,7 @@ static void update_vttbr(struct kvm *kvm)
kvm_next_vmid &= (1 << kvm_vmid_bits) - 1;

/* update vttbr to be used with the new vmid */
- pgd_phys = virt_to_phys(kvm_get_hwpgd(kvm));
+ pgd_phys = virt_to_phys(kvm->arch.pgd);
BUG_ON(pgd_phys & ~VTTBR_BADDR_MASK);
vmid = ((u64)(kvm->arch.vmid) << VTTBR_VMID_SHIFT) & VTTBR_VMID_MASK(kvm_vmid_bits);
kvm->arch.vttbr = pgd_phys | vmid;
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index dda60fc..1cb7e60 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -755,6 +755,11 @@ int create_hyp_io_mappings(void *from, void *to, phys_addr_t phys_addr)
__phys_to_pfn(phys_addr), PAGE_HYP_DEVICE);
}

+static inline unsigned int kvm_get_hwpgd_size(void)
+{
+ return PTRS_PER_S2_PGD * sizeof(pgd_t);
+}
+
/* Free the HW pgd, one page at a time */
static void kvm_free_hwpgd(void *hwpgd)
{
@@ -783,29 +788,16 @@ static void *kvm_alloc_hwpgd(void)
int kvm_alloc_stage2_pgd(struct kvm *kvm)
{
pgd_t *pgd;
- void *hwpgd;

if (kvm->arch.pgd != NULL) {
kvm_err("kvm_arch already initialized?\n");
return -EINVAL;
}

- hwpgd = kvm_alloc_hwpgd();
- if (!hwpgd)
+ pgd = kvm_alloc_hwpgd();
+ if (!pgd)
return -ENOMEM;

- /*
- * When the kernel uses more levels of page tables than the
- * guest, we allocate a fake PGD and pre-populate it to point
- * to the next-level page table, which will be the real
- * initial page table pointed to by the VTTBR.
- */
- pgd = kvm_setup_fake_pgd(hwpgd);
- if (IS_ERR(pgd)) {
- kvm_free_hwpgd(hwpgd);
- return PTR_ERR(pgd);
- }
-
kvm_clean_pgd(pgd);
kvm->arch.pgd = pgd;
return 0;
@@ -893,8 +885,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
return;

unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
- kvm_free_hwpgd(kvm_get_hwpgd(kvm));
- kvm_free_fake_pgd(kvm->arch.pgd);
+ kvm_free_hwpgd(kvm->arch.pgd);
kvm->arch.pgd = NULL;
}

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e3fee0a..249c4fc 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -141,24 +141,6 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
return (pmd_val(*pmd) & PMD_S2_RDWR) == PMD_S2_RDONLY;
}

-static inline void *kvm_get_hwpgd(struct kvm *kvm)
-{
- return kvm->arch.pgd;
-}
-
-static inline unsigned int kvm_get_hwpgd_size(void)
-{
- return PTRS_PER_S2_PGD * sizeof(pgd_t);
-}
-
-static inline pgd_t *kvm_setup_fake_pgd(pgd_t *hwpgd)
-{
- return hwpgd;
-}
-
-static inline void kvm_free_fake_pgd(pgd_t *pgd)
-{
-}
static inline bool kvm_page_empty(void *ptr)
{
struct page *ptr_page = virt_to_page(ptr);
--
1.7.9.5

2016-04-04 16:28:04

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 14/17] kvm-arm: Cleanup kvm_* wrappers

Now that we have switched to explicit page table routines,
get rid of the obsolete kvm_* wrappers.

Also, kvm_tlb_flush_vmid_by_ipa is now called only on stage2
page tables, hence get rid of the redundant check.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/include/asm/kvm_mmu.h | 20 --------------------
arch/arm/kvm/mmu.c | 9 +--------
arch/arm64/include/asm/kvm_mmu.h | 24 ------------------------
3 files changed, 1 insertion(+), 52 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 901c1ea..50aa901 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -136,32 +136,12 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
return (pmd_val(*pmd) & L_PMD_S2_RDWR) == L_PMD_S2_RDONLY;
}

-
-/* Open coded p*d_addr_end that can deal with 64bit addresses */
-#define kvm_pgd_addr_end(addr, end) \
-({ u64 __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK; \
- (__boundary - 1 < (end) - 1)? __boundary: (end); \
-})
-
-#define kvm_pud_addr_end(addr,end) (end)
-
-#define kvm_pmd_addr_end(addr, end) \
-({ u64 __boundary = ((addr) + PMD_SIZE) & PMD_MASK; \
- (__boundary - 1 < (end) - 1)? __boundary: (end); \
-})
-
-#define kvm_pgd_index(addr) pgd_index(addr)
-
static inline bool kvm_page_empty(void *ptr)
{
struct page *ptr_page = virt_to_page(ptr);
return page_count(ptr_page) == 1;
}

-#define kvm_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)
-#define kvm_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
-#define kvm_pud_table_empty(kvm, pudp) (0)
-
#define hyp_pte_table_empty(ptep) kvm_page_empty(ptep)
#define hyp_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
#define hyp_pud_table_empty(pudp) (0)
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 0009a24..dda60fc 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -66,14 +66,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)

static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
{
- /*
- * This function also gets called when dealing with HYP page
- * tables. As HYP doesn't have an associated struct kvm (and
- * the HYP page tables are fairly static), we don't do
- * anything there.
- */
- if (kvm)
- kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
+ kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
}

/*
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index edf3c62..a3c0d05 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -153,13 +153,6 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
return (pmd_val(*pmd) & PMD_S2_RDWR) == PMD_S2_RDONLY;
}

-
-#define kvm_pgd_addr_end(addr, end) pgd_addr_end(addr, end)
-#define kvm_pud_addr_end(addr, end) pud_addr_end(addr, end)
-#define kvm_pmd_addr_end(addr, end) pmd_addr_end(addr, end)
-
-#define kvm_pgd_index(addr) (((addr) >> PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
-
static inline void *kvm_get_hwpgd(struct kvm *kvm)
{
pgd_t *pgd = kvm->arch.pgd;
@@ -232,23 +225,6 @@ static inline bool kvm_page_empty(void *ptr)
return page_count(ptr_page) == 1;
}

-#define kvm_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)
-
-#ifdef __PAGETABLE_PMD_FOLDED
-#define kvm_pmd_table_empty(kvm, pmdp) (0)
-#else
-#define kvm_pmd_table_empty(kvm, pmdp) \
- (kvm_page_empty(pmdp) && (!(kvm) || KVM_PREALLOC_LEVEL < 2))
-#endif
-
-#ifdef __PAGETABLE_PUD_FOLDED
-#define kvm_pud_table_empty(kvm, pudp) (0)
-#else
-#define kvm_pud_table_empty(kvm, pudp) \
- (kvm_page_empty(pudp) && (!(kvm) || KVM_PREALLOC_LEVEL < 1))
-#endif
-
-
#define hyp_pte_table_empty(ptep) kvm_page_empty(ptep)

#ifdef __PAGETABLE_PMD_FOLDED
--
1.7.9.5

2016-04-04 16:28:38

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 13/17] kvm-arm: Add stage2 page table modifiers

Now that the hyp page table is handled by different set of
routines, rename the original shared routines to stage2 handlers.
Also make explicit use of the stage2 page table helpers.

unmap_range has been merged to existing unmap_stage2_range.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/kvm/mmu.c | 97 ++++++++++++++++++++++++----------------------------
1 file changed, 44 insertions(+), 53 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 2b491e5..0009a24 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -152,26 +152,26 @@ static void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
return p;
}

-static void clear_pgd_entry(struct kvm *kvm, pgd_t *pgd, phys_addr_t addr)
+static void clear_stage2_pgd_entry(struct kvm *kvm, pgd_t *pgd, phys_addr_t addr)
{
- pud_t *pud_table __maybe_unused = pud_offset(pgd, 0);
- pgd_clear(pgd);
+ pud_t *pud_table __maybe_unused = stage2_pud_offset(pgd, 0UL);
+ stage2_pgd_clear(pgd);
kvm_tlb_flush_vmid_ipa(kvm, addr);
- pud_free(NULL, pud_table);
+ stage2_pud_free(NULL, pud_table);
put_page(virt_to_page(pgd));
}

-static void clear_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
+static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
{
- pmd_t *pmd_table = pmd_offset(pud, 0);
- VM_BUG_ON(pud_huge(*pud));
- pud_clear(pud);
+ pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(pud, 0);
+ VM_BUG_ON(stage2_pud_huge(*pud));
+ stage2_pud_clear(pud);
kvm_tlb_flush_vmid_ipa(kvm, addr);
- pmd_free(NULL, pmd_table);
+ stage2_pmd_free(NULL, pmd_table);
put_page(virt_to_page(pud));
}

-static void clear_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr)
+static void clear_stage2_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr)
{
pte_t *pte_table = pte_offset_kernel(pmd, 0);
VM_BUG_ON(pmd_thp_or_huge(*pmd));
@@ -201,7 +201,7 @@ static void clear_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr)
* the corresponding TLBs, we call kvm_flush_dcache_p*() to make sure
* the IO subsystem will never hit in the cache.
*/
-static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
+static void unmap_stage2_ptes(struct kvm *kvm, pmd_t *pmd,
phys_addr_t addr, phys_addr_t end)
{
phys_addr_t start_addr = addr;
@@ -223,19 +223,19 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
}
} while (pte++, addr += PAGE_SIZE, addr != end);

- if (kvm_pte_table_empty(kvm, start_pte))
- clear_pmd_entry(kvm, pmd, start_addr);
+ if (stage2_pte_table_empty(start_pte))
+ clear_stage2_pmd_entry(kvm, pmd, start_addr);
}

-static void unmap_pmds(struct kvm *kvm, pud_t *pud,
+static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud,
phys_addr_t addr, phys_addr_t end)
{
phys_addr_t next, start_addr = addr;
pmd_t *pmd, *start_pmd;

- start_pmd = pmd = pmd_offset(pud, addr);
+ start_pmd = pmd = stage2_pmd_offset(pud, addr);
do {
- next = kvm_pmd_addr_end(addr, end);
+ next = stage2_pmd_addr_end(addr, end);
if (!pmd_none(*pmd)) {
if (pmd_thp_or_huge(*pmd)) {
pmd_t old_pmd = *pmd;
@@ -247,57 +247,64 @@ static void unmap_pmds(struct kvm *kvm, pud_t *pud,

put_page(virt_to_page(pmd));
} else {
- unmap_ptes(kvm, pmd, addr, next);
+ unmap_stage2_ptes(kvm, pmd, addr, next);
}
}
} while (pmd++, addr = next, addr != end);

- if (kvm_pmd_table_empty(kvm, start_pmd))
- clear_pud_entry(kvm, pud, start_addr);
+ if (stage2_pmd_table_empty(start_pmd))
+ clear_stage2_pud_entry(kvm, pud, start_addr);
}

-static void unmap_puds(struct kvm *kvm, pgd_t *pgd,
+static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
phys_addr_t addr, phys_addr_t end)
{
phys_addr_t next, start_addr = addr;
pud_t *pud, *start_pud;

- start_pud = pud = pud_offset(pgd, addr);
+ start_pud = pud = stage2_pud_offset(pgd, addr);
do {
- next = kvm_pud_addr_end(addr, end);
- if (!pud_none(*pud)) {
- if (pud_huge(*pud)) {
+ next = stage2_pud_addr_end(addr, end);
+ if (!stage2_pud_none(*pud)) {
+ if (stage2_pud_huge(*pud)) {
pud_t old_pud = *pud;

- pud_clear(pud);
+ stage2_pud_clear(pud);
kvm_tlb_flush_vmid_ipa(kvm, addr);
-
kvm_flush_dcache_pud(old_pud);
-
put_page(virt_to_page(pud));
} else {
- unmap_pmds(kvm, pud, addr, next);
+ unmap_stage2_pmds(kvm, pud, addr, next);
}
}
} while (pud++, addr = next, addr != end);

- if (kvm_pud_table_empty(kvm, start_pud))
- clear_pgd_entry(kvm, pgd, start_addr);
+ if (stage2_pud_table_empty(start_pud))
+ clear_stage2_pgd_entry(kvm, pgd, start_addr);
}

-
-static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
- phys_addr_t start, u64 size)
+/**
+ * unmap_stage2_range -- Clear stage2 page table entries to unmap a range
+ * @kvm: The VM pointer
+ * @start: The intermediate physical base address of the range to unmap
+ * @size: The size of the area to unmap
+ *
+ * Clear a range of stage-2 mappings, lowering the various ref-counts. Must
+ * be called while holding mmu_lock (unless for freeing the stage2 pgd before
+ * destroying the VM), otherwise another faulting VCPU may come in and mess
+ * with things behind our backs.
+ */
+static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
{
pgd_t *pgd;
phys_addr_t addr = start, end = start + size;
phys_addr_t next;

- pgd = pgdp + kvm_pgd_index(addr);
+ pgd = kvm->arch.pgd + stage2_pgd_index(addr);
do {
- next = kvm_pgd_addr_end(addr, end);
- if (!pgd_none(*pgd))
- unmap_puds(kvm, pgd, addr, next);
+ next = stage2_pgd_addr_end(addr, end);
+ if (!stage2_pgd_none(*pgd))
+ unmap_stage2_puds(kvm, pgd, addr, next);
} while (pgd++, addr = next, addr != end);
}

@@ -811,22 +818,6 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
return 0;
}

-/**
- * unmap_stage2_range -- Clear stage2 page table entries to unmap a range
- * @kvm: The VM pointer
- * @start: The intermediate physical base address of the range to unmap
- * @size: The size of the area to unmap
- *
- * Clear a range of stage-2 mappings, lowering the various ref-counts. Must
- * be called while holding mmu_lock (unless for freeing the stage2 pgd before
- * destroying the VM), otherwise another faulting VCPU may come in and mess
- * with things behind our backs.
- */
-static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
-{
- unmap_range(kvm, kvm->arch.pgd, start, size);
-}
-
static void stage2_unmap_memslot(struct kvm *kvm,
struct kvm_memory_slot *memslot)
{
--
1.7.9.5

2016-04-04 16:27:08

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 10/17] kvm-arm: arm64: Introduce hyp page table empty checks

Introduce hyp_pxx_table_empty helpers for checking whether
a given table entry is empty. This will be used explicitly
once we switch to explicit routines for hyp page table walk.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm64/include/asm/kvm_mmu.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e3f53e3..edf3c62 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -249,6 +249,20 @@ static inline bool kvm_page_empty(void *ptr)
#endif


+#define hyp_pte_table_empty(ptep) kvm_page_empty(ptep)
+
+#ifdef __PAGETABLE_PMD_FOLDED
+#define hyp_pmd_table_empty(pmdp) (0)
+#else
+#define hyp_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
+#endif
+
+#ifdef __PAGETABLE_PUD_FOLDED
+#define hyp_pud_table_empty(pudp) (0)
+#else
+#define hyp_pud_table_empty(pudp) kvm_page_empty(pudp)
+#endif
+
struct kvm;

#define kvm_flush_dcache_to_poc(a,l) __flush_dcache_area((a), (l))
--
1.7.9.5

2016-04-04 16:29:05

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 12/17] kvm-arm: Add explicit hyp page table modifiers

We have common routines to modify hyp and stage2 page tables
based on the 'kvm' parameter. For a smoother transition to
using separate routines for each, duplicate the routines
and modify the copy to work on hyp.

Marks the forked routines with _hyp_ and gets rid of the
kvm parameter which is no longer needed and is NULL for hyp.
Also, gets rid of calls to kvm_tlb_flush_by_vmid_ipa() calls
from the hyp versions. Uses explicit host page table accessors
instead of the kvm_* page table helpers.

Suggested-by: Christoffer Dall <[email protected]>
Cc: Marc Zyngier <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/kvm/mmu.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 118 insertions(+), 5 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index b46a337..2b491e5 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -388,6 +388,119 @@ static void stage2_flush_vm(struct kvm *kvm)
srcu_read_unlock(&kvm->srcu, idx);
}

+static void clear_hyp_pgd_entry(pgd_t *pgd)
+{
+ pud_t *pud_table __maybe_unused = pud_offset(pgd, 0UL);
+ pgd_clear(pgd);
+ pud_free(NULL, pud_table);
+ put_page(virt_to_page(pgd));
+}
+
+static void clear_hyp_pud_entry(pud_t *pud)
+{
+ pmd_t *pmd_table __maybe_unused = pmd_offset(pud, 0);
+ VM_BUG_ON(pud_huge(*pud));
+ pud_clear(pud);
+ pmd_free(NULL, pmd_table);
+ put_page(virt_to_page(pud));
+}
+
+static void clear_hyp_pmd_entry(pmd_t *pmd)
+{
+ pte_t *pte_table = pte_offset_kernel(pmd, 0);
+ VM_BUG_ON(pmd_thp_or_huge(*pmd));
+ pmd_clear(pmd);
+ pte_free_kernel(NULL, pte_table);
+ put_page(virt_to_page(pmd));
+}
+
+static void unmap_hyp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
+{
+ pte_t *pte, *start_pte;
+
+ start_pte = pte = pte_offset_kernel(pmd, addr);
+ do {
+ if (!pte_none(*pte)) {
+ pte_t old_pte = *pte;
+
+ kvm_set_pte(pte, __pte(0));
+
+ /* XXX: Do we need to invalidate the cache for device mappings ? */
+ if (!kvm_is_device_pfn(pte_pfn(old_pte)))
+ kvm_flush_dcache_pte(old_pte);
+
+ put_page(virt_to_page(pte));
+ }
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+
+ if (hyp_pte_table_empty(start_pte))
+ clear_hyp_pmd_entry(pmd);
+}
+
+static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
+{
+ phys_addr_t next;
+ pmd_t *pmd, *start_pmd;
+
+ start_pmd = pmd = pmd_offset(pud, addr);
+ do {
+ next = pmd_addr_end(addr, end);
+ if (!pmd_none(*pmd)) {
+ if (pmd_thp_or_huge(*pmd)) {
+ pmd_t old_pmd = *pmd;
+
+ pmd_clear(pmd);
+ kvm_flush_dcache_pmd(old_pmd);
+ put_page(virt_to_page(pmd));
+ } else {
+ unmap_hyp_ptes(pmd, addr, next);
+ }
+ }
+ } while (pmd++, addr = next, addr != end);
+
+ if (hyp_pmd_table_empty(start_pmd))
+ clear_hyp_pud_entry(pud);
+}
+
+static void unmap_hyp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
+{
+ phys_addr_t next;
+ pud_t *pud, *start_pud;
+
+ start_pud = pud = pud_offset(pgd, addr);
+ do {
+ next = pud_addr_end(addr, end);
+ if (!pud_none(*pud)) {
+ if (pud_huge(*pud)) {
+ pud_t old_pud = *pud;
+
+ pud_clear(pud);
+ kvm_flush_dcache_pud(old_pud);
+ put_page(virt_to_page(pud));
+ } else {
+ unmap_hyp_pmds(pud, addr, next);
+ }
+ }
+ } while (pud++, addr = next, addr != end);
+
+ if (hyp_pud_table_empty(start_pud))
+ clear_hyp_pgd_entry(pgd);
+}
+
+static void unmap_hyp_range(pgd_t *pgdp, phys_addr_t start, u64 size)
+{
+ pgd_t *pgd;
+ phys_addr_t addr = start, end = start + size;
+ phys_addr_t next;
+
+ pgd = pgdp + pgd_index(addr);
+ do {
+ next = pgd_addr_end(addr, end);
+ if (!pgd_none(*pgd))
+ unmap_hyp_puds(pgd, addr, next);
+ } while (pgd++, addr = next, addr != end);
+}
+
/**
* free_boot_hyp_pgd - free HYP boot page tables
*
@@ -398,14 +511,14 @@ void free_boot_hyp_pgd(void)
mutex_lock(&kvm_hyp_pgd_mutex);

if (boot_hyp_pgd) {
- unmap_range(NULL, boot_hyp_pgd, hyp_idmap_start, PAGE_SIZE);
- unmap_range(NULL, boot_hyp_pgd, TRAMPOLINE_VA, PAGE_SIZE);
+ unmap_hyp_range(boot_hyp_pgd, hyp_idmap_start, PAGE_SIZE);
+ unmap_hyp_range(boot_hyp_pgd, TRAMPOLINE_VA, PAGE_SIZE);
free_pages((unsigned long)boot_hyp_pgd, hyp_pgd_order);
boot_hyp_pgd = NULL;
}

if (hyp_pgd)
- unmap_range(NULL, hyp_pgd, TRAMPOLINE_VA, PAGE_SIZE);
+ unmap_hyp_range(hyp_pgd, TRAMPOLINE_VA, PAGE_SIZE);

mutex_unlock(&kvm_hyp_pgd_mutex);
}
@@ -430,9 +543,9 @@ void free_hyp_pgds(void)

if (hyp_pgd) {
for (addr = PAGE_OFFSET; virt_addr_valid(addr); addr += PGDIR_SIZE)
- unmap_range(NULL, hyp_pgd, KERN_TO_HYP(addr), PGDIR_SIZE);
+ unmap_hyp_range(hyp_pgd, KERN_TO_HYP(addr), PGDIR_SIZE);
for (addr = VMALLOC_START; is_vmalloc_addr((void*)addr); addr += PGDIR_SIZE)
- unmap_range(NULL, hyp_pgd, KERN_TO_HYP(addr), PGDIR_SIZE);
+ unmap_hyp_range(hyp_pgd, KERN_TO_HYP(addr), PGDIR_SIZE);

free_pages((unsigned long)hyp_pgd, hyp_pgd_order);
hyp_pgd = NULL;
--
1.7.9.5

2016-04-04 16:29:30

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 09/17] kvm-arm: arm64: Introduce stage2 page table helpers

Introduce stage2 page table helpers for arm64. With the fake
page table level still in place, the stage2 table has the same
number of levels as that of the host (and hyp), so they all
fallback to the host version.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm64/include/asm/kvm_mmu.h | 29 +----------
arch/arm64/include/asm/stage2_pgtable.h | 86 +++++++++++++++++++++++++++++++
2 files changed, 88 insertions(+), 27 deletions(-)
create mode 100644 arch/arm64/include/asm/stage2_pgtable.h

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 9a3409f..e3f53e3 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -91,6 +91,8 @@ alternative_endif
#define KVM_PHYS_SIZE (1UL << KVM_PHYS_SHIFT)
#define KVM_PHYS_MASK (KVM_PHYS_SIZE - 1UL)

+#include <asm/stage2_pgtable.h>
+
int create_hyp_mappings(void *from, void *to);
int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
void free_boot_hyp_pgd(void);
@@ -156,35 +158,8 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
#define kvm_pud_addr_end(addr, end) pud_addr_end(addr, end)
#define kvm_pmd_addr_end(addr, end) pmd_addr_end(addr, end)

-/*
- * In the case where PGDIR_SHIFT is larger than KVM_PHYS_SHIFT, we can address
- * the entire IPA input range with a single pgd entry, and we would only need
- * one pgd entry. Note that in this case, the pgd is actually not used by
- * the MMU for Stage-2 translations, but is merely a fake pgd used as a data
- * structure for the kernel pgtable macros to work.
- */
-#if PGDIR_SHIFT > KVM_PHYS_SHIFT
-#define PTRS_PER_S2_PGD_SHIFT 0
-#else
-#define PTRS_PER_S2_PGD_SHIFT (KVM_PHYS_SHIFT - PGDIR_SHIFT)
-#endif
-#define PTRS_PER_S2_PGD (1 << PTRS_PER_S2_PGD_SHIFT)
-
#define kvm_pgd_index(addr) (((addr) >> PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))

-/*
- * If we are concatenating first level stage-2 page tables, we would have less
- * than or equal to 16 pointers in the fake PGD, because that's what the
- * architecture allows. In this case, (4 - CONFIG_PGTABLE_LEVELS)
- * represents the first level for the host, and we add 1 to go to the next
- * level (which uses contatenation) for the stage-2 tables.
- */
-#if PTRS_PER_S2_PGD <= 16
-#define KVM_PREALLOC_LEVEL (4 - CONFIG_PGTABLE_LEVELS + 1)
-#else
-#define KVM_PREALLOC_LEVEL (0)
-#endif
-
static inline void *kvm_get_hwpgd(struct kvm *kvm)
{
pgd_t *pgd = kvm->arch.pgd;
diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
new file mode 100644
index 0000000..751227d
--- /dev/null
+++ b/arch/arm64/include/asm/stage2_pgtable.h
@@ -0,0 +1,86 @@
+/*
+ * Copyright (C) 2016 - ARM Ltd
+ *
+ * stage2 page table helpers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __ARM64_S2_PGTABLE_H_
+#define __ARM64_S2_PGTABLE_H_
+
+#include <asm/pgtable.h>
+
+/*
+ * In the case where PGDIR_SHIFT is larger than KVM_PHYS_SHIFT, we can address
+ * the entire IPA input range with a single pgd entry, and we would only need
+ * one pgd entry. Note that in this case, the pgd is actually not used by
+ * the MMU for Stage-2 translations, but is merely a fake pgd used as a data
+ * structure for the kernel pgtable macros to work.
+ */
+#if PGDIR_SHIFT > KVM_PHYS_SHIFT
+#define PTRS_PER_S2_PGD_SHIFT 0
+#else
+#define PTRS_PER_S2_PGD_SHIFT (KVM_PHYS_SHIFT - PGDIR_SHIFT)
+#endif
+#define PTRS_PER_S2_PGD (1 << PTRS_PER_S2_PGD_SHIFT)
+
+/*
+ * If we are concatenating first level stage-2 page tables, we would have less
+ * than or equal to 16 pointers in the fake PGD, because that's what the
+ * architecture allows. In this case, (4 - CONFIG_PGTABLE_LEVELS)
+ * represents the first level for the host, and we add 1 to go to the next
+ * level (which uses contatenation) for the stage-2 tables.
+ */
+#if PTRS_PER_S2_PGD <= 16
+#define KVM_PREALLOC_LEVEL (4 - CONFIG_PGTABLE_LEVELS + 1)
+#else
+#define KVM_PREALLOC_LEVEL (0)
+#endif
+
+#define stage2_pgd_none(pgd) pgd_none(pgd)
+#define stage2_pgd_clear(pgd) pgd_clear(pgd)
+#define stage2_pgd_present(pgd) pgd_present(pgd)
+#define stage2_pgd_populate(mm, pgd, pud) pgd_populate(mm, pgd, pud)
+#define stage2_pud_offset(pgd, address) pud_offset(pgd, address)
+#define stage2_pud_free(mm, pud) pud_free(mm, pud)
+
+#define stage2_pud_none(pud) pud_none(pud)
+#define stage2_pud_clear(pud) pud_clear(pud)
+#define stage2_pud_present(pud) pud_present(pud)
+#define stage2_pud_populate(mm, pud, pmd) pud_populate(mm, pud, pmd)
+#define stage2_pmd_offset(pud, address) pmd_offset(pud, address)
+#define stage2_pmd_free(mm, pmd) pmd_free(mm, pmd)
+
+#define stage2_pud_huge(pud) pud_huge(pud)
+
+#define stage2_pgd_addr_end(address, end) pgd_addr_end(address, end)
+#define stage2_pud_addr_end(address, end) pud_addr_end(address, end)
+#define stage2_pmd_addr_end(address, end) pmd_addr_end(address, end)
+
+#define stage2_pte_table_empty(ptep) kvm_page_empty(ptep)
+#ifdef __PGTABLE_PMD_FOLDED
+#define stage2_pmd_table_empty(pmdp) (0)
+#else
+#define stage2_pmd_table_empty(pmdp) ((KVM_PREALLOC_LEVEL < 2) && kvm_page_empty(pmdp))
+#endif
+
+#ifdef __PGTABLE_PUD_FOLDED
+#define stage2_pud_table_empty(pudp) (0)
+#else
+#define stage2_pud_table_empty(pudp) ((KVM_PREALLOC_LEVEL < 1) && kvm_page_empty(pudp))
+#endif
+
+#define stage2_pgd_index(addr) (((addr) >> PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
+
+#endif /* __ARM64_S2_PGTABLE_H_ */
--
1.7.9.5

2016-04-04 16:30:24

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 06/17] kvm-arm: Remove kvm_pud_huge()

Get rid of kvm_pud_huge() which falls back to pud_huge. Use
pud_huge instead.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/kvm/mmu.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 7837f0a..d0c0ee9 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -45,8 +45,6 @@ static phys_addr_t hyp_idmap_vector;

#define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t))

-#define kvm_pud_huge(_x) pud_huge(_x)
-
#define KVM_S2PTE_FLAG_IS_IOMAP (1UL << 0)
#define KVM_S2_FLAG_LOGGING_ACTIVE (1UL << 1)

@@ -1077,7 +1075,7 @@ static void stage2_wp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
next = kvm_pud_addr_end(addr, end);
if (!pud_none(*pud)) {
/* TODO:PUD not supported, revisit later if supported */
- BUG_ON(kvm_pud_huge(*pud));
+ BUG_ON(pud_huge(*pud));
stage2_wp_pmds(pud, addr, next);
}
} while (pud++, addr = next, addr != end);
--
1.7.9.5

2016-04-04 16:30:48

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 08/17] kvm-arm: arm: Introduce hyp page table empty checks

Introduce hyp_pxx_table_empty helpers for checking whether
a given table entry is empty. This will be used explicitly
once we switch to explicit routines for hyp page table walk.

Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/include/asm/kvm_mmu.h | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 7d207b4..901c1ea 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -162,6 +162,10 @@ static inline bool kvm_page_empty(void *ptr)
#define kvm_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
#define kvm_pud_table_empty(kvm, pudp) (0)

+#define hyp_pte_table_empty(ptep) kvm_page_empty(ptep)
+#define hyp_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
+#define hyp_pud_table_empty(pudp) (0)
+
static inline void *kvm_get_hwpgd(struct kvm *kvm)
{
return kvm->arch.pgd;
--
1.7.9.5

2016-04-04 16:31:11

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 05/17] kvm-arm: Replace kvm_pmd_huge with pmd_thp_or_huge

Both arm and arm64 now provides a helper, pmd_thp_or_huge()
to check if the given pmd represents a huge page. Use that
instead of our own custom check.

Suggested-by: Mark Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/kvm/mmu.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 774d00b..7837f0a 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -45,7 +45,6 @@ static phys_addr_t hyp_idmap_vector;

#define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t))

-#define kvm_pmd_huge(_x) (pmd_huge(_x) || pmd_trans_huge(_x))
#define kvm_pud_huge(_x) pud_huge(_x)

#define KVM_S2PTE_FLAG_IS_IOMAP (1UL << 0)
@@ -115,7 +114,7 @@ static bool kvm_is_device_pfn(unsigned long pfn)
*/
static void stage2_dissolve_pmd(struct kvm *kvm, phys_addr_t addr, pmd_t *pmd)
{
- if (!kvm_pmd_huge(*pmd))
+ if (!pmd_thp_or_huge(*pmd))
return;

pmd_clear(pmd);
@@ -177,7 +176,7 @@ static void clear_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
static void clear_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr)
{
pte_t *pte_table = pte_offset_kernel(pmd, 0);
- VM_BUG_ON(kvm_pmd_huge(*pmd));
+ VM_BUG_ON(pmd_thp_or_huge(*pmd));
pmd_clear(pmd);
kvm_tlb_flush_vmid_ipa(kvm, addr);
pte_free_kernel(NULL, pte_table);
@@ -240,7 +239,7 @@ static void unmap_pmds(struct kvm *kvm, pud_t *pud,
do {
next = kvm_pmd_addr_end(addr, end);
if (!pmd_none(*pmd)) {
- if (kvm_pmd_huge(*pmd)) {
+ if (pmd_thp_or_huge(*pmd)) {
pmd_t old_pmd = *pmd;

pmd_clear(pmd);
@@ -326,7 +325,7 @@ static void stage2_flush_pmds(struct kvm *kvm, pud_t *pud,
do {
next = kvm_pmd_addr_end(addr, end);
if (!pmd_none(*pmd)) {
- if (kvm_pmd_huge(*pmd))
+ if (pmd_thp_or_huge(*pmd))
kvm_flush_dcache_pmd(*pmd);
else
stage2_flush_ptes(kvm, pmd, addr, next);
@@ -1050,7 +1049,7 @@ static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
do {
next = kvm_pmd_addr_end(addr, end);
if (!pmd_none(*pmd)) {
- if (kvm_pmd_huge(*pmd)) {
+ if (pmd_thp_or_huge(*pmd)) {
if (!kvm_s2pmd_readonly(pmd))
kvm_set_s2pmd_readonly(pmd);
} else {
@@ -1331,7 +1330,7 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
if (!pmd || pmd_none(*pmd)) /* Nothing there */
goto out;

- if (kvm_pmd_huge(*pmd)) { /* THP, HugeTLB */
+ if (pmd_thp_or_huge(*pmd)) { /* THP, HugeTLB */
*pmd = pmd_mkyoung(*pmd);
pfn = pmd_pfn(*pmd);
pfn_valid = true;
@@ -1555,7 +1554,7 @@ static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
if (!pmd || pmd_none(*pmd)) /* Nothing there */
return 0;

- if (kvm_pmd_huge(*pmd)) { /* THP, HugeTLB */
+ if (pmd_thp_or_huge(*pmd)) { /* THP, HugeTLB */
if (pmd_young(*pmd)) {
*pmd = pmd_mkold(*pmd);
return 1;
@@ -1585,7 +1584,7 @@ static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
if (!pmd || pmd_none(*pmd)) /* Nothing there */
return 0;

- if (kvm_pmd_huge(*pmd)) /* THP, HugeTLB */
+ if (pmd_thp_or_huge(*pmd)) /* THP, HugeTLB */
return pmd_young(*pmd);

pte = pte_offset_kernel(pmd, gpa);
--
1.7.9.5

2016-04-04 16:31:46

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 03/17] kvm arm: Move fake PGD handling to arch specific files

Rearrange the code for fake pgd handling, which is applicable
only for arm64. This will later be removed once we introduce
the stage2 page table walker macros.

Reviewed-by: Marc Zyngier <[email protected]>
Reviewed-by: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/include/asm/kvm_mmu.h | 11 +++++++--
arch/arm/kvm/mmu.c | 47 ++++++--------------------------------
arch/arm64/include/asm/kvm_mmu.h | 43 ++++++++++++++++++++++++++++++++++
3 files changed, 59 insertions(+), 42 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index da44be9..c2b2b27 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -161,8 +161,6 @@ static inline bool kvm_page_empty(void *ptr)
#define kvm_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
#define kvm_pud_table_empty(kvm, pudp) (0)

-#define KVM_PREALLOC_LEVEL 0
-
static inline void *kvm_get_hwpgd(struct kvm *kvm)
{
return kvm->arch.pgd;
@@ -173,6 +171,15 @@ static inline unsigned int kvm_get_hwpgd_size(void)
return PTRS_PER_S2_PGD * sizeof(pgd_t);
}

+static inline pgd_t *kvm_setup_fake_pgd(pgd_t *hwpgd)
+{
+ return hwpgd;
+}
+
+static inline void kvm_free_fake_pgd(pgd_t *pgd)
+{
+}
+
struct kvm;

#define kvm_flush_dcache_to_poc(a,l) __cpuc_flush_dcache_area((a), (l))
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 58dbd5c..774d00b 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -684,47 +684,16 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
if (!hwpgd)
return -ENOMEM;

- /* When the kernel uses more levels of page tables than the
+ /*
+ * When the kernel uses more levels of page tables than the
* guest, we allocate a fake PGD and pre-populate it to point
* to the next-level page table, which will be the real
* initial page table pointed to by the VTTBR.
- *
- * When KVM_PREALLOC_LEVEL==2, we allocate a single page for
- * the PMD and the kernel will use folded pud.
- * When KVM_PREALLOC_LEVEL==1, we allocate 2 consecutive PUD
- * pages.
*/
- if (KVM_PREALLOC_LEVEL > 0) {
- int i;
-
- /*
- * Allocate fake pgd for the page table manipulation macros to
- * work. This is not used by the hardware and we have no
- * alignment requirement for this allocation.
- */
- pgd = kmalloc(PTRS_PER_S2_PGD * sizeof(pgd_t),
- GFP_KERNEL | __GFP_ZERO);
-
- if (!pgd) {
- kvm_free_hwpgd(hwpgd);
- return -ENOMEM;
- }
-
- /* Plug the HW PGD into the fake one. */
- for (i = 0; i < PTRS_PER_S2_PGD; i++) {
- if (KVM_PREALLOC_LEVEL == 1)
- pgd_populate(NULL, pgd + i,
- (pud_t *)hwpgd + i * PTRS_PER_PUD);
- else if (KVM_PREALLOC_LEVEL == 2)
- pud_populate(NULL, pud_offset(pgd, 0) + i,
- (pmd_t *)hwpgd + i * PTRS_PER_PMD);
- }
- } else {
- /*
- * Allocate actual first-level Stage-2 page table used by the
- * hardware for Stage-2 page table walks.
- */
- pgd = (pgd_t *)hwpgd;
+ pgd = kvm_setup_fake_pgd(hwpgd);
+ if (IS_ERR(pgd)) {
+ kvm_free_hwpgd(hwpgd);
+ return PTR_ERR(pgd);
}

kvm_clean_pgd(pgd);
@@ -831,9 +800,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm)

unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
kvm_free_hwpgd(kvm_get_hwpgd(kvm));
- if (KVM_PREALLOC_LEVEL > 0)
- kfree(kvm->arch.pgd);
-
+ kvm_free_fake_pgd(kvm->arch.pgd);
kvm->arch.pgd = NULL;
}

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 22732a5..9a3409f 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -208,6 +208,49 @@ static inline unsigned int kvm_get_hwpgd_size(void)
return PTRS_PER_S2_PGD * sizeof(pgd_t);
}

+/*
+ * Allocate fake pgd for the host kernel page table macros to work.
+ * This is not used by the hardware and we have no alignment
+ * requirement for this allocation.
+ */
+static inline pgd_t *kvm_setup_fake_pgd(pgd_t *hwpgd)
+{
+ int i;
+ pgd_t *pgd;
+
+ if (!KVM_PREALLOC_LEVEL)
+ return hwpgd;
+
+ /*
+ * When KVM_PREALLOC_LEVEL==2, we allocate a single page for
+ * the PMD and the kernel will use folded pud.
+ * When KVM_PREALLOC_LEVEL==1, we allocate 2 consecutive PUD
+ * pages.
+ */
+
+ pgd = kmalloc(PTRS_PER_S2_PGD * sizeof(pgd_t),
+ GFP_KERNEL | __GFP_ZERO);
+ if (!pgd)
+ return ERR_PTR(-ENOMEM);
+
+ /* Plug the HW PGD into the fake one. */
+ for (i = 0; i < PTRS_PER_S2_PGD; i++) {
+ if (KVM_PREALLOC_LEVEL == 1)
+ pgd_populate(NULL, pgd + i,
+ (pud_t *)hwpgd + i * PTRS_PER_PUD);
+ else if (KVM_PREALLOC_LEVEL == 2)
+ pud_populate(NULL, pud_offset(pgd, 0) + i,
+ (pmd_t *)hwpgd + i * PTRS_PER_PMD);
+ }
+
+ return pgd;
+}
+
+static inline void kvm_free_fake_pgd(pgd_t *pgd)
+{
+ if (KVM_PREALLOC_LEVEL > 0)
+ kfree(pgd);
+}
static inline bool kvm_page_empty(void *ptr)
{
struct page *ptr_page = virt_to_page(ptr);
--
1.7.9.5

2016-04-04 16:32:45

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 01/17] arm64: Reuse TCR field definitions for EL1 and EL2

TCR_EL1, TCR_EL2 and VTCR_EL2, all share some field positions
(TG0, ORGN0, IRGN0 and SH0) and their corresponding value definitions.

This patch makes the TCR_EL1 definitions reusable and uses them for TCR_EL2
and VTCR_EL2 fields.

This also fixes a bug where TG0 in {V}TCR_EL2 was treated as 1bit field.

Cc: Catalin Marinas <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm64/include/asm/kvm_arm.h | 48 ++++++++++---------
arch/arm64/include/asm/pgtable-hwdef.h | 80 +++++++++++++++++++++++++-------
2 files changed, 88 insertions(+), 40 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 4150fd8..c460cfe 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -96,32 +96,34 @@
SCTLR_EL2_SA | SCTLR_EL2_I)

/* TCR_EL2 Registers bits */
-#define TCR_EL2_RES1 ((1 << 31) | (1 << 23))
-#define TCR_EL2_TBI (1 << 20)
-#define TCR_EL2_PS (7 << 16)
-#define TCR_EL2_PS_40B (2 << 16)
-#define TCR_EL2_TG0 (1 << 14)
-#define TCR_EL2_SH0 (3 << 12)
-#define TCR_EL2_ORGN0 (3 << 10)
-#define TCR_EL2_IRGN0 (3 << 8)
-#define TCR_EL2_T0SZ 0x3f
-#define TCR_EL2_MASK (TCR_EL2_TG0 | TCR_EL2_SH0 | \
- TCR_EL2_ORGN0 | TCR_EL2_IRGN0 | TCR_EL2_T0SZ)
+#define TCR_EL2_RES1 ((1 << 31) | (1 << 23))
+#define TCR_EL2_TBI (1 << 20)
+#define TCR_EL2_PS_SHIFT 16
+#define TCR_EL2_PS_MASK (7 << TCR_EL2_PS_SHIFT)
+#define TCR_EL2_PS_40B (2 << TCR_EL2_PS_SHIFT)
+#define TCR_EL2_TG0_MASK TCR_TG0_MASK
+#define TCR_EL2_SH0_MASK TCR_SH0_MASK
+#define TCR_EL2_ORGN0_MASK TCR_ORGN0_MASK
+#define TCR_EL2_IRGN0_MASK TCR_IRGN0_MASK
+#define TCR_EL2_T0SZ_MASK 0x3f
+#define TCR_EL2_MASK (TCR_EL2_TG0_MASK | TCR_EL2_SH0_MASK | \
+ TCR_EL2_ORGN0_MASK | TCR_EL2_IRGN0_MASK | TCR_EL2_T0SZ_MASK)

/* VTCR_EL2 Registers bits */
#define VTCR_EL2_RES1 (1 << 31)
-#define VTCR_EL2_PS_MASK (7 << 16)
-#define VTCR_EL2_TG0_MASK (1 << 14)
-#define VTCR_EL2_TG0_4K (0 << 14)
-#define VTCR_EL2_TG0_64K (1 << 14)
-#define VTCR_EL2_SH0_MASK (3 << 12)
-#define VTCR_EL2_SH0_INNER (3 << 12)
-#define VTCR_EL2_ORGN0_MASK (3 << 10)
-#define VTCR_EL2_ORGN0_WBWA (1 << 10)
-#define VTCR_EL2_IRGN0_MASK (3 << 8)
-#define VTCR_EL2_IRGN0_WBWA (1 << 8)
-#define VTCR_EL2_SL0_MASK (3 << 6)
-#define VTCR_EL2_SL0_LVL1 (1 << 6)
+#define VTCR_EL2_PS_MASK TCR_EL2_PS_MASK
+#define VTCR_EL2_TG0_MASK TCR_TG0_MASK
+#define VTCR_EL2_TG0_4K TCR_TG0_4K
+#define VTCR_EL2_TG0_64K TCR_TG0_64K
+#define VTCR_EL2_SH0_MASK TCR_SH0_MASK
+#define VTCR_EL2_SH0_INNER TCR_SH0_INNER
+#define VTCR_EL2_ORGN0_MASK TCR_ORGN0_MASK
+#define VTCR_EL2_ORGN0_WBWA TCR_ORGN0_WBWA
+#define VTCR_EL2_IRGN0_MASK TCR_IRGN0_MASK
+#define VTCR_EL2_IRGN0_WBWA TCR_IRGN0_WBWA
+#define VTCR_EL2_SL0_SHIFT 6
+#define VTCR_EL2_SL0_MASK (3 << VTCR_EL2_SL0_SHIFT)
+#define VTCR_EL2_SL0_LVL1 (1 << VTCR_EL2_SL0_SHIFT)
#define VTCR_EL2_T0SZ_MASK 0x3f
#define VTCR_EL2_T0SZ_40B 24
#define VTCR_EL2_VS_SHIFT 19
diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
index 5c25b83..936f173 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -208,23 +208,69 @@
#define TCR_T1SZ(x) ((UL(64) - (x)) << TCR_T1SZ_OFFSET)
#define TCR_TxSZ(x) (TCR_T0SZ(x) | TCR_T1SZ(x))
#define TCR_TxSZ_WIDTH 6
-#define TCR_IRGN_NC ((UL(0) << 8) | (UL(0) << 24))
-#define TCR_IRGN_WBWA ((UL(1) << 8) | (UL(1) << 24))
-#define TCR_IRGN_WT ((UL(2) << 8) | (UL(2) << 24))
-#define TCR_IRGN_WBnWA ((UL(3) << 8) | (UL(3) << 24))
-#define TCR_IRGN_MASK ((UL(3) << 8) | (UL(3) << 24))
-#define TCR_ORGN_NC ((UL(0) << 10) | (UL(0) << 26))
-#define TCR_ORGN_WBWA ((UL(1) << 10) | (UL(1) << 26))
-#define TCR_ORGN_WT ((UL(2) << 10) | (UL(2) << 26))
-#define TCR_ORGN_WBnWA ((UL(3) << 10) | (UL(3) << 26))
-#define TCR_ORGN_MASK ((UL(3) << 10) | (UL(3) << 26))
-#define TCR_SHARED ((UL(3) << 12) | (UL(3) << 28))
-#define TCR_TG0_4K (UL(0) << 14)
-#define TCR_TG0_64K (UL(1) << 14)
-#define TCR_TG0_16K (UL(2) << 14)
-#define TCR_TG1_16K (UL(1) << 30)
-#define TCR_TG1_4K (UL(2) << 30)
-#define TCR_TG1_64K (UL(3) << 30)
+
+#define TCR_IRGN0_SHIFT 8
+#define TCR_IRGN0_MASK (UL(3) << TCR_IRGN0_SHIFT)
+#define TCR_IRGN0_NC (UL(0) << TCR_IRGN0_SHIFT)
+#define TCR_IRGN0_WBWA (UL(1) << TCR_IRGN0_SHIFT)
+#define TCR_IRGN0_WT (UL(2) << TCR_IRGN0_SHIFT)
+#define TCR_IRGN0_WBnWA (UL(3) << TCR_IRGN0_SHIFT)
+
+#define TCR_IRGN1_SHIFT 24
+#define TCR_IRGN1_MASK (UL(3) << TCR_IRGN1_SHIFT)
+#define TCR_IRGN1_NC (UL(0) << TCR_IRGN1_SHIFT)
+#define TCR_IRGN1_WBWA (UL(1) << TCR_IRGN1_SHIFT)
+#define TCR_IRGN1_WT (UL(2) << TCR_IRGN1_SHIFT)
+#define TCR_IRGN1_WBnWA (UL(3) << TCR_IRGN1_SHIFT)
+
+#define TCR_IRGN_NC (TCR_IRGN0_NC | TCR_IRGN1_NC)
+#define TCR_IRGN_WBWA (TCR_IRGN0_WBWA | TCR_IRGN1_WBWA)
+#define TCR_IRGN_WT (TCR_IRGN0_WT | TCR_IRGN1_WT)
+#define TCR_IRGN_WBnWA (TCR_IRGN0_WBnWA | TCR_IRGN1_WBnWA)
+#define TCR_IRGN_MASK (TCR_IRGN0_MASK | TCR_IRGN1_MASK)
+
+
+#define TCR_ORGN0_SHIFT 10
+#define TCR_ORGN0_MASK (UL(3) << TCR_ORGN0_SHIFT)
+#define TCR_ORGN0_NC (UL(0) << TCR_ORGN0_SHIFT)
+#define TCR_ORGN0_WBWA (UL(1) << TCR_ORGN0_SHIFT)
+#define TCR_ORGN0_WT (UL(2) << TCR_ORGN0_SHIFT)
+#define TCR_ORGN0_WBnWA (UL(3) << TCR_ORGN0_SHIFT)
+
+#define TCR_ORGN1_SHIFT 26
+#define TCR_ORGN1_MASK (UL(3) << TCR_ORGN1_SHIFT)
+#define TCR_ORGN1_NC (UL(0) << TCR_ORGN1_SHIFT)
+#define TCR_ORGN1_WBWA (UL(1) << TCR_ORGN1_SHIFT)
+#define TCR_ORGN1_WT (UL(2) << TCR_ORGN1_SHIFT)
+#define TCR_ORGN1_WBnWA (UL(3) << TCR_ORGN1_SHIFT)
+
+#define TCR_ORGN_NC (TCR_ORGN0_NC | TCR_ORGN1_NC)
+#define TCR_ORGN_WBWA (TCR_ORGN0_WBWA | TCR_ORGN1_WBWA)
+#define TCR_ORGN_WT (TCR_ORGN0_WT | TCR_ORGN1_WT)
+#define TCR_ORGN_WBnWA (TCR_ORGN0_WBnWA | TCR_ORGN1_WBnWA)
+#define TCR_ORGN_MASK (TCR_ORGN0_MASK | TCR_ORGN1_MASK)
+
+#define TCR_SH0_SHIFT 12
+#define TCR_SH0_MASK (UL(3) << TCR_SH0_SHIFT)
+#define TCR_SH0_INNER (UL(3) << TCR_SH0_SHIFT)
+
+#define TCR_SH1_SHIFT 28
+#define TCR_SH1_MASK (UL(3) << TCR_SH1_SHIFT)
+#define TCR_SH1_INNER (UL(3) << TCR_SH1_SHIFT)
+#define TCR_SHARED (TCR_SH0_INNER | TCR_SH1_INNER)
+
+#define TCR_TG0_SHIFT 14
+#define TCR_TG0_MASK (UL(3) << TCR_TG0_SHIFT)
+#define TCR_TG0_4K (UL(0) << TCR_TG0_SHIFT)
+#define TCR_TG0_64K (UL(1) << TCR_TG0_SHIFT)
+#define TCR_TG0_16K (UL(2) << TCR_TG0_SHIFT)
+
+#define TCR_TG1_SHIFT 30
+#define TCR_TG1_MASK (UL(3) << TCR_TG1_SHIFT)
+#define TCR_TG1_16K (UL(1) << TCR_TG1_SHIFT)
+#define TCR_TG1_4K (UL(2) << TCR_TG1_SHIFT)
+#define TCR_TG1_64K (UL(3) << TCR_TG1_SHIFT)
+
#define TCR_ASID16 (UL(1) << 36)
#define TCR_TBI0 (UL(1) << 37)
#define TCR_HA (UL(1) << 39)
--
1.7.9.5

2016-04-08 12:43:23

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 07/17] kvm-arm: arm32: Introduce stage2 page table helpers

On Mon, Apr 04, 2016 at 05:26:07PM +0100, Suzuki K Poulose wrote:
> Define the page table helpers for walking the stage2 pagetable
> for arm. Since both hyp and stage2 have the same number of levels,
> as that of the host we reuse the host helpers.
>
> The exceptions are the p.d_addr_end routines which have to deal
> with IPA > 32bit, hence we use the open coded version of their host helpers
> which supports 64bit.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arch/arm/include/asm/kvm_mmu.h | 1 +
> arch/arm/include/asm/stage2_pgtable.h | 59 +++++++++++++++++++++++++++++++++
> 2 files changed, 60 insertions(+)
> create mode 100644 arch/arm/include/asm/stage2_pgtable.h
>
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index c2b2b27..7d207b4 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -47,6 +47,7 @@
> #include <linux/highmem.h>
> #include <asm/cacheflush.h>
> #include <asm/pgalloc.h>
> +#include <asm/stage2_pgtable.h>
>
> int create_hyp_mappings(void *from, void *to);
> int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
> diff --git a/arch/arm/include/asm/stage2_pgtable.h b/arch/arm/include/asm/stage2_pgtable.h
> new file mode 100644
> index 0000000..7633b0a
> --- /dev/null
> +++ b/arch/arm/include/asm/stage2_pgtable.h
> @@ -0,0 +1,59 @@
> +/*
> + * Copyright (C) 2016 - ARM Ltd
> + *
> + * stage2 page table helpers
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef __ARM_S2_PGTABLE_H_
> +#define __ARM_S2_PGTABLE_H_
> +
> +#define stage2_pgd_none(pgd) pgd_none(pgd)
> +#define stage2_pgd_clear(pgd) pgd_clear(pgd)
> +#define stage2_pgd_present(pgd) pgd_present(pgd)
> +#define stage2_pgd_populate(mm, pgd, pud) pgd_populate(mm, pgd, pud)
> +#define stage2_pud_offset(pgd, address) pud_offset(pgd, address)
> +#define stage2_pud_free(mm, pud) pud_free(mm, pud)
> +
> +#define stage2_pud_none(pud) pud_none(pud)
> +#define stage2_pud_clear(pud) pud_clear(pud)
> +#define stage2_pud_present(pud) pud_present(pud)
> +#define stage2_pud_populate(mm, pud, pmd) pud_populate(mm, pud, pmd)
> +#define stage2_pmd_offset(pud, address) pmd_offset(pud, address)
> +#define stage2_pmd_free(mm, pmd) pmd_free(mm, pmd)
> +
> +#define stage2_pud_huge(pud) pud_huge(pud)

could we get rid of the mm parameter to all these stage2_ versions above
and simply implement them with the generic functions passing NULL in the
definitions instead?

> +
> +/* Open coded p*d_addr_end that can deal with 64bit addresses */
> +static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
> +{
> + phys_addr_t boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> + return (boundary - 1 < end - 1) ? boundary : end;
> +}
> +
> +#define stage2_pud_addr_end(addr, end) (end)
> +
> +static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
> +{
> + phys_addr_t boundary = (addr + PMD_SIZE) & PMD_MASK;
> + return (boundary - 1 < end - 1) ? boundary : end;
> +}
> +
> +#define stage2_pgd_index(addr) pgd_index(addr)
> +
> +#define stage2_pte_table_empty(ptep) kvm_page_empty(ptep)
> +#define stage2_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
> +#define stage2_pud_table_empty(pudp) 0
> +
> +#endif /* __ARM_S2_PGTABLE_H_ */
> --
> 1.7.9.5
>

Otherwise:
Reviewed-by: Christoffer Dall <[email protected]>

2016-04-08 12:43:30

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 01/17] arm64: Reuse TCR field definitions for EL1 and EL2

On Mon, Apr 04, 2016 at 05:26:01PM +0100, Suzuki K Poulose wrote:
> TCR_EL1, TCR_EL2 and VTCR_EL2, all share some field positions
> (TG0, ORGN0, IRGN0 and SH0) and their corresponding value definitions.
>
> This patch makes the TCR_EL1 definitions reusable and uses them for TCR_EL2
> and VTCR_EL2 fields.
>
> This also fixes a bug where TG0 in {V}TCR_EL2 was treated as 1bit field.
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>

Reviewed-by: Christoffer Dall <[email protected]>

2016-04-08 12:43:51

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 04/17] arm64: Introduce pmd_thp_or_huge

On Mon, Apr 04, 2016 at 05:26:04PM +0100, Suzuki K Poulose wrote:
> Add a helper to determine if a given pmd represents a huge page
> either by hugetlb or thp, as we have for arm. This will be used
> by KVM MMU code.
>
> Suggested-by: Mark Rutland <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Steve Capper <[email protected]>
> Cc: Will Deacon <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>

Acked-by: Christoffer Dall <[email protected]>

> ---
> arch/arm64/include/asm/pgtable.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 989fef1..dda4aa9 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -290,6 +290,8 @@ static inline pgprot_t mk_sect_prot(pgprot_t prot)
> #define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
> #define pmd_mknotpresent(pmd) (__pmd(pmd_val(pmd) & ~PMD_TYPE_MASK))
>
> +#define pmd_thp_or_huge(pmd) (pmd_huge(pmd) || pmd_trans_huge(pmd))
> +
> #define __HAVE_ARCH_PMD_WRITE
> #define pmd_write(pmd) pte_write(pmd_pte(pmd))
>
> --
> 1.7.9.5
>

2016-04-08 12:43:36

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 02/17] arm64: Cleanup VTCR_EL2 and VTTBR field values

On Mon, Apr 04, 2016 at 05:26:02PM +0100, Suzuki K Poulose wrote:
> We share most of the bits for VTCR_EL2 for different page sizes,
> except for the TG0 value and the entry level value. This patch
> makes the definitions a bit more cleaner to reflect this fact.
>
> Also cleans up the VTTBR_X calculation. No funcational changes.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arch/arm64/include/asm/kvm_arm.h | 23 +++++++++++------------
> 1 file changed, 11 insertions(+), 12 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index c460cfe..d5d5fdf 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -144,32 +144,31 @@
> * The magic numbers used for VTTBR_X in this patch can be found in Tables
> * D4-23 and D4-25 in ARM DDI 0487A.b.
> */
> +#define VTCR_EL2_T0SZ_IPA VTCR_EL2_T0SZ_40B
> +#define VTCR_EL2_COMMON_BITS (VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
> + VTCR_EL2_IRGN0_WBWA | VTCR_EL2_SL0_LVL1 | \
> + VTCR_EL2_RES1 | VTCR_EL2_T0SZ_IPA)
> #ifdef CONFIG_ARM64_64K_PAGES
> /*
> * Stage2 translation configuration:
> - * 40bits input (T0SZ = 24)
> * 64kB pages (TG0 = 1)
> * 2 level page tables (SL = 1)
> */
> -#define VTCR_EL2_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SH0_INNER | \
> - VTCR_EL2_ORGN0_WBWA | VTCR_EL2_IRGN0_WBWA | \
> - VTCR_EL2_SL0_LVL1 | VTCR_EL2_T0SZ_40B | \
> - VTCR_EL2_RES1)
> -#define VTTBR_X (38 - VTCR_EL2_T0SZ_40B)
> +#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
> +#define VTTBR_X_TGRAN_MAGIC 38
> #else
> /*
> * Stage2 translation configuration:
> - * 40bits input (T0SZ = 24)
> * 4kB pages (TG0 = 0)
> * 3 level page tables (SL = 1)
> */
> -#define VTCR_EL2_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SH0_INNER | \
> - VTCR_EL2_ORGN0_WBWA | VTCR_EL2_IRGN0_WBWA | \
> - VTCR_EL2_SL0_LVL1 | VTCR_EL2_T0SZ_40B | \
> - VTCR_EL2_RES1)
> -#define VTTBR_X (37 - VTCR_EL2_T0SZ_40B)
> +#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
> +#define VTTBR_X_TGRAN_MAGIC 37
> #endif

why do we add VTCR_EL2_SL0_LVL1 in both the common bits and TGRAN_FLAGS
define?

Otherwise:

Reviewed-by: Christoffer Dall <[email protected]>

>
> +#define VTCR_EL2_FLAGS (VTCR_EL2_TGRAN_FLAGS | VTCR_EL2_COMMON_BITS)
> +#define VTTBR_X (VTTBR_X_TGRAN_MAGIC - VTCR_EL2_T0SZ_IPA)
> +
> #define VTTBR_BADDR_SHIFT (VTTBR_X - 1)
> #define VTTBR_BADDR_MASK (((UL(1) << (PHYS_MASK_SHIFT - VTTBR_X)) - 1) << VTTBR_BADDR_SHIFT)
> #define VTTBR_VMID_SHIFT (UL(48))
> --
> 1.7.9.5
>

2016-04-08 12:43:58

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 06/17] kvm-arm: Remove kvm_pud_huge()

On Mon, Apr 04, 2016 at 05:26:06PM +0100, Suzuki K Poulose wrote:
> Get rid of kvm_pud_huge() which falls back to pud_huge. Use
> pud_huge instead.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>

Acked-by: Christoffer Dall <[email protected]>

2016-04-08 12:43:50

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 05/17] kvm-arm: Replace kvm_pmd_huge with pmd_thp_or_huge

On Mon, Apr 04, 2016 at 05:26:05PM +0100, Suzuki K Poulose wrote:
> Both arm and arm64 now provides a helper, pmd_thp_or_huge()
> to check if the given pmd represents a huge page. Use that
> instead of our own custom check.
>
> Suggested-by: Mark Rutland <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>

Acked-by: Christoffer Dall <[email protected]>

2016-04-08 12:45:47

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH 02/17] arm64: Cleanup VTCR_EL2 and VTTBR field values

On 08/04/16 13:43, Christoffer Dall wrote:
>> +#define VTCR_EL2_T0SZ_IPA VTCR_EL2_T0SZ_40B
>> +#define VTCR_EL2_COMMON_BITS (VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
>> + VTCR_EL2_IRGN0_WBWA | VTCR_EL2_SL0_LVL1 | \
>> + VTCR_EL2_RES1 | VTCR_EL2_T0SZ_IPA)
>> #ifdef CONFIG_ARM64_64K_PAGES
>> /*
>> * Stage2 translation configuration:
>> - * 40bits input (T0SZ = 24)
>> * 64kB pages (TG0 = 1)
>> * 2 level page tables (SL = 1)
>> */
>> -#define VTCR_EL2_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SH0_INNER | \
>> - VTCR_EL2_ORGN0_WBWA | VTCR_EL2_IRGN0_WBWA | \
>> - VTCR_EL2_SL0_LVL1 | VTCR_EL2_T0SZ_40B | \
>> - VTCR_EL2_RES1)
>> -#define VTTBR_X (38 - VTCR_EL2_T0SZ_40B)
>> +#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
>> +#define VTTBR_X_TGRAN_MAGIC 38
>> #else
>> /*
>> * Stage2 translation configuration:
>> - * 40bits input (T0SZ = 24)
>> * 4kB pages (TG0 = 0)
>> * 3 level page tables (SL = 1)
>> */
>> -#define VTCR_EL2_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SH0_INNER | \
>> - VTCR_EL2_ORGN0_WBWA | VTCR_EL2_IRGN0_WBWA | \
>> - VTCR_EL2_SL0_LVL1 | VTCR_EL2_T0SZ_40B | \
>> - VTCR_EL2_RES1)
>> -#define VTTBR_X (37 - VTCR_EL2_T0SZ_40B)
>> +#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
>> +#define VTTBR_X_TGRAN_MAGIC 37
>> #endif
>
> why do we add VTCR_EL2_SL0_LVL1 in both the common bits and TGRAN_FLAGS
> define?

Oops! It should only be part of TGRAN_FLAGS. Thanks for spotting, will fix it.


> Otherwise:
>
> Reviewed-by: Christoffer Dall <[email protected]>


Thanks
Suzuki

2016-04-08 13:15:28

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 12/17] kvm-arm: Add explicit hyp page table modifiers

On Mon, Apr 04, 2016 at 05:26:12PM +0100, Suzuki K Poulose wrote:
> We have common routines to modify hyp and stage2 page tables
> based on the 'kvm' parameter. For a smoother transition to
> using separate routines for each, duplicate the routines
> and modify the copy to work on hyp.
>
> Marks the forked routines with _hyp_ and gets rid of the
> kvm parameter which is no longer needed and is NULL for hyp.
> Also, gets rid of calls to kvm_tlb_flush_by_vmid_ipa() calls
> from the hyp versions. Uses explicit host page table accessors
> instead of the kvm_* page table helpers.
>
> Suggested-by: Christoffer Dall <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arch/arm/kvm/mmu.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 118 insertions(+), 5 deletions(-)
>
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index b46a337..2b491e5 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -388,6 +388,119 @@ static void stage2_flush_vm(struct kvm *kvm)
> srcu_read_unlock(&kvm->srcu, idx);
> }
>
> +static void clear_hyp_pgd_entry(pgd_t *pgd)
> +{
> + pud_t *pud_table __maybe_unused = pud_offset(pgd, 0UL);
> + pgd_clear(pgd);
> + pud_free(NULL, pud_table);
> + put_page(virt_to_page(pgd));
> +}
> +
> +static void clear_hyp_pud_entry(pud_t *pud)
> +{
> + pmd_t *pmd_table __maybe_unused = pmd_offset(pud, 0);
> + VM_BUG_ON(pud_huge(*pud));
> + pud_clear(pud);
> + pmd_free(NULL, pmd_table);
> + put_page(virt_to_page(pud));
> +}
> +
> +static void clear_hyp_pmd_entry(pmd_t *pmd)
> +{
> + pte_t *pte_table = pte_offset_kernel(pmd, 0);
> + VM_BUG_ON(pmd_thp_or_huge(*pmd));
> + pmd_clear(pmd);
> + pte_free_kernel(NULL, pte_table);
> + put_page(virt_to_page(pmd));
> +}
> +
> +static void unmap_hyp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
> +{
> + pte_t *pte, *start_pte;
> +
> + start_pte = pte = pte_offset_kernel(pmd, addr);
> + do {
> + if (!pte_none(*pte)) {
> + pte_t old_pte = *pte;
> +
> + kvm_set_pte(pte, __pte(0));
> +
> + /* XXX: Do we need to invalidate the cache for device mappings ? */

no, we will not be swapping out any pages mapped in Hyp mode so you can
get rid of both of the following two lines.

> + if (!kvm_is_device_pfn(pte_pfn(old_pte)))
> + kvm_flush_dcache_pte(old_pte);
> +
> + put_page(virt_to_page(pte));
> + }
> + } while (pte++, addr += PAGE_SIZE, addr != end);
> +
> + if (hyp_pte_table_empty(start_pte))
> + clear_hyp_pmd_entry(pmd);
> +}
> +
> +static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
> +{
> + phys_addr_t next;
> + pmd_t *pmd, *start_pmd;
> +
> + start_pmd = pmd = pmd_offset(pud, addr);
> + do {
> + next = pmd_addr_end(addr, end);
> + if (!pmd_none(*pmd)) {
> + if (pmd_thp_or_huge(*pmd)) {

do we ever actually map anything with section mappings in the Hyp
mappings?

> + pmd_t old_pmd = *pmd;
> +
> + pmd_clear(pmd);
> + kvm_flush_dcache_pmd(old_pmd);
> + put_page(virt_to_page(pmd));
> + } else {
> + unmap_hyp_ptes(pmd, addr, next);
> + }
> + }
> + } while (pmd++, addr = next, addr != end);
> +
> + if (hyp_pmd_table_empty(start_pmd))
> + clear_hyp_pud_entry(pud);
> +}
> +
> +static void unmap_hyp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
> +{
> + phys_addr_t next;
> + pud_t *pud, *start_pud;
> +
> + start_pud = pud = pud_offset(pgd, addr);
> + do {
> + next = pud_addr_end(addr, end);
> + if (!pud_none(*pud)) {
> + if (pud_huge(*pud)) {

do we ever actually map anything with huge pud
mappings for the Hyp space?

> + pud_t old_pud = *pud;
> +
> + pud_clear(pud);
> + kvm_flush_dcache_pud(old_pud);
> + put_page(virt_to_page(pud));
> + } else {
> + unmap_hyp_pmds(pud, addr, next);
> + }
> + }
> + } while (pud++, addr = next, addr != end);
> +
> + if (hyp_pud_table_empty(start_pud))
> + clear_hyp_pgd_entry(pgd);
> +}
> +
> +static void unmap_hyp_range(pgd_t *pgdp, phys_addr_t start, u64 size)
> +{
> + pgd_t *pgd;
> + phys_addr_t addr = start, end = start + size;
> + phys_addr_t next;
> +
> + pgd = pgdp + pgd_index(addr);
> + do {
> + next = pgd_addr_end(addr, end);
> + if (!pgd_none(*pgd))
> + unmap_hyp_puds(pgd, addr, next);
> + } while (pgd++, addr = next, addr != end);

shouldn't we flush the EL2 (hyp) TLB here, strictly speaking?

Or do we rely on all mappings ever created/torn down here to always have
the same VA/PA relationship? Since we didn't flush the EL2 TLB in the
existing code, that indeed does seem to be the case.

That, in turn, raises the question why we don't simply map all pages
that could be referenced by a kmalloc() in Hyp mode during the init
phase and be done with all this hyp mapping/unmapping stuff?

In any case, that behavior doesn't have to change now, but if we don't
add a TLB flush here, I'd like a comment to explain why that's not
needed.

> +}
> +
> /**
> * free_boot_hyp_pgd - free HYP boot page tables
> *
> @@ -398,14 +511,14 @@ void free_boot_hyp_pgd(void)
> mutex_lock(&kvm_hyp_pgd_mutex);
>
> if (boot_hyp_pgd) {
> - unmap_range(NULL, boot_hyp_pgd, hyp_idmap_start, PAGE_SIZE);
> - unmap_range(NULL, boot_hyp_pgd, TRAMPOLINE_VA, PAGE_SIZE);
> + unmap_hyp_range(boot_hyp_pgd, hyp_idmap_start, PAGE_SIZE);
> + unmap_hyp_range(boot_hyp_pgd, TRAMPOLINE_VA, PAGE_SIZE);
> free_pages((unsigned long)boot_hyp_pgd, hyp_pgd_order);
> boot_hyp_pgd = NULL;
> }
>
> if (hyp_pgd)
> - unmap_range(NULL, hyp_pgd, TRAMPOLINE_VA, PAGE_SIZE);
> + unmap_hyp_range(hyp_pgd, TRAMPOLINE_VA, PAGE_SIZE);
>
> mutex_unlock(&kvm_hyp_pgd_mutex);
> }
> @@ -430,9 +543,9 @@ void free_hyp_pgds(void)
>
> if (hyp_pgd) {
> for (addr = PAGE_OFFSET; virt_addr_valid(addr); addr += PGDIR_SIZE)
> - unmap_range(NULL, hyp_pgd, KERN_TO_HYP(addr), PGDIR_SIZE);
> + unmap_hyp_range(hyp_pgd, KERN_TO_HYP(addr), PGDIR_SIZE);
> for (addr = VMALLOC_START; is_vmalloc_addr((void*)addr); addr += PGDIR_SIZE)
> - unmap_range(NULL, hyp_pgd, KERN_TO_HYP(addr), PGDIR_SIZE);
> + unmap_hyp_range(hyp_pgd, KERN_TO_HYP(addr), PGDIR_SIZE);
>
> free_pages((unsigned long)hyp_pgd, hyp_pgd_order);
> hyp_pgd = NULL;
> --
> 1.7.9.5
>

Otherwise looks good!

I'm quite happy to see the hyp/stage2 stuff decoupled.

Thanks,
-Christoffer

2016-04-08 13:15:38

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 08/17] kvm-arm: arm: Introduce hyp page table empty checks

On Mon, Apr 04, 2016 at 05:26:08PM +0100, Suzuki K Poulose wrote:
> Introduce hyp_pxx_table_empty helpers for checking whether
> a given table entry is empty. This will be used explicitly
> once we switch to explicit routines for hyp page table walk.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>

Acked-by: Christoffer Dall <[email protected]>

2016-04-08 13:16:00

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 10/17] kvm-arm: arm64: Introduce hyp page table empty checks

On Mon, Apr 04, 2016 at 05:26:10PM +0100, Suzuki K Poulose wrote:
> Introduce hyp_pxx_table_empty helpers for checking whether
> a given table entry is empty. This will be used explicitly
> once we switch to explicit routines for hyp page table walk.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>

Acked-by: Christoffer Dall <[email protected]>

2016-04-08 13:15:44

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 09/17] kvm-arm: arm64: Introduce stage2 page table helpers

On Mon, Apr 04, 2016 at 05:26:09PM +0100, Suzuki K Poulose wrote:
> Introduce stage2 page table helpers for arm64. With the fake
> page table level still in place, the stage2 table has the same
> number of levels as that of the host (and hyp), so they all
> fallback to the host version.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>

Reviewed-by: Christoffer Dall <[email protected]>

2016-04-08 13:16:19

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 11/17] kvm-arm: Use explicit stage2 helper routines

On Mon, Apr 04, 2016 at 05:26:11PM +0100, Suzuki K Poulose wrote:
> We have stage2 page table helpers for both arm and arm64. Switch to
> the stage2 helpers for routines that only deal with stage2 page table.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---

Acked-by: Christoffer Dall <[email protected]>

2016-04-08 13:42:47

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 13/17] kvm-arm: Add stage2 page table modifiers

On Mon, Apr 04, 2016 at 05:26:13PM +0100, Suzuki K Poulose wrote:
> Now that the hyp page table is handled by different set of
> routines, rename the original shared routines to stage2 handlers.
> Also make explicit use of the stage2 page table helpers.
>
> unmap_range has been merged to existing unmap_stage2_range.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arch/arm/kvm/mmu.c | 97 ++++++++++++++++++++++++----------------------------
> 1 file changed, 44 insertions(+), 53 deletions(-)
>
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 2b491e5..0009a24 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -152,26 +152,26 @@ static void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
> return p;
> }
>
> -static void clear_pgd_entry(struct kvm *kvm, pgd_t *pgd, phys_addr_t addr)
> +static void clear_stage2_pgd_entry(struct kvm *kvm, pgd_t *pgd, phys_addr_t addr)
> {
> - pud_t *pud_table __maybe_unused = pud_offset(pgd, 0);
> - pgd_clear(pgd);
> + pud_t *pud_table __maybe_unused = stage2_pud_offset(pgd, 0UL);
> + stage2_pgd_clear(pgd);
> kvm_tlb_flush_vmid_ipa(kvm, addr);
> - pud_free(NULL, pud_table);
> + stage2_pud_free(NULL, pud_table);
> put_page(virt_to_page(pgd));
> }
>
> -static void clear_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
> +static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
> {
> - pmd_t *pmd_table = pmd_offset(pud, 0);
> - VM_BUG_ON(pud_huge(*pud));
> - pud_clear(pud);
> + pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(pud, 0);

The __maybe_unused are slightly ugly, so it may be nicer to create the
stage2_pmd_free() as static inline's if they're defined to do nothing
instead.


> + VM_BUG_ON(stage2_pud_huge(*pud));
> + stage2_pud_clear(pud);
> kvm_tlb_flush_vmid_ipa(kvm, addr);
> - pmd_free(NULL, pmd_table);
> + stage2_pmd_free(NULL, pmd_table);
> put_page(virt_to_page(pud));
> }
>
> -static void clear_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr)
> +static void clear_stage2_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr)
> {
> pte_t *pte_table = pte_offset_kernel(pmd, 0);
> VM_BUG_ON(pmd_thp_or_huge(*pmd));
> @@ -201,7 +201,7 @@ static void clear_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr)
> * the corresponding TLBs, we call kvm_flush_dcache_p*() to make sure
> * the IO subsystem will never hit in the cache.
> */
> -static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
> +static void unmap_stage2_ptes(struct kvm *kvm, pmd_t *pmd,
> phys_addr_t addr, phys_addr_t end)
> {
> phys_addr_t start_addr = addr;
> @@ -223,19 +223,19 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
> }
> } while (pte++, addr += PAGE_SIZE, addr != end);
>
> - if (kvm_pte_table_empty(kvm, start_pte))
> - clear_pmd_entry(kvm, pmd, start_addr);
> + if (stage2_pte_table_empty(start_pte))
> + clear_stage2_pmd_entry(kvm, pmd, start_addr);
> }
>
> -static void unmap_pmds(struct kvm *kvm, pud_t *pud,
> +static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud,
> phys_addr_t addr, phys_addr_t end)
> {
> phys_addr_t next, start_addr = addr;
> pmd_t *pmd, *start_pmd;
>
> - start_pmd = pmd = pmd_offset(pud, addr);
> + start_pmd = pmd = stage2_pmd_offset(pud, addr);
> do {
> - next = kvm_pmd_addr_end(addr, end);
> + next = stage2_pmd_addr_end(addr, end);
> if (!pmd_none(*pmd)) {
> if (pmd_thp_or_huge(*pmd)) {
> pmd_t old_pmd = *pmd;
> @@ -247,57 +247,64 @@ static void unmap_pmds(struct kvm *kvm, pud_t *pud,
>
> put_page(virt_to_page(pmd));
> } else {
> - unmap_ptes(kvm, pmd, addr, next);
> + unmap_stage2_ptes(kvm, pmd, addr, next);
> }
> }
> } while (pmd++, addr = next, addr != end);
>
> - if (kvm_pmd_table_empty(kvm, start_pmd))
> - clear_pud_entry(kvm, pud, start_addr);
> + if (stage2_pmd_table_empty(start_pmd))
> + clear_stage2_pud_entry(kvm, pud, start_addr);
> }
>
> -static void unmap_puds(struct kvm *kvm, pgd_t *pgd,
> +static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
> phys_addr_t addr, phys_addr_t end)
> {
> phys_addr_t next, start_addr = addr;
> pud_t *pud, *start_pud;
>
> - start_pud = pud = pud_offset(pgd, addr);
> + start_pud = pud = stage2_pud_offset(pgd, addr);
> do {
> - next = kvm_pud_addr_end(addr, end);
> - if (!pud_none(*pud)) {
> - if (pud_huge(*pud)) {
> + next = stage2_pud_addr_end(addr, end);
> + if (!stage2_pud_none(*pud)) {
> + if (stage2_pud_huge(*pud)) {
> pud_t old_pud = *pud;
>
> - pud_clear(pud);
> + stage2_pud_clear(pud);
> kvm_tlb_flush_vmid_ipa(kvm, addr);
> -
> kvm_flush_dcache_pud(old_pud);
> -
> put_page(virt_to_page(pud));
> } else {
> - unmap_pmds(kvm, pud, addr, next);
> + unmap_stage2_pmds(kvm, pud, addr, next);
> }
> }
> } while (pud++, addr = next, addr != end);
>
> - if (kvm_pud_table_empty(kvm, start_pud))
> - clear_pgd_entry(kvm, pgd, start_addr);
> + if (stage2_pud_table_empty(start_pud))
> + clear_stage2_pgd_entry(kvm, pgd, start_addr);
> }
>
> -
> -static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
> - phys_addr_t start, u64 size)
> +/**
> + * unmap_stage2_range -- Clear stage2 page table entries to unmap a range
> + * @kvm: The VM pointer
> + * @start: The intermediate physical base address of the range to unmap
> + * @size: The size of the area to unmap
> + *
> + * Clear a range of stage-2 mappings, lowering the various ref-counts. Must
> + * be called while holding mmu_lock (unless for freeing the stage2 pgd before
> + * destroying the VM), otherwise another faulting VCPU may come in and mess
> + * with things behind our backs.
> + */
> +static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
> {
> pgd_t *pgd;
> phys_addr_t addr = start, end = start + size;
> phys_addr_t next;
>
> - pgd = pgdp + kvm_pgd_index(addr);
> + pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> do {
> - next = kvm_pgd_addr_end(addr, end);
> - if (!pgd_none(*pgd))
> - unmap_puds(kvm, pgd, addr, next);
> + next = stage2_pgd_addr_end(addr, end);
> + if (!stage2_pgd_none(*pgd))
> + unmap_stage2_puds(kvm, pgd, addr, next);
> } while (pgd++, addr = next, addr != end);
> }
>
> @@ -811,22 +818,6 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
> return 0;
> }
>
> -/**
> - * unmap_stage2_range -- Clear stage2 page table entries to unmap a range
> - * @kvm: The VM pointer
> - * @start: The intermediate physical base address of the range to unmap
> - * @size: The size of the area to unmap
> - *
> - * Clear a range of stage-2 mappings, lowering the various ref-counts. Must
> - * be called while holding mmu_lock (unless for freeing the stage2 pgd before
> - * destroying the VM), otherwise another faulting VCPU may come in and mess
> - * with things behind our backs.
> - */
> -static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
> -{
> - unmap_range(kvm, kvm->arch.pgd, start, size);
> -}
> -
> static void stage2_unmap_memslot(struct kvm *kvm,
> struct kvm_memory_slot *memslot)
> {
> --
> 1.7.9.5
>

Acked-by: Christoffer Dall <[email protected]>

2016-04-08 14:39:59

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH 07/17] kvm-arm: arm32: Introduce stage2 page table helpers

On 08/04/16 13:43, Christoffer Dall wrote:
> On Mon, Apr 04, 2016 at 05:26:07PM +0100, Suzuki K Poulose wrote:
>> Define the page table helpers for walking the stage2 pagetable
>> for arm. Since both hyp and stage2 have the same number of levels,
>> as that of the host we reuse the host helpers.
>>
>> The exceptions are the p.d_addr_end routines which have to deal
>> with IPA > 32bit, hence we use the open coded version of their host helpers
>> which supports 64bit.
>>


>> +#ifndef __ARM_S2_PGTABLE_H_
>> +#define __ARM_S2_PGTABLE_H_
>> +
>> +#define stage2_pgd_none(pgd) pgd_none(pgd)
>> +#define stage2_pgd_clear(pgd) pgd_clear(pgd)
>> +#define stage2_pgd_present(pgd) pgd_present(pgd)
>> +#define stage2_pgd_populate(mm, pgd, pud) pgd_populate(mm, pgd, pud)
>> +#define stage2_pud_offset(pgd, address) pud_offset(pgd, address)
>> +#define stage2_pud_free(mm, pud) pud_free(mm, pud)
>> +
>> +#define stage2_pud_none(pud) pud_none(pud)
>> +#define stage2_pud_clear(pud) pud_clear(pud)
>> +#define stage2_pud_present(pud) pud_present(pud)
>> +#define stage2_pud_populate(mm, pud, pmd) pud_populate(mm, pud, pmd)
>> +#define stage2_pmd_offset(pud, address) pmd_offset(pud, address)
>> +#define stage2_pmd_free(mm, pmd) pmd_free(mm, pmd)
>> +
>> +#define stage2_pud_huge(pud) pud_huge(pud)
>
> could we get rid of the mm parameter to all these stage2_ versions above
> and simply implement them with the generic functions passing NULL in the
> definitions instead?

We could, I retained it just to match the corresponding host version. Will change
it in the next version.

>
> Otherwise:
> Reviewed-by: Christoffer Dall <[email protected]>


Thanks
Suzuki

2016-04-08 15:05:15

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 15/17] kvm: arm64: Get rid of fake page table levels

On Mon, Apr 04, 2016 at 05:26:15PM +0100, Suzuki K Poulose wrote:
> On arm64, the hardware mandates concatenation of upto 16 tables,
> at entry level for stage2 translations. This could lead to reduced
> number of translation levels than the normal (stage1 table).
> Also, since the IPA(40bit) is smaller than the some of the supported
> VA_BITS (e.g, 48bit), there could be different number of levels
> in stage-1 vs stage-2 tables. To work around this, so far we have been
> using a fake software page table level, not known to the hardware.
> But with 16K translations, there could be upto 2 fake software
> levels, which complicates the code. Hence, we want to get rid of the hack.
>
> Now that we have explicit accessors for hyp vs stage2 page tables,
> define the stage2 walker helpers accordingly based on the actual
> table used by the hardware.
>
> Once we know the number of translation levels used by the hardware,
> it is merely a job of defining the helpers based on whether a
> particular level is folded or not, looking at the number of levels.
>
> Some facts before we calculate the translation levels:
>
> 1) Smallest page size supported by arm64 is 4K.
> 2) The minimum number of bits resolved at any page table level
> is (PAGE_SHIFT - 3) at intermediate levels.
> Both of them implies, minimum number of bits required for a level
> change is 9.
>
> Since we can concatenate upto 16 tables at stage2 entry, the total
> number of page table levels used by the hardware for resolving N bits
> is same as that for (N - 4) bits (with concatenation), as there cannot
> be a level in between (N, N-4) as per the above rules.
>
> Hence, we have
>
> STAGE2_PGTABLE_LEVELS = PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)
>
> With the current IPA limit (40bit), for all supported translations
> and VA_BITS, we have the following condition (even for 36bit VA with
> 16K page size):
>
> CONFIG_PGTABLE_LEVELS >= STAGE2_PGTABLE_LEVELS.
>
> So, for e.g, if PUD is present in stage2, it is present in the hyp(host).
> Hence, we fall back to the host definition if we find that a level is not
> folded. Otherwise we redefine it accordingly. A build time check is added
> to make sure the above condition holds.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arch/arm64/include/asm/kvm_mmu.h | 64 +-------------
> arch/arm64/include/asm/stage2_pgtable-nopmd.h | 42 +++++++++
> arch/arm64/include/asm/stage2_pgtable-nopud.h | 39 +++++++++
> arch/arm64/include/asm/stage2_pgtable.h | 113 +++++++++++++++++--------
> 4 files changed, 163 insertions(+), 95 deletions(-)
> create mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
> create mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h
>
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index a3c0d05..e3fee0a 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -45,18 +45,6 @@
> */
> #define TRAMPOLINE_VA (HYP_PAGE_OFFSET_MASK & PAGE_MASK)
>
> -/*
> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
> - * levels in addition to the PGD and potentially the PUD which are
> - * pre-allocated (we pre-allocate the fake PGD and the PUD when the Stage-2
> - * tables use one level of tables less than the kernel.
> - */
> -#ifdef CONFIG_ARM64_64K_PAGES
> -#define KVM_MMU_CACHE_MIN_PAGES 1
> -#else
> -#define KVM_MMU_CACHE_MIN_PAGES 2
> -#endif
> -
> #ifdef __ASSEMBLY__
>
> #include <asm/alternative.h>
> @@ -155,69 +143,21 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
>
> static inline void *kvm_get_hwpgd(struct kvm *kvm)
> {
> - pgd_t *pgd = kvm->arch.pgd;
> - pud_t *pud;
> -
> - if (KVM_PREALLOC_LEVEL == 0)
> - return pgd;
> -
> - pud = pud_offset(pgd, 0);
> - if (KVM_PREALLOC_LEVEL == 1)
> - return pud;
> -
> - BUG_ON(KVM_PREALLOC_LEVEL != 2);
> - return pmd_offset(pud, 0);
> + return kvm->arch.pgd;
> }
>
> static inline unsigned int kvm_get_hwpgd_size(void)
> {
> - if (KVM_PREALLOC_LEVEL > 0)
> - return PTRS_PER_S2_PGD * PAGE_SIZE;
> return PTRS_PER_S2_PGD * sizeof(pgd_t);
> }
>
> -/*
> - * Allocate fake pgd for the host kernel page table macros to work.
> - * This is not used by the hardware and we have no alignment
> - * requirement for this allocation.
> - */
> static inline pgd_t *kvm_setup_fake_pgd(pgd_t *hwpgd)
> {
> - int i;
> - pgd_t *pgd;
> -
> - if (!KVM_PREALLOC_LEVEL)
> - return hwpgd;
> -
> - /*
> - * When KVM_PREALLOC_LEVEL==2, we allocate a single page for
> - * the PMD and the kernel will use folded pud.
> - * When KVM_PREALLOC_LEVEL==1, we allocate 2 consecutive PUD
> - * pages.
> - */
> -
> - pgd = kmalloc(PTRS_PER_S2_PGD * sizeof(pgd_t),
> - GFP_KERNEL | __GFP_ZERO);
> - if (!pgd)
> - return ERR_PTR(-ENOMEM);
> -
> - /* Plug the HW PGD into the fake one. */
> - for (i = 0; i < PTRS_PER_S2_PGD; i++) {
> - if (KVM_PREALLOC_LEVEL == 1)
> - pgd_populate(NULL, pgd + i,
> - (pud_t *)hwpgd + i * PTRS_PER_PUD);
> - else if (KVM_PREALLOC_LEVEL == 2)
> - pud_populate(NULL, pud_offset(pgd, 0) + i,
> - (pmd_t *)hwpgd + i * PTRS_PER_PMD);
> - }
> -
> - return pgd;
> + return hwpgd;
> }
>
> static inline void kvm_free_fake_pgd(pgd_t *pgd)
> {
> - if (KVM_PREALLOC_LEVEL > 0)
> - kfree(pgd);
> }
> static inline bool kvm_page_empty(void *ptr)
> {
> diff --git a/arch/arm64/include/asm/stage2_pgtable-nopmd.h b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> new file mode 100644
> index 0000000..40dda8c
> --- /dev/null
> +++ b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> @@ -0,0 +1,42 @@
> +/*
> + * Copyright (C) 2016 - ARM Ltd
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef __ARM64_S2_PGTABLE_NOPMD_H_
> +#define __ARM64_S2_PGTABLE_NOPMD_H_
> +
> +#include <asm/stage2_pgtable-nopud.h>
> +
> +#define __S2_PGTABLE_PMD_FOLDED
> +
> +#define S2_PMD_SHIFT S2_PUD_SHIFT
> +#define S2_PTRS_PER_PMD 1
> +#define S2_PMD_SIZE (1UL << S2_PMD_SHIFT)
> +#define S2_PMD_MASK (~(S2_PMD_SIZE-1))
> +
> +#define stage2_pud_none(pud) (0)
> +#define stage2_pud_present(pud) (1)
> +#define stage2_pud_clear(pud) do { } while (0)
> +#define stage2_pud_populate(mm, pud, pmd) do { } while (0)
> +#define stage2_pmd_offset(pud, address) ((pmd_t *)(pud))
> +
> +#define stage2_pmd_free(mm, pmd) do { } while (0)
> +
> +#define stage2_pmd_addr_end(addr, end) (end)
> +
> +#define stage2_pud_huge(pud) (0)
> +#define stage2_pmd_table_empty(pmdp) (0)
> +
> +#endif
> diff --git a/arch/arm64/include/asm/stage2_pgtable-nopud.h b/arch/arm64/include/asm/stage2_pgtable-nopud.h
> new file mode 100644
> index 0000000..b0eff9d
> --- /dev/null
> +++ b/arch/arm64/include/asm/stage2_pgtable-nopud.h
> @@ -0,0 +1,39 @@
> +/*
> + * Copyright (C) 2016 - ARM Ltd
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef __ARM64_S2_PGTABLE_NOPUD_H_
> +#define __ARM64_S2_PGTABLE_NOPUD_H_
> +
> +#define __S2_PGTABLE_PUD_FOLDED
> +
> +#define S2_PUD_SHIFT S2_PGDIR_SHIFT
> +#define S2_PTRS_PER_PUD 1
> +#define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
> +#define S2_PUD_MASK (~(S2_PUD_SIZE-1))
> +
> +#define stage2_pgd_none(pgd) (0)
> +#define stage2_pgd_present(pgd) (1)
> +#define stage2_pgd_clear(pgd) do { } while (0)
> +#define stage2_pgd_populate(mm, pgd, pud) do { } while (0)
> +
> +#define stage2_pud_offset(pgd, address) ((pud_t *)(pgd))
> +
> +#define stage2_pud_free(mm, x) do { } while (0)
> +
> +#define stage2_pud_addr_end(addr, end) (end)
> +#define stage2_pud_table_empty(pmdp) (0)
> +
> +#endif
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> index 751227d..139b4db 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -22,32 +22,55 @@
> #include <asm/pgtable.h>
>
> /*
> - * In the case where PGDIR_SHIFT is larger than KVM_PHYS_SHIFT, we can address
> - * the entire IPA input range with a single pgd entry, and we would only need
> - * one pgd entry. Note that in this case, the pgd is actually not used by
> - * the MMU for Stage-2 translations, but is merely a fake pgd used as a data
> - * structure for the kernel pgtable macros to work.
> + * The hardware mandates concatenation of upto 16 tables at stage2 entry level.

s/upto/up to/

> + * Now, the minimum number of bits resolved at any level is (PAGE_SHIFT - 3),
> + * or in other words log2(PTRS_PER_PTE). On arm64, the smallest PAGE_SIZE

not sure the log2 comment helps here.

> + * supported is 4k, which means (PAGE_SHIFT - 3) > 4 holds for all page sizes.
> + * This implies, the total number of page table levels at stage2 expected
> + * by the hardware is actually the number of levels required for (KVM_PHYS_SHIFT - 4)
> + * in normal translations(e.g, stage-1), since we cannot have another level in
> + * the range (KVM_PHYS_SHIFT, KVM_PHYS_SHIFT - 4).

Is it not a design decision to always choose the maximum number of
concatinated initial-level stage2 tables (with the constraint that
there's a minimum number required)?

I agree with the design decision, if my math is correct that on 64K
systems you end up requiring a 1MB physically contiguous 1MB aligned
allocation for each VM? This seems reasonable enough if you configure
your kernel with 64K pages and expect to run VMs on top of that.

> */
> -#if PGDIR_SHIFT > KVM_PHYS_SHIFT
> -#define PTRS_PER_S2_PGD_SHIFT 0
> -#else
> -#define PTRS_PER_S2_PGD_SHIFT (KVM_PHYS_SHIFT - PGDIR_SHIFT)
> -#endif
> -#define PTRS_PER_S2_PGD (1 << PTRS_PER_S2_PGD_SHIFT)
> +#define STAGE2_PGTABLE_LEVELS ARM64_HW_PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)
>
> /*
> - * If we are concatenating first level stage-2 page tables, we would have less
> - * than or equal to 16 pointers in the fake PGD, because that's what the
> - * architecture allows. In this case, (4 - CONFIG_PGTABLE_LEVELS)
> - * represents the first level for the host, and we add 1 to go to the next
> - * level (which uses contatenation) for the stage-2 tables.
> + * At the moment, we do not support a combination of guest IPA and host VA_BITS
> + * where
> + * STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS

can you change this comment to reverse the statement to avoid someone
seeing this as a constraint, when in fact it's a negative invariant?

So the case we don't support is a sufficiently larger IPA space compared
to the host VA space such that the above happens? (Since at the same
IPA space size as host VA space size, the stage-2 levels will always be
less than or equal to the host levels.)

I don't see how that would ever work with userspace either so I think
this is a safe assumption and not something that ever needs fixing. In
which case this should be reworded to just state the assumptions and why
this is a good assumption.

(If my assumptions are wrong here, then there are also weird cases where
the host does huge pages at the PMD level and we don't. Not sure I can
see the full ramifications of that.)

> + *
> + * We base our stage-2 page table walker helpers based on this assumption and
> + * fallback to using the host version of the helper wherever possible.
> + * i.e, if a particular level is not folded (e.g, PUD) at stage2, we fall back
> + * to using the host version, since it is guaranteed it is not folded at host.

I don't really understand why it's desirable to fall back to the host
version of the helpers; in fact I would probably prefer to just have it
disjoint, but maybe I'll see the reason when going through the patch
more. But I doubt the value of this particular piece of commentary...

> + *
> + * TODO: We could lift this limitation easily by rearranging the host level
> + * definitions to a more reusable version.
> */

So is this really a TODO: based on the above?

> -#if PTRS_PER_S2_PGD <= 16
> -#define KVM_PREALLOC_LEVEL (4 - CONFIG_PGTABLE_LEVELS + 1)
> -#else
> -#define KVM_PREALLOC_LEVEL (0)
> +#if STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS
> +#error "Unsupported combination of guest IPA and host VA_BITS."
> #endif
>
> +

Can we add a comment as to what this defines exactly? Something like:
/*
* PGDIR_SHIFT determines the size a top-level stage2 page table entry can map
*/
> +#define S2_PGDIR_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - STAGE2_PGTABLE_LEVELS)
> +#define S2_PGDIR_SIZE (_AC(1, UL) << S2_PGDIR_SHIFT)
> +#define S2_PGDIR_MASK (~(S2_PGDIR_SIZE - 1))
> +
> +/* We can have concatenated tables at stage2 entry. */

I'm not sure if the comment is helpful. How about:

/*
* The number of PTRS across all concatenated stage2 tables given by the
* number of bits resolved at the initial level.
*/

> +#define PTRS_PER_S2_PGD (1 << (KVM_PHYS_SHIFT - S2_PGDIR_SHIFT))
> +
> +/*
> + * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
> + * levels in addition to the PGD.
> + */
> +#define KVM_MMU_CACHE_MIN_PAGES (STAGE2_PGTABLE_LEVELS - 1)
> +
> +
> +#if STAGE2_PGTABLE_LEVELS > 3
> +
> +#define S2_PUD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(1)
> +#define S2_PUD_SIZE (_AC(1, UL) << S2_PUD_SHIFT)
> +#define S2_PUD_MASK (~(S2_PUD_SIZE - 1))
> +
> #define stage2_pgd_none(pgd) pgd_none(pgd)
> #define stage2_pgd_clear(pgd) pgd_clear(pgd)
> #define stage2_pgd_present(pgd) pgd_present(pgd)
> @@ -55,6 +78,23 @@
> #define stage2_pud_offset(pgd, address) pud_offset(pgd, address)
> #define stage2_pud_free(mm, pud) pud_free(mm, pud)
>
> +#define stage2_pud_table_empty(pudp) kvm_page_empty(pudp)
> +
> +static inline phys_addr_t stage2_pud_addr_end(phys_addr_t addr, phys_addr_t end)
> +{
> + phys_addr_t boundary = (addr + S2_PUD_SIZE) & S2_PUD_MASK;
> + return (boundary - 1 < end - 1) ? boundary : end;
> +}
> +
> +#endif /* STAGE2_PGTABLE_LEVELS > 3 */
> +
> +
> +#if STAGE2_PGTABLE_LEVELS > 2
> +
> +#define S2_PMD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
> +#define S2_PMD_SIZE (_AC(1, UL) << S2_PMD_SHIFT)
> +#define S2_PMD_MASK (~(S2_PMD_SIZE - 1))
> +
> #define stage2_pud_none(pud) pud_none(pud)
> #define stage2_pud_clear(pud) pud_clear(pud)
> #define stage2_pud_present(pud) pud_present(pud)
> @@ -63,24 +103,31 @@
> #define stage2_pmd_free(mm, pmd) pmd_free(mm, pmd)
>
> #define stage2_pud_huge(pud) pud_huge(pud)
> +#define stage2_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
> +
> +static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
> +{
> + phys_addr_t boundary = (addr + S2_PMD_SIZE) & S2_PMD_MASK;
> + return (boundary - 1 < end - 1) ? boundary : end;
> +}
>
> -#define stage2_pgd_addr_end(address, end) pgd_addr_end(address, end)
> -#define stage2_pud_addr_end(address, end) pud_addr_end(address, end)
> -#define stage2_pmd_addr_end(address, end) pmd_addr_end(address, end)
> +#endif /* STAGE2_PGTABLE_LEVELS > 2 */
>
> #define stage2_pte_table_empty(ptep) kvm_page_empty(ptep)
> -#ifdef __PGTABLE_PMD_FOLDED
> -#define stage2_pmd_table_empty(pmdp) (0)
> -#else
> -#define stage2_pmd_table_empty(pmdp) ((KVM_PREALLOC_LEVEL < 2) && kvm_page_empty(pmdp))
> -#endif
>
> -#ifdef __PGTABLE_PUD_FOLDED
> -#define stage2_pud_table_empty(pudp) (0)
> -#else
> -#define stage2_pud_table_empty(pudp) ((KVM_PREALLOC_LEVEL < 1) && kvm_page_empty(pudp))
> +#if STAGE2_PGTABLE_LEVELS == 2
> +#include <asm/stage2_pgtable-nopmd.h>
> +#elif STAGE2_PGTABLE_LEVELS == 3
> +#include <asm/stage2_pgtable-nopud.h>
> #endif
>
> -#define stage2_pgd_index(addr) (((addr) >> PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
> +
> +#define stage2_pgd_index(addr) (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
> +
> +static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
> +{
> + phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;
> + return (boundary - 1 < end - 1) ? boundary : end;
> +}
>
> #endif /* __ARM64_S2_PGTABLE_H_ */
> --
> 1.7.9.5
>

So, I only commented on the comments here (to prove that I actually read
the patch ;) ), but the code looks really good!

Thanks,
-Christoffer

2016-04-08 15:05:22

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 14/17] kvm-arm: Cleanup kvm_* wrappers

On Mon, Apr 04, 2016 at 05:26:14PM +0100, Suzuki K Poulose wrote:
> Now that we have switched to explicit page table routines,
> get rid of the obsolete kvm_* wrappers.
>
> Also, kvm_tlb_flush_vmid_by_ipa is now called only on stage2
> page tables, hence get rid of the redundant check.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>

Acked-by: Christoffer Dall <[email protected]>

2016-04-08 15:08:50

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 16/17] kvm-arm: Cleanup stage2 pgd handling

On Mon, Apr 04, 2016 at 05:26:16PM +0100, Suzuki K Poulose wrote:
> Now that we don't have any fake page table levels for arm64,
> cleanup the common code to get rid of the dead code.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arch/arm/include/asm/kvm_mmu.h | 19 -------------------
> arch/arm/kvm/arm.c | 2 +-
> arch/arm/kvm/mmu.c | 25 ++++++++-----------------
> arch/arm64/include/asm/kvm_mmu.h | 18 ------------------
> 4 files changed, 9 insertions(+), 55 deletions(-)
>
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 50aa901..f8b7920 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -146,25 +146,6 @@ static inline bool kvm_page_empty(void *ptr)
> #define hyp_pmd_table_empty(pmdp) kvm_page_empty(pmdp)
> #define hyp_pud_table_empty(pudp) (0)
>
> -static inline void *kvm_get_hwpgd(struct kvm *kvm)
> -{
> - return kvm->arch.pgd;
> -}
> -
> -static inline unsigned int kvm_get_hwpgd_size(void)
> -{
> - return PTRS_PER_S2_PGD * sizeof(pgd_t);
> -}
> -
> -static inline pgd_t *kvm_setup_fake_pgd(pgd_t *hwpgd)
> -{
> - return hwpgd;
> -}
> -
> -static inline void kvm_free_fake_pgd(pgd_t *pgd)
> -{
> -}
> -
> struct kvm;
>
> #define kvm_flush_dcache_to_poc(a,l) __cpuc_flush_dcache_area((a), (l))
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 6accd66..dfd4987 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -448,7 +448,7 @@ static void update_vttbr(struct kvm *kvm)
> kvm_next_vmid &= (1 << kvm_vmid_bits) - 1;
>
> /* update vttbr to be used with the new vmid */
> - pgd_phys = virt_to_phys(kvm_get_hwpgd(kvm));
> + pgd_phys = virt_to_phys(kvm->arch.pgd);
> BUG_ON(pgd_phys & ~VTTBR_BADDR_MASK);
> vmid = ((u64)(kvm->arch.vmid) << VTTBR_VMID_SHIFT) & VTTBR_VMID_MASK(kvm_vmid_bits);
> kvm->arch.vttbr = pgd_phys | vmid;
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index dda60fc..1cb7e60 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -755,6 +755,11 @@ int create_hyp_io_mappings(void *from, void *to, phys_addr_t phys_addr)
> __phys_to_pfn(phys_addr), PAGE_HYP_DEVICE);
> }
>
> +static inline unsigned int kvm_get_hwpgd_size(void)
> +{
> + return PTRS_PER_S2_PGD * sizeof(pgd_t);
> +}
> +
> /* Free the HW pgd, one page at a time */
> static void kvm_free_hwpgd(void *hwpgd)
> {
> @@ -783,29 +788,16 @@ static void *kvm_alloc_hwpgd(void)
> int kvm_alloc_stage2_pgd(struct kvm *kvm)
> {
> pgd_t *pgd;
> - void *hwpgd;
>
> if (kvm->arch.pgd != NULL) {
> kvm_err("kvm_arch already initialized?\n");
> return -EINVAL;
> }
>
> - hwpgd = kvm_alloc_hwpgd();
> - if (!hwpgd)
> + pgd = kvm_alloc_hwpgd();

can you just inline kvm_alloc_hwpgd and kvm_get_hwpgd_size now?

> + if (!pgd)
> return -ENOMEM;
>
> - /*
> - * When the kernel uses more levels of page tables than the
> - * guest, we allocate a fake PGD and pre-populate it to point
> - * to the next-level page table, which will be the real
> - * initial page table pointed to by the VTTBR.
> - */
> - pgd = kvm_setup_fake_pgd(hwpgd);
> - if (IS_ERR(pgd)) {
> - kvm_free_hwpgd(hwpgd);
> - return PTR_ERR(pgd);
> - }
> -
> kvm_clean_pgd(pgd);
> kvm->arch.pgd = pgd;
> return 0;
> @@ -893,8 +885,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
> return;
>
> unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> - kvm_free_hwpgd(kvm_get_hwpgd(kvm));
> - kvm_free_fake_pgd(kvm->arch.pgd);
> + kvm_free_hwpgd(kvm->arch.pgd);
> kvm->arch.pgd = NULL;
> }
>
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index e3fee0a..249c4fc 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -141,24 +141,6 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
> return (pmd_val(*pmd) & PMD_S2_RDWR) == PMD_S2_RDONLY;
> }
>
> -static inline void *kvm_get_hwpgd(struct kvm *kvm)
> -{
> - return kvm->arch.pgd;
> -}
> -
> -static inline unsigned int kvm_get_hwpgd_size(void)
> -{
> - return PTRS_PER_S2_PGD * sizeof(pgd_t);
> -}
> -
> -static inline pgd_t *kvm_setup_fake_pgd(pgd_t *hwpgd)
> -{
> - return hwpgd;
> -}
> -
> -static inline void kvm_free_fake_pgd(pgd_t *pgd)
> -{
> -}
> static inline bool kvm_page_empty(void *ptr)
> {
> struct page *ptr_page = virt_to_page(ptr);
> --
> 1.7.9.5
>

otherwise:

Acked-by: Christoffer Dall <[email protected]>

2016-04-08 15:09:21

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 12/17] kvm-arm: Add explicit hyp page table modifiers

On 08/04/16 14:15, Christoffer Dall wrote:
> On Mon, Apr 04, 2016 at 05:26:12PM +0100, Suzuki K Poulose wrote:
>> We have common routines to modify hyp and stage2 page tables
>> based on the 'kvm' parameter. For a smoother transition to
>> using separate routines for each, duplicate the routines
>> and modify the copy to work on hyp.
>>
>> Marks the forked routines with _hyp_ and gets rid of the
>> kvm parameter which is no longer needed and is NULL for hyp.
>> Also, gets rid of calls to kvm_tlb_flush_by_vmid_ipa() calls
>> from the hyp versions. Uses explicit host page table accessors
>> instead of the kvm_* page table helpers.
>>
>> Suggested-by: Christoffer Dall <[email protected]>
>> Cc: Marc Zyngier <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> arch/arm/kvm/mmu.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 118 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index b46a337..2b491e5 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -388,6 +388,119 @@ static void stage2_flush_vm(struct kvm *kvm)
>> srcu_read_unlock(&kvm->srcu, idx);
>> }
>>
>> +static void clear_hyp_pgd_entry(pgd_t *pgd)
>> +{
>> + pud_t *pud_table __maybe_unused = pud_offset(pgd, 0UL);
>> + pgd_clear(pgd);
>> + pud_free(NULL, pud_table);
>> + put_page(virt_to_page(pgd));
>> +}
>> +
>> +static void clear_hyp_pud_entry(pud_t *pud)
>> +{
>> + pmd_t *pmd_table __maybe_unused = pmd_offset(pud, 0);
>> + VM_BUG_ON(pud_huge(*pud));
>> + pud_clear(pud);
>> + pmd_free(NULL, pmd_table);
>> + put_page(virt_to_page(pud));
>> +}
>> +
>> +static void clear_hyp_pmd_entry(pmd_t *pmd)
>> +{
>> + pte_t *pte_table = pte_offset_kernel(pmd, 0);
>> + VM_BUG_ON(pmd_thp_or_huge(*pmd));
>> + pmd_clear(pmd);
>> + pte_free_kernel(NULL, pte_table);
>> + put_page(virt_to_page(pmd));
>> +}
>> +
>> +static void unmap_hyp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
>> +{
>> + pte_t *pte, *start_pte;
>> +
>> + start_pte = pte = pte_offset_kernel(pmd, addr);
>> + do {
>> + if (!pte_none(*pte)) {
>> + pte_t old_pte = *pte;
>> +
>> + kvm_set_pte(pte, __pte(0));
>> +
>> + /* XXX: Do we need to invalidate the cache for device mappings ? */
>
> no, we will not be swapping out any pages mapped in Hyp mode so you can
> get rid of both of the following two lines.
>
>> + if (!kvm_is_device_pfn(pte_pfn(old_pte)))
>> + kvm_flush_dcache_pte(old_pte);
>> +
>> + put_page(virt_to_page(pte));
>> + }
>> + } while (pte++, addr += PAGE_SIZE, addr != end);
>> +
>> + if (hyp_pte_table_empty(start_pte))
>> + clear_hyp_pmd_entry(pmd);
>> +}
>> +
>> +static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
>> +{
>> + phys_addr_t next;
>> + pmd_t *pmd, *start_pmd;
>> +
>> + start_pmd = pmd = pmd_offset(pud, addr);
>> + do {
>> + next = pmd_addr_end(addr, end);
>> + if (!pmd_none(*pmd)) {
>> + if (pmd_thp_or_huge(*pmd)) {
>
> do we ever actually map anything with section mappings in the Hyp
> mappings?

No, this is purely a page mapping so far. On my system, the HYP text is
just over 4 pages big (4k pages), so the incentive is pretty low, unless
we can demonstrate some big gains due to the reduced TLB impact.

>> + pmd_t old_pmd = *pmd;
>> +
>> + pmd_clear(pmd);
>> + kvm_flush_dcache_pmd(old_pmd);
>> + put_page(virt_to_page(pmd));
>> + } else {
>> + unmap_hyp_ptes(pmd, addr, next);
>> + }
>> + }
>> + } while (pmd++, addr = next, addr != end);
>> +
>> + if (hyp_pmd_table_empty(start_pmd))
>> + clear_hyp_pud_entry(pud);
>> +}
>> +
>> +static void unmap_hyp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
>> +{
>> + phys_addr_t next;
>> + pud_t *pud, *start_pud;
>> +
>> + start_pud = pud = pud_offset(pgd, addr);
>> + do {
>> + next = pud_addr_end(addr, end);
>> + if (!pud_none(*pud)) {
>> + if (pud_huge(*pud)) {
>
> do we ever actually map anything with huge pud
> mappings for the Hyp space?

Same thing. Looks like there is some potential simplification here.

>
>> + pud_t old_pud = *pud;
>> +
>> + pud_clear(pud);
>> + kvm_flush_dcache_pud(old_pud);
>> + put_page(virt_to_page(pud));
>> + } else {
>> + unmap_hyp_pmds(pud, addr, next);
>> + }
>> + }
>> + } while (pud++, addr = next, addr != end);
>> +
>> + if (hyp_pud_table_empty(start_pud))
>> + clear_hyp_pgd_entry(pgd);
>> +}
>> +
>> +static void unmap_hyp_range(pgd_t *pgdp, phys_addr_t start, u64 size)
>> +{
>> + pgd_t *pgd;
>> + phys_addr_t addr = start, end = start + size;
>> + phys_addr_t next;
>> +
>> + pgd = pgdp + pgd_index(addr);
>> + do {
>> + next = pgd_addr_end(addr, end);
>> + if (!pgd_none(*pgd))
>> + unmap_hyp_puds(pgd, addr, next);
>> + } while (pgd++, addr = next, addr != end);
>
> shouldn't we flush the EL2 (hyp) TLB here, strictly speaking?
>
> Or do we rely on all mappings ever created/torn down here to always have
> the same VA/PA relationship? Since we didn't flush the EL2 TLB in the
> existing code, that indeed does seem to be the case.

Actually, we never unmap anything from HYP. Once a structure (kvm, vcpu)
is mapped there, it stays forever, whatever happens to the VM (that's
because we'd otherwise have to refcount the number of objects in a page,
and I'm lazy...).

> That, in turn, raises the question why we don't simply map all pages
> that could be referenced by a kmalloc() in Hyp mode during the init
> phase and be done with all this hyp mapping/unmapping stuff?
>
> In any case, that behavior doesn't have to change now, but if we don't
> add a TLB flush here, I'd like a comment to explain why that's not
> needed.

Hope you have your answer above... ;-)

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2016-04-08 15:13:12

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 17/17] arm64: kvm: Add support for 16K pages

On Mon, Apr 04, 2016 at 05:26:17PM +0100, Suzuki K Poulose wrote:
> Now that we can handle stage-2 page tables independent
> of the host page table levels, wire up the 16K page
> support.
>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arch/arm64/include/asm/kvm_arm.h | 11 ++++++++++-
> arch/arm64/kvm/Kconfig | 1 -
> 2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index d5d5fdf..3d830a3 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -114,6 +114,7 @@
> #define VTCR_EL2_PS_MASK TCR_EL2_PS_MASK
> #define VTCR_EL2_TG0_MASK TCR_TG0_MASK
> #define VTCR_EL2_TG0_4K TCR_TG0_4K
> +#define VTCR_EL2_TG0_16K TCR_TG0_16K
> #define VTCR_EL2_TG0_64K TCR_TG0_64K
> #define VTCR_EL2_SH0_MASK TCR_SH0_MASK
> #define VTCR_EL2_SH0_INNER TCR_SH0_INNER
> @@ -139,7 +140,7 @@
> * (see hyp-init.S).
> *
> * Note that when using 4K pages, we concatenate two first level page tables
> - * together.
> + * together. With 16K pages, we concatenate 16 first level page tables.
> *
> * The magic numbers used for VTTBR_X in this patch can be found in Tables
> * D4-23 and D4-25 in ARM DDI 0487A.b.
> @@ -156,6 +157,14 @@
> */
> #define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
> #define VTTBR_X_TGRAN_MAGIC 38
> +#elif defined(CONFIG_ARM64_16K_PAGES)
> +/*
> + * Stage2 translation configuration:
> + * 16kB pages (TG0 = 2)
> + * 2 level page tables (SL = 1)
> + */
> +#define VTCR_EL2_TGRAN_FLAGS (VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
> +#define VTTBR_X_TGRAN_MAGIC 42
> #else
> /*
> * Stage2 translation configuration:
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index de7450d..aa2e34e 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -22,7 +22,6 @@ config KVM_ARM_VGIC_V3
> config KVM
> bool "Kernel-based Virtual Machine (KVM) support"
> depends on OF
> - depends on !ARM64_16K_PAGES
> select MMU_NOTIFIER
> select PREEMPT_NOTIFIERS
> select ANON_INODES
> --
> 1.7.9.5
>

Looks so easy now :)

Reviewed-by: Christoffer Dall <[email protected]>

2016-04-08 15:14:51

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 00/17] kvm-arm: Add stage2 page table walker

Hi Suzuki,

On Mon, Apr 04, 2016 at 05:26:00PM +0100, Suzuki K Poulose wrote:
> This series adds support for stage2 page table helpers and makes
> the core kvm-arm MMU code make use of it. At the moment we assume
> that the host/hyp and the stage2 page tables have same number of
> levels and hence use the host level accessors (except for some
> hooks, e.g kvm_p.d_addr_end) and shares the routines for unmapping
> the page table ranges.
>
> On arm32, the only change w.r.t the page tables is dealing
> with > 32bit physical addresses.
>
> However on arm64, the hardware supports concatenation of tables (upto 16)
> at the entry level, which could affect :
> 1) number of entries in the PGD table (upto 16 * PTRS_PER_PTE)
> 2) number of page table levels (reduced number of page table levels).
>
> Also depending on the VA_BITS for the host kernel, the number of page table
> levels for both host and stage2(40bit IPA) could differ. At present, we insert
> (upto) one fake software page table(as the hardware is not aware of it and is
> only used by the OS to walk the table) level to bring the number of levels to
> that of the host/hyp table. However, with 16K + 48bit, and 40bit IPA, we could
> end up in 2 fake levels, which complicates the code.
>
> This series introduces explicit stage2 page table helpers and also defines
> separate set of routines for unmapping hyp and stage2 tables.
>
> On arm64 stage2 page table helpers are defined based on the number of levels
> required to map the IPA bits. See patch 15 for more details.
>
> Tested on TC2 (arm32), Fast models(with VHE) and real hardwares.
>

This looks pretty good overall. Could you rebase it on kvmarm/master
where Marc addressed the 36 bits PA size of the foundation model and
have a look at the interactions there?

If we can solve that bit and address the cosmetic issues I had in this
series, then I think we can queue this real soon.

Thanks a lot for going the extra mile on this with rewriting the whole
KVM page table handling!

Best,
-Christoffer

2016-04-08 15:16:43

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 12/17] kvm-arm: Add explicit hyp page table modifiers

On Fri, Apr 08, 2016 at 04:09:11PM +0100, Marc Zyngier wrote:
> On 08/04/16 14:15, Christoffer Dall wrote:
> > On Mon, Apr 04, 2016 at 05:26:12PM +0100, Suzuki K Poulose wrote:
> >> We have common routines to modify hyp and stage2 page tables
> >> based on the 'kvm' parameter. For a smoother transition to
> >> using separate routines for each, duplicate the routines
> >> and modify the copy to work on hyp.
> >>
> >> Marks the forked routines with _hyp_ and gets rid of the
> >> kvm parameter which is no longer needed and is NULL for hyp.
> >> Also, gets rid of calls to kvm_tlb_flush_by_vmid_ipa() calls
> >> from the hyp versions. Uses explicit host page table accessors
> >> instead of the kvm_* page table helpers.
> >>
> >> Suggested-by: Christoffer Dall <[email protected]>
> >> Cc: Marc Zyngier <[email protected]>
> >> Signed-off-by: Suzuki K Poulose <[email protected]>
> >> ---
> >> arch/arm/kvm/mmu.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++---
> >> 1 file changed, 118 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >> index b46a337..2b491e5 100644
> >> --- a/arch/arm/kvm/mmu.c
> >> +++ b/arch/arm/kvm/mmu.c
> >> @@ -388,6 +388,119 @@ static void stage2_flush_vm(struct kvm *kvm)
> >> srcu_read_unlock(&kvm->srcu, idx);
> >> }
> >>
> >> +static void clear_hyp_pgd_entry(pgd_t *pgd)
> >> +{
> >> + pud_t *pud_table __maybe_unused = pud_offset(pgd, 0UL);
> >> + pgd_clear(pgd);
> >> + pud_free(NULL, pud_table);
> >> + put_page(virt_to_page(pgd));
> >> +}
> >> +
> >> +static void clear_hyp_pud_entry(pud_t *pud)
> >> +{
> >> + pmd_t *pmd_table __maybe_unused = pmd_offset(pud, 0);
> >> + VM_BUG_ON(pud_huge(*pud));
> >> + pud_clear(pud);
> >> + pmd_free(NULL, pmd_table);
> >> + put_page(virt_to_page(pud));
> >> +}
> >> +
> >> +static void clear_hyp_pmd_entry(pmd_t *pmd)
> >> +{
> >> + pte_t *pte_table = pte_offset_kernel(pmd, 0);
> >> + VM_BUG_ON(pmd_thp_or_huge(*pmd));
> >> + pmd_clear(pmd);
> >> + pte_free_kernel(NULL, pte_table);
> >> + put_page(virt_to_page(pmd));
> >> +}
> >> +
> >> +static void unmap_hyp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
> >> +{
> >> + pte_t *pte, *start_pte;
> >> +
> >> + start_pte = pte = pte_offset_kernel(pmd, addr);
> >> + do {
> >> + if (!pte_none(*pte)) {
> >> + pte_t old_pte = *pte;
> >> +
> >> + kvm_set_pte(pte, __pte(0));
> >> +
> >> + /* XXX: Do we need to invalidate the cache for device mappings ? */
> >
> > no, we will not be swapping out any pages mapped in Hyp mode so you can
> > get rid of both of the following two lines.
> >
> >> + if (!kvm_is_device_pfn(pte_pfn(old_pte)))
> >> + kvm_flush_dcache_pte(old_pte);
> >> +
> >> + put_page(virt_to_page(pte));
> >> + }
> >> + } while (pte++, addr += PAGE_SIZE, addr != end);
> >> +
> >> + if (hyp_pte_table_empty(start_pte))
> >> + clear_hyp_pmd_entry(pmd);
> >> +}
> >> +
> >> +static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
> >> +{
> >> + phys_addr_t next;
> >> + pmd_t *pmd, *start_pmd;
> >> +
> >> + start_pmd = pmd = pmd_offset(pud, addr);
> >> + do {
> >> + next = pmd_addr_end(addr, end);
> >> + if (!pmd_none(*pmd)) {
> >> + if (pmd_thp_or_huge(*pmd)) {
> >
> > do we ever actually map anything with section mappings in the Hyp
> > mappings?
>
> No, this is purely a page mapping so far. On my system, the HYP text is
> just over 4 pages big (4k pages), so the incentive is pretty low, unless
> we can demonstrate some big gains due to the reduced TLB impact.
>
> >> + pmd_t old_pmd = *pmd;
> >> +
> >> + pmd_clear(pmd);
> >> + kvm_flush_dcache_pmd(old_pmd);
> >> + put_page(virt_to_page(pmd));
> >> + } else {
> >> + unmap_hyp_ptes(pmd, addr, next);
> >> + }
> >> + }
> >> + } while (pmd++, addr = next, addr != end);
> >> +
> >> + if (hyp_pmd_table_empty(start_pmd))
> >> + clear_hyp_pud_entry(pud);
> >> +}
> >> +
> >> +static void unmap_hyp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
> >> +{
> >> + phys_addr_t next;
> >> + pud_t *pud, *start_pud;
> >> +
> >> + start_pud = pud = pud_offset(pgd, addr);
> >> + do {
> >> + next = pud_addr_end(addr, end);
> >> + if (!pud_none(*pud)) {
> >> + if (pud_huge(*pud)) {
> >
> > do we ever actually map anything with huge pud
> > mappings for the Hyp space?
>
> Same thing. Looks like there is some potential simplification here.
>
> >
> >> + pud_t old_pud = *pud;
> >> +
> >> + pud_clear(pud);
> >> + kvm_flush_dcache_pud(old_pud);
> >> + put_page(virt_to_page(pud));
> >> + } else {
> >> + unmap_hyp_pmds(pud, addr, next);
> >> + }
> >> + }
> >> + } while (pud++, addr = next, addr != end);
> >> +
> >> + if (hyp_pud_table_empty(start_pud))
> >> + clear_hyp_pgd_entry(pgd);
> >> +}
> >> +
> >> +static void unmap_hyp_range(pgd_t *pgdp, phys_addr_t start, u64 size)
> >> +{
> >> + pgd_t *pgd;
> >> + phys_addr_t addr = start, end = start + size;
> >> + phys_addr_t next;
> >> +
> >> + pgd = pgdp + pgd_index(addr);
> >> + do {
> >> + next = pgd_addr_end(addr, end);
> >> + if (!pgd_none(*pgd))
> >> + unmap_hyp_puds(pgd, addr, next);
> >> + } while (pgd++, addr = next, addr != end);
> >
> > shouldn't we flush the EL2 (hyp) TLB here, strictly speaking?
> >
> > Or do we rely on all mappings ever created/torn down here to always have
> > the same VA/PA relationship? Since we didn't flush the EL2 TLB in the
> > existing code, that indeed does seem to be the case.
>
> Actually, we never unmap anything from HYP. Once a structure (kvm, vcpu)
> is mapped there, it stays forever, whatever happens to the VM (that's
> because we'd otherwise have to refcount the number of objects in a page,
> and I'm lazy...).
>
> > That, in turn, raises the question why we don't simply map all pages
> > that could be referenced by a kmalloc() in Hyp mode during the init
> > phase and be done with all this hyp mapping/unmapping stuff?
> >
> > In any case, that behavior doesn't have to change now, but if we don't
> > add a TLB flush here, I'd like a comment to explain why that's not
> > needed.
>
> Hope you have your answer above... ;-)
>
Not quite: Could we just map the linearly mapped region in Hyp mode from
the beginning and be done with all this?

Otherwise yes, I have the answer, and we should add a comment too.

Thanks,
-Christoffer


2016-04-08 15:22:13

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 12/17] kvm-arm: Add explicit hyp page table modifiers

On 08/04/16 16:16, Christoffer Dall wrote:
> On Fri, Apr 08, 2016 at 04:09:11PM +0100, Marc Zyngier wrote:
>> On 08/04/16 14:15, Christoffer Dall wrote:
>>> On Mon, Apr 04, 2016 at 05:26:12PM +0100, Suzuki K Poulose wrote:
>>>> We have common routines to modify hyp and stage2 page tables
>>>> based on the 'kvm' parameter. For a smoother transition to
>>>> using separate routines for each, duplicate the routines
>>>> and modify the copy to work on hyp.
>>>>
>>>> Marks the forked routines with _hyp_ and gets rid of the
>>>> kvm parameter which is no longer needed and is NULL for hyp.
>>>> Also, gets rid of calls to kvm_tlb_flush_by_vmid_ipa() calls
>>>> from the hyp versions. Uses explicit host page table accessors
>>>> instead of the kvm_* page table helpers.
>>>>
>>>> Suggested-by: Christoffer Dall <[email protected]>
>>>> Cc: Marc Zyngier <[email protected]>
>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>> ---
>>>> arch/arm/kvm/mmu.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++---
>>>> 1 file changed, 118 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>> index b46a337..2b491e5 100644
>>>> --- a/arch/arm/kvm/mmu.c
>>>> +++ b/arch/arm/kvm/mmu.c
>>>> @@ -388,6 +388,119 @@ static void stage2_flush_vm(struct kvm *kvm)
>>>> srcu_read_unlock(&kvm->srcu, idx);
>>>> }
>>>>
>>>> +static void clear_hyp_pgd_entry(pgd_t *pgd)
>>>> +{
>>>> + pud_t *pud_table __maybe_unused = pud_offset(pgd, 0UL);
>>>> + pgd_clear(pgd);
>>>> + pud_free(NULL, pud_table);
>>>> + put_page(virt_to_page(pgd));
>>>> +}
>>>> +
>>>> +static void clear_hyp_pud_entry(pud_t *pud)
>>>> +{
>>>> + pmd_t *pmd_table __maybe_unused = pmd_offset(pud, 0);
>>>> + VM_BUG_ON(pud_huge(*pud));
>>>> + pud_clear(pud);
>>>> + pmd_free(NULL, pmd_table);
>>>> + put_page(virt_to_page(pud));
>>>> +}
>>>> +
>>>> +static void clear_hyp_pmd_entry(pmd_t *pmd)
>>>> +{
>>>> + pte_t *pte_table = pte_offset_kernel(pmd, 0);
>>>> + VM_BUG_ON(pmd_thp_or_huge(*pmd));
>>>> + pmd_clear(pmd);
>>>> + pte_free_kernel(NULL, pte_table);
>>>> + put_page(virt_to_page(pmd));
>>>> +}
>>>> +
>>>> +static void unmap_hyp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
>>>> +{
>>>> + pte_t *pte, *start_pte;
>>>> +
>>>> + start_pte = pte = pte_offset_kernel(pmd, addr);
>>>> + do {
>>>> + if (!pte_none(*pte)) {
>>>> + pte_t old_pte = *pte;
>>>> +
>>>> + kvm_set_pte(pte, __pte(0));
>>>> +
>>>> + /* XXX: Do we need to invalidate the cache for device mappings ? */
>>>
>>> no, we will not be swapping out any pages mapped in Hyp mode so you can
>>> get rid of both of the following two lines.
>>>
>>>> + if (!kvm_is_device_pfn(pte_pfn(old_pte)))
>>>> + kvm_flush_dcache_pte(old_pte);
>>>> +
>>>> + put_page(virt_to_page(pte));
>>>> + }
>>>> + } while (pte++, addr += PAGE_SIZE, addr != end);
>>>> +
>>>> + if (hyp_pte_table_empty(start_pte))
>>>> + clear_hyp_pmd_entry(pmd);
>>>> +}
>>>> +
>>>> +static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
>>>> +{
>>>> + phys_addr_t next;
>>>> + pmd_t *pmd, *start_pmd;
>>>> +
>>>> + start_pmd = pmd = pmd_offset(pud, addr);
>>>> + do {
>>>> + next = pmd_addr_end(addr, end);
>>>> + if (!pmd_none(*pmd)) {
>>>> + if (pmd_thp_or_huge(*pmd)) {
>>>
>>> do we ever actually map anything with section mappings in the Hyp
>>> mappings?
>>
>> No, this is purely a page mapping so far. On my system, the HYP text is
>> just over 4 pages big (4k pages), so the incentive is pretty low, unless
>> we can demonstrate some big gains due to the reduced TLB impact.
>>
>>>> + pmd_t old_pmd = *pmd;
>>>> +
>>>> + pmd_clear(pmd);
>>>> + kvm_flush_dcache_pmd(old_pmd);
>>>> + put_page(virt_to_page(pmd));
>>>> + } else {
>>>> + unmap_hyp_ptes(pmd, addr, next);
>>>> + }
>>>> + }
>>>> + } while (pmd++, addr = next, addr != end);
>>>> +
>>>> + if (hyp_pmd_table_empty(start_pmd))
>>>> + clear_hyp_pud_entry(pud);
>>>> +}
>>>> +
>>>> +static void unmap_hyp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
>>>> +{
>>>> + phys_addr_t next;
>>>> + pud_t *pud, *start_pud;
>>>> +
>>>> + start_pud = pud = pud_offset(pgd, addr);
>>>> + do {
>>>> + next = pud_addr_end(addr, end);
>>>> + if (!pud_none(*pud)) {
>>>> + if (pud_huge(*pud)) {
>>>
>>> do we ever actually map anything with huge pud
>>> mappings for the Hyp space?
>>
>> Same thing. Looks like there is some potential simplification here.
>>
>>>
>>>> + pud_t old_pud = *pud;
>>>> +
>>>> + pud_clear(pud);
>>>> + kvm_flush_dcache_pud(old_pud);
>>>> + put_page(virt_to_page(pud));
>>>> + } else {
>>>> + unmap_hyp_pmds(pud, addr, next);
>>>> + }
>>>> + }
>>>> + } while (pud++, addr = next, addr != end);
>>>> +
>>>> + if (hyp_pud_table_empty(start_pud))
>>>> + clear_hyp_pgd_entry(pgd);
>>>> +}
>>>> +
>>>> +static void unmap_hyp_range(pgd_t *pgdp, phys_addr_t start, u64 size)
>>>> +{
>>>> + pgd_t *pgd;
>>>> + phys_addr_t addr = start, end = start + size;
>>>> + phys_addr_t next;
>>>> +
>>>> + pgd = pgdp + pgd_index(addr);
>>>> + do {
>>>> + next = pgd_addr_end(addr, end);
>>>> + if (!pgd_none(*pgd))
>>>> + unmap_hyp_puds(pgd, addr, next);
>>>> + } while (pgd++, addr = next, addr != end);
>>>
>>> shouldn't we flush the EL2 (hyp) TLB here, strictly speaking?
>>>
>>> Or do we rely on all mappings ever created/torn down here to always have
>>> the same VA/PA relationship? Since we didn't flush the EL2 TLB in the
>>> existing code, that indeed does seem to be the case.
>>
>> Actually, we never unmap anything from HYP. Once a structure (kvm, vcpu)
>> is mapped there, it stays forever, whatever happens to the VM (that's
>> because we'd otherwise have to refcount the number of objects in a page,
>> and I'm lazy...).
>>
>>> That, in turn, raises the question why we don't simply map all pages
>>> that could be referenced by a kmalloc() in Hyp mode during the init
>>> phase and be done with all this hyp mapping/unmapping stuff?
>>>
>>> In any case, that behavior doesn't have to change now, but if we don't
>>> add a TLB flush here, I'd like a comment to explain why that's not
>>> needed.
>>
>> Hope you have your answer above... ;-)
>>
> Not quite: Could we just map the linearly mapped region in Hyp mode from
> the beginning and be done with all this?

We could. Not sure what the implications may be, apart from using more
memory for page tables. In that case, section mappings would be in order
though.

> Otherwise yes, I have the answer, and we should add a comment too.

Agreed.

M.
--
Jazz is not dead. It just smells funny...

2016-04-08 15:23:02

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH 12/17] kvm-arm: Add explicit hyp page table modifiers

On 08/04/16 16:09, Marc Zyngier wrote:
> On 08/04/16 14:15, Christoffer Dall wrote:
>> On Mon, Apr 04, 2016 at 05:26:12PM +0100, Suzuki K Poulose wrote:
>>> We have common routines to modify hyp and stage2 page tables
>>> based on the 'kvm' parameter. For a smoother transition to
>>> using separate routines for each, duplicate the routines
>>> and modify the copy to work on hyp.
>>>
>>> Marks the forked routines with _hyp_ and gets rid of the
>>> kvm parameter which is no longer needed and is NULL for hyp.
>>> Also, gets rid of calls to kvm_tlb_flush_by_vmid_ipa() calls
>>> from the hyp versions. Uses explicit host page table accessors
>>> instead of the kvm_* page table helpers.
>>>
>>> Suggested-by: Christoffer Dall <[email protected]>
>>> Cc: Marc Zyngier <[email protected]>
>>> Signed-off-by: Suzuki K Poulose <[email protected]>

>>> +static void unmap_hyp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
>>> +{
>>> + pte_t *pte, *start_pte;
>>> +
>>> + start_pte = pte = pte_offset_kernel(pmd, addr);
>>> + do {
>>> + if (!pte_none(*pte)) {
>>> + pte_t old_pte = *pte;
>>> +
>>> + kvm_set_pte(pte, __pte(0));
>>> +
>>> + /* XXX: Do we need to invalidate the cache for device mappings ? */
>>
>> no, we will not be swapping out any pages mapped in Hyp mode so you can
>> get rid of both of the following two lines.

OK, will remove this hunk.

>>
>>> + if (!kvm_is_device_pfn(pte_pfn(old_pte)))
>>> + kvm_flush_dcache_pte(old_pte);
>>> +
>>> + put_page(virt_to_page(pte));
>>> + }
>>> + } while (pte++, addr += PAGE_SIZE, addr != end);
>>> +
>>> + if (hyp_pte_table_empty(start_pte))
>>> + clear_hyp_pmd_entry(pmd);
>>> +}
>>> +
>>> +static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
>>> +{
>>> + phys_addr_t next;
>>> + pmd_t *pmd, *start_pmd;
>>> +
>>> + start_pmd = pmd = pmd_offset(pud, addr);
>>> + do {
>>> + next = pmd_addr_end(addr, end);
>>> + if (!pmd_none(*pmd)) {
>>> + if (pmd_thp_or_huge(*pmd)) {
>>
>> do we ever actually map anything with section mappings in the Hyp
>> mappings?
>
> No, this is purely a page mapping so far. On my system, the HYP text is
> just over 4 pages big (4k pages), so the incentive is pretty low, unless
> we can demonstrate some big gains due to the reduced TLB impact.


>>> +static void unmap_hyp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
>>> +{
>>> + phys_addr_t next;
>>> + pud_t *pud, *start_pud;
>>> +
>>> + start_pud = pud = pud_offset(pgd, addr);
>>> + do {
>>> + next = pud_addr_end(addr, end);
>>> + if (!pud_none(*pud)) {
>>> + if (pud_huge(*pud)) {
>>
>> do we ever actually map anything with huge pud
>> mappings for the Hyp space?
>
> Same thing. Looks like there is some potential simplification here.

Right, we don't map anything with section mapping. I can clean these up.

>>> +static void unmap_hyp_range(pgd_t *pgdp, phys_addr_t start, u64 size)
>>> +{
>>> + pgd_t *pgd;
>>> + phys_addr_t addr = start, end = start + size;
>>> + phys_addr_t next;
>>> +
>>> + pgd = pgdp + pgd_index(addr);
>>> + do {
>>> + next = pgd_addr_end(addr, end);
>>> + if (!pgd_none(*pgd))
>>> + unmap_hyp_puds(pgd, addr, next);
>>> + } while (pgd++, addr = next, addr != end);
>>
>> shouldn't we flush the EL2 (hyp) TLB here, strictly speaking?
>>
>> Or do we rely on all mappings ever created/torn down here to always have
>> the same VA/PA relationship? Since we didn't flush the EL2 TLB in the
>> existing code, that indeed does seem to be the case.
>
> Actually, we never unmap anything from HYP.

Except for the kvm tearing down where we clean up all the hyp table.

> Once a structure (kvm, vcpu)is mapped there, it stays forever, whatever happens
> to the VM (that's because we'd otherwise have to refcount the number of objects in a page,
> and I'm lazy...).

Thats one of my TODO list if there is sufficient interest in getting that done.

Thanks
Suzuki

2016-04-08 15:25:11

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 12/17] kvm-arm: Add explicit hyp page table modifiers

On Fri, Apr 8, 2016 at 5:22 PM, Suzuki K Poulose <[email protected]> wrote:
> On 08/04/16 16:09, Marc Zyngier wrote:
>>
>> On 08/04/16 14:15, Christoffer Dall wrote:
>>>
>>> On Mon, Apr 04, 2016 at 05:26:12PM +0100, Suzuki K Poulose wrote:
>>>>
>>>> We have common routines to modify hyp and stage2 page tables
>>>> based on the 'kvm' parameter. For a smoother transition to
>>>> using separate routines for each, duplicate the routines
>>>> and modify the copy to work on hyp.
>>>>
>>>> Marks the forked routines with _hyp_ and gets rid of the
>>>> kvm parameter which is no longer needed and is NULL for hyp.
>>>> Also, gets rid of calls to kvm_tlb_flush_by_vmid_ipa() calls
>>>> from the hyp versions. Uses explicit host page table accessors
>>>> instead of the kvm_* page table helpers.
>>>>
>>>> Suggested-by: Christoffer Dall <[email protected]>
>>>> Cc: Marc Zyngier <[email protected]>
>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>
>
>>>> +static void unmap_hyp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t
>>>> end)
>>>> +{
>>>> + pte_t *pte, *start_pte;
>>>> +
>>>> + start_pte = pte = pte_offset_kernel(pmd, addr);
>>>> + do {
>>>> + if (!pte_none(*pte)) {
>>>> + pte_t old_pte = *pte;
>>>> +
>>>> + kvm_set_pte(pte, __pte(0));
>>>> +
>>>> + /* XXX: Do we need to invalidate the cache for
>>>> device mappings ? */
>>>
>>>
>>> no, we will not be swapping out any pages mapped in Hyp mode so you can
>>> get rid of both of the following two lines.
>
>
> OK, will remove this hunk.
>
>
>>>
>>>> + if (!kvm_is_device_pfn(pte_pfn(old_pte)))
>>>> + kvm_flush_dcache_pte(old_pte);
>>>> +
>>>> + put_page(virt_to_page(pte));
>>>> + }
>>>> + } while (pte++, addr += PAGE_SIZE, addr != end);
>>>> +
>>>> + if (hyp_pte_table_empty(start_pte))
>>>> + clear_hyp_pmd_entry(pmd);
>>>> +}
>>>> +
>>>> +static void unmap_hyp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t
>>>> end)
>>>> +{
>>>> + phys_addr_t next;
>>>> + pmd_t *pmd, *start_pmd;
>>>> +
>>>> + start_pmd = pmd = pmd_offset(pud, addr);
>>>> + do {
>>>> + next = pmd_addr_end(addr, end);
>>>> + if (!pmd_none(*pmd)) {
>>>> + if (pmd_thp_or_huge(*pmd)) {
>>>
>>>
>>> do we ever actually map anything with section mappings in the Hyp
>>> mappings?
>>
>>
>> No, this is purely a page mapping so far. On my system, the HYP text is
>> just over 4 pages big (4k pages), so the incentive is pretty low, unless
>> we can demonstrate some big gains due to the reduced TLB impact.
>
>
>
>>>> +static void unmap_hyp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t
>>>> end)
>>>> +{
>>>> + phys_addr_t next;
>>>> + pud_t *pud, *start_pud;
>>>> +
>>>> + start_pud = pud = pud_offset(pgd, addr);
>>>> + do {
>>>> + next = pud_addr_end(addr, end);
>>>> + if (!pud_none(*pud)) {
>>>> + if (pud_huge(*pud)) {
>>>
>>>
>>> do we ever actually map anything with huge pud
>>> mappings for the Hyp space?
>>
>>
>> Same thing. Looks like there is some potential simplification here.
>
>
> Right, we don't map anything with section mapping. I can clean these up.
>
>>>> +static void unmap_hyp_range(pgd_t *pgdp, phys_addr_t start, u64 size)
>>>> +{
>>>> + pgd_t *pgd;
>>>> + phys_addr_t addr = start, end = start + size;
>>>> + phys_addr_t next;
>>>> +
>>>> + pgd = pgdp + pgd_index(addr);
>>>> + do {
>>>> + next = pgd_addr_end(addr, end);
>>>> + if (!pgd_none(*pgd))
>>>> + unmap_hyp_puds(pgd, addr, next);
>>>> + } while (pgd++, addr = next, addr != end);
>>>
>>>
>>> shouldn't we flush the EL2 (hyp) TLB here, strictly speaking?
>>>
>>> Or do we rely on all mappings ever created/torn down here to always have
>>> the same VA/PA relationship? Since we didn't flush the EL2 TLB in the
>>> existing code, that indeed does seem to be the case.
>>
>>
>> Actually, we never unmap anything from HYP.
>
>
> Except for the kvm tearing down where we clean up all the hyp table.
>
>> Once a structure (kvm, vcpu)is mapped there, it stays forever, whatever
>> happens
>> to the VM (that's because we'd otherwise have to refcount the number of
>> objects in a page,
>> and I'm lazy...).
>
>
> Thats one of my TODO list if there is sufficient interest in getting that
> done.
>
I think you can ignore it for now... I'm sure we have bigger fish to fry.

-Christoffer

2016-04-08 15:37:08

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH 13/17] kvm-arm: Add stage2 page table modifiers

On 08/04/16 14:42, Christoffer Dall wrote:
> On Mon, Apr 04, 2016 at 05:26:13PM +0100, Suzuki K Poulose wrote:

>>
>> -static void clear_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
>> +static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
>> {
>> - pmd_t *pmd_table = pmd_offset(pud, 0);
>> - VM_BUG_ON(pud_huge(*pud));
>> - pud_clear(pud);
>> + pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(pud, 0);
>
> The __maybe_unused are slightly ugly, so it may be nicer to create the
> stage2_pmd_free() as static inline's if they're defined to do nothing
> instead.
>

Sure, we could do that for stage2. However, we will need to fix the host helpers
as well for making such a change in the _hyp version (for 16K + 36bit VA).

>>
>
> Acked-by: Christoffer Dall <[email protected]>
>


Thanks
Suzuki

2016-04-08 17:02:52

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 13/17] kvm-arm: Add stage2 page table modifiers

On Fri, Apr 08, 2016 at 04:37:02PM +0100, Suzuki K Poulose wrote:
> On 08/04/16 14:42, Christoffer Dall wrote:
> >On Mon, Apr 04, 2016 at 05:26:13PM +0100, Suzuki K Poulose wrote:
>
> >>
> >>-static void clear_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
> >>+static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
> >> {
> >>- pmd_t *pmd_table = pmd_offset(pud, 0);
> >>- VM_BUG_ON(pud_huge(*pud));
> >>- pud_clear(pud);
> >>+ pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(pud, 0);
> >
> >The __maybe_unused are slightly ugly, so it may be nicer to create the
> >stage2_pmd_free() as static inline's if they're defined to do nothing
> >instead.
> >
>
> Sure, we could do that for stage2. However, we will need to fix the host helpers
> as well for making such a change in the _hyp version (for 16K + 36bit VA).
>

I thought the host helpers were already done like that, since we don't
need the __maybe_unused currently. If it involves changing core code
etc. then don't bother.

Thanks,
-Christoffer

2016-04-08 17:07:18

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH 13/17] kvm-arm: Add stage2 page table modifiers

On 08/04/16 18:03, Christoffer Dall wrote:
> On Fri, Apr 08, 2016 at 04:37:02PM +0100, Suzuki K Poulose wrote:
>> On 08/04/16 14:42, Christoffer Dall wrote:
>>> On Mon, Apr 04, 2016 at 05:26:13PM +0100, Suzuki K Poulose wrote:
>>
>>>>
>>>> -static void clear_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
>>>> +static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
>>>> {
>>>> - pmd_t *pmd_table = pmd_offset(pud, 0);
>>>> - VM_BUG_ON(pud_huge(*pud));
>>>> - pud_clear(pud);
>>>> + pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(pud, 0);
>>>
>>> The __maybe_unused are slightly ugly, so it may be nicer to create the
>>> stage2_pmd_free() as static inline's if they're defined to do nothing
>>> instead.
>>>
>>
>> Sure, we could do that for stage2. However, we will need to fix the host helpers
>> as well for making such a change in the _hyp version (for 16K + 36bit VA).
>>
>
> I thought the host helpers were already done like that, since we don't
> need the __maybe_unused currently. If it involves changing core code
> etc. then don't bother.

Unfortunately no, e.g,

include/asm-generic/pgtable-nopud.h defines:

#define pud_free(mm, x) do { } while (0)

Cheers
Suzuki


2016-04-08 17:25:00

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 13/17] kvm-arm: Add stage2 page table modifiers

On Fri, Apr 08, 2016 at 06:07:13PM +0100, Suzuki K Poulose wrote:
> On 08/04/16 18:03, Christoffer Dall wrote:
> >On Fri, Apr 08, 2016 at 04:37:02PM +0100, Suzuki K Poulose wrote:
> >>On 08/04/16 14:42, Christoffer Dall wrote:
> >>>On Mon, Apr 04, 2016 at 05:26:13PM +0100, Suzuki K Poulose wrote:
> >>
> >>>>
> >>>>-static void clear_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
> >>>>+static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
> >>>> {
> >>>>- pmd_t *pmd_table = pmd_offset(pud, 0);
> >>>>- VM_BUG_ON(pud_huge(*pud));
> >>>>- pud_clear(pud);
> >>>>+ pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(pud, 0);
> >>>
> >>>The __maybe_unused are slightly ugly, so it may be nicer to create the
> >>>stage2_pmd_free() as static inline's if they're defined to do nothing
> >>>instead.
> >>>
> >>
> >>Sure, we could do that for stage2. However, we will need to fix the host helpers
> >>as well for making such a change in the _hyp version (for 16K + 36bit VA).
> >>
> >
> >I thought the host helpers were already done like that, since we don't
> >need the __maybe_unused currently. If it involves changing core code
> >etc. then don't bother.
>
> Unfortunately no, e.g,
>
> include/asm-generic/pgtable-nopud.h defines:
>
> #define pud_free(mm, x) do { } while (0)
>

Leave it then :)

Thanks,
-Christoffer

2016-04-11 14:33:51

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH 15/17] kvm: arm64: Get rid of fake page table levels

On 08/04/16 16:05, Christoffer Dall wrote:
> On Mon, Apr 04, 2016 at 05:26:15PM +0100, Suzuki K Poulose wrote:

>> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
>> index 751227d..139b4db 100644
>> --- a/arch/arm64/include/asm/stage2_pgtable.h
>> +++ b/arch/arm64/include/asm/stage2_pgtable.h
>> @@ -22,32 +22,55 @@
>> #include <asm/pgtable.h>
>>
>> /*
>> - * In the case where PGDIR_SHIFT is larger than KVM_PHYS_SHIFT, we can address
>> - * the entire IPA input range with a single pgd entry, and we would only need
>> - * one pgd entry. Note that in this case, the pgd is actually not used by
>> - * the MMU for Stage-2 translations, but is merely a fake pgd used as a data
>> - * structure for the kernel pgtable macros to work.
>> + * The hardware mandates concatenation of upto 16 tables at stage2 entry level.
>
> s/upto/up to/
>
>> + * Now, the minimum number of bits resolved at any level is (PAGE_SHIFT - 3),
>> + * or in other words log2(PTRS_PER_PTE). On arm64, the smallest PAGE_SIZE
>
> not sure the log2 comment helps here.

OK, will address both the above comments.

>
>> + * supported is 4k, which means (PAGE_SHIFT - 3) > 4 holds for all page sizes.
>> + * This implies, the total number of page table levels at stage2 expected
>> + * by the hardware is actually the number of levels required for (KVM_PHYS_SHIFT - 4)
>> + * in normal translations(e.g, stage-1), since we cannot have another level in
>> + * the range (KVM_PHYS_SHIFT, KVM_PHYS_SHIFT - 4).
>
> Is it not a design decision to always choose the maximum number of
> concatinated initial-level stage2 tables (with the constraint that
> there's a minimum number required)?

You are right. It is a design decision.

>
> I agree with the design decision, if my math is correct that on 64K
> systems you end up requiring a 1MB physically contiguous 1MB aligned
> allocation for each VM? This seems reasonable enough if you configure
> your kernel with 64K pages and expect to run VMs on top of that.

Right, and it is "up to 1MB" and not always, depending on the IPA size.
And for 16K it would be up to 256K (e.g, with 40bit IPA).

>
>> */
>> -#if PGDIR_SHIFT > KVM_PHYS_SHIFT
>> -#define PTRS_PER_S2_PGD_SHIFT 0
>> -#else
>> -#define PTRS_PER_S2_PGD_SHIFT (KVM_PHYS_SHIFT - PGDIR_SHIFT)
>> -#endif
>> -#define PTRS_PER_S2_PGD (1 << PTRS_PER_S2_PGD_SHIFT)
>> +#define STAGE2_PGTABLE_LEVELS ARM64_HW_PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)
>>
>> /*
>> - * If we are concatenating first level stage-2 page tables, we would have less
>> - * than or equal to 16 pointers in the fake PGD, because that's what the
>> - * architecture allows. In this case, (4 - CONFIG_PGTABLE_LEVELS)
>> - * represents the first level for the host, and we add 1 to go to the next
>> - * level (which uses contatenation) for the stage-2 tables.
>> + * At the moment, we do not support a combination of guest IPA and host VA_BITS
>> + * where
>> + * STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS
>
> can you change this comment to reverse the statement to avoid someone
> seeing this as a constraint, when in fact it's a negative invariant?
>
> So the case we don't support is a sufficiently larger IPA space compared
> to the host VA space such that the above happens? (Since at the same
> IPA space size as host VA space size, the stage-2 levels will always be
> less than or equal to the host levels.)

Correct.

>
> I don't see how that would ever work with userspace either so I think
> this is a safe assumption and not something that ever needs fixing. In

For e.g, we can perfectly run a guest with 40bit IPA under a host with 16K+36bit
VA. The moment we go above 40bit IPA, we could trigger the conditions above.
I think it is perfectly fine for the guest to choose higher IPA width, and place
its memory well above as long as the qemu/lkvm doesn't exhaust its VA. I just
tried booting a VM with memory at 0x70_0000_0000 on a 16K+36bitVA host and it
boots perfectly fine.


> which case this should be reworded to just state the assumptions and why
> this is a good assumption.
>
> (If my assumptions are wrong here, then there are also weird cases where
> the host does huge pages at the PMD level and we don't. Not sure I can
> see the full ramifications of that.)

I am sorry, I didn't get your point about the PMD level.

>
>> + *
>> + * We base our stage-2 page table walker helpers based on this assumption and
>> + * fallback to using the host version of the helper wherever possible.
>> + * i.e, if a particular level is not folded (e.g, PUD) at stage2, we fall back
>> + * to using the host version, since it is guaranteed it is not folded at host.
>
> I don't really understand why it's desirable to fall back to the host
> version of the helpers; in fact I would probably prefer to just have it
> disjoint, but maybe I'll see the reason when going through the patch
> more. But I doubt the value of this particular piece of commentary...

OK

>
>> + *
>> + * TODO: We could lift this limitation easily by rearranging the host level
>> + * definitions to a more reusable version.
>> */
>
> So is this really a TODO: based on the above?
>
>> -#if PTRS_PER_S2_PGD <= 16
>> -#define KVM_PREALLOC_LEVEL (4 - CONFIG_PGTABLE_LEVELS + 1)
>> -#else
>> -#define KVM_PREALLOC_LEVEL (0)
>> +#if STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS
>> +#error "Unsupported combination of guest IPA and host VA_BITS."
>> #endif
>>
>> +


>
> Can we add a comment as to what this defines exactly? Something like:
> /*
> * PGDIR_SHIFT determines the size a top-level stage2 page table entry can map
> */

Done.

>> +#define S2_PGDIR_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - STAGE2_PGTABLE_LEVELS)
>> +#define S2_PGDIR_SIZE (_AC(1, UL) << S2_PGDIR_SHIFT)
>> +#define S2_PGDIR_MASK (~(S2_PGDIR_SIZE - 1))
>> +
>> +/* We can have concatenated tables at stage2 entry. */
>
> I'm not sure if the comment is helpful. How about:
>
> /*
> * The number of PTRS across all concatenated stage2 tables given by the
> * number of bits resolved at the initial level.
> */
>

OK

Thanks
Suzuki

2016-04-12 12:14:00

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 15/17] kvm: arm64: Get rid of fake page table levels

On Mon, Apr 11, 2016 at 03:33:45PM +0100, Suzuki K Poulose wrote:
> On 08/04/16 16:05, Christoffer Dall wrote:
> >On Mon, Apr 04, 2016 at 05:26:15PM +0100, Suzuki K Poulose wrote:
>
> >>diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> >>index 751227d..139b4db 100644
> >>--- a/arch/arm64/include/asm/stage2_pgtable.h
> >>+++ b/arch/arm64/include/asm/stage2_pgtable.h
> >>@@ -22,32 +22,55 @@
> >> #include <asm/pgtable.h>
> >>
> >> /*
> >>- * In the case where PGDIR_SHIFT is larger than KVM_PHYS_SHIFT, we can address
> >>- * the entire IPA input range with a single pgd entry, and we would only need
> >>- * one pgd entry. Note that in this case, the pgd is actually not used by
> >>- * the MMU for Stage-2 translations, but is merely a fake pgd used as a data
> >>- * structure for the kernel pgtable macros to work.
> >>+ * The hardware mandates concatenation of upto 16 tables at stage2 entry level.
> >
> >s/upto/up to/
> >
> >>+ * Now, the minimum number of bits resolved at any level is (PAGE_SHIFT - 3),
> >>+ * or in other words log2(PTRS_PER_PTE). On arm64, the smallest PAGE_SIZE
> >
> >not sure the log2 comment helps here.
>
> OK, will address both the above comments.
>
> >
> >>+ * supported is 4k, which means (PAGE_SHIFT - 3) > 4 holds for all page sizes.
> >>+ * This implies, the total number of page table levels at stage2 expected
> >>+ * by the hardware is actually the number of levels required for (KVM_PHYS_SHIFT - 4)
> >>+ * in normal translations(e.g, stage-1), since we cannot have another level in
> >>+ * the range (KVM_PHYS_SHIFT, KVM_PHYS_SHIFT - 4).
> >
> >Is it not a design decision to always choose the maximum number of
> >concatinated initial-level stage2 tables (with the constraint that
> >there's a minimum number required)?
>
> You are right. It is a design decision.
>
> >
> >I agree with the design decision, if my math is correct that on 64K
> >systems you end up requiring a 1MB physically contiguous 1MB aligned
> >allocation for each VM? This seems reasonable enough if you configure
> >your kernel with 64K pages and expect to run VMs on top of that.
>
> Right, and it is "up to 1MB" and not always, depending on the IPA size.
> And for 16K it would be up to 256K (e.g, with 40bit IPA).
>
> >
> >> */
> >>-#if PGDIR_SHIFT > KVM_PHYS_SHIFT
> >>-#define PTRS_PER_S2_PGD_SHIFT 0
> >>-#else
> >>-#define PTRS_PER_S2_PGD_SHIFT (KVM_PHYS_SHIFT - PGDIR_SHIFT)
> >>-#endif
> >>-#define PTRS_PER_S2_PGD (1 << PTRS_PER_S2_PGD_SHIFT)
> >>+#define STAGE2_PGTABLE_LEVELS ARM64_HW_PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)
> >>
> >> /*
> >>- * If we are concatenating first level stage-2 page tables, we would have less
> >>- * than or equal to 16 pointers in the fake PGD, because that's what the
> >>- * architecture allows. In this case, (4 - CONFIG_PGTABLE_LEVELS)
> >>- * represents the first level for the host, and we add 1 to go to the next
> >>- * level (which uses contatenation) for the stage-2 tables.

just noticed: s/contatenation/concatenation/

> >>+ * At the moment, we do not support a combination of guest IPA and host VA_BITS
> >>+ * where
> >>+ * STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS
> >
> >can you change this comment to reverse the statement to avoid someone
> >seeing this as a constraint, when in fact it's a negative invariant?
> >
> >So the case we don't support is a sufficiently larger IPA space compared
> >to the host VA space such that the above happens? (Since at the same
> >IPA space size as host VA space size, the stage-2 levels will always be
> >less than or equal to the host levels.)
>
> Correct.
>
> >
> >I don't see how that would ever work with userspace either so I think
> >this is a safe assumption and not something that ever needs fixing. In
>
> For e.g, we can perfectly run a guest with 40bit IPA under a host with 16K+36bit
> VA. The moment we go above 40bit IPA, we could trigger the conditions above.
> I think it is perfectly fine for the guest to choose higher IPA width, and place
> its memory well above as long as the qemu/lkvm doesn't exhaust its VA. I just
> tried booting a VM with memory at 0x70_0000_0000 on a 16K+36bitVA host and it
> boots perfectly fine.
>

Right, I was thinking about it as providing more than 36bits of *memory*
not address space in this case, so you're right, it is at least a
theoretically possible case.

>
> >which case this should be reworded to just state the assumptions and why
> >this is a good assumption.
> >
> >(If my assumptions are wrong here, then there are also weird cases where
> >the host does huge pages at the PMD level and we don't. Not sure I can
> >see the full ramifications of that.)
>
> I am sorry, I didn't get your point about the PMD level.
>

Right, I expressed that terribly, and I may have gotten myself confused
when writing that.

My concern is this: if the number of levels between the host and stage-2
are different, and the host uses huge pmd mappings (either THP or huge
tlb fs), then do we always do the right thing for stage-2 tables, even
if we support the case with more levels in Stage-2 than on the host?




> >
> >>+ *
> >>+ * We base our stage-2 page table walker helpers based on this assumption and
> >>+ * fallback to using the host version of the helper wherever possible.
> >>+ * i.e, if a particular level is not folded (e.g, PUD) at stage2, we fall back
> >>+ * to using the host version, since it is guaranteed it is not folded at host.
> >
> >I don't really understand why it's desirable to fall back to the host
> >version of the helpers; in fact I would probably prefer to just have it
> >disjoint, but maybe I'll see the reason when going through the patch
> >more. But I doubt the value of this particular piece of commentary...
>
> OK
>
> >
> >>+ *
> >>+ * TODO: We could lift this limitation easily by rearranging the host level
> >>+ * definitions to a more reusable version.
> >> */
> >
> >So is this really a TODO: based on the above?
> >
> >>-#if PTRS_PER_S2_PGD <= 16
> >>-#define KVM_PREALLOC_LEVEL (4 - CONFIG_PGTABLE_LEVELS + 1)
> >>-#else
> >>-#define KVM_PREALLOC_LEVEL (0)
> >>+#if STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS
> >>+#error "Unsupported combination of guest IPA and host VA_BITS."
> >> #endif
> >>
> >>+
>
>
> >
> >Can we add a comment as to what this defines exactly? Something like:
> >/*
> > * PGDIR_SHIFT determines the size a top-level stage2 page table entry can map
> > */
>
> Done.
>
> >>+#define S2_PGDIR_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - STAGE2_PGTABLE_LEVELS)
> >>+#define S2_PGDIR_SIZE (_AC(1, UL) << S2_PGDIR_SHIFT)
> >>+#define S2_PGDIR_MASK (~(S2_PGDIR_SIZE - 1))
> >>+
> >>+/* We can have concatenated tables at stage2 entry. */
> >
> >I'm not sure if the comment is helpful. How about:
> >
> >/*
> > * The number of PTRS across all concatenated stage2 tables given by the
> > * number of bits resolved at the initial level.
> > */
> >
>
> OK
>

Thanks for trying to parse my crytptic and potentially nonsensical
questions.

-Christoffer

2016-04-12 13:03:30

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH 15/17] kvm: arm64: Get rid of fake page table levels

On 12/04/16 13:14, Christoffer Dall wrote:
> On Mon, Apr 11, 2016 at 03:33:45PM +0100, Suzuki K Poulose wrote:
>> On 08/04/16 16:05, Christoffer Dall wrote:
>>> On Mon, Apr 04, 2016 at 05:26:15PM +0100, Suzuki K Poulose wrote:
>>

>>>> -#if PGDIR_SHIFT > KVM_PHYS_SHIFT
>>>> -#define PTRS_PER_S2_PGD_SHIFT 0
>>>> -#else
>>>> -#define PTRS_PER_S2_PGD_SHIFT (KVM_PHYS_SHIFT - PGDIR_SHIFT)
>>>> -#endif
>>>> -#define PTRS_PER_S2_PGD (1 << PTRS_PER_S2_PGD_SHIFT)
>>>> +#define STAGE2_PGTABLE_LEVELS ARM64_HW_PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)
>>>>
>>>> /*
>>>> - * If we are concatenating first level stage-2 page tables, we would have less
>>>> - * than or equal to 16 pointers in the fake PGD, because that's what the
>>>> - * architecture allows. In this case, (4 - CONFIG_PGTABLE_LEVELS)
>>>> - * represents the first level for the host, and we add 1 to go to the next
>>>> - * level (which uses contatenation) for the stage-2 tables.
>
> just noticed: s/contatenation/concatenation/

Thanks for catching that. Will fix it.

>>> which case this should be reworded to just state the assumptions and why
>>> this is a good assumption.
>>>
>>> (If my assumptions are wrong here, then there are also weird cases where
>>> the host does huge pages at the PMD level and we don't. Not sure I can
>>> see the full ramifications of that.)
>>
>> I am sorry, I didn't get your point about the PMD level.
>>
>
> Right, I expressed that terribly, and I may have gotten myself confused
> when writing that.
>
> My concern is this: if the number of levels between the host and stage-2
> are different, and the host uses huge pmd mappings (either THP or huge
> tlb fs), then do we always do the right thing for stage-2 tables, even
> if we support the case with more levels in Stage-2 than on the host?

Yes. We are safe with PMD (level 2). In the worst case we will have a 2level
page table at host and say 4 at stage2. In either case, we have a PMD level
(which in host is folded to PGD) and both will use section mapping at level 2
for huge pmd. Also pmd_huge() doesn't care if the PMD is folded or not. So,
we are fine with that. I hope that answers your question.

>
> Thanks for trying to parse my crytptic and potentially nonsensical
> questions.

No no, they make absolute sense. I had gone through these questions myself initially
when I wrote the series, so its good to share them :-).

Btw, I have rebased my series to kvmarm and have addressed the comments. I will post
them after a round of testing.

Cheers
Suzuki

2016-04-12 13:11:40

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 15/17] kvm: arm64: Get rid of fake page table levels

On Tue, Apr 12, 2016 at 02:03:25PM +0100, Suzuki K Poulose wrote:
> On 12/04/16 13:14, Christoffer Dall wrote:
> >On Mon, Apr 11, 2016 at 03:33:45PM +0100, Suzuki K Poulose wrote:
> >>On 08/04/16 16:05, Christoffer Dall wrote:
> >>>On Mon, Apr 04, 2016 at 05:26:15PM +0100, Suzuki K Poulose wrote:
> >>
>
> >>>>-#if PGDIR_SHIFT > KVM_PHYS_SHIFT
> >>>>-#define PTRS_PER_S2_PGD_SHIFT 0
> >>>>-#else
> >>>>-#define PTRS_PER_S2_PGD_SHIFT (KVM_PHYS_SHIFT - PGDIR_SHIFT)
> >>>>-#endif
> >>>>-#define PTRS_PER_S2_PGD (1 << PTRS_PER_S2_PGD_SHIFT)
> >>>>+#define STAGE2_PGTABLE_LEVELS ARM64_HW_PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)
> >>>>
> >>>> /*
> >>>>- * If we are concatenating first level stage-2 page tables, we would have less
> >>>>- * than or equal to 16 pointers in the fake PGD, because that's what the
> >>>>- * architecture allows. In this case, (4 - CONFIG_PGTABLE_LEVELS)
> >>>>- * represents the first level for the host, and we add 1 to go to the next
> >>>>- * level (which uses contatenation) for the stage-2 tables.
> >
> >just noticed: s/contatenation/concatenation/
>
> Thanks for catching that. Will fix it.
>
> >>>which case this should be reworded to just state the assumptions and why
> >>>this is a good assumption.
> >>>
> >>>(If my assumptions are wrong here, then there are also weird cases where
> >>>the host does huge pages at the PMD level and we don't. Not sure I can
> >>>see the full ramifications of that.)
> >>
> >>I am sorry, I didn't get your point about the PMD level.
> >>
> >
> >Right, I expressed that terribly, and I may have gotten myself confused
> >when writing that.
> >
> >My concern is this: if the number of levels between the host and stage-2
> >are different, and the host uses huge pmd mappings (either THP or huge
> >tlb fs), then do we always do the right thing for stage-2 tables, even
> >if we support the case with more levels in Stage-2 than on the host?
>
> Yes. We are safe with PMD (level 2). In the worst case we will have a 2level
> page table at host and say 4 at stage2. In either case, we have a PMD level
> (which in host is folded to PGD) and both will use section mapping at level 2
> for huge pmd. Also pmd_huge() doesn't care if the PMD is folded or not. So,
> we are fine with that. I hope that answers your question.
>

It does, thanks for enumerating the cases for me.

> >
> >Thanks for trying to parse my crytptic and potentially nonsensical
> >questions.
>
> No no, they make absolute sense. I had gone through these questions myself initially
> when I wrote the series, so its good to share them :-).
>
> Btw, I have rebased my series to kvmarm and have addressed the comments. I will post
> them after a round of testing.
>

Great, thanks!

-Christoffer

2016-04-13 17:50:05

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH 15/17] kvm: arm64: Get rid of fake page table levels

On 12/04/16 13:14, Christoffer Dall wrote:
> On Mon, Apr 11, 2016 at 03:33:45PM +0100, Suzuki K Poulose wrote:
>> On 08/04/16 16:05, Christoffer Dall wrote:
>>> On Mon, Apr 04, 2016 at 05:26:15PM +0100, Suzuki K Poulose wrote:
>>
>>>> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
>>>> index 751227d..139b4db 100644
>>>> --- a/arch/arm64/include/asm/stage2_pgtable.h
>>>> +++ b/arch/arm64/include/asm/stage2_pgtable.h
>>>> @@ -22,32 +22,55 @@
>>>> #include <asm/pgtable.h>
>>>>
>>>> /*
>>>> - * In the case where PGDIR_SHIFT is larger than KVM_PHYS_SHIFT, we can address
>>>> - * the entire IPA input range with a single pgd entry, and we would only need
>>>> - * one pgd entry. Note that in this case, the pgd is actually not used by
>>>> - * the MMU for Stage-2 translations, but is merely a fake pgd used as a data
>>>> - * structure for the kernel pgtable macros to work.
>>>> + * The hardware mandates concatenation of upto 16 tables at stage2 entry level.
>>>
>>> s/upto/up to/
>>>
>>>> + * Now, the minimum number of bits resolved at any level is (PAGE_SHIFT - 3),
>>>> + * or in other words log2(PTRS_PER_PTE). On arm64, the smallest PAGE_SIZE
>>>
>>> not sure the log2 comment helps here.
>>
>> OK, will address both the above comments.
>>
>>>
>>>> + * supported is 4k, which means (PAGE_SHIFT - 3) > 4 holds for all page sizes.
>>>> + * This implies, the total number of page table levels at stage2 expected
>>>> + * by the hardware is actually the number of levels required for (KVM_PHYS_SHIFT - 4)
>>>> + * in normal translations(e.g, stage-1), since we cannot have another level in
>>>> + * the range (KVM_PHYS_SHIFT, KVM_PHYS_SHIFT - 4).
>>>
>>> Is it not a design decision to always choose the maximum number of
>>> concatinated initial-level stage2 tables (with the constraint that
>>> there's a minimum number required)?
>>

I have changed the above comment to :

/*
* The hardware supports concatenation of up to 16 tables at stage2 entry level
* and we use the feature whenever possible.
*
* Now, the minimum number of bits resolved at any level is (PAGE_SHIFT - 3).
* On arm64, the smallest PAGE_SIZE supported is 4k, which means
* (PAGE_SHIFT - 3) > 4 holds for all page sizes.
* This implies, the total number of page table levels at stage2 expected
* by the hardware is actually the number of levels required for (KVM_PHYS_SHIFT - 4)
* in normal translations(e.g, stage1), since we cannot have another level in
* the range (KVM_PHYS_SHIFT, KVM_PHYS_SHIFT - 4).
*/



>>>> + * At the moment, we do not support a combination of guest IPA and host VA_BITS
>>>> + * where
>>>> + * STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS
>>>
>>> can you change this comment to reverse the statement to avoid someone
>>> seeing this as a constraint, when in fact it's a negative invariant?
>>>
>>> So the case we don't support is a sufficiently larger IPA space compared
>>> to the host VA space such that the above happens? (Since at the same
>>> IPA space size as host VA space size, the stage-2 levels will always be
>>> less than or equal to the host levels.)
>>
>> Correct.
>>
>>>
>>> I don't see how that would ever work with userspace either so I think
>>> this is a safe assumption and not something that ever needs fixing. In
>>
>> For e.g, we can perfectly run a guest with 40bit IPA under a host with 16K+36bit
>> VA. The moment we go above 40bit IPA, we could trigger the conditions above.
>> I think it is perfectly fine for the guest to choose higher IPA width, and place
>> its memory well above as long as the qemu/lkvm doesn't exhaust its VA. I just
>> tried booting a VM with memory at 0x70_0000_0000 on a 16K+36bitVA host and it
>> boots perfectly fine.
>>
>
> Right, I was thinking about it as providing more than 36bits of *memory*
> not address space in this case, so you're right, it is at least a
> theoretically possible case.
>

I have reworded the comment as follows:
/*
* With all the supported VA_BITs and 40bit guest IPA, the following condition
* is always true:
*
* CONFIG_PGTABLE_LEVELS >= STAGE2_PGTABLE_LEVELS
*
* We base our stage-2 page table walker helpers on this assumption and
* fall back to using the host version of the helper wherever possible.
* i.e, if a particular level is not folded (e.g, PUD) at stage2, we fall back
* to using the host version, since it is guaranteed it is not folded at host.
*
* If the condition breaks in the future, we can rearrange the host level
* definitions and reuse them for stage2. Till then...
*/
#if STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS
#error "Unsupported combination of guest IPA and host VA_BITS."
#endif

---

Please let me know your comments.


Cheers
Suzuki

2016-04-14 12:17:43

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 15/17] kvm: arm64: Get rid of fake page table levels

On Wed, Apr 13, 2016 at 06:49:59PM +0100, Suzuki K Poulose wrote:
> On 12/04/16 13:14, Christoffer Dall wrote:
> >On Mon, Apr 11, 2016 at 03:33:45PM +0100, Suzuki K Poulose wrote:
> >>On 08/04/16 16:05, Christoffer Dall wrote:
> >>>On Mon, Apr 04, 2016 at 05:26:15PM +0100, Suzuki K Poulose wrote:
> >>
> >>>>diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> >>>>index 751227d..139b4db 100644
> >>>>--- a/arch/arm64/include/asm/stage2_pgtable.h
> >>>>+++ b/arch/arm64/include/asm/stage2_pgtable.h
> >>>>@@ -22,32 +22,55 @@
> >>>> #include <asm/pgtable.h>
> >>>>
> >>>> /*
> >>>>- * In the case where PGDIR_SHIFT is larger than KVM_PHYS_SHIFT, we can address
> >>>>- * the entire IPA input range with a single pgd entry, and we would only need
> >>>>- * one pgd entry. Note that in this case, the pgd is actually not used by
> >>>>- * the MMU for Stage-2 translations, but is merely a fake pgd used as a data
> >>>>- * structure for the kernel pgtable macros to work.
> >>>>+ * The hardware mandates concatenation of upto 16 tables at stage2 entry level.
> >>>
> >>>s/upto/up to/
> >>>
> >>>>+ * Now, the minimum number of bits resolved at any level is (PAGE_SHIFT - 3),
> >>>>+ * or in other words log2(PTRS_PER_PTE). On arm64, the smallest PAGE_SIZE
> >>>
> >>>not sure the log2 comment helps here.
> >>
> >>OK, will address both the above comments.
> >>
> >>>
> >>>>+ * supported is 4k, which means (PAGE_SHIFT - 3) > 4 holds for all page sizes.
> >>>>+ * This implies, the total number of page table levels at stage2 expected
> >>>>+ * by the hardware is actually the number of levels required for (KVM_PHYS_SHIFT - 4)
> >>>>+ * in normal translations(e.g, stage-1), since we cannot have another level in
> >>>>+ * the range (KVM_PHYS_SHIFT, KVM_PHYS_SHIFT - 4).
> >>>
> >>>Is it not a design decision to always choose the maximum number of
> >>>concatinated initial-level stage2 tables (with the constraint that
> >>>there's a minimum number required)?
> >>
>
> I have changed the above comment to :
>
> /*
> * The hardware supports concatenation of up to 16 tables at stage2 entry level
> * and we use the feature whenever possible.
> *
> * Now, the minimum number of bits resolved at any level is (PAGE_SHIFT - 3).
> * On arm64, the smallest PAGE_SIZE supported is 4k, which means
> * (PAGE_SHIFT - 3) > 4 holds for all page sizes.
> * This implies, the total number of page table levels at stage2 expected
> * by the hardware is actually the number of levels required for (KVM_PHYS_SHIFT - 4)
> * in normal translations(e.g, stage1), since we cannot have another level in
> * the range (KVM_PHYS_SHIFT, KVM_PHYS_SHIFT - 4).
> */
>
>

Looks great.

>
> >>>>+ * At the moment, we do not support a combination of guest IPA and host VA_BITS
> >>>>+ * where
> >>>>+ * STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS
> >>>
> >>>can you change this comment to reverse the statement to avoid someone
> >>>seeing this as a constraint, when in fact it's a negative invariant?
> >>>
> >>>So the case we don't support is a sufficiently larger IPA space compared
> >>>to the host VA space such that the above happens? (Since at the same
> >>>IPA space size as host VA space size, the stage-2 levels will always be
> >>>less than or equal to the host levels.)
> >>
> >>Correct.
> >>
> >>>
> >>>I don't see how that would ever work with userspace either so I think
> >>>this is a safe assumption and not something that ever needs fixing. In
> >>
> >>For e.g, we can perfectly run a guest with 40bit IPA under a host with 16K+36bit
> >>VA. The moment we go above 40bit IPA, we could trigger the conditions above.
> >>I think it is perfectly fine for the guest to choose higher IPA width, and place
> >>its memory well above as long as the qemu/lkvm doesn't exhaust its VA. I just
> >>tried booting a VM with memory at 0x70_0000_0000 on a 16K+36bitVA host and it
> >>boots perfectly fine.
> >>
> >
> >Right, I was thinking about it as providing more than 36bits of *memory*
> >not address space in this case, so you're right, it is at least a
> >theoretically possible case.
> >
>
> I have reworded the comment as follows:
> /*
> * With all the supported VA_BITs and 40bit guest IPA, the following condition
> * is always true:
> *
> * CONFIG_PGTABLE_LEVELS >= STAGE2_PGTABLE_LEVELS
> *
> * We base our stage-2 page table walker helpers on this assumption and
> * fall back to using the host version of the helper wherever possible.
> * i.e, if a particular level is not folded (e.g, PUD) at stage2, we fall back
> * to using the host version, since it is guaranteed it is not folded at host.
> *
> * If the condition breaks in the future, we can rearrange the host level
> * definitions and reuse them for stage2. Till then...
> */
> #if STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS
> #error "Unsupported combination of guest IPA and host VA_BITS."
> #endif
>
> ---
>
> Please let me know your comments.
>
>
Looks good, thanks for the cleanup!

-Christoffer