2023-11-23 06:58:15

by Xu Lu

[permalink] [raw]
Subject: [RFC PATCH V1 00/11] riscv: Introduce 64K base page

Some existing architectures like ARM supports base page larger than 4K
as their MMU supports more page sizes. Thus, besides hugetlb page and
transparent huge page, there is another way for these architectures to
enjoy the benefits of fewer TLB misses without worrying about cost of
splitting and merging huge pages. However, on architectures with only
4K MMU, larger base page is unavailable now.

This patch series attempts to break through the limitation of MMU and
supports larger base page on RISC-V, which only supports 4K page size
now.

The key idea to implement larger base page based on 4K MMU is to
decouple the MMU page from the base page in view of kernel mm, which we
denote as software page. In contrary to software page, we denote the MMU
page as hardware page. Below is the difference between these two kinds
of pages.

1. Kernel memory management module manages, allocates and maps memory at
a granularity of software page, which should not be restricted by
MMU and can be larger than hardware page.

2. Architecture page table operations should be carried out from MMU's
perspective and page table entries are encoded at a granularity of
hardware page, which is 4K on RISC-V MMU now.

The main work to decouple these two kinds of pages lies in architecture
code. For example, we turn the pte_t struct to an array of page table
entries to match it with software page which can be larger than hardware
page, and adapt the page table operations accordingly. For 64K software
base page, the pte_t struct now contains 16 contiguous page table
entries which point to 16 contiguous 4K hardware pages.

To achieve the benefits of large base page, we applies Svnapot for each
base page's mapping. The Svnapot extension on RISC-V is like contiguous
PTE on ARM64. It allows ptes of a naturally aligned power-of 2 size
memory range be encoded in the same format to save the TLB space.

This patch series is the first version and is based on v6.7-rc1. This
version supports both bare metal and virtualization scenarios.

In the next versions, we will continue on the following works:

1. Reduce the memory usage of page table page as it only uses 4K space
while costs a whole base page.

2. When IMSIC interrupt file is smaller than 64K, extra isolation
measures for the interrupt file are needed. (S)PMP and IOPMP may be good
choices.

3. More consideration is needed to make this patch series collaborate
with folios better.

4. Support 64K base page on IOMMU.

5. The performance test is on schedule to verify the actual performance
improvement and the decrease in TLB miss rate.

Thanks in advance for comments.

Xu Lu (11):
mm: Fix misused APIs on huge pte
riscv: Introduce concept of hardware base page
riscv: Adapt pte struct to gap between hw page and sw page
riscv: Adapt pte operations to gap between hw page and sw page
riscv: Decouple pmd operations and pte operations
riscv: Distinguish pmd huge pte and napot huge pte
riscv: Adapt satp operations to gap between hw page and sw page
riscv: Apply Svnapot for base page mapping
riscv: Adjust fix_btmap slots number to match variable page size
riscv: kvm: Adapt kvm to gap between hw page and sw page
riscv: Introduce 64K page size

arch/Kconfig | 1 +
arch/riscv/Kconfig | 28 +++
arch/riscv/include/asm/fixmap.h | 3 +-
arch/riscv/include/asm/hugetlb.h | 71 ++++++-
arch/riscv/include/asm/page.h | 16 +-
arch/riscv/include/asm/pgalloc.h | 21 ++-
arch/riscv/include/asm/pgtable-32.h | 2 +-
arch/riscv/include/asm/pgtable-64.h | 45 +++--
arch/riscv/include/asm/pgtable.h | 282 +++++++++++++++++++++++-----
arch/riscv/kernel/efi.c | 2 +-
arch/riscv/kernel/head.S | 4 +-
arch/riscv/kernel/hibernate.c | 3 +-
arch/riscv/kvm/mmu.c | 198 +++++++++++++------
arch/riscv/mm/context.c | 7 +-
arch/riscv/mm/fault.c | 1 +
arch/riscv/mm/hugetlbpage.c | 42 +++--
arch/riscv/mm/init.c | 25 +--
arch/riscv/mm/kasan_init.c | 7 +-
arch/riscv/mm/pageattr.c | 2 +-
fs/proc/task_mmu.c | 2 +-
include/asm-generic/hugetlb.h | 7 +
include/asm-generic/pgtable-nopmd.h | 1 +
include/linux/pgtable.h | 6 +
mm/hugetlb.c | 2 +-
mm/migrate.c | 5 +-
mm/mprotect.c | 2 +-
mm/rmap.c | 10 +-
mm/vmalloc.c | 3 +-
28 files changed, 616 insertions(+), 182 deletions(-)

--
2.20.1


2023-11-23 06:58:20

by Xu Lu

[permalink] [raw]
Subject: [RFC PATCH V1 05/11] riscv: Decouple pmd operations and pte operations

Existing pmd operations are usually implemented via pte operations. For
example, the pmd_mkdirty function, which is used to mark a pmd_t struct
as dirty, will transfer pmd_t struct to pte_t struct via pmd_pte first,
mark the generated pte_t as dirty then, and finally transfer it back to
pmd_t struct via pte_pmd function. Such implementation introduces
unnecessary overhead of struct transferring. Also, Now that pte_t struct
is a number of page table entries, which can be larger than pmd_t
struct, functions like set_pmd_at implemented via set_pte_at will cause
write amplifications.

This commit decouples pmd operations and pte operations. Pmd operations
are now implemented independently of pte operations.

Signed-off-by: Xu Lu <[email protected]>
---
arch/riscv/include/asm/pgtable-64.h | 6 ++
arch/riscv/include/asm/pgtable.h | 124 +++++++++++++++++++++-------
include/asm-generic/pgtable-nopmd.h | 1 +
include/linux/pgtable.h | 6 ++
4 files changed, 108 insertions(+), 29 deletions(-)

diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index 1926727698fc..95e785f2160c 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -206,6 +206,12 @@ static inline int pud_leaf(pud_t pud)
return pud_present(pud) && (pud_val(pud) & _PAGE_LEAF);
}

+#define pud_exec pud_exec
+static inline int pud_exec(pud_t pud)
+{
+ return pud_val(pud) & _PAGE_EXEC;
+}
+
static inline int pud_user(pud_t pud)
{
return pud_val(pud) & _PAGE_USER;
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index d50c4588c1ed..9f81fe046cb8 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -272,6 +272,18 @@ static inline int pmd_leaf(pmd_t pmd)
return pmd_present(pmd) && (pmd_val(pmd) & _PAGE_LEAF);
}

+#define pmd_exec pmd_exec
+static inline int pmd_exec(pmd_t pmd)
+{
+ return pmd_val(pmd) & _PAGE_EXEC;
+}
+
+#define __HAVE_ARCH_PMD_SAME
+static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
+{
+ return pmd_val(pmd_a) == pmd_val(pmd_b);
+}
+
static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
{
*pmdp = pmd;
@@ -506,7 +518,7 @@ static inline int pte_protnone(pte_t pte)

static inline int pmd_protnone(pmd_t pmd)
{
- return pte_protnone(pmd_pte(pmd));
+ return (pmd_val(pmd) & (_PAGE_PRESENT | _PAGE_PROT_NONE)) == _PAGE_PROT_NONE;
}
#endif

@@ -740,73 +752,95 @@ static inline unsigned long pud_pfn(pud_t pud)

static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
{
- return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
+ unsigned long newprot_val = pgprot_val(newprot);
+
+ ALT_THEAD_PMA(newprot_val);
+
+ return __pmd((pmd_val(pmd) & _PAGE_CHG_MASK) | newprot_val);
}

#define pmd_write pmd_write
static inline int pmd_write(pmd_t pmd)
{
- return pte_write(pmd_pte(pmd));
+ return pmd_val(pmd) & _PAGE_WRITE;
}

static inline int pmd_dirty(pmd_t pmd)
{
- return pte_dirty(pmd_pte(pmd));
+ return pmd_val(pmd) & _PAGE_DIRTY;
}

#define pmd_young pmd_young
static inline int pmd_young(pmd_t pmd)
{
- return pte_young(pmd_pte(pmd));
+ return pmd_val(pmd) & _PAGE_ACCESSED;
}

static inline int pmd_user(pmd_t pmd)
{
- return pte_user(pmd_pte(pmd));
+ return pmd_val(pmd) & _PAGE_USER;
}

static inline pmd_t pmd_mkold(pmd_t pmd)
{
- return pte_pmd(pte_mkold(pmd_pte(pmd)));
+ return __pmd(pmd_val(pmd) & ~(_PAGE_ACCESSED));
}

static inline pmd_t pmd_mkyoung(pmd_t pmd)
{
- return pte_pmd(pte_mkyoung(pmd_pte(pmd)));
+ return __pmd(pmd_val(pmd) | _PAGE_ACCESSED);
}

static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
{
- return pte_pmd(pte_mkwrite_novma(pmd_pte(pmd)));
+ return __pmd(pmd_val(pmd) | _PAGE_WRITE);
}

static inline pmd_t pmd_wrprotect(pmd_t pmd)
{
- return pte_pmd(pte_wrprotect(pmd_pte(pmd)));
+ return __pmd(pmd_val(pmd) & (~_PAGE_WRITE));
}

static inline pmd_t pmd_mkclean(pmd_t pmd)
{
- return pte_pmd(pte_mkclean(pmd_pte(pmd)));
+ return __pmd(pmd_val(pmd) & (~_PAGE_DIRTY));
}

static inline pmd_t pmd_mkdirty(pmd_t pmd)
{
- return pte_pmd(pte_mkdirty(pmd_pte(pmd)));
+ return __pmd(pmd_val(pmd) | _PAGE_DIRTY);
+}
+
+#define pmd_accessible(mm, pmd) ((void)(pmd), 1)
+
+static inline void __set_pmd_at(pmd_t *pmdp, pmd_t pmd)
+{
+ if (pmd_present(pmd) && pmd_exec(pmd))
+ flush_icache_pte(pmd_pte(pmd));
+
+ set_pmd(pmdp, pmd);
}

static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, pmd_t pmd)
{
page_table_check_pmd_set(mm, pmdp, pmd);
- return __set_pte_at((pte_t *)pmdp, pmd_pte(pmd));
+ return __set_pmd_at(pmdp, pmd);
+}
+
+static inline void __set_pud_at(pud_t *pudp, pud_t pud)
+{
+ if (pud_present(pud) && pud_exec(pud))
+ flush_icache_pte(pud_pte(pud));
+
+ set_pud(pudp, pud);
}

static inline void set_pud_at(struct mm_struct *mm, unsigned long addr,
pud_t *pudp, pud_t pud)
{
page_table_check_pud_set(mm, pudp, pud);
- return __set_pte_at((pte_t *)pudp, pud_pte(pud));
+ return __set_pud_at(pudp, pud);
}

#ifdef CONFIG_PAGE_TABLE_CHECK
@@ -826,25 +860,64 @@ static inline bool pud_user_accessible_page(pud_t pud)
}
#endif

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int pmd_trans_huge(pmd_t pmd)
-{
- return pmd_leaf(pmd);
-}
-
#define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp,
pmd_t entry, int dirty)
{
- return ptep_set_access_flags(vma, address, (pte_t *)pmdp, pmd_pte(entry), dirty);
+ if (!pmd_same(*pmdp, entry))
+ set_pmd_at(vma->vm_mm, address, pmdp, entry);
+ /*
+ * update_mmu_cache will unconditionally execute, handling both
+ * the case that the PMD changed and the spurious fault case.
+ */
+ return true;
+}
+
+#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm,
+ unsigned long address, pmd_t *pmdp)
+{
+ pmd_t pmd = __pmd(atomic_long_xchg((atomic_long_t *)pmdp, 0));
+
+ page_table_check_pmd_clear(mm, pmd);
+
+ return pmd;
+}
+
+#define __HAVE_ARCH_PMDP_SET_WRPROTECT
+static inline void pmdp_set_wrprotect(struct mm_struct *mm,
+ unsigned long address, pmd_t *pmdp)
+{
+ atomic_long_and(~(unsigned long)_PAGE_WRITE, (atomic_long_t *)pmdp);
+}
+
+#define __HAVE_ARCH_PMDP_CLEAR_FLUSH
+static inline pmd_t pmdp_clear_flush(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp)
+{
+ struct mm_struct *mm = (vma)->vm_mm;
+ pmd_t pmd = pmdp_get_and_clear(mm, address, pmdp);
+
+ if (pmd_accessible(mm, pmd))
+ flush_tlb_page(vma, address);
+
+ return pmd;
}

#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp)
{
- return ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
+ if (!pmd_young(*pmdp))
+ return 0;
+ return test_and_clear_bit(_PAGE_ACCESSED_OFFSET, &pmd_val(*pmdp));
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+ return pmd_leaf(pmd);
}

#define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
@@ -858,13 +931,6 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
return pmd;
}

-#define __HAVE_ARCH_PMDP_SET_WRPROTECT
-static inline void pmdp_set_wrprotect(struct mm_struct *mm,
- unsigned long address, pmd_t *pmdp)
-{
- ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
-}
-
#define pmdp_establish pmdp_establish
static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp, pmd_t pmd)
diff --git a/include/asm-generic/pgtable-nopmd.h b/include/asm-generic/pgtable-nopmd.h
index 8ffd64e7a24c..acef201b29f5 100644
--- a/include/asm-generic/pgtable-nopmd.h
+++ b/include/asm-generic/pgtable-nopmd.h
@@ -32,6 +32,7 @@ static inline int pud_bad(pud_t pud) { return 0; }
static inline int pud_present(pud_t pud) { return 1; }
static inline int pud_user(pud_t pud) { return 0; }
static inline int pud_leaf(pud_t pud) { return 0; }
+static inline int pud_exec(pud_t pud) { return 0; }
static inline void pud_clear(pud_t *pud) { }
#define pmd_ERROR(pmd) (pud_ERROR((pmd).pud))

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b0a3..b8d6e39fefc2 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1630,9 +1630,15 @@ typedef unsigned int pgtbl_mod_mask;
#ifndef pud_leaf
#define pud_leaf(x) 0
#endif
+#ifndef pud_exec
+#define pud_exec(x) 0
+#endif
#ifndef pmd_leaf
#define pmd_leaf(x) 0
#endif
+#ifndef pmd_exec
+#define pmd_exec(x) 0
+#endif

#ifndef pgd_leaf_size
#define pgd_leaf_size(x) (1ULL << PGDIR_SHIFT)
--
2.20.1

2023-11-23 06:58:43

by Xu Lu

[permalink] [raw]
Subject: [RFC PATCH V1 04/11] riscv: Adapt pte operations to gap between hw page and sw page

MMU handles pages at a granularity of hardware page. That is, for 4K
MMU, the pfn decoded from page table entry will be regarded as 4K page
frame number, no matter how large the software page is. Thus, page table
entries should always be encoded at the granularity of hardware page.

This commit makes pte operations aware of the gap between hw page and sw
page. All pte operations now configure page table entries via hardware
page frame number.

Signed-off-by: Xu Lu <[email protected]>
---
arch/riscv/include/asm/page.h | 3 +++
arch/riscv/include/asm/pgalloc.h | 21 ++++++++++-----
arch/riscv/include/asm/pgtable-32.h | 2 +-
arch/riscv/include/asm/pgtable-64.h | 40 ++++++++++++++++++-----------
arch/riscv/include/asm/pgtable.h | 19 +++++++-------
arch/riscv/mm/init.c | 18 ++++++-------
6 files changed, 62 insertions(+), 41 deletions(-)

diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index cbaa7e027f9a..12f2e73ed55b 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -177,6 +177,9 @@ extern phys_addr_t __phys_addr_symbol(unsigned long x);
#define __pa(x) __virt_to_phys((unsigned long)(x))
#define __va(x) ((void *)__pa_to_va_nodebug((phys_addr_t)(x)))

+#define pfn_to_hwpfn(pfn) (pfn << (PAGE_SHIFT - HW_PAGE_SHIFT))
+#define hwpfn_to_pfn(hwpfn) (hwpfn >> (PAGE_SHIFT - HW_PAGE_SHIFT))
+
#define phys_to_pfn(phys) (PFN_DOWN(phys))
#define pfn_to_phys(pfn) (PFN_PHYS(pfn))

diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index d169a4f41a2e..eab75d5f7093 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -19,32 +19,36 @@ static inline void pmd_populate_kernel(struct mm_struct *mm,
pmd_t *pmd, pte_t *pte)
{
unsigned long pfn = virt_to_pfn(pte);
+ unsigned long hwpfn = pfn_to_hwpfn(pfn);

- set_pmd(pmd, __pmd((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
+ set_pmd(pmd, __pmd((hwpfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
}

static inline void pmd_populate(struct mm_struct *mm,
pmd_t *pmd, pgtable_t pte)
{
unsigned long pfn = virt_to_pfn(page_address(pte));
+ unsigned long hwpfn = pfn_to_hwpfn(pfn);

- set_pmd(pmd, __pmd((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
+ set_pmd(pmd, __pmd((hwpfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
}

#ifndef __PAGETABLE_PMD_FOLDED
static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
{
unsigned long pfn = virt_to_pfn(pmd);
+ unsigned long hwpfn = pfn_to_hwpfn(pfn);

- set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
+ set_pud(pud, __pud((hwpfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
}

static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
{
if (pgtable_l4_enabled) {
unsigned long pfn = virt_to_pfn(pud);
+ unsigned long hwpfn = pfn_to_hwpfn(pfn);

- set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
+ set_p4d(p4d, __p4d((hwpfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
}
}

@@ -53,9 +57,10 @@ static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
{
if (pgtable_l4_enabled) {
unsigned long pfn = virt_to_pfn(pud);
+ unsigned long hwpfn = pfn_to_hwpfn(pfn);

set_p4d_safe(p4d,
- __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
+ __p4d((hwpfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
}
}

@@ -63,8 +68,9 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4d)
{
if (pgtable_l5_enabled) {
unsigned long pfn = virt_to_pfn(p4d);
+ unsigned long hwpfn = pfn_to_hwpfn(pfn);

- set_pgd(pgd, __pgd((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
+ set_pgd(pgd, __pgd((hwpfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
}
}

@@ -73,9 +79,10 @@ static inline void pgd_populate_safe(struct mm_struct *mm, pgd_t *pgd,
{
if (pgtable_l5_enabled) {
unsigned long pfn = virt_to_pfn(p4d);
+ unsigned long hwpfn = pfn_to_hwpfn(pfn);

set_pgd_safe(pgd,
- __pgd((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
+ __pgd((hwpfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
}
}

diff --git a/arch/riscv/include/asm/pgtable-32.h b/arch/riscv/include/asm/pgtable-32.h
index 00f3369570a8..dec436e146ae 100644
--- a/arch/riscv/include/asm/pgtable-32.h
+++ b/arch/riscv/include/asm/pgtable-32.h
@@ -20,7 +20,7 @@
/*
* rv32 PTE format:
* | XLEN-1 10 | 9 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
- * PFN reserved for SW D A G U X W R V
+ * HW_PFN reserved for SW D A G U X W R V
*/
#define _PAGE_PFN_MASK GENMASK(31, 10)

diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index c08db54594a9..1926727698fc 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -50,7 +50,7 @@ typedef struct {

#define p4d_val(x) ((x).p4d)
#define __p4d(x) ((p4d_t) { (x) })
-#define PTRS_PER_P4D (PAGE_SIZE / sizeof(p4d_t))
+#define PTRS_PER_P4D (HW_PAGE_SIZE / sizeof(p4d_t))

/* Page Upper Directory entry */
typedef struct {
@@ -59,7 +59,7 @@ typedef struct {

#define pud_val(x) ((x).pud)
#define __pud(x) ((pud_t) { (x) })
-#define PTRS_PER_PUD (PAGE_SIZE / sizeof(pud_t))
+#define PTRS_PER_PUD (HW_PAGE_SIZE / sizeof(pud_t))

/* Page Middle Directory entry */
typedef struct {
@@ -69,12 +69,12 @@ typedef struct {
#define pmd_val(x) ((x).pmd)
#define __pmd(x) ((pmd_t) { (x) })

-#define PTRS_PER_PMD (PAGE_SIZE / sizeof(pmd_t))
+#define PTRS_PER_PMD (HW_PAGE_SIZE / sizeof(pmd_t))

/*
* rv64 PTE format:
* | 63 | 62 61 | 60 54 | 53 10 | 9 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
- * N MT RSV PFN reserved for SW D A G U X W R V
+ * N MT RSV HW_PFN reserved for SW D A G U X W R V
*/
#define _PAGE_PFN_MASK GENMASK(53, 10)

@@ -94,13 +94,23 @@ enum napot_cont_order {
NAPOT_ORDER_MAX,
};

-#define for_each_napot_order(order) \
- for (order = NAPOT_CONT_ORDER_BASE; order < NAPOT_ORDER_MAX; order++)
-#define for_each_napot_order_rev(order) \
- for (order = NAPOT_ORDER_MAX - 1; \
- order >= NAPOT_CONT_ORDER_BASE; order--)
-#define napot_cont_order(val) \
- (__builtin_ctzl((pte_val(val) >> _PAGE_PFN_SHIFT) << 1))
+#define NAPOT_PAGE_ORDER_BASE \
+ ((NAPOT_CONT_ORDER_BASE >= (PAGE_SHIFT - HW_PAGE_SHIFT)) ? \
+ (NAPOT_CONT_ORDER_BASE - (PAGE_SHIFT - HW_PAGE_SHIFT)) : 1)
+#define NAPOT_PAGE_ORDER_MAX \
+ ((NAPOT_ORDER_MAX > (PAGE_SHIFT - HW_PAGE_SHIFT)) ? \
+ (NAPOT_ORDER_MAX - (PAGE_SHIFT - HW_PAGE_SHIFT)) : \
+ NAPOT_PAGE_ORDER_BASE)
+
+#define for_each_napot_order(order) \
+ for (order = NAPOT_PAGE_ORDER_BASE; \
+ order < NAPOT_PAGE_ORDER_MAX; order++)
+#define for_each_napot_order_rev(order) \
+ for (order = NAPOT_PAGE_ORDER_MAX - 1; \
+ order >= NAPOT_PAGE_ORDER_BASE; order--)
+#define napot_cont_order(val) \
+ (__builtin_ctzl((pte_val(val) >> _PAGE_PFN_SHIFT) << 1) \
+ - (PAGE_SHIFT - HW_PAGE_SHIFT))

#define napot_cont_shift(order) ((order) + PAGE_SHIFT)
#define napot_cont_size(order) BIT(napot_cont_shift(order))
@@ -108,7 +118,7 @@ enum napot_cont_order {
#define napot_pte_num(order) BIT(order)

#ifdef CONFIG_RISCV_ISA_SVNAPOT
-#define HUGE_MAX_HSTATE (2 + (NAPOT_ORDER_MAX - NAPOT_CONT_ORDER_BASE))
+#define HUGE_MAX_HSTATE (2 + (NAPOT_ORDER_MAX - NAPOT_PAGE_ORDER_BASE))
#else
#define HUGE_MAX_HSTATE 2
#endif
@@ -213,7 +223,7 @@ static inline void pud_clear(pud_t *pudp)

static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
{
- return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
+ return __pud((pfn_to_hwpfn(pfn) << _PAGE_PFN_SHIFT) | pgprot_val(prot));
}

static inline unsigned long _pud_pfn(pud_t pud)
@@ -257,7 +267,7 @@ static inline pmd_t pfn_pmd(unsigned long pfn, pgprot_t prot)

ALT_THEAD_PMA(prot_val);

- return __pmd((pfn << _PAGE_PFN_SHIFT) | prot_val);
+ return __pmd((pfn_to_hwpfn(pfn) << _PAGE_PFN_SHIFT) | prot_val);
}

static inline unsigned long _pmd_pfn(pmd_t pmd)
@@ -316,7 +326,7 @@ static inline void p4d_clear(p4d_t *p4d)

static inline p4d_t pfn_p4d(unsigned long pfn, pgprot_t prot)
{
- return __p4d((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
+ return __p4d((pfn_to_hwpfn(pfn) << _PAGE_PFN_SHIFT) | pgprot_val(prot));
}

static inline unsigned long _p4d_pfn(p4d_t p4d)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 342be2112fd2..d50c4588c1ed 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -26,9 +26,9 @@
#endif

/* Number of entries in the page global directory */
-#define PTRS_PER_PGD (PAGE_SIZE / sizeof(pgd_t))
+#define PTRS_PER_PGD (HW_PAGE_SIZE / sizeof(pgd_t))
/* Number of entries in the page table */
-#define PTRS_PER_PTE (PAGE_SIZE / sizeof(pte_t))
+#define PTRS_PER_PTE (HW_PAGE_SIZE / sizeof(pte_t))

/*
* Half of the kernel address space (1/4 of the entries of the page global
@@ -118,7 +118,8 @@
#include <linux/mm_types.h>
#include <asm/compat.h>

-#define __page_val_to_pfn(_val) (((_val) & _PAGE_PFN_MASK) >> _PAGE_PFN_SHIFT)
+#define __page_val_to_hwpfn(_val) (((_val) & _PAGE_PFN_MASK) >> _PAGE_PFN_SHIFT)
+#define __page_val_to_pfn(_val) hwpfn_to_pfn(__page_val_to_hwpfn(_val))

#ifdef CONFIG_64BIT
#include <asm/pgtable-64.h>
@@ -287,7 +288,7 @@ static inline pgd_t pfn_pgd(unsigned long pfn, pgprot_t prot)

ALT_THEAD_PMA(prot_val);

- return __pgd((pfn << _PAGE_PFN_SHIFT) | prot_val);
+ return __pgd((pfn_to_hwpfn(pfn) << _PAGE_PFN_SHIFT) | prot_val);
}

static inline unsigned long _pgd_pfn(pgd_t pgd)
@@ -351,12 +352,12 @@ static inline unsigned long pte_napot(pte_t pte)
/* Yields the page frame number (PFN) of a page table entry */
static inline unsigned long pte_pfn(pte_t pte)
{
- unsigned long res = __page_val_to_pfn(pte_val(pte));
+ unsigned long res = __page_val_to_hwpfn(pte_val(pte));

if (has_svnapot() && pte_napot(pte))
res = res & (res - 1UL);

- return res;
+ return hwpfn_to_pfn(res);
}

#define pte_page(x) pfn_to_page(pte_pfn(x))
@@ -368,7 +369,7 @@ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot)

ALT_THEAD_PMA(prot_val);

- return __pte((pfn << _PAGE_PFN_SHIFT) | prot_val);
+ return __pte((pfn_to_hwpfn(pfn) << _PAGE_PFN_SHIFT) | prot_val);
}

#define mk_pte(page, prot) pfn_pte(page_to_pfn(page), prot)
@@ -723,14 +724,14 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)
return __pmd(pmd_val(pmd) & ~(_PAGE_PRESENT|_PAGE_PROT_NONE));
}

-#define __pmd_to_phys(pmd) (__page_val_to_pfn(pmd_val(pmd)) << PAGE_SHIFT)
+#define __pmd_to_phys(pmd) (__page_val_to_hwpfn(pmd_val(pmd)) << HW_PAGE_SHIFT)

static inline unsigned long pmd_pfn(pmd_t pmd)
{
return ((__pmd_to_phys(pmd) & PMD_MASK) >> PAGE_SHIFT);
}

-#define __pud_to_phys(pud) (__page_val_to_pfn(pud_val(pud)) << PAGE_SHIFT)
+#define __pud_to_phys(pud) (__page_val_to_hwpfn(pud_val(pud)) << HW_PAGE_SHIFT)

static inline unsigned long pud_pfn(pud_t pud)
{
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 2e011cbddf3a..a768b2b3ff05 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -466,7 +466,7 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
pte_phys = pt_ops.alloc_pte(va);
pmdp[pmd_idx] = pfn_pmd(PFN_DOWN(pte_phys), PAGE_TABLE);
ptep = pt_ops.get_pte_virt(pte_phys);
- memset(ptep, 0, PAGE_SIZE);
+ memset(ptep, 0, PTRS_PER_PTE * sizeof(pte_t));
} else {
pte_phys = PFN_PHYS(_pmd_pfn(pmdp[pmd_idx]));
ptep = pt_ops.get_pte_virt(pte_phys);
@@ -569,7 +569,7 @@ static void __init create_pud_mapping(pud_t *pudp,
next_phys = pt_ops.alloc_pmd(va);
pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
nextp = pt_ops.get_pmd_virt(next_phys);
- memset(nextp, 0, PAGE_SIZE);
+ memset(nextp, 0, PTRS_PER_PMD * sizeof(pmd_t));
} else {
next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
nextp = pt_ops.get_pmd_virt(next_phys);
@@ -596,7 +596,7 @@ static void __init create_p4d_mapping(p4d_t *p4dp,
next_phys = pt_ops.alloc_pud(va);
p4dp[p4d_index] = pfn_p4d(PFN_DOWN(next_phys), PAGE_TABLE);
nextp = pt_ops.get_pud_virt(next_phys);
- memset(nextp, 0, PAGE_SIZE);
+ memset(nextp, 0, PTRS_PER_PUD * sizeof(pud_t));
} else {
next_phys = PFN_PHYS(_p4d_pfn(p4dp[p4d_index]));
nextp = pt_ops.get_pud_virt(next_phys);
@@ -654,7 +654,7 @@ void __init create_pgd_mapping(pgd_t *pgdp,
next_phys = alloc_pgd_next(va);
pgdp[pgd_idx] = pfn_pgd(PFN_DOWN(next_phys), PAGE_TABLE);
nextp = get_pgd_next_virt(next_phys);
- memset(nextp, 0, PAGE_SIZE);
+ memset(nextp, 0, PTRS_PER_P4D * sizeof(p4d_t));
} else {
next_phys = PFN_PHYS(_pgd_pfn(pgdp[pgd_idx]));
nextp = get_pgd_next_virt(next_phys);
@@ -815,16 +815,16 @@ static __init void set_satp_mode(uintptr_t dtb_pa)
if (hw_satp != identity_satp) {
if (pgtable_l5_enabled) {
disable_pgtable_l5();
- memset(early_pg_dir, 0, PAGE_SIZE);
+ memset(early_pg_dir, 0, PTRS_PER_PGD * sizeof(pgd_t));
goto retry;
}
disable_pgtable_l4();
}

- memset(early_pg_dir, 0, PAGE_SIZE);
- memset(early_p4d, 0, PAGE_SIZE);
- memset(early_pud, 0, PAGE_SIZE);
- memset(early_pmd, 0, PAGE_SIZE);
+ memset(early_pg_dir, 0, PTRS_PER_PGD * sizeof(pgd_t));
+ memset(early_p4d, 0, PTRS_PER_P4D * sizeof(p4d_t));
+ memset(early_pud, 0, PTRS_PER_PUD * sizeof(pud_t));
+ memset(early_pmd, 0, PTRS_PER_PMD * sizeof(pmd_t));
}
#endif

--
2.20.1

2023-11-23 06:58:45

by Xu Lu

[permalink] [raw]
Subject: [RFC PATCH V1 06/11] riscv: Distinguish pmd huge pte and napot huge pte

There exist two kinds of huge pte on RISC-V: pmd and napot pte. For pmd
kind huge pte, the huge page it represents is much larger than base
page. Thus pmd kind huge pte can be represented via a single pmd entry
and needs no special handling. For napot kind huge pte, it is actually
several normal pte entries encoded in Svnapot format. Thus napot kind
huge pte should be represented via pte_t struct and handled via pte
operations.

This commit distinguishes these two kinds of huge pte and handles them
with different operations.

Signed-off-by: Xu Lu <[email protected]>
---
arch/riscv/include/asm/hugetlb.h | 71 +++++++++++++++++++++++++++++++-
arch/riscv/mm/hugetlbpage.c | 40 ++++++++++++------
2 files changed, 97 insertions(+), 14 deletions(-)

diff --git a/arch/riscv/include/asm/hugetlb.h b/arch/riscv/include/asm/hugetlb.h
index 4c5b0e929890..1cdd5a26e6d4 100644
--- a/arch/riscv/include/asm/hugetlb.h
+++ b/arch/riscv/include/asm/hugetlb.h
@@ -4,6 +4,7 @@

#include <asm/cacheflush.h>
#include <asm/page.h>
+#include <asm/pgtable.h>

static inline void arch_clear_hugepage_flags(struct page *page)
{
@@ -12,6 +13,7 @@ static inline void arch_clear_hugepage_flags(struct page *page)
#define arch_clear_hugepage_flags arch_clear_hugepage_flags

#ifdef CONFIG_RISCV_ISA_SVNAPOT
+
#define __HAVE_ARCH_HUGE_PTE_CLEAR
void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, unsigned long sz);
@@ -41,10 +43,77 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
#define __HAVE_ARCH_HUGE_PTEP_GET
pte_t huge_ptep_get(pte_t *ptep);

+#define __HAVE_ARCH_HUGE_PTEP_GET_LOCKLESS
+static inline pte_t huge_ptep_get_lockless(pte_t *ptep)
+{
+ unsigned long pteval = READ_ONCE(ptep->ptes[0]);
+
+ return __pte(pteval);
+}
+
pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags);
#define arch_make_huge_pte arch_make_huge_pte

-#endif /*CONFIG_RISCV_ISA_SVNAPOT*/
+#else /* CONFIG_RISCV_ISA_SVNAPOT */
+
+#define __HAVE_ARCH_HUGE_PTEP_GET
+static inline pte_t huge_ptep_get(pte_t *ptep)
+{
+ pmd_t *pmdp = (pmd_t *)ptep;
+
+ return pmd_pte(pmdp_get(pdmp));
+}
+
+#define __HAVE_ARCH_HUGE_PTEP_GET_LOCKLESS
+static inline pte_t huge_ptep_get_lockless(pte_t *ptep)
+{
+ return huge_ptep_get(ptep);
+}
+
+#define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT
+static inline void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+ set_pmd_at(mm, addr, (pmd_t *)ptep, pte_pmd(pte));
+}
+
+#define __HAVE_ARCH_HUGE_PTEP_SET_ACCESS_FLAGS
+static inline int huge_ptep_set_access_flags(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ pte_t pte, int dirty)
+{
+ return pmdp_set_access_flags(vma, addr, (pmd_t *)ptep, pte_pmd(pte), dirty);
+}
+
+#define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
+static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+ return pmd_pte(pmdp_get_and_clear(mm, addr, (pmd_t *)ptep));
+}
+
+#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
+static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+ pmdp_set_wrprotect(mm, addr, (pmd_t *)ptep);
+}
+
+#define __HAVE_ARCH_HUGE_PTEP_CLEAR_FLUSH
+static inline pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
+{
+ return pmd_pte(pmdp_clear_flush(vma, addr, (pmd_t *)ptep));
+}
+
+#define __HAVE_ARCH_HUGE_PTE_CLEAR
+static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, unsigned long sz)
+{
+ pmd_clear((pmd_t *)ptep);
+}
+
+#endif /* CONFIG_RISCV_ISA_SVNAPOT */

#include <asm-generic/hugetlb.h>

diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index 67fd71c36853..4a2ad8657502 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -7,8 +7,13 @@ pte_t huge_ptep_get(pte_t *ptep)
{
unsigned long pte_num;
int i;
- pte_t orig_pte = ptep_get(ptep);
+ pmd_t *pmdp = (pmd_t *)ptep;
+ pte_t orig_pte = pmd_pte(pmdp_get(pmdp));

+ /*
+ * Non napot pte indicates a middle page table entry and
+ * should be treated as a pmd.
+ */
if (!pte_present(orig_pte) || !pte_napot(orig_pte))
return orig_pte;

@@ -198,6 +203,8 @@ void set_huge_pte_at(struct mm_struct *mm,
hugepage_shift = PAGE_SHIFT;

pte_num = sz >> hugepage_shift;
+ if (pte_num == 1)
+ set_pmd_at(mm, addr, (pmd_t *)ptep, pte_pmd(pte));
for (i = 0; i < pte_num; i++, ptep++, addr += (1 << hugepage_shift))
set_pte_at(mm, addr, ptep, pte);
}
@@ -214,7 +221,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
int i, pte_num;

if (!pte_napot(pte))
- return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
+ return pmdp_set_access_flags(vma, addr, (pmd_t *)ptep,
+ pte_pmd(pte), dirty);

order = napot_cont_order(pte);
pte_num = napot_pte_num(order);
@@ -237,11 +245,12 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
unsigned long addr,
pte_t *ptep)
{
- pte_t orig_pte = ptep_get(ptep);
+ pmd_t *pmdp = (pmd_t *)ptep;
+ pte_t orig_pte = pmd_pte(pmdp_get(pmdp));
int pte_num;

if (!pte_napot(orig_pte))
- return ptep_get_and_clear(mm, addr, ptep);
+ return pmd_pte(pmdp_get_and_clear(mm, addr, pmdp));

pte_num = napot_pte_num(napot_cont_order(orig_pte));

@@ -252,13 +261,14 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
unsigned long addr,
pte_t *ptep)
{
- pte_t pte = ptep_get(ptep);
+ pmd_t *pmdp = (pmd_t *)ptep;
+ pte_t pte = pmd_pte(pmdp_get(pmdp));
unsigned long order;
pte_t orig_pte;
int i, pte_num;

if (!pte_napot(pte)) {
- ptep_set_wrprotect(mm, addr, ptep);
+ pmdp_set_wrprotect(mm, addr, pmdp);
return;
}

@@ -277,11 +287,12 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
unsigned long addr,
pte_t *ptep)
{
- pte_t pte = ptep_get(ptep);
+ pmd_t *pmdp = (pmd_t *)ptep;
+ pte_t pte = pmd_pte(pmdp_get(pmdp));
int pte_num;

if (!pte_napot(pte))
- return ptep_clear_flush(vma, addr, ptep);
+ return pmd_pte(pmdp_clear_flush(vma, addr, pmdp));

pte_num = napot_pte_num(napot_cont_order(pte));

@@ -293,11 +304,12 @@ void huge_pte_clear(struct mm_struct *mm,
pte_t *ptep,
unsigned long sz)
{
- pte_t pte = ptep_get(ptep);
+ pmd_t *pmdp = (pmd_t *)ptep;
+ pte_t pte = pmd_pte(pmdp_get(pmdp));
int i, pte_num;

if (!pte_napot(pte)) {
- pte_clear(mm, addr, ptep);
+ pmd_clear(pmdp);
return;
}

@@ -325,8 +337,10 @@ static __init int napot_hugetlbpages_init(void)
if (has_svnapot()) {
unsigned long order;

- for_each_napot_order(order)
- hugetlb_add_hstate(order);
+ for_each_napot_order(order) {
+ if (napot_cont_shift(order) > PAGE_SHIFT)
+ hugetlb_add_hstate(order);
+ }
}
return 0;
}
@@ -357,7 +371,7 @@ bool __init arch_hugetlb_valid_size(unsigned long size)
return true;
else if (IS_ENABLED(CONFIG_64BIT) && size == PUD_SIZE)
return true;
- else if (is_napot_size(size))
+ else if (is_napot_size(size) && size > PAGE_SIZE)
return true;
else
return false;
--
2.20.1

2023-11-23 06:58:47

by Xu Lu

[permalink] [raw]
Subject: [RFC PATCH V1 02/11] riscv: Introduce concept of hardware base page

The key idea to implement larger base page based on MMU that only
supports 4K page is to decouple the MMU page from the software page in
view of kernel mm. In contrary to software page, we denote the MMU page
as hardware page.

To decouple these two kinds of pages, we should manage, allocate and map
memory at a granularity of software page, which is exactly what existing
mm code does. The page table operations, however, should configure page
table entries at a granularity of hardware page, which is the
responsibility of arch code.

This commit introduces the concept of hardware base page for RISCV.

Signed-off-by: Xu Lu <[email protected]>
---
arch/riscv/Kconfig | 8 ++++++++
arch/riscv/include/asm/page.h | 6 +++++-
2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 95a2a06acc6a..105cbb3ca797 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -221,6 +221,14 @@ config PAGE_OFFSET
default 0x80000000 if !MMU
default 0xff60000000000000 if 64BIT

+config RISCV_HW_PAGE_SHIFT
+ int
+ default 12
+
+config RISCV_PAGE_SHIFT
+ int
+ default 12
+
config KASAN_SHADOW_OFFSET
hex
depends on KASAN_GENERIC
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 57e887bfa34c..a8c59d80683c 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -12,7 +12,11 @@
#include <linux/pfn.h>
#include <linux/const.h>

-#define PAGE_SHIFT (12)
+#define HW_PAGE_SHIFT CONFIG_RISCV_HW_PAGE_SHIFT
+#define HW_PAGE_SIZE (_AC(1, UL) << HW_PAGE_SHIFT)
+#define HW_PAGE_MASK (~(HW_PAGE_SIZE - 1))
+
+#define PAGE_SHIFT CONFIG_RISCV_PAGE_SHIFT
#define PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE - 1))

--
2.20.1

2023-11-23 06:59:09

by Xu Lu

[permalink] [raw]
Subject: [RFC PATCH V1 07/11] riscv: Adapt satp operations to gap between hw page and sw page

The control register CSR_SATP on RISC-V, which points to the root page
table page, is used by MMU to translate va to pa when TLB miss happens.
Thus it should be encoded at a granularity of hardware page, while
existing code usually encodes it via software page frame number.

This commit corrects encoding operations of CSR_SATP register. To get
developers rid of the annoying encoding format of CSR_SATP and the
conversion between sw pfn and hw pfn, we abstract the encoding
operations of CSR_SATP into a specific function.

Signed-off-by: Xu Lu <[email protected]>
---
arch/riscv/include/asm/pgtable.h | 7 +++++++
arch/riscv/kernel/head.S | 4 ++--
arch/riscv/kernel/hibernate.c | 3 ++-
arch/riscv/mm/context.c | 7 +++----
arch/riscv/mm/fault.c | 1 +
arch/riscv/mm/init.c | 7 +++++--
arch/riscv/mm/kasan_init.c | 7 +++++--
7 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 9f81fe046cb8..56366f07985d 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -213,6 +213,13 @@ extern pgd_t swapper_pg_dir[];
extern pgd_t trampoline_pg_dir[];
extern pgd_t early_pg_dir[];

+static inline unsigned long make_satp(unsigned long pfn,
+ unsigned long asid, unsigned long satp_mode)
+{
+ return (pfn_to_hwpfn(pfn) |
+ ((asid & SATP_ASID_MASK) << SATP_ASID_SHIFT) | satp_mode);
+}
+
static __always_inline int __pte_present(unsigned long pteval)
{
return (pteval & (_PAGE_PRESENT | _PAGE_PROT_NONE));
diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index b77397432403..dace2e4e6164 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -87,7 +87,7 @@ relocate_enable_mmu:
csrw CSR_TVEC, a2

/* Compute satp for kernel page tables, but don't load it yet */
- srl a2, a0, PAGE_SHIFT
+ srl a2, a0, HW_PAGE_SHIFT
la a1, satp_mode
REG_L a1, 0(a1)
or a2, a2, a1
@@ -100,7 +100,7 @@ relocate_enable_mmu:
*/
la a0, trampoline_pg_dir
XIP_FIXUP_OFFSET a0
- srl a0, a0, PAGE_SHIFT
+ srl a0, a0, HW_PAGE_SHIFT
or a0, a0, a1
sfence.vma
csrw CSR_SATP, a0
diff --git a/arch/riscv/kernel/hibernate.c b/arch/riscv/kernel/hibernate.c
index 671b686c0158..155be6b1d32c 100644
--- a/arch/riscv/kernel/hibernate.c
+++ b/arch/riscv/kernel/hibernate.c
@@ -395,7 +395,8 @@ int swsusp_arch_resume(void)
if (ret)
return ret;

- hibernate_restore_image(resume_hdr.saved_satp, (PFN_DOWN(__pa(resume_pg_dir)) | satp_mode),
+ hibernate_restore_image(resume_hdr.saved_satp,
+ make_satp(PFN_DOWN(__pa(resume_pg_dir)), 0, satp_mode),
resume_hdr.restore_cpu_addr);

return 0;
diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
index 217fd4de6134..2ecf87433dfc 100644
--- a/arch/riscv/mm/context.c
+++ b/arch/riscv/mm/context.c
@@ -190,9 +190,8 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
raw_spin_unlock_irqrestore(&context_lock, flags);

switch_mm_fast:
- csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
- ((cntx & asid_mask) << SATP_ASID_SHIFT) |
- satp_mode);
+ csr_write(CSR_SATP, make_satp(virt_to_pfn(mm->pgd), (cntx & asid_mask),
+ satp_mode));

if (need_flush_tlb)
local_flush_tlb_all();
@@ -201,7 +200,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
static void set_mm_noasid(struct mm_struct *mm)
{
/* Switch the page table and blindly nuke entire local TLB */
- csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | satp_mode);
+ csr_write(CSR_SATP, make_satp(virt_to_pfn(mm->pgd), 0, satp_mode));
local_flush_tlb_all();
}

diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 90d4ba36d1d0..026ac007febf 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -133,6 +133,7 @@ static inline void vmalloc_fault(struct pt_regs *regs, int code, unsigned long a
*/
index = pgd_index(addr);
pfn = csr_read(CSR_SATP) & SATP_PPN;
+ pfn = hwpfn_to_pfn(pfn);
pgd = (pgd_t *)pfn_to_virt(pfn) + index;
pgd_k = init_mm.pgd + index;

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index a768b2b3ff05..c33a90d0c51d 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -805,7 +805,7 @@ static __init void set_satp_mode(uintptr_t dtb_pa)
(uintptr_t)early_p4d : (uintptr_t)early_pud,
PGDIR_SIZE, PAGE_TABLE);

- identity_satp = PFN_DOWN((uintptr_t)&early_pg_dir) | satp_mode;
+ identity_satp = make_satp(PFN_DOWN((uintptr_t)&early_pg_dir), 0, satp_mode);

local_flush_tlb_all();
csr_write(CSR_SATP, identity_satp);
@@ -1285,6 +1285,8 @@ static void __init create_linear_mapping_page_table(void)

static void __init setup_vm_final(void)
{
+ unsigned long satp;
+
/* Setup swapper PGD for fixmap */
#if !defined(CONFIG_64BIT)
/*
@@ -1318,7 +1320,8 @@ static void __init setup_vm_final(void)
clear_fixmap(FIX_P4D);

/* Move to swapper page table */
- csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | satp_mode);
+ satp = make_satp(PFN_DOWN(__pa_symbol(swapper_pg_dir)), 0, satp_mode);
+ csr_write(CSR_SATP, satp);
local_flush_tlb_all();

pt_ops_set_late();
diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
index 5e39dcf23fdb..72269e9f1964 100644
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@ -471,11 +471,13 @@ static void __init create_tmp_mapping(void)

void __init kasan_init(void)
{
+ unsigned long satp;
phys_addr_t p_start, p_end;
u64 i;

create_tmp_mapping();
- csr_write(CSR_SATP, PFN_DOWN(__pa(tmp_pg_dir)) | satp_mode);
+ satp = make_satp(PFN_DOWN(__pa(tmp_pg_dir)), 0, satp_mode);
+ csr_write(CSR_SATP, satp);

kasan_early_clear_pgd(pgd_offset_k(KASAN_SHADOW_START),
KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -520,6 +522,7 @@ void __init kasan_init(void)
memset(kasan_early_shadow_page, KASAN_SHADOW_INIT, PAGE_SIZE);
init_task.kasan_depth = 0;

- csr_write(CSR_SATP, PFN_DOWN(__pa(swapper_pg_dir)) | satp_mode);
+ satp = make_satp(PFN_DOWN(__pa(swapper_pg_dir)), 0, satp_mode);
+ csr_write(CSR_SATP, satp);
local_flush_tlb_all();
}
--
2.20.1

2023-11-23 06:59:27

by Xu Lu

[permalink] [raw]
Subject: [RFC PATCH V1 08/11] riscv: Apply Svnapot for base page mapping

The Svnapot extension on RISC-V is like contiguous PTE on ARM64. It
allows ptes of a naturally aligned power-of 2 (NAPOT) memory range be
encoded in the same format to save the TLB space.

This commit applies Svnapot for each base page's mapping. This commit is
the key to achieving larger base page's performance optimization.

Signed-off-by: Xu Lu <[email protected]>
---
arch/riscv/include/asm/pgtable.h | 34 +++++++++++++++++++++++++++-----
1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 56366f07985d..803dc5fb6314 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -230,6 +230,16 @@ static __always_inline unsigned long __pte_napot(unsigned long pteval)
return pteval & _PAGE_NAPOT;
}

+static __always_inline unsigned long __pte_mknapot(unsigned long pteval,
+ unsigned int order)
+{
+ int pos = order - 1 + _PAGE_PFN_SHIFT;
+ unsigned long napot_bit = BIT(pos);
+ unsigned long napot_mask = ~GENMASK(pos, _PAGE_PFN_SHIFT);
+
+ return (pteval & napot_mask) | napot_bit | _PAGE_NAPOT;
+}
+
static inline pte_t __pte(unsigned long pteval)
{
pte_t pte;
@@ -348,13 +358,11 @@ static inline unsigned long pte_napot(pte_t pte)
return __pte_napot(pte_val(pte));
}

-static inline pte_t pte_mknapot(pte_t pte, unsigned int order)
+static inline pte_t pte_mknapot(pte_t pte, unsigned int page_order)
{
- int pos = order - 1 + _PAGE_PFN_SHIFT;
- unsigned long napot_bit = BIT(pos);
- unsigned long napot_mask = ~GENMASK(pos, _PAGE_PFN_SHIFT);
+ unsigned int hw_page_order = page_order + (PAGE_SHIFT - HW_PAGE_SHIFT);

- return __pte((pte_val(pte) & napot_mask) | napot_bit | _PAGE_NAPOT);
+ return __pte(__pte_mknapot(pte_val(pte), hw_page_order));
}

#else
@@ -366,6 +374,11 @@ static inline unsigned long pte_napot(pte_t pte)
return 0;
}

+static inline pte_t pte_mknapot(pte_t pte, unsigned int page_order)
+{
+ return pte;
+}
+
#endif /* CONFIG_RISCV_ISA_SVNAPOT */

/* Yields the page frame number (PFN) of a page table entry */
@@ -585,6 +598,17 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b)
*/
static inline void set_pte(pte_t *ptep, pte_t pteval)
{
+ unsigned long order;
+
+ /*
+ * has_svnapot() always return false before riscv_isa is initialized.
+ */
+ if (has_svnapot() && pte_present(pteval) && !pte_napot(pteval)) {
+ for_each_napot_order(order) {
+ if (napot_cont_shift(order) == PAGE_SHIFT)
+ pteval = pte_mknapot(pteval, order);
+ }
+ }
*ptep = pteval;
}

--
2.20.1

2023-11-23 06:59:30

by Xu Lu

[permalink] [raw]
Subject: [RFC PATCH V1 11/11] riscv: Introduce 64K page size

This patch introduces new config to control whether enabling the 64K
base page feature on RISC-V.

Signed-off-by: Xu Lu <[email protected]>
---
arch/Kconfig | 1 +
arch/riscv/Kconfig | 20 ++++++++++++++++++++
2 files changed, 21 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index f4b210ab0612..66f64450d409 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1087,6 +1087,7 @@ config HAVE_ARCH_COMPAT_MMAP_BASES

config PAGE_SIZE_LESS_THAN_64KB
def_bool y
+ depends on !RISCV_64K_PAGES
depends on !ARM64_64K_PAGES
depends on !PAGE_SIZE_64KB
depends on !PARISC_PAGE_SIZE_64KB
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 105cbb3ca797..d561f9f7f9b4 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -227,6 +227,7 @@ config RISCV_HW_PAGE_SHIFT

config RISCV_PAGE_SHIFT
int
+ default 16 if RISCV_64K_PAGES
default 12

config KASAN_SHADOW_OFFSET
@@ -692,6 +693,25 @@ config RISCV_BOOT_SPINWAIT

If unsure what to do here, say N.

+choice
+ prompt "Page size"
+ default RISCV_4K_PAGES
+ help
+ Page size (translation granule) configuration.
+
+config RISCV_4K_PAGES
+ bool "4KB"
+ help
+ This feature enables 4KB pages support.
+
+config RISCV_64K_PAGES
+ bool "64KB"
+ depends on ARCH_HAS_STRICT_KERNEL_RWX && 64BIT
+ help
+ This feature enables 64KB pages support (4KB by default)
+
+endchoice
+
config ARCH_SUPPORTS_KEXEC
def_bool MMU

--
2.20.1

2023-11-23 06:59:33

by Xu Lu

[permalink] [raw]
Subject: [RFC PATCH V1 09/11] riscv: Adjust fix_btmap slots number to match variable page size

The existing fixmap slot number will cause the fixmap size to exceed
FIX_FDT_SIZE when base page becomes larger than 4K. This patch adjusts
the slot number to make them always match.

Signed-off-by: Xu Lu <[email protected]>
---
arch/riscv/include/asm/fixmap.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
index 0a55099bb734..17bf31334bd5 100644
--- a/arch/riscv/include/asm/fixmap.h
+++ b/arch/riscv/include/asm/fixmap.h
@@ -44,7 +44,8 @@ enum fixed_addresses {
* before ioremap() is functional.
*/
#define NR_FIX_BTMAPS (SZ_256K / PAGE_SIZE)
-#define FIX_BTMAPS_SLOTS 7
+#define FIX_BTMAPS_SIZE (FIXADDR_SIZE - ((FIX_BTMAP_END + 1) << PAGE_SHIFT))
+#define FIX_BTMAPS_SLOTS (FIX_BTMAPS_SIZE / SZ_256K)
#define TOTAL_FIX_BTMAPS (NR_FIX_BTMAPS * FIX_BTMAPS_SLOTS)

FIX_BTMAP_END = __end_of_permanent_fixed_addresses,
--
2.20.1

2023-11-23 07:00:03

by Xu Lu

[permalink] [raw]
Subject: [RFC PATCH V1 10/11] riscv: kvm: Adapt kvm to gap between hw page and sw page

Existing mmu code in kvm handles middle level page table entry and the
last level page table entry in the same way, which is insufficient when
base page becomes larger. For example, for 64K base page, per pte_t
contains 16 page table entries while per pmd_t still contains one and
thus needs to be handled in different ways.

This commit refines kvm mmu code to handle middle level page table
entries and last level page table entries distinctively.

Signed-off-by: Xu Lu <[email protected]>
---
arch/riscv/include/asm/pgtable.h | 7 ++
arch/riscv/kvm/mmu.c | 198 +++++++++++++++++++++----------
2 files changed, 145 insertions(+), 60 deletions(-)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 803dc5fb6314..9bed1512b3d2 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -220,6 +220,13 @@ static inline unsigned long make_satp(unsigned long pfn,
((asid & SATP_ASID_MASK) << SATP_ASID_SHIFT) | satp_mode);
}

+static inline unsigned long make_hgatp(unsigned long pfn,
+ unsigned long vmid, unsigned long hgatp_mode)
+{
+ return ((pfn_to_hwpfn(pfn) & HGATP_PPN) |
+ ((vmid << HGATP_VMID_SHIFT) & HGATP_VMID) | hgatp_mode);
+}
+
static __always_inline int __pte_present(unsigned long pteval)
{
return (pteval & (_PAGE_PRESENT | _PAGE_PROT_NONE));
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 068c74593871..f26d3e94fe17 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -36,22 +36,36 @@ static unsigned long gstage_pgd_levels __ro_after_init = 2;
gstage_pgd_xbits)
#define gstage_gpa_size ((gpa_t)(1ULL << gstage_gpa_bits))

+#define gstage_pmd_leaf(__pmdp) \
+ (pmd_val(pmdp_get(__pmdp)) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC))
+
#define gstage_pte_leaf(__ptep) \
- (pte_val(*(__ptep)) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC))
+ (pte_val(ptep_get(__ptep)) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC))

-static inline unsigned long gstage_pte_index(gpa_t addr, u32 level)
+static inline unsigned long gstage_pmd_index(gpa_t addr, u32 level)
{
unsigned long mask;
unsigned long shift = HGATP_PAGE_SHIFT + (gstage_index_bits * level);

+ BUG_ON(level == 0);
if (level == (gstage_pgd_levels - 1))
- mask = (PTRS_PER_PTE * (1UL << gstage_pgd_xbits)) - 1;
+ mask = (PTRS_PER_PMD * (1UL << gstage_pgd_xbits)) - 1;
else
- mask = PTRS_PER_PTE - 1;
+ mask = PTRS_PER_PMD - 1;

return (addr >> shift) & mask;
}

+static inline unsigned long gstage_pte_index(gpa_t addr, u32 level)
+{
+ return (addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+}
+
+static inline unsigned long gstage_pmd_page_vaddr(pmd_t pmd)
+{
+ return (unsigned long)pfn_to_virt(__page_val_to_pfn(pmd_val(pmd)));
+}
+
static inline unsigned long gstage_pte_page_vaddr(pte_t pte)
{
return (unsigned long)pfn_to_virt(__page_val_to_pfn(pte_val(pte)));
@@ -60,9 +74,13 @@ static inline unsigned long gstage_pte_page_vaddr(pte_t pte)
static int gstage_page_size_to_level(unsigned long page_size, u32 *out_level)
{
u32 i;
- unsigned long psz = 1UL << 12;
+ unsigned long psz = 1UL << HW_PAGE_SHIFT;

- for (i = 0; i < gstage_pgd_levels; i++) {
+ if (page_size == PAGE_SIZE) {
+ *out_level = 0;
+ return 0;
+ }
+ for (i = 1; i < gstage_pgd_levels; i++) {
if (page_size == (psz << (i * gstage_index_bits))) {
*out_level = i;
return 0;
@@ -77,7 +95,11 @@ static int gstage_level_to_page_order(u32 level, unsigned long *out_pgorder)
if (gstage_pgd_levels < level)
return -EINVAL;

- *out_pgorder = 12 + (level * gstage_index_bits);
+ if (level == 0)
+ *out_pgorder = PAGE_SHIFT;
+ else
+ *out_pgorder = HW_PAGE_SHIFT + (level * gstage_index_bits);
+
return 0;
}

@@ -95,30 +117,40 @@ static int gstage_level_to_page_size(u32 level, unsigned long *out_pgsize)
}

static bool gstage_get_leaf_entry(struct kvm *kvm, gpa_t addr,
- pte_t **ptepp, u32 *ptep_level)
+ void **ptepp, u32 *ptep_level)
{
- pte_t *ptep;
+ pmd_t *pmdp = NULL;
+ pte_t *ptep = NULL;
u32 current_level = gstage_pgd_levels - 1;

*ptep_level = current_level;
- ptep = (pte_t *)kvm->arch.pgd;
- ptep = &ptep[gstage_pte_index(addr, current_level)];
- while (ptep && pte_val(*ptep)) {
- if (gstage_pte_leaf(ptep)) {
+ pmdp = (pmd_t *)kvm->arch.pgd;
+ pmdp = &pmdp[gstage_pmd_index(addr, current_level)];
+ while (current_level && pmdp && pmd_val(pmdp_get(pmdp))) {
+ if (gstage_pmd_leaf(pmdp)) {
*ptep_level = current_level;
- *ptepp = ptep;
+ *ptepp = (void *)pmdp;
return true;
}

+ current_level--;
+ *ptep_level = current_level;
+ pmdp = (pmd_t *)gstage_pmd_page_vaddr(pmdp_get(pmdp));
if (current_level) {
- current_level--;
- *ptep_level = current_level;
- ptep = (pte_t *)gstage_pte_page_vaddr(*ptep);
- ptep = &ptep[gstage_pte_index(addr, current_level)];
+ pmdp = &pmdp[gstage_pmd_index(addr, current_level)];
} else {
- ptep = NULL;
+ ptep = (pte_t *)pmdp;
+ ptep = &ptep[gstage_pte_index(addr, current_level)];
}
}
+ if (ptep && pte_val(ptep_get(ptep))) {
+ if (gstage_pte_leaf(ptep)) {
+ *ptep_level = current_level;
+ *ptepp = (void *)ptep;
+ return true;
+ }
+ ptep = NULL;
+ }

return false;
}
@@ -136,40 +168,53 @@ static void gstage_remote_tlb_flush(struct kvm *kvm, u32 level, gpa_t addr)

static int gstage_set_pte(struct kvm *kvm, u32 level,
struct kvm_mmu_memory_cache *pcache,
- gpa_t addr, const pte_t *new_pte)
+ gpa_t addr, const void *new_pte)
{
u32 current_level = gstage_pgd_levels - 1;
- pte_t *next_ptep = (pte_t *)kvm->arch.pgd;
- pte_t *ptep = &next_ptep[gstage_pte_index(addr, current_level)];
+ pmd_t *next_pmdp = (pmd_t *)kvm->arch.pgd;
+ pmd_t *pmdp = &next_pmdp[gstage_pmd_index(addr, current_level)];
+ pte_t *next_ptep = NULL;
+ pte_t *ptep = NULL;

if (current_level < level)
return -EINVAL;

while (current_level != level) {
- if (gstage_pte_leaf(ptep))
+ if (gstage_pmd_leaf(pmdp))
return -EEXIST;

- if (!pte_val(*ptep)) {
+ if (!pmd_val(pmdp_get(pmdp))) {
if (!pcache)
return -ENOMEM;
- next_ptep = kvm_mmu_memory_cache_alloc(pcache);
- if (!next_ptep)
+ next_pmdp = kvm_mmu_memory_cache_alloc(pcache);
+ if (!next_pmdp)
return -ENOMEM;
- *ptep = pfn_pte(PFN_DOWN(__pa(next_ptep)),
- __pgprot(_PAGE_TABLE));
+ set_pmd(pmdp, pfn_pmd(PFN_DOWN(__pa(next_pmdp)),
+ __pgprot(_PAGE_TABLE)));
} else {
- if (gstage_pte_leaf(ptep))
+ if (gstage_pmd_leaf(pmdp))
return -EEXIST;
- next_ptep = (pte_t *)gstage_pte_page_vaddr(*ptep);
+ next_pmdp = (pmd_t *)gstage_pmd_page_vaddr(pmdp_get(pmdp));
}

current_level--;
- ptep = &next_ptep[gstage_pte_index(addr, current_level)];
+ if (current_level) {
+ pmdp = &next_pmdp[gstage_pmd_index(addr, current_level)];
+ } else {
+ next_ptep = (pte_t *)next_pmdp;
+ ptep = &next_ptep[gstage_pte_index(addr, current_level)];
+ }
}

- *ptep = *new_pte;
- if (gstage_pte_leaf(ptep))
- gstage_remote_tlb_flush(kvm, current_level, addr);
+ if (current_level) {
+ set_pmd(pmdp, pmdp_get((pmd_t *)new_pte));
+ if (gstage_pmd_leaf(pmdp))
+ gstage_remote_tlb_flush(kvm, current_level, addr);
+ } else {
+ set_pte(ptep, ptep_get((pte_t *)new_pte));
+ if (gstage_pte_leaf(ptep))
+ gstage_remote_tlb_flush(kvm, current_level, addr);
+ }

return 0;
}
@@ -182,6 +227,7 @@ static int gstage_map_page(struct kvm *kvm,
{
int ret;
u32 level = 0;
+ pmd_t new_pmd;
pte_t new_pte;
pgprot_t prot;

@@ -213,10 +259,15 @@ static int gstage_map_page(struct kvm *kvm,
else
prot = PAGE_WRITE;
}
- new_pte = pfn_pte(PFN_DOWN(hpa), prot);
- new_pte = pte_mkdirty(new_pte);
-
- return gstage_set_pte(kvm, level, pcache, gpa, &new_pte);
+ if (level) {
+ new_pmd = pfn_pmd(PFN_DOWN(hpa), prot);
+ new_pmd = pmd_mkdirty(new_pmd);
+ return gstage_set_pte(kvm, level, pcache, gpa, &new_pmd);
+ } else {
+ new_pte = pfn_pte(PFN_DOWN(hpa), prot);
+ new_pte = pte_mkdirty(new_pte);
+ return gstage_set_pte(kvm, level, pcache, gpa, &new_pte);
+ }
}

enum gstage_op {
@@ -226,9 +277,12 @@ enum gstage_op {
};

static void gstage_op_pte(struct kvm *kvm, gpa_t addr,
- pte_t *ptep, u32 ptep_level, enum gstage_op op)
+ void *__ptep, u32 ptep_level, enum gstage_op op)
{
int i, ret;
+ pmd_t *pmdp = (pmd_t *)__ptep;
+ pte_t *ptep = (pte_t *)__ptep;
+ pmd_t *next_pmdp;
pte_t *next_ptep;
u32 next_ptep_level;
unsigned long next_page_size, page_size;
@@ -239,11 +293,13 @@ static void gstage_op_pte(struct kvm *kvm, gpa_t addr,

BUG_ON(addr & (page_size - 1));

- if (!pte_val(*ptep))
+ if (ptep_level && !pmd_val(pmdp_get(pmdp)))
+ return;
+ if (!ptep_level && !pte_val(ptep_get(ptep)))
return;

- if (ptep_level && !gstage_pte_leaf(ptep)) {
- next_ptep = (pte_t *)gstage_pte_page_vaddr(*ptep);
+ if (ptep_level && !gstage_pmd_leaf(pmdp)) {
+ next_pmdp = (pmd_t *)gstage_pmd_page_vaddr(pmdp_get(pmdp));
next_ptep_level = ptep_level - 1;
ret = gstage_level_to_page_size(next_ptep_level,
&next_page_size);
@@ -251,17 +307,33 @@ static void gstage_op_pte(struct kvm *kvm, gpa_t addr,
return;

if (op == GSTAGE_OP_CLEAR)
- set_pte(ptep, __pte(0));
- for (i = 0; i < PTRS_PER_PTE; i++)
- gstage_op_pte(kvm, addr + i * next_page_size,
- &next_ptep[i], next_ptep_level, op);
+ set_pmd(pmdp, __pmd(0));
+ if (next_ptep_level) {
+ for (i = 0; i < PTRS_PER_PMD; i++)
+ gstage_op_pte(kvm, addr + i * next_page_size,
+ &next_pmdp[i], next_ptep_level, op);
+ } else {
+ next_ptep = (pte_t *)next_pmdp;
+ for (i = 0; i < PTRS_PER_PTE; i++)
+ gstage_op_pte(kvm, addr + i * next_page_size,
+ &next_ptep[i], next_ptep_level, op);
+ }
if (op == GSTAGE_OP_CLEAR)
- put_page(virt_to_page(next_ptep));
+ put_page(virt_to_page(next_pmdp));
} else {
- if (op == GSTAGE_OP_CLEAR)
- set_pte(ptep, __pte(0));
- else if (op == GSTAGE_OP_WP)
- set_pte(ptep, __pte(pte_val(*ptep) & ~_PAGE_WRITE));
+ if (ptep_level) {
+ if (op == GSTAGE_OP_CLEAR)
+ set_pmd(pmdp, __pmd(0));
+ else if (op == GSTAGE_OP_WP)
+ set_pmd(pmdp,
+ __pmd(pmd_val(pmdp_get(pmdp)) & ~_PAGE_WRITE));
+ } else {
+ if (op == GSTAGE_OP_CLEAR)
+ set_pte(ptep, __pte(0));
+ else if (op == GSTAGE_OP_WP)
+ set_pte(ptep,
+ __pte(pte_val(ptep_get(ptep)) & ~_PAGE_WRITE));
+ }
gstage_remote_tlb_flush(kvm, ptep_level, addr);
}
}
@@ -270,7 +342,7 @@ static void gstage_unmap_range(struct kvm *kvm, gpa_t start,
gpa_t size, bool may_block)
{
int ret;
- pte_t *ptep;
+ void *ptep;
u32 ptep_level;
bool found_leaf;
unsigned long page_size;
@@ -305,7 +377,7 @@ static void gstage_unmap_range(struct kvm *kvm, gpa_t start,
static void gstage_wp_range(struct kvm *kvm, gpa_t start, gpa_t end)
{
int ret;
- pte_t *ptep;
+ void *ptep;
u32 ptep_level;
bool found_leaf;
gpa_t addr = start;
@@ -572,7 +644,7 @@ bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
- pte_t *ptep;
+ void *ptep;
u32 ptep_level = 0;
u64 size = (range->end - range->start) << PAGE_SHIFT;

@@ -585,12 +657,15 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
&ptep, &ptep_level))
return false;

- return ptep_test_and_clear_young(NULL, 0, ptep);
+ if (ptep_level)
+ return pmdp_test_and_clear_young(NULL, 0, (pmd_t *)ptep);
+ else
+ return ptep_test_and_clear_young(NULL, 0, (pte_t *)ptep);
}

bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
- pte_t *ptep;
+ void *ptep;
u32 ptep_level = 0;
u64 size = (range->end - range->start) << PAGE_SHIFT;

@@ -603,7 +678,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
&ptep, &ptep_level))
return false;

- return pte_young(*ptep);
+ if (ptep_level)
+ return pmd_young(pmdp_get((pmd_t *)ptep));
+ else
+ return pte_young(ptep_get((pte_t *)ptep));
}

int kvm_riscv_gstage_map(struct kvm_vcpu *vcpu,
@@ -746,11 +824,11 @@ void kvm_riscv_gstage_free_pgd(struct kvm *kvm)

void kvm_riscv_gstage_update_hgatp(struct kvm_vcpu *vcpu)
{
- unsigned long hgatp = gstage_mode;
+ unsigned long hgatp;
struct kvm_arch *k = &vcpu->kvm->arch;

- hgatp |= (READ_ONCE(k->vmid.vmid) << HGATP_VMID_SHIFT) & HGATP_VMID;
- hgatp |= (k->pgd_phys >> PAGE_SHIFT) & HGATP_PPN;
+ hgatp = make_hgatp(PFN_DOWN(k->pgd_phys), READ_ONCE(k->vmid.vmid),
+ gstage_mode);

csr_write(CSR_HGATP, hgatp);

--
2.20.1

2023-11-23 09:30:09

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC PATCH V1 00/11] riscv: Introduce 64K base page

On Thu, Nov 23, 2023, at 07:56, Xu Lu wrote:
> Some existing architectures like ARM supports base page larger than 4K
> as their MMU supports more page sizes. Thus, besides hugetlb page and
> transparent huge page, there is another way for these architectures to
> enjoy the benefits of fewer TLB misses without worrying about cost of
> splitting and merging huge pages. However, on architectures with only
> 4K MMU, larger base page is unavailable now.
>
> This patch series attempts to break through the limitation of MMU and
> supports larger base page on RISC-V, which only supports 4K page size
> now.
>
> The key idea to implement larger base page based on 4K MMU is to
> decouple the MMU page from the base page in view of kernel mm, which we
> denote as software page. In contrary to software page, we denote the MMU
> page as hardware page. Below is the difference between these two kinds
> of pages.

We have played with this on arm32, but the conclusion is that it's
almost never worth the memory overhead, as most workloads end up
using several times the amount of physical RAM after each small
file in the page cache and any sparse populated anonymous memory
area explodes to up to 16 times the size.

On ppc64, using 64KB pages was way to get around limitations in
their hashed MMU design, which had a much bigger performance impact
because any page table access ends up being a cache miss. On arm64,
there are some CPUs like the Fujitsu A64FX that are really bad at
4KB pages and don't support 16KB pages, so this is the only real
option.

You will see a notable performance benefit in synthetic benchmarks
like speccpu with 64KB pages, or on specific computational
workloads that have large densely packed memory chunks, but for
real workloads, the usual answer is to just use transparent
hugepages for larger mappings and a page size of no more than
16KB for the page cache.

With the work going into using folios in the kernel (see e.g.
https://lwn.net/Articles/932386/), even the workloads that
benefit from 64KB base pages should be better off with 4KB
pages and just using the TLB hints for large folios.

Arnd

2023-11-27 08:14:34

by Xu Lu

[permalink] [raw]
Subject: Re: [External] Re: [RFC PATCH V1 00/11] riscv: Introduce 64K base page

Thanks a lot for your reply! And sorry for replying so late.

On Thu, Nov 23, 2023 at 5:30 PM Arnd Bergmann <[email protected]> wrote:
>
> On Thu, Nov 23, 2023, at 07:56, Xu Lu wrote:
> > Some existing architectures like ARM supports base page larger than 4K
> > as their MMU supports more page sizes. Thus, besides hugetlb page and
> > transparent huge page, there is another way for these architectures to
> > enjoy the benefits of fewer TLB misses without worrying about cost of
> > splitting and merging huge pages. However, on architectures with only
> > 4K MMU, larger base page is unavailable now.
> >
> > This patch series attempts to break through the limitation of MMU and
> > supports larger base page on RISC-V, which only supports 4K page size
> > now.
> >
> > The key idea to implement larger base page based on 4K MMU is to
> > decouple the MMU page from the base page in view of kernel mm, which we
> > denote as software page. In contrary to software page, we denote the MMU
> > page as hardware page. Below is the difference between these two kinds
> > of pages.
>
> We have played with this on arm32, but the conclusion is that it's
> almost never worth the memory overhead, as most workloads end up
> using several times the amount of physical RAM after each small
> file in the page cache and any sparse populated anonymous memory
> area explodes to up to 16 times the size.
>
> On ppc64, using 64KB pages was way to get around limitations in
> their hashed MMU design, which had a much bigger performance impact
> because any page table access ends up being a cache miss. On arm64,
> there are some CPUs like the Fujitsu A64FX that are really bad at
> 4KB pages and don't support 16KB pages, so this is the only real
> option.
>
> You will see a notable performance benefit in synthetic benchmarks
> like speccpu with 64KB pages, or on specific computational
> workloads that have large densely packed memory chunks, but for
> real workloads, the usual answer is to just use transparent
> hugepages for larger mappings and a page size of no more than
> 16KB for the page cache.

Actually we did find actual performance benefits brought by 64K page
size in real business scenarios.
On the Ampere ARM server, when applying 64K base page size, we saw an
improvement of 2.5x for both qps and latency on redis, a performance
improvement of 10~20% on our own newsql database and 50% on object
storage.
For mysql, the qps increases about 14%, 17.5% and 20% for read-only,
write-only and random read/write workloads respectively. And the
latency reduces about 13.7%, 15.8% and 14.5% on average.
This is also why we chose to implement a similar feature on RISC-V in
the beginning.

>
> With the work going into using folios in the kernel (see e.g.
> https://lwn.net/Articles/932386/), even the workloads that
> benefit from 64KB base pages should be better off with 4KB
> pages and just using the TLB hints for large folios.

Maybe 64K page size combined with large folios can achieve more benefits.
As is mentioned in this patch[1], a 64K page size kernel combined with
large folios and THPs via cont pte can achieve speedup of 10.5x on
some memory-intensive workloads on arm64 SBSA server.

[1] https://lore.kernel.org/all/[email protected]/

>
> Arnd

2023-12-07 06:08:05

by Xu Lu

[permalink] [raw]
Subject: Re: [RFC PATCH V1 00/11] riscv: Introduce 64K base page

A gentle ping.

On Thu, Nov 23, 2023 at 2:57 PM Xu Lu <[email protected]> wrote:
>
> Some existing architectures like ARM supports base page larger than 4K
> as their MMU supports more page sizes. Thus, besides hugetlb page and
> transparent huge page, there is another way for these architectures to
> enjoy the benefits of fewer TLB misses without worrying about cost of
> splitting and merging huge pages. However, on architectures with only
> 4K MMU, larger base page is unavailable now.
>
> This patch series attempts to break through the limitation of MMU and
> supports larger base page on RISC-V, which only supports 4K page size
> now.
>
> The key idea to implement larger base page based on 4K MMU is to
> decouple the MMU page from the base page in view of kernel mm, which we
> denote as software page. In contrary to software page, we denote the MMU
> page as hardware page. Below is the difference between these two kinds
> of pages.
>
> 1. Kernel memory management module manages, allocates and maps memory at
> a granularity of software page, which should not be restricted by
> MMU and can be larger than hardware page.
>
> 2. Architecture page table operations should be carried out from MMU's
> perspective and page table entries are encoded at a granularity of
> hardware page, which is 4K on RISC-V MMU now.
>
> The main work to decouple these two kinds of pages lies in architecture
> code. For example, we turn the pte_t struct to an array of page table
> entries to match it with software page which can be larger than hardware
> page, and adapt the page table operations accordingly. For 64K software
> base page, the pte_t struct now contains 16 contiguous page table
> entries which point to 16 contiguous 4K hardware pages.
>
> To achieve the benefits of large base page, we applies Svnapot for each
> base page's mapping. The Svnapot extension on RISC-V is like contiguous
> PTE on ARM64. It allows ptes of a naturally aligned power-of 2 size
> memory range be encoded in the same format to save the TLB space.
>
> This patch series is the first version and is based on v6.7-rc1. This
> version supports both bare metal and virtualization scenarios.
>
> In the next versions, we will continue on the following works:
>
> 1. Reduce the memory usage of page table page as it only uses 4K space
> while costs a whole base page.
>
> 2. When IMSIC interrupt file is smaller than 64K, extra isolation
> measures for the interrupt file are needed. (S)PMP and IOPMP may be good
> choices.
>
> 3. More consideration is needed to make this patch series collaborate
> with folios better.
>
> 4. Support 64K base page on IOMMU.
>
> 5. The performance test is on schedule to verify the actual performance
> improvement and the decrease in TLB miss rate.
>
> Thanks in advance for comments.
>
> Xu Lu (11):
> mm: Fix misused APIs on huge pte
> riscv: Introduce concept of hardware base page
> riscv: Adapt pte struct to gap between hw page and sw page
> riscv: Adapt pte operations to gap between hw page and sw page
> riscv: Decouple pmd operations and pte operations
> riscv: Distinguish pmd huge pte and napot huge pte
> riscv: Adapt satp operations to gap between hw page and sw page
> riscv: Apply Svnapot for base page mapping
> riscv: Adjust fix_btmap slots number to match variable page size
> riscv: kvm: Adapt kvm to gap between hw page and sw page
> riscv: Introduce 64K page size
>
> arch/Kconfig | 1 +
> arch/riscv/Kconfig | 28 +++
> arch/riscv/include/asm/fixmap.h | 3 +-
> arch/riscv/include/asm/hugetlb.h | 71 ++++++-
> arch/riscv/include/asm/page.h | 16 +-
> arch/riscv/include/asm/pgalloc.h | 21 ++-
> arch/riscv/include/asm/pgtable-32.h | 2 +-
> arch/riscv/include/asm/pgtable-64.h | 45 +++--
> arch/riscv/include/asm/pgtable.h | 282 +++++++++++++++++++++++-----
> arch/riscv/kernel/efi.c | 2 +-
> arch/riscv/kernel/head.S | 4 +-
> arch/riscv/kernel/hibernate.c | 3 +-
> arch/riscv/kvm/mmu.c | 198 +++++++++++++------
> arch/riscv/mm/context.c | 7 +-
> arch/riscv/mm/fault.c | 1 +
> arch/riscv/mm/hugetlbpage.c | 42 +++--
> arch/riscv/mm/init.c | 25 +--
> arch/riscv/mm/kasan_init.c | 7 +-
> arch/riscv/mm/pageattr.c | 2 +-
> fs/proc/task_mmu.c | 2 +-
> include/asm-generic/hugetlb.h | 7 +
> include/asm-generic/pgtable-nopmd.h | 1 +
> include/linux/pgtable.h | 6 +
> mm/hugetlb.c | 2 +-
> mm/migrate.c | 5 +-
> mm/mprotect.c | 2 +-
> mm/rmap.c | 10 +-
> mm/vmalloc.c | 3 +-
> 28 files changed, 616 insertions(+), 182 deletions(-)
>
> --
> 2.20.1
>